Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Not So Standard Deviations Episode 13 - It's Good that Someone is Thinking About Us

In this episode, Hilary and I talk about the difficulties of separating data analysis from its context, and Feather, a new file format for storing tabular data. Also, we respond to some listener questions and Hilary announces her new job.

If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.

Subscribe to the podcast on iTunes.

Please leave us a review on iTunes!

Show notes:

Download the audio for this episode.

Companies are Countries, Academia is Europe

I’ve been thinking a lot recently about the practice of data analysis in different settings and how the environment in which you work can affect the view you have on how things should be done. I’ve been working in academia for over 12 years now. I don’t have any industry data science experience, but long ago I worked as a software engineer at two companies. Obviously, my experience is biased on the academic side.

I’ve see an interesting divergence between what I see being written from data scientists in industry and my personal experience doing data science in academia. From the industry side, I see a lot of stuff about tooling/software and processes. This makes sense to me. Often, a company needs/wants to move quickly and doing so requires making decisions on a reasonable time scale. If decisions are made with data, then the process of collecting, organizing, analyzing, and communicating data needs to be well thought-out, systematized, rigorous, and streamlined. If everytime someone at the company had a question the data science team developed some novel custom coded-from-scratch solution, decisions would be made at a glacial pace, which is probably not good for business. In order to handle this type of situation you need solid tools and flexible workflows. You also need agreement within the company about how things are down and the processes that are followed.

Now, I don’t mean to imply that life at a company is easy, that there isn’t politics or bureacracy to deal with. But I see companies as much like individual countries, with a clear (hierarchical) leadership structure and decision-making process (okay, maybe ideal companies). Much like in a country, it might take some time to come to a decision about a policy or problem (e.g. health insurance), with much negotiation and horse-trading, but once consensus is arrived at, often the policy can be implemented across the country at a reasonable timescale. In a company, if a certain workflow or data process can be shown to be beneficial and perhaps improve profitability down the road, then a decision could be made to implement it. Ultimately, everyone within a company is in the same boat and is interested in seeing the company succeed.

When I worked at a company as a software developer, I’d sometimes run into a problem that was confusing or difficult to code. So I’d walk down to the systems engineer’s office (they guy who wrote the specification) and talk to him about it. We’d hash things out for a while and then figure out a way to go forward. Often the technical writers who wrote the documentation would come and ask me what exactly a certain module did and I’d explain it to them. Communication was usually quick and efficient because it usually occurred person-to-person and because we were all on the same team.

Academia is more like Europe, a somewhat loose federation of states that only communicates with each other because they have to. Each principal investigator is a country and s/he has to engage in constant (sometimes contentious) negotiations with other investigators (“countries”). As a data scientist, this can be tricky because unless I collect/generate my own data (which sometimes, I do), I have to negotiate with another investigator to obtain the data. Even if I were collaborating with that investigator from the very beginning of a study, I typically have very little direct control over the data collection process because those people don’t work for me. The result is often, the data come to me in some format over which I had little input, and I just have to deal with it. Sometimes this is a nice CSV file, but often it is not.

In good situations, I can talk with the investigator collecting the data and we can hash out a plan to put the data into a certain format. But even if we can agree on that, often the expertise will not be available on their end to get the data into that format, so I’ll end up having to do it myself anyway. In not-so-good situations, I can make all the arguments I want for an organized data collection and analysis workflow, but if the investigator doesn’t want to do it, can’t afford it, or doesn’t see any incentive, then it’s not going to happen. Ever.

However, even in the good situations, every investigator works in their own personal way. I mean, that’s why people go into academia, because you can “be your own boss” and work on problems that interest you. Most people develop a process for running their group/lab that most suits their personality. If you’re a data scientist, you need to figure out a way to mesh with each and every investigator you collaborate with. In addition, you need to adapt yourself to whatever data process each investigator has developed for their group. So if you’re working with a genomics person, you might need to learn about BAM files. For a neuroimaging collaborator, you’ll need to know about SPM. If one person doesn’t like tidy data, then that’s too bad. You need to deal with it (or don’t work with them). As a result, it’s difficult to develop a useful “system” for data science because any system that works for one collaborator is unlikely to work for another collaborator. In effect, each collaboration often results in a custom coded-from-scratch solution.

This contrast between companies and academia got me thinking about the Theory of the Firm. This is an economic theory that tries to explain why firms/companies develop at all, as opposed to individuals or small groups negotiating over an open market. My understanding is that it all comes down to how well you can write and enforce a contract between two parties. For example, if I need to manufacture iPhones, I can go to a contract manufacturer, given them the designs and the precise specifications/tolerances and they can just produce millions of them. However, if I need to design the iPhone, it’s a bit harder for me to go to another company and just say “Design an awesome new phone!” That kind of contract is difficult to write down, much less enforce. That other company will be operating off of different incentives from me and will likely not produce what I want. It’s probably better if I do the design work in-house. Ultimately, once the transaction costs of having two different companies work together become too high, it makes more sense for a company to do the work in-house.

I think collaborating on data analysis is a high transaction cost activity. Companies have an advantage in this realm to the extent that they can hire lots of data scientists to work in-house. Academics that are well-funded and have large labs can often hire a data analyst to work for them. This is good because it makes a well-trained person available at low transaction cost, but this setup is the exception. PIs with smaller labs barely have enough funding to do their experiments and so either have to analyze the data themselves (for which they may not be appropriately trained) or collaborate with someone willing to do it. Large academic centers often have research cores that provide data analysis services, but this doesn’t change the fact that data analysis that occurs “outside the company” dramatically increases the transaction costs of doing the research. Because data analysis is a highly iterative process, each time you have to go back in forth with an outside entity, the costs go up.

I think it’s possible to see a time when data analysis can effectively be made external. I mean, Apple used to manufacture all its products, but has shifted to contract manufacturing to great success. But I think we will have to develop a much better understanding of the data analysis process before we see the transaction costs start to go down.

New Feather Format for Data Frames

This past Tuesday, Hadley Wickham and Wes McKinney announced a new binary file format specifically for storing data frames.

One thing that struck us was that, while R’s data frames and Python’s pandas data frames utilize different internal memory representations, the semantics of their user data types are mostly the same. In both R and pandas, data frames contain lists of named, equal-length columns, which can be numeric, boolean, and date-and-time, categorical (factors), or string. Additionally, these columns must support missing (null) values.

Their work builds on the Apache Arrow project, which specifies a format for tabular data. There is currently a Python and R implementation for reading/writing these files but other implementations could easily be built as the file format looks pretty straightforward. The git repository is here.

Initial thoughts:

  • The possibilities for passing data between languages is I think the main point here. The potential for passing data through a pipeline without worrying about the specifics of different languages could make for much more powerful analyses where different tools are used for whatever they tend to do best. Essentially, as long as data can be made tidy going in and coming out, there should not be a communication issue between languages.

  • R users might be wondering what the big deal is–we already have a binary serialization format (XDR). But R’s serialization format is meant to cover all possible R objects. Feather’s focus on data frames allows for the removal of many of the annoying (but seldom used) complexities of R objects and optimizing a very commonly used data format.

  • In my testing, there’s a noticeable speed difference between reading a feather file and reading an (uncompressed) R workspace file (feather seems about 2x faster). I didn’t time writing files, but the difference didn’t seem as noticeable there. That said, it’s not clear to me that performance on files is the main point here.

  • Given the underlying framework and representation, there seem to be some interesting possibilities for low-memory environments.

I’ve only had a chance to quickly look at the code but I’m excited to see what comes next.

How to create an AI startup - convince some humans to be your training set

The latest trend in data science is artificial intelligence. It has been all over the news for tackling a bunch of interesting questions. For example:

Almost all of these applications are based (at some level) on using variations on neural networks and deep learning. These models are used like any other statistical or machine learning model. They involve a prediction function that is based on a set of parameters. Using a training data set, you estimate the parameters. Then when you get a new set of data, you push it through the prediction function using those estimated parameters and make your predictions.

So why does deep learning do so well on problems like voice recognition, image recognition, and other complicated tasks? The main reason is that these models involve hundreds of thousands or millions of parameters, that allow the model to capture even very subtle structure in large scale data sets. This type of model can be fit now because (a) we have huge training sets (think all the pictures on Facebook or all voice recordings of people using Siri) and (b) we have fast computers that allow us to estimate the parameters.

Almost all of the high-profile examples of “artificial intelligence” we are hearing about involve this type of process. This means that the machine is “learning” from examples of how humans behave. The algorithm itself is a way to estimate subtle structure from collections of human behavior.

The result is that the typical trajectory for an AI business is.

  1. Get a large collection of humans to perform some repetitive but possibly complicated behavior (play thousands of games of Go, or answer requests from people on Facebook messenger, or label pictures and videos, or drive cars.)
  2. Record all of the actions the humans perform to create a training set.
  3. Feed these data into a statistical model with a huge number of parameters - made possible by having a huge training set collected from the humans in steps 1 and 2.
  4. Apply the algorithm to perform the repetitive task and cut the humans out of the process.

The question is how do you get the humans to perform the task for you? One option is to collect data from humans who are using your product (think Facebook image tagging). The other, more recent phenomenon, is to farm the task out to a large number of contractors (think gig economy jobs like driving for Uber, or responding to queries on Facebook).

The interesting thing about the latter case is that in the short term it produces a market for gigs for humans. But in the long term, by performing those tasks, the humans are putting themselves out of a job. This played out in a relatively public way just recently with a service called GoButler that used its employees to train a model and then replaced them with that model.

It will be interesting to see how many areas of employment this type of approach takes over. It is also interesting to think about how much value each task you perform for a company like that is worth to the training set. It will also be interesting if there is a legal claim for the gig workers at these companies to make that their labor helped “create the value” at the companies that replace them.

Not So Standard Deviations Episode 12 - The New Bayesian vs. Frequentist

In this episode, Hilary and I discuss the new direction for the journal Biostatistics, the recent fracas over ggplot2 and base graphics in R, and whether collecting more data is always better than collecting less (fewer?) data. Also, Hilary and Roger respond to some listener questions and more free advertising.

If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.

Subscribe to the podcast on iTunes.

Please leave us a review on iTunes!

Show notes:

Download the audio for this episode.