Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Data handcuffs

A few years ago, if you asked me what the top skills I got asked about for students going into industry, I’d definitely have said things like data cleaning, data transformation, database pulls, and other non-traditional statistical tasks. But as companies have progressed from the point of storing data to actually wanting to do something with it, I would say one of the hottest skills is understanding and dealing with data from randomized trials.

In particular I see data scientists talking more about A/B testing, sequential stopping rules, hazard regression and other ideas  that are really common in Biostatistics, which has traditionally focused on the analysis of data from designed experiments in biology.

I think it is great that companies are choosing to do experiments, as this still remains the gold standard for how to generate knowledge about causal effects. One interesting new development though is the extreme lengths it appears some organizations are going to to be “data-driven”.  They make all decisions based on data they have collected or experiments they have performed.

But data mostly tell you about small scale effects and things that happened in the past. To be able to make big discoveries/improvements requires (a) having creative ideas that are not data supported and (b) trying them in experiments to see if they work. If you get too caught up in experimenting on the same set of conditions you will inevitably asymptote to a maximum and quickly reach diminishing returns. This is where the data handcuffs come in. Data can only tell you about the conditions that existed in the past, they often can’t predict conditions in the future or ideas that may work out or might not.

In an interesting parallel to academic research a good strategy appears to be: (a) trying a bunch of things, including some things that have only a pretty modest chance of success, (b) doing experiments early and often when trying those things, and (c) getting very good at recognizing failure quickly and moving on to ideas that will be fruitful. The challenges are that in part (a) it is often difficult to generate really knew ideas, especially if you are already doing something that has had any level of success. There will be extreme pressure not to change what you are doing. In part (c) the challenge is that if you discard ideas too quickly you might miss a big opportunity, but if you don’t discard them quickly enough you will sink a lot of time/cost into utlimately not very fruitful projects.

Regardless, almost all of the most interesting projects I’ve worked on in my life were not driven by data that suggested they would be successful. They were often risks where the data either wasn’t in, or the data supported not doing at all. But as a statistician I decided to straight up ignore the data and try anyway. Then again, these ideas have also been the sources of my biggest flameouts.