Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Increasing the cost of data analysis

Jeff’s post about the deterministic statistical machine got me thinking a bit about the cost of data analysis. The cost of data analysis these day is in many ways going up. The data being collected are getting bigger and more complex. Analyzing these data require more expertise, more storage hardware, and more computing power. In fact the analysis in some fields like genomics is now more expensive than the collection of the data [There’s a graph that shows this but I can’t seem to find it anywhere; I’ll keep looking and post later. For now see here.].

However, that’s really about the dollars and cents kind of cost. The cost of data analysis has gone very far down in a different sense. For the vast majority of applications that look at moderate to large datasets, many many statistical analyses can be conducted essentially at the push of a button. And so there’s not cost in continuing to analyze data until a desirable result is obtained. Correcting for multiple testing is one way to “fix” this problem. But I personally don’t find multiple testing corrections to be all that helpful because ultimately they still try to boil down a complex analysis into a simple yes/no answer.

In the old days (for example when Rafa was in grad school), computing time was precious and things had to be planned out carefully, starting with the planning of the experiment and continuing with the data collection and the analysis. In fact, much of current statistical education is still geared around the idea that computing is expensive, which is why we use things like asymptotic theorems and approximations even when we don’t really have to. Nowadays, there’s a bit of a “we’ll fix it in post” mentality, which values collecting as much data as possible when given the chance and figuring out what to do with it later. This kind of thinking can lead to (1) small big data problems; (2) poorly designed studies; (3) data that don’t really address the question of interest to everyone.

What if the cost of data analysis were not paid in dollars but were paid in some general unit of credibility. For example, Jeff’s hypothetical machine would do some of this.

By publishing all reports to figshare, it makes it even harder to fudge the data. If you fiddle with the data to try to get a result you want, there will be a “multiple testing paper trail” following you around. 

So with each additional analysis of the data, you get an additional piece of paper added to your analysis paper trail. People can look at the analysis paper trail and make of it what they will. Maybe they don’t care. Maybe having a ton of analyses discredits the final results. The point is that it’s there for all to see.

I do not think what we need is better methods to deal with multiple testing. This is simply not a statistical issue. What we need is a way to increase the cost of data analysis by preserving the paper trail. So that people hesitate before they run all pairwise combinations of whatever. Reproducible research doesn’t really deal with this problem because reproducibility only really requires that the final analysis is documented.

In other words, let the paper trail be the price of pushing the button.