Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Data science can't be point and click

As data becomes cheaper and cheaper there are more people that want to be able to analyze and interpret that data.  I see more and more that people are creating tools to accommodate folks who aren’t trained but who still want to look at data _right now. _While I admire the principle of this approach - we need to democratize access to data - I think it is the most dangerous way to solve the problem.

The reason is that, especially with big data, it is very easy to find things like this with point and click tools:

US spending on science, space, and technology correlates with Suicides by hanging, strangulation and suffocation (http://www.tylervigen.com/view_correlation?id=1597)

The danger with using point and click tools is that it is very hard to automate the identification of warning signs that seasoned analysts get when they have their hands in the data. These may be spurious correlation like the plot above or issues with data quality, or missing confounders, or implausible results. These things are much easier to spot when analysis is being done interactively. Point and click software is also getting better about reproducibility, but it still a major problem for many interfaces.

Despite these issues, point and click software are still all the rage. I understand the sentiment, there is a bunch of data just laying there and there aren’t enough people to analyze it expertly. But you wouldn’t want me to operate on you using point and click surgery software. You’d want a surgeon who has practiced on real people and knows what to do when she has an artery in her hand. In the same way, I think point and click software allows untrained people to do awful things to big data.

The ways to solve this problem are:

  1. More data analysis training
  2. Encouraging people to do their analysis interactively

I have a few more tips which I have summarized in this talk on things statistics taught us about big data.