10 things statistics taught us about big data analysis

22 May 2014

In my previous post I pointed out a major problem with big data is that applied statistics have been left out. But many cool ideas in applied statistics are really relevant for big data analysis. So I thought I’d try to answer the second question in my previous post: “When thinking about the big data era, what are some statistical ideas we’ve already figured out?” Because the internet loves top 10 lists I came up with 10, but there are more if people find this interesting. Obviously mileage may vary with these recommendations, but I think they are generally not a bad idea.

If the goal is prediction accuracy, average many prediction models together. In general, the prediction algorithms that most frequently win Kaggle competitions or the Netflix prize blend multiple models together. The idea is that by averaging (or majority voting) multiple good prediction algorithms you can reduce variability without giving up bias. One of the earliest descriptions of this idea was of a much simplified version based on bootstrapping samples and building multiple prediction functions - a process called bagging (short for bootstrap aggregating). Random forests, another incredibly successful prediction algorithm, is based on a similar idea with classification trees.
When testing many hypotheses, correct for multiple testing This comic points out the problem with standard hypothesis testing when many tests are performed. Classic hypothesis tests are designed to call a set of data significant 5% of the time, even when the null is true (e.g. nothing is going on). One really common choice for correcting for multiple testing is to use the false discovery rate to control the rate at which things you call significant are false discoveries. People like this measure because you can think of it as the rate of noise among the signals you have discovered. Benjamini and Hochber gave the first definition of the false discovery rate and provided a procedure to control the FDR. There is also a really readable introduction to FDR by Storey and Tibshirani.
When you have data measured over space, distance, or time, you should smooth This is one of the oldest ideas in statistics (regression is a form of smoothing and Galton popularized that a while ago). I personally like locally weighted scatterplot smoothing a lot. This paperis a good one by Cleveland about loess. Here it is in a gif. But people also like smoothing splines, Hidden Markov Models, moving averages and many other smoothing choices.
Before you analyze your data with computers, be sure to plot it A common mistake made by amateur analysts is to immediately jump to fitting models to big data sets with the fanciest computational tool. But you can miss pretty obvious things like this if you don’t plot your data. There are too many plots to talk about individually, but one example of an incredibly important plot is the Bland-Altman plot, (called an MA-plot in genomics) when comparing measurements from multiple technologies. R provides tons of graphics for a reason and ggplot2 makes them pretty.
Interactive analysis is the best way to really figure out what is going on in a data set This is related to the previous point; if you want to understand a data set you have to be able to play around with it and explore it. You need to make tables, make plots, identify quirks, outliers, missing data patterns and problems with the data. To do this you need to interact with the data quickly. One way to do this is to analyze the whole data set at once using tools like Hive, Hadoop, or Pig. But an often easier, better, and more cost effective approach is to use random sampling . As Robert Gentleman put it “make big data as small as possible as quick as possible”.
Know what your real sample size is. It can be easy to be tricked by the size of a data set. Imagine you have an image of a simple black circle on a white background stored as pixels. As the resolution increases the size of the data increases, but the amount of information may not (hence vector graphics). Similarly in genomics, the number of reads you measure (which is a main determinant of data size) is not the sample size, it is the number of individuals. In social networks, the number of people in the network may not be the sample size. If the network is very dense, the sample size might be much less. In general the bigger the sample size the better and sample size and data size aren’t always tightly correlated.
Unless you ran a randomized trial, potential confounders should keep you up at night Confounding is maybe the most fundamental idea in statistical analysis. It is behind the spurious correlations like these and the reason why nutrition studies are so hard. It is very hard to hold people to a randomized diet and people who eat healthy diets might be different than people who don’t in other important ways. In big data sets confounders might be technical variables about how the data were measured or they could be differences over time in Google search terms. Any time you discover a cool new result, your first thought should be, “what are the potential confounders?”
Define a metric for success up front Maybe the simplest idea, but one that is critical in statistics and decision theory. Sometimes your goal is to discover new relationships and that is great if you define that up front. One thing that applied statistics has taught us is that changing the criteria you are going for after the fact is really dangerous. So when you find a correlation, don’t assume you can predict a new result or that you have discovered which way a causal arrow goes.
Make your code and data available and have smart people check it As several people pointed out about my last post, the Reinhart and Rogoff problem did not involve big data. But even in this small data example, there was a bug in the code used to analyze them. With big data and complex models this is even more important. Mozilla Science is doing interesting work on code review for data analysis in science. But in general if you just get a friend to look over your code it will catch a huge fraction of the problems you might have.
Problem first not solution backward One temptation in applied statistics is to take a tool you know well (regression) and use it to hit all the nails (epidemiology problems). There is a similar temptation in big data to get fixated on a tool (hadoop, pig, hive, nosql databases, distributed computing, gpgpu, etc.) and ignore the problem of can we infer x relates to y or that x predicts y.

Why big data is in trouble: they forgot about applied statistics

07 May 2014

This year the idea that statistics is important for big data has exploded into the popular media. Here are a few examples, starting with the Lazer et. al paper in Science that got the ball rolling on this idea.

All of these articles warn about issues that statisticians have been thinking about for a very long time: sampling populations, confounders, multiple testing, bias, and overfitting. In the rush to take advantage of the hype around big data, these ideas were ignored or not given sufficient attention.

One reason is that when you actually take the time to do an analysis right, with careful attention to all the sources of variation in the data, it is almost a law that you will have to make smaller claims than you could if you just shoved your data in a machine learning algorithm and reported whatever came out the other side.

The prime example in the press is Google Flu trends. Google Flu trends was originally developed as a machine learning algorithm for predicting the number of flu cases based on Google Search Terms. While the underlying data management and machine learning algorithms were correct, a misunderstanding about the uncertainties in the data collection and modeling process have led to highly inaccurate estimates over time. A statistician would have thought carefully about the sampling process, identified time series components to the spatial trend, investigated why the search terms were predictive and tried to understand what the likely reason that Google Flu trends was working.

As we have seen, lack of expertise in statistics has led to fundamental errors in both genomic science and economics. In the first case a team of scientists led by Anil Potti created an algorithm for predicting the response to chemotherapy. This solution was widely praised in both the scientific and popular press. Unfortunately the researchers did not correctly account for all the sources of variation in the data set and had misapplied statistical methods and ignored major data integrity problems. The lead author and the editors who handled this paper didn’t have the necessary statistical expertise, which led to major consequences and cancelled clinical trials.

Similarly, two economists Reinhart and Rogoff, published a paper claiming that GDP growth was slowed by high governmental debt. Later it was discovered that there was an error in an Excel spreadsheet they used to perform the analysis. But more importantly, the choice of weights they used in their regression model were questioned as being unrealistic and leading to dramatically different conclusions than the authors espoused publicly. The primary failing was a lack of sensitivity analysis to data analytic assumptions that any well-trained applied statisticians would have performed.

Statistical thinking has also been conspicuously absent from major public big data efforts so far. Here are some examples:

White House Big Data Partners Workshop - 0/19 statisticians
National Academy of Sciences Big Data Worskhop - 2/13 speakers statisticians
Moore Foundation Data Science Environments - 0/3 directors from statistical background, 1/25 speakers at OSTP event about the environments was a statistician
Original group that proposed NIH BD2K - 0/18 participants statisticians
Big Data rollout from the White House - 0/4 thought leaders statisticians, 0/n participants statisticians.

One example of this kind of thinking is this insane table from the alumni magazine of the University of California which I found from this [This year the idea that statistics is important for big data has exploded into the popular media. Here are a few examples, starting with the Lazer et. al paper in Science that got the ball rolling on this idea.

All of these articles warn about issues that statisticians have been thinking about for a very long time: sampling populations, confounders, multiple testing, bias, and overfitting. In the rush to take advantage of the hype around big data, these ideas were ignored or not given sufficient attention.

One reason is that when you actually take the time to do an analysis right, with careful attention to all the sources of variation in the data, it is almost a law that you will have to make smaller claims than you could if you just shoved your data in a machine learning algorithm and reported whatever came out the other side.

The prime example in the press is Google Flu trends. Google Flu trends was originally developed as a machine learning algorithm for predicting the number of flu cases based on Google Search Terms. While the underlying data management and machine learning algorithms were correct, a misunderstanding about the uncertainties in the data collection and modeling process have led to highly inaccurate estimates over time. A statistician would have thought carefully about the sampling process, identified time series components to the spatial trend, investigated why the search terms were predictive and tried to understand what the likely reason that Google Flu trends was working.

As we have seen, lack of expertise in statistics has led to fundamental errors in both genomic science and economics. In the first case a team of scientists led by Anil Potti created an algorithm for predicting the response to chemotherapy. This solution was widely praised in both the scientific and popular press. Unfortunately the researchers did not correctly account for all the sources of variation in the data set and had misapplied statistical methods and ignored major data integrity problems. The lead author and the editors who handled this paper didn’t have the necessary statistical expertise, which led to major consequences and cancelled clinical trials.

Similarly, two economists Reinhart and Rogoff, published a paper claiming that GDP growth was slowed by high governmental debt. Later it was discovered that there was an error in an Excel spreadsheet they used to perform the analysis. But more importantly, the choice of weights they used in their regression model were questioned as being unrealistic and leading to dramatically different conclusions than the authors espoused publicly. The primary failing was a lack of sensitivity analysis to data analytic assumptions that any well-trained applied statisticians would have performed.

Statistical thinking has also been conspicuously absent from major public big data efforts so far. Here are some examples:

White House Big Data Partners Workshop - 0/19 statisticians
National Academy of Sciences Big Data Worskhop - 2/13 speakers statisticians
Moore Foundation Data Science Environments - 0/3 directors from statistical background, 1/25 speakers at OSTP event about the environments was a statistician
Original group that proposed NIH BD2K - 0/18 participants statisticians
Big Data rollout from the White House - 0/4 thought leaders statisticians, 0/n participants statisticians.

One example of this kind of thinking is this insane table from the alumni magazine of the University of California which I found from this](http://www.chalmers.se/en/areas-of-advance/ict/calendar/Pages/Terry-Speed.aspx) (via Rafa, go watch his talk right now, it gets right to the heart of the issue). It shows a fundamental disrespect for applied statisticians who have developed serious expertise in a range of scientific disciplines.

All of this leads to two questions:

Given the importance of statistical thinking why aren’t statisticians involved in these initiatives?
When thinking about the big data era, what are some statistical ideas we’ve already figured out?

JHU Data Science: More is More

05 May 2014

Today Jeff Leek, Brian Caffo, and I are launching 3 new courses on Coursera as part of the Johns Hopkins Data Science Specialization. These courses are

I’m particularly excited about Reproducible Research, not just because I’m teaching it, but because I think it’s essentially the first of its kind being offered in a massive open format. Given the rich discussions about reproducibility that have occurred over the past few years, I’m happy to finally be able to offer this course for free to a large audience.

These courses are launching in addition to the first 3 courses in the sequence: The Data Scientist’s Toolbox, R Programming, and Getting and Cleaning Data, which are also running this month in case you missed your chance in April.

All told we have 6 of the 9 courses in the Specialization available as of today. We’re really looking forward to next month where we will be launching the final 3 courses: Regression Models, Practical Machine Learning, and Developing Data Products. We also have some exciting announcements coming soon regarding the Capstone Projects.

Every course will be available every month, so don’t worry about missing a session. You can always come back next month.

Confession: I sometimes enjoy reading the fake journal/conference spam

30 Apr 2014

I've spent a considerable amount of time setting up filters to avoid getting spam from fake journals and conferences. Unfortunately, they are exceptionally good at thwarting my defenses. This does not annoy me as much as I pretend because, secretly, I enjoy reading some of these emails. Here are three of my favorites.

1) Over-the-top robot:

It gives us immense pleasure to invite you and your research allies to submit a manuscript for the journal “REDACTED”. The expertise of you in the never ending field of Gene Technology is highly appreciable. The level of intricacy shown by you in your work makes us even more proud, and we believe that your works should be known to mankind of science.

2) Sarcastic robot?

First of all, congratulations on the publication of your highly cited original article < The human colon cancer methylome shows similar hypo- and hypermethylation at conserved tissue-specific CpG island shores > in the field of colon cancer, which has been cited more than 1 times and is in the world’s top one percent of papers. Such high number of citations reflects the high quality and influence of your paper.

3) Intimidating robot:

This is Rocky.... Recently we have mailed you about the details of the conference. But we still have not received your response. So today we contact you again.

NB: Although I am joking in this post, I do think these fake journals and conferences are a very serious problem. The fact that they are still around means enough money (mostly taxpayer money) is being spent to keep them in business. If you want to learn more, this blog does a good job on reporting on them and includes a list of culprits.

Picking a (bio)statistics thesis topic for real world impact and transferable skills

22 Apr 2014

One of the things that was hardest for me in graduate school was starting to think about my own research projects and not just the ideas my advisor fed me. I remember that it was stressful because I didn’t quite know where to start. After having done this for a while and particularly after having read a bunch of papers by people who are way more successful than I am, I have come to the following algorithm as a means for finding a topic that will have real world impact and also give you skills to take on new problems in a flexible way.

Find a scientific problem that hasn’t been solved with data (by far hardest part)
Define your metric for success
Collect data/partner up with someone with data for that problem.
Create a good solution to the problem
Only invent new methods if you must
(Optional) Write software and document the hell out of it
(Optional) Respond to users and update as needed
Don’t get (meanly) competitive

The first step is definitely the most important and the hardest. The balance is between big important problems that lots of people are working on but where the potential for innovation is low and small detailed problems where you won’t have serious competition but you will have limited impact. In general good ways to find scientific problems are the following. (1) Find close and real scientific/applications collaborators. Not real like you talk to them once a month, real like you have a weekly meeting, you try to understand how their data are collected or generated and you ask them specifically what problems prevent them from doing their job well then solve those problems. (2) You come up with a scientific question you have on your own. In mature research areas like genomics this requires a huge amount of reading to know what people have done before you, or to at least know what new technologies/data are becoming available. (3) You you could read a ton of papers and find one that produces interesting data you think could answer a question the authors haven’t asked. In general, the key is to put the problem first, before you even think about how to quantify or answer the question.

Next you have to define your metric for success. This metric should be scientific. You should try to say, “if I could predict x at 70% accuracy I could solve scientific problem y” or “if I could infer the relationship between x and y I would know something about z”. The metric should be compared to the scientific standards in the field. As an example, screening tests for the general population often must be 99% sensitive and specific (or more) due to low prevalence. But in a sub population, sensitivity and specificity of 70% or 80% may be really useful.

Then you find the data. Here the key quote comes from Tukey:

The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.

My experience is that when you start with the problem first, the data are often hard to come by, have quirks, or are not quite right for the problem you want to solve. Generating the perfect data is often very expensive, so a huge amount of the effort you will spend is either (a) generating the perfect data or (b) determining if the data you collected is “good enough” to answer the question. One important point here is that knowing when you have failed is the entire name of the game here. If you get stuck once, you should try again. If you get stuck 100 times, it might be time to look for a different data set or figure out why the problem is unanswerable with current data. Incidentally, this is the most difficult part of the approach I’m proposing for coming up with topics. Failure is both likely and frequent, but that is a good thing when you are in grad school if you can learn from it and learn to predict when you are going to fail.

Since you’ve identified a problem that hasn’t been solved before in step 1, the first thing to try is to come up with a sensible solution using only the methods that already exist. In many cases, these existing methods will work pretty well. If they don’t, invent only as much statistical methodology and theory as you need to solve the problem. If you invent something new here, you should try it out on simple simulated examples and complex data where you either know the answer or can perform cross-validation/replication analysis.

At this point, if you have a basic solution to the problem, even if it is just the t-test, you are in great shape! You have solved a problem that is new and you are ready to publish. If you have invented some methods along the way, publish those, too!

In some cases the problems you solve will be focused on an area where lots of other people can collect similar data to answer similar problems. In this case, your most direct route to maximum impact is to write simple, usable, and really well documented software other people can use. Write it in R, make it free, give it a vignette and advertise it! If people use your software they will send you bug reports, patches, typos, fixes, and wish lists of things they want your software to do. The more you help people and respond, the more your software will get used and the more impact your method will have.

Step 8 is often the hardest part. If you do something interesting, you will have a ton of competitors. People will write better and more precise methods down and will “beat” your method. That’s ok, in fact it is good! The more people that compare to your approach, the more you know you picked a good problem. In some cases, people will genuinely create better methods than you will. Learn from them and make your methods and software better. But try not to be upset that they wrote a paper about how their idea is so much better than yours, it is a high compliment they thought your idea was worth comparing to. This is one the author of the post hasn’t nailed down perfectly but I think the more you can do it the happier you will be.

The best part of this algorithm is that it gives you the problem first focus that will make it easy to transition if you do a postdoc with a different kind of data, or move to industry, or start with new collaborators.

Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

10 things statistics taught us about big data analysis

Why big data is in trouble: they forgot about applied statistics

JHU Data Science: More is More

Confession: I sometimes enjoy reading the fake journal/conference spam

Picking a (bio)statistics thesis topic for real world impact and transferable skills