Google's brainteasers (that don't work) and Johns Hopkins Biostatistics Data Analysis

20 Jun 2013

This article is getting some attention, because Google’s VP for people operations at Google has made public a few insights that the Google HR team has come to over the last several years. The most surprising might be:

They don’t collect GPAs except for new candidates
Test scores are worthless
Interview scores weren’t correlated with success.
Brainteasers that Google is so famous for are worthless
Behavioral interviews are the most effective

The reason the article is getting so much attention is how surprising these facts may be to people who have little experience hiring/managing in technical fields. But I thought this quote was really telling:

One of my own frustrations when I was in college and grad school is that you knew the professor was looking for a specific answer. You could figure that out, but it’s much more interesting to solve problems where there isn’t an obvious answer.

Interestingly, that is the whole point of my data analysis course here at Hopkins. Over my relatively limited time as a faculty member I realized there were two key qualities that made students in biostatistics stand out: (1) that they were hustlers - willing to just work until the problem is solved even if it was frustrating and (2) that they were willing/able to try new approaches or techniques they weren’t comfortable with. I don’t have the quantitative data that Google does, but I would venture to guess those two traits explain 80%+ of the variation in success rates for graduate students in statistics/computing/data analysis.

Once that realization is made, it becomes clear pretty quickly that textbook problems or re-analysis of well known data sets measure something orthogonal to traits (1) and (2). So I went about redesigning the types of problems our students had to tackle. Instead of assigning problems out of a book I redesigned the questions to have the following characteristics:

The were based on live data sets. I define a “live” data set as a data set that has not been used to answer the question of interest previously.
The questions are problem forward, not solution backward. I would have an idea of what would likely work and what would likely not work. But I defined the question without thinking about what methods the students might use.
The answer was open ended (and often not known to me in advance).
The problems often had to do with unique scenarios not encountered frequently in statistics (e.g. you have a data census instead of just a sample).
The problems involved methods application/development, coding, and writing/communication.

I have found that problems with these characteristics more precisely measure hustle and flexibility, like Google is looking for in their hiring practices. Of course, there are some down sides to this approach. I think it can be more frustrating for students, who don’t have as clearly defined a path through the homework. It also means dramatically more work for the instructor in terms of analyzing the data to find the quirks, creating personalized feedback for students, and being able to properly estimate the amount of work a project will take.

We have started thinking about how to do this same thing at scale on Coursera. In the meantime, Google will just have to send their recruiters to Hopkins Biostats to find students who meet the characteristics they are looking for :-).

Sunday data/statistics link roundup (6/16/13 - Father's day edition!)

16 Jun 2013

Datapalooza! I’m wondering where my invite is? I do health data stuff, pick me, pick me! Actually it does sound like a pretty good idea - in general giving a bunch of smart people access to interesting data and real science problems can produce some cool results (link via Dan S.)
This report on precision medicine from the Manhattan Institute is related to my post this week on personalized medicine. I like the idea that we should be focusing on developing new ideas for adaptive trials (my buddy Michael is all over that stuff). I did thing that it was a little pie-in-the-sky with plenty of buzzwords like Bayesian causal networks and pattern recognition. I think these ideas are certainly applicable, but the report, I think, overstates the current level of applicability of these methods. We need more funding and way more research to support this area before we should automatically adopt it - big data can be used to confuse when methods aren’t well understood (link via Rafa via Marginal Revolution).
rOpenSci wins a grant from the Sloan Foundation! Psyched to see this kind of innovative open software development get the support it deserves. My favorite rOpenSci package is rFigshare, what’s yours?
A k-means approach to detecting what will be trending on Twitter. It always gets me so pumped up to see the creative ways that methods that have been around forever can be adapted to solve real, interesting problems.
Finally, I thought this link was very appropriate for father’s day. I couldn’t agree more that the best kind of learning happens when you are just so in to something that you forget you are learning. Happy father’s day everyone!

The vast majority of statistical analysis is not performed by statisticians

14 Jun 2013

Whether you know it or not, everything you do produces data - from the websites you read to the rate at which your heart beats. Until pretty recently, most of the data you produced wasn’t collected, it floated off unmeasured. The only data that were collected were painstakingly gathered by scientists one number at a time in small experiments with a few people. This laborious process meant that data were expensive and time-consuming to collect. Yet many of the most amazing scientific discoveries over the last two centuries were squeezed from just a few data points. But over the last two decades, the unit price of data has dramatically dropped. New technologies touching every aspect of our lives from our money, to our health, to our social interactions have made data collection cheap and easy (see e.g. Camp Williams).

To give you an idea of how steep the drop in the price of data has been, in 1967 Stanley Milgram did an experiment to determine the number of degrees of separation between two people in the U.S. In his experiment he sent 296 letters to people in Omaha, Nebraska and Wichita, Kansas. The goal was to get the letters to a specific person in Boston, Massachusetts. The trick was people had to send the letters to someone they knew, and they then sent it to someone they knew and so on. At the end of the experiment, only 64 letters made it to the individual in Boston. On average, the letters had gone through 6 people to get there. This is where the idea of “6-degrees of Kevin Bacon” comes from. Based on 64 data points. A 2007 study updated that number to “7 degrees of Kevin Bacon”. The study was based on 30 billion instant messaging conversations collected over the course of a month or two with the same amount of effort.

Once data started getting cheaper to collect, it got cheaper fast. Take another example, the human genome. The genome is the unique DNA code in every one of your cells. It consists of a set of 3 billion letters that is unique to you. By many measures, the race to be the first group to collect all 3 billion letters from a single person kicked off the data revolution in biology. The project was completed in 2000 after a decade of work and $3 billion to collect the 3 billion letters in the first human genome. This project was actually a stunning success, most people thought it would be much more expensive. But just over a decade later, new technology means that we can now collect all 3 billion letters from a person’s genome for about $10,000 in about a week.

As the price of data dropped so dramatically over the last two decades, the division of labor between analysts and everyone else became less and less clear. Data became so cheap that it couldn’t be confined to just a few highly trained people. So raw data started to trickle out in a number of different ways. It started with maps of temperatures across the U.S. in newspapers and quickly ramped up to information on how many friends you had on Facebook, the price of tickets on 50 airlines for the same flight, or measurements of your blood pressure, good cholesterol, and bad cholesterol at every doctor’s visit. Arguments about politics started focusing on the results of opinion polls and who was asking the questions. The doctor stopped telling you what to do and started presenting you with options and the risks that went along with each.

That is when statisticians stopped being the primary data analysts. At some point, the trickle of data about you, your friends, and the world started impacting every component of your life. Now almost every decision you make is based on data you have about the world around you. Let’s take something simple, like where are you going to eat tonight. You might just pick the nearest restaurant to your house. But you could also ask your friends on Facebook where you should eat, or read reviews on Yelp, or check out menus on the restaurants websites. All of these are pieces of data that are collected and presented for you to "analyze".

This revolution demands a new way of thinking about statistics. It has precipitated explosive growth in data visualization - the most accessible form of data analysis. It has encouraged explosive growth in MOOCs like the ones Roger, Brian and I taught. It has created open data initiatives in government. It has also encouraged more accessible data analysis platforms in the form of startups like StatWing that make it easier for non-statisticians to analyze data.

What does this mean for statistics as a discipline? Well it is great news in that we have a lot more people to train. It also really drives home the importance of statistical literacy. But it also means we need to adapt our thinking about what it means to teach and perform statistics. We need to focus increasingly on interpretation and critique and away from formulas and memorization (think English composition versus grammar). We also need to realize that the most impactful statistical methods will not be used by statisticians, which means we need more fool proofing, more time automating, and more time creating software. The potential payout is huge for realizing that the tide has turned and most people who analyze data aren't statisticians.

False discovery rate regression (cc NSA's PRISM)

13 Jun 2013

There is an idea I have been thinking about for a while now. It re-emerged at the top of my list after seeing this really awesome post on using metadata to identify “conspirators” in the American revolution. My first thought was: but how do you know that you aren’t just making lots of false discoveries?

Hypothesis testing and significance analysis were originally developed to make decisions for single hypotheses. In many modern applications, it is more common to test hundreds or thousands of hypotheses. In the standard multiple testing framework, you perform a hypothesis test for each of the “features” you are studying (these are typically genes or voxels in high-dimensional problems in biology, but can be other things as well). Then the following outcomes are possible:

	Call Null True	Call Null False	Total
Null True	True Negatives	False Positives	True Nulls
Null False	False Negatives	True Positives	False Nulls
	No Decisions	Rejections

The reason for “No Decisions” is that the way hypothesis testing is set up, one should technically never accept the null hypothesis. The number of rejections is the total number of times you claim that a particular feature shows a signal of interest.

A very common measure of embarrassment in multiple hypothesis testing scenarios is the false discovery rate defined as:

There are some niceties that have to be dealt with here, like the fact that the may be equal to zero, inspiring things like the positive false discovery rate, which has some nice Bayesian interpretations.

The way that the process usually works is that a test statistic is calculated for each hypothesis test where a larger statistic means more significant and then operations are performed on these ordered statistics. The two most common operations are: (1) pick a cutoff along the ordered list of p-values - call everything less than this threshold significant and estimate the FDR for that cutoff and (2) pick an acceptable FDR level and find an algorithm to pick the threshold that controls the FDR where control is defined usually by saying something like the algorithm produces .

Regardless of the approach these methods usually make an assumption that the rejection regions should be nested. In other words, if you call statistic $T_k$ significant and $T_j > T_k$ then your method should also call statistic $T_j$ significant. In the absence of extra information, this is a very reasonable assumption.

But in many situations you might have additional information you would like to use in the decision about whether to reject the null hypothesis for test $j$.

Example 1 A common example is gene-set analysis. Here you have a group of hypotheses that you have tested individually and you want to say something about the level of noise in the group. In this case, you might want to know something about the level of noise if you call the whole set interesting.

Example 2 Suppose you are a mysterious government agency and you want to identify potential terrorists. You observe some metadata on people and you want to predict who is a terrorist - say using betweenness centrality. You could calculate a P-value for each individual, say using a randomization test. Then estimate your FDR based on predictions using the metadata.

Example 3 You are monitoring a system over time where observations are random. Say for example whether there is an outbreak of a particular disease in a particular region at a given time. So, is the rate of disease higher than background. How can you estimate the rate at which you make false claims?

For now I’m going to focus on the estimation scenario but you could imagine using these estimates to try to develop controlling procedures as well.

In each of these cases you have a scenario where you are interested in something like:

where is a covariate-specific estimator of the false discovery rate. Returning to our examples you could imagine:

Example 1

Example 2

Example 3

Where in the last case, we have parameterized the relationship between FDR and time with a flexible model like cubic splines.

The hard problem is fitting the regression models in Examples 1-3. Here I propose a basic estimator of the FDR regression model and leave it to others to be smart about it. Let’s focus on P-values because they are the easiest to deal with. Suppose that we calculate the random variables . Then:

Where $G(\lambda)$ is the empirical distribution function for the P-values under the alternative hypothesis. This may be a mixture distribution. If we assume reasonably powered tests and that $\lambda$ is large enough, then . So

One obvious choice is then to try to model

We could, for example use the model:

where is a linear model or spline, etc. Then we get the fitted values and calculate:

Here is a little simulated example where the goal is to estimate the probability of being a false positive as a smooth function of time.

## Load libraries

library(splines)
## Define the number of tests
set.seed(1345)
ntest <- 1000

## Set up the time vector and the probability of being null
tme <- seq(-2,2,length=ntest)
pi0 <- pnorm(tme)

## Calculate a random variable indicating whether to draw
## the p-values from the null or alternative
nullI <- rbinom(ntest,prob=pi0,size=1)> 0

## Sample the null P-values from U(0,1) and the alternatives
## from a beta distribution

pValues <- rep(NA,ntest)
pValues[nullI] <- runif(sum(nullI))
pValues[!nullI] <- rbeta(sum(!nullI),1,50)

## Set lambda and calculate the estimate

lambda <- 0.8
y <- pValues > lambda
glm1 <- glm(y ~ ns(tme,df=3))

## Get the estimate pi0 values
pi0hat <- glm1$fitted/(1-lambda)

## Plot the real versus fitted probabilities

plot(pi0,pi0hat,col="blue",type="l",lwd=3,xlab="Real pi0",ylab="Fitted pi0")
abline(c(0,1),col="grey",lwd=3)

The result is this plot:

Real versus estimated false discovery rate when calling all tests significant.

This estimate is obviously not guaranteed to estimate the FDR well, the operating characteristics both theoretically and empirically need to be evaluated and the other examples need to be fleshed out. But isn’t the idea of FDR regression cool?

Personalized medicine is primarily a population-health intervention

12 Jun 2013

There has been a lot of discussion of personalized medicine, individualized health, and precision medicine in the news and in the medical research community. Despite this recent attention, it is clear that healthcare has always been personalized to some extent. For example, men are rarely pregnant and heart attacks occur more often among older patients. In these cases, easily collected variables such as sex and age, can be used to predict health outcomes and therefore used to “personalize” healthcare for those individuals.

So why the recent excitement around personalized medicine? The reason is that it is increasingly cheap and easy to collect more precise measurements about patients that might be able to predict their health outcomes. An example that has recently been in the news is the measurement of mutations in the BRCA genes. Angelina Jolie made the decision to undergo a prophylactic double mastectomy based on her family history of breast cancer and measurements of mutations in her BRCA genes. Based on these measurements, previous studies had suggested she might have a lifetime risk as high as 80% of developing breast cancer.

This kind of scenario will become increasingly common as newer and more accurate genomic screening and predictive tests are used in medical practice. When I read these stories there are two points I think of that sometimes get obscured by the obviously fraught emotional, physical, and economic considerations involved with making decisions on the basis of new measurement technologies:

In individualized health/personalized medicine the “treatment” is information about risk. In some cases treatment will be personalized based on assays. But in many other cases, we still do not (and likely will not) have perfect predictors of therapeutic response. In those cases, the healthcare will be “personalized” in the sense that the patient will get more precise estimates of their likelihood of survival, recurrence etc. This means that patients and physicians will increasingly need to think about/make decisions with/act on information about risks. But communicating and acting on risk is a notoriously challenging problem; personalized medicine will dramatically raise the importance of understanding uncertainty.
Individualized health/personalized medicine is a population-level treatment. Assuming that the 80% lifetime risk estimate was correct for Angelina Jolie, it still means there is a 1 in 5 chance she was never going to develop breast cancer. If that had been her case, then the surgery was unnecessary. So while her decision was based on personal information, there is still uncertainty in that decision for her. So the “personal” decision may not always be the “best” decision for any specific individual. It may however, be the best thing to do for everyone in a population with the same characteristics.

Older Newer

Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Google's brainteasers (that don't work) and Johns Hopkins Biostatistics Data Analysis

Sunday data/statistics link roundup (6/16/13 - Father's day edition!)

The vast majority of statistical analysis is not performed by statisticians

False discovery rate regression (cc NSA's PRISM)

Personalized medicine is primarily a population-health intervention