Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Use R! 2014 to be at UCLA

The 2014 Use R! conference will be in Los Angeles, CA and will be hosted by the UCLA Department of Statistics (an excellent department, I must say) and the newly created Foundation for Open Access Statistics. This is basically the meeting for R users and developers and has grown to be quite an event.

Fourth of July data/statistics link roundup (7/4/2013)

  1. An interesting post about how lots of people start out in STEM majors but eventually bail because they are too hard. They recommend either: (1) we better prepare high school students or (2) we make STEM majors easier. I like the idea of making STEM majors more interactive and self-paced. There is a bigger issue here of weed-out classes and barrier classes that deserves a longer discussion (via Alex N.)
  2. This is an incredibly interesting FDA proposal to share all clinical data. I didn’t know this, but apparently right now all FDA data is proprietary. That is stunning to me, given the openness that we have say in genomic data - where most data are public. This goes beyond even the alltrials idea of reporting all results. I think we need full open disclosure of data and need to think hard about the privacy/consent implications this may have (via Rima I.).
  3. This is a pretty cool data science fellowship program for people who want to transition from academia to industry, post PhD. I have no idea if the program is any good, but certainly the concept is a great one. (via Sherri R.)
  4. A paper in Nature Methods about data visualization and understanding the levels of uncertainty in data analysis. I love seeing that journals are recognizing the importance of uncertainty in analysis. Sometimes I feel like the “biggies” want perfect answers with no uncertainty - which never happens.

That’s it, just a short set of links today. Enjoy your 4th!

Repost: The 5 Most Critical Statistical Concepts

(Editor’s Note: This is an old post but a good one from Jeff.)

It seems like everywhere we look, data is being generated - from politics, to biology, to publishing, to social networks. There are also diverse new computational tools, like GPGPU and cloud computing, that expand the statistical toolbox. Statistical theory is more advanced than its ever been, with exciting work in a range of areas.

With all the excitement going on around statistics, there is also increasing diversity. It is increasingly hard to define “statistician” since the definition ranges from very mathematical to very applied. An obvious question is: what are the most critical skills needed by statisticians?

So just for fun, I made up my list of the top 5 most critical skills for a statistician by my own definition. They are by necessity very general (I only gave myself 5).

  1. The ability to manipulate/organize/work with data on computers - whether it is with excel, R, SAS, or Stata, to be a statistician you have to be able to work with data.
  2. A knowledge of exploratory data analysis - how to make plots, how to discover patterns with visualizations, how to explore assumptions
  3. Scientific/contextual knowledge - at least enough to be able to abstract and formulate problems. This is what separates statisticians from mathematicians.
  4. Skills to distinguish true from false patterns - whether with p-values, posterior probabilities, meaningful summary statistics, cross-validation or any other means.
  5. The ability to communicate results to people without math skills - a key component of being a statistician is knowing how to explain math/plots/analyses.

What are your top 5? What order would you rank them in? Even though these are so general, I almost threw regression in there because of how often it pops up in various forms.

Related Posts: Rafa on graduate education and What is a Statistician? Roger on “Do we really need applied statistics journals?”

Measuring the importance of data privacy: embarrassment and cost

We We when it is inexpensive and easy to collect data about ourselves or about other people. These data can take the form of health information - like medical records, or they could be financial data - like your online bank statements, or they could be social data - like your friends on Facebook. We can also easily collect information about our genetic makeup or our fitness (although it can be hard to get).

All of these data types are now stored electronically. There are obvious reasons why this is both economical and convenient. The downside, of course, is that the data can be used by the government or other entities in ways that you may not like. Whether it is to track your habits to sell you new products or to use your affiliations to make predictions about your political leanings, these data are not just “numbers”.

Data protection and data privacy are major issues in a variety of fields. In some areas, laws are in place to govern how your data can be shared and used (e.g. HIPAA). In others it is a bit more of a wild west mentality (see this interesting series of posts, “Know your data” by junkcharts talking about some data issues). I think most people have some idea that they would like to keep at least certain parts of their data private (from the government, from companies, or from their friends/family), but I’m not sure how most people think about data privacy.

For me there are two scales on which I measure the importance of the privacy of my own data:

  1. Embarrassment - Data about my personal habits, whether I let my son watch too much TV, or what kind of underwear I buy could be embarrassing if it was out in public.
  2. Financial  - Data about my social security number, my bank account numbers, or my credit card account could be used to cost me either my current money or potential future money.

My concerns about data privacy can almost always be measured primarily on these two scales. For example, I don’t want my medical records to be public because: (1) it might be embarrassing for people to know how bad my blood pressure is and (2) insurance companies might charge me more if they knew. On the other hand, I don’t want my bank account to get out primarily because it could cost me financially. So that mostly only registers on one scale.

One option, of course, would be to make all of my data totally private. But the problem is I want to share some of it with other people - I want my doctor to know my medical history and my parents to get to see pictures of my son. Usually I just make these choices about data sharing without even thinking about them, but after a little reflection I think these are the main considerations that go into my data sharing choices:

  1. Where does it rate on the two scales above?
  2. How much do I trust the person I’m sharing with? For example, my wife knows my bank account info, but I wouldn’t give it to a random stranger on the street. Google has my email and uses it to market to me, but that doesn’t bother me too much. But I trust them (I think) not to say - tell people I’m negotiating with my plans based on emails I sent to my wife (this goes with #4 below).
  3. How hard would it be to use the information? I give my credit card to waiters at restaurants all the time, but I also monitor my account - so it would be relatively hard to run up a big bill before I (or the bank) noticed. I put my email address online, but it is a couple of steps between that and anything that is embarrassing/financially dubious for me. You’d have to be able to use that to hack some account.
  4. Is there incentive for someone to use the information? I’m not fabulously wealthy or famous. So most of the time, even if financial/embarrassing stuff is online about me, it probably wouldn’t get used. On the other hand, if I was an actor, a politician, or a billionaire there would be a lot more people incentivized to use my data against me. For example, if Google used my info to blow up a negotiation they would gain very little. I, on the other hand, would lose a lot and would probably sue them.*

With these ideas in mind it makes it a little easier for me to (at least personally) classify how much I care about different kinds of privacy breaches.

For example, suppose my health information was posted on the web. I would consider this a  problem because of both financial and embarrassment potential. It is also on the web, so I basically don’t trust the vast majority of people that would have access. On the other hand, it would be at least reasonably hard to use this data directly against me unless you were an insurance provider and most people wouldn’t have the incentive.

Take another example: someone tagging me in Facebook photos (I don’t have my own account). Here the financial considerations are only potential future employment problems, but the embarrassment considerations are quite high. I probably somewhat trust the person tagging me since I at least likely know them. On the other hand it would be super-easy to use the info against me - it is my face in a picture and would just need to be posted on the web. So in this case, it mostly comes down to incentive and I don’t think most people have an incentive to use pictures against me (except in jokes - which I’m mostly cool with).

I could do more examples, but you get the idea. I do wonder if there is an interesting statistical model to be built here on the basis of these axioms (or other more general ones) about when/how data should be used/shared.

An interesting side note is that I did use my gmail account when I was considering a position at Google fresh out of my Ph.D. I sent emails to my wife and my advisor discussing my plans/strategy. I always wondered if they looked at those emails when they were negotiating with me - although I never had any reason to suspect they had. 

What is the Best Way to Analyze Data?

One topic I’ve been thinking about recently is extent to which data analysis is an art versus a science. In my thinking about art and science, I rely on Don Knuth’s distinction, from his 1974 lecture “Computer Programming as an Art”:

Science is knowledge that we understand so well that we can teach it to a computer; and if we don’t fully understand something, it is an art to deal with it. Since the notion of an algorithm or a computer program provides us with an extremely useful test for the depth of our knowledge about any given subject, the process of going from an art to a science means that we learn how to automate something.

Of course, the phrase “analyze data” is far too general; it needs to be placed in a much more specific context. So choose your favorite specific context and consider this question: Is there a way to teach a computer how to analyze the data generated in that context? Jeff wrote about this a while back and he called this magical program the deterministic statistical machine.

For example, one area where I’ve done some work is in estimating short-term/acute population-level effects of ambient air pollution. These are typically done using time series data of ambient pollution from central monitors and community-level counts of some health outcome (e.g. deaths, hospitalizations). The basic question is if pollution goes up on a given day, do we also see health outcomes go up on the same day, or perhaps in the few days afterwards. This is a fairly well-worn question in the air pollution literature and there have been hundreds of time series studies published. Similarly, there has been a lot of research into the statistical methodology for conducting time series studies and I would wager that as a result of that research we actually know something about what to do and what not to do.

But is our level of knowledge about the methodology for analyzing air pollution time series data to the point where we could program a computer to do the whole thing? Probably not, but I believe there are aspects of the analysis that we could program.

Here’s how I might break it down. Assume we basically start with a rectangular dataset with time series data on a health outcome (say, daily mortality counts in a major city), daily air pollution data, and daily data on other relevant variables (e.g. weather). Typically, the target of analysis is the association between the air pollution variable and the outcome, adjusted for everything else.

  1. Exploratory analysis. Not sure this can be fully automated. Need to check for missing data and maybe stop analysis if proportion of missing data is too high? Check for high leverage points as pollution data tends to be skewed. Maybe log-transform if that makes sense in this context. Check for other outliers and note them for later (we may want to do a sensitivity analysis without those observations). 
  2. Model fitting. This is already fully automated. If the outcome is a count, then typically a Poisson regression model is used. We already know that maximum likelihood is an excellent approach and better than most others under reasonable circumstances. There’s plenty of GLM software out there so we don’t even have to program the IRLS algorithm.
  3. Model building. Since this is not a prediction model, the main concern we have is that we properly adjusted for measured and unmeasured confounding. Francesca Dominici and some of her colleagues have done some interesting work regarding how best to do this via Bayesian model averaging and other approaches. I would say that in principle this can be automated, but the lack of easy-to-use software at the moment makes it a bit complicated. That said, I think simpler versions of the “ideal approach” can be easily implemented.
  4. Sensitivity analysis. There are a number of key sensitivity analyses that need to be done in all time series analyses. If there were outliers during EDA, maybe re-run model fit and see if regression coefficient for pollution changes much. How much is too much? (Not sure.) For time series models, unmeasured temporal confounding is a big issue so this is usually checked using spline smoothers on the time variable with different degrees of freedom. This can be automated by fitting the model many different times with different degrees of freedom in the spline.
  5. Reporting. Typically, some summary statistics for the data are reported along with the estimate + confidence interval for the air pollution association. Estimates from the sensitivity analysis should be reported (probably in an appendix), and perhaps estimates from different lags of exposure, if that’s a question of interest. It’s slightly more complicated if you have a multi-city study.

So I’d say that of the five major steps listed above, the one that I find most difficult to automate is EDA. There a lot of choices have to be made that are not easy to program into a computer. But I think the rest of the analysis could be automated. I’ve left out the cleaning and preparation of the data here, which also involves making many choices. But in this case, much of that is often outside the control of the investigator. These analyses typically use publicly available data where the data are available “as-is”. For example, the investigator would likely have no control over how the mortality counts were created.

What’s the point of all this? Well, I would argue that if we cannot completely automate a data analysis for a given context, then either we need to narrow the context, or we have some more statistical research to do. Thinking about how one might automate a data analysis process is a useful way to identify where are the major statistical gaps in a given area. Here, there may be some gaps in how best to automate the exploratory analyses. Whether those gaps can be filled (or more importantly, whether you are interested in filling them) is not clear. But most likely it’s not a good idea to think about better ways to fit Poisson regression models.

So what do you do when all of the steps of the analysis have been fully automated? Well, I guess time to move on then….