Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Art from Data

There’s a nice piece by Mark Hansen about data-driven aesthetics in the New York Times special section on big data.

From a speedometer to a weather map to a stock chart, we routinely interpret and act on data displayed visually. With a few exceptions, data has no natural “look,” no natural “visualization,” and choices have to be made about how it should be displayed. Each choice can reveal certain kinds of patterns in the data while hiding others.

I think drawing a line between a traditional statistical graphic and a pure work of art would be somewhat difficult. You can find examples of both that might fall in the opposite category: traditional graphics that transcend their utilitarian purposes and “pure art” works that tell you something new about your world.

Indeed, I think Mark Hansen’s own work with Ben Rubin falls into the latter category–art pieces that perhaps had their beginnings as purely works of art but ended up giving you new insight into the world. For example, Listening Post was a highly creative installation that simultaneously gave you an emotional connection to random people chatting on the Internet as well as insight into what the Internet was “saying” at any given time (I wonder if NSA employees took a field trip to the Whitney Museum of American Art!).

Doing Statistical Research

There’s a wonderful article over at the STATtr@k web site by Terry Speed on How to Do Statistical Research. There is a lot of good advice there, but the column is most notable because it’s pretty much the exact opposite of the advice that I got when I first started out.

To quote the article:

The ideal research problem in statistics is “do-able,” interesting, and one for which there is not much competition. My strategy for getting there can be summed up as follows:

  • Consulting: Do a very large amount
  • Collaborating: Do quite a bit
  • Research: Do some

For the most part, I was told to flip the research and consulting bits. That is, you want to spend most of your time doing “research” and very little of your time doing “consulting”. Why? Because ostensibly, the consulting work doesn’t involve new problems, only solving old problems with existing techniques. The research work by definition involves addressing new problems.

But,

A strategy I discourage is “develop theory/model/method, seek application.” Developing theory, a model, or a method suggests you have done some context-free research; already a bad start. The existence of proof (Is there a problem?) hasn’t been given. If you then seek an application, you don’t ask, “What is a reasonable way to answer this question, given this data, in this context?” Instead, you ask, “Can I answer the question with this data; in this context; with my theory, model, or method?” Who then considers whether a different (perhaps simpler) answer would have been better?

The truth is, most problems can be solved with an existing method. They may not be 100% solvable with existing tools, but usually 90% is good enough and it’s not worth developing a new statistical method to cover the remaining 10%. What you really want to be doing is working on the problem that is 0% solvable with existing methods. Then there’s a pretty big payback if you develop a new method to address it and it’s more likely that your approach will be adopted by others simply because there’s no alternative. But in order to find these 0% problems, you have to see a lot of problems, and that’s where the consulting and collaboration comes in. Exposure to lots of problems lets you see the universe of possibilities and gives you a sense of where scientists really need help and where they’re more or less doing okay.

Even if you agree with Terry’s advice, implementing it may not be so straightforward. It may be easier/harder to do consulting and collaboration depending on where you work. Also, finding good collaborators can be tricky and may involve some trial and error.

But it’s useful to keep this advice in mind, especially when looking for a job. The places you want to be on the lookout for are places that give you the most exposure to interesting scientific problems, the 0% problems. These places will give you the best opportunities for collaboration and for having a real impact on science.

Does fraud depend on my philosophy?

Ever since my last post on replication and fraud I’ve been doing some more thinking about why people consider some things “scientific fraud”. (First of all, let me just say that I was a bit surprised by the discussion in the comments for that post. Some people apparently thought I was asking about the actual probability that the study was a fraud. This was not the case. I just wanted people to think about how they would react when confronted with the scenario.)

I often find that when I talk to people about the topic of scientific fraud, especially statisticians, there is a sense that much work that goes on out there is fraudulent, but the precise argument for why is difficult to pin down.

Consider the following three cases:

  1. I conduct a randomized clinical trial comparing a new treatment and a control and their effect on outcome Y1. I also collect data on outcomes Y2, Y3, … Y10. After conducting the trial I see that there isn’t a significant difference for Y1 so I test the other 9 outcomes and find a significant effect (defined as p-value equal to 0.04) for Y7. I then publish a paper about outcome Y7 and state that it’s significant with p=0.04. I make no mention of the other outcomes.
  2. I conduct the same clinical trial with the 10 different outcomes and look at the difference between the treatment groups for all outcomes. I notice that the largest standardized effect size is for Y7 with a standardized effect of 3, suggesting the treatment is highly effective in this trial. I publish a paper about outcome Y7 and state that the standardized effect size was 3 for comparing treatment vs. control. I note that a difference of 3 is highly significant, but I make no mention of statistical significance or p-values. I also make no mention of the other outcomes.
  3. I conduct the same clinical trial with the 10 outcomes. Now I look at all 10 outcomes and calculate the posterior probability that the effect is greater than zero (favoring the new treatment), given a pre-specified diffuse prior on the effect (assume it’s the same prior for each effect). Of the 10 outcomes I see that Y7 has the largest posterior probability of 0.98. I publish a paper about Y7 stating that my posterior probability for a positive effect is 0.98. I make no mention of the other outcomes.

Which one of these cases constitutes scientific fraud?

  1. I think most people would object to Case 1. This is the classic multiple testing scenario where the end result is that the stated p-value is not correct. Rather than a p-value of 0.04 the real p-value is more like 0.4. A simple Bonferroni correction fixes this but obviously would have resulted in not finding any significant effects based on a 0.05 threshold. The real problem is that in Case 1 you are clearly trying to make an inference about future studies. You’re saying that if there’s truly no difference, then in 100 other studies just like this one, you’d expect only 4 to detect a difference under the same criteria that you used. But it’s incorrect to say this and perhaps fraudulent (or negligent) depending on your underlying intent. In this case a relevant detail that is missing is the number of other outcomes that were tested.
  2. Case 2 differs from case 1 only in that no p-values are used but rather the measure of significance is the standardized effect size. Therefore, no probability statements are made and no inference is made about future studies. Although the information about the other outcomes is similarly omitted in this case as in case 1, it’s difficult for me to identify what is wrong with this paper.
  3. Case 3 takes a Bayesian angle and is more or less like case 2 in my opinion. Here, probability is used as a measure of belief about a parameter but no explicit inferential statements are made (i.e. there is no reference to some population of other studies). In this case I just state my belief about whether an effect/parameter is greater than 0. Although I also omit the other 9 outcomes in the paper, revealing that information would not have changed anything about my posterior probability.

In each of these three scenarios, the underlying data were generated in the exact same way (let’s assume for the moment that the trial itself was conducted with complete integrity).  In each of the three scenarios, 10 outcomes were examined and outcome Y7 was in some sense the most interesting.

Of course, the analyses and the interpretation of the data were not the same in each scenario. Case 1 makes an explicit inference whereas Cases 2 and 3 essentially do not. However, I would argue the evidence about the new treatment compared to the control treatment in each scenario was identical.

I don’t believe that the investigator in Case 1 should be allowed to engage in such shenanigans with p-values, but should he/she be pilloried simply because the p-value was the chosen metric of significance? I guess the answer would be “yes” for many of you, but keep in mind that the investigator in Case 1 still generated the same evidence as the others. Should the investigators in Case 2 and Case 3 be thrown in the slammer? If so, on what basis?

My feeling is not that people should be allowed to do whatever they please, but we need a better way to separate the “stuff” from the stuff. This is both a methodological and a communications issue. For example, Case 3 may not be fraud but I’m not necessarily interested in what the investigator’s opinion about a parameter is. I want to know what the data say about that parameter (or treatment difference in this case). Is it fraud to make any inferences in the first place (as in Case 1)? I mean, how could you possible know that your inference is “correct”? If “all models are wrong, but some are useful”, does that mean that everyone is committing fraud?

Sunday data/statistics link roundup (6/23/13)

  1. An interesting study describing the potential benefits of using significance testing may be potentially beneficial and a scenario where the file drawer effect may even be beneficial. Granted this is all simulation so you have to take it with a grain of salt, but I like the pushback against the hypothesis testing haters. In all things moderation, including hypothesis testing.
  2. Venn Diagrams for the win, bro.
  3. The new basketball positions. The idea is to cluster players based on the positions on the floor where they shoot, etc. I like the idea of data driven position definitions; I am a little worried about “reading ideas in” to a network picture.
  4. A really cool idea about a startup that makes data on healthcare procedures available to patients. I’m all about data transparency, but it makes me wonder, how often do people with health insurance negotiate the prices of procedures (via Leah J.)
  5. Another interesting article about using tweets (and other social media) to improve public health. I do wonder about potential sampling issues, like what happened with google flu trends (via Nick C.)

Interview with Miriah Meyer - Microsoft Faculty Fellow and Visualization Expert

miriah

Miriah Meyer received her Ph.D. in computer science from the University of Utah, then did a postdoctoral fellowship at Harvard University and was a visiting fellow at MIT and the Broad Institute. Her research focuses on developing visualization tools in close collaboration with biological scientists. She has been recognized as a Microsoft Faculty Fellow, a TED Fellow, and appeared on the TR35. We talked with Miriah about visualization, collaboration, and her influences during her career as part of the Simply Statistics Interview Series.

SS: Which term applies to you: data scientist, statistician, computer scientist, or something else?

MM: My training is as a computer scientist and much of the way I problem solve is grounded in computational thinking. I do, however, sometimes think of myself as a data counselor, as a big part of what I do is help my collaborators move towards a deeper and more articulate statement about what they want/need to do with their data.

SS: Most data analysis is done by scientists, not trained statisticians. How does data visualization help/hurt scientists when looking at piles of complicated data?

MM: In the sciences, visualization is particularly good for hypothesis generation and early stage exploration. With many fields turning toward data-driven approaches, scientists are often not sure of exactly what they will find in a mound of data. Visualization allows them to look into the data without having to specify a specific question, query, or model. This early, exploratory analysis is very difficult to do strictly computationally. Exploration via interactive visualization can lead a scientist towards establishing a more specific question of the data that could then be addressed algorithmically.

SS: What are the steps in developing a visualization with a scientific collaborator?

MM: The first step is finding good collaborators :)

The beginning of a project is spent in discussions with the scientists, trying to understand their needs, data, and mental models of a problem. I find this part to be the most interesting, and also the most challenging. The goal is to develop a clear, abstract understanding of the problem and set of design requirements. We do this through interviews and observations, with a focus on understanding how people currently solve problems and what they want to do but can’t with current tools.

Next is to take this understanding and prototype ideas for visualization designs. Rapid prototyping on paper is usually first, followed by more sophisticated, software prototypes after getting feedback from the collaborators. Once a design is sufficiently fleshed out and validated, a (relatively) full-featured visualization tool is developed and deployed.

At this point, the scientists tend to realize that the problem they initially thought was most interesting isn’t… and the cycle continues.

Fast iteration is really essential in this process. In the past I’ve gone through as many as three cycles of this process before find the right problem abstractions and designs.

SS: You have tackled some diverse visualizations (from synteny to poems); what are the characteristics of a problem that make it a good candidate for new visualizations?

MM: For me, the most important thing is to find good collaborators. It is essential to find partners that are willing to give lots of their time up front, are open-minded about research directions, and are working on cutting-edge problems in their field. This latter characteristic helps to ensure that there will be something novel needed from a data analysis and visualization perspective.

The other thing is to test whether a problem passes the Tableau/R/Matlab test: if the problem can’t be solved using one of these environments, then that is probably a good start.

SS: What is the four-level nested model for design validation and how did you extend it?

MM: This is a design decision model that helps to frame the different kinds of decisions made in the visualization design process, such as decisions about data derivations, visual representations, and algorithms. The model helps to put any one decision in the context of other visualization ideas, methods, and techniques, and also helps a researcher generalize new ideas to a broader class of problems. We recently extended this model to specifically define what a visualization “guideline” is, and how to relate this concept to how we design and evaluate visualizations.

SS: Who are the key people who have been positive influences on your career and how did they help you?

MM: One influence that jumps out to me is a collaboration with a designer in Boston named Bang Wong. Working with Bang completely changed my approach to visualization development and got me thinking about iteration, rapid prototyping, and trying out many ideas before committing. Also important were two previous supervisors, Ross Whitaker and Tamara Munzner, who constantly pushed me to be precise and articulate about problems and approaches to them. I believe that precision is a hallmark of good data science, even when characterizing unprecise things :)

SS: Do you have any advice for computer scientists/statisticians who want to work on visualization as a research area?

MM: Do it! Visualization is a really fun, vibrant, growing field. It relies on a broad spectrum of skills, from computer science, to design, to collaboration. I would encourage those interested to not get to infatuated with the engineering or the aesthetics and to instead focus on solving real-world problems. There is an unlimited supply of those!