Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Visualizing Yahoo Email

Here is a cool page where yahoo shows you the email it is processing in real time. It includes a visualization of the most popular words in emails at a given time. A pretty neat tool and definitely good for procrastination, but I’m not sure what else it is good for…

Web-scraping

The internet is the greatest source of publicly available data. One of the key skills to being able to obtain data from the web is “web-scraping”, where you use a piece of software to run through a website and collect information. 

This technique can be used for collecting data from databases or to collect data that is scattered across a website. Here is a very cool little exercise in web-scraping that can be used as an example of the things that are possible. 

Related Posts: Jeff on APIs, Data Sources, Regex, and The Open Data Movement.

Archetypal Athletes

Here is a cool paper on the ArXiv about archetypal athletes. The basic idea is to look at a large number of variables for each player and identify multivariate outliers or extremes. These outliers are the archetypes talked about in the title. 

According to his analysis the author claims the best players (for different reasons, i.e. different archetypes) in the NBA in 2009/2010 were:  Taj Gibson, Anthony Morrow, and Kevin Durant. The best soccer players were Wayne Rooney, Leonel Messi, and Christiano Ronaldo.

Thanks to Andrew Jaffe for pointing out the article. 

Related Posts: Jeff on “Innovation and Overconfidence”, Rafa on “Once in a lifetime collapse

Graduate student data analysis inspired by a high-school teacher

I love watching TED talks. One of my absolute favorites is the talk by Dan Meyer on how math class needs a makeover. Dan also has one of the more fascinating blogs I have read. He talks about math education, primarily K-12 education.  His posts on curriculum design, assessment , work ethic, and homework are really, really good. In fact, just go read all his author choices. You won’t regret it. 

The best quote from the talk is:

Ask yourselves, what problem have you solved, ever, that was worth solving, where you knew knew all of the given information in advance? Where you didn’t have a surplus of information and have to filter it out, or you didn’t have insufficient information and have to go find some?

Many of the data analyses I have done in classes/assigned in class have focused on a problem with exactly the right information with relatively little extraneous data or missing information. But I have been slowly evolving these problems; as an example here is a data analysis project that we developed last year for the qualifying exam at JHU. This project is what I consider a first step toward a “less helpful” project model. 

The project was inspired by this blog post at marginal revolution which Rafa suggested. As with the homework problem Dan dissects in his talk, there are layers to this problem:

  1. Understanding the question
  2. Downloading and filtering the data
  3. Exploratory analysis
  4. Fitting models/interpreting results
  5. Synthesis and writing the results up
  6. Reproducibility of the R code

For this analysis, I was pretty specific with 1. Understanding the question:

(1) The association between enrollment and the percent of students scoring “Advanced” on the MSA in Reading and Math in the 5th grade.

(2) The change in the number of students scoring “Advanced” in Reading and Math from one year to the next (at minimum consider the change from 2009-2010) versus enrollment.

(3) Potential reasons for results like those in Table 1.  

Although I didn’t mention the key idea from the Marginal Revolution post. I think for a qualifying exam, this level of specificity is necessary, but for an in-class project I think I would have removed this information so students had to “discover the question” themselves. 

I was also pretty specific with the data source suggesting the Maryland Education department’s website. However, several students went above and beyond and found other data sources, or downloaded more data than I suggested. In the future, I think I will leave this off too. My google/data finding skills don’t hold a candle to those of my students. 

Steps 3-5 were summed up with the statement: 

Your project is to analyze data from the MSA and write a short letter either in favor of or against spending money to decrease school sizes.

This is one part of the exam I’m happy with. It is sufficiently vague to let the students come to their own conclusions. It also suggests that the students should draw conclusions and support them with statistical analyses. One of the major difficulties I have struggled with in teaching this class is getting students to state a conclusion as a result of their analysis and to quantify how uncertain they are about that decision. In my mind, this is different from just the uncertainty associated with a single parameter estimate. 

It  was surprising how much requiring reproducibility helped students focus their analyses. I think because they had to organize/collect their code which, helped them organize their analysis. Also, there was a strong correlation between reproducibility and quality of the written reports.

Going forward I have a couple of ideas of how I would change my data analysis projects:

  1. Be less helpful - be less clear about the problem statement, data sources, etc. I definitely want students to get more practice formulating problems. 
  2. Focus on writing/synthesis - my students are typically very good at fitting models, but sometimes struggle with putting together the “story” of an analysis. 
  3. Stress much less about whether specific methods will work well on the data analyses I suggest. One of the more helpful things I think these messy problems produce is a chance to figure out what works and what doesn’t on real world problems. 

Related Posts: Rafa on the future of graduate education, Roger on applied statistics journals.

The self-assessment trap

Several months ago I was sitting next to my colleague Ben Langmead at the Genome Informatics meeting. Various talks were presented on short read alignments and every single performance table showed the speaker’s method as #1 and Ben’s Bowtie as #2 among a crowded field of lesser methods. It was fun to make fun of Ben for getting beat every time, but the reality was that all I could conclude was that Bowtie was best and speakers were falling into the the self-assessment trap: each speaker had tweaked the assessment to make their method look best. This practice is pervasive in Statistics where easy-to-tweak Monte Carlo simulations are commonly used to assess performance. In a recent paper, a team at IBM described how the problem in the systems biology literature is pervasive as well. Co-author Gustavo Stolovitzky Stolovitsky is a co-developer of the DREAM challenge in which the assessments are fixed and developers are asked to submit. About 7 years ago we developed affycomp, a comparison webtool for microarray preprocessing methods. I encourage others involved in fields where methods are constantly being compared to develop such tools. It’s a lot of work, but journals are usually friendly to papers describing the results of such competitions.

Related Posts:  Roger on colors in R, Jeff on battling bad science