Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Confronting a Law Of Limits

Confronting a Law Of Limits

An essay on why programmers need to learn statistics

This is awesome. There are a few places with some strong language, but overall I think the message is pretty powerful. Via Tariq K. I agree with Tariq, one of the gems is:

If you want to measure something, then don’t measure other sh**. 

A cool profile of a human rights statistician

Via AL Daily, this dude collects data and analyzes it to put war criminals away. The idea of using statistics to quantify mass testimony is interesting. 

With statistical methods and the right kind of data, he can make what we know tell us what we don’t know. He has shown human rights groups, truth commissions, and international courts how to take a collection of thousands of testimonies and extract from them the magnitude and pattern of violence — to lift the fog of war.

So how does he do it? With an idea from statistical ecology. This is a bit of a long quote but describes the key bit.

Working on the Guatemalan data, Ball found the answer. He called Fritz Scheuren, a statistician with a long history of involvement in human rights projects. Scheuren reminded Ball that a solution to exactly this problem had been invented in the 19th century to count wildlife. “If you want to find out how many fish are in the pond, you can drain the pond and count them,” Scheuren explained, “but they’ll all be dead. Or you can fish, tag the fish you catch, and throw them back. Then you go another day and fish again. You count how many fish you caught the first day, and the second day, and the number of overlaps.”

The number of overlaps is key. It tells you how representative a sample is. From the overlap, you can calculate how many fish are in the whole pond. (The actual formula is this: Multiply the number of fish caught the first day by the number caught the second day. Divide the total by the overlap. That’s roughly how many fish are really in the pond.) It gets more accurate if you can fish not just twice, but many more times — then you can measure the overlap between every pair of days.

Guatemala had three different collections of human rights testimonies about what had happened during the country’s long, bloody civil war: from the U.N. truth commission, the Catholic Church’s truth commission, and the International Center for Research on Human Rights, an organization that worked with Guatemala’s human rights groups. Working for the official truth commission, Ball used the count-the-fish method, called multiple systems estimation (MSE), to compare all three databases. He found that over the time covered by the commission’s mandate, from 1978 to 1996, 132,000 people were killed (not counting those disappeared), and that government forces committed 95.4 percent of the killings. He was also able to calculate killings by the ethnicity of the victim. Between 1981 and 1983, 8 percent of the nonindigenous population of the Ixil region was assassinated; in the Rabinal region, the figure was around 2 percent. In both those regions, though, more than 40 percent of the Mayan population was assassinated.

Cool right? The article is worth a read. If you are inspired, check out Data Without Borders. 

The case for open computer programs

The case for open computer programs

Statistics project ideas for students

Here are a few ideas that might make for interesting student projects at all levels (from high-school to graduate school). I’d welcome ideas/suggestions/additions to the list as well. All of these ideas depend on free or scraped data, which means that anyone can work on them. I’ve given a ballpark difficulty for each project to give people some idea.

Happy data crunching!

Data Collection/Synthesis

  1. Creating a webpage that explains conceptual statistical issues like randomization, margin of error, overfitting, cross-validation, concepts in data visualization, sampling. The webpage should not use any math at all and should explain the concepts so a general audience could understand. Bonus points if you make short 30 second animated youtube clips that explain the concepts. (Difficulty: Lowish; Effort: Highish)
  2. Building an aggregator for statistics papers across disciplines that can be the central resource for statisticians. Journals ranging from PLoS Genetics to Neuroimage now routinely publish statistical papers. But there is no one central resource that aggregates all the statistics papers published across disciplines. Such a resource would be hugely useful to statisticians. You could build it using blogging software like WordPress so articles could be tagged/you could put the resource in your RSS feeder. (Difficulty: Lowish; Effort: Mediumish)

Data Analyses

  1. Scrape the LivingSocial/Groupon sites for the daily deals and develop a prediction of how successful the deal will be based on location/price/type of deal. You could use either the RCurl R package or the XML R package to scrape the data. (Difficulty: Mediumish; Effort: Mediumish)
  2. You could use the data from your city (here are a few cities with open data) to: (a) identify the best and worst neighborhoods to live in based on different metrics like how many parks are within walking distance, crime statistics, etc. (b) identify concrete measures your city could take to improve different quality of life metrics like those described above - say where should the city put a park, or (c) see if you can predict when/where crimes will occur (like these guys did). (Difficulty: Mediumish; Effort: Highish)
  3. Download data on state of the union speeches from here and use the tm package in R to analyze the patterns of word use over time (Difficulty: Lowish; Effort: Lowish)
  4. Use this data set from Donors Choose to determine the characteristics that make the funding of projects more likely. You could send your results to the Donors Choose folks to help them improve the funding rate for their projects. (Difficulty: Mediumish; Effort: Mediumish
  5. Which basketball player would you want on your team? Here is a really simple analysis done by Rafa. But it doesn’t take into account things like defense. If you want to take on this project, you should take a look at this Denis Rodman analysis which is the gold standard. (Difficulty: Mediumish; Effort: Highish).

Data visualization

  1. Creating an R package that wraps the svgAnnotation package. This package can be used to create dynamic graphics in R, but is still a bit too flexible for most people to use. Writing some wrapper functions that simplify the interface would be potentially high impact. Maybe something like svgPlot() to create simple, dynamic graphics with only a few options (Difficulty: Mediumish; Effort: Mediumish). 
  2. The same as project 1 but for D3.js. The impact could potentially be a bit higher, since the graphics are a bit more professional, but the level of difficulty and effort would also both be higher. (Difficulty: Highish; Effort: Highish)