Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Sunday data/statistics link roundup (11/3/13)

  1. There has been a big knockdown-dragout battle in the blogosphere over how GTEX is doing their analysis. Read the original post here, my summary here, and the response from GTEX here. I agree that the criticism bordered on hyperbolic but also think that methods matter. I also think that consortia are under pressure to get something out and have to use software that works, I’m sympathetic cause that must be a tough position to be in, but it is important to remember software runs != software works well.
  2. Chris Bosh thinks you should learn to code. Me too. I wonder if the Heat will give me a contract now?
  3. Terry Speed wins the Prime Minister’s Prize for science. Here is an awesome interview with him. Watch to the end to find out how he is gonna spend all the money.
  4. Learn faster with the Feynman technique. tl;dr = practice teaching what you are trying to learn.
  5. Via Tim T. Jr. check out this interactive version of Simpson’s paradox. Super slick and educational.
  6. Stats used to determine the Gold Glove (in part).
  7. An angry newcomer’s guide to data types in R, dangit!
  8. Accidental aRt - accidentally beautiful creations in R.

Unconference on the Future of Statistics (Live Stream) #futureofstats

The Unconference on the Future of Statistics will begin at 12pm EDT today. Watch the live stream here.

How to participate in #futureofstats Unconference

Tomorrow is the Unconference on the Future of Statistics from 12PM-1PM EDT. There are two ways that you can get in the game:

  1. Ask questions for our speakers on Twitter with the hashtag #futureofstats. Don’t wait, start right now, Roger, Rafa, and I are monitoring the hashtag and collecting questions. We will pick some to ask the speakers tomorrow during the Unconference. 
  2. If you have an idea about the future of statistics write it up, post it on Github, on Blogger, on WordPress, on your personal website, then tweet it with the hashtag #futureofstats. We will do our best to collect these and post them with the video so your contributions will be part of the Unconference.

Tukey Talks Turkey #futureofstats

I’ve been digging up old “future of statistics” writings from the past in anticipation of our Unconference on the Future of Statistics this Wednesday 12-1pm EDT. Last week I mentioned Daryl Pregibon’s experience trying to build statistical expertise into software. One classic is “The Future of Data Analysis” by John Tukey published in the Annals of Mathematical Statistics in 1962.

Perhaps the most surprising aspect of this paper is how relevant it remains today. I think perhaps with just a few small revisions it could easily be published in a journal today and few people would find it out of place.

In Section 3 titled “How can new data analysis be initiated?” he describes directions in which statisticians should go to grow the field of data analysis. But the advice itself is quite general and probably should be heeded by any junior statistician just starting out in research.

How is novelty most likely to begin and grow? Not through work on familiar problems, in terms of familiar frameworks, and starting with the results of applying familiar processes to the observations. Some or all of these familiar constraints must be given up in each piece of work which may contribute novelty.

Tukey’s article serves as a coherent and comprehensive roadmap for the development of data analysis as a field. He suggests that we should study how people analyze data and uncover “what works” and what doesn’t. However, he appears to draw the line at suggesting that such study should result in a single way of analyzing a given type of data. Rather, statisticians should maintain some flexibility in modeling and analysis. I personally think the reality should be somewhere the middle. Too much flexibility can lead to problems, but rigidity is not the solution.

It is interesting, from my perspective, that given how clear and coherent Tukey’s roadmap was in 1962, how much of it was essentially ignored. In fact, the field pretty much went the other direction towards more mathematical elegance (I’m guessing Tukey sensed this would happen). His article is uncomfortable to read, because it’s full of problems that arise in real data that are difficult to handle with standard approaches. He has an uncanny ability to make up methods that look totally bizarre on first glance but are totally reasonable after some thought.

I honestly can’t think of a better way to end this post than to quote Tukey himself.

The future of data analysis can involve great progress, the overcoming of real difficulties, and the provision of a great service to all fields of science and technology. Will it? That remains to us, to our willingness to take up the rocky road of real problems in preference to the smooth road of unreal assumptions, arbitrary criteria, and abstract results without real attachments. Who is for the challenge?

Read the paper. And then come join us at 12pm EDT tomorrow.

Simply Statistics Future of Statistics Speakers - Two Truths, One Lie #futureofstats

Our online conference live-streamed on Youtube is going to happen on October 30th 12PM-1PM Baltimore (UTC-4:00) time. You can find more information here or sign up for email alerts here. I get bored with the usual speaker bios at conferences so I am turning our speaker bios into a game. Below you will find three bullet pointed items of interest about each of our speakers. Two of them are truths and one is a lie. See if you can spot the lies and sign up for the unconference!

Hadley Wickham

  • Created the ggplot2/devtools packages.
  • Developed R’s first class system.
  • Is chief scientist at RStudio.

Daniela Witten

  • Developed the most popular method for inferring Facebook connections.
  • Created the Spacejam algorithm for inferring networks.
  • Made the Forbes 30 under 30 list twice as a rising scientific star.

Joe Blitzstein 

  • A Professor of the Practice of Statistics at Harvard University.
  • Created the first statistical method for automatically teaching the t-test.
  • His statistics 101 course is frequently in the top 10 courses on iTunes U.

Hongkai Ji

  • Developed the hmChIP database of over 2,000 ChIP-Seq and ChIP-Chip data samples.
  • Coordinated the analysis of the orangutan genome project.
  • Analyzed data to help us understand sonic-hedgehog mediated neural patterning.

Sinan Aral

  • Coined the phrase “social networking potential”.
  • Ran a large randomized study that determined the value of upvotes.
  • Discovered that peer influence is dramatically overvalued in product adoption.

Hilary Mason

  • Is a co-founder of DataGotham and HackNY
  • Developed computational algorithms for identifying the optimal cheeseburger
  • Founded the first company to create link sorting algorithms.