Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Return of the sunday links! (10/26/14)

New look for the blog and bringing back the links. If you have something that you’d like included in the Sunday links, email me and let me know. If you use the title of the message “Sunday Links” you’ll be more likely for me to find it when I search my gmail.

  1. Thomas L. does a more technical post on semi-parametric efficiency, normally I’m a data n’ applications guy, but I love these in depth posts, especially when the papers remind me of all the things I studied at my alma mater.
  2. I am one of those people who only knows a tiny bit about Docker, but hears about it all the time. That being said, after I read about Rocker, I got pretty excited.
  3. Hadley W.’s favorite tools, seems like that dude likes R Studio for some reason….(me too)
  4. A cool visualization of chess piece survival rates.
  5. A short movie by 538 about statistics and the battle between Deep Blue and Gary Kasparov. Where’s the popcorn?
  6. Twitter engineering released an R package for detecting outbreaks. I wonder how circular binary segmentation would do?

 

 

An interactive visualization to teach about the curse of dimensionality

I recently was contacted for an interview about the curse of dimensionality. During the course of the conversation, I realized how hard it is to explain the curse to a general audience. One of the best descriptions I could come up with was trying to describe sampling from a unit line, square, cube, etc. and taking samples with side length fixed. You would capture fewer and fewer points. As I was saying this, I realized it is a pretty bad way to explain the curse of dimensionality in words. But there was potentially a cool data visualization that would illustrate the idea. I went to my student Prasad, our resident interactive viz design expert to see if he could build it for me. He came up with this cool Shiny app where you can simulate a number of points (n) and then fix a side length for 1-D, 2-D, 3-D, and 4-D and see how many points you capture in a cube of that length in that dimension. You can find the full app here or check it out on the blog here:

 

Vote on simply statistics new logo design

As you can tell, we have given the Simply Stats blog a little style update. It should be more readable on phones or tablets now. We are also about to get a new logo. We are down to the last couple of choices and can’t decide. Since we are statisticians, we thought we’d collect some data. Here is the link to the poll. Let us know

Thinking like a statistician: don't judge a society by its internet comments

In a previous post I explained how thinking like a statistician can help you avoid  feeling sad after using Facebook. The basic point was that missing not at random (MNAR) data on your friends’ profiles (showing only the best parts of their life) can result in the biased view that your life is boring and uninspiring in comparison. A similar argument can be made to avoid  losing faith in humanity after reading internet comments or anonymous tweets, one of the most depressing activities that I have voluntarily engaged in.  If you want to see proof that racism, xenophobia, sexism and homophobia are still very much alive, read the unfiltered comments sections of articles related to race, immigration, gender or gay rights. However, as a statistician, I remain optimistic about our society after realizing how extremely biased these particular MNAR data can be.

Assume we could summarize an individual’s “righteousness with a numerical index. I realize this is a gross oversimplification, but bear with me. Below is my view on the distribution of this index across all members of our society.

IMG_5842

Note that the distribution is not bimodal. This means there is no gap between good and evil, instead we have a continuum. Although there is variability, and we do have some extreme outliers on both sides of the distribution, most of us are much closer to the median than we like to believe. The offending internet commentators represent a very small proportion (the “bad” tail shown in red). But in a large population, such as internet users, this extremely small proportion can be quite numerous and gives us a biased view.

There is one more level of variability here that introduces biases. Since internet comments can be anonymous, we get an unprecedentedly large glimpse into people’s opinions and thoughts. We assign a “righteousness” index to our thoughts and opinion and include it in the scatter plot shown in the figure above. Note that this index exhibits variability within individuals: even the best people have the occasional bad thought.  The points in red represent thoughts so awful that no one, not even the worst people, would ever express publicly. The red points give us an overly pessimistic estimate of the individuals that are posting these comments, which exacerbates our already pessimistic view due to a non-representative sample of individuals.

I hope that thinking like a statistician will help the media and social networks put in statistical perspective the awful tweets or internet comments that represent the worst of the worst. These actually provide little to no information on humanity’s distribution of righteousness, that I think is moving consistently, albeit slowly, towards the good.

 

 

Bayes Rule in an animated gif

Say Pr(A)=5% is the prevalence of a disease (% of red dots on top fig). Each individual is given a test with accuracy Pr(B A)=Pr(no B no A) = 90% .  The O in the middle turns into an X when the test fails. The rate of Xs is 1-Pr(B A). We want to know the probability of having the disease if you tested positive: Pr(A B). Many find it counterintuitive that this probability is much lower than 90%; this animated gif is meant to help.

The individual being tested is highlighted with a moving black circle. Pr(B) of these will test positive: we put these in the bottom left and the rest in the bottom right. The proportion of red points that end up in the bottom left is the proportion of red points Pr(A) with a positive test Pr(B A), thus Pr(B A) x Pr(A). Pr(A B), or the proportion of reds in the bottom left, is therefore Pr(B A) x Pr(A) divided by Pr(B):  Pr(A B)=Pr(B A) x Pr(A) / Pr(B)

ps - Is this a frequentist or Bayesian gif?