Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

A non-comprehensive list of awesome things other people did in 2014

Editor’s Note: Last year _Editor’s Note: Last year_ _ off the top of my head of awesome things other people did. I loved doing it so much that I’m doing it again for 2014. Like last year, I have surely missed awesome things people have done. If you know of some, you should make your own list or add it to the comments! The rules remain the same. I have avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people’s awesome stuff. I wrote this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data. Update: I missed pipes in R, now added!

 

  1. I’m copying everything about Jenny Bryan’s amazing Stat 545 class in my data analysis classes. It is one of my absolute favorite open online set of notes on data analysis.
  2. Ben Baumer, Mine Cetinkaya-Rundel, Andrew Bray, Linda Loi, Nicholas J. Horton wrote this awesome paper on integrating R markdown into the curriculum. I love the stuff that Mine and Nick are doing to push data analysis into undergrad stats curricula.
  3. Speaking of those folks, the undergrad guidelines for stats programs put out by the ASA do an impressive job of balancing the advantages of statistics and the excitement of modern data analysis.
  4. Somebody tell Hector Corrada Bravo to stop writing so many awesome papers. He is making us all look bad. His epiviz paper is great and you should go start using the Bioconductor package if you do genomics.
  5. Hilary Mason founded fast forward labs. I love the business model of translating cutting edge academic (and otherwise) knowledge to practice. I am really pulling for this model to work.
  6. As far as I can tell 2014 was the year that causal inference become the new hotness. One example of that is this awesome paper from the Google folks on trying to infer causality from related time series. The R package has some cool features too. I definitely am excited to see all the new innovation in this area.
  7. Hadley was Hadley.
  8. Rafa and Mike taught an awesome class on data analysis for genomics. They also created a book on Github that I think is one of the best introductions to the statistics of genomics that exists so far.
  9. Hilary Parker [Editor’s Note: Last year _Editor’s Note: Last year_ _ off the top of my head of awesome things other people did. I loved doing it so much that I’m doing it again for 2014. Like last year, I have surely missed awesome things people have done. If you know of some, you should make your own list or add it to the comments! The rules remain the same. I have avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people’s awesome stuff. I wrote this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data. Update: I missed pipes in R, now added!

 

  1. I’m copying everything about Jenny Bryan’s amazing Stat 545 class in my data analysis classes. It is one of my absolute favorite open online set of notes on data analysis.
  2. Ben Baumer, Mine Cetinkaya-Rundel, Andrew Bray, Linda Loi, Nicholas J. Horton wrote this awesome paper on integrating R markdown into the curriculum. I love the stuff that Mine and Nick are doing to push data analysis into undergrad stats curricula.
  3. Speaking of those folks, the undergrad guidelines for stats programs put out by the ASA do an impressive job of balancing the advantages of statistics and the excitement of modern data analysis.
  4. Somebody tell Hector Corrada Bravo to stop writing so many awesome papers. He is making us all look bad. His epiviz paper is great and you should go start using the Bioconductor package if you do genomics.
  5. Hilary Mason founded fast forward labs. I love the business model of translating cutting edge academic (and otherwise) knowledge to practice. I am really pulling for this model to work.
  6. As far as I can tell 2014 was the year that causal inference become the new hotness. One example of that is this awesome paper from the Google folks on trying to infer causality from related time series. The R package has some cool features too. I definitely am excited to see all the new innovation in this area.
  7. Hadley was Hadley.
  8. Rafa and Mike taught an awesome class on data analysis for genomics. They also created a book on Github that I think is one of the best introductions to the statistics of genomics that exists so far.
  9. Hilary Parker](http://hilaryparker.com/2014/04/29/writing-an-r-package-from-scratch/) that took the twitterverse by storm. It is perfectly written for people who are just at the point of being able to create their own R package. I think it probably generated 100+ R packages just by being so easy to follow.
  10. Oh you’re not reading StatsChat yet? For real?
  11. FiveThirtyEight launched. Despite some early bumps they have done some really cool stuff. Loved the recent piece on the beer mile and I read every piece that Emily Oster writes. She does an amazing job of explaining pretty complicated statistical topics to a really broad audience.
  12. David Robinson’s broom package is one of my absolute favorite R packages that was built this year. One of the most annoying things about R is the variety of outputs different models give and this tidy version makes it really easy to do lots of neat stuff.
  13. Chung and Storey introduced the jackstraw which is both a very clever idea and the perfect name for a method that can be used to identify variables associated with principal components in a statistically rigorous way.
  14. I rarely dig excel-type replacements, but the simplicity of charted.co makes me love it. It does one thing and one thing really well.
  15. The hipsteR package for teaching old R dogs new tricks is one of the many cool things Karl Broman did this year. I read all of his tutorials and never cease to learn stuff. In related news if I was 1/10th as organized as that dude I’d actually you know, get stuff done.
  16. Whether I agree with them or not that they should be allowed to do unregulated human subjects research, statistics at tech companies, and in particular randomized experiments have never been hotter. The boldest of the bunch is OKCupid who writes blog posts with titles like, “We experiment on human beings!”
  17. In related news, I love the PlanOut project by the folks over at Facebook, so cool to see an open source approach to experimentation at web scale.
  18. No wonder Mike Jordan (no not that Mike Jordan) is such a superstar. His reddit AMA raised my respect for him from already super high levels. First, its awesome that he did it, and second it is amazing how well he articulates the relationship between CS and Stats.
  19. I’m trying to figure out a way to get Matthew Stephens to write more blog posts. He teased us with the Dynamic Statistical Comparisons post and then left us hanging. The people demand more Matthew.
  20. Di Cook also started a new blog in 2014. She was also part of this cool exploratory data analysis event for the UN. They have a monster program going over there at Iowa State, producing some amazing research and a bunch of students that are recognizable by one name (Yihui, Hadley, etc.).
  21. Love this paper on sure screening of graphical models out of Daniela Witten’s group at UW. It is so cool when a simple idea ends up being really well justified theoretically, it makes the world feel right.
  22. I’m sure this actually happened before 2014, but the Bioconductor folks are still the best open source data science project that exists in my opinion. My favorite development I started using in 2014 is the git-subversion bridge that lets me update my Bioc packages with pull requests.
  23. rOpenSci ran an awesome hackathon. The lineup of people they invited was great and I loved the commitment to a diverse group of junior R programmers. I really, really hope they run it again.
  24. Dirk Eddelbuettel and Carl Boettiger continue to make bigtime contributions to R. This time it is Rocker, with Docker containers for R. I think this could be a reproducibility/teaching gamechanger.
  25. Regina Nuzzo brought the p-value debate to the masses. She is also incredible at communicating pretty complicated statistical ideas to a broad audience and I’m looking forward to more stats pieces by her in the top journals.
  26. Barbara Engelhardt keeps rocking out great papers. But she is also one of the best AE’s I have ever had handle a paper for me at PeerJ. Super efficient, super fair, and super demanding. People don’t get enough credit for being amazing in the peer review process and she deserves it.
  27. Ben Goldacre and Hans Rosling continue to be two of the best advocates for statistics and the statistical discipline - I’m not sure either claims the title of statistician but they do a great job anyway. This piece about Professor Rosling in Science gives some idea about the impact a statistician can have on the most current problems in public health. Meanwhile, I think Dr. Goldacre does a great job of explaining how personalized medicine is an information science in this piece on statins in the BMJ.
  28. Michael Lopez’s series of posts on graduate school in statistics should be 100% required reading for anyone considering graduate school in statistics. He really nails it.
  29.  Trey Causey has an equally awesome Getting Started in Data Science post that I read about 10 times.
  30. Drop everything and go read all of Philip Guo’s posts. Especially this one about industry versus academia or this one on the practical reason to do a PhD.
  31. The top new Twitter feed of 2014 has to be @ResearchMark (incidentally I’m still mourning the disappearance of @STATSHULK).
  32. Stephanie Hicks’ blog combines recipes for delicious treats and statistics, also I thought she had a great summary of the Women in Stats (#WiS2014) conference.
  33. Emma Pierson is a Rhodes Scholar who wrote for 538, 23andMe, and a bunch of other major outlets as an undergrad. Her blog, obsessionwithregression.blogspot.com is another must read. Here is an example of her awesome work on how different communities ignored each other on Twitter during the Ferguson protests.
  34. The Rstudio crowd continues to be on fire. I think they are a huge part of the reason that R is gaining momentum. It wouldn’t be possible to list all their contributions (or it would be an Rstudio exclusive list) but I really like Packrat and R markdown v2.
  35. Another huge reason for the movement with R has been the outreach and development efforts of the Revolution Analytics folks. The Revolutions blog has been a must read this year.
  36. Julian Wolfson and Joe Koopmeiners at University of Minnesota are straight up gamers. They live streamed their recruiting event this year. One way I judge good ideas is by how mad I am I didn’t think of it and this one had me seeing bright red.
  37. This is just an awesome paper comparing lots of machine learning algorithms on lots of data sets. Random forests wins and this is a nice update of one of my favorite papers of all time: Classifier technology and the illusion of progress.
  38. Pipes in R! This stuff is for real. The piping functionality created by Stefan Milton and Hadley is one of the few inventions over the last several years that immediately changed whole workflows for me.

 

I’ll let @ResearchMark take us out:

Sunday data/statistics link roundup (12/14/14)

  1. 1. suggests that economists are impartial when it comes to their liberal/conservative views. That being said, I’m not sure the regression line says what they think it does, particularly if you pay attention to the variance around the line (via Rafa).
  2. I am digging the simplicity of charted.co from the folks at Medium. But I worry about spurious correlations everywhere. I guess I should just let that ship sail.
  3. FiveThirtyEight does a run down of the beer mile. If they set up a data crunchers beer mile, we are in.
  4. I love it when Thomas Lumley interviews himself about silly research studies and particularly their associated press releases. I can actually hear his voice in my head when I read them. This time the lipstick/IQ silliness gets Lumleyed.
  5. Jordan was better than Kobe. Surprise. Plus Rafa always takes the Kobe bait.
  6. Matlab/Python/R translation cheat sheet (via Stephanie H.).
  7. If I’ve said it once, I’ve said it a million times, statistical thinking is now as important as reading and writing. The latest example is parents not understanding the difference between sensitivity and the predictive value of a positive may be leading to unnecessary abortions (via Dan M./Rafa).

Kobe, data says stop blaming your teammates

This year, Kobe leads the league in missed shots (by a lot), has an abysmal FG% of 39 and his team plays better when he is on the bench. Yet he This year, Kobe leads the league in missed shots ([by a lot](http://ftw.usatoday.com/2014/11/kobe-bryant-lakers-shot-stats)), has an abysmal FG% of 39 and his team plays better [when he is on the bench](http://bleacherreport.com/articles/2292515-how-much-blame-does-kobe-bryant-deserve-for-los-angeles-lakers-pathetic-start). Yet he for the Lakers’ 6-16 record. Below is a plot showing that 2014 is not the first time the Lakers are mediocre during Kobe’s tenure. It shows the percentage points above .500 per season with the Shaq and twin towers eras highlighted. I include the same plot for Lebron as a control.

Rplot

So stop blaming your teammates!

And here is my hastily written code (don’t judge me!).

 

 



  

Genéticamente, no hay tal cosa como la raza puertorriqueña

Editor’s note: Last week the Latin American media picked up a blog post with the eye-catching title “The perfect human is Puerto Rican”. More attention appears to have been given to the title than the post itself. The coverage and comments on social media have demonstrated the need for scientific education on the topic of genetics and race. Here I will try to explain, in layman’s terms, why the interpretations I read in the main Puerto Rican paper is scientifically incorrect and somewhat concerning. The post is in Spanish.

En un artículo reciente titulado “Ser humano perfecto sería puertorriqueño”, El Nuevo Día resumió una entrada en el blog (erróneamente llamado un estudio) del matemático Lior Pachter. El autor del blog, intentando ridiculizar comentarios racistas que escuchó decir a James Watson, describe un experimento mental en el cual encuentra que el humano “perfecto” (las comilla son importantes), de existir, pertenecería a un grupo genéticamente mezclado. De las personas estudiadas,  la más genéticamente cercana a su humano “perfecto” resultó ser una mujer puertorriqueña. La motivación de este ejercicio era ridiculizar la idea de que una raza puede ser superior a otra. El Nuevo Día parece no captar este punto y nos dice que “el experto concluyó que en todo caso no es de sorprenderse que la persona más cercana a tal perfección sería una puertorriqueña, debido a la combinación de buenos genes que tiene la raza puertorriqueña.” Aquí describo por qué esta interpretación es científicamente errada.

¿Qué es el genoma?

El genoma humano codifica (en moléculas de ADN) la información genética necesaria para nuestro desarrollo biológico. Podemos pensar en el genoma como dos series de 3,000,000,000 letras (A, T, C o G) concatenadas. Una la recibimos de nuestro padre y la otra de nuestra madre. Distintos pedazos (los genes) codifican proteínas necesarias para las miles de funciones que cumplen nuestras células y que conllevan a algunas de nuestras características físicas. Con unas pocas excepciones, todas las células en nuestro cuerpo contienen una copia exacta de estas dos series de letras. El esperma y el huevo tienen sólo una serie de letras, una mezcla de las otras dos. Cuando se unen el esperma y el huevo, la nueva célula, el cigoto, une las dos series y es así que heredamos características de cada progenitor.

¿Qué es la variación genética?

Si todos venimos del primer humano,¿cómo entonces es que somos diferentes? Aunque es muy raro, estas letras a veces mutan aleatoriamente. Por ejemplo, una C puede cambiar a una T. A través de cientos de miles de años suficientes mutaciones han ocurrido para crear variación entre los humanos. La teoría de selección natural nos dice que si esta mutación confiere una ventaja para la supervivencia, el que la posee tiene más probabilidad de pasarla a sus descendientes. Por ejemplo, en Europa la piel clara es más ventajosa, por su habilidad de absorber vitamina D cuando hay poco sol, que en África Occidental donde la melanina en la piel oscura protege del sol intenso. Se estima que las diferencias entre los humanos se pueden encontrar en por lo menos 10 millones de las 3 mil millones de letras (noten que es menos de 1%).

Genéticamente, ¿qué es una “raza” ?

Esta es un pregunta controversial. Lo que no es controversial es que si comparamos la serie de letras de los europeos del norte con los africanos occidentales o con los indígenas de las Américas, encontramos pedazos del código que son únicos a cada región. Si estudiamos las partes del código que cambian entre humanos, fácilmente podemos distinguir los tres grupos. Esto no nos debe sorprender dado que, por ejemplo, la diferencia en el color de ojos y la pigmentación de la piel se codifica con distintas letras en los genes asociados con estas características. En este sentido podríamos crear una definición genética de “raza” basada en las letras que distinguen a estos grupos. Ahora bien, ¿podemos hacer lo mismo para distinguir un puertorriqueño de un dominicano? ¿Podemos crear una definición genética que incluye a Carlos Delgado y a Mónica Puig, pero no a Robinson Canó y Juan Luis Guerra? La literatura científica nos dice que no.

PCAfinal

En una serie de artículos , el genético Carlos Bustamante y sus colegas han estudiado los genomas de personas de varios grupos étnicos. Ellos definen una distancia genética que resumen con dos dimensiones en la gráfica arriba. Cada punto es una persona y el color presenta a su grupo. Noten los tres extremos de la gráfica con muchos puntos del mismo color amontonados. Estos son los europeos blancos (puntos rojo), africanos occidentales (verde) e indígenas americanos (azul). Los puntos más regados en el medio son las poblaciones mezcladas. Entre los europeos y los indígenas vemos a los mexicanos y entre los europeos y africanos a los afroamericanos. Los puertorriqueños son los puntos anaranjados. He resaltado con números a tres de ellos. El 1 está cerca del supuesto humano “perfecto”. El 2 es indistinguible de un europeo y el 3 es indistinguible de un afroamericano. Los demás cubrimos un espectro amplio. También resalto con el número 4 a un dominicano que está tan cerca a la “perfección” como la puertorriqueña. La observación principal es que hay mucha variación genética entre los puertorriqueños. En los que Bustamante estudió, la ascendencia africana varía de 5-60%, la europea de 35-95% y la taína de 0-20%. ¿Cómo entonces podemos hablar de una “raza” puertorriqueña cuando nuestros genomas abarcan un espacio tan grande que puede incluir, entre otros, europeos, afroamericanos y dominicanos  ?

¿Qué son los genes “buenos”?

Algunas mutaciones son letales. Otras resultan en cambios a proteínas que causan enfermedades como la fibrosis quística y requieren que ambos padres tengan la mutación. Por lo tanto la mezcla de genomas diferentes disminuye las probabilidades de estas enfermedades. Recientemente una serie de estudios ha encontrado ventajas de algunas combinaciones de letras relacionadas a enfermedades comunes como la hipertensión. Una mezcla genética que evita tener dos copias de estos genes con más riesgo puede ser ventajosa. Pero las supuestas ventajas son pequeñísimas y específicas a enfermedades, no a otras características que asociamos con la “perfección”. El concepto de “genes buenos” es un vestigio de la eugenesia.

A pesar de nuestros problemas sociales y económicos actuales, Puerto Rico tiene mucho de lo cual estar orgulloso. En particular, producimos buenísimos ingenieros, atletas y músicos. Atribuir su éxito a “genes buenos” de nuestra “raza” no sólo es un disparate científico, sino una falta de respeto a estos individuos que a través del trabajo duro, la disciplina y el esmero han logrado lo que han logrado. Si quieren saber si Puerto Rico tuvo algo que ver con el éxito de estos individuos, pregúntenle a un historiador, un antropólogo o un sociólogo y no a un genetista. Ahora, si quieren aprender del potencial de estudiar genomas para mejorar tratamientos médicos y la importancia de estudiar una diversidad de individuos, un genetista tendrá mucho que compartir.

Sunday data/statistics link roundup (12/7/14)

  1. A randomized controlled trial shows that using conversation to detect suspicious behavior is much more effective then just monitoring body language (via Ann L. on Twitter). This comes as a crushing blow to those of us who enjoyed the now-cancelled Lie to Me and assumed it was all real.
  2. Check out this awesome real-time visualization of different types of network attacks. Rafa says if you watch long enough you will almost certainly observe a “storm” of attacks. A cool student project would be modeling the distribution of these attacks if you could collect the data (via David S.).
  3. Consider this: Did Big Data Kill the Statistician? I understand the sentiment, that statistical thinking and applied statistics has been around a long time and has produced some good ideas. On the other hand, there is definitely a large group of statisticians who aren’t willing to expand their thinking beyond a really narrow set of ideas (via Rafa)
  4. Gangnam Style viewership creates integers too big for Youtube (via Rafa)
  5. A couple of interviews worth reading, [ 1. A randomized controlled trial shows that using conversation to detect suspicious behavior is much more effective then just monitoring body language (via Ann L. on Twitter). This comes as a crushing blow to those of us who enjoyed the now-cancelled Lie to Me and assumed it was all real.
  6. Check out this awesome real-time visualization of different types of network attacks. Rafa says if you watch long enough you will almost certainly observe a “storm” of attacks. A cool student project would be modeling the distribution of these attacks if you could collect the data (via David S.).
  7. Consider this: Did Big Data Kill the Statistician? I understand the sentiment, that statistical thinking and applied statistics has been around a long time and has produced some good ideas. On the other hand, there is definitely a large group of statisticians who aren’t willing to expand their thinking beyond a really narrow set of ideas (via Rafa)
  8. Gangnam Style viewership creates integers too big for Youtube (via Rafa)
  9. A couple of interviews worth reading,](http://simplystatistics.org/2014/12/05/interview-with-cole-trapnell-of-uw-genome-sciences/) and SAMSI’s with Jyotishka Data (via Jamie N.)
  10.  A piece on the secrets we don’t know we are giving away through giving our data to [companies/the government/the internet].