A non-comprehensive list of awesome things other people did in 2015

21 Dec 2015

Editor’s Note: This is the third year I’m making a list of awesome things other people did this year. Just like the lists for 2013 and 2014 I am doing this off the top of my head. I have avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people’s awesome stuff. I wrote this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data. This year’s list is particularly “off the cuff” so I’d appreciate additions if you have ‘em. I have surely missed awesome things people have done.

I hear the Tukey conference put on by my former advisor John S. was amazing. Out of it came this really good piece by David Donoho on 50 years of Data Science.
Sherri Rose wrote really accurate and readable guides on academic CVs, academic cover letters, and how to be an effective PhD researcher.
I am not 100% sold on the deep learning hype, but Michael Nielson wrote this awesome book on deep learning and neural networks. I like how approachable it is and how un-hypey it is. I also thought Andrej Karpathy’s blog post on whether you have a good selfie or not was fun.
Thomas Lumley continues to be must read regardless of which blog he writes for with a ton of snarky fun posts debunking the latest ridiculous health headlines on statschat and more in depth posts like this one on pre-filtering multiple tests on notstatschat.
David Robinson is making a strong case for top data science blogger with his series of awesome posts on empirical Bayes.
Hadley Wickham doing Hadley Wickham things again. readr is the biggie for me this year.
I’ve been really enjoying the solid coverage of science/statistics from the (not entirely statistics focused as the name would suggest) STAT.
Ben Goldacre and co. launched OpenTrials for aggregating all the clinical trial data in the world in an open repository.
Christie Aschwanden’s piece on why Science Isn’t Broken is a must read and one of the least polemic treatments of the reproducibility/replicability issue I’ve read. The p-hacking graphic is just icing on the cake.
I’m excited about the new R Consortium and the idea of having more organizations that support folks in the R community.
Emma Pierson’s blog and writeups in various national level news outlets continue to impress. I thought this one on changing the incentives for sexual assault surveys was particularly interesting/good.
Amanda Cox an co. created this [Editor’s Note: This is the third year I’m making a list of awesome things other people did this year. Just like the lists for 2013 and 2014 I am doing this off the top of my head. I have avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people’s awesome stuff. I wrote this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data. This year’s list is particularly “off the cuff” so I’d appreciate additions if you have ‘em. I have surely missed awesome things people have done.
I hear the Tukey conference put on by my former advisor John S. was amazing. Out of it came this really good piece by David Donoho on 50 years of Data Science.
Sherri Rose wrote really accurate and readable guides on academic CVs, academic cover letters, and how to be an effective PhD researcher.
I am not 100% sold on the deep learning hype, but Michael Nielson wrote this awesome book on deep learning and neural networks. I like how approachable it is and how un-hypey it is. I also thought Andrej Karpathy’s blog post on whether you have a good selfie or not was fun.
Thomas Lumley continues to be must read regardless of which blog he writes for with a ton of snarky fun posts debunking the latest ridiculous health headlines on statschat and more in depth posts like this one on pre-filtering multiple tests on notstatschat.
David Robinson is making a strong case for top data science blogger with his series of awesome posts on empirical Bayes.
Hadley Wickham doing Hadley Wickham things again. readr is the biggie for me this year.
I’ve been really enjoying the solid coverage of science/statistics from the (not entirely statistics focused as the name would suggest) STAT.
Ben Goldacre and co. launched OpenTrials for aggregating all the clinical trial data in the world in an open repository.
Christie Aschwanden’s piece on why Science Isn’t Broken is a must read and one of the least polemic treatments of the reproducibility/replicability issue I’ve read. The p-hacking graphic is just icing on the cake.
I’m excited about the new R Consortium and the idea of having more organizations that support folks in the R community.
Emma Pierson’s blog and writeups in various national level news outlets continue to impress. I thought this one on changing the incentives for sexual assault surveys was particularly interesting/good.
Amanda Cox an co. created this ](http://www.nytimes.com/interactive/2015/05/28/upshot/you-draw-it-how-family-income-affects-childrens-college-chances.html) , which is an amazing way to teach people about pre-conceived biases in the way we think about relationships and correlations. I love the crowd-sourcing view on data analysis this suggests.
As usual Philip Guo was producing gold over on his blog. I appreciate this piece on twelve tips for data driven research.
I am really excited about the new field of adaptive data analysis. Basically understanding how we can let people be “real data analysts” and still get reasonable estimates at the end of the day. This paper from Cynthia Dwork and co was one of the initial salvos that came out this year.
Datacamp incorporated Python into their platform. The idea of interactive education for R/Python/Data Science is a very cool one and has tons of potential.
I was really into the idea of Cross-Study validation that got proposed this year. With the growth of public data in a lot of areas we can really start to get a feel for generalizability.
The Open Science Foundation did this incredible replication of 100 different studies in psychology with attention to detail and care that deserves a ton of attention.
Florian’s piece “You are not working for me; I am working with you.” should be required reading for all students/postdocs/mentors in academia. This is something I still hadn’t fully figured out until I read Florian’s piece.
I think Karl Broman’s post on why reproducibility is hard is a great introduction to the real issues in making data analyses reproducible.
This was the year of the f1000 post-publication review paper. I thought this one from Yoav and the ensuing fallout was fascinating.
I love pretty much everything out of Di Cook/Heike Hoffman’s groups. This year I liked the paper on visual statistical inference in high-dimensional low sample size settings.
This is pretty recent, but Nathan Yau’s day in the life graphic is mesmerizing.

This was a year where open source data people described their pain from people being demanding/mean to them for their contributions. As the year closes I just want to give a big thank you to everyone who did awesome stuff I used this year and have completely ungraciously failed to acknowledge.

Not So Standard Deviations: Episode 6 - Google is the New Fisher

18 Dec 2015

Episode 6 of Not So Standard Deviations is now posted. In this episode Hilary and I talk about the analytics of our own podcast, and analyses that seem easy but are actually hard.

If you haven’t already, you can subscribe to the podcast through iTunes.

This will be our last episode for 2015 so see you in 2016!

Notes

Roger’s books on Leanpub
KPIs
Reply All, a great podcast
Use R! 2016 conference where Don Knuth is an invited speaker!
Liz Stuart’s directory of propensity score software
A/B testing
iid
R 3.2.3 release notes
pqR
John Myles White’s tweet

Download the audio file for this episode.

Instead of research on reproducibility, just do reproducible research

11 Dec 2015

Right now reproducibility, replicability, false positive rates, biases in methods, and other problems with science are the hot topic. As I mentioned in a previous post pointing out a flaw with a scientific study is way easier to do correctly than generating a new scientific study. Some folks have noticed that right now there is a huge market for papers pointing out how science is flawed. The combination of the relative ease of pointing out flaws and the huge payout for writing these papers is helping to generate the hype around the “reproducibility crisis”.

I gave a talk a little while ago at an NAS workshop where I stated that all the tools for reproducible research exist (the caveat being really large analyses - although that is changing as well). To make a paper completely reproducible, open, and available for post publication review you can use the following approach with no new tools/frameworks needed.

Use Github for version control.
Use rmarkdown or iPython notebooks for your analysis code
When your paper is done post it to arxiv or biorxiv.
Post your data to an appropriate repository like SRA or a general purpose site like figshare.
Send any software you develop to a controlled repository like CRAN or Bioconductor.
Participate in the post publication discussion on Twitter and with a Blog

This is also true of open science, open data sharing, reproducibility, replicability, post-publication peer review and all the other issues forming the “reproducibility crisis”. There is a lot of attention and heat that has focused on the “crisis” or on folks who make a point to take a stand on reproducibility or open science or post publication review. But in the background, outside of the hype, there are a large group of people that are quietly executing solid, open, reproducible science.

I wish that this group would get more attention so I decided to point out a few of them. Next time somebody asks me about the research on reproducibility or open science I’ll just point them here and tell them to just follow the lead of people doing it.

Karl Broman - posts all of his talks online , generates many widely used open source packages, writes free/open tutorials on everything from knitr to making webpages, makes his papers highly reproducible.
Jessica Li - posts her data online and writes open source software for her analyses.
Mark Robinson - posts many of his papers as preprints on biorxiv, makes his analyses reproducible, writes open source software
Florian Markowetz - writes open source software, provides Bioconductor data for major projects, links his papers with his code nicely on his publications page.
Raphael Gottardo - writes/maintains many open source software packages, makes his analyses reproducible and available via Github, posts preprints of his papers.
Genevera Allen - writes](https://cran.r-project.org/web/packages/TCGA2STAT/index.html) to make data easier to access, posts preprints on biorxiv and on arxiv
Lorena Barba - teaches open source moocs, with lessons as open source iPython modules, and reproducible code for her analyses.
Alicia Oshlack - writes papers with completely reproducible analyses, publishes lots of open source software and publishes preprints for her papers.
Baggerly and Coombs - although they are famous for a highly public reproducible piece of research they have also quietly implemented policies like making all reports reproducible for their consulting center.

This list was made completely haphazardly as all my lists are, but just to indicate there are a ton of people out there doing this. One thing that is clear too is that grad students and postdocs are adopting the approach I described at a very high rate.

Moreover there are people that have been doing parts of this for a long time (like the physics or biostatistics communities with preprints, or how people have used Sweave for a long time) . I purposely left people off the list like Titus and Ethan who have gone all in, even posting their grants online. I did this because they are very loud advocates of open science, but I wanted to highlight quieter contributors and point out that while there is a lot of noise going on over in one corner, many people are quietly doing really good science in another.

By opposing tracking well-meaning educators are hurting disadvantaged kids

09 Dec 2015

An unfortunate fact about the US K-12 system is that the education gap between poor and rich is growing. One manifestation of this trend is that we rarely see US kids from disadvantaged backgrounds become tenure track faculty, especially in the STEM fields. In my experience, the ones that do make it, when asked how they overcame the suboptimal math education their school district provided, often respond "I was tracked" or "I went to a magnet school". Magnet schools filter students with admission tests and then teach at a higher level than an average school, so essentially the entire school is an advanced track.

Twenty years of classroom instruction experience has taught me that classes with diverse academic abilities present one of the most difficult teaching challenges. Typically, one is forced to focus on only a sub-group of students, usually the second quartile. As a consequence the lower and higher quartiles are not properly served. At the university level, we minimize this problem by offering different levels: remedial math versus math for engineers, probability for the Masters program versus probability for PhD students, co-ed intramural sports versus the varsity basketball team, intro to World Music versus a spot in the orchestra, etc. In K-12, tracking seems like the obvious solution to teaching to an array of student levels.

Unfortunately, there has been a trend recently to move away from tracking and several school districts now forbid it. The motivation seems to be a series of observational studies that note that “low-track classes tend to be primarily composed of low-income students, usually minorities, while upper-track classes are usually dominated by students from socioeconomically successful groups.” Tracking opponents infer that this unfortunate reality is due to bias (conscious or unconscious) in the the informal referrals that are typically used to decide which students are advanced. However, this is a critique of the referral system, not of tracking itself. A simple fix is to administer an objective test or use the percentiles from state assessment tests. In fact, such exams have been developed and implemented. A recent study (summarized here) examined the data from a district that for a period of time implemented an objective assessment and found that

[t]he number of Hispanic students [in the advanced track increased] by 130 percent and the number of black students by 80 percent.

Unfortunately, instead of maintaining the placement criteria, which benefited underrepresented minorities without relaxing standards, these school districts reverted to the old, flawed system due to budget cuts.

Another argument against tracking is that students benefit more from being in classes with higher-achieving peers, rather than being in a class with students with similar subject mastery and a teacher focused on their level. However a [<div class="page" title="Page 2">

An unfortunate fact about the US K-12 system is that the education gap between poor and rich is growing. One manifestation of this trend is that we rarely see US kids from disadvantaged backgrounds become tenure track faculty, especially in the STEM fields. In my experience, the ones that do make it, when asked how they overcame the suboptimal math education their school district provided, often respond "I was tracked" or "I went to a magnet school". Magnet schools filter students with admission tests and then teach at a higher level than an average school, so essentially the entire school is an advanced track.

</div>

Twenty years of classroom instruction experience has taught me that classes with diverse academic abilities present one of the most difficult teaching challenges. Typically, one is forced to focus on only a sub-group of students, usually the second quartile. As a consequence the lower and higher quartiles are not properly served. At the university level, we minimize this problem by offering different levels: remedial math versus math for engineers, probability for the Masters program versus probability for PhD students, co-ed intramural sports versus the varsity basketball team, intro to World Music versus a spot in the orchestra, etc. In K-12, tracking seems like the obvious solution to teaching to an array of student levels.

Unfortunately, there has been a trend recently to move away from tracking and several school districts now forbid it. The motivation seems to be a series of observational studies that note that “low-track classes tend to be primarily composed of low-income students, usually minorities, while upper-track classes are usually dominated by students from socioeconomically successful groups.” Tracking opponents infer that this unfortunate reality is due to bias (conscious or unconscious) in the the informal referrals that are typically used to decide which students are advanced. However, this is a critique of the referral system, not of tracking itself. A simple fix is to administer an objective test or use the percentiles from state assessment tests. In fact, such exams have been developed and implemented. A recent study (summarized here) examined the data from a district that for a period of time implemented an objective assessment and found that

[t]he number of Hispanic students [in the advanced track increased] by 130 percent and the number of black students by 80 percent.

Unfortunately, instead of maintaining the placement criteria, which benefited underrepresented minorities without relaxing standards, these school districts reverted to the old, flawed system due to budget cuts.

Another argument against tracking is that students benefit more from being in classes with higher-achieving peers, rather than being in a class with students with similar subject mastery and a teacher focused on their level. However a](http://web.stanford.edu/~pdupas/Tracking_rev.pdf) (and the only one of which I am aware) finds that tracking helps all students:

We find that tracking students by prior achievement raised scores for all students, even those assigned to lower achieving peers. On average, after 18 months, test scores were 0.14 standard deviations higher in tracking schools than in non-tracking schools (0.18 standard deviations higher after controlling for baseline scores and other control variables). After controlling for the baseline scores, students in the top half of the pre-assignment distribution gained 0.19 standard deviations, and those in the bottom half gained 0.16 standard deviations. Students in all quantiles benefited from tracking.

I believe that without tracking, the achievement gap between disadvantaged children and their affluent peers will continue to widen since involved parents will seek alternative educational opportunities, including private schools or subject specific extracurricular acceleration programs. With limited or no access to advanced classes in the public system, disadvantaged students will be less prepared to enter the very competitive STEM fields. Note that competition comes not only from within the US, but from other countries including many with educational systems that track.

To illustrate the extreme gap, the following exercises are from a 7th grade public school math class (in a high performing school district):

(Click to enlarge). There is no tracking so all students must work on these problems. Meanwhile, in a 7th grade advanced, private math class, that same student can be working on problems like these:Let me stress that there is nothing wrong with the first example if it is the appropriate level of the student. However, a student who can work at the level of the second example, should be provided with the opportunity to do so notwithstanding their family’s ability to pay. Poorer kids in districts which do not offer advanced classes will not only be less equipped to compete with their richer peers, but many of the academically advanced ones may, I suspect, dismiss academics due to lack of challenge and boredom. Educators need to consider evidence when making decisions regarding policy. Tracking can be applied unfairly, but that aspect can be remedied. Eliminating tracking all together takes away a crucial tool for disadvantaged students to move into the STEM fields and, according to the empirical evidence, hurts all students.

Not So Standard Deviations: Episode 5 - IRL Roger is Totally With It

03 Dec 2015

I just posted Episode 5 of Not So Standard Deviations so check your feeds! Sorry for the long delay since the last episode but we got a bit tripped up by the Thanksgiving holiday.

In this episode, Hilary and I open up the mailbag and go through some of the feedback we’ve gotten on the previous episodes. The rest of the time is spent talking about the importance of reproducibility in data analysis both in academic research and in industry settings.

If you haven’t already, you can subscribe to the podcast through iTunes. Or you can use the SoundCloud RSS feed directly.

Notes:

Hilary’s talk on reproducible analysis in production at the New York R Conference
Hilary’s Ignite presentation at Strata 2013
Roger’s talk on “Computational and Policy Tools for Reproducible Research” at the Applied Mathematics Perspectives Workshop in Vancouver, 2011
Duke Scandal Starter Set
Keith Baggerly’s talk on Duke Scandal
The Web of Trust
testdat R package