Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Crowdsourcing resources for the Johns Hopkins Data Science Specialization

Since we began offering the Johns Hopkins Data Science Specialization we've noticed the unbelievable passion that our students have about our courses and the generosity they show toward each other on the course forums. Many students have created quality content around the subjects we discuss, and many of these materials are so good we feel that they should be shared with all of our students. We also know there are tons of other great organizations creating material (looking at you Software Carpentry folks).

We're excited to announce that we've created a site using GitHub Pages: http://datasciencespecialization.github.io/ to serve as a directory for content that the community has created. If you've created materials relating to any of the courses in the Data Science Specialization please send us a pull request and we will add a link to your content on our site. You can find out more about contributing here: https://github.com/DataScienceSpecialization/DataScienceSpecialization.github.io#contributing

We can't wait to see what you've created and where the community can take this site!

swirl and the little data scientist's predicament

Editor's note: This is a repost of "R and the little data scientist's predicament". A brief idea for an update is presented at the end in italics. 

I just read this fascinating post on _why, apparently a bit of a cult hero among enthusiasts of the Ruby programming language. One of the most interesting bits was The Little Coder’s Predicament, which boiled down essentially says that computer programming languages have grown too complex - so children/newbies can’t get the instant gratification when they start programming. He suggested a simplified “gateway language” that would get kids fired up about programming, because with a simple line of code or two they could make the computer do things like play some music or make a video.

I feel like there is a similar ramp up with data scientists. To be able to do anything cool/inspiring with data you need to know (a) a little statistics, (b) a little bit about a programming language, and (c) quite a bit about syntax.

Wouldn’t it be cool if there was an R package that solved the little data scientist’s predicament? The package would have to have at least some of these properties:

  1. It would have to be easy to load data sets, one line of not complicated code. You could write an interface for RCurl/read.table/download.file for a defined set of APIs/data sets so the command would be something like: load(“education-data”) and it would load a bunch of data on education. It would handle all the messiness of scraping the web, formatting data, etc. in the background.
  2. It would have to have a lot of really easy visualization functions. Right now, if you want to make pretty plots with ggplot(), plot(), etc. in R, you need to know all the syntax for pch, cex, col, etc. The plotting function should handle all this behind the scenes and make super pretty pictures.
  3. It would be awesome if the functions would include some sort of dynamic graphics (withsvgAnnotation or a wrapper for D3.js). Again, the syntax would have to be really accessible/not too much to learn.

That alone would be a huge start. In just 2 lines kids could load and visualize cool data in a pretty way they could show their parents/friends.

Update: Now that Nick and co. have created swirl the technology is absolutely in place to have people do something awesome quickly. You could imagine taking the airplane data and immediately having them make a plot of all the flights using ggplot. Or any number of awesome government data sets and going straight to ggvis. Solving this problem is now no longer technically a challenge, it is just a matter of someone coming up with an amazing swirl module that immediately sucks students in. This would be a really awesome project for a grad student or even an undergrad with an interest in teaching. If you do do it, you should absolutely send it our way and we'll advertise the heck out of it!

The Leek group guide to giving talks

I wrote a little guide to giving talks that goes along with my I wrote a little guide to giving talks that goes along with my , R packages, and reviewing guides. I posted it to Github and would be really happy to take any feedback/pull requests that folks might have. If you send a pull request please be sure to add yourself to the contributor list.

Stop saying "Scientists discover..." instead say, "Prof. Doe's team discovers..."

I was just reading an article about data science in the WSJ. They were talking about how data scientists with just 2 years experience can earn a whole boatload of money*. I noticed a description that seemed very familiar:

At e-commerce site operator Etsy Inc., for instance, a biostatistics Ph.D. who spent years mining medical records for early signs of breast cancer now writes statistical models to figure out the terms people use when they search Etsy for a new fashion they saw on the street.

This perfectly describes the resume of a student that worked with me here at Hopkins and is now tearing it up in industry. But it made me a little bit angry that they didn’t publicize her name. Now she may have requested her name not be used, but I think it is more likely that it is a case of the standard, “Scientists discover…” (see e.g. this article or this one or this one).

There is always a lot of discussion about how to push people to get into STEM fields, including a ton of misguided attempts that waste time and money. But here is one way that would cost basically nothing and dramatically raise the profile of scientists in the eyes of the public: use their names when you describe their discoveries.

The value of this simple change could be huge. In an era of selfies, reality TV, and the power of social media, emphasizing the value that individual scientists bring could have a huge impact on STEM recruiting. That paragraph above is a lot more inspiring to potential young data scientists when rewritten:

At e-commerce site operator Etsy Inc., for instance, Dr Hilary Parker,  a biostatistics Ph.D. who spent years mining medical records for early signs of breast cancer now writes statistical models to figure out the terms people use when they search Etsy for a new fashion they saw on the street.

 

 

 

 

Incidentally, I think it is a bit overhyped. I have rarely heard of anyone making $200k-$300k with that little experience, but maybe I’m wrong? I’d be interested to hear if people really were making that kind of $$ at that stage in their careers. 

It's like Tinder, but for peer review.

I have an idea for an app. You input the title and authors of a preprint (maybe even the abstract). The app shows the title/authors/abstract to people who work in a similar area to you. You could estimate this based on papers they have published that have similar key words to start.

Then you swipe left if you think the paper is interesting and right if you think it isn’t. We could then aggregate the data on how many “likes” a paper gets as a measure of how “interesting” it is. I wonder if this would be a better measure of later citations/interestingness than the opinion of a small number of editors and referees.

This is obviously taking my proposal of a fast statistics journal to the extreme and would provide no measure of how scientifically sound the paper was. But in an age when scientific soundness is only one part of the equation for top journals, a measure of interestingness that was available before review could be of huge value to journals.

If done properly, it would encourage people to publish preprints. If you posted a preprint and it was immediately “interesting” to many scientists, you could use that to convince editors to get past that stage and consider your science. More things like this could happen:

So anyone want to build it?