29 Jan 2014
In his book Average Is Over, Tyler Cowen predicts that as automatization becomes more common, modern economies will eventually be composed of two groups: 1) a highly educated minority involved in the production of automated services and 2) a vast majority earning very little but enough to consume some of the low-priced products created by group 1. Not everybody will agree with this view, but we can’t ignore the fact that automatization has already eliminated many middle class jobs in, for example, manufacturing and the automotive industries. New technologies, such as driverless cars and online retailers, will very likely eliminate many more jobs (e.g. drivers and retail clerks) than they create (programmers and engineers).
Computer literacy is essential for working with automatized systems. Programming and learning from data are perhaps the most useful skill for creating these systems. Yet the current default curriculum includes neither computer science nor statistics. At the same time, there are plenty of resources for motivated parents with means to get their children to learn these subjects. Kids whose parents don’t have the wherewithal to take advantage of these educational resources will be at an even greater disadvantage than they are today. This disadvantage is made worse by the fact that many of the aforementioned resources are free and open to the world (Codeacademy, Khan Academy, EdX, and Coursera for example) which means that a large pool of students that previously had no access to this learning material will also be competing for group 1 jobs. If we want to level the playing field we should start by updating the public school curriculum so that, in principle, everybody has the opportunity to compete.
28 Jan 2014
Editor’s note: This post was written by Nick Carchedi, a Master’s degree student in the Department of Biostatistics at Johns Hopkins. He is working with us to develop the Data Science Specialization as well as software for interactive learning of R and statistics.
Official swirl website: swirlstats.com
On September 27, 2013, I wrote a guest blog post on Simply Statistics to announce the creation of Statistics with Interactive R Learning (swirl), an R package for teaching and learning statistics and R simultaneously and interactively. Over the next several months, I received a tremendous amount of feedback from all over the world. Two things became clear: 1) there were many opportunities for improvement on the original design and 2) there’s an incredible demand globally for new and better ways of learning statistics and R.
In the spirit of R and open source software, I shared the source code for swirl on GitHub. As a result, I quickly came in contact with several very talented individuals, without whom none of what I’m about to share with you would have been possible. Armed with invaluable feedback and encouragement from early adopters of swirl 1.0, my new team and I pursued a complete overhaul of the original design.
Today, I’m happy to announce the result of our efforts: swirl 2.0.
Like the first version of the software, swirl 2.0 guides students through interactive tutorials in the R console on a variety of topics related to statistics and R. The user selects from a menu of courses, each of which is broken up by topic into shorter lessons. Lessons, in turn, are a dialog between swirl and the user and are composed of text output, multiple choice and text-based questions, and (most importantly) questions that require the user to enter actual R code at the prompt. Responses are evaluated for correctness based on instructor-specified answer tests and appropriate feedback is given immediately to the user.
It’s helpful to think of swirl as the synthesis of two separate parts: content and platform. Content is authored by instructors in R Markdown files. The platform is then responsible for delivering this content to the user and interpreting the user’s responses in an interactive and engaging way.
Our primary focus for swirl 2.0 was to build a more robust and extensible platform for delivering content. Here’s a (nontechnical) summary of new and revised features:
- A library of answer tests an instructor can deploy to check user input for correctness
- If stuck, a user can skip a question, causing swirl to enter the correct answer on their behalf
- During a lesson, a user can pause instruction to play around or practice something they just learned, then use a special keyword to regain swirl’s attention when ready to resume
- swirl “sees” user input the same way R “sees” it, which allows swirl to understand the composition of a user’s input on a much deeper level (thanks, Hadley)
- User progress is saved between sessions
- More readable output that adjusts to the width of the user’s display (thanks again, Hadley)
- Extensible framework allows others to easily extend swirl’s functionality
- Instructors can author content in a special flavor of R markdown
(For a more technical understanding of swirl’s features and inner workings, we encourage readers to consult our GitHub repository.)
Although improving the platform was our first priority for this release, we’ve made some improvements to existing content and, more importantly, added the beginnings of a new course: Intro to R. Intro to R is our response to the overwhelming demand for a more accessible and interactive way to learn the R language. We’ve included the first three lessons of the course and plan to add many more over the coming months as our focus turns to creating more high quality content.
Our ultimate goal is to have the statistics and R communities use swirl as a platform to deliver their own content to students interactively. We’ve heard from many people who have an interest in creating their own content and we’re working hard to make the process of creating content as simple and enjoyable as possible.
The goal of swirl is not to be flashy, but rather to provide the most authentic learning environment possible. We accomplish this by placing students directly on the R prompt, within the very same environment they’ll use for data analysis when they are not using swirl. We hope you find swirl to be a valuable tool for learning and teaching statistics and R.
It’s important to stress that, as with any new software, we expect there will be bugs. At this point, users should still consider themselves “early adopters”. For bug reports, suggested enhancements, or to learn more about swirl, please visit our website.
Contributors:
Many people have contributed to this project, either directly or indirectly, since its inception. I will attempt to list them all here, in no particular order. I’m sincerely grateful to each and everyone one of you.
- Bill & Gina: swirl is as much theirs as it is mine at this point. Their contributions are the only reason the project has evolved so much since the release of swirl 1.0.
- Brian: Challenged me to turn my idea for swirl into a working prototype. Coined the “swirl” acronym. swirl would still be an idea in my head without his encouragement.
- Jeff: Pushes me to think big picture and provides endless encouragement. Reminds me that a great platform is worthless without great content.
- Roger: Encouraged me to separate platform and content, a key paradigm that allowed swirl to mature from a messy prototype to something of real value. Introduced me to Git and GitHub.
- Lauren & Ethan: Helped with development of the earliest instructional content.
- Ramnath: Provided a model for content authoring via slidify “flavor” of R Markdown.
- Hadley: Made key suggestions for improvement and provided an important proof of concept. His work has had a profound influence on swirl’s development.
- Peter: Our discussions led to a better understanding of some key ideas behind swirl 2.0.
- Sally & Liz: Beta testers and victims of my endless rants during stats tutoring sessions.
- Kelly: Most talented graphic designer I know and mastermind behind the swirl logo. First line of defense against bad ideas, poor design, and crappy websites. Visit her website.
- Mom & Dad: Beta testers and my #1 fans overall.
28 Jan 2014
“There are sadistic scientists who hurry to hunt down error instead of establishing the truth.” -Marie Curie (http://en.wikiquote.org/wiki/Marie_Curie)
Thanks to Kasper H. for that quote. I think it is a perfect fit for today’s culture of academic put down as academic contribution. One perfect example is the explosion of hate against the quilt plot. A quilt plot is a heatmap with several parameters selected in advance; that’s it. This simplification of R’s heatmap function appeared in the journal PLoS One. They say (though not up front and not clearly enough for my personal taste) that they know it is just a heatmap.
Over the course of the next several weeks quilt plots went viral. Here are a few example tweets. It was also [> “There are sadistic scientists who hurry to hunt down error instead of establishing the truth.” -Marie Curie (http://en.wikiquote.org/wiki/Marie_Curie)
Thanks to Kasper H. for that quote. I think it is a perfect fit for today’s culture of academic put down as academic contribution. One perfect example is the explosion of hate against the quilt plot. A quilt plot is a heatmap with several parameters selected in advance; that’s it. This simplification of R’s heatmap function appeared in the journal PLoS One. They say (though not up front and not clearly enough for my personal taste) that they know it is just a heatmap.
Over the course of the next several weeks quilt plots went viral. Here are a few example tweets. It was also](http://liorpachter.wordpress.com/2014/01/19/why-do-you-look-at-the-speck-in-your-sisters-quilt-plot-and-pay-no-attention-to-the-plank-in-your-own-heat-map/) on people’s blogs and even in the scientist. So I did an experiment. I built a table of frequencies in R like this and applied the heatmap function in R, then the quilt.plot function in fields, then the function written by the authors of the paper with as minimal tweeking as possible.
set.seed(12345)
library(fields)
x = matrix(rbinom(25,size=4,prob=0.5),nrow=5)
pt = prop.table(x)
heatmap(pt)
quilt.plot(x=rep(1:5,5),y=rep(1:5,5),z=pt)
quilt(pt,1:5,1:5,zlabel="Proportion")
Here are the results:
heatmap
quilt.plot
quilt
It is clear that out of the box and with no tinkering, the new plot makes something nicer/more interpretable. The columns/rows are where I expect and the scale is there and nicely labeled. Everyone who has ever made heatmaps in R has some bit of code that looks like this:
image(t(bdat)[,nrow(bdat):1],col=colsb(9),breaks=quantile(as.vector(as.matrix(dat)),probs=seq(0,1,length=10)),xaxt="n",yaxt="n",xlab="",ylab="")
To hack together a heatmap in R that looks like you expect. It is a total pain. Obviously the quilt plot paper has a few flaws:
- It tries to introduce the quilt plot as a new idea.
- It doesn’t just come out and say it is a hack of the heatmap function, but tries to dance around it.
- It produces code, but only as images in word files. I had to retype the code to make my plot.
That being said here are a couple of other true things about the paper:
- The code works if you type it out and apply it.
- They produced code.
- The paper is open access.
- The paper is correct technically.
- The hack is useful for users with few R skills.
So why exactly isn’t it a paper? It smacks of academic elitism to claim that this isn’t good enough because it isn’t a “new idea”. Not every paper discovers radium. Some papers are better than others and that is ok. I think the quilt plot being published isn’t a problem, maybe I don’t like the way it is written exactly, but they do acknowledge the heat map, they do produce correct, relevant code, and it does solve a problem people actually have. That is better than a lot of papers that appear in more prestigious journals. Arsenic life anyone?
I think it is useful to have a forum where people can post correct, useful, but not necessarily ground breaking results and get credit for them, even if the credit is modest. Otherwise we might miss out on useful bits of code. Frank Harrell has a bunch of functions that tons of people use but he doesn’t get citations, you probably have heard of the Hmisc package if you use R.
But did you know Karl Broman has a bunch of really useful functions in his personal R package, qqline2 is great. I know Rafa has a bunch of functions he has never published because they seem “too trivial” but I use them all the time. Every scientist who touches code has a personal library like this. I’m not saying the quilt plot is in that category. But I am saying that it is stupid not to have a public forum for making these functions available to other scientists. But that won’t happen if the “quilt plot backlash” is what people see when they try to get published credit for simple code that solves real problems.
Hacks like the quilt plot can help people who aren’t comfortable with R write reproducible scripts without having to figure out every plotting parameter. Keeping in mind that the vast majority of data analysis is not done by statisticians, it seems like these little hacks are an important part of science. If you believe in figshare, github, open science, and shareable code, you shouldn’t be making fun of the quilt plotters.
Marie Curie says so.
21 Jan 2014
We are very proud to announce the the Johns Hopkins Data Science Specialization on Coursera. You can see the official announcement from the Coursera folks here. This is the main reason Simply Statistics has been a little quiet lately.
The three of us (Brian Caffo, Roger Peng, and Jeff Leek) along with a couple of incredibly hard working graduate students (Nick Carchedi of swirl fame and Sean Kross) have put together nine new one-month classes to run on the Coursera platform. The classes are:
- The Data Scientist’s Toolbox - A basic introduction to data and data science and a basic guide to R/Rstudio/Github/Command Line Interface.
- R Programming - Introduction to R programming, from installing R to types, to functions, to control structures.
- Getting and Cleaning Data - An introduction to getting data from the web, from images, from APIs, and from databases. The course also covers how to go from raw data to tidy data.
-
Exploratory Data Analysis - This course covers plotting in base graphics, lattice, ggplot2 and clustering and other exploratory techniques. It also covers how to think about exploring data you haven't seen.
-
Reproducible Research - This is one of the unique courses to our sequence. It covers how to think about reproducible research, evidence based data analysis, reproducible research checklists and knitr, markdown, R markdown, etc.
-
Statistical Inference - This course covers the fundamentals of statistical inference from a practical perspective. The course covers both the technical details and important ideas like confounding.
-
Regression Models - This course covers the fundamentals of linear and generalized linear regression modeling. It also serves as an introduction to how to "think about" relating variables to each other quantitatively.
-
Practical Machine Learning - This course will cover the basic conceptual ideas in machine learning like in/out of sample errors, cross validation, and training and test sets. It will also cover a range of machine learning algorithms and their practical implementation.
-
Developing Data Products - This course will cover how to develop tools for communicating data, methods, and analyses with other people. It will cover building R packages, Shiny, and Slidify, among other things.
There will also be a specialization project - consisting of a 10th class where students will work on projects conducted with industry, government, and academic partners.
The classes represent some of the content we have previously covered in our popular Coursera classes and a ton of brand new content for this specialization. Here are some things that I think make our program stand out:
- We will roll out 3 classes at a time starting in April. Once a class is running, it will run every single month concurrently.
- The specialization offers a bunch of unique content, particularly in the courses Getting and Cleaning Data, Reproducible Research, and Developing Data Products.
- All of the content is being developed open source and open-access on Github. You are welcome to check it out as we develop it and contribute!
- You can take the first 9 courses of the specialization entirely for free.
- You can choose to pay a very modest fee to get “Signature Track” certification in every course.
I have also created a little page that summarizes some of the unique aspects of our program. Scroll through it and you’ll find sharing links at the bottom. Please share with your friends, we think this is pretty cool: http://jhudatascience.org
19 Jan 2014
- Tesla is hiring a data scientist. That is all.
- I’m not sure I buy the idea that Python is taking over for R among people who actually do regular data science. I think it is still context dependent. A huge fraction of genomics happens in R and there is a steady stream of new packages that allow R users to push farther and farther back into the processing pipeline. On the other hand, I think language diversity is clearly a plus for someone who works with data. Not that I’d know…
- This is an awesome talk on why to pursue a Ph.D.. It gives a really level headed and measured discussion, specifically focused on computational programs (I think I got to it via Alyssa F.’s blog).
- En Español - A blog post about a study of genetic risk factors among Hispanic/Latino populations (via Rafa).
- Where have all the tenured women gone? This is a major issue and deserves much more press than it gets (via Sherri R.).
- Not related to statistics really, but these image captures from Google streetview are wild.