Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

A non-comprehensive comparison of prominent data science programs on cost and frequency.

Screen Shot 2014-03-26 at 9.29.53 AM

We did a really brief comparison of a few notable data science

programs for a grant submission we were working on. I thought it was pretty fascinating, so I’m posting it here. A couple of notes about the table.

  1. Our program can be taken for free, which includes assessments. If you want the official certificate and to take the capstone you pay the above costs.

  2. Udacity’s program can also be taken for free, but if you want the official certificate, assessments, or tutoring you pay the above costs.

  3. The asterisks denote programs where you get an official master’s degree.

  4. The MOOC programs (Udacity’s and ours) offer the more flexibility in

the terms of student schedules. Ours is the most flexible with courses

running every month. The in person programs having the least

flexibility but obviously the most direct instructor time.

5) The programs are all quite different in the terms of focus, design,

student requirements, admissions, instruction, cost and value.

6) As far as we know, ours is the only one where every bit of lecture

content has been open sourced (https://github.com/DataScienceSpecialization)

The fact that data analysts base their conclusions on data does not mean they ignore experts

Paul Krugman recently joined the new FiveThirtyEight hating bandwagon. I am not crazy about the new website either (although I’ll wait more than one weeks before judging) but in a recent post Krugman creates a false dichotomy that is important to correct. Krugmanam states that “[w]hat [Nate Silver] seems to have concluded is that there are no experts anywhere, that a smart data analyst can and should ignore all that.” I don’t think that is what Nate Silver, nor any other smart data scientist or applied statistician has concluded. Note that to build his election prediction model, Nate had to understand how the electoral college works, how polls work, how different polls are different, the relationship between primaries and presidential election, among many other details specific to polls and US presidential elections. He learned all of this by reading and talking to experts. Same is true for PECOTA where data analysts who know quite a bit about baseball collect data to create meaningful and predictive summary statistics. As Jeff said before, the key word in “Data Science” is not Data, it is Science.

The one example Krugman points too as ignoring experts appears to be written by someone who, according to the article that Krugman links to, was biased by his own opinions, not by data analysis that ignored experts. However, in Nate’s analysis of polls and baseball data it is hard to argue that he let his bias affect his analysis. Furthermore, it is important to point out that he did not simply stick data into a black box prediction algorithm. Instead he did what most of us applied statisticians do: we build empirically inspired models but guided by expert knowledge.

ps - Krugman links to a [Paul Krugman recently joined the new FiveThirtyEight hating bandwagon. I am not crazy about the new website either (although I’ll wait more than one weeks before judging) but in a recent post Krugman creates a false dichotomy that is important to correct. Krugmanam states that “[w]hat [Nate Silver] seems to have concluded is that there are no experts anywhere, that a smart data analyst can and should ignore all that.” I don’t think that is what Nate Silver, nor any other smart data scientist or applied statistician has concluded. Note that to build his election prediction model, Nate had to understand how the electoral college works, how polls work, how different polls are different, the relationship between primaries and presidential election, among many other details specific to polls and US presidential elections. He learned all of this by reading and talking to experts. Same is true for PECOTA where data analysts who know quite a bit about baseball collect data to create meaningful and predictive summary statistics. As Jeff said before, the key word in “Data Science” is not Data, it is Science.

The one example Krugman points too as ignoring experts appears to be written by someone who, according to the article that Krugman links to, was biased by his own opinions, not by data analysis that ignored experts. However, in Nate’s analysis of polls and baseball data it is hard to argue that he let his bias affect his analysis. Furthermore, it is important to point out that he did not simply stick data into a black box prediction algorithm. Instead he did what most of us applied statisticians do: we build empirically inspired models but guided by expert knowledge.

ps - Krugman links to a](http://www.nytimes.com/2014/03/22/opinion/egan-creativity-vs-quants.html?src=me&ref=general) piece which has another false dichotomy as the title: “Creativity vs. Quants”. He should try doing it before assuming there is no creativity involved in extracting information from data.

The 80/20 rule of statistical methods development

Developing statistical methods is hard and often frustrating work. One of the under appreciated rules in statistical methods development is what I call the 80/20 rule (maybe could even by the 90/10 rule). The basic idea is that the first reasonable thing you can do to a set of data often is 80% of the way to the optimal solution. Everything after that is working on getting the last 20%. (Edit: Rafa points out that once again I’ve reverse-scooped a bunch of people and this is already a thing that has been pointed out many times. See for example the Pareto principle and this post also called the 80:20 rule)

Sometimes that extra 20% is really important and sometimes it isn’t. In a clinical trial, where each additional patient may cost a large amount of money to recruit and enroll, it is definitely worth the effort. For more exploratory techniques like those often used when analyzing high-dimensional data it may not. This is particularly true because the extra 20% usually comes at a cost of additional assumptions about the way the world works. If your assumptions are right, you get the 20%, if they are wrong, you may lose and it isn’t always clear how much.

Here is a very simple example of the 80/20 rule from frequentist statistics - in my experience similar ideas hold in machine learning and Bayesian inference as well. Suppose that I collect some observations  X_1,\ldots, X_n and want to test whether the mean of the observations is greater than 0. Suppose I know that the data are normal and that the variance is equal to 1. Then the absolute best statistical test (called the uniformly most powerful test) you could do rejects the hypothesis the mean is zero if  \bar{X} > z_{\alpha}\left(\frac{1}{\sqrt{n}}\right) .

There are a bunch of other tests you could do though. If you assume the distribution is symmetric you could also use the sign test to test the same hypothesis by creating the random variables  Y_i = 1(X_i > 0) and testing the hypothesis  H_0: Pr(Y_i = 1) = 0.5 versus the alternative that the probability is greater than 0.5 . Or you could use the one sided t-test. Or you could use the Wilcoxon test. These are suboptimal if you know the data are Normal with variance one.

I tried each of these tests with a sample of size  n=20 at the  \alpha=0.05 level. In the plot below I show the ratio of power between each non-optimal test and the optimal z-test (you could do this theoretically but I’m lazy so did it with simulation, code here, colors by RSkittleBrewer).

relpower

The tests get to 80% of the power of the z-test for different sizes of the true mean (0.6 for Wilcoxon, 0.5 for the t-test, and 0.85 for the sign test). Overall, these methods very quickly catch up to the optimal method.

In this case, the non-optimal methods aren’t much easier to implement than the optimal solution. But in many cases, the optimal method requires significantly more computation, memory, assumptions, theory, or some combination of the four. The hard decision is whether to create a new method is whether the 20% is worth it. This is obviously application specific.

An important corollary of the 80/20 rule is that you can have a huge impact on new technologies if you are the first to suggest an already known 80% solution. For example, the first person to suggest hierarchical clustering or the singular value decomposition for a new high-dimensional data type will often get a large number of citations. But that is a hard way to make a living - you aren’t the only person who knows about these methods and the person who says it first soaks up a huge fraction of the credit. So the only way to take advantage of this corollary is to spend your time constantly trying to figure out what the next big technology will be. And you know what they say about prediction being hard, especially about the future.

The time traveler's challenge.

Editor’s note: This has nothing to do with statistics. 

I do a lot of statistics for a living and would claim to know a relatively large amount about it. I also know a little bit about a bunch of other scientific disciplines, a tiny bit of engineering, a lot about pointless sports trivia, some current events, the geography of the world (vaguely) and the geography of places I’ve lived (pretty well).

I have often wondered, if I was transported back in time to a point before the discovery of say, how to make a fire, how much of human knowledge I could recreate. In other words, what would be the marginal effect on the world of a single person (me) being transported back in time. I could propose Newton’s Laws, write down a bunch of the basis of calculus, and discover the central limit theorem. I probably couldn’t build an internal combustion engine - I know the concept but don’t know enough of the details. So the challenge is this.

 If you were transported back 4,000 or 5,000 years, how much could you accelerate human knowledge?

When I told Leah J. about this idea she came up with an even more fascinating variant.

Suppose that I told you that in 5 days you were going to be transported back 4,000 or 5,000 years but you couldn’t take anything with you. What would you read about on Wikipedia? 

ENAR is in Baltimore - Here's What To Do

This year’s meeting of the Eastern North American Region of the International Biometric Society (ENAR) is in lovely Baltimore, Maryland. As local residents Jeff and I thought we’d put down a few suggestions for what to do during your stay here in case you’re not familiar with the area.

Venue

The conference is being held at the Marriott in the Harbor East area of the city, which is relatively new and a great location. There are a number of good restaurants right in the vicinity, including Wit & Wisdom in the Four Seasons hotel across the street and Pabu, an excellent Japanese restaurant that I personally believe is the best restaurant in Baltimore (a very close second is Woodberry Kitchen, which is a bit farther away near Hampden). If you go to Pabu, just don’t get sushi; try something new for a change. Around Harbor East you’ll also find a Cinghiale (excellent northern Italian restaurant), Charleston (expensive southern food), Lebanese Taverna, and Ouzo Bay. If you’re sick of restaurants, there’s also a Whole Foods. If you want a great breakfast, you can walk just a few blocks down Aliceanna street to the Blue Moon Cafe. Get the eggs Benedict. If you get the Cap’n Crunch French toast, you will need a nap afterwards.

Just east of Harbor East is an area called Fell’s Point. This is commonly known as the “bar district” and it lives up to its reputation. Max’s in Fell’s Point (on the square) has an obscene number of beers on tap. The Heavy Seas Alehouse on Central Avenue has some excellent beers from the local Heavy Seas brewery and also has great food from chef Matt Seeber. Finally, the Daily Grind coffee shop is a local institution.

Around the Inner Harbor

Outside of the immediate Harbor East area, there are a number of things to do. For kids, there’s Port Discovery, which my 3-year-old son seems to really enjoy. There’s also the National Aquarium where the Tuesday networking event will be held. This is also a great place for kids if you’re bringing family. There’s a neat little park on Pier 6 that is small, but has a number of kid-related things to do. It’s a nice place to hang out when the weather is nice. Around the other side of the harbor is the Maryland Science Center, another kid-fun place, and just west of the Harbor down Pratt Street is the B&O Railroad Museum, which I think is good for both kids and adults (I like trains).

Unfortunately, at this time there’s no football or baseball to watch.

Around Baltimore

There are a lot of really interesting things to check out around Baltimore if you have the time. If you need to get around downtown and the surrounding areas there’s the Charm City Circulator which is a free bus that runs every 15 minutes or so. The Mt. Vernon district has a number of cultural things to do. For classical music fans there’s the wonderful Baltimore Symphony Orchestra directed by Marin Alsop. The Peabody Institute often has some interesting concerts going on given by the students there. There’s the Walters Art Museum, which is free, and has a very interesting collection. There are also a number of good restaurants and coffee shops in Mt. Vernon, like Dooby’s (excellent dinner) and Red Emma’s  (lots of Noam Chomsky).

That’s all I can think of right now. If you have other questions about Baltimore while you’re here for ENAR tweet us up at @simplystats.