Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Expected Salary by Major

In thisrecent editorialabout the Occupy Wall Street movement, Richard Kim profiles a protestor that despite having a master’s degree can’t find a job. This particular protestorquit his job as a school teacher three years ago and took out a $35K student loan to obtain a master’s degree in puppetry from the University of Connecticut. I wonder if, before taking his money, UConn showed this person data on job prospects for their puppetry graduates. More generally,I wonder if any university shows their idealist 18 year old freshmen such data.

Georgetown’s Center for Education and the Workforce has an informativeinteractive webpagethat students can use to find out by-major salary information. I scraped data from thisWall Street Journal webpagewhich also provides, for each major, unemployment rates, salary quartiles, and its rank in popularity. I used these data to compute expected salaries by multiplying median salary by percent of employment. The graph above shows expected salary versus popularity rank (1=most popular) for the 50 most popular majors (Go here for a complete table and here is the raw data and code). I also included Physics (the 70-th). I used different colors to represent four categories: engineering, math/stat/computers, physical sciences, and the rest. As a baseline I added a horizontal line representing the average salary for a truck driver: $65K, a job currently withplenty of openings. Different font sizes are used only to make names fit.A couple of observations stand out. First, only one of the top 10 most popular majors,Computer Science,has a higher expected salary than truck drivers. Second, Psychology, the fifth most popular major, has an expected salary of $40K and, as seen in the table, an unemployment rate of 6.1%; almost three times worse than nursing.

A few editorial remarks:1)I understand that being a truck driver is very hard and that there is little room for career development. 2) I am not advocating that people pick majors based on future salaries. 3) I think college freshmen deserve to know the data given how much money they fork over to us. 4) The graph is for bachelor’s degrees, not graduate education. The CEW website includes data for graduate degrees. Note that Biology shoots way up with a graduate degree. 5) For those interested in a PhD in Statistics I recommend you major in Math with a minor in a liberal arts subject, such as English, while taking as many programming classes as you can. We all know Math is the base for everything statisticians do, but why English? Students interested in academia tend to underestimate the importance of writing and communicating.

Related articles:ThisNY Times article describes how/why students are leaving the sciences. Here, Alex Tabarrok describes big changes in the balance of majors between 1985 and today and herehe shares his thoughts on Richard Kim’s editorial. Matt Yglesias explains thatunemploymentis rising across the board. Finally, Peter Orszag share his views on how a changing world is changing the value of a college degree.

Hat tip to David Santiago for sending various of these links and Harris Jaffee for help with scrapping.

Statisticians on Twitter...help me find more!

In honor of our blog finally dragging itself into the 21st century and jumping onto Twitter/Facebook, I have been compiling a list of statistical people on Twitter. I couldn’t figure out an easy way to find statisticians in one go (which could be because I don’t have Twitter skills). 

So here is my very informal list of statisticians I found in a half hour of searching. I know I missed a ton of people; let me know who I missed so I can update!

@leekgroup - Jeff Leek (What, you thought I’d list someone else first?)

@rdpeng - Roger Peng

@rafalab - Rafael Irizarry

@storeylab - John Storey

@bcaffo - Brian Caffo

@sherrirose - Sherri Rose

@raphg - Raphael Gottardo

@airoldilab - Edo Airoldi

@stat110 - Joe Blitzstein

@tylermccormick - Tyler McCormick

@statpumpkin - Chris Volinsky

@fivethirtyeight - Nate Silver

@flowingdata - Nathan Yau

@kinggary - Gary King

@StatModeling - Andrew Gelman

@AmstatNews - Amstat News

@hadleywickham - Hadley Wickham

Coarse PM and measurement error paper

Howard Chang, a former PhD student of mine now at Emory, just published a paper on a measurement error model for estimating the health effects of coarse particulate matter (PM). This is a cool paper that deals with the problem that coarse PM tends to be very spatially heterogeneous. Coarse PM is a bit of a hot topic now because there is currently no national ambient air quality standard for coarse PM specifically. There is a standard for fine PM, but compared to fine PM,  the scientific evidence for health effects of coarse PM is relatively less developed. 

When you want to assign a coarse PM exposure level to people in a county (assuming you don’t have personal monitoring) there is a fair amount of uncertainty about the assignment because of the spatial variability. This is in contrast to pollutants like fine PM or ozone which tend to be more spatially smooth. Standard approaches essentially ignore the uncertainty which may lead to some bias in estimates of the health effects.

Howard developed a measurement error model that uses observations from multiple monitors to estimate the spatial variability and correct for it in time series regression models estimating the health effects of coarse PM. Another nice thing about his approach is that it avoids any complex spatial-temporal modeling to do the correction.

Related Posts: Jeff on “Cool papers” and “Dissecting the genomics of trauma

Is Statistics too darn hard?

In this NY Times article, Christopher Drew points out that many students planning engineering and science majors end up switching to other subjects or fail to get any degree. He argues that this is partly due todo the difficulty of classes. In a previous post we lamented the anemic growth in math and statistics majors in comparison to other majors. I do not think we should make our classes easier just to keep these students. But we can certainly do a better job of motivating the material and teaching it more interesting. After having fun in high school science classes, students entering college are faced with the reality that the first college science classes can be abstract and technical. But in Statistics we certainly can be teaching the practical aspects first. Learning the abstractions is so much easier and enjoyable when you understand the practical problem behind the math. And in Statistics there is always a practical aspect behind the math. The statistics class I took in college was so dry and removed from reality that I can see why it would turn students away from the subject. So, if you are teaching undergrad (or grads) I highly recommend the Stat labs text book by Deb Nolan and Terry Speed that teaches Mathematical Statistics through applications. If you know of other good books please post in the comments? Also, if you know of similar books for other science, technology, engineering, and math (STEM) subjects please share as well.

Related Pots: Jeff on “The 5 most critical statistical concepts”, Rafa on “The future of graduate education”, Jeff on “Graduate student data analysis inspired by a high-school teacher

Reproducible research: Notes from the field

Over the past year, I’ve been doing a lot of talking about reproducible research. Talking to people, talking on panel discussions, and talking about some of my own work. It seems to me that interest in the topic has exploded recently, in part due to some recent scandals, such as the Duke clinical trials fiasco.

If you are unfamiliar with the term “reproducible research”, the basic idea is that authors of published research should make available the necessary materials so that others may reproduce to a very high degree of similarity the published findings. If that definitions seems imprecise, well that’s because it is.

I think reproducibility becomes easier to define in the context of a specific field or application. Reproducibility often comes up in the context of computational science. In computational science fields, often much of the work is done on the computer using often very large amounts of data. In other words, the analysis of the data is of comparable difficulty as the collection of the data (maybe even more complicated). Then the notion of reproducibility typically comes down to the idea of making the analytic data and the computer code available to others. That way, knowledgeable people can run your code on your data and presumably get your results. If others do not get your results, then that may be a sign of a problem, or perhaps a misunderstanding. In either case, a resolution needs to be found. Reproducibility is key to science much the way it is key to programming. When bugs are found in software, being able to reproduce the bug is an important step to fixing it. Anyone learning to program in C knows the pain of dealing with a memory-related bug, which will often exhibit seemingly random and non-reproducible behavior.

My discussions with others about the need for reproducibility in science often range far and wide. One reason is that many people have very different ideas what (a) what is reproducibility and (b) why we need it. Here is my take on various issues.

  • Reproducibility is not replication. There’s often honest confusion between the notion of reproducibility and what I would call a “full replication”. A full replication doesn’t analyze the same dataset, but rather involves an independent investigator collecting an independent dataset conducting an independent analysis. Full replication has been a fundamental component of science for a long time now and will continue to be the primary yardstick for measuring the plausibility of scientific claims. I think most would agree that full replication is preferable, but often it is simply not possible.
  • Reproducibility is not needed solely to prevent fraud. I’ve heard many people emphasize reproducibility as a means to prevent fraud. Journal editors seem to think this is the main reason for demanding reproducibility. It is_ one_ reason, but to be honest, I’m not sure it’s all that useful for detecting fraud. If someone truly wants to commit fraud, then it’s possible to make the fraud reproducible. If I just generate a bunch of numbers and claim it as data that I collected, any analysis from that dataset can be reproducible. While demanding reproducibility may be useful for ferreting out certain types of fraud, it’s not a general solution and it’s not the primary reason we need it. 
  • Reproducibility is not as easy as it sounds. Making one’s research reproducible is hard. It’s especially hard when you try to do it after the research has been done. In that case it’s more like an audit, and I’m guessing for most people the word “audit” is NOT synonymous with “fun”. Even if you set out to make your work reproducible from the get go, it’s easy to miss things. Code can get lost (even with a version control system) and metadata can slip through the cracks. Even when you’ve done everything right, computers and software can change. Virtual machines like Amazon EC2 and others seem to have some potential. The single most useful tool that I have found is a good version control system, like git
  • At this point, anything would be better than nothing. Right now, I think the bar for reproducibility is quite low in the sense that most published work is not reproducible. Even if data are available, often the code that analyzed the data is not available. So if you’re publishing research and you want to make it at least partially reproducible, just put what you can out there. On the web, on github, in a data repository, wherever you can. If you can’t publish the data, make your code available. Even that is better than nothing. In fact, I find reading someone’s code to be very informative and often questions can arise without looking at the data. Until we have a better infrastructure for distributing reproducible research, we will have to make do with what we have. But if we all start putting stuff out there, the conversation will turn from “Why should I make stuff available?” to “Why wouldn’t I make stuff available?”