Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Sunday data/statistics link roundup (2/24/2013)

  1. An attempt to create a version of knitr for stata (via John M.). I  like the direction that reproducible research is moving - toward easier use and wider spread adoption. The success of iPython notebook is another great sign for the whole research area.
  2. Email is always a problem for me. In the last week I’ve been introduced to a couple of really nice apps that give me insight into my email habits (Gmail meter - via John M.) and that help me to send reminders to myself with minimal hassle (Boomerang - via Brian C.)
  3. Andrew Lo proposes a new model for cancer research funding based on his research in financial engineering. In light of the impending sequester I’m interested in alternative funding models for data science/statistics in biology. But the concerns I have about both crowd-funding and Lo’s idea are whether the basic scientists get hosed and whether sustained funding at a level that will continue to attract top scientists is possible.
  4. This is a really nice rundown of why medical costs are so high. They key things in the article to me are that: (1) he chased down the data about actual costs versus charges and (2) he highlights the role of the chargemaster - the price setter in medical centers - and how the prices are often set historically with yearly markups (not based on estimates of costs, etc.), and (3) he discusses key nuances like medical liability if the “best” tests aren’t run on everyone. Overall, it is definitely worth a read and this seems like a hugely important problem a statistician could really help with (if they could get their hands on the data).
  5. A really cool applied math project where flying robot helicopters toss and catch a stick. Applied math can be super impressive, but they always still need a little boost from statistics, ““This also involved bringing the insights gained from their initial
and many subsequent experiments to bear on their overall system
  
design. For example, a learning algorithm was added to account for
  
model inaccuracies." (via Rafa via MR).   6. We've talked about [trying to reduce meetings](http://simplystatistics.org/2011/09/19/meetings/) to increase producitivity before. Here is an article in the NYT talking about [the same issue](http://www.nytimes.com/2013/02/17/jobs/too-many-office-meetings-and-how-to-fight-back.html?_r=1&) (via Rafa via Karl B.). Brian C. made an interesting observation though, that in a soft money research environment there should be evolutionary pressure against anything that doesn't improve your ability to obtain research funding. Despite this, meetings proliferate in soft-money environments. So there must be some selective advantage to them! Another interesting project for a stats/evolutionary biology student.   7. If you have read all the Simply Statistics interviews and still want more, check out <http://www.analyticstory.com/>.

 

Tesla vs. NYT: Do the Data Really Tell All?

I’ve enjoyed so far the back and forth between Tesla Motors and New York Times reporter John Broder. The short version is

  • Broder tested one of Tesla’s new Model S all-electric sedans on a drive from Washington, D.C. to Groton, CT. Part of the reason for this specific trip was to make use of Tesla’s new supercharger stations along the route (one in Delaware and one in Connecticut).
  • Broder’s trip appeared to have some bumps, including running out of electricity at one point and requiring a tow.
  • After the review was published in the New York Times, Elon Musk, the CEO/Founder of Tesla, was apparently livid. He published a detailed response on the Tesla blog explaining that what Broder wrote in his review was not true and that “he simply did not accurately capture what happened and worked very hard to force our car to stop running”.
  • Broder has since responded to Musk’s response with further explanation.

Of course, the most interesting aspect of Musk’s response on the Tesla blog was that he published the data collected by the car during Broder’s test drive. When revelations of this data came about, I thought it was a bit creepy, but Musk makes clear in his post that they require data collection for all reviewers because of a previous bad experience. So, the fact that data were being collected on speed, cabin temperature, battery charge %, and rated range remaining, was presumably known to all, especially Broder. Given that you know Big Brother Musk is watching, it seems odd to deliberately lie in a widely read publication like the Times.

Having read the original article, Musk’s response, and Broder’s rebuttal, one things is clear to me–there’s more than one way to see the data. The challenge here is that Broder had the car, but not the data, so had to rely on his personal recollection and notes. Musk has the data, but wasn’t there, and so has to rely on peering at graphs to interpret what happened on the trip.

One graph in particular was fascinating. Musk shows a periodic-looking segment of the speed graph and concludes

Instead of plugging in the car, he drove in circles for over half a mile in a tiny, 100-space parking lot. When the Model S valiantly refused to die, he eventually plugged it in.

Broder claims

I drove around the Milford service plaza in the dark looking for the Supercharger, which is not prominently marked. I was not trying to drain the battery. (It was already on reserve power.) As soon as I found the Supercharger, I plugged the car in.

Okay, so who’s right? Isn’t the data supposed to settle this?

In a few other cases in this story, the data support both people. In particular, it seems that there was some serious miscommunication between Broder and Tesla’s staff. I’m sure they also have recordings of those telephone calls too but they were not reproduced in Musk’s response.

The bottom line here, in my opinion, is that sometimes the data don’t tell all, especially “big data”. In the end, data are one thing, interpretation is another. Tesla had reams of black-box data from the car and yet some of the data still appear to be open to interpretation. My guess is that the data Tesla collects is not collected specifically to root out liars, and so is maybe not optimized for this purpose. Which leads to another key point about big data–they are often used “off-label”, i.e. not for the purpose they were originally designed.

I read this story with interest because I actually think Tesla is a fascinating company that makes cool products (that sadly, I could never afford). This episode will surely not be the end of Tesla or of the New York Times, but it illustrates to me that simply “having the data” doesn’t necessarily give you what you want.

Sunday data/statistics link roundup (2/17/2013)

  1. The Why Axis - discussion of important visualizations on the web. This is one I think a lot of people know about, but it is new to me. (via Thomas L. - p.s. I’m @leekgroup on Twitter, not @jtleek). 
  2. This paper says that people who “engage in outreach” (read: write blogs) tend to have higher academic output (hooray!) but that outreach itself doesn’t help their careers (boo!).
  3. It is a little too late for this year, but next year you could make a Valentine with R.
  4. [ 1. The Why Axis - discussion of important visualizations on the web. This is one I think a lot of people know about, but it is new to me. (via Thomas L. - p.s. I’m @leekgroup on Twitter, not @jtleek). 
  5. This paper says that people who “engage in outreach” (read: write blogs) tend to have higher academic output (hooray!) but that outreach itself doesn’t help their careers (boo!).
  6. It is a little too late for this year, but next year you could make a Valentine with R. 4.](http://emailcharter.org/) (via Rafa). This is pretty similar to my getting email responses from busy people. Not sure who scooped who. I’m still waiting for my to-do list app. Mailbox is close, but I still want actions to be multiple choice or yes/no or delegation rather than just snoozing emails for later.
  7. Top ten reasons not to share your code, and why you should anyway.

Interview with Nick Chamandy, statistician at Google

Nick Chamandy
person_photo
Nick Chamandy received his M.S. in statistics from the University of Chicago, his Ph.D. in statistics at McGill University and joined Google as a statistician. We talked to him about how he ended up at Google, what software he uses, and how big the Google data sets are. To read more interviews - check out our interviews page.
SS: Which term applies to you: data scientist, statistician, computer scientist, or something else?

NC: I usually use the term Statistician, but at Google we are also known as Data Scientists or Quantitative Analysts. All of these titles apply to some degree. As with many statisticians, my day to day job is a mixture of analyzing data, building models, thinking about experiments, and trying to figure out how to deal with large and complex data structures. When posting job opportunities, we are cognizant that people from different academic fields tend to use different language, and we don't want to miss out on a great candidate because he or she comes from a non-statistics background and doesn't search for the right keyword. On my team alone, we have had successful "statisticians" with degrees in statistics, electrical engineering, econometrics, mathematics, computer science, and even physics. All are passionate about data and about tackling challenging inference problems.

SS: How did you end up at Google?

Coming out of my PhD program at McGill, I was somewhat on the fence about the academia vs. industry decision. Ideally I wanted an opportunity that combined the intellectual freedom and stimulation of academia with the concreteness and real-world relevance of industrial problems. Google seemed to me at the time (and still does) to be by far the most exciting place to pursue that happy medium. The culture at Google emphasizes independent thought and idea generation, and the data are staggering in both size and complexity. That places us squarely on the "New Frontier" of statistical innovation, which is really motivating. I don't know of too many other places where you can both solve a research problem and have an impact on a multi-billion dollar business in the same day.

SS: Is your work related to the work you did as a Ph.D. student?

NC: Although I apply many of the skills I learned in grad school on a daily basis, my PhD research was on Gaussian random fields, with particular application to brain imaging data. The bulk of my work at Google is in other areas, since I work for the Ads Quality Team, whose goal is to quantify and improve the experience that users have interacting with text ads on the google.com search results page. Once in a while though, I come across data sets with a spatial or spatio-temporal component and I get the opportunity to leverage my experience in that area. Some examples are eye-tracking studies run by the user research lab (measuring user engagement on different parts of the search page), and click pattern data. These data sets typically violate many of the assumptions made in neuroimaging applications, notably smoothness and isotropy conditions. And they are predominantly 2-D applications, as opposed to 3-D or higher.

What is your programming language of choice, R, Python or something else?  

I use R, and occasionally matlab, for data analysis. There is a large, active and extremely knowledgeable R community at Google. Because of the scale of Google data, however, R is typically only useful after a massive data aggregation step has been accomplished. Before that, the data are not only too large for R to handle, but are stored on many thousands of machines. This step is usually accomplished using the MapReduce parallel computing framework, and there are several Google-developed scripting languages that can be used for this purpose, including Go. We also have an interactive, ad hoc query language which can be applied to massive, "sharded" data sets (even those with a nested structure), and for which there is an R API. The engineers at Google have also developed a truly impressive package for massive parallelization of R computations on hundreds or thousands of machines. I typically use shell or python scripts for chaining together data aggregation and analysis steps into "pipelines".

SS: How big are the data sets you typically handle? Do you extract them yourself or does someone else extract them for you?

Our data sets contain billions of observations before any aggregation is done. Even after aggregating down to a more manageable size, they can easily consist of 10s of millions of rows, and on the order of 100s of columns. Sometimes they are smaller, depending on the problem of interest. In the vast majority of cases, the statistician pulls his or her own data -- this is an important part of the Google statistician culture. It is not purely a question of self-sufficiency. There is a strong belief that without becoming intimate with the raw data structure, and the many considerations involved in filtering, cleaning, and aggregating the data, the statistician can never truly hope to have a complete understanding of the data. For massive and complex data, there are sometimes as many subtleties in whittling down to the right data set as there are in choosing or implementing the right analysis procedure. Also, we want to guard against creating a class system among data analysts -- every statistician, whether BS, MS or PhD level, is expected to have competence in data pulling. That way, nobody becomes the designated data puller for a colleague. That said, we always feel comfortable asking an engineer or other statistician for help using a particular language, code library, or tool for the purpose of data-pulling. That is another important value of the Google culture -- sharing knowledge and helping others get "unstuck".

Do you work collaboratively with other statisticians/computer scientists at Google? How do projects you work on get integrated into Google's products, is there a process of approval?

Yes, collaboration with both statisticians and engineers is a huge part of working at Google. In the Ads Team we work on a variety of flavours of statistical problems, spanning but not limited to the following categories: (1) Retrospective analysis with the goal of understanding the way users and advertisers interact with our system; (2) Designing and running randomized experiments to measure the impact of changes to our systems; (3) Developing metrics, statistical methods and tools to help evaluate experiment data and inform decision-making; (4) Building models and signals which feed directly into our engineering systems. "Systems" here are things like the algorithms that decide which ads to display for a given query and context.

Clearly (2) and (4) require deep collaboration with engineers -- they can make the changes to our production codebase which deploy a new experiment or launch a new feature in a prediction model. There are multiple engineering and product approval steps involved here, meant to avoid introducing bugs or features which harm the user experience. We work with engineers and computer scientists on (1) and (3) as well, but to a lesser degree. Engineers and computer scientists tend to be extremely bright and mathematically-minded people, so their feedback on our analyses, methodology and evaluation tools is pretty invaluable!

Who have been good mentors to you during your career? Is there something in particular they did to help you?

I've had numerous important mentors at Google (in addition, of course, to my thesis advisors and professors at McGill). Largely they are statisticians who have worked in industry for a number of years and have mastered the delicate balance between deep-thinking a problem and producing something quick and dirty that can have an immediate impact. Grad school teaches us to spend weeks thinking about a problem and coming up with an elegant or novel methodology to solve it (sometimes without even looking at data). This process certainly has its place, but in some contexts a better outcome is to produce an unsophisticated but useful and data-driven answer, and then refine it further as needed. Sometimes the simple answer provides 80% of the benefit, and there is no reason to deprive the consumers of your method this short-term win while you optimize for the remaining 20%. By encouraging the "launch and iterate" mentality for which Google is well-known, my mentors have helped me produce analysis, models and methods that have a greater and more immediate impact.

What skills do you think are most important for statisticians/data scientists moving into the tech industry?

Broadly, statisticians entering the tech industry should do so with an open mind. Technically speaking, they should be comfortable with heavy-tailed, poorly-behaved distributions that fail to conform to assumptions or data structures underlying the models taught in most statistics classes. They should not be overly attached to the ways in which they currently interact with data sets, since most of these don't work for web-scale applications. They should be receptive to statistical techniques that require massive amounts of data or vast computing networks, since many tech companies have these resources at their disposal. That said, a statistician interested in the tech industry should not feel discouraged if he or she has not already mastered large-scale computing or the hottest programming languages. To me, it is less about what skills one must brush up on, and much more about a willingness to adaptively learn new skills and adjust one's attitude to be in tune with the statistical nuances and tradeoffs relevant to this New Frontier of statistics. Statisticians in the tech industry will be well-served by the classical theory and techniques they have mastered, but at times must be willing to re-learn things that they have come to regard as trivial. Standard procedures and calculations can quickly become formidable when the data are massive and complex.

I'm a young scientist and sequestration will hurt me

I’m a biostatistician. That means that I help scientists and doctors analyze their medical data to try to figure out new screening tools, new therapies, and new ways to improve patients’ health. I’m also a professor. I  spend a good fraction of my time teaching students about analyzing data in classes here at my university and online. Big data/data analysis is an area of growth for the U.S. economy and some have even suggested that there will be a critical shortage of trained data analysts.

I have other responsibilities but these are the two biggies - teaching and research. I work really hard to be good at them because I’m passionate about education and I’m passionate about helping people. I’m by no means the only (relatively) young person with this same drive. I would guess this is a big reason why a lot of people become scientists. They want to contribute to both our current knowledge (research) and the future of knowledge (teaching).

My salary comes from two places - the students who pay tuition at our school and, to a much larger extent, the federal government’s research funding through the NIH. So you are paying my salary. The way that the NIH distributes that funding is through a serious and very competitive process. I submit proposals of my absolute best ideas, so do all the other scientists in the U.S., and they are evaluated by yet another group of scientists who don’t have a vested interest in our grants. This system is the reason that only the best, most rigorously vetted science is funded by taxpayer money.

It is very hard to get a grant. In 2012, between 7% and 16% of new projects were funded. So you have to write a proposal that is better than 84-93% of all other proposals being submitted by other really, really smart and dedicated scientists. The practical result is that it is already very difficult for a good young scientist to get a grant. The NIH recognizes this and implements special measures for new scientists to get grants, but it still isn’t easy by any means.

Sequestration will likely dramatically reduce the fraction of grants that get funded. Already on that website, the “payline” or cutoff for funding, has dropped from 10% of grants in 2012 to 6% in 2013 for some NIH institutes. If sequestration goes through, it will be worse - maybe a lot worse. The result is that it will go from being really hard to get individual grants to nearly impossible. If that happens, many young scientists like me won’t be able to get grants. No matter how passionate we are about helping people or doing the right thing, many of us will have to stop being researchers and scientists and get other jobs to pay the bills - we have to eat.

So if sequestration or other draconian cuts to the NIH go through, they will hurt me and other junior scientists like me. It will make it harder - if not impossible - for me to get grants. It will affect whether I can afford to educate the future generation of students who will analyze all the data we are creating. It will create dramatic uncertainty/difficulty in the lives of the young biological scientists I work with who may not be able to rely on funding from collaborative grants to the extent that I can. In the end, this will hurt me, it will hurt my other scientific colleagues, and it could dramatically reduce our competitiveness in science technology and mathematics (STEM) for years to come. Steven wrote this up beautifully on his blog.

I know that these cuts will also affect the lives of many other people from all walks of life, not just scientists. So I hope that Congress will do the right thing and decide that hurting all these people isn’t worth the political points they will score - on both sides. Sequestration isn’t the right choice - it is the choice that was most politically expedient when people’s backs were against the wall.

Instead of making dramatic, untested, and possibly disastrous cuts across the board for political reasons, let’s do what scientists and statisticians have been doing for years when deciding which drugs work and don’t. Let’s run controlled studies and evaluate the impact of budget cuts to different programs - as Ben Goldacre and his colleagues of so beautifully laid out in their proposal. That way we can bring our spending into line, but sensibly and based on evidence, rather than the politics of the moment or untested economic models not based on careful experimentation.