Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

A general audience friendly explanation for why Lars Peter Hansen won the Nobel Prize

Lars Peter Hansen won the Nobel Prize in economics for creating the generalized method of moments. A rather technical explanation of the idea appears on Wikipedia. These are a good set of lecture notes on gmms if you like math. I went over to Marginal Revolution to see what was being written about the Nobel Prize winners. Clearly a bunch of other people were doing the same thing as the site was pretty slow to load. Here is what Tyler C. says about Hansen. In describing Hansen’s work he says:

For years now journalists have asked me if Hansen might win, and if so, how they might explain his work to the general reading public.  Good luck with that one.

Alex T. does a good job of explaining the idea, but it still seems a bit technical for my tastes. Guan Y.  does another good, and a little less technical explanation here, but it is still a little rough if you aren’t an economist. So I took a shot at an even more “general audience friendly” version below.

A very common practice in economics (and most other scientific disciplines) is to collect experimental data on two (or more) variables and to try to figure out if the variables are related to each other. A huge amount of statistical research is dedicated to this relatively simple-sounding problem. Lars Hansen won the Nobel Prize for his research on this problem because:

  1. Economists (and scientists) hate assumptions they can’t justify with data and want to use the fewest number possible. The recent Rogoff and Reinhart controversy illustrates this idea. They wrote a paper that suggested public debt was bad for growth. But when they estimated the relationship between variables they made assumptions (chose weights) that have been questioned widely - suggesting that public debt might not be so bad after all. But not before a bunch of politicians used this result to justify austerity measures which had a huge impact on the global economy.
  2. Economists (and mathematicians) love to figure out the “one true idea” that encompasses many ideas. When you show something about the really general solution, you get all the particular cases for free. This means that all the work you do to show some statistical procedure is good helps not just you in a general sense, but all the specific cases that are examples of the general things you are talking about.

I’m going to use a really silly example to illustrate the idea. Suppose that you collect information on the weight of animals bodies and the weight of their brains. You want to find out how body weight and brain weight are related to each other. You collect the data, they might look something like this:weights

So it looks like if you have a bigger body you have a bigger brain (except for poor old Triceratops who is big but has a small brain). Now you want to say something quantitative about this. For example:

Animals that are 1 kilogram larger have a brain that is on average k kilograms larger.

How do you figure that out? Well one problem is that you don’t have infinite money so you only collected information on a few animals. But you don’t want to say something just about the animals you measured - you want to change the course of science forever and say something about the relationship between the two variables for all animals.

The best way to do this is to make some assumptions about what the measurements of brain and body weight look like if you could collect all of the measurements. It turns out if you assume that you know the complete shape of the distribution in this way, it becomes pretty straightforward (with a little math) to estimate the relationship between brain and body weight using something called maximum likelihood estimation. This is probably the most common way that economists or scientists relate one variable to another (the inventor of this approach is still waiting for his Nobel).

The problem is you assumed a lot to get your answer. For example, here are the data from just the brains that we have collected. It is pretty hard to guess exactly what shape the data from the whole world would look like.

brains

This presents the next problem: how do we know that we have the “right one”?

We don’t.

One way to get around this problem is to use a very old idea called the  method of moments. Suppose we believe the equation:

Average in World Body Weight = k * Average in World Brain Weight

In other words, if we take any animal in the world on average it's brain weights 5 kilos then its body will on average be (k * 5) kilos. The relationship is only "on average" because there are a bunch of variables we didn't measure and they may affect the relationship between brain and body weight. You can see it in the scatterplot because the two values don't lie on the same line.

One way to estimate k is to just replace the numbers you wish you knew with the numbers you have in your population:

Average in Data you Have Body Weight = k * Average in Data you Have Brain Weight

Since you have the data the only thing you don’t know in the equation is k, so you can solve the equation and get an estimate. The nice thing here is we don’t have to say much about the shape of the data we expect for body weight or brain weight. We just have to believe this one equation.  The key insight here is that you don’t have to know the whole shape of the data, just one part of it (the average).  An important point to remember is that you are still making some assumptions here (that the average is a good thing to estimate, for example) but they are definitely fewer assumptions than you make if you go all the way and specify the whole shape, or distribution, of the data.

This is a pretty oversimplified version of the problem that Hansen solved. In reality when you make assumptions about the way the world works you often get more equations like the one above than variables you want to estimate. Solving all of those equations is now complicated because the answers from different equations  might contradict each other (the technical word is overdetermined).

Hansen showed that in this case you can take the equations and multiply them by a set of weights. You put more weight on equations you are more sure about, then add them up. If you choose the weights well, you avoid the problem of having too many equations for two few variables. This is the thing he won the prize for - the generalized method of moments.

This is all a big deal because the variables that economists measure frequently aren’t very pretty. One common way they aren’t pretty is that they are often measured over time, with complex relationships between values at different time points. That means it is hard to come up with realistic assumptions about what the data may look like.

By proposing an approach that doesn’t require as many assumptions Hansen satisfied criteria (1) for things economists like. And, if you squint just right at the equations he proposed, you can see they actually are a general form of a bunch of other estimation techniques like maximum likelihood estimation and instrumental variables, which made it easier to prove theoretical results and satisfied criteria (2) for things economists like.

 -—

Disclaimer: This post was written for a general audience and may cause nerd-rage in those who see (important) details I may have skimmed over. 

Disclaimer #2: I’m not an economist. So I can’t talk about economics. T__here are reasons gmm is useful economically that I didn’t even talk about here.

Sunday data/statistics link roundup (10/13/13)

  1. A really interesting comparison between educational and TV menus (via Rafa). On a related note, it will be interesting to see how/whether the traditional educational system will be disrupted. I’m as into the MOOC thing as the next guy, but I’m not sure I buy a series of pictures from your computer as “validation” you took/know the material for a course. Also I’m not 100% sure about what this is, but it has the potential to be kind of awesome - the Moocdemic.
  2. This piece of “investigative journalism” had the open-access internet up in arms. The piece shows pretty clearly that there are bottom-feeding journals who will use unscrupulous tactics and claim peer review while doing no such thing. But it says basically nothing about open access as far as I can tell. On a related note, a couple of years ago we developed an economic model for peer review, then tested the model out. In a very contrived/controlled system we showed peer review improves accuracy, even when people aren’t incentivized to review.
  3. Related to our guest post on NIH study sections is this pretty depressing piece in Nature.
  4. One of JHU Biostat’s NSF graduate research fellows was interviewed by Amstat News.
  5. Jenny B. has some great EDA lectures you should check out.

Why do we still teach a semester of trigonometry? How about engineering instead?

Arthur Benjamini says we should teach statistics before calculus.  He points out that most of what we do in high school math is preparing us for calculus. He makes the point that while physicists, engineers and economists need calculus, in the digital age, discrete math, probability and statistics are much more relevant to everyone else. I agree with him and was happy to see Statistics as part of the common core. However, other topics I wish were there, such as engineering, programming, and finance, are missing.

This saturday I took my 5th grader to a 3 hour robotics workshop. We both enjoyed it thoroughly. We built and programmed two-wheeled robots to, among other things, go around a table. To make this happen we learned about measurement error,  how to use a protractor, that C =   ∏ d, a bit of algebra, how to use grid searches, if-else conditionals, and for-loops. Meanwhile during a semester of high school trigonometry we learn this (do you remember that 2 sin^2 x = 1-cos 2x  ? ). Of course it is important to know trigonometry, but do we really need to learn to derive and memorize these identities that are rarely use and are readily available from a smartphone? One could easily teach the fundamentals as part of an applied class such as robotics. We can ask questions like: if while turning you make a mistake of 0.5 degrees, by how much will your robot miss its mark after traveling one meter? We can probably teach the fundamentals of trigonometry in about 2 weeks, later using these concepts in applied problems.

Cancelled NIH study sections: a subtle, yet disastrous, effect of the government shutdown

Editor’s note: This post is contributed by Debashis Ghosh. Debashis is the chair of __the Biostatistical Methods and Research Design (BMRD) study sections at the National Institutes of Health (NIH).  BMRD’s focus is statistical methodology.

I write today to discuss effects of the government shutdown that will likely have disastrous long-term effects on the state of biomedical and scientific research.  A list of the sections can be found at http://public.csr.nih.gov/StudySections/Standing/Pages/default.aspx.   These are panels of distinguished scientists in their fields that meet three times a year to review grant submissions to the NIH by investigators.  For most professors and scientists that work in academia, these grants provide the means of conducting research and funding for staff such as research associates, postdocs and graduate students.  At most universities and medical schools in the U.S., having an independent research grant is necessary for assistant professors to be promoted and get tenure (of course, there is some variation in this across all universities).

Yesterday, I was notified by NIH that the BMRD October meeting was cancelled and postponed until further notice.  I could not communicate with NIH staff about this because they are on furlough, meaning that they are not able to send or receive email or other communications.   This means that our study section will not be reviewing grants in October.  People who receive funding from NIH grants are familiar with the usual routine of submitting grants three times a year and getting reviewed approximately 6 months after submission.  This process has now stopped because of the government shutdown, and it is unclear when it will restart.  The session I chair is but one of 160 regular study sections and many of them would be meeting in October.  In fact, I was involved with a grant submitted to another study section that would have met on October 8, but this meeting did not happen.

The stoppage has many detrimental consequences.  Because BMRD will not be reviewing the submitted grants at the scheduled time, they will lack a proper scientific evaluation.  The NIH review process separates the scientific evaluation of grants from the actual awarding of funding.   While there have been many criticisms of the process, it has also been acknowledged that that the U.S. scientific research community has been the leader in the world, and NIH grant review has played a role in this status.   With the suspension of activities, the status that the U.S. currently enjoys is in peril.   It is interesting to note that now many countries are attempting to install a review process similar to the one at NIH (R. Nakamura, personal communication).

The effects of the shutdown are perilous for the investigators that are submitting grants.  Without the review, their grants cannot be evaluated and funded.  This lag in the funding timeline stalls research, and in scientific research a slow stall now is more disastrous in the long term.   The type of delay described here will mean layoffs for lab technicians and research associates that are funded by grants needing renewal as well as a hiring freeze for new lab personnel using newly funded grants.   This delay and loss of labor will diminish the existing scientific knowledge base in the U.S., which leads to a loss of the competitive advantage we have enjoyed as a nation for decades in science.

Economically, the delay has a huge impact as well.   Suppose there is a delay of three months in funding decisions.  In the case of NIH grants, this is tens of millions of dollars that is not being given out for scientific research for a period of three months. The rate of return of these grants has been estimated to be 25 – 40 percent a year (http://www.faseb.org/portals/0/pdfs/opa/2008/nih_research_benefits.pdf), and the findings from these grants have the potential to benefit 1,000s of patients a year by increasing their survival or improving the quality of their lives.   In the starkest possible terms, more medical patients will die and suffer because the government shutdown is forcing the research that provides new methods of diagnosis and treatment to grind to a halt.

Note: The opinions expressed here represent my own and not those of my employer, Penn State University, nor those of the National Institutes of Health nor the Center for Scientific Review.

The Care and Feeding of Your Scientist Collaborator

Editor’s Note: This post written by Roger Peng is part of a two-part series on Scientist-Statistician interactions. The first post was written by Elizabeth C. Matsui, an Associate Professor in the Division of Allergy and Immunology at the Johns Hopkins School of Medicine.

This post is a followup to Elizabeth Matsui’s previous post for scientists/clinicians on collaborating with biostatisticians. Elizabeth and I have been working for over half a decade and I think the story of how we started working together is perhaps a brief lesson on collaboration in and of itself. Basically, she emailed someone who didn’t have time, so that person emailed someone else who didn’t have time, so that person emailed someone else who didn’t have time, so that person emailed me, who as a mere assistant professor had plenty of time! A few people I’ve talked to are irked by this process because it feels like you’re someone’s fourth choice. But personally, I don’t care. I’d say almost all my good collaborations have come about this way. To me, it either works or it doesn’t work, regardless of where on the list you were when you were contacted.

I’ve written before about how to find good collaborators (although I neglected to mention the process described above), but this post tries to answer the question, “Now that I’ve found this good collaborator, what do I do with her/him?” Her are some thoughts I’ve accumulated over the years.

  • Understand that a scientist is not a fountain from which “the numbers” flow. Most statisticians like to work with data, and some even need it to demonstrate the usefulness of their methods or theory. So there’s a temptation to go “find a scientist” to “give you some data”. This is starting off on the wrong foot. If you picture your collaborator as a person who hands over the data and then you never talk to that person again (because who needs a clinician for a JASA paper?), then things will probably not end up so great. And I think there are two ways in which the experience will be sub-optimal. First, your scientist collaborator may feel miffed that you basically went off and did your own thing, making her/him less inclined to work with you in the future. Second, the product you end up with (paper, software, etc.) might not have the same impact on science as it would have had if you’d worked together more closely. This is the bigger problem: see #5 below.

  • All good collaborations involve some teaching: Be patient, not patronizing. Statisticians are often annoyed that “So-and-so didn’t even know this” or “they tried to do this with a sample size of 3!” True, there are egregious cases of scientists with a lack of basic statistical knowledge, but in my experience, all good collaborations involve some teaching. Otherwise, why would you collaborate with someone who knows exactly the same things that you know? Just like it’s important to take some time to learn the discipline that you’re applying statistical methods to, it’s important to take some time to describe to your collaborator how those statistical methods you’re using really work. Where does the information in the data come from? What aspects are important; what aspects are not important? What do parameter estimates mean in the context of this problem? If you find you can’t actually explain these concepts, or become very impatient when they don’t understand, that may be an indication that there’s a problem with the method itself that may need rethinking. Or maybe you just need a simpler method.

  • Go to where they are. This bit of advice I got from Scott Zeger when I was just starting out at Johns Hopkins. His bottom line was that if you understand where the data come from (as in literally, the data come from this organ in this person’s body), then you might not be so flippant about asking for an extra 100 subjects to have a sufficient sample size. In biomedical science, the data usually come from people. Real people. And the job of collecting that data, the scientist’s job, is usually not easy. So if you have a chance, go see how the data are collected and what needs to be done. Even just going to their office or lab for a meeting rather than having them come to you can be helpful in understanding the environment in which they work. I know it can feel nice (and convenient) to have everyone coming to you, but that’s crap. Take the time and go to where they are.

  • Their business is your business, so pitch in. A lot of research (and actually most jobs) involves doing things that are not specifically relevant to your primary goal (a paper in a good journal). But sometimes you do those things to achieve broader goals, like building better relationships and networks of contacts. This may involve, say, doing a sample size calculation once in a while for a new grant that’s going in. That may not be pertinent to your current project, but it’s not that hard to do, and it’ll help your collaborator a lot. You’re part of a team here, so everyone has to pitch in. In a restaurant kitchen, even the Chef works the line once in a while. Another way to think of this is as an investment. Particularly in the early stages there’s going to be a lot of ambiguity about what should be done and what is the best way to proceed. Sometimes the ideal solution won’t show itself until much later (the so-called “j-shaped curve” of investment). In the meantime, pitch in and keep things going.

  • Your job is to advance the science. In a good collaboration, everyone should be focused on the same goal. In my area, that goal is improving public health. If I have to prove a theorem or develop a new method to do that, then I will (or at least try). But if I’m collaborating with a biomedical scientist, there has to be an alignment of long-term goals. Otherwise, if the goals are scattered, the science tends to be scattered, and ultimately sub-optimal with respect to impact. I actually think that if you think of your job in this way (to advance the science), then you end up with better collaborations. Why? Because you start looking for people who are similarly advancing the science and having an impact, rather than looking for people who have “good data”, whatever that means, for applying your methods.

In the end, I think statisticians need to focus on two things: Go out and find the best people to work with and then help them advance the science.