11 Jun 2013
I’m in the process of trying to get together a couple of sessions to submit to ENAR 2014. I’m pretty psyched about the topics and am looking forward to hosting the conference in Baltimore. It is pretty awesome to have one of the bigger stats conferences on our home turf and we are going to try to be well represented at the conference.
While putting the sessions together I’ve been thinking about what are my favorite characteristics of sessions at stats conferences. Alyssa has a few suggestions for speakers which I’m completely in agreement with, but I’m talking about whole sessions. Since statistics is often concerned primarily with precision/accuracy the talks tend to be a little bit technical and sometimes dry. Even on topics I really am excited about, people try not to exaggerate. I think overall this is a great quality, but I’d [I’m in the process of trying to get together a couple of sessions to submit to ENAR 2014. I’m pretty psyched about the topics and am looking forward to hosting the conference in Baltimore. It is pretty awesome to have one of the bigger stats conferences on our home turf and we are going to try to be well represented at the conference.
While putting the sessions together I’ve been thinking about what are my favorite characteristics of sessions at stats conferences. Alyssa has a few suggestions for speakers which I’m completely in agreement with, but I’m talking about whole sessions. Since statistics is often concerned primarily with precision/accuracy the talks tend to be a little bit technical and sometimes dry. Even on topics I really am excited about, people try not to exaggerate. I think overall this is a great quality, but I’d](http://www.hilarymason.com/speaking/speaking-entertain-dont-teach/) at a conference. I realized that one of my favorite kind of sessions is the “future of statistics” session.
My only problem is that future of the field talks are always given by luminaries who have a lot of experience. This isn’t surprising, since (1) they are famous and their names are a big draw, (2) they have made lots of interesting/unique contributions, and (3) they are established so they don’t have to worry about being a little imprecise.
But I’d love to see a “future of the field” session with only people who are students/postdocs/first year assistant professors. These are the people who will really be the future of the field and are often more on top of new trends. It would be so cool to see four or five of the most creative young people in the field making bold predictions about where we will go as a discipline. Then you could have one senior person discuss the talks and give some perspective on how realistic the visions would be in light of past experience.
Tell me that wouldn’t be an awesome conference session.
29 May 2013
There has been a lot of discussion among statisticians about big data and what statistics should do to get involved. Recently Steve M. and Larry W. took up the same issue on their blog. I have been thinking about this for a while, since I work in genomics, which almost always comes with “big data”. It is also one area of big data where statistics and statisticians have played a huge role.
A question that naturally arises is, “why have statisticians been so successful in genomics?” I think a major reason is the phrase I borrowed from Brian C. (who may have borrowed it from Ron B.)
problem first, not solution backward
One of the reasons that “big data” is even a term is that there is that data are less expensive than they were a few years ago. One example is the dramatic drop in the price of DNA-sequencing. But there are many many more examples. The quantified self movement and Fitbits, Google Books, social network data from Twitter, etc. are all areas where data that cost us a huge amount to collect 10 years ago can now be collected and stored very cheaply.
As statisticians we look for generalizable principles; I would say that you have to zoom pretty far out to generalize from social networks to genomics but here are two:
- The data can’t be easily analyzed in an R session on a simple laptop (say low Gigs to Terabytes)
- The data are generally quirky and messy (unstructured text, json files with lots of missing data, fastq files with quality metrics, etc.)
So how does one end up at the “leading edge” of big data? By being willing to deal with the schlep and work out the knitty gritty of how you apply even standard methods to data sets where taking the mean takes hours. Or taking the time to learn all the kinks that are specific to say, how does one process a microarray, and then taking the time to fix them. This is why statisticians were so successful in genomics, they focused on the practical problems and this gave them access to data no one else had/could use properly.
Doing these things requires a lot of effort that isn’t elegant. It also isn’t “statistics” by the definition that only mathematical methodology is statistics. Steve alludes to this in his post when he says:
Frankly I am a little disappointed that there does not seem to be any really compelling new idea (e.g. as in neural nets or the kernel embedding idea that drove machine learning).
I think this is a view shared by many statisticians. That since there isn’t a new elegant theory yet, there aren’t “new ideas” in big data. That focus is solution backward. We want an elegant theory that we can then apply to specific problems if they happen to come up.
The alternative is problem forward. The fact that we can collect data so cheaply means we can measure and study things we never could before. Computer scientists, physicists, genome biologists, and others are leading in big data precisely because they aren’t thinking about the statistical solution. They are thinking about solving an important scientific problem and are willing to deal with all the dirty details to get there. This allows them to work on data sets and problems that haven’t been considered by other people.
In genomics, this has happened before. In that case, the invention of microarrays revolutionized the field and statisticians jumped on board, working closely with scientists, handling the dirty details, and building software so others could too. As a discipline if we want to be part of the “big data” revolution I think we need to focus on the scientific problems and let methodology come second. That requires a rethinking of what it means to be statistics. Things like parallel computing, data munging, reproducibility, and software development have to be accepted as equally important to methods development.
The good news is that there is plenty of room for statisticians to bring our unique skills in dealing with uncertainty to these new problems; but we will only get a seat at the table if we are willing to deal with the mess that comes with doing real science.
I’ll close by listing a few things I’d love to see:
- A Bioconductor-like project for social network data. Tyler M. and Ali S. have a paper that would make for an awesome package for this project.
- Statistical pre-processing for fMRI and other brain imaging data. Keep an eye on our smart group for that.
- Data visualization for translational applications, dealing with all the niceties of human-data interfaces. See healthvis or the stuffy Miriah Meyer is doing.
- Most importantly, starting with specific, unsolved scientific problems. Seeking novel ways to collect cheap data, and analyzing them, even with known and straightforward statistical methods to deepen our understanding about ourselves or the universe.
17 May 2013
Here’s a little thought experiment for your weekend pleasure. Consider the following:
Joe Scientist decides to conduct a study (call it Study A) to test the hypothesis that a parameter D > 0 vs. the null hypothesis that D = 0. He designs a study, collects some data, conducts an appropriate statistical analysis and concludes that D > 0. This result is published in the Journal of Awesome Results along with all the details of how the study was done.
Jane Scientist finds Joe’s study very interesting and tries to replicate his findings. She conducts a study (call it Study B) that is similar to Study A but completely independent of it (and does not communicate with Joe). In her analysis she does not find strong evidence that D > 0 and concludes that she cannot rule out the possibility that D = 0. She publishes her findings in the Journal of Null Results along with all the details.
From these two studies, which of the following conclusions can we make?
- Study A is obviously a fraud. If the truth were that D > 0, then Jane should have concluded that D > 0 in her independent replication.
- Study B is obviously a fraud. If Study A were conducted properly, then Jane should have reached the same conclusion.
- Neither Study A nor Study B was a fraud, but the result for Study A was a Type I error, i.e. a false positive.
- Neither Study A nor Study B was a fraud, but the result for Study B was a Type II error, i.e a false negative.
I realize that there are a number of subtle details concerning why things might happen but I’ve purposely left them out. My question is, based on the information that you actually have about the two studies, what would you consider to be the most likely case? What further information would you like to know beyond what was given here?_
_