Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Follow up on "Statistics and the Science Club"

I agree with Roger’s latest post: “we need to expand the tent of statistics and include people who are using their statistical training to lead the new science”. I am perhaps a bit more worried than Roger. Specifically, I worry that talented go-getters interested in leading science via data analysis will achieve this without engaging our research community. 

A  quantitatively trained person (engineers , computer scientists, physicists, etc..) with strong computing skills (knows python, C, and shell scripting), that reads, for example, “Elements of Statistical Learning” and learns R, is well on their way. Eventually, many of these users of Statistics will become developers and if we don’t keep up then what do they need from us? Our already-written books may be enough. In fact, in genomics, I know several people like this that are already developing novel statistical methods. I want these researchers to be part of our academic departments. Otherwise, I fear we will not be in touch with the problems and data that lead to, quoting Roger, “the most exciting developments of our lifetime.” 

The problem with small big data

There’s lots of talk about “big data” these days and I think that’s great. I think it’s bringing statistics out into the mainstream (even if they don’t call it statistics) and it creating lots of opportunities for people with statistics training. It’s one of the reasons we created this blog.

One thing that I think gets missed in much of the mainstream reporting is that, in my opinion, the biggest problems aren’t with the truly massive datasets out there that need to be mined for important information. Sure, those types of problems pose interesting challenges with respect to hardware infrastructure and algorithm design.

I think a bigger problem is what I call “small big data”. Small big data is the dataset that is collected by an individual whose data collection skills are far superior to his/her data analysis skills. You can think of the size of the problem as being measured by the ratio of the dataset size to the investigator’s statistical skill level. For someone with no statistical skills, any dataset represents “big data”.

These days, any individual can create a massive dataset with relatively few resources. In some of the work I do, we send people out with portable air pollution monitors that record pollution levels every 5 minutes over a 1-week period. People with fitbits can get highly time-resolved data about their daily movements. A single MRI can produce millions of voxels of data.

One challenge here is that these examples all represent datasets that are large “on paper”. That is, there are a lot of bits to store, but that doesn’t mean there’s a lot of useful information there. For example, I find people are often impressed by data that are collected with very high temporal or spatial resolution. But often, you don’t need that level of detail and can get away with coarser resolution over a wider range of scenarios. For example, if you’re interested in changes in air pollution exposure across seasons but you only measure people in the summer, then it doesn’t matter if you measure levels down to the microsecond and produce terabytes of data. Another example might be the idea the sequencing technology doesn’t in fact remove biological variability, no matter how large a dataset it produces.

Another challenge is that the person who collected the data is often not qualified/prepared to analyze it. If the data collector didn’t arrange beforehand to have someone analyze the data, then they’re often stuck. Furthermore, usually the grant that paid for the data collection didn’t budget (enough) for the analysis of the data. The result is that there’s a lot of “small big data” that just sits around unanalyzed. This is an unfortunate circumstance, but in my experience quite common.

One conclusion we can draw is that we need to get more statisticians out into the field both helping to analyze the data; and perhaps more importantly, designing good studies so that useful data are collected in the first place (as opposed to merely “big” data). But the sad truth is that there aren’t enough of us on the planet to fill the demand. So we need to come up with more creative ways to get the skills out there without requiring our physical presence.

Hilary Mason: From Tiny Links, Big Insights

Hilary Mason: From Tiny Links, Big Insights

The Evolution of Music

The Evolution of Music

A specific suggestion to help recruit/retain women faculty at Hopkins

A recent article by a former Obama administration official has stirred up debate over the obstacles women face in balancing work/life. This reminded me of this report written by  a committee here at Hopkins to help resolve the current gender-based career obstacles for women faculty. The report is great, but in practice we have a long way to go. For example, my department has not hired a woman at the tenure track level in 15 years. This drought has not been for lack of trying as we have made several offers, but none have been accepted. One issue that has come up multiple times is “spousal hires”. Anecdotal evidence strongly suggests that in academia the “two body” problem is more common with women than men. As hard as my department has tried to find jobs for spouses, efforts are ad-hoc and we get close to no institutional support. As far as I know, as an institution, Hopkins allocates no resources to spousal hires. So, a tangible improvement we could make is changing this. Another specific improvement that many agree will help women is subsidized day care. The waiting list here is very long (as a result few of my colleagues use it) and one still has to pay more than $1,600 a month for infants.

These two suggestions are of course easier said than done as they both require $. Quite of bit actually, and Hopkins is not rich compared to other well-known universities. My suggestion is to get rid of the college tuition remission benefit for faculty. Hopkins covers half the college tuition for the children of all their employees. This perk helps male faculty in their 50s much more than it helps potential female recruits. So I say get rid of this benefit and use the $ for spousal hires and to further subsidize childcare.

It might be argued the tuition remission perk helps retain faculty, but the institution can invest in that retention on a case-by-case basis as opposed to giving the subsidy to everybody independent of merit. I suspect spousal hires and subsidized day care will be more attractive at the time of recruitment. 

Although this post is Hopkins-specific I am sure similar reallocation of funds is possible in other universities.