24 Aug 2012
Simply Statistics Podcast #1.
To mark the occasion of our 1-year anniversary of starting the blog, Jeff, Rafa, and I have recorded our first podcast. You can tell that it’s our very first podcast because we don’t appear to have any idea what we’re doing. However, we decided to throw caution to the wind.
In this episode we talk about why we started the blog and discuss our thoughts on statistics and big data. Be sure to watch to the end as Rafa provides a special treat.
UPDATE: For those of you who can’t bear the sight of us, there is an audio only version.
UPDATE 2: I have setup an RSS feed for the audio-only version of the podcast.
UPDATE 3: Here is the RSS feed for HD video version of the podcast.
23 Aug 2012
I’ve fallen behind and so haven’t had a chance to mention this, but Science Exchange has started its Reproducibility Initiative. The idea is that authors can submit their study to be reproduced and Science Exchange will match the study with a validator who will attempt to reproduce the results (for a fee).
Validated studies will receive a Certificate of Reproducibility acknowledging that their results have been independently reproduced as part of the Reproducibility Initiative. Researchers have the opportunity to publish the replicated results as an independent publication in the PLOS Reproducibility Collection, and can share their data via the figshare Reproducibility Collection repository. The original study will also be acknowledged as independently reproduced if published in a supporting journal.
This is a very interesting initiative and it’s one I and a number of others have been talking about doing. They have an excellent advisory board and seem to have all the right partners/infrastructure lined up.
The obvious question to me is if you’re going to submit your study to this service and get it reproduced, why would you ever want to submit it to a journal? The level of review you’d get here is quite a bit more rigorous than you’d receive at a journal and the submission process essentially involves writing a paper without the Introduction and the Discussion (usually the hardest and most annoying parts). At the moment, it seems the service is set up to work in parallel with standard publication or perhaps after the fact. But I could see it eventually replacing standard publication altogether.
The timing, of course, could be an issue. It’s not clear how long one should expect it to take to reproduce a study. But it’s probably not much longer than a review you’d get at a statistics journal.
22 Aug 2012
Y Combinator, the tech startup incubator, had its 15th demo day. Here are some of the data/statistics-related highlights (thanks to TechCrunch for doing the hard work):
-
EVERYDAY.ME — A PRIVATE, ONLINE RECORD OF YOUR LIFE. </p>
This company seems to me like a meta-data company. It compiles your data from other sites.
-
MTH SENSE: IMPROVING MOBILE AD TARGETING
“Most [mobile] ads served are blind. Mth sense’s solution adds demographic data to ads through predictive modeling based on app and device usage. For example, if you have the Pinterest, and Vogue apps, you’re more likely to be a soccer mom.” Hmm, I guess I’d better delete those apps from my phone….
-
SURVATA: REPLACING PAYWALLS WITH SURVEYWALLS
Survata’s product replaces paywalls on premium content from online publishers with surveys that conduct market research.
-
RENT.IO — RENT PRICE PREDICTION
Rent.io says it wants to “optimize pricing of the single biggest recurring expense in lives of 100 million Americans.&rdquo
-
BIGCALC: FAST NUMBER-CRUNCHING FOR MAKING FINANCIAL TRADING DECISIONS
BigCalc says its platform for financial modeling scales to enormous datasets, and purportedly does simulations that typically take 22 hours in 24 minutes.
-
DATANITRO — A BACKBONE FOR FINANCE-RELATED DATA
DataNitro’s founders have both worked in finance, and they say they know from experience that financial industry software is basically “held together with duct tape.” A big problem with the status quo is how data is exported from Excel.
-
STATWING: EASY TO USE DATA ANALYSIS
Most existing data analysis tools (in particular SPSS) are built for statisticians. Statwing has created tools that make it easier for marketers and analysts to interact with data without dealing with arcane technical terminology. Those users only need a few core functions, Statwing says, so that’s what the company provides. With just a few clicks, users can get the graphs that they want. And the data is summarized in a single sentence of conversational English.
21 Aug 2012
There’s controversy brewing over at the National Science Foundation over names. Back in October 2011, Sastry Pantula, the Director of the Division of Mathematical Sciences at NSF (formerly the Chair of NC State Statistics Department and President of the ASA), proposed that the name of the Division be changed to the “Division of Mathematical and Statistical Sciences”. Excerpting from his original proposal, Pantula says
Extracting useful knowledge from the deluge of data is critical to the scientific successes of the future. Data-intensive research will drive many of the major scientific breakthroughs in the coming decades. There is a long-term need for research and workforce development in computational and data-enabled sciences. Statistics is broadly recognized as a data-centric discipline, thus having it in the Division’s name as proposed would be advantageous whenever “Big Data” and data-sciences investments are discussed internally and externally.
This bureaucratic move by Pantula created quite a reaction. A sub-committee of the Math and Physical Sciences Advisory Committee (MPSAC) was formed to investigate the name change and to solicit feedback from the relevant communities. The sub-committee was chaired by Fred Roberts (Rutgers) and also included James Berger (Duke), Emery Brown (MIT), Kevin Corlette (U. of Chicago), Irene Fonseca (CMU), and Juan Meza (UC Merced). A number of organizations provided feedback to the sub-committee, including the American Statistical Association and the American Mathematical Society.
There was intense feedback both for and against the name change. Somewhat predictably, mathematicians were adamantly opposed to the name change and statisticians were for it. The final report of the sub-committee is both interesting and enlightening for those not familiar with the arguments involved.
First a little background for people (like me) who are not familiar with NSF’s organizational structure. NSF has a number of Directorates, of which Mathematical and Physical Sciences (MPS) is one, and within MPS is the Division of Mathematical Sciences (DMS). DMS includes 11 program areas ranging from algebra and number theory to topology. Statistics is one of those program areas.
This should already give one pause. How exactly do statistics and topology end up in the same basket? I’m not exactly sure but I’m guessing it’s the result of bureaucratic inertia. Statistics came later and it had to be stuck somewhere. DMS is not the only place at NSF to get funding for statistics, but a quick search through the currently active grants shows that the vast majority of statistics-related grants go through DMS, with a smattering coming through other Divisions.
The primary issue here, and the only reason it’s an issue at all, is money. Statistics is one of 11 program areas in DMS, which means that it roughly gets 9% of the funding allocated to DMS. This is worth noting—the entire field of statistics gets roughly as much funding as, say, topology. For example, one of the arguments against the name change in the sub-committee’s report is
3). Statistics constitutes a small (although significant) proportion of the DMS portfolio in terms of number of programs, number of grant applications, number of grants funded.
Well, yes, but I would argue that the reason for this is the historically (low) prioritization of statistics in the Division. This is a choice, not a fact. I believe statistics could play a much bigger role in the Division and perhaps within NSF more generally if there were an agreement on its importance. A key argument comes next, which is
If the name change attracts more proposals to the Division from the statistics community, this could draw funding away from other subfields and it could also increase the workload of the Division’s program officers.
Okay, so money’s important too, but let’s get to the main attraction, which comes in comment number 5:
5). Statistics is funded throughout the federal government. The traditional funding of statistics by DMS is appropriate: fund fundamental research in statistics. Broadening the mission of DMS to include more applied statistics would not benefit the overall funding of the mathematical sciences.
The first sentence is a fact: Many government agencies fund statistics research. For example, the National Institutes of Health funds many statisticians who develop and apply methods to problems in the health sciences. The EPA will occasionally fund statisticians to develop methods for environmental health applications.
But who is charged with funding the development and application of statistical methods to every other scientific field? The problem now is that you essentially have a group of NIH-funded (bio)statisticians doing biomedical research and a group of NSF-funded statisticians doing “fundamental” research in statistics (note that “fundamental” equals “mathematical” here). But that hardly represents all of the statisticians out there. So for the rest of the statisticians who are not doing biomedical research and are not doing “fundamental” research, where do they go for funding?
These days, statistics is “applied” to everything. NSF itself has acknowledged that we are in an era of big data—it’s clear that statistics will play a big role whether we call it “statistics” or not. If NSF decided to fund research into the application of statistics to all areas, it would likely overwhelm the funding of every other program area in DMS. This is why the “solution” is to resort to what is informally understood as the mission of NSF, which is to fund “fundamental” research.
But it’s not clear to me that NSF should limit itself in this manner. In particular, if NSF got serious about funding the application of statistics to all scientific areas (either through DMS or some other Division), it would incentivize statisticians to build stronger and closer collaborations with scientists all over. I see this as a win-win for everyone involved.
As a statistician, I’m willing to admit I’m biased, but I think NSF should play a much bigger role in advancing statistics as one of the critical tools of the future. Perhaps the solution is not to rename the Division, but to create a separate division for statistical sciences independent of mathematics, one of the suggestions in the sub-committee report. This separation would mirror what has occurred in many universities over the past 50 years or so with the creation of independent departments of statistics and biostatistics.
Ultimately, the name of the Division was not changed. Here’s the release from last week:
NSF is committed to supporting the research necessary to maximize the benefits to be derived from the age of data, and to promoting and funding research related to data-centric scientific discovery and innovation, and in particular, the growing role of the statistical sciences in all research areas. Recognizing both the complex composition of the various communities and the support of statistical sciences throughout NSF, and taking into account the various community views described in the very thoughtful report of the MPSAC, I have decided to maintain the name “Division of Mathematical Sciences (DMS)” within MPS, but to affirm strong commitment to the statistical sciences.
To demonstrate this commitment, (a) whenever appropriate, we will specifically mention “statistics” alongside “mathematics” in budget requests and in solicitations in order to recognize the unique and pervasive role of statistical sciences, and to ensure that relevant solicitations reach the statistical sciences community….
Well, I feel better already. I suppose this is progress of some sort.