30 Jun 2014
Thomas Piketty’s book Capital in the 21st Century was a surprise best seller and the subject of intense scrutiny. A few weeks ago the Financial Times claimed that the analysis was riddled with errors, leading to a firestorm of discussion. A few days ago the London School of economics posted a similar call to make the data open and machine readable saying.
None of this data is explicitly open for everyone to reuse, clearly licenced and in machine-readable formats.
A few friends of Simply Stats had started on a project to translate his work from the excel files where the original analysis resides into R. The people that helped were Alyssa Frazee, Aaron Fisher, Bruce Swihart, Abhinav Nellore, Hector Corrada Bravo, John Muschelli * Hector Corrada Bravo, and me. We haven’t finished translating all chapters, so we are asking anyone who is interested to help contribute to translating the book’s technical appendices into R markdown documents. If you are interested, please send pull requests to the gh-pages branch of this Github repo.
As a way to entice you to participate, here is one interesting thing we found. We don’t know enough economics to know if what we are finding is “right” or not, but one interesting thing I found is that the x-axes in the excel files are really distorted. For example here is Figure 1.1 from the Excel files where the ticks on the x-axis are separated by 20, 50, 43, 37, 20, 20, and 22 years.
Here is the same plot with an equally spaced x-axis.
I'm not sure if it makes any difference but it is interesting. It sounds like on measure, the Piketty analysis was mostly reproducible and reasonable. But having the data available in a more readily analyzable format will allow for more concrete discussion based on the data. So consider contributing to our github repo.
25 Jun 2014
The U.S. Supreme Court just made a unanimous ruling in Riley v. California making it clear that police officers must get a warrant before searching through the contents of a cell phone obtained incident to an arrest. The message was put pretty clearly in the decision:
Our answer to the question of what police must do before searching a cell phone seized incident to an arrest is accordingly simple — get a warrant.
But I was more fascinated by this quote:
The sum of an individual’s private life can be reconstructed through a thousand photographs labeled with dates, locations, and descriptions; the same cannot be said of a photograph or two of loved ones tucked into a wallet.
So n = 2 is not enough to recreate a private life, but n = 2,000 (with associated annotation) is enough. I wonder what the minimum sample size needed is to officially violate someone’s privacy. I’d be curious get Cathy O’Neil’s opinion on that question, she seems to have thought very hard about the relationship between data and privacy.
This is another case where I think that, to some extent, the Supreme Court made a decision on the basis of a statistical concept. Last time it was correlation, this time it is inference. As I read the opinion, part of the argument hinged on how much information do you get by searching a cell phone versus a wallet? Importantly, how much can you infer from those two sets of data?
If any of the Supreme’s want a primer in statistics, I’m available.
23 Jun 2014
I was reading one of my favorite stats blogs, StatsChat, where Thomas points to this article in the Atlantic and highlights this quote:
Dassault Systèmes is focusing on that level of granularity now, trying to simulate propagation of cholesterol in human cells and building oncological cell models. “It’s data science and modeling,” Charlès told me. “Coupling the two creates a new environment in medicine.”
I think that is a perfect example of data hype. This is a cool idea and if it worked would be completely revolutionary. But the reality is we are not even close to this. In very simple model organisms we can predict very high level phenotypes some of the time with whole cell modeling. We aren’t anywhere near the resolution we’d need to model the behavior of human cells, let alone the complex genetic, epigenetic, genomic, and environmental components that likely contribute to complex diseases. It is awesome that people are thinking about the future and the fastest way to science future is usually through science fiction, but this is way overstating the power of current or even currently achievable data science.
So does that mean data science for improving clinical trials right now should be abandoned?
No.
There is tons of currently applicable and real world data science being done in sequential analysis, adaptive clinical trials, and dynamic treatment regimes. These are important contributions that are impacting clinical trials _right now _and where advances can reduce costs, save patient harm, and speed the implementation of clinical trials. I think that is the hope of data science - using statistics and data to make steady, realizable improvement in the way we treat patients.
18 Jun 2014
Update (6/19/14): The folks at JNCI and OUP have kindly confirmed that they will consider manuscripts that have been posted to preprint servers.
I just got this email about a paper we submitted to JNCI
Dear Dr. Leek:
I am sorry that we will not be able to use the above-titled manuscript. Unfortunately, the paper was published online on a site called bioRXiv, The Preprint Server for Biology, hosted by Cold Spring Harbor Lab. JNCI does not publish previously published work.
Thank you for your submission to the Journal.
I have to say I’m not totally surprised, but I am a little disappointed, the future of academic publishing is definitely not evenly distributed.