14 Mar 2014
Happy π day everybody!
I wanted to write some simple code (included below) to the test parallelization capabilities of my new cluster. So, in honor of π day, I decided to check for evidence that π is a normal number. A normal number is a real number whose infinite sequence of digits has the property that picking any given random m digit pattern is 10−m. For example, using the Poisson approximation, we can predict that the pattern “123456789” should show up between 0 and 3 times in the first billion digits of π (it actually shows up twice starting, at the 523,551,502-th and 773,349,079-th decimal places).
To test our hypothesis, let Y1, …, Y100 be the number of “00”, “01”, …,”99” in the first billion digits of π. If π is in fact normal then the Ys should be approximately IID binomials with N=1 billon and p=0.01. In the qq-plot below I show Z-scores (Y - 10,000,000) / √9,900,000) which appear to follow a normal distribution as predicted by our hypothesis. Further evidence for π being normal is provided by repeating this experiment for 3,4,5,6, and 7 digit patterns (for 5,6 and 7 I sampled 10,000 patterns). Note that we can perform a chi-square test for the uniform distribution as well. For patterns of size 1,2,3 the p-values were 0.84, 0.89, 0.92, and 0.99.
Another test we can perform is to divide the 1 billion digits into 100,000 non-overlapping segments of length 10,000. The vector of counts for any given pattern should also be binomial. Below I also include these qq-plots.
These observed counts should also be independent, and to explore this we can look at autocorrelation plots:
To do this in about an hour and with just a few lines of code (included below), I used the Bioconductor Biostrings package to match strings and the foreach
function to parallelize.
library(Biostrings)
library(doParallel)
registerDoParallel(cores = 48)
x=scan("pi-billion.txt",what="c")
x=substr(x,3,nchar(x)) ##remove 3.
x=BString(x)
n<-length(x)
p <- 1/(10^d)
par(mfrow=c(2,3))
for(d in 2:4){
if(d<5){
patterns<-sprintf(paste0("%0",d,"d"),seq(0,10^d-1))
} else{
patterns<-sprintf(paste0("%0",d,"d"),sample(10^d,10^4)-1)
}
res <- foreach(pat=patterns,.combine=c) %dopar% countPattern(pat,x)
z <- (res - n*p ) / sqrt( n*p*(1-p) )
qqnorm(z,xlab="Theoretical quantiles",ylab="Observed z-scores",main=paste(d,"digits"))
abline(0,1)
##correction: original post had length(res)
if(d<5) print(1-pchisq(sum ((res - n*p)^2/(n*p)),length(res)-1))
}
###Now count in segments
d <- 1
m <-10^5
patterns <-sprintf(paste0("%0",d,"d"),seq(0,10^d-1))
res <- foreach(pat=patterns,.combine=cbind) %dopar% {
tmp<-start(matchPattern(pat,x))
tmp2<-floor( (tmp-1)/m)
return(tabulate(tmp2+1,nbins=n/m))
}
##qq-plots
par(mfrow=c(2,5))
p <- 1/(10^d)
for(i in 1:ncol(res)){
z <- (res[,i] - m*p) / sqrt( m*p*(1-p) )
qqnorm(z,xlab="Theoretical quantiles",ylab="Observed z-scores",main=paste(i-1))
abline(0,1)
}
##ACF plots
par(mfrow=c(2,5))
for(i in 1:ncol(res)) acf(res[,i])
NB: A normal number has the above stated property in any base. The examples above a for base 10.
12 Mar 2014
An astute reader (Niels Hansen, who is visiting our department today) caught a bug in my code on Github for the Leekasso. I had:
lm1 = lm(y ~ leekX)
predict.lm(lm1,as.data.frame(leekX2))
Unfortunately, this meant that I was getting predictions for the training set on the test set. Since I set up the test/training sets the same, this meant that I was actually getting training set error rates for the Leekasso. Neils Hansen noticed the bug and reran the fixed code with this term instead:
lm1 = lm(y ~ ., data = as.data.frame(leekX))
predict.lm(lm1,as.data.frame(leekX2))
He created a heatmap subtracting the average accuracy of the Leekasso/Lasso and showed they are essentially equivalent.
This is a bummer, the Leekasso isn’t a world crushing algorithm. On the other hand, I’m happy that just choosing the top 10 is still competitive with the optimized lasso on average. More importantly, although I hate being wrong, I appreciate people taking the time to look through my code.
Just out of curiosity I’m taking a survey. Do you think I should publish this top10 predictor thing as a paper? Or do you think it is too trivial?
05 Mar 2014
I’ve been closely following the fallout from PLoS One’s new policy for data sharing. The policy says, basically, that if you publish a paper, all data and code to go with that paper should be made publicly available at the time of publishing and include an explicit data sharing policy in the paper they submit.
I think the reproducibility debate is over. Data should be made available when papers are published. The Potti scandal and the Reinhart/Rogoff scandal have demonstrated the extreme consequences of lack of reproducibility and the reproducibility advocates have taken this one home. The question with reproducibility isn’t “if” anymore it is “how”.
The transition toward reproducibility is likely to be rough for two reasons. One is that many people who generate data lack training in handling and analyzing data, even in a data saturated field like genomics. The story is even more grim in areas that haven’t been traditionally considered “data rich” fields.
The second problem is a cultural and economic problem. It involves the fundamental disconnect between (1) the incentives of our system for advancement, grant funding, and promotion and (2) the policies that will benefit science and improve reproducibility. Most of the debate on social media seems to conflate these two issues. I think it is worth breaking the debate down into three main constituencies: journals, data creators, and data analysts.
Journals with requirements for data sharing
Data sharing, especially for large data sets, isn’t easy and it isn’t cheap. Not knowing how to share data is not an excuse - to be a modern scientist this is one of the skills you have to have. But if you are a journal that makes huge profits and you want open sharing, you should put up or shut up. The best way to do that would be to pay for storage on something like AWS for all data sets submitted to comply with your new policy. In the era of cheap hosting and standardized templates, charging $1,000 or more for an open access paper is way too much. It costs essentially nothing to host that paper online and you are getting peer review for free. So you should spend some of your profits paying for the data sharing that will benefit your journal and the scientific community.
Data creators
It is really hard to create a serious, research quality data set in almost any scientific discipline. If you are studying humans, it requires careful adherence to rules and procedures for handling human data. If you are in ecology, it may involve extensive field work. If you are in behavioral research, it may involve careful review of thousands of hours of video tape.
The value of one careful, rigorous, and interesting data set is hard to overstate. In my field, the data Leonid Kruglyak’s group generated measuring gene expression and genetics in a careful yeast experiment spawned an entirely new discipline within both genomics and statistics.
The problem is that to generate one really good data set can take months or even years. It is definitely possible to publish more than one paper on a really good data set. But after the data are generated, most of these papers will have to do with data analysis, not data generation. If there are ten papers that could be published on your data set and your group publishes the data with the first one, you may get to the second or third, but someone else might publish 4-10.
This may be good for science, but it isn’t good for the careers of data generators. Ask anyone in academics whether you’d rather have 6 citations from awesome papers or 6 awesome papers and 100% of them will take the papers.
I’m completely sympathetic to data generators who spend a huge amount of time creating a data set and are worried they may be scooped on later papers. This is a place where the culture of credit hasn’t caught up with the culture of science. If you write a grant and generate an amazing data set that 50 different people use - you should absolutely get major credit for that in your next grant. However, you probably shouldn’t get authorship unless you intellectually contributed to the next phase of the analysis.
The problem is we don’t have an intermediate form of credit for data generators that is weighted more heavily than a citation. In the short term, this lack of a proper system of credit will likely lead data generators to make the following (completely sensible) decision to hold their data close and then publish multiple papers at once - like ENCODE did. This will drive everyone crazy and slow down science - but it is the appropriate career choice for data generators until our system of credit has caught up.
Data analysts
I think that data analysts who are pushing for reproducibility are genuine in their desire for reproducibility. I also think that the debate is over. I think we can contribute to the success of the reproducibility transition by figuring out ways to give stronger and more appropriate credit to data generators. I don’t think authorship is the right approach. But I do think that it is the right approach to loudly and vocally give credit to people who generated the data you used in your purely data analytic paper. That includes making sure the people that are responsible for their promotion and grants know just how incredibly critical it is that they keep generating data so you can keep doing your analysis.
Finally, I think that we should be more sympathetic to the career concerns of folks who generate data. I have written methods and made the code available. I have then seen people write very similar papers using my methods and code - then getting credit/citations for producing a very similar method to my own. Being [I’ve been closely following the fallout from PLoS One’s new policy for data sharing. The policy says, basically, that if you publish a paper, all data and code to go with that paper should be made publicly available at the time of publishing and include an explicit data sharing policy in the paper they submit.
I think the reproducibility debate is over. Data should be made available when papers are published. The Potti scandal and the Reinhart/Rogoff scandal have demonstrated the extreme consequences of lack of reproducibility and the reproducibility advocates have taken this one home. The question with reproducibility isn’t “if” anymore it is “how”.
The transition toward reproducibility is likely to be rough for two reasons. One is that many people who generate data lack training in handling and analyzing data, even in a data saturated field like genomics. The story is even more grim in areas that haven’t been traditionally considered “data rich” fields.
The second problem is a cultural and economic problem. It involves the fundamental disconnect between (1) the incentives of our system for advancement, grant funding, and promotion and (2) the policies that will benefit science and improve reproducibility. Most of the debate on social media seems to conflate these two issues. I think it is worth breaking the debate down into three main constituencies: journals, data creators, and data analysts.
Journals with requirements for data sharing
Data sharing, especially for large data sets, isn’t easy and it isn’t cheap. Not knowing how to share data is not an excuse - to be a modern scientist this is one of the skills you have to have. But if you are a journal that makes huge profits and you want open sharing, you should put up or shut up. The best way to do that would be to pay for storage on something like AWS for all data sets submitted to comply with your new policy. In the era of cheap hosting and standardized templates, charging $1,000 or more for an open access paper is way too much. It costs essentially nothing to host that paper online and you are getting peer review for free. So you should spend some of your profits paying for the data sharing that will benefit your journal and the scientific community.
Data creators
It is really hard to create a serious, research quality data set in almost any scientific discipline. If you are studying humans, it requires careful adherence to rules and procedures for handling human data. If you are in ecology, it may involve extensive field work. If you are in behavioral research, it may involve careful review of thousands of hours of video tape.
The value of one careful, rigorous, and interesting data set is hard to overstate. In my field, the data Leonid Kruglyak’s group generated measuring gene expression and genetics in a careful yeast experiment spawned an entirely new discipline within both genomics and statistics.
The problem is that to generate one really good data set can take months or even years. It is definitely possible to publish more than one paper on a really good data set. But after the data are generated, most of these papers will have to do with data analysis, not data generation. If there are ten papers that could be published on your data set and your group publishes the data with the first one, you may get to the second or third, but someone else might publish 4-10.
This may be good for science, but it isn’t good for the careers of data generators. Ask anyone in academics whether you’d rather have 6 citations from awesome papers or 6 awesome papers and 100% of them will take the papers.
I’m completely sympathetic to data generators who spend a huge amount of time creating a data set and are worried they may be scooped on later papers. This is a place where the culture of credit hasn’t caught up with the culture of science. If you write a grant and generate an amazing data set that 50 different people use - you should absolutely get major credit for that in your next grant. However, you probably shouldn’t get authorship unless you intellectually contributed to the next phase of the analysis.
The problem is we don’t have an intermediate form of credit for data generators that is weighted more heavily than a citation. In the short term, this lack of a proper system of credit will likely lead data generators to make the following (completely sensible) decision to hold their data close and then publish multiple papers at once - like ENCODE did. This will drive everyone crazy and slow down science - but it is the appropriate career choice for data generators until our system of credit has caught up.
Data analysts
I think that data analysts who are pushing for reproducibility are genuine in their desire for reproducibility. I also think that the debate is over. I think we can contribute to the success of the reproducibility transition by figuring out ways to give stronger and more appropriate credit to data generators. I don’t think authorship is the right approach. But I do think that it is the right approach to loudly and vocally give credit to people who generated the data you used in your purely data analytic paper. That includes making sure the people that are responsible for their promotion and grants know just how incredibly critical it is that they keep generating data so you can keep doing your analysis.
Finally, I think that we should be more sympathetic to the career concerns of folks who generate data. I have written methods and made the code available. I have then seen people write very similar papers using my methods and code - then getting credit/citations for producing a very similar method to my own. Being](http://simplystatistics.org/2011/12/03/reverse-scooping/) like this is incredibly frustrating. If you’ve ever had that experience imagine what it would feel like to spend a whole year creating a data set and then only getting one publication.
I also think that the primary use of reproducibility so far has been as a weapon. It has been used (correctly) to point out critical flaws in research. It has also been used as a way to embarrass authors who don’t (and even some who do) have training in data analysis. The transition to fully reproducible science can either be a painful fight or a smoother transition. One thing that would go a long way would be to think of code review/reproducibility not like peer review, but more like pull requests and issues on Github. The goal isn’t to show how the other person did it wrong, the goal is to help them do it right.