04 Oct 2012
A little while ago I wrote a post on statistics projects ideas for students. In honor of the first Simply Statistics Coursera offering, Computing for Data Analysis, here is a new list of student projects for folks excited about trying out those new R programming skills. Again we have rated each project with my best guess difficulty and effort required. Happy computing!
Data Analysis
- Use city data to predict areas with the highest risk for parking tickets. Here is the data for Baltimore. (Difficulty: Moderate, Effort: Low/Moderate)
- If you have a Fitbit with a premium account, download the data into a spreadsheet (or get Chris’s data) Then build various predictors using the data: (a) are you running or walking, (b) are you having a good day or not, (c) did you eat well that day or not, (d) etc. For special bonus points create a blog with your new discoveries and share your data with the world. (Difficulty: Depends on what you are trying to predict, Effort: Moderate with Fitbit/Jawbone/etc.)
Data Collection/Synthesis
- Make a list of skills associated with each component of the Data Scientist Venn Diagram. Then update the data scientist R function described in this post to ask a set of questions, then plot people on the diagram. Hint, check out the readline() function. (Difficulty: Moderately low, Effort:__Moderate)
- HealthData.gov has a ton of data from various sources about public health, medicines, etc. Some of this data is super useful for projects/analysis and some of it is just data dumps. Create an R package that downloads data from healthdata.gov and gives some measures of how useful/interesting it is for projects (e.g. number of samples in the study, number of variables measured, is it summary data or raw data, etc.) (Difficulty: Moderately hard, Effort: High)
- Build an up-to-date aggregator of R tutorials/how-to videos, summarize/rate each one so that people know which ones to look at for learning which tasks. (Difficulty: Low, Effort: Medium)
Tool building
- Build software that creates a 2-d author list and averages people’s 2-d author lists. (Difficulty: Medium, Effort: Low)
- Create an R package that interacts with and downloads data from government websites and processes it in a way that is easy to analyze. (Difficulty: Medium, Effort: High)
_
_
04 Oct 2012
Nate Silver, everyone’s favorite statistician made good, just gave an interview where he said he thinks many journal articles should be blog posts. I have been thinking about this same issue for a while now, and I’m not the only one. This is a really interesting post suggesting that although scientific journals once facilitated dissemination of ideas, they now impede the flow of information and make it more expensive.
Two recent examples really drove this message home for me. In the first example, I posted a quick idea called the Leekasso, which led to some discussion on the blog, has nearly 2,000 page views (a pretty recent number of downloads for a paper), and has been implemented in software by someone other than me. If this were one of my papers, it would be one of the more reasonably high impact papers. The second example is a post I put up about a recent Nature paper. The authors (who are really good sports) ended up writing to me to get my critiques. I wrote them out, and they responded. All of this happened after peer review and informally. All of the interaction also occurred in email, where no one can see but us.
It wouldn’t take much to go to a blog-based system. What if everyone who was publishing scientific results started a blog (free), then there was a site, run by pubmed, that aggregated the feeds (this would be super cheap to set up/maintain). Then people could comment on blog posts and vote for ones they liked if they had verified accounts. We skipped peer review in favor of just producing results and discussing them. The results that were interesting were shared by email, Twitter, etc.
Why would we do this? Well, the current journal system: (1) significantly slows the publication of research, (2) costs thousands of dollars, and (3) costs significant labor that is not scientifically productive (such as resubmitting).
Almost every paper I have had published has been rejected at least one place, including the “good” ones. This means that the results of even the good papers have been delayed by months. Or in the case of one paper - a full year and a half of delay. Any time I publish open access, it costs me at minimum around $1,500. I like open access because I think science funded by taxpayers should be free. But it is a significant drain on the resources of my group. Finally, most of the resubmission process is wasted labor. It generally doesn’t produce new science or improve the quality of the science. The effort is just in reformatting and re-inputing information about authors.
So why not have everyone just post results on their blog/figshare. They’d have a DOI that could be cited. We’d reduce everyone’s labor in reviewing/editing/resubmitting by an order of magnitude or two and save the taxpapers a few thousand dollars each a year in publication fees. We’d also increase the speed of updating/reacting to new ideas by an order of magnitude.
I still maintain we should be evaluating people based on reading their actual work, not highly subjective and error-prone indices. But if the powers that be insisted, it would be easy to evaluate people based on likes/downloads/citations/discussion of papers rather than on the basis of journal titles and the arbitrary decisions of editors.
So should we stop publishing peer review papers?
Edit: Titus points to a couple of good posts with interesting ideas about the peer review process that are worth reading, here and here. Also, Joe Pickrell et al. are already on this for population genetics, having set up the aggregator Haldane’s Sieve. It would be nice if this expanded to other areas (and people got credit for the papers published there, like they do for papers in journals).
03 Oct 2012
The paper is a review of how to do software development for academics. I saw it via C. Titus Brown (who we have interviewed), he is also a co-author. How to write software (particularly for other people) is something that is under emphasized in many curricula. But it turns out this is also one of the more important components of disseminating your work in modern applied statistics. My only wish is that there was an accompanying website with resources/links for people to chase down.
03 Oct 2012
The order of authors on scientific papers matters a lot. The best places to be on a paper vary by field. But typically the first and the corresponding (usually last) authors are the prime real estate. When people are evaluated on the job market, for promotion, or to get grants, the number of first/corresponding author papers can be the difference between success and failure.
At the same time, many journals list “authors contributions” at the end of the manuscript, but this is rarely prominently displayed. The result is that regardless of the true distribution of credit in a manuscript, the first and last authors get the bulk of the benefit.
This system is antiquated for a few reasons:
- In multidisciplinary science, there are often equal and very different contributions from people working in different disciplines.
- Science is increasing collaborative, even within a single discipline and papers are rarely the effort of 2 people anymore.
How about a 2-D, resortable author list? Each author is a column and each kind of contribution is a row. The contributions are: (1) conceived the idea, (2) collected the data, (3) did the computational analysis, (4) wrote the paper (you could imagine adding others). Each category then gets a quantitative number, fraction of the effort to that component of the paper. Then you build an interactive graphic that allows you to sort the authors by each of the categories. So you could see who did what on the paper.
To get an overall impression of which activities an author performs, you could average their contribution across papers in each of the categories. Creating a “heatmap of contributions”. Anyone want to build this?
02 Oct 2012
Good friend and friend of the blog Rob Gould has started a statistics blog called Citizen Statistician. What is a citizen statistician, you ask?
What is a citizen statistician? A citizen statistician participates in formal and informal data gathering. A citizen statistician is aware of his or her data trail, and is aware of the harm that could be done to themselves or to others through data aggregation. Citizen statisticians recognize opportunities to improve their personal or professional lives through analyzing data, and know how to share data with others. They know that almost any question about the world can be answered using data, how to find relevant data sources on the web, and critically evaluate these sources. A citizen statistician also knows how to bring that data into an analysis package and how to start their investigation.
What’s even better than having more statistics blogs? Having more statisticians.