Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

An update on Georgia Tech's MOOC-based CS degree

This article in Inside Higher Ed discusses next steps for Georgia Tech’s ground-breaking low-cost CS degree based on MOOCs run by Udacity. With Sebastian Thrun stepping down as CEO at Udacity, it seems both Georgia Tech and Udacity might be moving into a new phase.

One fact that surprised me about the Georgia Tech program concerned the demographics:

Once the first applications for the online program arrived, Georgia Tech was surprised by how the demographics differed from the applications to the face-to-face program. The institute’s face-to-face cohorts tend to have more men than women and international students than U.S. citizens or residents. Applications to the online program, however, came overwhelmingly from students based in the U.S. (80 percent). The gender gap was even larger, with nearly nine out of 10 applications coming from men.

Write papers like a modern scientist (use Overleaf or Google Docs + Paperpile)

Editor’s note - This is a chapter from my book How to be a modern scientist where I talk about some of the tools and techniques that scientists have available to them now that they didn’t before.

Writing - what should I do and why?

Write using collaborative software to avoid version control issues.

On almost all modern scientific papers you will have co-authors. The traditional way of handling this was to create a single working document and pass it around. Unfortunately this system always results in a long collection of versions of a manuscript, which are often hard to distinguish and definitely hard to synthesize.

An alternative approach is to use formal version control systems like Git and Github. However, the overhead for using these systems can be pretty heavy for paper authoring. They also require all parties participating in the writing of the paper to be familiar with version control and the command line. Alternative paper authoring tools are now available that provide some of the advantages of version control without the major overhead involved in using base version control systems.

The usual result of file naming by a group (image via https://xkcd.com/1459/)

Make figures the focus of your writing

Once you have a set of results and are ready to start writing up the paper the first thing is not to write. The first thing you should do is create a set of 1-10 publication-quality plots with 3-4 as the central focus (see Chapter 10 here for more information on how to make plots). Show these to someone you trust to make sure they “get” your story before proceeding. Your writing should then be focused around explaining the story of those plots to your audience. Many people, when reading papers, read the title, the abstract, and then usually jump to the figures. If your figures tell the whole story you will dramatically increase your audience. It also helps you to clarify what you are writing about.

Write clearly and simply even though it may make your papers harder to publish.

Learn how to write papers in a very clear and simple style. Whenever you can write in plain English and make the approach you are using understandable and clear. This can (sometimes) make it harder to get your papers into journals. Referees are trained to find things to criticize and by simplifying your argument they will assume that what you have done is easy or just like what has been done before. This can be extremely frustrating during the peer review process. But the peer review process isn’t the end goal of publishing! The point of publishing is to communicate your results to your community and beyond so they can use them. Simple, clear language leads to much higher use/reading/citation/impact of your work.

Include links to code, data, and software in your writing

Not everyone recognizes the value of re-analysis, scientific software, or data and code sharing. But it is the fundamental cornerstone of the modern scientific process to make all of your materials easily accessible, re-usable and checkable. Include links to data, code, and software prominently in your abstract, introduction and methods and you will dramatically increase the use and impact of your work.

Give credit to others

In academics the main currency we use is credit for publication. In general assigning authorship and getting credit can be a very tricky component of the publication process. It is almost always better to err on the side of offering credit. A very useful test that my advisor John Storey taught me is if you are embarrassed to explain the authorship credit to anyone who was on the paper or not on the paper, then you probably haven’t shared enough credit.

Writing - what tools should I use?

WYSIWYG software: Google Docs and Paperpile.

This system uses Google Docs for writing and Paperpile for reference management. If you have a Google account you can easily create documents and share them with your collaborators either privately or publicly. Paperpile allows you to search for academic articles and insert references into the text using a system that will be familiar if you have previously used Endnote and Microsoft Word.

This system has the advantage of being a what you see is what you get system - anyone with basic text processing skills should be immediately able to contribute. Google Docs also automatically saves versions of your work so that you can flip back to older versions if someone makes a mistake. You can also easily see which part of the document was written by which person and add comments.

Getting started

  1. Set up accounts with Google and with Paperpile. If you are an academic the Paperpile account will cost $2.99 a month, but there is a 30 day free trial.
  2. Go to Google Docs and create a new document.
  3. Set up the Paperpile add-on for Google Docs
  4. In the upper right hand corner of the document, click on the Share link and share the document with your collaborators
  5. Start editing
  6. When you want to include a reference, place the cursor where you want the reference to go, then using the Paperpile menu, choose insert citation. This should give you a search box where you can search by Pubmed ID or on the web for the reference you want.
  7. Once you have added some references use the Citation style option under Paperpile to pick the citation style for the journal you care about.
  8. Then use the Format citations option under Paperpile to create the bibliography at the end of the document

The nice thing about using this system is that everyone can easily directly edit the document simultaneously - which reduces conflict and difficulty of use. A disadvantage is getting the formatting just right for most journals is nearly impossible, so you will be sending in a version of your paper that is somewhat generic. For most journals this isn’t a problem, but a few journals are sticklers about this.

Typesetting software: Overleaf or ShareLatex

An alternative approach is to use typesetting software like Latex. This requires a little bit more technical expertise since you need to understand the Latex typesetting language. But it allows for more precise control over what the document will look like. Using Latex on its own you will have many of the same issues with version control that you would have for a word document. Fortunately there are now “Google Docs like” solutions for editing latex code that are readily available. Two of the most popular are Overleaf and ShareLatex.

In either system you can create a document, share it with collaborators, and edit it in a similar manner to a Google Doc, with simultaneous editing. Under both systems you can save versions of your document easily as you move along so you can quickly return to older versions if mistakes are made.

I have used both kinds of software, but now primarily use Overleaf because they have a killer feature. Once you have finished writing your paper you can directly submit it to some preprint servers like arXiv or biorXiv and even some journals like Peerj or f1000research.

Getting started

  1. Create an Overleaf account. There is a free version of the software. Paying $8/month will give you easy saving to Dropbox.
  2. Click on New Project to create a new document and select from the available templates
  3. Open your document and start editing
  4. Share with colleagues by clicking on the Share button within the project. You can share either a read only version or a read and edit version.

Once you have finished writing your document you can click on the Publish button to automatically submit your paper to the available preprint servers and journals. Or you can download a pdf version of your document and submit it to any other journal.

Writing - further tips and issues

When to write your first paper

As soon as possible! The purpose of graduate school is (in some order):

  • Freedom
  • Time to discover new knowledge
  • Time to dive deep
  • Opportunity for leadership
  • Opportunity to make a name for yourself
    • R packages
    • Papers
    • Blogs
  • Get a job

The first couple of years of graduate school are typically focused on (1) teaching you all the technical skills you need and (2) data dumping as much hard-won practical experience from more experienced people into your head as fast as possible.

After that one of your main focuses should be on establishing your own program of research and reputation. Especially for Ph.D. students it can not be emphasized enough no one will care about your grades in graduate school but everyone will care what you produced. See for example, Sherri’s excellent guide on CV’s for academic positions.

I firmly believe that R packages and blog posts can be just as important as papers, but the primary signal to most traditional academic communities still remains published peer-reviewed papers. So you should get started on writing them as soon as you can (definitely before you feel comfortable enough to try to write one).

Even if you aren’t going to be in academics, papers are a great way to show off that you can (a) identify a useful project, (b) finish a project, and (c) write well. So the first thing you should be asking when you start a project is “what paper are we working on?”

What is an academic paper?

A scientific paper can be distilled into four parts:

  1. A set of methodologies
  2. A description of data
  3. A set of results
  4. A set of claims

When you (or anyone else) writes a paper the goal is to communicate clearly items 1-3 so that they can justify the set of claims you are making. Before you can even write down 4 you have to do 1-3. So that is where you start when writing a paper.

How do you start a paper?

The first thing you do is you decide on a problem to work on. This can be a problem that your advisor thought of or it can be a problem you are interested in, or a combination of both. Ideally your first project will have the following characteristics:

  1. Concrete
  2. Solves a scientific problem
  3. Gives you an opportunity to learn something new
  4. Something you feel ownership of
  5. Something you want to work on

Points 4 and 5 can’t be emphasized enough. Others can try to help you come up with a problem, but if you don’t feel like it is your problem it will make writing the first paper a total slog. You want to find an option where you are just insanely curious to know the answer at the end, to the point where you just have to figure it out and kind of don’t care what the answer is. That doesn’t always happen, but that makes the grind of writing papers go down a lot easier.

Once you have a problem the next step is to actually do the research. I’ll leave this for another guide, but the basic idea is that you want to follow the usual data analytic process:

  1. Define the question
  2. Get/tidy the data
  3. Explore the data
  4. Build/borrow a model
  5. Perform the analysis
  6. Check/critique results
  7. Write things up

The hardest part for the first paper is often knowing where to stop and start writing.

How do you know when to start writing?

Sometimes this is an easy question to answer. If you started with a very concrete question at the beginning then once you have done enough analysis to convince yourself that you have the answer to the question. If the answer to the question is interesting/surprising then it is time to stop and write.

If you started with a question that wasn’t so concrete then it gets a little trickier. The basic idea here is that you have convinced yourself you have a result that is worth reporting. Usually this takes the form of between 1 and 5 figures that show a coherent story that you could explain to someone in your field.

In general one thing you should be working on in graduate school is your own internal timer that tells you, “ok we have done enough, time to write this up”. I found this one of the hardest things to learn, but if you are going to stay in academics it is a critical skill. There are rarely deadlines for paper writing (unless you are submitting to CS conferences) so it will eventually be up to you when to start writing. If you don’t have a good clock, this can really slow down your ability to get things published and promoted in academics.

One good principle to keep in mind is “the perfect is the enemy of the very good” Another one is that a published paper in a respectable journal beats a paper you just never submit because you want to get it into the “best” journal.

A note on “negative results”

If the answer to your research problem isn’t interesting/surprising but you started with a concrete question it is also time to stop and write. But things often get more tricky with this type of paper as most journals when reviewing papers filter for “interest” so sometimes a paper without a really “big” result will be harder to publish. This is ok!! Even though it may take longer to publish the paper, it is important to publish even results that aren’t surprising/novel. I would much rather that you come to an answer you are comfortable with and we go through a little pain trying to get it published than you keep pushing until you get an “interesting” result, which may or may not be justifiable.

How do you start writing?

  1. Once you have a set of results and are ready to start writing up the paper the first thing is not to write. The first thing you should do is create a set of 1-4 publication-quality plots (see Chapter 10 here). Show these to someone you trust to make sure they “get” your story before proceeding.
  2. Start a project on Overleaf or Google Docs.
  3. Write up a story around the four plots in the simplest language you feel you can get away with, while still reporting all of the technical details that you can.
  4. Go back and add references in only after you have finished the whole first draft.
  5. Add in additional technical detail in the supplementary material if you need it.
  6. Write up a reproducible version of your code that returns exactly the same numbers/figures in your paper with no input parameters needed.

What are the sections in a paper?

Keep in mind that most people will read the title of your paper only, a small fraction of those people will read the abstract, a small fraction of those people will read the introduction, and a small fraction of those people will read your whole paper. So make sure you get to the point quickly!

The sections of a paper are always some variation on the following:

  1. Title: Should be very short, no colons if possible, and state the main result. Example, “A new method for sequencing data that shows how to cure cancer”. Here you want to make sure people will read the paper without overselling your results - this is a delicate balance.
  2. Abstract: In (ideally) 4-5 sentences explain (a) what problem you are solving, (b) why people should care, (c) how you solved the problem, (d) what are the results and (e) a link to any data/resources/software you generated.
  3. Introduction: A more lengthy (1-3 pages) explanation of the problem you are solving, why people should care, and how you are solving it. Here you also review what other people have done in the area. The most critical thing is never underestimate how little people know or care about what you are working on. It is your job to explain to them why they should.
  4. Methods: You should state and explain your experimental procedures, how you collected results, your statistical model, and any strengths or weaknesses of your proposed approach.
  5. Comparisons (for methods papers): Compare your proposed approach to the state of the art methods. Do this with simulations (where you know the right answer) and data you haven’t simulated (where you don’t know the right answer). If you can base your simulation on data, even better. Make sure you are simulating both the easy case (where your method should be great) and harder cases where your method might be terrible.
  6. Your analysis: Explain what you did, what data you collected, how you processed it and how you analysed it.
  7. Conclusions: Summarize what you did and explain why what you did is important one more time.
  8. Supplementary Information: If there are a lot of technical computational, experimental or statistical details, you can include a supplement that has all of the details so folks can follow along. As far as possible, try to include the detail in the main text but explained clearly.

The length of the paper will depend a lot on which journal you are targeting. In general the shorter/more concise the better. But unless you are shooting for a really glossy journal you should try to include the details in the paper itself. This means most papers will be in the 4-15 page range, but with a huge variance.

Note: Part of this chapter appeared in the Leek group guide to writing your first paper

As a data analyst the best data repositories are the ones with the least features

Lately, for a range of projects I have been working on I have needed to obtain data from previous publications. There is a growing list of data repositories where data is made available. General purpose data sharing sites include:

There are also a host of field-specific data sharing sites.One thing that I find a little frustrating about these sites is that they add a lot of bells and whistles. For example I wanted to download a p-value data set from Dataverse (just to pick on one, but most repositories have similar issues). I go to the page and click Download on the data set I want.

Downloading a dataverse paper

Then I have to accept terms:

Then I have to Downloading a dataverse paper part 2

Then the data set is downloaded. But it comes from a button that doesn’t allow me to get the direct link. There is an R package that you can use to download dataverse data, but again not with direct links to the data sets.

This is a similar system to many data repositories where there is a multi-step process to downloading data rather than direct links.

But as a data analyst I often find that I want:

  • To be able to find a data set with some minimal search terms
  • Find the data set in .csv or tab delimited format via a direct link
  • Have the data set be available both as raw and processed versions
  • The processed version will either be one or many tidy data sets.

As a data analyst I would rather have all of the data stored as direct links and ideally as csv files. Then you don’t need to figure out a specialized package, an API, or anything else. You just use read.csv directly using the URL in R and you are off to the races. It also makes it easier to point to which data set you are using. So I find the best data repositories are the ones with the least features.

Junior scientists - you don't have to publish in open access journals to be an open scientist.

Editor’s note - This is a chapter from my book How to be a modern scientist where I talk about some of the tools and techniques that scientists have available to them now that they didn’t before.

Publishing - what should I do and why?

A modern scientific writing process goes as follows.

  1. You write a paper
  2. You post a preprint a. Everyone can read and comment
  3. You submit it to a journal
  4. It is peer reviewed privately
  5. The paper is accepted or rejected a. If rejected go back to step 2 and start over b. If accepted it will be published

You can take advantage of modern writing and publishing tools to handle several steps in the process.

Post preprints of your work

Once you have finished writing you paper, you want to share it with others. Historically, this involved submitting the paper to a journal, waiting for reviews, revising the paper, resubmitting, and eventually publishing it. There is now very little reason to wait that long for your paper to appear in print. Generally you can post a paper to a preprint server and have it appear in 1-2 days. This is a dramatic improvement on the weeks or months it takes for papers to appear in peer reviewed journals even under optimal conditions. There are several advantages to posting preprints.

  • Preprints establish precedence for your work so it reduces your risk of being scooped.
  • Preprints allow you to collect feedback on your work and improve it quickly.
  • Preprints can help you to get your work published in formal academic journals.
  • Preprints can get you attention and press for your work.
  • Preprints give junior scientists and other researchers gratification that helps them handle the stress and pressure of their first publications.

The last point is underappreciated and was first pointed out to me by Yoav Gilad It takes a really long time to write a scientific paper. For a student publishing their first paper, the first feedback they get is often (a) delayed by several months and (b) negative and in the form of a referee report. This can have a major impact on the motivation of those students to keep working on projects. Preprints allow students to have an immediate product they can point to as an accomplishment, allow them to get positive feedback along with constructive or negative feedback, and can ease the pain of difficult referee reports or rejections.

Choose the journal that maximizes your visibility

You should try to publish your work in the best journals for your field. There are a couple of reasons for this. First, being a scientist is both a calling and a career. To advance your career, you need visibilty among your scientific peers and among the scientists who will be judging you for grants and promotions. The best place to do this is by publishing in the top journals in your field. The important thing is to do your best to do good work and submit it to these journals, even if the results aren’t the most “sexy”. Don’t adapt your workflow to the journal, but don’t ignore the career implications either. Do this even if the journals are closed source. There are ways to make your work accessible and you will both raise your profile and disseminate your results to the broadest audience.

Share your work on social media

Academic journals are good for disseminating your work to the appropriate scientific community. As a modern scientist you have other avenues and other communities - like the general public - that you would like to reach with your work. Once your paper has been published in a preprint or in a journal, be sure to share your work through appropriate social media channels. This will also help you develop facility in coming up with one line or one figure that best describes what you think you have published so you can share it on social media sites like Twitter.

Preprints and criticism

See the section on scientific blogging for how to respond to criticism of your preprints online.

Publishing - what tools should I use?

Preprint servers

Here are a few preprint servers you can use.

  1. arXiv (free) - primarily takes math/physics/computer science papers. You can submit papers and they are reviewed and posted within a couple of days. It is important to note that once you submit a paper here, you can not take it down. But you can submit revisions to the paper which are tracked over time. This outlet is followed by a large number of journalists and scientists.
  2. biorXiv (free) - primarily takes biology focused papers. They are pretty strict about which categories you can submit to. You can submit papers and they are reviewed and posted within a couple of days. biorxiv also allows different versions of manuscripts, but some folks have had trouble with their versioning system, which can be a bit tricky for keeping your paper coordinated with your publication. bioXiv is pretty carefully followed by the biological and computational biology communities.
  3. Peerj (free) - takes a wide range of different types of papers. They will again review your preprint quickly and post it online. You can also post different versions of your manuscript with this system. This system is newer and so has fewer followers, you will need to do your own publicity if you publish your paper here.

Journal preprint policies

This list provides information on which journals accept papers that were first posted as preprints. However, you shouldn’t

Publishing - further tips and issues

Open vs. closed access

Once your paper has been posted to a preprint server you need to submit it for publication. There are a number of considerations you should keep in mind when submitting papers. One of these considerations is closed versus open access. Closed access journals do not require you to pay to submit or publish your paper. But then people who want to read your paper either need to pay or have a subscription to the journal in question.

There has been a strong push for open access journals over the last couple of decades. There are some very good reasons justifying this type of publishing including (a) moral arguments based on using public funding for research, (2) each of access to papers, and (3) benefits in terms of people being able to use your research. In general, most modern scientists want their work to be as widely accessible as possible. So modern scientists often opt for open access publishing.

Open access publishing does have a couple of disadvantages. First it is often expensive, with fees for publication ranging between $1,000 and $4,000 depending on the journal. Second, while science is often a calling, it is also a career. Sometimes the best journals in your field may be closed access. In general, one of the most important components of an academic career is being able to publish in journals that are read by a lot of people in your field so your work will be recognized and impactful.

However, modern systems make both closed and open access journals reasonable outlets.

Closed access + preprints

If the top journals in your field are closed access and you are a junior scientist then you should try to submit your papers there. But to make sure your papers are still widely accessible you can use preprints. First, you can submit a preprint before you submit the paper to the journal. Second, you can update the preprint to keep it current with the published version of your paper. This system allows you to make sure that your paper is read widely within your field, but also allows everyone to freely read the same paper. On your website, you can then link to both the published and preprint version of your paper.

Open access

If the top journal in your field is open access you can submit directly to that journal. Even if the journal is open access it makes sense to submit the paper as a preprint during the review process. You can then keep the preprint up-to-date, but this system has the advantage that the formally published version of your paper is also available for everyone to read without constraints.

Responding to referee comments

After your paper has been reviewed at an academic journal you will receive referee reports. If the paper has not been outright rejected, it is important to respond to the referee reports in a timely and direct manner. Referee reports are often maddening. There is little incentive for people to do a good job refereeing and the most qualified reviewers will likely be those with a conflict of interest.

The first thing to keep in mind is that the power in the refereeing process lies entirely with the editors and referees. The first thing to do when responding to referee reports is to eliminate the impulse to argue or respond with any kind of emotion. A step-by-step process for responding to referee reports is the following.

  1. Create a Google Doc. Put in all referee and editor comments in italics.
  2. Break the comments up into each discrete criticism or request.
  3. In bold respond to each comment. Begin each response with “On page xx we did yy to address this comment”
  4. Perform the analyses and experiments that you need to fill in the yy
  5. Edit the document to reflect all of the experiments that you have performed

By actively responding to each comment you will ensure you are responsive to the referees and give your paper the best chance of success. If a comment is incorrect or non-sensical, think about how you can edit the paper to remove this confusion.

Finishing

While I have advocated here for using preprints to disseminate your work, it is important to follow the process all the way through to completion. Responding to referee reports is drudgery and no one likes to do it. But in terms of career advancement preprints are almost entirely valueless until they are formally accepted for publication. It is critical to see all papers all the way through to the end of the publication cycle.

You aren’t done!

Publication of your paper is only the beginning of successfully disseminating your science. Once you have published the paper, it is important to use your social media, blog, and other resources to disseminate your results to the broadest audience possible. You will also give talks, discuss the paper with colleagues, and respond to requests for data and code. The most successful papers have a long half life and the responsibilities linger long after the paper is published. But the most successful scientists continue to stay on top of requests and respond to critiques long after their papers are published.

Note: Part of this chapter appeared in the Simply Statistics blog post: “Preprints are great, but post publication peer review isn’t ready for prime time”

A Natural Curiosity of How Things Work, Even If You're Not Responsible For Them

I just read Karl’s great post on what it means to be a data scientist. I can’t really add much to it, but reading it got me thinking about the Apollo 12 mission, the second moon landing.

This mission is actually famous because of its launch, where the Saturn V was struck by lightning and John Aaron (played wonderfully by Loren Dean in the movie Apollo 13), the flight controller in charge of environmental, electrical, and consumables (EECOM), had to make a decision about whether to abort the launch.

In this great clip from the movie Failure is Not An Option, the real John Aaron describes what makes for a good EECOM flight controller. The bottom line is that

A good EECOM has a natural curiosity for how things work, even if you…are not responsible for them

I think a good data scientist or statistician also has that property. They key part of that line is the “even if you are not responsible for them” part. I’ve found that a lot of being a statistician involves nosing around in places where you’re not supposed to be, seeing how data are collected, handled, managed, analyzed, and reported. Focusing on the development and implementation of methods is not enough.

Here’s the clip, which describes the famous “SCE to AUX” call from John Aaron: