loess explained in a GIF

13 Feb 2014

Local regression (loess) is one of the statistical procedures I most use. Here is a movie showing how it works

Monday data/statistics link roundup (2/10/14)

10 Feb 2014

I’m going to try Monday’s for the links. Let me know what you think.

The Guardian is reading our blog. A week after Rafa posts that everyone should learn to code for career preparedness, the Guardian gets on the bandwagon.
Nature Methods published a paper on a webtool for creating boxplots (via Simina B.). The nerdrage rivaled the quilt plot. I’m not opposed to papers like this being published, in fact it is an important part of making sure we don’t miss out on the good software when it comes. There are two important things to keep in mind though: (a) Nature Methods grades on a heavy “innovative” curve which makes it pretty hard to publish papers there, so publishing papers like this could cause frustration among people who would submit there and (b) if you use the boxplots from using this tool you must cite the relevant software that generated the boxplot.
This story about Databall (via Rafa.) is great, I love the way that it talks about statisticians as the leaders on a new data type. I also enjoyed reading the paper the story is about. One interesting thing about that paper and many of the papers at the Sloan Sports Conference is that the data are proprietary (via Chris V.) so the code/data/methods are actually not available for most papers (including this one). In the short term this isn’t a big deal, the papers are fun to read. In the long term, it will dramatically slow progress. It is almost always a bad long term strategy to make data private if the goal is to maximize value.
The P-value curve for fixing publication bias (via Rafa). I think it is an interesting idea, very similar to our approach for the science-wise false discovery rate. People who liked our paper will like the P-value curve paper. People who hated our paper for the uniformity under the null assumption will hate that paper for the same reason (via David S.)
Hopkins discovers bones are the best (via Michael R.).
Awesome scientific diagrams in tex. Some of these are ridiculous.
Mary Carillo goes crazy on backyard badminton. This is awesome. If you love the Olympics and the internet, you will love this (via Hilary P.)
B’more Biostats has been on a tear lately. I’ve been reading [I’m going to try Monday’s for the links. Let me know what you think.
The Guardian is reading our blog. A week after Rafa posts that everyone should learn to code for career preparedness, the Guardian gets on the bandwagon.
Nature Methods published a paper on a webtool for creating boxplots (via Simina B.). The nerdrage rivaled the quilt plot. I’m not opposed to papers like this being published, in fact it is an important part of making sure we don’t miss out on the good software when it comes. There are two important things to keep in mind though: (a) Nature Methods grades on a heavy “innovative” curve which makes it pretty hard to publish papers there, so publishing papers like this could cause frustration among people who would submit there and (b) if you use the boxplots from using this tool you must cite the relevant software that generated the boxplot.
This story about Databall (via Rafa.) is great, I love the way that it talks about statisticians as the leaders on a new data type. I also enjoyed reading the paper the story is about. One interesting thing about that paper and many of the papers at the Sloan Sports Conference is that the data are proprietary (via Chris V.) so the code/data/methods are actually not available for most papers (including this one). In the short term this isn’t a big deal, the papers are fun to read. In the long term, it will dramatically slow progress. It is almost always a bad long term strategy to make data private if the goal is to maximize value.
The P-value curve for fixing publication bias (via Rafa). I think it is an interesting idea, very similar to our approach for the science-wise false discovery rate. People who liked our paper will like the P-value curve paper. People who hated our paper for the uniformity under the null assumption will hate that paper for the same reason (via David S.)
Hopkins discovers bones are the best (via Michael R.).
Awesome scientific diagrams in tex. Some of these are ridiculous.
Mary Carillo goes crazy on backyard badminton. This is awesome. If you love the Olympics and the internet, you will love this (via Hilary P.)
B’more Biostats has been on a tear lately. I’ve been reading](http://lcolladotor.github.io/2014/02/05/DropboxAndGoogleDocsFromR/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+FellgernonBit+%28Fellgernon+Bit%29) on uploading files to Dropbox/Google drive from R, Mandy’s post explaining quantitative MRI, Yenny’s post on data sciences, John’s post on graduate school open houses, and Alyssa’s post on vectorization. If you like Simply Stats you should be following them on Twitter here.

Just a thought on peer reviewing - I can't help myself.

05 Feb 2014

Today I was thinking about reviewing, probably because I was handling a couple of papers as AE and doing tasks associated with reviewing several other papers. I know that this is idle thinking, but suppose peer review was just a drop down ranking with these 6 questions.

How close is this paper to your area of expertise?
Does the paper appear to be technically right?
Does the paper use appropriate statistics/computing?
Is the paper interesting to people in your area?
Is the paper interesting to a broad audience?
Are the appropriate data and code available?

Each question would be rated on a 1-5 star scale. 1 stars = completely inadequate, 3 stars = acceptable, 5 stars = excellent. There would be an optional comments box that would only be used for major/interesting thoughts and anything that got above 3 stars for questions 2, 3, and 6 was published. Incidentally, you could do this for free on Github if the papers were written in markdown, that would reduce the substantial costs of open-access publishing.

No doubt peer review would happen faster this way. I was wondering, would it be any worse?

My Online Course Development Workflow

04 Feb 2014

One of the nice things about developing 9 new courses for the JHU Data Science Specialization in a short period of time is that you get to learn all kinds of cool and interesting tools. One of the ways that we were able to push out so much content in just a few months was that we did most of the work ourselves, rather than outsourcing things like video production and editing. You could argue that this results in a poorer quality final product but (a) I disagree; and (b) even if that were true, I think the content is still valuable.

The advantage of learning all the tools was that it allowed for a quick turn-around from the creation of the lecture to the final exporting of the video (often within a single day). For a hectic schedule, it’s nice to be able to write slides in the morning, record some video in between two meetings in the afternoon, and the combine/edit all the video in the evening. Then if you realize something doesn’t work, you can start over the next day and have another version done in less than 24 hours.

I thought it might be helpful to someone out there to detail the workflow and tools that I use to develop the content for my online courses.

I use Camtasia for Mac to do all my screencasting/recording. This is a nice tool and I think has more features than your average screen recorder. That said, if you just want to record your screen on your Mac, you can actually use the built-in Quicktime software. I used to do all of my video editing in Camtasia but now it’s pretty much glorified screencasting software for me.
For talking head type videos I use my iPhone 5S mounted on a tripod. The iPhone produces surprisingly good 1080p HD 30 fps video and is definitely sufficient for my purposes (see here for a much better example of what can be done). I attach the phone to an Apogee microphone to pick up better sound. For some of the interviews that we do on Simply Statistics I use two iPhones (A 5S and a 4S, my older phone).
To record my primary sound (i.e. me talking), I use the Zoom H4N portable recorder. This thing is not cheap but it records very high-quality stereo sound. I can connect it to my computer via USB or it can record to a SD card.
For simple sound recording (no video or screen) I use Audacity.
All of my lecture videos are run through Final Cut Pro X on my 15-inch MacBook Pro with Retina Display. Videos from Camtasia are exported in Apple ProRes format and then imported into Final Cut. Learning FCPX is not for the faint-of-heart if you’re not used to a nonlinear editor (as I was not). I bought this excellent book to help me learn it, but I still probably only use 1% of the features. In the end using a real editor was worth it because it makes merging multiple videos much easier (i.e. multicam shots for screencasts + talking head) and editing out mistakes (e.g. typos on slides) much simpler. The editor in Camtasia is pretty good but if you have more then one camera/microphone it becomes infeasible.
I have an 8TB Western Digital Thunderbolt drive to store the raw video for all my classes (and some backups). I also use two 1TB Thunderbolt drives to store video for individual classes (each 4-week class borders on 1TB of raw video). These smaller drives are nice because I can just throw them in my bag and edit video at home or on the weekend if I need to.
Finished videos are shared with a Dropbox for Business account so that Jeff, Brian, and I can all look at each other’s stuff. Videos are exported to H.264/AAC and uploaded to Coursera.
For developing slides, Jeff, Brian, and I have standardized around using Slidify. The beauty of using slidify is that it lets you write everything in Markdown, a super simple text format. It also make it simpler to manage all the course material in Git/GitHub because you don’t have to lug around huge PowerPoint files. Everything is a light-weight text file. And thanks to Ramnath’s incredible grit and moxie, we have handy tools to easily export everything to PDF and HTML slides (HTML slides hosted via GitHub Pages).

The first courses for the Data Science Specialization start on April 7th. Don’t forget to sign up!

The three tables for genomics collaborations

03 Feb 2014

Collaborations between biologists and statisticians are very common in genomics. For the data analysis to be fruitful, the statistician needs to understand what samples are being analyzed. For the analysis report to make sense to the biologist, it needs to be properly annotated with information such as gene names, genomic location, etc… In a recent post, Jeff laid out his guide for such collaborations, here I describe an approach that has helped me in mine.

In many of my past collaborations, sharing the experiment’s key information, in a way that facilitates data analysis, turned out to be more time consuming than the analysis itself. One reason is that the data producers annotated samples in ways that was imposible to decipher without direct knowledge of the experiment (e.g using lab specific codes in the filenames, or colors in excel files). In the early days of microarrays, a group of researchers noticed this problem and created a markup language to describe and communicate information about experiments electronically.

The Bioconductor project took a less ambitious approach and created classes specifically designed to store the minimal information needed to perform an analysis. These classes can be thought of as three tables, stored as flat text files, all of which are relatively easy to create for biologists.

The first table contains the experimental data with rows representing features (e.g. genes) and the columns representing samples.

The second table contains the sample information. This table contains a row for each column in the experimental data table. This table contains at least two columns. The first contains an identifier that can be used to match the rows of this table to the columns of the first table. The second contains the main outcome of interest, e.g. case or control, cancer or normal. Other commonly included columns are the filename of the original raw data associated with each row, the date the experiment was processed, and other information about the samples.

The third table contains the feature information. This table contains a row for each row in the experimental table. The table contains at least two columns. The first contains an identifier that can be used to match the rows of this table to the row of the first table. The second will contain an annotation that makes sense to biologists, e.g. a gene name. For technologies that are widely used (e.g. Affymetrix gene expression arrays) these table are readily available.

With these three relatively simple files in place less time is spent “figuring out” the data and the statisticians can focus their energy on data analysis while the biologists can focus their energy on interpreting the results. This approach seems to have been the inspiration for the MAGE-TAB format.

Note that with newer technologies, statisticians prefer to get access to the raw data. In this case, instead of an experimental data table (table 1), they will want the original raw data files. The sample information then must contain a column with the filenames so that sample annotation can be properly matched.

NB: These three tables are not a complete description of an experiment and are not intended as an alternative to standards such as MAGE and MIAME. But in many cases, they provide the very minimum information needed to carry out a rudimentary analysis. Note that Bioconductor provides tools to import information from MAGE-ML and other related formats.

Older Newer

Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

loess explained in a GIF

Monday data/statistics link roundup (2/10/14)

Just a thought on peer reviewing - I can't help myself.

My Online Course Development Workflow

The three tables for genomics collaborations