Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Spectacular Plots Made Entirely in R

When doing data analysis, I often create a set of plots quickly just to explore the data and see what the general trends are. Later I go back and fiddle with the plots to make them look pretty for publication. But some people have taken this to the next level. Here are two plots made entirely in R:

The descriptions of how they were created are here and here.

Related: Check out Roger’s post on R colors and my post on APIs

Caffo + Ninjas = Awesome

Our colleague Brian Caffo and his team of statistics ninjas won the “Imaging-Based Diagnostic Classification Contest” as part of the ADHD 200 Global Competition. From the prize citation:

The method developed by the team from Johns Hopkins University excelled in its specificity, or its ability to identify typically developing children (TDC) without falsely classifying them as ADHD-positive. They correctly classified 94% of TDC, showing that a diagnostic imaging methodology can be developed with a very low risk of false positives, a fantastic result. Their method was not as effective in terms of sensitivity, or its ability to identify true positive ADHD diagnoses. They only identified 21% of cases; however, among those cases, they discerned the subtypes of ADHD with 89.47% accuracy. Other teams demonstrated that there is ample room to improve sensitivity scores. 

Congratulations to Brian and his team!

Colors in R

One of my favorite R packages that I use all the time is the RColorBrewer package. The package has been around for a while now and is written/maintained by Erich Neuwirth. The guts of the package are based on Cynthia Brewer’s very cool work on the use of color in cartography (check out the colorbrewer web site).

As a side note, I think the ability to manipulate colors in plots/graphs/maps is one of R’s many great strengths. My personal experience is that getting the right color scheme can make a difference in how data are perceived in a plot.

RColorBrewer basically provides one function, brewer.pal, that generates different types of color palettes. There are three types of palettes: sequential, diverging, and qualitative. Roughly speaking, sequential palettes are for continuous data where low is less important and high is more important, diverging palettes are for continuous data where both low and high are important (i.e. deviation from some reference point), and qualitative palettes are for categorical data where there is no logical order (i.e. male/female).

To use the brewer.pal function, it’s often useful to combine it with another R function, colorRampPalette. This function is built into R and is part of the grDevices package. It takes a palette of colors and interpolates between the colors to give you an entire spectrum. Think of a painter’s palette with 4 or 5 color blotches on it, and then think of the painter taking a brush and blending the colors together. That’s what colorRampPalette does. So brewer.pal gives you the colors and colorRampPalette mashes them together. It’s a happy combination.

So, how do we use these functions? My basic approach is to first set the palette depending on the type of data. Suppose we have continuous sequential data and we want the “Blue-Purple” palette

colors <- brewer.pal(4, "BuPu")

Here, I’ve taken 4 colors from the “BuPu” palette, so there are now 4 blotches on my palette. To interpolate these colors, I can call colorRampPalette, which actually returns a function.

pal <- colorRampPalette(colors)

Now, pal is a function that takes a positive integer argument and returns that number of colors from the palette. So for example

> pal(5)
[1] "#EDF8FB" "#C1D7E9" "#9FB1D4" "#8B80BB" "#88419D"

I got 5 different colors from the palette, with their red, green, and blue values coded in hexadecimal. If I wanted 20 colors I could have called pal(20).

The pal function is useful in other functions like image or wireframe (in the lattice package). In both of those functions, the ‘col’ argument can be given a set of colors generated by the pal function. For example, you could call

data(volcano)
image(volcano, col = pal(30))

and you would plot the ‘volcano’ data using 30 colors from the “BuPu” palette.

If you’re wondering what all the different palettes are and what colors are in them, here’s a handy reference:

Or you can just call

display.brewer.all()

There’s been a lot of interesting work done on colors in R and this is just scratching the surface. I’ll probably return to this subject in a future post.

Competing through data: Three experts offer their game plan

Competing through data: Three experts offer their game plan

Where would we be without Dennis Ritchie?

Most have probably seen this already since it happend a few days ago, but Dennis Ritchie died. It just blows my mind how influential his work was — developing the C language, Unix — and how so many pieces of technology bear his fingerprints. 

My first encounter with K&R was in college when I learned C programming in the “Data Structures and Programming Techniques” class at Yale (taught by Stan “the man” Eisenstadt). Looking back, his book seems fairly easy to read and understand, but I must have cursed that book a million times when I took that course!