Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

R Workshop: Reading in Large Data Frames

 One question I get a lot about how to read large data frames into R. There are some useful tricks that can save you both time and memory when reading large data frames but I find that many people are not aware of them. Of course, your ability to read data is limited by your available memory. I usually do a rough calculation along the lines of

# rows * # columns * 8 bytes / 2^20

This gives you the number of megabytes of the data frame (roughly speaking, it could be less). If this number is more than half the amount of memory on your computer, then you might run into trouble.

First, read the help page for ‘read.table’. It contains many hints for how to read in large tables. Of course, help pages tend to be a little confusing so I’ll try to distill the relevant details here.

The following options to ‘read.table()’ can affect R’s ability to read large tables:

colClasses

This option takes a vector whose length is equal to the number of columns in year table. Specifying this option instead of using the default can make ‘read.table’ run MUCH faster, often twice as fast. In order to use this option, you have to know the of each column in your data frame. If all of the columns are “numeric”, for example, then you can just set ‘colClasses = “numeric”’. If the columns are all different classes, or perhaps you just don’t know, then you can have R do some of the work for you.

You can read in just a few rows of the table and then create a vector of classes from just the few rows. For example, if I have a file called “datatable.txt”, I can read in the first 100 rows and determine the column classes from that:

tab5rows <- read.table("datatable.txt", header = TRUE, nrows = 100)
classes <- sapply(tab5rows, class)
tabAll <- read.table("datatable.txt", header = TRUE, colClasses = classes)

Always try to use ‘colClasses’, it will make a very big difference. In particular, if one of the column classes is “character”, “integer”, “numeric”, or “logical”, then things will be optimal (because those are the basic classes).

nrows

Specifying the ‘nrows’ argument doesn’t necessary make things go faster but it can help a lot with memory usage. R doesn’t know how many rows it’s going to read in so it first makes a guess, and then when it runs out of room it allocates more memory. The constant allocations can take a lot of time, and if R overestimates the amount of memory it needs, your computer might run out of memory. Of course, you may not know how many rows your table has. The easiest way to find this out is to use the ‘wc’ command in Unix. So if you run ‘wc datafile.txt’ in Unix, then it will report to you the number of lines in the file (the first number). You can then pass this number to the ‘nrows’ argument of ‘read.table()’. If you can’t use ‘wc’ for some reason, but you know that there are definitely less than, say, N rows, then you can specify ‘nrows = N’ and things will still be okay. A mild overestimate for ‘nrows’ is better than none at all.

comment.char

If your file has no comments in it (e.g. lines starting with ‘#’), then setting ‘comment.char = “”’ will sometimes make ‘read.table()’ run faster. In my experience, the difference is not dramatic.