Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

New Feather Format for Data Frames

By Roger Peng

This past Tuesday, Hadley Wickham and Wes McKinney announced a new binary file format specifically for storing data frames.

One thing that struck us was that, while R’s data frames and Python’s pandas data frames utilize different internal memory representations, the semantics of their user data types are mostly the same. In both R and pandas, data frames contain lists of named, equal-length columns, which can be numeric, boolean, and date-and-time, categorical (factors), or string. Additionally, these columns must support missing (null) values.

Their work builds on the Apache Arrow project, which specifies a format for tabular data. There is currently a Python and R implementation for reading/writing these files but other implementations could easily be built as the file format looks pretty straightforward. The git repository is here.

Initial thoughts:

  • The possibilities for passing data between languages is I think the main point here. The potential for passing data through a pipeline without worrying about the specifics of different languages could make for much more powerful analyses where different tools are used for whatever they tend to do best. Essentially, as long as data can be made tidy going in and coming out, there should not be a communication issue between languages.

  • R users might be wondering what the big deal is–we already have a binary serialization format (XDR). But R’s serialization format is meant to cover all possible R objects. Feather’s focus on data frames allows for the removal of many of the annoying (but seldom used) complexities of R objects and optimizing a very commonly used data format.

  • In my testing, there’s a noticeable speed difference between reading a feather file and reading an (uncompressed) R workspace file (feather seems about 2x faster). I didn’t time writing files, but the difference didn’t seem as noticeable there. That said, it’s not clear to me that performance on files is the main point here.

  • Given the underlying framework and representation, there seem to be some interesting possibilities for low-memory environments.

I’ve only had a chance to quickly look at the code but I’m excited to see what comes next.