A nice presentation on regex in R
09 Oct 2011Over at Recology here is a nice presentation on regular expressions. I found this on the R bloggers site.
Over at Recology here is a nice presentation on regular expressions. I found this on the R bloggers site.
Welcome to WordPress.com. After you read this, you should delete and write your own post, with a new title above. Or hit Add New on the left (of the admin dashboard) to start a fresh post.
Here are some suggestions for your first post.
Here’s a claim for which I have absolutely no data: I believe I am more productive with a smaller screen/monitor. I have a 13” MacBook Air that I occasionally hook up to a 21-inch external monitor. Sometimes, when I want to read a document I’ll hook up the external monitor so that I can see a whole page at a time. Other times, when I’m using R, I’ll have the graphics window on the external and then the R console and Emacs on the main screen.
But my feeling is that when I’ve got more monitor real estate I’m less productive. I think it’s because I have the freedom to open more windows and to have more things going on. When I’ve got my laptop, I can only really afford to have 1 or 2 windows open. So I’m more focused on whatever I’m supposed to be doing. I also think this is one of the (small) reasons that people like things like the iPad. It’s a single application/single window device.
A quick Google search will find some pretty crazy multiple-monitor setups out there. For some of them you’d think they were head of security at Los Angeles International Airport or something. And most people I know would scoff at the idea of working solely on your laptop while in the office. Partially, it’s an ergonomic issue. But maybe they just need an external monitor that’s 13 inches? I think I have one sitting in my basement somewhere….
One question I get a lot about how to read large data frames into R. There are some useful tricks that can save you both time and memory when reading large data frames but I find that many people are not aware of them. Of course, your ability to read data is limited by your available memory. I usually do a rough calculation along the lines of
# rows * # columns * 8 bytes / 2^20
This gives you the number of megabytes of the data frame (roughly speaking, it could be less). If this number is more than half the amount of memory on your computer, then you might run into trouble.
First, read the help page for ‘read.table’. It contains many hints for how to read in large tables. Of course, help pages tend to be a little confusing so I’ll try to distill the relevant details here.
The following options to ‘read.table()’ can affect R’s ability to read large tables:
colClasses
This option takes a vector whose length is equal to the number of columns in year table. Specifying this option instead of using the default can make ‘read.table’ run MUCH faster, often twice as fast. In order to use this option, you have to know the of each column in your data frame. If all of the columns are “numeric”, for example, then you can just set ‘colClasses = “numeric”’. If the columns are all different classes, or perhaps you just don’t know, then you can have R do some of the work for you.
You can read in just a few rows of the table and then create a vector of classes from just the few rows. For example, if I have a file called “datatable.txt”, I can read in the first 100 rows and determine the column classes from that:
tab5rows <- read.table("datatable.txt", header = TRUE, nrows = 100) classes <- sapply(tab5rows, class) tabAll <- read.table("datatable.txt", header = TRUE, colClasses = classes)
Always try to use ‘colClasses’, it will make a very big difference. In particular, if one of the column classes is “character”, “integer”, “numeric”, or “logical”, then things will be optimal (because those are the basic classes).
nrows
Specifying the ‘nrows’ argument doesn’t necessary make things go faster but it can help a lot with memory usage. R doesn’t know how many rows it’s going to read in so it first makes a guess, and then when it runs out of room it allocates more memory. The constant allocations can take a lot of time, and if R overestimates the amount of memory it needs, your computer might run out of memory. Of course, you may not know how many rows your table has. The easiest way to find this out is to use the ‘wc’ command in Unix. So if you run ‘wc datafile.txt’ in Unix, then it will report to you the number of lines in the file (the first number). You can then pass this number to the ‘nrows’ argument of ‘read.table()’. If you can’t use ‘wc’ for some reason, but you know that there are definitely less than, say, N rows, then you can specify ‘nrows = N’ and things will still be okay. A mild overestimate for ‘nrows’ is better than none at all.
comment.char
If your file has no comments in it (e.g. lines starting with ‘#’), then setting ‘comment.char = “”’ will sometimes make ‘read.table()’ run faster. In my experience, the difference is not dramatic.
I just found this really cool paper on the phenomenon of the “hot hand” in sports. The idea behind the “hot hand” (also called the “clustering illusion”) is that success breeds success. In other words, when you are successful (you win games, you make free throws, you get hits) you will continue to be successful. In sports, it has frequently been observed that events are close to independent, meaning that the “hot hand” is just an illusion.
In the paper, the authors downloaded all the data on NBA free throws for the 2005/2006 through the 2009/2010 seasons. They cleaned up the data, then analyzed changes in conditional probability. Their analysis suggested that free throw success was not an independent event. They go on to explain:
However, while statistical traces of this phenomenon are observed in the data, an open question still remains: are these non random patterns a result of “success breeds success” and “failure breeds failure” mechanisms or simply “better” and “worse” periods? Although free throws data is not adequate to answer this question in a definite way, we speculate based on it, that the latter is the dominant cause behind the appearance of the “hot hand” phenomenon in the data.
The things I like about the paper are that they explain things very simply, use a lot of real data they obtained themselves, and are very careful in their conclusions.