24 Jan 2016
Hilary and I were snowed in over the weekend, so we recorded Episode 8 of Not So Standard Deviations. In this episode, Hilary and I talk about how to get your foot in the door with data science, the New England Journal’s view on data sharing, Google’s “Cohort Analysis”, and trying to predict a movie’s box office returns based on the movie’s script.
Subscribe to the podcast on iTunes.
Follow @NSSDeviations on Twitter!
Show notes:
Apologies for my audio on this episode. I had a bit of a problem calibrating my microphone. I promise to figure it out for the next episode!
Download the audio for this episode.
21 Jan 2016
I’m working on a new chapter for my R Programming book and the topic is parallel computation. So, I was happy to see this tweet from David Robinson (@drob) yesterday:
What does this have to do with parallel computation? Briefly, the code generates 5,000 standard normal random variates, repeats this 5,000 times and stores them in a 5,000 x 5,000 matrix (`x’). Then it computes x x’. The second part is key, because it involves a matrix multiplication.
Matrix multiplication in R is handled, at a very low level, by the library that implements the Basic Linear Algebra Subroutines, or BLAS. The stock R that you download from CRAN comes with what’s known as a reference implementation of BLAS. It works, it produces what everyone agrees are the right answers, but it is in no way optimized. Here’s what I get when I run this code on my Mac using Studio and the CRAN version of R for Mac OS X:
system.time({ x <- replicate(5e3, rnorm(5e3)); tcrossprod(x) })
user system elapsed
59.622 0.314 59.927
Note that the “user” time and the “elapsed” time are roughly the same. Note also that I use the tcrossprod() function instead of the otherwise equivalent expression x %*% t(x). Both crossprod() and tcrossprod() are generally faster than using the %*% operator.
Now, when I run the same code on my built-from-source version of R (version 3.2.3), here’s what I get:
system.time({ x <- replicate(5e3, rnorm(5e3)); tcrossprod(x) })
user system elapsed
14.378 0.276 3.344
Overall, it’s faster when I don’t run the code through RStudio (14s vs. 59s). Also on this version the elapsed time is about 1/4 the user time. Why is that?
The build-from-source version of R is linked to Apple’s Accelerate framework, which is a large library that includes an optimized BLAS library for Intel chips. This optimized BLAS, in addition to being optimized with respect to the code itself, is designed to be multi-threaded so that it can split work off into chunks and run them in parallel on multi-core machines. Here, the tcrossprod() function was run in parallel on my machine, and so the elapsed time was about a quarter of the time that was “charged” to the CPU(s).
David’s tweet indicated that when using Microsoft R Open, which is a custom built binary of R, that the (I assume?) elapsed time is 2.5 seconds. Looking at the attached link, it appears that Microsoft’s R Open is linked against Intel’s Math Kernel Library (MKL) which contains, among other things, an optimized BLAS for Intel chips. I don’t know what kind of computer David was running on, but assuming it was similarly high-powered as mine, it would suggest Intel’s MKL sees slightly better performance. But either way, both Accelerate and MKL achieve that speed up through custom-coding of the BLAS routines and multi-threading on multi-core systems.
If you’re going to be doing any linear algebra in R (and you will), it’s important to link to an optimized BLAS. Otherwise, you’re just wasting time unnecessarily. Besides Accelerate (Mac) and Intel MKL, theres AMD’s ACML library for AMD chips and the ATLAS library which is a general purpose tunable library. Also Goto’s BLAS is optimized but is not under active development.
14 Jan 2016
If you’ve ever wanted to know more about my Not So Standard Deviations co-host (and Johns Hopkins graduate) Hilary Parker, you can go check out the great profile of her on the American Statistical Association’s This Is Statistics web site.
What advice would you give to high school students thinking about majoring in statistics?
It’s such a great field! Not only is the industry booming, but more importantly, the disciplines of statistics teaches you to think analytically, which I find helpful for just about every problem I run into. It’s also a great field to be interested in as a generalist– rather than dedicating yourself to studying one subject, you are deeply learning a set of tools that you can apply to any subject that you find interesting. Just one glance at the topics covered on The Upshot or 538 can give you a sense of that. There’s politics, sports, health, history… the list goes on! It’s a field with endless possibility for growth and exploration, and as I mentioned above, the more I explore the more excited I get about it.
12 Jan 2016
The latest episode of Not So Standard Deviations is out, and boy does Hilary have a story to tell.
We also talk about Theranos and the pitfalls of diagnostic testing, Spotify’s Discover Weekly playlist generation algorithm (and the need for human product managers), and of course, a little Star Wars. Also, Hilary and I start a new segment where we each give some “free advertising” to something interesting that they think other people should know about.
Show Notes:
Download the audio for this episode.
11 Jan 2016
Jeff Leek, Brian Caffo, and I are doing a Reddit AMA TODAY at 3pm EST. We’re happy to answer questions about…anything…including our roles as Co-Directors of the Johns Hopkins Data Science Specialization as well as the Executive Data Science Specialization.
This is one of the few pictures of the three of us together.