Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Statistical Theory is our "Write Once, Run Anywhere"

Having followed the software industry as a casual bystander, I periodically see the tension flare up between the idea of writing “native apps”, software that is tuned to a particular platform (Windows, Mac, etc.) and more cross-platform apps, which run on many platforms without too much modification. Over the years it has come up in many different forms, but they fundamentals are the same. Back in the day, there was Java, which was supposed to be the platform that ran on any computing device. Sun Microsystems originated the phrase “Write Once, Run Anywhere” to illustrate the cross-platform strengths of Java. More recently, Steve Jobs famously banned Flash from any iOS device. Apple is also moving away from standards like OpenGL and towards its own Metal platform.

What’s the problem with “write once, run anywhere”, or of cross-platform development more generally, assuming it’s possible? Well, there are a number of issues: often there are performance penalties, it may be difficult to use the native look and feel of a platform, and you may be reduced to using the “lowest common denominator” of feature sets. It seems to me that anytime a new meta-platform comes out that promises to relieve programmers of the burden of having to write for multiple platforms, it eventually gets modified or subsumed by the need to optimize apps for a given platform as much as possible. The need to squeeze as much juice out of an app seems to be too important an opportunity to pass up.

In statistics, theory and theorems are our version of “write once, run anywhere”. The basic idea is that theorems provide an abstract layer (a “virtual machine”) that allows us to reason across a large number of specific problems. Think of the central limit theorem, probably our most popular theorem. It could be applied to any problem/situation where you have a notion of sample size that could in principle be increasing.

But can it be applied to every situation, or even any situation? This might be more of a philosophical question, given that the CLT is stated asymptotically (maybe we’ll find out the answer eventually). In practice, my experience is that many people attempt to apply it to problems where it likely is not appropriate. Think, large-scale studies with a sample size of 10. Many people will use Normal-based confidence intervals in those situations, but they probably have very poor coverage.

Because the CLT doesn’t apply in many situations (small sample, dependent data, etc.), variations of the CLT have been developed, as well as entirely different approaches to achieving the same ends, like confidence intervals, p-values, and standard errors (think bootstrap, jackknife, permutation tests). While the CLT an provide beautiful insight in a large variety of situations, in reality, one must often resort to a custom solution when analyzing a given dataset or problem. This should be a familiar conclusion to anyone who analyzes data. The promise of “write once, run anywhere” is always tantalizing, but the reality never seems to meet that expectation.

Ironically, if you look across history and all programming languages, probably the most “cross-platform” language is C, which was originally considered to be too low-level to be broadly useful. C programs run on basically every existing platform and the language has been completely standardized so that compilers can be written to produce well-defined output. The keys to C’s success I think are that it’s a very simple/small language which gives enormous (sometimes dangerous) power to the programmer, and that an enormous toolbox (compiler toolchains, IDEs) has been developed over time to help developers write applications on all platforms.

In a sense, we need “compilers” that can help us translate statistical theory for specific data analysis problems. In many cases, I’d imagine the compiler would “fail”, meaning the theory was not applicable to that problem. This would be a Good Thing, because right now we have no way of really enforcing the appropriateness of a theorem for specific problems.

More practically (perhaps), we could develop data analysis pipelines that could be applied to broad classes of data analysis problems. Then a “compiler” could be employed to translate the pipeline so that it worked for a given dataset/problem/toolchain.

The key point is to recognize that there is a “translation” process that occurs when we use theory to justify certain data analysis actions, but this translation process is often not well documented or even thought through. Having an explicit “compiler” for this would help us to understand the applicability of certain theorems and may serve to prevent bad data analysis from occurring.

Autonomous killing machines won't look like the Terminator...and that is why they are so scary

Just a few days ago many of the most incredible minds in science and technology urged governments to avoid using artificial intelligence to create autonomous killing machines. One thing that always happens when such a warning is put into place is you see the inevitable Terminator picture:

 

terminator

 

The reality is that robots that walk and talk are getting better but still have a ways to go:

 

 

Does this mean that I think all those really smart people are silly for making this plea about AI now though? No, I think they are probably just in time.

The reason is that the first autonomous killing machines will definitely not look anything like the Terminator. They will more likely than not be drones, that are already in widespread use by the military, and will soon be flying over our heads delivering Amazon products.

 

drone

 

I also think that when people think about “artificial intelligence” they also think about robots that can mimic the behaviors of a human being, including the ability to talk, hold a conversation, or pass the Turing test. But it turns out that the “artificial intelligence” you would need to create an automated killing system is much much simpler than that and is mostly some basic data science. The things you would need are:

  1. A drone with the ability to fly on its own
  2. The ability to make decisions about what people to target
  3. The ability to find those people and attack them

 

The first issue, being able to fly on autopilot, is something that has existed for a while. You have probably flown on a plane that has used autopilot for at least some of the flight. I won’t get into the details on this one because I think it is the least interesting - it has been around a while and we didn’t get the dire warnings about autonomous agents.

The second issue, about deciding which people to target is already in existence as well. We have already seen programs like PRISM and others that collect individual level metadata and presumably use those to make predictions. We have already seen programs like PRISM and others that collect individual level metadata and presumably use those to make predictions. While the true and false positive rates are probably messed up by the fact that there are very very few “true positives” these programs are being developed and even relatively simple statistical models can be used to build a predictor - even if those don’t work.

The second issue is being able to find people to attack them. This is where the real “artificial intelligence” comes in to play. But it isn’t artificial intelligence like you might think about. It could be just as simple as having the drone fly around and take people’s pictures. Then we could use those pictures to match up with the people identified through metadata and attack them. Facebook has a [Just a few days ago many of the most incredible minds in science and technology urged governments to avoid using artificial intelligence to create autonomous killing machines. One thing that always happens when such a warning is put into place is you see the inevitable Terminator picture:

 

terminator

 

The reality is that robots that walk and talk are getting better but still have a ways to go:

 

 

Does this mean that I think all those really smart people are silly for making this plea about AI now though? No, I think they are probably just in time.

The reason is that the first autonomous killing machines will definitely not look anything like the Terminator. They will more likely than not be drones, that are already in widespread use by the military, and will soon be flying over our heads delivering Amazon products.

 

drone

 

I also think that when people think about “artificial intelligence” they also think about robots that can mimic the behaviors of a human being, including the ability to talk, hold a conversation, or pass the Turing test. But it turns out that the “artificial intelligence” you would need to create an automated killing system is much much simpler than that and is mostly some basic data science. The things you would need are:

  1. A drone with the ability to fly on its own
  2. The ability to make decisions about what people to target
  3. The ability to find those people and attack them

 

The first issue, being able to fly on autopilot, is something that has existed for a while. You have probably flown on a plane that has used autopilot for at least some of the flight. I won’t get into the details on this one because I think it is the least interesting - it has been around a while and we didn’t get the dire warnings about autonomous agents.

The second issue, about deciding which people to target is already in existence as well. We have already seen programs like PRISM and others that collect individual level metadata and presumably use those to make predictions. We have already seen programs like PRISM and others that collect individual level metadata and presumably use those to make predictions. While the true and false positive rates are probably messed up by the fact that there are very very few “true positives” these programs are being developed and even relatively simple statistical models can be used to build a predictor - even if those don’t work.

The second issue is being able to find people to attack them. This is where the real “artificial intelligence” comes in to play. But it isn’t artificial intelligence like you might think about. It could be just as simple as having the drone fly around and take people’s pictures. Then we could use those pictures to match up with the people identified through metadata and attack them. Facebook has a](file:///Users/jtleek/Downloads/deepface.pdf) that demonstrates an algorithm that can identify people with near human level accuracy. This approach is based on something called deep neural nets, which sounds very intimidating, but is actually just a set of nested nonlinear logistic regression models. These models have gotten very good because (a) we are getting better at fitting them mathematically and computationally but mostly (b) we have much more data to train them with than we ever did before. The speed that this part of the process is developing is (I think) why there is so much recent concern about potentially negative applications like autonomous killing machines.

The scary thing is that these technologies could be combined *right now* to create such a system that was not controlled directly by humans but made automated decisions and flew drones to carry out those decisions. The technology to shrink these type of deep neural net systems to identify people is so good it can even be made simple enough to run on a phone for things like language translation and could easily be embedded in a drone.

So I am with Musk, Hawking, and others who would urge caution by governments in developing these systems. Just because we can make it doesn’t mean it will do what we want. Just look at how well Facebook/Amazon/Google make suggestions for “other things you might like” to get an idea about how potentially disastrous automated killing systems could be.

 

Announcing the JHU Data Science Hackathon 2015

We are pleased to announce that the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health will be hosting the first ever JHU Data Science Hackathon (DaSH) on September 21-23, 2015 at the Baltimore Marriott Waterfront.

This event will be an opportunity for data scientists and data scientists-in-training to get together and hack on real-world problems collaboratively and to learn from each other. The DaSH will feature data scientists from government, academia, and industry presenting problems and describing challenges in their respective areas. There will also be a number of networking opportunities where attendees can get to know each other. We think this will be  fun event and we encourage people from all areas, including students (graduate and undergraduate), to attend.

To get more details and to sign up for the hackathon, you can go to the DaSH web site. We will be posting more information as the event nears.

Organizers:

  • Jeff Leek
  • Brian Caffo
  • Roger Peng
  • Leah Jager

Funding:

  • National Institutes of Health
  • Johns Hopkins University

 

stringsAsFactors: An unauthorized biography

Recently, I was listening in on the conversation of some colleagues who were discussing a bug in their R code. The bug was ultimately traced back to the well-known phenomenon that functions like ‘read.table()’ and ‘read.csv()’ in R convert columns that are detected to be character/strings to be factor variables. This lead to the spontaneous outcry from one colleague of

Why does stringsAsFactors not default to FALSE????

The argument ‘stringsAsFactors’ is an argument to the ‘data.frame()’ function in R. It is a logical that indicates whether strings in a data frame should be treated as factor variables or as just plain strings. The argument also appears in ‘read.table()’ and related functions because of the role these functions play in reading in table data and converting them to data frames. By default, ‘stringsAsFactors’ is set to TRUE.

This argument dates back to May 20, 2006 when it was originally introduced into R as the ‘charToFactor’ argument to ‘data.frame()’. Soon afterwards, on May 24, 2006, it was changed to ‘stringsAsFactors’ to be compatible with S-PLUS by request from Bill Dunlap.

Most people I talk to today who use R are completely befuddled by the fact that ‘stringsAsFactors’ is set to TRUE by default. First of all, it should be noted that before the ‘stringsAsFactors’ argument even existed, the behavior of R was to coerce all character strings to be factors in a data frame. If you didn’t want this behavior, you had to manually coerce each column to be character.

So here’s the story:

In the old days, when R was primarily being used by statisticians and statistical types, this setting strings to be factors made total sense. In most tabular data, if there were a column of the table that was non-numeric, it almost certainly encoded a categorical variable. Think sex (male/female), country (U.S./other), region (east/west), etc. In R, categorical variables are represented by ‘factor’ vectors and so character columns got converted factor.

Why do we need factor variables to begin with? Because of modeling functions like ‘lm()’ and ‘glm()’. Modeling functions need to treat expand categorical variables into individual dummy variables, so that a categorical variable with 5 levels will be expanded into 4 different columns in your modeling matrix. There’s no way for R to know it should do this unless it has some extra information in the form of the factor class. From this point of view, setting ‘stringsAsFactors = TRUE’ when reading in tabular data makes total sense. If the data is just going to go into a regression model, then R is doing the right thing.

There’s also a more obscure reason. Factor variables are encoded as integers in their underlying representation. So a variable like “disease” and “non-disease” will be encoded as 1 and 2 in the underlying representation. Roughly speaking, since integers only require 4 bytes on most systems, the conversion from string to integer actually saved some space for long strings. All that had to be stored was the integer levels and the labels. That way you didn’t have to repeat the strings “disease” and “non-disease” for as many observations that you had, which would have been wasteful.

Around June of 2007, R introduced hashing of CHARSXP elements in the underlying C code thanks to Seth Falcon. What this meant was that effectively, character strings were hashed to an integer representation and stored in a global table in R. Anytime a given string was needed in R, it could be referenced by its underlying integer. This effectively put in place, globally, the factor encoding behavior of strings from before. Once this was implemented, there was little to be gained from an efficiency standpoint by encoding character variables as factor. Of course, you still needed to use ‘factors’ for the modeling functions.

The difference nowadays is that R is being used a by a very wide variety of people doing all kinds of things the creators of R never envisioned. This is, of course, wonderful, but it introduces lots of use cases that were not originally planned for. I find that most often, the people complaining about ‘stringsAsFactors’ not being FALSE are people who are doing things that are not the traditional statistical modeling things (things that old-time statisticians like me used to do). In fact, I would argue that if you’re upset about ‘stringsAsFactors = TRUE’, then it’s a pretty good indicator that you’re either not a statistician by training, or you’re doing non-traditional statistical things.

For example, in genomics, you might have the names of the genes in one column of data. It really doesn’t make sense to encode these as factors because they won’t be used in any modeling function. They’re just labels, essentially. And because of CHARSXP hashing, you don’t gain anything from an efficiency standpoint by converting them to factors either.

But of course, given the long-standing behavior of R, many people depend on the default conversion of characters to factors when reading in tabular data. Changing this default would likely result in an equal number of people complaining about ‘stringsAsFactors’.

I fully expect that this blog post will now make all R users happy. If you think I’ve missed something from this unauthorized biography, please let me know on Twitter (@rdpeng).

The statistics department Moneyball opportunity

is a book and a movie about Billy Bean. It makes statisticians look awesome and I loved the movie. I loved it so much I’m putting the movie trailer right here:

The basic idea behind Moneyball was that the Oakland Athletics were able to build a very successful baseball team on a tight budget by valuing skills that many other teams undervalued. In baseball those skills were things like on-base percentage and slugging percentage. By correctly valuing these skills and their impact on a teams winning percentage, the A’s were able to build one of the most successful regular season teams on a minimal budget. This graph shows what an outlier they were, from a nice fivethirtyeight analysis.

 

oakland

 

I think that the data science/data analysis revolution that we have seen over the last decade has created a similar moneyball opportunity for statistics and biostatistics departments. Traditionally in these departments the highest value activities have been publishing a select number of important statistics journals (JASA, JRSS-B, Annals of Statistics, Biometrika, Biometrics and more recently journals like Biostatistics and Annals of Applied Statistics). But there are some hugely valuable ways to contribute to statistics/data science that don’t necessarily end with papers in those journals like:

  1. Creating good, well-documented, and widely used software
  2. Being primarily an excellent collaborator who brings in grant money and is a major contributor to science through statistics
  3. Publishing in top scientific journals rather than statistics journals
  4. Being a good scientific communicator who can attract talent
  5. Being a statistics educator who can build programs

Another thing that is undervalued is not having a Ph.D. in statistics or biostatistics. The fact that these skills are undervalued right now means that up and coming departments could identify and recruit talented people that might be missed by other departments and have a huge impact on the world. One tricky thing is that the rankings of department are based on the votes of people from other departments who may or may not value these same skills. Another tricky thing is that many industry data science positions put incredibly high value on these skills and so you might end up competing with them for people - a competition that will definitely drive up the market value of these data scientist/statisticians. But for the folks that want to stay in academia, now is a prime opportunity.