Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Sometimes there's friction for a reason

Thinking about my post on Theranos yesterday it occurred to me that one thing that’s great about all of the innovation and technology coming out of places like Silicon Valley is the tremendous reduction of friction in our lives. With Uber it’s much easier to get a ride because of improvement in communication and an increase in the supply of cars. With Amazon, I can get that jug of vegetable oil that I always wanted without having to leave the house, because Amazon.

So why is there all this friction? Sometimes it’s because of regulation, which may have made sense at an earlier time, but perhaps doesn’t make as much sense now. Sometimes, you need a company like Amazon to really master the logistics operation to be able to deliver anything anywhere. Otherwise, you’re just stuck driving to the grocery store to get that vegetable oil.

But sometimes there’s friction for a reason. For example, Ben Thompson talks about how previously there was quite a bit more friction involved before law enforcement could listen in on our communications. Although wiretapping had long been around (as noted by David Simon of…The Wire) the removal of all friction by the NSA made the situation quite different. Related to this idea is the massive data release from OkCupid a few weeks ago, as I discussed on the latest Not So Standard Deviations podcast episode. Sure, your OkCupid profile is visible to everyone with an account, but having someone vacuum up 70,000 profiles and dumping them on the web for anyone to view is not what people signed up for—there is a qualitative difference there.

When it comes to Theranos and diagnostic testing in general, there is similarly a need for some friction in order to protect public health. John Ioannides notes in his commentary for JAMA:

Even if the tests were accurate, when they are performed in massive scale and multiple times, the possibility of causing substantial harm from widespread testing is very real, as errors accumulate with multiple testing. Repeated testing of an individual is potentially a dangerous self-harm practice, and these individuals are destined to have some incorrect laboratory results and eventually experience harm, such as, for example, the anxiety of being labeled with a serious condition or adverse effects from increased testing and procedures to evaluate false-positive test results. Moreover, if the diagnostic testing process becomes dissociated from physicians, self-testing and self-interpretation could cause even more problems than they aim to solve.

Unlike with the NSA, where the differences in scale may be difficult to quantify because the exact extent of the program is unknown to most people, with diagnostic testing, we can precisely quantify how a diagnostic test’s characteristics will change if we apply it to 1,000 people vs. 1,000,000 people. This is why organizations like the US Preventative Services Task Force so carefully considers recommendations for testing or screening (and why they have a really tough job).

I’ll admit that a lot of the friction in our daily lives is pointless and it would be great to reduce it if possible. But in many cases, it was us that put the friction there for a reason, and it’s sometimes good to think about why before we move to eliminate it.

Update On Theranos

I think it’s fair to say that things for Theranos, the Silicon Valley blood testing company, are not looking up. From the Wall Street Journal (via The Verge):

Theranos has voided two years of results from its Edison blood-testing machines, issuing tens of thousands of corrected reports to patients and doctors and raising the possibility that many health care decisions may have been made based on inaccurate data. The Wall Street Journal first reported the news, saying that many of the corrected tests have been run using traditional machinery. One doctor told the Journal that she sent a patient to the emergency room after seeing abnormal results from a Theranos test; the corrected report returned normal readings.

Furthermore, this commentary in JAMA from John Ioannides emphasizes the need for caution when implementing testing on a massive scale. In particular, “The notion of patients and healthy people being repeatedly tested in supermarkets and pharmacies, or eventually in cafeterias or at home, sounds revolutionary, but little is known about the consequences” and the consequences really matter here. In addition, as the title of the commentary would indicate, research done in secret is not research at all. For there the be credibility for a company like this, data needs to be made public.

I continue to maintain that the fundamental premise on which the company is built, as stated by its founder Elizabeth Holmes, is flawed. Two concepts are repeatedly made in the context of Theranos:

  • More testing is better. Anyone who stayed awake in their introduction to Bayesian statistics lecture knows this is very difficult to make true in all circumstances, no matter how accurate a test is. With rare diseases, the number of false positives is overwhelming and can have very real harmful effects on people. Combine testing on a massive scale, with repeated application over time, and you get a recipe for confusion.
  • People do not get tested because they are afraid of needles. Elizabeth Holmes makes a big deal about her personal fear of needles and it’s impact on her (not) getting blood tests done. I have no doubt that many people share this fear, but I have serious doubt that this is the reason people don’t get the medical testing done. There are many barriers to people getting the medical care that they need, many that are non-financial in nature and do not include fear of needles. The problem of getting people the medical care that they need is one deserving of serious attention, but changing the manner in which blood is collected is not going to do it.

Not So Standard Deviations Episode 16 - The Silicon Valley Episode

Roger and Hilary are back, with Hilary broadcasting from the west coast. Hilary and Roger discuss the possibility of scaling data analysis and how that may or may not work for companies like Palantir. Also, the latest on Theranos and the release of data from OkCupid.

If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.

Subscribe to the podcast on iTunes.

Subscribe to the podcast on Google Play.

Please leave us a review on iTunes!

Support us through our Patreon page.

Show notes:

Download the audio for this episode.

What is software engineering for data science?

Editor’s note: This post is a chapter from the book Executive Data Science: A Guide to Training and Managing the Best Data Scientists, written by myself, Brian Caffo, and Jeff Leek.

Software is the generalization of a specific aspect of a data analysis. If specific parts of a data analysis require implementing or applying a number of procedures or tools together, software is the encompassing of all these tools into a specific module or procedure that can be repeatedly applied in a variety of settings. Software allows for the systematizing and the standardizing of a procedure, so that different people can use it and understand what it’s going to do at any given time.

Software is useful because it formalizes and abstracts the functionality of a set of procedures or tools, by developing a well defined interface to the analysis. Software will have an interface, or a set of inputs and a set of outputs that are well understood. People can think about the inputs and the outputs without having to worry about the gory details of what’s going on underneath. Now, they may be interested in those details, but the application of the software at any given setting will not necessarily depend on the knowledge of those details. Rather, the knowledge of the interface to that software is important to using it in any given situation.

For example, most statistical packages will have a linear regression function which has a very well defined interface. Typically, you’ll have to input things like the outcome and the set of predictors, and maybe there will be some other inputs like the data set or weights. Most linear regression functions kind of work in that way. And importantly, the user does not have to know exactly how the linear regression calculation is done underneath the hood. Rather, they only need to know that they need to specify the outcome, the predictors, and a couple of other things. The linear regression function abstracts all the details that are required to implement linear regression, so that the user can apply the tool in a variety of settings.

There are three levels of software that are important to consider, going from kind of the simplest to the most abstract.

  1. At the first level you might just have some code that you wrote, and you might want to encapsulate the automation of a set of procedures using a loop (or something similar) that repeats an operation multiple times.
  2. The next step might be some sort of function. Regardless of what language you may be using, often there will be some notion of a function, which is used to encapsulate a set of instructions. And the key thing about a function is that you’ll have to define some sort of interface, which will be the inputs to the function. The function may also have a set of outputs or it may have some side effect for example, if it’s a plotting function. Now the user only needs to know those inputs and what the outputs will be. This is the first level of abstraction that you might encounter, where you have to actually define and interface to that function.
  3. The highest level is an actual software package, which will often be a collection of functions and other things. That will be a little bit more formal because there’ll be a very specific interface or API that a user has to understand. Often for a software package there’ll be a number of convenience features for users, like documentation, examples, or tutorials that may come with it, to help the user apply the software to many different settings. A full on software package will be most general in the sense that it should be applicable to more than one setting.

One question that you’ll find yourself asking, is at what point do you need to systematize common tasks and procedures across projects versus recreating code or writing new code from scratch on every new project? It depends on a variety of factors and answering this question may require communication within your team, and with people outside of your team. You may need to develop an understanding of how often a given process is repeated, or how often a given type of data analysis is done, in order to weigh the costs and benefits of investing in developing a software package or something similar.

Within your team, you may want to ask yourself, “Is the data analysis you’re going to do something that you are going to build upon for future work, or is it just going to be a one shot deal?” In our experience, there are relatively few one shot deals out there. Often you will have to do a certain analysis more than once, twice, or even three times, at which point you’ve reached the threshold where you want to write some code, write some software, or at least a function. The point at which you need to systematize a given set of procedures is going to be sooner than you think it is. The initial investment for developing more formal software will be higher, of course, but that will likely pay off in time savings down the road.

A basic rule of thumb is

  • If you’re going to do something once (that does happen on occasion), just write some code and document it very well. The important thing is that you want to make sure that you understand what the code does, and so that requires both writing the code well and writing documentation. You want to be able to reproduce it down later on if you ever come back to it, or if someone else comes back to it.
  • If you’re going to do something twice, write a function. This allows you to abstract a small piece of code, and it forces you to define an interface, so you have well defined inputs and outputs.
  • If you’re going to do something three times or more, you should think about writing a small package. It doesn’t have to be commercial level software, but a small package which encapsulates the set of operations that you’re going to be doing in a given analysis. It’s also important to write some real documentation so that people can understand what’s supposed to be going on, and can apply the software to a different situation if they have to.

Disseminating reproducible research is fundamentally a language and communication problem

Just about 10 years ago, I wrote my first of many articles about the importance of reproducible research. Since that article, one of the points I’ve made is that the key issue to resolve was one of tools and infrastructure. At the time, many people were concerned that people would not want to share data and that we had to spend a lot of energy finding ways to either compel or incentivize them to do so. But the reality was that it was difficult to answer the question “What should I do if I desperately want to make my work reproducible?” Back then, even if you could convince a clinical researcher to use R and LaTeX to create a Sweave document (!), it was not immediately obvious where they should host the document, code, and data files.

Much has happened since then. We now have knitr and Markdown for live documents (as well as iPython notebooks and the like). We also have git, GitHub, and friends, which provide free code sharing repositories in a distributed manner (unlike older systems like CVS and Subversion). With the recent announcement of the Journal of Open Source Software, posting code on GitHub can now be recognized within the current system of credits and incentives. Finally, the number of data repositories has grown, providing more places for researchers to deposit their data files.

Is the tools and infrastructure problem solved? I’d say yes. One thing that has become clear is that disseminating reproducible research is no longer a software problem. At least in R land, building live documents that can be executed by others is straightforward and not too difficult to pick up (thank you John Gruber!). For other languages there many equivalent (if not better) tools for writing documents that mix code and text. For this kind of thing, there’s just no excuse anymore. Could things be optimized a bit for some edge cases? Sure, but the tools are prefectly fine for the vast majority of use cases.

But now there is a bigger problem that needs to be solved, which is that we do not have an effective way to communicate data analyses. One might think that publishing the full code and datasets is the perfect way to communicate a data analysis, but in a way, it is too perfect. That approach can provide too much information.

I find the following analogy useful for discussing this problem. If you look at music, one way to communicate music is to provide an audio file, a standard WAV file or something similar. In a way, that is a near-perfect representation of the music—bit-for-bit—that was produced by the performer. If I want to experience a Beethoven symphony the way that it was meant to be experienced, I’ll listen to a recording of it.

But if I want to understand how Beethoven wrote the piece—the process and the details—the recording is not a useful tool. What I look at instead is the score. The recording is a serialization of the music. The score provides an expanded representation of the music that shows all of the different pieces and how they fit together. A person with a good ear can often reconstruct the score, but this is a difficult and time-consuming task. Better to start with what the composer wrote originally.

Over centuries, classical music composers developed a language and system for communicating their musical ideas so that

  1. there was enough detail that a 3rd party could interpret the music and perform it to a level of accuracy that satisfied the composer; but
  2. it was not so prescriptive or constraining so that different performers could not build on the work and incorporate their own ideas

It would seem that traditional computer code satisfies those criteria, but I don’t think so. Traditional computer code (even R code) is designed to communicate programming concepts and constructs, not to communicate data analysis constructs. For example, a for loop is not a data analysis concept, even though we may use for loops all the time in data analysis.

Because of this disconnect between computer code and data analysis, I often find it difficult to understand what a data analysis is doing, even if it is coded very well. I imagine this is what programmers felt like when programming in processor-specific assembly language. Before languages like C were developed, where high-level concepts could be expressed, you had to know the gory details of how each CPU operated.

The closest thing that I can see to a solution emerging is the work that Hadley Wickham is doing with packages like dplyr and ggplot2. The dplyr package’s verbs (filter, arrange, etc.) represent data manipulation concepts that are meaningful to analysts. But we still have a long way to go to cover all of data analysis in this way.

Reproducible research is important because it is fundamentally about communicating what you have done in your work. Right now we have a sub-optimal way to communicate what was done in a data analysis, via traditional computer code. I think developing a new approach to communicating data analysis could have a few benefits:

  1. It would provide greater transparency
  2. It would allow others to more easily build on what was done in an analysis by extending or modifying specific elements
  3. It would make it easier to understand what common elements there were across many different data analyses
  4. It would make it easier to teach data analysis in a systematic and scalable way

So, any takers?