Simply Statistics

Listen here:

Reproducible Research Needs Some Limiting Principles

2017-02-01T00:00:00+00:00

Over the past 10 years thinking and writing about reproducible research, I’ve come to the conclusion that much of the discussion is incomplete. While I think we as a scientific community have come a long way in changing people’s thinking about data and code and making them available to others, there are some key sticking points that keep coming up that are preventing further progress in the area.

When I used to write about reproducibility, I felt that the primary challenge/roadblock was a lack of tooling. Much has changed in just the last five years though, and many new tools have been developed to make life a lot easier. Packages like knitr (for R), markdown, and iPython notebooks, have made writing reproducible data analysis documents a lot easier. Web sites like GitHub and many others have made distributing analyses a lot simpler because now everyone effectively has a free web site (this was NOT true in 2005).

Even still, our basic definition of reproducibility is incomplete. Most people would say that a data analysis is reproducible if the analytic data and metadata are available and the code that did the analysis is available. Furthermore, it would be preferable to have some documentation to go along with both. But there are some key issues that need to be resolved to complete this general definition.

Reproducible for Whom?

In discussions about reproducibility with others, the topic of who should be able to reproduce the analysis only occasionally comes up. There’s a general sense, especially amongst academics, that anyone should be able to reproduce any analysis if they wanted to.

There is an analogy with free software here in the sense that free software can be free for some people and not for others. This made more sense in the days before the Internet when distribution was much more costly. The idea here was that I could write software for a client and give them the source code for that software (as they would surely demand). The software is free for them but not for anyone else. But free software ultimately only matters when it comes to distribution. Once I distribute a piece of software, that’s when all the restrictions come into play. However, if I only distribute it to a few people, I only need to guarantee that those few people have those freedoms.

Richard Stallman once said that something like 90% of software was free software because almost all software being written was custom software for individual clients (I have no idea where he got this number). Even if the number is wrong, the point still stands that if I write software for a single person, it can be free for that person even if no one in the world has access to the software.

Of course, now with the Internet, everything pretty much gets distributed to everyone because there’s nothing stopping someone from taking a piece of free software and posting it on a web site. But the idea still holds: Free software only needs to be free for the people who receive it.

That said, the analogy is not perfect. Software and research are not the same thing. They key difference is that you can’t call something research unless is generally available and disseminated. If Pfizer comes up with the cure for cancer and never tells anyone about it, it’s not research. If I discover that there’s a 9th planet and only tell my neighbor about it, it’s not research. Many companies might call those activities research (particularly from an tax/accounting point of view) but since society doesn’t get to learn about them, it’s not research.

If research is by definition disseminated to all, then it should therefore be reproducible by all. However, there are at least two circumstances in which we do not even pretend to believe this is possible.

Imbalance of resources: If I conduct a data analysis that requires the world’s largest supercomputer, I can make all the code and data available that I want–few people will be able to actually reproduce it. That’s an extreme case, but even if I were to make use of a dramatically smaller computing cluster it’s unlikely that anyone would be able to recreate those resources. So I can distribute something that’s reproducible in theory but not in reality by most people.
Protected data: Numerous analyses in the biomedical sciences make use of protected health information that cannot easily be disseminated. Privacy is an important issue, in part, because in many cases it allows us to collect the data in the first place. However, most would agree we cannot simply post that data for all to see in the name of reproducibility. First, it is against the law, and second it would likely deter anyone from agreeing to participate in any study in the future.

We can pretend that we can make data analyses reproducible for all, but in reality it’s not possible. So perhaps it would make sense for us to consider whether a limiting principle should be applied. The danger of not considering it is that one may take things to the extreme—if it can’t be made reproducible for all, then why bother trying? A partial solution is needed here.

For How Long?

Another question that needs to be resolved for reproducibility to be a widely implemented and sustainable phenomenon is for how long should something be reproducible? Ultimately, this is a question about time and resources because ensuring that data and code can be made available and can run on current platforms in perpetuity requires substantial time and money. In the academic community, where projects are often funded off of grants or contracts with finite lifespans, often the money is long gone even though the data and code must be maintained. The question then is who pays for the maintainence and the upkeep of the data and code?

I’ve never heard a satisfactory answer to this question. If the answer is that data analyses should be reproducible forever, then we need to consider a different funding model. This position would require a perpetual funds model, essentially an endowment, for each project that is disseminated and claims to be reproducible. The endowment would pay for things like servers for hosting the code and data and perhaps engineers to adapt and adjust the code as the surrounding environment changes. While there are a number of repositories that have developed scalable operating models, it’s not clear to me that the funding model is completely sustainable.

If we look at how scientific publications are sustained, we see that it’s largely private enterprise that shoulders the burden. Journals house most of the publications out there and they charge a fee for access (some for profit, some not for profit). Whether the reader pays or the author pays is not relevant, the point is that a decision has been made about who pays.

The author-pays model is interesting though. Here, an author pays a publication charge of ~$2,000, and the reader never pays anything for access (in perpetuity, presumably). The $2,000 payment by the author is like a one-time capital expense for maintaining that one publication forever (a mini-endowment, in a sense). It works for authors because grant/contract supported research often budget for one-time publication charges. There’s no need for continued payments after a grant/contract has expired.

The publication system is quite a bit simpler because almost all publications are the same size and require the same resources for access—basically a web site that can serve up PDF files and people to maintain it. For data analyses, one could see things potentially getting out of control. For a large analysis with terabytes of data, what would the one-time up-front fee be to house the data and pay for anyone to access it for free forever?

Using Amazon’s monthly cost estimator we can get a rough sense of what the pure data storage might cost. Suppose we have a 10GB dataset that we want to store and we anticipate that it might be downloaded 10 times per month. This would cost about $7.65 per month, or $91.80 per year. If we assume Amazon raises their prices about 3% per year and a discount rate of 5%, the total cost for the storage is $4,590. If we tack on 20% for other costs, that brings us to $5,508. This is perhaps not unreasonable, and the scenario would certainly include most people. For comparison a 1 TB dataset downloaded once a year, using the same formula gives us a one-time cost of about $40,000. This is real money when it comes to fixed research budgets and would likely require some discussion of trade-offs.

Summary

Reproducibility is a necessity in science, but it’s high time that we start considering the practical implications of actually doing the job. There are still holdouts when it comes to the basic idea of reproducibiltiy, but they are fewer and farther between. If we do not seriously consider the details of how to implement reproducibility, perhaps by introducing some limiting principles, we may never be able to achieve any sort of widespread adoption.

Turning data into numbers

2017-01-31T00:00:00+00:00

Editor’s note: This is the third chapter of a book I’m working on called Demystifying Artificial Intelligence. The goal of the book is to demystify what modern AI is and does for a general audience. So something to smooth the transition between AI fiction and highly mathematical descriptions of deep learning. I’m developing the book over time - so if you buy the book on Leanpub know that there are only three chapters in there so far, but I’ll be adding more over the next few weeks and you get free updates. The cover of the book was inspired by this amazing tweet by Twitter user @notajf. Feedback is welcome and encouraged!

“It is a capital mistake to theorize before one has data.” Arthur Conan Doyle

Data, data everywhere

I already have some data about you. You are reading this book. Does that seem like data? It’s just something you did, that’s not data is it? But if I collect that piece of information about you, it actually tells me a surprising amount. It tells me you have access to an internet connection, since the only place to get the book is online. That in turn tells me something about your socioeconomic status and what part of the world you live in. It also tells me that you like to read, which suggests a certain level of education.

Whether you know it or not, everything you do produces data - from the websites you read to the rate at which your heart beats. Until pretty recently, most of the data you produced wasn’t collected, it floated off unmeasured. Data were painstakingly gathered by scientists one number at a time in small experiments with a few people. This laborious process meant that data were expensive and time-consuming to collect. Yet many of the most amazing scientific discoveries over the last two centuries were squeezed from just a few data points. But over the last two decades, the unit price of data has dramatically dropped. New technologies touching every aspect of our lives from our money, to our health, to our social interactions have made data collection cheap and easy.

To give you an idea of how steep the drop in the price of data has been, in 1967 Stanley Milgram did an experiment to determine the number of degrees of separation between two people in the U.S. (Travers and Milgram 1969). In his experiment he sent 296 letters to people in Omaha, Nebraska and Wichita, Kansas. The goal was to get the letters to a specific person in Boston, Massachusetts. The trick was people had to send the letters to someone they knew, and they then sent it to someone they knew and so on. At the end of the experiment, only 64 letters made it to the individual in Boston. On average, the letters had gone through 6 people to get there.

This is an idea that is so powerful it even became part of the popular consciousness. For example it is the foundation of the internet meme “the 6-degrees of Kevin Bacon” (Wikipedia contributors 2016a) - the idea that if you take any actor and look at the people they have been in movies with, then the people those people have been in movies with, it will take you at most six steps to end up at the actor Kevin Bacon. This idea, despite its popularity was originally studied by Milgram using only 64 data points. A 2007 study updated that number to “7 degrees of Kevin Bacon”. The study was based on 30 billion instant messaging conversations collected over the course of a month or two with the same amount of effort (Leskovec and Horvitz 2008).

Once data started getting cheaper to collect, it got cheaper fast. Take another example, the human genome. The genome is the unique DNA code in every one of your cells. It consists of a set of 3 billion letters that is unique to you. By many measures, the race to be the first group to collect all 3 billion letters from a single person kicked off the data revolution in biology. The project was completed in 2000 after a decade of work and $3 billion to collect the 3 billion letters in the first human genome (Venter et al. 2001). This project was actually a stunning success, most people thought it would be much more expensive. But just over a decade later, new technology means that we can now collect all 3 billion letters from a person’s genome for about $1,000 in about a week (“The Cost of Sequencing a Human Genome,” n.d.), soon it may be less than $100 (Buhr 2017).

You may have heard that this is the era of “big data” from The Economist or The New York Times. It is really the era of cheap data collection and storage. Measurements we never bothered to collect before are now so easy to obtain that there is no reason not to collect them. Advances in computer technology also make it easier to store huge amounts of data digitally. This may not seem like a big deal, but it is much easier to calculate the average of a bunch of numbers stored electronically than it is to calculate that same average by hand on a piece of paper. Couple these advances with the free and open distribution of data over the internet and it is no surprise that we are awash in data. But tons of data on their own are meaningless. It is understanding and interpreting the data where the real advances start to happen.

This explosive growth in data collection is one of the key driving influences behind interest in artificial intelligence. When teaching computers to do something that only humans could do previously, it helps to have lots of examples. You can then use statistical and machine learning models to summarize that set of examples and help a computer make decisions what to do. The more examples you have, the more flexible your computer model can be in making decisions, and the more “intelligent” the resulting application.

What is data?

Tidy data

“What is data”? Seems like a relatively simple question. In some ways this question is easy to answer. According to Wikipedia:

Data (/ˈdeɪtə/ day-tə, /ˈdætə/ da-tə, or /ˈdɑːtə/ dah-tə)[1] is a set of values of qualitative or quantitative variables. An example of qualitative data would be an anthropologist’s handwritten notes about her interviews with people of an Indigenous tribe. Pieces of data are individual pieces of information. While the concept of data is commonly associated with scientific research, data is collected by a huge range of organizations and institutions, ranging from businesses (e.g., sales data, revenue, profits, stock price), governments (e.g., crime rates, unemployment rates, literacy rates) and non-governmental organizations (e.g., censuses of the number of homeless people by non-profit organizations).

When you think about data, you probably think of orderly sets of numbers arranged in something like an Excel spreadsheet. In the world of data science and machine learning this type of data has a name - “tidy data” (Wickham and others 2014). Tidy data has the properties that all measured quantities are represented by numbers or character strings (think words). The data are organized such that.

Each variable you measured is in one column
Each different measurement of that variable is in a different row
There is one data table for each “type” of variable.
If there are multiple tables then they are linked by a common ID.

This idea is borrowed from data management schemas that have long been used for storing data in databases. Here is an example of a tidy data set of swimming world records.

year	time	sex
1905	65.8	M
1908	65.6	M
1910	62.8	M
1912	61.6	M
1918	61.4	M
1920	60.4	M
1922	58.6	M
1924	57.4	M
1934	56.8	M
1935	56.6	M

This type of data, neat, organized and nicely numeric is not the kind of data people are talking about when they say the “era of big data”. Data almost never start their lives in such a neat and organized format.

Raw data

The explosion of interest in AI has been powered by a variety of types of data that you might not even think of when you think of “data”. The data might be pictures you take and upload to social media, the text of the posts on that same platform, or the sound captured from your voice when you speak to your phone.

Social media and cell phones aren’t the only area where data is being collected more frequently. Speed cameras on roads collect data on the movement of cars, electronic medical records store information about people’s health, wearable devices like Fitbit collect information on the activity of people. GPS information stores the location of people, cars, boats, airplanes, and an increasingly wide array of other objects.

Images, voice recordings, text files, and GPS coordinates are what experts call “raw data”. To create an artificial intelligence application you need to begin with a lot of raw data. But as we discussed in the simple AI example from the previous chapter - a computer doesn’t understand raw data in its natural form. It is not always immediately obvious how the raw data can be turned into numbers that a computer can understand. For example, when an artificial intelligence works with a picture the computer doesn’t “see” the picture file itself. It sees a set of numbers that represent that picture and operates on those numbers. The first step in almost every artificial intelligence application is to “pre-process” the data - to take the image files or the movie files or the text of a document and turn it into numbers that a computer can understand. Then those numbers can be fed into algorithms that can make predictions and ultimately be used to make an interface look intelligent.

Turning raw data into numbers

So how do we convert raw data into a form we can work with? It depends on what type of measurement or data you have collected. Here I will use two examples to explain how you can convert images and the text of a document into numbers that an algorithm can be applied to.

Images

Suppose that we were developing an AI to identify pictures of the author of this book. We would need to collect a picture of the author - maybe an embarrassing one.

This picture is made of pixels. You can see that if you zoom in very close on the image and look more closely. You can see that the image consists of many hundreds of little squares, each square just one color. Those squares are called pixels and they are one step closer to turning the image into numbers.

You can think of each pixel like a dot of color. Let’s zoom in a little bit more and instead of showing each pixel as a square show each one as a colored dot.

Imagine we are going to build an AI application on the basis of lots of images. Then we would like to turn a set of images into “tidy data”. As described above a tidy data set is defined as the following.

Each variable you measured is in one column
Each different measurement of that variable is in a different row
There is one data table for each “type” of variable.
If there are multiple tables then they are linked by a common ID.

A translation of tidy data for a collection of images would be the following.

Variables: Are the pixels measured in the images. So the top left pixel is a variable, the bottom left pixel is a variable, and so on. So each pixel should be in a separate column.
Measurements: The measurements are the values for each pixel in each image. So each row corresponds to the values of the pixels for each row.
Tables: There would be two tables - one with the data from the pixels and one with the labels of each image (if we know them).

To start to turn the image into a row of the data set we need to stretch the dots into a single row. One way to do this is to snake along the image going from top left corner to bottom right corner and creating a single line of dots.

This still isn’t quite data a computer can understand - a computer doesn’t know about dots. But we could take each dot and label it with a color name.

We could take each color name and give it a number, something like rosybrown = 1, mistyrose = 2, and so on. This approach runs into some trouble because we don’t have names for every possible color and because it is pretty inefficient to have a different number for every hue we could imagine.

But that would be both inefficient and not very understandable by a computer. An alternative strategy that is often used is to encode the intensity of the red, green, and blue colors for each pixel. This is sometimes called the rgb color model (Wikipedia contributors 2016b). So for example we can take these dots and show how much red, green, and blue they have in them.

Looking at it this way we now have three measurements for each pixel. So we need to update our tidy data definition to be:

Variables: Are the three colors for each pixel measured in the images. So the top left pixel red value is a variable, the top left pixel green value is a variable and so on. So each pixel/color combination should be in a separate column.
Measurements: The measurements are the values for each pixel in each image. So each row corresponds to the values of the pixels for each row.
Tables: There would be two tables - one with the data from the pixels and one with the labels of each image (if we know them).

So a tidy data set might look something like this for just the image of Jeff.

id	label	p1red	p1green	p1blue	p2red	…
1	“jeff”	238	180	180	205	…

Each additional image would then be another row in the data set. As we will see in the chapters that follow we can then feed this data into an algorithm for performing an artificial intelligence task.

Notes

Parts of this chapter from appeared in the Simply Statistics blog post “The vast majority of statistical analysis is not performed by statisticians” written by the author of this book.

References

Buhr, Sarah. 2017. “Illumina Wants to Sequence Your Whole Genome for $100.” https://techcrunch.com/2017/01/10/illumina-wants-to-sequence-your-whole-genome-for-100/.

Leskovec, Jure, and Eric Horvitz. 2008. “Planetary-Scale Views on an Instant-Messaging Network,” 6~mar.

“The Cost of Sequencing a Human Genome.” n.d. https://www.genome.gov/sequencingcosts/.

Travers, Jeffrey, and Stanley Milgram. 1969. “An Experimental Study of the Small World Problem.” Sociometry 32 (4). [American Sociological Association, Sage Publications, Inc.]: 425–43.

Venter, J Craig, Mark D Adams, Eugene W Myers, Peter W Li, Richard J Mural, Granger G Sutton, Hamilton O Smith, et al. 2001. “The Sequence of the Human Genome.” Science 291 (5507). American Association for the Advancement of Science: 1304–51.

Wickham, Hadley, and others. 2014. “Tidy Data.” Under Review.

Wikipedia contributors. 2016a. “Six Degrees of Kevin Bacon.” https://en.wikipedia.org/w/index.php?title=Six_Degrees_of_Kevin_Bacon&oldid=748831516.

———. 2016b. “RGB Color Model.” https://en.wikipedia.org/w/index.php?title=RGB_color_model&oldid=756764504.

New class - Data App Prototyping for Public Health and Beyond

2017-01-26T00:00:00+00:00

Are you interested in building data apps to help save the world, start the next big business, or just to see if you can? We are running a data app prototyping class for people interested in creating these apps.

This will be a special topics class at JHU and is open to any undergrad student, grad student, postdoc, or faculty member at the university. We are also seeing if we can make the class available to people outside of JHU so even if you aren’t at JHU but are interested you should let us know below.

One of the principles of our approach is that anyone can prototype an app. Our class starts with some tutorials on Shiny and R. While we have no formal pre-reqs for the class you will have much more fun if you have the background equivalent to our Coursera classes:

If you don’t have that background you can take the classes online starting now to get up to speed! To see some examples of apps we will be building check out our gallery.

We will mostly be able to support development with R and Shiny but would be pumped to accept people with other kinds of development background - we just might not be able to give a lot of technical assistance.

As part of the course we are also working with JHU’s Fast Forward program to streamline and ease the process of starting a company around the app you build for the class. So if you have entrepreneurial ambitions, this is the class for you!

We are in the process of setting up the course times, locations, and enrollment cap. The class will run from March to May (exact dates TBD). To sign up for announcements about the class please fill out your information here.

User Experience and Value in Products - What Regression and Surrogate Variables can Teach Us

2017-01-23T00:00:00+00:00

Over the past year, there have been a number of recurring topics in my global news feed that have a shared theme to them. Some examples of these topics are:

Fake news: Before and after the election in 2016, Facebook (or Facebook’s Trending News algorithm) was accused of promoting news stories that turned out to be completely false, promoted by dubious news sources in FYROM and elsewhere.
Theranos: This diagnostic testing company promised to revolutionize the blood testing business and prevent disease for all by making blood testing simple and painless. This way people would not be afraid to get blood tests and would do them more often, presumably catching diseases while they were in the very early stages. Theranos lobbied to allow patients order their own blood tests so that they wouldn’t need a doctor’s order.
Homeopathy: This a so-called alternative medical system developed in the late 18th century based on notions such as “like cures like” and “law of minimum dose.
Online education: New companies like Coursera and Udacity promised to revolutionize education by making it accessible to a broader audience than conventional universities were able.

What exactly do these things have in common?

First, consumers love them. Fake news played to people’s biases by confirming to them, from a seemingly trustworthy source, what they always “knew to be true”. The fact that the stories weren’t actually true was irrelevant given that users enjoyed the experience of seeing what they agreed with. Perhaps the best explanation of the entire Facebook fake news issue was from Kim-Mai Cutler:

The best way to have the stickiest and most lucrative product? Be a systematic tool for confirmation bias. https://t.co/8uOHZLomhX
— Kim-Mai Cutler (@kimmaicutler) November 10, 2016

Theranos promised to revolutionize blood testing and change the user experience behind the whole industry. Indeed the company had some fans (particularly amongst its investor base). However, after investigations by the Center for Medicare and Medicaid Services, the FDA, and an independent laboratory, it was found that Theranos’s blood testing machine was wildly inconsistent and variable, leading to Theranos ultimately retracting all of its blood test results and cutting half its workforce.

Homeopathy is not company specific, but is touted by many as an “alternative” treatment for many diseases, with many claiming that it “works for them”. However, the NIH states quite clearly on its web site that “There is little evidence to support homeopathy as an effective treatment for any specific condition.”

Finally, companies like Coursera and Udacity in the education space have indeed produced products that people like, but in some instances have hit bumps in the road. Udacity conducted a brief experiment/program with San Jose State University that failed due to the large differences between the population that took online courses and the one that took them in person. Coursera has massive offerings from major universities (including my own) but has run into continuing challenges with drop out and questions over whether the courses offered are suitable for job placement.

User Experience and Value

In each of these four examples there is a consumer product that people love, often because they provide a great user experience. Take the fake news example–people love to read headlines from “trusted” news sources that agree with what they believe. With Theranos, people love to take a blood test that is not painful (maybe “love” is the wrong word here). With many consumer products companies, it is the user experience that defines the value of a product. Often when describing the user experience, you are simultaneously describing the value of the product.

Take for example Uber. With Uber, you open an app on your phone, click a button to order a car, watch the car approach you on your phone with an estimate of how long you will be waiting, get in the car and go to your destination, and get out without having to deal with paying. If someone were to ask me “What’s the value of Uber?” I would probably just repeat the description in the previous sentence. Isn’t it obvious that it’s better than the usual taxi experience? The same could be said for many companies that have recently come up: Airbnb, Amazon, Apple, Google. With many of the products from these companies, the description of the user experience is a description of its value.

Disruption Through User Experience

In the example of Uber (and Airbnb, and Amazon, etc.) you could depict the relationship between the product, the user experience, and the value as such:

Any changes that you can make to the product to improve the user experience will then improve the value that the product offers. Another way to say it is that the user experience serves as a surrogate outcome for the value. We can influence the UX and know that we are improving value. Furthermore, any measurements that we take on the UX (surveys, focus groups, app data) will serve as direct observations on the value provided to customers.

New companies in these kinds of consumer product spaces can disrupt the incumbents by providing a much better user experience. When incumbents have gotten fat and lazy, there is often a sizable segment of the customer base that feels underserved. That’s when new companies can swoop in to specifically serve that segment, often with a “worse” product overall (as in fewer features) and usually much cheaper. The Internet has made the “swooping in” much easier by dramatically reducing transaction and distribution costs. Once the new company has a foothold, they can gradually work their way up the ladder of customer segments to take over the market. It’s classic disruption theory a la Clayton Christensen.

When Value Defines the User Experience and Product

There has been much talk of applying the classic disruption model to every space imaginable, but I contend that not all product spaces are the same. In particular, the four examples I described in the beginning of this post cover some of those different areas:

Medicine (Theranos, homeopathy)
News (Facebook/fake news)
Education (Coursera/Udacity)

One thing you’ll notice about these areas, particularly with medicine and education, is that they are all heavily regulated. The reason is because we as a community have decided that there is a minimum level of value that is required to be provided by entities in this space. That is, the value that a product offers is defined first, before the product can come to market. Therefore, the value of the product actually constrains the space of products that can be produced. We can depict this relationship as such:

In classic regression modeling language, the value of a product must be “adjusted for” before examining the relationship between the product and the user experience. Naturally, as in any regression problem, when you adjust for a variable that is related to the product and the user experience, you reduce the overall variation in the product.

In situations where the value defines the product and the user experience, there is much less room to maneuver for new entrants in the market. The reason is because they, like everyone else, are constrained by the value that is agreed upon by the community, usually in the form of regulations.

When Theranos comes in and claims that it’s going to dramatically improve the user experience of blood testing, that’s great, but they must be constrained by the value that society demands, which is a certain precision and accuracy in its testing results. Companies in the online education space are welcome to disrupt things by providing a better user experience. Online offerings in fact do this by allowing students to take classes according to their own schedule, wherever they may live in the world. But we still demand that the students learn an agreed-upon set of facts, skills, or lessons.

New companies will often argue that the things that we currently value are outdated or no longer valuable. Their incentive is to change the value required so that there is more room for new companies to enter the space. This is a good thing, but it’s important to realize that this cannot happen solely through changes in the product. Innovative features of a product may help us to understand that we should be valuing different things, but ultimately the change in what we preceive as value occurs independently of any given product.

When I see new companies enter the education, medicine, or news areas, I always hesitate a bit because I want some assurance that they will still provide the value that we have come to expect. In addition, with these particular areas, there is a genuine sense that failing to deliver on what we value could cause serious harm to individuals. However, I think the discussion that is provoked by new companies entering the space is always welcome because we need to constantly re-evaluate what we value and whether it matches the needs of our time.

An example that isn't that artificial or intelligent

2017-01-20T00:00:00+00:00

Editor’s note: This is the second chapter of a book I’m working on called Demystifying Artificial Intelligence. The goal of the book is to demystify what modern AI is and does for a general audience. So something to smooth the transition between AI fiction and highly mathematical descriptions of deep learning. I’m developing the book over time - so if you buy the book on Leanpub know that there are only two chapters in there so far, but I’ll be adding more over the next few weeks and you get free updates. The cover of the book was inspired by this amazing tweet by Twitter user @notajf. Feedback is welcome and encouraged!

“I am so clever that sometimes I don’t understand a single word of what I am saying.” Oscar Wilde

As we have described it artificial intelligence applications consist of three things:

A large collection of data examples
An algorithm for learning a model from that training set.
An interface with the world.

In the following chapters we will go into each of these components in much more detail, but lets start with a a couple of very simple examples to make sure that the components of an AI are clear. We will start with a completely artificial example and then move to more complicated examples.

Building an album

Lets start with a very simple hypothetical example that can be understood even if you don’t have a technical background. We can also use this example to define some of the terms we will be discussing later in the book.

In our simple example the goal is to make an album of photos for a friend. For example, suppose I want to take the photos in my photobook and find all the ones that include pictures of myself and my son Dex for his grandmother.

If you are anything like the author of this book, then you probably have a very large number of pictures of your family on your phone. So the first step in making the photo alubm would be to stort through all of my pictures and pick out the ones that should be part of the album.

This is a typical example of the type of thing we might want to train a computer to do in an artificial intelligence application. Each of the components of an AI application is there:

The data: all of the pictures on the author’s phone (a big training set!)
The algorithm: finding pictures of me and my son Dex
The interface: the album to give to Dex’s grandmother.

One way to solve this problem is for me to sort through the pictures one by one and decide whether they should be in the album or not, then assemble them together, and then put them into the album. If I did it like this then I myself would be the AI! That wouldn’t be very artificial though…imagine we instead wanted to teach a computer to make this album..

But what does it mean to “teach” a computer to do something?

The terms “machine learning” and “artificial intelligence” invoke the idea of teaching computers in the same way that we teach children. This was a deliberate choice to make the analogy - both because in some ways it is appropriate and because it is useful for explaining complicated concepts to people with limited backgrounds. To teach a child to find pictures of the author and his son, you would show her lots of examples of that type of picture and maybe some examples of the author with other kids who were not his son. You’d repeat to the child that the pictures of the author and his son were the kinds you wanted and the others weren’t. Eventually she would retain that information and if you gave her a new picture she could tell you whether it was the right kind or not.

To teach a machine to perform the same kind of recognition you go through a similar process. You “show” the machine many pictures labeled as either the ones you want or not. You repeat this process until the machine “retains” the information and can correctly label a new photo. Getting the machine to “retain” this information is a matter of getting the machine to create a set of step by step instructions it can apply to go from the image to the label that you want.

The data

The images are what people in the fields of artificial intelligence and machine learning call “raw data” (Leek, n.d.). The categories of pictures (a picture of the author and his son or a picture of something else) are called the “labels” or “outcomes”. If the computer gets to see the labels when it is learning then it is called “supervised learning” (Wikipedia contributors 2016) and when the computer doesn’t get to see the labels it is called “unsupervised learning” (Wikipedia contributors 2017a).

Going back to our analogy with the child, supervised learning would be teaching the child to recognize pictures of the author and his son together. Unsupervised learning would be giving the child a pile of pictures and asking them to sort them into groups. They might sort them by color or subject or location - not necessarily into categories that you care about. But probably one of the categories they would make would be pictures of people - so she would have found some potentially useful information even if it wasn’t exactly what you wanted. One whole field of artificial intelligence is figuring out how to use the information learned in this “unsupervised” setting and using it for supervised tasks

this is sometimes called “transfer learning” (Raina et al. 2007) by people in the field since you are transferring information from one task to another.

Returning to the task of “teaching” a computer to retain information about what kind of pictures you want we run into a problem - computers don’t know what pictures are! They also don’t know what audio clips, text files, videos, or any other kind of information is. At least not directly. They don’t have eyes, ears, and other senses along with a brain designed to decode the information from these senses.

So what can a computer understand? A good rule of thumb is that a computer works best with numbers. If you want a computer to sort pictures into an album for you, the first thing you need to do is to find a way to turn all of the information you want to “show” the computer into numbers. In the case of sorting pictures into albums - a supervised learning problem - we need to turn the labels and the images into numbers the computer can use.

One way to do that would be for you to do it for the computer. You could take every picture on your phone and label it with a 1 if it was a picture of the author and his son and a 0 if not. Then you would have a set of 1’s and 0’s corresponding to all of the pictures. This takes some thing the computer can’t understand (the picture) and turns it into something the computer can understand (the label).

This process would turn the labels into something a computer could understand, it still isn’t something we could teach a computer to do. The computer can’t actually “look” at the image and doesn’t know who the author or his son are. So we need to figure out a way to turn the images into numbers for the computer to use to generate those labels directly.

This is a little more complicated but you could still do it for the computer. Let’s suppose that the author and his son always wear matching blue shirts when they spend time together. Then you could go through and look at each image and decide what fraction of the image is blue. So each picture would get a number ranging from zero to one like 0.30 if the picture was 30% blue and 0.53 if it was 53% blue.

The fraction of the picture that is blue is called a “feature” and the process of creating that feature is called “feature engineering” (Wikipedia contributors 2017b). Until very recently feature engineering of text, audio, or video files was best performed by an expert human. In later chapters we will discuss how one of the most exciting parts about AI application is that it is now possible to have computers perform feature engineering for you.

The algorithm

Now that we have converted the images to numbers and the labels to numbers, we can talk about how to “teach” a computer to label the pictures. A good rule of thumb when thinking about algorithms is that a computer can’t “do” anything without being told very explicitly what to do. It needs a step by step set of instructions. The instructions should start with a calculation on the numbers for the image and should end with a prediction of what label to apply to that image. The image (converted to numbers) is the “input” and the label (also a number) is the “output”. You may have heard the phrase:

“Garbage in, garbage out”

What this phrase means is if the inputs (the images) are bad - say they are all very dark or hard to see. Then the output of the algorithm will also be bad - the predictions won’t be very good.

A machine learning “algorithm” can be thought of as a set of instructions with some of the parts left blank - sort of like mad-libs. One example of a really simple algorithm for sorting pictures into the album would be:

Calculate the fraction of blue in the image.

If the fraction of blue is above X label it 1

If the fraction of blue is less than X label it 0

Put all of the images labeled 1 in the album

The machine “learns” by using the examples to fill in the blanks in the instructions. In the case of our really simple algorithm we need to figure out what fraction of blue to use (X) for labeling the picture.

To figure out a guess for X we need to decide what we want the algorithm to do. If we set X to be too low then all of the images will be labeled with a 1 and put into the album. If we set X to be too high then all of the images will be labeled 0 and none will appear in the album. In between there is some grey area - do we care if we accidentally get some pictures of the ocean or the sky with our algorithm?

But the number of images in the album isn’t even the thing we really care about. What we might care about is making sure that the album is mostly pictures of the author and his son. In the field of AI they usually turn this statement around - we want to make sure the album has a very small fraction of pictures that are not of the author and his son. This fraction - the fraction that are incorrectly placed in the album is called the “loss”. You can think about it like a game where the computer loses a point every time it puts the wrong kind of picture into the album.

Using our loss (how many pictures we incorrectly placed in the album) we can now use the data we have created (the numbers for the labels and the images) to fill in the blanks in our mad-lib algorithm (picking the cutoff on the amount of blue). We have a large number of pictures where we know what fraction of each picture is blue and whether it is a picture of the author and his son or not. We can try each possible X and calculate the fraction of pictures in the album that are incorrectly placed into the album (the loss) and find the X that produces the smallest fraction.

Suppose that the value of X that gives the smallest faction of wrong pictures in the album is 30. Then our “learned” model would be:

Calculate the fraction of blue in the image

If the fraction of blue is above 0.1 label it 1

If the fraction of blue is less than 0.1 label it 0

Put all of the images labeled 1 in the album

The interface

The last part of an AI application is the interface. In this case, the interface would be the way that we share the pictures with Dex’s grandmother. For example we could imagine uploading the pictures to Shutterfly and having the album delivered to Dex’s grandmother.

Putting this all together we could imagine an application using our trained AI. The author uploads his unlabeled photos. The photos are then passed to the computer program which calculates the fraction of the image that is blue, then applies a label according to the algorithm we learned, then takes all the images predicted to be of the author and his son and sends them off to be a Shutterfly album mailed to the authors’ mother.

If the algorithm was good, then from the perspective of the author the website would look “intelligent”. I just uploaded pictures and it created an album for me with the pictures that I wanted. But the steps in the process were very simple and understandable behind the scenes.

References

Leek, Jeffrey. n.d. “The Elements of Data Analytic Style.” {https://leanpub.com/datastyle}.

Raina, Rajat, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y Ng. 2007. “Self-Taught Learning: Transfer Learning from Unlabeled Data.” In Proceedings of the 24th International Conference on Machine Learning, 759–66. ICML ’07. New York, NY, USA: ACM.

Wikipedia contributors. 2016. “Supervised Learning.” https://en.wikipedia.org/w/index.php?title=Supervised_learning&oldid=752493505.

———. 2017a. “Unsupervised Learning.” https://en.wikipedia.org/w/index.php?title=Unsupervised_learning&oldid=760556815.

———. 2017b. “Feature Engineering.” https://en.wikipedia.org/w/index.php?title=Feature_engineering&oldid=760758719.

What is artificial intelligence? A three part definition

2017-01-19T00:00:00+00:00

Editor’s note: This is the first chapter of a book I’m working on called Demystifying Artificial Intelligence. The goal of the book is to demystify what modern AI is and does for a general audience. So something to smooth the transition between AI fiction and highly mathematical descriptions of deep learning. I’m developing the book over time - so if you buy the book on Leanpub know that there is only one chaper in there so far, but I’ll be adding more over the next few weeks and you get free updates. The cover of the book was inspired by this amazing tweet by Twitter user @notajf. Feedback is welcome and encouraged!

What is artificial intelligence?

“If it looks like a duck and quacks like a duck but it needs batteries, you probably have the wrong abstraction” Derick Bailey

This book is about artificial intelligence. The term “artificial intelligence” or “AI” has a long and convoluted history (Cohen and Feigenbaum 2014). It has been used by philosophers, statisticians, machine learning experts, mathematicians, and the general public. This historical context means that when people say artificial intelligence the term is loaded with one of many potential different meanings.

Humanoid robots

Before we can demystify artificial intelligence it is helpful to have some context for what the word means. When asked about artificial intelligence, most people’s imagination leaps immediately to images of robots that can act like and interact with humans. Near-human robots have long been a source of fascination by humans have appeared in cartoons like the Jetsons and science fiction like Star Wars. More recently, subtler forms of near-human robots with artificial intelligence have played roles in movies like Her and Ex machina.

The type of artificial intelligence that can think and act like a human is something that experts call artificial general intelligence (Wikipedia contributors 2017a).

is the intelligence of a machine that could successfully perform any intellectual task that a human being can

There is an understandable fascination and fear associated with robots, created by humans, but evolving and thinking independently. While this is a major area of ressearch (Laird, Newell, and Rosenbloom 1987) and of course the center of most people’s attention when it comes to AI, there is no near term possibility of this type of intelligence (Urban, n.d.). There are a number of barriers to human-mimicking AI from difficulty with robotics (Couden 2015) to needed speedups in computational power (Langford, n.d.).

One of the key barriers is that most current forms of the computer models behind AI are trained to do one thing really well, but can not be applied beyond that narrow task. There are extremely effective artificial intelligence applications for translating between languages (Wu et al. 2016), for recognizing faces in images (Taigman et al. 2014), and even for driving cars (Santana and Hotz 2016).

But none of these technologies are generalizable across the range of tasks that most adult humans can accomplish. For example, the AI application for recognizing faces in images can not be directly applied to drive cars and the translation application couldn’t recognize a single image. While some of the internal technology used in the applications is the same, the final version of the applications can’t be transferred. This means that when we talk about artificial intelligence we are not talking about a general purpose humanoid replacement. Currently we are talking about technologies that can typically accomplish one or two specific tasks that a human could accomplish.

Cognitive tasks

While modern AI applications couldn’t do everything that an adult could do (Baciu and Baciu 2016), they can perform individual tasks nearly as well as a human. There is a second commonly used definition of artificial intelligence that is considerably more narrow (Wikipedia contributors 2017b)

… the term “artificial intelligence” is applied when a machine mimics “cognitive” functions that humans associate with other human minds, such as “learning” and “problem solving”.

This definition encompasses applications like machine translation and facial recognition. They are “cognitive” functions that are generally usually only performed by humans. A difficulty with this definition is that it is relative. People refer to machines that can do tasks that we thought humans could only do as artificial intelligence. But over time, as we become used to machines performing a particular task it is no longer surprising and we stop calling it artificial intelligence. John McCarthy, one of the leading early figures in artificial intelligence said (Vardi 2012):

As soon as it works, no one calls it AI anymore…

As an example, when you send a letter in the mail, there is a machine that scans the writing on the letter. A computer then “reads” the characters on the front of the letter. The computer reads the characters in several steps - the color of each pixel in the picture of the letter is stored in a data set on the computer. Then the computer uses an algorithm that has been built using thousands or millions of other letters to take the pixel data and turn it into predictions of the characters in the image. Then the characters are identified as addresses, names, zipcodes, and other relevant pieces of information. Those are then stored in the computer as text which can be used for sorting the mail.

This task used to be considered “artificial intelligence” (Pavlidis, n.d.). It was surprising that a computer could perform the tasks of recognizing characters and addresses just based on a picture of the letter. This task is now called “optical character recognition” (Wikipedia contributors 2016). Many tutorials on the algorithms behind machine learning begin with this relatively simple task (Google Tensorflow Team, n.d.). Optical character recognition is now used in a wide range of applications including in Google’s effort to digitize millions of books (Darnton 2009).

Since this type of algorithm has become so common it is no longer called “artificial intelligence”. This transition happened becasue we no longer think it is surprising that computers can do this task - so it is no longer considered intelligent. This process has played out with a number of other technologies. Initially it is thought that only a human can do a particular cognitive task. As computers become increasingly proficient at that task they are called artificially intelligent. Finally, when that task is performed almost exclusively by computers it is no longer considered “intelligent” and the boundary moves.

Over the last two decades tasks from optical character recognition, to facial recognition in images, to playing chess have started as artificially intelligent applications. At the time of this writing there are a number of technologies that are currently on the boundary between doable only by a human and doable by a computer. These are the tasks that are considered AI when you read about the term in the media. Examples of tasks that are currently considered “artificial intelligence” include:

Computers that can drive cars
Computers that can identify human faces from pictures
Computers that can translate text from one language to another
Computers that can label pictures with text descriptions

Just as it used to be with optical character recognition, self-driving cars and facial recognition are tasks that still surprise us when performed by a computer. So we still call them artificially intelligent. Eventually, many or most of these tasks will be performed nearly exclusively by computers and we will no longer think of them as components of computer “intelligence”. To go a little further we can think about any task that is repetitive and performed by humans. For example, picking out music that you like or helping someone buy something at a store. An AI can eventually be built to do those tasks provided that: (a) there is a way of measuring and storing information about the tasks and (b) there is technology in place to perform the task if given a set of computer instructions.

The more narrow definition of AI is used colloquially in the news to refer to new applications of computers to perform tasks previously thought impossible. It is important to know both the definition of AI used by the general public and the more narrow and relative definition used to describe modern applications of AI by companies like Google and Facebook. But neither of these definitions is satisfactory to help demystify the current state of artificial intelligence applications.

A three part definition

The first definition describes a technology that we are not currently faced with - fully functional general purpose artificial intelligence. The second definition suffers from the fact that it is relative to the expectations of people discussing applications. For this book, we need a definition that is concrete, specific, and doesn’t change with societal expectations.

We will consider specific examples of human-like tasks that computers can perform. So we will use the definition that artificial intelligence requires the following components:

The data set : A of data examples that can be used to train a statistical or machine learning model to make predictions.
The algorithm : An algorithm that can be trained based on the data examples to take a new example and execute a human-like task.
The interface : An interface for the trained algorithm to receive a data input and execute the human like task in the real world.

This definition encompases optical character recognition and all the more modern examples like self driving cars. It is also intentionally broad, covering even examples where the data set is not large or the algorithm is not complicated. We will use our definition to break down modern artificial intelligence applications into their constituitive parts and make it clear how the computer represents knowledge learned from data examples and then applies that knowledge.

As one example, consider Amazon Echo and Alexa - an application currently considered to be artificially intelligent (Nuñez, n.d.). This combination meets our definition of artificially intelligent since each of the components is in place.

The data set : The large set of data examples consist of all the recordings that Amazon has collected of people talking to their Amazon devices.
The machine learning algorithm : The Alexa voice service (Alexa Developers 2016) is a machine learning algorithm trained using the previous recordings of people talking to Amazon devices.
The interface : The interface is the Amazon Echo (Amazon Inc 2016) a speaker that can record humans talking to it and respond with information or music.

When we break down artificial intelligence into these steps it makes it clearer why there has been such a sudden explosion of interest in artificial intelligence over the last several years.

First, the cost of data storage and collection has gone down steadily (Irizarry, n.d.) but dramatically (Quigley, n.d.) over the last several years. As the costs have come down, it is increasingly feasible for companies, governments, and even individuals to store large collections of data (Component 1 - The Data). To take advantage of these huge collections of data requires incredibly flexible statistical or machine learning algorithms that can capture most of the patterns in the data and re-use them for prediction. The most common type of algorithms used in modern artificial intelligence are something called “deep neural networks”. These algorithms are so flexible they capture nearly all of the important structure in the data. They can only be trained well if huge data sets exist and computers are fast enough. Continual increases in computing speed and power over the last several decades now make it possible to apply these models to use collections of data (Component 2 - The Algorithm).

Finally, the most underappreciated component of the AI revolution does not have to do with data or machine learning. Rather it is the development of new interfaces that allow people to interact directly with machine learning models. For a number of years now, if you were an expert with statistical and machine learning software it has been possible to build highly accurate predictive models. But if you were a person without technical training it was not possible to directly interact with algorithms.

Or as statistical experts Diego Kuonen and Rafael Irizarry have put it:

The big in big data refers to importance, not size

The explosion of interfaces for regular, non-technical people to interact with machine learning is an underappreciated driver of the AI revolution of the last several years. Artificial intelligence can now power labeling friends on Facebook, parsing your speech to your personal assistant Siri or Google Assistant, or providing you with directions in your car, or when you talk to your Echo. More recently sensors and devices make it possible for the instructions created by a computer to steer and drive a car.

These interfaces now make it possible for hundreds of millions of people to directly interact with machine learning algorithms. These algorithms can range from exceedingly simple to mind bendingly complex. But the common result is that the interface allows the computer to perform a human-like action and makes it look like artificial intelligence to the person on the other side. This interface explosion only promises to accelerate as we are building sensors for both data input and behavior output in objects from phones to refrigerators to cars (Component 3 - The interface).

This definition of artificial intelligence in three components will allow us to demystify artificial intelligence applications from self driving cars to facial recognition. Our goal is to provide a high-level interface to the current conception of AI and how it can be applied to problems in real life. It will include discussion and references to the sophisticated models and data collection methods used by Facebook, Tesla, and other companies. However, the book does not assume a mathematical or computer science background and will attempt to explain these ideas in plain language. Of course, this means that some details will be glossed over, so we will attempt to point the interested reader toward more detailed resources throughout the book.

References

Alexa Developers. 2016. “Alexa Voice Service.” https://developer.amazon.com/alexa-voice-service.

Amazon Inc. 2016. “Amazon Echo.” https://www.amazon.com/Amazon-Echo-Bluetooth-Speaker-with-WiFi-Alexa/dp/B00X4WHP5E.

Baciu, Assaf, and Assaf Baciu. 2016. “Artificial Intelligence Is More Artificial Than Intelligent.” Wired, 7~dec.

Cohen, Paul R, and Edward A Feigenbaum. 2014. The Handbook of Artificial Intelligence. Vol. 3. Butterworth-Heinemann. https://goo.gl/wg5rMk.

Couden, Craig. 2015. “Why It’s so Hard to Make Humanoid Robots | Make:” http://makezine.com/2015/06/15/hard-make-humanoid-robots/.

Darnton, Robert. 2009. Google & the Future of Books. na.

Google Tensorflow Team. n.d. “MNIST for ML Beginners | TensorFlow.” https://www.tensorflow.org/tutorials/mnist/beginners/.

Irizarry, Rafael. n.d. “The Big in Big Data Relates to Importance Not Size · Simply Statistics.” http://simplystatistics.org/2014/05/28/the-big-in-big-data-relates-to-importance-not-size/.

Laird, John E, Allen Newell, and Paul S Rosenbloom. 1987. “Soar: An Architecture for General Intelligence.” Artificial Intelligence 33 (1). Elsevier: 1–64.

Langford, John. n.d. “AlphaGo Is Not the Solution to AI « Machine Learning (Theory).” http://hunch.net/?p=3692542.

Nuñez, Michael. n.d. “Amazon Echo Is the First Artificial Intelligence You’ll Want at Home.” http://www.popsci.com/amazon-echo-first-artificial-intelligence-youll-want-home.

Pavlidis, Theo. n.d. “Computers Versus Humans - 2002 Lecture.” http://www.theopavlidis.com/comphumans/comphuman.htm.

Quigley, Robert. n.d. “The Cost of a Gigabyte over the Years.” http://www.themarysue.com/gigabyte-cost-over-years/.

Santana, Eder, and George Hotz. 2016. “Learning a Driving Simulator,” 3~aug.

Taigman, Y, M Yang, M Ranzato, and L Wolf. 2014. “DeepFace: Closing the Gap to Human-Level Performance in Face Verification.” In 2014 IEEE Conference on Computer Vision and Pattern Recognition, 1701–8.

Urban, Tim. n.d. “The AI Revolution: How Far Away Are Our Robot Overlords?” http://gizmodo.com/the-ai-revolution-how-far-away-are-our-robot-overlords-1684199433.

Vardi, Moshe Y. 2012. “Artificial Intelligence: Past and Future.” Commun. ACM 55 (1). New York, NY, USA: ACM: 5–5.

Wikipedia contributors. 2016. “Optical Character Recognition.” https://en.wikipedia.org/w/index.php?title=Optical_character_recognition&oldid=757150540.

———. 2017a. “Artificial General Intelligence.” https://en.wikipedia.org/w/index.php?title=Artificial_general_intelligence&oldid=758867755.

———. 2017b. “Artificial Intelligence.” https://en.wikipedia.org/w/index.php?title=Artificial_intelligence&oldid=759177704.

Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, et al. 2016. “Google’s Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation,” 26~sep.

Got a data app idea? Apply to get it prototyped by the JHU DSL!

2017-01-18T00:00:00+00:00

Last fall we ran the first iteration of a class at the Johns Hopkins Data Science Lab where we teach students to build data web-apps using Shiny, R, GoogleSheets and a number of other technologies. Our goals were to teach students to build data products, to reduce friction for students who want to build things with data, and to help people solve important data problems with web and SMS apps.

We are going to be running a second iteration of our program from March-June this year. We are looking for awesome projects for students to build that solve real world problems. We are particularly interested in projects that could have a positive impact on health but are open to any cool idea. We generally build apps that are useful for:

Data donation - if you have a group of people you would like to donate data to your project.
Data collection - if you would like to build an app for collecting data from people.
Data visualziation - if you have a data set and would like to have a web app for interacting with the data
Data interaction - if you have a statistical or machine learning model and you would like a web interface for it.

But we are interested in any consumer-facing data product that you might be interested in having built. We want you to submit your wildest, most interesting ideas and we’ll see if we can get them built for you.

We are hoping to solicit a large number of projects and then build as many as possible. The best part is that we will build the prototype for you for free! If you have an idea of something you’d like built please submit it to this Google form.

Students in the class will select projects they are interested in during early March. We will let you know if your idea was selected for the program by mid-March. If you aren’t selected you will have the opportunity to roll your submission over to our next round of prototyping.

I’ll be writing a separate post targeted at students, but if you are interested in being a data app prototyper, sign up here.

Interview with Al Sommer - Effort Report Episode 23

2017-01-17T00:00:00+00:00

My colleage Elizabeth Matsui and I had a great opportunity to talk with Al Sommer on the latest episode of our podcast The Effort Report. Al is the former Dean of the Johns Hopkins Bloomberg School of Public Health and is Professor of Epidemiology and International Health at the School. He is (among other things) world reknown for his pioneering research in vitamin A deficiency and mortality in children.

Al had some good bits of advice for academics and being successful in academia.

What you are excited about and interested in at the moment, you’re much more likely to be succesful at—because you’re excited about it! So you’re going to get up at 2 in the morning and think about it, you’re going to be putting things together in ways that nobody else has put things together. And guess what? When you do that you’re more succesful [and] you actual end up getting academic promotions.

On the slow rate of progress:

It took ten years, after we had seven randomized trials already to show that you get this 1/3 reduction in child mortality by giving them two cents worth of vitamin A twice a year. It took ten years to convince the child survival Nawabs of the world, and there are still some that don’t believe it.

On working overseas:

It used to be true [that] it’s a lot easier to work overseas than it is to work here because the experts come from somewhere else. You’re never an expert in your own home.

You can listen to the entire episode here:

Not So Standard Deviations Episode 30 - Philately and Numismatology

2017-01-09T00:00:00+00:00

Hilary and I follow up on open data and data sharing in government. They also discuss artificial intelligence, self-driving cars, and doing your taxes in R.

If you have questions you’d like Hilary and me to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.

Show notes:

Listen here:

Some things I've found help reduce my stress around science

2016-12-29T00:00:00+00:00

Being a scientist can be pretty stressful for any number of reasons, from the peer review process, to getting funding, to getting blown up on the internet.

Like a lot of academics I suffer from a lot of stress related to my own high standards and the imposter syndrome that comes from not meeting them on a regular basis. I was just reading through the excellent material in Lorena Barba’s class on essential skills in reproducibility and came across this set of slides by Phillip Stark. The one that caught my attention said:

If I say just trust me and I’m wrong, I’m untrustworthy. If I say here’s my work and it’s wrong, I’m honest, human, and serving scientific progress.

I love this quote because it shows how being open about both your successes and failures makes it less stressful to be a scientist. Inspired by this quote I decided to make a list of things that I’ve learned through hard experience do not help me with my own imposter syndrome and do help me to feel less stressed out about my science.

Put everything out in the open. We release all of our software, data, and analysis scripts. This has led to almost exclusively positive interactions with people as they help us figure out good and bad things about our work.
Admit mistakes quickly. Since my code/data are out in the open I’ve had people find little bugs and big whoa this is bad bugs in my code. I used to freak out when that happens. But I found the thing that minimizes my stress is to just quickly admit the error and submit updates/changes/revisions to code and papers as necessary.
Respond to requests for support at my own pace. I try to be as responsive as I can when people email me about software/data/code/papers of mine. I used to stress about doing this right away when I would get the emails. I still try to be prompt, but I don’t let that dominate my attention/time. I also prioritize things that are wrong/problematic and then later handle the requests for free consulting every open source person gets.
Treat rejection as a feature not a bug. This one is by far the hardest for me but preprints have helped a ton. The academic system is designed to be critical. That is a good thing, skepticism is one of the key tenets of the scientific process. It took me a while to just plan on one or two rejections for each paper, one or two or more rejections for each grant, etc. But now that I plan on the rejection I find I can just focus on how to steadily move forward and constructively address criticism rather than taking it as a personal blow.
Don’t argue with people on the internet, especially on Twitter. This is a new one for me and one I’m having to practice hard every single day. But I’ve found that I’ve had very few constructive debates on Twitter. I also found that this is almost purely negative energy for me and doesn’t help me accomplish much.
Redefine success. I’ve found that if I recalibrate what success means to include accomplishing tasks like peer reviewing papers, getting letters of recommendation sent at the right times, providing support to people I mentor, and the submission rather than the success of papers/grants then I’m much less stressed out.
Don’t compare myself to other scientists. It is very hard to get good evaluation in science and I’m extra bad at self-evaluation. Scientists are good in many different dimensions and so whenever I pick a one dimensional summary and compare myself to others there are always people who are “better” than me. I find I’m happier when I set internal, short term goals for myself and only compare myself to them.
When comparing, at least pick a metric I’m good at. I’d like to claim I never compare myself to others, but the reality is I do it more than I’d like. I’ve found one way to not stress myself out for my own internal comparisons is to pick metrics I’m good at - even if they aren’t the “right” metrics. That way at least if I’m comparing I’m not hurting my own psyche.
Let myself be bummed sometimes. Some days despite all of that I still get the imposter syndrome feels and can’t get out of the funk. I used to beat myself up about those days, but now I try to just build that into the rhythm of doing work.
Try very hard to be positive in my interactions. This is another hard one, because it is important to be skeptical/critical as a scientist. But I also try very hard to do that in as productive a way as possible. I try to assume other people are doing the right thing and I try very hard to stay positive or neutral when writing blog posts/opinion pieces, etc.
Realize that giving credit doesn’t take away from me. In my research career I have worked with some extremely generous mentors. They taught me to always give credit whenever possible. I also learned from Roger that you can give credit and not lose anything yourself, in fact you almost always gain. Giving credit is low cost but feels really good so is a nice thing to help me feel better.

The last thing I’d say is that having a blog has helped reduce my stress, because sometimes I’m having a hard time getting going on my big project for the day and I can quickly write a blog post and still feel like I got something done…

A non-comprehensive list of awesome things other people did in 2016

2016-12-20T00:00:00+00:00

Editor’s note: For the last few years I have made a list of awesome things that other people did (2015, 2014, 2013). Like in previous years I’m making a list, again right off the top of my head. If you know of some, you should make your own list or add it to the comments! I have also avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people’s awesome stuff. I write this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data.

Thomas Lin Pedersen created the tweenr package for interpolating graphs in animations. Check out this awesome logo he made with it.
Yihui Xie is still blowing away everything he does. First it was bookdown and then the yolo feature in xaringan package.
J Alammar built this great visual introduction to neural networks
Jenny Bryan is working literal world wonders with legos to teach functional programming. I loved her Data Rectangling talk. The analogy between exponential families and data frames is so so good.
Hadley Wickham’s book on R for data science is everything you’d expect. Super clear, great examples, just a really nice book.
David Robinson is a machine put on this earth to create awesome data science stuff. Here is analyzing Trump’s tweets and here he is on empirical Bayes modeling explained with baseball.
Julia Silge and David created the tidytext package. This is a holy moly big contribution to NLP in R. They also have a killer book on tidy text mining.
Julia used the package to do this fascinating post on mining Reddit after the election.
It would be hard to pick just five different major contributions from JJ Allaire (great interview here), Joe Cheng, and the rest of the Rstudio folks. Rstudio is absolutely churning out awesome stuff at a rate that is hard to keep up with. I loved R notebooks and have used them extensively for teaching.
Konrad Kording and Brett Mensh full on mike dropped on how to write a paper with their 10 simple rules piece Figure 1 from that paper should be affixed to the office of every student/faculty in the world permanently.
Yaniv Erlich just can’t stop himself from doing interesting things like seeq.io and dna.land.
Thomaz Berisa and Joe Pickrell set up a freaking Python API for genomics projects.
DataCamp continues to do great things. I love their DataChats series and they have been rolling out tons of new courses.
Sean Rife and Michele Nuijten created statcheck.io for checking papers for p-value calculation errors. This was all over the press, but I just like the site as a dummy proofing for myself.
This was the artificial intelligence tweet of the year
I loved seeing PLoS Genetics start a policy of looking for papers in biorxiv.
Matthew Stephens post on his preprint getting pre-accepted and reproducibility is also awesome. Preprints are so hot right now!
Lorena Barba made this amazing reproducibility syllabus then won the Leamer-Rosenthal prize in open science.
Colin Dewey continues to do just stellar stellar work, this time on re-annotating genomics samples. This is one of the key open problems in genomics.
I love FlowingData sooooo much. Here is one on the changing American diet.
If you like computational biology and data science and like super detailed reports of meetings/talks you MIchael Hoffman is your man. How he actually summarizes that much information in real time is still beyond me.
I really really wish I had been at Alyssa Frazee’s talk at startup.ml but loved this review of it. Sampling, inverse probability weighting? Love that stats flavor!
I have followed Cathy O’Neil for a long time in her persona as mathbabedotorg so it is no surprise to me that her new book Weapons of Math Descruction is so good. One of the best works on the ethics of data out there.
A related and very important piece is on Machine bias in sentencing by Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner at ProPublica.
Dimitris Rizopolous created this stellar integrated Shiny app for his repeated measures class. I wish I could build things half this nice.
Daniel Engber’s piece on Who will debunk the debunkers? at fivethirtyeight just keeps getting more relevant.
I rarely am willing to watch a talk posted on the internet, but Amelia McNamara’s talk on seeing nothing was an exception. Plus she talks so fast #jealous.
Sherri Rose’s post on economic diversity in the academy focuses on statistics but should be required reading for anyone thinking about diversity. Everything about it is impressive.
If you like your data science with a side of Python you should definitely be checking out Jake Vanderplas’s data science handbook and the associated Jupyter notebooks.
I love Thomas Lumley being snarky about the stats news. Its a guilty pleasure. If he ever collected them into a book I’d buy it (hint Thomas :)).
Dorothy Bishop’s blog is one of the ones I read super regularly. Her post on When is a replication a replication is just one example of her very clearly explaining a complicated topic in a sensible way. I find that so hard to do and she does it so well.
Ben Goldacre’s crowd is doing a bunch of interesting things. I really like their OpenPrescribing project.
I’m really excited to see what Elizabeth Rhodes does with the experimental design for the Ycombinator Basic Income Experiment.
Lucy D’Agostino McGowan made this amazing explanation of Hill’s criterion using xckd.
It is hard to overstate how good Leslie McClure’s blog is. This post on biostatistics is public health should be read aloud at every SPH in the US.
The ASA’s statement on p-values is a really nice summary of all the issues around a surprisngly controversial topic. Ron Wasserstein and Nicole Lazar did a great job putting it together.
I really liked this piece on the relationship between income and life expectancy by Raj Chetty and company.
Christie Aschwanden continues to be the voice of reason on the statistical crises in science.

That’s all I have for now, I know I’m missing things. Maybe my New Year’s resolution will be to keep better track of the awesome things other people are doing :).

The four eras of data

2016-12-16T00:00:00+00:00

I’m teaching a class in data science for our masters and PhD students here at Hopkins. I’ve been teaching a variation on this class since 2011 and over time I’ve introduced a number of new components to the class: high-dimensional data methods (2011), data manipulation and cleaning (2012), real, possibly not doable data analyses (2012,2013), peer reviews (2014), building swirl tutorials for data analysis techniques (2015), and this year building data analytic web apps/R packages.

I’m the least efficient teacher in the world, probably because I’m very self conscious about my teaching. So I always feel like I have to completely re-do my lecture materials every year I teach the class (I know, I know I’m a dummy). This year I was reviewing my notes on high-dimensional data and I was looking at this breakdown of the three eras of statistics from Brad Efron’s book:

The age of Quetelet and his successors, in which huge census-level data sets were brought to bear on simple but important questions: Are there more male than female births? Is the rate of insanity rising?

The classical period of Pearson, Fisher, Neyman, Hotelling, and their successors, intellectual giants who developed a theory of optimal inference capable of wringing every drop of information out of a scientific experiment. The questions dealt with still tended to be simple — Is treatment A better than treatment B? — but the new methods were suited to the kinds of small data sets individual scientists might collect.

The era of scientific mass production, in which new technologies typi- fied by the microarray allow a single team of scientists to produce data sets of a size Quetelet would envy. But now the flood of data is accompanied by a deluge of questions, perhaps thousands of estimates or hypothesis tests that the statistician is charged with answering together; not at all what the classical masters had in mind.

While I think this is a useful breakdown, I realized I think about it in a slightly different way as a statistician. My breakdown goes more like this:

The era of not much data This is everything prior to about 1995 in my field. The era when we could only collect a few measurements at a time. The whole point of statistics was to try to optimaly squeeze information out of a small number of samples - so you see methods like maximum likelihood and minimum variance unbiased estimators being developed.
The era of lots of measurements on a few samples This one hit hard in biology with the development of the microarray and the ability to measure thousands of genes simultaneously. This is the same statistical problem as in the previous era but with a lot more noise added. Here you see the development of methods for multiple testing and regularized regression to separate signals from piles of noise.
The era of a few measurements on lots of samples This era is overlapping to some extent with the previous one. Large scale collections of data from EMRs and Medicare are examples where you have a huge number of people (samples) but a relatively modest number of variables measured. Here there is a big focus on statistical methods for knowing how to model different parts of the data with hierarchical models and separating signals of varying strength with model calibration.
The era of all the data on everything. This is an era that currently we as civilians don’t get to participate in. But Facebook, Google, Amazon, the NSA and other organizations have thousands or millions of measurements on hundreds of millions of people. Other than just sheer computing I’m speculating that a lot of the problem is in segmentation (like in era 3) coupled with avoiding crazy overfitting (like in era 2).

I’ve focused here on the implications of these eras from a statistical modeling perspective, but as we discussed in my class, era 4 coupled with advances in machine learning methods mean that there are social, economic, and behaviorial implications of these eras as well.

Not So Standard Deviations Episode 28 - Writing is a lot Harder than Just Talking

2016-12-15T00:00:00+00:00

Hilary and I talk about building data science products that provide a good user experience while adhering to some kind of ground truth, whether it’s in medicine, education, news, or elsewhere. Also Gilmore Girls.

If you have questions you’d like Hilary and me to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.

Show notes:

Hill’s criteria for causation
O’Reilly Bots Podcast
NHTSA’s Federal Automated Vehicles Policy
Subscribe to the podcast on iTunes or Google Play. And please leave us a review on iTunes.
Support us through our Patreon page.
Get the Not So Standard Deviations book.

Listen here:

What is going on with math education in the US?

2016-12-09T00:00:00+00:00

When colleagues with young children seeking information about schools ask me if I like the Massachusetts public school my children attend, my answer is always the same: “it’s great…except for math”. The fact is that in our household we supplement our kids’ math education with significant extra curricular work in order to ensure that they receive a math education comparable to what we received as children in the public system.

The latest results from the Program for International Student Assessment (PISA) results show that there is a general problem with math education in the US. Were it a country, Massachusetts would have been in second place in reading, sixth in science, but 20th in math, only ten points above the OECD average of 490. The US as a whole did not fair nearly as well as MA, and the same discrepancy between math and the other two subjects was present. In fact, among the top 30 performing countries ranked by their average of science and reading scores, the US has, by far, the largest discrepancy between math and the other two subjects tested by PISA. The difference of 27 was substantially greater than the second largest difference, which came from Finland at 17. Massachusetts had a difference of 28.

If we look at the trend of this difference since PISA was started 16 years ago, we see a disturbing progression. While science and reading have remained stable, math has declined. In 2000 the difference between the results in math and the other subjects was only 8.5. Furthermore, the US is not performing exceptionally well in any subject:

So what is going on? I’d love to read theories in the comment section. From my experience comparing my kids’ public schools now with those that I attended, I have one theory of my own. When I was a kid there was a math textbook. Even when a teacher was bad, it provided structure and an organized alternative for learning on your own. Today this approach is seen as being “algorithmic” and has fallen out of favor. “Project based learning” coupled with group activities have become popular replacements.

Project based learning is great in principle. But, speaking from experience, I can say it is very hard to come up with good projects, even for highly trained mathematical minds. And it is certainly much more time consuming for the instructor than following a textbook. Teachers don’t have more time now than they did 30 years ago so it is no surprise that this new more open approach leads to improvisation and mediocre lessons. A recent example of a pointless math project involved 5th graders picking a number and preparing a colorful poster showing “interesting” facts about this number. To make things worse in terms of math skills, students are often rewarded for effort, while correctness is secondary and often disregarded.

Regardless of the reason for the decline, given the trends we are seeing, we need to rethink the approach to math education. Math education may have had its problems in the past, but recent evidence suggests that the reforms of the past few decades seem to have only worsened the situation.

Note: To make these plots I download and read-in the data into R as described here.

Not So Standard Deviations Episode 27 - Special Guest Amelia McNamara

2016-11-30T00:00:00+00:00

I had the pleasure of sitting down with Amelia McNamara, Visiting Assistant Professor of Statistical and Data Sciences at Smith College, to talk about data science, data journalism, visualization, the problems with R, and adult coloring books.

If you have questions you’d like Hilary and me to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.

Show notes:

Amelia McNamara’s web site
Mark Hansen
Listening Post
Moveable Type
Alan Kay
HARC (Human Advancement Research Community)
VPRI (Viewpoints Research Institute)
Interactive essays
Golden Ratio Coloring Book
Subscribe to the podcast on iTunes or Google Play. And please leave us a review on iTunes.
Support us through our Patreon page.
Get the Not So Standard Deviations book.

Listen here:

Help choose the Leek group color palette

2016-11-17T00:00:00+00:00

My research group just recently finish a paper where several different teams within the group worked on different analyses. If you are interested the paper describes the recount resource which includes processed versions of thousands of human RNA-seq data sets.

As part of this project each group had to contribute some plots to the paper. One thing that I noticed is that each person used their own color palette and theme when building the plots. When we wrote the paper this made it a little harder for the figures to all fit together - especially when different group members worked on a single panel of a multi-panel plot.

So I started thinking about setting up a Leek group theme for both base R and ggplot2 graphics. One of the first problems was that every group member had their own opinion about what the best color palette would be. So we are running a little competition to determine what the official Leek group color palette for plots will be in the future.

As part of that process, one of my awesome postdocs, Shannon Ellis, decided to collect some data on how people perceive different color palettes. The survey is here:

https://docs.google.com/forms/d/e/1FAIpQLSfHMXVsl7pxYGarGowJpwgDSf9lA2DfWJjjEON1fhuCh6KkRg/viewform?c=0&w=1

If you have a few minutes and have an opinion about colors (I know you do!) please consider participating in our little poll and helping to determine the future of Leek group plots!

Open letter to my lab: I am not "moving to Canada"

2016-11-11T00:00:00+00:00

Dear Lab Members,

I know that the results of Tuesday’s election have many of you concerned about your future. You are not alone. I am concerned about my future as well. But I want you to know that I have no plans of going anywhere and I intend to dedicate as much time to our projects as I always have. Meeting, discussing ideas and putting them into practice with you is, by far, the best part of my job.

We are all concerned that if certain campaign promises are kept many of our fellow citizens may need our help. If this happens, then we will pause to do whatever we can to help. But I am currently cautiously optimistic that we will be able to continue focusing on helping society in the best way we know how: by doing scientific research.

This week Dr. Francis Collins assured us that there is strong bipartisan support for scientific research. As an example consider this op-ed in which Newt Gingrich advocates for doubling the NIH budget. There also seems to be wide consensus in this country that scientific research is highly beneficial to society and an understanding that to do the best research we need the best of the best no matter their gender, race, religion or country of origin. Nothing good comes from creative, intelligent, dedicated people leaving science.

I know there is much uncertainty but, as of now, there is nothing stopping us from continuing to work hard. My plan is to do just that and I hope you join me.

Not all forecasters got it wrong: Nate Silver does it again (again)

2016-11-09T00:00:00+00:00

Four years ago we posted on Nate Silver’s, and other forecasters’, triumph over pundits. In contrast, after yesterday’s presidential election, results contradicted most polls and data-driven forecasters, several news articles came out wondering how this happened. It is important to point out that not all forecasters got it wrong. Statistically speaking, Nate Silver, once again, got it right.

To show this, below I include a plot showing the expected margin of victory for Clinton versus the actual results for the most competitive states provided by 538. It includes the uncertainty bands provided by 538 in this site (I eyeballed the band sizes to make the plot in R, so they are not exactly like 538’s).

Note that if these are 95% confidence/credible intervals, 538 got 1 wrong. This is exactly what we expect since 15/16 is about 95%. Furthermore, judging by the plot here, 538 estimated the popular vote margin to be 3.6% with a confidence/credible interval of about 5%. This too was an accurate prediction since Clinton is going to win the popular vote by about 1% ~~0.5%~~ (note this final result is in the margin of error of several traditional polls as well). Finally, when other forecasters were giving Trump between 14% and 0.1% chances of winning, 538 gave him about a 30% chance which is slightly more than what a team has when down 3-2 in the World Series. In contrast, in 2012 538 gave Romney only a 9% chance of winning. Also, remember, if in ten election cycles you call it for someone with a 70% chance, you should get it wrong 3 times. If you get it right every time then your 70% statement was wrong.

So how did 538 outperform all other forecasters? First, as far as I can tell they model the possibility of an overall bias, modeled as a random effect, that affects every state. This bias can be introduced by systematic lying to pollsters or under sampling some group. Note that this bias can’t be estimated from data from one election cycle but it’s variability can be estimated from historical data. 538 appear to estimate the standard error of this term to be about 2%. More details on this are included here. In 2016 we saw this bias and you can see it in the plot above (more points are above the line than below). The confidence bands account for this source of variabilty and furthermore their simulations account for the strong correlation you will see across states: the chance of seeing an upset in Pennsylvania, Wisconsin, and Michigan is not the product of an upset in each. In fact it’s much higher. Another advantage 538 had is that they somehow were able to predict a systematic, not random, bias against Trump. You can see this by comparing their adjusted data to the raw data (the adjustment favored Trump about 1.5 on average). We can clearly see this when comparing the 538 estimates to The Upshots’:

The fact that 538 did so much better than other forecasters should remind us how hard it is to do data analysis in real life. Knowing math, statistics and programming is not enough. It requires experience and a deep understanding of the nuances related to the specific problem at hand. Nate Silver and the 538 team seem to understand this more than others.

Update: Jason Merkin points out (via Twitter) that 538 provides 80% credible intervals.

Data scientist on a chromebook take two

2016-11-08T00:00:00+00:00

My friend Fernando showed me his collection of old Apple dongles that no longer work with the latest generation of Apple devices. This coupled with the announcement of the Macbook pro that promises way more dongles and mostly the same computing, had me freaking out about my computing platform for the future. I’ve been using cloudy tools for more and more of what I do and so it had me wondering if it was time to go back and try my Chromebook experiment again. Basically the question is whether I can do everything I need to do comfortably on a Chromebook.

So to execute the experience I got a brand new ASUS chromebook flip and the connector I need to plug it into hdmi monitors (there is no escaping at least one dongle I guess :(). Here is what that badboy looks like in my home office with Apple superfanboy Roger on the screen.

In terms of software there have been some major improvements since I last tried this experiment out. Some of these I talk about in my book How to be a modern scientist. As of this writing this is my current setup:

Music on Google Play
Latex on Overleaf
Blog/website/code on Github
R programming on an Amazon AMI with Rstudio loaded although I hear there may be other options that are good there that I should try.
Email/Calendar/Presentations/Spreadsheets/Docs with Google products
Twitter with Tweetdeck

That handles the vast majority of my workload so far (its only been a day :)). But I would welcome suggestions and I’ll report back when either I give up or if things are still going strong in a little while….

Not So Standard Deviations Episode 25 - How Exactly Do You Pronounce SQL?

2016-10-28T00:00:00+00:00

Hilary and I go through the overflowing mailbag to respond to listener questions! Topics include causal inference in trend modeling, regression model selection, using SQL, and data science certification.

If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.

Show notes:

Professor Kobre’s Lightscoop Standard Version Bounce Flash Device
Speechpad
Speaking American by Josh Katz
Data Sets Are The New Server Rooms
Are Datasets the New Server Rooms?
Subscribe to the podcast on iTunes or Google Play. And please leave us a review on iTunes.
Support us through our Patreon page.
Get the Not So Standard Deviations book.

Listen here:

Are Datasets the New Server Rooms?

2016-10-26T00:00:00+00:00

Josh Nussbaum has an interesting post over at Medium about whether massive datasets are the new server rooms of tech business.

The analogy comes from the “old days” where in order to start an Internet business, you had to buy racks and servers, rent server space, buy network bandwidth, license expensive server software, backups, and on and on. In order to do all that up front, it required a substantial amount of capital just to get off the ground. As inconvenient as this might have been, it provided an immediate barrier to entry for any other competitors who weren’t able to raise similar capital.

Of course,

…the emergence of open source software and cloud computing completely eviscerated the costs and barriers to starting a company, leading to deflationary economics where one or two people could start their company without the large upfront costs that were historically the hallmark of the VC industry.

So if startups don’t have huge capital costs in the beginning, what costs do they have? Well, for many new companies that rely on machine learning, they need to collect data.

As a startup collects the data necessary to feed their ML algorithms, the value the product/service provides improves, allowing them to access more customers/users that provide more data and so on and so forth.

Collecting huge datasets ultimately costs money. The sooner a startup can raise money to get that data, the sooner they can defend themselves from competitors who may not yet have collected the huge datasets for training their algorithms.

I’m not sure the analogy between datasets and server rooms quite works. Even back when you had to pay a lot of up front costs to setup servers and racks, a lot of that technology was already a commodity, and anyone could have access to it for a price.

I see massive datasets used to train machine learning algorithms as more like the new proprietary software. The startups of yore spent a lot of time writing custom software for what we might now consider mundane tasks. This was a time-consuming activity but the software that was developed had value and was a differentiator for the company. Today, many companies write complex machine learning algorithms, but those algorithms and their implmentations are quickly becoming commodities. So the only thing that separates one company from another is the amount and quality of data that they have to train those algorithms.

Going forward, it will be interesting see what these companies will do with those massive datasets once they no longer need them. Will they “open source” them and make them available to everyone? Could there be an open data movement analogous to the open source movement?

For the most part, I doubt it. While I think many today would perhaps sympathize with the sentiment that software shouldn’t have owners, those same people I think would argue vociferously that data most certainly do have owners. I’m not sure how I’d feel if Facebook made all their data available to anyone. That said, many datasets are made available by various businesses, and as these datasets grow in number and in usefulness, we may see a day where the collection of data is not a key barrier to entry, and that you can train your machine learning algorithm on whatever is out there.

Distributed Masochism as a Pedagogical Model

2016-10-20T00:00:00+00:00

Editor’s note: This is a guest post by Sean Kross. Sean is a software developer in the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health. Sean has contributed to several of our specializations including Data Science, Executive Data Science, and Mastering Software Development in R. He tweets @seankross.

Over the past few months I’ve been helping Jeff develop the Advanced Data Science class he’s teaching at the Johns Hopkins Bloomberg School of Public Health. We’ve been trying to identify technologies that we can teach to students which (we hope) will enable them to rapidly prototype data-based software applications which will serve a purpose in public health. We started with technologies that we’re familiar with (R, Shiny, static websites) but we’re also trying to teach ourselves new technologies (the Amazon Alexa Skills API, iOS and Swift). We’re teaching skills that we know intimately along with skills that we’re learning on the fly which is a style of teaching that we’ve practiced several times.

Jeff and I have come to realize that while building new courses with technologies that are new to us we experience particular pains and frustrations which, when documented, become valuable learning resources for our students. This process of documenting new-tech-induced pain is only a preliminary step. When we actually launch classes either online or in person our students run into new frustrations which we respond to with changes to either documentation or course content. This process of quickly iterating on course material is especially enhanced in online courses where the time span for a course lasts a few weeks compared to a full semester, so kinks in the course are ironed out at a faster rate compared to traditional in-person courses. All of the material in our courses is open-source and available on GitHub, and we teach our students how to use Git and GitHub. We can take advantage of improvements and contributions the students think we should make to our courses through pull requests that we recieve. Student contributions further reduce the overall start-up pain experienced by other students.

With students from all over the world participating in our online courses we’re unable to anticipate every technical need considering different locales, languages, and operating systems. Instead of being anxious about this reality we depend on a system of “distributed masochism” whereby documenting every student’s unique technical learning pains is an important aspect of improving the online learning experience. Since we only have a few months head start using some of these technologies compared to our students it’s likely that as instructors we’ve recently climbed a similar learning curve which makes it easier for us to help our students. We believe that this approach of teaching new technologies by allowing any student to contribute to open course material allows a course to rapidly adapt to students’ needs and to the inevitable changes and upgrades that are made to new technologies.

I’m extremely interested in communicating with anyone else who is using similar techniques, so if you’re interested please contact me via Twitter (@seankross) or send me an email: sean at seankross.com.

Not So Standard Deviations Episode 24 - 50 Minutes of Blathering

2016-10-16T00:00:00+00:00

Another IRL episode! Hilary and I met at a Jimmy John’s to talk data science, like you do. Topics covered include RStudio Conf, polling, millennials, Karl Broman, and more!

If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.

Subscribe to the podcast on iTunes or Google Play. And please leave us a review on iTunes.

Support us through our Patreon page.

Get the Not So Standard Deviations book.

Show notes:

Listen here:

Should I make a chatbot or a better FAQ?

2016-10-14T00:00:00+00:00

Roger pointed me to this interesting article (paywalled, sorry!) about Facebook’s chatbot service. I think the article made a couple of interesting points. The first thing I thought was interesting was their explicit acknowledgement of the process I outlined in a previous post for building an AI startup - (1) convince (or in this case pay) some humans to be your training set, and (2) collect the data on the humans and then use it to build your AI.

The other point that is pretty fascinating is that they realized how many data points they would need before they could reasonably replace a human with an AI chatbot. The original estimate was tens of thousands and the ultimate number was millions or more. I have been thinking a lot that the AI “revolution” is just a tradeoff between parameters and data points. If you have a billion parameter prediction algorithm it may work pretty awesome - as long as you have a few hundred billion data points to train it with.

But the theme of the article was that chatbots may have had some mis-steps/may not be ready for prime time. I think the main reason is that at the moment most AI efforts can only report facts, not intuit intention and alter the question for the user or go beyond the facts/state of the world.

One example I’ve run into recently was booking a ticket on an airline. I wanted to know if I could make a certain change to my ticket. The airline didn’t have any information about the change I wanted to make online. After checking thoroughly I clicked on the “Chat with an agent” button and was directed to what was clearly a chatbot. The chatbot asked a question or two and then sent me to the “make changes to a ticket” page of the website.

I eventually had to call and get a person on the phone, because what I wanted to ask about didn’t apply to the public information. They set me straight and I booked the ticket. The chatbot wasn’t helpful because it could only respond with information it had available on the website. It couldn’t identify a new situation, realize it had to ask around, figure out there was an edge case, and then make a ruling/help out.

I would guess that most of the time if a person interacts with a chatbot they are doing it only because they already looked at all the publicly available information on the FAQ, etc. and couldn’t find it. So an alternative solution, which would require a lot less work and a much smaller training set, is to just have a more complete FAQ.

The question to me is does anyone other than Facebook or Google have a big enough training set to make a chatbot worth it?

The Dangers of Weighting Up a Sample

2016-10-12T00:00:00+00:00

There’s a great story by Nate Cohn over at the New York Times’ Upshot about the dangers of “weighting up” a sample from a survey. In this case, it is in regards to a U.S.C/LA Times poll asking who people will vote for President:

The U.S.C./LAT poll weights for many tiny categories: like 18-to-21-year-old men, which U.S.C./LAT estimates make up around 3.3 percent of the adult citizen population. Weighting simply for 18-to-21-year-olds would be pretty bold for a political survey; 18-to-21-year-old men is really unusual.

The U.S.C./LA Times poll apparently goes even further:

When you start considering the competing demands across multiple categories, it can quickly become necessary to give an astonishing amount of extra weight to particularly underrepresented voters — like 18-to-21-year-old black men. This wouldn’t be a problem with broader categories, like those 18 to 29, and there aren’t very many national polls that are weighting respondents up by more than eight or 10-fold. The extreme weights for the 19-year-old black Trump voter in Illinois are not normal.

It’s worth noting (as a good thing) that the U.S.C./LA Times poll data is completely open, thus allowing the NYT to reproduce this entire analysis.

I haven’t done much in the way of survey analyses, but I’ve done some inverse probability weighting and in my experience it can be a tricky procedure in ways that are not always immediately obvious. The article discusses weight trimming, but also notes the dangers of that procedure. Overall, a good treatment of a complex issue.

Information and VC Investing

2016-10-03T00:00:00+00:00

Sam Lessin at The Information has a nice post (sorry, paywall, but it’s a great publication) about how increased measurement and analysis is changing the nature of venture capital investing.

This brings me back to what is happening at series A financings. Investors have always, obviously, tried to do diligence at all financing rounds. But series A investments used to be an exercise in a few top-level metrics a company might know, some industry interviews and analysis, and a whole lot of trust. The data that would drive capital market efficiency usually just wasn’t there, so capital was expensive and there were opportunities for financiers. Now, I am seeing more and more that after a seed round to boot up most companies, the backbone of a series A financing is an intense level of detail in reporting and analytics. It can be that way because the companies have the data

I’ve seen this happen in other areas where data comes in to disrupt the way things are done. Good analysis only gives you an advantage if no one else is doing it. Once everyone accepts the idea and everyone has the data (and a good analytics team), there’s no more value left in the market.

Time to search elsewhere.

papr - it's like tinder, but for academic preprints

2016-10-03T00:00:00+00:00

As part of the Johns Hopkins Data Science Lab we are setting up a web and mobile data product prototyping shop. As part of that process I’ve been working on different types of very cheap and easy to prototype apps. A few days ago I posted about creating a distributed data collection app with Google Sheets.

So for fun I built another kind of app. This one I’m calling papr and its sort of like “Tinder for preprints”. I scraped all of the papers out of the http://biorxiv.org/ database. When you open the app you see one at random and you can rate it according to two axes:

Is the paper interesting? - a paper can be rated as exciting or boring. We leave the definitions of those terms up to you.
Is the paper correct or questionable? - a paper can either be solidly correct or potentially questionable in its results. We leave the definitions of those terms up to you.

When you click on your rating you are shown another randomly generated paper from bioRxiv. You can “level up” to different levels if you rate more papers. You can also download your ratings at any time.

If you have any feedback on the app I’d love to hear it and if anyone knows how to get custom domain names to work with shinyapps.io I’d also love to hear from you. I tried the instructions with no luck…

Try the app here:

https://jhubiostatistics.shinyapps.io/papr/

Not So Standard Deviations Episode 23 - Special Guest Walt Hickey

2016-10-01T00:00:00+00:00

Hilary and Roger invite Walt Hickey of FiveThirtyEight.com on to the show to talk about polling, movies, and data analysis reproducibility (of course).

If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.

Subscribe to the podcast on iTunes or Google Play.

Please leave us a review on iTunes.

Support us through our Patreon page.

Get the Not So Standard Deviations book.

Show Notes:

Listen here:

Statistical vitriol

2016-09-29T00:00:00+00:00

Over the last few months there has been a lot of vitriol around statistical ideas. First there were data parasites and then there were methodological terrorists. These epithets came from established scientists who have relatively little statistical training. There was the predictable backlash to these folks from their counterparties, typically statisticians or statistically trained folks who care about open source.

I’m a statistician who cares about open source but I also frequently collaborate with scientists from different fields. It makes me sad and frustrated that statistics - which I’m so excited about and have spent my entire professional career working on - is something that is causing so much frustration, anxiety, and anger.

I have been thinking a lot about the cause of this anger and division in the sciences. As a person who interacts with both groups pretty regularly I think that the reasons are some combination of the following.

Data is now everywhere, so every single publication involves some level of statistical modeling and analysis. It can’t be escaped.
The deluge of scientific papers means that only big claims get your work noticed, get you into fancy journals, and get you attention.
Most senior scientists, the ones leading and designing studies, have little or no training in statistics. There is a structural reason for this: data was sparse when they were trained and there wasn’t any reason for them to learn statistics. So statistics and data science wasn’t (and still often isn’t) integrated into medical and scientific curricula.
There is an imbalance of power in the scientific process between statisticians/computational scientists and scientific investigators or clinicians. The clinicians/scientific investigators are “in charge” and the statisticians are often relegated to a secondary role. Statisticians with some control over their environment (think senior tenured professors of (bio)statistics) can avoid these imbalances and look for collaborators who respect statistical thinking, but not everyone can. There are a large number of lonely bioinformaticians out there.
Statisticians and computational scientists are also frustrated because their is often no outlet for them to respond to these papers in the formal scientific literature - those outlets are controlled by scientists and rarely have statisticians in positions of influence within the journals.

Since statistics is everywhere (1) and only flashy claims get you into journals (2) and the people leading studies don’t understand statistics very well (3), you get many publications where the paper makes a big claim based on shakey statistics but it gets through. This then frustrates the statisticians because they have little control over the process (4) and can’t get their concerns into the published literature (5).

This used to just result in lots of statisticians and computational scientists complaining behind closed doors. The internet changed all that, everyone is an internet scientist now. So the statisticians and statistically savvy take to blogs, f1000research, and other outlets to get their point across.

Sometimes to get attention, statisticians start to have the same problem as scientists; they need their complaints to get attention to have any effect. So they go over the top. They accuse people of fraud, or being statistically dumb, or nefarious, or intentionally doing things with data, or cast a wide net and try to implicate a large number of scientists in poor statistics. The ironic thing is that these things are the same thing that the scientists are doing to get attention that frustrated the statisticians in the first place.

Just to be 100% clear here I am also guilty of this. I have definitely fallen into the hype trap - talking about the “replicability crisis”. I also made the mistake earlier in my blogging career of trashing the statistics of a paper that frustrated me. I am embarrassed I did that now, it wasn’t constructive and the author ended up being very responsive. I think if I had just emailed that person they would have resolved their problem.

I just recently had an experience where a very prominent paper hadn’t made their data public and I was having trouble getting the data. I thought about writing a blog post to get attention, but at the end of the day just did the work of emailing the authors, explaining myself over and over and finally getting the data from them. The result is the same (I have the data) but it cost me time and frustration. So I understand when people don’t want to deal with that.

The problem is that scientists see the attention the statisticians are calling down on them - primarily negative and often over-hyped. Then they get upset and call the statisticians/open scientists names, or push back on entirely sensible policies because they are worried about being humiliated or discredited. While I don’t agree with that response, I also understand the feeling of “being under attack”. I’ve had that happen to me too and it doesn’t feel good.

So where do we go from here? How do we end statistical vitriol and make statistics a positive force? Here is my six part plan:

We should create continuining education for senior scientists and physicians in statistical and open data thinking so people who never got that training can understand the unique requirements of a data rich scientific world.
We should encourage journals and funders to incorporate statisticians and computational scientists at the highest levels of influence so that they can drive policy that makes sense in this new data driven time.
We should recognize that scientists and data generators have a lot more on the line when they produce a result or a scientific data set. We should give them appropriate credit for doing that even if they don’t get the analysis exactly right.
We should de-escalate the consequences of statistical mistakes. Right now the consequences are: retractions that hurt careers, blog posts that are aggressive and often too personal, and humiliation by the community. We should make it easy to acknowledge these errors without ruining careers. This will be hard - scientists careers often depend on the results they get (recall 2 above). So we need a way to pump up/give credit to/acknowledge scientists who are willing to sacrifice that to get the stats right.
We need to stop treating retractions/statistical errors/mistakes like a sport where there are winners and losers. Statistical criticism should be easy, allowable, publishable and not angry or personal.
Any paper where statistical analysis is part of the paper must have both a statistically trained author or a statistically trained reviewer or both. I wouldn’t believe a paper on genomics that was performed entirely by statisticians with no biology training any more than I believe a paper with statistics in it performed entirely by physicians with no statistical training.

I think scientists forget that statisticians feel un-empowered in the scientific process and statisticians forget that a lot is riding on any given study for a scientist. So being a little more sympathetic to the pressures we all face would go a long way to resolving statistical vitriol.

I’d be eager to hear other ideas too. It makes me sad that statistics has become so political on both sides.

The Mystery of Palantir Continues

2016-09-28T00:00:00+00:00

Palantir, the secretive data science/consulting/software company, continues to be a mystery to most people, but recent reports have not been great. Reuters reports that the U.S. Department of Labor is suing it for employment discrimination:

The lawsuit alleges Palantir routinely eliminated Asian applicants in the resume screening and telephone interview phases, even when they were as qualified as white applicants.

Interestingly, the report indicates a statistical argument:

In one example cited by the Labor Department, Palantir reviewed a pool of more than 130 qualified applicants for the role of engineering intern. About 73 percent of applicants were Asian. The lawsuit, which covers Palantir’s conduct between January 2010 and the present, said the company hired 17 non-Asian applicants and four Asians. “The likelihood that this result occurred according to chance is approximately one in a billion,” said the lawsuit, which was filed with the department’s Office of Administrative Law Judges.

Update: Thanks to David Robinson for point out that (a) I read the numbers incorrectly and (b) I should have used the hypergeometric distribution to account for the sampling without replacement. The paragraph below is corrected accordingly.

Note the use of the phrase “qualified applicants” in reference to the

Presumably, there was a screening process that removed “unqualified applicants” and that led us to 130. Of the 130, 73% were Asian. Presumably, there was a follow up selection process (interview, exam) that led to 4 Asians being hired out of 21 (about 19%). Clearly there’s a difference between 19% and 73% but the reasons may not be nefarious. If you assume the number of Asians hired is proportional to the number in the qualified pool, then the p-value for the observed data is about 10^-8, which is not quite “1 in a billion” as the report claims but it’s indeed small. But my guess is the Labor Department has more than this test of binomial proportions in terms of evidence if they were to go through with a suit.

Alfred Lee from The Information reports that a mutual fund run by Valic sold their shares of Palantir for below the recent valuation:

The Valic fund sold its stake at $4.50 per share, filings show, down from the $11.38 per share at which the company raised money in December. The value of the stake at the sale price was $621,000. Despite the price drop, Valic made money on the deal, as it had acquired stock in preferred fundraisings in 2012 and 2013 at between $3.06 and $3.51 per share.

The valuation suggested in the article by the recent sale is $8 billion. In my previous post on Palantir, I noted that while other large-scale consulting companies certainly make a lot of money, none have the sky-high valuation that Palantir commands. However, a more “down-to-Earth” valuation of $8 billion might be more or less in line with these other companies. It may be bad news for Palantir, but should the company ever have an IPO, it would be good for the public for market participants to realize the intrinsic value of the company.

Thinking like a statistician: this is not the election for progressives to vote third party

2016-09-27T00:00:00+00:00

Democratic elections permit us to vote for whomever we perceive has the highest expectation to do better with the issues we care about. Let’s simplify and assume we can quantify how satisfied we are with an elected official’s performance. Denote this quantity with X. Because when we cast our vote we still don’t know for sure how the candidate will perform, we base our decision on what we expect, denoted here with E(X). Thus we try to maximize E(X). However, both political theory and data tell us that in US presidential elections only two parties have a non-negligible probability of winning. This implies that E(X) is 0 for some candidates no matter how large X could potentially be. So what we are really doing is deciding if E(X-Y) is positive or negative with X representing one candidate and Y the other.

In past elections some progressives have argued that the difference between candidates is negligible and have therefore supported the Green Party ticket. The 2000 election is a notable example. The 2000 election was won by George W. Bush by just five electoral votes. In Florida, which had 25 electoral votes, Bush beat Al Gore by just 537 votes. Green Party candidate Ralph Nader obtained 97,488 votes. Many progressive voters were OK with this outcome because they perceived E(X-Y) to be practically 0.

In contrast, in 2016, I suspect few progressives think that E(X-Y) is anywhere near 0. In the figures below I attempt to quantify the progressive’s pre-election perception of consequences for the last five contests. The first figure shows E(X) and E(Y) and the second shows E(X-Y). Note despite E(X) being the lowest in the last past five elections, E(X-Y) is by far the largest. So if these figures accurately depict your perception and you think like a statistician, it becomes clear that this is not the election to vote third party.

Facebook and left censoring

2016-09-26T00:00:00+00:00

From the Wall Street Journal:

Several weeks ago, Facebook disclosed in a post on its “Advertiser Help Center” that its metric for the average time users spent watching videos was artificially inflated because it was only factoring in video views of more than three seconds. The company said it was introducing a new metric to fix the problem.

A classic case of left censoring (in this case, by “accident”).

Also this:

Ad buying agency Publicis Media was told by Facebook that the earlier counting method likely overestimated average time spent watching videos by between 60% and 80%, according to a late August letter Publicis Media sent to clients that was reviewed by The Wall Street Journal.

What does this information tell us about the actual time spent watching Facebook videos?

Not So Standard Deviations Episode 22 - Number 1 Side Project

2016-09-19T00:00:00+00:00

Hilary and I celebrate our one year anniversary doing the podcast together by discussing whether there are cities that are good for data scientists, reproducible research, and professionalizing data science.

Also, Hilary and I have just published a new book, Conversations on Data Science, which collects some of our episodes in an easy-to-read format. The book is available from Leanpub and will be updated as we record more episodes. If you’re new to the podcast, this is a good way to do some catching up!

If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.

Subscribe to the podcast on iTunes or Google Play.

Support us through our Patreon page.

Show Notes:

Listen here:

Mastering Software Development in R

2016-09-19T00:00:00+00:00

Today I’m happy to announce that we’re launching a new specialization on Coursera titled Mastering Software Development in R. This is a 5-course sequence developed with Sean Kross and Brooke Anderson.

This sequence differs from our previous Data Science Specialization because it focuses primarily on using R for developing software. We’ve found that as the field of data science evolves, it is becoming ever more clear that software development skills are essential for producing useful data science results and products. In addition, there is a tremendous need for tooling in the data science universe and we want to train people to build those tools.

The first course, The R Programming Environment, launches today. In the following months, we will launch the remaining courses:

Advanced R Programming
Building R Packages
Building Data Visualization Tools

In addition to the course, we have a companion textbook that goes along with the sequence. The book is available from Leanpub and is currently in progress (if you get the book now, you will receive free updates as they are available). We will be releaseing new chapters of the book alongside the launches of the other courses in the sequence.

Interview With a Data Sucker

2016-09-07T00:00:00+00:00

A few months ago Jill Sederstrom from ASH Clinical News interviewed me for this article on the data sharing editorial published by the The New England Journal of Medicine (NEJM) and the debate it generated. The article presented a nice summary, but I thought the original comprehensive set of questions was very good too. So, with permission from ASH Clinical News, I am sharing them here along with my answers.

Before I answer the questions below, I want to make an important remark. When writing these answers I am reflecting on data sharing in general. Nuances arise in different contexts that need to be discussed on an individual basis. For example, there are different considerations to keep in mind when sharing publicly funded data in genomics (my field) and sharing privately funded clinical trials data, just to name two examples.

The biggest pro of data sharing is that it can accelerate and improve the scientific enterprise. This can happen in a variety of ways. For example, competing experts may apply an improved statistical analysis that finds a hidden discovery the original data generators missed. Furthermore, examination of data by many experts can help correct errors missed by the analyst of the original project. Finally, sharing data facilitates the merging of datasets from different sources that allow discoveries not possible with just one study.

Note that data sharing is not a radical idea. For example, thanks to an organization called The MGED Soceity, most journals require all published microarray gene expression data to be public in one of two repositories: GEO or ArrayExpress. This has been an incredible success, leading to new discoveries, new databases that combine studies, and the development of widely used statistical methods and software built with these data as practice examples.

The NEJM editorial expressed concern that a new generation of researchers will emerge, those who had nothing to do with collecting the research but who will use it to their own ends. It referred to these as “research parasites.” Is this a real concern?

Absolutely not. If our goal is to facilitate scientific discoveries that improve our quality of life, I would be much more concerned about “data hoarders” than “research parasites”. If an important nugget of knowledge is hidden in a dataset, don’t you want the best data analysts competing to find it? Restricting the researchers who can analyze the data to those directly involved with the generators cuts out the great majority of experts.

To further illustrate this, let’s consider a very concrete example with real life consequences. Imagine a loved one has a disease with high mortality rates. Finding a cure is possible but only after analyzing a very very complex genomic assay. If some of the best data analysts in the world want to help, does it make any sense at all to restrict the pool of analysts to, say, a freshly minted masters level statistician working for the genomics core that generated the data? Furthermore, what would be the harm of having someone double check that analysis?

If such mistakes are made, good peer reviewers will catch the error. If it escapes peer review, we point it out in post publication discussions. Science is constantly self correcting.

Regarding attribution, this is a legitimate, but in my opinion, minor concern. Developers of open source statistical methods and software see our methods used without attribution quite often. We survive. But as I elaborate below, we can do things to alleviate this concern.

Is data stealing a real worry? Have you ever heard of it happening before?

I can’t say I can recall any case of data being stolen. But let’s remember that most published data is paid for by tax payers. They are the actual owners. So there is an argument to be made that the public’s data is being held hostage.

I think symbiotic sharing is the most effective approach to the repurposing of data. But no, I don’t think we need to force it to happen this way. Competition is one of the key ingredients of the scientific enterprise. Having many groups competing almost always beats out a small group of collaborators. And note that the data generators won’t necessarily have time to collaborate with all the groups interested in the data.

I think you are referring to a post by Jeff Leek but I am happy to answer. For data to be generated, we need to incentivize the endeavor. Guidelines that assure patient privacy should of course be followed. Some other simple guidelines related to those mentioned by Jeff are:

Reward data generators when their data is used by others.
Penalize those that do not give proper attribution.
Apply the same critical rigor on critiques of the original analysis as we apply to the original analysis.
Include data sharing ethics in scientific education

One of the guidelines suggested a new designation for leaders of major data collection or software generation projects. Why do you think this is important?

Again, this was Jeff, but I agree. This is important because we need an incentive other than giving the generators exclusive rights to publications emanating from said data.

You also discussed the need for requiring statistical/computational co-authors for papers written by experimentalists with no statistical/computational co-authors and vice versa. What role do you see the referee serving? Why is this needed?

I think the same rule should apply to referees. Every paper based on the analysis of complex data needs to have a referee with statistical/computational expertise. I also think biomedical journals publishing data-driven research should start adding these experts to their editorial boards. I should mention that NEJM actually has had such experts on their editorial board for a while now.

Are there certain guidelines would feel would be most critical to include?

To me the most important ones are:

The funding agencies and the community should reward data generators when their data is used by others. Perhaps more than for the papers they produce with these data.
Apply the same critical rigor on critiques of the original analysis as we apply to the original analysis. Bashing published results and talking about the “replication crisis” has become fashionable. Although in some cases it is very well merited (see Baggerly and Coombes work for example) in some circumstances critiques are made without much care mainly for the attention. If we are not careful about keeping a good balance, we may end up paralyzing scientific progress.

I can describe my experience. I am trained as a statistician. I analyze data on a daily basis both as a collaborator and method developer. Experience has taught me that if one does not understand the scientific problem at hand, it is hard to make a meaningful contribution through data analysis or method development. Most successful applied statisticians will tell you the same thing.

Most difficult scientific challenges have nuances that only the subject matter expert can effectively describe. Failing to understand these usually leads analysts to chase false leads, interpret results incorrectly or waste time solving a problem no one cares about. Successful collaboration usually involve a constant back and forth between the data analysts and the subject matter experts.

However, in many circumstances the data generator is not necessarily the only one that can provide such guidance. Some data analysts actually become subject matter experts themselves, others download data and seek out other collaborators that also understand the details of the scientific challenge and data generation process.

A Short Guide for Students Interested in a Statistics PhD Program

2016-09-06T00:00:00+00:00

This summer I had several conversations with undergraduate students seeking career advice. All were interested in data analysis and were considering graduate school. I also frequently receive requests for advice via email. We have posted on this topic before, for example here and here, but I thought it would be useful to share this short guide I put together based on my recent interactions.

It’s OK to be confused

When I was a college senior I didn’t really understand what Applied Statistics was nor did I understand what one does as a researcher in academia. Now I love being an academic doing research in applied statistics. But it is hard to understand what being a researcher is like until you do it for a while. Things become clearer as you gain more experience. One important piece of advice is to carefully consider advice from those with more experience than you. It might not make sense at first, but I can tell today that I knew much less than I thought I did when I was 22.

Should I even go to graduate school?

Yes. An undergraduate degree in mathematics, statistics, engineering, or computer science provides a great background, but some more training greatly increases your career options. You may be able to learn on the job, but note that a masters can be as short as a year.

A masters or a PhD?

If you want a career in academia or as a researcher in industry or government you need a PhD. In general, a PhD will give you more career options. If you want to become a data analyst or research assistant, a masters may be enough. A masters is also a good way to test out if this career is a good match for you. Many people do a masters before applying to PhD Programs. The rest of this guide focuses on those interested in a PhD.

What discipline?

There are many disciplines that can lead you to a career in data science: Statistics, Biostatistics, Astronomy, Economics, Machine Learning, Computational Biology, and Ecology are examples that come to mind. I did my PhD in Statistics and got a job in a Department of Biostatistics. So this guide focuses on Statistics/Biostatistics.

Note that once you finish your PhD you have a chance to become a postdoctoral fellow and further focus your training. By then you will have a much better idea of what you want to do and will have the opportunity to chose a lab that closely matches your interests.

What is the difference between Statistics and Biostatistics?

Short answer: very little. I treat them as the same in this guide. Long answer: read this.

How should I prepare during my senior year?

Math

Good grades in math and statistics classes are almost a requirement. Good GRE scores help and you need to get a near perfect score in the Quantitative Reasoning part of the GRE. Get yourself a practice book and start preparing. Note that to survive the first two years of a statistics PhD program you need to prove theorems and derive relatively complicated mathematical results. If you can’t easily handle the math part of the GRE, this will be quite challenging.

When choosing classes note that the area of math most related to your stat PhD courses is Real Analysis. The area of math most used in applied work is Linear Algebra, specifically matrix theory including understanding eigenvalues and eigenvectors. You might not make the connection between what you learn in class and what you use in practice until much later. This is totally normal.

If you don’t feel ready, consider doing a masters first. But also, get a second opinion. You might be being too hard on yourself.

Programming

You will be using a computer to analyze data so knowing some programming is a must these days. At a minimum, take a basic programming class. Other computer science classes will help especially if you go into an area dealing with large datasets. In hindsight, I wish I had taken classes on optimization and algorithm design.

Know that learning to program and learning a computer language are different things. You need to learn to program. The choice of language is up for debate. If you only learn one, learn R. If you learn three, learn R, Python and C++.

Knowing Linux/Unix is an advantage. If you have a Mac try to use the terminal as much as possible. On Windows get an emulator.

Writing and Communicating

My biggest educational regret is that, as a college student, I underestimated the importance of writing. To this day I am correcting that mistake.

Your success as a researcher greatly depends on how well you write and communicate. Your thesis, papers, grant proposals and even emails have to be well written. So practice as much as possible. Take classes, read works by good writers, and practice. Consider starting a blog even if you don’t make it public. Also note that in academia, job interviews will involve a 50 minute talk as well as several conversations about your work and future plans. So communication skills are also a big plus.

But wait, why so much math?

The PhD curriculum is indeed math heavy. Faculty often debate the possibility of changing the curriculum. But regardless of differing opinions on what is the right amount, math is the foundation of our discipline. Although it is true that you will not directly use much of what you learn, I don’t regret learning so much abstract math because I believe it positively shaped the way I think and attack problems.

Note that after the first two years you are pretty much done with courses and you start on your research. If you work with an applied statistician you will learn data analysis via the apprenticeship model. You will learn the most, by far, during this stage. So be patient. Watch these two Karate Kid scenes for some inspiration.

What department should I apply to?

The top 20-30 departments are practically interchangeable in my opinion. If you are interested in applied statistics make sure you pick a department with faculty doing applied research. Note that some professors focus their research on the mathematical aspects of statistics. By reading some of their recent papers you will be able to tell. An applied paper usually shows data (not simulated) and motivates a subject area challenge in the abstract or introduction. A theory paper shows no data at all or uses it only as an example.

Can I take a year off?

Absolutely. Especially if it’s to work in a data related job. In general, maturity and life experiences are an advantage in grad school.

What should I expect when I finish?

You will have many many options. The demand of your expertise is great and growing. As a result there are many high-paying options. If you want to become an academic I recommend doing a postdoc. Here is why. But there are many other options as we describe here and here.

Not So Standard Deviations Episode 21 - This Might be the Future!

2016-08-26T00:00:00+00:00

Hilary and I are apart again and this time we’re talking about political polling. Also, they discuss Trump’s tweets, and the fact that Hilary owns a bowling ball.

If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.

Subscribe to the podcast on iTunes or Google Play.

Support us through our Patreon page.

Show Notes:

Image from Hector Corrata Bravo’s lecture notes

Listen here:

How to create a free distributed data collection "app" with R and Google Sheets

2016-08-26T00:00:00+00:00

Jenny Bryan, developer of the google sheets R package, gave a talk at Use2015 about the package.

One of the things that got me most excited about the package was an example she gave in her talk of using the Google Sheets package for data collection at ultimate frisbee tournaments. One reason is that I used to play a little ultimate back in the day.

Another is that her idea is an amazing one for producing cool public health applications. One of the major issues with public health is being able to do distributed data collection cheaply, easily, and reproducibly. So I decided to write a little tutorial on how one could use Google Sheets and R to create a free distributed data collecton “app” for public health (or anything else really).

What you will need

A Google account and access to Google Sheets
R and the googlesheets package.

The “app”

What we are going to do is collect data in a Google Sheet or sheets. This sheet can be edited by anyone with the link using their computer or a mobile phone. Then we will use the googlesheets package to pull the data into R and analyze it.

Making the Google Sheet work with googlesheets

After you have a first thing to do is to go to the Google Sheets I suggest bookmarking this page: https://docs.google.com/spreadsheets/u/0/ which skips the annoying splash screen.

Create a blank sheet and give it an appropriate title for whatever data you will be collecting.

Next, we need to make the sheet public on the web so that the googlesheets package can read it. This is different from the sharing settings you set with the big button on the right. To make the sheet public on the web, go to the “File” menu and select “Publish to the web…”. Like this:

then it will ask you if you want to publish the sheet, just hit publish

Copy the link it gives you and you can use it to read in the Google Sheet. If you want to see all the Google Sheets you can read in, you can load the package and use the gs_ls function.

library(googlesheets)
sheets = gs_ls()
sheets[1,]

## # A tibble: 1 x 10
##   sheet_title author  perm version             updated
##         <chr>  <chr> <chr>   <chr>              <time>
## 1 app_example jtleek    rw     new 2016-08-26 17:48:21
## # ... with 5 more variables: sheet_key <chr>, ws_feed <chr>,
## #   alternate <chr>, self <chr>, alt_key <chr>

It will pop up a dialog asking for you to authorize the googlesheets package to read from your Google Sheets account. Then you should see a list of spreadsheets you have created.

In my example I created a sheet called “app_example” so I can load the Google Sheet like this:

## Identifies the Google Sheet
example_sheet = gs_title("app_example")

## Sheet successfully identified: "app_example"

## Reads the data
dat = gs_read(example_sheet)

## Accessing worksheet titled 'Sheet1'.

## No encoding supplied: defaulting to UTF-8.

head(dat)

## # A tibble: 3 x 5
##   who_collected at_work person  time       date
##           <chr>   <chr>  <chr> <chr>      <chr>
## 1          jeff      no   ingo 13:47 08/26/2016
## 2          jeff     yes  roger 13:47 08/26/2016
## 3          jeff     yes  brian 13:47 08/26/2016

In this case the data I’m collecting is about who is at work right now as I’m writing this post :). But you could collect whatever you want.

Distributing the data collection

Now that you have the data published to the web, you can read it into Google Sheets. Also, anyone with the link will be able to view the Google Sheet. But if you don’t change the sharing settings, you are the only one who can edit the sheet.

This is where you can make your data collection distributed if you want. If you go to the “Share” button, then click on advanced you will get a screen like this and have some options.

Private data collection

In the example I’m using I haven’t changed the sharing settings, so while you can see the sheet, you can’t edit it. This is nice if you want to collect some data and allow other people to read it, but you don’t want them to edit it.

Controlled distributed data collection

If you just enter people’s emails then you can open the data collection to just those individuals you have shared the sheet with. Be careful though, if they don’t have Google email addresses, then they get a link which they could share with other people and this could lead to open data collection.

Uncontrolled distributed data collection

Another option is to click on “Change” next to “Private - Only you can access”. If you click on “On - Anyone with the link” and click on “Can View”.

Then you can modify it to say “Can Edit” and hit “Save”. Now anyone who has the link can edit the Google Sheet. This means that you can’t control who will be editing it (careful!) but you can really widely distribute the data collection.

Collecting data

Once you have distributed the link either to your collaborators or more widely it is time to collect data. This is where I think that the “app” part of this is so cool. You can edit the Google Sheet from a Desktop computer, but if you have the (free!) Google Sheets app for your phone then you can also edit the data on the go. There is even an offline mode if the internet connection isn’t available where you are working (more on this below).

Quality control

One of the major issues with distributed data collection is quality control. If possible you want people to input data using (a) a controlled vocubulary/system and (b) the same controlled vocabulary/system. My suggestion here depends on whether you are using a controlled distributed system or an uncontrolled distributed system.

For the controlled distributed system you are specifically giving access to individual people - you can provide some training or a walk through before giving them access.

For the uncontrolled distributed system you should create a very detailed set of instructions. For example, for my sheet I would create a set of instructions like:

Every data point must have a label of who collected in in the who_collected column. You should pick a username that does not currently appear in the sheet and stick with it. Use all lower case for your username.
You should either report “yes” or “no” in lowercase in the at_work column.
You should report the name of the person in all lower case in the person column. You should search and make sure that the person you are reporting on doesn’t appear before introducing a new name. If the name already exists, use the name spelled exactly as it is in the sheet already.
You should report the time in the format hh:mm on a 24 hour clock in the eastern time zone of the United States.
You should report the date in the mm/dd/yyyy format.

You could be much more detailed depending on the case.

Offline editing and conflicts

One of the cool things about Google Sheets is that they can even be edited without an internet connection. This is particularly useful if you are collecting data in places where internet connections may be spotty. But that may generate conflicts if you use only one sheet.

There may be different ways to handle this, but one I thought of is to just create one sheet for each person collecting data (if you are using controlled distributed data collection). Then each person only edits the data in their sheet, avoiding potential conflicts if multiple people are editing offline and non-synchronously.

Reading the data

Anyone with the link can now read the most up-to-date data with the following simple code.

## Identifies the Google Sheet
example_sheet = gs_url("https://docs.google.com/spreadsheets/d/177WyyzWOHGIQ9O5iUY9P9IVwGi7jL3f4XBY4d98CY_o/pubhtml")

## Sheet-identifying info appears to be a browser URL.
## googlesheets will attempt to extract sheet key from the URL.

## Putative key: 177WyyzWOHGIQ9O5iUY9P9IVwGi7jL3f4XBY4d98CY_o

## Sheet successfully identified: "app_example"

## Reads the data
dat = gs_read(example_sheet, ws="Sheet1")

## Accessing worksheet titled 'Sheet1'.

## No encoding supplied: defaulting to UTF-8.

dat

## # A tibble: 3 x 5
##   who_collected at_work person  time       date
##           <chr>   <chr>  <chr> <chr>      <chr>
## 1          jeff      no   ingo 13:47 08/26/2016
## 2          jeff     yes  roger 13:47 08/26/2016
## 3          jeff     yes  brian 13:47 08/26/2016

Here the url is the one I got when I went to the “File” menu and clicked on “Publish to the web…”. The argument ws in the gs_read command is the name of the worksheet. If you have multiple sheets assigned to different people, you can read them in one at a time and then merge them together.

Conclusion

So that’s it, its pretty simple. But as I gear up to teach advanced data science here at Hopkins I’m thinking a lot about Sean Taylor’s awesome post Real scientists make their own data

I think this approach is a super cool/super lightweight system for collecting data either on your own or as a team. As I said I think it could be really useful in public health, but it could also be used for any data collection you want.

Interview with COPSS award winner Nicolai Meinshausen.

2016-08-24T00:00:00+00:00

Editor’s Note: We are again pleased to interview the COPSS President’s award winner. The COPSS Award is one of the most prestigious in statistics, sometimes called the Nobel Prize in statistics. This year the award went to Nicolai Meinshausen from ETH Zurich. He is known for his work in causality, high-dimensional statistics, and machine learning. Also see our past COPSS award interviews with John Storey and Martin Wainwright.

Do you consider yourself to be a statistician, data scientist, machine learner, or something else?

Perhaps all of the above. If you forced me to pick one, then statistician but I hope we will soon come to a point where these distinctions do not matter much any more.

How did you find out you had won the COPSS award?

Jeremy Taylor called me. I know I am expected to say I did not expect it but that was indeed the case and it was a genuine surprise.

How do you see the fields of causal inference and high-dimensional statistics merging?

Causal inference is already very challenging in the low-dimensional case - if understood as data for which the number of observations exceeds the number of variables. There are commonalities between high-dimensional statistics and the subfield of causal discovery, however, as we try to recover a sparse underlying structure from data in both cases (say when trying to reconstruct a gene network from observational and intervention data). The interpretations are just slightly different. A further difference is the implicit optimization. High-dimensional estimators can often be framed as convex optimization problems and the question is whether causal discovery can or should be pushed in this direction as well.

Can you explain a little about how you can infer causal effects from inhomogeneous data?

Why do we want a causal model in the first place? In most cases the benefit of a causal over a regression model is that the predictions of a causal model continue to be valid even if we intervene on the variables we use for prediction.

The inference we proposed turns this around and is looking for all models that are invariant in the sense that they give the same prediction accuracy across a number of different settings or environments. If we just have observational data, then this invariance holds for all models but if the data are inhomogeneous, certain models can be discarded since they work better in one environment than in another and can thus not be causal. If all models that show invariance use a certain variable, we can claim that the variable in question has a causal effect (while controlling type I error rates) and construct confidence intervals for the strength of the effect.

You have worked on studying the effects of climate change - do you think statisticians can play an important role in this debate?

To a certain extent. I have worked on several projects with physicists and the general caveat is that physicists are in general quite advanced in their methodology already and have quite a good understanding of the relevant statistical concepts. Biology is thus maybe a field where even more external input is required. Then again, it saves one from having to calculate t-tests in collaborations with physicists and just the interestingand challenging problems are left.

What advice would you give young statisticians getting into the discipline right now?

First I would say that they have made a good choice since these are interesting times for the field with many challenging and relevant problems still open and unsolved (but not completely out of reach either). I think its important to keep an open mind and read as much literature as possible from neighbouring fields. My personal experience has been that it is very beneficial to get involved in some scientific collaborations.

What sorts of things is your group working on these days?

Two general themes: the first is what people would call more classical machine learning. For example, how can we detect interactions in large-scale datasets in sub-quadratic time? Secondly, we are trying to extend the invariance approach to causal inference to more general settings, for example allowing for nonlinearities and hidden variables while at the same time improving the computational aspects.

A Simple Explanation for the Replication Crisis in Science

2016-08-24T00:00:00+00:00

By now, you’ve probably heard of the replication crisis in science. In summary, many conclusions from experiments done in a variety of fields have been found to not hold water when followed up in subsequent experiments. There are now any number of famous examples now, particularly from the fields of psychology and clinical medicine that show that the rate of replication of findings is less than the the expected rate.

The reasons proposed for this crisis are wide ranging, but typical center on the preference for “novel” findings in science and the pressure on investigators (especially new ones) to “publish or perish”. My favorite reason places the blame for the entire crisis on p-values.

I think to develop a better understanding of why there is a “crisis”, we need to step back and look across differend fields of study. There is one key question you should be asking yourself: Is the replication crisis evenly distributed across different scientific disciplines? My reading of the literature would suggest “no”, but the reasons attributed to the replication crisis are common to all scientists in every field (i.e. novel findings, publishing, etc.). So why would there be any heterogeneity?

An Aside on Replication and Reproducibility

As Lorena Barba recently pointed out, there can be tremendous confusion over the use of the words replication and reproducibility, so it’s best that we sort that out now. Here’s how I use both words:

replication: This is the act of repeating an entire study, independently of the original investigator without the use of original data (but generally using the same methods).
reproducibility: A study is reproducible if you can take the original data and the computer code used to analyze the data and reproduce all of the numerical findings from the study. This may initially sound like a trivial task but experience has shown that it’s not always easy to achieve this seemly minimal standard.

For more precise definitions of what I mean by these terms, you can take a look at my recent paper with Jeff Leek and Prasad Patil.

One key distinction between replication and reproducibility is that with replication, there is no need to trust any of the original findings. In fact, you may be attempting to replicate a study because you do not trust the findings of the original study. A recent example includes the creation of stem cells from ordinary cells, a finding that was so extraodinary that it led several laboratories to quickly try to replicate the study. Ultimately, seven separate laboratories could not replicate the finding and the original study was ultimately retracted.

Astronomy and Epidemiology

What do the fields of astronomy and epidemiology have in common? You might think nothing. Those two departments are often not even on the same campus at most universities! However, they have at least one common element, which is that the things that they study are generally reluctant to be controlled by human beings. As a result, both astronomers and epidemiologist rely heavily on one tools: the observational study. Much has been written about observational studies of late, and I’ll spare you the literature search by saying that the bottom line is they can’t be trusted (particularly observational studies that have not been pre-registered!).

But that’s fine—we have a method for dealing with things we don’t trust: It’s called replication. Epidemiologists actually codified their understanding of when they believe a causal claim (see Hill’s Criteria), which is typically only after a claim has been replicated in numerous studies in a variety of settings. My understanding is that astronomers have a similar mentality as well—no single study will result in anyone believe something new about the universe. Rather, findings need to be replicated using different approaches, instruments, etc.

The key point here is that in both astronomy and epidemiology expectations are low with respect to individual studies. It’s difficult to have a replication crisis when nobody believes the findings in the first place. Investigators have a culture of distrusting individual one-off findings until they have been replicated numerous times. In my own area of research, the idea that ambient air pollution causes health problems was difficult to believe for decades, until we started seeing the same associations appear in numerous studies conducted all around the world. It’s hard to imagine any single study “proving” that connection, no matter how well it was conducted.

Theory and Experimentation in Science

I’ve already described the primary mode of investigation in astronomy and epidemiology, but there are of course other methods in other fields. One large category of methods includes the controlled experiment. Controlled experiments come in a variety of forms, whether they are laboratory experiments on cells or randomized clinical trials with humans, all of them involve intentional manipulation of some factor by the investigator in order to observe how such manipulation affects an outcome. In clinical medicine and the social sciences, controlled experiments are considered the “gold standard” of evidence. Meta-analyses and literature summaries generally weight publications with controlled experiments more highly than other approaches like observational studies.

The other aspect I want to look at here is whether a field has a strong basic theoretical foundation. The idea here is that some fields, like say physics, have a strong set of basic theories whose predictions have been consistently validated over time. Other fields, like medicine, lack even the most rudimentary theories that can be used to make basic predictions. Granted, the distinction between fields with or without “basic theory” is a bit arbitrary on my part, but I think it’s fair to say that different fields of study fall on a spectrum in terms of how much basic theory they can rely on.

Given the theoretical nature of different fields and the primary mode of investigation, we can develop the following crude 2x2 table, in which I’ve inserted some representative fields of study.

My primary contention here is

The replication crisis in science is concentrated in areas where (1) there is a tradition of controlled experimentation and (2) there is relatively little basic theory underpinning the field.

Further, in general, I don’t believe that there’s anything wrong with the people tirelessly working in the upper right box. At least, I don’t think there’s anything more wrong with them compared to the good people working in the other three boxes.

In case anyone is wondering where the state of clinical science is relative to, say, particle physics with respect to basic theory, I only point you to the web site for the National Center for Complementary and Integrative Health. This is essentially a government agency with a budget of $124 million dedicated to advancing pseudoscience. This is the state of “basic theory” in clinical medicine.

The Bottom Line

The people working in the upper right box have an uphill battle for at least two reasons

The lack of strong basic theory makes it more difficult to guide investigation, leading to wider ranging efforts that may be less likely to replicate.
The tradition of controlled experimentation places high expectations that research produced here is “valid”. I mean, hey, they’re using the gold standard of evidence, right?

The confluence of these two factors leads to a much greater disappointment when findings from these fields do not replicate. This leads me to believe that the replication crisis in science is largely attributable to a mismatch in our expectations of how often findings should replicate and how difficult it is to actually discover true findings in certain fields. Further, the reliance of controlled experiements in certain fields has lulled us into believing that we can place tremendous weight on a small number of studies. Ultimately, when someone asks, “Why haven’t we cured cancer yet?” the answer is “Because it’s hard”.

The Silver Lining

It’s important to remember that, as my colleague Rafa Irizarry pointed out, findings from many of the fields in the upper right box, especially clinical medicine, can have tremendous positive impacts on our lives when they do work out. Rafa says

…I argue that the rate of discoveries is higher in biomedical research than in physics. But, to achieve this higher true positive rate, biomedical research has to tolerate a higher false positive rate.

It is certainly possible to reduce the rate of false positives—one way would be to do no experiments at all! But is that what we want? Would that most benefit us as a society?

The Takeaway

What to do? Here are a few thoughts:

We need to stop thinking that any single study is definitive or confirmatory, no matter if it was a controlled experiment or not. Science is always a cumulative business, and the value of a given study should be understood in the context of what came before it.
We have to recognize that some areas are more difficult to study and are less mature than other areas because of the lack of basic theory to guide us.
We need to think about what the tradeoffs are for discovering many things that may not pan out relative to discovering only a few things. What are we willing to accept in a given field? This is a discussion that I’ve not seen much of.
As Rafa pointed out in his article, we can definitely focus on things that make science better for everyone (better methods, rigorous designs, etc.).

A meta list of what to do at JSM 2016

2016-07-30T00:00:00+00:00

I’m going to be heading out tomorrow for JSM 2016. If you want to catch up I’ll be presenting in the 6-8PM poster session on The Extraordinary Power of Data on Sunday and on data visualization (and other things) in MOOCs at 8:30am on Monday. Here is a little sneak preview, the first slide from my talk:

This year I am so excited that other people have done all the work of going through the program for me and picking out what talks to see. Here is a list of lists.

Karl Broman - if you like open source software, data viz, and genomics.
Rstudio - if you like Rstudio
Mine Cetinkaya Rundel - if you like stat ed, data science, data viz, and data journalism.
Julian Wolfson - if you like missing sessions and guilt.
Stephanie Hicks - if you like lots of sessions and can’t make up your mind (also stat genomics, open source software, stat computing, stats for social good…)

If you know about more lists, please feel free to tweet at me or send pull requests.

I also saw the materials for this awesome tutorial on webscraping that I’m sorry I’ll miss.

The relativity of raw data

2016-07-20T00:00:00+00:00

“Raw data” is one of those terms that everyone in statistics and data science uses but no one defines. For example, we all agree that we should be able to recreate results in scientific papers from the raw data and the code for that paper.

But what do we mean when we say raw data?

When working with collaborators or students I often find myself saying - could you just give me the raw data so I can do the normalization or processing myself. To give a concrete example, I work in the analysis of data from high-throughput genomic sequencing experiments.

These experiments produce data by breaking up genomic molecules into short fragements of DNA - then reading off parts of those fragments to generate “reads” - usually 100 to 200 letters long per read. But the reads are just puzzle pieces that need to be fit back together and then quantified to produce measurements on DNA variation or gene expression abundances.

When I say “raw data” when talking to a collaborator I mean the reads that are reported from the sequencing machine. To me that is the rawest form of the data I will look at. But to generate those reads the sequencing machine first (1) created a set of images for each letter in the sequence of reads, (2) measured the color at the spots on that image to get the quantitative measurement of which letter, and (3) calculated which letter was there with a confidence measure. The raw data I ask for only includes the confidence measure and the sequence of letters itself, but ignores the images and the colors extracted from them (steps 1 and 2).

So to me the “raw data” is the files of reads. But to the people who produce the machine for sequencing the raw data may be the images or the color data. To my collaborator the raw data may be the quantitative measurements I calculate from the reads. When thinking about this I realized an important characteristics of raw data.

Raw data is relative to your reference frame.

In other words the raw data is raw to you if you have done no processing, manipulation, coding, or analysis of the data. In other words, the file you received from the person before you is untouched. But it may not be the rawest version of the data. The person who gave you the raw data may have done some computations. They have a different “raw data set”.

The implication for reproducibility and replicability is that we need a “chain of custody” just like with evidence collected by the police. As long as each person keeps a copy and record of the “raw data” to them you can trace the provencance of the data back to the original source.

Not So Standard Deviations Episode 18 - Divide by n-1, or n-2, or Whatever

2016-07-18T00:00:00+00:00

Hilary and I talk about statistical software in fMRI analyses, the differences between software testing differences in proportions (a must listen!), and a preview of JSM 2016.

Also, Hilary and I have just published a new book, Conversations on Data Science, which collects some of our episodes in an easy-to-read format. The books is available from Leanpub and will be updated as we record more episodes.

If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.

Support us through our Patreon page.

Show Notes:

Listen here:

Tuesday update

2016-07-11T00:00:00+00:00

It Might All Be Wrong

Tom Nichols and colleagues have published a paper on the software used to analyze fMRI data:

Functional MRI (fMRI) is 25 years old, yet surprisingly its most common statistical methods have not been validated using real data. Here, we used resting-state fMRI data from 499 healthy controls to conduct 3 million task group analyses. Using this null data with different experimental designs, we estimate the incidence of significant results. In theory, we should find 5% false positives (for a significance threshold of 5%), but instead we found that the most common software packages for fMRI analysis (SPM, FSL, AFNI) can result in false-positive rates of up to 70%. These results question the validity of some 40,000 fMRI studies and may have a large impact on the interpretation of neuroimaging results.

Criminal Justice Forecasts

The ongoing discussion over the use of prediction algorithms in the criminal justice system reminds me a bit of the introduction of DNA evidence decades ago. Ultimately, there is a technology that few people truly understand and there are questions as to whether the information they provide is fair or accurate.

Shameless Promotion

I have a new book coming out with Hilary Parker, based on our Not So Standard Deviations podcast. Sign up to be notified of its release (which should be Real Soon Now).

Not So Standard Deviations Episode 18 - Back on Planet Earth

2016-07-05T00:00:00+00:00

With Hilary fresh from Use R! 2016, Hilary and I discuss some of the highlights from the conference. Also, some followup about a previous Free Advertising and the NSSD drinking game.

If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.

Support us through our Patreon page.

Show notes:

Tuesday Update

2016-06-28T00:00:00+00:00

If you weren’t sick of Theranos yet….

Looks like there will be a movie version of the Theranos saga which, as far as I can tell, isn’t over yet, but no matter. It will be done by Adam McKay, the writer-director of The Big Short (excellent film), and will star Jennifer Lawrence as Elizabeth Holmes. From Vanity Fair:

Legendary Pictures snapped up rights to the hot-button biopic for a reported $3 million Thursday evening, after outbidding and outlasting a swarm of competition from Warner Bros., Twentieth Century Fox, STX Entertainment, Regency Enterprises, Cross Creek, Amazon Studios, AG Capital, the Weinstein Company, and, in the penultimate stretch, Paramount, among other studio suitors.

Based on a book proposal by two-time Pulitzer Prize-winning journalist John Carreyrou titled Bad Blood: Secrets and Lies in Silicon Valley, the project (reported to be in the $40 million to $50 million budget range) has made the rounds to almost every studio in town. It’s been personally pitched by McKay, who won an Oscar for best adapted screenplay for last year’s rollicking financial meltdown procedural The Big Short.

Frankly, I think we all know how this movie will end.

The People vs. OJ Simpson vs….Statistics

I’m in the middle of watching The People vs. OJ Simpson and so far it is fantastic—I highly recommend it. One thing that is not represented in the show is the important role that statistics played in the trial. The trial was just in the early days of using DNA as evidence in criminal trials and there were many questions about how likely it was to find DNA matches in blood.

Terry Speed ended up testifying for the defense (Simpson) and in this nice interview, he explains how that came to be:

At the beginning of the Simpson trial, there was going to be a pre-trial hearing and experts from both sides would argue in front of the judge as to what approaches should be accepted. Other pre-trial activities dragged on, and the one on DNA forensics was eventually scrapped. The DNA experts, including me were then asked whether they wanted to give evidence for the prosecution or defence, or leave. I did not initially plan to join the defence team, but wished to express my point of view in what was more or less a scientific environment before the trial started, but when the pre-trial DNA hearing was scrapped, I decided that I had no choice but to express my views in court on behalf of the defence, which I did.

The full interview is well worth the read.

AI is the residual

I just recently found out about the AI effect which I thought was interesting. Basically, “AI” is whatever can’t be explained, or in other words, the residuals of machine learning.

A Year at Stack Overflow

2016-06-28T00:00:00+00:00

David Robinson (@drob) has a great post on his blog about his first year as a data scientist at Stack Overflow. This section in particular stood out for me:

I like using R to learn interesting things about our data, but my longer term goal is to make it easy for any of our engineers to do so….Towards this goal, I’ve been focusing on building reliable tools and frameworks that people can apply to a variety of problems, rather than “one-off” analysis scripts. (There’s an awesome post by Jeff Magnusson at StitchFix about some of these general challenges). My approach has been building internal R packages, similar to AirBnb’s strategy (though our data team is quite a bit younger and smaller than theirs). These internal packages can query databases and parsing our internal APIs, including making various security and infrastructure issues invisible to the user.

The world needs an army of David Robinsons.

Ultimate AI battle - Apple vs. Google

2016-06-14T00:00:00+00:00

Yesterday, Apple launched its Worldwide Developer’s Conference (WWDC) and had its public keynote address. While many new things were announced, the one thing that caught my eye was the dramatic expansion of Apple’s use of artificial intelligence (AI) tools. I talked a bit about AI with Hilary Parker on the latest Not So Standard Deviations, particularly in the context of Amazon’s Echo/Alexa, and I think it’s definitely going to be an area of intense competition between the major tech companies.

Pretty much every major tech player is involved in AI—Google, Facebook, Amazon, Apple, Microsoft—the list goes on. Recently, a some commentators have suggested that Apple in particular will never catch up with the likes of Google with respect to AI because of Apple’s strict stance on privacy and unwillingness to gather/aggregate data from all its users. However, yesterday at WWDC, Apple revealed a few clues about what it was up to in the AI world.

First, Apple mentioned deep learning more than a few times, including specifically calling out its use of LSTM in its Messages app and facial recognition in its Photos app. Previously, Apple had been rumored to be applying deep learning to its Siri assistant and its fingerprint sensor. At WWDC, Craig Federighi stressed Apple’s continued focus on privacy and how Apple does not need to develop “user profiles” server-side, but rather does most computation on-device (in this case, on the iPhone).

However, it can’t be that Apple does all its deep learning computation on the iPhone. These models tend to be enormous and take advantage of reams of data that can only be reasonablly processed server-side. Unfortunately, because most companies (Apple in particular) release few details about what they do, we may never how this works. But we can definitely speculate!

Apple vs. Google

I think the Apple/Google dichotomy provides an interesting opportunity to talk about how models can be learned using data in different ways. There are two approaches being represented here by Apple and Google:

Google way - Collect lots of data from users and store them on a server in the Googleplex somewhere. Then use that data to fit an enormous model that can predict when you’ve taken a picture of a cat. As users generate more data, bring that data back to the Googleplex and update/refine the model.
Apple way - Build a “starter model” in the Apple Mothership. As users generate data on their phones, bring the model to the phone and update the model using just their data. Bring the updated model back to the Apple Mothership and leave the user’s data on the phone.

Perhaps the easiest way to understand this difference is with the arithmetic mean, which is perhaps the simplest “model”. Suppose you have a bunch of users out there and you want to compute the average of some attribute that they have on their phones (or whatever device). The first approach would be to get all that data and compute the mean in the usual way.

Once all the data is in the Googleplex, we can just use the formula

I’ll call this the “Google mean” because it requires that you get all the data X₁ through X_n, then sum them up and divide by n. Here, each of the X_i’s represents the ith user’s data. The general principle here is to gather all the data and then estimate the model parameters server-side.

What if you didn’t want to gather everyone’s data centrally? Can you still compute the mean?

Yes, because there’s a nice recurrence formula for the mean:

We can call this the “Apple mean”. With this strategy, we can send our current estimate of the mean to each user, update our estimate by taking the weighted average of the old value and the new value, and then move on to the next user. Here, you send the model parameters out to the users, update those parameters and then bring the parameters back.

Which method is better? Well, in this case, both give you the same answer. In general, for linear models (like the mean), you can usually rework the formulas to build out either “whole data” (Google) approaches or “streaming” (Apple) approaches and get pretty much the same answer. But for non-linear models, it’s not so simple and you usually cannot achieve this kind of equivalence.

Clients and Servers

Balancing how much work is done on a server and how much is done on the client is an age-old computing problem and, over time, the balance of work between client and server seems to shift back and forth like a pendulum. When I was in grad school, we had so-called “dumb terminals” that were basically a screen that you used to login to the server. Today, I use my laptop for computing/work and that’s it. But I use the cloud for many other tasks.

The Apple approach definitely requires a “fatter” client because the work of integrating current model parameters with new user data has to happen on the phone. With the Google approach, all the phone has to do is be able to collect the data and send it over the network to Google.

The Apple approach is also closely related to what my colleagues Martin Lindquist and Brian Caffo refer to as “fusion science”, whereby Big Data and “Small Data” can be fused together via models to improve inference, but without ever having to actually combine the data. In a Bayesian context, you might think of the Big Data as making up the prior distribution and the Small Data as the likelihood. The Small Data can be used to update the model parameters and produce the posterior distribution, after which the Small Data can be thrown out.

And the Winner is…

It’s not clear to me which approach is better in terms of building a better model for prediction or inference. Sadly, we may never have enough details to find out, and will only be ablle to evaluate which approach is better by the performance of the systems in the marketplace. But perhaps that’s the way things should be evaluated in this case?

Good list of good books

2016-06-13T00:00:00+00:00

The MultiThreaded blog over at Stitch Fix (hat tip to Hilary Parker) has posted a really nice list of data science books (disclosure: one of my books is on the list).

We’ve queried our data science team for some of their favorite data science books. This list is by no means exhaustive, but should keep any data scientist/engineer new or old learning and entertained for many an evening.

Enjoy!

Not So Standard Deviations Episode 17 - Diurnal High Variance

2016-06-09T00:00:00+00:00

Hilary and I talk about Amazon Echo and Alexa as AI as a service, the COMPAS algorithm, criminal justice forecasts, and whether algorithms can introduce or remove bias (or both).

If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.

Support us through our Patreon page.

Show notes:

Defining success - Four secrets of a successful data science experiment

2016-06-03T00:00:00+00:00

Editor’s note: This post is excerpted from the book Executive Data Science: A Guide to Training and Managing the Best Data Scientists, written by myself, Brian Caffo, and Jeff Leek. This particular section was written by Brian Caffo.

Defining success is a crucial part of managing a data science experiment. Of course, success is often context specific. However, some aspects of success are general enough to merit discussion. A list of hallmarks of success includes:

New knowledge is created.
Decisions or policies are made based on the outcome of the experiment.
A report, presentation, or app with impact is created.
It is learned that the data can’t answer the question being asked of it.

Some more negative outcomes include: Decisions being made that disregard clear evidence from the data, equivocal results that do not shed light in one direction or another, uncertainty which prevents new knowledge from being created.

Let’s discuss some of the successful outcomes first.

New knowledge seems ideal in many cases (especially since we are academics), but new knowledge doesn’t necessarily mean that it’s important. If this new knowledge produces actionable decisions or policies, that’s even better. The idea of having evidence-based policy, while perhaps newer than the analogous evidence-based medicine movement that has transformed medical practice, has the potential to similarly transform public policy. Finaly, that our data science products have great (positive) impact on an audience that is much wider than a group of data scientists, is of course ideal. Creating reusable code or apps is great way to increase the impact of a project and to disseminate its findings.

The fourth point is perhaps the most controversial. I view it as a success if we can show that the data can’t answer the questions being asked. I am reminded of a friend who told a story of the company he worked at. They hired many expensive prediction consultants to help use their data to inform pricing. However, the prediction results weren’t helping. They were able to prove that the data couldn’t answer the hypothesis under study. There was too much noise and the measurements just weren’t accurately measuring what was needed. Sure, the result wasn’t optimal, as they still needed to know how to price things, but it did save money on consultants. I have since heard this story repeated nearly identically by friends in different industries.

Sometimes the biggest challenge is applying what we already know

2016-05-31T00:00:00+00:00

There’s definitely a need to innovate and develop new treatments in the area of asthma, but it’s easy to underestimate the barriers to just doing what we already know, such as making sure that people are following existing, well-established guidelines on how to treat asthma. My colleague, Elizabeth Matsui, has written about the challenges in a study that we are collaborating on:

Our group is currently conducting a study that includes implementation of national guidelines-based medical care for asthma, so that one process that we’ve had to get right is to prescribe an appropriate dose of medication and get it into the family’s hands. [emphasis added]

Seems simple, right?

Sometimes there's friction for a reason

2016-05-24T00:00:00+00:00

Thinking about my post on Theranos yesterday it occurred to me that one thing that’s great about all of the innovation and technology coming out of places like Silicon Valley is the tremendous reduction of friction in our lives. With Uber it’s much easier to get a ride because of improvement in communication and an increase in the supply of cars. With Amazon, I can get that jug of vegetable oil that I always wanted without having to leave the house, because Amazon.

So why is there all this friction? Sometimes it’s because of regulation, which may have made sense at an earlier time, but perhaps doesn’t make as much sense now. Sometimes, you need a company like Amazon to really master the logistics operation to be able to deliver anything anywhere. Otherwise, you’re just stuck driving to the grocery store to get that vegetable oil.

But sometimes there’s friction for a reason. For example, Ben Thompson talks about how previously there was quite a bit more friction involved before law enforcement could listen in on our communications. Although wiretapping had long been around (as noted by David Simon of…The Wire) the removal of all friction by the NSA made the situation quite different. Related to this idea is the massive data release from OkCupid a few weeks ago, as I discussed on the latest Not So Standard Deviations podcast episode. Sure, your OkCupid profile is visible to everyone with an account, but having someone vacuum up 70,000 profiles and dumping them on the web for anyone to view is not what people signed up for—there is a qualitative difference there.

When it comes to Theranos and diagnostic testing in general, there is similarly a need for some friction in order to protect public health. John Ioannides notes in his commentary for JAMA:

Even if the tests were accurate, when they are performed in massive scale and multiple times, the possibility of causing substantial harm from widespread testing is very real, as errors accumulate with multiple testing. Repeated testing of an individual is potentially a dangerous self-harm practice, and these individuals are destined to have some incorrect laboratory results and eventually experience harm, such as, for example, the anxiety of being labeled with a serious condition or adverse effects from increased testing and procedures to evaluate false-positive test results. Moreover, if the diagnostic testing process becomes dissociated from physicians, self-testing and self-interpretation could cause even more problems than they aim to solve.

Unlike with the NSA, where the differences in scale may be difficult to quantify because the exact extent of the program is unknown to most people, with diagnostic testing, we can precisely quantify how a diagnostic test’s characteristics will change if we apply it to 1,000 people vs. 1,000,000 people. This is why organizations like the US Preventative Services Task Force so carefully considers recommendations for testing or screening (and why they have a really tough job).

I’ll admit that a lot of the friction in our daily lives is pointless and it would be great to reduce it if possible. But in many cases, it was us that put the friction there for a reason, and it’s sometimes good to think about why before we move to eliminate it.

Update On Theranos

2016-05-23T00:00:00+00:00

I think it’s fair to say that things for Theranos, the Silicon Valley blood testing company, are not looking up. From the Wall Street Journal (via The Verge):

Theranos has voided two years of results from its Edison blood-testing machines, issuing tens of thousands of corrected reports to patients and doctors and raising the possibility that many health care decisions may have been made based on inaccurate data. The Wall Street Journal first reported the news, saying that many of the corrected tests have been run using traditional machinery. One doctor told the Journal that she sent a patient to the emergency room after seeing abnormal results from a Theranos test; the corrected report returned normal readings.

Furthermore, this commentary in JAMA from John Ioannides emphasizes the need for caution when implementing testing on a massive scale. In particular, “The notion of patients and healthy people being repeatedly tested in supermarkets and pharmacies, or eventually in cafeterias or at home, sounds revolutionary, but little is known about the consequences” and the consequences really matter here. In addition, as the title of the commentary would indicate, research done in secret is not research at all. For there the be credibility for a company like this, data needs to be made public.

I continue to maintain that the fundamental premise on which the company is built, as stated by its founder Elizabeth Holmes, is flawed. Two concepts are repeatedly made in the context of Theranos:

More testing is better. Anyone who stayed awake in their introduction to Bayesian statistics lecture knows this is very difficult to make true in all circumstances, no matter how accurate a test is. With rare diseases, the number of false positives is overwhelming and can have very real harmful effects on people. Combine testing on a massive scale, with repeated application over time, and you get a recipe for confusion.
People do not get tested because they are afraid of needles. Elizabeth Holmes makes a big deal about her personal fear of needles and it’s impact on her (not) getting blood tests done. I have no doubt that many people share this fear, but I have serious doubt that this is the reason people don’t get the medical testing done. There are many barriers to people getting the medical care that they need, many that are non-financial in nature and do not include fear of needles. The problem of getting people the medical care that they need is one deserving of serious attention, but changing the manner in which blood is collected is not going to do it.

Not So Standard Deviations Episode 16 - The Silicon Valley Episode

2016-05-23T00:00:00+00:00

Roger and Hilary are back, with Hilary broadcasting from the west coast. Hilary and Roger discuss the possibility of scaling data analysis and how that may or may not work for companies like Palantir. Also, the latest on Theranos and the release of data from OkCupid.

If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.

Support us through our Patreon page.

Show notes:

What is software engineering for data science?

2016-05-18T00:00:00+00:00

Editor’s note: This post is a chapter from the book Executive Data Science: A Guide to Training and Managing the Best Data Scientists, written by myself, Brian Caffo, and Jeff Leek.

Software is the generalization of a specific aspect of a data analysis. If specific parts of a data analysis require implementing or applying a number of procedures or tools together, software is the encompassing of all these tools into a specific module or procedure that can be repeatedly applied in a variety of settings. Software allows for the systematizing and the standardizing of a procedure, so that different people can use it and understand what it’s going to do at any given time.

Software is useful because it formalizes and abstracts the functionality of a set of procedures or tools, by developing a well defined interface to the analysis. Software will have an interface, or a set of inputs and a set of outputs that are well understood. People can think about the inputs and the outputs without having to worry about the gory details of what’s going on underneath. Now, they may be interested in those details, but the application of the software at any given setting will not necessarily depend on the knowledge of those details. Rather, the knowledge of the interface to that software is important to using it in any given situation.

For example, most statistical packages will have a linear regression function which has a very well defined interface. Typically, you’ll have to input things like the outcome and the set of predictors, and maybe there will be some other inputs like the data set or weights. Most linear regression functions kind of work in that way. And importantly, the user does not have to know exactly how the linear regression calculation is done underneath the hood. Rather, they only need to know that they need to specify the outcome, the predictors, and a couple of other things. The linear regression function abstracts all the details that are required to implement linear regression, so that the user can apply the tool in a variety of settings.

There are three levels of software that are important to consider, going from kind of the simplest to the most abstract.

At the first level you might just have some code that you wrote, and you might want to encapsulate the automation of a set of procedures using a loop (or something similar) that repeats an operation multiple times.
The next step might be some sort of function. Regardless of what language you may be using, often there will be some notion of a function, which is used to encapsulate a set of instructions. And the key thing about a function is that you’ll have to define some sort of interface, which will be the inputs to the function. The function may also have a set of outputs or it may have some side effect for example, if it’s a plotting function. Now the user only needs to know those inputs and what the outputs will be. This is the first level of abstraction that you might encounter, where you have to actually define and interface to that function.
The highest level is an actual software package, which will often be a collection of functions and other things. That will be a little bit more formal because there’ll be a very specific interface or API that a user has to understand. Often for a software package there’ll be a number of convenience features for users, like documentation, examples, or tutorials that may come with it, to help the user apply the software to many different settings. A full on software package will be most general in the sense that it should be applicable to more than one setting.

One question that you’ll find yourself asking, is at what point do you need to systematize common tasks and procedures across projects versus recreating code or writing new code from scratch on every new project? It depends on a variety of factors and answering this question may require communication within your team, and with people outside of your team. You may need to develop an understanding of how often a given process is repeated, or how often a given type of data analysis is done, in order to weigh the costs and benefits of investing in developing a software package or something similar.

Within your team, you may want to ask yourself, “Is the data analysis you’re going to do something that you are going to build upon for future work, or is it just going to be a one shot deal?” In our experience, there are relatively few one shot deals out there. Often you will have to do a certain analysis more than once, twice, or even three times, at which point you’ve reached the threshold where you want to write some code, write some software, or at least a function. The point at which you need to systematize a given set of procedures is going to be sooner than you think it is. The initial investment for developing more formal software will be higher, of course, but that will likely pay off in time savings down the road.

A basic rule of thumb is

If you’re going to do something once (that does happen on occasion), just write some code and document it very well. The important thing is that you want to make sure that you understand what the code does, and so that requires both writing the code well and writing documentation. You want to be able to reproduce it down later on if you ever come back to it, or if someone else comes back to it.
If you’re going to do something twice, write a function. This allows you to abstract a small piece of code, and it forces you to define an interface, so you have well defined inputs and outputs.
If you’re going to do something three times or more, you should think about writing a small package. It doesn’t have to be commercial level software, but a small package which encapsulates the set of operations that you’re going to be doing in a given analysis. It’s also important to write some real documentation so that people can understand what’s supposed to be going on, and can apply the software to a different situation if they have to.

Disseminating reproducible research is fundamentally a language and communication problem

2016-05-13T00:00:00+00:00

Just about 10 years ago, I wrote my first of many articles about the importance of reproducible research. Since that article, one of the points I’ve made is that the key issue to resolve was one of tools and infrastructure. At the time, many people were concerned that people would not want to share data and that we had to spend a lot of energy finding ways to either compel or incentivize them to do so. But the reality was that it was difficult to answer the question “What should I do if I desperately want to make my work reproducible?” Back then, even if you could convince a clinical researcher to use R and LaTeX to create a Sweave document (!), it was not immediately obvious where they should host the document, code, and data files.

Much has happened since then. We now have knitr and Markdown for live documents (as well as iPython notebooks and the like). We also have git, GitHub, and friends, which provide free code sharing repositories in a distributed manner (unlike older systems like CVS and Subversion). With the recent announcement of the Journal of Open Source Software, posting code on GitHub can now be recognized within the current system of credits and incentives. Finally, the number of data repositories has grown, providing more places for researchers to deposit their data files.

Is the tools and infrastructure problem solved? I’d say yes. One thing that has become clear is that disseminating reproducible research is no longer a software problem. At least in R land, building live documents that can be executed by others is straightforward and not too difficult to pick up (thank you John Gruber!). For other languages there many equivalent (if not better) tools for writing documents that mix code and text. For this kind of thing, there’s just no excuse anymore. Could things be optimized a bit for some edge cases? Sure, but the tools are prefectly fine for the vast majority of use cases.

But now there is a bigger problem that needs to be solved, which is that we do not have an effective way to communicate data analyses. One might think that publishing the full code and datasets is the perfect way to communicate a data analysis, but in a way, it is too perfect. That approach can provide too much information.

I find the following analogy useful for discussing this problem. If you look at music, one way to communicate music is to provide an audio file, a standard WAV file or something similar. In a way, that is a near-perfect representation of the music—bit-for-bit—that was produced by the performer. If I want to experience a Beethoven symphony the way that it was meant to be experienced, I’ll listen to a recording of it.

But if I want to understand how Beethoven wrote the piece—the process and the details—the recording is not a useful tool. What I look at instead is the score. The recording is a serialization of the music. The score provides an expanded representation of the music that shows all of the different pieces and how they fit together. A person with a good ear can often reconstruct the score, but this is a difficult and time-consuming task. Better to start with what the composer wrote originally.

Over centuries, classical music composers developed a language and system for communicating their musical ideas so that

there was enough detail that a 3rd party could interpret the music and perform it to a level of accuracy that satisfied the composer; but
it was not so prescriptive or constraining so that different performers could not build on the work and incorporate their own ideas

It would seem that traditional computer code satisfies those criteria, but I don’t think so. Traditional computer code (even R code) is designed to communicate programming concepts and constructs, not to communicate data analysis constructs. For example, a for loop is not a data analysis concept, even though we may use for loops all the time in data analysis.

Because of this disconnect between computer code and data analysis, I often find it difficult to understand what a data analysis is doing, even if it is coded very well. I imagine this is what programmers felt like when programming in processor-specific assembly language. Before languages like C were developed, where high-level concepts could be expressed, you had to know the gory details of how each CPU operated.

The closest thing that I can see to a solution emerging is the work that Hadley Wickham is doing with packages like dplyr and ggplot2. The dplyr package’s verbs (filter, arrange, etc.) represent data manipulation concepts that are meaningful to analysts. But we still have a long way to go to cover all of data analysis in this way.

Reproducible research is important because it is fundamentally about communicating what you have done in your work. Right now we have a sub-optimal way to communicate what was done in a data analysis, via traditional computer code. I think developing a new approach to communicating data analysis could have a few benefits:

It would provide greater transparency
It would allow others to more easily build on what was done in an analysis by extending or modifying specific elements
It would make it easier to understand what common elements there were across many different data analyses
It would make it easier to teach data analysis in a systematic and scalable way

So, any takers?

The Real Lesson for Data Science That is Demonstrated by Palantir's Struggles

2016-05-11T00:00:00+00:00

Buzzfeed recently published a long article on the struggles of the secretive data science company, Palantir.

Over the last 13 months, at least three top-tier corporate clients have walked away, including Coca-Cola, American Express, and Nasdaq, according to internal documents. Palantir mines data to help companies make more money, but clients have balked at its high prices that can exceed $1 million per month, expressed doubts that its software can produce valuable insights over time, and even experienced difficult working relationships with Palantir’s young engineers. Palantir insiders have bemoaned the “low-vision” clients who decide to take their business elsewhere.

Palantir’s origins are with PayPal, and its founders are part of the PayPal Mafia. As Peter Thiel describes it in his book Zero to One, PayPal was having a lot of trouble with fraud and the FBI was getting on its case. So PayPal developed some software to monitor the millions of transacations going through its systems and to flag transactions that were suspicious. Eventually, they realized that this kind of software might be useful to government agencies in a variety of contexts and the idea for Palantir was born.

Much of the press reaction to Buzzfeed’s article amounts to schadenfreude over the potential fall of another so-called Silicon Valley unicorn. Indeed, Palentir is valued at $20 billion, a valuation only exceeded in the private markets by Airbnb and Uber. But to me, nothing in the article indicates that Palantir is necessarily more poorly run than your average startup. It looks like they are going through pretty standard growing pains, trying to scale the business and diversify the customer base. It’s not surprising to me that employees would leave at this point—going from startup to “real company” is often not that fun and just a lot of work.

However, a key question that arises is that if Palantir is having trouble trying to scale the business, why might that be? The Buzzfeed article doesn’t contain any answers but in this post I will attempt to speculate.

The real message from the Buzzfeed article goes beyond just Palantir and is highly relevant to the data science world. It ultimately comes down to the question of what is the value of data analysis?, and secondarily, how do you communicate that value?

The Data Science Spectrum

Data science activities live on a spectrum with software on one end and highly customized consulting on another end (I lump a lot of things into consulting, including methods development, modeling, etc.).

The idea being that if someone comes to you with a data problem, there are two extremes that you might offer to them:

Give them some software, some documentation, and maybe a brief tutorial on how to use the software, and then send them on their way. For example, if someone wants to see if two groups are different from each other, you could send them the t.test() function in R and explain how to use it. This could be done over email; you wouldn’t even have to talk to the person.
Meet with the person, talk about their problem and the question they’re trying to answer, develop an analysis plan, and build a custom software solution that produces the exact output that they’re looking for.

The first option is cheap, simple, and if you had a good enough web site, the person probably wouldn’t even have to talk with you at all! For example, I use this web site for sample size calculations and I’ve never spoken with the author of the web site. Much of the labor is up front, for the development of the software, and then is amortized over the life of the product. Ultimately, a software product has zero marginal cost and so it can be easily replicated and is infinitely scalable.

The second option is labor intensive, time-consuming, ongoing in nature, and is only scalable to the extent that you are willing to forgo sleep and maybe bend the space-time continuum. By definition, a custom solution is unique and is difficult to replicate.

Selling Data Science

An important question for Palantir and data scientists in general is “How do you communicate the value of data analysis?” Many people expect the result of a good data analysis to be something “surprising”, i.e. something that they didn’t already know. Because if they knew it already why bother looking at the data? Think Moneyball—if you can find that “diamond in the rough” it make spending the time to analyze the data worthwhile. But the success of a data analysis can’t depend on the results. What if you go through the data and find nothing? Is the data analysis a failure? We as data scientists can only show what the data show. Otherwise, it just becomes a recipe for telling people what they want to hear.

It’s tempting for a client to say “well, the data didn’t show anything surprising so there’s no value there.” Also, a data analysis may reveal something that is perhaps interesting but doesn’t necessarily lead to any sort of decision. For example, there may be an aspect of a business process that is inefficient but is nevertheless unmodifiable. There may be little perceived value in discovering this with data.

What is Useful?

Palantir apparently tried to develop a relationship with American Express, but ultimately failed.

But some major firms have not found Palantir’s products and services that useful. In April 2015, employees were informed that American Express (codename: Charlie’s Angels) had dumped Palantir after 18 months of cybersecurity work, including a six-month pilot, an email shows. “We struggled from day 1 to make Palantir a sticky product for users and generate wins,” Sid Rajgarhia, a Palantir business development employee, said in the email.

What does it mean for a data analysis product to be useful? It’s not necessarily clear to me in this case. Did Palantir not reveal new information? Did they not highlight something that could be modified?

Lack of Deep Expertise

A failed attempt attempt at working with Coke reveals some other challenges of the data science business model.

The beverage giant also had other concerns [in addition to the price]. Coke “wanted deeper industry expertise in a partner,” Jonty Kelt, a Palantir executive, told colleagues in the email. He added that Coca-Cola’s “working relationship” with the youthful Palantir employees was “difficult.” The Coke executive acknowledged that the beverage giant “needs to get better at working with millennials,” according to Kelt. Coke spokesperson Scott Williamson declined to comment.

Annoying millenials notwithstanding, it’s clear that Coke didn’t feel comfortable collaborating with Palantir’s personnel. Like any data science collaboration, it’s key that the data scientist have some familiarity with the domain. In many cases, having “deep expertise” in an area can give a collaborator confidence that you will focus on the things that matter to them. But developing that expertise costs money and time and it may prevent you from working with other types of clients where you will necessarily have less expertise. For example, Palantir’s long experience working with the US military and intelligence agencies gave them deep expertise in those areas, but how does that help them with a consumer products company?

Harder Than It Looks

The final example of a client that backed out is Kimberly-Clark:

But Kimberly-Clark was getting cold feet by early 2016. In January, a year after the initial pilot, Kimberly-Clark executive Anthony J. Palmer said he still wasn’t ready to sign a binding contract, meeting notes show. Palmer also “confirmed our suspicion” that a primary reason Kimberly-Clark had not moved forward was that “they wanted to see if they could do it cheaper themselves,” Kelt told colleagues in January. [emphasis added]

This is a common problem confronted by anyone in the data science business. A good analysis often looks easy in retrospect—all you did was run some functions and put the data through some models! In fact, running the models probably is the easy part, but getting to the point where you can actually fit models can be incredibly hard. Many clients, not seeing the long and winding process leading to a model, will be tempted think they can do it themselves.

Palantir’s Valuation

Ultimately, what makes Palantir interesting is its astounding valuation. But what is the driver of this valuation? I think the key to answering this question lies in the description of the company itself:

The company, based in Palo Alto, California, is essentially a hybrid software and consulting firm, placing what it calls “forward deployed engineers” on-site at client offices.

What does it mean to be a “hybrid software and consulting firm”? And which one is the company more like? Consulting or software? Because ultimately, revealing which side of the spectrum Palantir is really on could have huge implications for its valuation and future prospects.

Consulting companies can surely make a lot of money, but none to my knowledge have the kind of valuation that Palantir currently commands. If it turns out that every customer of Palantir’s requires a custom solution, then I think they’re likely overvalued, because that model scales poorly. On the other hand, if Palantir has genuinely figured out a way to “software-ize” data analysis and to turn it into a commodity, then they are very likely undervalued.

Given the tremendous difficulty of turning data analysis into a software problem, my guess is that they are more akin to a consulting company and are overvalued. This is not to say that they won’t make money—they will likely make plenty—but that they won’t be the Silicon Valley darling that everyone wants them to be.

A means not an end - building a social media presence as a junior scientist

2016-05-10T00:00:00+00:00

Editor’s note - This is a chapter from my book How to be a modern scientist where I talk about some of the tools and techniques that scientists have available to them now that they didn’t before. 50% of all royalties from the book go to support Data Carpentry to promote data science education.

Social media can serve a variety of roles for modern scientists. Here I am going to focus on the role of social media for working scientists whose primary focus is not on scientific communication. Something that is often missed by people who are just getting started with social media is that there are two separate components to developing a successful social media presence.

The first is to develop a following and connections to people in your community. This is achieved through being either a content curator, a content generator, or being funny/interesting in some other way. This often has nothing to do with your scientific output.

The second component is using your social media presence to magnify the audience for your scientific work. You can only do this if you have successfully developed a network and community in the first step. Then, when you post about your own scientific papers they will be shared.

To most effectively achieve all of these goals you need to identify relevant communities and develop a network of individuals who follow you and will help to share your ideas and work.

Set up social media accounts and follow relevant people/journals

One of the largest academic communities has developed around Twitter, but some scientists are also using Facebook for professional purposes. If you set up a Twitter account, you should then find as many colleagues in your area of expertise that you can find and also any journals that are in your area.

Use your social media account to promote the work of other people

If you just use your social media account to post links to any papers that you publish, it will be hard to develop much of a following. It is also hard to develop a following by constantly posting long form original content such as blog posts. Alternatively you can gain a large number of followers by being (a) funny, (b) interesting, or (c) being a content curator. This latter approach can be particularly useful for people new to social media. Just follow people and journals you find interesting and share anything that you think is important/creative/exciting.

Share any work that you develop

Any code, publications, data, or blog posts you create you can share from your social media account. This can help raise your profile as people notice your good work. But if you only post your own work it is rarely possible to develop a large following unless you are already famous for another reason.

There are a large number of social media platforms that are available to scientists. Creatively using any new social media platform if it has a large number of users can be a way to quickly jump into the consciousness of more people. That being said the two largest communities of scientists have organized around two of the largest social media platforms.

Twitter - is a platform where you can post short (less than 140 character) messages. This is a great platform for both discovering science and engaging in conversations about topics at a superficial level. It is not particularly useful for in depth scientific discussions.
Facebook - some scientists post longer form scientific discussions on Facebook, but the community there is somewhat less organized and people tend to use it less for professional reasons. However, sharing content on Facebook, particularly when it is of interest to a general audience, can lead to a broader engagement in your work.

There are also a large and growing number of academic-specific social networks. For the most part these social networks are not widely used by practicing scientists and so don’t represent the best use of your time.

You might also consider short videos on Vine, longer videos on Youtube, more image intensive social media on Tumblr or Instagram if you have content that regularly fits those outlets. But they tend to have smaller communities of scientists with less opportunity for back and forth.

You do not need to develop original content

Social media can be a time suck, particularly if you are spending a lot of time engaging in conversations on your platform of choice. Generating long form content in particular can take up a lot of time. But you don’t need to do that to generate a decent following. Finding the right community and then sharing work within that community and adding brief commentary and ideas can often help you develop a large following which can then be useful for other reasons.

Add your own commentary

Once you are comfortable using the social media platform of your choice you can start to engage with other people in conversation or add comments when you share other people’s work. This will increase the interest in your social media account and help you develop followers. This can be as simple as one-liners copied straight from the text of papers or posts that you think are most important.

Make online friends - then meet them offline

One of the biggest advantages of scientific social media is that it levels the playing ground. Don’t be afraid to engage with members of your scientific community at all levels, from members of the National Academy (if they are online!) all the way down to junior graduate students. Getting to know a diversity of people can really help you during scientific meetings and visits. If you spend time cultivating online friendships, you’ll often meet a “familiar handle” at any conference or meeting you go to.

Include images when you can

If you see a plot from a paper you think is particularly compelling, copy it and attach it when you post/tweet when you link to the paper. On social media, images are often better received than plain text.

Be careful of hot button issues (unless you really care)

One thing to keep in mind on social media is the amplification of opinions. There are a large number of issues that are of extreme interest and generate really strong opinions on multiple sides. Some of these issues are common societal issues (e.g., racism, feminism, economic inequality) and some are specific to science (e.g., open access publishing, open source development). If you are starting a social media account to engage in these topics then you should definitely do that. If you are using your account primarily for scientific purposes you should consider carefully the consequences of wading into these discussions. The debates run very hot on social media and you may post what you consider to be a relatively tangential or light message on one of these topics and find yourself the center of a lot of attention (positive and negative).

Time Series Analysis in Biomedical Science - What You Really Need to Know

2016-05-05T00:00:00+00:00

For a few years now I have given a guest lecture on time series analysis in our School’s Environmental Epidemiology course. The basic thrust of this lecture is that you should generally ignore what you read about time series modeling, either in papers or in books. The reason is because I find much of the time series literature is not particularly helpful when doing analyses in a biomedical or population health context, which is what I do almost all the time.

Prediction vs. Inference

First, most of the literature on time series models tends to assume that you are interested in doing prediction—forecasting future values in a time series. I almost am never doing this. In my work looking at air pollution and mortality, the goal is never to find the best model that predicts mortality. In particular, if our goal were to predict mortality, we would probably never include air pollution as a predictor. This is because air pollution has an inherently weak association with mortality at the population, whereas things like temperature and other seasonal factors tend to have a much stronger association.

What I am interested in doing is estimating an association between changes in air pollution levels and mortality and making some sort of inference about that association, either to a broader population or to other time periods. The challenges in these types of analyses include estimating weak associations in the presence of many stronger signals and appropriately adjusting for any potential confounding variables that similarly vary over time.

The reason the distinction between prediction and inference is important is that focusing on one vs. the other can lead you to very different model building strategies. Prediction modeling strategies will always want you to include into the model factors that are strongly correlated with the outcome and explain a lot of the outcome’s variation. If you’re trying to do inference and use a prediction modeling strategy, you may make at least two errors:

You may conclude that your key predictor of interest (e.g. air pollution) is not important because the modeling strategy didn’t deem to include it
You may omit important potential confounders because they have a weak releationship with the outcome (but maybe have a strong relationship with your key predictor). For example, one class of potential confounders in air pollution studies is other pollutants, which tend to be weakly associated with mortality but may be strongly associated with your pollutant of interest.

Random vs. Fixed

Another area where I feel much time series literature differs from my practice is on the whether to focus on fixed effects or random effects. Most of what you might think of when you think of time series models (i.e. AR models, MA models, GARCH, etc.) focuses on modeling the random part of the model. Often, you end up treating time series data as random because you simply do not have any other data. But the reality is that in many biomedical and public health applications, patterns in time series data can be explained by clearly understood fixed patterns.

For example, take this time series here. It is lower at the beginning and at the end of the series, with higher level sin the middle of the period.

It’s possible that this time series could be modeled with an auto-regressive (AR) model or maybe an auto-regressive moving average (ARMA) model. Or it’s possible that the data are exhibiting a seasonal pattern. It’s impossible to tell from the data whether this is a random formulation of this pattern or whether it’s something you’d expect every time. The problem is that we usually onl have one observation from teh time series. That is, we observe the entire series only once.

Now take a look at this time series. It exhibits some of the same properties as the first series: it’s low at the beginning and end and high in the middle.

Should we model this as a random process or as a process with a fixed pattern? That ultimately will depend on the what type of data this is and what we know about it. If it’s air pollution data, we might do one thing, but if it’s stock market data, we might do a totally different thing.

If one were to see replicates of the time series, we’d be able to resolve the fixed vs. random question. For example, because I simulated the data above, I can simulate another replicate and see what happens. In the plot below I show two replications from each of the processes.

It’s clear now that the time series on the top row has a fixed “seasonal” pattern while the time series on the bottom row is random (in fact it is simulated from an AR(1) model).

The point here is that I think very often we know things about the time series that we’re modeling that we know introduced fixed variation into the data: seasonal patterns, day-of-week effects, and long-term trends. Furthermore, there may be other time-varying covariates that can help predict whatever time series we’re modeling and can be put into the fixed part of the model (a.k.a regression modeling). Ultimately, when many of these fixed components are accounted for, there’s relatively little of interest left in the residuals.

What to Model?

So the question remains: What should I do? The short answer is to try to incorporate everything that you know about the data into the fixed/regression part of the model. Then take a look at the residuals and see if you still care.

Here’s a quick example from my work in air pollution and mortality. The data are all-cause mortality and PM10 pollution from Detroit for the years 1987–2000. The question is whether daily mortaliy is associated with daily changes in ambient PM10 levels. We can try to answer that with a simple linear regression model:

Call:
lm(formula = death ~ pm10, data = ds)

Residuals:
    Min      1Q  Median      3Q     Max 
-26.978  -5.559  -0.386   5.109  34.022 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 46.978826   0.112284 418.394   <2e-16
pm10         0.004885   0.001936   2.523   0.0117

Residual standard error: 8.03 on 5112 degrees of freedom
Multiple R-squared:  0.001244,	Adjusted R-squared:  0.001049 
F-statistic: 6.368 on 1 and 5112 DF,  p-value: 0.01165

PM10 appears to be positively associated with mortality, but when we look at the autocorrelation function of the residuals, we see

If we see a seasonal-like pattern in the auto-correlation function, then that means there’s a seasonal pattern in the residuals as well. Not good.

But okay, we can just model the seasonal component with an indicator of the season.

Call:
lm(formula = death ~ season + pm10, data = ds)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.964  -5.087  -0.242   4.907  33.884 

Coefficients:
             Estimate Std. Error t value    Pr(>|t|)
(Intercept) 50.830458   0.215679 235.676     < 2e-16
seasonQ2    -4.864167   0.304838 -15.957     < 2e-16
seasonQ3    -6.764404   0.304346 -22.226     < 2e-16
seasonQ4    -3.712292   0.302859 -12.258     < 2e-16
pm10         0.009478   0.001860   5.097 0.000000358

Residual standard error: 7.649 on 5109 degrees of freedom
Multiple R-squared:  0.09411,	Adjusted R-squared:  0.09341 
F-statistic: 132.7 on 4 and 5109 DF,  p-value: < 2.2e-16

Note that the coefficient for PM10, the coefficient of real interest, gets a little bigger when we add the seasonal component.

When we look at the residuals now, we see

The seasonal pattern is gone, but we see that there’s positive autocorrelation at seemingly long distances (~100s of days). This is usually an indicator that there’s some sort of long-term trend in the data. Since we only care about the day-to-day changes in PM10 and mortality, it would make sense to remove any such long-term trend. I can do that by just including the date as a linear predictor.

Call:
lm(formula = death ~ season + date + pm10, data = ds)

Residuals:
    Min      1Q  Median      3Q     Max 
-23.407  -5.073  -0.375   4.718  32.179 

Coefficients:
               Estimate  Std. Error t value    Pr(>|t|)
(Intercept) 60.04317325  0.64858433  92.576     < 2e-16
seasonQ2    -4.76600268  0.29841993 -15.971     < 2e-16
seasonQ3    -6.56826695  0.29815323 -22.030     < 2e-16
seasonQ4    -3.42007191  0.29704909 -11.513     < 2e-16
date        -0.00106785  0.00007108 -15.022     < 2e-16
pm10         0.00933871  0.00182009   5.131 0.000000299

Residual standard error: 7.487 on 5108 degrees of freedom
Multiple R-squared:  0.1324,	Adjusted R-squared:  0.1316 
F-statistic:   156 on 5 and 5108 DF,  p-value: < 2.2e-16

Now we can look at the autocorrelation function one last time.

The ACF trails to zero reasonably quickly now, but there’s still some autocorrelation at short lags up to about 15 days or so.

Now we can engage in some traditional time series modeling. We might want to model the residuals with an auto-regressive model over order p. What should p be? We can check by looking at the partial autocorrelation function (PACF).

The PACF seems to suggest we should fit an AR(6) or AR(7) model. Let’s use an AR(6) model and see how things look. We can use the arima() function for that.

Call:
arima(x = y, order = c(6, 0, 0), xreg = m, include.mean = FALSE)

Coefficients:
         ar1     ar2     ar3     ar4     ar5     ar6  (Intercept)
      0.0869  0.0933  0.0733  0.0454  0.0377  0.0489      59.8179
s.e.  0.0140  0.0140  0.0141  0.0141  0.0140  0.0140       1.0300
      seasonQ2  seasonQ3  seasonQ4     date    pm10
       -4.4635   -6.2778   -3.2878  -0.0011  0.0096
s.e.    0.4569    0.4624    0.4546   0.0001  0.0018

sigma^2 estimated as 53.69:  log likelihood = -17441.84,  aic = 34909.69

Note that the coefficient for PM10 hasn’t changed much from our initial models. The usual concern with not accounting for residual autocorrelation is that the variance/standard error of the coefficient of interest will be affected. In this case, there does not appear to be much of a difference between using the AR(6) to account for the residual autocorrelation and ignoring it altogether. Here’s a comparison of the standard errors for each coefficient.

               Naive AR model
(Intercept) 0.648584 1.030007
seasonQ2    0.298420 0.456883
seasonQ3    0.298153 0.462371
seasonQ4    0.297049 0.454624
date        0.000071 0.000114
pm10        0.001820 0.001819

The standard errors for the pm10 variable are almost identical, while the standard errors for the other variables are somewhat bigger in the AR model.

Conclusion

Ultimately, I’ve found that in many biomedical and public health applications, time series modeling is very different from what I read in the textbooks. The key takeaways are:

Make sure you know if you’re doing prediction or inference. Most often you will be doing inference, in which case your modeling strategies will be quite different.
Focus separately on the fixed and random parts of the model. In particular, work with the fixed part of the model first, incorporating as much information as you can that will explain variability in your outcome.
Model the random part appropriately, after incorporating as much as you can into the fixed part of the model. Classical time series models may be of use here, but also simple robust variance estimators may be sufficient.

Not So Standard Deviations Episode 15 - Spinning Up Logistics

2016-05-04T00:00:00+00:00

This is Hilary’s and my last New York-Baltimore episode! In future episodes, Hilary will be broadcasting from California. In this episode we discuss collaboration tools and workflow management for data science projects. To date, I have not found a project management tool that I can actually use (besides email), but am open to suggestions (from students).

If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.

Support us through our Patreon page.

Show notes:

High school student builds interactive R class for the intimidated with the JHU DSL

2016-04-27T00:00:00+00:00

Annika Salzberg is currently a biology undergraduate at Haverford College majoring in biology. While in high-school here in Baltimore she developed and taught an R class to her classmates at the Park School. Her interest in R grew out of a project where she and her fellow students and teachers went to the Canadian sub-Arctic to collect data on permafrost depth and polar bears. When analyzing the data she learned R (with the help of a teacher) to be able to do the analyses, some of which she did on her laptop while out in the field.

Later she worked on developing a course that she felt was friendly and approachable enough for her fellow high-schoolers to benefit. With the help of Steven Salzberg and the folks here at the JHU DSL, she built a class she calls R for the intimidated which just launched on DataCamp and you can take for free!

The class is a great introduction for people who are just getting started with R. It walks through R/Rstudio, package installation, data visualization, data manipulation, and a final project. We are super excited about the content that Annika created working here at Hopkins and think you should go check it out!

An update on Georgia Tech's MOOC-based CS degree

2016-04-27T00:00:00+00:00

This article in Inside Higher Ed discusses next steps for Georgia Tech’s ground-breaking low-cost CS degree based on MOOCs run by Udacity. With Sebastian Thrun stepping down as CEO at Udacity, it seems both Georgia Tech and Udacity might be moving into a new phase.

One fact that surprised me about the Georgia Tech program concerned the demographics:

Once the first applications for the online program arrived, Georgia Tech was surprised by how the demographics differed from the applications to the face-to-face program. The institute’s face-to-face cohorts tend to have more men than women and international students than U.S. citizens or residents. Applications to the online program, however, came overwhelmingly from students based in the U.S. (80 percent). The gender gap was even larger, with nearly nine out of 10 applications coming from men.

Write papers like a modern scientist (use Overleaf or Google Docs + Paperpile)

2016-04-21T00:00:00+00:00

Editor’s note - This is a chapter from my book How to be a modern scientist where I talk about some of the tools and techniques that scientists have available to them now that they didn’t before.

Writing - what should I do and why?

Write using collaborative software to avoid version control issues.

On almost all modern scientific papers you will have co-authors. The traditional way of handling this was to create a single working document and pass it around. Unfortunately this system always results in a long collection of versions of a manuscript, which are often hard to distinguish and definitely hard to synthesize.

An alternative approach is to use formal version control systems like Git and Github. However, the overhead for using these systems can be pretty heavy for paper authoring. They also require all parties participating in the writing of the paper to be familiar with version control and the command line. Alternative paper authoring tools are now available that provide some of the advantages of version control without the major overhead involved in using base version control systems.

Make figures the focus of your writing

Once you have a set of results and are ready to start writing up the paper the first thing is not to write. The first thing you should do is create a set of 1-10 publication-quality plots with 3-4 as the central focus (see Chapter 10 here for more information on how to make plots). Show these to someone you trust to make sure they “get” your story before proceeding. Your writing should then be focused around explaining the story of those plots to your audience. Many people, when reading papers, read the title, the abstract, and then usually jump to the figures. If your figures tell the whole story you will dramatically increase your audience. It also helps you to clarify what you are writing about.

Write clearly and simply even though it may make your papers harder to publish.

Learn how to write papers in a very clear and simple style. Whenever you can write in plain English and make the approach you are using understandable and clear. This can (sometimes) make it harder to get your papers into journals. Referees are trained to find things to criticize and by simplifying your argument they will assume that what you have done is easy or just like what has been done before. This can be extremely frustrating during the peer review process. But the peer review process isn’t the end goal of publishing! The point of publishing is to communicate your results to your community and beyond so they can use them. Simple, clear language leads to much higher use/reading/citation/impact of your work.

Include links to code, data, and software in your writing

Not everyone recognizes the value of re-analysis, scientific software, or data and code sharing. But it is the fundamental cornerstone of the modern scientific process to make all of your materials easily accessible, re-usable and checkable. Include links to data, code, and software prominently in your abstract, introduction and methods and you will dramatically increase the use and impact of your work.

Give credit to others

In academics the main currency we use is credit for publication. In general assigning authorship and getting credit can be a very tricky component of the publication process. It is almost always better to err on the side of offering credit. A very useful test that my advisor John Storey taught me is if you are embarrassed to explain the authorship credit to anyone who was on the paper or not on the paper, then you probably haven’t shared enough credit.

Writing - what tools should I use?

WYSIWYG software: Google Docs and Paperpile.

This system uses Google Docs for writing and Paperpile for reference management. If you have a Google account you can easily create documents and share them with your collaborators either privately or publicly. Paperpile allows you to search for academic articles and insert references into the text using a system that will be familiar if you have previously used Endnote and Microsoft Word.

This system has the advantage of being a what you see is what you get system - anyone with basic text processing skills should be immediately able to contribute. Google Docs also automatically saves versions of your work so that you can flip back to older versions if someone makes a mistake. You can also easily see which part of the document was written by which person and add comments.

Getting started

Set up accounts with Google and with Paperpile. If you are an academic the Paperpile account will cost $2.99 a month, but there is a 30 day free trial.
Go to Google Docs and create a new document.
Set up the Paperpile add-on for Google Docs
In the upper right hand corner of the document, click on the Share link and share the document with your collaborators
Start editing
When you want to include a reference, place the cursor where you want the reference to go, then using the Paperpile menu, choose insert citation. This should give you a search box where you can search by Pubmed ID or on the web for the reference you want.
Once you have added some references use the Citation style option under Paperpile to pick the citation style for the journal you care about.
Then use the Format citations option under Paperpile to create the bibliography at the end of the document

The nice thing about using this system is that everyone can easily directly edit the document simultaneously - which reduces conflict and difficulty of use. A disadvantage is getting the formatting just right for most journals is nearly impossible, so you will be sending in a version of your paper that is somewhat generic. For most journals this isn’t a problem, but a few journals are sticklers about this.

Typesetting software: Overleaf or ShareLatex

An alternative approach is to use typesetting software like Latex. This requires a little bit more technical expertise since you need to understand the Latex typesetting language. But it allows for more precise control over what the document will look like. Using Latex on its own you will have many of the same issues with version control that you would have for a word document. Fortunately there are now “Google Docs like” solutions for editing latex code that are readily available. Two of the most popular are Overleaf and ShareLatex.

In either system you can create a document, share it with collaborators, and edit it in a similar manner to a Google Doc, with simultaneous editing. Under both systems you can save versions of your document easily as you move along so you can quickly return to older versions if mistakes are made.

I have used both kinds of software, but now primarily use Overleaf because they have a killer feature. Once you have finished writing your paper you can directly submit it to some preprint servers like arXiv or biorXiv and even some journals like Peerj or f1000research.

Getting started

Create an Overleaf account. There is a free version of the software. Paying $8/month will give you easy saving to Dropbox.
Click on New Project to create a new document and select from the available templates
Open your document and start editing
Share with colleagues by clicking on the Share button within the project. You can share either a read only version or a read and edit version.

Once you have finished writing your document you can click on the Publish button to automatically submit your paper to the available preprint servers and journals. Or you can download a pdf version of your document and submit it to any other journal.

Writing - further tips and issues

When to write your first paper

As soon as possible! The purpose of graduate school is (in some order):

Freedom
Time to discover new knowledge
Time to dive deep
Opportunity for leadership
Opportunity to make a name for yourself
- R packages
- Papers
- Blogs
Get a job

The first couple of years of graduate school are typically focused on (1) teaching you all the technical skills you need and (2) data dumping as much hard-won practical experience from more experienced people into your head as fast as possible.

After that one of your main focuses should be on establishing your own program of research and reputation. Especially for Ph.D. students it can not be emphasized enough no one will care about your grades in graduate school but everyone will care what you produced. See for example, Sherri’s excellent guide on CV’s for academic positions.

I firmly believe that R packages and blog posts can be just as important as papers, but the primary signal to most traditional academic communities still remains published peer-reviewed papers. So you should get started on writing them as soon as you can (definitely before you feel comfortable enough to try to write one).

Even if you aren’t going to be in academics, papers are a great way to show off that you can (a) identify a useful project, (b) finish a project, and (c) write well. So the first thing you should be asking when you start a project is “what paper are we working on?”

What is an academic paper?

A scientific paper can be distilled into four parts:

A set of methodologies
A description of data
A set of results
A set of claims

When you (or anyone else) writes a paper the goal is to communicate clearly items 1-3 so that they can justify the set of claims you are making. Before you can even write down 4 you have to do 1-3. So that is where you start when writing a paper.

How do you start a paper?

The first thing you do is you decide on a problem to work on. This can be a problem that your advisor thought of or it can be a problem you are interested in, or a combination of both. Ideally your first project will have the following characteristics:

Concrete
Solves a scientific problem
Gives you an opportunity to learn something new
Something you feel ownership of
Something you want to work on

Points 4 and 5 can’t be emphasized enough. Others can try to help you come up with a problem, but if you don’t feel like it is your problem it will make writing the first paper a total slog. You want to find an option where you are just insanely curious to know the answer at the end, to the point where you just have to figure it out and kind of don’t care what the answer is. That doesn’t always happen, but that makes the grind of writing papers go down a lot easier.

Once you have a problem the next step is to actually do the research. I’ll leave this for another guide, but the basic idea is that you want to follow the usual data analytic process:

Define the question
Get/tidy the data
Explore the data
Build/borrow a model
Perform the analysis
Check/critique results
Write things up

The hardest part for the first paper is often knowing where to stop and start writing.

How do you know when to start writing?

Sometimes this is an easy question to answer. If you started with a very concrete question at the beginning then once you have done enough analysis to convince yourself that you have the answer to the question. If the answer to the question is interesting/surprising then it is time to stop and write.

If you started with a question that wasn’t so concrete then it gets a little trickier. The basic idea here is that you have convinced yourself you have a result that is worth reporting. Usually this takes the form of between 1 and 5 figures that show a coherent story that you could explain to someone in your field.

In general one thing you should be working on in graduate school is your own internal timer that tells you, “ok we have done enough, time to write this up”. I found this one of the hardest things to learn, but if you are going to stay in academics it is a critical skill. There are rarely deadlines for paper writing (unless you are submitting to CS conferences) so it will eventually be up to you when to start writing. If you don’t have a good clock, this can really slow down your ability to get things published and promoted in academics.

One good principle to keep in mind is “the perfect is the enemy of the very good” Another one is that a published paper in a respectable journal beats a paper you just never submit because you want to get it into the “best” journal.

A note on “negative results”

If the answer to your research problem isn’t interesting/surprising but you started with a concrete question it is also time to stop and write. But things often get more tricky with this type of paper as most journals when reviewing papers filter for “interest” so sometimes a paper without a really “big” result will be harder to publish. This is ok!! Even though it may take longer to publish the paper, it is important to publish even results that aren’t surprising/novel. I would much rather that you come to an answer you are comfortable with and we go through a little pain trying to get it published than you keep pushing until you get an “interesting” result, which may or may not be justifiable.

How do you start writing?

Once you have a set of results and are ready to start writing up the paper the first thing is not to write. The first thing you should do is create a set of 1-4 publication-quality plots (see Chapter 10 here). Show these to someone you trust to make sure they “get” your story before proceeding.
Start a project on Overleaf or Google Docs.
Write up a story around the four plots in the simplest language you feel you can get away with, while still reporting all of the technical details that you can.
Go back and add references in only after you have finished the whole first draft.
Add in additional technical detail in the supplementary material if you need it.
Write up a reproducible version of your code that returns exactly the same numbers/figures in your paper with no input parameters needed.

What are the sections in a paper?

Keep in mind that most people will read the title of your paper only, a small fraction of those people will read the abstract, a small fraction of those people will read the introduction, and a small fraction of those people will read your whole paper. So make sure you get to the point quickly!

The sections of a paper are always some variation on the following:

Title: Should be very short, no colons if possible, and state the main result. Example, “A new method for sequencing data that shows how to cure cancer”. Here you want to make sure people will read the paper without overselling your results - this is a delicate balance.
Abstract: In (ideally) 4-5 sentences explain (a) what problem you are solving, (b) why people should care, (c) how you solved the problem, (d) what are the results and (e) a link to any data/resources/software you generated.
Introduction: A more lengthy (1-3 pages) explanation of the problem you are solving, why people should care, and how you are solving it. Here you also review what other people have done in the area. The most critical thing is never underestimate how little people know or care about what you are working on. It is your job to explain to them why they should.
Methods: You should state and explain your experimental procedures, how you collected results, your statistical model, and any strengths or weaknesses of your proposed approach.
Comparisons (for methods papers): Compare your proposed approach to the state of the art methods. Do this with simulations (where you know the right answer) and data you haven’t simulated (where you don’t know the right answer). If you can base your simulation on data, even better. Make sure you are simulating both the easy case (where your method should be great) and harder cases where your method might be terrible.
Your analysis: Explain what you did, what data you collected, how you processed it and how you analysed it.
Conclusions: Summarize what you did and explain why what you did is important one more time.
Supplementary Information: If there are a lot of technical computational, experimental or statistical details, you can include a supplement that has all of the details so folks can follow along. As far as possible, try to include the detail in the main text but explained clearly.

The length of the paper will depend a lot on which journal you are targeting. In general the shorter/more concise the better. But unless you are shooting for a really glossy journal you should try to include the details in the paper itself. This means most papers will be in the 4-15 page range, but with a huge variance.

Note: Part of this chapter appeared in the Leek group guide to writing your first paper

As a data analyst the best data repositories are the ones with the least features

2016-04-20T00:00:00+00:00

Lately, for a range of projects I have been working on I have needed to obtain data from previous publications. There is a growing list of data repositories where data is made available. General purpose data sharing sites include:

There are also a host of field-specific data sharing sites.One thing that I find a little frustrating about these sites is that they add a lot of bells and whistles. For example I wanted to download a p-value data set from Dataverse (just to pick on one, but most repositories have similar issues). I go to the page and click Download on the data set I want.

Then I have to accept terms:

Then I have to

Then the data set is downloaded. But it comes from a button that doesn’t allow me to get the direct link. There is an R package that you can use to download dataverse data, but again not with direct links to the data sets.

This is a similar system to many data repositories where there is a multi-step process to downloading data rather than direct links.

But as a data analyst I often find that I want:

To be able to find a data set with some minimal search terms
Find the data set in .csv or tab delimited format via a direct link
Have the data set be available both as raw and processed versions
The processed version will either be one or many tidy data sets.

As a data analyst I would rather have all of the data stored as direct links and ideally as csv files. Then you don’t need to figure out a specialized package, an API, or anything else. You just use read.csv directly using the URL in R and you are off to the races. It also makes it easier to point to which data set you are using. So I find the best data repositories are the ones with the least features.

Junior scientists - you don't have to publish in open access journals to be an open scientist.

2016-04-11T00:00:00+00:00

Publishing - what should I do and why?

A modern scientific writing process goes as follows.

You write a paper
You post a preprint a. Everyone can read and comment
You submit it to a journal
It is peer reviewed privately
The paper is accepted or rejected a. If rejected go back to step 2 and start over b. If accepted it will be published

You can take advantage of modern writing and publishing tools to handle several steps in the process.

Post preprints of your work

Once you have finished writing you paper, you want to share it with others. Historically, this involved submitting the paper to a journal, waiting for reviews, revising the paper, resubmitting, and eventually publishing it. There is now very little reason to wait that long for your paper to appear in print. Generally you can post a paper to a preprint server and have it appear in 1-2 days. This is a dramatic improvement on the weeks or months it takes for papers to appear in peer reviewed journals even under optimal conditions. There are several advantages to posting preprints.

Preprints establish precedence for your work so it reduces your risk of being scooped.
Preprints allow you to collect feedback on your work and improve it quickly.
Preprints can help you to get your work published in formal academic journals.
Preprints can get you attention and press for your work.
Preprints give junior scientists and other researchers gratification that helps them handle the stress and pressure of their first publications.

The last point is underappreciated and was first pointed out to me by Yoav Gilad It takes a really long time to write a scientific paper. For a student publishing their first paper, the first feedback they get is often (a) delayed by several months and (b) negative and in the form of a referee report. This can have a major impact on the motivation of those students to keep working on projects. Preprints allow students to have an immediate product they can point to as an accomplishment, allow them to get positive feedback along with constructive or negative feedback, and can ease the pain of difficult referee reports or rejections.

Choose the journal that maximizes your visibility

You should try to publish your work in the best journals for your field. There are a couple of reasons for this. First, being a scientist is both a calling and a career. To advance your career, you need visibilty among your scientific peers and among the scientists who will be judging you for grants and promotions. The best place to do this is by publishing in the top journals in your field. The important thing is to do your best to do good work and submit it to these journals, even if the results aren’t the most “sexy”. Don’t adapt your workflow to the journal, but don’t ignore the career implications either. Do this even if the journals are closed source. There are ways to make your work accessible and you will both raise your profile and disseminate your results to the broadest audience.

Share your work on social media

Academic journals are good for disseminating your work to the appropriate scientific community. As a modern scientist you have other avenues and other communities - like the general public - that you would like to reach with your work. Once your paper has been published in a preprint or in a journal, be sure to share your work through appropriate social media channels. This will also help you develop facility in coming up with one line or one figure that best describes what you think you have published so you can share it on social media sites like Twitter.

Preprints and criticism

See the section on scientific blogging for how to respond to criticism of your preprints online.

Publishing - what tools should I use?

Preprint servers

Here are a few preprint servers you can use.

arXiv (free) - primarily takes math/physics/computer science papers. You can submit papers and they are reviewed and posted within a couple of days. It is important to note that once you submit a paper here, you can not take it down. But you can submit revisions to the paper which are tracked over time. This outlet is followed by a large number of journalists and scientists.
biorXiv (free) - primarily takes biology focused papers. They are pretty strict about which categories you can submit to. You can submit papers and they are reviewed and posted within a couple of days. biorxiv also allows different versions of manuscripts, but some folks have had trouble with their versioning system, which can be a bit tricky for keeping your paper coordinated with your publication. bioXiv is pretty carefully followed by the biological and computational biology communities.
Peerj (free) - takes a wide range of different types of papers. They will again review your preprint quickly and post it online. You can also post different versions of your manuscript with this system. This system is newer and so has fewer followers, you will need to do your own publicity if you publish your paper here.

Journal preprint policies

This list provides information on which journals accept papers that were first posted as preprints. However, you shouldn’t

Publishing - further tips and issues

Open vs. closed access

Once your paper has been posted to a preprint server you need to submit it for publication. There are a number of considerations you should keep in mind when submitting papers. One of these considerations is closed versus open access. Closed access journals do not require you to pay to submit or publish your paper. But then people who want to read your paper either need to pay or have a subscription to the journal in question.

There has been a strong push for open access journals over the last couple of decades. There are some very good reasons justifying this type of publishing including (a) moral arguments based on using public funding for research, (2) each of access to papers, and (3) benefits in terms of people being able to use your research. In general, most modern scientists want their work to be as widely accessible as possible. So modern scientists often opt for open access publishing.

Open access publishing does have a couple of disadvantages. First it is often expensive, with fees for publication ranging between $1,000 and $4,000 depending on the journal. Second, while science is often a calling, it is also a career. Sometimes the best journals in your field may be closed access. In general, one of the most important components of an academic career is being able to publish in journals that are read by a lot of people in your field so your work will be recognized and impactful.

However, modern systems make both closed and open access journals reasonable outlets.

Closed access + preprints

If the top journals in your field are closed access and you are a junior scientist then you should try to submit your papers there. But to make sure your papers are still widely accessible you can use preprints. First, you can submit a preprint before you submit the paper to the journal. Second, you can update the preprint to keep it current with the published version of your paper. This system allows you to make sure that your paper is read widely within your field, but also allows everyone to freely read the same paper. On your website, you can then link to both the published and preprint version of your paper.

Open access

If the top journal in your field is open access you can submit directly to that journal. Even if the journal is open access it makes sense to submit the paper as a preprint during the review process. You can then keep the preprint up-to-date, but this system has the advantage that the formally published version of your paper is also available for everyone to read without constraints.

Responding to referee comments

After your paper has been reviewed at an academic journal you will receive referee reports. If the paper has not been outright rejected, it is important to respond to the referee reports in a timely and direct manner. Referee reports are often maddening. There is little incentive for people to do a good job refereeing and the most qualified reviewers will likely be those with a conflict of interest.

The first thing to keep in mind is that the power in the refereeing process lies entirely with the editors and referees. The first thing to do when responding to referee reports is to eliminate the impulse to argue or respond with any kind of emotion. A step-by-step process for responding to referee reports is the following.

Create a Google Doc. Put in all referee and editor comments in italics.
Break the comments up into each discrete criticism or request.
In bold respond to each comment. Begin each response with “On page xx we did yy to address this comment”
Perform the analyses and experiments that you need to fill in the yy
Edit the document to reflect all of the experiments that you have performed

By actively responding to each comment you will ensure you are responsive to the referees and give your paper the best chance of success. If a comment is incorrect or non-sensical, think about how you can edit the paper to remove this confusion.

Finishing

While I have advocated here for using preprints to disseminate your work, it is important to follow the process all the way through to completion. Responding to referee reports is drudgery and no one likes to do it. But in terms of career advancement preprints are almost entirely valueless until they are formally accepted for publication. It is critical to see all papers all the way through to the end of the publication cycle.

You aren’t done!

Publication of your paper is only the beginning of successfully disseminating your science. Once you have published the paper, it is important to use your social media, blog, and other resources to disseminate your results to the broadest audience possible. You will also give talks, discuss the paper with colleagues, and respond to requests for data and code. The most successful papers have a long half life and the responsibilities linger long after the paper is published. But the most successful scientists continue to stay on top of requests and respond to critiques long after their papers are published.

Note: Part of this chapter appeared in the Simply Statistics blog post: “Preprints are great, but post publication peer review isn’t ready for prime time”

A Natural Curiosity of How Things Work, Even If You're Not Responsible For Them

2016-04-08T00:00:00+00:00

I just read Karl’s great post on what it means to be a data scientist. I can’t really add much to it, but reading it got me thinking about the Apollo 12 mission, the second moon landing.

This mission is actually famous because of its launch, where the Saturn V was struck by lightning and John Aaron (played wonderfully by Loren Dean in the movie Apollo 13), the flight controller in charge of environmental, electrical, and consumables (EECOM), had to make a decision about whether to abort the launch.

In this great clip from the movie Failure is Not An Option, the real John Aaron describes what makes for a good EECOM flight controller. The bottom line is that

A good EECOM has a natural curiosity for how things work, even if you…are not responsible for them

I think a good data scientist or statistician also has that property. They key part of that line is the “even if you are not responsible for them” part. I’ve found that a lot of being a statistician involves nosing around in places where you’re not supposed to be, seeing how data are collected, handled, managed, analyzed, and reported. Focusing on the development and implementation of methods is not enough.

Here’s the clip, which describes the famous “SCE to AUX” call from John Aaron:

Not So Standard Deviations Episode 13 - It's Good that Someone is Thinking About Us

2016-04-07T00:00:00+00:00

In this episode, Hilary and I talk about the difficulties of separating data analysis from its context, and Feather, a new file format for storing tabular data. Also, we respond to some listener questions and Hilary announces her new job.

If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.

Show notes:

Companies are Countries, Academia is Europe

2016-04-05T00:00:00+00:00

I’ve been thinking a lot recently about the practice of data analysis in different settings and how the environment in which you work can affect the view you have on how things should be done. I’ve been working in academia for over 12 years now. I don’t have any industry data science experience, but long ago I worked as a software engineer at two companies. Obviously, my experience is biased on the academic side.

I’ve see an interesting divergence between what I see being written from data scientists in industry and my personal experience doing data science in academia. From the industry side, I see a lot of stuff about tooling/software and processes. This makes sense to me. Often, a company needs/wants to move quickly and doing so requires making decisions on a reasonable time scale. If decisions are made with data, then the process of collecting, organizing, analyzing, and communicating data needs to be well thought-out, systematized, rigorous, and streamlined. If everytime someone at the company had a question the data science team developed some novel custom coded-from-scratch solution, decisions would be made at a glacial pace, which is probably not good for business. In order to handle this type of situation you need solid tools and flexible workflows. You also need agreement within the company about how things are down and the processes that are followed.

Now, I don’t mean to imply that life at a company is easy, that there isn’t politics or bureacracy to deal with. But I see companies as much like individual countries, with a clear (hierarchical) leadership structure and decision-making process (okay, maybe ideal companies). Much like in a country, it might take some time to come to a decision about a policy or problem (e.g. health insurance), with much negotiation and horse-trading, but once consensus is arrived at, often the policy can be implemented across the country at a reasonable timescale. In a company, if a certain workflow or data process can be shown to be beneficial and perhaps improve profitability down the road, then a decision could be made to implement it. Ultimately, everyone within a company is in the same boat and is interested in seeing the company succeed.

When I worked at a company as a software developer, I’d sometimes run into a problem that was confusing or difficult to code. So I’d walk down to the systems engineer’s office (they guy who wrote the specification) and talk to him about it. We’d hash things out for a while and then figure out a way to go forward. Often the technical writers who wrote the documentation would come and ask me what exactly a certain module did and I’d explain it to them. Communication was usually quick and efficient because it usually occurred person-to-person and because we were all on the same team.

Academia is more like Europe, a somewhat loose federation of states that only communicates with each other because they have to. Each principal investigator is a country and s/he has to engage in constant (sometimes contentious) negotiations with other investigators (“countries”). As a data scientist, this can be tricky because unless I collect/generate my own data (which sometimes, I do), I have to negotiate with another investigator to obtain the data. Even if I were collaborating with that investigator from the very beginning of a study, I typically have very little direct control over the data collection process because those people don’t work for me. The result is often, the data come to me in some format over which I had little input, and I just have to deal with it. Sometimes this is a nice CSV file, but often it is not.

In good situations, I can talk with the investigator collecting the data and we can hash out a plan to put the data into a certain format. But even if we can agree on that, often the expertise will not be available on their end to get the data into that format, so I’ll end up having to do it myself anyway. In not-so-good situations, I can make all the arguments I want for an organized data collection and analysis workflow, but if the investigator doesn’t want to do it, can’t afford it, or doesn’t see any incentive, then it’s not going to happen. Ever.

However, even in the good situations, every investigator works in their own personal way. I mean, that’s why people go into academia, because you can “be your own boss” and work on problems that interest you. Most people develop a process for running their group/lab that most suits their personality. If you’re a data scientist, you need to figure out a way to mesh with each and every investigator you collaborate with. In addition, you need to adapt yourself to whatever data process each investigator has developed for their group. So if you’re working with a genomics person, you might need to learn about BAM files. For a neuroimaging collaborator, you’ll need to know about SPM. If one person doesn’t like tidy data, then that’s too bad. You need to deal with it (or don’t work with them). As a result, it’s difficult to develop a useful “system” for data science because any system that works for one collaborator is unlikely to work for another collaborator. In effect, each collaboration often results in a custom coded-from-scratch solution.

This contrast between companies and academia got me thinking about the Theory of the Firm. This is an economic theory that tries to explain why firms/companies develop at all, as opposed to individuals or small groups negotiating over an open market. My understanding is that it all comes down to how well you can write and enforce a contract between two parties. For example, if I need to manufacture iPhones, I can go to a contract manufacturer, given them the designs and the precise specifications/tolerances and they can just produce millions of them. However, if I need to design the iPhone, it’s a bit harder for me to go to another company and just say “Design an awesome new phone!” That kind of contract is difficult to write down, much less enforce. That other company will be operating off of different incentives from me and will likely not produce what I want. It’s probably better if I do the design work in-house. Ultimately, once the transaction costs of having two different companies work together become too high, it makes more sense for a company to do the work in-house.

I think collaborating on data analysis is a high transaction cost activity. Companies have an advantage in this realm to the extent that they can hire lots of data scientists to work in-house. Academics that are well-funded and have large labs can often hire a data analyst to work for them. This is good because it makes a well-trained person available at low transaction cost, but this setup is the exception. PIs with smaller labs barely have enough funding to do their experiments and so either have to analyze the data themselves (for which they may not be appropriately trained) or collaborate with someone willing to do it. Large academic centers often have research cores that provide data analysis services, but this doesn’t change the fact that data analysis that occurs “outside the company” dramatically increases the transaction costs of doing the research. Because data analysis is a highly iterative process, each time you have to go back in forth with an outside entity, the costs go up.

I think it’s possible to see a time when data analysis can effectively be made external. I mean, Apple used to manufacture all its products, but has shifted to contract manufacturing to great success. But I think we will have to develop a much better understanding of the data analysis process before we see the transaction costs start to go down.

New Feather Format for Data Frames

2016-03-31T00:00:00+00:00

This past Tuesday, Hadley Wickham and Wes McKinney announced a new binary file format specifically for storing data frames.

One thing that struck us was that, while R’s data frames and Python’s pandas data frames utilize different internal memory representations, the semantics of their user data types are mostly the same. In both R and pandas, data frames contain lists of named, equal-length columns, which can be numeric, boolean, and date-and-time, categorical (factors), or string. Additionally, these columns must support missing (null) values.

Their work builds on the Apache Arrow project, which specifies a format for tabular data. There is currently a Python and R implementation for reading/writing these files but other implementations could easily be built as the file format looks pretty straightforward. The git repository is here.

Initial thoughts:

The possibilities for passing data between languages is I think the main point here. The potential for passing data through a pipeline without worrying about the specifics of different languages could make for much more powerful analyses where different tools are used for whatever they tend to do best. Essentially, as long as data can be made tidy going in and coming out, there should not be a communication issue between languages.
R users might be wondering what the big deal is–we already have a binary serialization format (XDR). But R’s serialization format is meant to cover all possible R objects. Feather’s focus on data frames allows for the removal of many of the annoying (but seldom used) complexities of R objects and optimizing a very commonly used data format.
In my testing, there’s a noticeable speed difference between reading a feather file and reading an (uncompressed) R workspace file (feather seems about 2x faster). I didn’t time writing files, but the difference didn’t seem as noticeable there. That said, it’s not clear to me that performance on files is the main point here.
Given the underlying framework and representation, there seem to be some interesting possibilities for low-memory environments.

I’ve only had a chance to quickly look at the code but I’m excited to see what comes next.

How to create an AI startup - convince some humans to be your training set

2016-03-30T00:00:00+00:00

The latest trend in data science is artificial intelligence. It has been all over the news for tackling a bunch of interesting questions. For example:

AlphaGo beat one of the top Go players in the world in what has been called a major advance for the field.
Microsoft created a chatbot Tay that ultimately went very very wrong.
Google and a number of others are working on self driving cars.
Facebook is creating an artificial intellgence based virtual assistant called M

Almost all of these applications are based (at some level) on using variations on neural networks and deep learning. These models are used like any other statistical or machine learning model. They involve a prediction function that is based on a set of parameters. Using a training data set, you estimate the parameters. Then when you get a new set of data, you push it through the prediction function using those estimated parameters and make your predictions.

So why does deep learning do so well on problems like voice recognition, image recognition, and other complicated tasks? The main reason is that these models involve hundreds of thousands or millions of parameters, that allow the model to capture even very subtle structure in large scale data sets. This type of model can be fit now because (a) we have huge training sets (think all the pictures on Facebook or all voice recordings of people using Siri) and (b) we have fast computers that allow us to estimate the parameters.

Almost all of the high-profile examples of “artificial intelligence” we are hearing about involve this type of process. This means that the machine is “learning” from examples of how humans behave. The algorithm itself is a way to estimate subtle structure from collections of human behavior.

The result is that the typical trajectory for an AI business is.

Get a large collection of humans to perform some repetitive but possibly complicated behavior (play thousands of games of Go, or answer requests from people on Facebook messenger, or label pictures and videos, or drive cars.)
Record all of the actions the humans perform to create a training set.
Feed these data into a statistical model with a huge number of parameters - made possible by having a huge training set collected from the humans in steps 1 and 2.
Apply the algorithm to perform the repetitive task and cut the humans out of the process.

The question is how do you get the humans to perform the task for you? One option is to collect data from humans who are using your product (think Facebook image tagging). The other, more recent phenomenon, is to farm the task out to a large number of contractors (think gig economy jobs like driving for Uber, or responding to queries on Facebook).

The interesting thing about the latter case is that in the short term it produces a market for gigs for humans. But in the long term, by performing those tasks, the humans are putting themselves out of a job. This played out in a relatively public way just recently with a service called GoButler that used its employees to train a model and then replaced them with that model.

It will be interesting to see how many areas of employment this type of approach takes over. It is also interesting to think about how much value each task you perform for a company like that is worth to the training set. It will also be interesting if there is a legal claim for the gig workers at these companies to make that their labor helped “create the value” at the companies that replace them.

Not So Standard Deviations Episode 12 - The New Bayesian vs. Frequentist

2016-03-26T00:00:00+00:00

In this episode, Hilary and I discuss the new direction for the journal Biostatistics, the recent fracas over ggplot2 and base graphics in R, and whether collecting more data is always better than collecting less (fewer?) data. Also, Hilary and Roger respond to some listener questions and more free advertising.

If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.

Show notes:

The future of biostatistics

2016-03-24T00:00:00+00:00

Starting in January my colleague Dimitris Rizopoulos and I took over as co-editors of the journal Biostatistics. We are pretty fired up to try some new things with the journal and to make sure that the most important advances in statistical methodology and application have a good home.

We started a blog for the journal and our first post is here: The future of Biostatistics. Thanks to Karl Broman and his famiy we also have the twitter handle @biostatistics. Follow us there to hear about all the new stuff we are rolling out.

The Evolution of a Data Scientist

2016-03-21T00:00:00+00:00

Editor’s note: This post is a guest post by Andrew Jaffe

“How do you get to Carnegie Hall? Practice, practice, practice.” (“The Wit Parade” by E.E. Kenyon on March 13, 1955)

”..an extraordinarily consistent answer in an incredible number of fields … you need to have practiced, to have apprenticed, for 10,000 hours before you get good.” (Malcolm Gladwell, on Outliers)

I have been a data scientist for the last seven or eight years, probably before “data science” existed as a field. I work almost exclusively in the R statistical environment which I first toyed with as a sophomore in college, which ramped up through graduate school. I write all of my code in Notepad++ and make all of my plots with base R graphics, over newer and probably easier approaches, like R Studio, ggplot2, and R Markdown. Every so often, someone will email asking for code used in papers for analysis or plots, and I dig through old folders to track it down. Every time this happens, I come to two realizations: 1) I used to write fairly inefficient and not-so-great code as an early PhD student, and 2) I write a lot of R code.

I think there are some pretty good ways of measuring success and growth as a data scientist – you can count software packages and their user-bases, projects and papers, citations, grants, and promotions. But I wanted to calculate one more metric to add to the list – how much R code have I written in the last 8 years? I have been using the Joint High Performance Computing Exchange (JHPCE) at Johns Hopkins University since I started graduate school, so all of my R code was pretty much all in one place. I therefore decided to spend my Friday night drinking some Guinness and chronicling my journey using R and evolution as a data scientist.

I found all of the .R files across my four main directories on the computing cluster (after copying over my local scripts), and then removed files that came with packages, that belonged to other users, and that resulted from poorly designed simulation and permutation analyses (perm1.R,…,perm100.R) before I learned how to use array jobs, and then extracted the creation date, last modified date, file size, and line count for each R script. Based on this analysis, I have written 3257 R scripts across 13.4 megabytes and 432,753 lines of code (including whitespace and comments) since February 22, 2009.

I found that my R coding output has generally increased over time when tabulated by month (number of scripts: p=6.3e-7, size of files: p=3.5x10-9, and number of lines: p=5.0e-9). These metrics of coding – number, size, and lines - also suggest that, on average, I wrote the most code during my PhD (p-value range: 1.7e-4-1.8e-7). Interestingly, the changes in output over time surprisingly consistent across the three phases of my academic career: PhD, postdoc, and faculty (see Figure 1) – you can see the initial dropoff in production during the first one or two months as I transitioned to a postdoc at the Lieber Institute for Brain Development after finishing my PhD. My output rate has dropped slightly as a faculty member as I started working with doctoral students who took over the analyses of some projects (month-by-output interaction p-value: 5.3e-4, 0.002, and 0.03, respectively, for number, size, and lines). The mean coding output – on average, how much code it takes for a single analysis – were also increased over time and slightly decreased at LIBD, although to lesser extents (all p-values were between 0.01-0.05). I was actually surprised that coding output increased – rather than decreased – over time, as any gains in coding efficiency were probably canceled out my often times more modular analyses at LIBD.

I also looked at coding output by hour of the day to better characterize my working habits – the output per hour is shown stratified by the two eras each about ~3 years (Figure 2). As expected, I never really work much in the morning – very little work get done before 8AM – and little has changed since a second year PhD student. As a faculty member, I have the highest output between 9AM-3PM. The trough between 4PM and 7PM likely corresponds to walking the dog we got three years ago, working out, and cooking (and eating) dinner. The output then increases steadily from 8PM-12AM, where I can work largely uninterrupted from meetings and people dropping by my office, with occasional days (or nights) working until 1AM.

Lastly, I examined R coding output by day of the week. As expected, the lowest output occurred over the weekend, especially on Saturdays. Interestingly, I tended to increase output later in the work week as a faculty member, and also work a little more on Sundays and Mondays, compared to a PhD student.

Looking at the code itself, of the 432,753 lines, 84,343 were newlines (19.5%), 66,900 were lines that were exclusively comments (15.5%), and an additional 6,994 lines contained comments following R code (1.6%). Some of my most used syntax and symbols, as line counts containing at least one symbol, were pretty much as expected (dropping commas and requiring whitespace between characters):

Code	Count	Code	Count
=	175604	==	5542
#	48763	<	5039
<-	16492	for(i	5012
{	11879	&	4803
}	11612	the	4734
in	10587	function(x)	4591
##	8508	###	4105
~	6948	-	4034
>	5621	%in%	3896

My code is available on GitHub: https://github.com/andrewejaffe/how-many-lines (after removing file paths and names, as many of the projects are currently unpublished and many files are placed in folders named by collaborator), so feel free to give it a try and see how much R code you’ve written over your career. While there are probably a lot more things to play around with and explore, this was about all the time I could commit to this, given other responsibilities (I’m not on sabbatical like Jeff Leek…). All in all, this was a pretty fun experience and largely reflected, with data, how my R skills and experience have progressed over the years.

Not So Standard Deviations Episode 11 - Start and Stop

2016-03-14T00:00:00+00:00

We’ve started a Patreon page! Now you can support the podcast directly by going to our page and making a pledge. This will help Hilary and me build the podcast, add new features, and get some better equipment.

Episode 11 is an all craft episode of Not So Standard Deviations, where Hilary and Roger discuss starting and ending a data analysis. What do you do at the very beginning of an analysis? Hilary and Roger talk about some of the things that seem to come up all the time. Also up for discussion is the American Statistical Association’s statement on p values, famous statisticians on Twitter, and evil data scientists on TV. Plus two new things for free advertising.

Show notes:

Not So Standard Deviations Episode 10 - It's All Counterexamples

2016-03-02T00:00:00+00:00

In the latest episode of Not So Standard Deviations Hilary and I talk about the motivation behind the explainr package and the general usefulness of automated reporting and interpretation of statistical tests. Also, Roger struggles to come up with a quick and easy way to explain why statistics is useful when it sometimes doesn’t produce any different results.

Show notes:

Preprints are great, but post publication peer review isn't ready for prime time

2016-02-26T00:00:00+00:00

The current publication system works something like this:

Coupled review and publication

You write a paper
You submit it to a journal
It is peer reviewed privately
The paper is accepted or rejected a. If rejected go back to step 2 and start over b. If accepted it will be published
If published then people can read it

This system has several major disadvantages that bother scientists. It means all research appears on a lag (whatever the time in peer review is). It can be a major lag time if the paper is sent to “top tier journals” and rejected then filters down to “lower tier” journals before ultimate publication. Another disadvantage is that there are two options for most people to publish their papers: (a) in closed access journals where it doesn’t cost anything to publish but then the articles are beyind paywalls and (b) in open access journals where anyone can read them but it costs money to publish. Especially for junior scientists or folks without resources, this creates a difficult choice because they might not be able to afford open access fees.

For a number of years some fields like physics (with the arxiv) and economics (with NBER) have solved this problem by decoupling peer review and publication. In these fields the system works like this:

Decoupled review and publication

You write a paper
You post a preprint a. Everyone can read and comment
You submit it to a journal
It is peer reviewed privately
The paper is accepted or rejected a. If rejected go back to step 2 and start over b. If accepted it will be published

Lately there has been a growing interest in this same system in molecular and computational biology. I think this is a really good thing, because it makes it easier to publish papers more quickly and doesn’t cost researchers to publish. That is why the papers my group publishes all show up on biorxiv or arxiv first.

While I think this decoupling is great, there seems to be a push for this decoupling and at the same time a move to post publication peer review. I used to argue pretty strongly for post-publication peer review but Rafa set me straight and pointed out that at least with peer review every paper that gets submitted gets evaluated by someone even if the paper is ultimately rejected.

One of the risks of post publication peer review is that there is no incentive to peer review in the current system. In a paper a few years ago I actually showed that under an economic model for closed peer review the Nash equilibrium is for no one to peer review at all. We showed in that same paper that under open peer review there is an increase in the amount of time spent reviewing, but the effect was relatively small. Moreover the dangers of open peer review are clear (junior people reviewing senior people and being punished for it) while the benefits (potentially being recognized for insightful reviews) are much hazier. Even the most vocal proponents of post publication peer review don’t do it that often when given the chance.

The reason is that everyone in academics already have a lot of things they are asked to do. Many review papers either out of a sense of obligation or because they want to be in the good graces of a particular journal. Without this system in place there is a strong chance that peer review rates will drop and only a few papers will get reviewed. This will ultimately decrease the accuracy of science. In our (admittedly contrived/simplified) experiment on peer review accuracy went from 39% to 78% after solutions were reviewed. You might argue that only “important” papers should be peer reviewed but then you are back in the camp of glamour. Say waht you want about glamour journals. They are for sure biased by the names of the people submitting the papers there. But it is possible for someone to get a paper in no matter who they are. If we go to a system where there is no curation through a journal-like mechanism then popularity/twitter followers/etc. will drive readers. I’m not sure that is better than where we are now.

So while I think pre-prints are a great idea I’m still waiting to see a system that beats pre-publication review for maintaining scientific quality (even though it may just be an impossible problem)

Spreadsheets: The Original Analytics Dashboard

2016-02-23T08:42:30+00:00

Soon after my discussion with Hilary Parker and Jenny Bryan about spreadsheets on Not So Standard Deviations, Brooke Anderson forwarded me this article written by Steven Levy about the original granddaddy of spreadsheets, VisiCalc. Actually, the real article was written back in 1984 as so-called microcomputers were just getting their start. VisiCalc was originally written for the Apple II computer and notable competitors at the time included Lotus 1-2-3 and Microsoft Multiplan, all since defunct.

It’s interesting to see Levy’s perspective on spreadsheets back then and to compare it to the current thinking about data, data science, and reproducibility in science. The problem back then was “ledger sheets” (what we might now call a spreadsheet), which contained numbers and calculations related to businesses, were tedious to make and keep up to date.

Making spreadsheets, however necessary, was a dull chore best left to accountants, junior analysts, or secretaries. As for sophisticated “modeling” tasks – which, among other things, enable executives to project costs for their companies – these tasks could be done only on big mainframe computers by the data-processing people who worked for the companies Harvard MBAs managed.

You can see one issue here: Spreadsheets/Ledgers were a “dull chore”, and best left to junior people. However, the “real” computation was done by the people the “data processing” center on big mainframes. So what exactly does that leave for the business executive to do?

Note that the way of doing things back then was effectively reproducible, because the presentation (ledger sheets printed on paper) and the computation (data processing on mainframes) was separated.

The impact of the microcomputer-based spreadsheet program appears profound.

Already, the spreadsheet has redefined the nature of some jobs; to be an accountant in the age of spreadsheet program is — well, almost sexy. And the spreadsheet has begun to be a forceful agent of decentralization, breaking down hierarchies in large companies and diminishing the power of data processing.

There has been much talk in recent years about an “entrepreneurial renaissance” and a new breed of risk-taker who creates businesses where none previously existed. Entrepreneurs and their venture-capitalist backers are emerging as new culture heroes, settlers of another American frontier. Less well known is that most of these new entrepreneurs depend on their economic spreadsheets as much as movie cowboys depend on their horses.

If you replace "accountant" with "statistician" and "spreadsheet" with "big data" and you are magically teleported into 2016.

The way I see it, in the early 80's, spreadsheets satisfied the never-ending desire that people have to interact with data. Now, with things like tablets and touch-screen phones, you can literally "touch" your data. But it took microcomputers to get to a certain point before interactive data analysis could really be done in a way that we recognize today. Spreadsheets tightened the loop between question and answer by cutting out the Data Processing department and replacing it with an Apple II (or an IBM PC, if you must) right on your desk.

Of course, the combining of presentation with computation comes at a cost of reproducibility and perhaps quality control. Seeing the description of how spreadsheets were originally used, it seems totally natural to me. It is not unlike today's analytic dashboards that give you a window into your business and allow you to "model" various scenarios by tweaking a few numbers of formulas. Over time, people took spreadsheets to all sorts of extremes, using them for purposes for which they were not originally designed, and problems naturally arose.

So now, we are trying to separate out the computation and presentation bits a little. Tools like knitr and R and shiny allow us to do this and to bring them together with a proper toolchain. The loss in interactivity is only slight because of the power of the toolchain and the speed of computers nowadays. Essentially, we've brought back the Data Processing department, but have staffed it with robots and high speed multi-core computers.

Non-tidy data

2016-02-17T15:47:23+00:00

During the discussion that followed the ggplot2 posts from David and I last week we started talking about tidy data and the man himself noted that matrices are often useful instead of “tidy data” and I mentioned there might be other data that are usefully “non tidy”. Here I will be using tidy/non-tidy according to Hadley’s definition. So tidy data have:

One variable per column
One observation per row
Each type of observational unit forms a table

I push this approach in my guide to data sharing and in a lot of my personal work. But note that non-tidy data can definitely be already processed, cleaned, organized and ready to use.

@hadleywickham @drob @mark_scheuerell I'm saying that not all data are usefully tidy (and not just matrices) so I care more abt flexibility

— Jeff Leek (@jtleek) February 12, 2016

This led to a very specific blog request:

@jtleek @drob I want a blog post on non-tidy data!

— Hadley Wickham (@hadleywickham) February 12, 2016

So I thought I’d talk about a couple of reasons why data are usefully non-tidy. The basic reason is that I usually take a problem first, not solution backward approach to my scientific research. In other words, the goal is to solve a particular problem and the format I chose is the one that makes it most direct/easy to solve that problem, rather than one that is theoretically optimal. To illustrate these points I’ll use an example from my area.

Example data

Often you want data in a matrix format. One good example is gene expression data or data from another high-dimensional experiment. David talks about one such example in his post here. He makes the (valid) point that for students who aren’t going to do genomics professionally, it may be more useful to learn an abstract tool such as tidy data/dplyr. But for those working in genomics, this can make you do unnecessary work in the name of theory/abstraction.

He analyzes the data in that post by first tidying the data.

library(dplyr)
library(tidyr)
library(stringr)
library(readr)
library(broom)
 
original_data %
  separate(NAME, c("name", "BP", "MF", "systematic_name", "number"), sep = "\\|\\|") %>%
  mutate_each(funs(trimws), name:systematic_name) %>%
  select(-number, -GID, -YORF, -GWEIGHT) %>%
  gather(sample, expression, G0.05:U0.3) %>%
  separate(sample, c("nutrient", "rate"), sep = 1, convert = TRUE)

It isn’t 100% tidy as data of different types are in the same data frame (gene expression and metadata/phenotype data belong in different tables). But its close enough for our purposes. Now suppose that you wanted to fit a model and test for association between the “rate” variable and gene expression for each gene. You can do this with David’s tidy data set, dplyr, and the broom package like so:

rate_coeffs = cleaned_data %>% group_by(name) %>%
     do(fit = lm(expression ~ rate + nutrient, data = .)) %>%
     tidy(fit) %>% 
     dplyr::filter(term=="rate")

On my computer we get something like:

system.time( cleaned_data %>% group_by(name) %>%
+               do(fit = lm(expression ~ rate + nutrient, data = .)) %>%
+                tidy(fit) %>% 
+                dplyr::filter(term=="rate"))
|==========================================================|100% ~0 s remaining 
user  system elapsed 
 12.431   0.258  12.364

Let’s now try that analysis a little bit differently. As a first step, lets store the data in two separate tables. A table of “phenotype information” and a matrix of “expression levels”. This is the more common format used for these type of data. Here is the code to do that:

expr = original_data %>% 
  select(grep("[0:9]",names(original_data)))
 
rownames(expr) = original_data %>%
  separate(NAME, c("name", "BP", "MF", "systematic_name", "number"), sep = "\\|\\|") %>%
  select(systematic_name) %>% mutate_each(funs(trimws),systematic_name) %>% as.matrix()
 
vals = data.frame(vals=names(expr))
pdata = separate(vals,vals,c("nutrient", "rate"), sep = 1, convert = TRUE)
 
expr = as.matrix(expr)

If we leave the data in this format we can get the model fits and the p-values using some simple linear algebra

expr = as.matrix(expr)
 
mod = model.matrix(~ rate +  as.factor(nutrient),data=pdata)
rate_betas = expr %*% mod %*% solve(t(mod) %*% mod)

This gives the same answer after re-ordering

all(abs(rate_betas[,2]- rate_coeffs$estimate[ind]) < 1e-5,na.rm=T)
[1] TRUE

But this approach is much faster.

 system.time(expr %*% mod %*% solve(t(mod) %*% mod))
   user  system elapsed 
  0.015   0.000   0.015

This requires some knowledge of linear algebra and isn’t pretty. But it brings us to the first general point: you might not use tidy data because some computations are more efficient if the data is in a different format.

Many examples from graphical models, to genomics, to neuroimaging, to social sciences rely on some kind of linear algebra based computations (matrix multiplication, singular value decompositions, eigen decompositions, etc.) which are all optimized to work on matrices, not tidy data frames. There are ways to improve performance with tidy data for sure, but they would require an equal amount of custom code to take advantage of say C, or vectorization properties in R.

Ok now the linear regressions here are all treated independently, but it is very well known that you get much better performance in terms of the false positive/true positive tradeoff if you use an empirical Bayes approach for this calculation where you pool variances.

If the data are in this matrix format you can do it with R like so:

library(limma)
fit_limma = lmFit(expr,mod)
ebayes_limma = eBayes(fit_limma)
topTable(ebayes_limma)

This approach is again very fast, optimized for the calculations being performed and performs much better than the one-by-one regression approach. But it requires the data in matrix or expression set format. Which brings us to the second general point: **you might not use tidy data because many functions require a different, also very clean and useful data format, and you don’t want to have to constantly be switching back and forth. **Again, this requires you to be more specific to your application, but the potential payoffs can be really big as in the case of limma.

I’m showing an example here with expression sets and matrices, but in NLP the data are often input in the form of lists, in graphical analyses as matrices, in genomic analyses as GRanges lists, etc. etc. etc. One option would be to rewrite all infrastructure in your area of interest to accept tidy data formats but that would be going against conventions of a community and would ultimately cost you a lot of work when most of that work has already been done for you.

The final point, which I won’t discuss here is that data are often usefully represented in a non-tidy way. Examples include the aforementioned GRanges list which consists of (potentially) ragged arrays of intervals and quantitative measurements about them. You could force these data to be tidy by the definition above, but again most of the infrastructure is built around a different format that is much more intuitive for that type of data. Similarly data from other applications may be more suited to application specific formats.

In summary, tidy data is a useful conceptual idea and is often the right way to go for general, small data sets, but may not be appropriate for all problems. Here are some examples of data formats (biased toward my area, but there are others) that have been widely adopted, have a ton of useful software, but don’t meet the tidy data definition above. I will define these as “processed data” as opposed to “tidy data”.

Expression sets for expression data
Summarized experiments for a variety of genomic experiments
Granges Lists for genomic intervals
Corpus objects for corpora of texts.
igraph objects for graphs

I’m sure there are a ton more I’m missing and would be happy to get some suggestions on Twitter too.

When it comes to science - its the economy stupid.

2016-02-16T14:57:14+00:00

I read a lot of articles about what is going wrong with science:

These articles always point to the “incentives” in science and how they don’t align with how we’d like scientists to work. These discussions often frustrate me because they almost always boil down to asking scientists (especially and often junior scientists) to make some kind of change for public good without any guarantee that they are going to be ok. I’ve seen an acceleration/accumulation of people who are focusing on these issues, I think largely because it is now possible to make a very nice career by pointing out how other people are doing science wrong.

The issue I have is that the people who propose unilateral moves seem to care less that science is both (a) a calling and (b) a career for most people. I do science because I love it. I do science because I want to discover new things about the world. It is a direct extension of the wonder and excitement I had about the world when I was a little kid. But science is also a career for me. It matters if I get my next grant, if I get my next paper. Why? Because I want to be able to support myself and my family.

The issue with incentives is that talking about them costs nothing, but actually changing them is expensive. Right now our system, broadly defined, rewards (a) productivity - lots of papers, (b) cleverness - coming up with an idea first, and (c) measures of prestige - journal titles, job titles, etc. This is because there are tons of people going for a relatively small amount of grant money. More importantly, that money is decided on by processes that are both peer reviewed and political.

Suppose that you wanted to change those incentives to something else. Here is a small list of things I would like:

People can have stable careers and live in a variety of places without massive two body problems
Scientists shouldn’t have to move every couple of years 2-3 times right at the beginning of their career
We should distribute our money among the largest number of scientists possible
Incentivizing long term thinking
Incentivizing objective peer review
Incentivizing openness and sharing

The key problem isn't publishing, or code, or reproducibility, or even data analysis.

The key problem is that the fundamental model by which we fund science is completely broken.

The model now is that you have to come up with an idea every couple of years then "sell" it to funders, your peers, etc. This is the source of the following problems:

An incentive to publish only positive results so your ideas look good
An incentive to be closed so people don’t discover flaws in your analysis
An incentive to publish in specific “big name” journals that skews the results (again mostly in the positive direction)
Pressure to publish quickly which leads to cutting corners
Pressure to stay in a single area and make incremental changes so you know things will work.

If we really want to have any measurable impact on science we need to solve the funding model. The solution is actually pretty simple. We need to give out 20+ year grants to people who meet minimum qualifications. These grants would cover their own salary plus one or two people and the minimum necessary equipment.

The criteria for getting or renewing these grants should not be things like Nature papers or number of citations. It has to be designed to incentivize the things that we want to (mine are listed above). So if I was going to define the criteria for meeting the standards people would have to be:

Working on a scientific problem and trained as a scientist
Publishing all results immediately online as preprints/free code
Responding to queries about their data/code
Agreeing to peer review a number of papers per year

More importantly these grants should be given out for a very long term (20+ years) and not be tied to a specific institution. This would allow people to have flexible careers and to target bigger picture problems. We saw the benefits of people working on problems they weren’t originally funded to work on with research on the Zika virus.

These grants need to be awarded using a rigorous peer review system just like the NIH, HHMI, and other organizations use to ensure we are identifying scientists with potential early in their careers and letting them flourish. But they’d be given out in a different matter. I’m very confident in a peer review to detect the difference between psuedo-science and real science, or complete hype and realistic improvement. But I’m much less confident in the ability of peer review to accurately distinguish “important” from “not important” research. So I think we should consider seriously the lottery for these grants.

Each year all eligible scientists who meet some minimum entry requirements submit proposals for what they’d like to do scientifically. Each year those proposals are reviewed to make sure they meet the very minimum bar (are they scientific? do they have relevant training at all?). Among all the (very large) class of people who pass that bar we hold a lottery. We take the number of research dollars and divide it up to give the maximum number of these grants possible. These grants might be pretty small - just enough to fund the person’s salary and maybe one or two students/postdocs. To make this works for labs that required equipment there would have to be cooperative arrangements between multiple independent indviduals to fund/sustain equipment they needed. Renewal of these grants would happen as long as you were posting your code/data online, you were meeting peer review requirements, and responding to inquires about your work.

One thing we’d do to fund this model is eliminate/reduce large-scale projects and super well funded labs. Instead of having 30 postdocs in a well funded lab, you’d have some fraction of those people funded as independent investigators right from the get-go. If we wanted to run a massive large scale program that would be out of a very specific pot of money that would have to be saved up and spent, completely outside of the pot of money for investigator-initiated grants. That would reduce the hierarchy in the system, reduce pressure that leads to bad incentive, and give us the best chance to fund creative, long term thinking science.

Regardless of whether you like my proposal or not, I hope that people will start focusing on how to change the incentives, even when that means doing something big or potentially costly.

Not So Standard Deviations Episode 9 - Spreadsheet Drama

2016-02-12T11:24:04+00:00

For this episode, special guest Jenny Bryan (@jennybryan) joins us from the University of British Columbia! Jenny, Hilary, and I talk about spreadsheets and why some people love them and some people despise them. We also discuss blogging as part of scientific discourse.

Show notes:

Why I don't use ggplot2

2016-02-11T13:25:38+00:00

Some of my colleagues think of me as super data-sciencey compared to other academic statisticians. But one place I lose tons of street cred in the data science community is when I talk about ggplot2. For the 3 data type people on the planet who still don’t know what that is, ggplot2 is an R package/phenomenon for data visualization. It was created by Hadley Wickham, who is (in my opinion) perhaps the most important statistician/data scientist on the planet. It is one of the best maintained, most important, and really well done R packages. Hadley also supports R software like few other people on the planet.

But I don’t use ggplot2 and I get nervous when other people do.

I get no end of grief for this from Hilary and Roger and especially from drob, among many others. So I thought I would explain why and defend myself from the internet hordes. To understand why I don’t use it, you have to understand the three cases where I use data visualization.

When creating exploratory graphics - graphs that are fast, not to be shown to anyone else and help me to explore a data set
When creating expository graphs - graphs that i want to put into a publication that have to be very carefully made.
When grading student data analyses.

Let’s consider each case.

Exploratory graphs

Exploratory graphs don’t have to be pretty. I’m going to be the only one who looks at 99% of them. But I have to be able to make them quickly and I have to be able to make a broad range of plots with minimal code. There are a large number of types of graphs, including things like heatmaps, that don’t neatly fit into ggplot2 code and therefore make it challenging to make those graphs. The flexibility of base R comes at a price, but it means you can make all sorts of things you need to without struggling against the system. Which is a huge advantage for data analysts. There are some graphs (like this one) that are pretty straightforward in base, but require quite a bit of work in ggplot2. In many cases qplot can be used sort of interchangably with plot, but then you really don’t get any of the advantages of the ggplot2 framework.

Expository graphs

When making graphs that are production ready or fit for publication, you can do this with any system. You can do it with ggplot2, with lattice, with base R graphics. But regardless of which system you use it will require about an equal amount of code to make a graph ready for publication. One perfect example of this is the comparison of different plotting systems for creating Tufte-like graphs. To create this minimal barchart:

The code they use in base graphics is this (super blurry sorry, you can also go to the website for a better view).

in ggplot2 the code is:

Both require a significant amount of coding. The ggplot2 plot also takes advantage of the ggthemes package here. Which means, without that package for some specific plot, it would require more coding.

The bottom line is for production graphics, any system requires work. So why do I still use base R like an old person? Because I learned all the stupid little tricks for that system, it was a huge pain, and it would be a huge pain to learn it again for ggplot2, to make very similar types of plots. This is one where neither system is particularly better, but the time-optimal solution is to stick with whichever system you learned first.

Grading student work

People I seriously respect suggest teaching ggplot2 before base graphics as a way to get people up and going quickly making pretty visualizations. This is a good solution to the little data scientist’s predicament. The tricky thing is that the defaults in ggplot2 are just pretty enough that they might trick you into thinking the graph is production ready using defaults. Say for example you make a plot of the latitude and longitude of quakes data in R, colored by the number of stations reporting. This is one case where ggplot2 crushes base R for simplicity because of the automated generation of a color scale. You can make this plot with just the line:

ggplot() + geom_point(data=quakes,aes(x=lat,y=long,colour=stations))

And get this out:

That is a pretty amazing plot in one line of code! What often happens with students in a first serious data analysis class is they think that plot is done. But it isn’t even close. Here are a few things you would need to do to make this plot production ready: (1) make the axes bigger, (2) make the labels bigger, (3) make the labels be full names (latitude and longitude, ideally with units when variables need them), (4) make the legend title be number of stations reporting. Those are the bare minimum. But a very common move by a person who knows a little R/data analysis would be to leave that graph as it is and submit it directly. I know this from lots of experience.

The one nice thing about teaching base R here is that the base version for this plot is either (a) a ton of work or (b) ugly. In either case, it makes the student think very hard about what they need to do to make the plot better, rather than just assuming it is ok.

Where ggplot2 is better for sure

ggplot2 being compatible with piping, having a simple system for theming, having a good animation package, and in general being an excellent platform for developers who create [Some of my colleagues think of me as super data-sciencey compared to other academic statisticians. But one place I lose tons of street cred in the data science community is when I talk about ggplot2. For the 3 data type people on the planet who still don’t know what that is, ggplot2 is an R package/phenomenon for data visualization. It was created by Hadley Wickham, who is (in my opinion) perhaps the most important statistician/data scientist on the planet. It is one of the best maintained, most important, and really well done R packages. Hadley also supports R software like few other people on the planet.

But I don’t use ggplot2 and I get nervous when other people do.

When creating exploratory graphics - graphs that are fast, not to be shown to anyone else and help me to explore a data set
When creating expository graphs - graphs that i want to put into a publication that have to be very carefully made.
When grading student data analyses.

Let’s consider each case.

Exploratory graphs

Expository graphs

The code they use in base graphics is this (super blurry sorry, you can also go to the website for a better view).

in ggplot2 the code is:

Grading student work

ggplot() + geom_point(data=quakes,aes(x=lat,y=long,colour=stations))

And get this out:

Where ggplot2 is better for sure

ggplot2 being compatible with piping, having a simple system for theming, having a good animation package, and in general being an excellent platform for developers who create](https://ggplot2-exts.github.io/index.html) are all huge advantages. It is also great for getting absolute newbies up and making medium-quality graphics in a huge hurry. This is a great way to get more people engaged in data science and I’m psyched about the reach and power ggplot2 has had. Still, I probably won’t use it for my own work, even thought it disappoints my data scientist friends.

Data handcuffs

2016-02-10T15:38:37+00:00

A few years ago, if you asked me what the top skills I got asked about for students going into industry, I’d definitely have said things like data cleaning, data transformation, database pulls, and other non-traditional statistical tasks. But as companies have progressed from the point of storing data to actually wanting to do something with it, I would say one of the hottest skills is understanding and dealing with data from randomized trials.

In particular I see data scientists talking more about A/B testing, sequential stopping rules, hazard regression and other ideas that are really common in Biostatistics, which has traditionally focused on the analysis of data from designed experiments in biology.

I think it is great that companies are choosing to do experiments, as this still remains the gold standard for how to generate knowledge about causal effects. One interesting new development though is the extreme lengths it appears some organizations are going to to be “data-driven”. They make all decisions based on data they have collected or experiments they have performed.

But data mostly tell you about small scale effects and things that happened in the past. To be able to make big discoveries/improvements requires (a) having creative ideas that are not data supported and (b) trying them in experiments to see if they work. If you get too caught up in experimenting on the same set of conditions you will inevitably asymptote to a maximum and quickly reach diminishing returns. This is where the data handcuffs come in. Data can only tell you about the conditions that existed in the past, they often can’t predict conditions in the future or ideas that may work out or might not.

In an interesting parallel to academic research a good strategy appears to be: (a) trying a bunch of things, including some things that have only a pretty modest chance of success, (b) doing experiments early and often when trying those things, and (c) getting very good at recognizing failure quickly and moving on to ideas that will be fruitful. The challenges are that in part (a) it is often difficult to generate really knew ideas, especially if you are already doing something that has had any level of success. There will be extreme pressure not to change what you are doing. In part (c) the challenge is that if you discard ideas too quickly you might miss a big opportunity, but if you don’t discard them quickly enough you will sink a lot of time/cost into utlimately not very fruitful projects.

Regardless, almost all of the most interesting projects I’ve worked on in my life were not driven by data that suggested they would be successful. They were often risks where the data either wasn’t in, or the data supported not doing at all. But as a statistician I decided to straight up ignore the data and try anyway. Then again, these ideas have also been the sources of my biggest flameouts.

Leek group guide to reading scientific papers

2016-02-09T13:59:53+00:00

The other day on Twitter Amelia requested a guide for reading papers

I love @jtleek’s github guides to reviewing papers, writing R packages, giving talks, etc. Would love one on reading papers, for students.

— Amelia McNamara (@AmeliaMN) February 5, 2016

So I came up with a guide which you can find here: Leek group guide to reading papers. I actually found this to be one that I had the hardest time with. I described how I tend to read a paper but I’m not sure that is really the optimal (or even a very good) way. I’d really appreciate pull requests if you have ideas on how to improve the guide.

A menagerie of messed up data analyses and how to avoid them

2016-02-01T13:39:57+00:00

Update: I realize this may seem like I’m picking on people. I really don’t mean to, I have for sure made all of these mistakes and many more. I can give many examples, but the one I always remember is the time Rafa saved me from “I got a big one here” when I made a huge mistake as a first year assistant professor.

In any introductory statistics or data analysis class they might teach you the basics, how to load a data set, how to munge it, how to do t-tests, maybe how to write a report. But there are a whole bunch of ways that a data analysis can be screwed up that often get skipped over. Here is my first crack at creating a “menagerie” of messed up data analyses and how you can avoid them. Depending on interest I could probably list a ton more, but as always I’m doing the non-comprehensive list :).

Outcome switching

_What it is: _Outcome switching is where you collect data looking at say, the relationship between exercise and blood pressure. Once you have the data, you realize that blood pressure isn’t really related to exercise. So you change the outcome and ask if HDL levels are related to exercise and you find a relationship. It turns out that when you do this kind of switch you have now biased your analysis because you would have just stopped if you found the original relationship.

An example: In this article they discuss how Paxil, an anti-depressant, was originally studied for several main outcomes, none of which showed an effect - but some of the secondary outcomes did. So they switched the outcome of the trial and used this result to market the drug.

What you can do: Pre-specify your analysis plan, including which outcomes you want to look at. Then very clearly state when you are analyzing a primary outcome or a secondary analysis. That way people know to take the secondary analyses with a grain of salt. You can even get paid $$ to pre-specify with the OSF's pre-registration challenge.

Garden of forking paths

_What it is: _In this case you may or may not have specified your outcome and stuck with it. Let’s assume you have, so you are still looking at blood pressure and exercise. But it turns out a bunch of people had apparently erroneous measures of blood pressure. So you dropped those measurements and did the analysis with the remaining values. This is a totally sensible thing to do, but if you didn’t specify in advance how you would handle bad measurements, you can make a bunch of different choices here (the forking paths). You could drop them, impute them, multiply impute them, weight them, etc. Each of these gives a different result and you can accidentally pick the one that works best even if you are being “sensible”

An example: This article gives several examples of the forking paths. One is where authors report that at peak fertility women are more likely to wear red or pink shirts. They made several inclusion/exclusion choices (which women to include in which comparison group) for who to include that could easily have gone a different direction or were against stated rules.

_What you can do: _Pre-specify every part of your analysis plan, down to which observations you are going to drop, transform, etc. To be honest this is super hard to do because almost every data set is messy in a unique way. So the best thing here is to point out steps in your analysis where you made a choice that wasn’t pre-specified and you could have made differently. Or, even better, try some of the different choices and make sure your results aren’t dramatically different.

P-hacking

_What it is: _The nefarious cousin of the garden of forking paths. Basically here the person outcome switches, uses the garden of forking paths, intentionally doesn’t correct for multiple testing, or uses any of these other means to cheat and get a result that they like.

An example: This one gets talked about a lot and there is some evidence that it happens. But it is usually pretty hard to ascribe purely evil intentions to people and I’d rather not point the finger here. I think that often the garden of forking paths results in just as bad an outcome without people having to try.

What to do: Know how to do an analysis well and don’t cheat.

Update: Some [Update: I realize this may seem like I’m picking on people. I really don’t mean to, I have for sure made all of these mistakes and many more. I can give many examples, but the one I always remember is the time Rafa saved me from “I got a big one here” when I made a huge mistake as a first year assistant professor.

Outcome switching

Garden of forking paths

P-hacking

What to do: Know how to do an analysis well and don’t cheat.

Update: Some](http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2649230) “when honest researchers face ambiguity about what analyses to run, and convince themselves those leading to better results are the correct ones (see e.g., Gelman & Loken, 2014; John, Loewenstein, & Prelec, 2012; Simmons, Nelson, & Simonsohn, 2011; Vazire, 2015).” This coincides with the definition of “garden of forking paths”. I have been asked to point this out on Twitter. It was never my intention to accuse anyone of accusing people of fraud. That being said, I still think that the connotation that many people think of when they think “p-hacking” corresponds to my definition above, although I agree with folks that isn’t helpful - which is why I prefer we call the non-nefarious version the garden of forking paths.

Uncorrected multiple testing

_What it is: _This one is related to the garden of forking paths and outcome switching. Most statistical methods for measuring the potential for error assume you are only evaluating one hypothesis at a time. But in reality you might be measuring a ton either on purpose (in a big genomics or neuroimaging study) or accidentally (because you consider a bunch of outcomes). In either case, the expected error rate changes a lot if you consider many hypotheses.

An example: The most famous example is when someone did an fMRI on a dead fish and showed that there were a bunch of significant regions at the P < 0.05 level. The reason is that there is natural variation in the background of these measurements and if you consider each pixel independently ignoring that you are looking at a bunch of them, a few will have P < 0.05 just by chance.

What you can do: Correct for multiple testing. When you calculate a large number of p-values make sure you know what their distribution is expected to be and you use a method like Bonferroni, Benjamini-Hochberg, or q-value to correct for multiple testing.

I got a big one here

What it is: One of the most painful experiences for all new data analysts. You collect data and discover a huge effect. You are super excited so you write it up and submit it to one of the best journals or convince your boss to be the farm. The problem is that huge effects are incredibly rare and are usually due to some combination of experimental artifacts and biases or mistakes in the analysis. Almost no effects you detect with statistics are huge. Even the relationship between smoking and cancer is relatively weak in observational studies and requires very careful calibration and analysis.

An example: In a paper authors claimed that 78% of genes were differentially expressed between Asians and Europeans. But it turns out that most of the Asian samples were measured in one sample and the Europeans in another. [Update: I realize this may seem like I’m picking on people. I really don’t mean to, I have for sure made all of these mistakes and many more. I can give many examples, but the one I always remember is the time Rafa saved me from “I got a big one here” when I made a huge mistake as a first year assistant professor.

Outcome switching

Garden of forking paths

P-hacking

What to do: Know how to do an analysis well and don’t cheat.

Outcome switching

Garden of forking paths

P-hacking

What to do: Know how to do an analysis well and don’t cheat.

Uncorrected multiple testing

I got a big one here

What you can do: Be deeply suspicious of big effects in data analysis. If you find something huge and counterintuitive, especially in a well established research area, spend a lot of time trying to figure out why it could be a mistake. If you don’t, others definitely will, and you might be embarrassed.

Double complication

What it is: When faced with a large and complicated data set, beginning analysts often feel compelled to use a big complicated method. Imagine you have collected data on thousands of genes or hundreds of thousands of voxels and you want to use this data to predict some health outcome. There is a severe temptation to use deep learning or blend random forests, boosting, and five other methods to perform the prediction. The problem is that complicated methods fail for complicated reasons, which will be extra hard to diagnose if you have a really big, complicated data set.

An example: There are a large number of examples where people use very small training sets and complicated methods. One example (there were many other problems with this analysis, too) is when people tried to use complicated prediction algorithms to predict which chemotherapy would work best using genomics. Ultimately this paper was retracted for may problems, but the complication of the methods plus the complication of the data made it hard to detect.

What you can do: When faced with a big, messy data set, try simple things first. Use linear regression, make simple scatterplots, check to see if there are obvious flaws with the data. If you must use a really complicated method, ask yourself if there is a reason it is outperforming the simple methods because often with large data sets even simple things work.

Image credits:

Outcome switching. Icon made by Hanan from www.flaticon.com is licensed under CC BY 3.0
Forking paths. Icon made by Popcic from www.flaticon.com is licensed under CC BY 3.0
P-hacking.Icon made by Icomoon from www.flaticon.com is licensed under CC BY 3.0
Uncorrected multiple testing.Icon made by Freepik from www.flaticon.com is licensed under CC BY 3.0
Big one here. Icon made by Freepik from www.flaticon.com is licensed under CC BY 3.0
Double complication. Icon made by Freepik from www.flaticon.com is licensed under CC BY 3.0

Exactly how risky is breathing?

2016-01-26T09:58:23+00:00

This article by by George Johnson in the NYT describes a study by Kamen P. Simonov and Daniel S. Himmelstein that examines the hypothesis that people living at higher altitudes experience lower rates of lung cancer than people living at lower altitudes.

All of the usual caveats apply. Studies like this, which compare whole populations, can be used only to suggest possibilities to be explored in future research. But the hypothesis is not as crazy as it may sound. Oxygen is what energizes the cells of our bodies. Like any fuel, it inevitably spews out waste — a corrosive exhaust of substances called “free radicals,” or “reactive oxygen species,” that can mutate DNA and nudge a cell closer to malignancy.

I’m not so much focused on the science itself, which is perhaps intriguing, but rather on the way the article was written. First, George Johnson links to the paper itself, already a major victory. Also, I thought he did a very nice job of laying out the complexity of doing a population-level study like this one–all the potential confounders, selection bias, negative controls, etc.

I remember particulate matter air pollution epidemiology used to have this feel. You’d try to do all these different things to make the effect go away, but for some reason, under every plausible scenario, in almost every setting, there was always some association between air pollution and health outcomes. Eventually you start to believe it….

On research parasites and internet mobs - let's try to solve the real problem.

2016-01-25T14:34:08+00:00

A couple of days ago one of the editors of the New England Journal of Medicine posted an editorial showing some moderate level of support for data sharing but also introducing the term “research parasite”:

A second concern held by some is that a new class of research person will emerge — people who had nothing to do with the design and execution of the study but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited. There is concern among some front-line researchers that the system will be taken over by what some researchers have characterized as “research parasites.”

While this is obviously the most inflammatory statement in the article, I think that there are several more important and overlooked misconceptions. The biggest problems are:

“The first concern is that someone not involved in the generation and collection of the data may not understand the choices made in defining the parameters.“ This almost certainly would be the fault of the investigators who published the data. If the authors adhere to good [A couple of days ago one of the editors of the New England Journal of Medicine posted an editorial showing some moderate level of support for data sharing but also introducing the term “research parasite”:

A second concern held by some is that a new class of research person will emerge — people who had nothing to do with the design and execution of the study but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited. There is concern among some front-line researchers that the system will be taken over by what some researchers have characterized as “research parasites.”

While this is obviously the most inflammatory statement in the article, I think that there are several more important and overlooked misconceptions. The biggest problems are:

“The first concern is that someone not involved in the generation and collection of the data may not understand the choices made in defining the parameters.“ This almost certainly would be the fault of the investigators who published the data. If the authors adhere to good](https://github.com/jtleek/datasharing) policies and respond to queries from people using their data promptly then this should not be a problem at all.
“… but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited.” The idea that no one should be able to try to disprove ideas with the authors data has been covered in other blogs/on Twitter. One thing I do think is worth considering here is the concern about credit. I think that the traditional way credit has accrued to authors has been citations. But if you get a major study funded, say for 50 million dollars, run that study carefully, sit on a million conference calls, and end up with a single major paper, that could be frustrating. Which is why I think that a better policy would be to have the people who run massive studies get credit in a way that is not papers. They should get some kind of formal administrative credit. But then the data should be immediately and publicly available to anyone to publish on. That allows people who run massive studies to get credit and science to proceed normally.
“The new investigators arrived on the scene with their own ideas and worked symbiotically, rather than parasitically, with the investigators holding the data, moving the field forward in a way that neither group could have done on its own.” The story that follows about a group of researchers who collaborated with the NSABP to validate their gene expression signature is very encouraging. But it isn’t the only way science should work. Researchers shouldn’t be constrained to one model or another. Sometimes collaboration is necessary, sometimes it isn’t, but in neither case should we label the researchers “symbiotic” or “parasitic”, terms that have extreme connotations.
“How would data sharing work best? We think it should happen symbiotically, not parasitically.” I think that it should happen automatically. If you generate a data set with public funds, you should be required to immediately make it available to researchers in the community. But you should get credit for generating the data set and the hypothesis that led to the data set. The problem is that people who generate data will almost never be as fast at analyzing it as people who know how to analyze data. But both deserve credit, whether they are working together or not.
“Start with a novel idea, one that is not an obvious extension of the reported work. Second, identify potential collaborators whose collected data may be useful in assessing the hypothesis and propose a collaboration. Third, work together to test the new hypothesis. Fourth, report the new findings with relevant coauthorship to acknowledge both the group that proposed the new idea and the investigative group that accrued the data that allowed it to be tested.” The trouble with this framework is that it preferentially accrues credit to data generators and doesn’t accurately describe the role of either party. To flip this argument around, you could just as easily say that anyone who uses Steven Salzberg’s software for aligning or assembling short reads should make him a co-author. I think Dr. Drazen would agree that not everyone who aligned reads should add Steven as co-author, despite his contribution being critical for the completion of their work.

After the piece was posted there was predictable internet rage from data parasites, a dedicated hashtag, and half a dozen angry blog posts written about the piece. These inspired a follow up piece from Drazen. I recognize why these folks were upset - the “research parasites” thing was unnecessarily inflammatory. But I also sympathize with data creators who are also subject to a tough environment - particularly when they are junior scientists.

I think the response to the internet outrage also misses the mark and comes off as a defense of people with angry perspectives on data sharing. I would have much rather seen a more pro-active approach from a leading journal of medicine. I’d like to see something that acknowledges different contributions appropriately and doesn’t slow down science. Something like:

We will require all data, including data from clinical trials, to be made public immediately on publication as long as it poses minimal risk to the patients involved or the patients have been consented to broad sharing.
When data are not made publicly available they are still required to be deposited with a third party such as the NIH or Figshare to be held available for request from qualified/approved researchers.
We will require that all people who use data give appropriate credit to the original data generators in terms of data citations.
We will require that all people who use software/statistical analysis tools give credit to the original tool developers in terms of software citations.
We will include a new designation for leaders of major data collection or software generation projects that can be included to demonstrate credit for major projects undertaken and completed.
When reviewing papers written by experimentalists with no statistical/computational co-authors we will require no fewer than 2 statistical/computational referees to ensure there has not been a mistake made by inexperienced researchers.
When reviewing papers written by statistical/computational authors with no experimental co-authors we will require no fewer than 2 experimental referees to ensure there has not been a mistake made by inexperienced researchers.

Not So Standard Deviations Episode 8 - Snow Day

2016-01-24T21:41:44+00:00

Hilary and I were snowed in over the weekend, so we recorded Episode 8 of Not So Standard Deviations. In this episode, Hilary and I talk about how to get your foot in the door with data science, the New England Journal’s view on data sharing, Google’s “Cohort Analysis”, and trying to predict a movie’s box office returns based on the movie’s script.

Follow @NSSDeviations on Twitter!

Show notes:

Remembrances of Peter Hall
Research Parasites (NEJM editorial by Dan Longo and Jeffrey Drazen)
Amazon review/data analysis of Fifty Shades of Grey
Time-lapse cats
Pocket

Apologies for my audio on this episode. I had a bit of a problem calibrating my microphone. I promise to figure it out for the next episode!

Parallel BLAS in R

2016-01-21T11:53:07+00:00

I’m working on a new chapter for my R Programming book and the topic is parallel computation. So, I was happy to see this tweet from David Robinson (@drob) yesterday:

How fast is this #rstats code? x <- replicate(5e3, rnorm(5e3)) x %*% t(x) For me, w/Microsoft R Open, 2.5sec. Wow. https://t.co/0SbijNxxVa

— David Robinson (@drob) January 20, 2016

What does this have to do with parallel computation? Briefly, the code generates 5,000 standard normal random variates, repeats this 5,000 times and stores them in a 5,000 x 5,000 matrix (`x’). Then it computes x x’. The second part is key, because it involves a matrix multiplication.

Matrix multiplication in R is handled, at a very low level, by the library that implements the Basic Linear Algebra Subroutines, or BLAS. The stock R that you download from CRAN comes with what’s known as a reference implementation of BLAS. It works, it produces what everyone agrees are the right answers, but it is in no way optimized. Here’s what I get when I run this code on my Mac using Studio and the CRAN version of R for Mac OS X:

system.time({ x <- replicate(5e3, rnorm(5e3)); tcrossprod(x) })
   user  system elapsed 
 59.622   0.314  59.927

Note that the “user” time and the “elapsed” time are roughly the same. Note also that I use the tcrossprod() function instead of the otherwise equivalent expression x %*% t(x). Both crossprod() and tcrossprod() are generally faster than using the %*% operator.

Now, when I run the same code on my built-from-source version of R (version 3.2.3), here’s what I get:

system.time({ x <- replicate(5e3, rnorm(5e3)); tcrossprod(x) })
   user  system elapsed 
 14.378   0.276   3.344

Overall, it’s faster when I don’t run the code through RStudio (14s vs. 59s). Also on this version the elapsed time is about 1/4 the user time. Why is that?

The build-from-source version of R is linked to Apple’s Accelerate framework, which is a large library that includes an optimized BLAS library for Intel chips. This optimized BLAS, in addition to being optimized with respect to the code itself, is designed to be multi-threaded so that it can split work off into chunks and run them in parallel on multi-core machines. Here, the tcrossprod() function was run in parallel on my machine, and so the elapsed time was about a quarter of the time that was “charged” to the CPU(s).

David’s tweet indicated that when using Microsoft R Open, which is a custom built binary of R, that the (I assume?) elapsed time is 2.5 seconds. Looking at the attached link, it appears that Microsoft’s R Open is linked against Intel’s Math Kernel Library (MKL) which contains, among other things, an optimized BLAS for Intel chips. I don’t know what kind of computer David was running on, but assuming it was similarly high-powered as mine, it would suggest Intel’s MKL sees slightly better performance. But either way, both Accelerate and MKL achieve that speed up through custom-coding of the BLAS routines and multi-threading on multi-core systems.

If you’re going to be doing any linear algebra in R (and you will), it’s important to link to an optimized BLAS. Otherwise, you’re just wasting time unnecessarily. Besides Accelerate (Mac) and Intel MKL, theres AMD’s ACML library for AMD chips and the ATLAS library which is a general purpose tunable library. Also Goto’s BLAS is optimized but is not under active development.

Profile of Hilary Parker

2016-01-14T21:15:46+00:00

If you’ve ever wanted to know more about my Not So Standard Deviations co-host (and Johns Hopkins graduate) Hilary Parker, you can go check out the great profile of her on the American Statistical Association’s This Is Statistics web site.

What advice would you give to high school students thinking about majoring in statistics?

It’s such a great field! Not only is the industry booming, but more importantly, the disciplines of statistics teaches you to think analytically, which I find helpful for just about every problem I run into. It’s also a great field to be interested in as a generalist– rather than dedicating yourself to studying one subject, you are deeply learning a set of tools that you can apply to any subject that you find interesting. Just one glance at the topics covered on The Upshot or 538 can give you a sense of that. There’s politics, sports, health, history… the list goes on! It’s a field with endless possibility for growth and exploration, and as I mentioned above, the more I explore the more excited I get about it.

Not So Standard Deviations Episode 7 - Statistical Royalty

2016-01-12T08:45:24+00:00

The latest episode of Not So Standard Deviations is out, and boy does Hilary have a story to tell.

We also talk about Theranos and the pitfalls of diagnostic testing, Spotify’s Discover Weekly playlist generation algorithm (and the need for human product managers), and of course, a little Star Wars. Also, Hilary and I start a new segment where we each give some “free advertising” to something interesting that they think other people should know about.

Show Notes:

Gosset Icterometer
The dangers of entertainment medicine
Spotify’s Discover Weekly solves human curation?
David Robinson’s Variance Explained
What3Words

Jeff, Roger and Brian Caffo are doing a Reddit AMA at 3pm EST Today

2016-01-11T09:29:28+00:00

Jeff Leek, Brian Caffo, and I are doing a Reddit AMA TODAY at 3pm EST. We’re happy to answer questions about…anything…including our roles as Co-Directors of the Johns Hopkins Data Science Specialization as well as the Executive Data Science Specialization.

This is one of the few pictures of the three of us together.

A non-comprehensive list of awesome things other people did in 2015

2015-12-21T11:22:07+00:00

Editor’s Note: This is the third year I’m making a list of awesome things other people did this year. Just like the lists for 2013 and 2014 I am doing this off the top of my head. I have avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people’s awesome stuff. I wrote this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data. This year’s list is particularly “off the cuff” so I’d appreciate additions if you have ‘em. I have surely missed awesome things people have done.

I hear the Tukey conference put on by my former advisor John S. was amazing. Out of it came this really good piece by David Donoho on 50 years of Data Science.
Sherri Rose wrote really accurate and readable guides on academic CVs, academic cover letters, and how to be an effective PhD researcher.
I am not 100% sold on the deep learning hype, but Michael Nielson wrote this awesome book on deep learning and neural networks. I like how approachable it is and how un-hypey it is. I also thought Andrej Karpathy’s blog post on whether you have a good selfie or not was fun.
Thomas Lumley continues to be must read regardless of which blog he writes for with a ton of snarky fun posts debunking the latest ridiculous health headlines on statschat and more in depth posts like this one on pre-filtering multiple tests on notstatschat.
David Robinson is making a strong case for top data science blogger with his series of awesome posts on empirical Bayes.
Hadley Wickham doing Hadley Wickham things again. readr is the biggie for me this year.
I’ve been really enjoying the solid coverage of science/statistics from the (not entirely statistics focused as the name would suggest) STAT.
Ben Goldacre and co. launched OpenTrials for aggregating all the clinical trial data in the world in an open repository.
Christie Aschwanden’s piece on why Science Isn’t Broken is a must read and one of the least polemic treatments of the reproducibility/replicability issue I’ve read. The p-hacking graphic is just icing on the cake.
I’m excited about the new R Consortium and the idea of having more organizations that support folks in the R community.
Emma Pierson’s blog and writeups in various national level news outlets continue to impress. I thought this one on changing the incentives for sexual assault surveys was particularly interesting/good.
Amanda Cox an co. created this [Editor’s Note: This is the third year I’m making a list of awesome things other people did this year. Just like the lists for 2013 and 2014 I am doing this off the top of my head. I have avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people’s awesome stuff. I wrote this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data. This year’s list is particularly “off the cuff” so I’d appreciate additions if you have ‘em. I have surely missed awesome things people have done.
I hear the Tukey conference put on by my former advisor John S. was amazing. Out of it came this really good piece by David Donoho on 50 years of Data Science.
Sherri Rose wrote really accurate and readable guides on academic CVs, academic cover letters, and how to be an effective PhD researcher.
I am not 100% sold on the deep learning hype, but Michael Nielson wrote this awesome book on deep learning and neural networks. I like how approachable it is and how un-hypey it is. I also thought Andrej Karpathy’s blog post on whether you have a good selfie or not was fun.
Thomas Lumley continues to be must read regardless of which blog he writes for with a ton of snarky fun posts debunking the latest ridiculous health headlines on statschat and more in depth posts like this one on pre-filtering multiple tests on notstatschat.
David Robinson is making a strong case for top data science blogger with his series of awesome posts on empirical Bayes.
Hadley Wickham doing Hadley Wickham things again. readr is the biggie for me this year.
I’ve been really enjoying the solid coverage of science/statistics from the (not entirely statistics focused as the name would suggest) STAT.
Ben Goldacre and co. launched OpenTrials for aggregating all the clinical trial data in the world in an open repository.
Christie Aschwanden’s piece on why Science Isn’t Broken is a must read and one of the least polemic treatments of the reproducibility/replicability issue I’ve read. The p-hacking graphic is just icing on the cake.
I’m excited about the new R Consortium and the idea of having more organizations that support folks in the R community.
Emma Pierson’s blog and writeups in various national level news outlets continue to impress. I thought this one on changing the incentives for sexual assault surveys was particularly interesting/good.
Amanda Cox an co. created this ](http://www.nytimes.com/interactive/2015/05/28/upshot/you-draw-it-how-family-income-affects-childrens-college-chances.html) , which is an amazing way to teach people about pre-conceived biases in the way we think about relationships and correlations. I love the crowd-sourcing view on data analysis this suggests.
As usual Philip Guo was producing gold over on his blog. I appreciate this piece on twelve tips for data driven research.
I am really excited about the new field of adaptive data analysis. Basically understanding how we can let people be “real data analysts” and still get reasonable estimates at the end of the day. This paper from Cynthia Dwork and co was one of the initial salvos that came out this year.
Datacamp incorporated Python into their platform. The idea of interactive education for R/Python/Data Science is a very cool one and has tons of potential.
I was really into the idea of Cross-Study validation that got proposed this year. With the growth of public data in a lot of areas we can really start to get a feel for generalizability.
The Open Science Foundation did this incredible replication of 100 different studies in psychology with attention to detail and care that deserves a ton of attention.
Florian’s piece “You are not working for me; I am working with you.” should be required reading for all students/postdocs/mentors in academia. This is something I still hadn’t fully figured out until I read Florian’s piece.
I think Karl Broman’s post on why reproducibility is hard is a great introduction to the real issues in making data analyses reproducible.
This was the year of the f1000 post-publication review paper. I thought this one from Yoav and the ensuing fallout was fascinating.
I love pretty much everything out of Di Cook/Heike Hoffman’s groups. This year I liked the paper on visual statistical inference in high-dimensional low sample size settings.
This is pretty recent, but Nathan Yau’s day in the life graphic is mesmerizing.

This was a year where open source data people described their pain from people being demanding/mean to them for their contributions. As the year closes I just want to give a big thank you to everyone who did awesome stuff I used this year and have completely ungraciously failed to acknowledge.

Not So Standard Deviations: Episode 6 - Google is the New Fisher

2015-12-18T13:08:10+00:00

Episode 6 of Not So Standard Deviations is now posted. In this episode Hilary and I talk about the analytics of our own podcast, and analyses that seem easy but are actually hard.

If you haven’t already, you can subscribe to the podcast through iTunes.

This will be our last episode for 2015 so see you in 2016!

Notes

Roger’s books on Leanpub
KPIs
Reply All, a great podcast
Use R! 2016 conference where Don Knuth is an invited speaker!
Liz Stuart’s directory of propensity score software
A/B testing
iid
R 3.2.3 release notes
pqR
John Myles White’s tweet

Instead of research on reproducibility, just do reproducible research

2015-12-11T12:18:33+00:00

Right now reproducibility, replicability, false positive rates, biases in methods, and other problems with science are the hot topic. As I mentioned in a previous post pointing out a flaw with a scientific study is way easier to do correctly than generating a new scientific study. Some folks have noticed that right now there is a huge market for papers pointing out how science is flawed. The combination of the relative ease of pointing out flaws and the huge payout for writing these papers is helping to generate the hype around the “reproducibility crisis”.

I gave a talk a little while ago at an NAS workshop where I stated that all the tools for reproducible research exist (the caveat being really large analyses - although that is changing as well). To make a paper completely reproducible, open, and available for post publication review you can use the following approach with no new tools/frameworks needed.

Use Github for version control.
Use rmarkdown or iPython notebooks for your analysis code
When your paper is done post it to arxiv or biorxiv.
Post your data to an appropriate repository like SRA or a general purpose site like figshare.
Send any software you develop to a controlled repository like CRAN or Bioconductor.
Participate in the post publication discussion on Twitter and with a Blog

This is also true of open science, open data sharing, reproducibility, replicability, post-publication peer review and all the other issues forming the “reproducibility crisis”. There is a lot of attention and heat that has focused on the “crisis” or on folks who make a point to take a stand on reproducibility or open science or post publication review. But in the background, outside of the hype, there are a large group of people that are quietly executing solid, open, reproducible science.

I wish that this group would get more attention so I decided to point out a few of them. Next time somebody asks me about the research on reproducibility or open science I’ll just point them here and tell them to just follow the lead of people doing it.

Karl Broman - posts all of his talks online , generates many widely used open source packages, writes free/open tutorials on everything from knitr to making webpages, makes his papers highly reproducible.
Jessica Li - posts her data online and writes open source software for her analyses.
Mark Robinson - posts many of his papers as preprints on biorxiv, makes his analyses reproducible, writes open source software
Florian Markowetz - writes open source software, provides Bioconductor data for major projects, links his papers with his code nicely on his publications page.
Raphael Gottardo - writes/maintains many open source software packages, makes his analyses reproducible and available via Github, posts preprints of his papers.
Genevera Allen - writes](https://cran.r-project.org/web/packages/TCGA2STAT/index.html) to make data easier to access, posts preprints on biorxiv and on arxiv
Lorena Barba - teaches open source moocs, with lessons as open source iPython modules, and reproducible code for her analyses.
Alicia Oshlack - writes papers with completely reproducible analyses, publishes lots of open source software and publishes preprints for her papers.
Baggerly and Coombs - although they are famous for a highly public reproducible piece of research they have also quietly implemented policies like making all reports reproducible for their consulting center.

This list was made completely haphazardly as all my lists are, but just to indicate there are a ton of people out there doing this. One thing that is clear too is that grad students and postdocs are adopting the approach I described at a very high rate.

Moreover there are people that have been doing parts of this for a long time (like the physics or biostatistics communities with preprints, or how people have used Sweave for a long time) . I purposely left people off the list like Titus and Ethan who have gone all in, even posting their grants online. I did this because they are very loud advocates of open science, but I wanted to highlight quieter contributors and point out that while there is a lot of noise going on over in one corner, many people are quietly doing really good science in another.

By opposing tracking well-meaning educators are hurting disadvantaged kids

2015-12-09T10:10:02+00:00

An unfortunate fact about the US K-12 system is that the education gap between poor and rich is growing. One manifestation of this trend is that we rarely see US kids from disadvantaged backgrounds become tenure track faculty, especially in the STEM fields. In my experience, the ones that do make it, when asked how they overcame the suboptimal math education their school district provided, often respond "I was tracked" or "I went to a magnet school". Magnet schools filter students with admission tests and then teach at a higher level than an average school, so essentially the entire school is an advanced track.

Twenty years of classroom instruction experience has taught me that classes with diverse academic abilities present one of the most difficult teaching challenges. Typically, one is forced to focus on only a sub-group of students, usually the second quartile. As a consequence the lower and higher quartiles are not properly served. At the university level, we minimize this problem by offering different levels: remedial math versus math for engineers, probability for the Masters program versus probability for PhD students, co-ed intramural sports versus the varsity basketball team, intro to World Music versus a spot in the orchestra, etc. In K-12, tracking seems like the obvious solution to teaching to an array of student levels.

Unfortunately, there has been a trend recently to move away from tracking and several school districts now forbid it. The motivation seems to be a series of observational studies that note that “low-track classes tend to be primarily composed of low-income students, usually minorities, while upper-track classes are usually dominated by students from socioeconomically successful groups.” Tracking opponents infer that this unfortunate reality is due to bias (conscious or unconscious) in the the informal referrals that are typically used to decide which students are advanced. However, this is a critique of the referral system, not of tracking itself. A simple fix is to administer an objective test or use the percentiles from state assessment tests. In fact, such exams have been developed and implemented. A recent study (summarized here) examined the data from a district that for a period of time implemented an objective assessment and found that

[t]he number of Hispanic students [in the advanced track increased] by 130 percent and the number of black students by 80 percent.

Unfortunately, instead of maintaining the placement criteria, which benefited underrepresented minorities without relaxing standards, these school districts reverted to the old, flawed system due to budget cuts.

Another argument against tracking is that students benefit more from being in classes with higher-achieving peers, rather than being in a class with students with similar subject mastery and a teacher focused on their level. However a [<div class="page" title="Page 2">

</div>

[t]he number of Hispanic students [in the advanced track increased] by 130 percent and the number of black students by 80 percent.

We find that tracking students by prior achievement raised scores for all students, even those assigned to lower achieving peers. On average, after 18 months, test scores were 0.14 standard deviations higher in tracking schools than in non-tracking schools (0.18 standard deviations higher after controlling for baseline scores and other control variables). After controlling for the baseline scores, students in the top half of the pre-assignment distribution gained 0.19 standard deviations, and those in the bottom half gained 0.16 standard deviations. Students in all quantiles benefited from tracking.

I believe that without tracking, the achievement gap between disadvantaged children and their affluent peers will continue to widen since involved parents will seek alternative educational opportunities, including private schools or subject specific extracurricular acceleration programs. With limited or no access to advanced classes in the public system, disadvantaged students will be less prepared to enter the very competitive STEM fields. Note that competition comes not only from within the US, but from other countries including many with educational systems that track.

To illustrate the extreme gap, the following exercises are from a 7th grade public school math class (in a high performing school district):

(Click to enlarge). There is no tracking so all students must work on these problems. Meanwhile, in a 7th grade advanced, private math class, that same student can be working on problems like these:Let me stress that there is nothing wrong with the first example if it is the appropriate level of the student. However, a student who can work at the level of the second example, should be provided with the opportunity to do so notwithstanding their family’s ability to pay. Poorer kids in districts which do not offer advanced classes will not only be less equipped to compete with their richer peers, but many of the academically advanced ones may, I suspect, dismiss academics due to lack of challenge and boredom. Educators need to consider evidence when making decisions regarding policy. Tracking can be applied unfairly, but that aspect can be remedied. Eliminating tracking all together takes away a crucial tool for disadvantaged students to move into the STEM fields and, according to the empirical evidence, hurts all students.

Not So Standard Deviations: Episode 5 - IRL Roger is Totally With It

2015-12-03T09:52:47+00:00

I just posted Episode 5 of Not So Standard Deviations so check your feeds! Sorry for the long delay since the last episode but we got a bit tripped up by the Thanksgiving holiday.

In this episode, Hilary and I open up the mailbag and go through some of the feedback we’ve gotten on the previous episodes. The rest of the time is spent talking about the importance of reproducibility in data analysis both in academic research and in industry settings.

If you haven’t already, you can subscribe to the podcast through iTunes. Or you can use the SoundCloud RSS feed directly.

Notes:

Hilary’s talk on reproducible analysis in production at the New York R Conference
Hilary’s Ignite presentation at Strata 2013
Roger’s talk on “Computational and Policy Tools for Reproducible Research” at the Applied Mathematics Perspectives Workshop in Vancouver, 2011
Duke Scandal Starter Set
Keith Baggerly’s talk on Duke Scandal
The Web of Trust
testdat R package

Or you can listen right here:

Thinking like a statistician: the importance of investigator-initiated grants

2015-12-01T11:40:29+00:00

A substantial amount of scientific research is funded by investigator-initiated grants. A researcher has an idea, writes it up and sends a proposal to a funding agency. The agency then elicits help from a group of peers to evaluate competing proposals. Grants are awarded to the most highly ranked ideas. The percent awarded depends on how much funding gets allocated to these types of proposals. At the NIH, the largest funding agency of these types of grants, the success rate recently fell below 20% from a high above 35%. Part of the reason these percentages have fallen is to make room for large collaborative projects. Large projects seem to be increasing, and not just at the NIH. In Europe, for example, the Human Brain Project has an estimated cost of over 1 billion US$ over 10 years. To put this in perspective, 1 billion dollars can fund over 500 NIH R01s. R01 is the NIH mechanism most appropriate for investigator initiated proposals.

The merits of big science has been widely debated (for example here and here). And most agree that some big projects have been successful. However, in this post I present a statistical argument highlighting the importance of investigator-initiated awards. The idea is summarized in the graph below.

The two panes above represent two different funding strategies: fund-many-R01s (left) or reduce R01s to fund several large projects (right). The grey crosses represent investigators and the gold dots represent potential paradigm-shifting geniuses. Location on the Cartesian plane represent research areas, with the blue circles denoting areas that are prime for an important scientific advance. The largest scientific contributions occur when a gold dot falls in a blue circle. Large contributions also result from the accumulation of incremental work produced by grey crosses in the blue circles.

Although not perfect, the peer review approach implemented by most funding agencies appears to work quite well at weeding out unproductive researchers and unpromising ideas. They also seem to do well at spreading funds across general areas. For example NIH spreads funds across diseases and public health challenges (for example cancer, mental health, heart, genomics, heart and lung disease.) as well as general medicine, genomics and information. However, precisely predicting who will be a gold dot or what specific area will be a blue circle seems like an impossible endeavor. Increasing the number of tested ideas and researchers therefore increases our chance of success. When a funding agency decides to invest big in a specific area (green dollar signs) they are predicting the location of a blue circle. As funding flows into these areas, so do investigators (note the clusters). The total number of funded lead investigators also drops. The risk here is that if the dollar sign lands far from a blue dot, we pull researchers away from potentially fruitful areas. If after 10 years of funding, the Human Brain Project doesn’t “achieve a multi-level, integrated understanding of brain structure and function” we will have missed out on trying out 500 ideas by hundreds of different investigators. With a sample size this large, we expect at least a handful of these attempts to result in the type of impactful advance that justifies funding scientific research.

The simulation presented (code below) here is clearly an over simplification, but it does depict the statistical reason why I favor investigator-initiated grants. The simulation clearly depicts that the strategy of funding many investigator-initiated grants is key for the continued success of scientific research.

set.seed(2) library(rafalib) thecol=”gold3” mypar(1,2,mar=c(0.5,0.5,2,0.5)) ### ## Start with the many R01s model ### ##generate location of 2,000 investigators N = 2000 x = runif(N) y = runif(N) ## 1% are geniuses Ng = N0.01 g = rep(4,N);g[1:Ng]=16 ## generate location of important areas of research M0 = 10 x0 = runif(M0) y0 = runif(M0) r0 = rep(0.03,M0) ##Make the plot nullplot(xaxt=”n”,yaxt=”n”,main=”Many R01s”) symbols(x0,y0,circles=r0,fg=”black”,bg=”blue”, lwd=3,add=TRUE,inches=FALSE) points(x,y,pch=g,col=ifelse(g==4,”grey”,thecol)) points(x,y,pch=g,col=ifelse(g==4,NA,thecol)) ### Generate the location of 5 big projects M1 = 5 x1 = runif(M1) y1 = runif(M1) ##make initial plot nullplot(xaxt=”n”,yaxt=”n”,main=”A Few Big Projects”) symbols(x0,y0,circles=r0,fg=”black”,bg=”blue”, lwd=3,add=TRUE,inches=FALSE) ### Generate location of investigators attracted ### to location of big projects. There are 1000 total ### investigators Sigma = diag(2)0.005 N1 = 200 Ng1 = round(N10.01) g1 = rep(4,N);g1[1:Ng1]=16 library(MASS) for(i in 1:M1){ xy = mvrnorm(N1,c(x1[i],y1[i]),Sigma) points(xy[,1],xy[,2],pch=g1,col=ifelse(g1==4,”grey”,thecol)) } ### generate location of investigators that ignore big projects ### note now 500 instead of 200. Note overall total ## is also less because large projects result in less ## lead investigators N = 500 x = runif(N) y = runif(N) Ng = N0.01 g = rep(4,N);g[1:Ng]=16 points(x,y,pch=g,col=ifelse(g==4,”grey”,thecol)) points(x1,y1,pch=”$”,col=”darkgreen”,cex=2,lwd=2)

A thanksgiving dplyr Rubik's cube puzzle for you

2015-11-25T12:14:06+00:00

Nick Carchedi is back visiting from DataCamp and for fun we came up with a [Nick Carchedi](http://nickcarchedi.com/) is back visiting from [DataCamp](https://www.datacamp.com/) and for fun we came up with a Rubik’s cube puzzle. Here is how it works. To solve the puzzle you have to make a 4 x 3 data frame that spells Thanksgiving like this:

To solve the puzzle you need to pipe this data frame in

and pipe out the Thanksgiving data frame using only the dplyr commands arrange, mutate, slice, filter and select. For advanced users you can try our slightly more complicated puzzle:

See if you can do it this fast. Post your solutions in the comments and Happy Thanksgiving!

20 years of Data Science: from Music to Genomics

2015-11-24T10:00:56+00:00

I finally got around to reading David Donoho’s 50 Years of Data Science paper. I highly recommend it. The following quote seems to summarize the sentiment that motivated the paper, as well as why it has resonated among academic statisticians:

The statistics profession is caught at a confusing moment: the activities which preoccupied it over centuries are now in the limelight, but those activities are claimed to be bright shiny new, and carried out by (although not actually invented by) upstarts and strangers.

The reason we started this blog over four years ago was because, as Jeff wrote in his inaugural post, we were “fired up about the new era where data is abundant and statisticians are scientists”. It was clear that many disciplines were becoming data-driven and that interest in data analysis was growing rapidly. We were further motivated because, despite this new found interest in our work, academic statisticians were, in general, more interested in the development of context free methods than in leveraging applied statistics to take leadership roles in data-driven projects. Meanwhile, great and highly visible applied statistics work was occurring in other fields such as astronomy, computational biology, computer science, political science and economics. So it was not completely surprising that some (bio)statistics departments were being left out from larger university-wide data science initiatives. Some of our posts exhorted academic departments to embrace larger numbers of applied statisticians:

[M]any of the giants of our discipline were very much interested in solving specific problems in genetics, agriculture, and the social sciences. In fact, many of today’s most widely-applied methods were originally inspired by insights gained by answering very specific scientific questions. I worry that the balance between application and theory has shifted too far away from applications. An unfortunate consequence is that our flagship journals, including our applied journals, are publishing too many methods seeking to solve many problems but actually solving none. By shifting some of our efforts to solving specific problems we will get closer to the essence of modern problems and will actually inspire more successful generalizable methods.

Donoho points out that John Tukey had a similar preoccupation 50 years ago:

For a long time I have thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt. ... All in all I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data

Many applied statisticians do the things Tukey mentions above. In the blog we have encouraged them to teach the gory details of what what they do, along with the general methodology we currently teach. With all this in mind, several months ago, when I was invited to give a talk at a department that was, at the time, deciphering their role in their university's data science initiative, I gave a talk titled 20 years of Data Science: from Music to Genomics. The goal was to explain why applied statistician is not considered synonymous with data scientist even when we focus on the same goal: extract knowledge or insights from data.

The first example in the talk related to how academic applied statisticians tend to emphasize the parts that will be most appreciated by our math stat colleagues and ignore the aspects that are today being heralded as the linchpins of data science. I used my thesis papers as examples. My dissertation work was about finding meaningful parametrization of musical sound signals that my collaborators could use to manipulate sounds to create new ones. To do this, I prepared a database of sounds, wrote code to extract and import the digital representations from CDs into S-plus (yes, I'm that old), visualized the data to motivate models, wrote code in C (or was it Fortran?) to make the analysis go faster, and tested these models with residual analysis by ear (you can listen to them here). None of these data science aspects were highlighted in the papers I wrote about my thesis. Here is a screen shot from this paper:

I am actually glad I wrote out and published all the technical details of this work. It was great training. My point was simply that based on the focus of these papers, this work would not be considered data science.

The rest of my talk described some of the work I did once I transitioned into applications in Biology. I was fortunate to have a department chair that appreciated lead-author papers in the subject matter journals as much as statistical methodology papers. This opened the door for me to become a full fledged applied statistician/data scientist. In the talk I described how developing software packages, planning the gathering of data to aid method development, developing web tools to assess data analysis techniques in the wild, and facilitating data-driven discovery in biology has been very gratifying and, simultaneously, helped my career. However, at some point, early in my career, senior members of my department encouraged me to write and submit a methods paper to a statistical journal to go along with every paper I sent to the subject matter journals. Although I do write methods papers when I think the ideas add to the statistical literature, I did not follow the advice to simply write papers for the sake of publishing in statistics journals. Note that if (bio)statistics departments require applied statisticians to do this, then it becomes harder to have an impact as data scientists. Departments that are not producing widely used methodology or successful and visible applied statistics projects (or both), should not be surprised when they are not included in data science initiatives. So, applied statistician, read that Tukey quote again, listen to President Obama, and go do some great data science.

Some Links Related to Randomized Controlled Trials for Policymaking

2015-11-19T12:49:03+00:00

In response to my previous post, Avi Feller sent me these links related to efforts promoting the use of RCTs and evidence-based approaches for policymaking:

The theme of this year's just-concluded APPAM conference (the national public policy research organization) was "evidence-based policymaking," with a headline panel on using experiments in policy (see here and here).

Jeff Liebman has written extensively about the use of randomized experiments in policy (see here for a recent interview).

The White House now has an entire office devoted to running randomized trials to improve government performance (the so-called "nudge unit"). Check out their recent annual report here.

JPAL North America just launched a major initiative to help state and local governments run randomized trials (see here).

Given the history of medicine, why are randomized trials not used for social policy?

2015-11-17T10:42:24+00:00

Policy changes can have substantial societal effects. For example, clean water and hygiene policies have saved millions, if not billions, of lives. But effects are not always positive. For example, prohibition, or the “noble experiment”, boosted organized crime, slowed economic growth and increased deaths caused by tainted liquor. Good intentions do not guarantee desirable outcomes.

The medical establishment is well aware of the danger of basing decisions on the good intentions of doctors or biomedical researchers. For this reason, randomized controlled trials (RCTs) are the standard approach to determining if a new treatment is safe and effective. In these trials an objective assessment is achieved by assigning patients at random to a treatment or control group, and then comparing the outcomes in these two groups. Probability calculations are used to summarize the evidence in favor or against the new treatment. Modern RCTs are considered one of the greatest medical advances of the 20th century.

Despite their unprecedented success in medicine, RCTs have not been fully adopted outside of scientific fields. In this post, Ben Goldcare advocates for politicians to learn from scientists and base policy decisions on RCTs. He provides several examples in which results contradicted conventional wisdom. In this TED talk Esther Duflo convincingly argues that RCTs should be used to determine what interventions are best at fighting poverty. Although some RCTs are being conducted, they are still rare and oftentimes ignored by policymakers. For example, despite at least two RCTs finding that universal pre-K programs are not effective, polymakers in New York are implementing a $400 million a year program. Supporters of this noble endeavor defend their decision by pointing to observational studies and “expert” opinion that support their preconceived views. Before the 1950s, indifference to RCTs was common among medical doctors as well, and the outcomes were at times devastating.

Today, when we compare conclusions from non-RCT studies to RCTs, we note the unintended strong effects that preconceived notions can have. The first chapter in this book provides a summary and some examples. One example comes from a study of 51 studies on the effectiveness of the portacaval shunt. Here is table summarizing the conclusions of the 51 studies:

Design	Marked Improvement	Moderate Improvement	None
No control	24	7	1
Controls; but no randomized	10	3	2
Randomized		1	3

Compare the first and last column to appreciate the importance of the randomized trials.

A particularly troubling example relates to the studies on Diethylstilbestrol (DES). DES is a drug that was used to prevent spontaneous abortions. Five out of five studies using historical controls found the drug to be effective, yet all three randomized trials found the opposite. Before the randomized trials convinced doctors to stop using this drug , it was given to thousands of women. This turned out to be a tragedy as later studies showed DES has terrible side effects. Despite the doctors having the best intentions in mind, ignoring the randomized trials resulted in unintended consequences.

Well meaning experts are regularly implementing policies without really testing their effects. Although randomized trials are not always possible, it seems that they are rarely considered, in particular when the intentions are noble. Just like well-meaning turn-of-the-20th-century doctors, convinced that they were doing good, put their patients at risk by providing ineffective treatments, well intentioned policies may end up hurting society.

Update: A reader pointed me to these preprints which point out that the control group in one of the cited early education RCTs included children that receive care in a range of different settings, not just staying at home. This implies that the signal is attenuated if what we want to know is if the program is effective for children that would otherwise stay at home. In this preprint they use statistical methodology (principal stratification framework) to obtain separate estimates: the effect for children that would otherwise go to other center-based care and the effect for children that would otherwise stay at home. They find no effect for the former group but a significant effect for the latter. Note that in this analysis the effect being estimated is no longer based on groups assigned at random. Instead, model assumptions are used to infer the two effects. To avoid dependence on these assumptions we will have to perform an RCT with better defined controls. Also note that the RCT data facilitated the principal stratification framework analysis. I also want to restate what I’ve posted before, “I am not saying that observational studies are uninformative. If properly analyzed, observational data can be very valuable. For example, the data supporting smoking as a cause of lung cancer is all observational. Furthermore, there is an entire subfield within statistics (referred to as causal inference) that develops methodologies to deal with observational data. But unfortunately, observational data are commonly misinterpreted.”

So you are getting crushed on the internet? The new normal for academics.

2015-11-16T09:49:04+00:00

Roger and I were just talking about all the discussion around the Case and Deaton paper on death rates for middle class people. Andrew Gelman discussed it among many others. They noticed a potential bias in the analysis and did some re-analysis. Just yesterday Noah Smith wrote a piece about academics versus blogs and how many academics are taken by surprise when they see their paper being discussed so rapidly on the internet. Much of the debate comes down to the speed, tone, and ferocity of internet discussion of academic work - along with the fact that sometimes it isn’t fully fleshed out.

I have been seeing this play out not just in the case of this specific paper, but many times that folks have been confronted with blogs or the quick publication process of f1000Research. I think it is pretty scary for folks who aren’t used to “internet speed” to see this play out and I thought it would be helpful to make a few points.

Everyone is an internet scientist now. The internet has arrived as part of academics and if you publish a paper that is of interest (or if you are a Nobel prize winner, or if you dispute a claim, etc.) you will see discussion of that paper within a day or two on the blogs. This is now a fact of life.
The internet loves a fight. The internet responds best to personal/angry blog posts or blog posts about controversial topics like p-values, errors, and bias. Almost certainly if someone writes a blog post about your work or an f1000 paper it will be about an error/bias/correction or something personal.
Takedowns are easier than new research and happen faster. It is much, much easier to critique a paper than to design an experiment, collect data, figure out what question to ask, ask it quantitatively, analyze the data, and write it up. This doesn’t mean the critique won’t be good/right it just means it will happen much much faster than it took you to publish the paper because it is easier to do. All it takes is noticing one little bug in the code or one error in the regression model. So be prepared for speed in the response.

In light of these three things, you have a couple of options about how to react if you write an interesting paper and people are discussing it - which they will certainly do (point 1), in a way that will likely make you uncomfortable (point 2), and faster than you’d expect (point 3). The first thing to keep in mind is that the internet wants you to “fight back” and wants to declare a “winner”. Reading about amicable disagreements doesn’t build audience. That is why there is reality TV. So there will be pressure for you to score points, be clever, be fast, and refute every point or be declared the loser. I have found from my own experience that is what I feel like doing too. I think that resisting this urge is both (a) very very hard and (b) the right thing to do. I find the best solution is to be proud of your work, but be humble, because no paper is perfect and thats ok. If you do the best you can , sensible people will acknowledge that.

I think these are the three ways to respond to rapid internet criticism of your work.

Option 1: Respond on internet time. This means if you publish a big paper that you think might be controversial you should block off a day or two to spend time on the internet responding. You should be ready to do new analysis quickly, be prepared to admit mistakes quickly if they exist, and you should be prepared to make it clear when there aren’t. You will need social media accounts and you should probably have a blog so you can post longer form responses. Github/Figshare accounts make it better for quickly sharing quantitative/new analyses. Again your goal is to avoid the personal and stick to facts, so I find that Twitter/Facebook are best for disseminating your more long form responses on blogs/Github/Figshare. If you are going to go this route you should try to respond to as many of the major criticisms as possible, but usually they cluster into one or two specific comments, which you can address all in one.
Option2 : Respond in academic time. You might have spent a year writing a paper to have people respond to it essentially instantaneously. Sometimes they will have good points, but they will rarely have carefully thought out arguments given the internet-speed response (although remember point 3 that good critiques can be faster than good papers). One approach is to collect all the feedback, ignore the pressure for an immediate response, and write a careful, scientific response which you can publish in a journal or in a fast outlet like f1000Research. I think this route can be the most scientific and productive if executed well. But this will be hard because people will treat that like “you didn’t have a good answer so you didn’t respond immediately”. The internet wants a quick winner/loser and that is terrible for science. Even if you choose this route though, you should make sure you have a way of publicizing your well thought out response - through blogs, social media, etc. once it is done.
Option 3: Do not respond. This is what a lot of people do and I’m unsure if it is ok or not. Clearly internet facing commentary can have an impact on you/your work/how it is perceived for better or worse. So if you ignore it, you are ignoring those consequences. This may be ok, but depending on the severity of the criticism may be hard to deal with and it may mean that you have a lot of questions to answer later. Honestly, I think as time goes on if you write a big paper under a lot of scrutiny Option 3 is going to go away.

All of this only applies if you write a paper that a ton of people care about/is controversial. Many technical papers won’t have this issue and if you keep your claims small, this also probably won’t apply. But I thought it was useful to try to work out how to act under this “new normal”.

Prediction Markets for Science: What Problem Do They Solve?

2015-11-10T20:29:19+00:00

I’ve recently seen a bunch of press on this paper, which describes an experiment with developing a prediction market for scientific results. From FiveThirtyEight:

Although replication is essential for verifying results, the current scientific culture does little to encourage it in most fields. That’s a problem because it means that misleading scientific results, like those from the “shades of gray” study, could be common in the scientific literature. Indeed, a 2005 study claimed that most published research findings are false.

[…]

The researchers began by selecting some studies slated for replication in the Reproducibility Project: Psychology — a project that aimed to reproduce 100 studies published in three high-profile psychology journals in 2008. They then recruited psychology researchers to take part in two prediction markets. These are the same types of markets that people use to bet on who’s going to be president. In this case, though, researchers were betting on whether a study would replicate or not.

There are all kinds of prediction markets these days–for politics, general ideas–so having one for scientific ideas is not too controversial. But I’m not sure I see exactly what problem is solved by having a prediction market for science. In the paper, they claim that the market-based bets were better predictors of the general survey that was administrated to the scientists. I’ll admit that’s an interesting result, but I’m not yet convinced.

First off, it’s worth noting that this work comes out of the massive replication project conducted by the Center for Open Science, where I believe they have a fundamentally flawed definition of replication. So I’m not sure I can really agree with the idea of basing a prediction market on such a definition, but I’ll let that go for now.

The purpose of most markets is some general notion of “price discovery”. One popular market is the stock market and I think it’s instructive to see how that works. Basically, people continuously bid on the shares of certain companies and markets keep track of all the bids/offers and the completed transactions. If you are interested in finding out what people are willing to pay for a share of Apple, Inc., then it’s probably best to look at…what people are willing to pay. That’s exactly what the stock market gives you. You only run into trouble when there’s no liquidity, so no one shows up to bid/offer, but that would be a problem for any market.

Now, suppose you’re interested in finding out what the “true fundamental value” of Apple, Inc. Some people think the stock market gives you that at every instance, while others think that the stock market can behave irrationally for long periods of time. Perhaps in the very long run, you get a sense of the fundamental value of a company, but that may not be useful information at that point.

What does the market for scientific hypotheses give you? Well, it would be one thing if granting agencies participated in the market. Then, we would never have to write grant applications. The granting agencies could then signal what they’d be willing to pay for different ideas. But that’s not what we’re talking about.

Here, we’re trying to get at whether a given hypothesis is true or not. The only real way to get information about that is to conduct an experiment. How many people betting in the markets will have conducted an experiment? Likely the minority, given that the whole point is to save money by not having people conduct experiments investigating hypotheses that are likely false.

But if market participants aren’t contributing real information about an hypothesis, what are they contributing? Well, they’re contributing their opinion about an hypothesis. How is that related to science? I’m not sure. Of course, participants could be experts in the field (although not necessarily) and so their opinions will be informed by past results. And ultimately, it’s consensus amongst scientists that determines, after repeated experiments, whether an hypothesis is true or not. But at the early stages of investigation, it’s not clear how valuable people’s opinions are.

In a way, this reminds me of a time a while back when the EPA was soliciting “expert opinion” about the health effects of outdoor air pollution, as if that were a reasonable substitute for collecting actual data on the topic. At least it cost less money–just the price of a conference call.

There’s a version of this playing out in the health tech market right now. Companies like Theranos and 23andMe are selling health products that they claim are better than some current benchmark. In particular, Theranos claims its blood tests are accurate when only using a tiny sample of blood. Is this claim true or not? No one outside Theranos knows for sure, but we can look to the financial markets.

Theranos can point to the marketplace and show that people are willing to pay for its products. Indeed, the $9 billion valuation of the private company is another indicator that people…highly value the company. But ultimately, we still don’t know if their blood tests are accurate because we don’t have any data. If we were to go by the financial markets alone, we would necessarily conclude that their tests are good, because why else would anyone invest so much money in the company?

I think there may be a role to play for prediction markets in science, but I’m not sure discovering the truth about nature is one of them.

Biostatistics: It's not what you think it is

2015-11-09T10:00:20+00:00

My department recently sent me on a recruitment trip for our graduate program. I had the opportunity to chat with undergrads interested in pursuing a career related to data analysis. I found that several did not know about the existence of Departments of Biostatistics and most of the rest thought Biostatistics was the study of clinical trials. We have posted on the need for better marketing for Statistics, but Biostatistics needs it even more. So this post is for students considering a career as applied statisticians or data science and are considering PhD programs.

There are dozens of Biostatistics departments and most run PhD programs. As an undergraduate, you may have never heard of it because they are usually in schools that undergrads don’t regularly frequent: Public Health and Medicine. However, they are very active in research and teaching graduate students. In fact, the 2014 US News & World Report ranking of Statistics Departments includes three Biostat departments in the top five spots. Although clinical trials are a popular area of interest in these departments, there are now many other areas of research. With so many fields of science shifting to data intensive research, Biostatistics has adapted to work in these areas. Today pretty much any Biostat department will have people working on projects related to genetics, genomics, computational biology, electronic medical records, neuroscience, environmental sciences, and epidemiology, health-risk analysis, and clinical decision making. Through collaborations, academic biostatisticians have early access to the cutting edge datasets produced by public health scientists and biomedical researchers. Our research usually revolves in either developing statistical methods that are used by researchers working in these fields or working directly with a collaborator in data-driven discovery.

How is it different from Statistics? In the grand scheme of things, they are not very different. As implied by the name, Biostatisticians focus on data related to biology while statisticians tend to be more general. However, the underlying theory and skills we learn are similar. In my view, the major difference is that Biostatisticians, in general, tend to be more interested in data and the subject matter, while in Statistics Departments more emphasis is given to the mathematical theory.

What type of job can I get with a Phd In Biostatistics? A well paying one. And you will have many options to chose from. Our graduates tend to go to academia, industry or government. Also, the Bio in the name does not keep our graduates for landing non-bio related jobs, such as in high tech. The reason for this is that the training our students receive and the what they learn from research experiences can be widely applied to data analysis challenges.

How should I prepare if I want to apply to a PhD program? First you need to decide if you are going to like it. One way to do this is to participate in one of the summer programs where you get a glimpse of what we do. My department runs one of these as well. However, as an undergrad I would mainly focus on courses. Undergraduate research experiences are a good way to get an idea of what it’s like, but it is difficult to do real research unless you can set aside several hours a week for several consecutive months. This is difficult as an undergrad because you have to make sure to do well in your courses, prepare for the GRE, and get a solid mathematical and computing foundation in order to conduct research later. This is why these programs are usually in the summer. If you decide to apply to a PhD program, I recommend you take advanced math courses such as Real Analysis and Matrix Algebra. If you plan to develop software for complex datasets, I recommend CS courses that cover algorithms and optimization. Note that programming skills are not the same thing as the theory taught in these CS courses. Programming skills in R will serve you well if you plan to analyze data regardless of what academic route you follow. Python and a low-level language such as C++ are more powerful languages that many biostatisticians use these days.

I think the demand for well-trained researchers that can make sense of data will continue to be on the rise. If you want a fulfilling job where you analyze data for a living, you should consider a PhD in Biostatistics.

Not So Standard Deviations: Episode 4 - A Gajillion Time Series

2015-11-07T11:46:49+00:00

Episode 4 of Not So Standard Deviations is hot off the audio editor. In this episode Hilary first explains to me what heck is DevOps and then we talk about the statistical challenges in detecting rare events in an enormous set of time series data. There’s also some discussion of Ben and Jerry’s and the t-test, so you’ll want to hang on for that.

Notes:

How I decide when to trust an R package

2015-11-06T13:41:02+00:00

One thing that I’ve given a lot of thought to recently is the process that I use to decide whether I trust an R package or not. Kasper Hansen took a break from trolling me on Twitter to talk about how he trusts packages on Github less than packages that are on CRAN and particularly Bioconductor. A couple of points he makes that I think are very relevant. First, that having a package on CRAN/Bioconductor raises trust in that package:

.@michaelhoffman But it's not on Bioconductor or CRAN. This decreases trust substantially.

— Kasper Daniel Hansen (@KasperDHansen) October 29, 2015

The primary reason is because Bioc/CRAN demonstrate something about the developer’s willingness to do the boring but critically important parts of package development like documentation, vignettes, minimum coding standards, and being sure that their code isn’t just a rehash of something else. The other big point Kasper made was the difference between a repository - which is user oriented and should provide certain guarantees and Github - which is a developer platform and makes things easier/better for developers but doesn’t have a user guarantee system in place.

.@StrictlyStat CRAN is a repository, not a development platform. It is user oriented, not developer oriented. GH is the reverse.

— Kasper Daniel Hansen (@KasperDHansen) November 4, 2015

This discussion got me thinking about when/how I depend on R packages and how I make that decision. The scenarios where I depend on R packages are:

Quick and dirty analyses for myself
Shareable data analyses that I hope are reproducible
As dependencies of R packages I maintain

As you move from 1-3 it is more and more of a pain if the package I’m depending on breaks. If it is just something I was doing for fun, its not that big of a deal. But if it means I have to rewrite/recheck/rerelease my R package than that is a much bigger headache.

So my scale for how stringent I am about relying on packages varies by the type of activity, but what are the criteria I use to measure how trustworthy a package is? For me, the criteria are in this order:

People prior
Forced competence
Indirect data

I’ll explain each criteria in a minute, but the main purpose of using these criteria is (a) to ensure that I’m using a package that works and (b) to ensure that if the package breaks I can trust it will be fixed or at least I can get some help from the developer.

People prior

The first thing I do when I look at a package I might depend on is look at who the developer is. If that person is someone I know has developed widely used, reliable software and who quickly responds to requests/feedback then I immediately trust the package. I have a list of people like Brian, or Hadley, or Jenny, or Rafa, who could post their package just as a link to their website and I would trust it. It turns out almost all of these folks end up putting their packages on CRAN/Bioconductor anyway. But even if they didn’t I assume that the reason is either (a) the package is very new or (b) they have a really good reason for not distributing it through the normal channels.

Forced competence

For people who I don’t know about or whose software I’ve never used, then I have very little confidence in the package a priori. This is because there are a ton of people developing R packages now with highly variable levels of commitment to making them work. So as a placeholder for all the variables I don’t know about them, I use the repository they choose as a surrogate. My personal prior on the trustworthiness of a package from someone I don’t know goes something like:

This prior is based on the idea of forced competence. In general, you have to do more to get a package approved on Bioconductor than on CRAN (for example you have to have a good vignette) and you have to do more to get a package on CRAN (pass R CMD CHECK and survive the review process) than to put it on Github.

This prior isn’t perfect, but it does tell me something about how much the person cares about their package. If they go to the work of getting it on CRAN/Bioc, then at least they cared enough to document it. They are at least forced to be minimally competent - at least at the time of submission and enough for the packages to still pass checks.

Indirect data

After I’ve applied my priors I then typically look at the data. For Bioconductor I look at the badges, like how downloaded it is, whether it passes the checks, and how well it is covered by tests. I’m already inclined to trust it a bit since it is on that platform, but I use the data to adjust my prior a bit. For CRAN I might look at the download stats provided by Rstudio. The interesting thing is that as John Muschelli points out, Github actually has the most indirect data available for a package:

.@KasperDHansen Flipside: CRAN has no issue pages, stars/ratings, outdated limits on size, and limited development cycle/turnover.

— John Muschelli (@StrictlyStat) November 4, 2015

If I’m going to use a package that is on Github from a person who isn’t on my prior list of people to trust then I look at a few things. The number of stars/forks/watchers is one thing that is a quick and dirty estimate of how used a package is. I also look very carefully at how many commits the person has submitted to both the package in question and in general all other packages over the last couple of months. If the person isn’t actively developing either the package or anything else on Github, that is a bad sign. I also look to see how quickly they have responded to issues/bug reports on the package in the past if possible. One idea I haven’t used but I think is a good one is to submit an issue for a trivial change to the package and see if I get a response very quickly. Finally I look and see if they have some demonstration their package works across platforms (say with a travis badge). If the package is highly starred, frequently maintained, all issues are responded to and up-to-date, and passes checks on all platform then that data might overwhelm my prior and I’d go ahead and trust the package.

Summary

In general one of the best things about the R ecosystem is being able to rely on other packages so that you don’t have to write everything from scratch. But there is a hard balance to strike with keeping the dependency list small. One way I maintain this balance is using the strategy I’ve outlined to worry less about trustworthy dependencies.

The Statistics Identity Crisis: Am I a Data Scientist

2015-10-30T14:21:08+00:00

The joint ASA/Simply Statistics webinar on the statistics identity crisis is now live!

Faculty/postdoc job opportunities in genomics across Johns Hopkins

2015-10-30T10:33:06+00:00

It’s pretty exciting to be in genomics at Hopkins right now with three new Bloomberg professors in genomics areas, a ton of stellar junior faculty, and a really fun group of students/postdocs. If you want to get in on the action here is a non-comprehensive list of great opportunities.

Faculty Jobs

Job: Multiple tenure track faculty positions in all areas including in genomics

Department: Biostatistics

To apply: http://www.jhsph.edu/departments/biostatistics/_docs/faculty-ad-2016-combined-large-final.pdf

Deadline: Review ongoing

Job: Tenure track position in data intensive biology

Department: Biology

To apply: http://apply.interfolio.com/31146

Deadline: Nov 1st and ongoing

Job: Tenure track positions in bioinformatics, with focus on proteomics or sequencing data analysis

Department: Oncology Biostatistics

To apply: https://www.research-it.onc.jhmi.edu/DBB/PhD_Statistician.pdf

Deadline: Review ongoing

Postdoc Jobs

Job: Postdoc(s) in statistical methods/software development for RNA-seq

Employer: Jeff Leek

To apply: email Jeff (http://jtleek.com/jobs/)

Deadline: Review ongoing

Job: Data scientist for integrative genomics in the human brain (MS/PhD)

Employer: Andrew Jaffe

To apply: email Andrew (http://www.aejaffe.com/jobs.html)

Deadline: Review ongoing

Job: Research associate for genomic data processing and analysis (BA+)

Employer: Andrew Jaffe

To apply: email Andrew (http://www.aejaffe.com/jobs.html)

Deadline: Review ongoing

Job: PhD developing scalable software and algorithms for analyzing sequencing data

Employer: Ben Langmead

To apply: http://www.cs.jhu.edu/graduate-studies/phd-program/

Deadline: See site

Job: Postdoctoral researcher developing scalable software and algorithms for analyzing sequencing data

Employer: Ben Langmead

To apply: email Ben (http://www.langmead-lab.org/open-positions/)

Deadline: Review ongoing

Job: Postdoctoral researcher developing algorithms for challenging problems in large-scale genomics whole-genome assenbly, RNA-seq analysis, and microbiome analysis

Employer: Steven Salzberg

To apply: email Steven (http://salzberg-lab.org/)

Deadline: Review ongoing

Job: Research associate for genomic data processing and analysis (BA+) in cancer

Employer: Luigi Marchionni (with Don Geman)

To apply: email Luigi (http://luigimarchionni.org/)

Deadline: Review ongoing

Job: Postdoctoral researcher developing algorithms for biomarkers development and precision medicine application in cancer

Employer: Luigi Marchionni (with Don Geman)

To apply: email Luigi (http://luigimarchionni.org/)

Deadline: Review ongoing

Job:Postdoctoral researcher developing methods in machine learning, genomics, and regulatory variation

Employer: Alexis Battle

To apply: email Alexis (http://battlelab.jhu.edu/join_us.html)

Deadline: Review ongoing

Job: Postdoctoral fellow with interests in biomarker discovery for Alzheimer’s disease

Employer: Madhav Thambisetty / Ingo Ruczinski

To apply: http://www.alzforum.org/jobs/postdoctoral-research-fellow-alzheimers-disease-biomarkers

Deadline: Review ongoing

Job: Postdoctoral positions for research in the interface of statistical genetics, precision medicine and big data

Employer: Nilanjan Chatterjee

To apply: http://www.jhsph.edu/departments/biostatistics/_docs/postdoc-ad-chatterjee.pdf

Deadline: Review ongoing

Job: Postdoctoral research developing algorithms and software for time course pattern detection in genomics data

Employer: Elana Fertig

To apply: email Elana (ejfertig@jhmi.edu)

Deadline: Review ongoing

Job: Postdoctoral fellow to develop novel methods for large-scale DNA and RNA sequence analysis related to human and/or plant genetics, such as developing methods for discovering structural variations in cancer or for assembling and analyzing large complex plant genomes.

Employer: Mike Schatz

To apply: email Mike (http://schatzlab.cshl.edu/apply/)

Deadline: Review ongoing

Students

We are all always on the hunt for good Ph.D. students. At Hopkins students are admitted to specific departments. So if you find a faculty member you want to work with, you can apply to their department. Here are the application details for the various departments admitting students to work on genomics: https://ccb.jhu.edu/students.shtml

The statistics identity crisis: am I really a data scientist?

2015-10-29T13:32:13+00:00

Tl;dr: We will host a Google Hangout of our popular JSM session October 30th 2-4 PM EST.

I organized a session at JSM 2015 called “The statistics identity crisis: am I really a data scientist?” The session turned out to be pretty popular:

Packed room of statisticians with identity crises at #JSM2015 session: are we really data scientists? pic.twitter.com/eLsGosoTCt

— Dr Ruth Etzioni (@retzioni) August 11, 2015

but it turns out not everyone fit in the room:

This is the closest I can get to @statpumpkin's talk. #jsm2015 still had no clue how to predict session attendance. pic.twitter.com/gTb4OqdAo3

— sandy griffith (@sgrifter) August 11, 2015

Thankfully, Steve Pierson at the ASA had the awesome idea to re-run the session for people who couldn’t be there. So we will be hosting a Google Hangout with the following talks:

	'Am I a Data Scientist?': The Applied Statistics Student's Identity Crisis — Alyssa Frazee, Stripe
	How Industry Views Data Science Education in Statistics Departments — Chris Volinsky, AT&T
	Evaluating Data Science Contributions in Teaching and Research — Lance Waller, Emory University
	Teach Data Science and They Will Come — Jennifer Bryan, The University of British Columbia

You can watch it on Youtube or Google Plus. Here is the link:

https://plus.google.com/events/chuviltukohj2inbqueap9h7228

The session will be held October 30th (tomorrow!) from 2-4PM EST. You can watch it live and discuss the talks using the hashtag [

Tl;dr: We will host a Google Hangout of our popular JSM session October 30th 2-4 PM EST.

I organized a session at JSM 2015 called “The statistics identity crisis: am I really a data scientist?” The session turned out to be pretty popular:

Packed room of statisticians with identity crises at #JSM2015 session: are we really data scientists? pic.twitter.com/eLsGosoTCt

— Dr Ruth Etzioni (@retzioni) August 11, 2015

but it turns out not everyone fit in the room:

This is the closest I can get to @statpumpkin's talk. #jsm2015 still had no clue how to predict session attendance. pic.twitter.com/gTb4OqdAo3

— sandy griffith (@sgrifter) August 11, 2015

Thankfully, Steve Pierson at the ASA had the awesome idea to re-run the session for people who couldn’t be there. So we will be hosting a Google Hangout with the following talks:

	'Am I a Data Scientist?': The Applied Statistics Student's Identity Crisis — Alyssa Frazee, Stripe
	How Industry Views Data Science Education in Statistics Departments — Chris Volinsky, AT&T
	Evaluating Data Science Contributions in Teaching and Research — Lance Waller, Emory University
	Teach Data Science and They Will Come — Jennifer Bryan, The University of British Columbia

You can watch it on Youtube or Google Plus. Here is the link:

https://plus.google.com/events/chuviltukohj2inbqueap9h7228

The session will be held October 30th (tomorrow!) from 2-4PM EST. You can watch it live and discuss the talks using the hashtag](https://twitter.com/search?q=%23jsm2015) or you can watch later as the video will remain on Youtube.

Discussion of the Theranos Controversy with Elizabeth Matsui

2015-10-28T14:54:50+00:00

Theranos is a Silicon Valley diagnostic testing company that has been in the news recently. The story of Theranos has fascinated me because I think it represents a perfect collision of the tech startup culture and the health care culture and how combining them together can generate unique problems.

I talked with Elizabeth Matsui, a Professor of Pediatrics in the Division of Allergy and Immunology here at Johns Hopkins, to discuss Theranos, the realities of diagnostic testing, and the unique challenges that a health-tech startup faces with respect to doing good science and building products people want to buy.

Notes:

Original Wall Street Journal story on Theranos (paywalled)
Related stories in Wired and NYT’s Dealbook (not paywalled)
Theranos response to WSJ story

Not So Standard Deviations: Episode 3 - Gilmore Girls

2015-10-24T23:17:18+00:00

I just uploaded Episode 3 of Not So Standard Deviations so check your feeds. In this episode Hilary and I talk about our jobs and the life of the data scientist in both academia and the tech industry. It turns out that they’re not as different as I would have thought.

We need a statistically rigorous and scientifically meaningful definition of replication

2015-10-20T10:05:22+00:00

Replication and confirmation are indispensable concepts that help define scientific facts. However, the way in which we reach scientific consensus on a given finding is rather complex. Although some press releases try to convince us otherwise, rarely is one publication enough. In fact, most published results go unnoticed and no attempts to replicate them are made. These are not debunked either; they simply get discarded to the dustbin of history. The very few results that garner enough attention for others to spend time and energy on them are assessed by an ad-hoc process involving a community of peers. The assessments are usually a combination of deductive reasoning, direct attempts at replication, and indirect checks obtained by attempting to build on the result in question. This process eventually leads to a result either being accepted by consensus or not. For particularly important cases, an official scientific consensus report may be commissioned by a national academy or an established scientific society. Examples of results that have become part of the scientific consensus in this way include smoking causing lung cancer, HIV causing AIDS, and climate change being caused by humans. In contrast, the published result that vaccines cause autism has been thoroughly debunked by several follow up studies. In none of these four cases a simple definition of replication was used to confirm or falsify a result. The same is true for most results for which there is consensus. Yet science moves on, and continues to be an incomparable force at improving our quality of life.

Regulatory agencies, such as the FDA, are an exception since they clearly spell out a definition of replication. For example, to approve a drug they may require two independent clinical trials, adequately powered, to show statistical significance at some predetermined level. They also require a large enough effect size to justify the cost and potential risks associated with treatment. This is not to say that FDA approval is equivalent to scientific consensus, but they do provide a clearcut definition of replication.

In response to a growing concern over a reproducibility crisis, projects such as the Open Science Collaboration have commenced to systematically try to replicate published results. In a recent post, Jeff described one of their recent papers on estimating the reproducibility of psychological science (they really mean replicability; see note below). This Science paper led to lay press reports with eye-catching headlines such as “only 36% of psychology experiments replicate”. Note that the 36% figure comes from a definition of replication that mimics the definition used by regulatory agencies: results are considered replicated if a p-value < 0.05 was reached in both the original study and the replicated one. Unfortunately, this definition ignores both effect size and statistical power. If power is not controlled, then the expected proportion of correct findings that replicate can be quite small. For example, if I try to replicate the smoking-causes-lung-cancer result with a sample size of 5, there is a good chance it will not replicate. In his post, Jeff notes that for several of the studies that did not replicate, the 95% confidence intervals intersected. So should intersecting confidence intervals be our definition of replication? This too has a flaw since it favors imprecise studies with very large confidence intervals. If effect size is ignored, we may waste our time trying to replicate studies reporting practically meaningless findings. Generally defining replication for published studies is not as easy as for highly controlled clinical trials. However, one clear improvement from what is currently being done is to consider statistical power and effect sizes.

To further illustrate this, let’s consider a very concrete example with real life consequences. Imagine a loved one has a disease with high mortality rates and asks for your help in evaluating the scientific evidence on treatments. Four experimental drugs are available all with promising clinical trials resulting in p-values <0.05. However, a replication project redoes the experiments and finds that only the drug A and drug B studies replicate (p<0.05). So which drug do you take? Let’s give a bit more information to help you decide. Here are the p-values for both original and replication trials:

Drug	Original	Replication	Replicated
A	0.0001	0.001	Yes
B	<0.000001	0.03	Yes
C	0.03	0.06	No
D	<0.000001	0.10	No

Which drug would you take now? The information I have provided is based on p-values and therefore is missing a key piece of information: the effect sizes. Below I show the confidence intervals for all four studies (left) and four replication studies (right). Note that except for drug B, all confidence intervals intersect. In light of the figure below, which one would you chose?

I would be inclined to go with drug D because it has a large effect size, a small p-value, and the replication experiment effect estimate fell inside a 95% confidence interval. I would definitely not go with A since it provides marginal benefits, even if the trial found a statistically significant effect and was replicated. So the p-value based definition of replication is practically worthless from a practical standpoint.

It seems that before continuing the debate over replication, and certainly before declaring that we are in a reproducibility crisis, we need a statistically rigorous and scientifically meaningful definition of replication. This definition does not necessarily need to be dichotomous (replicated or not) and it will probably require more than one replication experiment and more than one summary statistic: one for effect size and one for uncertainty. In the meantime, we should be careful not to dismiss the current scientific process, which seems to be working rather well at either ignoring or debunking false positive results while producing useful knowledge and discovery.

Footnote on reproducible versus replication: As Jeff pointed out, the cited Open Science Collaboration paper is about replication, not reproducibility. A study is considered reproducible if an independent researcher can recreate the tables and figures from the original raw data. Replication is not nearly as simple to define because it involves probability. To replicate the experiment it has to be performed again, with a different random sample and new set of measurement errors.

Theranos runs head first into the realities of diagnostic testing

2015-10-16T08:42:11+00:00

The Wall Street Journal has published a lengthy investigation into the diagnostic testing company Theranos.

The company offers more than 240 tests, ranging from cholesterol to cancer. It claims its technology can work with just a finger prick. Investors have poured more than $400 million into Theranos, valuing it at $9 billion and her majority stake at more than half that. The 31-year-old Ms. Holmes’s bold talk and black turtlenecks draw comparisons to Apple Inc. cofounder Steve Jobs.

If ever there were a warning sign, the comparison to Steve Jobs has got to be it.

But Theranos has struggled behind the scenes to turn the excitement over its technology into reality. At the end of 2014, the lab instrument developed as the linchpin of its strategy handled just a small fraction of the tests then sold to consumers, according to four former employees.

One former senior employee says Theranos was routinely using the device, named Edison after the prolific inventor, for only 15 tests in December 2014. Some employees were leery about the machine’s accuracy, according to the former employees and emails reviewed by The Wall Street Journal.

In a complaint to regulators, one Theranos employee accused the company of failing to report test results that raised questions about the precision of the Edison system. Such a failure could be a violation of federal rules for laboratories, the former employee said.

With these kinds of stories, it's always hard to tell whether there's reality here or it's just a bunch of axe grinding. But one thing that's for sure is that people are talking, and probably not for good reasons.

Minimal R Package Check List

2015-10-14T08:21:48+00:00

A little while back I had the pleasure of flying in a small Cessna with a friend and for the first time I got to see what happens in the cockpit with a real pilot. One thing I noticed was that basically you don’t lift a finger without going through some sort of check list. This starts before you even roll the airplane out of the hangar. It makes sense because flying is a pretty dangerous hobby and you want to prevent problems from occurring when you’re in the air.

That experience got me thinking about what might be the minimal check list for building an R package, a somewhat less dangerous hobby. First off, much has changed (for the better) since I started making R packages and I wanted to have some clean documentation of the process, particularly with using RStudio’s tools. So I wiped off my installations of both R and RStudio and started from scratch to see what it would take to get someone to build their first R package.

The list is basically a “pre-flight” list-–the presumption here is that you actually know the important details of building packages, but need to make sure that your environment is setup correctly so that you don’t run into errors or problems. I find this is often a problem for me when teaching students to build packages because I focus on the details of actually making the packages (i.e. DESCRIPTION files, Roxygen, etc.) and forget that way back when I actually configured my environment to do this.

Pre-flight Procedures for R Packages

Install most recent version of R
Install most recent version of RStudio
Open RStudio
Install devtools package
Click on Project –> New Project… –> New Directory –> R package
Enter package name
Delete boilerplate code and “hello.R” file
Goto “man” directory an delete “hello.Rd” file
In File browser, click on package name to go to the top level directory
Click “Build” tab in environment browser
Click “Configure Build Tools…”
Check “Generate documentation with Roxygen”
Check “Build & Reload” when Roxygen Options window opens –> Click OK
Click OK in Project Options window

At this point, you’re clear to build your package, which obviously involves writing R code, Roxygen documentation, writing package metadata, and building/checking your package.

If I’m missing a step or have too many steps, I’d like to hear about it. But I think this is the minimum number of steps you need to configure your environment for building R packages in RStudio.

UPDATE: I’ve made some changes to the check list and will be posting future updates/modifications to my GitHub repository.

Profile of Data Scientist Shannon Cebron

2015-10-03T09:32:20+00:00

The “This is Statistics” campaign has a nice profile of Shannon Cebron, a data scientist working at the Baltimore-based Pegged Software.

What advice would you give to someone thinking of a career in data science?

Take some advanced statistics courses if you want to see what it’s like to be a statistician or data scientist. By that point, you’ll be familiar with enough statistical methods to begin solving real-world problems and understanding the power of statistical science. I didn’t realize I wanted to be a data scientist until I took more advanced statistics courses, around my third year as an undergraduate math major.

Not So Standard Deviations: Episode 2 - We Got it Under 40 Minutes

2015-10-02T09:00:29+00:00

Episode 2 of my podcast with Hilary Parker, Not So Standard Deviations, is out! In this episode, we talk about user testing for statistical methods, navigating the Hadleyverse, the crucial significance of rename(), and the secret reason for creating the podcast (hint: it rhymes with “bee”). Also, I erroneously claim that Bill Cleveland is way older than he actually is. Sorry Bill.

In other news, we are finally on iTunes so you can subscribe from there directly if you want (just search for “Not So Standard Deviations” or paste the link directly into your podcatcher.