Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Big Data - Context = Bad

There’s a nice article by Nick Bilton in the New York Times Bits blog about the need for context when looking at Big Data. Actually, the article starts off by describing how Google’s Flu Trends model overestimated the number of people infected with flue in the U.S. this season, but then veers off into a more general discussion about Big Data.

My favorite quote comes from Mark Hansen:

“Data inherently has all of the foibles of being human,” said Mark Hansen, director of the David and Helen Gurley Brown Institute for Media Innovation at Columbia University. “Data is not a magic force in society; it’s an extension of us.”

Bilton also talks about a course he taught where students built sensors to install in elevators and stairwells at NYU to see how often they were used. The idea was to explore how often and when the NYU students used the stairs versus the elevator.

As I left campus that evening, one of the N.Y.U. security guards who had seen students setting up the computers in the elevators asked how our experiment had gone. I explained that we had found that students seemed to use the elevators in the morning, perhaps because they were tired from staying up late, and switch to the stairs at night, when they became energized.

“Oh, no, they don’t,” the security guard told me, laughing as he assured me that lazy college students used the elevators whenever possible. “One of the elevators broke down a few evenings last week, so they had no choice but to use the stairs.”

I can see at least three problems here, not necessarily mutually exclusive:

  1. Big Data are often “Wrong” Data. The students used the sensors measure something, but it didn’t give them everything they needed. Part of this is that the sensors were cheap, and budget was likely a big constraint here. But Big Data are often big because they are cheap. But of course, they still couldn’t tell that the elevator was broken.
  2. A failure of interrogation. With all the data the students collected with their multitude of sensors, they were unable to answer the question “What else could explain what I’m observing?”
  3. A strong desire to tell a story. Upon looking at the data, they seemed to “make sense” or to at least match a preconceived notion of that they should look like. This is related to #2 above, which is that you have to challenge what you see. It’s very easy and tempting to let the data tell an interesting story rather than the right story.

I don’t mean to be unduly critical of some students in a class who were just trying to collect some data. I think there should be more of that going on. But my point is that it’s not as easy as it looks. Even trying to answer a seemingly innocuous question of how students use elevators and stairs requires some forethought, study design, and careful analysis.