Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Measuring the importance of data privacy: embarrassment and cost

We We when it is inexpensive and easy to collect data about ourselves or about other people. These data can take the form of health information - like medical records, or they could be financial data - like your online bank statements, or they could be social data - like your friends on Facebook. We can also easily collect information about our genetic makeup or our fitness (although it can be hard to get).

All of these data types are now stored electronically. There are obvious reasons why this is both economical and convenient. The downside, of course, is that the data can be used by the government or other entities in ways that you may not like. Whether it is to track your habits to sell you new products or to use your affiliations to make predictions about your political leanings, these data are not just “numbers”.

Data protection and data privacy are major issues in a variety of fields. In some areas, laws are in place to govern how your data can be shared and used (e.g. HIPAA). In others it is a bit more of a wild west mentality (see this interesting series of posts, “Know your data” by junkcharts talking about some data issues). I think most people have some idea that they would like to keep at least certain parts of their data private (from the government, from companies, or from their friends/family), but I’m not sure how most people think about data privacy.

For me there are two scales on which I measure the importance of the privacy of my own data:

  1. Embarrassment - Data about my personal habits, whether I let my son watch too much TV, or what kind of underwear I buy could be embarrassing if it was out in public.
  2. Financial  - Data about my social security number, my bank account numbers, or my credit card account could be used to cost me either my current money or potential future money.

My concerns about data privacy can almost always be measured primarily on these two scales. For example, I don’t want my medical records to be public because: (1) it might be embarrassing for people to know how bad my blood pressure is and (2) insurance companies might charge me more if they knew. On the other hand, I don’t want my bank account to get out primarily because it could cost me financially. So that mostly only registers on one scale.

One option, of course, would be to make all of my data totally private. But the problem is I want to share some of it with other people - I want my doctor to know my medical history and my parents to get to see pictures of my son. Usually I just make these choices about data sharing without even thinking about them, but after a little reflection I think these are the main considerations that go into my data sharing choices:

  1. Where does it rate on the two scales above?
  2. How much do I trust the person I’m sharing with? For example, my wife knows my bank account info, but I wouldn’t give it to a random stranger on the street. Google has my email and uses it to market to me, but that doesn’t bother me too much. But I trust them (I think) not to say - tell people I’m negotiating with my plans based on emails I sent to my wife (this goes with #4 below).
  3. How hard would it be to use the information? I give my credit card to waiters at restaurants all the time, but I also monitor my account - so it would be relatively hard to run up a big bill before I (or the bank) noticed. I put my email address online, but it is a couple of steps between that and anything that is embarrassing/financially dubious for me. You’d have to be able to use that to hack some account.
  4. Is there incentive for someone to use the information? I’m not fabulously wealthy or famous. So most of the time, even if financial/embarrassing stuff is online about me, it probably wouldn’t get used. On the other hand, if I was an actor, a politician, or a billionaire there would be a lot more people incentivized to use my data against me. For example, if Google used my info to blow up a negotiation they would gain very little. I, on the other hand, would lose a lot and would probably sue them.*

With these ideas in mind it makes it a little easier for me to (at least personally) classify how much I care about different kinds of privacy breaches.

For example, suppose my health information was posted on the web. I would consider this a  problem because of both financial and embarrassment potential. It is also on the web, so I basically don’t trust the vast majority of people that would have access. On the other hand, it would be at least reasonably hard to use this data directly against me unless you were an insurance provider and most people wouldn’t have the incentive.

Take another example: someone tagging me in Facebook photos (I don’t have my own account). Here the financial considerations are only potential future employment problems, but the embarrassment considerations are quite high. I probably somewhat trust the person tagging me since I at least likely know them. On the other hand it would be super-easy to use the info against me - it is my face in a picture and would just need to be posted on the web. So in this case, it mostly comes down to incentive and I don’t think most people have an incentive to use pictures against me (except in jokes - which I’m mostly cool with).

I could do more examples, but you get the idea. I do wonder if there is an interesting statistical model to be built here on the basis of these axioms (or other more general ones) about when/how data should be used/shared.

An interesting side note is that I did use my gmail account when I was considering a position at Google fresh out of my Ph.D. I sent emails to my wife and my advisor discussing my plans/strategy. I always wondered if they looked at those emails when they were negotiating with me - although I never had any reason to suspect they had.