Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

A non-comprehensive list of awesome female data people on Twitter

I was just talking to a student who mentioned she didn’t know Jenny Bryan was on Twitter. She is and she is an awesome person to follow. I also realized that I hadn’t seen a good list of women on Twitter who do stats/data. So I thought I’d make one. This list is what I could make in 15 minutes based on my own feed and will, with 100% certainty, miss really people. Can you please add them in the comments and I’ll update the list?

  • @JennyBryan (Jenny Bryan) statistics professor at UBC, teaching a great intro to data science class right now.
  • @hspter (Hilary Parker) data analyst at Etsy (former Hopkins grad student!) and co-creator (I think) of #rcatladies, also wrote this nice post on writing an R package from scratch
  • @acfrazee (Alyssa Frazee) Ph.D. student at Hopkins, writes a great blog on data stuff, works on statistical genomics
  • @emsweene57 (Elizabeth Sweeney) - Hopkins Ph.D. student, developer of methods for neuroimaging.
  • @hmason (Hilary Mason) - currently running one of my favorite startups Fast Forward Labs, but basically needs no introduction, one of the biggest names in data science right now.
  • @sherrirose (Sherri Rose) - former Hopkins postdoc, now at Harvard. Literally wrote the book on targeted learning.
  • @eloyan_ani (Ani Eloyan) - Hopkins Biostat faculty, working on neuroimaging and EMRs. Lead the team that won the ADHD-200 competition.
  • @mrogati (Monica Rogati) - Former Linkedin data scientist, now running the data team at Jawbone.
  • @annmariastat (AnnMaria De Mars) - runs the Julia group, also world class judoka, writes one of my favorite stats/education blogs
  • @kara_woo (Kara Woo) - Works at the National Center for Ecological Analysis and Synthesis and maintains their projections blog
  • @jhubiostat (Betsy Ogburn) - Hopkins biostat faculty, not technically her account. But she is the reason this is the funniest/best academic department twitter account out there.
  • @lovestats  (Annie Pettit) - Does surveys and data quality/MRX work. If you are into MRX, check out her blog.
  • @ProfEmilyOster (Emily Oster) - Econ professor at U Chicago. Has been my favorite writer for FiveThirtyEight since their relaunch.
  • @monachalabi (Mona Chalabi) - writer for FiveThirtyEight, I like her “Am I normal” series of posts.
  • @lisaczhang  (Lisa Zhang)- cofounder of Polychart.
  • @notawful (Jessica Hartnett) - professor at Gannon University, writes a great blog on teaching statistics.
  • @AliciaOshlack (Alicia Oshlack) - researcher at Murdoch Children’s research institute, one of the real superstars in computational genomics.
  • @AmeliaMN (Amelia McNamara) - graduate student at UCLA, works on the Mobilize project and other awesome data education initiatives in LA school system.
  •  @leighadlr (LEIGH ARINO DE LA RUBIA) Editor in chief of DataScience.LA
  • @inesgn (Ines Germendia) - data scientist working on official statistics at Basque Statistics - Eustat
  • @sgrifter  (Sandy Griffith) - Biostat Ph.D., fellow #rcatladies creator, professor at the Cleveland Clinic in quantitative medicine
  • @ladamic (Lada Adamic) - professor at Michigan, teacher of really highly regarded social network analysis class on Coursera, now at Facebook (I think)
  • @stephaniehicks - (Stephanie Hicks) postdoc in compbio at Harvard, lead teaching assistant for Data Science course at Harvard.
  • @ansate - (Melissa Santos) manager of Hadoop infrastructure at Etsy, maintainer of the women in data list below.
  • <@lauramclay> (Laura McClay) - professor of operations research at UW Madison, writes a blog with an amazing name: Punk Rock Operations Research.
  • @bioannie (Laura Hatfield) - professor at Harvard, also has one of the best data titles I’ve ever heard: Princess of Bayesia
  • @kaythaney (Kaitlin Thaney)  - director of the Mozilla Science Lab, also works with Data Kind UK.
  • <@laurieskelly>  (Laurie Skelly)- Data scientist at Data Scope analytics
  • @bo_p (Bo Peng) - Data scientist at Data Scope analytics
  • @siminaboca (Simina Boca) - former Hopkins Ph.D. student, now assistant professor at Georgetown in Biomedical informatics.
  • @HelenPowell01 (Helen Powell) - postdoc in Biostatistics at Hopkins, works on statistics for relationship between air pollution and health.
  • @victoriastodden (Victoria Stodden) - one of the leaders in the legal and sociological aspects of reproducible research.
  • @hannawallach (Hanna Wallach) - CS professor and researcher at Microsoft Research NY.
  • @kralljr (Jenna Krall) - postdoctoral fellow in environmental statistics at Emory (Hopkins grad!)
  • @LssLi (Shanshan Li) - professor of Biostatistics at IUPI, works on neuroimaging, aging and epidemiology (Hopkins grad!)
  • @aheineike (Amy Heineike) - director of mathematics at Quid, also excellent interviewee.
  • @mathbabedotorg (Cathy O’Neil) program director of the Lede Program at Columbia’s J School, writes a very popular data science blog.
  • @ameiliashowalter (Amelia Showalter)  Former director of digital analytics for Obama2012. Data consultant.
  • @minebocek (Mine Cetinkaya Rundel) Professor at Duke, teaches the great statistics MOOC from them based on OpenIntro.
  • @YennyWebbV (Yenny Webb Vargas) Ph.D. student in Biostatistics at Johns Hopkins, one of the founders of Bmore Biostats and a blogger
  • @OMGannaks (Anna Smith) - former data scientist at Bitly, now analytics engineer at rentherunway.
  • @kristin_linn (Kristin Linn) - postdoc at UPenn, formerly NC State grad student, part of the awesome statistics band (!) @TheFifthMoment
  • @ledell (Erin LeDell) - grad student in Biostatistics at Berkeley working on machine learning, co-author of subsemble R package.
  • @atmccann (Allison McCann) - writer for FiveThirtyEight. Data viz person, my favorite post of hers is how to debug a jet
  • @ReginaNuzzo (Regina Nuzzo) - stats prof and freelance writer. Her piece on p-values in Nature just won the statistical reporting award.
  • @jrfAleks (Aleks Collingwood) - programme manager for the Joseph Rowntree Foundation. Working on poverty and aging.
  • @abarysh (Anastasia Baryshnikova) - princeton Lewis-Sigler fellow, co-leader of major project on large international yeast knockout study.
  • @sharon000 (Sharon Machlis) - online managing editor at Computerworld.
  • @2plus2make5 (Emma Pierson) - Stanford undergrad, Rhodes Scholar, frequent contributor to FiveThirtyEight and other data blogs.
  • @mandyfmejia (Mandy Mejia) - Johns Hopkins PhD student, brain imaging analyzer, also writes a great blog!

I have also been informed that these Twitter lists are probably better than my post. But I’ll keep updating my list anyway cause I want to know who all the right people to follow are!

 

Why the three biggest positive contributions to reproducible research are the iPython Notebook, knitr, and Galaxy

There is a huge amount of interest in reproducible research and replication of results. Part of this is driven by some of the pretty major mistakes in reproducibility we have seen in economics and genomics. This has spurred discussion at a variety of levels including at the level of the United States Congress.

To solve this problem we need the appropriate infrastructure. I think developing infrastructure is a lot like playing the lottery, only if the lottery required a lot more work to buy a ticket. You pour a huge amount of effort into building good infrastructure.  I think it helps if you build it for yourself like Yihui did for knitr:

(also make sure you go read the blog post over at Data Science LA)

If lots of people adopt it, you are set for life. If they don’t, you did all that work for nothing. So you have to applaud all the groups who have made efforts at building infrastructure for reproducible research.

I would contend that the largest positive contributions to reproducibility in sheer number of analyses made reproducible are:

  •  The knitr R package (or more recently rmarkdown) for creating literate webpages and documents in R.
  • iPython notebooks  for creating literate webpages and documents interactively in Python.
  • The Galaxy project for creating reproducible work flows (among other things) combining known tools.

There are similarities and differences between the different platforms but the one thing I think they all have in common is that they added either no or negligible effort to people’s data analytic workflows.

knitr and iPython notebooks have primarily increased reproducibility among folks who have some scripting experience. I think a major reason they are so popular is because you just write code like you normally would, but embed it in a simple to use document. The workflow doesn’t change much for the analyst because they were going to write that code anyway. The document just allows it to be built into a more shareable document.

Galaxy has increased reproducibility for many folks, but my impression is the primary user base are folks who have less experience scripting. They have worked hard to make it possible for these folks to analyze data they couldn’t before in a reproducible way. But the reproducibility is incidental in some sense. The main reason users come is that they would have had to stitch those pipelines together anyway. Now they have an easier way to do it (lowering workload) and they get reproducibility as a bonus.

If I was in charge of picking the next round of infrastructure projects that are likely to impact reproducibility or science in a positive way, I would definitely look for projects that have certain properties.

  • For scripters and experts I would look for projects that interface with what people are already doing (most data analysis is in R or Python these days), require almost no extra work, and provide some benefit (reproducibility or otherwise). I would also look for things that are agnostic to which packages/approaches people are using.
  • For non-experts I would look for projects that enable people to build pipelines  they were’t able to before using already standard tools and give them things like reproducibility for free.

Of course I wouldn’t put me in charge anyway, I’ve never won the lottery with any infrastructure I’ve tried to build.

A (very) brief review of published human subjects research conducted with social media companies

As I wrote the other day, more and more human subjects research is being performed by large tech companies. The best way to handle the ethical issues raised by this research is still unclear. The first step is to get some idea of what has already been published from these organizations. So here is a brief review of the papers I know about where human subjects experiments have been conducted by companies. I’m only counting experiments here that have (a) been published in the literature and (b) involved experiments on users. I realized I could come up with surprisingly few.  I’d be interested to see more in the comments if people know about them.

Paper: Experimental evidence of massive-scale emotional contagion through social networks

Company: Facebook

What they did: Randomized people to get different emotions in their news feed and observed if they showed an emotional reaction.

What they found: That there was almost no real effect on emotion. The effect was statistically significant but not scientifically or emotionally meaningful.

Paper: Social influence bias: a randomized experiment

Company: Not stated but sounds like Reddit

What they did: Randomly up-voted, down voted, or left alone posts to the social networking site. Then they observed whether there was a difference in the overall rating of posts within each treatment.

What they found: Posts that were upvoted ended up with a final rating score (total upvotes - total downvotes) that was 25% higher.

Paper: Identifying influential and susceptible members of social networks 

Company: Facebook

What they did: Using a commercial Facebook app,  they found users who adopted a product and randomized sending messages to their friends about the use of the product. Then they measured whether their friends decided to adopt the product as well.

What they found: Many interesting things. For example: susceptibility to influence decreases with age, people over 31 are stronger influencers, women are less susceptible to influence than men, etc. etc.

 

Paper: Inferring causal impact using Bayesian structural time-series models

Company: Google

What they did: They developed methods for inferring the causal impact of an ad in a time series situation. They used data from an advertiser who showed ads to people related to keywords and measured how many visits there were to the advertiser’s website through paid and organic (non-paid) clicks.

What they found: That the ads worked. But more importantly that they could predict the causal effect of the ad using their methods.

 

 

 

 

 

 

 

SwiftKey and Johns Hopkins partner for Data Science Specialization Capstone

I use SwiftKey on my Android phone all the time. So I was super pumped up I use [SwiftKey](http://swiftkey.com/en/) on my Android phone all the time. So I was super pumped up to run in October 2014. To enroll in the course you have to pass the other 9 courses in the Data Science Specialization.

The 9 courses have only been running for 4 months but already 200+ people have finished all 9! It has been unbelievable to see the response to the specialization and we are exited about taking it to the next level.

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:

I went to the

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone you will work on understanding and building predictive text models like those used by SwiftKey.

This course will start with the basics, analyzing a large corpus of text documents to discover the structure in the data and how words are put together. It will cover cleaning and analyzing text data, then building and sampling from a predictive text model. Finally, students will use the knowledge gained in our  Data Products course to build a predictive text product they can show off to their family, friends, and potential employers.

We are really excited to work with SwiftKey to take our Specialization to the next level! Here is Roger’s intro video for the course to get you fired up too.

Interview with COPSS Award winner Martin Wainwright

Editor’s note: Martin Wainwright is the winner of the 2014 COPSS Award. This award is the most prestigious award in statistics, sometimes refereed to as the Nobel Prize in Statistics. Martin received the award for: “ For fundamental and groundbreaking contributions to high-dimensional statistics, graphical modeling, machine learning, optimization and algorithms, covering deep and elegant mathematical analysis as well as new methodology with wide-ranging implications for numerous applications.” He kindly agreed to be interviewed by Simply Statistics. 

wainwright

SS: How did you find out you had received the COPSS prize?

It was pretty informal -– I received an email in February from

Raymond Carroll, who chaired the committee. But it had explicit

instructions to keep the information private until the award ceremony

in August.

SS: You are in Electrical Engineering & Computer Science (EECS) and

Statistics at Berkeley: why that mix of departments?

Just to give a little bit of history, I did my undergraduate degree in

math at the University of Waterloo in Canada, and then my Ph.D. in

EECS at MIT, before coming to Berkeley to work as a postdoc in

Statistics. So when it came time to looking at faculty positions,

having a joint position between these two departments made a lot of

sense. Berkeley has always been at the forefront of having effective

joint appointments of the “Statistics plus X” variety, whether X is

EECS, Mathematics, Political Science, Computational Biology and so on.

For me personally, the EECS plus Statistics combination is terrific,

as a lot of my interests lie at the boundary between these two areas,

whether it is investigating tradeoffs between computational and

statistical efficiency, connections between information theory and

statistics, and so on. I hope that it is also good for my students!

In any case, whether they enter in EECS or Statistics, they graduate

with a strong background in both statistical theory and methods, as

well as optimization, algorithms and so on. I think that this kind of

mix is becoming increasingly relevant to the practice of modern

statistics, and one can certainly see that Berkeley consistently

produces students, whether from my own group or other people at

Berkeley, with this kind of hybrid background.

SS: What do you see as the relationship between statistics and machine

learning?

This is an interesting question, but tricky to answer, as it can

really depend on the person. In my own view, statistics is a very

broad and encompassing field, and in this context, machine learning

can be viewed as a particular subset of it, one especially focused on

algorithmic and computational aspects of statistics. But on the other

hand, as things stand, machine learning has rather different cultural

roots than statistics, certainly strongly influenced by computer

science. In general, I think that both groups have lessons to learn

from each other. For instance, in my opinion, anyone who wants to do

serious machine learning needs to have a solid background in

statistics. Statisticians have been thinking about data and

inferential issues for a very long time now, and these fundamental

issues remain just as important now, even though the application

domains and data types may be changing. On the other hand, in certain

ways, statistics is still a conservative field, perhaps not as quick

to move into new application domains, experiment with new methods and

so on, as people in machine learning do. So I think that

statisticians can benefit from the playful creativity and unorthodox

experimentation that one sees in some machine learning work, as well

as the algorithmic and programming expertise that is standard in

computer science.

SS: What sorts of things is your group working on these days?

I have fairly eclectic interests, so we are working on a range of

topics. A number of projects concern the interface between

computation and statistics. For instance, we have a recent pre-print

(with postdoc Sivaraman Balakrishnan and colleague Bin Yu) that tries

to address the gap between statistical and computational guarantees in

applications of the expectation-maximization (EM) algorithm for latent

variable models. In theory, we know that the global minimizer of the

(nonconvex) likelihood has good properties, but the in practice, the

EM algorithm only returns local optima. How to resolve this gap

between existing theory and actual practice? In this paper, we show

that under pretty reasonable conditions-–that hold for various types

of latent variable models-–the EM fixed points are as good as the

global minima from the statistical perspective. This explains what is

observed a lot in practice, namely that when the EM algorithm is given

a reasonable initialization, it often returns a very good answer.

There are lots of other interesting questions at this

computation/statistics interface. For instance, a lot of modern data

sets (e.g., Netflix) are so large that they cannot be stored on a

single machine, but must be split up into separate pieces. Any

statistical task must then be carried out in a distributed way, with

each processor performing local operations on a subset of the data,

and then passing messages to other processors that summarize the

results of its local computations. This leads to a lot of fascinating

questions. What can be said about the statistical performance of such

distributed methods for estimation or inference? How many bits do the

machines need to exchange in order for the distributed performance to

match that of the centralized “oracle method” that has access to all

the data at once? We have addressed some of these questions in a

recent line of work (with student Yuchen Zhang, former student John

Duchi and colleague Micheel Jordan).

So my students and postdocs are keeping me busy, and in addition, I am

also busy writing a couple of books, one jointly with Trevor Hastie

and Rob Tibshirani at Stanford University on the Lasso and related

methods, and a second solo-authored effort, more theoretical in focus,

on high-dimensional and non-asymptotic statistics.

SS: What role do you see statistics playing in the relationship

between Big Data and Privacy?

Another very topical question: privacy considerations are certainly

becoming more and more relevant as the scale and richness of data

collection grows. Witness the recent controversies with the NSA, data

manipulation on social media sites, etc. I think that statistics

should have a lot to say about data and privacy. There has a long

line of statistical work on privacy, dating back at least to Warner’s

work on survey sampling in the 1960s, but I anticipate seeing more of

it over the next years. Privacy constraints bring a lot of

interesting statistical questions-–how to design experiments, how to

perform inference, how should data be aggregated and what should be

released and so on-–and I think that statisticians should be at the

forefront of this discussion.

In fact, in some joint work with former student John Duchi and

colleague Michael Jordan, we have examined some tradeoffs between

privacy constraints and statistical utility. We adopt the framework

of local differential privacy that has been put forth in the computer

science community, and study how statistical utility (in the form of

estimation accuracy) varies as a function of the privacy level.

Obviously, preserving privacy means obscuring something, so that

estimation accuracy goes down, but what is the quantitative form of

this tradeoff? An interesting consequence of our analysis is that in

certain settings, it identifies optimal mechanisms for preserving a

certain level of privacy in data.

What advice would you give young statisticians getting into the

discipline right now?

It is certainly an exciting time to be getting into the discipline.

For undergraduates thinking of going to graduate school in statistics,

I would encourage them to build a strong background in basic

mathematics (linear algebra, analysis, probability theory and so on)

that are all important for a deep understanding of statistical methods

and theory. I would also suggest “getting their hands dirty”, that is

doing some applied work involving statistical modeling, data analysis

and so on. Even for a person who ultimately wants to do more

theoretical work, having some exposure to real-world problems is

essential. As part of this, I would suggest acquiring some knowledge

of algorithms, optimization, and so on, all of which are essential in

dealing with large, real-world data sets.