Simply Statistics
2021-11-12T13:27:19+00:00
http://simplystats.github.io
Some default and debt restructuring data
2017-05-04T00:00:00+00:00
http://simplystats.github.io/2017/05/04/debt-haircuts
<p>Yesterday the government of Puerto Rico <a href="https://www.nytimes.com/2017/05/03/business/dealbook/puerto-rico-debt.html">asked for bankruptcy relief in federal court</a>. Puerto Rico owes about $70 billion to bondholders and about $50 billion in pension obligations. Before asking for protection the government offered to pay back some of the debt (50% according to some news reports) but bondholders refused. Bondholders will now fight in court to recover as much of what is owed as possible while the government and a federal oversight board will try to lower this amount. What can we expect to happen?</p>
<p>A case like this is unprecedented, but there are plenty of data on restructurings. An <a href="http://www.elnuevodia.com/opinion/columnas/ladeudaserenegociaraeneltribunal-columna-2317174/">op-ed</a> by Juan Lara pointed me to <a href="http://voxeu.org/article/argentina-s-haircut-outlier">this</a> blog post describing data on 180 debt restructurings. I am not sure how informative these data are with regards to Puerto Rico, but the plot below sheds some light into the variability of previous restructurings. Colors represent regions of the world and the lines join points from the same country. I added data from US cases shown in <a href="http://www.nfma.org/assets/documents/RBP/wp_statliens_julydraft.pdf">this paper</a>.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2017-05-04/haircuts.png" alt="" /></p>
<p>The cluster of points you see below the 30% mark appear to be cases involving particularly poor countries: Albania, Argentina, Bolivia, Ethiopia, Bosnia and Herzegovina, Guinea, Guyana, Honduras, Cameroon, Iraq, Congo, Rep., Costa Rica, Mauritania, Sao Tome and Principe, Mozambique, Senegal, Nicaragua, Niger, Serbia and Montenegro, Sierra Leone, Tanzania, Togo, Uganda, Yemen, and Republic of Zambia. Note also these restructurings happened after 1990.</p>
Science really is non-partisan: facts and skepticism annoy everybody
2017-04-24T00:00:00+00:00
http://simplystats.github.io/2017/04/24/march-for-science
<p>This is a short open letter to those that believe scientists have a “liberal bias” and question their objectivity. I suspect that for many conservatives, this Saturday’s March for Science served as confirmation of this fact. In this post I will try to convince you that this is not the case specifically by pointing out how scientists often annoy the left as much as the right.</p>
<p>First, let me emphasize that scientists are highly appreciative of members of Congress and past administrations that have supported Science funding though the DoD, NIH and NSF. Although the current administration did propose a 20% cut to NIH, we are aware that, generally speaking, support for scientific research has traditionally been bipartisan.</p>
<p>It is true that the typical data-driven scientists will disagree, sometimes strongly, with many stances that are considered conservative. For example, most scientists will argue that:</p>
<ol>
<li>Climate change is real and is driven largely by increased carbon dioxide and other human-made emissions into the atmosphere.</li>
<li>Evolution needs to be part of children’s education and creationism has no place in Science class.</li>
<li>Homosexuality is not a choice.</li>
<li>Science must be publically funded because the free market is not enough to make science thrive.</li>
</ol>
<p>But scientists will also hold positions that are often criticized heavily by some of those who identify as politically left wing:</p>
<ol>
<li>Current vaccination programs are safe and need to be enforced: without heard immunity thousands of children would die.</li>
<li>Genetically modified organisms (GMOs) are safe and are indispensable to fight world hunger. There is no need for warning labels.</li>
<li>Using nuclear energy to power our electrical grid is much less harmful than using natural gas, oil and coal and, currently, more viable than renewable energy.</li>
<li>Alternative medicine, such as homeopathy, naturopathy, faith healing, reiki, and acupuncture, is pseudo-scientific quackery.</li>
</ol>
<p>The timing of the announcement of the March for Science, along with the organizers’ focus on environmental issues and diversity, may have made it seem like a partisan or left-leaning event, but please also note that many scientists <a href="https://www.nytimes.com/2017/01/31/opinion/a-scientists-march-on-washington-is-a-bad-idea.html">criticized</a> the organizers for this very reason and there was much debate in general. Most scientists I know that went to the march did so not necessarily because they are against Republican administrations, but because they are legitimately concerned about some of the choices of this particular administration and the future of our country if we stop funding and trusting science.</p>
<p>If you haven’t already seen this <a href="https://www.youtube.com/watch?v=8MqTOEospfo">Neil Degrasse Tyson video</a> on the importance of Science to everyone, I highly recommend it.</p>
Redirect
2017-04-06T00:00:00+00:00
http://simplystats.github.io/2017/04/06/march-for-science
<p>This page was generated in error. The “Science really is non-partisan: facts and skepticism annoy everybody” blog post is <a href="http://simplystatistics.org/2017/04/24/march-for-science/">here</a></p>
<p>Apologies for the inconvenience.</p>
La matrícula, el costo del crédito y las huelgas en la UPR
2017-04-06T00:00:00+00:00
http://simplystats.github.io/2017/04/06/huelga
<p>La Universidad de Puerto Rico (UPR) recibe aproximádamente 800 millones de
dólares del estado cada año. Esta inversión le permite ofrecer salarios más
altos, lo cual atrae a los mejores profesores, tener las mejores instalaciones
para la investigación y enseñanza, y mantener el precio por crédito más bajo que las universidades privadas. Gracias a estas grandes
ventajas, la UPR suele ser la primera opción del estudiantado puertorriqueño, en
particular los dos recintos más grandes, Río Piedras (UPRRP) y Mayagüez. Un
estudiante que aprovecha su tiempo en la UPR, además de formarse como ciudadano, puede
entrar exitosamente en la fuerza laboral o continuar sus estudios en las mejores escuelas graduadas. El
precio módico del crédito, en combinación con las becas federales Pell, han
ayudado a miles de estudiantes económicamente desaventajados a completar sus
estudios sin tener que endeudarse.</p>
<p>En la pasada década una realidad preocupante ha surgido: mientras la demanda por la
educación universitaria ha crecido, demostrado por el crecimiento de la matrícula en las universidades privadas, el número de estudiantes matriculados en la UPR
ha bajado.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2017-04-06/matricula.png" alt="" /></p>
<p>¿Por qué ha bajado la matrícula en la UPR?
<a href="http://www.elnuevodia.com/noticias/locales/nota/protestalauniondejuventudessocialistas-1331982/">Una explicación popular</a>
es que “la baja en matrícula es provocada por el aumento en el costo de la
matrícula”. La teoría de que un alza en costos disminuye la matrícula es
comúnmente aceptada pues tiene sentido económico: cuando el precio sube, las
ventas bajan. Pero entonces ¿por qué ha crecido la matrícula en las
universidades privadas? Tampoco lo explica un crecimiento en el número de estudiantes ricos ya
que, en el 2012, <a href="http://www.80grados.net/hacia-una-universidad-mas-pequena-y-agil/">la mediana de ingreso familiar de aquellos jóvenes matriculados en
algún recinto de la UPR era de $32,379; en contraste, la mediana de ingreso de
aquellos que están matriculados en una universidad privada era de $25,979</a>. Otro problema con esta teoría es que, una vez ajustamos por inflación, el costo del crédito se ha mantenido más o menos estable tanto en la UPR como en las unversidades privadas.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2017-04-06/costo.png" alt="" /></p>
<p>Ahora, si miramos detenidamente los datos de la matrícula notamos que los bajones más grandes fueron precisamente en los años de huelga (2005, 2010, 2011). En el 2005 comienza una tendencia positiva en la matrícula del Sagrado, con el crecimiento más alto en el 2010 y el 2011.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2017-04-06/cambio-en-matricula.png" alt="" /></p>
<p>Actualmente, varios recintos, incluyendo Río Piedras, <a href="http://www.elnuevodia.com/noticias/locales/nota/estudiantesapruebanvotodehuelgasistemicaenlaupr-2307616/">están cerrados
indefinidamente</a>. En una asamblea nacional asistida por 10% de los más de 50,000 estudiantes del sistema, una huelga indefinida fue aprobada en una votación de 4,522 a 1,154. Para reiniciar labores los estudiantes exigen que “no se impongan sanciones a los estudiantes que participen en la huelga, que se presente un plan de reforma universitaria elaborado por la comunidad universitaria, que se audite la deuda pública y se restituya a los miembros de la comisión evaluadora de la auditoría pública y su prepuesto”. Esto ocurre como respuesta a la propuesta por la <a href="https://en.wikipedia.org/wiki/PROMESA">Junta de Supervición Fiscal (JSF)</a> y el gobernador de
<a href="http://www.elnuevodia.com/noticias/locales/nota/revelanelplanderecortesparaelsistemadelaupr-2302675/">reducir</a> el presupuesto de la UPR como parte de sus intentos de
resolver una <a href="https://www.project-syndicate.org/commentary/puerto-rico-debt-plan-deep-depression-by-joseph-e--stiglitz-and-martin-guzman-2017-02">grave crisis
fiscal</a>.</p>
<p>Durante el cierre, los estudiantes en huelga le impiden la entrada al recinto al
resto de la comunidad universitaria, incluyendo aquellos que no consideran la huelga una manera efectiva de protesta. Aquellos que se oponen y quieren continuar estudiando, se les acusa de ser egoistas o de ser aliados de quienes quieren destruir la UPR. Hasta ahora estos estudiantes tampoco han recibido el apoyo explícito de los profesores y administradores. No debe sorprendernos si los que quieren continuar estudiando recurren a pagar más en una universidad privada.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2017-04-06/IMG_7076.jpg" alt="portones2" style="width: 300px;" /></p>
<p>Aunque existe la posibilidad de que la huelga ejerza suficiente presión política para que se responda a las exigencias determinadas en la asamblea, hay otras posibilidades menos favorables para los estudiantes:</p>
<ul>
<li>La falta de actividad académica resulta en el exilio de miles de estudiantes a las universidades privadas.</li>
<li>La JSF usa el cierre para justificar aun más recortes: una institución no requiere millones de dolares al día si está cerrada.</li>
<li>Los recintos cerrados pierden su acreditación ya que una universidad en la cual no se da clases no puede cumplir con las <a href="http://www.msche.org/?Nav1=About&Nav2=FAQ&Nav3=Question07">normas necesarias</a>.</li>
<li>Se revocan las becas Pell a los estudiantes en receso.</li>
</ul>
<p>Hay mucha evidencia empírica que demuestra la importancia de la educación universitaria accesible. Lo mismo no es cierto sobre las huelgas como estrategia para defender dicha educación. Y cabe la posibildad que la huelga indefinida tenga el efecto opuesto y perjudique enormemente a los estudiantes, en particular a los que se ven forzados a matricularse en una universidad privada.</p>
<p>Notas:</p>
<ol>
<li>
<p>Data proporcionada por el <a href="http://www2.pr.gov/agencias/cepr/inicio/estadisticas_e_investigacion/Pages/Estadisticas-Educacion-Superior.aspx">Consejo de Educación de Puerto Rico (CEPR)</a>.</p>
</li>
<li>
<p>El costo del crédito del 2011 no incluye la cuota.</p>
</li>
</ol>
The Importance of Interactive Data Analysis for Data-Driven Discovery
2017-04-03T00:00:00+00:00
http://simplystats.github.io/2017/04/03/interactive-data-analysis
<p>Data analysis workflows and recipes are commonly used in science. They
are actually indispensable since reinventing the wheel for each
project would result in a colossal waste of time. On the other hand,
mindlessly applying a workflow can result in
totally wrong conclusions if the required assumptions don’t hold.
This is why successful data analysts rely heavily on interactive
data analysis (IDA). I write today because I am somewhat
concerned that the importance of IDA is not fully appreciated by many
of the policy makers and thought leaders that will influence how we
access and work with data in the future.</p>
<p>I start by constructing a very simple example to illustrate the
importance of IDA. Suppose that as
part of a demographic study you are asked to summarize male heights
across several counties. Since sample sizes are large and heights are
known to be well approximated by a normal distribution you feel
comfortable using a true and tested recipe:
report the average and standard deviation as a summary. You are
surprised to find a county with average heights of 6.1 feet with a
standard deviation (SD) of 7.8 feet. Do you start writing a paper and a
press release to describe this very interesting finding? Here,
interactive data analysis saves us from naively reporting this.
First, we note that the standard deviation is impossibly big if data is in
fact normally distributed: more than 15% of heights would be
negative. Given this nonsensical result, the next
obvious step for an experienced data analyst is to explore the data,
say with a boxplot (see below). This immediately reveals a problem, it
appears one value was reported in centimeters: 180 centimeters not
feet. After fixing this, the summary changes to an average height
of 5.75 and with a 3 inch SD.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/heights-with-outlier.png" alt="European Outlier" /></p>
<p>Years of data analysis experience will show you that examples like this are
common. Unfortunately, as data and analyses get more complex, workflow
failures are harder to detect and often go unnoticed. An important
principle many of us teach our trainees is to carefully check for
hidden problems when data analysis leads you to unexpected results,
especialy when the unexpected results holding up benefits us
professionally, for example by leading to a publication.</p>
<p>Interactive data analysis is also indispensable for the
development of new methodology. For example, in my field of research, exploring
the data has led to the discovery of the need for new methods and
motivated new approaches that handle specific cases that existing
workflows can’t handle.</p>
<p>So why I am concerned?
As public datasets become larger and more
numerous, many funding agencies, policy makers and industry leaders are
advocating for using cloud computing to bring computing to the
data. If done correctly, this would provide a great improvement over
the current redundant and unsystematic approach of everybody downloading data and working with it locally. However, after
looking into the details of some of these plans, I have become a bit
concerned that perhaps the importance of IDA is not fully appreciated by decision makers.</p>
<p>As an example consider the NIH efforts to promote data-driven discovery
that center around plans for the
<a href="https://datascience.nih.gov/commons"><em>Data Commons</em></a>. The linked page
describes an ecosystem with four components one of which is
“Software”. According to the description, the software component of
<em>The Commons</em> should provide “[a]ccess to and deployment of scientific analysis
tools and pipeline workflows”. There is no mention of a strategy that
will grant access to the
raw data. Without this, carefully checking the workflow output and
developing the analysis tools and pipeline workflows of the future
will be difficult.</p>
<p>I note that data analysis workflows are very popular in fields in which data
analysis is indispensible, as is the case in biomedical research, my
focus area. In this field, data generators, which typically
lead the scientific enterprise, are not always trained data
analysts. But the literature is overflowing with proposed workflows.
You can gauge the popularity of these by the vast number
published in the nature journals as demonstrated by this
<a href="https://www.google.com/search?q=workflow+site:nature.com&biw=1706&bih=901&source=lnms&tbm=isch&sa=X&ved=0ahUKEwi3usL8-dDPAhUDMSYKHaBFBTAQ_AUIBigB#tbm=isch&q=analysis+workflow+site:nature.com">google search</a>:</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/many-workflows.png" alt="Nature workflows" /></p>
<p>In a field in which data generators are not data analysis experts, the
workflow has the added allure that it removes the need to think deeply about
data analysis and instead shifts the responsibility to pre-approved
software. Note that these workflows are not always described with the
mathematical language or computer coded needed to truly understand it
but rather with a series of PowerPoint shapes. The gist of the typical
data analysis workflow can be simplified into the following:</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/workflow.png" alt="workflows" /></p>
<p>This simplification of the data analysis process makes it particularly
worrisome that the intricacies of IDA are not fully appreciated.</p>
<p>As mentioned above, data analysis workflows are a necessary component of
the scientific enterprise. Without them the process would slow down to
a halt. However, workflows should only be implemented once consensus
is reached regarding its optimality. And even then, IDA is needed to
assure that the process is performing as expected. The career of many of my
colleagues has been dedicated mostly to the development of such
analysis tools. We have learned that rushing to implement workflows
before they are mature enough can have widespread negative
consequences. And, at least in my experience, developing rigorous tools is
impossible without interactive data analysis. So I hope that this post
helps make a case for the importance of interactive data analysis and
that it continues to be a part of the scientific enterprise.</p>
The levels of data science class
2017-03-16T00:00:00+00:00
http://simplystats.github.io/2017/03/16/evo-ds-class
<p>In a recent post, Nathan Yau <a href="http://flowingdata.com/2013/03/12/data-hackathon-challenges-and-why-questions-are-important/">points to a comment</a> by Jake Porway about data science hackathons. They both say that for data science/visualization projects to be successful you have to start with an important question, not with a pile of data. This is the <a href="http://simplystatistics.org/2013/05/29/what-statistics-should-do-about-big-data-problem-forward-not-solution-backward/">problem forward not solution backward</a> approach to data science and big data. This is the approach also advocated in the really nice piece on teaching data science by <a href="https://arxiv.org/abs/1612.07140">Stephanie and Rafa</a></p>
<p>I have adopted a similar approach in the data science class here at Hopkins, largely inspired by Dan Meyer’s <a href="https://www.ted.com/talks/dan_meyer_math_curriculum_makeover/transcript">patient problem solving for middle school math class</a>. So instead of giving students a full problem description I give them project suggestions like:</p>
<ul>
<li><strong>Option 1</strong>: Develop a prediction algorithm for identifying and classifying users that are trolling or being mean on Twitter. If you want an idea of what I’m talking about just look at the responses to any famous person’s tweets.</li>
<li><strong>Option 2</strong>: Analyze the traffic fatality data to identify any geographic, time varying, or other characteristics that are associated with traffic fatalities: https://www.transportation.gov/fastlane/2015-traffic-fatalities-data-has-just-been-released-call-action-download-and-analyze.</li>
<li><strong>Option 3</strong>: Develop a model for predicting life expectancy in Baltimore down to single block resolution with estimates of uncertainty. You may need to develop an approach for “downsampling” since the outcome data you’ll be able to find is likely aggregated at the neighborhood level (http://health.baltimorecity.gov/node/231).</li>
<li><strong>Option 4</strong>: Develop a statistical model for inferring the variables you need to calculate the Gail score (http://www.cancer.gov/bcrisktool/) for a woman based on her Facebook profile. Develop a model for the Gail score prediction from Facebook and its uncertainty. You should include estimates of uncertainty in the predicted score due to your inferred variables.</li>
<li><strong>Option 5</strong>: Potentially fun but super hard project. develop an algorithm for self-driving car using the training data: http://research.comma.ai/. Build a model for predicting at every moment what direction the car should be going, whether it should be signalling, and what speed it should be going. You might consider starting with a small subsample of the (big) training set.</li>
</ul>
<p>Each of these projects shares the characteristic that there is an interesting question - but the data may or may not be available. If it is available it may or may not have to be processed/cleaned/organized. Moreover, with the data in hand you may need to think about how it was collected or go out and collect some more data. This kind of problem is inspired by this quote from Dan’s talk - he was talking about math but it could easily have been data science:</p>
<blockquote>
<p>Ask yourselves, what problem have you solved, ever, that was worth solving, where you knew knew all of the given information in advance? Where you didn’t have a surplus of information and have to filter it out, or you didn’t have insufficient information and have to go
find some?</p>
</blockquote>
<p>I realize though that this is advanced data science. So I was thinking about the levels of data science course and how you would build up a curriculum. I came up with the following courses/levels and would be interested in what others thought.</p>
<ul>
<li><strong>Level 0: Background</strong>: Basic computing, some calculus with a focus on optimization, basic linear algebra.</li>
<li><strong>Level 1: Data science thinking</strong>: How to define a question, how to turn a question into a statement about data, how to identify data sets that may be applicable, experimental design, critical thinking about data sets.</li>
<li><strong>Level 2: Data science communication</strong>: Teaching students how to write about data science, how to express models qualitatively and in mathematical notation, explaining how to interpret results of algorithms/models. Explaining how to make figures.</li>
<li><strong>Level 3: Data science tools</strong>: Learning the basic tools of R, loading data of various types, reading data, plotting data.</li>
<li><strong>Level 4: Real data</strong>: Manipulating different file formats, working with “messy” data, trying to organize multiple data sets into one data set.</li>
<li><strong>Level 5: Worked examples</strong>: Use real data examples, but work them through from start to finish as case studies, don’t make them easy clean data sets, but have a clear path from the beginning of the problem to the end.</li>
<li><strong>Level 6: Just the question</strong>: Give students a question where you have done a little research to know that it is posisble to get at least some data, but aren’t 100% sure it is the right data or that the problem can be perfectly solved. Part of the learning process here is knowing how to define success or failure and when to keep going or when to quit.</li>
<li><strong>Level 7: The student is the scientist</strong>: Have the students come up with their own questions and answer them using data.</li>
</ul>
<p>I think that a lot of the thought right now in biostatistics has been on level 3 and 4 courses. These are courses where we have students work with real data sets and learn about tools. To be self-sufficient as a data scientist it is clear you need to be able to work with real world data.
But what Jake/Nathan are referring to is level 5 or level 6 - cases where you have a question but the data needs a ton of work and may not even be good enough without collecting new information. Jake and Nathan have perfectly identified the ability to translate murkey questions into data answers as the most valuable data skill. If I had to predict the future of data courses I would see them trending in that direction.</p>
When do we need interpretability?
2017-03-08T00:00:00+00:00
http://simplystats.github.io/2017/03/08/when-do-we-need-interpretability
<p>I just saw a link to an <a href="https://arxiv.org/abs/1702.08608">interesting article</a> by Finale Doshi-Velez and Been Kim titled “Towards A Rigorous Science of Interpretable Machine Learning”. From the abstract:</p>
<blockquote>
<p>Unfortunately, there is little consensus on what interpretability in machine learning is and how to evaluate it for benchmarking. Current interpretability evaluation typically falls into two categories. The first evaluates interpretability in the context of an application: if the system is useful in either a practical application or a simplified version of it, then it must be somehow interpretable. The second evaluates interpretability via a quantifiable proxy: a researcher might first claim that some model class—e.g. sparse linear models, rule lists, gradient boosted trees—are interpretable and then present algorithms to optimize within that class.</p>
</blockquote>
<p>The paper raises a good point, which is that we don’t really have a definition of “interpretability”. We just know it when we see it. For the most part, there’s been some agreement that parametric models are “more interpretable” than other models, but that’s a relativey fuzzy statement.</p>
<p>I’ve long heard that complex machine learning models that lack any real interpretability are okay because there are many situations where we don’t care “how things work”. When Netflix is recommending my next movie based on my movie history and perhaps other data, the only thing that matters is that the recommendation is something I like. In other words, the <a href="http://simplystatistics.org/2017/01/23/ux-value/">user experience defines the value</a> to me. However, in other applications, such as when we’re assessing the relationship between air pollution and lung cancer, a more interpretable model may be required.</p>
<p>I think the dichotomization between these two kinds of scenarios will eventually go away for a few reasons:</p>
<ol>
<li>For some applications, lack of interpretability is fine…until it’s not. In other words, what happens when things go wrong? Interpretability can help us to decipher why things went wrong and how things can be <em>modified</em> to be fixed. In order to move the levers of a machine to fix it, we need to know exactly where the levers are. Yet another way to say this is that it’s possible to quickly jump from one situation (interpretability not needed) to another situation (what the heck just happened?) very quickly.</li>
<li>I think interpretability will become the new <a href="http://simplystatistics.org/2014/06/06/the-real-reason-reproducible-research-is-important/">reproducible research</a>, transmogrified to the machine learning and AI world. In the scientific world, reproducibility took some time to catch on (and has not quite caught on completely), but it is not so controversial now and many people in many fields accept the notion that all studies should at least be reproducible (if <a href="http://www.pnas.org/content/112/6/1645.full">not necessarily correct</a>). There was a time when people differentiated between cases that needed reproducibility (big data, computational work), and cases where it wasn’t needed. But that differentiation is slowly going away. I believe interpretability in machine learning and statistical modeling wil go the same way as reproducibility in science.</li>
</ol>
<p>Ultimately, I think it’s the success of machine learning that brings the requirement of interpretability on to the scene. Because machine learning has become ubiquitous, we as a society begin to develop expectations for what it is supposed to do. Thus, the <a href="http://simplystatistics.org/2017/01/23/ux-value/">value of the machine learning begins to be defined externally</a>. It will no longer be good enough to simply provide a great user experience.</p>
Model building with time series data
2017-03-07T00:00:00+00:00
http://simplystats.github.io/2017/03/07/time-series-model
<p>A nice post by Alex Smolyanskaya over the <a href="http://multithreaded.stitchfix.com/blog/2017/02/28/whats-wrong-with-my-time-series/">Stitch Fix blog</a> about some of the unique challenges of model building in a time series context:</p>
<blockquote>
<p>Cross validation is the process of measuring a model’s predictive power by testing it on randomly selected data that was not used for training. However, autocorrelations in time series data mean that data points are not independent from each other across time, so holding out some data points from the training set doesn’t necessarily remove all their associated information. Further, time series models contain autoregressive components to deal with the autocorrelations. These models rely on having equally spaced data points; if we leave out random subsets of the data, the training and testing sets will have holes that destroy the autoregressive components.</p>
</blockquote>
Reproducibility and replicability is a glossy science now so watch out for the hype
2017-03-02T00:00:00+00:00
http://simplystats.github.io/2017/03/02/rr-glossy
<p><a href="http://biorxiv.org/content/early/2016/07/29/066803">Reproducibility</a> is the ability to take the code and data from a previous publication, rerun the code and get the same results. <a href="http://biorxiv.org/content/early/2016/07/29/066803">Replicability</a> is the ability to rerun an experiment and get “consistent” results with the original study using new data. Results that are not reproducible are hard to verify and results that do not replicate in new studies are harder to trust. It is important that we aim for reproducibility and replicability in science.</p>
<p>Over the last few years there has been increasing concern about problems with reproducibility and replicability in science. There are a number of suggestions for why this might be:</p>
<ul>
<li>Papers published by scientists with lack of training in statistics and computation</li>
<li>Treating statistics as a second hand discipline that can be “tacked on” at the end of a science experiment</li>
<li>Financial incentives for companies and others to publish desirable results.</li>
<li>Academic incentives for scientists to publish desirable results so they can get their next grant.</li>
<li>Incentives for journals to publish surprising/eye catching/interesting results.</li>
<li>Over-hyped studies with limited statistical characteristics (small sample size, questionable study populations etc.)</li>
<li>TED-style sound bytes of scientific results that are digested and repeated in the press despite limited scientific evidence.</li>
<li>Scientists who refuse to consider alternative explanations for their data</li>
</ul>
<p>Usually the targets of discussion about reproducibility and replicability are highly visible scientific studies. The targets are usually papers in what are considered “top journals” or the papers in journals like Science and Nature that seek to maximize visibility. Or, more recently, entire fields of science that are widely publicized - like psychology or cancer biology are targeted for reproducibility and replicability studies.</p>
<p>These studies have pointed out serious issues with the statistics, study designs, code availability and methods descriptions in papers they have studied. These are fundamental issues that deserve attention and should be taught to all scientists. As more papers have come out pointing out potential issues, they have merged into what is being called “a crisis of reproducibility”, “a crisis of replicability”, “a crisis of confidence in science” or other equally strong statements.</p>
<p>As the interest around reproducibility and replicability has built to a fever pitch in the scientific community it has morphed into a glossy scientific field in its own right. All of the characteristics are in place:</p>
<ul>
<li>A big central “positive” narrative that all science is not replicable, reproducible, or correct.</li>
<li>Incentives to publish these types of results because they can appear in Nature/Science/other glossy journals. (<a href="http://www.pnas.org/content/112/6/1645.full">I’m not immune to this</a>)</li>
<li>Strong and aggressive responses to papers that provide alternative explanations or don’t fit the narrative.</li>
<li>Researchers whose careers depend on the narrative being true</li>
<li>TED-style talks and sound bytes (“most published research is false”, “most papers don’t replicate”)</li>
<li>Press hype, including for papers with statistical weaknesses (small sample sizes, weaker study designs)</li>
</ul>
<p>Reproducibility and replicability has “arrived” and become a field in its own right. That has both positives and negatives. On the positive side it means critical statistical issues are now being talked about by a broader range of people. On the negative side, researchers now have to do the same sober evaluation of the claims in reproducibility and replicability papers that they do for any other scientific field. Papers on reproducibility and replicability must be judged with the same critical eye as we apply to any other scientific study. That way we can sift through the hype and move science forward.</p>
Learning about Machine Learning with an Earthquake Example
2017-02-23T00:00:00+00:00
http://simplystats.github.io/2017/02/23/ml-earthquakes
<p><em>Editor’s note: This is the fourth chapter of a book I’m working on called <a href="https://leanpub.com/demystifyai/">Demystifying Artificial Intelligence</a>. I’ve also added a co-author, <a href="https://twitter.com/data_divya">Divya Narayanan</a>, a masters student here at Johns Hopkins! The goal of the book is to demystify what modern AI is and does for a general audience. So something to smooth the transition between AI fiction and highly mathematical descriptions of deep learning. We are developing the book over time - so if you buy the book on Leanpub know that there are only four chapters in there so far, but I’ll be adding more over the next few weeks and you get free updates. The cover of the book was inspired by this <a href="https://twitter.com/notajf/status/795717253505413122">amazing tweet</a> by Twitter user <a href="https://twitter.com/notajf/">@notajf</a>. Feedback is welcome and encouraged!</em></p>
<blockquote>
<p>“A learning machine is any device whose actions are influenced by past experience.” - Nils John Nilsson</p>
</blockquote>
<p>Machine learning describes exactly what you would think: a machine that learns. As we described in the previous chapter a machine “learns” just like humans from previous examples. With certain experiences that give them an understanding about a particular concept, machines can be trained to have similar experiences as well, or at least mimic them. With very routine tasks, our brains become attuned to characteristics that define different objects or activities.</p>
<p>Before we can dive into the algorithms - like neural networks - that are most commonly used for artificial intelligence, lets consider a real example to understand how machine learning works in practice.</p>
<h2 id="the-big-one">The Big One</h2>
<p>Earthquakes occur when the surface of the Earth experiences a shake due to displacement of the ground, and can readily occur along fault lines where there have already been massive displacements of rock or ground(Wikipedia 2017a). For people living in places like California where earthquakes occur relatively frequently, preparedness and safety are major concerns. One famous fault in southern California, called the San Andreas Fault, is expected to produce the next big earthquake in the foreseeable future, often referred to as the “Big One”. Naturally, some residents are concerned and may like to know more so they are better prepared.</p>
<p>The following data are pulled from <strong>fivethirtyeight</strong>, a political and sports blogging site, and describe how worried people are about the “Big One” (Hickey 2015). Here’s an example of the first few observations in this dataset:</p>
<table>
<thead>
<tr>
<th style="text-align: left"> </th>
<th style="text-align: left">worry_general</th>
<th style="text-align: left">worry_bigone</th>
<th style="text-align: left">will_occur</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">1004</td>
<td style="text-align: left">Somewhat worried</td>
<td style="text-align: left">Somewhat worried</td>
<td style="text-align: left">TRUE</td>
</tr>
<tr>
<td style="text-align: left">1005</td>
<td style="text-align: left">Not at all worried</td>
<td style="text-align: left">Not at all worried</td>
<td style="text-align: left">FALSE</td>
</tr>
<tr>
<td style="text-align: left">1006</td>
<td style="text-align: left">Not so worried</td>
<td style="text-align: left">Not so worried</td>
<td style="text-align: left">FALSE</td>
</tr>
<tr>
<td style="text-align: left">1007</td>
<td style="text-align: left">Not at all worried</td>
<td style="text-align: left">Not at all worried</td>
<td style="text-align: left">FALSE</td>
</tr>
<tr>
<td style="text-align: left">1008</td>
<td style="text-align: left">Not at all worried</td>
<td style="text-align: left">Not at all worried</td>
<td style="text-align: left">FALSE</td>
</tr>
<tr>
<td style="text-align: left">1009</td>
<td style="text-align: left">Not at all worried</td>
<td style="text-align: left">Not at all worried</td>
<td style="text-align: left">FALSE</td>
</tr>
<tr>
<td style="text-align: left">1010</td>
<td style="text-align: left">Not so worried</td>
<td style="text-align: left">Somewhat worried</td>
<td style="text-align: left">FALSE</td>
</tr>
<tr>
<td style="text-align: left">1011</td>
<td style="text-align: left">Not so worried</td>
<td style="text-align: left">Extremely worried</td>
<td style="text-align: left">FALSE</td>
</tr>
<tr>
<td style="text-align: left">1012</td>
<td style="text-align: left">Not at all worried</td>
<td style="text-align: left">Not so worried</td>
<td style="text-align: left">FALSE</td>
</tr>
<tr>
<td style="text-align: left">1013</td>
<td style="text-align: left">Somewhat worried</td>
<td style="text-align: left">Not so worried</td>
<td style="text-align: left">FALSE</td>
</tr>
</tbody>
</table>
<p>Just by looking at this subset of the data, we can already get a feel for how many different ways it could be structured. Here, we see that there are 10 observations which represent 10 individuals. For each individual, we have information on 11 different aspects of earthquake preparedness and experience (only 3 of which are shown here). Data can be stored as text, logical responses (true/false), or numbers. Sometimes, and quite often at that, it may be missing; for example, observation 1013.</p>
<p>So what can we do with this data? For example, we could predict - or classify - whether or not someone was likely to have taken any precautions for an upcoming earthquake, like bolting their shelves to the wall or come up with an evacuation plan. Using this idea, we have now found a question that we’re interested in analyzing: are you prepared for an earthquake or not? And now, based on this question and the data that we have, we can see that you can either be prepared (seen above as “true”) or not (seen above as “false”).</p>
<blockquote>
<p>Our question: How well can we predict whether or not someone is prepared for an earthquake?</p>
</blockquote>
<h2 id="an-algorithm--whats-that">An Algorithm – what’s that?</h2>
<p>With our question in tow, we want to design a way for our machine to determine if someone is prepared for an earthquake or not. To do this, the machine goes through a flowchart-like set of instructions. At each fork in the flowchart, there are different answers which take the machine on a different path to get to the final answer. If you go through the correct series of questions and answers, it can correctly identify a person as being prepared. Here’s a small portion of the final flowchart for the San Andreas data which we will proceed to dissect (note: the ellipses on the right-hand side of the flowchart indicate where the remainder of the algorithm lies. This will be expanded later in the chapter):</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/Flowchart-partial.png" alt="" /></p>
<p>The steps that we take through the flowchart, or <strong>tree</strong> make up the <strong>classification algorithm</strong>. An algorithm is essentially a set of step-by-step instructions that we follow to organize, or in other words, to make a prediction about our data. In this case, our goal is to classify an individual as prepared or not by working our way through the different branches of the tree. So how did we establish this particular set of questions to be in our framework of identifying a prepared individual?</p>
<p><strong>CART</strong>, or a classification and regression tree, is one way to assess which of these characteristics is the most important in terms of splitting up the data into prepared and unprepared individuals (Wikipedia 2017b, Breiman et al. (1984)). There are multiple ways of implementing this method, often times with the earlier branches making larger splits in the data, and later branches making smaller splits.</p>
<p>Within an algorithm, there exists another level of organization composed of <strong>features</strong> and <strong>parameters</strong>.</p>
<p>In order to tell if someone is prepared for an earthquake or not, there have to be certain characteristics, known as <strong>features</strong>, that separate those who are prepared from those who are not. Features are basically the things you measured in your dataset that are chosen to give you insight into an individual and how to best classify them into groups. Looking at our sample data, we can see that some of the possible features are: whether or not an individual is worried about earthquakes in general, prior experiences with earthquakes, or their gender. As we will soon see, certain features will carry more weight in separating an individual into the two groups (prepared vs. unprepared).</p>
<p>If we were looking at how important previously experiencing an earthquake was in classifying someone as prepared, we’d say it plays a pretty big role, since it’s one of the first features that we encounter in our flowchart. The feature that seems to make a bigger split to our data is region, as it appears as the first feature in our algorithm shown above. We would expect that people in the Mountain and Pacific regions have more experience and knowledge about earthquakes, as that part of the country is more prone to experiencing an earthquake. However, someone’s age may not be as important in classifying a prepared individual. Since it doesn’t even show up in the top of our flowchart, it means it wasn’t as crucial to know this information to decide if a person is prepared or not and it didn’t separate the data much.</p>
<p>The second form of organization within an algorithm are the questions and cutoffs for moving one direction or another at each node. These are known as <strong>parameters</strong> of our algorithm. These parameters give us insight as to how the features we have established define the observation we are trying to identify. Let us consider an example using the feature of region. As we mentioned earlier, we would expect that those in the Mountain and Pacific regions would have more experience with earthquakes, which may reflect in their level of preparedness. Looking back at our abbreviated classification tree, the first node in our tree has a parameter of “Mountain or Pacific” for the feature region, which can be split into “yes” (those living in one of these regions) or “no” (living elsewhere in the US).</p>
<p>If we were looking at a continuous variable, say number of years living in a region, we may set a threshold instead, say greater than 5 years, as opposed to a yes/no distinction. In supervised learning, where we are teaching the machine to identify a prepared individual, we provide the machine multiple observations of prepared individuals and include different parameter values of features to show how a person could be prepared. To illustrate this point, let us consider the 10 sample observations below, and specifically examine the outcome, preparedness, with respect to the features: will_occur, female, and household income.</p>
<table>
<thead>
<tr>
<th style="text-align: left"> </th>
<th style="text-align: left">prepared</th>
<th style="text-align: left">will_occur</th>
<th style="text-align: left">female</th>
<th style="text-align: left">hhold_income</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">1004</td>
<td style="text-align: left">TRUE</td>
<td style="text-align: left">TRUE</td>
<td style="text-align: left">FALSE</td>
<td style="text-align: left">$50,000 to $74,999</td>
</tr>
<tr>
<td style="text-align: left">1005</td>
<td style="text-align: left">FALSE</td>
<td style="text-align: left">FALSE</td>
<td style="text-align: left">TRUE</td>
<td style="text-align: left">$10,000 to $24,999</td>
</tr>
<tr>
<td style="text-align: left">1006</td>
<td style="text-align: left">TRUE</td>
<td style="text-align: left">FALSE</td>
<td style="text-align: left">TRUE</td>
<td style="text-align: left">$200,000 and up</td>
</tr>
<tr>
<td style="text-align: left">1007</td>
<td style="text-align: left">FALSE</td>
<td style="text-align: left">FALSE</td>
<td style="text-align: left">FALSE</td>
<td style="text-align: left">$75,000 to $99,999</td>
</tr>
<tr>
<td style="text-align: left">1008</td>
<td style="text-align: left">FALSE</td>
<td style="text-align: left">FALSE</td>
<td style="text-align: left">TRUE</td>
<td style="text-align: left">Prefer not to answer</td>
</tr>
<tr>
<td style="text-align: left">1009</td>
<td style="text-align: left">FALSE</td>
<td style="text-align: left">FALSE</td>
<td style="text-align: left">FALSE</td>
<td style="text-align: left">Prefer not to answer</td>
</tr>
<tr>
<td style="text-align: left">1010</td>
<td style="text-align: left">TRUE</td>
<td style="text-align: left">FALSE</td>
<td style="text-align: left">TRUE</td>
<td style="text-align: left">$50,000 to $74,999</td>
</tr>
<tr>
<td style="text-align: left">1011</td>
<td style="text-align: left">FALSE</td>
<td style="text-align: left">FALSE</td>
<td style="text-align: left">TRUE</td>
<td style="text-align: left">Prefer not to answer</td>
</tr>
<tr>
<td style="text-align: left">1012</td>
<td style="text-align: left">FALSE</td>
<td style="text-align: left">FALSE</td>
<td style="text-align: left">TRUE</td>
<td style="text-align: left">$50,000 to $74,999</td>
</tr>
<tr>
<td style="text-align: left">1013</td>
<td style="text-align: left">FALSE</td>
<td style="text-align: left">FALSE</td>
<td style="text-align: left">NA</td>
<td style="text-align: left">NA</td>
</tr>
</tbody>
</table>
<p>Of these ten observations, 7 are not prepared for the next earthquake and 3 are. But to make this information more useful, we can look at some of the features to see if there are any similarities that the machine can use as a classifier. For example, of the 3 individuals that are prepared, two are female and only one is male. But notice we get the same distribution of males and females by looking at those who are not prepared: you have 4 females and 2 males, the same 2:1 ratio. From such a small sample, the algorithm may not be able to tell how important gender is in classifying preparedness. But, by looking through the remaining features and a larger sample, it can start to classify individuals. This is what we mean when we say a machine learning algorithm <strong>learns</strong>.</p>
<p>Now, let us take a closer look at observations 1005, 1011, and 1012, and more specifically the household income feature:</p>
<table>
<thead>
<tr>
<th style="text-align: left"> </th>
<th style="text-align: left">prepared</th>
<th style="text-align: left">will_occur</th>
<th style="text-align: left">female</th>
<th style="text-align: left">hhold_income</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">1005</td>
<td style="text-align: left">FALSE</td>
<td style="text-align: left">FALSE</td>
<td style="text-align: left">TRUE</td>
<td style="text-align: left">$10,000 to $24,999</td>
</tr>
<tr>
<td style="text-align: left">1011</td>
<td style="text-align: left">FALSE</td>
<td style="text-align: left">FALSE</td>
<td style="text-align: left">TRUE</td>
<td style="text-align: left">Prefer not to answer</td>
</tr>
<tr>
<td style="text-align: left">1012</td>
<td style="text-align: left">FALSE</td>
<td style="text-align: left">FALSE</td>
<td style="text-align: left">TRUE</td>
<td style="text-align: left">$50,000 to $74,999</td>
</tr>
</tbody>
</table>
<p>All three of these observations are females and believe that the “Big One” won’t occur in their lifetime. But despite the fact that they are all unprepared, they each report a different household income. Based on just these three observations, we may conclude that household income is not as important in determining preparedness. By showing a machine different examples of which features a prepared individual has (or unprepared, as in this case), it can start to recognize patterns and identify the features, or combination of features, and parameters that are most indicative of preparedness.</p>
<p>In summary, every flowchart will have the following components:</p>
<ol>
<li>
<p><strong>The algorithm</strong> - The general workflow or logic that dictates the path the machine travels, based on chosen features and parameter values. In turn, the machine classifies or predicts which group an observation belongs to</p>
</li>
<li><strong>Features</strong> - The variables or types of information we have about each observation</li>
<li><strong>Parameters</strong> - The possible values a particular feature can have</li>
</ol>
<p>Even with the experience of seeing numerous observations with various feature values, there is no way to show our machine information on every single person that exists in the world. What will it do when it sees a brand new observation that is not identified as prepared or unprepared? Is there a way to improve how well our algorithm performs?</p>
<h2 id="training-and-testing-data">Training and Testing Data</h2>
<p>You may have heard of the terms <em>sample</em> and <em>population</em>. In case these terms are unfamiliar, think of the population as the entire group of people we want to get information from, study, and describe. This would be like getting a piece of information, say income, from every single person in the world. Wouldn’t that be a fun exercise!</p>
<p>If we had the resources to do this, we could then take all those incomes and find out the average income of an individual in the world. But since this is not possible, it might be easier to get that information from a smaller number of people, or <em>sample</em>, and use the average income of that smaller pool of people to represent the average income of the world’s population. We could only say that the average income of the sample is <em>representative</em> of the population if the sample of people that we picked have the same characteristics of the population.</p>
<p>For example, if we assumed that our population of interest was a company with 1,000 employees, where 200 of those employees make $60,000 each and 800 of them make $30,000 each. Since we have this information on everyone, we could easily calculate the average income of an employee in the company, which would be $36,000. Now, say we randomly picked a group of 100 individuals from the company as our sample. If all of those 100 individuals came from the group of employees that made $60,000, we might think that the average income for an employee was actually much higher than the true average of the population (the whole company). The opposite would be true if all 100 of those employees came from the group making less money - we would mistakenly think the average income of employees is lower. In order to accurately reflect the distribution of income of the company employees through our sample, or rather to have a <em>representative</em> sample, we would try to pick 20 individuals from the higher income group and 80 individuals from the lower income group to get an accurate representation of this company population.</p>
<p>Now heading back to our earthquake example, our big picture goal is to be able to feed our algorithm a brand new observation of someone who answered information about themselves and earthquake preparedness, and have the machine be able to correctly identify whether or not they are prepared for a future earthquake.</p>
<p>One definition of a population could consist of all individuals in the world. However, since we can’t get access to data on all these individuals, we decide to sample 1013 respondents and ask them earthquake related questions. Remember that in order for our machine to be able to accurately identify an individual as prepared or unprepared, it needs to have had some experience seeing some observations where features identify the individual as prepared, as well as some observations that aren’t. This seems a little counterintuitive though. If we want our algorithm to identify a prepared individual, why wouldn’t we show it all the observations that are prepared?</p>
<p>By showing our machine observations of respondents that are not prepared, it can better strengthen its idea of what features identify a prepared individual. But we also want to make our algorithm as <em>robust</em> as possible. For an algorithm to be robust, it should be able to take in a wide range of values for each feature, and appropriately go through the algorithm to make a classification. If we only show our machine a narrow set of experiences, say people who have an income of between $10,000 and $25,000, it will be harder for the algorithm to correctly classify an individual who has an income of $50,000.</p>
<p>One way we can give our machine this set of experiences is to take all 1013 observations and randomly split them up into two groups. Note: for simplification, any observations that had missing data (total: 7) for the outcome variable were removed from the original dataset, leaving 1006 observations for our analysis.</p>
<ol>
<li>
<p><strong>Training data</strong> - This serves as the wide range of experiences that we want our machine to see to have a better understanding of preparedness</p>
</li>
<li>
<p><strong>Testing data</strong> - This data will allow us to evaluate our algorithm and see how well it was able to pick up on features and parameter values that are specific to prepared individuals and correctly label them as such</p>
</li>
</ol>
<p>So what’s the point of splitting up our data into training and testing? We could have easily fed all the data that we have into the algorithm and have it detect the most important features and parameters we have based on the provided observations. But there’s an issue with that, known as <strong>overfitting</strong>. When an algorithm has overfit the data, it means that it has been fit specifically to the data at hand, and only that data. It would be harder to give our algorithm data that does not fit within the bounds of our training data (though it would perform very well in this sample set). Moreover, the algorithm would only accurately classify a very narrow set of observations. This example nicely illustrates the concept we introduced earlier - <em>robustness</em> - and demonstrates the importance of exposing our algorithm to a wide range of experiences. We want our algorithm to be useful, which means it needs to be able to take in all kinds of data with different distributions, and still be able to accurately classify them.</p>
<p>To create training and testing sets, we can adopt the following idea:</p>
<ol>
<li>Split the 1006 observations in half: roughly 500 for training, and the remainder for testing</li>
<li>Feed the 500 training observations through the algorithm for the machine to understand what features best classify individuals as prepared or unprepared</li>
<li>Once the machine is trained, feed the remaining test observations through the algorithm and see how well it classifies them</li>
</ol>
<h2 id="algorithm-accuracy">Algorithm Accuracy</h2>
<p>Now that we’ve built up our algorithm and split our data into training and test sets, let’s take a look at the full classification algorithm:</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/Flowchart-full.png" alt="" /></p>
<p>Recall the question we set out to answer with respect to the earthquake data: <strong>How well can we predict whether or not someone is prepared for an earthquake?</strong> In a binary (yes/no) case like this, we can set up our results in a 2x2 table, where the rows represent predicted preparedness (based on the features of our algorithm) and the columns represent true preparedness (what their true label is).</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2x2-table-results.png" alt="" /></p>
<p>This simple 2x2 table carries quite a bit of information. Essentially, we can grade our machine on how well it learned to tell whether individuals are prepared or unprepared. We can see how well our algorithm did at classifying new observations by calculating the <strong>predictive accuracy</strong>, done by summing cells A and C and dividing by the total number of observations, or more simply, (A + C) / N. Through this calculation, we see that the algorithm from our example correctly classified individuals as prepared or unprepared 77.9% of the time. Not bad!</p>
<p>When we feed the roughly 500 test observations through the algorithm, it is the first time the machine has seen those observations. As a result, there is a chance that despite going through the algorithm, the machine <strong>misclassified</strong> someone as prepared, when in fact they were unprepared. To determine how often this happens, we can calculate the <strong>test error rate</strong> from the 2x2 table from above. To calculate the test error rate, we take the total number of observations that are <em>discordant</em>, or dissimilar between true and predicted status, and divide this total by the total number of observations that were assessed. Based on the above table, the test error rate would be (B + C) / N, or 22.1%.</p>
<p>There are a number of reasons that a test error rate would be high. Depending on the data set, there might be different methods that are better for developing the algorithm. Additionally, despite randomly splitting our data into training and testing sets, there may be some inherent differences between the two (think back to the employee income example above), making it harder for the algorithm to correctly label an observation.</p>
<h2 id="references">References</h2>
<p>Breiman, Leo, Jerome H Friedman, Richard A Olshen, and Charles J Stone. 1984. “Classification and Regression Trees. Wadsworth & Brooks.” <em>Monterey, CA</em>.</p>
<p>Hickey, Walt. 2015. “The Rock Isn’t Alone: Lots of People Are Worried About ‘the Big One’.” <em>FiveThirtyEight</em>. FiveThirtyEight. <a href="https://fivethirtyeight.com/datalab/the-rock-isnt-alone-lots-of-people-are-worried-about-the-big-one/">https://fivethirtyeight.com/datalab/the-rock-isnt-alone-lots-of-people-are-worried-about-the-big-one/</a>.</p>
<p>Wikipedia. 2017a. “Earthquake — Wikipedia, the Free Encyclopedia.” <a href="http://en.wikipedia.org/w/index.php?title=Earthquake&oldid=762614740">http://en.wikipedia.org/w/index.php?title=Earthquake&oldid=762614740</a>.</p>
<p>———. 2017b. “Predictive analytics — Wikipedia, the Free Encyclopedia.” <a href="http://en.wikipedia.org/w/index.php?title=Predictive%20analytics&oldid=764577274">http://en.wikipedia.org/w/index.php?title=Predictive%20analytics&oldid=764577274</a>.</p>
My Podcasting Setup
2017-02-20T00:00:00+00:00
http://simplystats.github.io/2017/02/20/podcasting-setup
<p>I’ve gotten a number of inquiries over the last 2 years about my podcasting setup and I’ve been meaning to write about it but….</p>
<p>But here it is! I actually wanted to write this because I felt like there actually wasn’t a ton of good information about this on the Internet that wasn’t for people who wanted to do it professionally but were rather more “casual” podcasters. So here’s what I’ve got.</p>
<p>There are two types of podcasts roughly: The kind you record with everyone in the same room and the kind you record where everyone is in different rooms. They both require slightly different setups so I’ll talk about both. For me, Elizabeth Matsui and I record <a href="http://effortreport.libsyn.com">The Effort Report</a> locally because we’re both at Johns Hopkins. But Hilary Parker and I record <a href="https://soundcloud.com/nssd-podcast">Not So Standard Deviations</a> remotely because she’s on the other side of the country most of the time.</p>
<h2 id="recording-equipment">Recording Equipment</h2>
<p>When Hilary and I first started we just used the microphone attached to the headphones you get with your iPhone or whatever. That’s okay but the sound feels very “narrow” to me. That said, it’s a good way to get started and it likely costs you nothing.</p>
<p>The next level up for many people is the <a href="https://www.amazon.com/Blue-Yeti-USB-Microphone-Silver/dp/B002VA464S/">Blue Yeti USB Microphone</a> which is perfectly fine microphone and not too expensive. Also, it uses USB (as opposed to more professional XLR) so it connects to any computer, which is nice. However, it typically retails for $120, which isn’t nothing, and there are probably cheaper microphones that are just as good. For example, Jason Snell recommends the <a href="https://www.amazon.com/Audio-Technica-ATR2100-USB-Cardioid-Dynamic-Microphone/dp/B004QJOZS4/ref=as_li_ss_tl?ie=UTF8&qid=1479488629&sr=8-2&keywords=audio-technica+atr&linkCode=sl1&tag=incomparablepod-20&linkId=0919132824ac2090de45f2b1135b0163">Audio Technica ATR2100</a> which is only about $70.</p>
<p>If you’re willing to shell out a little more money, I’d highly recommend the <a href="https://www.zoom-na.com/products/field-video-recording/field-recording/zoom-h4n-handy-recorder">Zoom H4n</a> portable recorder. This is actually two things: a microphone <em>and</em> a recorder. It has a nice stero microphone built into the top along with two XLR inputs on the bottom that allow you to record from external mics. It records to SD cards so it’s great for a portable setup where you don’t want to carry a computer around with you. It retails for about $200 so it’s <em>not</em> cheap, but in my opinion it is worth every penny. I’ve been using my H4n for years now.</p>
<p>Because we do a lot or recording for our online courses here, we’ve actually got a bit more equipment in the office. So for in-person podcasts I sometimes record using a <a href="https://en-us.sennheiser.com/short-gun-tube-microphone-camera-films-mkh-416-p48u3">Sennheiser MKH416-P48US</a> attached to an <a href="https://www.amazon.com/gp/product/B00D4AGIBS/">Auray MS-5230T microphone stand</a> which is decidedly not cheap but is a great piece of hardware.</p>
<p>By the way, a microphone stand is great to have, if you can get one, so you don’t have to set the microphone on your desk or table. That way if you bump the table by accident or generally like to bang the table, it won’t get picked up on the microphone. It’s not something to get right away, but maybe later when you make the big time.</p>
<h2 id="recording-software">Recording Software</h2>
<p>If you’re recording by yourself, you can just hook up your microphone to your computer and record to any old software that records sound (on the Mac you can use Quicktime). If you have multiple people, you can either</p>
<ol>
<li>Speak into the same mic and have both your voices recorded on the same audio file</li>
<li>Use separate mics (and separate computers) and record separtely on to separate audio files. This requires synching the audio files in an editor, but that’s not too big a deal if you only have 2-3 people.</li>
</ol>
<p>For local podcasts, I actually just use the H4n and record directly to the SD card. This creates separate WAV files for each microphone that are already synced so you can just plop them in the editor.</p>
<p>For remote podcasts, you’ll need some communication software. Hilary and I use <a href="https://zencastr.com">Zencastr</a> which has its own VoIP system that allows you to talk to anyone by just sending your guests a link. So I create a session in Zencastr, send Hilary the link for the session, she logs in (without needing any credentials) and we just start talking. The web site records the audio directly off of your microphone and then uploads the audio files (one for each guest) to Dropbox. The service is really nice and there are now a few just like it. Zencastr costs $20 a month right now but there is a limited free tier.</p>
<p>The other approach is to use something like Skype and then use something like <a href="http://www.ecamm.com/mac/callrecorder/">ecamm call-recorder</a> to record the conversation. The downside with this approach is that if you have any network trouble that messes up the audio, then you will also record that. However, Zencastr (and related services) do not work on iOS devices and other devices that use WebKit based browsers. So if you have someone calling in on a mobile device via Skype or something, then you’ll have to use this approach. Otherwise, I prefer the Zencastr approach and can’t really see any downside except for the cost.</p>
<h2 id="editing-software">Editing Software</h2>
<p>There isn’t a lot of software that’s specifically designed for editing podcasts. I actually started off editing podcasts in Final Cut Pro X (nonlinear video editor) because that’s what I was familiar with. But now I use <a href="http://www.apple.com/logic-pro/">Logic Pro X</a>, which is not really designed for podcasts, but it’s a real digital audio workstation and has nice features (like <a href="https://support.apple.com/kb/PH13055?locale=en_US">strip silence</a>). But I think something like <a href="http://www.audacityteam.org">Audacity</a> would be fine for basic editing.</p>
<p>The main thing I need to do with editing is merge the different audio tracks together and cut off any extraneous material at the beginning or the end. I don’t usually do a lot of editing in the middle unless there’s a major mishap like a siren goes by or a cat jumps on the computer. Once the editing is done I bounce to an AAC or MP3 file for uploading.</p>
<h2 id="hosting">Hosting</h2>
<p>You’ll need a service for hosting your audio files if you don’t have your own server. You can technically host your audio files anywhere, but specific services have niceties like auto-generating the RSS feed. For Not So Standard Deviations I use <a href="https://soundcloud.com/stream">SoundCloud</a> and for The Effort Report I use <a href="https://www.libsyn.com">Libsyn</a>.</p>
<p>Of the two services, I think I prefer Libsyn, because it’s specifically designed for podcasting and has somewhat better analytics. The web site feels a little bit like it was designed in 2003, but otherwise it works great. Libsyn also has features for things like advertising and subscriptions, but I don’t use any of those. SoundCloud is fine but wasn’t really designed for podcasting and sometimes feels a little unnatural.</p>
<h2 id="summary">Summary</h2>
<p>If you’re interested in getting started in podcasting, here’s my bottom line:</p>
<ol>
<li>Get a partner. It’s more fun that way!</li>
<li>If you and your partner are remote, use Zencastr or something similar.</li>
<li>Splurge for the Zoom H4n if you can, otherwise get a reasonable cheap microphone like the Audio Technica or the Yeti.</li>
<li>Don’t focus too much on editing. Just clip off the beginning and the end.</li>
<li>Host on Libsyn.</li>
</ol>
Data Scientists Clashing at Hedge Funds
2017-02-15T00:00:00+00:00
http://simplystats.github.io/2017/02/15/Data-Scientists-Clashing-at-Hedge-Funds
<p>There’s an interesting article over at Bloomberg about how <a href="https://www.bloomberg.com/news/articles/2017-02-15/point72-shows-how-firms-face-culture-clash-on-road-to-quantland">data scientists have struggled at some hedge funds</a>:</p>
<blockquote>
<p>The firms have been loading up on data scientists and coders to deliver on the promise of quantitative investing and lift their ho-hum returns. But they are discovering that the marriage of old-school managers and data-driven quants can be rocky. Managers who have relied on gut calls resist ceding control to scientists and their trading signals. And quants, emboldened by the success of computer-driven funds like Renaissance Technologies, bristle at their second-class status and vie for a bigger voice in investing.</p>
</blockquote>
<p>There are some interesting tidbits in the article that I think hold lessons for any collaboration between a data scientist or analyst and a non-data scientist (for lack of a better word).</p>
<p>At Point72, the family office successor to SAC Capital, problems at the quant unit (known as Aperio):</p>
<blockquote>
<p>The divide between Aperio quants and fundamental money managers was also intellectual. They struggled to communicate about the basics, like how big data could inform investment decisions. [Michael] Recce’s team, which was stacked with data scientists and coders, developed trading signals but didn’t always fully explain the margin of error in the analysis to make them useful to fund managers, the people said.</p>
</blockquote>
<p>It’s hard to know the details of what actually happened, but for data scientists collaborating with others, there always needs to be an explanation of “what’s going on”. There’s a general feeling that it’s okay that machine learning techniques build complicated uninterpretable models because they work better. But in my experience that’s not enough. People want to know why they work better, when they work better, and when they <em>don’t</em> work.</p>
<p>On over-theorizing:</p>
<blockquote>
<p>Haynes, who joined Stamford, Connecticut-based Point72 in early 2014 after about two decades at McKinsey & Co., and other senior managers grew dissatisfied with Aperio’s progress and impact on returns, the people said. When the group obtained new data sets, it spent too much time developing theories about how to process them rather than quickly producing actionable results.</p>
</blockquote>
<p>I don’t necessarily agree with this “criticism”, but I only put it here because the land of hedge funds isn’t generally viewed on the outside as a place where lots of theorizing goes on.</p>
<p>At BlueMountain, another hedge fund:</p>
<blockquote>
<p>When quants showed their risk analysis and trading signals to fundamental managers, they sometimes were rejected as nothing new, the people said. Quants at times wondered if managers simply didn’t want to give them credit for their ideas.</p>
</blockquote>
<p>I’ve seen this quite a bit. When a data scientist presents results to collaborators, there’s often two responses:</p>
<ol>
<li>“I knew that already” and so you haven’t taught me anything new</li>
<li>“I didn’t know that already” and so you must be wrong</li>
</ol>
<p>The common link here, of course, is the inability to admit that there are things you don’t know. Whether this is an inherent character flaw or something that can be overcome through teaching is not yet clear to me. But it is common when data is brought to bear on a problem that previously lacked data. One of the key tasks that a data scientist in any industry must prepare for is the task of giving people information that will make them uncomfortable.</p>
Not So Standard Deviations Episode 32 - You Have to Reinvent the Wheel a Few Times
2017-02-13T00:00:00+00:00
http://simplystats.github.io/2017/02/13/nssd-episode-32
<p>Hilary and I discuss training in PhD programs, estimating the variance vs. the standard deviation, the bias variance tradeoff, and explainable machine learning.</p>
<p>We’re also introducing a new level of support on our Patreon page, where you can get access to some of the outtakes from our episodes. Check out our <a href="https://www.patreon.com/NSSDeviations">Patreon page</a> for details.</p>
<p>Show notes:</p>
<ul>
<li>
<p><a href="http://www.darpa.mil/program/explainable-artificial-intelligence">Explainable AI</a></p>
</li>
<li>
<p><a href="http://multithreaded.stitchfix.com/blog/2016/11/22/nba-rankings/">Stitch Fix Blog NBA Rankings</a></p>
</li>
<li>
<p><a href="http://varianceexplained.org/r/empirical-bayes-book/">David Robinson’s Empirical Bayes book</a></p>
</li>
<li>
<p><a href="https://warontherocks.com/2017/01/introducing-bombshell-the-explosive-first-episode/">War on the Rocks podcast</a></p>
</li>
<li>
<p><a href="https://twitter.com/rdpeng">Roger on Twitter</a></p>
</li>
<li>
<p><a href="https://twitter.com/hspter">Hilary on Twitter</a></p>
</li>
<li>
<p><a href="https://leanpub.com/conversationsondatascience/">Get the Not So Standard Deviations book</a></p>
</li>
<li>
<p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a></p>
</li>
<li>
<p><a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Subscribe to the podcast on Google Play</a></p>
</li>
<li>
<p><a href="https://soundcloud.com/nssd-podcast">Find past episodes</a></p>
</li>
</ul>
<p><a href="https://soundcloud.com/nssd-podcast/episode-32-you-have-to-reinvent-the-wheel-a-few-times">Download the audio for this episode</a></p>
<p>Listen here:</p>
<iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/306883468&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false"></iframe>
Reproducible Research Needs Some Limiting Principles
2017-02-01T00:00:00+00:00
http://simplystats.github.io/2017/02/01/reproducible-research-limits
<p>Over the past 10 years thinking and writing about reproducible research, I’ve come to the conclusion that much of the discussion is incomplete. While I think we as a scientific community have come a long way in changing people’s thinking about data and code and making them available to others, there are some key sticking points that keep coming up that are preventing further progress in the area.</p>
<p>When I used to write about reproducibility, I felt that the primary challenge/roadblock was a lack of tooling. Much has changed in just the last five years though, and many new tools have been developed to make life a lot easier. Packages like knitr (for R), markdown, and iPython notebooks, have made writing reproducible data analysis documents a lot easier. Web sites like GitHub and many others have made distributing analyses a lot simpler because now everyone effectively has a free web site (this was NOT true in 2005).</p>
<p>Even still, our basic definition of reproducibility is incomplete. Most people would say that a data analysis is reproducible if the analytic data and metadata are available and the code that did the analysis is available. Furthermore, it would be preferable to have some documentation to go along with both. But there are some key issues that need to be resolved to complete this general definition.</p>
<h2 id="reproducible-for-whom">Reproducible for Whom?</h2>
<p>In discussions about reproducibility with others, the topic of <strong>who</strong> should be able to reproduce the analysis only occasionally comes up. There’s a general sense, especially amongst academics, that <strong>anyone</strong> should be able to reproduce any analysis if they wanted to.</p>
<p>There is an analogy with free software here in the sense that free software can be free for some people and not for others. This made more sense in the days before the Internet when distribution was much more costly. The idea here was that I could write software for a client and give them the source code for that software (as they would surely demand). The software is free for them but not for anyone else. But free software ultimately only matters when it comes to distribution. Once I distribute a piece of software, that’s when all the restrictions come into play. However, if I only distribute it to a few people, I only need to guarantee that those few people have those freedoms.</p>
<p>Richard Stallman once said that something like 90% of software was free software because almost all software being written was custom software for individual clients (I have no idea where he got this number). Even if the number is wrong, the point still stands that if I write software for a single person, it can be free for that person even if no one in the world has access to the software.</p>
<p>Of course, now with the Internet, everything pretty much gets distributed to everyone because there’s nothing stopping someone from taking a piece of free software and posting it on a web site. But the idea still holds: Free software only needs to be free for the people who receive it.</p>
<p>That said, the analogy is not perfect. Software and research are not the same thing. They key difference is that you can’t call something research unless is generally available and disseminated. If Pfizer comes up with the cure for cancer and never tells anyone about it, it’s not research. If I discover that there’s a 9th planet and only tell my neighbor about it, it’s not research. Many companies might call those activities research (particularly from an tax/accounting point of view) but since society doesn’t get to learn about them, it’s not research.</p>
<p>If research is by definition disseminated to all, then it should therefore be reproducible by all. However, there are at least two circumstances in which we do not even pretend to believe this is possible.</p>
<ol>
<li><strong>Imbalance of resources</strong>: If I conduct a data analysis that requires the <a href="https://www.top500.org/lists/2016/06/">world’s largest supercomputer</a>, I can make all the code and data available that I want–few people will be able to actually reproduce it. That’s an extreme case, but even if I were to make use of a <a href="https://jhpce.jhu.edu">dramatically smaller computing cluster</a> it’s unlikely that anyone would be able to recreate those resources. So I can distribute something that’s reproducible in theory but not in reality by most people.</li>
<li><strong>Protected data</strong>: Numerous analyses in the biomedical sciences make use of protected health information that cannot easily be disseminated. Privacy is an important issue, in part, because in many cases it allows us to collect the data in the first place. However, most would agree we cannot simply post that data for all to see in the name of reproducibility. First, it is against the law, and second it would likely deter anyone from agreeing to participate in any study in the future.</li>
</ol>
<p>We can pretend that we can make data analyses reproducible for all, but in reality it’s not possible. So perhaps it would make sense for us to consider whether a limiting principle should be applied. The danger of not considering it is that one may take things to the extreme—if it can’t be made reproducible for all, then why bother trying? A partial solution is needed here.</p>
<h2 id="for-how-long">For How Long?</h2>
<p>Another question that needs to be resolved for reproducibility to be a widely implemented and sustainable phenomenon is for how long should something be reproducible? Ultimately, this is a question about time and resources because ensuring that data and code can be made available and can run on current platforms <em>in perpetuity</em> requires substantial time and money. In the academic community, where projects are often funded off of grants or contracts with finite lifespans, often the money is long gone even though the data and code must be maintained. The question then is who pays for the maintainence and the upkeep of the data and code?</p>
<p>I’ve never heard a satisfactory answer to this question. If the answer is that data analyses should be reproducible forever, then we need to consider a different funding model. This position would require a perpetual funds model, essentially an endowment, for each project that is disseminated and claims to be reproducible. The endowment would pay for things like servers for hosting the code and data and perhaps engineers to adapt and adjust the code as the surrounding environment changes. While there are a number of <a href="http://dataverse.org">repositories</a> that have developed scalable operating models, it’s not clear to me that the funding model is completely sustainable.</p>
<p>If we look at how scientific publications are sustained, we see that it’s largely private enterprise that shoulders the burden. Journals house most of the publications out there and they charge a fee for access (some for profit, some not for profit). Whether the reader pays or the author pays is not relevant, the point is that a decision has been made about <em>who</em> pays.</p>
<p>The author-pays model is interesting though. Here, an author pays a publication charge of ~$2,000, and the reader never pays anything for access (in perpetuity, presumably). The $2,000 payment by the author is like a one-time capital expense for maintaining that one publication forever (a mini-endowment, in a sense). It works for authors because grant/contract supported research often budget for one-time publication charges. There’s no need for continued payments after a grant/contract has expired.</p>
<p>The publication system is quite a bit simpler because almost all publications are the same size and require the same resources for access—basically a web site that can serve up PDF files and people to maintain it. For data analyses, one could see things potentially getting out of control. For a large analysis with terabytes of data, what would the one-time up-front fee be to house the data and pay for anyone to access it for free forever?</p>
<p>Using Amazon’s <a href="http://calculator.s3.amazonaws.com/index.html">monthly cost estimator</a> we can get a rough sense of what the pure data storage might cost. Suppose we have a 10GB dataset that we want to store and we anticipate that it might be downloaded 10 times per month. This would cost about $7.65 per month, or $91.80 per year. If we assume Amazon raises their prices about 3% per year and a discount rate of 5%, the total cost for the storage is $4,590. If we tack on 20% for other costs, that brings us to $5,508. This is perhaps not unreasonable, and the scenario would certainly include most people. For comparison a 1 TB dataset downloaded once a year, using the same formula gives us a one-time cost of about $40,000. This is real money when it comes to fixed research budgets and would likely require some discussion of trade-offs.</p>
<h2 id="summary">Summary</h2>
<p>Reproducibility is a necessity in science, but it’s high time that we start considering the practical implications of actually doing the job. There are still holdouts when it comes to the basic idea of reproducibiltiy, but they are fewer and farther between. If we do not seriously consider the details of how to implement reproducibility, perhaps by introducing some limiting principles, we may never be able to achieve any sort of widespread adoption.</p>
Turning data into numbers
2017-01-31T00:00:00+00:00
http://simplystats.github.io/2017/01/31/data-into-numbers
<p><em>Editor’s note: This is the third chapter of a book I’m working on called <a href="https://leanpub.com/demystifyai/">Demystifying Artificial Intelligence</a>. The goal of the book is to demystify what modern AI is and does for a general audience. So something to smooth the transition between AI fiction and highly mathematical descriptions of deep learning. I’m developing the book over time - so if you buy the book on Leanpub know that there are only three chapters in there so far, but I’ll be adding more over the next few weeks and you get free updates. The cover of the book was inspired by this <a href="https://twitter.com/notajf/status/795717253505413122">amazing tweet</a> by Twitter user <a href="https://twitter.com/notajf/">@notajf</a>. Feedback is welcome and encouraged!</em></p>
<blockquote>
<p>“It is a capital mistake to theorize before one has data.” Arthur Conan Doyle</p>
</blockquote>
<h2 id="data-data-everywhere">Data, data everywhere</h2>
<p>I already have some data about you. You are reading this book. Does that seem like data? It’s just something you did, that’s not data is it? But if I collect that piece of information about you, it actually tells me a surprising amount. It tells me you have access to an internet connection, since the only place to get the book is online. That in turn tells me something about your socioeconomic status and what part of the world you live in. It also tells me that you like to read, which suggests a certain level of education.</p>
<p>Whether you know it or not, everything you do produces data - from the websites you read to the rate at which your heart beats. Until pretty recently, most of the data you produced wasn’t collected, it floated off unmeasured. Data were painstakingly gathered by scientists one number at a time in small experiments with a few people. This laborious process meant that data were expensive and time-consuming to collect. Yet many of the most amazing scientific discoveries over the last two centuries were squeezed from just a few data points. But over the last two decades, the unit price of data has dramatically dropped. New technologies touching every aspect of our lives from our money, to our health, to our social interactions have made data collection cheap and easy.</p>
<p>To give you an idea of how steep the drop in the price of data has been, in 1967 Stanley Milgram did an experiment to determine the number of degrees of separation between two people in the U.S. (Travers and Milgram 1969). In his experiment he sent 296 letters to people in Omaha, Nebraska and Wichita, Kansas. The goal was to get the letters to a specific person in Boston, Massachusetts. The trick was people had to send the letters to someone they knew, and they then sent it to someone they knew and so on. At the end of the experiment, only 64 letters made it to the individual in Boston. On average, the letters had gone through 6 people to get there.</p>
<p>This is an idea that is so powerful it even became part of the popular consciousness. For example it is the foundation of the internet meme “the 6-degrees of Kevin Bacon” (Wikipedia contributors 2016a) - the idea that if you take any actor and look at the people they have been in movies with, then the people those people have been in movies with, it will take you at most six steps to end up at the actor Kevin Bacon. This idea, despite its popularity was originally studied by Milgram using only 64 data points. A 2007 study updated that number to “7 degrees of Kevin Bacon”. The study was based on 30 billion instant messaging conversations collected over the course of a month or two with the same amount of effort (Leskovec and Horvitz 2008).</p>
<p>Once data started getting cheaper to collect, it got cheaper fast. Take another example, the human genome. The genome is the unique DNA code in every one of your cells. It consists of a set of 3 billion letters that is unique to you. By many measures, the race to be the first group to collect all 3 billion letters from a single person kicked off the data revolution in biology. The project was completed in 2000 after a decade of work and $3 billion to collect the 3 billion letters in the first human genome (Venter et al. 2001). This project was actually a stunning success, most people thought it would be much more expensive. But just over a decade later, new technology means that we can now collect all 3 billion letters from a person’s genome for about $1,000 in about a week (“The Cost of Sequencing a Human Genome,” n.d.), soon it may be less than $100 (Buhr 2017).</p>
<p>You may have heard that this is the era of “big data” from The Economist or The New York Times. It is really the era of cheap data collection and storage. Measurements we never bothered to collect before are now so easy to obtain that there is no reason not to collect them. Advances in computer technology also make it easier to store huge amounts of data digitally. This may not seem like a big deal, but it is much easier to calculate the average of a bunch of numbers stored electronically than it is to calculate that same average by hand on a piece of paper. Couple these advances with the free and open distribution of data over the internet and it is no surprise that we are awash in data. But tons of data on their own are meaningless. It is understanding and interpreting the data where the real advances start to happen.</p>
<p>This explosive growth in data collection is one of the key driving influences behind interest in artificial intelligence. When teaching computers to do something that only humans could do previously, it helps to have lots of examples. You can then use statistical and machine learning models to summarize that set of examples and help a computer make decisions what to do. The more examples you have, the more flexible your computer model can be in making decisions, and the more “intelligent” the resulting application.</p>
<h2 id="what-is-data">What is data?</h2>
<h3 id="tidy-data">Tidy data</h3>
<p>“What is data”? Seems like a relatively simple question. In some ways this question is easy to answer. According to <a href="https://en.wikipedia.org/wiki/Data">Wikipedia</a>:</p>
<blockquote>
<p>Data (/ˈdeɪtə/ day-tə, /ˈdætə/ da-tə, or /ˈdɑːtə/ dah-tə)[1] is a set of values of qualitative or quantitative variables. An example of qualitative data would be an anthropologist’s handwritten notes about her interviews with people of an Indigenous tribe. Pieces of data are individual pieces of information. While the concept of data is commonly associated with scientific research, data is collected by a huge range of organizations and institutions, ranging from businesses (e.g., sales data, revenue, profits, stock price), governments (e.g., crime rates, unemployment rates, literacy rates) and non-governmental organizations (e.g., censuses of the number of homeless people by non-profit organizations).</p>
</blockquote>
<p>When you think about data, you probably think of orderly sets of numbers arranged in something like an Excel spreadsheet. In the world of data science and machine learning this type of data has a name - “tidy data” (Wickham and others 2014). Tidy data has the properties that all measured quantities are represented by numbers or character strings (think words). The data are organized such that.</p>
<ol>
<li>Each variable you measured is in one column</li>
<li>Each different measurement of that variable is in a different row</li>
<li>There is one data table for each “type” of variable.</li>
<li>If there are multiple tables then they are linked by a common ID.</li>
</ol>
<p>This idea is borrowed from data management schemas that have long been used for storing data in databases. Here is an example of a tidy data set of swimming world records.</p>
<table>
<thead>
<tr>
<th style="text-align: right">year</th>
<th style="text-align: right">time</th>
<th style="text-align: left">sex</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">1905</td>
<td style="text-align: right">65.8</td>
<td style="text-align: left">M</td>
</tr>
<tr>
<td style="text-align: right">1908</td>
<td style="text-align: right">65.6</td>
<td style="text-align: left">M</td>
</tr>
<tr>
<td style="text-align: right">1910</td>
<td style="text-align: right">62.8</td>
<td style="text-align: left">M</td>
</tr>
<tr>
<td style="text-align: right">1912</td>
<td style="text-align: right">61.6</td>
<td style="text-align: left">M</td>
</tr>
<tr>
<td style="text-align: right">1918</td>
<td style="text-align: right">61.4</td>
<td style="text-align: left">M</td>
</tr>
<tr>
<td style="text-align: right">1920</td>
<td style="text-align: right">60.4</td>
<td style="text-align: left">M</td>
</tr>
<tr>
<td style="text-align: right">1922</td>
<td style="text-align: right">58.6</td>
<td style="text-align: left">M</td>
</tr>
<tr>
<td style="text-align: right">1924</td>
<td style="text-align: right">57.4</td>
<td style="text-align: left">M</td>
</tr>
<tr>
<td style="text-align: right">1934</td>
<td style="text-align: right">56.8</td>
<td style="text-align: left">M</td>
</tr>
<tr>
<td style="text-align: right">1935</td>
<td style="text-align: right">56.6</td>
<td style="text-align: left">M</td>
</tr>
</tbody>
</table>
<p>This type of data, neat, organized and nicely numeric is not the kind of data people are talking about when they say the “era of big data”. Data almost never start their lives in such a neat and organized format.</p>
<h3 id="raw-data">Raw data</h3>
<p>The explosion of interest in AI has been powered by a variety of types of data that you might not even think of when you think of “data”. The data might be pictures you take and upload to social media, the text of the posts on that same platform, or the sound captured from your voice when you speak to your phone.</p>
<p>Social media and cell phones aren’t the only area where data is being collected more frequently. Speed cameras on roads collect data on the movement of cars, electronic medical records store information about people’s health, wearable devices like Fitbit collect information on the activity of people. GPS information stores the location of people, cars, boats, airplanes, and an increasingly wide array of other objects.</p>
<p>Images, voice recordings, text files, and GPS coordinates are what experts call “raw data”. To create an artificial intelligence application you need to begin with a lot of raw data. But as we discussed in the simple AI example from the previous chapter - a computer doesn’t understand raw data in its natural form. It is not always immediately obvious how the raw data can be turned into numbers that a computer can understand. For example, when an artificial intelligence works with a picture the computer doesn’t “see” the picture file itself. It sees a set of numbers that represent that picture and operates on those numbers. The first step in almost every artificial intelligence application is to “pre-process” the data - to take the image files or the movie files or the text of a document and turn it into numbers that a computer can understand. Then those numbers can be fed into algorithms that can make predictions and ultimately be used to make an interface look intelligent.</p>
<h2 id="turning-raw-data-into-numbers">Turning raw data into numbers</h2>
<p>So how do we convert raw data into a form we can work with? It depends on what type of measurement or data you have collected. Here I will use two examples to explain how you can convert images and the text of a document into numbers that an algorithm can be applied to.</p>
<h3 id="images">Images</h3>
<p>Suppose that we were developing an AI to identify pictures of the author of this book. We would need to collect a picture of the author - maybe an embarrassing one.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/jeff.jpg" alt="An embarrassing picture of the author" /></p>
<p>This picture is made of pixels. You can see that if you zoom in very close on the image and look more closely. You can see that the image consists of many hundreds of little squares, each square just one color. Those squares are called pixels and they are one step closer to turning the image into numbers.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/jeff-smile.png" alt="A zoomed in view of the author's smile - you can see that each little square corresponds to one pixel and has an individual color" /></p>
<p>You can think of each pixel like a dot of color. Let’s zoom in a little bit more and instead of showing each pixel as a square show each one as a colored dot.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/jeff-smile-dots.png" alt="A zoomed in view of the author's smile - now each of the pixels are little dots one for each pixel." /></p>
<p>Imagine we are going to build an AI application on the basis of lots of images. Then we would like to turn a set of images into “tidy data”. As described above a tidy data set is defined as the following.</p>
<ol>
<li>Each variable you measured is in one column</li>
<li>Each different measurement of that variable is in a different row</li>
<li>There is one data table for each “type” of variable.</li>
<li>If there are multiple tables then they are linked by a common ID.</li>
</ol>
<p>A translation of tidy data for a collection of images would be the following.</p>
<ol>
<li><em>Variables</em>: Are the pixels measured in the images. So the top left pixel is a variable, the bottom left pixel is a variable, and so on. So each pixel should be in a separate column.</li>
<li><em>Measurements</em>: The measurements are the values for each pixel in each image. So each row corresponds to the values of the pixels for each row.</li>
<li><em>Tables</em>: There would be two tables - one with the data from the pixels and one with the labels of each image (if we know them).</li>
</ol>
<p>To start to turn the image into a row of the data set we need to stretch the dots into a single row. One way to do this is to snake along the image going from top left corner to bottom right corner and creating a single line of dots.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/jeff-smile-lines.png" alt="Follow the path of the arrows to see how you can turn the two dimensional picture into a one dimensional picture" /></p>
<p>This still isn’t quite data a computer can understand - a computer doesn’t know about dots. But we could take each dot and label it with a color name.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/jeff-color-names.png" alt="Labeling each color with a name" /></p>
<p>We could take each color name and give it a number, something like <code class="language-plaintext highlighter-rouge">rosybrown = 1</code>, <code class="language-plaintext highlighter-rouge">mistyrose = 2</code>, and so on. This approach runs into some trouble because we don’t have names for every possible color and because it is pretty inefficient to have a different number for every hue we could imagine.</p>
<p>But that would be both inefficient and not very understandable by a computer. An alternative strategy that is often used is to encode the intensity of the red, green, and blue colors for each pixel. This is sometimes called the rgb color model (Wikipedia contributors 2016b). So for example we can take these dots and show how much red, green, and blue they have in them.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/jeff-rgb.png" alt="Breaking each color down into the amount of red, green and blue" /></p>
<p>Looking at it this way we now have three measurements for each pixel. So we need to update our tidy data definition to be:</p>
<ol>
<li><em>Variables</em>: Are the three colors for each pixel measured in the images. So the top left pixel red value is a variable, the top left pixel green value is a variable and so on. So each pixel/color combination should be in a separate column.</li>
<li><em>Measurements</em>: The measurements are the values for each pixel in each image. So each row corresponds to the values of the pixels for each row.</li>
<li><em>Tables</em>: There would be two tables - one with the data from the pixels and one with the labels of each image (if we know them).</li>
</ol>
<p>So a tidy data set might look something like this for just the image of Jeff.</p>
<table>
<thead>
<tr>
<th>id</th>
<th>label</th>
<th>p1red</th>
<th>p1green</th>
<th>p1blue</th>
<th>p2red</th>
<th>…</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>“jeff”</td>
<td>238</td>
<td>180</td>
<td>180</td>
<td>205</td>
<td>…</td>
</tr>
</tbody>
</table>
<p>Each additional image would then be another row in the data set. As we will see in the chapters that follow we can then feed this data into an algorithm for performing an artificial intelligence task.</p>
<h2 id="notes">Notes</h2>
<p>Parts of this chapter from appeared in the Simply Statistics blog post <a href="http://simplystatistics.org/2013/06/14/the-vast-majority-of-statistical-analysis-is-not-performed-by-statisticians/">“The vast majority of statistical analysis is not performed by statisticians”</a> written by the author of this book.</p>
<h2 id="references">References</h2>
<p>Buhr, Sarah. 2017. “Illumina Wants to Sequence Your Whole Genome for $100.” <a href="https://techcrunch.com/2017/01/10/illumina-wants-to-sequence-your-whole-genome-for-100/">https://techcrunch.com/2017/01/10/illumina-wants-to-sequence-your-whole-genome-for-100/</a>.</p>
<p>Leskovec, Jure, and Eric Horvitz. 2008. “Planetary-Scale Views on an Instant-Messaging Network,” 6~mar.</p>
<p>“The Cost of Sequencing a Human Genome.” n.d. <a href="https://www.genome.gov/sequencingcosts/">https://www.genome.gov/sequencingcosts/</a>.</p>
<p>Travers, Jeffrey, and Stanley Milgram. 1969. “An Experimental Study of the Small World Problem.” <em>Sociometry</em> 32 (4). [American Sociological Association, Sage Publications, Inc.]: 425–43.</p>
<p>Venter, J Craig, Mark D Adams, Eugene W Myers, Peter W Li, Richard J Mural, Granger G Sutton, Hamilton O Smith, et al. 2001. “The Sequence of the Human Genome.” <em>Science</em> 291 (5507). American Association for the Advancement of Science: 1304–51.</p>
<p>Wickham, Hadley, and others. 2014. “Tidy Data.” <em>Under Review</em>.</p>
<p>Wikipedia contributors. 2016a. “Six Degrees of Kevin Bacon.” <a href="https://en.wikipedia.org/w/index.php?title=Six_Degrees_of_Kevin_Bacon&oldid=748831516">https://en.wikipedia.org/w/index.php?title=Six_Degrees_of_Kevin_Bacon&oldid=748831516</a>.</p>
<p>———. 2016b. “RGB Color Model.” <a href="https://en.wikipedia.org/w/index.php?title=RGB_color_model&oldid=756764504">https://en.wikipedia.org/w/index.php?title=RGB_color_model&oldid=756764504</a>.</p>
New class - Data App Prototyping for Public Health and Beyond
2017-01-26T00:00:00+00:00
http://simplystats.github.io/2017/01/26/new-prototyping-class
<p>Are you interested in building data apps to help save the world, start the next big business, or just to see if you can? We are running a data app prototyping class for people interested in creating these apps.</p>
<p>This will be a special topics class at JHU and is open to any undergrad student, grad student, postdoc, or faculty member at the university. We are also seeing if we can make the class available to people outside of JHU so even if you aren’t at JHU but are interested you should let us know below.</p>
<p>One of the principles of our approach is that anyone can prototype an app. Our class starts with some tutorials on Shiny and R. While we have no formal pre-reqs for the class you will have much more fun if you have the background equivalent to our Coursera classes:</p>
<ul>
<li><a href="https://www.coursera.org/learn/data-scientists-tools">Data Scientist’s Toolbox</a></li>
<li><a href="https://www.coursera.org/learn/r-programming">R programming</a></li>
<li><a href="https://www.coursera.org/learn/r-packages">Building R packages</a></li>
<li><a href="https://www.coursera.org/learn/data-products">Developing Data Products</a></li>
</ul>
<p>If you don’t have that background you can take the classes online starting now to get up to speed! To see some examples of apps we will be building check out our <a href="http://jhudatascience.org/data_app_gallery.html">gallery</a>.</p>
<p>We will mostly be able to support development with R and Shiny but would be pumped to accept people with other kinds of development background - we just might not be able to give a lot of technical assistance.</p>
<p>As part of the course we are also working with JHU’s <a href="https://ventures.jhu.edu/fastforward/">Fast Forward</a> program to streamline and ease the process of starting a company around the app you build for the class. So if you have entrepreneurial ambitions, this is the class for you!</p>
<p>We are in the process of setting up the course times, locations, and enrollment cap. The class will run from March to May (exact dates TBD). To sign up for announcements about the class please fill out your information <a href="http://jhudatascience.org/prototyping_students.html">here</a>.</p>
User Experience and Value in Products - What Regression and Surrogate Variables can Teach Us
2017-01-23T00:00:00+00:00
http://simplystats.github.io/2017/01/23/ux-value
<p>Over the past year, there have been a number of recurring topics in my global news feed that have a shared theme to them. Some examples of these topics are:</p>
<ul>
<li><strong>Fake news</strong>: Before and after the election in 2016, Facebook (or Facebook’s Trending News algorithm) was accused of promoting news stories that turned out to be completely false, promoted by dubious news sources in FYROM and elsewhere.</li>
<li><strong>Theranos</strong>: This diagnostic testing company promised to revolutionize the blood testing business and prevent disease for all by making blood testing simple and painless. This way people would not be afraid to get blood tests and would do them more often, presumably catching diseases while they were in the very early stages. Theranos lobbied to allow patients order their own blood tests so that they wouldn’t need a doctor’s order.</li>
<li><strong>Homeopathy</strong>: This a so-called <a href="https://nccih.nih.gov/health/homeopathy">alternative medical system</a> developed in the late 18th century based on notions such as “like cures like” and “law of minimum dose.</li>
<li><strong>Online education</strong>: New companies like Coursera and Udacity promised to revolutionize education by making it accessible to a broader audience than conventional universities were able.</li>
</ul>
<p>What exactly do these things have in common?</p>
<p>First, consumers love them. Fake news played to people’s biases by confirming to them, from a seemingly trustworthy source, what they always “knew to be true”. The fact that the stories weren’t actually true was irrelevant given that users enjoyed the experience of seeing what they agreed with. Perhaps the best explanation of the entire Facebook fake news issue was from Kim-Mai Cutler:</p>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">The best way to have the stickiest and most lucrative product? Be a systematic tool for confirmation bias. <a href="https://t.co/8uOHZLomhX">https://t.co/8uOHZLomhX</a></p>— Kim-Mai Cutler (@kimmaicutler) <a href="https://twitter.com/kimmaicutler/status/796560990854905857">November 10, 2016</a></blockquote>
<script async="" src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>Theranos promised to revolutionize blood testing and change the user experience behind the whole industry. Indeed the company had some fans (particularly amongst its <a href="https://www.axios.com/tim-drapers-keeps-defending-theranos-2192078259.html">investor base</a>). However, after investigations by the Center for Medicare and Medicaid Services, the FDA, and an independent laboratory, it was found that Theranos’s blood testing machine was wildly inconsistent and variable, leading to Theranos ultimately retracting all of its blood test results and cutting half its workforce.</p>
<p>Homeopathy is not company specific, but is touted by many as an “alternative” treatment for many diseases, with many claiming that it “works for them”. However, the NIH states quite clearly on its <a href="https://nccih.nih.gov/health/homeopathy">web site</a> that “There is little evidence to support homeopathy as an effective treatment for any specific condition.”</p>
<p>Finally, companies like Coursera and Udacity in the education space have indeed produced products that people like, but in some instances have hit bumps in the road. Udacity conducted a brief experiment/program with San Jose State University that failed due to the large differences between the population that took online courses and the one that took them in person. Coursera has massive offerings from major universities (including my own) but has run into continuing <a href="http://www.economist.com/news/special-report/21714173-alternative-providers-education-must-solve-problems-cost-and">challenges with drop out</a> and questions over whether the courses offered are suitable for job placement.</p>
<h2 id="user-experience-and-value">User Experience and Value</h2>
<p>In each of these four examples there is a consumer product that people love, often because they provide a great user experience. Take the fake news example–people love to read headlines from “trusted” news sources that agree with what they believe. With Theranos, people love to take a blood test that is not painful (maybe “love” is the wrong word here). With many consumer products companies, it is the user experience that defines the value of a product. Often when describing the user experience, you are simultaneously describing the value of the product.</p>
<p>Take for example Uber. With Uber, you open an app on your phone, click a button to order a car, watch the car approach you on your phone with an estimate of how long you will be waiting, get in the car and go to your destination, and get out without having to deal with paying. If someone were to ask me “What’s the value of Uber?” I would probably just repeat the description in the previous sentence. Isn’t it obvious that it’s better than the usual taxi experience? The same could be said for many companies that have recently come up: Airbnb, Amazon, Apple, Google. With many of the products from these companies, <em>the description of the user experience is a description of its value</em>.</p>
<h2 id="disruption-through-user-experience">Disruption Through User Experience</h2>
<p>In the example of Uber (and Airbnb, and Amazon, etc.) you could depict the relationship between the product, the user experience, and the value as such:</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/ux1.png" alt="" /></p>
<p>Any changes that you can make to the product to improve the user experience will then improve the value that the product offers. Another way to say it is that the user experience serves as a <em>surrogate outcome</em> for the value. We can influence the UX and know that we are improving value. Furthermore, any measurements that we take on the UX (surveys, focus groups, app data) will serve as direct observations on the value provided to customers.</p>
<p>New companies in these kinds of consumer product spaces can disrupt the incumbents by providing a much better user experience. When incumbents have gotten fat and lazy, there is often a sizable segment of the customer base that feels underserved. That’s when new companies can swoop in to specifically serve that segment, often with a “worse” product overall (as in fewer features) and usually much cheaper. The Internet has made the “swooping in” much easier by <a href="https://stratechery.com/2015/netflix-and-the-conservation-of-attractive-profits/">dramatically reducing transaction and distribution costs</a>. Once the new company has a foothold, they can gradually work their way up the ladder of customer segments to take over the market. It’s classic disruption theory a la <a href="http://www.claytonchristensen.com">Clayton Christensen</a>.</p>
<h2 id="when-value-defines-the-user-experience-and-product">When Value Defines the User Experience and Product</h2>
<p>There has been much talk of applying the classic disruption model to every space imaginable, but I contend that not all product spaces are the same. In particular, the four examples I described in the beginning of this post cover some of those different areas:</p>
<ul>
<li>Medicine (Theranos, homeopathy)</li>
<li>News (Facebook/fake news)</li>
<li>Education (Coursera/Udacity)</li>
</ul>
<p>One thing you’ll notice about these areas, particularly with medicine and education, is that they are all heavily regulated. The reason is because we as a community have decided that there is a minimum level of value that is required to be provided by entities in this space. That is, the value that a product offers is <em>defined first</em>, before the product can come to market. Therefore, the value of the product actually constrains the space of products that can be produced. We can depict this relationship as such:</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/ux2.png" alt="" /></p>
<p>In classic regression modeling language, the value of a product must be “adjusted for” before examining the relationship between the product and the user experience. Naturally, as in any regression problem, when you adjust for a variable that is related to the product and the user experience, you reduce the overall variation in the product.</p>
<p>In situations where the value defines the product and the user experience, there is much less room to maneuver for new entrants in the market. The reason is because they, like everyone else, are constrained by the value that is agreed upon by the community, usually in the form of regulations.</p>
<p>When Theranos comes in and claims that it’s going to dramatically improve the user experience of blood testing, that’s great, but they must be constrained by the value that society demands, which is a certain precision and accuracy in its testing results. Companies in the online education space are welcome to disrupt things by providing a better user experience. Online offerings in fact do this by allowing students to take classes according to their own schedule, wherever they may live in the world. But we still demand that the students learn an agreed-upon set of facts, skills, or lessons.</p>
<p>New companies will often argue that the things that we currently value are outdated or no longer valuable. Their incentive is to change the value required so that there is more room for new companies to enter the space. This is a good thing, but it’s important to realize that this cannot happen solely through changes in the product. Innovative features of a product may help us to understand that we should be valuing different things, but ultimately the change in what we preceive as value occurs independently of any given product.</p>
<p>When I see new companies enter the education, medicine, or news areas, I always hesitate a bit because I want some assurance that they will still provide the value that we have come to expect. In addition, with these particular areas, there is a genuine sense that failing to deliver on what we value could cause serious harm to individuals. However, I think the discussion that is provoked by new companies entering the space is always welcome because we need to constantly re-evaluate what we value and whether it matches the needs of our time.</p>
An example that isn't that artificial or intelligent
2017-01-20T00:00:00+00:00
http://simplystats.github.io/2017/01/20/not-artificial-not-intelligent
<p><em>Editor’s note: This is the second chapter of a book I’m working on called <a href="https://leanpub.com/demystifyai/">Demystifying Artificial Intelligence</a>. The goal of the book is to demystify what modern AI is and does for a general audience. So something to smooth the transition between AI fiction and highly mathematical descriptions of deep learning. I’m developing the book over time - so if you buy the book on Leanpub know that there are only two chapters in there so far, but I’ll be adding more over the next few weeks and you get free updates. The cover of the book was inspired by this <a href="https://twitter.com/notajf/status/795717253505413122">amazing tweet</a> by Twitter user <a href="https://twitter.com/notajf/">@notajf</a>. Feedback is welcome and encouraged!</em></p>
<blockquote>
<p>“I am so clever that sometimes I don’t understand a single word of
what I am saying.” Oscar Wilde</p>
</blockquote>
<p>As we have described it artificial intelligence applications consist of
three things:</p>
<ol>
<li>A large collection of data examples</li>
<li>An algorithm for learning a model from that training set.</li>
<li>An interface with the world.</li>
</ol>
<p>In the following chapters we will go into each of these components in
much more detail, but lets start with a a couple of very simple examples
to make sure that the components of an AI are clear. We will start with
a completely artificial example and then move to more complicated
examples.</p>
<h2 id="building-an-album">Building an album</h2>
<p>Lets start with a very simple hypothetical example that can be
understood even if you don’t have a technical background. We can also
use this example to define some of the terms we will be discussing later
in the book.</p>
<p>In our simple example the goal is to make an album of photos for a
friend. For example, suppose I want to take the photos in my photobook
and find all the ones that include pictures of myself and my son Dex for
his grandmother.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/cartoon-phone-photos.png" alt="The author's drawing of the author's phone album. Don't make fun, he's
a data scientist, not an artist" /></p>
<p>If you are anything like the author of this book, then you probably have
a very large number of pictures of your family on your phone. So the
first step in making the photo alubm would be to stort through all of my
pictures and pick out the ones that should be part of the album.</p>
<p>This is a typical example of the type of thing we might want to train a
computer to do in an artificial intelligence application. Each of the
components of an AI application is there:</p>
<ol>
<li><strong>The data</strong>: all of the pictures on the author’s phone (a big
training set!)</li>
<li><strong>The algorithm</strong>: finding pictures of me and my son Dex</li>
<li><strong>The interface</strong>: the album to give to Dex’s grandmother.</li>
</ol>
<p>One way to solve this problem is for me to sort through the pictures one
by one and decide whether they should be in the album or not, then
assemble them together, and then put them into the album. If I did it
like this then I myself would be the AI! That wouldn’t be very
artificial though…imagine we instead wanted to teach a computer to
make this album..</p>
<blockquote>
<p>But what does it mean to “teach” a computer to do something?</p>
</blockquote>
<p>The terms “machine learning” and “artificial intelligence” invoke the
idea of teaching computers in the same way that we teach children. This
was a deliberate choice to make the analogy - both because in some ways
it is appropriate and because it is useful for explaining complicated
concepts to people with limited backgrounds. To teach a child to find
pictures of the author and his son, you would show her lots of examples
of that type of picture and maybe some examples of the author with other
kids who were not his son. You’d repeat to the child that the pictures
of the author and his son were the kinds you wanted and the others
weren’t. Eventually she would retain that information and if you gave
her a new picture she could tell you whether it was the right kind or
not.</p>
<p>To teach a machine to perform the same kind of recognition you go
through a similar process. You “show” the machine many pictures labeled
as either the ones you want or not. You repeat this process until the
machine “retains” the information and can correctly label a new photo.
Getting the machine to “retain” this information is a matter of getting
the machine to create a set of step by step instructions it can apply to
go from the image to the label that you want.</p>
<h2 id="the-data">The data</h2>
<p>The images are what people in the fields of artificial intelligence and
machine learning call <em>“raw data”</em> (Leek, n.d.). The categories of
pictures (a picture of the author and his son or a picture of something
else) are called the <em>“labels”</em> or <em>“outcomes”</em>. If the computer gets to
see the labels when it is learning then it is called <em>“supervised
learning”</em> (Wikipedia contributors 2016) and when the computer doesn’t
get to see the labels it is called <em>“unsupervised learning”</em> (Wikipedia
contributors 2017a).</p>
<p>Going back to our analogy with the child, supervised learning would be
teaching the child to recognize pictures of the author and his son
together. Unsupervised learning would be giving the child a pile of
pictures and asking them to sort them into groups. They might sort them
by color or subject or location - not necessarily into categories that
you care about. But probably one of the categories they would make would
be pictures of people - so she would have found some potentially useful
information even if it wasn’t exactly what you wanted. One whole field
of artificial intelligence is figuring out how to use the information
learned in this “unsupervised” setting and using it for supervised tasks</p>
<ul>
<li>this is sometimes called <em>“transfer learning”</em> (Raina et al. 2007) by
people in the field since you are transferring information from one task
to another.</li>
</ul>
<p>Returning to the task of “teaching” a computer to retain information
about what kind of pictures you want we run into a problem - computers
don’t know what pictures are! They also don’t know what audio clips,
text files, videos, or any other kind of information is. At least not
directly. They don’t have eyes, ears, and other senses along with a
brain designed to decode the information from these senses.</p>
<p>So what can a computer understand? A good rule of thumb is that a
computer works best with numbers. If you want a computer to sort
pictures into an album for you, the first thing you need to do is to
find a way to turn all of the information you want to “show” the
computer into numbers. In the case of sorting pictures into albums - a
supervised learning problem - we need to turn the labels and the images
into numbers the computer can use.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/labels-to-numbers.png" alt="Label each picture as a one or a zero depending on whether it is the
kind of picture you want in the album" /></p>
<p>One way to do that would be for you to do it for the computer. You could
take every picture on your phone and label it with a 1 if it was a
picture of the author and his son and a 0 if not. Then you would have a
set of 1’s and 0’s corresponding to all of the pictures. This takes some
thing the computer can’t understand (the picture) and turns it into
something the computer can understand (the label).</p>
<p>This process would turn the labels into something a computer could
understand, it still isn’t something we could teach a computer to do.
The computer can’t actually “look” at the image and doesn’t know who the
author or his son are. So we need to figure out a way to turn the images
into numbers for the computer to use to generate those labels directly.</p>
<p>This is a little more complicated but you could still do it for the
computer. Let’s suppose that the author and his son always wear matching
blue shirts when they spend time together. Then you could go through and
look at each image and decide what fraction of the image is blue. So
each picture would get a number ranging from zero to one like 0.30 if
the picture was 30% blue and 0.53 if it was 53% blue.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/images-to-numbers.png" alt="Calculate the fraction of each image that is the color blue as a
"feature" of the image that is numeric" /></p>
<p>The fraction of the picture that is blue is called a <em>“feature”</em> and the
process of creating that feature is called <em>“feature engineering”</em>
(Wikipedia contributors 2017b). Until very recently feature engineering
of text, audio, or video files was best performed by an expert human. In
later chapters we will discuss how one of the most exciting parts about
AI application is that it is now possible to have computers perform
feature engineering for you.</p>
<h2 id="the-algorithm">The algorithm</h2>
<p>Now that we have converted the images to numbers and the labels to
numbers, we can talk about how to “teach” a computer to label the
pictures. A good rule of thumb when thinking about algorithms is that a
computer can’t “do” anything without being told very explicitly what to
do. It needs a step by step set of instructions. The instructions should
start with a calculation on the numbers for the image and should end
with a prediction of what label to apply to that image. The image
(converted to numbers) is the <em>“input”</em> and the label (also a number) is
the <em>“output”</em>. You may have heard the phrase:</p>
<blockquote>
<p>“Garbage in, garbage out”</p>
</blockquote>
<p>What this phrase means is if the inputs (the images) are bad - say they
are all very dark or hard to see. Then the output of the algorithm will
also be bad - the predictions won’t be very good.</p>
<p>A machine learning <em>“algorithm”</em> can be thought of as a set of
instructions with some of the parts left blank - sort of like mad-libs.
One example of a really simple algorithm for sorting pictures into the
album would be:</p>
<blockquote>
<ol>
<li>Calculate the fraction of blue in the image.</li>
<li>If the fraction of blue is above <em>X</em> label it 1</li>
<li>If the fraction of blue is less than <em>X</em> label it 0</li>
<li>Put all of the images labeled 1 in the album</li>
</ol>
</blockquote>
<p>The machine <em>“learns”</em> by using the examples to fill in the blanks in
the instructions. In the case of our really simple algorithm we need to
figure out what fraction of blue to use (<em>X</em>) for labeling the picture.</p>
<p>To figure out a guess for <em>X</em> we need to decide what we want the
algorithm to do. If we set <em>X</em> to be too low then all of the images will
be labeled with a 1 and put into the album. If we set <em>X</em> to be too high
then all of the images will be labeled 0 and none will appear in the
album. In between there is some grey area - do we care if we
accidentally get some pictures of the ocean or the sky with our
algorithm?</p>
<p>But the number of images in the album isn’t even the thing we really
care about. What we might care about is making sure that the album is
mostly pictures of the author and his son. In the field of AI they
usually turn this statement around - we want to make sure the album has
a very small fraction of pictures that are not of the author and his
son. This fraction - the fraction that are incorrectly placed in the
album is called the <em>“loss”</em>. You can think about it like a game where
the computer loses a point every time it puts the wrong kind of picture
into the album.</p>
<p>Using our loss (how many pictures we incorrectly placed in the album) we
can now use the data we have created (the numbers for the labels and the
images) to fill in the blanks in our mad-lib algorithm (picking the
cutoff on the amount of blue). We have a large number of pictures where
we know what fraction of each picture is blue and whether it is a
picture of the author and his son or not. We can try each possible <em>X</em>
and calculate the fraction of pictures in the album that are incorrectly
placed into the album (the loss) and find the <em>X</em> that produces the
smallest fraction.</p>
<p>Suppose that the value of <em>X</em> that gives the smallest faction of wrong
pictures in the album is 30. Then our “learned” model would be:</p>
<blockquote>
<ol>
<li>Calculate the fraction of blue in the image</li>
<li>If the fraction of blue is above 0.1 label it 1</li>
<li>If the fraction of blue is less than 0.1 label it 0</li>
<li>Put all of the images labeled 1 in the album</li>
</ol>
</blockquote>
<h2 id="the-interface">The interface</h2>
<p>The last part of an AI application is the interface. In this case, the
interface would be the way that we share the pictures with Dex’s
grandmother. For example we could imagine uploading the pictures to
<a href="https://www.shutterfly.com/">Shutterfly</a> and having the album delivered
to Dex’s grandmother.</p>
<p>Putting this all together we could imagine an application using our
trained AI. The author uploads his unlabeled photos. The photos are then
passed to the computer program which calculates the fraction of the
image that is blue, then applies a label according to the algorithm we
learned, then takes all the images predicted to be of the author and his
son and sends them off to be a Shutterfly album mailed to the authors’
mother.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/ai-album.png" alt="Whoa that computer is smart - from the author's picture to grandma's
hands!" /></p>
<p>If the algorithm was good, then from the perspective of the author the
website would look “intelligent”. I just uploaded pictures and it
created an album for me with the pictures that I wanted. But the steps
in the process were very simple and understandable behind the scenes.</p>
<h2 id="references">References</h2>
<p>Leek, Jeffrey. n.d. “The Elements of Data Analytic Style.”
<a href="{https://leanpub.com/datastyle}">{https://leanpub.com/datastyle}</a>.</p>
<p>Raina, Rajat, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y
Ng. 2007. “Self-Taught Learning: Transfer Learning from Unlabeled Data.”
In <em>Proceedings of the 24th International Conference on Machine
Learning</em>, 759–66. ICML ’07. New York, NY, USA: ACM.</p>
<p>Wikipedia contributors. 2016. “Supervised Learning.”
<a href="https://en.wikipedia.org/w/index.php?title=Supervised_learning&oldid=752493505">https://en.wikipedia.org/w/index.php?title=Supervised_learning&oldid=752493505</a>.</p>
<p>———. 2017a. “Unsupervised Learning.”
<a href="https://en.wikipedia.org/w/index.php?title=Unsupervised_learning&oldid=760556815">https://en.wikipedia.org/w/index.php?title=Unsupervised_learning&oldid=760556815</a>.</p>
<p>———. 2017b. “Feature Engineering.”
<a href="https://en.wikipedia.org/w/index.php?title=Feature_engineering&oldid=760758719">https://en.wikipedia.org/w/index.php?title=Feature_engineering&oldid=760758719</a>.</p>
What is artificial intelligence? A three part definition
2017-01-19T00:00:00+00:00
http://simplystats.github.io/2017/01/19/what-is-artificial-intelligence
<p><em>Editor’s note: This is the first chapter of a book I’m working on called <a href="https://leanpub.com/demystifyai/">Demystifying Artificial Intelligence</a>. The goal of the book is to demystify what modern AI is and does for a general audience. So something to smooth the transition between AI fiction and highly mathematical descriptions of deep learning. I’m developing the book over time - so if you buy the book on Leanpub know that there is only one chaper in there so far, but I’ll be adding more over the next few weeks and you get free updates. The cover of the book was inspired by this <a href="https://twitter.com/notajf/status/795717253505413122">amazing tweet</a> by Twitter user <a href="https://twitter.com/notajf/">@notajf</a>. Feedback is welcome and encouraged!</em></p>
<h1 id="what-is-artificial-intelligence">What is artificial intelligence?</h1>
<blockquote>
<p>“If it looks like a duck and quacks like a duck but it needs
batteries, you probably have the wrong abstraction” <a href="https://lostechies.com/derickbailey/2009/02/11/solid-development-principles-in-motivational-pictures/">Derick
Bailey</a></p>
</blockquote>
<p>This book is about artificial intelligence. The term “artificial
intelligence” or “AI” has a long and convoluted history (Cohen and
Feigenbaum 2014). It has been used by philosophers, statisticians,
machine learning experts, mathematicians, and the general public. This
historical context means that when people say <em>artificial intelligence</em>
the term is loaded with one of many potential different meanings.</p>
<h2 id="humanoid-robots">Humanoid robots</h2>
<p>Before we can demystify artificial intelligence it is helpful to have
some context for what the word means. When asked about artificial
intelligence, most people’s imagination leaps immediately to images of
robots that can act like and interact with humans. Near-human robots
have long been a source of fascination by humans have appeared in
cartoons like the <em>Jetsons</em> and science fiction like <em>Star Wars</em>. More
recently, subtler forms of near-human robots with artificial
intelligence have played roles in movies like <em>Her</em> and <em>Ex machina</em>.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/movie-ai.png" alt="People usually think of artificial intelligence as a human-like robot
performing all the tasks that a person could." /></p>
<p>The type of artificial intelligence that can think and act like a human
is something that experts call artificial general intelligence
(Wikipedia contributors 2017a).</p>
<blockquote>
<p>is the intelligence of a machine that could successfully perform any
intellectual task that a human being can</p>
</blockquote>
<p>There is an understandable fascination and fear associated with robots,
created by humans, but evolving and thinking independently. While this
is a major area of ressearch (Laird, Newell, and Rosenbloom 1987) and of
course the center of most people’s attention when it comes to AI, there
is no near term possibility of this type of intelligence (Urban, n.d.).
There are a number of barriers to human-mimicking AI from difficulty
with robotics (Couden 2015) to needed speedups in computational power
(Langford, n.d.).</p>
<p>One of the key barriers is that most current forms of the computer
models behind AI are trained to do one thing really well, but can not be
applied beyond that narrow task. There are extremely effective
artificial intelligence applications for translating between languages
(Wu et al. 2016), for recognizing faces in images (Taigman et al. 2014),
and even for driving cars (Santana and Hotz 2016).</p>
<p>But none of these technologies are generalizable across the range of
tasks that most adult humans can accomplish. For example, the AI
application for recognizing faces in images can not be directly applied
to drive cars and the translation application couldn’t recognize a
single image. While some of the internal technology used in the
applications is the same, the final version of the applications can’t be
transferred. This means that when we talk about artificial intelligence
we are not talking about a general purpose humanoid replacement.
Currently we are talking about technologies that can typically
accomplish one or two specific tasks that a human could accomplish.</p>
<h2 id="cognitive-tasks">Cognitive tasks</h2>
<p>While modern AI applications couldn’t do everything that an adult could
do (Baciu and Baciu 2016), they can perform individual tasks nearly as
well as a human. There is a second commonly used definition of
artificial intelligence that is considerably more narrow (Wikipedia
contributors 2017b)</p>
<blockquote>
<p>… the term “artificial intelligence” is applied when a machine
mimics “cognitive” functions that humans associate with other human
minds, such as “learning” and “problem solving”.</p>
</blockquote>
<p>This definition encompasses applications like machine translation and
facial recognition. They are “cognitive” functions that are generally
usually only performed by humans. A difficulty with this definition is
that it is relative. People refer to machines that can do tasks that we
thought humans could only do as artificial intelligence. But over time,
as we become used to machines performing a particular task it is no
longer surprising and we stop calling it artificial intelligence. John
McCarthy, one of the leading early figures in artificial intelligence
said (Vardi 2012):</p>
<blockquote>
<p>As soon as it works, no one calls it AI anymore…</p>
</blockquote>
<p>As an example, when you send a letter in the mail, there is a machine
that scans the writing on the letter. A computer then “reads” the
characters on the front of the letter. The computer reads the characters
in several steps - the color of each pixel in the picture of the letter
is stored in a data set on the computer. Then the computer uses an
algorithm that has been built using thousands or millions of other
letters to take the pixel data and turn it into predictions of the
characters in the image. Then the characters are identified as
addresses, names, zipcodes, and other relevant pieces of information.
Those are then stored in the computer as text which can be used for
sorting the mail.</p>
<p>This task used to be considered “artificial intelligence” (Pavlidis,
n.d.). It was surprising that a computer could perform the tasks of
recognizing characters and addresses just based on a picture of the
letter. This task is now called “optical character recognition”
(Wikipedia contributors 2016). Many tutorials on the algorithms behind
machine learning begin with this relatively simple task (Google
Tensorflow Team, n.d.). Optical character recognition is now used in a
wide range of applications including in Google’s effort to digitize
millions of books (Darnton 2009).</p>
<p>Since this type of algorithm has become so common it is no longer called
“artificial intelligence”. This transition happened becasue we no longer
think it is surprising that computers can do this task - so it is no
longer considered intelligent. This process has played out with a number
of other technologies. Initially it is thought that only a human can do
a particular cognitive task. As computers become increasingly proficient
at that task they are called artificially intelligent. Finally, when
that task is performed almost exclusively by computers it is no longer
considered “intelligent” and the boundary moves.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/timeline-ai.png" alt="Timeline of tasks we were surprised that computers could do as well as
humans." /></p>
<p>Over the last two decades tasks from optical character recognition, to
facial recognition in images, to playing chess have started as
artificially intelligent applications. At the time of this writing there
are a number of technologies that are currently on the boundary between
doable only by a human and doable by a computer. These are the tasks
that are considered AI when you read about the term in the media.
Examples of tasks that are currently considered “artificial
intelligence” include:</p>
<ul>
<li>Computers that can drive cars</li>
<li>Computers that can identify human faces from pictures</li>
<li>Computers that can translate text from one language to another</li>
<li>Computers that can label pictures with text descriptions</li>
</ul>
<p>Just as it used to be with optical character recognition, self-driving
cars and facial recognition are tasks that still surprise us when
performed by a computer. So we still call them artificially intelligent.
Eventually, many or most of these tasks will be performed nearly
exclusively by computers and we will no longer think of them as
components of computer “intelligence”. To go a little further we can
think about any task that is repetitive and performed by humans. For
example, picking out music that you like or helping someone buy
something at a store. An AI can eventually be built to do those tasks
provided that: (a) there is a way of measuring and storing information
about the tasks and (b) there is technology in place to perform the task
if given a set of computer instructions.</p>
<p>The more narrow definition of AI is used colloquially in the news to
refer to new applications of computers to perform tasks previously
thought impossible. It is important to know both the definition of AI
used by the general public and the more narrow and relative definition
used to describe modern applications of AI by companies like Google and
Facebook. But neither of these definitions is satisfactory to help
demystify the current state of artificial intelligence applications.</p>
<h2 id="a-three-part-definition">A three part definition</h2>
<p>The first definition describes a technology that we are not currently
faced with - fully functional general purpose artificial intelligence.
The second definition suffers from the fact that it is relative to the
expectations of people discussing applications. For this book, we need a
definition that is concrete, specific, and doesn’t change with societal
expectations.</p>
<p>We will consider specific examples of human-like tasks that computers
can perform. So we will use the definition that artificial intelligence
requires the following components:</p>
<ol>
<li><em>The data set</em> : A of data examples that can be used to train a
statistical or machine learning model to make predictions.</li>
<li><em>The algorithm</em> : An algorithm that can be trained based on the data
examples to take a new example and execute a human-like task.</li>
<li><em>The interface</em> : An interface for the trained algorithm to receive
a data input and execute the human like task in the real world.</li>
</ol>
<p>This definition encompases optical character recognition and all the
more modern examples like self driving cars. It is also intentionally
broad, covering even examples where the data set is not large or the
algorithm is not complicated. We will use our definition to break down
modern artificial intelligence applications into their constituitive
parts and make it clear how the computer represents knowledge learned
from data examples and then applies that knowledge.</p>
<p>As one example, consider Amazon Echo and Alexa - an application
currently considered to be artificially intelligent (Nuñez, n.d.). This
combination meets our definition of artificially intelligent since each
of the components is in place.</p>
<ol>
<li><em>The data set</em> : The large set of data examples consist of all the
recordings that Amazon has collected of people talking to their
Amazon devices.</li>
<li><em>The machine learning algorithm</em> : The Alexa voice service (Alexa
Developers 2016) is a machine learning algorithm trained using the
previous recordings of people talking to Amazon devices.</li>
<li><em>The interface</em> : The interface is the Amazon Echo (Amazon Inc 2016)
a speaker that can record humans talking to it and respond with
information or music.</li>
</ol>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/alexa-ai.png" alt="The three parts of an artificial intelligence illustrated with Amazon
Echo and Alexa" /></p>
<p>When we break down artificial intelligence into these steps it makes it
clearer why there has been such a sudden explosion of interest in
artificial intelligence over the last several years.</p>
<p>First, the cost of data storage and collection has gone down steadily
(Irizarry, n.d.) but dramatically (Quigley, n.d.) over the last several
years. As the costs have come down, it is increasingly feasible for
companies, governments, and even individuals to store large collections
of data (Component 1 - <em>The Data</em>). To take advantage of these huge
collections of data requires incredibly flexible statistical or machine
learning algorithms that can capture most of the patterns in the data
and re-use them for prediction. The most common type of algorithms used
in modern artificial intelligence are something called “deep neural
networks”. These algorithms are so flexible they capture nearly all of
the important structure in the data. They can only be trained well if
huge data sets exist and computers are fast enough. Continual increases
in computing speed and power over the last several decades now make it
possible to apply these models to use collections of data (Component 2 -
<em>The Algorithm</em>).</p>
<p>Finally, the most underappreciated component of the AI revolution does
not have to do with data or machine learning. Rather it is the
development of new interfaces that allow people to interact directly
with machine learning models. For a number of years now, if you were an
expert with statistical and machine learning software it has been
possible to build highly accurate predictive models. But if you were a
person without technical training it was not possible to directly
interact with algorithms.</p>
<p>Or as statistical experts Diego Kuonen and Rafael Irizarry have put it:</p>
<blockquote>
<p>The big in big data refers to importance, not size</p>
</blockquote>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/importance-not-size.jpg" alt="It isn't about how much data you have, it is about how many people you
can get to use it." /></p>
<p>The explosion of interfaces for regular, non-technical people to
interact with machine learning is an underappreciated driver of the AI
revolution of the last several years. Artificial intelligence can now
power labeling friends on Facebook, parsing your speech to your personal
assistant Siri or Google Assistant, or providing you with directions in
your car, or when you talk to your Echo. More recently sensors and
devices make it possible for the instructions created by a computer to
steer and drive a car.</p>
<p>These interfaces now make it possible for hundreds of millions of people
to directly interact with machine learning algorithms. These algorithms
can range from exceedingly simple to mind bendingly complex. But the
common result is that the interface allows the computer to perform a
human-like action and makes it look like artificial intelligence to the
person on the other side. This interface explosion only promises to
accelerate as we are building sensors for both data input and behavior
output in objects from phones to refrigerators to cars (Component 3 -
<em>The interface</em>).</p>
<p>This definition of artificial intelligence in three components will
allow us to demystify artificial intelligence applications from self
driving cars to facial recognition. Our goal is to provide a high-level
interface to the current conception of AI and how it can be applied to
problems in real life. It will include discussion and references to the
sophisticated models and data collection methods used by Facebook,
Tesla, and other companies. However, the book does not assume a
mathematical or computer science background and will attempt to explain
these ideas in plain language. Of course, this means that some details
will be glossed over, so we will attempt to point the interested reader
toward more detailed resources throughout the book.</p>
<h2 id="references">References</h2>
<p>Alexa Developers. 2016. “Alexa Voice Service.”
<a href="https://developer.amazon.com/alexa-voice-service">https://developer.amazon.com/alexa-voice-service</a>.</p>
<p>Amazon Inc. 2016. “Amazon Echo.”
<a href="https://www.amazon.com/Amazon-Echo-Bluetooth-Speaker-with-WiFi-Alexa/dp/B00X4WHP5E">https://www.amazon.com/Amazon-Echo-Bluetooth-Speaker-with-WiFi-Alexa/dp/B00X4WHP5E</a>.</p>
<p>Baciu, Assaf, and Assaf Baciu. 2016. “Artificial Intelligence Is More
Artificial Than Intelligent.” <em>Wired</em>, 7~dec.</p>
<p>Cohen, Paul R, and Edward A Feigenbaum. 2014. <em>The Handbook of
Artificial Intelligence</em>. Vol. 3. Butterworth-Heinemann.
<a href="https://goo.gl/wg5rMk">https://goo.gl/wg5rMk</a>.</p>
<p>Couden, Craig. 2015. “Why It’s so Hard to Make Humanoid Robots | Make:”
<a href="http://makezine.com/2015/06/15/hard-make-humanoid-robots/">http://makezine.com/2015/06/15/hard-make-humanoid-robots/</a>.</p>
<p>Darnton, Robert. 2009. <em>Google & the Future of Books</em>. na.</p>
<p>Google Tensorflow Team. n.d. “MNIST for ML Beginners | TensorFlow.”
<a href="https://www.tensorflow.org/tutorials/mnist/beginners/">https://www.tensorflow.org/tutorials/mnist/beginners/</a>.</p>
<p>Irizarry, Rafael. n.d. “The Big in Big Data Relates to Importance Not
Size · Simply Statistics.”
<a href="http://simplystatistics.org/2014/05/28/the-big-in-big-data-relates-to-importance-not-size/">http://simplystatistics.org/2014/05/28/the-big-in-big-data-relates-to-importance-not-size/</a>.</p>
<p>Laird, John E, Allen Newell, and Paul S Rosenbloom. 1987. “Soar: An
Architecture for General Intelligence.” <em>Artificial Intelligence</em> 33
(1). Elsevier: 1–64.</p>
<p>Langford, John. n.d. “AlphaGo Is Not the Solution to AI « Machine
Learning (Theory).” <a href="http://hunch.net/?p=3692542">http://hunch.net/?p=3692542</a>.</p>
<p>Nuñez, Michael. n.d. “Amazon Echo Is the First Artificial Intelligence
You’ll Want at Home.”
<a href="http://www.popsci.com/amazon-echo-first-artificial-intelligence-youll-want-home">http://www.popsci.com/amazon-echo-first-artificial-intelligence-youll-want-home</a>.</p>
<p>Pavlidis, Theo. n.d. “Computers Versus Humans - 2002 Lecture.”
<a href="http://www.theopavlidis.com/comphumans/comphuman.htm">http://www.theopavlidis.com/comphumans/comphuman.htm</a>.</p>
<p>Quigley, Robert. n.d. “The Cost of a Gigabyte over the Years.”
<a href="http://www.themarysue.com/gigabyte-cost-over-years/">http://www.themarysue.com/gigabyte-cost-over-years/</a>.</p>
<p>Santana, Eder, and George Hotz. 2016. “Learning a Driving Simulator,”
3~aug.</p>
<p>Taigman, Y, M Yang, M Ranzato, and L Wolf. 2014. “DeepFace: Closing the
Gap to Human-Level Performance in Face Verification.” In <em>2014 IEEE
Conference on Computer Vision and Pattern Recognition</em>, 1701–8.</p>
<p>Urban, Tim. n.d. “The AI Revolution: How Far Away Are Our Robot
Overlords?”
<a href="http://gizmodo.com/the-ai-revolution-how-far-away-are-our-robot-overlords-1684199433">http://gizmodo.com/the-ai-revolution-how-far-away-are-our-robot-overlords-1684199433</a>.</p>
<p>Vardi, Moshe Y. 2012. “Artificial Intelligence: Past and Future.”
<em>Commun. ACM</em> 55 (1). New York, NY, USA: ACM: 5–5.</p>
<p>Wikipedia contributors. 2016. “Optical Character Recognition.”
<a href="https://en.wikipedia.org/w/index.php?title=Optical_character_recognition&oldid=757150540">https://en.wikipedia.org/w/index.php?title=Optical_character_recognition&oldid=757150540</a>.</p>
<p>———. 2017a. “Artificial General Intelligence.”
<a href="https://en.wikipedia.org/w/index.php?title=Artificial_general_intelligence&oldid=758867755">https://en.wikipedia.org/w/index.php?title=Artificial_general_intelligence&oldid=758867755</a>.</p>
<p>———. 2017b. “Artificial Intelligence.”
<a href="https://en.wikipedia.org/w/index.php?title=Artificial_intelligence&oldid=759177704">https://en.wikipedia.org/w/index.php?title=Artificial_intelligence&oldid=759177704</a>.</p>
<p>Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi,
Wolfgang Macherey, Maxim Krikun, et al. 2016. “Google’s Neural Machine
Translation System: Bridging the Gap Between Human and Machine
Translation,” 26~sep.</p>
Got a data app idea? Apply to get it prototyped by the JHU DSL!
2017-01-18T00:00:00+00:00
http://simplystats.github.io/2017/01/18/data-prototyping-class
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/papr.png" alt="Get your app built" /></p>
<p>Last fall we ran the first iteration of a class at the <a href="http://jhudatascience.org/">Johns Hopkins Data Science Lab</a> where we teach students to build data web-apps using Shiny, R, GoogleSheets and a number of other technologies. Our goals were to teach students to build data products, to reduce friction for students who want to build things with data, and to help people solve important data problems with web and SMS apps.</p>
<p>We are going to be running a second iteration of our program from March-June this year. We are looking for awesome projects for students to build that solve real world problems. We are particularly interested in projects that could have a positive impact on health but are open to any cool idea. We generally build apps that are useful for:</p>
<ul>
<li><strong>Data donation</strong> - if you have a group of people you would like to donate data to your project.</li>
<li><strong>Data collection</strong> - if you would like to build an app for collecting data from people.</li>
<li><strong>Data visualziation</strong> - if you have a data set and would like to have a web app for interacting with the data</li>
<li><strong>Data interaction</strong> - if you have a statistical or machine learning model and you would like a web interface for it.</li>
</ul>
<p>But we are interested in any consumer-facing data product that you might be interested in having built. We want you to submit your wildest, most interesting ideas and we’ll see if we can get them built for you.</p>
<p>We are hoping to solicit a large number of projects and then build as many as possible. The best part is that we will build the prototype for you for free! If you have an idea of something you’d like built please submit it to this <a href="https://docs.google.com/forms/d/1UPl7h8_SLw4zNFl_I9li_8GN14gyAEtPHtwO8fJ232E/edit?usp=forms_home&ths=true">Google form</a>.</p>
<p>Students in the class will select projects they are interested in during early March. We will let you know if your idea was selected for the program by mid-March. If you aren’t selected you will have the opportunity to roll your submission over to our next round of prototyping.</p>
<p>I’ll be writing a separate post targeted at students, but if you are interested in being a data app prototyper, sign up <a href="http://jhudatascience.org/prototyping_students.html">here</a>.</p>
Interview with Al Sommer - Effort Report Episode 23
2017-01-17T00:00:00+00:00
http://simplystats.github.io/2017/01/17/effort-report-episode-23
<p>My colleage <a href="https://twitter.com/elizabethmatsui">Elizabeth Matsui</a> and I had a great opportunity to talk with Al Sommer on the <a href="http://effortreport.libsyn.com/23-special-guest-al-sommer">latest episode</a> of our podcast <a href="http://effortreport.libsyn.com">The Effort Report</a>. Al is the former Dean of the Johns Hopkins Bloomberg School of Public Health and is Professor of Epidemiology and International Health at the School. He is (among other things) world reknown for his pioneering research in vitamin A deficiency and mortality in children.</p>
<p>Al had some good bits of advice for academics and being successful in academia.</p>
<blockquote>
<p>What you are excited about and interested in at the moment, you’re much more likely to be succesful at—because you’re excited about it! So you’re going to get up at 2 in the morning and think about it, you’re going to be putting things together in ways that nobody else has put things together. And guess what? When you do that you’re more succesful [and] you actual end up getting academic promotions.</p>
</blockquote>
<p>On the slow rate of progress:</p>
<blockquote>
<p>It took ten years, after we had seven randomized trials already to show that you get this 1/3 reduction in child mortality by giving them two cents worth of vitamin A twice a year. It took ten years to convince the child survival Nawabs of the world, and there are still some that don’t believe it.</p>
</blockquote>
<p>On working overseas:</p>
<blockquote>
<p>It used to be true [that] it’s a lot easier to work overseas than it is to work here because the experts come from somewhere else. You’re never an expert in your own home.</p>
</blockquote>
<p>You can listen to the entire episode here:</p>
<iframe style="border: none" src="//html5-player.libsyn.com/embed/episode/id/4992405/height/90/width/700/theme/custom/autonext/no/thumbnail/yes/autoplay/no/preload/no/no_addthis/no/direction/forward/render-playlist/no/custom-color/87A93A/" height="90" width="700" scrolling="no" allowfullscreen="" webkitallowfullscreen="" mozallowfullscreen="" oallowfullscreen="" msallowfullscreen=""></iframe>
Not So Standard Deviations Episode 30 - Philately and Numismatology
2017-01-09T00:00:00+00:00
http://simplystats.github.io/2017/01/09/nssd-episode-30
<p>Hilary and I follow up on open data and data sharing in government. They also discuss artificial intelligence, self-driving cars, and doing your taxes in R.</p>
<p>If you have questions you’d like Hilary and me to answer, you can send them to nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p>
<p>Show notes:</p>
<ul>
<li>
<p>Lucy D’Agostino McGowan (@LucyStats) made a <a href="http://www.lucymcgowan.com/hill-for-data-scientists.html">great translation of Hill’s criteria using XKCD comics</a></p>
</li>
<li>
<p><a href="http://www.lucymcgowan.com">Lucy’s web page</a></p>
</li>
<li>
<p><a href="https://www.whitehouse.gov/sites/default/files/whitehouse_files/microsites/ostp/NSTC/preparing_for_the_future_of_ai.pdf">Preparing for the Future of Artificial Intelligence</a></p>
</li>
<li>
<p><a href="http://12%20Dec%202016%20White%20House%20Special%20with%20DJ%20Patil,%20US%20Chief%20Data%20Scientist">Partially Derivative White House Special – with DJ Patil, US Chief Data Scientist</a></p>
</li>
<li>
<p><a href="https://soundcloud.com/nssd-podcast/episode-29-standards-are-like-toothbrushes">Not So Standard Deviations – Standards are Like Toothbrushes – with with Daniel Morgan, Chief Data Officer for the U.S. Department of Transportation and Terah Lyons, Policy Advisor to the Chief Technology Officer of the U.S.</a></p>
</li>
<li>
<p><a href="http://www.hgitner.com">Henry Gitner Philatelists</a></p>
</li>
<li>
<p><a href="https://drive.google.com/file/d/0B678uTpUfn80a2RkOUc5LW51cVU/view?usp=sharing">Some Pioneers of Modern Statistical Theory: A Personal Reflection by Sir David R. Cox</a></p>
</li>
</ul>
<p><a href="https://soundcloud.com/nssd-podcast/episode-30-philately-and-numismatology">Download the audio for this episode</a></p>
<p>Listen here:</p>
<iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/301065336&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false"></iframe>
Some things I've found help reduce my stress around science
2016-12-29T00:00:00+00:00
http://simplystats.github.io/2016/12/29/some-stress-reducers
<p>Being a scientist can be pretty stressful for any number of reasons, from the peer review process, to getting funding, to <a href="http://simplystatistics.org/2015/11/16/so-you-are-getting-crushed-on-the-internet-the-new-normal-for-academics/">getting blown up on the internet</a>.</p>
<p>Like a lot of academics I suffer from a lot of stress related to my own high standards and the imposter syndrome that comes from not meeting them on a regular basis. I was just reading through the excellent material in Lorena Barba’s class on <a href="https://barbagroup.github.io/essential_skills_RRC/">essential skills in reproducibility</a> and came across this <a href="http://www.stat.berkeley.edu/~stark/Seminars/reproNE16.htm#1">set of slides</a> by Phillip Stark. The one that caught my attention said:</p>
<blockquote>
<p>If I say just trust me and I’m wrong, I’m untrustworthy.
If I say here’s my work and it’s wrong, I’m honest, human, and serving scientific progress.</p>
</blockquote>
<p>I love this quote because it shows how being open about both your successes and failures makes it less stressful to be a scientist. Inspired by this quote I decided to make a list of things that I’ve learned through hard experience do not help me with my own imposter syndrome and do help me to feel less stressed out about my science.</p>
<ol>
<li><em>Put everything out in the open.</em> We release all of our software, data, and analysis scripts. This has led to almost exclusively positive interactions with people as they help us figure out good and bad things about our work.</li>
<li><em>Admit mistakes quickly.</em> Since my code/data are out in the open I’ve had people find little bugs and big whoa this is bad bugs in my code. I used to freak out when that happens. But I found the thing that minimizes my stress is to just quickly admit the error and submit updates/changes/revisions to code and papers as necessary.</li>
<li><em>Respond to requests for support at my own pace.</em> I try to be as responsive as I can when people email me about software/data/code/papers of mine. I used to stress about doing this <em>right away</em> when I would get the emails. I still try to be prompt, but I don’t let that dominate my attention/time. I also prioritize things that are wrong/problematic and then later handle the requests for free consulting every open source person gets.</li>
<li><em>Treat rejection as a feature not a bug.</em> This one is by far the hardest for me but preprints have helped a ton. The academic system is <em>designed</em> to be critical. That is a good thing, skepticism is one of the key tenets of the scientific process. It took me a while to just plan on one or two rejections for each paper, one or two or more rejections for each grant, etc. But now that I plan on the rejection I find I can just focus on how to steadily move forward and constructively address criticism rather than taking it as a personal blow.</li>
<li><em>Don’t argue with people on the internet, especially on Twitter.</em> This is a new one for me and one I’m having to practice hard every single day. But I’ve found that I’ve had very few constructive debates on Twitter. I also found that this is almost purely negative energy for me and doesn’t help me accomplish much.</li>
<li><em>Redefine success.</em> I’ve found that if I recalibrate what success means to include accomplishing tasks like peer reviewing papers, getting letters of recommendation sent at the right times, providing support to people I mentor, and the submission rather than the success of papers/grants then I’m much less stressed out.</li>
<li><em>Don’t compare myself to other scientists.</em> It is <a href="http://simplystatistics.org/2015/02/09/the-trouble-with-evaluating-anything/">very hard to get good evaluation in science</a> and I’m extra bad at self-evaluation. Scientists are good in many different dimensions and so whenever I pick a one dimensional summary and compare myself to others there are always people who are “better” than me. I find I’m happier when I set internal, short term goals for myself and only compare myself to them.</li>
<li><em>When comparing, at least pick a metric I’m good at.</em> I’d like to claim I never compare myself to others, but the reality is I do it more than I’d like. I’ve found one way to not stress myself out for my own internal comparisons is to pick metrics I’m good at - even if they aren’t the “right” metrics. That way at least if I’m comparing I’m not hurting my own psyche.</li>
<li><em>Let myself be bummed sometimes.</em> Some days despite all of that I still get the imposter syndrome feels and can’t get out of the funk. I used to beat myself up about those days, but now I try to just build that into the rhythm of doing work.</li>
<li><em>Try very hard to be positive in my interactions.</em> This is another hard one, because it is important to be skeptical/critical as a scientist. But I also try very hard to do that in as productive a way as possible. I try to assume other people are doing the right thing and I try very hard to stay positive or neutral when writing blog posts/opinion pieces, etc.</li>
<li><em>Realize that giving credit doesn’t take away from me.</em> In my research career I have worked with some extremely <a href="http://genomics.princeton.edu/storeylab/">generous</a> <a href="http://rafalab.github.io/">mentors</a>. They taught me to always give credit whenever possible. I also learned from <a href="http://www.biostat.jhsph.edu/~rpeng/">Roger</a> that you can give credit and not lose anything yourself, in fact you almost always gain. Giving credit is low cost but feels really good so is a nice thing to help me feel better.</li>
</ol>
<p>The last thing I’d say is that having a blog has helped reduce my stress, because sometimes I’m having a hard time getting going on my big project for the day and I can quickly write a blog post and still feel like I got something done…</p>
A non-comprehensive list of awesome things other people did in 2016
2016-12-20T00:00:00+00:00
http://simplystats.github.io/2016/12/20/noncomprehensive-list-of-awesome
<p><em>Editor’s note: For the last few years I have made a list of awesome things that other people did (<a href="http://simplystatistics.org/2015/12/21/a-non-comprehensive-list-of-awesome-things-other-people-did-in-2015/">2015</a>, <a href="http://simplystatistics.org/2014/12/17/a-non-comprehensive-list-of-awesome-things-other-people-did-in-2014/">2014</a>, <a href="http://simplystatistics.org/2013/12/20/a-non-comprehensive-list-of-awesome-things-other-people-did-this-year/">2013</a>). Like in previous years I’m making a list, again right off the top of my head. If you know of some, you should make your own list or add it to the comments! I have also avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people’s awesome stuff. I write this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data.</em></p>
<ul>
<li>Thomas Lin Pedersen created the <a href="https://github.com/thomasp85/tweenr">tweenr</a> package for interpolating graphs in animations. Check out this awesome <a href="https://twitter.com/thomasp85/status/809896220906897408">logo</a> he made with it.</li>
<li>Yihui Xie is still blowing away everything he does. First it was <a href="https://bookdown.org/yihui/bookdown/">bookdown</a> and then the yolo feature in <a href="https://github.com/yihui/xaringan">xaringan</a> package.</li>
<li>J Alammar built this great <a href="https://jalammar.github.io/visual-interactive-guide-basics-neural-networks/">visual introduction to neural networks</a></li>
<li>Jenny Bryan is working literal world wonders with legos to teach functional programming. I loved her <a href="https://speakerdeck.com/jennybc/data-rectangling">Data Rectangling</a> talk. The analogy between exponential families and data frames is so so good.</li>
<li>Hadley Wickham’s book on <a href="http://r4ds.had.co.nz/">R for data science</a> is everything you’d expect. Super clear, great examples, just a really nice book.</li>
<li>David Robinson is a machine put on this earth to create awesome data science stuff. Here is <a href="http://varianceexplained.org/r/trump-tweets/">analyzing Trump’s tweets</a> and here he is on <a href="http://varianceexplained.org/r/hierarchical_bayes_baseball/">empirical Bayes modeling explained with baseball</a>.</li>
<li>Julia Silge and David created the <a href="https://cran.r-project.org/web/packages/tidytext/index.html">tidytext</a> package. This is a holy moly big contribution to NLP in R. They also have a killer <a href="http://tidytextmining.com/">book on tidy text mining</a>.</li>
<li>Julia used the package to do this <a href="http://juliasilge.com/blog/Reddit-Responds/">fascinating post</a> on mining Reddit after the election.</li>
<li>It would be hard to pick just five different major contributions from JJ Allaire (great interview <a href="https://www.rstudio.com/rviews/2016/10/12/interview-with-j-j-allaire/">here</a>), Joe Cheng, and the rest of the Rstudio folks. Rstudio is absolutely <em>churning</em> out awesome stuff at a rate that is hard to keep up with. I loved <a href="https://blog.rstudio.org/2016/10/05/r-notebooks/">R notebooks</a> and have used them extensively for teaching.</li>
<li>Konrad Kording and Brett Mensh full on mike dropped on how to write a paper with their <a href="http://biorxiv.org/content/early/2016/11/28/088278">10 simple rules piece</a> Figure 1 from that paper should be affixed to the office of every student/faculty in the world permanently.</li>
<li>Yaniv Erlich just can’t stop himself from doing interesting things like <a href="https://seeq.io/">seeq.io</a> and <a href="https://dna.land/">dna.land</a>.</li>
<li>Thomaz Berisa and Joe Pickrell set up a freaking <a href="https://medium.com/the-seeq-blog/start-a-human-genomics-project-with-a-few-lines-of-code-dde90c4ef68#.g64meyjim">Python API for genomics projects</a>.</li>
<li>DataCamp continues to do great things. I love their <a href="https://www.datacamp.com/community/blog/an-interview-with-david-robinson-data-scientist-at-stack-overflow">DataChats</a> series and they have been rolling out tons of new courses.</li>
<li>Sean Rife and Michele Nuijten created <a href="http://statcheck.io/">statcheck.io</a> for checking papers for p-value calculation errors. This was all over the press, but I just like the site as a dummy proofing for myself.</li>
<li>This was the artificial intelligence <a href="https://twitter.com/notajf/status/795717253505413122">tweet of the year</a></li>
<li>I loved seeing PLoS Genetics start a policy of looking for papers in <a href="http://blogs.plos.org/plos/2016/10/the-best-of-both-worlds-preprints-and-journals/">biorxiv</a>.</li>
<li>Matthew Stephens <a href="https://medium.com/@biostatistics/guest-post-matthew-stephens-on-biostatistics-pre-review-and-reproducibility-a14a26d83d6f#.usisi7kd3">post</a> on his preprint getting pre-accepted and reproducibility is also awesome. Preprints are so hot right now!</li>
<li>Lorena Barba made this amazing <a href="https://hackernoon.com/barba-group-reproducibility-syllabus-e3757ee635cf#.2orb46seg">reproducibility syllabus</a> then <a href="https://twitter.com/LorenaABarba/status/809641955437051904">won the Leamer-Rosenthal prize</a> in open science.</li>
<li>Colin Dewey continues to do just stellar stellar work, this time on <a href="http://biorxiv.org/content/early/2016/11/30/090506">re-annotating genomics samples</a>. This is one of the key open problems in genomics.</li>
<li>I love FlowingData sooooo much. Here is one on <a href="http://flowingdata.com/2016/05/17/the-changing-american-diet/">the changing American diet</a>.</li>
<li>If you like computational biology and data science and like <em>super</em> detailed reports of meetings/talks you <a href="https://twitter.com/michaelhoffman">MIchael Hoffman</a> is your man. How he actually summarizes that much information in real time is still beyond me.</li>
<li>I really really wish I had been at Alyssa Frazee’s talk at startup.ml but loved this <a href="http://www.win-vector.com/blog/2016/09/adversarial-machine-learning/">review of it</a>. Sampling, inverse probability weighting? Love that stats flavor!</li>
<li>I have followed Cathy O’Neil for a long time in her persona as <a href="https://twitter.com/mathbabedotorg">mathbabedotorg</a> so it is no surprise to me that her new book <a href="https://www.amazon.com/dp/B019B6VCLO/ref=dp-kindle-redirect?_encoding=UTF8&btkr=1">Weapons of Math Descruction</a> is so good. One of the best works on the ethics of data out there.</li>
<li>A related and very important piece is on <a href="https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing">Machine bias in sentencing</a> by Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner at ProPublica.</li>
<li>Dimitris Rizopolous created this stellar <a href="http://iprogn.blogspot.com/2016/03/an-integrated-shiny-app-for-course-on.html">integrated Shiny app</a> for his repeated measures class. I wish I could build things half this nice.</li>
<li>Daniel Engber’s piece on <a href="http://fivethirtyeight.com/features/who-will-debunk-the-debunkers/">Who will debunk the debunkers?</a> at fivethirtyeight just keeps getting more relevant.</li>
<li>I rarely am willing to watch a talk posted on the internet, but <a href="https://www.youtube.com/watch?v=hps9r7JZQP8">Amelia McNamara’s talk on seeing nothing</a> was an exception. Plus she talks so fast #jealous.</li>
<li>Sherri Rose’s post on <a href="http://drsherrirose.com/economic-diversity-and-the-academy-statistical-science">economic diversity in the academy</a> focuses on statistics but should be required reading for anyone thinking about diversity. Everything about it is impressive.</li>
<li>If you like your data science with a side of Python you should definitely be checking out Jake Vanderplas’s <a href="http://shop.oreilly.com/product/0636920034919.do">data science handbook</a> and the associated <a href="https://github.com/jakevdp/PythonDataScienceHandbook">Jupyter notebooks</a>.</li>
<li>I love Thomas Lumley <a href="http://www.statschat.org.nz/2016/12/19/sauna-and-dementia/">being snarky</a> about the stats news. Its a guilty pleasure. If he ever collected them into a book I’d buy it (hint Thomas :)).</li>
<li>Dorothy Bishop’s blog is one of the ones I read super regularly. Her post on <a href="http://deevybee.blogspot.com/2016/12/when-is-replication-not-replication.html">When is a replication a replication</a> is just one example of her very clearly explaining a complicated topic in a sensible way. I find that so hard to do and she does it so well.</li>
<li>Ben Goldacre’s crowd is doing a bunch of interesting things. I really like their <a href="https://openprescribing.net/">OpenPrescribing</a> project.</li>
<li>I’m really excited to see what Elizabeth Rhodes does with the experimental design for the <a href="http://blog.ycombinator.com/moving-forward-on-basic-income/">Ycombinator Basic Income Experiment</a>.</li>
<li>Lucy D’Agostino McGowan made this <a href="http://www.lucymcgowan.com/hill-for-data-scientists.html">amazing explanation</a> of Hill’s criterion using xckd.</li>
<li>It is hard to overstate how good Leslie McClure’s blog is. This post on <a href="https://statgirlblog.wordpress.com/2016/09/16/biostatistics-is-public-health/">biostatistics is public health</a> should be read aloud at every SPH in the US.</li>
<li>The ASA’s <a href="http://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016.1154108">statement on p-values</a> is a really nice summary of all the issues around a surprisngly controversial topic. Ron Wasserstein and Nicole Lazar did a great job putting it together.</li>
<li>I really liked <a href="http://jama.jamanetwork.com/article.aspx?articleId=2513561&guestAccessKey=4023ce75-d0fb-44de-bb6c-8a10a30a6173">this piece</a> on the relationship between income and life expectancy by Raj Chetty and company.</li>
<li>Christie Aschwanden continues to be the voice of reason on the <a href="http://fivethirtyeight.com/features/failure-is-moving-science-forward/">statistical crises in science</a>.</li>
</ul>
<p>That’s all I have for now, I know I’m missing things. Maybe my New Year’s resolution will be to keep better track of the awesome things other people are doing :).</p>
The four eras of data
2016-12-16T00:00:00+00:00
http://simplystats.github.io/2016/12/16/the-four-eras-of-data
<p>I’m teaching <a href="http://jtleek.com/advdatasci16/">a class in data science</a> for our masters and PhD students here at Hopkins. I’ve been teaching a variation on this class since 2011 and over time I’ve introduced a number of new components to the class: high-dimensional data methods (2011), data manipulation and cleaning (2012), real, possibly not doable data analyses (2012,2013), peer reviews (2014), building <a href="http://swirlstats.com/">swirl tutorials</a> for data analysis techniques (2015), and this year building data analytic web apps/R packages.</p>
<p>I’m the least efficient teacher in the world, probably because I’m very self conscious about my teaching. So I always feel like I have to completely re-do my lecture materials every year I teach the class (I know, I know I’m a dummy). This year I was reviewing my notes on high-dimensional data and I was looking at this breakdown of the three eras of statistics from Brad Efron’s <a href="http://statweb.stanford.edu/~ckirby/brad/other/2010LSIexcerpt.pdf">book</a>:</p>
<blockquote>
<ol>
<li>The age of Quetelet and his successors, in which huge census-level data
sets were brought to bear on simple but important questions: Are there
more male than female births? Is the rate of insanity rising?</li>
<li>The classical period of Pearson, Fisher, Neyman, Hotelling, and their
successors, intellectual giants who developed a theory of optimal inference
capable of wringing every drop of information out of a scientific
experiment. The questions dealt with still tended to be simple — Is treatment
A better than treatment B? — but the new methods were suited to
the kinds of small data sets individual scientists might collect.</li>
<li>The era of scientific mass production, in which new technologies typi-
fied by the microarray allow a single team of scientists to produce data
sets of a size Quetelet would envy. But now the flood of data is accompanied
by a deluge of questions, perhaps thousands of estimates or
hypothesis tests that the statistician is charged with answering together;
not at all what the classical masters had in mind.</li>
</ol>
</blockquote>
<p>While I think this is a useful breakdown, I realized I think about it in a slightly different way as a statistician. My breakdown goes more like this:</p>
<ol>
<li><strong>The era of not much data</strong> This is everything prior to about 1995 in my field. The era when we could only collect a few measurements at a time. The whole point of statistics was to try to optimaly squeeze information out of a small number of samples - so you see methods like maximum likelihood and minimum variance unbiased estimators being developed.</li>
<li><strong>The era of lots of measurements on a few samples</strong> This one hit hard in biology with the development of the microarray and the ability to measure thousands of genes simultaneously. This is the same statistical problem as in the previous era but with a lot more noise added. Here you see the development of methods for multiple testing and regularized regression to separate signals from piles of noise.</li>
<li><strong>The era of a few measurements on lots of samples</strong> This era is overlapping to some extent with the previous one. Large scale collections of data from EMRs and Medicare are examples where you have a huge number of people (samples) but a relatively modest number of variables measured. Here there is a big focus on statistical methods for knowing how to model different parts of the data with hierarchical models and separating signals of varying strength with model calibration.</li>
<li><strong>The era of all the data on everything.</strong> This is an era that currently we as civilians don’t get to participate in. But Facebook, Google, Amazon, the NSA and other organizations have thousands or millions of measurements on hundreds of millions of people. Other than just sheer computing I’m speculating that a lot of the problem is in segmentation (like in era 3) coupled with avoiding crazy overfitting (like in era 2).</li>
</ol>
<p>I’ve focused here on the implications of these eras from a statistical modeling perspective, but as we discussed in my class, era 4 coupled with advances in machine learning methods mean that there are social, economic, and behaviorial implications of these eras as well.</p>
Not So Standard Deviations Episode 28 - Writing is a lot Harder than Just Talking
2016-12-15T00:00:00+00:00
http://simplystats.github.io/2016/12/15/nssd-episode-28
<p>Hilary and I talk about building data science products that provide a good user experience while adhering to some kind of ground truth, whether it’s in medicine, education, news, or elsewhere. Also Gilmore Girls.</p>
<p>If you have questions you’d like Hilary and me to answer, you can send them to nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p>
<p>Show notes:</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Bradford_Hill_criteria">Hill’s criteria for causation</a></li>
<li><a href="https://www.oreilly.com/topics/oreilly-bots-podcast">O’Reilly Bots Podcast</a></li>
<li><a href="http://www.nhtsa.gov/nhtsa/av/index.html">NHTSA’s Federal Automated Vehicles Policy</a></li>
<li>Subscribe to the podcast on <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">iTunes</a> or <a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Google Play</a>. And please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>.</li>
<li>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</li>
<li>Get the <a href="https://leanpub.com/conversationsondatascience/">Not So Standard Deviations book</a>.</li>
</ul>
<p><a href="https://soundcloud.com/nssd-podcast/episode-28-writing-is-a-lot-harder-than-just-talking">Download the audio for this episode</a></p>
<p>Listen here:</p>
<iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/297930039&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false"></iframe>
What is going on with math education in the US?
2016-12-09T00:00:00+00:00
http://simplystats.github.io/2016/12/09/pisa-us-math
<p>When colleagues with young children seeking information about schools
ask me if I like the Massachusetts public school my
children attend, my answer is always the same: “it’s great…except for
math”. The fact is that in our household we supplement our kids’ math
education with significant extra curricular work in order to ensure
that they receive a math education comparable to what we received as
children in the public system.</p>
<p>The latest results from the Program for International Student
Assessment (PISA)
<a href="http://www.businessinsider.com/pisa-worldwide-ranking-of-math-science-reading-skills-2016-12">results</a>
show that there is a general problem with math education in the
US. Were it a country, Massachusetts would have been in second place
in reading, sixth in science, but 20th in math, only ten points above
the OECD average of 490. The US as a whole did not fair nearly as well
as MA, and the same discrepancy between math and the other two
subjects was present. In fact, among the top 30 performing
countries ranked by their average of science and reading scores, the
US has, by far, the largest discrepancy between math and
the other two subjects tested by PISA. The difference of 27 was
substantially greater than the second largest difference,
which came from Finland at 17. Massachusetts had a difference of 28.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/pisa-2015-math-v-others.png" alt="PISA 2015 Math minus average of science and reading" /></p>
<p>If we look at the trend of this difference since PISA was started 16
years ago, we see a disturbing progression. While science and reading
have
<a href="http://www.artofteachingscience.org/wp-content/uploads/2013/12/Screen-Shot-2013-12-17-at-9.28.38-PM.png">remained stable, math has declined</a>. In
2000 the difference between the results in math and the other subjects
was only 8.5. Furthermore,
the US is not performing exceptionally well in any subject:</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/pisa-2015-scatter.png" alt="PISA 2015 Math versus average of science and reading" /></p>
<p>So what is going on? I’d love to read theories in the comment
section. From my experience comparing my kids’ public schools now
with those that I attended, I have one theory of my own. When I was a
kid there was a math textbook. Even when a teacher was bad, it
provided structure and an organized alternative for learning on your
own. Today this approach is seen as being “algorithmic” and has fallen
out of favor. “Project based learning” coupled with group activities have
become popular replacements.</p>
<p>Project based learning is great in principle. But, speaking from
experience, I can say it is very hard to come up with good projects,
even for highly trained mathematical minds. And it is certainly much
more time consuming for the instructor than following a
textbook. Teachers don’t have more time now than they did 30 years ago
so it is no surprise that this new more open approach leads to
improvisation and mediocre lessons. A recent example of a pointless
math project involved 5th graders picking a number and preparing a
colorful poster showing “interesting” facts about this number. To
make things worse in terms of math skills, students are often rewarded
for effort, while correctness is secondary and often disregarded.</p>
<p>Regardless of the reason for the decline, given the trends
we are seeing, we need to rethink the approach to math education. Math
education may have had its problems in the past, but recent evidence
suggests that the reforms of the past few decades seem to have
only worsened the situation.</p>
<p>Note: To make these plots I download and read-in the data into R as described <a href="https://www.r-bloggers.com/pisa-2015-how-to-readprocessplot-the-data-with-r/">here</a>.</p>
Not So Standard Deviations Episode 27 - Special Guest Amelia McNamara
2016-11-30T00:00:00+00:00
http://simplystats.github.io/2016/11/30/nssd-episode-27
<p>I had the pleasure of sitting down with Amelia McNamara, Visiting Assistant Professor of Statistical and Data Sciences at Smith College, to talk about data science, data journalism, visualization, the problems with R, and adult coloring books.</p>
<p>If you have questions you’d like Hilary and me to answer, you can send them to nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p>
<p>Show notes:</p>
<ul>
<li>
<p><a href="http://www.science.smith.edu/~amcnamara/index.html">Amelia McNamara’s web site</a></p>
</li>
<li>
<p><a href="http://datascience.columbia.edu/mark-hansen">Mark Hansen</a></p>
</li>
<li>
<p><a href="https://www.youtube.com/watch?v=dD36IajCz6A">Listening Post</a></p>
</li>
<li>
<p><a href="http://www.nytimes.com/video/arts/1194817116105/moveable-type.html">Moveable Type</a></p>
</li>
<li>
<p><a href="https://en.wikipedia.org/wiki/Alan_Kay">Alan Kay</a></p>
</li>
<li>
<p><a href="https://harc.ycr.org/">HARC (Human Advancement Research Community)</a></p>
</li>
<li>
<p><a href="http://www.vpri.org/index.html">VPRI (Viewpoints Research Institute)</a></p>
</li>
<li>
<p><a href="https://www.youtube.com/watch?v=hps9r7JZQP8">Interactive essays</a></p>
</li>
<li>
<p><a href="https://rafaelaraujoart.com/products/golden-ratio-coloring-book">Golden Ratio Coloring Book</a></p>
</li>
<li>
<p>Subscribe to the podcast on <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">iTunes</a> or <a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Google Play</a>. And please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>.</p>
</li>
<li>
<p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p>
</li>
<li>
<p>Get the <a href="https://leanpub.com/conversationsondatascience/">Not So Standard Deviations book</a>.</p>
</li>
</ul>
<p><a href="https://soundcloud.com/nssd-podcast/episode-27-special-guest-amelia-mcnamara">Download the audio for this episode</a></p>
<p>Listen here:</p>
<iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/295593774&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false"></iframe>
Help choose the Leek group color palette
2016-11-17T00:00:00+00:00
http://simplystats.github.io/2016/11/17/leekgroup-colors
<p>My research group just recently finish a paper where several different teams within the group worked on different analyses. If you are interested the paper describes the <a href="http://biorxiv.org/content/early/2016/08/08/068478">recount resource</a> which includes processed versions of thousands of human RNA-seq data sets.</p>
<p>As part of this project each group had to contribute some plots to the paper. One thing that I noticed is that each person used their own color palette and theme when building the plots. When we wrote the paper this made it a little harder for the figures to all fit together - especially when different group members worked on a single panel of a multi-panel plot.</p>
<p>So I started thinking about setting up a Leek group theme for both base R and ggplot2 graphics. One of the first problems was that every group member had their own opinion about what the best color palette would be. So we are running a little competition to determine what the official Leek group color palette for plots will be in the future.</p>
<p>As part of that process, one of my awesome postdocs, Shannon Ellis, decided to collect some data on how people perceive different color palettes. The survey is here:</p>
<p>https://docs.google.com/forms/d/e/1FAIpQLSfHMXVsl7pxYGarGowJpwgDSf9lA2DfWJjjEON1fhuCh6KkRg/viewform?c=0&w=1</p>
<p>If you have a few minutes and have an opinion about colors (I know you do!) please consider participating in our little poll and helping to determine the future of Leek group plots!</p>
Open letter to my lab: I am not "moving to Canada"
2016-11-11T00:00:00+00:00
http://simplystats.github.io/2016/11/11/im-not-moving-to-canada
<p>Dear Lab Members,</p>
<p>I know that the results of Tuesday’s election have many of you
concerned about your future. You are not alone. I am concerned
about my future as well. But I want you to know that I have no plans
of going anywhere and I intend to dedicate as much time to our
projects as I always have. Meeting, discussing ideas and putting them
into practice with you is, by far, the best part of my job.</p>
<p>We are all concerned that if certain campaign promises are kept many
of our fellow citizens may need our help. If this happens, then we
will pause to do whatever we can to help. But I am currently
cautiously optimistic that we will be able to continue focusing on
helping society in the best way we know how: by doing scientific
research.</p>
<p>This week Dr. Francis Collins assured us that there is strong
bipartisan support for scientific research. As an example consider
<a href="http://www.nytimes.com/2015/04/22/opinion/double-the-nih-budget.html?_r=0">this op-ed</a>
in which Newt Gingrich advocates for doubling the NIH budget. There
also seems to be wide consensus in this country that scientific
research is highly beneficial to society and an understanding that to
do the best research we need the best of the best no matter their
gender, race, religion or country of origin. Nothing good comes from
creative, intelligent, dedicated people leaving science.</p>
<p>I know there is much uncertainty but, as of now, there is nothing stopping us
from continuing to work hard. My plan is to do just that and I hope
you join me.</p>
Not all forecasters got it wrong: Nate Silver does it again (again)
2016-11-09T00:00:00+00:00
http://simplystats.github.io/2016/11/09/not-all-forecasters-got-it-wrong
<p>Four years ago we
<a href="http://simplystatistics.org/2012/11/07/nate-silver-does-it-again-will-pundits-finally-accept/">posted</a>
on Nate Silver’s, and other forecasters’, triumph over pundits. In
contrast, after yesterday’s presidential election, results contradicted
most polls and data-driven forecasters, several news articles came out
wondering how this happened. It is important to point
out that not all forecasters got it wrong. Statistically
speaking, Nate Silver, once again, got it right.</p>
<p>To show this, below I include a plot showing the expected margin of
victory for Clinton versus the actual results for the most competitive states provided by 538. It includes the uncertainty bands provided by 538 in
<a href="http://projects.fivethirtyeight.com/2016-election-forecast/">this site</a>
(I eyeballed the band sizes to make the plot in R, so they are not
exactly like 538’s).</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/us-election-2016-538-prediction.png" alt="538-2016-election" /></p>
<p>Note that if these are 95% confidence/credible intervals, 538 got 1
wrong. This is exactly what we expect since 15/16 is about
95%. Furthermore, judging by the plot <a href="http://projects.fivethirtyeight.com/2016-election-forecast/">here</a>, 538 estimated the popular vote margin to be 3.6%
with a confidence/credible interval of about 5%.
This too was an accurate
prediction since Clinton is going to win the popular vote by
about 1% <del>0.5%</del> (note this final result is in the margin of error of
several traditional polls as well). Finally, when other forecasters were
giving Trump between 14% and 0.1% chances of winning, 538 gave
him about a
30% chance which is slightly more than what a team has when down 3-2
in the World Series. In contrast, in 2012 538 gave Romney only a 9%
chance of winning. Also, remember, if in ten election cycles you
call it for someone with a 70% chance, you should get it wrong 3
times. If you get it right every time then your 70% statement was wrong.</p>
<p>So how did 538 outperform all other forecasters? First, as far as I
can tell they model the possibility of an overall bias, modeled as a
random effect, that affects
every state. This bias can be introduced by systematic
lying to pollsters or under sampling some group. Note that this bias
can’t be estimated from data from
one election cycle but it’s variability can be estimated from
historical data. 538 appear
to estimate the standard error of this term to be
about 2%. More details on this are included <a href="http://simplystatistics.org/html/midterm2012.html">here</a>. In 2016 we saw this bias and you can see it in
the plot above (more points are above the line than below). The
confidence bands account for this source of variabilty and furthermore
their simulations account for the strong correlation you will see
across states: the chance of seeing an upset in Pennsylvania, Wisconsin,
and Michigan is <strong>not</strong> the product of an upset in each. In
fact it’s much higher. Another advantage 538 had is that they somehow
were able to predict a systematic, not random, bias against
Trump. You can see this by
comparing their adjusted data to the raw data (the adjustment favored
Trump about 1.5 on average). We can clearly see this when comparing the 538
estimates to The Upshots’:</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/us-election-2016-538-v-upshot.png" alt="538-2016-election" /></p>
<p>The fact that 538 did so much better than other forecasters should
remind us how hard it is to do data analysis in real life. Knowing
math, statistics and programming is not enough. It requires experience
and a deep understanding of the nuances related to the specific
problem at hand. Nate Silver and the 538 team seem to understand this
more than others.</p>
<p>Update: Jason Merkin points out (via Twitter) that 538 provides 80% credible
intervals.</p>
Data scientist on a chromebook take two
2016-11-08T00:00:00+00:00
http://simplystats.github.io/2016/11/08/chromebook-part2
<p>My friend Fernando showed me his collection of <a href="https://twitter.com/jtleek/status/795749713966497793">old Apple dongles</a> that no longer work with the latest generation of Apple devices. This coupled with the announcement of the Macbook pro that promises way more dongles and mostly the same computing, had me freaking out about my computing platform for the future. I’ve been using cloudy tools for more and more of what I do and so it had me wondering if it was time to go back and try my <a href="http://simplystatistics.org/2012/01/09/a-statistician-and-apple-fanboy-buys-a-chromebook-and/">Chromebook experiment</a> again. Basically the question is whether I can do everything I need to do comfortably on a Chromebook.</p>
<p>So to execute the experience I got a brand new <a href="https://www.asus.com/us/Notebooks/ASUS_Chromebook_Flip_C100PA/">ASUS chromebook flip</a> and the connector I need to plug it into hdmi monitors (there is no escaping at least one dongle I guess :(). Here is what that badboy looks like in my home office with Apple superfanboy Roger on the screen.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/chromebook2.jpg" alt="chromebook2" /></p>
<p>In terms of software there have been some major improvements since I last tried this experiment out. Some of these I talk about in my book <a href="https://leanpub.com/modernscientist">How to be a modern scientist</a>. As of this writing this is my current setup:</p>
<ul>
<li>Music on <a href="https://play.google.com">Google Play</a></li>
<li>Latex on <a href="https://www.overleaf.com">Overleaf</a></li>
<li>Blog/website/code on <a href="https://github.com/">Github</a></li>
<li>R programming on an <a href="http://www.louisaslett.com/RStudio_AMI/">Amazon AMI with Rstudio loaded</a> although <a href="https://twitter.com/earino/status/795750908457984000">I hear</a> there may be other options that are good there that I should try.</li>
<li>Email/Calendar/Presentations/Spreadsheets/Docs with <a href="https://www.google.com/">Google</a> products</li>
<li>Twitter with <a href="https://tweetdeck.twitter.com/">Tweetdeck</a></li>
</ul>
<p>That handles the vast majority of my workload so far (its only been a day :)). But I would welcome suggestions and I’ll report back when either I give up or if things are still going strong in a little while….</p>
Not So Standard Deviations Episode 25 - How Exactly Do You Pronounce SQL?
2016-10-28T00:00:00+00:00
http://simplystats.github.io/2016/10/28/nssd-episode-25
<p>Hilary and I go through the overflowing mailbag to respond to listener questions! Topics include causal inference in trend modeling, regression model selection, using SQL, and data science certification.</p>
<p>If you have questions you’d like us to answer, you can send them to
nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p>
<p>Show notes:</p>
<ul>
<li>
<p><a href="https://www.amazon.com/gp/product/B0017LNHY2/">Professor Kobre’s Lightscoop Standard Version Bounce Flash Device</a></p>
</li>
<li>
<p><a href="https://www.speechpad.com">Speechpad</a></p>
</li>
<li>
<p><a href="https://www.amazon.com/gp/product/0544703391/">Speaking American by Josh Katz</a></p>
</li>
<li>
<p><a href="https://medium.com/@josh_nussbaum/data-sets-are-the-new-server-rooms-40fdb5aed6b0?_hsenc=p2ANqtz-8IHAReMPP2JjyYs6TqyMYCnjUapQdLQFEaQOjNX9BfUhZV2nzXWwy2NHJHrCs-VN67GxT4djKCUWq8tkhTyiQkb965bg&_hsmi=36470868#.wybl0l3p7">Data Sets Are The New Server Rooms</a></p>
</li>
<li>
<p><a href="http://simplystatistics.org/2016/10/26/datasets-new-server-rooms/">Are Datasets the New Server Rooms?</a></p>
</li>
<li>
<p>Subscribe to the podcast on <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">iTunes</a> or <a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Google Play</a>. And please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>.</p>
</li>
<li>
<p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p>
</li>
<li>
<p>Get the <a href="https://leanpub.com/conversationsondatascience/">Not So Standard Deviations book</a>.</p>
</li>
</ul>
<p><a href="https://soundcloud.com/nssd-podcast/episode-25-how-exactly-do-you-pronounce-sql">Download the audio for this episode</a></p>
<p>Listen here:</p>
<iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/290164484&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false"></iframe>
Are Datasets the New Server Rooms?
2016-10-26T00:00:00+00:00
http://simplystats.github.io/2016/10/26/datasets-new-server-rooms
<p>Josh Nussbaum has an <a href="https://medium.com/@josh_nussbaum/data-sets-are-the-new-server-rooms-40fdb5aed6b0?_hsenc=p2ANqtz-8IHAReMPP2JjyYs6TqyMYCnjUapQdLQFEaQOjNX9BfUhZV2nzXWwy2NHJHrCs-VN67GxT4djKCUWq8tkhTyiQkb965bg&_hsmi=36470868#.wz8f23tak">interesting post</a> over at Medium about whether massive datasets are the new server rooms of tech business.</p>
<p>The analogy comes from the “old days” where in order to start an Internet business, you had to buy racks and servers, rent server space, buy network bandwidth, license expensive server software, backups, and on and on. In order to do all that up front, it required a substantial amount of capital just to get off the ground. As inconvenient as this might have been, it provided an immediate barrier to entry for any other competitors who weren’t able to raise similar capital.</p>
<p>Of course,</p>
<blockquote>
<p>…the emergence of open source software and cloud computing completely eviscerated the costs and barriers to starting a company, leading to deflationary economics where one or two people could start their company without the large upfront costs that were historically the hallmark of the VC industry.</p>
</blockquote>
<p>So if startups don’t have huge capital costs in the beginning, what costs <em>do</em> they have? Well, for many new companies that rely on machine learning, they need to collect data.</p>
<blockquote>
<p>As a startup collects the data necessary to feed their ML algorithms, the value the product/service provides improves, allowing them to access more customers/users that provide more data and so on and so forth.</p>
</blockquote>
<p>Collecting huge datasets ultimately costs money. The sooner a startup can raise money to get that data, the sooner they can defend themselves from competitors who may not yet have collected the huge datasets for training their algorithms.</p>
<p>I’m not sure the analogy between datasets and server rooms quite works. Even back when you had to pay a lot of up front costs to setup servers and racks, a lot of that technology was already a commodity, and anyone could have access to it for a price.</p>
<p>I see massive datasets used to train machine learning algorithms as more like the new proprietary software. The startups of yore spent a lot of time writing custom software for what we might now consider mundane tasks. This was a time-consuming activity but the software that was developed had value and was a differentiator for the company. Today, many companies write complex machine learning algorithms, but those algorithms and their implmentations are quickly becoming commodities. So the only thing that separates one company from another is the amount and quality of data that they have to train those algorithms.</p>
<p>Going forward, it will be interesting see what these companies will do with those massive datasets once they no longer need them. Will they “open source” them and make them available to everyone? Could there be an open data movement analogous to the open source movement?</p>
<p>For the most part, I doubt it. While I think many today would perhaps sympathize with the sentiment that <a href="https://www.gnu.org/gnu/manifesto.en.html">software shouldn’t have owners</a>, those same people I think would argue vociferously that data most certainly do have owners. I’m not sure how I’d feel if Facebook made all their data available to anyone. That said, many datasets are made available by various businesses, and as these datasets grow in number and in usefulness, we may see a day where the collection of data is not a key barrier to entry, and that you can train your machine learning algorithm on whatever is out there.</p>
Distributed Masochism as a Pedagogical Model
2016-10-20T00:00:00+00:00
http://simplystats.github.io/2016/10/20/distributed-masochism-as-a-pedagogical-model
<p><em>Editor’s note: This is a guest post by
<a href="http://seankross.com/">Sean Kross</a>. Sean is a software developer in the
Department of Biostatistics at the Johns Hopkins Bloomberg School of Public
Health. Sean has contributed to several of our specializations including
<a href="https://www.coursera.org/specializations/jhu-data-science">Data Science</a>,
<a href="https://www.coursera.org/specializations/executive-data-science">Executive Data Science</a>,
and <a href="https://www.coursera.org/specializations/r">Mastering Software Development in R</a>.
He tweets <a href="https://twitter.com/seankross">@seankross</a>.</em></p>
<p>Over the past few months I’ve been helping Jeff develop the Advanced Data
Science class he’s teaching at the Johns Hopkins Bloomberg School of Public
Health. We’ve been trying to identify technologies that we can teach to
students which (we hope) will enable them to rapidly prototype data-based
software applications which will serve a purpose in public health. We started with
technologies that we’re familiar with (R, Shiny, static websites) but we’re
also trying to teach ourselves new technologies (the Amazon Alexa Skills API,
iOS and Swift). We’re teaching skills that we know intimately along with skills
that we’re learning on the fly which is a style of teaching that we’ve practiced
<a href="https://www.coursera.org/specializations/jhu-data-science">several</a>
<a href="https://www.coursera.org/specializations/r">times</a>.</p>
<p>Jeff and I have come to realize that while building new courses with
technologies that are new to us we experience particular pains and frustrations
which, when documented, become valuable learning resources for our students.
This process of documenting new-tech-induced pain is only a preliminary step.
When we actually launch classes either online or
in person our students run into new frustrations which we respond to with
changes to either documentation or course content. This process of quickly
iterating on course material is especially enhanced in online courses where the
time span for a course lasts a few weeks compared to a full semester, so kinks
in the course are ironed out at a faster rate compared to traditional in-person
courses. All of the material in our courses is open-source and available on
GitHub, and we teach our students how to use Git and GitHub. We can take
advantage of improvements and contributions the students think we should make
to our courses through pull requests that we recieve. Student contributions
further reduce the overall start-up pain experienced by other students.</p>
<p>With students from all over the world participating in our online courses we’re
unable to anticipate every technical need considering different locales,
languages, and operating systems. Instead of being anxious about this reality
we depend on a system of “distributed masochism” whereby documenting every
student’s unique technical learning pains is an important aspect of improving
the online learning experience. Since we only have a few months head start
using some of these technologies compared to our students it’s likely that as
instructors we’ve recently climbed a similar learning curve which makes it
easier for us to help our students. We believe that this approach of teaching
new technologies by allowing any student to contribute to open course material
allows a course to rapidly adapt to students’ needs and to the inevitable
changes and upgrades that are made to new technologies.</p>
<p>I’m extremely interested in communicating with anyone else who is using similar techniques, so if you’re interested please contact me via Twitter (<a href="https://twitter.com/seankross">@seankross</a>) or send me an email: sean at seankross.com.</p>
Not So Standard Deviations Episode 24 - 50 Minutes of Blathering
2016-10-16T00:00:00+00:00
http://simplystats.github.io/2016/10/16/nssd-episode-24
<p>Another IRL episode! Hilary and I met at a Jimmy John’s to talk data
science, like you do. Topics covered include RStudio Conf, polling,
millennials, Karl Broman, and more!</p>
<p>If you have questions you’d like us to answer, you can send them to
nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p>
<p>Subscribe to the podcast on <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">iTunes</a> or <a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Google Play</a>. And please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>.</p>
<p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p>
<p>Get the <a href="https://leanpub.com/conversationsondatascience/">Not So Standard Deviations book</a>.</p>
<p>Show notes:</p>
<ul>
<li>
<p><a href="https://www.rstudio.com/conference/">rstudio::conf</a></p>
</li>
<li>
<p><a href="http://www.nytimes.com/interactive/2016/09/20/upshot/the-error-the-polling-world-rarely-talks-about.html?_r=0">We Gave Four Good Pollsters the Same Raw Data. They Had Four Different Results</a></p>
</li>
<li>
<p><a href="https://en.wikipedia.org/wiki/Millennials">Millenials</a></p>
</li>
<li>
<p><a href="http://kbroman.org">Karl Broman</a></p>
</li>
<li>
<p><a href="https://www.rstudio.com/2016/10/12/interview-with-j-j-allaire/">Interview with J.J. Allaire</a></p>
</li>
<li>
<p><a href="http://varianceexplained.org/r/year_data_scientist/">One Year at Stack Overflow</a></p>
</li>
</ul>
<p><a href="https://soundcloud.com/nssd-podcast/episode-24-50-minutes-of-blathering">Download the audio for this episode</a></p>
<p>Listen here:</p>
<iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/287815210&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false"></iframe>
Should I make a chatbot or a better FAQ?
2016-10-14T00:00:00+00:00
http://simplystats.github.io/2016/10/14/chatabot-or-faq
<p>Roger pointed me to this <a href="https://www.theinformation.com/behind-facebooks-messenger-missteps">interesting article</a> (paywalled, sorry!) about Facebook’s chatbot service. I think the article made a couple of interesting points. The first thing I thought was interesting was their explicit acknowledgement of the process I outlined in a previous post for building an AI startup - (1) convince (or in this case pay) some humans to be your training set, and (2) collect the data on the humans and then use it to build your AI.</p>
<p>The other point that is pretty fascinating is that they realized how many data points they would need before they could reasonably replace a human with an AI chatbot. The original estimate was tens of thousands and the ultimate number was millions or more. I have been thinking a lot that the AI “revolution” is just a tradeoff between parameters and data points. If you have a billion parameter prediction algorithm it may work pretty awesome - as long as you have a few hundred billion data points to train it with.</p>
<p>But the theme of the article was that chatbots may have had some mis-steps/may not be ready for prime time. I think the main reason is that at the moment most AI efforts can only report facts, not intuit intention and alter the question for the user or go beyond the facts/state of the world.</p>
<p>One example I’ve run into recently was booking a ticket on an airline. I wanted to know if I could make a certain change to my ticket. The airline didn’t have any information about the change I wanted to make online. After checking thoroughly I clicked on the “Chat with an agent” button and was directed to what was clearly a chatbot. The chatbot asked a question or two and then sent me to the “make changes to a ticket” page of the website.</p>
<p>I eventually had to call and get a person on the phone, because what I wanted to ask about didn’t apply to the public information. They set me straight and I booked the ticket. The chatbot wasn’t helpful because it could only respond with information it had available on the website. It couldn’t identify a new situation, realize it had to ask around, figure out there was an edge case, and then make a ruling/help out.</p>
<p>I would guess that most of the time if a person interacts with a chatbot they are doing it only because they already looked at all the publicly available information on the FAQ, etc. and couldn’t find it. So an alternative solution, which would require a lot less work and a much smaller training set, is to just have a more complete FAQ.</p>
<p>The question to me is does anyone other than Facebook or Google have a big enough training set to make a chatbot worth it?</p>
The Dangers of Weighting Up a Sample
2016-10-12T00:00:00+00:00
http://simplystats.github.io/2016/10/12/weighting-survey
<p>There’s a <a href="http://www.nytimes.com/2016/10/13/upshot/how-one-19-year-old-illinois-man-is-distorting-national-polling-averages.html">great story</a> by Nate Cohn over at the New York Times’ Upshot
about the dangers of “weighting up” a sample from a survey. In this
case, it is in regards to a U.S.C/LA Times poll asking who people will
vote for President:</p>
<blockquote>
<p>The U.S.C./LAT poll weights for many tiny categories: like 18-to-21-year-old men, which U.S.C./LAT estimates make up around 3.3 percent of the adult citizen population. Weighting simply for 18-to-21-year-olds would be pretty bold for a political survey; 18-to-21-year-old men is really unusual.</p>
</blockquote>
<p>The U.S.C./LA Times poll apparently goes even further:</p>
<blockquote>
<p>When you start considering the competing demands across multiple categories, it can quickly become necessary to give an astonishing amount of extra weight to particularly underrepresented voters — like 18-to-21-year-old black men. This wouldn’t be a problem with broader categories, like those 18 to 29, and there aren’t very many national polls that are weighting respondents up by more than eight or 10-fold. The extreme weights for the 19-year-old black Trump voter in Illinois are not normal.</p>
</blockquote>
<p>It’s worth noting (as a good thing) that the U.S.C./LA Times poll data is completely open, thus allowing the NYT to reproduce this entire analysis.</p>
<p>I haven’t done much in the way of survey analyses, but I’ve done some
inverse probability weighting and in my experience it can be a tricky
procedure in ways that are not always immediately obvious. The article
discusses weight trimming, but also notes the dangers of that
procedure. Overall, a good treatment of a complex issue.</p>
Information and VC Investing
2016-10-03T00:00:00+00:00
http://simplystats.github.io/2016/10/03/the-information-vc
<p>Sam Lessin at The Information has a <a href="http://go.theinformation.com/xXfQ5plmVMI">nice post</a> (sorry, paywall, but it’s a great publication) about how increased measurement and analysis is changing the nature of venture capital investing.</p>
<blockquote>
<p>This brings me back to what is happening at series A financings. Investors have always, obviously, tried to do diligence at all financing rounds. But series A investments used to be an exercise in a few top-level metrics a company might know, some industry interviews and analysis, and a whole lot of trust. The data that would drive capital market efficiency usually just wasn’t there, so capital was expensive and there were opportunities for financiers. Now, I am seeing more and more that after a seed round to boot up most companies, the backbone of a series A financing is an intense level of detail in reporting and analytics. It can be that way because the companies have the data</p>
</blockquote>
<p>I’ve seen this happen in other areas where data comes in to disrupt the way things are done. Good analysis only gives you an advantage if no one else is doing it. Once everyone accepts the idea and everyone has the data (and a good analytics team), there’s no more value left in the market.</p>
<p>Time to search elsewhere.</p>
papr - it's like tinder, but for academic preprints
2016-10-03T00:00:00+00:00
http://simplystats.github.io/2016/10/03/papr
<p>As part of the <a href="http://jhudatascience.org/">Johns Hopkins Data Science Lab</a> we are setting up a web and mobile <a href="http://jhudatascience.org/prototyping/">data product prototyping shop</a>. As part of that process I’ve been working on different types of very cheap and easy to prototype apps. A few days ago I posted about creating a <a href="http://simplystatistics.org/2016/08/26/googlesheets/">distributed data collection app with Google Sheets</a>.</p>
<p>So for fun I built another kind of app. This one I’m calling <a href="https://jhubiostatistics.shinyapps.io/papr/">papr</a> and its sort of like “Tinder for preprints”. I scraped all of the papers out of the <a href="http://biorxiv.org/">http://biorxiv.org/</a> database. When you open the app you see one at random and you can rate it according to two axes:</p>
<ul>
<li><em>Is the paper interesting?</em> - a paper can be rated as exciting or boring. We leave the definitions of those terms up to you.</li>
<li><em>Is the paper correct or questionable?</em> - a paper can either be solidly correct or potentially questionable in its results. We leave the definitions of those terms up to you.</li>
</ul>
<p>When you click on your rating you are shown another randomly generated paper from bioRxiv. You can “level up” to different levels if you rate more papers. You can also download your ratings at any time.</p>
<p>If you have any feedback on the app I’d love to hear it and if anyone knows how to get custom domain names to work with shinyapps.io I’d also love to hear from you. I tried the instructions with no luck…</p>
<p>Try the app here:</p>
<p>https://jhubiostatistics.shinyapps.io/papr/</p>
Not So Standard Deviations Episode 23 - Special Guest Walt Hickey
2016-10-01T00:00:00+00:00
http://simplystats.github.io/2016/10/01/nssd-episode-23
<p>Hilary and Roger invite Walt Hickey of FiveThirtyEight.com on to the show to talk about polling, movies, and data analysis reproducibility (of course).</p>
<p>If you have questions you’d like us to answer, you can send them to
nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p>
<p>Subscribe to the podcast on <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">iTunes</a> or <a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Google Play</a>.</p>
<p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>.</p>
<p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p>
<p>Get the <a href="https://leanpub.com/conversationsondatascience/">Not So Standard Deviations book</a>.</p>
<p>Show Notes:</p>
<ul>
<li>
<p><a href="http://fivethirtyeight.com/features/a-users-guide-to-fivethirtyeights-2016-general-election-forecast/">FiveThirtyEight’s polling methodology</a></p>
</li>
<li>
<p><a href="https://twitter.com/walthickey">Walt Hickey on Twitter</a></p>
</li>
<li>
<p><a href="http://fivethirtyeight.com/features/the-20-most-extreme-cases-of-the-book-was-better-than-the-movie/">The 20 Most Extreme Cases Of ‘The Book Was Better Than The Movie’</a></p>
</li>
<li>
<p><a href="http://practicaltypography.com">Matthew Butterick Typography</a></p>
</li>
<li>
<p><a href="http://www.hoppstudios.com">Hopp</a></p>
</li>
</ul>
<p><a href="https://soundcloud.com/nssd-podcast/episode-23-special-guest-walt-hickey">Download the audio for this episode</a>.</p>
<p>Listen here:</p>
<iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/285159790&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false"></iframe>
Statistical vitriol
2016-09-29T00:00:00+00:00
http://simplystats.github.io/2016/09/29/statistical-vitriol
<p>Over the last few months there has been a lot of vitriol around statistical ideas. First there were <a href="http://www.nejm.org/doi/full/10.1056/NEJMe1516564">data parasites</a> and then there were <a href="https://www.dropbox.com/s/9zubbn9fyi1xjcu/Fiske%20presidential%20guest%20column_APS%20Observer_copy-edited.pdf">methodological terrorists</a>. These epithets came from established scientists who have relatively little statistical training. There was the predictable backlash to these folks from their counterparties, typically statisticians or statistically trained folks who care about open source.</p>
<p>I’m a statistician who cares about open source but I also frequently collaborate with scientists from different fields. It makes me sad and frustrated that statistics - which I’m so excited about and have spent my entire professional career working on - is something that is causing so much frustration, anxiety, and anger.</p>
<p>I have been thinking a lot about the cause of this anger and division in the sciences. As a person who interacts with both groups pretty regularly I think that the reasons are some combination of the following.</p>
<ol>
<li>Data is now everywhere, so every single publication involves some level of statistical modeling and analysis. It can’t be escaped.</li>
<li>The deluge of scientific papers means that only big claims get your work noticed, get you into fancy journals, and get you attention.</li>
<li>Most senior scientists, the ones leading and designing studies, <a href="http://simplystatistics.org/2012/04/27/people-in-positions-of-power-that-dont-understand/">have little or no training in statistics</a>. There is a structural reason for this: data was sparse when they were trained and there wasn’t any reason for them to learn statistics. So statistics and data science wasn’t (and still often isn’t) integrated into medical and scientific curricula.</li>
<li>There is an imbalance of power in the scientific process between statisticians/computational scientists and scientific investigators or clinicians. The clinicians/scientific investigators are “in charge” and the statisticians are often relegated to a secondary role. Statisticians with some control over their environment (think senior tenured professors of (bio)statistics) can avoid these imbalances and look for collaborators who respect statistical thinking, but not everyone can. There are a large number of <a href="http://www.opiniomics.org/a-guide-for-the-lonely-bioinformatician/">lonely bioinformaticians</a> out there.</li>
<li>Statisticians and computational scientists are also frustrated because their is often no outlet for them to respond to these papers in the formal scientific literature - those outlets are controlled by scientists and rarely have statisticians in positions of influence within the journals.</li>
</ol>
<p>Since statistics is everywhere (1) and only flashy claims get you into journals (2) and the people leading studies don’t understand statistics very well (3), you get many publications where the paper makes a big claim based on shakey statistics but it gets through. This then frustrates the statisticians because they have little control over the process (4) and can’t get their concerns into the published literature (5).</p>
<p>This used to just result in lots of statisticians and computational scientists complaining behind closed doors. The internet changed all that, everyone is an <a href="http://simplystatistics.org/2015/11/16/so-you-are-getting-crushed-on-the-internet-the-new-normal-for-academics/">internet scientist</a> now. So the statisticians and statistically savvy take to blogs, f1000research, and other outlets to get their point across.</p>
<p>Sometimes to get attention, statisticians start to have the same problem as scientists; they need their complaints to get attention to have any effect. So they go over the top. They accuse people of fraud, or being statistically dumb, or nefarious, or intentionally doing things with data, or cast a wide net and try to implicate a large number of scientists in poor statistics. The ironic thing is that these things are the same thing that the scientists are doing to get attention that frustrated the statisticians in the first place.</p>
<p>Just to be 100% clear here I am also guilty of this. I have definitely fallen into the hype trap - talking about the “replicability crisis”. I also made the mistake earlier in my blogging career of trashing the statistics of a paper that frustrated me. I am embarrassed I did that now, it wasn’t constructive and the author ended up being very responsive. I think if I had just emailed that person they would have resolved their problem.</p>
<p>I just recently had an experience where a very prominent paper hadn’t made their data public and I was having trouble getting the data. I thought about writing a blog post to get attention, but at the end of the day just did the work of emailing the authors, explaining myself over and over and finally getting the data from them. The result is the same (I have the data) but it cost me time and frustration. So I understand when people don’t want to deal with that.</p>
<p>The problem is that scientists see the attention the statisticians are calling down on them - primarily negative and often over-hyped. Then they get upset and call the statisticians/open scientists names, or push back on entirely sensible policies because they are worried about being humiliated or discredited. While I don’t agree with that response, I also understand the feeling of “being under attack”. I’ve had that happen to me too and it doesn’t feel good.</p>
<p>So where do we go from here? How do we end statistical vitriol and make statistics a positive force? Here is my six part plan:</p>
<ol>
<li>We should create continuining education for senior scientists and physicians in statistical and open data thinking so people who never got that training can understand the unique requirements of a data rich scientific world.</li>
<li>We should encourage journals and funders to incorporate statisticians and computational scientists at the highest levels of influence so that they can drive policy that makes sense in this new data driven time.</li>
<li>We should recognize that scientists and data generators have <a href="http://simplystatistics.org/2016/01/25/on-research-parasites-and-internet-mobs-lets-try-to-solve-the-real-problem/">a lot more on the line</a> when they produce a result or a scientific data set. We should give them appropriate credit for doing that even if they don’t get the analysis exactly right.</li>
<li>We should de-escalate the consequences of statistical mistakes. Right now the consequences are: retractions that hurt careers, blog posts that are aggressive and often too personal, and humiliation by the community. We should make it easy to acknowledge these errors without ruining careers. This will be hard - scientists careers often depend on the results they get (recall 2 above). So we need a way to pump up/give credit to/acknowledge scientists who are willing to sacrifice that to get the stats right.</li>
<li>We need to stop treating retractions/statistical errors/mistakes like a sport where there are winners and losers. Statistical criticism should be easy, allowable, publishable and not angry or personal.</li>
<li>Any paper where statistical analysis is part of the paper must have both a statistically trained author or a statistically trained reviewer or both. I wouldn’t believe a paper on genomics that was performed entirely by statisticians with no biology training any more than I believe a paper with statistics in it performed entirely by physicians with no statistical training.</li>
</ol>
<p>I think scientists forget that statisticians feel un-empowered in the scientific process and statisticians forget that a lot is riding on any given study for a scientist. So being a little more sympathetic to the pressures we all face would go a long way to resolving statistical vitriol.</p>
<p>I’d be eager to hear other ideas too. It makes me sad that statistics has become so political on both sides.</p>
The Mystery of Palantir Continues
2016-09-28T00:00:00+00:00
http://simplystats.github.io/2016/09/28/mystery-palantir-continues
<p>Palantir, the secretive data science/consulting/software company, continues to be a mystery to most people, but recent reports have not been great. <a href="http://www.nytimes.com/reuters/2016/09/26/business/26reuters-palantir-tech-discrimination-lawsuit.html?smprod=nytcore-iphone&smid=nytcore-iphone-share&_r=0">Reuters reports</a> that the U.S. Department of Labor is suing it for employment discrimination:</p>
<blockquote>
<p>The lawsuit alleges Palantir routinely eliminated Asian applicants in the resume screening and telephone interview phases, even when they were as qualified as white applicants.</p>
</blockquote>
<p>Interestingly, the report indicates a statistical argument:</p>
<blockquote>
<p>In one example cited by the Labor Department, Palantir reviewed a pool of more than 130 qualified applicants for the role of engineering intern. About 73 percent of applicants were Asian. The lawsuit, which covers Palantir’s conduct between January 2010 and the present, said the company hired 17 non-Asian applicants and four Asians. “The likelihood that this result occurred according to chance is approximately one in a billion,” said the lawsuit, which was filed with the department’s Office of Administrative Law Judges.</p>
</blockquote>
<p><em>Update: Thanks to David Robinson for point out that (a) I read the numbers incorrectly and (b) I should have used the hypergeometric distribution to account for the sampling without replacement. The paragraph below is corrected accordingly.</em></p>
<p>Note the use of the phrase “qualified applicants” in reference to the</p>
<ol>
<li>Presumably, there was a screening process that removed
“unqualified applicants” and that led us to 130. Of the 130, 73% were
Asian. Presumably, there was a follow up selection process (interview,
exam) that led to 4 Asians being hired out of 21 (about 19%). Clearly
there’s a difference between 19% and 73% but the reasons may not be
nefarious. If you assume the number of Asians hired is proportional to
the number in the qualified pool, then the p-value for the observed
data is about 10^-8, which is not quite “1 in a billion” as the
report claims but it’s indeed small. But my guess is the Labor
Department has more than this test of binomial proportions in terms of
evidence if they were to go through with a suit.</li>
</ol>
<p>Alfred Lee from <a href="http://go.theinformation.com/r958P12lLdw">The Information</a> reports that a mutual fund run by Valic sold their shares of Palantir for below the recent valuation:</p>
<blockquote>
<p>The Valic fund sold its stake at $4.50 per share, filings show, down from the $11.38 per share at which the company raised money in December. The value of the stake at the sale price was $621,000. Despite the price drop, Valic made money on the deal, as it had acquired stock in preferred fundraisings in 2012 and 2013 at between $3.06 and $3.51 per share.</p>
</blockquote>
<p>The valuation suggested in the article by the recent sale is $8 billion. In my <a href="http://simplystatistics.org/2016/05/11/palantir-struggles/">previous post on Palantir</a>, I noted that while other large-scale consulting companies certainly make a lot of money, none have the sky-high valuation that Palantir commands. However, a more “down-to-Earth” valuation of $8 billion might be more or less in line with these other companies. It may be bad news for Palantir, but should the company ever have an IPO, it would be good for the public for market participants to realize the intrinsic value of the company.</p>
Thinking like a statistician: this is not the election for progressives to vote third party
2016-09-27T00:00:00+00:00
http://simplystats.github.io/elections/2016/09/27/thinking-like-statistician-election-2016
<p>Democratic elections permit us to vote for whomever we perceive has
the highest expectation to do better with the issues we care about. Let’s
simplify and assume we can quantify how satisfied we are with an
elected official’s performance. Denote this quantity with <em>X</em>. Because
when we cast our vote we still don’t know for sure how the candidate
will perform, we base our decision on what we expect, denoted here with
<em>E(X)</em>. Thus we try to maximize <em>E(X)</em>. However, both political theory
and data tell us that in US presidential elections only two parties
have a non-negligible probability of winning. This implies that
<em>E(X)</em> is 0 for some candidates no matter how large <em>X</em> could
potentially be. So what we are really doing is deciding if <em>E(X-Y)</em> is
positive or negative with <em>X</em> representing one candidate and <em>Y</em> the
other.</p>
<p>In past elections some progressives have argued that the difference
between candidates is negligible and have therefore supported the Green Party
ticket. The 2000 election is a notable example. The
<a href="https://en.wikipedia.org/wiki/United_States_presidential_election,_2000">2000 election</a>
was won by George W. Bush by just five <a href="https://en.wikipedia.org/wiki/Electoral_College_(United_States)">electoral votes</a>. In Florida,
which had 25 electoral votes, Bush beat Al
Gore by just 537 votes. Green Party candidate Ralph
Nader obtained 97,488 votes. Many progressive voters were OK with this
outcome because they perceived <em>E(X-Y)</em> to be practically 0.</p>
<p>In contrast, in 2016, I suspect few progressives think that
<em>E(X-Y)</em> is anywhere near 0. In the figures below I attempt to
quantify the progressive’s pre-election perception of consequences for
the last five contests. The first
figure shows <em>E(X)</em> and <em>E(Y)</em> and the second shows <em>E(X-Y)</em>. Note
despite <em>E(X)</em> being the lowest in the last past five elections,
<em>E(X-Y)</em> is by far the largest. So if these figures accurately depict
your perception and you think
like a statistician, it becomes clear that this is not the election to
vote third party.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/election.png" alt="election-2016" /></p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/election-diff.png" alt="election-diff-2016" /></p>
Facebook and left censoring
2016-09-26T00:00:00+00:00
http://simplystats.github.io/2016/09/26/facebook-left-censoring
<p>From the <a href="http://www.wsj.com/articles/facebook-overestimated-key-video-metric-for-two-years-1474586951">Wall Street Journal</a>:</p>
<blockquote>
<p>Several weeks ago, Facebook disclosed in a post on its “Advertiser Help Center” that its metric for the average time users spent watching videos was artificially inflated because it was only factoring in video views of more than three seconds. The company said it was introducing a new metric to fix the problem.</p>
</blockquote>
<p>A classic case of left censoring (in this case, by “accident”).</p>
<p>Also this:</p>
<blockquote>
<p>Ad buying agency Publicis Media was told by Facebook that the earlier counting method likely overestimated average time spent watching videos by between 60% and 80%, according to a late August letter Publicis Media sent to clients that was reviewed by The Wall Street Journal.</p>
</blockquote>
<p>What does this information tell us about the actual time spent watching Facebook videos?</p>
Not So Standard Deviations Episode 22 - Number 1 Side Project
2016-09-19T00:00:00+00:00
http://simplystats.github.io/2016/09/19/nssd-episode-22
<p>Hilary and I celebrate our one year anniversary doing the podcast together by discussing whether there are cities that are good for data scientists, reproducible research, and professionalizing data science.</p>
<p>Also, Hilary and I have just published a new book, <a href="https://leanpub.com/conversationsondatascience?utm_source=SimplyStats&utm_campaign=NSSD&utm_medium=BlogPost">Conversations on Data Science</a>, which collects some of our episodes in an easy-to-read format. The book is available from Leanpub and will be updated as we record more episodes. If you’re new to the podcast, this is a good way to do some catching up!</p>
<p>If you have questions you’d like us to answer, you can send them to
nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p>
<p>Subscribe to the podcast on <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">iTunes</a> or <a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Google Play</a>.</p>
<p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p>
<p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p>
<p>Show Notes:</p>
<ul>
<li>
<p><a href="https://www.biostat.washington.edu/suminst/sisbid2016/modules/BD1603">Roger’s reproducible research workshop</a></p>
</li>
<li>
<p><a href="http://radar.oreilly.com/2013/06/theres-more-than-one-kind-of-data-scientist.html">There’s More Than One Kind of Data Scientist by Harlan Harris</a></p>
</li>
<li>
<p><a href="http://sf.curbed.com/maps/mapping-the-10-sf-homes-with-the-highest-property-taxes">Billionaire’s row in San Francisco</a></p>
</li>
<li>
<p><a href="https://en.wikipedia.org/wiki/Mindfulness-based_stress_reduction">Mindfulness-based stress reduction</a></p>
</li>
<li>
<p><a href="http://www.asteroidmission.org/">OSIRIS-REx</a></p>
</li>
</ul>
<p><a href="https://soundcloud.com/nssd-podcast/episode-22-1-side-project">Download the audio for this episode</a>.</p>
<p>Listen here:</p>
<iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/282927998&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false"></iframe>
Mastering Software Development in R
2016-09-19T00:00:00+00:00
http://simplystats.github.io/2016/09/19/msdr-launch-announcement
<p>Today I’m happy to announce that we’re launching a new specialization on Coursera titled <a href="https://www.coursera.org/specializations/r/"><strong>Mastering Software Development in R</strong></a>. This is a 5-course sequence developed with <a href="https://twitter.com/seankross">Sean Kross</a> and <a href="http://csu-cvmbs.colostate.edu/academics/erhs/Pages/brooke-anderson.aspx">Brooke Anderson</a>.</p>
<p>This sequence differs from our previous Data Science Specialization because it focuses primarily on using R for developing <em>software</em>. We’ve found that as the field of data science evolves, it is becoming ever more clear that software development skills are essential for producing useful data science results and products. In addition, there is a tremendous need for tooling in the data science universe and we want to train people to build those tools.</p>
<p>The first course, <a href="https://www.coursera.org/learn/r-programming-environment">The R Programming Environment</a>, launches today. In the following months, we will launch the remaining courses:</p>
<ul>
<li>Advanced R Programming</li>
<li>Building R Packages</li>
<li>Building Data Visualization Tools</li>
</ul>
<p>In addition to the course, we have a <a href="https://leanpub.com/msdr">companion textbook</a> that goes along with the sequence. The book is available from Leanpub and is currently in progress (if you get the book now, you will receive free updates as they are available). We will be releaseing new chapters of the book alongside the launches of the other courses in the sequence.</p>
Interview With a Data Sucker
2016-09-07T00:00:00+00:00
http://simplystats.github.io/open%20science/2016/09/07/interview-with-a-data-sucker
<p>A few months ago Jill Sederstrom from ASH Clinical News interviewed
me for <a href="http://ashclinicalnews.org/attack-of-the-data-suckers/">this article</a> on the data sharing editorial published by the The New England Journal of Medicine (NEJM) and the debate it generated.
The article presented a nice summary, but I thought the original
comprehensive set of questions was very good too. So, with permission from
ASH Clinical News, I am sharing them here along with my answers.</p>
<p>Before I answer the questions below, I want to make an important remark.
When writing these answers I am reflecting on data sharing in
general. Nuances arise in different contexts that need to be
discussed on an individual basis. For example, there are different
considerations to keep in mind when sharing publicly funded data in
genomics (my field) and sharing privately funded clinical trials data,
just to name two examples.</p>
<h3 id="in-your-opinion-what-do-you-see-as-the-biggest-pros-of-data-sharing">In your opinion, what do you see as the biggest pros of data sharing?</h3>
<p>The biggest pro of data sharing is that it can accelerate and improve
the scientific enterprise. This can happen in a variety of ways. For
example, competing experts may apply an improved statistical analysis
that finds a hidden discovery the original data generators missed.
Furthermore, examination of data by many experts can help correct
errors missed by the analyst of the original project. Finally, sharing
data facilitates the merging of datasets from different sources that
allow discoveries not possible with just one study.</p>
<p>Note that data sharing is not a radical idea. For example, thanks to
an organization called <a href="http://fged.org">The MGED Soceity</a>, most journals require all published
microarray gene expression data to be public in one of two
repositories: GEO or ArrayExpress. This has been an incredible
success, leading to new discoveries, new databases that combine
studies, and the development of widely used statistical methods and
software built with these data as practice examples.</p>
<h3 id="the-nejm-editorial-expressed-concern-that-a-new-generation-of-researchers-will-emerge-those-who-had-nothing-to-do-with-collecting-the-research-but-who-will-use-it-to-their-own-ends-it-referred-to-these-as-research-parasites-is-this-a-real-concern">The NEJM editorial expressed concern that a new generation of researchers will emerge, those who had nothing to do with collecting the research but who will use it to their own ends. It referred to these as “research parasites.” Is this a real concern?</h3>
<p>Absolutely not. If our goal is to facilitate scientific discoveries that
improve our quality of life, I would be much more concerned about
“data hoarders” than “research parasites”. If an important nugget of
knowledge is hidden in a dataset, don’t you want the best data
analysts competing to find it? Restricting the researchers who can
analyze the data to those directly involved with the generators cuts
out the great majority of experts.</p>
<p>To further illustrate this, let’s consider a very concrete example
with real life consequences. Imagine a loved one has a disease with
high mortality rates. Finding a cure is possible but only after
analyzing a very very complex genomic assay. If some of the best data
analysts in the world want to help, does it make any sense at all to
restrict the pool of analysts to, say, a freshly minted masters level
statistician working for the genomics core that generated the data?
Furthermore, what would be the harm of having someone double check
that analysis?</p>
<h3 id="the-nejm-editorial-also-presented-several-other-concerns-it-had-with-data-sharing-including-whether-researchers-would-compare-data-across-clinical-trials-that-is-not-in-fact-comparable-and-a-failure-to-provide-correct-attribution-do-you-see-these-as-being-concerns-what-cons-do-you-believe-there-may-be-to-data-sharing">The NEJM editorial also presented several other concerns it had with data sharing including whether researchers would compare data across clinical trials that is not in fact comparable and a failure to provide correct attribution. Do you see these as being concerns? What cons do you believe there may be to data sharing?</h3>
<p>If such mistakes are made, good peer reviewers will catch the error.
If it escapes peer review, we point it out in post publication
discussions. Science is constantly self correcting.</p>
<p>Regarding attribution, this is a legitimate, but in my opinion, minor
concern. Developers of open source statistical methods and software
see our methods used without attribution quite often. We survive. But
as I elaborate below, we can do things to alleviate this concern.</p>
<h3 id="is-data-stealing-a-real-worry-have-you-ever-heard-of-it-happening-before">Is data stealing a real worry? Have you ever heard of it happening before?</h3>
<p>I can’t say I can recall any case of data being stolen. But let’s
remember that most published data is paid for by tax payers. They are the
actual owners. So there is an argument to be made that the public’s
data is being held hostage.</p>
<h3 id="does-data-sharing-need-to-happen-symbiotically-as-the-editorial-suggests-why-or-why-not">Does data sharing need to happen symbiotically as the editorial suggests? Why or why not?</h3>
<p>I think symbiotic sharing is the most effective approach to the
repurposing of data. But no, I don’t think we need to force it to happen this way.
Competition is one of the key ingredients of the scientific
enterprise. Having many groups competing almost always beats out a
small group of collaborators. And note that the data generators won’t
necessarily have time to collaborate with all the groups interested in
the data.</p>
<h3 id="in-a-recent-blog-post-you-suggested-several-possible-data-sharing-guidelines-what-would-the-advantage-be-of-having-guidelines-in-place-in-help-guide-the-data-sharing-process">In a recent blog post, you suggested several possible data sharing guidelines. What would the advantage be of having guidelines in place in help guide the data sharing process?</h3>
<p>I think you are referring to <a href="http://simplystatistics.org/2016/01/25/on-research-parasites-and-internet-mobs-lets-try-to-solve-the-real-problem/">a post by Jeff Leek</a> but I am happy to
answer. For data to be generated, we need to incentivize the endeavor.
Guidelines that assure patient privacy should of course be followed.
Some other simple guidelines related to those mentioned by Jeff are:</p>
<ol>
<li>Reward data generators when their data is used by others.</li>
<li>Penalize those that do not give proper attribution.</li>
<li>Apply the same critical rigor on critiques of the original analysis
as we apply to the original analysis.</li>
<li>Include data sharing ethics in scientific education</li>
</ol>
<h3 id="one-of-the-guidelines-suggested-a-new-designation-for-leaders-of-major-data-collection-or-software-generation-projects-why-do-you-think-this-is-important">One of the guidelines suggested a new designation for leaders of major data collection or software generation projects. Why do you think this is important?</h3>
<p>Again, this was Jeff, but I agree. This is important because we need
an incentive other than giving the generators exclusive rights to
publications emanating from said data.</p>
<h3 id="you-also-discussed-the-need-for-requiring-statisticalcomputational-co-authors-for-papers-written-by-experimentalists-with-no-statisticalcomputational-co-authors-and-vice-versa-what-role-do-you-see-the-referee-serving-why-is-this-needed">You also discussed the need for requiring statistical/computational co-authors for papers written by experimentalists with no statistical/computational co-authors and vice versa. What role do you see the referee serving? Why is this needed?</h3>
<p>I think the same rule should apply to referees. Every paper based on
the analysis of complex data needs to have a referee with
statistical/computational expertise. I also think biomedical journals
publishing data-driven research should start adding these experts to
their editorial boards. I should mention that NEJM actually has had
such experts on their editorial board for a while now.</p>
<h3 id="are-there-certain-guidelines-would-feel-would-be-most-critical-to-include">Are there certain guidelines would feel would be most critical to include?</h3>
<p>To me the most important ones are:</p>
<ol>
<li>
<p>The funding agencies and the community should reward data
generators when their data is used by others. Perhaps more than for
the papers they produce with these data.</p>
</li>
<li>
<p>Apply the same critical rigor on critiques of the original analysis
as we apply to the original analysis. Bashing published results and
talking about the “replication crisis”
has become fashionable. Although in some cases it is very well merited
(see Baggerly and Coombes <a href="http://projecteuclid.org/euclid.aoas/1267453942#info">work</a> for example) in some circumstances critiques are made without much care mainly for the attention. If we
are not careful about keeping a good balance, we may end up
paralyzing scientific progress.</p>
</li>
</ol>
<h3 id="you-mentioned-that-you-think-symbiotic-data-sharing-would-be-the-most-effective-approach-what-are-some-ways-in-which-scientists-can-work-symbiotically">You mentioned that you think symbiotic data sharing would be the most effective approach. What are some ways in which scientists can work symbiotically?</h3>
<p>I can describe my experience. I am trained as a statistician. I analyze
data on a daily basis both as a collaborator and method developer.
Experience has taught me that if one does not understand the
scientific problem at hand, it is hard to make a meaningful
contribution through data analysis or method development. Most
successful applied statisticians will tell you the same thing.</p>
<p>Most difficult scientific challenges have nuances that only the
subject matter expert can effectively describe. Failing to understand
these usually leads analysts to chase false leads, interpret results
incorrectly or waste time solving a problem no one cares about.
Successful collaboration usually involve a constant back and forth
between the data analysts and the subject matter experts.</p>
<p>However, in many circumstances the data generator is not necessarily
the only one that can provide such guidance. Some data analysts
actually become subject matter experts themselves, others download
data and seek out other collaborators that also understand the details
of the scientific challenge and data generation process.</p>
A Short Guide for Students Interested in a Statistics PhD Program
2016-09-06T00:00:00+00:00
http://simplystats.github.io/advice/2016/09/06/a-short-guide-for-phd-applicants
<p>This summer I had several conversations with undergraduate
students seeking career advice. All were interested in data analysis
and were considering graduate school. I also frequently receive
requests for advice via email. We have posted on this topic
before, for example
<a href="http://simplystatistics.org/2015/02/18/navigating-big-data-careers-with-a-statistics-phd/">here</a>
and
<a href="http://simplystatistics.org/2015/11/09/biostatistics-its-not-what-you-think-it-is/">here</a>, but
I thought it would be useful to share this short guide I put together based on my recent interactions.</p>
<h2 id="its-ok-to-be-confused">It’s OK to be confused</h2>
<p>When I was a college senior I didn’t really understand what Applied
Statistics was nor did I understand what one does as a researcher in
academia. Now I love being an academic doing research in applied statistics.
But it is hard to understand what being a researcher is like until you do
it for a while. Things become clearer as you gain more experience. One
important piece of advice is
to carefully consider advice from those with more
experience than you. It might not make sense at first, but I
can tell today that I knew much less than I thought I did when I was 22.</p>
<h2 id="should-i-even-go-to-graduate-school">Should I even go to graduate school?</h2>
<p>Yes. An undergraduate degree in mathematics, statistics, engineering, or computer science
provides a great background, but some more training greatly increases
your career options. You may be able to learn on the job, but note
that a masters can be as short as a year.</p>
<h2 id="a-masters-or-a-phd">A masters or a PhD?</h2>
<p>If you want a career in academia or as a researcher in industry or
government you need a PhD. In general, a PhD will
give you more career options. If you want to become a data analyst or
research assistant, a masters may be enough. A masters is also a good way
to test out if this career is a good match for you. Many people do a
masters before applying to PhD Programs. The rest of this guide
focuses on those interested in a PhD.</p>
<h2 id="what-discipline">What discipline?</h2>
<p>There are many disciplines that can lead you to a career in data
science: Statistics, Biostatistics, Astronomy, Economics, Machine Learning, Computational
Biology, and Ecology are examples that come to mind. I did my PhD
in Statistics and got a job in a Department of Biostatistics. So this
guide focuses on Statistics/Biostatistics.</p>
<p>Note that once you finish your PhD you have a chance to become a
postdoctoral fellow and further focus your training. By then you will have a
much better idea of what you want to do and will have the opportunity
to chose a lab that closely matches your interests.</p>
<h2 id="what-is-the-difference-between-statistics-and-biostatistics">What is the difference between Statistics and Biostatistics?</h2>
<p>Short answer: very little. I treat them as the same in this guide. Long answer: read
<a href="http://simplystatistics.org/2015/11/09/biostatistics-its-not-what-you-think-it-is/">this</a>.</p>
<h2 id="how-should-i-prepare-during-my-senior-year">How should I prepare during my senior year?</h2>
<h3 id="math">Math</h3>
<p>Good grades in math and statistics classes
are almost a requirement. Good GRE scores help and you need to get a near perfect score in
the Quantitative Reasoning part of the GRE. Get yourself a practice
book and start preparing. Note that to survive the first two years of a statistics PhD program
you need to prove theorems and derive relatively complicated
mathematical results. If you can’t easily handle the math part of the GRE, this will be
quite challenging.</p>
<p>When choosing classes note that the area of math most related to your
stat PhD courses is Real
Analysis. The area of math most used in applied work is Linear
Algebra, specifically matrix theory including understanding
eigenvalues and eigenvectors. You might not make the connection between
what you learn in class and what you use in practice until much
later. This is totally normal.</p>
<p>If you don’t feel ready, consider doing a masters first. But also, get
a second opinion. You might be being too hard on yourself.</p>
<h3 id="programming">Programming</h3>
<p>You will be using a computer to analyze data so knowing some
programming is a must these days. At a minimum, take a basic
programming class. Other computer science classes will help especially
if you go into an area dealing with large datasets. In hindsight, I
wish I had taken classes on optimization and algorithm design.</p>
<p>Know that learning to program and learning a computer language are
different things. You need to learn to program. The choice of language
is up for debate. If you only learn one, learn R. If you learn three,
learn R, Python and C++.</p>
<p>Knowing Linux/Unix is an advantage. If you have a Mac try to use the
terminal as much as possible. On Windows get an emulator.</p>
<h3 id="writing-and-communicating">Writing and Communicating</h3>
<p>My biggest educational regret is that, as a college student, I underestimated the importance
of writing. To this day I am correcting that mistake.</p>
<p>Your success as a researcher greatly depends on how well
you write and communicate. Your thesis, papers, grant
proposals and even emails have to be well written. So practice as much as
possible. Take classes, read works by good writers, and
<a href="http://bulletin.imstat.org/2011/09/terence%E2%80%99s-stuff-speaking-reading-writing/">practice</a>. Consider
starting a blog even if you don’t make it public. Also note that in
academia, job interviews will
involve a 50 minute talk as well as several conversations about your
work and future plans. So communication skills are also a big plus.</p>
<h2 id="but-wait-why-so-much-math">But wait, why so much math?</h2>
<p>The PhD curriculum is indeed math heavy. Faculty often debate the
possibility of changing the curriculum. But regardless of
differing opinions on what is the right amount, math is the
foundation of our discipline. Although it is true that you will not
directly use much of what you learn, I don’t regret learning so much abstract
math because I believe it positively shaped the way I think and attack
problems.</p>
<p>Note that after the first two years you are
pretty much done with courses and you start on your research. If you work with an
applied statistician you will learn data analysis via the
apprenticeship model. You will learn the most, by far, during this
stage. So be patient. Watch
<a href="https://www.youtube.com/watch?v=R37pbIySnjg">these</a>
<a href="https://www.youtube.com/watch?v=Bg21M2zwG9Q">two</a> Karate Kid scenes
for some inspiration.</p>
<h2 id="what-department-should-i-apply-to">What department should I apply to?</h2>
<p>The top 20-30 departments are practically interchangeable in my
opinion. If you are interested in applied statistics make sure you
pick a department with faculty doing applied research. Note that some
professors focus their research on the mathematical aspects of
statistics. By reading some of their recent papers you will be able to
tell. An applied paper usually shows data (not simulated) and
motivates a subject area challenge in the abstract or introduction. A
theory paper shows no data at all or uses it only as an example.</p>
<h2 id="can-i-take-a-year-off">Can I take a year off?</h2>
<p>Absolutely. Especially if it’s to work in a data related job. In
general, maturity and life experiences are an advantage in grad school.</p>
<h2 id="what-should-i-expect-when-i-finish">What should I expect when I finish?</h2>
<p>You will have many many options. The demand of your expertise is
great and growing. As a result there are many high-paying options. If you want to
become an academic I recommend doing a postdoc. <a href="http://simplystatistics.org/2011/12/28/grad-students-in-bio-statistics-do-a-postdoc/">Here</a> is why.
But there are many other options as we describe <a href="http://simplystatistics.org/2015/02/18/navigating-big-data-careers-with-a-statistics-phd/">here</a>
and <a href="http://simplystatistics.org/2011/09/12/advice-for-stats-students-on-the-academic-job-market/">here</a>.</p>
Not So Standard Deviations Episode 21 - This Might be the Future!
2016-08-26T00:00:00+00:00
http://simplystats.github.io/2016/08/26/nssd-episode-21
<p>Hilary and I are apart again and this time we’re talking about political polling. Also, they discuss Trump’s tweets, and the fact that Hilary owns a bowling ball.</p>
<p>Also, Hilary and I have just published a new book, <a href="https://leanpub.com/conversationsondatascience?utm_source=SimplyStats&utm_campaign=NSSD&utm_medium=BlogPost">Conversations on Data Science</a>, which collects some of our episodes in an easy-to-read format. The book is available from Leanpub and will be updated as we record more episodes. If you’re new to the podcast, this is a good way to do some catching up!</p>
<p>If you have questions you’d like us to answer, you can send them to
nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p>
<p>Subscribe to the podcast on <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">iTunes</a> or <a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Google Play</a>.</p>
<p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p>
<p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p>
<p>Show Notes:</p>
<ul>
<li>
<p><a href="http://projects.fivethirtyeight.com/2016-election-forecast/">FiveThirtyEight election dashboard</a></p>
</li>
<li>
<p><a href="http://www.nytimes.com/interactive/2016/upshot/presidential-polls-forecast.html">The Upshot’s election dashboard</a></p>
</li>
<li>
<p><a href="http://varianceexplained.org/r/trump-tweets/">David Robinson’s post on Trump’s tweets</a></p>
</li>
<li>
<p><a href="https://twitter.com/juliasilge">Julia Silge’s Twitter account</a></p>
</li>
<li>
<p><a href="http://thekateringshow.com">The Katering Show</a></p>
</li>
<li>
<p><a href="https://www.beomni.com">Omni</a></p>
</li>
</ul>
<p><a href="https://soundcloud.com/nssd-podcast/episode-21-this-might-be-the-future">Download the audio for this episode</a>.</p>
<p>Listen here:</p>
<iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/279922412&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false"></iframe>
How to create a free distributed data collection "app" with R and Google Sheets
2016-08-26T00:00:00+00:00
http://simplystats.github.io/2016/08/26/googlesheets
<p><a href="http://www.stat.ubc.ca/~jenny/">Jenny Bryan</a>, developer of the <a href="https://github.com/jennybc/googlesheets">google sheets R package</a>, <a href="https://speakerdeck.com/jennybc/googlesheets-talk-at-user2015">gave a talk</a> at Use2015 about the package.</p>
<p>One of the things that got me most excited about the package was an example she gave in her talk of using the Google Sheets package for data collection at ultimate frisbee tournaments. One reason is that I used to play a little ultimate <a href="http://www.pbase.com/jmlane/image/76852417">back in the day</a>.</p>
<p>Another is that her idea is an amazing one for producing cool public health applications. One of the major issues with public health is being able to do distributed data collection cheaply, easily, and reproducibly. So I decided to write a little tutorial on how one could use <a href="https://www.google.com/sheets/about/">Google Sheets</a> and R to create a free distributed data collecton “app” for public health (or anything else really).</p>
<h3 id="what-you-will-need">What you will need</h3>
<ul>
<li>A Google account and access to <a href="https://www.google.com/sheets/about/">Google Sheets</a></li>
<li><a href="https://www.r-project.org/">R</a> and the <a href="https://github.com/jennybc/googlesheets">googlesheets</a> package.</li>
</ul>
<h3 id="the-app">The “app”</h3>
<p>What we are going to do is collect data in a Google Sheet or sheets. This sheet can be edited by anyone with the link using their computer or a mobile phone. Then we will use the <code class="language-plaintext highlighter-rouge">googlesheets</code> package to pull the data into R and analyze it.</p>
<h3 id="making-the-google-sheet-work-with-googlesheets">Making the Google Sheet work with googlesheets</h3>
<p>After you have a first thing to do is to go to the Google Sheets I suggest bookmarking this page: https://docs.google.com/spreadsheets/u/0/ which skips the annoying splash screen.</p>
<p>Create a blank sheet and give it an appropriate title for whatever data you will be collecting.</p>
<p>Next, we need to make the sheet <em>public on the web</em> so that the <em>googlesheets</em> package can read it. This is different from the sharing settings you set with the big button on the right. To make the sheet public on the web, go to the “File” menu and select “Publish to the web…”. Like this:</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/gs_publishweb.png" alt="" /></p>
<p>then it will ask you if you want to publish the sheet, just hit publish</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/gs_publish.png" alt="" /></p>
<p>Copy the link it gives you and you can use it to read in the Google Sheet. If you want to see all the Google Sheets you can read in, you can load the package and use the <code class="language-plaintext highlighter-rouge">gs_ls</code> function.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">googlesheets</span><span class="p">)</span><span class="w">
</span><span class="n">sheets</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gs_ls</span><span class="p">()</span><span class="w">
</span><span class="n">sheets</span><span class="p">[</span><span class="m">1</span><span class="p">,]</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## # A tibble: 1 x 10
## sheet_title author perm version updated
## <chr> <chr> <chr> <chr> <time>
## 1 app_example jtleek rw new 2016-08-26 17:48:21
## # ... with 5 more variables: sheet_key <chr>, ws_feed <chr>,
## # alternate <chr>, self <chr>, alt_key <chr>
</code></pre></div></div>
<p>It will pop up a dialog asking for you to authorize the <code class="language-plaintext highlighter-rouge">googlesheets</code> package to read from your Google Sheets account. Then you should see a list of spreadsheets you have created.</p>
<p>In my example I created a sheet called “app_example” so I can load the Google Sheet like this:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Identifies the Google Sheet</span><span class="w">
</span><span class="n">example_sheet</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gs_title</span><span class="p">(</span><span class="s2">"app_example"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## Sheet successfully identified: "app_example"
</code></pre></div></div>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Reads the data</span><span class="w">
</span><span class="n">dat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gs_read</span><span class="p">(</span><span class="n">example_sheet</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## Accessing worksheet titled 'Sheet1'.
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## No encoding supplied: defaulting to UTF-8.
</code></pre></div></div>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">head</span><span class="p">(</span><span class="n">dat</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## # A tibble: 3 x 5
## who_collected at_work person time date
## <chr> <chr> <chr> <chr> <chr>
## 1 jeff no ingo 13:47 08/26/2016
## 2 jeff yes roger 13:47 08/26/2016
## 3 jeff yes brian 13:47 08/26/2016
</code></pre></div></div>
<p>In this case the data I’m collecting is about who is at work right now as I’m writing this post :). But you could collect whatever you want.</p>
<h3 id="distributing-the-data-collection">Distributing the data collection</h3>
<p>Now that you have the data published to the web, you can read it into Google Sheets. Also, anyone with the link will be able to view the Google Sheet. But if you don’t change the sharing settings, you are the only one who can edit the sheet.</p>
<p>This is where you can make your data collection distributed if you want. If you go to the “Share” button, then click on advanced you will get a screen like this and have some options.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/gs_share_advanced.png" alt="" /></p>
<p><em>Private data collection</em></p>
<p>In the example I’m using I haven’t changed the sharing settings, so while you can <em>see</em> the sheet, you can’t edit it. This is nice if you want to collect some data and allow other people to read it, but you don’t want them to edit it.</p>
<p><em>Controlled distributed data collection</em></p>
<p>If you just enter people’s emails then you can open the data collection to just those individuals you have shared the sheet with. Be careful though, if they don’t have Google email addresses, then they get a link which they could share with other people and this could lead to open data collection.</p>
<p><em>Uncontrolled distributed data collection</em></p>
<p>Another option is to click on “Change” next to “Private - Only you can access”. If you click on “On - Anyone with the link” and click on “Can View”.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/gs_can_view.png" alt="" /></p>
<p>Then you can modify it to say “Can Edit” and hit “Save”. Now anyone who has the link can edit the Google Sheet. This means that you can’t control who will be editing it (careful!) but you can really widely distribute the data collection.</p>
<h3 id="collecting-data">Collecting data</h3>
<p>Once you have distributed the link either to your collaborators or more widely it is time to collect data. This is where I think that the “app” part of this is so cool. You can edit the Google Sheet from a Desktop computer, but if you have the (free!) Google Sheets app for your phone then you can also edit the data on the go. There is even an offline mode if the internet connection isn’t available where you are working (more on this below).</p>
<h3 id="quality-control">Quality control</h3>
<p>One of the major issues with distributed data collection is quality control. If possible you want people to input data using (a) a controlled vocubulary/system and (b) the same controlled vocabulary/system. My suggestion here depends on whether you are using a controlled distributed system or an uncontrolled distributed system.</p>
<p>For the controlled distributed system you are specifically giving access to individual people - you can provide some training or a walk through before giving them access.</p>
<p>For the uncontrolled distributed system you should create a <em>very</em> detailed set of instructions. For example, for my sheet I would create a set of instructions like:</p>
<ol>
<li>Every data point must have a label of who collected in in the <code class="language-plaintext highlighter-rouge">who_collected</code> column. You should pick a username that does not currently appear in the sheet and stick with it. Use all lower case for your username.</li>
<li>You should either report “yes” or “no” in lowercase in the <code class="language-plaintext highlighter-rouge">at_work</code> column.</li>
<li>You should report the name of the person in all lower case in the <code class="language-plaintext highlighter-rouge">person</code> column. You should search and make sure that the person you are reporting on doesn’t appear before introducing a new name. If the name already exists, use the name spelled exactly as it is in the sheet already.</li>
<li>You should report the <code class="language-plaintext highlighter-rouge">time</code> in the format hh:mm on a 24 hour clock in the eastern time zone of the United States.</li>
<li>You should report the <code class="language-plaintext highlighter-rouge">date</code> in the mm/dd/yyyy format.</li>
</ol>
<p>You could be much more detailed depending on the case.</p>
<h3 id="offline-editing-and-conflicts">Offline editing and conflicts</h3>
<p>One of the cool things about Google Sheets is that they can even be edited without an internet connection. This is particularly useful if you are collecting data in places where internet connections may be spotty. But that may generate conflicts if you use only one sheet.</p>
<p>There may be different ways to handle this, but one I thought of is to just create one sheet for each person collecting data (if you are using controlled distributed data collection). Then each person only edits the data in their sheet, avoiding potential conflicts if multiple people are editing offline and non-synchronously.</p>
<h3 id="reading-the-data">Reading the data</h3>
<p>Anyone with the link can now read the most up-to-date data with the following simple code.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Identifies the Google Sheet</span><span class="w">
</span><span class="n">example_sheet</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gs_url</span><span class="p">(</span><span class="s2">"https://docs.google.com/spreadsheets/d/177WyyzWOHGIQ9O5iUY9P9IVwGi7jL3f4XBY4d98CY_o/pubhtml"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## Sheet-identifying info appears to be a browser URL.
## googlesheets will attempt to extract sheet key from the URL.
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## Putative key: 177WyyzWOHGIQ9O5iUY9P9IVwGi7jL3f4XBY4d98CY_o
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## Sheet successfully identified: "app_example"
</code></pre></div></div>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Reads the data</span><span class="w">
</span><span class="n">dat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gs_read</span><span class="p">(</span><span class="n">example_sheet</span><span class="p">,</span><span class="w"> </span><span class="n">ws</span><span class="o">=</span><span class="s2">"Sheet1"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## Accessing worksheet titled 'Sheet1'.
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## No encoding supplied: defaulting to UTF-8.
</code></pre></div></div>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dat</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## # A tibble: 3 x 5
## who_collected at_work person time date
## <chr> <chr> <chr> <chr> <chr>
## 1 jeff no ingo 13:47 08/26/2016
## 2 jeff yes roger 13:47 08/26/2016
## 3 jeff yes brian 13:47 08/26/2016
</code></pre></div></div>
<p>Here the url is the one I got when I went to the “File” menu and clicked on “Publish to the web…”. The argument <code class="language-plaintext highlighter-rouge">ws</code> in the <code class="language-plaintext highlighter-rouge">gs_read</code> command is the name of the worksheet. If you have multiple sheets assigned to different people, you can read them in one at a time and then merge them together.</p>
<h3 id="conclusion">Conclusion</h3>
<p>So that’s it, its pretty simple. But as I gear up to teach advanced data science here at Hopkins I’m thinking a lot about Sean Taylor’s awesome post <a href="http://seanjtaylor.com/post/41463778912/real-scientists-make-their-own-data">Real scientists make their own data</a></p>
<p>I think this approach is a super cool/super lightweight system for collecting data either on your own or as a team. As I said I think it could be really useful in public health, but it could also be used for any data collection you want.</p>
Interview with COPSS award winner Nicolai Meinshausen.
2016-08-24T00:00:00+00:00
http://simplystats.github.io/2016/08/24/meinshausen-copss
<p><em>Editor’s Note: We are again pleased to interview the COPSS President’s award winner. The COPSS Award is one of the most prestigious in statistics, sometimes called the Nobel Prize in statistics. This year the award went to Nicolai Meinshausen from ETH Zurich. He is known for his work in causality, high-dimensional statistics, and machine learning. Also see our past COPSS award interviews with <a href="http://simplystatistics.org/2015/08/25/interview-with-copss-award-winner-john-storey/">John Storey</a> and <a href="http://simplystatistics.org/2014/08/18/interview-with-copss-award-winner-martin-wainright/">Martin Wainwright</a>.</em></p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/meinshausen.png" alt="Nicolai Meinshausen" /></p>
<h2 id="do-you-consider-yourself-to-be-a-statistician-data-scientist-machine-learner-or-something-else">Do you consider yourself to be a statistician, data scientist, machine learner, or something else?</h2>
<p>Perhaps all of the above. If you forced me to pick one, then statistician but I hope we will soon come to a point where these distinctions do not matter much any more.</p>
<h2 id="how-did-you-find-out-you-had-won-the-copss-award">How did you find out you had won the COPSS award?</h2>
<p>Jeremy Taylor called me. I know I am expected to say I did not expect it but that was indeed the case and it was a genuine surprise.</p>
<h2 id="how-do-you-see-the-fields-of-causal-inference-and-high-dimensional-statistics-merging">How do you see the fields of causal inference and high-dimensional statistics merging?</h2>
<p>Causal inference is already very challenging in the low-dimensional case - if understood as data for which the number of observations exceeds
the number of variables. There are commonalities between high-dimensional statistics and the subfield of causal discovery, however, as we try to recover a sparse underlying structure from data in both cases
(say when trying to reconstruct a gene network from
observational and intervention data). The interpretations are just slightly different. A further difference is the implicit optimization. High-dimensional estimators can often be framed as convex optimization problems and the question is whether causal discovery can or should be
pushed in this direction as well.</p>
<h2 id="can-you-explain-a-little-about-how-you-can-infer-causal-effects-from-inhomogeneous-data">Can you explain a little about how you can infer causal effects from inhomogeneous data?</h2>
<p>Why do we want a causal model in the first place? In most cases the benefit of a causal over a regression model
is that the predictions of a causal model continue to be valid even if we intervene on the variables we use for prediction.</p>
<p>The inference we proposed turns this around and is looking for all models that are invariant in the sense that they give the same prediction accuracy across a number of different settings or environments. If we just have observational data, then this invariance
holds for all models but if the data are inhomogeneous, certain models can be discarded since they work better in one environment than in another and can thus not be causal. If all models that show invariance use a certain variable, we can claim that the variable in question
has a causal effect (while controlling type I error rates) and construct confidence intervals for the strength of the effect.</p>
<h2 id="you-have-worked-on-studying-the-effects-of-climate-change---do-you-think-statisticians-can-play-an-important-role-in-this-debate">You have worked on studying the effects of climate change - do you think statisticians can play an important role in this debate?</h2>
<p>To a certain extent. I have worked on several projects with physicists and the general caveat is that physicists are in general quite advanced in their methodology already and have quite a good understanding of the relevant statistical concepts. Biology is thus maybe a field where even more external input is required. Then again, it saves one from having to calculate t-tests in collaborations with physicists and just the interestingand challenging problems are left.</p>
<h2 id="what-advice-would-you-give-young-statisticians-getting-into-the-discipline-right-now">What advice would you give young statisticians getting into the discipline right now?</h2>
<p>First I would say that they have made a good choice since these are interesting times for the field with many challenging and relevant problems still open and unsolved (but not completely out of reach either).
I think its important to keep an open mind and read as much literature as possible from neighbouring fields. My personal experience has been that it is very beneficial to get involved in some scientific collaborations.</p>
<h2 id="what-sorts-of-things-is-your-group-working-on-these-days">What sorts of things is your group working on these days?</h2>
<p>Two general themes: the first is what people would call more classical machine learning. For example, how can we detect interactions in large-scale datasets in sub-quadratic time? Secondly, we are trying to extend the invariance approach to causal inference
to more general settings, for example allowing for nonlinearities and hidden variables while at the same time
improving the computational aspects.</p>
A Simple Explanation for the Replication Crisis in Science
2016-08-24T00:00:00+00:00
http://simplystats.github.io/2016/08/24/replication-crisis
<p>By now, you’ve probably heard of the <a href="https://en.wikipedia.org/wiki/Replication_crisis">replication crisis in science</a>. In summary, many conclusions from experiments done in a variety of fields have been found to not hold water when followed up in subsequent experiments. There are now any number of famous examples now, particularly from the fields of <a href="http://science.sciencemag.org/content/349/6251/aac4716">psychology</a> and <a href="http://biorxiv.org/content/early/2016/04/27/050575">clinical medicine</a> that show that the rate of replication of findings is less than the the expected rate.</p>
<p>The reasons proposed for this crisis are wide ranging, but typical center on the preference for “novel” findings in science and the pressure on investigators (especially new ones) to “publish or perish”. My favorite reason places the blame for the entire crisis on <a href="http://www.nature.com/news/psychology-journal-bans-p-values-1.17001">p-values</a>.</p>
<p>I think to develop a better understanding of why there is a “crisis”, we need to step back and look across differend fields of study. There is one key question you should be asking yourself: <em>Is the replication crisis evenly distributed across different scientific disciplines?</em> My reading of the literature would suggest “no”, but the reasons attributed to the replication crisis are common to all scientists in every field (i.e. novel findings, publishing, etc.). So why would there be any heterogeneity?</p>
<h2 id="an-aside-on-replication-and-reproducibility">An Aside on Replication and Reproducibility</h2>
<p>As Lorena Barba recently <a href="https://twitter.com/LorenaABarba/status/764836487212957696">pointed</a> <a href="https://github.com/ReScience/ReScience-article/issues/5#issuecomment-241232791">out</a>, there can be tremendous confusion over the use of the words <strong>replication</strong> and <strong>reproducibility</strong>, so it’s best that we sort that out now. Here’s how I use both words:</p>
<ul>
<li>
<p><em>replication</em>: This is the act of repeating an entire study, independently of the original investigator without the use of original data (but generally using the same methods).</p>
</li>
<li>
<p><em>reproducibility</em>: A study is reproducible if you can take the original data and the <em>computer code</em> used to analyze the data and reproduce all of the numerical findings from the study. This may initially sound like a trivial task but experience has shown that it’s not always easy to achieve this seemly minimal standard.</p>
</li>
</ul>
<p>For more precise definitions of what I mean by these terms, you can take a look at <a href="http://biorxiv.org/content/early/2016/07/29/066803">my recent paper with Jeff Leek and Prasad Patil</a>.</p>
<p>One key distinction between replication and reproducibility is that with replication, there is no need to trust any of the original findings. In fact, you may be attempting to replicate a study <em>because</em> you do not trust the findings of the original study. A recent example includes the creation of stem cells from ordinary cells, a finding that was so extraodinary that it led several laboratories to quickly try to replicate the study. Ultimately, <a href="http://www.nature.com/nature/journal/v525/n7570/full/nature15513.html">seven separate laboratories could not replicate the finding</a> and the original study was ultimately retracted.</p>
<h2 id="astronomy-and-epidemiology">Astronomy and Epidemiology</h2>
<p>What do the fields of astronomy and epidemiology have in common? You might think nothing. Those two departments are often not even on the same campus at most universities! However, they have at least one common element, which is that the things that they study are generally reluctant to be controlled by human beings. As a result, both astronomers and epidemiologist rely heavily on one tools: the <strong>observational study</strong>.
Much has been written about observational studies of late, and I’ll spare you the literature search by saying that the bottom line is they can’t be trusted (particularly observational studies that have not been pre-registered!).</p>
<p>But that’s fine—we have a method for dealing with things we don’t trust: It’s called replication. Epidemiologists actually codified their understanding of when they believe a causal claim (see <a href="https://en.wikipedia.org/wiki/Bradford_Hill_criteria">Hill’s Criteria</a>), which is typically only after a claim has been replicated in numerous studies in a variety of settings. My understanding is that astronomers have a similar mentality as well—no single study will result in anyone believe something new about the universe. Rather, findings need to be replicated using different approaches, instruments, etc.</p>
<p>The key point here is that in both astronomy and epidemiology expectations are low with respect to individual studies. <strong>It’s difficult to have a replication crisis when nobody believes the findings in the first place</strong>. Investigators have a culture of distrusting individual one-off findings until they have been replicated numerous times. In my own area of research, the idea that ambient air pollution causes health problems was difficult to believe for decades, until we started seeing the same associations appear in numerous studies conducted all around the world. It’s hard to imagine any single study “proving” that connection, no matter how well it was conducted.</p>
<h2 id="theory-and-experimentation-in-science">Theory and Experimentation in Science</h2>
<p>I’ve already described the primary mode of investigation in astronomy and epidemiology, but there are of course other methods in other fields. One large category of methods includes the <strong>controlled experiment</strong>. Controlled experiments come in a variety of forms, whether they are laboratory experiments on cells or randomized clinical trials with humans, all of them involve intentional manipulation of some factor by the investigator in order to observe how such manipulation affects an outcome. In clinical medicine and the social sciences, controlled experiments are considered the “gold standard” of evidence. Meta-analyses and literature summaries generally weight publications with controlled experiments more highly than other approaches like observational studies.</p>
<p>The other aspect I want to look at here is whether a field has a strong basic theoretical foundation. The idea here is that some fields, like say physics, have a strong set of basic theories whose predictions have been consistently validated over time. Other fields, like medicine, lack even the most rudimentary theories that can be used to make basic predictions. Granted, the distinction between fields with or without “basic theory” is a bit arbitrary on my part, but I think it’s fair to say that different fields of study fall on a spectrum in terms of how much basic theory they can rely on.</p>
<p>Given the theoretical nature of different fields and the primary mode of investigation, we can develop the following crude 2x2 table, in which I’ve inserted some representative fields of study.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/replication_2x2.png" alt="Theory vs. Experimentation in Science" /></p>
<p>My primary contention here is</p>
<blockquote>
<p>The replication crisis in science is concentrated in areas where (1) there is a tradition of controlled experimentation and (2) there is relatively little basic theory underpinning the field.</p>
</blockquote>
<p>Further, in general, I don’t believe that there’s anything wrong with the people tirelessly working in the upper right box. At least, I don’t think there’s anything <em>more</em> wrong with them compared to the good people working in the other three boxes.</p>
<p>In case anyone is wondering where the state of clinical science is relative to, say, particle physics with respect to basic theory, I only point you to the web site for the <a href="https://nccih.nih.gov">National Center for Complementary and Integrative Health</a>. This is essentially a government agency with a budget of $124 million dedicated to <a href="http://www.forbes.com/sites/stevensalzberg/2011/08/29/nihs-350000-questionnaire/#1ee73d4d4fc6">advancing pseudoscience</a>. This is the state of “basic theory” in clinical medicine.</p>
<h2 id="the-bottom-line">The Bottom Line</h2>
<p>The people working in the upper right box have an uphill battle for at least two reasons</p>
<ol>
<li>The lack of strong basic theory makes it more difficult to guide investigation, leading to wider ranging efforts that may be less likely to replicate.</li>
<li>The tradition of controlled experimentation places <em>high expectations</em> that research produced here is “valid”. I mean, hey, they’re using the gold standard of evidence, right?</li>
</ol>
<p>The confluence of these two factors leads to a much greater disappointment when findings from these fields do not replicate. This leads me to believe that <strong>the replication crisis in science is largely attributable to a mismatch in our expectations of how often findings should replicate and how difficult it is to actually discover true findings in certain fields</strong>. Further, the reliance of controlled experiements in certain fields has lulled us into believing that we can place tremendous weight on a small number of studies. Ultimately, when someone asks, “Why <em>haven’t</em> we cured cancer yet?” the answer is “Because it’s hard”.</p>
<h2 id="the-silver-lining">The Silver Lining</h2>
<p>It’s important to remember that, as my colleague Rafa Irizarry <a href="http://simplystatistics.org/2013/08/01/the-roc-curves-of-science/">pointed out</a>, findings from many of the fields in the upper right box, especially clinical medicine, can have tremendous positive impacts on our lives when they do work out. Rafa says</p>
<blockquote>
<p>…I argue that the rate of discoveries is higher in biomedical research than in physics. But, to achieve this higher true positive rate, biomedical research has to tolerate a higher false positive rate.</p>
</blockquote>
<p>It is certainly possible to reduce the rate of false positives—one way would be to do no experiments at all! But is that what we want? Would that most benefit us as a society?</p>
<h2 id="the-takeaway">The Takeaway</h2>
<p>What to do? Here are a few thoughts:</p>
<ul>
<li>We need to stop thinking that any single study is definitive or confirmatory, no matter if it was a controlled experiment or not. Science is always a cumulative business, and the value of a given study should be understood in the context of what came before it.</li>
<li>We have to recognize that some areas are more difficult to study and are less mature than other areas because of the lack of basic theory to guide us.</li>
<li>We need to think about what the tradeoffs are for discovering many things that may not pan out relative to discovering only a few things. What are we willing to accept in a given field? This is a discussion that I’ve not seen much of.</li>
<li>As Rafa pointed out in his article, we can definitely focus on things that make science better for everyone (better methods, rigorous designs, etc.).</li>
</ul>
A meta list of what to do at JSM 2016
2016-07-30T00:00:00+00:00
http://simplystats.github.io/2016/07/30/jsm2016
<p>I’m going to be heading out tomorrow for JSM 2016. If you want to catch up I’ll be presenting in the 6-8PM poster session on <a href="https://www.amstat.org/meetings/jsm/2016/onlineprogram/ActivityDetails.cfm?SessionID=213079">The Extraordinary Power of Data</a> on Sunday and on <a href="https://www.amstat.org/meetings/jsm/2016/onlineprogram/ActivityDetails.cfm?SessionID=212543">data visualization (and other things) in MOOCs</a> at 8:30am on Monday. Here is a little sneak preview, the first slide from my talk:</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/firstslide.jpg" alt="Was too scared to use GIFs" /></p>
<p>This year I am so excited that other people have done all the work of going through the program for me and picking out what talks to see. Here is a list of lists.</p>
<ul>
<li><a href="https://kbroman.wordpress.com/2016/07/27/my-jsm-2016-itinerary/">Karl Broman</a> - if you like open source software, data viz, and genomics.</li>
<li><a href="https://blog.rstudio.org/2016/07/19/discover-r-and-rstudio-at-jsm-2016-chicago/">Rstudio</a> - if you like Rstudio</li>
<li><a href="http://citizen-statistician.org/2016/07/29/my-jsm2016-itinerary/">Mine Cetinkaya Rundel</a> - if you like stat ed, data science, data viz, and data journalism.</li>
<li><a href="https://twitter.com/DrJWolfson/status/758990552754827264">Julian Wolfson</a> - if you like missing sessions and guilt.</li>
<li><a href="https://github.com/stephaniehicks/classroomNotes/blob/master/conferences/JSM2016.md">Stephanie Hicks</a> - if you like lots of sessions and can’t make up your mind (also stat genomics, open source software, stat computing, stats for social good…)</li>
</ul>
<p>If you know about more lists, please feel free to tweet at me or send pull requests.</p>
<p>I also saw the materials for this <a href="https://github.com/simonmunzert/rscraping-jsm-2016">awesome tutorial on webscraping</a> that I’m sorry I’ll miss.</p>
The relativity of raw data
2016-07-20T00:00:00+00:00
http://simplystats.github.io/2016/07/20/relativity-raw-data
<p>“Raw data” is one of those terms that everyone in statistics and data science uses but no one defines. For example, we all agree that we should be able to recreate results in scientific papers from the raw data and the code for that paper.</p>
<blockquote>
<p>But what do we mean when we say raw data?</p>
</blockquote>
<p>When working with collaborators or students I often find myself saying - could you just give me the raw data so I can do the normalization or processing myself. To give a concrete example, I work in the analysis of data from <a href="http://www.nature.com/nbt/journal/v26/n10/full/nbt1486.html">high-throughput genomic sequencing experiments</a>.</p>
<p>These experiments produce data by breaking up genomic molecules into short fragements of DNA - then reading off parts of those fragments to generate “reads” - usually 100 to 200 letters long per read. But the reads are just puzzle pieces that need to be fit back together and then quantified to produce measurements on DNA variation or gene expression abundances.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/sequencing.png" alt="High throughput sequencing" /></p>
<p><a href="http://cbcb.umd.edu/~hcorrada/CFG/lectures/lect22_seqIntro/seqIntro.pdf">Image from Hector Corrata Bravo’s lecture notes</a></p>
<p>When I say “raw data” when talking to a collaborator I mean the reads that are reported from the sequencing machine. To me that is the rawest form of the data I will look at. But to generate those reads the sequencing machine first (1) created a set of images for each letter in the sequence of reads, (2) measured the color at the spots on that image to get the quantitative measurement of which letter, and (3) calculated which letter was there with a confidence measure. The raw data I ask for only includes the confidence measure and the sequence of letters itself, but ignores the images and the colors extracted from them (steps 1 and 2).</p>
<p>So to me the “raw data” is the files of reads. But to the people who produce the machine for sequencing the raw data may be the images or the color data. To my collaborator the raw data may be the quantitative measurements I calculate from the reads. When thinking about this I realized an important characteristics of raw data.</p>
<blockquote>
<p>Raw data is relative to your reference frame.</p>
</blockquote>
<p>In other words the raw data is raw to <em>you</em> if you have done no processing, manipulation, coding, or analysis of the data. In other words, the file you received from the person before you is untouched. But it may not be the <em>rawest</em> version of the data. The person who gave you the raw data may have done some computations. They have a different “raw data set”.</p>
<p>The implication for reproducibility and replicability is that we need a “chain of custody” just like with evidence collected by the police. As long as each person keeps a copy and record of the “raw data” to them you can trace the provencance of the data back to the original source.</p>
Not So Standard Deviations Episode 18 - Divide by n-1, or n-2, or Whatever
2016-07-18T00:00:00+00:00
http://simplystats.github.io/2016/07/18/nssd-episode-19
<p>Hilary and I talk about statistical software in fMRI analyses, the differences between software testing differences in proportions (a must listen!), and a preview of JSM 2016.</p>
<p>Also, Hilary and I have just published a new book, <a href="https://leanpub.com/conversationsondatascience?utm_source=SimplyStats&utm_campaign=NSSD&utm_medium=BlogPost">Conversations on Data Science</a>, which collects some of our episodes in an easy-to-read format. The books is available from Leanpub and will be updated as we record more episodes.</p>
<p>If you have questions you’d like us to answer, you can send them to
nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p>
<p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p>
<p><a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Subscribe to the podcast on Google Play</a>.</p>
<p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p>
<p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p>
<p>Show Notes:</p>
<ul>
<li>
<p><a href="http://www.theregister.co.uk/2016/07/03/mri_software_bugs_could_upend_years_of_research/?mt=1467760452040">fMRI bugs could upend years of research</a></p>
</li>
<li>
<p><a href="http://www.pnas.org/content/113/28/7900.full">Eklund et al. PNAS Paper</a></p>
</li>
<li>
<p><a href="https://www.amstat.org/meetings/jsm/2016/onlineprogram/index.cfm">JSM 2016 Program</a></p>
</li>
<li>
<p><a href="https://leanpub.com/conversationsondatascience">Conversations on Data Science</a></p>
</li>
</ul>
<p><a href="https://soundcloud.com/nssd-podcast/episode-19-divide-by-n-1-or-n-2-or-whatever">Download the audio for this episode</a>.</p>
<p>Listen here:</p>
<iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/274214566&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false"></iframe>
Tuesday update
2016-07-11T00:00:00+00:00
http://simplystats.github.io/2016/07/11/tuesday-update
<h2 id="it-might-all-be-wrong">It Might All Be Wrong</h2>
<p>Tom Nichols and colleagues have published a paper on the software used to analyze fMRI data:</p>
<blockquote>
<p>Functional MRI (fMRI) is 25 years old, yet surprisingly its most common statistical methods have not been validated using real data. Here, we used resting-state fMRI data from 499 healthy controls to conduct 3 million task group analyses. Using this null data with different experimental designs, we estimate the incidence of significant results. In theory, we should find 5% false positives (for a significance threshold of 5%), but instead we found that the most common software packages for fMRI analysis (SPM, FSL, AFNI) can result in false-positive rates of up to 70%. These results question the validity of some 40,000 fMRI studies and may have a large impact on the interpretation of neuroimaging results.</p>
</blockquote>
<h2 id="criminal-justice-forecasts">Criminal Justice Forecasts</h2>
<p>The <a href="http://www.theatlantic.com/technology/archive/2016/06/when-algorithms-take-the-stand/489566/">ongoing discussion</a> over the use of prediction algorithms in the criminal justice system reminds me a bit of the introduction of DNA evidence decades ago. Ultimately, there is a technology that few people truly understand and there are questions as to whether the information they provide is fair or accurate.</p>
<h2 id="shameless-promotion">Shameless Promotion</h2>
<p>I have a <a href="https://leanpub.com/conversationsondatascience">new book</a> coming out with Hilary Parker, based on our <em>Not So Standard Deviations</em> podcast. Sign up to be notified of its release (which should be Real Soon Now).</p>
Not So Standard Deviations Episode 18 - Back on Planet Earth
2016-07-05T00:00:00+00:00
http://simplystats.github.io/2016/07/05/nssd-episode-18
<p>With Hilary fresh from Use R! 2016, Hilary and I discuss some of the highlights from the conference. Also, some followup about a previous Free Advertising and the NSSD drinking game.</p>
<p>If you have questions you’d like us to answer, you can send them to
nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p>
<p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p>
<p><a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Subscribe to the podcast on Google Play</a>.</p>
<p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p>
<p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p>
<p>Show notes:</p>
<ul>
<li>
<p><a href="http://www.vanityfair.com/hollywood/2016/06/jennifer-lawrence-theranos-elizabeth-holmes">Theranos movie with Jennifer Lawrence and Adam McKay</a></p>
</li>
<li>
<p><a href="https://en.wikipedia.org/wiki/Snowden_(film)">Snowden movie</a></p>
</li>
<li>
<p><a href="http://www.npr.org/2016/06/19/482514949/welcome-to-mongolias-new-postal-system-an-atlas-of-random-words">What3Words being used in Mongolia</a></p>
</li>
<li>
<p><a href="https://github.com/jimhester/lintr">lintr package</a></p>
</li>
<li>
<p><a href="https://youtu.be/dhh8Ao4yweQ">“The Electronic Coach” with Don Knuth</a></p>
</li>
<li>
<p><a href="http://alyssafrazee.com/gender-and-github-code.html">Exploring the data on gender and GitHub repo ownership</a></p>
</li>
<li>
<p><a href="https://blog.codinghorror.com/falling-into-the-pit-of-success/">Jeff Atwood “Falling Into the Pit of Success”</a></p>
</li>
<li>
<p><a href="https://research.googleblog.com/2014/08/doing-data-science-with-colaboratory.html">Google coLaboratory</a></p>
</li>
<li>
<p><a href="https://www.stickermule.com/marketplace/12936-number-rcatladies">#rcatladies stickers</a></p>
</li>
<li>
<p><a href="https://twitter.com/astrokatie/status/745529809669787649">Katie Mack time-lapse video</a></p>
</li>
</ul>
<p><a href="https://soundcloud.com/nssd-podcast/episode-18-back-on-planet-earth">Download the audio for this episode</a>.</p>
<iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/272064450&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false"></iframe>
Tuesday Update
2016-06-28T00:00:00+00:00
http://simplystats.github.io/2016/06/28/tuesday-update
<h2 id="if-you-werent-sick-of-theranos-yet">If you weren’t sick of Theranos yet….</h2>
<p>Looks like there will be a movie version of the <a href="http://simplystatistics.org/2016/05/23/update-on-theranos/">Theranos saga</a> which, as far as I can tell, isn’t over yet, but no matter. It will be done by Adam McKay, the writer-director of The Big Short (excellent film), and will star Jennifer Lawrence as Elizabeth Holmes. From <a href="http://www.vanityfair.com/hollywood/2016/06/jennifer-lawrence-theranos-elizabeth-holmes">Vanity Fair</a>:</p>
<blockquote>
<p>Legendary Pictures snapped up rights to the hot-button biopic for a reported $3 million Thursday evening, after outbidding and outlasting a swarm of competition from Warner Bros., Twentieth Century Fox, STX Entertainment, Regency Enterprises, Cross Creek, Amazon Studios, AG Capital, the Weinstein Company, and, in the penultimate stretch, Paramount, among other studio suitors.</p>
</blockquote>
<blockquote>
<p>Based on a book proposal by two-time Pulitzer Prize-winning journalist John Carreyrou titled Bad Blood: Secrets and Lies in Silicon Valley, the project (reported to be in the $40 million to $50 million budget range) has made the rounds to almost every studio in town. It’s been personally pitched by McKay, who won an Oscar for best adapted screenplay for last year’s rollicking financial meltdown procedural The Big Short.</p>
</blockquote>
<p>Frankly, I think we all know how this movie will end.</p>
<h2 id="the-people-vs-oj-simpson-vsstatistics">The People vs. OJ Simpson vs….Statistics</h2>
<p>I’m in the middle of watching <a href="https://en.wikipedia.org/wiki/The_People_v._O._J._Simpson:_American_Crime_Story">The People vs. OJ Simpson</a> and so far it is fantastic—I highly recommend it. One thing that is not represented in the show is the important role that statistics played in the trial. The trial was just in the early days of using DNA as evidence in criminal trials and there were many questions about how likely it was to find DNA matches in blood.</p>
<p>Terry Speed ended up testifying for the defense (Simpson) and in this <a href="http://www.statisticsviews.com/details/feature/4915471/To-some-statisticians-a-number-is-a-number-but-to-me-a-number-is-packed-with-his.html">nice interview</a>, he explains how that came to be:</p>
<blockquote>
<p>At the beginning of the Simpson trial, there was going to be a pre-trial hearing and experts from both sides would argue in front of the judge as to what approaches should be accepted. Other pre-trial activities dragged on, and the one on DNA forensics was eventually scrapped. The DNA experts, including me were then asked whether they wanted to give evidence for the prosecution or defence, or leave. I did not initially plan to join the defence team, but wished to express my point of view in what was more or less a scientific environment before the trial started, but when the pre-trial DNA hearing was scrapped, I decided that I had no choice but to express my views in court on behalf of the defence, which I did.</p>
</blockquote>
<p>The full interview is well worth the read.</p>
<h2 id="ai-is-the-residual">AI is the residual</h2>
<p>I just recently found out about the <a href="https://en.m.wikipedia.org/wiki/AI_effect">AI effect</a> which I thought was interesting. Basically, “AI” is whatever can’t be explained, or in other words, the residuals of machine learning.</p>
A Year at Stack Overflow
2016-06-28T00:00:00+00:00
http://simplystats.github.io/2016/06/28/stack-overflow-drob
<p>David Robinson (<a href="https://twitter.com/drob">@drob</a>) has a great post on his blog about his <a href="http://varianceexplained.org/r/year_data_scientist/">first year as a data scientist at Stack Overflow</a>. This section in particular stood out for me:</p>
<blockquote>
<p>I like using R to learn interesting things about our data, but my longer term goal is to make it easy for any of our engineers to do so….Towards this goal, I’ve been focusing on building reliable tools and frameworks that people can apply to a variety of problems, rather than “one-off” analysis scripts. (There’s an awesome post by Jeff Magnusson at StitchFix about some of these general challenges). My approach has been building internal R packages, similar to AirBnb’s strategy (though our data team is quite a bit younger and smaller than theirs). These internal packages can query databases and parsing our internal APIs, including making various security and infrastructure issues invisible to the user.</p>
</blockquote>
<p>The world needs an army of David Robinsons.</p>
Ultimate AI battle - Apple vs. Google
2016-06-14T00:00:00+00:00
http://simplystats.github.io/2016/06/14/ultimate-ai-battle
<p>Yesterday, Apple launched its Worldwide Developer’s Conference (WWDC) and had its public keynote address. While many new things were announced, the one thing that caught my eye was the <a href="http://go.theinformation.com/HnOAdA6DQ7g">dramatic expansion</a> of Apple’s use of artificial intelligence (AI) tools. I talked a bit about AI with Hilary Parker on the latest <a href="http://simplystatistics.org/2016/06/09/nssd-episode-17/"><em>Not So Standard Deviations</em></a>, particularly in the context of Amazon’s Echo/Alexa, and I think it’s definitely going to be an area of intense competition between the major tech companies.</p>
<p>Pretty much every major tech player is involved in AI—Google, Facebook, Amazon, Apple, Microsoft—the list goes on. Recently, a <a href="https://marco.org/2016/05/21/avoiding-blackberrys-fate">some commentators</a> <a href="https://stratechery.com/2015/tim-cooks-unfair-and-unrealistic-privacy-speech-strategy-credits-the-privacy-priority-problem/">have suggested</a> that Apple in particular will never catch up with the likes of Google with respect to AI because of Apple’s strict stance on privacy and unwillingness to gather/aggregate data from all its users. However, yesterday at WWDC, Apple revealed a few clues about what it was up to in the AI world.</p>
<p>First, Apple mentioned deep learning more than a few times, including specifically calling out its use of <a href="https://en.wikipedia.org/wiki/Long_short-term_memory">LSTM</a> in its Messages app and facial recognition in its Photos app. Previously, Apple had been rumored to be applying deep learning to its <a href="http://go.theinformation.com/4Z2WhEs9_Nc">Siri assistant and its fingerprint sensor</a>. At WWDC, Craig Federighi stressed Apple’s continued focus on privacy and how Apple does not need to develop “user profiles” server-side, but rather does most computation on-device (in this case, on the iPhone).</p>
<p>However, it can’t be that Apple does all its deep learning computation on the iPhone. These models tend to be enormous and take advantage of reams of data that can only be reasonablly processed server-side. Unfortunately, because most companies (Apple in particular) release few details about what they do, we may never how this works. But we can definitely speculate!</p>
<h2 id="apple-vs-google">Apple vs. Google</h2>
<p>I think the Apple/Google dichotomy provides an interesting opportunity to talk about how models can be learned using data in different ways. There are two approaches being represented here by Apple and Google:</p>
<ul>
<li><strong>Google way</strong> - Collect lots of data from users and store them on a server in the Googleplex somewhere. Then use that data to fit an enormous model that can predict when you’ve taken a picture of a cat. As users generate more data, bring that data back to the Googleplex and update/refine the model.</li>
<li><strong>Apple way</strong> - Build a “starter model” in the Apple <a href="http://9to5mac.com/2015/10/05/spaceship-campus-2-drone-video-october/">Mothership</a>. As users generate data on their phones, bring the model to the phone and update the model using just their data. Bring the updated model back to the Apple Mothership and leave the user’s data on the phone.</li>
</ul>
<p>Perhaps the easiest way to understand this difference is with the arithmetic mean, which is perhaps the simplest “model”. Suppose you have a bunch of users out there and you want to compute the average of some attribute that they have on their phones (or whatever device). The first approach would be to get all that data and compute the mean in the usual way.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/googleway.png" alt="Google way" /></p>
<p>Once all the data is in the Googleplex, we can just use the formula</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/Googlemean.png" alt="Google mean" /></p>
<p>I’ll call this the “Google mean” because it requires that you get all the data X<sub>1</sub> through X<sub>n</sub>, then sum them up and divide by n. Here, each of the X<sub>i</sub>’s represents the ith user’s data. The general principle here is to gather all the data and then estimate the model parameters server-side.</p>
<p>What if you didn’t want to gather everyone’s data centrally? Can you still compute the mean?</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/appleway.png" alt="Apple way" /></p>
<p>Yes, because there’s a nice recurrence formula for the mean:</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/Applemean.png" alt="Apple mean" /></p>
<p>We can call this the “Apple mean”. With this strategy, we can send our current estimate of the mean to each user, update our estimate by taking the weighted average of the old value and the new value, and then move on to the next user. Here, you send the model parameters out to the users, update those parameters and then bring the parameters back.</p>
<p>Which method is better? Well, in this case, both give you the same answer. In general, for linear models (like the mean), you can usually rework the formulas to build out either “whole data” (Google) approaches or “streaming” (Apple) approaches and get pretty much the same answer. But for non-linear models, it’s not so simple and you usually cannot achieve this kind of equivalence.</p>
<h2 id="clients-and-servers">Clients and Servers</h2>
<p>Balancing how much work is done on a server and how much is done on the client is an age-old computing problem and, over time, the balance of work between client and server seems to shift back and forth like a pendulum. When I was in grad school, we had so-called “dumb terminals” that were basically a screen that you used to login to the server. Today, I use my laptop for computing/work and that’s it. But I use the cloud for many other tasks.</p>
<p>The Apple approach definitely requires a “fatter” client because the work of integrating current model parameters with new user data has to happen on the phone. With the Google approach, all the phone has to do is be able to collect the data and send it over the network to Google.</p>
<p>The Apple approach is also closely related to what my colleagues <a href="http://www.biostat.jhsph.edu/~mlindqui/">Martin Lindquist</a> and <a href="http://www.bcaffo.com">Brian Caffo</a> refer to as “fusion science”, whereby Big Data and “Small Data” can be fused together via models to improve inference, but without ever having to actually combine the data. In a Bayesian context, you might think of the Big Data as making up the prior distribution and the Small Data as the likelihood. The Small Data can be used to update the model parameters and produce the posterior distribution, after which the Small Data can be thrown out.</p>
<h2 id="and-the-winner-is">And the Winner is…</h2>
<p>It’s not clear to me which approach is better in terms of building a better model for prediction or inference. Sadly, we may never have enough details to find out, and will only be ablle to evaluate which approach is better by the performance of the systems in the marketplace. But perhaps that’s the way things should be evaluated in this case?</p>
Good list of good books
2016-06-13T00:00:00+00:00
http://simplystats.github.io/2016/06/13/good-books
<p>The MultiThreaded blog over at Stitch Fix (hat tip to Hilary Parker)
has posted a <a href="http://multithreaded.stitchfix.com/blog/2016/06/09/ds-books/">really nice list of data science books</a> (disclosure: one
of <a href="https://leanpub.com/artofdatascience/">my books</a> is on the list).</p>
<blockquote>
<p>We’ve queried our data science team for some of their favorite data science books. This list is by no means exhaustive, but should keep any data scientist/engineer new or old learning and entertained for many an evening.</p>
</blockquote>
<p>Enjoy!</p>
Not So Standard Deviations Episode 17 - Diurnal High Variance
2016-06-09T00:00:00+00:00
http://simplystats.github.io/2016/06/09/nssd-episode-17
<p>Hilary and I talk about Amazon Echo and Alexa as AI as a service, the COMPAS algorithm, criminal justice forecasts, and whether algorithms can introduce or remove bias (or both).</p>
<p>If you have questions you’d like us to answer, you can send them to
nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p>
<p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p>
<p><a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Subscribe to the podcast on Google Play</a>.</p>
<p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p>
<p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p>
<p>Show notes:</p>
<ul>
<li>
<p><a href="http://www.wired.com/2016/03/two-moves-alphago-lee-sedol-redefined-future/">In Two Moves, AlphaGo and Lee Sedol Redefined the Future</a></p>
</li>
<li>
<p><a href="http://qz.com/639952/googles-ai-won-the-game-go-by-defying-millennia-of-basic-human-instinct/">Google’s AI won the game Go by defying millennia of basic human instinct</a></p>
</li>
<li>
<p><a href="https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing">Machine Bias: There’s Software Used Across the Country to Predict Future Criminals. And it’s Biased Against Blacks</a></p>
</li>
<li>
<p><a href="https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm">ProPublica analysis of COMPAS</a></p>
</li>
<li>
<p><a href="http://www.amazon.com/Criminal-Justice-Forecasts-Risk-SpringerBriefs/dp/1461430844?ie=UTF8&*Version*=1&*entries*=0">Richard Berk’s <em>Criminal Justice Forecasts of Risk</em></a></p>
</li>
<li>
<p><a href="http://www.amazon.com/Weapons-Math-Destruction-Increases-Inequality/dp/0553418815">Cathy O’Neill’s <em>Weapons of Math Destruction</em></a></p>
</li>
<li>
<p><a href="https://mathbabe.org/2016/04/07/ill-stop-calling-algorithms-racist-when-you-stop-anthropomorphizing-ai/">I’ll stop calling algorithms racist when you stop anthropomorphizing AI</a></p>
</li>
<li>
<p><a href="https://cran.r-project.org/web/packages/rmsfact/index.html">RMS Fact package</a></p>
</li>
<li>
<p><a href="http://user2016.org">Use R! 2016</a></p>
</li>
</ul>
<p><a href="https://soundcloud.com/nssd-podcast/episode-17-diurnal-high-variance">Download the audio for this episode.</a></p>
<iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/268232081&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false"></iframe>
Defining success - Four secrets of a successful data science experiment
2016-06-03T00:00:00+00:00
http://simplystats.github.io/2016/06/03/defining-success
<p><em>Editor’s note: This post is excerpted from the book <a href="https://leanpub.com/eds">Executive Data Science: A Guide to Training and Managing the Best Data Scientists</a>, written by myself, Brian Caffo, and Jeff Leek. This particular section was written by Brian Caffo.</em></p>
<p>Defining success is a crucial part of managing a data science experiment. Of course, success is often context specific. However, some aspects of success are general enough to merit discussion. A list of hallmarks of success includes:</p>
<ol>
<li>New knowledge is created.</li>
<li>Decisions or policies are made based on the outcome of the experiment.</li>
<li>A report, presentation, or app with impact is created.</li>
<li>It is learned that the data can’t answer the question being asked of it.</li>
</ol>
<p>Some more negative outcomes include: Decisions being made that disregard clear evidence from the data, equivocal results that do not shed light in one direction or another, uncertainty which prevents new knowledge from being created.</p>
<p>Let’s discuss some of the successful outcomes first.</p>
<p>New knowledge seems ideal in many cases (especially since we are academics), but new knowledge doesn’t necessarily mean that it’s important. If this new knowledge produces actionable decisions or policies, that’s even better. The idea of having evidence-based policy, while perhaps newer than the analogous evidence-based medicine movement that has transformed medical practice, has the potential to similarly transform public policy. Finaly, that our data science products have great (positive) impact on an audience that is much wider than a group of data scientists, is of course ideal. Creating reusable code or apps is great way to increase the impact of a project and to disseminate its findings.</p>
<p>The fourth point is perhaps the most controversial. I view it as a success if we can show that the data can’t answer the questions being asked. I am reminded of a friend who told a story of the company he worked at. They hired many expensive prediction consultants to help use their data to inform pricing. However, the prediction results weren’t helping. They were able to prove that the data couldn’t answer the hypothesis under study. There was too much noise and the measurements just weren’t accurately measuring what was needed. Sure, the result wasn’t optimal, as they still needed to know how to price things, but it did save money on consultants. I have since heard this story repeated nearly identically by friends in different industries.</p>
Sometimes the biggest challenge is applying what we already know
2016-05-31T00:00:00+00:00
http://simplystats.github.io/2016/05/31/barrier-to-medication
<p>There’s definitely a need to innovate and develop new treatments in
the area of asthma, but it’s easy to underestimate the barriers to
just doing what we already know, such as making sure that people are
following existing, well-established guidelines on how to treat
asthma. My colleague, Elizabeth Matsui, has <a href="http://skybrudeconsulting.com/blog/2016/05/31/barriers-medication.html">written about the
challenges</a> in a <a href="https://clinicaltrials.gov/ct2/show/NCT02251379?term=ecatch&rank=1">study</a> that we are collaborating on:</p>
<blockquote>
<p>Our group is currently conducting a study that includes implementation of national guidelines-based medical care for asthma, so that one process that we’ve had to get right is to <strong>prescribe an appropriate dose of medication and get it into the family’s hands</strong>. [emphasis added]</p>
</blockquote>
<p>Seems simple, right?</p>
Sometimes there's friction for a reason
2016-05-24T00:00:00+00:00
http://simplystats.github.io/2016/05/24/somtimes-theres-friction-for-a-reason
<p>Thinking about <a href="http://simplystatistics.org/2016/05/23/update-on-theranos/">my post on Theranos</a> yesterday it occurred to me that one thing that’s great about all of the innovation and technology coming out of places like Silicon Valley is the tremendous reduction of friction in our lives. With Uber it’s much easier to get a ride because of improvement in communication and an increase in the supply of cars. With Amazon, I can get that jug of <a href="http://www.amazon.com/Wesson-Pure-100%25-Natural-Vegetable/dp/B007F1KVX8/ref=sr_1_2_a_it?ie=UTF8&qid=1464092378&sr=8-2&keywords=vegetable+oil">vegetable oil</a> that I always wanted without having to leave the house, because Amazon.</p>
<p>So why is there all this friction? Sometimes it’s because of regulation, which may have made sense at an earlier time, but perhaps doesn’t make as much sense now. Sometimes, you need a company like Amazon to really master the logistics operation to be able to deliver anything anywhere. Otherwise, you’re just stuck driving to the grocery store to get that vegetable oil.</p>
<p>But sometimes there’s friction for a reason. For example, <a href="https://stratechery.com/2013/friction/">Ben Thompson talks about</a> how previously there was quite a bit more friction involved before law enforcement could listen in on our communications. Although wiretapping had long been around (as <a href="http://davidsimon.com/we-are-shocked-shocked/">noted</a> by David Simon of…<a href="http://www.hbo.com/the-wire">The Wire</a>) the removal of all friction by the NSA made the situation quite different. Related to this idea is the massive <a href="http://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release">data release from OkCupid</a> a few weeks ago, as I discussed on the latest <a href="https://soundcloud.com/nssd-podcast/episode-16-the-silicon-valley-episode">Not So Standard Deviations</a> podcast episode. Sure, your OkCupid profile is visible to everyone with an account, but having someone vacuum up 70,000 profiles and dumping them on the web for anyone to view is not what people signed up for—there is a qualitative difference there.</p>
<p>When it comes to Theranos and diagnostic testing in general, there is similarly a need for some friction in order to protect public health. John Ioannides notes in his <a href="http://jama.jamanetwork.com/article.aspx?articleid=2524161#.Vz-lkeuAj9p.twitter">commentary for JAMA</a>:</p>
<blockquote>
<p>Even if the tests were accurate, when they are performed in massive scale and multiple times, the possibility of causing substantial harm from widespread testing is very real, as errors accumulate with multiple testing. Repeated testing of an individual is potentially a dangerous self-harm practice, and these individuals are destined to have some incorrect laboratory results and eventually experience harm, such as, for example, the anxiety of being labeled with a serious condition or adverse effects from increased testing and procedures to evaluate false-positive test results. Moreover, if the diagnostic testing process becomes dissociated from physicians, self-testing and self-interpretation could cause even more problems than they aim to solve.</p>
</blockquote>
<p>Unlike with the NSA, where the differences in scale may be difficult to quantify because the exact extent of the program is unknown to most people, with diagnostic testing, we can <a href="https://en.wikipedia.org/wiki/Bayes%27_theorem">precisely quantify</a> how a diagnostic test’s characteristics will change if we apply it to 1,000 people vs. 1,000,000 people. This is why organizations like the US Preventative Services Task Force so carefully considers recommendations for testing or screening (and why they have a really tough job).</p>
<p>I’ll admit that a lot of the friction in our daily lives is pointless and it would be great to reduce it if possible. But in many cases, it was us that put the friction there for a reason, and it’s sometimes good to think about why before we move to eliminate it.</p>
Update On Theranos
2016-05-23T00:00:00+00:00
http://simplystats.github.io/2016/05/23/update-on-theranos
<p>I think it’s fair to say that things for Theranos, the Silicon Valley blood testing company, are not looking up. From the Wall Street Journal (via <a href="http://www.theverge.com/2016/5/19/11711004/theranos-voids-edison-blood-test-results">The Verge</a>):</p>
<blockquote>
<p>Theranos has voided two years of results from its Edison blood-testing machines, issuing tens of thousands of corrected reports to patients and doctors and raising the possibility that many health care decisions may have been made based on inaccurate data. The Wall Street Journal first reported the news, saying that many of the corrected tests have been run using traditional machinery. One doctor told the Journal that she sent a patient to the emergency room after seeing abnormal results from a Theranos test; the corrected report returned normal readings.</p>
</blockquote>
<p>Furthermore, <a href="http://jama.jamanetwork.com/article.aspx?articleid=2524161#.Vz-lkeuAj9p.twitter">this commentary in JAMA</a> from John Ioannides emphasizes the need for caution when implementing testing on a massive scale. In particular, “The notion of patients and healthy people being repeatedly tested in supermarkets and pharmacies, or eventually in cafeterias or at home, sounds revolutionary, but little is known about the consequences” and the consequences really matter here. In addition, as the title of the commentary would indicate, research done in secret is not research at all. For there the be credibility for a company like this, data needs to be made public.</p>
<p>I <a href="http://simplystatistics.org/2015/10/28/discussion-of-the-theranos-controversy-with-elizabeth-matsui/">continue to maintain</a> that the fundamental premise on which the company is built, as stated by its founder Elizabeth Holmes, is flawed. Two concepts are repeatedly made in the context of Theranos:</p>
<ul>
<li><strong>More testing is better</strong>. Anyone who stayed awake in their introduction to Bayesian statistics lecture knows this is very difficult to make true in all circumstances, no matter how accurate a test is. With rare diseases, the number of false positives is overwhelming and can have very real harmful effects on people. Combine testing on a massive scale, with repeated application over time, and you get a recipe for confusion.</li>
<li><strong>People do not get tested because they are afraid of needles</strong>. Elizabeth Holmes makes a big deal about her personal fear of needles and it’s impact on her (not) getting blood tests done. I have no doubt that many people share this fear, but I have serious doubt that this is the reason people don’t get the medical testing done. There are <a href="http://www.rwjf.org/en/library/research/2012/02/special-issue-of-health-services-research-links-health-care-rese/nonfinancial-barriers-and-access-to-care-for-us-adults.html">many barriers</a> to people getting the medical care that they need, many that are non-financial in nature and do not include fear of needles. The problem of getting people the medical care that they need is one deserving of serious attention, but changing the manner in which blood is collected is not going to do it.</li>
</ul>
Not So Standard Deviations Episode 16 - The Silicon Valley Episode
2016-05-23T00:00:00+00:00
http://simplystats.github.io/2016/05/23/nssd-episode-16
<p>Roger and Hilary are back, with Hilary broadcasting from the west coast. Hilary and Roger discuss the possibility of scaling data analysis and how that may or may not work for companies like Palantir. Also, the latest on Theranos and the release of data from OkCupid.</p>
<p>If you have questions you’d like us to answer, you can send them to
nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p>
<p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p>
<p><a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Subscribe to the podcast on Google Play</a>.</p>
<p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p>
<p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p>
<p>Show notes:</p>
<ul>
<li>
<p><a href="https://www.buzzfeed.com/williamalden/inside-palantir-silicon-valleys-most-secretive-company">BuzzFeed Article on Palantir</a></p>
</li>
<li>
<p><a href="http://simplystatistics.org/2016/05/11/palantir-struggles/">Roger’s Simply Statistics post on Palantir</a></p>
</li>
<li>
<p><a href="https://looker.com">Looker</a></p>
</li>
<li>
<p><a href="http://simplystatistics.org/2015/03/17/data-science-done-well-looks-easy-and-that-is-a-big-problem-for-data-scientists/">Data science done well looks easy</a></p>
</li>
<li>
<p><a href="http://www.wsj.com/articles/theranos-voids-two-years-of-edison-blood-test-results-1463616976">Latest on Theranos</a></p>
</li>
<li>
<p><a href="http://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release">OkCupid Data Release</a></p>
</li>
<li>
<p><a href="http://fr.slideshare.net/sblank/secret-history-why-stanford-and-not-berkeley">Secret history of Silicon Valley</a></p>
</li>
<li>
<p><a href="https://blog.wealthfront.com">Wealthfront blog</a></p>
</li>
</ul>
<p><a href="https://soundcloud.com/nssd-podcast/episode-16-the-silicon-valley-episode">Download the audio for this episode.</a></p>
<iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/265158223&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false"></iframe>
What is software engineering for data science?
2016-05-18T00:00:00+00:00
http://simplystats.github.io/2016/05/18/software-engineering-data-science
<p><em>Editor’s note: This post is a chapter from the book <a href="https://leanpub.com/eds">Executive Data Science: A Guide to Training and Managing the Best Data Scientists</a>, written by myself, Brian Caffo, and Jeff Leek.</em></p>
<p>Software is the generalization of a specific aspect of a data analysis.
If specific parts of a data analysis require implementing or applying a number of procedures or tools together, software is the encompassing of all these tools into a specific module or procedure that can be repeatedly applied in a variety of settings. Software allows for the systematizing and the standardizing of a procedure, so that different people can use it and understand what it’s going to do at any given time.</p>
<p>Software is useful because it formalizes and abstracts the functionality of a set of procedures or tools, by developing a well
defined interface to the analysis. Software will have an interface,
or a set of inputs and a set of outputs that are well understood. People can think about the inputs and the outputs without having to worry about the gory details of what’s going on underneath. Now, they may be interested in those details, but the application of the software at any given setting will not necessarily depend on the knowledge of those details. Rather, the knowledge of the <em>interface</em> to that software is important to using it in any given situation.</p>
<p>For example, most statistical packages will have a linear regression function which has a very well defined interface. Typically, you’ll have to input things like the outcome and the set of predictors, and maybe there will be some other inputs like the data set or weights. Most linear regression functions kind of work in that way. And importantly, the user does not have to know exactly how the linear regression calculation is done underneath the hood. Rather, they only need to know that they need to specify the outcome, the predictors, and a couple of other things. The linear regression function abstracts all the details that are required to implement linear regression, so that the user can apply the tool in a variety of settings.</p>
<p>There are three levels of software that are important to consider, going from kind of the simplest to the most abstract.</p>
<ol>
<li>At the first level you might just have some code that you wrote, and you might want to encapsulate the automation of a set of procedures using a loop (or something similar) that repeats an operation multiple times.</li>
<li>The next step might be some sort of function. Regardless of what language you may be using, often there will be some notion of a function, which is used to encapsulate a set of instructions. And the key thing about a function is that you’ll have to define some sort of interface, which will be the inputs to the function. The function may also have a set of outputs or it may have some side effect for example, if it’s a plotting function. Now the user only needs to know those inputs and what the outputs will be. This is the first level of abstraction that you might encounter, where you have to actually define and interface to that function.</li>
<li>The highest level is an actual software package, which will often be a collection of functions and other things. That will be a little bit more formal because there’ll be a very specific interface or API that a user has to understand. Often for a software package there’ll be a number of convenience features for users, like documentation, examples, or tutorials that may come with it, to help the user apply the software to many different settings. A full on software package will be most general in the sense that it should be applicable to more than one setting.</li>
</ol>
<p>One question that you’ll find yourself asking, is at what point do you need to systematize common tasks and procedures across projects versus recreating code or writing new code from scratch on every new project? It depends on a variety of factors and answering this question may require communication within your team, and with
people outside of your team. You may need to develop an understanding of how often a given process is repeated, or how often a given type of data analysis is done, in order to weigh the costs and benefits of investing in developing a software package or something similar.</p>
<p>Within your team, you may want to ask yourself, “Is the data analysis you’re going to do something that you are going to build upon for future work, or is it just going to be a one shot deal?” In our experience, there are relatively few one shot deals out there. Often you will have to do a certain analysis more than once, twice, or even three times, at which point you’ve reached the threshold where you want to write some code, write some software, or at least a function. The point at which you need to systematize a given set of procedures is going to be sooner than you think it is. The initial investment for developing more formal software will be higher, of course, but that will likely pay off in time savings down the road.</p>
<p>A basic rule of thumb is</p>
<ul>
<li>If you’re going to do something <strong>once</strong> (that does happen on occasion), just write some code and document it very well. The important thing is that you want to make sure that you understand what the code does, and so that requires both writing the code well and writing documentation. You want to be able to reproduce it down later on if you ever come back to it, or if someone else comes back to it.</li>
<li>If you’re going to do something <strong>twice</strong>, write a function. This allows you to abstract a small piece of code, and it forces you to define an interface, so you have well defined inputs and outputs.</li>
<li>If you’re going to do something <strong>three</strong> times or more, you should think about writing a small package. It doesn’t have to be commercial level software, but a small package which encapsulates the set of operations that you’re going to be doing in a given analysis. It’s also important to write some real documentation so that people can understand what’s supposed to be going on, and can apply the software to a different situation if they have to.</li>
</ul>
Disseminating reproducible research is fundamentally a language and communication problem
2016-05-13T00:00:00+00:00
http://simplystats.github.io/2016/05/13/reproducible-research-language
<p>Just about 10 years ago, I wrote my <a href="http://www.ncbi.nlm.nih.gov/pubmed/16510544">first</a> of many articles about the importance of reproducible research. Since that article, one of the points I’ve made is that the key issue to resolve was one of tools and infrastructure. At the time, many people were concerned that people would not want to share data and that we had to spend a lot of energy finding ways to either compel or incentivize them to do so. But the reality was that it was difficult to answer the question “What should I do if I desperately want to make my work reproducible?” Back then, even if you could convince a clinical researcher to use R and LaTeX to create a <a href="https://en.wikipedia.org/wiki/Sweave">Sweave</a> document (!), it was not immediately obvious where they should host the document, code, and data files.</p>
<p>Much has happened since then. We now have knitr and Markdown for live documents (as well as iPython notebooks and the like). We also have git, GitHub, and friends, which provide free code sharing repositories in a distributed manner (unlike older systems like CVS and Subversion). With the recent announcement of the <a href="http://www.arfon.org/announcing-the-journal-of-open-source-software">Journal of Open Source Software</a>, posting code on GitHub can now be recognized within the current system of credits and incentives. Finally, the number of <a href="http://dataverse.org">data</a> <a href="https://osf.io">repositories</a> has grown, providing more places for researchers to deposit their data files.</p>
<p>Is the tools and infrastructure problem solved? I’d say yes. One thing that has become clear is that disseminating reproducible research is <strong>no longer a software problem</strong>. At least in R land, building live documents that can be executed by others is straightforward and not too difficult to pick up (thank you <a href="https://daringfireball.net/projects/markdown/">John Gruber</a>!). For other languages there many equivalent (if not better) tools for writing documents that mix code and text. For this kind of thing, there’s just no excuse anymore. Could things be optimized a bit for some edge cases? Sure, but the tools are prefectly fine for the vast majority of use cases.</p>
<p>But now there is a bigger problem that needs to be solved, which is that <strong>we do not have an effective way to communicate data analyses</strong>. One might think that publishing the full code and datasets is the perfect way to communicate a data analysis, but in a way, it is too perfect. That approach can provide too much information.</p>
<p>I find the following analogy useful for discussing this problem. If you look at music, one way to communicate music is to provide an audio file, a standard WAV file or something similar. In a way, that is a near-perfect representation of the music—bit-for-bit—that was produced by the performer. If I want to experience a Beethoven symphony the way that it was meant to be experienced, I’ll listen to a <a href="https://itun.es/us/TudVe?i=79443286">recording of it</a>.</p>
<p>But if I want to understand how Beethoven wrote the piece—the process and the details—the recording is not a useful tool. What I look at instead is <a href="http://www.amazon.com/dp/0486260348">the score</a>. The recording is a serialization of the music. The score provides an expanded representation of the music that shows all of the different pieces and how they fit together. A person with a good ear can often reconstruct the score, but this is a difficult and time-consuming task. Better to start with what the composer wrote originally.</p>
<p>Over centuries, classical music composers developed a language and system for communicating their musical ideas so that</p>
<ol>
<li>there was enough detail that a 3rd party could interpret the music and perform it to a level of accuracy that satisfied the composer; but</li>
<li>it was not so prescriptive or constraining so that different performers could not build on the work and incorporate their own ideas</li>
</ol>
<p>It would seem that traditional computer code satisfies those criteria, but I don’t think so. Traditional computer code (even R code) is designed to communicate programming concepts and constructs, not to communicate data analysis constructs. For example, a <code class="language-plaintext highlighter-rouge">for</code> loop is not a data analysis concept, even though we may use <code class="language-plaintext highlighter-rouge">for</code> loops all the time in data analysis.</p>
<p>Because of this disconnect between computer code and data analysis, I often find it difficult to understand what a data analysis is doing, even if it is coded very well. I imagine this is what programmers felt like when programming in processor-specific <a href="https://en.wikipedia.org/wiki/Assembly_language">assembly language</a>. Before languages like C were developed, where high-level concepts could be expressed, you had to know the gory details of how each CPU operated.</p>
<p>The closest thing that I can see to a solution emerging is the work that Hadley Wickham is doing with packages like <a href="https://github.com/hadley/dplyr">dplyr</a> and <a href="https://github.com/hadley/ggplot2">ggplot2</a>. The <code class="language-plaintext highlighter-rouge">dplyr</code> package’s verbs (<code class="language-plaintext highlighter-rouge">filter</code>, <code class="language-plaintext highlighter-rouge">arrange</code>, etc.) represent data manipulation concepts that are meaningful to analysts. But we still have a long way to go to cover all of data analysis in this way.</p>
<p>Reproducible research is important because it is fundamentally about communicating what you have done in your work. Right now we have a sub-optimal way to communicate what was done in a data analysis, via traditional computer code. I think developing a new approach to communicating data analysis could have a few benefits:</p>
<ol>
<li>It would provide greater transparency</li>
<li>It would allow others to more easily build on what was done in an analysis by extending or modifying specific elements</li>
<li>It would make it easier to understand what common elements there were across many different data analyses</li>
<li>It would make it easier to teach data analysis in a systematic and scalable way</li>
</ol>
<p>So, any takers?</p>
The Real Lesson for Data Science That is Demonstrated by Palantir's Struggles
2016-05-11T00:00:00+00:00
http://simplystats.github.io/2016/05/11/palantir-struggles
<p>Buzzfeed recently published a <a href="https://www.buzzfeed.com/williamalden/inside-palantir-silicon-valleys-most-secretive-company?utm_term=.ko2PLKaMJ#.wiPxJERyA">long article</a> on the struggles of the secretive data science company, Palantir.</p>
<blockquote>
<p>Over the last 13 months, at least three top-tier corporate clients have walked away, including Coca-Cola, American Express, and Nasdaq, according to internal documents. Palantir mines data to help companies make more money, but clients have balked at its high prices that can exceed $1 million per month, expressed doubts that its software can produce valuable insights over time, and even experienced difficult working relationships with Palantir’s young engineers. Palantir insiders have bemoaned the “low-vision” clients who decide to take their business elsewhere.</p>
</blockquote>
<p>Palantir’s origins are with PayPal, and its founders are part of the <a href="https://en.wikipedia.org/wiki/PayPal_Mafia">PayPal Mafia</a>. As Peter Thiel describes it in his book <a href="https://en.wikipedia.org/wiki/Zero_to_One">Zero to One</a>, PayPal was having a lot of trouble with fraud and the FBI was getting on its case. So PayPal developed some software to monitor the millions of transacations going through its systems and to flag transactions that were suspicious. Eventually, they realized that this kind of software might be useful to government agencies in a variety of contexts and the idea for Palantir was born.</p>
<p>Much of the press reaction to Buzzfeed’s article amounts to schadenfreude over the potential fall of <a href="http://simplystatistics.org/2015/10/16/thorns-runs-head-first-into-the-realities-of-diagnostic-testing/">another</a> so-called Silicon Valley unicorn. Indeed, Palentir is valued at $20 billion, a valuation only exceeded in the private markets by Airbnb and Uber. But to me, nothing in the article indicates that Palantir is necessarily more poorly run than your average startup. It looks like they are going through pretty standard growing pains, trying to scale the business and diversify the customer base. It’s not surprising to me that employees would leave at this point—going from startup to “real company” is often not that fun and just a lot of work.</p>
<p>However, a key question that arises is that if Palantir is having trouble trying to scale the business, why might that be? The Buzzfeed article doesn’t contain any answers but in this post I will attempt to speculate.</p>
<p>The real message from the Buzzfeed article goes beyond just Palantir and is highly relevant to the data science world. It ultimately comes down to the question of <strong>what is the value of data analysis?</strong>, and secondarily, <strong>how do you communicate that value?</strong></p>
<h2 id="the-data-science-spectrum">The Data Science Spectrum</h2>
<p>Data science activities live on a spectrum with <strong>software</strong> on one end and <strong>highly customized consulting</strong> on another end (I lump a lot of things into consulting, including methods development, modeling, etc.).</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/DS_Spectrum2.png" alt="Data Science Spectrum" /></p>
<p>The idea being that if someone comes to you with a data problem, there are two extremes that you might offer to them:</p>
<ol>
<li>Give them some software, some documentation, and maybe a brief tutorial on how to use the software, and then send them on their way. For example, if someone wants to see if two groups are different from each other, you could send them the <code class="language-plaintext highlighter-rouge">t.test()</code> function in R and explain how to use it. This could be done over email; you wouldn’t even have to talk to the person.</li>
<li>Meet with the person, talk about their problem and the question they’re trying to answer, develop an analysis plan, and build a custom software solution that produces the exact output that they’re looking for.</li>
</ol>
<p>The first option is cheap, simple, and if you had a good enough web site, the person probably wouldn’t even have to talk with you at all! For example, I use <a href="http://hedwig.mgh.harvard.edu/sample_size/size.html">this web site</a> for sample size calculations and I’ve never spoken with the author of the web site. Much of the labor is up front, for the development of the software, and then is amortized over the life of the product. Ultimately, a software product has zero marginal cost and so it can be easily replicated and is <em>infinitely scalable</em>.</p>
<p>The second option is labor intensive, time-consuming, ongoing in nature, and is only scalable to the extent that you are willing to forgo sleep and maybe bend the space-time continuum. By definition, a custom solution is unique and is difficult to replicate.</p>
<h2 id="selling-data-science">Selling Data Science</h2>
<p>An important question for Palantir and data scientists in general is “How do you communicate the value of data analysis?” Many people expect the result of a good data analysis to be something “surprising”, i.e. something that they didn’t already know. Because if they knew it already why bother looking at the data? Think Moneyball—if you can find that “diamond in the rough” it make spending the time to analyze the data worthwhile. But <strong>the success of a data analysis can’t depend on the results</strong>. What if you go through the data and find nothing? Is the data analysis a failure? We as data scientists can only show what the data show. Otherwise, it just becomes a recipe for telling people what they want to hear.</p>
<p>It’s tempting for a client to say “well, the data didn’t show anything surprising so there’s no value there.” Also, a data analysis may reveal something that is perhaps interesting but doesn’t necessarily lead to any sort of decision. For example, there may be an aspect of a business process that is inefficient but is nevertheless unmodifiable. There may be little perceived value in discovering this with data.</p>
<h3 id="what-is-useful">What is Useful?</h3>
<p>Palantir apparently tried to develop a relationship with American Express, but ultimately failed.</p>
<blockquote>
<p>But some major firms have not found Palantir’s products and services that useful. In April 2015, employees were informed that American Express (codename: Charlie’s Angels) had dumped Palantir after 18 months of cybersecurity work, including a six-month pilot, an email shows. “We struggled from day 1 to make Palantir a sticky product for users and generate wins,” Sid Rajgarhia, a Palantir business development employee, said in the email.</p>
</blockquote>
<p>What does it mean for a data analysis product to be useful? It’s not necessarily clear to me in this case. Did Palantir not reveal new information? Did they not highlight something that could be modified?</p>
<h3 id="lack-of-deep-expertise">Lack of Deep Expertise</h3>
<p>A failed attempt attempt at working with Coke reveals some other challenges of the data science business model.</p>
<blockquote>
<p>The beverage giant also had other concerns [in addition to the price]. Coke “wanted deeper industry expertise in a partner,” Jonty Kelt, a Palantir executive, told colleagues in the email. He added that Coca-Cola’s “working relationship” with the youthful Palantir employees was “difficult.” The Coke executive acknowledged that the beverage giant “needs to get better at working with millennials,” according to Kelt. Coke spokesperson Scott Williamson declined to comment.</p>
</blockquote>
<p>Annoying millenials notwithstanding, it’s clear that Coke didn’t feel comfortable collaborating with Palantir’s personnel. Like any data science collaboration, it’s key that the data scientist have some familiarity with the domain. In many cases, having “deep expertise” in an area can give a collaborator confidence that you will focus on the things that matter to them. But developing that expertise costs money and time and it may prevent you from working with other types of clients where you will necessarily have less expertise. For example, Palantir’s long experience working with the US military and intelligence agencies gave them deep expertise in those areas, but how does that help them with a consumer products company?</p>
<h3 id="harder-than-it-looks">Harder Than It Looks</h3>
<p>The final example of a client that backed out is Kimberly-Clark:</p>
<blockquote>
<p>But Kimberly-Clark was getting cold feet by early 2016. In January, a year after the initial pilot, Kimberly-Clark executive Anthony J. Palmer said he still wasn’t ready to sign a binding contract, meeting notes show. Palmer also “confirmed our suspicion” that a primary reason Kimberly-Clark had not moved forward was that “<em>they wanted to see if they could do it cheaper themselves</em>,” Kelt told colleagues in January. [emphasis added]</p>
</blockquote>
<p>This is a common problem confronted by anyone in the data science business. A good analysis often looks easy in retrospect—all you did was run some functions and put the data through some models! In fact, running the models probably <em>is</em> the easy part, but getting to the point where you can actually fit models can be incredibly hard. Many clients, not seeing the long and winding process leading to a model, will be tempted think they can do it themselves.</p>
<h2 id="palantirs-valuation">Palantir’s Valuation</h2>
<p>Ultimately, what makes Palantir interesting is its astounding valuation. But what is the driver of this valuation? I think the key to answering this question lies in the description of the company itself:</p>
<blockquote>
<p>The company, based in Palo Alto, California, is essentially a hybrid software and consulting firm, placing what it calls “forward deployed engineers” on-site at client offices.</p>
</blockquote>
<p>What does it mean to be a “hybrid software and consulting firm”? And which one is the company more like? Consulting or software? Because ultimately, revealing which side of the spectrum Palantir is <em>really</em> on could have huge implications for its valuation and future prospects.</p>
<p>Consulting companies can surely make a lot of money, but none to my knowledge have the kind of valuation that Palantir currently commands. If it turns out that every customer of Palantir’s requires a custom solution, then I think they’re likely overvalued, because that model scales poorly. On the other hand, if Palantir has genuinely figured out a way to “software-ize” data analysis and to turn it into a commodity, then they are very likely undervalued.</p>
<p>Given the tremendous difficulty of turning data analysis into a software problem, my guess is that they are more akin to a consulting company and are overvalued. This is not to say that they won’t make money—they will likely make plenty—but that they won’t be the Silicon Valley darling that everyone wants them to be.</p>
A means not an end - building a social media presence as a junior scientist
2016-05-10T00:00:00+00:00
http://simplystats.github.io/2016/05/10/social-media
<p><em>Editor’s note - This is a chapter from my book <a href="https://leanpub.com/modernscientist">How to be a modern scientist</a> where I talk about some of the tools and techniques that scientists have available to them now that they didn’t before. 50% of all royalties from the book go to support <a href="http://www.datacarpentry.org/">Data Carpentry</a> to promote data science education.</em></p>
<h2 id="social-media---what-should-i-do-and-why">Social media - what should I do and why?</h2>
<p>Social media can serve a variety of roles for modern scientists. Here I am going to focus on the role of social
media for working scientists whose primary focus is not on scientific communication. Something that is often missed by people who are just getting started with social media is that there are two separate components to developing a successful social media presence.</p>
<p>The first is to develop a following and connections
to people in your community. This is achieved through being either a content curator, a content generator, or
being funny/interesting in some other way. This often has nothing to do with your scientific output.</p>
<p>The second component is using your social media presence to magnify the audience for your scientific work. You can
only do this if you have successfully developed a network and community in the first step. Then, when you post about
your own scientific papers they will be shared.</p>
<p>To most effectively achieve all of these goals you need to identify relevant communities and develop a network
of individuals who follow you and will help to share your ideas and work.</p>
<p><strong>Set up social media accounts and follow relevant people/journals</strong></p>
<p>One of the largest academic communities has developed around Twitter, but some scientists are also using Facebook for professional purposes. If you set up a Twitter account, you should then find as many colleagues in your area of expertise that you can find and also any journals that are in your area.</p>
<p><strong>Use your social media account to promote the work of other people</strong></p>
<p>If you just use your social media account to post links to any papers that you publish, it will be hard to develop much of a following. It is also hard to develop a following by constantly posting long form original content such as blog posts. Alternatively you can gain a large number of followers by being (a) funny, (b) interesting, or (c) being a content curator. This latter approach can be particularly useful for people new to social media. Just follow people and journals you find interesting and share anything that you think is important/creative/exciting.</p>
<p><strong>Share any work that you develop</strong></p>
<p>Any code, publications, data, or blog posts you create you can share from your social media account. This can help raise your profile as people notice your good work. But if you only post your own work it is rarely possible to develop a large following unless you are already famous for another reason.</p>
<h2 id="social-media---what-tools-should-i-use">Social media - what tools should I use?</h2>
<p>There are a large number of social media platforms that are available to scientists. Creatively using any new social media platform if it has a large number of users can be a way to quickly jump into the consciousness of more people. That being said the two largest communities of scientists have organized around two of the largest social media platforms.</p>
<ul>
<li><a href="https://twitter.com/">Twitter</a> - is a platform where you can post short (less than 140 character) messages. This is a great platform for both discovering science and engaging in conversations about topics at a superficial level. It is not particularly useful for in depth scientific discussions.</li>
<li><a href="https://www.facebook.com/">Facebook</a> - some scientists post longer form scientific discussions on Facebook, but the community there is somewhat less organized and people tend to use it less for professional reasons. However, sharing content on Facebook, particularly when it is of interest to a general audience, can lead to a broader engagement in your work.</li>
</ul>
<p>There are also a large and growing number of academic-specific social networks. For the most part these social networks are not widely used by practicing scientists and so don’t represent the best use of your time.</p>
<p>You might also consider short videos on <a href="https://vine.co/">Vine</a>, longer videos on <a href="https://www.youtube.com/">Youtube</a>, more image intensive social media on <a href="https://www.tumblr.com/">Tumblr</a> or <a href="https://www.instagram.com">Instagram</a> if you have content that regularly fits those outlets. But they tend to have smaller communities of scientists with less opportunity for back and forth.</p>
<h2 id="social-media---further-tips-and-issues">Social media - further tips and issues</h2>
<h3 id="you-do-not-need-to-develop-original-content">You do not need to develop original content</h3>
<p>Social media can be a time suck, particularly if you are spending a lot of time engaging in conversations on your platform of choice. Generating long form content in particular can take up a lot of time. But you don’t need to do that to generate a decent following. Finding the right community and then sharing work within that community and adding brief commentary and ideas can often help you develop a large following which can then be useful for other reasons.</p>
<h3 id="add-your-own-commentary">Add your own commentary</h3>
<p>Once you are comfortable using the social media platform of your choice you can start to engage with other people in conversation or add comments when you share other people’s work. This will increase the interest in your social media account and help you develop followers. This can be as simple as one-liners copied straight from the text of papers or posts that you think are most important.</p>
<h3 id="make-online-friends---then-meet-them-offline">Make online friends - then meet them offline</h3>
<p>One of the biggest advantages of scientific social media is that it levels the playing ground. Don’t be afraid to engage with members of your scientific community at all levels, from members of the National Academy (if they are online!) all the way down to junior graduate students. Getting to know a diversity of people can really help you during scientific meetings and visits. If you spend time cultivating online friendships, you’ll often meet a “familiar handle” at any conference or meeting you go to.</p>
<h3 id="include-images-when-you-can">Include images when you can</h3>
<p>If you see a plot from a paper you think is particularly compelling, copy it and attach it when you post/tweet when you link to the paper. On social media, images are often better received than plain text.</p>
<h3 id="be-careful-of-hot-button-issues-unless-you-really-care">Be careful of hot button issues (unless you really care)</h3>
<p>One thing to keep in mind on social media is the amplification of opinions. There are a large number of issues that are of extreme interest and generate really strong opinions on multiple sides. Some of these issues are common societal issues (e.g., racism, feminism, economic inequality) and some are specific to science (e.g., open access publishing, open source development). If you are starting a social media account to engage in these topics then you should definitely do that. If you are using your account primarily for scientific purposes you should consider carefully the consequences of wading into these discussions. The debates run very hot on social media and you may post what you consider to be a relatively tangential or light message on one of these topics and find yourself the center of a lot of attention (positive and negative).</p>
Time Series Analysis in Biomedical Science - What You Really Need to Know
2016-05-05T00:00:00+00:00
http://simplystats.github.io/2016/05/05/timeseries-biomedical
<p>For a few years now I have given a guest lecture on time series analysis in our School’s <em>Environmental Epidemiology</em> course. The basic thrust of this lecture is that you should generally ignore what you read about time series modeling, either in papers or in books. The reason is because I find much of the time series literature is not particularly helpful when doing analyses in a biomedical or population health context, which is what I do almost all the time.</p>
<h2 id="prediction-vs-inference">Prediction vs. Inference</h2>
<p>First, most of the literature on time series models tends to assume that you are interested in doing prediction—forecasting future values in a time series. I almost am never doing this. In my work looking at air pollution and mortality, the goal is never to find the best model that predicts mortality. In particular, if our goal were to predict mortality, we would probably <em>never include air pollution as a predictor</em>. This is because air pollution has an inherently weak association with mortality at the population, whereas things like temperature and other seasonal factors tend to have a much stronger association.</p>
<p>What I <em>am</em> interested in doing is estimating an association between changes in air pollution levels and mortality and making some sort of inference about that association, either to a broader population or to other time periods. The challenges in these types of analyses include estimating weak associations in the presence of many stronger signals and appropriately adjusting for any potential confounding variables that similarly vary over time.</p>
<p>The reason the distinction between prediction and inference is important is that focusing on one vs. the other can lead you to very different model building strategies. Prediction modeling strategies will always want you to include into the model factors that are strongly correlated with the outcome and explain a lot of the outcome’s variation. If you’re trying to do inference and use a prediction modeling strategy, you may make at least two errors:</p>
<ol>
<li>You may conclude that your key predictor of interest (e.g. air pollution) is not important because the modeling strategy didn’t deem to include it</li>
<li>You may omit important potential confounders because they have a weak releationship with the outcome (but maybe have a strong relationship with your key predictor). For example, one class of potential confounders in air pollution studies is other pollutants, which tend to be weakly associated with mortality but may be strongly associated with your pollutant of interest.</li>
</ol>
<h2 id="random-vs-fixed">Random vs. Fixed</h2>
<p>Another area where I feel much time series literature differs from my practice is on the whether to focus on fixed effects or random effects. Most of what you might think of when you think of time series models (i.e. AR models, MA models, GARCH, etc.) focuses on modeling the <em>random</em> part of the model. Often, you end up treating time series data as random because you simply do not have any other data. But the reality is that in many biomedical and public health applications, patterns in time series data can be explained by clearly understood fixed patterns.</p>
<p>For example, take this time series here. It is lower at the beginning and at the end of the series, with higher level sin the middle of the period.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/ts_fixed.png" alt="Time series with seasonal pattern 1" /></p>
<p>It’s possible that this time series could be modeled with an auto-regressive (AR) model or maybe an auto-regressive moving average (ARMA) model. Or it’s possible that the data are exhibiting a seasonal pattern. It’s impossible to tell from the data whether this is a random formulation of this pattern or whether it’s something you’d expect every time. The problem is that we usually onl have <em>one observation</em> from teh time series. That is, we observe the entire series only once.</p>
<p>Now take a look at this time series. It exhibits some of the same properties as the first series: it’s low at the beginning and end and high in the middle.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/ts_random.png" alt="Time series with seasonal pattern 2" /></p>
<p>Should we model this as a random process or as a process with a fixed pattern? That ultimately will depend on the what type of data this is and what we know about it. If it’s air pollution data, we might do one thing, but if it’s stock market data, we might do a totally different thing.</p>
<p>If one were to see replicates of the time series, we’d be able to resolve the fixed vs. random question. For example, because I simulated the data above, I can simulate another replicate and see what happens. In the plot below I show two replications from each of the processes.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/ts_both.png" alt="Fixed and random time series patterns" /></p>
<p>It’s clear now that the time series on the top row has a fixed “seasonal” pattern while the time series on the bottom row is random (in fact it is simulated from an AR(1) model).</p>
<p>The point here is that I think very often we know things about the time series that we’re modeling that we know introduced fixed variation into the data: seasonal patterns, day-of-week effects, and long-term trends. Furthermore, there may be other time-varying covariates that can help predict whatever time series we’re modeling and can be put into the fixed part of the model (a.k.a regression modeling). Ultimately, when many of these fixed components are accounted for, there’s relatively little of interest left in the residuals.</p>
<h2 id="what-to-model">What to Model?</h2>
<p>So the question remains: What should I do? The short answer is to try to incorporate everything that you know about the data into the fixed/regression part of the model. Then take a look at the residuals and see if you still care.</p>
<p>Here’s a quick example from my work in air pollution and mortality. The data are all-cause mortality and PM10 pollution from Detroit for the years 1987–2000. The question is whether daily mortaliy is associated with daily changes in ambient PM10 levels. We can try to answer that with a simple linear regression model:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Call:
lm(formula = death ~ pm10, data = ds)
Residuals:
Min 1Q Median 3Q Max
-26.978 -5.559 -0.386 5.109 34.022
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 46.978826 0.112284 418.394 <2e-16
pm10 0.004885 0.001936 2.523 0.0117
Residual standard error: 8.03 on 5112 degrees of freedom
Multiple R-squared: 0.001244, Adjusted R-squared: 0.001049
F-statistic: 6.368 on 1 and 5112 DF, p-value: 0.01165
</code></pre></div></div>
<p>PM10 appears to be positively associated with mortality, but when we look at the autocorrelation function of the residuals, we see</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2016-05-05-timeseries-biomedical_files/figure-html/unnamed-chunk-3-1.png" alt="ACF1" /></p>
<p>If we see a seasonal-like pattern in the auto-correlation function, then that means there’s a seasonal pattern in the residuals as well. Not good.</p>
<p>But okay, we can just model the seasonal component with an indicator of the season.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Call:
lm(formula = death ~ season + pm10, data = ds)
Residuals:
Min 1Q Median 3Q Max
-25.964 -5.087 -0.242 4.907 33.884
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.830458 0.215679 235.676 < 2e-16
seasonQ2 -4.864167 0.304838 -15.957 < 2e-16
seasonQ3 -6.764404 0.304346 -22.226 < 2e-16
seasonQ4 -3.712292 0.302859 -12.258 < 2e-16
pm10 0.009478 0.001860 5.097 0.000000358
Residual standard error: 7.649 on 5109 degrees of freedom
Multiple R-squared: 0.09411, Adjusted R-squared: 0.09341
F-statistic: 132.7 on 4 and 5109 DF, p-value: < 2.2e-16
</code></pre></div></div>
<p>Note that the coefficient for PM10, the coefficient of real interest, gets a little bigger when we add the seasonal component.</p>
<p>When we look at the residuals now, we see</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2016-05-05-timeseries-biomedical_files/figure-html/unnamed-chunk-5-1.png" alt="ACF2" /></p>
<p>The seasonal pattern is gone, but we see that there’s positive autocorrelation at seemingly long distances (~100s of days). This is usually an indicator that there’s some sort of long-term trend in the data. Since we only care about the day-to-day changes in PM10 and mortality, it would make sense to remove any such long-term trend. I can do that by just including the date as a linear predictor.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
Call:
lm(formula = death ~ season + date + pm10, data = ds)
Residuals:
Min 1Q Median 3Q Max
-23.407 -5.073 -0.375 4.718 32.179
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.04317325 0.64858433 92.576 < 2e-16
seasonQ2 -4.76600268 0.29841993 -15.971 < 2e-16
seasonQ3 -6.56826695 0.29815323 -22.030 < 2e-16
seasonQ4 -3.42007191 0.29704909 -11.513 < 2e-16
date -0.00106785 0.00007108 -15.022 < 2e-16
pm10 0.00933871 0.00182009 5.131 0.000000299
Residual standard error: 7.487 on 5108 degrees of freedom
Multiple R-squared: 0.1324, Adjusted R-squared: 0.1316
F-statistic: 156 on 5 and 5108 DF, p-value: < 2.2e-16
</code></pre></div></div>
<p>Now we can look at the autocorrelation function one last time.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2016-05-05-timeseries-biomedical_files/figure-html/unnamed-chunk-7-1.png" alt="ACF3" /></p>
<p>The ACF trails to zero reasonably quickly now, but there’s still some autocorrelation at short lags up to about 15 days or so.</p>
<p>Now we can engage in some traditional time series modeling. We might want to model the residuals with an auto-regressive model over order <em>p</em>. What should <em>p</em> be? We can check by looking at the partial autocorrelation function (PACF).</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2016-05-05-timeseries-biomedical_files/figure-html/unnamed-chunk-8-1.png" alt="PACF" /></p>
<p>The PACF seems to suggest we should fit an AR(6) or AR(7) model. Let’s use an AR(6) model and see how things look. We can use the <code class="language-plaintext highlighter-rouge">arima()</code> function for that.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
Call:
arima(x = y, order = c(6, 0, 0), xreg = m, include.mean = FALSE)
Coefficients:
ar1 ar2 ar3 ar4 ar5 ar6 (Intercept)
0.0869 0.0933 0.0733 0.0454 0.0377 0.0489 59.8179
s.e. 0.0140 0.0140 0.0141 0.0141 0.0140 0.0140 1.0300
seasonQ2 seasonQ3 seasonQ4 date pm10
-4.4635 -6.2778 -3.2878 -0.0011 0.0096
s.e. 0.4569 0.4624 0.4546 0.0001 0.0018
sigma^2 estimated as 53.69: log likelihood = -17441.84, aic = 34909.69
</code></pre></div></div>
<p>Note that the coefficient for PM10 hasn’t changed much from our initial models. The usual concern with not accounting for residual autocorrelation is that the variance/standard error of the coefficient of interest will be affected. In this case, there does not appear to be much of a difference between using the AR(6) to account for the residual autocorrelation and ignoring it altogether. Here’s a comparison of the standard errors for each coefficient.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Naive AR model
(Intercept) 0.648584 1.030007
seasonQ2 0.298420 0.456883
seasonQ3 0.298153 0.462371
seasonQ4 0.297049 0.454624
date 0.000071 0.000114
pm10 0.001820 0.001819
</code></pre></div></div>
<p>The standard errors for the <code class="language-plaintext highlighter-rouge">pm10</code> variable are almost identical, while the standard errors for the other variables are somewhat bigger in the AR model.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Ultimately, I’ve found that in many biomedical and public health applications, time series modeling is very different from what I read in the textbooks. The key takeaways are:</p>
<ol>
<li>
<p>Make sure you know if you’re doing <strong>prediction</strong> or <strong>inference</strong>. Most often you will be doing inference, in which case your modeling strategies will be quite different.</p>
</li>
<li>
<p>Focus separately on the <strong>fixed</strong> and <strong>random</strong> parts of the model. In particular, work with the fixed part of the model first, incorporating as much information as you can that will explain variability in your outcome.</p>
</li>
<li>
<p>Model the random part appropriately, after incorporating as much as you can into the fixed part of the model. Classical time series models may be of use here, but also simple robust variance estimators may be sufficient.</p>
</li>
</ol>
Not So Standard Deviations Episode 15 - Spinning Up Logistics
2016-05-04T00:00:00+00:00
http://simplystats.github.io/2016/05/04/nssd-episode-15
<p>This is Hilary’s and my last New York-Baltimore episode! In future
episodes, Hilary will be broadcasting from California. In this episode
we discuss collaboration tools and workflow management for data
science projects. To date, I have not found a project management tool
that I can actually use (besides email), but am open to suggestions
(from students).</p>
<p>If you have questions you’d like us to answer, you can send them to
nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p>
<p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p>
<p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p>
<p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p>
<p>Show notes:</p>
<ul>
<li>
<p><a href="http://twitter.com/hspter/status/725411087110299649">Hilary’s tweet on cats</a></p>
</li>
<li>
<p><a href="http://www.etsy.com/listing/185113916/…mug-coffee-cup-tea">Awesome vs. cats mug</a></p>
</li>
<li>
<p><a href="http://math.mit.edu/~urschel/">John Urschel’s web page</a></p>
</li>
<li>
<p><a href="http://www.ams.org/publications/journa…1602/rnoti-p148.pdf">Profile of John Urschel by the AMS</a></p>
</li>
<li>
<p><a href="http://en.wikipedia.org/wiki/Frank_Ryan_…merican_football">The other NFL player/mathematician</a>)</p>
</li>
<li>
<p><a href="http://guides.github.com/introduction/flow/">GitHub flow</a></p>
</li>
<li>
<p><a href="http://www.theinformation.com/articles/why-…a-product-fix">Problems with Slack</a></p>
</li>
<li>
<p><a href="http://www.astronomy.ohio-state.edu/~pogge/Ast…5/gps.html">Relativity and GPS</a></p>
</li>
<li>
<p><a href="http://www.theinformation.com/become-a-data…e-information">The Information is looking for a Data Storyteller</a></p>
</li>
<li>
<p><a href="http://www.stitchfix.com/careers?gh_jid=1…46?gh_jid=169746">Stitch Fix is looking for Data Scientists</a></p>
</li>
</ul>
<p><a href="https://soundcloud.com/nssd-podcast/nssd-episode-15-spinning-up-logistics">Download the audio for this episode.</a></p>
<iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/261374684&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false"></iframe>
High school student builds interactive R class for the intimidated with the JHU DSL
2016-04-27T00:00:00+00:00
http://simplystats.github.io/2016/04/27/r-intimidated
<p>Annika Salzberg is currently a biology undergraduate at Haverford College majoring in biology. While in high-school here in Baltimore she developed and taught an R class to her classmates at the <a href="http://www.parkschool.net/">Park School</a>. Her interest in R grew out of a project where she and her fellow students and teachers went to the Canadian sub-Arctic to collect data on permafrost depth and polar bears. When analyzing the data she learned R (with the help of a teacher) to be able to do the analyses, some of which she did on her laptop while out in the field.</p>
<p>Later she worked on developing a course that she felt was friendly and approachable enough for her fellow high-schoolers to benefit. With the help of Steven Salzberg and the folks here at the JHU DSL, she built a class she calls <a href="https://www.datacamp.com/courses/r-for-the-intimidated">R for the intimidated</a> which just launched on <a href="https://www.datacamp.com/courses/r-for-the-intimidated">DataCamp</a> and you can take for free!</p>
<p>The class is a great introduction for people who are just getting started with R. It walks through R/Rstudio, package installation, data visualization, data manipulation, and a final project. We are super excited about the content that Annika created working here at Hopkins and think you should go check it out!</p>
An update on Georgia Tech's MOOC-based CS degree
2016-04-27T00:00:00+00:00
http://simplystats.github.io/2016/04/27/georgia-tech-mooc-program
<p><a href="https://www.insidehighered.com/news/2016/04/27/georgia-tech-plans-next-steps-online-masters-degree-computer-science?utm_source=Inside+Higher+Ed&utm_campaign=d373e33023-DNU20160427&utm_medium=email&utm_term=0_1fcbc04421-d373e33023-197601005#.VyCmdfkGRPU.mailto">This article</a> in Inside Higher Ed discusses next steps for Georgia
Tech’s ground-breaking low-cost CS degree based on MOOCs run by
Udacity. With Sebastian Thrun <a href="http://blog.udacity.com/2016/04/udacity-has-a-new-___.html">stepping down</a> as CEO at Udacity, it seems both Georgia Tech and Udacity might be moving into a new phase.</p>
<p>One fact that surprised me about the Georgia Tech program concerned the demographics:</p>
<blockquote>
<p>Once the first applications for the online program arrived, Georgia Tech was surprised by how the demographics differed from the applications to the face-to-face program. The institute’s face-to-face cohorts tend to have more men than women and international students than U.S. citizens or residents. Applications to the online program, however, came overwhelmingly from students based in the U.S. (80 percent). The gender gap was even larger, with nearly nine out of 10 applications coming from men.</p>
</blockquote>
Write papers like a modern scientist (use Overleaf or Google Docs + Paperpile)
2016-04-21T00:00:00+00:00
http://simplystats.github.io/2016/04/21/writing
<p><em>Editor’s note - This is a chapter from my book <a href="https://leanpub.com/modernscientist">How to be a modern scientist</a> where I talk about some of the tools and techniques that scientists have available to them now that they didn’t before.</em></p>
<h2 id="writing---what-should-i-do-and-why">Writing - what should I do and why?</h2>
<p><strong>Write using collaborative software to avoid version control issues.</strong></p>
<p>On almost all modern scientific papers you will have co-authors. The traditional way of handling this was to
create a single working document and pass it around. Unfortunately this system always results in a long collection of
versions of a manuscript, which are often hard to distinguish and definitely hard to synthesize.</p>
<p>An alternative approach is to use formal version control systems like <a href="https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control">Git</a> and <a href="https://github.com/">Github</a>. However, the overhead for using these systems can be pretty heavy for paper authoring. They also require
all parties participating in the writing of the paper to be familiar with version control and the command line.
Alternative paper authoring tools are now available that provide some of the advantages of version control without the major overhead involved
in using base version control systems.</p>
<p><img src="https://imgs.xkcd.com/comics/documents.png" alt="The usual result of file naming by a group (image via https://xkcd.com/1459/)" /></p>
<p><strong>Make figures the focus of your writing</strong></p>
<p>Once you have a set of results and are ready to start writing up the paper the first thing is <em>not to write</em>. The first thing you should do is create a set of 1-10 publication-quality plots with 3-4 as the central focus (see Chapter 10 <a href="http://leanpub.com/datastyle">here</a> for more information on how to make plots). Show these to someone you trust to make sure they “get” your story before proceeding. Your writing should then be focused around explaining the story of those plots to your audience. Many people, when reading papers, read the title, the abstract, and then usually jump to the figures. If your figures tell the whole story you will dramatically increase your audience. It also helps you to clarify what you are writing about.</p>
<p><strong>Write clearly and simply even though it may make your papers harder to publish</strong>.</p>
<p>Learn how to write papers in a very clear and simple style. Whenever you can write in plain English and make the approach you are using understandable and clear. This can (sometimes) make it harder to get your papers into journals. Referees are trained to find things to criticize and by simplifying your argument they will assume that what you have done is easy or just like what has been done before. This can be extremely frustrating during the peer review process. But the peer review process isn’t the end goal of publishing! The point of publishing is to communicate your results to your community and beyond so they can use them. Simple, clear language leads to much higher use/reading/citation/impact of your work.</p>
<p><strong>Include links to code, data, and software in your writing</strong></p>
<p>Not everyone recognizes the value of re-analysis, scientific software, or data and code sharing. But it is the fundamental cornerstone of the modern scientific process to make all of your materials easily accessible, re-usable and checkable. Include links to data, code, and software prominently in your abstract, introduction and methods and you will dramatically increase the use and impact of your work.</p>
<p><strong>Give credit to others</strong></p>
<p>In academics the main currency we use is credit for publication. In general assigning authorship and getting credit can be a very tricky component of the publication process. It is almost always better to err on the side of offering credit. A very useful test that my advisor <a href="http://www.genomine.org/">John Storey</a> taught me is if you are embarrassed to explain the authorship credit to anyone who was on the paper or not on the paper, then you probably haven’t shared enough credit.</p>
<h2 id="writing---what-tools-should-i-use">Writing - what tools should I use?</h2>
<h3 id="wysiwyg-software-google-docs-and-paperpile">WYSIWYG software: Google Docs and Paperpile.</h3>
<p>This system uses <a href="https://www.google.com/docs/about/">Google Docs</a> for writing and <a href="https://paperpile.com/app">Paperpile</a> for reference management. If you have a Google account you can easily create documents and share them with your collaborators either privately or publicly. Paperpile allows you to search for academic articles and insert references into the text using a system that will be familiar if you have previously used <a href="http://endnote.com/">Endnote</a> and <a href="https://products.office.com/en-us/word">Microsoft Word</a>.</p>
<p>This system has the advantage of being a what you see is what you get system - anyone with basic text processing skills should be immediately able to contribute. Google Docs also automatically saves versions of your work so that you can flip back to older versions if someone makes a mistake. You can also easily see which part of the document was written by which person and add comments.</p>
<p><em>Getting started</em></p>
<ol>
<li>Set up accounts with <a href="https://accounts.google.com/SignUp">Google</a> and with <a href="https://paperpile.com/">Paperpile</a>. If you are an
academic the Paperpile account will cost $2.99 a month, but there is a 30 day free trial.</li>
<li>Go to <a href="https://docs.google.com/document/u/0/">Google Docs</a> and create a new document.</li>
<li>Set up the <a href="https://paperpile.com/blog/free-google-docs-add-on/">Paperpile add-on for Google Docs</a></li>
<li>In the upper right hand corner of the document, click on the <em>Share</em> link and share the document with your collaborators</li>
<li>Start editing</li>
<li>When you want to include a reference, place the cursor where you want the reference to go, then using the <em>Paperpile</em> menu, choose
insert citation. This should give you a search box where you can search by Pubmed ID or on the web for the reference you want.</li>
<li>Once you have added some references use the <em>Citation style</em> option under <em>Paperpile</em> to pick the citation style for the journal you care about.</li>
<li>Then use the <em>Format citations</em> option under <em>Paperpile</em> to create the bibliography at the end of the document</li>
</ol>
<p>The nice thing about using this system is that everyone can easily directly edit the document simultaneously - which reduces conflict and difficulty of use. A disadvantage is getting the formatting just right for most journals is nearly impossible, so you will be sending in a version of your paper that is somewhat generic. For most journals this isn’t a problem, but a few journals are sticklers about this.</p>
<h3 id="typesetting-software-overleaf-or-sharelatex">Typesetting software: Overleaf or ShareLatex</h3>
<p>An alternative approach is to use typesetting software like Latex. This requires a little bit more technical expertise since you need
to understand the Latex typesetting language. But it allows for more precise control over what the document will look like. Using Latex
on its own you will have many of the same issues with version control that you would have for a word document. Fortunately there are now
“Google Docs like” solutions for editing latex code that are readily available. Two of the most popular are <a href="https://www.overleaf.com/">Overleaf</a> and <a href="https://www.sharelatex.com/">ShareLatex</a>.</p>
<p>In either system you can create a document, share it with collaborators, and edit it in a similar manner to a Google Doc, with simultaneous editing. Under both systems you can save versions of your document easily as you move along so you can quickly return to older versions if mistakes are made.</p>
<p>I have used both kinds of software, but now primarily use Overleaf because they have a killer feature. Once you have
finished writing your paper you can directly submit it to some preprint servers like <a href="http://arxiv.org/">arXiv</a> or <a href="http://biorxiv.org/">biorXiv</a> and even some journals like <a href="https://peerj.com">Peerj</a> or <a href="http://f1000research.com/">f1000research</a>.</p>
<p><em>Getting started</em></p>
<ol>
<li>Create an <a href="https://www.overleaf.com/signup">Overleaf account</a>. There is a free version of the software. Paying $8/month will give you easy saving to Dropbox.</li>
<li>Click on <em>New Project</em> to create a new document and select from the available templates</li>
<li>Open your document and start editing</li>
<li>Share with colleagues by clicking on the <em>Share</em> button within the project. You can share either a read only version or a read and edit version.</li>
</ol>
<p>Once you have finished writing your document you can click on the <em>Publish</em> button to automatically submit your paper to the available preprint servers and journals. Or you can download a pdf version of your document and submit it to any other journal.</p>
<h2 id="writing---further-tips-and-issues">Writing - further tips and issues</h2>
<h3 id="when-to-write-your-first-paper">When to write your first paper</h3>
<p>As soon as possible! The purpose of graduate school is (in some order):</p>
<ul>
<li>Freedom</li>
<li>Time to discover new knowledge</li>
<li>Time to dive deep</li>
<li>Opportunity for leadership</li>
<li>Opportunity to make a name for yourself
<ul>
<li>R packages</li>
<li>Papers</li>
<li>Blogs</li>
</ul>
</li>
<li>Get a job</li>
</ul>
<p>The first couple of years of graduate school are typically focused on (1) teaching you all the technical skills you need and (2) data dumping as much hard-won practical experience from more experienced people into your head as fast as possible.</p>
<p>After that one of your main focuses should be on establishing your own program of research and reputation. Especially for Ph.D. students it can not be emphasized enough <em>no one will care about your grades in graduate school but everyone will care what you produced</em>. See for example, Sherri’s excellent <a href="http://drsherrirose.com/academic-cvs-for-statistical-science-faculty-positions">guide on CV’s for academic positions</a>.</p>
<p>I firmly believe that <a href="http://simplystatistics.org/2013/01/23/statisticians-and-computer-scientists-if-there-is-no-code-there-is-no-paper/">R packages</a> and blog posts can be just as important as papers, but the primary signal to most traditional academic communities still remains published peer-reviewed papers. So you should get started on writing them as soon as you can (definitely before you feel comfortable enough to try to write one).</p>
<p>Even if you aren’t going to be in academics, papers are a great way to show off that you can (a) identify a useful project, (b) finish a project, and (c) write well. So the first thing you should be asking when you start a project is “what paper are we working on?”</p>
<h3 id="what-is-an-academic-paper">What is an academic paper?</h3>
<p>A scientific paper can be distilled into four parts:</p>
<ol>
<li>A set of methodologies</li>
<li>A description of data</li>
<li>A set of results</li>
<li>A set of claims</li>
</ol>
<p>When you (or anyone else) writes a paper the goal is to communicate clearly items 1-3 so that they can justify the set of claims you are making. Before you can even write down 4 you have to do 1-3. So that is where you start when writing a paper.</p>
<h3 id="how-do-you-start-a-paper">How do you start a paper?</h3>
<p>The first thing you do is you decide on a problem to work on. This can be a problem that your advisor thought of or it can be a problem you are interested in, or a combination of both. Ideally your first project will have the following characteristics:</p>
<ol>
<li>Concrete</li>
<li>Solves a scientific problem</li>
<li>Gives you an opportunity to learn something new</li>
<li>Something you feel ownership of</li>
<li>Something you want to work on</li>
</ol>
<p>Points 4 and 5 can’t be emphasized enough. Others can try to help you come up with a problem, but if you don’t feel like it is <em>your</em> problem it will make writing the first paper a total slog. You want to find an option where you are just insanely curious to know the answer at the end, to the point where you <em>just have to figure it out</em> and kind of don’t care what the answer is. That doesn’t always happen, but that makes the grind of writing papers go down a lot easier.</p>
<p>Once you have a problem the next step is to actually do the research. I’ll leave this for another guide, but the basic idea is that you want to follow the usual <a href="https://leanpub.com/datastyle/">data analytic process</a>:</p>
<ol>
<li>Define the question</li>
<li>Get/tidy the data</li>
<li>Explore the data</li>
<li>Build/borrow a model</li>
<li>Perform the analysis</li>
<li>Check/critique results</li>
<li>Write things up</li>
</ol>
<p>The hardest part for the first paper is often knowing where to stop and start writing.</p>
<h3 id="how-do-you-know-when-to-start-writing">How do you know when to start writing?</h3>
<p>Sometimes this is an easy question to answer. If you started with a very concrete question at the beginning then once you have done enough analysis to convince yourself that you have the answer to the question. If the answer to the question is interesting/surprising then it is time to stop and write.</p>
<p>If you started with a question that wasn’t so concrete then it gets a little trickier. The basic idea here is that you have convinced yourself you have a result that is worth reporting. Usually this takes the form of between 1 and 5 figures that show a coherent story that you could explain to someone in your field.</p>
<p>In general one thing you should be working on in graduate school is your own internal timer that tells you, “ok we have done enough, time to write this up”. I found this one of the hardest things to learn, but if you are going to stay in academics it is a critical skill. There are rarely deadlines for paper writing (unless you are submitting to CS conferences) so it will eventually be up to you when to start writing. If you don’t have a good clock, this can really slow down your ability to get things published and promoted in academics.</p>
<p>One good principle to keep in mind is “the perfect is the enemy of the very good” Another one is that a published paper in a respectable journal beats a paper you just never submit because you want to get it into the “best” journal.</p>
<h3 id="a-note-on-negative-results">A note on “negative results”</h3>
<p>If the answer to your research problem isn’t interesting/surprising but you started with a concrete question it is also time to stop and write. But things often get more tricky with this type of paper as most journals when reviewing papers filter for “interest” so sometimes a paper without a really “big” result will be harder to publish. <strong>This is ok!!</strong> Even though it may take longer to publish the paper, it is important to publish even results that aren’t surprising/novel. I would much rather that you come to an answer you are comfortable with and we go through a little pain trying to get it published than you keep pushing until you get an “interesting” result, which may or may not be justifiable.</p>
<h3 id="how-do-you-start-writing">How do you start writing?</h3>
<ol>
<li>Once you have a set of results and are ready to start writing up the paper the first thing is <em>not to write</em>. The first thing you should do is create a set of 1-4 publication-quality plots (see Chapter 10 <a href="http://leanpub.com/datastyle">here</a>). Show these to someone you trust to make sure they “get” your story before proceeding.</li>
<li>Start a project on <a href="https://www.overleaf.com/">Overleaf</a> or <a href="https://www.google.com/docs/about/">Google Docs</a>.</li>
<li>Write up a story around the four plots in the simplest language you feel you can get away with, while still reporting all of the technical details that you can.</li>
<li>Go back and add references in only after you have finished the whole first draft.</li>
<li>Add in additional technical detail in the supplementary material if you need it.</li>
<li>Write up a reproducible version of your code that returns exactly the same numbers/figures in your paper with no input parameters needed.</li>
</ol>
<h3 id="what-are-the-sections-in-a-paper">What are the sections in a paper?</h3>
<p>Keep in mind that most people will read the title of your paper only, a small fraction of those people will read the abstract, a small fraction of those people will read the introduction, and a small fraction of those people will read your whole paper. So make sure you get to the point quickly!</p>
<p>The sections of a paper are always some variation on the following:</p>
<ol>
<li><strong>Title</strong>: Should be very short, no colons if possible, and state the main result. Example, “A new method for sequencing data that shows how to cure cancer”. Here you want to make sure people will read the paper without overselling your results - this is a delicate balance.</li>
<li><strong>Abstract</strong>: In (ideally) 4-5 sentences explain (a) what problem you are solving, (b) why people should care, (c) how you solved the problem, (d) what are the results and (e) a link to any data/resources/software you generated.</li>
<li><strong>Introduction</strong>: A more lengthy (1-3 pages) explanation of the problem you are solving, why people should care, and how you are solving it. Here you also review what other people have done in the area. The most critical thing is never underestimate how little people know or care about what you are working on. It is your job to explain to them why they should.</li>
<li><strong>Methods</strong>: You should state and explain your experimental procedures, how you collected results, your statistical model, and any strengths or weaknesses of your proposed approach.</li>
<li><strong>Comparisons (for methods papers)</strong>: Compare your proposed approach to the state of the art methods. Do this with simulations (where you know the right answer) and data you haven’t simulated (where you don’t know the right answer). If you can base your simulation on data, even better. Make sure you are <a href="http://simplystatistics.org/2013/03/06/the-importance-of-simulating-the-extremes/">simulating both the easy case (where your method should be great) and harder cases where your method might be terrible</a>.</li>
<li><strong>Your analysis</strong>: Explain what you did, what data you collected, how you processed it and how you analysed it.</li>
<li><strong>Conclusions</strong>: Summarize what you did and explain why what you did is important one more time.</li>
<li><strong>Supplementary Information</strong>: If there are a lot of technical computational, experimental or statistical details, you can include a supplement that has all of the details so folks can follow along. As far as possible, try to include the detail in the main text but explained clearly.</li>
</ol>
<p>The length of the paper will depend a lot on which journal you are targeting. In general the shorter/more concise the better. But unless you are shooting for a really glossy journal you should try to include the details in the paper itself. This means most papers will be in the 4-15 page range, but with a huge variance.</p>
<p><em>Note</em>: Part of this chapter appeared in the <a href="https://github.com/jtleek/firstpaper">Leek group guide to writing your first paper</a></p>
As a data analyst the best data repositories are the ones with the least features
2016-04-20T00:00:00+00:00
http://simplystats.github.io/2016/04/20/data-repositories
<p>Lately, for a range of projects I have been working on I have needed to obtain data from previous publications. There is a growing list of data repositories where data is made available. General purpose data sharing sites include:</p>
<ul>
<li>The <a href="https://osf.io/">open science framework</a></li>
<li>The <a href="https://dataverse.harvard.edu/">Harvard Dataverse</a></li>
<li><a href="https://figshare.com/">Figshare</a></li>
<li><a href="https://datadryad.org/">Datadryad</a></li>
</ul>
<p>There are also a host of field-specific data sharing sites.One thing that I find a little frustrating about these sites is that they add a lot of bells and whistles. For example I wanted to download a <a href="https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/6FMTT3">p-value data set</a> from Dataverse (just to pick on one, but most repositories have similar issues). I go to the page and click <code class="language-plaintext highlighter-rouge">Download</code> on the data set I want.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2016-04-20/dataverse1.png" alt="Downloading a dataverse paper " /></p>
<p>Then I have to accept terms:</p>
<p>Then I have to
<img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2016-04-20/dataverse2.png" alt="Downloading a dataverse paper part 2 " /></p>
<p>Then the data set is downloaded. But it comes from a button that doesn’t allow me to get the direct link. There is an <a href="https://github.com/ropensci/dvn">R package</a> that you can use to download dataverse data, but again not with direct links to the data sets.</p>
<p>This is a similar system to many data repositories where there is a multi-step process to downloading data rather than direct links.</p>
<p>But as a data analyst I often find that I want:</p>
<ul>
<li>To be able to find a data set with some minimal search terms</li>
<li>Find the data set in .csv or tab delimited format via a direct link</li>
<li>Have the data set be available both as raw and processed versions</li>
<li>The processed version will either be one or many <a href="https://www.jstatsoft.org/article/view/v059i10">tidy data sets</a>.</li>
</ul>
<p>As a data analyst I would rather have all of the data stored as direct links and ideally as csv files. Then you don’t need to figure out a specialized package, an API, or anything else. You just use <code class="language-plaintext highlighter-rouge">read.csv</code> directly using the URL in R and you are off to the races. It also makes it easier to point to which data set you are using. So I find the best data repositories are the ones with the least features.</p>
Junior scientists - you don't have to publish in open access journals to be an open scientist.
2016-04-11T00:00:00+00:00
http://simplystats.github.io/2016/04/11/publishing
<p><em>Editor’s note - This is a chapter from my book <a href="https://leanpub.com/modernscientist">How to be a modern scientist</a> where I talk about some of the tools and techniques that scientists have available to them now that they didn’t before.</em></p>
<h2 id="publishing---what-should-i-do-and-why">Publishing - what should I do and why?</h2>
<p>A modern scientific writing process goes as follows.</p>
<ol>
<li>You write a paper</li>
<li>You post a preprint
a. Everyone can read and comment</li>
<li>You submit it to a journal</li>
<li>It is peer reviewed privately</li>
<li>The paper is accepted or rejected
a. If rejected go back to step 2 and start over
b. If accepted it will be published</li>
</ol>
<p>You can take advantage of modern writing and publishing tools to
handle several steps in the process.</p>
<p><strong>Post preprints of your work</strong></p>
<p>Once you have finished writing you paper, you want to share it with others. Historically, this involved submitting the paper to a journal, waiting for reviews, revising the paper, resubmitting, and eventually publishing it. There is now very little reason to wait that long for your paper to appear in print. Generally you can post a paper to a preprint server and have it appear in 1-2 days. This is a dramatic improvement on the weeks or months it takes for papers to appear in peer reviewed journals even under optimal conditions. There are several advantages to posting preprints.</p>
<ul>
<li>Preprints establish precedence for your work so it reduces your risk of being scooped.</li>
<li>Preprints allow you to collect feedback on your work and improve it quickly.</li>
<li>Preprints can help you to get your work published in formal academic journals.</li>
<li>Preprints can get you attention and press for your work.</li>
<li>Preprints give junior scientists and other researchers gratification that helps them handle the stress and pressure of their
first publications.</li>
</ul>
<p>The last point is underappreciated and was first pointed out to me by <a href="http://giladlab.uchicago.edu/">Yoav Gilad</a> It takes a really long time to write a scientific paper. For a student publishing their first paper, the first feedback they get is often (a) delayed by several months and (b) negative and in the form of a referee report. This can have a major impact on the motivation of those students to keep working on projects. Preprints allow students to have an immediate product they can point to as an accomplishment, allow them to get positive feedback along with constructive or negative feedback, and can ease the pain of difficult referee reports or rejections.</p>
<p><strong>Choose the journal that maximizes your visibility</strong></p>
<p>You should try to publish your work in the best journals for your field. There are a couple of reasons for this. First, being a
scientist is both a calling and a career. To advance your career, you need visibilty among your scientific peers and among the scientists
who will be judging you for grants and promotions. The best place to do this is by publishing in the top journals in your field. The
important thing is to do your best to do good work and submit it to these journals, even if the results aren’t the most “sexy”. Don’t
adapt your workflow to the journal, but don’t ignore the career implications either. Do this even if the journals are closed source.
There are ways to make your work accessible and you will both raise your profile and disseminate your results to the broadest audience.</p>
<p><strong>Share your work on social media</strong></p>
<p>Academic journals are good for disseminating your work to the appropriate scientific community. As a modern scientist you have other avenues and other communities - like the general public - that you would like to reach with your work. Once your paper has been published in a preprint or in a journal, be sure to share your work through appropriate social media channels. This will also help you develop facility in coming up with one line or one figure that best describes what you think you have published so you can share it on social media sites like Twitter.</p>
<h3 id="preprints-and-criticism">Preprints and criticism</h3>
<p>See the section on scientific blogging for how to respond to criticism of your preprints online.</p>
<h2 id="publishing---what-tools-should-i-use">Publishing - what tools should I use?</h2>
<h3 id="preprint-servers">Preprint servers</h3>
<p>Here are a few preprint servers you can use.</p>
<ol>
<li><a href="http://arxiv.org/">arXiv</a> (free) - primarily takes math/physics/computer science papers. You can submit papers and they are reviewed and posted within a couple of days. It is important to note that once you submit a paper here, you can not take it down. But you can submit revisions
to the paper which are tracked over time. This outlet is followed by a large number of journalists and scientists.</li>
<li><a href="http://biorxiv.org/">biorXiv</a> (free) - primarily takes biology focused papers. They are pretty strict about which categories you can submit to. You can submit papers and they are reviewed and posted within a couple of days. biorxiv also allows different versions of manuscripts, but some folks have had trouble with their versioning system, which can be a bit tricky for keeping your paper coordinated with your publication. bioXiv is pretty carefully followed by the biological and computational biology communities.</li>
<li><a href="https://peerj.com/preprints/">Peerj</a> (free) - takes a wide range of different types of papers. They will again review your preprint quickly and post it online. You can also post different versions of your manuscript with this system. This system is newer and so has fewer followers, you will need to do your own publicity if you publish your paper here.</li>
</ol>
<h3 id="journal-preprint-policies">Journal preprint policies</h3>
<p>This <a href="https://en.wikipedia.org/wiki/List_of_academic_journals_by_preprint_policy">list</a> provides information on which journals accept papers that were first posted as preprints. However, you shouldn’t</p>
<h2 id="publishing---further-tips-and-issues">Publishing - further tips and issues</h2>
<h3 id="open-vs-closed-access">Open vs. closed access</h3>
<p>Once your paper has been posted to a preprint server you need to submit it for publication. There are a number of considerations you should keep in mind when submitting papers. One of these considerations is closed versus open access. Closed access journals do not require you to pay to submit or publish your paper. But then people who want to read your paper either need to pay or have a subscription to the journal in question.</p>
<p>There has been a strong push for open access journals over the last couple of decades. There are some very good reasons justifying this type of publishing including (a) moral arguments based on using public funding for research, (2) each of access to papers, and (3) benefits in terms of people being able to use your research. In general, most modern scientists want their work to be as widely accessible as possible. So modern scientists often opt for open access publishing.</p>
<p>Open access publishing does have a couple of disadvantages. First it is often expensive, with fees for publication ranging between <a href="http://simplystatistics.org/2011/11/03/free-access-publishing-is-awesome-but-expensive-how/">$1,000 and $4,000</a> depending on the journal. Second, while science is often a calling, it is also a career. Sometimes the best journals in your field may be closed access. In general, one of the most important components of an academic career is being able to publish in journals that are read by a lot of people in your field so your work will be recognized and impactful.</p>
<p>However, modern systems make both closed and open access journals reasonable outlets.</p>
<h3 id="closed-access--preprints">Closed access + preprints</h3>
<p>If the top journals in your field are closed access and you are a junior scientist then you should try to submit your papers there. But to make sure your papers are still widely accessible you can use preprints. First, you can submit a preprint before you submit the paper to the journal. Second, you can update the preprint to keep it current with the published version of your paper. This system allows you to make sure that your paper is read widely within your field, but also allows everyone to freely read the same paper. On your website, you can then link to both the published and preprint version of your paper.</p>
<h3 id="open-access">Open access</h3>
<p>If the top journal in your field is open access you can submit directly to that journal. Even if the journal is open access it makes sense to submit the paper as a preprint during the review process. You can then keep the preprint up-to-date, but this system has the advantage that the formally published version of your paper is also available for everyone to read without constraints.</p>
<h3 id="responding-to-referee-comments">Responding to referee comments</h3>
<p>After your paper has been reviewed at an academic journal you will receive referee reports. If the paper has not been outright rejected, it is important to respond to the referee reports in a timely and direct manner. Referee reports are often maddening. There is little incentive for people to do a good job refereeing and the most qualified reviewers will likely be those with a <a href="http://simplystatistics.org/2015/02/09/the-trouble-with-evaluating-anything/">conflict of interest</a>.</p>
<p>The first thing to keep in mind is that the power in the refereeing process lies entirely with the editors and referees. The first thing to do when responding to referee reports is to eliminate the impulse to argue or respond with any kind of emotion. A step-by-step process for responding to referee reports is the following.</p>
<ol>
<li>Create a Google Doc. Put in all referee and editor comments in italics.</li>
<li>Break the comments up into each discrete criticism or request.</li>
<li>In bold respond to each comment. Begin each response with “On page xx we did yy to address this comment”</li>
<li>Perform the analyses and experiments that you need to fill in the yy</li>
<li>Edit the document to reflect all of the experiments that you have performed</li>
</ol>
<p>By actively responding to each comment you will ensure you are responsive to the referees and give your paper the best chance of success. If a comment is incorrect or non-sensical, think about how you can edit the paper to remove this confusion.</p>
<h3 id="finishing">Finishing</h3>
<p>While I have advocated here for using preprints to disseminate your work, it is important to follow the process all the way through to completion. Responding to referee reports is drudgery and no one likes to do it. But in terms of career advancement preprints are almost entirely valueless until they are formally accepted for publication. It is critical to see all papers all the way through to the end of the publication cycle.</p>
<h3 id="you-arent-done">You aren’t done!</h3>
<p>Publication of your paper is only the beginning of successfully disseminating your science. Once you have published the paper, it is important to use your social media, blog, and other resources to disseminate your results to the broadest audience possible. You will also give talks, discuss the paper with colleagues, and respond to requests for data and code. The most successful papers have a long half life and the responsibilities linger long after the paper is published. But the most successful scientists continue to stay on top of requests and respond to critiques long after their papers are published.</p>
<p><em>Note:</em> Part of this chapter appeared in the Simply Statistics blog post: <a href="http://simplystatistics.org/2016/02/26/preprints-and-pppr/">“Preprints are great, but post publication peer review isn’t ready for prime time”</a></p>
A Natural Curiosity of How Things Work, Even If You're Not Responsible For Them
2016-04-08T00:00:00+00:00
http://simplystats.github.io/2016/04/08/eecom
<p>I just read Karl’s <a href="https://kbroman.wordpress.com/2016/04/08/i-am-a-data-scientist/">great
post</a>
on what it means to be a data scientist. I can’t really add much to
it, but reading it got me thinking about the Apollo 12 mission, the
second moon landing.</p>
<p>This mission is actually famous because of its launch, where the
Saturn V was struck by lightning and <a href="https://en.wikipedia.org/wiki/John_Aaron">John
Aaron</a> (played wonderfully
by Loren Dean in the movie <a href="http://www.imdb.com/title/tt0112384/">Apollo
13</a>), the flight controller in
charge of environmental, electrical, and consumables (EECOM), had to
make a decision about whether to abort the launch.</p>
<p>In this great clip from the movie <em>Failure is Not An Option</em>, the real
John Aaron describes what makes for a good EECOM flight
controller. The bottom line is that</p>
<blockquote>
<p>A good EECOM has a natural curiosity for how things work, even if you…are not responsible for them</p>
</blockquote>
<p>I think a good data scientist or statistician also has that
property. They key part of that line is the “<em>even if you are not
responsible for them”</em> part. I’ve found that a lot of being a
statistician involves nosing around in places where you’re not
supposed to be, seeing how data are collected, handled, managed,
analyzed, and reported. Focusing on the development and implementation
of methods is not enough.</p>
<p>Here’s the clip, which describes the famous “SCE to AUX” call from
John Aaron:</p>
<iframe width="640" height="480" src="https://www.youtube.com/embed/eWQIryll8y8" frameborder="0" allowfullscreen=""></iframe>
Not So Standard Deviations Episode 13 - It's Good that Someone is Thinking About Us
2016-04-07T00:00:00+00:00
http://simplystats.github.io/2016/04/07/nssd-episode-13
<p>In this episode, Hilary and I talk about the difficulties of
separating data analysis from its context, and Feather, a new file
format for storing tabular data. Also, we respond to some listener
questions and Hilary announces her new job.</p>
<p>If you have questions you’d like us to answer, you can send them to
nssdeviations @ gmail.com or tweet us at @NSSDeviations.</p>
<p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p>
<p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p>
<p>Show notes:</p>
<ul>
<li>
<p><a href="https://www.patreon.com/NSSDeviations">NSSD Patreon page</a></p>
</li>
<li>
<p><a href="https://github.com/wesm/feather/">Feather git repository</a></p>
</li>
<li>
<p><a href="https://arrow.apache.org">Apache Arrow</a></p>
</li>
<li>
<p><a href="https://google.github.io/flatbuffers/">FlatBuffers</a></p>
</li>
<li>
<p><a href="http://simplystatistics.org/2016/03/31/feather/">Roger’s blog post on feather</a></p>
</li>
<li>
<p><a href="https://www.etsy.com/shop/NausicaaDistribution">NausicaaDistribution</a></p>
</li>
<li>
<p><a href="http://www.rstats.nyc">New York R Conference</a></p>
</li>
<li>
<p><a href="https://goo.gl/J2QAWK">Every Frame a Painting</a></p>
</li>
</ul>
<p><a href="https://soundcloud.com/nssd-podcast/episode-13-its-good-that-someone-is-thinking-about-us">Download the audio for this episode.</a></p>
<iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/257851619&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false"></iframe>
Companies are Countries, Academia is Europe
2016-04-05T00:00:00+00:00
http://simplystats.github.io/2016/04/05/corporations-academia
<p>I’ve been thinking a lot recently about the practice of data analysis
in different settings and how the environment in which you work can
affect the view you have on how things should be done. I’ve been
working in academia for over 12 years now. I don’t have any industry
data science experience, but long ago I worked as a software engineer
at <a href="http://www.northropgrumman.com/Pages/default.aspx">two</a>
<a href="http://kencast.com">companies</a>. Obviously, my experience is biased on
the academic side.</p>
<p>I’ve see an interesting divergence between what I see being written
from data scientists in industry and my personal experience doing data
science in academia. From the industry side, I see a lot of stuff
about tooling/software and processes. This makes sense to me. Often, a
company needs/wants to move quickly and doing so requires making
decisions on a reasonable time scale. If decisions are made with data,
then the process of collecting, organizing, analyzing, and
communicating data needs to be well thought-out, systematized,
rigorous, and streamlined. If everytime someone at the company had a
question the data science team developed some novel custom
coded-from-scratch solution, decisions would be made at a glacial
pace, which is probably not good for business. In order to handle this
type of situation you need solid tools and flexible workflows. You
also need agreement within the company about how things are down and
the processes that are followed.</p>
<p>Now, I don’t mean to imply that life at a company is easy, that there
isn’t politics or bureacracy to deal with. But I see companies as much
like individual countries, with a clear (hierarchical) leadership
structure and decision-making process (okay, maybe ideal
companies). Much like in a country, it might take some time to come to
a decision about a policy or problem (e.g. health insurance), with
much negotiation and horse-trading, but once consensus is arrived at,
often the policy can be implemented across the country at a reasonable
timescale. In a company, if a certain workflow or data process can be
shown to be beneficial and perhaps improve profitability down the
road, then a decision could be made to implement it. Ultimately,
everyone within a company is in the same boat and is interested in
seeing the company succeed.</p>
<p>When I worked at a company as a software developer, I’d sometimes run
into a problem that was confusing or difficult to code. So I’d walk
down to the systems engineer’s office (they guy who wrote the
specification) and talk to him about it. We’d hash things out for a
while and then figure out a way to go forward. Often the technical
writers who wrote the documentation would come and ask me what exactly
a certain module did and I’d explain it to them. Communication was
usually quick and efficient because it usually occurred
person-to-person and because we were all on the same team.</p>
<p>Academia is more like Europe, a somewhat loose federation of states
that only communicates with each other because they have to. Each
principal investigator is a country and s/he has to engage in constant
(sometimes contentious) negotiations with other investigators
(“countries”). As a data scientist, this can be tricky because unless
I collect/generate my own data (which sometimes, <a href="http://www.ncbi.nlm.nih.gov/pubmed/18477784">I
do</a>), I have to negotiate
with another investigator to obtain the data. Even if I were
collaborating with that investigator from the very beginning of a
study, I typically have very little direct control over the data
collection process because those people don’t work for me. The result
is often, the data come to me in some format over which I had little
input, and I just have to deal with it. Sometimes this is a nice CSV
file, but often it is not.</p>
<p>In good situations, I can talk with the investigator collecting the
data and we can hash out a plan to put the data into a <a href="https://www.jstatsoft.org/article/view/v059i10">certain
format</a>. But even if
we can agree on that, often the expertise will not be available on
their end to get the data into that format, so I’ll end up having to
do it myself anyway. In not-so-good situations, I can make all the
arguments I want for an organized data collection and analysis
workflow, but if the investigator doesn’t want to do it, can’t afford
it, or doesn’t see any incentive, then it’s not going to happen. Ever.</p>
<p>However, even in the good situations, every investigator works in
their own personal way. I mean, that’s why people go into academia,
because you can “be your own boss” and work on problems that interest
you. Most people develop a process for running their group/lab that
most suits their personality. If you’re a data scientist, you need to
figure out a way to mesh with each and every investigator you
collaborate with. In addition, you need to adapt yourself to whatever
data process each investigator has developed for their group. So if
you’re working with a genomics person, you might need to learn about
BAM files. For a neuroimaging collaborator, you’ll need to know about
SPM. If one person doesn’t like tidy data, then that’s too bad. You
need to deal with it (or don’t work with them). As a result, it’s
difficult to develop a useful “system” for data science because any
system that works for one collaborator is unlikely to work for another
collaborator. In effect, each collaboration often results in a custom
coded-from-scratch solution.</p>
<p>This contrast between companies and academia got me thinking about the
<a href="https://en.wikipedia.org/wiki/Theory_of_the_firm">Theory of the
Firm</a>. This is an
economic theory that tries to explain why firms/companies develop at
all, as opposed to individuals or small groups negotiating over an
open market. My understanding is that it all comes down to how well
you can write and enforce a contract between two parties. For example,
if I need to manufacture iPhones, I can go to a contract manufacturer,
given them the designs and the precise specifications/tolerances and
they can just produce millions of them. However, if I need to <em>design</em>
the iPhone, it’s a bit harder for me to go to another company and just
say “Design an awesome new phone!” That kind of contract is difficult
to write down, much less enforce. That other company will be operating
off of different incentives from me and will likely not produce what I
want. It’s probably better if I do the design work
in-house. Ultimately, once the transaction costs of having two
different companies work together become too high, it makes more sense
for a company to do the work in-house.</p>
<p>I think collaborating on data analysis is a high transaction cost
activity. Companies have an advantage in this realm to the extent that
they can hire lots of data scientists to work in-house. Academics that
are well-funded and have large labs can often hire a data analyst to
work for them. This is good because it makes a well-trained person
available at low transaction cost, but this setup is the
exception. PIs with smaller labs barely have enough funding to do
their experiments and so either have to analyze the data themselves
(for which they may not be appropriately trained) or collaborate with
someone willing to do it. Large academic centers often have research
cores that provide data analysis services, but this doesn’t change the
fact that data analysis that occurs “outside the company” dramatically
increases the transaction costs of doing the research. Because data
analysis is a highly iterative process, each time you have to go back
in forth with an outside entity, the costs go up.</p>
<p>I think it’s possible to see a time when data analysis can effectively
be made external. I mean, Apple used to manufacture all its products,
but has shifted to contract manufacturing to great success. But I
think we will have to develop a much better understanding of the data
analysis process before we see the transaction costs start to go down.</p>
New Feather Format for Data Frames
2016-03-31T00:00:00+00:00
http://simplystats.github.io/2016/03/31/feather
<p>This past Tuesday, Hadley Wickham and Wes McKinney
<a href="http://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/">announced</a>
a new binary file format specifically for storing data frames.</p>
<blockquote>
<p>One thing that struck us was that, while R’s data frames and Python’s pandas data frames utilize different internal memory representations, the semantics of their user data types are mostly the same. In both R and pandas, data frames contain lists of named, equal-length columns, which can be numeric, boolean, and date-and-time, categorical (factors), or string. Additionally, these columns must support missing (null) values.</p>
</blockquote>
<p>Their work builds on the Apache Arrow project, which specifies a
format for tabular data. There is currently a Python and R
implementation for reading/writing these files but other
implementations could easily be built as the file format looks pretty
straightforward. The git repository is
<a href="https://github.com/wesm/feather/">here</a>.</p>
<p>Initial thoughts:</p>
<ul>
<li>
<p>The possibilities for passing data between languages is I think the
main point here. The potential for passing data through a pipeline
without worrying about the specifics of different languages could
make for much more powerful analyses where different tools are used
for whatever they tend to do best. Essentially, as long as data can
be made tidy going in and coming out, there should not be a
communication issue between languages.</p>
</li>
<li>
<p>R users might be wondering what the big deal is–we already have a
binary serialization format (XDR). But R’s serialization format is
meant to cover all possible R objects. Feather’s focus on data
frames allows for the removal of many of the annoying (but seldom
used) complexities of R objects and optimizing a very commonly used
data format.</p>
</li>
<li>
<p>In my testing, there’s a noticeable speed difference between reading
a feather file and reading an (uncompressed) R workspace file
(feather seems about 2x faster). I didn’t time writing files, but
the difference didn’t seem as noticeable there. That said, it’s not
clear to me that performance on files is the main point here.</p>
</li>
<li>
<p>Given the underlying framework and representation, there seem to be
some interesting possibilities for low-memory environments.</p>
</li>
</ul>
<p>I’ve only had a chance to quickly look at the code but I’m excited to
see what comes next.</p>
How to create an AI startup - convince some humans to be your training set
2016-03-30T00:00:00+00:00
http://simplystats.github.io/2016/03/30/humans-as-training-set
<p>The latest trend in data science is <a href="https://en.wikipedia.org/wiki/Artificial_intelligence">artificial intelligence</a>. It has been all over the news for tackling a bunch of interesting questions. For example:</p>
<ul>
<li><a href="https://deepmind.com/alpha-go.html">AlphaGo</a> <a href="http://www.techrepublic.com/article/how-googles-deepmind-beat-the-game-of-go-which-is-even-more-complex-than-chess/">beat</a> one of the top Go players in the world in what has been called a major advance for the field.</li>
<li>Microsoft created a chatbot <a href="http://techcrunch.com/2016/03/23/microsofts-new-ai-powered-bot-tay-answers-your-tweets-and-chats-on-groupme-and-kik/">Tay</a> that ultimately <a href="http://www.bbc.com/news/technology-35902104">went very very wrong</a>.</li>
<li>Google and a number of others are working on <a href="https://www.google.com/selfdrivingcar/">self driving cars</a>.</li>
<li>Facebook is creating an artificial intellgence based <a href="http://www.engadget.com/2015/08/26/facebook-messenger-m-assistant/">virtual assistant called M</a></li>
</ul>
<p>Almost all of these applications are based (at some level) on using variations on <a href="http://neuralnetworksanddeeplearning.com/">neural networks and deep learning</a>. These models are used like any other statistical or machine learning model. They involve a prediction function that is based on a set of parameters. Using a training data set, you estimate the parameters. Then when you get a new set of data, you push it through the prediction function using those estimated parameters and make your predictions.</p>
<p>So why does deep learning do so well on problems like voice recognition, image recognition, and other complicated tasks? The main reason is that these models involve hundreds of thousands or millions of parameters, that allow the model to capture even very subtle structure in large scale data sets. This type of model can be fit now because (a) we have huge training sets (think all the pictures on Facebook or all voice recordings of people using Siri) and (b) we have fast computers that allow us to estimate the parameters.</p>
<p>Almost all of the high-profile examples of “artificial intelligence” we are hearing about involve this type of process. This means that the machine is “learning” from examples of how humans behave. The algorithm itself is a way to estimate subtle structure from collections of human behavior.</p>
<p>The result is that the typical trajectory for an AI business is.</p>
<ol>
<li>Get a large collection of humans to perform some repetitive but possibly complicated behavior (play thousands of games of Go, or answer requests from people on Facebook messenger, or label pictures and videos, or drive cars.)</li>
<li>Record all of the actions the humans perform to create a training set.</li>
<li>Feed these data into a statistical model with a huge number of parameters - made possible by having a huge training set collected from the humans in steps 1 and 2.</li>
<li>Apply the algorithm to perform the repetitive task and cut the humans out of the process.</li>
</ol>
<p>The question is how do you get the humans to perform the task for you? One option is to collect data from humans who are using your product (think Facebook image tagging). The other, more recent phenomenon, is to farm the task out to a large number of contractors (think <a href="http://www.theguardian.com/commentisfree/2015/jul/26/will-we-get-by-gig-economy">gig economy</a> jobs like driving for Uber, or responding to queries on Facebook).</p>
<p>The interesting thing about the latter case is that in the short term it produces a market for gigs for humans. But in the long term, by performing those tasks, the humans are putting themselves out of a job. This played out in a relatively public way just recently with a service called <a href="http://www.fastcompany.com/3058060/this-is-what-it-feels-like-when-a-robot-takes-your-job">GoButler</a> that used its employees to train a model and then replaced them with that model.</p>
<p>It will be interesting to see how many areas of employment this type of approach takes over. It is also interesting to think about how much value each task you perform for a company like that is worth to the training set. It will also be interesting if there is a legal claim for the gig workers at these companies to make that their labor helped “create the value” at the companies that replace them.</p>
Not So Standard Deviations Episode 12 - The New Bayesian vs. Frequentist
2016-03-26T00:00:00+00:00
http://simplystats.github.io/2016/03/26/nssd-episode-12
<p>In this episode, Hilary and I discuss the new direction for the
journal Biostatistics, the recent fracas over ggplot2 and base
graphics in R, and whether collecting more data is always better than
collecting less (fewer?) data. Also, Hilary and Roger respond to some
listener questions and more free advertising.</p>
<p>If you have questions you’d like us to answer, you can send them to
nssdeviations @ gmail.com or tweet us at @NSSDeviations.</p>
<p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p>
<p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p>
<p>Show notes:</p>
<ul>
<li>
<p><a href="http://goo.gl/am6I3r">Jeff Leek on why he doesn’t use ggplot2</a></p>
</li>
<li>
<p>David Robinson on <a href="http://varianceexplained.org/r/why-I-use-ggplot2/">why he uses ggplot2</a></p>
</li>
<li>
<p><a href="http://goo.gl/6iEB2I">Nathan Yau’s post comparing ggplot2 and base graphics</a></p>
</li>
<li>
<p><a href="https://goo.gl/YuhFgB">Biostatistics Medium post</a></p>
</li>
<li>
<p><a href="http://goo.gl/tXNdCA">Photoviz</a></p>
</li>
<li>
<p><a href="https://twitter.com/PigeonAir">PigeonAir</a></p>
</li>
<li>
<p><a href="https://goo.gl/jqlg0G">I just want to plot()</a></p>
</li>
<li>
<p><a href="https://goo.gl/vvCfkl">Hilary and Rush Limbaugh</a></p>
</li>
<li>
<p><a href="http://imgur.com/a/K4RWn">Deep learning training set</a></p>
</li>
<li>
<p><a href="http://patreon.com/NSSDeviations">NSSD Patreon Page</a></p>
</li>
</ul>
<p><a href="https://soundcloud.com/nssd-podcast/episode-12-the-new-bayesian-vs-frequentist">Download the audio for this episode.</a></p>
<iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/255099493&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false"></iframe>
The future of biostatistics
2016-03-24T00:00:00+00:00
http://simplystats.github.io/2016/03/24/the-future-of-biostatistics
<p>Starting in January my colleague <a href="https://twitter.com/drizopoulos">Dimitris Rizopoulos</a> and I took over as co-editors of the journal
Biostatistics. We are pretty fired up to try some new things with the journal and to make sure that the most important advances
in statistical methodology and application have a good home.</p>
<p>We started a blog for the journal and our first post is here: <a href="https://medium.com/@biostatistics/the-future-of-biostatistics-5aa8246e14b4#.uk1gat5sr">The future of Biostatistics</a>. Thanks to <a href="https://twitter.com/kwbroman/status/695306823365169154">Karl Broman
and his famiy</a> we also have the twitter handle <a href="https://twitter.com/biostatistics">@biostatistics</a>. Follow us there to hear about all the new stuff we are rolling out.</p>
The Evolution of a Data Scientist
2016-03-21T00:00:00+00:00
http://simplystats.github.io/2016/03/21/dataScientistEvo-jaffe
<p><em>Editor’s note: This post is a guest post by <a href="http://aejaffe.com">Andrew Jaffe</a></em></p>
<p>“How do you get to Carnegie Hall? Practice, practice, practice.” (“The Wit Parade” by E.E. Kenyon on March 13, 1955)</p>
<p>”..an extraordinarily consistent answer in an incredible number of fields … you need to have practiced, to have apprenticed, for 10,000 hours before you get good.” (Malcolm Gladwell, on Outliers)</p>
<p>I have been a data scientist for the last seven or eight years, probably before “data science” existed as a field. I work almost exclusively in the R statistical environment which I first toyed with as a sophomore in college, which ramped up through graduate school. I write all of my code in Notepad++ and make all of my plots with base R graphics, over newer and probably easier approaches, like R Studio, ggplot2, and R Markdown. Every so often, someone will email asking for code used in papers for analysis or plots, and I dig through old folders to track it down. Every time this happens, I come to two realizations: 1) I used to write fairly inefficient and not-so-great code as an early PhD student, and 2) I write a lot of R code.</p>
<p>I think there are some pretty good ways of measuring success and growth as a data scientist – you can count software packages and their user-bases, projects and papers, citations, grants, and promotions. But I wanted to calculate one more metric to add to the list – how much R code have I written in the last 8 years? I have been using the Joint High Performance Computing Exchange (JHPCE) at Johns Hopkins University since I started graduate school, so all of my R code was pretty much all in one place. I therefore decided to spend my Friday night drinking some Guinness and chronicling my journey using R and evolution as a data scientist.</p>
<p>I found all of the .R files across my four main directories on the computing cluster (after copying over my local scripts), and then removed files that came with packages, that belonged to other users, and that resulted from poorly designed simulation and permutation analyses (perm1.R,…,perm100.R) before I learned how to use array jobs, and then extracted the creation date, last modified date, file size, and line count for each R script. Based on this analysis, I have written 3257 R scripts across 13.4 megabytes and 432,753 lines of code (including whitespace and comments) since February 22, 2009.</p>
<p>I found that my R coding output has generally increased over time when tabulated by month (number of scripts: p=6.3e-7, size of files: p=3.5x10-9, and number of lines: p=5.0e-9). These metrics of coding – number, size, and lines - also suggest that, on average, I wrote the most code during my PhD (p-value range: 1.7e-4-1.8e-7). Interestingly, the changes in output over time surprisingly consistent across the three phases of my academic career: PhD, postdoc, and faculty (see Figure 1) – you can see the initial dropoff in production during the first one or two months as I transitioned to a postdoc at the Lieber Institute for Brain Development after finishing my PhD. My output rate has dropped slightly as a faculty member as I started working with doctoral students who took over the analyses of some projects (month-by-output interaction p-value: 5.3e-4, 0.002, and 0.03, respectively, for number, size, and lines). The mean coding output – on average, how much code it takes for a single analysis – were also increased over time and slightly decreased at LIBD, although to lesser extents (all p-values were between 0.01-0.05). I was actually surprised that coding output increased – rather than decreased – over time, as any gains in coding efficiency were probably canceled out my often times more modular analyses at LIBD.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2016-03-21/sizeVsMonth_rCode.jpg" alt="Figure 1: Coding output over time. Vertical bars separate my PhD, postdoc, and faculty jobs" /></p>
<p>I also looked at coding output by hour of the day to better characterize my working habits – the output per hour is shown stratified by the two eras each about ~3 years (Figure 2). As expected, I never really work much in the morning – very little work get done before 8AM – and little has changed since a second year PhD student. As a faculty member, I have the highest output between 9AM-3PM. The trough between 4PM and 7PM likely corresponds to walking the dog we got three years ago, working out, and cooking (and eating) dinner. The output then increases steadily from 8PM-12AM, where I can work largely uninterrupted from meetings and people dropping by my office, with occasional days (or nights) working until 1AM.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2016-03-21/sizeVsHour_rCode.jpg" alt="Figure 2: Coding output by hour of day. X-axis starts at 5AM to divide the day into a more temporal order." /></p>
<p>Lastly, I examined R coding output by day of the week. As expected, the lowest output occurred over the weekend, especially on Saturdays. Interestingly, I tended to increase output later in the work week as a faculty member, and also work a little more on Sundays and Mondays, compared to a PhD student.</p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2016-03-21/sizeVsDay_rCode.jpg" alt="Figure 3: Coding output by day of week." /></p>
<p>Looking at the code itself, of the 432,753 lines, 84,343 were newlines (19.5%), 66,900 were lines that were exclusively comments (15.5%), and an additional 6,994 lines contained comments following R code (1.6%). Some of my most used syntax and symbols, as line counts containing at least one symbol, were pretty much as expected (dropping commas and requiring whitespace between characters):</p>
<table>
<tbody>
<tr>
<td>Code</td>
<td>Count</td>
<td>Code</td>
<td>Count</td>
</tr>
<tr>
<td>=</td>
<td>175604</td>
<td>==</td>
<td>5542</td>
</tr>
<tr>
<td>#</td>
<td>48763</td>
<td><</td>
<td>5039</td>
</tr>
<tr>
<td><-</td>
<td>16492</td>
<td>for(i</td>
<td>5012</td>
</tr>
<tr>
<td>{</td>
<td>11879</td>
<td>&</td>
<td>4803</td>
</tr>
<tr>
<td>}</td>
<td>11612</td>
<td>the</td>
<td>4734</td>
</tr>
<tr>
<td>in</td>
<td>10587</td>
<td>function(x)</td>
<td>4591</td>
</tr>
<tr>
<td>##</td>
<td>8508</td>
<td>###</td>
<td>4105</td>
</tr>
<tr>
<td>~</td>
<td>6948</td>
<td>-</td>
<td>4034</td>
</tr>
<tr>
<td>></td>
<td>5621</td>
<td>%in%</td>
<td>3896</td>
</tr>
</tbody>
</table>
<p>My code is available on GitHub: https://github.com/andrewejaffe/how-many-lines (after removing file paths and names, as many of the projects are currently unpublished and many files are placed in folders named by collaborator), so feel free to give it a try and see how much R code you’ve written over your career. While there are probably a lot more things to play around with and explore, this was about all the time I could commit to this, given other responsibilities (I’m not on sabbatical like <a href="http://jtleek.com">Jeff Leek</a>…). All in all, this was a pretty fun experience and largely reflected, with data, how my R skills and experience have progressed over the years.</p>
Not So Standard Deviations Episode 11 - Start and Stop
2016-03-14T00:00:00+00:00
http://simplystats.github.io/2016/03/14/nssd-episode-11
<p>We’ve started a Patreon page! Now you can support the podcast directly by going to <a href="http://patreon.com/NSSDeviations">our page</a> and making a pledge. This will help Hilary and me build the podcast, add new features, and get some better equipment.</p>
<p>Episode 11 is an all craft episode of <em>Not So Standard Deviations</em>, where Hilary and Roger discuss starting and ending a data analysis. What do you do at the very beginning of an analysis? Hilary and Roger talk about some of the things that seem to come up all the time. Also up for discussion is the American Statistical Association’s statement on <em>p</em> values, famous statisticians on Twitter, and evil data scientists on TV. Plus two new things for free advertising.</p>
<p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p>
<p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p>
<p>Show notes:</p>
<ul>
<li>
<p><a href="http://patreon.com/NSSDeviations">NSSD Patreon Page</a></p>
</li>
<li>
<p><a href="https://twitter.com/deleeuw_jan">Jan de Leeuw</a></p>
</li>
<li>
<p><a href="https://twitter.com/BatesDmbates">Douglas Bates</a></p>
</li>
<li>
<p><a href="https://en.wikipedia.org/wiki/Sports_Night">Sports Night</a></p>
</li>
<li>
<p><a href="http://goo.gl/JFz7ic">ASA’s statement on p values</a></p>
</li>
<li>
<p><a href="http://goo.gl/O8kL60">Basic and Applied Psychology Editorial banning p values</a></p>
</li>
<li>
<p><a href="http://www.seriouseats.com/vegan-experience">J. Kenji Alt’s Vegan Experience</a></p>
</li>
<li>
<p><a href="http://fieldworkfail.com/">fieldworkfail</a></p>
</li>
</ul>
<p><a href="https://soundcloud.com/nssd-podcast/episode-11-start-and-stop">Download the audio for this episode</a>.</p>
<iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/251825714&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false"></iframe>
Not So Standard Deviations Episode 10 - It's All Counterexamples
2016-03-02T00:00:00+00:00
http://simplystats.github.io/2016/03/02/nssd-episode-10
<p>In the latest episode of Not So Standard Deviations Hilary and I talk about the motivation behind the <a href="https://github.com/hilaryparker/explainr">explainr</a> package and the general usefulness of automated reporting and interpretation of statistical tests. Also, Roger struggles to come up with a quick and easy way to explain why statistics is useful when it sometimes doesn’t produce any different results.</p>
<p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p>
<p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p>
<p>Show notes:</p>
<ul>
<li>
<p>The <a href="https://github.com/hilaryparker/explainr">explainr</a> package</p>
</li>
<li>
<p><a href="https://google.github.io/CausalImpact/CausalImpact.html">Google’s CausalImpact package</a></p>
</li>
<li>
<p><a href="http://www.wsj.com/articles/SB10001424053111903480904576512250915629460">Software is Eating the World</a></p>
</li>
<li>
<p><a href="http://allendowney.blogspot.com/2015/12/many-rules-of-statistics-are-wrong.html">Many Rules of Statistics are Wrong</a></p>
</li>
</ul>
<p><a href="https://soundcloud.com/nssd-podcast/episode-10-its-all-counterexamples">Download the audio for this episode</a>.</p>
<iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/249517993&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false"></iframe>
Preprints are great, but post publication peer review isn't ready for prime time
2016-02-26T00:00:00+00:00
http://simplystats.github.io/2016/02/26/preprints-and-pppr
<p>The current publication system works something like this:</p>
<h3 id="coupled-review-and-publication">Coupled review and publication</h3>
<ol>
<li>You write a paper</li>
<li>You submit it to a journal</li>
<li>It is peer reviewed privately</li>
<li>The paper is accepted or rejected
a. If rejected go back to step 2 and start over
b. If accepted it will be published</li>
<li>If published then people can read it</li>
</ol>
<p>This system has several major disadvantages that bother scientists. It means
all research appears on a lag (whatever the time in peer review is). It can be
a major lag time if the paper is sent to “top tier journals” and rejected then filters
down to “lower tier” journals before ultimate publication. Another disadvantage
is that there are two options for most people to publish their papers: (a) in closed access journals where
it doesn’t cost anything to publish but then the articles are beyind paywalls and (b)
in open access journals where anyone can read them but it costs money to publish. Especially
for junior scientists or folks without resources, this creates a difficult choice because
they <a href="http://simplystatistics.org/2011/11/03/free-access-publishing-is-awesome-but-expensive-how/">might not be able to afford open access fees</a>.</p>
<p>For a number of years some fields like physics (with the <a href="http://arxiv.org/">arxiv</a>) and
economics (with <a href="http://www.nber.org/papers.html">NBER</a>) have solved this problem
by decoupling peer review and publication. In these fields the system works like this:</p>
<h3 id="decoupled-review-and-publication">Decoupled review and publication</h3>
<ol>
<li>You write a paper</li>
<li>You post a preprint
a. Everyone can read and comment</li>
<li>You submit it to a journal</li>
<li>It is peer reviewed privately</li>
<li>The paper is accepted or rejected
a. If rejected go back to step 2 and start over
b. If accepted it will be published</li>
</ol>
<p>Lately there has been a growing interest in this same system in molecular and computational biology. I think
this is a really good thing, because it makes it easier to publish papers more quickly and doesn’t cost researchers to publish. That is
why the papers my group publishes all show up on <a href="http://biorxiv.org/search/author1%3AJeffrey%2BLeek%2B">biorxiv</a> or <a href="http://arxiv.org/find/stat/1/au:+Leek_J/0/1/0/all/0/1">arxiv</a> first.</p>
<p>While I think this decoupling is great, there seems to be a push for this decoupling and at the same time
a move to post publication peer review.
I used to argue pretty strongly for <a href="http://simplystatistics.org/2012/10/04/should-we-stop-publishing-peer-reviewed-papers/">post-publication peer review</a> but Rafa <a href="http://simplystatistics.org/2012/10/08/why-we-should-continue-publishing-peer-reviewed-papers/">set me
straight</a> and pointed
out that at least with peer review every paper that gets submitted gets evaluated by <em>someone</em> even if the paper
is ultimately rejected.</p>
<p>One of the risks of post publication peer review is that there is no incentive to peer review in the current system. In a paper a
few years ago I actually showed that under an economic model for closed peer review the Nash equilibrium is for <a href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026895">no one to peer review at all</a>. We showed in that same paper that under
open peer review there is an increase in the amount of time spent reviewing, but the effect was relatively small. Moreover
the dangers of open peer review are clear (junior people reviewing senior people and being punished for it) while the
benefits (potentially being recognized for insightful reviews) are much hazier. Even the most vocal proponents of
post publication peer review <a href="http://www.ncbi.nlm.nih.gov/myncbi/michael.eisen.1/comments/">don’t do it that often</a> when given the chance.</p>
<p>The reason is that everyone in academics already have a lot of things they are asked to do. Many review papers either out
of a sense of obligation or because they want to be in the good graces of a particular journal. Without this system in place
there is a strong chance that peer review rates will drop and only a few papers will get reviewed. This will ultimately decrease
the accuracy of science. In our (admittedly contrived/simplified) <a href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.002689">experiment</a> on peer review accuracy went from 39% to 78% after solutions were reviewed. You might argue that only “important” papers should be peer reviewed but then you are back in the camp of glamour. Say waht you want about glamour journals. They are for sure biased by the names of the people submitting the papers there. But it is <em>possible</em> for someone to get a paper in no matter who they are. If we go to a system where there is no curation through a journal-like mechanism then popularity/twitter followers/etc. will drive readers. I’m not sure that is better than where we are now.</p>
<p>So while I think pre-prints are a great idea I’m still waiting to see a system that beats pre-publication review for maintaining scientific quality (even though it may just be an <a href="http://simplystatistics.org/2015/02/09/the-trouble-with-evaluating-anything/">impossible problem</a>)</p>
Spreadsheets: The Original Analytics Dashboard
2016-02-23T08:42:30+00:00
http://simplystats.github.io/2016/02/23/spreadsheets-the-original-analytics-dashboard
<p>Soon after my discussion with Hilary Parker and Jenny Bryan about spreadsheets on <em><a href="https://soundcloud.com/nssd-podcast">Not So Standard Deviations</a></em>, Brooke Anderson forwarded me <a href="https://backchannel.com/a-spreadsheet-way-of-knowledge-8de60af7146e#.gj4f2bod4">this article</a> written by Steven Levy about the original granddaddy of spreadsheets, <a href="https://en.wikipedia.org/wiki/VisiCalc">VisiCalc</a>. Actually, the real article was written back in 1984 as so-called microcomputers were just getting their start. VisiCalc was originally written for the Apple II computer and notable competitors at the time included <a href="https://en.wikipedia.org/wiki/Lotus_1-2-3">Lotus 1-2-3</a> and Microsoft <a href="https://en.wikipedia.org/wiki/Multiplan">Multiplan</a>, all since defunct.</p>
<p>It’s interesting to see Levy’s perspective on spreadsheets back then and to compare it to the current thinking about data, data science, and reproducibility in science. The problem back then was “ledger sheets” (what we might now call a spreadsheet), which contained numbers and calculations related to businesses, were tedious to make and keep up to date.</p>
<blockquote>
<p>Making spreadsheets, however necessary, was a dull chore best left to accountants, junior analysts, or secretaries. As for sophisticated “modeling” tasks – which, among other things, enable executives to project costs for their companies – these tasks could be done only on big mainframe computers by the data-processing people who worked for the companies Harvard MBAs managed.</p>
</blockquote>
<p>You can see one issue here: Spreadsheets/Ledgers were a “dull chore”, and best left to junior people. However, the “real” computation was done by the people the “data processing” center on big mainframes. So what exactly does that leave for the business executive to do?</p>
<p>Note that the way of doing things back then was effectively reproducible, because the presentation (ledger sheets printed on paper) and the computation (data processing on mainframes) was separated.</p>
<p>The impact of the microcomputer-based spreadsheet program appears profound.</p>
<blockquote>
<p id="9424" class="graf--p graf-after--p">
Already, the spreadsheet has redefined the nature of some jobs; to be an accountant in the age of spreadsheet program is — well, almost sexy. And the spreadsheet has begun to be a forceful agent of decentralization, breaking down hierarchies in large companies and diminishing the power of data processing.
</p>
<p class="graf--p graf-after--p">
There has been much talk in recent years about an “entrepreneurial renaissance” and a new breed of risk-taker who creates businesses where none previously existed. Entrepreneurs and their venture-capitalist backers are emerging as new culture heroes, settlers of another American frontier. Less well known is that most of these new entrepreneurs depend on their economic spreadsheets as much as movie cowboys depend on their horses.
</p>
</blockquote>
<p class="graf--p graf-after--p">
If you replace "accountant" with "statistician" and "spreadsheet" with "big data" and you are magically teleported into 2016.
</p>
<p class="graf--p graf-after--p">
The way I see it, in the early 80's, spreadsheets satisfied the never-ending desire that people have to interact with data. Now, with things like tablets and touch-screen phones, you can literally "touch" your data. But it took microcomputers to get to a certain point before interactive data analysis could really be done in a way that we recognize today. Spreadsheets tightened the loop between question and answer by cutting out the Data Processing department and replacing it with an Apple II (or an IBM PC, if you must) right on your desk.
</p>
<p class="graf--p graf-after--p">
Of course, the combining of presentation with computation comes at a cost of reproducibility and perhaps quality control. Seeing the description of how spreadsheets were originally used, it seems totally natural to me. It is not unlike today's analytic dashboards that give you a window into your business and allow you to "model" various scenarios by tweaking a few numbers of formulas. Over time, people took spreadsheets to all sorts of extremes, using them for purposes for which they were not originally designed, and problems naturally arose.
</p>
<p class="graf--p graf-after--p">
So now, we are trying to separate out the computation and presentation bits a little. Tools like knitr and R and shiny allow us to do this and to bring them together with a proper toolchain. The loss in interactivity is only slight because of the power of the toolchain and the speed of computers nowadays. Essentially, we've brought back the Data Processing department, but have staffed it with robots and high speed multi-core computers.
</p>
Non-tidy data
2016-02-17T15:47:23+00:00
http://simplystats.github.io/2016/02/17/non-tidy-data
<p>During the discussion that followed the ggplot2 posts from David and I last week we started talking about tidy data and the man himself noted that matrices are often useful instead of <a href="http://vita.had.co.nz/papers/tidy-data.pdf">“tidy data”</a> and I mentioned there might be other data that are usefully “non tidy”. Here I will be using tidy/non-tidy according to Hadley’s definition. So tidy data have:</p>
<ul>
<li>One variable per column</li>
<li>One observation per row</li>
<li>Each type of observational unit forms a table</li>
</ul>
<p>I push this approach in my <a href="https://github.com/jtleek/datasharing">guide to data sharing</a> and in a lot of my personal work. But note that non-tidy data can definitely be already processed, cleaned, organized and ready to use.</p>
<blockquote class="twitter-tweet" data-width="550">
<p lang="en" dir="ltr">
<a href="https://twitter.com/hadleywickham">@hadleywickham</a> <a href="https://twitter.com/drob">@drob</a> <a href="https://twitter.com/mark_scheuerell">@mark_scheuerell</a> I'm saying that not all data are usefully tidy (and not just matrices) so I care more abt flexibility
</p>
<p>
— Jeff Leek (@jtleek) <a href="https://twitter.com/jtleek/status/698247927706357760">February 12, 2016</a>
</p>
</blockquote>
<p>This led to a very specific blog request:</p>
<blockquote class="twitter-tweet" data-width="550">
<p lang="en" dir="ltr">
<a href="https://twitter.com/jtleek">@jtleek</a> <a href="https://twitter.com/drob">@drob</a> I want a blog post on non-tidy data!
</p>
<p>
— Hadley Wickham (@hadleywickham) <a href="https://twitter.com/hadleywickham/status/698251883685646336">February 12, 2016</a>
</p>
</blockquote>
<p>So I thought I’d talk about a couple of reasons why data are usefully non-tidy. The basic reason is that I usually take a <a href="http://simplystatistics.org/2013/05/29/what-statistics-should-do-about-big-data-problem-forward-not-solution-backward/">problem first, not solution backward</a> approach to my scientific research. In other words, the goal is to solve a particular problem and the format I chose is the one that makes it most direct/easy to solve that problem, rather than one that is theoretically optimal. To illustrate these points I’ll use an example from my area.</p>
<p><strong>Example data</strong></p>
<p>Often you want data in a matrix format. One good example is gene expression data or data from another high-dimensional experiment. David talks about one such example in <a href="http://varianceexplained.org/r/tidy-genomics/">his post here</a>. He makes the (valid) point that for students who aren’t going to do genomics professionally, it may be more useful to learn an abstract tool such as tidy data/dplyr. But for those working in genomics, this can make you do unnecessary work in the name of theory/abstraction.</p>
<p>He analyzes the data in that post by first tidying the data.</p>
<div class="wp_syntax">
<table>
<tr>
<td class="code">
<pre class="r" style="font-family:monospace;">library(dplyr)
library(tidyr)
library(stringr)
library(readr)
library(broom)
original_data %
separate(NAME, c("name", "BP", "MF", "systematic_name", "number"), sep = "\\|\\|") %>%
mutate_each(funs(trimws), name:systematic_name) %>%
select(-number, -GID, -YORF, -GWEIGHT) %>%
gather(sample, expression, G0.05:U0.3) %>%
separate(sample, c("nutrient", "rate"), sep = 1, convert = TRUE)</pre>
</td>
</tr>
</table>
</div>
<p>It isn’t 100% tidy as data of different types are in the same data frame (gene expression and metadata/phenotype data belong in different tables). But its close enough for our purposes. Now suppose that you wanted to fit a model and test for association between the “rate” variable and gene expression for each gene. You can do this with David’s tidy data set, dplyr, and the broom package like so:</p>
<div class="wp_syntax">
<table>
<tr>
<td class="code">
<pre class="r" style="font-family:monospace;">rate_coeffs = cleaned_data %>% group_by(name) %>%
do(fit = lm(expression ~ rate + nutrient, data = .)) %>%
tidy(fit) %>%
dplyr::filter(term=="rate")</pre>
</td>
</tr>
</table>
</div>
<p>On my computer we get something like:</p>
<div class="wp_syntax">
<table>
<tr>
<td class="code">
<pre class="r" style="font-family:monospace;">system.time( cleaned_data %>% group_by(name) %>%
+ do(fit = lm(expression ~ rate + nutrient, data = .)) %>%
+ tidy(fit) %>%
+ dplyr::filter(term=="rate"))
|==========================================================|100% ~0 s remaining
user system elapsed
12.431 0.258 12.364</pre>
</td>
</tr>
</table>
</div>
<p>Let’s now try that analysis a little bit differently. As a first step, lets store the data in two separate tables. A table of “phenotype information” and a matrix of “expression levels”. This is the more common format used for these type of data. Here is the code to do that:</p>
<div class="wp_syntax">
<table>
<tr>
<td class="code">
<pre class="r" style="font-family:monospace;">expr = original_data %>%
select(grep("[0:9]",names(original_data)))
rownames(expr) = original_data %>%
separate(NAME, c("name", "BP", "MF", "systematic_name", "number"), sep = "\\|\\|") %>%
select(systematic_name) %>% mutate_each(funs(trimws),systematic_name) %>% as.matrix()
vals = data.frame(vals=names(expr))
pdata = separate(vals,vals,c("nutrient", "rate"), sep = 1, convert = TRUE)
expr = as.matrix(expr)</pre>
</td>
</tr>
</table>
</div>
<p>If we leave the data in this format we can get the model fits and the p-values using some simple linear algebra</p>
<div class="wp_syntax">
<table>
<tr>
<td class="code">
<pre class="r" style="font-family:monospace;">expr = as.matrix(expr)
mod = model.matrix(~ rate + as.factor(nutrient),data=pdata)
rate_betas = expr %*% mod %*% solve(t(mod) %*% mod)</pre>
</td>
</tr>
</table>
</div>
<p>This gives the same answer after re-ordering</p>
<div class="wp_syntax">
<table>
<tr>
<td class="code">
<pre class="r" style="font-family:monospace;">all(abs(rate_betas[,2]- rate_coeffs$estimate[ind]) < 1e-5,na.rm=T)
[1] TRUE</pre>
</td>
</tr>
</table>
</div>
<p>But this approach is much faster.</p>
<div class="wp_syntax">
<table>
<tr>
<td class="code">
<pre class="r" style="font-family:monospace;"> system.time(expr %*% mod %*% solve(t(mod) %*% mod))
user system elapsed
0.015 0.000 0.015</pre>
</td>
</tr>
</table>
</div>
<p>This requires some knowledge of linear algebra and isn’t pretty. But it brings us to the first general point: <strong>you might not use tidy data because some computations are more efficient if the data is in a different format. </strong></p>
<p>Many examples from graphical models, to genomics, to neuroimaging, to social sciences rely on some kind of linear algebra based computations (matrix multiplication, singular value decompositions, eigen decompositions, etc.) which are all optimized to work on matrices, not tidy data frames. There are ways to improve performance with tidy data for sure, but they would require an equal amount of custom code to take advantage of say C, or vectorization properties in R.</p>
<p>Ok now the linear regressions here are all treated independently, but it is very well known that you get much better performance in terms of the false positive/true positive tradeoff if you use an empirical Bayes approach for this calculation where <a href="https://bioconductor.org/packages/release/bioc/html/limma.html">you pool variances</a>.</p>
<p>If the data are in this matrix format you can do it with R like so:</p>
<div class="wp_syntax">
<table>
<tr>
<td class="code">
<pre class="r" style="font-family:monospace;">library(limma)
fit_limma = lmFit(expr,mod)
ebayes_limma = eBayes(fit_limma)
topTable(ebayes_limma)</pre>
</td>
</tr>
</table>
</div>
<p>This approach is again very fast, optimized for the calculations being performed and performs much better than the one-by-one regression approach. But it requires the data in matrix or expression set format. Which brings us to the second general point: <strong>**you might not use tidy data because many functions require a different, also very clean and useful data format, and you don’t want to have to constantly be switching back and forth. </strong>**Again, this requires you to be more specific to your application, but the potential payoffs can be really big as in the case of limma.</p>
<p>I’m showing an example here with expression sets and matrices, but in NLP the data are often input in the form of lists, in graphical analyses as matrices, in genomic analyses as GRanges lists, etc. etc. etc. One option would be to rewrite all infrastructure in your area of interest to accept tidy data formats but that would be going against conventions of a community and would ultimately cost you a lot of work when most of that work has already been done for you.</p>
<p>The final point, which I won’t discuss here is that data are often usefully represented in a non-tidy way. Examples include the aforementioned <a href="http://kasperdanielhansen.github.io/genbioconductor/html/GenomicRanges_GRanges.html">GRanges list</a> which consists of (potentially) ragged arrays of intervals and quantitative measurements about them. You could <strong>force</strong> these data to be tidy by the definition above, but again most of the infrastructure is built around a different format that is much more intuitive for that type of data. Similarly data from other applications may be more suited to application specific formats.</p>
<p>In summary, tidy data is a useful conceptual idea and is often the right way to go for general, small data sets, but may not be appropriate for all problems. Here are some examples of data formats (biased toward my area, but there are others) that have been widely adopted, have a ton of useful software, but don’t meet the tidy data definition above. I will define these as “processed data” as opposed to “tidy data”.</p>
<ul>
<li><a href="http://bioconductor.org/packages/3.3/bioc/vignettes/Biobase/inst/doc/ExpressionSetIntroduction.pdf">Expression sets</a> for expression data</li>
<li><a href="http://kasperdanielhansen.github.io/genbioconductor/html/SummarizedExperiment.html">Summarized experiments</a> for a variety of genomic experiments</li>
<li><a href="http://kasperdanielhansen.github.io/genbioconductor/html/GenomicRanges_GRanges.html">Granges Lists</a> for genomic intervals</li>
<li><a href="https://cran.r-project.org/web/packages/tm/tm.pdf">Corpus</a> objects for corpora of texts.</li>
<li><a href="http://igraph.org/r/doc/">igraph objects</a> for graphs</li>
</ul>
<p>I’m sure there are a ton more I’m missing and would be happy to get some suggestions on Twitter too.</p>
<p> </p>
When it comes to science - its the economy stupid.
2016-02-16T14:57:14+00:00
http://simplystats.github.io/2016/02/16/when-it-comes-to-science-its-the-economy-stupid
<p>I read a lot of articles about what is going wrong with science:</p>
<ul>
<li><a href="http://www.economist.com/news/briefing/21588057-scientists-think-science-self-correcting-alarming-degree-it-not-trouble">The reproducibility/replicability crisis</a></li>
<li><a href="http://www.theatlantic.com/business/archive/2013/02/the-phd-bust-americas-awful-market-for-young-scientists-in-7-charts/273339/">Lack of jobs for PhDs</a></li>
<li><a href="https://theresearchwhisperer.wordpress.com/2013/11/19/academic-scattering/">The pressure on the families (or potential families) of scientists</a></li>
<li><a href="http://quillette.com/2016/02/15/the-unbearable-asymmetry-of-bullshit/?utm_content=buffer235f2&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer">Hype around specific papers and a more general abundance of BS</a></li>
<li><a href="http://www.michaeleisen.org/blog/?p=1179">Consortia and their potential evils</a></li>
<li><a href="http://www.vox.com/2015/12/7/9865086/peer-review-science-problems">Peer review not working well</a></li>
<li><a href="http://www.nejm.org/doi/full/10.1056/NEJMe1516564">Research parasites</a></li>
<li><a href="http://gmwatch.org/news/latest-news/16691-public-science-is-broken-says-professor-who-helped-expose-water-pollution-crisis">Not enough room for applications/public good</a></li>
<li><a href="http://www.statnews.com/2016/02/10/press-releases-stink/?s_campaign=stat:rss">Press releases that do evil</a></li>
<li><a href="https://twitter.com/Richvn/status/697725899404349440">Scientists don’t release enough data</a></li>
</ul>
<p>These articles always point to the “incentives” in science and how they don’t align with how we’d like scientists to work. These discussions often frustrate me because they almost always boil down to asking scientists (especially and often junior scientists) to make some kind of change for public good without any guarantee that they are going to be ok. I’ve seen an acceleration/accumulation of people who are focusing on these issues, I think largely because it is now possible to make a very nice career by pointing out how other people are doing science wrong.</p>
<p>The issue I have is that the people who propose unilateral moves seem to care less that science is both (a) a calling and (b) a career for most people. I do science because I love it. I do science because I want to discover new things about the world. It is a direct extension of the wonder and excitement I had about the world when I was a little kid. But science is also a career for me. It matters if I get my next grant, if I get my next paper. Why? Because I want to be able to support myself and my family.</p>
<p>The issue with incentives is that talking about them costs nothing, but actually changing them is expensive. Right now our system, broadly defined, rewards (a) productivity - lots of papers, (b) cleverness - coming up with an idea first, and (c) measures of prestige - journal titles, job titles, etc. This is because there are tons of people going for a relatively small amount of grant money. More importantly, that money is decided on by processes that are both peer reviewed and political.</p>
<p>Suppose that you wanted to change those incentives to something else. Here is a small list of things I would like:</p>
<ul>
<li>People can have stable careers and live in a variety of places without massive two body problems</li>
<li>Scientists shouldn’t have to move every couple of years 2-3 times right at the beginning of their career</li>
<li>We should distribute our money among the <a href="http://simplystatistics.org/2015/12/01/thinking-like-a-statistician-fund-more-investigator-initiated-grants/">largest number of scientists possible </a></li>
<li>Incentivizing long term thinking</li>
<li>Incentivizing objective peer review</li>
<li>Incentivizing openness and sharing</li>
</ul>
<div>
The key problem isn't publishing, or code, or reproducibility, or even data analysis.
</div>
<div>
</div>
<div>
<b>The key problem is that the fundamental model by which we fund science is completely broken. </b>
</div>
<div>
</div>
<div>
The model now is that you have to come up with an <span class="lG">idea</span> every couple of years then "sell" it to funders, your peers, etc. This is the source of the following problems:
</div>
<div>
</div>
<ul>
<li>An incentive to publish only positive results so your <span class="lG">ideas</span> look good</li>
<li>An incentive to be closed so people don’t discover flaws in your analysis</li>
<li> An incentive to publish in specific “<span class="lG">big</span> name” journals that skews the results (again mostly in the positive direction)</li>
<li> Pressure to publish quickly which leads to cutting corners</li>
<li>Pressure to stay in a single area and make incremental changes so you know things will work.</li>
</ul>
<div>
If we really want to have any measurable impact on science we need to solve the funding model. The solution is actually pretty simple. We need to give out 20+ year grants to people who meet minimum qualifications. These grants would cover their own salary plus one or two people and the minimum necessary equipment.
</div>
<div>
</div>
<div>
The criteria for getting or renewing these grants should not be things like Nature papers or number of citations. It has to be designed to incentivize the things that we want to (mine are listed above). So if I was going to define the criteria for meeting the standards people would have to be:
</div>
<div>
</div>
<ul>
<li>Working on a scientific problem and trained as a scientist</li>
<li>Publishing all results immediately online as preprints/free code</li>
<li>Responding to queries about their data/code</li>
<li>Agreeing to peer review a number of papers per year</li>
</ul>
<p>More importantly these grants should be given out for a very long term (20+ years) and not be tied to a specific institution. This would allow people to have flexible careers and to target bigger picture problems. We saw the benefits of people working on problems they weren’t originally funded to work on with <a href="http://www.wired.com/2016/02/zika-research-utmb/">research on the Zika virus.</a></p>
<p>These grants need to be awarded using a rigorous peer review system just like the NIH, HHMI, and other organizations use to ensure we are identifying scientists with potential early in their careers and letting them flourish. But they’d be given out in a different matter. I’m very confident in a peer review to detect the difference between psuedo-science and real science, or complete hype and realistic improvement. But I’m much less confident in the ability of peer review to accurately distinguish “important” from “not important” research. So I think we should <a href="http://www.wsj.com/articles/SB10001424052702303532704579477530153771424">consider seriously the lottery</a> for these grants.</p>
<p>Each year all eligible scientists who meet some minimum entry requirements submit proposals for what they’d like to do scientifically. Each year those proposals are reviewed to make sure they meet the very minimum bar (are they scientific? do they have relevant training at all?). Among all the (very large) class of people who pass that bar we hold a lottery. We take the number of research dollars and divide it up to give the maximum number of these grants possible. These grants might be pretty small - just enough to fund the person’s salary and maybe one or two students/postdocs. To make this works for labs that required equipment there would have to be cooperative arrangements between multiple independent indviduals to fund/sustain equipment they needed. Renewal of these grants would happen as long as you were posting your code/data online, you were meeting peer review requirements, and responding to inquires about your work.</p>
<p>One thing we’d do to fund this model is eliminate/reduce large-scale projects and super well funded labs. Instead of having 30 postdocs in a well funded lab, you’d have some fraction of those people funded as independent investigators right from the get-go. If we wanted to run a massive large scale program that would be out of a very specific pot of money that would have to be saved up and spent, completely outside of the pot of money for investigator-initiated grants. That would reduce the hierarchy in the system, reduce pressure that leads to bad incentive, and give us the best chance to fund creative, long term thinking science.</p>
<p>Regardless of whether you like my proposal or not, I hope that people will start focusing on how to change the incentives, even when that means doing something big or potentially costly.</p>
<p> </p>
<p> </p>
Not So Standard Deviations Episode 9 - Spreadsheet Drama
2016-02-12T11:24:04+00:00
http://simplystats.github.io/2016/02/12/not-so-standard-deviations-episode-9-spreadsheet-drama
<p>For this episode, special guest Jenny Bryan (@jennybryan) joins us from the University of British Columbia! Jenny, Hilary, and I talk about spreadsheets and why some people love them and some people despise them. We also discuss blogging as part of scientific discourse.</p>
<p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p>
<p>Show notes:</p>
<ul>
<li><a href="http://stat545-ubc.github.io/">Jenny’s Stat 545</a></li>
<li><a href="http://goo.gl/VvFyXz">Coding is not the new literacy</a></li>
<li><a href="https://goo.gl/mC0Qz9">Goldman Sachs spreadsheet error</a></li>
<li><a href="https://goo.gl/hNloVr">Jingmai O’Connor episode</a></li>
<li><a href="http://goo.gl/IYDwn1">De-weaponizing reproducibility</a></li>
<li><a href="https://goo.gl/n02EGP">Vintage Space</a></li>
<li><a href="https://goo.gl/H3YgV6">Tabby Cats</a></li>
</ul>
<p><a href="https://soundcloud.com/nssd-podcast/episode-9-spreadsheet-drama">Download the audio for this episode</a>.</p>
<iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/246296744&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false"></iframe>
Why I don't use ggplot2
2016-02-11T13:25:38+00:00
http://simplystats.github.io/2016/02/11/why-i-dont-use-ggplot2
<p>Some of my colleagues think of me as super data-sciencey compared to other academic statisticians. But one place I lose tons of street cred in the data science community is when I talk about ggplot2. For the 3 data type people on the planet who still don’t know what that is, <a href="https://cran.r-project.org/web/packages/ggplot2/index.html">ggplot2</a> is an R package/phenomenon for data visualization. It was created by Hadley Wickham, who is (in my opinion) perhaps the most important statistician/data scientist on the planet. It is one of the best maintained, most important, and really well done R packages. Hadley also supports R software like few other people on the planet.</p>
<p>But I don’t use ggplot2 and I get nervous when other people do.</p>
<p>I get no end of grief for this from <a href="https://soundcloud.com/nssd-podcast/episode-9-spreadsheet-drama">Hilary and Roger</a> and especially from <a href="https://twitter.com/drob/status/625682366913228800">drob</a>, among many others. So I thought I would explain why and defend myself from the internet hordes. To understand why I don’t use it, you have to understand the three cases where I use data visualization.</p>
<ol>
<li>When creating exploratory graphics - graphs that are fast, not to be shown to anyone else and help me to explore a data set</li>
<li>When creating expository graphs - graphs that i want to put into a publication that have to be very carefully made.</li>
<li>When grading student data analyses.</li>
</ol>
<p>Let’s consider each case.</p>
<p><strong>Exploratory graphs</strong></p>
<p>Exploratory graphs don’t have to be pretty. I’m going to be the only one who looks at 99% of them. But I have to be able to make them <em>quickly</em> and I have to be able to make a <em>broad range of plots</em> <em>with minimal code</em>. There are a large number of types of graphs, including things like heatmaps, that don’t neatly fit into ggplot2 code and therefore make it challenging to make those graphs. The flexibility of base R comes at a price, but it means you can make all sorts of things you need to without struggling against the system. Which is a huge advantage for data analysts. There are some graphs (<a href="http://rafalab.dfci.harvard.edu/images/frontb300.png">like this one</a>) that are pretty straightforward in base, but require quite a bit of work in ggplot2. In many cases qplot can be used sort of interchangably with plot, but then you really don’t get any of the advantages of the ggplot2 framework.</p>
<p><strong>Expository graphs</strong></p>
<p>When making graphs that are production ready or fit for publication, you can do this with any system. You can do it with ggplot2, with lattice, with base R graphics. But regardless of which system you use it will require about an equal amount of code to make a graph ready for publication. One perfect example of this is the <a href="http://motioninsocial.com/tufte/">comparison of different plotting systems</a> for creating Tufte-like graphs. To create this minimal barchart:</p>
<p><img class="aligncenter" src="" alt="" width="373" height="280" /></p>
<p> </p>
<p>The code they use in base graphics is this (super blurry sorry, you can also <a href="http://motioninsocial.com/tufte/">go to the website</a> for a better view).</p>
<p><img class="aligncenter wp-image-4646" src="http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-300x82.png" alt="Screen Shot 2016-02-11 at 12.56.53 PM" width="483" height="132" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-300x82.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-768x209.png 768w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-1024x279.png 1024w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-260x71.png 260w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM.png 1248w" sizes="(max-width: 483px) 100vw, 483px" /></p>
<p>in ggplot2 the code is:</p>
<p><img class="aligncenter wp-image-4647" src="http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-300x73.png" alt="Screen Shot 2016-02-11 at 12.56.39 PM" width="526" height="128" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-300x73.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-768x187.png 768w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-1024x249.png 1024w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-260x63.png 260w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM.png 1334w" sizes="(max-width: 526px) 100vw, 526px" /></p>
<p> </p>
<p>Both require a significant amount of coding. The ggplot2 plot also takes advantage of the ggthemes package here. Which means, without that package for some specific plot, it would require more coding.</p>
<p>The bottom line is for production graphics, any system requires work. So why do I still use base R like an old person? Because I learned all the stupid little tricks for that system, it was a huge pain, and it would be a huge pain to learn it again for ggplot2, to make very similar types of plots. This is one where neither system is particularly better, but the time-optimal solution is to stick with whichever system you learned first.</p>
<p><strong>Grading student work</strong></p>
<p>People I seriously respect suggest teaching ggplot2 before base graphics as a way to get people up and going quickly making pretty visualizations. This is a good solution to the <a href="http://simplystatistics.org/2014/08/13/swirl-and-the-little-data-scientists-predicament/">little data scientist’s predicament</a>. The tricky thing is that the defaults in ggplot2 are just pretty enough that they might trick you into thinking the graph is production ready using defaults. Say for example you make a plot of the latitude and longitude of <a href="https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/quakes.html">quakes</a> data in R, colored by the number of stations reporting. This is one case where ggplot2 crushes base R for simplicity because of the automated generation of a color scale. You can make this plot with just the line:</p>
<p>ggplot() + geom_point(data=quakes,aes(x=lat,y=long,colour=stations))</p>
<p>And get this out:</p>
<p><img class="aligncenter wp-image-4649" src="http://simplystatistics.org/wp-content/uploads/2016/02/quakes-300x264.png" alt="quakes" width="420" height="370" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/quakes-300x264.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/quakes-227x200.png 227w, http://simplystatistics.org/wp-content/uploads/2016/02/quakes.png 627w" sizes="(max-width: 420px) 100vw, 420px" /></p>
<p>That is a pretty amazing plot in one line of code! What often happens with students in a first serious data analysis class is they think that plot is done. But it isn’t even close. Here are a few things you would need to do to make this plot production ready: (1) make the axes bigger, (2) make the labels bigger, (3) make the labels be full names (latitude and longitude, ideally with units when variables need them), (4) make the legend title be number of stations reporting. Those are the bare minimum. But a very common move by a person who knows a little R/data analysis would be to leave that graph as it is and submit it directly. I know this from lots of experience.</p>
<p>The one nice thing about teaching base R here is that the base version for this plot is either (a) a ton of work or (b) ugly. In either case, it makes the student think very hard about what they need to do to make the plot better, rather than just assuming it is ok.</p>
<p><strong>Where ggplot2 is better for sure</strong></p>
<p>ggplot2 being compatible with piping, having a simple system for theming, having a good animation package, and in general being an excellent platform for developers who create [Some of my colleagues think of me as super data-sciencey compared to other academic statisticians. But one place I lose tons of street cred in the data science community is when I talk about ggplot2. For the 3 data type people on the planet who still don’t know what that is, <a href="https://cran.r-project.org/web/packages/ggplot2/index.html">ggplot2</a> is an R package/phenomenon for data visualization. It was created by Hadley Wickham, who is (in my opinion) perhaps the most important statistician/data scientist on the planet. It is one of the best maintained, most important, and really well done R packages. Hadley also supports R software like few other people on the planet.</p>
<p>But I don’t use ggplot2 and I get nervous when other people do.</p>
<p>I get no end of grief for this from <a href="https://soundcloud.com/nssd-podcast/episode-9-spreadsheet-drama">Hilary and Roger</a> and especially from <a href="https://twitter.com/drob/status/625682366913228800">drob</a>, among many others. So I thought I would explain why and defend myself from the internet hordes. To understand why I don’t use it, you have to understand the three cases where I use data visualization.</p>
<ol>
<li>When creating exploratory graphics - graphs that are fast, not to be shown to anyone else and help me to explore a data set</li>
<li>When creating expository graphs - graphs that i want to put into a publication that have to be very carefully made.</li>
<li>When grading student data analyses.</li>
</ol>
<p>Let’s consider each case.</p>
<p><strong>Exploratory graphs</strong></p>
<p>Exploratory graphs don’t have to be pretty. I’m going to be the only one who looks at 99% of them. But I have to be able to make them <em>quickly</em> and I have to be able to make a <em>broad range of plots</em> <em>with minimal code</em>. There are a large number of types of graphs, including things like heatmaps, that don’t neatly fit into ggplot2 code and therefore make it challenging to make those graphs. The flexibility of base R comes at a price, but it means you can make all sorts of things you need to without struggling against the system. Which is a huge advantage for data analysts. There are some graphs (<a href="http://rafalab.dfci.harvard.edu/images/frontb300.png">like this one</a>) that are pretty straightforward in base, but require quite a bit of work in ggplot2. In many cases qplot can be used sort of interchangably with plot, but then you really don’t get any of the advantages of the ggplot2 framework.</p>
<p><strong>Expository graphs</strong></p>
<p>When making graphs that are production ready or fit for publication, you can do this with any system. You can do it with ggplot2, with lattice, with base R graphics. But regardless of which system you use it will require about an equal amount of code to make a graph ready for publication. One perfect example of this is the <a href="http://motioninsocial.com/tufte/">comparison of different plotting systems</a> for creating Tufte-like graphs. To create this minimal barchart:</p>
<p><img class="aligncenter" src="" alt="" width="373" height="280" /></p>
<p> </p>
<p>The code they use in base graphics is this (super blurry sorry, you can also <a href="http://motioninsocial.com/tufte/">go to the website</a> for a better view).</p>
<p><img class="aligncenter wp-image-4646" src="http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-300x82.png" alt="Screen Shot 2016-02-11 at 12.56.53 PM" width="483" height="132" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-300x82.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-768x209.png 768w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-1024x279.png 1024w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-260x71.png 260w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM.png 1248w" sizes="(max-width: 483px) 100vw, 483px" /></p>
<p>in ggplot2 the code is:</p>
<p><img class="aligncenter wp-image-4647" src="http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-300x73.png" alt="Screen Shot 2016-02-11 at 12.56.39 PM" width="526" height="128" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-300x73.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-768x187.png 768w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-1024x249.png 1024w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-260x63.png 260w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM.png 1334w" sizes="(max-width: 526px) 100vw, 526px" /></p>
<p> </p>
<p>Both require a significant amount of coding. The ggplot2 plot also takes advantage of the ggthemes package here. Which means, without that package for some specific plot, it would require more coding.</p>
<p>The bottom line is for production graphics, any system requires work. So why do I still use base R like an old person? Because I learned all the stupid little tricks for that system, it was a huge pain, and it would be a huge pain to learn it again for ggplot2, to make very similar types of plots. This is one where neither system is particularly better, but the time-optimal solution is to stick with whichever system you learned first.</p>
<p><strong>Grading student work</strong></p>
<p>People I seriously respect suggest teaching ggplot2 before base graphics as a way to get people up and going quickly making pretty visualizations. This is a good solution to the <a href="http://simplystatistics.org/2014/08/13/swirl-and-the-little-data-scientists-predicament/">little data scientist’s predicament</a>. The tricky thing is that the defaults in ggplot2 are just pretty enough that they might trick you into thinking the graph is production ready using defaults. Say for example you make a plot of the latitude and longitude of <a href="https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/quakes.html">quakes</a> data in R, colored by the number of stations reporting. This is one case where ggplot2 crushes base R for simplicity because of the automated generation of a color scale. You can make this plot with just the line:</p>
<p>ggplot() + geom_point(data=quakes,aes(x=lat,y=long,colour=stations))</p>
<p>And get this out:</p>
<p><img class="aligncenter wp-image-4649" src="http://simplystatistics.org/wp-content/uploads/2016/02/quakes-300x264.png" alt="quakes" width="420" height="370" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/quakes-300x264.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/quakes-227x200.png 227w, http://simplystatistics.org/wp-content/uploads/2016/02/quakes.png 627w" sizes="(max-width: 420px) 100vw, 420px" /></p>
<p>That is a pretty amazing plot in one line of code! What often happens with students in a first serious data analysis class is they think that plot is done. But it isn’t even close. Here are a few things you would need to do to make this plot production ready: (1) make the axes bigger, (2) make the labels bigger, (3) make the labels be full names (latitude and longitude, ideally with units when variables need them), (4) make the legend title be number of stations reporting. Those are the bare minimum. But a very common move by a person who knows a little R/data analysis would be to leave that graph as it is and submit it directly. I know this from lots of experience.</p>
<p>The one nice thing about teaching base R here is that the base version for this plot is either (a) a ton of work or (b) ugly. In either case, it makes the student think very hard about what they need to do to make the plot better, rather than just assuming it is ok.</p>
<p><strong>Where ggplot2 is better for sure</strong></p>
<p>ggplot2 being compatible with piping, having a simple system for theming, having a good animation package, and in general being an excellent platform for developers who create](https://ggplot2-exts.github.io/index.html) are all huge advantages. It is also great for getting absolute newbies up and making medium-quality graphics in a huge hurry. This is a great way to get more people engaged in data science and I’m psyched about the reach and power ggplot2 has had. Still, I probably won’t use it for my own work, even thought it disappoints my data scientist friends.</p>
Data handcuffs
2016-02-10T15:38:37+00:00
http://simplystats.github.io/2016/02/10/data-handcuffs
<p>A few years ago, if you asked me what the top skills I got asked about for students going into industry, I’d definitely have said things like data cleaning, data transformation, database pulls, and other non-traditional statistical tasks. But as companies have progressed from the point of storing data to actually wanting to do something with it, I would say one of the hottest skills is understanding and dealing with data from randomized trials.</p>
<p>In particular I see data scientists talking more about <a href="https://medium.com/@InVisionApp/a-b-and-see-a-beginner-s-guide-to-a-b-testing-a16406f1a239#.p7hoxirwo">A/B testing</a>, <a href="http://varianceexplained.org/r/bayesian-ab-testing/">sequential stopping rules</a>, <a href="https://twitter.com/hspter/status/696820603945414656">hazard regression</a> and other ideas that are really common in Biostatistics, which has traditionally focused on the analysis of data from designed experiments in biology.</p>
<p>I think it is great that companies are choosing to do experiments, as this <a href="http://simplystatistics.org/2013/07/15/yes-clinical-trials-work/">still remains</a> the gold standard for how to generate knowledge about causal effects. One interesting new development though is the extreme lengths it appears some organizations are going to to be “data-driven”. They make all decisions based on data they have collected or experiments they have performed.</p>
<p>But data mostly tell you about small scale effects and things that happened in the past. To be able to make big discoveries/improvements requires (a) having creative ideas that are not data supported and (b) trying them in experiments to see if they work. If you get too caught up in experimenting on the same set of conditions you will inevitably asymptote to a maximum and quickly reach diminishing returns. This is where the data handcuffs come in. Data can only tell you about the conditions that existed in the past, they often can’t predict conditions in the future or ideas that may work out or might not.</p>
<p>In an interesting parallel to academic research a good strategy appears to be: (a) trying a bunch of things, including some things that have only a pretty modest chance of success, (b) doing experiments early and often when trying those things, and (c) getting very good at recognizing failure quickly and moving on to ideas that will be fruitful. The challenges are that in part (a) it is often difficult to generate really knew ideas, especially if you are already doing something that has had any level of success. There will be extreme pressure not to change what you are doing. In part (c) the challenge is that if you discard ideas too quickly you might miss a big opportunity, but if you don’t discard them quickly enough you will sink a lot of time/cost into utlimately not very fruitful projects.</p>
<p>Regardless, almost all of the most <a href="http://simplystatistics.org/2013/09/25/is-most-science-false-the-titans-weigh-in/">interesting projects</a> I’ve worked on in my life were not driven by data that suggested they would be successful. They were often risks where the data either wasn’t in, or the data supported not doing at all. But as a statistician I decided to straight up ignore the data and try anyway. Then again, these ideas have also been the sources of <a href="http://simplystatistics.org/2012/01/11/healthnewsrater/">my biggest flameouts</a>.</p>
Leek group guide to reading scientific papers
2016-02-09T13:59:53+00:00
http://simplystats.github.io/2016/02/09/leek-group-guide-to-reading-scientific-papers
<p>The other day on Twitter Amelia requested a guide for reading papers</p>
<blockquote class="twitter-tweet" data-width="550">
<p lang="en" dir="ltr">
I love <a href="https://twitter.com/jtleek">@jtleek</a>’s github guides to reviewing papers, writing R packages, giving talks, etc. Would love one on reading papers, for students.
</p>
<p>
— Amelia McNamara (@AmeliaMN) <a href="https://twitter.com/AmeliaMN/status/695633602751635456">February 5, 2016</a>
</p>
</blockquote>
<p> </p>
<p>So I came up with a guide which you can find here: <a href="https://github.com/jtleek/readingpapers">Leek group guide to reading papers</a>. I actually found this to be one that I had the hardest time with. I described how I tend to read a paper but I’m not sure that is really the optimal (or even a very good) way. I’d really appreciate pull requests if you have ideas on how to improve the guide.</p>
A menagerie of messed up data analyses and how to avoid them
2016-02-01T13:39:57+00:00
http://simplystats.github.io/2016/02/01/a-menagerie-of-messed-up-data-analyses-and-how-to-avoid-them
<p><em>Update: I realize this may seem like I’m picking on people. I really don’t mean to, I have for sure made all of these mistakes and many more. I can give many examples, but the one I always remember is the time Rafa saved me from “I got a big one here” when I made a huge mistake as a first year assistant professor.</em></p>
<p>In any introductory statistics or data analysis class they might teach you the basics, how to load a data set, how to munge it, how to do t-tests, maybe how to write a report. But there are a whole bunch of ways that a data analysis can be screwed up that often get skipped over. Here is my first crack at creating a “menagerie” of messed up data analyses and how you can avoid them. Depending on interest I could probably list a ton more, but as always I’m doing the non-comprehensive list :).</p>
<p> </p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>Outco<img class="alignleft wp-image-4613" src="http://simplystatistics.org/wp-content/uploads/2016/02/direction411.png" alt="direction411" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/direction411-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/direction411.png 256w" sizes="(max-width: 125px) 100vw, 125px" />me switching</strong></span></p>
<p>_What it is: _Outcome switching is where you collect data looking at say, the relationship between exercise and blood pressure. Once you have the data, you realize that blood pressure isn’t really related to exercise. So you change the outcome and ask if HDL levels are related to exercise and you find a relationship. It turns out that when you do this kind of switch you have now biased your analysis because you would have just stopped if you found the original relationship.</p>
<p style="text-align: left;">
<em>An example: </em><a href="http://www.vox.com/2015/12/29/10654056/ben-goldacre-compare-trials">In this article</a> they discuss how Paxil, an anti-depressant, was originally studied for several main outcomes, none of which showed an effect - but some of the secondary outcomes did. So they switched the outcome of the trial and used this result to market the drug.
</p>
<p style="text-align: left;">
<em>What you can do: </em>Pre-specify your analysis plan, including which outcomes you want to look at. Then very clearly state when you are analyzing a primary outcome or a secondary analysis. That way people know to take the secondary analyses with a grain of salt. You can even get paid $$ to pre-specify with the OSF's <a href="https://cos.io/prereg/">pre-registration challenge</a>.
</p>
<p><img class="alignleft wp-image-4618" src="http://simplystatistics.org/wp-content/uploads/2016/02/direction398-300x300.png" alt="direction398" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/direction398-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/direction398-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/direction398.png 512w" sizes="(max-width: 125px) 100vw, 125px" /></p>
<p><span style="text-decoration: underline;"><strong>Garden of forking paths</strong></span></p>
<p>_What it is: _In this case you may or may not have specified your outcome and stuck with it. Let’s assume you have, so you are still looking at blood pressure and exercise. But it turns out a bunch of people had apparently erroneous measures of blood pressure. So you dropped those measurements and did the analysis with the remaining values. This is a totally sensible thing to do, but if you didn’t specify in advance how you would handle bad measurements, you can make a bunch of different choices here (the forking paths). You could drop them, impute them, multiply impute them, weight them, etc. Each of these gives a different result and you can accidentally pick the one that works best even if you are being “sensible”</p>
<p><em>An example</em>: <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf">This article</a> gives several examples of the forking paths. One is where authors report that at peak fertility women are more likely to wear red or pink shirts. They made several inclusion/exclusion choices (which women to include in which comparison group) for who to include that could easily have gone a different direction or were against stated rules.</p>
<p>_What you can do: _Pre-specify every part of your analysis plan, down to which observations you are going to drop, transform, etc. To be honest this is super hard to do because almost every data set is messy in a unique way. So the best thing here is to point out steps in your analysis where you made a choice that wasn’t pre-specified and you could have made differently. Or, even better, try some of the different choices and make sure your results aren’t dramatically different.</p>
<p> </p>
<p><strong><img class="alignleft wp-image-4621" src="http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149.png" alt="emoticon149" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149.png 256w" sizes="(max-width: 125px) 100vw, 125px" /><span style="text-decoration: underline;">P-hacking</span></strong></p>
<p>_What it is: _The nefarious cousin of the garden of forking paths. Basically here the person outcome switches, uses the garden of forking paths, intentionally doesn’t correct for multiple testing, or uses any of these other means to cheat and get a result that they like.</p>
<p><em>An example:</em> This one gets talked about a lot and there is <a href="http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106">some evidence that it happens</a>. But it is usually pretty hard to ascribe purely evil intentions to people and I’d rather not point the finger here. I think that often the garden of forking paths results in just as bad an outcome without people having to try.</p>
<p><em>What to do:</em> Know how to do an analysis well and don’t cheat.</p>
<p><em>Update: </em> Some [<em>Update: I realize this may seem like I’m picking on people. I really don’t mean to, I have for sure made all of these mistakes and many more. I can give many examples, but the one I always remember is the time Rafa saved me from “I got a big one here” when I made a huge mistake as a first year assistant professor.</em></p>
<p>In any introductory statistics or data analysis class they might teach you the basics, how to load a data set, how to munge it, how to do t-tests, maybe how to write a report. But there are a whole bunch of ways that a data analysis can be screwed up that often get skipped over. Here is my first crack at creating a “menagerie” of messed up data analyses and how you can avoid them. Depending on interest I could probably list a ton more, but as always I’m doing the non-comprehensive list :).</p>
<p> </p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>Outco<img class="alignleft wp-image-4613" src="http://simplystatistics.org/wp-content/uploads/2016/02/direction411.png" alt="direction411" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/direction411-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/direction411.png 256w" sizes="(max-width: 125px) 100vw, 125px" />me switching</strong></span></p>
<p>_What it is: _Outcome switching is where you collect data looking at say, the relationship between exercise and blood pressure. Once you have the data, you realize that blood pressure isn’t really related to exercise. So you change the outcome and ask if HDL levels are related to exercise and you find a relationship. It turns out that when you do this kind of switch you have now biased your analysis because you would have just stopped if you found the original relationship.</p>
<p style="text-align: left;">
<em>An example: </em><a href="http://www.vox.com/2015/12/29/10654056/ben-goldacre-compare-trials">In this article</a> they discuss how Paxil, an anti-depressant, was originally studied for several main outcomes, none of which showed an effect - but some of the secondary outcomes did. So they switched the outcome of the trial and used this result to market the drug.
</p>
<p style="text-align: left;">
<em>What you can do: </em>Pre-specify your analysis plan, including which outcomes you want to look at. Then very clearly state when you are analyzing a primary outcome or a secondary analysis. That way people know to take the secondary analyses with a grain of salt. You can even get paid $$ to pre-specify with the OSF's <a href="https://cos.io/prereg/">pre-registration challenge</a>.
</p>
<p><img class="alignleft wp-image-4618" src="http://simplystatistics.org/wp-content/uploads/2016/02/direction398-300x300.png" alt="direction398" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/direction398-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/direction398-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/direction398.png 512w" sizes="(max-width: 125px) 100vw, 125px" /></p>
<p><span style="text-decoration: underline;"><strong>Garden of forking paths</strong></span></p>
<p>_What it is: _In this case you may or may not have specified your outcome and stuck with it. Let’s assume you have, so you are still looking at blood pressure and exercise. But it turns out a bunch of people had apparently erroneous measures of blood pressure. So you dropped those measurements and did the analysis with the remaining values. This is a totally sensible thing to do, but if you didn’t specify in advance how you would handle bad measurements, you can make a bunch of different choices here (the forking paths). You could drop them, impute them, multiply impute them, weight them, etc. Each of these gives a different result and you can accidentally pick the one that works best even if you are being “sensible”</p>
<p><em>An example</em>: <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf">This article</a> gives several examples of the forking paths. One is where authors report that at peak fertility women are more likely to wear red or pink shirts. They made several inclusion/exclusion choices (which women to include in which comparison group) for who to include that could easily have gone a different direction or were against stated rules.</p>
<p>_What you can do: _Pre-specify every part of your analysis plan, down to which observations you are going to drop, transform, etc. To be honest this is super hard to do because almost every data set is messy in a unique way. So the best thing here is to point out steps in your analysis where you made a choice that wasn’t pre-specified and you could have made differently. Or, even better, try some of the different choices and make sure your results aren’t dramatically different.</p>
<p> </p>
<p><strong><img class="alignleft wp-image-4621" src="http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149.png" alt="emoticon149" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149.png 256w" sizes="(max-width: 125px) 100vw, 125px" /><span style="text-decoration: underline;">P-hacking</span></strong></p>
<p>_What it is: _The nefarious cousin of the garden of forking paths. Basically here the person outcome switches, uses the garden of forking paths, intentionally doesn’t correct for multiple testing, or uses any of these other means to cheat and get a result that they like.</p>
<p><em>An example:</em> This one gets talked about a lot and there is <a href="http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106">some evidence that it happens</a>. But it is usually pretty hard to ascribe purely evil intentions to people and I’d rather not point the finger here. I think that often the garden of forking paths results in just as bad an outcome without people having to try.</p>
<p><em>What to do:</em> Know how to do an analysis well and don’t cheat.</p>
<p><em>Update: </em> Some](http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2649230) “when honest researchers face ambiguity about what analyses to run, and convince themselves those leading to better results are the correct ones (see e.g., Gelman & Loken, 2014; John, Loewenstein, & Prelec, 2012; Simmons, Nelson, & Simonsohn, 2011; Vazire, 2015).” This coincides with the definition of “garden of forking paths”. I have been asked to point this out <a href="https://twitter.com/talyarkoni/status/694576205089996800">on Twitter.</a> It was never my intention to accuse anyone of accusing people of fraud. That being said, I still think that the connotation that many people think of when they think “p-hacking” corresponds to my definition above, although I agree with folks that isn’t helpful - which is why I prefer we call the non-nefarious version the garden of forking paths.</p>
<p> </p>
<p><strong><img class="alignleft wp-image-4623" src="http://simplystatistics.org/wp-content/uploads/2016/02/paypal15.png" alt="paypal15" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/paypal15-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/paypal15.png 256w" sizes="(max-width: 125px) 100vw, 125px" /><span style="text-decoration: underline;">Uncorrected multiple testing </span></strong></p>
<p>_What it is: _This one is related to the garden of forking paths and outcome switching. Most statistical methods for measuring the potential for error assume you are only evaluating one hypothesis at a time. But in reality you might be measuring a ton either on purpose (in a big genomics or neuroimaging study) or accidentally (because you consider a bunch of outcomes). In either case, the expected error rate changes a lot if you consider many hypotheses.</p>
<p><em>An example: </em> The <a href="http://users.stat.umn.edu/~corbett/classes/5303/Bennett-Salmon-2009.pdf">most famous example</a> is when someone did an fMRI on a dead fish and showed that there were a bunch of significant regions at the P < 0.05 level. The reason is that there is natural variation in the background of these measurements and if you consider each pixel independently ignoring that you are looking at a bunch of them, a few will have P < 0.05 just by chance.</p>
<p><em>What you can do</em>: Correct for multiple testing. When you calculate a large number of p-values make sure you <a href="http://varianceexplained.org/statistics/interpreting-pvalue-histogram/">know what their distribution</a> is expected to be and you use a method like Bonferroni, Benjamini-Hochberg, or q-value to correct for multiple testing.</p>
<p> </p>
<p><strong><img class="alignleft wp-image-4625" src="http://simplystatistics.org/wp-content/uploads/2016/02/animal162.png" alt="animal162" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/animal162-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/animal162.png 256w" sizes="(max-width: 125px) 100vw, 125px" /><span style="text-decoration: underline;">I got a big one here</span></strong></p>
<p><em>What it is:</em> One of the most painful experiences for all new data analysts. You collect data and discover a huge effect. You are super excited so you write it up and submit it to one of the best journals or convince your boss to be the farm. The problem is that huge effects are incredibly rare and are usually due to some combination of experimental artifacts and biases or mistakes in the analysis. Almost no effects you detect with statistics are huge. Even the relationship between smoking and cancer is relatively weak in observational studies and requires very careful calibration and analysis.</p>
<p><em>An example:</em> <a href="http://www.ncbi.nlm.nih.gov/pubmed/17206142">In a paper</a> authors claimed that 78% of genes were differentially expressed between Asians and Europeans. But it turns out that most of the Asian samples were measured in one sample and the Europeans in another. [<em>Update: I realize this may seem like I’m picking on people. I really don’t mean to, I have for sure made all of these mistakes and many more. I can give many examples, but the one I always remember is the time Rafa saved me from “I got a big one here” when I made a huge mistake as a first year assistant professor.</em></p>
<p>In any introductory statistics or data analysis class they might teach you the basics, how to load a data set, how to munge it, how to do t-tests, maybe how to write a report. But there are a whole bunch of ways that a data analysis can be screwed up that often get skipped over. Here is my first crack at creating a “menagerie” of messed up data analyses and how you can avoid them. Depending on interest I could probably list a ton more, but as always I’m doing the non-comprehensive list :).</p>
<p> </p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>Outco<img class="alignleft wp-image-4613" src="http://simplystatistics.org/wp-content/uploads/2016/02/direction411.png" alt="direction411" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/direction411-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/direction411.png 256w" sizes="(max-width: 125px) 100vw, 125px" />me switching</strong></span></p>
<p>_What it is: _Outcome switching is where you collect data looking at say, the relationship between exercise and blood pressure. Once you have the data, you realize that blood pressure isn’t really related to exercise. So you change the outcome and ask if HDL levels are related to exercise and you find a relationship. It turns out that when you do this kind of switch you have now biased your analysis because you would have just stopped if you found the original relationship.</p>
<p style="text-align: left;">
<em>An example: </em><a href="http://www.vox.com/2015/12/29/10654056/ben-goldacre-compare-trials">In this article</a> they discuss how Paxil, an anti-depressant, was originally studied for several main outcomes, none of which showed an effect - but some of the secondary outcomes did. So they switched the outcome of the trial and used this result to market the drug.
</p>
<p style="text-align: left;">
<em>What you can do: </em>Pre-specify your analysis plan, including which outcomes you want to look at. Then very clearly state when you are analyzing a primary outcome or a secondary analysis. That way people know to take the secondary analyses with a grain of salt. You can even get paid $$ to pre-specify with the OSF's <a href="https://cos.io/prereg/">pre-registration challenge</a>.
</p>
<p><img class="alignleft wp-image-4618" src="http://simplystatistics.org/wp-content/uploads/2016/02/direction398-300x300.png" alt="direction398" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/direction398-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/direction398-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/direction398.png 512w" sizes="(max-width: 125px) 100vw, 125px" /></p>
<p><span style="text-decoration: underline;"><strong>Garden of forking paths</strong></span></p>
<p>_What it is: _In this case you may or may not have specified your outcome and stuck with it. Let’s assume you have, so you are still looking at blood pressure and exercise. But it turns out a bunch of people had apparently erroneous measures of blood pressure. So you dropped those measurements and did the analysis with the remaining values. This is a totally sensible thing to do, but if you didn’t specify in advance how you would handle bad measurements, you can make a bunch of different choices here (the forking paths). You could drop them, impute them, multiply impute them, weight them, etc. Each of these gives a different result and you can accidentally pick the one that works best even if you are being “sensible”</p>
<p><em>An example</em>: <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf">This article</a> gives several examples of the forking paths. One is where authors report that at peak fertility women are more likely to wear red or pink shirts. They made several inclusion/exclusion choices (which women to include in which comparison group) for who to include that could easily have gone a different direction or were against stated rules.</p>
<p>_What you can do: _Pre-specify every part of your analysis plan, down to which observations you are going to drop, transform, etc. To be honest this is super hard to do because almost every data set is messy in a unique way. So the best thing here is to point out steps in your analysis where you made a choice that wasn’t pre-specified and you could have made differently. Or, even better, try some of the different choices and make sure your results aren’t dramatically different.</p>
<p> </p>
<p><strong><img class="alignleft wp-image-4621" src="http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149.png" alt="emoticon149" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149.png 256w" sizes="(max-width: 125px) 100vw, 125px" /><span style="text-decoration: underline;">P-hacking</span></strong></p>
<p>_What it is: _The nefarious cousin of the garden of forking paths. Basically here the person outcome switches, uses the garden of forking paths, intentionally doesn’t correct for multiple testing, or uses any of these other means to cheat and get a result that they like.</p>
<p><em>An example:</em> This one gets talked about a lot and there is <a href="http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106">some evidence that it happens</a>. But it is usually pretty hard to ascribe purely evil intentions to people and I’d rather not point the finger here. I think that often the garden of forking paths results in just as bad an outcome without people having to try.</p>
<p><em>What to do:</em> Know how to do an analysis well and don’t cheat.</p>
<p><em>Update: </em> Some [<em>Update: I realize this may seem like I’m picking on people. I really don’t mean to, I have for sure made all of these mistakes and many more. I can give many examples, but the one I always remember is the time Rafa saved me from “I got a big one here” when I made a huge mistake as a first year assistant professor.</em></p>
<p>In any introductory statistics or data analysis class they might teach you the basics, how to load a data set, how to munge it, how to do t-tests, maybe how to write a report. But there are a whole bunch of ways that a data analysis can be screwed up that often get skipped over. Here is my first crack at creating a “menagerie” of messed up data analyses and how you can avoid them. Depending on interest I could probably list a ton more, but as always I’m doing the non-comprehensive list :).</p>
<p> </p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>Outco<img class="alignleft wp-image-4613" src="http://simplystatistics.org/wp-content/uploads/2016/02/direction411.png" alt="direction411" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/direction411-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/direction411.png 256w" sizes="(max-width: 125px) 100vw, 125px" />me switching</strong></span></p>
<p>_What it is: _Outcome switching is where you collect data looking at say, the relationship between exercise and blood pressure. Once you have the data, you realize that blood pressure isn’t really related to exercise. So you change the outcome and ask if HDL levels are related to exercise and you find a relationship. It turns out that when you do this kind of switch you have now biased your analysis because you would have just stopped if you found the original relationship.</p>
<p style="text-align: left;">
<em>An example: </em><a href="http://www.vox.com/2015/12/29/10654056/ben-goldacre-compare-trials">In this article</a> they discuss how Paxil, an anti-depressant, was originally studied for several main outcomes, none of which showed an effect - but some of the secondary outcomes did. So they switched the outcome of the trial and used this result to market the drug.
</p>
<p style="text-align: left;">
<em>What you can do: </em>Pre-specify your analysis plan, including which outcomes you want to look at. Then very clearly state when you are analyzing a primary outcome or a secondary analysis. That way people know to take the secondary analyses with a grain of salt. You can even get paid $$ to pre-specify with the OSF's <a href="https://cos.io/prereg/">pre-registration challenge</a>.
</p>
<p><img class="alignleft wp-image-4618" src="http://simplystatistics.org/wp-content/uploads/2016/02/direction398-300x300.png" alt="direction398" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/direction398-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/direction398-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/direction398.png 512w" sizes="(max-width: 125px) 100vw, 125px" /></p>
<p><span style="text-decoration: underline;"><strong>Garden of forking paths</strong></span></p>
<p>_What it is: _In this case you may or may not have specified your outcome and stuck with it. Let’s assume you have, so you are still looking at blood pressure and exercise. But it turns out a bunch of people had apparently erroneous measures of blood pressure. So you dropped those measurements and did the analysis with the remaining values. This is a totally sensible thing to do, but if you didn’t specify in advance how you would handle bad measurements, you can make a bunch of different choices here (the forking paths). You could drop them, impute them, multiply impute them, weight them, etc. Each of these gives a different result and you can accidentally pick the one that works best even if you are being “sensible”</p>
<p><em>An example</em>: <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf">This article</a> gives several examples of the forking paths. One is where authors report that at peak fertility women are more likely to wear red or pink shirts. They made several inclusion/exclusion choices (which women to include in which comparison group) for who to include that could easily have gone a different direction or were against stated rules.</p>
<p>_What you can do: _Pre-specify every part of your analysis plan, down to which observations you are going to drop, transform, etc. To be honest this is super hard to do because almost every data set is messy in a unique way. So the best thing here is to point out steps in your analysis where you made a choice that wasn’t pre-specified and you could have made differently. Or, even better, try some of the different choices and make sure your results aren’t dramatically different.</p>
<p> </p>
<p><strong><img class="alignleft wp-image-4621" src="http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149.png" alt="emoticon149" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149.png 256w" sizes="(max-width: 125px) 100vw, 125px" /><span style="text-decoration: underline;">P-hacking</span></strong></p>
<p>_What it is: _The nefarious cousin of the garden of forking paths. Basically here the person outcome switches, uses the garden of forking paths, intentionally doesn’t correct for multiple testing, or uses any of these other means to cheat and get a result that they like.</p>
<p><em>An example:</em> This one gets talked about a lot and there is <a href="http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106">some evidence that it happens</a>. But it is usually pretty hard to ascribe purely evil intentions to people and I’d rather not point the finger here. I think that often the garden of forking paths results in just as bad an outcome without people having to try.</p>
<p><em>What to do:</em> Know how to do an analysis well and don’t cheat.</p>
<p><em>Update: </em> Some](http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2649230) “when honest researchers face ambiguity about what analyses to run, and convince themselves those leading to better results are the correct ones (see e.g., Gelman & Loken, 2014; John, Loewenstein, & Prelec, 2012; Simmons, Nelson, & Simonsohn, 2011; Vazire, 2015).” This coincides with the definition of “garden of forking paths”. I have been asked to point this out <a href="https://twitter.com/talyarkoni/status/694576205089996800">on Twitter.</a> It was never my intention to accuse anyone of accusing people of fraud. That being said, I still think that the connotation that many people think of when they think “p-hacking” corresponds to my definition above, although I agree with folks that isn’t helpful - which is why I prefer we call the non-nefarious version the garden of forking paths.</p>
<p> </p>
<p><strong><img class="alignleft wp-image-4623" src="http://simplystatistics.org/wp-content/uploads/2016/02/paypal15.png" alt="paypal15" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/paypal15-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/paypal15.png 256w" sizes="(max-width: 125px) 100vw, 125px" /><span style="text-decoration: underline;">Uncorrected multiple testing </span></strong></p>
<p>_What it is: _This one is related to the garden of forking paths and outcome switching. Most statistical methods for measuring the potential for error assume you are only evaluating one hypothesis at a time. But in reality you might be measuring a ton either on purpose (in a big genomics or neuroimaging study) or accidentally (because you consider a bunch of outcomes). In either case, the expected error rate changes a lot if you consider many hypotheses.</p>
<p><em>An example: </em> The <a href="http://users.stat.umn.edu/~corbett/classes/5303/Bennett-Salmon-2009.pdf">most famous example</a> is when someone did an fMRI on a dead fish and showed that there were a bunch of significant regions at the P < 0.05 level. The reason is that there is natural variation in the background of these measurements and if you consider each pixel independently ignoring that you are looking at a bunch of them, a few will have P < 0.05 just by chance.</p>
<p><em>What you can do</em>: Correct for multiple testing. When you calculate a large number of p-values make sure you <a href="http://varianceexplained.org/statistics/interpreting-pvalue-histogram/">know what their distribution</a> is expected to be and you use a method like Bonferroni, Benjamini-Hochberg, or q-value to correct for multiple testing.</p>
<p> </p>
<p><strong><img class="alignleft wp-image-4625" src="http://simplystatistics.org/wp-content/uploads/2016/02/animal162.png" alt="animal162" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/animal162-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/animal162.png 256w" sizes="(max-width: 125px) 100vw, 125px" /><span style="text-decoration: underline;">I got a big one here</span></strong></p>
<p><em>What it is:</em> One of the most painful experiences for all new data analysts. You collect data and discover a huge effect. You are super excited so you write it up and submit it to one of the best journals or convince your boss to be the farm. The problem is that huge effects are incredibly rare and are usually due to some combination of experimental artifacts and biases or mistakes in the analysis. Almost no effects you detect with statistics are huge. Even the relationship between smoking and cancer is relatively weak in observational studies and requires very careful calibration and analysis.</p>
<p><em>An example:</em> <a href="http://www.ncbi.nlm.nih.gov/pubmed/17206142">In a paper</a> authors claimed that 78% of genes were differentially expressed between Asians and Europeans. But it turns out that most of the Asian samples were measured in one sample and the Europeans in another.](http://www.ncbi.nlm.nih.gov/pubmed/17597765) a large fraction of these differences.</p>
<p><em>What you can do</em>: Be deeply suspicious of big effects in data analysis. If you find something huge and counterintuitive, especially in a well established research area, spend <em>a lot</em> of time trying to figure out why it could be a mistake. If you don’t, others definitely will, and you might be embarrassed.</p>
<p><span style="text-decoration: underline;"><strong><img class="alignleft wp-image-4632" src="http://simplystatistics.org/wp-content/uploads/2016/02/man298.png" alt="man298" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/man298-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/man298.png 256w" sizes="(max-width: 125px) 100vw, 125px" />Double complication</strong></span></p>
<p><em>What it is</em>: When faced with a large and complicated data set, beginning analysts often feel compelled to use a big complicated method. Imagine you have collected data on thousands of genes or hundreds of thousands of voxels and you want to use this data to predict some health outcome. There is a severe temptation to use deep learning or blend random forests, boosting, and five other methods to perform the prediction. The problem is that complicated methods fail for complicated reasons, which will be extra hard to diagnose if you have a really big, complicated data set.</p>
<p><em>An example:</em> There are a large number of examples where people use very small training sets and complicated methods. One example (there were many other problems with this analysis, too) is when people <a href="http://www.nature.com/nm/journal/v12/n11/full/nm1491.html">tried to use complicated prediction algorithms</a> to predict which chemotherapy would work best using genomics. Ultimately this paper was retracted for may problems, but the complication of the methods plus the complication of the data made it hard to detect.</p>
<p><em>What you can do:</em> When faced with a big, messy data set, try simple things first. Use linear regression, make simple scatterplots, check to see if there are obvious flaws with the data. If you must use a really complicated method, ask yourself if there is a reason it is outperforming the simple methods because often with large data sets <a href="http://arxiv.org/pdf/math/0606441.pdf">even simple things work</a>.</p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>Image credits:</strong></span></p>
<ul>
<li>Outcome switching. Icon made by <a href="http://hananonblog.wordpress.com" title="Hanan">Hanan</a> from <a href="http://www.flaticon.com" title="Flaticon">www.flaticon.com</a> is licensed under <a href="http://creativecommons.org/licenses/by/3.0/" title="Creative Commons BY 3.0">CC BY 3.0</a></li>
<li>Forking paths. Icon made by <a href="http://iconalone.com" title="Popcic">Popcic</a> from <a href="http://www.flaticon.com" title="Flaticon">www.flaticon.com</a> is licensed under <a href="http://creativecommons.org/licenses/by/3.0/" title="Creative Commons BY 3.0">CC BY 3.0</a></li>
<li>P-hacking.Icon made by <a href="http://www.icomoon.io" title="Icomoon">Icomoon</a> from <a href="http://www.flaticon.com" title="Flaticon">www.flaticon.com</a> is licensed under <a href="http://creativecommons.org/licenses/by/3.0/" title="Creative Commons BY 3.0">CC BY 3.0</a></li>
<li>Uncorrected multiple testing.Icon made by <a href="http://www.freepik.com" title="Freepik">Freepik</a> from <a href="http://www.flaticon.com" title="Flaticon">www.flaticon.com</a> is licensed under <a href="http://creativecommons.org/licenses/by/3.0/" title="Creative Commons BY 3.0">CC BY 3.0</a></li>
<li>Big one here. Icon made by <a href="http://www.freepik.com" title="Freepik">Freepik</a> from <a href="http://www.flaticon.com" title="Flaticon">www.flaticon.com</a> is licensed under <a href="http://creativecommons.org/licenses/by/3.0/" title="Creative Commons BY 3.0">CC BY 3.0</a></li>
<li>Double complication. Icon made by <a href="http://www.freepik.com" title="Freepik">Freepik</a> from <a href="http://www.flaticon.com" title="Flaticon">www.flaticon.com</a> is licensed under <a href="http://creativecommons.org/licenses/by/3.0/" title="Creative Commons BY 3.0">CC BY 3.0</a></li>
</ul>
Exactly how risky is breathing?
2016-01-26T09:58:23+00:00
http://simplystats.github.io/2016/01/26/exactly-how-risky-is-breathing
<p>This <a href="http://nyti.ms/23nysp5">article by by George Johnson</a> in the NYT describes a study by Kamen P. Simonov and Daniel S. Himmelstein that examines the hypothesis that people living at higher altitudes experience lower rates of lung cancer than people living at lower altitudes.</p>
<blockquote>
<p>All of the usual caveats apply. Studies like this, which compare whole populations, can be used only to suggest possibilities to be explored in future research. But the hypothesis is not as crazy as it may sound. Oxygen is what energizes the cells of our bodies. Like any fuel, it inevitably spews out waste — a corrosive exhaust of substances called “free radicals,” or “reactive oxygen species,” that can mutate DNA and nudge a cell closer to malignancy.</p>
</blockquote>
<p>I’m not so much focused on the science itself, which is perhaps intriguing, but rather on the way the article was written. First, George Johnson links to the <a href="https://peerj.com/articles/705/">paper</a> itself, <a href="http://simplystatistics.org/2015/01/15/how-to-find-the-science-paper-behind-a-headline-when-the-link-is-missing/">already a major victory</a>. Also, I thought he did a very nice job of laying out the complexity of doing a population-level study like this one–all the potential confounders, selection bias, negative controls, etc.</p>
<p>I remember particulate matter air pollution epidemiology used to have this feel. You’d try to do all these different things to make the effect go away, but for some reason, under every plausible scenario, in almost every setting, there was always some association between air pollution and health outcomes. Eventually you start to believe it….</p>
On research parasites and internet mobs - let's try to solve the real problem.
2016-01-25T14:34:08+00:00
http://simplystats.github.io/2016/01/25/on-research-parasites-and-internet-mobs-lets-try-to-solve-the-real-problem
<p>A couple of days ago one of the editors of the New England Journal of Medicine <a href="http://www.nejm.org/doi/full/10.1056/NEJMe1516564">posted an editorial</a> showing some moderate level of support for data sharing but also introducing the term “research parasite”:</p>
<blockquote>
<p>A second concern held by some is that a new class of research person will emerge — people who had nothing to do with the design and execution of the study but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited. There is concern among some front-line researchers that the system will be taken over by what some researchers have characterized as “research parasites.”</p>
</blockquote>
<p>While this is obviously the most inflammatory statement in the article, I think that there are several more important and overlooked misconceptions. The biggest problems are:</p>
<ol>
<li><strong>“</strong><strong>The first concern is that someone not involved in the generation and collection of the data may not understand the choices made in defining the parameters.</strong><strong>“ </strong>This almost certainly would be the fault of the investigators who published the data. If the authors adhere to good [A couple of days ago one of the editors of the New England Journal of Medicine <a href="http://www.nejm.org/doi/full/10.1056/NEJMe1516564">posted an editorial</a> showing some moderate level of support for data sharing but also introducing the term “research parasite”:</li>
</ol>
<blockquote>
<p>A second concern held by some is that a new class of research person will emerge — people who had nothing to do with the design and execution of the study but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited. There is concern among some front-line researchers that the system will be taken over by what some researchers have characterized as “research parasites.”</p>
</blockquote>
<p>While this is obviously the most inflammatory statement in the article, I think that there are several more important and overlooked misconceptions. The biggest problems are:</p>
<ol>
<li><strong>“</strong><strong>The first concern is that someone not involved in the generation and collection of the data may not understand the choices made in defining the parameters.</strong><strong>“ </strong>This almost certainly would be the fault of the investigators who published the data. If the authors adhere to good](https://github.com/jtleek/datasharing) policies and respond to queries from people using their data promptly then this should not be a problem at all.</li>
<li><strong>“… but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited.” </strong>The idea that no one should be able to try to disprove ideas with the authors data has been covered in other blogs/on Twitter. One thing I do think is worth considering here is the concern about credit. I think that the traditional way credit has accrued to authors has been citations. But if you get a major study funded, say for 50 million dollars, run that study carefully, sit on a million conference calls, and end up with a single major paper, that could be frustrating. Which is why I think that a better policy would be to have the people who run massive studies get credit in a way that <em>is not papers</em>. They should get some kind of formal administrative credit. But then the data should be immediately and publicly available to anyone to publish on. That allows people who run massive studies to get credit and science to proceed normally.</li>
<li><strong>“</strong><strong>The new investigators arrived on the scene with their own ideas and worked symbiotically, rather than parasitically, with the investigators holding the data, moving the field forward in a way that neither group could have done on its own.” </strong> The story that follows about a group of researchers who collaborated with the NSABP to validate their gene expression signature is very encouraging. But it isn’t the only way science should work. Researchers shouldn’t be constrained to one model or another. Sometimes collaboration is necessary, sometimes it isn’t, but in neither case should we label the researchers “symbiotic” or “parasitic”, terms that have extreme connotations.</li>
<li><strong>“How would data sharing work best? We think it should happen symbiotically, not parasitically.”</strong> I think that it should happen <em>automatically</em>. If you generate a data set with public funds, you should be required to immediately make it available to researchers in the community. But you should <em>get credit for generating the data set and the hypothesis that led to the data set</em>. The problem is that people who generate data will almost never be as fast at analyzing it as people who know how to analyze data. But both deserve credit, whether they are working together or not.</li>
<li><strong>“Start with a novel idea, one that is not an obvious extension of the reported work. Second, identify potential collaborators whose collected data may be useful in assessing the hypothesis and propose a collaboration. Third, work together to test the new hypothesis. Fourth, report the new findings with relevant coauthorship to acknowledge both the group that proposed the new idea and the investigative group that accrued the data that allowed it to be tested.”</strong> The trouble with this framework is that it preferentially accrues credit to data generators and doesn’t accurately describe the role of either party. To flip this argument around, you could just as easily say that anyone who uses <a href="http://salzberg-lab.org/">Steven Salzberg</a>’s software for aligning or assembling short reads should make him a co-author. I think Dr. Drazen would agree that not everyone who aligned reads should add Steven as co-author, despite his contribution being critical for the completion of their work.</li>
</ol>
<p>After the piece was posted there was predictable internet rage from <a href="https://twitter.com/dataparasite">data parasites</a>, a <a href="https://twitter.com/hashtag/researchparasite?src=hash">dedicated hashtag</a>, and half a dozen angry blog posts written about the piece. These inspired a <a href="http://www.nejm.org/doi/full/10.1056/NEJMe1601087">follow up piece</a> from Drazen. I recognize why these folks were upset - the “research parasites” thing was unnecessarily inflammatory. But <a href="http://simplystatistics.org/2014/03/05/plos-one-i-have-an-idea-for-what-to-do-with-all-your-profits-buy-hard-drives/">I also sympathize with data creators</a> who are also subject to a tough environment - particularly when they are junior scientists.</p>
<p>I think the response to the internet outrage also misses the mark and comes off as a defense of people with angry perspectives on data sharing. I would have much rather seen a more pro-active approach from a leading journal of medicine. I’d like to see something that acknowledges different contributions appropriately and doesn’t slow down science. Something like:</p>
<ol>
<li>We will require all data, including data from clinical trials, to be made public immediately on publication as long as it poses minimal risk to the patients involved or the patients have been consented to broad sharing.</li>
<li>When data are not made publicly available they are still required to be deposited with a third party such as the NIH or Figshare to be held available for request from qualified/approved researchers.</li>
<li>We will require that all people who use data give appropriate credit to the original data generators in terms of data citations.</li>
<li>We will require that all people who use software/statistical analysis tools give credit to the original tool developers in terms of software citations.</li>
<li>We will include a new designation for leaders of major data collection or software generation projects that can be included to demonstrate credit for major projects undertaken and completed.</li>
<li>When reviewing papers written by experimentalists with no statistical/computational co-authors we will require no fewer than 2 statistical/computational referees to ensure there has not been a mistake made by inexperienced researchers.</li>
<li>When reviewing papers written by statistical/computational authors with no experimental co-authors we will require no fewer than 2 experimental referees to ensure there has not been a mistake made by inexperienced researchers.</li>
</ol>
<p> </p>
Not So Standard Deviations Episode 8 - Snow Day
2016-01-24T21:41:44+00:00
http://simplystats.github.io/2016/01/24/not-so-standard-deviations-episode-8-snow-day
<p>Hilary and I were snowed in over the weekend, so we recorded Episode 8 of Not So Standard Deviations. In this episode, Hilary and I talk about how to get your foot in the door with data science, the New England Journal’s view on data sharing, Google’s “Cohort Analysis”, and trying to predict a movie’s box office returns based on the movie’s script.</p>
<p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p>
<p>Follow <a href="https://twitter.com/nssdeviations">@NSSDeviations</a> on Twitter!</p>
<p>Show notes:</p>
<ul>
<li><a href="http://goo.gl/eUU2AK">Remembrances of Peter Hall</a></li>
<li><a href="http://goo.gl/HbMu87">Research Parasites</a> (NEJM editorial by Dan Longo and Jeffrey Drazen)</li>
<li>Amazon <a href="http://goo.gl/83DvvO">review/data analysis</a> of Fifty Shades of Grey</li>
<li><a href="https://youtu.be/55psWVYSbrI">Time-lapse cats</a></li>
<li><a href="https://getpocket.com">Pocket</a></li>
</ul>
<p>Apologies for my audio on this episode. I had a bit of a problem calibrating my microphone. I promise to figure it out for the next episode!</p>
<p><a href="https://api.soundcloud.com/tracks/243634673/download?client_id=02gUJC0hH2ct1EGOcYXQIzRFU91c72Ea&oauth_token=1-138878-174789515-deb24181d01af">Download the audio for this episode</a>.</p>
<p> </p>
Parallel BLAS in R
2016-01-21T11:53:07+00:00
http://simplystats.github.io/2016/01/21/parallel-blas-in-r
<p>I’m working on a new chapter for my R Programming book and the topic is parallel computation. So, I was happy to see this tweet from David Robinson (@drob) yesterday:</p>
<blockquote class="twitter-tweet" lang="en">
<p dir="ltr" lang="en">
How fast is this <a href="https://twitter.com/hashtag/rstats?src=hash">#rstats</a> code? x <- replicate(5e3, rnorm(5e3)) x %*% t(x) For me, w/Microsoft R Open, 2.5sec. Wow. <a href="https://t.co/0SbijNxxVa">https://t.co/0SbijNxxVa</a>
</p>
<p>
— David Robinson (@drob) <a href="https://twitter.com/drob/status/689916280233562112">January 20, 2016</a>
</p>
</blockquote>
<p>What does this have to do with parallel computation? Briefly, the code generates 5,000 standard normal random variates, repeats this 5,000 times and stores them in a 5,000 x 5,000 matrix (`x’). Then it computes x x’. The second part is key, because it involves a matrix multiplication.</p>
<p>Matrix multiplication in R is handled, at a very low level, by the library that implements the Basic Linear Algebra Subroutines, or BLAS. The stock R that you download from CRAN comes with what’s known as a reference implementation of BLAS. It works, it produces what everyone agrees are the right answers, but it is in no way optimized. Here’s what I get when I run this code on my Mac using Studio and the CRAN version of R for Mac OS X:</p>
<pre>system.time({ x <- replicate(5e3, rnorm(5e3)); tcrossprod(x) })
user system elapsed
59.622 0.314 59.927
</pre>
<p>Note that the “user” time and the “elapsed” time are roughly the same. Note also that I use the tcrossprod() function instead of the otherwise equivalent expression x %*% t(x). Both crossprod() and tcrossprod() are generally faster than using the %*% operator.</p>
<p>Now, when I run the same code on my built-from-source version of R (version 3.2.3), here’s what I get:</p>
<pre>system.time({ x <- replicate(5e3, rnorm(5e3)); tcrossprod(x) })
user system elapsed
14.378 0.276 3.344
</pre>
<p>Overall, it’s faster when I don’t run the code through RStudio (14s vs. 59s). Also on this version the elapsed time is about 1/4 the user time. Why is that?</p>
<p>The build-from-source version of R is linked to Apple’s Accelerate framework, which is a large library that includes an optimized BLAS library for Intel chips. This optimized BLAS, in addition to being optimized with respect to the code itself, is designed to be multi-threaded so that it can split work off into chunks and run them in parallel on multi-core machines. Here, the tcrossprod() function was run in parallel on my machine, and so the elapsed time was about a quarter of the time that was “charged” to the CPU(s).</p>
<p>David’s tweet indicated that when using Microsoft R Open, which is a custom built binary of R, that the (I assume?) elapsed time is 2.5 seconds. Looking at the attached link, it appears that Microsoft’s R Open is linked against <a href="https://software.intel.com/en-us/intel-mkl">Intel’s Math Kernel Library</a> (MKL) which contains, among other things, an optimized BLAS for Intel chips. I don’t know what kind of computer David was running on, but assuming it was similarly high-powered as mine, it would suggest Intel’s MKL sees slightly better performance. But either way, both Accelerate and MKL achieve that speed up through custom-coding of the BLAS routines and multi-threading on multi-core systems.</p>
<p>If you’re going to be doing any linear algebra in R (and you will), it’s important to link to an optimized BLAS. Otherwise, you’re just wasting time unnecessarily. Besides Accelerate (Mac) and Intel MKL, theres AMD’s <a href="http://developer.amd.com/tools-and-sdks/archive/amd-core-math-library-acml/">ACML</a> library for AMD chips and the <a href="http://math-atlas.sourceforge.net">ATLAS</a> library which is a general purpose tunable library. Also <a href="https://www.tacc.utexas.edu/research-development/tacc-software/gotoblas2">Goto’s BLAS</a> is optimized but is not under active development.</p>
Profile of Hilary Parker
2016-01-14T21:15:46+00:00
http://simplystats.github.io/2016/01/14/profile-of-hilary-parker
<p>If you’ve ever wanted to know more about my <a href="https://soundcloud.com/nssd-podcast">Not So Standard Deviations</a> co-host (and Johns Hopkins graduate) Hilary Parker, you can go check out the <a href="http://thisisstatistics.org/hilary-parker-gets-crafty-with-statistics-in-her-not-so-standard-job/">great profile of her</a> on the American Statistical Association’s This Is Statistics web site.</p>
<blockquote>
<p><strong>What advice would you give to high school students thinking about majoring in statistics?</strong></p>
<p>It’s such a great field! Not only is the industry booming, but more importantly, the disciplines of statistics teaches you to think analytically, which I find helpful for just about every problem I run into. It’s also a great field to be interested in as a generalist– rather than dedicating yourself to studying one subject, you are deeply learning a set of tools that you can apply to any subject that you find interesting. Just one glance at the topics covered on The Upshot or 538 can give you a sense of that. There’s politics, sports, health, history… the list goes on! It’s a field with endless possibility for growth and exploration, and as I mentioned above, the more I explore the more excited I get about it.</p>
</blockquote>
Not So Standard Deviations Episode 7 - Statistical Royalty
2016-01-12T08:45:24+00:00
http://simplystats.github.io/2016/01/12/not-so-standard-deviations-episode-7-statistical-royalty
<p>The latest episode of Not So Standard Deviations is out, and boy does Hilary have a story to tell.</p>
<p>We also talk about Theranos and the pitfalls of diagnostic testing, Spotify’s Discover Weekly playlist generation algorithm (and the need for human product managers), and of course, a little Star Wars. Also, Hilary and I start a new segment where we each give some “free advertising” to something interesting that they think other people should know about.</p>
<p>Show Notes:</p>
<ul>
<li><a href="http://goo.gl/JDk6ni">Gosset Icterometer</a></li>
<li>The <a href="http://skybrudeconsulting.com/blog/2015/10/16/theranos-healthcare.html">dangers</a> of <a href="https://www.fredhutch.org/en/news/center-news/2013/11/scientists-urge-caution-personal-genetic-screenings.html">entertainment</a> <a href="http://mobihealthnews.com/35444/the-rise-of-the-seemingly-serious-but-just-for-entertainment-purposes-medical-app/">medicine</a></li>
<li>Spotify’s Discover Weekly <a href="http://goo.gl/enzFeR">solves human curation</a>?</li>
<li>David Robinson’s <a href="http://varianceexplained.org">Variance Explained</a></li>
<li><a href="http://what3words.com">What3Words</a></li>
</ul>
<p><a href="https://api.soundcloud.com/tracks/241071463/download?client_id=02gUJC0hH2ct1EGOcYXQIzRFU91c72Ea&oauth_token=1-138878-174789515-deb24181d01af">Download the audio for this episode</a>.</p>
Jeff, Roger and Brian Caffo are doing a Reddit AMA at 3pm EST Today
2016-01-11T09:29:28+00:00
http://simplystats.github.io/2016/01/11/jeff-roger-and-brian-caffo-are-doing-a-reddit-ama-at-3pm-est-today
<p>Jeff Leek, Brian Caffo, and I are doing a <a href="https://www.reddit.com/r/IAmA">Reddit AMA</a> TODAY at 3pm EST. We’re happy to answer questions about…anything…including our roles as Co-Directors of the <a href="https://www.coursera.org/specializations/jhu-data-science">Johns Hopkins Data Science Specialization</a> as well as the <a href="https://www.coursera.org/specializations/executive-data-science">Executive Data Science Specialization</a>.</p>
<p>This is one of the few pictures of the three of us together.</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2016/01/IMG_0189.jpg"><img class="alignright size-large wp-image-4586" src="http://simplystatistics.org/wp-content/uploads/2016/01/IMG_0189-1024x768.jpg" alt="IMG_0189" width="990" height="743" srcset="http://simplystatistics.org/wp-content/uploads/2016/01/IMG_0189-120x90.jpg 120w, http://simplystatistics.org/wp-content/uploads/2016/01/IMG_0189-300x225.jpg 300w, http://simplystatistics.org/wp-content/uploads/2016/01/IMG_0189-1024x768.jpg 1024w, http://simplystatistics.org/wp-content/uploads/2016/01/IMG_0189-260x195.jpg 260w" sizes="(max-width: 990px) 100vw, 990px" /></a></p>
A non-comprehensive list of awesome things other people did in 2015
2015-12-21T11:22:07+00:00
http://simplystats.github.io/2015/12/21/a-non-comprehensive-list-of-awesome-things-other-people-did-in-2015
<p><em>Editor’s Note: This is the third year I’m making a list of awesome things other people did this year. Just like the lists for <a href="http://simplystatistics.org/2013/12/20/a-non-comprehensive-list-of-awesome-things-other-people-did-this-year/">2013</a> and <a href="http://simplystatistics.org/2014/12/17/a-non-comprehensive-list-of-awesome-things-other-people-did-in-2014/">2014</a> I am doing this off the top of my head. I have avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people’s awesome stuff. I wrote this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data. This year’s list is particularly “off the cuff” so I’d appreciate additions if you have ‘em. I have surely missed awesome things people have done.</em></p>
<ol>
<li>I hear the <a href="http://sml.princeton.edu/tukey">Tukey conference</a> put on by my former advisor John S. was amazing. Out of it came this really good piece by David Donoho on <a href="https://dl.dropboxusercontent.com/u/23421017/50YearsDataScience.pdf">50 years of Data Science</a>.</li>
<li>Sherri Rose wrote really accurate and readable guides on <a href="http://drsherrirose.com/academic-cvs-for-statistical-science-faculty-positions">academic CVs</a>, <a href="http://drsherrirose.com/academic-cover-letters-for-statistical-science-faculty-positions">academic cover letters</a>, and <a href="http://drsherrirose.com/how-to-be-an-effective-phd-researcher">how to be an effective PhD researcher</a>.</li>
<li>I am not 100% sold on the deep learning hype, but Michael Nielson wrote this awesome book on <a href="http://neuralnetworksanddeeplearning.com/">deep learning and neural networks</a>. I like how approachable it is and how un-hypey it is. I also thought Andrej Karpathy’s <a href="http://karpathy.github.io/2015/10/25/selfie/">blog post</a> on whether you have a good selfie or not was fun.</li>
<li>Thomas Lumley continues to be must read regardless of which blog he writes for with a ton of snarky fun posts debunking the latest ridiculous health headlines on <a href="http://www.statschat.org.nz/2015/11/27/to-find-the-minds-construction-near-the-face/">statschat</a> and more in depth posts like this one on pre-filtering multiple tests on <a href="http://notstatschat.tumblr.com/post/131478660126/prefiltering-very-large-numbers-of-tests">notstatschat</a>.</li>
<li>David Robinson is making a strong case for top data science blogger with his series of <a href="http://varianceexplained.org/r/bayesian_fdr_baseball/">awesome</a> <a href="http://varianceexplained.org/r/credible_intervals_baseball/">posts</a> on <a href="http://varianceexplained.org/r/empirical_bayes_baseball/">empirical Bayes</a>.</li>
<li>Hadley Wickham doing Hadley Wickham things again. <a href="https://github.com/hadley/readr">readr</a> is the biggie for me this year.</li>
<li>I’ve been really enjoying the solid coverage of science/statistics from the (not entirely statistics focused as the name would suggest) <a href="https://twitter.com/statnews">STAT</a>.</li>
<li>Ben Goldacre and co. launched <a href="http://opentrials.net/">OpenTrials</a> for aggregating all the clinical trial data in the world in an open repository.</li>
<li>Christie Aschwanden’s piece on why <a href="http://fivethirtyeight.com/features/science-isnt-broken/">Science Isn’t Broken </a> is a must read and one of the least polemic treatments of the reproducibility/replicability issue I’ve read. The p-hacking graphic is just icing on the cake.</li>
<li>I’m excited about the new <a href="http://blog.revolutionanalytics.com/2015/06/r-consortium.html">R Consortium</a> and the idea of having more organizations that support folks in the R community.</li>
<li>Emma Pierson’s blog and writeups in various national level news outlets continue to impress. I thought <a href="https://www.washingtonpost.com/news/grade-point/wp/2015/10/15/a-better-way-to-gauge-how-common-sexual-assault-is-on-college-campuses/">this one</a> on changing the incentives for sexual assault surveys was particularly interesting/good.</li>
<li>
<p>Amanda Cox an co. created this [<em>Editor’s Note: This is the third year I’m making a list of awesome things other people did this year. Just like the lists for <a href="http://simplystatistics.org/2013/12/20/a-non-comprehensive-list-of-awesome-things-other-people-did-this-year/">2013</a> and <a href="http://simplystatistics.org/2014/12/17/a-non-comprehensive-list-of-awesome-things-other-people-did-in-2014/">2014</a> I am doing this off the top of my head. I have avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people’s awesome stuff. I wrote this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data. This year’s list is particularly “off the cuff” so I’d appreciate additions if you have ‘em. I have surely missed awesome things people have done.</em></p>
</li>
<li>I hear the <a href="http://sml.princeton.edu/tukey">Tukey conference</a> put on by my former advisor John S. was amazing. Out of it came this really good piece by David Donoho on <a href="https://dl.dropboxusercontent.com/u/23421017/50YearsDataScience.pdf">50 years of Data Science</a>.</li>
<li>Sherri Rose wrote really accurate and readable guides on <a href="http://drsherrirose.com/academic-cvs-for-statistical-science-faculty-positions">academic CVs</a>, <a href="http://drsherrirose.com/academic-cover-letters-for-statistical-science-faculty-positions">academic cover letters</a>, and <a href="http://drsherrirose.com/how-to-be-an-effective-phd-researcher">how to be an effective PhD researcher</a>.</li>
<li>I am not 100% sold on the deep learning hype, but Michael Nielson wrote this awesome book on <a href="http://neuralnetworksanddeeplearning.com/">deep learning and neural networks</a>. I like how approachable it is and how un-hypey it is. I also thought Andrej Karpathy’s <a href="http://karpathy.github.io/2015/10/25/selfie/">blog post</a> on whether you have a good selfie or not was fun.</li>
<li>Thomas Lumley continues to be must read regardless of which blog he writes for with a ton of snarky fun posts debunking the latest ridiculous health headlines on <a href="http://www.statschat.org.nz/2015/11/27/to-find-the-minds-construction-near-the-face/">statschat</a> and more in depth posts like this one on pre-filtering multiple tests on <a href="http://notstatschat.tumblr.com/post/131478660126/prefiltering-very-large-numbers-of-tests">notstatschat</a>.</li>
<li>David Robinson is making a strong case for top data science blogger with his series of <a href="http://varianceexplained.org/r/bayesian_fdr_baseball/">awesome</a> <a href="http://varianceexplained.org/r/credible_intervals_baseball/">posts</a> on <a href="http://varianceexplained.org/r/empirical_bayes_baseball/">empirical Bayes</a>.</li>
<li>Hadley Wickham doing Hadley Wickham things again. <a href="https://github.com/hadley/readr">readr</a> is the biggie for me this year.</li>
<li>I’ve been really enjoying the solid coverage of science/statistics from the (not entirely statistics focused as the name would suggest) <a href="https://twitter.com/statnews">STAT</a>.</li>
<li>Ben Goldacre and co. launched <a href="http://opentrials.net/">OpenTrials</a> for aggregating all the clinical trial data in the world in an open repository.</li>
<li>Christie Aschwanden’s piece on why <a href="http://fivethirtyeight.com/features/science-isnt-broken/">Science Isn’t Broken </a> is a must read and one of the least polemic treatments of the reproducibility/replicability issue I’ve read. The p-hacking graphic is just icing on the cake.</li>
<li>I’m excited about the new <a href="http://blog.revolutionanalytics.com/2015/06/r-consortium.html">R Consortium</a> and the idea of having more organizations that support folks in the R community.</li>
<li>Emma Pierson’s blog and writeups in various national level news outlets continue to impress. I thought <a href="https://www.washingtonpost.com/news/grade-point/wp/2015/10/15/a-better-way-to-gauge-how-common-sexual-assault-is-on-college-campuses/">this one</a> on changing the incentives for sexual assault surveys was particularly interesting/good.</li>
<li>Amanda Cox an co. created this ](http://www.nytimes.com/interactive/2015/05/28/upshot/you-draw-it-how-family-income-affects-childrens-college-chances.html) , which is an amazing way to teach people about pre-conceived biases in the way we think about relationships and correlations. I love the crowd-sourcing view on data analysis this suggests.</li>
<li>As usual Philip Guo was producing gold over on his blog. I appreciate this piece on <a href="http://www.pgbovine.net/tips-for-data-driven-research.htm">twelve tips for data driven research</a>.</li>
<li>I am really excited about the new field of adaptive data analysis. Basically understanding how we can let people be “real data analysts” and still get reasonable estimates at the end of the day. <a href="http://www.sciencemag.org/content/349/6248/636.abstract">This paper</a> from Cynthia Dwork and co was one of the initial salvos that came out this year.</li>
<li>Datacamp <a href="https://www.datacamp.com/courses/intro-to-python-for-data-science?utm_source=growth&utm_campaign=python&utm_medium=button">incorporated Python</a> into their platform. The idea of interactive education for R/Python/Data Science is a very cool one and has tons of potential.</li>
<li>I was really into the idea of <a href="http://projecteuclid.org/euclid.aoas/1430226098">Cross-Study validatio</a>n that got proposed this year. With the growth of public data in a lot of areas we can really start to get a feel for generalizability.</li>
<li>The Open Science Foundation did this <a href="http://www.sciencemag.org/content/349/6251/aac4716">incredible replication of 100 different studies</a> in psychology with attention to detail and care that deserves a ton of attention.</li>
<li>Florian’s piece “<a href="http://www.ncbi.nlm.nih.gov/pubmed/26402330">You are not working for me; I am working with you.</a>” should be required reading for all students/postdocs/mentors in academia. This is something I still hadn’t fully figured out until I read Florian’s piece.</li>
<li>I think Karl Broman’s post on why <a href="https://kbroman.wordpress.com/2015/09/09/reproducibility-is-hard/">reproducibility is hard</a> is a great introduction to the real issues in making data analyses reproducible.</li>
<li>This was the year of the f1000 post-publication review paper. I thought <a href="http://f1000research.com/articles/4-121/v1">this one</a> from Yoav and the ensuing fallout was fascinating.</li>
<li>I love pretty much everything out of Di Cook/Heike Hoffman’s groups. This year I liked the paper on <a href="http://download.springer.com/static/pdf/611/art%253A10.1007%252Fs00180-014-0534-x.pdf?originUrl=http%3A%2F%2Flink.springer.com%2Farticle%2F10.1007%2Fs00180-014-0534-x&token2=exp=1450714996~acl=%2Fstatic%2Fpdf%2F611%2Fart%25253A10.1007%25252Fs00180-014-0534-x.pdf%3ForiginUrl%3Dhttp%253A%252F%252Flink.springer.com%252Farticle%252F10.1007%252Fs00180-014-0534-x*~hmac=3c5f5c7c1b2381685437659d8ffd64e1cb2c52d1dfd10506cad5d2af1925c0ac">visual statistical inference in high-dimensional low sample size settings</a>.</li>
<li>This is pretty recent, but Nathan Yau’s <a href="https://flowingdata.com/2015/12/15/a-day-in-the-life-of-americans/">day in the life graphic is mesmerizing</a>.</li>
</ol>
<p>This was a year where open source data people <a href="http://treycausey.com/emotional_rollercoaster_public_work.html">described</a> their <a href="https://twitter.com/johnmyleswhite/status/666429299327569921">pain</a> from people being demanding/mean to them for their contributions. As the year closes I just want to give a big thank you to everyone who did awesome stuff I used this year and have completely ungraciously failed to acknowledge.</p>
<p> </p>
Not So Standard Deviations: Episode 6 - Google is the New Fisher
2015-12-18T13:08:10+00:00
http://simplystats.github.io/2015/12/18/not-so-standard-deviations-episode-6-google-is-the-new-fisher
<p>Episode 6 of Not So Standard Deviations is now posted. In this episode Hilary and I talk about the analytics of our own podcast, and analyses that seem easy but are actually hard.</p>
<p>If you haven’t already, you can subscribe to the podcast through <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">iTunes</a>.</p>
<p>This will be our last episode for 2015 so see you in 2016!</p>
<p>Notes</p>
<ul>
<li><a href="https://goo.gl/X0TFt9">Roger’s books on Leanpub</a></li>
<li><a href="https://goo.gl/VO0ckP">KPIs</a></li>
<li><a href="http://replyall.soy">Reply All</a>, a great podcast</li>
<li><a href="http://user2016.org">Use R! 2016 conference</a> where Don Knuth is an invited speaker!</li>
<li><a href="http://goo.gl/wUcTBT">Liz Stuart’s directory of propensity score software</a></li>
<li><a href="https://goo.gl/CibhJ0">A/B testing</a></li>
<li><a href="https://goo.gl/qMyksb">iid</a></li>
<li><a href="https://goo.gl/qHVzWQ">R 3.2.3 release notes</a></li>
<li><a href="http://www.pqr-project.org/">pqR</a></li>
<li><a href="https://goo.gl/pFOVkx">John Myles White’s tweet</a></li>
</ul>
<p><a href="https://api.soundcloud.com/tracks/237909534/download?client_id=02gUJC0hH2ct1EGOcYXQIzRFU91c72Ea&oauth_token=1-138878-174789515-deb24181d01af">Download the audio file for this episode</a>.</p>
Instead of research on reproducibility, just do reproducible research
2015-12-11T12:18:33+00:00
http://simplystats.github.io/2015/12/11/instead-of-research-on-reproducibility-just-do-reproducible-research
<p>Right now reproducibility, replicability, false positive rates, biases in methods, and other problems with science are the hot topic. As I mentioned in a previous post pointing out a flaw with a scientific study is way easier to do correctly than generating a new scientific study. Some folks have noticed that right now there is a huge market for papers pointing out how science is flawed. The combination of the relative ease of pointing out flaws and the huge payout for writing these papers is helping to generate the hype around the “reproducibility crisis”.</p>
<p>I <a href="http://www.slideshare.net/jtleek/evidence-based-data-analysis-45800617">gave a talk</a> a little while ago at an NAS workshop where I stated that all the tools for reproducible research exist (the caveat being really large analyses - although that is changing as well). To make a paper completely reproducible, open, and available for post publication review you can use the following approach with no new tools/frameworks needed.</p>
<ol>
<li>Use <a href="https://github.com/">Github </a>for version control.</li>
<li>Use <a href="http://rmarkdown.rstudio.com/">rmarkdown</a> or <a href="http://ipython.org/notebook.html">iPython notebooks</a> for your analysis code</li>
<li>When your paper is done post it to <a href="http://arxiv.org/">arxiv</a> or <a href="http://biorxiv.org/">biorxiv</a>.</li>
<li>Post your data to an appropriate repository like <a href="http://www.ncbi.nlm.nih.gov/sra">SRA</a> or a general purpose site like <a href="https://figshare.com/">figshare.</a></li>
<li>Send any software you develop to a controlled repository like <a href="https://cran.r-project.org/">CRAN</a> or <a href="http://bioconductor.org/">Bioconductor</a>.</li>
<li>Participate in the <a href="http://simplystatistics.org/2015/11/16/so-you-are-getting-crushed-on-the-internet-the-new-normal-for-academics/">post publication discussion on Twitter and with a Blog</a></li>
</ol>
<p>This is also true of open science, open data sharing, reproducibility, replicability, post-publication peer review and all the other issues forming the “reproducibility crisis”. There is a lot of attention and heat that has focused on the “crisis” or on folks who make a point to take a stand on reproducibility or open science or post publication review. But in the background, outside of the hype, there are a large group of people that are quietly executing solid, open, reproducible science.</p>
<p>I wish that this group would get more attention so I decided to point out a few of them. Next time somebody asks me about the research on reproducibility or open science I’ll just point them here and tell them to just follow the lead of people doing it.</p>
<ul>
<li><strong>Karl Broman</strong> - posts all of his <a href="http://kbroman.org/pages/talks.html">talks online </a>, generates many widely used <a href="http://kbroman.org/pages/software.html">open source packages</a>, writes <a href="http://kbroman.org/pages/tutorials.html">free/open tutorials</a> on everything from knitr to making webpages, makes his <a href="http://www.ncbi.nlm.nih.gov/pubmed/26290572">papers</a> highly <a href="https://github.com/kbroman/Paper_SampleMixups">reproducible</a>.</li>
<li><strong>Jessica Li</strong> - <a href="http://www.stat.ucla.edu/~jingyi.li/software-and-data.html">posts her data online and writes open source software for her analyses</a>.</li>
<li><strong>Mark Robinson - </strong>posts many of his papers as <a href="http://biorxiv.org/search/author1%3Arobinson%252C%2Bmd%20numresults%3A10%20sort%3Arelevance-rank%20format_result%3Astandard">preprints on biorxiv</a>, makes his <a href="https://github.com/markrobinsonuzh/diff_splice_paper">analyses reproducible</a>, writes <a href="http://bioconductor.org/packages/release/bioc/html/Repitools.html">open source software </a></li>
<li><strong>Florian Markowetz -<a href="http://www.markowetzlab.org/software/"> </a></strong><a href="http://www.markowetzlab.org/software/">writes open source software</a>, provides <a href="http://www.markowetzlab.org/data.php">Bioconductor data for major projects</a>, links <a href="http://www.markowetzlab.org/publications.php">his papers with his code</a> nicely on his publications page.</li>
<li><strong>Raphael Gottardo</strong> - <a href="http://www.rglab.org/software.html">writes/maintains many open source software packages</a>, makes <a href="https://github.com/RGLab/BNCResponse">his analyses reproducible and available via Github</a>, posts <a href="http://biorxiv.org/content/early/2015/06/15/020842">preprints of his papers</a>.</li>
<li><strong>Genevera Allen - </strong>writes](https://cran.r-project.org/web/packages/TCGA2STAT/index.html) to make data easier to access, posts <a href="http://biorxiv.org/content/early/2015/09/24/027516">preprints on biorxiv</a> and <a href="http://arxiv.org/pdf/1502.03853v1.pdf">on arxiv</a></li>
<li><strong>Lorena Barba</strong> - <a href="http://openedx.seas.gwu.edu/courses/GW/MAE6286/2014_fall/about">teaches open source moocs</a>, with lessons as <a href="https://github.com/barbagroup/CFDPython">open source iPython modules</a>, and <a href="https://github.com/barbagroup/pygbe">reproducible code for her analyses</a>.</li>
<li><strong>Alicia Oshlack - </strong>writes papers with <a href="http://www.genomemedicine.com/content/7/1/43">completely reproducible analyses</a>, <a href="http://bioconductor.org/packages/release/bioc/html/missMethyl.html">publishes lots of open source software</a> and publishes <a href="http://biorxiv.org/content/early/2015/01/23/013698">preprints</a> for her papers.</li>
<li><strong>Baggerly and Coombs</strong> - although they are famous for a <a href="https://projecteuclid.org/euclid.aoas/1267453942">highly public reproducible piece of research</a> they have also quietly implemented policies like <a href="http://magazine.amstat.org/blog/2011/01/01/scipolicyjan11/">making all reports reproducible for their consulting center</a>.</li>
</ul>
<p>This list was made completely haphazardly as all my lists are, but just to indicate there are a ton of people out there doing this. One thing that is clear too is that grad students and postdocs are adopting the approach I described at a very high rate.</p>
<p>Moreover there are people that have been doing parts of this for a long time (like the <a href="http://arxiv.org/">physics</a> or <a href="http://biostats.bepress.com/jhubiostat/">biostatistics</a> communities with preprints, or how people have used <a href="https://projecteuclid.org/euclid.aoas/1267453942">Sweave for a long time</a>) . I purposely left people off the list like Titus and Ethan who have gone all in, even posting their <a href="http://ivory.idyll.org/blog/grants-posted.html">grants</a> <a href="http://jabberwocky.weecology.org/2012/08/10/a-list-of-publicly-available-grant-proposals-in-the-biological-sciences/">online</a>. I did this because they are very loud advocates of open science, but I wanted to highlight quieter contributors and point out that while there is a lot of noise going on over in one corner, many people are quietly doing really good science in another.</p>
By opposing tracking well-meaning educators are hurting disadvantaged kids
2015-12-09T10:10:02+00:00
http://simplystats.github.io/2015/12/09/by-opposing-tracking-well-meaning-educators-are-hurting-disadvantaged-kids
<div class="page" title="Page 2">
<div class="layoutArea">
<div class="column">
<p>
An unfortunate fact about the US K-12 system is that the education gap between poor and rich is growing. One manifestation of this trend is that we rarely see US kids from disadvantaged backgrounds become tenure track faculty, especially in the STEM fields. In my experience, the ones that do make it, when asked how they overcame the suboptimal math education their school district provided, often respond "I was <a href="https://en.wikipedia.org/wiki/Tracking_(education)">tracked</a>" or "I went to a <a href="https://en.wikipedia.org/wiki/Magnet_school">magnet school</a>". Magnet schools filter students with admission tests and then teach at a higher level than an average school, so essentially the entire school is an advanced track.
</p>
</div>
</div>
</div>
<p>Twenty years of classroom instruction experience has taught me that classes with diverse academic abilities present one of the most difficult teaching challenges. Typically, one is forced to focus on only a sub-group of students, usually the second quartile. As a consequence the lower and higher quartiles are not properly served. At the university level, we minimize this problem by offering different levels: remedial math versus math for engineers, probability for the Masters program versus probability for PhD students, co-ed intramural sports versus the varsity basketball team, intro to World Music versus a spot in the orchestra, etc. In K-12, tracking seems like the obvious solution to teaching to an array of student levels.</p>
<p>Unfortunately, there has been a trend recently to move away from tracking and several school districts now forbid it. The motivation seems to be a series of <a href="http://www.tandfonline.com/doi/abs/10.1207/s15430421tip4501_9">observational</a> <a href="http://files.eric.ed.gov/fulltext/ED329615.pdf">studies</a> that note that “low-track classes tend to be primarily composed of low-income students, usually minorities, while upper-track classes are usually dominated by students from socioeconomically successful groups.” Tracking opponents infer that this unfortunate reality is due to bias (conscious or unconscious) in the the informal referrals that are typically used to decide which students are advanced. However, <strong>this is a critique of the referral system, not of tracking itself.</strong> A simple fix is to administer an objective test or use the percentiles from <a href="http://www.doe.mass.edu/mcas/overview.html">state assessment tests</a>. In fact, such exams have been developed and implemented. A recent study (summarized <a href="http://www.vox.com/2015/11/23/9784250/card-giuliano-gifted-talented">here</a>) examined the data from a district that for a period of time implemented an objective assessment and found that</p>
<blockquote>
<p>[t]he number of Hispanic students [in the advanced track increased] by 130 percent and the number of black students by 80 percent.</p>
</blockquote>
<p>Unfortunately, instead of maintaining the placement criteria, which benefited underrepresented minorities without relaxing standards, these school districts reverted to the old, flawed system due to budget cuts.</p>
<p>Another argument against tracking is that students benefit more from being in classes with higher-achieving peers, rather than being in a class with students with similar subject mastery and a teacher focused on their level. However a [<div class="page" title="Page 2"></p>
<div class="layoutArea">
<div class="column">
<p>
An unfortunate fact about the US K-12 system is that the education gap between poor and rich is growing. One manifestation of this trend is that we rarely see US kids from disadvantaged backgrounds become tenure track faculty, especially in the STEM fields. In my experience, the ones that do make it, when asked how they overcame the suboptimal math education their school district provided, often respond "I was <a href="https://en.wikipedia.org/wiki/Tracking_(education)">tracked</a>" or "I went to a <a href="https://en.wikipedia.org/wiki/Magnet_school">magnet school</a>". Magnet schools filter students with admission tests and then teach at a higher level than an average school, so essentially the entire school is an advanced track.
</p>
</div>
</div>
<p></div></p>
<p>Twenty years of classroom instruction experience has taught me that classes with diverse academic abilities present one of the most difficult teaching challenges. Typically, one is forced to focus on only a sub-group of students, usually the second quartile. As a consequence the lower and higher quartiles are not properly served. At the university level, we minimize this problem by offering different levels: remedial math versus math for engineers, probability for the Masters program versus probability for PhD students, co-ed intramural sports versus the varsity basketball team, intro to World Music versus a spot in the orchestra, etc. In K-12, tracking seems like the obvious solution to teaching to an array of student levels.</p>
<p>Unfortunately, there has been a trend recently to move away from tracking and several school districts now forbid it. The motivation seems to be a series of <a href="http://www.tandfonline.com/doi/abs/10.1207/s15430421tip4501_9">observational</a> <a href="http://files.eric.ed.gov/fulltext/ED329615.pdf">studies</a> that note that “low-track classes tend to be primarily composed of low-income students, usually minorities, while upper-track classes are usually dominated by students from socioeconomically successful groups.” Tracking opponents infer that this unfortunate reality is due to bias (conscious or unconscious) in the the informal referrals that are typically used to decide which students are advanced. However, <strong>this is a critique of the referral system, not of tracking itself.</strong> A simple fix is to administer an objective test or use the percentiles from <a href="http://www.doe.mass.edu/mcas/overview.html">state assessment tests</a>. In fact, such exams have been developed and implemented. A recent study (summarized <a href="http://www.vox.com/2015/11/23/9784250/card-giuliano-gifted-talented">here</a>) examined the data from a district that for a period of time implemented an objective assessment and found that</p>
<blockquote>
<p>[t]he number of Hispanic students [in the advanced track increased] by 130 percent and the number of black students by 80 percent.</p>
</blockquote>
<p>Unfortunately, instead of maintaining the placement criteria, which benefited underrepresented minorities without relaxing standards, these school districts reverted to the old, flawed system due to budget cuts.</p>
<p>Another argument against tracking is that students benefit more from being in classes with higher-achieving peers, rather than being in a class with students with similar subject mastery and a teacher focused on their level. However a](http://web.stanford.edu/~pdupas/Tracking_rev.pdf) (and the only one of which I am aware) finds that tracking helps all students:</p>
<blockquote>
<p>We find that tracking students by prior achievement raised scores for all students, even those assigned to lower achieving peers. On average, after 18 months, test scores were 0.14 standard deviations higher in tracking schools than in non-tracking schools (0.18 standard deviations higher after controlling for baseline scores and other control variables). After controlling for the baseline scores, students in the top half of the pre-assignment distribution gained 0.19 standard deviations, and those in the bottom half gained 0.16 standard deviations. <strong>Students in all quantiles benefited from tracking. </strong></p>
</blockquote>
<p>I believe that without tracking, the achievement gap between disadvantaged children and their affluent peers will continue to widen since involved parents will seek alternative educational opportunities, including private schools or subject specific extracurricular acceleration programs. With limited or no access to advanced classes in the public system, disadvantaged students will be less prepared to enter the very competitive STEM fields. Note that competition comes not only from within the US, but from other countries including many with educational systems that track.</p>
<p>To illustrate the extreme gap, the following exercises are from a 7th grade public school math class (in a high performing school district):</p>
<table style="width: 100%;">
<tr>
<td>
<a href="http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-07-at-11.49.41-AM.png"><img src="http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-07-at-11.49.41-AM.png" alt="Screen Shot 2015-12-07 at 11.49.41 AM" width="275" /></a>
</td>
<td>
<a href="http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-09-at-9.00.57-AM.png"><img src="http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-09-at-9.00.57-AM.png" alt="Screen Shot 2015-12-09 at 9.00.57 AM" width="275" /></a>
</td>
</tr>
</table>
<p>(Click to enlarge). There is no tracking so all students must work on these problems. Meanwhile, in a 7th grade advanced, private math class, that same student can be working on problems like these:<a href="http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-07-at-11.47.45-AM.png"><img class="alignnone size-full wp-image-4511" src="http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-07-at-11.47.45-AM.png" alt="Screen Shot 2015-12-07 at 11.47.45 AM" width="1165" height="341" srcset="http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-07-at-11.47.45-AM-300x88.png 300w, http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-07-at-11.47.45-AM-1024x300.png 1024w, http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-07-at-11.47.45-AM-260x76.png 260w, http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-07-at-11.47.45-AM.png 1165w" sizes="(max-width: 1165px) 100vw, 1165px" /></a>Let me stress that there is nothing wrong with the first example if it is the appropriate level of the student. However, a student who can work at the level of the second example, should be provided with the opportunity to do so notwithstanding their family’s ability to pay. Poorer kids in districts which do not offer advanced classes will not only be less equipped to compete with their richer peers, but many of the academically advanced ones may, I suspect, dismiss academics due to lack of challenge and boredom. Educators need to consider evidence when making decisions regarding policy. Tracking can be applied unfairly, but that aspect can be remedied. Eliminating tracking all together takes away a crucial tool for disadvantaged students to move into the STEM fields and, according to the empirical evidence, hurts all students.</p>
Not So Standard Deviations: Episode 5 - IRL Roger is Totally With It
2015-12-03T09:52:47+00:00
http://simplystats.github.io/2015/12/03/not-so-standard-deviations-episode-5-irl-roger-is-totally-with-it
<p>I just posted Episode 5 of Not So Standard Deviations so check your feeds! Sorry for the long delay since the last episode but we got a bit tripped up by the Thanksgiving holiday.</p>
<p>In this episode, Hilary and I open up the mailbag and go through some of the feedback we’ve gotten on the previous episodes. The rest of the time is spent talking about the importance of reproducibility in data analysis both in academic research and in industry settings.</p>
<p>If you haven’t already, you can subscribe to the podcast through <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">iTunes</a>. Or you can use the <a href="http://feeds.soundcloud.com/users/soundcloud:users:174789515/sounds.rss">SoundCloud RSS feed</a> directly.</p>
<p>Notes:</p>
<ul>
<li>Hilary’s <a href="https://youtu.be/7B3n-5atLxM">talk on reproducible analysis in production</a> at the New York R Conference</li>
<li>Hilary’s <a href="https://youtu.be/zlSOckFpYqg">Ignite presentation</a> at Strata 2013</li>
<li>Roger’s <a href="https://youtu.be/aH8dpcirW1U">talk on “Computational and Policy Tools for Reproducible Research”</a> at the Applied Mathematics Perspectives Workshop in Vancouver, 2011</li>
<li>Duke Scandal <a href="http://goo.gl/rEO5QD">Starter Set</a></li>
<li><a href="https://youtu.be/7gYIs7uYbMo">Keith Baggerly’s talk</a> on Duke Scandal</li>
<li>The <a href="https://goo.gl/RtpBZa">Web of Trust</a></li>
<li><a href="https://goo.gl/MlM0gu">testdat</a> R package</li>
</ul>
<p><a href="https://api.soundcloud.com/tracks/235689361/download?client_id=02gUJC0hH2ct1EGOcYXQIzRFU91c72Ea&oauth_token=1-138878-174789515-deb24181d01af">Download the audio file for this episode</a>.</p>
<p>Or you can listen right here:</p>
Thinking like a statistician: the importance of investigator-initiated grants
2015-12-01T11:40:29+00:00
http://simplystats.github.io/2015/12/01/thinking-like-a-statistician-fund-more-investigator-initiated-grants
<p>A substantial amount of scientific research is funded by investigator-initiated grants. A researcher has an idea, writes it up and sends a proposal to a funding agency. The agency then elicits help from a group of peers to evaluate competing proposals. Grants are awarded to the most highly ranked ideas. The percent awarded depends on how much funding gets allocated to these types of proposals. At the NIH, the largest funding agency of these types of grants, the success rate recently <a href="https://nihdirectorsblog.files.wordpress.com/2013/09/sequestration-success-rates1.jpg">fell below 20% from a high above 35%</a>. Part of the reason these percentages have fallen is to make room for large collaborative projects. Large projects seem to be increasing, and not just at the NIH. In Europe, for example, the <a href="https://www.humanbrainproject.eu/">Human Brain Project</a> has an estimated cost of over 1 billion US$ over 10 years. To put this in perspective, 1 billion dollars can fund over 500 <a href="http://grants.nih.gov/grants/funding/r01.htm">NIH R01s</a>. R01 is the NIH mechanism most appropriate for investigator initiated proposals.</p>
<p>The merits of big science has been widely debated (for example <a href="http://www.michaeleisen.org/blog/?p=1179">here</a> and <a href="http://simplystatistics.org/2013/02/27/please-save-the-unsolicited-r01s/">here</a>). And most agree that some big projects have been successful. However, in this post I present a statistical argument highlighting the importance of investigator-initiated awards. The idea is summarized in the graph below.</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/12/Rplot.png"><img class="alignnone size-full wp-image-4483" src="http://simplystatistics.org/wp-content/uploads/2015/12/Rplot.png" alt="Rplot" width="1112" height="551" srcset="http://simplystatistics.org/wp-content/uploads/2015/12/Rplot-300x149.png 300w, http://simplystatistics.org/wp-content/uploads/2015/12/Rplot-1024x507.png 1024w, http://simplystatistics.org/wp-content/uploads/2015/12/Rplot-260x129.png 260w, http://simplystatistics.org/wp-content/uploads/2015/12/Rplot.png 1112w" sizes="(max-width: 1112px) 100vw, 1112px" /></a></p>
<p>The two panes above represent two different funding strategies: fund-many-R01s (left) or reduce R01s to fund several large projects (right). The grey crosses represent investigators and the gold dots represent potential paradigm-shifting geniuses. Location on the Cartesian plane represent research areas, with the blue circles denoting areas that are prime for an important scientific advance. The largest scientific contributions occur when a gold dot falls in a blue circle. Large contributions also result from the accumulation of incremental work produced by grey crosses in the blue circles.</p>
<p>Although not perfect, the peer review approach implemented by most funding agencies appears to work quite well at weeding out unproductive researchers and unpromising ideas. They also seem to do well at spreading funds across general areas. For example NIH spreads funds across <a href="https://www.nih.gov/institutes-nih/list-nih-institutes-centers-offices">diseases and public health challenges</a> (for example cancer, mental health, heart, genomics, heart and lung disease.) as well as <a href="https://www.nigms.nih.gov/Pages/default.aspx">general medicine</a>, <a href="https://www.genome.gov/">genomics</a> and <a href="https://www.nlm.nih.gov/">information.</a> However, precisely predicting who will be a gold dot or what specific area will be a blue circle seems like an impossible endeavor. Increasing the number of tested ideas and researchers therefore increases our chance of success. When a funding agency decides to invest big in a specific area (green dollar signs) they are predicting the location of a blue circle. As funding flows into these areas, so do investigators (note the clusters). The total number of funded lead investigators also drops. The risk here is that if the dollar sign lands far from a blue dot, we pull researchers away from potentially fruitful areas. If after 10 years of funding, the <a href="https://www.humanbrainproject.eu/">Human Brain Project</a> doesn’t <a href="https://www.humanbrainproject.eu/mission">“achieve a multi-level, integrated understanding of brain structure and function”</a> we will have missed out on trying out 500 ideas by hundreds of different investigators. With a sample size this large, we expect at least a handful of these attempts to result in the type of impactful advance that justifies funding scientific research.</p>
<p>The simulation presented (code below) here is clearly an over simplification, but it does depict the statistical reason why I favor investigator-initiated grants. The simulation clearly depicts that the strategy of funding many investigator-initiated grants is key for the continued success of scientific research.</p>
<p><tt><br /> set.seed(2)<br /> library(rafalib)<br /> thecol=”gold3”<br /> mypar(1,2,mar=c(0.5,0.5,2,0.5))<br /> ###<br /> ## Start with the many R01s model<br /> ###<br /> ##generate location of 2,000 investigators<br /> N = 2000<br /> x = runif(N)<br /> y = runif(N)<br /> ## 1% are geniuses<br /> Ng = N<em>0.01<br /> g = rep(4,N);g[1:Ng]=16<br /> ## generate location of important areas of research<br /> M0 = 10<br /> x0 = runif(M0)<br /> y0 = runif(M0)<br /> r0 = rep(0.03,M0)<br /> ##Make the plot<br /> nullplot(xaxt=”n”,yaxt=”n”,main=”Many R01s”)<br /> symbols(x0,y0,circles=r0,fg=”black”,bg=”blue”,<br /> lwd=3,add=TRUE,inches=FALSE)<br /> points(x,y,pch=g,col=ifelse(g==4,”grey”,thecol))<br /> points(x,y,pch=g,col=ifelse(g==4,NA,thecol))<br /> ### Generate the location of 5 big projects<br /> M1 = 5<br /> x1 = runif(M1)<br /> y1 = runif(M1)<br /> ##make initial plot<br /> nullplot(xaxt=”n”,yaxt=”n”,main=”A Few Big Projects”)<br /> symbols(x0,y0,circles=r0,fg=”black”,bg=”blue”,<br /> lwd=3,add=TRUE,inches=FALSE)<br /> ### Generate location of investigators attracted<br /> ### to location of big projects. There are 1000 total<br /> ### investigators<br /> Sigma = diag(2)</em>0.005<br /> N1 = 200<br /> Ng1 = round(N1<em>0.01)<br /> g1 = rep(4,N);g1[1:Ng1]=16<br /> library(MASS)<br /> for(i in 1:M1){<br /> xy = mvrnorm(N1,c(x1[i],y1[i]),Sigma)<br /> points(xy[,1],xy[,2],pch=g1,col=ifelse(g1==4,”grey”,thecol))<br /> }<br /> ### generate location of investigators that ignore big projects<br /> ### note now 500 instead of 200. Note overall total<br /> ## is also less because large projects result in less<br /> ## lead investigators<br /> N = 500<br /> x = runif(N)<br /> y = runif(N)<br /> Ng = N</em>0.01<br /> g = rep(4,N);g[1:Ng]=16<br /> points(x,y,pch=g,col=ifelse(g==4,”grey”,thecol))<br /> points(x1,y1,pch=”$”,col=”darkgreen”,cex=2,lwd=2)<br /> </tt></p>
A thanksgiving dplyr Rubik's cube puzzle for you
2015-11-25T12:14:06+00:00
http://simplystats.github.io/2015/11/25/a-thanksgiving-dplyr-rubiks-cube-puzzle-for-you
<p><a href="http://nickcarchedi.com/">Nick Carchedi</a> is back visiting from <a href="https://www.datacamp.com/">DataCamp</a> and for fun we came up with a <a href="https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html">[Nick Carchedi](http://nickcarchedi.com/) is back visiting from [DataCamp](https://www.datacamp.com/) and for fun we came up with a</a> Rubik’s cube puzzle. Here is how it works. To solve the puzzle you have to make a 4 x 3 data frame that spells Thanksgiving like this:</p>
<div class="oembed-gist">
<noscript>
View the code on <a href="https://gist.github.com/jtleek/4d4b63a035973231e6d4">Gist</a>.
</noscript>
</div>
<p><span style="line-height: 1.5;">To solve the puzzle you need to pipe this data frame in </span></p>
<div class="oembed-gist">
<noscript>
View the code on <a href="https://gist.github.com/jtleek/aae1218a8f4d1220e07d">Gist</a>.
</noscript>
</div>
<p>and pipe out the Thanksgiving data frame using only the dplyr commands <em>arrange</em>, <em>mutate</em>, <em>slice</em>, <em>filter</em> and <em>select</em>. For advanced users you can try our slightly more complicated puzzle:</p>
<div class="oembed-gist">
<noscript>
View the code on <a href="https://gist.github.com/jtleek/b82531d9dac78ba3c60a">Gist</a>.
</noscript>
</div>
<p>See if you can do it <a href="http://www.theguardian.com/technology/video/2015/nov/24/boy-completes-rubiks-cube-in-49-seconds-word-recordvideo">this fast</a>. Post your solutions in the comments and Happy Thanksgiving!</p>
20 years of Data Science: from Music to Genomics
2015-11-24T10:00:56+00:00
http://simplystats.github.io/2015/11/24/20-years-of-data-science-and-data-driven-discovery-from-music-to-genomics
<p>I finally got around to reading David Donoho’s <a href="https://dl.dropboxusercontent.com/u/23421017/50YearsDataScience.pdf">50 Years of Data Science</a> paper. I highly recommend it. The following quote seems to summarize the sentiment that motivated the paper, as well as why it has resonated among academic statisticians:</p>
<div class="page" title="Page 5">
<div class="layoutArea">
<div class="column">
<blockquote>
<p>
The statistics profession is caught at a confusing moment: the activities which preoccupied it over centuries are now in the limelight, but those activities are claimed to be bright shiny new, and carried out by (although not actually invented by) upstarts and strangers.
</p>
</blockquote>
</div>
</div>
</div>
<p>The reason we started this blog over four years ago was because, as Jeff wrote in his inaugural post, we were “<a href="http://simplystatistics.org/2011/09/07/first-things-first/">fired up about the new era where data is abundant and statisticians are scientists</a>”. It was clear that many disciplines were becoming data-driven and that interest in data analysis was growing rapidly. We were further motivated because, despite this <a href="http://simplystatistics.org/2014/09/15/applied-statisticians-people-want-to-learn-what-we-do-lets-teach-them/">new found interest in our work</a>, academic statisticians were, in general, more interested in the development of context free methods than in leveraging applied statistics to take <a href="http://simplystatistics.org/2012/06/22/statistics-and-the-science-club/">leadership roles</a> in data-driven projects. Meanwhile, great and highly visible applied statistics work was occurring in other fields such as astronomy, computational biology, computer science, political science and economics. So it was not completely surprising that some (bio)statistics departments were being left out from larger university-wide data science initiatives. Some of <a href="http://simplystatistics.org/2014/07/25/academic-statisticians-there-is-no-shame-in-developing-statistical-solutions-that-solve-just-one-problem/">our</a> <a href="http://simplystatistics.org/2013/04/15/data-science-only-poses-a-threat-to-biostatistics-if-we-dont-adapt/">posts</a> exhorted academic departments to embrace larger numbers of applied statisticians:</p>
<blockquote>
<p>[M]any of the giants of our discipline were very much interested in solving specific problems in genetics, agriculture, and the social sciences. In fact, many of today’s most widely-applied methods were originally inspired by insights gained by answering very specific scientific questions. I worry that the balance between application and theory has shifted too far away from applications. An unfortunate consequence is that our flagship journals, including our applied journals, are publishing too many methods seeking to solve many problems but actually solving none. By shifting some of our efforts to solving specific problems we will get closer to the essence of modern problems and will actually inspire more successful generalizable methods.</p>
</blockquote>
<p>Donoho points out that John Tukey had a similar preoccupation 50 years ago:</p>
<div class="page" title="Page 10">
<div class="layoutArea">
<div class="column">
<blockquote>
<p>
For a long time I have thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt. ... All in all I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data
</p>
</blockquote>
<p>
Many applied statisticians do the things Tukey mentions above. In the blog we have encouraged them to <a href="http://simplystatistics.org/2014/09/15/applied-statisticians-people-want-to-learn-what-we-do-lets-teach-them/">teach the gory details of what what they do</a>, along with the general methodology we currently teach. With all this in mind, several months ago, when I was invited to give a talk at a department that was, at the time, deciphering their role in their university's data science initiative, I gave a talk titled<em> 20 years of Data Science: from Music to Genomics. </em>The goal was to explain why <em>applied statistician</em> is not considered synonymous with <em>data scientist </em>even when we focus on the same goal: <a href="https://en.wikipedia.org/wiki/Data_science">extract knowledge or insights from data.</a>
</p>
<p>
The first example in the talk related to how academic applied statisticians tend to emphasize the parts that will be most appreciated by our math stat colleagues and ignore the aspects that are today being heralded as the linchpins of data science. I used my thesis papers as examples. <a href="http://archive.cnmat.berkeley.edu/Research/1998/Rafael/tesis.pdf">My dissertation work</a> was about finding meaningful parametrization of musical sound signals that<img class="wp-image-4449 alignright" src="http://www.biostat.jhsph.edu/~ririzarr/Demo/img7.gif" alt="Spectrogram" width="380" height="178" /> my collaborators could use to manipulate sounds to create new ones. To do this, I prepared a database of sounds, wrote code to extract and import the digital representations from CDs into S-plus (yes, I'm that old), visualized the data to motivate models, wrote code in C (or was it Fortran?) to make the analysis go faster, and tested these models with residual analysis by ear (you can listen to them <a href="http://www.biostat.jhsph.edu/~ririzarr/Demo/">here</a>). None of these data science aspects were highlighted in the <a href="http://www3.stat.sinica.edu.tw/statistica/oldpdf/A10n42.pdf">papers</a> <a href="http://www.tandfonline.com/doi/abs/10.1198/000313001300339969#.Vk4_ht-rQUE">I</a> <a href="http://www.tandfonline.com/doi/abs/10.1198/016214501750332875#.Vk4_mN-rQUE">wrote </a><a href="http://www.tandfonline.com/doi/abs/10.1198/016214501753168082#.Vk4_qt-rQUE">about</a> my <a href="http://onlinelibrary.wiley.com/doi/10.1111/1467-9892.01515/abstract?userIsAuthenticated=false&deniedAccessCustomisedMessage=">thesis</a>. Here is a screen shot from <a href="http://onlinelibrary.wiley.com/doi/10.1111/1467-9892.01515/abstract">this paper</a>:
</p>
</div>
</div>
</div>
<p><a href="http://simplystatistics.org/wp-content/uploads/2016/05/Screen-Shot-2015-04-15-at-12.24.40-PM.png"><img class="wp-image-4449 aligncenter" src="http://simplystatistics.org/wp-content/uploads/2016/05/Screen-Shot-2015-04-15-at-12.24.40-PM.png" alt="Screen Shot 2015-04-15 at 12.24.40 PM" width="320" height="342" srcset="http://simplystatistics.org/wp-content/uploads/2016/05/Screen-Shot-2015-04-15-at-12.24.40-PM-957x1024.png 957w, http://simplystatistics.org/wp-content/uploads/2016/05/Screen-Shot-2015-04-15-at-12.24.40-PM-187x200.png 187w, http://simplystatistics.org/wp-content/uploads/2016/05/Screen-Shot-2015-04-15-at-12.24.40-PM.png 1204w" sizes="(max-width: 320px) 100vw, 320px" /></a></p>
<p>I am actually glad I wrote out and published all the technical details of this work. It was great training. My point was simply that based on the focus of these papers, this work would not be considered data science.</p>
<p>The rest of my talk described some of the work I did once I transitioned into applications in Biology. I was fortunate to have a <a href="http://www.jhsph.edu/faculty/directory/profile/3859/scott-zeger">department chair</a> that appreciated lead-author papers in the subject matter journals as much as statistical methodology papers. This opened the door for me to become a full fledged applied statistician/data scientist. In the talk I described how <a href="http://bioinformatics.oxfordjournals.org/content/20/3/307.short">developing software packages,</a> <a href="http://www.nature.com/nmeth/journal/v2/n5/abs/nmeth756.html">planning</a> the <a href="http://www.nature.com/nmeth/journal/v4/n11/abs/nmeth1102.html">gathering of data</a> to <a href="http://www.ncbi.nlm.nih.gov/pubmed/?term=16108723">aid method development</a>, developing <a href="http://www.ncbi.nlm.nih.gov/pubmed/14960458">web tools</a> to assess data analysis techniques in the wild, and facilitating <a href="http://www.ncbi.nlm.nih.gov/pubmed/19151715">data-driven discovery</a> in biology has been very gratifying and, simultaneously, helped my career. However, at some point, early in my career, senior members of my department encouraged me to write and submit a methods paper to a statistical journal to go along with every paper I sent to the subject matter journals. Although I do write methods papers when I think the ideas add to the statistical literature, I did not follow the advice to simply write papers for the sake of publishing in statistics journals. Note that if (bio)statistics departments require applied statisticians to do this, then it becomes harder to have an impact as data scientists. Departments that are not producing widely used methodology or successful and visible applied statistics projects (or both), should not be surprised when they are not included in data science initiatives. So, applied statistician, read that Tukey quote again, listen to <a href="https://youtu.be/vbb-AjiXyh0">President Obama</a>, and go do some great data science.</p>
<p> </p>
<p> </p>
Some Links Related to Randomized Controlled Trials for Policymaking
2015-11-19T12:49:03+00:00
http://simplystats.github.io/2015/11/19/some-links-related-to-randomized-controlled-trials-for-policymaking
<div>
<p>
In response to <a href="http://simplystatistics.org/2015/11/17/why-are-randomized-trials-not-used-by-policymakers/">my previous post</a>, <a href="https://gspp.berkeley.edu/directories/faculty/avi-feller">Avi Feller</a> sent me these links related to efforts promoting the use of RCTs and evidence-based approaches for policymaking:
</p>
<ul>
<li>
The theme of this year's just-concluded APPAM conference (the national public policy research organization) was "evidence-based policymaking," with a headline panel on using experiments in policy (see <a href="http://www.appam.org/events/fall-research-conference/2015-fall-research-conference-information/" target="_blank">here</a> and <a href="http://www.appam.org/2015appam-student-summary-using-experiments-for-evidence-based-policy-lessons-from-the-private-sector/" target="_blank">here</a>).
</li>
</ul>
<ul>
<li>
Jeff Liebman has written extensively about the use of randomized experiments in policy (see <a href="http://govinnovator.com/ten_year_challenge/" target="_blank">here</a> for a recent interview).
</li>
</ul>
<ul>
<li>
The White House now has an entire office devoted to running randomized trials to improve government performance (the so-called "nudge unit"). Check out their recent annual report <a href="https://www.whitehouse.gov/sites/default/files/microsites/ostp/sbst_2015_annual_report_final_9_14_15.pdf" target="_blank">here</a>.
</li>
</ul>
<ul>
<li>
JPAL North America just launched a major initiative to help state and local governments run randomized trials (see <a href="https://www.povertyactionlab.org/about-j-pal/news/j-pal-north-america-state-and-local-innovation-initiative-release" target="_blank">here</a>).
</li>
</ul>
</div>
Given the history of medicine, why are randomized trials not used for social policy?
2015-11-17T10:42:24+00:00
http://simplystats.github.io/2015/11/17/why-are-randomized-trials-not-used-by-policymakers
<p>Policy changes can have substantial societal effects. For example, clean water and hygiene policies have saved millions, if not billions, of lives. But effects are not always positive. For example, <a href="https://en.wikipedia.org/wiki/Prohibition_in_the_United_States">prohibition</a>, or the “noble experiment”, boosted organized crime, slowed economic growth and increased deaths caused by tainted liquor. Good intentions do not guarantee desirable outcomes.</p>
<p>The medical establishment is well aware of the danger of basing decisions on the good intentions of doctors or biomedical researchers. For this reason, randomized controlled trials (RCTs) are the standard approach to determining if a new treatment is safe and effective. In these trials an objective assessment is achieved by assigning patients at random to a treatment or control group, and then comparing the outcomes in these two groups. Probability calculations are used to summarize the evidence in favor or against the new treatment. Modern RCTs are considered <a href="http://abcnews.go.com/Health/TenWays/story?id=3605442&page=1">one of the greatest medical advances of the 20th century</a>.</p>
<p>Despite their unprecedented success in medicine, RCTs have not been fully adopted outside of scientific fields. In <a href="http://www.badscience.net/2011/05/we-should-so-blatantly-do-more-randomised-trials-on-policy/">this post</a>, Ben Goldcare advocates for politicians to learn from scientists and base policy decisions on RCTs. He provides several examples in which results contradicted conventional wisdom. In <a href="https://www.ted.com/talks/esther_duflo_social_experiments_to_fight_poverty?language=en">this TED talk</a> Esther Duflo convincingly argues that RCTs should be used to determine what interventions are best at fighting poverty. Although some RCTs are being conducted, they are still rare and oftentimes ignored by policymakers. For example, despite at least <a href="http://peabody.vanderbilt.edu/research/pri/VPKthrough3rd_final_withcover.pdf">two</a> <a href="http://www.acf.hhs.gov/sites/default/files/opre/executive_summary_final.pdf">RCT</a>s finding that universal pre-K programs are not effective, polymakers in New York <a href="http://www.npr.org/sections/ed/2015/09/08/438584249/new-york-city-mayor-goes-all-in-on-free-preschool">are implementing a $400 million a year program</a>. Supporters of this noble endeavor defend their decision by pointing to observational studies and “expert” opinion that support their preconceived views. Before the 1950s, indifference to RCTs was common among medical doctors as well, and the outcomes were at times devastating.</p>
<p>Today, when we <a href="http://www.ncbi.nlm.nih.gov/pubmed/7058834">compare conclusions from non-RCT studies to RCTs</a>, we note the unintended strong effects that preconceived notions can have. The first chapter in <a href="http://www.amazon.com/Statistics-4th-Edition-David-Freedman/dp/0393929728">this book</a> provides a summary and some examples. One example comes from <a href="http://www.jameslindlibrary.org/grace-nd-muench-h-chalmers-tc-1966/">a study</a> of 51 studies on the effectiveness of the portacaval shunt. Here is table summarizing the conclusions of the 51 studies:</p>
<table>
<tr>
<td>
Design
</td>
<td>
Marked Improvement
</td>
<td>
Moderate Improvement
</td>
<td>
None
</td>
</tr>
<tr>
<td>
No control
</td>
<td>
24
</td>
<td>
7
</td>
<td>
1
</td>
</tr>
<tr>
<td>
Controls; but no randomized
</td>
<td>
10
</td>
<td>
3
</td>
<td>
2
</td>
</tr>
<tr>
<td>
Randomized
</td>
<td>
</td>
<td>
1
</td>
<td>
3
</td>
</tr>
</table>
<p>Compare the first and last column to appreciate the importance of the randomized trials.</p>
<p>A particularly troubling example relates to the studies on Diethylstilbestrol (DES). DES is a drug that was used to prevent spontaneous abortions. Five out of five studies using historical controls found the drug to be effective, yet all three randomized trials found the opposite. Before the randomized trials convinced doctors to stop using this drug , it was given to thousands of women. This turned out to be a tragedy as later studies showed DES has <a href="http://diethylstilbestrol.co.uk/des-side-effects/">terrible side effects</a>. Despite the doctors having the best intentions in mind, ignoring the randomized trials resulted in unintended consequences.</p>
<p>Well meaning experts are regularly implementing policies without really testing their effects. Although randomized trials are not always possible, it seems that they are rarely considered, in particular when the intentions are noble. <span style="line-height: 1.5;">Just like well-meaning turn-of-the-20th-century doctors, convinced that they were doing good, put their patients at risk by providing ineffective treatments, well intentioned policies may end up hurting society.</span></p>
<p><strong>Update: </strong>A reader pointed me to <a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2534811">these</a> <a href="http://eml.berkeley.edu//~crwalters/papers/kline_walters.pdf">preprints</a> which point out that the control group in <a href="http://www.acf.hhs.gov/sites/default/files/opre/executive_summary_final.pdf">one of the cited</a> early education RCTs included children that receive care in a range of different settings, not just staying at home. This implies that the signal is attenuated if what we want to know is if the program is effective for children that would otherwise stay at home. In <a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2534811">this preprint</a> they use statistical methodology (principal stratification framework) to obtain separate estimates: the effect for children that would otherwise go to other center-based care and the effect for children that would otherwise stay at home. They find no effect for the former group but a significant effect for the latter. Note that in this analysis the effect being estimated is no longer based on groups assigned at random. Instead, model assumptions are used to infer the two effects. To avoid dependence on these assumptions we will have to perform an RCT with better defined controls. Also note that the<span style="line-height: 1.5;"> RCT data facilitated the principal stratification framework analysis. I also want to restate what <a href="http://simplystatistics.org/2014/04/17/correlation-does-not-imply-causation-parental-involvement-edition/">I’ve posted before</a>, “I am not saying that observational studies are uninformative. If properly analyzed, observational data can be very valuable. For example, the data supporting smoking as a cause of lung cancer is all observational. Furthermore, there is an entire subfield within statistics (referred to as causal inference) that develops methodologies to deal with observational data. But unfortunately, observational data are commonly misinterpreted.”</span></p>
So you are getting crushed on the internet? The new normal for academics.
2015-11-16T09:49:04+00:00
http://simplystats.github.io/2015/11/16/so-you-are-getting-crushed-on-the-internet-the-new-normal-for-academics
<p>Roger and I were just talking about all the discussion around the <a href="http://www.pnas.org/content/early/2015/10/29/1518393112.full.pdf">Case and Deaton paper</a> on death rates for middle class people. Andrew Gelman <a href="http://www.slate.com/articles/health_and_science/science/2015/11/death_rates_for_white_middle_aged_americans_are_not_increasing.html">discussed it</a> among many others. They noticed a potential bias in the analysis and did some re-analysis. Just yesterday <a href="http://noahpinionblog.blogspot.com/2015/11/gelman-vs-case-deaton-academics-vs.html">Noah Smith</a> wrote a piece about academics versus blogs and how many academics are taken by surprise when they see their paper being discussed so rapidly on the internet. Much of the debate comes down to the speed, tone, and ferocity of internet discussion of academic work - along with the fact that sometimes it isn’t fully fleshed out.</p>
<p>I have been seeing this play out not just in the case of this specific paper, but many times that folks have been confronted with blogs or the quick publication process of <a href="http://f1000research.com/">f1000Research</a>. I think it is pretty scary for folks who aren’t used to “internet speed” to see this play out and I thought it would be helpful to make a few points.</p>
<ol>
<li><strong>Everyone is an internet scientist now.</strong> The internet has arrived as part of academics and if you publish a paper that is of interest (or if you are a Nobel prize winner, or if you dispute a claim, etc.) you will see discussion of that paper within a day or two on the blogs. This is now a fact of life.</li>
<li><strong>The internet loves a fight</strong>. The internet responds best to personal/angry blog posts or blog posts about controversial topics like p-values, errors, and bias. Almost certainly if someone writes a blog post about your work or an f1000 paper it will be about an error/bias/correction or something personal.</li>
<li><strong>Takedowns are easier than new research and happen faster</strong>. It is much, much easier to critique a paper than to design an experiment, collect data, figure out what question to ask, ask it quantitatively, analyze the data, and write it up. This doesn’t mean the critique won’t be good/right it just means it will happen much much faster than it took you to publish the paper because it is easier to do. All it takes is noticing one little bug in the code or one error in the regression model. So be prepared for speed in the response.</li>
</ol>
<p>In light of these three things, you have a couple of options about how to react if you write an interesting paper and people are discussing it - which they will certainly do (point 1), in a way that will likely make you uncomfortable (point 2), and faster than you’d expect (point 3). The first thing to keep in mind is that the internet wants you to “fight back” and wants to declare a “winner”. Reading about amicable disagreements doesn’t build audience. That is why there is reality TV. So there will be pressure for you to score points, be clever, be fast, and refute every point or be declared the loser. I have found from my own experience that is what I feel like doing too. I think that resisting this urge is both (a) very very hard and (b) the right thing to do. I find the best solution is to be proud of your work, but be humble, because no paper is perfect and thats ok. If you do the best you can , sensible people will acknowledge that.</p>
<p>I think these are the three ways to respond to rapid internet criticism of your work.</p>
<ul>
<li><strong>Option 1: Respond on internet time.</strong> This means if you publish a big paper that you think might be controversial you should block off a day or two to spend time on the internet responding. You should be ready to do new analysis quickly, be prepared to admit mistakes quickly if they exist, and you should be prepared to make it clear when there aren’t. You will need social media accounts and you should probably have a blog so you can post longer form responses. Github/Figshare accounts make it better for quickly sharing quantitative/new analyses. Again your goal is to avoid the personal and stick to facts, so I find that Twitter/Facebook are best for disseminating your more long form responses on blogs/Github/Figshare. If you are going to go this route you should try to respond to as many of the major criticisms as possible, but usually they cluster into one or two specific comments, which you can address all in one.</li>
<li><strong>Option2 : Respond in academic time.</strong> You might have spent a year writing a paper to have people respond to it essentially instantaneously. Sometimes they will have good points, but they will rarely have carefully thought out arguments given the internet-speed response (although remember point 3 that good critiques can be faster than good papers). One approach is to collect all the feedback, ignore the pressure for an immediate response, and write a careful, scientific response which you can publish in a journal or in a fast outlet like f1000Research. I think this route can be the most scientific and productive if executed well. But this will be hard because people will treat that like “you didn’t have a good answer so you didn’t respond immediately”. The internet wants a quick winner/loser and that is terrible for science. Even if you choose this route though, you should make sure you have a way of publicizing your well thought out response - through blogs, social media, etc. once it is done.</li>
<li><strong>Option 3: Do not respond.</strong> This is what a lot of people do and I’m unsure if it is ok or not. Clearly internet facing commentary can have an impact on you/your work/how it is perceived for better or worse. So if you ignore it, you are ignoring those consequences. This may be ok, but depending on the severity of the criticism may be hard to deal with and it may mean that you have a lot of questions to answer later. Honestly, I think as time goes on if you write a big paper under a lot of scrutiny Option 3 is going to go away.</li>
</ul>
<p>All of this only applies if you write a paper that a ton of people care about/is controversial. Many technical papers won’t have this issue and if you keep your claims small, this also probably won’t apply. But I thought it was useful to try to work out how to act under this “new normal”.</p>
Prediction Markets for Science: What Problem Do They Solve?
2015-11-10T20:29:19+00:00
http://simplystats.github.io/2015/11/10/prediction-markets-for-science-what-problem-do-they-solve
<p>I’ve recently seen a bunch of press on <a href="http://www.pnas.org/content/early/2015/11/04/1516179112.abstract">this paper</a>, which describes an experiment with developing a prediction market for scientific results. From FiveThirtyEight:</p>
<blockquote>
<p>Although <a href="http://fivethirtyeight.com/datalab/psychology-is-starting-to-deal-with-its-replication-problem/">replication is essential for verifying results</a>, the <a href="http://fivethirtyeight.com/features/science-isnt-broken/">current scientific culture does little to encourage it in most fields</a>. That’s a problem because it means that misleading scientific results, like those from the “shades of gray” study, <a href="http://pss.sagepub.com/content/22/11/1359.short?rss=1&ssource=mfr">could be common in the scientific literature</a>. Indeed, a 2005 study claimed that <a href="http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124">most published research findings are false.</a></p>
<p>[…]</p>
<p>The researchers began by selecting some studies slated for replication in the <a href="https://osf.io/ezcuj/wiki/home/">Reproducibility Project: Psychology</a> — a project that aimed to reproduce 100 studies published in three high-profile psychology journals in 2008. They then recruited psychology researchers to take part in <a href="https://osf.io/yjmht/">two prediction markets</a>. These are the same types of markets that people use <a href="http://www.nytimes.com/2015/10/24/upshot/betting-markets-call-marco-rubio-front-runner-in-gop.html?_r=0">to bet on who’s going to be president</a>. In this case, though, researchers were betting on whether a study would replicate or not.</p>
</blockquote>
<p>There are all kinds of prediction markets these days–for politics, general ideas–so having one for scientific ideas is not too controversial. But I’m not sure I see exactly what problem is solved by having a prediction market for science. In the paper, they claim that the market-based bets were better predictors of the general survey that was administrated to the scientists. I’ll admit that’s an interesting result, but I’m not yet convinced.</p>
<p>First off, it’s worth noting that this work comes out of the massive replication project conducted by the Center for Open Science, where I believe they <a href="http://simplystatistics.org/2015/10/01/a-glass-half-full-interpretation-of-the-replicability-of-psychological-science/">have a</a> <a href="http://simplystatistics.org/2015/10/20/we-need-a-statistically-rigorous-and-scientifically-meaningful-definition-of-replication/">fundamentally flawed definition of replication</a>. So I’m not sure I can really agree with the idea of basing a prediction market on such a definition, but I’ll let that go for now.</p>
<p>The purpose of most markets is some general notion of “price discovery”. One popular market is the stock market and I think it’s instructive to see how that works. Basically, people continuously bid on the shares of certain companies and markets keep track of all the bids/offers and the completed transactions. If you are interested in finding out what people are willing to pay for a share of Apple, Inc., then it’s probably best to look at…what people are willing to pay. That’s exactly what the stock market gives you. You only run into trouble when there’s no liquidity, so no one shows up to bid/offer, but that would be a problem for any market.</p>
<p>Now, suppose you’re interested in finding out what the “true fundamental value” of Apple, Inc. Some people think the stock market gives you that at every instance, while <a href="http://www.econ.yale.edu/~shiller/">others</a> think that the stock market can behave irrationally for long periods of time. Perhaps in the very long run, you get a sense of the fundamental value of a company, but that may not be useful information at that point.</p>
<p>What does the market for scientific hypotheses give you? Well, it would be one thing if granting agencies participated in the market. Then, we would never have to write grant applications. The granting agencies could then signal what they’d be willing to pay for different ideas. But that’s not what we’re talking about.</p>
<p>Here, we’re trying to get at whether a given hypothesis is <em>true or not</em>. The only real way to get information about that is to conduct an experiment. How many people betting in the markets will have conducted an experiment? Likely the minority, given that the whole point is to save money by not having people conduct experiments investigating hypotheses that are likely false.</p>
<p>But if market participants aren’t contributing real information about an hypothesis, what are they contributing? Well, they’re contributing their <em>opinion</em> about an hypothesis. How is that related to science? I’m not sure. Of course, participants could be experts in the field (although not necessarily) and so their opinions will be informed by past results. And ultimately, it’s consensus amongst scientists that determines, after repeated experiments, whether an hypothesis is true or not. But at the early stages of investigation, it’s not clear how valuable people’s opinions are.</p>
<p>In a way, this reminds me of a time a while back when the EPA was soliciting “expert opinion” about the health effects of outdoor air pollution, as if that were a reasonable substitute for collecting actual data on the topic. At least it cost less money–just the price of a conference call.</p>
<p>There’s a version of this playing out in the health tech market right now. Companies like <a href="http://simplystatistics.org/2015/10/28/discussion-of-the-theranos-controversy-with-elizabeth-matsui/">Theranos</a> and 23andMe are selling health products that they claim are better than some current benchmark. In particular, Theranos claims its blood tests are accurate when only using a tiny sample of blood. Is this claim true or not? No one outside Theranos knows for sure, but we can look to the financial markets.</p>
<p>Theranos can point to the marketplace and show that people are willing to pay for its products. Indeed, the $9 billion valuation of the private company is another indicator that people…highly value the company. But ultimately, <em>we still don’t know if their blood tests are accurate</em> because we don’t have any data. If we were to go by the financial markets alone, we would necessarily conclude that their tests are good, because why else would anyone invest so much money in the company?</p>
<p>I think there may be a role to play for prediction markets in science, but I’m not sure discovering the truth about nature is one of them.</p>
Biostatistics: It's not what you think it is
2015-11-09T10:00:20+00:00
http://simplystats.github.io/2015/11/09/biostatistics-its-not-what-you-think-it-is
<p><a href="http://www.hsph.harvard.edu/biostatistics">My department</a> recently sent me on a recruitment trip for our graduate program. I had the opportunity to chat with undergrads interested in pursuing a career related to data analysis. I found that several did not know about the existence of Departments of <em>Biostatistics</em> and most of the rest thought <em>Biostatistics</em> was the study of clinical trials. We <a href="http://simplystatistics.org/2012/08/14/statistics-statisticians-need-better-marketing/">have</a> <a href="http://simplystatistics.org/2011/11/02/we-need-better-marketing/">posted</a> on the need for better marketing for Statistics, but Biostatistics needs it even more. So this post is for students considering a career as applied statisticians or data science and are considering PhD programs.</p>
<p>There are dozens of Biostatistics departments and most run PhD programs. As an undergraduate, you may have never heard of it because they are usually in schools that undergrads don’t regularly frequent: Public Health and Medicine. However, they are very active in research and teaching graduate students. In fact, the 2014 US News & World Report <a href="http://US News and R">ranking of Statistics Departments</a> includes three Biostat departments in the top five spots. Although clinical trials are a popular area of interest in these departments, there are now many other areas of research. With so many fields of science shifting to data intensive research, Biostatistics has adapted to work in these areas. Today pretty much any Biostat department will have people working on projects related to genetics, genomics, computational biology, electronic medical records, neuroscience, environmental sciences, and epidemiology, health-risk analysis, and clinical decision making. Through collaborations, academic biostatisticians have early access to the cutting edge datasets produced by public health scientists and biomedical researchers. Our research usually revolves in either developing statistical methods that are used by researchers working in these fields or working directly with a collaborator in data-driven discovery.</p>
<p><strong>How is it different from Statistics? </strong>In the grand scheme of things, they are not very different. As implied by the name, Biostatisticians focus on data related to biology while statisticians tend to be more general. However, the underlying theory and skills we learn are similar. In my view, the major difference is that Biostatisticians, in general, tend to be more interested in data and the subject matter, while in Statistics Departments more emphasis is given to the mathematical theory.</p>
<p><strong>What type of job can I get with a Phd In Biostatistics? </strong><a href="http://fortune.com/2015/04/27/best-worst-graduate-degrees-jobs/">A well paying one</a>. And you will have many options to chose from. Our graduates tend to go to academia, industry or government. Also, the <strong>Bio </strong>in the name does not keep our graduates for landing non-bio related jobs, such as in high tech. The reason for this is that the training our students receive and the what they learn from research experiences can be widely applied to data analysis challenges.</p>
<p><strong>How should I prepare if I want to apply to a PhD program?</strong> First you need to decide if you are going to like it. One way to do this is to participate in one of the <a href="http://www.nhlbi.nih.gov/research/training/summer-institute-biostatistics-t15">summer programs</a> where you get a glimpse of what we do. My department runs <a href="http://www.hsph.harvard.edu/biostatistics/diversity/summer-program/">one of these as well</a>. However, as an undergrad I would mainly focus on courses. Undergraduate research experiences are a good way to get an idea of what it’s like, but it is difficult to do real research unless you can set aside several hours a week for several consecutive months. This is difficult as an undergrad because you have to make sure to do well in your courses, prepare for the GRE, and get a solid mathematical and computing foundation in order to conduct research later. This is why these programs are usually in the summer. If you decide to apply to a PhD program, I recommend you take advanced math courses such as Real Analysis and Matrix Algebra. If you plan to develop software for complex datasets, I recommend CS courses that cover algorithms and optimization. Note that programming skills are not the same thing as the theory taught in these CS courses. Programming skills in R will serve you well if you plan to analyze data regardless of what academic route you follow. Python and a low-level language such as C++ are more powerful languages that many biostatisticians use these days.</p>
<p>I think the demand for well-trained researchers that can make sense of data will continue to be on the rise. If you want a fulfilling job where you analyze data for a living, you should consider a PhD in Biostatistics.</p>
Not So Standard Deviations: Episode 4 - A Gajillion Time Series
2015-11-07T11:46:49+00:00
http://simplystats.github.io/2015/11/07/not-so-standard-deviations-episode-4-a-gajillion-time-series
<p>Episode 4 of Not So Standard Deviations is hot off the audio editor. In this episode Hilary first explains to me what heck is DevOps and then we talk about the statistical challenges in detecting rare events in an enormous set of time series data. There’s also some discussion of Ben and Jerry’s and the t-test, so you’ll want to hang on for that.</p>
<p>Notes:</p>
<ul>
<li><a href="https://goo.gl/259VKI">Nobody Loves Graphite Anymore</a></li>
<li><a href="http://goo.gl/zB7wM9">A response</a></li>
<li><a href="https://goo.gl/7PgLKY">Why Gosset is awesome</a></li>
</ul>
<p> </p>
How I decide when to trust an R package
2015-11-06T13:41:02+00:00
http://simplystats.github.io/2015/11/06/how-i-decide-when-to-trust-an-r-package
<p>One thing that I’ve given a lot of thought to recently is the process that I use to decide whether I trust an R package or not. Kasper Hansen took a break from <a href="https://twitter.com/KasperDHansen/status/657589509975076864">trolling me</a> <a href="https://twitter.com/KasperDHansen/status/621315346633519104">on Twitter</a> to talk about how he trusts packages on Github less than packages that are on CRAN and particularly Bioconductor. A couple of points he makes that I think are very relevant. First, that having a package on CRAN/Bioconductor raises trust in that package:</p>
<blockquote class="twitter-tweet" width="550">
<p lang="en" dir="ltr">
.<a href="https://twitter.com/michaelhoffman">@michaelhoffman</a> But it's not on Bioconductor or CRAN. This decreases trust substantially.
</p>
<p>
— Kasper Daniel Hansen (@KasperDHansen) <a href="https://twitter.com/KasperDHansen/status/659777449098637312">October 29, 2015</a>
</p>
</blockquote>
<p>The primary reason is because Bioc/CRAN demonstrate something about the developer’s willingness to do the boring but critically important parts of package development like documentation, vignettes, minimum coding standards, and being sure that their code isn’t just a rehash of something else. The other big point Kasper made was the difference between a repository - which is user oriented and should provide certain guarantees and Github - which is a developer platform and makes things easier/better for developers but doesn’t have a user guarantee system in place.</p>
<blockquote class="twitter-tweet" width="550">
<p lang="en" dir="ltr">
.<a href="https://twitter.com/StrictlyStat">@StrictlyStat</a> CRAN is a repository, not a development platform. It is user oriented, not developer oriented. GH is the reverse.
</p>
<p>
— Kasper Daniel Hansen (@KasperDHansen) <a href="https://twitter.com/KasperDHansen/status/661746848437243904">November 4, 2015</a>
</p>
</blockquote>
<p>This discussion got me thinking about when/how I depend on R packages and how I make that decision. The scenarios where I depend on R packages are:</p>
<ol>
<li>Quick and dirty analyses for myself</li>
<li>Shareable data analyses that I hope are reproducible</li>
<li>As dependencies of R packages I maintain</li>
</ol>
<p>As you move from 1-3 it is more and more of a pain if the package I’m depending on breaks. If it is just something I was doing for fun, its not that big of a deal. But if it means I have to rewrite/recheck/rerelease my R package than that is a much bigger headache.</p>
<p>So my scale for how stringent I am about relying on packages varies by the type of activity, but what are the criteria I use to measure how trustworthy a package is? For me, the criteria are in this order:</p>
<ol>
<li><strong>People prior </strong></li>
<li><strong>Forced competence</strong></li>
<li><strong>Indirect data</strong></li>
</ol>
<p>I’ll explain each criteria in a minute, but the main purpose of using these criteria is (a) to ensure that I’m using a package that works and (b) to ensure that if the package breaks I can trust it will be fixed or at least I can get some help from the developer.</p>
<p><strong>People prior</strong></p>
<p>The first thing I do when I look at a package I might depend on is look at who the developer is. If that person is someone I know has developed widely used, reliable software and who quickly responds to requests/feedback then I immediately trust the package. I have a list of people like <a href="https://en.wikipedia.org/wiki/Brian_D._Ripley">Brian</a>, or <a href="https://github.com/hadley">Hadley,</a> or <a href="https://github.com/jennybc">Jenny</a>, or <a href="http://rafalab.dfci.harvard.edu/index.php/software-and-data">Rafa</a>, who could post their package just as a link to their website and I would trust it. It turns out almost all of these folks end up putting their packages on CRAN/Bioconductor anyway. But even if they didn’t I assume that the reason is either (a) the package is very new or (b) they have a really good reason for not distributing it through the normal channels.</p>
<p><strong>Forced competence</strong></p>
<p>For people who I don’t know about or whose software I’ve never used, then I have very little confidence in the package a priori. This is because there are a ton of people developing R packages now with highly variable levels of commitment to making them work. So as a placeholder for all the variables I don’t know about them, I use the repository they choose as a surrogate. My personal prior on the trustworthiness of a package from someone I don’t know goes something like:</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-1.25.01-PM.png"><img class="aligncenter wp-image-4410 size-full" src="http://simplystatistics.org/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-1.25.01-PM.png" alt="Screen Shot 2015-11-06 at 1.25.01 PM" width="843" height="197" srcset="http://simplystatistics.org/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-1.25.01-PM-300x70.png 300w, http://simplystatistics.org/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-1.25.01-PM-260x61.png 260w, http://simplystatistics.org/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-1.25.01-PM.png 843w" sizes="(max-width: 843px) 100vw, 843px" /></a></p>
<p>This prior is based on the idea of forced competence. In general, you have to do more to get a package approved on Bioconductor than on CRAN (for example you have to have a good vignette) and you have to do more to get a package on CRAN (pass R CMD CHECK and survive the review process) than to put it on Github.</p>
<p>This prior isn’t perfect, but it does tell me something about how much the person cares about their package. If they go to the work of getting it on CRAN/Bioc, then at least they cared enough to document it. They are at least forced to be minimally competent - at least at the time of submission and enough for the packages to still pass checks.</p>
<p><strong>Indirect data</strong></p>
<p>After I’ve applied my priors I then typically look at the data. For Bioconductor I look at the badges, like how downloaded it is, whether it passes the checks, and how well it is covered by tests. I’m already inclined to trust it a bit since it is on that platform, but I use the data to adjust my prior a bit. For CRAN I might look at the <a href="http://cran-logs.rstudio.com/">download stats</a> provided by Rstudio. The interesting thing is that as John Muschelli points out, Github actually has the most indirect data available for a package:</p>
<blockquote class="twitter-tweet" width="550">
<p lang="en" dir="ltr">
.<a href="https://twitter.com/KasperDHansen">@KasperDHansen</a> Flipside: CRAN has no issue pages, stars/ratings, outdated limits on size, and limited development cycle/turnover.
</p>
<p>
— John Muschelli (@StrictlyStat) <a href="https://twitter.com/StrictlyStat/status/661746348409114624">November 4, 2015</a>
</p>
</blockquote>
<p>If I’m going to use a package that is on Github from a person who isn’t on my prior list of people to trust then I look at a few things. The number of stars/forks/watchers is one thing that is a quick and dirty estimate of how used a package is. I also look very carefully at how many commits the person has submitted to both the package in question and in general all other packages over the last couple of months. If the person isn’t actively developing either the package or anything else on Github, that is a bad sign. I also look to see how quickly they have responded to issues/bug reports on the package in the past if possible. One idea I haven’t used but I think is a good one is to submit an issue for a trivial change to the package and see if I get a response very quickly. Finally I look and see if they have some demonstration their package works across platforms (say with a <a href="https://travis-ci.org/">travis badge</a>). If the package is highly starred, frequently maintained, all issues are responded to and up-to-date, and passes checks on all platform then that data might overwhelm my prior and I’d go ahead and trust the package.</p>
<p><strong>Summary</strong></p>
<p>In general one of the best things about the R ecosystem is being able to rely on other packages so that you don’t have to write everything from scratch. But there is a hard balance to strike with keeping the dependency list small. One way I maintain this balance is using the strategy I’ve outlined to worry less about trustworthy dependencies.</p>
The Statistics Identity Crisis: Am I a Data Scientist
2015-10-30T14:21:08+00:00
http://simplystats.github.io/2015/10/30/the-statistics-identity-crisis-am-i-a-data-scientist
<p>The joint ASA/Simply Statistics webinar on the statistics identity crisis is now live!</p>
Faculty/postdoc job opportunities in genomics across Johns Hopkins
2015-10-30T10:33:06+00:00
http://simplystats.github.io/2015/10/30/facultypostdoc-job-opportunities-in-genomics-across-johns-hopkins
<p>It’s pretty exciting to be in genomics at Hopkins right now with three new Bloomberg professors in genomics areas, a ton of stellar junior faculty, and a really fun group of students/postdocs. If you want to get in on the action here is a non-comprehensive list of great opportunities.</p>
<h2 id="faculty-jobs"><span style="text-decoration: underline;"><strong>Faculty Jobs</strong></span></h2>
<p><strong>Job: </strong>Multiple tenure track faculty positions in all areas including in genomics</p>
<p><strong>Department: </strong> Biostatistics</p>
<p><strong>To apply</strong>: <a href="http://www.jhsph.edu/departments/biostatistics/_docs/faculty-ad-2016-combined-large-final.pdf">http://www.jhsph.edu/departments/biostatistics/_docs/faculty-ad-2016-combined-large-final.pdf</a></p>
<p><strong>Deadline:</strong> Review ongoing</p>
<p><strong>Job:</strong> Tenure track position in data intensive biology</p>
<p><strong>Department: </strong> Biology</p>
<p><strong>To apply</strong>: <a href="http://apply.interfolio.com/31146">http://apply.interfolio.com/31146</a></p>
<p><strong>Deadline: </strong>Nov 1st and ongoing</p>
<p><strong>Job:</strong> Tenure track positions in bioinformatics, with focus on proteomics or sequencing data analysis</p>
<p><strong>Department: </strong> Oncology Biostatistics</p>
<p><strong>To apply</strong>: <a href="https://www.research-it.onc.jhmi.edu/DBB/PhD_Statistician.pdf">https://www.research-it.onc.jhmi.edu/DBB/PhD_Statistician.pdf</a></p>
<p><strong>Deadline:</strong> Review ongoing</p>
<p> </p>
<h2 id="postdoc-jobs"><span style="text-decoration: underline;"><strong>Postdoc Jobs</strong></span></h2>
<p><strong>Job:</strong> Postdoc(s) in statistical methods/software development for RNA-seq</p>
<p><strong>Employer: </strong> Jeff Leek</p>
<p><strong>To apply</strong>: email Jeff (<a href="http://jtleek.com/jobs/">http://jtleek.com/jobs/</a>)</p>
<p><strong>Deadline:</strong> Review ongoing</p>
<p><strong>Job:</strong> Data scientist for integrative genomics in the human brain (MS/PhD)</p>
<p><strong>Employer: </strong> Andrew Jaffe</p>
<p><strong>To apply</strong>: email Andrew (<a href="http://www.aejaffe.com/jobs.html">http://www.aejaffe.com/jobs.html</a>)</p>
<p><strong>Deadline:</strong> Review ongoing</p>
<p><strong>Job:</strong> Research associate for genomic data processing and analysis (BA+)</p>
<p><strong>Employer: </strong> Andrew Jaffe</p>
<p><strong>To apply</strong>: email Andrew (<a href="http://www.aejaffe.com/jobs.html">http://www.aejaffe.com/jobs.html</a>)</p>
<p><strong>Deadline:</strong> Review ongoing</p>
<p><strong>Job:</strong> PhD developing scalable software and algorithms for analyzing sequencing data</p>
<p><strong>Employer: </strong> Ben Langmead</p>
<p><strong>To apply</strong>: http://www.cs.jhu.edu/graduate-studies/phd-program/</p>
<p><strong>Deadline:</strong> See site</p>
<p><strong>Job:</strong> Postdoctoral researcher developing scalable software and algorithms for analyzing sequencing data</p>
<p><strong>Employer: </strong> Ben Langmead</p>
<p><strong>To apply</strong>: email Ben (<a href="http://www.langmead-lab.org/open-positions/">http://www.langmead-lab.org/open-positions/</a>)</p>
<p><strong>Deadline:</strong> Review ongoing</p>
<p><strong>Job:</strong> Postdoctoral researcher developing algorithms for challenging problems in large-scale genomics whole-genome assenbly, RNA-seq analysis, and microbiome analysis</p>
<p><strong>Employer: </strong> Steven Salzberg</p>
<p><strong>To apply</strong>: email Steven (<a href="http://salzberg-lab.org/">http://salzberg-lab.org/</a>)</p>
<p><strong>Deadline:</strong> Review ongoing</p>
<p><strong>Job:</strong> Research associate for genomic data processing and analysis (BA+) in cancer</p>
<p><strong>Employer: </strong> Luigi Marchionni (with Don Geman)</p>
<p><strong>To apply</strong>: email Luigi (<a href="http://luigimarchionni.org/">http://luigimarchionni.org/</a>)</p>
<p><strong>Deadline:</strong> Review ongoing</p>
<p><strong>Job: </strong>Postdoctoral researcher developing algorithms for biomarkers development and precision medicine application in cancer</p>
<p><strong>Employer: </strong> Luigi Marchionni (with Don Geman)</p>
<p><strong>To apply</strong>: email Luigi (<a href="http://luigimarchionni.org/">http://luigimarchionni.org/</a>)</p>
<p><strong>Deadline:</strong> Review ongoing</p>
<p><strong>Job:</strong>Postdoctoral researcher developing methods in machine learning, genomics, and regulatory variation</p>
<p><strong>Employer: </strong> Alexis Battle</p>
<p><strong>To apply</strong>: email Alexis (<a href="http://battlelab.jhu.edu/join_us.html">http://battlelab.jhu.edu/join_us.html</a>)</p>
<p><strong>Deadline:</strong> Review ongoing</p>
<p><strong>Job: </strong>Postdoctoral fellow with interests in biomarker discovery for Alzheimer’s disease</p>
<p><strong>Employer: </strong> Madhav Thambisetty / Ingo Ruczinski</p>
<p><strong>To apply</strong>: <a href="http://www.alzforum.org/jobs/postdoctoral-research-fellow-alzheimers-disease-biomarkers"> http://www.alzforum.org/jobs/postdoctoral-research-fellow-alzheimers-disease-biomarkers</a></p>
<p><strong>Deadline:</strong> Review ongoing</p>
<p><strong>Job: </strong>Postdoctoral positions for research in the interface of statistical genetics, precision medicine and big data</p>
<p><strong>Employer: </strong> Nilanjan Chatterjee</p>
<p><strong>To apply</strong>: <a href="http://www.jhsph.edu/departments/biostatistics/_docs/postdoc-ad-chatterjee.pdf">http://www.jhsph.edu/departments/biostatistics/_docs/postdoc-ad-chatterjee.pdf</a></p>
<p><strong>Deadline:</strong> Review ongoing</p>
<p><strong>Job: </strong>Postdoctoral research developing algorithms and software for time course pattern detection in genomics data</p>
<p><strong>Employer: </strong> Elana Fertig</p>
<p><strong>To apply</strong>: email Elana (ejfertig@jhmi.edu)</p>
<p><strong>Deadline:</strong> Review ongoing</p>
<p><strong>Job: </strong>Postdoctoral fellow to develop novel methods for large-scale DNA and RNA sequence analysis related to human and/or plant genetics, such as developing methods for discovering structural variations in cancer or for assembling and analyzing large complex plant genomes.</p>
<p><strong>Employer: </strong> Mike Schatz</p>
<p><strong>To apply</strong>: email Mike (<a href="http://schatzlab.cshl.edu/apply/">http://schatzlab.cshl.edu/apply/</a>)</p>
<p><strong>Deadline:</strong> Review ongoing</p>
<h2 id="students"><span style="text-decoration: underline;"><strong>Students</strong></span></h2>
<p>We are all always on the hunt for good Ph.D. students. At Hopkins students are admitted to specific departments. So if you find a faculty member you want to work with, you can apply to their department. Here are the application details for the various departments admitting students to work on genomics:<a href="https://ccb.jhu.edu/students.shtml"> https://ccb.jhu.edu/students.shtml</a></p>
<p> </p>
<p> </p>
<p> </p>
The statistics identity crisis: am I really a data scientist?
2015-10-29T13:32:13+00:00
http://simplystats.github.io/2015/10/29/the-statistics-identity-crisis-am-i-really-a-data-scientist
<p> </p>
<p> </p>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/10/crisis.png"><img class="aligncenter wp-image-4397" src="http://simplystatistics.org/wp-content/uploads/2015/10/crisis-300x75.png" alt="crisis" width="508" height="127" srcset="http://simplystatistics.org/wp-content/uploads/2015/10/crisis-300x75.png 300w, http://simplystatistics.org/wp-content/uploads/2015/10/crisis-260x65.png 260w, http://simplystatistics.org/wp-content/uploads/2015/10/crisis.png 720w" sizes="(max-width: 508px) 100vw, 508px" /></a></p>
<p> </p>
<p><em>Tl;dr: We will host a Google Hangout of our popular JSM session October 30th 2-4 PM EST. </em></p>
<p> </p>
<p>I organized a session at JSM 2015 called <em>“The statistics identity crisis: am I really a data scientist?”</em> The session turned out to be pretty popular:</p>
<blockquote class="twitter-tweet" width="550">
<p lang="en" dir="ltr">
Packed room of statisticians with identity crises at <a href="https://twitter.com/hashtag/JSM2015?src=hash">#JSM2015</a> session: are we really data scientists? <a href="http://t.co/eLsGosoTCt">pic.twitter.com/eLsGosoTCt</a>
</p>
<p>
— Dr Ruth Etzioni (@retzioni) <a href="https://twitter.com/retzioni/status/631134032357502978">August 11, 2015</a>
</p>
</blockquote>
<p>but it turns out not everyone fit in the room:</p>
<blockquote class="twitter-tweet" width="550">
<p lang="en" dir="ltr">
This is the closest I can get to <a href="https://twitter.com/statpumpkin">@statpumpkin</a>'s talk. <a href="https://twitter.com/hashtag/jsm2015?src=hash">#jsm2015</a> still had no clue how to predict session attendance. <a href="http://t.co/gTb4OqdAo3">pic.twitter.com/gTb4OqdAo3</a>
</p>
<p>
— sandy griffith (@sgrifter) <a href="https://twitter.com/sgrifter/status/631134590229442560">August 11, 2015</a>
</p>
</blockquote>
<p>Thankfully, Steve Pierson at the ASA had the awesome idea to re-run the session for people who couldn’t be there. So we will be hosting a Google Hangout with the following talks:</p>
<table width="100%" cellspacing="0" cellpadding="4" bgcolor="white">
<tr>
<td align="right" valign="top" width="110">
</td>
<td>
<a href="https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314339">'Am I a Data Scientist?': The Applied Statistics Student's Identity Crisis</a> — <b>Alyssa Frazee, Stripe</b>
</td>
</tr>
<tr>
<td align="right" valign="top" width="110">
</td>
<td>
<a href="https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314376">How Industry Views Data Science Education in Statistics Departments</a> — <b>Chris Volinsky, AT&T</b>
</td>
</tr>
<tr>
<td align="right" valign="top" width="110">
</td>
<td>
<a href="https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314414">Evaluating Data Science Contributions in Teaching and Research</a> — <b>Lance Waller, Emory University</b>
</td>
</tr>
<tr>
<td align="right" valign="top" width="110">
</td>
<td>
<a href="https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314641">Teach Data Science and They Will Come</a> — <b>Jennifer Bryan, The University of British Columbia</b>
</td>
</tr>
</table>
<p>You can watch it on Youtube or Google Plus. Here is the link:</p>
<p>https://plus.google.com/events/chuviltukohj2inbqueap9h7228</p>
<p>The session will be held October 30th (tomorrow!) from 2-4PM EST. You can watch it live and discuss the talks using the hashtag [ </p>
<p> </p>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/10/crisis.png"><img class="aligncenter wp-image-4397" src="http://simplystatistics.org/wp-content/uploads/2015/10/crisis-300x75.png" alt="crisis" width="508" height="127" srcset="http://simplystatistics.org/wp-content/uploads/2015/10/crisis-300x75.png 300w, http://simplystatistics.org/wp-content/uploads/2015/10/crisis-260x65.png 260w, http://simplystatistics.org/wp-content/uploads/2015/10/crisis.png 720w" sizes="(max-width: 508px) 100vw, 508px" /></a></p>
<p> </p>
<p><em>Tl;dr: We will host a Google Hangout of our popular JSM session October 30th 2-4 PM EST. </em></p>
<p> </p>
<p>I organized a session at JSM 2015 called <em>“The statistics identity crisis: am I really a data scientist?”</em> The session turned out to be pretty popular:</p>
<blockquote class="twitter-tweet" width="550">
<p lang="en" dir="ltr">
Packed room of statisticians with identity crises at <a href="https://twitter.com/hashtag/JSM2015?src=hash">#JSM2015</a> session: are we really data scientists? <a href="http://t.co/eLsGosoTCt">pic.twitter.com/eLsGosoTCt</a>
</p>
<p>
— Dr Ruth Etzioni (@retzioni) <a href="https://twitter.com/retzioni/status/631134032357502978">August 11, 2015</a>
</p>
</blockquote>
<p>but it turns out not everyone fit in the room:</p>
<blockquote class="twitter-tweet" width="550">
<p lang="en" dir="ltr">
This is the closest I can get to <a href="https://twitter.com/statpumpkin">@statpumpkin</a>'s talk. <a href="https://twitter.com/hashtag/jsm2015?src=hash">#jsm2015</a> still had no clue how to predict session attendance. <a href="http://t.co/gTb4OqdAo3">pic.twitter.com/gTb4OqdAo3</a>
</p>
<p>
— sandy griffith (@sgrifter) <a href="https://twitter.com/sgrifter/status/631134590229442560">August 11, 2015</a>
</p>
</blockquote>
<p>Thankfully, Steve Pierson at the ASA had the awesome idea to re-run the session for people who couldn’t be there. So we will be hosting a Google Hangout with the following talks:</p>
<table width="100%" cellspacing="0" cellpadding="4" bgcolor="white">
<tr>
<td align="right" valign="top" width="110">
</td>
<td>
<a href="https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314339">'Am I a Data Scientist?': The Applied Statistics Student's Identity Crisis</a> — <b>Alyssa Frazee, Stripe</b>
</td>
</tr>
<tr>
<td align="right" valign="top" width="110">
</td>
<td>
<a href="https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314376">How Industry Views Data Science Education in Statistics Departments</a> — <b>Chris Volinsky, AT&T</b>
</td>
</tr>
<tr>
<td align="right" valign="top" width="110">
</td>
<td>
<a href="https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314414">Evaluating Data Science Contributions in Teaching and Research</a> — <b>Lance Waller, Emory University</b>
</td>
</tr>
<tr>
<td align="right" valign="top" width="110">
</td>
<td>
<a href="https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314641">Teach Data Science and They Will Come</a> — <b>Jennifer Bryan, The University of British Columbia</b>
</td>
</tr>
</table>
<p>You can watch it on Youtube or Google Plus. Here is the link:</p>
<p>https://plus.google.com/events/chuviltukohj2inbqueap9h7228</p>
<p>The session will be held October 30th (tomorrow!) from 2-4PM EST. You can watch it live and discuss the talks using the hashtag](https://twitter.com/search?q=%23jsm2015) or you can watch later as the video will remain on Youtube.</p>
Discussion of the Theranos Controversy with Elizabeth Matsui
2015-10-28T14:54:50+00:00
http://simplystats.github.io/2015/10/28/discussion-of-the-theranos-controversy-with-elizabeth-matsui
<p>Theranos is a Silicon Valley diagnostic testing company that has been in the news recently. The story of Theranos has fascinated me because I think it represents a perfect collision of the tech startup culture and the health care culture and how combining them together can generate unique problems.</p>
<p>I talked with Elizabeth Matsui, a Professor of Pediatrics in the Division of Allergy and Immunology here at Johns Hopkins, to discuss Theranos, the realities of diagnostic testing, and the unique challenges that a health-tech startup faces with respect to doing good science and building products people want to buy.</p>
<p>Notes:</p>
<ul>
<li>Original <a href="http://www.wsj.com/articles/theranos-has-struggled-with-blood-tests-1444881901">Wall Street Journal story</a> on Theranos (paywalled)</li>
<li>Related stories in <a href="http://www.wired.com/2015/10/theranos-scandal-exposes-the-problem-with-techs-hype-cycle/">Wired</a> and NYT’s <a href="http://www.nytimes.com/2015/10/28/business/dealbook/theranos-under-fire.html">Dealbook</a> (not paywalled)</li>
<li>Theranos <a href="https://www.theranos.com/news/posts/custom/theranos-facts">response</a> to WSJ story</li>
</ul>
<iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/230510705%3Fsecret_token%3Ds-WbZX8&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false"></iframe>
Not So Standard Deviations: Episode 3 - Gilmore Girls
2015-10-24T23:17:18+00:00
http://simplystats.github.io/2015/10/24/not-so-standard-deviations-episode-3-gilmore-girls
<p>I just uploaded Episode 3 of <a href="https://soundcloud.com/nssd-podcast">Not So Standard Deviations</a> so check your feeds. In this episode Hilary and I talk about our jobs and the life of the data scientist in both academia and the tech industry. It turns out that they’re not as different as I would have thought.</p>
<p><a href="https://api.soundcloud.com/tracks/229957578/download?client_id=02gUJC0hH2ct1EGOcYXQIzRFU91c72Ea&oauth_token=1-138878-174789515-deb24181d01af">Download the audio file for this episode</a>.</p>
We need a statistically rigorous and scientifically meaningful definition of replication
2015-10-20T10:05:22+00:00
http://simplystats.github.io/2015/10/20/we-need-a-statistically-rigorous-and-scientifically-meaningful-definition-of-replication
<p>Replication and confirmation are indispensable concepts that help define scientific facts. However, the way in which we reach scientific consensus on a given finding is rather complex. Although <a href="http://simplystatistics.org/2015/06/24/how-public-relations-and-the-media-are-distorting-science/">some press releases try to convince us otherwise</a>, rarely is one publication enough. In fact, most published results go unnoticed and no attempts to replicate them are made. These are not debunked either; they simply get discarded to the dustbin of history. The very few results that garner enough attention for others to spend time and energy on them are assessed by an ad-hoc process involving a community of peers. The assessments are usually a combination of deductive reasoning, direct attempts at replication, and indirect checks obtained by attempting to build on the result in question. This process eventually leads to a result either being accepted by consensus or not. For particularly important cases, an official scientific consensus report may be commissioned by a national academy or an established scientific society. Examples of results that have become part of the scientific consensus in this way include smoking causing lung cancer, HIV causing AIDS, and climate change being caused by humans. In contrast, the published result that vaccines cause autism has been thoroughly debunked by several follow up studies. In none of these four cases a simple definition of replication was used to confirm or falsify a result. The same is true for most results for which there is consensus. Yet science moves on, and continues to be an incomparable force at improving our quality of life.</p>
<p>Regulatory agencies, such as the FDA, are an exception since they clearly spell out a <a href="http://www.fda.gov/downloads/Drugs/.../Guidances/ucm078749.pdf">definition</a> of replication. For example, to approve a drug they may require two independent clinical trials, adequately powered, to show statistical significance at some predetermined level. They also require a large enough effect size to justify the cost and potential risks associated with treatment. This is not to say that FDA approval is equivalent to scientific consensus, but they do provide a clearcut definition of replication.</p>
<p>In response to a growing concern over a <em><a href="http://www.nature.com/news/reproducibility-1.17552">reproducibility crisis</a></em>, projects such as the <a href="http://osc.centerforopenscience.org/">Open Science Collaboration</a> have commenced to systematically try to replicate published results. In a <a href="http://simplystatistics.org/2015/10/01/a-glass-half-full-interpretation-of-the-replicability-of-psychological-science/">recent post</a>, Jeff described one of their <a href="http://www.sciencemag.org/content/349/6251/aac4716">recent papers</a> on estimating the reproducibility of psychological science (they really mean replicability; see note below). This Science paper led to lay press reports with eye-catching headlines such as “only 36% of psychology experiments replicate”. Note that the 36% figure comes from a definition of replication that mimics the definition used by regulatory agencies: results are considered replicated if a p-value < 0.05 was reached in both the original study and the replicated one. Unfortunately, this definition ignores both effect size and statistical power. If power is not controlled, then the expected proportion of correct findings that replicate can be quite small. For example, if I try to replicate the smoking-causes-lung-cancer result with a sample size of 5, there is a good chance it will not replicate. In his post, Jeff notes that for several of the studies that did not replicate, the 95% confidence intervals intersected. So should intersecting confidence intervals be our definition of replication? This too has a flaw since it favors imprecise studies with very large confidence intervals. If effect size is ignored, we may waste our time trying to replicate studies reporting practically meaningless findings. Generally defining replication for published studies is not as easy as for highly controlled clinical trials. However, one clear improvement from what is currently being done is to consider statistical power and effect sizes.</p>
<p>To further illustrate this, let’s consider a very concrete example with real life consequences. Imagine a loved one has a disease with high mortality rates and asks for your help in evaluating the scientific evidence on treatments. Four experimental drugs are available all with promising clinical trials resulting in p-values <0.05. However, a replication project redoes the experiments and finds that only the drug A and drug B studies replicate (p<0.05). So which drug do you take? Let’s give a bit more information to help you decide. Here are the p-values for both original and replication trials:</p>
<table style="width: 100%;">
<tr>
<td>
Drug
</td>
<td>
Original
</td>
<td>
Replication
</td>
<td>
Replicated
</td>
</tr>
<tr>
<td>
A
</td>
<td>
0.0001
</td>
<td>
0.001
</td>
<td>
Yes
</td>
</tr>
<tr>
<td>
B
</td>
<td>
<0.000001
</td>
<td>
0.03
</td>
<td>
Yes
</td>
</tr>
<tr>
<td>
C
</td>
<td>
0.03
</td>
<td>
0.06
</td>
<td>
No
</td>
</tr>
<tr>
<td>
D
</td>
<td>
<0.000001
</td>
<td>
0.10
</td>
<td>
No
</td>
<td>
</td>
</tr>
</table>
<p>Which drug would you take now? The information I have provided is based on p-values and therefore is missing a key piece of information: the effect sizes. Below I show the confidence intervals for all four studies (left) and four replication studies (right). Note that except for drug B, all confidence intervals intersect. In light of the figure below, which one would you chose?</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/10/replication.png"><img class=" wp-image-4368 alignright" src="http://simplystatistics.org/wp-content/uploads/2015/10/replication.png" alt="replication" width="359" height="338" srcset="http://simplystatistics.org/wp-content/uploads/2015/10/replication-300x283.png 300w, http://simplystatistics.org/wp-content/uploads/2015/10/replication-212x200.png 212w, http://simplystatistics.org/wp-content/uploads/2015/10/replication.png 617w" sizes="(max-width: 359px) 100vw, 359px" /></a></p>
<p>I would be inclined to go with drug D because it has a large effect size, a small p-value, and the replication experiment effect estimate fell inside a 95% confidence interval. I would definitely not go with A since it provides marginal benefits, even if the trial found a statistically significant effect and was replicated. So the p-value based definition of replication is practically worthless from a practical standpoint.</p>
<p>It seems that before continuing the debate over replication, and certainly before declaring that we are in a <a href="http://www.nature.com/news/reproducibility-1.17552">reproducibility crisis</a>, we need a statistically rigorous and scientifically meaningful definition of replication. This definition does not necessarily need to be dichotomous (replicated or not) and it will probably require more than one replication experiment and more than one summary statistic: one for effect size and one for uncertainty. In the meantime, we should be careful not to dismiss the current scientific process, which seems to be working rather well at either ignoring or debunking false positive results while producing useful knowledge and discovery.</p>
<hr />
<p>Footnote on reproducible versus replication: As Jeff pointed out, the cited Open Science Collaboration paper is about replication, not reproducibility. A study is considered reproducible if an independent researcher can recreate the tables and figures from the original raw data. Replication is not nearly as simple to define because it involves probability. To replicate the experiment it has to be performed again, with a different random sample and new set of measurement errors.</p>
Theranos runs head first into the realities of diagnostic testing
2015-10-16T08:42:11+00:00
http://simplystats.github.io/2015/10/16/thorns-runs-head-first-into-the-realities-of-diagnostic-testing
<p>The Wall Street Journal has published a <a href="http://www.wsj.com/articles/theranos-has-struggled-with-blood-tests-1444881901">lengthy investigation</a> into the diagnostic testing company Theranos.</p>
<blockquote>
<p>The company offers more than 240 tests, ranging from cholesterol to cancer. It claims its technology can work with just a finger prick. Investors have poured more than $400 million into Theranos, valuing it at $9 billion and her majority stake at more than half that. The 31-year-old Ms. Holmes’s bold talk and black turtlenecks draw comparisons to Apple<span class="company-name-type"> Inc.</span> cofounder Steve Jobs.</p>
</blockquote>
<p>If ever there were a warning sign, the comparison to Steve Jobs has got to be it.</p>
<blockquote>
<p>But Theranos has struggled behind the scenes to turn the excitement over its technology into reality. At the end of 2014, the lab instrument developed as the linchpin of its strategy handled just a small fraction of the tests then sold to consumers, according to four former employees.</p>
<div class=" media-object wrap scope-web|mobileapps " data-layout="wrap ">
One former senior employee says Theranos was routinely using the device, named Edison after the prolific inventor, for only 15 tests in December 2014. Some employees were leery about the machine’s accuracy, according to the former employees and emails reviewed by The Wall Street Journal.
</div>
<div class=" media-object wrap scope-web|mobileapps " data-layout="wrap ">
</div>
<div class=" media-object wrap scope-web|mobileapps " data-layout="wrap ">
In a complaint to regulators, one Theranos employee accused the company of failing to report test results that raised questions about the precision of the Edison system. Such a failure could be a violation of federal rules for laboratories, the former employee said.
</div>
</blockquote>
<div class=" media-object wrap scope-web|mobileapps " data-layout="wrap ">
With these kinds of stories, it's always hard to tell whether there's reality here or it's just a bunch of axe grinding. But one thing that's for sure is that people are talking, and probably not for good reasons.
</div>
Minimal R Package Check List
2015-10-14T08:21:48+00:00
http://simplystats.github.io/2015/10/14/minimal-r-package-check-list
<p>A little while back I had the pleasure of flying in a small Cessna with a friend and for the first time I got to see what happens in the cockpit with a real pilot. One thing I noticed was that basically you don’t lift a finger without going through some sort of check list. This starts before you even roll the airplane out of the hangar. It makes sense because flying is a pretty dangerous hobby and you want to prevent problems from occurring when you’re in the air.</p>
<p>That experience got me thinking about what might be the minimal check list for building an R package, a somewhat less dangerous hobby. First off, much has changed (for the better) since I started making R packages and I wanted to have some clean documentation of the process, particularly with using RStudio’s tools. So I wiped off my installations of both R and RStudio and started from scratch to see what it would take to get someone to build their first R package.</p>
<p>The list is basically a “pre-flight” list-–the presumption here is that you actually know the important details of building packages, but need to make sure that your environment is setup correctly so that you don’t run into errors or problems. I find this is often a problem for me when teaching students to build packages because I focus on the details of actually making the packages (i.e. DESCRIPTION files, Roxygen, etc.) and forget that way back when I actually configured my environment to do this.</p>
<p><strong>Pre-flight Procedures for R Packages</strong></p>
<ol>
<li>Install most recent version of R</li>
<li>Install most recent version of RStudio</li>
<li>Open RStudio</li>
<li>Install <strong>devtools</strong> package</li>
<li>Click on Project –> New Project… –> New Directory –> R package</li>
<li>Enter package name</li>
<li>Delete boilerplate code and “hello.R” file</li>
<li>Goto “man” directory an delete “hello.Rd” file</li>
<li>In File browser, click on package name to go to the top level directory</li>
<li>Click “Build” tab in environment browser</li>
<li>Click “Configure Build Tools…”</li>
<li>Check “Generate documentation with Roxygen”</li>
<li>Check “Build & Reload” when Roxygen Options window opens –> Click OK</li>
<li>Click OK in Project Options window</li>
</ol>
<p>At this point, you’re clear to build your package, which obviously involves writing R code, Roxygen documentation, writing package metadata, and building/checking your package.</p>
<p>If I’m missing a step or have too many steps, I’d like to hear about it. But I think this is the minimum number of steps you need to configure your environment for building R packages in RStudio.</p>
<p>UPDATE: I’ve made some changes to the check list and will be posting future updates/modifications to my <a href="https://github.com/rdpeng/daprocedures/blob/master/lists/Rpackage_preflight.md">GitHub repository</a>.</p>
Profile of Data Scientist Shannon Cebron
2015-10-03T09:32:20+00:00
http://simplystats.github.io/2015/10/03/profile-of-data-scientist-shannon-cebron
<p>The “This is Statistics” campaign has a nice <a href="http://thisisstatistics.org/interview-with-shannon-cebron-from-pegged-software/">profile of Shannon Cebron</a>, a data scientist working at the Baltimore-based Pegged Software.</p>
<blockquote>
<p><strong>What advice would you give to someone thinking of a career in data science?</strong></p>
<p>Take some advanced statistics courses if you want to see what it’s like to be a statistician or data scientist. By that point, you’ll be familiar with enough statistical methods to begin solving real-world problems and understanding the power of statistical science. I didn’t realize I wanted to be a data scientist until I took more advanced statistics courses, around my third year as an undergraduate math major.</p>
</blockquote>
Not So Standard Deviations: Episode 2 - We Got it Under 40 Minutes
2015-10-02T09:00:29+00:00
http://simplystats.github.io/2015/10/02/not-so-standard-deviations-episode-2-we-got-it-under-40-minutes
<p>Episode 2 of my podcast with Hilary Parker, <a href="https://soundcloud.com/nssd-podcast">Not So Standard Deviations</a>, is out! In this episode, we talk about user testing for statistical methods, navigating the Hadleyverse, the crucial significance of rename(), and the secret reason for creating the podcast (hint: it rhymes with “bee”). Also, I erroneously claim that <a href="http://www.stat.purdue.edu/~wsc/">Bill Cleveland</a> is <em>way</em> older than he actually is. Sorry Bill.</p>
<p>In other news, <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">we are finally on iTunes</a> so you can subscribe from there directly if you want (just search for “Not So Standard Deviations” or paste the link directly into your podcatcher.</p>
<p><a href="https://api.soundcloud.com/tracks/226538106/download?client_id=02gUJC0hH2ct1EGOcYXQIzRFU91c72Ea&oauth_token=1-138878-174789515-deb24181d01af">Download the audio file for this episode</a>.</p>
<p>Notes:</p>
<ul>
<li><a href="http://www.sciencemag.org/content/229/4716/828.short">Bill Cleveland’s paper in Science</a>, on graphical perception, <strong>published in 1985</strong></li>
<li><a href="https://www.eventbrite.com/e/statistics-making-a-difference-a-conference-in-honor-of-tom-louis-tickets-16248614042">TomFest</a></li>
</ul>
A glass half full interpretation of the replicability of psychological science
2015-10-01T10:00:53+00:00
http://simplystats.github.io/2015/10/01/a-glass-half-full-interpretation-of-the-replicability-of-psychological-science
<p style="line-height: 18.0pt;">
<em>tl;dr: 77% of replication effects from the psychology replication study were in (or above) the 95% prediction interval based on the original effect size. This isn't perfect and suggests (a) there is still room for improvement, (b) the scientists who did the replication study are pretty awesome at replicating, (c) we need a better definition of replication that respects uncertainty but (d) the scientific sky isn't falling. We wrote this up in a <a href="http://arxiv.org/abs/1509.08968">paper on arxiv</a>; <a href="https://github.com/jtleek/replication_paper">the code is here.</a> </em>
</p>
<p style="line-height: 18.0pt;">
<span style="font-size: 12.0pt; font-family: Georgia; color: #333333;">A week or two ago a paper came out in Science on<span class="apple-converted-space"> </span><a href="http://www.sciencemag.org/content/349/6251/aac4716">Estimating the reproducibility of psychological science</a>. The basic behind the study was to take a sample of studies that appeared in a particular journal in 2008 and try to replicate each of these studies. Here I'm using the definition that reproducibility is the ability to recalculate all results given the raw data and code from a study and replicability is the ability to re-do the study and get a consistent result. </span>
</p>
<p style="line-height: 18.0pt;">
<span style="font-size: 12.0pt; font-family: Georgia; color: #333333;">The paper is pretty incredible and the authors did an amazing job of going back to the original sources and trying to be faithful to the original study designs. I have to admit when I first heard about the study design I was incredibly pessimistic about the results (I suppose grouchy is a natural default state for many statisticians –especially those with sleep deprivation). I mean 2008 was well before the push toward reproducibility had really taken off (Biostatistics was one of the first journals to adopt a policy on reproducible research and that didn't happen <a href="http://biostatistics.oxfordjournals.org/content/10/3/405.full">until 2009</a>). More importantly, the student researchers from those studies had possibly moved on, study populations may change, there could be any number of minor variations in the study design and so forth. I thought the chances of getting any effects in the same range was probably pretty low. </span>
</p>
<p style="line-height: 18.0pt;">
So when the results were published I was pleasantly surprised. I wasn’t the only one:
</p>
<blockquote class="twitter-tweet" width="550">
<p lang="en" dir="ltr">
Someone has to say it, but this plot shows that science is, in fact, working. <a href="http://t.co/JUy10xHfbH">http://t.co/JUy10xHfbH</a> <a href="http://t.co/lJSx6IxPw2">pic.twitter.com/lJSx6IxPw2</a>
</p>
<p>
— Roger D. Peng (@rdpeng) <a href="https://twitter.com/rdpeng/status/637009904289452032">August 27, 2015</a>
</p>
</blockquote>
<blockquote class="twitter-tweet" width="550">
<p lang="en" dir="ltr">
Looks like psychologists are in a not-too-bad spot on the ROC curves of science (<a href="http://t.co/fPsesCn2yK">http://t.co/fPsesCn2yK</a>) <a href="http://t.co/9rAOdZWvzv">http://t.co/9rAOdZWvzv</a>
</p>
<p>
— Joe Pickrell (@joe_pickrell) <a href="https://twitter.com/joe_pickrell/status/637304244538896384">August 28, 2015</a>
</p>
</blockquote>
<p>But that was definitely not the prevailing impression that the paper left on social and mass media. A lot of the discussion around the paper focused on the <a href="https://github.com/jtleek/replication_paper/blob/gh-pages/in_the_media.md">idea that only 36% of the studies</a> had a p-value less than 0.05 in both the original and replication study. But many of the sample sizes were small and the effects were modest. So the first question I asked myself was, “Well what would we expect to happen if we replicated these studies?” The original paper measured replicability in several ways and tried hard to calibrate expected coverage of confidence intervals for the measured effects.</p>
<p>With <a href="http://www.biostat.jhsph.edu/~rpeng/">Roger</a> and <a href="http://www.biostat.jhsph.edu/~prpatil/">Prasad</a> we tried a little different approach. We estimated the 95% prediction interval for the replication effect given the original effect size.</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/10/pi_figure_nofilter.png"><img class="aligncenter wp-image-4337" src="http://simplystatistics.org/wp-content/uploads/2015/10/pi_figure_nofilter-300x300.png" alt="pi_figure_nofilter" width="397" height="397" srcset="http://simplystatistics.org/wp-content/uploads/2015/10/pi_figure_nofilter-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2015/10/pi_figure_nofilter-1024x1024.png 1024w, http://simplystatistics.org/wp-content/uploads/2015/10/pi_figure_nofilter-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2015/10/pi_figure_nofilter.png 1050w" sizes="(max-width: 397px) 100vw, 397px" /></a></p>
<p> </p>
<p>72% of the replication effects were within the 95% prediction interval and 2 were above the interval (showed a stronger signal in replication in than predicted from original study). This definitely shows that there is still room for improvement in replication of these studies - we would expect 95% of the effects to fall into the 95% prediction interval. But at least my opinion is that 72% (or 77% if you count the 2 above the P.I.) of studies falling in the prediction interval is (a) not bad and (b) a testament to the authors of the reproducibility paper and their efforts to get the studies right.</p>
<p>An important point here is that replication and reproducibility aren’t the same thing. When reproducing a study we expect the numbers and figures to be <em>exactly the same. _But a replication involves recollection of data and is subject to variation and so _we don’t expect the answer to be exactly the same in the replication</em>. This is of course made more confusing by regression to the mean, publication bias, and <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf">the garden of forking paths</a>. Our use of a prediction interval measures both the variation expected in the original study and in the replication. One thing we noticed when re-analyzing the data is how many of the studies had very low sample sizes. <a href="http://simplystatistics.org/wp-content/uploads/2015/10/samplesize_figure_nofilter.png"><img class="aligncenter wp-image-4339" src="http://simplystatistics.org/wp-content/uploads/2015/10/samplesize_figure_nofilter-300x300.png" alt="samplesize_figure_nofilter" width="450" height="450" srcset="http://simplystatistics.org/wp-content/uploads/2015/10/samplesize_figure_nofilter-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2015/10/samplesize_figure_nofilter-1024x1024.png 1024w, http://simplystatistics.org/wp-content/uploads/2015/10/samplesize_figure_nofilter-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2015/10/samplesize_figure_nofilter.png 1050w" sizes="(max-width: 450px) 100vw, 450px" /></a></p>
<p> </p>
<p>Sample sizes were generally bigger in the replication, but often very low regardless. This makes it more difficult to disentangle what didn’t replicate from what is just expected variation for a small sample size study. The point remains whether those small studies should be trusted in general, but for the purposes of measuring replication it makes the problem more difficult.</p>
<p>One thing I have been thinking about a lot and this study drove home is that if we are measuring replication we need a definition that incorporates uncertainty directly. Suppose that you collect a data set <strong>D0</strong> from an original study and <strong>D1</strong> from a replication. Then replication means that the data from a study replicates if <strong>D0 ~ F </strong>and <strong>D1 ~ F. </strong>Informally, if the data are generated from the same distribution in both experiments then the study replicates. To get an estimate you apply a pipeline to the data set to get an estimate <strong>e0 = p(D0). </strong>If the study is also reproducible than <strong>p</strong><strong>()</strong> is the same for both studies and <strong>p</strong><strong>(D0) ~ G </strong>and <strong>p</strong><strong>(D1)</strong> <strong>~ G</strong>, subject to some conditions on <strong>p</strong><strong>(). </strong></p>
<p>One interesting consequence of this definition is that each complete replication data set represents <em>only a single data point</em> for measuring replication. To measure replication with this definition you either need to make assumptions about the data generating distribution for <strong>D0</strong> and <strong>D1</strong> or you need to perform a complete replication of a study many times to determine if it replicates. However, it does mean that we can define replication even for studies with very small number of replicates as the data generating distribution may be arbitrarily variable in each case.</p>
<p>Regardless of this definition I was excited that the <a href="https://osf.io/">OSF </a>folks did the study and pulled it off as well as they did and was a bit bummed about the most common reaction. I think there is an easy narrative that “science is broken” which I think isn’t a positive thing for a number of reasons. I love the way that {reproducibility/replicability/open science/open publication} are becoming more and more common, but often think we fall into the same trap in wanting to report these results as clear cut as we do when reporting exaggerations or oversimplifications of scientific discoveries in headlines. I’m excited to see how these kinds of studies look in 10 years when Github/open science/pre-prints/etc. are all the standards.</p>
Apple Music's Moment of Truth
2015-09-30T07:38:08+00:00
http://simplystats.github.io/2015/09/30/apple-musics-moment-of-truth
<p>Today is the day when Apple, Inc. learns whether it’s brand new streaming music service, Apple Music, is going to be a major contributor to the bottom line or just another streaming service (JASS?). Apple Music launched 3 months ago and all new users are offered a 3-month free trial. Today, that free trial ends and the big question is how many people will start to <strong>pay</strong> for their subscription, as opposed to simply canceling it. My guess is that most people (> 50%) will opt to pay, but that’s a complete guess. For what it’s worth, I’ll be paying for my subscription. After adding all this music to my library, I’d hate to see it all go away.</p>
<p>Back on August 18, 2015, consumer market research firm MusicWatch <a href="http://www.businesswire.com/news/home/20150818005755/en#.VddbR7Scy6F">released a study</a> that claimed, among other things, that</p>
<blockquote>
<p>Among people who had tried Apple Music, 48 percent reported they are not currently using the service.</p>
</blockquote>
<p>This would suggest that almost half of people who had signed up for the free trial period of Apple Music were not interested in using it further and would likely not pay for it once the trial ended. If it were true, it would be a blow to the newly launched service.</p>
<p>But how did MusicWatch arrive at its number? It claimed to have surveyed 5,000 people in its study. Shortly before the survey by MusicWatch was released, Apple claimed that about 11 million people had signed up for their new Apple Music service (because the service had just launched, everyone who had signed up was in the free trial period). Clearly, 5,000 people do not make up the entire population, so we have but a small sample of users.</p>
<p>What is the target that MusicWatch was trying to answer? It seems that they wanted to know the percentage of <strong>all people who had signed up for Apple Music</strong> that were still using the service. Can they make inference about the entire population from the sample of 5,000?</p>
<p>If the sample is representative and the individuals are independent, we could use the number 48% as an estimate of the percentage in the population who no longer use the service. The press release from MusicWatch did not indicate any measure of uncertainty, so we don’t know how reliable the number is.</p>
<p>Interestingly, soon after the MusicWatch survey was released, Apple released a statement to the publication <em>The Verge</em>, stating that 79% of users who had signed up were still using the service (i.e. only 21% had stopped using it, as opposed to 48% reported by MusicWatch). In other words, Apple just came out and <em>gave us the truth</em>! This was unusual because Apple typically does not make public statements about newly launched products. I just found this amusing because I’ve never been in a situation where I was trying to estimate a parameter and then someone later just told me what its value was.</p>
<p>If we believe that Apple and MusicWatch were measuring the same thing in their analyses (and it’s not clear that they were), then it would suggest that MusicWatch’s estimate of the population percentage (48%) was quite far off from the true value (21%). What would explain this large difference?</p>
<ol>
<li><strong>Random variation</strong>. It’s true that MusicWatch’s survey was a small sample relative to the full population, but the sample was still big with 5,000 people. Furthermore, the analysis was fairly simple (just taking the proportion of users still using the service), so the uncertainty associated with that estimate is unlikely to be that large.</li>
<li><strong>Selection bias</strong>. Recall that it’s not clear how MusicWatch sampled its respondents, but it’s possible that the way that they did it led them to capture a set of respondents who were less inclined to use Apple Music. Beyond this, we can’t really say more without knowing the details of the survey process.</li>
<li><strong>Respondents are not independent</strong>. It’s possible that the survey respondents are not independent of each other. This would primiarily affect the uncertainty about the estimate, making it larger than we might expect if the respondents were all independent. However, since we do not know what MusicWatch’s uncertainty about their estimate was in the first place, it’s difficult to tell if dependence between respondents could play a role. Apple’s number, of course, has no uncertainty.</li>
<li><strong>Measurement differences</strong>. This is the big one, in my opinion. We don’t know is how either MusicWatch or Apple defined “still using the service”. You could imagine a variety of ways to determine whether a person was still using the service. You could ask “Have you used it in the last week?” or perhaps “Did you use it yesterday?” Responses to these questions would be quite different and would likely lead to different overall percentages of usage.</li>
</ol>
We Used Data to Improve our HarvardX Courses: New Versions Start Oct 15
2015-09-29T09:53:31+00:00
http://simplystats.github.io/2015/09/29/we-used-data-to-improve-our-harvardx-courses-new-versions-start-oct-15
<p>You can sign up following links <a href="http://genomicsclass.github.io/book/pages/classes.html">here</a></p>
<p>Last semester we successfully [You can sign up following links <a href="http://genomicsclass.github.io/book/pages/classes.html">here</a></p>
<p>Last semester we successfully](http://simplystatistics.org/2014/11/25/harvardx-biomedical-data-science-open-online-training-curriculum-launches-on-january-19/) of my <a href="http://simplystatistics.org/2014/03/31/data-analysis-for-genomic-edx-course/">Data Analysis course</a>. To create the second version, the first was split into eight courses. Over 2,000 students successfully completed the first of these, but, as expected, the numbers were lower for the more advanced courses. We wanted to remove any structural problems keeping students from maximizing what they get from our courses, so we studied the assessment questions data, which included completion rate and time, and used the findings to make improvements. We also used qualitative data from the discussion board. The major changes to version 3 are the following:</p>
<ul>
<li>We no longer use R packages that Microsoft Windows users had trouble installing in the first course.</li>
<li>All courses are now designed to be completed in 4 weeks.</li>
<li>We added new assessment questions.</li>
<li>We improved the assessment questions determined to be problematic.</li>
<li>We split the two courses that students took the longest to complete into smaller modules. Students now have twice as much time to complete these.</li>
<li>We consolidated the case studies into one course.</li>
<li>We combined the materials from the statistics courses into a <a href="http://simplystatistics.org/2015/09/23/data-analysis-for-the-life-sciences-a-book-completely-written-in-r-markdown/">book</a>, which you can download <a href="https://leanpub.com/dataanalysisforthelifesciences">here</a>. The material in the book match the materials taught in class so you can use it to follow along.</li>
</ul>
<p>You can enroll into any of the seven courses following the links below. We will be on the discussion boards starting October 15, and we hope to see you there.</p>
<ol>
<li><a href="https://www.edx.org/course/data-analysis-life-sciences-1-statistics-harvardx-ph525-1x">Statistics and R for the Life Sciences</a> starts October 15.</li>
<li><a href="https://www.edx.org/course/data-analysis-life-sciences-2-harvardx-ph525-2x">Introduction to Linear Models and Matrix Algebra</a> starts November 15.</li>
<li><a href="https://www.edx.org/course/data-analysis-life-sciences-3-harvardx-ph525-3x">Statistical Inference and Modeling for High-throughput Experiments</a> starts December 15.</li>
<li><a href="https://www.edx.org/course/data-analysis-life-sciences-4-harvardx-ph525-4x">High-Dimensional Data Analysis</a> starts January 15.</li>
<li><a href="https://www.edx.org/course/data-analysis-life-sciences-5-harvardx-ph525-5x">Introduction to Bioconductor: Annotation and Analysis of Genomes and Genomic Assays</a> starts February 15.</li>
<li><a href="https://www.edx.org/course/data-analysis-life-sciences-6-high-harvardx-ph525-6x">High-performance Computing for Reproducible Genomics</a> starts March 15.</li>
<li><a href="https://www.edx.org/course/data-analysis-life-sciences-7-case-harvardx-ph525-7x">Case Studies in Functional Genomics</a> start April 15.</li>
</ol>
<p>The landing page for the series continues to be <a href="http://genomicsclass.github.io/book/pages/classes.html">here</a>.</p>
Data Analysis for the Life Sciences - a book completely written in R markdown
2015-09-23T09:37:27+00:00
http://simplystats.github.io/2015/09/23/data-analysis-for-the-life-sciences-a-book-completely-written-in-r-markdown
<p class="p1">
The book <em>Data Analysis for the Life Sciences</em> is now available on <a href="https://leanpub.com/dataanalysisforthelifesciences">Leanpub</a>.
</p>
<p class="p1">
<span class="s1"><img class="wp-image-4313 alignright" src="http://simplystatistics.org/wp-content/uploads/2015/09/title_page-232x300.jpg" alt="title_page" width="222" height="287" srcset="http://simplystatistics.org/wp-content/uploads/2015/09/title_page-232x300.jpg 232w, http://simplystatistics.org/wp-content/uploads/2015/09/title_page-791x1024.jpg 791w" sizes="(max-width: 222px) 100vw, 222px" />Data analysis is now part of practically every research project in the life sciences. In this book we use data and computer code to teach the necessary statistical concepts and programming skills to become a data analyst. Following in the footsteps of <a href="https://www.stat.berkeley.edu/~statlabs/">Stat Labs</a>, instead of showing theory first and then applying it to toy examples, we start with actual applications and describe the theory as it becomes necessary to solve specific challenges.<span class="Apple-converted-space"> We use simulations and data analysis examples to teach statistical concepts. </span></span><span class="s1">The book includes links to computer code that readers can use to program along as they read the book.</span>
</p>
<p class="p1">
It includes the following chapters: Inference, Exploratory Data Analysis, Robust Statistics, Matrix Algebra, Linear Models, Inference for High-Dimensional Data, Statistical Modeling, Distance and Dimension Reduction, Practical Machine Learning, and Batch Effects.
</p>
<p class="p1">
The text was completely written in R markdown and every section contains a link to the document that was used to create that section. This means that you can use <a href="http://yihui.name/knitr/">knitr</a> to reproduce any section of the book on your own computer. You can also access all these markdown documents directly from <a href="https://github.com/genomicsclass/labs">GitHub</a>. Please send a pull request if you fix a typo or other mistake! For now we are keeping the R markdowns for the exercises private since they contain the solutions. But you can see the solutions if you take our <a href="http://genomicsclass.github.io/book/pages/classes.html">online course</a> quizzes. If we find that most readers want access to the solutions, we will open them up as well.
</p>
<p class="p1">
The material is based on the online courses I have been teaching with <a href="http://mikelove.github.io/">Mike Love</a>. As we created the course, Mike and I wrote R markdown documents for the students and put them on GitHub. We then used<a href="http://www.stephaniehicks.com/githubPages_tutorial/pages/githubpages-jekyll.html"> jekyll</a> to create a <a href="http://genomicsclass.github.io/book/">webpage</a> with html versions of the markdown documents. Jeff then convinced us to publish it on <del>Leanbup</del><a href="https://leanpub.com/dataanalysisforthelifesciences">Leanpub</a>. So we wrote a shell script that compiled the entire book into a Leanpub directory, and after countless hours of editing and tinkering we have a 450+ page book with over 200 exercises. The entire book compiles from scratch in about 20 minutes. We hope you like it.
</p>
The Leek group guide to writing your first paper
2015-09-18T10:57:26+00:00
http://simplystats.github.io/2015/09/18/the-leek-group-guide-to-writing-your-first-paper
<blockquote class="twitter-tweet" width="550">
<p lang="en" dir="ltr">
The <a href="https://twitter.com/jtleek">@jtleek</a> guide to writing your first academic paper <a href="https://t.co/APLrEXAS46">https://t.co/APLrEXAS46</a>
</p>
<p>
— Stephen Turner (@genetics_blog) <a href="https://twitter.com/genetics_blog/status/644540432534368256">September 17, 2015</a>
</p>
</blockquote>
<p>I have written guides on <a href="https://github.com/jtleek/reviews">reviewing papers</a>, <a href="https://github.com/jtleek/datasharing">sharing data</a>, and <a href="https://github.com/jtleek/rpackages">writing R packages</a>. One thing I haven’t touched on until now has been writing papers. Certainly for me, and I think for a lot of students, the hardest transition in graduate school is between taking classes and doing research.</p>
<p>There are several hard parts to this transition including trying to find a problem, trying to find an advisor, and having a ton of unstructured time. One of the hardest things I’ve found is knowing (a) when to start writing your first paper and (b) how to do it. So I wrote a guide for students in my group:</p>
<p><a href="https://github.com/jtleek/firstpaper">https://github.com/jtleek/firstpaper</a></p>
<p>On how to write your first paper. It might be useful for other folks as well so I put it up on Github. Just like with the other guides I’ve written this is a very opinionated (read: doesn’t apply to everyone) guide. I also would appreciate any feedback/pull requests people have.</p>
Not So Standard Deviations: The Podcast
2015-09-17T10:57:45+00:00
http://simplystats.github.io/2015/09/17/not-so-standard-deviations-the-podcast
<p>I’m happy to announce that I’ve started a brand new podcast called <a href="https://soundcloud.com/nssd-podcast">Not So Standard Deviations</a> with Hilary Parker at Etsy. Episode 1 “RCatLadies Origin Story” is available through SoundCloud. In this episode we talk about the origins of RCatLadies, evidence-based data analysis, my new book, and the Python vs. R debate.</p>
<p>You can subscribe to the podcast using the <a href="http://feeds.soundcloud.com/users/soundcloud:users:174789515/sounds.rss">RSS feed</a> from SoundCloud. We’ll be getting it up on iTunes hopefully very soon.</p>
<p><a href="https://api.soundcloud.com/tracks/224180667/download?client_id=02gUJC0hH2ct1EGOcYXQIzRFU91c72Ea&oauth_token=1-138878-174789515-deb24181d01af">Download the audio file</a>.</p>
<p>Show Notes:</p>
<ul>
<li><a href="https://twitter.com/rcatladies">RCatLadies Twitter account</a></li>
<li>Hilary’s <a href="http://hilaryparker.com/2013/01/30/hilary-the-most-poisoned-baby-name-in-us-history/">analysis of the name Hilary</a></li>
<li><a href="https://leanpub.com/artofdatascience">The Art of Data Science</a></li>
<li>What is <a href="http://www.amstat.org/meetings/jsm.cfm">JSM</a>?</li>
<li><a href="https://en.wikipedia.org/wiki/A_rising_tide_lifts_all_boats">A rising tide lifts all boats</a></li>
</ul>
Interview with COPSS award Winner John Storey
2015-08-25T09:25:28+00:00
http://simplystats.github.io/2015/08/25/interview-with-copss-award-winner-john-storey
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/jdstorey.jpg"><img class="aligncenter wp-image-4289 size-medium" src="http://simplystatistics.org/wp-content/uploads/2015/08/jdstorey-198x300.jpg" alt="jdstorey" width="198" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2015/08/jdstorey-198x300.jpg 198w, http://simplystatistics.org/wp-content/uploads/2015/08/jdstorey-132x200.jpg 132w" sizes="(max-width: 198px) 100vw, 198px" /></a></p>
<p> </p>
<p><em>Editor’s Note: We are again pleased to interview the COPSS President’s award winner. The <a href="https://en.wikipedia.org/wiki/COPSS_Presidents%27_Award">COPSS Award</a> is one of the most prestigious in statistics, sometimes called the Nobel Prize in statistics. This year the award went to <a href="http://www.genomine.org/">John Storey</a> who also won the <a href="http://sml.princeton.edu/news/john-storey-receives-2015-mortimer-spiegelman-award">Mortimer Spiegelman award</a> for his outstanding contribution to public health statistics. This interview is a <a href="https://twitter.com/simplystats/status/631607146572988417">particular pleasure</a> since John was my Ph.D. advisor and has been a major role model and incredibly supportive mentor for me throughout my career. He also <a href="https://github.com/jdstorey/simplystatistics">did the whole interview in markdown and put it under version control at Github</a> so it is fully reproducible. </em></p>
<p><strong>SimplyStats: Do you consider yourself to be a statistician, data scientist, machine learner, or something else?</strong></p>
<p>JS: For the most part I consider myself to be a statistician, but I’m also very serious about genetics/genomics, data analysis, and computation. I was trained in statistics and genetics, primarily statistics. I was also exposed to a lot of machine learning during my training since Rob Tibshirani was my <a href="http://genealogy.math.ndsu.nodak.edu/id.php?id=69303">PhD advisor</a>. However, I consider my research group to be a data science group. We have the <a href="http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram">Venn diagram</a> reasonably well covered: experimentalists, programmers, data wranglers, and developers of theory and methods; biologists, computer scientists, and statisticians.</p>
<p><strong>**SimplyStats:</strong> How did you find out you had won the COPSS Presidents’ Award?**</p>
<p>JS: I received a phone call from the chairperson of the awards committee while I was visiting the Department of Statistical Science at Duke University to <a href="https://stat.duke.edu/events/15731.html">give a seminar</a>. It was during the seminar reception, and I stepped out into the hallway to take the call. It was really exciting to get the news!</p>
<p><strong>**SimplyStats: </strong>One of the areas where you have had a big impact is inference in massively parallel problems. How do you feel high-dimensional inference is different from more traditional statistical inference?**</p>
<p>JS: My experience is that the most productive way to approach high-dimensional inference problems is to first think about a given problem in the scenario where the parameters of interest are random, and the joint distribution of these parameters is incorporated into the framework. In other words, I first gain an understanding of the problem in a Bayesian framework. Once this is well understood, it is sometimes possible to move in a more empirical and nonparametric direction. However, I have found that I can be most successful if my first results are in this Bayesian framework.</p>
<p>As an example, Theorem 1 from <a href="http://genomics.princeton.edu/storeylab/papers/Storey_Annals_2003.pdf">Storey (2003) Annals of Statistics</a> was the first result I obtained in my work on false discovery rates. This paper <a href="https://statistics.stanford.edu/research/false-discovery-rate-bayesian-interpretation-and-q-value">first appeared as a technical report in early 2001</a>, and the results spawned further work on a <a href="http://genomics.princeton.edu/storeylab/papers/directfdr.pdf">point estimation approach</a> to false discovery rates, the <a href="http://genomics.princeton.edu/storeylab/papers/ETST_JASA_2001.pdf">local false discovery rate</a>, <a href="http://www.bioconductor.org/packages/release/bioc/html/qvalue.html">q-value</a> and its <a href="http://www.pnas.org/content/100/16/9440.full">application to genomics</a>, and a <a href="http://genomics.princeton.edu/storeylab/papers/623.pdf">unified theoretical framework</a>.</p>
<p>Besides false discovery rates, this approach has been useful in my work on the <a href="http://genomics.princeton.edu/storeylab/papers/Storey_JRSSB_2007.pdf">optimal discovery procedure</a> as well as <a href="http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0030161">surrogate variable analysis</a> (in particular, <a href="http://amstat.tandfonline.com/doi/abs/10.1080/01621459.2011.645777#.VdxderxVhBc">Desai and Storey 2012</a> for surrogate variable analysis). For high-dimensional inference problems, I have also found it is important to consider whether there are any plausible underlying causal relationships among variables, even if causal inference in not the goal. For example, causal model considerations provided some key guidance in a <a href="http://www.nature.com/ng/journal/v47/n5/full/ng.3244.html">recent paper of ours</a> on testing for genetic associations in the presence of arbitrary population structure. I think there is a lot of insight to be gained by considering what is the appropriate approach for a high-dimensional inference problem under different causal relationships among the variables.</p>
<p><strong>SimplyStats: Do you have a process when you are tackling a hard problem or working with students on a hard problem?</strong></p>
<p>JS: I like to work on statistics research that is aimed at answering a specific scientific problem (usually in genomics). My process is to try to understand the why in the problem as much as the how. The path to success is often found in the former. I try first to find solutions to research problems by using simple tools and ideas. I like to get my hands dirty with real data as early as possible in the process. I like to incorporate some theory into this process, but I prefer methods that work really well in practice over those that have beautiful theory justifying them without demonstrated success on real-world applications. In terms of what I do day-to-day, listening to music is integral to my process, for both concentration and creative inspiration: typically <a href="https://en.wikipedia.org/wiki/King_Crimson">King Crimson</a> or some <a href="http://www.metal-archives.com/">variant of metal</a> or <a href="https://en.wikipedia.org/wiki/Brian_Eno">ambient</a> – which Simply Statistics co-founder [<a href="http://simplystatistics.org/wp-content/uploads/2015/08/jdstorey.jpg"><img class="aligncenter wp-image-4289 size-medium" src="http://simplystatistics.org/wp-content/uploads/2015/08/jdstorey-198x300.jpg" alt="jdstorey" width="198" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2015/08/jdstorey-198x300.jpg 198w, http://simplystatistics.org/wp-content/uploads/2015/08/jdstorey-132x200.jpg 132w" sizes="(max-width: 198px) 100vw, 198px" /></a></p>
<p> </p>
<p><em>Editor’s Note: We are again pleased to interview the COPSS President’s award winner. The <a href="https://en.wikipedia.org/wiki/COPSS_Presidents%27_Award">COPSS Award</a> is one of the most prestigious in statistics, sometimes called the Nobel Prize in statistics. This year the award went to <a href="http://www.genomine.org/">John Storey</a> who also won the <a href="http://sml.princeton.edu/news/john-storey-receives-2015-mortimer-spiegelman-award">Mortimer Spiegelman award</a> for his outstanding contribution to public health statistics. This interview is a <a href="https://twitter.com/simplystats/status/631607146572988417">particular pleasure</a> since John was my Ph.D. advisor and has been a major role model and incredibly supportive mentor for me throughout my career. He also <a href="https://github.com/jdstorey/simplystatistics">did the whole interview in markdown and put it under version control at Github</a> so it is fully reproducible. </em></p>
<p><strong>SimplyStats: Do you consider yourself to be a statistician, data scientist, machine learner, or something else?</strong></p>
<p>JS: For the most part I consider myself to be a statistician, but I’m also very serious about genetics/genomics, data analysis, and computation. I was trained in statistics and genetics, primarily statistics. I was also exposed to a lot of machine learning during my training since Rob Tibshirani was my <a href="http://genealogy.math.ndsu.nodak.edu/id.php?id=69303">PhD advisor</a>. However, I consider my research group to be a data science group. We have the <a href="http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram">Venn diagram</a> reasonably well covered: experimentalists, programmers, data wranglers, and developers of theory and methods; biologists, computer scientists, and statisticians.</p>
<p><strong>**SimplyStats:</strong> How did you find out you had won the COPSS Presidents’ Award?**</p>
<p>JS: I received a phone call from the chairperson of the awards committee while I was visiting the Department of Statistical Science at Duke University to <a href="https://stat.duke.edu/events/15731.html">give a seminar</a>. It was during the seminar reception, and I stepped out into the hallway to take the call. It was really exciting to get the news!</p>
<p><strong>**SimplyStats: </strong>One of the areas where you have had a big impact is inference in massively parallel problems. How do you feel high-dimensional inference is different from more traditional statistical inference?**</p>
<p>JS: My experience is that the most productive way to approach high-dimensional inference problems is to first think about a given problem in the scenario where the parameters of interest are random, and the joint distribution of these parameters is incorporated into the framework. In other words, I first gain an understanding of the problem in a Bayesian framework. Once this is well understood, it is sometimes possible to move in a more empirical and nonparametric direction. However, I have found that I can be most successful if my first results are in this Bayesian framework.</p>
<p>As an example, Theorem 1 from <a href="http://genomics.princeton.edu/storeylab/papers/Storey_Annals_2003.pdf">Storey (2003) Annals of Statistics</a> was the first result I obtained in my work on false discovery rates. This paper <a href="https://statistics.stanford.edu/research/false-discovery-rate-bayesian-interpretation-and-q-value">first appeared as a technical report in early 2001</a>, and the results spawned further work on a <a href="http://genomics.princeton.edu/storeylab/papers/directfdr.pdf">point estimation approach</a> to false discovery rates, the <a href="http://genomics.princeton.edu/storeylab/papers/ETST_JASA_2001.pdf">local false discovery rate</a>, <a href="http://www.bioconductor.org/packages/release/bioc/html/qvalue.html">q-value</a> and its <a href="http://www.pnas.org/content/100/16/9440.full">application to genomics</a>, and a <a href="http://genomics.princeton.edu/storeylab/papers/623.pdf">unified theoretical framework</a>.</p>
<p>Besides false discovery rates, this approach has been useful in my work on the <a href="http://genomics.princeton.edu/storeylab/papers/Storey_JRSSB_2007.pdf">optimal discovery procedure</a> as well as <a href="http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0030161">surrogate variable analysis</a> (in particular, <a href="http://amstat.tandfonline.com/doi/abs/10.1080/01621459.2011.645777#.VdxderxVhBc">Desai and Storey 2012</a> for surrogate variable analysis). For high-dimensional inference problems, I have also found it is important to consider whether there are any plausible underlying causal relationships among variables, even if causal inference in not the goal. For example, causal model considerations provided some key guidance in a <a href="http://www.nature.com/ng/journal/v47/n5/full/ng.3244.html">recent paper of ours</a> on testing for genetic associations in the presence of arbitrary population structure. I think there is a lot of insight to be gained by considering what is the appropriate approach for a high-dimensional inference problem under different causal relationships among the variables.</p>
<p><strong>SimplyStats: Do you have a process when you are tackling a hard problem or working with students on a hard problem?</strong></p>
<p>JS: I like to work on statistics research that is aimed at answering a specific scientific problem (usually in genomics). My process is to try to understand the why in the problem as much as the how. The path to success is often found in the former. I try first to find solutions to research problems by using simple tools and ideas. I like to get my hands dirty with real data as early as possible in the process. I like to incorporate some theory into this process, but I prefer methods that work really well in practice over those that have beautiful theory justifying them without demonstrated success on real-world applications. In terms of what I do day-to-day, listening to music is integral to my process, for both concentration and creative inspiration: typically <a href="https://en.wikipedia.org/wiki/King_Crimson">King Crimson</a> or some <a href="http://www.metal-archives.com/">variant of metal</a> or <a href="https://en.wikipedia.org/wiki/Brian_Eno">ambient</a> – which Simply Statistics co-founder](http://jtleek.com/) got to <del>endure</del> enjoy for years during his PhD in my lab.</p>
<p><strong>SimplyStats: You are the founding Director of the Center for Statistics and Machine Learning at Princeton. What parts of the new gig are you most excited about?</strong></p>
<p>JS: Princeton closed its Department of Statistics in the early 1980s. Because of this, the style of statistician and machine learner we have here today is one who’s comfortable being appointed in a field outside of statistics or machine learning. Examples include myself in genomics, Kosuke Imai in political science, Jianqing Fan in finance and economics, and Barbara Engelhardt in computer science. Nevertheless, statistics and machine learning here is strong, albeit too small at the moment (which will be changing soon). This is an interesting place to start, very different from most universities.</p>
<p>What I’m most excited about is that we get to answer the question: “What’s the best way to build a faculty, educate undergraduates, and create a PhD program starting now, focusing on the most important problems of today?”</p>
<p>For those who are interested, we’ll be releasing a <a href="http://www.princeton.edu/strategicplan/taskforces/sml/">public version of our strategic plan</a> within about six months. We’re trying to do something unique and forward-thinking, which will hopefully make Princeton an influential member of the statistics, machine learning, and data science communities.</p>
<p><strong>SimplyStats: You are organizing the Tukey conference at Princeton (to be held September 18, <a href="http://csml.princeton.edu/tukey">details here</a>).</strong> <strong>Do you think Tukey’s influence will affect your vision for re-building statistics at Princeton?</strong></p>
<p>JS: Absolutely, Tukey has been and will be a major influence in how we re-build. He made so many important contributions, and his approach was extremely forward thinking and tied into real-world problems. I strongly encourage everyone to read Tukey’s 1962 paper titled <a href="https://projecteuclid.org/euclid.aoms/1177704711">The Future of Data Analysis</a>. Here he’s 50 years into the future, foreseeing the rise of data science. This paper has truly amazing insights, including:</p>
<blockquote>
<p>For a long time I have thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt.</p>
<p>All in all, I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.</p>
<p>Data analysis is a larger and more varied field than inference, or incisive procedures, or allocation.</p>
<p>By and large, the great innovations in statistics have not had correspondingly great effects upon data analysis. . . . Is it not time to seek out novelty in data analysis?</p>
</blockquote>
<p>In this regard, another paper that has been influential in how we are re-building is Leo Breiman’s titled <a href="http://projecteuclid.org/euclid.ss/1009213726">Statistical Modeling: The Two Cultures</a>. We’re building something at Princeton that includes both cultures and seamlessly blends them into a bigger picture community concerned with data-driven scientific discovery and technology development.</p>
<p><strong>SimplyStats:</strong> <strong>What advice would you give young statisticians getting into the discipline now?</strong></p>
<p>JS: My most general advice is don’t isolate yourself within statistics. Interact with and learn from other fields. Work on problems that are important to practitioners of science and technology development. I recommend that students should master both “traditional statistics” and at least one of the following: (1) computational and algorithmic approaches to data analysis, especially those more frequently studied in machine learning or data science; (2) a substantive scientific area where data-driven discovery is extremely important (e.g., social sciences, economics, environmental sciences, genomics, neuroscience, etc.). I also recommend that students should consider publishing in scientific journals or computer science conference proceedings, in addition to traditional statistics journals. I agree with a lot of the constructive advice and commentary given on the Simply Statistics blog, such as encouraging students to learn about reproducible research, problem-driven research, software development, improving data analyses in science, and outreach to non-statisticians. These things are very important for the future of statistics.</p>
The Next National Library of Medicine Director Can Help Define the Future of Data Science
2015-08-24T10:00:26+00:00
http://simplystats.github.io/2015/08/24/the-next-national-library-of-medicine-director-can-help-define-the-future-of-data-science
<p>The main motivation for starting this blog was to share our enthusiasm about the increased importance of data and data analysis in science, industry, and society in general. Based on recent initiatives, such as <a href="https://datascience.nih.gov/bd2k">BD2k</a>, it is clear that the NIH is also enthusiastic and very much interested in supporting data science. For those that don’t know, the National Institutes of Health (NIH) is the largest public funder of biomedical research in the world. This federal agency has an annual budget of about $30 billion.</p>
<p>The NIH has <a href="http://www.nih.gov/icd/icdirectors.htm">several institutes</a>, each with its own budget and capability to guide funding decisions. Currently, the missions of most of these institutes relate to a specific disease or public health challenge. Many of them fund research in statistics and computing because these topics are important components of achieving their specific mission. Currently, however, there is no institute directly tasked with supporting data science per se. This is about to change.</p>
<p>The National Library of Medicine (NLM) is one of the few NIH institutes that is not focused on a particular disease or public health challenge. Apart from the important task of maintaining an actual library, it supports, among many other initiatives, indispensable databases such as PubMed, GeneBank and GEO. After over 30 years of successful service as NLM director, Dr. Donald Lindberg stepped down this year and, as is customary, an advisory board was formed to advice the NIH on what’s next for NLM. One of the main recommendations of <a href="http://acd.od.nih.gov/reports/Report-NLM-06112015-ACD.pdf">the report</a> is the following:</p>
<blockquote>
<p>NLM should be the intellectual and programmatic epicenter for data science at NIH and stimulate its advancement throughout biomedical research and application.</p>
</blockquote>
<p>Data science features prominently throughout the report making it clear the NIH is very much interested in further supporting this field. The next director can therefore have an enormous influence in the futre of data science. So, if you love data, have administrative experience, and a vision about the future of data science as it relates to the medical and related sciences, consider this exciting opportunity.</p>
<p>Here is the <a href="http://www.jobs.nih.gov/vacancies/executive/nlm_director.htm">ad</a>.</p>
<p> </p>
<p> </p>
<p> </p>
Interview with Sherri Rose and Laura Hatfield
2015-08-21T13:20:14+00:00
http://simplystats.github.io/2015/08/21/interview-with-sherri-rose-and-laura-hatfied
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/hatfieldrose.png"><img class="aligncenter wp-image-4273 size-medium" src="http://simplystatistics.org/wp-content/uploads/2015/08/hatfieldrose-300x200.png" alt="Sherri Rose and Laura Hatfield" width="300" height="200" srcset="http://simplystatistics.org/wp-content/uploads/2015/08/hatfieldrose-300x200.png 300w, http://simplystatistics.org/wp-content/uploads/2015/08/hatfieldrose-260x173.png 260w, http://simplystatistics.org/wp-content/uploads/2015/08/hatfieldrose.png 975w" sizes="(max-width: 300px) 100vw, 300px" /></a></p>
<p style="text-align: center;">
Rose/Hatfield © Savannah Bergquist
</p>
<p><em><a href="http://www.hcp.med.harvard.edu/faculty/core/laura-hatfield-phd">Laura Hatfield</a> and <a href="http://www.drsherrirose.com/">Sherri Rose</a> are Assistant Professors specializing in biostatistics at Harvard Medical School in the <a href="http://www.hcp.med.harvard.edu">Department of Health Care Policy</a>. Laura received her PhD in Biostatistics from the University of Minnesota and Sherri completed her PhD in Biostatistics at UC Berkeley. They are developing novel statistical methods for health policy problems.</em></p>
<p><strong><em>**_SimplyStats</em></strong>: Do you consider yourselves statisticians, data scientists, machine learners, or something else?_**</p>
<p><strong>Rose</strong>: I’d definitely say a statistician. Even when I’m working on things that fall into the categories of data science or machine learning, there’s underlying statistical theory guiding that process, be it for methods development or applications. Basically, there’s a statistical foundation to everything I do.</p>
<p><strong>Hatfield</strong>: When people ask what I do, I start by saying that I do research in health policy. Then I say I’m a statistician by training and I work with economists and physicians. People have mistaken ideas about what a statistician or professor does, so describing my context and work seems more informative. If I’m at a party, I usually wrap it up in a bow as, “I crunch numbers to study how Obamacare is working.” [laughs]</p>
<p> </p>
<p><strong><em>SimplyStats: What is the</em></strong> <a href="http://www.healthpolicydatascience.org/"><strong><em>Health Policy Data Science Lab</em></strong></a><strong><em>? How did you decide to start that?</em></strong></p>
<p><strong>Hatfield</strong>: We wanted to give our trainees a venue to promote their work and get feedback from their peers. And it helps me keep up on the cool projects Sherri and her students are working on.</p>
<p><strong>Rose</strong>: This grew out of us starting to jointly mentor trainees. It’s been a great way for us to make intellectual contributions to each other’s work through Lab meetings. Laura and I approach statistics from <em>completely</em> different frameworks, but work on related applications, so that’s a unique structure for a lab.</p>
<p> </p>
<p><strong><em>**_SimplyStats: </em></strong>What kinds of problems are your groups working on these days? Are they mostly focused on health policy?_**</p>
<p><strong>Rose</strong>: One of the fun things about working in health policy is that it is quite expansive. Statisticians can have an even bigger impact on science and public health if we take that next step: thinking about the policy implications of our research. And then, who needs to see the work in order to influence relevant policies. A couple projects I’m working on that demonstrate this breadth include a machine learning framework for risk adjustment in insurance plan payment and a new estimator for causal effects in a complex epidemiologic study of chronic disease. The first might be considered more obviously health policy, but the second will have important policy implications as well.</p>
<p><strong>Hatfield</strong>: When I start an applied collaboration, I’m also thinking, “Where is the methods paper?” Most of my projects use messy observational data, so there is almost always a methods paper. For example, many studies here need to find a control group from an administrative data source. I’ve been keeping track of challenges in this process. One of our Lab students is working with me on a pathological case of a seemingly benign control group selection method gone bad. I love the creativity required in this work; my first 10 analysis ideas may turn out to be infeasible given the data, but that’s what makes this fun!</p>
<p> </p>
<p><strong><em>**_SimplyStats: </em></strong>What are some particular challenges of working with large health data?_**</p>
<p><strong>Hatfield</strong>: When I first heard about the huge sample sizes, I was excited! Then I learned that data not collected for research purposes…</p>
<p><strong>Rose</strong>: This was going to be my answer!</p>
<p><strong>Hatfield</strong>: …are <em>very</em> hard to use for research! In a recent project, I’ve been studying how giving people a tool to look up prices for medical services changes their health care spending. But the data set we have leaves out [painful pause] a lot of variables we’d like to use for control group selection and… a lot of the prices. But as I said, these gaps in the data are begging to be filled by new methods.</p>
<p><strong>Rose</strong>: I think the fact that we have similar answers is important. I’ve repeatedly seen “big data” not have a strong signal for the research question, since they weren’t collected for that purpose. It’s easy to get excited about thousands of covariates in an electronic health record, but so much of it is noise, and then you end up with an R<sup>2</sup> of 10%. It can be difficult enough to generate an effective prediction function, even with innovative tools, let alone try to address causal inference questions. It goes back to basics: what’s the research question and how can we translate that into a statistical problem we can answer given the limitations of the data.</p>
<p><strong><em>**_SimplyStats: </em></strong>You both have very strong data science skills but are in academic positions. Do you have any advice for students considering the tradeoff between academia and industry?_**</p>
<p><strong>Hatfield</strong>: I think there is more variance within academia and within industry than between the two.</p>
<p><strong>Rose</strong>: Really? That’s surprising to me…</p>
<p><strong>Hatfield</strong>: I had stereotypes about academic jobs, but my current job defies those.</p>
<p><strong>Rose</strong>: What if a larger component of your research platform included programming tools and R packages? My immediate thought was about computing and its role in academia. Statisticians in genomics have navigated this better than some other areas. It can surely be done, but there are still challenges folding that into an academic career.</p>
<p><strong>Hatfield</strong>: I think academia imposes few restrictions on what you can disseminate compared to industry, where there may be more privacy and intellectual property concerns. But I take your point that R packages do not impress most tenure and promotion committees.</p>
<p><strong>Rose</strong>: You want to find a good match between how you like spending your time and what’s rewarded. Not all academic jobs are the same and not all industry jobs are alike either. I wrote a more detailed <a href="http://simplystatistics.org/2015/02/18/navigating-big-data-careers-with-a-statistics-phd/">guest post</a> on this topic for <em>Simply Statistics</em>.</p>
<p><strong>Hatfield</strong>: I totally agree you should think about how you’d actually spend your time in any job you’re considering, rather than relying on broad ideas about industry versus academia. Do you love writing? Do you love coding? etc.</p>
<p> </p>
<p><strong><em>**_SimplyStats: </em></strong>You are both adopters of social media as a mechanism of disseminating your work and interacting with the community. What do you think of social media as a scientific communication tool? Do you find it is enhancing your careers?_**</p>
<p><strong>Hatfield</strong>: Sherri is my social media mentor!</p>
<p><strong>Rose</strong>: I think social media can be a useful tool for networking, finding and sharing neat articles and news, and putting your research out there to a broader audience. I’ve definitely received speaking invitations and started collaborations because people initially “knew me from Twitter.” It’s become a way to recruit students as well. Prospective students are more likely to “know me” from a guest post or Twitter than traditional academic products, like journal articles.</p>
<p><strong>Hatfield</strong>: I’m grateful for our <a href="https://twitter.com/HPDSLab">Lab’s new Twitter</a> because it’s a purely academic account. My personal account has been awkwardly transitioning to include professional content; I still tweet silly things there.</p>
<p><strong>Rose</strong>: My timeline might have <a href="https://twitter.com/sherrirose/status/569613197600272386">a cat picture</a> or <a href="https://twitter.com/sherrirose/status/601822958491926529">two</a>.</p>
<p><strong>Hatfield</strong>: My very favorite thing about academic Twitter is discovering things I wouldn’t have even known to search for, especially packages and tricks in R. For example, that’s how I got converted to tidy data and dplyr.</p>
<p><strong>Rose</strong>: I agree. I think it’s a fantastic place to become exposed to work that’s incredibly related to your own but in another field, and you wouldn’t otherwise find it preparing a typical statistics literature review.</p>
<p> </p>
<p><strong><em>**</em></strong><em>SimplyStats: </em><strong><em>**What would you change in the statistics community?</em></strong></p>
<p><strong>Rose</strong>: Mentoring. I was tremendously lucky to receive incredible mentoring as a graduate student and now as a new faculty member. Not everyone gets this, and trainees don’t know where to find guidance. I’ve actively reached out to trainees during conferences and university visits, erring on the side of offering too much unsolicited help, because I feel there’s a need for that. I also have a <a href="http://drsherrirose.com/resources">resources page</a> on my website that I continue to update. I wish I had a more global solution beyond encouraging statisticians to take an active role in mentoring not just your own trainees. We shouldn’t lose good people because they didn’t get the support they needed.</p>
<p><strong>Hatfield</strong>: I think we could make conferences much better! Being in the same physical space at the same time is very precious. I would like to take better advantage of that at big meetings to do work that requires face time. Talks are not an example of this. Workshops and hackathons and panels and working groups – these all make better use of face-to-face time. And are a lot more fun!</p>
<p> </p>
If you ask different questions you get different answers - one more way science isn't broken it is just really hard
2015-08-20T14:52:34+00:00
http://simplystats.github.io/2015/08/20/if-you-ask-different-quetions-you-get-different-asnwers-one-more-way-science-isnt-broken-it-is-just-really-hard
<p>If you haven’t already read the amazing piece by Christie Aschwanden on why <a href="http://fivethirtyeight.com/features/science-isnt-broken/">Science isn’t Broken</a> you should do so immediately. It does an amazing job of capturing the nuance of statistics as applied to real data sets and how that can be misconstrued as science being “broken” without falling for the easy “everything is wrong” meme.</p>
<p>One thing that caught my eye was how the piece highlighted a crowd-sourced data analysis of soccer red cards. The key figure for that analysis is this one:</p>
<p> </p>
<p><a href="http://fivethirtyeight.com/features/science-isnt-broken/"><img class="aligncenter" src="https://espnfivethirtyeight.files.wordpress.com/2015/08/truth-vigilantes-soccer-calls2.png?w=1024&h=597" alt="" width="1024" height="597" /></a></p>
<p>I think the figure and <a href="https://osf.io/qix4g/">underlying data</a> for this figure are fascinating in that they really highlight the human behavioral variation in data analysis and you can even see some <a href="http://simplystatistics.org/2015/04/29/data-analysis-subcultures/">data analysis subcultures </a>emerging from the descriptions of how people did the analysis and justified or not the use of covariates.</p>
<p>One subtlety of the figure that I missed on the original reading is that not all of the estimates being reported are measuring the same thing. For example, if some groups adjusted for the country of origin of the referees and some did not, then the estimates for those two groups are measuring different things (the association conditional on country of origin or not, respectively). In this case the estimates may be different, but entirely consistent with each other, since they are just measuring different things.</p>
<p>If you ask two people to do the analysis and you only ask them the simple question: <em>Are referees more likely to give red cards to dark skinned players?</em> then you may get a different answer based on those two estimates. But the reality is the answers the analysts are reporting are actually to the questions:</p>
<ol>
<li>Are referees more likely to give red cards to dark skinned players holding country of origin fixed?</li>
<li>Are referees more likely to give red cards to dark skinned players averaging over country of origin (and everything else)?</li>
</ol>
<p>The subtlety lies in the fact that changes to covariates in the analysis are actually changing the hypothesis you are studying.</p>
<p>So in fact the conclusions in that figure may all be entirely consistent after you condition on asking the same question. I’d be interested to see the same plot, but only for the groups that conditioned on the same set of covariates, for example. This is just one more reason that science is really hard and why I’m so impressed at how well the FiveThirtyEight piece captured this nuance.</p>
<p> </p>
<p> </p>
P > 0.05? I can make any p-value statistically significant with adaptive FDR procedures
2015-08-19T10:38:31+00:00
http://simplystats.github.io/2015/08/19/p-0-05-i-can-make-any-p-value-statistically-significant-with-adaptive-fdr-procedures
<p>Everyone knows now that you have to correct for multiple testing when you calculate many p-values otherwise this can happen:</p>
<div style="width: 550px" class="wp-caption aligncenter">
<a href="http://xkcd.com/882/"><img class="" src=" http://imgs.xkcd.com/comics/significant.png" alt="" width="540" height="1498" /></a>
<p class="wp-caption-text">
http://xkcd.com/882/
</p>
</div>
<p> </p>
<p>One of the most popular ways to correct for multiple testing is to estimate or control the <a href="https://en.wikipedia.org/wiki/False_discovery_rate">false discovery rate</a>. The false discovery rate attempts to quantify the fraction of made discoveries that are false. If we call all p-values less than some threshold <em>t</em> significant, then borrowing notation from this <a href="http://www.ncbi.nlm.nih.gov/pubmed/12883005">great introduction to false discovery rates </a></p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/fdr3.gif"><img class="aligncenter size-full wp-image-4246" src="http://simplystatistics.org/wp-content/uploads/2015/08/fdr3.gif" alt="fdr3" width="285" height="40" /></a></p>
<p> </p>
<p>So <em>F(t)</em> is the (unknown) total number of null hypotheses called significant and <em>S(t)</em> is the total number of hypotheses called significant. The FDR is the expected ratio of these two quantities, which, under certain assumptions can be approximated by the ratio of the expectations.</p>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/fdr4.gif"><img class="aligncenter size-full wp-image-4247" src="http://simplystatistics.org/wp-content/uploads/2015/08/fdr4.gif" alt="fdr4" width="246" height="44" /></a></p>
<p> </p>
<p>To get an estimate of the FDR we just need an estimate for <em>E[_F(t)]</em> _ and <em>E[S(t)]. _The latter is pretty easy to estimate as just the total number of rejections (the number of _p < t</em>). If you assume that the p-values follow the expected distribution then <em>E[_F(t)]</em> <em>can be approximated by multiplying the fraction of null hypotheses, multiplied by the total number of hypotheses and multiplied by _t</em> since the p-values are uniform. To do this, we need an estimate for <span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_d4c98d75e25f5d28461f1da221eb7a95.gif" style="vertical-align: middle; border: none; padding-bottom:1px;" class="tex" alt="\pi_0" /></span>, the proportion of null hypotheses. There are a large number of ways to estimate this quantity but it is almost always estimated using the full distribution of computed p-values in an experiment. The most popular estimator compares the fraction of p-values greater than some cutoff to the number you would expect if every single hypothesis were null. This fraction is about the fraction of null hypotheses.</p>
<p>Combining the above equation with our estimates for <em>E[_F(t)]</em> _ and _E[S(t)] _we get:</p>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/fdr5.gif"><img class="aligncenter size-full wp-image-4250" src="http://simplystatistics.org/wp-content/uploads/2015/08/fdr5.gif" alt="fdr5" width="238" height="42" /></a></p>
<p> </p>
<p>The q-value is a multiple testing analog of the p-value and is defined as:</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/fdr61.gif"><img class="aligncenter size-full wp-image-4258" src="http://simplystatistics.org/wp-content/uploads/2015/08/fdr61.gif" alt="fdr6" width="163" height="26" /></a></p>
<p> </p>
<p>This is of course a very loose version of this and you can get a more technical description <a href="http://www.genomine.org/papers/directfdr.pdf">here</a>. But the main thing to notice is that the q-value depends on the estimated proportion of null hypotheses, which depends on the distribution of the observed p-values. The smaller the estimated fraction of null hypotheses, the smaller the FDR estimate and the smaller the q-value. This suggests a way to make any p-value significant by altering its “testing partners”. Here is a quick example. Suppose that we have done a test and have a p-value of 0.8. Not super significant. Suppose we perform this test in conjunction with a number of hypotheses that are null generating a p-value distribution like this.</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/uniform-pvals.png"><img class="aligncenter size-medium wp-image-4260" src="http://simplystatistics.org/wp-content/uploads/2015/08/uniform-pvals-300x300.png" alt="uniform-pvals" width="300" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2015/08/uniform-pvals-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2015/08/uniform-pvals-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2015/08/uniform-pvals.png 480w" sizes="(max-width: 300px) 100vw, 300px" /></a></p>
<p>Then you get a q-value greater than 0.99 as you would expect. But if you test that exact same p-value with a ton of other non-null hypotheses that generate tiny p-values in a distribution that looks like this:</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/significant-pvals.png"><img class="aligncenter size-medium wp-image-4261" src="http://simplystatistics.org/wp-content/uploads/2015/08/significant-pvals-300x300.png" alt="significant-pvals" width="300" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2015/08/significant-pvals-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2015/08/significant-pvals-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2015/08/significant-pvals.png 480w" sizes="(max-width: 300px) 100vw, 300px" /></a></p>
<p> </p>
<p>Then you get a q-value of 0.0001 for that same p-value of 0.8. The reason is that the estimate of the fraction of null hypotheses goes essentially to zero, which drives down the q-value. You can do this with any p-value, if you make its testing partners have sufficiently low p-values then the q-value will also be as small as you like.</p>
<p>A couple of things to note:</p>
<ul>
<li>Obviously doing this on purpose to change the significance of a calculated p-value is cheating and shouldn’t be done.</li>
<li>For correctly calculated p-values on a related set of hypotheses this is actually a sensible property to have - if you have almost all very small p-values and one very large p-value, you are doing a set of tests where almost everything appears to be alternative and you should weight that in some sensible way.</li>
<li>This is the reason that sometimes a “multiple testing adjusted” p-value (or q-value) is smaller than the p-value itself.</li>
<li>This doesn’t affect non-adaptive FDR procedures - but those procedures still depend on the “testing partners” of any p-value through the total number of tests performed. This is why people talk about the so-called “multiple testing burden”. But that is a subject for a future post. It is also the reason non-adaptive procedures can be severely underpowered compared to adaptive procedures when the p-values are correct.</li>
<li>I’ve appended the code to generate the histograms and calculate the q-values in this post in the following gist.</li>
</ul>
<p> </p>
UCLA Statistics 2015 Commencement Address
2015-08-12T10:34:03+00:00
http://simplystats.github.io/2015/08/12/ucla-statistics-2015-commencement-address
<p>I was asked to speak at the <a href="http://www.stat.ucla.edu">UCLA Department of Statistics</a> Commencement Ceremony this past June. As one of the first graduates of that department back in 2003, I was tremendously honored to be invited to speak to the graduates. When I arrived I was just shocked at how much the department had grown. When I graduated I think there were no more than 10 of us between the PhD and Master’s programs. Now they have ~90 graduates per year with undergrad, Master’s and PhD. It was just stunning.</p>
<p>Here’s the text of what I said, which I think I mostly stuck to in the actual speech.</p>
<p> </p>
<p><strong>UCLA Statistics Graduation: Some thoughts on a career in statistics</strong></p>
<p>When I asked Rick [Schoenberg] what I should talk about, he said to ‘talk for 95 minutes on asymptotic properties of maximum likelihood estimators under nonstandard conditions”. I thought this is a great opportunity! I busted out Tom Ferguson’s book and went through my old notes. Here we go. Let X be a complete normed vector space….</p>
<p>I want to thank the department for inviting me here today. It’s always good to be back. I entered the UCLA stat department in 1999, only the second entering class, and graduated from UCLA Stat in 2003. Things were different then. Jan was the chair and there were not many classes so we could basically do whatever we wanted. Things are different now and that’s a good thing. Since 2003, I’ve been at the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health, where I was first a postdoctoral fellow and then joined the faculty. It’s been a wonderful place for me to grow up and I’ve learned a lot there.</p>
<p>It’s just an incredible time to be a statistician. You guys timed it just right. I’ve been lucky enough to witness two periods like this, the first time being when I graduated from college at the height of the dot come boom. Today, it’s not computer programming skills that the world needs, but rather it’s statistical skills. I wish I were in your shoes today, just getting ready to startup. But since I’m not, I figured the best thing I could do is share some of the things I’ve learned and talk about the role that these things have played in my own life.</p>
<p>Know your edge: What’s the one thing that you know that no one else seems to know? You’re not a clone—you have original ideas and skills. You might think they’re not valuable but you’re wrong. Be proud of these ideas and use them to your advantage. As an example, I’ll give you my one thing. Right now, I believe the greatest challenge facing the field of statistics today is getting the entire world to know what we in this room already know. Data are everywhere today and the biggest barrier to progress is our collective inability to process and analyze those data to produce useful information. The need for the things that we know has absolutely exploded and we simply have not caught up. That’s why I created, along with Jeff Leek and Brian Caffo, the Johns Hopkins Data Science Specialization, which is currently the most successful massive open online course program ever. Our goal is to teach the entire world statistics, which we think is an essential skill. We’re not quite there yet, but—assuming you guys don’t steal my idea—I’m hopeful that we’ll get there sometime soon.</p>
<p>At some point the edge you have will no longer work: That sounds like a bad thing, but it’s actually good. If what you’re doing really matters, then at some point everyone will be doing it. So you’ll need to find something else. I’ve been confronted with this problem at least 3 times in my life so far. Before college, I was pretty good at the violin, and it opened a lot of doors for me. It got me into Yale. But when I got to Yale, I quickly realized that there were a lot of really good violinists here. Suddenly, my talent didn’t have so much value. This was when I started to pick up computer programming and in 1998 I learned an obscure little language called R. When I got to UCLA I realized I was one of the only people who knew R. So I started a little brown bag lunch series where I’d talk about some feature of R to whomever would show up (which wasn’t many people usually). Picking up on R early on turned out to be really important because it was a small community back then and it was easy to have a big impact. Also, as more and more people wanted to learn R, they’d usually call on me. It’s always nice to feel needed. Over the years, the R community exploded and R’s popularity got to the point where it was being talked about in the New York Times. But now you see the problem. Saying that you know R doesn’t exactly distinguish you anymore, so it’s time to move on again. These days, I’m realizing that the one useful skill that I have is the ability to make movies. Also, my experience being a performer on the violin many years ago is coming in handy. My ability to quickly record and edit movies was one of the key factors that enabled me to create an entire online data science program in 2 months last year.</p>
<p>Find the right people, and stick with them forever. Being a statistician means working with other people. Choose those people wisely and develop a strong relationship. It doesn’t matter how great the project is or how famous or interesting the other person is, if you can’t get along then bad things will happen. Statistics and data analysis is a highly verbal process that requires constant and very clear communication. If you’re uncomfortable with someone in any way, everything will suffer. Data analysis is unique in this way—our success depends critically on other people. I’ve only had a few collaborators in the past 12 years, but I love them like family. When I work with these people, I don’t necessarily know what will happen, but I know it will be good. In the end, I honestly don’t think I’ll remember the details of the work that I did, but I’ll remember the people I worked with and the relationships I built.</p>
<p>So I hope you weren’t expecting a new asymptotic theorem today, because this is pretty much all I’ve got. As you all go on to the next phase of your life, just be confident in your own ideas, be prepared to change and learn new things, and find the right people to do them with. Thank you.</p>
Correlation is not a measure of reproducibility
2015-08-12T10:33:25+00:00
http://simplystats.github.io/2015/08/12/correlation-is-not-a-measure-of-reproducibility
<p>Biologists make wide use of correlation as a measure of reproducibility. Specifically, they quantify reproducibility with the correlation between measurements obtained from replicated experiments. For example, <a href="https://genome.ucsc.edu/ENCODE/protocols/dataStandards/ENCODE_RNAseq_Standards_V1.0.pdf">the ENCODE data standards document</a> states</p>
<blockquote>
<p>A typical R<sup>2</sup> (Pearson) correlation of gene expression (RPKM) between two biological replicates, for RNAs that are detected in both samples using RPKM or read counts, should be between 0.92 to 0.98. Experiments with biological correlations that fall below 0.9 should be either be repeated or explained.</p>
</blockquote>
<p>However, for reasons I will explain here, correlation is not necessarily informative with regards to reproducibility. The mathematical results described below are not inconsequential theoretical details, and understanding them will help you assess new technologies, experimental procedures and computation methods.</p>
<p>Suppose you have collected data from an experiment</p>
<p style="text-align: center;">
<em>x</em><sub>1</sub>, <em>x</em><sub>2</sub>,..., <em>x</em><sub>n</sub>
</p>
<p>and want to determine if a second experiment replicates these findings. For simplicity, we represent data from the second experiment as adding unbiased (averages out to 0) and statistically independent measurement error <em>d</em> to the first:</p>
<p style="text-align: center;">
<em>y</em><sub>1</sub>=<em>x</em><sub>1</sub>+<em>d</em><sub>1</sub>, <em>y</em><sub>2</sub>=<em>x</em><sub>2</sub>+<em>d</em><sub>2</sub>, ... <em>y</em><sub>n</sub>=<em>x</em><sub>n</sub>+<em>d</em><sub>n</sub>.
</p>
<p>For us to claim reproducibility we want the differences</p>
<p style="text-align: center;">
<em>d</em><sub>1</sub>=<em>y</em><sub>1</sub>-<em>x</em><sub>1</sub>, <em>d</em><sub>2</sub>=<em>y</em><sub>2</sub>-<em>x</em><sub>2</sub>,<em>... </em>,<em>d</em><sub>n</sub>=<em>y</em><sub>n</sub>-<em>x</em><sub>n</sub>
</p>
<p>to be “small”. To give this some context, imagine the <em>x</em> and <em>y</em> are log scale (base 2) gene expression measurements which implies the <em>d</em> represent log fold changes. If these differences have a standard deviation of 1, it implies that fold changes of 2 are typical between replicates. If our replication experiment produces measurements that are typically twice as big or twice as small as the original, I am not going to claim the measurements are reproduced. However, as it turns out, such terrible reproducibility can still result in correlations higher than 0.92.</p>
<p>To someone basing their definition of correlation on the current common language usage this may seem surprising, but to someone basing it on math, it is not. To see this, note that the mathematical definition of correlation tells us that because <em>d</em> and <em>x</em> are independent:</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/pearsonformula.png"><img class=" aligncenter" src="http://simplystatistics.org/wp-content/uploads/2015/08/pearsonformula-300x55.png" alt="pearsonformula" width="300" height="55" /></a></p>
<p>This tells us that correlation summarizes the variability of <em>d</em> relative to the variability of <em>x</em>. Because of the wide range of gene expression values we observe in practice, the standard deviation of <em>x</em> can easily be as large as 3 (variance is 9). This implies we expect to see correlations as high as 1/sqrt(1+1/9) = 0.95, despite the lack of reproducibility when comparing <em>x</em> to <em>y</em>.</p>
<p>Note that using Spearman correlation does not fix this problem. A Spearman correlation of 1 tells us that the ranks of <em>x</em> and <em>y</em> are preserved, yet doest not summarize the actual differences. The problem comes down to the fact that we care about the variability of <em>d</em> and correlation, Pearson or Spearman, does not provide an optimal summary. While correlation relates to the preservation of ranks, a much more appropriate summary of reproducibly is the distance between <em>x</em> and <em>y</em> which is related to the standard deviation of the differences <em>d</em>. A very simple R command you can use to generate this summary statistic is:</p>
<pre>sqrt(mean(d^2))</pre>
<p>or the robust version:</p>
<pre>median(abs(d)) ##multiply by 1.4826 for unbiased estimate of true sd
</pre>
<p>The equivalent suggestion for plots it to make an <a href="https://en.wikipedia.org/wiki/MA_plot">MA-plot</a> instead of a scatterplot.</p>
<p>But aren’t correlations and distances directly related? Sort of, and this actually brings up another problem. If the <em>x</em> and <em>y</em> are standardized to have average 0 and standard deviation 1 then, yes, correlation and distance are directly related:</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/distcorr.png"><img class=" size-medium wp-image-4202 aligncenter" src="http://simplystatistics.org/wp-content/uploads/2015/08/distcorr-300x51.png" alt="distcorr" width="300" height="51" srcset="http://simplystatistics.org/wp-content/uploads/2015/08/distcorr-300x51.png 300w, http://simplystatistics.org/wp-content/uploads/2015/08/distcorr-260x44.png 260w, http://simplystatistics.org/wp-content/uploads/2015/08/distcorr.png 878w" sizes="(max-width: 300px) 100vw, 300px" /></a></p>
<p>However, if instead <em>x</em> and <em>y</em> have different average values, which would put into question reproducibility, then distance is sensitive to this problem while correlation is not. If the standard devtiation is 1, the formula is:</p>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/distcor2.png"><img class=" size-medium wp-image-4204 aligncenter" src="http://simplystatistics.org/wp-content/uploads/2015/08/distcor2-300x27.png" alt="distcor2" width="300" height="27" srcset="http://simplystatistics.org/wp-content/uploads/2015/08/distcor2-300x27.png 300w, http://simplystatistics.org/wp-content/uploads/2015/08/distcor2-1024x94.png 1024w" sizes="(max-width: 300px) 100vw, 300px" /></a></p>
<p>Once we consider units (standard deviations different from 1) then the relationship becomes even more complicated. Two advantages of distance you should be aware of are:</p>
<ol>
<li>it is in the same units as the data, while correlations have no units making it hard to interpret and select thresholds, and</li>
<li>distance accounts for bias (differences in average), while correlation does not.</li>
</ol>
<p>A final important point relates to the use of correlation with data that is not approximately normal. The useful interpretation of correlation as a summary statistic stems from the bivariate normal approximation: for every standard unit increase in the first variable, the second variable increased <em>r</em> standard units, with <em>r</em> the correlation. A summary of this is <a href="http://genomicsclass.github.io/book/pages/exploratory_data_analysis_2.html">here</a>. However, when data is not normal this interpretation no longer holds. Furthermore, heavy tail distributions, which are common in genomics, can lead to instability. Here is an example of uncorrelated data with a single pointed added that leads to correlations close to 1. This is quite common with RNAseq data.</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/supp_figure_2.png"><img class=" size-medium wp-image-4208 aligncenter" src="http://simplystatistics.org/wp-content/uploads/2015/08/supp_figure_2-300x300.png" alt="supp_figure_2" width="300" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2015/08/supp_figure_2-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2015/08/supp_figure_2-1024x1024.png 1024w, http://simplystatistics.org/wp-content/uploads/2015/08/supp_figure_2-200x200.png 200w" sizes="(max-width: 300px) 100vw, 300px" /></a></p>
<p> </p>
rafalib package now on CRAN
2015-08-10T10:00:26+00:00
http://simplystats.github.io/2015/08/10/rafalib-package-now-on-cran
<p>For the last several years I have been <a href="https://github.com/ririzarr/rafalib">collecting functions</a> I routinely use during exploratory data analysis in a private R package. <a href="http://mike-love.net/">Mike Love</a> and I used some of these in our HarvardX course and now, due to popular demand, I have created man pages and added the <a href="https://cran.r-project.org/web/packages/rafalib/">rafalib</a> package to CRAN. Mike has made several improvements and added some functions of his own. Here is quick descriptions of the rafalib functions I most use:</p>
<p>mypar - Before making a plot in R I almost always type <tt>mypar()</tt>. This basically gets around the suboptimal defaults of <tt>par</tt>. For example, it makes the margins (<tt>mar</tt>, <tt>mpg</tt>) smaller and defines RColorBrewer colors as defaults. It is optimized for the RStudio window. Another advantage is that you can type <tt>mypar(3,2)</tt> instead of <tt>par(mfrow=c(3,2))</tt>. <tt>bigpar()</tt> is optimized for R presentations or PowerPoint slides.</p>
<p>as.fumeric - This function turns characters into factors and then into numerics. This is useful, for example, if you want to plot values <tt>x,y</tt> with colors defined by their corresponding categories saved in a character vector <tt>labs</tt><tt>plot(x,y,col=as.fumeric(labs))</tt>.</p>
<p>shist (smooth histogram, pronounced <em>shitz</em>) - I wrote this function because I have a hard time interpreting the y-axis of <tt>density</tt>. The height of the curve drawn by <tt>shist</tt> can be interpreted as the height of a histogram if you used the units shown on the plot. Also, it automatically draws a smooth histogram for each entry in a matrix on the same plot.</p>
<p>splot (subset plot) - The datasets I work with are typically large enough that</p>
<p><tt>plot(x,y)</tt> involves millions of points, which is <a href="http://stackoverflow.com/questions/7714677/r-scatterplot-with-too-many-points">a problem</a>. Several solution are available to avoid over plotting, such as alpha-blending, hexbinning and 2d kernel smoothing. For reasons I won’t explain here, I generally prefer subsampling over these solutions. <tt>splot</tt> automatically subsamples. You can also specify an index that defines the subset.</p>
<p>sboxplot (smart boxplot) - This function draws points, boxplots or outlier-less boxplots depending on sample size. Coming soon is the kaboxplot (Karl Broman box-plots) for when you have too many boxplots.</p>
<p>install_bioc - For Bioconductor users, this function simply does the <tt>source(“http://www.bioconductor.org/biocLite.R”)</tt> for you and then uses <tt>BiocLite</tt> to install.</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/unnamed1.png"><img class="alignnone size-large wp-image-4190" src="http://simplystatistics.org/wp-content/uploads/2015/08/unnamed1-1024x773.png" alt="unnamed" width="990" height="747" srcset="http://simplystatistics.org/wp-content/uploads/2015/08/unnamed1-300x226.png 300w, http://simplystatistics.org/wp-content/uploads/2015/08/unnamed1-1024x773.png 1024w, http://simplystatistics.org/wp-content/uploads/2015/08/unnamed1-260x196.png 260w, http://simplystatistics.org/wp-content/uploads/2015/08/unnamed1.png 1035w" sizes="(max-width: 990px) 100vw, 990px" /></a></p>
Interested in analyzing images of brains? Get started with open access data.
2015-08-09T21:29:17+00:00
http://simplystats.github.io/2015/08/09/interested-in-analyzing-images-of-brains-get-started-with-open-access-data
<div>
<i>Editor's note: This is a guest post by <a href="http://www.anieloyan.com/" target="_blank"><span class="lG">Ani</span> Eloyan</a>. She is an Assistant Professor of Biostatistics at Brown University. Dr. Eloyan’s work focuses on</i> <i>semi-parametric likelihood based methods for matrix decompositions, statistical analyses of brain images, and the integration of various types of complex data structures for analyzing health care data</i><i>. She received her PhD in statistics from North Carolina State University and subsequently completed a postdoctoral fellowship in the <a href="http://www.biostat.jhsph.edu/">Department of Biostatistics at Johns Hopkins University</a>. Dr. Eloyan and her team won the <a>ADHD200 Competition</a></i> <i>discussed in <a href="http://journal.frontiersin.org/article/10.3389/fnsys.2012.00061/abstract" target="_blank">this</a> article. She tweets <a href="https://twitter.com/eloyan_ani">@eloyan_ani</a>.</i>
</div>
<div>
<i> </i>
</div>
<div>
<div>
Neuroscience is one of the exciting new fields for biostatisticians interested in real world applications where they can contribute novel statistical approaches. Most research in brain imaging has historically included studies run for small numbers of patients. While justified by the costs of data collection, the claims based on analyzing data for such small numbers of subjects often do not hold for our populations of interest. As discussed in <a href="http://www.huffingtonpost.com/american-statistical-association/wanted-neuroquants_b_3749363.html" target="_blank">this</a> article, there is a huge demand for biostatisticians in the field of quantitative neuroscience; so called neuroquants or neurostatisticians. However, while more statisticians are interested in the field, we are far from competing with other substantive domains. For instance, a quick search of abstract keywords in the online program of the upcoming <a href="https://www.amstat.org/meetings/jsm/2015/" target="_blank">JSM2015</a> conference of “brain imaging” and “neuroscience” results in 15 records, while a search of the words “genomics” and “genetics” generates 76 <a>records</a>.
</div>
<div>
</div>
<div>
Assuming you are trained in statistics and an aspiring neuroquant, how would you go about working with brain imaging data? As a graduate student in the <a href="http://www.stat.ncsu.edu/" target="_blank">Department of Statistics at NCSU</a> several years ago, I was very interested in working on statistical methods that would be directly applicable to solve problems in neuroscience. But I had this same question: “Where do I find the data?” I soon learned that to <i>really</i>approach substantial relevant problems I also needed to learn about the subject matter underlying these complex data structures.
</div>
<div>
</div>
<div>
In recent years, several leading groups have uploaded their lab data with the common goal of fostering the collection of high dimensional brain imaging data to build powerful models that can give generalizable results. <a href="http://www.nitrc.org/" target="_blank">Neuroimaging Informatics Tools and Resources Clearinghouse (NITRC)</a> founded in 2006 is a platform for public data sharing that facilitates streamlining data processing pipelines and compiling high dimensional imaging datasets for crowdsourcing the analyses. It includes data for people with neurological diseases and neurotypical children and adults. If you are interested in Alzheimer’s disease, you can check out <a href="http://adni.loni.usc.edu/" target="_blank">ADNI</a>. <a href="http://fcon_1000.projects.nitrc.org/indi/abide/" target="_blank">ABIDE</a> provides data for people with Autism Spectrum Disorder and neurotypical peers. <a href="http://fcon_1000.projects.nitrc.org/indi/adhd200/" target="_blank">ADHD200</a> was released in 2011 as a part of a competition to motivate building predictive methods for disease diagnoses using functional magnetic resonance imaging (MRI) in addition to demographic information to predict whether a child has attention deficit hyperactivity disorder (ADHD). While the competition ended in 2011, the dataset has been widely utilized afterwards in studies of ADHD. According to Google Scholar, the <a href="http://www.nature.com/mp/journal/v19/n6/abs/mp201378a.html" target="_blank">paper</a> introducing the ABIDE set has been cited 129 times since 2013 while the <a href="http://journal.frontiersin.org/article/10.3389/fnsys.2012.00062/full" target="_blank">paper</a> discussing the ADHD200 has been cited 51 times since <span style="font-family: Arial;">2012. These are only a few examples from the list of open access datasets that could of utilized by statisticians. </span>
</div>
<div>
</div>
<div>
Anyone can download these datasets (you may need to register and complete some paperwork in some cases), however, there are several data processing and cleaning steps to perform before the final statistical analyses. These preprocessing steps can be daunting for a statistician new to the field, especially as the tools used for preprocessing may not be available in R. <a href="https://hopstat.wordpress.com/2014/08/27/statisticians-in-neuroimaging-need-to-learn-preprocessing/" target="_blank">This</a> discussion makes the case as to why statisticians need to be involved in every step of preprocessing the data, while <u><a href="https://hopstat.wordpress.com/2014/06/17/fslr-an-r-package-interfacing-with-fsl-for-neuroimaging-analysis/" target="_blank">this R package</a></u> contains new tools linking R to a commonly used platform <a href="http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/" target="_blank">FSL</a>. However, as a newcomer, it can be easier to start with data that are already processed. <a href="http://projecteuclid.org/euclid.ss/1242049389" target="_blank">This</a> excellent overview by Dr. Martin Lindquist provides an introduction to the different types of analyses for brain imaging data from a statisticians point of view, while our<a href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0089470" target="_blank">paper</a> provides tools in R and example datasets for implementing some of these methods. At least one course on Coursera can help you get started with <a href="https://www.coursera.org/course/fmri" target="_blank">functional MRI</a> data. Talking to and reading the papers of biostatisticians working in the field of quantitative neuroscience and scientists in the field of neuroscience is the key.
</div>
</div>
Statistical Theory is our "Write Once, Run Anywhere"
2015-08-09T11:19:53+00:00
http://simplystats.github.io/2015/08/09/statistical-theory-is-our-write-once-run-anywhere
<p>Having followed the software industry as a casual bystander, I periodically see the tension flare up between the idea of writing “native apps”, software that is tuned to a particular platform (Windows, Mac, etc.) and more cross-platform apps, which run on many platforms without too much modification. Over the years it has come up in many different forms, but they fundamentals are the same. Back in the day, there was Java, which was supposed to be the platform that ran on any computing device. Sun Microsystems originated the phrase “<a href="https://en.wikipedia.org/wiki/Write_once,_run_anywhere">Write Once, Run Anywhere</a>” to illustrate the cross-platform strengths of Java. More recently, Steve Jobs famously <a href="https://www.apple.com/hotnews/thoughts-on-flash/">banned Flash</a> from any iOS device. Apple is also moving away from standards like OpenGL and towards its own Metal platform.</p>
<p>What’s the problem with “write once, run anywhere”, or of cross-platform development more generally, assuming it’s possible? Well, there are a <a href="https://en.wikipedia.org/wiki/Cross-platform#Challenges_to_cross-platform_development">number of issues</a>: often there are performance penalties, it may be difficult to use the native look and feel of a platform, and you may be reduced to using the “lowest common denominator” of feature sets. It seems to me that anytime a new meta-platform comes out that promises to relieve programmers of the burden of having to write for multiple platforms, it eventually gets modified or subsumed by the need to optimize apps for a given platform as much as possible. The need to squeeze as much juice out of an app seems to be too important an opportunity to pass up.</p>
<p>In statistics, theory and theorems are our version of “write once, run anywhere”. The basic idea is that theorems provide an abstract layer (a “virtual machine”) that allows us to reason across a large number of specific problems. Think of the <a href="https://en.wikipedia.org/wiki/Central_limit_theorem">central limit theorem</a>, probably our most popular theorem. It could be applied to any problem/situation where you have a notion of sample size that could in principle be increasing.</p>
<p>But can it be applied to every situation, or even any situation? This might be more of a philosophical question, given that the CLT is stated asymptotically (maybe we’ll find out the answer eventually). In practice, my experience is that many people attempt to apply it to problems where it likely is not appropriate. Think, large-scale studies with a sample size of 10. Many people will use Normal-based confidence intervals in those situations, but they probably have very poor coverage.</p>
<p>Because the CLT doesn’t apply in many situations (small sample, dependent data, etc.), variations of the CLT have been developed, as well as entirely different approaches to achieving the same ends, like confidence intervals, p-values, and standard errors (think bootstrap, jackknife, permutation tests). While the CLT an provide beautiful insight in a large variety of situations, in reality, one must often resort to a custom solution when analyzing a given dataset or problem. This should be a familiar conclusion to anyone who analyzes data. The promise of “write once, run anywhere” is always tantalizing, but the reality never seems to meet that expectation.</p>
<p>Ironically, if you look across history and all programming languages, probably the most “cross-platform” language is C, which was originally considered to be too low-level to be broadly useful. C programs run on basically every existing platform and the language has been completely standardized so that compilers can be written to produce well-defined output. The keys to C’s success I think are that it’s a very simple/small language which gives enormous (sometimes dangerous) power to the programmer, and that an enormous toolbox (compiler toolchains, IDEs) has been developed over time to help developers write applications on all platforms.</p>
<p>In a sense, we need “compilers” that can help us translate statistical theory for specific data analysis problems. In many cases, I’d imagine the compiler would “fail”, meaning the theory was not applicable to that problem. This would be a Good Thing, because right now we have no way of really enforcing the appropriateness of a theorem for specific problems.</p>
<p>More practically (perhaps), we could develop <a href="http://simplystatistics.org/2012/08/27/a-deterministic-statistical-machine/">data analysis pipelines</a> that could be applied to broad classes of data analysis problems. Then a “compiler” could be employed to translate the pipeline so that it worked for a given dataset/problem/toolchain.</p>
<p>The key point is to recognize that there is a “translation” process that occurs when we use theory to justify certain data analysis actions, but this translation process is often not well documented or even thought through. Having an explicit “compiler” for this would help us to understand the applicability of certain theorems and may serve to prevent bad data analysis from occurring.</p>
Autonomous killing machines won't look like the Terminator...and that is why they are so scary
2015-07-30T11:09:22+00:00
http://simplystats.github.io/2015/07/30/autonomous-killing-machines-wont-look-like-the-terminator-and-that-is-why-they-are-so-scary
<p>Just a few days ago many of the most incredible minds in science and technology <a href="http://www.theguardian.com/technology/2015/jul/27/musk-wozniak-hawking-ban-ai-autonomous-weapons">urged governments to avoid using artificial intelligence</a> to create autonomous killing machines. One thing that always happens when such a warning is put into place is you see the inevitable Terminator picture:</p>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/07/terminator.jpeg"><img class="aligncenter wp-image-4160 size-medium" src="http://simplystatistics.org/wp-content/uploads/2015/07/terminator-300x180.jpeg" alt="terminator" width="300" height="180" srcset="http://simplystatistics.org/wp-content/uploads/2015/07/terminator-300x180.jpeg 300w, http://simplystatistics.org/wp-content/uploads/2015/07/terminator-260x156.jpeg 260w, http://simplystatistics.org/wp-content/uploads/2015/07/terminator.jpeg 620w" sizes="(max-width: 300px) 100vw, 300px" /></a></p>
<p> </p>
<p>The reality is that robots that walk and talk are getting better but still have a ways to go:</p>
<p> </p>
<p> </p>
<p>Does this mean that I think all those really smart people are silly for making this plea about AI now though? No, I think they are probably just in time.</p>
<p>The reason is that the first autonomous killing machines will definitely not look anything like the Terminator. They will more likely than not be drones, that are already in widespread use by the military, and will soon be flying over our heads <a href="http://money.cnn.com/2015/07/29/technology/amazon-drones-air-space/">delivering Amazon products</a>.</p>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/07/drone.jpg"><img class="aligncenter size-medium wp-image-4161" src="http://simplystatistics.org/wp-content/uploads/2015/07/drone-300x238.jpg" alt="drone" width="300" height="238" srcset="http://simplystatistics.org/wp-content/uploads/2015/07/drone-300x238.jpg 300w, http://simplystatistics.org/wp-content/uploads/2015/07/drone-1024x814.jpg 1024w, http://simplystatistics.org/wp-content/uploads/2015/07/drone.jpg 1200w" sizes="(max-width: 300px) 100vw, 300px" /></a></p>
<p> </p>
<p>I also think that when people think about “artificial intelligence” they also think about robots that can mimic the behaviors of a human being, including the ability to talk, hold a conversation, <a href="https://en.wikipedia.org/wiki/Turing_test">or pass the Turing test</a>. But it turns out that the “artificial intelligence” you would need to create an automated killing system is much much simpler than that and is mostly some basic data science. The things you would need are:</p>
<ol>
<li>A drone with the ability to fly on its own</li>
<li>The ability to make decisions about what people to target</li>
<li>The ability to find those people and attack them</li>
</ol>
<p> </p>
<p>The first issue, being able to fly on autopilot, is something that has existed for a while. You have probably flown on a plane that has <a href="https://en.wikipedia.org/wiki/Autopilot">used autopilot</a> for at least some of the flight. I won’t get into the details on this one because I think it is the least interesting - it has been around a while and we didn’t get the dire warnings about autonomous agents.</p>
<p>The second issue, about deciding which people to target is already in existence as well. We have already seen programs like <a href="https://en.wikipedia.org/wiki/PRISM_(surveillance_program)">PRISM</a> and others that collect individual level metadata and presumably use those to make predictions. We have already seen programs like <a href="https://en.wikipedia.org/wiki/PRISM_(surveillance_program)">PRISM</a> and others that collect individual level metadata and presumably use those to make predictions. While the true and false positive rates are probably messed up by the fact that there are very very few “true positives” these programs are being developed and even relatively simple statistical models can be used to build a predictor - even if those don’t work.</p>
<p>The second issue is being able to find people to attack them. This is where the real “artificial intelligence” comes in to play. But it isn’t artificial intelligence like you might think about. It could be just as simple as having the drone fly around and take people’s pictures. Then we could use those pictures to match up with the people identified through metadata and attack them. Facebook has a [Just a few days ago many of the most incredible minds in science and technology <a href="http://www.theguardian.com/technology/2015/jul/27/musk-wozniak-hawking-ban-ai-autonomous-weapons">urged governments to avoid using artificial intelligence</a> to create autonomous killing machines. One thing that always happens when such a warning is put into place is you see the inevitable Terminator picture:</p>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/07/terminator.jpeg"><img class="aligncenter wp-image-4160 size-medium" src="http://simplystatistics.org/wp-content/uploads/2015/07/terminator-300x180.jpeg" alt="terminator" width="300" height="180" srcset="http://simplystatistics.org/wp-content/uploads/2015/07/terminator-300x180.jpeg 300w, http://simplystatistics.org/wp-content/uploads/2015/07/terminator-260x156.jpeg 260w, http://simplystatistics.org/wp-content/uploads/2015/07/terminator.jpeg 620w" sizes="(max-width: 300px) 100vw, 300px" /></a></p>
<p> </p>
<p>The reality is that robots that walk and talk are getting better but still have a ways to go:</p>
<p> </p>
<p> </p>
<p>Does this mean that I think all those really smart people are silly for making this plea about AI now though? No, I think they are probably just in time.</p>
<p>The reason is that the first autonomous killing machines will definitely not look anything like the Terminator. They will more likely than not be drones, that are already in widespread use by the military, and will soon be flying over our heads <a href="http://money.cnn.com/2015/07/29/technology/amazon-drones-air-space/">delivering Amazon products</a>.</p>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/07/drone.jpg"><img class="aligncenter size-medium wp-image-4161" src="http://simplystatistics.org/wp-content/uploads/2015/07/drone-300x238.jpg" alt="drone" width="300" height="238" srcset="http://simplystatistics.org/wp-content/uploads/2015/07/drone-300x238.jpg 300w, http://simplystatistics.org/wp-content/uploads/2015/07/drone-1024x814.jpg 1024w, http://simplystatistics.org/wp-content/uploads/2015/07/drone.jpg 1200w" sizes="(max-width: 300px) 100vw, 300px" /></a></p>
<p> </p>
<p>I also think that when people think about “artificial intelligence” they also think about robots that can mimic the behaviors of a human being, including the ability to talk, hold a conversation, <a href="https://en.wikipedia.org/wiki/Turing_test">or pass the Turing test</a>. But it turns out that the “artificial intelligence” you would need to create an automated killing system is much much simpler than that and is mostly some basic data science. The things you would need are:</p>
<ol>
<li>A drone with the ability to fly on its own</li>
<li>The ability to make decisions about what people to target</li>
<li>The ability to find those people and attack them</li>
</ol>
<p> </p>
<p>The first issue, being able to fly on autopilot, is something that has existed for a while. You have probably flown on a plane that has <a href="https://en.wikipedia.org/wiki/Autopilot">used autopilot</a> for at least some of the flight. I won’t get into the details on this one because I think it is the least interesting - it has been around a while and we didn’t get the dire warnings about autonomous agents.</p>
<p>The second issue, about deciding which people to target is already in existence as well. We have already seen programs like <a href="https://en.wikipedia.org/wiki/PRISM_(surveillance_program)">PRISM</a> and others that collect individual level metadata and presumably use those to make predictions. We have already seen programs like <a href="https://en.wikipedia.org/wiki/PRISM_(surveillance_program)">PRISM</a> and others that collect individual level metadata and presumably use those to make predictions. While the true and false positive rates are probably messed up by the fact that there are very very few “true positives” these programs are being developed and even relatively simple statistical models can be used to build a predictor - even if those don’t work.</p>
<p>The second issue is being able to find people to attack them. This is where the real “artificial intelligence” comes in to play. But it isn’t artificial intelligence like you might think about. It could be just as simple as having the drone fly around and take people’s pictures. Then we could use those pictures to match up with the people identified through metadata and attack them. Facebook has a](file:///Users/jtleek/Downloads/deepface.pdf) that demonstrates an algorithm that can identify people with near human level accuracy. This approach is based on something called deep neural nets, which sounds very intimidating, but is actually just a set of nested nonlinear <a href="https://en.wikipedia.org/wiki/Deep_learning">logistic regression models</a>. These models have gotten very good because (a) we are getting better at fitting them mathematically and computationally but mostly (b) we have much more data to train them with than we ever did before. The speed that this part of the process is developing is (I think) why there is so much recent concern about potentially negative applications like autonomous killing machines.</p>
<p>The scary thing is that these technologies could be combined *right now* to create such a system that was not controlled directly by humans but made automated decisions and flew drones to carry out those decisions. The technology to shrink these type of deep neural net systems to identify people is so good it can even be made simple enough to <a href="http://googleresearch.blogspot.com/2015/07/how-google-translate-squeezes-deep.html">run on a phone f</a>or things like language translation and could easily be embedded in a drone.</p>
<p>So I am with Musk, Hawking, and others who would urge caution by governments in developing these systems. Just because we can make it doesn’t mean it will do what we want. Just look at how well Facebook/Amazon/Google make suggestions for “other things you might like” to get an idea about how potentially disastrous automated killing systems could be.</p>
<p> </p>
Announcing the JHU Data Science Hackathon 2015
2015-07-28T13:31:04+00:00
http://simplystats.github.io/2015/07/28/announcing-the-jhu-data-science-hackathon-2015
<p>We are pleased to announce that the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health will be hosting the first ever <a href="https://www.regonline.com/jhudash">JHU Data Science Hackathon</a> (DaSH) on <strong>September 21-23, 2015</strong> at the Baltimore Marriott Waterfront.</p>
<p>This event will be an opportunity for data scientists and data scientists-in-training to get together and hack on real-world problems collaboratively and to learn from each other. The DaSH will feature data scientists from government, academia, and industry presenting problems and describing challenges in their respective areas. There will also be a number of networking opportunities where attendees can get to know each other. We think this will be fun event and we encourage people from all areas, including students (graduate and undergraduate), to attend.</p>
<p>To get more details and to sign up for the hackathon, you can go to the <a href="https://www.regonline.com/jhudash">DaSH web site</a>. We will be posting more information as the event nears.</p>
<p>Organizers:</p>
<ul>
<li>Jeff Leek</li>
<li>Brian Caffo</li>
<li>Roger Peng</li>
<li>Leah Jager</li>
</ul>
<p>Funding:</p>
<ul>
<li>National Institutes of Health</li>
<li>Johns Hopkins University</li>
</ul>
<p> </p>
stringsAsFactors: An unauthorized biography
2015-07-24T11:04:20+00:00
http://simplystats.github.io/2015/07/24/stringsasfactors-an-unauthorized-biography
<p>Recently, I was listening in on the conversation of some colleagues who were discussing a bug in their R code. The bug was ultimately traced back to the well-known phenomenon that functions like ‘read.table()’ and ‘read.csv()’ in R convert columns that are detected to be character/strings to be factor variables. This lead to the spontaneous outcry from one colleague of</p>
<blockquote>
<p>Why does stringsAsFactors not default to FALSE????</p>
</blockquote>
<p>The argument ‘stringsAsFactors’ is an argument to the ‘data.frame()’ function in R. It is a logical that indicates whether strings in a data frame should be treated as factor variables or as just plain strings. The argument also appears in ‘read.table()’ and related functions because of the role these functions play in reading in table data and converting them to data frames. By default, ‘stringsAsFactors’ is set to TRUE.</p>
<p>This argument dates back to May 20, 2006 when it was originally introduced into R as the ‘charToFactor’ argument to ‘data.frame()’. Soon afterwards, on May 24, 2006, it was changed to ‘stringsAsFactors’ to be compatible with S-PLUS by request from Bill Dunlap.</p>
<p>Most people I talk to today who use R are completely befuddled by the fact that ‘stringsAsFactors’ is set to TRUE by default. First of all, it should be noted that before the ‘stringsAsFactors’ argument even existed, the behavior of R was to coerce all character strings to be factors in a data frame. If you didn’t want this behavior, you had to manually coerce each column to be character.</p>
<p>So here’s the story:</p>
<p>In the old days, when R was primarily being used by statisticians and statistical types, this setting strings to be factors made total sense. In most tabular data, if there were a column of the table that was non-numeric, it almost certainly encoded a categorical variable. Think sex (male/female), country (U.S./other), region (east/west), etc. In R, categorical variables are represented by ‘factor’ vectors and so character columns got converted factor.</p>
<p>Why do we need factor variables to begin with? Because of modeling functions like ‘lm()’ and ‘glm()’. Modeling functions need to treat expand categorical variables into individual dummy variables, so that a categorical variable with 5 levels will be expanded into 4 different columns in your modeling matrix. There’s no way for R to know it should do this unless it has some extra information in the form of the factor class. From this point of view, setting ‘stringsAsFactors = TRUE’ when reading in tabular data makes total sense. If the data is just going to go into a regression model, then R is doing the right thing.</p>
<p>There’s also a more obscure reason. Factor variables are encoded as integers in their underlying representation. So a variable like “disease” and “non-disease” will be encoded as 1 and 2 in the underlying representation. Roughly speaking, since integers only require 4 bytes on most systems, the conversion from string to integer actually saved some space for long strings. All that had to be stored was the integer levels and the labels. That way you didn’t have to repeat the strings “disease” and “non-disease” for as many observations that you had, which would have been wasteful.</p>
<p>Around June of 2007, R introduced hashing of CHARSXP elements in the underlying C code thanks to Seth Falcon. What this meant was that effectively, character strings were hashed to an integer representation and stored in a global table in R. Anytime a given string was needed in R, it could be referenced by its underlying integer. This effectively put in place, globally, the factor encoding behavior of strings from before. Once this was implemented, there was little to be gained from an efficiency standpoint by encoding character variables as factor. Of course, you still needed to use ‘factors’ for the modeling functions.</p>
<p>The difference nowadays is that R is being used a by a very wide variety of people doing all kinds of things the creators of R never envisioned. This is, of course, wonderful, but it introduces lots of use cases that were not originally planned for. I find that most often, the people complaining about ‘stringsAsFactors’ not being FALSE are people who are doing things that are not the traditional statistical modeling things (things that old-time statisticians like me used to do). In fact, I would argue that if you’re upset about ‘stringsAsFactors = TRUE’, then it’s a pretty good indicator that you’re either not a statistician by training, or you’re doing non-traditional statistical things.</p>
<p>For example, in genomics, you might have the names of the genes in one column of data. It really doesn’t make sense to encode these as factors because they won’t be used in any modeling function. They’re just labels, essentially. And because of CHARSXP hashing, you don’t gain anything from an efficiency standpoint by converting them to factors either.</p>
<p>But of course, given the long-standing behavior of R, many people depend on the default conversion of characters to factors when reading in tabular data. Changing this default would likely result in an equal number of people complaining about ‘stringsAsFactors’.</p>
<p>I fully expect that this blog post will now make all R users happy. If you think I’ve missed something from this unauthorized biography, please let me know on Twitter (@rdpeng).</p>
The statistics department Moneyball opportunity
2015-07-17T09:21:16+00:00
http://simplystats.github.io/2015/07/17/the-statistics-department-moneyball-opportunity
<p><a href="https://en.wikipedia.org/wiki/Moneyball"></a> is a book and a movie about Billy Bean. It makes statisticians look awesome and I loved the movie. I loved it so much I’m putting the movie trailer right here:</p>
<p>The basic idea behind Moneyball was that the Oakland Athletics were able to build a very successful baseball team on a tight budget by valuing skills that many other teams undervalued. In baseball those skills were things like on-base percentage and slugging percentage. By correctly valuing these skills and their impact on a teams winning percentage, the A’s were able to build one of the most successful regular season teams on a minimal budget. This graph shows what an outlier they were, from a nice <a href="http://fivethirtyeight.com/features/billion-dollar-billy-beane/">fivethirtyeight analysis</a>.</p>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/07/oakland.png"><img class="aligncenter wp-image-4146" src="http://simplystatistics.org/wp-content/uploads/2015/07/oakland-1024x818.png" alt="oakland" width="500" height="400" srcset="http://simplystatistics.org/wp-content/uploads/2015/07/oakland-1024x818.png 1024w, http://simplystatistics.org/wp-content/uploads/2015/07/oakland-250x200.png 250w, http://simplystatistics.org/wp-content/uploads/2015/07/oakland.png 1150w" sizes="(max-width: 500px) 100vw, 500px" /></a></p>
<p> </p>
<p>I think that the data science/data analysis revolution that we have seen over the last decade has created a similar moneyball opportunity for statistics and biostatistics departments. Traditionally in these departments the highest value activities have been publishing a select number of important statistics journals (JASA, JRSS-B, Annals of Statistics, Biometrika, Biometrics and more recently journals like Biostatistics and Annals of Applied Statistics). But there are some hugely valuable ways to contribute to statistics/data science that don’t necessarily end with papers in those journals like:</p>
<ol>
<li>Creating good, well-documented, and widely used software</li>
<li>Being primarily an excellent collaborator who brings in grant money and is a major contributor to science through statistics</li>
<li>Publishing in top scientific journals rather than statistics journals</li>
<li>Being a good scientific communicator who can attract talent</li>
<li>Being a statistics educator who can build programs</li>
</ol>
<p>Another thing that is undervalued is not having a Ph.D. in statistics or biostatistics. The fact that these skills are undervalued right now means that up and coming departments could identify and recruit talented people that might be missed by other departments and have a huge impact on the world. One tricky thing is that the rankings of department are based on the votes of people from other departments who may or may not value these same skills. Another tricky thing is that many industry data science positions put incredibly high value on these skills and so you might end up competing with them for people - a competition that will definitely drive up the market value of these data scientist/statisticians. But for the folks that want to stay in academia, now is a prime opportunity.</p>
The Mozilla Fellowship for Science
2015-07-10T11:10:26+00:00
http://simplystats.github.io/2015/07/10/the-mozilla-fellowship-for-science
<p>This looks like an <a href="https://www.mozillascience.org/fellows">interesting opportunity</a> for grad students, postdocs, and early career researchers:</p>
<blockquote>
<p>We’re looking for researchers with a passion for open source and data sharing, already working to shift research practice to be more collaborative, iterative and open. Fellows will spend 10 months starting September 2015 as community catalysts at their institutions, mentoring the next generation of open data practitioners and researchers and building lasting change in the global open science community.</p>
<p>Throughout their fellowship year, chosen fellows will receive training and support from Mozilla to hone their skills around open source and data sharing. They will also craft code, curriculum and other learning resources that help their local communities learn open data practices, and teach forward to their peers.</p>
</blockquote>
<p>Here’s what you get:</p>
<blockquote>
<p>Fellows will receive:</p>
<ul>
<li>A stipend of $60,000 USD, paid in 10 monthly installments.</li>
<li>One-time health insurance supplement for Fellows and their families, ranging from $3,500 for single Fellows to $7,000 for a couple with two or more children.</li>
<li>One-time childcare allotment for families with children of up to $6,000.</li>
<li>Allowance of up to $3,000 towards the purchase of laptop computer, digital cameras, recorders and computer software; fees for continuing studies or other courses, research fees or payments, to the extent related to the fellowship.</li>
<li>All approved fellowship trips – domestic and international – are covered in full.</li>
</ul>
</blockquote>
<p>Deadline is August 14.</p>
JHU, UMD researchers are getting a really big Big Data center
2015-07-08T16:26:45+00:00
http://simplystats.github.io/2015/07/08/jhu-umd-researchers-are-getting-a-really-big-big-data-center
<p>From <a href="http://technical.ly/baltimore/2015/07/07/jhu-umd-big-data-maryland-advanced-research-computing-center-marcc/">Technical.ly Baltimore</a>:</p>
<blockquote>
<p>A nondescript, 3,700-square-foot building on Johns Hopkins’ Bayview campus will house a new data storage and computing center for university researchers. The $30 million Maryland Advanced Research Computing Center (MARCC) will be available to faculty from JHU and the University of Maryland, College Park.</p>
</blockquote>
<p>The web site has a pretty cool time-lapse video of the construction of the computing center. There’s also a bit more detail at the <a href="http://hub.jhu.edu/2015/07/06/computing-center-bayview">JHU Hub</a> site.</p>
The Massive Future of Statistics Education
2015-07-03T10:17:24+00:00
http://simplystats.github.io/2015/07/03/the-massive-future-of-statistics-education
<p><em>NOTE: This post was written as a chapter for the not-yet-released Handbook on Statistics Education. </em></p>
<p>Data are eating the world, but our collective ability to analyze data is going on a starvation diet.</p>
<div id="content">
<p>
Everywhere you turn, data are being generated somehow. By the time you read this piece, you’ll probably have collected some data. (For example this piece has 2,072 words). You can’t avoid data—it’s coming from all directions.
</p>
<p>
So what do we do with it? For the most part, nothing. There’s just too much data being spewed about. But for the data that we <em>are</em> interested in, we need to know the appropriate methods for thinking about and analyzing them. And by “we”, I mean pretty much everyone.
</p>
<p>
In the future, everyone will need some data analysis skills. People are constantly confronted with data and the need to make choices and decisions from the raw data they receive. Phones deliver information about traffic, we have ratings about restaurants or books, and even rankings of hospitals. High school students can obtain complex and rich information about the colleges to which they’re applying while admissions committees can get real-time data on applicants’ interest in the college.
</p>
<p>
Many people already have heuristic algorithms to deal with the data influx—and these algorithms may serve them well—but real statistical thinking will be needed for situations beyond choosing which restaurant to try for dinner tonight.
</p>
<p>
<strong>Limited Capacity</strong>
</p>
<p>
The McKinsey Global Institute, in a <a href="http://www.mckinsey.com/insights/americas/us_game_changers">highly cited report</a>, predicted that there would be a shortage of “data geeks” and that by 2018 there would be between 140,000 and 190,000 unfilled positions in data science. In addition, there will be an estimated 1.5 million people in managerial positions who will need to be trained to manage data scientists and to understand the output of data analysis. If history is any guide, it’s likely that these positions will get filled by people, regardless of whether they are properly trained. The potential consequences are disastrous as untrained analysts interpret complex big data coming from myriad sources of varying quality.
</p>
<p>
Who will provide the necessary training for all these unfilled positions? The field of statistics’ current system of training people and providing them with master’s degrees and PhDs is woefully inadequate to the task. In 2013, the top 10 largest statistics master’s degree programs in the U.S. graduated a total of <a href="http://community.amstat.org/blogs/steve-pierson/2014/02/09/largest-graduate-programs-in-statistics">730 people</a>. At this rate we will never train the people needed. While statisticians have greatly benefited from the sudden and rapid increase in the amount of data flowing around the world, our capacity for scaling up the needed training for analyzing those data is essentially nonexistent.
</p>
<p>
On top of all this, I believe that the McKinsey report is a gross underestimation of how many people will need to be trained in <em>some</em> data analysis skills in the future. Given how much data is being generated every day, and how critical it is for everyone to be able to intelligently interpret these data, I would argue that it’s necessary for <em>everyone</em> to have some data analysis skills. Needless to say, it’s foolish to suggest that everyone go get a master’s or even bachelor’s degrees in statistics. We need an alternate approach that is both high-quality and scalable to a large population over a short period of time.
</p>
<p>
<strong>Enter the MOOCs</strong>
</p>
<p>
In April of 2014, Jeff Leek, Brian Caffo, and I launched the <a href="https://www.coursera.org/specialization/jhudatascience/1">Johns Hopkins Data Science Specialization</a> on the Coursera platform. This is a sequence of nine courses that intends to provide a “soup-to-nuts” training in data science for people who are highly motivated and have some basic mathematical and computing background. The sequence of the nine courses follow what we believe is the essential “data science process”, which is
</p>
<ol>
<li>
Formulating a question that can be answered with data
</li>
<li>
Assembling, cleaning, tidying data relevant to a question
</li>
<li>
Exploring data, checking, eliminating hypotheses
</li>
<li>
Developing a statistical model
</li>
<li>
Making statistical inference
</li>
<li>
Communicating findings
</li>
<li>
Making the work reproducible
</li>
</ol>
<p>
We took these basic steps and designed courses around each one of them.
</p>
<p>
Each course is provided in a massive open online format, which means that many thousands of people typically enroll in each course every time it is offered. The learners in the courses do homework assignments, take quizzes, and peer assess the work of others in the class. All grading and assessment is handled automatically so that the process can scale to arbitrarily large enrollments. As an example, the April 2015 session of the R Programming course had nearly 45,000 learners enrolled. Each class is exactly 4 weeks long and every class runs every month.
</p>
<p>
We developed this sequence of courses in part to address the growing demand for data science training and education across the globe. Our background as biostatisticians was very closely aligned with the training needs of people interested in data science because, essentially, data science is <em>what we do every single day</em>. Indeed, one curriculum rule that we had was that we couldn’t include something if we didn’t in fact use it in our own work.
</p>
<p>
The sequence has a substantial amount of standard statistics content, such as probability and inference, linear models, and machine learning. It also has non-standard content, such as git, GitHub, R programming, Shiny, and Markdown. Together, the sequence covers the full spectrum of tools that we believe will be needed by the practicing data scientist.
</p>
<p>
For those who complete the nine courses, there is a capstone project at the end, that involves taking all of the skills in the course and developing a data product. For our first capstone project we partnered with <a href="http://swiftkey.com/en/">SwiftKey</a>, a predictive text analytics company, to develop a project where learners had to build a statistical model for predicting words in a sentence. This project involves taking unstructured, messy data, processing it into an analyzable form, developing a statistical model while making tradeoffs for efficiency and accuracy, and creating a Shiny app to show off their model to the public.
</p>
<p>
<strong>Degree Alternatives</strong>
</p>
<p>
The Data Science Specialization is not a formal degree program offered by Johns Hopkins University—learners who complete the sequence do not get any Johns Hopkins University credit—and so one might wonder what the learners get out of the program (besides, of course, the knowledge itself). To begin with, the sequence is completely portfolio based, so learners complete projects that are immediately viewable by others. This allows others to evaluate a learner’s ability on the spot with real code or data analysis.
</p>
<p>
All of the lecture content is openly available and hosted on GitHub, so outsiders can view the content and see for themselves what is being taught. This give outsiders an opportunity to evaluate the program directly rather than have to rely on the sterling reputation of the institution teaching the courses.
</p>
<p>
Each learner who completes a course using Coursera’s “Signature Track” (which currently costs $49 per course) can get a badge on their LinkedIn profile, which shows that they completed the course. This can often be as valuable as a degree or other certification as recruiters scouring LinkedIn for data scientist positions will be able to see our completers’ certifications in various data science courses.
</p>
<p>
Finally, the scale and reach of our specialization immediately creates a large alumni social network that learners can take advantage of. As of March 2015, there were approximately 700,000 people who had taken at least one course in the specialization. These 700,000 people have a shared experience that, while not quite at the level of a college education, still is useful for forging connections between people, especially when people are searching around for jobs.
</p>
<p>
<strong>Early Numbers</strong>
</p>
<p>
So far, the sequence has been wildly successful. It averaged 182,507 enrollees a month for the first year in existence. The overall course completion rate was about 6% and the completion rate amongst those in the “Signature Track” (i.e. paid enrollees) was 67%. In October of 2014, barely 7 months since the start of the specialization, we had 663 learners enroll in the capstone project.
</p>
<p>
<strong>Some Early Lessons</strong>
</p>
<p>
From running the Data Science Specialization for over a year now, we have learned a number of lessons, some of which were unexpected. Here, I summarize the highlights of what we’ve learned.
</p>
<p>
<strong>Data Science as Art and Science. </strong>Ironically, although the word “Science” appears in the name “Data Science”, there’s actually quite a bit about the practice of data science that doesn’t really resemble science at all. Much of what statisticians do in the act of data analysis is intuitive and ad hoc, with each data analysis being viewed as a unique flower.
</p>
<p>
When attempting to design data analysis assignments that could be graded at scale with tens of thousands of people, we discovered that designing the rubrics for grading these assignments was not trivial. The reason is because our understanding of what makes a “good” analysis different from a bad one is not well-articulated. We could not identify any community-wide understanding of what are the components of a good analysis. What are the “correct” methods to use in a given data analysis situation? What is definitely the “wrong” approach?
</p>
<p>
Although each one of us had been doing data analysis for the better part of a decade, none of us could succinctly write down what the process was and how to recognize when it was being done wrong. To paraphrase Daryl Pregibon from his <a href="http://www.nap.edu/catalog/1910/the-future-of-statistical-software-proceedings-of-a-forum">1991 talk at the National Academies of Science</a>, we had a process that we regularly espoused but barely understood.
</p>
<p>
<strong>Content vs. Curation</strong>.<strong> </strong>Much of the content that we put online is available elsewhere. With YouTube, you can find high-quality videos on almost any topic, and our videos are not really that much better. Furthermore, the subject matter that we were teaching was in now way proprietary. The linear models that we teach are the same linear models taught everywhere else. So what exactly was the value we were providing?
</p>
<p>
Searching on YouTube requires that you know what you are looking for. This is a problem for people who are just getting into an area. Effectively, what we provided was a <em>curation</em> of all the knowledge that’s out there on the topic of data science (we also added our own quirky spin). Curation is hard, because you need to make definitive choices between what is and is not a core element of a field. But curation is essential for learning a field for the uninitiated.
</p>
<p>
<strong>Skill sets vs. Certification</strong>. Because we knew that we were not developing a true degree program, we knew we had to develop the program in a manner so that the learners could quickly see for themselves the value they were getting out of it. This lead us to taking a portfolio approach where learners produced things that could be viewed publicly.
</p>
<p>
In part because of the self-selection of the population seeking to learn data science skills, our learners were more interested in being able to demonstrate the skills taught in the course rather than an abstract (but official) certification as might be gotten in a degree program. This is not unlike going to a music conservatory, where the output is your ability to play an instrument rather than the piece of paper you receive upon graduation. We feel that giving people the ability to demonstrate skills and skill sets is perhaps more important than official degrees in some instances because it gives employers a concrete sense of what a person is capable of doing.
</p>
<p>
<strong>Conclusions</strong>
</p>
<p>
As of April 2015, we had a total of 1,158 learners complete the entire specialization, including the capstone project. Given these numbers and our rate of completion for the specialization as a whole, we believe we are on our way to achieving our goal of creating a highly scalable program for training people in data science skills. Of course, this program alone will not be sufficient for all of the data science training needs of society. But we believe that the approach that we’ve taken, using non-standard MOOC channels, focusing on skill sets instead of certification, and emphasizing our role in curation, is a rich opportunity for the field of statistics to explore in order to educate the masses about our important work.
</p>
</div>
Looks like this R thing might be for real
2015-07-02T10:01:45+00:00
http://simplystats.github.io/2015/07/02/looks-like-this-r-thing-might-be-for-real
<p>Not sure how I missed this, but the Linux Foundation just announced the <a href="http://www.linuxfoundation.org/news-media/announcements/2015/06/linux-foundation-announces-r-consortium-support-millions-users">R Consortium</a> for supporting the “world’s most popular language for analytics and data science and support the rapid growth of the R user community”. From the Linux Foundation:</p>
<blockquote>
<p>The R language is used by statisticians, analysts and data scientists to unlock value from data. It is a free and open source programming language for statistical computing and provides an interactive environment for data analysis, modeling and visualization. The R Consortium will complement the work of the R Foundation, a nonprofit organization based in Austria that maintains the language. The R Consortium will focus on user outreach and other projects designed to assist the R user and developer communities.</p>
<p>Founding companies and organizations of the R Consortium include The R Foundation, Platinum members Microsoft and RStudio; Gold member TIBCO Software Inc.; and Silver members Alteryx, Google, HP, Mango Solutions, Ketchum Trading and Oracle.</p>
</blockquote>
How Airbnb built a data science team
2015-07-01T08:39:29+00:00
http://simplystats.github.io/2015/07/01/how-airbnb-built-a-data-science-team
<p>From <a href="http://venturebeat.com/2015/06/30/how-we-scaled-data-science-to-all-sides-of-airbnb-over-5-years-of-hypergrowth/">Venturebeat</a>:</p>
<blockquote>
<p>Back then we knew so little about the business that any insight was groundbreaking; data infrastructure was fast, stable, and real-time (I was querying our production MySQL database); the company was so small that everyone was in the loop about every decision; and the data team (me) was aligned around a singular set of metrics and methodologies.</p>
<p>But five years and 43,000 percent growth later, things have gotten a bit more complicated. I’m happy to say that we’re also more sophisticated in the way we leverage data, and there’s now a lot more of it. The trick has been to manage scale in a way that brings together the magic of those early days with the growing needs of the present — a challenge that I know we aren’t alone in facing.</p>
</blockquote>
How public relations and the media are distorting science
2015-06-24T10:07:45+00:00
http://simplystats.github.io/2015/06/24/how-public-relations-and-the-media-are-distorting-science
<p>Throughout history, engineers, medical doctors and other applied scientists have helped convert basic science discoveries into products, public goods and policy that have greatly improved our quality of life. With rare exceptions, it has taken years if not decades to establish these discoveries. And even the exceptions stand on the shoulders of incremental contributions. The researchers that produce this knowledge go through a slow and painstaking process to reach these achievements.</p>
<p>In contrast, most science related media reports that grab the public’s attention fall into three categories:</p>
<ol>
<li>The <em>exaggerated big discovery</em>: Recent examples include the discovery of <a href="http://www.cbsnews.com/news/dangerous-pathogens-and-mystery-microbes-ride-the-subway/">the bubonic plague in the NYC subway</a>, <a href="http://www.bbc.com/news/science-environment-32287609">liquid water in mars</a>, and <a href="http://www.nytimes.com/2015/05/24/opinion/sunday/infidelity-lurks-in-your-genes.html?ref=opinion&_r=3">the infidelity gene</a>.</li>
<li><em>Over-promising</em>: These try to explain a complicated basic science finding and, in the case of biomedical research, then speculate without much explanation that the finding will ”lead to a deeper understanding of diseases and new ways to treat or cure them”.</li>
<li><em>Science is broken</em>: These tend to report an anecdote about an allegedly corrupt scientist, maybe cite the “Why Most Published Research Findings are False” paper, and then extrapolate speculatively.</li>
</ol>
<p>In my estimation, despite the attention grabbing headlines, the great majority of the subject matter included in these reports will not have an impact on our lives and will not even make it into scientific textbooks. So does science still have anything to offer? Reports of the third category have even scientists particularly worried. I, however, remain optimistic. First, I do not see any empirical evidence showing that the negative effects of the lack of reproducibility are worse now than 50 years ago. Furthermore, although not widely reported in the lay press, I continue to see bodies of work built by several scientists over several years or decades with much promise of leading to tangible improvements to our quality of life. Recent advances that I am excited about include <a href="http://physics.gmu.edu/~pnikolic/articles/Topological%20insulators%20(Physics%20World,%20February%202011).pdf">insulators</a>, <a href="http://www.ncbi.nlm.nih.gov/pubmed/24955707">PD-1 pathway inhibitors</a>, <a href="https://en.wikipedia.org/wiki/CRISPR">clustered regularly interspaced short palindromic repeats</a>, advances in solar energy technology, and prosthetic robotics.</p>
<p>However, there is one general aspect of science that I do believe has become worse. Specifically, it’s a shift in how much scientists jockey for media attention, even if it’s short-lived. Instead of striving for having a sustained impact on our field, which may take decades to achieve, an increasing number of scientists seem to be placing more value on appearing in the New York Times, giving a Ted Talk or having a blog or tweet go viral. As a consequence, too many of us end up working on superficial short term challenges that, with the help of a professionally crafted press release, may result in an attention grabbing media report. NB: I fully support science communication efforts, but not when the primary purpose is garnering attention, rather than educating.</p>
<p>My concern spills over to funding agencies and philanthropic organizations as well. Consider the following two options. Option 1: be the funding agency representative tasked with organizing a big science project with a well-oiled PR machine. Option 2: be the funding agency representative in charge of several small projects, one of which may with low, but non-negligible, probability result in a Nobel Prize 30 years down the road. In the current environment, I see a preference for option 1.</p>
<p>I am also concerned about how this atmosphere may negatively affect societal improvements within science. Publicly shaming transgressors on Twitter or expressing one’s outrage on a blog post can garner many social media clicks. However, these may have a smaller positive impact than mundane activities such as serving on a committee that, after several months of meetings, implements incremental, yet positive, changes. Time and energy spent on trying to increase internet clicks is time and energy we don’t spend on the tedious administrative activities that are needed to actually affect change.</p>
<p>Because so many of the scientists that thrive in this atmosphere of short-lived media reports are disproportionately rewarded, I imagine investigators starting their careers feel some pressure to garner some media attention of their own. Furthermore, their view of how they are evaluated may be highly biased because evaluators that ignore media reports and focus more on the specifics of the scientific content, tend to be less visible. So if you want to spend your academic career slowly building a body of work with the hopes of being appreciated decades from now, you should not think that it is hopeless based on what is perhaps, a distorted view of how we are currently being evaluated.</p>
<p>Update: changed topological insulators links to <a href="http://scienceblogs.com/principles/2010/07/20/whats-a-topological-insulator/">these</a> <a href="http://physics.gmu.edu/~pnikolic/articles/Topological%20insulators%20(Physics%20World,%20February%202011).pdf">two</a>. <a href="http://spectrum.ieee.org/semiconductors/materials/topological-insulators">Here</a> is one more. Via David S.</p>
Interview at Leanpub
2015-06-16T21:49:33+00:00
http://simplystats.github.io/2015/06/16/interview-at-leanpub
<p>A few weeks ago I sat down with Len Epp over at Leanpub to talk about my recently published book <em><a href="https://leanpub.com/rprogramming">R Programming for Data Science</a></em>. So far, I’ve only published one book through Leanpub but I’m a huge fan. They’ve developed a system that is, in my opinion, perfect for academic publishing. The book’s written in Markdown and they compile it into PDF, ePub, and mobi formats automatically.</p>
<p>The full interview transcript is over at the <a href="http://blog.leanpub.com/2015/06/roger-peng.html">Leanpub blog</a>. If you want to listen to the audio of the interview, you can subscribe to the Leanpub <a href="https://itunes.apple.com/ca/podcast/id517117137?mt=2">podcast on iTunes</a>.</p>
<p><a href="https://leanpub.com/rprogramming"><em>R Programming for Data Science</em></a> is available at Leanpub for a suggested price of $15 (but you can get it for free if you want). R code files, datasets, and video lectures are available through the various add-on packages. Thanks to all of you who’ve already bought a copy!</p>
Johns Hopkins Data Science Specialization Captsone 2 Top Performers
2015-06-10T14:33:09+00:00
http://simplystats.github.io/2015/06/10/johns-hopkins-data-science-specialization-captsone-2-top-performers
<p><em>The second capstone session of the <a href="https://www.coursera.org/specialization/jhudatascience/1?utm_medium=listingPage">Johns Hopkins Data Science Specialization</a> concluded recently. This time, we had 1,040 learners sign up to participate in the session, which again featured a project developed in collaboration with the amazingly innovative folks at <a href="http://swiftkey.com/en/">SwiftKey</a>. </em></p>
<p><em>We’ve identified the learners listed below as the top performers in this capstone session. This is an incredibly talented group of people who have worked very hard throughout the entire nine-course specialization. Please take some time to read their stories and look at their work. </em></p>
<h1 id="ben-apple">Ben Apple</h1>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/06/Ben_Apple.jpg"><img class="aligncenter size-medium wp-image-4091" src="http://simplystatistics.org/wp-content/uploads/2015/06/Ben_Apple-300x285.jpg" alt="Ben_Apple" width="300" height="285" srcset="http://simplystatistics.org/wp-content/uploads/2015/06/Ben_Apple-300x285.jpg 300w, http://simplystatistics.org/wp-content/uploads/2015/06/Ben_Apple.jpg 360w" sizes="(max-width: 300px) 100vw, 300px" /></a></p>
<p>Ben Apple is a Data Scientist and Enterprise Architect with the Department of Defense. Mr. Apple holds a MS in Information Assurance and is a PhD candidate in Information Sciences.</p>
<h4 id="why-did-you-take-the-jhu-data-science-specialization"><strong>**Why did you take the JHU Data Science Specialization?</strong>**</h4>
<p>As a self trained data scientist I was looking for a program that would formalize my established skills while expanding my data science knowledge and tool box.</p>
<h4 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization"><strong>**What are you most proud of doing as part of the JHU Data Science Specialization?</strong>**</h4>
<p>The capstone project was the most demanding aspect of the program. As such I most proud of the finale project. The project stretched each of us beyond the standard course work of the program and was quite satisfying.</p>
<h4 id="how-are-you-planning-on-using-your-data-science-specialization-certificate"><strong>**How are you planning on using your Data Science Specialization Certificate?</strong>**</h4>
<p>To open doors so that I may further my research into the operational value of applying data science thought and practice to analytics of my domain.</p>
<p><strong>Final Project: </strong><a href="https://bengapple.shinyapps.io/coursera_nlp_capstone">https://bengapple.shinyapps.io/coursera_nlp_capstone</a></p>
<p><strong>Project Slide Deck: </strong><a href="http://rpubs.com/bengapple/71376">http://rpubs.com/bengapple/71376</a></p>
<p> </p>
<h1 id="ivan-corneillet">Ivan Corneillet</h1>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/06/Ivan.Corneillet.jpg"><img class="aligncenter size-medium wp-image-4092" src="http://simplystatistics.org/wp-content/uploads/2015/06/Ivan.Corneillet-300x300.jpg" alt="Ivan.Corneillet" width="300" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2015/06/Ivan.Corneillet-300x300.jpg 300w, http://simplystatistics.org/wp-content/uploads/2015/06/Ivan.Corneillet-200x200.jpg 200w, http://simplystatistics.org/wp-content/uploads/2015/06/Ivan.Corneillet.jpg 400w" sizes="(max-width: 300px) 100vw, 300px" /></a></p>
<p>A technologist, thinker, and tinkerer, Ivan facilitates the establishment of start-up companies by advising these companies about the hiring process, product development, and technology development, including big data, cloud computing, and cybersecurity. In his 17-year career, Ivan has held a wide range of engineering and management positions at various Silicon Valley companies. Ivan is a recent Wharton MBA graduate, and he previously earned his master’s degree in computer science from the Ensimag, and his master’s degree in electrical engineering from Université Joseph Fourier, both located in France.</p>
<p><strong>**Why did you take the JHU Data Science Specialization?</strong>**</p>
<p>There are three reasons why I decided to enroll in the JHU Data Science Specialization. First, fresh from college, my formal education was best suited for scaling up the Internet’s infrastructure. However, because every firm in every industry now creates products and services from analyses of data, I challenged myself to learn about Internet-scale datasets. Second, I am a big supporter of MOOCs. I do not believe that MOOCs should replace traditional education; however, I do believe that MOOCs and traditional education will eventually coexist in the same way that open-source and closed-source software does (read my blog post for more information on this topic: http://ivantur.es/16PHild). Third, the Johns Hopkins University brand certainly motivated me to choose their program. With a great name comes a great curriculum and fantastic professors, right?</p>
<p>Once I had completed the program, I was not disappointed at all. I had read a blog post that explained that the JHU Data Science Specialization was only a start to learning about data science. I certainly agree, but I would add that this program is a great start, because the curriculum emphasizes information that is crucial, while providing additional resources to those who wish to deepen their understanding of data science. My thanks to Professors Caffo, Leek, and Peng; the TAs, and Coursera for building and delivering this track!</p>
<p><strong>**What are you most proud of doing as part of the JHU Data Science Specialization?</strong>**</p>
<p>The capstone project made for a very rich and exhilarating learning experience, and was my favorite course in the specialization. Because I did not have prior knowledge in natural language processing (NLP), I had to conduct a fair amount of research. However, the program’s minimal-guidance approach mimicked a real-world environment, and gave me the opportunity to leverage my experience with developing code and designing products to get the most out of the skillset taught in the track. The result was that I created a data product that implemented state-of-the-art NLP algorithms using what I think are the best technologies (i.e., C++, JavaScript, R, Ruby, and SQL), given the choices that I had made. Bringing everything together is what made me the most proud. Additionally, my product capabilities are a far cry from IBM’s Watson, but while I am well versed in supercomputer hardware, this track helped me to gain a much deeper appreciation of Watson’s AI.</p>
<h4 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-1"><strong>**How are you planning on using your Data Science Specialization Certificate?</strong>**</h4>
<p>Thanks to the broad skillset that the specialization covered, I feel confident wearing a data science hat. The concepts and tools covered in this program helped me to better understand the concerns that data scientists have and the challenges they face. From a business standpoint, I am also better equipped to identify the opportunities that lie ahead.</p>
<p><strong>Final Project: </strong><a href="https://paspeur.shinyapps.io/wordmaster-io/">https://paspeur.shinyapps.io/wordmaster-io/</a></p>
<p><strong>Project Slide Deck: </strong><a href="http://rpubs.com/paspeur/wordmaster-io">http://rpubs.com/paspeur/wordmaster-io</a></p>
<p>#</p>
<h1 id="oscar-de-león">Oscar de León</h1>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/06/Oscar_De_Leon.jpg"><img class="aligncenter size-medium wp-image-4093" src="http://simplystatistics.org/wp-content/uploads/2015/06/Oscar_De_Leon-300x225.jpg" alt="Oscar_De_Leon" width="300" height="225" srcset="http://simplystatistics.org/wp-content/uploads/2015/06/Oscar_De_Leon-120x90.jpg 120w, http://simplystatistics.org/wp-content/uploads/2015/06/Oscar_De_Leon-300x225.jpg 300w, http://simplystatistics.org/wp-content/uploads/2015/06/Oscar_De_Leon-1024x768.jpg 1024w, http://simplystatistics.org/wp-content/uploads/2015/06/Oscar_De_Leon-260x195.jpg 260w" sizes="(max-width: 300px) 100vw, 300px" /></a></p>
<p>Oscar is an assistant researcher at a research institute in a developing country, graduated as a licentiate in biochemistry and microbiology in 2010 from the same university which hosts the institute. He has always loved technology, programming and statistics and has engaged in self learning of these subjects from an early age, finally using his abilities in the health-related research in which he has been involved since 2008. He is now working on the design, execution and analysis of various research projects, consulting for other researchers and students, and is looking forward to develop his academic career in biostatistics.</p>
<h4 id="why-did-you-take-the-jhu-data-science-specialization-1"><strong>**Why did you take the JHU Data Science Specialization?</strong>**</h4>
<p>I wanted to integrate my R experience into a more comprehensive data analysis workflow, which is exactly what this specialization offers. This was in line with the objectives of my position at the research institute in which I work, so I presented a study plan to my supervisor and she approved it. I also wanted to engage in an activity which enabled me to document my abilities in a verifiable way, and a Coursera Specialization seemed like a good option.</p>
<p>Additionally, I’ve followed the JHSPH group’s courses since the first offering of Mathematical Biostatistics Bootcamp in November 2012. They have proved the standards and quality of education at their institution, and it was not something to let go by.</p>
<h4 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-1"><strong>**What are you most proud of doing as part of the JHU Data Science Specialization?</strong>**</h4>
<p>I’m not one to usually interact with other students, and certainly didn’t do it during most of the specialization courses, but I decided to try out the fora on the Capstone project. It was wonderful; sharing ideas with, and receiving criticism form, my peers provided a very complete learning experience. After all, my contributions ended being appreciated by the community and a few posts stating it were very rewarding. This re-kindled my passion for teaching, and I’ll try to engage in it more from now on.</p>
<h4 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-2"><strong>**How are you planning on using your Data Science Specialization Certificate?</strong>**</h4>
<p>First, I’ll file it with HR at my workplace, since our research projects payed for the specialization <img src="http://simplystatistics.org/wp-includes/images/smilies/simple-smile.png" alt=":)" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<p>I plan to use the certificate as a credential for data analysis with R when it is relevant. For example, I’ve been interested in offering an R workshop for life sciences students and researchers at my University, and this certificate (and the projects I prepared during the specialization) could help me show I have a working knowledge on the subject.</p>
<p><strong>Final Project: </strong><a href="https://odeleon.shinyapps.io/ngram/">https://odeleon.shinyapps.io/ngram/</a></p>
<p><strong>Project Slide Deck: </strong><a href="http://rpubs.com/chemman/n-gram">http://rpubs.com/chemman/n-gram</a></p>
<p>#</p>
<h1 id="jeff-hedberg">Jeff Hedberg</h1>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/06/Jeff_Hedberg.jpg"><img class="aligncenter size-full wp-image-4094" src="http://simplystatistics.org/wp-content/uploads/2015/06/Jeff_Hedberg.jpg" alt="Jeff_Hedberg" width="200" height="200" /></a></p>
<p>I am passionate about turning raw data into actionable insights that solve relevant business problems. I also greatly enjoy leading large, multi-functional projects with impact in areas pertaining to machine and/or sensor data. I have a Mechanical Engineering Degree and an MBA, in addition to a wide range of Data Science (IT/Coding) skills.</p>
<h4 id="why-did-you-take-the-jhu-data-science-specialization-2"><strong>**Why did you take the JHU Data Science Specialization?</strong>**</h4>
<p>I was looking to gain additional exposure into Data Science as a current practitioner, and thought this would be a great program.</p>
<h4 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-2"><strong>**What are you most proud of doing as part of the JHU Data Science Specialization?</strong>**</h4>
<p>I am most proud of completing all courses with distinction (top of peers). Also, I’m proud to have achieved full points on my Capstone project having no prior experience in Natural Language Processing.</p>
<h4 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-3"><strong>**How are you planning on using your Data Science Specialization Certificate?</strong>**</h4>
<p>I am going to add this to my Resume and LinkedIn Profile. I will use it to solidify my credibility as a data science practitioner of value.</p>
<p><strong>Final Project: </strong><a href="https://hedbergjeffm.shinyapps.io/Next_Word_Prediction/">https://hedbergjeffm.shinyapps.io/Next_Word_Prediction/</a></p>
<p><strong>Project Slide Deck: </strong><a href="https://rpubs.com/jhedbergfd3s/74960">https://rpubs.com/jhedbergfd3s/74960</a></p>
<p>#</p>
<h1 id="hernán-martínez-foffani">Hernán Martínez-Foffani</h1>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/06/Hernán_Martínez-Foffani.jpg"><img class="aligncenter size-medium wp-image-4095" src="http://simplystatistics.org/wp-content/uploads/2015/06/Hernán_Martínez-Foffani-300x225.jpg" alt="Hernán_Martínez-Foffani" width="300" height="225" srcset="http://simplystatistics.org/wp-content/uploads/2015/06/Hernán_Martínez-Foffani-120x90.jpg 120w, http://simplystatistics.org/wp-content/uploads/2015/06/Hernán_Martínez-Foffani-300x225.jpg 300w, http://simplystatistics.org/wp-content/uploads/2015/06/Hernán_Martínez-Foffani-1024x768.jpg 1024w, http://simplystatistics.org/wp-content/uploads/2015/06/Hernán_Martínez-Foffani-260x195.jpg 260w, http://simplystatistics.org/wp-content/uploads/2015/06/Hernán_Martínez-Foffani.jpg 1256w" sizes="(max-width: 300px) 100vw, 300px" /></a></p>
<p>I was born in Argentina but now I’m settled in Spain. I’ve been working in computer technology since the eighties, in digital networks, programming, consulting, project management. Now, as CTO in a software company, I lead a small team of programmers developing a supply chain management app.</p>
<h4 id="why-did-you-take-the-jhu-data-science-specialization-3"><strong>**Why did you take the JHU Data Science Specialization?</strong>**</h4>
<p>In my opinion the curriculum is carefully designed with a nice balance between theory and practice. The JHU authoring and the teachers’ widely known prestige ensure the content quality. The ability to choose the learning pace, one per month in my case, fits everyone’s schedule.</p>
<h4 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-3"><strong>**What are you most proud of doing as part of the JHU Data Science Specialization?</strong>**</h4>
<p>The capstone definitely. It resulted in a fresh and interesting challenge. I sweat a lot, learned much more and in the end had a lot of fun.</p>
<h4 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-4"><strong>**How are you planning on using your Data Science Specialization Certificate?</strong>**</h4>
<p>While for the time being I don’t have any specific plan for the certificate, it’s a beautiful reward for the effort done.</p>
<p><strong>Final Project: </strong><a href="https://herchu.shinyapps.io/shinytextpredict/">https://herchu.shinyapps.io/shinytextpredict/</a></p>
<p><strong>Project Slide Deck: </strong><a href="http://rpubs.com/herchu1/shinytextprediction">http://rpubs.com/herchu1/shinytextprediction</a></p>
<p>#</p>
<h1 id="francois-schonken">Francois Schonken</h1>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/06/Francois-Schonken1.jpg"><img class="aligncenter size-medium wp-image-4097" src="http://simplystatistics.org/wp-content/uploads/2015/06/Francois-Schonken1-197x300.jpg" alt="Francois Schonken" width="197" height="300" /></a></p>
<p>I’m a 36 year old South African male born and raised. I recently (4 years now) immigrated to lovely Melbourne, Australia. I wrapped up a BSc (Hons) Computer Science with specialization in Computer Systems back in 2001. Next I co-found a small boutique Software Development house operating from South Africa. I wrapped my MBA, from Melbourne Business School, in 2013 and now I consult for my small boutique Software Development house and 2 (very) small internet start-ups.</p>
<h4 id="why-did-you-take-the-jhu-data-science-specialization-4"><strong>**Why did you take the JHU Data Science Specialization?</strong>**</h4>
<p>One of the core subjects in my MBA was Data Analysis, basically an MBA take on undergrad Statistics with focus on application over theory (not that there was any shortage of theory). Waiting in a lobby room some 6 months later I was paging through the financial section of business focused weekly. I came across an article explaining how a Melbourne local applied a language called R to solve a grammatically and statistically challenging issue. The rest, as they say, is history.</p>
<h4 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-4"><strong>**What are you most proud of doing as part of the JHU Data Science Specialization?</strong>**</h4>
<p>I’m quite proud of both my Developing Data Products and Capstone projects, but for me these tangible outputs merely served as a vehicle to better understand a different way of thinking about data. I’ve spend most of my Software Development life dealing with one form or the other form of RDBS (Relational Database Management System). This, in my experience, leads to a very set oriented way of thinking about data.</p>
<p>I’m most proud of developing a new tool in my “Skills Toolbox” that I consider highly complementary to both my Software and Business outlook on projects.</p>
<h4 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-5"><strong>**How are you planning on using your Data Science Specialization Certificate?</strong>**</h4>
<p>Honestly, I had not planned on using my Certificate in and of itself. The skills I’ve acquired has already helped shape my thinking in designing an in-house web based consulting collaboration platform.</p>
<p>I do not foresee this being the last time I’ll be applying Data Science thinking moving forward on my journey.</p>
<p><strong>Final Project: </strong><a href="https://schonken.shinyapps.io/WordPredictor">https://schonken.shinyapps.io/WordPredictor</a></p>
<p><strong>Project Slide Deck: </strong><a href="http://rpubs.com/schonken/sentence-builder">http://rpubs.com/schonken/sentence-builder</a></p>
<p>#</p>
<h1 id="david-j-tagler">David J. Tagler</h1>
<p>David is passionate about solving the world’s most important and challenging problems. His expertise spans chemical/biomedical engineering, regenerative medicine, healthcare technology management, information technology/security, and data science/analysis. David earned his Ph.D. in Chemical Engineering from Northwestern University and B.S. in Chemical Engineering from the University of Notre Dame.</p>
<h4 id="why-did-you-take-the-jhu-data-science-specialization-5"><strong>**Why did you take the JHU Data Science Specialization?</strong>**</h4>
<p>I enrolled in this specialization in order to advance my statistics, programming, and data analysis skills. I was interested in taking a series of courses that covered the entire data science pipeline. I believe that these skills will be critical for success in the future.</p>
<h4 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-5"><strong>**What are you most proud of doing as part of the JHU Data Science Specialization?</strong>**</h4>
<p>I am most proud of the R programming and modeling skills that I developed throughout this specialization. Previously, I had no experience with R. Now, I can effectively manage complex data sets, perform statistical analyses, build prediction models, create publication-quality figures, and deploy web applications.</p>
<h4 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-6"><strong>**How are you planning on using your Data Science Specialization Certificate?</strong>**</h4>
<p>I look forward to utilizing these skills in future research projects. Furthermore, I plan to take additional courses in data science, machine learning, and bioinformatics.</p>
<p><strong>Final Project: </strong><a href="http://dt444.shinyapps.io/next-word-predict">http://dt444.shinyapps.io/next-word-predict</a></p>
<p><strong>Project Slide Deck: </strong><a href="http://rpubs.com/dt444/next-word-predict">http://rpubs.com/dt444/next-word-predict</a></p>
<p>#</p>
<h1 id="melissa-tan">Melissa Tan</h1>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/06/MelissaTan.png"><img class="aligncenter size-medium wp-image-4099" src="http://simplystatistics.org/wp-content/uploads/2015/06/MelissaTan-300x198.png" alt="MelissaTan" width="300" height="198" srcset="http://simplystatistics.org/wp-content/uploads/2015/06/MelissaTan-300x198.png 300w, http://simplystatistics.org/wp-content/uploads/2015/06/MelissaTan-260x172.png 260w" sizes="(max-width: 300px) 100vw, 300px" /></a></p>
<p>I’m a financial journalist from Singapore. I did philosophy and computer science at the University of Chicago, and I’m keen on picking up more machine learning and data viz skills.</p>
<h4 id="why-did-you-take-the-jhu-data-science-specialization-6"><strong>**Why did you take the JHU Data Science Specialization?</strong>**</h4>
<p>I wanted to keep up with coding, while learning new tools and techniques for wrangling and analyzing data that I could potentially apply to my job. Plus, it sounded fun. <img src="http://simplystatistics.org/wp-includes/images/smilies/simple-smile.png" alt=":)" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<h4 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-6"><strong>**What are you most proud of doing as part of the JHU Data Science Specialization?</strong>**</h4>
<p>Building a word prediction app pretty much from scratch (with a truckload of forum reading). The capstone project seemed insurmountable initially and ate up all my weekends, but getting the app to work passably was worth it.</p>
<h4 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-7"><strong>**How are you planning on using your Data Science Specialization Certificate?</strong>**</h4>
<p>It’ll go on my CV, but I think it’s more important to be able to actually do useful things. I’m keeping an eye out for more practical opportunities to apply and sharpen what I’ve learnt.</p>
<p><strong>Final Project: </strong><a href="https://melissatan.shinyapps.io/word_psychic/">https://melissatan.shinyapps.io/word_psychic/</a></p>
<p><strong>Project Slide Deck: </strong><a href="https://rpubs.com/melissatan/capstone">https://rpubs.com/melissatan/capstone</a></p>
<p>#</p>
<h1 id="felicia-yii">Felicia Yii</h1>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/06/FeliciaYii.jpg"><img class="aligncenter size-medium wp-image-4100" src="http://simplystatistics.org/wp-content/uploads/2015/06/FeliciaYii-232x300.jpg" alt="FeliciaYii" width="232" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2015/06/FeliciaYii-232x300.jpg 232w, http://simplystatistics.org/wp-content/uploads/2015/06/FeliciaYii-793x1024.jpg 793w" sizes="(max-width: 232px) 100vw, 232px" /></a></p>
<p>Felicia likes to dream, think and do. With over 20 years in the IT industry, her current fascination is at the intersection of people, information and decision-making. Ever inquisitive, she has acquired an expertise in subjects as diverse as coding to cookery to costume making to cosmetics chemistry. It’s not apparent that there is anything she can’t learn to do, apart from housework. Felicia lives in Wellington, New Zealand with her husband, two children and two cats.</p>
<h4 id="why-did-you-take-the-jhu-data-science-specialization-7"><strong>**Why did you take the JHU Data Science Specialization?</strong>**</h4>
<p>Well, I love learning and the JHU Data Science Specialization appealed to my thirst for a new challenge. I’m really interested in how we can use data to help people make better decisions. There’s so much data out there these days that it is easy to be overwhelmed by it all. Data visualisation was at the heart of my motivation when starting out. As I got into the nitty gritty of the course, I really began to see the power of making data accessible and appealing to the data-agnostic world. There’s so much potential for data science thinking in my professional work.</p>
<h4 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-7"><strong>**What are you most proud of doing as part of the JHU Data Science Specialization?</strong>**</h4>
<p>Getting through it for starters while also working and looking after two children. Seriously though, being able to say I know what ‘practical machine learning’ is all about. Before I started the course, I had limited knowledge of statistics, let alone knowing how to apply them in a machine learning context. I was thrilled to be able to use what I learned to test a cool game concept in my final project.</p>
<h4 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-8"><strong>**How are you planning on using your Data Science Specialization Certificate?</strong>**</h4>
<p>I want to use what I have learned in as many ways possible. Firstly, I see opportunities to apply my skills to my day-to-day work in information technology. Secondly, I would like to help organisations that don’t have the skills or expertise in-house to apply data science thinking to help their decision making and communication. Thirdly, it would be cool one day to have my own company consulting on data science. I’ve more work to do to get there though!</p>
<p><strong>Final Project: </strong><a href="https://micasagroup.shinyapps.io/nwpgame/">https://micasagroup.shinyapps.io/nwpgame/</a></p>
<p><strong>Project Slide Deck: </strong><a href="https://rpubs.com/MicasaGroup/74788">https://rpubs.com/MicasaGroup/74788</a></p>
<p> </p>
Batch effects are everywhere! Deflategate edition
2015-06-09T11:47:27+00:00
http://simplystats.github.io/2015/06/09/batch-effects-are-everywhere-deflategate-edition
<p>In my opinion, batch effects are the biggest challenge faced by genomics research, especially in precision medicine. As we point out in <a href="http://www.ncbi.nlm.nih.gov/pubmed/20838408">this review</a>, they are everywhere among high-throughput experiments. But batch effects are not specific to genomics technology. In fact, in <a href="http://amstat.tandfonline.com/doi/abs/10.1080/00401706.1972.10488878">this 1972 paper</a> (paywalled), <a href="http://en.wikipedia.org/wiki/William_J._Youden">WJ Youden</a> describes batch effects in the context of measurements made by physicists. Check out this plot of <a href="https://en.wikipedia.org/wiki/Astronomical_unit">astronomical unit</a> <del>speed of light</del> estimates <strong>with an estimate of spread <del>confidence intervals</del></strong> (red and green are same lab).</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/06/Rplot.png"><img class=" wp-image-4295 aligncenter" src="http://simplystatistics.org/wp-content/uploads/2015/06/Rplot.png" alt="Rplot" width="467" height="290" srcset="http://simplystatistics.org/wp-content/uploads/2015/06/Rplot-300x186.png 300w, http://simplystatistics.org/wp-content/uploads/2015/06/Rplot.png 903w" sizes="(max-width: 467px) 100vw, 467px" /></a></p>
<p style="text-align: center;">
<p>
</p>
<p>
Sometimes you find batch effects where you least expect them. For example, in the <a href="http://en.wikipedia.org/wiki/Deflategate">deflategate</a> debate. Here is quote from the New England patriot's deflategate<a href="http://www.boston.com/sports/football/patriots/2015/05/14/key-takeaways-from-the-patriots-deflategate-report-rebuttal/hK0J0J9abNgtGyhTwlW53L/story.html"> rebuttal</a> (written with help from Nobel Prize winner <a href="http://en.wikipedia.org/wiki/Roderick_MacKinnon">Roderick MacKinnon</a>)
</p>
<blockquote>
<p>
in other words, the Colts balls were measured after the Patriots balls and had warmed up more. For the above reasons, the Wells Report conclusion that physical law cannot explain the pressures is incorrect.
</p>
</blockquote>
<p style="text-align: left;">
Here is another one:
</p>
<blockquote>
<p style="text-align: left;">
In the pressure measurements physical conditions were not very well-defined and major uncertainties, such as which gauge was used in pre-game measurements, affect conclusions.
</p>
</blockquote>
<p style="text-align: left;">
So NFL, please read <a href="http://www.ncbi.nlm.nih.gov/pubmed/20838408">our paper</a> before you accuse a player of cheating.
</p>
<p style="text-align: left;">
Disclaimer: I live in New England but I am <a href="http://www.urbandictionary.com/define.php?term=Ball+so+Hard+University">Ravens</a> fan.
</p>
</p>
I'm a data scientist - mind if I do surgery on your heart?
2015-06-08T14:15:39+00:00
http://simplystats.github.io/2015/06/08/im-a-data-scientist-mind-if-i-do-surgery-on-your-heart
<p>There has been a lot of recent interest from scientific journals and from other folks in creating checklists for data science and data analysis. The idea is that the checklist will help prevent results that won’t reproduce or replicate from the literature. One analogy that I’m frequently hearing is the analogy with checklists for surgeons that <a href="http://www.nejm.org/doi/full/10.1056/NEJMsa0810119">can help reduce patient mortality</a>.</p>
<p>The one major difference between checklists for surgeons and checklists I’m seeing for research purposes is the difference in credentialing between people allowed to perform surgery and people allowed to perform complex data analysis. You would never let me do surgery on you. I have no medical training at all. But I’m frequently asked to review papers that include complicated and technical data analyses, but have no trained data analysts or statisticians. The most common approach is that a postdoc or graduate student in the group is assigned to do the analysis, even if they don’t have much formal training. Whenever this happens red flags are up all over the place. Just like I wouldn’t trust someone without years of training and a medical license to do surgery on me, I wouldn’t let someone without years of training and credentials in data analysis make major conclusions from complex data analysis.</p>
<p>You might argue that the consequences for surgery and for complex data analysis are on completely different scales. I’d agree with you, but not in the direction that you might think. I would argue that high pressure and complex data analysis can have much larger consequences than surgery. In surgery there is usually only one person that can be hurt. But if you do a bad data analysis, say claiming say that <a href="http://www.ncbi.nlm.nih.gov/pubmed/9500320">vaccines cause autism</a>, that can have massive consequences for hundreds or even thousands of people. So complex data analysis, especially for important results, should be treated with at least as much care as surgery.</p>
<p>The reason why I don’t think checklists alone will solve the problem is that they are likely to be used by people without formal training. One obvious (and recent) example that I think makes this really clear is the <a href="https://developer.apple.com/healthkit/">HealthKit</a> data we are about to start seeing. A ton of people signed up for studies on their iPhones and it has been all over the news. The checklist will (almost certainly) say to have a big sample size. HealthKit studies will certainly pass the checklist, but they are going to get <a href="http://en.wikipedia.org/wiki/Dewey_Defeats_Truman">Truman/Deweyed</a> big time if they aren’t careful about biased sampling.</p>
<div>
If I walked into an operating room and said I'm going to start dabbling in surgery I would be immediately thrown out. But people do that with statistics and data analysis all the time. What they really need is to require careful training and expertise in data analysis on each paper that analyzes data. Until we treat it as a first class component of the scientific process we'll continue to see retractions, falsifications, and irreproducible results flourish.
</div>
Interview with Class Central
2015-06-04T09:27:20+00:00
http://simplystats.github.io/2015/06/04/4063
<p>Recently I sat down with Class Central to do an interview about the Johns Hopkins Data Science Specialization. We talked about the motivation for designing the sequence and and the capstone project. With the demand for data science skills greater than ever, the importance of the specialization is only increasing.</p>
<p>See the <a href="https://www.class-central.com/report/data-science-specialization/">full interview</a> at the Class Central site. Below is short excerpt.</p>
Interview with Chris Wiggins, chief data scientist at the New York Times
2015-06-01T09:00:27+00:00
http://simplystats.github.io/2015/06/01/interview-with-chris-wiggins-chief-data-scientist-at-the-new-york-times
<p><em>Editor’s note: We are trying something a little new here and doing an interview with Google Hangouts on Air. The interview will be live at 11:30am EST. I have some questions lined up for Chris, but if you have others you’d like to ask, you can tweet them @simplystats and I’ll see if I can work them in. After the livestream we’ll leave the video on Youtube so you can check out the interview if you can’t watch the live stream. I’m embedding the Youtube video here but if you can’t see the live stream when it is running go check out the event page: <a href="https://plus.google.com/events/c7chrkg0ene47mikqrvevrg3a4o">https://plus.google.com/events/c7chrkg0ene47mikqrvevrg3a4o</a>.</em></p>
Science is a calling and a career, here is a career planning guide for students and postdocs
2015-05-28T10:16:47+00:00
http://simplystats.github.io/2015/05/28/science-is-a-calling-and-a-career-here-is-a-career-planning-guide-for-students-and-postdocs
<p><em>Editor’s note: This post was inspired by a really awesome career planning guide that Ben Langmead</em> <a href="https://github.com/BenLangmead/langmead-lab/blob/master/postdoc_questionnaire.md"><em>Editor’s note: This post was inspired by a really awesome career planning guide that Ben Langmead</em></a> <em>which you should go check out right now. You can also find the slightly adapted</em> <a href="https://github.com/jtleek/careerplanning"><em>Leek group career planning guide</em></a> <em>here.</em></p>
<p>The most common reason that people go into science is altruistic. They loved dinosaurs and spaceships when they were a kid and that never wore off. On some level this is one of the reasons I love this field so much, it is an area where if you can get past all the hard parts can really keep introducing wonder into what you work on every day.</p>
<p>Sometimes I feel like this altruism has negative consequences. For example, I think that there is less emphasis on the career planning and development side in the academic community. I don’t think this is malicious, but I do think that sometimes people think of the career part of science as unseemly. But if you have any job that you want people to pay you to do, then there will be parts of that job that will be career oriented. So if you want to be a professional scientist, being brilliant and good at science is not enough. You also need to pay attention to and plan carefully your career trajectory.</p>
<p>A colleague of mine, Ben Langmead, created a really nice guide for his postdocs to thinking about and planning the career side of a postdoc <a href="https://github.com/BenLangmead/langmead-lab/blob/master/postdoc_questionnaire.md">which he has over on Github</a>. I thought it was such a good idea that I immediately modified it and asked all of my graduate students and postdocs to fill it out. It is kind of long so there was no penalty if they didn’t finish it, but I think it is an incredibly useful tool for thinking about how to strategize a career in the sciences. I think that the more we are concrete about the career side of graduate school and postdocs, including being honest about all the realistic options available, the better prepared our students will be to succeed on the market.</p>
<p>You can find the <a href="https://github.com/jtleek/careerplanning">Leek Group Guide to Career Planning</a> here and make sure you also go <a href="https://github.com/BenLangmead/langmead-lab/blob/master/postdoc_questionnaire.md">check out Ben’s</a> since it was his idea and his is great.</p>
<p> </p>
Is it species or is it batch? They are confounded, so we can't know
2015-05-20T11:11:18+00:00
http://simplystats.github.io/2015/05/20/is-it-species-or-is-it-batch-they-are-confounded-so-we-cant-know
<p>In a 2005 OMICS <a href="http://online.liebertpub.com/doi/abs/10.1089/153623104773547462" target="_blank">paper</a>, an analysis of human and mouse gene expression microarray measurements from several tissues led the authors to conclude that “any tissue is more similar to any other human tissue examined than to its corresponding mouse tissue”. Note that this was a rather surprising result given how similar tissues are between species. For example, both mice and humans see with their eyes, breathe with their lungs, pump blood with their hearts, etc… Two follow-up papers (<a href="http://mbe.oxfordjournals.org/content/23/3/530.abstract?ijkey=2c3d98666afbc99949fdcf514f10e3fedadee259&keytype2=tf_ipsecsha" target="_blank">here</a> and <a href="http://mbe.oxfordjournals.org/content/24/6/1283.abstract?ijkey=366fdf09da56a5dd0cfdc5f74082d9c098ae7801&keytype2=tf_ipsecsha" target="_blank">here</a>) demonstrated that platform-specific technical variability was the cause of this apparent dissimilarity. The arrays used for the two species were different and thus measurement platform and species were completely <strong>confounded</strong>. In a 2010 paper, we confirmed that once this technical variability was accounted for, the number of genes expressed in common between the same tissue across the two species was much higher than the those expressed in common between two species across the different tissues (see Figure 2 <a href="http://nar.oxfordjournals.org/content/39/suppl_1/D1011.full" target="_blank">here</a>).</p>
<p>So <a href="http://genomicsclass.github.io/book/pages/confounding.html">what is confounding</a> and <a href="http://www.nature.com/ng/journal/v39/n7/full/ng0707-807.html">why is it a problem</a>? This topic has been discussed broadly. We wrote a <a href="http://www.nature.com/nrg/journal/v11/n10/full/nrg2825.html">review</a> some time ago. But based on recent discussions I’ve participated in, it seems that there is still some confusion. Here I explain, aided by some math, how confounding leads to problems in the context of estimating species effects in genomics. We will use</p>
<ul>
<li><em>X<sub>i</sub></em> to represent the gene expression measurements for human tissue <em>i,</em></li>
<li><em>a<sub>X</sub></em> to represent the level of expression that is specific to humans and</li>
<li><em>b<sub>X</sub></em> to represent the batch effect introduced by the use of the human microarray platform.</li>
<li>Therefore <em>X<sub>i</sub></em> =<em>a<sub>X </sub></em>+ <em>b<sub>X </sub></em>+ <em>e<sub>i</sub></em>, with <em>e<sub>i</sub></em> the tissue <em>i</em> effect and other uninteresting sources of variability.</li>
</ul>
<p>Similarly, we will use:</p>
<ul>
<li><em>Y<sub>i</sub></em> to represent the measurements for mouse tissue <em>i</em></li>
<li><em>a<sub>Y</sub></em> to represent the mouse specific level and</li>
<li><em>b<sub>Y</sub></em> the batch effect introduced by the use of the mouse microarray platform.</li>
<li>Therefore <em>Y</em><sub>i</sub> = <em>a<sub>Y</sub></em>+ <em>b<sub>Y</sub></em> + <em>f<sub>i</sub></em>, with <em>f<sub>i</sub></em> tissue <em>i</em> effect and other uninteresting sources of variability.</li>
</ul>
<p>If we are interested in estimating a species effect that is general across tissues, then we are interested in the following quantity:</p>
<p style="text-align: center;">
<em>a<sub>Y</sub> - a<sub>X</sub></em>
</p>
<p>Naively, we would think that we can estimate this quantity using the observed differences between the species that cancel out the tissue effect. We observe a difference for each tissue: <em>Y<sub>1 </sub></em> - <em>X<sub>1 </sub></em>, <em>Y<sub>2</sub></em> - <em>X<sub>2 </sub></em>, etc… The problem is that <em>a<sub>X</sub></em> and <em>b<sub>X</sub></em> are always together as are <em>a<sub>Y</sub></em> and <em>b<sub>Y</sub></em>. We say that the batch effect <em>b<sub>X</sub></em> is <strong>confounded</strong> with the species effect <em>a<sub>X</sub></em>. Therefore, on average, the observed differences include both the species and the batch effects. To estimate the difference above we would write a model like this:</p>
<p style="text-align: center;">
<em>Y<sub>i</sub></em> - <em>X<sub>i</sub></em> = (<em>a<sub>Y</sub> - a<sub>X</sub></em>) + (<em>b<sub>Y</sub> - b<sub>X</sub></em>) + other sources of variability
</p>
<p style="text-align: left;">
and then estimate the unknown quantities of interest: (<em>a<sub>Y</sub> - a<sub>X</sub></em>) and (<em>b<sub>Y</sub> - b<sub>X</sub></em>) from the observed data <em>Y<sub>1</sub></em> - <em>X<sub>1</sub></em>, <em>Y<sub>2</sub></em> - <em>X<sub>2</sub></em>, etc... The problem is that, we can estimate the aggregate effect (<em>a<sub>Y</sub> - a<sub>X</sub></em>) + (<em>b<sub>Y</sub> - b<sub>X</sub></em>), but, mathematically, we can't tease apart the two differences. To see this note that if we are using least squares, the estimates (<em>a<sub>Y</sub> - a<sub>X</sub></em>) = 7, (<em>b<sub>Y</sub> - b<sub>X</sub></em>)=3 will fit the data exactly as well as (<em>a<sub>Y</sub> - a<sub>X</sub></em>)=3,(<em>b<sub>Y</sub> - b<sub>X</sub></em>)=7 since
</p>
<p style="text-align: center;">
<em>{(Y-X) -(7+3))^2 = {(Y-X)- (3+7)}^2.</em>
</p>
<p style="text-align: left;">
In fact, under these circumstances, there are an infinite number of solutions to the standard statistical estimation approaches. A simple analogy is to try to find a unique solution to the equations m+n = 0. If batch and species are not confounded then we are able to tease apart differences just as if we were given another equation: m+n=0; m-n=2. You can learn more about this in <a href="https://www.edx.org/course/introduction-linear-models-matrix-harvardx-ph525-2x">this linear models course</a>.
</p>
<p style="text-align: left;">
Note that the above derivation apply to each gene affected by the batch effect. In practice we commonly see hundreds of genes affected. As a consequence, when we compute distances between two samples from different species we may see large differences even where there is no species effect. This is because the <em>b<sub>Y</sub> - b<sub>X </sub></em>differences for each gene are squared and added up.
</p>
<p style="text-align: left;">
In summary, if you completely confound your variable of interest, in this case species, with a batch effect, you will not be able to estimate the effect of either. In fact, in a <a href="http://www.nature.com/nrg/journal/v11/n10/full/nrg2825.html">2010 Nature Genetics Review</a> about batch effects we warned about "cases in which batch effects are confounded with an outcome of interest and result in misleading biological or clinical conclusions". We also warned that none of the existing solutions for batch effects (Combat, SVA, RUV, etc...) can save you from a situation with perfect confounding. Because we can't always predict what will introduce unwanted variability, we recommend randomization as an experimental design approach.
</p>
<p style="text-align: left;">
Almost a decade later after the OMICS paper was published, the same surprising conclusion was reached in <a href="http://www.pnas.org/content/111/48/17224.abstract" target="_blank">this PNAS paper</a>: "tissues appear more similar to one another within the same species than to the comparable organs of other species". This time RNAseq was used for both species and therefore the different platform issue was not considered<sup>*</sup>. Therefore, the authors implicitly assumed that (<em>b<sub>Y</sub> - b<sub>X</sub></em>)=0. However, in a recent F1000 Research <a href="http://f1000research.com/articles/4-121/v1" target="_blank">publication</a> Gilad and Mizrahi-Man describe describe an exercise in <a href="http://projecteuclid.org/euclid.aoas/1267453942">forensic bioinformatics</a> that led them to discover that mice and human samples were run in different lanes or different instruments. The confounding was near perfect (see <a href="https://f1000researchdata.s3.amazonaws.com/manuscripts/7019/9f5f4330-d81d-46b8-9a3f-d8cb7aaf577e_figure1.gif">Figure 1</a>). As pointed out by these authors, with this experimental design we can't simply accept that (<em>b<sub>Y</sub> - b<sub>X</sub></em>)=0, which implies that we can't estimate a species effect. Gilad and Mizrahi-Man then apply a <a href="http://biostatistics.oxfordjournals.org/content/8/1/118.abstract">linear model</a> (ComBat) to account for the batch/species effect and find that <a href="https://f1000researchdata.s3.amazonaws.com/manuscripts/7019/9f5f4330-d81d-46b8-9a3f-d8cb7aaf577e_figure3.gif">samples cluster almost perfectly by tissue</a>. However, Gilad and Mizrahi-Man correctly note that, due to the confounding, if there is in fact a species effect, this approach will remove it along with the batch effect. Unfortunately, due to the experimental design it will be hard or impossible to determine if it's batch or if it's species. More data and more analyses are needed.
</p>
<p>Confounded designs ruin experiments. Current batch effect removal methods will not save you. If you are designing a large genomics experiments, learn about randomization.</p>
<p style="text-align: left;">
* The fact that RNAseq was used does not necessarily mean there is no platform effect. The species have different genomes, with different sequences and thus can lead to different biases during experimental protocols.
</p>
<p style="text-align: left;">
<strong>Update: </strong>Shin Lin has repeated a small version of the experiment described in the <a href="http://www.pnas.org/content/111/48/17224.abstract" target="_blank">PNAS paper</a>. The new experimental design does not confound lane/instrument with species. The new data confirms their original results pointing to the fact that lane/instrument do not explain the clustering by species. You can see his response in the comments <a href="http://f1000research.com/articles/4-121/v1" target="_blank">here</a>.
</p>
Residual expertise - or why scientists are amateurs at most of science
2015-05-18T10:21:18+00:00
http://simplystats.github.io/2015/05/18/residual-expertise
<p><em>Editor’s note: I have been unsuccessfully attempting to finish a book I started 3 years ago about how and why everyone should get pumped about reading and understanding scientific papers. I’ve adapted part of one of the chapters into this blogpost. It is pretty raw but hopefully gets the idea across. </em></p>
<p>An episode of_ The Daily Show with Jon Stewart_ featured physicist Lisa Randall, an incredible physicist and noted scientific communicator, as the invited guest.</p>
<div style="background-color: #000000; width: 520px;">
<div style="padding: 4px;">
</p>
<p style="text-align: left; background-color: #ffffff; padding: 4px; margin-top: 4px; margin-bottom: 0px; font-family: Arial, Helvetica, sans-serif; font-size: 12px;">
<b><a href="http://thedailyshow.cc.com/">The Daily Show</a></b><br /> Get More: <a href="http://thedailyshow.cc.com/full-episodes/">Daily Show Full Episodes</a>,<a href="http://www.facebook.com/thedailyshow">The Daily Show on Facebook</a>,<a href="http://thedailyshow.cc.com/videos">Daily Show Video Archive</a>
</p>
</div>
</div>
<p>Near the end of the interview, Stewart asked Randall why, with all the scientific progress we have made, that we have been unable to move away from fossil fuel-based engines. The question led to the exchange:</p>
<blockquote>
<p><em>Randall: “So this is part of the problem, because I’m a scientist doesn’t mean I know the answer to that question.”</em></p>
<p>**</p>
</blockquote>
<blockquote>
<p>** <em>Stewart: ”Oh is that true? Here’s the thing, here’s what’s part of the answer. You could say anything and I would have no idea what you are talking about.”</em></p>
</blockquote>
<p>Professor Randall is a world leading physicist, the first woman to achieve tenure in physics at Princeton, Harvard, and MIT, and a member of the National Academy of Sciences.2 But when it comes to the science of fossil fuels, she is just an amateur. Her response to this question is just perfect - it shows that even brilliant scientists can just be interested amateurs on topics outside of their expertise. Despite Professor Randall’s over-the-top qualifications, she is an amateur on a whole range of scientific topics from medicine, to computer science, to nuclear engineering. Being an amateur isn’t a bad thing, and recognizing where you are an amateur may be the truest indicator of genius. That doesn’t mean Professor Randall can’t know a little bit about fossil fuels or be curious about why we don’t all have nuclear-powered hovercrafts yet. It just means she isn’t the authority.</p>
<p>Stewart’s response is particularly telling and indicative of what a lot of people think about scientists. It takes years of experience to become an expert in a scientific field - some have suggested as many as 10,000 hours of dedicated time. Professor Randall is a scientist - so she must have more information about any scientific problem than an informed amateur like Jon Stewart. But of course this isn’t true, Jon Stewart (and you) could quickly learn as much about fossil fuels as a scientist if the scientist wasn’t already an expert in the area. Sure a background in physics would help, but there are a lot of moving parts in our dependence on fossil fuels, including social, political, economic problems in addition to the physics involved.</p>
<p>This is an example of “residual expertise” - when people without deep scientific training are willing to attribute expertise to scientists even if it is outside their primary area of focus. It is closely related to the logical fallacy behind the <a href="http://en.wikipedia.org/wiki/Argument_from_authority">argument from authority</a>:</p>
<blockquote>
<p>A is an authority on a particular topic</p>
<p>A says something about that topic</p>
<p>A is probably correct</p>
</blockquote>
<p>the difference is that with residual expertise you assume that since A is an authority on a particular topic, if they say something about another, potentially related topic, they will probably be correct. This idea is critically important, it is how quacks make their living. The logical leap of faith from “that person is a doctor” to “that person is a doctor so of course they understand epidemiology, or vaccination, or risk communication” is exactly the leap empowered by the idea of residual expertise. It is also how you can line up scientific experts against any well established doctrine like evolution or climate change. Experts in the field will know all of the relevant information that supports key ideas in the field and what it would take to overturn those ideas. But experts outside of the field can be lined up and their residual expertise used to call into question even the most supported ideas.</p>
<p>What does this have to do with you?</p>
<p>Most people aren’t necessarily experts in scientific disciplines they care about. But becoming a successful amateur requires a much smaller time commitment than becoming an expert, but can still be incredibly satisfying, fun, and useful. This book is designed to help you become a fired-up amateur in the science of your choice. Think of it like a hobby, but one where you get to learn about some of the coolest new technologies and ideas coming out in the scientific literature. If you can ignore the way residual expertise makes you feel silly for reading scientific papers you don’t fully understand - you can still learn a ton and have a pretty fun time doing it.</p>
<p> </p>
<p> </p>
The tyranny of the idea in science
2015-05-08T11:58:51+00:00
http://simplystats.github.io/2015/05/08/the-tyranny-of-the-idea-in-science
<p>There are a lot of analogies between <a href="http://simplystatistics.org/2012/09/20/every-professor-is-a-startup/">startups and academic science labs</a>. One thing that is definitely very different is the relative value of ideas in the startup world and in the academic world. For example, <a href="http://simplystatistics.org/2012/09/20/every-professor-is-a-startup/">Paul Graham has said:</a></p>
<blockquote>
<p>Actually, startup ideas are not million dollar ideas, and here’s an experiment you can try to prove it: just try to sell one. Nothing evolves faster than markets. The fact that there’s no market for startup ideas suggests there’s no demand. Which means, in the narrow sense of the word, that startup ideas are worthless.</p>
</blockquote>
<p>In academics, almost the opposite is true. There is huge value to being first with an idea, even if you haven’t gotten all the details worked out or stable software in place. Here are a couple of extreme examples illustrated with Nobel prizes:</p>
<ol>
<li><strong>Higgs Boson</strong> - Peter Higgs <a href="http://journals.aps.org/pr/abstract/10.1103/PhysRev.145.1156">postulated the Boson in 1964</a>, <a href="http://www.symmetrymagazine.org/article/october-2013/nobel-prize-in-physics-honors-prediction-of-higgs-boson">he won the Nobel Prize in 2013 for that prediction</a>, in between tons of people did follow on work, someone convinced Europe to build one of the <a href="http://en.wikipedia.org/wiki/Large_Hadron_Collider">most expensive pieces of scientific equipment ever built</a> and conservatively thousands of scientists and engineers had to do a ton of work to get the equipment to (a) work and (b) confirm the prediction.</li>
<li><strong>Human genome</strong> - <a href="http://en.wikipedia.org/wiki/Molecular_Structure_of_Nucleic_Acids:_A_Structure_for_Deoxyribose_Nucleic_Acid">Watson and Crick postulated the structure of DNA</a> in 1953, <a href="http://www.nobelprize.org/nobel_prizes/medicine/laureates/1962/">they won the Nobel Prize in medicine in 1962</a> for this work. But the real value of the human genome was realized when the <a href="http://en.wikipedia.org/wiki/Human_Genome_Project">largest biological collaboration in history sequenced the human genome</a>, along with all of the subsequent work in the genomics revolution.</li>
</ol>
<p>These are two large scale examples where the academic scientific community (as represented by the Nobel committee, mostly because it is a concrete example) rewards the original idea and not the hard work to achieve that idea. I call this, “the tyranny of the idea.” I notice a similar issue on a much smaller scale, for example when people <a href="http://ivory.idyll.org/blog/2015-software-as-a-primary-product-of-science.html">don’t recognize software as a primary product of science</a>. I feel like these decisions devalue the real work it takes to make any scientific idea a reality. Sure the ideas are good, but it isn’t clear that some ideas wouldn’t be discovered by someone else - but surely we aren’t going to build another large hadron collider. I’d like to see the scales correct back the other way a little bit so we put at least as much emphasis on the science it takes to follow through on an idea as on discovering it in the first place.</p>
Mendelian randomization inspires a randomized trial design for multiple drugs simultaneously
2015-05-07T11:30:09+00:00
http://simplystats.github.io/2015/05/07/mendelian-randomization-inspires-a-randomized-trial-design-for-multiple-drugs-simultaneously
<p>Joe Pickrell has an interesting new paper out about <a href="http://biorxiv.org/content/early/2015/04/16/018150.full-text.pdf+html">Mendelian randomization.</a> He discusses some of the interesting issues that come up with these studies and performs a mini-review of previously published studies using the technique.</p>
<p>The basic idea behind Mendelian Randomization is the following. In a simple, randomly mating population Mendel’s laws tell us that at any genomic locus (a measured spot in the genome) the allele (genetic material you got) you get is assigned at random. At the chromosome level this is very close to true due to properties of meiosis (here is an example of how this looks in very cartoonish form in yeast). A very famous example of this was an experiment performed by Leonid Kruglyak’s group where they took two strains of yeast and repeatedly mated them, then measured genetics and gene expression data. The experimental design looked like this:</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/05/Slide06.jpg"><img class="aligncenter wp-image-4009 size-medium" src="http://simplystatistics.org/wp-content/uploads/2015/05/Slide06-300x224.jpg" alt="Slide06" width="300" height="224" srcset="http://simplystatistics.org/wp-content/uploads/2015/05/Slide06-300x224.jpg 300w, http://simplystatistics.org/wp-content/uploads/2015/05/Slide06-260x194.jpg 260w" sizes="(max-width: 300px) 100vw, 300px" /></a></p>
<p> </p>
<p>If you look at the allele inherited from the two parental strains (BY, RM) at two separate genes on different chromsomes in each of the 112 segregants (yeast offspring) they do appear to be random and independent:</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/05/Screen-Shot-2015-05-07-at-11.20.46-AM.png"><img class="aligncenter wp-image-4010 size-medium" src="http://simplystatistics.org/wp-content/uploads/2015/05/Screen-Shot-2015-05-07-at-11.20.46-AM-235x300.png" alt="Screen Shot 2015-05-07 at 11.20.46 AM" width="235" height="300" /></a></p>
<p> </p>
<p> </p>
<p>So this is a randomized trial in yeast where the yeast were each randomized to many many genetic “treatments” simultaneously. Now this isn’t strictly true, since genes on the same chromosomes near each other aren’t exactly random and in humans it is definitely not true since there is population structure, non-random mating and a host of other issues. But you can still do cool things to try to infer causality from the genetic “treatments” to downstream things like gene expression ( <a href="http://genomebiology.com/2007/8/10/r219">and even do a reasonable job in the model organism case</a>).</p>
<p>In my mind this raises a potentially interesting study design for clinical trials. Suppose that there are 10 treatments for a disease that we know about. We design a study where each of the patients in the trial was randomized to receive treatment or placebo for each of the 10 treatments. So on average each person would get 5 treatments. Then you could try to tease apart the effects using methods developed for the Mendelian randomization case. Of course, this is ignoring potential interactions, side effects of taking multiple drugs simultaneously, etc. But I’m seeing lots of <a href="http://www.nature.com/news/personalized-medicine-time-for-one-person-trials-1.17411">interesting proposals</a> for new trial designs (<a href="http://notstatschat.tumblr.com/post/118102423391/precise-answers-but-not-necessarily-to-the-right">which may or may not work</a>), so I thought I’d contribute my own interesting idea.</p>
Rafa's citations above replacement in statistics journals is crazy high.
2015-05-01T11:18:47+00:00
http://simplystats.github.io/2015/05/01/rafas-citations-above-replacement-in-statistics-journals-is-crazy-high
<p><em>Editor’s note: I thought it would be fun to do some bibliometrics on a Friday. This is super hacky and the CAR/Y stat should not be taken seriously. </em></p>
<p>I downloaded data on the 400 most cited papers between 2000-2010 in some statistical journals from <a href="webofscience.com/">Web of Science</a>. Here is a boxplot of the average number of citations per year (from publication date - 2015) to these papers in the journals Annals of Statistics, Biometrics, Biometrika, Biostatistics, JASA, Journal of Computational and Graphical Statistics, Journal of Machine Learning Research, and Journal of the Royal Statistical Society Series B.</p>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/05/journals.png"><img class="aligncenter wp-image-4001" src="http://simplystatistics.org/wp-content/uploads/2015/05/journals-300x300.png" alt="journals" width="500" height="500" srcset="http://simplystatistics.org/wp-content/uploads/2015/05/journals-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2015/05/journals-1024x1024.png 1024w, http://simplystatistics.org/wp-content/uploads/2015/05/journals-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2015/05/journals.png 1050w" sizes="(max-width: 500px) 100vw, 500px" /></a></p>
<p> </p>
<p>There are several interesting things about this graph right away. One is that JASA has the highest median number of citations, but has fewer “big hits” (papers with 100+ citations/year) than Annals of Statistics, JMLR, or JRSS-B. Another thing is how much of a lottery developing statistical methods seems to be. Most papers, even among the 400 most cited, have around 3 citations/year on average. But a few lucky winners have 100+ citations per year. One interesting thing for me is the papers that get 10 or more citations per year but aren’t huge hits. I suspect these are the papers that <a href="http://simplystatistics.org/2014/07/25/academic-statisticians-there-is-no-shame-in-developing-statistical-solutions-that-solve-just-one-problem/">solve one problem well but don’t solve the most general problem ever</a>.</p>
<p>Something that jumps out from that plot is the outlier for the journal Biostatistics. One of their papers is cited 367.85 times per year. The next nearest competitor is 67.75 and it is 19 standard deviations above the mean! The paper in question is: “Exploration, normalization, and summaries of high density oligonucleotide array probe level data”, which is the paper that introduced RMA, one of the most popular methods for pre-processing microarrays ever created. It was written by Rafa and colleagues. It made me think of the statistic “<a href="http://www.fangraphs.com/library/misc/war/">wins above replacement</a>” which quantifies how many extra wins a baseball team gets by playing a specific player in place of a league average replacement.</p>
<p>What about a “citations /year above replacement” statistic where you calculate for each journal:</p>
<blockquote>
<p>Median number of citations to a paper/year with Author X - Median number of citations/year to an average paper in that journal</p>
</blockquote>
<p>Then average this number across journals. This attempts to quantify how many extra citations/year a person’s papers generate compared to the “average” paper in that journal. For Rafa the numbers look like this:</p>
<ul>
<li>Biostatistics: Rafa = 15.475, Journal = 1.855, CAR/Y = 13.62</li>
<li>JASA: Rafa = 74.5, Journal = 5.2, CAR/Y = 69.3</li>
<li>Biometrics: Rafa = 4.33, Journal = 3.38, CAR/Y = 0.95</li>
</ul>
<p>So Rafa’s citations above replacement is (13.62 + 69.3 + 0.95)/3 = 27.96! There are a couple of reasons why this isn’t a completely accurate picture. One is the low sample size, the second is the fact that I only took the 400 most cited papers in each journal. Rafa has a few papers that didn’t make the top 400 for journals like JASA - which would bring down his CAR/Y.</p>
<p> </p>
Figuring Out Learning Objectives the Hard Way
2015-04-30T11:10:06+00:00
http://simplystats.github.io/2015/04/30/figuring-out-learning-objectives-the-hard-way
<p>When building the <a href="https://www.coursera.org/specialization/genomics/41" title="Genomic Data Science Specialization">Genomic Data Science Specialization</a> (which starts in June!) we had to figure out the learning objectives for each course. We initially set our ambitions high, but as you can see in this video below, Steven Salzberg brought us back to Earth.</p>
Data analysis subcultures
2015-04-29T10:23:57+00:00
http://simplystats.github.io/2015/04/29/data-analysis-subcultures
<p>Roger and I responded to the controversy around the journal that banned p-values today <a href="http://www.nature.com/news/statistics-p-values-are-just-the-tip-of-the-iceberg-1.17412">in Nature.</a> A piece like this requires a lot of information packed into very little space but I thought one idea that deserved to be talked about more was the idea of data analysis subcultures. From the paper:</p>
<blockquote>
<p>Data analysis is taught through an apprenticeship model, and different disciplines develop their own analysis subcultures. Decisions are based on cultural conventions in specific communities rather than on empirical evidence. For example, economists call data measured over time ‘panel data’, to which they frequently apply mixed-effects models. Biomedical scientists refer to the same type of data structure as ‘longitudinal data’, and often go at it with generalized estimating equations.</p>
</blockquote>
<p>I think this is one of the least appreciated components of modern data analysis. Data analysis is almost entirely taught through an apprenticeship culture with completely different behaviors taught in different disciplines. All of these disciplines agree about the mathematical optimality of specific methods under very specific conditions. That is why you see <a href="http://psychclassics.yorku.ca/Peirce/small-diffs.htm">methods</a> like <a href="http://en.wikipedia.org/wiki/Statistical_Methods_for_Research_Workers">randomized trials</a> [Roger and I responded to the controversy around the journal that banned p-values today <a href="http://www.nature.com/news/statistics-p-values-are-just-the-tip-of-the-iceberg-1.17412">in Nature.</a> A piece like this requires a lot of information packed into very little space but I thought one idea that deserved to be talked about more was the idea of data analysis subcultures. From the paper:</p>
<blockquote>
<p>Data analysis is taught through an apprenticeship model, and different disciplines develop their own analysis subcultures. Decisions are based on cultural conventions in specific communities rather than on empirical evidence. For example, economists call data measured over time ‘panel data’, to which they frequently apply mixed-effects models. Biomedical scientists refer to the same type of data structure as ‘longitudinal data’, and often go at it with generalized estimating equations.</p>
</blockquote>
<p>I think this is one of the least appreciated components of modern data analysis. Data analysis is almost entirely taught through an apprenticeship culture with completely different behaviors taught in different disciplines. All of these disciplines agree about the mathematical optimality of specific methods under very specific conditions. That is why you see <a href="http://psychclassics.yorku.ca/Peirce/small-diffs.htm">methods</a> like <a href="http://en.wikipedia.org/wiki/Statistical_Methods_for_Research_Workers">randomized trials</a>](http://www.ted.com/talks/esther_duflo_social_experiments_to_fight_poverty?language=en) across <a href="http://www.badscience.net/category/evidence-based-policy/">multiple disciplines</a>.</p>
<p>But any real data analysis is always a multi-step process involving data cleaning and tidying, exploratory analysis, model fitting and checking, summarization and communication. If you gave someone from economics, biostatistics, statistics, and applied math an identical data set they’d give you back <strong>very</strong> different reports on what they did, why they did it, and what it all meant. Here are a few examples I can think of off the top of my head:</p>
<ul>
<li>Economics calls longitudinal data panel data and uses mostly linear mixed effects models, while generalized estimating equations are more common in biostatistics (this is the example from Roger/my paper).</li>
<li>In genome wide association studies the family wise error rate is the most common error rate to control. In gene expression studies people frequently use the false discovery rate.</li>
<li>This is changing a bit, but if you learned statistics at Duke you are probably a Bayesian and if you learned at Berkeley you are probably a frequentist.</li>
<li>Psychology has a history of using <a href="http://en.wikipedia.org/wiki/Psychological_statistics">parametric statistics</a>, genomics is big into <a href="http://www.bioconductor.org/packages/release/bioc/html/limma.html">empirical Bayes</a>, and you see a lot of Bayesian statistics in <a href="https://www1.ethz.ch/iac/people/knuttir/papers/meinshausen09nat.pdf">climate studies</a>.</li>
<li>You see [Roger and I responded to the controversy around the journal that banned p-values today <a href="http://www.nature.com/news/statistics-p-values-are-just-the-tip-of-the-iceberg-1.17412">in Nature.</a> A piece like this requires a lot of information packed into very little space but I thought one idea that deserved to be talked about more was the idea of data analysis subcultures. From the paper:</li>
</ul>
<blockquote>
<p>Data analysis is taught through an apprenticeship model, and different disciplines develop their own analysis subcultures. Decisions are based on cultural conventions in specific communities rather than on empirical evidence. For example, economists call data measured over time ‘panel data’, to which they frequently apply mixed-effects models. Biomedical scientists refer to the same type of data structure as ‘longitudinal data’, and often go at it with generalized estimating equations.</p>
</blockquote>
<p>I think this is one of the least appreciated components of modern data analysis. Data analysis is almost entirely taught through an apprenticeship culture with completely different behaviors taught in different disciplines. All of these disciplines agree about the mathematical optimality of specific methods under very specific conditions. That is why you see <a href="http://psychclassics.yorku.ca/Peirce/small-diffs.htm">methods</a> like <a href="http://en.wikipedia.org/wiki/Statistical_Methods_for_Research_Workers">randomized trials</a> [Roger and I responded to the controversy around the journal that banned p-values today <a href="http://www.nature.com/news/statistics-p-values-are-just-the-tip-of-the-iceberg-1.17412">in Nature.</a> A piece like this requires a lot of information packed into very little space but I thought one idea that deserved to be talked about more was the idea of data analysis subcultures. From the paper:</p>
<blockquote>
<p>Data analysis is taught through an apprenticeship model, and different disciplines develop their own analysis subcultures. Decisions are based on cultural conventions in specific communities rather than on empirical evidence. For example, economists call data measured over time ‘panel data’, to which they frequently apply mixed-effects models. Biomedical scientists refer to the same type of data structure as ‘longitudinal data’, and often go at it with generalized estimating equations.</p>
</blockquote>
<p>I think this is one of the least appreciated components of modern data analysis. Data analysis is almost entirely taught through an apprenticeship culture with completely different behaviors taught in different disciplines. All of these disciplines agree about the mathematical optimality of specific methods under very specific conditions. That is why you see <a href="http://psychclassics.yorku.ca/Peirce/small-diffs.htm">methods</a> like <a href="http://en.wikipedia.org/wiki/Statistical_Methods_for_Research_Workers">randomized trials</a>](http://www.ted.com/talks/esther_duflo_social_experiments_to_fight_poverty?language=en) across <a href="http://www.badscience.net/category/evidence-based-policy/">multiple disciplines</a>.</p>
<p>But any real data analysis is always a multi-step process involving data cleaning and tidying, exploratory analysis, model fitting and checking, summarization and communication. If you gave someone from economics, biostatistics, statistics, and applied math an identical data set they’d give you back <strong>very</strong> different reports on what they did, why they did it, and what it all meant. Here are a few examples I can think of off the top of my head:</p>
<ul>
<li>Economics calls longitudinal data panel data and uses mostly linear mixed effects models, while generalized estimating equations are more common in biostatistics (this is the example from Roger/my paper).</li>
<li>In genome wide association studies the family wise error rate is the most common error rate to control. In gene expression studies people frequently use the false discovery rate.</li>
<li>This is changing a bit, but if you learned statistics at Duke you are probably a Bayesian and if you learned at Berkeley you are probably a frequentist.</li>
<li>Psychology has a history of using <a href="http://en.wikipedia.org/wiki/Psychological_statistics">parametric statistics</a>, genomics is big into <a href="http://www.bioconductor.org/packages/release/bioc/html/limma.html">empirical Bayes</a>, and you see a lot of Bayesian statistics in <a href="https://www1.ethz.ch/iac/people/knuttir/papers/meinshausen09nat.pdf">climate studies</a>.</li>
<li>You see](http://en.wikipedia.org/wiki/White_test) used a lot in econometrics, but that is hardly ever done through formal hypothesis testing in biostatistics.</li>
<li>Training sets and test sets are used in machine learning for prediction, but rarely used for inference.</li>
</ul>
<p>This is just a partial list I thought of off the top of my head, there are a ton more. These decisions matter <strong>a lot</strong> in a data analysis. The problem is that the behavioral component of a data analysis is incredibly strong, no matter how much we’d like to think of the process as mathematico-theoretical. Until we acknowledge that the most common reason a method is chosen is because, “I saw it in a widely-cited paper in journal XX from my field” it is likely that little progress will be made on resolving the statistical problems in science.</p>
Why is there so much university administration? We kind of asked for it.
2015-04-13T17:13:16+00:00
http://simplystats.github.io/2015/04/13/why-is-there-so-much-university-administration-we-kind-of-asked-for-it
<p>The latest commentary on the rising cost of college tuition is by Paul F. Campos and is titled <a href="http://www.nytimes.com/2015/04/05/opinion/sunday/the-real-reason-college-tuition-costs-so-much.html">The Real Reason College Tuition Costs So Much</a>. There has been much debate about this article and whether Campos is right or wrong…and I don’t plan to add to that. However, I wanted to pick up on a major point of the article that I felt got left hanging out there: The rising levels of administrative personnel at universities.</p>
<p>Campos argues that the reason college tuition is on the rise is not that colleges get less and less money from the government (mostly state government for state schools), but rather that there is an increasing number of administrators at universities that need to be paid in dollars and cents. He cites a study that shows that for the California State University system, in a 34 year period, the number of of faculty rose by about 3% whereas the number of administrators rose by 221%.</p>
<p>My initial thinking when I saw the 221% number was “only that much?” I’ve been a faculty member at Johns Hopkins now for about 10 years, and just in that short period I’ve seen the amount of administrative work I need to do go up what feels like at least 221%. Partially, of course, that is a result of climbing up the ranks. As you get more qualified to do administrative work, you get asked to do it! But even adjusting for that, there are quite a few things that faculty need to do now that they weren’t required to do before. Frankly, I’m grateful for the few administrators that we do have around here to help me out with various things.</p>
<p>Campos seems to imply (but doesn’t come out and say) that the bulk of administrators are not necessary. And that if we were to cut these people from the payrolls, that we could reduce tuition down to what it was in the old days. Or at least, it would be cheaper. This argument reminds me about debates over the federal budget: Everyone thinks the budget is too big, but no one wants to suggest something to cut.</p>
<p>My point here is that the reason there are so many administrators is that there’s actually quite a bit of administration to do. And the amount of administration that needs to be done has increased over the past 30 years.</p>
<p>Just for fun, I decided to go to the <a href="http://webapps.jhu.edu/jhuniverse/administration/">Johns Hopkins University Administration</a> web site to see who all these administrators were. This site shows the President’s Cabinet and the Deans of the individual schools, which isn’t everybody, but it represents a large chunk. I don’t know all of these people, but I have met and worked with a few of them.</p>
<p>For the moment I’m going to skip over individual people because, as much as you might think they are overpaid, no individual’s salary is large enough to move the needle on college tuition. So I’ll stick with people who actually represent large offices with staff. Here’s a sample.</p>
<ul>
<li><strong>University President</strong>. Call me crazy, but I think the university needs a President. In the U.S. the university President tends to focus on outward facing activities like raising money from various sources, liasoning with the government(s), and pushing university initiatives around the world. This is not something I want to do (but I think it’s necessary), I’d rather have the President take care of it for me.</li>
<li>
<p><strong>University Provost</strong>. At most universities in the U.S. the Provost is the “senior academic officer”, which means that he/she runs the university. This is a big job, especially at big universities, and require coordinating across a variety of constituencies. Also, at JHU, the Provost’s office deals with a number of compliance related issues like Title IX, accreditation, Americans with Disabilities Act, and many others. I suppose we could save some money by violating federal law, but that seems short-sighted.
The people in this office do tough work involving a ton of paper. One example involves online education. Most states in the U.S. say that if you’re going to run an education program in their state, it needs to be approved by some regulatory body. Some states have essentially a reciprocal agreement, so if it’s okay in your state, then it’s okay in their state. But many states require an entire approval process for a program to run in that state. And by “a program” I mean something like an M.S. in Mathematics. If you want to run an M.S. in English that’s another approval, etc. So someone has to go to all the 50 states and D.C. and get approval for every online program that JHU runs in order to enroll students into that program from that state. I think Arkansas actually requires that someone come to Arkansas and testify in person about a program asking for approval.</p>
<p>I support online education programs, and I’m glad the Provost’s office is getting all those approvals for us.</li></p>
<ul>
<li><strong>Corporate Security</strong>. This may be a difficult one for some people to understand, but bear in mind that much of Johns Hopkins is located in East Baltimore. If you’ve ever seen the TV show <a href="http://en.wikipedia.org/wiki/The_Wire">The Wire</a>, then you know why we need corporate security.</li>
<li><strong>Facilities and Real Estate</strong>. Johns Hopkins owns and deals with a lot of real estate; it’s a big organization. Who is supposed to take care of all that? For example, we just installed a brand new supercomputer jointly with the University of Maryland, called <a href="http://marcc.jhu.edu">MARCC</a>. I’m really excited to use this supercomputer for research, but systems like this require a bit of space. A lot of space actually. So we needed to get some land to put it on. If you’ve ever bought a house, you know how much paperwork is involved.</li>
<li><strong>Development and Alumni Relations</strong>. I have a new appreciation for this office now that I co-direct a <a href="https://www.coursera.org/specialization/jhudatascience/1">program</a> that has enrolled over 1.5 million people in just over a year. It’s critically important that we keep track of our students for many reasons: tracking student careers and success, tapping them to mentor current students, developing relationships with organizations that they’re connected to are just a few.</li>
<li><strong>General Counsel</strong>. I’m not he lawbreaking type, so I need lawyers to help me out.</li>
<li><strong>Enterprise Development</strong>. This office involves, among other things, technology transfer, which I have recently been involved with quite a bit for my role in the Data Science Specialization offered through Coursera. This is just to say that I personally benefit from this office. I’ve heard people say that universities shouldn’t be involved in tech transfer, but Bayh-Dole is what it is and I think Johns Hopkins should play by the same rules as everyone else. I’m not interested in filing patents, trademarks, and copyrights, so it’s good to have people doing that for me.</ul></li>
</ul>
<p>Okay, that’s just a few offices, but you get the point. These administrators seem to be doing a real job (imagine that!) and actually helping out the university. Many of these people are actually helping <em>me</em> out. Some of these jobs are essentially required by the existence of federal laws, and so we need people like this.</p>
<p>So, just to recap, I think there are in fact more administrators in universities than there used to be. Is this causing an increase in tuition? It’s possible, but it’s probably not the only cause. If you believe the CSU study, there was about a 3.5% annual increase in the number of administrators each year from 1975 to 2008. College tuition during that time period went up <a href="http://trends.collegeboard.org/college-pricing/figures-tables/average-rates-growth-published-charges-decade">around 4% per year</a> (inflation adjusted). But even so, much of this administration needs to be done (because faculty don’t want to do it), so this is a difficult path to go down if you’re looking for ways to lower tuition.</p>
<p>Even if we’ve found the smoking gun, the question is what do we do about it?</p>
</li>
</ul>
Genomics Case Studies Online Courses Start in Two Weeks (4/27)
2015-04-13T10:00:29+00:00
http://simplystats.github.io/2015/04/13/genomics-case-studies-online-courses-start-in-two-weeks-427
<p>The last month of the <a href="http://genomicsclass.github.io/book/pages/classes.html">HarvardX Data Analysis for Genomics series</a> start on 4/27. We will cover case studies on RNAseq, Variant calling, ChipSeq and DNA methylation. Faculty includes Shirley Liu, Mike Love, Oliver Hoffman and the HSPH Bioinformatics Core. Although taking the previous courses on the series will help, the four case study courses were developed as stand alone and you can obtain a certificate for each one without taking any other course.</p>
<p>Each course is presented over two weeks but will remain open until June 13 to give students an opportunity to take them all if they wish. For more information follow the links listed below.</p>
<ol>
<li><a href="https://www.edx.org/course/case-study-rna-seq-data-analysis-harvardx-ph525-5x">RNA-seq data analysis</a> will be lead by Mike Love</li>
<li><a href="https://www.edx.org/course/case-study-variant-discovery-and-genotyping-harvardx-ph525-6x">Variant Discovery and Genotyping</a> will be taught by Shannan Ho Sui, Oliver Hofmann, Radhika Khetani and Meeta Mistry (from the The HSPH Bioinformatics Core)</li>
<li><a href="https://www.edx.org/course/case-study-chip-seq-data-analysis-harvardx-ph525-7x">ChIP-seq data analysis</a> will be lead by Shirley Liu</li>
<li><a href="https://www.edx.org/course/case-study-dna-methylation-data-analysis-harvardx-ph525-8x">DNA methylation data analysis</a> will be lead by Rafael Irizarry</li>
</ol>
A blessing of dimensionality often observed in high-dimensional data sets
2015-04-09T15:19:13+00:00
http://simplystats.github.io/2015/04/09/a-blessing-of-dimensionality-often-observed-in-high-dimensional-data-sets
<p><a href="http://www.jstatsoft.org/v59/i10/paper"></a> have one observation per row and one variable per column. Using this definition, big data sets can be either:</p>
<ol>
<li><strong>Wide</strong> - a wide data set has a large number of measurements per observation, but fewer observations. This type of data set is typical in neuroimaging, genomics, and other biomedical applications.</li>
<li><strong>Tall</strong> - a tall data set has a large number of observations, but fewer measurements. This is the typical setting in a large clinical trial or in a basic social network analysis.</li>
</ol>
<p>The <a href="http://en.wikipedia.org/wiki/Curse_of_dimensionality">curse of dimensionality</a> tells us that estimating some quantities gets harder as the number of dimensions of a data set increases - as the data gets taller or wider. An example of this was <a href="http://simplystatistics.org/2014/10/24/an-interactive-visualization-to-teach-about-the-curse-of-dimensionality/">nicely illustrated</a> by my student Prasad (although it looks like his quota may be up on Rstudio).</p>
<p>For wide data sets there is also a blessing of dimensionality. The basic reason for the blessing of dimensionality is that:</p>
<blockquote>
<p>No matter how many new measurements you take on a small set of observations, the number of observations and all of their characteristics are fixed.</p>
</blockquote>
<p>As an example, suppose that we make measurements on 10 people. We start out by making one measurement (blood pressure), then another (height), then another (hair color) and we keep going and going until we have one million measurements on those same 10 people. The blessing occurs because the measurements on those 10 people will all be related to each other. If 5 of the people are women and 5 or men, then any measurement that has a relationship with sex will be highly correlated with any other measurement that has a relationship with sex. So by knowing one small bit of information, you can learn a lot about many of the different measurements.</p>
<p>This blessing of dimensionality is the key idea behind many of the statistical approaches to wide data sets whether it is stated explicitly or not. I thought I’d make a very short list of some of these ideas:</p>
<p><strong>1. Idea: </strong><a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3841439/">De-convolving mixed observations from high-dimensional data. </a></p>
<p><strong>How the blessing plays a role: </strong>The measurements for each observation are assumed to be a mixture of values measured from different observation types. The proportion of each observation type is assumed to be fixed across measurements, so you can take advantage of the multiple measurements to estimate the mixing percentage and perform the deconvolution. (<a href="http://odin.mdacc.tmc.edu/~wwang7/">Wenyi Wang</a> came and gave an excellent seminar on this idea at JHU a couple of days ago, which inspired this post).</p>
<p><strong>2. Idea:</strong> <a href="http://biostatistics.oxfordjournals.org/content/5/2/155.short">The two groups model for false discovery rates</a>.</p>
<p><strong>How the blessing plays a role: </strong> The models assume that a hypothesis test is performed for each observation and that the probability any observation is drawn from the null, the null distribution, and the alternative distributions are common across observations. If the null is assumed known, then it is possible to use the known null distribution to estimate the common probability that an observation is drawn from the null.</p>
<p> </p>
<p><strong>3. Idea: </strong><a href="http://www.degruyter.com/view/j/sagmb.2004.3.issue-1/sagmb.2004.3.1.1027/sagmb.2004.3.1.1027.xml">Empirical Bayes variance shrinkage for linear models</a></p>
<p><strong>How the blessing plays a role: </strong> A linear model is fit for each observation and the means and variances of the log ratios calculated from the model are assumed to follow a common distribution across observations. The method estimates the hyper-parameters of these common distributions and uses them to adjust any individual measurement’s estimates.</p>
<p> </p>
<p><strong>4. Idea: </strong><a href="http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0030161">Surrogate variable analysis</a></p>
<p><strong>How the blessing plays a role: </strong> Each observation is assumed to be influenced by a single variable of interest (a primary variable) and multiple unmeasured confounders. Since the observations are fixed, the values of the unmeasured confounders are the same for each measurement and a supervised PCA can be used to estimate surrogates for the confounders. (<a href="http://www.slideshare.net/jtleek/jhu-feb2009">see my JHU job talk for more on the blessing</a>)</p>
<p> </p>
<p>The blessing of dimensionality I’m describing here is related to the idea that <a href="http://andrewgelman.com/2004/10/27/the_blessing_of/">Andrew Gelman refers to in this 2004 post.</a> Basically, since increasingly large number of measurements are made on the same observations there is an inherent structure to those observations. If you take advantage of that structure, then as the dimensionality of your problem increases you actually get <strong>better estimates</strong> of the structure in your high-dimensional data - a nice blessing!</p>
How to Get Ahead in Academia
2015-04-09T13:38:01+00:00
http://simplystats.github.io/2015/04/09/how-to-get-ahead-in-academia
<p>This video on how to make it in academia was produced over 10 years ago by Steven Goodman for the ENAR Junior Researchers Workshop. Now the whole world can benefit from its wisdom.</p>
<p>The movie features current and former JHU Biostatistics faculty, including Francesca Dominici, Giovanni Parmigiani, Scott Zeger, and Tom Louis. You don’t want to miss Scott Zeger’s secret formula for getting promoted!</p>
Why You Need to Study Statistics
2015-04-02T21:42:06+00:00
http://simplystats.github.io/2015/04/02/why-you-need-to-study-statistics
<p>The American Statistical Association is continuing its campaign to get you to study statistics, if you haven’t already. I have to agree with them that being a statistician is a pretty good job. Their latest video highlights a wide range of statisticians working in industry, government, and academia. You can check it out here:</p>
Teaser trailer for the Genomic Data Science Specialization on Coursera
2015-03-26T10:06:43+00:00
http://simplystats.github.io/2015/03/26/teaser-trailer-for-the-genomic-data-science-specialization-on-coursera
<p> </p>
<p>We have been hard at work in the studio putting together our next specialization to launch on Coursera. It will be called the “Genomic Data Science Specialization” and includes a spectacular line up of instructors: <a href="http://salzberg-lab.org/">Steven Salzberg</a>, <a href="http://ccb.jhu.edu/people/mpertea/">Ela Pertea</a>, <a href="http://jamestaylor.org/">James Taylor</a>, <a href="http://ccb.jhu.edu/people/florea/">Liliana Florea</a>, <a href="http://www.hansenlab.org/">Kasper Hansen</a>, and me. The specialization will cover command line tools, statistics, Galaxy, Bioconductor, and Python. There will be a capstone course at the end of the sequence featuring an in-depth genomic analysis. If you are a grad student, postdoc, or principal investigator in a group that does genomics this specialization is for you. If you are a person looking to transition into one of the hottest areas of research with the new precision medicine initiative this is for you. Get pumped and share the teaser-trailer with your friends!</p>
Introduction to Bioconductor HarvardX MOOC starts this Monday March 30
2015-03-24T09:24:27+00:00
http://simplystats.github.io/2015/03/24/introduction-to-bioconductor-harvardx-mooc-starts-this-monday-march-30
<p>Bioconductor is one of the most widely used open source toolkits for biological high-throughput data. In this four week course, co-taught with Vince Carey and Mike Love, we will introduce you to Bioconductor’s general infrastructure and then focus on two specific technologies: next generation sequencing and microarrays. The lectures and assessments will be annotated in case you want to focus only on one of these two technologies. Although if you plan to be a bioinformatician we recommend you learn both.</p>
<p>Topics covered include:</p>
<ul>
<li>A short introduction to molecular biology and measurement technology</li>
<li>An overview on how to leverage the platform and genome annotation packages and experimental archives</li>
<li>GenomicsRanges: the infrastructure for storing, manipulating and analyzing next generation sequencing data</li>
<li>Parallel computing and cloud concepts</li>
<li>Normalization, preprocessing and bias correction.</li>
<li>Statistical inference in practice: including hierarchical models and gene set enrichment analysis</li>
<li>Building statistical analysis pipelines of genome-scale assays including the creation of reproducible reports</li>
</ul>
<p>Throughout the class we will be using data examples from both next generation sequencing and microarray experiments.</p>
<p>We will assume <a href="https://www.edx.org/course/statistics-r-life-sciences-harvardx-ph525-1x">basic knowledge of Statistics and R</a>.</p>
<p>For more information visit the <a href="https://www.edx.org/course/introduction-bioconductor-harvardx-ph525-4x">course website</a>.</p>
A surprisingly tricky issue when using genomic signatures for personalized medicine
2015-03-19T13:06:32+00:00
http://simplystats.github.io/2015/03/19/a-surprisingly-tricky-issue-when-using-genomic-signatures-for-personalized-medicine
<p>My student Prasad Patil has a really nice paper that <a href="http://bioinformatics.oxfordjournals.org/content/early/2015/03/18/bioinformatics.btv157.full.pdf?keytype=ref&ijkey=loVpUJfJxG2QjoE">just came out in Bioinformatics</a> (<a href="http://biorxiv.org/content/early/2014/06/06/005983">preprint</a> in case paywalled). The paper is about a surprisingly tricky normalization issue with genomic signatures. Genomic signatures are basically statistical/machine learning functions applied to the measurements for a set of genes to predict how long patients will survive, or how they will respond to therapy. The issue is that usually when building and applying these signatures, people normalize across samples in the training and testing set.</p>
<p>An example of this normalization is to mean-center the measurements for each gene in the testing/application stage, then apply the prediction rule. The problem is that if you use a different set of samples when calculating the mean you can get a totally different prediction function. The basic problem is illustrated in this graphic.</p>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/03/Screen-Shot-2015-03-19-at-12.58.03-PM.png"><img class="aligncenter wp-image-3947 size-medium" src="http://simplystatistics.org/wp-content/uploads/2015/03/Screen-Shot-2015-03-19-at-12.58.03-PM-300x227.png" alt="Screen Shot 2015-03-19 at 12.58.03 PM" width="300" height="227" srcset="http://simplystatistics.org/wp-content/uploads/2015/03/Screen-Shot-2015-03-19-at-12.58.03-PM-300x227.png 300w, http://simplystatistics.org/wp-content/uploads/2015/03/Screen-Shot-2015-03-19-at-12.58.03-PM-260x197.png 260w" sizes="(max-width: 300px) 100vw, 300px" /></a></p>
<p> </p>
<p>This seems like a pretty esoteric statistical issue, but it turns out that this one simple normalization problem can dramatically change the results of the predictions. In particular, we show that the predictions for the same patient, with the exact same data, can change dramatically if you just change the subpopulations of patients within the testing set. In this plot, Prasad made predictions for the exact same set of patients two times when the patient population varied in ER status composition. As many as 30% of the predictions were different for the same patient with the same data if you just varied who they were being predicted with.</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/03/Screen-Shot-2015-03-19-at-1.02.25-PM.png"><img class="aligncenter wp-image-3948 size-full" src="http://simplystatistics.org/wp-content/uploads/2015/03/Screen-Shot-2015-03-19-at-1.02.25-PM.png" alt="Screen Shot 2015-03-19 at 1.02.25 PM" width="494" height="277" srcset="http://simplystatistics.org/wp-content/uploads/2015/03/Screen-Shot-2015-03-19-at-1.02.25-PM-300x168.png 300w, http://simplystatistics.org/wp-content/uploads/2015/03/Screen-Shot-2015-03-19-at-1.02.25-PM.png 494w" sizes="(max-width: 494px) 100vw, 494px" /></a></p>
<p> </p>
<p>This paper highlights how tricky statistical issues can slow down the process of translating ostensibly really useful genomic signatures into clinical practice and lends even more weight to the idea that precision medicine is a statistical field.</p>
A simple (and fair) way all statistics journals could drive up their impact factor.
2015-03-18T16:32:10+00:00
http://simplystats.github.io/2015/03/18/a-simple-and-fair-way-all-statistics-journals-could-drive-up-their-impact-factor
<p>Hypothesis:</p>
<blockquote>
<p>If every method in every stats journal was implemented in a corresponding R package (<a href="http://hilaryparker.com/2014/04/29/writing-an-r-package-from-scratch/">easy</a>), was required to have a companion document that was a tutorial on how to use the software (<a href="http://www.bioconductor.org/help/package-vignettes/">easy</a>), included a reference to how to cite the paper if you used the software (<a href="http://www.inside-r.org/r-doc/utils/citation">easy</a>) and the paper/tutorial was posted to the relevant message boards for the communities of interest (<a href="http://seqanswers.com/forums/showthread.php?t=42018">easy</a>) that journal would see a dramatic bump in its impact factor.</p>
</blockquote>
Data science done well looks easy - and that is a big problem for data scientists
2015-03-17T10:47:12+00:00
http://simplystats.github.io/2015/03/17/data-science-done-well-looks-easy-and-that-is-a-big-problem-for-data-scientists
<p>Data science has a ton of different definitions. For the purposes of this post I’m going to use the definition of data science we used when creating our Data Science program online. Data science is:</p>
<blockquote>
<p>Data science is the process of formulating a quantitative question that can be answered with data, collecting and cleaning the data, analyzing the data, and communicating the answer to the question to a relevant audience.</p>
</blockquote>
<p>In general the data science process is iterative and the different components blend together a little bit. But for simplicity lets discretize the tasks into the following 7 steps:</p>
<ol>
<li>Define the question of interest</li>
<li>Get the data</li>
<li>Clean the data</li>
<li>Explore the data</li>
<li>Fit statistical models</li>
<li>Communicate the results</li>
<li>Make your analysis reproducible</li>
</ol>
<p>A good data science project answers a real scientific or business analytics question. In almost all of these experiments the vast majority of the analyst’s time is spent on getting and cleaning the data (steps 2-3) and communication and reproducibility (6-7). In most cases, if the data scientist has done her job right the statistical models don’t need to be incredibly complicated to identify the important relationships the project is trying to find. In fact, if a complicated statistical model seems necessary, it often means that you don’t have the right data to answer the question you really want to answer. One option is to spend a huge amount of time trying to tune a statistical model to try to answer the question but serious data scientist’s usually instead try to go back and get the right data.</p>
<p>The result of this process is that most well executed and successful data science projects don’t (a) use super complicated tools or (b) fit super complicated statistical models. The characteristics of the most successful data science projects I’ve evaluated or been a part of are: (a) a laser focus on solving the scientific problem, (b) careful and thoughtful consideration of whether the data is the right data and whether there are any lurking confounders or biases and (c) relatively simple statistical models applied and interpreted skeptically.</p>
<p>It turns out doing those three things is actually surprisingly hard and very, very time consuming. It is my experience that data science projects take a solid 2-3 times as long to complete as a project in theoretical statistics. The reason is that inevitably the data are a mess and you have to clean them up, then you find out the data aren’t quite what you wanted to answer the question, so you go find a new data set and clean it up, etc. After a ton of work like that, you have a nice set of data to which you fit simple statistical models and then it looks <strong>super easy </strong>to someone who either doesn’t know about the data collection and cleaning process or doesn’t care.</p>
<p>This poses a major public relations problem for serious data scientists. When you show someone a good data science project they almost invariably think “oh that is easy” or “that is just a trivial statistical/machine learning model” and don’t see all of the work that goes into solving the real problems in data science. A concrete example of this is in academic statistics. It is customary for people to show theorems in their talks and maybe even some of the proof. This gives people working on theoretical projects an opportunity to “show their stuff” and demonstrate how good they are. The equivalent for a data scientist would be showing how they found and cleaned multiple data sets, merged them together, checked for biases, and arrived at a simplified data set. Showing the “proof” would be equivalent to showing how they matched IDs. These things often don’t look nearly as impressive in talks, particularly if the audience doesn’t have experience with how incredibly delicate real data analysis is. I imagine versions of this problem play out in industry as well (candidate X did a good analysis but it wasn’t anything special, candidate Y used Hadoop to do BIG DATA!).</p>
<p>The really tricky twist is that bad data science looks easy too. You can scrape a data set off the web and slap a machine learning algorithm on it no problem. So how do you judge whether a data science project is really “hard” and whether the data scientist is an expert? Just like with anything, there is no easy shortcut to evaluating data science projects. You have to ask questions about the details of how the data were collected, what kind of biases might exist, why they picked one data set over another, etc. In the meantime, don’t be fooled by what looks like simple data science - <a href="http://fivethirtyeight.com/interactives/senate-forecast/">it can often be pretty effective</a>.</p>
<p> </p>
<p><em>Editor’s note: If you like this post, you might like my pay-what-you-want book Elements of Data Analytic Style: <a href="https://leanpub.com/datastyle">https://leanpub.com/datastyle</a></em></p>
<p> </p>
π day special: How to use Bioconductor to find empirical evidence in support of π being a normal number
2015-03-14T10:15:10+00:00
http://simplystats.github.io/2015/03/14/%cf%80-day-special-how-to-use-bioconductor-to-find-empirical-evidence-in-support-of-%cf%80-being-a-normal-number
<p><em>Editor’s note: Today 3/14/15 at some point between 9:26:53 and 9:26:54 it was the most π day of them all. Below is a repost from last year. </em></p>
<p>Happy π day everybody!</p>
<p>I wanted to write some simple code (included below) to the test parallelization capabilities of my new cluster. So, in honor of π day, I decided to check for <a href="http://www.davidhbailey.com/dhbpapers/normality.pdf">evidence that π is a normal number</a>. A <a href="http://en.wikipedia.org/wiki/Normal_number">normal number</a> is a real number whose infinite sequence of digits has the property that picking any given random m digit pattern is 10<sup>−m</sup>. For example, using the Poisson approximation, we can predict that the pattern “123456789” should show up between 0 and 3 times in the <a href="http://stuff.mit.edu/afs/sipb/contrib/pi/">first billion digits of π</a> (it actually shows up twice starting, at the 523,551,502-th and 773,349,079-th decimal places).</p>
<p>To test our hypothesis, let Y<sub>1</sub>, …, Y<sub>100</sub> be the number of “00”, “01”, …,”99” in the first billion digits of π. If π is in fact normal then the Ys should be approximately IID binomials with N=1 billon and p=0.01. In the qq-plot below I show Z-scores (Y - 10,000,000) / √9,900,000) which appear to follow a normal distribution as predicted by our hypothesis. Further evidence for π being normal is provided by repeating this experiment for 3,4,5,6, and 7 digit patterns (for 5,6 and 7 I sampled 10,000 patterns). Note that we can perform a chi-square test for the uniform distribution as well. For patterns of size 1,2,3 the p-values were 0.84, <del>0.89,</del> 0.92, and 0.99.</p>
<p><a href="http://simplystatistics.org/2014/03/14/using-bioconductor-to-find-some-empirical-evidence-in-support-of-%cf%80-being-a-normal-number/pi-3/" rel="attachment wp-att-2792"><img class="alignnone size-full wp-image-2792" src="http://simplystatistics.org/wp-content/uploads/2014/03/pi2.png" alt="pi" width="4800" height="3000" srcset="http://simplystatistics.org/wp-content/uploads/2014/03/pi2-300x187.png 300w, http://simplystatistics.org/wp-content/uploads/2014/03/pi2-1024x640.png 1024w, http://simplystatistics.org/wp-content/uploads/2014/03/pi2.png 4800w" sizes="(max-width: 4800px) 100vw, 4800px" /></a></p>
<p>Another test we can perform is to divide the 1 billion digits into 100,000 non-overlapping segments of length 10,000. The vector of counts for any given pattern should also be binomial. Below I also include these qq-plots.</p>
<p><a href="http://simplystatistics.org/2014/03/14/using-bioconductor-to-find-some-empirical-evidence-in-support-of-%cf%80-being-a-normal-number/pi2/" rel="attachment wp-att-2793"><img class="alignnone size-full wp-image-2793" src="http://simplystatistics.org/wp-content/uploads/2014/03/pi21.png" alt="pi2" width="5600" height="3000" srcset="http://simplystatistics.org/wp-content/uploads/2014/03/pi21-1024x548.png 1024w, http://simplystatistics.org/wp-content/uploads/2014/03/pi21.png 5600w" sizes="(max-width: 5600px) 100vw, 5600px" /></a></p>
<p>These observed counts should also be independent, and to explore this we can look at autocorrelation plots:</p>
<p><a href="http://simplystatistics.org/2014/03/14/using-bioconductor-to-find-some-empirical-evidence-in-support-of-%cf%80-being-a-normal-number/piacf-2/" rel="attachment wp-att-2794"><img class="alignnone size-full wp-image-2794" src="http://simplystatistics.org/wp-content/uploads/2014/03/piacf1.png" alt="piacf" width="5600" height="3000" srcset="http://simplystatistics.org/wp-content/uploads/2014/03/piacf1-1024x548.png 1024w, http://simplystatistics.org/wp-content/uploads/2014/03/piacf1.png 5600w" sizes="(max-width: 5600px) 100vw, 5600px" /></a></p>
<p>To do this in about an hour and with just a few lines of code (included below), I used the <a href="http://www.bioconductor.org/">Bioconductor</a> <a href="http://www.bioconductor.org/packages/release/bioc/html/Biostrings.html">Biostrings</a> package to match strings and the <em>foreach</em> function to parallelize.</p>
<p>`<em>Editor’s note: Today 3/14/15 at some point between 9:26:53 and 9:26:54 it was the most π day of them all. Below is a repost from last year. </em></p>
<p>Happy π day everybody!</p>
<p>I wanted to write some simple code (included below) to the test parallelization capabilities of my new cluster. So, in honor of π day, I decided to check for <a href="http://www.davidhbailey.com/dhbpapers/normality.pdf">evidence that π is a normal number</a>. A <a href="http://en.wikipedia.org/wiki/Normal_number">normal number</a> is a real number whose infinite sequence of digits has the property that picking any given random m digit pattern is 10<sup>−m</sup>. For example, using the Poisson approximation, we can predict that the pattern “123456789” should show up between 0 and 3 times in the <a href="http://stuff.mit.edu/afs/sipb/contrib/pi/">first billion digits of π</a> (it actually shows up twice starting, at the 523,551,502-th and 773,349,079-th decimal places).</p>
<p>To test our hypothesis, let Y<sub>1</sub>, …, Y<sub>100</sub> be the number of “00”, “01”, …,”99” in the first billion digits of π. If π is in fact normal then the Ys should be approximately IID binomials with N=1 billon and p=0.01. In the qq-plot below I show Z-scores (Y - 10,000,000) / √9,900,000) which appear to follow a normal distribution as predicted by our hypothesis. Further evidence for π being normal is provided by repeating this experiment for 3,4,5,6, and 7 digit patterns (for 5,6 and 7 I sampled 10,000 patterns). Note that we can perform a chi-square test for the uniform distribution as well. For patterns of size 1,2,3 the p-values were 0.84, <del>0.89,</del> 0.92, and 0.99.</p>
<p><a href="http://simplystatistics.org/2014/03/14/using-bioconductor-to-find-some-empirical-evidence-in-support-of-%cf%80-being-a-normal-number/pi-3/" rel="attachment wp-att-2792"><img class="alignnone size-full wp-image-2792" src="http://simplystatistics.org/wp-content/uploads/2014/03/pi2.png" alt="pi" width="4800" height="3000" srcset="http://simplystatistics.org/wp-content/uploads/2014/03/pi2-300x187.png 300w, http://simplystatistics.org/wp-content/uploads/2014/03/pi2-1024x640.png 1024w, http://simplystatistics.org/wp-content/uploads/2014/03/pi2.png 4800w" sizes="(max-width: 4800px) 100vw, 4800px" /></a></p>
<p>Another test we can perform is to divide the 1 billion digits into 100,000 non-overlapping segments of length 10,000. The vector of counts for any given pattern should also be binomial. Below I also include these qq-plots.</p>
<p><a href="http://simplystatistics.org/2014/03/14/using-bioconductor-to-find-some-empirical-evidence-in-support-of-%cf%80-being-a-normal-number/pi2/" rel="attachment wp-att-2793"><img class="alignnone size-full wp-image-2793" src="http://simplystatistics.org/wp-content/uploads/2014/03/pi21.png" alt="pi2" width="5600" height="3000" srcset="http://simplystatistics.org/wp-content/uploads/2014/03/pi21-1024x548.png 1024w, http://simplystatistics.org/wp-content/uploads/2014/03/pi21.png 5600w" sizes="(max-width: 5600px) 100vw, 5600px" /></a></p>
<p>These observed counts should also be independent, and to explore this we can look at autocorrelation plots:</p>
<p><a href="http://simplystatistics.org/2014/03/14/using-bioconductor-to-find-some-empirical-evidence-in-support-of-%cf%80-being-a-normal-number/piacf-2/" rel="attachment wp-att-2794"><img class="alignnone size-full wp-image-2794" src="http://simplystatistics.org/wp-content/uploads/2014/03/piacf1.png" alt="piacf" width="5600" height="3000" srcset="http://simplystatistics.org/wp-content/uploads/2014/03/piacf1-1024x548.png 1024w, http://simplystatistics.org/wp-content/uploads/2014/03/piacf1.png 5600w" sizes="(max-width: 5600px) 100vw, 5600px" /></a></p>
<p>To do this in about an hour and with just a few lines of code (included below), I used the <a href="http://www.bioconductor.org/">Bioconductor</a> <a href="http://www.bioconductor.org/packages/release/bioc/html/Biostrings.html">Biostrings</a> package to match strings and the <em>foreach</em> function to parallelize.</p>
<p>`</p>
<p>NB: A normal number has the above stated property in any base. The examples above a for base 10.</p>
De-weaponizing reproducibility
2015-03-13T10:24:05+00:00
http://simplystats.github.io/2015/03/13/de-weaponizing-reproducibility
<div>
A couple of weeks ago Roger and I went to a <a href="http://sites.nationalacademies.org/DEPS/BMSA/DEPS_153236">conference on statistical reproducibility </a>held at the National Academy of Sciences. The discussion was pretty wide ranging and I love that the thinking about reproducibility is coming back to statistics. There was pretty widespread support for the idea that prevention is the <a href="http://arxiv.org/abs/1502.03169">right way to approach reproducibility</a>.
</div>
<div>
</div>
<div>
It turns out I was the last speaker of the whole conference. This is an unenviable position to be in with so many bright folks speaking first as they covered a huge amount of what I wanted to say. <a href="http://www.slideshare.net/jtleek/evidence-based-data-analysis">My talk focused on three key points:</a>
</div>
<div>
</div>
<ol>
<li>The tools for reproducibility already exist, the barrier isn’t tools</li>
<li>We need to de-weaponize reproducibility</li>
<li>Prevention is the right approach to reproducibility</li>
</ol>
<p> </p>
<p>In terms of the first point, <a href="http://simplystatistics.org/2014/09/04/why-the-three-biggest-positive-contributions-to-reproducible-research-are-the-ipython-notebook-knitr-and-galaxy/">tools like iPython, knitr, and Galaxy </a>can be used to all but the absolutely largest analysis reproducible right now. Our group does this all the time with our papers and so do many others. The problem isn’t a lack of tools.</p>
<p>Speaking to point two, I think many people would agree that part of the issue is culture change. One issue that is increasingly concerning to me is the “weaponization” of reproducibility. I have been noticing is that some of us (like me, my students, other folks at JHU, and lots of particularly junior computational people elsewhere) are trying really hard to be reproducible. Most of the time this results in really positive reactions from the community. But when a co-author of mine and I wrote that paper about the <a href="http://biostatistics.oxfordjournals.org/content/early/2013/09/24/biostatistics.kxt007.abstract">science-wise false discovery rate</a>, one of the discussants used our code (great), improved on it (great), identified a bug (great), and then did his level best to humiliate us both in front of the editor and the general public because of that bug (<a href="http://simplystatistics.org/2013/09/26/how-could-code-review-discourage-code-disclosure-reviewers-with-motivation/">not so great</a>).</p>
<div>
</div>
<div>
I have seen this happen several times. Most of the time if a paper is reproducible the authors get a pat on the back and their code is either ignored, or used in a positive way. But for high-profile and important problems, people largely use eproducibility to:
</div>
<div>
</div>
<ol>
<li> Impose regulatory hurdles in the short term while people transition to reproducibility. One clear example of this is the <a href="https://www.congress.gov/bill/113th-congress/house-bill/4012">Secret Science Reform Act</a> which is a bill that imposes strict reproducibility conditions on all science before it can be used as evidence for regulation.</li>
<li>Humiliate people who aren’t good coders or who make mistakes in their code. This is what happened in my paper when I produced reproducible code for my analysis, but has also happened <a href="http://simplystatistics.org/2014/01/28/marie-curie-says-stop-hating-on-quilt-plots-already/">to other people</a>.</li>
<li>Take advantage of people’s code to plagiarize/straight up steal work. I have stories about this I’d rather not put on the internet</li>
</ol>
<p> </p>
<p>Of the three, I feel like (1) and (2) are the most common. Plagiarism and scooping by theft I think are actually relatively rare based on my own anecdotal experience. But I think that the “weaponization” of reproducibility to block regulation or to humiliate folks who are new to computational sciences is more common than I’d like it to be. Until reproducibility is the standard for everyone - which I think is possible now and will happen as the culture changes - the people who are the early adopters are at risk of being bludgeoned with their own reproducibility. As a community, if we want widespread reproducibility adoption we have to be ferocious about not allowing this to happen.</p>
The elements of data analytic style - so much for a soft launch
2015-03-03T11:22:28+00:00
http://simplystats.github.io/2015/03/03/the-elements-of-data-analytic-style-so-much-for-a-soft-launch
<p><em>Editor’s note: I wrote a book called Elements of Data Analytic Style. Buy it on <a href="https://leanpub.com/datastyle">Leanpub</a> or <a href="http://www.amazon.com/Elements-Data-Analytic-Style-ebook/dp/B00U6D80YY/ref=sr_1_1?ie=UTF8&qid=1425397222&sr=8-1&keywords=elements+of+data+analytic+style">Amazon</a>! If you buy it on Leanpub, you get all updates (there are likely to be some) for free and you can pay what you want (including zero) but the author would be appreciative if you’d throw a little scratch his way. </em></p>
<p>So uh, I was going to soft launch my new book The Elements of Data Analytic Style yesterday. I figured I’d just quietly email my Coursera courses to let them know I created a new reference. It turns out that that wasn’t very quiet. First this happened:</p>
<blockquote class="twitter-tweet" width="550">
<p>
<a href="https://twitter.com/jtleek">@jtleek</a> <a href="https://twitter.com/albertocairo">@albertocairo</a> <a href="https://twitter.com/simplystats">@simplystats</a> Instabuy. And apparently not just for me: it looks like you just Slashdotted <a href="https://twitter.com/leanpub">@leanpub</a>'s website.
</p>
<p>
— Andrew Janke (@AndrewJanke) <a href="https://twitter.com/AndrewJanke/status/572474567467401216">March 2, 2015</a>
</p>
</blockquote>
<p> </p>
<p>and sure enough the website was down:</p>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/03/Screen-Shot-2015-03-02-at-2.14.05-PM.png"><img class="aligncenter wp-image-3919 size-medium" src="http://simplystatistics.org/wp-content/uploads/2015/03/Screen-Shot-2015-03-02-at-2.14.05-PM-300x202.png" alt="Screen Shot 2015-03-02 at 2.14.05 PM" width="300" height="202" srcset="http://simplystatistics.org/wp-content/uploads/2015/03/Screen-Shot-2015-03-02-at-2.14.05-PM-300x202.png 300w, http://simplystatistics.org/wp-content/uploads/2015/03/Screen-Shot-2015-03-02-at-2.14.05-PM-1024x690.png 1024w, http://simplystatistics.org/wp-content/uploads/2015/03/Screen-Shot-2015-03-02-at-2.14.05-PM-260x175.png 260w" sizes="(max-width: 300px) 100vw, 300px" /></a></p>
<p> </p>
<p> </p>
<p>then overnight it did something like 6,000+ units:</p>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/03/whoacoursera.png"><img class="aligncenter wp-image-3920 size-medium" src="http://simplystatistics.org/wp-content/uploads/2015/03/whoacoursera-300x300.png" alt="whoacoursera" width="300" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2015/03/whoacoursera-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2015/03/whoacoursera-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2015/03/whoacoursera.png 480w" sizes="(max-width: 300px) 100vw, 300px" /></a></p>
<p> </p>
<p> </p>
<p>So lesson learned, there is no soft open with Coursera. Here is the post I was going to write though:</p>
<p> </p>
<p>### Post I was gonna write</p>
<p>I have been doing data analysis for something like 10 years now (gulp!) and teaching data analysis in person for 6+ years. One of the things we do in <a href="https://github.com/jtleek/jhsph753and4">my data analysis class at Hopkins</a> is to perform a complete data analysis (from raw data to written report) every couple of weeks. Then I grade each assignment for everything from data cleaning to the written report and reproducibility. I’ve noticed over the course of teaching this class (and classes online) that there are many common elements of data analytic style that I don’t often see in textbooks, or when I do, I see them spread across multiple books.</p>
<p>I’ve posted on some of these issues in some open source guides I’ve posted to Github like:</p>
<ul>
<li><a href="http://simplystatistics.org/2014/05/22/10-things-statistics-taught-us-about-big-data-analysis/" target="_self">10 things statistics taught us about big data analysis</a></li>
<li><a href="https://github.com/jtleek/rpackages" target="_self">The Leek Group Guide to R packages</a></li>
<li><a href="https://github.com/jtleek/datasharing" target="_self">How to share data with a statistician</a></li>
</ul>
<p>But I decided that it might be useful to have a more complete guide to the “art” part of data analysis. One goal is to summarize in a succinct way the most common difficulties encountered by practicing data analysts. It may be a useful guide for peer reviewers who could refer to section numbers when evaluating manuscripts, for instructors who have to grade data analyses, as a supplementary text for a data analysis class, or just as a useful reference. It is modeled loosely in format and aim on the <a href="http://www.bartleby.com/141/">Elements of Style</a> by William Strunk. Just as with the EoS, both the checklist and my book cover a small fraction of the field of data analysis, but my experience is that once these elements are mastered, data analysts benefit most from hands on experience in their own discipline of application, and that many principles may be non-transferable beyond the basics. But just as with writing, new analysts would do better to follow the rules until they know them well enough to violate them.</p>
<ul>
<li><a href="https://leanpub.com/datastyle/">Buy EDAS on Leanpub</a></li>
<li><a href="http://www.amazon.com/Elements-Data-Analytic-Style-ebook/dp/B00U6D80YY/ref=sr_1_1?ie=UTF8&qid=1425397222&sr=8-1&keywords=elements+of+data+analytic+style">Buy EDAS on Amazon</a></li>
</ul>
<p>The book includes a basic checklist that may be useful as a guide for beginning data analysts or as a rubric for evaluating data analyses. I’m reproducing it here so you can comment/hate/enjoy on it.</p>
<p> </p>
<p><em><strong>The data analysis checklis</strong>t</em></p>
<p>This checklist provides a condensed look at the information in this book. It can be used as a guide during the process of a data analysis, as a rubric for grading data analysis projects, or as a way to evaluate the quality of a reported data analysis.</p>
<p><strong>I Answering the question</strong></p>
<ol>
<li>
<p>Did you specify the type of data analytic question (e.g. exploration, assocation causality) before touching the data?</p>
</li>
<li>
<p>Did you define the metric for success before beginning?</p>
</li>
<li>
<p>Did you understand the context for the question and the scientific or business application?</p>
</li>
<li>
<p>Did you record the experimental design?</p>
</li>
<li>
<p>Did you consider whether the question could be answered with the available data?</p>
</li>
</ol>
<p><strong>II Checking the data</strong></p>
<ol>
<li>
<p>Did you plot univariate and multivariate summaries of the data?</p>
</li>
<li>
<p>Did you check for outliers?</p>
</li>
<li>
<p>Did you identify the missing data code?</p>
</li>
</ol>
<p><strong>III Tidying the data</strong></p>
<ol>
<li>
<p>Is each variable one column?</p>
</li>
<li>
<p>Is each observation one row?</p>
</li>
<li>
<p>Do different data types appear in each table?</p>
</li>
<li>
<p>Did you record the recipe for moving from raw to tidy data?</p>
</li>
<li>
<p>Did you create a code book?</p>
</li>
<li>
<p>Did you record all parameters, units, and functions applied to the data?</p>
</li>
</ol>
<p><strong>IV Exploratory analysis</strong></p>
<ol>
<li>
<p>Did you identify missing values?</p>
</li>
<li>
<p>Did you make univariate plots (histograms, density plots, boxplots)?</p>
</li>
<li>
<p>Did you consider correlations between variables (scatterplots)?</p>
</li>
<li>
<p>Did you check the units of all data points to make sure they are in the right range?</p>
</li>
<li>
<p>Did you try to identify any errors or miscoding of variables?</p>
</li>
<li>
<p>Did you consider plotting on a log scale?</p>
</li>
<li>
<p>Would a scatterplot be more informative?</p>
</li>
</ol>
<p><strong>V Inference</strong></p>
<ol>
<li>
<p>Did you identify what large population you are trying to describe?</p>
</li>
<li>
<p>Did you clearly identify the quantities of interest in your model?</p>
</li>
<li>
<p>Did you consider potential confounders?</p>
</li>
<li>
<p>Did you identify and model potential sources of correlation such as measurements over time or space?</p>
</li>
<li>
<p>Did you calculate a measure of uncertainty for each estimate on the scientific scale?</p>
</li>
</ol>
<p><strong>VI Prediction</strong></p>
<ol>
<li>
<p>Did you identify in advance your error measure?</p>
</li>
<li>
<p>Did you immediately split your data into training and validation?</p>
</li>
<li>
<p>Did you use cross validation, resampling, or bootstrapping only on the training data?</p>
</li>
<li>
<p>Did you create features using only the training data?</p>
</li>
<li>
<p>Did you estimate parameters only on the training data?</p>
</li>
<li>
<p>Did you fix all features, parameters, and models before applying to the validation data?</p>
</li>
<li>
<p>Did you apply only one final model to the validation data and report the error rate?</p>
</li>
</ol>
<p><strong>VII Causality</strong></p>
<ol>
<li>
<p>Did you identify whether your study was randomized?</p>
</li>
<li>
<p>Did you identify potential reasons that causality may not be appropriate such as confounders, missing data, non-ignorable dropout, or unblinded experiments?</p>
</li>
<li>
<p>If not, did you avoid using language that would imply cause and effect?</p>
</li>
</ol>
<p><strong>VIII Written analyses</strong></p>
<ol>
<li>
<p>Did you describe the question of interest?</p>
</li>
<li>
<p>Did you describe the data set, experimental design, and question you are answering?</p>
</li>
<li>
<p>Did you specify the type of data analytic question you are answering?</p>
</li>
<li>
<p>Did you specify in clear notation the exact model you are fitting?</p>
</li>
<li>
<p>Did you explain on the scale of interest what each estimate and measure of uncertainty means?</p>
</li>
<li>
<p>Did you report a measure of uncertainty for each estimate on the scientific scale?</p>
</li>
</ol>
<p><strong>IX Figures</strong></p>
<ol>
<li>
<p>Does each figure communicate an important piece of information or address a question of interest?</p>
</li>
<li>
<p>Do all your figures include plain language axis labels?</p>
</li>
<li>
<p>Is the font size large enough to read?</p>
</li>
<li>
<p>Does every figure have a detailed caption that explains all axes, legends, and trends in the figure?</p>
</li>
</ol>
<p><strong>X Presentations</strong></p>
<ol>
<li>
<p>Did you lead with a brief, understandable to everyone statement of your problem?</p>
</li>
<li>
<p>Did you explain the data, measurement technology, and experimental design before you explained your model?</p>
</li>
<li>
<p>Did you explain the features you will use to model data before you explain the model?</p>
</li>
<li>
<p>Did you make sure all legends and axes were legible from the back of the room?</p>
</li>
</ol>
<p><strong>XI Reproducibility</strong></p>
<ol>
<li>
<p>Did you avoid doing calculations manually?</p>
</li>
<li>
<p>Did you create a script that reproduces all your analyses?</p>
</li>
<li>
<p>Did you save the raw and processed versions of your data?</p>
</li>
<li>
<p>Did you record all versions of the software you used to process the data?</p>
</li>
<li>
<p>Did you try to have someone else run your analysis code to confirm they got the same answers?</p>
</li>
</ol>
<p><strong>XI R packages</strong></p>
<ol>
<li>
<p>Did you make your package name “Googleable”</p>
</li>
<li>
<p>Did you write unit tests for your functions?</p>
</li>
<li>
<p>Did you write help files for all functions?</p>
</li>
<li>
<p>Did you write a vignette?</p>
</li>
<li>
<p>Did you try to reduce dependencies to actively maintained packages?</p>
</li>
<li>
<p>Have you eliminated all errors and warnings from R CMD CHECK?</p>
</li>
</ol>
<p> </p>
Advanced Statistics for the Life Sciences MOOC Launches Today
2015-03-02T09:37:39+00:00
http://simplystats.github.io/2015/03/02/advanced-statistics-for-the-life-sciences-mooc-launches-today
<p>In <a href="https://www.edx.org/course/advanced-statistics-life-sciences-harvardx-ph525-3x#.VPRzYSnffwc">In</a> we will teach statistical techniques that are commonly used in the analysis of high-throughput data and their corresponding R implementations. In Week 1 we will explain inference in the context of high-throughput data and introduce the concept of error controlling procedures. We will describe the strengths and weakness of the Bonferroni correction, FDR and q-values. We will show how to implement these in cases in which thousands of tests are conducted, as is typically done with genomics data. In Week 2 we will introduce the concept of mathematical distance and how it is used in exploratory data analysis, clustering, and machine learning. We will describe how techniques such as principal component analysis (PCA) and the singular value decomposition (SVD) can be used for dimension reduction in high dimensional data. During week 3 we will describe confounding, latent variables and factor analysis in the context of high dimensional data and how this relates to batch effects. We will show how to implement methods such as SVA to perform inference on data affected by batch effects. Finally, during week 4 we will show how statistical modeling, and empirical Bayes modeling in particular, are powerful techniques that greatly improve precision in high-throughput data. We will be using R code to explain concepts throughout the course. We will also be using exploratory data analysis and data visualization to motivate the techniques we teach during each week.</p>
Navigating Big Data Careers with a Statistics PhD
2015-02-18T10:12:29+00:00
http://simplystats.github.io/2015/02/18/navigating-big-data-careers-with-a-statistics-phd
<div>
<em>Editor's note: This is a guest post by <a href="http://www.drsherrirose.com/" target="_blank">Sherri Rose</a>. She is an Assistant Professor of Biostatistics in the Department of Health Care Policy at Harvard Medical School. Her work focuses on nonparametric estimation, causal inference, and machine learning in health settings. Dr. Rose received her BS in statistics from The George Washington University and her PhD in biostatistics from the University of California, Berkeley, where she coauthored a book on <a href="http://drsherrirose.com/targeted-learning-book/" target="_blank">Targeted Learning</a>. She tweets <a href="https://twitter.com/sherrirose" target="_blank">@sherrirose</a>.</em>
</div>
<div>
</div>
<div>
A quick scan of the science and technology headlines often yields two words: big data. The amount of information we collect has continued to increase, and this data can be found in varied sectors, ranging from social media to genomics. Claims are made that big data will solve an array of problems, from understanding devastating diseases to predicting political outcomes. There is substantial “big data” hype in the press, as well as business and academic communities, but how do upcoming, current, and recent statistical science PhDs handle the array of training opportunities and career paths in this new era? <a href="http://www.amstat.org/newsroom/pressreleases/2015-StatsFastestGrowingSTEMDegree.pdf" target="_blank">Undergraduate interest in statistics degrees is exploding</a>, bringing new talent to graduate programs and the post-PhD job pipeline. Statistics training is diversifying, with students focusing on theory, methods, computation, and applications, or a blending of these areas. A few years ago, Rafa outlined the academic career options for statistics PhDs in <a href="http://simplystatistics.org/2011/09/12/advice-for-stats-students-on-the-academic-job-market/" target="_blank">two</a> <a href="http://simplystatistics.org/2011/09/15/another-academic-job-market-option-liberal-arts/" target="_blank">posts</a>, which cover great background material I do not repeat here. The landscape for statistics PhD careers is also changing quickly, with a variety of companies attracting top statistics students in new roles. As a <a href="http://www.drsherrirose.com/" target="_blank">new faculty member</a> at the intersection of machine learning, causal inference, and health care policy, I've already found myself frequently giving career advice to trainees. The choices have become much more nuanced than just academia vs. industry vs. government.
</div>
<div>
</div>
<div>
</div>
<div>
So, you find yourself inspired by big data problems and fascinated by statistics. While you are a student, figuring out what you enjoy working on is crucial. This exploration could involve engaging in internship opportunities or collaborating with multiple faculty on different types of projects. Both positive and negative experiences can help you identify your preferences.
</div>
<div>
</div>
<div>
</div>
<div>
Undergraduates may wish to spend a couple months at a <a href="http://www.nhlbi.nih.gov/research/training/summer-institute-biostatistics-t15" target="_blank">Summer Institute for Training in Biostatistics</a> or <a href="http://www.nsf.gov/crssprgm/reu/" target="_blank">National Science Foundation Research Experience for Undergraduates</a>. There are <a href="https://www.udacity.com/course/st101" target="_blank">also</a> <a href="https://www.coursera.org/course/casebasedbiostat" target="_blank">many</a> <a href="https://www.coursera.org/specialization/jhudatascience/1" target="_blank">MOOC</a> <a href="https://www.edx.org/course/statistics-r-life-sciences-harvardx-ph525-1x#.VJOhXsAAPe" target="_blank">options</a> <a href="https://www.coursera.org/course/maththink" target="_blank">to</a> <a href="https://www.udacity.com/course/ud120" target="_blank">get</a> <a href="https://www.udacity.com/course/ud359" target="_blank">a</a> <a href="https://www.udacity.com/course/ud651" target="_blank">taste</a> <a href="https://www.edx.org/course/foundations-data-analysis-utaustinx-ut-7-01x#.VNpQRd4bakA" target="_blank">of</a> <a href="https://www.edx.org/course/introduction-linear-models-matrix-harvardx-ph525-2x#.VNpQS94bakA" target="_blank">different</a> <a href="https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x#.VNpQU94bakA" target="_blank">areas</a> <a href="https://www.edx.org/course/introduction-computational-thinking-data-mitx-6-00-2x-0#.VNpQWd4bakA" target="_blank">of</a><a href="https://www.edx.org/course/fundamentals-clinical-trials-harvardx-hsph-hms214x#.VNpQt94bakA" target="_blank">statistics</a>. Selecting a graduate program for PhD study can be a difficult choice, especially when your interests within statistics have yet to be identified, as is often the case for undergraduates. However, if you know that you have interests in software and programming, it can be easy to sort which statistical science PhD programs have a curricular or research focus in this area by looking at department websites. Similarly, if you know you want to work in epidemiologic methods, genomics, or imaging, specific programs are going to jump right to the top as good fits. Getting advice from faculty in your department will be important. Competition for admissions into statistics and biostatistics PhD programs has continued to increase, and most faculty advise applying to as many relevant programs as is reasonable given the demands on your time and finances. If you end up sitting on multiple (funded) offers come April, talking to current students, student alums, and looking at alumni placement can be helpful. Don't hesitate to contact these people, selectively. Most PhD programs genuinely do want you to end up in the place that is best for you, even if it is not with them.
</div>
<div>
</div>
<div>
</div>
<div>
Once you're in a PhD program, internship opportunities for graduate students are listed each year by the <a href="http://www.amstat.org/education/internships.cfm" target="_blank">American Statistical Association</a>. Your home department may also have ties with local research organizations and companies with openings. Internships can help you identify future positions and the types of environments where you will flourish in your career. <a href="https://www.linkedin.com/pub/lauren-kunz/a/aab/293" target="_blank">Lauren Kunz</a>, a recent PhD graduate in biostatistics from Harvard University, is currently a Statistician at the National Heart, Lung, and Blood Institute (NHLBI) of the National Institutes of Health. Dr. Kunz said, "As a previous summer intern at the NHLBI, I was able to get a feel for the day to day life of a biostatistician at the NHLBI. I found the NHLBI Office of Biostatistical Research to be a collegial, welcoming environment, and I soon learned that NHLBI biostatisticians have the opportunity to work on a variety of projects, very often collaborating with scientists and clinicians. Due to the nature of these collaborations, the biostatisticians are frequently presented with scientifically interesting and important statistical problems. This work often motivates methodological research which in turn has immediate, practical applications. These factors matched well with my interest in collaborative research that is both methodological and applied."
</div>
<div>
</div>
<div>
</div>
<div>
<span style="font-family: Helvetica;">Industry is also enticing to statistics PhDs, particularly those with an applied or computational focus, like <a href="http://www.stephaniesapp.com/" target="_blank">Stephanie Sapp</a> and</span> <a href="http://alyssafrazee.com/" target="_blank">Alyssa Frazee</a><span style="font-family: Helvetica;">. Dr. Sapp has a PhD in statistics from the University of California, Berkeley, and is currently a Quantitative Analyst at <a href="http://www.google.com/" target="_blank">Google</a>. She also completed an internship there the summer before she graduated. In commenting about her choice to join Google, Dr. Sapp said, "</span>I really enjoy both academic research and seeing my work used in practice. Working at Google allows me to continue pursuing new and interesting research topics, as well as see my results drive more immediate impact." <span style="font-family: Helvetica;">Dr. Frazee just finished her PhD in biostatistics at Johns Hopkins University and previously spent a summer exploring her interests in <a href="https://www.hackerschool.com/" target="_blank">Hacker School</a>. While she applied to both academic and industry positions, receiving multiple offers, she ultimately chose to go into industry and work for <a href="https://stripe.com/" target="_blank">Stripe</a>: "</span>I accepted a tech company's offer for many reasons, one of them being that I really like programming and writing code. There are tons of opportunities to grow as a programmer/engineer at a tech company, but building an academic career on that foundation would be more of a challenge. I'm also excited about seeing my statistical work have more immediate impact. At smaller companies, much of the work done there has visible/tangible bearing on the product. Academic research in statistics is operating a lot closer to the boundaries of what we know and discovering a lot of cool stuff, which means researchers get to try out original ideas more often, but the impact is less immediately tangible. A new method or estimator has to go through a lengthy peer review/publication process and be integrated into the community's body of knowledge, which could take several years, before its impact can be fully observed." One of Dr. Frazee, Dr. Sapp, and Dr. Kunz's considerations in choosing a job reflects many of those in the early career statistics community: having an impact.
</div>
<div>
</div>
<div>
</div>
<div>
<span style="font-family: Helvetica;">Interest in both developing methods </span><i>and</i> <span style="font-family: Helvetica;">translating statistical advances into practice is a common theme in the big data statistics world, but not one that always leads to an industry or government career. There are also academic opportunities in statistics, biostatistics, and interdisciplinary departments like my own where your work can have an impact on current science. The <a href="http://www.hcp.med.harvard.edu/" target="_blank">Department of Health Care Policy</a> (HCP) at Harvard Medical School has 5 tenure-track/tenured statistics faculty members, including myself, among a total of about 20 core faculty members. The statistics faculty work on a range of theoretical and methodological problems while collaborating with HCP faculty (health economists, clinician <wbr />researchers, and sociologists) and leading our own substantive projects in health care policy (e.g., <a href="http://www.massdac.org/" target="_blank">Mass-DAC</a>). I find it to be a unique and exciting combination of roles, and love that the science truly informs my statistical research, giving it broader impact. Since joining the department a year and a half ago, I've worked in many new areas, such as plan payment risk adjustment methodology. I have also applied some of my previous work in machine learning to predicting adverse health outcomes in large datasets. Here, I immediately saw a need for new avenues of statistical research to make the optimal approach based on statistical theory align with an optimal approach in practice. My current research portfolio is diverse; example projects include the development of a double robust estimator for the study of chronic disease, leading an evaluation of a new state-wide health plan initiative, and collaborating with department colleagues on statistical issues in all-payer claims databases, physician prescribing intensification behavior, and predicting readmissions. The <a href="http://statistics.fas.harvard.edu/" target="_blank">larger</a> <a href="http://www.hsph.harvard.edu/biostatistics/" target="_blank">statistics</a> <a href="http://www.iq.harvard.edu/" target="_blank">community</a> <a href="http://bcb.dfci.harvard.edu/" target="_blank">at</a> Harvard also affords many opportunities to interact with statistics faculty across the campus, and <a href="http://www.faculty.harvard.edu/" target="_blank">university-wide junior faculty events</a> have connected me with professors in computer science and engineering. I feel an immense sense of research freedom to pursue my interests at HCP, which was a top priority when I was comparing job offers.</span>
</div>
<div>
</div>
<div>
</div>
<div>
<a href="http://had.co.nz/" target="_blank">Hadley Wickam</a>, of <a href="http://www.amazon.com/dp/0387981403/" target="_blank">ggplot2</a> and <a href="http://www.amazon.com/dp/1466586966/" target="_blank">Advanced R</a> fame, took on a new role as Chief Scientist at <a href="http://www.rstudio.com/" target="_blank">RStudio</a> in 2013. Freedom was also a key component in his choice to move sectors: "For me, the driving motivation is freedom: I know what I want to work on, I just need the freedom (and support) to work on it. It's pretty unusual to find an industry job that has more freedom than academia, but I've been noticeably more productive at RStudio because I don't have any meetings, and I can spend large chunks of time devoted to thinking about hard problems. It's not possible for everyone to get that sort of job, but everyone should be thinking about how they can negotiate the freedom to do what makes them happy. I really like the thesis of Cal Newport's book <a href="http://www.amazon.com/dp/1455509124/" target="_blank"><i>So </i></a><a href="http://www.amazon.com/dp/1455509124/" target="_blank"><i>Good They Can't Ignore You</i></a> - the better you are at your job, the greater your ability to negotiate for what you want."
</div>
<div>
</div>
<div>
</div>
<div>
There continues to be a strong emphasis in the work force on the vaguely defined field of “data science,” which incorporates the collection, storage, analysis, and interpretation of big data. Statisticians not only work in and lead teams with other scientists (e.g., clinicians, biologists, computer scientists) to attack big data challenges, but with each other. Your time as a statistics trainee is an amazing opportunity to explore your strengths and preferences, and which sectors and jobs appeal to you. Do your due diligence to figure out which employers are interested in and supportive of the type of career you want to create for yourself. Think about how you want to spend your time, and remember that you're the only person who has to live your life once you get that job. Other people's opinions are great, but your values and instincts matter too. Your definition of "best" doesn't have to match someone else's. Ask questions! Try new things! The potential for breakthroughs with novel flexible methods is strong. Statistical science training has progressed to the point where trainees are armed with thorough knowledge in design, methodology, theory, and, increasingly, data collection, applications, and computation. Statisticians working in data science are poised to continue making important contributions in all sectors for years to come. Now, you just need to decide where you fit.
</div>
Introduction to Linear Models and Matrix Algebra MOOC starts this Monday Feb 16
2015-02-13T09:00:11+00:00
http://simplystats.github.io/2015/02/13/introduction-to-linear-models-and-matrix-algebra-mooc-starts-this-monday-feb-16
<p>Matrix algebra is the language of modern data analysis. We use it to develop and describe statistical and machine learning methods, and to code efficiently in languages such as R, matlab and python. Concepts such as principal component analysis (PCA) are best described with matrix algebra. It is particularly useful to describe linear models.</p>
<p>Linear models are everywhere in data analysis. ANOVA, linear regression, limma, edgeR, DEseq, most smoothing techniques, and batch correction methods such as SVA and Combat are based on linear models. In this two week MOOC we well describe the basics of matrix algebra, demonstrate how linear models are used in the life sciences and show how to implement these efficiently in R.</p>
<p>Update: Here is <a href="https://www.edx.org/course/introduction-linear-models-matrix-harvardx-ph525-2x">the link</a> to the class</p>
Is Reproducibility as Effective as Disclosure? Let's Hope Not.
2015-02-12T10:21:35+00:00
http://simplystats.github.io/2015/02/12/is-reproducibility-as-effective-as-disclosure-lets-hope-not
<p>Jeff and I just this week published a <a href="http://www.pnas.org/content/112/6/1645.full">commentary</a> in the <em>Proceedings of the National Academy of Sciences</em> on our latest thinking on reproducible research and its ability to solve the reproducibility/replication “crisis” in science (there’s a version on <a href="http://arxiv.org/abs/1502.03169">arXiv</a> too). In a nutshell, we believe reproducibility (making data and code available so that others can recompute your results) is an essential part of science, but it is not going to end the crisis of confidence in science. In fact, I don’t think it’ll even make a dent. The problem is that reproducibility, as a tool for preventing poor research, comes in at the wrong stage of the research process (the end). While requiring reproducibility may deter people from committing outright fraud (a small group), it won’t stop people who just don’t know what they’re doing with respect to data analysis (a much larger group).</p>
<p>In an eerie coincidence, Jesse Eisinger of the investigative journalism non-profit ProPublica, has just published a piece on the New York Times Dealbook site discussing how <a href="http://dealbook.nytimes.com/2015/02/11/an-excess-of-sunlight-a-paucity-of-rules/">requiring disclosure rules in the financial industry has produced meager results</a>. He writes</p>
<blockquote>
<p class="story-body-text">
Over the last century, disclosure and transparency have become our regulatory crutch, the answer to every vexing problem. We require corporations and government to release reams of information on food, medicine, household products, consumer financial tools, campaign finance and crime statistics. We have a booming “report card” industry for a range of services, including hospitals, public schools and restaurants.
</p>
</blockquote>
<p class="story-body-text">
The rationale for all this disclosure is that
</p>
<blockquote>
<p class="story-body-text">
someone, somewhere reads the fine print in these contracts and keeps corporations honest. It turns out what we laymen intuit is true: <a href="http://www.law.nyu.edu/news/ideas/Marotta-Wurgler-standard-form-contracts-fine-print">No one reads them</a>, according to research by a New York University law professor, Florencia Marotta-Wurgler.
</p>
</blockquote>
<p class="story-body-text">
But disclosure is nevertheless popular because how could you be against it?
</p>
<blockquote>
<p class="story-body-text">
The disclosure bonanza is easy to explain. Nobody is against it. It’s politically expedient. Companies prefer such rules, especially in lieu of actual regulations that would curtail bad products or behavior. The opacity lobby — the <a href="http://en.wikipedia.org/wiki/Remora">remora fish</a> class of lawyers, lobbyists and consultants in New York and Washington — knows that disclosure requirements are no bar to dodgy practices. You just have to explain what you’re doing in sufficiently incomprehensible language, a task that earns those lawyers a hefty fee.
</p>
</blockquote>
<p class="story-body-text">
In the now infamous <a href="http://simplystatistics.org/2012/02/27/the-duke-saga-starter-set/">Duke Saga</a>, Keith Baggerly was able to reproduce the work of Potti et al. after roughly 2,000 hours of work because the data were publicly available (although the code was not). It's not clear how much time would have been saved if the code had been available, but it seems reasonable to assume that it would have taken some amount of time to <em>understand</em> the analysis, if not reproduce it. Once the errors in Potti's work were discovered, it took 5 years for the original Nature Medicine paper to be retracted.
</p>
<p class="story-body-text">
Although you could argue that the process worked in some sense, it came at tremendous cost of time and money. Wouldn't it have been better if the analysis had been done right in the first place?
</p>
The trouble with evaluating anything
2015-02-09T19:24:22+00:00
http://simplystats.github.io/2015/02/09/the-trouble-with-evaluating-anything
<p>It is very hard to evaluate people’s productivity or work in any meaningful way. This problem is the source of:</p>
<ol>
<li><a href="http://simplystatistics.org/2013/09/26/how-could-code-review-discourage-code-disclosure-reviewers-with-motivation/">Consternation about peer review</a></li>
<li><a href="http://simplystatistics.org/2014/02/21/heres-why-the-scientific-publishing-system-can-never-be-fixed/">The reason why post publication peer review doesn’t work</a></li>
<li><a href="http://simplystatistics.org/2012/05/24/how-do-we-evaluate-statisticians-working-in-genomics/">Consternation about faculty evaluation</a></li>
<li>Major problems at companies like <a href="http://www.bloomberg.com/bw/articles/2013-11-12/yahoos-latest-hr-disaster-ranking-workers-on-a-curve">Yahoo</a> and <a href="http://www.bloomberg.com/bw/articles/2013-11-13/microsoft-kills-its-hated-stack-rankings-dot-does-anyone-do-employee-reviews-right">Microsoft</a>.</li>
</ol>
<p>Roger and I were just talking about this problem in the context of evaluating the impact of software as a faculty member and Roger suggested the problem is that:</p>
<blockquote>
<p>Evaluating people requires real work and so people are always looking for shortcuts</p>
</blockquote>
<p>To evaluate a person’s work or their productivity requires three things:</p>
<ol>
<li>To be an expert in what they do</li>
<li>To have absolutely no reason to care whether they succeed or not</li>
<li>To have time available to evaluate them</li>
</ol>
<p>These three fundamental things are at the heart of why it is so hard to get good evaluations of people and why peer review and other systems are under such fire. The main source of the problem is the conflict between 1 and 2. The group of people in any organization or on any scale that is truly world class at any given topic from software engineering to history is small. It has to be by definition. This group of people inevitably has some reason to care about the success of the other people in that same group. Either they work with the other world class people and want them to succeed or they either intentionally or unintentionally are competing with them.</p>
<p>The conflict between being and expert and having no say wouldn’t be such a problem if it wasn’t for issue number 3: the time to evaluate people. To truly get good evaluations what you need is for someone who <em>isn’t an expert in a field and so has no stake</em> to take the time to become an expert and then evaluate the person/software. But this requires a huge amount of effort on the part of a reviewer who has to become expert in a new field. Given that reviewing is often considered the least important task in people’s workflow, evidenced by the value we put on people acting as peer reviewers for journals, or the value people get for doing a good job in people’s evaluation for promotion in companies, it is no wonder people don’t take the time to become experts.</p>
<p>I actually think that tenure review committees at forward thinking places may be the best at this (<a href="http://simplystatistics.org/2012/12/20/the-nih-peer-review-system-is-still-the-best-at-identifying-innovative-biomedical-investigators/">Rafa said the same thing about NIH study section</a>). They at least attempt to get outside reviews from people who are unbiased about the work that a faculty member is doing before they are promoted. This system, of course, has large and well-document problems, but I think it is better than having a person’s direct supervisor - who clearly has a stake - being the only person evaluating them.It is also better than only using the quantifiable metrics like number of papers and impact factor of the corresponding journals. I also think that most senior faculty who evaluate people take the job very seriously despite the only incentive being good citizenship.</p>
<p>Since real evaluation requires hard work and expertise, most of the time people are looking for a short cut. These short cuts typically take the form of quantifiable metrics. In the academic world these shortcuts are things like:</p>
<ol>
<li>Number of papers</li>
<li>Citations to academic papers</li>
<li>The impact factor of a journal</li>
<li>Downloads to a person’s software</li>
</ol>
<p>I think all of these things are associated with quality but none define quality. You could try to model the relationship, but it is very hard to come up with a universal definition for the outcome you are trying to model. In academics, some people have suggested that <a href="http://www.michaeleisen.org/blog/?p=694">open review or post-publication review</a> solves the problem. But this is only true for a very small subset of cases that violate rule number 2. The only papers that get serious post-publication review are where people have an incentive for the paper to go one way or the other. This means that papers in Science will be post-pub reviewed much much more often than equally important papers in discipline specific journals - just because people care more about Science. This will leave the vast majority of papers unreviewed - as evidenced by the relatively modest number of papers reviewed by <a href="https://pubpeer.com/">PubPeer</a> or <a href="http://www.ncbi.nlm.nih.gov/pubmedcommons/">Pubmed Commons.</a></p>
<p>I’m beginning to think that the only way to do evaluation well is to hire people whose <em>only job is to evaluate something well</em>. In other words, peer reviewers who are paid to review papers full time and are only measured by how often those papers are retracted or proved false. Or tenure reviewers who are paid exclusively to evaluate tenure cases and are measured by how well the post-tenure process goes for the people they evaluate and whether there is any measurable bias in their reviews.</p>
<p>The trouble with evaluating anything is that it is hard work and right now we aren’t paying anyone to do it.</p>
<p> </p>
Johns Hopkins Data Science Specialization Top Performers
2015-02-05T10:40:14+00:00
http://simplystats.github.io/2015/02/05/johns-hopkins-data-science-specialization-top-performers
<p><em>Editor’s note: The Johns Hopkins Data Science Specialization is the largest data science program in the world. <a href="http://www.bcaffo.com/">Brian</a>, <a href="http://www.biostat.jhsph.edu/~rpeng/">Roger</a>, and <a href="http://jtleek.com/">myself </a> conceived the program at the beginning of January 2014 , then built, recorded, and launched the classes starting in April 2014 with the help of <a href="https://twitter.com/iragooding">Ira</a>. Since April 2014 we have enrolled 1.76 million student and awarded 71,589 Signature Track verified certificates. The first capstone class ran in October - just 7 months after the first classes launched and 4 months after all classes were running. Despite this incredibly short time frame 917 students finished all 9 classes and enrolled in the Capstone Course. 478 successfully completed the course.</em></p>
<p>When we first announced the the Data Science Specialization, we said that the top performers would be profiled here on Simply Statistics. Well, that time has come, and we’ve got a very impressive group of participants that we want to highlight. These folks have successfully completed all nine MOOCs in the specialization and earned top marks in our first capstone session with <a href="http://swiftkey.com/en/">SwiftKey</a>. We had the pleasure of meeting some of them last week in a video conference, and we were struck by their insights and expertise. Check them out below.</p>
<h2 id="sasa-bogdanovic"><strong>Sasa Bogdanovic</strong></h2>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/02/Sasa-Bogdanovic.jpg"><img class="size-thumbnail wp-image-3874 alignleft" src="http://simplystatistics.org/wp-content/uploads/2015/02/Sasa-Bogdanovic-120x90.jpg" alt="Sasa-Bogdanovic" width="120" height="90" /></a></p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p>Sasa Bogdanovic is passionate about everything data. For the last 6 years, he’s been working in the iGaming industry, providing data products (integrations, data warehouse architectures and models, business intelligence tools, analyst reports and visualizations) for clients, helping them make better, data-driven, business decisions.</p>
<h3 id="why-did-you-take-the-jhu-data-science-specialization"><strong>Why did you take the JHU Data Science Specialization?</strong></h3>
<p>Although I’ve been working with data for many years, I wanted to take a different perspective and learn more about data science concepts and get insights into the whole pipeline from acquiring data to developing final data products. I also wanted to learn more about statistical models and machine learning.</p>
<h3 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization"><strong>What are you most proud of doing as part of the JHU Data Science Specialization?</strong></h3>
<p>I am very happy to have discovered the data science field. It is a whole new world that I find fascinating and inspiring to explore. I am looking forward to my new career in data science. This will allow me to combine all my previous knowledge and experience with my new insights and methods. I am very proud of every single quiz, assignment and project. For sure, the capstone project was a culmination, and I am very proud and happy to have succeeded to make a solid data product and to be a one of the top performers in the group. For this I am very grateful to the instructors, community TAs, all other peers for their contributions in the forums, and Coursera for putting it all together and making it possible.</p>
<h3 id="how-are-you-planning-on-using-your-data-science-specialization-certificate"><strong>How are you planning on using your Data Science Specialization Certificate?</strong></h3>
<p>I have already put the certificate in motion. My company is preparing new projects, and I expect the certificate to add weight to our proposals.</p>
<h2 id="alejandro-morales-gallardo">Alejandro Morales Gallardo</h2>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/02/Alejandro.png"><img class="size-thumbnail wp-image-3875 alignleft" src="http://simplystatistics.org/wp-content/uploads/2015/02/Alejandro-120x90.png" alt="Alejandro" width="120" height="90" /></a></p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p>I’m a trained physicist with strong coding skills. I have a passion for dissecting datasets to find the hidden stories in data and produce insights through creative visualizations. A hackathon and open-data aficionado, I have an interest in using data (and science) to improve our lives.</p>
<h3 id="why-did-you-take-the-jhu-data-science-specialization-1"><strong>Why did you take the JHU Data Science Specialization?</strong></h3>
<p>I wanted to close a gap in my skills and transition into to becoming a full blown Data Scientist by learning key concepts and practices in the field. Learning R, an industry relevant language, while creating a portfolio to showcase my abilities in the entire data science pipeline seemed very attractive.</p>
<h3 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-1"><strong>What are you most proud of doing as part of the JHU Data Science Specialization?</strong></h3>
<p>I’m most proud of the Predictive Text App I developed. With the Capstone Project, it was extremely rewarding to be able to tackle a brand new data type and learn about text mining and natural language processing while building a fun and attractive data product. I was particularly proud that the accuracy of my app was not that far off from SwiftKey smartphone app. I’m also proud of being a top performer!</p>
<h3 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-1"><strong>How are you planning on using your Data Science Specialization Certificate?</strong></h3>
<p>I want to apply my new set of skills to develop other products, analyze new datasets and keep growing my portfolio. It is also helpful to have Verified Certificates to show prospective employers.</p>
<h2 id="nitin-gupta">Nitin Gupta</h2>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/02/NitinGupta.jpg"><img class="size-thumbnail wp-image-3876 alignleft" src="http://simplystatistics.org/wp-content/uploads/2015/02/NitinGupta-120x90.jpg" alt="NitinGupta" width="120" height="90" /></a></p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p>Nitin is an independent trader and quant strategist with over 13 years of multi-faceted experience in the investment management industry. In the past he worked for a leading investment management firm where he built automated trading and risk management systems and gained complete life-cycle expertise in creating systematic investment products. He has a background in computer science with a strong interest in machine learning and its applications in quantitative modeling.</p>
<h3 id="why-did-you-take-the-jhu-data-science-specialization-2"><strong>Why did you take the JHU Data Science Specialization?</strong></h3>
<p>I was fortunate to have done the first Machine Learning course taught by Prof. Andrew Ng at the launch of Coursera in 2012, which really piqued my interest in the topic. The next course I did on Coursera was Prof. Roger Peng’s Computing For Data Analysis which introduced me to R. I realized that R was ideally suited for the quantitative modeling work I was doing. When I learned about the range of topics that the JHU DSS would cover - from the best practices in tidying and transforming data to modeling, analysis and visualization - I did not hesitate to sign up. Learning how to do all of this in an ecosystem built around R has been a huge plus.</p>
<h3 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-2"><strong>What are you most proud of doing as part of the JHU Data Science Specialization?</strong></h3>
<p>I am quite pleased with the web apps I built which utilize the concepts learned during the track. One of my apps visualizes and compares historical stock performance with other stocks and market benchmarks after querying the data directly from web resources. Another one showcases a predictive typing engine that dynamically predicts the next few words to use and append, as the user types a sentence. The process of building these apps provided a fantastic learning experience. Also, for the first time I built something that even my near and dear ones could use and appreciate, which is terrific.</p>
<h3 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-2"><strong>How are you planning on using your Data Science Specialization Certificate?</strong></h3>
<p>The broad skill set developed through this specialization could be applied across multiple domains. My current focus is on building robust quantitative models for systematic trading strategies that could learn and adapt to changing market environments. This would involve the application of machine learning techniques among other skills learned during the specialization. Using R and Shiny to interactively analyze the results would be tremendously useful.</p>
<h2 id="marc-kreyer">Marc Kreyer</h2>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/02/Marc-Kreyer.jpeg"><img class="size-thumbnail wp-image-3877 alignleft" src="http://simplystatistics.org/wp-content/uploads/2015/02/Marc-Kreyer-120x90.jpeg" alt="Marc Kreyer" width="120" height="90" /></a></p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p>Marc Kreyer is an expert business analyst and software engineer with extensive experience in financial services in Austria and Liechtenstein. He successfully finishes complex projects by not only using broad IT knowledge but also outstanding comprehension of business needs. Marc loves combining his programming and database skills with his affinity for mathematics to transform data into insight.</p>
<h3 id="why-did-you-take-the-jhu-data-science-specialization-3"><strong>Why did you take the JHU Data Science Specialization?</strong></h3>
<p>There are many data science MOOCs, but usually they are independent 4-6 week courses. The JHU Data Science Specialization was the first offering of a series of courses that build upon each other.</p>
<h3 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-3"><strong>What are you most proud of doing as part of the JHU Data Science Specialization?</strong></h3>
<p>Creating a working text prediction app without any prior NLP knowledge and only minimal assistance from instructors.</p>
<h3 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-3"><strong>How are you planning on using your Data Science Specialization Certificate?</strong></h3>
<p>Knowledge and experience are the most valuable things gained from the Data Science Specialization. As they can’t be easily shown to future employers, the certificate can be a good indicator for them. Unfortunately there is neither an issue data nor a verification link on the certificate, therefore it will be interesting to see how valuable it really will be.</p>
<h2 id="hsing-liu">Hsing Liu</h2>
<p> </p>
<p style="text-align: left;">
<a href="http://simplystatistics.org/wp-content/uploads/2015/02/Paul_HsingLiu.jpeg"><img class="size-thumbnail wp-image-3878" src="http://simplystatistics.org/wp-content/uploads/2015/02/Paul_HsingLiu-120x90.jpeg" alt="Paul_HsingLiu" width="120" height="90" /></a>
</p>
<p>I studied in the U.S. for a number of years, and received my M.S. in mathematics from NYU before returning to my home country, Taiwan. I’m most interested in how people think and learn, and education in general. This year I’m starting a new career as an iOS app engineer.</p>
<h3 id="why-did-you-take-the-jhu-data-science-specialization-4"><strong>Why did you take the JHU Data Science Specialization?</strong></h3>
<p>In my brief past job as an instructional designer, I read a lot about the new wave of online education, and was especially intrigued by how Khan Academy’s data science division is using data to help students learn. It occurred to me that to leverage my math background and make a bigger impact in education (or otherwise), data science could be an exciting direction to take.</p>
<h3 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-4"><strong>What are you most proud of doing as part of the JHU Data Science Specialization?</strong></h3>
<p>It may sound boring, but I’m proud of having done my best for each course in the track, going beyond the bare requirements when I’m able. The parts of the Specialization fit into a coherent picture of the discipline, and I’m glad to have put in the effort to connect the dots and gained a new perspective.</p>
<h3 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-4"><strong>How are you planning on using your Data Science Specialization Certificate?</strong></h3>
<p>I’m listing the certificate on my resume and LinkedIn, and I expect to be applying what I’ve learned once my company’s e-commence app launch.</p>
<h2 id="yichen-liu">Yichen Liu</h2>
<p> </p>
<p>Yichen Liu is a business analyst at Toyota Western Australia where he is responsible for business intelligence development, data analytics and business improvement. His prior experience includes working as a sessional lecturer and tutor at Curtin University in finance and econometrics units.</p>
<h3 id="why-did-you-take-the-jhu-data-science-specialization-5"><strong>Why did you take the JHU Data Science Specialization?</strong></h3>
<p>Recognising the trend that the world is more data-driven than before, I felt it was necessary to gain further understanding in data analysis to tackle both current and future challenges at work.</p>
<h3 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-5"><strong>What are you most proud of doing as part of the JHU Data Science Specialization?</strong></h3>
<p>The most proud thing as part of the program is that I have gained some basic knowledge in a totally new area, natural language processing. Though its connection with my current working area is limited, I see the future of data analysis to be more unstructured-data-drive and am willing to develop more knowledge in this area.</p>
<h3 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-5"><strong>How are you planning on using your Data Science Specialization Certificate?</strong></h3>
<p>I see the certificate as a stepping stone into the data science world, and would like to conduct more advanced studies in data science especially for unstructured data analysis.</p>
<h2 id="johann-posch">Johann Posch</h2>
<p style="text-align: left;">
<a href="http://simplystatistics.org/wp-content/uploads/2015/02/PictureJohannPosch.png"><img class="size-thumbnail wp-image-3879" src="http://simplystatistics.org/wp-content/uploads/2015/02/PictureJohannPosch-120x90.png" alt="PictureJohannPosch" width="120" height="90" /></a>
</p>
<p>After graduating form Vienna University of Technology with a specialization in Artificial Intelligence I joined Microsoft. There I worked as a developer on various products but the majority of the time as a Windows OS developer. After venturing into start-ups for a few years I joined GE Research to work on the Predix Big Data Platform and recently I joined on the Industrial Data Science team.</p>
<h3 id="why-did-you-take-the-jhu-data-science-specialization-6"><strong>Why did you take the JHU Data Science Specialization?</strong></h3>
<p>Ever since I wrote my masters thesis in Neural Networks I have been intrigued with machine learning. I see data science as a field where great advances will happen over the next decade and as an opportunity to positively impact millions of lives. I like how JHU structured the course series.</p>
<h3 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-6">What are you most proud of doing as part of the JHU Data Science Specialization?</h3>
<p>Being able to complete the JHU Data Science Specialization in 6 months and to get an distinction on every one of the courses was a great success. However, the best moment was probably the way my capstone project (next word prediction) turned out. The model could be trained in incremental steps and how it was able to provide meaningful options in real time.</p>
<h3 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-6"><strong>How are you planning on using your Data Science Specialization Certificate?</strong></h3>
<p>The course covered the concepts and tools needed to successfully address data science problems. It gave me the confidence and knowledge to apply for data science position. I am now working in the field at GE Research. I am grateful to all who made this Specialization happen!</p>
<h2 id="jason-wilkinson">Jason Wilkinson</h2>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/02/JasonWilkinson.jpg"><img class="size-thumbnail wp-image-3880 alignleft" src="http://simplystatistics.org/wp-content/uploads/2015/02/JasonWilkinson-120x90.jpg" alt="JasonWilkinson" width="120" height="90" /></a></p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p>Jason Wilkinson is a trader of commodity futures and other financial securities at a small proprietary trading firm in New York City. He and his wife, Katie, and dog, Charlie, can frequently be seen at the Jersey shore. And no, it’s nothing like the tv show, aside from the fist pumping.</p>
<h3 id="why-did-you-take-the-jhu-data-science-specialization-7"><strong>Why did you take the JHU Data Science Specialization?</strong></h3>
<p>The JHU Data Science Specialization helped me to prepare as I begin working on a Masters of Computer Science specializing in Machine Learning at Georgia Tech and also in researching algorithmic trading ideas. I also hope to find ways of using what I’ve learned in philanthropic endeavors, applying data science for social good.</p>
<h3 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-7"><strong>What are you most proud of doing as part of the JHU Data Science Specialization?</strong></h3>
<p>I’m most proud of going from knowing zero R code to being able to apply it in the capstone and other projects in such a short amount of time.</p>
<h3 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-7"><strong>How are you planning on using your Data Science Specialization Certificate?</strong></h3>
<p>The knowledge gained in pursuing the specialization certificate alone was worth the time put into it. A certificate is just a piece of paper. It’s what you can do with the knowledge gained that counts.</p>
<h2 id="uli-zellbeck">Uli Zellbeck</h2>
<p> </p>
<p style="text-align: left;">
<a href="http://simplystatistics.org/wp-content/uploads/2015/02/Uli.jpg"><img class="size-thumbnail wp-image-3881" src="http://simplystatistics.org/wp-content/uploads/2015/02/Uli-120x90.jpg" alt="Uli" width="120" height="90" /></a>
</p>
<p> </p>
<p>I studied economics in Berlin with focus on econometrics and business informatics. I am currently working as a Business Intelligence / Data Warehouse Developer in an e-commerce company. I am interested in recommender systems and machine learning.</p>
<h3 id="why-did-you-take-the-jhu-data-science-specialization-8"><strong>Why did you take the JHU Data Science Specialization?</strong></h3>
<p>I wanted to learn about Data Science because it provides a different approach on solving business problems with data. I chose the JHU Data Science Specialization on Coursera because it promised a wide range of topics and I like the idea of online courses. Also, I had experience with R and I wanted to deepen my knowledge with this tool.</p>
<h3 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-8">What are you most proud of doing as part of the JHU Data Science Specialization?</h3>
<p>There are two things. I successfully took all nine courses in 4 months and the capstone project was really hard work.</p>
<h3 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-8"><strong>How are you planning on using your Data Science Specialization Certificate?</strong></h3>
<p>I might get the chance to develop a Data Science department at my company. I like to use the certificate as basis to get a deeper knowledge in the many parts of Data Science.</p>
<h2 id="fred-zhengzhenhao">Fred Zheng Zhenhao</h2>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/02/ZHENG-Zhenhao.jpeg"><img class="size-thumbnail wp-image-3882 alignleft" src="http://simplystatistics.org/wp-content/uploads/2015/02/ZHENG-Zhenhao-120x90.jpeg" alt="ZHENG Zhenhao" width="120" height="90" /></a></p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p>By the time I enrolled in the JHU data science specialization, I was an undergraduate student in The Hong Kong Polytechnic university. Before that, I read some data mining books, feel excited about the content, but I never get to implement any of the algorithms because I barely have any programming skill. After taking this series of courses, now I am able to analyze the web content which is related to my research using R.</p>
<h3 id="why-did-you-take-the-jhu-data-science-specialization-9"><strong>Why did you take the JHU Data Science Specialization?</strong></h3>
<p>I took this series of courses as a challenge to me. I would like to see whether my interest can support me through 9 courses and 1 capstone project. And I do want to learn more in this field. This specialization is different from other data mining or machine learning class in that it covers the entire process including the Git, R, R-Markdown, shiny etc, and I think these are necessary skills too.</p>
<p><strong>What are you most proud of doing as part of the JHU Data Science Specialization?</strong></p>
<p>Getting my word prediction app to respond in 0.05 seconds is already exiting, and one of the reviewer says “congratulations your engine came up with the most correct prediction among those I reviewed: 3 out of 5, including one that stumped every one else : “child might stick her finger or a foreign object into an electrical (outlet)”. I guess that’s the part I am most proud of.</p>
<p><strong>How are you planning on using your Data Science Specialization Certificate?</strong></p>
<p>It definitely goes in my CV for future job hunting.</p>
<p> </p>
<p> </p>
Early data on knowledge units - atoms of statistical education
2015-02-05T09:44:49+00:00
http://simplystats.github.io/2015/02/05/early-data-on-knowledge-units-atoms-of-statistical-education
<p>Yesterday I posted <a href="http://simplystatistics.org/2015/02/04/knowledge-units-the-atoms-of-statistical-education/">about atomizing statistical education into knowledge units</a>. You can try out the first knowledge unit here: <a href="https://jtleek.typeform.com/to/jMPZQe">https://jtleek.typeform.com/to/jMPZQe</a>. The early data is in and it is consistent with many of our hypotheses about the future of online education.</p>
<p>Namely:</p>
<ol>
<li>Completion rates are high when segments are shorter</li>
<li>You can learn something about statistics in a short amount of time (2 minutes to complete, many people got all questions right)</li>
<li>People will consume educational material on tablets/smartphones more and more.</li>
</ol>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/02/Screen-Shot-2015-02-05-at-9.34.51-AM.png"><img class="aligncenter wp-image-3863" src="http://simplystatistics.org/wp-content/uploads/2015/02/Screen-Shot-2015-02-05-at-9.34.51-AM.png" alt="Screen Shot 2015-02-05 at 9.34.51 AM" width="500" height="402" srcset="http://simplystatistics.org/wp-content/uploads/2015/02/Screen-Shot-2015-02-05-at-9.34.51-AM-300x241.png 300w, http://simplystatistics.org/wp-content/uploads/2015/02/Screen-Shot-2015-02-05-at-9.34.51-AM.png 1004w" sizes="(max-width: 500px) 100vw, 500px" /></a></p>
<p> </p>
Knowledge units - the atoms of statistical education
2015-02-04T16:45:21+00:00
http://simplystats.github.io/2015/02/04/knowledge-units-the-atoms-of-statistical-education
<p><em>Editor’s note: This idea is <a href="http://www.bcaffo.com/">Brian’s idea</a> and based on conversations with him and Roger, but I just executed it.</em></p>
<p>The length of academic courses has traditionally ranged between a few days for a short course to a few months for a semester-long course. Lectures are typically either 30 minutes or one hour. Term and lecture lengths have been dictated by tradition and the relative inconvenience of coordinating schedules of the instructors and students for shorter periods of time. As classes have moved online the barrier of inconvenience to varying the length of an academic course has been removed. Despite this flexibilty, most academic online courses adhere to the traditional semester-long format. For example, the first massive online open courses were simply semester-long courses directly recorded and offered online.</p>
<p>Data collected from massive online open courses suggest that [<em>Editor’s note: This idea is <a href="http://www.bcaffo.com/">Brian’s idea</a> and based on conversations with him and Roger, but I just executed it.</em></p>
<p>The length of academic courses has traditionally ranged between a few days for a short course to a few months for a semester-long course. Lectures are typically either 30 minutes or one hour. Term and lecture lengths have been dictated by tradition and the relative inconvenience of coordinating schedules of the instructors and students for shorter periods of time. As classes have moved online the barrier of inconvenience to varying the length of an academic course has been removed. Despite this flexibilty, most academic online courses adhere to the traditional semester-long format. For example, the first massive online open courses were simply semester-long courses directly recorded and offered online.</p>
<p>Data collected from massive online open courses suggest that](https://onlinelearninginsights.wordpress.com/2014/04/28/mooc-design-tips-maximizing-the-value-of-video-lectures/) and the [<em>Editor’s note: This idea is <a href="http://www.bcaffo.com/">Brian’s idea</a> and based on conversations with him and Roger, but I just executed it.</em></p>
<p>The length of academic courses has traditionally ranged between a few days for a short course to a few months for a semester-long course. Lectures are typically either 30 minutes or one hour. Term and lecture lengths have been dictated by tradition and the relative inconvenience of coordinating schedules of the instructors and students for shorter periods of time. As classes have moved online the barrier of inconvenience to varying the length of an academic course has been removed. Despite this flexibilty, most academic online courses adhere to the traditional semester-long format. For example, the first massive online open courses were simply semester-long courses directly recorded and offered online.</p>
<p>Data collected from massive online open courses suggest that [<em>Editor’s note: This idea is <a href="http://www.bcaffo.com/">Brian’s idea</a> and based on conversations with him and Roger, but I just executed it.</em></p>
<p>The length of academic courses has traditionally ranged between a few days for a short course to a few months for a semester-long course. Lectures are typically either 30 minutes or one hour. Term and lecture lengths have been dictated by tradition and the relative inconvenience of coordinating schedules of the instructors and students for shorter periods of time. As classes have moved online the barrier of inconvenience to varying the length of an academic course has been removed. Despite this flexibilty, most academic online courses adhere to the traditional semester-long format. For example, the first massive online open courses were simply semester-long courses directly recorded and offered online.</p>
<p>Data collected from massive online open courses suggest that](https://onlinelearninginsights.wordpress.com/2014/04/28/mooc-design-tips-maximizing-the-value-of-video-lectures/) and the](https://www.coursera.org/specialization/jhudatascience/1?utm_medium=courseDescripTop) leads to higher student retention. These results line up with data on other online activities such as Youtube video watching or form completion, which also show that shorter activities lead to higher completion rates.</p>
<p>We have some of the earliest and most highly subscribed massive online open courses through the Coursera platform: Data Analysis, Computing for Data Analysis, and Mathematical Biostatistics Bootcamp. Our original courses were translated from courses we offered locally and were therefore closer to semester long with longer lectures ranging from 15-30 minutes. Based on feedback from our students and the data we observed about completion rates, we made the decision to break our courses down into smaller, one-month courses with no more than two hours of lecture material per week. Since then, we have enrolled more than a million students in our MOOCs.</p>
<p>The data suggest that the shorter you can make an academic unit online, the higher the completion percentage. The question then becomes “How short can you make an online course?” To answer this question requires a definition of a course. For our purposes we will define a course as an educational unit consisting of the following three components:</p>
<p><strong>** </strong>**</p>
<ul>
<li>
<p><strong>**Knowledge delivery</strong> -** the distribution of educational material through lectures, audiovisual materials, and course notes<strong>.</strong></p>
</li>
<li>
<p><strong>Knowledge evaluation</strong> - the evaluation of how much of the knowledge delivered to a student is retained.</p>
</li>
<li>
<p><strong>Knowledge certification</strong> - an independent claim or representation that a student has learned some set of knowledge.</p>
</li>
</ul>
<p> </p>
<p>A typical university class delivers 36 hours = 12 weeks x 3 hours/week of content knowledge, evaluates that knowledge based on the order of 10 homework assignments and 2 tests, and results in a certification equivalent to 3 university credits.With this definition, what is the smallest possible unit that satisfies all three definitions of a course? We will call this smallest possible unit one knowledge unit. The smallest knowledge unit that satisfies all three definitions is a course that:</p>
<ul>
<li>
<p><strong>**Delivers a single unit of content</strong> -** We will define a single unit of content as a text, image, or video describing a single concept.</p>
</li>
<li>
<p><strong>Evaluates that single unit of content</strong> - The smallest unit of evaluation possible is a single question to evaluate a student’s knowledge.</p>
</li>
<li>
<p><strong>Certifies knowlege</strong> - Provides the student with a statement of successful evaluation of the knowledge in the knowledge unit.</p>
</li>
</ul>
<p>An example of a knowledge unit appears here: <a href="https://jtleek.typeform.com/to/jMPZQe">https://jtleek.typeform.com/to/jMPZQe</a>. The knowledge unit consists of a short (less than 2 minute) video and 3 quiz questions. When completed, the unit sends the completer an email verifying that the quiz has been completed. Just as an atom is the smallest unit of mass that defines a chemical element, the knowledge unit is the smallest unit of education that defines a course.</p>
<p>Shrinking the units down to this scale opens up some ideas about how you can connect them together into courses and credentials. I’ll leave that for a future post.</p>
Precision medicine may never be very precise - but it may be good for public health
2015-01-30T14:24:17+00:00
http://simplystats.github.io/2015/01/30/precision-medicine-will-never-be-very-precise-but-it-may-be-good-for-public-health
<p><em>Editor’s note: This post was originally titled: <a href="http://simplystatistics.org/2013/06/12/personalized-medicine-is-primarily-a-population-health-intervention/">Personalized medicine is primarily a population health intervention</a>. It has been updated with the graph of odds ratios/betas from GWAS studies.</em></p>
<p>There has been a lot of discussion of <a href="http://en.wikipedia.org/wiki/Personalized_medicine">personalized medicine</a>, <a href="http://web.jhu.edu/administration/provost/initiatives/ihi/">individualized health</a>, and <a href="http://www.ucsf.edu/welcome-to-ome">precision medicine</a> in the news and in the medical research community and President Obama just <a href="http://www.whitehouse.gov/the-press-office/2015/01/30/fact-sheet-president-obama-s-precision-medicine-initiative">announced a brand new initiative in precision medicine</a> . Despite this recent attention, it is clear that healthcare has always been personalized to some extent. For example, men are rarely pregnant and heart attacks occur more often among older patients. In these cases, easily collected variables such as sex and age, can be used to predict health outcomes and therefore used to “personalize” healthcare for those individuals.</p>
<p>So why the recent excitement around personalized medicine? The reason is that it is increasingly cheap and easy to collect more precise measurements about patients that might be able to predict their health outcomes. An example that <a href="http://www.nytimes.com/2013/05/14/opinion/my-medical-choice.html?_r=0">has recently been in the news</a> is the measurement of mutations in the BRCA genes. Angelina Jolie made the decision to undergo a prophylactic double mastectomy based on her family history of breast cancer and measurements of mutations in her BRCA genes. Based on these measurements, previous studies had suggested she might have a lifetime risk as high as 80% of developing breast cancer.</p>
<p>This kind of scenario will become increasingly common as newer and more accurate genomic screening and predictive tests are used in medical practice. When I read these stories there are two points I think of that sometimes get obscured by the obviously fraught emotional, physical, and economic considerations involved with making decisions on the basis of new measurement technologies:</p>
<ol>
<li><strong>In individualized health/personalized medicine the “treatment” is information about risk</strong>. In <a href="http://en.wikipedia.org/wiki/Gleevec">some cases</a> treatment will be personalized based on assays. But in many other cases, we still do not (and likely will not) have perfect predictors of therapeutic response. In those cases, the healthcare will be “personalized” in the sense that the patient will get more precise estimates of their likelihood of survival, recurrence etc. This means that patients and physicians will increasingly need to think about/make decisions with/act on information about risks. But communicating and acting on risk is a notoriously challenging problem; personalized medicine will dramatically raise the importance of <a href="http://understandinguncertainty.org/">understanding uncertainty</a>.</li>
<li><strong>Individualized health/personalized medicine is a population-level treatment.</strong> Assuming that the 80% lifetime risk estimate was correct for Angelina Jolie, it still means there is a 1 in 5 chance she was never going to develop breast cancer. If that had been her case, then the surgery was unnecessary. So while her decision was based on personal information, there is still uncertainty in that decision for her. So the “personal” decision may not always be the “best” decision for any specific individual. It may however, be the best thing to do for everyone in a population with the same characteristics.</li>
</ol>
<p>The first point bears serious consideration in light of President Obama’s new proposal. We have already collected a massive amount of genetic data about a large number of common diseases. In almost all cases, the amount of predictive information that we can glean from genetic studies is modest. One paper pointed this issue out in a rather snarky way by comparing two approaches to predicting people’s heights: (1) averaging their parents heights - an approach from the Victorian era and (2) combing the latest information on the best genetic markers at the time. It turns out, all the genetic information we gathered isn’t as good as <a href="http://www.nature.com/ejhg/journal/v17/n8/full/ejhg20095a.html">averaging parents heights</a>. Another way to see this is to download data on all genetic variants associated with disease from the <a href="http://www.genome.gov/gwastudies/">GWAS catalog</a> that have a P-value less than 1 x 10e-8. If you do that and look at the distribution of effect sizes, you see that 95% have an odds ratio or beta coefficient less than about 4. Here is a histogram of the effect sizes:</p>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/gwas-overall.png"><img class="aligncenter size-full wp-image-3852" src="http://simplystatistics.org/wp-content/uploads/2015/01/gwas-overall.png" alt="gwas-overall" width="480" height="480" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/gwas-overall-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2015/01/gwas-overall-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2015/01/gwas-overall.png 480w" sizes="(max-width: 480px) 100vw, 480px" /></a></p>
<p> </p>
<p> </p>
<p>This means that nearly all identified genetic effects are small. The ones that are really large (effect size greater than 100) are not for common disease outcomes, they are for <a href="http://en.wikipedia.org/wiki/Birdshot_chorioretinopathy">Birdshot chorioretinopathy</a> and hippocampal volume. You can really see this if you look at the bulk of the distribution of effect sizes, which are mostly less than 2 by zooming the plot on the x-axis:</p>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/gwas-zoomed.png"><img class="aligncenter size-full wp-image-3853" src="http://simplystatistics.org/wp-content/uploads/2015/01/gwas-zoomed.png" alt="gwas-zoomed" width="480" height="480" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/gwas-zoomed-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2015/01/gwas-zoomed-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2015/01/gwas-zoomed.png 480w" sizes="(max-width: 480px) 100vw, 480px" /></a></p>
<p> </p>
<p> </p>
<p>These effect sizes translate into very limited predictive capacity for most identified genetic biomarkers. The implication is that personalized medicine, at least for common diseases, is highly likely to be inaccurate for any individual person. But if we can take advantage of the population-level improvements in health from precision medicine by increasing risk literacy, improving our use of uncertain markers, and understanding that precision medicine isn’t precise for any one person, it could be a really big deal.</p>
Reproducible Research Course Companion
2015-01-26T16:22:36+00:00
http://simplystats.github.io/2015/01/26/reproducible-research-course-companion
<p><a href="https://itunes.apple.com/us/book/reproducible-research/id961495566?ls=1&mt=13" rel="https://itunes.apple.com/us/book/reproducible-research/id961495566?ls=1&mt=13"><img class="alignright wp-image-3838" src="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-26-at-4.14.26-PM-779x1024.png" alt="Screen Shot 2015-01-26 at 4.14.26 PM" width="331" height="435" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-26-at-4.14.26-PM-228x300.png 228w, http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-26-at-4.14.26-PM-779x1024.png 779w, http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-26-at-4.14.26-PM-152x200.png 152w, http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-26-at-4.14.26-PM.png 783w" sizes="(max-width: 331px) 100vw, 331px" /></a>I’m happy to announce that you can now get a copy of the <a title="Reproducible Research Course Companion" href="https://itunes.apple.com/us/book/reproducible/id961495566?ls=1&mt=13" target="_blank">Reproducible Research Course Companion</a> from the Apple iBookstore. The purpose of this e-book is pretty simple. The book provides all of the key video lectures from my <a title="JHU/Coursera Reproducible Research Course " href="https://www.coursera.org/course/repdata" target="_blank">Reproducible Research course</a> offered on Coursera, in a simple offline e-book format. The book can be viewed on a Mac, iPad, or iPad mini.</p>
<p>If you’re interested in taking my Reproducible Research course on Coursera and would like a flavor of what the course will be like, then you can view the lectures through the book (the free sample contains three lectures). On the other hand, if you already took the course and would like access to the lecture material afterwards, then this might be a useful add-on. If you care currently enrolled in the course, then this could be a handy way for you to take the lectures on the road with you.</p>
<p>Please note that all of the lectures are still available for free on YouTube via my <a href="https://www.youtube.com/channel/UCZA0RbbSK1IXeeJysKYRWuQ" target="_blank">YouTube channel</a>. Also, the book provides content only. If you wish to actually complete the course, you must take it through the Coursera web site.</p>
Data as an antidote to aggressive overconfidence
2015-01-21T11:58:07+00:00
http://simplystats.github.io/2015/01/21/data-as-an-antidote
<p>A recent <a href="http://www.nytimes.com/2014/12/07/opinion/sunday/adam-grant-and-sheryl-sandberg-on-discrimination-at-work.html?_r=0">NY Times op-ed</a> reminded us of the many biases faced by women at work. A [A recent <a href="http://www.nytimes.com/2014/12/07/opinion/sunday/adam-grant-and-sheryl-sandberg-on-discrimination-at-work.html?_r=0">NY Times op-ed</a> reminded us of the many biases faced by women at work. A ](http://time.com/3666135/sheryl-sandberg-talking-while-female-manterruptions/) gave specific recommendations for how to conduct ourselves in meetings_. <em>In general, I found these very insightful, but don’t necessarily agree with the recommendations that women should “Practice Assertive Body Language”. Instead, we should make an effort to judge ideas by their content and not be impressed by body language. More generally, it is a problem that many of the characteristics that help advance careers contribute nothing to intellectual output. One of these is what I call _aggressive overconfidence</em>.</p>
<p>Here is an example (based on a true story). A data scientist finds a major flaw with the data analysis performed by a prominent data-producing scientist’s lab. Both are part of a large collaborative project. A meeting is held among the project leaders to discuss the disagreement. The data producer is very self-confident in defending his approach. The data scientist, who in not nearly as aggressive, is <a href="http://time.com/3666135/sheryl-sandberg-talking-while-female-manterruptions/">interrupted</a> so much that she barely gets her point across. The project leaders decide that this seems to be simply a difference of opinion and, for all practical purposes, ignore the data scientist. I imagine this story sounds familiar to many. While in many situations this story ends here, when the results are data driven we can actually fact check opinions that are pronounced as fact. In this example, the data is public and anybody with the right expertise can download the data and corroborate the flaw in the analysis. This is typically quite tedious, but it can be done. Because the key flaws are rather complex, the project leaders, lacking expertise in data analysis, can’t make this determination. But eventually, a chorus of fellow data analysts will be too loud to ignore.</p>
<p>That aggressive overconfidence is generally rewarded in academia is a problem. And if this trait is <a href="http://scholar.google.com/scholar?hl=en&as_sdt=0,22&q=overconfidence+gender">highly correlated with being male</a>, then a manifestation of this is a worsened gender gap. My experience (including reading internet discussions among scientists on controversial topics) has convinced me that this trait is in fact correlated with gender. But the solution is not to help women become more aggressively overconfident. Instead we should continue to strive to judge work based on content rather than style. I am optimistic that more and more, data, rather than who sounds more sure of themselves, will help us decide who wins a debate.</p>
<p> </p>
Gorging ourselves on "free" health care: Harvard's dilemma
2015-01-20T09:00:56+00:00
http://simplystats.github.io/2015/01/20/gorging-ourselves-on-free-health-care-harvards-dilemma
<p><em>Editor’s note: This is a guest post by <a href="http://www.hcp.med.harvard.edu/faculty/core/laura-hatfield-phd">Laura Hatfield</a>. Laura is an Assistant Professor of Health Care Policy at Harvard Medical School, with a specialty in Biostatistics. Her work focuses on understanding trade-offs and relationships among health outcomes. Dr. Hatfield received her BS in genetics from Iowa State University and her PhD in biostatistics from the University of Minnesota. She tweets <a href="https://twitter.com/bioannie">@bioannie</a></em></p>
<p>I didn’t imagine when I joined Harvard’s Department of Health Care Policy that the New York Times would be <a href="http://www.nytimes.com/2015/01/06/us/health-care-fixes-backed-by-harvards-experts-now-roil-its-faculty.html">writing about my benefits package</a>. Then a vocal and aggrieved group of faculty <a href="http://www.thecrimson.com/article/2014/11/12/harvards-health-benefits-unfairness/">rebelled against health benefits changes</a> for 2015, and commentators responded by gleefully <a href="http://www.thefiscaltimes.com/2015/01/07/Harvards-Whiny-Profs-Could-Get-Obamacare-Bonus">skewering</a> entitled-sounding Harvard professors. But I’m a statistician, so I want to talk data.</p>
<p>Health care spending is tremendously right-skewed. The figure below shows the annual spending distribution among people with any spending (~80% of the total population) in two data sources on people covered by employer-sponsored insurance, such as the Harvard faculty. Notice that the y axis is on the log scale. More than half of people spend $1000 or less, but a few very unfortunate folks top out near half a million.</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/spending_distribution.jpg"><img class="alignnone size-full wp-image-3814" src="http://simplystatistics.org/wp-content/uploads/2015/01/spending_distribution.jpg" alt="spending_distribution" width="600" height="400" /></a></p>
<p>Source: <a href="https://www.bea.gov/papers/working_papers.htm">Measuring health care costs of individuals with employer-sponsored health insurance in the US: A comparison of survey and claims data</a>. A. Aizcorbe, E. Liebman, S. Pack, D.M. Cutler, M.E. Chernew, A.B. Rosen. BEA working paper. WP2010-06. June 2010.</p>
<p>If instead of contributing to my premiums, Harvard instead gave me the $1000/month premium contribution in the form of wages, I would be on the hook for my own health care expenses. If I stay healthy, I pocket the money, minus income taxes. If I get sick, I have the extra money available to cover the expenses…provided I’m not one of the unlucky 10% of people spending more than $12,000/year. In that case, the additional wages would be insufficient to cover my health care expenses. This “every woman for herself” system lacks the key benefit of insurance: risk pooling. The sickest among us would be bankrupted by health costs. Another good reason for an employer to give me benefits is that I do not pay taxes on this part of my compensation (more on that later).</p>
<p>At the opposite end of the spectrum is the Harvard faculty health insurance plan. Last year, the university paid ~$1030/month toward my premium and I put in ~$425 (tax-free). In exchange for this ~$17,000 of premiums, my family got first-dollar insurance coverage with very low co-pays. Faculty contributions to our collective expenses health care were distributed fairly evenly among all of us, with only minimal cost sharing to reflect how much care each person consumed. The sickest among us were in no financial peril. My family didn’t use much care and thus didn’t get our (or Harvard’s) money’s worth for all that coverage, but I’m ok with it. I still prefer risk pooling.</p>
<p>Here’s the problem: moral hazard. It’s a word I learned when I started hanging out with health economists. It describes the tendency of people to over-consume goods that feel free, such as health care paid through premiums or desserts at an all-you-can-eat buffet. Just look at this array—how much cake do *you* want to eat for $9.99?</p>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/buffet.jpg"><img class="alignnone size-large wp-image-3815" src="http://simplystatistics.org/wp-content/uploads/2015/01/buffet-1024x768.jpg" alt="buffet" width="500" height="380" /></a></p>
<p>Source: https://www.flickr.com/photos/jimmonk/5687939526/in/photostream/</p>
<p>One way to mitigate moral hazard is to expose people to more of their cost of care at the point of service instead of through premiums. You might think twice about that fifth tiny cake if you were paying per morsel. This is what the new Harvard faculty plans do: our premiums actually go down, but now we face a modest deductible, $250 per person or $750 max for a family. This is meant to encourage faculty to use their health care more efficiently, but it still affords good protection against catastrophic costs. The out-of-pocket max remains low at $1500 per individual or $4500 per family, with recent announcements to further protect individuals who pay more than 3% of salary in out-of-pocket health costs through a reimbursement program.</p>
<p>The allocation of individuals’ contributions between premiums and point-of-service costs is partly a question of how we cross-subsidize each other. If Harvard’s total contribution remains the same and health care costs do not grow faster than wages (ha!), then increased cost sharing decreases the amount by which people who use less care subsidize those who use more. How you feel about the “right” level of cost sharing may depend on whether you’re paying or receiving a subsidy from your fellow employees. And maybe your political leanings.</p>
<p>What about the argument that it is better for an employer to “pay” workers by health insurance premium contributions rather than wages because of the tax benefits? While we might prefer to get our compensation in the form of tax-free health benefits vs taxed wages, the university, like all employers, is looking ahead to the <a href="http://www.forbes.com/sites/sallypipes/2014/12/01/a-cadillac-tax-for-everyone/">Cadillac tax provision of the ACA</a>. So they have to do some re-balancing of our overall compensation. If Harvard reduces its health insurance contributions to avoid the tax, we might reasonably <a href="http://www.washingtonpost.com/blogs/wonkblog/wp/2013/08/30/youre-spending-way-more-on-your-health-benefits-than-you-think/">expect to make up that difference</a> in higher wages. The empirical evidence is <a href="http://www.hks.harvard.edu/fs/achandr/JLE_LaborMktEffectsRisingHealthInsurancePremiums_2006.pdf">complicated</a> and suggests that employers may not immediately return savings on health benefits dollar-for-dollar in the form of wages.</p>
<p>As far as I can tell, Harvard is contributing roughly the same amount as last year toward my health benefits, but exact numbers are difficult to find. I switched plan types\footnote{into a high-deductible plan, but that’s a topic for another post!}, so I can’t find and directly compare Harvard’s contributions in the same plan type this year and last. Peter Ubel <a href="http://www.peterubel.com/health_policy/how-behavioral-economics-could-have-prevented-the-harvard-meltdown-over-healthcare-costs/">argues</a> that if the faculty *had* seen these figures, we might not have revolted. The actuarial value of our plans remains very high (91%, just a bit better than the expensive Platinum plans on the exchanges) and Harvard’s spending on health care has grown from 8% to 12% of the university’s budget over the past few years. Would these data have been sufficient to quell the insurrection? Good question.</p>
If you were going to write a paper about the false discovery rate you should have done it in 2002
2015-01-16T10:58:04+00:00
http://simplystats.github.io/2015/01/16/if-you-were-going-to-write-a-paper-about-the-false-discovery-rate-you-should-have-done-it-in-2002
<p>People often talk about academic superstars as people who have written highly cited papers. Some of that has to do with people’s genius, or ability, or whatever. But one factor that I think sometimes gets lost is luck and timing. So I wrote a little script to get the first 30 papers that appear when you search Google Scholar for the terms:</p>
<ul>
<li>empirical processes</li>
<li>proportional hazards model</li>
<li>generalized linear model</li>
<li>semiparametric</li>
<li>generalized estimating equation</li>
<li>false discovery rate</li>
<li>microarray statistics</li>
<li>lasso shrinkage</li>
<li>rna-seq statistics</li>
</ul>
<p>Google Scholar sorts by relevance, but that relevance is driven to a large degree by citations. For example, if you look at the first 10 papers you get for searching for false discovery rate you get.</p>
<ul>
<li>Controlling the false discovery rate: a practical and powerful approach to multiple testing</li>
<li>Thresholding of statistical maps in functional neuroimaging using the false discovery rate</li>
<li>The control of the false discovery rate in multiple testing under dependency</li>
<li>Controlling the false discovery rate in behavior genetics research</li>
<li>Identifying differentially expressed genes using false discovery rate controlling procedures</li>
<li>The positive false discovery rate: A Bayesian interpretation and the q-value</li>
<li>On the adaptive control of the false discovery rate in multiple testing with independent statistics</li>
<li>Implementing false discovery rate control: increasing your power</li>
<li>Operating characteristics and extensions of the false discovery rate procedure</li>
<li>Adaptive linear step-up procedures that control the false discovery rate</li>
</ul>
<p>People who work in this area will recognize that many of these papers are the most important/most cited in the field.</p>
<p>Now we can make a plot that shows for each term when these 30 highest ranked papers appear. There are some missing values, because of the way the data are scraped, but this plot gives you some idea of when the most cited papers on these topics were published:</p>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/citations-boxplot.png"><img class="aligncenter size-full wp-image-3798" src="http://simplystatistics.org/wp-content/uploads/2015/01/citations-boxplot.png" alt="citations-boxplot" width="600" height="400" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/citations-boxplot-300x200.png 300w, http://simplystatistics.org/wp-content/uploads/2015/01/citations-boxplot-260x173.png 260w, http://simplystatistics.org/wp-content/uploads/2015/01/citations-boxplot.png 600w" sizes="(max-width: 600px) 100vw, 600px" /></a></p>
<p>You can see from the plot that the median publication year of the top 30 hits for “empirical processes” was 1990 and for “RNA-seq statistics” was 2010. The medians for the other topics were:</p>
<ul>
<li>Emp. Proc. 1990.241</li>
<li>Prop. Haz. 1990.929</li>
<li>GLM 1994.433</li>
<li>Semi-param. 1994.433</li>
<li>GEE 2000.379</li>
<li>FDR 2002.760</li>
<li>microarray 2003.600</li>
<li>lasso 2004.900</li>
<li>rna-seq 2010.765</li>
</ul>
<p>I think this pretty much matches up with the intuition most people have about the relative timing of fields, with a few exceptions (GEE in particular seems a bit late). There are a bunch of reasons this analysis isn’t perfect, but it does suggest that luck and timing in choosing a problem can play a major role in the “success” of academic work as measured by citations. It also suggests another reason for success in science than individual brilliance. Given the potentially negative consequences the <a href="http://www.sciencemag.org/content/347/6219/262.abstract">expectation of brilliance has on certain subgroups</a>, it is important to recognize the importance of timing and luck. The median most cited “false discovery rate” paper was 2002, but almost none of the 30 top hits were published after about 2008.</p>
<p><a href="https://gist.github.com/jtleek/c5158965d77c21ade424">The code for my analysis is here</a>. It is super hacky so have mercy.</p>
How to find the science paper behind a headline when the link is missing
2015-01-15T13:35:42+00:00
http://simplystats.github.io/2015/01/15/how-to-find-the-science-paper-behind-a-headline-when-the-link-is-missing
<p>I just saw a pretty wild statistic on Twitter that less than 60% of university news releases link to the papers they are describing.</p>
<p> </p>
<blockquote class="twitter-tweet" width="550">
<p>
Amazingly, less than 60% of university news releases link to the papers they're describing <a href="http://t.co/daN11xYvKs">http://t.co/daN11xYvKs</a> <a href="http://t.co/QtneZUAeFD">pic.twitter.com/QtneZUAeFD</a>
</p>
<p>
— Justin Wolfers (@JustinWolfers) <a href="https://twitter.com/JustinWolfers/status/555782983429677056">January 15, 2015</a>
</p>
</blockquote>
<p>Before you believe anything your read about science in the news, you need to go and find the original article. When the article isn’t linked in the press release, sometimes you need to do a bit of sleuthing. Here is an example of how I do it for a news article. In general the press-release approach is very similar, but you skip the first step I describe below.</p>
<p><strong>Here is the news article (<a href="http://www.huffingtonpost.com/2015/01/14/online-avatar-personality_n_6463484.html?utm_hp_ref=science">link</a>):</strong></p>
<p> </p>
<p><img class="aligncenter wp-image-3787" src="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.11.22-PM.png" alt="Screen Shot 2015-01-15 at 1.11.22 PM" width="300" height="405" /></p>
<p> </p>
<p> </p>
<p><strong>Step 1: Look for a link to the article</strong></p>
<p>Usually it will be linked near the top or the bottom of the article. In this case, the article links to the press release about the paper. <em>This is not the original research article</em>. If you don’t get to a scientific journal you aren’t finished. In this case, the press release actually gives the full title of the article, but that will happen less than 60% of the time according to the statistic above.</p>
<p> </p>
<p><strong>Step 2: Look for names of the authors, scientific key words and journal name if available</strong></p>
<p>You are going to use these terms to search in a minute. In this case the only two things we have are the journal name:</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-2.png"><img class="aligncenter size-full wp-image-3791" src="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-2.png" alt="Untitled presentation (2)" width="949" height="334" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-2-300x105.png 300w, http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-2-260x91.png 260w, http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-2.png 949w" sizes="(max-width: 949px) 100vw, 949px" /></a></p>
<p> </p>
<p>And some key words:</p>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-3.png"><img class="aligncenter size-full wp-image-3792" src="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-3.png" alt="Untitled presentation (3)" width="933" height="343" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-3-300x110.png 300w, http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-3.png 933w" sizes="(max-width: 933px) 100vw, 933px" /></a></p>
<p> </p>
<p><strong>Step 3 Use Google Scholar</strong></p>
<p>You could just google those words and sometimes you get the real paper, but often you just end up back at the press release/news article. So instead the best way to find the article is to go to <a href="https://scholar.google.com/">Google Scholar </a>then click on the little triangle next to the search box.</p>
<p> </p>
<p> </p>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-4.png"><img class="aligncenter size-full wp-image-3793" src="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-4.png" alt="Untitled presentation (4)" width="960" height="540" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-4-260x146.png 260w, http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-4.png 960w" sizes="(max-width: 960px) 100vw, 960px" /></a></p>
<p>Fill in information while you can. Fill in the same year as the press release, information about the journal, university and key words.</p>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.31.38-PM.png"><img class="aligncenter size-full wp-image-3794" src="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.31.38-PM.png" alt="Screen Shot 2015-01-15 at 1.31.38 PM" width="509" height="368" /></a></p>
<p> </p>
<p><strong>Step 4 Victory</strong></p>
<p>Often this will come up with the article you are looking for:</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.32.45-PM.png"><img class="aligncenter size-full wp-image-3795" src="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.32.45-PM.png" alt="Screen Shot 2015-01-15 at 1.32.45 PM" width="813" height="658" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.32.45-PM-247x200.png 247w, http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.32.45-PM.png 813w" sizes="(max-width: 813px) 100vw, 813px" /></a></p>
<p> </p>
<p>Unfortunately, the article may be paywalled, so if you don’t work at a university or institute with a subscription, you can always tweet the article name with the hashtag [I just saw a pretty wild statistic on Twitter that less than 60% of university news releases link to the papers they are describing.</p>
<p> </p>
<blockquote class="twitter-tweet" width="550">
<p>
Amazingly, less than 60% of university news releases link to the papers they're describing <a href="http://t.co/daN11xYvKs">http://t.co/daN11xYvKs</a> <a href="http://t.co/QtneZUAeFD">pic.twitter.com/QtneZUAeFD</a>
</p>
<p>
— Justin Wolfers (@JustinWolfers) <a href="https://twitter.com/JustinWolfers/status/555782983429677056">January 15, 2015</a>
</p>
</blockquote>
<p>Before you believe anything your read about science in the news, you need to go and find the original article. When the article isn’t linked in the press release, sometimes you need to do a bit of sleuthing. Here is an example of how I do it for a news article. In general the press-release approach is very similar, but you skip the first step I describe below.</p>
<p><strong>Here is the news article (<a href="http://www.huffingtonpost.com/2015/01/14/online-avatar-personality_n_6463484.html?utm_hp_ref=science">link</a>):</strong></p>
<p> </p>
<p><img class="aligncenter wp-image-3787" src="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.11.22-PM.png" alt="Screen Shot 2015-01-15 at 1.11.22 PM" width="300" height="405" /></p>
<p> </p>
<p> </p>
<p><strong>Step 1: Look for a link to the article</strong></p>
<p>Usually it will be linked near the top or the bottom of the article. In this case, the article links to the press release about the paper. <em>This is not the original research article</em>. If you don’t get to a scientific journal you aren’t finished. In this case, the press release actually gives the full title of the article, but that will happen less than 60% of the time according to the statistic above.</p>
<p> </p>
<p><strong>Step 2: Look for names of the authors, scientific key words and journal name if available</strong></p>
<p>You are going to use these terms to search in a minute. In this case the only two things we have are the journal name:</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-2.png"><img class="aligncenter size-full wp-image-3791" src="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-2.png" alt="Untitled presentation (2)" width="949" height="334" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-2-300x105.png 300w, http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-2-260x91.png 260w, http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-2.png 949w" sizes="(max-width: 949px) 100vw, 949px" /></a></p>
<p> </p>
<p>And some key words:</p>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-3.png"><img class="aligncenter size-full wp-image-3792" src="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-3.png" alt="Untitled presentation (3)" width="933" height="343" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-3-300x110.png 300w, http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-3.png 933w" sizes="(max-width: 933px) 100vw, 933px" /></a></p>
<p> </p>
<p><strong>Step 3 Use Google Scholar</strong></p>
<p>You could just google those words and sometimes you get the real paper, but often you just end up back at the press release/news article. So instead the best way to find the article is to go to <a href="https://scholar.google.com/">Google Scholar </a>then click on the little triangle next to the search box.</p>
<p> </p>
<p> </p>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-4.png"><img class="aligncenter size-full wp-image-3793" src="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-4.png" alt="Untitled presentation (4)" width="960" height="540" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-4-260x146.png 260w, http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-4.png 960w" sizes="(max-width: 960px) 100vw, 960px" /></a></p>
<p>Fill in information while you can. Fill in the same year as the press release, information about the journal, university and key words.</p>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.31.38-PM.png"><img class="aligncenter size-full wp-image-3794" src="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.31.38-PM.png" alt="Screen Shot 2015-01-15 at 1.31.38 PM" width="509" height="368" /></a></p>
<p> </p>
<p><strong>Step 4 Victory</strong></p>
<p>Often this will come up with the article you are looking for:</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.32.45-PM.png"><img class="aligncenter size-full wp-image-3795" src="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.32.45-PM.png" alt="Screen Shot 2015-01-15 at 1.32.45 PM" width="813" height="658" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.32.45-PM-247x200.png 247w, http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.32.45-PM.png 813w" sizes="(max-width: 813px) 100vw, 813px" /></a></p>
<p> </p>
<p>Unfortunately, the article may be paywalled, so if you don’t work at a university or institute with a subscription, you can always tweet the article name with the hashtag](https://twitter.com/hashtag/icanhazpdf) and your contact info. Then you just have to hope that someone will send it to you (they often do).</p>
<p> </p>
<p> </p>
Statistics and R for the Life Sciences: New HarvardX course starts January 19
2015-01-12T10:30:08+00:00
http://simplystats.github.io/2015/01/12/statistics-and-r-for-the-life-sciences-new-harvardx-course-starts-january-19
<p>The first course of our Biomedical Data Science online curriculum</p>
<p>starts next week. You can sign up <a href="https://www.edx.org/course/statistics-r-life-sciences-harvardx-ph525-1x">here</a>. Instead of relying on</p>
<p>mathematical formulas to teach statistical concepts, students can</p>
<p>program along as we show computer code for simulations that illustrate</p>
<p>the main ideas of exploratory data analysis and statistical inference</p>
<p>(p-values, confidence intervals and power calculations for example).</p>
<p>By doing this, students will learn Statistics and R simultaneously and</p>
<p>will not be bogged down by having to memorize formulas. We have three types of learning modules: lectures (see picture below), screencasts and assessments. After each</p>
<p>video students will have the opportunity to assess their understanding</p>
<p>through homeworks involving coding in R. A big improvement over the</p>
<p>first version is that we have added dozens of assessment.</p>
<p>Note that this course is the first in an <a href="http://simplystatistics.org/2014/03/31/data-analysis-for-genomic-edx-course/">eight part series</a> on Data Analysis for Genomics. Updates will be provided via twitter <a href="https://twitter.com/rafalab">@rafalab</a>.</p>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/edx_screenshot_v2.png"><img class="alignnone size-large wp-image-3773" src="http://simplystatistics.org/wp-content/uploads/2015/01/edx_screenshot_v2-1024x603.png" alt="edx_screenshot_v2" width="495" height="291" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/edx_screenshot_v2-300x176.png 300w, http://simplystatistics.org/wp-content/uploads/2015/01/edx_screenshot_v2-1024x603.png 1024w, http://simplystatistics.org/wp-content/uploads/2015/01/edx_screenshot_v2-260x153.png 260w, http://simplystatistics.org/wp-content/uploads/2015/01/edx_screenshot_v2.png 1298w" sizes="(max-width: 495px) 100vw, 495px" /></a></p>
Beast mode parenting as shown by my Fitbit data
2015-01-07T11:22:57+00:00
http://simplystats.github.io/2015/01/07/beast-mode-parenting-as-shown-by-my-fitbit-data
<p>This weekend was one of those hardcore parenting weekends that any parent of little kids will understand. We were up and actively taking care of kids for a huge fraction of the weekend. (Un)fortunately I was wearing my Fitbit, so I can quantify exactly how little we were sleeping over the weekend.</p>
<p>Here is Saturday:</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/saturday.png"><img class="aligncenter wp-image-3762 size-full" src="http://simplystatistics.org/wp-content/uploads/2015/01/saturday.png" alt="saturday" width="500" height="500" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/saturday-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2015/01/saturday-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2015/01/saturday.png 500w" sizes="(max-width: 500px) 100vw, 500px" /></a></p>
<p> </p>
<p> </p>
<p>There you can see that I rocked about midnight-4am without running around chasing a kid or bouncing one to sleep. But Sunday was the real winner:</p>
<p> </p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/sunday.png"><img class="aligncenter wp-image-3763 size-full" src="http://simplystatistics.org/wp-content/uploads/2015/01/sunday.png" alt="sunday" width="500" height="500" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/sunday-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2015/01/sunday-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2015/01/sunday.png 500w" sizes="(max-width: 500px) 100vw, 500px" /></a></p>
<p>Check that out. I was totally asleep from like 4am-6am there. Nice.</p>
<p>Stay tuned for much more from my Fitbit data over the next few weeks.</p>
<p> </p>
<p> </p>
Sunday data/statistics link roundup (1/4/15)
2015-01-04T14:45:19+00:00
http://simplystats.github.io/2015/01/04/sunday-datastatistics-link-roundup-1415
<ol>
<li>I am digging <a href="http://waitbutwhy.com/2014/05/life-weeks.html">this visualization of your life in weeks</a>. I might have to go so far as to actually make one for myself.</li>
<li>I’m very excited about the new podcast <a href="http://www.thetalkingmachines.com/">TalkingMachines</a> and what an awesome name! I wish someone would do that same thing for applied statistics (Roger?)</li>
<li>I love that they call Ben Goldacre the <a href="http://www.vox.com/2014/12/27/7423229/ben-goldacre">anti-Dr. Oz in this piece</a>, especially given how often <a href="http://www.bmj.com/content/349/bmj.g7346">Dr. Oz is telling the truth</a>.</li>
<li>If you haven’t read it yet, <a href="http://www.economist.com/news/christmas-specials/21636589-how-statisticians-changed-war-and-war-changed-statistics-they-also-served">this piece in the Economist</a> on statisticians during the war is really good.</li>
<li>The arXiv <a href="http://www.nature.com/news/the-arxiv-preprint-server-hits-1-million-articles-1.16643">celebrated it’s 1M paper upload</a>. It costs less to run than the <a href="https://twitter.com/joe_pickrell/status/549762678160625664">top 2 executives at PLoS make</a>. It is t<a href="http://simplystatistics.org/2011/11/03/free-access-publishing-is-awesome-but-expensive-how/">oo darn expensive</a> to publish open access right now.</li>
</ol>
Ugh ... so close to one million page views for 2014
2014-12-31T13:16:14+00:00
http://simplystats.github.io/2014/12/31/ugh-so-close-to-one-million-page-views-for-2014
<p>In my <a href="http://simplystatistics.org/2014/12/21/sunday-datastatistics-link-roundup-122114/">last Sunday Links roundup</a> I mentioned we were going to be really close to 1 million page views this year. Chris V. tried to rally the troops:</p>
<p> </p>
<blockquote class="twitter-tweet" width="550">
<p>
Lets get them over the hump // “<a href="https://twitter.com/simplystats">@simplystats</a>: Sunday data/statistics link roundup (12/21/14) <a href="http://t.co/X1WDF9zZc1">http://t.co/X1WDF9zZc1</a> <a href="https://twitter.com/hashtag/simplystats1e6?src=hash">#simplystats1e6</a>”
</p>
<p>
— Chris Volinsky (@statpumpkin) <a href="https://twitter.com/statpumpkin/status/546872078730010624">December 22, 2014</a>
</p>
</blockquote>
<p> </p>
<p>but alas we are probably not going to make it (unless by some miracle one of our posts goes viral in the next 12 hours):</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2014/12/soclose.png"><img class="aligncenter wp-image-3752" src="http://simplystatistics.org/wp-content/uploads/2014/12/soclose-1024x1024.png" alt="soclose" width="400" height="400" srcset="http://simplystatistics.org/wp-content/uploads/2014/12/soclose-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2014/12/soclose-1024x1024.png 1024w, http://simplystatistics.org/wp-content/uploads/2014/12/soclose-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2014/12/soclose.png 1050w" sizes="(max-width: 400px) 100vw, 400px" /></a></p>
<p> </p>
<p>Stay tuned for a bunch of cool new stuff from Simply Stats in 2015, including a new podcasting idea, more interviews, another unconference, and <a href="https://github.com/jtleek/simplystats">a new plotting theme</a>!</p>
On how meetings and conference calls are disruptive to a data scientist
2014-12-22T10:00:51+00:00
http://simplystats.github.io/2014/12/22/meetings-2
<p><em>Editor’s note: The week of Xmas eve is usually my most productive of the year. This is because there is reduced emails and 0 meetings (I do take a break, but after this great week for work). Here is a repost of one of our first entries explaining how meetings and conference calls are particularly disruptive in data science. </em></p>
<p>In <a href="http://www.ted.com/talks/jason_fried_why_work_doesn_t_happen_at_work.html" target="_blank">this</a> TED talk Jason Fried explains why work doesn’t happen at work. He describes the evils of meetings. Meetings are particularly disruptive for applied statisticians, especially for those of us that hack data files, explore data for systematic errors, get inspiration from visual inspection, and thoroughly test our code. Why? Before I become productive I go through a ramp-up/boot-up stage. Scripts need to be found, data loaded into memory, and most importantly, my brains needs to re-familiarize itself with the data and the essence of the problem at hand. I need a similar ramp up for writing as well. It usually takes me between 15 to 60 minutes before I am in full-productivity mode. But once I am in “the zone”, I become very focused and I can stay in this mode for hours. There is nothing worse than interrupting this state of mind to go to a meeting. I lose much more than the hour I spend at the meeting. A short way to explain this is that having 10 separate hours to work is basically nothing, while having 10 hours in the zone is when I get stuff done.</p>
<p>Of course not all meetings are a waste of time. Academic leaders and administrators need to consult and get advice before making important decisions. I find lab meetings very stimulating and, generally, productive: we unstick the stuck and realign the derailed. But before you go and set up a standing meeting consider this calculation: a weekly one hour meeting with 20 people translates into 1 hour x 20 people x 52 weeks/year = 1040 person hours of potentially lost production per year. Assuming 40 hour weeks, that translates into six months. How many grants, papers, and lectures can we produce in six months? And this does not take into account the non-linear effect described above. Jason Fried suggest you cancel your next meeting, notice that nothing bad happens and enjoy the extra hour of work.</p>
<p>I know many others that are like me in this regard and for you I have these recommendations: 1- avoid unnecessary meetings, especially if you are already in full-productivity mode. Don’t be afraid to use this as an excuse to cancel. If you are in a soft $ institution, remember who pays your salary. 2- Try to bunch all the necessary meetings all together into one day. 3- Separate at least one day a week to stay home and work for 10 hours straight. Jason Fried also recommends that every work place declare a day in which no one talks. No meetings, no chit-chat, no friendly banter, etc… No talk Thursdays anyone?</p>
Sunday data/statistics link roundup (12/21/14)
2014-12-21T22:00:33+00:00
http://simplystats.github.io/2014/12/21/sunday-datastatistics-link-roundup-122114
<p>James Stewart, author of the most popular Calculus textbook in the world, <a href="http://classic.slashdot.org/story/14/12/20/0036210">passed away</a>. In case you wonder if there is any money in textbooks, he had a $32 million house in Toronto. Maybe I should get out of MOOCs and into textbooks.</p>
<ol>
<li><a href="https://medium.com/the-physics-arxiv-blog/cause-and-effect-the-revolutionary-new-statistical-test-that-can-tease-them-apart-ed84a988e">This post</a> on medium about a new test for causality is making the rounds. The authors <a href="http://arxiv.org/pdf/1412.3773v1.pdf">of the original paper</a> make clear their assumptions make the results basically unrealistic for any real analysis for example:”<a href="http://arxiv.org/pdf/1412.3773v1.pdf">We simplify the causal discovery problem by assuming no confounding, selection bias and feedback.</a>” The medium article is too bold and as I replied to an economist who tweeted there was a new test that could distinguish causality: “<a href="https://twitter.com/simplystats/status/545769855564398593">Nope</a>”.</li>
<li>I’m excited that the Rafa + the ASA have started a section <a href="https://twitter.com/rafalab/status/543115692770607104">on Genomics and Genetics</a>. It is nice to have a place to belong within our community. I hope it can be a place where folks who aren’t into the hype (a lot of those in genomics), but really care about applications, can meet each other and work together.</li>
<li><a href="https://medium.com/@hannawallach/big-data-machine-learning-and-the-social-sciences-927a8e20460d">Great essay</a> by Hanna W. about data, machine learning and fairness. I love this quote: “in order to responsibly articulate and address issues relating to bias, fairness, and inclusion, we need to stop thinking of big data sets as being homogeneous, and instead shift our focus to the many diverse data sets nested within these larger collections.” (via Hilary M.)</li>
<li>Over at Flowing Data they ran down <a href="http://flowingdata.com/2014/12/19/the-best-data-visualization-projects-of-2014-2/">the best data visualizations</a> of the year.</li>
<li><a href="http://dirk.eddelbuettel.com/blog/2014/12/21/#sorry_julia_2014-12">This rant</a> from Dirk E. perfectly encapsulates every annoying thing about the Julia versus R comparisons I see regularly.</li>
<li>We are tantalizingly close to 1 million page views for the year for Simply Stats. Help get us over the edge, share your favorite simply stats article with all your friends using the hashtag <a href="https://twitter.com/search?f=realtime&q=%23simplystats1e6&src=typd">#simplystats1e6</a></li>
</ol>
Interview with Emily Oster
2014-12-19T09:39:38+00:00
http://simplystats.github.io/2014/12/19/interview-with-emily-oster
<div>
<div class="nD">
<div dir="ltr">
<div>
<a href="http://simplystatistics.org/wp-content/uploads/2014/12/Emily_Oster_Photo.jpg"><img class="aligncenter wp-image-3714 " src="http://simplystatistics.org/wp-content/uploads/2014/12/Emily_Oster_Photo-198x300.jpg" alt="Emily Oster" width="121" height="184" /></a>
</div>
<div>
</div>
<div>
</div>
<div>
<em><a href="http://en.wikipedia.org/wiki/Emily_Oster">Emily Oster</a> is an Associate Professor of Economics at Brown University. She is a frequent and highly respected <a href="http://fivethirtyeight.com/contributors/emily-oster/">contributor to 538 </a>where she brings clarity to areas of interest to parents, pregnant woman, and the general public where empirical research is conflicting or difficult to interpret. She is also the author of the popular new book about pregnancy:<a href="http://www.amazon.com/Expecting-Better-Conventional-Pregnancy-Wrong/dp/0143125702"> Expecting Better: Why the Conventional Pregnancy Wisdom Is Wrong--and What You Really Need to Know</a><b>. </b>We interviewed Emily as part of our <a href="http://simplystatistics.org/interviews/">ongoing interview series</a> with exciting empirical data scientists. </em>
</div>
<div>
<em> </em>
</div>
<div>
</div>
<div>
<b>SS: Do you consider yourself an economist, econometrician, statistician, data scientist or something else?</b>
</div>
<div>
</div>
</div>
</div>
</div>
<div>
<div class="F3hlO">
<div dir="ltr">
<div>
EO: I consider myself an empirical economist. I think my econometrics colleagues would have a hearty laugh at the idea that I'm an econometrician! The questions I'm most interested in tend to have a very heavy empirical component - I really want to understand what we can learn from data. In this sense, there is a lot of overlap with statistics. But at the end of the day, the motivating questions and the theories of behavior I want to test come straight out of economics.
</div>
</div>
</div>
</div>
<div>
<div class="nD">
<div dir="ltr">
<div>
<div>
</div>
<div>
<b>SS: You are a frequent contributor to 538. Many of your pieces are attempts to demystify often conflicting sets of empirical research (about concussions and suicide, or the dangers of water flouridation). What would you say are the issues that make empirical research about these topics most difficult?</b>
</div>
<div>
<b> </b>
</div>
</div>
</div>
</div>
</div>
<div>
<div class="F3hlO">
<div dir="ltr">
<div>
<div>
EO: In nearly all the cases, I'd summarize the problem as : "The data isn't good enough." Sometimes this is because we only see observational data, not anything randomized. A large share of studies using observational data that I discuss have serious problems with either omitted variables or reverse causality (or both). This means that the results are suggestive, but really not conclusive. A second issue is even when we do have some randomized data, it's usually on a particular population, or a small group, or in the wrong time period. In the flouride case, the studies which come closest to being "randomized" are from 50 years ago. How do we know they still apply now? This makes even these studies challenging to interpret.
</div>
</div>
</div>
</div>
</div>
<div>
<div class="nD">
<div dir="ltr">
<div>
<div>
</div>
<div>
<b>SS: Your recent book "Expecting Better: Why the Conventional Pregnancy Wisdom Is Wrong--and What You Really Need to Know" takes a similar approach to pregnancy. Why do you think there are so many conflicting studies about pregnancy? Is it because it is so hard to perform randomized studies?</b>
</div>
<div>
<b> </b>
</div>
</div>
</div>
</div>
</div>
<div>
<div class="F3hlO">
<div dir="ltr">
<div>
<div>
EO: I think the inability to run randomized studies is a big part of this, yes. One area of pregnancy where the data is actually quite good is labor and delivery. If you want to know the benefits and consequences of pain medication in labor, for example, it is possible to point you to some reasonably sized randomized trials. For various reasons, there has been more willingness to run randomized studies in this area. When pregnant women want answers to less medical questions (like, "Can I have a cup of coffee?") there is typically no randomized data to rely on. Because the possible benefits of drinking coffee while pregnant are pretty much nil, it is difficult to conceptualize a randomized study of this type of thing.
</div>
<div>
</div>
<div>
Another big issue I found in writing the book was that even in cases where the data was quite good, data often diverges from practice. This was eye-opening for me and convinced me that in pregnancy (and probably in other areas of health) people really do need to be their own advocates and know the data for themselves.
</div>
</div>
</div>
</div>
</div>
<div>
<div class="nD">
<div dir="ltr">
<div>
<div>
</div>
<div>
<b>SS: Have you been surprised about the backlash to your book for your discussion of the zero-alcohol policy during pregnancy? </b>
</div>
<div>
<b> </b>
</div>
</div>
</div>
</div>
</div>
<div>
<div class="F3hlO">
<div dir="ltr">
<div>
<div>
EO: A little bit, yes. This backlash has died down a lot as pregnant women actually read the book and use it. As it turns out, the discussion of alcohol makes up a tiny fraction of the book and most pregnant women are more interested in the rest of it! But certainly when the book came out this got a lot of focus. I suspected it would be somewhat controversial, although the truth is that every OB I actually talked to told me they thought it was fine. So I was surprised that the reaction was as sharp as it was. I think in the end a number of people felt that even if the data were supportive of this view, it was important not to say it because of the concern that some women would over-react. I am not convinced by this argument.
</div>
</div>
</div>
</div>
</div>
<div>
<div class="nD">
<div dir="ltr">
<div>
<div>
</div>
<div>
<b>SS: What are the three most important statistical concepts for new mothers to know? </b>
</div>
<div>
<b> </b>
</div>
</div>
</div>
</div>
</div>
<div>
<div class="F3hlO">
<div dir="ltr">
<div>
<div>
EO: I really only have two!
</div>
<div>
</div>
<div>
I think the biggest thing is to understand the difference between randomized and non-randomized data and to have some sense of the pittfalls of non-randomized data. I reviewed studies of alcohol where the drinkers were twice as likely as non-drinkers to use cocaine. I think people (pregnant or not) should be able to understand why one is going to struggle to draw conclusions about alcohol from these data.
</div>
<div>
</div>
<div>
A second issue is the concept of probability. It is easy to say, "There is a 10% chance of the following" but do we really understand that? If someone quotes you a 1 in 100 risk from a procedure, it is important to understand the difference between 1 in 100 and 1 in 400. For most of us, those seem basically the same - they are both small. But they are not, and people need to think of ways to structure decision-making that acknowledge these differences.
</div>
</div>
</div>
</div>
</div>
<div>
<div class="nD">
<div dir="ltr">
<div>
<div>
</div>
<div>
<b>SS: What computer programming language is most commonly taught for data analysis in economics? </b>
</div>
<div>
<b> </b>
</div>
</div>
</div>
</div>
</div>
<div>
<div class="F3hlO">
<div dir="ltr">
<div>
<div>
EO: So, I think the majority of empirical economists use Stata. I have been seeing more R, as well as a variety of other things, but more commonly among people who do heavier computational fields.
</div>
</div>
</div>
</div>
</div>
<div>
<div class="nD">
<div dir="ltr">
<div>
<div>
</div>
<div>
<b>SS: Do you have any advice for young economists/statisticians who are interested in empirical research? </b>
</div>
</div>
</div>
</div>
</div>
<div>
<div class="F3hlO">
<div dir="ltr">
<div>
</div>
<div>
EO:
</div>
<div>
1. Work on topics that interest you. As an academic you will ultimately have to motivate yourself to work. If you aren't interested in your topic (at least initially!), you'll never succeed.
</div>
<div>
2. One project which is 100% done is way better than five projects at 80%. You need to actually finish things, something which many of us struggle with.
</div>
<div>
3. Presentation matters. Yes, the substance is the most important thing, but don't discount the importance of conveying your ideas well.
</div>
</div>
</div>
</div>
Repost: Statistical illiteracy may lead to parents panicking about Autism
2014-12-18T12:09:24+00:00
http://simplystats.github.io/2014/12/18/repost-statistical-illiteracy-may-lead-to-parents-panicking-about-autism
<p><em>Editor’s Note: This is a repost of a <a href="http://simplystatistics.org/2012/11/30/statistical-illiteracy-may-lead-to-parents-panicking-about-autism/">previous post on our blog from 2012</a>. The repost is inspired by similar issues with statistical illiteracy that are coming up in <a href="http://skybrudeconsulting.com/blog/2014/12/12/diagnostic-testing.html">allergy screening</a> and <a href="http://www.bostonglobe.com/metro/2014/12/14/oversold-and-unregulated-flawed-prenatal-tests-leading-abortions-healthy-fetuses/aKFAOCP5N0Kr8S1HirL7EN/story.html">pregnancy screening</a>. </em></p>
<p>I just was doing my morning reading of a few news sources and stumbled across this <a href="http://www.huffingtonpost.com/2012/11/29/autism-risk-babies-cries_n_2211729.html">Huffington Post article</a> talking about research correlating babies cries to autism. It suggests that the sound of a babies cries may predict their future risk for autism. As the parent of a young son, this obviously caught my attention in a very lizard-brain, caveman sort of way. I couldn’t find a link to the research paper in the article so I did some searching and found out this result is also being covered by <a href="http://healthland.time.com/2012/11/28/can-a-babys-cry-be-a-clue-to-autism/">Time</a>, <a href="http://www.sciencedaily.com/releases/2012/11/121127111352.htm">Science Daily</a>, <a href="http://www.medicaldaily.com/articles/13324/20121129/baby-s-cry-reveal-autism-risk.htm">Medical Daily</a>, and a bunch of other news outlets.</p>
<p>Now thoroughly freaked, I looked online and found the pdf of the <a href="https://www.ewi-ssl.pitt.edu/psychology/admin/faculty-publications/201209041019040.Sheinkopf%20in%20press.pdf">original research article</a>. I started looking at the statistics and took a deep breath. Based on the analysis they present in the article there is absolutely no statistical evidence that a babies’ cries can predict autism. Here are the flaws with the study:</p>
<ol>
<li><strong>Small sample size</strong>. The authors only recruited 21 at risk infants and 18 healthy infants. Then, because of data processing issues, only ended up analyzing 7 high autistic risk versus 5 low autistic-risk in one analysis and 10 versus 6 in another. That is no where near a representative sample and barely qualifies as a pilot study.</li>
<li><strong>Major and unavoidable confounding</strong>. The way the authors determined high autistic risk versus low risk was based on whether an older sibling had autism. Leaving aside the quality of this metric for measuring risk of autism, there is a major confounding factor: the families of the high risk children all had an older sibling with autism and the families of the low risk children did not! It would not be surprising at all if children with one autistic older sibling might get a different kind of attention and hence cry differently regardless of their potential future risk of autism.</li>
<li><strong>No correction for multiple testing</strong>. This is one of the oldest problems in statistical analysis. It is also one that is a consistent culprit of false positives in epidemiology studies. XKCD <a href="http://xkcd.com/882/">even did a cartoon</a> about it! They tested 9 variables measuring the way babies cry and tested each one with a statistical hypothesis test. They did not correct for multiple testing. So I gathered resulting p-values and did the correction <a href="https://gist.github.com/4177366">for them</a>. It turns out that after adjusting for multiple comparisons, nothing is significant at the usual P < 0.05 level, which would probably have prevented publication.</li>
</ol>
<p>Taken together, these problems mean that the statistical analysis of these data do not show any connection between crying and autism.</p>
<p>The problem here exists on two levels. First, there was a failing in the statistical evaluation of this manuscript at the peer review level. Most statistical referees would have spotted these flaws and pointed them out for such a highly controversial paper. A second problem is that news agencies report on this result and despite paying lip-service to potential limitations, are not statistically literate enough to point out the major flaws in the analysis that reduce the probability of a true positive. Should journalists have some minimal in statistics that allows them to determine whether a result is likely to be a false positive to save us parents a lot of panic?</p>
<p> </p>
A non-comprehensive list of awesome things other people did in 2014
2014-12-17T13:08:43+00:00
http://simplystats.github.io/2014/12/17/a-non-comprehensive-list-of-awesome-things-other-people-did-in-2014
<p><em>Editor’s Note: Last year</em> <em><a href="http://simplystatistics.org/2013/12/20/a-non-comprehensive-list-of-awesome-things-other-people-did-this-year/">_Editor’s Note: Last year_ _</a> off the top of my head of awesome things other people did. I loved doing it so much that I’m doing it again for 2014. Like last year, I have surely missed awesome things people have done. If you know of some, you should make your own list or add it to the comments! The rules remain the same. I have avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people’s awesome stuff. I wrote this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data. Update: I missed pipes in R, now added!</em></p>
<p> </p>
<ol>
<li>I’m copying everything about Jenny Bryan’s amazing <a href="http://stat545-ubc.github.io/">Stat 545 class</a> in my data analysis classes. It is one of my absolute favorite open online set of notes on data analysis.</li>
<li>Ben Baumer, Mine Cetinkaya-Rundel, Andrew Bray, Linda Loi, Nicholas J. Horton wrote <a href="http://arxiv.org/abs/1402.1894">this awesome paper</a> on integrating R markdown into the curriculum. I love the stuff that Mine and Nick are doing to push data analysis into undergrad stats curricula.</li>
<li>Speaking of those folks, the undergrad g<a href="file:///Users/jtleek/Downloads/Report%20on%20Undergrad%20Ed_final3.pdf">uidelines for stats programs put out by the ASA</a> do an impressive job of balancing the advantages of statistics and the excitement of modern data analysis.</li>
<li>Somebody tell Hector Corrada Bravo to stop writing so many awesome papers. He is making us all look bad. His <a href="http://www.nature.com/nmeth/journal/v11/n9/abs/nmeth.3038.html">epiviz paper is great</a> and you should go start using the <a href="http://www.bioconductor.org/packages/release/bioc/html/epivizr.html">Bioconductor packag</a>e if you do genomics.</li>
<li>Hilary Mason founded<a href="http://www.fastforwardlabs.com/"> fast forward labs</a>. I love the business model of translating cutting edge academic (and otherwise) knowledge to practice. I am really pulling for this model to work.</li>
<li>As far as I can tell 2014 was the year that causal inference become the new hotness. One example of that is this awesome paper from the Google folks on trying to <a href="http://google.github.io/CausalImpact/CausalImpact.html">infer causality from related time series</a>. <a href="http://google.github.io/CausalImpact/CausalImpact.html">The R package</a> has some <a href="https://twitter.com/hspter/status/496689866953224192">cool features too</a>. I definitely am excited to see all the new innovation in this area.</li>
<li><a href="http://r-pkgs.had.co.nz/">Hadley</a> was <a href="https://github.com/hadley/dplyr">Hadley</a>.</li>
<li>Rafa and <a href="http://www.mike-love.net/">Mike </a>taught an awesome class on data analysis for genomics. They also created a <a href="http://genomicsclass.github.io/book/">book on Github</a> that I think is one of the best introductions to the statistics of genomics that exists so far.</li>
<li>Hilary Parker [<em>Editor’s Note: Last year</em> <em><a href="http://simplystatistics.org/2013/12/20/a-non-comprehensive-list-of-awesome-things-other-people-did-this-year/">_Editor’s Note: Last year_ _</a> off the top of my head of awesome things other people did. I loved doing it so much that I’m doing it again for 2014. Like last year, I have surely missed awesome things people have done. If you know of some, you should make your own list or add it to the comments! The rules remain the same. I have avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people’s awesome stuff. I wrote this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data. Update: I missed pipes in R, now added!</em></li>
</ol>
<p> </p>
<ol>
<li>I’m copying everything about Jenny Bryan’s amazing <a href="http://stat545-ubc.github.io/">Stat 545 class</a> in my data analysis classes. It is one of my absolute favorite open online set of notes on data analysis.</li>
<li>Ben Baumer, Mine Cetinkaya-Rundel, Andrew Bray, Linda Loi, Nicholas J. Horton wrote <a href="http://arxiv.org/abs/1402.1894">this awesome paper</a> on integrating R markdown into the curriculum. I love the stuff that Mine and Nick are doing to push data analysis into undergrad stats curricula.</li>
<li>Speaking of those folks, the undergrad g<a href="file:///Users/jtleek/Downloads/Report%20on%20Undergrad%20Ed_final3.pdf">uidelines for stats programs put out by the ASA</a> do an impressive job of balancing the advantages of statistics and the excitement of modern data analysis.</li>
<li>Somebody tell Hector Corrada Bravo to stop writing so many awesome papers. He is making us all look bad. His <a href="http://www.nature.com/nmeth/journal/v11/n9/abs/nmeth.3038.html">epiviz paper is great</a> and you should go start using the <a href="http://www.bioconductor.org/packages/release/bioc/html/epivizr.html">Bioconductor packag</a>e if you do genomics.</li>
<li>Hilary Mason founded<a href="http://www.fastforwardlabs.com/"> fast forward labs</a>. I love the business model of translating cutting edge academic (and otherwise) knowledge to practice. I am really pulling for this model to work.</li>
<li>As far as I can tell 2014 was the year that causal inference become the new hotness. One example of that is this awesome paper from the Google folks on trying to <a href="http://google.github.io/CausalImpact/CausalImpact.html">infer causality from related time series</a>. <a href="http://google.github.io/CausalImpact/CausalImpact.html">The R package</a> has some <a href="https://twitter.com/hspter/status/496689866953224192">cool features too</a>. I definitely am excited to see all the new innovation in this area.</li>
<li><a href="http://r-pkgs.had.co.nz/">Hadley</a> was <a href="https://github.com/hadley/dplyr">Hadley</a>.</li>
<li>Rafa and <a href="http://www.mike-love.net/">Mike </a>taught an awesome class on data analysis for genomics. They also created a <a href="http://genomicsclass.github.io/book/">book on Github</a> that I think is one of the best introductions to the statistics of genomics that exists so far.</li>
<li>Hilary Parker](http://hilaryparker.com/2014/04/29/writing-an-r-package-from-scratch/) that took the twitterverse by storm. It is perfectly written for people who are just at the point of being able to create their own R package. I think it probably generated 100+ R packages just by being so easy to follow.</li>
<li>Oh you’re <a href="http://www.statschat.org.nz/2014/12/10/spin-and-manipulation-in-science-reporting/">not reading StatsChat yet</a>? <a href="http://www.statschat.org.nz/2014/12/13/blaming-mothers-again/">For real</a>?</li>
<li>FiveThirtyEight launched. Despite <a href="http://fivethirtyeight.com/features/a-formula-for-decoding-health-news/">some early bumps</a> they have done some really cool stuff. Loved the recent <a href="http://fivethirtyeight.com/tag/beer-mile/">piece on the beer mile</a> and I read every piece that <a href="http://fivethirtyeight.com/contributors/emily-oster/">Emily Oster writes</a>. She does an amazing job of explaining pretty complicated statistical topics to a really broad audience.</li>
<li>David Robinson’s <a href="https://github.com/dgrtwo/broom">broom package</a> is one of my absolute favorite R packages that was built this year. One of the most annoying things about R is the variety of outputs different models give and this tidy version makes it really easy to do lots of neat stuff.</li>
<li>Chung and Storey <a href="http://bioinformatics.oxfordjournals.org/content/early/2014/10/21/bioinformatics.btu674.full.pdf">introduced the jackstraw</a> which is both a very clever idea and the perfect name for a method that can be used to identify variables associated with principal components in a statistically rigorous way.</li>
<li>I rarely dig excel-type replacements, but the <a href="http://www.charted.co/">simplicity of charted.co</a> makes me love it. It does one thing and one thing really well.</li>
<li>The <a href="http://kbroman.wordpress.com/2014/05/15/hipster-re-educating-people-who-learned-r-before-it-was-cool/">hipsteR package</a> for teaching old R dogs new tricks is one of the many cool things Karl Broman did this year. I read all of his tutorials and never cease to learn stuff. In related news if I was 1/10th as organized as that dude I’d actually you know, get stuff done.</li>
<li>Whether I agree with them or not that they should be allowed to do unregulated human subjects research, statistics at tech companies, and in particular randomized experiments have never been hotter. The boldest of the bunch is OKCupid who writes blog posts with titles like, “<a href="http://blog.okcupid.com/index.php/we-experiment-on-human-beings/">We experiment on human beings</a>!”</li>
<li>In related news, I love the <a href="https://facebook.github.io/planout/">PlanOut project</a> by the folks over at Facebook, so cool to see an open source approach to experimentation at web scale.</li>
<li>No wonder <a href="http://www.cs.berkeley.edu/~jordan/">Mike Jordan </a>(no not that <a href="http://en.wikipedia.org/wiki/Michael_Jordan">Mike Jordan</a>) is such a superstar. His <a href="http://www.reddit.com/r/MachineLearning/comments/2fxi6v/ama_michael_i_jordan">reddit AMA</a> raised my respect for him from already super high levels. First, its awesome that he did it, and second it is amazing how well he articulates the relationship between CS and Stats.</li>
<li>I’m trying to figure out a way to get Matthew Stephens to <a href="http://stephens999.github.io/blog/">write more blog posts.</a> He teased us with the <a href="http://stephens999.github.io/blog/2014/11/dscr.html">Dynamic Statistical Comparisons</a> post and then left us hanging. The people demand more Matthew.</li>
<li>Di Cook also <a href="http://dicook.github.io/blog.html">started a new blog</a> in 2014. She was also <a href="https://unite.un.org/techevents/eda">part of this cool exploratory data analysis event</a> for the UN. They have a monster program going over there at Iowa State, producing some amazing research and a bunch of students that are recognizable by one name (Yihui, Hadley, etc.).</li>
<li>Love <a href="http://arxiv-web3.library.cornell.edu/pdf/1407.7819v1.pdf">this paper on sure screening of graphical models</a> out of Daniela Witten’s group at UW. It is so cool when a simple idea ends up being really well justified theoretically, it makes the world feel right.</li>
<li>I’m sure this actually happened before 2014, but the Bioconductor folks are still the best open source data science project that exists in my opinion. My favorite development I started using in 2014 is the <a href="http://www.bioconductor.org/developers/how-to/git-svn/">git-subversion bridge</a> that lets me update my Bioc packages with pull requests.</li>
<li>rOpenSci <a href="https://github.com/ropensci/hackathon">ran an awesome hackathon</a>. The lineup of people they invited was great and I loved the commitment to a diverse group of junior R programmers. I really, really hope they run it again.</li>
<li>Dirk Eddelbuettel and Carl Boettiger continue to make bigtime contributions to R. This time it is <a href="http://dirk.eddelbuettel.com/blog/2014/10/23/">Rocker</a>, with Docker containers for R. I think this could be a reproducibility/teaching gamechanger.</li>
<li>Regina Nuzzo <a href="http://www.nature.com/news/scientific-method-statistical-errors-1.14700">brought the p-value debate to the masses</a>. She is also incredible at communicating pretty complicated statistical ideas to a broad audience and I’m looking forward to more stats pieces by her in the top journals.</li>
<li>Barbara Engelhardt keeps <a href="http://arxiv.org/abs/1411.2698">rocking out great papers</a>. But she is also one of the best AE’s I have ever had handle a paper for me at PeerJ. Super efficient, super fair, and super demanding. People don’t get enough credit for being amazing in the peer review process and she deserves it.</li>
<li>Ben Goldacre and Hans Rosling continue to be two of the best advocates for statistics and the statistical discipline - I’m not sure either claims the title of statistician but they do a great job anyway. <a href="http://news.sciencemag.org/africa/2014/12/star-statistician-hans-rosling-takes-ebola?rss=1&utm_source=dlvr.it&utm_medium=twitter">This piece</a> about Professor Rosling in Science gives some idea about the impact a statistician can have on the most current problems in public health. Meanwhile, I think Dr. Goldacre <a href="http://www.bmj.com/content/348/bmj.g3306/rr/759401">does a great job</a> of explaining how personalized medicine is an information science in this piece on statins in the BMJ.</li>
<li>Michael Lopez’s <a href="http://statsbylopez.com/2014/07/23/so-you-want-a-graduate-degree-in-statistics/">series of posts</a> on graduate school in statistics should be 100% required reading for anyone considering graduate school in statistics. He really nails it.</li>
<li> Trey Causey has an equally awesome <a href="http://treycausey.com/getting_started.html">Getting Started in Data Science</a> post that I read about 10 times.</li>
<li>Drop everything and <a href="http://www.pgbovine.net/writings.htm">go read all of Philip Guo’s posts</a>. Especially <a href="http://www.pgbovine.net/academia-industry-junior-employee.htm">this one</a> about industry versus academia or this one on <a href="http://www.pgbovine.net/practical-reason-to-pursue-PhD.htm">the practical reason to do a PhD</a>.</li>
<li>The top new Twitter feed of 2014 has to be <a href="https://twitter.com/ResearchMark">@ResearchMark</a> (incidentally I’m still mourning the disappearance of <a href="https://twitter.com/STATSHULK">@STATSHULK</a>).</li>
<li>Stephanie Hicks’ blog <a href="http://statisticalrecipes.blogspot.com/">combines recipes for delicious treats and statistics</a>, also I thought she had <a href="http://statisticalrecipes.blogspot.com/2014/05/inaugural-women-in-statistics-2014.html">a great summary</a> of the Women in Stats (<a href="https://twitter.com/search?q=%23WiS2014%20&src=typd">#WiS2014</a>) conference.</li>
<li>Emma Pierson is a Rhodes Scholar who wrote for 538, 23andMe, and a bunch of other major outlets as an undergrad. Her blog, <a href="http://obsessionwithregression.blogspot.com/">obsessionwithregression.blogspot.com</a> is another must read. <a href="http://qz.com/302616/see-how-red-tweeters-and-blue-tweeters-ignore-each-other-on-ferguson/">Here is an example</a> of her awesome work on how different communities ignored each other on Twitter during the Ferguson protests.</li>
<li>The Rstudio crowd continues to be on fire. I think they are a huge part of the reason that R is gaining momentum. It wouldn’t be possible to list all their contributions (or it would be an Rstudio exclusive list) but I really like <a href="http://blog.rstudio.org/2014/07/22/announcing-packrat-v0-4/">Packrat</a> and <a href="http://blog.rstudio.org/2014/06/18/r-markdown-v2/">R markdown v2</a>.</li>
<li>Another huge reason for the movement with R has been the outreach and development efforts of the <a href="http://www.revolutionanalytics.com/">Revolution Analytics folks.</a> The <a href="http://blog.revolutionanalytics.com/">Revolutions blog</a> has been a must read this year.</li>
<li>Julian Wolfson and Joe Koopmeiners at University of Minnesota are straight up gamers. <a href="http://sph.umn.edu/site/docs/biostats/OpenHouseFlyer2014.pdf">They live streamed their recruiting event</a> this year. One way I judge good ideas is by how mad I am I didn’t think of it and this one had me seeing bright red.</li>
<li>This is <a href="http://jmlr.org/papers/volume15/delgado14a/delgado14a.pdf">just an awesome paper</a> comparing lots of machine learning algorithms on lots of data sets. Random forests wins and this is a nice update of one of my favorite papers of all time: <a href="http://arxiv.org/pdf/math/0606441.pdf">Classifier technology and the illusion of progress</a>.</li>
<li><a href="http://www.r-statistics.com/2014/08/simpler-r-coding-with-pipes-the-present-and-future-of-the-magrittr-package/">Pipes in R</a>! This stuff is for real. The piping functionality created by Stefan Milton and Hadley is one of the few inventions over the last several years that immediately changed whole workflows for me.</li>
</ol>
<p> </p>
<p>I’ll let <a href="https://twitter.com/ResearchMark">@ResearchMark</a> take us out:</p>
<p><a href="https://pbs.twimg.com/media/B2NC5c7IYAAt_j-.jpg"><img class="aligncenter" src="https://pbs.twimg.com/media/B2NC5c7IYAAt_j-.jpg" alt="" width="308" height="308" /></a></p>
Sunday data/statistics link roundup (12/14/14)
2014-12-14T12:54:50+00:00
http://simplystats.github.io/2014/12/14/sunday-datastatistics-link-roundup-121414
<ol>
<li><a href="http://www.motherjones.com/kevin-drum/2014/12/economists-are-almost-inhumanly-impartial"> 1.</a> suggests that economists are impartial when it comes to their liberal/conservative views. That being said, I’m not sure the regression line says what they think it does, particularly if you pay attention to the variance around the line (via Rafa).</li>
<li>I am digging the simplicity of <a href="http://www.charted.co/">charted.co</a> from the folks at Medium. But I worry about spurious correlations everywhere. I guess I should just let that ship sail.</li>
<li>FiveThirtyEight <a href="http://fivethirtyeight.com/features/beer-mile-chug-run-repeat/">does a run down of the beer mile</a>. If they set up a data crunchers beer mile, we are in.</li>
<li>I love it when Thomas Lumley interviews himself about silly research studies and particularly their associated press releases. I can actually hear his voice in my head when I read them. This time the <a href="http://www.statschat.org.nz/2014/12/13/blaming-mothers-again/">lipstick/IQ silliness gets Lumleyed</a>.</li>
<li><a href="http://fivethirtyeight.com/datalab/michael-jordan-kobe-bryant/">Jordan was better than Kobe</a>. Surprise. Plus <a href="http://simplystatistics.org/2014/12/12/kobe-data-says-stop-blaming-your-teammates/">Rafa always takes the Kobe bait</a>.</li>
<li><a href="http://mathesaurus.sourceforge.net/matlab-python-xref.pdf">Matlab/Python/R translation cheat sheet</a> (via Stephanie H.).</li>
<li>If I’ve said it once, I’ve said it a million times, statistical thinking is now as important as reading and writing. <a href="http://www.bostonglobe.com/metro/2014/12/14/oversold-and-unregulated-flawed-prenatal-tests-leading-abortions-healthy-fetuses/aKFAOCP5N0Kr8S1HirL7EN/story.html">The latest example</a> is parents not understanding the difference between sensitivity and the predictive value of a positive may be leading to unnecessary abortions (via Dan M./Rafa).</li>
</ol>
Kobe, data says stop blaming your teammates
2014-12-12T10:00:20+00:00
http://simplystats.github.io/2014/12/12/kobe-data-says-stop-blaming-your-teammates
<p>This year, Kobe leads the league in missed shots (<a href="http://ftw.usatoday.com/2014/11/kobe-bryant-lakers-shot-stats">by a lot</a>), has an abysmal FG% of 39 and his team plays better <a href="http://bleacherreport.com/articles/2292515-how-much-blame-does-kobe-bryant-deserve-for-los-angeles-lakers-pathetic-start">when he is on the bench</a>. Yet he <a href="http://espn.go.com/los-angeles/nba/story/_/id/12016979/los-angeles-lakers-star-kobe-bryant-critical-teammates-heated-scrimmage">This year, Kobe leads the league in missed shots ([by a lot](http://ftw.usatoday.com/2014/11/kobe-bryant-lakers-shot-stats)), has an abysmal FG% of 39 and his team plays better [when he is on the bench](http://bleacherreport.com/articles/2292515-how-much-blame-does-kobe-bryant-deserve-for-los-angeles-lakers-pathetic-start). Yet he</a> for the Lakers’ 6-16 record. Below is a plot showing that 2014 is not the first time the Lakers are mediocre during Kobe’s tenure. It shows the percentage points above .500 per season with the Shaq and twin towers eras highlighted. I include the same plot for Lebron as a control.</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2014/12/Rplot.png"><img class="alignnone size-large wp-image-3679" src="http://simplystatistics.org/wp-content/uploads/2014/12/Rplot-1024x511.png" alt="Rplot" width="525" srcset="http://simplystatistics.org/wp-content/uploads/2014/12/Rplot-1024x511.png 1024w, http://simplystatistics.org/wp-content/uploads/2014/12/Rplot.png 1106w" sizes="(max-width: 1024px) 100vw, 1024px" /></a></p>
<p>So stop blaming your teammates!</p>
<p>And here is my <a href="http://rafalab.jhsph.edu/simplystats/kobe2014.R">hastily written code</a> (don’t judge me!).</p>
<p> </p>
<p> </p>
<pre></pre>
Genéticamente, no hay tal cosa como la raza puertorriqueña
2014-12-08T09:09:59+00:00
http://simplystats.github.io/2014/12/08/geneticamente-no-hay-tal-cosa-como-la-raza-puertorriquena
<p><em>Editor’s note: Last week the Latin American media picked up a blog post with the eye-catching title “<a href="http://liorpachter.wordpress.com/2014/12/02/the-perfect-human-is-puerto-rican/">The perfect human is Puerto Rican</a>”. More attention appears to have been given to the title than the post itself. The coverage and comments on social media have demonstrated the need for scientific education on the topic of genetics and race. Here I will try to explain, in layman’s terms, why the interpretations I read in the main Puerto Rican paper is scientifically incorrect and somewhat concerning. The post is in Spanish.</em></p>
<p>En un artículo reciente titulado “<a href="[http://www.elnuevodia.com/serhumanoperfectoseriapuertorriqueno-1903858.html">Ser humano perfecto sería puertorriqueño</a>”, El Nuevo Día resumió una entrada en el blog (erróneamente llamado un estudio) del matemático Lior Pachter. El autor del blog, intentando ridiculizar comentarios racistas que escuchó decir a James Watson, describe un experimento mental en el cual encuentra que el humano “perfecto” (las comilla son importantes), de existir, pertenecería a un grupo genéticamente mezclado. De las personas estudiadas, la más genéticamente cercana a su humano “perfecto” resultó ser una mujer puertorriqueña. La motivación de este ejercicio era ridiculizar la idea de que una raza puede ser superior a otra. El Nuevo Día parece no captar este punto y nos dice que “el experto concluyó que en todo caso no es de sorprenderse que la persona más cercana a tal perfección sería una puertorriqueña, debido a la combinación de buenos genes que tiene la raza puertorriqueña.” Aquí describo por qué esta interpretación es científicamente errada.</p>
<p><strong>¿Qué es el genoma?</strong></p>
<p>El genoma humano codifica (en moléculas de <a href="http://es.wikipedia.org/wiki/%C3%81cido_desoxirribonucleico">ADN</a>) la información genética necesaria para nuestro desarrollo biológico. Podemos pensar en el genoma como dos series de 3,000,000,000 letras (A, T, C o G) concatenadas. Una la recibimos de nuestro padre y la otra de nuestra madre. Distintos pedazos (los genes) codifican proteínas necesarias para las miles de funciones que cumplen nuestras células y que conllevan a algunas de nuestras características físicas. Con unas pocas excepciones, todas las células en nuestro cuerpo contienen una copia exacta de estas dos series de letras. El esperma y el huevo tienen sólo una serie de letras, una mezcla de las otras dos. Cuando se unen el esperma y el huevo, la nueva célula, el cigoto, une las dos series y es así que heredamos características de cada progenitor.</p>
<p><strong>¿Qué es la variación genética?</strong></p>
<p>Si todos venimos del primer humano,¿cómo entonces es que somos diferentes? Aunque es muy raro, estas letras a veces mutan aleatoriamente. Por ejemplo, una C puede cambiar a una T. A través de cientos de miles de años suficientes mutaciones han ocurrido para crear variación entre los humanos. La teoría de selección natural nos dice que si esta mutación confiere una ventaja para la supervivencia, el que la posee tiene más probabilidad de pasarla a sus descendientes. Por ejemplo, en Europa la piel clara es más ventajosa, por su habilidad de absorber vitamina D cuando hay poco sol, que en África Occidental donde la melanina en la piel oscura protege del sol intenso. Se estima que las diferencias entre los humanos se pueden encontrar en por lo menos 10 millones de las 3 mil millones de letras (noten que es menos de 1%).</p>
<p><strong>Genéticamente, ¿qué es una “raza” ?</strong></p>
<p>Esta es un pregunta controversial. Lo que no es controversial es que si comparamos la serie de letras de los europeos del norte con los africanos occidentales o con los indígenas de las Américas, encontramos pedazos del código que son únicos a cada región. Si estudiamos las partes del código que cambian entre humanos, fácilmente podemos distinguir los tres grupos. Esto no nos debe sorprender dado que, por ejemplo, la diferencia en el color de ojos y la pigmentación de la piel se codifica con distintas letras en los genes asociados con estas características. En este sentido podríamos crear una definición genética de “raza” basada en las letras que distinguen a estos grupos. Ahora bien, ¿podemos hacer lo mismo para distinguir un puertorriqueño de un dominicano? ¿Podemos crear una definición genética que incluye a Carlos Delgado y a Mónica Puig, pero no a Robinson Canó y Juan Luis Guerra? La literatura científica nos dice que no.</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2014/12/PCAfinal.png"><img class="alignnone wp-image-3636 size-large" src="http://simplystatistics.org/wp-content/uploads/2014/12/PCAfinal-914x1024.png" alt="PCAfinal" width="411" height="461" srcset="http://simplystatistics.org/wp-content/uploads/2014/12/PCAfinal-267x300.png 267w, http://simplystatistics.org/wp-content/uploads/2014/12/PCAfinal-914x1024.png 914w, http://simplystatistics.org/wp-content/uploads/2014/12/PCAfinal-178x200.png 178w" sizes="(max-width: 411px) 100vw, 411px" /></a></p>
<p>En una <a href="http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1003925">serie</a> de <a href="http://www.pnas.org/content/107/Supplement_2/8954">artículos</a> , el genético Carlos Bustamante y sus colegas han estudiado los genomas de personas de varios grupos étnicos. Ellos definen una distancia genética que resumen con dos dimensiones en la gráfica arriba. Cada punto es una persona y el color presenta a su grupo. Noten los tres extremos de la gráfica con muchos puntos del mismo color amontonados. Estos son los europeos blancos (puntos rojo), africanos occidentales (verde) e indígenas americanos (azul). Los puntos más regados en el medio son las poblaciones mezcladas. Entre los europeos y los indígenas vemos a los mexicanos y entre los europeos y africanos a los afroamericanos. Los puertorriqueños son los puntos anaranjados. He resaltado con números a tres de ellos. El <strong>1</strong> está cerca del supuesto humano “perfecto”. El <strong>2</strong> es indistinguible de un europeo y el <strong>3</strong> es indistinguible de un afroamericano. Los demás cubrimos un espectro amplio. También resalto con el número <strong>4</strong> a un dominicano que está tan cerca a la “perfección” como la puertorriqueña. La observación principal es que hay mucha variación genética entre los puertorriqueños. En los que Bustamante estudió, la ascendencia africana varía de 5-60%, la europea de 35-95% y la taína de 0-20%. ¿Cómo entonces podemos hablar de una “raza” puertorriqueña cuando nuestros genomas abarcan un espacio tan grande que puede incluir, entre otros, europeos, afroamericanos y dominicanos ?</p>
<p><strong>¿Qué son los genes “buenos”?</strong></p>
<p>Algunas mutaciones son letales. Otras resultan en cambios a proteínas que causan enfermedades como la <a href="http://es.wikipedia.org/wiki/Fibrosis_qu%C3%ADstica">fibrosis quística</a> y requieren que ambos padres tengan la mutación. Por lo tanto la mezcla de genomas diferentes disminuye las probabilidades de estas enfermedades. Recientemente una serie de estudios ha encontrado ventajas de algunas combinaciones de letras relacionadas a enfermedades comunes como la hipertensión. Una mezcla genética que evita tener dos copias de estos genes con más riesgo puede ser ventajosa. Pero las supuestas ventajas son pequeñísimas y específicas a enfermedades, no a otras características que asociamos con la “perfección”. El concepto de “genes buenos” es un vestigio de la <a href="http://en.wikipedia.org/wiki/Eugenics">eugenesia</a>.</p>
<p>A pesar de nuestros problemas sociales y económicos actuales, Puerto Rico tiene mucho de lo cual estar orgulloso. En particular, producimos buenísimos ingenieros, atletas y músicos. Atribuir su éxito a “genes buenos” de nuestra “raza” no sólo es un disparate científico, sino una falta de respeto a estos individuos que a través del trabajo duro, la disciplina y el esmero han logrado lo que han logrado. Si quieren saber si Puerto Rico tuvo algo que ver con el éxito de estos individuos, pregúntenle a un historiador, un antropólogo o un sociólogo y no a un genetista. Ahora, si quieren aprender del potencial de estudiar genomas para mejorar tratamientos médicos y la importancia de estudiar una diversidad de individuos, un genetista tendrá mucho que compartir.</p>
Sunday data/statistics link roundup (12/7/14)
2014-12-07T10:00:43+00:00
http://simplystats.github.io/2014/12/07/sunday-datastatistics-link-roundup-12714
<ol>
<li><a href="http://www.apa.org/news/press/releases/2014/11/airport-security.aspx">A randomized controlled trial</a> shows that using conversation to detect suspicious behavior is much more effective then just monitoring body language (via Ann L. on Twitter). This comes as a crushing blow to those of us who enjoyed the now-cancelled <a href="http://en.wikipedia.org/wiki/Lie_to_Me">Lie to Me</a> and assumed it was all real.</li>
<li>Check out this awesome <a href="http://map.ipviking.com/">real-time visualization</a> of different types of network attacks. Rafa says if you watch long enough you will almost certainly observe a “storm” of attacks. A cool student project would be modeling the distribution of these attacks if you could collect the data (via David S.).</li>
<li><a href="http://goodstrat.com/2014/12/03/consider-this-did-big-data-kill-the-statistician/">Consider this: Did Big Data Kill the Statistician?</a> I understand the sentiment, that statistical thinking and applied statistics has been around a long time and has <a href="http://simplystatistics.org/2014/05/22/10-things-statistics-taught-us-about-big-data-analysis/">produced some good ideas</a>. On the other hand, there is definitely a large group of statisticians who aren’t willing to expand their thinking beyond a really narrow set of ideas (via Rafa)</li>
<li><a href="http://www.huffingtonpost.com/2014/12/03/gangnam-style-youtube_n_6261332.html">Gangnam Style viewership creates integers too big for Youtube</a> (via Rafa)</li>
<li>A couple of interviews worth reading, [ 1. <a href="http://www.apa.org/news/press/releases/2014/11/airport-security.aspx">A randomized controlled trial</a> shows that using conversation to detect suspicious behavior is much more effective then just monitoring body language (via Ann L. on Twitter). This comes as a crushing blow to those of us who enjoyed the now-cancelled <a href="http://en.wikipedia.org/wiki/Lie_to_Me">Lie to Me</a> and assumed it was all real.</li>
<li>Check out this awesome <a href="http://map.ipviking.com/">real-time visualization</a> of different types of network attacks. Rafa says if you watch long enough you will almost certainly observe a “storm” of attacks. A cool student project would be modeling the distribution of these attacks if you could collect the data (via David S.).</li>
<li><a href="http://goodstrat.com/2014/12/03/consider-this-did-big-data-kill-the-statistician/">Consider this: Did Big Data Kill the Statistician?</a> I understand the sentiment, that statistical thinking and applied statistics has been around a long time and has <a href="http://simplystatistics.org/2014/05/22/10-things-statistics-taught-us-about-big-data-analysis/">produced some good ideas</a>. On the other hand, there is definitely a large group of statisticians who aren’t willing to expand their thinking beyond a really narrow set of ideas (via Rafa)</li>
<li><a href="http://www.huffingtonpost.com/2014/12/03/gangnam-style-youtube_n_6261332.html">Gangnam Style viewership creates integers too big for Youtube</a> (via Rafa)</li>
<li>A couple of interviews worth reading,](http://simplystatistics.org/2014/12/05/interview-with-cole-trapnell-of-uw-genome-sciences/) and <a href="http://samsiatrtp.wordpress.com/2014/11/18/samsi-postdoctoral-profile-jyotishka-datta/">SAMSI’s with Jyotishka Data</a> (via Jamie N.)</li>
<li> <a href="http://www.theguardian.com/technology/2014/dec/05/when-data-gets-creepy-secrets-were-giving-away">A piece on the secrets we don’t know we are giving away</a> through giving our data to [companies/the government/the internet].</li>
</ol>
Interview with Cole Trapnell of UW Genome Sciences
2014-12-05T12:06:57+00:00
http://simplystats.github.io/2014/12/05/interview-with-cole-trapnell-of-uw-genome-sciences
<div id="mO" class="">
<div class="tNsA5e-nUpftc nUpftc ja xpv2f">
<div class="pf">
<div class="nXx3q">
<div class="cA">
<div class="cl ac">
<div class="yDSKFc viy5Tb">
<div class="rt">
<div class="DsPmj">
<div class="scroll-list-section-body scroll-list-section-body-0">
<div class="scroll-list-item top-level-item scroll-list-item-open scroll-list-item-highlighted" tabindex="0" data-item-id="Bs#gmail:thread-f:1463549268702220125" data-item-id-qs="qsBs-gmail-thread-f-1463549268702220125-0">
<div class="ah V T qX V-M">
<div class="af qX af-M">
<div class="fB qX">
<div class="ag qX" tabindex="0" data-msg-id="Bs#msg-f:1463577765776057801" data-msg-id-qs="qsBs-msg-f-1463577765776057801">
<div class="nI qX">
<div class="gm qX">
<div class="bK xJNT8d">
<div>
<div class="nD">
<blockquote>
<div dir="ltr">
<div>
<a href="http://simplystatistics.org/wp-content/uploads/2014/12/cole_cropped.jpg"><img class="aligncenter wp-image-3624" src="http://simplystatistics.org/wp-content/uploads/2014/12/cole_cropped-278x300.jpg" alt="cole_cropped" width="186" height="200" /></a>
</div>
</div>
</blockquote>
<div dir="ltr">
</div>
<div dir="ltr">
<div style="text-align: left;">
<em><a href="http://cole-trapnell-lab.github.io/">Cole Trapnell</a> is an Assistant Professor of Genome Sciences at the University of Washington. He is the developer of multiple incredibly widely used tools for genomics including Tophat, Cufflinks, and Monocle. His lab at UW studies cell differentiation, reprogramming, and other transitions between stable or metastable cellular states using a combination of computational and experimental techniques. We talked to Cole as part of our <a href="http://simplystatistics.org/interviews/">ongoing interview series</a> with exciting junior data scientists. </em>
</div>
<div style="text-align: left;">
</div>
<div style="text-align: left;">
</div>
<div style="text-align: left;">
<strong>SS: Do you consider yourself a computer scientist, a statistician, a computational biologist, or something else?</strong>
</div>
</div>
</div>
</div>
<div>
<div class="F3hlO">
<div>
<p>
CT: The questions that get me up and out of bed in the morning the fastest are biology questions. I work on cell differentiation - I want to know how to define the state of a cell and how to predict transitions between states. That said, my approach to these questions so far has been to use new technologies to look at previously hard to access aspects of gene regulation. For example, I’ve used RNA-Seq to look beyond gene expression into finer layers of regulation like splicing. Analyzing sequencing experiments often involves some pretty non-trivial math, computer science, and statistics. These data sets are huge, so you need fast algorithms to even look at them. They all involve transforming reads into a useful readout of biology, and the technical and biological variability in that transformation needs to be understood and controlled for, so you see cool mathematical and statistical problems all the time. So I guess you could say that I’m a biologist, both experimental and computational. I have to do some computer science and statistics in order to do biology.
</p>
<div>
</div>
</div>
</div>
</div>
<div>
<div class="nD">
<div>
<div>
<div>
<div>
<div>
<div dir="ltr">
<div>
<strong>SS: You got a Ph.D. in computer science but have spent the last several years in a wet lab learning to be a bench biologist - why did you make that choice?</strong>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<div>
<div class="F3hlO">
<div>
<div>
<p>
CT: Three reasons, mainly:
</p>
<p>
1) I thought learning to do bench work would make me a better overall scientist. It has, in many ways, I think. It’s fundamentally changed the way I approach the questions I work on, but it’s also made me more effective in lots of tiny ways. I remember when I first got to John Rinn’s lab, we needed some way to track lots of libraries and other material. I came up with some scheme where each library would get an 8-digit alphanumeric code generated by a hash function or something like that (we’d never have to worry about collisions!). My lab mate handed me a marker and said, “OK, write that on the side of these 12 micro centrifuge tubes”. I threw out my scheme and came up with something like “JR_1”, “JR_2”, etc. That’s a silly example, but I mention it because it reminds me of how completely clueless I was about where biological data really comes from.
</p>
<p>
2) I wanted to establish an independent, long-term research program investigating differentiation, and I didn’t want to have to rely on collaborators to generate data. I knew at the end of grad school that I wanted to have my own wet lab, and I doubted that anyone would trust me with that kind of investment without doing some formal training. Despite the now-common recognition by experimental biologists that analysis is incredibly important, there’s still a perception out there that computational biologists aren’t “real biologists”, and that computational folks are useful tools, but not the drivers of the intellectual agenda. That's of course not true, but I didn’t want to fight the stigma.
</p>
<p>
3) It sounded fun. I had one or two friends who had followed the "dry to wet” training trajectory, and they were having a blast. Seeing a result live under the microscope is satisfying in a way that I’ve rarely experienced looking at a computer screen.
</p>
<div>
</div>
</div>
</div>
</div>
</div>
<div>
<strong>SS: Do you plan to have both a wet lab and a dry lab when you start your new group? </strong>
</div>
<div>
<div class="F3hlO">
<div>
<div>
<div>
<p>
CT: Yes. I’m going to be starting my lab at the University of Washington in the department of Genome Sciences this summer, and it’s going to be a roughly 50/50 operation, I hope. Many of the labs there are set up that way, and there’s a real culture of valuing both sides. As a postdoc, I’ve been extremely fortunate to collaborate with grad students and postdocs who were trained as cell or molecular biologists but wanted to learn sequencing analysis. We’d train each other, often at great cost in terms of time spent solving “somebody else’s problem”. I’m going to do my best to create an environment like that, the way John did for me and my lab mates.
</p>
<div>
</div>
<div>
<strong>SS: You are frequently on the forefront of new genomic technologies. As data sets get larger and more complicated how do we ensure reproducibility and replicability of computational results? </strong>
</div>
</div>
</div>
</div>
</div>
</div>
<div>
<div class="F3hlO">
<div>
<div>
<div>
<div>
<div>
<p>
CT: That’s a good question, and I don’t really have a good answer. You’ve talked a lot on this blog about the importance of making science more reproducible and how journals could change to make it so. I agree wholeheartedly with a lot of what you’ve said. I like the idea of "papers as packages”, but I don’t see it happening soon, because it’s a huge amount of extra work and there’s not a big incentive to do so. Doing so might make it easier to be attacked, so there could even a disincentive! Scientists do well when the publish papers and those papers are cited widely. We have lots of ways to quantify “impact” - h-index, total citation count, how many times your paper is shared via twitter on a given day, etc. (Say what you want about whether these are meaningful measures).
</p>
<p>
We don’t have a good way to track who’s right and who’s wrong, or whose results are reproducible and whose aren’t, short of full blown paper retraction. Most papers aren’t even checked in a serious way. Worse, the papers that are checked are the ones that a lot of people see - few people spend precious time following up on tangential observations in low circulation journals. So there’s actually an incentive to publish “controversial" results in highly visible journals because at least you’re getting attention.
</p>
<p>
Maybe we need a Yelp for papers and data sets? One where in order to dispute the reproducibility of the analysis, you’d have to provide the code *you* ran to generate a contradictory result? There needs to be a genuine and tangible *reward* (read: funding and career advancement) for putting up an analysis that others can dive into, verify, extend, and learn from.
</p>
<p>
In any case, I think it’s worth noting that reproducibility is not a problem unique to computation - experimentalists have a hard time reproducing results they got last week, much less results that came from some other lab! There’s all kinds of harmless reasons for that. Experiments are hard. Reagents come in bad lots. You had too much coffee that morning and can’t steady your pipet hand to save your life. But I worry a bit that we could spend a lot of effort making our analysis totally automated and perfectly reproducible and still be faced with the same problem.
</p>
<div>
</div>
<div>
<strong>SS: What are the interesting statistical challenges in single-cell RNA-sequencing? </strong>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<div>
<div class="F3hlO">
<div>
<div>
<div>
<div>
<div>
<div>
<p>
CT:
</p>
<p>
Oh man, there are many. Here’s a few:
</p>
<p>
1) There some very interesting questions about variability in expression across cells, or within one cell across time. There’s clearly a lot of variability in the expression level of a given gene across cells. But there’s really no way right now to take “replicate” measurements of a single cell. What would that mean? With current technology, to make an RNA-Seq library form a cell, you have to lyse it. So that’s it for that cell. Even if you had a non-invasive way to measure the whole transcriptome, the cell is a living machine that’s always changing in ways large and small, even in culture. Would you consider repeated measurements “replicates”. Furthermore, how can you say that two different cells are “replicate” measurements of a single, defined cell state? Do such states even really exist?
</p>
<p>
For that matter, we don’t have a good way of assessing how much variability stems from technical sources as opposed to biological sources. One common way of assessing technical variability is to spike some alien transcripts at known concentrations in to purified RNA before making the library, so you can see how variable your endpoint measurements are for those alien transcripts. But to do that for single-cell RNA-Seq, we’d have to actually spike transcripts *into* the nucleus of a cell before we lyse it and put it through the library prep process. Just doping it into the lysate’s not good enough, because the lysis itself might (and likely does) destroy a substantial fraction of the endogenous RNA in the cell. So there are some real barriers to overcome in order to get a handle on how much variability is really biological.
</p>
<p>
2) A second challenge is writing down what a biological process looks like at single cell resolution. I mean we want to write down a model that predicts the expression levels of each gene in a cell as it goes through some biological process. We want to be able to say this gene comes on first, then this one, then these genes, and so on. In genomics up until now, we’ve been in the situation where we are measuring many variables (P) from few measurements (N). That is, N << P, typically, which has made this problem extremely difficult. With single cell RNA-Seq, that may no longer be the case. We can already easily capture hundreds of cells, and thousands of cells per capture is just around the corner, so soon, N will be close to P, and maybe someday greater.
</p>
<p>
Assume for the moment that we are capturing cells that are either resting at or transiting between well defined states. You can think of each cell as a point in a high-dimensional geometric space, where each gene is a different dimension. We’d like to find those equilibrium states and figure out which genes are correlated with which other genes. Even better, we’d like to study the transitions between states and identify the genes that drive them. The curse of dimensionality is always going to be a problem (we’re not likely to capture millions or billions of cells anytime soon), but maybe we have enough data to make some progress. There’s interesting literature out there for tackling problems at this scale, but to my knowledge these methods haven’t yet been widely applied in biology. I guess you can think of cell differentiation viewed at whole-transcriptome, single-cell resolution as one giant manifold learning problem. Same goes for oncogenesis, tissue homeostasis, reprogramming, and on and on. It’s going to be very exciting to see the convergence of large scale statistical machine learning and cell biology.
</p>
<p>
<strong>SS: If you could do it again would you do computational training then wet lab training or the other way around? </strong>
</p>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<div>
<div class="F3hlO">
<p>
CT: I’m happy with how I did things, but I’ve seen folks go the other direction very successfully. My labmates Loyal Goff and Dave Hendrickson started out as molecular biologists, but they’re wizards at the command line now.
</p>
<div>
</div>
</div>
</div>
<div>
<div class="nD">
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div dir="ltr">
<div>
<strong>SS: What is your programming language of choice? </strong>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<div>
<div class="F3hlO">
<div>
<div>
<div>
<div>
<div>
<p>
CT: Oh, I’d say I hate them all equally 😉
</p>
<p>
Just kidding. I’ll always love C++. I work in R a lot these days, as my work has veered away from developing tools for other people towards analyzing data I’ve generated. I still find lots of things about R to be very painful, but ggplot2, plyr, and a handful of other godsend packages make the juice worth the squeeze.
</p>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
Repost: A deterministic statistical machine
2014-12-04T13:13:45+00:00
http://simplystats.github.io/2014/12/04/repost-a-deterministic-statistical-machine
<p><em>Editor’s note: This is a repost of our previous post about deterministic statistical machines. It is inspired by the <a href="https://gigaom.com/2014/12/02/google-is-funding-an-artificial-intelligence-for-data-science/">recent announcement</a> that the <a href="http://www.automaticstatistician.com/">Automatic Statistician </a>received funding from Google. In 2012 we also applied to Google for a small research award to study this same problem, but didn’t get it. In the interest of extreme openness like Titus Brown or Ethan White, <a href="https://docs.google.com/document/d/1ERL40_LYt4U_vYx2rUxPvIhCrxnpld3dcrtEiCeWn8U/edit">here is our application</a> we submitted to Google. I showed this to a friend who told me the reason we didn’t get it is because our proposal was missing two words: “artificial”, “intelligence”. </em></p>
<p>As Roger pointed out the most recent batch of Y Combinator startups included a bunch of <a href="http://simplystatistics.org/post/29964925728/data-startups-from-y-combinator-demo-day" target="_blank">data-focused</a> companies. One of these companies, <a href="https://www.statwing.com/" target="_blank">StatWing</a>, is a web-based tool for data analysis that looks like an improvement on SPSS with more plain text, more visualization, and a lot of the technical statistical details “under the hood”. I first read about StatWing on TechCrunch, where the title, <a href="http://techcrunch.com/2012/08/16/how-statwing-makes-it-easier-to-ask-questions-about-data-so-you-dont-have-to-hire-a-statistical-wizard/" target="_blank">“How Statwing Makes It Easier To Ask Questions About Data So You Don’t Have To Hire a Statistical Wizard”</a>.</p>
<p>StatWing looks super user-friendly and the idea of democratizing statistical analysis so more people can access these ideas is something that appeals to me. But, as one of the aforementioned statistical wizards, this had me freaked out for a minute. Once I looked at the software though, I realized it suffers from the same problem that most “user-friendly” statistical software suffers from. It makes it really easy to screw up a data analysis. It will tell you when something is significant and if you don’t like that it isn’t, you can keep slicing and dicing the data until it is. The key issue behind getting insight from data is knowing when you are fooling yourself with confounders, or small effect sizes, or overfitting. StatWing looks like an improvement on the UI experience of data analysis, but it won’t prevent false positives that plague science and cost business big $$.</p>
<p>So I started thinking about what kind of software would prevent these sort of problems while still being accessible to a big audience. My idea is a “deterministic statistical machine”. Here is how it works, you input a data set and then specify the question you are asking (is variable Y related to variable X? can i predict Z from W?) then, depending on your question, it uses a deterministic set of methods to analyze the data. Say regression for inference, linear discriminant analysis for prediction, etc. But the method is fixed and deterministic for each question. It also performs a pre-specified set of checks for outliers, confounders, missing data, <a href="http://www.nature.com/news/the-data-detective-1.10937" target="_blank">maybe even data fudging</a>. It generates a report with a markdown tool and then immediately publishes the result to <a href="http://figshare.com/" target="_blank">figshare</a>.</p>
<p>The advantage is that people can get their data-related questions answered using a standard tool. It does a lot of the “heavy lifting” in checking for potential problems and produces nice reports. But it is a deterministic algorithm for analysis so overfitting, fudging the analysis, etc. are harder. By publishing all reports to figshare, it makes it even harder to fudge the data. If you fiddle with the data to try to get a result you want, there will be a “multiple testing paper trail” following you around.</p>
<p>The DSM should be a web service that is easy to use. Anybody want to build it? Any suggestions for how to do it better?</p>
Thinking Like a Statistician: Social Media and the ‘Spiral of Silence’
2014-12-02T10:00:39+00:00
http://simplystats.github.io/2014/12/02/thinking-like-a-statistician-social-media-and-the-spiral-of-silence
<p>A few months ago the Pew Research Internet Project published a <a href="http://www.pewinternet.org/2014/08/26/social-media-and-the-spiral-of-silence/">paper</a> on social media and the ‘<a href="http://en.wikipedia.org/wiki/Spiral_of_silence">spiral of silence</a>’. Their main finding is that people are less likely to discuss a controversial topic on social media than in person. Unlike others, I did not find this result surprising, perhaps because I think like a statistician.</p>
<p>Shares or retweets of published opinions on controversial political topics - religion, abortion rights, gender inequality, immigration, income inequality, race relations, the role of government, foreign policy, education, climate change - are ubiquitous in social media. These are usually accompanied by passionate statements of strong support or outraged disagreement. Because these are posted by people we elect to follow, we generally agree with what we see on our feeds. Here is a statistical explanation for why many keep silent when they disagree.</p>
<p>We will summarize the <em>political view</em> of an individual as their opinions on the 10 topics listed above. For simplicity I will assume these opinions can be quantified with a left (liberal) to right (conservative) scale. Every individual can therefore be defined by a point in a 10 dimensional space. Once quantified in this way, we can define a political distance between any pair of individuals. In the American landscape there are two clear clusters which I will call the Fox News and MSNBC clusters. As seen in the illustration below, the cluster centers are very far from each other and individuals within the clusters are very close. Each cluster has a very low opinion of the other. A glance through a social media feed will quickly reveal individuals squarely inside one of these clusters. Members of the clusters fearlessly post their opinions on controversial topics as this behavior is rewarded by likes, retweets or supportive comments from others in their cluster. Based on the uniformity of opinion inferred from the comments, one would think that everybody is in one of these two groups. But this is obviously not the case.</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2014/12/plotforpost.png"><img class="aligncenter wp-image-3602 size-large" src="http://simplystatistics.org/wp-content/uploads/2014/12/plotforpost-1024x1007.png" alt="plotforpost" width="396" height="389" srcset="http://simplystatistics.org/wp-content/uploads/2014/12/plotforpost-300x295.png 300w, http://simplystatistics.org/wp-content/uploads/2014/12/plotforpost-1024x1007.png 1024w" sizes="(max-width: 396px) 100vw, 396px" /></a></p>
<p>In the illustration above I include an example of an individual (the green dot) that is outside the two clusters. Although not shown, there are many of these <em>independent thinkers</em>. In our example, this individual is very close to the MSNBC cluster, but not in it. The controversial topic posts in this person’s feed are mostly posted by those in the cluster of closest proximity, and the spiral of silence is due in part to the fact that independent thinkers are uniformly adverse to disagreeing publicly. For the mathematical explanation of why, we introduce the concept of a <a href="http://en.wikipedia.org/wiki/Projection_%28mathematics%29"><em>projection</em></a>.</p>
<p>In mathematics, a projection can map a multidimensional point to a smaller, simpler, subset. In our illustration, the independent thinker is very close to the MSNBC cluster on all dimensions except one. To use education as an example, let’s say this person supports <a href="http://www.foxnews.com/opinion/2014/10/10/florida-senator-why-am-fighting-for-school-choice-lifeline-for-poor-kids/">school choice</a>. As seen in the illustration, in the projection to the education dimension, that mostly liberal person is squarely in the Fox News cluster. Now imagine that a friend shares an article on <a href="http://www.huffingtonpost.com/diann-woodard/the-corporate-takeover_b_3397091.html">The Corporate Takeover of Public Education</a> along with a passionate statement of approval. Independent thinkers have a feeling that by voicing their dissent, dozens, perhaps hundreds, of strangers on social media (friends of friends for example) will judge them solely on this projection. To make matters worse, public shaming of the independent thinker, for supposedly being a member of the Fox News cluster, will then be rewarded by increased social standing among the MSNBC cluster as evidenced by retweets, likes and supportive comments. In a worse case scenario for this person, and best case scenario for the critics, this public shaming goes viral. While the short term rewards for preaching to the echo chamber are clear, there are no apparent incentives for dissent.</p>
<p>The superficial and fast paced nature of social media is not amenable to nuances and subtleties. Disagreement with the groupthink on one specific topic can therefore get a person labeled as a “neoliberal corporate shill” by the MSNBC cluster or a “godless liberal” by the Fox News one. The irony is that in social media, those politically closest to you, will be the ones attaching the unwanted label.</p>
HarvardX Biomedical Data Science Open Online Training Curriculum launches on January 19
2014-11-25T14:01:47+00:00
http://simplystats.github.io/2014/11/25/harvardx-biomedical-data-science-open-online-training-curriculum-launches-on-january-19
<p>We recently received <a href="http://bd2k.nih.gov/FY14/COE/COE.html#sthash.ESkvsyrj.dpbs">We recently received </a> initiative to develop MOOCs for biomedical data science. Our first offering will be version 2 of my <a href="http://simplystatistics.org/2014/03/31/data-analysis-for-genomic-edx-course/">Data Analysis for Genomics course</a> which will launch on January 19. In this version, the course will be turned into an 8 course series and you can get a certificate in each one of them. The motivation for doing this is to go more in-depth into the different topics and to provide different entry points for students with different levels of expertise. We provide four courses on concepts and skills and four case-study based course. We basically broke the original class into the following eight parts:</p>
<ol>
<li><a href="https://www.edx.org/course/statistics-with-r-for-life-sciences-harvardx-ph525-1x#.VHTQgmTF86B">Statistics and R for the Life Sciences</a></li>
<li><a href="https://www.edx.org/course/introduction-to-linear-models-and-matrix-algebra-harvardx-ph525-2x#.VHTQxGTF86B">Introduction to Linear Models and Matrix Algebra</a></li>
<li><a href="https://www.edx.org/course/advanced-statistics-for-the-life-sciences-harvardx-ph525-3x#.VHTQ0GTF86B">Advanced Statistics for the Life Sciences</a></li>
<li><a href="https://www.edx.org/course/introduction-to-bioconductor-harvardx-ph525-4x#.VHTQ22TF86B">Introduction to Bioconductor</a></li>
<li><a href="https://www.edx.org/course/case-study-rna-seq-data-analysis-harvardx-ph525-5x#.VHTQ5mTF86B">Case study: RNA-seq data analysis</a></li>
<li><a href="https://www.edx.org/course/case-study-variant-discovery-and-genotyping-harvardx-ph525-6x#.VHTQ-WTF86B">Case study: Variant Discovery and Genotyping</a></li>
<li><a href="https://www.edx.org/course/case-study-chip-seq-data-analysis-harvardx-ph525-7x#.VHTRBWTF86B">Case study: ChIP-seq data analysis</a></li>
<li><a href="https://www.edx.org/course/case-study-dna-methylation-data-analysis-harvardx-ph525-8x#.VHTREmTF86B">Case study: DNA methylation data analysis</a></li>
</ol>
<p>You can follow the links to enroll. While not required, some familiarity with R and Rstudio will serve you well so consider taking <a href="https://www.coursera.org/course/rprog">Roger’s R course</a> and Jeff’s <a href="https://www.coursera.org/course/datascitoolbox">Toolbox</a> course before delving into this class.</p>
<p>In years 2 and 3 we plan to introduce several other courses covering topics such as python for data analysis, probability, software engineering, and data visualization which will be taught by a collaboration between the departments of Biostatistics, Statistics and Computer Science at Harvard.</p>
<p>Announcements will be made here and on twitter: <a href="https://twitter.com/rafalab">@rafalab</a></p>
<p> </p>
Data Science Students Predict the Midterm Election Results
2014-11-12T13:37:36+00:00
http://simplystats.github.io/2014/11/12/data-science-students-predict-the-midterm-election-results
<p>As explained in an <a href="http://simplystatistics.org/2014/11/04/538-election-forecasts-made-simple/">earlier post</a>, one of the homework assignments of my <a href="http://cs109.github.io/2014/">CS109</a> class was to predict the results of the midterm election. We created a competition in which 49 students entered. The most interesting challenge was to provide intervals for the republican - democrat difference in each of the 35 senate races. Anybody missing more than 2 was eliminated. The average size of the intervals was the tie breaker.</p>
<p>The main teaching objective here was to get students thinking about how to evaluate prediction strategies when chance is involved. To a naive observer, a biased strategy that favored democrats and correctly called, say, Virginia may look good in comparison to strategies that called it a toss-up. However, a look at the other 34 states would reveal the weakness of this biased strategy. I wanted students to think of procedures that can help distinguish lucky guesses from strategies that universally perform well.</p>
<p>One of the concepts we discussed in class was the systematic bias of polls which we modeled as a random effect. One can’t infer this bias from polls until after the election passes. By studying previous elections students were able to estimate the SE of this random effect and incorporate it into the calculation of intervals. The realization of this random effect was <a href="http://fivethirtyeight.com/features/the-polls-were-skewed-toward-democrats/">very large</a> in these elections (about +4 for the democrats) which clearly showed the importance of modeling this source of variability. Strategies that restricted standard error measures to sample estimates from this year’s polls did very poorly. The <a href="http://fivethirtyeight.com/interactives/senate-forecast/">90% credible intervals</a> provided by 538, which I believe does incorporate this, missed 8 of the 35 races (23%). This suggests that they underestimated the variance. Several of our students compared favorably to 538:</p>
<div class="table-responsive">
<table style="width:100%; " class="easy-table easy-table-default " border="0">
<tr>
<th>
name
</th>
<th>
avg bias
</th>
<th>
MSE
</th>
<th>
avg interval size
</th>
<th>
# missed
</th>
</tr>
<tr>
<td>
Manuel Andere
</td>
<td>
-3.9
</td>
<td>
6.9
</td>
<td>
24.1
</td>
<td>
3
</td>
</tr>
<tr>
<td>
Richard Lopez
</td>
<td>
-5.0
</td>
<td>
7.4
</td>
<td>
26.9
</td>
<td>
3
</td>
</tr>
<tr>
<td>
Daniel Sokol
</td>
<td>
-4.5
</td>
<td>
6.4
</td>
<td>
24.1
</td>
<td>
4
</td>
</tr>
<tr>
<td>
Isabella Chiu
</td>
<td>
-5.3
</td>
<td>
9.6
</td>
<td>
26.9
</td>
<td>
6
</td>
</tr>
<tr>
<td>
Denver Mosigisi Ogaro
</td>
<td>
-3.2
</td>
<td>
6.6
</td>
<td>
18.9
</td>
<td>
7
</td>
</tr>
<tr>
<td>
Yu Jiang
</td>
<td>
-5.6
</td>
<td>
9.6
</td>
<td>
22.6
</td>
<td>
7
</td>
</tr>
<tr>
<td>
David Dowey
</td>
<td>
-3.5
</td>
<td>
6.2
</td>
<td>
16.3
</td>
<td>
8
</td>
</tr>
<tr>
<td>
Nate Silver
</td>
<td>
-4.2
</td>
<td>
6.6
</td>
<td>
16.4
</td>
<td>
8
</td>
</tr>
<tr>
<td>
Filip Piasevoli
</td>
<td>
-3.5
</td>
<td>
7.4
</td>
<td>
22.1
</td>
<td>
8
</td>
</tr>
<tr>
<td>
Yapeng Lu
</td>
<td>
-6.5
</td>
<td>
8.2
</td>
<td>
16.5
</td>
<td>
10
</td>
</tr>
<tr>
<td>
David Jacob Lieb
</td>
<td>
-3.7
</td>
<td>
7.2
</td>
<td>
17.1
</td>
<td>
10
</td>
</tr>
<tr>
<td>
Vincent Nguyen
</td>
<td>
-3.8
</td>
<td>
5.9
</td>
<td>
11.1
</td>
<td>
14
</td>
</tr>
</table>
</div>
<p>It is important to note that 538 would have probably increased their interval size had they actively participated in a competition requiring 95% of the intervals to cover. But all in all, students did very well. The majority correctly predicted the republican take over. The median mean square error across all 49 participantes was 8.2 which was not much worse that 538’s 6.6. Other example of strategies that I think helped some of these students perform well was the use of creative weighting schemes (based on previous elections) to average poll and the use of splines to estimate trends, which in this particular election were going in the republican’s favor.</p>
<p>Here are some plots showing results from two of our top performers:</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2014/11/Rplot.png"><img class="alignnone wp-image-3560" src="http://simplystatistics.org/wp-content/uploads/2014/11/Rplot.png" alt="Rplot" width="714" height="233" srcset="http://simplystatistics.org/wp-content/uploads/2014/11/Rplot-300x98.png 300w, http://simplystatistics.org/wp-content/uploads/2014/11/Rplot-1024x334.png 1024w, http://simplystatistics.org/wp-content/uploads/2014/11/Rplot.png 1674w" sizes="(max-width: 714px) 100vw, 714px" /></a> <a href="http://simplystatistics.org/wp-content/uploads/2014/11/Rplot01.png"><img class="alignnone wp-image-3561" src="http://simplystatistics.org/wp-content/uploads/2014/11/Rplot01.png" alt="Rplot01" width="714" height="233" srcset="http://simplystatistics.org/wp-content/uploads/2014/11/Rplot01-300x98.png 300w, http://simplystatistics.org/wp-content/uploads/2014/11/Rplot01-1024x334.png 1024w, http://simplystatistics.org/wp-content/uploads/2014/11/Rplot01.png 1674w" sizes="(max-width: 714px) 100vw, 714px" /></a></p>
<p>I hope this exercise helped students realize that data science can be both fun and useful. I can’t wait to do this again in 2016.</p>
<p> </p>
<p> </p>
<p> </p>
Sunday data/statistics link roundup (11/9/14)
2014-11-10T01:30:00+00:00
http://simplystats.github.io/2014/11/10/sunday-datastatistics-link-roundup-11914
<p>So I’m a day late, but you know, I got a new kid and stuff…</p>
<ol>
<li><a href="http://www.newyorker.com/science/maria-konnikova/moocs-failure-solutions">The New Yorker hating on MOOCs</a>, they mention all the usual stuff. Including the <a href="http://simplystatistics.org/2013/07/19/the-failure-of-moocs-and-the-ecological-fallacy/">really poorly designed San Jose State experiment</a>. I think this deserves a longer post, but this is definitely a case where people are looking at MOOCs on the <a href="http://en.wikipedia.org/wiki/Hype_cycle">wrong part of the hype curve</a>. MOOCs won’t solve all possible education problems, but they are hugely helpful to many people and writing them off is a little silly (via Rafa).</li>
<li>My colleague Dan S. is <a href="http://www.eventzilla.net/web/event?eventid=2139054537">teaching a missing data workshop</a> here at Hopkins next week (via Dan S.)</li>
<li>A couple of cool Youtube videos explaining <a href="http://www.youtube.com/watch?v=YmOsDTczOFs">how the normal distribution sounds</a> and the <a href="http://www.youtube.com/watch?v=F-I-BVqMiNI">pareto principle with paperclips</a> (via Presh T., pair with the <a href="http://simplystatistics.org/2014/03/20/the-8020-rule-of-statistical-methods-development/">80/20 rule of statistical methods development</a>)</li>
<li>If you aren’t following <a href="https://twitter.com/ResearchMark">Research Wahlberg</a>, you aren’t on academic twitter.</li>
<li>I followed <a href="https://twitter.com/hashtag/biodata14?src=hash">#biodata14</a> closely. I think having a meeting on Biological Big Data is a great idea and many of the discussion leaders are people I admire a ton. I also am a big fan of Mike S. I have to say I was pretty bummed that more statisticians weren’t invited (we like to party too!).</li>
<li>Our data science specialization generates <a href="http://rpubs.com/hadley/39122">almost 1,000 new R github repos a month</a>! Roger and I are in a neck and neck race to be the person who has taught the most people statistics/data science in the history of the world.</li>
<li>The Rstudio guys have also put together what looks like a <a href="http://blog.rstudio.org/2014/11/06/introduction-to-data-science-with-r-video-workshop/">great course</a> similar in spirit to our Data Science Specialization. The Rstudio folks have been *super* supportive of the DSS and we assume anything they make will be awesome.</li>
<li>
<p><a href="http://datacarpentry.github.io/blog/2014/11/05/announce/">Congrats to Data Carpentry</a> and [So I’m a day late, but you know, I got a new kid and stuff…</p>
</li>
<li><a href="http://www.newyorker.com/science/maria-konnikova/moocs-failure-solutions">The New Yorker hating on MOOCs</a>, they mention all the usual stuff. Including the <a href="http://simplystatistics.org/2013/07/19/the-failure-of-moocs-and-the-ecological-fallacy/">really poorly designed San Jose State experiment</a>. I think this deserves a longer post, but this is definitely a case where people are looking at MOOCs on the <a href="http://en.wikipedia.org/wiki/Hype_cycle">wrong part of the hype curve</a>. MOOCs won’t solve all possible education problems, but they are hugely helpful to many people and writing them off is a little silly (via Rafa).</li>
<li>My colleague Dan S. is <a href="http://www.eventzilla.net/web/event?eventid=2139054537">teaching a missing data workshop</a> here at Hopkins next week (via Dan S.)</li>
<li>A couple of cool Youtube videos explaining <a href="http://www.youtube.com/watch?v=YmOsDTczOFs">how the normal distribution sounds</a> and the <a href="http://www.youtube.com/watch?v=F-I-BVqMiNI">pareto principle with paperclips</a> (via Presh T., pair with the <a href="http://simplystatistics.org/2014/03/20/the-8020-rule-of-statistical-methods-development/">80/20 rule of statistical methods development</a>)</li>
<li>If you aren’t following <a href="https://twitter.com/ResearchMark">Research Wahlberg</a>, you aren’t on academic twitter.</li>
<li>I followed <a href="https://twitter.com/hashtag/biodata14?src=hash">#biodata14</a> closely. I think having a meeting on Biological Big Data is a great idea and many of the discussion leaders are people I admire a ton. I also am a big fan of Mike S. I have to say I was pretty bummed that more statisticians weren’t invited (we like to party too!).</li>
<li>Our data science specialization generates <a href="http://rpubs.com/hadley/39122">almost 1,000 new R github repos a month</a>! Roger and I are in a neck and neck race to be the person who has taught the most people statistics/data science in the history of the world.</li>
<li>The Rstudio guys have also put together what looks like a <a href="http://blog.rstudio.org/2014/11/06/introduction-to-data-science-with-r-video-workshop/">great course</a> similar in spirit to our Data Science Specialization. The Rstudio folks have been *super* supportive of the DSS and we assume anything they make will be awesome.</li>
<li><a href="http://datacarpentry.github.io/blog/2014/11/05/announce/">Congrats to Data Carpentry</a> and](https://twitter.com/tracykteal) on their funding from the Moore Foundation!</li>
</ol>
<blockquote class="twitter-tweet" width="550">
<p>
Sup. Party's over. Keep moving. <a href="http://t.co/R8sTbKzpF8">pic.twitter.com/R8sTbKzpF8</a>
</p>
<p>
— Research Wahlberg (@ResearchMark) <a href="https://twitter.com/ResearchMark/status/530109209543999489">November 5, 2014</a>
</p>
</blockquote>
Time varying causality in n=1 experiments with applications to newborn care
2014-11-05T13:13:11+00:00
http://simplystats.github.io/2014/11/05/time-varying-causality-in-n1-experiments-with-applications-to-newborn-care
<p>We just had our second son about a week ago and I’ve been hanging out at home with him and the rest of my family. It has reminded me of a few things from when we had our first son. First, newborns are tiny and super-duper adorable. Second, daylight savings time means gaining an extra hour of sleep for many people, but for people with young children it is more like (via Reddit):</p>
<p><a href="http://www.reddit.com/r/funny/comments/2l25vx/gain_an_extra_hour_of_sleep_waityou_have_toddlers/"><img class="aligncenter" src="http://i.imgur.com/1HWQIPa.gif" alt="" width="480" height="270" /></a></p>
<p> </p>
<p>Third, taking care of a newborn is like performing a series of n=1 experiments where the causal structure of the problem changes every time you perform an experiment.</p>
<p>Suppose, hypothetically, that your newborn has just had something to eat and it is 2am in the morning (again, just hypothetically). You are hoping he’ll go back down to sleep so you can catch some shut-eye yourself. But your baby just can’t sleep and seems uncomfortable. Here are a partial list of causes for this: (1) dirty diaper, (2) needs to burp, (3) still hungry, (4) not tired, (5) over tired, (6) has gas, (7) just chillin. So you start going down the list and trying to address each of the potential causes of late-night sleeplessness: (1) check diaper, (2) try burping, (3) feed him again, etc. etc. Then, miraculously, one works and the little guy falls asleep.</p>
<p>It is interesting how the natural human reaction to this is to reorder the potential causes of sleeplessness and start with the thing that worked next time. Then often get frustrated when the same thing doesn’t work the next time. You can’t help it, you did an experiment, you have some data, you want to use it. But the reality is that the next time may have nothing to do with the first.</p>
<p>I’m in the process of collecting some very poorly annotated data collected exclusively at night if anyone wants to write a dissertation on this problem.</p>
538 election forecasts made simple
2014-11-04T17:12:16+00:00
http://simplystats.github.io/2014/11/04/538-election-forecasts-made-simple
<p>Nate Silver does a <a href="http://fivethirtyeight.com/features/how-the-fivethirtyeight-senate-forecast-model-works/">great job</a> of explaining his forecast model to laypeople. However, as a statistician I’ve always wanted to know more details. After preparing a “<span class="s2"><a href="http://cs109.github.io/2014/pages/homework.html">predict the midterm elections</a>“ </span>homework for my <a href="http://cs109.github.io/2014"><span class="s2">data science class</span></a> I have a better idea of what is going on.</p>
<p><a href="http://simplystatistics.org/html/midterm2012.html">Here</a> is my best attempt at explaining the ideas of 538 using formulas and data. <del>And <a href="http://rafalab.jhsph.edu/simplystats/midterm2012.Rmd">here</a> is the R markdown.</del></p>
<p> </p>
<p> </p>
<p> </p>
Sunday data/statistics link roundup (11/2/14)
2014-11-02T19:16:22+00:00
http://simplystats.github.io/2014/11/02/sunday-datastatistics-link-roundup-11214
<p>Better late than never! If you have something cool to share, please continue to email it to me with subject line “Sunday links”.</p>
<ol>
<li>A <a href="http://www.drivendata.org/">DrivenData</a> is a Kaggle-like site but for social good. I like the principle of using data for societal benefit, since there are so many ways it seems to be used for nefarious purposes (via Rafa).</li>
<li>This article <a href="http://www.nytimes.com/2014/11/02/opinion/sunday/academic-science-isnt-sexist.html?ref=opinion&_r=2">claiming academic science isn’t sexist</a> has been widely panned Emily Willingham <a href="http://www.emilywillinghamphd.com/2014/11/academic-science-is-sexist-we-do-have.html">pretty much destroys it here</a> (via Sherri R.). The thing that is interesting about this article is the way that it tries to use data to give the appearance of empiricism, while using language to try to skew the results. Is it just me or is this totally bizarre in light of the NYT also <a href="http://www.nytimes.com/2014/11/02/us/handling-of-sexual-harassment-case-poses-larger-questions-at-yale.html?smid=tw-share">publishing this piece</a> about academic sexual harassment at Yale?</li>
<li>Noah Smith, an economist, <a href="http://www.bloombergview.com/articles/2014-10-29/bad-data-can-make-us-smarter">tries to summarize</a> the problem with “most research being wrong”. It is an interesting take, I wonder if he read Roger’s piece <a href="http://simplystatistics.org/2014/10/15/dear-laboratory-scientists-welcome-to-my-world/">saying almost exactly the same thing </a> like the week before? He also mentions it is hard to quantify the rate of false discoveries in science, maybe he should <a href="http://biostatistics.oxfordjournals.org/content/early/2013/09/24/biostatistics.kxt007.abstract">read our paper</a>?</li>
<li>Nature <a href="http://www.nature.com/news/code-share-1.16232">now requests</a> that code sharing occur “where possible” (via Steven S.)</li>
<li>
<p>Great [Better late than never! If you have something cool to share, please continue to email it to me with subject line “Sunday links”.</p>
</li>
<li>A <a href="http://www.drivendata.org/">DrivenData</a> is a Kaggle-like site but for social good. I like the principle of using data for societal benefit, since there are so many ways it seems to be used for nefarious purposes (via Rafa).</li>
<li>This article <a href="http://www.nytimes.com/2014/11/02/opinion/sunday/academic-science-isnt-sexist.html?ref=opinion&_r=2">claiming academic science isn’t sexist</a> has been widely panned Emily Willingham <a href="http://www.emilywillinghamphd.com/2014/11/academic-science-is-sexist-we-do-have.html">pretty much destroys it here</a> (via Sherri R.). The thing that is interesting about this article is the way that it tries to use data to give the appearance of empiricism, while using language to try to skew the results. Is it just me or is this totally bizarre in light of the NYT also <a href="http://www.nytimes.com/2014/11/02/us/handling-of-sexual-harassment-case-poses-larger-questions-at-yale.html?smid=tw-share">publishing this piece</a> about academic sexual harassment at Yale?</li>
<li>Noah Smith, an economist, <a href="http://www.bloombergview.com/articles/2014-10-29/bad-data-can-make-us-smarter">tries to summarize</a> the problem with “most research being wrong”. It is an interesting take, I wonder if he read Roger’s piece <a href="http://simplystatistics.org/2014/10/15/dear-laboratory-scientists-welcome-to-my-world/">saying almost exactly the same thing </a> like the week before? He also mentions it is hard to quantify the rate of false discoveries in science, maybe he should <a href="http://biostatistics.oxfordjournals.org/content/early/2013/09/24/biostatistics.kxt007.abstract">read our paper</a>?</li>
<li>Nature <a href="http://www.nature.com/news/code-share-1.16232">now requests</a> that code sharing occur “where possible” (via Steven S.)</li>
<li>Great](http://imgur.com/gallery/ZpgQz) cartoons, I particularly like the one about replication (via Steven S.).</li>
</ol>
Why I support statisticians and their resistance to hype
2014-10-28T10:19:01+00:00
http://simplystats.github.io/2014/10/28/why-i-support-statisticians-and-their-resistance-to-hype
<p>Despite Statistics being the most mature data related discipline, statisticians <a href="http://simplystatistics.org/2014/05/07/why-big-data-is-in-trouble-they-forgot-about-applied-statistics/">have not fared well</a> in terms of being selected for funding or leadership positions in the new initiatives brought about by the increasing interest in data. Just to give one example (<a href="http://simplystatistics.org/2014/05/07/why-big-data-is-in-trouble-they-forgot-about-applied-statistics/">Jeff</a> and <a href="http://www.chalmers.se/en/areas-of-advance/ict/calendar/Pages/Terry-Speed.aspx">Terry Speed</a> give many more) the <a href="http://www.nitrd.gov/nitrdgroups/index.php?title=White_House_Big_Data_Partners_Workshop">White House Big Data Partners Workshop</a> had 19 members of which 0 were statisticians. The statistical community is clearly worried about this predicament and there is widespread consensus that we need to be <a href="http://simplystatistics.org/2012/08/14/statistics-statisticians-need-better-marketing/">better at marketing</a>. Although I agree that only good can come from better communicating what we do, it is also important to continue doing one of the things we do best: resisting the hype and being realistic about data.</p>
<p>This week, after reading Mike Jordan’s <a href="http://www.reddit.com/r/MachineLearning/comments/2fxi6v/ama_michael_i_jordan">reddit ask me anything</a>, I was reminded of exactly how much I admire this quality in statisticians. From reading the interview one learns about instances where hype has led to confusion, how getting past this confusion helps us better understand and consequently appreciate the importance of his field. For the past 30 years, Mike Jordan has been one of the most prolific academics working in the areas that today are receiving increased attention_._ Yet, you won’t find a hyped-up press release coming out of his lab. In fact when a <a href="http://spectrum.ieee.org/robotics/artificial-intelligence/machinelearning-maestro-michael-jordan-on-the-delusions-of-big-data-and-other-huge-engineering-efforts">journalist tried to hype up Jordan’s critique of hype</a>, Jordan <a href="https://amplab.cs.berkeley.edu/2014/10/22/big-data-hype-the-media-and-other-provocative-words-to-put-in-a-title/">called out the author</a>.</p>
<p>Assessing the current situation with data initiatives it is hard not to conclude that hype is being rewarded. Many statisticians have come to the sad realization that by being cautious and skeptical, we may be losing out on funding possibilities and leadership roles. However, I remain very much upbeat about our discipline. First, being skeptical and cautious has actually led to many important contributions. An important example is how randomized controlled experiments changed how medical procedures are evaluated. A more recent one is the concept of FDR, which helps control false discoveries in, for example, high-throughput experiments. Second, many of us continue to work in the interface with real world applications placing us in a good position to make relevant contributions. Third, despite the failures alluded to above, we continue to successfully find ways to fund our work. Although resisting the hype has cost us in the short term, we will continue to produce methods that will be useful in the long term, as we have been doing for decades. Our methods will still be used when today’s hyped up press releases are long forgotten.</p>
<p> </p>
<p> </p>
Return of the sunday links! (10/26/14)
2014-10-26T10:00:31+00:00
http://simplystats.github.io/2014/10/26/return-of-the-sunday-links-102614
<p>New look for the blog and bringing back the links. If you have something that you’d like included in the Sunday links, email me and let me know. If you use the title of the message “Sunday Links” you’ll be more likely for me to find it when I search my gmail.</p>
<ol>
<li>Thomas L. does a more technical post on <a href="http://notstatschat.tumblr.com/post/100893932596/semiparametric-efficiency-and-nearly-true-models">semi-parametric efficiency</a>, normally I’m a data n’ applications guy, but I love these in depth posts, especially when the papers remind me of all the things I studied at my <a href="http://www.biostat.washington.edu/">alma mater</a>.</li>
<li>I am one of those people who only knows a tiny bit about Docker, but hears about it all the time. That being said, after I read about <a href="http://dirk.eddelbuettel.com/blog/2014/10/23/#introducing_rocker">Rocker</a>, I got pretty excited.</li>
<li>Hadley W.’s <a href="https://www.biostars.org/p/115481/">favorite tools</a>, seems like that dude likes R Studio for some reason….(me too)</li>
<li><a href="http://priorprobability.com/2014/10/22/chess-piece-survival-rates/">A cool visualization</a> of chess piece survival rates.</li>
<li><a href="http://espn.go.com/video/clip?id=11694550">A short movie by 538</a> about statistics and the battle between Deep Blue and Gary Kasparov. Where’s the popcorn?</li>
<li>Twitter engineering released an R package for <a href="https://blog.twitter.com/2014/breakout-detection-in-the-wild">detecting outbreaks</a>. I wonder how <a href="http://www.bioconductor.org/packages/release/bioc/html/DNAcopy.html">circular binary segmentation</a> would do?</li>
</ol>
<p> </p>
<p> </p>
An interactive visualization to teach about the curse of dimensionality
2014-10-24T11:14:43+00:00
http://simplystats.github.io/2014/10/24/an-interactive-visualization-to-teach-about-the-curse-of-dimensionality
<p>I recently was contacted for an interview about the curse of dimensionality. During the course of the conversation, I realized how hard it is to explain the curse to a general audience. One of the best descriptions I could come up with was trying to describe sampling from a unit line, square, cube, etc. and taking samples with side length fixed. You would capture fewer and fewer points. As I was saying this, I realized it is a pretty bad way to explain the curse of dimensionality in words. But there was potentially a cool data visualization that would illustrate the idea. I went to my student <a href="http://www.biostat.jhsph.edu/~prpatil/">Prasad</a>, our resident interactive viz design expert to see if he could build it for me. He came up with this cool Shiny app where you can simulate a number of points (n) and then fix a side length for 1-D, 2-D, 3-D, and 4-D and see how many points you capture in a cube of that length in that dimension. You can find the <a href="https://prpatil.shinyapps.io/cod_app/">full app here</a> or check it out on the blog here:</p>
<p> </p>
Vote on simply statistics new logo design
2014-10-22T10:38:10+00:00
http://simplystats.github.io/2014/10/22/vote-on-simply-statistics-new-logo-design
<p>As you can tell, we have given the Simply Stats blog a little style update. It should be more readable on phones or tablets now. We are also about to get a new logo. We are down to the last couple of choices and can’t decide. Since we are statisticians, we thought we’d collect some data. <a href="http://99designs.com/logo-design/vote-3datw8">Here is the link</a> to the poll. Let us know</p>
Thinking like a statistician: don't judge a society by its internet comments
2014-10-20T13:59:03+00:00
http://simplystats.github.io/2014/10/20/thinking-like-a-statistician-dont-judge-a-society-by-its-internet-comments
<p>In a previous <a href="http://simplystatistics.org/2014/01/17/missing-not-at-random-data-makes-some-facebook-users-feel-sad/">post</a> I explained how thinking like a statistician can help you avoid <a href="http://www.npr.org/2014/01/09/261108836/many-younger-facebook-users-unfriend-the-network">feeling sad after using Facebook.</a> The basic point was that <em>missing not at random</em> (MNAR) data on your friends’ profiles (showing only the best parts of their life) can result in the biased view that your life is boring and uninspiring in comparison. A similar argument can be made to avoid losing faith in humanity after reading internet comments or anonymous tweets, one of the most depressing activities that I have voluntarily engaged in. If you want to see proof that racism, xenophobia, sexism and homophobia are still very much alive, read the unfiltered comments sections of articles related to race, immigration, gender or gay rights. However, as a statistician, I remain optimistic about our society after realizing how extremely biased these particular MNAR data can be.</p>
<p>Assume we could summarize an individual’s “righteousness<span class="star inactive">”</span> with a numerical index. I realize this is a gross oversimplification, but bear with me. Below is my view on the distribution of this index across all members of our society.</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2014/10/IMG_5842.jpg"><img class="aligncenter wp-image-3409" src="http://simplystatistics.org/wp-content/uploads/2014/10/IMG_5842.jpg" alt="IMG_5842" width="442" height="463" srcset="http://simplystatistics.org/wp-content/uploads/2014/10/IMG_5842-286x300.jpg 286w, http://simplystatistics.org/wp-content/uploads/2014/10/IMG_5842-977x1024.jpg 977w, http://simplystatistics.org/wp-content/uploads/2014/10/IMG_5842.jpg 2139w" sizes="(max-width: 442px) 100vw, 442px" /></a></p>
<p>Note that the distribution is not bimodal. This means there is no gap between good and evil, instead we have a continuum. Although there is variability, and we do have some extreme outliers on both sides of the distribution, most of us are much closer to the median than we like to believe. The offending internet commentators represent a very small proportion (the “bad” tail shown in red). But in a large population, such as internet users, this extremely small proportion can be quite numerous and gives us a biased view.</p>
<p>There is one more level of variability here that introduces biases. Since internet comments can be anonymous, we get an unprecedentedly large glimpse into people’s opinions and thoughts. We assign a “righteousness” index to our thoughts and opinion and include it in the scatter plot shown in the figure above. Note that this index exhibits variability within individuals: even the best people have the occasional bad thought. The points in red represent thoughts so awful that no one, not even the worst people, would ever express publicly. The red points give us an overly pessimistic estimate of the individuals that are posting these comments, which exacerbates our already pessimistic view due to a non-representative sample of individuals.</p>
<p>I hope that thinking like a statistician will help the media and social networks put in statistical perspective the awful tweets or internet comments that represent the worst of the worst. These actually provide little to no information on humanity’s distribution of righteousness, that I think is moving consistently, albeit slowly, towards the good.</p>
<p> </p>
<p> </p>
Bayes Rule in an animated gif
2014-10-17T10:00:41+00:00
http://simplystats.github.io/2014/10/17/bayes-rule-in-a-gif
<table>
<tbody>
<tr>
<td>Say Pr(A)=5% is the prevalence of a disease (% of red dots on top fig). Each individual is given a test with accuracy Pr(B</td>
<td>A)=Pr(no B</td>
<td>no A) = 90% . The O in the middle turns into an X when the test fails. The rate of Xs is 1-Pr(B</td>
<td>A). We want to know the probability of having the disease if you tested positive: Pr(A</td>
<td>B). Many find it counterintuitive that this probability is much lower than 90%; this animated gif is meant to help.</td>
</tr>
</tbody>
</table>
<p><img src="http://rafalab.jhsph.edu/simplystats/bayes.gif" alt="" width="600" /></p>
<table>
<tbody>
<tr>
<td>The individual being tested is highlighted with a moving black circle. Pr(B) of these will test positive: we put these in the bottom left and the rest in the bottom right. The proportion of red points that end up in the bottom left is the proportion of red points Pr(A) with a positive test Pr(B</td>
<td>A), thus Pr(B</td>
<td>A) x Pr(A). Pr(A</td>
<td>B), or the proportion of reds in the bottom left, is therefore Pr(B</td>
<td>A) x Pr(A) divided by Pr(B): Pr(A</td>
<td>B)=Pr(B</td>
<td>A) x Pr(A) / Pr(B)</td>
</tr>
</tbody>
</table>
<p>ps - Is this a <a href="http://simplystatistics.org/2014/10/13/as-an-applied-statistician-i-find-the-frequentists-versus-bayesians-debate-completely-inconsequential/">frequentist or Bayesian</a> gif?</p>
Creating the field of evidence based data analysis - do people know what a p-value looks like?
2014-10-16T15:00:34+00:00
http://simplystats.github.io/2014/10/16/creating-the-field-of-evidence-based-data-analysis-do-people-know-what-a-p-value-looks-like
<p>In the medical sciences, there is a discipline called “<a href="http://en.wikipedia.org/wiki/Evidence-based_medicine">evidence based medicine</a>”. The basic idea is to study the actual practice of medicine using experimental techniques. The reason is that while we may have good experimental evidence about specific medicines or practices, the global behavior and execution of medical practice may also matter. There have been some success stories from this approach and also backlash from physicians who <a href="http://onlinelibrary.wiley.com/doi/10.1111/j.1523-536X.1996.tb00491.x/abstract">don’t like to be told how to practice medicine.</a> However, on the whole it is a valuable and interesting scientific exercise.</p>
<p>Roger introduced the idea of <a href="http://simplystatistics.org/2013/08/28/evidence-based-data-analysis-treading-a-new-path-for-reproducible-research-part-2/">evidence based data analysis</a> in a previous post. The basic idea is to study the actual practice and behavior of data analysts to identify how analysts behave. There is a strong history of this type of research within the data visualization community <a href="http://www.stat.purdue.edu/~wsc/">starting with Bill Cleveland</a> and extending forward to work by <a href="http://dicook.github.io/cv.html">Diane Cook</a>, , <a href="http://vis.stanford.edu/papers/crowdsourcing-graphical-perception">Jeffrey Heer</a>, and others.</p>
<p><a href="https://peerj.com/articles/589/">Today we published</a> a large-scale evidence based data analysis randomized trial. Two of the most common data analysis tasks (for better or worse) are exploratory analysis and the identification of statistically significant results. Di Cook’s group calls this idea <a href="http://stat.wharton.upenn.edu/~buja/PAPERS/Wickham-Cook-Hofmann-Buja-IEEE-TransVizCompGraphics_2010-Graphical%20Inference%20for%20Infovis.pdf">“graphical inference” or “visual significance”</a> and they have studied human’s ability to detect significance in the context of [In the medical sciences, there is a discipline called “<a href="http://en.wikipedia.org/wiki/Evidence-based_medicine">evidence based medicine</a>”. The basic idea is to study the actual practice of medicine using experimental techniques. The reason is that while we may have good experimental evidence about specific medicines or practices, the global behavior and execution of medical practice may also matter. There have been some success stories from this approach and also backlash from physicians who <a href="http://onlinelibrary.wiley.com/doi/10.1111/j.1523-536X.1996.tb00491.x/abstract">don’t like to be told how to practice medicine.</a> However, on the whole it is a valuable and interesting scientific exercise.</p>
<p>Roger introduced the idea of <a href="http://simplystatistics.org/2013/08/28/evidence-based-data-analysis-treading-a-new-path-for-reproducible-research-part-2/">evidence based data analysis</a> in a previous post. The basic idea is to study the actual practice and behavior of data analysts to identify how analysts behave. There is a strong history of this type of research within the data visualization community <a href="http://www.stat.purdue.edu/~wsc/">starting with Bill Cleveland</a> and extending forward to work by <a href="http://dicook.github.io/cv.html">Diane Cook</a>, , <a href="http://vis.stanford.edu/papers/crowdsourcing-graphical-perception">Jeffrey Heer</a>, and others.</p>
<p><a href="https://peerj.com/articles/589/">Today we published</a> a large-scale evidence based data analysis randomized trial. Two of the most common data analysis tasks (for better or worse) are exploratory analysis and the identification of statistically significant results. Di Cook’s group calls this idea <a href="http://stat.wharton.upenn.edu/~buja/PAPERS/Wickham-Cook-Hofmann-Buja-IEEE-TransVizCompGraphics_2010-Graphical%20Inference%20for%20Infovis.pdf">“graphical inference” or “visual significance”</a> and they have studied human’s ability to detect significance in the context of](http://www.tandfonline.com/doi/abs/10.1080/01621459.2013.808157) and how it <a href="http://arxiv.org/abs/1408.1974">associates with demographics and visual characteristics of the plot.</a></p>
<p>We performed a randomized study to determine if data analysts with basic training could identify statistically significant relationships. Or as the first author put it in a tweet:</p>
<blockquote class="twitter-tweet" width="550">
<p>
First paper just dropped! Can you tell the difference between these two plots? <a href="https://t.co/Lng0FWI0XY">https://t.co/Lng0FWI0XY</a> <a href="http://t.co/zFCwwcxaAX">pic.twitter.com/zFCwwcxaAX</a>
</p>
<p>
— Aaron Fisher (@PrfFarnsworth) <a href="https://twitter.com/PrfFarnsworth/status/522790724774141952">October 16, 2014</a>
</p>
</blockquote>
<p>What we found was that people were pretty bad at detecting statistically significant results, but that over multiple trials they could improve. This is a tentative first step toward understanding how the general practice of data analysis works. If you want to play around and see how good you are at seeing p-values we also built this interactive Shiny app. If you don’t see the app you can also go to the <a href="http://glimmer.rstudio.com/afisher/EDA/">Shiny app page here.</a></p>
<p> </p>
Dear Laboratory Scientists: Welcome to My World
2014-10-15T19:42:03+00:00
http://simplystats.github.io/2014/10/15/dear-laboratory-scientists-welcome-to-my-world
<p>Consider the following question: Is there a reproducibility/replication crisis in epidemiology?</p>
<p>I think there are only two possible ways to answer that question:</p>
<ol>
<li>No, there is no replication crisis in epidemiology because no one ever believes the result of an epidemiological study unless it has been replicated a minimum of 1,000 times in every possible population.</li>
<li>Yes, there is a replication crisis in epidemiology, and it started in 1854 when <a href="http://www.ph.ucla.edu/epi/snow/snowbook2.html">John Snow</a> inferred, from observational data, that cholera was spread via contaminated water obtained from public pumps.</li>
</ol>
<p>If you chose (2), then I don’t think you are allowed to call it a “crisis” because I think by definition, a crisis cannot last 160 years. In that case, it’s more of a chronic disease.</p>
<p>I had an interesting conversation last week with a prominent environmental epidemiologist over the replication crisis that has been reported about extensively in the scientific and popular press. In his view, he felt this was less of an issue in epidemiology because epidemiologists never really had the luxury of people (or at least fellow scientists) believing their results because of their general inability to conduct controlled experiments.</p>
<p>Given the observational nature of most environmental epidemiological studies, it’s generally accepted in the community that no single study can be considered causal, and that many replications of a finding are need to establish a causal connection. Even the popular press knows now to include the phrase “correlation does not equal causation” when reporting on an observational study. The work of <a href="http://en.wikipedia.org/wiki/Bradford_Hill_criteria">Sir Austin Bradford Hill</a> essentially codifies the standard of evidence needed to draw causal conclusions from observational studies.</p>
<p>So if “correlation does not equal causation”, it begs the question, what <em>does</em> equal causation? Many would argue that a controlled experiment, whether it’s a randomized trial or a laboratory experiment, equals causation. But people who work in this area have long known that while controlled experiments do assign the treatment or exposure, there are still many other elements of the experiment that are _not _controlled.</p>
<p>For example, if subjects drop out of a randomized trial, you now essentially have an observational study (or at least a <a href="http://amstat.tandfonline.com/doi/abs/10.1198/016214503000071#.VD8EqL5DuoY">“broken” randomized trial</a>). If you are conducting a laboratory experiment and all of the treatment samples are measured with one technology and all of the control samples are measured with a different technology (perhaps because of a lack of blinding), then you still have confounding.</p>
<p>The correct statement is not “correlation does not equal causation” but rather “no single study equals causation”, regardless of whether it was an observational study or a controlled experiment. Of course, a very tightly controlled and rigorously conducted controlled experiment will be more valuable than a similarly conducted observational study. But in general, all studies should simply be considered as further evidence for or against an hypothesis. We should not be lulled into thinking that any single study about an important question can truly be definitive.</p>
I declare the Bayesian vs. Frequentist debate over for data scientists
2014-10-13T10:45:44+00:00
http://simplystats.github.io/2014/10/13/as-an-applied-statistician-i-find-the-frequentists-versus-bayesians-debate-completely-inconsequential
<p>In a recent New York Times <a href="http://www.nytimes.com/2014/09/30/science/the-odds-continually-updated.html?_r=1">article</a> the “Frequentists versus Bayesians” debate was brought up once again. I agree with Roger:</p>
<blockquote class="twitter-tweet" lang="en">
<p>
NYT wants to create a battle b/w Bayesians and Frequentists but it's all crap. Statisticians develop techniques. <a href="http://t.co/736gbqZGuq">http://t.co/736gbqZGuq</a>
</p>
<p>
— Roger D. Peng (@rdpeng) <a href="https://twitter.com/rdpeng/status/516739602024267776">September 30, 2014</a>
</p>
</blockquote>
<p>Because the real story (or non-story) is way too boring to sell newspapers, the author resorted to a sensationalist narrative that went something like this: ”Evil and/or stupid frequentists were ready to let a fisherman die; the persecuted Bayesian heroes saved him.” This piece adds to the growing number of writings blaming frequentist statistics for the so-called reproducibility crisis in science. If there is something Roger, <a href="http://simplystatistics.org/2013/11/26/statistical-zealots/">Jeff</a> and <a href="http://simplystatistics.org/2013/08/01/the-roc-curves-of-science/">I</a> agree on is that this debate is <a href="http://noahpinionblog.blogspot.com/2013/01/bayesian-vs-frequentist-is-there-any.html">not constructive</a>. As <a href="http://arxiv.org/pdf/1106.2895v2.pdf">Rob Kass</a> suggests it’s time to move on to pragmatism. Here I follow up Jeff’s <a href="http://simplystatistics.org/2014/09/30/you-think-p-values-are-bad-i-say-show-me-the-data/">recent post</a> by sharing related thoughts brought about by two decades of practicing applied statistics and hope it helps put this unhelpful debate to rest.</p>
<p>Applied statisticians help answer questions with data. How should I design a roulette so my casino makes $? Does this fertilizer increase crop yield? Does streptomycin cure pulmonary tuberculosis? Does smoking cause cancer? What movie would would this user enjoy? Which baseball player should the Red Sox give a contract to? Should this patient receive chemotherapy? Our involvement typically means analyzing data and designing experiments. To do this we use a variety of techniques that have been successfully applied in the past and that we have mathematically shown to have desirable properties. Some of these tools are frequentist, some of them are Bayesian, some could be argued to be both, and some don’t even use probability. The Casino will do just fine with frequentist statistics, while the baseball team might want to apply a Bayesian approach to avoid overpaying for players that have simply been lucky.</p>
<p>It is also important to remember that good applied statisticians also <strong>think</strong>. They don’t apply techniques blindly or religiously. If applied statisticians, regardless of their philosophical bent, are asked if the sun just exploded, they would not design an experiment as the one depicted in this popular XKCD cartoon.</p>
<p><a href="http://xkcd.com/1132/"><img class="aligncenter" src="http://imgs.xkcd.com/comics/frequentists_vs_bayesians.png" alt="" width="234" height="355" /></a></p>
<p>Only someone that does not know how to think like a statistician would act like the frequentists in the cartoon. Unfortunately we do have such people analyzing data. But their choice of technique is not the problem, it’s their lack of critical thinking. However, even the most frequentist-appearing applied statistician understands Bayes rule and will adapt the Bayesian approach when appropriate. In the above XCKD example, any respectful applied statistician would not even bother examining the data (the dice roll), because they would assign a probability of 0 to the sun exploding (the empirical prior based on the fact that they are alive). However, superficial propositions arguing for wider adoption of Bayesian methods fail to realize that using these techniques in an actual data analysis project is very different from simply thinking like a Bayesian. To do this we have to represent our intuition or prior knowledge (or whatever you want to call it) with mathematical formulae. When theoretical Bayesians pick these priors, they mainly have mathematical/computational considerations in mind. In practice we can’t afford this luxury: a bad prior will render the analysis useless regardless of its convenient mathematically properties.</p>
<p>Despite these challenges, applied statisticians regularly use Bayesian techniques successfully. In one of the fields I work in, Genomics, empirical Bayes techniques are widely used. In <a href="http://www.ncbi.nlm.nih.gov/pubmed/16646809">this</a> popular application of empirical Bayes we use data from all genes to improve the precision of estimates obtained for specific genes. However, the most widely used output of the software implementation is not a posterior probability. Instead, an empirical Bayes technique is used to improve the estimate of the standard error used in a good ol’ fashioned t-test. This idea has changed the way thousands of Biologists search for differential expressed genes and is, in my opinion, one of the most important contributions of Statistics to Genomics. Is this approach frequentist? Bayesian? To this applied statistician it doesn’t really matter.</p>
<p>For those arguing that simply switching to a Bayesian philosophy will improve the current state of affairs, let’s consider the smoking and cancer example. Today there is wide agreement that smoking causes lung cancer. Without a clear deductive biochemical/physiological argument and without the possibility of a randomized trial, this connection was established with a series of observational studies. Most, if not all, of the associated data analyses were based on frequentist techniques. None of the reported confidence intervals on their own established the consensus. Instead, as usually happens in science, a long series of studies supporting this conclusion were needed. How exactly would this have been different with a strictly Bayesian approach? Would a single paper been enough? Would using priors helped given the “expert knowledge” at the time (see below)?</p>
<p><img src="http://cdn.saveourbones.com/wp-content/uploads/smoking_doctor.jpg" width="234" height="355" class="aligncenter" alt="" /></p>
<p>And how would the Bayesian analysis performed by tabacco companies shape the debate? Ultimately, I think applied statisticians would have made an equally convincing case against smoking with Bayesian posteriors as opposed to frequentist confidence intervals. Going forward I hope applied statisticians continue to be free to use whatever techniques they see fit and that critical thinking about data continues to be what distinguishes us. Imposing Bayesian or frequentists philosophy on us would be a disaster.</p>
Data science can't be point and click
2014-10-09T16:16:17+00:00
http://simplystats.github.io/2014/10/09/data-science-cant-be-point-and-click
<p>As data becomes cheaper and cheaper there are more people that want to be able to analyze and interpret that data. I see more and more that people are creating tools to accommodate folks who aren’t trained but who still want to look at data _right now. _While I admire the principle of this approach - we need to democratize access to data - I think it is the most dangerous way to solve the problem.</p>
<p>The reason is that, especially with big data, it is very easy to find things like this with point and click tools:</p>
<div style="width: 670px" class="wp-caption aligncenter">
<a href="http://www.tylervigen.com/view_correlation?id=1597"><img class="" src="http://www.tylervigen.com/correlation_project/correlation_images/us-spending-on-science-space-and-technology_suicides-by-hanging-strangulation-and-suffocation.png" alt="" width="660" height="230" /></a>
<p class="wp-caption-text">
US spending on science, space, and technology correlates with Suicides by hanging, strangulation and suffocation (http://www.tylervigen.com/view_correlation?id=1597)
</p>
</div>
<p>The danger with using point and click tools is that it is very hard to automate the identification of warning signs that seasoned analysts get when they have their hands in the data. These may be spurious correlation like the plot above or issues with data quality, or missing confounders, or implausible results. These things are much easier to spot when analysis is being done interactively. Point and click software is also getting better about reproducibility, but it still a major problem for many interfaces.</p>
<p>Despite these issues, point and click software are still all the rage. I understand the sentiment, there is a bunch of data just laying there and <a href="http://simplystatistics.org/2013/06/14/the-vast-majority-of-statistical-analysis-is-not-performed-by-statisticians/">there aren’t enough people to analyze it expertly</a>. But you wouldn’t want me to operate on you using point and click surgery software. You’d want a surgeon who has practiced on real people and knows what to do when she has an artery in her hand. In the same way, I think point and click software allows untrained people to do awful things to big data.</p>
<p>The ways to solve this problem are:</p>
<ol>
<li>More data analysis training</li>
<li>Encouraging people to do their analysis interactively</li>
</ol>
<p>I have a few more tips which I have summarized in this talk on <a href="http://www.slideshare.net/jtleek/10-things-statistics-taught-us-about-big-data">things statistics taught us about big data</a>.</p>
The Leek group guide to genomics papers
2014-10-08T14:16:00+00:00
http://simplystats.github.io/2014/10/08/the-leek-group-guide-to-genomics-papers
<p><a href="https://github.com/jtleek/genomicspapers/">Leek group guide to genomics papers</a></p>
<p>When I was a student, my advisor, <a href="http://www.genomine.org/">John Storey</a>, made a list of papers for me to read on nights and weekends. That list was incredibly helpful for a couple of reasons.</p>
<ul class="task-list">
<li>
It got me caught up on the field of computational genomics
</li>
<li>
It was expertly curated, so it filtered a lot of papers I didn't need to read
</li>
<li>
It gave me my first set of ideas to try to pursue as I was reading the papers
</li>
</ul>
<p>I have often thought I should make a similar list for folks who may want to work wtih me (or who want to learn about statistial genomics). So this is my first attempt at that list. I’ve tried to separate the papers into categories and I’ve probably missed important papers. I’m happy to take suggestions for the list, but this is primarily designed for people in my group so I might be a little bit parsimonious.</p>
<p> </p>
An economic model for peer review
2014-10-06T10:00:36+00:00
http://simplystats.github.io/2014/10/06/an-economic-model-for-peer-review
<p>I saw this tweet the other day:</p>
<blockquote class="twitter-tweet" width="550">
<p>
Has anyone applied game theory to the issue of anonymous peer review in academia?
</p>
<p>
— Mick Watson (@BioMickWatson) <a href="https://twitter.com/BioMickWatson/status/517715981104590848">October 2, 2014</a>
</p>
</blockquote>
<p>It reminded me that a few years ago <a href="http://simplystatistics.org/2012/07/11/my-worst-recent-experience-with-peer-review/">I had a paper that went through the peer review wringer</a>. It drove me completely bananas. One thing that drove me so crazy about the process was how long the referees waited before reviewing and how terrible the reviews were after that long wait. So I started thinking about the “economics of peer review”. Basically, what is the incentive for scientists to contribute to the system.</p>
<p>To get a handle on this idea, I designed a “peer review game” where there are a fixed number of players N. The players play the game for a fixed period of time. During that time, they can submit papers or they can review papers. For each person, their final score at the end of the time is <span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_499ac624b20f322a6a85792cf12b0c13.gif" style="vertical-align: middle; border: none; " class="tex" alt="S_i = \sum {\rm Submitted \; Papers \; Accepted}" /></span>.</p>
<p>Based on this model, under closed peer review, there is one Nash equilibrium under the strategy that <strong>no one reviews any papers</strong>. Basically, no one can hope to improve their score by reviewing, they can only hope to improve their score by submitting more papers (sound familiar?). Under open peer review, there are more potential equilibria, based on the relative amount of goodwill you earn from your fellow reviewers by submitting good reviews.</p>
<p>We then built a model system for testing out our theory. The system involved having groups of students play a “peer review game” where they submitted solutions to SAT problems like:</p>
<p><img class="aligncenter" src="http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0026895.g005&representation=PNG_M" alt="" width="390" height="335" /></p>
<p>Each solution was then randomly assigned to another player to review. Those players could (a) review it and reject it, (b) review it and accept it, or (c) not review it. The person with the most points at the end of the time (one hour) won.</p>
<p>We found some cool things:</p>
<ol>
<li>In closed review, reviewing gave no benefit.</li>
<li>In open review, reviewing gave a small positive benefit.</li>
<li>Both systems gave comparable accuracy</li>
<li>All peer review increased the overall accuracy of responses</li>
</ol>
<p><a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0026895">The paper is here</a> and all of the <a href="http://www.biostat.jhsph.edu/~jleek/peerreview/">data and code are here</a>.</p>
The Drake index for academics
2014-10-02T13:30:52+00:00
http://simplystats.github.io/2014/10/02/the-drake-index-for-academics
<p>I think academic indices are pretty silly; maybe we should introduce so many academic indices that people can’t even remember which one is which. There are pretty serious flaws with both citation indices and social media indices that I think render them pretty meaningless in a lot of ways.</p>
<p>Regardless of these obvious flaws I want in the game. Instead of the <a href="http://genomebiology.com/2014/15/7/424">K-index</a> for academics I propose the <a href="http://www.drakeofficial.com/">Drake</a> index. Drake has achieved <a href="http://en.wikipedia.org/wiki/Drake_(rapper)">both critical and popular success</a>. His song “Honorable Mentions” from the ESPYs (especially the first verse) reminds me of the motivation of the K-index paper.</p>
<p>To quantify both the critical and popular success of a scientist, I propose the Drake Index (TM). The Drake Index is defined as follows</p>
<blockquote>
<p>(# Twitter Followers)/(Max Twitter Followers by a Person in your Field) + (#Citations)/(Max Citations by a Person in your Field)</p>
</blockquote>
<p>Let’s break the index down. There are two main components (Twitter followers and Citations) measuring popular and critical acclaim. But they are measured on different scales. So we attempt to normalize them to the maximum in their field so the indices will both be between 0 and 1. This means that your Drake index score is between 0 and 2. Let’s look at a few examples to see how the index works.</p>
<ol>
<li><a href="https://twitter.com/Drake">Drake</a> = (16.9M followers)/(55.5 M followers for Justin Bieber) + (0 citations)/(134 <a href="http://scholar.google.com/scholar?hl=en&q=+Natalie+Hershlag&btnG=&as_sdt=1%2C21&as_sdtp=">Citations for Natalie Portman</a>) = 0.30</li>
<li>Rafael Irizarry = (1.1K followers)/(17.6K followers for Simply Stats) + (<a href="http://scholar.google.com/citations?user=nFW-2Q8AAAAJ&hl=en&oi=ao">38,194 citations</a>)/(<a href="http://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:biostatistics">185,740 citations for Doug Altman</a>) = 0.27</li>
<li>Roger Peng - (4.5K followers)/(17.6K followers for Simply Stats) + (<a href="http://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=roger+peng">4,011 citations</a>)/(<a href="http://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:biostatistics">185,740 citations for Doug Altman</a>) = 0.27</li>
<li>Jeff Leek - (2.6K followers)/(17.6K followers for Simply + (<a href="http://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en">2,348 citations</a>)/(<a href="http://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:biostatistics">185,740 citations for Doug Altman</a>) = 0.16</li>
</ol>
<p>In the interest of this not being taken any seriously than an afternoon blogpost should be I won’t calculate any other people’s Drake index. But you can :-).</p>
You think P-values are bad? I say show me the data.
2014-09-30T12:00:44+00:00
http://simplystats.github.io/2014/09/30/you-think-p-values-are-bad-i-say-show-me-the-data
<div class="page" title="Page 1">
<div class="layoutArea">
<div class="column">
<p>
Both the scientific community and the popular press are freaking out about reproducibility right now. I think they have good reason to, because even the US Congress is now <a href="http://web.stanford.edu/~vcs/talks/Testimony-STODDEN.pdf">investigating the transparency of science</a>. It has been driven by the very public reproducibility disasters in <a href="http://simplystatistics.org/2012/02/27/the-duke-saga-starter-set/">genomics</a> and <a href="http://simplystatistics.org/2013/04/19/podcast-7-reinhart-rogoff-reproducibility/">economics</a>.
</p>
<p>
There are three major components to a reproducible and replicable study from a computational perspective: (1) the raw data from the experiment must be available, (2) the statistical code and documentation to reproduce the analysis must be available and (3) a correct data analysis must be performed.
</p>
<p>
There have been successes and failures in releasing all the data, but <a href="http://blogs.plos.org/everyone/2014/02/24/plos-new-data-policy-public-access-data-2/">PLoS' policy on data availability</a> and the <a href="http://www.alltrials.net/">alltrials</a> initiative hold some hope. The most progress has been made on making code and documentation available. Galaxy, knitr, and iPython make it easier to distribute literate programs than it has ever been previously and people are actually using them!
</p>
<p>
The trickiest part of reproducibility and replicability is ensuring that people perform a good data analysis. The first problem is that we actually don't know which statistical methods lead to higher reproducibility and replicability in users hands. Articles like <a href="http://www.nytimes.com/2014/09/30/science/the-odds-continually-updated.html?_r=0">the one that just came out in the NYT</a> suggest that using one type of method (Bayesian approaches) over another (p-values) will address the problem. But the real story is that those are still 100% philosophical arguments. We actually have very little good data on whether analysts will perform better analyses using one method or another. <a href="http://simplystatistics.org/2014/02/14/on-the-scalability-of-statistical-procedures-why-the-p-value-bashers-just-dont-get-it/">I agree with Roger</a> in his tweet storm (quick someone is wrong on the internet Roger, fix it!):
</p>
<blockquote class="twitter-tweet" width="550">
<p>
5/If using Bayesian methods made you a better scientist, that would be great. But I need to see the evidence on that first.
</p>
<p>
— Roger D. Peng (@rdpeng) <a href="https://twitter.com/rdpeng/status/516958707859857409">September 30, 2014</a>
</p>
</blockquote>
<p>
</p>
<p>
This is even more of a problem because the data deluge demands that <a href="http://simplystatistics.org/2013/06/14/the-vast-majority-of-statistical-analysis-is-not-performed-by-statisticians/">almost all data analysis be performed by people with basic to intermediate statistics training</a> at best. There is no way around this in the short term. There just aren't enough trained statisticians/data scientists to go around. So we need to study statistics just like any other human behavior to figure out which methods work best in the hands of the people most likely to be using them.
</p>
</div>
</div>
</div>
Unbundling the educational package
2014-09-22T11:46:44+00:00
http://simplystats.github.io/2014/09/22/unbundling-the-educational-package
<p>I just got back from the World Economic Forum’s summer meeting in Tianjin, China and there was much talk of disruption and innovation there. Basically, if you weren’t disrupting, you were furniture. Perhaps not surprisingly, one topic area that was universally considered ripe for disruption was Education.</p>
<p>There are many ideas bandied about with respect to “disrupting” education and some are interesting to consider. MOOCs were the darlings of…last year…but they’re old news now. Sam Lessin has a <a href="https://www.theinformation.com/Why-Universities-Should-Wise-Up-and-Retain-Their-Users">nice piece</a> in the The Information (total paywall, sorry, but it’s worth it) about building a subscription model for universities. Aswath Damodaran has what I think is a <a href="http://www.aswathdamodaran.blogspot.com/2014/09/the-education-business-road-map-for.html">nice framework for thinking about the “education business”</a>.</p>
<p>One thing that I latched on to in Damodaran’s piece is the idea of education as a “bundled product”. Indeed, I think the key aspect of traditional on-site university education is the simultaneous offering of</p>
<ol>
<li>Subject matter content (i.e. course material)</li>
<li>Mentoring and guidance by faculty</li>
<li>Social and professional networking</li>
<li>Other activities (sports, arts ensembles, etc.)</li>
</ol>
<p>MOOCs have attacked #1 for many subjects, typically large introductory courses. Endeavors like the <a href="http://www.minervaproject.com">Minerva project</a> are attempting to provide lower-cost seminar-style courses (i.e. anti-MOOCs).</p>
<p>I think the extent to which universities will truly be disrupted will hinge on how well we can unbundle the four (or maybe more?) elements described above and provide them separately but at roughly the same level of quality. Is it possible? I don’t know.</p>
Applied Statisticians: people want to learn what we do. Let's teach them.
2014-09-15T10:00:04+00:00
http://simplystats.github.io/2014/09/15/applied-statisticians-people-want-to-learn-what-we-do-lets-teach-them
<p>In <a href="http://bulletin.imstat.org/2014/09/data-science-how-is-it-different-to-statistics%E2%80%89/">this</a> recent opinion piece, <a href="http://had.co.nz/">Hadley Wickham</a> explains how data science goes beyond Statistics and that data science is not promoted in academia. He defines data science as follows:</p>
<blockquote>
<p>I think there are three main steps in a data science project: you <em>collect</em> data (and questions), <em>analyze</em> it (using visualization and models), then <em>communicate</em> the results.</p>
</blockquote>
<p>and makes the important point that</p>
<blockquote>
<p>Any real data analysis involves data manipulation (sometimes called wrangling or munging), visualization and modelling.</p>
</blockquote>
<p>The above describes what I have been doing since I became an academic applied statistician about 20 years ago. It describes what several of my colleagues do as well. For example, 15 years ago Karl Broman, in <a href="https://www.biostat.wisc.edu/~kbroman/presentations/interference.pdf">his excellent job talk</a>, covered all the items in Hadley’s definition. The arc of the talk revolved around the scientific problem and not the statistical models. He spent a considerable amount of time describing how the data was acquired and how he used perl scripts to clean up microsatellites data. More than half <a href="https://www.biostat.wisc.edu/~kbroman/presentations/interference.pdf">his slides</a> contained visualizations, either illustrative cartoons or data plots. This research eventually led to his widely used “data product” <a href="http://www.rqtl.org/">R/qtl</a>. Although not described in the talk, Karl used <a href="http://kbroman.org/minimal_make/">make</a> to help make the results reproducible.</p>
<p>So why then does Hadley think that “Statistics research focuses on data collection and modeling, and there is little work on developing good questions, thinking about the shape of data, communicating results or building data products”? I suspect one reason is that most applied work is published outside the flagship statistical journals. For example, Karl’s work was published in the <a href="http://www.ncbi.nlm.nih.gov/pubmed/9718341#" title="American journal of human genetics.">American Journal of Human Genetics.</a> A second reason may be that most of us academic applied statisticians don’t teach what we do. Despite writing a <a href="http://www.biostat.jhsph.edu/~ririzarr/Demo/index.html">thesis</a> that involved much data wrangling (reading music aiff files into Splus) and data visualization (including listening to fitted signals and residuals), the first few courses I taught as an assistant professor were almost solely on GLM theory.</p>
<p>About five years ago I tried changing the Methods course for our PhD students from one focusing on the math behind statistical methods to a problem and data-driven course. This was not very successful as many of our students were interested in the mathematical aspects of statistics and did not like the open-ended assignments. Jeff Leek built on that class by incorporating question development, much more vague problem statements, data wrangling, and peer grading. He also found it challenging to teach the more messy parts of applied statistics. It often requires exploration and failure which can be frustrating for new students.</p>
<p>This story has a happy ending though. Last year Jeff created a data science Coursera course that enrolled over 180,000 students with 6,000+ completing. This year I am subbing for <a href="http://www.people.fas.harvard.edu/~blitz/Site/Home.html">Joe Blitzstein</a> (talk about filling in big shoes) in <a href="http://cs109.github.io/2014/">CS109</a>: the Data Science undergraduate class <a href="http://vcg.seas.harvard.edu/">Hanspeter Pfister</a> and Joe created last year at Harvard. We have over 300 students registered, making it one of the largest classes on campus. I am not teaching them GLM theory.</p>
<p>So if you are an experienced applied statistician in academia, consider developing a data science class that teaches students what you do.</p>
<p> </p>
<p> </p>
<p> </p>
A non-comprehensive list of awesome female data people on Twitter
2014-09-09T09:59:39+00:00
http://simplystats.github.io/2014/09/09/a-non-comprehensive-list-of-awesome-female-data-people-on-twitter
<p>I was just talking to a student who mentioned she didn’t know Jenny Bryan was on Twitter. She is and she is an awesome person to follow. I also realized that I hadn’t seen a good list of women on Twitter who do stats/data. So I thought I’d make one. This list is what I could make in 15 minutes based on my own feed and will, with 100% certainty, miss really people. Can you please add them in the comments and I’ll update the list?</p>
<ul>
<li><a href="https://twitter.com/JennyBryan">@JennyBryan</a> (Jenny Bryan) statistics professor at UBC, teaching a great <a href="http://stat545-ubc.github.io/">intro to data science class</a> right now.</li>
<li><a href="http://twitter.com/hspter">@hspter</a> (Hilary Parker) data analyst at Etsy (former Hopkins grad student!) and co-creator (I think) of #rcatladies, also wrote this nice post on <a href="http://hilaryparker.com/2014/04/29/writing-an-r-package-from-scratch/">writing an R package from scratch</a></li>
<li><a href="https://twitter.com/acfrazee">@acfrazee</a> (Alyssa Frazee) Ph.D. student at Hopkins, <a href="http://alyssafrazee.com/">writes a great blog</a> on data stuff, works on statistical genomics</li>
<li><a href="https://twitter.com/emsweene57" target="_blank">@emsweene57</a> (Elizabeth Sweeney) - Hopkins Ph.D. student, developer of methods for neuroimaging.</li>
<li><a href="https://twitter.com/hmason">@hmason</a> (Hilary Mason) - currently running one of my favorite startups <a href="http://www.fastforwardlabs.com/">Fast Forward Labs</a>, but basically needs no introduction, one of the biggest names in data science right now.</li>
<li><a href="https://twitter.com/sherrirose">@sherrirose</a> (Sherri Rose) - former Hopkins postdoc, now at Harvard. Literally <a href="http://drsherrirose.com/targeted-learning-book/">wrote the book on targeted learning</a>.</li>
<li><a href="https://twitter.com/eloyan_ani">@eloyan_ani</a> (Ani Eloyan) - Hopkins Biostat faculty, working on neuroimaging and EMRs. Lead the team that won the <a href="http://www.ncbi.nlm.nih.gov/pubmed/23060754">ADHD-200 competition</a>.</li>
<li><a href="https://twitter.com/mrogati">@mrogati</a> (Monica Rogati) - Former Linkedin data scientist, now running the data team at <a href="https://jawbone.com/">Jawbone</a>.</li>
<li><a href="https://twitter.com/annmariastat">@annmariastat</a> (<span style="color: #333333;">AnnMaria De Mars) - runs the Julia group, also world class judoka, writes one of <a href="http://www.thejuliagroup.com/blog/">my favorite stats/education blogs</a>. </span></li>
<li><a href="https://twitter.com/kara_woo">@kara_woo</a> (Kara Woo) - Works at the<span style="color: #36312d;"> </span><a style="color: #e94f1d;" href="http://www.nceas.ucsb.edu/" target="_blank">National Center for Ecological Analysis and Synthesis</a> and maintains their <a href="http://baikaldimensions.wordpress.com/">projections blog</a></li>
<li><a href="https://twitter.com/jhubiostat" target="_blank">@jhubiostat</a> (Betsy Ogburn) - Hopkins biostat faculty, not technically her account. But she is the reason this is the funniest/best academic department twitter account out there.</li>
<li><a href="https://twitter.com/lovestats" target="_blank">@lovestats</a> (Annie Pettit) - Does surveys and data quality/MRX work. If you are into MRX, <a href="http://lovestats.wordpress.com/" target="_blank">check out her blog</a>.</li>
<li><a href="https://twitter.com/ProfEmilyOster" target="_blank">@ProfEmilyOster</a> (Emily Oster) - Econ professor at U Chicago. Has been my favorite <a href="http://fivethirtyeight.com/contributors/emily-oster/" target="_blank">writer for FiveThirtyEight</a> since their relaunch.</li>
<li><a href="https://twitter.com/MonaChalabi" target="_blank">@monachalabi</a> (Mona Chalabi) - writer for FiveThirtyEight, I like her “Am I normal” series of posts.</li>
<li><a href="https://twitter.com/lisaczhang" target="_blank">@lisaczhang</a> (Lisa Zhang)- cofounder of Polychart.</li>
<li><a href="https://twitter.com/notawful">@notawful</a> (Jessica Hartnett) - professor at Gannon University, writes a <a href="http://notawfulandboring.blogspot.com/">great blog on teaching statistics</a>.</li>
<li>@<a href="https://twitter.com/AliciaOshlack">AliciaOshlack</a> (Alicia Oshlack) - researcher at Murdoch Children’s research institute, one of the real superstars in computational genomics.</li>
<li><a href="https://twitter.com/AmeliaMN">@AmeliaMN</a> (Amelia McNamara) - graduate student at UCLA, works on the <a href="http://www.mobilizingcs.org/">Mobilize project</a> and other awesome data education initiatives in LA school system.</li>
<li> <a href="https://twitter.com/leighadlr">@leighadlr</a> (LEIGH ARINO DE LA RUBIA) Editor in chief of <a href="http://datascience.la/">DataScience.LA</a></li>
<li><a href="https://twitter.com/inesgn">@inesgn</a> (Ines Germendia) - data scientist working on official statistics at Basque Statistics - Eustat</li>
<li><a href="https://twitter.com/sgrifter">@sgrifter</a> (Sandy Griffith) - Biostat Ph.D., fellow #rcatladies creator, professor at the Cleveland Clinic in quantitative medicine</li>
<li><a href="https://twitter.com/ladamic">@ladamic</a> (Lada Adamic) - professor at Michigan, teacher of really highly regarded <a href="https://www.coursera.org/course/sna">social network analysis class</a> on Coursera, now at Facebook (I think)</li>
<li><a href="https://twitter.com/stephaniehicks">@stephaniehicks</a> - (Stephanie Hicks) postdoc in compbio at Harvard, <a href="http://www.stephaniehicks.com/pages/teaching.html">lead teaching assistant for Data Science course at Harvard</a>.</li>
<li><a href="https://twitter.com/ansate">@ansate</a> - (Melissa Santos) manager of Hadoop infrastructure at Etsy, maintainer of the women in data list below.</li>
<li><@lauramclay> (Laura McClay) - professor of operations research at UW Madison, writes a blog with an amazing name: <a href="http://punkrockor.wordpress.com/">Punk Rock Operations Research</a>.</li>
<li><a href="https://twitter.com/bioannie">@bioannie</a> (Laura Hatfield) - professor at Harvard, also has one of the best data titles I’ve ever heard: <a href="https://twitter.com/bioannie">Princess of Bayesia</a></li>
<li><a href="https://twitter.com/kaythaney">@kaythaney</a> (Kaitlin Thaney) - director of the Mozilla Science Lab, also works with Data Kind UK.</li>
<li><@laurieskelly> (Laurie Skelly)- Data scientist at Data Scope analytics</li>
<li><a href="https://twitter.com/bo_p">@bo_p</a> (Bo Peng) - Data scientist at Data Scope analytics</li>
<li><a href="https://twitter.com/siminaboca">@siminaboca</a> (Simina Boca) - former Hopkins Ph.D. student, now assistant professor at Georgetown in Biomedical informatics.</li>
<li><a href="https://twitter.com/HelenPowell01">@HelenPowell01</a> (Helen Powell) - postdoc in Biostatistics at Hopkins, works on statistics for relationship between air pollution and health.</li>
<li><a href="https://twitter.com/victoriastodden">@victoriastodden</a> (Victoria Stodden) - one of the leaders in the legal and sociological aspects of reproducible research.</li>
<li><a href="https://twitter.com/hannawallach">@hannawallach</a> (Hanna Wallach) - CS professor and researcher at Microsoft Research NY.</li>
<li><a href="https://twitter.com/kralljr">@kralljr</a> (Jenna Krall) - postdoctoral fellow in environmental statistics at Emory (Hopkins grad!)</li>
<li><a href="https://twitter.com/lssli">@LssLi</a> (Shanshan Li) - professor of Biostatistics at IUPI, works on neuroimaging, aging and epidemiology (Hopkins grad!)</li>
<li><a href="https://twitter.com/aheineike">@aheineike</a> (Amy Heineike) - director of mathematics at Quid, also <a href="http://simplystatistics.org/2012/03/19/interview-with-amy-heineike-director-of-mathematics/">excellent interviewee</a>.</li>
<li><a href="https://twitter.com/mathbabedotorg">@mathbabedotorg</a> (Cathy O’Neil) program director of the Lede Program at Columbia’s J School, <a href="http://mathbabe.org/">writes a very popular data science blog</a>.</li>
<li><a href="https://twitter.com/ameliashowalter">@ameiliashowalter</a> (Amelia Showalter) Former director of digital analytics for Obama2012. Data consultant.</li>
<li><a href="https://twitter.com/minebocek">@minebocek</a> (Mine Cetinkaya Rundel) Professor at Duke, teaches the <a href="https://www.coursera.org/course/statistics">great statistics MOOC</a> from them based on OpenIntro.</li>
<li><a href="https://twitter.com/@YennyWebbV">@YennyWebbV</a> (Yenny Webb Vargas) Ph.D. student in Biostatistics at Johns Hopkins, one of the founders of Bmore Biostats and <a href="http://yennywebbv.weebly.com/blog/data-sciences">a blogger</a></li>
<li><a href="https://twitter.com/@OMGannaks">@OMGannaks</a> (Anna Smith) - former data scientist at Bitly, now analytics engineer at rentherunway.</li>
<li><a href="https://twitter.com/@kristin_linn">@kristin_linn</a> (Kristin Linn) - postdoc at UPenn, formerly NC State grad student, part of the awesome statistics band (!) <a href="https://twitter.com/TheFifthMoment">@TheFifthMoment</a></li>
<li><a href="http://www.stat.berkeley.edu/~ledell/">@ledell</a> (Erin LeDell) - grad student in Biostatistics at Berkeley working on machine learning, <a href="http://cran.r-project.org/web/packages/subsemble/index.html">co-author of subsemble R package</a>.</li>
<li><a href="https://twitter.com/atmccann">@atmccann</a> (Allison McCann) - writer for FiveThirtyEight. Data viz person, my favorite post of hers is <a href="http://www.businessweek.com/articles/2014-02-13/how-airbus-is-debugging-the-a350#p2">how to debug a jet</a></li>
<li><a href="https://twitter.com/@ReginaNuzzo">@ReginaNuzzo</a> (Regina Nuzzo) - stats prof and freelance writer. Her piece on <a href="http://www.nature.com/news/scientific-method-statistical-errors-1.14700">p-values in Nature</a> just won the statistical reporting award.</li>
<li><a href="https://twitter.com/jrfAleks">@jrfAleks</a> (Aleks Collingwood) - programme manager for the Joseph Rowntree Foundation. Working on poverty and aging.</li>
<li><a href="https://twitter.com/@abarysh">@abarysh</a> (Anastasia Baryshnikova) - princeton Lewis-Sigler fellow, co-leader of major project on large international yeast knockout study.</li>
<li><a href="Sharon%20Machlis">@sharon000</a> (Sharon Machlis) - online managing editor at Computerworld.</li>
<li><a href="https://twitter.com/2plus2make5">@2plus2make5</a> (Emma Pierson) - Stanford undergrad, Rhodes Scholar, frequent contributor to FiveThirtyEight and other data blogs.</li>
<li><a href="https://twitter.com/mandyfmejia">@mandyfmejia</a> (Mandy Mejia) - Johns Hopkins PhD student, brain imaging analyzer, also <a href="http://mandymejia.wordpress.com/">writes a great blog</a>!</li>
</ul>
<p>I have also been informed that these Twitter lists are probably better than my post. But I’ll keep updating my list anyway cause I want to know who all the right people to follow are!</p>
<ul>
<li>
<div>
<a href="https://twitter.com/ansate/lists/women-in-data" target="_blank">https://twitter.com/ansate/<wbr />lists/women-in-data</a>
</div>
</li>
<li>
<div>
<a href="https://twitter.com/BecomingDataSci/lists/women-in-data-science" target="_blank">https://twitter.com/<wbr />BecomingDataSci/lists/women-<wbr />in-data-science</a>
</div>
</li>
</ul>
<p> </p>
Why the three biggest positive contributions to reproducible research are the iPython Notebook, knitr, and Galaxy
2014-09-04T14:08:42+00:00
http://simplystats.github.io/2014/09/04/why-the-three-biggest-positive-contributions-to-reproducible-research-are-the-ipython-notebook-knitr-and-galaxy
<p>There is a huge amount of interest in reproducible research and replication of results. Part of this is driven by some of the pretty major mistakes in reproducibility we have seen in <a href="http://simplystatistics.org/2013/04/19/podcast-7-reinhart-rogoff-reproducibility/">economics </a>and <a href="http://simplystatistics.org/2011/09/11/the-duke-saga/">genomics</a>. This has spurred discussion at a variety of levels including at the level of the <a href="http://simplystatistics.org/2014/04/01/this-is-how-an-important-scientific-debate-is-being-used-to-stop-epa-regulation/">United States Congress.</a></p>
<p>To solve this problem we need the appropriate infrastructure. I think developing infrastructure is a lot like playing the lottery, only if the lottery required a lot more work to buy a ticket. You pour a huge amount of effort into building good infrastructure. I think it helps if you build it for yourself like Yihui did for knitr:</p>
<p>(also <a href="http://datascience.la/yihui-xie-the-user-2014-interview/">make sure you go read the blog post</a> over at Data Science LA)</p>
<p>If lots of people adopt it, you are set for life. If they don’t, you did all that work for nothing. So you have to applaud all the groups who have made efforts at building infrastructure for reproducible research.</p>
<p>I would contend that the largest positive contributions to reproducibility in sheer number of analyses made reproducible are:</p>
<ul>
<li> The <a href="http://yihui.name/knitr/">knitr</a> R package (or more recently <a href="http://rmarkdown.rstudio.com/">rmarkdown</a>) for creating literate webpages and documents in R.</li>
<li><a href="http://ipython.org/notebook.html">iPython notebooks </a> for creating literate webpages and documents interactively in Python.</li>
<li>The <a href="http://galaxyproject.org/">Galaxy project</a> for creating reproducible work flows (among other things) combining known tools.</li>
</ul>
<p>There are similarities and differences between the different platforms but the one thing I think they all have in common is that they added either no or negligible effort to people’s data analytic workflows.</p>
<p>knitr and iPython notebooks have primarily increased reproducibility among folks who have some scripting experience. I think a major reason they are so popular is because you just write code like you normally would, but embed it in a simple to use document. The workflow doesn’t change much for the analyst because they were going to write that code anyway. The document just allows it to be built into a more shareable document.</p>
<p>Galaxy has increased reproducibility for many folks, but my impression is the primary user base are folks who have less experience scripting. They have worked hard to make it possible for these folks to analyze data they couldn’t before in a reproducible way. But the reproducibility is incidental in some sense. The main reason users come is that they would have had to stitch those pipelines together anyway. Now they have an easier way to do it (lowering workload) and they get reproducibility as a bonus.</p>
<p>If I was in charge of picking the next round of infrastructure projects that are likely to impact reproducibility or science in a positive way, I would definitely look for projects that have certain properties.</p>
<ul>
<li>For scripters and experts I would look for projects that interface with what people are already doing (most data analysis is in R or Python these days), require almost no extra work, and provide some benefit (reproducibility or otherwise). I would also look for things that are agnostic to which packages/approaches people are using.</li>
<li>For non-experts I would look for projects that enable people to build pipelines they were’t able to before using already standard tools and give them things like reproducibility for free.</li>
</ul>
<p>Of course I wouldn’t put me in charge anyway, I’ve never won the lottery with any infrastructure I’ve tried to build.</p>
A (very) brief review of published human subjects research conducted with social media companies
2014-08-20T10:32:02+00:00
http://simplystats.github.io/2014/08/20/a-very-brief-review-of-published-human-subjects-research-conducted-with-social-media-companies
<p>As I wrote the other day, more and more human subjects research is being performed by large tech companies. The best way to handle the ethical issues raised by this research <a href="http://simplystatistics.org/2014/08/05/do-we-need-institutional-review-boards-for-human-subjects-research-conducted-by-big-web-companies/">is still unclear</a>. The first step is to get some idea of what has already been published from these organizations. So here is a brief review of the papers I know about where human subjects experiments have been conducted by companies. I’m only counting experiments here that have (a) been published in the literature and (b) involved experiments on users. I realized I could come up with surprisingly few. I’d be interested to see more in the comments if people know about them.</p>
<p><strong>Paper</strong>: <a href="http://www.pnas.org/content/111/24/8788.full">Experimental evidence of massive-scale emotional contagion through social networks</a></p>
<p><strong>Company</strong>: Facebook</p>
<p><strong>What they did</strong>: Randomized people to get different emotions in their news feed and observed if they showed an emotional reaction.</p>
<p><strong>What they found</strong>: That there was almost no real effect on emotion. The effect was statistically significant but not scientifically or emotionally meaningful.</p>
<p><strong>Paper: </strong><a href="http://www.sciencemag.org/content/341/6146/647.abstract">Social influence bias: a randomized experiment</a></p>
<p><strong>Company</strong>: Not stated but sounds like Reddit</p>
<p><strong>What they did</strong>: Randomly up-voted, down voted, or left alone posts to the social networking site. Then they observed whether there was a difference in the overall rating of posts within each treatment.</p>
<p><strong>What they found</strong>: Posts that were upvoted ended up with a final rating score (total upvotes - total downvotes) that was 25% higher.</p>
<p><strong>Paper: <a href="http://www.sciencemag.org/content/337/6092/337.full">Identifying influential and susceptible members of social networks</a> </strong></p>
<p><strong>Company</strong>: Facebook</p>
<p><strong>What they did</strong>: Using a commercial Facebook app, they found users who adopted a product and randomized sending messages to their friends about the use of the product. Then they measured whether their friends decided to adopt the product as well.</p>
<p><strong>What they found</strong>: Many interesting things. For example: susceptibility to influence decreases with age, people over 31 are stronger influencers, women are less susceptible to influence than men, etc. etc.</p>
<p> </p>
<p><strong>Paper: </strong><a href="http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41854.pdf">Inferring causal impact using Bayesian structural time-series models</a></p>
<p><strong>Company</strong>: Google</p>
<p><strong>What they did</strong>: They developed methods for inferring the causal impact of an ad in a time series situation. They used data from an advertiser who showed ads to people related to keywords and measured how many visits there were to the advertiser’s website through paid and organic (non-paid) clicks.</p>
<p><strong>What they found</strong>: That the ads worked. But more importantly that they could predict the causal effect of the ad using their methods.</p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
SwiftKey and Johns Hopkins partner for Data Science Specialization Capstone
2014-08-19T12:18:05+00:00
http://simplystats.github.io/2014/08/19/swiftkey-and-johns-hopkins-partner-for-data-science-specialization-capstone
<p>I use <a href="http://swiftkey.com/en/">SwiftKey</a> on my Android phone all the time. So I was super pumped up <a href="http://www.jhsph.edu/news/news-releases/2014/johns-hopkins-bloomberg-school-of-public-healths-data-science-specialization-mooc-series-launches-industry-collaboration-with-swiftkey.html">I use [SwiftKey](http://swiftkey.com/en/) on my Android phone all the time. So I was super pumped up</a> to run in October 2014. To enroll in the course you have to pass the other 9 courses in the <a href="https://www.coursera.org/specialization/jhudatascience/1">Data Science Specialization</a>.</p>
<p>The 9 courses have only been running for 4 months but already 200+ people have finished all 9! It has been unbelievable to see the response to the specialization and we are exited about taking it to the next level.</p>
<p>Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:</p>
<p><em>I went to the</em></p>
<p>the keyboard presents three options for what the next word might be. For example, the three words might be <em>gym, store, restaurant</em>. In this capstone you will work on understanding and building predictive text models like those used by SwiftKey.</p>
<p>This course will start with the basics, analyzing a large corpus of text documents to discover the structure in the data and how words are put together. It will cover cleaning and analyzing text data, then building and sampling from a predictive text model. Finally, students will use the knowledge gained in our Data Products course to build a predictive text product they can show off to their family, friends, and potential employers.</p>
<p>We are really excited to work with SwiftKey to take our Specialization to the next level! Here is Roger’s intro video for the course to get you fired up too.</p>
Interview with COPSS Award winner Martin Wainwright
2014-08-18T10:00:15+00:00
http://simplystats.github.io/2014/08/18/interview-with-copss-award-winner-martin-wainright
<p><em>Editor’s note: <a href="http://www.cs.berkeley.edu/~wainwrig/">Martin Wainwright</a> is the winner of the 2014 COPSS Award. This award is the most prestigious award in statistics, sometimes refereed to as the <a href="http://en.wikipedia.org/wiki/COPSS_Presidents'_Award">Nobel Prize in Statistics</a>. Martin received the award for: “<span style="color: #222222;"> For fundamental and groundbreaking contributions to high-dimensional statistics, graphical modeling, machine learning, optimization and algorithms, covering deep and elegant mathematical analysis as well as new methodology with wide-ranging implications for numerous applications.” He kindly agreed to be interviewed by Simply Statistics. </span></em></p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2014/08/wainwright.jpg"><img class="alignnone wp-image-3232" src="http://simplystatistics.org/wp-content/uploads/2014/08/wainwright.jpg" alt="wainwright" width="250" height="333" srcset="http://simplystatistics.org/wp-content/uploads/2014/08/wainwright-225x300.jpg 225w, http://simplystatistics.org/wp-content/uploads/2014/08/wainwright-768x1024.jpg 768w, http://simplystatistics.org/wp-content/uploads/2014/08/wainwright.jpg 3000w" sizes="(max-width: 250px) 100vw, 250px" /></a></p>
<p><strong>SS: How did you find out you had received the COPSS prize?</strong></p>
<p>It was pretty informal -– I received an email in February from</p>
<p>Raymond Carroll, who chaired the committee. But it had explicit</p>
<p>instructions to keep the information private until the award ceremony</p>
<p>in August.</p>
<p><strong>SS: You are in Electrical Engineering & Computer Science (EECS) and</strong></p>
<p><strong>Statistics at Berkeley: why that mix of departments?</strong></p>
<p>Just to give a little bit of history, I did my undergraduate degree in</p>
<p>math at the University of Waterloo in Canada, and then my Ph.D. in</p>
<p>EECS at MIT, before coming to Berkeley to work as a postdoc in</p>
<p>Statistics. So when it came time to looking at faculty positions,</p>
<p>having a joint position between these two departments made a lot of</p>
<p>sense. Berkeley has always been at the forefront of having effective</p>
<p>joint appointments of the “Statistics plus X” variety, whether X is</p>
<p>EECS, Mathematics, Political Science, Computational Biology and so on.</p>
<p>For me personally, the EECS plus Statistics combination is terrific,</p>
<p>as a lot of my interests lie at the boundary between these two areas,</p>
<p>whether it is investigating tradeoffs between computational and</p>
<p>statistical efficiency, connections between information theory and</p>
<p>statistics, and so on. I hope that it is also good for my students!</p>
<p>In any case, whether they enter in EECS or Statistics, they graduate</p>
<p>with a strong background in both statistical theory and methods, as</p>
<p>well as optimization, algorithms and so on. I think that this kind of</p>
<p>mix is becoming increasingly relevant to the practice of modern</p>
<p>statistics, and one can certainly see that Berkeley consistently</p>
<p>produces students, whether from my own group or other people at</p>
<p>Berkeley, with this kind of hybrid background.</p>
<p><strong>SS: What do you see as the relationship between statistics and machine</strong></p>
<p><strong>learning?</strong></p>
<p>This is an interesting question, but tricky to answer, as it can</p>
<p>really depend on the person. In my own view, statistics is a very</p>
<p>broad and encompassing field, and in this context, machine learning</p>
<p>can be viewed as a particular subset of it, one especially focused on</p>
<p>algorithmic and computational aspects of statistics. But on the other</p>
<p>hand, as things stand, machine learning has rather different cultural</p>
<p>roots than statistics, certainly strongly influenced by computer</p>
<p>science. In general, I think that both groups have lessons to learn</p>
<p>from each other. For instance, in my opinion, anyone who wants to do</p>
<p>serious machine learning needs to have a solid background in</p>
<p>statistics. Statisticians have been thinking about data and</p>
<p>inferential issues for a very long time now, and these fundamental</p>
<p>issues remain just as important now, even though the application</p>
<p>domains and data types may be changing. On the other hand, in certain</p>
<p>ways, statistics is still a conservative field, perhaps not as quick</p>
<p>to move into new application domains, experiment with new methods and</p>
<p>so on, as people in machine learning do. So I think that</p>
<p>statisticians can benefit from the playful creativity and unorthodox</p>
<p>experimentation that one sees in some machine learning work, as well</p>
<p>as the algorithmic and programming expertise that is standard in</p>
<p>computer science.</p>
<p><strong>SS: What sorts of things is your group working on these days?</strong></p>
<p>I have fairly eclectic interests, so we are working on a range of</p>
<p>topics. A number of projects concern the interface between</p>
<p>computation and statistics. For instance, we have a recent pre-print</p>
<p>(with postdoc Sivaraman Balakrishnan and colleague Bin Yu) that tries</p>
<p>to address the gap between statistical and computational guarantees in</p>
<p>applications of the expectation-maximization (EM) algorithm for latent</p>
<p>variable models. In theory, we know that the global minimizer of the</p>
<p>(nonconvex) likelihood has good properties, but the in practice, the</p>
<p>EM algorithm only returns local optima. How to resolve this gap</p>
<p>between existing theory and actual practice? In this paper, we show</p>
<p>that under pretty reasonable conditions-–that hold for various types</p>
<p>of latent variable models-–the EM fixed points are as good as the</p>
<p>global minima from the statistical perspective. This explains what is</p>
<p>observed a lot in practice, namely that when the EM algorithm is given</p>
<p>a reasonable initialization, it often returns a very good answer.</p>
<p>There are lots of other interesting questions at this</p>
<p>computation/statistics interface. For instance, a lot of modern data</p>
<p>sets (e.g., Netflix) are so large that they cannot be stored on a</p>
<p>single machine, but must be split up into separate pieces. Any</p>
<p>statistical task must then be carried out in a distributed way, with</p>
<p>each processor performing local operations on a subset of the data,</p>
<p>and then passing messages to other processors that summarize the</p>
<p>results of its local computations. This leads to a lot of fascinating</p>
<p>questions. What can be said about the statistical performance of such</p>
<p>distributed methods for estimation or inference? How many bits do the</p>
<p>machines need to exchange in order for the distributed performance to</p>
<p>match that of the centralized “oracle method” that has access to all</p>
<p>the data at once? We have addressed some of these questions in a</p>
<p>recent line of work (with student Yuchen Zhang, former student John</p>
<p>Duchi and colleague Micheel Jordan).</p>
<p>So my students and postdocs are keeping me busy, and in addition, I am</p>
<p>also busy writing a couple of books, one jointly with Trevor Hastie</p>
<p>and Rob Tibshirani at Stanford University on the Lasso and related</p>
<p>methods, and a second solo-authored effort, more theoretical in focus,</p>
<p>on high-dimensional and non-asymptotic statistics.</p>
<p><strong>SS: What role do you see statistics playing in the relationship</strong></p>
<p><strong>between Big Data and Privacy?</strong></p>
<p>Another very topical question: privacy considerations are certainly</p>
<p>becoming more and more relevant as the scale and richness of data</p>
<p>collection grows. Witness the recent controversies with the NSA, data</p>
<p>manipulation on social media sites, etc. I think that statistics</p>
<p>should have a lot to say about data and privacy. There has a long</p>
<p>line of statistical work on privacy, dating back at least to Warner’s</p>
<p>work on survey sampling in the 1960s, but I anticipate seeing more of</p>
<p>it over the next years. Privacy constraints bring a lot of</p>
<p>interesting statistical questions-–how to design experiments, how to</p>
<p>perform inference, how should data be aggregated and what should be</p>
<p>released and so on-–and I think that statisticians should be at the</p>
<p>forefront of this discussion.</p>
<p>In fact, in some joint work with former student John Duchi and</p>
<p>colleague Michael Jordan, we have examined some tradeoffs between</p>
<p>privacy constraints and statistical utility. We adopt the framework</p>
<p>of local differential privacy that has been put forth in the computer</p>
<p>science community, and study how statistical utility (in the form of</p>
<p>estimation accuracy) varies as a function of the privacy level.</p>
<p>Obviously, preserving privacy means obscuring something, so that</p>
<p>estimation accuracy goes down, but what is the quantitative form of</p>
<p>this tradeoff? An interesting consequence of our analysis is that in</p>
<p>certain settings, it identifies optimal mechanisms for preserving a</p>
<p>certain level of privacy in data.</p>
<p><strong>What advice would you give young statisticians getting into the</strong></p>
<p><strong>discipline right now?</strong></p>
<p>It is certainly an exciting time to be getting into the discipline.</p>
<p>For undergraduates thinking of going to graduate school in statistics,</p>
<p>I would encourage them to build a strong background in basic</p>
<p>mathematics (linear algebra, analysis, probability theory and so on)</p>
<p>that are all important for a deep understanding of statistical methods</p>
<p>and theory. I would also suggest “getting their hands dirty”, that is</p>
<p>doing some applied work involving statistical modeling, data analysis</p>
<p>and so on. Even for a person who ultimately wants to do more</p>
<p>theoretical work, having some exposure to real-world problems is</p>
<p>essential. As part of this, I would suggest acquiring some knowledge</p>
<p>of algorithms, optimization, and so on, all of which are essential in</p>
<p>dealing with large, real-world data sets.</p>
Crowdsourcing resources for the Johns Hopkins Data Science Specialization
2014-08-15T15:55:37+00:00
http://simplystats.github.io/2014/08/15/crowdsourcing-resources-for-the-johns-hopkins-data-science-specialization
<p style="color: #222222;">
Since we began offering the <a href="https://www.coursera.org/specialization/jhudatascience/1">Johns Hopkins Data Science Specialization</a> we've noticed the unbelievable passion that our students have about our courses and the generosity they show toward each other on the course forums. Many students have created quality content around the subjects we discuss, and many of these materials are so good we feel that they should be shared with all of our students. We also know there are tons of other great organizations creating material (looking at you <a href="http://software-carpentry.org/">Software Carpentry folks</a>).
</p>
<p style="color: #222222;">
We're excited to announce that we've created a site using GitHub Pages: <a style="color: #1155cc;" href="http://datasciencespecialization.github.io/" target="_blank">http://<wbr />datasciencespecialization.<wbr />github.io/</a> to serve as a directory for content that the community has created. If you've created materials relating to any of the courses in the Data Science Specialization please send us a pull request and we will add a link to your content on our site. You can find out more about contributing here: <a style="color: #1155cc;" href="https://github.com/DataScienceSpecialization/DataScienceSpecialization.github.io#contributing" target="_blank">https://github.com/<wbr />DataScienceSpecialization/<wbr />DataScienceSpecialization.<wbr />github.io#contributing</a>
</p>
<p style="color: #222222;">
We can't wait to see what you've created and where the community can take this site!
</p>
<p style="color: #222222;">
<p style="color: #222222;">
</p></p>
swirl and the little data scientist's predicament
2014-08-13T15:41:58+00:00
http://simplystats.github.io/2014/08/13/swirl-and-the-little-data-scientists-predicament
<p style="color: #333333;">
<em>Editor's note: This is a repost of "<a href="http://simplystatistics.org/2012/03/26/r-and-the-little-data-scientists-predicament/">R and the little data scientist's predicament</a>". A brief idea for an update is presented at the end in italics. </em>
</p>
<p style="color: #333333;">
I just read this <a href="http://www.slate.com/articles/technology/technology/2012/03/ruby_ruby_on_rails_and__why_the_disappearance_of_one_of_the_world_s_most_beloved_computer_programmers_.single.html" target="_blank">fascinating post</a> on _why, apparently a bit of a cult hero among enthusiasts of the Ruby programming language. One of the most interesting bits was <a href="http://viewsourcecode.org/why/hacking/theLittleCodersPredicament.html" target="_blank">The Little Coder’s Predicament</a>, which boiled down essentially says that computer programming languages have grown too complex - so children/newbies can’t get the instant gratification when they start programming. He suggested a simplified “gateway language” that would get kids fired up about programming, because with a simple line of code or two they could make the computer <strong>do things</strong> like play some music or make a video.
</p>
<p style="color: #333333;">
I feel like there is a similar ramp up with data scientists. To be able to do anything cool/inspiring with data you need to know (a) a little statistics, (b) a little bit about a programming language, and (c) quite a bit about syntax.
</p>
<p style="color: #333333;">
Wouldn’t it be cool if there was an R package that solved the little data scientist’s predicament? The package would have to have at least some of these properties:
</p>
<ol style="color: #333333;">
<li>
It would have to be easy to load data sets, one line of not complicated code. You could write an interface for RCurl/read.table/download.file for a defined set of APIs/data sets so the command would be something like: load(“education-data”) and it would load a bunch of data on education. It would handle all the messiness of scraping the web, formatting data, etc. in the background.
</li>
<li>
It would have to have a lot of really easy visualization functions. Right now, if you want to make pretty plots with ggplot(), plot(), etc. in R, you need to know all the syntax for pch, cex, col, etc. The plotting function should handle all this behind the scenes and make super pretty pictures.
</li>
<li>
It would be awesome if the functions would include some sort of dynamic graphics (with<a href="http://www.omegahat.org/SVGAnnotation/" target="_blank">svgAnnotation</a> or a wrapper for <a href="http://mbostock.github.com/d3/" target="_blank">D3.js</a>). Again, the syntax would have to be really accessible/not too much to learn.
</li>
</ol>
<p style="color: #333333;">
That alone would be a huge start. In just 2 lines kids could load and visualize cool data in a pretty way they could show their parents/friends.
</p>
<p style="color: #333333;">
<em>Update: Now that Nick and co. have created <a href="http://swirlstats.com/">swirl </a>the technology is absolutely in place to have people do something awesome quickly. You could imagine taking the airplane data and immediately having them make a plot of all the flights using ggplot. Or any number of awesome government data sets and going straight to ggvis. Solving this problem is now no longer technically a challenge, it is just a matter of someone coming up with an amazing swirl module that immediately sucks students in. This would be a really awesome project for a grad student or even an undergrad with an interest in teaching. If you do do it, you should absolutely send it our way and we'll advertise the heck out of it!</em>
</p>
The Leek group guide to giving talks
2014-08-12T14:19:53+00:00
http://simplystats.github.io/2014/08/12/the-leek-group-guide-to-giving-talks
<p>I wrote a little guide to giving talks that goes along with my <a href="https://github.com/jtleek/datasharing">I wrote a little guide to giving talks that goes along with my</a> , <a href="https://github.com/jtleek/rpackages">R packages</a>, and <a href="https://github.com/jtleek/reviews">reviewing</a> guides. I posted it to Github and would be really happy to take any feedback/pull requests that folks might have. If you send a pull request please be sure to add yourself to the contributor list.</p>
<ul>
<li><a href="https://github.com/jtleek/talkguide/blob/master/README.md">Leek group guide to giving talks</a></li>
</ul>
Stop saying "Scientists discover..." instead say, "Prof. Doe's team discovers..."
2014-08-11T12:54:53+00:00
http://simplystats.github.io/2014/08/11/stop-saying-scientists-discover-instead-say-prof-does-team-discovers
<p>I was just reading an <a href="http://online.wsj.com/articles/academic-researchers-find-lucrative-work-as-big-data-scientists-1407543088">article about data science in the WSJ</a>. They were talking about how data scientists with just 2 years experience can earn a whole boatload of money*. I noticed a description that seemed very familiar:</p>
<blockquote>
<p>At e-commerce site operator Etsy Inc., for instance, a biostatistics Ph.D. who spent years mining medical records for early signs of breast cancer now writes statistical models to figure out the terms people use when they search Etsy for a new fashion they saw on the street.</p>
</blockquote>
<p>This perfectly describes the resume of a student that worked with me here at Hopkins and is now tearing it up in industry. But it made me a little bit angry that they didn’t publicize her name. Now she may have requested her name not be used, but I think it is more likely that it is a case of the standard, “Scientists discover…” (see e.g. t<a href="http://news.yahoo.com/scientists-discover-why-thrive-less-sleep-others-163712171.html">his article</a> or <a href="http://www.science20.com/writer_on_the_edge/blog/scientists_discover_that_atheists_might_not_exist_and_thats_not_a_joke-139982">this one</a> or <a href="http://www.huffingtonpost.com/2014/08/10/hummingbird-helicopters-technology-video_n_5659013.html?utm_hp_ref=science">this one</a>).</p>
<p>There is always a lot of discussion about how to push people to get into STEM fields, including a ton of misguided attempts that waste time and money. But here is one way that would cost basically nothing and dramatically raise the profile of scientists in the eyes of the public: <strong>use their names when you describe their discoveries</strong>.</p>
<p>The value of this simple change could be huge. In an era of selfies, reality TV, and the power of social media, emphasizing the value that individual scientists bring could have a huge impact on STEM recruiting. That paragraph above is a lot more inspiring to potential young data scientists when rewritten:</p>
<blockquote>
<p><span style="font-style: italic;">At e-commerce site operator Etsy Inc., for instance, Dr Hilary Parker, a biostatistics Ph.D. who spent years mining medical records for early signs of breast cancer now writes statistical models to figure out the terms people use when they search Etsy for a new fashion they saw on the street.</span></p>
</blockquote>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p>* <em>Incidentally, I think it is a bit overhyped. I have rarely heard of anyone making $200k-$300k with that little experience, but maybe I’m wrong? I’d be interested to hear if people really were making that kind of $$ at that stage in their careers. </em></p>
It's like Tinder, but for peer review.
2014-08-07T16:29:47+00:00
http://simplystats.github.io/2014/08/07/its-like-tinder-but-for-peer-review
<p>I have an idea for an app. You input the title and authors of a preprint (maybe even the abstract). The app shows the title/authors/abstract to people who work in a similar area to you. You could estimate this based on papers they have published that have similar key words to start.</p>
<p>Then you swipe left if you think the paper is interesting and right if you think it isn’t. We could then aggregate the data on how many “likes” a paper gets as a measure of how “interesting” it is. I wonder if this would be a better measure of later citations/interestingness than the opinion of a small number of editors and referees.</p>
<p>This is obviously taking my proposal of a <a href="http://simplystatistics.org/2012/03/14/a-proposal-for-a-really-fast-statistics-journal/">fast statistics journal</a> to the extreme and would provide no measure of how scientifically sound the paper was. But in an age when scientific soundness is only one part of the equation for top journals, a measure of interestingness that was available before review could be of huge value to journals.</p>
<p>If done properly, it would encourage people to publish preprints. If you posted a preprint and it was immediately “interesting” to many scientists, you could use that to convince editors to get past that stage and consider your science. More things like this could happen:</p>
<blockquote class="twitter-tweet" width="550">
<p>
Is this the future? "We saw with interest your preprint on <a href="https://twitter.com/biorxivpreprint">@biorxivpreprint</a>. We encourage you to submit it to [well-established journal]."
</p>
<p>
— Leonid Kruglyak (@leonidkruglyak) <a href="https://twitter.com/leonidkruglyak/status/466254954261254144">May 13, 2014</a>
</p>
</blockquote>
<p>So anyone want to build it?</p>
If you like A/B testing here are some other Biostatistics ideas you may like
2014-08-06T10:35:59+00:00
http://simplystats.github.io/2014/08/06/if-you-like-ab-testing-here-are-some-other-biostatistics-ideas-you-may-like
<p>Web companies are using A/B testing and experimentation regularly now to determine which features to push for advertising or improving user experience. A/B testing is a form of <a href="http://en.wikipedia.org/wiki/Randomized_controlled_trial">randomized controlled trial</a> that was originally employed in psychology but first adopted on a massive scale in Biostatistics. Since then a large amount of work on trials and trial design has been performed in the Biostatistics community. Some of these ideas may be useful in the same context within web companies, probably a lot of them are already being used and I just haven’t seen published examples. Here are some examples:</p>
<ol>
<li><a href="http://en.wikipedia.org/wiki/Sequential_analysis">Sequential study designs</a>. Here the sample size isn’t fixed in advance (an issue that I imagine is pretty hard to do with web experiments) but as the experiment goes on, the data are evaluated and a stopping rule that controls appropriate error rates is used. Here are a couple of good (if a bit dated) review on sequential designs<a href="http://smm.sagepub.com/content/9/5/497.full.pdf"> [1]</a> <a href="http://www.ncbi.nlm.nih.gov/pubmed/18663761">[2]</a>.</li>
<li><a href="http://en.wikipedia.org/wiki/Randomized_controlled_trial#Adaptive">Adaptive study designs</a>. These are study designs that use covariates or responses to adapt the treatment assignments of people over time. With careful design and analysis choices, you can still control the relevant error rates. Here are a couple of reviews on adaptive trial designs <a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2422839/">[1]</a> <a href="http://www.trialsjournal.com/content/13/1/145">[2]</a></li>
<li><a href="http://en.wikipedia.org/wiki/Randomized_controlled_trial#By_hypothesis_.28superiority_vs._noninferiority_vs._equivalence.29">Noninferiority trials</a> These are trials designed to show that one treatment is at least as good as the standard of care. They are often implemented when a good placebo group is not available, often for ethical reasons. In light of the <a href="http://simplystatistics.org/2014/08/05/do-we-need-institutional-review-boards-for-human-subjects-research-conducted-by-big-web-companies/">ethical concerns for human subjects research at tech companies</a> this could be a useful trial design. Here is a systematic review for noninferiority trials <a href="http://www.ncbi.nlm.nih.gov/pubmed/22317762">[1]</a></li>
</ol>
<p>It is also probably useful to read about <a href="http://en.wikipedia.org/wiki/Proportional_hazards_model">proportional hazards models</a> and <a href="http://www.annualreviews.org/doi/pdf/10.1146/annurev.publhealth.20.1.145">time varying coefficients</a>. Obviously these are just a few ideas that might be useful, but talking to a Biostatistician who works on clinical trials (not me!) would be a great way to get more information.</p>
Do we need institutional review boards for human subjects research conducted by big web companies?
2014-08-05T12:34:28+00:00
http://simplystats.github.io/2014/08/05/do-we-need-institutional-review-boards-for-human-subjects-research-conducted-by-big-web-companies
<p>Web companies have been doing human subjects research for a while now. Companies like Facebook and Google have employed statisticians for almost a decade (or more) and part of the culture they have introduced is the idea of randomized experiments to identify ideas that work and that don’t. They have figured out that experimentation and statistical analysis often beat out the opinion of the highest paid person at the company for identifying features that “work”. Here “work” may mean features that cause people to <a href="https://www.youtube.com/watch?v=E_F5GxCwizc">read advertising</a>, or <a href="http://www.cnet.com/news/google-starts-placing-ads-directly-in-gmail-inboxes/">click on ads</a>, or <a href="http://blog.okcupid.com/index.php/we-experiment-on-human-beings/">match up with more people</a>.</p>
<p>This has created a huge amount of value and definitely a big interest in the statistical community. For example, today’s session on “Statistics: The Secret Weapon of Successful Web Giants” was standing room only.</p>
<blockquote class="twitter-tweet" width="550">
<p>
Can't get into the session I wanted to attend, "Statistics: The Secret Weapon of Successful Web Giants" <img src="http://simplystatistics.org/wp-includes/images/smilies/frownie.png" alt=":(" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <a href="https://twitter.com/hashtag/JSM2014?src=hash">#JSM2014</a> <a href="http://t.co/y6KNnPfDe2">pic.twitter.com/y6KNnPfDe2</a>
</p>
<p>
— Hilary Parker (@hspter) <a href="https://twitter.com/hspter/status/496671469473370114">August 5, 2014</a>
</p>
</blockquote>
<p>But at the same time, these experiments have raised some issues. Recently scientists from Cornell and Facebook <a href="http://www.pnas.org/content/111/24/8788.full">published a study</a> where they experimented with the news feeds of users. This turned into a PR problem for Facebook and Cornell because people were pretty upset they were being experimented on and weren’t being told about it. This has led defenders of the study to say: (a) Facebook is doing the experiments anyway, they just published it this time, (b) in this case very little harm was done, (c) most experiments done by Facebook are designed to increase profitability, at least this experiment had a more public good focused approach, and (d) there was a small effect size so what’s the big deal?</p>
<p>OK Cupid then published a very timely blog postwith the title, “<a href="http://blog.okcupid.com/index.php/we-experiment-on-human-beings/">We experiment on human beings!</a>”, probably at least in part to take advantage of the press around the Facebook experiment. This post was received with less vitriol than the Facebook study, but really drove home the point that <strong>large web companies perform as much human subjects research as most universities and with little or no oversight. </strong></p>
<p>The same situation was the way academic research used to work. Scientists used their common sense and their scientific sense to decide on what experiments to run. Most of the time this worked fine, but then things like the <a href="http://en.wikipedia.org/wiki/Tuskegee_Syphilis_Study">Tuskegee Syphillis Study</a> happened. These really unethical experiments led to the <a href="http://en.wikipedia.org/wiki/National_Research_Act">National Research Act of 1974</a> which codified rules about [Web companies have been doing human subjects research for a while now. Companies like Facebook and Google have employed statisticians for almost a decade (or more) and part of the culture they have introduced is the idea of randomized experiments to identify ideas that work and that don’t. They have figured out that experimentation and statistical analysis often beat out the opinion of the highest paid person at the company for identifying features that “work”. Here “work” may mean features that cause people to <a href="https://www.youtube.com/watch?v=E_F5GxCwizc">read advertising</a>, or <a href="http://www.cnet.com/news/google-starts-placing-ads-directly-in-gmail-inboxes/">click on ads</a>, or <a href="http://blog.okcupid.com/index.php/we-experiment-on-human-beings/">match up with more people</a>.</p>
<p>This has created a huge amount of value and definitely a big interest in the statistical community. For example, today’s session on “Statistics: The Secret Weapon of Successful Web Giants” was standing room only.</p>
<blockquote class="twitter-tweet" width="550">
<p>
Can't get into the session I wanted to attend, "Statistics: The Secret Weapon of Successful Web Giants" <img src="http://simplystatistics.org/wp-includes/images/smilies/frownie.png" alt=":(" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <a href="https://twitter.com/hashtag/JSM2014?src=hash">#JSM2014</a> <a href="http://t.co/y6KNnPfDe2">pic.twitter.com/y6KNnPfDe2</a>
</p>
<p>
— Hilary Parker (@hspter) <a href="https://twitter.com/hspter/status/496671469473370114">August 5, 2014</a>
</p>
</blockquote>
<p>But at the same time, these experiments have raised some issues. Recently scientists from Cornell and Facebook <a href="http://www.pnas.org/content/111/24/8788.full">published a study</a> where they experimented with the news feeds of users. This turned into a PR problem for Facebook and Cornell because people were pretty upset they were being experimented on and weren’t being told about it. This has led defenders of the study to say: (a) Facebook is doing the experiments anyway, they just published it this time, (b) in this case very little harm was done, (c) most experiments done by Facebook are designed to increase profitability, at least this experiment had a more public good focused approach, and (d) there was a small effect size so what’s the big deal?</p>
<p>OK Cupid then published a very timely blog postwith the title, “<a href="http://blog.okcupid.com/index.php/we-experiment-on-human-beings/">We experiment on human beings!</a>”, probably at least in part to take advantage of the press around the Facebook experiment. This post was received with less vitriol than the Facebook study, but really drove home the point that <strong>large web companies perform as much human subjects research as most universities and with little or no oversight. </strong></p>
<p>The same situation was the way academic research used to work. Scientists used their common sense and their scientific sense to decide on what experiments to run. Most of the time this worked fine, but then things like the <a href="http://en.wikipedia.org/wiki/Tuskegee_Syphilis_Study">Tuskegee Syphillis Study</a> happened. These really unethical experiments led to the <a href="http://en.wikipedia.org/wiki/National_Research_Act">National Research Act of 1974</a> which codified rules about](http://en.wikipedia.org/wiki/Institutional_review_board) to oversee research conducted on human subjects, to guarantee their protection. The IRBs are designed to consider the ethical issues involved with performing research on humans to balance protection of rights with advancing science.</p>
<p>Facebook, OK Cupid, and other companies are not subject to IRB approval. Yet they are performing more and more human subjects experiments. Obviously the studies described in the Facebook paper and the OK Cupid post pale in comparison to the Tuskegee study. I also know scientists at these companies and know they are ethical and really trying to do the right thing. But it raises interesting questions about oversight. Given the emotional, professional, and economic value that these websites control for individuals around the globe, it may be time to discuss whether it is time to consider the equivalent of “institutional review boards” for human subjects research conducted by companies.</p>
<p>Companies who test drugs on humans such as Merck are subject to careful oversight and regulation to prevent potential harm to patients during the discovery process. This is obviously not the optimal solution for speed - understandably a major advantage and goal of tech companies. But there are issues that deserve serious consideration. For example, I think it is no where near sufficient to claim that by signing the terms of service that people have given informed consent to be part of an experiment. That being said, they could just stop using Facebook if they don’t like that they are being experimented on.</p>
<p>Our reliance on these tools for all aspects of our lives means that it isn’t easy to just tell people, “Well if you don’t like being experimented on, don’t use that tool.” You would have to give up at minimum Google, Gmail, Facebook, Twitter, and Instagram to avoid being experimented on. But you’d also have to give up using smaller sites like OK Cupid, because almost all web companies are recognizing the <a href="http://simplystatistics.org/2014/05/07/why-big-data-is-in-trouble-they-forgot-about-applied-statistics/">importance of statistics</a>. One good place to start might be in considering <a href="http://biorxiv.org/content/biorxiv/early/2014/06/25/006601.full.pdf">new and flexible forms of consent</a> that make it possible to opt in and out of studies in an informed way, but with enough speed and flexibility not to slowing down the innovation in tech companies.</p>
<p> </p>
Introducing people to R: 14 years and counting
2014-07-29T17:17:09+00:00
http://simplystats.github.io/2014/07/29/introducing-people-to-r-14-years-and-counting
<p>I’ve been introducing people to R for quite a long time now and I’ve been doing some reflecting today on how that process has changed quite a bit over time. I first started using R around 1998–1999 I think I first started talking about R informally to my fellow classmates (and some faculty) back when I was in graduate school at UCLA. There, the department was officially using <a href="http://www.stat.uiowa.edu/~luke/xls/xlsinfo/xlsinfo.html">Lisp-Stat</a> (which I loved) and only later <a href="http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CB0QFjAA&url=http%3A%2F%2Fwww.jstatsoft.org%2Fv13%2Fi07%2Fpaper&ei=uADYU_rQMrLfsATQ44L4Bw&usg=AFQjCNFymozQrHQaHHRdLkF3QGIqamvCSQ&sig2=Twlq_KFfvTekZgJSr9SS1g&bvm=bv.71778758,d.cWc">converted its courses over to R</a>. Through various brown-bag lunches and seminars I would talk about R, and the main selling point at the time was “It’s just like S-PLUS but it’s free!” As it turns out, S-PLUS was basically abandoned by academics and its ownership changed hands a number of times over the years (it is currently owned by TIBCO). I still talk about S-PLUS when <a href="http://youtu.be/kzxHxFHW6hs">I talk about the history of R</a> but I’m not sure many people nowadays actually have any memories of the product.</p>
<p>When I got to Johns Hopkins in 2003 there wasn’t really much of a modern statistical computing class, so <a href="http://kbroman.org">Karl Broman</a>, <a href="http://rafalab.dfci.harvard.edu">Rafa Irizarry</a>, <a href="http://www.bcaffo.com">Brian Caffo</a>, <a href="http://www.biostat.jhsph.edu/~iruczins/">Ingo Ruczinski</a>, and I got together and started what we called the “KRRIB” class, which was basically a weekly seminar where one of us talked about a computing topic of interest. I gave some of the R lectures in that class and when I asked people who had heard of R before, almost no one raised their hand. And no one had actually used it before. My approach was pretty much the same at the time, although I left out the part about S-PLUS because no one had used that either. A lot of people had experience with SAS or Stata or SPSS. A number of people had used something like Java or C/C++ before and so I often used that a reference frame. No one had ever used a functional-style of programming language like Scheme or Lisp.</p>
<p>Over time, the population of students (mostly first-year graduate students) slowly shifted to the point where many of them had been introduced to R while they were undergraduates. This trend mirrored the overall trend with statistics where we are seeing more and more students do undergraduate majors in statistics (as opposed to, say, mathematics). Eventually, by 2008–2009, when I’d ask how many people had heard of or used R before, everyone raised their hand. However, even at that late date, I still felt the need to convince people that R was a “real” language that could be used for real tasks.</p>
<p>R has grown a lot in recent years, and is being used in so many places now, that I think its essentially impossible for a person to keep track of everything that is going on. That’s fine, but it makes “introducing” people to R an interesting experience. Nowadays in class, students are often teaching me something new about R that I’ve never seen or heard of before (they are quite good at Googling around for themselves). I feel no need to “bring people over” to R. In fact it’s quite the opposite–people might start asking questions if I <em>weren’t</em> teaching R.</p>
<p>Even though my approach to introducing R has evolved over time, with the topics that I emphasize or de-emphasize changing, I’ve found there are a few topics that I always stress to people who are generally newcomers to R. For whatever reason, these topics are always new or at least a little unfamiliar.</p>
<ul>
<li><strong>R is a functional-style language</strong>. Back when most people primarily saw something like C as a first programming language, it made sense to me that the functional style of programming would seem strange. I came to R from Lisp-Stat so the functional aspect was pretty natural for me. But many people seem to get tripped up over the idea of passing a function as an argument or not being able to modify the state of an object in place. Also, it sometimes takes people a while to get used to doing things like lapply() and map-reduce types of operations. Everyone still wants to write a for loop!</li>
<li><strong>R is both an interactive system and a programming language</strong>. Yes, it’s a floor wax and a dessert topping–get used to it. Most people seem expect one or the other. SAS users are wondering why you need to write 10 lines of code to do what SAS can do in one massive PROC statement. C programmers are wondering why you don’t write more for loops. C++ programmers are confused by the weird system for object orientation. In summary, no one is ever happy.</li>
<li><strong>Visualization/plotting capabilities are state-of-the-art</strong>. One of the big selling points back in the “old days” was that from the very beginning R’s plotting and graphics capabilities where far more elegant than the ASCII-art that was being produced by other statistical packages (true for S-PLUS too). I find it a bit strange that this point has largely remained true. While other statistical packages have definitely improved their output (and R certainly has some areas where it is perhaps deficient), R still holds its own quite handily against those other packages. If the community can continue to produce things like ggplot2 and rgl, I think R will remain at the forefront of data visualization.</li>
</ul>
<p>I’m looking forward to teaching R to people as long as people will let me, and I’m interested to see how the next generation of students will approach it (and how my approach to them will change). Overall, it’s been just an amazing experience to see the widespread adoption of R over the past decade. I’m sure the next decade will be just as amazing.</p>
Academic statisticians: there is no shame in developing statistical solutions that solve just one problem
2014-07-25T11:27:39+00:00
http://simplystats.github.io/2014/07/25/academic-statisticians-there-is-no-shame-in-developing-statistical-solutions-that-solve-just-one-problem
<p dir="ltr">
I think that the main distinction between academic statisticians and those calling themselves data scientists is that the latter are very much willing to invest most of their time and energy into solving specific problems by analyzing specific data sets. In contrast, most academic statisticians strive to develop methods that can be very generally applied across problems and data types. There is a reason for this of course: historically statisticians have had enormous influence by developing general theory/methods/concepts such as the p-value, maximum likelihood estimation, and linear regression. However, these types of success stories are becoming more and more rare while data scientists are becoming increasingly influential in their respective areas of applications by solving important context-specific problems. The success of Money Ball and the prediction of election results are two recent widely publicized examples.
</p>
<p dir="ltr">
A survey of papers published in our flagship journals make it quite clear that context-agnostic methodology are valued much more than detailed descriptions of successful solutions to specific problems. These applied papers tend to get published in subject matter journals and do not usually receive the same weight in appointments and promotions. This culture has therefore kept most statisticians holding academic position away from collaborations that require substantial time and energy investments in understanding and attacking the specifics of the problem at hand. Below I argue that to remain relevant as a discipline we need a cultural shift.
</p>
<p dir="ltr">
It is of course understandable that to remain a discipline academic statisticians can’t devote all our effort to solving specific problems and none to trying to the generalize these solutions. It is the development of these abstractions that defines us as an academic discipline and not just a profession. However, if our involvement with real problems is too superficial, we run the risk of developing methods that solve no problem at all which will eventually render us obsolete. We need to accept that as data and problems become more complex, more time will have to be devoted to understanding the gory details.
</p>
<p>But what should the balance be?</p>
<p dir="ltr">
Note that many of the giants of our discipline were very much interested in solving specific problems in genetics, agriculture, and the social sciences. In fact, many of today’s most widely-applied methods were originally inspired by insights gained by answering very specific scientific questions. I worry that the balance between application and theory has shifted too far away from applications. An unfortunate consequence is that our flagship journals, including our applied journals, are publishing too many methods seeking to solve many problems but actually solving none. By shifting some of our efforts to solving specific problems we will get closer to the essence of modern problems and will actually inspire more successful generalizable methods.
</p>
Jan de Leeuw owns the Internet
2014-07-16T11:22:51+00:00
http://simplystats.github.io/2014/07/16/jan-de-leeuw-owns-the-internet
<p>One of the best things to happen on the Internet recently is that <a href="http://gifi.stat.ucla.edu">Jan de Leeuw</a> has decided to own the Twitter/Facebook universe. If you do not already, you should be <a href="https://twitter.com/deleeuw_jan">following him</a>. Among his many accomplishments, he founded the Department of Statistics at UCLA (my <em>alma mater</em>), which is currently thriving. On the occasion of the Department’s 10th birthday, there was a small celebration, and I recall Don Ylvisaker mentioning that the reason they invited Jan to UCLA way back when was because he “knew everyone and knew everything”. Pretty accurate description, in my opinion.</p>
<p>Jan’s been tweeting quite a bit of late, but recently had this gem:</p>
<blockquote class="twitter-tweet" lang="en">
<p>
As long as statistics continues to emphasize assumptions, models, and inference it will remain a minor subfield of data science.
</p>
<p>
— Jan de Leeuw (@deleeuw_jan) <a href="https://twitter.com/deleeuw_jan/statuses/488835963297087488">July 15, 2014</a>
</p>
</blockquote>
<p>followed by</p>
<blockquote class="twitter-tweet" lang="en">
<p>
<a href="https://twitter.com/j8feng">@j8feng</a> <a href="https://twitter.com/rdpeng">@rdpeng</a> Statistics is the applied science that constructs and studies techniques for data analysis.
</p>
<p>
— Jan de Leeuw (@deleeuw_jan) <a href="https://twitter.com/deleeuw_jan/statuses/488889040771354625">July 15, 2014</a>
</p>
</blockquote>
<p>I’m not sure what Jan’s thinking behind the first tweet was, but I think many in statistics would consider it a “good thing” to be a minor subfield of data science. Why get involved in that messy thing called data science where people are going wild with data in an unprincipled manner?</p>
<p>This is a situation where I think there is a large disconnect between what “should be” and what “is reality”. What should be is that statistics should include the field of data science. Honestly, that would be beneficial to the field of statistics and would allow us to provide a home to many people who don’t necessarily have one (primarily, people working not he border between two fields). Nate Silver made reference to this in his keynote address to the Joint Statistical Meetings last year when he said data science was just a fancy term for statistics.</p>
<p>The reality though is the opposite. Statistics has chosen to limit itself to a few areas, such as inference, as Jan mentions, and to willfully ignore other important aspects of data science as “not statistics”. This is unfortunate, I think, because unlike many in the field of statistics, I believe data science is here to stay. The reason is because statistics has decided not to fill the spaces that have been created by the increasing complexity of modern data analysis. The needs of modern data analyses (reproducibility, computing on large datasets, data preprocessing/cleaning) didn’t fall into the usual statistics curriculum, and so they were ignored. In my view, data science is about stringing together many different tools for many different purposes into an analytic whole. Traditional statistical modeling is a part of this (often a small part), but statistical thinking plays a role in all of it.</p>
<p>Statisticians should take on the challenge of data science and own it. We may not be successful in doing so, but we certainly won’t be if we don’t try.</p>
Piketty in R markdown - we need some help from the crowd
2014-06-30T09:45:02+00:00
http://simplystats.github.io/2014/06/30/piketty-in-r-markdown-we-need-some-help-from-the-crowd
<p>Thomas Piketty’s book <a href="http://www.amazon.com/Capital-Twenty-First-Century-Thomas-Piketty/dp/067443000X">Capital in the 21st Century</a> was a surprise best seller and the subject of intense scrutiny. A few weeks ago the <a href="http://www.ft.com/cms/s/2/e1f343ca-e281-11e3-89fd-00144feabdc0.html#axzz33PSo6ySt">Financial Times claimed</a> that the analysis was riddled with errors, leading to a firestorm of discussion. A few days ago the <a href="http://blogs.lse.ac.uk/impactofsocialsciences/2014/05/22/thomas-piketty-data-make-it-open/">London School of economics posted</a> a similar call to make the data open and machine readable saying.</p>
<blockquote>
<p>None of this data is explicitly open for everyone to reuse, clearly licenced and in machine-readable formats.</p>
</blockquote>
<p>A few friends of Simply Stats had started on a project to translate his work from the excel files where the <a href="http://piketty.pse.ens.fr/en/capital21c2">original analysis resides</a> into R. The people that helped were <a href="http://alyssafrazee.com/">Alyssa Frazee</a>, <a href="http://aaronjfisher.wordpress.com/">Aaron Fisher</a>, <a href="http://www.biostat.jhsph.edu/~bswihart/">Bruce Swihart</a>, <a href="http://www.biostat.jhsph.edu/people/postdocs/nellore.shtml">Abhinav Nellore</a>, <a href="http://www.cbcb.umd.edu/~hcorrada/">Hector Corrada Bravo</a>, <a href="http://biostat.jhsph.edu/~jmuschel/">John Muschelli</a> * <a href="http://www.cbcb.umd.edu/~hcor">Hector Corrada Bravo</a>, and me. We haven’t finished translating all chapters, so we are asking anyone who is interested to help contribute to translating the book’s technical appendices into R markdown documents. If you are interested, please send pull requests to the <a href="https://github.com/jtleek/capitalIn21stCenturyinR/tree/gh-pages">gh-pages branch of this Github repo</a>.</p>
<p>As a way to entice you to participate, here is one interesting thing we found. We don’t know enough economics to know if what we are finding is “right” or not, but one interesting thing I found is that the x-axes in the excel files are really distorted. For example here is Figure 1.1 from the Excel files where the ticks on the x-axis are separated by 20, 50, 43, 37, 20, 20, and 22 years.</p>
<p style="text-align: center;">
<a href="http://simplystatistics.org/2014/06/30/piketty-in-r-markdown-we-need-some-help-from-the-crowd/fig11/" rel="attachment wp-att-3189"><img class=" wp-image-3189 aligncenter" alt="fig11" src="http://simplystatistics.org/wp-content/uploads/2014/06/fig11.png" width="503" height="346" srcset="http://simplystatistics.org/wp-content/uploads/2014/06/fig11-300x206.png 300w, http://simplystatistics.org/wp-content/uploads/2014/06/fig11-1024x704.png 1024w, http://simplystatistics.org/wp-content/uploads/2014/06/fig11.png 1396w" sizes="(max-width: 503px) 100vw, 503px" /></a>
</p>
<p> </p>
<p>Here is the same plot with an equally spaced x-axis.</p>
<p style="text-align: center;">
<a href="http://simplystatistics.org/2014/06/30/piketty-in-r-markdown-we-need-some-help-from-the-crowd/f11-us/" rel="attachment wp-att-3190"><img class=" wp-image-3190 aligncenter" alt="f11-us" src="http://simplystatistics.org/wp-content/uploads/2014/06/f11-us.png" width="450" height="393" srcset="http://simplystatistics.org/wp-content/uploads/2014/06/f11-us-300x262.png 300w, http://simplystatistics.org/wp-content/uploads/2014/06/f11-us.png 576w" sizes="(max-width: 450px) 100vw, 450px" /></a>
</p>
<p style="text-align: center;">
<p style="text-align: left;">
I'm not sure if it makes any difference but it is interesting. It sounds like on measure, the Piketty analysis <a href="http://simplystatistics.org/2014/06/03/post-piketty-lessons/">was mostly reproducible and reasonable</a>. But having the data available in a more readily analyzable format will allow for more concrete discussion based on the data. So consider c<a href="https://github.com/jtleek/capitalIn21stCenturyinR/tree/gh-pages">ontributing to our github repo</a>.
</p>
</p>
Privacy as a function of sample size
2014-06-25T14:41:09+00:00
http://simplystats.github.io/2014/06/25/privacy-as-a-function-of-sample-size
<p>The U.S. Supreme Court just made a unanimous ruling in <a href="http://www.docstoc.com/docs/document-preview.aspx?doc_id=171429294">Riley v. California</a> making it clear that police officers must get a warrant before searching through the contents of a cell phone obtained incident to an arrest. The message was put pretty clearly in the decision:</p>
<blockquote>
<p> Our answer to the question of what police must do before searching a cell phone seized incident to an arrest is accordingly simple — get a warrant.</p>
</blockquote>
<p>But I was more fascinated by this quote:</p>
<blockquote>
<p>The sum of an individual’s private life can be reconstructed through a thousand photographs labeled with dates, locations, and descriptions; the same cannot be said of a photograph or two of loved ones tucked into a wallet.</p>
</blockquote>
<p>So n = 2 is not enough to recreate a private life, but n = 2,000 (with associated annotation) is enough. I wonder what the minimum sample size needed is to officially violate someone’s privacy. I’d be curious get <a href="http://mathbabe.org/">Cathy O’Neil</a>’s opinion on that question, she seems to have thought very hard about the relationship between data and privacy.</p>
<p>This is another case where I think that, to some extent, the Supreme Court made a decision <a href="http://simplystatistics.org/2011/12/12/the-supreme-courts-interpretation-of-statistical/">on the basis of a statistical concept</a>. Last time it was correlation, this time it is inference. As I read the opinion, part of the argument hinged on how much information do you get by searching a cell phone versus a wallet? Importantly, how much can you infer from those two sets of data?</p>
<p>If any of the Supreme’s want a primer in statistics, I’m available.</p>
New book on implementing reproducible research
2014-06-24T13:24:34+00:00
http://simplystats.github.io/2014/06/24/new-book-on-implementing-reproducible-research
<p><a href="http://simplystatistics.org/wp-content/uploads/2014/06/9781466561595.jpg"><img class="alignright" alt="9781466561595" src="http://simplystatistics.org/wp-content/uploads/2014/06/9781466561595.jpg" width="180" height="281" /></a>I have mentioned this in a few places but my book edited with Victoria Stodden and Fritz Leisch, <em><a href="http://www.crcpress.com/product/isbn/9781466561595">Implementing Reproducible Research</a></em>, has just been published by CRC Press. Although it is technically in their “R Series”, the chapters contain information on a wide variety of useful tools, not just R-related tools. <a href="http://simplystatistics.org/wp-content/uploads/2014/06/9781466561595.jpg">[<img class="alignright" alt="9781466561595" src="http://simplystatistics.org/wp-content/uploads/2014/06/9781466561595.jpg" width="180" height="281" />](http://simplystatistics.org/wp-content/uploads/2014/06/9781466561595.jpg)I have mentioned this in a few places but my book edited with Victoria Stodden and Fritz Leisch, <em>[Implementing Reproducible Research](http://www.crcpress.com/product/isbn/9781466561595)</em>, has just been published by CRC Press. Although it is technically in their “R Series”, the chapters contain information on a wide variety of useful tools, not just R-related tools. </a></p>
<p>There is also a <a href="http://www.implementingRR.org">supplementary web site</a> hosted through Open Science Framework that contains a lot of additional information, including the list of chapters.</p>
The difference between data hype and data hope
2014-06-23T13:14:12+00:00
http://simplystats.github.io/2014/06/23/the-difference-between-data-hype-and-data-hope
<p>I was reading one of my favorite stats blogs, <a href="http://www.statschat.org.nz/">StatsChat</a>, where Thomas points <a href="http://www.theatlantic.com/technology/archive/2014/05/virtual-clinical-trials-doctors-could-use-algorithms-instead-of-people-to-test-new-drugs/371902/?mkt_tok=3RkMMJWWfF9wsRovuaTLZKXonjHpfsX86O8oW6Sg38431UFwdcjKPmjr1YIBSMFrI%2BSLDwEYGJlv6SgFSrnAMaxlzLgNXRk%3D">to this article</a> in the Atlantic and highlights this quote:</p>
<blockquote>
<p>Dassault Systèmes is focusing on that level of granularity now, trying to simulate propagation of cholesterol in human cells and building oncological cell models. “It’s data science and modeling,” Charlès told me. “Coupling the two creates a new environment in medicine.”</p>
</blockquote>
<p>I think that is a perfect example of data hype. This is a cool idea and if it worked would be completely revolutionary. But the reality is we are not even close to this. In very simple model organisms we can predict <a href="http://www.nature.com/nmeth/journal/v10/n12/full/nmeth.2724.html">very high level phenotypes some of the time</a> with whole cell modeling. We aren’t anywhere near the resolution we’d need to model the behavior of human cells, let alone the complex genetic, epigenetic, genomic, and environmental components that likely contribute to complex diseases. It is awesome that people are thinking about the future and the fastest way to science future is usually through science fiction, but this is way overstating the power of current or even currently achievable data science.</p>
<p>So does that mean data science for improving clinical trials right now should be abandoned?</p>
<p>No.</p>
<p>There is tons of currently applicable and real world data science being done in <a href="http://en.wikipedia.org/wiki/Sequential_analysis">sequential analysis</a>, <a href="http://en.wikipedia.org/wiki/Adaptive_clinical_trial">adaptive clinical trials</a>, and <a href="http://en.wikipedia.org/wiki/Dynamic_treatment_regime">dynamic treatment regimes</a>. These are important contributions that are impacting clinical trials _right now _and where advances can reduce costs, save patient harm, and speed the implementation of clinical trials. I think that is the hope of data science - using statistics and data to make steady, realizable improvement in the way we treat patients.</p>
Heads up if you are going to submit to the Journal of the National Cancer Institute
2014-06-18T12:08:53+00:00
http://simplystats.github.io/2014/06/18/heads-up-if-you-are-going-to-submit-to-the-journal-of-the-national-cancer-institute
<p><strong>Update (6/19/14):</strong> <em>The folks at JNCI and OUP have kindly confirmed that they will consider manuscripts that have been posted to preprint servers. </em></p>
<p>I just got this email about a paper we submitted to JNCI</p>
<blockquote>
<p>Dear Dr. Leek:</p>
<p>I am sorry that we will not be able to use the above-titled manuscript. Unfortunately, the paper was published online on a site called bioRXiv, The Preprint Server for Biology, hosted by Cold Spring Harbor Lab. JNCI does not publish previously published work.</p>
<p>Thank you for your submission to the Journal.</p>
</blockquote>
<p>I have to say I’m not totally surprised, but I am a little disappointed, the future of academic publishing <a href="http://simplystatistics.org/2014/06/16/the-future-of-academic-publishing-is-here-it-just-isnt-evenly-distributed/">is definitely not evenly distributed</a>.</p>
The future of academic publishing is here, it just isn't evenly distributed
2014-06-16T10:10:34+00:00
http://simplystats.github.io/2014/06/16/the-future-of-academic-publishing-is-here-it-just-isnt-evenly-distributed
<p>Academic publishing has always been a slow process. Typically you would submit a paper for publication and then wait a few months to more than a year (statistics journals can be slow!) for a review. Then you’d revise the paper in a process that would take another couple of months, resubmit it and potentially wait another few months while this second set of reviews came back.</p>
<p>Lately statistics and statistical genomics have been doing more of what math does and posting papers to the <a href="http://arxiv.org/">arxiv </a>or to <a href="http://biorxiv.org/">biorxiv</a>. I don’t know if it is just me, but using this process has led to a massive speedup in the rate that my academic work gets used/disseminated. Here are a few examples of how crazy it is out there right now.</p>
<p>I <a href="https://github.com/jtleek/talkguide">started a post</a> on giving talks on Github. It was tweeted before I even finished!</p>
<blockquote class="twitter-tweet" lang="en">
<p>
(not a joke) If <a href="https://twitter.com/jtleek">@jtleek</a>'s new guide turns out like any of the others in the series, it will be one to bookmark <a href="https://t.co/WGKjn6MINH">https://t.co/WGKjn6MINH</a>
</p>
<p>
— Stephen Turner (@genetics_blog) <a href="https://twitter.com/genetics_blog/statuses/450980566369067008">April 1, 2014</a>
</p>
</blockquote>
<p>I really appreciate the compliment, especially coming from someone whose posts I read all the time, but it was wild to me that I hadn’t even finished the post yet (still haven’t) and it was already public.</p>
<p>Another example is that we have posted several papers on biorxiv and they all get tweeted/read. When we posted the <a href="http://biorxiv.org/content/early/2014/03/30/003665">Ballgown paper</a> it was rapidly discussed. The day after it was posted, there were already <a href="http://nextgenseek.com/2014/03/ballgown-for-estimating-differential-expression-of-genes-transcripts-or-exons-from-rna-seq/">blog posts</a> about the paper up.</p>
<p>We also have been working on another piece of software on Github that hasn’t been published yet, but have already had <a href="https://github.com/lcolladotor/derfinder/graphs/contributors">multiple helpful contributions</a> from people outside our group.</p>
<p>While all of this is going on, we have a paper out to review that we have been waiting to hear about for multiple months. So while open science is dramatically speeding up the rate at which we disseminate our results, the speed isn’t evenly distributed.</p>
What I do when I get a new data set as told through tweets
2014-06-13T09:06:18+00:00
http://simplystats.github.io/2014/06/13/what-i-do-when-i-get-a-new-data-set-as-told-through-tweets
<p>Hilary Mason asked a really interesting question yesterday:</p>
<blockquote class="twitter-tweet" lang="en">
<p>
Data people: What is the very first thing you do when you get your hands on a new data set?
</p>
<p>
— Hilary Mason (@hmason) <a href="https://twitter.com/hmason/statuses/476905839035305984">June 12, 2014</a>
</p>
</blockquote>
<p>You should really consider reading the whole discussion <a href="https://twitter.com/hmason/status/476905839035305984">here</a> it is amazing. But it also inspired me to write a post about what I do, as told by other people on Twitter. I apologize in advance if I missed your tweet, there was way too much good stuff to get them all.</p>
<p><strong>Step 0: Figure out what I’m trying to do with the data</strong></p>
<p>At least for me I come to a new data set in one of three ways: (1) I made it myself, (2) a collaborator created a data set with a specific question in mind, or (3) a collaborator created a data set and just wants to explore it. In the first case and the second case I already know what the question is, although sometimes in case (2) I still spend a little more time making sure I understand the question before diving in. @visualisingdata and I think alike here:</p>
<blockquote class="twitter-tweet" lang="en">
<p>
<a href="https://twitter.com/hmason">@hmason</a> this will sound textbooky but I stop, look and think about "what's it about (phenomena, activity, entity etc). Look before see.
</p>
<p>
— Andy Kirk (@visualisingdata) <a href="https://twitter.com/visualisingdata/statuses/476958934528704512">June 12, 2014</a>
</p>
</blockquote>
<p> Usually this involves figuring out what the variables mean like @_jden does:</p>
<blockquote class="twitter-tweet" lang="en">
<p>
<a href="https://twitter.com/hmason">@hmason</a> try to figure out what the fields mean and how it's coded — :sandwich emoji: (@_jden) <a href="https://twitter.com/_jden/statuses/476907686307430400">June 12, 2014</a>
</p>
</blockquote>
<p>If I’m working with a collaborator I do what @evanthomaspaul does:</p>
<blockquote class="twitter-tweet" lang="en">
<p>
<a href="https://twitter.com/hmason">@hmason</a> Interview the source, if possible, to know all of the problems with the data, use limitations, caveats, etc. — Evan Thomas Paul (@evanthomaspaul) <a href="https://twitter.com/evanthomaspaul/statuses/476924149852827648">June 12, 2014</a>
</p>
</blockquote>
<p>If the data don’t have a question yet, I usually start thinking right away about what questions can actually be answered with the data and what can’t. This prevents me from wasting a lot of time later chasing trends. @japerk does something similar:</p>
<blockquote class="twitter-tweet" lang="en">
<p>
<a href="https://twitter.com/hmason">@hmason</a> figure out the format & how to read it. Then ask myself, what can be learned from this data? — Jacob (@japerk) <a href="https://twitter.com/japerk/statuses/476909485651279872">June 12, 2014</a>
</p>
</blockquote>
<p><strong>Step 1: Learn about the elephant</strong> Unless the data is something I’ve analyzed a lot before, I usually feel like the <a href="http://en.wikipedia.org/wiki/Blind_men_and_an_elephant">blind men and the elephant.</a></p>
<p><a href="http://changeprocessdesign.files.wordpress.com/2009/11/6-blind-men-hans.jpg"><img class="aligncenter" alt="" src="http://changeprocessdesign.files.wordpress.com/2009/11/6-blind-men-hans.jpg" width="293" height="188" /></a></p>
<p>So the first thing I do is fool around a bit to try to figure out what the data set “looks” like by doing things like what @jasonpbecker does looking at the types of variables I have, what the first few observations and last few observations look like.</p>
<blockquote class="twitter-tweet" lang="en">
<p>
<a href="https://twitter.com/hmason">@hmason</a> sapply(df, class); head(df); tail(df) — Jason Becker (@jasonpbecker) <a href="https://twitter.com/jasonpbecker/statuses/476907832718397440">June 12, 2014</a>
</p>
</blockquote>
<p>If it is medical/social data I usually use this to look for personally identifiable information and then do what @peteskomoroch does:</p>
<blockquote class="twitter-tweet" lang="en">
<p>
<a href="https://twitter.com/hmason">@hmason</a> remove PII and burn it with fire — Peter Skomoroch (@peteskomoroch) <a href="https://twitter.com/peteskomoroch/statuses/476910403348209665">June 12, 2014</a>
</p>
</blockquote>
<p>If the data set is really big, I usually take a carefully chosen random subsample to make it possible to do my exploration interactively like @richardclegg</p>
<blockquote class="twitter-tweet" lang="en">
<p>
<a href="https://twitter.com/hmason">@hmason</a> unless it is big data in which case sample then import to R and look for NAs... <img src="http://simplystatistics.org/wp-includes/images/smilies/simple-smile.png" alt=":-)" class="wp-smiley" style="height: 1em; max-height: 1em;" /> — Richard G. Clegg (@richardclegg) <a href="https://twitter.com/richardclegg/statuses/477113022658641920">June 12, 2014</a>
</p>
</blockquote>
<p>After doing that I look for weird quirks, like if there are missing values or outliers like @feralparakeet</p>
<blockquote class="twitter-tweet" lang="en">
<p>
<a href="https://twitter.com/hmason">@hmason</a> ALL THE DESCRIPTIVES. Well, after reviewing the codebook, of course. — Vickie Edwards (@feralparakeet) <a href="https://twitter.com/feralparakeet/statuses/476913969962053634">June 12, 2014</a>
</p>
</blockquote>
<p>and like @cpwalker07</p>
<blockquote class="twitter-tweet" lang="en">
<p>
<a href="https://twitter.com/hmason">@hmason</a> count # rows, read every column header — Chris Walker (@cpwalker07) <a href="https://twitter.com/cpwalker07/statuses/476922532596289536">June 12, 2014</a>
</p>
</blockquote>
<p>and like @toastandcereal</p>
<blockquote class="twitter-tweet" lang="en">
<p>
<a href="https://twitter.com/hmason">@hmason</a><a href="https://twitter.com/mispagination">@mispagination</a> jot down number of rows. That way I can assess right away whether I've done something dumb later on. — Jessica Balsam (@toastandcereal) <a href="https://twitter.com/toastandcereal/statuses/476949846377914368">June 12, 2014</a>
</p>
</blockquote>
<p>and like @cld276</p>
<blockquote class="twitter-tweet" lang="en">
<p>
<a href="https://twitter.com/hmason">@hmason</a> run a bunch of count/groupby statements to gauge if I think it's corrupt. — Carol Davidsen (@cld276) <a href="https://twitter.com/cld276/statuses/476908703493677056">June 12, 2014</a>
</p>
</blockquote>
<p>and @adamlaiacano</p>
<blockquote class="twitter-tweet" lang="en">
<p>
<a href="https://twitter.com/hmason">@hmason</a> summary() — Adam Laiacano (@adamlaiacano) <a href="https://twitter.com/adamlaiacano/statuses/476906966049374208">June 12, 2014</a>
</p>
</blockquote>
<p><strong>Step 2: Clean/organize</strong> I usually use the first exploration to figure out things that need to be fixed so that I can mess around with a <a href="http://vita.had.co.nz/papers/tidy-data.pdf">tidy data set</a>. This includes fixing up missing value encoding like @chenghlee</p>
<blockquote class="twitter-tweet" lang="en">
<p>
.<a href="https://twitter.com/hmason">@hmason</a> Often times, "fix" various codings, esp. for missing data (e.g., mixed strings & ints for coded vals; decide if NAs, "" are equiv.) — Cheng H. Lee (@chenghlee) <a href="https://twitter.com/chenghlee/statuses/476919091056226306">June 12, 2014</a>
</p>
</blockquote>
<p>or more generically like: @RubyChilds</p>
<blockquote class="twitter-tweet" lang="en">
<p>
<a href="https://twitter.com/hmason">@hmason</a> clean it — Ruby ˁ˚ᴥ˚ˀ (@RubyChilds) <a href="https://twitter.com/RubyChilds/statuses/476932385913569282">June 12, 2014</a>
</p>
</blockquote>
<p>I usually do a fair amount of this, like @the_turtle too:</p>
<blockquote class="twitter-tweet" lang="en">
<p>
<a href="https://twitter.com/hmason">@hmason</a> Spend the next two days swearing because nobody cleaned it. — The Turtle (@the_turtle) <a href="https://twitter.com/the_turtle/statuses/476907578404786176">June 12, 2014</a>
</p>
</blockquote>
<p>When I’m done I do a bunch of sanity checks and data integrity checks like @deaneckles and if things are screwed up I got back and fix them:</p>
<blockquote class="twitter-tweet" lang="en">
<p>
<a href="https://twitter.com/treycausey">@treycausey</a> <a href="https://twitter.com/hmason">@hmason</a> Test really boring hypotheses. Like num_mobile_comments <= num_comments. — Dean Eckles (@deaneckles) <a href="https://twitter.com/deaneckles/statuses/476911179361972224">June 12, 2014</a>
</p>
</blockquote>
<p> <strong>Step 3: Plot. That. Stuff.</strong> After getting a handle with mostly text based tables and output (things that don’t require a graphics device) and cleaning things up a bit I start with plotting everything like @hspter</p>
<blockquote class="twitter-tweet" lang="en">
<p>
<a href="https://twitter.com/hmason">@hmason</a> usually head(data) then straight to visualization. Have been working on some "unit tests" for data as well <a href="https://t.co/6Qd3URmzpe">https://t.co/6Qd3URmzpe</a> — Hilary Parker (@hspter) <a href="https://twitter.com/hspter/statuses/476915876927520768">June 12, 2014</a>
</p>
</blockquote>
<p>At this stage my goal is to get the maximum amount of information about the data set in the minimal amount of time. So I do not make the graphs pretty (I think there is a distinction between exploratory and expository graphics). I do histograms and jittered one d plots to look at variables one by one like @FisherDanyel</p>
<blockquote class="twitter-tweet" lang="en">
<p>
<a href="https://twitter.com/TwoHeadlines">@TwoHeadlines</a><a href="https://twitter.com/hmason">@hmason</a> After looking at a few hundred random rows? Histograms & scatterplots of columns to understand what I have. — Danyel Fisher (@FisherDanyel) <a href="https://twitter.com/FisherDanyel/statuses/477206626558951425">June 12, 2014</a>
</p>
</blockquote>
<p>To compare the distributions of variables I usually use overlayed density plots like @sjwhitworth</p>
<blockquote class="twitter-tweet" lang="en">
<p>
<a href="https://twitter.com/hmason">@hmason</a> density plot all the things!
</p>
<p>
— Stephen Whitworth (@sjwhitworth) <a href="https://twitter.com/sjwhitworth/statuses/476953907424657408">June 12, 2014</a>
</p>
</blockquote>
<p>I make tons of scatterplots to look at relationships between variables like @wduyck</p>
<blockquote class="twitter-tweet" lang="en">
<p>
<a href="https://twitter.com/hmason">@hmason</a> plot scatterplots and distributions
</p>
<p>
— Wouter Duyck (@wduyck) <a href="https://twitter.com/wduyck/statuses/476979620706013184">June 12, 2014</a>
</p>
</blockquote>
<p>I usually color/size the dots in the scatterplots by other variables to see if I can identify any confounding relationships that might screw up analyses downstream. Then, if the data are multivariate, I do some dimension reduction to get a feel for high dimensional structure. Nobody mentioned principal components or hierarchical clustering in the Twitter conversation, but I end up using these a lot to just figure out if there are any weird multivariate dependencies I might have missed.</p>
<p><strong>Step 4: Get a quick and dirty answer to the question from Step 1</strong></p>
<p>After I have a feel for the data I usually try to come up with a quick and dirty answer to the question I care about. This might be a simple predictive model (I usually use 60% training, 40% test) or a really basic regression model when possible, just to see if the signal is huge, medium or subtle. I use this as a place to start when doing the rest of the analysis. I also often check this against the intuition of the person who generated the data to make sure something hasn’t gone wrong in the data set.</p>
The Real Reason Reproducible Research is Important
2014-06-06T06:19:31+00:00
http://simplystats.github.io/2014/06/06/the-real-reason-reproducible-research-is-important
<p>Reproducible research has been on my mind a bit these days, partly because it has been in the news with the <a href="http://simplystatistics.org/2014/06/03/post-piketty-lessons/">Piketty stuff</a>, and also perhaps because I just <a href="http://www.amazon.com/Implementing-Reproducible-Research-Chapman-Series/dp/1466561599/ref=sr_1_1?ie=UTF8&qid=1402049601&sr=8-1&keywords=roger+peng">published a book on it</a> and I’m <a href="https://www.coursera.org/course/repdata">teaching a class on it</a> as we speak (as well as next month and the month after…).</p>
<p>However, as I watch and read many discussions over the role of reproducibility in science, I often feel that many people miss the point. Now, just to be clear, when I use the word “reproducibility” or say that a study is reproducible, I do not mean “independent verification” as in a separate investigator conducted an independent study and came to the same conclusion as the original study (that is what I refer to as “replication”). By using the word reproducible, I mean that the original data (and original computer code) can be analyzed (by an independent investigator) to obtain the same results of the original study. In essence, it is the notion that the <em>data analysis</em> can be successfully repeated_. _Reproducibility is particularly important in large computational studies where the data analysis can often play an outsized role in supporting the ultimate conclusions.</p>
<p>Many people seem to conflate the ideas of reproducible and correctness, but they are not the same thing. One must always remember that <strong>a study can be reproducible and still be wrong</strong>. By “wrong”, I mean that the conclusion or claim can be wrong. If I claim that X causes Y (think “sugar causes cancer”), my data analysis might be reproducible, but my claim might ultimately be incorrect for a variety of reasons. If my claim has any value, then others will attempt to replicate it and the correctness of the claim will be determined by whether others come to similar conclusions.</p>
<p>Then why is reproducibility so important? Reproducibility is important because <strong>it is the only thing that an investigator can guarantee about a study</strong>.</p>
<p>Contrary to what most press releases would have you believe, an investigator cannot guarantee that the claims made in a study are correct (unless they are purely descriptive). This is because in the history of science, no meaningful claim has ever been proven by a single study. (The one exception might be mathematics, whether they are literally proving things in their papers.) So reproducibility is important not because it ensures that the results are correct, but rather because it ensures transparency and gives us confidence in understanding exactly what was done.</p>
<p>These days, with the complexity of data analysis and the subtlety of many claims (particularly about complex diseases), reproducibility is pretty much the only thing we can hope for. Time will tell whether we are ultimately right or wrong about any claims, but reproducibility is something we can know right now.</p>
Post-Piketty Lessons
2014-06-03T07:04:14+00:00
http://simplystats.github.io/2014/06/03/post-piketty-lessons
<p>The latest crisis in data analysis comes to us (once again) from the field of Economics. Thomas Piketty, a French economist recently published a book titled <em>Capital in the 21st Century</em> that has been a best-seller. I have not read the book, but based on media reports, it appears to make the claim that inequality has increased in recent years and will likely increase into the future. The book argues that this increase in inequality is driven by capitalism’s tendency to reward capital more than labor. This is my non-economist’s understanding of the book, but the details specific claims of the book are not what I want to discuss here (there is much discussion elsewhere).</p>
<p>An interesting aspect of Piketty’s work, from my perspective, is that <a href="http://piketty.pse.ens.fr/en/capital21c2">he has made all of his data and analysis available on the web</a>. From what I can tell, his analysis was not trivial—data were collected and merged from multiple disparate sources and adjustments were made to different data series to account for various incompatibilities. To me, this sounds like a standard data analysis, in the sense that all meaningful data analyses are complicated. As noted by Nate Silver, data do not arise from a “<a href="http://fivethirtyeight.com/features/be-skeptical-of-both-piketty-and-his-skeptics/">virgin birth</a>”, and in any example worth discussing, much work has to be done to get the data into a state in which statistical models can be fit, or even more simply, plots can be made.</p>
<p>Chris Giles, a journalist for the Financial Times, recently published a column (unfortunately blocked by paywall) in which he claimed that much of the analysis that Piketty had done was flawed or incorrect. In particular, he claimed that based on his (Giles’) analysis, inequality was not growing as much over time as Piketty claimed. Among other points, Giles claims that numerous errors were made in assembling the data and in Piketty’s original analysis.</p>
<p>This episode smacked of the recent <a href="http://simplystatistics.org/2013/04/16/i-wish-economists-made-better-plots/">Reinhart-Rogoff kerfuffle</a> in which some fairly basic errors were discovered in those economists’ Excel spreadsheets. Some of those errors only made small differences to the results, but a critical methodological component, in which the data were weighted in a special way, appeared to have a significant impact on the results if alternate approaches were taken.</p>
<p>Piketty has since <a href="http://www.nytimes.com/2014/05/30/upshot/thomas-piketty-responds-to-criticism-of-his-data.html?_r=0">responded forcefully</a> to the FT’s column, defending all of the work he has done and addressing the criticisms one by one. To me, the most important result of the FT analysis is that <em>Piketty’s work appears to be largely reproducible</em>. Piketty made his data available, with reasonable documentation (in addition to his book), and Giles was able to come up with the same numbers Piketty came up with. This is a <em>good thing</em>. Piketty’s work was complex, and the only way to communicate the entirety of it was to make the data and code available.</p>
<p>The other aspects of Giles’ analysis are, from an academic standpoint, largely irrelevant to me, particularly because I am not an economist. The reason I find them irrelevant is because the objections are largely over <em>whether he is correct or not</em>. This is an obviously important question, but in any field, no single study or even synthesis can be determined to be “correct” at that instance. Time will tell, and if his work is “correct”, his predictions will be borne out by nature. It’s not so satisfying to have to wait many years to know if you are correct, but that’s how science works.</p>
<p>In the meantime, economists will have a debate over the science and the appropriate methods and data used for analysis. This is also how science works, and it is only (really) possible because Piketty made his work reproducible. Otherwise, the debate would be largely uninformed.</p>
The Big in Big Data relates to importance not size
2014-05-28T11:31:15+00:00
http://simplystats.github.io/2014/05/28/the-big-in-big-data-relates-to-importance-not-size
<p>In the past couple of years several non-statisticians have asked me “what is Big Data exactly?” or “How big is Big Data?”. My answer has been “I think Big Data is much more about “data” than “big”. I explain below.</p>
<table>
<tr>
<td>
<a href="http://simplystatistics.org/2014/05/28/the-big-in-big-data-relates-to-importance-not-size/screen-shot-2014-05-28-at-10-14-53-am/" rel="attachment wp-att-3096"><img class="alignnone size-full wp-image-3096" alt="Screen Shot 2014-05-28 at 10.14.53 AM" src="http://simplystatistics.org/wp-content/uploads/2014/05/Screen-Shot-2014-05-28-at-10.14.53-AM.png" width="262" height="230" /></a>
</td>
<td>
<a href="http://simplystatistics.org/2014/05/28/the-big-in-big-data-relates-to-importance-not-size/screen-shot-2014-05-28-at-10-15-04-am/" rel="attachment wp-att-3097"><img class="alignnone size-full wp-image-3097" alt="Screen Shot 2014-05-28 at 10.15.04 AM" src="http://simplystatistics.org/wp-content/uploads/2014/05/Screen-Shot-2014-05-28-at-10.15.04-AM.png" width="265" height="233" /></a>
</td>
</tr>
</table>
<p>Since 2011 Big Data has been all over the news. The New York Times, The Economist, Science, Nature, etc.. have told us that the Big Data Revolution is upon us (see google trends figure above). But was this really a revolution? What happened to the Massive Data Revolution (see figure above)? For this to be called a revolution, there must be some a drastic change, a discontinuity, or a quantum leap of some kind. So has there been such a discontinuity in the rate of growth of data? Although this may be true for some fields (for example in genomics, next generation sequencing <a href="http://www.genome.gov/sequencingcosts/">did introduce a discontinuity around 2007</a>), overall, data size seems to have been growing at a steady rate for decades. For example, in the <a href="http://www.singularity.com/charts/page80.html">graph below</a> (see <a href="http://www.dtc.umn.edu/~odlyzko/doc/oft.internet.growth.pdf">this paper</a> for source) note the trend in internet traffic data (which btw dwarfs genomics data). There does seem to be a change of rate but during the 1990s which brings me to my main point.</p>
<p><img alt="internet data traffic" src="http://www.singularity.com/images/charts/InternetDataTraffic2b.jpg" width="500" /></p>
<p>Although several fields (including Statistics) are having to innovate to keep up with growing data size, I don’t see this as something that new. But I do think that we are in the midst of a Big Data revolution. Although the media only noticed it recently, it started about 30 years ago. The discontinuity is not in the size of data, but in the percent of fields (across academia, industry and government) that use data. At some point in the 1980s with the advent of cheap computers, data were moved from the file cabinet to the disk drive. Then in the 1990s, with the democratization of the internet, these data started to become easy to share. All of the sudden, people could use data to answer questions that were previously answered only by experts, theory or intuition.</p>
<p>In this blog we like to point out examples but let me review a few. Credit card companies started using purchase data to detect fraud. Baseball teams started scraping data and evaluating players without ever seeing them. Financial companies started analyzing stock market data to develop investment strategies. Environmental scientists started to gather and analyze data from air pollution monitors. Molecular biologists started quantifying outcomes of interest into matrices of numbers (as opposed to looking at stains on nylon membranes) to discover new tumor types and develop diagnostics tools. Cities started using crime data to guide policing strategies. Netflix started using costumer ratings to recommend movies. Retail stores started mining bonus card data to deliver targeted advertisements. Note that all the data sets mentioned were tiny in comparison to, for example, sky survey data collected by astronomers. But, I still call this phenomenon Big Data because the percent of people using data was in fact Big.</p>
<p><img src="http://simplystatistics.org/wp-content/uploads/2014/05/IMG_5053.jpg" alt="bigdata" /></p>
<p>I borrowed the title of this talk from a <a href="http://www.slideshare.net/kuonen/big-datadatascience-may2014">very nice presentation</a> by Diego Kuonen</p>
10 things statistics taught us about big data analysis
2014-05-22T11:37:41+00:00
http://simplystats.github.io/2014/05/22/10-things-statistics-taught-us-about-big-data-analysis
<p>In <a href="http://simplystatistics.org/2014/05/07/why-big-data-is-in-trouble-they-forgot-about-applied-statistics/">my previous post</a> I pointed out a major problem with big data is that applied statistics have been left out. But many cool ideas in applied statistics are really relevant for big data analysis. So I thought I’d try to answer the second question in my previous post: <em>“When thinking about the big data era, what are some statistical ideas we’ve already figured out?”</em> Because the internet loves top 10 lists I came up with 10, but there are more if people find this interesting. Obviously mileage may vary with these recommendations, but I think they are generally not a bad idea.</p>
<ol>
<li><strong>If the goal is prediction accuracy, average many prediction models together</strong>. In general, the prediction algorithms that most frequently win Kaggle competitions or the Netflix prize <a href="http://en.wikipedia.org/wiki/Ensemble_learning">blend multiple models together</a>. The idea is that by averaging (or majority voting) multiple good prediction algorithms you can reduce variability without giving up bias. One of the earliest descriptions of this idea was of a much simplified version based on <a href="http://en.wikipedia.org/wiki/Bootstrapping_(statistics)">bootstrapping samples</a> and building multiple prediction functions - a <a href="http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf">process called bagging</a> (short for bootstrap aggregating). <a href="http://en.wikipedia.org/wiki/Random_forest">Random forests</a>, another incredibly successful prediction algorithm, is based on a similar idea with classification trees.</li>
<li><strong>When testing many hypotheses, correct for multiple testing</strong> <a href="http://xkcd.com/882/">This comic</a> points out the problem with standard hypothesis testing when many tests are performed. Classic hypothesis tests are designed to call a set of data significant 5% of the time, even when the null is true (e.g. nothing is going on). One really common choice for correcting for multiple testing is to use <a href="http://en.wikipedia.org/wiki/False_discovery_rate">the false discovery rate</a> to control the rate at which things you call significant are false discoveries. People like this measure because you can think of it as the rate of noise among the signals you have discovered. Benjamini and Hochber gave the f<a href="http://www.stat.purdue.edu/~doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y%20FDR.pdf">irst definition of the false discovery rate and provided a procedure to control the FDR</a>. There is also a really readable introduction to FDR by <a href="http://www.pnas.org/content/100/16/9440.full">Storey and Tibshirani</a>.</li>
<li><strong>When you have data measured over space, distance, or time, you should smooth </strong>This is one of the oldest ideas in statistics (regression is a form of smoothing and Galton <a href="http://en.wikipedia.org/wiki/Regression_toward_the_mean">popularized that a while ago</a>). I personally like locally weighted scatterplot smoothing a lot. <a href="http://www.people.fas.harvard.edu/~gov2000/Handouts/lowess.pdf">This paper</a>is a good one by Cleveland about loess. Here it is in a gif. <a href="http://simplystatistics.org/2014/02/13/loess-explained-in-a-gif/" rel="attachment wp-att-3069"><img class=" wp-image-3069 aligncenter" alt="loess" src="http://simplystatistics.org/wp-content/uploads/2014/05/loess.gif" width="202" height="202" /></a>But people also like <a href="http://en.wikipedia.org/wiki/Smoothing_spline">smoothing splines</a>, <a href="http://en.wikipedia.org/wiki/Hidden_Markov_model">Hidden Markov Models</a>, <a href="http://en.wikipedia.org/wiki/Moving_average">moving averages</a> and many other smoothing choices.</li>
<li><strong>Before you analyze your data with computers, be sure to plot it</strong> A common mistake made by amateur analysts is to immediately jump to fitting models to big data sets with the fanciest computational tool. But you can miss pretty obvious things <a href="http://en.wikipedia.org/wiki/Anscombe's_quartet">like this </a>if you don’t plot your data. <a href="http://en.wikipedia.org/wiki/File:Bland-altman_plot.png" rel="attachment wp-att-3068"><img class=" wp-image-3068 aligncenter" alt="ba" src="http://simplystatistics.org/wp-content/uploads/2014/05/ba.png" width="288" height="288" srcset="http://simplystatistics.org/wp-content/uploads/2014/05/ba-150x150.png 150w, http://simplystatistics.org/wp-content/uploads/2014/05/ba-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2014/05/ba.png 600w" sizes="(max-width: 288px) 100vw, 288px" /></a>There are too many plots to talk about individually, but one example of an incredibly important plot is the <a href="http://en.wikipedia.org/wiki/Bland%E2%80%93Altman_plot">Bland-Altman plot,</a> (called an MA-plot in genomics) when comparing measurements from multiple technologies. R provides tons of graphics for a reason and <a style="font-size: 16px;" href="http://ggplot2.org/">ggplot2</a> makes them pretty.</li>
<li><strong>Interactive analysis is the best way to really figure out what is going on in a data set</strong> This is related to the previous point; if you want to understand a data set you have to be able to play around with it and explore it. You need to make tables, make plots, identify quirks, outliers, missing data patterns and problems with the data. To do this you need to interact with the data quickly. One way to do this is to analyze the whole data set at once using tools like Hive, Hadoop, or Pig. But an often easier, better, and more cost effective approach is to use random sampling . As Robert Gentleman put it “<a href="https://twitter.com/EllieMcDonagh/status/469184554549248000">make big data as small as possible as quick as possible</a>”.</li>
<li><strong>Know what your real sample size is. </strong> It can be easy to be tricked by the size of a data set. Imagine you have an image of a simple black circle on a white background stored as pixels. As the resolution increases the size of the data increases, but the amount of information may not (hence <a href="http://en.wikipedia.org/wiki/Vector_graphics">vector graphics</a>). Similarly in genomics, the number of reads you measure (which is a main determinant of data size) is not the sample size, it is the number of individuals. In social networks, the number of people in the network may not be the sample size. If the network is very dense, the sample size <a href="http://arxiv.org/pdf/1112.0840.pdf">might be much less</a>. In general the bigger the sample size the better and sample size and data size aren’t always tightly correlated.</li>
<li><strong>Unless you ran a randomized trial, potential confounders should keep you up at night </strong>Confounding is maybe the most fundamental idea in statistical analysis. It is behind the <a href="http://www.tylervigen.com/">spurious correlations</a> like these and the reason why nutrition studies <a href="http://fivethirtyeight.com/features/eat-more-nuts-and-vegetables-and-dont-forget-to-exercise-and-quit-smoking/">are so hard</a>. It is very hard to hold people to a randomized diet and people who eat healthy diets might be different than people who don’t in other important ways. In big data sets confounders might be <a href="http://www.cis.jhu.edu/publications/papers_in_database/GEMAN/Geman_NatureReviews_2010.pdf">technical variables</a> about how the data were measured or they could be <a href="http://gking.harvard.edu/files/gking/files/0314policyforumff.pdf">differences over time in Google search terms</a>. Any time you discover a cool new result, your first thought should be, “what are the potential confounders?”<a href="http://xkcd.com/552/" rel="attachment wp-att-3067"><img class=" wp-image-3067 aligncenter" alt="correlation" src="http://simplystatistics.org/wp-content/uploads/2014/05/correlation.png" width="275" height="111" /></a></li>
<li><strong>Define a metric for success up front</strong> Maybe the simplest idea, but one that is critical in statistics and <a href="http://en.wikipedia.org/wiki/Decision_theory">decision theory</a>. Sometimes your goal is to discover new relationships and that is great if you define that up front. One thing that applied statistics has taught us is that changing the criteria you are going for after the fact is really dangerous. So when you find a correlation, don’t assume you can predict a new result or that you have discovered which way a causal arrow goes.</li>
<li><strong>Make your code and data available and have smart people check it</strong> As several people pointed out about my last post, the Reinhart and Rogoff problem did not involve big data. But even in this small data example, there was a bug in the code used to analyze them. With big data and complex models this is even more important. Mozilla Science is <a href="http://mozillascience.org/code-review-for-science-what-we-learned/">doing interesting work</a> on code review for data analysis in science. But in general if you just get a friend to look over your code it will catch a huge fraction of the problems you might have.</li>
<li><strong><a href="http://simplystatistics.org/2013/05/29/what-statistics-should-do-about-big-data-problem-forward-not-solution-backward/">Problem first not solution backward </a></strong>One temptation in applied statistics is to take a tool you know well (regression) and use it to hit all the nails (epidemiology problems). <a href="http://simplystatistics.org/2014/05/22/10-things-statistics-taught-us-about-big-data-analysis/hitnails/" rel="attachment wp-att-3066"><img class=" wp-image-3066 aligncenter" alt="hitnails" src="http://simplystatistics.org/wp-content/uploads/2014/05/hitnails.png" width="288" height="216" srcset="http://simplystatistics.org/wp-content/uploads/2014/05/hitnails-300x225.png 300w, http://simplystatistics.org/wp-content/uploads/2014/05/hitnails.png 800w" sizes="(max-width: 288px) 100vw, 288px" /></a>There is a similar temptation in big data to get fixated on a tool (hadoop, pig, hive, nosql databases, distributed computing, gpgpu, etc.) and ignore the problem of can we infer x relates to y or that x predicts y.</li>
</ol>
Why big data is in trouble: they forgot about applied statistics
2014-05-07T10:08:32+00:00
http://simplystats.github.io/2014/05/07/why-big-data-is-in-trouble-they-forgot-about-applied-statistics
<p>This year the idea that statistics is important for big data has exploded into the popular media. Here are a few examples, starting with the Lazer et. al paper in Science that got the ball rolling on this idea.</p>
<ul>
<li><a href="http://gking.harvard.edu/files/gking/files/0314policyforumff.pdf">The parable of Google Flu: traps in big data analysis</a></li>
<li><a href="http://www.ft.com/intl/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#axzz30INfAyMi">Big data are we making a big mistake?</a></li>
<li><a href="http://bits.blogs.nytimes.com/2014/03/28/google-flu-trends-the-limits-of-big-data/">Google Flu Trends: the limits of big data</a></li>
<li><a href="http://www.nytimes.com/2014/04/07/opinion/eight-no-nine-problems-with-big-data.html">Eight (No, Nine!) Problems with Big Data</a></li>
</ul>
<p>All of these articles warn about issues that statisticians have been thinking about for a very long time: sampling populations, confounders, multiple testing, bias, and overfitting. In the rush to take advantage of the hype around big data, these ideas were ignored or not given sufficient attention.</p>
<p>One reason is that when you actually take the time to do an analysis right, with careful attention to all the sources of variation in the data, it is almost a law that you will have to make smaller claims than you could if you just shoved your data in a machine learning algorithm and reported whatever came out the other side.</p>
<p>The prime example in the press is Google Flu trends. Google Flu trends was originally developed as a machine learning algorithm for predicting the number of flu cases based on Google Search Terms. While the underlying data management and machine learning algorithms were correct, a misunderstanding about the uncertainties in the data collection and modeling process have led to highly inaccurate estimates over time. A statistician would have thought carefully about the sampling process, identified time series components to the spatial trend, investigated <em>why</em> the search terms were predictive and tried to understand what the likely reason that Google Flu trends was working.</p>
<p>As we have seen, lack of expertise in statistics has led to fundamental errors in both <a style="font-size: 16px;" href="http://simplystatistics.org/2012/02/27/the-duke-saga-starter-set/">genomic science</a> and <a style="font-size: 16px;" href="http://simplystatistics.org/2013/04/19/podcast-7-reinhart-rogoff-reproducibility/">economics</a>. In the first case a team of scientists led by Anil Potti created an algorithm for predicting the response to chemotherapy. This solution was widely praised in both the scientific and popular press. Unfortunately the researchers did not correctly account for all the sources of variation in the data set and had misapplied statistical methods and ignored major data integrity problems. The lead author and the editors who handled this paper didn’t have the necessary statistical expertise, which led to major consequences and cancelled clinical trials.</p>
<p>Similarly, two economists Reinhart and Rogoff, published a paper claiming that GDP growth was slowed by high governmental debt. Later it was discovered that there was an error in an Excel spreadsheet they used to perform the analysis. But more importantly, the choice of weights they used in their regression model were questioned as being unrealistic and leading to dramatically different conclusions than the authors espoused publicly. The primary failing was a lack of sensitivity analysis to data analytic assumptions that any well-trained applied statisticians would have performed.</p>
<p>Statistical thinking has also been conspicuously absent from major public big data efforts so far. Here are some examples:</p>
<ul>
<li><a href="http://www.nitrd.gov/nitrdgroups/index.php?title=White_House_Big_Data_Partners_Workshop">White House Big Data Partners Workshop</a> - 0/19 statisticians</li>
<li><a href="http://sites.nationalacademies.org/DEPS/DEPS_087192">National Academy of Sciences Big Data Worskhop</a> - 2/13 speakers statisticians</li>
<li><a href="http://news.cs.washington.edu/2013/11/12/uw-berkeley-nyu-collaborate-on-37-8m-data-science-initiative/">Moore Foundation Data Science Environments</a> - 0/3 directors from statistical background, 1/25 speakers at <a href="http://lazowska.cs.washington.edu/MS/OSTP.release.pdf">OSTP event</a> about the environments was a statistician</li>
<li><a href="http://acd.od.nih.gov/Data-and-Informatics-Implementation-Plan.pdf">Original group that proposed NIH BD2K</a> - 0/18 participants statisticians</li>
<li><a href="http://nsf.gov/news/news_videos.jsp?cntn_id=123607&media_id=72174&org=NSF">Big Data rollout from the White House</a> - 0/4 thought leaders statisticians, 0/n participants statisticians.</li>
</ul>
<p>One example of this kind of thinking is this insane table from the alumni magazine of the University of California which I found from this [This year the idea that statistics is important for big data has exploded into the popular media. Here are a few examples, starting with the Lazer et. al paper in Science that got the ball rolling on this idea.</p>
<ul>
<li><a href="http://gking.harvard.edu/files/gking/files/0314policyforumff.pdf">The parable of Google Flu: traps in big data analysis</a></li>
<li><a href="http://www.ft.com/intl/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#axzz30INfAyMi">Big data are we making a big mistake?</a></li>
<li><a href="http://bits.blogs.nytimes.com/2014/03/28/google-flu-trends-the-limits-of-big-data/">Google Flu Trends: the limits of big data</a></li>
<li><a href="http://www.nytimes.com/2014/04/07/opinion/eight-no-nine-problems-with-big-data.html">Eight (No, Nine!) Problems with Big Data</a></li>
</ul>
<p>All of these articles warn about issues that statisticians have been thinking about for a very long time: sampling populations, confounders, multiple testing, bias, and overfitting. In the rush to take advantage of the hype around big data, these ideas were ignored or not given sufficient attention.</p>
<p>One reason is that when you actually take the time to do an analysis right, with careful attention to all the sources of variation in the data, it is almost a law that you will have to make smaller claims than you could if you just shoved your data in a machine learning algorithm and reported whatever came out the other side.</p>
<p>The prime example in the press is Google Flu trends. Google Flu trends was originally developed as a machine learning algorithm for predicting the number of flu cases based on Google Search Terms. While the underlying data management and machine learning algorithms were correct, a misunderstanding about the uncertainties in the data collection and modeling process have led to highly inaccurate estimates over time. A statistician would have thought carefully about the sampling process, identified time series components to the spatial trend, investigated <em>why</em> the search terms were predictive and tried to understand what the likely reason that Google Flu trends was working.</p>
<p>As we have seen, lack of expertise in statistics has led to fundamental errors in both <a style="font-size: 16px;" href="http://simplystatistics.org/2012/02/27/the-duke-saga-starter-set/">genomic science</a> and <a style="font-size: 16px;" href="http://simplystatistics.org/2013/04/19/podcast-7-reinhart-rogoff-reproducibility/">economics</a>. In the first case a team of scientists led by Anil Potti created an algorithm for predicting the response to chemotherapy. This solution was widely praised in both the scientific and popular press. Unfortunately the researchers did not correctly account for all the sources of variation in the data set and had misapplied statistical methods and ignored major data integrity problems. The lead author and the editors who handled this paper didn’t have the necessary statistical expertise, which led to major consequences and cancelled clinical trials.</p>
<p>Similarly, two economists Reinhart and Rogoff, published a paper claiming that GDP growth was slowed by high governmental debt. Later it was discovered that there was an error in an Excel spreadsheet they used to perform the analysis. But more importantly, the choice of weights they used in their regression model were questioned as being unrealistic and leading to dramatically different conclusions than the authors espoused publicly. The primary failing was a lack of sensitivity analysis to data analytic assumptions that any well-trained applied statisticians would have performed.</p>
<p>Statistical thinking has also been conspicuously absent from major public big data efforts so far. Here are some examples:</p>
<ul>
<li><a href="http://www.nitrd.gov/nitrdgroups/index.php?title=White_House_Big_Data_Partners_Workshop">White House Big Data Partners Workshop</a> - 0/19 statisticians</li>
<li><a href="http://sites.nationalacademies.org/DEPS/DEPS_087192">National Academy of Sciences Big Data Worskhop</a> - 2/13 speakers statisticians</li>
<li><a href="http://news.cs.washington.edu/2013/11/12/uw-berkeley-nyu-collaborate-on-37-8m-data-science-initiative/">Moore Foundation Data Science Environments</a> - 0/3 directors from statistical background, 1/25 speakers at <a href="http://lazowska.cs.washington.edu/MS/OSTP.release.pdf">OSTP event</a> about the environments was a statistician</li>
<li><a href="http://acd.od.nih.gov/Data-and-Informatics-Implementation-Plan.pdf">Original group that proposed NIH BD2K</a> - 0/18 participants statisticians</li>
<li><a href="http://nsf.gov/news/news_videos.jsp?cntn_id=123607&media_id=72174&org=NSF">Big Data rollout from the White House</a> - 0/4 thought leaders statisticians, 0/n participants statisticians.</li>
</ul>
<p>One example of this kind of thinking is this insane table from the alumni magazine of the University of California which I found from this](http://www.chalmers.se/en/areas-of-advance/ict/calendar/Pages/Terry-Speed.aspx) (via Rafa, go watch his talk right now, it gets right to the heart of the issue). It shows a fundamental disrespect for applied statisticians who have developed serious expertise in a range of scientific disciplines.</p>
<p style="text-align: center;">
<a href="http://simplystatistics.org/2014/05/07/why-big-data-is-in-trouble-they-forgot-about-applied-statistics/screen-shot-2014-05-06-at-9-06-38-pm/" rel="attachment wp-att-3032"><img class=" wp-image-3032 aligncenter" alt="Screen Shot 2014-05-06 at 9.06.38 PM" src="http://simplystatistics.org/wp-content/uploads/2014/05/Screen-Shot-2014-05-06-at-9.06.38-PM.png" width="362" height="345" /></a>
</p>
<p>All of this leads to two questions:</p>
<ol>
<li>Given the importance of statistical thinking why aren’t statisticians involved in these initiatives?</li>
<li>When thinking about the big data era, what are some statistical ideas we’ve already figured out?</li>
</ol>
<p dir="ltr">
</p>
JHU Data Science: More is More
2014-05-05T10:09:09+00:00
http://simplystats.github.io/2014/05/05/jhu-data-science-more-is-more
<p>Today Jeff Leek, Brian Caffo, and I are launching 3 new courses on Coursera as part of the <a href="https://www.coursera.org/specialization/jhudatascience/1">Johns Hopkins Data Science Specialization</a>. These courses are</p>
<ul>
<li><a href="https://www.coursera.org/course/exdata">Exploratory Data Analysis</a></li>
<li><a href="https://www.coursera.org/course/repdata">Reproducible Research</a></li>
<li><a href="https://www.coursera.org/course/statinference">Statistical Inference</a></li>
</ul>
<p>I’m particularly excited about Reproducible Research, not just because I’m teaching it, but because I think it’s essentially the first of its kind being offered in a massive open format. Given the rich discussions about reproducibility that have occurred over the past few years, I’m happy to finally be able to offer this course for free to a large audience.</p>
<p><span style="font-size: 16px;">These courses are launching in </span>addition to the first 3 courses in the sequence: <a href="https://www.coursera.org/course/datascitoolbox">The Data Scientist’s Toolbox</a>, <a href="https://www.coursera.org/course/rprog">R Programming</a>, and <a href="https://www.coursera.org/course/getdata">Getting and Cleaning Data</a>, which are also running this month in case you missed your chance in April.</p>
<p>All told we have 6 of the 9 courses in the Specialization available as of today. We’re really looking forward to next month where we will be launching the final 3 courses: <a href="https://www.coursera.org/course/regmods">Regression Models</a>, <a href="https://www.coursera.org/course/predmachlearn">Practical Machine Learning</a>, and <a href="https://www.coursera.org/course/predmachlearn">Developing Data Products</a>. We also have some exciting announcements coming soon regarding the Capstone Projects.</p>
<p>Every course will be available every month, so don’t worry about missing a session. You can always come back next month.</p>
Confession: I sometimes enjoy reading the fake journal/conference spam
2014-04-30T10:00:06+00:00
http://simplystats.github.io/2014/04/30/confession-i-sometimes-enjoy-reading-the-fake-journalconference-spam
<p style="text-align: left">
I've spent a considerable amount of time setting up filters to avoid getting spam from fake <a href="http://www.nytimes.com/2013/04/08/health/for-scientists-an-exploding-world-of-pseudo-academia.html?pagewanted=all">journals and conferences. </a>Unfortunately, they are exceptionally good at thwarting my defenses. This does not annoy me as much as I pretend because, secretly, I enjoy reading some of these emails. Here are three of my favorites.
</p>
<p style="text-align: left">
1) Over-the-top robot:
</p>
<blockquote>
<p style="text-align: left">
<span style="font-style: italic">It gives us immense pleasure to invite you and your research allies to submit a manuscript for the journal “REDACTED”. The expertise of you in the never ending field of Gene Technology is highly appreciable. The level of intricacy shown by you in your work makes us even more proud, and </span><strong style="font-style: italic">we believe that your works should be known to mankind of science.</strong>
</p>
</blockquote>
<p>2) Sarcastic robot?</p>
<blockquote>
<p>First of all, congratulations on the publication of your highly cited original article < The human colon cancer methylome shows similar hypo- and hypermethylation at conserved tissue-specific CpG island shores > in the field of colon cancer, <strong>which has been cited more than 1 times and is in the world’s top one percent of papers</strong>. Such high number of citations reflects the high quality and influence of your paper.</p>
</blockquote>
<div>
<p>
3) Intimidating robot:
</p>
<blockquote>
<p>
This is Rocky.... Recently we have mailed you about the details of the conference. But we still have not received your response. So today we contact you again.
</p>
</blockquote>
<p>
NB: Although I am joking in this post, I do think these fake journals and conferences are a very serious problem. The fact that they are still around means enough money (mostly taxpayer money) is being spent to keep them in business. If you want to learn more, <a href="http://scholarlyoa.com/">this blog</a> does a good job on reporting on them and includes a <a href="http://scholarlyoa.com/publishers/">list of culprits.</a>
</p>
</div>
Picking a (bio)statistics thesis topic for real world impact and transferable skills
2014-04-22T14:39:27+00:00
http://simplystats.github.io/2014/04/22/picking-a-biostatistics-thesis-topic-for-real-world-impact-and-transferable-skills
<p>One of the things that was hardest for me in graduate school was starting to think about my own research projects and not just the ideas my advisor fed me. I remember that it was stressful because I didn’t quite know where to start. After having done this for a while and particularly after having read a bunch of papers by people who are way more successful than I am, I have come to the following algorithm as a means for finding a topic that will have real world impact and also give you skills to take on new problems in a flexible way.</p>
<ol>
<li> Find a scientific problem that hasn’t been solved with data (by far hardest part)</li>
<li>Define your metric for success</li>
<li> Collect data/partner up with someone with data for that problem.</li>
<li> Create a good solution to the problem</li>
<li> Only invent new methods if you must</li>
<li>(Optional) Write software and document the hell out of it</li>
<li>(Optional) Respond to users and update as needed</li>
<li>Don’t get (meanly) competitive</li>
</ol>
<p>The first step is definitely the most important and the hardest. The balance is between big important problems that lots of people are working on but where the potential for innovation is low and small detailed problems where you won’t have serious competition but you will have limited impact. In general good ways to find scientific problems are the following. (1) Find close and real scientific/applications collaborators. Not real like you talk to them once a month, real like you have a weekly meeting, you try to understand how their data are collected or generated and you ask them specifically what problems prevent them from doing their job well then solve those problems. (2) You come up with a scientific question you have on your own. In mature research areas like genomics this requires a huge amount of reading to know what people have done before you, or to at least know what new technologies/data are becoming available. (3) You you could read a ton of papers and find one that produces interesting data you think could answer a question the authors haven’t asked. In general, <a style="font-size: 16px;" href="http://simplystatistics.org/2013/05/29/what-statistics-should-do-about-big-data-problem-forward-not-solution-backward/">the key is to put the problem first</a>, before you even think about how to quantify or answer the question.</p>
<p>Next you have to define your metric for success. This metric should be scientific. You should try to say, “if I could predict x at 70% accuracy I could solve scientific problem y” or “if I could infer the relationship between x and y I would know something about z”. The metric should be compared to the scientific standards in the field. As an example, screening tests for the general population often must be 99% sensitive and specific (or more) due to low prevalence. But in a sub population, sensitivity and specificity of 70% or 80% may be really useful.</p>
<p>Then you find the data. Here the key quote comes from Tukey:</p>
<blockquote>
<p>The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.</p>
</blockquote>
<p>My experience is that when you start with the problem first, the data are often hard to come by, have quirks, or are not quite right for the problem you want to solve. Generating the perfect data is often very expensive, so a huge amount of the effort you will spend is either (a) generating the perfect data or (b) determining if the data you collected is “good enough” to answer the question. One important point here is that knowing when you have failed is the entire name of the game here. If you get stuck once, you should try again. If you get stuck 100 times, it might be time to look for a different data set or figure out why the problem is unanswerable with current data. Incidentally, this is the most difficult part of the approach I’m proposing for coming up with topics. Failure is both likely and frequent, but that is a good thing when you are in grad school if you can learn from it and learn to predict when you are going to fail.</p>
<p>Since you’ve identified a problem that hasn’t been solved before in step 1, the first thing to try is to come up with a sensible solution using only the methods that already exist. In many cases, these existing methods <a href="http://simplystatistics.org/2014/03/20/the-8020-rule-of-statistical-methods-development/">will work pretty well</a>. If they don’t, invent only as much statistical methodology and theory as you need to solve the problem. If you invent something new here, you should try it out <a href="http://simplystatistics.org/2013/03/06/the-importance-of-simulating-the-extremes/">on simple simulated examples and complex data</a> where you either know the answer or can perform cross-validation/replication analysis.</p>
<p>At this point, if you have a basic solution to the problem, even if it is just the t-test, you are in great shape! You have solved a problem that is new and you are ready to publish. If you have invented some methods along the way, publish those, too!</p>
<p>In some cases the problems you solve will be focused on an area where lots of other people can collect similar data to answer similar problems. In this case, your most direct route to maximum impact is to write simple, usable, and really well documented software other people can use. Write it in R, make it free, give it a vignette and advertise it! If people use your software they will send you bug reports, patches, typos, fixes, and wish lists of things they want your software to do. The more you help people and respond, the more your software will get used and the more impact your method will have.</p>
<p>Step 8 is often the hardest part. If you do something interesting, you will have a ton of competitors. People will write better and more precise methods down and will “beat” your method. That’s ok, in fact it is good! The more people that compare to your approach, the more you know you picked a good problem. In some cases, people will genuinely create better methods than you will. Learn from them and make your methods and software better. But try not to be upset that they wrote a paper about how their idea is so much better than yours, it is a high compliment they thought your idea was worth comparing to. This is one the author of the post hasn’t nailed down perfectly but I think the more you can do it the happier you will be.</p>
<p>The best part of this algorithm is that it gives you the problem first focus that will make it easy to transition if you do a postdoc with a different kind of data, or move to industry, or start with new collaborators.</p>
Correlation does not imply causation (parental involvement edition)
2014-04-17T10:00:24+00:00
http://simplystats.github.io/2014/04/17/correlation-does-not-imply-causation-parental-involvement-edition
<p>The New York Times recently published <a href="http://opinionator.blogs.nytimes.com/2014/04/12/parental-involvement-is-overrated/?rref=opinion&module=ArrowsNav&contentCollection=Opinion&action=keypress&region=FixedLeft&pgtype=Blogs">an article</a> on education titled “Parental Involvement Is Overrated”. Most research in this area supports the opposite view, but the authors claim that “evidence from our research suggests otherwise”. Before you stop helping your children understand long division or correcting their grammar, you should learn about one of the most basic statistical concepts: correlation does not imply causation. The first two chapters of this <a href="http://www.amazon.com/Statistics-4th-Edition-David-Freedman/dp/0393929728">very popular text book</a> describes the problem and even <a href="https://www.khanacademy.org/math/probability/regression/regression-correlation/v/correlation-and-causality">Khan Academy</a> has a class on it. As several of the commenters in the NYT article point out, the authors fail to make this distinction.</p>
<p>To illustrate the problem, imagine you want to know how effective tutoring is for students in a math class you are teaching. So you compare the test scores of students that received tutoring to those that don’t. You find that receiving tutoring is correlated with lower test scores. So do you conclude that tutoring causes lower grades? Of course not! In this particular case we are confusing cause and effect: students that have trouble with math are much more likely to seek out tutoring and this is what drives the observed correlation. With that example in mind, consider this quote from the New York Times article:</p>
<blockquote>
<p>When we examined whether regular help with homework had a positive impact on children’s academic performance, we were quite startled by what we found. Regardless of a family’s social class, racial or ethnic background, or a child’s grade level, consistent homework help almost never improved test scores or grades…. Even more surprising to us was that when parents regularly helped with homework, kids usually performed worse.</p>
</blockquote>
<p>A first question we would ask here is: how do we know that the children’s performance would not have been even worse had they not received help? I imagine the authors made use of c_ontrols: _we compare the group that received the treatment (regular help with homework) to a control group that did not. But this brings up a more difficult question: how do we know that the treatment and control groups are comparable?</p>
<p>In a randomized controlled experiment, we would take a group of kids and randomly assign each one to the treatment group (will be helped with their homework) or control group (no help with homework). By doing this we can use probability calculations to determine the range of differences we expect to see by chance when the treatment has no effect. Note that by chance one group may end up with a few more “better testers” than the other. However, if we see a big enough difference that can’t be explained by chance, then the alternative that the treatment is responsible for the observed differences becomes more believable.</p>
<p>Given all the prior research (and common sense) suggesting that parent involvement, in its many manifestations, is in fact helpful to students, many would consider it unethical to run a randomized controlled trial on this issue (you would knowingly hurt the control group). Therefore, the authors are left with no choice than to use an _observational study _to reach their conclusions. In this case, we have no control over who receives help and who doesn’t. Kids that require regular help with their homework are different in many ways to kids that don’t, even after correcting for all the factors mentioned. For example, one can envision how kids that have a mediocre teacher or have trouble with tests are more likely to be in the treatment group, while kids who naturally test well or go to schools that offer in-school tutoring are more likely to be in the control group.</p>
<p>I am not an expert on education, but as a statistician I am skeptical of the conclusions of this data-driven article. In fact, I would recommend parents actually do get involved early on by, for example, teaching children that correlation does not imply causation.</p>
<p>Note that I am not saying that observational studies are uninformative. If properly analyzed, observational data can be very valuable. For example, the data supporting smoking as a cause of lung cancer is all observational. Furthermore, there is an entire subfield within statistics (referred to as causal inference) that develops methodologies to deal with observational data. But unfortunately, observational data are commonly misinterpreted.</p>
The #rOpenSci hackathon #ropenhack
2014-04-10T09:41:09+00:00
http://simplystats.github.io/2014/04/10/the-ropensci-hackathon-ropenhack
<p><em>Editor’s note: This is a guest post by <a href="http://alyssafrazee.com/">Alyssa Frazee</a>, a graduate student in the Biostatistics department at Johns Hopkins and a participant in the recent rOpenSci hackathon. </em></p>
<p>Last week, I took a break from my normal PhD student schedule to participate in a <a href="https://github.com/ropensci/hackathon">hackathon</a> in San Francisco. The two-day event was hosted by <a href="http://ropensci.org/">rOpenSci</a>, an organization committed to developing R tools for open science. Working with <a href="https://github.com/ropensci/hackathon/wiki/Confirmed-attendees">several wonderful people</a> from the R community was inspiring, humbling, and incredibly fun. So many great things happened in a two-day whirlwind: it would be impossible now to capture the whole thing in a narrative that would do it justice. So instead of a play-by-play, here are some of the quotes from the event that I’ve recently been reflecting on:</p>
<h3 id="the-enemy-isnt-r-python-or-julia-the-enemy-is-closed-source-science"><strong>“The enemy isn’t R, Python, or Julia. The enemy is closed-source science.”</strong></h3>
<p dir="ltr">
There have been some lively internet debates recently about mathematical and scientific computing languages. While conversations about these languages are interesting and necessary, the forest often gets lost for the trees: in the end, we are here to do good science, and we should use whatever makes that easiest. We should build strong, collaborative communities, both within languages and across them. A closed-source science mentality hinders this kind of collaboration. I thought one of the hackathon projects, an<a href="https://github.com/takluyver/IRkernel"> R kernel for the iPython notebook</a>, especially exemplified a commitment to open science and to cross-language collaboration. It was so awesome to spend two days with R folks like this who genuinely enjoy working together, in any language, to make scientific computing better.
</p>
<h3 id="pair-debugging-is-fun"><strong>“Pair debugging is fun!”</strong></h3>
<p dir="ltr">
This quote perfectly captures one of my favorite things about hackathons: genuine group work! During my time in graduate school, I've done most of my programming solo. I think this is the nature of getting a PhD: the projects have to be yours, and all the other PhD students are working on their solo projects. So I really enjoyed the hackathon because it facilitated true pair/group work: two or more peers working on the same project, in the same room, at the same time. I like this work strategy for many reasons:
</p>
<p dir="ltr">
• The rate at which I learn new things is high, since it's so easy to ask a question. Lots of time is saved by not having to sift through internet search results.
</p>
<p dir="ltr">
• Sometimes I find solo debugging to be<a href="https://twitter.com/irqed/status/358212928404586498"> pretty painful</a>. But I think pair debugging is fun and satisfying: it's like an inspirational sports movie. It's you and me, the ragtag underdogs, against the computer, the evil bully from across town. Relatedly, the sweet sweet taste of victory is also shared.
</p>
<p dir="ltr">
• It's easier to stay focused on the task at hand. I'm not as easily distracted by email/Twitter/Facebook/blogs/the rest of the internet when I'm not coding alone.
</p>
<p dir="ltr">
My<a href="http://en.wikipedia.org/wiki/Academic_genealogy"> academic sister</a>,<a href="http://hilaryparker.com/"> Hilary</a>, and I did a good amount of pair debugging during the hackathon, and I kept finding myself thinking "I wish this would have been possible while we were both grad students!" I think we both had lots of fun working together. For a short discussion of more fun aspects of pairing,<a href="http://jvns.ca/blog/2014/03/02/pair-programming/"> here's a blog post I like</a>. At the rOpenSci hackathon in particular, group work was especially awesome because we could ask questions in person to people who have written the libraries our projects depend on, or to RStudio developers, or to GitHub employees, or to potential users of the projects. Just some of the many joys of having lots of<a href="https://github.com/ropensci/hackathon/wiki/Confirmed-attendees"> talented, friendly R programmers</a> all in the same space!
</p>
<h3 id="want-me-to-write-some-unit-tests-for-your-unit-tests"><strong>“Want me to write some unit tests for your unit tests?”</strong></h3>
<p dir="ltr">
During the hackathon, I primarily worked on a unit-testing package called<a href="https://github.com/ropensci/testdat"> testdat</a>. Testdat provides functions that check for and fix common problems with tabular data, like UTF-8 characters and inconsistent missing data codes, with the overall goal of making data processing/cleaning more reproducible. The project was really good for a two-day hackathon, since it was small enough to almost finish in two days, and it was very modular: one person worked on the missing data checking functions, another worked on UTF-8 checking, a third wrote the tests for the finished functions (unit tests for unit tests!), etc. Also, it didn't require a lot of background knowledge in a specific subject area or a deep dive into an existing codebase: all it required were some coding skills and perhaps a frustrating experience with messy data in the past (for motivation).
</p>
<p dir="ltr">
Finding an appropriate project to work on was probably my biggest challenge at this hackathon. I spent the summer at<a href="https://www.hackerschool.com/"> Hacker School</a>, where the days were structured similarly to how they were at the rOpenSci hackathon: there wasn't really any structure. In both scenarios, the minimal structure was intentional. Lots of great collaborative work can happen with a few free days of hacking. But with two free days at the hackathon (versus Hacker School's 50), it was much more important to choose a good project quickly and get coding. One way to do this would have been to arrive at the hackathon with a small project in hand (<a href="https://github.com/ropensci/hackathon/issues?state=open">many people did this</a>). My strategy, however, was to chat with a few different project groups for the first hour or two on day 1, and then stick with one of those groups for the rest of the time. It worked well -- as I mentioned above, testdat was a great project -- but I did feel some time pressure (internally!) to choose a small project quickly.
</p>
<p dir="ltr">
For a look at some of the other hackathon projects, check out<a href="https://github.com/ropensci"> rOpenSci's GitHub page</a>, the<a href="https://github.com/ropensci/hackathon"> hackathon GitHub page</a>, project-specific posts on the<a href="http://ropensci.org/blog/"> rOpenSci blog</a>, or the hackathon's live-tweet hashtag,<a href="https://twitter.com/search?src=typd&q=%23ropenhack"> #ropenhack</a>.
</p>
<h3 id="why-are-there-so-many-minnesotans-here"><strong>“Why are there so many Minnesotans here?”</strong></h3>
<p dir="ltr">
There were at least four hackathon attendees (out of 35-40 total) that either currently live in or hail from Minnesota. Talk about overrepresentation! We are everywhere.
</p>
<h3 id="i-love-my-job"><strong>“I love my job.”</strong></h3>
<p dir="ltr">
I'm a late-stage PhD student, so the job market is looming closer with every passing day. When I meet new people working in statistics, genomics, data science, or another related field, I like to ask them whether they like their current work, how it compares to other jobs they've had, etc. Hackathon attendees had all kinds of jobs: academic researcher, industry scientist, freelancer, student, etc. The majority of the responses to my inquiries about how they liked their work was "I love it." The situation made the job market seem exciting, rather than intimidating: among the hackathon attendees and folks from the SF data science community that hung out with us for a dinner, the jobs themselves were pretty heterogeneous, but the general enjoyment of the work seemed consistently high.
</p>
<h3 id="whats-the-future-of-r"><strong>“What’s the future of R?”</strong></h3>
<p dir="ltr">
I suppose we should have known that existential questions like this would come up when 40 passionate R people spend two straight days together. Our discussion of the future of R didn't really yield any definitive answers or predictions, but I think we have big dreams for what R's future will look like: vibrant, open, collaborative, and scientifically driven. If the hackathon atmosphere was any indication of R's future, I'm feeling pretty optimistic about where things are going.
</p>
<p>In closing: we’re really grateful to the people and organizations that made the hackathon possible: <a href="http://ropensci.org/">rOpenSci</a>, <a href="http://inundata.org/">Karthik Ram</a>, <a href="http://github.com">GitHub</a>, the <a href="http://www.sloan.org/">Sloan Foundation</a>, and <a href="http://f1000research.com/">F1000 Research</a>. Thanks for strengthening the R community, giving us the chance to meet each other outside of the internet, and helping us have a great time doing R, for science, together!</p>
Writing good software can have more impact than publishing in high impact journals for genomic statisticians
2014-04-07T10:46:16+00:00
http://simplystats.github.io/2014/04/07/writing-good-software-can-have-more-impact-than-publishing-in-high-impact-journals-for-genomic-statisticians
<!-- html table generated in R 3.0.3 by xtable 1.7-1 package -->
<p>Every once in a while we see computational papers published in science journals with high impact factors. Genomics related methods appear quite often in these journals. Several of my junior colleagues express frustration that all their papers get rejected from these journals. I tell them that the same is true for most of my papers and remind them of these examples:</p>
<!-- Sat Apr 5 22:41:28 2014 -->
<table border="1">
<tr>
<th>
Method
</th>
<th>
Journal
</th>
<th>
Year
</th>
<th>
#Citations
</th>
</tr>
<tr>
<td>
PLINK
</td>
<td>
AJHG
</td>
<td align="right">
2007
</td>
<td align="right">
6481
</td>
</tr>
<tr>
<td>
Bioconductor
</td>
<td>
Genome Biology
</td>
<td align="right">
2004
</td>
<td align="right">
5973
</td>
</tr>
<tr>
<td>
RMA
</td>
<td>
Biostatistics
</td>
<td align="right">
2003
</td>
<td align="right">
5674
</td>
</tr>
<tr>
<td>
limma
</td>
<td>
SAGMB
</td>
<td align="right">
2004
</td>
<td align="right">
5637
</td>
</tr>
<tr>
<td>
quantile normalization
</td>
<td>
Bioinformatics
</td>
<td align="right">
2003
</td>
<td align="right">
4646
</td>
</tr>
<tr>
<td>
Bowtie
</td>
<td>
Genome Biology
</td>
<td align="right">
2009
</td>
<td align="right">
3849
</td>
</tr>
<tr>
<td>
BWA
</td>
<td>
Bioinformatics
</td>
<td align="right">
2009
</td>
<td align="right">
3327
</td>
</tr>
<tr>
<td>
Loess normalization
</td>
<td>
NAR
</td>
<td align="right">
2002
</td>
<td align="right">
3313
</td>
</tr>
<tr>
<td>
qvalues
</td>
<td>
JRSS-B
</td>
<td align="right">
2002
</td>
<td align="right">
2758
</td>
</tr>
<tr>
<td>
tophat
</td>
<td>
Bioinformatics
</td>
<td align="right">
2008
</td>
<td align="right">
1868
</td>
</tr>
<tr>
<td>
vsn
</td>
<td>
Bioinformatics
</td>
<td align="right">
2002
</td>
<td align="right">
1398
</td>
</tr>
<tr>
<td>
GCRMA
</td>
<td>
JASA
</td>
<td align="right">
2004
</td>
<td align="right">
1397
</td>
</tr>
<tr>
<td>
MACS
</td>
<td>
Genome Biology
</td>
<td align="right">
2008
</td>
<td align="right">
1277
</td>
</tr>
<tr>
<td>
deseq
</td>
<td>
Genome Biology
</td>
<td align="right">
2010
</td>
<td align="right">
1264
</td>
</tr>
<tr>
<td>
CBS
</td>
<td>
Biostatistics
</td>
<td align="right">
2004
</td>
<td align="right">
1051
</td>
</tr>
<tr>
<td>
R/qtl
</td>
<td>
Bioinformatics
</td>
<td align="right">
2003
</td>
<td align="right">
1027
</td>
</tr>
</table>
<p>Let me know of other examples in the comments.</p>
<p>update: I added one more to the list.</p>
This is how an important scientific debate is being used to stop EPA regulation
2014-04-01T09:13:08+00:00
http://simplystats.github.io/2014/04/01/this-is-how-an-important-scientific-debate-is-being-used-to-stop-epa-regulation
<p dir="ltr">
Environmental regulation in the United States has protected human health for over 40 years. Since the Clean Air Act was enacted in 1970, levels of outdoor air pollution have dropped dramatically, changing the landscape of once heavily-polluted cities like Los Angeles and Pittsburgh. A 2011 <a href="http://www.epa.gov/air/sect812/prospective2.html">cost-benefit analysis</a> conducted by the U.S. Environmental Protection Agency estimated that the 1990 amendments to the CAA prevented 160,000 deaths and 13 million lost work days in the year 2010 alone. They estimated that the monetary benefits of the CAA were 30 times greater than the costs of implementing the regulations.
</p>
<p dir="ltr">
The benefits of environmental regulations like the CAA significantly outweigh their costs. But there are still costs, and those costs must be borne by someone. The burden is usually put on the polluters, such as the automobile and power generation industries, which have long fought any notion of air pollution regulation as a threat to their existence. Initially, as air pollution and health studies were still emerging, opponents of regulation often challenged the science itself, claiming flaws in the methodology, the measurements, or the interpretation. But when study after study demonstrated a connection between outdoor air pollution and a variety of health problems, it became increasingly difficult for critics to mount a credible challenge. Lawsuits are another tactic used by industry, with one case brought by the American Trucking Association going all the way to the <a href="http://www.oyez.org/cases/2000-2009/2000/2000_99_1257">U.S. Supreme Court</a>.
</p>
<p>The latest attack comes from the House of Representatives in the form of the <a href="http://beta.congress.gov/bill/113th-congress/house-bill/4012">Secret Science Reform Act</a>, or H.R. 4102. In summary, the proposed bill requires that every scientific paper cited by the EPA to justify a new rule or regulation needs to be reproducible. What exactly does this mean? To answer that question we need to take a brief diversion into some recent important developments in statistical science.</p>
<p>The idea behind reproducibility is simple. All the data used in a scientific paper and all the computer code used to analyze that data should be made available to other researchers and the public. It may be surprising that much of this data actually isn’t already available. The primary reason most data isn’t available is because, until recently, most people didn’t ask scientists for their data. The data was often small and collected for a specific purpose so other scientists and the general public just weren’t that interested. If a scientist were interested in checking the truth of a claim, she could simply repeat the experiment in her lab to see if the claim could be replicated.</p>
<p>The nature of science has changed quickly over the last three decades. There has been an explosion of data, fueled by the decreasing cost of data collection technologies and computing power. At the same time, increased access to sophisticated computing power has let scientists conduct more sophisticated analyses on their data. The massive growth in data and the increasing sophistication of the analyses has made communicating what was done in a scientific study more complicated.</p>
<p>The traditional medium of journal publications has proven to be inadequate for describing the important details of a data analysis. As a result, it has been said that scientific articles are merely the “advertising” for the research that was conducted. The real research is buried in the data and the computer code actually used to compute the results. Journals have traditionally not required that data or computer code be published along with papers. As a result, many important details may be lost and prevent key studies from being fully reproducible.</p>
<p>The explosion of data has also made completely replicating a large study by an independent scientist much more difficult and costly. A large study is expensive to conduct in the first place; there is usually little appetite or funding to repeat it. The result is that much of published scientific research cannot be reproduced by other scientists because the necessary data and analytic details are not available to others.</p>
<p>The scientific community is currently engaged in a debate over how to improve reproducibility across all of science. You might be tempted to ask, why not just share the data? Even if we could get everyone to agree with that in principle, it’s not clear how to do it.</p>
<p>Imagine if everyone in the U.S. decided we were all going to share our movie collections, and suppose for the sake of this example that the movie industry did not object. How would it work? Numerous questions immediately arise. Where would all these movies be stored? How would they be transferred from one person to another? How would I know what movies everyone else had? If my movies are all on the old DVD format, do I need to convert them to some other format before I can share? My Internet connection is very slow, how can I download a 3 hour HD movie? My mother doesn’t use computers much, but she has a great movie collection that I think others should have access to. What should she do? And who is going to pay for all of this? While each question may have a reasonable answer, it’s not clear what is the optimal combination and how you might scale it to the entire country.</p>
<p>Some of you may recall that the music industry had a brilliant sharing service that essentially allowed everyone to share their music collections. It was called Napster. Napster solved many of the problems raised above except for one – they failed to survive. So even when a decent solution is found, there’s no guarantee that it will always be there.</p>
<p>As outlandish as this example may seem, minor variations on these exact questions come up when we discuss how to share scientific data. The volume of data being produced today is enormous and making all of it available to everyone is not an easy task. That’s not to say it is impossible. If smart people get together and work constructively, it is entirely possible that a reasonable approach could be found. But at this point, a credible long-term solution has yet to emerge.</p>
<p>This brings us back to the Secret Science Reform Act. The latest tactic by opponents of air quality regulation is to force the EPA to ensure that all of the studies that it cites to support new regulations are reproducible. A cursory reading of the bill gives the impression that the sponsors are genuinely concerned about making science more transparent to the public. But when one reads the language of the bill in the context of ongoing discussions about reproducibility, it becomes clear that the sponsors of the bill have no such goal in mind. The purpose of H.R. 4102 is to prevent the Environmental Protection Agency from proposing new regulations.</p>
<p>The EPA develops rules and regulations on the basis of scientific evidence. For example, the Clean Air Act requires EPA to periodically review the scientific literature for the latest evidence on the health effects of air pollution. The science the EPA considers needs to be published in peer-reviewed journals. This makes the EPA a key consumer of scientific knowledge and it uses this knowledge to make informed decisions about protecting public health. What the EPA is not is a large funder of scientific studies. The entire budget for the Office of Research and Development at EPA is roughly $550 million (<a href="http://nepis.epa.gov/Exe/ZyNET.exe/P100GCS2.TXT?ZyActionD=ZyDocument&Client=EPA&Index=2011+Thru+2015&Docs=&Query=&Time=&EndTime=&SearchMethod=1&TocRestrict=n&Toc=&TocEntry=&QField=&QFieldYear=&QFieldMonth=&QFieldDay=&IntQFieldOp=0&ExtQFieldOp=0&XmlQuery=&File=D%3A%5Czyfiles%5CIndex%20Data%5C11thru15%5CTxt%5C00000007%5CP100GCS2.txt&User=ANONYMOUS&Password=anonymous&SortMethod=h%7C-&MaximumDocuments=1&FuzzyDegree=0&ImageQuality=r75g8/r75g8/x150y150g16/i425&Display=p%7Cf&DefSeekPage=x&SearchBack=ZyActionL&Back=ZyActionS&BackDesc=Results%20page&MaximumPages=1&ZyEntry=1&SeekPage=x&ZyPURL">fiscal 2014</a>), or less than 2 percent of the budget for the National Institutes of Health (about $30 billion for fiscal 2014). This means EPA has essentially no influence over the scientists behind many of the studies it cites because it funds very few of those studies. The best the EPA can do is politely ask scientists to make their data available. If a scientist refuses, there’s not much the EPA can use as leverage.</p>
<p dir="ltr">
The latest controversy to come up involves the <a href="http://www.ncbi.nlm.nih.gov/pubmed/8179653">Harvard Six Cities study</a> published in 1993. This landmark study found a large difference in mortality rates comparing cities with high and low air pollution, even after adjusting for smoking and other factors. The House committee has been trying to make the data for this study publicly available so that it can ensure that regulations are “<a href="http://online.wsj.com/news/articles/SB10001424127887323829104578624562008231682">backed by good science</a>”. However, the Committee has either forgotten or never knew that this particular study <a href="http://www.ncbi.nlm.nih.gov/pubmed/16020032">has been fully reproduced by independent investigators</a>. In 2005, independent investigators found that they were “...<a href="http://www.ncbi.nlm.nih.gov/pubmed/16020032">able to reproduce virtually all of the original numerical results</a>, including the 26 percent increase in all-cause mortality in the most polluted city (Stubenville, OH) as compared to the least polluted city (Portage, WI). The audit and validation of the Harvard Six Cities Study conducted by the reanalysis team generally confirmed the quality of the data and the numerical results reported by the original investigators.”
</p>
<p>It would be hard to find an air pollution study that has been subject to more scrutiny than the Six Cities studies. Even if you believed the Six Cities study was totally wrong, its original findings have been replicated numerous times since its publication, with different investigators, in different populations, using different analysis techniques, and in different countries. If you’re looking for an example where the science was either not reproducible or not replicable, sorry, but this is not your case study.</p>
<p>Ultimately, it is <a href="http://www.ncbi.nlm.nih.gov/pubmed/16020032">c</a>lear that the sponsors of this bill are cynically taking advantage of a genuine (but difficult) scientific debate over reproducibility to push a political agenda. Scientists are in agreement that reproducibility is important, but there is no consensus yet on how to make it happen for everyone. By forcing the EPA to ensure reproducibility of the science on which it bases regulation, lawmakers are asking the EPA to solve a problem that the entire scientific community has yet to figure out. The end result of passing a bill like H.R. 4102 is that the EPA will be forced to stop proposing any new regulation, handing a major victory to opponents of air quality standards and dealing a major blow to public health in the U.S.</p>
Data Analysis for Genomics edX Course
2014-03-31T10:10:37+00:00
http://simplystats.github.io/2014/03/31/data-analysis-for-genomic-edx-course
<p>Mike Love (@mikelove) and I have been working hard the past couple of months preparing a free online <a href="https://www.edx.org/">edX</a> course on <a href="https://www.edx.org/course/harvardx/harvardx-ph525x-data-analysis-genomics-1401">data analysis for genomics</a>. Our target audience are the postdocs, graduate students and research scientists that are tasked with analyzing genomics data, but don’t have any formal training. The eight week course will start with the very basics, but will ramp up rather quickly and end with real-life workflows for genome variation, RNA-seq, DNA methylation, and ChIP-seq.</p>
<p>Throughout the course students will learn skills and concepts that provide a foundation for analyzing genomics data. Specifically, we will cover exploratory data analysis, basic statistical inference, linear regression, modeling with parametric distributions, empirical Bayes, multiple comparison corrections and smoothing techniques.</p>
<p>In the class we will make heavy use of computer labs. Almost every lecture is accompanied by an R markdown document that students can use to recreate the plots shown in the lectures. The html document resulting from these R markdown files will result in an html document that will serve as a text book for the class.</p>
<p>Questions will be discussed on online forums led by Stephanie Hicks (@stephaniehicks) and Jim MacDonald.</p>
<p>If you want to sign up, <a href="https://www.edx.org/course/harvardx/harvardx-ph525x-data-analysis-genomics-1401">here</a> is the link.</p>
A non-comprehensive comparison of prominent data science programs on cost and frequency.
2014-03-26T10:07:45+00:00
http://simplystats.github.io/2014/03/26/a-non-comprehensive-comparison-of-prominent-data-science-programs-on-cost-and-frequency
<p><a href="http://simplystatistics.org/2014/03/26/a-non-comprehensive-comparison-of-prominent-data-science-programs-on-cost-and-frequency/screen-shot-2014-03-26-at-9-29-53-am/" rel="attachment wp-att-2872"><img class="alignnone size-full wp-image-2872" alt="Screen Shot 2014-03-26 at 9.29.53 AM" src="http://simplystatistics.org/wp-content/uploads/2014/03/Screen-Shot-2014-03-26-at-9.29.53-AM.png" width="743" height="226" srcset="http://simplystatistics.org/wp-content/uploads/2014/03/Screen-Shot-2014-03-26-at-9.29.53-AM-300x91.png 300w, http://simplystatistics.org/wp-content/uploads/2014/03/Screen-Shot-2014-03-26-at-9.29.53-AM.png 743w" sizes="(max-width: 743px) 100vw, 743px" /></a></p>
<p>We did a really brief comparison of a few notable data science</p>
<p>programs for a grant submission we were working on. I thought it was pretty fascinating, so I’m posting it here. A couple of notes about the table.</p>
<ol>
<li>
<p>Our program can be taken for free, which includes assessments. If you want the official certificate and to take the capstone you pay the above costs.</p>
</li>
<li>
<p>Udacity’s program can also be taken for free, but if you want the official certificate, assessments, or tutoring you pay the above costs.</p>
</li>
<li>
<p>The asterisks denote programs where you get an official master’s degree.</p>
</li>
<li>
<p>The MOOC programs (Udacity’s and ours) offer the more flexibility in</p>
</li>
</ol>
<p>the terms of student schedules. Ours is the most flexible with courses</p>
<p>running every month. The in person programs having the least</p>
<p>flexibility but obviously the most direct instructor time.</p>
<p>5) The programs are all quite different in the terms of focus, design,</p>
<p>student requirements, admissions, instruction, cost and value.</p>
<p>6) As far as we know, ours is the only one where every bit of lecture</p>
<p>content has been open sourced (<a href="https://github.com/DataScienceSpecialization">https://github.com/DataScienceSpecialization</a>)</p>
The fact that data analysts base their conclusions on data does not mean they ignore experts
2014-03-24T10:00:42+00:00
http://simplystats.github.io/2014/03/24/the-fact-that-data-analysts-base-their-conclusions-on-data-does-not-mean-they-ignore-experts
<p>Paul Krugman recently <a href="http://krugman.blogs.nytimes.com/2014/03/23/tarnished-silver/?_php=true&_type=blogs&_php=true&_type=blogs&_r=1&">joined</a> the new FiveThirtyEight hating <a href="http://www.salon.com/2014/03/18/nate_silvers_new_fivethirtyeight_is_getting_some_high_profile_bad_reviews/">bandwagon</a>. I am not crazy about the new website either (although I’ll wait more than one week<del>s</del> before judging) but in a recent post Krugman creates a false dichotomy that is important to correct. Krugman<del>am</del> states that “[w]hat [Nate Silver] seems to have concluded is that there are no experts anywhere, that a smart data analyst can and should ignore all that.” I don’t think that is what Nate Silver<del>,</del> nor any other smart data scientist or applied statistician has concluded. Note that to build his election prediction model, Nate had to understand how the electoral college works, how polls work, how different polls are different, the relationship between primaries and presidential election, among many other details specific to polls and US presidential elections. He learned all of this by reading and talking to experts. Same is true for PECOTA where data analysts who know quite a bit about baseball collect data to create meaningful and predictive summary statistics. As Jeff said before, <a href="http://simplystatistics.org/2013/12/12/the-key-word-in-data-science-is-not-data-it-is-science/">the key word in “Data Science” is not Data, it is Science</a>.</p>
<p>The <a href="http://fivethirtyeight.com/features/disasters-cost-more-than-ever-but-not-because-of-climate-change/">one example</a> Krugman points too as ignoring experts appears to be written by someone who, <a href="http://thinkprogress.org/climate/2014/03/19/3416369/538-climate-article/">according to the article that Krugman links to</a>, was biased by his own opinions, not by data analysis that ignored experts. However, in Nate’s analysis of polls and baseball data it is hard to argue that he let his bias affect his analysis. Furthermore, it is important to point out that he did not simply stick data into a black box prediction algorithm. Instead he did what most of us applied statisticians do: we build empirically inspired models but guided by expert knowledge.</p>
<p>ps - Krugman links to a [Paul Krugman recently <a href="http://krugman.blogs.nytimes.com/2014/03/23/tarnished-silver/?_php=true&_type=blogs&_php=true&_type=blogs&_r=1&">joined</a> the new FiveThirtyEight hating <a href="http://www.salon.com/2014/03/18/nate_silvers_new_fivethirtyeight_is_getting_some_high_profile_bad_reviews/">bandwagon</a>. I am not crazy about the new website either (although I’ll wait more than one week<del>s</del> before judging) but in a recent post Krugman creates a false dichotomy that is important to correct. Krugman<del>am</del> states that “[w]hat [Nate Silver] seems to have concluded is that there are no experts anywhere, that a smart data analyst can and should ignore all that.” I don’t think that is what Nate Silver<del>,</del> nor any other smart data scientist or applied statistician has concluded. Note that to build his election prediction model, Nate had to understand how the electoral college works, how polls work, how different polls are different, the relationship between primaries and presidential election, among many other details specific to polls and US presidential elections. He learned all of this by reading and talking to experts. Same is true for PECOTA where data analysts who know quite a bit about baseball collect data to create meaningful and predictive summary statistics. As Jeff said before, <a href="http://simplystatistics.org/2013/12/12/the-key-word-in-data-science-is-not-data-it-is-science/">the key word in “Data Science” is not Data, it is Science</a>.</p>
<p>The <a href="http://fivethirtyeight.com/features/disasters-cost-more-than-ever-but-not-because-of-climate-change/">one example</a> Krugman points too as ignoring experts appears to be written by someone who, <a href="http://thinkprogress.org/climate/2014/03/19/3416369/538-climate-article/">according to the article that Krugman links to</a>, was biased by his own opinions, not by data analysis that ignored experts. However, in Nate’s analysis of polls and baseball data it is hard to argue that he let his bias affect his analysis. Furthermore, it is important to point out that he did not simply stick data into a black box prediction algorithm. Instead he did what most of us applied statisticians do: we build empirically inspired models but guided by expert knowledge.</p>
<p>ps - Krugman links to a](http://www.nytimes.com/2014/03/22/opinion/egan-creativity-vs-quants.html?src=me&ref=general) piece which has another false dichotomy as the title: “Creativity vs. Quants”. He should try doing it before assuming there is no creativity involved in extracting information from data.</p>
The 80/20 rule of statistical methods development
2014-03-20T11:10:33+00:00
http://simplystats.github.io/2014/03/20/the-8020-rule-of-statistical-methods-development
<p>Developing statistical methods is hard and often frustrating work. One of the under appreciated rules in statistical methods development is what I call the 80/20 rule (maybe could even by the 90/10 rule). The basic idea is that the first <em>reasonable</em> thing you can do to a set of data often is 80% of the way to the optimal solution. Everything after that is working on getting the last 20%. (<em>Edit: Rafa points out that once again I’ve <a href="http://simplystatistics.org/2011/12/03/reverse-scooping/">reverse-scooped</a> a bunch of people and this is already a thing that has been pointed out many times. See for example the <a href="http://en.wikipedia.org/wiki/Pareto_principle">Pareto principle</a> and <a href="http://c2.com/cgi/wiki?EightyTwentyRule">this post</a> also called the 80:20 rule</em>)</p>
<p>Sometimes that extra 20% is really important and sometimes it isn’t. In a clinical trial, where each additional patient may cost a large amount of money to recruit and enroll, it is definitely worth the effort. For more exploratory techniques like those often used when analyzing high-dimensional data it may not. This is particularly true because the extra 20% usually comes at a cost of additional assumptions about the way the world works. If your assumptions are right, you get the 20%, if they are wrong, you may lose and it isn’t always clear how much.</p>
<p>Here is a very simple example of the 80/20 rule from frequentist statistics - in my experience similar ideas hold in machine learning and Bayesian inference as well. Suppose that I collect some observations <span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_062d0284d7f45f7d93ff7d4b0fb8886f.gif" style="vertical-align: middle; border: none; " class="tex" alt=" X_1,\ldots, X_n" /></span> and want to test whether the mean of the observations is greater than 0. Suppose I know that the data are normal and that the variance is equal to 1. Then the absolute best statistical test (called the u<a href="http://en.wikipedia.org/wiki/Uniformly_most_powerful_test">niformly most powerful test</a>) you could do rejects the hypothesis the mean is zero if <span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_8bc8d8fb0a6ff9570c357dab881a2891.gif" style="vertical-align: middle; border: none; " class="tex" alt=" \bar{X} > z_{\alpha}\left(\frac{1}{\sqrt{n}}\right) " /></span>.</p>
<p>There are a bunch of other tests you could do though. If you assume the distribution is symmetric you could also use the sign test to test the same hypothesis by creating the random variables <span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_ef677414efbbce1f018925a0af81fed4.gif" style="vertical-align: middle; border: none; " class="tex" alt=" Y_i = 1(X_i > 0) " /></span> and testing the hypothesis <span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_7ee857b7d44d932477dffae81af3056f.gif" style="vertical-align: middle; border: none; " class="tex" alt=" H_0: Pr(Y_i = 1) = 0.5 " /></span> versus the alternative that the probability is greater than 0.5 . Or you could use the one sided t-test. Or you could use the Wilcoxon test. These are suboptimal if you <em>know</em> the data are Normal with variance one.</p>
<p>I tried each of these tests with a sample of size <span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_1e6b5b7ea03ea0c5cf784222e040a049.gif" style="vertical-align: middle; border: none; padding-bottom:1px;" class="tex" alt=" n=20 " /></span> at the <span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_062331e0001151470fd8ce434608411c.gif" style="vertical-align: middle; border: none; padding-bottom:1px;" class="tex" alt=" \alpha=0.05 " /></span> level. In the plot below I show the ratio of power between each non-optimal test and the optimal z-test (you could do this theoretically but I’m lazy so did it with simulation, <a href="https://gist.github.com/jtleek/9665572">code here</a>, colors by <a href="http://alyssafrazee.com/RSkittleBrewer.html">RSkittleBrewer</a>).</p>
<p style="text-align: center;">
<a href="http://simplystatistics.org/2014/03/20/the-8020-rule-of-statistical-methods-development/relpower-3/" rel="attachment wp-att-2830"><img class=" wp-image-2830 aligncenter" alt="relpower" src="http://simplystatistics.org/wp-content/uploads/2014/03/relpower2.png" width="504" height="469" srcset="http://simplystatistics.org/wp-content/uploads/2014/03/relpower2-300x279.png 300w, http://simplystatistics.org/wp-content/uploads/2014/03/relpower2.png 720w" sizes="(max-width: 504px) 100vw, 504px" /></a>
</p>
<p>The tests get to 80% of the power of the z-test for different sizes of the true mean (0.6 for Wilcoxon, 0.5 for the t-test, and 0.85 for the sign test). Overall, these methods very quickly catch up to the optimal method.</p>
<p>In this case, the non-optimal methods aren’t much easier to implement than the optimal solution. But in many cases, the optimal method requires significantly more computation, memory, assumptions, theory, or some combination of the four. The hard decision is whether to create a new method is whether the 20% is worth it. This is obviously application specific.</p>
<p>An important corollary of the 80/20 rule is that you can have a huge impact on new technologies if you are the first to suggest an already known 80% solution. For example, the first person to suggest <a href="http://www.pnas.org/content/95/25/14863.long">hierarchical clustering</a> or the <a href="http://www.pnas.org/content/97/18/10101.abstract">singular value decomposition</a> for a new high-dimensional data type will often get a large number of citations. But that is a hard way to make a living - you aren’t the only person who knows about these methods and the person who says it first soaks up a huge fraction of the credit. So the only way to take advantage of this corollary is to spend your time constantly trying to figure out what the next big technology will be. And you know what they say about prediction being hard, especially about the future.</p>
The time traveler's challenge.
2014-03-19T09:50:00+00:00
http://simplystats.github.io/2014/03/19/the-end-of-the-world-challenge
<p><em>Editor’s note: This has nothing to do with statistics. </em></p>
<p>I do a lot of statistics for a living and would claim to know a relatively large amount about it. I also know a little bit about a bunch of other scientific disciplines, a tiny bit of engineering, a lot about pointless sports trivia, some current events, the geography of the world (vaguely) and the geography of places I’ve lived (pretty well).</p>
<p>I have often wondered, if I was transported back in time to a point before the discovery of say, how to make a fire, how much of human knowledge I could recreate. In other words, what would be the marginal effect on the world of a single person (me) being transported back in time. I could propose Newton’s Laws, write down a bunch of the basis of calculus, and discover the central limit theorem. I probably couldn’t build an internal combustion engine - I know the concept but don’t know enough of the details. So the challenge is this.</p>
<p><em> If you were transported back 4,000 or 5,000 years, how much could you accelerate human knowledge?</em></p>
<p>When I told Leah J. about this idea she came up with an even more fascinating variant.</p>
<p><em>Suppose that I told you that in 5 days you were going to be transported back 4,000 or 5,000 years but you couldn’t take anything with you. What would you read about on Wikipedia? </em></p>
ENAR is in Baltimore - Here's What To Do
2014-03-14T14:18:20+00:00
http://simplystats.github.io/2014/03/14/enar-is-in-baltimore-heres-what-to-do
<p>This year’s meeting of the Eastern North American Region of the International Biometric Society (ENAR) is in lovely Baltimore, Maryland. As local residents Jeff and I thought we’d put down a few suggestions for what to do during your stay here in case you’re not familiar with the area.</p>
<p><strong>Venue</strong></p>
<p>The conference is being held at the Marriott in the Harbor East area of the city, which is relatively new and a great location. There are a number of good restaurants right in the vicinity, including <a href="http://www.witandwisdombaltimore.com">Wit & Wisdom</a> in the Four Seasons hotel across the street and <a href="http://www.pabuizakaya.com">Pabu</a>, an excellent Japanese restaurant that I personally believe is the best restaurant in Baltimore (a very close second is <a href="http://www.woodberrykitchen.com">Woodberry Kitchen</a>, which is a bit farther away near Hampden). If you go to Pabu, just don’t get sushi; try something new for a change. Around Harbor East you’ll also find a <a href="http://www.cgeno.com">Cinghiale</a> (excellent northern Italian restaurant), <a href="http://www.charlestonrestaurant.com">Charleston</a> (expensive southern food), <a href="http://www.lebanesetaverna.com/lebanese-restaurant-baltimore-md.html">Lebanese Taverna</a>, and <a href="http://www.ouzobay.com">Ouzo Bay</a>. If you’re sick of restaurants, there’s also a Whole Foods. If you want a great breakfast, you can walk just a few blocks down Aliceanna street to the <a href="http://bluemoonbaltimore.com">Blue Moon Cafe</a>. Get the eggs Benedict. If you get the Cap’n Crunch French toast, you will need a nap afterwards.</p>
<p>Just east of Harbor East is an area called Fell’s Point. This is commonly known as the “bar district” and it lives up to its reputation. <a href="http://www.maxs.com">Max’s</a> in Fell’s Point (on the square) has an obscene number of beers on tap. The <a href="http://heavyseasalehouse.com">Heavy Seas Alehouse</a> on Central Avenue has some excellent beers from the local Heavy Seas brewery and also has great food from chef <a href="https://twitter.com/Matt_Seeber">Matt Seeber</a>. Finally, the <a href="http://www.fellsgrind.com/Index.aspx">Daily Grind</a> coffee shop is a local institution.</p>
<p><strong>Around the Inner Harbor</strong></p>
<p>Outside of the immediate Harbor East area, there are a number of things to do. For kids, there’s <a href="http://www.portdiscovery.org/index.cfm?">Port Discovery</a>, which my 3-year-old son seems to really enjoy. There’s also the <a href="http://aqua.org">National Aquarium</a> where the Tuesday networking event will be held. This is also a great place for kids if you’re bringing family. There’s a neat <a href="https://www.google.com/maps/place/39°17'07.5%22N+76°36'20.4%22W/@39.285406,-76.605667,15z/data=!3m1!4b1!4m2!3m1!1s0x0:0x0">little park on Pier 6</a> that is small, but has a number of kid-related things to do. It’s a nice place to hang out when the weather is nice. Around the other side of the harbor is the <a href="http://www.mdsci.org">Maryland Science Center</a>, another kid-fun place, and just west of the Harbor down Pratt Street is the <a href="http://www.borail.org">B&O Railroad Museum</a>, which I think is good for both kids and adults (I like trains).</p>
<p>Unfortunately, at this time there’s no football or baseball to watch.</p>
<p><strong>Around Baltimore</strong></p>
<p>There are a lot of really interesting things to check out around Baltimore if you have the time. If you need to get around downtown and the surrounding areas there’s the <a href="http://www.charmcitycirculator.com">Charm City Circulator</a> which is a free bus that runs every 15 minutes or so. The Mt. Vernon district has a number of cultural things to do. For classical music fans there’s the wonderful <a href="http://bsomusic.org">Baltimore Symphony Orchestra</a> directed by Marin Alsop. The <a href="http://www.peabody.jhu.edu">Peabody Institute</a> often has some interesting concerts going on given by the students there. There’s the <a href="http://thewalters.org">Walters Art Museum</a>, which is free, and has a very interesting collection. There are also a number of good restaurants and coffee shops in Mt. Vernon, like <a href="http://www.doobyscoffee.com">Dooby’s</a> (excellent dinner) and <a href="https://redemmas.org">Red Emma’s</a> (lots of Noam Chomsky).</p>
<p>That’s all I can think of right now. If you have other questions about Baltimore while you’re here for ENAR tweet us up at @simplystats.</p>
How to use Bioconductor to find empirical evidence in support of π being a normal number
2014-03-14T10:00:19+00:00
http://simplystats.github.io/2014/03/14/using-bioconductor-to-find-some-empirical-evidence-in-support-of-%cf%80-being-a-normal-number
<p>Happy π day everybody!</p>
<p>I wanted to write some simple code (included below) to the test parallelization capabilities of my new cluster. So, in honor of π day, I decided to check for <a href="http://www.davidhbailey.com/dhbpapers/normality.pdf">evidence that π is a normal number</a>. A <a href="http://en.wikipedia.org/wiki/Normal_number">normal number</a> is a real number whose infinite sequence of digits has the property that picking any given random m digit pattern is 10<sup>−m</sup>. For example, using the Poisson approximation, we can predict that the pattern “123456789” should show up between 0 and 3 times in the <a href="http://stuff.mit.edu/afs/sipb/contrib/pi/">first billion digits of π</a> (it actually shows up twice starting, at the 523,551,502-th and 773,349,079-th decimal places).</p>
<p>To test our hypothesis, let Y<sub>1</sub>, …, Y<sub>100</sub> be the number of “00”, “01”, …,”99” in the first billion digits of π. If π is in fact normal then the Ys should be approximately IID binomials with N=1 billon and p=0.01. In the qq-plot below I show Z-scores (Y - 10,000,000) / √9,900,000) which appear to follow a normal distribution as predicted by our hypothesis. Further evidence for π being normal is provided by repeating this experiment for 3,4,5,6, and 7 digit patterns (for 5,6 and 7 I sampled 10,000 patterns). Note that we can perform a chi-square test for the uniform distribution as well. For patterns of size 1,2,3 the p-values were 0.84, <del>0.89,</del> 0.92, and 0.99.</p>
<p><a href="http://simplystatistics.org/2014/03/14/using-bioconductor-to-find-some-empirical-evidence-in-support-of-%cf%80-being-a-normal-number/pi-3/" rel="attachment wp-att-2792"><img class="alignnone size-full wp-image-2792" alt="pi" src="http://simplystatistics.org/wp-content/uploads/2014/03/pi2.png" width="4800" height="3000" srcset="http://simplystatistics.org/wp-content/uploads/2014/03/pi2-300x187.png 300w, http://simplystatistics.org/wp-content/uploads/2014/03/pi2-1024x640.png 1024w, http://simplystatistics.org/wp-content/uploads/2014/03/pi2.png 4800w" sizes="(max-width: 4800px) 100vw, 4800px" /></a></p>
<p>Another test we can perform is to divide the 1 billion digits into 100,000 non-overlapping segments of length 10,000. The vector of counts for any given pattern should also be binomial. Below I also include these qq-plots.</p>
<p><a href="http://simplystatistics.org/2014/03/14/using-bioconductor-to-find-some-empirical-evidence-in-support-of-%cf%80-being-a-normal-number/pi2/" rel="attachment wp-att-2793"><img class="alignnone size-full wp-image-2793" alt="pi2" src="http://simplystatistics.org/wp-content/uploads/2014/03/pi21.png" width="5600" height="3000" srcset="http://simplystatistics.org/wp-content/uploads/2014/03/pi21-1024x548.png 1024w, http://simplystatistics.org/wp-content/uploads/2014/03/pi21.png 5600w" sizes="(max-width: 5600px) 100vw, 5600px" /></a></p>
<p>These observed counts should also be independent, and to explore this we can look at autocorrelation plots:</p>
<p><a href="http://simplystatistics.org/2014/03/14/using-bioconductor-to-find-some-empirical-evidence-in-support-of-%cf%80-being-a-normal-number/piacf-2/" rel="attachment wp-att-2794"><img class="alignnone size-full wp-image-2794" alt="piacf" src="http://simplystatistics.org/wp-content/uploads/2014/03/piacf1.png" width="5600" height="3000" srcset="http://simplystatistics.org/wp-content/uploads/2014/03/piacf1-1024x548.png 1024w, http://simplystatistics.org/wp-content/uploads/2014/03/piacf1.png 5600w" sizes="(max-width: 5600px) 100vw, 5600px" /></a></p>
<p>To do this in about an hour and with just a few lines of code (included below), I used the <a href="http://www.bioconductor.org/">Bioconductor</a> <a href="http://www.bioconductor.org/packages/release/bioc/html/Biostrings.html">Biostrings</a> package to match strings and the <code class="language-plaintext highlighter-rouge">foreach</code> function to parallelize.</p>
<pre>library(Biostrings)
library(doParallel)
registerDoParallel(cores = 48)
x=scan("pi-billion.txt",what="c")
x=substr(x,3,nchar(x)) ##remove 3.
x=BString(x)
n<-length(x)
p <- 1/(10^d)
par(mfrow=c(2,3))
for(d in 2:4){
if(d<5){
patterns<-sprintf(paste0("%0",d,"d"),seq(0,10^d-1))
} else{
patterns<-sprintf(paste0("%0",d,"d"),sample(10^d,10^4)-1)
}
res <- foreach(pat=patterns,.combine=c) %dopar% countPattern(pat,x)
z <- (res - n*p ) / sqrt( n*p*(1-p) )
qqnorm(z,xlab="Theoretical quantiles",ylab="Observed z-scores",main=paste(d,"digits"))
abline(0,1)
##correction: original post had length(res)
if(d<5) print(1-pchisq(sum ((res - n*p)^2/(n*p)),length(res)-1))
}
###Now count in segments
d <- 1
m <-10^5
patterns <-sprintf(paste0("%0",d,"d"),seq(0,10^d-1))
res <- foreach(pat=patterns,.combine=cbind) %dopar% {
tmp<-start(matchPattern(pat,x))
tmp2<-floor( (tmp-1)/m)
return(tabulate(tmp2+1,nbins=n/m))
}
##qq-plots
par(mfrow=c(2,5))
p <- 1/(10^d)
for(i in 1:ncol(res)){
z <- (res[,i] - m*p) / sqrt( m*p*(1-p) )
qqnorm(z,xlab="Theoretical quantiles",ylab="Observed z-scores",main=paste(i-1))
abline(0,1)
}
##ACF plots
par(mfrow=c(2,5))
for(i in 1:ncol(res)) acf(res[,i])</pre>
<p>NB: A normal number has the above stated property in any base. The examples above a for base 10.</p>
Oh no, the Leekasso....
2014-03-12T09:38:31+00:00
http://simplystats.github.io/2014/03/12/oh-no-the-leekasso
<p>An astute reader (Niels Hansen, who is visiting our department today) caught a bug in <a href="https://github.com/jtleek/leekasso">my code</a> on Github for the Leekasso. I had:</p>
<p><em>lm1 = lm(y ~ leekX)</em></p>
<p><em>predict.lm(lm1,as.data.frame(<wbr />leekX2))</em></p>
<p>Unfortunately, this meant that I was getting predictions for the training set on the test set. Since I set up the test/training sets the same, this meant that I was actually getting training set error rates for the Leekasso. Neils Hansen noticed the bug and reran the fixed code with this term instead:</p>
<p><em>lm1 = lm(y ~ ., data = as.data.frame(leekX))</em></p>
<p><em>predict.lm(lm1,as.data.frame(<wbr />leekX2))</em></p>
<p>He created a heatmap subtracting the average accuracy of the Leekasso/Lasso and showed they are essentially equivalent.</p>
<p><a href="http://simplystatistics.org/2014/03/12/oh-no-the-leekasso/leekassolasso/" rel="attachment wp-att-2553"><img alt="LeekassoLasso" src="http://simplystatistics.org/wp-content/uploads/2014/01/LeekassoLasso-300x300.png" width="300" height="300" /></a></p>
<p>This is a bummer, the Leekasso isn’t a world crushing algorithm. On the other hand, I’m happy that just choosing the top 10 is still competitive with the optimized lasso on average. More importantly, although I hate being wrong, I appreciate people taking the time to look through my code.</p>
<p>Just out of curiosity I’m taking a survey. Do you think I should publish this top10 predictor thing as a paper? Or do you think it is too trivial?</p>
Per capita GDP versus years since women received right to vote
2014-03-07T10:00:10+00:00
http://simplystats.github.io/2014/03/07/per-capita-gdp-versus-years-since-women-received-right-to-vote
<p>Below is a plot of per capita GPD (in log scale) against years since women received the right to vote for 42 countries. Is this cause, effect, both or neither? We all know correlation does not imply causation, but I see many (non statistical) arguments to support both cause and effect here. Happy <a href="http://en.wikipedia.org/wiki/International_Women's_Day">International Women’s Day</a> ! <a href="http://simplystatistics.org/2014/03/07/per-capita-gdp-versus-years-since-women-received-right-to-vote/rplot/" rel="attachment wp-att-2766"><img class="alignnone size-full wp-image-2766" alt="Rplot" src="http://simplystatistics.org/wp-content/uploads/2014/03/Rplot.png" width="983" height="591" srcset="http://simplystatistics.org/wp-content/uploads/2014/03/Rplot-300x180.png 300w, http://simplystatistics.org/wp-content/uploads/2014/03/Rplot.png 983w" sizes="(max-width: 983px) 100vw, 983px" /></a></p>
<p>The data is from <a href="http://www.infoplease.com/ipa/A0931343.html">here</a> and <a href="http://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)_per_capita">here</a>. I removed countries where women have had the right to vote for less than 20 years.</p>
<p>pd -What’s with Switzerland?</p>
<p>update - R^2 and p-value added to graph</p>
PLoS One, I have an idea for what to do with all your profits: buy hard drives
2014-03-05T11:07:03+00:00
http://simplystats.github.io/2014/03/05/plos-one-i-have-an-idea-for-what-to-do-with-all-your-profits-buy-hard-drives
<p>I’ve been closely following the fallout from PLoS One’s <a href="http://www.plos.org/data-access-for-the-open-access-literature-ploss-data-policy/">new policy for data sharing</a>. The policy says, basically, that if you publish a paper, all data and code to go with that paper should be made publicly available at the time of publishing and include an explicit data sharing policy in the paper they submit.</p>
<p>I think the reproducibility debate is over. Data should be made available when papers are published. The <a href="http://simplystatistics.org/2012/02/27/the-duke-saga-starter-set/">Potti scandal </a>and the <a href="http://simplystatistics.org/2013/04/19/podcast-7-reinhart-rogoff-reproducibility/">Reinhart/Rogoff scandal</a> have demonstrated the extreme consequences of lack of reproducibility and the reproducibility advocates have taken this one home. The question with reproducibility isn’t “if” anymore it is “how”.</p>
<p>The transition toward reproducibility is likely to be rough for two reasons. One is that many people who generate data lack training in <a href="http://simplystatistics.org/2012/04/27/people-in-positions-of-power-that-dont-understand/">handling and analyzing data</a>, even in a data saturated field like genomics. The story is even more grim in areas that haven’t been traditionally considered “data rich” fields.</p>
<p>The second problem is a cultural and economic problem. It involves the fundamental disconnect between (1) the incentives of our system for advancement, grant funding, and promotion and (2) the policies that will benefit science and improve reproducibility. Most of the debate on social media seems to conflate these two issues. I think it is worth breaking the debate down into three main constituencies: journals, data creators, and data analysts.</p>
<p><strong>Journals with requirements for data sharing</strong></p>
<p>Data sharing, especially for large data sets, isn’t easy and it isn’t cheap. Not knowing how to share data is not an excuse - to be a modern scientist this is one of the skills you have to have. But if you are a journal that <a href="http://www.nature.com/news/plos-profits-prompt-revamp-1.14205">makes huge profits</a> and you want open sharing, you should put up or shut up. The best way to do that would be to pay for storage on something like AWS for all data sets submitted to comply with your new policy. In the era of cheap hosting and standardized templates, charging $1,000 or more for an open access paper is way too much. It costs essentially nothing to host that paper online and you are getting peer review for free. So you should spend some of your profits paying for the data sharing that will benefit your journal and the scientific community.</p>
<p><strong>Data creators</strong></p>
<p>It is really hard to create a serious, research quality data set in almost any scientific discipline. If you are studying humans, it requires careful adherence to rules and procedures for handling human data. If you are in ecology, it may involve extensive field work. If you are in behavioral research, it may involve careful review of thousands of hours of video tape.</p>
<p>The value of one careful, rigorous, and interesting data set is hard to overstate. In my field, the data Leonid Kruglyak’s group generated measuring <a href="http://www.pnas.org/content/102/5/1572.long">gene expression and genetics</a> in a careful yeast experiment spawned an entirely new discipline within both genomics and statistics.</p>
<p>The problem is that to generate one really good data set can take months or even years. It is definitely possible to publish more than one paper on a really good data set. But after the data are generated, most of these papers will have to do with data analysis, not data generation. If there are ten papers that could be published on your data set and your group publishes the data with the first one, you may get to the second or third, but someone else might publish 4-10.</p>
<p>This may be good for science, but it isn’t good for the careers of data generators. Ask anyone in academics whether you’d rather have 6 citations from awesome papers or 6 awesome papers and 100% of them will take the papers.</p>
<p>I’m completely sympathetic to data generators who spend a huge amount of time creating a data set and are worried they may be scooped on later papers. This is a place where the culture of credit hasn’t caught up with the culture of science. If you write a grant and generate an amazing data set that 50 different people use - you should absolutely get major credit for that in your next grant. However, you probably shouldn’t get authorship unless you intellectually contributed to the next phase of the analysis.</p>
<p>The problem is we don’t have an intermediate form of credit for data generators that is weighted more heavily than a citation. In the short term, this lack of a proper system of credit will likely lead data generators to make the following (completely sensible) decision to hold their data close and then publish multiple papers at once - <a href="http://www.nature.com/encode/#/threads">like ENCODE did</a>. This will drive everyone crazy and slow down science - but it is the appropriate career choice for data generators until our system of credit has caught up.</p>
<p><strong>Data analysts</strong></p>
<p>I think that data analysts who are pushing for reproducibility are genuine in their desire for reproducibility. I also think that the debate is over. I think we can contribute to the success of the reproducibility transition by figuring out ways to give stronger and more appropriate credit to data generators. I don’t think authorship is the right approach. But I do think that it is the right approach to loudly and vocally give credit to people who generated the data you used in your purely data analytic paper. That includes making sure the people that are responsible for their promotion and grants know just how incredibly critical it is that they keep generating data so you can keep doing your analysis.</p>
<p>Finally, I think that we should be more sympathetic to the career concerns of folks who generate data. I have written methods and made the code available. I have then seen people write very similar papers using my methods and code - then getting credit/citations for producing a very similar method to my own. Being [I’ve been closely following the fallout from PLoS One’s <a href="http://www.plos.org/data-access-for-the-open-access-literature-ploss-data-policy/">new policy for data sharing</a>. The policy says, basically, that if you publish a paper, all data and code to go with that paper should be made publicly available at the time of publishing and include an explicit data sharing policy in the paper they submit.</p>
<p>I think the reproducibility debate is over. Data should be made available when papers are published. The <a href="http://simplystatistics.org/2012/02/27/the-duke-saga-starter-set/">Potti scandal </a>and the <a href="http://simplystatistics.org/2013/04/19/podcast-7-reinhart-rogoff-reproducibility/">Reinhart/Rogoff scandal</a> have demonstrated the extreme consequences of lack of reproducibility and the reproducibility advocates have taken this one home. The question with reproducibility isn’t “if” anymore it is “how”.</p>
<p>The transition toward reproducibility is likely to be rough for two reasons. One is that many people who generate data lack training in <a href="http://simplystatistics.org/2012/04/27/people-in-positions-of-power-that-dont-understand/">handling and analyzing data</a>, even in a data saturated field like genomics. The story is even more grim in areas that haven’t been traditionally considered “data rich” fields.</p>
<p>The second problem is a cultural and economic problem. It involves the fundamental disconnect between (1) the incentives of our system for advancement, grant funding, and promotion and (2) the policies that will benefit science and improve reproducibility. Most of the debate on social media seems to conflate these two issues. I think it is worth breaking the debate down into three main constituencies: journals, data creators, and data analysts.</p>
<p><strong>Journals with requirements for data sharing</strong></p>
<p>Data sharing, especially for large data sets, isn’t easy and it isn’t cheap. Not knowing how to share data is not an excuse - to be a modern scientist this is one of the skills you have to have. But if you are a journal that <a href="http://www.nature.com/news/plos-profits-prompt-revamp-1.14205">makes huge profits</a> and you want open sharing, you should put up or shut up. The best way to do that would be to pay for storage on something like AWS for all data sets submitted to comply with your new policy. In the era of cheap hosting and standardized templates, charging $1,000 or more for an open access paper is way too much. It costs essentially nothing to host that paper online and you are getting peer review for free. So you should spend some of your profits paying for the data sharing that will benefit your journal and the scientific community.</p>
<p><strong>Data creators</strong></p>
<p>It is really hard to create a serious, research quality data set in almost any scientific discipline. If you are studying humans, it requires careful adherence to rules and procedures for handling human data. If you are in ecology, it may involve extensive field work. If you are in behavioral research, it may involve careful review of thousands of hours of video tape.</p>
<p>The value of one careful, rigorous, and interesting data set is hard to overstate. In my field, the data Leonid Kruglyak’s group generated measuring <a href="http://www.pnas.org/content/102/5/1572.long">gene expression and genetics</a> in a careful yeast experiment spawned an entirely new discipline within both genomics and statistics.</p>
<p>The problem is that to generate one really good data set can take months or even years. It is definitely possible to publish more than one paper on a really good data set. But after the data are generated, most of these papers will have to do with data analysis, not data generation. If there are ten papers that could be published on your data set and your group publishes the data with the first one, you may get to the second or third, but someone else might publish 4-10.</p>
<p>This may be good for science, but it isn’t good for the careers of data generators. Ask anyone in academics whether you’d rather have 6 citations from awesome papers or 6 awesome papers and 100% of them will take the papers.</p>
<p>I’m completely sympathetic to data generators who spend a huge amount of time creating a data set and are worried they may be scooped on later papers. This is a place where the culture of credit hasn’t caught up with the culture of science. If you write a grant and generate an amazing data set that 50 different people use - you should absolutely get major credit for that in your next grant. However, you probably shouldn’t get authorship unless you intellectually contributed to the next phase of the analysis.</p>
<p>The problem is we don’t have an intermediate form of credit for data generators that is weighted more heavily than a citation. In the short term, this lack of a proper system of credit will likely lead data generators to make the following (completely sensible) decision to hold their data close and then publish multiple papers at once - <a href="http://www.nature.com/encode/#/threads">like ENCODE did</a>. This will drive everyone crazy and slow down science - but it is the appropriate career choice for data generators until our system of credit has caught up.</p>
<p><strong>Data analysts</strong></p>
<p>I think that data analysts who are pushing for reproducibility are genuine in their desire for reproducibility. I also think that the debate is over. I think we can contribute to the success of the reproducibility transition by figuring out ways to give stronger and more appropriate credit to data generators. I don’t think authorship is the right approach. But I do think that it is the right approach to loudly and vocally give credit to people who generated the data you used in your purely data analytic paper. That includes making sure the people that are responsible for their promotion and grants know just how incredibly critical it is that they keep generating data so you can keep doing your analysis.</p>
<p>Finally, I think that we should be more sympathetic to the career concerns of folks who generate data. I have written methods and made the code available. I have then seen people write very similar papers using my methods and code - then getting credit/citations for producing a very similar method to my own. Being](http://simplystatistics.org/2011/12/03/reverse-scooping/) like this is incredibly frustrating. If you’ve ever had that experience imagine what it would feel like to spend a whole year creating a data set and then only getting one publication.</p>
<p>I also think that the primary use of reproducibility so far has been as a weapon. It has been used (correctly) to point out critical flaws in research. It has also been used as a way to embarrass authors who don’t (<a href="http://simplystatistics.org/2013/09/26/how-could-code-review-discourage-code-disclosure-reviewers-with-motivation/">and even some who do</a>) have training in data analysis. The transition to fully reproducible science can either be a painful fight or a smoother transition. One thing that would go a long way would be to think of code review/reproducibility not like peer review, but more like pull requests and issues on Github. The goal isn’t to show how the other person did it wrong, the goal is to help them do it right.</p>
<p><strong> </strong></p>
Data Science is Hard, But So is Talking
2014-02-26T09:01:07+00:00
http://simplystats.github.io/2014/02/26/data-science-is-hard-but-so-is-talking
<p>Jeff, Brian, and I had to record nine separate introductory videos for our <a href="http://jhudatascience.org">Data Science Specialization</a> and, well, some of us were better at it than others. It takes a bit of practice to read effectively from a teleprompter, something that is exceedingly obvious from this video.</p>
Here's why the scientific publishing system can never be "fixed"
2014-02-21T12:41:55+00:00
http://simplystats.github.io/2014/02/21/heres-why-the-scientific-publishing-system-can-never-be-fixed
<p>There’s been much discussion recently about how the scientific publishing system is “broken”. Just the latest one that I saw was a tweet from Princeton biophysicist Josh Shaevitz:</p>
<blockquote class="twitter-tweet" lang="en">
<p>
Editor at a ‘fancy’ journal to my postdoc “This is amazing work that will change the field. No one will publish it.” Sci. pubs are broken.
</p>
<p>
— Joshua Shaevitz (@shaevitz) <a href="https://twitter.com/shaevitz/statuses/433990986457284608">February 13, 2014</a>
</p>
</blockquote>
<p>On this blog, we’ve talked quite a bit about the publishing system, including in this interview with <a href="http://simplystatistics.org/2013/12/12/simply-statistics-interview-with-michael-eisen-co-founder-of-the-public-library-of-science/">Michael</a> <a href="http://simplystatistics.org/2013/12/13/simply-statistics-interview-with-michael-eisen-co-founder-of-the-public-library-of-science-part-22/">Eisen</a>. Jeff recently posted about <a href="http://simplystatistics.org/2014/02/05/just-a-thought-on-peer-reviewing-i-cant-help-myself/">changing the reviewing system</a> (again). We have a <a href="http://simplystatistics.org/2014/01/28/marie-curie-says-stop-hating-on-quilt-plots-already/">few</a> <a href="http://simplystatistics.org/2013/10/23/the-leek-group-guide-to-reviewing-scientific-papers/">other</a> <a href="http://simplystatistics.org/2013/10/22/blog-posts-that-impact-real-science-software-review-and-gtex/">posts</a> <a href="http://simplystatistics.org/2013/09/04/repost-a-proposal-for-a-really-fast-statistics-journal/">on</a> <a href="http://simplystatistics.org/2012/01/26/when-should-statistics-papers-be-published-in-science/">this</a> <a href="http://simplystatistics.org/2011/12/14/dear-editors-associate-editors-referees-please-reject/">topic</a>. Yes, we like to complain like the best of them.</p>
<p>But there’s a simple fact: The scientific publishing system, as broken as you may find it to be, can never truly be fixed.</p>
<p>Here’s the tl;dr</p>
<ul>
<li>The collection of scientific publications out there make up a marketplace of ideas, hypotheses, theorems, conjectures, and comments about nature.</li>
<li>Each member of society has an algorithm for placing a value on each of those publications. Valuation methodologies vary, but they often include factors like the reputation of the author(s), the journal in which the paper was published, the source of funding, as well as one’s own personal beliefs about the quality of the work described in the publication.</li>
<li>Given a valuation methodology, each scientist can rank order the publications from “most valuable” to “least valuable”.</li>
<li>Fixing the scientific publication system would require forcing everyone to agree on the same valuation methodology for all publications.</li>
</ul>
<p><strong>The Marketplace of Publications</strong></p>
<p>The first point is that the collection of scientific publications make up a kind of market of ideas. Although we don’t really “trade” publications in this market, we do estimate the value of each publication and label some as “important” and some as not important. I think this is important because it allows us to draw analogies with other types of markets. In particular, consider the following question: Can you think of a market in any item where each item was priced perfectly, so that every (rational) person agreed on its value? I can’t.</p>
<p>Consider the stock market, which might be the most analyzed market in the world. Professional investors make their entire living analyzing the companies that are listed on stock exchanges and buying and selling their shares based on what they believe is the value of those companies. And yet, there can be huge disagreements over the valuation of these companies. Consider the current <a href="http://dealbook.nytimes.com/?s=herbalife">Herbalife drama</a>, where investors William Ackman and Carl Icahn (and Daniel Loeb) are taking complete opposite sides of the trade (Ackman is short and Icahn is long). They can’t both be right about the valuation; they must have different valuation strategies. Everyday, the market’s collective valuation of different companies changes, reacting to new information and perhaps to <a href="http://www.nber.org/papers/w0456">irrational behavior</a>. In the long run, <a href="http://apple.com">good companies</a> survive while <a href="http://en.wikipedia.org/wiki/Pets.com">others</a> do not. In the meantime, everyone will argue about the appropriate price.</p>
<p>Journals are in some ways like the stock exchanges of yore. There are very prestigious ones (e.g. NYSE, the “big board”) and there are less prestigious ones (e.g. NASDAQ) and everyone tries to get their publication into the prestigious journals. Journals have listing requirements–you can’t just put any publication in the journal. It has to meet certain standards set by the journal. The importance of being listed on a prestigious stock exchange has diminished somewhat over the years. The most <a href="https://www.google.com/finance?q=NASDAQ%3AAAPL&ei=oGEHU7jbE8TF6gH9Ew">valuable company in the world</a> trades on the NASDAQ. Similarly, although Science, Nature, and the New England Journal of Medicine are still quite sought after by scientists, competition is increasing from journals (such as those from the Public Library of Science) who are willing to publish papers that are technically correct and let readers determine their importance.</p>
<p><strong>What’s the “Fix”?</strong></p>
<p>Now let’s consider a world where we obliterate journals like Nature and Science and that there’s only the “one true journal”. Suppose this journal accepts any publication that satisfies some basic technical requirements (i.e. not content-based) and then has a sophisticated rating system that allows readers to comment on, rate, and otherwise evaluate each publication. There is no pre-publication peer review. Everything is immediately published. Problem solved? Not really, in my opinion. Here’s what I think would end up happening:</p>
<ul>
<li>People would have to (slightly) alter their methodology for ranking individual scientists. They would not be able to say “so-and-so has 10 Nature papers, so he must be good”. But most likely, another proxy for actually reading the appears would arise. For example, “My buddy from University of Whatever put this paper in his top-ten list, so it must be good”. As <a href="http://simplystatistics.org/2013/12/13/simply-statistics-interview-with-michael-eisen-co-founder-of-the-public-library-of-science-part-22/">Michael Eisen said in our interview</a>, the ranking system induced by journals like Science and Nature is just an abstract hierarchy; we can still reproduce the hierarchy even if Science/Nature don’t exist.</li>
<li>In the current system, certain publications often “get stuck” with overly inflated valuations and it is often difficult to effectively criticize such publications because there does not exist an equivalent venue for informed criticism on par with Science and Nature. These publications achieve such high valuations partly because they are published in high-end journals like Nature and Science, but partly it is because some people actually believe they are valuable. In other words, it is possible to create a “bubble” where people irrationally believe a publication is valuable, just because everyone believes it’s valuable. If you destroy the current publication system, there will still be publications that are “over-valued”, just like in every other market. And furthermore, it will continue to be difficult to criticize such publications. Think of all the analysts that were yelling about how the housing market was dangerously inflated back in 2007. Did anyone listen? Not until it was too late.</li>
</ul>
<p><strong>What Can be Done?</strong></p>
<p>I don’t mean for this post to be depressing, but I think there’s a basic reality about publication that perhaps is not fully appreciated. That said, I believe there are things that can be done to improve science itself, as well as the publication system.</p>
<ul>
<li><strong><span style="line-height: 16px;">Raise the </span><a style="line-height: 16px;" href="http://simplystatistics.org/2013/08/01/the-roc-curves-of-science/">ROC curves of science</a></strong><span style="line-height: 16px;">. Efforts in this direction make everyone better and improve our ability to make more important discoveries.</span></li>
<li><strong>Increase the reproducibility of science</strong>. This is kind of the “<a href="http://en.wikipedia.org/wiki/Sarbanes–Oxley_Act">Sarbanes-Oxley</a>” of science. For the most part, I think the debate about <em>whether</em> science should be made more reproducible is coming to a close (or it is for me). The real question is how do we do it, for all scientists? I don’t think there are enough people thinking about this question. It will likely be a mix of different strategies, policies, incentives, and tools.</li>
<li><strong>Develop more sophisticated evaluation technologies for publications</strong>. Again, to paraphrase Michael Eisen, we are better able to judge the value of a pencil on Amazon than we are able to judge a scientific publication. The technology exists for improving the system, but someone has to implement it. I think a useful system along these lines would go a long way towards de-emphasizing the importance of “vanity journals” like Nature and Science.</li>
<li><strong>Make open access more accessible</strong>. Open access journals have been an important addition to the publication universe, but they are still very expensive (the cost has just been shifted). We need to think more about lowering the overall cost of publication so that it is truly open access.</li>
</ul>
<p>Ultimately, in a universe where there are finite resources, a system has to be developed to determine how those resources should be distributed. Any system that we can come up with will be flawed as there will by necessity have to be winners and losers. I think there are serious efforts that need to be made to make the system more fair and more transparent, but the problem will never truly be “fixed” to everyone’s satisfaction.</p>
Why do we love R so much?
2014-02-19T10:05:59+00:00
http://simplystats.github.io/2014/02/19/why-do-we-love-r-so-much
<p>When Jeff, Brian, and I started the <a href="http://jhudatascience.org">Johns Hopkins Data Science Specialization</a> we decided early on to organize the program around using R. Why? Because we love R, we use it everyday, and it has an incredible community of developers and users. The R community has created an ecosystem of packages and tools that lets R continue to be relevant and useful for real problems.</p>
<p>We created a short video to talk about one of the reasons we love R so much.</p>
k-means clustering in a GIF
2014-02-18T13:09:21+00:00
http://simplystats.github.io/2014/02/18/k-means-clustering-in-a-gif
<p><a href="http://en.wikipedia.org/wiki/K-means_clustering">k-means</a> is a simple and intuitive clustering approach. Here is a movie showing how it works:</p>
<p><a href="http://simplystatistics.org/2014/02/18/k-means-clustering-in-a-gif/kmeans/" rel="attachment wp-att-2716"><img class="alignnone size-full wp-image-2716" alt="kmeans" src="http://simplystatistics.org/wp-content/uploads/2014/02/kmeans.gif" width="480" height="480" /></a></p>
Repost: Ronald Fisher is one of the few scientists with a legit claim to most influential scientist ever
2014-02-17T12:03:55+00:00
http://simplystats.github.io/2014/02/17/repost-ronald-fisher-is-one-of-the-few-scientists-with-a-legit-claim-to-most-influential-scientist-ever
<p><em>Editor’s Note: Ronald This is a repost of the post “<a href="http://simplystatistics.org/2012/03/07/r-a-fisher-is-the-most-influential-scientist-ever/">R.A. Fisher is the most influential scientist ever</a>” with a picture of my pilgrimage to his gravesite in Adelaide, Australia. </em></p>
<p>You can now see profiles of famous scientists on Google Scholar citations. Here are links to a few of them (via Ben L.). <a href="http://scholar.google.com/citations?user=6kEXBa0AAAAJ&hl=en" target="_blank">Von Neumann</a>, <a href="http://scholar.google.com/citations?user=qc6CJjYAAAAJ&hl=en" target="_blank">Einstein</a>, <a href="http://scholar.google.com/citations?user=xJaxiEEAAAAJ&hl=en" target="_blank">Newton</a>, <a href="http://scholar.google.com/citations?user=B7vSqZsAAAAJ&hl=en" target="_blank">Feynman</a></p>
<p>But their impact on science pales in comparison (with the possible exception of Newton) to the impact of one statistician: <a href="http://en.wikipedia.org/wiki/Ronald_Fisher" target="_blank">R.A. Fisher</a>. Many of the concepts he developed are so common and are considered so standard, that he is never cited/credited. Here are some examples of things he invented along with a conservative number of citations they would have received calculated via Google Scholar*.</p>
<ol>
<li>P-values - <strong>3 million citations</strong></li>
<li>Analysis of variance (ANOVA) - <strong>1.57 million citations</strong></li>
<li>Maximum likelihood estimation - <strong>1.54 million citations</strong></li>
<li>Fisher’s linear discriminant <strong>62,400 citations</strong></li>
<li>Randomization/permutation tests <strong>37,940 citations</strong></li>
<li>Genetic linkage analysis <strong>298,000 citations</strong></li>
<li>Fisher information <strong>57,000 citations</strong></li>
<li>Fisher’s exact test <strong>237,000 citations</strong></li>
</ol>
<p>A couple of notes:</p>
<ol>
<li>These are seriously conservative estimates, since I only searched for a few variants on some key words</li>
<li>These numbers are <strong>BIG</strong>, there isn’t another scientist in the ballpark. The guy who wrote the “<a href="http://www.jbc.org/content/280/28/e25.full" target="_blank">most highly cited paper</a>” got 228,441 citations on GS. His next most cited paper? <a href="http://scholar.google.com/citations?hl=en&user=YCS0XAcAAAAJ&oi=sra" target="_blank">3,000 citations</a>. Fisher has at least 5 papers with more citations than his best one.</li>
<li><a href="http://archive.sciencewatch.com/sept-oct2003/sw_sept-oct2003_page2.htm" target="_blank">This page</a> says Bert Vogelstein has the most citations of any person over the last 30 years. If you add up the number of citations to his top 8 papers on GS, you get 57,418. About as many as to the Fisher information matrix.</li>
</ol>
<p>I think this really speaks to a couple of things. One is that Fisher invented some of the most critical concepts in statistics. The other is the breadth of impact of statistical ideas across a range of disciplines. In any case, I would be hard pressed to think of another scientist who has influenced a greater range or depth of scientists with their work.</p>
<p><strong>Update:</strong> I recently when to Adelaide to give a couple of talks on Bioinformatics, Statistics and MOOCs. My host Gary informed me that Fisher was buried in Adelaide. I went to the cathedral to see the memorial and took this picture. I couldn’t get my face in the picture because the plaque was on the ground. You’ll have to trust me that these are my shoes.</p>
<p style="text-align: center;">
<a href="http://simplystatistics.org/2014/02/17/repost-ronald-fisher-is-one-of-the-few-scientists-with-a-legit-claim-to-most-influential-scientist-ever/2013-12-03-16-27-07/" rel="attachment wp-att-2710"><img class="alignnone size-medium wp-image-2710" alt="2013-12-03 16.27.07" src="http://simplystatistics.org/wp-content/uploads/2014/02/2013-12-03-16.27.07-300x225.jpg" width="300" height="225" srcset="http://simplystatistics.org/wp-content/uploads/2014/02/2013-12-03-16.27.07-300x225.jpg 300w, http://simplystatistics.org/wp-content/uploads/2014/02/2013-12-03-16.27.07-1024x768.jpg 1024w" sizes="(max-width: 300px) 100vw, 300px" /></a>
</p>
<ul>
<li>
<p>Calculations of citations #####################</p>
<ol>
<li><a href="http://simplystatistics.tumblr.com/post/15402808730/p-values-and-hypothesis-testing-get-a-bad-rap-but-we" target="_blank">As described</a> in a previous post</li>
<li># of GS results for “Analysis of Variance” + # for “ANOVA” - “Analysis of Variance”</li>
<li># of GS results for “maximum likelihood”</li>
<li># of GS results for “linear discriminant”</li>
<li># of GS results for “permutation test” + # for ”permutation tests” - “permutation test”</li>
<li># of GS results for “linkage analysis”</li>
<li># of GS results for “fisher information” + # for “information matrix” - “fisher information”</li>
<li># of GS results for “fisher’s exact test” + # for “fisher exact test” - “fisher’s exact test”</li>
</ol>
</li>
</ul>
On the scalability of statistical procedures: why the p-value bashers just don't get it.
2014-02-14T12:40:06+00:00
http://simplystats.github.io/2014/02/14/on-the-scalability-of-statistical-procedures-why-the-p-value-bashers-just-dont-get-it
<p><strong>Executive Summary</strong></p>
<ol>
<li>The problem is not p-values it is a fundamental shortage of data analytic skill.</li>
<li>In general it makes sense to reduce researcher degrees of freedom for non-experts, but any choice of statistic, when used by many untrained people, will be flawed.</li>
<li>The long term solution is to require training in <strong>both statistics and data analysis</strong> for anyone who uses data but particularly journal editors, reviewers, and scientists in molecular biology, medicine, physics, economics, and astronomy.</li>
<li><a href="https://www.coursera.org/specialization/jhudatascience/1">The Johns Hopkins Specialization in Data Science</a> runs every month and can be easily integrated into any program. Other, more specialized, online courses and short courses make it possible to round this training out in ways that are appropriate for each discipline.</li>
</ol>
<p><strong>Scalability of Statistical Procedures</strong></p>
<p>The P-value is in the news again. <a href="http://www.nature.com/news/scientific-method-statistical-errors-1.14700?WT.mc_id=PIN_NatureNews">Nature came out with a piece</a> talking about how scientists are naive about the use of P-values <a href="https://twitter.com/leonidkruglyak/status/433747859414872065">among other things</a>. P-values have known flaws which have been regularly discussed. If you want to see some criticisms just Google “NHST”. Despite their flaws, from a practical perspective it is and oversimplification to point to the use of P-values as the critical flaw in scientific practice. The problem is not that people use P-values poorly it is that <a href="http://simplystatistics.org/2013/06/14/the-vast-majority-of-statistical-analysis-is-not-performed-by-statisticians/">the vast majority of data analysis is not performed by people properly trained to perform data analysis. </a></p>
<p>Data are now abundant in nearly every discipline from astrophysics, to biology, to the social sciences, and even in qualitative disciplines like literature. By scientific standards, the growth of data came on at a breakneck pace. Over a period of about 40 years we went from a scenario where data was measured in bytes to terabytes in almost every discipline. Training programs haven’t adapted to this new era. This is particularly true in genomics where within one generation we went from a data poor environment to a data rich environment. [<strong>Executive Summary</strong></p>
<ol>
<li>The problem is not p-values it is a fundamental shortage of data analytic skill.</li>
<li>In general it makes sense to reduce researcher degrees of freedom for non-experts, but any choice of statistic, when used by many untrained people, will be flawed.</li>
<li>The long term solution is to require training in <strong>both statistics and data analysis</strong> for anyone who uses data but particularly journal editors, reviewers, and scientists in molecular biology, medicine, physics, economics, and astronomy.</li>
<li><a href="https://www.coursera.org/specialization/jhudatascience/1">The Johns Hopkins Specialization in Data Science</a> runs every month and can be easily integrated into any program. Other, more specialized, online courses and short courses make it possible to round this training out in ways that are appropriate for each discipline.</li>
</ol>
<p><strong>Scalability of Statistical Procedures</strong></p>
<p>The P-value is in the news again. <a href="http://www.nature.com/news/scientific-method-statistical-errors-1.14700?WT.mc_id=PIN_NatureNews">Nature came out with a piece</a> talking about how scientists are naive about the use of P-values <a href="https://twitter.com/leonidkruglyak/status/433747859414872065">among other things</a>. P-values have known flaws which have been regularly discussed. If you want to see some criticisms just Google “NHST”. Despite their flaws, from a practical perspective it is and oversimplification to point to the use of P-values as the critical flaw in scientific practice. The problem is not that people use P-values poorly it is that <a href="http://simplystatistics.org/2013/06/14/the-vast-majority-of-statistical-analysis-is-not-performed-by-statisticians/">the vast majority of data analysis is not performed by people properly trained to perform data analysis. </a></p>
<p>Data are now abundant in nearly every discipline from astrophysics, to biology, to the social sciences, and even in qualitative disciplines like literature. By scientific standards, the growth of data came on at a breakneck pace. Over a period of about 40 years we went from a scenario where data was measured in bytes to terabytes in almost every discipline. Training programs haven’t adapted to this new era. This is particularly true in genomics where within one generation we went from a data poor environment to a data rich environment.](http://simplystatistics.org/2012/04/27/people-in-positions-of-power-that-dont-understand/) were trained before data were widely available and used.</p>
<p>The result is that the vast majority of people performing statistical and data analysis are people with only one or two statistics classes and little formal data analytic training under their belt. Many of these scientists would happily work with a statistician, but as any applied statistician at a research university will tell you, it is impossible to keep up with the demand from our scientific colleagues. Everyone is collecting major data sets or analyzing public data sets; there just aren’t enough hours in the day.</p>
<p>Since most people performing data analysis are not statisticians there is a lot of room for error in the application of statistical methods. This error is magnified enormously when naive analysts are given too many “researcher degrees of freedom”. If a naive analyst can pick any of a range of methods and does not understand how they work, they will generally pick the one that gives them maximum benefit.</p>
<p>The short-term solution is to find a balance <a href="http://simplystatistics.org/2013/07/31/the-researcher-degrees-of-freedom-recipe-tradeoff-in-data-analysis/">between researcher degrees of freedom and “recipe book” style approaches</a> that require a specific method to be applied. In general, for naive analysts, it makes sense to lean toward less flexible methods that have been shown to work across a range of settings. The key idea here is to evaluate methods in the hands of naive users and see which ones work best most frequently, an idea we have previously called “<a href="http://simplystatistics.org/2013/08/28/evidence-based-data-analysis-treading-a-new-path-for-reproducible-research-part-2/">evidence based data analysis</a>”.</p>
<p>An incredible success story of evidence based data analysis in genomics is the use of the <a href="http://www.bioconductor.org/packages/release/bioc/html/limma.html">limma package</a> for differential expression analysis of microarray data. Limma <a href="http://biostatistics.oxfordjournals.org/content/8/2/414.full.pdf">can be beat</a> in certain specific scenarios, but it is robust to such a wide number of study designs, sample sizes, and data types that the choice to use something other than limma should only be exercised by experts.</p>
<p><strong>The trouble with criticizing p-values without an alternative</strong></p>
<p>P-values are an obvious target of wrath by people who don’t do day to day statistical analysis because the P-value is the most successful statistical procedure ever invented. If every person who used a P-value cited the inventor, P-values would have, <em>very conservatively</em>, <a href="http://simplystatistics.org/2012/03/07/r-a-fisher-is-the-most-influential-scientist-ever/">3 million citations</a>. That’s an insane amount of use for one statistic.</p>
<p>Why would such a terrible statistic be used by so many people? The reason is that it is critical that we have some measure of uncertainty we can assign to data analytic results. Without such a measure, the only way to determine if results are real or not is to rely on people’s intuition, which is a <a href="http://psiexp.ss.uci.edu/research/teaching/Tversky_Kahneman_1974.pdf">notoriously unreliable metric</a> when uncertainty is involved. It is pretty clear science would be much worse off if we decided if results were reliable based on peoples’ gut feeling about the data.</p>
<p>P-values can and are misinterpreted, misused, and abused both by naive analysts and by statisticians. Sometimes these problems are due to statistical naiveté, sometimes they are due to wishful thinking and career pressure, and sometimes they are malicious. The reason is that P-values are complicated and require training to understand.</p>
<p>Critics of the P-value argue in favor of a large number of the procedures to be used in place of P-values. But when considering the scale at which the methods must be used to address the demands of the current data rich world, many alternatives would result in similar flaws. <em>This is in no way proves the use of P-values is a good idea, but it does prove that coming up with an alternative is hard.</em> Here are a few potential alternatives.</p>
<ol>
<li><strong>Methods should only be chosen and applied by true data analytic experts. Pros:</strong> This is the best case scenario. <strong>Cons:</strong> Impossible to implement broadly given the level of statistical and data analytic expertise in the community<strong> </strong></li>
<li><strong>The full prior, likelihood and posterior should be detailed and complete sensitivity analysis should be performed. </strong><strong>Pros: </strong>In cases where this can be done this provides much more information about the model and uncertainty being considered. <strong>Cons</strong>: The model requires more advanced statistical expertise, is computationally much more demanding, and can not be applied in problems where model based approaches have not been developed. Yes/no decisions about credibility of results still come down to picking a threshold or allowing more researcher degrees of freedom.</li>
<li><strong>A direct Bayesian approach should be used reporting credible intervals and Bayes estimators. </strong><strong>Pros:</strong> In cases where the model can be fit, can be used by non-experts, provides scientific measures of uncertainty like confidence intervals. <strong>Cons</strong>: The prior allows a large number of degrees of freedom when not used by an expert, sensitivity analysis is required to determine the effect of the prior, many more complex models can not be implemented, results are still sample size dependent.</li>
<li><strong>Replace P-values with likelihood ratios. </strong><strong>Pros:</strong> In cases where it is available may reduce some of the conceptual difficulty with the null hypothesis. <strong>Cons:</strong> Likelihood ratios can usually only be computed exactly for cases with few or no nuisance parameters, likelihood ratios run into trouble for complex alternatives, they are still sample size dependent, the a likelihood ratio threshold is equivalent to a p-value threshold in many cases.</li>
<li><strong>We should use Confidence Intervals exclusively in the place of p-values. Pros: </strong>A measure and variability on the scale of interest will be reported. We can evaluate effect sizes on a scientific scale. <strong>Cons: </strong>Confidence intervals are still sample size dependent and can be misleading for large samples, significance levels can be chosen to make intervals artificially wide/small, if used as a decision making tool there is a one-to-one mapping between a confidence interval and a p-value threshold.</li>
<li><strong>We should use Bayes Factors instead of p-values. </strong><strong>Pros</strong>: They can compare the evidence (loosely defined) for both the null and alternative. They can incorporate prior information. <strong>Cons:</strong> Priors provide researcher degrees of freedom, cutoffs may still lead to false/true positives, BF’s still depend on sample size.</li>
</ol>
<p>This is not to say that many of these methods have advantages over P-values. But at scale any of these methods will be prone to abuse, misinterpretation and error. For example, none of them by default deals with multiple testing. Reducing researcher degrees of freedom is good when dealing with a lack of training, but the consequence is potential for mistakes and all of these methods would be ferociously criticized if used as frequently as p-values.</p>
<p><strong>The difference between data analysis and statistics</strong></p>
<p>Many disciplines including medicine and molecular biology usually require an introductory statistics or machine learning class during their program. This is a great start, but is not sufficient for the modern data saturated era. The introductory statistics or machine learning class is enough to teach someone the language of data analysis, but not how to use it. For example, you learn about the t-statistic and how to calculate it. You may also learn the asymptotic properties of the statistic. But you rarely learn about what happens to the t-statistic when there is <a href="http://en.wikipedia.org/wiki/Confounding">an unmeasured confounder</a>. You also don’t learn how to handle non iid data, sample mixups, reproducibility, most of scripting, etc.</p>
<p>It is therefore critical that if you plan to use or understand data analysis you take both the introductory course and at least one data analysis course. The data analysis course should cover study design, more general data analytic reasoning, non-iid data, biased sampling, basics of non-parametrics, training vs test sets, prediction error, sources of likely problems in data sets (like sample mixups), and reproducibility. These are the concepts that appear regularly when analyzing real data that don’t usually appear in the first course in statistics that most medical and molecular biology professionals see. There are awesome statistical educators who are trying hard to bring more of this into the introductory stats world, but it is just too much to cram into one class.</p>
<p><strong>What should we do</strong></p>
<p>The thing that is the most frustrating about the frequent and loud criticisms of P-values is that they usually point out what is wrong with P-values, but don’t suggest what we should do about it. When they do make suggestions, they frequently ignore the fundamental problems:</p>
<ol>
<li>Statistics are complicated and require careful training to understand properly. This is true regardless of the choice of statistic, philosophy, or algorithm.</li>
<li>Data is incredibly abundant in all disciplines and shows no sign of slowing down.</li>
<li>There is a fundamental shortage of training in statistics <em>and data analysis </em></li>
<li>Giving untrained analysts extra researcher degrees of freedom is dangerous.</li>
</ol>
<p>The most direct solution to this problem is increased training in statistics and data analysis. Every major or program in a discipline that regularly analyzes data (molecular biology, medicine, finance, economics, astrophysics, etc.) should require at minimum an introductory statistics class and a data analysis class. If the expertise doesn’t exist to create these sorts of courses there are options. For example, we have introduced a series of 9 courses that run every month that cover most of the basic topics that are common across disciplines.</p>
<p><a href="http://jhudatascience.org/">http://jhudatascience.org/</a></p>
<p><a href="https://www.coursera.org/specialization/jhudatascience/1">https://www.coursera.org/specialization/jhudatascience/1</a></p>
<p>I think of particular interest given the <a href="http://www.nature.com/news/policy-nih-plans-to-enhance-reproducibility-1.14586">NIH Director’s recent comments</a> on reproducibility is our course on <a href="https://www.coursera.org/course/repdata">Reproducible Research</a>. There are also many more specialized resources that are very good and widely available that will build on the base we created with the data science specialization.</p>
<ol>
<li>For scientific software engineering/reproducibility: <a href="http://software-carpentry.org/">Software Carpentry</a>.</li>
<li>For data analysis in genomics: Rafa’s <a href="https://www.edx.org/course/harvardx/harvardx-ph525x-data-analysis-genomics-1401">Data Analysis for Genomics Class</a>.</li>
<li>For Python and computing: <a href="https://www.coursera.org/specialization/fundamentalscomputing/9/courses">The Fundamentals of Computing Specialization</a></li>
</ol>
<p>Enforcing education and practice in data analysis is the only way to resolve the problems that people usually attribute to P-values. In the short term, we should at minimum require all the editors of journals who regularly handle data analysis to show competency in statistics and data analysis.</p>
<p>_Correction: _After seeing Katie K.’s comment on Facebook I concur that P-values were not directly referred to as “worse than useless”, so to more fairly represent the article, I have deleted that sentence.</p>
loess explained in a GIF
2014-02-13T10:53:58+00:00
http://simplystats.github.io/2014/02/13/loess-explained-in-a-gif
<p><a href="http://en.wikipedia.org/wiki/Local_regression">Local regression</a> (loess) is one of the statistical procedures I most use. Here is a movie showing how it works</p>
<p style="text-align: center;">
<a href="http://simplystatistics.org/2014/02/13/loess-explained-in-a-gif/loess/" rel="attachment wp-att-2661"><img class="size-full wp-image-2661 aligncenter" alt="loess" src="http://simplystatistics.org/wp-content/uploads/2014/02/loess.gif" width="480" height="480" /></a>
</p>
Monday data/statistics link roundup (2/10/14)
2014-02-10T05:44:14+00:00
http://simplystats.github.io/2014/02/10/monday-datastatistics-link-roundup-11014
<p>I’m going to try Monday’s for the links. Let me know what you think.</p>
<ol>
<li>The Guardian is reading our blog. A week after <a href="http://simplystatistics.org/2014/01/29/not-teaching-computing-and-statistics-in-our-public-schools-will-make-upward-mobility-even-harder/">Rafa posts</a> that everyone should learn to code for career preparedness, <a href="http://www.theguardian.com/technology/2014/feb/07/year-of-code-dan-crow-songkick">the Guardian gets on the bandwagon</a>.</li>
<li>Nature Methods published a paper on a <a href="http://blogs.nature.com/methagora/2014/01/bring-on-the-box-plots-boxplotr.html/">webtool for creating boxplots</a> (via Simina B.). The nerdrage rivaled the quilt plot. <a href="http://simplystatistics.org/2014/01/28/marie-curie-says-stop-hating-on-quilt-plots-already/">I’m not opposed to papers like this being published</a>, in fact it is an important part of making sure we don’t miss out on the good software when it comes. There are two important things to keep in mind though: (a) Nature Methods grades on a heavy “innovative” curve which makes it pretty hard to publish papers there, so publishing papers like this could cause frustration among people who would submit there and (b) if you use the boxplots from using this tool you <strong>must</strong> cite the relevant software that generated the boxplot.</li>
<li><a href="http://grantland.com/features/expected-value-possession-nba-analytics/">This story about Databall</a> (via Rafa.) is great, I love the way that it talks about statisticians as the leaders on a new data type. I also enjoyed <a href="http://www.sloansportsconference.com/wp-content/uploads/2014/02/2014_SSAC_Pointwise-Predicting-Points-and-Valuing-Decisions-in-Real-Time.pdf">reading the paper </a>the story is about. One interesting thing about that paper and many of the papers at the Sloan Sports Conference is that the <a href="http://regressing.deadspin.com/here-are-this-years-sloan-finalist-papers-and-their-bi-1518317761">data are proprietary</a> (via Chris V.) so the code/data/methods are actually not available for most papers (including this one). In the short term this isn’t a big deal, the papers are fun to read. In the long term, it will dramatically slow progress. It is almost always a bad long term strategy to make data private if the goal is to maximize value.</li>
<li><a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2377290">The P-value curve</a> for fixing publication bias (via Rafa). I think it is an interesting idea, very similar to our approach for the <a href="http://simplystatistics.org/swfdr/">science-wise false discovery rate</a>. People who liked our paper will like the P-value curve paper. People who hated our paper for the uniformity under the null assumption will hate that paper for the same reason (via David S.)</li>
<li><a href="http://www.theonion.com/video/new-study-shows-that-bones-are-incredibly-cool,35111/">Hopkins discovers bones are the best</a> (via Michael R.).</li>
<li><a href="http://tex.stackexchange.com/questions/158668/nice-scientific-pictures-show-off">Awesome scientific diagrams in tex</a>. Some of these are ridiculous.</li>
<li><a href="https://www.youtube.com/watch?v=cZDn0U0w78k&feature=youtube_gdata_player">Mary Carillo goes crazy on backyard badminton</a>. This is awesome. If you love the Olympics and the internet, you will love this (via Hilary P.)</li>
<li>
<p><a href="http://bmorebiostat.com/">B’more Biostats</a> has been on a tear lately. I’ve been reading [I’m going to try Monday’s for the links. Let me know what you think.</p>
</li>
<li>The Guardian is reading our blog. A week after <a href="http://simplystatistics.org/2014/01/29/not-teaching-computing-and-statistics-in-our-public-schools-will-make-upward-mobility-even-harder/">Rafa posts</a> that everyone should learn to code for career preparedness, <a href="http://www.theguardian.com/technology/2014/feb/07/year-of-code-dan-crow-songkick">the Guardian gets on the bandwagon</a>.</li>
<li>Nature Methods published a paper on a <a href="http://blogs.nature.com/methagora/2014/01/bring-on-the-box-plots-boxplotr.html/">webtool for creating boxplots</a> (via Simina B.). The nerdrage rivaled the quilt plot. <a href="http://simplystatistics.org/2014/01/28/marie-curie-says-stop-hating-on-quilt-plots-already/">I’m not opposed to papers like this being published</a>, in fact it is an important part of making sure we don’t miss out on the good software when it comes. There are two important things to keep in mind though: (a) Nature Methods grades on a heavy “innovative” curve which makes it pretty hard to publish papers there, so publishing papers like this could cause frustration among people who would submit there and (b) if you use the boxplots from using this tool you <strong>must</strong> cite the relevant software that generated the boxplot.</li>
<li><a href="http://grantland.com/features/expected-value-possession-nba-analytics/">This story about Databall</a> (via Rafa.) is great, I love the way that it talks about statisticians as the leaders on a new data type. I also enjoyed <a href="http://www.sloansportsconference.com/wp-content/uploads/2014/02/2014_SSAC_Pointwise-Predicting-Points-and-Valuing-Decisions-in-Real-Time.pdf">reading the paper </a>the story is about. One interesting thing about that paper and many of the papers at the Sloan Sports Conference is that the <a href="http://regressing.deadspin.com/here-are-this-years-sloan-finalist-papers-and-their-bi-1518317761">data are proprietary</a> (via Chris V.) so the code/data/methods are actually not available for most papers (including this one). In the short term this isn’t a big deal, the papers are fun to read. In the long term, it will dramatically slow progress. It is almost always a bad long term strategy to make data private if the goal is to maximize value.</li>
<li><a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2377290">The P-value curve</a> for fixing publication bias (via Rafa). I think it is an interesting idea, very similar to our approach for the <a href="http://simplystatistics.org/swfdr/">science-wise false discovery rate</a>. People who liked our paper will like the P-value curve paper. People who hated our paper for the uniformity under the null assumption will hate that paper for the same reason (via David S.)</li>
<li><a href="http://www.theonion.com/video/new-study-shows-that-bones-are-incredibly-cool,35111/">Hopkins discovers bones are the best</a> (via Michael R.).</li>
<li><a href="http://tex.stackexchange.com/questions/158668/nice-scientific-pictures-show-off">Awesome scientific diagrams in tex</a>. Some of these are ridiculous.</li>
<li><a href="https://www.youtube.com/watch?v=cZDn0U0w78k&feature=youtube_gdata_player">Mary Carillo goes crazy on backyard badminton</a>. This is awesome. If you love the Olympics and the internet, you will love this (via Hilary P.)</li>
<li><a href="http://bmorebiostat.com/">B’more Biostats</a> has been on a tear lately. I’ve been reading](http://lcolladotor.github.io/2014/02/05/DropboxAndGoogleDocsFromR/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+FellgernonBit+%28Fellgernon+Bit%29) on uploading files to Dropbox/Google drive from R, <a href="http://mandymejia.wordpress.com/2014/02/04/what-is-quantitative-mri-and-why-does-it-matter/">Mandy’s post</a> explaining quantitative MRI, <a href="http://yennywebbv.weebly.com/1/post/2014/01/data-sciences.html">Yenny’s post</a> on data sciences, <a href="http://hopstat.wordpress.com/2014/01/31/a-graduate-school-open-house-words-from-a-student/">John’s post</a> on graduate school open houses, and <a href="http://alyssafrazee.com/vectorization.html">Alyssa’s post on vectorization.</a> If you like Simply Stats you should be following them <a href="https://twitter.com/bmorebiostats">on Twitter here</a>.</li>
</ol>
Just a thought on peer reviewing - I can't help myself.
2014-02-05T20:47:09+00:00
http://simplystats.github.io/2014/02/05/just-a-thought-on-peer-reviewing-i-cant-help-myself
<p>Today I was thinking about reviewing, probably because I was handling a couple of papers as AE and doing tasks associated with reviewing several other papers. I know that this is idle thinking, but suppose peer review was just a drop down ranking with these 6 questions.</p>
<ol>
<li>How close is this paper to your area of expertise?</li>
<li>Does the paper appear to be technically right?</li>
<li>Does the paper use appropriate statistics/computing?</li>
<li>Is the paper interesting to people in your area?</li>
<li>Is the paper interesting to a broad audience?</li>
<li>Are the appropriate data and code available?</li>
</ol>
<p>Each question would be rated on a 1-5 star scale. 1 stars = completely inadequate, 3 stars = acceptable, 5 stars = excellent. There would be an optional comments box that would only be used for major/interesting thoughts and anything that got above 3 stars for questions 2, 3, and 6 was published. Incidentally, you could do this for free on Github if the papers were written in markdown, that would reduce the <a href="http://simplystatistics.org/2011/11/03/free-access-publishing-is-awesome-but-expensive-how/">substantial costs of open-access publishing</a>.</p>
<p>No doubt peer review would happen faster this way. I was wondering, would it be any worse?</p>
<p> </p>
<p> </p>
My Online Course Development Workflow
2014-02-04T09:30:00+00:00
http://simplystats.github.io/2014/02/04/my-online-course-development-workflow
<p>One of the nice things about developing <a href="http://jhudatascience.org">9 new courses</a> for the JHU Data Science Specialization in a short period of time is that you get to learn all kinds of cool and interesting tools. One of the ways that we were able to push out so much content in just a few months was that we did most of the work ourselves, rather than outsourcing things like video production and editing. You could argue that this results in a poorer quality final product but (a) I disagree; and (b) even if that were true, I think the content is still valuable.</p>
<p>The advantage of learning all the tools was that it allowed for a quick turn-around from the creation of the lecture to the final exporting of the video (often within a single day). For a hectic schedule, it’s nice to be able to write slides in the morning, record some video in between two meetings in the afternoon, and the combine/edit all the video in the evening. Then if you realize something doesn’t work, you can start over the next day and have another version done in less than 24 hours.</p>
<p>I thought it might be helpful to someone out there to detail the workflow and tools that I use to develop the content for my online courses.</p>
<ul>
<li><span style="line-height: 16px;">I use <a href="http://www.techsmith.com/camtasia-mac-features.html">Camtasia for Mac</a> to do all my screencasting/recording. This is a nice tool and I think has more features than your average screen recorder. That said, if you just want to record your screen on your Mac, you can actually use the built-in Quicktime software. I used to do all of my video editing in Camtasia but now it’s pretty much glorified screencasting software for me.</span></li>
<li>For talking head type videos I use my <a href="http://www.apple.com/iphone-5s/">iPhone 5S</a> <a href="http://www.amazon.com/gp/product/B00AAKERD6/ref=oh_details_o03_s00_i00?ie=UTF8&psc=1">mounted</a> on a <a href="http://www.amazon.com/gp/product/B000V7AF8E/ref=oh_details_o02_s00_i00?ie=UTF8&psc=1">tripod</a>. The iPhone produces surprisingly good 1080p HD 30 fps video and is definitely sufficient for my purposes (see <a href="http://www.apple.com/30-years/1-24-14-film/#video-1242014-film">here</a> for a much better example of what can be done). I attach the phone to an <a href="http://apogeedigital.com/products/mic.php">Apogee microphone</a> to pick up better sound. For some of the <a href="http://simplystatistics.org/interviews/">interviews</a> that we do on Simply Statistics I use two iPhones (A 5S and a 4S, my older phone).</li>
<li>To record my primary sound (i.e. me talking), I use the <a href="http://www.amazon.com/gp/product/B001QWBM62/ref=oh_details_o00_s00_i01?ie=UTF8&psc=1">Zoom H4N portable recorder</a>. This thing is not cheap but it records very high-quality stereo sound. I can connect it to my computer via USB or it can record to a SD card.</li>
<li>For simple sound recording (no video or screen) I use <a href="http://audacity.sourceforge.net">Audacity</a>.</li>
<li>All of my lecture videos are run through <a href="http://www.apple.com/final-cut-pro/">Final Cut Pro X</a> on my <a href="http://www.apple.com/macbook-pro/">15-inch MacBook Pro with Retina Display</a>. Videos from Camtasia are exported in Apple ProRes format and then imported into Final Cut. Learning FCPX is not for the faint-of-heart if you’re not used to a nonlinear editor (as I was not). I bought this <a href="http://www.amazon.com/gp/product/0321774671/ref=wms_ohs_product?ie=UTF8&psc=1">excellent book</a> to help me learn it, but I still probably only use 1% of the features. In the end using a real editor was worth it because it makes merging multiple videos much easier (i.e. multicam shots for screencasts + talking head) and editing out mistakes (e.g. typos on slides) much simpler. The editor in Camtasia is pretty good but if you have more then one camera/microphone it becomes infeasible.</li>
<li>I have an <a href="http://store.apple.com/us/product/HD816ZM/A/wd-8tb-my-book-thunderbolt-duo-dual-drive-storage-system?fnode=5f&fs=f%3Dthunderbolt%26fh%3D3783%252B309a">8TB Western Digital Thunderbolt drive</a> to store the raw video for all my classes (and some backups). I also use two <a href="http://store.apple.com/us/product/HE965VC/A/g-tech-1tb-g-drive-mobile-thunderboltusb-30-hard-drive?fnode=5f&fs=f%3Dthunderbolt%26fh%3D3783%252B309a">1TB Thunderbolt drives</a> to store video for individual classes (each 4-week class borders on 1TB of raw video). These smaller drives are nice because I can just throw them in my bag and edit video at home or on the weekend if I need to.</li>
<li>Finished videos are shared with a <a href="https://www.dropbox.com/business">Dropbox for Business</a> account so that Jeff, Brian, and I can all look at each other’s stuff. Videos are exported to H.264/AAC and uploaded to Coursera.</li>
<li>For developing slides, Jeff, Brian, and I have standardized around using <a href="http://slidify.org">Slidify</a>. The beauty of using slidify is that it lets you write everything in <a href="http://daringfireball.net/projects/markdown/">Markdown</a>, a super simple text format. It also make it simpler to manage all the course material in <a href="https://github.com/DataScienceSpecialization/courses">Git/GitHub</a> because you don’t have to lug around huge PowerPoint files. Everything is a light-weight text file. And thanks to <a href="http://people.mcgill.ca/ramnath.vaidyanathan/">Ramnath’s</a> incredible grit and moxie, we have handy tools to easily export everything to PDF and HTML slides (HTML slides hosted via <a href="http://pages.github.com">GitHub Pages</a>).</li>
</ul>
<p>The first courses for the <a href="https://www.coursera.org/specialization/jhudatascience/1">Data Science Specialization</a> start on April 7th. Don’t forget to sign up!</p>
The three tables for genomics collaborations
2014-02-03T10:12:19+00:00
http://simplystats.github.io/2014/02/03/the-three-tables-for-genomics-collaborations
<p>Collaborations between biologists and statisticians are very common in genomics. For the data analysis to be fruitful, the statistician needs to understand what samples are being analyzed. For the analysis report to make sense to the biologist, it needs to be properly annotated with information such as gene names, genomic location, etc… In a recent post, Jeff laid out <a href="http://simplystatistics.org/2013/11/14/the-leek-group-guide-to-sharing-data-with-a-statistician-to-speed-collaboration/">his guide</a><span style="text-decoration: underline;"> </span>for such collaborations, here I describe an approach that has helped me in mine.</p>
<p>In many of my past collaborations, sharing the experiment’s key information, in a way that facilitates data analysis, turned out to be more time consuming than the analysis itself. One reason is that the data producers annotated samples in ways that was imposible to decipher without direct knowledge of the experiment (e.g using lab specific codes in the filenames, or colors in excel files). In the early days of microarrays, a group of researchers noticed this problem and created a <a href="http://www.mged.org/Workgroups/MAGE/mage-ml.html">markup language</a> to describe and communicate information about experiments electronically.</p>
<p>The <a href="http://www.bioconductor.org/">Bioconductor project</a> took a less ambitious approach and created <a href="http://www.bioconductor.org/packages/2.14/bioc/vignettes/Biobase/inst/doc/BiobaseDevelopment.pdf">classes</a> specifically designed to store the minimal information needed to perform an analysis. These classes can be thought of as three tables, stored as <strong>flat text files</strong>, all of which are relatively easy to create for biologists.</p>
<p>The first table contains the <strong>experimental data</strong> with rows representing features (e.g. genes) and the columns representing samples.</p>
<p>The second table contains the <strong>sample information</strong>. This table contains a row for each column in the experimental data table. This table contains at least two columns. The first contains an identifier that can be used to match the rows of this table to the columns of the first table. The second contains the main outcome of interest, e.g. case or control, cancer or normal. Other commonly included columns are the filename of the original raw data associated with each row, the date the experiment was processed, and other information about the samples.</p>
<p>The third table contains the <strong>feature information. </strong>This table contains a row for each row in the experimental table. The table contains at least two columns. The first contains an identifier that can be used to match the rows of this table to the row of the first table. The second will contain an annotation that makes sense to biologists, e.g. a gene name. For technologies that are widely used (e.g. Affymetrix gene expression arrays) these table are readily available.</p>
<p>With these three relatively simple files in place less time is spent “figuring out” the data and the statisticians can focus their energy on data analysis while the biologists can focus their energy on interpreting the results. This approach seems to have been the inspiration for the <a href="http://www.mged.org/mage-tab/">MAGE-TAB</a> format.</p>
<p>Note that with newer technologies, statisticians prefer to get access to <strong>the raw data</strong>. In this case, instead of an experimental data table (table 1), they will want the original raw data files. The sample information then must contain a column with the filenames so that sample annotation can be properly matched.</p>
<p><strong>NB</strong>: These three tables are not a complete description of an experiment and are not intended as an alternative to standards such as MAGE and MIAME. But in many cases, they provide the very minimum information needed to carry out a rudimentary analysis. Note that Bioconductor provides <a href="http://www.bioconductor.org/packages/2.3/bioc/html/RMAGEML.html">tools</a> to import information from MAGE-ML and other related formats.</p>
Not teaching computing and statistics in our public schools will make upward mobility even harder
2014-01-29T10:40:51+00:00
http://simplystats.github.io/2014/01/29/not-teaching-computing-and-statistics-in-our-public-schools-will-make-upward-mobility-even-harder
<p>In his book <a href="http://www.amazon.com/dp/0525953736/?tag=slatmaga-20" target="_blank">Average Is Over</a>, Tyler Cowen predicts that as automatization becomes more common, modern economies will eventually be composed of two groups: <a href="http://en.wikipedia.org/wiki/Average_is_Over">1) a highly educated minority involved in the production of automated services and 2) a vast majority earning very little but enough to consume some of the low-priced products created by group 1</a>. Not everybody will agree with this view, but we can’t ignore the fact that automatization has already eliminated many middle class jobs in, for example, manufacturing and the automotive industries. New technologies, such as <a href="http://www.youtube.com/watch?v=cdgQpa1pUUE">driverless cars</a> and online retailers, will very likely eliminate many more jobs (e.g. drivers and retail clerks) than they create (programmers and engineers).</p>
<p>Computer literacy is essential for working with automatized systems. Programming and learning from data are perhaps the most useful skill for creating these systems. Yet the current default curriculum includes neither computer science nor statistics. At the same time, there are plenty of resources for motivated parents with means to get their children to learn these subjects. Kids whose parents don’t have the wherewithal to take advantage of these educational resources will be at an even greater disadvantage than they are today. This disadvantage is made worse by the fact that many of the aforementioned resources are free and open to the world (<a href="http://www.codecademy.com/">Codeacademy</a>, <a href="https://www.khanacademy.org/">Khan Academy</a>, <a href="https://www.edx.org/">EdX</a>, and <a href="https://www.coursera.org/">Coursera</a> for example) which means that a large pool of students that previously had no access to this learning material will also be competing for group 1 jobs. If we want to level the playing field we should start by updating the public school curriculum so that, in principle, everybody has the opportunity to compete.</p>
Announcing the Release of swirl 2.0
2014-01-28T09:44:06+00:00
http://simplystats.github.io/2014/01/28/swirl-2
<p><em>Editor’s note: This post was written by Nick Carchedi, a Master’s degree student in the Department of Biostatistics at Johns Hopkins. He is working with us to develop the <a href="http://jhudatascience.org">Data Science Specialization</a> as well as software for interactive learning of R and statistics.</em></p>
<p>Official swirl website: <a href="http://swirlstats.com">swirlstats.com</a></p>
<p>On September 27, 2013, I wrote a guest <a href="http://simplystatistics.org/2013/09/27/announcing-statistics-with-interactive-r-learning-software-environment/">blog post</a> on Simply Statistics to announce the creation of Statistics with Interactive R Learning (swirl), an R package for teaching and learning statistics and R simultaneously and interactively. Over the next several months, I received a tremendous amount of feedback from all over the world. Two things became clear: 1) there were many opportunities for improvement on the original design and 2) there’s an incredible demand globally for new and better ways of learning statistics and R.</p>
<p>In the spirit of R and open source software, I shared the source code for swirl on GitHub. As a result, I quickly came in contact with several very talented individuals, without whom none of what I’m about to share with you would have been possible. Armed with invaluable feedback and encouragement from early adopters of swirl 1.0, my new team and I pursued a complete overhaul of the original design.</p>
<p>Today, I’m happy to announce the result of our efforts: swirl 2.0.</p>
<p>Like the first version of the software, swirl 2.0 guides students through interactive tutorials in the R console on a variety of topics related to statistics and R. The user selects from a menu of courses, each of which is broken up by topic into shorter lessons. Lessons, in turn, are a dialog between swirl and the user and are composed of text output, multiple choice and text-based questions, and (most importantly) questions that require the user to enter actual R code at the prompt. Responses are evaluated for correctness based on instructor-specified answer tests and appropriate feedback is given immediately to the user.</p>
<p>It’s helpful to think of swirl as the synthesis of two separate parts: content and platform. Content is authored by instructors in R Markdown files. The platform is then responsible for delivering this content to the user and interpreting the user’s responses in an interactive and engaging way.</p>
<p>Our primary focus for swirl 2.0 was to build a more robust and extensible platform for delivering content. Here’s a (nontechnical) summary of new and revised features:</p>
<ul>
<li>A library of answer tests an instructor can deploy to check user input for correctness</li>
<li>If stuck, a user can skip a question, causing swirl to enter the correct answer on their behalf</li>
<li>During a lesson, a user can pause instruction to play around or practice something they just learned, then use a special keyword to regain swirl’s attention when ready to resume</li>
<li>swirl “sees” user input the same way R “sees” it, which allows swirl to understand the composition of a user’s input on a much deeper level (thanks, Hadley)</li>
<li>User progress is saved between sessions</li>
<li>More readable output that adjusts to the width of the user’s display (thanks again, Hadley)</li>
<li>Extensible framework allows others to easily extend swirl’s functionality</li>
<li>Instructors can author content in a special flavor of R markdown</li>
</ul>
<p>(For a more technical understanding of swirl’s features and inner workings, we encourage readers to consult our <a href="https://github.com/swirldev/swirl">GitHub repository</a>.)</p>
<p>Although improving the platform was our first priority for this release, we’ve made some improvements to existing content and, more importantly, added the beginnings of a new course: Intro to R. Intro to R is our response to the overwhelming demand for a more accessible and interactive way to learn the R language. We’ve included the first three lessons of the course and plan to add many more over the coming months as our focus turns to creating more high quality content.</p>
<p>Our ultimate goal is to have the statistics and R communities use swirl as a platform to deliver their own content to students interactively. We’ve heard from many people who have an interest in creating their own content and we’re working hard to make the process of creating content as simple and enjoyable as possible.</p>
<p>The goal of swirl is not to be flashy, but rather to provide the most authentic learning environment possible. We accomplish this by placing students directly on the R prompt, within the very same environment they’ll use for data analysis when they are not using swirl. We hope you find swirl to be a valuable tool for learning and teaching statistics and R.</p>
<p>It’s important to stress that, as with any new software, we expect there will be bugs. At this point, users should still consider themselves “early adopters”. For bug reports, suggested enhancements, or to learn more about swirl, please visit <a href="http://swirlstats.com">our website</a>.</p>
<h2 id="contributors">Contributors:</h2>
<p>Many people have contributed to this project, either directly or indirectly, since its inception. I will attempt to list them all here, in no particular order. I’m sincerely grateful to each and everyone one of you.</p>
<ul>
<li>Bill & Gina: swirl is as much theirs as it is mine at this point. Their contributions are the only reason the project has evolved so much since the release of swirl 1.0.</li>
<li>Brian: Challenged me to turn my idea for swirl into a working prototype. Coined the “swirl” acronym. swirl would still be an idea in my head without his encouragement.</li>
<li>Jeff: Pushes me to think big picture and provides endless encouragement. Reminds me that a great platform is worthless without great content.</li>
<li>Roger: Encouraged me to separate platform and content, a key paradigm that allowed swirl to mature from a messy prototype to something of real value. Introduced me to Git and GitHub.</li>
<li>Lauren & Ethan: Helped with development of the earliest instructional content.</li>
<li>Ramnath: Provided a model for content authoring via slidify “flavor” of R Markdown.</li>
<li>Hadley: Made key suggestions for improvement and provided an important <a href="https://gist.github.com/hadley/6734404">proof of concept</a>. His work has had a profound influence on swirl’s development.</li>
<li>Peter: Our discussions led to a better understanding of some key ideas behind swirl 2.0.</li>
<li>Sally & Liz: Beta testers and victims of my endless rants during stats tutoring sessions.</li>
<li>Kelly: Most talented graphic designer I know and mastermind behind the swirl logo. First line of defense against bad ideas, poor design, and crappy websites. Visit <a href="http://kellynealon.com">her website</a>.</li>
<li>Mom & Dad: Beta testers and my #1 fans overall.</li>
</ul>
Marie Curie says stop hating on quilt plots already.
2014-01-28T08:50:37+00:00
http://simplystats.github.io/2014/01/28/marie-curie-says-stop-hating-on-quilt-plots-already
<blockquote>
<p>“There are sadistic scientists who hurry to hunt down error instead of establishing the truth.” -Marie Curie (http://en.wikiquote.org/wiki/Marie_Curie)</p>
</blockquote>
<p>Thanks to Kasper H. for that quote. I think it is a perfect fit for today’s culture of academic put down as academic contribution. One perfect example is the explosion of hate against the quilt plot. A quilt plot is a heatmap with several parameters selected in advance; that’s it. <a href="http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0085047">This simplification</a> of R’s heatmap function appeared in the journal PLoS One. They say (though not up front and not clearly enough for my personal taste) that they know it is just a heatmap.</p>
<p>Over the course of the next several weeks quilt plots went viral. Here <a href="https://twitter.com/EvolOdonata/status/427657216154664960">are a</a> <a href="https://twitter.com/BioMickWatson/status/426780957279281152">few</a> <a href="https://twitter.com/rvimieiro/status/423418772368547840">example</a> tweets. It was also [> “There are sadistic scientists who hurry to hunt down error instead of establishing the truth.” -Marie Curie (http://en.wikiquote.org/wiki/Marie_Curie)</p>
<p>Thanks to Kasper H. for that quote. I think it is a perfect fit for today’s culture of academic put down as academic contribution. One perfect example is the explosion of hate against the quilt plot. A quilt plot is a heatmap with several parameters selected in advance; that’s it. <a href="http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0085047">This simplification</a> of R’s heatmap function appeared in the journal PLoS One. They say (though not up front and not clearly enough for my personal taste) that they know it is just a heatmap.</p>
<p>Over the course of the next several weeks quilt plots went viral. Here <a href="https://twitter.com/EvolOdonata/status/427657216154664960">are a</a> <a href="https://twitter.com/BioMickWatson/status/426780957279281152">few</a> <a href="https://twitter.com/rvimieiro/status/423418772368547840">example</a> tweets. It was also](http://liorpachter.wordpress.com/2014/01/19/why-do-you-look-at-the-speck-in-your-sisters-quilt-plot-and-pay-no-attention-to-the-plank-in-your-own-heat-map/) on <a href="http://eagereyes.org/series/peer-review/1-quilt-plots">people’s blogs</a> and <a href="http://www.the-scientist.com/?articles.view/articleNo/38919/title/Not-So-New-/">even in the scientist</a>. So I did an experiment. I built a table of frequencies in R like this and applied the heatmap function in R, then the quilt.plot function in fields, then the function written by the authors of the paper with as minimal tweeking as possible.</p>
<pre class="brush: r; title: ; notranslate" title="">set.seed(12345)
library(fields)
x = matrix(rbinom(25,size=4,prob=0.5),nrow=5)
pt = prop.table(x)
heatmap(pt)
quilt.plot(x=rep(1:5,5),y=rep(1:5,5),z=pt)
quilt(pt,1:5,1:5,zlabel="Proportion")
</pre>
<p>Here are the results:</p>
<p><strong>heatmap</strong></p>
<p><a href="http://simplystatistics.org/2014/01/28/marie-curie-says-stop-hating-on-quilt-plots-already/heatmap/" rel="attachment wp-att-2588"><img class="alignnone size-medium wp-image-2588" alt="heatmap" src="http://simplystatistics.org/wp-content/uploads/2014/01/heatmap-300x300.png" width="300" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2014/01/heatmap-150x150.png 150w, http://simplystatistics.org/wp-content/uploads/2014/01/heatmap-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2014/01/heatmap.png 480w" sizes="(max-width: 300px) 100vw, 300px" /></a></p>
<p><strong>quilt.plot</strong></p>
<p><a href="http://simplystatistics.org/2014/01/28/marie-curie-says-stop-hating-on-quilt-plots-already/quilt-plot/" rel="attachment wp-att-2589"><img class="alignnone size-medium wp-image-2589" alt="quilt.plot" src="http://simplystatistics.org/wp-content/uploads/2014/01/quilt.plot_-300x300.png" width="300" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2014/01/quilt.plot_-150x150.png 150w, http://simplystatistics.org/wp-content/uploads/2014/01/quilt.plot_-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2014/01/quilt.plot_.png 480w" sizes="(max-width: 300px) 100vw, 300px" /></a></p>
<p><strong>quilt</strong></p>
<p><strong><a href="http://simplystatistics.org/2014/01/28/marie-curie-says-stop-hating-on-quilt-plots-already/quilt/" rel="attachment wp-att-2590"><img class="alignnone size-medium wp-image-2590" alt="quilt" src="http://simplystatistics.org/wp-content/uploads/2014/01/quilt-300x300.png" width="300" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2014/01/quilt-150x150.png 150w, http://simplystatistics.org/wp-content/uploads/2014/01/quilt-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2014/01/quilt.png 480w" sizes="(max-width: 300px) 100vw, 300px" /></a></strong></p>
<p>It is clear that out of the box and with no tinkering, the new plot makes something nicer/more interpretable. The columns/rows are where I expect and the scale is there and nicely labeled. Everyone who has ever made heatmaps in R has some bit of code that looks like this:</p>
<pre class="brush: r; title: ; notranslate" title="">image(t(bdat)[,nrow(bdat):1],col=colsb(9),breaks=quantile(as.vector(as.matrix(dat)),probs=seq(0,1,length=10)),xaxt="n",yaxt="n",xlab="",ylab="")
</pre>
<p>To hack together a heatmap in R that looks like you expect. It is a total pain. Obviously the quilt plot paper has a few flaws:</p>
<ol>
<li>It tries to introduce the quilt plot as a new idea.</li>
<li>It doesn’t just come out and say it is a hack of the heatmap function, but tries to dance around it.</li>
<li>It produces code, but only as images in word files. I had to retype the code to make my plot.</li>
</ol>
<p>That being said here are a couple of other true things about the paper:</p>
<ol>
<li>The code works if you type it out and apply it.</li>
<li>They produced code.</li>
<li>The paper is open access.</li>
<li>The paper is correct technically.</li>
<li>The hack is useful for users with few R skills.</li>
</ol>
<p>So why exactly isn’t it a paper? It smacks of academic elitism to claim that this isn’t good enough because it isn’t a “new idea”. Not every paper discovers radium. Some papers are better than others and that is ok. I think the quilt plot being published isn’t a problem, maybe I don’t like the way it is written exactly, but they do acknowledge the heat map, they do produce correct, relevant code, and it does solve a problem people actually have. That is better than a lot of papers that appear in more prestigious journals. <a href="http://www.nature.com/news/arsenic-life-bacterium-prefers-phosphorus-after-all-1.11520#/b1">Arsenic life</a> anyone?</p>
<p>I think it is useful to have a forum where people can post correct, useful, but not necessarily ground breaking results and get credit for them, even if the credit is modest. Otherwise we might miss out on useful bits of code. Frank Harrell has a <a href="http://cran.r-project.org/web/packages/Hmisc/index.html">bunch of functions</a> that tons of people use but he doesn’t get citations, you probably have heard of the Hmisc package if you use R.</p>
<p>But did you know Karl Broman has a bunch of really useful functions in his <a href="https://github.com/kbroman/broman">personal R package</a>, <a href="https://github.com/kbroman/broman/blob/master/R/qqline2.R">qqline2</a> is great. I know Rafa has a bunch of functions he has never published because they seem “too trivial” but I use them all the time. Every scientist who touches code has a personal library like this. I’m not saying the quilt plot is in that category. But I am saying that it is stupid not to have a public forum for making these functions available to other scientists. But that won’t happen if the “quilt plot backlash” is what people see when they try to get published credit for simple code that solves real problems.</p>
<p>Hacks like the quilt plot can help people who aren’t comfortable with R write reproducible scripts without having to figure out every plotting parameter. Keeping in mind that <a href="http://simplystatistics.org/2013/06/14/the-vast-majority-of-statistical-analysis-is-not-performed-by-statisticians/">the vast majority of data analysis is not done by statisticians</a>, it seems like these little hacks are an important part of science. If you believe in figshare, github, open science, and shareable code, you shouldn’t be making fun of the quilt plotters.</p>
<p>Marie Curie says so.</p>
The Johns Hopkins Data Science Specialization on Coursera
2014-01-21T11:06:47+00:00
http://simplystats.github.io/2014/01/21/the-johns-hopkins-data-science-specialization-on-coursera
<p>We are very proud to announce the the Johns Hopkins Data Science Specialization on Coursera. You can see the official announcement from the Coursera folks <a href="http://blog.coursera.org/post/73994272513/coursera-specializations-focused-programs-in-popular">here</a>. This is the main reason Simply Statistics has been a little quiet lately.</p>
<p>The three of us (Brian Caffo, Roger Peng, and Jeff Leek) along with a couple of incredibly hard working graduate students (Nick Carchedi of <a href="http://swirlstats.com/">swirl</a> fame and Sean Kross) have put together <em>nine</em> new one-month classes to run on the Coursera platform. The classes are:</p>
<ol>
<li><strong>The Data Scientist’s Toolbox</strong> <strong> </strong>- A basic introduction to data and data science and a basic guide to R/Rstudio/Github/Command Line Interface.</li>
<li><strong>R Programming </strong> - Introduction to R programming, from installing R to types, to functions, to control structures.</li>
<li><strong>Getting and Cleaning Data</strong> - An introduction to getting data from the web, from images, from APIs, and from databases. The course also covers how to go from raw data to tidy data.</li>
<li>
<div>
<strong>Exploratory Data Analysis</strong> - This course covers plotting in base graphics, lattice, ggplot2 and clustering and other exploratory techniques. It also covers how to think about exploring data you haven't seen.
</div>
</li>
<li>
<div>
<strong>Reproducible Research </strong> - This is one of the unique courses to our sequence. It covers how to think about reproducible research, evidence based data analysis, reproducible research checklists and knitr, markdown, R markdown, etc.
</div>
</li>
<li>
<div>
<strong>Statistical Inference </strong> - This course covers the fundamentals of statistical inference from a practical perspective. The course covers both the technical details and important ideas like confounding.
</div>
</li>
<li>
<div>
<strong>Regression Models </strong> - This course covers the fundamentals of linear and generalized linear regression modeling. It also serves as an introduction to how to "think about" relating variables to each other quantitatively.
</div>
</li>
<li>
<div>
<strong>Practical Machine Learning </strong> - This course will cover the basic conceptual ideas in machine learning like in/out of sample errors, cross validation, and training and test sets. It will also cover a range of machine learning algorithms and their practical implementation.
</div>
</li>
<li>
<div>
<strong>Developing Data Products </strong> - This course will cover how to develop tools for communicating data, methods, and analyses with other people. It will cover building R packages, Shiny, and Slidify, among other things.
</div>
</li>
</ol>
<p>There will also be a specialization project - consisting of a 10th class where students will work on projects conducted with industry, government, and academic partners.</p>
<p>The classes represent some of the content we have previously covered in our popular Coursera classes and a ton of brand new content for this specialization. Here are some things that I think make our program stand out:</p>
<ul>
<li>We will roll out 3 classes at a time starting in April. Once a class is running, it will run every single month concurrently.</li>
<li>The specialization offers a bunch of unique content, particularly in the courses Getting and Cleaning Data, Reproducible Research, and Developing Data Products.</li>
<li>All of the content is being developed open source and open-access on Github. You are welcome to check it out as we develop it and contribute!</li>
<li>You can take the first 9 courses of the specialization entirely for free.</li>
<li>You can choose to pay a very modest fee to get “Signature Track” certification in every course.</li>
</ul>
<p>I have also created a little page that summarizes some of the unique aspects of our program. Scroll through it and you’ll find sharing links at the bottom. Please share with your friends, we think this is pretty cool: <a href="http://jhudatascience.org/">http://jhudatascience.org</a></p>
<p style="text-align: left;">
</p>
Sunday data/statistics link roundup (1/19/2014)
2014-01-19T22:57:42+00:00
http://simplystats.github.io/2014/01/19/sunday-datastatistics-link-roundup-1192014
<ol>
<li><a href="http://ch.tbe.taleo.net/CH07/ats/careers/requisition.jsp?org=TESLA&cws=1&rid=12268">Tesla is hiring a data scientist</a>. That is all.</li>
<li>I’m not sure I buy <a href="http://www.talyarkoni.org/blog/2013/11/18/the-homogenization-of-scientific-computing-or-why-python-is-steadily-eating-other-languages-lunch/">the idea</a> <a href="http://readwrite.com/2013/11/25/python-displacing-r-as-the-programming-language-for-data-science?utm_medium=readwr.it-twitter&utm_source=t.co&utm_content=awesmsharetools-sharebuttons&awesm=readwr.it_p0jm&utm_campaign=#awesm=~osWcapOVQuLAaP">that Python</a> is taking over for R among people who actually do regular data science. I think it is still context dependent. A huge fraction of genomics happens in R and there is a steady stream of new packages that allow R users to push farther and farther back into the processing pipeline. On the other hand, I think language diversity is clearly a plus for someone who works with data. Not that I’d know…</li>
<li>This is an awesome talk on <a href="http://vimeo.com/80236275">why to pursue a Ph.D.</a>. It gives a really level headed and measured discussion, specifically focused on computational programs (I think I got to it via Alyssa F.’s blog).</li>
<li>En Español - <a href="http://www.elnuevodia.com/paraentenderlageneticalatina-1689123.html?fb_action_ids=10100207969753748">A blog post</a> about a study of genetic risk factors among Hispanic/Latino populations (via Rafa).</li>
<li><a href="http://magazine.amstat.org/blog/2014/01/01/tenured-women/">Where have all the tenured women gone?</a> This is a major issue and deserves much more press than it gets (via Sherri R.).</li>
<li><span style="line-height: 16px;">Not related to statistics really, <a href="http://9-eyes.com/">but these image captures</a> from Google streetview are wild. </span></li>
</ol>
Missing not at random data makes some Facebook users feel sad
2014-01-17T10:22:20+00:00
http://simplystats.github.io/2014/01/17/missing-not-at-random-data-makes-some-facebook-users-feel-sad
<p><a href="http://www.npr.org/2014/01/09/261108836/many-younger-facebook-users-unfriend-the-network">This article</a>, published last week, explained how “some younger users of Facebook say that using the site often leaves them feeling sad, lonely and inadequate”. Being a statistician gives you an advantage here because we know that naive estimates from missing not at random (MNAR) data can be very biased. The posts you see on Facebook are not a random sample from your friends’ lives. We see pictures of their vacations, abnormally flattering pictures of themselves, reports on their major achievements, etc… but no view of the mundane typical daily occurrences. Here is a simple cartoon explanation of how MNAR data can give you a biased view of whats really going on. Suppose your life occurrences are rated from 1 (worst) to 5 (best), this table compares what you see to what is really going on after 15 occurrences:</p>
<p><a href="http://simplystatistics.org/2014/01/17/missing-not-at-random-data-makes-some-facebook-users-feel-sad/screen-shot-2014-01-17-at-10-16-32-am/" rel="attachment wp-att-2516"><img class="alignnone size-full wp-image-2516" alt="Screen Shot 2014-01-17 at 10.16.32 AM" src="http://simplystatistics.org/wp-content/uploads/2014/01/Screen-Shot-2014-01-17-at-10.16.32-AM.png" width="1105" height="197" srcset="http://simplystatistics.org/wp-content/uploads/2014/01/Screen-Shot-2014-01-17-at-10.16.32-AM-300x53.png 300w, http://simplystatistics.org/wp-content/uploads/2014/01/Screen-Shot-2014-01-17-at-10.16.32-AM-1024x182.png 1024w, http://simplystatistics.org/wp-content/uploads/2014/01/Screen-Shot-2014-01-17-at-10.16.32-AM.png 1105w" sizes="(max-width: 1105px) 100vw, 1105px" /></a></p>
edge.org asks famous scientists what scientific concept to throw out & they say statistics
2014-01-16T10:10:00+00:00
http://simplystats.github.io/2014/01/16/edge-org-asks-famous-scientists-what-scientific-concept-to-throw-out-they-say-statistics
<p>I don’t think I’ve ever been forwarded one link on the web more than I have been forwarded the edge.org post on <a href="http://www.edge.org/responses/what-scientific-idea-is-ready-for-retirement">“What scientific idea is ready for retirement?”</a>. Here are a few of the comments with my responses. I’m going to keep them brief because I think the edge.org crowd pushes people to say outrageous things, so it isn’t even clear they mean what they say.</p>
<p>I think the whole conceit of the question is a little silly. If you are going to retire a major scientific idea you better have a replacement or at least a guess at what we could do next. The question totally ignores the key question of: “Suppose we actually did what you suggested, what would we do instead?”</p>
<p><strong>On getting rid of big clinical trials</strong></p>
<blockquote>
<p>It is a commonly held but erroneous belief that a larger study is always more rigorous or definitive than a smaller one, and a randomized controlled trial is always the gold standard . However, there is a growing awareness that size does not always matter and a randomized controlled trial may introduce its own biases. We need more creative experimental designs.</p>
</blockquote>
<p><strong>My response: </strong><a href="http://simplystatistics.org/2013/07/15/yes-clinical-trials-work/">Yes clinical trials work</a>. Yes bigger trials and randomized trials are more definitive. There is currently no good alternative for generating causal statements that doesn’t require quite severe assumptions. The “creative experimental designs” has serious potential to be abused by folks who say things like “Well my friend Susie totally said that diet worked for her…”. The author says we should throw out RCT with all the benefits they have provided because it is hard to get women to adhere to a pretty serious behavioral intervention over an 8 year period. If anything, this makes us consider what is a reasonable intervention, not the randomized trial part.</p>
<p><strong>On bailing on statistical independence assumptions</strong></p>
<blockquote>
<p>It is time for science to retire the fiction of statistical independence…..So the overwhelming common practice is simply to assume that sampled events are independent. An easy justification for this is that almost everyone else does it and it’s in the textbooks. This assumption has to be one of the most widespread instances of groupthink in all of science.</p>
</blockquote>
<p><strong>My response: </strong>There are a huge number of statistical methods for dealing with non-independent data. Statisticians have been working on this for decades with <a href="http://en.wikipedia.org/wiki/Blocking_(stage)">blocking</a>, <a href="http://en.wikipedia.org/wiki/Stratified_sampling">stratification</a>, <a href="http://en.wikipedia.org/wiki/Random_effects_model">random effects</a>, <a href="http://en.wikipedia.org/wiki/Deep_learning">deep learning</a>, <a href="http://en.wikipedia.org/wiki/Multilevel_model">multilevel models</a>, <a href="http://en.wikipedia.org/wiki/Generalized_estimating_equation">GEE</a>, <a href="http://en.wikipedia.org/wiki/Autoregressive_conditional_heteroskedasticity">Garch models</a>, etc. etc., etc. It’s a fact that <a href="http://en.wikiquote.org/wiki/George_E._P._Box">statistical independence is a fiction, but sometimes it is a useful one</a>.</p>
<p><strong>On bailing on the p-value (or any other standardized statistical procedure)</strong></p>
<blockquote>
<p>Not for a minute should anyone think that this procedure has much to do with statistics proper… A 2011 paper in_Nature Neuroscience_ presented an analysis of neuroscience articles in <em>Science, Nature, Nature Neuroscience, Neuron</em> and <em>The Journal of Neuroscience</em> showed that although 78 did as they should, 79 used the incorrect procedure.</p>
</blockquote>
<p><strong>My response: </strong>P-values on their own and P-values en-masse are both annoying and not very helpful. But we need a way to tell whether those effect sizes you observed are going to replicate or not. P-values are probably not the best thing for measuring that (<a href="http://www.biomedcentral.com/1471-2105/14/360">maybe you should try to estimate it directly?</a>). But any procedure you scale up to 100,000’s of thousands of users is going to cause all sorts of problems. If you give people more dimensions to call their result “real” or “significant” you aren’t going to reduce false positives. <a href="http://simplystatistics.org/2012/08/27/a-deterministic-statistical-machine/">At scale we need fewer researcher degrees of freedom not more</a>.</p>
<p><strong>On science not being self-correcting</strong></p>
<blockquote>
<p>The pace of scientific production has quickened, and self-correction has suffered. Findings that might correct old results are considered less interesting than results from more original research questions. Potential corrections are also more contested. As the competition for space in prestigious journals has become increasingly frenzied, doing and publishing studies that would confirm the rapidly accumulating new discoveries, or would correct them, became a losing proposition. ublic registration of the design and analysis plan of a study before it is begun. Clinical trials researchers have done this for decades, and in 2013 researchers in other areas rapidly followed suit. Registration includes the details of the data analyses that will be conducted, which eliminates the former practice of presenting the inevitable fluctuations of multifaceted data as robust results. Reviewers assessing the associated manuscripts end up focusing more on the soundness of the study’s registered design rather than disproportionately favoring the findings. This helps reduce the disadvantage that confirmatory studies usually have relative to fishing expeditions. Indeed, a few journals have begun accepting articles from well-designed studies even before the results come in.</p>
</blockquote>
<p>Wait, I thought there was a big rise in retraction rates that has everyone freaking out. Isn’t there a website just dedicated to <a href="http://retractionwatch.com/">outing and shaming people who retract stuff</a>? I think registry of study designs for confirmatory research is a great idea. But I wonder what the effect would be on reducing the <a href="http://www.acs.org/content/acs/en/education/whatischemistry/landmarks/flemingpenicillin.html">opportunity for scientific mistakes that turn into big ideas</a>. This person needs to read the <a href="http://simplystatistics.org/2013/08/01/the-roc-curves-of-science/">ROC curves of science</a>. Any basic research system that doesn’t allow for a lot of failure is never going to discover anything interesting.</p>
<p><strong>Big effects are due to multiple small effects</strong></p>
<blockquote>
<p>So, do big effects tend to have big explanations, or many explanations? There is probably no single, simple and uniformly correct answer to this question. (It’s a hopeless tree!) But, we can use a simple model to help make an educated guess.</p>
</blockquote>
<p>The author simulates 200 variables each drawn from a N(0,i) for i=1…5. The author finds that most of the largest values come from the N(0,5) not the N(0,1). This says nothing about simple or complex phenomena. It says a lot about how a N(0,5) is more variable than a N(0,1). This does not address the issue of whether hypotheses are correct or not.</p>
<p><strong>Bonus round: On abandoning evolution</strong></p>
<blockquote>
<p>Intelligent design and other Creationist critiques have been easily shrugged off and the facts of evolution well established in the laboratory, fossil record, DNA record and computer simulations. If evolutionary biologists are really Seekers of the Truth, they need to focus more on finding the mathematical regularities of biology, following in the giant footsteps of Sewall Wright, JBS Haldane, Ronald Fisher and so on.</p>
</blockquote>
<p>Among many other things, this person needs a course in statistics. The people he is talking about focused on quantifying uncertainty about biology, not certainty or mathematical regularity.</p>
<p><strong>One I actually agree with: putting an end to the idea that Big Data solves all problems</strong></p>
<blockquote>
<p>No, I don’t literally mean that we should stop believing in, or collecting, Big Data. But we should stop pretending that Big Data is magic.</p>
</blockquote>
<p>That guy must be reading our blog. <a href="http://simplystatistics.org/2013/12/12/the-key-word-in-data-science-is-not-data-it-is-science/">The key word in data science is science</a>, after all.</p>
<p><strong>On focusing on the variance rather than the mean</strong></p>
<blockquote>
<p>Our focus on averages should be retired. Or, if not retired, we should give averages an extended vacation. During this vacation, we should catch up on another sort of difference between groups that has gotten short shrift: we should focus on comparing the difference in variance (which captures the spread or range of measured values) between groups.</p>
</blockquote>
<p>I actually like most of this article, but the format for the edge.org pieces killed it. The author says we should stop caring about the mean or make it secondary. I completely agree we should consider the variance - the examples he points out are great. But we should also always keep in mind the first moment before we move on to the second, so not “retire” just “add to”.</p>
<p> </p>
<p><strong> No one asked me but here is what I’d throw out</strong></p>
<ul>
<li>Sweeping generalizations without careful theory, experimentation, and good data</li>
<li>Oversimplifying questions that don’t ask for potential solutions that deal with the complexity of the real world.</li>
<li>Sensationalism by scientists about science</li>
<li>Sensationalism by journalists about science</li>
<li>Absolutist claims about uncertain data</li>
</ul>
Sunday data/statistics link roundup (1/12/2014)
2014-01-13T04:59:04+00:00
http://simplystats.github.io/2014/01/13/sunday-datastatistics-link-roundup-1132014
<p>Well it technically is Monday, but I never went to sleep so that still counts as Sunday right?</p>
<ol>
<li>As a person who has taught a couple of MOOCs I’m used to getting some pushback from people who don’t like the whole concept. But I’m still happy that I’m not the only one who thinks they are a <a href="http://www.eduwire.com/technology/more-not-or-fear-and-loathing-the-world-of-moocs/">pretty good idea</a> and still worth doing. I think that both the hype and the backlash are too much. They hype claimed it would completely end the university as we know it. The backlash says it will have no impact. I think more likely it will have a major impact on people who traditionally don’t attend colleges. That’s ok with me. I think <a href="http://science-and-food.blogspot.com/2013/06/on-super-professors-and-mooc-pushback.html">this post</a> gets it about right.</li>
<li>The Leekasso is finally dethroned! <a href="http://strimmerlab.org/korbinian.html">Korbinian Strimmer</a> used my simulation code and compared it to CAT scores in the sda package coupled with Higher Criticism feature selection. <a href="https://github.com/jtleek/leekasso/blob/master/cat-vs-leekasso.png">Here is the accuracy plot</a>. Looks like Leekasso is competitive with CAT-Leekasso, but CAT+HC wins. Big win for Github there and thanks to Korbinian for taking the time to do the simulation!</li>
<li>Jack Andraka is <a href="http://www.forbes.com/sites/matthewherper/2014/01/08/why-biotech-whiz-kid-jack-andraka-is-not-on-the-forbes-30-under-30-list/">getting some pushback</a> from serious scientists on the draft of his paper describing the research he <a href="http://www.ted.com/talks/jack_andraka_a_promising_test_for_pancreatic_cancer_from_a_teenager.html">outlined in his TED talk</a>. He is taking the criticism like a pro, which says a lot about the guy. From reading the second hand reviews, it sounds like his project was like most good science projects - it made some interesting progress but needs a lot of grinding before it turns into something real. The hype made it sound too good to be true. I hope that he will just ignore the hype machine from here on in and keep grinding (via Rafa).</li>
<li>I’ve probably posted this before, but here is the <a href="http://matt.might.net/articles/phd-school-in-pictures/">illustrated guide to a Ph.D.</a> Lest you think that little bump doesn’t matter, don’t forget to scroll to the bottom and <a href="http://matt.might.net/articles/my-sons-killer/">read this</a>.</li>
<li>The bmorebiostat bloggers (<a href="http://bmorebiostat.com/">http://bmorebiostat.com/</a>), if you aren’t following them, you should be.</li>
<li><a href="http://source.opennews.org/en-US/articles/introducing-treasuryio/">Potentially cool website</a> for accessing treasury data.</li>
<li>Ok its 5am. I need a <a href="https://twitter.com/rdpeng/status/422382846041665537">githug</a> and then off to bed.</li>
</ol>
The top 10 predictor takes on the debiased Lasso - still the champ!
2014-01-08T10:39:30+00:00
http://simplystats.github.io/2014/01/08/the-top-10-predictor-takes-on-the-debiased-lasso-still-the-champ
<p>After reposting on the comparison between the lasso and the always top 10 predictor (leekasso) I got some feedback that the problem could be I wasn’t debiasing the Lasso (thanks Tim T. on Twitter!). The idea behind debiasing (as I understand it) is to use the Lasso to do feature selection and then fit model without shrinkage to “debias” the coefficients. The debiased model is then used for prediction. <a href="http://faculty.washington.edu/nrsimon/">Noah Simon</a>, who knows approximately infinitely more about this than I do, kindly provided some code for fitting a debiased Lasso. He is not responsible for any mistakes/silliness in the simulation, he was just nice enough to provide some debiased Lasso code. He mentions a similar idea appears in the <a href="http://cran.r-project.org/web/packages/relaxo/relaxo.pdf">relaxo package</a> if you set <span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_0b92f8c2972983f15725fd66e4a72066.gif" style="vertical-align: middle; border: none; " class="tex" alt="\phi=0" /></span>.</p>
<p>I used the <a href="http://simplystatistics.org/2014/01/04/repost-prediction-the-lasso-vs-just-using-the-top-10-predictors/">same simulation set up </a>as before and tried out the Leekasso, the Lasso and the Debiased Lasso. Here are the accuracy results (more red = higher accuracy):</p>
<p style="text-align: center;">
<a href="http://simplystatistics.org/2014/01/08/the-top-10-predictor-takes-on-the-debiased-lasso-still-the-champ/accuracy-plot-2/" rel="attachment wp-att-2412"><img class="size-medium wp-image-2412 aligncenter" alt="accuracy-plot" src="http://simplystatistics.org/wp-content/uploads/2014/01/accuracy-plot1-300x100.png" width="300" height="100" srcset="http://simplystatistics.org/wp-content/uploads/2014/01/accuracy-plot1-300x100.png 300w, http://simplystatistics.org/wp-content/uploads/2014/01/accuracy-plot1-1024x341.png 1024w" sizes="(max-width: 300px) 100vw, 300px" /></a>
</p>
<p style="text-align: left;">
The results suggest the debiased Lasso still doesn't work well under this design. Keep in mind as I mentioned in my previous post that the Lasso may perform better under a different causal model.
</p>
<p style="text-align: left;">
<strong>Update: </strong> <a href="https://github.com/jtleek/leekasso">Code available here on Github</a> if you want to play around.
</p>
Preparing for tenure track job interviews
2014-01-07T10:00:55+00:00
http://simplystats.github.io/2014/01/07/preparing-for-tenure-track-job-interviews-2
<p><em>Editor’s note: This is a slightly modified version of a previous post.</em></p>
<p>If you are in the job market you will soon be receiving (or already received) an invitation for an interview. So how should you prepare? You have two goals. The first is to make a good impression. Here are some tips:</p>
<p>1) During your talk, do NOT go over your allotted time. Practice your talk at least twice. Both times in front of a live audiences that asks questions.</p>
<p>2) Know your audience. If it’s a “math-y” department, give a more “math-y” talk. If it’s an applied department, give a more applied talk. But (sorry for the cliché) be yourself. Don’t pretend to be interested in something you are not as this almost always backfires.</p>
<p>3) Learn about the faculty’s research interests. This will help during the one-on-one meetings.</p>
<p>4) Be ready to answer the question “what do you want to teach?” and “where do you see yourself in five years?”</p>
<p>5) I can’t think of any department where it is necessary to wear a suit (correct me if I’m wrong in the comments). In some places you might feel uncomfortable wearing a suit while those interviewing you are in <a href="http://owpdb.mfo.de/photoNormal?id=7558" target="_blank">shorts and t-shirt</a>.</p>
<p>Second, and just as important, you want to figure out if you like the department you are visiting. Do you want to spend the next 5, 10, 50 years there? Make sure to find out as much as you can to answer this question. Some questions are more appropriate for junior faculty, the more sensitive ones for the chair. Here are some example questions I would ask:</p>
<p>1) What are the expectations for promotion? Would you promote someone publishing exclusively in subject matter journals such as Nature, Science, Cell, PLoS Biology, American Journal of Epidemiology ? Somebody publishing exclusively in Annals of Statistics? Is being a PI on an R01 a requirement for tenure?</p>
<p>2) What are the expectations for teaching/service/collaboration? How are teaching and committee service assignments made?</p>
<p>3) How did you connect with your collaborators? How are these connections made?</p>
<p>4) What percent of my salary am I expected to cover? Is it possible to do this by being a co-investigator?</p>
<p>5) Where do you live? How are the schools? How is the commute?</p>
<p>6) How many graduate students does the department have? How are graduate students funded? If I want someone to work with me, do I have to cover their stipend/tuition?</p>
<p>7) How is computing supported? This varies a lot from place to place. Some departments share amazing systems. Ask how costs are shared? How is the IT staff? Is R supported? In others you might have to buy your own hardware. Get <strong>all</strong> the details.</p>
<p>Specific questions for the junior Faculty:</p>
<p>Are the expectations for promotion made clear to you? Do you get feedback on your progress? Do the senior faculty mentor you? Do the senior faculty get along? What do you like most about the department? What can be improved? In the last 10 years, what percent of junior faculty get promoted?</p>
<p>Questions for the chair:</p>
<p>What percent of my salary am I expected to cover? How soon? Is their bridge funding? What is a standard startup package? Can you describe the promotion process in detail? What space is available for postdocs? (for hard money place) I love teaching, but can I buy out teaching with grants?</p>
<p>I am sure I missed stuff, so please comment away….</p>
Sunday data/statistics link roundup (1/5/14)
2014-01-05T10:57:59+00:00
http://simplystats.github.io/2014/01/05/sunday-datastatistics-link-roundup-1514
<ol>
<li>If you haven’t seen <a href="http://lolmythesis.com/">lolmythesis</a> it is pretty incredible. 1-2 line description of thesis projects. I think every student should be required to make one of these up before they defend. The best I could come up with for mine is, “We built a machine sensitive enough to measure the abundance of every gene in your body at once; turns out it measures other stuff too.”</li>
<li><a href="http://www.nytimes.com/2013/12/31/science/i-had-my-dna-picture-taken-with-varying-results.html?_r=0">An interesting article</a> about how different direct to consumer genetic tests give different results. It doesn’t say, but it would be interesting if the raw data were highly replicable and the interpretations were different. If the genotype calls themselves didn’t match up that would be much worse on some level. I agree people have a right to their genetic data. On the other hand, I think it is important to remember that even people with Ph.D’s and 15 years experience have trouble interpreting the results of a GWAS. To assume the average individual will understand their genetic risk is seriously optimistic (via Rafa).</li>
<li>The <a href="http://www.codinghorror.com/blog/2006/05/the-ten-commandments-of-egoless-programming.html">10 commandments of egoless programming.</a>These are so important on big collaborative projects like my group has been working on the last year or so. Fortunately my students and postdocs are much better at being egoless than I am (I am an academic with a blog so it isn’t like you couldn’t see the ego coming <img src="http://simplystatistics.org/wp-includes/images/smilies/simple-smile.png" alt=":-)" class="wp-smiley" style="height: 1em; max-height: 1em;" />).</li>
<li><a href="http://cnr.lwlss.net/GarminR/?utm_content=buffer66fff&utm_source=buffer&utm_medium=twitter&utm_campaign=Buffer">This is a neat post</a> on parsing and analyzing data from a Garmin. The analysis even produces an automated report! I love it when people do cool things like this with their own data in R.</li>
<li><a href="http://biology.duke.edu/johnsenlab/advice.html">Super interesting advice page</a> for potential graduate students from a faculty member at Duke Biology. This is particularly interesting in light of the ongoing debate about the viability of the graduate education pipeline <a href="http://www.bloomberg.com/news/2014-01-03/can-t-get-tenure-then-get-a-real-job.html">highlighted in this recent article</a>. I think it is important for graduate students in Ph.D. programs to know that not every student goes to an academic position. This has been true for a long time in Biostatistics, where many people end up in industry positions. That also means it is the obligation of Ph.D. programs to prepare students for a variety of jobs. Fortunately, most Ph.D.s in Biostatistics have experience processing data, working with collaborators, and developing data products so are usually also really prepared for industry.</li>
<li><a href="http://stat-graphics.org/movies/prim9.html">This old video</a> of Tukey and Friedman is awesome and mind-blowing (via Mike L.).</li>
<li><a href="http://balancedbudget.baltimorecity.gov/exercise/index.php">Cool site</a> that lets you try to balance Baltimore’s budget. This type of thing would be even cooler if there were Github like pull requests where you could make new suggestions as well.</li>
<li><a href="http://alyssafrazee.com/introducing-R.html">My student Alyssa</a> has a very interesting post on teaching R to a non-programmer in one hour. Take the Frazee Challenge and list what you would teach.</li>
</ol>
Repost: Prediction: the Lasso vs. just using the top 10 predictors
2014-01-04T14:37:24+00:00
http://simplystats.github.io/2014/01/04/repost-prediction-the-lasso-vs-just-using-the-top-10-predictors
<p><em>Editor’s note: This is a previously published post of mine from a couple of years ago (!). I always thought about turning it into a paper. The interesting idea (I think) is how the causal model matters for whether the lasso or the marginal regression approach works better. Also <a href="https://github.com/ecpolley/SuperLearner/blob/master/R/SL.leekasso.R">check it out</a>, the Leekasso is part of the SuperLearner package.</em></p>
<p>One incredibly popular tool for the analysis of high-dimensional data is the <a href="http://www-stat.stanford.edu/~tibs/lasso.html" target="_blank">lasso</a>. The lasso is commonly used in cases when you have many more predictors than independent samples (the n « p) problem. It is also often used in the context of prediction.</p>
<p>Suppose you have an outcome <strong>Y</strong> and several predictors <strong>X1</strong>,…,<strong>XM</strong>, the lasso fits a model:</p>
<p><strong>Y = B0 + B1 X1 + B2 X2 + … + BM XM + E</strong></p>
<p>subject to a constraint on the sum of the absolute value of the <strong>B</strong> coefficients. The result is that: (1) some of the coefficients get set to zero, and those variables drop out of the model, (2) other coefficients are “shrunk” toward zero. Dropping some variables is good because there are a lot of potentially unimportant variables. Shrinking coefficients may be good, since the big coefficients might be just the ones that were really big by random chance (this is related to Andrew Gelman’s <a href="http://andrewgelman.com/2011/09/the-statistical-significance-filter/" target="_blank">type M errors</a>).</p>
<p>I work in genomics, where n«p problems come up all the time. Whenever I use the lasso or when I read papers where the lasso is used for prediction, I always think: “How does this compare to just using the top 10 most significant predictors?” I have asked this out loud enough that <a href="http://www.biostat.jhsph.edu/~rpeng/" target="_blank">some</a> <a href="http://www.biostat.jhsph.edu/~iruczins/" target="_blank">people</a> <a href="http://www.bcaffo.com/" target="_blank">around</a> <a href="http://rafalab.jhsph.edu/" target="_blank">here</a> <a href="http://people.csail.mit.edu/mrosenblum/" target="_blank">started</a> calling it the “Leekasso” to poke fun at me. So I’m going to call it that in a thinly veiled attempt to avoid <a href="http://en.wikipedia.org/wiki/Stigler's_law_of_eponymy" target="_blank">Stigler’s law of eponymy</a> (actually Rafa points out that using this name is a perfect example of this law, since this feature selection approach has been proposed before <a href="http://www.stat.berkeley.edu/tech-reports/576.pdf" target="_blank">at least once</a>).</p>
<p>Here is how the Leekasso works. You fit each of the models:</p>
<p><strong>Y = B0 + BkXk + E</strong></p>
<p>take the 10 variables with the smallest p-values from testing the <strong>Bk</strong> coefficients, then fit a linear model with just those 10 coefficients. You never use 9 or 11, the Leekasso is always 10.</p>
<p>For fun I did an experiment to compare the accuracy of the Leekasso and the Lasso.</p>
<p>Here is the setup:</p>
<ul>
<li>I simulated 500 variables and 100 samples for each study, each N(0,1)</li>
<li>I created an outcome that was 0 for the first 50 samples, 1 for the last 50</li>
<li>I set a certain number of variables (between 5 and 50) to be associated with the outcome using the model <strong>Xi = b0i + b1iY + e </strong>(this is an important choice, more later in the post)</li>
<li>I tried different levels of signal to the truly predictive features</li>
<li>I generated two data sets (training and test) from the exact same model for each scenario</li>
<li>I fit the Lasso using the <a href="http://cran.r-project.org/web/packages/lars/index.html" target="_blank">lars </a>package, choosing the shrinkage parameter as the value that minimized the cross-validation MSE in the training set</li>
<li>I fit the Leekasso and the Lasso on the training sets and evaluated accuracy on the test sets.</li>
</ul>
<p>The R code for this analysis is available <a href="http://biostat.jhsph.edu/~jleek/code/leekasso.R" target="_blank">here</a> and the resulting data is <a href="http://biostat.jhsph.edu/~jleek/code/lassodata.rda" target="_blank">here</a>.</p>
<p>The results show that for all configurations, using the top 10 has a higher out of sample prediction accuracy than the lasso. A larger version of the plot is <a href="http://biostat.jhsph.edu/~jleek/code/accuracy-plot.png" target="_blank">here</a>.</p>
<p><img alt="" src="http://biostat.jhsph.edu/~jleek/code/accuracy-plot.png" width="480" height="240" /></p>
<p>Interestingly, this is true even when there are fewer than 10 real features in the data or when there are many more than 10 real features ((remember the Leekasso always picks 10).</p>
<p>Some thoughts on this analysis:</p>
<ol>
<li>This is only test-set prediction accuracy, it says nothing about selecting the “right” features for prediction.</li>
<li>The Leekasso took about 0.03 seconds to fit and test per data set compared to about 5.61 seconds for the Lasso.</li>
<li>The data generating model is the model underlying the top 10, so it isn’t surprising it has higher performance. Note that I simulated from the model: <strong>Xi = b0i + b1iY + e</strong>, this is the model commonly assumed in differential expression analysis (genomics) or voxel-wise analysis (fMRI). Alternatively I could have simulated from the model: <strong>Y = B0 + B1 X1 + B2 X2 + … + BM XM + E</strong>, where most of the coefficients are zero. In this case, the Lasso would outperform the top 10 (data not shown). This is a key, and possibly obvious, issue raised by this simulation. When doing prediction differences in the true “causal” model matter a lot. So if we believe the “top 10 model” holds in many high-dimensional settings, then it may be the case that regularization approaches don’t work well for prediction and vice versa.</li>
<li>I think what may be happening is that the Lasso is overshrinking the parameter estimates, in other words, you give up too much bias for a gain in variance. Alan Dabney and John Storey have a really nice <a href="http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0001002" target="_blank">paper</a> discussing shrinkage in the context of genomic prediction that I think is related.
5.</li>
</ol>
The Supreme Court takes on Pollution Source Apportionment...and Realizes It's Hard
2014-01-03T09:00:43+00:00
http://simplystats.github.io/2014/01/03/the-supreme-court-takes-on-pollution-source-apportionment-and-realizes-its-hard
<p>Recently, the U.S. Supreme Court heard arguments in the cases <em><a href="http://www.scotusblog.com/case-files/cases/environmental-protection-agency-v-eme-homer-city-generation/">EPA v. EME Homer City</a> <a href="http://www.scotusblog.com/case-files/cases/environmental-protection-agency-v-eme-homer-city-generation/">Generation</a></em> and _<a href="http://www.scotusblog.com/case-files/cases/american-lung-association-v-eme-homer-city-generation/">American Lung Association v EME Homer City Generation</a>. _SCOTUSblog has a nice <a href="http://www.scotusblog.com/2013/12/argument-recap-a-good-day-for-epa/#more-201950">summary of the legal arguments</a>, for the law buffs out there.</p>
<p>The basic problem is that the way air pollution is regulated, the EPA and state and local agencies monitor the air pollution in each state. When the levels of pollution are above the national ambient air quality standards at the monitors in that state, the state is considered in “non-attainment” (i.e. they have not attained the standard). Otherwise, they are in attainment.</p>
<p>But what if your state doesn’t actually generate any pollution, but there’s all this pollution blowing in from another state? Pollution knows no boundaries and in that case, the monitors in your state will be in non-attainment, and it isn’t even your fault! The Clean Air Act has something called the “good neighbor” policy that was designed to address this issue. From SCOTUSblog:</p>
<blockquote>
<p>One of the obligations that states have, in drafting implementation plans [to reduce pollution], is imposed by what is called the “good neighbor” policy. It dates from 1963, in a more elemental form, but its most fully developed form requires each state to include in its plan the measures necessary to prevent the migration of their polluted air to their neighbors, if that would keep the neighbors from meeting EPA’s quality standards.</p>
</blockquote>
<p>The problem is that if you live in a state like Maryland, your air pollution is coming from a bunch of states (Pennsylvania, Ohio, etc.). So who do you blame? Well, the logical thing would be to say that if Pennsylvania contributes to 90% of Maryland’s interstate air pollution and Ohio contributes 10%, then Pennsylvania should get 90% of the blame and Ohio 10%. But it’s not so easy because air pollution doesn’t have any special identifiers on it to indicate what state it came from. This is the <em>source apportionment problem</em> in air pollution and it involves trying to back-calculate where a given amount of pollution came from (or what was its source). It’s not an easy problem.</p>
<p>EPA realized the unfairness here and devised the State Air Pollution Rule, also known as the “Transport Rule”. From SCOTUSblog:</p>
<blockquote>
<p>What the Transport Rule sought to do is to set up a regime to limit cross-border movement of emissions of nitrogen oxides and sulfur dioxide. Those substances, sent out from coal-fired power plants and other sources, get transformed into ozone and “fine particular matter” (basically, soot), and both are harmful to human health, contributing to asthma and heart attacks. They also damage natural terrain such as forests, destroy farm crops, can kill fish, and create hazes that reduce visibility.</p>
<p>Both of those pollutants are carried by the wind, and they can be transported very large distances — a phenomenon that is mostly noticed in the eastern states.</p>
</blockquote>
<p>There are actually a few versions of this problem. One common one involves identifying the source of a particle (i.e. automobile, power plans, road dust) based on its chemical composition. The idea here is that at any given monitor, there are particles blowing in from all different types of sources and so the pollution you measure is a mixture of all these sources. Making some assumptions about chemical mass balance, there are ways to statistically separate out the contributions from individual sources based on a the chemical composition of the total mass measurement. If the particles that we measure, say, have a lot of ammonium ions and we know that particles generated by coal-burning power plants have a lot of ammonium ions, then we might infer that the particles came from a coal-burning power plant.</p>
<p>The key idea here is that different sources of particles have “chemical signatures” that can be used to separate out their various contributions. This is already a difficult problem, but at least here, we have some knowledge of the chemical makeup of various sources and can incorporate that knowledge into the statistical analysis.</p>
<p>In the problem at the Supreme Court, we’re not concerned with particles from various types of sources, but rather from different locations. But, for the most part, different states don’t have “chemical signatures” or tracer elements, so it’s hard to identify whether a given particle (or other pollutant) blowing in the wind came from Pennsylvania versus Ohio.</p>
<p>So what did EPA do? Well, instead of figuring out where the pollution came from, they decided that states would reduce emissions based on how much it would cost to control those emissions. The states objected because the cost of controlling emissions may well have nothing to do with how much pollution is actually being contributed downwind.</p>
<p>The legal question involves whether or not EPA has the authority to devise a regulatory plan based on costs as opposed to actual pollution contribution. I will let people who actually know the law address that question, but given the general difficulty of source apportionment, I’m not sure EPA could have come up with a much better plan.</p>
Some things R can do you might not be aware of
2013-12-30T16:04:24+00:00
http://simplystats.github.io/2013/12/30/some-things-r-can-do-you-might-not-be-aware-of
<p>There is a lot of noise around the “R versus Contender X” for Data Science. I think the two main competitors right now that I hear about are Python and Julia. I’m not going to weigh into the debates because I go by the motto: “Why not just use something that works?”</p>
<p>R offers a lot of benefits if you are interested in statistical or predictive modeling. It is basically unrivaled in terms of the breadth of packages for applied statistics. But I think sometimes it isn’t obvious that R can handle some tasks that you used to have to do with other languages. This misconception is particularly common among people who regularly code in a different language and are moving to R. So I thought I’d point out a few cool things that R can do. Please add to the list in the comments if I’ve missed things that R can do people don’t expect.</p>
<ol>
<li><strong>R can do regular expressions/text processing:</strong> Check out <a href="http://cran.r-project.org/web/packages/stringr/index.html">stringr</a>, <a href="http://cran.r-project.org/web/packages/tm/index.html">tm</a>, and a large number of other <a href="http://cran.r-project.org/web/views/NaturalLanguageProcessing.html">natural language processing packages</a>.</li>
<li><strong>R can get data out of a database:</strong> Check out <a href="http://cran.r-project.org/web/packages/RMySQL/index.html">RMySQL</a>, <a href="http://cran.r-project.org/web/packages/rmongodb/index.html">RMongoDB</a>, <a href="http://www.bioconductor.org/packages/release/bioc/html/rhdf5.html">rhdf5</a>, <a href="http://cran.r-project.org/web/packages/ROracle/index.html">ROracle</a>, <a href="http://monetr.r-forge.r-project.org/">MonetDB.R</a> (via Anthony D.).</li>
<li><strong>R can process nasty data: </strong>Check out <a href="http://cran.r-project.org/web/packages/plyr/index.html">plyr</a>, <a href="http://cran.r-project.org/web/packages/reshape2/index.html">reshape2</a>, <a href="http://cran.r-project.org/web/packages/Hmisc/index.html">Hmisc</a></li>
<li><strong>R can process images: </strong><a href="http://www.bioconductor.org/packages/2.13/bioc/html/EBImage.html">EBImage</a> is a good general purpose tool, but there are also packages for various file types like <a href="http://cran.fhcrc.org/web/packages/jpeg/index.html">jpeg</a>.</li>
<li><strong>R can handle different data formats: </strong><a href="http://cran.r-project.org/web/packages/XML/index.html">XML</a> and <a href="http://cran.r-project.org/web/packages/RJSONIO/index.html">RJSONIO</a> handle two common types, but you can also read from Excel files with <a href="http://cran.r-project.org/web/packages/xlsx/index.html">xlsx</a> or handle pretty much every common data storage type (you’ll have to search <a href="http://lmgtfy.com/?q=R+%2B+data+type">R + data type</a>) to find the package.</li>
<li><strong>R can interact with APIs</strong>: Check out <a href="http://cran.r-project.org/web/packages/RCurl/index.html">RCurl</a> and <a href="http://cran.r-project.org/web/packages/httr/">httr</a> for general purpose software, or you could try some specific examples like <a href="http://cran.r-project.org/web/packages/twitteR/index.html">twitteR</a>. You can create an api from R code using <a href="http://yhathq.com/">yhat</a>.</li>
<li><strong>R can build apps/interactive graphics: </strong>Some pretty cool things have already been built with <a href="http://www.rstudio.com/shiny/">shiny</a>, <a href="http://rcharts.io/">rCharts</a> interfaces with a ton of interactive graphics packages.</li>
<li><strong>R can create dynamic documents: </strong>Try out [There is a lot of noise around the “R versus Contender X” for Data Science. I think the two main competitors right now that I hear about are Python and Julia. I’m not going to weigh into the debates because I go by the motto: “Why not just use something that works?”</li>
</ol>
<p>R offers a lot of benefits if you are interested in statistical or predictive modeling. It is basically unrivaled in terms of the breadth of packages for applied statistics. But I think sometimes it isn’t obvious that R can handle some tasks that you used to have to do with other languages. This misconception is particularly common among people who regularly code in a different language and are moving to R. So I thought I’d point out a few cool things that R can do. Please add to the list in the comments if I’ve missed things that R can do people don’t expect.</p>
<ol>
<li><strong>R can do regular expressions/text processing:</strong> Check out <a href="http://cran.r-project.org/web/packages/stringr/index.html">stringr</a>, <a href="http://cran.r-project.org/web/packages/tm/index.html">tm</a>, and a large number of other <a href="http://cran.r-project.org/web/views/NaturalLanguageProcessing.html">natural language processing packages</a>.</li>
<li><strong>R can get data out of a database:</strong> Check out <a href="http://cran.r-project.org/web/packages/RMySQL/index.html">RMySQL</a>, <a href="http://cran.r-project.org/web/packages/rmongodb/index.html">RMongoDB</a>, <a href="http://www.bioconductor.org/packages/release/bioc/html/rhdf5.html">rhdf5</a>, <a href="http://cran.r-project.org/web/packages/ROracle/index.html">ROracle</a>, <a href="http://monetr.r-forge.r-project.org/">MonetDB.R</a> (via Anthony D.).</li>
<li><strong>R can process nasty data: </strong>Check out <a href="http://cran.r-project.org/web/packages/plyr/index.html">plyr</a>, <a href="http://cran.r-project.org/web/packages/reshape2/index.html">reshape2</a>, <a href="http://cran.r-project.org/web/packages/Hmisc/index.html">Hmisc</a></li>
<li><strong>R can process images: </strong><a href="http://www.bioconductor.org/packages/2.13/bioc/html/EBImage.html">EBImage</a> is a good general purpose tool, but there are also packages for various file types like <a href="http://cran.fhcrc.org/web/packages/jpeg/index.html">jpeg</a>.</li>
<li><strong>R can handle different data formats: </strong><a href="http://cran.r-project.org/web/packages/XML/index.html">XML</a> and <a href="http://cran.r-project.org/web/packages/RJSONIO/index.html">RJSONIO</a> handle two common types, but you can also read from Excel files with <a href="http://cran.r-project.org/web/packages/xlsx/index.html">xlsx</a> or handle pretty much every common data storage type (you’ll have to search <a href="http://lmgtfy.com/?q=R+%2B+data+type">R + data type</a>) to find the package.</li>
<li><strong>R can interact with APIs</strong>: Check out <a href="http://cran.r-project.org/web/packages/RCurl/index.html">RCurl</a> and <a href="http://cran.r-project.org/web/packages/httr/">httr</a> for general purpose software, or you could try some specific examples like <a href="http://cran.r-project.org/web/packages/twitteR/index.html">twitteR</a>. You can create an api from R code using <a href="http://yhathq.com/">yhat</a>.</li>
<li><strong>R can build apps/interactive graphics: </strong>Some pretty cool things have already been built with <a href="http://www.rstudio.com/shiny/">shiny</a>, <a href="http://rcharts.io/">rCharts</a> interfaces with a ton of interactive graphics packages.</li>
<li><strong>R can create dynamic documents: </strong>Try out](http://yihui.name/knitr/) or <a href="http://slidify.org/">slidify</a>.</li>
<li><strong>R can play with Hadoop: </strong>Check out the <a href="https://github.com/RevolutionAnalytics/RHadoop/wiki">rhadoop wiki</a>.</li>
<li><strong>R can create interactive teaching modules:</strong> You can do it in the console with <a href="http://swirlstats.com/">swirl</a> or on the web with <a href="http://www.datacamp.com/">Datamind</a>.</li>
<li><strong>R interfaces very nicely with C if you need to be hardcore (also maybe? interfaces with Python): </strong><a href="http://dirk.eddelbuettel.com/code/rcpp.html">Rcpp</a>, enough said. Also <a href="http://adv-r.had.co.nz/Rcpp.html">read the tutorial</a>. I haven’t tried the <a href="http://cran.r-project.org/web/packages/rPython/rPython.pdf">rPython</a> library, but it looks like a great idea.</li>
</ol>
A non-comprehensive list of awesome things other people did this year.
2013-12-20T10:46:28+00:00
http://simplystats.github.io/2013/12/20/a-non-comprehensive-list-of-awesome-things-other-people-did-this-year
<p><em>Editor’s Note:</em> <em>I made this list off the top of my head and have surely missed awesome things people have done this year. If you know of some, you should make your own list or add it to the comments! I have also avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people’s awesome stuff. I wrote this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data. </em></p>
<ul>
<li>I emailed Hadley Wickham about some trouble we were having memory profiling. He wrote back immediately, <a href="https://github.com/hadley/lineprof">then wrote an R package</a>, then wrote <a href="http://adv-r.had.co.nz/memory.html">this awesome guide</a>. That guy is ridiculous.</li>
<li>Jared Horvath <a href="http://blogs.scientificamerican.com/guest-blog/2013/12/04/the-replication-myth-shedding-light-on-one-of-sciences-dirty-little-secrets/">wrote this</a> incredibly well-written and compelling argument for the scientific system that has given us a wide range of discoveries.</li>
<li>Yuwen Liu and colleagues wrote <a href="http://bioinformatics.oxfordjournals.org/content/early/2013/12/06/bioinformatics.btt688.short">this really interesting paper</a> on power for RNA-seq studies comparing biological replicates and sequencing depth. Shows pretty conclusively to go for more replicates (music to a statisticians ears!).</li>
<li>Yoav Benjamini and Yotam Hechtlingler <a href="http://biostatistics.oxfordjournals.org/content/early/2013/09/24/biostatistics.kxt032.extract">wrote an amazing discussion</a> of the paper we wrote about the science-wise false discovery rate. It contributes new ideas about estimation/control in that context.</li>
<li>Sherri Rose <a href="http://static.squarespace.com/static/5006630b24ac4eefa45a0d3e/t/5027310ee4b09eb28be16153/1344745742832/">wrote a fascinating article</a> about statistician’s role in big data. One thing I really liked was this line: “This may require implementing commonly used methods, developing a new method, or integrating techniques from other fields to answer our problem.” I really like the idea that integrating and applying standard methods in new and creative ways can be viewed as a statistical contribution.</li>
<li>Karl Broman gave his now legendary talk (<a href="http://www.biostat.wisc.edu/~kbroman/presentations/IowaState2013/graphs_combined.pdf">part1</a>/<a href="http://www.biostat.wisc.edu/~kbroman/presentations/IowaState2013/index.html">part2</a>) on statistical graphics that I think should be required viewing for anyone who will ever plot data on a Google Hangout with the Iowa State data viz crowd. They had some technical difficulties during the broadcast so Karl B. took it down. Join me in begging him to put it back up again despited the warts.</li>
<li>Everything Thomas Lumley wrote on <a href="http://notstatschat.tumblr.com/">notstatschat</a>, I follow that blog super closely. I love <a href="http://notstatschat.tumblr.com/post/66056322820/from-labhacks-the-25-scrunchable-scientific-poster">this scrunchable poster</a> he pointed to and <a href="http://notstatschat.tumblr.com/post/62048763550/statins-and-the-causal-markov-property">this post</a> on Statins and the Causal Markov property.</li>
<li>I wish I could take Joe Blitzstein’s <a href="http://cs109.org/">data science class</a>. Particularly check out the reading list, which I think is excellent.</li>
<li>Lev Muchik, Sinan Aral, and Sean Taylor <a href="http://www.sciencemag.org/content/341/6146/647.abstract">brought the randomized control trial</a> to social influence bias on a massive scale. I love how RCT are finding their ways into the new, sexy areas.</li>
<li>Genevera Allen taught a congressman about statistical brain mapping and holy crap <a href="http://www.c-spanvideo.org/clip/4465538">he talked about it on the floor of the house.</a></li>
<li>Lior Pachter starting <a href="http://liorpachter.wordpress.com/">mixing it up on his blog</a>. I don’t necessarily agree with all of his posts but it is hard to deny the influence that his posts have had on real science. I definitely read it regularly.</li>
<li>Marie Davidian, President of the ASA, has been on a tear this year, doing tons of cool stuff, including landing the big fish, <a href="http://blog.revolutionanalytics.com/2013/08/nate-silver-jsm.html">Nate Silver</a>, for JSM. Super impressive to watch the energy. I’m also really excited to see what Bin Yu works on this year as <a href="http://imstat.org/officials/current_officials.html">president of IMS</a>.</li>
<li>The <a href="http://www.statistics2013.org/">Stats 2013</a> crowd has done a ridiculously good job of getting the word out about statistics this year. I keep seeing statistics pop up in places like the <a href="http://online.wsj.com/news/articles/SB10001424052702303559504579197942777726778">WSJ</a>, which warms my heart.</li>
<li>One way I judge a paper is by how angry/jealous I am that I didn’t think of or write that paper. <a href="http://www.nature.com/nbt/journal/v31/n11/full/nbt.2702.html">This paper</a> on the reproducibility of RNA-seq experiments was so good I was seeing red. I’ll be reading everything that Tuuli Lappalainen’s new group at the <a href="http://www.nygenome.org/">New York Genome Center</a> writes.</li>
<li>Hector Corrada Bravo and the crowd at UMD wrote this paper about <a href="http://www.nature.com/nmeth/journal/v10/n12/full/nmeth.2658.html">differential abundance in microbial communities</a> that also made me crazy jealous. Just such a good idea done so well.</li>
<li>Chad Myers and Curtis Huttenhower continue to absolutely tear it up on <a href="http://www.sciencedirect.com/science/article/pii/S109727651300405X">networks</a> and <a href="http://www.nature.com/nbt/journal/v31/n9/abs/nbt.2676.html">microbiome</a> stuff. Just stop guys, you are making the rest of us look bad…</li>
<li><a href="http://www.youtube.com/watch?v=oH7rt2GZnW8&feature=youtu.be">I don’t want to go to Stanford I want to go to Johns Hopkins</a>.</li>
<li>Ramnath keeps Ramnathing (def. to build incredible things at a speed that we can’t keep up with by repurposing old tools in the most creative way possible) with <a href="http://rcharts.io/">rCharts</a>.</li>
<li>Neo Chung and John Storey invented the <a href="http://arxiv.org/pdf/1308.6013v1.pdf">jackstraw</a> for testing the association between measured variables and principal components. It is an awesome idea and a descriptive name.</li>
<li>I wasn’t at <a href="https://secure.bioconductor.org/BioC2013/">Bioc 2013</a>, but I heard from two people who I highly respect and it takes a lot to impress that Levi Waldron gave one of the best talks they’d ever seen. The paper isn’t up yet (I think) but <a href="http://database.oxfordjournals.org/content/2013/bat013.abstract">here is the R package</a> with the data he described. His <a href="https://bitbucket.org/lwaldron/survhd">survHd</a> package for fast coxph fits (think rowFtests but with Cox) is also worth checking out.</li>
<li>John Cook kept cranking out interesting posts, as usual. <a href="http://www.johndcook.com/blog/2013/09/17/to-err-is-human-to-catch-an-error-shows-expertise/">One of my favorites</a> talks about how one major component of expertise is the ability to quickly find and correct inevitable errors (for example, in code).</li>
<li>Larry Wasserman’s <a href="http://normaldeviate.wordpress.com/2013/06/20/simpsons-paradox-explained/">Simpson’s Paradox post</a> should be required reading. He is shutting down Normal Deviate, which is a huge bummer.</li>
<li>Andrew Gelman and I don’t always agree on scientific issues, but there is no arguing that he and the stan team have made a pretty impressive piece of software with<a href="http://mc-stan.org/"> stan</a>. Richard McElreath also <a href="https://github.com/rmcelreath/glmer2stan">wrote a slick interface</a> that makes fitting a fully Bayesian model match the syntax of lmer.</li>
<li>Steve Pierson and Ron Wasserstein from ASA are also doing a huge service for our community in tackling the big issues like interfacing statistics to government funding agencies. <a href="https://twitter.com/ASA_SciPol">Steve’s Twitter feed</a> has been a great resource for keeping track of deadlines for competitions, grants, and other deadlines.</li>
<li><a href="http://spark-1590165977.us-west-2.elb.amazonaws.com/jkatz/SurveyMaps/">Joshua Katz built these amazing dialect maps</a> that have been all over the news. Shiny Apps are getting to be serious business.</li>
<li>Speaking of RStudio, they keep rolling out the goodies, my favorite recent addition is <a href="http://www.rstudio.com/ide/docs/debugging/overview">interactive debugging</a>.</li>
<li>I’ll close with <a href="http://mlg.eng.cam.ac.uk/duvenaud/">David Duvenaud</a>’s HarlMCMC shake:</li>
</ul>
A summary of the evidence that most published research is false
2013-12-16T09:50:58+00:00
http://simplystats.github.io/2013/12/16/a-summary-of-the-evidence-that-most-published-research-is-false
<p>One of the hottest topics in science has two main conclusions:</p>
<ul>
<li>Most published research is false</li>
<li>There is a reproducibility crisis in science</li>
</ul>
<p>The first claim is often stated in a slightly different way: that most results of scientific experiments do not replicate. I recently <a href="http://simplystatistics.org/2013/09/25/is-most-science-false-the-titans-weigh-in/">got caught up in this debate</a> and I frequently get asked about it.</p>
<p>So I thought I’d do a very brief review of the reported evidence for the two perceived crises. An important point is all of the scientists below have made the best effort they can to tackle a fairly complicated problem and this is early days in the study of science-wise false discovery rates. But the take home message is that there is currently no definitive evidence one way or another about whether most results are false.</p>
<ol>
<li><strong>Paper:</strong> <a href="http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124">Why most published research findings are false</a>. <strong>Main idea: </strong>People use hypothesis testing to determine if specific scientific discoveries are significant. This significance calculation is used as a screening mechanism in the scientific literature. Under assumptions about the way people perform these tests and report them it is possible to construct a universe where most published findings are false positive results. <strong>Important drawback:</strong> The paper contains no real data, it is purely based on conjecture and simulation.</li>
<li><strong>Paper: </strong><a href="http://www.nature.com/nature/journal/v483/n7391/full/483531a.html">Drug development: Raise standards for preclinical research</a>. <strong>Main idea</strong><strong>: </strong>Many drugs fail when they move through the development process. Amgen scientists tried to replicate 53 high-profile basic research findings in cancer and could only replicate 6. <strong>Important drawback:</strong> This is not a scientific paper. The study design, replication attempts, selected studies, and the statistical methods to define “replicate” are not defined. No data is available or provided.</li>
<li><strong>Paper:</strong> <a href="http://biostatistics.oxfordjournals.org/content/early/2013/09/24/biostatistics.kxt007.abstract">An estimate of the science-wise false discovery rate and application to the top medical literature</a>. <strong>Main idea:</strong> The paper collects P-values from published abstracts of papers in the medical literature and uses a statistical method to estimate the false discovery rate proposed in paper 1 above. <strong>Important drawback:</strong> The paper only collected data from major medical journals and the abstracts. P-values can be manipulated in many ways that could call into question the statistical results in the paper.</li>
<li><strong>Paper:</strong> <a href="http://www.pnas.org/content/early/2013/10/28/1313476110.abstract">Revised standards for statistical evidence</a>. <strong>Main idea: </strong>The P-value cutoff of 0.05 is used by many journals to determine statistical significance. This paper proposes an alternative method for screening hypotheses based on Bayes factors. <strong>Important drawback</strong>: The paper is a theoretical and philosophical argument for simple hypothesis tests. The data analysis recalculates Bayes factors for reported t-statistics and plots the Bayes factor versus the t-test then makes an argument for why one is better than the other.</li>
<li><strong>Paper: </strong><a href="http://jama.jamanetwork.com/article.aspx?articleid=201218">Contradicted and initially stronger effects in highly cited research</a> <strong>Main idea: </strong>This paper looks at studies that attempted to answer the same scientific question where the second study had a larger sample size or more robust (e.g. randomized trial) study design. Some effects reported in the second study do not match the results exactly from the first. <strong>Important drawback: </strong>The title does not match the results. 16% of studies were contradicted (meaning effect in a different direction). 16% reported smaller effect size, 44% were replicated and 24% were unchallenged. So 44% + 24% + 16% = 86% were not contradicted. <a href="http://biostatistics.oxfordjournals.org/content/early/2013/09/24/biostatistics.kxt038.full">Lack of replication is also not proof of error</a>.</li>
<li><strong>Paper</strong><strong>: </strong><a href="http://www.nature.com/nature/journal/vaop/ncurrent/full/nature12786.html">Modeling the effects of subjective and objective decision making in scientific peer review</a>. <strong>Main idea:</strong> This paper considers a theoretical model for how referees of scientific papers may behave socially. They use simulations to point out how an effect called “herding” (basically peer-mimicking) may lead to biases in the review process. <strong>Important drawback:</strong> The model makes major simplifying assumptions about human behavior and supports these conclusions entirely with simulation. No data is presented.</li>
<li><strong>Paper: </strong><a href="http://www.nature.com/ng/journal/v41/n2/abs/ng.295.html">Repeatability of published microarray gene expression analyses</a>. <strong>Main idea: </strong>This paper attempts to collect the data used in published papers and to repeat one randomly selected analysis from the paper. For many of the papers the data was either not available or available in a format that made it difficult/impossible to repeat the analysis performed in the original paper. The types of software used were also not clear. <strong>Important drawback</strong><strong>: </strong>This paper was written about 18 data sets in 2005-2006. This is both early in the era of reproducibility and not comprehensive in any way. This says nothing about the rate of false discoveries in the medical literature but does speak to the reproducibility of genomics experiments 10 years ago.</li>
<li><a href="https://osf.io/wx7ck/"><strong>Paper: </strong>Investigating variation in replicability: The “Many Labs” replication project.</a> (not yet published) <strong>Main idea</strong><strong>: </strong>The idea is to take a bunch of published high-profile results and try to get multiple labs to replicate the results. They successfully replicated 10 out of 13 results and the distribution of results you see is about what you’d expect (see embedded figure below). <strong>Important drawback:</strong> The paper isn’t published yet and it only covers 13 experiments. That being said, this is by far the strongest, most comprehensive, and most reproducible analysis of replication among all the papers surveyed here.</li>
</ol>
<p>I do think that the reviewed papers are important contributions because they draw attention to real concerns about the modern scientific process. Namely</p>
<ul>
<li>We need more statistical literacy</li>
<li>We need more computational literacy</li>
<li>We need to require code be published</li>
<li>We need mechanisms of peer review that deal with code</li>
<li>We need a culture that doesn’t use reproducibility as a weapon</li>
<li>We need increased transparency in review and evaluation of papers</li>
</ul>
<p>Some of these have simple fixes (more statistics courses, publishing code) some are much, much harder (changing publication/review culture).</p>
<p>The Many Labs project (Paper 8) points out that statistical research is proceeding in a fairly reasonable fashion. Some effects are overestimated in individual studies, some are underestimated, and some are just about right. Regardless, no single study should stand alone as the last word about an important scientific issue. It obviously won’t be possible to replicate every study as intensely as those in the Many Labs project, but this is a reassuring piece of evidence that things aren’t as bad as some paper titles and headlines may make it seem.</p>
<div style="width: 379px" class="wp-caption aligncenter">
<img alt="" src="http://2.bp.blogspot.com/-iEeV4FlwKsE/UpdaFts3fzI/AAAAAAAAAqw/OJjvoXG2e6g/s1600/Picture+73.png" width="369" height="244" />
<p class="wp-caption-text">
Many labs data. Blue x's are original effect sizes. Other dots are effect sizes from replication experiments (http://rolfzwaan.blogspot.com/2013/11/what-can-we-learn-from-many-labs.html)
</p>
</div>
<p>The Many Labs results suggest that the hype about the failures of science are, at the very least, premature. I think an equally important idea is that science has pretty much always worked with some number of false positive and irreplicable studies. This was beautifully described by Jared Horvath in this <a href="http://blogs.scientificamerican.com/guest-blog/2013/12/04/the-replication-myth-shedding-light-on-one-of-sciences-dirty-little-secrets/">blog post from the Economist</a>. I think the take home message is that regardless of the rate of false discoveries, the scientific process has led to amazing and life-altering discoveries.</p>
Sunday data/statistics link roundup (12/15/13)
2013-12-15T23:02:51+00:00
http://simplystats.github.io/2013/12/15/sunday-datastatistics-link-roundup-121513
<ol>
<li>Rafa (in Spanish) <a href="http://www.80grados.net/desmitificando-los-gmos/">clarifying some of the problems</a> with the anti-GMO crowd.</li>
<li>Joe Bliztstein, most recently of <a href="https://plus.google.com/events/cd94ktf46i1hbi4mbqbbvvga358">#futureofstats</a> fame, talks up data science in the <a href="http://www.thecrimson.com/article/2013/12/11/big-data-joe-blitzstein/">Harvard Crimson</a> (via Rafa). As has been pointed out by <a href="http://simplystatistics.org/2012/10/19/interview-with-rebecca-nugent-of-carnegie-mellon/">Rebecca Nugent</a> when she stopped to visit us, class sizes in undergrad stats programs are blowing up!</li>
<li>If you missed it, Michael Eisen dropped by to chat about open access (<a href="http://simplystatistics.org/2013/12/12/simply-statistics-interview-with-michael-eisen-co-founder-of-the-public-library-of-science/">part 1</a>/<a href="http://simplystatistics.org/2013/12/13/simply-statistics-interview-with-michael-eisen-co-founder-of-the-public-library-of-science-part-22/">part 2</a>). We talked about Randy Schekman, a recent Nobel prize winner who says he <a href="http://www.theguardian.com/commentisfree/2013/dec/09/how-journals-nature-science-cell-damage-science">isn’t publishing in Nature/Science/Cell anymore</a>. Professor Schekman did a Reddit AMA where <a href="http://www.reddit.com/r/IAmA/comments/1sq4vd/im_randy_schekman_corecipient_of_the_2013_nobel/">he got grilled pretty hard</a> about pushing a glamour open access journal eLife, while dissing N/S/C, where he published a lot of stuff before winning the Nobel.</li>
<li>The article I received most the last couple of weeks <a href="http://www.theguardian.com/science/2013/dec/06/peter-higgs-boson-academic-system">is this one</a>. In it, Peter Higgs says he wouldn’t have had time to think deeply to perform the research that led to the Boson discovery in the modern publish or perish academic system. But he got the prize, at least in part, because of the people who conceived/built/tested the theory in the Large Hadron Collider. I’m much more inclined to believe someone would have come up with the Boson theory in our current system than someone would have built the LHC in a system without competitive pressure.</li>
<li>I think <a href="http://www.biasedtransmission.org/2013/12/is-the-obesity-paradox-for-diabetes-simply-bad-statistics.html">this post</a> raises some interesting questions about the <a href="http://jama.jamanetwork.com/article.aspx?articleid=1309174">Obesity Paradox</a> that says overweight people with diabetes may have lower risk of death than normal weight people. The analysis is obviously tongue-in-cheek, but I’d be interested to hear what other people think about whether it is a serious issue or not.</li>
</ol>
Simply Statistics Interview with Michael Eisen, Co-Founder of the Public Library of Science (Part 2/2)
2013-12-13T09:02:08+00:00
http://simplystats.github.io/2013/12/13/simply-statistics-interview-with-michael-eisen-co-founder-of-the-public-library-of-science-part-22
<p>Here is Part 2 of our Jeff’s and my interview with Michael Eisen, Co-Founder of the Public Library of Science.</p>
The key word in "Data Science" is not Data, it is Science
2013-12-12T15:24:42+00:00
http://simplystats.github.io/2013/12/12/the-key-word-in-data-science-is-not-data-it-is-science
<p>One of my colleagues was just at a conference where they saw a presentation about using data to solve a problem where data had previously not been abundant. The speaker claimed the data were “big data” and a question from the audience was: “Well, that isn’t really big data is it, it is only X Gigabytes”.</p>
<p>While that exact question would elicit groans from most people who work with data, I think it highlights one of the key problems with the thinking around data science. Most people hyping data science have focused on the first word: data. They care about volume and velocity and whatever other buzzwords describe data that is too big for you to analyze in Excel. This hype about the size (relative or absolute) of the data being collected fed into the second category of hype - hype about tools. People threw around EC2, Hadoop, Pig, and had huge debates about Python versus R.</p>
<p>But the key word in data science is not “data”; it is “science”. Data science is only useful when the data are used to answer a question. That is the science part of the equation. The problem with this view of data science is that it is much harder than the view that focuses on data size or tools. It is much, much easier to calculate the size of a data set and say “My data are bigger than yours” or to say, “I can code in Hadoop, can you?” than to say, “I have this really hard question, can I answer it with my data?”.</p>
<p>A few reasons it is harder to focus on the science than the data/tools are:</p>
<ol>
<li><span style="line-height: 16px;">John Tukey’s quote: “The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.”. You may have 100 Gb and <a href="http://simplystatistics.org/2013/09/23/the-limiting-reagent-for-big-data-is-often-small-well-curated-data/">only 3 Kb are useful</a> for answering the real question you care about. </span></li>
<li>When you start with the question you often discover that you need to collect new data or design an experiment to confirm you are getting the right answer.</li>
<li>It is easy to discover “structure” or “networks” in a data set. There will always be correlations for a thousand reasons if you collect enough data. Understanding whether these correlations matter for specific, interesting questions is much harder.</li>
<li>Often the structure you found on the first pass is due to a phenomena (measurement error, artifacts, data processing) that doesn’t answer an interesting question.</li>
</ol>
<p>The issue is that the hype around big data/data science will flame out (it already is) if data science is only about “data” and not about “science”. The long term impact of data science will be measured by the scientific questions we can answer with the data.</p>
Simply Statistics Interview with Michael Eisen, Co-Founder of the Public Library of Science (Part 1/2)
2013-12-12T11:42:02+00:00
http://simplystats.github.io/2013/12/12/simply-statistics-interview-with-michael-eisen-co-founder-of-the-public-library-of-science
<p>Jeff and I had a chance to interview <a href="http://www.michaeleisen.org/blog/">Michael Eisen</a>, a co-founder of the <a href="http://plos.org">Public Library of Science</a>, HHMI Investigator, and a Professor at UC Berkeley. We talked with him about publishing in open access and how young investigators might publish in open access journals under the current system. Watch part 1 of the interview above.</p>
Are MOOC's fundamentally flawed? Or is it a problem with statistical literacy?
2013-12-11T13:58:56+00:00
http://simplystats.github.io/2013/12/11/are-moocs-fundamentally-flawed-or-is-it-a-problem-with-statistical-literacy
<p>People know I have taught a MOOC on Data Analysis, so I frequently get emails about updates on the “state of MOOCs”. It definitely feels like the wild west of education is happening right now. If you make an analogy to air travel, I would say we are about here:</p>
<p style="text-align: center;">
<a href="http://charliekennedy.files.wordpress.com/2009/12/1217-wright-bros-1903.jpg"><img class="aligncenter" alt="" src="http://charliekennedy.files.wordpress.com/2009/12/1217-wright-bros-1903.jpg" width="422" height="263" /></a>
</p>
<p style="text-align: center;">
<p style="text-align: left;">
So of course I feel like it is a bit premature for quotes like this:
</p>
<blockquote>
<p style="text-align: left;">
Two years after a Stanford professor drew 160,000 students from around the globe to a free online course on artificial intelligence, starting what was widely viewed as a revolution in higher education, early results for such large-scale courses are disappointing, forcing a rethinking of how college instruction can best use the Internet.
</p>
</blockquote>
<p>
These headlines are being driven in large part by Sebastian Thrun, the founder of Udacity, which has had some trouble with their business model. One reason is that they seem to have had the most trouble with luring instructors from the top schools to their platform.
</p>
<p>
But the main reason that gets cited for the "failure" of MOOCs is <a href="http://www.sjsu.edu/chemistry/People/Faculty/Collins_Research_Page/AOLE%20Report%20-September%2010%202013%20final.pdf">this experiment </a>performed at San Jose State. I previously pointed out one major flaw with the study design: <a href="http://simplystatistics.org/2013/07/19/the-failure-of-moocs-and-the-ecological-fallacy/">that the students in the two comparison groups were not comparable</a>.
</p>
<p>
Here are a few choice quotes from the study:
</p>
<p>
<strong>Poor response rate:</strong>
</p>
<blockquote>
<p>
While a major effort was made to increase participation in the survey research within this population, the result was disappointing (response rates of 32% for Survey 1; 34% for Survey 2, and 32% for Survey 3).
</p>
</blockquote>
<p>
<strong>Not a representative sample:</strong>
</p>
<blockquote>
<p>
The research team compared the survey participants to the entire student population and found significant differences. Most importantly, students who succeeded are over-represented among the survey respondents.
</p>
</blockquote>
<p>
<strong>Difficulties with data collection/processing:</strong>
</p>
<blockquote>
<p>
While most of the data were provided by the end of the Spring 2013 semester, clarifications, corrections and data transformations had to be made for many weeks thereafter, including resolving accuracy questions that arose once the analysis of the Udacity platform data began
</p>
</blockquote>
<p>
These ideas alone point to an incredibly suspect study that is not the fault of the researchers in question. They were working with the data the best they could, but the study design and data are deeply flawed. The most egregious, of course, is the difference in populations between the students who matriculated and didn't (Tables 1-4 show the <em>dramatic </em>differences in population).
</p>
<p>
My take home message is that if this study were submitted to a journal it would be seriously questioned on both scientific and statistical grounds. Before we rush to claim that the whole idea of MOOCs are flawed, I think we should wait for more thorough, larger, and well-designed studies are performed.
</p>
</p>
NYC crime rates by year/commissioner
2013-12-05T13:54:00+00:00
http://simplystats.github.io/2013/12/05/nyc-crime-rates-by-yearcommissioner
<p>NYC mayor-elect Bill de Blasio <a href="http://www.nytimes.com/2013/12/06/nyregion/de-blasio-to-name-bratton-as-new-york-police-commissioner.html?smid=fb-nytimes&WT.z_sma=NY_DBT_20131205&bicmp=AD&bicmlukp=WT.mc_id&bicmst=%20201385874000000&bicmet=%20201388638800000">is expected</a> to name William J. Bratton to lead the NYPD. Bratton has been commissioner before (1994-1996) so I was curious to see the crime rates during his tenure, which was within the period that saw an impressive drop (1990-2010). Here is the graph of violent crimes per 100,000 inhabitants for <del>NYC</del> NY state for year 1965-2012 (divided by commissioner). Will Bratton be able to continue the trend? The graph suggests to me that they have hit a “floor” (1960s levels!).</p>
<p><a href="http://simplystatistics.org/2013/12/05/nyc-crime-rates-by-yearcommissioner/nycrimes/" rel="attachment wp-att-2278"><img class="alignnone size-full wp-image-2278" alt="nycrimes" src="http://simplystatistics.org/wp-content/uploads/2013/12/nycrimes.png" width="1036" height="492" srcset="http://simplystatistics.org/wp-content/uploads/2013/12/nycrimes-300x142.png 300w, http://simplystatistics.org/wp-content/uploads/2013/12/nycrimes-1024x486.png 1024w, http://simplystatistics.org/wp-content/uploads/2013/12/nycrimes.png 1036w" sizes="(max-width: 1036px) 100vw, 1036px" /></a></p>
<p>Data is <a href="http://www.disastercenter.com/crime/nycrime.htm">here</a>.</p>
Advice for students on the academic job market
2013-12-04T10:00:26+00:00
http://simplystats.github.io/2013/12/04/advice-for-stats-students-on-the-academic-job-market-2
<p><em>Editor’s note: This is a slightly modified version of a previous post.</em></p>
<p>Job hunting season is upon us. Openings are already being posted <a href="http://www.stat.ufl.edu/vlib/Index.html" target="_blank">here</a>, <a href="http://www.stat.washington.edu/jobs/" target="_blank">here</a>, and <a href="http://jobs.amstat.org/" target="_blank">here</a>. So you should have your CV, research statement, and web page ready. I highly recommend having a web page. It doesn’t have to be fancy. <a href="http://jkp-mac1.uchicago.edu/~pickrell/Site/Home.html" target="_blank">Here</a>, <a href="http://www.biostat.jhsph.edu/~khansen/" target="_blank">here</a>, and <a href="http://www.biostat.jhsph.edu/~jleek/research.html" target="_blank">here</a> are some good ones ranging from simple to a bit over the top. Minimum requirements are a list of publications and a link to a CV. If you have written software, link to that as well.</p>
<p>The earlier you submit the better. Don’t wait for your letters. Keep in mind two things: 1) departments have a limit of how many people they can invite and 2) admissions committee members get tired after reading 200+ CVs.</p>
<p>If you are seeking an academic job your CV should focus on the following: PhD granting institution, advisor (including postdoc advisor if you have one), and papers. Be careful not to drown out these most important features with superflous entries. For papers, include three sections: 1-published, 2-under review, and 3-under preparation. For 2, include the journal names and if possible have tech reports available on your web page. For 3, be ready to give updates during the interview. If you have papers for which you are co-first author be sure to highlight that fact somehow.</p>
<p>So what are the different types of jobs? Before listing the options I should explain the concept of hard versus soft money. Revenue in academia comes from tuition (in public schools the state kicks in some extra $), external funding (e.g. NIH grants), services (e.g. patient care), and philanthropy (endowment). The money that comes from tuition, services, and philanthropy is referred to as hard money. Within an institution, roughly the same amount is available every year and the way its split among departments rarely changes. When it does, it’s because your chair has either lost or won a long hard-fought zero-sum battle. Research money comes from NIH, NSF, DoD, etc.. and one has to write grants to <em>raise</em> funding (which pay part or all of your salary). These days about 10% of grant applications are funded, so it is certainly not guaranteed. Although at the institution level the law of large numbers kicks in, at the individual level it certainly doesn’t. Note that the break down of revenue varies widely from institution to institution. Liberal arts colleges are almost 100% hard money while research institutes are almost 100% soft money.</p>
<p>So to simplify, your salary will come from teaching (tuition) and research (grants). The percentages will vary depending on the department. Here are five types of jobs:</p>
<p>1) Soft money university positions: examples are Hopkins and Harvard Biostat. A typical breakdown is 75% soft/25% hard. To earn the hard money you will have to teach, but not that much. In my dept we teach 48 classroom hours a year (equivalent to one one-semester class). To earn the soft money you have to write, and eventually get, grants. As a statistician you don’t necessarily have to write your own grants, you can partner up with other scientists that need help with their data. And there are many! Salaries are typically higher in these positions. Stress levels are also higher given the uncertainty of funding. I personally like this as it keeps me motivated, focused, and forces me to work on problems important enough to receive NIH funding.</p>
<p>1a) Some schools of medicine have Biostatistics units that are 100% soft money. One does not have to teach, but, unless you have a joint appointment, you won’t have access to grad students. Still these are tenure track jobs. Although at 100% soft what does tenure mean? I should mention at MD Anderson, one only needs to raise 50% of ones salary and the other 50% is earned via service (statistical consulting to the institution). I imagine there are other places like this, as well as institutions that use endowments to provide some hard money.</p>
<p>2) Hard money positions: examples are Berkeley and Stanford Stat. A typical break down is 75% hard/25% soft. You get paid a 9 month salary. If you want to get paid in the summer and pay students, you need a grant. Here you typically teach two classes a semester but many places let you “buy out” of teaching if you can get grants to pay your salary. Some tension exists when chairs decide who teaches the big undergrand courses (lots of grunt work) and who teaches the small seminar classes where you talk about your own work.</p>
<p>2a) Hard money postions: Liberal arts colleges will cover as much as 100% of your salary from tuition. As a result, you are expected to teach much more. Most liberal arts colleges weigh teaching as much (or more) than research during promotion although there is a trend towards weighing research more.</p>
<p>3) Research associate positions: examples are jobs in schools of medicine in departments other than Stat/Biostat. These positions are typically 100% soft and are created because someone at the institution has a grant to pay for you. These are usually not tenure track positons and you rarely have to teach. You also have less independence since you have to work on the grant that funds you.</p>
<p>4) Industry: typically 100% hard. There are plenty of for-profit companies where one can have fruitful research careers. AT & T, Google, IBM, Microsoft, and Genentech are all examples of companies with great research groups. Note that S, the language that R is based on, was born in Bell Labs. And one of the co-creators of R now does his research at Genentech. Salaries are typically higher in industry and <a href="http://gawker.com/375460/facebook-hires-away-googles-top-chef">cafeteria food</a> can be quite awesome. The drawbacks are no access to students and lack of independence (although not always).</p>
<p>5) Government jobs: The FDA and NIH are examples of agencies that have research positions. The NCI’s Biometric Research Branch is an example. I would classify these as 100% hard. But it is different than other hard money places in that you have to justify your budget every so often. Service, collaborative, and independent research is expected. A drawback is that you don’t have access to students although you can get joint appointments. Hopkins Biostat has a couple of NCI researchers with joint appointments.</p>
<p>Ok, that is it for now. Later this month we will blog about job interviews.</p>
On the future of the textbook
2013-12-03T13:11:42+00:00
http://simplystats.github.io/2013/12/03/on-the-future-of-the-textbook
<p>The latest issue of <a href="http://escholarship.org/uc/uclastat_cts_tise">Technological Innovations in Statistics Education</a> is focused on the future of the textbook. Editor Rob Gould has put together an interesting list of contributions as well as discussions from the leaders in the field of statistics education. Articles include</p>
<ul>
<li><span style="line-height: 16px;"><a href="http://escholarship.org/uc/item/12q2z58x">The Course as Textbook: A Symbiotic Relationship in the Introductory Statistics Class</a> by Zieffler, Isaak, and Garfield<br /> </span></li>
<li><a href="http://escholarship.org/uc/item/6ms0x5nf">OpenIntro Statistics: an Open-source Textbook</a> by Cetinkaya-Rundel, Diez, and Barr</li>
<li><a href="http://escholarship.org/uc/item/8mv5b3zt">Textbooks 2.0</a> by Webster West</li>
</ul>
<p>Go check it out!</p>
Academics should not feel guilty for maximizing their potential by leaving their homeland
2013-12-02T10:07:32+00:00
http://simplystats.github.io/2013/12/02/academics-should-not-feel-guilty-for-maximizing-their-potential-by-leaving-their-homeland
<p>In a New York Times op-ed titled <a href="http://www.nytimes.com/2013/11/30/opinion/migration-hurts-the-homeland.html">Migration Hurts the Homeland</a>, Paul Collier tells us that</p>
<blockquote>
<p dir="ltr">
What’s good for migrants from poor places is not always good for the countries they’re leaving behind.
</p>
</blockquote>
<p dir="ltr">
He makes the argument that those that favor open immigration don't realize that they are actually hurting "the poor" more than they are helping. This post is not about the issue of whether migration is bad for the homeland (I know of others that <a href="http://essential.metapress.com/content/cv2115573n315017/">make the opposite claim</a>) but rather about the opinions I have formed by leaving my <a href="http://www.caribbeanbusinesspr.com/news/census-pr-brain-drain-picking-up-80281.html">homeland</a> to become an academic in a US research university.
</p>
<p dir="ltr">
Let me start by pointing out that an outstanding <a href="http://en.wikipedia.org/wiki/List_of_Nobel_laureates_by_country">470 Nobel prizes </a>have been handed out to residents of the US or the UK. About 25% of these are to immigrants. These Nobel laureates include academics born in Egypt, Venezuela, and Mexico. In contrast, only one of the 20 prizes handed to Italy was to an immigrant (none in the last 50 years). I view my university as international, not american.
</p>
<p dir="ltr">
Throughout my career I have encountered several foreign graduate students/postdocs that ponder passing on academic jobs in the US to go back and help the homeland. I was one of them and I admire the commitment of those who decide to go back. However, I think it's important to point out that the accomplishments of those that take jobs in American research universities are in large part due to the unique support that these universities provide. This is particularly true in the sciences were research success depends on low teaching loads, lab infrastructure, high-performance computers, administrative support for grant submission, and talented collaborators.
</p>
<p dir="ltr">
The latter is by far the most important for applied statisticians like myself who depend on subject matter experts that provide quantitative challenges. Having a critical mass of such innovators is key. Although I will never know for sure, I am quite certain that most of what I have accomplished would not have happened had I returned home.
</p>
<p dir="ltr">
It is also important to point out that my homeland benefits from what I have learned during 15 years working in top research universities. I am always looking for an excuse to visit my friends and family and I also enjoy giving back to my <a href="http://www.uprrp.edu/">alma mater</a>. This has greatly increased my interactions through workshops, academic talks, participation in advisory boards, and many other informal exchanges.
</p>
<p dir="ltr">
So, if you are an up-and-coming academic deciding if you should go back or not, do not let guilt factor into the decision. Humanity benefits from you maximizing your potential. Your homeland will benefit in indirect ways as well.
</p>
<p dir="ltr">
ps - Do <a href="http://www.biostat.jhsph.edu/~jleek/">people from Idaho</a> feel guilty for leaving their <a href="http://www.hcn.org/blogs/range/western-brain-drain">brain-drained state</a>?
</p>
<p dir="ltr">
</p>
Sunday data/statistics link roundup (12/2/13)
2013-12-01T13:26:09+00:00
http://simplystats.github.io/2013/12/01/sunday-datastatistics-link-roundup-12213
<ol>
<li><span style="line-height: 16px;">I’m in Australia for <a href="http://www.maths.adelaide.edu.au/biosummer2013/index.html">Bioinfo Summer 2013</a>! First time in Australia and excited about the great lineup of speakers and to meet a bunch of people at the University of Adelaide. </span></li>
<li><a href="http://www.bostonglobe.com/business/2013/11/26/computer-science-course-breaks-stereotypes-and-fills-halls-harvard/7XAXko7O392DiO1nAhp7dL/story.html">An interesting post</a> about how CS has become the de facto language of our times. They specifically talk about <a href="https://cs50.harvard.edu/">CS50</a> at Harvard. I think in terms of being an informed citizen CS and Statistics are quickly being added to Reading, Writing, and Arithmetic as the required baseline knowledge (link via Alex N.)</li>
<li><a href="http://gking.harvard.edu/files/gking/files/psc47_1-1300153-king-2.pdf">A long but fascinating</a> read by Gary King about restructuring the social sciences with a focus on ending the quantitative/qualitative divide. I think a similar restructuring has been going on in biology for a while. It is nearly impossible to be a modern molecular biologist without at least some basic training in statistics. Similarly statisticians are experiencing an inverted revolution where we are refocusing on applications and some basic scientific experience is becoming a required component of being a statistician (link via Rafa).</li>
<li><a href="http://www.rochester.edu/rocdata/recruit/interdisciplinary.html">This is how you make a splash in data science</a>. Rochester is hiring 20! faculty across multiple disciplines. It will be interesting to see how that works out (link via Rafa). This goes along with the recent announcement of the Moore foundation funding <a href="http://www.moore.org/programs/science/data-driven-discovery/data-science-environments">Berkeley, UW, and NYU to build data science cultures/environments</a>.</li>
<li><a href="http://www.nature.com/news/plos-profits-prompt-revamp-1.14205">PLoS is rich and they have to figure out what to do</a>! They are a non-profit, but their journal PLoS One publishes about 30k papers a year at about 1k a pop. That is some serious money, which they need to figure out how to spend pronto. My main suggestion: fund research to figure out a way to put peer reviewing on the same level as publishing in terms of academic credit (link via Simina B.)</li>
<li>A group of psychologists got together and performed replication experiments for 13 major effects. <a href="https://openscienceframework.org/project/WX7Ck/">They replicated 11/13</a> (of course depending on your definition of replication). Hopefully these results are a good first step toward reducing the mania around the “replication crisis” and refocusing attention back on real solutions.</li>
</ol>
Statistical zealots
2013-11-26T10:51:21+00:00
http://simplystats.github.io/2013/11/26/statistical-zealots
<p>Yesterday <a href="https://github.com/jtleek/datasharing">my data sharing policy</a> went a little bit viral. It hit the front page of Hacker News and was a trending repo on Github. I was <a href="https://news.ycombinator.com/item?id=6793291">reading the comments on Hacker News</a> and came across this gem:</p>
<blockquote>
<p>So, while I can imagine there are good Frequentists Statisticians out there, I insist that frequentism itself is bogus.</p>
</blockquote>
<p>This is the extension of a long standing debate about the relative merits of <a href="http://en.wikipedia.org/wiki/Frequentist_inference">frequentist</a> and <a href="http://en.wikipedia.org/wiki/Bayesian_inference">Bayesian</a> statistical methods. It is interesting that I largely only see one side of the debate being played out these days. The Bayesian zealots have it in for the frequentists in a big way. The Hacker News comments are one example, but <a href="http://www.johnmyleswhite.com/notebook/2012/05/10/criticism-1-of-nhst-good-tools-for-individual-researchers-are-not-good-tools-for-research-communities/">here</a> <a href="http://www.entsophy.net/blog/?p=200">are</a> a [Yesterday <a href="https://github.com/jtleek/datasharing">my data sharing policy</a> went a little bit viral. It hit the front page of Hacker News and was a trending repo on Github. I was <a href="https://news.ycombinator.com/item?id=6793291">reading the comments on Hacker News</a> and came across this gem:</p>
<blockquote>
<p>So, while I can imagine there are good Frequentists Statisticians out there, I insist that frequentism itself is bogus.</p>
</blockquote>
<p>This is the extension of a long standing debate about the relative merits of <a href="http://en.wikipedia.org/wiki/Frequentist_inference">frequentist</a> and <a href="http://en.wikipedia.org/wiki/Bayesian_inference">Bayesian</a> statistical methods. It is interesting that I largely only see one side of the debate being played out these days. The Bayesian zealots have it in for the frequentists in a big way. The Hacker News comments are one example, but <a href="http://www.johnmyleswhite.com/notebook/2012/05/10/criticism-1-of-nhst-good-tools-for-individual-researchers-are-not-good-tools-for-research-communities/">here</a> <a href="http://www.entsophy.net/blog/?p=200">are</a> a](http://wmbriggs.com/blog/?p=5062) <a href="http://www.nature.com/news/weak-statistical-standards-implicated-in-scientific-irreproducibility-1.14131">more</a>. Interestingly, even the “popular geek press” is getting in the game.</p>
<p style="text-align: center;">
<img class="aligncenter" alt="" src="http://imgs.xkcd.com/comics/frequentists_vs_bayesians.png" width="281" height="425" />
</p>
<p style="text-align: left;">
I think it probably deserves a longer post but here are my thoughts on statistical zealotry:
</p>
<ol>
<li><span style="line-height: 16px;">User effect »»»»»»»»> Philosophy effect. The person doing the statistics probably matters more than the statistical philosophy. I would prefer Andrew Gelman analyzed my data than a lot of frequentists. Similarly, I’d prefer that John Storey analyzed my data than a lot of Bayesians. </span></li>
<li>I agree with Noahpinion that this is likely mostly a <a href="http://noahpinionblog.blogspot.com/2013/01/bayesian-vs-frequentist-is-there-any.html">philosophy battle</a> than a real practical applications battle.</li>
<li>I like <a href="http://arxiv.org/pdf/1106.2895v2.pdf">Rob Kass’s idea</a> that we should move away from frequentist vs. Bayesian to pragmatism. I think most real applied statisticians have already done this, if for no other reason than being pragmatic helps you get things done.</li>
<li><a href="http://www.pnas.org/content/early/2013/10/28/1313476110">Papers like this one</a> that claim total victory for one side or the other all have one thing in common: they rarely use real data to verify their claims. The real world is messy and one approach never wins all the time.</li>
</ol>
<p>My final thought on this matter is: never trust people with an agenda bearing extreme counterexamples.</p>
Simply Statistics interview with Daphne Koller, Co-Founder of Coursera
2013-11-22T10:13:49+00:00
http://simplystats.github.io/2013/11/22/simply-statistics-interview-with-daphnekoller-co-founder-of-coursera
<p>Jeff and I had an opportunity to sit down with Daphne Koller, Co-Founder of <a href="http://coursera.org">Coursera</a> and Rajeev Motwani Professor of Computer Science at Stanford University. Jeff and I both teach massive open online courses using the Coursera platform and it was great to be able to talk with Professor Koller about the changing nature of education today.</p>
<p>Some highlights:</p>
<ul>
<li>[1:35] <strong>On the origins of Coursera</strong>: “I actually came to that realization when listening to talk about YouTube, and realizing that, why does it make sense for me to come and deliver the same lecture year after year after year where I could package it in much smaller bite size chunks that were much more fun and much more cohesive and then use the class time for engaging with students in more meaningful ways.</li>
<li>[7:22] <strong>On the role of MOOCs in academia</strong>: “Sometimes I have these discussions with some people in academic institutions who say that they feel that by engaging, for example, with MOOCs or blogs or social media they are diverting energy from what is their primary function which is teaching of their registered students…. But I think for most academic institutions, if I had to say what the primary function of an academic institution is, it’s the creation and dissemination of knowledge…. The only way society is going to move forward is if more people are better educated.”</li>
<li>[10:15] <strong>On teaching</strong>: “I think that teaching is a scholarly work as well, a kind of distillation of knowledge that has to occur in order to put together a really great course.”</li>
<li>[11:19] <strong>On teaching to the world</strong>. “Teaching, and quality of teaching, that used to be something that you could hide away from everyone…here, we’re suddenly in a world where teaching is really visible to everyone, and as a consequence, good teaching is going to be visible as a role model.”</li>
<li>[13:33] <strong>On work/life balance</strong>: “It’s been insane. It’s also been somewhat surreal…. Sometimes I look at my life and I’m saying really, I mean, who’s life is this?”</li>
</ul>
<iframe width="560" height="315" src="https://www.youtube.com/embed/6Mx3_9fo_aE" frameborder="0" allowfullscreen=""></iframe>
You must be at least 20 years old for this job
2013-11-21T20:28:44+00:00
http://simplystats.github.io/2013/11/21/you-must-be-at-least-20-years-old-for-this-job
<p>The New York Times is recruiting a <a href="http://jobs.nytco.com/mobile/job/New-York-Chief-Data-Scientist-Job-NY-10001/27577700/?utm_content=buffer1c5a2&utm_source=buffer&utm_medium=twitter&utm_campaign=Buffer">chief data scientist</a>.</p>
Future of Statistics take home messages. #futureofstats
2013-11-21T10:20:25+00:00
http://simplystats.github.io/2013/11/21/future-of-statistics-take-home-messages-futureofstats
<p>A couple weeks ago we had the Future of Statistics Unconference. <a href="http://www.youtube.com/watch?v=Y4UJjzuYjfM">You can still watch it online here</a>. Rafa also attended the <a href="http://www.statistics2013.org/presentations-and-panelists/">Future of Statistical Sciences Workshop</a> and wrote a <a href="http://simplystatistics.org/2013/11/18/feeling-optimistic-after-the-future-of-the-statistical-sciences-workshop/">great summary which you can read here</a>.</p>
<p>I decided to write a summary of take home messages from our speakers at the Unconference. <a href="https://github.com/jtleek/futureofstats/blob/master/README.md">You can read it on Github here</a>. I put it on Github for two reasons:</p>
<ol>
<li>I agree with Hadley’s statement that the future of statistics is on Github.</li>
<li>I summarized them based on my interpretation and would love collaboration on the document. If you want to add your new thoughts/summaries, add a new section with your bullet pointed ideas and send me a pull request!</li>
</ol>
<p>I sent our speakers a gift for presenting in the Unconference (if you were a speaker and didn’t get yours, email me!). Hadley posted the front on Twitter. Here is the back:</p>
<p style="text-align: center;">
<a href="http://simplystatistics.org/2013/11/21/future-of-statistics-take-home-messages-futureofstats/2013-11-21-10-16-54/" rel="attachment wp-att-2221"><img class="alignnone size-medium wp-image-2221" alt="2013-11-21 10.16.54" src="http://simplystatistics.org/wp-content/uploads/2013/11/2013-11-21-10.16.54-300x225.jpg" width="300" height="225" srcset="http://simplystatistics.org/wp-content/uploads/2013/11/2013-11-21-10.16.54-300x225.jpg 300w, http://simplystatistics.org/wp-content/uploads/2013/11/2013-11-21-10.16.54-1024x768.jpg 1024w" sizes="(max-width: 300px) 100vw, 300px" /></a>
</p>
<p style="text-align: center;">
<p style="text-align: left;">
P.S. Stay tuned for the future of Simply Statistics Unconferences.
</p>
</p>
Feeling optimistic after the Future of the Statistical Sciences Workshop
2013-11-18T10:24:17+00:00
http://simplystats.github.io/2013/11/18/feeling-optimistic-after-the-future-of-the-statistical-sciences-workshop
<p>Last I week I participated in the <a href="http://www.statistics2013.org/presentations-and-panelists/">Future of the Statistical Sciences Workshop</a>. I arrived feeling somewhat pessimistic about the future of our discipline. My pessimism stemmed from the emergence of the term <em>Data Science</em> and the small role academic (bio)statistics department are playing in the excitement and initiatives surrounding it. Data Science centers/departments/initiatives are propping up in universities without much interaction with (bio)statistics departments. Funding agencies, interested in supporting Data Science, are not always including academic statisticians in the decision making process.</p>
<p>About <a href="http://www.statistics2013.org/workshop-invited-participants/">100 participants,</a> including many of our discipline’s leaders, attended the workshop. It was organized in <a href="http://www.statistics2013.org/presentations-and-panelists/">sessions</a> and about a dozen talks; some about the future, others featuring collaborations between statisticians and subject matter experts. The collaborative talks provided great examples of the best our field has to offer and the rest generated provocative discussions. In most of these discussions the disconnect between our discipline and Data Science was raised as cause for concern.</p>
<p>Some participants thought <em>Data Science</em> is just another fad like Data Mining was 10-20 years ago. I actually disagree because I view the recent increase in the number of fields that have suddenly become data-driven <a href="http://simplystatistics.org/2013/05/15/the-bright-future-of-applied-statistics/">as a historical discontinuity</a>. For example, we <a href="http://simplystatistics.org/2011/11/22/data-scientist-vs-statistician/">first posted about statistics versus data science</a> back in 2011.</p>
<p>At the workshop, Mike Jordan explained that the term was coined up by industry for practical reasons: emerging companies needed a work force that could solve problems with data and statisticians were not fitting the bill. However, at the workshop there was consensus that our discipline needs a jolt to meet these new challenges. The take away messages were all in line with ideas we have been promoting here in Simply Statistics (here is a good <a href="http://simplystatistics.org/2013/04/15/data-science-only-poses-a-threat-to-biostatistics-if-we-dont-adapt/">summary post from Jeff</a>):</p>
<ol>
<li>
<p>We need to engage in real present-day problems (<a href="http://simplystatistics.org/2013/05/29/what-statistics-should-do-about-big-data-problem-forward-not-solution-backward/">problem first not solution backward</a>)</p>
</li>
<li>
<p>Computing should be a big part of our PhD curriculum (<a href="http://simplystatistics.org/2013/04/15/data-science-only-poses-a-threat-to-biostatistics-if-we-dont-adapt/">here are some suggestions</a>)</p>
</li>
<li>
<p>We need <a href="http://simplystatistics.org/2012/05/28/schlep-blindness-in-statistics/">to deliver solutions</a> (and stop whining about not being listened to); be more like engineers than mathematicians. (here is a <a href="http://simplystatistics.org/2012/06/22/statistics-and-the-science-club/">related post by Roger</a>, in statistical genomics <a href="http://simplystatistics.org/2012/05/24/how-do-we-evaluate-statisticians-working-in-genomics/">this has been the de facto rule</a> for a while.)</p>
</li>
<li>
<p>We need to improve our communication skills (<a href="http://simplystatistics.org/2012/03/05/characteristics-of-my-favorite-statistics-talks/">in talks</a> or <a href="http://simplystatistics.org/2012/01/05/why-all-academics-should-have-professional-twitter/">on Twitter</a>)</p>
</li>
</ol>
<p>The fact that there was consensus on these four points gave me reason
to feel optimistic about our future.</p>
What should statistics do about massive open online courses?
2013-11-16T19:49:44+00:00
http://simplystats.github.io/2013/11/16/what-should-statistics-do-about-massive-open-online-courses
<p>Marie Davidian, the President of the American Statistical Association, writes about the <a href="http://magazine.amstat.org/blog/2013/11/01/prescolumnnov2013/">JHU Biostatistics effort to deliver massive open online courses</a>. She interviewed Jeff, Brian Caffo, and me and summarized our thoughts.</p>
<blockquote>
<p>All acknowledge that the future is unknown. How MOOCs will affect degree programs remains to be seen. Roger notes that the MOOCs he, Jeff, Brian, and others offer seem to attract many students who would likely not enter a degree program at Hopkins, regardless, so may be filling a niche that will not result in increased degree enrollments. But Brian notes that their MOOC involvement has brought extensive exposure to the Hopkins Department of Biostatistics—for many people the world over, Hopkins biostatistics is statistics.</p>
</blockquote>
What's the future of inference?
2013-11-15T09:56:27+00:00
http://simplystats.github.io/2013/11/15/whats-the-future-of-inference
<p>Rob Gould reports on what appears to have been interesting <a href="http://citizen-statistician.org/2013/11/14/the-future-of-inference/">panel discussion on the future of statistics</a> hosted by the UCLA Statistics Department. The panelists were Songchun Zhu (UCLA Statistics), Susan Paddock (RAND Corp.), and Jan de Leeuw (UCLA Statistics).</p>
<p>He describes Jan’s thoughts on the future of inference in the field of statistics:</p>
<blockquote>
<p>Jan said that inference as an activity belongs in the substantive field that raised the problem. Statisticians should not do inference. Statisticians might, he said, design tools to help specialists have an easier time doing inference. But the inferential act itself requires intimate substantive knowledge, and so the statistician can assist, but not do.</p>
</blockquote>
<p>I found this comment to be thought provoking. First of all, it sounds exactly like something Jan would say, which makes me smile. In principle, I agree with the premise. In order to make a reasonable (or intelligible) inference you have to have some knowledge of the substantive field. I don’t think that’s too controversial. However, I think it’s incredibly short-sighted to conclude therefore that statisticians should not be engaged in inference. To me, it seems more logical that statisticians should go learn some science. After all, we keep telling the scientists to learn some statistics.</p>
<p>In my experience, it’s not so easy to draw a clean line between the person analyzing the data and the person drawing the inferences. It’s generally not possible to say to someone, “Hey, I just analyze the data, I don’t care about your science.” For starters, that tends to make for bad collaborations. But more importantly, that kind of attitude assumes that you can effectively analyze the data without any substantive knowledge. That you can just “crunch the numbers” and produce a useful product.</p>
<p>Ultimately, I can see how statisticians would want to stay away from the inference business. That part is hard, it’s controversial, it involves messy details about sampling, and opens one up to criticism. And statisticians love to criticize <em>other</em> people. Why would anyone want to get mixed up with that? This is why machine learning is so attractive–it’s all about prediction and in-sample learning.</p>
<p>However, I think I agree with <a href="http://www.biostat.washington.edu/~dwitten/">Daniela Witten</a>, who at our recent <a href="http://simplystatistics.org/unconference/">Unconference</a>, said that the future of statistics <em>is</em> inference. That’s where statisticians are going to earn their money.</p>
The Leek group guide to sharing data with a data analyst to speed collaboration
2013-11-14T11:16:36+00:00
http://simplystats.github.io/2013/11/14/the-leek-group-guide-to-sharing-data-with-a-statistician-to-speed-collaboration
<p>My group collaborates with many different scientists and the number one determinant of how fast we can turn around results is the status of the data we receive from our collaborators. If the data are well organized and all the important documentation is there, it dramatically speeds up the analysis time.</p>
<p>I recently had the experience where a postdoc requesting help with an analysis provided an amazing summary of the data she wanted analyzed. It has made me want to prioritize her analysis in my queue and it inspired me to write a how-to guide that will help scientific/business collaborators get speedier results from their statistician colleagues.</p>
<p>Here is the <a href="https://github.com/jtleek/datasharing">Leek group guide to sharing data with statisticians/data analysts</a>.</p>
<p>As usual I put it on Github because I’m sure this first draft will have mistakes or less than perfect ideas. I would love help in making the guide more comprehensive and useful. If you issue a pull request make sure you add yourself to list of contributors at the end.</p>
Original source code for Apple II DOS
2013-11-13T08:21:24+00:00
http://simplystats.github.io/2013/11/13/original-source-code-for-apple-ii-dos
<p>Someone needs to put <a href="http://www.digibarn.com/collections/business-docs/apple-II-DOS/index.html">this</a> on GitHub right now.</p>
<blockquote>
<p>Thanks Paul Laughton for your donation of this superb collection of early to mid-1978 documents including the letters, agreements, specifications (including hand-written code and schematics), and two original source code listing for the creation of the Apple II “DOS” (Disk Operating System).This was, of course, Apple’s first operating system, written not by Steve Wozniak (“Woz”) but by an external contractor (Paul Laughton working for Shepardson Microsystems). Woz lacked the skills to write an OS (as did anyone then at Apple). Paul authored the actual Apple II DOS to its release in the fall of 1978.</p>
</blockquote>
<p>Update: At this point I see some GitHub stub accounts, but no real code (yet).</p>
Survival analysis for hard drives
2013-11-12T16:29:40+00:00
http://simplystats.github.io/2013/11/12/survival-analysis-for-hard-drives
<p><a href="http://www.extremetech.com/computing/170748-how-long-do-hard-drives-actually-live-for">How long do hard drives last</a>?</p>
<blockquote>
<p>Backblaze has kept up to 25,000 hard drives constantly online for the last four years. Every time a drive fails, they note it down, then slot in a replacement. After four years, Backblaze now has some amazing data and graphs that detail the failure rate of hard drives over the first four years of their life.</p>
</blockquote>
<p>I guess it’s easier to do this with hard drives than it is for people.</p>
Future of Statistical Sciences Workshop is happening right now #FSSW2013
2013-11-12T10:07:30+00:00
http://simplystats.github.io/2013/11/12/future-of-statistical-sciences-workshop-is-happening-right-now-fssw2013
<p>ASA Executive Director Ron Wasserstein is tweeting like mad man. If you’re not in London, catch up on what’s happening at the hashtag #FSSW2013.</p>
Apple's Touch ID and a worldwide lesson in sensitivity and specificity
2013-11-11T14:06:57+00:00
http://simplystats.github.io/2013/11/11/apples-touch-id-and-a-worldwide-lesson-in-sensitivity-and-specificity
<p>I’ve been playing with my new iPhone 5s for the last few weeks, and first let me just say that it’s an awesome phone. Don’t listen to whatever Jeff says. It’s probably worth it just for the camera, but I’ve been particularly interested in the behavior of Apple’s fingerprint sensor (a.k.a. Touch ID). Before the phone came out, there were persistent rumors of a fingerprint sensor from now-defunct AuthenTec, and I wondered how the sensor would work given that it was unlikely to be perfect.</p>
<p>Apple reportedly sold 9 million iPhone 5c and 5s models over the opening weekend alone. Of those, about 7 million were estimated to be the 5s model which includes the fingerprint sensor (the 5c does not include it). So now millions of people have been using this thing and I’m getting the sense that many people are experiencing the same behavior I’ve observed over the last few weeks.</p>
<ul>
<li><strong>The sensor appears to have a high specificity</strong>. If you put the wrong finger, or the wrong person’s finger on the sensor, it will not let you unlock the phone. I haven’t seen a single instance of a false positive here, which seems like a good thing.</li>
<li><strong>The sensor’s sensitivity is modest</strong>. Given the correct finger, the sensor seems to have a sensitivity of between 50-80% based on my completely unscientific guestimation. It seems to depend a little on the finger. I don’t know if this is high or low based on other fingerprint sensors, but it’s mildly annoying to have to switch fingers or type in the passcode more often than I was expecting to have to do that.</li>
<li><strong>Behavior seems to change depending on the task</strong>. This is pure speculation, but it seems the sensor is a bit more open to false positives if you’re using it to buy a song on iTunes. Although I haven’t actually seen it happen, it feels like I don’t have to place my finger on the sensor so perfectly if I’m just purchasing a song or an app.</li>
</ul>
<p>If my experiences in any way reflect reality, it seems to make sense. Apple had to make some choices on what cutoffs to make for false positives and negatives, and I think they erred on the side of security. Having a high specificity is critical because that prevents a bad guy from accessing the phone. A low sensitivity is annoying, but not critical because the correct user could always type in a passcode. As for modifying the behavior based on the task, that seems to make sense too because you can’t buy songs or apps without first unlocking the phone.</p>
<p>Overall, I think Apple did a good job with the fingerprint sensor, especially for version 1.0. I’m guessing they’re making improvements in the technology/software as we speak and will want to improve the sensitivity before they start using it for more tasks or applications.</p>
Out with Big Data, in with Hyperdata
2013-11-11T09:53:55+00:00
http://simplystats.github.io/2013/11/11/out-with-big-data-in-with-hyperdata
<p>Big data is so <a href="http://www.nytimes.com/2013/11/11/technology/gathering-more-data-faster-to-produce-more-up-to-date-information.html">last year</a>.</p>
<blockquote>
<p itemprop="articleBody">
Collecting data from all sorts of odd places and analyzing it much faster than was possible even a couple of years ago has become one of the hottest areas of the technology industry. The idea is simple: With all that processing power and a little creativity, researchers should be able to find novel patterns and relationships among different kinds of information.
</p>
<p itemprop="articleBody">
For the last few years, insiders have been calling this sort of analysis Big Data. Now Big Data is evolving, becoming more “hyper” and including all sorts of sources. Start-ups like Premise and ClearStory Data, as well as larger companies like General Electric, are getting into the act.
</p>
<p itemprop="articleBody">
...
</p>
<p itemprop="articleBody">
“Hyperdata comes to you on the spot, and you can analyze it and act on it on the spot,” said Bernt Wahl, an industry fellow at the Center for Entrepreneurship and Technology at the University of California, Berkeley. “It will be in regular business soon, with everyone predicting and acting the way Amazon instantaneously changes its prices around.”
</p>
</blockquote>
How to Host a Conference on Google Hangouts on Air
2013-11-05T09:41:33+00:00
http://simplystats.github.io/2013/11/05/how-to-host-a-conference-on-google-hangouts-on-air
<p>We recently hosted the first ever <a href="http://simplystatistics.org/unconference">Simply Statistics Unconference on the Future of Statistics</a>. In preparing for the event, we learned a lot about how to organize such an event and frankly we wished there had been a bit more organized documentation on how to do this. The various Google web sites were full of nice videos demonstrating how cool the technology is, but not much in the way of specific instructions on how to get it done.</p>
<p>I posted on GitHub my <a href="https://github.com/rdpeng/ConferenceGHOA">step-by-step list of instructions</a> for how to set up and run a conference on Google Hangouts on Air in the hopes that someone would find it useful. I’m also happy accept corrections if something there is not right.</p>
Sunday data/statistics link roundup (11/3/13)
2013-11-03T21:32:57+00:00
http://simplystats.github.io/2013/11/03/sunday-datastatistics-link-roundup-11313
<ol>
<li>There has been a big knockdown-dragout battle in the blogosphere over how GTEX is doing their analysis. Read the original post <a href="http://liorpachter.wordpress.com/2013/10/21/gtex/">here</a>, my summary <a href="http://simplystatistics.org/2013/10/22/blog-posts-that-impact-real-science-software-review-and-gtex/">here</a>, and the <a href="http://liorpachter.wordpress.com/2013/10/31/response-to-gtex-is-throwing-away-90-of-their-data/">response from GTEX here.</a> I agree that the criticism bordered on hyperbolic but also think that methods matter. I also think that consortia are under pressure to get something out and have to use software that works, I’m sympathetic cause that must be a tough position to be in, but it is important to remember software runs != software works well.</li>
<li>Chris Bosh <a href="http://www.businessinsider.com/chris-bosh-thinks-you-should-learn-how-to-code-2013-10">thinks you should learn to code</a>. Me too. I wonder if the Heat will give me a contract now?</li>
<li>Terry Speed wins the Prime Minister’s Prize for science. Here is an <a href="http://www.abc.net.au/news/2013-10-31/prime-ministers-prize-for-science-award-winner-terry-speed/5059718">awesome interview</a> with him. Watch to the end to find out how he is gonna spend all the money.</li>
<li>Learn faster with the <a href="http://www.youtube.com/watch?v=FrNqSLPaZLc">Feynman technique</a>. tl;dr = practice teaching what you are trying to learn.</li>
<li>Via Tim T. Jr. check out this <a href="http://vudlab.com/simpsons/">interactive version</a> of Simpson’s paradox. Super slick and educational.</li>
<li><a href="http://bleacherreport.com/articles/1830249-golden-glove-awards-2013-full-list-of-winners-and-analysis">Stats used to determine</a> the Gold Glove (in part).</li>
<li><a href="https://github.com/tdsmith/aRrgh">An angry newcomer’s guide</a> to data types in R, dangit!</li>
<li><a href="http://accidental-art.tumblr.com/">Accidental aRt</a> - accidentally beautiful creations in R.</li>
</ol>
Unconference on the Future of Statistics (Live Stream) #futureofstats
2013-10-30T11:36:34+00:00
http://simplystats.github.io/2013/10/30/unconference-on-the-future-of-statistics-live-stream-futureofstats
<p>The Unconference on the Future of Statistics will begin at 12pm EDT today. Watch the live stream here.</p>
How to participate in #futureofstats Unconference
2013-10-29T09:52:21+00:00
http://simplystats.github.io/2013/10/29/how-to-participate-in-futureofstats-unconference
<p>Tomorrow is the <a href="https://plus.google.com/events/cd94ktf46i1hbi4mbqbbvvga358">Unconference on the Future of Statistics</a> from 12PM-1PM EDT. There are two ways that you can get in the game:</p>
<ol>
<li><span style="line-height: 16px;">Ask questions for our speakers on Twitter with the hashtag #futureofstats. Don’t wait, start right now, Roger, Rafa, and I are monitoring the hashtag and collecting questions. We will pick some to ask the speakers tomorrow during the Unconference. </span></li>
<li>If you have an idea about the future of statistics write it up, post it on Github, on Blogger, on WordPress, on your personal website, then tweet it with the hashtag #futureofstats. We will do our best to collect these and post them with the video so your contributions will be part of the Unconference.</li>
</ol>
Tukey Talks Turkey #futureofstats
2013-10-29T09:38:00+00:00
http://simplystats.github.io/2013/10/29/tukey-talks-turkey-futureofstats
<p>I’ve been digging up old “future of statistics” writings from the past in anticipation of our <a href="http://simplystatistics.org/unconference">Unconference on the Future of Statistics</a> this Wednesday 12-1pm EDT. Last week I mentioned Daryl Pregibon’s experience trying to <a href="http://simplystatistics.org/2013/10/25/back-to-the-future-of-statistical-software-futureofstats/">build statistical expertise into software</a>. One classic is <a href="http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aoms/1177704711">“The Future of Data Analysis”</a> by John Tukey published in the <em>Annals of Mathematical Statistics</em> in 1962.</p>
<p>Perhaps the most surprising aspect of this paper is how relevant it remains today. I think perhaps with just a few small revisions it could easily be published in a journal today and few people would find it out of place.</p>
<p>In Section 3 titled “How can new data analysis be initiated?” he describes directions in which statisticians should go to grow the field of data analysis. But the advice itself is quite general and probably should be heeded by any junior statistician just starting out in research.</p>
<blockquote>
<p>How is novelty most likely to begin and grow? Not through work on familiar problems, in terms of familiar frameworks, and starting with the results of applying familiar processes to the observations. Some or all of these familiar constraints must be given up in each piece of work which may contribute novelty.</p>
</blockquote>
<p>Tukey’s article serves as a coherent and comprehensive roadmap for the development of data analysis as a field. He suggests that we should study how people analyze data and uncover “what works” and what doesn’t. However, he appears to draw the line at suggesting that such study should result in a single way of analyzing a given type of data. Rather, statisticians should maintain some flexibility in modeling and analysis. I personally think the reality should be somewhere the middle. Too much flexibility can lead to problems, but rigidity is not the solution.</p>
<p>It is interesting, from my perspective, that given how clear and coherent Tukey’s roadmap was in 1962, how much of it was essentially ignored. In fact, the field pretty much went the other direction towards more mathematical elegance (I’m guessing Tukey sensed this would happen). His article is uncomfortable to read, because it’s full of problems that arise in real data that are difficult to handle with standard approaches. He has an uncanny ability to make up methods that look totally bizarre on first glance but are totally reasonable after some thought.</p>
<p>I honestly can’t think of a better way to end this post than to quote Tukey himself.</p>
<blockquote>
<p>The future of data analysis can involve great progress, the overcoming of real difficulties, and the provision of a great service to all fields of science and technology. Will it? That remains to us, to our willingness to take up the rocky road of real problems in preference to the smooth road of unreal assumptions, arbitrary criteria, and abstract results without real attachments. Who is for the challenge?</p>
</blockquote>
<p>Read the paper. And then come join us at 12pm EDT tomorrow.</p>
Simply Statistics Future of Statistics Speakers - Two Truths, One Lie #futureofstats
2013-10-28T10:28:19+00:00
http://simplystats.github.io/2013/10/28/simply-statistics-future-of-statistics-speakers-two-truths-one-lie-futureofstats
<p>Our online conference live-streamed on Youtube is going to happen on October 30th 12PM-1PM Baltimore (UTC-4:00) time. You can find more information <a href="http://simplystatistics.org/unconference">here</a> or sign up for email alerts <a href="https://plus.google.com/events/cd94ktf46i1hbi4mbqbbvvga358">here</a>. I get bored with the usual speaker bios at conferences so I am turning our speaker bios into a game. Below you will find three bullet pointed items of interest about each of our speakers. Two of them are truths and one is a lie. See if you can spot the lies and <a href="https://plus.google.com/events/cd94ktf46i1hbi4mbqbbvvga358">sign up for the unconference</a>!</p>
<p><strong><a href="http://had.co.nz/">Hadley Wickham</a></strong></p>
<ul>
<li>Created the ggplot2/devtools packages.</li>
<li>Developed R’s first class system.</li>
<li>Is chief scientist at RStudio.</li>
</ul>
<p><a href="http://www.biostat.washington.edu/~dwitten/"><strong>Daniela Witten</strong></a></p>
<ul>
<li>Developed the most popular method for inferring Facebook connections.</li>
<li>Created the Spacejam algorithm for inferring networks.</li>
<li>Made the Forbes 30 under 30 list twice as a rising scientific star.</li>
</ul>
<p><strong><a href="http://www.people.fas.harvard.edu/~blitz/Site/Home.html">Joe Blitzstein </a></strong></p>
<ul>
<li>A Professor of the Practice of Statistics at Harvard University.</li>
<li>Created the first statistical method for automatically teaching the t-test.</li>
<li>His statistics 101 course is frequently in the top 10 courses on iTunes U.</li>
</ul>
<p><a href="http://www.biostat.jhsph.edu/~hji/"><strong>Hongkai Ji</strong></a></p>
<ul>
<li>Developed the hmChIP database of over 2,000 ChIP-Seq and ChIP-Chip data samples.</li>
<li>Coordinated the analysis of the orangutan genome project.</li>
<li>Analyzed data to help us understand sonic-hedgehog mediated neural patterning.</li>
</ul>
<p><a href="http://web.mit.edu/sinana/www/"><strong>Sinan Aral</strong></a></p>
<ul>
<li>Coined the phrase “social networking potential”.</li>
<li>Ran a large randomized study that determined the value of upvotes.</li>
<li>Discovered that peer influence is dramatically overvalued in product adoption.</li>
</ul>
<p><strong><a href="http://www.hilarymason.com/">Hilary Mason</a></strong></p>
<ul>
<li>Is a co-founder of DataGotham and HackNY</li>
<li>Developed computational algorithms for identifying the optimal cheeseburger</li>
<li>Founded the first company to create link sorting algorithms.</li>
</ul>
Sunday data/statistics link roundup (10/27/13)
2013-10-27T13:16:51+00:00
http://simplystats.github.io/2013/10/27/sunday-datastatistics-link-roundup-102713
<ol>
<li><a href="http://www.ncbi.nlm.nih.gov/pubmedcommons/">Pubmed Commons</a> is a new post-publication commenting system. I think this is a great idea and I hope it succeeds. Right now it is in “private beta” so only people with Pubmed Commons accounts can post/view comments. But you can follow along with who is making comments via <a href="https://twitter.com/pmctrawler">this neat twitter bot</a>. I think the main feature this lacks to be a hugely successful experiment is some way to give real, tangible academic credit to commenters. One very obvious way would be by assigning DOIs to every comment and making the comments themselves Pubmed searchable. Then they could be listed as contributions on CVs - a major incentive.</li>
<li><a href="http://countaleph.wordpress.com/2013/10/20/dear-startups-stop-asking-me-math-puzzles-to-figure-out-if-i-can-code/">A post</a> on the practice of asking potential hires tricky math problems - even if they are going to be hired to do something else (like software engineering). This happens all the time in academia as well - often the exams we give/questions we ask aren’t neatly aligned with the ultimate goals of a program (producing innovative/determined researchers).</li>
<li>This is going to be a short Sunday Links because my <a href="https://www.coursera.org/course/dataanalysis">Coursera class</a> is starting again tomorrow.</li>
<li>Don’t forget that next week is the [ 1. <a href="http://www.ncbi.nlm.nih.gov/pubmedcommons/">Pubmed Commons</a> is a new post-publication commenting system. I think this is a great idea and I hope it succeeds. Right now it is in “private beta” so only people with Pubmed Commons accounts can post/view comments. But you can follow along with who is making comments via <a href="https://twitter.com/pmctrawler">this neat twitter bot</a>. I think the main feature this lacks to be a hugely successful experiment is some way to give real, tangible academic credit to commenters. One very obvious way would be by assigning DOIs to every comment and making the comments themselves Pubmed searchable. Then they could be listed as contributions on CVs - a major incentive.</li>
<li><a href="http://countaleph.wordpress.com/2013/10/20/dear-startups-stop-asking-me-math-puzzles-to-figure-out-if-i-can-code/">A post</a> on the practice of asking potential hires tricky math problems - even if they are going to be hired to do something else (like software engineering). This happens all the time in academia as well - often the exams we give/questions we ask aren’t neatly aligned with the ultimate goals of a program (producing innovative/determined researchers).</li>
<li>This is going to be a short Sunday Links because my <a href="https://www.coursera.org/course/dataanalysis">Coursera class</a> is starting again tomorrow.</li>
<li>Don’t forget that next week is the](https://plus.google.com/events/cd94ktf46i1hbi4mbqbbvvga358) on Wednesday, October 30th at noon Baltimore time!</li>
</ol>
(Back to) The Future of Statistical Software #futureofstats
2013-10-25T08:24:52+00:00
http://simplystats.github.io/2013/10/25/back-to-the-future-of-statistical-software-futureofstats
<p>In anticipation of the upcoming <a href="http://simplystatistics.org/unconference/">Unconference on the Future of Statistics</a> next Wednesday at 12-1pm EDT, I thought I’d dig up what people in the past had said about the future so we can see how things turned out. In doing this I came across an old National Academy of Sciences report from 1991 on the <a href="http://www.nap.edu/catalog.php?record_id=1910">Future of Statistical Software</a>. This was a panel discussion hosted by the National Research Council and summarized in this volume. I believe you can download the entire volume as a PDF for free from the NAS web site.</p>
<p>The entire volume is a delight to read but I was particularly struck by Daryl Pregibon’s presentation on “Incorporating Statistical Expertise into Data Analysis Software” (starting on p. 51). Pregibon describes his (unfortunate) experience trying to develop statistical software which has the ability to incorporate expert knowledge into data analysis. In his description of his goals, it’s clear in retrospect that he was incredibly ambitious to attempt to build a kind of general-purpose statistical analysis machine. In particular, it was not clear how to incorporate subject matter information.</p>
<blockquote>
<div title="Page 63">
<div>
<div>
<div>
<p>
[T]he major factor limiting the number of people using these tools was the recognition that (subject matter) context was hard to ignore and even harder to incorporate into software than the statistical methodology itself. Just how much context is required in an analysis? When is it used? How is it used? The problems in thoughtfully integrating context into software seemed overwhelming.
</p>
</div>
</div>
</div>
</div>
</blockquote>
<p>Pregibon skirted the problem of integrating subject matter context into statistical software.</p>
<blockquote>
<div title="Page 64">
<div>
<div>
<div>
<p>
I am not talking about integrating context into software. That is ultimately going to be important, but it cannot be done yet. The expertise of concern here is that of carrying out the plan, the sequence of steps used once the decision has been made to do, say, a regression analysis or a one-way analysis of variance. Probably the most interesting things statisticians do take place before that.
</p>
</div>
</div>
</div>
</div>
</blockquote>
<p>Statisticians (and many others) tend to focus on the application of the “real” statistical method–the regression model, lasso shrinkage, or support vector machine. But as much painful experience in a variety of fields has demonstrated, much what happens before the application of the key model is as important, or even more important.</p>
<p>Pregibon makes an important point that although statisticians are generally resistant to incorporating their own expertise into software, they have no problem writing textbooks about the same topic. I’ve observed the same attitude when I talk about <a href="http://simplystatistics.org/2013/08/28/evidence-based-data-analysis-treading-a-new-path-for-reproducible-research-part-2/">evidence-based data analysis</a>. If I were to guess, the problem is that textbooks are still to a certain extent abstract, while software is 100% concrete.</p>
<blockquote>
<div title="Page 62">
<p>
Initial efforts to incorporate statistical expertise into software were aimed at helping inexperienced users navigate through the statistical software jungle that had been created…. Not surprisingly, such ideas were not enthusiastically embraced by the statistics community. Few of the criticisms were legitimate, as most were concerned with the impossibility of automating the “art” of data analysis. <strong>Statisticians seemed to be making a distinction between providing statistical expertise in textbooks as opposed to via software</strong>. [emphasis added]
</p>
</div>
</blockquote>
<p>In short, Pregibon wanted to move data analysis from an <em>art</em> to a <em>science</em>, more than 20 years ago! He stressed that data analysis, at that point in time, was not considered a process worth studying. I found the following paragraph interesting and worth considering in now, over 20 years later. He talks about the reasons for incorporating statistical expertise into software.</p>
<blockquote>
<div title="Page 64">
<div>
<div>
<div>
<p>
The third [reason] is to study the data analysis process itself, and that is my motivating interest. Throughout American or even global industry, there is much advocacy of statistical process control and of understanding processes. <strong>Statisticians have a process they espouse but do not know anything about</strong>. It is the process of putting together many tiny pieces, the process called data analysis, and is not really understood. Encoding these pieces provides a platform from which to study this process that was invented to tell people what to do, and about which little is known. [emphasis added]
</p>
</div>
</div>
</div>
</div>
</blockquote>
<div title="Page 64">
<p>
I believe we have come quite far since 1991, but I don't think we no much more about the process of data analysis, especially in newer areas that involve newer data. The reason is because the field has not put much effort into studying the whole data analysis process. I think there is still a resistance to studying this process, in part because it involves "stooping" to analyze data and in part because it is difficult to model with mathematics. In his presentation, Pregibon suggests that resampling methods like the bootstrap might allow us to skirt the mathematical difficulties in studying data analysis processes.
</p>
<p>
One interesting lesson Pregibon relates during the development of REX, an early system that failed, involves the difference between the end-goals of statisticians and non-statisticians:
</p>
<blockquote>
<div title="Page 68">
<p>
Several things were learned from the work on REX. The first was that statisticians wanted more control. There were no users, rather merely statisticians looking over my shoulder to see how it was working. Automatically, people reacted negatively. They would not have done it that way. In contrast, non-statisticians to whom it was shown loved it. They wanted less control. In fact they did not want the system--they wanted answers.
</p>
</div>
</blockquote>
</div>
The Leek group guide to reviewing scientific papers
2013-10-23T11:14:03+00:00
http://simplystats.github.io/2013/10/23/the-leek-group-guide-to-reviewing-scientific-papers
<p>There has been a lot of discussion of peer review on this blog and elsewhere. One thing I realized is that no one ever formally taught me the point of peer review or how to write a review.</p>
<p>Like a lot of other people, I have been <a href="http://simplystatistics.org/2012/07/11/my-worst-recent-experience-with-peer-review/">frustrated by the peer review process</a>. I also now frequently turn to my students to perform supervised peer review of papers, both for their education and because I can’t handle the large number of peer review requests I get on my own.</p>
<p>So I wrote <a href="https://github.com/jtleek/reviews">this guide</a> on how to write a review of a scientific paper on Github. Last time I did this <a href="http://simplystatistics.org/2013/10/07/the-leek-group-policy-for-developing-sustainable-r-packages/">with R packages </a>a bunch of people contributed to make the guide better. I hope that the same thing will happen this time.</p>
Blog posts that impact real science - software review and GTEX
2013-10-22T11:53:56+00:00
http://simplystats.github.io/2013/10/22/blog-posts-that-impact-real-science-software-review-and-gtex
<p>There was a flurry of activity on social media yesterday surrounding a blog post by <a href="http://liorpachter.wordpress.com/2013/10/21/gtex/">Lior Pachter</a>. He was speaking about the <a href="http://commonfund.nih.gov/GTEx/">GTEX project</a> - a large NIH funded project that has the goal of understanding expression variation within and among human beings. The project has measured gene expression in multiple tissues of over 900 individuals.</p>
<p>In the post, the author claims that the GTEX project is “throwing away” 90% of its data. The basis for this claim is a simulation study using the parameters from one of the author’s papers. The claim of 90% is based on the fact that increasing the number of mRNA fragments leads to increasing correlation in abundance measurements in the simulation study. In order to get the same Spearman correlation as other methodologies have at 10M fragments, the software being used by GTEX needs 100M fragments.</p>
<p>This post and the associated furor raises three issues:</p>
<ol>
<li>The power and advantage of blog posts and social media as a form of academic communication.</li>
<li>The importance of using published software.</li>
<li>Extreme critiques deserve as much scrutiny as extreme claims.</li>
</ol>
<p>The first point is obvious; the post was rapidly disseminated and elicited responses from the leaders of the GTEX project. Interestingly, I think the authors got an early view of the criticisms they would face from reviewers through the blog post. The short term criticism is probably not fun to deal with but it might save them time later.</p>
<p>I think the criticism about using software that has not been fully vetted through the publication/peer review process is an important one. For such a large scale project, you’d like to see the primary analysis being done with “community approved” software. The reason is that we just don’t know if it is better or worse because no one published a study on the software. It would be interesting to see how the <a href="http://simplystatistics.org/2012/09/07/top-down-versus-bottom-up-science-data-analysis/">bottom up approach</a> would have faired here. The good news for GTEX here is that for future papers they will either get out a more comprehensive comparison or they will switch software - either of which will improve their work.</p>
<p>Regarding point 2, Pachter did a “back of the envelope” calculation that suggested the Flux software wasn’t performing well. These back of the envelope calculations are very important - <a href="http://simplystatistics.org/2013/03/06/the-importance-of-simulating-the-extremes/">if you can’t solve the easy case, how can you expect to solve the hard case.</a> Lost in all of the publicity about the 90% number is that Pachter’s blog post hasn’t been vetted, either. Here are a few questions that immediately jumped to my mind when reading the blog post:</p>
<ol>
<li>Why use Spearman correlation as the important measure of agreement?</li>
<li>What is the correlation between replicates?</li>
<li>What parameters did he use for the Flux calculation?</li>
<li>Where is his code so we can see if there were any bugs (I’m sure he is willing to share, but I don’t see a link)?</li>
<li>That 90% number seems very high, I wonder if varying the simulation approach/parameter settings/etc. would show it isn’t quite that bad</li>
<li>Throwing away 90% of you data might not matter if you get the right answer to the question you care about at the end. Can we evaluate something closer to what we care about? A list of DE genes/transcripts, for example?</li>
</ol>
<p>Whenever a scientist sees a claim as huge as “throwing away 90% of the data” they should be skeptical. This is particularly true in genomics, where huge effects are often due to bugs or artifacts. So in general, it is important that we apply the same level of scrutiny to extreme critiques as we do to extreme claims.</p>
<p>My guess is ultimately, the 90% number may end up being an overestimate of how bad the problem is. On the other hand, I think it was hugely useful for Pachter to point out the potential issue and give GTEX the chance to respond. If nothing else, it points out (1) the danger of using unpublished methods when good published alternatives exist and (2) that science moves faster in the era of blog posts and social media.</p>
<p><em>Disclaimers: I work on RNA-seq analysis although I’m not an author on any of the methods being considered. I have spoken at a GTEX meeting, but am not involved in the analysis of the data. Most importantly, I have not analyzed any data and am in no position to make claims about any of the software in question. I’m just making observations about the sociology of this interaction.</em></p>
PubMed commons is launching
2013-10-22T11:00:47+00:00
http://simplystats.github.io/2013/10/22/pubmed-commons-is-launching
<p><a href="http://www.ncbi.nlm.nih.gov/pubmed">PubMed</a>, the main database of life sciences and biomedical literature, is now allowing comments and upvotes. <a href="http://www.ncbi.nlm.nih.gov/pubmedcommons">Here</a> is more information and the twitter handle is @PubMedCommons.</p>
Why are the best relievers not used when they are most needed?
2013-10-21T10:00:51+00:00
http://simplystats.github.io/2013/10/21/why-are-the-best-relievers-not-used-when-they-are-most-needed
<p>During Saturday’s ALCS game 6 the Red Sox’s manager John Farrell took out his starter in the 6th inning. They were leading by 1, but had runners on first and second with no outs. This is a hard situation to get out of without giving up a run. The chances of scoring with an average pitcher are about <a href="http://www.nssl.noaa.gov/users/brooks/public_html/feda/datasets/expectedruns.html">64</a>%. I am sure that with a top of the line pitcher, like <a href="http://www.baseball-reference.com/players/u/ueharko01.shtml">Koji Uehara</a>, this number goes down substantially. So what does a typical manager do in this situation? Because managers like to save their better relievers for the end, and it’s only the 6th inning, they will bring in a mediocre one instead. This is what Farrell did and 2 batters latter the score was 2-1 Tigers. To really understand why this is bad move, the chances of a mediocre pitcher giving up runs when starting an inning is about 28%. So why not bring in your best reliever when the game is actually on the line? <a href="http://www.billjamesonline.com/stats26/">Here</a> is an article by John Dewan with a good in -depth discussion. Note that the Red Sox won the game 5-2 and Koji Uehara was brought in the ninth inning to get 3 outs with the bases empty and a 3 run lead.</p>
Platforms and Integration in Statistical Research (Part 2/2)
2013-10-18T08:40:43+00:00
http://simplystats.github.io/2013/10/18/platforms-and-integration-in-statistical-research-part-22
<p>In my <a href="http://simplystatistics.org/2013/10/15/platforms-and-integration-in-statistical-research-part-12/">last post</a>, I talked about two general approaches to conducting statistical research: platforms and integration. In this followup I thought I’d describe the characteristics of certain fields that suggesting taking one approach over another.</p>
<p>I think in practice, most statisticians will dedicate some time to both the platform and integrative approaches to doing statistical research because different approaches work better in different situations. The question then is not “Which approach is better?” but rather “What characteristics of a field suggest one should take a platform / integrative approach in order to have the greatest impact?” I think one way to answer this question is to make an analogy with transaction costs a la the <a href="http://en.wikipedia.org/wiki/Theory_of_the_firm">theory of the firm</a>. (This kind of analogy also plays a role in determining who best to collaborate with but that’s a different post).</p>
<p>In the context of an academic community, I think if it’s easy to exchange information, for example, about data, then building platforms that are widely used makes sense. For example, if everyone uses a standardized technology for collecting a certain kind of data, then it’s easy to develop a platform that applies some method to that data. Regression methodology works in any field that can organize their data into a rectangular table. On the other hand, if information exchange is limited, then building platforms is more difficult and closer collaboration may be required with individual investigators. For example, if there is no standard data collection method or if everyone uses a different proprietary format, then it’s difficult to build a platform that generalizes to many different areas.</p>
<p>There are two case studies with which I am somewhat familiar that I think are useful for demonstrating these characteristics.</p>
<ul>
<li><strong>Genomics</strong>. I think genomics is an area where you can see statisticians definitely taking both approaches. However, I’m struck by the intense focus on the development of methods and data analysis pipelines, particularly in order to adapt to the ever-changing ‘omics technologies that are being developed. Part of the reason is that for a given type of data, there are relatively few companies developing the technology for collecting the data. Here, it is possible to develop a method or pipeline to deal with a new kind of data generated by a new technology in the early stages of when that data are being produced. If your method works well relative to others, then it’s possible for your method to become essentially a standard approach that everyone uses for that technology. So there’s a pretty big incentive to be the person who develops a platform for a data collection technology on which everyone builds their research. It is helpful if you can get early access to new technologies so you can get a peek at the data before everyone else and get a head start on developing the methods. Another aspect of genomics is that the field is quite open relative to others, in that there is quite a bit of information sharing. With the enormous amount of publicly available data out there, there’s a very large population of potential users of your method/software. Those people who don’t collect their own primary data can still take your method and apply it to data that’s already out there. Therefore, I think from a statistician’s point of view, genomics is a field that presents many opportunities to build platforms that will be used by many people addressing many different types of questions.</li>
<li><strong>Environmental Health</strong>. The area of environmental health, where I generally operate, is a much smaller field than genomics. You can see this by looking at things like journal impact factors and h-indices. It does not have the same culture as genomics and relatively little data is shared openly and there are typically no requirements from journals to make data available upon publication. Data are often very expensive and time-consuming to collect, particularly if you are running large cohorts and are monitoring things like personal exposure. There are no real standardized methods for data collection and many formats are proprietary. Statisticians in this area tend to be attached to larger groups who run studies or collect human health and exposure data. It’s relatively hard to be an independent statistician here because you need access to a collaborator who has relevant expertise, resources, and data. The lack of publicly available health data severely limits the participation of statisticians outside biomedical research institutions where the data are collected primarily. I would argue that in environmental health, the integrative approach is more fruitful because (1) in order to do the work in the first place you already need be working closely with people collecting the health data; (2) there is a general lack of information sharing and standardization with respect to data collection; (3) if you develop a new tool, there is not a particularly large audience available to adopt those tools; (4) because studies are not unified by shared technologies, as in genomics, it’s often difficult to usefully generalize methodology from one study to the next. While I think it’s possible to develop general methodology for a certain type of study in this field, the impact is inherently limited due to the small size of the field.</li>
</ul>
<p>In the end I think areas that are ripe for the platform approach to statistical research are those that are very open and have culture of information sharing, have a large community of active methodologists, and have a lot of useful publicly available data. Areas that do not have these qualities might be better served by an integrative approach where statisticians work more directly with scientific collaborators and focus on the specific questions and problems of a given study.</p>
The @fivethirtyeight effect - watching @walthickey gain Twitter followers in real time
2013-10-17T10:16:31+00:00
http://simplystats.github.io/2013/10/17/the-fivethirtyeight-effect-watching-walthickey-gain-twitter-followers-in-real-time
<p>Last night Nate Silver announced that he had hired Walt Hickey away from Business Insider to be an editor for the new http://www.fivethirtyeight.com/ website with a couple of tweets:</p>
<blockquote class="twitter-tweet" width="550">
<p>
Super excited to announce that 538 is hiring <a href="https://twitter.com/WaltHickey">@WaltHickey</a>, the talented young writer/journalist/data geek from Business Insider.
</p>
<p>
— Nate Silver (@NateSilver538) <a href="https://twitter.com/NateSilver538/status/390608474763063296">October 16, 2013</a>
</p>
</blockquote>
<blockquote class="twitter-tweet" width="550">
<p>
.<a href="https://twitter.com/WaltHickey">@WaltHickey</a> will have a similarly broad range for 538, bringing a data-driven view toward all types of things. Give him a follow!
</p>
<p>
— Nate Silver (@NateSilver538) <a href="https://twitter.com/NateSilver538/status/390608971280551936">October 16, 2013</a>
</p>
</blockquote>
<p>I knew about Walt because he <a href="http://www.businessinsider.com/fox-news-charts-tricks-data-2012-11">syndicated one of my posts</a> about Fox News Graphics on Business Insider. But he clearly wasn’t as well known as Nate S. who is probably the face of statistical analysis to most people in the world. So I figured the announcement might increase Walt’s following on Twitter.</p>
<p>After goofing around a bit with the <a href="https://dev.twitter.com/">Twitter api</a> and the <a href="http://cran.r-project.org/web/packages/twitteR/index.html">twitteR</a> R package. I managed to start sampling the number of followers for Walt H. This started about an hour or so (I think) after the announcement was made, here is a plot of Walt’s followers over about two hours.</p>
<p><a href="http://simplystatistics.org/2013/10/17/the-fivethirtyeight-effect-watching-walthickey-gain-twitter-followers-in-real-time/walthickey-followers-3/" rel="attachment wp-att-2048"><img class="alignnone size-full wp-image-2048" alt="walthickey-followers" src="http://simplystatistics.org/wp-content/uploads/2013/10/walthickey-followers.png" width="400" height="400" srcset="http://simplystatistics.org/wp-content/uploads/2013/10/walthickey-followers-150x150.png 150w, http://simplystatistics.org/wp-content/uploads/2013/10/walthickey-followers-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2013/10/walthickey-followers.png 400w" sizes="(max-width: 400px) 100vw, 400px" /></a></p>
<p>Over the two hours he gained almost 1,000 followers! We can also take a look at the rate he was gaining followers.</p>
<p><a href="http://simplystatistics.org/2013/10/17/the-fivethirtyeight-effect-watching-walthickey-gain-twitter-followers-in-real-time/walthickey-rate/" rel="attachment wp-att-2049"><img class="alignnone size-full wp-image-2049" alt="walthickey-rate" src="http://simplystatistics.org/wp-content/uploads/2013/10/walthickey-rate.png" width="400" height="400" srcset="http://simplystatistics.org/wp-content/uploads/2013/10/walthickey-rate-150x150.png 150w, http://simplystatistics.org/wp-content/uploads/2013/10/walthickey-rate-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2013/10/walthickey-rate.png 400w" sizes="(max-width: 400px) 100vw, 400px" /></a>n</p>
<p>So he was gaining followers at around 10-15 per minute on average at 7:30 yesterday. It cooled off over those two hours, but he was still getting a few followers a minute. To put those numbers in perspective, our Twitter account @simplystats, gets on average about 10 new followers <em>a day</em>.</p>
<p>So there you have it, the real time (albeit two hours too late) 538 bump in Twitter followers.</p>
Platforms and Integration in Statistical Research (Part 1/2)
2013-10-15T10:21:50+00:00
http://simplystats.github.io/2013/10/15/platforms-and-integration-in-statistical-research-part-12
<p>In the technology world today, one of the common topics of interest is the so-called “war” between Apple and Google (or Android). This war is ostensibly over dominance of the mobile phone industry, where Apple sells the most popular phone but Google/Android (as an operating system) controls over half the market. (Android phones themselves are manufactured by a variety of companies and no one of those companies sells more phones than Apple.)</p>
<p><strong>Apple vs. Google (vs. Microsoft)</strong></p>
<p>Apple’s model is to own the entire (or most of the) development of the phone. They build the hardware, the software, and create the design. They also control the App Store for selling their own software and third party software, distribute the music from their iTunes store, and distribute the e-books through their iBookstore. They even have their own proprietary messaging platform. This “walled-garden” approach is a hallmark of Apple and its famously controlling founder Steve Jobs. Rather than “walled-garden”, I would call it more of an “integrative” approach, where Apple essentially has its fingers in all the relevant pies, controlling every aspect.</p>
<p>The Google/Android approach is more modular and involves controlling the platform on which pretty much every phone could theoretically be built. Until recently, Google did not build their own phones, but rather let other companies build the phones and use Google’s operating system as the software for the phone. The model here is similar to the <a href="http://en.wikipedia.org/wiki/Unix_philosophy">Unix philosophy</a>, which is to “do one thing well”. Google is really good at developing Android and handset manufacturers are really good at building phones. There’s no point in one company doing two things moderately well when you could have two companies each do one thing really well. Here, Google focuses on the platform, the Android operating system, and tries to spread it as far and wide as possible to cover the most possible phones, tablets, watches, whatever mobile device is relevant.</p>
<p>For us older people, the more relevant “war” is between Microsoft and everyone else. Microsoft built one of the most legendary platforms in computer history-–the Windows operating system. For decades this platform was (and continues to be) the dominant operating system for personal desktop computers. Although Microsoft never really focused on building its own hardware, Microsoft’s impact on the PC world through its control of Windows is undeniable. Unfortunately, an asterisk must be put next to all of this history because we now know that much of this dominance was achieved through <a href="http://en.wikipedia.org/wiki/United_States_v._Microsoft_Corp.">criminal activity</a>.</p>
<p><strong>Theory and Methods vs. Applications</strong></p>
<p>There’s much debate in the technology world over which approach is better, the Apple integrative model or the Google/Microsoft modular platform model. I think this “debate” exists because it’s fun to argue about Apple vs. Google and it gives technology reporters something to write about. When the dust settles (if ever) I think the answer will be “it depends”.</p>
<p>In the statistical community I find there’s often an analogous debate that goes on regarding which is the more important form of statistical activity, theory/methods or applications. In a nutshell (perhaps even a cartoon nutshell) there’s a sense that theoretical or abstract methodological development has a greater impact because it is broadly generalizable to many different areas. Applications work is less impactful because it is focused on a specific area and any lessons learned that might be applicable to other areas would only be realized much later.</p>
<p>We could spend a lot of time debating the specific arguments here (and I have already spent that time!) but I think a better way to frame this debate is to use the analogy of Apple and Google, that is between integrative statistical research and platforms research. In particular, I think the “theory vs. applications” moniker is a bit outdated and does not cover many of the recent developments in the field of statistics.</p>
<p><strong>Platforms in Statistics</strong></p>
<p>When I was in graduate school and learning about being a statistician, it was pretty much hammered into my brain that the ultimate goal of a statistician is to build a platform. It was not described to me in those words, but that was the essential message. The basic idea was that you would develop a new method that was as general as possible so that it could be applied to a wide variety of fields, from agriculture to zoology. Ideally, you would demonstrate that this method was better than any other method through some sort of theorem.</p>
<p>They ultimate platform in statistics might be the <em>t</em>-test, or perhaps the <em>p</em>-value. Those two statistical methods are used in some form in almost any scientific context you could possibly imagine. I’d argue that the p-value is the Microsoft Windows of science. (Much like with Windows, you could argue this is for better or for worse.) Other essential platforms in statistics might be linear regression, generalized linear models, the bootstrap, the EM algorithm, etc. If you could be the developer of one of these platforms your impact would be tremendous because everyone in every discipline would use it. That’s why Ronald Fisher should be the <a href="http://simplystatistics.org/2012/03/07/r-a-fisher-is-the-most-influential-scientist-ever/">most influential scientist ever</a>.</p>
<p>I think the notion of a platform, rather than theory/methods, is a much more useful context here because it more accurately describes why these things are so important. Generalized linear models may be interesting because they represent an abstract concept of linear relationships, but it’s useful because it’s a platform on which a ton of other research in many other fields can be built. If we accept the idea that something is important because it serves as a platform on which many other things can be built, then I think this idea encompasses more than what might be published in the pages of the <em>Journal of the American Statistical Association</em> or the <em>Annals of Statistics</em>.</p>
<p>In particular, I think one of the greatest statistical platforms developed in the last 10 to 15 years is <a href="http://www.r-project.org/">R</a>. If you consider what R really is, yes it’s a software package that does statistical calculations, but primarily it’s a platform on which an enormous community of people can build things. The Comprehensive R Archive Network is the “App Store” through which statisticians can develop and distribute their tools. R itself has been extended (through packages) and applied to almost every imaginable scientific discipline. Take one look at the <a href="http://cran.r-project.org/web/views/">Task Views</a> section to get a sense of the diversity of areas to which R has been applied. Entire sub-projects (i.e. <a href="http://bioconductor.org">Bioconductor</a>) have been developed around using R in specific fields of research. At this point the impact of R on both the sciences and on statistics is as undeniable as the <em>t</em>-test.</p>
<p><strong>Integrative Statistical Research</strong></p>
<p>Integrative research in statistics is something that I think harks back to a much earlier era in the history of statistics, the era in which the field of statistics didn’t really exist. Before the field really had solidified itself as a separate discipline, people “doing statistics” <a href="http://simplystatistics.org/2011/09/10/what-is-a-statistician/">came from all areas of science as well as mathematics</a>. Here, the statistician was involved in all aspects of research and not just walled-off in a separate area dreaming up abstract methods. Many methods were later abstracted and generalized, but this largely grew out of an initial need to solve a specific problem.</p>
<p>As the field matured and separate Departments of Statistics started to appear, the discipline moved more towards a platform approach by focusing more on abstraction and generalizable approaches. It’s easy to see why this move would occur. If you’re trying to distinguish your discipline as being separate from other disciplines (and therefore deserving of separate resources), you need to demonstrate a unique contribution that is separate from the other fields and, in a sense, wall yourself off a little from the others. Given that computers were generally available at the time this move began, mathematics was the most useful and easily accessible tool to build new statistical platforms.</p>
<p>Today, I think the field of statistics is moving back towards the old model of integrating closer with scientists in other disciplines. In particular, we are seeing more and more people “invading” the field from other related areas like computer science, just like the old days. Personally, I think these “outsiders” should be welcomed under our tent as they bring unique insights to our field and provide a richness not otherwise obtainable.</p>
<p>With the integrative statistical research model we see more statisticians “embedded” into the sciences, in the the thick of it, so to speak, with involvement in every aspect. They publish in discipline-specific journals and in some cases are flat-out leading large-scale scientific collaborations. The reasons for this are many, but I think are centered around advances in computer technology that has allowed for the rapid collection of large and complex datasets and the easy implementation of sophisticated models. The heterogeneity and unique complexity of these different datasets has required statisticians to dig deeper into the field and learn more of the substantive details before a useful contribution can be made. This accumulation of deep knowledge of a field occurs at the expense of being able to work in many different fields at once, or as John Tukey said, to “play in everyone’s backyard”.</p>
<p>The integrative approach to statistical research is exciting because it allows for the statistician to have a direct impact on a scientific discipline rather than a more indirect one through developing platforms. However, the approach is resource intensive in that it requires an interdisciplinary research environment with <a href="http://simplystatistics.org/2011/10/20/finding-good-collaborators/">good collaborators</a> in the relevant disciplines. As such, it may only be possible to take the integrative approach in certain institutions and environments. I think a similar argument could be made with respect to conducting platform research but I think there are many cases there where it was not strictly necessary.</p>
<p>In the next post, I’ll talk a bit (and give examples) about where I think the platform and integrative approaches may be more or less fruitful.</p>
Teaching least squares to a 5th grader by calibrating a programmable robot
2013-10-15T10:07:16+00:00
http://simplystats.github.io/2013/10/15/teaching-least-squares-to-a-5th-grader-by-calibrating-a-programmable-robot
<p>The Lego Mindstorm kit provides software and hardware to create programmable robots. A very simple first task is figuring out how to make the robot move any given distance. You get to program the number of wheel rotations. The video below shows how one can use this to motivate and teach least squares. We assumed the formula was distance = K × rotations, collected data for 1,2…, 10 rotations, then used R to motivate (via plots) and calculate the least squares estimate of K.</p>
<p>Not shown in the video is my explanation of how we could also use the formula circumference = pi x diameter to figure out K and a discussion about which approach is better. Next project will be to calibrate turns which are achieved by rotating the wheels in opposite directions. This time I will use both the geometric approach (compute the wheel circumference and the circumference defined by robot turns) and the statistical approach.</p>
A general audience friendly explanation for why Lars Peter Hansen won the Nobel Prize
2013-10-14T10:54:33+00:00
http://simplystats.github.io/2013/10/14/why-did-lars-peter-hansen-win-the-nobel-prize-generalized-method-of-moments-explained
<p><em>Lars Peter Hansen won the Nobel Prize in economics for creating the generalized method of moments. <a href="http://en.wikipedia.org/wiki/Generalized_method_of_moments">A rather technical</a> explanation of the idea appears on Wikipedia. <a href="http://lipas.uwasa.fi/~sjp/Teaching/gmm/lectures/gmmc3.pdf">These</a> are a good set of lecture notes on gmms if you like math. I went over to Marginal Revolution to see what was being written about the Nobel Prize winners. Clearly a bunch of other people were doing the same thing as the site was pretty slow to load. <a href="http://marginalrevolution.com/marginalrevolution/2013/10/lars-peter-hansen-nobel-laureate.html">Here is what Tyler C. says about Hansen</a>. In describing Hansen’s work he says:</em></p>
<blockquote>
<p>For years now journalists have asked me if Hansen might win, and if so, how they might explain his work to the general reading public. Good luck with that one.</p>
</blockquote>
<p><em><a href="http://marginalrevolution.com/marginalrevolution/2013/10/lars-peter-hansen-nobelist.html">Alex T. does a good job</a> of explaining the idea, but it still seems a bit technical for my tastes. <a href="http://noahpinionblog.blogspot.com/2013/10/lars-peter-hansen-explained-kind-of.html">Guan Y.</a> does another good, and a little less technical explanation here, but it is still a little rough if you aren’t an economist. So I took a shot at an even more “general audience friendly” version below.</em></p>
<p>A very common practice in economics (and most other scientific disciplines) is to collect experimental data on two (or more) variables and to try to figure out if the variables are related to each other. A huge amount of statistical research is dedicated to this relatively simple-sounding problem. Lars Hansen won the Nobel Prize for his research on this problem because:</p>
<ol>
<li><strong>Economists (and scientists) hate assumptions they can’t justify with data and want to use the fewest number possible. </strong>The recent <a href="http://www.newyorker.com/online/blogs/johncassidy/2013/04/the-rogoff-and-reinhart-controversy-a-summing-up.html">Rogoff and Reinhart controversy</a> illustrates this idea. They wrote a paper that suggested public debt was bad for growth. But when they estimated the relationship between variables they made assumptions (chose weights) that have been questioned widely - suggesting that public debt might not be so bad after all. But not before a bunch of politicians used this result to justify austerity measures which had a huge impact on the global economy.</li>
<li><strong>Economists (and mathematicians) love to figure out the “one true idea” that encompasses many ideas.</strong> When you show something about the really general solution, you get all the particular cases for free. This means that all the work you do to show some statistical procedure is good helps not just you in a general sense, but all the specific cases that are examples of the general things you are talking about.</li>
</ol>
<p>I’m going to use a really silly example to illustrate the idea. Suppose that you collect information on the weight of animals bodies and the weight of their brains. You want to find out how body weight and brain weight are related to each other. You collect the data, they might look something like this:<a href="http://simplystatistics.org/2013/10/14/why-did-lars-peter-hansen-win-the-nobel-prize-generalized-method-of-moments-explained/weights-2/" rel="attachment wp-att-1990"><img class="alignnone size-full wp-image-1990" alt="weights" src="http://simplystatistics.org/wp-content/uploads/2013/10/weights1.png" width="445" height="427" /></a></p>
<p>So it looks like if you have a bigger body you have a bigger brain (except for poor old Triceratops who is big but has a small brain). Now you want to say something quantitative about this. For example:</p>
<blockquote>
<p>Animals that are 1 kilogram larger have a brain that is on average k kilograms larger.</p>
</blockquote>
<p>How do you figure that out? Well one problem is that you don’t have infinite money so you only collected information on a few animals. But you don’t want to say something just about the animals you measured - you want to change the course of science forever and say something about the relationship between the two variables <em>for all animals</em>.</p>
<p>The best way to do this is to make some assumptions about what the measurements of brain and body weight look like if you could collect all of the measurements. It turns out if you assume that you know the complete shape of the distribution in this way, it becomes pretty straightforward (with a little math) to estimate the relationship between brain and body weight using something called <a href="http://en.wikipedia.org/wiki/Maximum_likelihood">maximum likelihood estimation</a>. This is probably the most common way that economists or scientists relate one variable to another (<a href="http://en.wikipedia.org/wiki/Ronald_Fisher">the inventor</a> of this approach is still waiting for his Nobel).</p>
<p>The problem is you assumed a lot to get your answer. For example, here are the data from just the brains that we have collected. It is pretty hard to guess exactly what shape the data from the whole world would look like.</p>
<p><a href="http://simplystatistics.org/2013/10/14/why-did-lars-peter-hansen-win-the-nobel-prize-generalized-method-of-moments-explained/brains/" rel="attachment wp-att-1995"><img class="alignnone size-full wp-image-1995" alt="brains" src="http://simplystatistics.org/wp-content/uploads/2013/10/brains.png" width="445" height="427" /></a></p>
<p>This presents the next problem: how do we know that we have the “right one”?</p>
<p>We don’t.</p>
<p>One way to get around this problem is to use a very old idea called the <a href="http://en.wikipedia.org/wiki/Method_of_moments_(statistics)">method of moments</a>. Suppose we believe the equation:</p>
<p style="text-align: center;">
<em>Average<strong> in World</strong> Body Weight = k * Average <strong>in World</strong> Brain Weight</em>
</p>
<p style="text-align: left;">
In other words, if we take any animal in the world on average it's brain weights 5 kilos then its body will on average be (k * 5) kilos. The relationship is only "on average" because there are a bunch of variables we didn't measure and they may affect the relationship between brain and body weight. You can see it in the scatterplot because the two values don't lie on the same line.
</p>
<p style="text-align: left;">
One way to estimate k is to just replace the numbers you wish you knew with the numbers you have in your population:
</p>
<p><em>Average <strong>in Data you Have</strong> Body Weight = k * Average i<strong>n Data you Have</strong> Brain Weight</em></p>
<p>Since you have the data the only thing you don’t know in the equation is k, so you can solve the equation and get an estimate. The nice thing here is we don’t have to say much about the shape of the data we expect for body weight or brain weight. <em>We just have to believe this one equation</em>. The key insight here is that you don’t have to know the whole shape of the data, just one part of it (the average). An important point to remember is that you are still making some assumptions here (that the average is a good thing to estimate, for example) but they are definitely fewer assumptions than you make if you go all the way and specify the whole shape, or distribution, of the data.</p>
<p>This is a pretty oversimplified version of the problem that Hansen solved. In reality when you make assumptions about the way the world works you often get more equations like the one above than variables you want to estimate. Solving all of those equations is now complicated because the answers from different equations might contradict each other (the technical word is <a href="http://en.wikipedia.org/wiki/Overdetermination">overdetermined</a>).</p>
<p>Hansen showed that in this case you can take the equations and multiply them by a set of weights. You put more weight on equations you are more sure about, then add them up. If you choose the weights well, you avoid the problem of having too many equations for two few variables. This is the thing he won the prize for - the <strong>generalized method of moments</strong>.</p>
<p>This is all a big deal because the variables that economists measure frequently aren’t very pretty. One common way they aren’t pretty is that they are often measured over time, with complex relationships between values at different time points. That means it is hard to come up with realistic assumptions about what the data may look like.</p>
<p>By proposing an approach that doesn’t require as many assumptions Hansen satisfied criteria (1) for things economists like. And, if you squint just right at the equations he proposed, you can see they actually are a general form of a bunch of other estimation techniques like <a href="http://en.wikipedia.org/wiki/Maximum_likelihood">maximum likelihood estimation</a> and <a href="http://en.wikipedia.org/wiki/Instrumental_variable">instrumental variables</a>, which made it easier to prove theoretical results and satisfied criteria (2) for things economists like.</p>
<p><em> -—</em></p>
<p><em>Disclaimer: This post was written for a general audience and may cause nerd-rage in those who see (important) details I may have skimmed over. </em></p>
<p><em>Disclaimer #2: I’m not an economist. So I can’t talk about economics. T__here are reasons gmm is useful economically that I didn’t even talk about here.</em></p>
Sunday data/statistics link roundup (10/13/13)
2013-10-13T14:31:17+00:00
http://simplystats.github.io/2013/10/13/sunday-datastatistics-link-roundup-101313
<ol>
<li>A really interesting comparison <a href="http://marginalrevolution.com/marginalrevolution/2013/10/online-education-and-the-tivo-revolution.html">between educational and TV menus</a> (via Rafa). On a related note, it will be interesting to see <a href="http://www.slate.com/articles/technology/education/2013/09/edx_mit_and_online_certificates_how_non_degree_certificates_are_disrupting.html">how/whether the traditional educational system will be disrupted</a>. I’m as into the MOOC thing as the next guy, but I’m not sure I buy a series of pictures from your computer as “validation” you took/know the material for a course. Also I’m not 100% sure about what this is, but it has the potential to be kind of awesome - <a href="https://www.moocdemic.com/">the Moocdemic</a>.</li>
<li><a href="http://www.sciencemag.org/content/342/6154/60.full">This piece</a> of “investigative journalism” had the open-access internet up in arms. The piece shows pretty clearly that there are bottom-feeding journals who will use unscrupulous tactics and claim peer review while doing no such thing. But it says basically nothing about open access as far as I can tell. On a related note, a couple of years ago we <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0026895">developed an economic model for peer review</a>, then tested the model out. In a very contrived/controlled system we showed peer review improves accuracy, even when people aren’t incentivized to review.</li>
<li>Related to <a href="http://simplystatistics.org/2013/10/10/cancelled-nih-study-sections-a-subtle-yet-disastrous-effect-of-the-government-shutdown/">our guest post</a> on NIH study sections is this <a href="http://www.nature.com/news/nih-campus-endures-slow-decay-1.13942">pretty depressing piece in Nature</a>.</li>
<li>One of JHU Biostat’s NSF graduate research fellows <a href="http://stattrak.amstat.org/2013/10/01/fellowship-experience/">was interviewed by Amstat News</a>.</li>
<li>Jenny B. has <a href="http://www.stat.ubc.ca/~jenny/STAT545A/2012-lectures/">some great EDA lectures</a> you should check out.</li>
</ol>
Why do we still teach a semester of trigonometry? How about engineering instead?
2013-10-11T10:54:19+00:00
http://simplystats.github.io/2013/10/11/why-do-we-still-teach-a-semester-of-trigonometry-how-about-engineering-instead
<p>Arthur Benjamini says we should <a href="http://www.ted.com/talks/arthur_benjamin_s_formula_for_changing_math_education.html">teach statistics before calculus</a>. He points out that most of what we do in high school math is preparing us for calculus. He makes the point that while physicists, engineers and economists need calculus, in the digital age, discrete math, probability and statistics are much more relevant to everyone else. I agree with him and was happy to see Statistics as part of the <a href="http://www.corestandards.org/Math">common core</a>. However, other topics I wish were there, such as engineering, programming, and finance, are missing.</p>
<p>This saturday I took my 5th grader to a 3 hour robotics workshop. We both enjoyed it thoroughly. We built and programmed two-wheeled robots to, among <a href="http://www.youtube.com/watch?v=24sX9MtqQNA">other things</a>, go around a table. To make this happen we learned about measurement error, how to use a protractor, that C = ∏ d, a bit of algebra, how to use grid searches, if-else conditionals, and for-loops. Meanwhile during a semester of high school trigonometry we learn <a href="http://www.sosmath.com/trig/Trig5/trig5/trig5.html">this</a> (do you remember that 2 sin^2 x = 1-cos 2x ? ). Of course it is important to know trigonometry, but do we really need to learn to derive and memorize these identities that are rarely use and are readily available from a smartphone? One could easily teach the fundamentals as part of an applied class such as robotics. We can ask questions like: if while turning you make a mistake of 0.5 degrees, by how much will your robot miss its mark after traveling one meter? We can probably teach the fundamentals of trigonometry in about 2 weeks, later using these concepts in applied problems.</p>
Cancelled NIH study sections: a subtle, yet disastrous, effect of the government shutdown
2013-10-10T10:00:05+00:00
http://simplystats.github.io/2013/10/10/cancelled-nih-study-sections-a-subtle-yet-disastrous-effect-of-the-government-shutdown
<p><em>Editor’s note:</em> <em>This post is contributed by <a href="http://stat.psu.edu/people/dug10">Debashis Ghosh</a>. Debashis is the chair of __the Biostatistical Methods and Research Design (BMRD) study sections at the National Institutes of Health (NIH). BMRD’s focus is statistical methodology.</em></p>
<p>I write today to discuss effects of the government shutdown that will likely have disastrous long-term effects on the state of biomedical and scientific research. A list of the sections can be found at <a href="http://public.csr.nih.gov/StudySections/Standing/Pages/default.aspx">http://public.csr.nih.gov/StudySections/Standing/Pages/default.aspx</a>. These are panels of distinguished scientists in their fields that meet three times a year to review grant submissions to the NIH by investigators. For most professors and scientists that work in academia, these grants provide the means of conducting research and funding for staff such as research associates, postdocs and graduate students. At most universities and medical schools in the U.S., having an independent research grant is necessary for assistant professors to be promoted and get tenure (of course, there is some variation in this across all universities).</p>
<p>Yesterday, I was notified by NIH that the BMRD October meeting was cancelled and postponed until further notice. I could not communicate with NIH staff about this because they are on furlough, meaning that they are not able to send or receive email or other communications. This means that our study section will not be reviewing grants in October. People who receive funding from NIH grants are familiar with the usual routine of submitting grants three times a year and getting reviewed approximately 6 months after submission. This process has now stopped because of the government shutdown, and it is unclear when it will restart. The session I chair is but one of 160 regular study sections and many of them would be meeting in October. In fact, I was involved with a grant submitted to another study section that would have met on October 8, but this meeting did not happen.</p>
<p>The stoppage has many detrimental consequences. Because BMRD will not be reviewing the submitted grants at the scheduled time, they will lack a proper scientific evaluation. The NIH review process separates the scientific evaluation of grants from the actual awarding of funding. While there have been many criticisms of the process, it has also been acknowledged that that the U.S. scientific research community has been the leader in the world, and NIH grant review has played a role in this status. With the suspension of activities, the status that the U.S. currently enjoys is in peril. It is interesting to note that now many countries are attempting to install a review process similar to the one at NIH (R. Nakamura, personal communication).</p>
<p>The effects of the shutdown are perilous for the investigators that are submitting grants. Without the review, their grants cannot be evaluated and funded. This lag in the funding timeline stalls research, and in scientific research a slow stall now is more disastrous in the long term. The type of delay described here will mean layoffs for lab technicians and research associates that are funded by grants needing renewal as well as a hiring freeze for new lab personnel using newly funded grants. This delay and loss of labor will diminish the existing scientific knowledge base in the U.S., which leads to a loss of the competitive advantage we have enjoyed as a nation for decades in science.</p>
<p>Economically, the delay has a huge impact as well. Suppose there is a delay of three months in funding decisions. In the case of NIH grants, this is tens of millions of dollars that is not being given out for scientific research for a period of three months. The rate of return of these grants has been estimated to be 25 – 40 percent a year (<a href="http://www.faseb.org/portals/0/pdfs/opa/2008/nih_research_benefits.pdf">http://www.faseb.org/portals/0/pdfs/opa/2008/nih_research_benefits.pdf</a>), and the findings from these grants have the potential to benefit 1,000s of patients a year by increasing their survival or improving the quality of their lives. In the starkest possible terms, more medical patients will die and suffer because the government shutdown is forcing the research that provides new methods of diagnosis and treatment to grind to a halt.</p>
<p>Note: The opinions expressed here represent my own and not those of my employer, Penn State University, nor those of the National Institutes of Health nor the Center for Scientific Review.</p>
<p align="center">
</p>
The Care and Feeding of Your Scientist Collaborator
2013-10-09T08:55:52+00:00
http://simplystats.github.io/2013/10/09/the-care-and-feeding-of-your-scientist-collaborator
<p><em>Editor’s Note: This post written by Roger Peng is part of a two-part series on Scientist-Statistician interactions. The <a href="http://simplystatistics.org/2013/10/08/the-care-and-feeding-of-the-biostatistician/">first post</a> was written by <a href="http://www.hopkinschildrens.org/elizabeth-matsui-md.aspx">Elizabeth C. Matsui</a>, an Associate Professor in the Division of Allergy and Immunology at the Johns Hopkins School of Medicine.</em></p>
<p>This post is a followup to Elizabeth Matsui’s <a href="http://simplystatistics.org/2013/10/08/the-care-and-feeding-of-the-biostatistician/">previous post</a> for scientists/clinicians on collaborating with biostatisticians. Elizabeth and I have been working for over half a decade and I think the story of how we started working together is perhaps a brief lesson on collaboration in and of itself. Basically, she emailed someone who didn’t have time, so that person emailed someone else who didn’t have time, so that person emailed someone else who didn’t have time, so that person emailed me, who as a mere assistant professor had plenty of time! A few people I’ve talked to are irked by this process because it feels like you’re someone’s fourth choice. But personally, I don’t care. I’d say almost all my good collaborations have come about this way. To me, it either works or it doesn’t work, regardless of where on the list you were when you were contacted.</p>
<p>I’ve written before about <a href="http://simplystatistics.org/2011/10/20/finding-good-collaborators/">how to find good collaborators</a> (although I neglected to mention the process described above), but this post tries to answer the question, “Now that I’ve found this good collaborator, what do I do with her/him?” Her are some thoughts I’ve accumulated over the years.</p>
<ul>
<li>
<p><strong>Understand that a scientist is not a fountain from which “the numbers” flow</strong>. Most statisticians like to work with data, and some even need it to demonstrate the usefulness of their methods or theory. So there’s a temptation to go “find a scientist” to “<a href="http://simplystatistics.org/2012/01/08/where-do-you-get-your-data/">give you some data</a>”. This is starting off on the wrong foot. If you picture your collaborator as a person who hands over the data and then you never talk to that person again (because who needs a clinician for a JASA paper?), then things will probably not end up so great. And I think there are two ways in which the experience will be sub-optimal. First, your scientist collaborator may feel miffed that you basically went off and did your own thing, making her/him less inclined to work with you in the future. Second, the product you end up with (paper, software, etc.) might not have the same impact on science as it would have had if you’d worked together more closely. This is the bigger problem: see #5 below.</p>
</li>
<li>
<p><strong>All good collaborations involve some teaching: Be patient, not patronizing</strong>. Statisticians are often annoyed that “So-and-so didn’t even know this” or “they tried to do this with a sample size of 3!” True, there are egregious cases of scientists with a lack of basic statistical knowledge, but in my experience, all good collaborations involve some teaching. Otherwise, why would you collaborate with someone who knows exactly the same things that you know? Just like it’s important to take some time to learn the discipline that you’re applying statistical methods to, it’s important to take some time to describe to your collaborator how those statistical methods you’re using really work. Where does the information in the data come from? What aspects are important; what aspects are not important? What do parameter estimates mean in the context of this problem? If you find you can’t actually explain these concepts, or become very impatient when they don’t understand, that may be an indication that there’s a problem with the method itself that may need rethinking. Or maybe you just need a simpler method.</p>
</li>
<li>
<p><strong>Go to where they are</strong>. This bit of advice I got from <a href="http://www.biostat.jhsph.edu/~szeger/">Scott Zeger</a> when I was just starting out at Johns Hopkins. His bottom line was that if you understand where the data come from (as in literally, the data come from this organ in this person’s body), then you might not be so flippant about asking for an extra 100 subjects to have a sufficient sample size. In biomedical science, the data usually come from people. Real people. And the job of collecting that data, the scientist’s job, is usually not easy. So if you have a chance, go see how the data are collected and what needs to be done. Even just going to their office or lab for a meeting rather than having them come to you can be helpful in understanding the environment in which they work. I know it can feel nice (and convenient) to have everyone coming to you, but that’s crap. Take the time and go to where they are.</p>
</li>
<li>
<p><strong>Their business is your business, so pitch in</strong>. A lot of research (and actually most jobs) involves doing things that are not specifically relevant to your primary goal (a paper in a good journal). But sometimes you do those things to achieve broader goals, like building better relationships and networks of contacts. This may involve, say, doing a sample size calculation once in a while for a new grant that’s going in. That may not be pertinent to your current project, but it’s not that hard to do, and it’ll help your collaborator a lot. You’re part of a team here, so everyone has to pitch in. In a restaurant kitchen, even the Chef works the line once in a while. Another way to think of this is as an investment. Particularly in the early stages there’s going to be a lot of ambiguity about what should be done and what is the best way to proceed. Sometimes the ideal solution won’t show itself until much later (the so-called “j-shaped curve” of investment). In the meantime, pitch in and keep things going.</p>
</li>
<li>
<p><strong>Your job is to advance the science</strong>. In a good collaboration, everyone should be focused on the same goal. In my area, that goal is improving public health. If I have to prove a theorem or develop a new method to do that, then I will (or at least try). But if I’m collaborating with a biomedical scientist, there has to be an alignment of long-term goals. Otherwise, if the goals are scattered, the science tends to be scattered, and ultimately sub-optimal with respect to impact. I actually think that if you think of your job in this way (to advance the science), then you end up with better collaborations. Why? Because you start looking for people who are similarly advancing the science and having an impact, rather than looking for people who have “good data”, whatever that means, for applying your methods.</p>
</li>
</ul>
<p>In the end, I think statisticians need to focus on two things: Go out and find the best people to work with and then help them advance the science.</p>
The Care and Feeding of the Biostatistician
2013-10-08T10:33:10+00:00
http://simplystats.github.io/2013/10/08/the-care-and-feeding-of-the-biostatistician
<p><em>Editor’s Note: This guest post was written by <a href="http://www.hopkinschildrens.org/elizabeth-matsui-md.aspx">Elizabeth C. Matsui</a>, an Associate Professor in the Division of Pediatric Allergy and Immunology at the Johns Hopkins School of Medicine.</em></p>
<p>I’ve been collaborating with Roger for several years now and we have had quite a few discussions about characteristics of a successful collaboration between a clinical investigator and a biostatistician. I can’t remember for certain, but think that <a href="http://www.youtube.com/watch?v=Hz1fyhVOjr4">this cartoon</a> may have been the impetus for some of our discussions. I have joked that I should write a guide for clinical investigators entitled, “The Care and Feeding of the Biostatistician.” Fortunately, Roger has a good sense of humor and appreciates the ironic title, so asked me to write down a few thoughts for Simply Statistics. Forging successful collaborations may seem less important than other skills such as grant writing, but successful collaboration is an important determinant of career success, and for many people, an enormous source of career satisfaction. And in the current scientific environment in which large, complex datasets and sophisticated quantitative and data visualization methods are becoming increasingly common, collaboration with biostatisticians is necessary to harness the full potential of your data and to have the greatest scientific impact. In some cases, not engaging a biostatistical collaborator may put you at risk of making statistical missteps that could result in erroneous results.</p>
<ul>
<li>
<p><strong>Be respectful of time</strong>. This tenet, of course, is applicable to all collaborations, but may be a more common stumbling block for clinical investigators working with biostatisticians. Most power estimates and sample size calculations, for example, are more complex than appreciated by the clinical investigator. A discussion about the research question, primary outcome, etc. is required and some thought has to go into determining the most appropriate approach before your biostatistician collaborator has even laid hands on the keyboard and fired up R. At a minimum, engage your biostatistician collaborator earlier than you might think necessary, and ideally, solicit their input during the planning stages. Engaging a biostatistician sooner rather than later not only fosters good will, but will also improve your science. A biostatistician’s time, like yours, is valuable, so respect their time by allocating an appropriate level of salary support on grants. Most academicians I come across appreciate that budgets are tight, so they understand that they may not get the level of salary support that they think is most appropriate. However, “finding room” in the budget for 1% salary support for a biostatistician sends the message that the biostatistician is an afterthought, a necessity for a sample size calculation and a competitive grant application, but in the end, just a formality. Instead, dedicate sufficient salary support in your grant to support the level of biostatistical effort that will be needed. This sends the message that you would like your biostatistician collaborator to be an integral part of the investigator team and provides an opportunity for the kind of regular, ongoing interactions that are needed for productive collaborations.</p>
</li>
<li>
<p><strong>Understand that a biostatistician is not a computational tool</strong>. Although sample size and power calculations are probably the most common service solicited from biostatisticians, and biostatisticians can be enormously helpful in this arena, they have the most impact when they are engaged in discussions about study designs and analytic approaches for a scientific question. Their quantitative approach to scientific problems provides a fresh perspective that can increase the scientific impact of your work. My sense is that this is also much more interesting work for a biostatistician than sample size and power calculations, and engaging them in interesting work goes a long way towards cementing a mutually productive collaboration.</p>
</li>
<li>
<p><strong>Make an effort to learn the language of biostatistics</strong>. Technical jargon is a serious impediment to successful collaboration. Again, this is true of all cross-discipline collaborations, but may be particularly true in collaborations with biostatisticians. The field has a penchant for eponymous methods (Hosmer-Lemeshow, Wald, etc.) and terminology that is entertaining, but not intuitive (jackknife, bootstrapping, lasso). While I am not suggesting that a clinical investigator needs to enroll in biostatistics courses (why gain expertise in a field when your collaborator provides this expertise), I am advocating for educating yourself about the basic concepts and terminology of statistics. Know what is meant by: distribution of a variable, predictor variable, outcome variable, and variance, for example. There are some terrific “Biostatistics 101”-type lectures and course materials online that are excellent resources. But also lean on your biostatistician collaborator by asking him/her to explain terminology and teach you these basics and do not be afraid to ask questions.</p>
</li>
<li>
<p><strong>When all else fails (and even when all else doesn’t fail), draw pictures</strong>. In truth, this is often the place where I start when I first engage a biostatistician. Showing your biostatistician collaborator what you expect your data to look like in a figure or conceptual diagram simplifies communication as it avoids use of jargon and biostatisticians can readily grasp the key information they need from a figure or diagram to come up with a sample size estimate or analytic approach.</p>
</li>
<li>
<p><strong>Teach them your language</strong>. Clinical medicine is also rife with jargon, and just as biostatistical jargon can make it difficult to communicate clearly with a biostatistician, so can clinical jargon. Avoid technical jargon where possible, and define terminology where it is not possible. Educate your collaborator about the background, context and rationale for your scientific question and encourage questions.</p>
</li>
<li>
<p><strong>Generously share your data and ideas</strong>. In many organizations, biostatisticians are very interested in developing new methods, applying more sophisticated methods to an “old” problem, and/or answering their own scientific questions. Do what you can to support these career interests, such as sharing your data and your ideas. Sharing data opens up avenues for increasing the impact of your work, as your biostatistician collaborator has opportunities to develop quantitative approaches to answering research questions related to your own interests. Sharing data alone is not sufficient, though. Discussions about what you see as the important, unanswered questions will help provide the necessary background and context for the biostatistician to make the most of the available data. As highlighted in a recent <a href="http://www.nytimes.com/2013/03/31/magazine/is-giving-the-secret-to-getting-ahead.html?ref=magazine&pagewanted=all&_r=0">book</a>, giving may be an important and overlooked component of success, and I would argue, also a cornerstone of a successful collaboration.</p>
</li>
</ul>
The Leek group policy for developing sustainable R packages
2013-10-07T11:56:40+00:00
http://simplystats.github.io/2013/10/07/the-leek-group-policy-for-developing-sustainable-r-packages
<p>As my group has grown over the past few years and I have more people writing software, I have started to progressively freak out more and more about how to make sure that the software is sustainable as students graduate and move on to bigger and better things. I am also concerned with maintaining quality of the software we are developing in a field where the pace of development/discovery is so high.</p>
<p>As a person who simultaneously (a) has no formal training in CS or software development and (b) believes that <a href="http://simplystatistics.org/2013/01/23/statisticians-and-computer-scientists-if-there-is-no-code-there-is-no-paper/">if there is no software there is no paper</a> I am worried about creating a bunch of unsustainable software. So I solicited the advice of people around here who know more about it than I do and I collected my past experience with creating software and how I screwed it up. I put it all together in the <a href="https://github.com/jtleek/rpackages">Leek group guide to building and maintaing software packages</a>.</p>
<p>The guide covers (among other things):</p>
<ul>
<li>When to start building a package</li>
<li>How to version the package</li>
<li>How to document the package</li>
<li>What not to include</li>
<li>How to build unit tests</li>
<li>How to create a vignette</li>
<li>The commitment I expect in terms of software maintenance</li>
</ul>
<p>I put it on Github because I’m still not 100% sure I got it right. The policy takes effect as of now. But I would welcome feedback/pull requests on how we can improve the policy to make it better and reduce the probability that I end up with a bunch of broken packages when all my awesome students, who are much better coders than me, eventually graduate.</p>
Sunday data/statistics link roundup (10/6/2013)
2013-10-06T14:52:47+00:00
http://simplystats.github.io/2013/10/06/sunday-datastatistics-link-roundup-1062013
<ol>
<li><span style="line-height: 16px;"><a href="http://www.gwern.net/The%20Existential%20Risk%20of%20Mathematical%20Error">A fascinating read</a> about applying decision theory to mathematical proofs. They talk about Type I and Type II errors and everything. </span></li>
<li>Statistical concepts <a href="http://www.youtube.com/watch?v=VFjaBh12C6s&list=PLCkLQOAPOtT2H1hJRUxUYOxThRwfVI9jI&index=1">explained through dance</a>. Even for a pretty culture-deficient dude like me this is cool.</li>
<li>Lots of good talks from the <a href="http://www.winworkshop.net/videos.php">WIN Workshop</a>, including by one of our speakers for the <a href="https://plus.google.com/events/cd94ktf46i1hbi4mbqbbvvga358">Unconference on the Future of Statistics</a>.</li>
<li>The best advice for graduate students (or any academics) I have seen in my time writing the Sunday Links. <a href="https://blogs.akamai.com/2013/10/you-must-try-and-then-you-must-ask.html">You must try, and then you must ask</a> (via Seth F.).</li>
<li>Alberto C. has a MOOC on <a href="http://www.thefunctionalart.com/2013/09/the-third-introduction-to-infographics.html">infographics and visualization</a> that looks pretty cool. That way you can <a href="http://xkcd.com/1273/">avoid this kind of thing</a>.</li>
<li><a href="https://twitter.com/AstroKatie/status/386757429351813120/photo/1">This picture is awesome</a>. Nothing to do with statistics. (via @AstroKatie).</li>
<li>If you aren’t reading Thomas L.’s <a href="http://notstatschat.tumblr.com/">notstatschat</a>, you should be.</li>
<li>Karl B. has an interesting <a href="http://www.biostat.wisc.edu/~kbroman/presentations/openaccess.pdf">presentation on open access</a> that is itself open access. First Beamer theme I’ve seen that didn’t make me want to cover my eyes in sadness. My only problem is I wish open access publishing <a href="http://simplystatistics.org/2011/11/03/free-access-publishing-is-awesome-but-expensive-how/">wasn’t so expensive</a>. Can’t we just use a blog/<a href="http://figshare.com/">figshare</a> to publish journals that are almost as good. <a href="http://www.sciencemag.org/content/342/6154/66.full?sid=cb2de807-61a8-4dda-ba15-3b4c76e0c627&utm_content=buffer8aaf1&utm_source=buffer&utm_medium=twitter&utm_campaign=Buffer">This dude</a> says peer review is old news anyway.</li>
</ol>
Repost: Finding good collaborators
2013-10-04T14:02:39+00:00
http://simplystats.github.io/2013/10/04/repost-finding-good-collaborators
<p><em>Editor’s note: Simply Statistics is still freaking out about the government shut down and potential impending economic catastrophe if the debt ceiling isn’t raised. Since anything new we might write seems trivial compared to what is going on in Washington, we are reposting an awesome old piece by Roger on finding good collaborators. </em></p>
<p>The job of the statistician is almost entirely about collaboration. Sure, there’s theoretical work that we can do by ourselves, but most of the impact that we have on science comes from our work with scientists in other fields. Collaboration is also what makes the field of statistics so much fun.</p>
<p>So one question I get a lot from people is “how do you find good collaborations”? Or, put another way, how do you find good collaborators? It turns out this distinction is more important than it might seem.</p>
<p>My approach to developing collaborations has evolved over time and I consider myself fairly lucky to have developed a few very productive and very enjoyable collaborations. These days my strategy for finding good collaborations is to look for good collaborators. I personally find it important to work with people that I like as well as respect as scientists, because a good collaboration is going to involve a lot of personal interaction. A place like Johns Hopkins has no shortage of very intelligent and very productive researchers that are doing interesting things, but that doesn’t mean you want to work with all of them.</p>
<p>Here’s what I’ve been telling people lately about finding collaborations, which is a mish-mash of a lot of advice I’ve gotten over the years.</p>
<ol>
<li><strong>Find people you can work with</strong>. I sometimes see situations where a statistician will want to work with someone because he/she is working on an important problem. Of course, you want to be working on a problem that interests you, but it’s only partly about the specific project. It’s very much about the person. If you can’t develop a strong working relationship with a collaborator, both sides will suffer. If you don’t feel comfortable asking (stupid) questions, pointing out problems, or making suggestions, then chances are the science won’t be as good as it could be.</li>
<li><strong>It’s going to take some time</strong>. I sometimes half-jokingly tell people that good collaborations are what you’re left with after getting rid of all your bad ones. Part of the reasoning here is that you actually may not know what kinds of people you are most comfortable working with. So it takes time and a series of interactions to learn these things about yourself and to see what works and doesn’t work. Of course, you can’t take forever, particularly in academic settings where the tenure clock might be ticking, but you also can’t rush things either. One rule I heard once was that a collaboration is worth doing if it will likely end up with a published paper. That’s a decent rule of thumb, but see my next comment.</li>
<li><strong>It’s going to take some time</strong>. Developing good collaborations will usually take some time, even if you’ve found the right person. You might need to learn the science, get up to speed on the latest methods/techniques, learn the jargon, etc. So it might be a while before you can start having intelligent conversations about the subject matter. Then it takes time to understand how the key scientific questions translate to statistical problems. Then it takes time to figure out how to develop new methods to address these statistical problems. So a good collaboration is a serious long-term investment which has some risk of not working out. There may not be a lot of papers initially, but the idea is to make the early investment so that truly excellent papers can be published later.</li>
<li><strong>Work with people who are getting things done</strong>. Nothing is more frustrating than collaborating on a project with someone who isn’t that interested in bringing it to a close (i.e. a published paper, completed software package). Sometimes there isn’t a strong incentive for the collaborator to finish (i.e she/he is already tenured) and other times things just fall by the wayside. So finding a collaborator who is continuously getting things done is key. One way to determine this is to check out their CV. Is there a steady stream of productivity? Papers in good journals? Software used by lots of other people? Grants? Web site that’s not in total disrepair?</li>
<li><strong>You’re not like everyone else</strong>. One thing that surprised me was discovering that just because someone you know works well with a specific person doesn’t mean that <em>you</em> will work well with that person. This sounds obvious in retrospect, but there were a few situations where a collaborator was recommended to me by a source that I trusted completely, and yet the collaboration didn’t work out. The bottom line is to trust your mentors and friends, but realize that differences in personality and scientific interests may determine a different set of collaborators with whom you work well.</li>
</ol>
<p>These are just a few of my thoughts on finding good collaborators. I’d be interested in hearing others’ thoughts and experiences along these lines.</p>
<p> </p>
Statistical Ode to Mariano Rivera
2013-09-30T10:00:06+00:00
http://simplystats.github.io/2013/09/30/statistical-ode-to-mariano-rivera
<p>Mariano Rivera is an outlier in many ways. The plot below shows one of them: top 10 pitchers ranked by postseason saves.</p>
<p><a href="http://simplystatistics.org/?attachment_id=1922" rel="attachment wp-att-1922"><img class="alignnone wp-image-1922" alt="mariano" src="http://simplystatistics.org/wp-content/uploads/2013/09/mariano.png" width="500" height="500" srcset="http://simplystatistics.org/wp-content/uploads/2013/09/mariano-150x150.png 150w, http://simplystatistics.org/wp-content/uploads/2013/09/mariano-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2013/09/mariano-1024x1024.png 1024w, http://simplystatistics.org/wp-content/uploads/2013/09/mariano.png 4200w" sizes="(max-width: 500px) 100vw, 500px" /></a></p>
Sunday data/statistics link roundup (9/29/13)
2013-09-29T13:19:26+00:00
http://simplystats.github.io/2013/09/29/sunday-datastatistics-link-roundup-92913
<p>The links are back! Read on.</p>
<ol>
<li><span style="line-height: 15.994318008422852px;">Susan Murphy - a statistician - <a href="http://ns.umich.edu/new/multimedia/videos/21711-u-m-professor-susan-murphy-earns-prestigious-macarthur-fellowship">wins a Macarthur Award</a>. Great for the field of statistics (via Dan S. and Simina B., among others).</span></li>
<li>Related: an <a style="font-size: 16px;" href="http://www.youtube.com/watch?v=heWEDx1gbB0">Interview with David Donoho</a> about the Shaw Prize. Statisticians are blowing up! (via Rafa)</li>
<li>Hope that the award winners <a href="http://www.nber.org/papers/w19445">don’t lose momentum</a>! (via Andrew J.)</li>
<li>Hopkins grad students <a href="http://www.baltimoresun.com/news/opinion/oped/bs-ed-biomedical-research,0,6244826.story">take to the Baltimore Sun</a> to report yet more ongoing negative effects of sequestration. Particularly appropriate in light of the current mayhem around keeping the government open. (via Rafa)</li>
<li>Great <a href="http://www.youtube.com/watch?v=1OQvGvQAI7A">BBC piece</a> featuring David Spiegelhalter on the science of chance. I rarely watch Youtube videos that long all the way through, but I made it to the end of this one.</li>
<li>Love how Yahoo finance has recognized the agonized cries of statisticians and <a href="http://finance.yahoo.com/news/where-americans-rich-poor-spent-193609849.html">is converting pie charts to bar charts</a>. (via Rafa - <a href="http://simplystatistics.org/2012/11/27/i-give-up-i-am-embracing-pie-charts/">who has actually given up on the issue</a>).</li>
<li><a href="http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html">Don’t use Hadoop - your data aren’t that big</a>.</li>
<li>Don’t forget to <a href="https://plus.google.com/events/cd94ktf46i1hbi4mbqbbvvga358">sign up</a> for the future of statistics unconference October 30th Noon-1pm eastern. We have an awesome lineup of speakers and over 500 people RSVP’d on google plus alone. It’s going to be a thing.</li>
</ol>
Announcing Statistics with Interactive R Learning Software Environment
2013-09-27T11:26:31+00:00
http://simplystats.github.io/2013/09/27/announcing-statistics-with-interactive-r-learning-software-environment
<p dir="ltr">
<em>Editor's note: This post was written by Nick Carchedi, a Master's degree student in the Department of Biostatistics at Johns Hopkins. He is working with us to develop software for interactive learning of R and statistics. </em>
</p>
<p dir="ltr">
Inspired by the relative lack of computer-based platforms for learning statistics and the R programming language, we at <a href="http://www.biostat.jhsph.edu">Johns Hopkins Biostatistics</a> have created a new R package designed to teach both topics simultaneously and interactively. Accordingly, we’ve named the package swirl, which stands for “Statistics with Interactive R Learning”. We sought to model swirl after other highly successful interactive learning platforms such as Codecademy, Code School, and Khan Academy, but with a specific focus on teaching statistics and R. Additionally, we wanted users to learn these topics within the same environment in which they would be applying them, namely the R console.
</p>
<p dir="ltr">
If you’re reading this article, then you probably already have an appreciation for the R language and there’s no need to beat that drum any further. Staying true to the R culture, the swirl package is totally open-source and free for anyone to use, modify, or improve. Furthermore, anyone with something to teach can use the platform to create their own interactive content for the world to use.
</p>
<p dir="ltr">
A typical swirl session has a user load the package from the R console, choose from a menu of options the course he or she would like to take, then work through 10-15 minute interactive modules, each covering a particular topic. A module generally alternates between instructional text output to the user and prompts for the user to answer questions. One question may ask for the result of a simple numerical calculation, while another requires the user to enter an actual R command (which is parsed and executed, if correct) to perform a requested task. Multiple choice, text-based and approximate numerical answers are also fair game. Whenever the user answers a question incorrectly, immediate feedback is given in the form of a hint before prompting her to try again. Finally, plots, figures, and even videos may be incorporated into a module for the sake of reinforcing the methods or concepts being taught.
</p>
<p dir="ltr">
We believe that this form of interactive learning, or learning by doing, is essential for true mastery of topics as challenging and complex as statistics and statistical computing. While we are aware of a handful of other platforms for learning R interactively, our goal was to focus on the teaching of R and statistics simultaneously. As far as we know, swirl is the only platform of its kind and almost certainly the only one that takes place within the R console.
</p>
<p dir="ltr">
When we developed the swirl package, we wanted from the start to allow other people to extend and customize it to their particular needs. The beauty of the swirl platform is that anyone can create their own content and have it included in the package for all users to access. We have designed pre-formatted templates (color-coded spreadsheets) that instructors can fill out with their own content according to a fairly simple set of instructions. Once instructors send us the completed templates, we then load the content into the package so that anyone with the most recent version of swirl on their computer can access the content. We’ve tried to make the process of content creation as simple and painless as possible so that the statistics and computing communities are encouraged to share their knowledge with the world through our platform.
</p>
<p dir="ltr">
The package currently includes only a few sample modules that we’ve created in-house, primarily serving as demonstrations of how the platform works and how a typical module may appear to users. In the future, we envision a vibrant and dynamic collection of full courses and short modules that users can vote up or down based on the quality of their experience with each. In such a scenario, the very best courses would naturally float to the top and the less effective courses would fall out of favor and perhaps be recommended for revision.
</p>
<p dir="ltr">
In addition to making more content available to future users, we hope to one day transition swirl from being an interactive learning environment to one that is truly adaptive to the individual needs of each user. Perhaps this future version of our software would support a more intricate web of content, intelligently navigating users among topics based on a dynamic, data-driven interpretation of their strengths, weaknesses, competencies, and knowledge gaps. With the right people on board, this could become a reality.
</p>
<p dir="ltr">
We’ve created this package with the hope that the statistics and computing communities find it to be a valuable educational tool. We’ve got the basic infrastructure in place, but we recognize that there is a great deal of room for improvement. The swirl package is still very much in development and we are actively seeking feedback on how we can make it better. Please visit the swirl website to download the package or for more information on the project. We’d love for you to give it a try and let us know what you think.
</p>
<p dir="ltr">
Go to swirl website: <a href="http://swirlstats.com">http://swirlstats.com</a><a href="http://ncarchedi.github.io/swirl/"><br /> </a>
</p>
How could code review discourage code disclosure? Reviewers with motivation.
2013-09-26T11:08:00+00:00
http://simplystats.github.io/2013/09/26/how-could-code-review-discourage-code-disclosure-reviewers-with-motivation
<p><a href="http://www.nature.com/news/mozilla-plan-seeks-to-debug-scientific-code-1.13812"></a> appeared a couple of days ago in Nature describing Mozilla’s efforts to implement code review for scientific papers. As anyone who follows our blog knows, we are in favor of reproducible research, in favor of disclosing code, and in favor of open science.</p>
<p>So people were surprised when they saw this quote from Roger at the end of the Nature piece:</p>
<blockquote>
<p>“One worry I have is that, with reviews like this, scientists will be even more discouraged from publishing their code. We need to get more code out there, not improve how it looks.”</p>
</blockquote>
<p>Not surprisingly a bunch of reproducible research/open science people were quick to jump on this quote:</p>
<blockquote class="twitter-tweet" width="550">
<p>
.<a href="https://twitter.com/kaythaney">@kaythaney</a> re code review story, <a href="http://t.co/7rlAsmLuPw">http://t.co/7rlAsmLuPw</a> comment by <a href="https://twitter.com/simplystats">@simplystats</a> seems off to me... must be more nuance there <img src="http://simplystatistics.org/wp-includes/images/smilies/simple-smile.png" alt=":)" class="wp-smiley" style="height: 1em; max-height: 1em;" />
</p>
<p>
— Titus Brown (@ctitusbrown) <a href="https://twitter.com/ctitusbrown/status/382562811039064064">September 24, 2013</a>
</p>
</blockquote>
<blockquote class="twitter-tweet" width="550">
<p>
<a href="https://twitter.com/nickbarnes">@nickbarnes</a> <a href="https://twitter.com/cboettig">@cboettig</a> <a href="https://twitter.com/ctitusbrown">@ctitusbrown</a> agree. comment lead with this backfiring / discouraging others to make code available, which seemed off.
</p>
<p>
— Kaitlin Thaney (@kaythaney) <a href="https://twitter.com/kaythaney/status/382819174206423040">September 25, 2013</a>
</p>
</blockquote>
<p>Now, Roger’s quote was actually a little more nuanced and it was posted after a pretty in-depth discussion on Twitter:</p>
<blockquote class="twitter-tweet" width="550">
<p>
<a href="https://twitter.com/ctitusbrown">@ctitusbrown</a> <a href="https://twitter.com/cboettig">@cboettig</a> <a href="https://twitter.com/kaythaney">@kaythaney</a> <a href="https://twitter.com/nickbarnes">@nickbarnes</a> see whole <a href="https://twitter.com/simplystats">@simplystats</a> quote on prof. code review discouraging sharing <a href="http://t.co/pNQWT9Safz">pic.twitter.com/pNQWT9Safz</a>
</p>
<p>
— Erika Check Hayden (@Erika_Check) <a href="https://twitter.com/Erika_Check/status/382911015358181376">September 25, 2013</a>
</p>
</blockquote>
<p>But I think the real source of confusion was best summed up by Titus B.:</p>
<blockquote class="twitter-tweet" width="550">
<p>
.<a href="https://twitter.com/cboettig">@cboettig</a> <a href="https://twitter.com/kaythaney">@kaythaney</a> <a href="https://twitter.com/nickbarnes">@nickbarnes</a> As one of my grad students said to me, "I don't understand why 'must share code' is a radical opinion."
</p>
<p>
— Titus Brown (@ctitusbrown) <a href="https://twitter.com/ctitusbrown/status/382904483102982145">September 25, 2013</a>
</p>
</blockquote>
<p>That is the key issue. People are surprised that sharing code would be anything but an obvious thing to do. To people who share code all the time, this is an obvious no-brainer. My bias is clearly in that camp as well. I require reproducibility of my students analyses, I discuss reproducible research when I teach, I take my own medicine by making my analyses reproducible, and I frequently state in reviews that papers are only acceptable after the code is available.</p>
<p><em>So what’s the big deal?</em></p>
<p>In an incredibly interesting coincidence, I <a href="http://simplystatistics.org/2013/09/25/is-most-science-false-the-titans-weigh-in/">had a paper</a> come out the same week in Biostatistics that has been uh…little controversial.</p>
<p>In this case, our paper was published with discussion. For people outside of statistics, a discussant and a reviewer are different things. The paper first goes through peer review in the usual way. Then, once it is accepted for publication, it is sent out to discussants to read and comment on.</p>
<p>A couple of discussants were very, very motivated to discredit our approach. Despite this, because we believe in open science, stating our assumptions, and being reproducible, we made all of the code we used and data we collected available for the discussants (and for everyone else). In an awesome win for open science, many of the discussants used/evaluated our code in their discussions.</p>
<p>One of the very motivated discussants identified an actual bug in the code. This bug caused the journal names to be scrambled in Figures 3 and 4. The bug (thank goodness!) did not substantively alter the methods, the results or the conclusions of our paper. On top of it, the cool thing about having our code on github meant we could carefully look it over, fix the bug, and push the changes to the repository (and update the paper) so the discussant could see the revised version as soon as we pushed it.</p>
<p>We were happy that the discussant didn’t find any more substantial bugs (because we knew they were motivated to review our code for errors as carefully as possible). We were also happy to make the changes, admit our mistake and move on.</p>
<p>An interesting thing happened though. The motivated discussant wanted to discredit our approach. So they included in the supplement how they noticed the bug (totally fair game, it was a bug). But they also included their email exchange with the editor about the bug and this quote:</p>
<blockquote>
<p>As all seasoned methodologists know, minor coding errors causing total havoc is quite common (I have seen it happen in my own work). I think that it is ironic that a paper that claims to prove the reliability of the literature had completely messed up the two main figures that represent the core of all its data and its main results.</p>
</blockquote>
<p>A couple of points here: (1) the minor bug didn’t wreak havoc with our results, it didn’t change any conclusions and it didn’t affect our statistics and (2) the statement is clearly designed for the sole purpose of embarrassing us (the authors) and discrediting our work.</p>
<p>The problem here is that the code reviewer deeply cares about us being wrong. This incident highlights one reason for Roger’s concerns. I feel we acted in pretty good faith here to try to be honest about our assumptions and open with our code. We also responded quickly and thoroughly to the report of a bug. But the discussant used the fact that we had a bug at all to try to discredit our whole analysis with sarcasm. This sort of thing could absolutely discourage a person from releasing code.</p>
<p>One thing the discussant is absolutely right about is that most code will have minor bugs. Personally, I’m very grateful to the discussant for catching the bug before the work was published and I’m happy that we made the code available and corrected our mistake.</p>
<p><em>But the key risk here is that people who demand reproducible code do so only so they can try to embarrass analysts and discredit science they don’t like. </em></p>
<p>If we want people to make code available, be willing to admit mistakes, and continuously update their code then we don’t just need code review. We need a policy and commitment from the community to not just use reproducible research as a vehicle for embarrassment and discrediting each other. We need a policy that:</p>
<ol>
<li>Doesn’t discourage people from putting code up before papers are published for fear of embarrassment.</li>
<li>Acknowledges minor bugs happen and doesn’t penalize people for admitting them/fixing them.</li>
<li>Prevents people from publishing when they have major typos, but doesn’t humiliate them.</li>
<li>Defines specific, positive ways that code sharing can benefit the community (collaboration) rather than only reporting errors that are discovered when code is made available.</li>
<li>Recognizes that most scientists are not professional software developers and focuses review on the scientific correctness/reproducibility of code, rather than technical software development skills.</li>
</ol>
<p>One way I think we could address a lot of these issues is not to think of it as code review, but as code evaluation and update. <span style="font-size: 16px;">That is one thing I really like about Mozilla’s approach - they report their findings to the authors and let them respond. </span><span style="font-size: 16px;">The only thing that would be better is if Mozilla actually created patches/bug fixes for the code and issued pull requests that the authors could incorporate. </span></p>
<p>Ultimately, I hope we can focus on a way to make scientific software correct, not just point out how it is wrong.</p>
Is most science false? The titans weigh in.
2013-09-25T11:06:11+00:00
http://simplystats.github.io/2013/09/25/is-most-science-false-the-titans-weigh-in
<p>Some of you may recall that a few months ago my colleague and I posted a <a href="http://arxiv.org/pdf/1301.3718.pdf">paper</a> to the ArXiv on estimating the rate of false discoveries in the scientific literature. The paper was picked up by the <a href="http://m.technologyreview.com/view/510126/the-statistical-puzzle-over-how-much-biomedical-research-is-wrong/">Tech Review</a> and led to a post on <a href="http://andrewgelman.com/2013/01/24/i-dont-believe-the-paper-empirical-estimates-suggest-most-published-medical-research-is-true-that-is-the-claim-may-very-well-be-true-but-im-not-at-all-convinced-by-the-analysis-being-used/">Andrew G.’s blog</a>, <a href="http://blogs.discovermagazine.com/neuroskeptic/2013/01/24/is-medical-science-really-86-true/#.UkLqWWTXis0">on Discover blogs</a>, and <a href="http://simplystatistics.org/2013/01/24/why-i-disagree-with-andrew-gelmans-critique-of-my-paper-about-the-rate-of-false-discoveries-in-the-medical-literature/">on our blog</a>. One other interesting feature of our paper was that we put all the <a href="https://github.com/jtleek/swfdr">code/data we collected on Github</a>.</p>
<p>At the time this whole thing blew up our paper still wasn’t published. After the explosion of interest we submitted the paper to Biostatistics. They liked the paper and actually solicited formal discussion of our approach by other statisticians. We were then allowed to respond to the discussions.</p>
<p>Overall, it was an awesome experience at Biostatistics - they did a great job of doing a thorough, but timely, review. They got some amazing discussants. Finally, they made our paper open-access. So much goodness. (conflict of interest disclaimer - I am an associate editor for Biostatistics)</p>
<p>Here are the papers that came out which I think are all worth reading:</p>
<ul>
<li><a href="http://biostatistics.oxfordjournals.org/content/early/2013/09/24/biostatistics.kxt007.full">Our paper</a></li>
<li><a href="http://biostatistics.oxfordjournals.org/content/early/2013/09/24/biostatistics.kxt032.full">Discussion by Benjamini and colleagues</a></li>
<li><a href="http://biostatistics.oxfordjournals.org/content/early/2013/09/24/biostatistics.kxt033.full">Discussion by D.R. Cox (!)</a></li>
<li><a href="http://biostatistics.oxfordjournals.org/content/early/2013/09/24/biostatistics.kxt034.full">Discussion by Gelman and colleagues</a></li>
<li><a href="http://biostatistics.oxfordjournals.org/content/early/2013/09/24/biostatistics.kxt035.full">Discussion by Goodman</a></li>
<li><a href="http://biostatistics.oxfordjournals.org/content/early/2013/09/24/biostatistics.kxt036.full">Discussion by Ioannidis</a></li>
<li><a href="http://biostatistics.oxfordjournals.org/content/early/2013/09/24/biostatistics.kxt037.full">Discussion by Schuemie and colleagues</a></li>
<li><a href="http://biostatistics.oxfordjournals.org/content/early/2013/09/24/biostatistics.kxt038.full">Our rejoinder</a></li>
</ul>
<p>I’m very proud of our paper and the rejoinder. The discussants were very passionate and added a huge amount of value, particularly in the collection/analysis of our data and additional data they collected.</p>
<p>I think it is 100% worth reading all of the papers over at Biostatistics but for the tldr crowd here are some take home messages I have from the experience and summarizing the discussion above:</p>
<ol>
<li>Posting to ArXiv can be a huge advantage for a paper like ours but be ready for the heat.</li>
<li>Biostatistics (the journal) is awesome. Great job of reviewing/editing in a timely way and great job of organizing the discussion!</li>
<li>When talking about the science-wise false discovery rate you have to bring data.</li>
<li>We proposed the first formal framework for evaluating the science-wise false discovery rate which lots of people care about (and there are a ton of ideas in the discussion about ways to estimate it better).</li>
<li>I think based on our paper and the discussion that it is pretty unlikely that most published research is false. But that probably varies by your definition of false/what you mean by most/the journal type/the field you are considering/the analysis type/etc.</li>
<li>This is a question people care about. <em>A lot</em>.</li>
</ol>
<p>Finally, I think this is the most important quote from our rejoinder:</p>
<blockquote>
<p>We are encouraged, however, that several of the discussants collected additional data to evaluate the impact of the above decisions on the SWFDR estimates. The discussion illustrates the powerful way that data collection can be used to move the theoretical and philosophical discussion on to a more concrete, scientific footing—discussing the specific strengths and weaknesses of a particular empirical approach. Moreover, the interesting additional data collected by the discussants on study types, journals, and endpoints demonstrate that data beget data and lead to a stronger and more directed conversation.</p>
</blockquote>
How I view an academic talk: like a sports game
2013-09-24T10:32:55+00:00
http://simplystats.github.io/2013/09/24/how-i-view-an-academic-talk-like-a-sports-game
<p>I know this is a little random/non-statisticsy but I have been thinking about it a lot lately. Over the last couple of weeks I have been giving a bunch of talks and guest lectures here locally around the Baltimore/DC area. Each one of them was to a slightly <a style="font-size: 16px;" href="http://www.meetup.com/Data-Science-MD/events/135629022/">different</a> <a style="font-size: 16px;" href="http://www.cbcb.umd.edu/~langmead/teaching/f2013_439/syllabus.pdf">audience.</a></p>
<p>As I was preparing/giving all of these talks I realized I have a few habits that I have developed in the way I view the talks and in the way that I give them. I 100% agree with Hilary M. that a talk <a href="http://www.hilarymason.com/speaking/speaking-entertain-dont-teach/">should entertain</a> more than it should teach. I also try to give talks that <a href="http://simplystatistics.org/2012/03/05/characteristics-of-my-favorite-statistics-talks/">I would like to see myself</a>.</p>
<p>Another thing I realized is that I view talks in a very specific way. I see them as a sports game. From the time I was a kid until the end of <a href="http://biostat.jhsph.edu/~jleek/ultimate.png">graduate school</a> I was on sports teams. I love playing/watching all kinds of sports and I definitely miss playing competitively.</p>
<p>Unfortunately, being a faculty member doesn’t leave much time for sports. So now, the only chance I have to get up and play is during a talk. Here are the ways that I see the two activities as being similar:</p>
<ol>
<li>They both require practice. I played a lot of sports with <a href="http://www.biostat.umn.edu/~rudser/Images/IMG_0220.JPG">this guy</a> who liked the quote, “Practice doesn’t make perfect, perfect practice makes perfect”. I feel the same way.</li>
<li>They are both a way to entertain. I rarely played in front of crowds as big as the groups I speak to these days, but whenever there was an audience I would always get way more pumped up.</li>
<li>There is some competition to both. In terms of talks, there is always at least one audience member who wants to challenge your ideas. I see this exchange as a game, rather than something I dread. Sometimes I win (my answers cover all the questions) and sometimes I lose (I missed something important). Usually, being prepared is associated with better practice.</li>
<li>I get a rush off of both playing in games and giving talks. Part of that is self fueled. I like to listen to pump up music right before I give a talk or play a game.</li>
</ol>
<p>One thing I wish is that more talks were joint talks. One thing I love about sports is playing on a team. The preparation of a talk is always done with a team - usually the students/postdocs/collaborators working on the project. But I wish presentations were more often a team activity. It makes it more fun to celebrate if the talk went well and less painful if I flub when I give a talk with someone else. Plus it is fun to cheer on your team mate.</p>
<p>Does anyone else think of talks this way? Or do you have another way of thinking about talks?</p>
<p> </p>
<p> </p>
<p> </p>
The limiting reagent for big data is often small, well-curated data
2013-09-23T10:32:29+00:00
http://simplystats.github.io/2013/09/23/the-limiting-reagent-for-big-data-is-often-small-well-curated-data
<p>I’ve been working on “big” data in genomics since I was a first year student in graduate school (a longer time than I’d rather admit). At the time, “big” meant <a href="http://genomics.princeton.edu/storeylab/trauma/">microarray studies with a couple of hundred patients</a>. Of course, that is now a really small drop in the pond compared to the huge sequencing data sets, <a href="http://www.nature.com/nature/journal/vaop/ncurrent/full/nature12531.html">like the one</a> published recently in Nature.</p>
<p>Despite the exploding size of these genomic data sets, the discovery process is almost always limited by the quality and quantity of useful metadata that go along with them. In the trauma study I referenced above, the genomic data was both costly and hard to collect. But the bigger, more impressive feat was to collect the data from trauma patients at relatively precise time points after they had been injured. Along with the genomic data a host of clinical data was also collected and aligned with the genomic data.</p>
<p><em>The key insights derived from the data were the relationships between low-dimensional and high-dimensional measurements. </em></p>
<p>This is actually relatively common:</p>
<ul>
<li>In computer vision you need quality labeled images to use as a training set (this type of manual labeling is so common it forms the basis for major citizen science projects like <a href="https://www.zooniverse.org/">zooniverse</a>)</li>
<li>In genome-wide association studies you need accurate phenotypes.</li>
<li>In the analysis of social networks like the Framingham Heart Survey, you need to <a href="http://www.nejm.org/doi/full/10.1056/NEJMsa066082">collect data on obesity levels</a>, etc.</li>
</ul>
<p>One common feature of these studies is that they are examples of what computer scientists call _<a href="http://en.wikipedia.org/wiki/Supervised_learning">supervised learning</a>. _Most hypothesis-driven research falls into this type of study. It is important to recognize that these studies can only work with painstaking and careful collection of small data. So in many cases, the limits to the insights we can obtain from big data are imposed by <a href="http://simplystatistics.org/2012/05/28/schlep-blindness-in-statistics/">how much schlep</a> we are willing to put in to get small data.</p>
<p> </p>
<p> </p>
<p> </p>
Announcing the Simply Statistics Unconference on the Future of Statistics #futureofstats
2013-09-17T11:22:39+00:00
http://simplystats.github.io/2013/09/17/announcing-the-simply-statistics-unconference-on-the-future-of-statistics-futureofstats
<p><a href="https://plus.google.com/events/cd94ktf46i1hbi4mbqbbvvga358">Sign up here!</a></p>
<p dir="ltr">
We here at Simply Statistics are pumped about the <a href="http://www.statistics2013.org/introduction-to-the-future-of-statistical-sciences-workshop/">Statistics 2013 Future of Statistical Sciences Workshop (Nov. 11-12</a>). It is a great time to be a statistician and discussing the future of our discipline is of utmost importance to us. In fact, we liked the idea so much that we decided to get in the game ourselves.
</p>
<p dir="ltr">
We are super excited to announce the first ever “Unconference” hosted by Simply Statistics. The unconference will focus on the Future of Statistics and will be held October 30th from 12-1pm EST. The unconference will be hosted on Google Hangouts and will be simultaneously live-streamed on YouTube. After the unconference is over we will maintain a recorded version for viewing on YouTube. Our goal is to compliment and continue the discussion inspired by the Statistics 2013 Workshop.
</p>
<p dir="ltr">
This unconference will feature some of the most exciting and innovative statistical thinkers in the field discussing their views on the future of the field and focusing on issues that affect junior statisticians the most: education, new methods, software development, collaborations with natural sciences/social sciences, and the relationship between statistics and industry.
</p>
<p dir="ltr">
The confirmed presenters are:
</p>
<ul>
<li><strong>Daniela Witten</strong>, Assistant Professor, Department of Biostatistics, University of Washington</li>
<li><strong>Hongkai Ji</strong>, Assistant Professor, Department of Biostatistics, Johns Hopkins University</li>
<li><strong>Joe Blitzstein</strong>, Professor of the Practice, Department of Statistics, Harvard University</li>
<li><strong>Sinan Aral</strong>, Associate Professor, MIT Sloan School of Management</li>
<li><strong>Hadley Wickham</strong>, Chief Scientist, RStudio</li>
<li><strong>Hilary Mason</strong>, Chief Data Scientist at Accel Partners</li>
</ul>
<p><a href="https://twitter.com/simplystats">Follow us on Twitter</a> or sign up for the Unconference at <a style="font-size: 16px;" href="http://simplystatistics.org/unconference">http://simplystatistics.org/unconference</a>. In the month or so leading up to the conference we would also love to hear from you about your thoughts on the future of statistics. Let us know about your ideas on Twitter with the hashtag #futureofstats, we’ll be compiling the information and will make it available along with the talks so that you can tell us what you think the future is.</p>
<p>Tell your friends, tell your family, it is on!</p>
Data Analysis in the top 9 courses in lifetime enrollment at Coursera!
2013-09-16T14:41:26+00:00
http://simplystats.github.io/2013/09/16/data-analysis-in-the-top-9-courses-in-lifetime-enrollment-at-coursera
<p>Holy cow I just saw this, my Coursera class is in the top 9 by all time enrollment!</p>
<blockquote class="twitter-tweet" width="550">
<p>
Top 9 courses on <a href="https://twitter.com/hashtag/coursera?src=hash">#coursera</a> by lifetime enrollment current as of 9/16- check out & enroll: <a href="http://t.co/2X0EJoetoC">http://t.co/2X0EJoetoC</a>! <a href="http://t.co/d0Sko1KhoD">pic.twitter.com/d0Sko1KhoD</a>
</p>
<p>
— Coursera (@coursera) <a href="https://twitter.com/coursera/status/379650980154863617">September 16, 2013</a>
</p>
</blockquote>
<p>Only problem is those pesky other classes ahead of me. Help me take down Creativity, Innovation and Change (what good is all that anyway <img src="http://simplystatistics.org/wp-includes/images/smilies/simple-smile.png" alt=":-)" class="wp-smiley" style="height: 1em; max-height: 1em;" />by <a href="https://www.coursera.org/course/dataanalysis">signing up here</a>!</p>
So you're moving to Baltimore
2013-09-13T15:21:59+00:00
http://simplystats.github.io/2013/09/13/so-youre-moving-to-baltimore
<p><em>Editor’s Note: This post was written by Brian Caffo, occasional Simply Statistics contributor and Director of Graduate Studies in the Department of Biostatistics at Johns Hopkins. This was written primarily for incoming graduate students, but if you’re planning on moving to Baltimore anyway, feel free to use it to your advantage!</em></p>
<p>Congratulations on picking Hopkins Biostatistics for your graduate studies. Now that you’re either here or coming to to Baltimore, I’m guessing that you’ll need some start-up knowledge for this quirky, fun city. Here’s a guide of to some of my favorite Baltimore places and traditions.</p>
<p>Put more in the comments!</p>
<h2 dir="ltr">
Events
</h2>
<p dir="ltr">
First, let me discuss some sporting events that you should be aware of. Absolutely top on the list is going to a baseball game at <a href="http://baltimore.orioles.mlb.com/bal/ballpark/index.jsp">Camden Yards </a>to watch the <a href="http://baltimore.orioles.mlb.com/">Orioles</a>. There’s lots of games on days, nights and weekends and for the most part, tickets are easy to get and relatively cheap. Going to the (twice Super Bowl champion) <a href="http://www.baltimoreravens.com/">NFL Ravens</a> is a bit harder and more expensive, but well worth the splurge once during your studies. Then you can come back to your research on investigating the long term impact of football head trauma.
</p>
<p><strong>** </strong>**The [<em>Editor’s Note: This post was written by Brian Caffo, occasional Simply Statistics contributor and Director of Graduate Studies in the Department of Biostatistics at Johns Hopkins. This was written primarily for incoming graduate students, but if you’re planning on moving to Baltimore anyway, feel free to use it to your advantage!</em></p>
<p>Congratulations on picking Hopkins Biostatistics for your graduate studies. Now that you’re either here or coming to to Baltimore, I’m guessing that you’ll need some start-up knowledge for this quirky, fun city. Here’s a guide of to some of my favorite Baltimore places and traditions.</p>
<p>Put more in the comments!</p>
<h2 dir="ltr">
Events
</h2>
<p dir="ltr">
First, let me discuss some sporting events that you should be aware of. Absolutely top on the list is going to a baseball game at <a href="http://baltimore.orioles.mlb.com/bal/ballpark/index.jsp">Camden Yards </a>to watch the <a href="http://baltimore.orioles.mlb.com/">Orioles</a>. There’s lots of games on days, nights and weekends and for the most part, tickets are easy to get and relatively cheap. Going to the (twice Super Bowl champion) <a href="http://www.baltimoreravens.com/">NFL Ravens</a> is a bit harder and more expensive, but well worth the splurge once during your studies. Then you can come back to your research on investigating the long term impact of football head trauma.
</p>
<p><strong>** </strong>**The](http://www.preakness.com/) horse race is another that’s worth going to at least once. The Preakness takes place on a Saturday and is a very popular event; this can translate to big crowds. If you don’t like big crowds but would like to see what all the fuss is about, you may enjoy the Black Eye Susan Stakes; this is a day of racing at Pimlico on Friday before the Preakness where the crowds are smaller, it costs $5 to get into the track and you can enjoy the celebratory atmosphere of the Preakness. Another fun event is the <a href="http://www.grandprixofbaltimore.com/">Baltimore Grand Prix</a> which happens every Labor day weekend (at least for the next few years). Since you’re at Hopkins, try to go catch a lacrosse game. The Hopkins team is consistently among the best. If you’re a distance runner, there’s the <a href="http://www.thebaltimoremarathon.com/">Baltimore Marathon</a>. Also, I hesitate to include this with sports, but I can’t get enough of the <a href="http://www.kineticbaltimore.com/">Kinetic Sculpture “Race</a>”, the most fun Baltimore event that I can think of. And we would be doing Hilary Parker a disservice if we failed to mention the <a href="http://www.charmcityrollergirls.com/">Charm City Roller Girls</a>!</p>
<p dir="ltr">
The main non-sporting event that I like are all of the festivals. Every year, especially during the summer, every neighborhood has a festival. <a href="http://www.youtube.com/watch?v=q-p5wqCA-aohttp://www.youtube.com/watch?v=q-p5wqCA-ao">Honfest </a>in <a href="http://en.wikipedia.org/wiki/Hampden,_Baltimore">Hampden</a> is surely the one not to be missed (but there are festivals in every notable neighborhood including the <a href="http://fellspointfest.com/">Fells Point Festival</a>) At Christmas time, there’s the<a href="http://en.wikipedia.org/wiki/Miracle_on_34th_Street_(Baltimore)"> Miracle on 34th Street</a> right nearby and <a href="http://www.destinationmainstreets.com/maryland/hampden.php">36th street (the Avenue)</a> is a fun place to go out for shopping and eating, regardless of whether Honfest is going on. During the summer months, a local radio station sponsors “<a href="http://wtmd.org/radio/first-thursday-concerts-in-the-park/">First Thursdays</a>” where they put on a free concert series at the Washington Monument in Mt. Vernon.
</p>
<h2 dir="ltr">
Things to do during the day
</h2>
<p dir="ltr">
Probably you’ll visit the <a href="http://baltimore.org/about-baltimore/inner-harbor">Harbor </a>as one of the first things you do. Make sure to hit the <a href="http://www.aqua.org/">National Aquarium</a>, the <a href="http://www.avam.org/">Visionary Arts Museum</a> and the <a href="http://www.mdsci.org/">Maryland Science Center</a> (not all in one day). Downtown there’s the <a href="http://thewalters.org/">Walters Art Museum</a> and the <a href="http://www.artbma.org/">Baltimore Museum of Art</a> on the Johns Hopkins Homewood campus. Go see <a href="http://www.nps.gov/fomc/">Fort McHenry</a>, where Francis Scott Key wrote the National Anthem. The <a href="http://www.rflewismuseum.org/">Museum of African American History and Culture</a> is right near the Inner Harbor on Pratt Street.
</p>
<p dir="ltr">
If you’re outdoorsy, <a href="http://www.dnr.state.md.us/publiclands/central/patapsco.asp">Patapsco </a>and<a href="http://www.dnr.state.md.us/publiclands/central/gunpowder.asp"> Gunpowder Falls</a> appear to be good places nearby. <a href="http://www.nps.gov/cato/">Catoctin Park</a> is nearby with Camp David tucked in it somewhere; you’ll know you’ve found it when the secret service tackles you. If you don’t want to travel too far, just outside the northern border of the city is <a href="http://www.baltimorecountymd.gov/Agencies/recreation/programdivision/naturearea/relpark/">Robert E. Lee park</a> which has a nice hiking trail and a dog park. When you’re done there you can grab lunch at the <a href="http://www.hautedogcarte.com">Haute Dog</a>.
</p>
<p><strong>** </strong>**If you have kids, the <a href="http://www.marylandzoo.org/">Baltimore Zoo</a> is a really nice outdoor zoo that is a great place to go if the weather is nice. It’s in <a href="http://www.druidhillpark.org/">Druid Hill Park</a>, which is also a great place to go running or biking. If you’re willing to drive an hour or more, the outdoor options are basically endless.</p>
<p>DC and Philly are easy day trips using the train and Annapolis is an easy drive. If you go to the DC, only schedule a few museums right near one and another, otherwise you’ll spend the whole day walking. On a nice day, the <a href="http://nationalzoo.si.edu/">National Zoo</a> is fantastic (and free). The MARC train goes to DC from Penn Station and is under $10 each way, but it only runs in the morning and evening. Outside of those times you can take the Amtrak train. If you drive, it’s usually about an hour one-way, depending on where you’re going.</p>
<h2 dir="ltr">
Things to do during the night
</h2>
<p dir="ltr">
I have little kids. How would I know? My answer is, fight about bedtime and collapse. However, if I was forced to come up with something, I would say go to <a href="http://www.pattersonbowl.com/">Patterson Park Lanes</a> and do Duckpin Bowling. Make sure to reserve a lane earlier on in the week if you want to go at night on a weekend.
</p>
<p>From my outside vantage point, there appears to be tons of nightlife. The best places appear to be in upscale city areas, like Fells Point, Canton, downtown, Harbor East, Federal Hill. Also, catch a show at the [<em>Editor’s Note: This post was written by Brian Caffo, occasional Simply Statistics contributor and Director of Graduate Studies in the Department of Biostatistics at Johns Hopkins. This was written primarily for incoming graduate students, but if you’re planning on moving to Baltimore anyway, feel free to use it to your advantage!</em></p>
<p>Congratulations on picking Hopkins Biostatistics for your graduate studies. Now that you’re either here or coming to to Baltimore, I’m guessing that you’ll need some start-up knowledge for this quirky, fun city. Here’s a guide of to some of my favorite Baltimore places and traditions.</p>
<p>Put more in the comments!</p>
<h2 dir="ltr">
Events
</h2>
<p dir="ltr">
First, let me discuss some sporting events that you should be aware of. Absolutely top on the list is going to a baseball game at <a href="http://baltimore.orioles.mlb.com/bal/ballpark/index.jsp">Camden Yards </a>to watch the <a href="http://baltimore.orioles.mlb.com/">Orioles</a>. There’s lots of games on days, nights and weekends and for the most part, tickets are easy to get and relatively cheap. Going to the (twice Super Bowl champion) <a href="http://www.baltimoreravens.com/">NFL Ravens</a> is a bit harder and more expensive, but well worth the splurge once during your studies. Then you can come back to your research on investigating the long term impact of football head trauma.
</p>
<p><strong>** </strong>**The [<em>Editor’s Note: This post was written by Brian Caffo, occasional Simply Statistics contributor and Director of Graduate Studies in the Department of Biostatistics at Johns Hopkins. This was written primarily for incoming graduate students, but if you’re planning on moving to Baltimore anyway, feel free to use it to your advantage!</em></p>
<p>Congratulations on picking Hopkins Biostatistics for your graduate studies. Now that you’re either here or coming to to Baltimore, I’m guessing that you’ll need some start-up knowledge for this quirky, fun city. Here’s a guide of to some of my favorite Baltimore places and traditions.</p>
<p>Put more in the comments!</p>
<h2 dir="ltr">
Events
</h2>
<p dir="ltr">
First, let me discuss some sporting events that you should be aware of. Absolutely top on the list is going to a baseball game at <a href="http://baltimore.orioles.mlb.com/bal/ballpark/index.jsp">Camden Yards </a>to watch the <a href="http://baltimore.orioles.mlb.com/">Orioles</a>. There’s lots of games on days, nights and weekends and for the most part, tickets are easy to get and relatively cheap. Going to the (twice Super Bowl champion) <a href="http://www.baltimoreravens.com/">NFL Ravens</a> is a bit harder and more expensive, but well worth the splurge once during your studies. Then you can come back to your research on investigating the long term impact of football head trauma.
</p>
<p><strong>** </strong>**The](http://www.preakness.com/) horse race is another that’s worth going to at least once. The Preakness takes place on a Saturday and is a very popular event; this can translate to big crowds. If you don’t like big crowds but would like to see what all the fuss is about, you may enjoy the Black Eye Susan Stakes; this is a day of racing at Pimlico on Friday before the Preakness where the crowds are smaller, it costs $5 to get into the track and you can enjoy the celebratory atmosphere of the Preakness. Another fun event is the <a href="http://www.grandprixofbaltimore.com/">Baltimore Grand Prix</a> which happens every Labor day weekend (at least for the next few years). Since you’re at Hopkins, try to go catch a lacrosse game. The Hopkins team is consistently among the best. If you’re a distance runner, there’s the <a href="http://www.thebaltimoremarathon.com/">Baltimore Marathon</a>. Also, I hesitate to include this with sports, but I can’t get enough of the <a href="http://www.kineticbaltimore.com/">Kinetic Sculpture “Race</a>”, the most fun Baltimore event that I can think of. And we would be doing Hilary Parker a disservice if we failed to mention the <a href="http://www.charmcityrollergirls.com/">Charm City Roller Girls</a>!</p>
<p dir="ltr">
The main non-sporting event that I like are all of the festivals. Every year, especially during the summer, every neighborhood has a festival. <a href="http://www.youtube.com/watch?v=q-p5wqCA-aohttp://www.youtube.com/watch?v=q-p5wqCA-ao">Honfest </a>in <a href="http://en.wikipedia.org/wiki/Hampden,_Baltimore">Hampden</a> is surely the one not to be missed (but there are festivals in every notable neighborhood including the <a href="http://fellspointfest.com/">Fells Point Festival</a>) At Christmas time, there’s the<a href="http://en.wikipedia.org/wiki/Miracle_on_34th_Street_(Baltimore)"> Miracle on 34th Street</a> right nearby and <a href="http://www.destinationmainstreets.com/maryland/hampden.php">36th street (the Avenue)</a> is a fun place to go out for shopping and eating, regardless of whether Honfest is going on. During the summer months, a local radio station sponsors “<a href="http://wtmd.org/radio/first-thursday-concerts-in-the-park/">First Thursdays</a>” where they put on a free concert series at the Washington Monument in Mt. Vernon.
</p>
<h2 dir="ltr">
Things to do during the day
</h2>
<p dir="ltr">
Probably you’ll visit the <a href="http://baltimore.org/about-baltimore/inner-harbor">Harbor </a>as one of the first things you do. Make sure to hit the <a href="http://www.aqua.org/">National Aquarium</a>, the <a href="http://www.avam.org/">Visionary Arts Museum</a> and the <a href="http://www.mdsci.org/">Maryland Science Center</a> (not all in one day). Downtown there’s the <a href="http://thewalters.org/">Walters Art Museum</a> and the <a href="http://www.artbma.org/">Baltimore Museum of Art</a> on the Johns Hopkins Homewood campus. Go see <a href="http://www.nps.gov/fomc/">Fort McHenry</a>, where Francis Scott Key wrote the National Anthem. The <a href="http://www.rflewismuseum.org/">Museum of African American History and Culture</a> is right near the Inner Harbor on Pratt Street.
</p>
<p dir="ltr">
If you’re outdoorsy, <a href="http://www.dnr.state.md.us/publiclands/central/patapsco.asp">Patapsco </a>and<a href="http://www.dnr.state.md.us/publiclands/central/gunpowder.asp"> Gunpowder Falls</a> appear to be good places nearby. <a href="http://www.nps.gov/cato/">Catoctin Park</a> is nearby with Camp David tucked in it somewhere; you’ll know you’ve found it when the secret service tackles you. If you don’t want to travel too far, just outside the northern border of the city is <a href="http://www.baltimorecountymd.gov/Agencies/recreation/programdivision/naturearea/relpark/">Robert E. Lee park</a> which has a nice hiking trail and a dog park. When you’re done there you can grab lunch at the <a href="http://www.hautedogcarte.com">Haute Dog</a>.
</p>
<p><strong>** </strong>**If you have kids, the <a href="http://www.marylandzoo.org/">Baltimore Zoo</a> is a really nice outdoor zoo that is a great place to go if the weather is nice. It’s in <a href="http://www.druidhillpark.org/">Druid Hill Park</a>, which is also a great place to go running or biking. If you’re willing to drive an hour or more, the outdoor options are basically endless.</p>
<p>DC and Philly are easy day trips using the train and Annapolis is an easy drive. If you go to the DC, only schedule a few museums right near one and another, otherwise you’ll spend the whole day walking. On a nice day, the <a href="http://nationalzoo.si.edu/">National Zoo</a> is fantastic (and free). The MARC train goes to DC from Penn Station and is under $10 each way, but it only runs in the morning and evening. Outside of those times you can take the Amtrak train. If you drive, it’s usually about an hour one-way, depending on where you’re going.</p>
<h2 dir="ltr">
Things to do during the night
</h2>
<p dir="ltr">
I have little kids. How would I know? My answer is, fight about bedtime and collapse. However, if I was forced to come up with something, I would say go to <a href="http://www.pattersonbowl.com/">Patterson Park Lanes</a> and do Duckpin Bowling. Make sure to reserve a lane earlier on in the week if you want to go at night on a weekend.
</p>
<p>From my outside vantage point, there appears to be tons of nightlife. The best places appear to be in upscale city areas, like Fells Point, Canton, downtown, Harbor East, Federal Hill. Also, catch a show at the](http://www.france-merrickpac.com/home.html) or <a href="https://www.centerstage.org">Center Stage</a> or any one of the many local theatres. The best places to go to movies are the <a href="http://www.thesenatortheatre.com/">Senator</a>, <a href="http://www.fandango.com/rotundacinemas_aabot/theaterpage">Rotunda</a>, <a href="http://www.thecharles.com/">the Charles</a> and the <a href="http://articles.washingtonpost.com/2012-10-11/entertainment/35499403_1_kwame-kwei-armah-strand-theater-rain-pryor">Landmark at Harbor East</a>.</p>
<p><strong>** </strong>**The <a href="http://www.bsomusic.org">Baltimore Symphony</a> is one of the top orchestras in the country and usually has interesting programs. You can usually just show up a few minutes before the concert and get a good (cheap) ticket. There’s also opera at the <a href="http://www.lyricoperahouse.com/page_img.php?cms_id=2">Lyric Opera House</a>, but Ingo will tell you that the real stuff is in DC at the <a href="http://www.kennedy-center.org/wno/">National Opera</a>.</p>
<h2 dir="ltr">
Things to eat
</h2>
<p dir="ltr">
There’s too many restaurants to discuss. So, I’ll talk about some recommendations. If you have to have deli food, go to <a href="http://www.attmansdeli.com/">Attman’s on Lombard street</a>. If you need authentic Chinese food, go to <a href="http://hunantastemd.com/">Hunan Taste</a> in Catonsville. All of the Korean restaurants are just north of North Avenue on Charles; try <a href="http://www.yelp.com/biz/jong-kak-baltimore-3">Jong Kak</a>. If you’re a locavore and want to go out for a nice dinner, there’s a lot of choices. I like the <a href="http://www.woodberrykitchen.com/">Woodberry Kitchen</a> and <a href="http://bmoreclementine.com/">Clementine</a>. If you want to break the bank, go to the <a href="http://www.charlestonrestaurant.com/">Charleston</a>, probably the fanciest restaurant in the city. Also, make sure to hit the big <a href="http://www.promotionandarts.com/index.cfm?page=events&id=3">Farmer’s Market</a> on Sunday at least once. The best place to go drink beer and eat crabs is <a href="http://www.lpsteamers.com/">LP Steamers</a>. If you want a crab cake the size of a softball, go to <a href="http://www.faidleyscrabcakes.com/">Faidley’s </a>in<a href="http://www.lexingtonmarket.com/"> Lexington Market</a>. Lexington Market is its own spectacle that you should try at least once. If you need an Italian Deli, <a href="http://baltimore.cbslocal.com/top-lists/best-italian-delis-in-the-baltimore-area/">there’s several</a> (<a href="http://www.mastellones.com/">Mastellone’s</a> is my favorite, but this list at least omits Isabella’s in Little Italy and Ceriello in Belvedere Square).
</p>
<h2 dir="ltr">
What you eat
</h2>
<p dir="ltr">
You’re a Baltimoron now, so you drink<a href="http://nationalbohemian.com/"> Natty Boh</a>, eat <a href="http://www.utzsnacks.com/">Utz Potato chips</a> and <a href="http://bergercookies.com/">Berger cookies</a>. (Don’t question; this is what you do now.) In the summer, go get an <a href="http://www2.citypaper.com/bob/story.asp?id=8153">egg cream snowball with marshmallow</a>. If you want high end local beer, I like <a href="http://www.hsbeer.com/#">Heavy Seas</a> and <a href="http://unioncraftbrewing.com/">Union Craft</a>. If you’re a coffee drinker, you drink <a href="http://www.zekescoffee.com/">Zeke’s coffee</a> now.
</p>
<h2 dir="ltr">
Baltimore stuff
</h2>
<p dir="ltr">
So you need to know a few things so you don’t look the fool. I’ve created a Baltimore cheat sheet. Normally I wouldn’t suggest cheating, but feel free to write this on your hand or something.
</p>
<li dir="ltr">
<p dir="ltr">
The O’s are the baseball team (Orioles, named after a species of bird that lives around here); they have a rich history and are in a division with poser glamour bankroll teams: the Yankees and Red Sox. You do not like the Yankees or Red Sox now.
</p>
</li>
<li dir="ltr">
<p dir="ltr">
Cal Ripken Jr is a former O’s player who broke a famous record for number of consecutive games played.
</p>
</li>
<li dir="ltr">
<p dir="ltr">
The Ravens are the football team (named after the poem from Edgar Allan Poe see below). They have been very good for a while. There was an issue where the old team, the Baltimore Colts, left Baltimore for Indianapolis and Baltimore subsequently got Cleveland’s team and named it the Ravens. So, now you don’t like Indianapolis Colts fans and people from Cleveland don’t like you.
</p>
</li>
<li dir="ltr">
<p dir="ltr">
Lacrosse is a sport that exists and Hopkins is good at it.
</p>
</li>
<li dir="ltr">
<p dir="ltr">
Thurgood Marshall, the first black US Supreme court justice, was born here. The airport is named after him.
</p>
</li>
<li dir="ltr">
<p dir="ltr">
The author Edgar Allan Poe lived, worked, died and was buried here. You can go visit his grave.
</p>
</li>
<li dir="ltr">
<p dir="ltr">
The most famous baseball player ever, Babe Ruth, was born, grew up and started in baseball here. He really liked duckpin bowling, so the story goes.
</p>
</li>
<li dir="ltr">
<p dir="ltr">
Olympic swimmer Michael Phelps grew up, lives and trains here.
</p>
</li>
<li dir="ltr">
<p dir="ltr">
John Waters is a famous film director of cult classics is from Baltimore and the city is prominent in many of his movies.
</p>
</li>
<li dir="ltr">
<p dir="ltr">
HL Mencken was a celebrated intellectual and writer.
</p>
</li>
<li dir="ltr">
<p dir="ltr">
Frederick Douglass, the abolitionist and intellectual was born and lived near here.
</p>
</li>
<li dir="ltr">
<p dir="ltr">
There was a wonderfully done and controversial television program from HBO, <a href="http://www.imdb.com/title/tt0306414/">The Wire</a>, by David Simon, that everyone talks about around here. It’s filmed in and is about Baltimore.
</p>
</li>
<li dir="ltr">
<p dir="ltr">
There is a Baltimore accent, but you may miss it at first. People say hon as a term of endearment, pronounce Baltimore as Bawlmer and Washington as Warshington, among other things. Think about all of the time you can save for research now, by omitting several pesky syllables.
</p>
</li>
<p dir="ltr">
That’s it for now. We’ll do another one on Hopkins and research in the area.
</p>
Help needed for establishing an ASA statistical genetics and genomics section
2013-09-12T11:32:57+00:00
http://simplystats.github.io/2013/09/12/help-needed-for-establishing-an-asa-statistical-genetics-and-genomics-section
<p>To promote research and education in statistical genetics and genomics, some of us in the community would like to establish a statistical genetics and genomics section of the American Statistical Association (ASA). Having an ASA section gives us certain advantages, such as having allocated invited sessions at JSM, young investigator and student awards, and senior investigator awards in statistical genetics and genomics, as well as a community to interact and exchange information.</p>
<p>We need at least 100 ASA members to pledge that they will join the section (if you are in more than 3 sections already you will be asked to pay a nominal fee of less than $10). If you are interested please fill a row in the following google doc by November 1st:</p>
<p><a href="https://docs.google.com/spreadsheet/ccc?key=0AtD3gd8kGN45dE9BZ1pTYWtCa0M2VWhKckRoUE9KLVE#gid=0" target="_blank">https://docs.google.com/<wbr />spreadsheet/ccc?key=<wbr />0AtD3gd8kGN45dE9BZ1pTYWtCa0M2V<wbr />WhKckRoUE9KLVE#gid=0</a></p>
Implementing Evidence-based Data Analysis: Treading a New Path for Reproducible Research (Part 3)
2013-09-05T16:30:47+00:00
http://simplystats.github.io/2013/09/05/implementing-evidence-based-data-analysis-treading-a-new-path-for-reproducible-research-part-3
<p><a href="http://simplystatistics.org/2013/08/28/evidence-based-data-analysis-treading-a-new-path-for-reproducible-research-part-2/">Last week</a> I talked about how we might be able to improve data analyses by moving towards “evidence-based” data analysis and to use data analytic techniques that are proven to be useful based on statistical research rather. My feeling was this approach attacks the most “upstream” aspect of data analysis before problems have the chance to filter down into things like publications, or even worse, clinical decision-making.</p>
<p>In this third (and final!) post on this topic I wanted to describe a little how we could implement evidence-based data analytic pipelines. Depending on your favorite software system you could imagine a number of ways to do this. If the pipeline were implemented in R, you could imagine it as an R package. The precise platform is not critical at this point; I would imagine most complex pipelines would involve multiple different software systems tied together.</p>
<p>Below is a rough diagram of how I think the various pieces of an evidence-based data analysis pipeline would fit together.</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2013/09/dsmpic.png"><img class="alignright size-large wp-image-1800" alt="dsmpic" src="http://simplystatistics.org/wp-content/uploads/2013/09/dsmpic-1024x608.png" width="640" height="380" srcset="http://simplystatistics.org/wp-content/uploads/2013/09/dsmpic-300x178.png 300w, http://simplystatistics.org/wp-content/uploads/2013/09/dsmpic-1024x608.png 1024w" sizes="(max-width: 640px) 100vw, 640px" /></a>There are a few key elements of this diagram that I’d like to stress:</p>
<ol>
<li><span style="line-height: 16px;"> <strong>Inputs are minimal</strong>. You don’t want to allow for a lot of inputs or arguments that can be fiddled with. This reduces the number of degrees of freedom and hopefully reduces the amount of hacking. Basically, you want to be able to input the data and perhaps some metadata.</span></li>
<li><strong>Analysis comes in stages</strong>. There are multiple stages in any analysis, not just the part where you fit a model. Everything is important and every stage should use the best available method.</li>
<li><strong>The stuff in the red box does not involve manual intervention</strong>. The point is to not allow tweaking, fudging, and fiddling. Once the data goes in, we just wait for something to come out the other end.</li>
<li><strong>Methods should be benchmarked</strong>. For each stage of the analysis, there is a set of methods that are applied. These methods should, at a minimum, be benchmarked via a standard group of datasets. That way, if another method comes a long, we have an objective way to evaluate whether the new method is better than the older methods. New methods that improve on the benchmarks can replace the existing methods in the pipeline.</li>
<li><strong>Output includes a human-readable report</strong>. This report summarizes what the analysis was and what the results were (including results of any sensitivity analysis). The material in this report could be included in the “Methods” section of a paper and perhaps in the “Results” or “Supplementary Materials”. The goal would be to allow someone who was not intimately familiar with the all of the methods used in the pipeline to be able to walk away with a report that he/she could understand and interpret. At a minimum, this person could take the report and share it with their local statistician for help with interpretation.</li>
<li><strong>There is a defined set of output parameters</strong>. Each analysis pipeline should, in a sense, have an “API” so that we know what outputs to expect (not the exact values, of course, but what kinds of values). For example, if a pipeline fits a regression model at the end the regression parameters are the key objects of interest, then the output could be defined as a vector of regression parameters. There are two reasons to have this: (1) the outputs, if the pipeline is deterministic, could be used for regression testing in case the pipeline is modified; and (2) the outputs could serve as inputs into another pipeline or algorithm.</li>
</ol>
<p>Clearly, one pipeline is not enough. We need many of them for different problems. So what do we do with all of them?</p>
<p>I think we could organize them in a central location (kind of a specialized GitHub) where people could search for, download, create, and contribute to existing data analysis pipelines. An analogy (but not exactly a model) is the <a href="http://www.cochrane.org">Cochrane Collaboration</a> which serves as a repository for evidence-based medicine. There are already a number of initiatives along these lines, such as the <a href="http://galaxyproject.org">Galaxy Project</a> for bioinformatics. I don’t know whether it’d be ideal to have everything in one place or have a number of sub-projects for specialized areas.</p>
<p>Each pipeline would have a leader (or “friendly dictator”) who would organize the contributions and determine which components would go where. This could obviously be contentious, more some in some areas than in others, but I don’t think any more contentious than your average open source project (check the archives of the Linus Kernel or Git mailing lists and you’ll see what I mean).</p>
<p>So, to summarize, I think we need to organize lots of evidence-based data analysis pipelines and make them widely available. If I were writing this 5 or 6 years ago, I’d be complaining about a lack of infrastructure out there to support this. But nowadays, I think we have pretty much everything we need in terms of infrastructure. So what are we waiting for?</p>
Repost: A proposal for a really fast statistics journal
2013-09-04T14:54:13+00:00
http://simplystats.github.io/2013/09/04/repost-a-proposal-for-a-really-fast-statistics-journal
<p><em>Editor’s note: This is a repost of a previous Simply Statistics column that seems to be relevant again in light of Marie Davidian’s <a href="http://magazine.amstat.org/blog/2013/09/01/peerreview/">really important column</a> on the peer review process. You should also check out <a href="http://yihui.name/en/2012/03/a-really-fast-statistics-journal/">Yihui’s thoughts on this</a>, which verge on the creation of a very fast/dynamic stats journal. </em></p>
<p>I know we need a new journal like we need a good poke in the eye. But I got fired up by the recent discussion of open science (by <a href="http://krugman.blogs.nytimes.com/2012/01/17/open-science-and-the-econoblogosphere/" target="_blank">Paul Krugman</a> and others) and the seriously misguided <a href="http://en.wikipedia.org/wiki/Research_Works_Act" target="_blank">Research Works Act</a>- that aimed to make it illegal to deposit published papers funded by the government in Pubmed central or other open access databases.</p>
<div>
I also realized that I spend a huge amount of time/effort on the following things: (1) waiting for reviews (typically months), (2) addressing reviewer comments that are unrelated to the accuracy of my work - just adding citations to referees papers or doing additional simulations, and (3) resubmitting rejected papers to new journals - this is a huge time suck since I have to reformat, etc. Furthermore, If I want my papers to be published open-access I also realized I have to pay at minimum <a href="http://simplystatistics.tumblr.com/post/12286350206/free-access-publishing-is-awesome-but-expensive-how" target="_blank">$1,000 per paper</a>.So I thought up my criteria for an ideal statistics journal. It would be accurate, have fast review times, and not discriminate based on how interesting an idea is. I have found that my most interesting ideas are the hardest ones to get published. This journal would:</p>
<ul>
<li>
Be open-access and free to publish your papers there. You own the copyright on your work.
</li>
<li>
The criteria for publication would be: (1) it has to do with statistics, computation, or data analysis, (2) is the work is technically correct.
</li>
<li>
We would accept manuals, reports of new statistical software, and full length research articles.
</li>
<li>
There would be no page limits/figure limits.
</li>
<li>
The journal would be published exclusively online.
</li>
<li>
We would guarantee reviews within 1 week and publication immediately upon review if criteria (1) and (2) are satisfied
</li>
<li>
Papers would receive a star rating from the editor - 0-5 stars. There would be a place for readers to also review articles
</li>
<li>
All articles would be published with a tweet/like button so they can be easily distributed
</li>
</ul>
<div>
</div>
<div>
To achieve such a fast review time, here is how it would work. We would have a large group of Associate Editors (hopefully 30 or more). When a paper was received, it would be assigned to an AE. The AEs would agree to referee papers within 2 days. They would use a form like this:
</div>
<div>
</div>
<blockquote>
<ul>
<li>
Review of: Jeff’s Paper
</li>
<li>
Technically Correct: Yes
</li>
<li>
About statistics/computation/data analysis: Yes
</li>
<li>
Number of Stars: 3 stars
</li>
</ul>
<ul>
<li>
3 Strengths of Paper (1 required):
</li>
<li>
This paper revolutionizes statistics
</li>
</ul>
<ul>
<li>
3 Weakness of Paper (1 required):
</li>
<li>
* The proof that this paper revolutionizes statistics is pretty weak
</li>
<li>
because he only includes one example.
</li>
</ul>
</blockquote>
<div>
</div>
<div>
That’s it, super quick, super simple, so it wouldn’t be hard to referee. As long as the answers to the first two questions were yes, it would be published.
</div>
<div>
</div>
<div>
So now here’s my questions:
</div>
<div>
</div>
<div>
<ol>
<li>
Would you ever consider submitting a paper to such a journal?
</li>
<li>
Would you be willing to be one of the AEs for such a journal?
</li>
<li>
Is there anything you would change?
</li>
</ol>
</div>
</div>
Sunday data/statistics link roundup (9/1/13)
2013-09-01T15:00:01+00:00
http://simplystats.github.io/2013/09/01/sunday-datastatistics-link-roundup-9113
<ol>
<li><span style="line-height: 16px;">There has been a lot of discussion of the importance of open access on Twitter. I am 100% in favor of open access (<a href="http://simplystatistics.org/2011/11/03/free-access-publishing-is-awesome-but-expensive-how/">I do wish it was less expensive</a>), but I also think that sometimes people lose sight of other important issues for junior scientists that go beyond open access. Dr. Isis has a<a href="http://isisthescientist.com/2013/08/28/the-morality-of-open-access-vs-increasing-diversity/"> great example</a> of this on her blog. </span></li>
<li>Sherri R. has a <a href="http://drsherrirose.com/resources">great list of resources</a> for stats minded folks at the undergrad, grad, and faculty levels.</li>
<li>There he goes again. Another <a href="http://www.80grados.net/arce-y-la-estrella/">awesome piece by Rafa</a> on someone else’s blog. It is in Spanish but the google translate does ok. Be sure to check out questions 3 and 4.</li>
<li>A really nice summary of Nate Silver’s talk at JSM and a post-talk interview (in video format) <a href="http://www.statisticsviews.com/details/feature/5133141/Nate-Silver-Wha%20t-I-need-from-statisticians.html">are available here.</a> Pair with this <a href="http://www.theonion.com/articles/nate-silver-vows-to-teach-chris-berman-how-to-read,33610/">awesome Onion piece</a> (both links via Marie D.)</li>
<li><a href="http://noahpinionblog.blogspot.com/2013/08/a-few-words-about-math.html">A really nice post</a> that made the rounds in the economics blogosphere talking about the use of mathematics in econ. This seems like a pretty relevant quote, “Instead, it was just some random thing that someone made up and wrote down because A) it was tractable to work with, and B) it sounded plausible enough so that most other economists who looked at it tended not to make too much of a fuss.”</li>
<li>More on <a href="http://blog.alinelerner.com/silicon-valley-hiring-is-not-a-meritocracy/">hiring technical people</a>. This is related to Google saying their <a href="http://simplystatistics.org/2013/06/20/googles-brainteasers-that-dont-work-and-johns-hopkins-biostatistics-data-analysis/">brainteaser interview questions don’t work</a>. Check out <a href="http://blog.alinelerner.com/lessons-from-a-years-worth-of-hiring-data/">the list</a> here of things that this person found useful in hiring technical people that could be identified easily. I like how typos and grammatical errors were one of the best predictors.</li>
</ol>
AAAS S&T Fellows for Big Data and Analytics
2013-08-30T10:00:51+00:00
http://simplystats.github.io/2013/08/30/aaas-st-fellows-for-big-data-and-analytics
<p>Thanks to <a href="https://twitter.com/ASA_SciPol">Steve Pierson</a> of the ASA for letting us know that the AAAS <a href="http://fellowships.aaas.org/02_Areas/02_index.shtml">Science and Technology Fellowship program</a> has a new category for “<a href="http://fellowships.aaas.org/02_Areas/02_index.shtml#data">Big Data and Analytics</a>”. For those not familiar, AAAS organizes the S&T Fellowship program to get scientists involved in the policy-making process in Washington and at the federal agencies. In general, the requirements for the program are</p>
<blockquote>
<p>Applicants must have a PhD or an equivalent doctoral-level degree at the time of application. Individuals with a master’s degree in engineering and at least three years of post-degree professional experience also may apply. Some programs require additional experience. Applicants must be U.S. citizens. Federal employees are not eligible for the fellowships.</p>
</blockquote>
<p>Further details are on the <a href="http://fellowships.aaas.org/04_Become/04_Eligibility.shtml">AAAS web site</a>.</p>
<p>I’ve met a number of current and former AAAS fellows working on Capitol Hill and at the various agencies and I have to say I’ve been universally impressed. I personally think getting more scientists into the federal government and involved with the policy-making process is a Good Thing. If you’re a statistician looking to have a different kind of impact, this might be for you.</p>
The return of the stat - Computing for Data Analysis & Data Analysis back on Coursera!
2013-08-29T09:31:08+00:00
http://simplystats.github.io/2013/08/29/the-return-of-the-stat-computing-for-data-analysis-data-analysis-back-on-coursera
<p>It’s the <a href="http://www.tubechop.com/watch/1390349">return of the stat</a>. Roger and I are going to be re-offering our Coursera courses:</p>
<p><strong>Computing for Data Analysis (starts Sept 23)</strong></p>
<p><a href="https://www.coursera.org/course/compdata">Sign up here</a>.</p>
<p><strong>Data Analysis (starts Oct 28)</strong></p>
<p><a href="https://www.coursera.org/course/dataanalysis">Sign up here</a>.</p>
Evidence-based Data Analysis: Treading a New Path for Reproducible Research (Part 2)
2013-08-28T10:14:32+00:00
http://simplystats.github.io/2013/08/28/evidence-based-data-analysis-treading-a-new-path-for-reproducible-research-part-2
<p dir="ltr">
Last week I <a href="http://simplystatistics.org/2013/08/21/treading-a-new-path-for-reproducible-research-part-1/">posted</a> about how I thought the notion of reproducible research did not go far enough to address the question of whether you could trust that a given data analysis was conducted appropriately. From some of the discussion on the post, it seems some of you thought I believed therefore that reproducibility had no value. That’s definitely not true and I’m hoping I can clarify my thinking in this followup post.
</p>
<p dir="ltr">
Just to summarize a bit from last week, one key problem I find with requiring reproducibility of a data analysis is that it comes only at the most “downstream” part of the research process, the post-publication part. So anything that was done incorrectly has already happened and the damage has been done to the analysis. Having code and data available, importantly, makes it possible to discover these problems, but only after the fact. I think this results in two problems: (1) It may take a while to figure out what exactly the problems are (even with code/data) and how to fix them; and (2) the problems in the analysis may have already caused some sort of harm.
</p>
<p dir="ltr">
<strong>Open Source Science?</strong>
</p>
<p dir="ltr">
For the first problem, I think a reasonable analogy for reproducible research is open source software. There the idea is that source code is available for all computer programs so that we can inspect and modify how a program runs. With open source software “<a href="http://en.wikipedia.org/wiki/Linus's_Law">all bugs are shallow</a>”. But the key here is that as long as all programmers have the requisite tools, they can modify the source code on their own, publish their corrected version (if they are fixing a bug), others can review it and accept or modify, and on and on. All programmers are more or less on the same footing, as long as they have the ability to hack the code. With distributed source code management systems like <a href="http://git-scm.com">git</a>, people don’t even need permission to modify the source tree. In this environment, the best idea wins.
</p>
<p dir="ltr">
The analogy with open source software breaks down a bit with scientific research because not all players are on the same footing. Typically, the original investigator is much better equipped to modify the “source code”, in this case the data analysis, and to fix any problems. Some types of analyses may require tremendous resources that are not available to all researchers. Also, it might take a long time for others who were not involved in the research, to fully understand what is going on and how to make reasonable modifications. That may involve, for example, learning the science in the first place, or learning how to program a computer for that matter. So I think making changes to a data analysis and having them accepted is a slow process in science, much more so than with open source software. There are definitely things we can do to improve our ability to make rapid changes/updates, but the implementation of those changes are only just getting started.
</p>
<p dir="ltr">
<strong>First Do No Harm</strong>
</p>
<p dir="ltr">
The second problem, that some sort of harm may have already occurred before an analysis can be fully examined is an important one. As I mentioned in the previous post, merely stating that an analysis is reproducible doesn’t say a whole lot about whether it was done correctly. In order to verify that, someone knowledgeable has to go into the details and muck around to see what is going on. If someone is not available to do this, then we may never know what actually happened. Meanwhile, the science still stands and others may build off of it.
</p>
<p dir="ltr">
In the Duke saga, one of the most concerning aspects of the whole story was that some of Potti’s research was going to be used to guide therapy in a clinical trial. The fact that a series of flawed data analyses was going to be used as the basis of choosing what cancer treatments people were going to get was very troubling. In particular, one of these flawed analyses reversed the labeling of the cancer and control cases!
</p>
<p>To me, it seems that waiting around for someone like Keith Baggerly to come around and spend close to 2,000 hours reproducing, inspecting, and understanding a series of analyses is not an efficient system. In particular, when actual human lives may be affected, it would be preferable if the analyses were done right in the first place, without the “statistics police” having to come in and check that everything was done properly.</p>
<p><strong>Evidence-based Data Analysis</strong></p>
<p dir="ltr">
What I think the statistical community needs to invest time and energy into is what I call “evidence-based data analysis”. What do I mean by this? Most data analyses are not the simple classroom exercises that we’ve all done involving linear regression or two-sample t-tests. Most of the time, you have to obtain the data, clean that data, remove outliers, impute missing values, transform variables and on and on, even before you fit any sort of model. Then there’s model selection, model fitting, diagnostics, sensitivity analysis, and more. So a data analysis is really pipeline of operations where the output of one stage becomes the input of another.
</p>
<p dir="ltr">
The basic idea behind evidence-based data analysis is that for each stage of that pipeline, we should be using the best method, justified by appropriate statistical research that provides evidence favoring one method over another. If we cannot reasonable agree on a best method for a given stage in the pipeline, then we have a gap that needs to be filled. So we fill it!
</p>
<p dir="ltr">
Just to clarify things before moving on too far, here’s a simple example.
</p>
<p dir="ltr">
<strong>Evidence-based Histograms</strong>
</p>
<p dir="ltr">
Consider the following simple histogram.
</p>
<p dir="ltr">
<a href="http://simplystatistics.org/wp-content/uploads/2013/08/hist.png"><img class="alignright size-large wp-image-1773" alt="hist" src="http://simplystatistics.org/wp-content/uploads/2013/08/hist-1024x703.png" width="640" height="439" srcset="http://simplystatistics.org/wp-content/uploads/2013/08/hist-300x206.png 300w, http://simplystatistics.org/wp-content/uploads/2013/08/hist-1024x703.png 1024w" sizes="(max-width: 640px) 100vw, 640px" /></a>
</p>
<p dir="ltr">
The histogram was created in R by calling hist(x) on some Normal random deviates (I don’t remember the seed so unfortunately it is not reproducible). Now, we all know that a histogram is a kind of smoother, and with any smoother, the critical parameter is the smoothing parameter or the bandwidth. Here, it’s the size of the bin or the number of bins.
</p>
<p>Notice that when I call ‘hist’ I don’t actually specify the number of bins. Why not? Because in R, the default is to use Sturges’ formula for the number of bins. Where does that come from? Well, there is a <a href="http://amstat.tandfonline.com/doi/abs/10.1080/01621459.1926.10502161?journalCode=uasa20#.Uh3_FBbHKZY">paper</a> in the <em>Journal of the American Statistical Association</em> in 1926 by H. A. Sturges that justifies why such a formula is reasonable for a histogram (it is a very short paper, those were the days). R provides other choices for choosing the number of bins. For example, David Scott <a href="http://biomet.oxfordjournals.org/content/66/3/605.short">wrote a paper</a> in <em>Biometrika</em> that justified bandwith/bin size based in integrated mean squared error criteria.</p>
<p>The point is that R doesn’t just choose the default number of bins willy-nilly, there’s actual research behind that choice and evidence supporting why it’s a good choice. Now, we may not all agree that this default is the best choice at all times, but personally I rarely modify the default number of bins. Usually I just want to get a sense of what the distribution looks like and the default is fine. If there’s a problem, transforming the variable somehow often is more productive than modifying the number of bins. What’s the best transformation? Well, it turns out there’s <a href="http://en.wikipedia.org/wiki/Power_transform">research on that too</a>.</p>
<p><strong>Evidence-based Reproducible Research</strong></p>
<p dir="ltr">
Now why can’t we extend the idea behind the histogram bandwidth to all data analysis? I think we can. For every stage of a given data analysis pipeline, we can have the “best practices” and back up those practices with statistical research. Of course it’s possible that such best practices have not yet been developed. This is common in emerging areas like genomics where the data collection technology is constantly changing. That’s fine, but in more mature areas, I think it’s possible for the community to agree on a series of practices that work, say, 90% of the time.
</p>
<p dir="ltr">
There are a few advantages to evidence-based reproducible research.
</p>
<ol>
<li>It reduces the “researcher degrees of freedom”. Researchers would be disincentivized from choosing the method that produces the “best” results if there is already a generally agreed upon approach. If a given data analysis required a different approach, the burden would be on the analyst to justify why a deviation from the generally accepted approach was made.</li>
<li>The methodology would be transparent because the approach would have been vetted by the community. I call this “transparent box” analysis, as opposed to black box analysis. The analysis would be transparent so you would know exactly what is going on, but it would “<a href="http://www.hulu.com/watch/284761">locked in a box</a>” so that you couldn’t tinker with it to game the results.</li>
<li>You would not have the <a href="http://simplystatistics.org/2013/08/09/embarrassing-typos-reveal-the-dangers-of-the-lonely-data-analyst/">lonely data analyst</a> coming up with their own magical method to analyze the data. If a researcher claimed to have conducted an analysis using an evidence-based pipeline, you could at least have a sense that something reasonable was done. You would still need reproducibility to ensure that the researcher was not misrepresenting him/herself, but now we would have two checks on the analysis, not just one.</li>
<li>Most importantly, evidence-based reproducible research attacks the furthest upstream aspect of the research, which is the analysis itself. It guarantees that generally accepted approaches are used to analyze the data from the very beginning and hopefully prevents problems from occurring rather than letting them propagate through the system.</li>
</ol>
<p dir="ltr">
What can we do to bring evidence-based data analysis practices to all of the sciences? I’ll write about what I think we can do in the next post.
</p>
Interview with Ani Eloyan and Betsy Ogburn
2013-08-27T10:25:48+00:00
http://simplystats.github.io/2013/08/27/interview-with-ani-eloyan-and-betsy-ogburn
<p>Jeff and I interview Ani Eloyan and Betsy Ogburn, two new Assistant Professors in the Department of Biostatistics here.</p>
<p>Jeff and I talk to Ani and Betsy about their research interests and finally answer the burning question: “What is the future of statistics?”</p>
Statistics meme: Sad p-value bear
2013-08-26T12:43:11+00:00
http://simplystats.github.io/2013/08/26/statistics-meme-sad-p-value-bear
<div style="width: 470px" class="wp-caption aligncenter">
<img alt="" src="http://i.imgflip.com/37w9c.jpg" width="460" height="480" />
<p class="wp-caption-text">
Sad p-value bear wishes you had a bigger sample size.
</p>
</div>
<p>I was just at a conference where the idea for a sad p-value bear meme came up (in the spirit of <a href="http://biostatisticsryangoslingreturns.tumblr.com/">Biostatistics Ryan Gosling</a>). This should not be considered an endorsement of p-values or p-value hacking.</p>
Did Faulty Software Shut Down the NASDAQ?
2013-08-24T10:00:33+00:00
http://simplystats.github.io/2013/08/24/did-faulty-software-shut-down-the-nasdaq
<p>This past Thursday, the NASDAQ stock exchange shut down for just over 3 hours due to some technical problems. It’s still not clear what the problem was because NASDAQ officials are being tight-lipped. NASDAQ has had a bad run of problems recently, the most visible was the botching of the Facebook initial public offering.</p>
<p>Stock trading these days is a highly technical business involving complex algorithms and multiple exchanges spread across the country. Poorly coded software or just plain old bugs have the potential to take down an entire exchange and paralyze parts of the financial system for hours.</p>
<p>Mary Jo White, the Chairman of the SEC is apparently getting involved.</p>
<blockquote>
<p>Thursday evening, Ms. White said in a statement that the paralysis at the Nasdaq was “serious and should reinforce our collective commitment to addressing technological vulnerabilities of exchanges and other market participants.”</p>
<p>She said she would push ahead with recently proposed rules that would add testing requirements and safeguards for trading software. So far, those rules have faced resistance from the exchange companies. Ms. White said that she would “shortly convene a meeting of the leaders of the exchanges and other major market participants to accelerate ongoing efforts to further strengthen our markets.”</p>
</blockquote>
<p>Having testing requirements for trading software is an interesting idea. It’s easy to see why the industry would be against it. Trading is a fast moving business and my guess is software is updated/modified constantly to improve performance or to provide people and edge. If you had to get approval or run a bunch of tests every time you wanted to deploy something, you’d quickly get behind the curve.</p>
<p>But is there an issue of safety here? If a small bug in the computer code on which the exchange relies can take down the entire system for hours, isn’t that a problem of “financial safety”? Other problems, like the notorious “flash crash” of 2010 where the Dow Jones Industrial Average dropped 700 points in minutes, have the potential to affect regular people, not just hedge fund traders.</p>
<p>It’s not unprecedented to subject computer code to higher scrutiny. Code that flies airplanes or runs air-traffic control systems is all tested and reviewed rigorously before being put into production and I think most people would consider that reasonable. Are financial markets the next area? What about scientific research?</p>
Stratifying PISA scores by poverty rates suggests imitating Finland is not necessarily the way to go for US schools
2013-08-23T10:01:31+00:00
http://simplystats.github.io/2013/08/23/stratifying-pisa-scores-by-poverty-rates-suggests-imitating-finland-is-not-necessarily-the-way-to-go-for-us-schools
<p>For the past several years a <a href="http://www.businessinsider.com/finlands-education-system-best-in-world-2012-11?op=1">steady</a> <a href="http://www.nytimes.com/2011/12/13/education/from-finland-an-intriguing-school-reform-model.html?pagewanted=all">stream</a> of <a href="http://www.smithsonianmag.com/people-places/Why-Are-Finlands-Schools-Successful.html">news articles</a> and <a href="http://www.greatschools.org/students/2453-finland-education.gs">opinion pieces</a> have been praising the virtues of Finish schools and exalting the US to imitate this system. One data point supporting this view comes from the most recent PISA scores (2009) in which Finland outscored the US 536 to 500. Several people have pointed out that this is an apples (huge and diverse) to oranges (small and homogeneous) comparison. One of the many differences that makes the comparison complicated is that Finland has less students living in poverty ( 3%) than the US (20%). <a href="http://nasspblogs.org/principaldifference/2010/12/pisa_its_poverty_not_stupid_1.html">This post</a> defending US public school teachers makes this point with data. Here I show these data in graphical form. The plot on the left shows <a href="http://nces.ed.gov/surveys/pisa/">PISA scores</a> versus the percent of students living in poverty for several countries. There is a pattern suggesting that higher poverty rates are associated with lower PISA scores. In the plot on the right, US schools are stratified by % poverty (orange points). The regression line is the same. Some countries are added (purple) for comparative purposes (the <a href="http://nasspblogs.org/principaldifference/2010/12/pisa_its_poverty_not_stupid_1.html">post</a> does not provide their poverty rates). Note that US school with poverty rates comparable to Finland’s (below 10%) outperform Finland and schools in the 10-24% range aren’t far behind. So why should these schools change what they are doing? Schools with poverty rates above 25% are another story. Clearly the US has lots of work to do in trying to improve performance in these schools, but is it safe to assume that Finland’s system would work for these student populations?</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2013/08/pisa2.png"><img src="http://simplystatistics.org/wp-content/uploads/2013/08/pisa2.png" alt="pisa scores plotted against percent poverty" /></a></p>
<p>Note that I scraped data from <a href="http://nasspblogs.org/principaldifference/2010/12/pisa_its_poverty_not_stupid_1.html">this post</a> and not the original source.</p>
If you are near DC/Baltimore, come see Jeff talk about Coursera
2013-08-23T08:53:08+00:00
http://simplystats.github.io/2013/08/23/if-you-are-near-dcbaltimore-come-see-jeff-talk-about-coursera
<p>I’ll be speaking at the Data Science Maryland meetup. The title of my presentation is “Teaching Data Science to the Masses”. The talk is at 6pm on Thursday, Sept. 19th. More info <a href="http://www.meetup.com/Data-Science-MD/events/135629022/">here</a>.</p>
Chris Lane, U.S. tourism boycotts, and large relative risks on small probabilities
2013-08-22T09:45:57+00:00
http://simplystats.github.io/2013/08/22/chris-lane-u-s-tourism-boycotts-and-large-relative-risks-on-small-probabilities
<p>Chris Lane <a href="http://www.cnn.com/2013/08/21/justice/australia-student-killed-oklahoma/index.html?hpt=hp_t2">was tragically killed</a> (link via Leah J.) in a shooting in Duncan, Oklahoma. According to the reports, it sounds like it was apparently a random and completely senseless act of violence. It is horrifying to think that those kids were just looking around for someone to kill because they were bored.</p>
<p>Gun violence in the U.S. is way too common and I’m happy about efforts to reduce the chance of this type of event. But I noticed this quote in the above linked CNN article from the former prime minister of Australia, Tim Fischer:</p>
<blockquote>
<p>People thinking of going to the USA for business or tourist trips should think carefully about it given the statistical fact you are 15 times more likely to be shot dead in the USA than in Australia per capita per million people.</p>
</blockquote>
<p>The CNN article suggests he is calling for a boycott of U.S. tourism. I’m guessing he got his data from a table <a href="http://en.wikipedia.org/wiki/List_of_countries_by_firearm-related_death_rate">like this</a>. According to the table, the total firearm related deaths per one million in Australia is 10.6 and in the U.S. 103. So the ratio is something like 10 times. If you restrict to homicides, the rates are 1.3 per million for Australia and 36 per million for the U.S. Here the ratio is almost 36 times.</p>
<p>So the question is, should you boycott the U.S. if you are an Australian tourist? Well, the percentage of people killed in firearm related deaths is 0.0036% in the U.S. and 0.00013% for Australia. So it is incredibly unlikely that you will be killed by a firearm in either country. The issue here is that with small probabilities, you can get huge relative risks, even when both outcomes are very unlikely in an absolute sense. The Chris Lane killing is tragic and horrifying, but I’m not sure a tourism boycott for the purposes of safety is justified.</p>
Treading a New Path for Reproducible Research: Part 1
2013-08-21T14:34:04+00:00
http://simplystats.github.io/2013/08/21/treading-a-new-path-for-reproducible-research-part-1
<p dir="ltr">
Discussions about reproducibility in scientific research have been on the rise lately, including <a href="http://simplystatistics.org/2013/07/09/repost-preventing-errors-through-reproducibility/">on</a> <a href="http://simplystatistics.org/2013/04/30/reproducibility-and-reciprocity/">this</a> <a href="http://simplystatistics.org/2011/12/02/reproducible-research-in-computational-science/">blog</a>. There are many underlying trends that have produced this increased interest in reproducibility: larger and larger studies being harder to replicate independently, cheaper data collection technologies/methods producing larger datasets, cheaper computing power allowing for more sophisticated analyses (even for small datasets), and the rise of general computational science (for every “X” we now have “Computational X”).
</p>
<p>For those that haven’t been following, here’s a brief review of what I mean when I say “reproducibility”. For the most part in science, we focus on what I and some others call “replication”. The purpose of replication is to address the validity of a scientific claim. If I conduct a study and conclude that “X is related to Y”, then others may be encouraged to replicate my study–with independent investigators, data collection, instruments, methods, and analysis–in order to determine whether my claim of “X is related to Y” is in fact true. If many scientists replicate the study and come to the same conclusion, then there’s evidence in favor of the claim’s validity. If other scientists cannot replicate the same finding, then one might conclude that the original claim was false. In either case, this is how science has always worked and how it will continue to work.</p>
<p dir="ltr">
Reproducibility, on the other hand, focuses on the validity of the data analysis. In the past, when datasets were small and the analyses were fairly straightforward, the idea of being able to reproduce a data analysis was perhaps not that interesting. But now, with computational science, where data analyses can be extraodinarily complicated, there’s great interest in whether certain data analyses can in fact be reproduced. By this I mean is it possible to take someone’s dataset and come to the same numerical/graphical/whatever output that they came to. While this seems theoretically trivial, in practice it’s very complicated because a given data analysis, which typically will involve a long pipeline of analytic operations, may be difficult to keep track of without proper organization, training, or software.
</p>
<p><strong>What Problem Does Reproducibility Solve?</strong></p>
<p dir="ltr">
In my opinion, reproducibility cannot really address the validity of a scientific claim as well as replication. Of course, if a given analysis is not reproducible, that may call into question any conclusions drawn from the analysis. However, if an analysis is reproducible, that says practically nothing about the validity of the conclusion or of the analysis itself.
</p>
<p dir="ltr">
In fact, there are numerous examples in the literature of analyses that were reproducible but just wrong. Perhaps the most nefarious recent example is the <a href="http://simplystatistics.org/2012/02/27/the-duke-saga-starter-set/">Potti scandal at Duke</a>. Given the amount of effort (somewhere close to 2000 hours) Keith Baggerly and his colleagues had to put into figuring out what Potti and others did, I think it’s reasonable to say that their work was not reproducible. But in the end, Baggerly was able to reproduce some of the results--this was how he was able to figure out that the analysis were incorrect. If the Potti analysis had not been reproducible from the start, it would have been impossible for Baggerly to come up with the laundry list of errors that they made.
</p>
<p dir="ltr">
The <a href="http://simplystatistics.org/2013/04/21/nevins-potti-reinhart-rogoff/">Reinhart-Rogoff kerfuffle</a> is another example of analysis that ultimately was reproducible but nevertheless questionable. While Herndon did have to do a little reverse engineering to figure out the original analysis, it was nowhere near the years-long effort of Baggerly and colleagues. However, it was Reinhart-Rogoff’s unconventional weighting scheme (fully reproducible, mind you) that drew all of the attention and strongly influenced the analysis.
</p>
<p dir="ltr">
I think the key question we want to answer when seeing the results of any data analysis is “Can I trust this analysis?” It’s not possible to go into every data analysis and check everything, even if all the data and code were available. In most cases, we want to have a sense that the analysis was done appropriately (if not optimally). I would argue that requiring that analyses be reproducible does not address this key question.
</p>
<p>With reproducibility you get a number of important benefits: transparency, data and code for others to analyze, and an increased rate of transfer of knowledge. These are all very important things. Data sharing in particular may be important independent of the need to reproduce a study if others want to aggregate datasets or do meta-analyses. But reproducibility does not guarantee validity or correctness of the analysis.</p>
<p><strong>Prevention vs. Medication</strong></p>
<p dir="ltr">
One key problem with the notion of reproducibility is the point in the research process at which we can apply it as an intervention. Reproducibility plays a role only in the most downstream aspect of the research process--post-publication. Only after a paper is published (and after any questionable analyses have been conducted) can we check to see if an analysis was reproducible or conducted in error.
</p>
<p dir="ltr">
<a href="http://simplystatistics.org/wp-content/uploads/2013/08/rrpipeline.png"><img class="alignright size-large wp-image-1705" alt="rrpipeline" src="http://simplystatistics.org/wp-content/uploads/2013/08/rrpipeline-1024x463.png" width="640" height="289" srcset="http://simplystatistics.org/wp-content/uploads/2013/08/rrpipeline-300x135.png 300w, http://simplystatistics.org/wp-content/uploads/2013/08/rrpipeline-1024x463.png 1024w" sizes="(max-width: 640px) 100vw, 640px" /></a>
</p>
<p dir="ltr">
At this point it may be difficult to correct any mistakes if they are identified. Grad students have graduated, postdocs have left, people have moved on. In the Potti case, letters to the journal editors were ignored. While it may be better to check the research process at the end rather than to never check it, intervening at the post-publication phase is arguably the most expensive place to do it. At this phase of the research process, you are merely “medicating” the problem, to draw an analogy with chronic diseases. But fundamental data analytic damage may have already been done.
</p>
<p dir="ltr">
This medication aspect of reproducibility reminds me of a famous quotation from R. A. Fisher:
</p>
<blockquote>
<p>To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.</p>
</blockquote>
<p dir="ltr">
Reproducibility allows for the statistician to conduct the post mortem of a data analysis. But wouldn’t it have been better to have prevented the analysis from dying in the first place?
</p>
<p dir="ltr">
<strong>Moving Upstream</strong>
</p>
<p dir="ltr">
There has already been much discussion of changing the role of reproducibility in the publication/dissemination process. What if a paper had to be deemed reproducible before it was published? The question here is who will reproduce the analysis? We can't trust the authors to do it so we have to get an independent third party. What about peer reviewers? I would argue that this is a pretty big burden to place on a peer reviewer who is already working for free. How about one of the Editors? Well, at the journal <em>Biostatistics</em>, that’s <a href="http://biostatistics.oxfordjournals.org/content/10/3/405.long">exactly what we do</a>. However, our policy is voluntary and only plays a role after a paper has been accepted through the usual peer review process. At any rate, from a business perspective, most journal owners will be reluctant to implement any policy that might reduce the number of submissions to the journal.
</p>
<p><strong>What Then?</strong></p>
<p dir="ltr">
To summarize, I believe reproducibility of computational research is very important, primarily to increase transparency and to improve knowledge sharing. However, I don’t think reproducibility in and of itself addresses the fundamental question of “Can I trust this analysis?”. Furthermore, reproducibility plays a role at the most downstream part of the research process (post-publication) where it is costliest to fix any mistakes that may be discovered. Ultimately, we need to think beyond reproducibility and to consider developing ways to ensure the quality of data analysis from the start.
</p>
<p dir="ltr">
How can we address the key problem concerning the validity of a data analysis? I’ll talk about what I think we should do in Part 2 of this post.
</p>
A couple of requests for the @Statistics2013 future of statistics workshop
2013-08-20T10:22:50+00:00
http://simplystats.github.io/2013/08/20/a-couple-of-requests-for-the-statistics2013-future-of-statistics-workshop
<p>Statistics 2013 is hosting a workshop on the <a href="http://www.statistics2013.org/about-the-future-of-the-statistical-sciences-workshop/">future of statistics</a>. Given the timing and the increasing popularity of our discipline I think its a great idea to showcase the future of our field.</p>
<p>I just have two requests:</p>
<div id=":27d">
<div role="chatMessage">
<div dir="ltr" id=":2o3">
<ol>
<li>
Please invite more junior people to speak who are doing cutting edge work that will define the future of our field.
</li>
<li>
Please focus the discussion on some of the real and very urgent issues facing our field.
</li>
</ol>
</div>
</div>
</div>
<p>Regarding #1 the list of speakers appears to be only <a href="http://www.statistics2013.org/presentations-and-panelists/">very senior people</a>. I wish there were more junior speakers because: (1) the future of statistics will be defined by people who are just starting their careers now and (2) there are some awesome super stars who are making huge advances in, among other things, the <a href="http://www.princeton.edu/~hanliu/">theory of machine learning</a>, <a href="http://www.biostat.wisc.edu/~cdewey/">high-throughput data analysis</a>, <a href="http://simplystatistics.org/2012/06/01/interview-with-amanda-cox-graphics-editor-at-the-new/">data visualization</a>, and <a href="http://www.biostat.jhsph.edu/~khansen/software.html">software creation</a>. I think including at least one person under 40 on the speaking list would bring some fresh energy.*</p>
<p>Regarding #2 I think there are a few issues that are incredibly important for our field as we move forward. I hope that the discussion will cover some of these:</p>
<ol>
<li><a href="http://simplystatistics.org/2013/05/29/what-statistics-should-do-about-big-data-problem-forward-not-solution-backward/">Problem first not solution backward</a>. It would be awesome if there was a whole panel filled with people from industry/applied statistics talking about the major problems where statisticians are needed and how we can train statisticians to tackle those problems. In particular it would be cool to see discussion of: (1) should we remove some math and add some software development to our curriculum?, (2) should we rebalance our curriculum to include more machine learning?, (3) should we require all students to do rotations in scientific or business internships?, (4) should we make presentation skills a high priority skill along with the required courses in math stats/applied stats?</li>
<li>Straight up embracing online education. <a href="http://simplystatistics.org/2012/08/10/why-we-are-teaching-massive-open-online-courses-moocs/">We are teaching MOOCs</a> here at Simply Stats. But that is only one way to embrace online education. What about <a href="https://github.com/hadley/devtools/wiki/Rcpp">online</a> <a href="http://kbroman.github.io/minimal_make/">tutorials</a> on Github. Or how about making educational videos for <a href="http://www.youtube.com/watch?v=znaO6OHLTeY">software packages</a>?</li>
<li><a href="http://simplystatistics.org/2012/05/28/schlep-blindness-in-statistics/">Good software is now the most important contribution of statisticians</a>. The most glaring absence from the list of speakers and panels is that there is no discussion of software! I have gone so far as to say if you (or someone else) aren’t writing software for your methods, <a href="http://simplystatistics.org/2013/01/23/statisticians-and-computer-scientists-if-there-is-no-code-there-is-no-paper/">they don’t really exist</a>. We need to have a serious discussion as a field about how the future of version control, reproducibility, data sharing, etc. are going to work. This seems like the perfect venue.</li>
<li>How we can forge better partnerships with industry and other data generators? Facebook, Google, Bitly, Twitter, Fitbit etc. are all collecting huge amounts of data. But <a href="http://simplystatistics.org/2013/01/02/fitbit-why-cant-i-have-my-data/">there is no data sharing protocol</a> like there was for genomics. Similarly, much of the imaging data in the world is tied up in academic and medical institutes. Fresh statistical eyes can’t be placed on these problems until the data are available in easily accessible, analyzable formats. How can we forge partnerships that make the data more valuable to the companies/institutes creating them and add immense value to young statisticians?</li>
</ol>
<p>These next two are primarily targeted at academics:</p>
<ol>
<li><a href="http://simplystatistics.org/2012/03/14/a-proposal-for-a-really-fast-statistics-journal/">How we can speed up our publication process</a>? For academic statisticians this is a killer and major problem. I regularly wait 3-5 months for papers to be reviewed for the first time at the fastest stat journals. Some people still wait years. By then, the highest impact applied problems have moved on with better technology, newer methodology etc.</li>
<li>How we can make our promotion process/awards process more balanced between theoretical and applied contributions? I think both are very important, but right now, on balance, papers in JASA are much more highly rated than Bioconductor packages with 10,000+ users. Both are hard work, both represent important contributions and both should be given strong weight (for example in <a href="http://www.amstat.org/fellows/nominations/pdfs/RatingofNominees.pdf">rating ASA Fellows</a>).</li>
</ol>
<p>Anyway, I hope the conference is a huge success. I was pumped to see all the chatter on Twitter when Nate Silver spoke at JSM. That was a huge win for the organizers of the event. I am really hopeful that with the important efforts of the organizers of these big events that we will see a continued trend toward a bigger and bigger impact of statistics.</p>
<p><em>* Rafa is invited, but he’s over 40 :-).**</em></p>
<p><em>** Rafa told me to mention he’s barely over 40.</em></p>
WANTED: Neuro-quants
2013-08-13T15:16:31+00:00
http://simplystats.github.io/2013/08/13/wanted-neuro-quants
<p>Our good colleagues <a href="http://www.bcaffo.com">Brian Caffo</a>, Martin Lindquist, and Ciprian Crainiceanu have written a nice editorial for the HuffPo on the <a href="http://www.huffingtonpost.com/american-statistical-association/wanted-neuroquants_b_3749363.html">need for statisticians in neuroimaging</a>.</p>
Embarrassing typos reveal the dangers of the lonely data analyst
2013-08-09T09:33:53+00:00
http://simplystats.github.io/2013/08/09/embarrassing-typos-reveal-the-dangers-of-the-lonely-data-analyst
<p><a href="http://lifesciencephdadventures.wordpress.com/2013/08/07/a-failure-of-authorship-and-peer-review/">A silly, but actually very serious, error</a> in the supplementary material of a recent paper in Organometallics is causing a stir on the internets (I saw it on <a href="http://andrewgelman.com/2013/08/08/for-chrissake-just-make-up-an-analysis-already-we-have-a-lab-here-to-run-yknow/">Andrew G.’s</a> blog). The error in question is a comment in the <a href="http://pubs.acs.org/doi/suppl/10.1021/om4000067/suppl_file/om4000067_si_002.pdf">supplementary material</a> of the paper:</p>
<blockquote>
<p>Emma, please insert NMR data here! where are they? and for this compound, just make up an elemental analysis . . .</p>
</blockquote>
<p><a href="http://blog.chembark.com/2013/08/06/a-disturbing-note-in-a-recent-si-file/">As has been pointed ou</a>t on the chemistry blogs, this is actually potentially a pretty serious problem. Apparently, the type of analysis in question is relatively easy to make up or at minimum, there are a lot of <a href="http://simplystatistics.org/2013/07/31/the-researcher-degrees-of-freedom-recipe-tradeoff-in-data-analysis/">researcher degrees of freedom</a>.</p>
<p>This error reminds me of another slip-up, this one from <a href="http://nsaunders.wordpress.com/2012/07/23/we-really-dont-care-what-statistical-method-you-used/">a paper in BMC Bioinformatics</a>. Here is the key bit, from the abstract:</p>
<blockquote>
<p>In this study, we have used (insert statistical method here) to compile unique DNA methylation signatures from normal human heart, lung, and kidney using the</p>
</blockquote>
<p>These slip-ups seem pretty embarrassing/funny at first pass. I will also admit that in some ways, I’m pretty sympathetic as a person who advises students and analysts. The comments on intermediate drafts of papers frequently say things like, “put this analysis here” or “fill in details here”. I think if one slipped through the cracks and ended up in the abstract or supplement of a paper I was publishing, I’d look pretty silly to.</p>
<p>But there are some more important issues here that relate to the issue of analysts/bioinformaticians/computing experts being directed by scientists. In some cases <a href="http://simplystatistics.org/2012/04/27/people-in-positions-of-power-that-dont-understand/">the scientists might not understand statistics</a>, which has its own set of problems. But often the scientists know exactly what they are talking about; the analyst and their advisor/boss just need to communicate about what is acceptable and what isn’t acceptable in practice. This is beautifully covered in this post on advice for <a href="http://biomickwatson.wordpress.com/2013/04/23/a-guide-for-the-lonely-bioinformatician/">lonely bioinformaticians</a>. I would extend that to all students/lonely analysts in any field. Finally, in the era of open science and collaboration, it is pretty clear that it is important to make sure that statements made in the margins of drafts can’t be misinterpreted and to check for typos in final submitted drafts of papers. <strong>Always double check for typos. </strong></p>
Data scientist is just a sexed up word for statistician
2013-08-08T10:28:16+00:00
http://simplystats.github.io/2013/08/08/data-scientist-is-just-a-sexed-up-word-for-statistician
<p>A couple of cool things happened at this years JSM.</p>
<ol>
<li>Twitter adoption went way up and it was much easier for people (like me) who weren’t there to keep track of all the action by monitoring the <a href="https://twitter.com/search?q=%23jsm2013&src=typd">#JSM2013</a> hashtag.</li>
<li>
<p>Nate Silver gave the keynote and [A couple of cool things happened at this years JSM.</p>
</li>
<li>Twitter adoption went way up and it was much easier for people (like me) who weren’t there to keep track of all the action by monitoring the <a href="https://twitter.com/search?q=%23jsm2013&src=typd">#JSM2013</a> hashtag.</li>
<li>Nate Silver gave the keynote and](https://twitter.com/rafalab/status/364480835577073664/photo/1) showed up.</li>
</ol>
<p>Nate Silver is hands down the <a href="http://simplystatistics.org/2013/07/26/statistics-takes-center-stage-in-the-independent/">rockstar of our field</a>. I mean, no other statistician changing jobs would make the news at the Times, at ESPN, and on pretty much every other major news source.</p>
<p>Silver’s talk at JSM focused on 11 principles of statistical journalism, which are <a href="http://blog.revolutionanalytics.com/2013/08/nate-silver-jsm.html">covered really nicely</a> here by Joseph Rickert from Revolution. After his talk, he answered questions Tweeted from the audience. He brought the house down (I’m sure in person, but definitely on Twitter) with his response to a question about data scientists versus statisticians with the perfectly weighted response for the audience:</p>
<blockquote>
<p>Data scientist is just a sexed up word for statistician</p>
</blockquote>
<p>Of course statisticians love to hear this but data scientists didn’t necessarily agree.</p>
<blockquote class="twitter-tweet" width="550">
<p>
Not at <a href="https://twitter.com/hashtag/JSM2013?src=hash">#JSM2013</a>, but intersect of self-ID’ed statisticians w/ self-ID’ed data scis is ~ null. Not sure who’s losing in the “sexed up” dept.
</p>
<p>
— Drew Conway (@drewconway) <a href="https://twitter.com/drewconway/status/364493993117507584">August 5, 2013</a>
</p>
</blockquote>
<blockquote class="twitter-tweet" width="550">
<p>
<a href="https://twitter.com/hspter">@hspter</a> not sure that describes what I do.
</p>
<p>
— josh attenberg (@jattenberg) <a href="https://twitter.com/jattenberg/status/364550740506710016">August 6, 2013</a>
</p>
</blockquote>
<blockquote class="twitter-tweet" width="550">
<p>
<a href="https://twitter.com/jattenberg">@jattenberg</a> <a href="https://twitter.com/hspter">@hspter</a> Me either. <img src="http://simplystatistics.org/wp-includes/images/smilies/simple-smile.png" alt=":)" class="wp-smiley" style="height: 1em; max-height: 1em;" />
</p>
<p>
— Hilary Mason (@hmason) <a href="https://twitter.com/hmason/status/364551047445884928">August 6, 2013</a>
</p>
</blockquote>
<p>I’ve talked about the statistician/data scientist divide before and how I think that we need better marketing as statisticians. I think it is telling that some of the very accomplished, very successful people tweeting about Nate’s quote are uncomfortable being labeled statistician. The reason, I think, is that statisticians have a reputation for focusing primarily on theory and not being willing to <a href="http://simplystatistics.org/2012/05/28/schlep-blindness-in-statistics/">do the schlep</a>.</p>
<p>I do think there is some cachet to having the “hot job title” but eventually solving real problems matters more. Which leads me to my favorite part of Nate’s quote, the part that isn’t getting nearly as much play as it should:</p>
<blockquote>
<p>Just do good work and call yourself whatever you want.</p>
</blockquote>
<p>I think that as statisticians we should embrace a <a href="http://simplystatistics.org/2012/08/14/statistics-statisticians-need-better-marketing/">“big tent” approach</a> to labeling. But rather than making it competitive by saying data scientists aren’t that great they are just “sexed up” statisticians, we should make it inclusive, “data scientists are statisticians because being a statistician is awesome and anyone who does cool things with data is a statistician”. People who build websites, or design graphics, or make reproducible documents, or build pipelines, or hack low-level data are all statisticians and we should respect them all for their unique skills.</p>
Simply Statistics #JSM2013 Picks for Wednesday
2013-08-07T08:48:04+00:00
http://simplystats.github.io/2013/08/07/simply-statistics-jsm2013-picks-for-wednesday
<p>Sorry for the delay with my session picks for Wednesday. Here’s what I’m thinking of:</p>
<ul>
<li><span style="line-height: 16px;">8:30-10:20am: <b>Bayesian Methods for Causal Inference in Complex Settings </b>(CC-520a) or <b>Developments in Statistical Methods for Functional and Imaging Data </b>(CC-522bc)</span></li>
<li>10:30am-12:20pm: <strong>Spatial Statistics for Environmental Health Studies</strong> (CC-510c) or <strong>Big Data Exploration with Amazon</strong> (CC-516c)</li>
<li>2-3:50pm: There are some future stars in the session <strong>Environmental Impacts on Public and Ecological Health</strong> (CC-512h) and <strong>Statistical Challenges in Cancer Genomics with Next-Generation Sequencing and Microarrays</strong> (CC-514a)</li>
<li>4-5:50pm: Find out who won the COPSS award! (CC-517ab)</li>
</ul>
<p> </p>
Simply Statistics #JSM2013 Picks for Tuesday
2013-08-06T07:40:32+00:00
http://simplystats.github.io/2013/08/06/simply-statistics-jsm2013-picks-for-tuesday
<p>It seems like Monday was a big hit at JSM with Nate Silver’s talk and all. Rafa estimates that there were about <a href="https://twitter.com/rafalab/status/364480835577073664">1 million people there</a> (+/- 1 million). Ramnath Vaidyanathan has a nice summary of the <a href="http://storify.com/ramnathv/jsm-2013?utm_campaign=&utm_source=t.co&utm_content=storify-pingback&awesm=sfy.co_hPE1&utm_medium=sfy.co-twitter">talk</a> and the <a href="https://gist.github.com/ramnathv/975b7d5df642c3804fc5">Q&A</a> afterwards. Among other things, Silver encouraged people to start a blog and communicate directly with the public. Couldn’t agree more! Thanks to all who live-tweeted at #JSM2013. I felt like I was there.</p>
<p>On to Tuesday! Here’s where I’d like to go:</p>
<ul>
<li><span style="line-height: 16px;">8:30-10:20am: <b>Spatial Uncertainty in Public Health Problems </b>(CC-513b); and since Nate says education is the next important area, <b>Statistical Knowledge for Teaching: Research Results and Implications for Professional Development</b> (CC-520d)</span></li>
<li>10:30am-12:20pm: Check out the latest in causal inference at <strong>Fresh Perspectives on Causal Inference</strong><strong> </strong>(CC-512f) and come see the future of statistics at the <strong>**SBSS Student Paper Travel Award Winners II</strong>** (CC-520d)</li>
<li>2-3:50pm: There’s a cast of all-stars over in the <strong>Biased Epidemiological Study Designs: Opportunities and Challenges</strong> (CC-511c) session and a visualization session with an interesting premise <strong>Painting a Picture of Life in the United States</strong> (CC-510a)</li>
<li>4-5:50pm: Only two choices here, so take your pick (or flip a coin).</li>
</ul>
Simply Statistics #JSM2013 Picks for Monday
2013-08-05T07:55:48+00:00
http://simplystats.github.io/2013/08/05/simply-statistics-jsm2013-picks-for-monday
<p>I’m sadly not able to attend the Joint Statistical Meetings this year (where Nate Silver is the keynote speaker!) in the great city of Montreal. I’m looking forward to checking out the chatter on #JSM2013 but in the meantime, here are the sessions I would have attended if I’d been there. If I pick more than one session for a given time slot, I assume you can run back and forth between the two.</p>
<ul>
<li>8:30-10:20am: Kasper Hansen is presenting in <strong>Statistical Methods for High-Dimensional Data: Presentations by Junior Researchers</strong> (CC-515c) and there are some great people in <strong>The Profession of Statistics and Its Impact on the Media</strong> (CC-516d)</li>
<li>10:30am-12:20pm: There are some heavy hitters in the <strong>Showcase of Analysis of Correlated Measurements</strong> (CC-511d); this session has a great title <strong>Herd Immunity: Teaching Techniques for the Health Sciences </strong>(CC-515b)**</li>
</ul>
<p>**</p>
<ul>
<li>2-3:50pm: I have a soft spot in my heart for a good MCMC session like <strong>Challenges in Using Markov Chain Monte Carlo in Modern Applications</strong> (CC-510d); I also have a soft spot for visualization and Simon Urbanek - <strong>Visualizing Big Data Interactively </strong>(CC-510b)</li>
<li>4-5:50pm: I would check out Nate Silver’s talk (CC-517ab)</li>
</ul>
<p>Have fun!</p>
Sunday data/statistics link roundup (8/4/13)
2013-08-04T11:58:53+00:00
http://simplystats.github.io/2013/08/04/sunday-datastatistics-link-roundup-8413
<ol>
<li><a href="http://m.us.wsj.com/articles/a/SB10001424127887324635904578639780253571520?mg=reno64-wsj">The $4 million teacher</a>. I love the idea that teaching is becoming a competitive industry where the best will get the kind of pay they really really deserve. I can’t think of another profession where the ratio of (if you are good at how much influence you have on the world)/(salary) is so incredibly large. <a href="http://marginalrevolution.com/marginalrevolution/2013/08/competition-in-higher-education-continues-to-grow.html">MOOC’s may contribute</a> to this, that is if they aren’t felled by the <a href="http://simplystatistics.org/2013/07/19/the-failure-of-moocs-and-the-ecological-fallacy/">ecological fallacy</a> (via Alex N.).</li>
<li>The NIH is considering <a href="http://www.nature.com/news/nih-mulls-rules-for-validating-key-results-1.13469">requiring replication of results</a> (via Rafa). Interestingly, the article talks about <a href="http://simplystatistics.org/2012/04/18/replication-psychology-and-big-science/">reproducibility, as opposed to replication</a>, throughout most of the text.</li>
<li><a href="http://www.r-bloggers.com/demand-for-r-jobs-on-the-rise-while-sas-jobs-decline/">R jobs on the rise</a>! Pair that with this <a href="http://www.r-bloggers.com/statisticians-an-endangered-species/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed:+RBloggers+%28R+bloggers%29">rather intense critique</a> of Marie Davidian’s interview about big data because she didn’t mention R. I think R/software development is definitely coming into its own as a critical part of any statistician’s toolbox. As that happens we need to take more and more care to include relevant training in version control, software development, and documentation for our students.</li>
<li>Not technically statistics, but holy crap a <a href="http://www.oddly-even.com/2013/07/31/the-largest-photo-ever-taken-of-tokyo-is-zoomable-and-it-is-glorious/">600,000 megapixel picture</a>?</li>
<li>A short <a href="http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/2/">history of data science</a>. Not many card-carrying statisticians make the history, which is a shame, given <a href="http://www.huffingtonpost.com/american-statistical-association/statistical-thinking-the-bedrock-of-data-science_b_3651121.html">all the good</a> they have contributed to the development of the foundations of this exciting discipline (via Rafa).</li>
<li>For those of you at JSM 2013, make sure you wear out that hashtag (<a href="https://twitter.com/search?q=%23jsm2013&src=typd">#JSM2013</a>) for those of us on the outside looking in. Watch out for the <a href="http://notstatschat.tumblr.com/post/57329870050/some-failure-modes-of-statistics-research-talks">Lumley 12</a> and make sure you check out <a href="http://www.amstat.org/meetings/jsm/2013/onlineprogram/ActivityDetails.cfm?SessionID=208498">Shirley’s talk</a>, <a href="http://www.amstat.org/meetings/jsm/2013/onlineprogram/ActivityDetails.cfm?SessionID=208765">Lumley and Hadley together</a>, this interesting looking <a href="http://www.amstat.org/meetings/jsm/2013/onlineprogram/ActivityDetails.cfm?SessionID=208599">ethics session</a>, and Martin <a href="http://www.amstat.org/meetings/jsm/2013/onlineprogram/ActivityDetails.cfm?SessionID=208786">doing his fMRI thang</a>, among others….</li>
</ol>
That causal inference came out of nowhere
2013-08-02T09:43:52+00:00
http://simplystats.github.io/2013/08/02/that-causal-inference-came-out-of-nowhere
<p><a href="http://archpedi.jamanetwork.com/article.aspx?articleid=1720224">This</a> is a study of breastfeeding and its impact on IQ that has been making the rounds on a number of different media outlets. I first saw it on the <a href="http://online.wsj.com/article/SB10001424127887324809004578635783141433600.html">Wall Street Journal</a> where I was immediately drawn to this quote:</p>
<blockquote>
<p>They then subtracted those factors using a statistical model. Dr. Belfort said she hopes that “what we have left is the true connection” with nursing and IQ.</p>
</blockquote>
<p>As the father of a young child this was of course pretty interesting to me so I thought I’d go and <a href="http://archpedi.jamanetwork.com/article.aspx?articleid=1720224">check out the paper</a> itself. I was pretty stunned to see this line right there in the conclusions:</p>
<blockquote>
<p>Our results support a causal relationship of breastfeeding duration with receptive language and verbal and nonverbal intelligence later in life.</p>
</blockquote>
<p>I immediately thought: “man how did they run a clinical trial of breastfeeding”. It seems like it would be a huge challenge to get past the IRB. So then I read a little bit more carefully how they performed the analysis. It was a prospective study, where they followed the children over time, then performed a linear regression analysis to adjust for a number of other factors that might influence childhood intelligence. Some examples include mother’s IQ, soci0-demographic information, and questionaires about delivery.</p>
<p>They then fit a number of regression models with different combinations of covariates and outcomes. They did not attempt to perform any sort of causal inference to make up for the fact that the study was not randomized. Moreover, they did not perform multiple hypothesis testing correction for all of the combinations of effects they observed. The actual reported connections represent just a small fraction of all the possible connections they tested.</p>
<p>So I was pretty surprised when they said:</p>
<blockquote>
<p>In summary, our results support a causal relationship of breastfeeding in infancy with receptive language at age 3 and with verbal and nonverbal IQ at school age.</p>
</blockquote>
<p style="text-align: left;">
<a href="http://simplystatistics.org/2013/07/15/yes-clinical-trials-work/">I'm</a> as <a href="http://simplystatistics.org/2013/05/06/why-the-current-over-pessimism-about-science-is-the-perfect-confirmation-bias-vehicle-and-we-should-proceed-rationally/">optimistic</a> as <a href="http://arxiv.org/abs/1301.3718">science</a> as they come. But where did that causal inference come from?
</p>
The ROC curves of science
2013-08-01T10:21:33+00:00
http://simplystats.github.io/2013/08/01/the-roc-curves-of-science
<p>Andrew Gelman’s <a href="http://andrewgelman.com/2013/07/24/too-good-to-be-true-the-scientific-mass-production-of-spurious-statistical-significance/">recent post</a> on what he calls the “scientific mass production of spurious statistical significance” reminded me of a thought I had back when I read John Ioannidis’ paper claiming that <a href="http://simplystatistics.org/2013/05/06/why-the-current-over-pessimism-about-science-is-the-perfect-confirmation-bias-vehicle-and-we-should-proceed-rationally/">most published research finding are false</a>. Many authors, which I will refer to as _the pessimists, _have joined Ioannidis in making similar claims and repeatedly blaming the current state of affairs on the mindless use of frequentist inference. The gist of my thought is that, for some scientific fields, the pessimist’s criticism is missing a critical point: that in practice, there is an inverse relationship between increasing rates of true discoveries and decreasing rates of false discoveries and that true discoveries from fields such as the biomedical sciences provide an enormous benefit to society. Before I explain this in more detail, I want to be very clear that I do think that reducing false discoveries is an important endeavor and that some of these false discoveries are completely avoidable. But, as I describe below, a general solution that improves the current situation is much more complicated than simply abandoning the frequentist inference that currently dominates.</p>
<p>Few will deny that our current system, with all its flaws, still produces important discoveries. Many of the pessimists’ proposals for reducing false positives seem to be, in one way or another, a call for being more conservative in reporting findings. Example of recommendations include that we require larger effect sizes or smaller p-values, that we correct for the “<a href="http://www.slate.com/articles/health_and_science/science/2013/07/statistics_and_psychology_multiple_comparisons_give_spurious_results.html">researcher degrees of freedom</a>”, and that we use Bayesian analyses with pessimistic priors. I tend to agree with many of these recommendations but I have yet to see a specific proposal on exactly how conservative we should be. Note that we could easily bring the false positives all the way down to 0 by simply taking this recommendation to its extreme and stop publishing biomedical research results all together. This absurd proposal brings me to receiver operating characteristic (ROC) curves.</p>
<p><a href="http://simplystatistics.org/2013/08/01/the-roc-curves-of-science/slide1-2/" rel="attachment wp-att-1627"><img class="alignnone size-full wp-image-1627" alt="Slide1" src="http://simplystatistics.org/wp-content/uploads/2013/07/Slide11.png" width="515" height="427" /></a></p>
<p>ROC curves plot true positive rates (TPR) versus false positive rates (FPR) for a given classifying procedure. For example, suppose a regulatory agency that runs randomized trials on drugs (e.g. FDA) classifies a drug as effective when a pre-determined statistical test produces a p-value < 0.05 or a posterior probability > 0.95. This procedure will have a historical false positive rate and true positive rate pair: one point in an ROC curve. We can change the 0.05 to, say, 0.2 (or the 0.95 to 0.80) and we would move up the ROC curve: higher FPR and TPR. Not doing research would put us at the useless bottom left corner. It is important to keep in mind that biomedical science is done by imperfect humans on imperfect and stochastic measurements so to make discoveries the field has to tolerate some false discoveries (ROC curves don’t shoot straight up from 0% to 100%). Also note that it can take years to figure out which publications report important true discoveries.</p>
<p>I am going to use the concept of ROC curve to distinguish between reducing FPR by being statistically more conservative and reducing FPR via more general improvements. In my ROC curve the y-axis represents the number of important discoveries per decade and the x-axis the number of false positives per decade (to avoid confusion I will continue to use the acronyms TPR and FPR). The current state of biomedical research is represented by one point on the red curve: one TPR,FPR pair. The pessimist argue that the FPR is close to 100% of all results but they rarely comment on the TPR. Being more conservative lowers our FPR, which saves us time and money, but it also lowers our TPR, which could reduce the number of important discoveries that improve human health. So what is the optimal balance and how far are we from it? I don’t think this is an easy question to answer.</p>
<p>Now, one thing we can all agree on is that moving the ROC curve up is a good thing, since it means that we get a higher TPR for any given FPR. Examples of ways we can achieve this are developing better measurement technologies, statistically improving the quality of these measurements, augmenting the statistical training of researchers, thinking harder about the hypotheses we test, and making less coding or experimental mistakes. However, applying a more conservative procedure does not move the ROC up, it moves our point left on the existing ROC: we reduce our FPR but reduce our TPR as well.</p>
<p>In the plot above I draw two imagined ROC curves: one for physics and one for biomedical research. The physicists’ curve looks great. Note that it shoots up really fast which means they can make most available discoveries with very few false positives. Perhaps due to the maturity of the field, physicists can afford and tend to use <a href="http://www.guardian.co.uk/science/2012/jul/04/higgs-boson-cern-scientists-discover">very stringent criteria</a>. The biomedical research curve does not look as good. This is mainly due to the fact that biology is way more complex and harder to model mathematically than physics. However, because there is a larger uncharted territory and more research funding, I argue that the rate of discoveries is higher in biomedical research than in physics. But, to achieve this higher TPR, biomedical research has to tolerate a higher FPR. According to my imaginary ROC curves, if we become as stringent as physicists our TPR would be five times smaller. It is not obvious to me that this would result in a better situation than the current one. At the same time, note that the red ROC suggests that increasing the FPR, with the hopes of increasing our TPR, is not a good idea because the curve is quite flat beyond our current location on the curve.</p>
<p>Clearly I am oversimplifying a very complicated issue, but I think it is important to point out that there are two discussions to be had: 1) where should we be on the ROC curve (keeping in mind the relationship between FPR and TPR)? and 2) what can we do to improve the ROC curve? My own view is that we can probably move down the ROC curve some and reduce the FPR without much loss in TPR (for example, by raising awareness of the <a href="http://www.slate.com/articles/health_and_science/science/2013/07/statistics_and_psychology_multiple_comparisons_give_spurious_results.html">researcher degrees of freedom</a>). But I also think that most our efforts should go to reducing the FPR by improving the ROC. In general, I think statisticians can add to the conversation about 1) while at the same time continue collaborating to move the red ROC curve up.</p>
The researcher degrees of freedom - recipe tradeoff in data analysis
2013-07-31T10:25:34+00:00
http://simplystats.github.io/2013/07/31/the-researcher-degrees-of-freedom-recipe-tradeoff-in-data-analysis
<p>An important concept that is only recently gaining the <a href="http://andrewgelman.com/2012/11/01/researcher-degrees-of-freedom/">attention</a> <a href="http://theness.com/neurologicablog/index.php/publishing-false-positives/">it</a> <a href="http://duncanlaw.wordpress.com/2012/04/09/researcher-degrees-of-freedom/">deserves</a> is researcher degrees of freedom. From <a href="http://people.psych.cornell.edu/~jec7/pcd%20pubs/simmonsetal11.pdf">Simmons et al</a>.:</p>
<blockquote>
<p>The culprit is a construct we refer to as researcher degrees of freedom. In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared? Which control variables should be considered? Should specific measures be combined or transformed or both?</p>
</blockquote>
<p>So far, researcher degrees of freedom has primarily been used with <a href="http://www.slate.com/articles/health_and_science/science/2013/07/statistics_and_psychology_multiple_comparisons_give_spurious_results.html">negative connotations</a>. This probably stems from the original definition of the idea which focused on how analysts could “manufacture” statistical significance by changing the way the data was processed without disclosing those changes. Reproducible research and distributed code would of course address these issues to some extent. But it is still relatively easy to obfuscate dubious analysis by <a href="http://petewarden.com/2013/07/18/why-you-should-never-trust-a-data-scientist/">dressing it up in technical language</a>.</p>
<p>One interesting point that I think sometimes gets lost in all of this is the researcher degrees of freedom - recipe tradeoff. You could think of this as the<a href="http://scott.fortmann-roe.com/docs/BiasVariance.html">bias-variance tradeoff</a> for big data.</p>
<p>At one end of the scale you can allow the data analyst full freedom, in which case researcher degrees of freedom may lead to overfitting and open yourself up to the manufacture of statistical results (optimistic significance or point estimates or confidence intervals). Or you can <a href="http://simplystatistics.org/2012/08/27/a-deterministic-statistical-machine/">require a recipe</a> for every data analysis which means that it isn’t possible to adapt to the unanticipated quirks (missing data mechanism, outliers, etc.) that may be present in an individual data set.</p>
<p>As with the bias-variance tradeoff, the optimal approach probably depends on your optimality criteria. You could imagine fitting a model that minimizes the mean squared error for fitting a linear model where you do not constrain the degrees of freedom in any way (that might represent an analysis where the researcher tries all possible models, including all types of data munging, choices of which observations to drop, how to handle outliers, etc.) to get the absolute best fit. Of course, this would likely be a strongly overfit/biased model. Alternatively you could penalize the flexibility allowed to the analyst. For example, you minimize a weighted criteria like:</p>
<p style="text-align:center;">
<span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_e18e41b63161ab4018790b295f7fb05d.gif" style="vertical-align: middle; border: none;" class="tex" alt=" \sum_{i=1}^n (y_i - b_0 x_{i1} + b_1 x_{i2})^2 + Researcher \; Penalty(\vec{y},\vec{x})" /></span>
</p>
<p>Some examples of the penalties could be:</p>
<ul>
<li><span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_7a7fd71819b3694b995fbd1fafc903fe.gif" style="vertical-align: middle; border: none; " class="tex" alt=" \lambda \times \sum_{i=1}^n 1_{researcher\; dropped \; ?y_i , x_i?\ ; from \; analysis}" /></span></li>
<li><span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_d77e8e36192d96f2d5f700d8b9b66be9.gif" style="vertical-align: middle; border: none; " class="tex" alt="\lambda \times \#\{of\;transforms\;tried\}" /></span></li>
<li><span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_08fc8b9e8ab22b50767a77c4d74b9739.gif" style="vertical-align: middle; border: none; " class="tex" alt=" \lambda \times \#{Outliers \; removed \; ad-hoc}" /></span></li>
</ul>
<p>You could also combine all of the penalties together into the “elastic researcher net” type approach. Then as the collective pentalty <span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_61b2f994e9f9cea7af386ccb914f2ed7.gif" style="vertical-align: middle; border: none; padding-bottom:1px;" class="tex" alt=" \lambda \rightarrow \infty" /></span> you get the <a href="http://simplystatistics.org/2012/08/27/a-deterministic-statistical-machine/">DSM</a>, like you have in a clinical trial for example.As <span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_b6df1ca955358221c80c622cfdbe6912.gif" style="vertical-align: middle; border: none; " class="tex" alt="\lambda \rightarrow 0" /></span> you get fully flexible data analysis, which you might want for discovery.</p>
<p>Of course if you allow researchers to choose the penalty you are right back to a scenario where you have degrees of freedom in the analysis (the problem you always get with any penalized approach). On the other hand it would make it easier to disclose how those degrees of freedom were applied.</p>
Sunday data/statistics link roundup (7/28/13)
2013-07-28T10:53:56+00:00
http://simplystats.github.io/2013/07/28/sunday-datastatistics-link-roundup-72813
<ol>
<li><span style="line-height: 16px;"><a href="http://www.huffingtonpost.com/2013/07/23/women-in-physics-statistics-hiring-bias-female-faculty_n_3635710.html">An article</a> in the Huffpo about a report claiming there is no gender bias in the hiring of physics faculty. I didn’t read the paper carefully but I definitely agree with the quote from Prof. Dame Athene Donald that the comparison should be made to the number of faculty candidates on the market. I’d also be a little careful about touting my record of gender equality if only 13% of faculty in my discipline were women (via Alex N.).</span></li>
<li>If you are the only person who hasn’t seen the upwardly mobile by geography article yet, <a href="http://www.nytimes.com/2013/07/22/business/in-climbing-income-ladder-location-matters.html?hp&_r=1&">here it is</a> (via Rafa). Also covered over at the great “<a href="http://chartsnthings.tumblr.com/post/56193905994/winning">charts n things</a>” blog.</li>
<li>Finally <a href="http://news.sciencemag.org/scientific-community/2013/07/senate-panel-gives-nsf-8-budget-boost">some good news</a> on the science funding front; a Senate panel raises NSF’s budget by 8% (the link worked for me earlier but I was having a little trouble today). I think that this is of course a positive development. I think that article pairs very well with <a href="http://www.businessinsider.com/a-private-university-might-have-saved-detroit-2013-7">this provocative piece</a> suggesting Detroit might have done better if they had a private research school.</li>
<li>I’m going to probably talk about this more later in the week because it gets my blood pressure up, but I thought I’d just say again that <a href="http://andrewgelman.com/2013/07/24/too-good-to-be-true-the-scientific-mass-production-of-spurious-statistical-significance/">hyperbolic takedowns</a> of the statistical methods in specific papers in the popular press <a href="http://simplystatistics.org/2013/05/06/why-the-current-over-pessimism-about-science-is-the-perfect-confirmation-bias-vehicle-and-we-should-proceed-rationally/">leads only one direction</a>.</li>
</ol>
Statistics takes center stage in the Independent
2013-07-26T16:18:11+00:00
http://simplystats.github.io/2013/07/26/statistics-takes-center-stage-in-the-independent
<p>Check out <a href="http://www.independent.co.uk/news/world/americas/heroes-of-zeroes-nate-silver-his-rivals-and-the-big-electoral-data-revolution-8734380.html">this really good piece</a> over at the Independent. It talks about the rise of statisticians as rockstars, naming Hans Rosling, Nate Silver, and Chris Volinsky among others. I think that those guys are great and deserve all the attention they get.</p>
<p>I only hope that more of the superstars that fly under the radar of the general public but have made huge contributions to science/medicine (like Ross Prentice, Terry Speed, Scott Zeger, or others that were highlighted in the comments <a href="http://simplystatistics.org/2013/07/17/name-5-statisticians-now-name-5-young-statisticians/">here</a>) get the same kind of attention (although I suspect they might not want it).</p>
<p>I think one of the best parts of the article (which you should read in it’s entirety) is Marie Davidian’s quote:</p>
<blockquote>
<p>There are rock stars, and then there are rock bands: statisticians frequently work in teams</p>
</blockquote>
What are the 5 most influential statistics papers of 2000-2010?
2013-07-22T10:52:45+00:00
http://simplystats.github.io/2013/07/22/what-are-the-5-most-influential-statistics-papers-of-2000-2010
<p>A few folks here at Hopkins were just reading the comments of our post on <a href="http://simplystatistics.org/2013/07/17/name-5-statisticians-now-name-5-young-statisticians/">awesome young/senior statisticians</a>. It was cool to see the diversity of opinions and all the impressive people working in our field. We realized that another question we didn’t have a great answer to was:</p>
<blockquote>
<p>What are the 5 most influential statistics papers of the aughts (2000-2010)?</p>
</blockquote>
<p>Now that the auggies or aughts or whatever are a few years behind us, we have the benefit of a little hindsight and can get a reasonable measure of retrospective impact.</p>
<p>Since this is a pretty broad question I’d thought I’d lay down some generic ground rules for nominations:</p>
<ol>
<li>Papers must have been published in 2000-2010.</li>
<li>Papers must primarily report a statistical method or analysis (the impact shouldn’t be only because of the scientific result).</li>
<li>Papers may be published in either statistical or applied journals.</li>
</ol>
<p>For extra credit, along with your list give your definition of impact. Mine would be something like:</p>
<ul>
<li>Has been cited at a high rate in scientific papers (in other words, it is used by science, not just cited by statisticians trying to beat it)</li>
<li>Has corresponding software that has been used</li>
<li>Made simpler/changed the way we did a specific type of analysis</li>
</ul>
<p>I don’t have my list yet (I know, a cop-out) but I’m working on it.</p>
Sunday data/statistics link roundup (7/21/2013)
2013-07-21T20:23:56+00:00
http://simplystats.github.io/2013/07/21/sunday-datastatistics-link-roundup-7212013
<ol>
<li><a href="http://www.nytimes.com/2013/07/21/opinion/sunday/lets-shake-up-the-social-sciences.html?hp&_r=2&">Let’s shake up the social sciences</a> is a piece in the New York Times by Nicholas Christakis who rose to fame by claiming that <a href="http://www.nejm.org/doi/full/10.1056/NEJMsa066082">obesity is contagious</a>. <a href="http://andrewgelman.com/2013/07/21/defensive-political-science-responds-defensively-to-an-attack-on-social-science/">Gelman responds</a> that he thinks maybe Christakis got a little ahead of himself. I’m going to stay out of this one as it is all pretty far outside my realm - but I will say that I think quantitative social sciences is a hot area and all hot areas bring both interesting new results and hype. You just have to figure out which is which (via Rafa).</li>
<li><a href="http://www.aclu.org/blog/technology-and-liberty-national-security/police-documents-license-plate-scanners-reveal-mass">This</a> is both creepy and proves <a href="http://simplystatistics.org/2013/06/14/the-vast-majority-of-statistical-analysis-is-not-performed-by-statisticians/">my point</a> about the ubiquity of data. Basically police departments are storing tons of information about where we drive because, well, it is easy to do so why not?</li>
<li>I mean, I’m not an actuary and I don’t run cities, but <a href="http://dealbook.nytimes.com/2013/07/19/detroit-gap-reveals-industry-dispute-on-pension-math/?hp">this</a> strikes me as a little insane. How do you not just keep track of all the pensions you owe people and add them up to know your total obligation? Why predict it when you could actually just collect the data? Maybe an economist can explain this one to me. (via Andrew J.)</li>
<li>[ 1. <a href="http://www.nytimes.com/2013/07/21/opinion/sunday/lets-shake-up-the-social-sciences.html?hp&_r=2&">Let’s shake up the social sciences</a> is a piece in the New York Times by Nicholas Christakis who rose to fame by claiming that <a href="http://www.nejm.org/doi/full/10.1056/NEJMsa066082">obesity is contagious</a>. <a href="http://andrewgelman.com/2013/07/21/defensive-political-science-responds-defensively-to-an-attack-on-social-science/">Gelman responds</a> that he thinks maybe Christakis got a little ahead of himself. I’m going to stay out of this one as it is all pretty far outside my realm - but I will say that I think quantitative social sciences is a hot area and all hot areas bring both interesting new results and hype. You just have to figure out which is which (via Rafa).</li>
<li><a href="http://www.aclu.org/blog/technology-and-liberty-national-security/police-documents-license-plate-scanners-reveal-mass">This</a> is both creepy and proves <a href="http://simplystatistics.org/2013/06/14/the-vast-majority-of-statistical-analysis-is-not-performed-by-statisticians/">my point</a> about the ubiquity of data. Basically police departments are storing tons of information about where we drive because, well, it is easy to do so why not?</li>
<li>I mean, I’m not an actuary and I don’t run cities, but <a href="http://dealbook.nytimes.com/2013/07/19/detroit-gap-reveals-industry-dispute-on-pension-math/?hp">this</a> strikes me as a little insane. How do you not just keep track of all the pensions you owe people and add them up to know your total obligation? Why predict it when you could actually just collect the data? Maybe an economist can explain this one to me. (via Andrew J.)
4.](http://www.nytimes.com/2013/07/19/opinion/in-defense-of-clinical-drug-trials.html?src=recg&gwh=9D33ABD1323113EF3AC9C48210900171) reverse scoops <a href="http://simplystatistics.org/2013/07/15/yes-clinical-trials-work/">our clinical trials post</a>! In all seriousness, there are a lot of nice responses there to the original article.</li>
<li><a href="http://touch.baltimoresun.com/#section/-1/article/p2p-76681838/">JH Hospital back to #1</a>. Order is restored. Read <a href="http://simplystatistics.org/2012/07/18/a-closer-look-at-data-suggests-johns-hopkins-is-still/">our analysis</a> of Hopkins ignominious drop to #2 last year (via Sherri R.).</li>
</ol>
The "failure" of MOOCs and the ecological fallacy
2013-07-19T10:52:41+00:00
http://simplystats.github.io/2013/07/19/the-failure-of-moocs-and-the-ecological-fallacy
<p>At first blush <a href="http://www.sfgate.com/news/article/San-Jose-State-suspends-online-courses-4672870.php">the news out of San Jose State</a> that the partnership with Udacity is being temporarily suspended is bad news for MOOCs. It is particularly bad news since the main reason for the suspension is poor student performance on exams. I think in the PR game there is certainly some reason to be disappointed in the failure of this first big experiment, but as someone who loves the idea of high-throughput education, I think that this is primarily a good learning experience.</p>
<p>The money quote in my mind is:</p>
<blockquote>
<p>Officials say the data suggests many of the students had little college experience or held jobs while attending classes. Both populations traditionally struggle with college courses.</p>
<p>“We had this element that we picked, student populations who were not likely to succeed,” Thrun said.</p>
</blockquote>
<p>I think it was a really nice idea to try to expand educational opportunities to students who traditionally dont have time for college or have struggled with college. But this represents a pretty major confounder in the analysis comparing educational outcomes between students in the online and in person classes. There is a lot of room for the <a href="http://en.wikipedia.org/wiki/Ecological_fallacy">ecological fallacy</a> to make it look like online classes are failing. They could very easily address this problem by using a subset of students randomized in the right way. There are even really good papers - <a href="http://scholar.harvard.edu/aglynn/publications/alleviating-linear-ecological-bias-and-optimal-design-subsample-data">like this one by Glynn</a> - on the optimal way to do this.</p>
<p>I think there are some potential lessons learned here from this PR problem:</p>
<ol>
<li><span style="line-height: 15.994318008422852px;"><strong>We need good study design in high-throughput education</strong>. I don’t know how rigorous the study design was in the case of the San Jose State experiment, but if the comparison is just whoever signed up in class versus whoever signed up online we have a long way to go in evaluating these classes.<br /> </span></li>
<li><strong>We need coherent programs online</strong> It looks like they offered a scattered collection of mostly lower level courses online (elementary statistics, college algebra, entry level math, introduction to programming and introduction to psychology). These courses are obvious ones for picking off with MOOCs since they are usually large lecture-style courses in person as well. But they are also hard classes to “get motivated for” if there isn’t a clear end goal in mind. If you are learning college algebra online but don’t have a clear path to using that education it might make more sense to start with the <a href="https://www.khanacademy.org/math/algebra">Khan Academy</a></li>
<li><strong>We need to parse variation in educational </strong><span style="color: #000000;"><b>attainment</b></span>. It makes sense to evaluate in class and online students with similar instruments. But I wonder if there is a way to estimate the components of variation: motivation, prior skill, time dedicated to the course, learning from course materials, learning from course discussion, and learning for different types of knowledge (e.g. vocational versus theoretical) using statistical models. I think that kind of modeling would offer a much more clear picture of whether these programs are “working”.</li>
</ol>
Defending clinical trials
2013-07-19T08:16:47+00:00
http://simplystats.github.io/2013/07/19/defending-clinical-trials
<p>The New York Times has published some <a href="http://www.nytimes.com/2013/07/19/opinion/in-defense-of-clinical-drug-trials.html?src=recg">letters to the Editor</a> in response to the piece by Clifton Leaf on clinical trials. You can also see <a href="http://simplystatistics.org/2013/07/15/yes-clinical-trials-work/">our response here</a>.</p>
Name 5 statisticians, now name 5 young statisticians
2013-07-17T11:31:51+00:00
http://simplystats.github.io/2013/07/17/name-5-statisticians-now-name-5-young-statisticians
<p>I have been thinking for a while how hard it is to find statisticians to interview for the blog. When I started the interview series, it was targeted at interviewing statisticians at the early stages of their careers. It is relatively easy, if you work in academic statistics, to name 5 famous statisticians. If you asked me to do that, I’d probably say something like: Efron, Tibshirani, Irizarry, Prentice, and Storey. I could also name 5 famous statisticians in industry with relative ease: Mason, Volinsky, Heineike, Patil, Conway.</p>
<p>Most of that is because of where I went to school (Storey/Prentice), the area I work in (Tibshirani/Irizarry/Storey), my advisor (Storey), or the bootstrap (Efron) and the people I see on Twitter (all the industry folks). I could, of course, name a lot of other famous statisticians. Almost all of them biased by my education or the books I read.</p>
<p>But almost surely I will miss people who work outside my area or didn’t go to school where I did. This is particularly true in applied statistics, where people might not even spend most of their time in statistics departments. It is doubly true of people who are young and just getting started, as I haven’t had a chance to hear about them.</p>
<p>So if you have a few minutes in the comments name five statisticians you admire. Then name five junior statisticians you think will be awesome. They don’t have to be famous (in fact it is better if they are good but <em>not</em> famous so I can learn something). Plus it will be interesting to see the responses.</p>
Yes, Clinical Trials Work
2013-07-15T11:20:06+00:00
http://simplystats.github.io/2013/07/15/yes-clinical-trials-work
<p>This saturday the New York Times published an opinion pieces wondering “<a style="font-size: 16px;" href="http://www.nytimes.com/2013/07/14/opinion/sunday/do-clinical-trials-work.html?pagewanted=all&_r=0">do clinical trials work?</a>”. The answer, of course, is: absolutely. For those that don’t know the history, randomized control trials (RCTs) are one of the reasons why life spans skyrocketed in the 20th century. Before RCTs wishful thinking and arrogance lead numerous well-meaning scientist and doctors to incorrectly believe their treatments worked. They are so successful that they have been adopted with much fanfare in far flung arenas like poverty alleviation (see e.g.,this discussion by <a style="font-size: 16px;" href="http://www.effectivephilanthropy.org/blog/2011/06/esther-duflo-explains-why-she-believes-randomized-controlled-trials-are-so-vital/">Esther Duflo</a>); where wishful thinking also lead many to incorrectly believe their interventions helped.</p>
<p>The first chapter of<a href="http://www.amazon.com/Statistics-4th-Edition-David-Freedman/dp/0393929728"> this book</a> contains several examples and <a href="http://clinicaltrials.gov/ct2/about-studies/learn">this is a really nice introduction</a> to clinical studies. A very common problem was that the developers of the treatment would create treatment groups that were healthier to start with. Randomization takes care of this. To understand the importance of controls I quote the opinion piece to demonstrate a common mistake we humans make: “Some patients did do better on the drug, and indeed, doctors and patients insist that some who take Avastin significantly beat the average.” The problem is that the fact that Avastin did not do better on average means that the exact same statement can be made about the control group! It also means that some patient did worse than average too. The use of a control points to the possibility that Avastin has nothing to do with the observed improvements.</p>
<p>The opinion piece is very critical of current clinical trials work and complains about the “dismal success rate for drug development”. But what is the author comparing too? Dismal compared to what? We are talking about developing complicated compounds that must be both safe and efficacious in often critically ill populations. It would be surprising if our success rate was incredibly high. Or is the author comparing the current state of affairs to the pre-clinical-trials days when procedures such as <a style="font-size: 16px;" href="http://en.wikipedia.org/wiki/Bloodletting">bloodletting</a> were popular.</p>
<p>A better question might be, “how can we make clinical trials more efficient?” To answer this question there is definitely a lively and ongoing research area. In some cases they can definitely be better by adapting to new developments such as biomarkers and the advent of personalized medicine. This is why there are dozens of statisticians working in this area.</p>
<p>The article says that</p>
<blockquote>
<p>“[p]art of the novelty lies in a statistical technique called Bayesian analysis that lets doctors quickly glean information about which therapies are working best. “</p>
</blockquote>
<p>As <a style="font-size: 16px;" href="http://simplystatistics.org/2013/07/14/sunday-datastatistics-link-roundup-7142013/">Jeff pointed out</a> this a pretty major oversimplification of all of the hard work that it takes to maintain scientific integrity and patient safety when studying new compounds. The fact that the analysis is Bayesian is ancillary to other issues like <a href="http://www.trialsjournal.com/content/13/1/145">adaptive trials</a> (as Julian <a href="http://simplystatistics.org/2013/07/14/sunday-datastatistics-link-roundup-7142013/#comment-962395470">pointed out in the comments)</a>, <a href="http://en.wikipedia.org/wiki/Dynamic_treatment_regime">dynamic treatment regimes</a>, or even more established ideas <a href="http://en.wikipedia.org/wiki/Sequential_analysis">like group sequential trials</a>. The basic principle underlying these ideas is the same: _can we run a trial more efficiently while achieving reasonable estimates of effect sizes and uncertainties? _You could imagine doing this by focusing on subpopulations that seem to work well for subpopulations with specific biomarkers, or by stopping trials early if drugs are strongly (in)effective, or by picking optimal paths through multiple treatments. That the statistical methodology is Bayesian or Frequentist has little to do with the ways that clinical trials are adapting to be more efficient.</p>
<p>This is a wide open area and deserves a much more informed conversation. I’m providing here a list of resources that would be a good place to start:</p>
<ol>
<li><a href="http://www.clinicaltrials.gov/ct2/info/resources">An introduction to clinical trials</a></li>
<li><a href="http://people.csail.mit.edu/mrosenblum/Teaching/adaptive_designs_2010.html">Michael Rosenblum’s adaptive trial design page. </a></li>
<li><a href="http://clinicaltrials.gov/">Clinicaltrials.gov</a> - registry of clinical trials</li>
<li><a href="https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/62529/TLA-1906126.pdf">Test, learn adapt</a> - a white paper on using clinical trials for public policy</li>
<li><a href="http://www.alltrials.net/">Alltrials</a> - an initiative to make all clinical trial data public</li>
<li><a href="http://www.asco.org/advocacy-practice/clinical-trial-resources">ASCO clinical trials resources</a> - on clinical trials ethics and standards</li>
<li><a href="http://jco.ascopubs.org/content/29/6/606">Don Berry’s paper on adaptive design</a>.</li>
<li><a href="http://www.amazon.com/dp/1441915850">Fundamentals of clinical trials</a> - a good general book (via David H.)</li>
<li><a href="http://www.amazon.com/Clinical-Trials-Methodologic-Perspective-Probability/dp/0471727814">Clinical trials, a methodological perspective</a> - a more regulatory take (via David H.)</li>
</ol>
<p><em>This post is by Rafa and Jeff. </em></p>
Sunday data/statistics link roundup (7/14/2013)
2013-07-14T12:10:53+00:00
http://simplystats.github.io/2013/07/14/sunday-datastatistics-link-roundup-7142013
<ol>
<li><span style="line-height: 15.994318008422852px;">Question: <a href="http://www.nytimes.com/2013/07/14/opinion/sunday/do-clinical-trials-work.html?pagewanted=all&_r=1&">Do clinical trials work</a>?Answer: Yes. Clinical trials are one of the defining success stories in the process of scientific inquiry. Do they work as fast/efficiently as a pharma company with potentially billions on the line would like? That is definitely much more up for debate. Most of the article is a good summary of how drug development works - although I think the statistics reporting is a little prone to hyperbole. I also think this sentence is both misleading, wrong, and way over the top, <em>“Part of the novelty lies in a statistical technique called Bayesian analysis that lets doctors quickly glean information about which therapies are working best. There’s no certainty in the assessment, but doctors get to learn during the process and then incorporate that knowledge into the ongoing trial.” </em><br /> </span></li>
<li><a href="http://www.nytimes.com/2013/07/11/business/2-competitors-sued-by-genetics-company-for-patent-infringement.html?src=rechp&_r=0">The fun begins</a> in the grim world of patenting genes. Two companies are being sued by Myriad even though they just lost the case on their main patent. Myriad is claiming violation of one of their 500 or so other patents. Can someone with legal expertise give me an idea - is Myriad now a patent troll?</li>
<li><a href="http://thomaslevine.com/!/r-spells-for-data-wizards/">R spells for data wizards</a> from Thomas Levine. I also link the pink on grey look.</li>
<li>Larry W. takes on <a href="http://normaldeviate.wordpress.com/2013/07/13/lost-causes-in-statistics-ii-noninformative-priors/">non-informative priors</a>. Worth the read, particularly the discussion of how non-informative priors can be informative in different parameterizations. The problem Larry points out here is one I think that is critical - in big data applications where the goal is often discovery, we rarely have enough prior information to make reasonable informative priors either. Not to say some regularization can’t be helpful, but I think there is danger in putting an even weakly informative prior on a poorly understood, high dimensional space and then claiming victory when we discover something.</li>
<li>Statistics and actuarial science are jumping into a politically fraught situation by <a href="http://www.nytimes.com/2013/07/08/us/schools-seeking-to-arm-employees-hit-hurdle-on-insurance.html?hp&pagewanted=all&_r=0">raising the insurance on schools that allow teachers to carry guns</a>. Fiscally, this is clearly going to be the right move. I wonder what the political fallout will be for the insurance company and for the governments that passed these laws (via Rafa via Marginal Revolution).</li>
<li>Timmy!! Tim Lincecum <a href="http://scores.espn.go.com/mlb/recap?gameId=330713125">throws his first no hitter.</a> I know this isn’t strictly data/stats but he went to UW like me!</li>
</ol>
What are the iconic data graphs of the past 10 years?
2013-07-10T10:00:56+00:00
http://simplystats.github.io/2013/07/10/what-are-the-iconic-data-graphs-of-the-past-10-years
<p>This article in the New York Times about the supposed <a href="http://bits.blogs.nytimes.com/2013/07/05/the-death-of-photography-has-been-greatly-exaggerated/?smid=pl-share">death of photography</a> got me thinking about statistics. Apparently, the death of photography has been around the corner for some time now:</p>
<blockquote>
<p>For years, photographers have been bracing for this moment, warned that the last rites will be read for photography when video technology becomes good enough for anyone to record. But as this Fourth of July showed me, I think the reports of the death of photography have been greatly exaggerated.</p>
</blockquote>
<p>Yet, photography has not died and, says <a href="http://www.fas.harvard.edu/~amciv/faculty/kelsey.shtml">Robin Kelsey</a>, a professor of photography at Harvard,</p>
<blockquote>
<p>The fact that we can commit a single image to memory in a way that we cannot with video is a big reason photography is still used so much today.</p>
</blockquote>
<p>This got me thinking about data graphics. One long-time gripe about data graphics in R has been it’s horrible lack of support for dynamic or interactive graphics. graphics. This is an indisputable fact, especially in the early years. Nowadays there are quite a few extensions and packages that allow R to create dynamic graphics, but it still doesn’t feel like part of the “core”. I still feel like when I talk to people about R, the first criticism they jump to is the poor support for dynamic/interactive graphics.</p>
<p>But personally, I’ve never thought it was a big deal. Why? Because I don’t really find such graphics useful for truly <em>thinking</em> about data. I’ve definitely enjoyed viewing some of them (especially some of the D3 stuff), and it’s often fun to move sliders around and see how things change (perhaps my favorite is the <a href="http://www.babynamewizard.com/voyager">Baby Name Voyager</a> or maybe <a href="http://www.businessweek.com/articles/2013-07-09/jay-z-is-right-most-rappers-are-lying-about-their-money">this one showing rapper wealth</a>).</p>
<p>But in the end, what are you supposed to walk away with? As a a creator of such a graphic, how are you supposed to communicate the evidence in the data? The key element of dynamic/interactive graphics is that it allows the viewer to explore the data in their own way, not in some prescribed static way that you’ve explicitly set out. Ultimately, I think that aspect makes dynamic graphics useful for presenting <em>data</em>, but not that useful for presenting <em>evidence</em>. If you want to present evidence, you have to tell a story with the data, you can’t just let the viewer tell their own story.</p>
<p>This got me thinking about what are the iconic data “photos” of the past 10 years (or so). The NYT article mentions the famous “<a href="http://en.wikipedia.org/wiki/Raising_the_Flag_on_Iwo_Jima">Raising the Flag on Iwo Jima</a>” by AP photographer Joe Rosenthal as an image that many would recognize (and perhaps remember). What are the data graphics that are burned in your memory?</p>
<p>I’ll give one example. I remember seeing Richard Peto give a talk here about the benefits of smoking cessation and its effect on life expectancy. He found that according to large population surveys, people who quit smoking by the age of 40 or so had more or less the same life expectancy as those who never smoked at all. The graph he showed was one very similar to <a href="http://www.nejm.org/action/showImage?doi=10.1056%2FNEJMsa1211128&iid=f03">Figure 3 from this article</a>. Although I already knew that smoking was bad for you, this picture really crystalized it for me in a specific way.</p>
<p>Of course, sometimes data graphics are <a href="http://simplystatistics.org/2012/11/26/the-statisticians-at-fox-news-use-classic-and-novel-graphical-techniques-to-lead-with-data/">memorable for other reasons</a>, but I’d like to try and stay positive here. Which data graphics have made a big impression on you?</p>
Repost: Preventing Errors Through Reproducibility
2013-07-09T10:00:39+00:00
http://simplystats.github.io/2013/07/09/repost-preventing-errors-through-reproducibility
<p>Checklist mania has hit clinical medicine thanks to people like Peter Pronovost and many others. The basic idea is that simple and short checklists along with changes to clinical culture can prevent major errors from occurring in medical practice. One particular success story is Pronovost’s central line checklist which <a href="http://www.ncbi.nlm.nih.gov/pubmed/15483409" target="_blank">dramatically reduced bloodstream infections</a> in hospital intensive care units.</p>
<p>There are three important points about the checklist. First, it neatly summarizes information, bringing the latest evidence directly to clinical practice. It is easy to follow because it is short. Second, it serves to slow you down from whatever you’re doing. Before you cut someone open for surgery, you stop for a second and run the checklist. Third, it is a kind of equalizer that subtly changes the culture: everyone has to follow the checklist, no exceptions. A number of studies have now shown that when clinical units follow checklists, infection rates go down and hospital stays are shorter compared to units using standard procedures.</p>
<p>Here’s a question: What would it take to convince you that an article’s results were reproducible, short of going in and reproducing the results yourself? I recently raised this question in a <a href="http://simplystatistics.tumblr.com/post/12243614318/i-gave-a-talk-on-reproducible-research-back-in" target="_blank">talk I gave</a> at the Applied Mathematics Perspectives conference. At the time I didn’t get any responses, but I’ve had some time to think about it since then.</p>
<p>I think most people are thinking of this issue along the lines of “The only way I can confirm that an analysis is reproducible is to reproduce it myself”. In order for that to work, everyone needs to have the data and code available to them so that they can do their own independent reproduction. Such a scenario would be sufficient (and perhaps ideal) to claim reproducibility, but is it strictly necessary? For example, if I reproduced a published analysis, would that satisfy you that the work was reproducible, or would you have to independently reproduce the results for yourself? If you had to choose someone to reproduce an analysis for you (not including yourself), who would it be?</p>
<p>This idea is embedded in the <a href="http://www.ncbi.nlm.nih.gov/pubmed/19535325" target="_blank">reproducible research policy at </a>_<a href="http://www.ncbi.nlm.nih.gov/pubmed/19535325" target="_blank">Biostatistics</a>, _but of course we make the data and code available too. There, a (hopefully) trusted third party (the Associate Editor for Reproducibility) reproduces the analysis and confirms that the code was runnable (at least at that moment in time).</p>
<p>It’s important to point out that reproducible research is not only about correctness and prevention of errors. It’s also about making research results available to others so that they may more easily build on the work. However, preventing errors is an important part and the question is then what is the best way to do that? Can we generate a reproducibility checklist?</p>
Use R! 2014 to be at UCLA
2013-07-08T16:33:00+00:00
http://simplystats.github.io/2013/07/08/use-r-2014-to-be-at-ucla
<p>The <a href="http://user2014.stat.ucla.edu">2014 Use R! conference</a> will be in Los Angeles, CA and will be hosted by the <a href="http://www.stat.ucla.edu">UCLA Department of Statistics</a> (an excellent department, I must say) and the newly created <a href="http://www.foastat.org">Foundation for Open Access Statistics</a>. This is basically <em>the</em> meeting for R users and developers and has grown to be quite an event.</p>
Fourth of July data/statistics link roundup (7/4/2013)
2013-07-04T13:49:11+00:00
http://simplystats.github.io/2013/07/04/fourth-of-july-datastatistics-link-roundup-742013
<ol>
<li><a href="http://www.slate.com/blogs/moneybox/2013/07/01/science_majors_are_hard_that_s_why_people_don_t_do_them.html">An interesting post</a> about how lots of people start out in STEM majors but eventually bail because they are too hard. They recommend either: (1) we better prepare high school students or (2) we make STEM majors easier. I like the idea of making STEM majors more interactive and self-paced. There is a bigger issue here of weed-out classes and barrier classes that deserves a longer discussion (via Alex N.)</li>
<li>This is <a href="http://www.gpo.gov/fdsys/pkg/FR-2013-06-04/html/2013-13083.htm">an incredibly interesting FDA proposal</a> to share all clinical data. I didn’t know this, but apparently right now all FDA data is proprietary. That is stunning to me, given the openness that we have say in genomic data - where most data are public. This goes beyond even the alltrials idea of reporting all results. I think we need full open disclosure of data and need to think hard about the privacy/consent implications this may have (via Rima I.).</li>
<li>This is a <a href="http://insightdatascience.com/">pretty cool data science</a> fellowship program for people who want to transition from academia to industry, post PhD. I have no idea if the program is any good, but certainly the concept is a great one. (via Sherri R.)</li>
<li><a href="http://www.nature.com/nmeth/journal/v10/n7/full/nmeth.2530.html?utm_content=buffercf9e7&utm_source=buffer&utm_medium=twitter&utm_campaign=Buffer">A paper in Nature Methods</a> about data visualization and understanding the levels of uncertainty in data analysis. I love seeing that journals are recognizing the importance of uncertainty in analysis. Sometimes I feel like the “biggies” want perfect answers with no uncertainty - which never happens.</li>
</ol>
<p>That’s it, just a short set of links today. Enjoy your 4th!</p>
Repost: The 5 Most Critical Statistical Concepts
2013-07-03T13:56:24+00:00
http://simplystats.github.io/2013/07/03/repost-the-5-most-critical-statistical-concepts
<p>(Editor’s Note: This is an old post but a good one from Jeff.)</p>
<p>It seems like everywhere we look, data is being generated - from politics, to biology, to publishing, to social networks. There are also diverse new computational tools, like GPGPU and cloud computing, that expand the statistical toolbox. Statistical theory is more advanced than its ever been, with exciting work in a range of areas.</p>
<p>With all the excitement going on around statistics, there is also increasing diversity. It is increasingly hard to define “statistician” since the definition ranges from <a href="http://www.stat.washington.edu/jaw/" target="_blank">very mathematical</a> to <a href="http://en.wikipedia.org/wiki/Nate_Silver" target="_blank">very applied</a>. An obvious question is: what are the most critical skills needed by statisticians?</p>
<p>So just for fun, I made up my list of the top 5 most critical skills for a statistician by my own definition. They are by necessity very general (I only gave myself 5).</p>
<ol>
<li><strong>The ability to manipulate/organize/work with data on computers</strong> - whether it is with excel, R, SAS, or Stata, to be a statistician you have to be able to work with data.</li>
<li><strong>A knowledge of exploratory data analysis</strong> - how to make plots, how to discover patterns with visualizations, how to explore assumptions</li>
<li><strong>Scientific/contextual knowledge</strong> - at least enough to be able to abstract and formulate problems. This is what separates statisticians from mathematicians.</li>
<li><strong>Skills to distinguish true from false patterns</strong> - whether with p-values, posterior probabilities, meaningful summary statistics, cross-validation or any other means.</li>
<li><strong>The ability to communicate results to people without math skills</strong> - a key component of being a statistician is knowing how to explain math/plots/analyses.</li>
</ol>
<p>What are your top 5? What order would you rank them in? Even though these are so general, I almost threw regression in there because of how often it pops up in various forms.</p>
<p><strong>Related Posts: </strong>Rafa on <a href="http://simplystatistics.tumblr.com/post/10764298034/the-future-of-graduate-education" target="_blank">graduate education</a> and <a href="http://simplystatistics.tumblr.com/post/10021164565/what-is-a-statistician" target="_blank">What is a Statistician</a>? Roger on <a href="http://simplystatistics.tumblr.com/post/11655593971/do-we-really-need-applied-statistics-journals" target="_blank">“Do we really need applied statistics journals?”</a></p>
Measuring the importance of data privacy: embarrassment and cost
2013-07-01T15:52:38+00:00
http://simplystats.github.io/2013/07/01/measuring-the-importance-of-data-privacy-embarrassment-and-cost
<p>We <a href="http://simplystatistics.org/2013/06/14/the-vast-majority-of-statistical-analysis-is-not-performed-by-statisticians/">We</a> when it is inexpensive and easy to collect data about ourselves or about other people. These data can take the form of health information - like medical records, or they could be financial data - like your online bank statements, or they could be social data - like your friends on Facebook. We can also easily collect information about our <a href="https://www.23andme.com/">genetic makeup</a> or our <a href="http://www.fitbit.com/">fitness</a> (although it can be <a href="http://simplystatistics.org/2013/01/02/fitbit-why-cant-i-have-my-data/">hard to get</a>).</p>
<p>All of these data types are now stored electronically. There are obvious reasons why this is both economical and convenient. The downside, of course, is that the data can be <a href="http://en.wikipedia.org/wiki/PRISM_(surveillance_program)">used by the government</a> or <a href="http://junkcharts.typepad.com/numbersruleyourworld/2013/07/know-your-data-11-facebook-and-you.html">other entities</a> in ways that you may not like. Whether it is to track your habits to sell you new products or to use your affiliations to make predictions about your political leanings, these data are not just “numbers”.</p>
<p>Data protection and data privacy are major issues in a variety of fields. In some areas, laws are in place to govern how your data can be shared and used (e.g. HIPAA). In others it is a bit more of a wild west mentality (see this interesting series of posts, “<a href="http://junkcharts.typepad.com/numbersruleyourworld/know-your-data/">Know your data</a>” by junkcharts talking about some data issues). I think most people have some idea that they would like to keep at least certain parts of their data private (from the government, from companies, or from their friends/family), but I’m not sure how most people think about data privacy.</p>
<p>For me there are two scales on which I measure the importance of the privacy of my own data:</p>
<ol>
<li><strong>Embarrassment</strong> - Data about my personal habits, whether I let my son watch too much TV, or what kind of underwear I buy could be embarrassing if it was out in public.</li>
<li><strong>Financial </strong> - Data about my social security number, my bank account numbers, or my credit card account could be used to cost me either my current money or potential future money.</li>
</ol>
<p>My concerns about data privacy can almost always be measured primarily on these two scales. For example, I don’t want my medical records to be public because: (1) it might be embarrassing for people to know how bad my blood pressure is and (2) insurance companies might charge me more if they knew. On the other hand, I don’t want my bank account to get out primarily because it could cost me financially. So that mostly only registers on one scale.</p>
<p>One option, of course, would be to make all of my data totally private. But the problem is I want to share some of it with other people - I want my doctor to know my medical history and my parents to get to see pictures of my son. Usually I just make these choices about data sharing without even thinking about them, but after a little reflection I think these are the main considerations that go into my data sharing choices:</p>
<ol>
<li><strong>Where does it rate on the two scales above?</strong></li>
<li><strong>How much do I trust the person I’m sharing with?</strong> For example, my wife knows my bank account info, but I wouldn’t give it to a random stranger on the street. Google has my email and uses it to market to me, but that doesn’t bother me too much. But I trust them (I think) not to say - tell people I’m negotiating with my plans based on emails I sent to my wife (this goes with #4 below).</li>
<li><strong>How hard would it be to use the information? </strong>I give my credit card to waiters at restaurants all the time, but I also monitor my account - so it would be relatively hard to run up a big bill before I (or the bank) noticed. I put my email address online, but it is a couple of steps between that and anything that is embarrassing/financially dubious for me. You’d have to be able to use that to hack some account.</li>
<li><strong>Is there incentive for someone to use the information? </strong>I’m not fabulously wealthy or famous. So most of the time, even if financial/embarrassing stuff is online about me, it probably wouldn’t get used. On the other hand, if I was an actor, a politician, or a billionaire there would be a lot more people incentivized to use my data against me. For example, if Google used my info to blow up a negotiation they would gain very little. I, on the other hand, would lose a lot and would probably sue them.*</li>
</ol>
<p>With these ideas in mind it makes it a little easier for me to (at least personally) classify how much I care about different kinds of privacy breaches.</p>
<p>For example, suppose my health information was posted on the web. I would consider this a problem because of both financial and embarrassment potential. It is also on the web, so I basically don’t trust the vast majority of people that would have access. On the other hand, it would be at least reasonably hard to use this data directly against me unless you were an insurance provider and most people wouldn’t have the incentive.</p>
<p>Take another example: someone tagging me in Facebook photos (I don’t have my own account). Here the financial considerations are only potential future employment problems, but the embarrassment considerations are quite high. I probably somewhat trust the person tagging me since I at least likely know them. On the other hand it would be super-easy to use the info against me - it is my face in a picture and would just need to be posted on the web. So in this case, it mostly comes down to incentive and I don’t think most people have an incentive to use pictures against me (except in jokes - which I’m mostly cool with).</p>
<p>I could do more examples, but you get the idea. I do wonder if there is an interesting statistical model to be built here on the basis of these axioms (or other more general ones) about when/how data should be used/shared.</p>
<p>* <em style="font-size: 16px;">An interesting side note is that I did use my gmail account when I was considering a position at Google fresh out of my Ph.D. I sent emails to my wife and my advisor discussing my plans/strategy. I always wondered if they looked at those emails when they were negotiating with me - although I never had any reason to suspect they had. </em></p>
What is the Best Way to Analyze Data?
2013-06-27T16:41:20+00:00
http://simplystats.github.io/2013/06/27/what-is-the-best-way-to-analyze-data
<p>One topic I’ve been thinking about recently is extent to which data analysis is an art versus a science. In my thinking about art and science, I rely on Don Knuth’s distinction, from his 1974 lecture “Computer Programming as an Art”:</p>
<blockquote>
<p>Science is knowledge that we understand so well that we can teach it to a computer; and if we don’t fully understand something, it is an art to deal with it. Since the notion of an algorithm or a computer program provides us with an extremely useful test for the depth of our knowledge about any given subject, the process of going from an art to a science means that we learn how to automate something.</p>
</blockquote>
<p>Of course, the phrase “analyze data” is far too general; it needs to be placed in a much more specific context. So choose your favorite specific context and consider this question: Is there a way to teach a computer how to analyze the data generated in that context? Jeff wrote about this a while back and he called this magical program the <a href="http://simplystatistics.org/2012/08/27/a-deterministic-statistical-machine/">deterministic statistical machine</a>.</p>
<p>For example, one area where I’ve done some work is in estimating short-term/acute population-level effects of ambient air pollution. These are typically done using time series data of ambient pollution from central monitors and community-level counts of some health outcome (e.g. deaths, hospitalizations). The basic question is if pollution goes up on a given day, do we also see health outcomes go up on the same day, or perhaps in the few days afterwards. This is a fairly well-worn question in the air pollution literature and there have been hundreds of time series studies published. Similarly, there has been a lot of research into the statistical methodology for conducting time series studies and I would wager that as a result of that research we actually know something about what <em>to</em> do and what <em>not</em> to do.</p>
<p>But is our level of knowledge about the methodology for analyzing air pollution time series data to the point where we could program a computer to do the whole thing? Probably not, but I believe there are aspects of the analysis that we could program.</p>
<p>Here’s how I might break it down. Assume we basically start with a rectangular dataset with time series data on a health outcome (say, daily mortality counts in a major city), daily air pollution data, and daily data on other relevant variables (e.g. weather). Typically, the target of analysis is the association between the air pollution variable and the outcome, adjusted for everything else.</p>
<ol>
<li><span style="line-height: 16px;"><strong>Exploratory analysis</strong>. Not sure this can be fully automated. Need to check for missing data and maybe stop analysis if proportion of missing data is too high? Check for high leverage points as pollution data tends to be skewed. Maybe log-transform if that makes sense in this context. Check for other outliers and note them for later (we may want to do a sensitivity analysis without those observations). </span></li>
<li><strong>Model fitting</strong>. This is already fully automated. If the outcome is a count, then typically a Poisson regression model is used. We already know that maximum likelihood is an excellent approach and better than most others under reasonable circumstances. There’s plenty of GLM software out there so we don’t even have to program the IRLS algorithm.</li>
<li><strong>Model building</strong>. Since this is not a prediction model, the main concern we have is that we properly adjusted for measured and unmeasured confounding. <a href="http://www.hsph.harvard.edu/francesca-dominici/">Francesca Dominici</a> and some of her colleagues have done <a href="http://www.ncbi.nlm.nih.gov/pubmed/18552590">some</a> <a href="http://www.ncbi.nlm.nih.gov/pubmed/22364439">interesting</a> <a href="http://www.tandfonline.com/doi/abs/10.1198/016214504000000656#.Ucye6BbHKZY">work</a> regarding how best to do this via Bayesian model averaging and other approaches. I would say that in principle this can be automated, but the lack of easy-to-use software at the moment makes it a bit complicated. That said, I think simpler versions of the “ideal approach” can be easily implemented.</li>
<li><strong>Sensitivity analysis</strong>. There are a number of key sensitivity analyses that need to be done in all time series analyses. If there were outliers during EDA, maybe re-run model fit and see if regression coefficient for pollution changes much. How much is too much? (Not sure.) For time series models, unmeasured temporal confounding is a big issue so this is usually checked using spline smoothers on the time variable with different degrees of freedom. This can be automated by fitting the model many different times with different degrees of freedom in the spline.</li>
<li><strong>Reporting</strong>. Typically, some summary statistics for the data are reported along with the estimate + confidence interval for the air pollution association. Estimates from the sensitivity analysis should be reported (probably in an appendix), and perhaps estimates from different lags of exposure, if that’s a question of interest. It’s slightly more complicated if you have a multi-city study.</li>
</ol>
<p>So I’d say that of the five major steps listed above, the one that I find most difficult to automate is EDA. There a lot of choices have to be made that are not easy to program into a computer. But I think the rest of the analysis could be automated. I’ve left out the cleaning and preparation of the data here, which also involves making many choices. But in this case, much of that is often outside the control of the investigator. These analyses typically use publicly available data where the data are available “as-is”. For example, the investigator would likely have no control over how the mortality counts were created.</p>
<p>What’s the point of all this? Well, I would argue that if we cannot completely automate a data analysis for a given context, then either we need to narrow the context, or we have some more statistical research to do. Thinking about how one might automate a data analysis process is a useful way to identify where are the major statistical gaps in a given area. Here, there may be some gaps in how best to automate the exploratory analyses. Whether those gaps can be filled (or more importantly, whether <em>you</em> are interested in filling them) is not clear. But most likely it’s not a good idea to think about better ways to fit Poisson regression models.</p>
<p>So what do you do when all of the steps of the analysis have been fully automated? Well, I guess time to move on then….</p>
Art from Data
2013-06-26T08:39:23+00:00
http://simplystats.github.io/2013/06/26/art-from-data
<p>There’s a nice piece by Mark Hansen about <a href="http://bits.blogs.nytimes.com/2013/06/19/data-driven-aesthetics/">data-driven aesthetics</a> in the New York Times special section on big data.</p>
<blockquote>
<p>From a speedometer to a weather map to a stock chart, we routinely interpret and act on data displayed visually. With a few exceptions, data has no natural “look,” no natural “visualization,” and choices have to be made about how it should be displayed. Each choice can reveal certain kinds of patterns in the data while hiding others.</p>
</blockquote>
<p>I think drawing a line between a traditional statistical graphic and a pure work of art would be somewhat difficult. You can find examples of both that might fall in the opposite category: traditional graphics that transcend their utilitarian purposes and “pure art” works that tell you something new about your world.</p>
<p>Indeed, I think Mark Hansen’s own work with Ben Rubin falls into the latter category–art pieces that perhaps had their beginnings as purely works of art but ended up giving you new insight into the world. For example, <a href="http://earstudio.com/2010/09/29/listening-post/">Listening Post</a> was a highly creative installation that simultaneously gave you an emotional connection to random people chatting on the Internet as well as insight into what the Internet was “saying” at any given time (I wonder if NSA employees took a field trip to the Whitney Museum of American Art!).</p>
Doing Statistical Research
2013-06-25T09:12:36+00:00
http://simplystats.github.io/2013/06/25/doing-statistical-research
<p>There’s a wonderful article over at the STATtr@k web site by Terry Speed on <a href="http://stattrak.amstat.org/2013/06/01/how-to-do-statistical-research/">How to Do Statistical Research</a>. There is a lot of good advice there, but the column is most notable because it’s pretty much the exact opposite of the advice that I got when I first started out.</p>
<p>To quote the article:</p>
<blockquote>
<p>The ideal research problem in statistics is “do-able,” interesting, and one for which there is not much competition. My strategy for getting there can be summed up as follows:</p>
<ul>
<li>Consulting: Do a very large amount</li>
<li>Collaborating: Do quite a bit</li>
<li>Research: Do some</li>
</ul>
</blockquote>
<p>For the most part, I was told to flip the research and consulting bits. That is, you want to spend most of your time doing “research” and very little of your time doing “consulting”. Why? Because ostensibly, the consulting work doesn’t involve new problems, only solving old problems with existing techniques. The research work by definition involves addressing new problems.</p>
<p>But,</p>
<blockquote>
<p>A strategy I discourage is “develop theory/model/method, seek application.” Developing theory, a model, or a method suggests you have done some context-free research; already a bad start. The existence of proof (Is there a problem?) hasn’t been given. If you then seek an application, you don’t ask, “What is a reasonable way to answer this question, given this data, in this context?” Instead, you ask, “Can I answer the question with this data; in this context; with my theory, model, or method?” Who then considers whether a different (perhaps simpler) answer would have been better?</p>
</blockquote>
<p>The truth is, most problems can be solved with an existing method. They may not be 100% solvable with existing tools, but usually 90% is good enough and it’s not worth developing a new statistical method to cover the remaining 10%. What you really want to be doing is working on the problem that is 0% solvable with existing methods. Then there’s a pretty big payback if you develop a new method to address it and it’s more likely that your approach will be adopted by others simply because there’s no alternative. But in order to find these 0% problems, you have to see a lot of problems, and that’s where the consulting and collaboration comes in. Exposure to lots of problems lets you see the universe of possibilities and gives you a sense of where scientists really need help and where they’re more or less doing okay.</p>
<p>Even if you agree with Terry’s advice, implementing it may not be so straightforward. It may be easier/harder to do consulting and collaboration depending on where you work. Also, <a href="http://simplystatistics.org/2011/10/20/finding-good-collaborators/">finding good collaborators</a> can be tricky and may involve some trial and error.</p>
<p>But it’s useful to keep this advice in mind, especially when looking for a job. The places you want to be on the lookout for are places that give you the most exposure to interesting scientific problems, the 0% problems. These places will give you the best opportunities for collaboration and for having a real impact on science.</p>
Does fraud depend on my philosophy?
2013-06-24T10:00:30+00:00
http://simplystats.github.io/2013/06/24/does-fraud-depend-on-my-philosophy
<p>Ever since my <a href="http://simplystatistics.org/2013/05/17/when-does-replication-reveal-fraud/">last post on replication and fraud</a> I’ve been doing some more thinking about why people consider some things “scientific fraud”. (First of all, let me just say that I was a bit surprised by the discussion in the comments for that post. Some people apparently thought I was asking about the actual probability that the study was a fraud. This was not the case. I just wanted people to think about how they would react when confronted with the scenario.)</p>
<p>I often find that when I talk to people about the topic of scientific fraud, especially statisticians, there is a sense that much work that goes on out there is fraudulent, but the precise argument for why is difficult to pin down.</p>
<p>Consider the following three cases:</p>
<ol>
<li>I conduct a randomized clinical trial comparing a new treatment and a control and their effect on outcome Y1. I also collect data on outcomes Y2, Y3, … Y10. After conducting the trial I see that there isn’t a significant difference for Y1 so I test the other 9 outcomes and find a significant effect (defined as p-value equal to 0.04) for Y7. I then publish a paper about outcome Y7 and state that it’s significant with p=0.04. I make no mention of the other outcomes.</li>
<li>I conduct the same clinical trial with the 10 different outcomes and look at the difference between the treatment groups for all outcomes. I notice that the largest standardized effect size is for Y7 with a standardized effect of 3, suggesting the treatment is highly effective in this trial. I publish a paper about outcome Y7 and state that the standardized effect size was 3 for comparing treatment vs. control. I note that a difference of 3 is highly significant, but I make no mention of <em>statistical</em> significance or p-values. I also make no mention of the other outcomes.</li>
<li>I conduct the same clinical trial with the 10 outcomes. Now I look at all 10 outcomes and calculate the posterior probability that the effect is greater than zero (favoring the new treatment), given a pre-specified diffuse prior on the effect (assume it’s the same prior for each effect). Of the 10 outcomes I see that Y7 has the largest posterior probability of 0.98. I publish a paper about Y7 stating that my posterior probability for a positive effect is 0.98. I make no mention of the other outcomes.</li>
</ol>
<p>Which one of these cases constitutes scientific fraud?</p>
<ol>
<li>I think most people would object to Case 1. This is the classic multiple testing scenario where the end result is that the stated p-value is not correct. Rather than a p-value of 0.04 the real p-value is more like 0.4. A simple Bonferroni correction fixes this but obviously would have resulted in not finding any significant effects based on a 0.05 threshold. The real problem is that in Case 1 you are clearly trying to make an inference about future studies. You’re saying that if there’s truly no difference, then in 100 other studies just like this one, you’d expect only 4 to detect a difference under the same criteria that you used. But it’s incorrect to say this and perhaps fraudulent (or negligent) depending on your underlying intent. In this case a relevant detail that is missing is the number of other outcomes that were tested.</li>
<li>Case 2 differs from case 1 only in that no p-values are used but rather the measure of significance is the standardized effect size. Therefore, no probability statements are made and no inference is made about future studies. Although the information about the other outcomes is similarly omitted in this case as in case 1, it’s difficult for me to identify what is wrong with this paper.</li>
<li>Case 3 takes a Bayesian angle and is more or less like case 2 in my opinion. Here, probability is used as a measure of belief about a parameter but no explicit inferential statements are made (i.e. there is no reference to some population of other studies). In this case I just state my belief about whether an effect/parameter is greater than 0. Although I also omit the other 9 outcomes in the paper, revealing that information would not have changed anything about my posterior probability.</li>
</ol>
<p>In each of these three scenarios, the underlying data were generated in the exact same way (let’s assume for the moment that the trial itself was conducted with complete integrity). In each of the three scenarios, 10 outcomes were examined and outcome Y7 was in some sense the most interesting.</p>
<p>Of course, the analyses and the interpretation of the data were <em>not</em> the same in each scenario. Case 1 makes an explicit inference whereas Cases 2 and 3 essentially do not. However, I would argue the <em>evidence</em> about the new treatment compared to the control treatment in each scenario was identical.</p>
<p>I don’t believe that the investigator in Case 1 should be allowed to engage in such shenanigans with p-values, but should he/she be pilloried simply because the p-value was the chosen metric of significance? I guess the answer would be “yes” for many of you, but keep in mind that the investigator in Case 1 still generated the same evidence as the others. Should the investigators in Case 2 and Case 3 be thrown in the slammer? If so, on what basis?</p>
<p>My feeling is not that people should be allowed to do whatever they please, but we need a better way to separate the “stuff” from the stuff. This is both a methodological and a communications issue. For example, Case 3 may not be fraud but I’m not necessarily interested in what the investigator’s opinion about a parameter is. I want to know what the data say about that parameter (or treatment difference in this case). Is it fraud to make any inferences in the first place (as in Case 1)? I mean, how could you possible know that your inference is “correct”? If “all models are wrong, but some are useful”, does that mean that everyone is committing fraud?</p>
Sunday data/statistics link roundup (6/23/13)
2013-06-23T22:24:40+00:00
http://simplystats.github.io/2013/06/23/sunday-datastatistics-link-roundup-62313
<ol>
<li><a href="http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0066463">An interesting study</a> describing the potential benefits of using significance testing may be potentially beneficial and a scenario where the file drawer effect may even be beneficial. Granted this is all simulation so you have to take it with a grain of salt, but I like the pushback against the hypothesis testing haters. In all things moderation, including hypothesis testing.</li>
<li><a href="http://www.npr.org/blogs/codeswitch/2013/06/21/193881290/jeah-we-mapped-out-the-four-basic-aspects-of-being-a-bro">Venn Diagrams for the win, bro</a>.</li>
<li><a href="http://www.youtube.com/watch?v=E-gpSQQe3w8">The new basketball positions</a>. The idea is to cluster players based on the positions on the floor where they shoot, etc. I like the idea of data driven position definitions; I am a little worried about “reading ideas in” to a network picture.</li>
<li><a href="http://qz.com/95516/an-start-ups-plan-to-make-us-health-care-cheaper-tell-people-what-it-costs/">A really cool idea</a> about a startup that makes data on healthcare procedures available to patients. I’m all about data transparency, but it makes me wonder, how often do people with health insurance negotiate the prices of procedures (via Leah J.)</li>
<li>Another interesting article <a href="http://www.nytimes.com/2013/06/23/opinion/sunday/theres-a-fly-in-my-tweets.html?emc=eta1&_r=0">about using tweets</a> (and other social media) to improve public health. I do wonder about potential sampling issues, like what happened with <a href="http://bits.blogs.nytimes.com/2013/02/24/disruptions-google-flu-trends-shows-problems-of-big-data-without-context/">google flu trends</a> (via Nick C.)</li>
</ol>
Interview with Miriah Meyer - Microsoft Faculty Fellow and Visualization Expert
2013-06-21T10:39:24+00:00
http://simplystats.github.io/2013/06/21/interview-with-miriah-meyer-microsoft-faculty-fellow-and-visualization-expert
<p><a href="http://simplystatistics.org/2013/06/21/interview-with-miriah-meyer-microsoft-faculty-fellow-and-visualization-expert/miriah-2/" rel="attachment wp-att-1424"><img class="alignnone wp-image-1424" alt="miriah" src="http://simplystatistics.org/wp-content/uploads/2013/06/miriah1.jpg" width="311" height="256" /></a></p>
<p><em><a href="http://www.cs.utah.edu/~miriah/">Miriah Meyer</a> received her Ph.D. in computer science from the University of Utah, then did a postdoctoral fellowship at Harvard University and was a visiting fellow at MIT and the Broad Institute. Her research focuses on developing visualization tools in close collaboration with biological scientists. She has been recognized as a Microsoft Faculty Fellow, a TED Fellow, and appeared on the TR35. We talked with Miriah about visualization, collaboration, and her influences during her career as part of the <a href="http://simplystatistics.org/interviews/">Simply Statistics Interview Series</a>.</em></p>
<p><strong>SS: Which term applies to you: data scientist, statistician, computer scientist, or something else?</strong></p>
<p>MM: My training is as a computer scientist and much of the way I problem solve is grounded in computational thinking. I do, however, sometimes think of myself as a data counselor, as a big part of what I do is help my collaborators move towards a deeper and more articulate statement about what they want/need to do with their data.</p>
<p><strong>SS: Most data analysis is done by scientists, not trained statisticians. How does data visualization help/hurt scientists when looking at piles of complicated data?</strong></p>
<p>MM: In the sciences, visualization is particularly good for hypothesis generation and early stage exploration. With many fields turning toward data-driven approaches, scientists are often not sure of exactly what they will find in a mound of data. Visualization allows them to look into the data without having to specify a specific question, query, or model. This early, exploratory analysis is very difficult to do strictly computationally. Exploration via interactive visualization can lead a scientist towards establishing a more specific question of the data that could then be addressed algorithmically.</p>
<p><strong>SS: </strong><strong>What are the steps in developing a visualization with a scientific collaborator?</strong></p>
<p>MM: The first step is finding good collaborators <img src="http://simplystatistics.org/wp-includes/images/smilies/simple-smile.png" alt=":)" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<p>The beginning of a project is spent in discussions with the scientists, trying to understand their needs, data, and mental models of a problem. I find this part to be the most interesting, and also the most challenging. The goal is to develop a clear, abstract understanding of the problem and set of design requirements. We do this through interviews and observations, with a focus on understanding how people currently solve problems and what they want to do but can’t with current tools.</p>
<p>Next is to take this understanding and prototype ideas for visualization designs. Rapid prototyping on paper is usually first, followed by more sophisticated, software prototypes after getting feedback from the collaborators. Once a design is sufficiently fleshed out and validated, a (relatively) full-featured visualization tool is developed and deployed.</p>
<p>At this point, the scientists tend to realize that the problem they initially thought was most interesting isn’t… and the cycle continues.</p>
<p>Fast iteration is really essential in this process. In the past I’ve gone through as many as three cycles of this process before find the right problem abstractions and designs.</p>
<p><strong>SS: You have tackled some diverse visualizations (from synteny to poems); what are the characteristics of a problem that make it a good candidate for new visualizations?</strong></p>
<p>MM: For me, the most important thing is to find good collaborators. It is essential to find partners that are willing to give lots of their time up front, are open-minded about research directions, and are working on cutting-edge problems in their field. This latter characteristic helps to ensure that there will be something novel needed from a data analysis and visualization perspective.</p>
<p>The other thing is to test whether a problem passes the Tableau/R/Matlab test: if the problem can’t be solved using one of these environments, then that is probably a good start.</p>
<p><strong>SS: What is the four-level nested model for design validation and how did you extend it?</strong></p>
<p>MM: This is a design decision model that helps to frame the different kinds of decisions made in the visualization design process, such as decisions about data derivations, visual representations, and algorithms. The model helps to put any one decision in the context of other visualization ideas, methods, and techniques, and also helps a researcher generalize new ideas to a broader class of problems. We recently extended this model to specifically define what a visualization “guideline” is, and how to relate this concept to how we design and evaluate visualizations.</p>
<p><strong>SS: Who are the key people who have been positive influences on your career and how did they help you?</strong></p>
<p>MM: One influence that jumps out to me is a collaboration with a designer in Boston named Bang Wong. Working with Bang completely changed my approach to visualization development and got me thinking about iteration, rapid prototyping, and trying out many ideas before committing. Also important were two previous supervisors, Ross Whitaker and Tamara Munzner, who constantly pushed me to be precise and articulate about problems and approaches to them. I believe that precision is a hallmark of good data science, even when characterizing unprecise things <img src="http://simplystatistics.org/wp-includes/images/smilies/simple-smile.png" alt=":)" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<p><strong>SS: Do you have any advice for computer scientists/statisticians who want to work on visualization as a research area?</strong></p>
<p>MM: Do it! Visualization is a really fun, vibrant, growing field. It relies on a broad spectrum of skills, from computer science, to design, to collaboration. I would encourage those interested to not get to infatuated with the engineering or the aesthetics and to instead focus on solving real-world problems. There is an unlimited supply of those!</p>
Google's brainteasers (that don't work) and Johns Hopkins Biostatistics Data Analysis
2013-06-20T10:10:15+00:00
http://simplystats.github.io/2013/06/20/googles-brainteasers-that-dont-work-and-johns-hopkins-biostatistics-data-analysis
<p><a href="http://www.nytimes.com/2013/06/20/business/in-head-hunting-big-data-may-not-be-such-a-big-deal.html?pagewanted=all&_r=0">This article</a> is getting some attention, because Google’s VP for people operations at Google has made public a few insights that the Google HR team has come to over the last several years. The most surprising might be:</p>
<ol>
<li><span style="line-height: 16px;">They don’t collect GPAs except for new candidates</span></li>
<li>Test scores are worthless</li>
<li>Interview scores weren’t correlated with success.</li>
<li>Brainteasers that Google is so famous for are worthless</li>
<li>Behavioral interviews are the most effective</li>
</ol>
<p>The reason the article is getting so much attention is how surprising these facts may be to people who have little experience hiring/managing in technical fields. But I thought this quote was really telling:</p>
<blockquote>
<p> One of my own frustrations when I was in college and grad school is that you knew the professor was looking for a specific answer. You could figure that out, but it’s much more interesting to solve problems where there isn’t an obvious answer.</p>
</blockquote>
<p>Interestingly, <a href="http://simplystatistics.org/2011/10/22/graduate-student-data-analysis-inspired-by-a/">that is the whole point</a> of my data analysis course here at Hopkins. Over my relatively limited time as a faculty member I realized there were two key qualities that made students in biostatistics stand out: (1) that they were hustlers - willing to just work until the problem is solved even if it was frustrating and (2) that they were willing/able to try new approaches or techniques they weren’t comfortable with. I don’t have the quantitative data that Google does, but I would venture to guess those two traits explain 80%+ of the variation in success rates for graduate students in statistics/computing/data analysis.</p>
<p>Once that realization is made, it becomes clear pretty quickly that textbook problems or re-analysis of well known data sets measure something orthogonal to traits (1) and (2). So I went about redesigning the types of problems our students had to tackle. Instead of assigning problems out of a book I redesigned the questions to have the following characteristics:</p>
<ol>
<li><span style="line-height: 16px;">The were based on live data sets. I define a “live” data set as a data set that has not been used to answer the question of interest previously. </span></li>
<li>The questions are <a href="http://simplystatistics.org/2013/05/29/what-statistics-should-do-about-big-data-problem-forward-not-solution-backward/">problem forward, not solution backward</a>. I would have an idea of what would likely work and what would likely not work. But I defined the question without thinking about what methods the students might use.</li>
<li>The answer was open ended (and often not known to me in advance).</li>
<li>The problems often had to do with unique scenarios not encountered frequently in statistics (e.g. you have a data census instead of just a sample).</li>
<li>The problems involved methods application/development, coding, and writing/communication.</li>
</ol>
<p>I have found that problems with these characteristics more precisely measure hustle and flexibility, like Google is looking for in their hiring practices. Of course, there are some down sides to this approach. I think it can be more frustrating for students, who don’t have as clearly defined a path through the homework. It also means dramatically more work for the instructor in terms of analyzing the data to find the quirks, creating personalized feedback for students, and being able to properly estimate the amount of work a project will take.</p>
<p><a href="http://simplystatistics.org/2013/03/26/an-instructors-thoughts-on-peer-review-for-data-analysis-in-coursera/">We have started thinking</a> about how to do this same thing at scale on Coursera. In the meantime, Google will just have to send their recruiters to Hopkins Biostats to find students who meet the characteristics they are looking for :-).</p>
Sunday data/statistics link roundup (6/16/13 - Father's day edition!)
2013-06-16T10:31:51+00:00
http://simplystats.github.io/2013/06/16/sunday-datastatistics-link-roundup-61613-fathers-day-edition
<ol>
<li><span style="line-height: 16px;"><a href="http://www.npr.org/blogs/health/2013/06/07/189565146/datapalooza-a-concept-a-conference-and-a-movement">Datapalooza</a>! I’m wondering where my invite is? I do health data stuff, pick me, pick me! Actually it does sound like a pretty good idea - in general giving a bunch of smart people access to interesting data and real science problems can produce some cool results (link via Dan S.)</span></li>
<li>This <a href="http://www.manhattan-institute.org/pdf/fda_06.pdf">report on precision medicine</a> from the Manhattan Institute is related to my post this week on <a href="http://simplystatistics.org/2013/06/12/personalized-medicine-is-primarily-a-population-health-intervention/">personalized medicine</a>. I like the idea that we should be focusing on developing new ideas for adaptive trials (my buddy <a href="http://people.csail.mit.edu/mrosenblum/Home.html">Michael</a> is all over that stuff). I did thing that it was a little pie-in-the-sky with plenty of buzzwords like Bayesian causal networks and pattern recognition. I think these ideas are certainly applicable, but the report, I think, overstates the current level of applicability of these methods. We need more funding and way more research to support this area before we should automatically adopt it - big data can be used to confuse when methods aren’t well understood (link via Rafa via Marginal Revolution).</li>
<li><a href="http://ropensci.org/blog/2013/06/12/sloan/">rOpenSci</a> wins a grant from the Sloan Foundation! Psyched to see this kind of innovative open software development get the support it deserves. My favorite rOpenSci package is <a href="http://ropensci.org/packages/figshare.html">rFigshare</a>, what’s yours?</li>
<li>A <a href="http://snikolov.wordpress.com/2012/11/14/early-detection-of-twitter-trends/">k-means approach</a> to detecting what will be trending on Twitter. It always gets me so pumped up to see the creative ways that methods that have been around forever can be adapted to solve real, interesting problems.</li>
<li>Finally, I <a href="http://www.brainpickings.org/index.php/2013/06/14/einstein-letter-to-son/">thought this link</a> was very appropriate for father’s day. I couldn’t agree more that the best kind of learning happens when you are just so in to something that you forget you are learning. Happy father’s day everyone!</li>
</ol>
The vast majority of statistical analysis is not performed by statisticians
2013-06-14T10:31:10+00:00
http://simplystats.github.io/2013/06/14/the-vast-majority-of-statistical-analysis-is-not-performed-by-statisticians
<p dir="ltr">
Whether you know it or not, everything you do produces data - from the websites you read to the rate at which your heart beats. Until pretty recently, most of the data you produced wasn’t collected, it floated off unmeasured. The only data that were collected were painstakingly gathered by scientists one number at a time in small experiments with a few people. This laborious process meant that data were expensive and time-consuming to collect. Yet many of the most amazing scientific discoveries over the last two centuries were squeezed from just a few data points. But over the last two decades, the unit price of data has dramatically dropped. New technologies touching every aspect of our lives from our money, to our health, to our social interactions have made data collection cheap and easy (see e.g. <a href="http://en.wikipedia.org/wiki/Camp_Williams">Camp Williams</a>).
</p>
<p dir="ltr">
To give you an idea of how steep the drop in the price of data has been, in 1967 Stanley Milgram <a href="http://en.wikipedia.org/wiki/Small_world_phenomenon">did an experiment</a> to determine the number of degrees of separation between two people in the U.S. In his experiment he sent 296 letters to people in Omaha, Nebraska and Wichita, Kansas. The goal was to get the letters to a specific person in Boston, Massachusetts. The trick was people had to send the letters to someone they knew, and they then sent it to someone they knew and so on. At the end of the experiment, only 64 letters made it to the individual in Boston. On average, the letters had gone through 6 people to get there. This is where the idea of “6-degrees of Kevin Bacon” comes from. Based on 64 data points. <a href="http://research.microsoft.com/en-us/um/people/horvitz/Messenger_graph_www.htm">A 2007 study</a> updated that number to “7 degrees of Kevin Bacon”. The study was based on 30 billion instant messaging conversations collected over the course of a month or two with the same amount of effort.
</p>
<p dir="ltr">
Once data started getting cheaper to collect, it got cheaper fast. Take another example, the human genome. The genome is the unique DNA code in every one of your cells. It consists of a set of 3 billion letters that is unique to you. By many measures, the race to be the first group to collect all 3 billion letters from a single person kicked off the data revolution in biology. The project was completed in 2000 after a decade of work and <a href="http://www.genome.gov/11006943">$3 billion</a> to collect the 3 billion letters in the first human genome. This project was actually <a href="http://www.nature.com/news/economic-return-from-human-genome-project-grows-1.13187">a stunning success</a>, most people thought it would be much more expensive. But just over a decade later, new technology means that we can now collect all 3 billion letters from a person’s genome for about $10,000 in about a week.
</p>
<p> As the price of data dropped so dramatically over the last two decades, the division of labor between analysts and everyone else became less and less clear. Data became so cheap that it couldn’t be confined to just a few highly trained people. So raw data started to trickle out in a number of different ways. It started with maps of temperatures across the U.S. in newspapers and quickly ramped up to information on how many friends you had on Facebook, the price of tickets on 50 airlines for the same flight, or measurements of your blood pressure, good cholesterol, and bad cholesterol at every doctor’s visit. Arguments about politics started focusing on the results of opinion polls and who was asking the questions. The doctor stopped telling you what to do and started presenting you with options and the risks that went along with each.</p>
<p dir="ltr">
That is when statisticians stopped being the primary data analysts. At some point, the trickle of data about you, your friends, and the world started impacting every component of your life. Now almost every decision you make is based on data you have about the world around you. Let’s take something simple, like where are you going to eat tonight. You might just pick the nearest restaurant to your house. But you could also ask your friends on Facebook where you should eat, or read reviews on Yelp, or check out menus on the restaurants websites. All of these are pieces of data that are collected and presented for you to "analyze".
</p>
<p dir="ltr">
This revolution demands a new way of thinking about statistics. It has precipitated <a href="http://flowingdata.com/">explosive growth in data visualization </a>- the most accessible form of data analysis. It has encouraged explosive growth in MOOCs like the ones <a href="http://simplystatistics.org/courses/">Roger, Brian and I taught. </a>It has created <a href="https://data.baltimorecity.gov/">open data initiatives in government</a>. It has also encouraged more accessible data analysis platforms in the form of startups like <a href="https://www.statwing.com/">StatWing</a> that make it easier for non-statisticians to analyze data.
</p>
<p dir="ltr">
What does this mean for statistics as a discipline? Well it is great news in that we have a lot more people to train. It also really drives home the <a href="http://simplystatistics.org/tag/statistical-literacy/">importance of statistical literacy</a>. But it also means we need to adapt our thinking about what it means to teach and perform statistics. We need to focus increasingly on interpretation and critique and away from formulas and memorization (think English composition versus grammar). We also need to realize that the most impactful statistical methods will not be used by statisticians, which means we need more fool proofing, <a href="http://simplystatistics.org/2012/08/27/a-deterministic-statistical-machine/">more time automating</a>, and <a href="http://simplystatistics.org/2013/01/23/statisticians-and-computer-scientists-if-there-is-no-code-there-is-no-paper/">more time creating software</a>. The potential payout is huge for realizing that the tide has turned and most people who analyze data aren't statisticians.
</p>
<p dir="ltr">
</p>
False discovery rate regression (cc NSA's PRISM)
2013-06-13T10:36:43+00:00
http://simplystats.github.io/2013/06/13/false-discovery-rate-regression-cc-nsas-prism
<p><em>There is an idea I have been thinking about for a while now. It re-emerged at the top of my list after seeing this <a href="http://kieranhealy.org/blog/archives/2013/06/09/using-metadata-to-find-paul-revere/">really awesome post</a> on using metadata to identify “conspirators” in the American revolution. My first thought was: but how do you know that you aren’t just <a href="http://www.statsblogs.com/2013/06/07/how-likely-is-the-nsa-prism-program-to-catch-a-terrorist/">making lots of false discoveries</a>?</em></p>
<p>Hypothesis testing and significance analysis were originally developed to make decisions for single hypotheses. In many modern applications, it is more common to test hundreds or thousands of hypotheses. In the standard multiple testing framework, you perform a hypothesis test for each of the “features” you are studying (these are typically genes or voxels in high-dimensional problems in biology, but can be other things as well). Then the following outcomes are possible:</p>
<div class="table-responsive">
<table style="width:100%; " class="easy-table easy-table-default " border="0">
<tr>
<th>
</th>
<th>
Call Null True
</th>
<th>
Call Null False
</th>
<th>
Total
</th>
</tr>
<tr>
<td>
Null True
</td>
<td>
True Negatives
</td>
<td>
False Positives
</td>
<td>
True Nulls
</td>
</tr>
<tr>
<td>
Null False
</td>
<td>
False Negatives
</td>
<td>
True Positives
</td>
<td>
False Nulls
</td>
</tr>
<tr>
<td>
</td>
<td>
No Decisions
</td>
<td>
Rejections
</td>
</tr>
</table>
</div>
<p>The reason for “No Decisions” is that the way hypothesis testing is set up, one should technically never accept the null hypothesis. The number of rejections is the total number of times you claim that a particular feature shows a signal of interest.</p>
<p>A very common measure of embarrassment in multiple hypothesis testing scenarios is the <a href="http://www.pnas.org/content/100/16/9440.long">false discovery rate</a> defined as:</p>
<p style="text-align:center;">
<span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_8872a758f47a82b2e07e611dd230ce08.gif" style="vertical-align: middle; border: none;" class="tex" alt=" FDR = E\left[\frac{\# of False Positives}{\# of Rejections}\right] " /></span>
</p>
<p>.</p>
<p>There are some niceties that have to be dealt with here, like the fact that the <span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_ffd8d02b433631d1f60bb73a1d273664.gif" style="vertical-align: middle; border: none; " class="tex" alt="\# of Rejections" /></span> may be equal to zero, inspiring things like the <a href="http://www.genomine.org/papers/directfdr.pdf">positive false discovery rate</a>, which has <a href="http://genomics.princeton.edu/storeylab/papers/Storey_Annals_2003.pdf">some nice Bayesian interpretations</a>.</p>
<p>The way that the process usually works is that a test statistic is calculated for each hypothesis test where a larger statistic means more significant and then operations are performed on these ordered statistics. The two most common operations are: (1) pick a cutoff along the ordered list of p-values - call everything less than this threshold significant and <em>estimate</em> the FDR for that cutoff and (2) pick an acceptable FDR level and find an algorithm to pick the threshold that <em>controls</em> the FDR where control is defined usually by saying something like the algorithm produces <span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_05efdcd0b987f4b9c9667b1b0ffe3e9c.gif" style="vertical-align: middle; border: none; " class="tex" alt="E[FDP] \leq FDR" /></span>.</p>
<p>Regardless of the approach these methods usually make an assumption that the rejection regions should be nested. In other words, if you call statistic $T_k$ significant and $T_j > T_k$ then your method should also call statistic $T_j$ significant. In the absence of extra information, this is a very reasonable assumption.</p>
<p>But in many situations you might have additional information you would like to use in the decision about whether to reject the null hypothesis for test $j$.</p>
<p><strong>Example 1 </strong>A common example is gene-set analysis. Here you have a group of hypotheses that you have tested individually and you want to say something about the level of noise <em>in the group</em>. In this case, you might want to know something about the level of noise if you call the whole set interesting.</p>
<p><strong>Example 2</strong> Suppose you are a <a href="http://www.nsa.gov/">mysterious government agency</a> and you want to identify potential terrorists. You observe some metadata on people and you want to predict who is a terrorist - <a href="http://kieranhealy.org/blog/archives/2013/06/09/using-metadata-to-find-paul-revere/">say using betweenness centrality</a>. You could calculate a P-value for each individual, <a href="http://kieranhealy.org/blog/archives/2013/06/09/using-metadata-to-find-paul-revere/">say using a randomization test</a>. Then estimate your FDR based on predictions using the metadata.</p>
<p><strong>Example 3 </strong>You are monitoring a system over time where observations are random. Say for example whether there is an outbreak of a particular disease in a particular region at a given time. So, is the rate of disease higher than background. How can you estimate the rate at which you make false claims?</p>
<p>For now I’m going to focus on the estimation scenario but you could imagine using these estimates to try to develop controlling procedures as well.</p>
<p>In each of these cases you have a scenario where you are interested in something like:</p>
<p style="text-align:center;">
<span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_6b9dec89fd15463535787dae9770a119.gif" style="vertical-align: middle; border: none;" class="tex" alt=" E\left[\frac{V}{R} | X=x\right] = fdr(x) " /></span>
</p>
<p>where <span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_d050cb5977ffc8b20156a55d09523b37.gif" style="vertical-align: middle; border: none; " class="tex" alt="fdr(x)" /></span> is a covariate-specific estimator of the false discovery rate. Returning to our examples you could imagine:</p>
<p><strong>Example 1</strong></p>
<p style="text-align:center;">
<span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_1d320f0acb560fdcbb3a754382fdfff6.gif" style="vertical-align: middle; border: none;" class="tex" alt=" E\left[\frac{V}{R} | GS = k\right] =\beta_0 + \sum_{\ell=1}^K\beta_{\ell} 1(GS=\ell) " /></span>
</p>
<p><strong>Example 2</strong></p>
<p style="text-align:center;">
<span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_9098601d0984950edac65551f02c21d5.gif" style="vertical-align: middle; border: none;" class="tex" alt=" E\left[\frac{V}{R} | Person , Age\right] =\beta_0 + \gamma Age + \sum_{\ell=1}^K\beta_{\ell}1(Person = \ell)" /></span>
</p>
<p><strong>Example 3</strong></p>
<p style="text-align:center;">
<span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_f91ccbb594db69ad1f8c4a6ce55aafe7.gif" style="vertical-align: middle; border: none;" class="tex" alt=" E\left[\frac{V}{R} | Time \right] =\beta_0 + \sum_{\ell =1}^{K} s_{\ell}(time)" /></span>
</p>
<p>Where in the last case, we have parameterized the relationship between FDR and time with a flexible model like <a href="http://en.wikipedia.org/wiki/Spline_(mathematics)">cubic splines</a>.</p>
<p>The hard problem is fitting the regression models in Examples 1-3. Here I propose a basic estimator of the FDR regression model and leave it to others to be smart about it. Let’s focus on P-values because they are the easiest to deal with. Suppose that we calculate the random variables <span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_950d9b6e1b9ec640c200f498edce462e.gif" style="vertical-align: middle; border: none; " class="tex" alt="Y_i = 1(P_i > \lambda)" /></span>. Then:</p>
<p style="text-align:center;">
<span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_3fab11118e2a7046c2f292c6b8c4b02c.gif" style="vertical-align: middle; border: none;" class="tex" alt=" E[Y_i] = Prob(P_i > \lambda) = (1-\lambda)*\pi_0 + (1-G(\lambda))*(1-\pi_0)" /></span>
</p>
<p>Where $G(\lambda)$ is the empirical distribution function for the P-values under the alternative hypothesis. This may be a mixture distribution. If we assume reasonably powered tests and that $\lambda$ is large enough, then <span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_bb81b1246a04daed40b29fbcb43af011.gif" style="vertical-align: middle; border: none; " class="tex" alt="G(\lambda) \approx 1" /></span>. So</p>
<p style="text-align:center;">
<span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_23e567c80909c623fb219119e166648f.gif" style="vertical-align: middle; border: none;" class="tex" alt=" E[Y_i] \approx (1-\lambda) \pi_0" /></span>
</p>
<p>One obvious choice is then to try to model</p>
<p style="text-align:center;">
<span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_3bbbbca2879079148f4c2e876c22c16e.gif" style="vertical-align: middle; border: none;" class="tex" alt=" E[Y_i | X = x] \approx (1-\lambda) \pi_0(x) " /></span>
</p>
<p>We could, for example use the model:</p>
<p style="text-align:center;">
<span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_f5a2a8626283bb4ffee57f4a841bf587.gif" style="vertical-align: middle; border: none;" class="tex" alt=" logit(E[Y_i | X = x]) = f(x)" /></span>
</p>
<p>where <span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_50bbd36e1fd2333108437a2ca378be62.gif" style="vertical-align: middle; border: none; " class="tex" alt="f(x)" /></span> is a linear model or spline, etc. Then we get the fitted values and calculate:</p>
<p style="text-align:center;">
<span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_73bdc9e0e61a7d559c44b24d2ecd74cf.gif" style="vertical-align: middle; border: none;" class="tex" alt="\hat{\pi}_0(x) = \hat{E}[Y_i | X=x] /(1-\lambda)" /></span>
</p>
<p>Here is a little simulated example where the goal is to estimate the probability of being a false positive as a smooth function of time.</p>
<pre class="brush: r; title: ; notranslate" title="">## Load libraries
library(splines)
## Define the number of tests
set.seed(1345)
ntest <- 1000
## Set up the time vector and the probability of being null
tme <- seq(-2,2,length=ntest)
pi0 <- pnorm(tme)
## Calculate a random variable indicating whether to draw
## the p-values from the null or alternative
nullI <- rbinom(ntest,prob=pi0,size=1)> 0
## Sample the null P-values from U(0,1) and the alternatives
## from a beta distribution
pValues <- rep(NA,ntest)
pValues[nullI] <- runif(sum(nullI))
pValues[!nullI] <- rbeta(sum(!nullI),1,50)
## Set lambda and calculate the estimate
lambda <- 0.8
y <- pValues > lambda
glm1 <- glm(y ~ ns(tme,df=3))
## Get the estimate pi0 values
pi0hat <- glm1$fitted/(1-lambda)
## Plot the real versus fitted probabilities
plot(pi0,pi0hat,col="blue",type="l",lwd=3,xlab="Real pi0",ylab="Fitted pi0")
abline(c(0,1),col="grey",lwd=3)
</pre>
<p>The result is this plot:</p>
<p><a href="http://simplystatistics.org/2013/06/13/false-discovery-rate-regression-cc-nsas-prism/pi0/" rel="attachment wp-att-1312"><img class="alignnone size-full wp-image-1312" alt="pi0" src="http://simplystatistics.org/wp-content/uploads/2013/05/pi0.png" width="480" height="480" srcset="http://simplystatistics.org/wp-content/uploads/2013/05/pi0-150x150.png 150w, http://simplystatistics.org/wp-content/uploads/2013/05/pi0-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2013/05/pi0.png 480w" sizes="(max-width: 480px) 100vw, 480px" /></a></p>
<p><span style="color: #000000;"><b>Real versus estimated false discovery rate when calling all tests significant.</b></span></p>
<p>This estimate is obviously not guaranteed to estimate the FDR well, the operating characteristics both theoretically and empirically need to be evaluated and the other examples need to be fleshed out. But isn’t the idea of FDR regression cool?</p>
Personalized medicine is primarily a population-health intervention
2013-06-12T11:06:11+00:00
http://simplystats.github.io/2013/06/12/personalized-medicine-is-primarily-a-population-health-intervention
<p>There has been a lot of discussion of <a href="http://en.wikipedia.org/wiki/Personalized_medicine">personalized medicine</a>, <a href="http://web.jhu.edu/administration/provost/initiatives/ihi/">individualized health</a>, and <a href="http://www.ucsf.edu/welcome-to-ome">precision medicine</a> in the news and in the medical research community. Despite this recent attention, it is clear that healthcare has always been personalized to some extent. For example, men are rarely pregnant and heart attacks occur more often among older patients. In these cases, easily collected variables such as sex and age, can be used to predict health outcomes and therefore used to “personalize” healthcare for those individuals.</p>
<p>So why the recent excitement around personalized medicine? The reason is that it is increasingly cheap and easy to collect more precise measurements about patients that might be able to predict their health outcomes. An example that <a href="http://www.nytimes.com/2013/05/14/opinion/my-medical-choice.html?_r=0">has recently been in the news</a> is the measurement of mutations in the BRCA genes. Angelina Jolie made the decision to undergo a prophylactic double mastectomy based on her family history of breast cancer and measurements of mutations in her BRCA genes. Based on these measurements, previous studies had suggested she might have a lifetime risk as high as 80% of developing breast cancer.</p>
<p>This kind of scenario will become increasingly common as newer and more accurate genomic screening and predictive tests are used in medical practice. When I read these stories there are two points I think of that sometimes get obscured by the obviously fraught emotional, physical, and economic considerations involved with making decisions on the basis of new measurement technologies:</p>
<ol>
<li><strong>In individualized health/personalized medicine the “treatment” is information about risk</strong>. In <a href="http://en.wikipedia.org/wiki/Gleevec">some cases</a> treatment will be personalized based on assays. But in many other cases, we still do not (and likely will not) have perfect predictors of therapeutic response. In those cases, the healthcare will be “personalized” in the sense that the patient will get more precise estimates of their likelihood of survival, recurrence etc. This means that patients and physicians will increasingly need to think about/make decisions with/act on information about risks. But communicating and acting on risk is a notoriously challenging problem; personalized medicine will dramatically raise the importance of <a href="http://understandinguncertainty.org/">understanding uncertainty</a>.</li>
<li><strong>Individualized health/personalized medicine is a population-level treatment.</strong> Assuming that the 80% lifetime risk estimate was correct for Angelina Jolie, it still means there is a 1 in 5 chance she was never going to develop breast cancer. If that had been her case, then the surgery was unnecessary. So while her decision was based on personal information, there is still uncertainty in that decision for her. So the “personal” decision may not always be the “best” decision for any specific individual. It may however, be the best thing to do for everyone in a population with the same characteristics.</li>
</ol>
Why not have a "future of the field" session at a conference with only young speakers?
2013-06-11T10:17:46+00:00
http://simplystats.github.io/2013/06/11/why-not-have-a-future-of-the-field-session-at-a-conference-with-only-young-speakers
<p>I’m in the process of trying to get together a couple of sessions to submit to ENAR 2014. I’m pretty psyched about the topics and am looking forward to hosting the conference in Baltimore. It is pretty awesome to have one of the bigger stats conferences on our home turf and we are going to try to be well represented at the conference.</p>
<p>While putting the sessions together I’ve been thinking about what are my favorite characteristics of sessions at stats conferences. Alyssa has a <a href="http://alyssafrazee.wordpress.com/2013/03/18/ideas-for-super-awesome-conferences/">few suggestions</a> for speakers which I’m completely in agreement with, but I’m talking about whole sessions. Since statistics is often concerned primarily with precision/accuracy the talks tend to be a little bit technical and sometimes dry. Even on topics I really am excited about, people try not to exaggerate. I think overall this is a great quality, but I’d [I’m in the process of trying to get together a couple of sessions to submit to ENAR 2014. I’m pretty psyched about the topics and am looking forward to hosting the conference in Baltimore. It is pretty awesome to have one of the bigger stats conferences on our home turf and we are going to try to be well represented at the conference.</p>
<p>While putting the sessions together I’ve been thinking about what are my favorite characteristics of sessions at stats conferences. Alyssa has a <a href="http://alyssafrazee.wordpress.com/2013/03/18/ideas-for-super-awesome-conferences/">few suggestions</a> for speakers which I’m completely in agreement with, but I’m talking about whole sessions. Since statistics is often concerned primarily with precision/accuracy the talks tend to be a little bit technical and sometimes dry. Even on topics I really am excited about, people try not to exaggerate. I think overall this is a great quality, but I’d](http://www.hilarymason.com/speaking/speaking-entertain-dont-teach/) at a conference. I realized that one of my favorite kind of sessions is the “future of statistics” session.</p>
<p>My only problem is that future of the field talks are always given by luminaries who have a lot of experience. This isn’t surprising, since (1) they are famous and their names are a big draw, (2) they have made lots of interesting/unique contributions, and (3) they are established so they don’t have to worry about being a little imprecise.</p>
<p>But I’d love to see a “future of the field” session with only people who are students/postdocs/first year assistant professors. These are the people who will really <em>be</em> the future of the field and are often more on top of new trends. It would be so cool to see four or five of the most creative young people in the field making bold predictions about where we will go as a discipline. Then you could have one senior person discuss the talks and give some perspective on how realistic the visions would be in light of past experience.</p>
<p>Tell me that wouldn’t be an awesome conference session.</p>
<p> </p>
Sunday data/statistics link roundup (6/2/13)
2013-06-02T21:53:23+00:00
http://simplystats.github.io/2013/06/02/sunday-datastatistics-link-roundup-6213
<ol>
<li>Awesome, a <a href="https://plot.ly/plot">GUI for d3 graphs</a>. Via John M.</li>
<li>Tom L. on <a href="http://researchmatters.blogs.census.gov/2013/05/30/statistics-matter/">why statistics matter</a>, especially at the <a href="http://www.census.gov/research/">Census</a>!</li>
<li>I’ve been spending the last several weeks house hunting like crazy, so the idea of data on schools is high on my mind right now. So this link to data on geography of [ 1. Awesome, a <a href="https://plot.ly/plot">GUI for d3 graphs</a>. Via John M.</li>
<li>Tom L. on <a href="http://researchmatters.blogs.census.gov/2013/05/30/statistics-matter/">why statistics matter</a>, especially at the <a href="http://www.census.gov/research/">Census</a>!</li>
<li>I’ve been spending the last several weeks house hunting like crazy, so the idea of data on schools is high on my mind right now. So this link to data on geography of](http://greatergreatereducation.org/post/18992/osse-releases-more-school-data-on-students-neighborhoods/?utm_source=feedly) seemed particularly interesting (via Rafa).</li>
<li><a href="http://www.nbcnews.com/technology/students-self-driving-car-tech-wins-intel-science-fair-1C9977186">A student dramatically reduces the cost</a> of the self-driving car. The big technological breakthrough? Sampling! (via Marginal Revolution).</li>
</ol>
What statistics should do about big data: problem forward not solution backward
2013-05-29T17:59:07+00:00
http://simplystats.github.io/2013/05/29/what-statistics-should-do-about-big-data-problem-forward-not-solution-backward
<p>There has been a lot of discussion among statisticians about big data and what statistics should do to get involved. Recently <a href="http://normaldeviate.wordpress.com/2013/05/28/steve-marron-on-big-data/">Steve M. and Larry W.</a> took up the same issue on their blog. I have been thinking about this for a while, since I work in genomics, which almost always comes with “big data”. It is also one area of big data where statistics and statisticians have played a huge role.</p>
<p>A question that naturally arises is, “why have statisticians been so successful in genomics?” I think a major reason is the phrase I borrowed from <a href="http://www.bcaffo.com/">Brian C. </a>(who may have borrowed it from <a href="http://www.biostat.ucla.edu/Directory/Brookmeyer">Ron B</a>.)</p>
<blockquote>
<p>problem first, not solution backward</p>
</blockquote>
<p>One of the reasons that “big data” is even a term is that there is that data are less expensive than they were a few years ago. One example is the dramatic drop in the price of <a href="http://genomebiology.com/2010/11/5/207">DNA-sequencing</a>. But there are many many more examples. The quantified self movement and Fitbits, Google Books, social network data from Twitter, etc. are all areas where data that cost us a huge amount to collect 10 years ago can now be collected and stored very cheaply.</p>
<p>As statisticians we look for generalizable principles; I would say that you have to zoom pretty far out to generalize from social networks to genomics but here are two:</p>
<ol>
<li>The data can’t be easily analyzed in an R session on a simple laptop (say low Gigs to Terabytes)</li>
<li>The data are generally quirky and messy (unstructured text, json files with lots of missing data, fastq files with quality metrics, etc.)</li>
</ol>
<p>So how does one end up at the “leading edge” of big data? By being willing to <a href="http://simplystatistics.org/2012/05/28/schlep-blindness-in-statistics/">deal with the schlep</a> and work out the knitty gritty of how you apply even standard methods to data sets where taking the mean takes hours. Or taking the time to learn all the kinks that are specific to say, how does one process a microarray, and then taking the time to fix them. This is why statisticians were so successful in genomics, they focused on the practical problems and this gave them access to data no one else had/could use properly.</p>
<p>Doing these things requires a lot of effort that isn’t elegant. It also isn’t “statistics” by the definition that only mathematical methodology is statistics. Steve alludes to this in his post when he says:</p>
<blockquote>
<p>Frankly I am a little disappointed that there does not seem to be any really compelling new idea (e.g. as in neural nets or the kernel embedding idea that drove machine learning).</p>
</blockquote>
<p>I think this is a view shared by many statisticians. That since there isn’t a new elegant theory yet, there aren’t “new ideas” in big data. That focus is solution backward. We want an elegant theory that we can then apply to specific problems if they happen to come up.</p>
<p>The alternative is problem forward. The fact that we can collect data so cheaply means we can measure and study things we never could before. Computer scientists, physicists, genome biologists, and others are leading in big data precisely because they aren’t thinking about the statistical solution. They are thinking about solving an important scientific problem and are willing to deal with all the dirty details to get there. This allows them to work on data sets and problems that haven’t been considered by other people.</p>
<p>In genomics, this has happened before. In that case, the invention of microarrays revolutionized the field and statisticians jumped on board, working closely with scientists, handling the dirty details, and <a href="http://www.bioconductor.org/">building software so others could too</a>. As a discipline if we want to be part of the “big data” revolution I think we need to focus on the scientific problems and let methodology come second. That requires a rethinking of what it means to be statistics. Things like parallel computing, data munging, reproducibility, and software development have to be accepted as equally important to methods development.</p>
<p>The good news is that there is plenty of room for statisticians to bring our unique skills in dealing with uncertainty to these new problems; but we will only get a seat at the table if we are willing to deal with <a href="http://simplystatistics.org/2012/06/22/statistics-and-the-science-club/">the mess that comes with doing real science</a>.</p>
<p>I’ll close by listing a few things I’d love to see:</p>
<ol>
<li><span style="line-height: 16px;">A Bioconductor-like project for social network data. Tyler M. and Ali S. <a href="http://www.csss.washington.edu/Papers/wp127.pdf">have a paper </a>that would make for an awesome package for this project. </span></li>
<li><a href="http://smart-stats.org/">Statistical pre-processing</a> for fMRI and other brain imaging data. Keep an eye on our smart group for that.</li>
<li>Data visualization for translational applications, dealing with all the niceties of human-data interfaces. See <a href="http://healthvis.org/">healthvis</a> or the stuffy <a href="http://www.cs.utah.edu/~miriah/">Miriah Meyer</a> is doing.</li>
<li>Most importantly, starting with specific, unsolved scientific problems. Seeking novel ways to collect cheap data, and analyzing them, even with known and straightforward statistical methods to deepen our understanding about ourselves or the universe.</li>
</ol>
Sunday data/statistics link roundup (5/19/2013)
2013-05-19T12:01:52+00:00
http://simplystats.github.io/2013/05/19/sunday-datastatistics-link-roundup-5192013
<ol>
<li>This is a <a href="http://camdp.com/blogs/21st-century-problems"> 1. This is a</a> on 20th versus 21st century problems and the rise of the importance of empirical science. I particularly like the discussion of what it means to be a “solved” problem and how that has changed.</li>
<li><a href="http://www.sciencemag.org/content/340/6134/787.full">A discussion</a> in Science about the (arguably) most important statistics among academics, the impact factor and h-index. This comes on the heels of the <a href="http://am.ascb.org/dora/">San Francisco Declaration of Research Assessment</a>. I like the idea that we should focus on evaluating science for its own merit rather than focusing on summaries like impact factor. But I worry that the “gaming” people are worried about with quantitative numbers like IF will be replaced with “politicking” if it becomes too qualitative. (via Rafa)</li>
<li>A <a href="http://blogs.telegraph.co.uk/news/tomchiversscience/100217094/depressing-just-nine-per-cent-of-britons-trust-stats-over-our-own-experience-though-most-of-us-wont-believe-that/">write-up</a> about a survey in Britain that suggests people don’t believe statistics (surprise!). I think this is symptomatic of a bigger issue which is being raised over and over. In the era when scientific problems don’t have deterministic solutions how do we determine if a problem has been solved? There is no good answer for this yet and it threatens to undermine a major fraction of the scientific enterprise going forward.</li>
<li>Businesses are confusing <a href="http://qz.com/81661/most-data-isnt-big-and-businesses-are-wasting-money-pretending-it-is/">data analysis and big data</a>. This is so important and true. Big data infrastructure is often critical for creating/running data products. But discovering new ideas from data often happens on much smaller data sets with good intuition and interactive data analysis.</li>
<li><a href="http://www.nytimes.com/2013/05/19/sports/topps-changes-baseball-card-numbering-to-criticism.html?_r=1&">Really interesting article</a> about how the baseball card numbering system matters and how changing it can upset collectors (via Chris V.).</li>
</ol>
When does replication reveal fraud?
2013-05-17T09:32:01+00:00
http://simplystats.github.io/2013/05/17/when-does-replication-reveal-fraud
<p>Here’s a little thought experiment for your weekend pleasure. Consider the following:</p>
<p>Joe Scientist decides to conduct a study (call it Study A) to test the hypothesis that a parameter <em>D</em> > 0 vs. the null hypothesis that <em>D</em> = 0. He designs a study, collects some data, conducts an appropriate statistical analysis and concludes that <em>D</em> > 0. This result is published in the Journal of Awesome Results along with all the details of how the study was done.</p>
<p>Jane Scientist finds Joe’s study very interesting and tries to replicate his findings. She conducts a study (call it Study B) that is similar to Study A but completely independent of it (and does not communicate with Joe). In her analysis she does not find strong evidence that <em>D</em> > 0 and concludes that she cannot rule out the possibility that <em>D</em> = 0. She publishes her findings in the Journal of Null Results along with all the details.</p>
<p>From these two studies, which of the following conclusions can we make?</p>
<ol>
<li>Study A is obviously a fraud. If the truth were that <em>D</em> > 0, then Jane should have concluded that <em>D</em> > 0 in her independent replication.</li>
<li>Study B is obviously a fraud. If Study A were conducted properly, then Jane should have reached the same conclusion.</li>
<li>Neither Study A nor Study B was a fraud, but the result for Study A was a Type I error, i.e. a false positive.</li>
<li>Neither Study A nor Study B was a fraud, but the result for Study B was a Type II error, i.e a false negative.</li>
</ol>
<p>I realize that there are a number of subtle details concerning why things might happen but I’ve purposely left them out. My question is, based on the information that you <em>actually have</em> about the two studies, what would you consider to be the most likely case? What further information would you like to know beyond what was given here?_</p>
<p>_</p>
The bright future of applied statistics
2013-05-15T10:00:33+00:00
http://simplystats.github.io/2013/05/15/the-bright-future-of-applied-statistics
<p>In 2013, the Committee of Presidents of Statistical Societies (COPSS) celebrates its 50th Anniversary. As part of its celebration, COPSS will publish a book, with contributions from past recipients of its awards, titled “Past, Present and Future of Statistical Science”. Below is my contribution titled <em>The bright future of applied statistics</em>.</p>
<p>When I was asked to contribute to this issue, titled Past, Present, and Future of Statistical Science, I contemplated my career while deciding what to write about. One aspect that stood out was how much I benefited from the right circumstances. I came to one clear conclusion: it is a great time to be an applied statistician. I decided to describe the aspects of my career that I have thoroughly enjoyed in the
<em>past</em> and <em>present</em> and explain why I this has led me to believe that the <em>is bright for applied statisticians</em>.</p>
<p>I became an applied statistician while working with David Brillinger on my PhD thesis. When searching for an advisor I visited several professors and asked them about their interests. David asked me what I liked and all I came up with was “<em>I don’t know. Music?</em>”, to which he responded “<em>That’s what we will work on</em>”. Apart from the necessary theorems to get a PhD from the Statistics Department at Berkeley, my thesis summarized my collaborative work with researchers at the Center for New Music and Audio Technology. The work<br /> involved separating and parameterizing the harmonic and non-harmonic components of musical sound signals [<a href="#Xirizarry2001local">5</a>]. The sounds had been digitized into data. The work was indeed fun, but I also had my first glimpse into the incredible potential of statistics in a world becoming more and more data-driven.</p>
<p>Despite having expertise only in music, and a thesis that required a CD player to hear the data, <a href="http://www.biostat.jhsph.edu/~ririzarr/Demo/index.html">fitted models and residuals</a>, I was hired by the Department of Biostatistics at Johns Hopkins School of Public Health. Later I realized what was probably obvious to the School’s leadership: that regardless of the subject matter of my thesis, my time series expertise could be applied to several public health applications [<a href="#Xirizarry2001assessing">8</a>, <a href="#Xdipietro2001cross">2</a>, <a href="#Xcrone2001electrocorticographic">1</a>]. The public health and biomedical challenges surrounding me were simply too hard to resist and my new<br /> department knew this. It was inevitable that I would quickly turn into an applied Biostatistician. <!--l. 60--></p>
<p>Since the day that I arrived at Hopkins 15 years ago, Scott Zeger, the department chair, fostered and encouraged faculty to leverage their statistical expertise to make a difference and to have an immediate impact in science. At that time, we were in the midst of a measurement revolution that was transforming several scientific fields into data-driven ones. By being located in a School of Public Health and next to a medical school, we were surrounded by collaborators working in such fields. These included environmental science, neuroscience, cancer biology, genetics, and molecular biology. Much of my work was motivated by collaborations with biologists that, for the first time, were collecting large amounts of data. Biology was changing from a data poor discipline to a data intensive<br /> ones.<br /> <!--l. 75--></p>
<p>A specific example came from the measurement of gene expression. Gene expression is the process where DNA, the blueprint for life, is copied into RNA, the templates for the synthesis of proteins, the building blocks for life. Before microarrays were invented in the 1990s, the analysis of gene expression data amounted to spotting black dots on a piece of paper (see Figure 1A below). With microarrays, this suddenly changed to sifting through tens of thousands of numbers (see Figure 1B). Biologists went from using their eyes to categorize results to having thousands (and now millions) of measurements per sample to analyze. Furthermore, unlike genomic DNA, which is static, gene expression is a dynamic quantity: different tissues express different genes at different levels and at different times. The complexity was exacerbated by unpolished technologies that made measurements much noisier than anticipated. This complexity and level of variability made statistical thinking an important aspect of the analysis. The Biologists that used to say, “if I need statistics, the experiment went wrong” were now seeking out our help. The results of these collaborations have led to, among other things, the development of breast cancer recurrence gene expression assays making it possible to identify patients at risk of distant recurrence following surgery [<a href="#Xvan2002gene">9</a>].</p>
<div class="figure" style="text-align: left;">
<p class="noindent">
<a href="http://simplystatistics.org/2013/05/15/the-bright-future-of-applied-statistics/expression/" rel="attachment wp-att-1329"><img class="alignnone size-full wp-image-1329" alt="expression" src="http://simplystatistics.org/wp-content/uploads/2013/05/expression.jpg" /></a>
</p>
<div class="caption">
Figure 1: Illustration of gene expression data before and after micorarrays.
</div>
</div>
<p>When biologists at Hopkins first came to our department for help with their microarray data, Scott put them in touch with me because I had experience with (what was then) large datasets (digitized music signals are represented by 44,100 points per second). The more I learned about the scientific problems and the more data I explored, the more motivated I became. The potential for statisticians having an impact in this nascent field was clear and my department was encouraging me to take the plunge. This institutional encouragement and support was crucial as successfully working in this field made it harder to publish in the mainstream statistical journals; an accomplishment that had traditionally been heavily weighted in the promotion process. The message was clear: having an immediate impact on specific scientific fields would be rewarded as much as mathematically rigorous methods with general applicability.</p>
<p>As with my thesis applications, it was clear that to solve some of the challenges posed by microarray data I would have to learn all about the technology. For this I organized a sabbatical with Terry Speed’s group in Melbourne where they helped me accomplish this goal. During this visit I reaffirmed my preference for attacking applied problems with simple statistical methods, as opposed to overcomplicated ones or developing new techniques. Learning that deciphering clever ways of putting the existing statistical toolbox to work was good enough for an accomplished statistician like Terry gave me the necessary confidence to continue working this way. More than a decade later this continues to be my approach to applied statistics. This approach has been instrumental for some of my current collaborative work. In particular, it led to important new biological discoveries made together with Andy Feinberg’s lab [<a href="#Xirizarry2009human">7</a>].</p>
<p>During my sabbatical we developed preliminary solutions that improved precision and aided in the removal of systematic biases for microarray data [<a href="#Xirizarry2003exploration">6</a>]. I was aware that hundreds, if not thousands, of other scientists were facing the same problematic data and were searching for solutions. Therefore I was also thinking hard about ways in which I could share whatever solutions I developed with others. During this time I received an email from Robert Gentleman asking if I was interested in joining a new software project for the delivery of statistical methods for genomics data. This new collaboration eventually became the <a href="http://www.bioconductor.org">Bioconductor project</a>, which to this day continues to grow its user and developer base [<a href="#Xgentleman2004bioconductor">4</a>]. Bioconductor was the perfect vehicle for having the impact that my department had encouraged me to seek. With Ben Bolstad and others we wrote an R package that has been downloaded tens of thousands of times [<a href="#Xgautier2004affy">3</a>]. Without the availability of software, the statistical method would not have received nearly as much attention. This lesson served me well throughout my career, as developing software packages has greatly helped disseminate my statistical ideas. The fact that my department and school rewarded software publications provided important support.</p>
<p>The impact statisticians have had in genomics is just one example of our fields accomplishment in the 21st century. In academia, the number of statistician becoming leaders in fields like environmental sciences, human genetics, genomics, and social sciences continues to grow. Outside of academia, Sabermetrics has become a standard approach in several sports (not just baseball) and inspired the Hollywood movie Money Ball. A PhD Statistician led the team that won the <a href="http://www.netflixprize.com">Netflix million dollar prize</a>. <a href="http://mashable.com/2012/11/07/nate-silver-wins">Nate Silver</a> proved the pundits wrong by once again using statistical models to predict election results almost perfectly. R has become a widely used programming language. It is no surprise that Statistics majors at Harvard have more than <a href="http://nesterko.com/visuals/statconcpred2012-with-dm/">quadrupled since 2000</a> and that statistics MOOCs are among the <a href="http://edudemic.com/2012/12/the-11-most-popular-open-online-courses/">most popular</a>.</p>
<p>The unprecedented advance in digital technology during the second half of the 20th century has produced a measurement revolution that is transforming science. Scientific fields that have traditionally relied upon simple data analysis techniques have been turned on their heads by these technologies. Furthermore, advances such as these have brought about a shift from hypothesis to discovery-driven research. However, interpreting information extracted from these massive and complex datasets requires sophisticated statistical skills as one can easily be fooled by patterns that arise by chance. This has greatly elevated the importance of our discipline in biomedical research. <!--l. 186--></p>
<p>I think that the data revolution is just getting started. Datasets are currently being, or have already been, collected that contain, hidden in their complexity, important truths waiting to be discovered. These discoveries will increase the scientific understanding of our world. Statisticians should be excited and ready to play an important role in the new scientific renaissance driven by the measurement revolution.</p>
<h2 class="likechapterHead" style="text-align: left;">
<a id="x1-20001"></a>Bibliography
</h2>
<div class="thebibliography">
<p class="bibitem" style="text-align: left;">
[1] <a id="Xcrone2001electrocorticographic"></a>NE Crone, L Hao, J Hart, D Boatman, RP Lesser, R Irizarry, and<br /> B Gordon. Electrocorticographic gamma activity during word production<br /> in spoken and sign language. Neurology, 57(11):2045–2053, 2001.
</p>
<p class="bibitem" style="text-align: left;">
[2] <a id="Xdipietro2001cross"></a>Janet A DiPietro, Rafael A Irizarry, Melissa Hawkins, Kathleen A<br /> Costigan, and Eva K Pressman. Cross-correlation of fetal cardiac and<br /> somatic activity as an indicator of antenatal neural development. American<br /> journal of obstetrics and gynecology, 185(6):1421–1428, 2001.
</p>
<p class="bibitem" style="text-align: left;">
[3] <a id="Xgautier2004affy"></a>Laurent Gautier, Leslie Cope, Benjamin M Bolstad, and Rafael A<br /> Irizarry. affyanalysis of affymetrix genechip data at the probe level.<br /> Bioinformatics, 20(3):307–315, 2004.
</p>
<p class="bibitem" style="text-align: left;">
[4] <a id="Xgentleman2004bioconductor"></a>Robert C Gentleman, Vincent J Carey, Douglas M Bates, Ben Bolstad,<br /> Marcel Dettling, Sandrine Dudoit, Byron Ellis, Laurent Gautier, Yongchao<br /> Ge, Jeff Gentry, et al. Bioconductor: open software development for<br /> computational biology and bioinformatics. Genome biology, 5(10):R80, 2004.
</p>
<p class="bibitem" style="text-align: left;">
[5] <a id="Xirizarry2001local"></a>Rafael A Irizarry. Local harmonic estimation in musical sound signals.<br /> Journal of the American Statistical Association, 96(454):357–367, 2001.
</p>
<p class="bibitem" style="text-align: left;">
[6] <a id="Xirizarry2003exploration"></a>Rafael A Irizarry, Bridget Hobbs, Francois Collin, Yasmin D<br /> Beazer-Barclay, Kristen J Antonellis, Uwe Scherf, and Terence P Speed.<br /> Exploration, normalization, and summaries of high density oligonucleotide<br /> array probe level data. Biostatistics, 4(2):249–264, 2003.
</p>
<p class="bibitem" style="text-align: left;">
[7] <a id="Xirizarry2009human"></a>Rafael A Irizarry, Christine Ladd-Acosta, Bo Wen, Zhijin Wu, Carolina<br /> Montano, Patrick Onyango, Hengmi Cui, Kevin Gabo, Michael Rongione,<br /> Maree Webster, et al. The human colon cancer methylome shows similar<br /> hypo-and hypermethylation at conserved tissue-specific cpg island shores.<br /> Nature genetics, 41(2):178–186, 2009.
</p>
<p class="bibitem" style="text-align: left;">
[8] <a id="Xirizarry2001assessing"></a>Rafael A Irizarry, Clarke Tankersley, Robert Frank, and Susan<br /> Flanders. Assessing homeostasis through circadian patterns. Biometrics,<br /> 57(4):1228–1237, 2001.
</p>
<p class="bibitem" style="text-align: left;">
[9] <a id="Xvan2002gene"></a>Laura J van’t Veer, Hongyue Dai, Marc J Van De Vijver, Yudong D<br /> He, Augustinus AM Hart, Mao Mao, Hans L Peterse, Karin van der Kooy,<br /> Matthew J Marton, Anke T Witteveen, et al. Gene expression profiling<br /> predicts clinical outcome of breast cancer. nature, 415(6871):530–536, 2002.
</p>
</div>
Sunday data/statistics link roundup (5/12/2013, Mother's Day!)
2013-05-12T22:29:17+00:00
http://simplystats.github.io/2013/05/12/sunday-datastatistics-link-roundup-5122013-mothers-day
<ol>
<li><span style="line-height: 16px;">A tutorial on <a href="http://deeplearning.net/tutorial/">deep-learning</a>, I really enjoyed reading it, but I’m still trying to figure out how this is different than non-linear logistic regression to estimate features then supervised prediction using those features? Or maybe I’m just naive….</span></li>
<li>Rafa on <a href="http://www.80grados.net/la-importancia-de-la-autonomia-politica-para-las-ciencias/">political autonomy for science</a> for a blog in PR called <a href="http://www.80grados.net/">80 grados. </a> He writes about Rep. Lamar Smith and then focuses more closely on issues related to the University of Puerto Rico. A very nice read. (via Rafa)</li>
<li><a href="http://deadspin.com/infographic-is-your-states-highest-paid-employee-a-co-489635228">Highest paid employees by state</a>. I should have coached football…</li>
<li><a href="http://www.motherjones.com/kevin-drum/2013/05/groundbreaking-isaac-newton-invention-youve-never-heard">Newton took the mean.</a> It warms my empirical heart to hear about how the theoretical result was backed up by averaging (via David S.)</li>
<li>Reinhart and Rogoff <a href="http://www.cnbc.com/id/100721630">publish a correction but stand by their original claims</a>. I’m not sure whether this is a good or a bad thing. But it definitely is an overall win for reproducibility.</li>
<li>Statesy folks are getting some much-deserved attention. Terry Speed is a <a href="http://royalsociety.org/people/terence-speed/">Fellow of the Royal Society</a>, Peter Hall is a <a href="http://www.nasonline.org/news-and-multimedia/news/2013_04_30_NAS_Election.html">foreign associate of the NAS</a>, Gareth Roberts is <a href="http://royalsociety.org/people/gareth-roberts/">also a Fellow of the Royal Society</a> (via Peter H.)</li>
<li><a href="http://www.nytimes.com/2013/05/06/business/media/solving-equation-of-a-hit-film-script-with-data.html?src=rechp&_r=1&">Statisticians go to the movies </a>and <a href="http://well.blogs.nytimes.com/2013/05/08/are-hot-hands-in-sports-for-real/">the hot hand analysis makes the NY Times</a> (via Dan S.)</li>
</ol>
<p><strong>Bonus Link! </strong> Karl B.’s Github <a href="http://www.statsblogs.com/2013/05/10/tutorials-on-gitgithub-and-gnu-make/">tutorial is awesome</a> and every statistician should be required to read it. I only ask why he gives all the love to Nacho’s admittedly awesome <a href="https://github.com/nachocab/clickme">Clickme package</a> and no love to <a href="http://healthvis.org/">healthvis</a>, we are on <a href="https://github.com/hcorrada/healthvis">Github too</a>!</p>
A Shiny web app to find out how much medical procedures cost in your state.
2013-05-08T17:09:08+00:00
http://simplystats.github.io/2013/05/08/a-shiny-web-app-to-find-out-how-much-medical-procedures-cost-in-your-state
<p>Today the <a href="http://www.huffingtonpost.com/2013/05/08/hospital-prices-cost-differences_n_3232678.html">front page of the Huffington Post featured</a> the <a href="https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/index.html">Today the [front page of the Huffington Post featured](http://www.huffingtonpost.com/2013/05/08/hospital-prices-cost-differences_n_3232678.html) the</a> that shows the cost of many popular procedures broken down by hospital. We here at Simply Statistics think you should be able to explore these data more easily. So we asked <a href="http://biostat.jhsph.edu/~jmuschel/">Today the [front page of the Huffington Post featured](http://www.huffingtonpost.com/2013/05/08/hospital-prices-cost-differences_n_3232678.html) the [Today the [front page of the Huffington Post featured](http://www.huffingtonpost.com/2013/05/08/hospital-prices-cost-differences_n_3232678.html) the](https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/index.html) that shows the cost of many popular procedures broken down by hospital. We here at Simply Statistics think you should be able to explore these data more easily. So we asked</a> to help us build a Shiny App that allows you to interact with these data. You can choose your state and your procedure and see how much the procedure costs at hospitals in your state. It takes a second to load because it is a lot of data….</p>
<p><a href="http://glimmer.rstudio.com/muschellij2/Shiny_Health_Data/">Here is the link the app. </a></p>
<p>Here are some screenshots for intracranial hemmhorage for the US and for Idaho.</p>
<p><a href="http://simplystatistics.org/2013/05/08/a-shiny-web-app-to-find-out-how-much-medical-procedures-cost-in-your-state/screen-shot-2013-05-08-at-4-57-56-pm/" rel="attachment wp-att-1317"><img class="alignnone size-full wp-image-1317" alt="Screen Shot 2013-05-08 at 4.57.56 PM" src="http://simplystatistics.org/wp-content/uploads/2013/05/Screen-Shot-2013-05-08-at-4.57.56-PM.png" width="516" height="439" srcset="http://simplystatistics.org/wp-content/uploads/2013/05/Screen-Shot-2013-05-08-at-4.57.56-PM-300x255.png 300w, http://simplystatistics.org/wp-content/uploads/2013/05/Screen-Shot-2013-05-08-at-4.57.56-PM.png 516w" sizes="(max-width: 516px) 100vw, 516px" /></a><a href="http://simplystatistics.org/2013/05/08/a-shiny-web-app-to-find-out-how-much-medical-procedures-cost-in-your-state/screen-shot-2013-05-08-at-4-58-09-pm/" rel="attachment wp-att-1318"><img class="alignnone size-full wp-image-1318" alt="Screen Shot 2013-05-08 at 4.58.09 PM" src="http://simplystatistics.org/wp-content/uploads/2013/05/Screen-Shot-2013-05-08-at-4.58.09-PM.png" width="549" height="460" srcset="http://simplystatistics.org/wp-content/uploads/2013/05/Screen-Shot-2013-05-08-at-4.58.09-PM-300x251.png 300w, http://simplystatistics.org/wp-content/uploads/2013/05/Screen-Shot-2013-05-08-at-4.58.09-PM.png 549w" sizes="(max-width: 549px) 100vw, 549px" /></a>\</p>
<p><a href="https://github.com/muschellij2/Shiny_Health_Data">The R code is here</a> if you want to tweak/modify.</p>
Why the current over-pessimism about science is the perfect confirmation bias vehicle and we should proceed rationally
2013-05-06T14:30:41+00:00
http://simplystats.github.io/2013/05/06/why-the-current-over-pessimism-about-science-is-the-perfect-confirmation-bias-vehicle-and-we-should-proceed-rationally
<p>Recently there have been some high profile flameouts in scientific research. A couple examples include <a href="http://simplystatistics.org/2012/02/27/the-duke-saga-starter-set/">the Duke saga</a>, <a href="http://simplystatistics.org/2012/07/03/replication-and-validation-in-omics-studies-just-as/">the replication issues in social sciences</a>, <a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1850704">p-value hacking</a>, <a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2114571&http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2114571">fabricated data</a>, <a href="http://www.michaeleisen.org/blog/?p=1312">not enough open-access publication</a>, and on and on.</p>
<p>Some of these results have had major non-scientific consequences, which is the reason they have drawn so much attention both inside and outside of the academic community. For example, the Duke saga [Recently there have been some high profile flameouts in scientific research. A couple examples include <a href="http://simplystatistics.org/2012/02/27/the-duke-saga-starter-set/">the Duke saga</a>, <a href="http://simplystatistics.org/2012/07/03/replication-and-validation-in-omics-studies-just-as/">the replication issues in social sciences</a>, <a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1850704">p-value hacking</a>, <a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2114571&http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2114571">fabricated data</a>, <a href="http://www.michaeleisen.org/blog/?p=1312">not enough open-access publication</a>, and on and on.</p>
<p>Some of these results have had major non-scientific consequences, which is the reason they have drawn so much attention both inside and outside of the academic community. For example, the Duke saga](http://www.nytimes.com/2011/07/08/health/research/08genes.html?_r=0) , the lack of replication has led to high-profile arguments between scientists in <a href="http://blogs.discovermagazine.com/notrocketscience/?p=7765#.UYfhJitKnKo">Discover</a> and <a href="http://www.nature.com/news/replication-studies-bad-copy-1.10634">Nature</a> among other outlets, and the [Recently there have been some high profile flameouts in scientific research. A couple examples include <a href="http://simplystatistics.org/2012/02/27/the-duke-saga-starter-set/">the Duke saga</a>, <a href="http://simplystatistics.org/2012/07/03/replication-and-validation-in-omics-studies-just-as/">the replication issues in social sciences</a>, <a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1850704">p-value hacking</a>, <a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2114571&http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2114571">fabricated data</a>, <a href="http://www.michaeleisen.org/blog/?p=1312">not enough open-access publication</a>, and on and on.</p>
<p>Some of these results have had major non-scientific consequences, which is the reason they have drawn so much attention both inside and outside of the academic community. For example, the Duke saga [Recently there have been some high profile flameouts in scientific research. A couple examples include <a href="http://simplystatistics.org/2012/02/27/the-duke-saga-starter-set/">the Duke saga</a>, <a href="http://simplystatistics.org/2012/07/03/replication-and-validation-in-omics-studies-just-as/">the replication issues in social sciences</a>, <a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1850704">p-value hacking</a>, <a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2114571&http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2114571">fabricated data</a>, <a href="http://www.michaeleisen.org/blog/?p=1312">not enough open-access publication</a>, and on and on.</p>
<p>Some of these results have had major non-scientific consequences, which is the reason they have drawn so much attention both inside and outside of the academic community. For example, the Duke saga](http://www.nytimes.com/2011/07/08/health/research/08genes.html?_r=0) , the lack of replication has led to high-profile arguments between scientists in <a href="http://blogs.discovermagazine.com/notrocketscience/?p=7765#.UYfhJitKnKo">Discover</a> and <a href="http://www.nature.com/news/replication-studies-bad-copy-1.10634">Nature</a> among other outlets, and the](http://www.businessinsider.com/why-the-reinhart-rogoff-excel-debacle-could-be-devastating-for-the-austerity-movement-2013-4) (sometimes <a href="http://www.colbertnation.com/the-colbert-report-videos/425748/april-23-2013/austerity-s-spreadsheet-error">comically</a>) <a href="http://simplystatistics.org/2013/04/19/podcast-7-reinhart-rogoff-reproducibility/">because of a lack of reproducibility</a>.</p>
<p>The result of this high-profile attention is that there is a movement on to “<a href="http://www.newyorker.com/online/blogs/newsdesk/2012/12/cleaning-up-science.html">clean up science</a>”. As has been pointed out, there is a group of scientists who are making names for themselves primarily as critics of what is wrong with the scientific process. The good news is that these key players are calling attention to issues: reproducibility, replicability, and open access, among others, that are critically important for the scientific enterprise.</p>
<p>I too am concerned about these issues and have altered my own research process to try to address them for my own research group. I also think that the solutions others have proposed on a larger scale like <a href="http://www.alltrials.net/">alltrials.net</a> or <a href="http://www.plos.org/">PLoS</a> are great advances for the scientific community.</p>
<p>I am also very worried that people are using a few high-profile cases to hyperventilate about the real, solvable, and recognized problems in the scientific process These people get credit and a lot of attention for pointing out how science is “failing”. But they aren’t giving proportional time to all of the incredible success stories we have had, both in performing research and in reforming research with reproducibility, open access, and replication initiatives.</p>
<p>We should recognize that science is hard and even dedicated, diligent, and honest scientists will make mistakes , perform irreproducible or irreplicable studies, or publish in closed access journals. Sometimes this is because of ignorance of good research principles, sometimes it is because people are new to working in a world where data/computation are a major player, and some will be because it is legitimately, really hard to make real advances in science. I think people who participate in real science recognize these problems and are eager to solve them. I also have noticed that real scientists generally try to propose a solution when they complain about these issues.</p>
<p>But it seems like sometimes people use these high-profile mistakes out of context to push their own scientific pet peeves. For example:</p>
<ol>
<li><strong>I don’t like p-values and there are lots of results that fail to replicate so it must be the fault of p-values.</strong> Many studies fail to replicate not because the researchers used p-values, but because they performed studies that were either weak or had poorly understood scientific mechanisms.</li>
<li><strong>I don’t like not being able to access people’s code so lack of reproducibility is causing science to fail. </strong>Even in the two most infamous cases (Potti and Reinhart - Rogoff) the problem with the science wasn’t reproducibility - it was that the analysis was incorrect/flawed. Reproducibility compounded the problem but wasn’t the root cause of the problem.</li>
<li><strong>I don’t like not being able to access scientific papers so closed-access journals are evil. </strong>For whatever reason (I don’t know if I understand why) it is expensive to publish journals. Clearly, because <a href="http://simplystatistics.org/2011/11/03/free-access-publishing-is-awesome-but-expensive-how/">publishing open access is expensive</a> and closed access journals are expensive. If I’m a junior researcher, I’ll definitely post my preprints online, but I also want papers in “good” journals and don’t have a ton of grant money, so sometimes I’ll choose close access.</li>
<li><strong>I don’t like these crazy headlines from social psychology (substitute other field here) and there have been some that haven’t replicated, so none must replicate. </strong>Of course some papers won’t replicate, including even high profile papers. If you are doing statistics, then by definition some papers won’t replicate since you have to make a decision on noisy data.</li>
</ol>
<p>These are just a few examples where I feel like a basic, fixable flaw in science has been used to justify a hugely pessimistic view of science in general. I’m not saying it is all rainbows and unicorns. Of course we want to improve the process. But I’m worried that the rational reasonable problems we have, with enough hyperbole, will make it look like the scientific process “sky is falling” and will leave the door open for individuals like Rep. Lamar Smith to come in and <a href="http://www.huffingtonpost.com/2013/04/30/lamar-smith-science-peer-review_n_3189107.html?utm_hp_ref=politics">turn the scientific process into a political one</a>.</p>
<p>P.S. <a href="http://andrewgelman.com/2013/05/06/against-optimism-about-social-science/#more-18943">Andrew Gelman</a> posted on a similar topic yesterday as well.. He argues the case for less optimism and to make sure we don’t stay complacent. He added a P.S. and mentioned two points on which we can agree: (1) science is hard and is a human system and we are working to fix the flaws inherent in such systems and (2) that it is still easier to publish as splashy claim than to publish a correction. I do definitely agree with both. I think Gelman would also likely agree that we need to be careful about <a href="http://simplystatistics.org/2013/04/30/reproducibility-and-reciprocity/">reciprocity</a> with these issues. If earnest scientists work hard to address reproducibility, replicability, open access, etc. then people who criticize them should have to work just as hard to justify their critiques. Just because it is a critique doesn’t mean it should automatically get the same treatment as the original paper.</p>
Talking about MOOCs on MPT Direct Connection
2013-05-06T09:01:06+00:00
http://simplystats.github.io/2013/05/06/talking-about-moocs-on-mpt-direct-connection
<p style="font-size: 11px; font-family: Arial, Helvetica, sans-serif; color: #808080; margin-top: 5px; background: transparent; text-align: center; width: 512px;">
Watch <a style="text-decoration: none !important; font-weight: normal !important; height: 13px; color: #4eb2fe !important;" href="http://video.mpt.tv/video/2365006588" target="_blank">Monday, April 29, 2013</a> on PBS. See more from <a style="text-decoration: none !important; font-weight: normal !important; height: 13px; color: #4eb2fe !important;" href="http://www.mpt.org/dc" target="_blank">Direct Connection.</a>
</p>
<p>I appeared on Maryland Public Television’s Direct Connection with Jeff Salkin last Monday to talk about MOOCs (along with our Dean Mike Klag).</p>
Reproducibility at Nature
2013-05-02T17:22:32+00:00
http://simplystats.github.io/2013/05/02/reproducibility-at-nature
<p>Nature has jumped on to the reproducibility bandwagon and has <a href="http://www.nature.com/news/announcement-reducing-our-irreproducibility-1.12852">announced</a> a new approach to improving reproducibility of submitted papers. The new effort is focused primarily and methodology, including statistics, and in making sure that it is clear what an author has done.</p>
<blockquote>
<p>To ease the interpretation and improve the reliability of published results we will more systematically ensure that key methodological details are reported, and we will give more space to methods sections. We will examine statistics more closely and encourage authors to be transparent, for example by including their raw data.</p>
</blockquote>
<p>To this end they have created a <a href="http://www.nature.com/authors/policies/checklist.pdf">checklist</a> for highlighting key aspects that need to be clear in the manuscript. A number of these points are statistical, and two specifically highlight data deposition and computer code availability. I think an important change is the following:</p>
<blockquote>
<p>To allow authors to describe their experimental design and methods in as much detail as necessary, the participating journals, including <em>Nature</em>, will abolish space restrictions on the methods section.</p>
</blockquote>
<p>I think this is particularly important because of the message it sends. Most journals have overall space limitations and some journals even have specific limits on the Methods section. This sends a clear message that “methods aren’t important, results are”. Removing space limits on the Methods section will allow people to just say what they actually did, rather than figure out some tortured way to summarize years of work into a smattering of key words.</p>
<p>I think this is a great step forward by a leading journal. The next step will be for Nature to stick to it and make sure that authors live up to their end of the bargain.</p>
Reproducibility and reciprocity
2013-04-30T09:58:47+00:00
http://simplystats.github.io/2013/04/30/reproducibility-and-reciprocity
<p>One element about the entire discussion about reproducible research that I haven’t seen talked about very much is the potential for the lack of reciprocity. I think even if scientists were not concerned about the possibility of getting scooped by others by making their data/code available this issue would be sufficient to give people pause about making their work reproducible.</p>
<p>What do I mean by reciprocity? Consider the following (made up) scenario:</p>
<ol>
<li>I conduct a study (say, a randomized controlled trial, for concreteness) that I register at clinicaltrials.gov beforehand and specify details about the study like the design, purpose, and primary and secondary outcomes.</li>
<li>I rigorously conduct the study, ensuring safety and privacy of subjects, collect the data, and analyze the data.</li>
<li>I publish the results for the primary and secondary outcomes in the peer-reviewed literature where I describe how the study was conducted and the statistical methods that were used. For the sake of concreteness, let’s say the results were “significant” by whatever definition of significant you care to use and that the paper was highly influential.</li>
<li>Along with publishing the paper I make the analytic dataset and computer code available so that others can look at what I did and, if they want, reproduce the result.</li>
</ol>
<p>So far so good right? It seems this would be a great result for any study. Now consider the following possible scenarios:</p>
<ol>
<li>Someone obtains the data and the code from the web site where it is hosted, analyzes it, and then publishes a note claiming that the intervention negatively affected a different outcome not described in the original study (i.e. not one of the primary or secondary outcomes).</li>
<li>A second person obtains the data, analyzes it, and then publishes a note on the web claiming that the intervention was ineffective for the primary outcome in a the subset of participants that were male.</li>
<li>A third person obtains the data, analyzes the data, and then publishes a note on the web saying that the study is flawed and that the original results of the paper are incorrect. No code, data, or details of their methods are given.</li>
</ol>
<p>Now, how should one react to the follow-up note claiming the study was flawed? It’s easy to imagine a spectrum of possible responses ranging from accusations of fraud to staunch defenses of the original study. Because the original study was influential, there is likely to be a kerfuffle either way.</p>
<p>But what’s the problem with the three follow-up scenarios described? The one thing that they have in common is that none of the three responding people were subjected to the same standards to which the original investigator (me) was subjected. I was required to register my trial and state the outcomes in advance. In an ideal world you might argue I should have stated my hypotheses in advance too. That’s fine, but the point is that the people analyzing the data subsequently were not required to do any of this. Why should they be held to a lower standard of scrutiny?</p>
<p>The first person analyzed a different outcome that was not a primary or secondary outcome. How many outcomes did they test before the came to that one negatively significant one? The second person examined a subset of the participants. Was the study designed (or powered) to look at this subset? Probably not. The third person claims fraud, but does not provide any details of what they did.</p>
<p>I think it’s easy to take care of the third person–just require that they make their work reproducible too. That way we can all see what they did and verify that there was in fact fraud. But the first two people are a little more difficult. If there are no barriers to obtaining the data, then they can just get the data and run a bunch of analyses. If the results don’t go their way, they can just move on and no one would be the wiser. If they did, they can try to publish something.</p>
<p>What I think a good reproducibility policy should have is a type of “viral” clause. For example, the GNU General Public License (GPL) is an open source software license that requires, among other things, that anyone who writes their own software, but links to or integrates software covered under the GPL, must publish their software under the GPL too. This “viral” requirement ensures that people cannot make use of the efforts of the open source community without also giving back to that community. There have been numerous heated discussions in the software community regarding the pros and cons of such a clause, with (large) commercial software developers often coming down against it. Open source developers have largely beens skeptical of the arguments of large commercial developers, claiming that those companies simply want to “steal” open source software and/or maintain their dominance.</p>
<p>I think it is important that if we are going to make reproducibility the norm in science, that we have analogous “viral” clauses to ensure that everyone is held to the same standard. This is particularly important in policy-relevant or in politically sensitive subject areas where there are often parties involved who have essentially no interest (and are in fact paid to have no interest) in holding themselves to the same standard of scientific conduct.</p>
<p>Richard Stallman was right to assume that without the <a href="http://en.wikipedia.org/wiki/Copyleft">copyleft clause</a> in the GPL that large commercial interests would simply usurp the work of the free software community and essentially crush it before it got started. Reproducibility needs its own version of copyleft or else scientists will be left to defend themselves against unscrupulous individuals who are not held to the same standard.</p>
Sunday data/statistics link roundup (4/28/2013)
2013-04-28T22:31:21+00:00
http://simplystats.github.io/2013/04/28/sunday-datastatistics-link-roundup-4282013
<ol>
<li><a href="http://mathwithbaddrawings.com/2013/04/25/were-all-bad-at-math-1-i-feel-stupid-too/">What it feels like to be bad at math</a>. My personal experience like this culminated in some difficulties <a href="http://en.wikipedia.org/wiki/Green's_function">with Green’s functions</a> back in my early days at USU. I think almost everybody who does enough math eventually runs into a situation where they don’t understand what is going on and it stresses them out.</li>
<li><a href="http://www.nytimes.com/2013/04/28/technology/how-big-data-is-playing-recruiter-for-specialized-workers.html?_r=0">An article</a> about companies that are using data to try to identify people for jobs (via Rafa).</li>
<li><a href="http://www.forbes.com/sites/davidleinweber/2013/04/26/big-data-gets-bigger-now-google-trends-can-predict-the-market/">Google trends for predicting the market</a>. I’m not sure that “predicting” is the right word here. I think a better word might be “explaining/associating”. I also wonder if <a href="http://www.nature.com/news/when-google-got-flu-wrong-1.12413">this could go off the rails</a>.</li>
<li>This article [ 1. <a href="http://mathwithbaddrawings.com/2013/04/25/were-all-bad-at-math-1-i-feel-stupid-too/">What it feels like to be bad at math</a>. My personal experience like this culminated in some difficulties <a href="http://en.wikipedia.org/wiki/Green's_function">with Green’s functions</a> back in my early days at USU. I think almost everybody who does enough math eventually runs into a situation where they don’t understand what is going on and it stresses them out.</li>
<li><a href="http://www.nytimes.com/2013/04/28/technology/how-big-data-is-playing-recruiter-for-specialized-workers.html?_r=0">An article</a> about companies that are using data to try to identify people for jobs (via Rafa).</li>
<li><a href="http://www.forbes.com/sites/davidleinweber/2013/04/26/big-data-gets-bigger-now-google-trends-can-predict-the-market/">Google trends for predicting the market</a>. I’m not sure that “predicting” is the right word here. I think a better word might be “explaining/associating”. I also wonder if <a href="http://www.nature.com/news/when-google-got-flu-wrong-1.12413">this could go off the rails</a>.</li>
<li>This article](http://www.r-bloggers.com/faster-higher-stonger-a-guide-to-speeding-up-r-code-for-busy-people/?utm_source=feedly&utm_medium=feed&utm_campaign=Feed:+RBloggers+(R+bloggers)) in terms of describing the ways that you can speed up R code. My favorite part of it is that it starts with the “why”. Exactly. <a href="http://en.wikiquote.org/wiki/Donald_Knuth">Premature optimization is the root of all evi</a>l.</li>
<li><a href="http://blog.mortardata.com/post/47549853491/data-science-at-tumblr">A discussion of data science at Tumblr</a>. The author/speaker <a href="http://www.adamlaiacano.com/">also has a great blog</a>.</li>
</ol>
Mindlessly normalizing genomics data is bad - but ignoring unwanted variability can be worse
2013-04-26T10:49:08+00:00
http://simplystats.github.io/2013/04/26/mindlessly-normalizing-genomics-data-is-bad-but-ignoring-unwanted-variability-can-be-worse
<p>Yesterday, and bleeding over into today, <a href="http://www.ncbi.nlm.nih.gov/pubmed/12538238">quantile normalization</a> (QN) was being discussed on Twitter. This is the <a href="https://twitter.com/mbeisen/status/327563522185764864">Yesterday, and bleeding over into today, [quantile normalization](http://www.ncbi.nlm.nih.gov/pubmed/12538238) (QN) was being discussed on Twitter. This is the</a> that started the whole thing off. The conversation went a bunch of different directions and then this happened:</p>
<blockquote>
<p>well, this happens all over bio-statistics - ie, naive use in seemingly undirected ways until you get a “good” pvalue. And then end</p>
</blockquote>
<p>So Jeff and I felt it was important to respond - since we are biostatisticians that work in genomics. We felt a couple of points were worth making:</p>
<ol>
<li><strong>Most statisticians we know, including us, know QN’s limitations and are always nervous about using QN</strong>. But with most datasets we see, unwanted variability is overwhelming and we are left with no choice but to normalize in orde to extract anything useful from the data. In fact, many times QN is not enough and we have to apply further transformations, e.g., to remove <a href="http://www.ncbi.nlm.nih.gov/pubmed/20838408">batch effects</a>.</li>
</ol>
<p>2. <strong>We would be curious to know which biostatisticians were being referred to. </strong>We would like some examples, because most of the genomic statisticians we know work very closely with biologists to aid them in cleaning dirty data to help them find real sources of signal. Furthermore, we encourage biologists to validate their results. In many cases, quantile normalization (or other transforms) are critical to finding results that validate and there is a long literature (both biological and statistical) supporting the importance of appropriate normalization.</p>
<p>3. <strong>Assuming the data that you get (sequences, probe intensities, etc.) from high-throughput tech = direct measurement of abundance is incorrect.</strong> Before worrying about QN (or other normalization) being an arbitrary transformation that distorts the data, keep in mind that what you want to measure has already been distorted by PCR, the imperfections of the microarray, scanner measurement error, image bleeding, cross hybridization or alignment artifacts, ozone effects, etc…</p>
<p>To go into a little more detail about the reasons that normalization may be important in many cases, so I have written a little more detail below with data if you are interested.</p>
<!--more-->
<p>Most, if not all, the high throughput data we have analyzed needs some kind of normalization. This applies to both microarrays and next-gen sequencing. To demonstrate why, below I include 5 boxplots of log intensities from 5 microarrays that were hybridized to the same RNA (technical replicates).</p>
<p><a href="http://simplystatistics.org/2013/04/26/mindlessly-normalizing-genomics-data-is-bad-but-ignoring-unwanted-variability-can-be-worse/screen-shot-2013-04-25-at-11-12-20-pm/" rel="attachment wp-att-1216"><img class="wp-image-1216 alignleft" alt="Screen shot 2013-04-25 at 11.12.20 PM" src="http://simplystatistics.org/wp-content/uploads/2013/04/Screen-shot-2013-04-25-at-11.12.20-PM.png" width="285" height="271" srcset="http://simplystatistics.org/wp-content/uploads/2013/04/Screen-shot-2013-04-25-at-11.12.20-PM-300x285.png 300w, http://simplystatistics.org/wp-content/uploads/2013/04/Screen-shot-2013-04-25-at-11.12.20-PM.png 475w" sizes="(max-width: 285px) 100vw, 285px" /></a></p>
<p>See the problem? If we took the data at face value we would conclude that there is a large (almost 2 fold) global change in expression when comparing, say, samples C and E. But they are technical replicates so the observed difference is not biologically driven. Discrepancies like these are the rule rather than the exception. Biologists seem to underestimate the amount of unwanted variability present in the data they produce. Look at enough data and you will quickly learn that, in most cases, unwanted experimental variability dwarfs the biological differences we are interested in discovering. Normalization is the statistical technique that saves biologists millions of dollars a year by fixing this problem in silico rather than redoing the experiment.</p>
<p>For the data above you might be tempted to simply standardize the data by subtracting the median. But the problem is more complicated than that as shown in the plot below. This plot shows the log ratio (M) versus the average of the logs intensities (A) for two technical replicates in which 16 probes (red dots) have been “spiked-in” to have true fold changes of 2. The other ~20,000 probesets (blue streak) are supposed to be unchanged (M=0). See the curvature of the genes that are supposed to be at 0? Taken at face value, thousands of the low expressed probes exhibit larger differential expression than the only 16 that are actually different. That’s a problem. And standardizing by the subtracting the median won’t fix it. Non-linear biases such as this one are also quite common.<a href="http://simplystatistics.org/2013/04/26/mindlessly-normalizing-genomics-data-is-bad-but-ignoring-unwanted-variability-can-be-worse/screen-shot-2013-04-25-at-11-14-20-pm/" rel="attachment wp-att-1218"><img class=" wp-image-1218 alignright" alt="Screen shot 2013-04-25 at 11.14.20 PM" src="http://simplystatistics.org/wp-content/uploads/2013/04/Screen-shot-2013-04-25-at-11.14.20-PM.png" width="483" height="275" /></a></p>
<p>QN offers one solution to this problem if you can assume that the true distribution of what you are measuring is roughly the same across samples. Briefly, QN forces each sample to have the same distribution. The after picture above is the result of QN. It removes the curvature but preserves most of the real differences.</p>
<p>So why should we be nervous? QN and other normalization techniques risk throwing the baby out with the bath water. What if there is a real global difference? If there is, and you use QN, you will miss it and you may introduce artifacts. <em>But the assumptions are no secret and it’s up to the biologists to decide if they are reasonable.</em> At the same time, we have to be very careful about interpreting large scale changes given that we see large scale changes when we know there are none. Other than cases were global differences are forced or simulated, I have yet to see a good example in which QN causes more harm than good. I’m sure there are some real data examples out there, so if you have one please share, as I would love to use it as an example in class.</p>
<p>Also note that statisticians (including me) are working hard at deciphering ways to normalize without the need for such strong assumptions. Although in their first incarnation they were useless, current control probes/transcripts techniques are promising. We have used them in the past to <a href="http://www.ncbi.nlm.nih.gov/pubmed/20858772">normalize methylation data</a> (a similar approach was used <a href="http://www.ncbi.nlm.nih.gov/pubmed/23101621">here</a> for gene expression data). And then there is <a style="font-size: 16px;" href="http://www.ncbi.nlm.nih.gov/pubmed/20976876">subset quantile normalization</a>. I am sure there are others and more to come. So Biologists, don’t worry, we have your backs and serve at your pleasure. In the meantime don’t be so afraid of QN: at least give it a try before you knock it.</p>
Interview at Yale Center for Environmental Law & Policy
2013-04-23T10:00:44+00:00
http://simplystats.github.io/2013/04/23/interview-at-yale-center-for-environmental-law-policy
<p><a href="http://vimeo.com/64067594">Interview with Roger Peng</a> from <a href="http://vimeo.com/ycelp">YCELP</a> on <a href="http://vimeo.com">Vimeo</a>.</p>
<p>A few weeks ago I sat down with Angel Hsu of the Yale Center for Environmental Law and Policy to talk about some of their work on air pollution indicators.</p>
<p>(Note: I haven’t moved–I still work at the John_<strong>s</strong>_ Hopkins School of Public Health.)</p>
Nevins-Potti, Reinhart-Rogoff
2013-04-21T21:35:41+00:00
http://simplystats.github.io/2013/04/21/nevins-potti-reinhart-rogoff
<p>There’s an interesting parallel between the <a href="http://simplystatistics.org/2012/02/27/the-duke-saga-starter-set/">Nevins-Potti debacle</a> (a true debacle, in my mind) and the recent <a href="http://simplystatistics.org/2013/04/19/podcast-7-reinhart-rogoff-reproducibility/">Reinhart-Rogoff kerfuffle</a>. Both were exposed via some essentially small detail that had nothing to do with the real problem.</p>
<p>In the case of Reinhart-Rogoff, the Excel error was what made them look ridiculous, but it was in fact the “unconventional weighting” of the data that had the most dramatic effect. Furthermore, ever since the paper had come out, academic economists were debating and challenging its conclusions from the get go. Even when legitimate scientific concerns were raised, policy-makers and other academics were not convinced. As soon as the Excel error was revealed, everything needed to be re-examined.</p>
<p>In the Nevins-Potti debacle, Baggerly and Coombes wrote article after article pointing out all the problems and, for the most part, no one in a position of power really cared. The Nevins-Potti errors were real zingers too, not some trivial Excel error (i.e. switching the labels between people with disease and people without disease). But in the end, it took Potti’s claim of being a Rhodes Scholar to bring him down. Clearly, the years of academic debate beforehand were meaningless compared to lying on a CV.</p>
<p>In the Reinhart-Rogoff case, reproducibility was an issue and if the data had been made available earlier, the problems would have been discovered earlier and perhaps that would have headed off years of academic debate (for better or for worse). In the Nevins-Potti example, reproducibility was not an issue–the original Nature Medicine study was done using public data and so was reproducible (although it would have been easier if code had been made available). The problem there is that no one listened.</p>
<p>One has to wonder if the academic system is working in this regard. In both cases, it took a minor, but _personal _failing, to bring down the entire edifice. But the protestations of reputable academics, challenging the research on the merits, were ignored. I’d say in both cases the original research conveniently said what people wanted to hear (debt slows growth, personalized gene signatures can predict response to chemotherapy), and so no amount of research would convince people to question the original findings.</p>
<p>One also has to wonder whether reproducibility is of any help here. I certainly don’t think it hurts, but in the case of Nevins-Potti, where the errors were shockingly obvious to anyone paying attention, the problems were deemed merely technical (i.e. statistical). The truth is, reproducibility will be most necessary in highly technical and complex analyses where it’s often not obvious how an analysis is done. If you can show a flaw in an analysis that is complicated, what’s the use if your work will be written off as merely concerned with technical details (as if those weren’t important)? Most of the news articles surrounding Reinhart-Rogoff characterized the problems as complex and statistical (i.e. not important) and not concerned with fundamental questions of interest.</p>
<p>In both cases, I think science was used to push an external agenda, and when the science was called into question, it was difficult to back down. I’ll write more in a future post about these kinds of situations and what, if anything, we can do to improve matters.</p>
Podcast #7: Reinhart, Rogoff, Reproducibility
2013-04-19T15:27:52+00:00
http://simplystats.github.io/2013/04/19/podcast-7-reinhart-rogoff-reproducibility
<p>Jeff and I talk about the recent Reinhart-Rogoff reproducibility kerfuffle and how it turns out that data analysis is really hard no matter how big the dataset.</p>
I wish economists made better plots
2013-04-16T18:14:59+00:00
http://simplystats.github.io/2013/04/16/i-wish-economists-made-better-plots
<p>I’m seeing lots of traffic on a big-time economics article by that failed to reproduce and here are my quick thoughts. You can read a pretty good summary here by <a href="http://www.nextnewdeal.net/rortybomb/researchers-finally-replicated-reinhart-rogoff-and-there-are-serious-problems">Mike Konczal</a>.</p>
<p>Quick background: Carmen Reinhart and Kenneth Rogoff wrote an <a href="http://www.nber.org/papers/w15639.pdf">influential paper</a> that was used by many to justify the need for austerity measures taken by governments to reduce debts relative to GDP. Yesterday, Thomas Herndon, Michael Ash, and Robert Pollin (HAP) <a href="http://www.peri.umass.edu/236/hash/31e2ff374b6377b2ddec04deaa6388b1/publication/566/">released a paper</a> where they reproduced the Reinhart-Rogoff (RR) analysis and noted a few irregularities or errors. In their abstract, HAP claim that they “find that coding errors, selective exclusion of available data, and unconventional weighting of summary statistics [in the RR analysis] lead to serious errors that inaccurately represent the relationship between public debt and GDP growth among 20 advanced economies in the post-war period.</p>
<p>It appears there were three points made by HAP: (1) RR excluded some important data from their final analysis; (2) RR weighted countries in a manner that was <em>not</em> proportional to the number of years they contributed to the dataset (RR used equal weighting of countries); and (3) there was an error in RR’s Excel formula which resulted in them inadvertently leaving out five countries from their final analysis.</p>
<p>The bottom line is shown in HAP’s Figure 1, which I reproduce below (on the basis of fair use):</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2013/04/Screen-Shot-2013-04-16-at-5.48.06-PM.png"><img class="alignright size-full wp-image-1173" alt="HAP Analysis" src="http://simplystatistics.org/wp-content/uploads/2013/04/Screen-Shot-2013-04-16-at-5.48.06-PM.png" width="768" height="992" srcset="http://simplystatistics.org/wp-content/uploads/2013/04/Screen-Shot-2013-04-16-at-5.48.06-PM-232x300.png 232w, http://simplystatistics.org/wp-content/uploads/2013/04/Screen-Shot-2013-04-16-at-5.48.06-PM.png 768w" sizes="(max-width: 768px) 100vw, 768px" /></a></p>
<p>From the plot you can see that the HAP’s adjusted analysis (circles) more or less coincides with RR’s analysis (diamonds) except for the last categories of countries with debt/GDP ratios over 90%. In that category RR’s analysis shows a large drop in growth whereas HAP’s analysis shows a more or less smooth decline (but still positive growth).</p>
<p>To me, it seems that the incorrect Excel formula is a real error, but easily fixed. It also seemed to have the least impact on the final analysis. The other two problems, which had far bigger impacts, might have some explanation that I’m not aware of. I am not an economist so I await others to weigh in. RR apparently do not comment on the exclusion of certain data points or on the weighting scheme so it’s difficult to say what the thinking was, whether it was inadvertent or purposeful.</p>
<p>In summary, so what? Here’s what I think:</p>
<ol>
<li><strong>Is there some fishiness?</strong> Sure, but this is not the Potti-Nevins scandal a la economics. I suppose it’s possible RR manipulated the analysis to get the answer austerity hawks were looking for, but we don’t have the evidence yet and this just doesn’t feel like that kind of thing.</li>
<li><strong>What’s the counterfactual?</strong> Or, what would have happened if the analysis had been done the way HAP propose? Would the world have embraced pro-growth policies by taking on a greater debt burden? My guess is no. Austerity hawks would have found some other study that supported their claims (and in fact there was at least one other).</li>
<li>RR’s original analysis did not contain a plot like Figure 1 in HAP’s analysis, which I personally find very illuminating. From HAP’s figure, you can see that there’s quite a bit of variation across countries and perhaps an overall downward trend. I’m not sure I would have dramatically changed my conclusion if I had done the HAP analysis instead of the RR analysis. My point is that <strong>plots like this, which <em>show the variability</em>, are very important</strong>._</li>
</ol>
<p>_</p>
<ol>
<li><strong>People see what they want to see</strong>. I would not be surprised to see some claim that HAP’s analysis supports the austerity conclusion because growth under high debt loads is much lower (almost 50%!) than under low debt loads.</li>
<li><strong>If RR’s analysis had been correct, should they have even made the conclusions they made?</strong> RR indicated that there was a “threshold” at 90% debt/GDP. My experience is that statements about thresholds, are generally very hard to make, even with good data. I wonder what other more knowledgable people think of the original conclusions.</li>
<li><strong>If the data had been made available sooner, this problem would have been fixed sooner</strong>. But in my opinion, that’s all that would have happened.</li>
</ol>
<p>The vibe on the Internets seems to be that if only this problem had been identified sooner, the world would be a better place. But my cynical mind says, uh, no. You can toss this incident in the very large bucket of papers with some technical errors that are easily fixed. Thankfully, someone found these errors and fixed them, and that’s a good thing. Science moves on.</p>
<p>UPDATE: Reinhart-Rogoff <a href="http://www.slate.com/blogs/moneybox/2013/04/16/reinhart_and_rogoff_respond_researchers_say_high_debt_is_associated_with.html">respond</a>.</p>
<p>UPDATE 2: Reinhart-Rogoff more <a href="http://blogs.wsj.com/economics/2013/04/17/reinhart-rogoff-admit-excel-mistake-rebut-other-critiques/">detailed response</a>.</p>
Data science only poses a threat to (bio)statistics if we don't adapt
2013-04-15T15:19:16+00:00
http://simplystats.github.io/2013/04/15/data-science-only-poses-a-threat-to-biostatistics-if-we-dont-adapt
<p>We have previously mentioned on this blog how <a href="http://simplystatistics.org/2012/08/14/statistics-statisticians-need-better-marketing/">statistics needs better marketing</a>. Recently, Karl B. has suggested that “<a href="http://kbroman.wordpress.com/2013/04/05/data-science-is-statistics/">Data science is statistics</a>” and Larry W. has wondered if “<a href="http://normaldeviate.wordpress.com/2013/04/13/data-science-the-end-of-statistics/">Data science is the end of statistics?</a>” I think there are a couple of types of data science and that each has a different relationship to the discipline of academic statistics:</p>
<ol>
<li><strong>Data science as marketing tool</strong>. Data analytics, data science, big data, etc. are terms that companies who already did something (IT infrastructure, consulting, database management, etc.) throw around to make them sound like they are doing the latest and greatest thing. These marketers are dabblers in what I would call the real “science of data” or maybe deal with just one part of the data pipeline. I think they pose no threat to the statistics community other than by generating backlash by over promising on the potential of data science or diluting the term to the point of being almost non-sensical.</li>
<li><strong>Data science as business analytics.</strong> Another common use of “data science” is to describe the exact same set of activities that use to be performed by business analytics people, maybe allowing for some growth in the size of the data sets. This might be a threat to folks who do statistics in business schools - although more likely it will be beneficial to those programs as there is growth in the need for business-oriented statisticians.</li>
<li><strong>Data science as big data engineer</strong> Sometimes data science refers to people who do stuff with huge amounts of data. Larry refers to this in his post when he talks about people <a href="http://normaldeviate.wordpress.com/2013/04/13/data-science-the-end-of-statistics/">working on billions of data points</a>. Most classically trained statisticians aren’t comfortable with data of this size. But at places like Google - where big data sets are routine - the infrastructure is built <a href="http://simplystatistics.org/2013/02/15/interview-with-nick-chamandy-statistician-at-google/">so that statisticians can access and compress the parts of the data that they need</a> to do their jobs. I don’t think this is necessarily a threat to statistics; but we should definitely be integrating data access into our curriculum.</li>
<li><strong>Data science as replacement for statistics </strong>Some people (and I think it is the minority) are exactly referring to things that statisticians do when they talk about data science. This means manipulating, collecting, and analyzing data, then making inferences to a population or predictions about what will happen next. This is, of course, a threat to statisticians. Some places, like <a href="http://analytics.ncsu.edu/">NC State</a> and <a href="http://idse.columbia.edu/">Columbia</a>, are tackling this by developing centers/institutes/programs with data science in the name. But I think that is a little dangerous. The data don’t matter - it is the problem you can solve with the data. So the key thing is that these institutes need to focus on solving real problems - not just churning out people who know a little R, a little SQL, and a little Python.</li>
</ol>
<p>So why is #4 happening? I think one reason is reputation. Larry mentions that a statistician produces an estimate and a confidence interval and maybe the confidence interval is too wide. I think he is on to something there, but I think it is a bigger problem. <a href="http://simplystatistics.org/2012/06/22/statistics-and-the-science-club/">As Roger has pointed out</a> - statisticians often see themselves as referees - rather scientists/business people. So a lot of people have the experience of going to a statistician and feel like they have been criticized for bad experimental design, too small a sample size, etc. These issues are hugely important - but sometimes you have to make due with what you have. I think data scientists in category 4 are taking advantage of a cultural tendency of statisticians to avoid making concrete decisions.</p>
<p>A second reason is that some statisticians have avoided getting their hands dirty. “Hands clean” statisticians don’t get the data from the database, or worry about the data munging, or match identifiers, etc. They wait until the data are nicely formated in a matrix to apply their methods. To stay competitive, we need to produce more “hands dirty” statisticians who are willing to go beyond <a href="http://simplystatistics.org/2012/05/28/schlep-blindness-in-statistics/">schlep blindnes</a>s and handle all aspects of a data analysis. In academia, we can encourage this by incorporating more of those issues into our curriculum.</p>
<p>Finally, I think statisticians focus on optimality hurts us. Our field grew up in an era where data was sparse and we had to squeeze every last ounce of information out what little data we had. Those constraints led to a cultural focus on optimality to a degree that is no longer necessary when data are abundant. In fact, an abundance of data is <a href="http://www.youtube.com/watch?v=yvDCzhbjYWs">often unreasonably effective even with suboptimal methods</a>. “Data scientists” understand this and shoot for the 80% solution that is good enough in most cases.</p>
<p>In summary I don’t think statistics will be killed off by data science. Most of the hype around data science is actually somewhat removed from our field (see above). But I do think that it is worth considering some potential changes that reposition our discipline as the most useful for answering questions with data. Here are some concrete proposals:</p>
<ol>
<li>Remove some theoretical requirements and add computing requirements to statistics curricula.</li>
<li>Focus on statistical writing, presentation, and communication as a main part of the curriculum.</li>
<li>Focus on positive interactions with collaborators (being a scientist) rather than immediately going to the referee attitude.</li>
<li>Add a unit on translating scientific problems to statistical problems.</li>
<li>Add a unit on data munging and getting data from databases.</li>
<li>Integrating real and live data analyses into our curricula.</li>
<li>Make all our students create an R package (a data product) before they graduate.</li>
<li>Most important of all have a “big tent” attitude about what constitutes statistics.</li>
</ol>
<p> </p>
Sunday data/statistics link roundup (4/14/2013)
2013-04-14T10:36:29+00:00
http://simplystats.github.io/2013/04/14/sunday-datastatistics-link-roundup-4142013
<ol>
<li><a href="http://storify.com/Kalido/most-influential-data-scientists-on-twitter">The most influential data scientists on Twitter</a>, featuring Amy Heineike, Hilary Mason, and a few other familiar names to readers of this blog. In other news, I love reading list of the “Top K _<em>__</em>” as much as the next person. I love them even more when they are quantitative (the list above isn’t) - even when the quantification is totally bogus. (via John M.)</li>
<li>Rod Little and our own Tom Louis <a href="http://www.huffingtonpost.com/rod-little/decennial-census_b_3046611.html?utm_hp_ref=science">over at the Huffingtonpost</a> talking about the ways in which the U.S. Census supports our democracy. It is a very good piece and I think highlights the critical importance that statistics and data play in keeping government open and honest.</li>
<li><a href="http://www.nytimes.com/2013/04/08/health/for-scientists-an-exploding-world-of-pseudo-academia.html?src=me&ref=general&_r=1&">An article</a> about the growing number of fake academic journals and their potential predatory practices. I think I’ve been able to filter out the fake journals/conferences pretty well (if they’ve invited 30 Nobel Laureates - probably fake). But this poses big societal problems; how do we tell what is real science from what is fake if you don’t have inside knowledge about which journals are real? (via John H.)</li>
<li>[ 1. <a href="http://storify.com/Kalido/most-influential-data-scientists-on-twitter">The most influential data scientists on Twitter</a>, featuring Amy Heineike, Hilary Mason, and a few other familiar names to readers of this blog. In other news, I love reading list of the “Top K _<em>__</em>” as much as the next person. I love them even more when they are quantitative (the list above isn’t) - even when the quantification is totally bogus. (via John M.)</li>
<li>Rod Little and our own Tom Louis <a href="http://www.huffingtonpost.com/rod-little/decennial-census_b_3046611.html?utm_hp_ref=science">over at the Huffingtonpost</a> talking about the ways in which the U.S. Census supports our democracy. It is a very good piece and I think highlights the critical importance that statistics and data play in keeping government open and honest.</li>
<li><a href="http://www.nytimes.com/2013/04/08/health/for-scientists-an-exploding-world-of-pseudo-academia.html?src=me&ref=general&_r=1&">An article</a> about the growing number of fake academic journals and their potential predatory practices. I think I’ve been able to filter out the fake journals/conferences pretty well (if they’ve invited 30 Nobel Laureates - probably fake). But this poses big societal problems; how do we tell what is real science from what is fake if you don’t have inside knowledge about which journals are real? (via John H.)
4.](https://www.capitalbikeshare.com/trip-history-data) on the DC Capitol Bikeshare. One of my favorite things is when a government organization just opens up its data. The best part is that the files are formatted as csv’s. Clearly someone who knows that the best data formats are open, free, and easy to read into statistical software. In other news, I think one of the most important classes that could be taught is “How to share data 101” (via David B.)</li>
<li>A slightly belated link to a <a href="http://blogs.sas.com/content/jmp/2013/03/29/george-box-a-remembrance/">remembrance of George Box.</a> He was the one who said, “All models are wrong, but some are useful.” An absolute titan of our field.</li>
<li>Check out these <a href="http://exp.lore.com/post/47740806673/mexico-based-designer-alan-betacourt-has-created">cool logotypes for famous scientists</a>. I want one! Also, see the article on these awesome <a href="http://www.brainpickings.org/index.php/2012/09/26/hydrogene-women-in-science-posters/">minimalist posters celebrating legendary women in science</a>. I want the Sally Ride poster on a t-shirt.</li>
<li>As an advisor, I aspire to treat my students/postdocs <a href="https://twitter.com/hunterwalk/status/323294179046326273/photo/1">like this</a>. (<a href="https://twitter.com/hunterwalk">@hunterwalk</a>). I’m not always so good at it, but those are some good ideals to try to live up to.</li>
</ol>
Great scientist - statistics = lots of failed experiments
2013-04-12T15:25:44+00:00
http://simplystats.github.io/2013/04/12/great-scientist-statistics-lots-of-failed-experiments
<p><a href="http://en.wikipedia.org/wiki/E._O._Wilson">E.O. Wilson</a> is a famous evolutionary biologist. He is currently an emeritus professor at Harvard and just this last week dropped <a href="http://online.wsj.com/article/SB10001424127887323611604578398943650327184.html">this little gem</a> in the Wall Street Journal. In the piece, he suggests that knowing mathematics is not important for becoming a great scientist. Wilson goes even further, suggesting that you can be mathematically semi-literate and still be an amazing scientist. There are two key quotes in the piece that I think deserve special attention:</p>
<blockquote>
<p>Fortunately, exceptional mathematical fluency is required in only a few disciplines, such as particle physics, astrophysics and information theory. Far more important throughout the rest of science is the ability to form concepts, during which the researcher conjures images and processes by intuition.</p>
</blockquote>
<p>I agree with this quote in general <a href="http://krugman.blogs.nytimes.com/2013/04/09/doing-the-math/">as does Paul Krugman</a>. Many scientific areas don’t require advanced measure theory, differential geometry, or number theory to make big advances. It seems like this is is the kind of mathematics to which E.O. Wilson is referring to and on that point I think there is probably universal agreement that you can have a hugely successful scientific career without knowing about measurable spaces.</p>
<p>Wilson doesn’t stop there, however. He goes on to paint a much broader picture about how one can pursue science without the aid of even basic mathematics or statistics_ _and this is where I think he goes off the rails a bit:</p>
<blockquote>
<p>Ideas in science emerge most readily when some part of the world is studied for its own sake. They follow from thorough, well-organized knowledge of all that is known or can be imagined of real entities and processes within that fragment of existence. When something new is encountered, the follow-up steps usually require mathematical and statistical methods to move the analysis forward. If that step proves too technically difficult for the person who made the discovery, a mathematician or statistician can be added as a collaborator.</p>
</blockquote>
<p>I see two huge problems with this statement:</p>
<ol>
<li>Poor design of experiments is one of, if not the most, common reason for an experiment to fail. It is so important that Fisher said, “To consult the statistician after an experiment is finished is often merely to ask him to conduct a <em>post mortem</em> examination. He can perhaps say what the experiment died of.” Wilson is suggesting that with careful conceptual thought and some hard work you can do good science, but without a fundamental understanding of basic math, statistics, and study design even the best conceived experiments are likely to fail.</li>
<li>While armchair science was likely the norm when Wilson was in his prime, huge advances have been made in both science and technology. Scientifically, it is difficult to synthesize and understand everything that has been done without some basic understanding of the statistical quality of previous experiments. Similarly, as data collection has evolved statistics and computation are playing a more and more central role. As Rafa has pointed out, <a href="http://simplystatistics.tumblr.com/post/21914291274/people-in-positions-of-power-that-dont-understand">people in positions of power who don’t understand statistics are a big problem for science</a>.</li>
</ol>
<p>More importantly, as we live in an increasingly data rich environment both in the sciences and in the broader community - basic statistical and numerical literacy are becoming more and more important. While I agree with Wilson that we should try not to discourage people who have a difficult first encounter with math from pursuing careers in science, I think it is both disingenuous and potentially disastrous to downplay the importance of quantitative skill at the exact moment in history that those skills are most desperately needed.</p>
<p>As a counter proposal to Wilson’s idea that we should encourage people to disregard quantitative sciences I propose that we build a better infrastructure for ensuring all people interested in the sciences are able to improve their quantitative skills and literacy. Here at Simply Stats we are all about putting our money where our mouth is and we have already started by creating <a href="http://simplystatistics.org/courses/">free, online versions</a> of our quantitative courses. Maybe Wilson should take one….</p>
Climate Science Day on Capitol Hill
2013-04-10T10:00:36+00:00
http://simplystats.github.io/2013/04/10/climate-science-day-on-capitol-hill
<p>A few weeks ago I participated in the fourth annual Climate Science Day organized by the ASA and a host of other professional and scientific societies. There’s a nice write up of the event written by Steve Pierson over at <a href="http://magazine.amstat.org/blog/2013/04/01/csdapril2013/">Amstat News</a>. There were a number of statisticians there besides me, but the vast majority of people were climate modelers, atmospheric scientists, agronomists, and the like. Below is our crack team of scientists outside the office of (Dr.) Andy Harris. Might be the only time you see me wearing a suit.</p>
<p><img class="alignright size-medium wp-image-1149" alt="IMG_3783" src="http://simplystatistics.org/wp-content/uploads/2013/04/IMG_3783-300x225.jpg" width="300" height="225" srcset="http://simplystatistics.org/wp-content/uploads/2013/04/IMG_3783-300x225.jpg 300w, http://simplystatistics.org/wp-content/uploads/2013/04/IMG_3783-1024x768.jpg 1024w" sizes="(max-width: 300px) 100vw, 300px" /></p>
<p>The basic idea behind the day is to get scientists who do climate-related research into the halls of Congress to introduce themselves to members of Congress and make themselves available for scientific consultations. I was there (with Brooke Anderson, the other JHU rep) because of some of my work on the health effects of heat. I was paired up with Tony Broccoli, a climate modeler at Rutgers, as we visited the various offices of New Jersey and Maryland legislators. We also talked to staff from the Senate Health, Education, Labor, and Pensions (HELP) committee.</p>
<p>Here are a few things I learned:</p>
<ul>
<li>It was fun. I’d never been to Congress before so it was interesting for me to walk around and see how people work. Everyone (regardless of party) was super friendly and happy to talk to us.</li>
<li>The legislature appears to be run by women. Seriously, I think every staffer we met with (but one) was a woman. Might have been a coincidence, but I was not expecting that. We only met with one actual member of Congress, and that was (Dr.) Andy Harris from Maryland’s first district.</li>
<li>Climate change is not really on anyone’s radar. Oh well, we were there 3 days before the sequester hit so there were understandably other things on their minds. Waxman-Markey was the most recent legislation taken up by the House and it went nowhere in the Senate.</li>
<li>The Senate HELP committee has PhDs working on its staff. Didn’t know that.</li>
<li>Staffers are working on like 90 things at once, probably none of which are related to each other. That’s got to be a tough job.</li>
<li>I used more business cards on this one day than in my entire life.</li>
<li>Senate offices are way nicer than House offices.</li>
<li>The people who write our laws are around 22 years old. Maybe 25 if they went to law school. I’m cool with that, I think.</li>
</ul>
NIH is looking for an Associate Director for Data Science: Statisticians should consider applying
2013-04-08T16:22:07+00:00
http://simplystats.github.io/2013/04/08/nih-is-looking-for-an-associate-director-for-data-science-statisticians-should-consider-applying
<p>NIH understands the importance of data and several months ago they announced this new position. Here is an excerpt from <a href="http://www.jobs.nih.gov/vacancies/executive/adds.htm">the add</a>:</p>
<blockquote>
<p>The ADDS will focus on the urgent need and increased opportunities for capitalizing on the expanding collections of biomedical data to advance NIH’s mission. In doing so, the incumbent will provide programmatic NIH-wide leadership for areas of data science that relate to data emanating from many areas of study (e.g., genomics, imaging, and electronic heath records). This will require knowledge about multiple domains of study as well as familiarity with approaches for integrating data from these various domains.</p>
</blockquote>
<p>In my opinion, the person holding this job should have hands-on experience with data analysis and programming. The <del>nuisances</del> nuances of what a data analyst needs to successfully do his/her job can’t be underestimated. This knowledge will help this director make the right decisions when it comes to choosing what data to make available and how to make it available. When it comes to creating data resources, good intentions don’t always translate into usable products.</p>
<p>In this new era of data driven science this position will be highly influential making this job quite attractive. If you know of a Statistician that you think is interested please pass along the information.</p>
Introducing the healthvis R package - one line D3 graphics with R
2013-04-02T10:00:39+00:00
http://simplystats.github.io/2013/04/02/introducing-the-healthvis-r-package-one-line-d3-graphics-with-r
<p dir="ltr">
We have been a little slow on the posting for the last couple of months here at Simply Stats. That’s bad news for the blog, but good news for our research programs!
</p>
<p>Today I’m announcing the new <a style="font-size: 16px" href="http://healthvis.org/">healthvis</a> R package that is being developed by my student <a style="font-size: 16px" href="http://www.biostat.jhsph.edu/~prpatil/">Prasad Patil </a>(who needs a website like yesterday), <a style="font-size: 16px" href="http://www.cbcb.umd.edu/~hcorrada/">Hector Corrada Bravo</a>, and myself*. The basic idea is that I have loved <a style="font-size: 16px" href="http://d3js.org/">D3 interactive graphics</a> for a while. But they are hard to create from scratch, since they require knowledge of both Javascript and the D3 library.</p>
<p>Even with those skills, it can take a while to develop a new graphic. On the other hand, I know a lot about R and am often analyzing biomedical data where interactive graphics could be hugely useful. There are a couple of really useful tools for creating interactive graphics in R, most notably <a style="font-size: 16px" href="http://www.rstudio.com/shiny/">Shiny</a>, which is awesome. But these tools still require a bit of development to get right and are designed for “stand alone” tools.</p>
<p>So we created an R package that builds specific graphs that come up commonly in the analysis of health data like survival curves, heatmaps, and <a style="font-size: 16px" href="http://library.mpib-berlin.mpg.de/ft/mg/MG_Using_2009.pdf">icon arrays</a>. For example, here is how you make an interactive survival plot comparing treated to untreated individuals with healthvis:</p>
<pre class="brush: r; title: ; notranslate" title=""># Load libraries
library(healthvis)
library(survival)
# Run a cox proportional hazards regression
cobj <- coxph(Surv(time, status)~trt+age+celltype+prior, data=veteran)
# Plot using healthvis - one line!
survivalVis(cobj, data=veteran, plot.title="Veteran Survival Data", group="trt", group.names=c("Treatment", "No Treatment"), line.col=c("#E495A5","#39BEB1"))
</pre>
<p>The “survivalVis” command above produces an interactive graphic <a style="font-size: 16px" href="http://healthviz.appspot.com/display/hs_20001">like this</a>. Here it is embedded (you may have to scroll to see the dropdowns on the right - we are working on resizing)</p>
<p>`<p dir="ltr">
We have been a little slow on the posting for the last couple of months here at Simply Stats. That’s bad news for the blog, but good news for our research programs!
</p></p>
<p>Today I’m announcing the new <a style="font-size: 16px" href="http://healthvis.org/">healthvis</a> R package that is being developed by my student <a style="font-size: 16px" href="http://www.biostat.jhsph.edu/~prpatil/">Prasad Patil </a>(who needs a website like yesterday), <a style="font-size: 16px" href="http://www.cbcb.umd.edu/~hcorrada/">Hector Corrada Bravo</a>, and myself*. The basic idea is that I have loved <a style="font-size: 16px" href="http://d3js.org/">D3 interactive graphics</a> for a while. But they are hard to create from scratch, since they require knowledge of both Javascript and the D3 library.</p>
<p>Even with those skills, it can take a while to develop a new graphic. On the other hand, I know a lot about R and am often analyzing biomedical data where interactive graphics could be hugely useful. There are a couple of really useful tools for creating interactive graphics in R, most notably <a style="font-size: 16px" href="http://www.rstudio.com/shiny/">Shiny</a>, which is awesome. But these tools still require a bit of development to get right and are designed for “stand alone” tools.</p>
<p>So we created an R package that builds specific graphs that come up commonly in the analysis of health data like survival curves, heatmaps, and <a style="font-size: 16px" href="http://library.mpib-berlin.mpg.de/ft/mg/MG_Using_2009.pdf">icon arrays</a>. For example, here is how you make an interactive survival plot comparing treated to untreated individuals with healthvis:</p>
<pre class="brush: r; title: ; notranslate" title=""># Load libraries
library(healthvis)
library(survival)
# Run a cox proportional hazards regression
cobj <- coxph(Surv(time, status)~trt+age+celltype+prior, data=veteran)
# Plot using healthvis - one line!
survivalVis(cobj, data=veteran, plot.title="Veteran Survival Data", group="trt", group.names=c("Treatment", "No Treatment"), line.col=c("#E495A5","#39BEB1"))
</pre>
<p>The “survivalVis” command above produces an interactive graphic <a style="font-size: 16px" href="http://healthviz.appspot.com/display/hs_20001">like this</a>. Here it is embedded (you may have to scroll to see the dropdowns on the right - we are working on resizing)</p>
<p>`</p>
<p>The advantage of this approach is that you can make common graphics interactive without a lot of development time. Here are some other unique features:</p>
<ul>
<li>
<p dir="ltr">
The graphics are hosted on Google App Engine. With one click you can get a permanent link and share it with collaborators.
</p>
</li>
<li>
<p dir="ltr">
With another click you can get the code to embed the graphics in your website.
</p>
</li>
<li>
<p dir="ltr">
If you have already created D3 graphics it only takes a few minutes to <a href="http://healthvis.wordpress.com/develop/">develop a healthvis version</a> to let R users create their own - email us and we will make it part of the healthvis package!
</p>
</li>
<li>
<p dir="ltr">
healthvis is totally general - you can develop graphics that don’t have anything to do with health with our framework. Just email to get in touch if you want to be a developer at <a href="mailto:healthvis@gmail.com">healthvis@gmail.com</a>
</p>
</li>
</ul>
<p>We have started a blog over at <a style="font-size: 16px" href="http://healthvis.org/">healthvis.org</a> where we will be talking about the tricks we learn while developing D3 graphics, updates to the healthvis package, and generally talking about visualization for new technologies like those developed by the CCNE and individualized health. If you are interested in getting involved as a developer, user or tester, drop us a line and let us know. In the meantime, happy visualizing!</p>
<p><em>* This project is supported by the <a href="http://ccne.inbt.jhu.edu/">JHU CCNE</a> (U54CA151838) and the Johns Hopkins <a href="http://web.jhu.edu/administration/provost/initiatives/ihi/">inHealth initiative</a>.</em></p>
An instructor's thoughts on peer-review for data analysis in Coursera
2013-03-26T10:52:42+00:00
http://simplystats.github.io/2013/03/26/an-instructors-thoughts-on-peer-review-for-data-analysis-in-coursera
<p>I used peer-review for the data analysis course I just finished. As I mentioned in the <a href="http://simplystatistics.org/2013/03/25/podcast-6-data-analysis-mooc-post-mortem/">post-mortem podcast</a> I knew in advance that it was likely to be the most controversial component of the class. So it wasn’t surprising that based on feedback in the discussion boards and on this blog, the peer review process is by far the thing students were most concerned about.</p>
<p>But to evaluate complete data analysis projects at scale there is no other alternative that is economically feasible. To give you an idea, I have our local students perform 3 data analyses in an 8 week term here at Johns Hopkins. There are generally 10-15 students in that class and I estimate that I spend around an hour reading each analysis, digesting what was done, and writing up comments. That means I usually spend almost an entire weekend grading just for 10-15 data analyses. If you extrapolate that out to the 5,000 or so people who turned in data analysis assignments, it is clearly not possible for me to do all the grading.</p>
<p>Another alternative would be to pay trained data analysts to grade all the assignments. Of course that would be expensive - you couldn’t farm it out to the mechanical turk. If you want to get a better/more consistent grading scheme than peer review you’d need to hire highly trained data analysts to do that and that would be very expensive. While Johns Hopkins has been incredibly supportive in terms of technical support and giving me the flexibility to pursue the class, it is definitely something I did on my own time and with a lot of my own resources. It isn’t clear that it make sense for Hopkins to pour huge resources into really high-quality grading. At the same time, I’m not sure Coursera could afford to do this for all of the classes where peer review is needed, as they are just a startup.</p>
<p>So I think that at least for the moment, peer review is the best option for grading. This has big implications for the value of the Coursera statements of accomplishment in classe where peer review is necessary. I think that it would benefit Coursera hugely to do some research on how to ensure/maintain quality in peer review (Coursera - if you are reading this and you have some $$ you want to send my way to support some students/postdocs I have some ideas on how to do that). The good news is that the amazing Coursera platform collects so much data that it is possible to do that kind of research.</p>
<p> </p>
Podcast #6: Data Analysis MOOC Post-mortem
2013-03-25T13:34:01+00:00
http://simplystats.github.io/2013/03/25/podcast-6-data-analysis-mooc-post-mortem
<p>Jeff and I talk about Jeff’s recently completed MOOC on Data Analysis.</p>
Sunday data/statistics link roundup (3/24/2013)
2013-03-24T10:00:42+00:00
http://simplystats.github.io/2013/03/24/sunday-datastatistics-link-roundup-3242013
<ol>
<li><span style="font-size: 16px">My Coursera Data Analysis class is done for now! All the lecture notes </span><a style="font-size: 16px" href="https://github.com/jtleek/dataanalysis">are on Github</a><span style="font-size: 16px"> all the videos </span><a style="font-size: 16px" href="http://www.youtube.com/user/jtleek2007/videos?sort=dd&tag_id=UC8xNPQ-3a5t9uMU7Vah-jWA.3.coursera&view=46">are on Youtube</a><span style="font-size: 16px">. They are tagged by week with tags “Week x”.</span></li>
<li>After ENAR the comments on how to have better stats conferences started flowing. Check out <a href="http://alyssafrazee.wordpress.com/2013/03/18/ideas-for-super-awesome-conferences/">Frazee</a>, <a href="http://yihui.name/en/2013/03/on-enar-or-statistical-meetings-in-general/">Xie</a>, and <a href="http://kbroman.wordpress.com/2013/03/19/enar-highs-and-lows/">Broman</a>. My favorite cherry picked ideas: conference app (frazee), giving the poster session more focus (frazee), free and announced wifi (broman), more social media (i loved following ENAR <a href="https://twitter.com/search/realtime?q=%23ENAR2013&src=hash">on twitter</a> but wish there had been more tweeting) (xie), add some jokes to talks (xie).</li>
<li>A related post is this one from Hilary M. on how a talk s<a href="http://www.hilarymason.com/speaking/speaking-entertain-dont-teach/">hould entertain, not teach</a>.</li>
<li>This is a [ 1. <span style="font-size: 16px">My Coursera Data Analysis class is done for now! All the lecture notes </span><a style="font-size: 16px" href="https://github.com/jtleek/dataanalysis">are on Github</a><span style="font-size: 16px"> all the videos </span><a style="font-size: 16px" href="http://www.youtube.com/user/jtleek2007/videos?sort=dd&tag_id=UC8xNPQ-3a5t9uMU7Vah-jWA.3.coursera&view=46">are on Youtube</a><span style="font-size: 16px">. They are tagged by week with tags “Week x”.</span></li>
<li>After ENAR the comments on how to have better stats conferences started flowing. Check out <a href="http://alyssafrazee.wordpress.com/2013/03/18/ideas-for-super-awesome-conferences/">Frazee</a>, <a href="http://yihui.name/en/2013/03/on-enar-or-statistical-meetings-in-general/">Xie</a>, and <a href="http://kbroman.wordpress.com/2013/03/19/enar-highs-and-lows/">Broman</a>. My favorite cherry picked ideas: conference app (frazee), giving the poster session more focus (frazee), free and announced wifi (broman), more social media (i loved following ENAR <a href="https://twitter.com/search/realtime?q=%23ENAR2013&src=hash">on twitter</a> but wish there had been more tweeting) (xie), add some jokes to talks (xie).</li>
<li>A related post is this one from Hilary M. on how a talk s<a href="http://www.hilarymason.com/speaking/speaking-entertain-dont-teach/">hould entertain, not teach</a>.</li>
<li>This is a](http://blogs.spectator.co.uk/books/2013/03/interview-with-a-writer-jaron-lanier/) I found via AL Daily. My favorite lines? “You run into this attitude, that if ordinary people cannot set their Facebook privacy settings, then they deserve what is coming to them. There is a hacker superiority complex to this.” I think this is certainly something we have a lot of in statistics as well.</li>
<li>The CIA wants to <a href="http://www.rawstory.com/rs/2013/03/21/cias-big-data-mission-collect-everything-and-hang-onto-it-forever/">collect all the dataz</a>. Call me when cat videos become important for national security, ok guys?</li>
<li>Given I just completed my class, the <a href="http://www.katyjordan.com/MOOCproject.html">MOOC completion rates</a> graph is pretty appropriate. I think my #’s are right in line with that other people report. I’m still trying to figure out how to know how many people “completed” the class.</li>
</ol>
Youtube should check its checksums
2013-03-21T22:37:35+00:00
http://simplystats.github.io/2013/03/21/youtube-should-check-its-checksums
<p>I am in the process of uploading the video lectures for <a href="https://www.coursera.org/course/dataanalysis">Data Analysis</a>. I am getting ready to send out the course wrap-up email and I wanted to include the link to the Youtube playlist as well.</p>
<p>Unfortunately, Youtube keeps reporting that a pair of the videos in week 2 are duplicates. This is true despite them being different lengths (12:15 vs. 16:58), having different titles, and having dramatically different content. I [I am in the process of uploading the video lectures for <a href="https://www.coursera.org/course/dataanalysis">Data Analysis</a>. I am getting ready to send out the course wrap-up email and I wanted to include the link to the Youtube playlist as well.</p>
<p>Unfortunately, Youtube keeps reporting that a pair of the videos in week 2 are duplicates. This is true despite them being different lengths (12:15 vs. 16:58), having different titles, and having dramatically different content. I](http://productforums.google.com/forum/#!topic/youtube/Yc7hHqwtBX0) on the forums:</p>
<blockquote>
<p>YouTube uses a checksum to determine duplicates. The chances of having two different files containing different content but have the same checksum would be astronomical.</p>
</blockquote>
<p>That isn’t on the <a href="http://support.google.com/youtube/bin/answer.py?hl=en&answer=58139">official Google documentation page</a>, which is pretty sparse, but is the only description I can find of how Youtube checks for duplicate content. A <a href="http://en.wikipedia.org/wiki/Checksum">checksum</a> is a function you apply to the data from a video that (ideally) with high probability will yield different values when different videos are uploaded and the same value when the same video is uploaded. One possible checksum function could be the length of the video. Obviously that won’t work in general because many videos might be 2 minutes exactly.</p>
<p>Regardless, it looks like Youtube can’t distinguish my lecture videos. I’m thinking Vimeo or something else if I can’t get this figured out. Of course, if someone has a suggestion (short of re-exporting the videos from Camtasia) that would allow me to circumvent this problem I’d love to hear it!</p>
<p><strong>Update</strong>: <em>I ended up fiddling with the videos and got them to upload. Thanks to the helpful comments!</em></p>
<p> </p>
Call for papers for a special issue of Statistical Analysis and Data Mining
2013-03-19T11:06:32+00:00
http://simplystats.github.io/2013/03/19/call-for-papers-for-a-special-issue-of-statistical-analysis-and-data-mining
<p>David Madigan sends the following. It looks like a really interesting place to submit papers for both statisticians and data scientists, so submit away!</p>
<blockquote>
<p>Statistical Analysis and Data Mining, An American Statistical Association Journal</p>
<div>
Call for Papers
</div>
<div>
Special Issue on Observational Healthcare Data
</div>
<div>
</div>
<div>
Guest Editors: Patrick Ryan, J&J and Marc Suchard, UCLA
</div>
<div>
</div>
<div>
Due date: July 1, 2013
</div>
<div>
</div>
<div>
Data sciences is the rapidly evolving field that integrates
</div>
<div>
mathematical and statistical knowledge, software engineering and large-scale data management skills, and domain expertise to tackle difficult problems that typically cannot be solved by any one discipline alone. Some of the most difficult, and arguably most important, problems exist in healthcare. Knowledge about human biology has exponentially advanced in the past two decades with exciting progress in genetics, biophysics, and pharmacology. However, substantial opportunities exist to extend the evidence base about human disease, patient health and effects of medical interventions and translate knowledge into actions that can directly impact clinical care. The emerging availability of 'big data' in healthcare, ranging from prospective research with aggregated genomics and clinical trials to observational data from administrative claims and electronic health records through social media, offer unprecedented opportunities for data scientists to contribute to advancing healthcare through the development, evaluation, and application of novel analytical solutions to explore these data to generate evidence at both the patient and population level. Statistical and computational challenges abound and
</div>
<div>
methodological progress will draw on fields such as data mining,
</div>
<div>
epidemiology, medical informatics, and biostatistics to name but a
</div>
<div>
few. This special issue of Statistical Analysis and Data Mining seeks to capture the current state of the art in healthcare data sciences. We welcome contributions that focus on methodology for healthcare data and original research that demonstrates the application of data sciences to problems in public health.
</div>
<div>
</div>
<div>
<a href="http://onlinelibrary.wiley.com/journal/10.1002/(ISSN)1932-1872" target="_blank">http://onlinelibrary.wiley.<wbr />com/journal/10.1002/(ISSN)<wbr />1932-1872</a>
</div>
</blockquote>
Sunday data/statistics link roundup (3/17/13)
2013-03-17T10:06:16+00:00
http://simplystats.github.io/2013/03/17/sunday-datastatistics-link-roundup-31713
<ol>
<li><span style="line-height: 15.989583969116211px;"><a href="http://blog.revolutionanalytics.com/2013/03/a-map-of-worldwide-email-traffic-created-with-r.html">A post</a> on the Revolutions blog about an analysis of the worldwide email traffic patterns. The corresponding paper is also<a href="http://arxiv.org/pdf/1303.0045v1.pdf"> pretty interesting</a>. The best part is the whole analysis was done in R. </span></li>
<li><a href="http://www.nytimes.com/2013/03/13/education/california-bill-would-force-colleges-to-honor-online-classes.html?hpw&_r=0">A bill</a> in California that would require faculty approved online classes to be given credit. I think this is potentially game changing if it passes - depending on who has to do the approving. If there is local control within departments it could be huge. On the other hand, as I’ll discuss later this week, there is still some ground to be made up before I think MOOCs are ready for prime time credit in areas outside of the very basics.</li>
<li>A pretty amazing blog post about a survival analysis of <a href="http://badhessian.org/lipsyncing-for-your-life-a-survival-analysis-of-rupauls-drag-race/">RuPaul’s drag race</a>. Via Hadley.</li>
<li>If you are a statistician hiding under a rock you missed the <a href="http://www.nytimes.com/2013/03/12/science/putting-a-value-to-real-in-medical-research.html?_r=0">NY Times messing up P-values</a>. The statistical blogosphere came out swinging. <a href="http://andrewgelman.com/2013/03/12/misunderstanding-the-p-value/">Gelman</a>, <a href="http://normaldeviate.wordpress.com/2013/03/14/double-misunderstandings-about-p-values/">Wasserman</a>, <a href="http://hilaryparker.com/2013/03/12/about-that-pvalue-article/">Parker</a>, etc.</li>
<li>As a statistician who is pretty fired up about the tech community, I can get lost a bit in the hype as much as the next guy. I thought this article was <a href="http://www.sfgate.com/technology/dotcommentary/article/Innovation-and-the-face-of-capitalism-4342160.php">pretty sobering</a>. I think the way to make sure we keep innovating is having the will to fund long term companies and long term research. Look at how it paid off with Amazon…</li>
<li>Broman <a href="http://kbroman.wordpress.com/2013/03/16/why-arent-all-of-our-graphs-interactive/">on interactive graphics</a> is worth a read. I agree that more of our graphics should be interactive, but there is an inherent tension/tradeoff in graphics, similar to the bias variance tradeoff. I’m sure there is a real word for it but it is the flexibility vs. understandability tradeoff. Too much interaction and its hard to see what is going on, not enough and you might as well have made a static graph.</li>
</ol>
Postdoctoral fellow position in reproducible research
2013-03-14T10:00:41+00:00
http://simplystats.github.io/2013/03/14/postdoctoral-fellow-position-in-reproducible-research
<p>We are looking to recruit a postdoctoral fellow to work on developing tools to make scientific research more easily reproducible. We’re looking for someone who wants to work on (and solve!) real research problems in the biomedical sciences and address the growing need for reproducible research tools. The position would be in the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health and would be jointly advised by Jeff and myself.</p>
<p><strong>Qualifications</strong>: PhD in statistics, biostatistics, computer science, or related field; strong programming skills in R and Perl/Python/C; excellent written and oral communication skills; serious moxie</p>
<p><strong>Additional Information</strong>: Informal questions about the position can be sent to Dr. Roger Peng at rpeng @ jhsph.edu. Applications will be considered as they arrive.</p>
<div title="Page 1">
<p>
To apply, send a cover letter describing your research interests and interest in the position, a CV, and the names of three references. In your application, please reference "Reproducible Research postdoctoral fellowship". Application materials should be emailed to Dr. Roger Peng at rpeng @ jhsph.edu.
</p>
<p>
Applications from minority and female candidates are especially encouraged. Johns Hopkins University is an AA/EOE.
</p>
</div>
Here's my #ENAR2013 Wednesday schedule
2013-03-13T07:00:13+00:00
http://simplystats.github.io/2013/03/13/heres-my-enar2013-wednesday-schedule
<p>Here are my picks for ENAR sessions today (Wednesday):</p>
<ul>
<li>8:30-10:15am: <strong>Large Data Visualization and Exploration</strong>, Grand Ballroom 4 (make sure you stay till the end to see Karl Broman); <strong>Innovative Methods in Causal Inference with Applications to Mediation, Neuroimaging, and Infectious Diseases</strong>, Grand Ballroom 8A; <strong>Next Generation Sequencing</strong>, Grand Ballroom 5</li>
<li>10:30am-12:15pm: <strong>Statistical Information Integration of -Omics Data</strong>, Grand Ballrooms 1 & 2</li>
</ul>
<p>Okay, so this schedule actually requires me to split myself in to three separate entities. However, if you find a way to do that, the 8:30-10:15am block is full of good stuff.</p>
<p>Have fun!</p>
If I were at #ENAR2013 today, here's where I'd go
2013-03-12T07:00:39+00:00
http://simplystats.github.io/2013/03/12/if-i-were-at-enar2013-today-heres-where-id-go
<p>This week is the annual ENAR meeting, the big biostatistics conference, in Orlando, Florida. It actually started on Sunday but I haven’t gotten around to looking at the program (obviously, I’m not there right now). Flipping through the <a href="http://www.enar.org/meetings2013/2013_program.pdf">program</a> now, here’s what looks good to me for Tuesday:</p>
<ul>
<li><span style="line-height: 16px">8:30-10:15am: <strong>Functional Neuroimaging Decompositions</strong>, Grand Ballroom 3 </span></li>
<li>10:30am-12:15pm: Hmm…I guess you should go to the <strong>Presidential Invited Address</strong>, Grand Ballroom 7</li>
<li>1:45-3:30pm: <strong>JABES Showcase</strong>, Grand Ballroom 8A; <strong>Statistical Body Language: Analytical Methods for Wearable Computing</strong>, Grand Ballroom 4</li>
<li>3:45-5:30pm: <strong>Big Data: Wearable Computing, Crowdsourcing, Space Telescopes, and Brain Imaging</strong>, Grand Ballroom 8A; <strong>Sample Size Planning for Clinical Development</strong>, Grand Ballroom 6</li>
</ul>
<p>That’s right, you can pack in <em>two</em> sessions on wearable computing today if you want. I’ll post tomorrow for what looks good on Wednesday.</p>
Sunday data/statistics link roundup (3/10/13)
2013-03-10T22:11:16+00:00
http://simplystats.github.io/2013/03/10/sunday-datastatistics-link-roundup-31013
<ol>
<li><a style="font-size: 16px;" href="http://aleadeum.wordpress.com/2013/03/11/14-to-40-percent-of-medical-research-are-false-positives-yet-another-calculation/">This</a> <span style="font-size: 16px;">is an outstanding follow up analysis to </span><a style="font-size: 16px;" href="http://arxiv.org/abs/1301.3718">our paper</a> <span style="font-size: 16px;">on the rate of false discoveries in the medical literature. I hope that the author of the blog post will consider submitting it for publication in a journal, I think it is worth having more methodology out there in this area. </span></li>
<li>If you are an academic in statistics and aren’t following <a href="https://twitter.com/kwbroman">Karl</a> and <a href="https://twitter.com/tslumley">Thomas</a> on Twitter, you should be. Also check out Karl’s (mostly) <a href="http://kbroman.wordpress.com/2013/03/10/towards-making-my-own-papers-reproducible/">reproducible paper</a>.</li>
<li><a href="http://online.wsj.com/article/SB10001424127887323478304578332850293360468.html">An article</a> in the WSJ that I think I received about 40 times this week. The <a href="http://blogs.wsj.com/numbersguy/the-upbeat-stats-on-statistics-1216/">online version</a> has a quote from our own <a href="http://www.bcaffo.com/">B-Caffo</a>. It is a really good read. If you are into this, it seems like the interviews with <a href="http://simplystatistics.org/2012/10/19/interview-with-rebecca-nugent-of-carnegie-mellon/">Rebecca Nugent</a> (where we discuss growing undergrad programs) and <a href="http://simplystatistics.org/2012/01/20/interview-with-joe-blitzstein/">Joe Blitzstein</a> where we discuss stats ed are relevant. I thought this quote was hugely relevant, “The bulk of the people coming out [with statistics degrees] are technically competent but they’re missing the consultative and the soft skills, everything else they need to be successful” We are focusing heavily on both components of these skills in the grad program here at Hopkins - so if people are looking for awesome data people, just let us know!</li>
<li><a href="http://www.fangraphs.com/blogs/index.php/sloan-analytics-farhan-zaidi-on-as-analytics/#more-116534">A cool discussion</a> of how the A’s look for players with “positive residuals” - positive value missed by the evaluations of other teams. (via Rafa)</li>
<li><a href="http://www.nytimes.com/2013/03/10/magazine/the-professor-the-bikini-model-and-the-suitcase-full-of-trouble.html?nl=todaysheadlines&emc=edit_th_20130310&_r=0">The physicist and the bikini model</a>. If you haven’t read it, you must be living under a rock. (via Alex N.)</li>
<li><a href="http://www.nytimes.com/2013/02/28/technology/ibm-exploring-new-feats-for-watson.html?hp&_r=1&pagewanted=all&">An interesting article</a> about how IBM is using Watson to come up with new recipes based on the data from old recipes. I’m a little suspicious of the Spanish crescent though - no butter?!</li>
<li>You should vote for Steven Salzberg for the <a href="http://www.bioinformatics.org/franklin/">Ben Franklin award</a>. The dude has come up huge for open software and we should come up huge for him. Gotta vote today though.</li>
<li><a href="http://www.youtube.com/results?search_query=harlem+shake&oq=harlem+shake&gs_l=youtube.3..35i39l2j0l2j0i3j0l2j0i3j0l2.135.2511.0.2623.17.13.2.0.0.0.490.1645.8j4j4-1.13.0...0.0...1ac.1.Eibxr2zw9B8">The Harlem Shake</a> has killed more than one of my lunch hours. <a href="http://www.youtube.com/watch?v=Vv3f0QNWvWQ">But this one is the best</a>. By far. How all simulation studies should be done (via <a href="http://www.statschat.org.nz/">StatsChat</a>).</li>
</ol>
Send me student/postdoc blogs in statistics and computational biology
2013-03-08T10:15:23+00:00
http://simplystats.github.io/2013/03/08/send-me-studentpostdoc-blogs-in-statistics-and-computational-biology
<p>I’ve been writing a blog for a few years now, but it started after I was already comfortably settled in a tenure track job. There have been some huge benefits of writing a scientific blog. It has certainly raised my visibility and given me opportunities to talk about issues that are a little outside of my usual research agenda. It has also inspired more than one research project that has ended up in a full blown peer-reviewed publication. I also frequently look to blogs/twitter accounts to see “what’s happening” in the world of statistics/data science.</p>
<p>One thing that gets me incredibly fired up are student blogs. A [I’ve been writing a blog for a few years now, but it started after I was already comfortably settled in a tenure track job. There have been some huge benefits of writing a scientific blog. It has certainly raised my visibility and given me opportunities to talk about issues that are a little outside of my usual research agenda. It has also inspired more than one research project that has ended up in a full blown peer-reviewed publication. I also frequently look to blogs/twitter accounts to see “what’s happening” in the world of statistics/data science.</p>
<p>One thing that gets me incredibly fired up are student blogs. A](http://hilaryparker.com/) of <a href="http://alyssafrazee.wordpress.com/">my</a> <a href="http://fellgernon.tumblr.com/">students</a> have them and I read them whenever they post. But I have found it is hard to discover all of the blogs that might be written by students I’m not directly working with.</p>
<p>So this post is designed for two things:</p>
<p>(1) I’d really like it if you could please send me the links to twitter feeds/blogs/google+ pages etc. of students (undergrad, grad or postdoc) in statistics, computational biology, computational neuroscience, computational social science, etc. Anything that touches statistics and data is fair game.</p>
<p>(2) I plan to create a regularly-maintained page on the blog with links to student blogs with some kind of tagging system so other people can find all the cool stuff that students are thinking about/doing.</p>
<p>Please feel free to either post links in the comments, send them to us on twitter, or email them to me directly. I’ll follow up in a couple of weeks once I have things organized.</p>
The importance of simulating the extremes
2013-03-06T12:35:04+00:00
http://simplystats.github.io/2013/03/06/the-importance-of-simulating-the-extremes
<p>Simulation is commonly used by statisticians/data analysts to: (1) <a href="http://en.wikipedia.org/wiki/Bootstrapping_(statistics)">estimate variability/improve predictors</a>, (2) <a href="http://en.wikipedia.org/wiki/Monte_Carlo_method">to evaluate the space of potential outcomes</a>, and (3) to evaluate the properties of new algorithms or procedures. Over the last couple of days, discussions of simulation have popped up in a couple of different places.</p>
<p>First, the reviewers of a paper that my student is working on had asked a question about the behavior of the method in different conditions. I mentioned in passing, that I thought it was a good idea to simulate some cases where our method will definitely break down.</p>
<p>I also saw this post by John Cook about simple/complex models. He <a href="http://www.johndcook.com/blog/2013/03/05/data-calls-the-models-bluff/">raises the really important point</a> that increasingly complex models built on a canonical, small, data set can fool you. You can make the model more and more complicated - but in other data sets the assumptions might not hold and the model won’t generalize. Of course, simple models can have the same problems, but generally simple models will fail on small data sets in the same way they would fail on larger data sets (in my experience) - either they work or they don’t.</p>
<p>These two ideas got me thinking about why I like simulation. Some statisticians, particularly applied statisticians, aren’t fond of simulation for evaluating methods. I think the reason is that you can always simulate a situation that meets all of your assumptions and make your approach look good. Real data rarely conform to model assumptions and so are harder to “trick”. On the other hand, I really like simulation, it can reveal a lot about how and when a method will work well and it allows you to explore scenarios - particularly for new or difficult to obtain data.</p>
<p>Here are the simulations I like to see:</p>
<ol>
<li><strong>Simulation where the assumptions are true</strong> There are a surprising number of proposed methods/analysis procedures/analyses that fail or perform poorly even when the model assumptions hold. This could be because the methods overfit, have a bug, are computationally unstable, are on the wrong place on the bias/variance tradeoff curve, etc. etc. etc. I always do at least one simulation for every method where the answer should be easy to get, because I know if I don’t get the right answer, it is back to the drawing board.</li>
<li><strong>Simulation where things should definitely fail</strong> I like to try out a few realistic scenarios where I’m pretty sure my model assumptions won’t hold and the method should fail. This kind of simulation is good for two reasons: (1) sometimes I’m pleasantly surprised and the model will hold up and (2) (the more common scenario) I can find out where the model assumption boundaries are so that I can give concrete guidance to users about when/where the method will work and when/where it will fail.</li>
</ol>
<p>The first type of simulation is easy to come up with - generally you can just simulate from the model. The second type is much harder. You have to creatively think about reasonable ways that your model can fail. I’ve found that using real data for simulations can be the best way to start coming up with ideas to try - but I usually find that it is worth building on those ideas to imagine even more extreme circumstances. Playing the <a href="http://en.wikipedia.org/wiki/Evil_demon">evil demon</a> for my own methods often leads me to new ideas/improvements I hadn’t thought of before. It also helps me to evaluate the work of other people - since I’ve tried to explore the contexts where methods likely fail.</p>
<p>In any case, if you haven’t simulated the extremes I don’t think you really know how your methods/analysis procedures are working.</p>
Big Data - Context = Bad
2013-03-04T10:00:24+00:00
http://simplystats.github.io/2013/03/04/big-data-context-bad
<p>There’s a nice article by Nick Bilton in the New York Times Bits blog about <a href="http://bits.blogs.nytimes.com/2013/02/24/disruptions-google-flu-trends-shows-problems-of-big-data-without-context/?smid=pl-share">the need for context when looking at Big Data</a>. Actually, the article starts off by describing how Google’s Flu Trends model overestimated the number of people infected with flue in the U.S. this season, but then veers off into a more general discussion about Big Data.</p>
<p>My favorite quote comes from Mark Hansen:</p>
<blockquote>
<p>“Data inherently has all of the foibles of being human,” said <a href="http://www.journalism.columbia.edu/profile/428-mark-hansen/10" title="More about Dr. Hansen.">Mark Hansen</a>, director of the David and <a href="http://topics.nytimes.com/top/reference/timestopics/people/b/helen_gurley_brown/index.html?inline=nyt-per" title="More articles about Helen Gurley Brown.">Helen Gurley Brown</a> Institute for Media Innovation at <a href="http://topics.nytimes.com/top/reference/timestopics/organizations/c/columbia_university/index.html?inline=nyt-org" title="More articles about Columbia University.">Columbia University</a>. “Data is not a magic force in society; it’s an extension of us.”</p>
</blockquote>
<p>Bilton also talks about a course he taught where students built sensors to install in elevators and stairwells at NYU to see how often they were used. The idea was to explore how often and when the NYU students used the stairs versus the elevator.</p>
<blockquote>
<p>As I left campus that evening, one of the N.Y.U. security guards who had seen students setting up the computers in the elevators asked how our experiment had gone. I explained that we had found that students seemed to use the elevators in the morning, perhaps because they were tired from staying up late, and switch to the stairs at night, when they became energized.</p>
<p>“Oh, no, they don’t,” the security guard told me, laughing as he assured me that lazy college students used the elevators whenever possible. “One of the elevators broke down a few evenings last week, so they had no choice but to use the stairs.”</p>
</blockquote>
<p>I can see at least three problems here, not necessarily mutually exclusive:</p>
<ol>
<li><span style="line-height: 16px"><strong>Big Data are often “Wrong” Data</strong>. The students used the sensors measure something, but it didn’t give them everything they needed. Part of this is that the sensors were cheap, and budget was likely a big constraint here. But Big Data are often big <em>because</em> they are cheap. But of course, they still couldn’t tell that the elevator was broken.</span></li>
<li><strong>A failure of interrogation</strong>. With all the data the students collected with their multitude of sensors, they were unable to answer the question “What else could explain what I’m observing?”</li>
<li><strong>A strong desire to tell a story</strong>. Upon looking at the data, they seemed to “make sense” or to at least match a preconceived notion of that they should look like. This is related to #2 above, which is that you have to challenge what you see. It’s very easy and tempting to let the data tell an interesting story rather than the right story.</li>
</ol>
<p>I don’t mean to be unduly critical of some students in a class who were just trying to collect some data. I think there should be more of that going on. But my point is that it’s not as easy as it looks. Even trying to answer a seemingly innocuous question of how students use elevators and stairs requires some forethought, study design, and careful analysis.</p>
<p> </p>
Sunday data/statistics link roundup (3/3/2013)
2013-03-03T08:32:05+00:00
http://simplystats.github.io/2013/03/03/sunday-datastatistics-link-roundup-332013
<ol>
<li><a href="http://www.nejm.org/doi/full/10.1056/NEJMoa1200303">A really nice example</a> where epidemiological studies are later confirmed by a randomized trial. From a statistician’s point of view, this is the idealized way that science would work. First, data that are relatively cheap (observational/retrospective studies) are used to identify potential associations of interest. After a number of these studies show a similar effect, a randomized study is performed to confirm what we suspected from the cheaper studies.</li>
<li>Joe Blitzstein talking about the “<a href="https://www.youtube.com/watch?feature=player_embedded&v=dzFf3r1yph8#">Soul of Statistics</a>”, <a href="http://simplystatistics.org/2012/01/20/interview-with-joe-blitzstein/">we interviewed</a> Joe a while ago. Teaching statistics is critical for modern citizenship. It is not just about learning which formula to plug a number into - <a href="http://citizen-statistician.org/2013/03/02/wall-street-journal/">it is about critical thinking with data</a>. Joe’s talk nails this issue.</li>
<li>Significance magazine has a <a href="http://www.significancemagazine.org/details/webexclusive/4374981/Writing-with-Significance-Writing-competition-to-celebrate-the-International-Yea.html#.USSXc5Hslzc.twitter">writing contest</a>. If you are a grad student in statistics/biostatistics this is an awesome way to (a) practice explaining your discipline to people who are not experts - a hugely important skill and (b) get your name out there, which will help when it comes time to look for jobs/apply for awards, etc.</li>
<li>A great post from David Spiegelhalter about the UK court’s <a href="http://understandinguncertainty.org/court-appeal-bans-bayesian-probability-and-sherlock-holmes">interpretation of probability</a>. It reminds me of the Supreme Court’s recent decision that also <a href="http://simplystatistics.org/2011/12/12/the-supreme-courts-interpretation-of-statistical/">hinged on a statistical interpretation</a>. This post brings up two issues I think are worth a more in-depth discussion. One is that it is pretty clear that many court decisions are going to <a href="http://www.huffingtonpost.com/2013/03/02/john-roberts-voting-rights-act_n_2797127.html">hinge on statistical arguments</a>. This suggests (among other things) that statistical training should be mandatory in legal education. The second issue is a minor disagreement I have with Spiegelhalter’s characterization that only Bayesians use epistemic uncertainty. I frequently discuss this type of uncertainty in my classes although I take a primarily frequentist/classical approach to teaching these courses.</li>
<li>Thomas Lumley is <a href="http://www.statistics.com/survey-r/">giving an online course</a> in complex surveys.</li>
<li><a href="http://www.biostat.wisc.edu/~kbroman/refs/umbrellas_and_lions.pdf">On the protective value of an umbrella</a> when encountering a lion. Seems like a nice way to wrap up a post that started with the power of epidemiology and clinical trials. (via <a href="https://twitter.com/kwbroman">Karl B.</a>)</li>
</ol>
Please save the unsolicited R01s
2013-02-27T10:21:21+00:00
http://simplystats.github.io/2013/02/27/please-save-the-unsolicited-r01s
<p><em><strong>Editor’s note</strong>: With the sequestration deadline hours away, the career of many young US scientists is on the line. In this guest post, our colleague Steven Salzberg , an avid <a href="http://www.forbes.com/sites/stevensalzberg/2013/01/14/congress-is-killing-medical-research/">_<strong>Editor’s note</strong>: With the sequestration deadline hours away, the career of many young US scientists is on the line. In this guest post, our colleague Steven Salzberg , an avid </a> and <a href="http://simplystatistics.org/2013/01/04/does-nih-fund-innovative-work-does-nature-care-about-publishing-accurate-articles/">its peer review process</a>, tells us why now more than ever the NIH should prioritize funding R01s over other project grants .</em></p>
<p>First let’s get the obvious facts out of the way: the federal budget is a mess, and Congress is completely disfunctional. When it comes to NIH funding, this is not a good thing.</p>
<p>Hidden within the larger picture, though, is a serious menace to our decades-long record of incredibly successful research in the United States. The investigator-driven, basic research grant is in even worse shape than the overall NIH budget. A recent analysis by FASEB, shown in the figure here, reveals that the number of new R01s reached its peak in 2003 - ten years ago! - and has been steadily declining since. In 2003, 7,430 new R01s were awarded. In 2012, that number had dropped to 5,437, a 27% decline.</p>
<p><a href="http://simplystatistics.org/2013/02/27/please-save-the-unsolicited-r01s/number-of-new-r01s/" rel="attachment wp-att-1055"><img class="alignnone size-full wp-image-1055" alt="number-of-new-r01s" src="http://simplystatistics.org/wp-content/uploads/2013/02/number-of-new-r01s.jpg" width="720" height="540" srcset="http://simplystatistics.org/wp-content/uploads/2013/02/number-of-new-r01s-300x225.jpg 300w, http://simplystatistics.org/wp-content/uploads/2013/02/number-of-new-r01s.jpg 720w" sizes="(max-width: 720px) 100vw, 720px" /></a></p>
<p>For those who might not be familiar with the NIH system, the R01 grant is the crown jewel of research grants. R01s are awarded to individual scientists to pursue all varieties of biomedical research, from very basic science to clinical research. For R01s, NIH doesn’t tell the scientists what to do: we propose the ideas, we write them up, and then NIH organizes a rigorous peer review (which isn’t perfect, but it’s the best system anyone has). Only the top-scoring proposals get funded.</p>
<p>This process has gotten much tougher over the years. In 1995, <a href="http://www.faseb.org/Policy-and-Government-Affairs/Data-Compilations/NIH-Research-Funding-Trends.aspx" target="_blank">the success rate for R01s was 25.9%</a>. Today it is 18.4% and falling. This includes applications from everyone, even the most experienced and proven scientists. Thus no matter who you are, you can expect that there is more than an 80% chance that your grant application will be turned down. In some areas it is even worse: NIAID’s website announced that it is <a href="http://www.niaid.nih.gov/researchfunding/paybud/pages/paylines.aspx" target="_blank">currently funding only 6%</a> of R01s.</p>
<p>Why are R01s declining? Not for lack of interest: the number of applications last year was 29,627, an all-time high. Besides the overall budget problem, another problem is growing: the fondness of the NIH administration for big, top-down science projects, many times with the letters “ome” or “omics” attached.</p>
<p>Yes, the human genome was a huge success. Maybe the human microbiome will be too. But now NIH is pushing gigantic, top-down projects: ENCODE, 1000 Genomes, the cancer anatomy genome project (CGAP), the cancer genome atlas (TCGA), a new “brain-ome” project, and more. The more money is allocated to these big projects, the less R01s NIH can fund. For example, NIAID, with its 6% R01 success rate, has been spending tens of millions of dollars per year on 3 large <a href="http://www.niaid.nih.gov/labsandresources/resources/dmid/gsc/Pages/default.aspx" target="_blank">Microbial Genome Sequencing Center</a> contracts and tens of millions more on 5 large <a href="http://www.niaid.nih.gov/labsandresources/resources/dmid/brc/Pages/awards.aspx" target="_blank">Bioinformatics Resource Center</a> contracts. As far as I can tell, no one uses these bioinformatics resource centers for anything - in fact, virtually no one outside the centers even knows they exist. Furthermore, these large, top-down driven sequencing projects don’t address specific scientific hypotheses, but they produce something that the NIH administration seems to love: numbers. It’s impressive to see how many genomes they’ve sequenced, and it makes for nice press releases. But very often we simply don’t need these huge, top-down projects to answer scientific questions. Genome sequencing is cheap enough that we can include it in an R01 grant, if only NIH will stop pouring all its sequencing money into these huge, monolithic projects.</p>
<p>I’ll be the first person to cheer if Congress gets its act together and fund NIH at a level that allows reasonable growth. But whether or not that happens, the growth of big science projects, often created and run by administrators at NIH rather than scientists who have successfully competed for R01s, represents a major threat to the scientist-driven research that has served the world so well for the past 50 years. Many scientists are afraid to speak out against this trend, because by doing so we (yes, this includes me) are criticizing those same NIH administrators who manage our R01s. But someone has to say something. A 27% decline in the number of R01s over the past decade is not a good thing. Maybe it’s time to stop the omics train.</p>
Big data: Giving people what they want
2013-02-25T08:22:33+00:00
http://simplystats.github.io/2013/02/25/big-data-giving-people-what-they-want
<p>Netflix is <a href="http://www.nytimes.com/2013/02/25/business/media/for-house-of-cards-using-big-data-to-guarantee-its-popularity.html?smid=pl-share">using data to create original content for its subscribers</a>, the first example of which was <a href="http://en.wikipedia.org/wiki/House_of_Cards_(U.S._TV_series)">House of Cards</a>. Three main data points for this show were that (1) People like David Fincher (because they watch The Social Network, like, all the time); (2) People like Kevin Spacey; and (3) People liked the British version of House of Cards. Netflix obviously has tons of other data, including when you stop, pause, rewind certain scenes in a movie or TV show.</p>
<blockquote>
<p>Netflix has always used data to decide which shows to license, and now that expertise is extended to the first-run. And there was not one trailer for “House of Cards,” there were many. Fans of Mr. Spacey saw trailers featuring him, women watching “Thelma and Louise” saw trailers featuring the show’s female characters and serious film buffs saw trailers that reflected Mr. Fincher’s touch.</p>
</blockquote>
<p>Using data to program television content is about as new as Bryl Cream, but Netflix has the Big Data and has direct interaction with its viewers (so does Amazon Prime, which apparently is also looking to create original content). So the question is, does it work? My personal opinion is that it’s probably not any worse than previous methods, but may not be a lot better. But I would be delighted to be proven wrong. From my walks around the hallway here it seems House of Cards is in fact a good show (I haven’t seen it). But one observation probably isn’t enough to draw a conclusion here.</p>
<p>John Landgraf of FX Networks thinks Big Data won’t help:</p>
<blockquote>
<p>“Data can only tell you what people have liked before, not what they don’t know they are going to like in the future,” he said. “A good high-end programmer’s job is to find the white spaces in our collective psyche that aren’t filled by an existing television show,” adding, those choices were made “in a black box that data can never penetrate.”</p>
</blockquote>
<p>I was a bit confused when I read this but the use of the word “programmer” here I’m pretty sure is in reference to television programmer. This quote is reminiscent of Steve Jobs’ line about how it’s not he consumer’s job to know what he/she wants. It also reminds me of financial markets where all the data it the world can only tell you about the past.</p>
<p>In the end, can any of it help you predict the future? Or do some people just get lucky?</p>
<p> </p>
Sunday data/statistics link roundup (2/24/2013)
2013-02-24T10:00:00+00:00
http://simplystats.github.io/2013/02/24/sunday-datastatistics-link-roundup-2242013
<ol>
<li><span style="font-size: 16px">An attempt to create a version of </span><a style="font-size: 16px" href="https://github.com/amarder/stata-tutorial">knitr for stata</a><span style="font-size: 16px"> (via John M.)</span><span style="font-size: 16px">. I like the direction that reproducible research is moving - toward easier use and wider spread adoption. The success of </span><a style="font-size: 16px" href="http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html">iPython notebook</a><span style="font-size: 16px"> is another great sign for the whole research area.</span></li>
<li>Email is <a href="http://simplystatistics.org/2012/12/05/an-idea-for-killing-email/">always a problem</a> for me. In the last week I’ve been introduced to a couple of really nice apps that give me insight into my email habits (<a href="http://www.gmailmeter.com/">Gmail meter</a> - via John M.) and that help me to send reminders to myself with minimal hassle (<a href="http://www.boomeranggmail.com/">Boomerang</a> - via Brian C.)</li>
<li>Andrew Lo proposes a new model for <a href="http://www.businessinsider.com/qa-with-mit-finance-professor-andrew-lo-2013-2">cancer research funding</a> based on his research in financial engineering. In light of the <a href="http://simplystatistics.org/2013/02/13/im-a-young-scientist-and-sequestration-will-hurt-me/">impending sequester</a> I’m interested in alternative funding models for data science/statistics in biology. But the concerns I have about both crowd-funding and Lo’s idea are whether the basic scientists get hosed and whether sustained funding at a level that will continue to attract top scientists is possible.</li>
<li>This is a <a href="http://healthland.time.com/2013/02/20/bitter-pill-why-medical-bills-are-killing-us/">really nice rundown</a> of why medical costs are so high. They key things in the article to me are that: (1) he chased down the data about actual costs versus charges and (2) he highlights the role of the chargemaster - the price setter in medical centers - and how the prices are often set historically with yearly markups (not based on estimates of costs, etc.), and (3) he discusses key nuances like medical liability if the “best” tests aren’t run on everyone. Overall, it is definitely worth a read and this seems like a hugely important problem a statistician could really help with (if they could get their hands on the data).</li>
<li><a href="http://robohub.org/video-throwing-and-catching-an-inverted-pendulum-with-quadrocopters/">A really cool applied math project</a> where flying robot helicopters toss and catch a stick. Applied math can be super impressive, but they always still need a little boost from statistics, ““This also involved bringing the insights gained from their initial</li>
</ol>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>and many subsequent experiments to bear on their overall system
design. For example, a learning algorithm was added to account for
model inaccuracies." (via Rafa via MR). 6. We've talked about [trying to reduce meetings](http://simplystatistics.org/2011/09/19/meetings/) to increase producitivity before. Here is an article in the NYT talking about [the same issue](http://www.nytimes.com/2013/02/17/jobs/too-many-office-meetings-and-how-to-fight-back.html?_r=1&) (via Rafa via Karl B.). Brian C. made an interesting observation though, that in a soft money research environment there should be evolutionary pressure against anything that doesn't improve your ability to obtain research funding. Despite this, meetings proliferate in soft-money environments. So there must be some selective advantage to them! Another interesting project for a stats/evolutionary biology student. 7. If you have read all the Simply Statistics interviews and still want more, check out <http://www.analyticstory.com/>.
</code></pre></div></div>
<p> </p>
Tesla vs. NYT: Do the Data Really Tell All?
2013-02-18T09:10:51+00:00
http://simplystats.github.io/2013/02/18/tesla-vs-nyt-do-the-data-really-tell-all
<p>I’ve enjoyed so far the back and forth between Tesla Motors and New York Times reporter John Broder. The short version is</p>
<ul>
<li>Broder <a href="http://www.nytimes.com/2013/02/10/automobiles/stalled-on-the-ev-highway.html?smid=pl-share">tested one of Tesla’s new Model S all-electric sedans</a> on a drive from Washington, D.C. to Groton, CT. Part of the reason for this specific trip was to make use of Tesla’s new supercharger stations along the route (one in Delaware and one in Connecticut).</li>
<li>Broder’s trip appeared to have some bumps, including running out of electricity at one point and requiring a tow.</li>
<li>After the review was published in the New York Times, Elon Musk, the CEO/Founder of Tesla, was apparently livid. He published a <a href="http://www.teslamotors.com/blog/most-peculiar-test-drive">detailed response</a> on the Tesla blog explaining that what Broder wrote in his review was not true and that “he simply did not accurately capture what happened and worked very hard to force our car to stop running”.</li>
<li>Broder has since <a href="http://wheels.blogs.nytimes.com/2013/02/14/that-tesla-data-what-it-says-and-what-it-doesnt/">responded to Musk’s response</a> with further explanation.</li>
</ul>
<p>Of course, the most interesting aspect of Musk’s response on the Tesla blog was that he published the data collected by the car during Broder’s test drive. When revelations of this data came about, I thought it was a bit creepy, but Musk makes clear in his post that they require data collection for all reviewers because of a previous bad experience. So, the fact that data were being collected on speed, cabin temperature, battery charge %, and rated range remaining, was presumably known to all, especially Broder. Given that you know Big Brother Musk is watching, it seems odd to deliberately lie in a widely read publication like the Times.</p>
<p>Having read the original article, Musk’s response, and Broder’s rebuttal, one things is clear to me–there’s more than one way to see the data. The challenge here is that Broder had the car, but not the data, so had to rely on his personal recollection and notes. Musk has the data, but wasn’t there, and so has to rely on peering at graphs to interpret what happened on the trip.</p>
<p>One graph in particular was fascinating. Musk shows a <a href="http://www.teslamotors.com/sites/default/files/blog_images/speedmph0.jpg">periodic-looking segment of the speed graph</a> and concludes</p>
<blockquote>
<p>Instead of plugging in the car, <span style="text-decoration: underline;">he drove in circles</span> for over half a mile in a tiny, 100-space parking lot. When the Model S valiantly refused to die, he eventually plugged it in.</p>
</blockquote>
<p>Broder claims</p>
<blockquote>
<p>I drove around the Milford service plaza in the dark looking for the Supercharger, which is not prominently marked. I was not trying to drain the battery. (It was already on reserve power.) As soon as I found the Supercharger, I plugged the car in.</p>
</blockquote>
<p>Okay, so who’s right? Isn’t the data supposed to settle this?</p>
<p>In a few other cases in this story, the data support both people. In particular, it seems that there was some serious miscommunication between Broder and Tesla’s staff. I’m sure they also have recordings of those telephone calls too but they were not reproduced in Musk’s response.</p>
<p>The bottom line here, in my opinion, is that sometimes the data don’t tell all, especially “big data”. In the end, data are one thing, interpretation is another. Tesla had reams of black-box data from the car and yet some of the data still appear to be open to interpretation. My guess is that the data Tesla collects is not collected specifically to root out liars, and so is maybe not optimized for this purpose. Which leads to another key point about big data–they are often used “off-label”, i.e. not for the purpose they were originally designed.</p>
<p>I read this story with interest because I actually think Tesla is a fascinating company that makes cool products (that sadly, I could never afford). This episode will surely not be the end of Tesla or of the New York Times, but it illustrates to me that simply “having the data” doesn’t necessarily give you what you want.</p>
Sunday data/statistics link roundup (2/17/2013)
2013-02-17T10:53:21+00:00
http://simplystats.github.io/2013/02/17/sunday-datastatistics-link-roundup-2172013
<ol>
<li><span style="line-height: 15.989583969116211px;"><a href="http://thewhyaxis.info/">The Why Axis</a> - discussion of important visualizations on the web. This is one I think a lot of people know about, but it is new to me. (via Thomas L. - p.s. I’m @leekgroup on Twitter, not @jtleek). </span></li>
<li><a href="http://arxiv.org/abs/0810.4672">This paper</a> says that people who “engage in outreach” (read: write blogs) tend to have higher academic output (hooray!) but that outreach itself doesn’t help their careers (boo!).</li>
<li>It is a little too late for this year, but next year you could <a href="http://blog.revolutionanalytics.com/2013/02/make-a-valentines-heart-with-r.html">make a Valentine with R</a>.</li>
<li>[ 1. <span style="line-height: 15.989583969116211px;"><a href="http://thewhyaxis.info/">The Why Axis</a> - discussion of important visualizations on the web. This is one I think a lot of people know about, but it is new to me. (via Thomas L. - p.s. I’m @leekgroup on Twitter, not @jtleek). </span></li>
<li><a href="http://arxiv.org/abs/0810.4672">This paper</a> says that people who “engage in outreach” (read: write blogs) tend to have higher academic output (hooray!) but that outreach itself doesn’t help their careers (boo!).</li>
<li>It is a little too late for this year, but next year you could <a href="http://blog.revolutionanalytics.com/2013/02/make-a-valentines-heart-with-r.html">make a Valentine with R</a>.
4.](http://emailcharter.org/) (via Rafa). This is pretty similar to my <a href="http://simplystatistics.org/2011/09/23/getting-email-responses-from-busy-people/">getting email responses from busy people</a>. Not sure who scooped who. I’m still waiting for my <a href="http://simplystatistics.org/2012/12/05/an-idea-for-killing-email/">to-do list app</a>. <a href="http://www.mailboxapp.com/">Mailbox</a> is close, but I still want actions to be multiple choice or yes/no or delegation rather than just snoozing emails for later.</li>
<li><a href="http://faculty.washington.edu/rjl/pubs/topten/topten.pdf">Top ten reasons not to share your code, and why you should anyway</a>.</li>
</ol>
Interview with Nick Chamandy, statistician at Google
2013-02-15T12:09:01+00:00
http://simplystats.github.io/2013/02/15/interview-with-nick-chamandy-statistician-at-google
<div dir="ltr">
<div>
<strong>Nick Chamandy</strong>
</div>
<div>
</div>
<div>
<a href="http://simplystatistics.org/2013/02/15/interview-with-nick-chamandy-statistician-at-google/person_photo/" rel="attachment wp-att-1029"><img class="alignnone size-full wp-image-1029" alt="person_photo" src="http://simplystatistics.org/wp-content/uploads/2013/02/person_photo.png" width="190" height="235" /></a>
</div>
<div>
</div>
<div>
Nick Chamandy received his M.S. in statistics from the University of Chicago, his Ph.D. in statistics at McGill University and joined Google as a statistician. We talked to him about how he ended up at Google, what software he uses, and how big the Google data sets are. To read more interviews - check out our <a href="http://simplystatistics.org/interviews/">interviews page</a>.
</div>
<div>
</div>
<div>
</div>
<div>
</div>
<div>
</div>
<div>
<strong>SS: Which term applies to you: data scientist, statistician, computer scientist, or something else?</strong>
</div>
<p>
NC: I usually use the term Statistician, but at Google we are also known as Data Scientists or Quantitative Analysts. All of these titles apply to some degree. As with many statisticians, my day to day job is a mixture of analyzing data, building models, thinking about experiments, and trying to figure out how to deal with large and complex data structures. When posting job opportunities, we are cognizant that people from different academic fields tend to use different language, and we don't want to miss out on a great candidate because he or she comes from a non-statistics background and doesn't search for the right keyword. On my team alone, we have had successful "statisticians" with degrees in statistics, electrical engineering, econometrics, mathematics, computer science, and even physics. All are passionate about data and about tackling challenging inference problems.
</p>
<div>
<p>
<strong>SS: How did you end up at Google?</strong>
</p>
</div>
<p>
Coming out of my PhD program at McGill, I was somewhat on the fence about the academia vs. industry decision. Ideally I wanted an opportunity that combined the intellectual freedom and stimulation of academia with the concreteness and real-world relevance of industrial problems. Google seemed to me at the time (and still does) to be by far the most exciting place to pursue that happy medium. The culture at Google emphasizes independent thought and idea generation, and the data are staggering in both size and complexity. That places us squarely on the "New Frontier" of statistical innovation, which is really motivating. I don't know of too many other places where you can both solve a research problem and have an impact on a multi-billion dollar business in the same day.
</p>
<div>
<p>
<strong>SS: Is your work related to the work you did as a Ph.D. student?</strong>
</p>
</div>
<p>
NC: Although I apply many of the skills I learned in grad school on a daily basis, my PhD research was on Gaussian random fields, with particular application to brain imaging data. The bulk of my work at Google is in other areas, since I work for the Ads Quality Team, whose goal is to quantify and improve the experience that users have interacting with text ads on the <a href="http://google.com/" target="_blank">google.com</a> search results page. Once in a while though, I come across data sets with a spatial or spatio-temporal component and I get the opportunity to leverage my experience in that area. Some examples are eye-tracking studies run by the user research lab (measuring user engagement on different parts of the search page), and click pattern data. These data sets typically violate many of the assumptions made in neuroimaging applications, notably smoothness and isotropy conditions. And they are predominantly 2-D applications, as opposed to 3-D or higher.
</p>
<div>
<p>
<strong>What is your programming language of choice, R, Python or something else? </strong>
</p>
</div>
<p>
I use R, and occasionally matlab, for data analysis. There is a large, active and extremely knowledgeable R community at Google. Because of the scale of Google data, however, R is typically only useful after a massive data aggregation step has been accomplished. Before that, the data are not only too large for R to handle, but are stored on many thousands of machines. This step is usually accomplished using the MapReduce parallel computing framework, and there are several Google-developed scripting languages that can be used for this purpose, including Go. We also have an interactive, ad hoc query language which can be applied to massive, "sharded" data sets (even those with a nested structure), and for which there is an R API. The engineers at Google have also developed a truly impressive package for massive parallelization of R computations on hundreds or thousands of machines. I typically use shell or python scripts for chaining together data aggregation and analysis steps into "pipelines".
</p>
<div>
<p>
<strong>SS: How big are the data sets you typically handle? Do you extract them yourself or does someone else extract them for you?</strong>
</p>
</div>
<p>
Our data sets contain billions of observations before any aggregation is done. Even after aggregating down to a more manageable size, they can easily consist of 10s of millions of rows, and on the order of 100s of columns. Sometimes they are smaller, depending on the problem of interest. In the vast majority of cases, the statistician pulls his or her own data -- this is an important part of the Google statistician culture. It is not purely a question of self-sufficiency. There is a strong belief that without becoming intimate with the raw data structure, and the many considerations involved in filtering, cleaning, and aggregating the data, the statistician can never truly hope to have a complete understanding of the data. For massive and complex data, there are sometimes as many subtleties in whittling down to the right data set as there are in choosing or implementing the right analysis procedure. Also, we want to guard against creating a class system among data analysts -- every statistician, whether BS, MS or PhD level, is expected to have competence in data pulling. That way, nobody becomes the designated data puller for a colleague. That said, we always feel comfortable asking an engineer or other statistician for help using a particular language, code library, or tool for the purpose of data-pulling. That is another important value of the Google culture -- sharing knowledge and helping others get "unstuck".
</p>
<div>
<p>
<strong>Do you work collaboratively with other statisticians/computer scientists at Google? How do projects you work on get integrated into Google's products, is there a process of approval?</strong>
</p>
</div>
<p>
Yes, collaboration with both statisticians and engineers is a huge part of working at Google. In the Ads Team we work on a variety of flavours of statistical problems, spanning but not limited to the following categories: (1) Retrospective analysis with the goal of understanding the way users and advertisers interact with our system; (2) Designing and running randomized experiments to measure the impact of changes to our systems; (3) Developing metrics, statistical methods and tools to help evaluate experiment data and inform decision-making; (4) Building models and signals which feed directly into our engineering systems. "Systems" here are things like the algorithms that decide which ads to display for a given query and context.
</p>
<p>
Clearly (2) and (4) require deep collaboration with engineers -- they can make the changes to our production codebase which deploy a new experiment or launch a new feature in a prediction model. There are multiple engineering and product approval steps involved here, meant to avoid introducing bugs or features which harm the user experience. We work with engineers and computer scientists on (1) and (3) as well, but to a lesser degree. Engineers and computer scientists tend to be extremely bright and mathematically-minded people, so their feedback on our analyses, methodology and evaluation tools is pretty invaluable!
</p>
<div>
<p>
<strong>Who have been good mentors to you during your career? Is there something in particular they did to help you?</strong>
</p>
</div>
<p>
I've had numerous important mentors at Google (in addition, of course, to my thesis advisors and professors at McGill). Largely they are statisticians who have worked in industry for a number of years and have mastered the delicate balance between deep-thinking a problem and producing something quick and dirty that can have an immediate impact. Grad school teaches us to spend weeks thinking about a problem and coming up with an elegant or novel methodology to solve it (sometimes without even looking at data). This process certainly has its place, but in some contexts a better outcome is to produce an unsophisticated but useful and data-driven answer, and then refine it further as needed. Sometimes the simple answer provides 80% of the benefit, and there is no reason to deprive the consumers of your method this short-term win while you optimize for the remaining 20%. By encouraging the "launch and iterate" mentality for which Google is well-known, my mentors have helped me produce analysis, models and methods that have a greater and more immediate impact.
</p>
<div>
<p>
<strong>What skills do you think are most important for statisticians/data scientists moving into the tech industry?</strong>
</p>
</div>
<p>
Broadly, statisticians entering the tech industry should do so with an open mind. Technically speaking, they should be comfortable with heavy-tailed, poorly-behaved distributions that fail to conform to assumptions or data structures underlying the models taught in most statistics classes. They should not be overly attached to the ways in which they currently interact with data sets, since most of these don't work for web-scale applications. They should be receptive to statistical techniques that require massive amounts of data or vast computing networks, since many tech companies have these resources at their disposal. That said, a statistician interested in the tech industry should not feel discouraged if he or she has not already mastered large-scale computing or the hottest programming languages. To me, it is less about what skills one must brush up on, and much more about a willingness to adaptively learn new skills and adjust one's attitude to be in tune with the statistical nuances and tradeoffs relevant to this New Frontier of statistics. Statisticians in the tech industry will be well-served by the classical theory and techniques they have mastered, but at times must be willing to re-learn things that they have come to regard as trivial. Standard procedures and calculations can quickly become formidable when the data are massive and complex.
</p>
</div>
<div>
</div>
I'm a young scientist and sequestration will hurt me
2013-02-13T14:07:13+00:00
http://simplystats.github.io/2013/02/13/im-a-young-scientist-and-sequestration-will-hurt-me
<p>I’m a biostatistician. That means that I help scientists and doctors analyze their medical data to try to figure out new screening tools, new therapies, and new ways to improve patients’ health. I’m also a professor. I spend a good fraction of my time teaching students about analyzing data in classes here at my university and <a href="https://www.coursera.org/course/dataanalysis">online</a>. Big data/data analysis is an area of growth for the U.S. economy and some have even suggested that there will be a <a href="http://online.wsj.com/article/SB10001424052702304723304577365700368073674.html">critical shortage</a> of trained data analysts.</p>
<p>I have other responsibilities but these are the two biggies - teaching and research. I work really hard to be good at them because I’m passionate about education and I’m passionate about helping people. I’m by no means the only (relatively) young person with this same drive. I would guess this is a big reason why a lot of people become scientists. They want to contribute to both our current knowledge (research) and the future of knowledge (teaching).</p>
<p>My salary comes from two places - the students who pay tuition at our school and, to a much larger extent, the federal government’s research funding through the NIH. So you are paying my salary. The way that the NIH distributes that funding is through a serious and very competitive process. I submit proposals of my absolute best ideas, so do all the other scientists in the U.S., and they are evaluated by yet another group of scientists who don’t have a vested interest in our grants. This system is the reason that only the best, most rigorously vetted science is funded by taxpayer money.</p>
<p>It is very hard to get a grant. In 2012, <a href="http://www.einstein.yu.edu/administration/grant-support/nih-paylines.aspx">between 7% and 16%</a> of new projects were funded. So you have to write a proposal that is better than 84-93% of all other proposals being submitted by other really, really smart and dedicated scientists. The practical result is that it is already very difficult for a good young scientist to get a grant. The NIH recognizes this and implements special measures for new scientists to get grants, but it still isn’t easy by any means.</p>
<p>Sequestration will likely dramatically reduce the fraction of grants that get funded. Already on that website, the “payline” or cutoff for funding, has dropped from 10% of grants in 2012 to 6% in 2013 for some NIH institutes. If sequestration goes through, it will be worse - maybe a lot worse. The result is that it will go from being really hard to get individual grants to nearly impossible. If that happens, many young scientists like me won’t be able to get grants. No matter how passionate we are about helping people or doing the right thing, many of us will have to stop being researchers and scientists and get other jobs to pay the bills - we have to eat.</p>
<p>So if sequestration or other draconian cuts to the NIH go through, they will hurt me and other junior scientists like me. It will make it harder - if not impossible - for me to get grants. It will affect whether I can afford to educate the future generation of students who will analyze all the data we are creating. It will create dramatic uncertainty/difficulty in the lives of the young biological scientists I work with who may not be able to rely on funding from collaborative grants to the extent that I can. In the end, this will hurt me, it will hurt my other scientific colleagues, and it could dramatically reduce our competitiveness in science technology and mathematics (STEM) for years to come. Steven <a href="http://genome.fieldofscience.com/2013/01/congress-is-killing-medical-research.html">wrote this up beautifully</a> on his blog.</p>
<p>I know that these cuts will also affect the lives of many other people from all walks of life, not just scientists. So I hope that Congress will do the right thing and decide that hurting all these people isn’t worth the political points they will score - on both sides. Sequestration isn’t the right choice - it is the choice that was most politically expedient when people’s backs were against the wall.</p>
<p>Instead of making dramatic, untested, and possibly disastrous cuts across the board for political reasons, let’s do what scientists and statisticians have been doing for years when deciding which drugs work and don’t. Let’s run controlled studies and evaluate the impact of budget cuts to different programs - as Ben Goldacre and his colleagues of so <a href="http://www.cabinetoffice.gov.uk/sites/default/files/resources/TLA-1906126.pdf">beautifully laid out in their proposal</a>. That way we can bring our spending into line, but sensibly and based on evidence, rather than the politics of the moment or untested economic models not based on careful experimentation.</p>
Sunday data/statistics link roundup (2/10/2013)
2013-02-10T20:29:06+00:00
http://simplystats.github.io/2013/02/10/sunday-datastatistics-link-roundup-2102013
<ol>
<li><a href="http://www.grantland.com/blog/the-triangle/post/_/id/50343/the-height-of-wonkery-an-in-depth-look-at-the-nba-with-the-most-innovative-technology-available">An article</a> about how NBA teams have installed cameras that allow their analysts to collect information on every movement/pass/play that is performed in a game. I think the most interesting part for me would be how you would define features. They talk about, for example, how many times a player drives. I wonder if they have an intern in the basement manually annotating those features or if they are using automatic detection algorithms (via Marginal Revolution).</li>
<li>Our friend Florian <a href="https://scientificbsides.wordpress.com/2013/02/10/maximal-information-coefficient-just-a-messed-up-estimate-of-mutual-information/">jumps into the MIC debate</a>. I haven’t followed the debate very closely, but I agree with Florian that if a theory paper <a href="http://simplystatistics.org/2012/01/26/when-should-statistics-papers-be-published-in-science/"> is published in a top journal</a>, later falling back on heuristics and hand waving seems somewhat unsatisfying.</li>
<li>An <a href="http://www.the-scientist.com/?articles.view/articleNo/33968/title/Opinion--Publish-Negative-Results/">opinion piece</a> pushing the Journal of Negative Results in Biomedicine. If you can’t get your negative result in there, think about <a href="http://simplystatistics.org/2011/09/28/the-p-0-05-journal/">our P > 0.05 journal</a> :-).</li>
<li><span style="line-height: 15.989583969116211px;">This has nothing to do with statistics/data but is a bit of nerd greatness. Run these commands from a terminal: traceroute <a href="tel:216.81.59.173" target="_blank">216.81.59.173</a>.</span></li>
<li><a href="http://www.viewtific.com/elections-performance-inde/">A data visualization</a> describing the effectiveness of each state’s election administrations. I think that it is a really cool idea, although I’m not sure I understand the index. A couple of related plots are <a href="http://www.elections.state.md.us/press_room/2012_stats_general/2012_general_election_day_turnout_and_distance.pdf">this one</a> that shows distance to polling place versus election day turnout and <a href="http://www.elections.state.md.us/press_room/2012_stats_general/2012_general_early_voting_turnout_and_distance.pdf">this one</a> that shows the same thing for early voting. It’s pretty interesting how dramatically different the plots are.</li>
<li>Postdoc Sherri Rose <a href="http://stattrak.amstat.org/2013/02/01/statisticians-place-in-big-data/">writes about big data and junior statisticians</a> at Stattrak. My favorite quote: “ We need to take the time to understand the science behind our projects before applying and developing new methods. The importance of defining our research questions will not change as methods progress and technology advances”.</li>
</ol>
Issues with reproducibility at scale on Coursera
2013-02-06T10:11:20+00:00
http://simplystats.github.io/2013/02/06/issues-with-reproducibility-at-scale-on-coursera
<p>As you know, we are <a href="http://simplystatistics.org/?s=reproducible+research">big fans of reproducible research</a> here at Simply Statistics. <a href="http://simplystatistics.org/2012/02/27/the-duke-saga-starter-set/">As you know, we are [big fans of reproducible research](http://simplystatistics.org/?s=reproducible+research) here at Simply Statistics.</a> around the lack of reproducibility in the analyses performed by Anil Potti and subsequent fallout drove the importance of this topic home.</p>
<p>So when I started teaching a course on <a href="https://www.coursera.org/course/dataanalysis">Data Analysis for Coursera</a>, of course I wanted to focus on reproducible research. The students in the class will be performing two data analyses during the course. They will be peer evaluated using a rubric specifically designed for evaluating data analyses at scale. One of the components of the rubric was to evaluate whether the code people submitted with their assignments reproduced all the numbers in the assignment.</p>
<p>Unfortunately, I just had to cancel the reproducibility component of the first data analysis assignment. Here are the things I realized while trying to set up the process that may seem obvious but weren’t to me when I was designing the rubric:</p>
<ol>
<li><strong>Security</strong> I realized (thanks to a very smart subset of the students in the class who posted on the message boards) that there is a major security issue with exchanging R code and data files with each other. Even if they use only the data downloaded from the official course website, it is possible that people could use the code to try to hack/do nefarious things to each other. The students in the class are great and the probability of this happening is small, but with a class this size, it isn’t worth the risk.</li>
<li><strong>Compatibility</strong> I’m requiring that people use R for the course. Even so, people are working on every possible operating system, with many different versions of R . In this scenario, it is entirely conceivable for a person to write totally reproducible code that works on their machine but won’t work on a random peer-reviewers machine</li>
<li><strong>Computing Resources </strong>The range of computing resources used by people in the class is huge. Everyone from people using modern clusters to people running on a single old beat up laptop. Inefficient code on a fast computer is fine, but on a slow computer with little memory it could mean the difference between reproducibility and crashed computers.</li>
</ol>
<p>Overall, I think the solution is to run some kind of EC2 instance with a standardized set of software. That is the only thing I can think of that would be scalable to a class this size. On the other hand that would both be expensive, a pain to maintain, and would require everyone to run code on EC2.</p>
<p>Regardless, it is a super interesting question. How do you do reproducibility at scale?</p>
Sunday data/statistics link roundup (2/3/2013)
2013-02-03T10:00:23+00:00
http://simplystats.github.io/2013/02/03/sunday-datastatistics-link-roundup-232013
<ol>
<li>My student, <a href="http://www.biostat.jhsph.edu/~hiparker/">Hilary</a>, wrote a post about how her name is the most <a href="http://hilaryparker.com/2013/01/30/hilary-the-most-poisoned-baby-name-in-us-history/">poisoned in history</a>. A poisoned name is a name that quickly loses popularity year over year. The post is awesome for the following reasons: (1) she is a good/funny writer and has lots of great links in the post, (2) she very clearly explains concepts that are widely used in biostatistics like relative risk, and (3) she took the time to try to really figure out all the trends she saw in the name popularity. I’m not the only one who thinks it is a good post, it was <a href="http://nymag.com/thecut/2013/01/hillary-most-poisoned-baby-name-in-us-history.html">reprinted in New York Magazine</a> and went viral this last week.</li>
<li>In honor of it being Super Bowl Sunday (go Ravens!) here is a post about the reasons why it often doesn’t make sense to consider <a href="http://www.footballperspective.com/what-are-the-odds-of-that/">the odds of an event retrospectively</a> due to the Wyatt Earp effect. Another way to think about it is, if you have a big tournament with tons of teams - someone will win. But at the very beginning, any team had a pretty small chance of winning all the games and taking the championship. If we wait until some team wins and calculate their pre-tournament odds of winning, it will probably be small. (via David S.)</li>
<li><a href="http://www.nytimes.com/2013/02/02/opinion/health-cares-trick-coin.html?smid=fb-share&_r=0">A new article</a> by Ben Goldacre in the NYT about unreported clinical trials. This is a major issue and Ben is all over it with his <a href="http://www.alltrials.net/">All Trials</a> project. This is another reason we need a <a href="http://simplystatistics.org/2012/08/27/a-deterministic-statistical-machine/">deterministic statistical machine</a>. Don’t worry, we are working on building it.</li>
<li>Even though it is Super Bowl Sunday, I’m still eagerly looking forward to spring and the real sport of baseball. Rafa sends along this link analyzing the effectiveness of patient hitters <a href="http://www.hardballtimes.com/main/article/game-theory-and-first-pitch/">when they swing at a first strike</a>. It looks like it is only a big advantage if you are an elite hitter.</li>
<li>An article in Wired on the <a href="http://www.wired.com/opinion/2013/01/forget-big-data-think-long-data/">importance of long data</a>. The article talks about how in addition to cross-sectional big data, we might also want to be looking at data over time - possibly large amounts of time. I think the title is maybe a little over the top, but the point is well taken. It turns out this is something a bunch of my colleagues in imaging and environmental health have been working on/talking about for a while. Longitudinal/time series big data seems like an important and wide-open field (via Nick R.).</li>
</ol>
paste0 is statistical computing's most influential contribution of the 21st century
2013-01-31T11:11:24+00:00
http://simplystats.github.io/2013/01/31/paste0-is-statistical-computings-most-influential-contribution-of-the-21st-century
<p>The day I discovered paste0 I literally cried. No more paste(bla,bla, sep=””). While looking through code written by a student who did not know about paste0 I started pondering about how many person hours it has saved humanity. So typing sep=”” takes about 1 second. We R users use paste about 100 times a day and there are about 1,000,000 R users in the world. That’s over 3 person years a day! Next up read.table0 (who doesn’t want as.is to be TRUE?).</p>
Data supports claim that if Kobe stops ball hogging the Lakers will win more
2013-01-28T11:33:26+00:00
http://simplystats.github.io/2013/01/28/data-supports-claim-that-if-kobe-stops-ball-hogging-the-lakers-will-win-more
<p>The Lakers recently snapped a four game losing streak. In that game Kobe, the league leader in field goal attempts and missed shots, had a season low of 14 points but a season high of 14 assists. This makes sense to me since Kobe shooting less means more efficient players are shooting more. Kobe has a lower career <a style="font-size: 16px;" href="http://www.basketball-reference.com/leaders/ts_pct_active.html">true shooting %</a> than Gasol, Howard and Nash (ranked 17,3 and 2 respectively). Despite this he takes more than 1/4 of the shots. Commentators usually praise top scorers no matter what, but recently they <a href="http://espn.go.com/los-angeles/nba/story/_/id/8884925/los-angeles-lakers-coach-mike-dantoni-says-kobe-bryant-assists-looked-sacrificing">The Lakers recently snapped a four game losing streak. In that game Kobe, the league leader in field goal attempts and missed shots, had a season low of 14 points but a season high of 14 assists. This makes sense to me since Kobe shooting less means more efficient players are shooting more. Kobe has a lower career <a style="font-size: 16px;" href="http://www.basketball-reference.com/leaders/ts_pct_active.html">true shooting %</a> than Gasol, Howard and Nash (ranked 17,3 and 2 respectively). Despite this he takes more than 1/4 of the shots. Commentators usually praise top scorers no matter what, but recently they</a> and noticed that the Lakers are 6-22 when Kobe has more than 19 field goal attempts and 12-3 in the rest of the games.</p>
<p><a href="http://simplystatistics.org/2013/01/28/data-supports-claim-that-if-kobe-stops-ball-hogging-the-lakers-will-win-more/kobelakers-2/" rel="attachment wp-att-978"><img class="alignnone size-medium wp-image-978" alt="kobelakers" src="http://simplystatistics.org/wp-content/uploads/2013/01/kobelakers1-300x300.png" width="300" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2013/01/kobelakers1-150x150.png 150w, http://simplystatistics.org/wp-content/uploads/2013/01/kobelakers1-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2013/01/kobelakers1-1024x1024.png 1024w" sizes="(max-width: 300px) 100vw, 300px" /></a></p>
<p>This graph shows score differential versus % of shots taken by Kobe* . Linear regression suggests that an increase of 1% in % of shots taken by Kobe results in a drop of 1.16 points (+/- 0.22) in score differential. It also suggests that when Kobe takes 15% of the shots, the Lakers win by an average of about 10 points, when he takes 30% (not a rare occurrence) they lose by an average of about 5. Of course we should not take this regression analysis to seriously but it’s hard to ignore the fact that when Kobe takes less than <del>23</del> 23.25% of the shots the Lakers are 13-1.</p>
<p>I suspect that this relationship is not unique to Kobe and the Lakers. In general, teams with a more balanced attack probably do better. Testing this could be a good project for <a href="https://www.coursera.org/course/dataanalysis">Jeff’s class</a>.</p>
<p>* I approximated shots taken as field goal attempts + floor(0.5 x Free Throw Attempts).</p>
<p>Data is <a href="http://rafalab.jhsph.edu/simplystats/kobe2.txt">here</a>.</p>
<p><strong>Update</strong>: Commentator Sidney fixed some entires in the data file. Data and plot updated.</p>
Sunday data/statistics link roundup (1/27/2013)
2013-01-27T10:26:20+00:00
http://simplystats.github.io/2013/01/27/sunday-datastatistics-link-roundup-1272013
<ol>
<li>Wisconsin is d<a href="http://marginalrevolution.com/marginalrevolution/2013/01/the-wisconsin-revolution.html">ecoupling the education and degree granting components</a> of education. This means if you take a MOOC like <a href="https://www.coursera.org/course/dataanalysis">mine</a>, <a href="https://www.coursera.org/course/biostats">Brian’s</a> or <a href="https://www.coursera.org/course/compdata">Roger’s</a> and there is an equivalent class to pass at Wisconsin, you can take the exam and get credit. This is big. (via Rafa)</li>
<li><a href="http://cscheid.net/static/mlb-hall-of-fame-voting/#state=state%5Bshown_histograms%5D%5B%5D=-1&state%5Bshown_histograms%5D%5B%5D=2&state%5Bshown_histograms%5D%5B%5D=14&state%5Bshown_histograms%5D%5B%5D=12&state%5Bshown_histograms%5D%5B%5D=4&state%5Bshown_histograms%5D%5B%5D=11&state%5Bshown_histograms%5D%5B%5D=18"> 1. Wisconsin is d[ecoupling the education and degree granting components](http://marginalrevolution.com/marginalrevolution/2013/01/the-wisconsin-revolution.html) of education. This means if you take a MOOC like [mine](https://www.coursera.org/course/dataanalysis), [Brian’s](https://www.coursera.org/course/biostats) or [Roger’s](https://www.coursera.org/course/compdata) and there is an equivalent class to pass at Wisconsin, you can take the exam and get credit. This is big. (via Rafa)
2.</a> is a really cool MLB visualisation done with d3.js and Crossfilter. It was also prototyped in R, which makes it even cooler. (via Rafa via Chris V.)</li>
<li>Harvard is <a href="http://www.guardian.co.uk/science/2012/apr/24/harvard-university-journal-publishers-prices">encouraging their professors</a> to only publish in open access journals and to resign from closed access journals. This is another major change and bodes well for the future of open science (again via Rafa - noticing a theme this week?).</li>
<li>This deserves a post all to itself, but Greece is <a href="http://www.ekathimerini.com/4dcgi/_w_articles_wsite3_1_26/01/2013_480606">prosecuting a statistician</a> for analyzing data in a way that changed their deficit figure. I wonder what the folks at the International Year of Statistics think about that? (via Alex N.)</li>
<li>Be on the twitters at 10:30AM Tuesday and follow the hashtag <a href="https://twitter.com/search/realtime?q=%23jhsph753&src=typd">#jhsph753</a> if you want to hear all the crazy stuff I tell my students when I’m running on no sleep.</li>
<li>Thomas at StatsChat is <a href="http://www.statschat.org.nz/2013/01/24/enough-with-the-nobel-correlations-already/">fed up</a> with Nobel correlations. Although I’m still partial to the <a href="http://www.statschat.org.nz/2012/10/12/even-better-than-chocolate/">length of country name</a> association.</li>
</ol>
My advanced methods class is now being live-tweeted
2013-01-25T09:57:41+00:00
http://simplystats.github.io/2013/01/25/my-advanced-methods-class-is-now-being-live-tweeted
<p>A student in my class is going to be live-tweeting my (often silly/controversial) comments in the advanced/Ph.D. data analysis and methods class I’m teaching here at Hopkins. The hashtag is #jhsph753 and the class runs from 10:30am to 12:00PM EST. Check it out <a href="https://twitter.com/search/realtime?q=%23jhsph753&src=hash">here</a>.</p>
Why I disagree with Andrew Gelman's critique of my paper about the rate of false discoveries in the medical literature
2013-01-24T14:55:17+00:00
http://simplystats.github.io/2013/01/24/why-i-disagree-with-andrew-gelmans-critique-of-my-paper-about-the-rate-of-false-discoveries-in-the-medical-literature
<p>With a colleague, I wrote a paper titled, <a href="http://arxiv.org/abs/1301.3718">“Empirical estimates suggest most published medical research is true”</a> which we quietly posted to ArXiv a few days ago. I posted to the ArXiv in the interest of open science and because we didn’t want to delay the dissemination of our approach during the long review process. I didn’t email anyone about the paper or talk to anyone about it, except my friends here locally.</p>
<p>I underestimated the internet. Yesterday, the paper was covered in <a href="http://www.technologyreview.com/view/510126/the-statistical-puzzle-over-how-much-biomedical-research-is-wrong/">this piece</a> on the MIT Tech review. That exposure was enough for the paper to appear in a few different outlets. I’m totally comfortable with the paper, but was not anticipating all of the attention so quickly.</p>
<p>In particular, I was a little surprised to see it appear on Andrew Gelman’s blog with the disheartening title, <a href="http://andrewgelman.com/2013/01/i-dont-believe-the-paper-empirical-estimates-suggest-most-published-medical-research-is-true-that-is-the-claim-may-very-well-be-true-but-im-not-at-all-convinced-by-the-analysis-being-used/">“I don’t believe the paper, “Empirical estimates suggest most published medical research is true.” That is, most published medical research may well be true, but I’m not at all convinced by the analysis being used to support this claim.”</a> I responded briefly this morning to his post, but then had to run off to teach class. After thinking about it a little more, I realized I have some objections to his critique.</p>
<p>His main criticisms of our paper are: (1) with type I/type II errors instead of type S versus type M errors (paragraph 2), (2) that we didn’t look at replication, we performed inference (paragraph 4), (3) that there is p-value hacking going on (paragraph 4), and (4) he thinks that our model does not apply because p-value hacking my change the assumptions underlying this model in genomics.</p>
<p>I will handle each of these individually:</p>
<p>(1) This is primarily semantics. Andrew is concerned with interesting/uninteresting with his Type S and Type M Errors. We are concerned with true/false positives as defined by type I and type II errors (and a null hypothesis). You might believe that the null is never true - but then by the standards of the original paper all published research is true. Or you might say that a non-null result might have an effect size too small to be interesting - but the framework being used here is hypothesis testing and we have stated how we defined a true positive in that framework explicitly. We define the error rate by the rate of classifying thing as null when they should be classified as alternative and vice versa. We then estimate the false discovery rate, under the framework used to calculate those p-values. So this is not a criticism of our work with evidence, rather it is a stated difference of opinion about the philosophy of statistics not supported by conclusive data.</p>
<p>(2) Gelman says he originally thought we would follow up specific p-values to see if the results replicated and makes that a critique of our paper. That would definitely be another approach to the problem. Instead, we chose to perform statistical inference using justified and widely used statistical techniques. Others have taken the replication route, but of course that approach too would be fraught with difficulty - are the exact conditions replicable (e.g. for a clinical trial), can we sample from the same population (if it has changed or is hard to sample), and what do we mean by replicates (would two p-values less than 0.05 be convincing?). This again is not a criticism of our approach, but a statement of another, different analysis Gelman was wishing to see.</p>
<p>(3)-(4) Gelman states, “You don’t have to be Uri Simonsohn to know that there’s a lot of p-hacking going on.” Indeed Uri Samuelson <a href="http://pss.sagepub.com/content/22/11/1359.full.pdf+html">wrote a paper</a> where he talks about the potential for p-value hacking. He does not collect data from real experiments/analyses, but uses simulations, theoretical arguments, and prospective experiments designed to show specific problems. While these arguments are useful and informative, it gives no indication of the extent of p-value hacking in the medical literature. So this argument is made on the basis of a supposition by Gelman that this happens broadly, rather than on data.</p>
<p>My objection to his criticism is that his critiques are based primarily on philosophy (1), a wish that we had done the study a different way (2), and assumptions about the way science works with only anecdotal evidence (3-4).</p>
<p>One thing you could very reasonably argue is how sensitive our approach is to violations of our assumptions (which Gelman implied with criticisms 3-4). To address this, my co-author and I have now performed a simulation analysis. In the first simulation, we considered a case where every p-value less than 0.05 was reported and the p-values were uniformly distributed, just as our assumptions would state. We then plot our estimates of the swfdr versus the truth. Here our estimator works pretty well.</p>
<p> </p>
<p><a href="http://simplystatistics.org/2013/01/24/why-i-disagree-with-andrew-gelmans-critique-of-my-paper-about-the-rate-of-false-discoveries-in-the-medical-literature/all-significant/" rel="attachment wp-att-940"><a href="http://simplystatistics.org/?attachment_id=942" rel="attachment wp-att-942"><img class="alignnone size-medium wp-image-942" alt="all-significant" src="http://simplystatistics.org/wp-content/uploads/2013/01/all-significant-300x228.jpg" width="300" height="228" srcset="http://simplystatistics.org/wp-content/uploads/2013/01/all-significant-300x228.jpg 300w, http://simplystatistics.org/wp-content/uploads/2013/01/all-significant-1024x779.jpg 1024w, http://simplystatistics.org/wp-content/uploads/2013/01/all-significant.jpg 1266w" sizes="(max-width: 300px) 100vw, 300px" /></a></a></p>
<p>We also simulate a pretty serious p-value hacking scenario where people report only the minimum p-value they observe out of 20 p-values. Here our assumption of uniformity is strongly violated. But we still get pretty accurate estimates of the swfdr for the range of values (14%) we report in our paper.</p>
<p><a href="http://simplystatistics.org/2013/01/24/why-i-disagree-with-andrew-gelmans-critique-of-my-paper-about-the-rate-of-false-discoveries-in-the-medical-literature/only-min-2/" rel="attachment wp-att-944"><img class="alignnone size-medium wp-image-944" alt="only-min" src="http://simplystatistics.org/wp-content/uploads/2013/01/only-min-300x228.jpg" width="300" height="228" srcset="http://simplystatistics.org/wp-content/uploads/2013/01/only-min-300x228.jpg 300w, http://simplystatistics.org/wp-content/uploads/2013/01/only-min-1024x779.jpg 1024w, http://simplystatistics.org/wp-content/uploads/2013/01/only-min.jpg 1266w" sizes="(max-width: 300px) 100vw, 300px" /></a></p>
<p>Since I recognize this is only a couple of simulations, I have also put the code up on Github with the rest of our code for the paper so other people can test it out.</p>
<p>Whether you are convinced by Gelman, or convinced by my response, I agree with him that it is pretty unlikely that “most published research is false” so I’m glad our paper is at least bringing that important point up. I also hope that by introducing a new estimator of the science-wise fdr we inspire more methodological development and that philosophical criticisms won’t prevent people from looking at the data in new ways.</p>
<p> </p>
<p> </p>
<p><strong> </strong></p>
Statisticians and computer scientists - if there is no code, there is no paper
2013-01-23T11:25:05+00:00
http://simplystats.github.io/2013/01/23/statisticians-and-computer-scientists-if-there-is-no-code-there-is-no-paper
<p>I think it has been beat to death that the incentives in academia lean heavily toward producing papers and less toward producing/maintaining software. There are people that are way, way more knowledgeable than me about building and maintaining software. For example, Titus Brown hit a lot of the key issues in his <a href="http://simplystatistics.org/2012/08/17/interview-with-c-titus-brown-computational-biologist/">interview</a>. The open source community is also filled with advocates and researchers who know way more about this than I do.</p>
<p>This post is more about my views on changing the perspective of code/software in the data analysis community. I have been frustrated often with statisticians and computer scientists who write papers where they develop new methods and seem to demonstrate that those methods blow away all their competitors. But then no software is available to actually test and see if that is true. Even worse, sometimes I just want to use their method to solve a problem in our pipeline, but I have to code it from scratch!</p>
<p>I have also had several cases where I emailed the authors for their software and they said it “wasn’t fit for distribution” or they “don’t have code” or the “code can only be run on our machines”. I totally understand the first and last, my code isn’t always pretty (I have zero formal training in computer science so messy code is actually the most likely scenario) but I always say, “I’ll take whatever you got and I’m willing to hack it out to make it work”. I often still am turned down.</p>
<p>So I have a new policy when evaluating CV’s of candidates for jobs, or when I’m reading a paper as a referee. If the paper is about a new statistical method or machine learning algorithm and there is no software available for that method - I simply mentally cross it off the CV. If I’m reading a data analysis and there isn’t code that reproduces their analysis - I mentally cross it off. In my mind, new methods/analyses without software are just <a href="http://en.wikipedia.org/wiki/Vaporware">vapor ware</a>. Now, you’d definitely have to cross a few papers off my CV, based on this principle. I do that. But I’m trying really hard going forward to make sure nothing gets crossed off.</p>
<p>In a future post I’ll talk about the new issue I’m struggling with - maintaing all that software I’m creating.</p>
<p> </p>
Sunday data/statistics link roundup (1/20/2013)
2013-01-20T10:00:32+00:00
http://simplystats.github.io/2013/01/20/sunday-datastatistics-link-roundup-1202013
<ol>
<li>This might be short. I have a couple of classes starting on Monday. The first is our <a href="http://www.jhsph.edu/courses/course/140.753/01/2012/16424/"> 1. This might be short. I have a couple of classes starting on Monday. The first is our</a> class. This is one of my favorite classes to teach, our Ph.D. students are pretty awesome and they always amaze me with what they can do. The other is my Coursera debut in <a href="https://www.coursera.org/course/dataanalysis">Data Analysis</a>. We are at about 88,000 enrolled. Tell your friends, maybe we can make it an even 100k! In related news, some California schools are <a href="http://chronicle.com/article/California-State-U-Will/136677/"> 1. This might be short. I have a couple of classes starting on Monday. The first is our [ 1. This might be short. I have a couple of classes starting on Monday. The first is our](http://www.jhsph.edu/courses/course/140.753/01/2012/16424/) class. This is one of my favorite classes to teach, our Ph.D. students are pretty awesome and they always amaze me with what they can do. The other is my Coursera debut in [Data Analysis](https://www.coursera.org/course/dataanalysis). We are at about 88,000 enrolled. Tell your friends, maybe we can make it an even 100k! In related news, some California schools are</a> with offering credit for online courses. (via Sherri R.)</li>
<li><a href="http://espn.go.com/blog/truehoop/post/_/id/53534/where-have-all-the-gunners-gone">Some interesting numbers</a> on why there aren’t as many “gunners” in the NBA - players who score a huge number of points. I love the talk about hustling, rotating team defense. I have always enjoyed watching good defense more than good offense. It might not be the most popular thing to watch, but seeing the Spurs rotate perfectly to cover the open man is a thing of athletic beauty. <a href="http://www.utahstateaggies.com/sports/m-baskbl/ust-m-baskbl-body.html">My Aggies</a> aren’t too bad at it either…(via Rafa).</li>
<li>A <a href="http://journal.sjdm.org/12/12810/jdm12810.html">really interesting article</a> suggesting that nonsense math can make arguments seem more convincing to non-technical audiences. This is tangentially related to a <a href="http://www.pnas.org/content/early/2012/06/22/1205259109.full.pdf">previous study</a> which showed that more equations led to fewer citations in biology articles. Overall, my take home message is that we don’t need less equations necessarily; we need to elevate statistical/quantitative literacy to the importance of reading literacy. (via David S.)</li>
<li>This has been posted elsewhere, but a reminder to send in your statistical stories for the <a href="http://statisticsforum.wordpress.com/2013/01/17/wanted-365-stories-of-statistics/">365 stories of statistics</a>.</li>
<li>Automatically generate a <a href="http://www.elsewhere.org/pomo/">postmodernism essay</a>. Hit refresh a few times. It’s pretty hilarious. It reminds me a lot of this <a href="http://nataliacecire.blogspot.com/2012/11/the-passion-of-nate-silver-sort-of.html">article about statisticians</a>. <a href="http://www.csse.monash.edu.au/cgi-bin/pub_search?104+1996+bulhak+Postmodernism">Here</a> is the technical paper describing how they simulate the essays. (via Rafa)</li>
</ol>
Comparing online and in-class outcomes
2013-01-18T11:31:44+00:00
http://simplystats.github.io/2013/01/18/comparing-online-and-in-class-outcomes
<p>My colleague John McGready has just <a href="http://www.sciencedirect.com/science/article/pii/S009174351200597X">published a study</a> he conducted comparing the outcomes of students in the online and in-class versions of his <em>Statistical Reasoning in Public Health</em> class that he teaches here in the fall. In this class the online and in-class portions are taught concurrently, so it’s basically one big class where some people are not in the building. Everything is the same for both groups–quizzes, tests, homework, instructor, lecture notes. From the article:</p>
<blockquote>
<p id="p0015">
The on-campus version employs twice-weekly 90 minute live lectures. Online students view pre-recorded narrated versions of the same materials. Narrated lecture slides are made available to on-campus students.
</p>
<p>The on-campus section has 5 weekly office hour sessions. Online students communicate with the course instructor asynchronously via email and a course bulletin board. The instructor communicates with online students in real time via weekly one-hour online sessions. Exams and quizzes are multiple choice. In 2005, on-campus students took timed quizzes and exams on paper in monitored classrooms. Online students took quizzes via a web-based interface with the same time limits. Final exams for the online students were taken on paper with a proctor.</p>
</blockquote>
<p>So how did the two groups fair in their final grades? Pretty much the same. First off, the two groups of students were not the same. Online students were 8 years older on average, more likely to have an MD degree, and more likely to be male. Final exam scores between online and in-class groups differed by -1.2 (out of 100, online group was lower) and after adjusting for student characteristics they differed by -1.5. In both cases, the difference was not statistically significant.</p>
<p>This was not a controlled trial and so there are possibly some problems with unmeasured confounding given that the populations appeared fairly different. It would be interesting to think about a study design that might allow a measure of control or perhaps get a better measure of the difference between online and on-campus learning. But the logistics and demographics of the students would seem to make this kind of experiment challenging.</p>
<p>Here’s the best I can think of right now: Take a large class (where all students are on-campus) and get a classroom that can fit roughly half the number of students in the class. Then randomize half the students to be in-class and the other half to be online up until the midterm. After the midterm cross everyone over so that the online group comes into the classroom and the in-class group goes online to take the final. It’s not perfect–One issue is that course material tends to get harder as the term goes on and it may be that the “easier” material is better learned online and the harder material is better learned on-campus (or vice versa). Any thoughts?</p>
Review of R Graphics Cookbook by Winston Chang
2013-01-16T09:47:04+00:00
http://simplystats.github.io/2013/01/16/review-of-r-graphics-cookbook-by-winston-chang
<p>I just got a copy of Winston Chang’s book <em>R Graphics Cookbook</em>, published by O’Reilly Media. This book follows now a series of O’Reilly books on R, including an <em>R Cookbook.</em> Winston Chang is a graduate student at Northwestern University but is probably better known to R users as an active member of the ggplot2 mailing list and an active contributor to the ggplot2 source code.</p>
<p>The book has a typical cookbook format. After some preliminaries about how to install R packages and how to read data into R (Chapter 1), he quickly launches into exploratory data analysis and graphing. The basic outline of each section is:</p>
<ol>
<li>Statement of problem (“You want to make a histogram”)</li>
<li>Solution: If you can reasonably do it with base R graphics, here’s how you do it. Oh, and here’s how you do it in ggplot2. Notice how it’s better? (He doesn’t actually say that. He doesn’t have to.)</li>
<li>Discussion: This usually revolves around different options that might be set or alternative approaches.</li>
<li>See also: Other recipes in the book.</li>
</ol>
<p>Interestingly, nowhere in the book is the lattice package mentioned (except in passing). But I suppose that’s because ggplot2 pretty much supersedes anything you might want to do in the lattice package. Recently, I’ve been wondering what the future of the lattice package is given that it doesn’t seem to me to be going under very active development. But I digress….</p>
<p>Overall, the book is great. I learned quite a few things just in my initial read of the book and as I dug in a bit more there were some functions that I was not familiar with. Much of the material is straight up ggplot2 stuff so if you’re an expert there you probably won’t get a whole lot more. But my guess is that most are not experts and so will be able to get something out of the book.</p>
<p>The meat of the book covers a lot of different plotting techniques, enough to make your toolbox quite full. If you pick up this book and think something is missing, my guess is that you’re making some pretty esoteric plots. I enjoyed the few sections on specifying colors as well as the recipes on making maps (one of ggplot2’s strong points). I wish there were more map recipes, but hey, that’s just me.</p>
<p>Towards the end there’s a nice discussion of graphics file formats (PDF, PNG, WMF, etc.) and the advantages and disadvantages of each (Chapter 14: Output for Presentation). I particularly enjoyed the discussion of fonts in R graphics since I find this to be a fairly confusing aspect of R, even for seasoned users.</p>
<p>The book ends with a series of recipes related to data manipulation. It’s funny how many recipes there are about modifying factor variables, but I guess this is just a function of how annoying it is to modify factor variables. There’s also some highlighting of the plyr and reshape2 packages.</p>
<p>Ultimately, I think this is a nice complement to Hadley Wickham’s _ggplot2 _as most of the recipes focus on implementing plots in ggplot2. I don’t think you necessarily need to have a deep understanding of ggplot2 in order to use this book (there are some details in an appendix), but some people might want to grab Hadley’s book for more background. In fact, this may be a better book to use to get started with ggplot2 simply because it focuses on specific applications. I kept thinking that if the book had been written using base graphics only, it’d probably have to be 2 or 3 times longer just to fit all the code in, which is a testament to the power and compactness of the ggplot2 approach.</p>
<p>One last note: I got the e-book version of the book, but I would recommend the paper version. With books like these, I like to flip around constantly (since there’s no need to read it in a linear fashion) and I find e-readers like iBooks and Kindle Reader to be not so good at this.</p>
R package meme
2013-01-16T05:00:20+00:00
http://simplystats.github.io/2013/01/16/r-package-meme
<p>I just got this from a former student who is working on a project with me:</p>
<p><img class="alignnone" style="font-size: 16px;" alt="" src="http://cdn.memegenerator.net/instances/400x/33457759.jpg" width="400" height="400" /></p>
<p>Awesome.</p>
<p> </p>
<p> </p>
Welcome to the Smog-ocalypse
2013-01-14T16:23:50+00:00
http://simplystats.github.io/2013/01/14/welcome-to-the-smog-ocalypse
<p><img class="size-medium wp-image-876 alignleft" alt="Beijing fog, 2013" src="http://simplystatistics.org/wp-content/uploads/2013/01/BeijingFog1-225x300.jpg" width="225" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2013/01/BeijingFog1-225x300.jpg 225w, http://simplystatistics.org/wp-content/uploads/2013/01/BeijingFog1-768x1024.jpg 768w, http://simplystatistics.org/wp-content/uploads/2013/01/BeijingFog1.jpg 1536w" sizes="(max-width: 225px) 100vw, 225px" /></p>
<p>Recent reports of air pollution levels out of Beijing are <a href="http://www.nytimes.com/2013/01/15/world/asia/china-allows-media-to-report-alarming-air-pollution-crisis.html?smid=pl-share">very</a> <a href="http://bloom.bg/ZTFD9q">very</a> disturbing. Levels of fine particulate matter (PM2.5, or PM less than 2.5 microns in diameter) have reached unprecedented levels. So high are the levels that even the <a href="http://www.nytimes.com/2013/01/15/world/asia/china-allows-media-to-report-alarming-air-pollution-crisis.html?ref=global-home&_r=0">offici</a><a href="http://www.nytimes.com/2013/01/15/world/asia/china-allows-media-to-report-alarming-air-pollution-crisis.html?ref=global-home&_r=0">al media are allowed to mention it</a>.</p>
<p>Here is a photograph of downtown Beijing during the day (Thanks to Sarah E. Burton for the photograph). Hourly levels of PM2.5 hit over 900 micrograms per cubic meter in some parts of the city and 24-hour average levels (the basis for most air quality standards) reached over 500 micrograms per cubic meter. Just for reference, the US national ambient air quality standard for the 24-hour average level of PM2.5 is 35 micrograms per cubic meter.</p>
<p>Below is a plot of the PM2.5 data taken from the <a href="https://twitter.com/beijingair">US Embassy’s rooftop monitor</a>.</p>
<p><a href="http://simplystatistics.org/wp-content/uploads/2013/01/Beijingair.png"><img class="alignright size-large wp-image-890" alt="Beijingair" src="http://simplystatistics.org/wp-content/uploads/2013/01/Beijingair-1024x737.png" width="640" height="460" srcset="http://simplystatistics.org/wp-content/uploads/2013/01/Beijingair-300x216.png 300w, http://simplystatistics.org/wp-content/uploads/2013/01/Beijingair-1024x737.png 1024w" sizes="(max-width: 640px) 100vw, 640px" /></a></p>
<p>The solid circles indicate the 24-hour average for the day. The red line is the median of the daily averages for the time period in the plot (about 6 weeks) and the dotted blue line is the US 24-hour national ambient air quality standard. The median for the period was about 91 micrograms per cubic meter.</p>
<p>First, it should be noted that a “typical” day of 91 micrograms per cubic meter is still <em>crazy.</em> But suppose we take 91 to be a typical day. Then in a city like Beijing, which has about 20 million people, if we assume that about 700 people die on a typical day, then the last 5 days alone would experience about 307 excess deaths from all causes. I get this from using a rough estimate of a 0.3% increase in all-cause mortality per 10 microgram per cubic meter increase in PM2.5 levels (studies from China and the US tend to report risks in roughly this area). The 700 deaths per day number is a fairly back-of-the-envelope number that I got simply using comparisons to other major cities. Numbers for things like excess hospitalizations will be higher because both the risks and the baselines are higher. For example, in the US, we estimate about a 1.28% increase in heart failure hospitalization for a 10 microgram per cubic meter increase in PM2.5.</p>
<p>If you like, you can also <a href="http://simplystatistics.org/2011/12/14/smoking-is-a-choice-breathing-is-not/">translate current levels to numbers of cigarettes smoked</a>. If you assume a typical adult inhales about 18-20 cubic meters of air per day, then in the last 5 days, the average Beijinger smoked about 3 cigarettes just by getting out of bed in the morning.</p>
<p>Lastly, I want to point to a nice series of photos that the Guardian has collected on the (in)famous <a href="http://www.guardian.co.uk/environment/gallery/2012/dec/05/60-years-great-smog-london-in-pictures">London Fog of 1952</a>. Although the levels were quite a bit worse back then (about 2-3 times worse, if you can believe it), the photos bear a striking resemblance to today’s Beijing.</p>
<p>At least in the US, the infamous smog episodes that occurred regularly only 60 years ago are pretty much non-existent. But in many places around the world, “crazy bad” air pollution is part of everyday life.</p>
Sunday data/statistics link roundup (1/13/2013)
2013-01-13T15:30:40+00:00
http://simplystats.github.io/2013/01/13/sunday-datastatistics-link-roundup-1132012
<ol>
<li><a href="http://www.seaborg.ucla.edu/video2012.html">These</a> are some great talks. But definitely watch Michael Eisen’s talk on E-biomed and the history of open access publication. This is particularly poigniant in light of <a href="http://www.nytimes.com/2013/01/13/technology/aaron-swartz-internet-activist-dies-at-26.html?_r=0">Aaron Swartz’s tragic suicide</a>. It’s also worth checking out the twitter hashtag <a href="https://twitter.com/search?q=%23pdftribute&src=hash">#pdftribute </a>.</li>
<li>An <a href="http://civilstat.com/wp-content/uploads/2013/01/IMG_8807-1024x768.jpg">awesome flowchart</a> before a talk given by the creator of the R <a href="http://www.twotorials.com/">twotuorials</a>. Roger gets a shoutout (via civilstat).</li>
<li><a href="http://stochasticplanet.tumblr.com/">This blog</a> selects a position at random on the planet earth every day and posts the picture taken closest to that point. Not much about the methodology on the blog, but totally fascinating and a clever idea.</li>
<li>A set of data giving a <a href="http://reportcard.studentsfirst.org/">“report card”</a> for each state on how that state does in improving public education for students. I’m not sure I believe the grades, but the underlying reports look interesting.</li>
</ol>
NSF should understand that Statistics is not Mathematics
2013-01-11T10:44:56+00:00
http://simplystats.github.io/2013/01/11/nsf-should-understand-that-statistics-in-not-mathematics
<p>NSF has realized that the role of Statistics is growing in all areas of science and engineering and <a href="http://www.nsf.gov/attachments/124926/public/Request_to_form_MPSAC_Subcommittee_StatsNSF_8-15-2012_Final.pdf">NSF has realized that the role of Statistics is growing in all areas of science and engineering and </a> to examine the current structure of support of the statistical sciences. As <a href="http://simplystatistics.org/2012/08/21/nsf-recognizes-math-and-statistics-are-not-the-same/">Roger explained</a> in August, the NSF is divided into directorates composed of divisions. Statistics is in the Division of Mathematical Sciences (DMS) within the Directorate for Mathematical and Physical Sciences. Within this <a href="http://www.nsf.gov/div/index.jsp?div=dms">division</a> it is a <em>Disciplinary Research Program</em> along with Topology, Geometric Analysis, etc.. To statisticians this does not make much sense, and my first thought when asked for recommendations was that we need a proper division. But the committee is seeking out recommendations that</p>
<blockquote>
<p>[do] not include renaming of the Division of Mathematical Sciences. Particularly desired are recommendations that can be implemented within the current divisional and directorate structure of NSF; Foundation (NSF) and to provide recommendations for NSF to consider.</p>
</blockquote>
<p>This clarification is there because former director [NSF has realized that the role of Statistics is growing in all areas of science and engineering and <a href="http://www.nsf.gov/attachments/124926/public/Request_to_form_MPSAC_Subcommittee_StatsNSF_8-15-2012_Final.pdf">NSF has realized that the role of Statistics is growing in all areas of science and engineering and </a> to examine the current structure of support of the statistical sciences. As <a href="http://simplystatistics.org/2012/08/21/nsf-recognizes-math-and-statistics-are-not-the-same/">Roger explained</a> in August, the NSF is divided into directorates composed of divisions. Statistics is in the Division of Mathematical Sciences (DMS) within the Directorate for Mathematical and Physical Sciences. Within this <a href="http://www.nsf.gov/div/index.jsp?div=dms">division</a> it is a <em>Disciplinary Research Program</em> along with Topology, Geometric Analysis, etc.. To statisticians this does not make much sense, and my first thought when asked for recommendations was that we need a proper division. But the committee is seeking out recommendations that</p>
<blockquote>
<p>[do] not include renaming of the Division of Mathematical Sciences. Particularly desired are recommendations that can be implemented within the current divisional and directorate structure of NSF; Foundation (NSF) and to provide recommendations for NSF to consider.</p>
</blockquote>
<p>This clarification is there because former director](http://simplystatistics.org/2012/08/21/nsf-recognizes-math-and-statistics-are-not-the-same/) names to “Division of Mathematical and <strong>Statistical</strong> Sciences”. The NSF shot down this idea and listed this as one of the reasons:</p>
<blockquote>
<p>If the name change attracts more proposals to the Division from the statistics community, this could draw funding away from other subfields</p>
</blockquote>
<p>So NSF does not want to take away from the other math programs and this is understandable given the current levels of research funding for Mathematics. But this being the case, I can’t really think of a recommendation other than giving Statistics it’s own division or give data related sciences their own directorate. Increasing support for the statistical sciences means increasing funding. You secure the necessary funding either by asking congress for a bigger budget (good luck with that) or by cutting from other divisions, not just Mathematics. A new division makes sense not only in practice but also in principle because Statistics is not Mathematics.</p>
<p>Statistics is analogous to other disciplines that use mathematics as a fundamental language, like Physics, Engineering, and Computer Science. But like those disciplines, Statistics contributes separate and fundamental scientific knowledge. While the field of applied mathematics tries to explain the world with deterministic equations, Statistics takes a dramatically different approach. In highly complex systems, such as the weather, Mathematicians battle <a href="http://en.wikipedia.org/wiki/Laplace's_demon">LaPlace’s demon</a> and struggle to explain nature using mathematics derived from first principles. Statisticians accept that deterministic approaches are not always useful and instead develop and rely on random models. These two approaches are both important as demonstrated by the improvements in meteorological predictions achieved once data-driven statistical models were used to compliment deterministic mathematical models.</p>
<p>Although Statisticians rely heavily on theoretical/mathematical thinking, another important distinction from Mathematics is that advances in our field are almost exclusively driven by empirical work. Statistics always starts with a specific, concrete real world problem: we thrive in <a href="http://en.wikipedia.org/wiki/Pasteur's_quadrant">Pasteur’s quadrant</a>. Important theoretical work that generalizes our solutions always follows. This approach, built mostly by basic researchers, has yielded some of the most useful concepts relied upon by modren science: the p-value, randomization, analysis of variance, regression, the proportional hazards model, causal inference, Bayesian methods, and the Bootstrap, just to name a few examples. These ideas were instrumental in the most important genetic discoveries, improving agriculture, the inception of the empirical social sciences, and revolutionizing medicine via randomized clinical trials. They have also fundamentally changed the way we abstract quantitative problems from real data.</p>
<p>The 21st century brings the era of big data, and distinguishing Statistics from Mathematics becomes more important than ever. Many areas of science are now being driven by new measurement technologies. Insights are being made by discovery-driven, as opposed to hypothesis-driven, experiments. Although testing hypotheses developed theoretically will of course remain important to science, it is inconceivable to think that, just like Leeuwenhoek became the father of microbiology by looking through the microscope without theoretical predictions, the era of big data will enable discoveries that we have not yet even imagined. However, it is naive to think that these new datasets will be free of noise and unwanted variability. Deterministic models alone will almost certainly fail at extracting useful information from these data just like they have failed at predicting complex systems like the weather. The advancement in science during the era of big data that the NSF wants to see will only happen if the field that specializes in making sense of data is properly defined as a separate field from Mathematics and appropriately supported.</p>
<p><strong>Addendum:</strong> On a related note, NIH just announced that they plan to recruit a new senior scientific position:<a href="http://www.nih.gov/news/health/jan2013/od-10a.htm"> the Associate Director for Data Science</a></p>
The landscape of data analysis
2013-01-10T09:11:26+00:00
http://simplystats.github.io/2013/01/10/the-landscape-of-data-analysis
<p>I have been getting some questions via email, LinkedIn, and Twitter about the content of the Data Analysis class I will be teaching for Coursera. Data Analysis and Data Science mean different things to different people. So I made a video describing how Data Analysis fits into the landscape of other quantitative classes here:</p>
<p><a href="http://prezi.com/fhumwa8tb3fs/the-lanscape-of-data-analysis/?kw=view-fhumwa8tb3fs&rc=ref-27684941">Here</a> is the corresponding presentation. I also made a tentative list of topics we will cover, subject to change at the instructor’s whim. Here it is:</p>
<ul>
<li>The structure of a data analysis (steps in the process, knowing when to quit, etc.)</li>
<li>Types of data (census, designed studies, randomized trials)</li>
<li>Types of data analysis questions (exploratory, inferential, predictive, etc.)</li>
<li>How to write up a data analysis (compositional style, reproducibility, etc.)</li>
<li>Obtaining data from the web (through downloads mostly)</li>
<li>Loading data into R from different file types</li>
<li>Plotting data for exploratory purposes (boxplots, scatterplots, etc.)</li>
<li>Exploratory statistical models (clustering)</li>
<li>Statistical models for inference (linear models, basic confidence intervals/hypothesis testing)</li>
<li>Basic model checking (primarily visually)</li>
<li>The prediction process</li>
<li>Study design for prediction</li>
<li>Cross-validation</li>
<li>A couple of simple prediction models</li>
<li>Basics of simulation for evaluating models</li>
<li>Ways you can fool yourself and how to avoid them (confounding, multiple testing, etc.)</li>
</ul>
<p>Of course that is a ton of material for 8 weeks and so obviously we will be covering just the very basics. I think it is really important to remember that being a good Data Analyst is like being a good surgeon or writer. There is no such thing as a prodigy in surgery or writing, because it requires long experience, trying lots of things out, and learning from mistakes. I hope to give people the basic information they need to get started and point to resources where they can learn more. I also hope to give them a chance to practice a couple of times some basics and to learn that in data analysis the first goal is to “do no harm”.</p>
By introducing competition open online education will improve teaching at top universities
2013-01-08T10:00:17+00:00
http://simplystats.github.io/2013/01/08/by-introducing-competition-open-online-education-will-improve-teaching-at-top-universities
<p>It is no secret that faculty evaluations at <a href="http://www.shanghairanking.com/ARWU2012.html">top universities</a> weigh research much more than teaching. This is not surprising given that, among other reasons, global visibility comes from academic innovation (think Nobel Prizes) not classroom instruction. Come promotion time the peer review system carefully examines your publication record and ability to raise research funds. External experts within your research area are asked if you are a leader in the field. Top universities maintain their status by imposing standards that lead to a highly competitive environment in which only the most talented researchers survive.</p>
<p>However, the assessment of teaching excellence is much less stringent. Unless they reveal utter incompetence, teaching evaluations are practically ignored; especially if you have graduated numerous PhD students. Certainly, outside experts are not asked about your teaching. This imbalance in incentives explains why faculty use research funding to buy-out of teaching and why highly recruited candidates negotiate low teaching loads.</p>
<p>Top researchers end up at top universities but being good at research does not necessarily mean you are a good teacher. Furthermore, the effort required to be a competitive researcher leaves limited time for class preparation. To make matters worse, within a university, faculty have a monopoly on the classes they teach. With few incentives and practically no competition it is hard to believe that top universities are doing the best they can when it comes to classroom instruction. By introducing competition, MOOCs might change this.</p>
<p>To illustrate, say you are a chair of a soft money department in 2015. Four of your faculty receive 25% funding to teach the big Stat 101 class and your graduate program’s three main classes. But despite being great researchers these four are mediocre teachers. So why are they teaching if 1) a MOOC exists for each of these classes and 2) these professors can easily cover 100% of their salary with research funds. As chair, not only do you wonder why not let these four profs focus on what they do best, but also why your department is not creating MOOCs and getting global recognition for it. So instead of hiring 4 great researchers that are mediocre teachers why not hire (for the same cost) 4 great researchers (fully funded by grants) and 1 great teacher (funded with tuition $)? I think in the future tenure track positions will be divided into top researchers doing mostly research and top teachers doing mostly classroom teaching and MOOC development. Because top universities will feel the pressure to compete and develop the courses that educate the world, there will be no room for mediocre teaching.</p>
<p> </p>
Sunday data/statistics link roundup (1/6/2013)
2013-01-06T11:08:07+00:00
http://simplystats.github.io/2013/01/06/sunday-datastatistics-link-roundup-162013
<ol>
<li>Not really statistics, but this is <a href="http://www.science20.com/hammock_physicist/rational_suckers-99998">an interesting article</a> about how rational optimization by individual actors does not always lead to an optimal solutio<span style="line-height: 24px;">h</span>n. Related, ere is the <a href="http://www.businessinsider.com/16-ways-asian-cities-are-making-their-us-counterparts-look-like-the-third-world-2013-1#some-japanese-street-signs-have-heat-maps-to-relay-congestion-information-to-drivers-and-directly-influence-traffic-patterns-3">coolest street sign</a> I think I’ve ever seen, with a heatmap of traffic density to try to influence commuters.</li>
<li>An <a href="http://arxiv.org/pdf/1205.4891v1.pdf">interesting paper</a> that talks about how clustering is only a really hard problem when there aren’t obvious clusters. I was a little disappointed in the paper, because it defines the “obviousness” of clusters only theoretically by a distance metric. There is very little discussion of the practical distance/visual distance metrics people use when looking at clustering dendograms, etc.</li>
<li>A post about the <a href="http://norvig.com/chomsky.html">two cultures of statistical learning</a> and a <a href="http://www.r-bloggers.com/data-driven-science-is-a-failure-of-imagination/">related post</a> on how data-driven science is a failure of imagination. I think in both cases, it is worth pointing out that <a href="http://simplystatistics.org/2012/12/31/what-makes-a-good-data-scientist/">the only good data science is good science</a> - i.e. it seeks to answer a real, specific question through the scientific method. However, I think for many modern scientific problems it is pretty naive to think we will be able to come to a full, mechanistic understanding complete with tidy theorems that describe all the properties of the system. I think the real failure of imagination is to think that science/statistics/mathematics won’t change to tackle the realistic challenges posed in solving modern scientific problems.</li>
<li>A graph that shows the incredibly strong correlation ( > 0.99!) between the <a href="http://boingboing.net/2013/01/01/correlation-between-autism-dia.html">growth of autism diagnoses and organic food sales</a>. Another example where even really strong correlation does not imply causation.</li>
<li>The Buffalo Bills are going to start an <a href="http://www.nfl.com/news/story/0ap1000000121055/article/buffalo-bills-to-start-advanced-analytics-department">advanced analytics department</a> (via Rafa and Chris V.), maybe they can take advantage of all this <a href="http://www.advancednflstats.com/2010/04/play-by-play-data.html">free play-by-play data</a> from years of NFL games.</li>
<li>A <a href="https://www.youtube.com/watch?v=CJAIERgWhZQ">prescient interview</a> with Isaac Asimov on learning, predicting the Kahn Academy, MOOCs and other developments in online learning (via Rafa and Marginal Revolution).</li>
<li><a href="http://seanjtaylor.com/post/39573264781/the-statistics-software-signal">The statistical software signal</a> - what your choice of software says about you. Just another reason we need a <a href="http://simplystatistics.org/2012/08/27/a-deterministic-statistical-machine/">deterministic statistical machine</a>.</li>
</ol>
<p> </p>
Does NIH fund innovative work? Does Nature care about publishing accurate articles?
2013-01-04T10:00:00+00:00
http://simplystats.github.io/2013/01/04/does-nih-fund-innovative-work-does-nature-care-about-publishing-accurate-articles
<p><em>Editor’s Note: In a recent post we <a href="http://simplystatistics.org/2012/12/20/the-nih-peer-review-system-is-still-the-best-at-identifying-innovative-biomedical-investigators/">disagreed</a> with a Nature article claiming that NIH doesn’t support innovation. Our colleague <a href="http://bioinformatics.igm.jhmi.edu/salzberg/Salzberg/Salzberg_Lab_Home.html">Steven Salzberg</a> actually looked at the data and wrote the guest post below. </em></p>
<p>Nature <a href="http://www.nature.com/nature/journal/v492/n7427/full/492034a.html">published an article last month</a> with the provocative title “Research grants: Conform and be funded.” The authors looked at papers with over 1000 citations to find out whether scientists “who do the most influential scientific work get funded by the NIH.” Their dramatic conclusion, widely reported, was that only 40% of such influential scientists get funding.</p>
<p>Dramatic, but wrong. I re-analyzed the authors’ data and wrote a letter to Nature, <a href="http://www.nature.com/nature/journal/v493/n7430/full/493026b.html">which was published today</a> along with the authors response, which more or less ignored my points. Unfortunately, Nature cut my already-short letter in half, so what readers see in the journal omits half my argument. My entire letter is published here, thanks to my colleagues at Simply Statistics. I titled it “NIH funds the overwhelming majority of highly influential original science results,” because that’s what the original study should have concluded from their very own data. Here goes:</p>
<p style="padding-left: 30px">
<em>To the Editors:</em>
</p>
<p style="padding-left: 30px">
<em>In their recent commentary, "Conform and be funded," Joshua Nicholson and John Ioannidis claim that "too many US authors of the most innovative and influential papers in the life sciences do not receive NIH funding." They support their thesis with an analysis of 200 papers sampled from 700 life science papers with over 1,000 citations. Their main finding was that only 40% of "primary authors" on these papers are PIs on NIH grants, from which they argue that the peer review system "encourage[s] conformity if not mediocrity."</em>
</p>
<p style="padding-left: 30px">
<em>While this makes for an appealing headline, the authors' own data does not support their conclusion. I downloaded the full text for a random sample of 125 of the 700 highly cited papers [data available upon request]. A majority of these papers were either reviews (63), which do not report original findings, or not in the life sciences (17) despite being included in the authors' database. For the remaining 45 papers, I looked at each paper to see if the work was supported by NIH. In a few cases where the paper did not include this information, I used the NIH grants database to determine if the corresponding author has current NIH support. 34 out of 45 (75%) of these highly-cited papers were supported by NIH. The 11 papers not supported included papers published by other branches of the U.S. government, including the CDC and the U.S. Army, for which NIH support would not be appropriate. Thus, using the authors' own data, one would have to conclude that NIH has supported a large majority of highly influential life sciences discoveries in the past twelve years.</em>
</p>
<p style="padding-left: 30px">
<em>The authors – and the editors at </em>Nature<em>, who contributed to the article – suffer from the same biases that Ioannidis himself has often criticized. Their inclusion of inappropriate articles and especially the choice to require that both the first and last author be PIs on an NIH grant, even when the first author was a student, produced an artificially low number that misrepresents the degree to which NIH supports innovative original research.</em>
</p>
<p>It seems pretty clear that <em>Nature</em> wanted a headline about how NIH doesn’t support innovation, and Ioannidis was happy to give it to them. Now, I’d love it if NIH had the funds to support more scientists, and I’d also be in favor of funding at least some work retrospectively - based on recent major achievements, for example, rather than proposed future work. But the evidence doesn’t support the “Conform and be funded” headline, however much <em>Nature</em> might want it to be true.</p>
The scientific reasons it is not helpful to study the Newtown shooter's DNA
2013-01-03T10:10:46+00:00
http://simplystats.github.io/2013/01/03/the-scientific-reasons-it-is-not-helpful-to-study-the-newtown-shooters-dna
<p>The Connecticut Medical Examiner <a href="http://www.theatlanticwire.com/technology/2012/12/adam-lanza-dna-test/60371/">has asked to sequence</a> and study the DNA of the recent Newtown shooter. I’ve been seeing this pop up over the last few days on a lot of <a href="http://www.businessinsider.com/plans-to-study-adam-lanzas-dna-splits-scientific-community-2012-12">popular media sites</a>, where they mention some objections scientists (or geneticists) may have to this “scientific” study. But I haven’t seen the objections explicitly laid out anywhere. So here are mine.</p>
<p><strong>Ignoring the fundamentals of the genetics of complex disease:</strong> If the violent behavior of the shooter has any genetic underpinning, it is complex. If you only look at one person’s DNA, without a clear behavior definition (violent? mental disorder? etc.?) it is impossible to assess important complications such as <a href="http://en.wikipedia.org/wiki/Penetrance">penetrance</a>, <a href="http://en.wikipedia.org/wiki/Epistasis">epistasis</a>, and <a href="http://en.wikipedia.org/wiki/Gene%E2%80%93environment_interaction">gene-environment interactions</a>, to name a few. These make statistical analysis incredibly complicated even in huge, well-designed studies.</p>
<p><strong>Small Sample Size</strong>: One person hit on the issue that is maybe the biggest reason this is a waste of time/likely to lead to incorrect results. _You can’t draw a reasonable conclusion about any population by <a href="https://twitter.com/drng/status/283692936930152448">looking at only one individual</a>. _This is actually a fundamental component of <a href="http://en.wikipedia.org/wiki/Statistical_inference">statistical inference</a>. The goal of statistical inference is to take a small, representative sample and use data from that sample to say something about the bigger population. In this case, there are two reasons that the usual practice of statistical inference can’t be applied: (1) only one individual is being considered, so we can’t measure anything about how variable (or accurate) the data are, and (2) we’ve picked one, incredibly high-profile, and almost certainly not representative, individual to study.</p>
<p><strong>Multiple testing/data dredging: </strong>The small sample size problem is compounded by the fact that we aren’t looking at just one or two of the shooter’s genes, but rather the whole genome. To see why making statements about violent individuals based on only one person’s DNA is a bad idea, think about the <a href="http://news.bbc.co.uk/2/hi/science/nature/3760766.stm">20,000 genes in a human body</a>. Let’s suppose that only one of the genes causes violent behavior (it is definitely more complicated than that) and that there is no environmental cause to the violent behavior (clearly false). Furthermore, suppose that if you have the bad version of the violent gene you will do something violent in your life (almost definitely not a sure thing).</p>
<p>Now, even with all these simplifying (and incorrect) assumptions for each gene you flip a coin with a different chance of being heads. The violent gene turned up tails, but so did a large number of other genes. If we compare the set of genes that came up tails to another individual, they will have a huge number in common in addition to the violent gene. So based on this information, you would have no idea which gene causes violence even in this hugely simplified scenario.</p>
<p><strong>Heavy reliance on prior information/intuition</strong>: This is a supposedly scientific study, but the small sample size/multiple testing problems mean any conclusions from the data will be very very weak. The only thing you could do is take the set of genes you found and then rely on previous studies to try to determine which one is the “violence gene”. But now you are being guided by intuition, guesswork, and a bunch of studies that may or may not be relevant. The result is that more than likely you’d end up on the wrong gene.</p>
<p>The result is that it is highly likely that no solid statistical information will be derived from this experiment. Sometimes, just because the technology exists to run an experiment, doesn’t mean that experiment will teach us anything.</p>
Fitbit, why can't I have my data?
2013-01-02T20:16:26+00:00
http://simplystats.github.io/2013/01/02/fitbit-why-cant-i-have-my-data
<p>I have a <a href="http://www.fitbit.com/">Fitbit</a>. I got it because I wanted to collect some data about myself and I liked the simplicity of the set-up. I also asked around and Fitbit seemed like the most “open” platform for collecting one’s own data. You have to pay $50 for a premium account, but after that, they allow you to download your data.</p>
<p>Or do they?</p>
<p>I looked into the details, asked a buddy or two, and found out that you actually can’t get the really interesting minute-by-minute data even with a premium account. You only get the daily summarized totals for steps/calories/stairs climbed. While this data is of some value, the minute-by-minute data are oh so much more interesting. I’d like to use it for personal interest, for teaching, for research, and for sharing interesting new ideas back to other Fitbit developers.</p>
<p>Since I’m not easily dissuaded, I tried another route. I created an application that accessed the <a href="http://dev.fitbit.com/">Fitbit API</a>. After fiddling around a bit with a few R packages, I was able to download my daily totals. But again, no minute-by-minute data. I looked into it and only [I have a <a href="http://www.fitbit.com/">Fitbit</a>. I got it because I wanted to collect some data about myself and I liked the simplicity of the set-up. I also asked around and Fitbit seemed like the most “open” platform for collecting one’s own data. You have to pay $50 for a premium account, but after that, they allow you to download your data.</p>
<p>Or do they?</p>
<p>I looked into the details, asked a buddy or two, and found out that you actually can’t get the really interesting minute-by-minute data even with a premium account. You only get the daily summarized totals for steps/calories/stairs climbed. While this data is of some value, the minute-by-minute data are oh so much more interesting. I’d like to use it for personal interest, for teaching, for research, and for sharing interesting new ideas back to other Fitbit developers.</p>
<p>Since I’m not easily dissuaded, I tried another route. I created an application that accessed the <a href="http://dev.fitbit.com/">Fitbit API</a>. After fiddling around a bit with a few R packages, I was able to download my daily totals. But again, no minute-by-minute data. I looked into it and only](https://wiki.fitbit.com/display/API/Fitbit+Partner+API) have access to the intraday data. So I emailed Fitbit to ask if I could be a partner app. So far no word.</p>
<p>I guess it is true, if you aren’t paying for it, you are the product. But honestly, I’m just not that interested in being a product for Fitbit. So I think I’m bailing until I can download intraday data - I’m even happy to pay for it. If anybody has a suggestion of a more open self-monitoring device, I’d love to hear about it.</p>
Happy 2013: The International Year of Statistics
2013-01-01T09:00:25+00:00
http://simplystats.github.io/2013/01/01/happy-2013-the-international-year-of-statistics
<p>The ASA has <a href="http://www.statistics2013.org/">declared</a> 2013 to be the International Year of Statistics and I am ready to celebrate it in full force. It is a great time to be a statistician and I am hoping more people will join the fun. In fact, as we like to point out in this blog, Statistics has already been at the center of many exciting accomplishments of the 21st century. <a href="http://en.wikipedia.org/wiki/Sabermetrics">Sabermetrics</a> has become a standard approach and inspired the Hollywood movie <a href="http://www.imdb.com/title/tt1210166/">Money Ball</a>. Friend of the blog <a href="http://www2.research.att.com/~volinsky/">Chris Volinsk</a>y, a PhD Statistician, led <a href="http://www.nytimes.com/2009/07/28/technology/internet/28netflix.html">the team</a> that won the <a href="http://www.netflixprize.com/">Netflix million dollar prize</a>. Nate Silver et al. <a href="http://simplystatistics.org/2012/11/07/nate-silver-does-it-again-will-pundits-finally-accept/">proved the pundits wrong</a> by, once again, using statistical models to <a href="http://mashable.com/2012/11/07/nate-silver-wins/">predict election results almost perfectly</a>. R has become one the most <a href="http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?pagewanted=all">widely used</a> programming languages in the world. Meanwhile, in academia, the number of statisticians becoming leaders in fields like environmental sciences, human genetics, genomics, and social sciences continues to grow. It is no surprise that stats majors at Harvard have <a href="http://nesterko.com/visuals/statconcpred2012-with-dm/">more than quadrupled</a> since 2000 and that statistics MOOCs are among <a href="http://edudemic.com/2012/12/the-11-most-popular-open-online-courses/">the most</a> popular.</p>
<p style="text-align: left;">
<img class="aligncenter" alt="" src="http://mope.amsiintern.org.au/wp-content/uploads/2012/09/IYSTAT-Logo-extended-620x350.jpg" width="372" height="210" />The unprecedented advances in digital technology during the second half of the 20th century has produced a measurement revolution that is transforming the world. Many areas of science are now being driven by new measurement technologies and many insights are being made by discovery-driven, as opposed to hypothesis-driven, experiments. Empiricism is back with a vengeance. The current scientific era is defined by its dependence on data and the statistical methods and concepts developed during the 20th century provide an incomparable toolbox to help tackle current challenges. The toolbox, along with computer science, will also serve as a base for the methods of tomorrow. So I will gladly join the Year of Statistics' festivities during 2013 and beyond, during the era of data-driven science.
</p>
What makes a good data scientist?
2012-12-31T08:49:03+00:00
http://simplystats.github.io/2012/12/31/what-makes-a-good-data-scientist
<p>Apparently, New Year’s Eve is not a popular day to come to the office as it seems I’m the only one here. No matter, it just means I can blast Mahler 3 (Bernstein, NY Phil, 1980s recording) louder than I normally would.</p>
<p>Today’s post is inspired by this <a href="At the M.I.T. conference, Ms. Schutt was asked what makes a good data scientist. Obviously, she replied, the requirements include computer science and math skills, but you also want someone who has a deep, wide-ranging curiosity, is innovative and is guided by experience as well as data.">latest article in the NYT about big data</a>. The article for the most part describes a conference that happened at MIT recently on the topic of big data. Towards the end of the article, it is noted that one of the participants (Rachel Schutt) was asked what makes a good data scientist.</p>
<blockquote>
<div>
Obviously, she replied, the requirements include computer science and math skills, but you also want someone who has a deep, wide-ranging curiosity, is innovative and is guided by experience as well as data.</p>
<p itemprop="articleBody">
“I don’t worship the machine,” she said.
</p>
</div>
</blockquote>
<p>I think I agree, but I would have put it a different way. Mostly, I think what makes a good data scientist is the same thing that makes you a good [insert field here] scientist. In other words, a good data scientist is a good scientist.</p>
Sunday data/statistics link roundup (12/30/12)
2012-12-30T10:32:45+00:00
http://simplystats.github.io/2012/12/30/sunday-datastatistics-link-roundup-123012
<ol>
<li>An interesting new app called <a href="http://100plus.com/">100plus</a>, which looks like it uses public data to help determine how little decisions (walking more, one more glass of wine, etc.) lead to more or less health. H<a href="http://www.healthdata.gov/blog/100plus-%E2%80%93-app-making-health-care-easier">ere’s a post</a> describing it on the heathdata.gov blog. As far as I can tell, the app is still in beta, so only the folks who have a code can download it.</li>
<li><a href="http://m.motherjones.com/politics/2012/12/mass-shootings-mother-jones-full-data">Data</a> on mass shootings from the Mother Jones investigation.</li>
<li>A post by Hilary M. on <a href="http://www.hilarymason.com/blog/getting-started-with-data-science/">“Getting Started with Data Science”</a>. I really like the suggestion of just picking a project and doing something, getting it out there. One thing I’d add to the list is that I would spend a little time learning about an area you are interested in. With all the free data out there, it is easy to just “do something”, without putting in the requisite work to know why what you are doing is good/bad. So when you are doing something, make sure you take the time to “know something”.</li>
<li>An <a href="http://xxx.lanl.gov/pdf/0902.2183v2.pdf">analysis of various measures of citation impact</a> (also via Hilary M.). I’m not sure I follow the reasoning behind all of the analyses performed (seems a little like throwing everything at the problem and hoping something sticks) but one interesting point is how citation/usage are far apart from each other on the PCA plot. This is likely just because the measures cluster into two big categories, but it makes me wonder. Is it better to have a lot of people read your paper (broad impact?) or cite your paper (deep impact?).</li>
<li>An [ 1. An interesting new app called <a href="http://100plus.com/">100plus</a>, which looks like it uses public data to help determine how little decisions (walking more, one more glass of wine, etc.) lead to more or less health. H<a href="http://www.healthdata.gov/blog/100plus-%E2%80%93-app-making-health-care-easier">ere’s a post</a> describing it on the heathdata.gov blog. As far as I can tell, the app is still in beta, so only the folks who have a code can download it.</li>
<li><a href="http://m.motherjones.com/politics/2012/12/mass-shootings-mother-jones-full-data">Data</a> on mass shootings from the Mother Jones investigation.</li>
<li>A post by Hilary M. on <a href="http://www.hilarymason.com/blog/getting-started-with-data-science/">“Getting Started with Data Science”</a>. I really like the suggestion of just picking a project and doing something, getting it out there. One thing I’d add to the list is that I would spend a little time learning about an area you are interested in. With all the free data out there, it is easy to just “do something”, without putting in the requisite work to know why what you are doing is good/bad. So when you are doing something, make sure you take the time to “know something”.</li>
<li>An <a href="http://xxx.lanl.gov/pdf/0902.2183v2.pdf">analysis of various measures of citation impact</a> (also via Hilary M.). I’m not sure I follow the reasoning behind all of the analyses performed (seems a little like throwing everything at the problem and hoping something sticks) but one interesting point is how citation/usage are far apart from each other on the PCA plot. This is likely just because the measures cluster into two big categories, but it makes me wonder. Is it better to have a lot of people read your paper (broad impact?) or cite your paper (deep impact?).</li>
<li>An](https://twitter.com/hmason/status/285163907360899072) on Twitter about how big data does not mean you can ignore the scientific method. We have talked a little bit about this before, in terms of how one should <a href="http://simplystatistics.org/2012/06/28/motivating-statistical-projects/">motivate statistical projects</a>.</li>
</ol>
Make a Christmas Tree in R with random ornaments/presents
2012-12-24T11:09:18+00:00
http://simplystats.github.io/2012/12/24/make-a-christmas-tree-in-r-with-random-ornamentspresents
<p>Happy holidays!</p>
<p><a href="http://simplystatistics.org/2012/12/24/make-a-christmas-tree-in-r-with-random-ornamentspresents/xmas/" rel="attachment wp-att-768"><img class="alignnone size-medium wp-image-768" alt="xmas" src="http://simplystatistics.org/wp-content/uploads/2012/12/xmas-150x300.png" width="150" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2012/12/xmas-150x300.png 150w, http://simplystatistics.org/wp-content/uploads/2012/12/xmas.png 480w" sizes="(max-width: 150px) 100vw, 150px" /></a></p>
<p> </p>
<p> </p>
<p><a href="https://gist.github.com/4369771">Link to Gist</a></p>
Sunday data/statistics link roundup 12/23/12
2012-12-23T09:44:38+00:00
http://simplystats.github.io/2012/12/23/sunday-datastatistics-link-roundup-122312
<ol>
<li>A <a href="http://diabetesvis.herokuapp.com/diabetes/dashboard">cool data visualization</a> for blood glucose levels for diabetic individuals. This kind of interactive visualization can help people see where/when major health issues arise for chronic diseases. This was a class project by Jeff Heer’s Stanford CS448B students Ben Rudolph and Reno Bowen (twitter @RenoBowen). Speaking of interactive visualizations, I also got <a href="http://dexvis.com/doku.php">this link</a> from Patrick M. It looks like a way to build interactive graphics and my understanding is it is compatible with R data frames, worth checking out (plus, Dex is a good name).</li>
<li>Here is an <a href="http://mathbabe.org/2012/12/20/nate-silver-confuses-cause-and-effect-ends-up-defending-corruption/">interesting review</a> of Nate Silver’s book. The interesting thing about the review is that it doesn’t criticize the statistical content, but criticizes the belief that people only use data analysis for good. This is an interesting theme we’ve seen before. Gelman also <a href="http://andrewgelman.com/2012/12/two-reviews-of-nate-silvers-new-book-from-kaiser-fung-and-cathy-oneil/">reviews the review</a>.</li>
<li>It’s a little late now, but this tool seems useful for folks who want to know <a href="http://www.whatdoineedonmyfinal.com/">whatdoineedonmyfinal</a>?</li>
<li>A list of the <a href="http://www.theatlanticcities.com/technology/2012/12/best-open-data-releases-2012/4200/">best open data releases of 2012</a>. I particularly like the rat sightings in New York and think the Baltimore fixed speed cameras (which I have a habit of running afoul of).</li>
<li>A <a href="http://giladlotan.com/blog/mapping-twitters-python-data-science-communities/">map of data scientists</a> on Twitter. Unfortunately, since we don’t have “data scientist” in our Twitter description, Simply Statistics does not appear. I’m sure we would have been central….</li>
<li>Here is <a href="http://www.nature.com/ncomms/journal/v3/n12/full/ncomms2292.html?WT.mc_id=TWT_NatureComms">an interesting paper</a> where some investigators developed a technology that directly reads out a bar chart of the relevant quantities. They mention this means there is no need for statistical analysis. I wonder if the technology also reads out error bars.</li>
</ol>
The NIH peer review system is still the best at identifying innovative biomedical investigators
2012-12-20T10:00:15+00:00
http://simplystats.github.io/2012/12/20/the-nih-peer-review-system-is-still-the-best-at-identifying-innovative-biomedical-investigators
<p><a href="http://www.nature.com/nature/journal/v492/n7427/full/492034a.html">This</a> recent Nature paper makes the controversial claim that the most innovative (interpreted as best) scientists are not being funded by NIH. Not surprisingly, it is getting a lot of attention in the popular media. The title and introduction make it sound like there is a pervasive problem biasing the funding enterprise against innovative scientists. To me this appears counterintuitive given how much innovation, relative to other funding agencies around the world, comes out of NIH funded researchers (<a href="http://www.nytimes.com/2011/09/13/health/13gene.html?pagewanted=all&_r=0">here</a> is a recent example) and how many of the best biomedical investigators in the world elect to work for NIH funded institutions. The authors use data to justify their conclusions but I do not find it very convincing.</p>
<p>First, the paper defines innovative/non-conformist scientists as those with a first/last/single author paper with 1000+ citations in the years 2002-2012. Obvious problems with this definition are already pointed out in the comments of the original paper but for argument’s sake I will accept it as useful quantification The key data point the authors use is that only 2/5 of people with a first/last single author 1000+ citation paper are principal investigators on NIH grants. I would need to see the complete 2x2 table for people that actually applied for grants (1000+ citations or not x got NIH grant or not) to be convinced. The reported ratio is meaningful only if most people with 1000+ papers are applying for grants but the authors doen’t report how many are retired, or are still postdocs, or went into industry, or are one-hit-wonders. Given that the payline is about 8%-15%, the 40% number may actually imply that NIH is in fact funding innovative people at a high rate.</p>
<p>The paper also implies that many of the undeserving funding recipients are connected individuals that serve on study sections. The evidence for this is that they are funded at a much higher rate than individuals with 1000+ citation papers. But as the authors themselves point out, study section members are often recruited from the subset of individuals who have NIH grants (it’s a way to give back to NIH). This does not suggest bias in the process, it just suggests that if you recruit funded people to be on a panel, that panel will have a higher rate of funded people.</p>
<p>NIH’s peer review system is far from perfect but it somehow manages to produce the best biomedical research in the world. How does this happen? Well, I think it’s because NIH is currently funding some of the most innovative biomedical researchers in the world. The current system can certainly improve, but perhaps we should focus on concrete proposals with hard evidence that they will actually make things better.</p>
<p>Disclaimers: I am a regular member of an NIH study section. I am PI on NIH grants. I am on several papers with more than 1000 citations.</p>
Rafa interviewed about statistical genomics
2012-12-19T11:47:10+00:00
http://simplystats.github.io/2012/12/19/rafa-interviewed-about-statistical-genomics
<p>He talks about the <a href="http://simplystatistics.tumblr.com/post/21914291274/people-in-positions-of-power-that-dont-understand">problems created by the speed of increase in data sizes</a> in molecular biology, the way that genomics is hugely driven by data analysis/statistics, how Bioconductor is an example of <a href="http://simplystatistics.org/2012/09/07/top-down-versus-bottom-up-science-data-analysis/">bottom up science</a>, Simply Statistics gets a shout out, how new data are going to lead to new modeling/statistical challenges, and gives an ode to boxplots. It’s worth watching the whole thing…</p>
The value of re-analysis
2012-12-18T11:58:28+00:00
http://simplystats.github.io/2012/12/18/the-value-of-re-analysis
<p>I just saw <a href="http://www.johndcook.com/blog/2012/12/18/the-value-of-typing-code/">this really nice post</a> over on John Cook’s blog. He talks about how it is a valuable exercise to re-type code for examples you find in a book or on a blog. I completely agree that this is a good way to learn through osmosis, learn about debugging, and often pick up the reasons for particular coding tricks (this is how I learned about vectorized calculations in Matlab, by re-typing and running my advisors code back in my youth).</p>
<p>In a more statistical version of this idea, Gary King has proposed <a href="http://gking.harvard.edu/gking/papers">reproducing the analysis</a> in a published paper as a way to get a paper of your own. You can figure out the parts that a person did well and the parts that you would do differently, maybe finding enough insight to come up with your own new paper. But I think this level of replication involves actually two levels of thinking:</p>
<ol>
<li>Can you actually reproduce the code used to perform the analysis?</li>
<li>Can you solve the “<a href="http://www.perlsteinlab.com/blog/papers-as-puzzles">paper as puzzle</a>” exercise proposed by Ethan Perlstein over at his site. Given the results in the paper, can you come up with the story?</li>
</ol>
<p>Both of these things require a bit more “higher level thinking” than just re-running the analysis if you have the code. But I think even the seemingly “low-level” task of just retyping and running the code that is used to perform a data analysis can be very enlightening. The problem is that this code, in many cases, does not exist. But that is starting to change. If you check out <a href="http://www.rpubs.com/">Rpubs</a> or <a href="http://www.runmycode.org/CompanionSite/">RunMyCode</a> or even the right parts of <a href="http://figshare.com/">Figshare</a> you can find data analyses you can run through and reproduce.</p>
<p>The only downside is there is currently no measure of quality on these published analyses. It would be great if people could focus their time re-typing only good data analyses, rather than one at random. Or, as a guy once (almost) <a href="http://www.quoteworld.org/quotes/8414">said</a>, “Data analysis practice doesn’t make perfect, perfect data analysis practice makes perfect.”</p>
Should the Cox Proportional Hazards model get the Nobel Prize in Medicine?
2012-12-17T15:26:16+00:00
http://simplystats.github.io/2012/12/17/should-the-cox-proportional-hazards-model-get-the-nobel-prize-in-medicine
<p><a href="http://www.ncbi.nlm.nih.gov/pubmed/12762435">I’m not the first one</a> to suggest that Biostatistics has been undervalued in the scientific community, and some of the shortcomings of epidemiology and biostatistics have been noted elsewhere. But this previous work focuses primarily on the contributions of statistics/biostatistics at the purely scientific level.</p>
<p>The <a href="http://en.wikipedia.org/wiki/Proportional_hazards_models">Cox Proportional Hazards model</a> is one of the most widely used statistical models in the analysis of data from clinical trials and other medical studies. The corresponding paper has been cited over <a href="http://scholar.google.com/scholar?q=Regression+models+and+life-tables&btnG=&hl=en&as_sdt=0%2C21">32,000 times</a>; this is a dramatically low estimate of the number of times the model has been used. It is one of “those methods” that doesn’t even require a reference to the original methods paper anymore.</p>
<p>Many of the most influential medical studies, including major studies like the <a href="http://jama.jamanetwork.com/article.aspx?articleid=1108397">Women’s Health Initiative</a> have used these methods to answer some of our most pressing medical questions. Despite the incredible impact of this statistical technique on the world of medicine and public health, it has not received the Nobel Prize. This isn’t an aberration, statistical methods are not traditionally considered for Nobel Prizes in Medicine. They primarily focus on biochemical, genetic, or public health discoveries.</p>
<p>In contrast, many economics Nobel Prizes have been awarded primarily for the discovery of a new statistical or mathematical concept. One example is the <a href="http://en.wikipedia.org/wiki/Autoregressive_conditional_heteroskedasticity">ARCH model</a>. The Nobel Prize in Economics in 2003 was awarded to <a href="http://www.nobelprize.org/nobel_prizes/economics/laureates/2003/">Robert Engle</a>, the person who proposed the original ARCH model. The model has gone on to have a major impact on financial analysis, much like the Cox model has had a major impact on medicine?</p>
<p>So why aren’t Nobel Prizes in medicine awarded to statisticians more often? Other methods such as ANOVA, P-values, etc. have also had an incredibly large impact on the way we measure and evaluate medical procedures. Maybe as medicine becomes increasingly driven by data, we will start to see more statisticians recognized for their incredible discoveries and the huge contributions they make to medical research and practice.</p>
<p> </p>
Sunday data/statistics link roundup (12/16/12)
2012-12-16T10:01:36+00:00
http://simplystats.github.io/2012/12/16/sunday-datastatistics-link-roundup-121612
<ol>
<li>A <a href="http://www.doaj.org/doaj?func=home&uiLanguage=en">directory of open access journals</a>. Very cool idea to aggregate them. Here is a <a href="http://www.thejuliagroup.com/blog/?p=2898">blog post </a>from one of my favorite statistics bloggers about why open-access journals are so cool. Just like in a lot of other areas, open access journals can be thought of as an open data initiative.</li>
<li>Here is a website that <a href="http://www.richblockspoorblocks.com/">displays data on the relative wealth of neighborhoods</a>, broken down by census track. It’s pretty fascinating to take a look and see what the income changes are, even in regions pretty close to each other.</li>
<li>More citizen science goodness. Zooniverse <a href="https://www.zooniverse.org/project/snapshotserengeti">has a new project</a> where you can look through a bunch of pictures in the Serengeti and see if you can find animals.</li>
<li>Nate Silver <a href="http://www.youtube.com/watch?feature=player_embedded&v=mYIgSq-ZWE0">talking about his new book</a> with Hal Varian. (<a href="http://www.youtube.com/watch?feature=player_embedded&v=mYIgSq-ZWE0">via</a>). I have skimmed the book and found that the parts about baseball/politics are awesome and the other parts seem a little light. But maybe that’s just my pre-conceived bias? I’d love to hear what other people thought…</li>
</ol>
Computing for Data Analysis Returns
2012-12-14T09:20:15+00:00
http://simplystats.github.io/2012/12/14/computing-for-data-analysis-returns
<p>I’m happy to announce that my course <a href="https://www.coursera.org/course/compdata">Computing for Data Analysis</a> will return to <a href="http://coursera.org">Coursera</a> on January 2nd, 2013. While I had previously announced that the course would be presented again right here, it made more sense to do it again on Coursera where it is (still) free and the platform there is much richer. For those of you who missed it the last time around, this is your chance to take it and learn a little R.</p>
<p>I’ve gotten a number of emails from people who were interested in watching the videos for the course. If you just want to sit around and watch videos of me talking, I’ve created a set of four YouTube playlists based on the four weeks of the course:</p>
<ul>
<li><a href="http://www.youtube.com/playlist?list=PLjTlxb-wKvXMUop9m0C8G5xLBzhsIDBC7">Background and getting started</a></li>
<li><a href="http://www.youtube.com/playlist?list=PLjTlxb-wKvXNSDfcKPFH2gzHGyjpeCZmJ&feature=view_all">Week 1</a>: Background on R, data types, reading/writing data</li>
<li><a href="http://www.youtube.com/playlist?list=PLjTlxb-wKvXNnjUTX4C8IeIhPBjPkng6B&feature=view_all">Week 2</a>: Control structures, functions, apply functions, debugging tools</li>
<li><span style="font-size: medium;"><a href="http://www.youtube.com/playlist?list=PLjTlxb-wKvXOzI2h0F2_rYZHIXz8GWBop&feature=view_all">Week 3</a>: Plotting and simulation</span></li>
<li><a href="http://www.youtube.com/playlist?list=PLjTlxb-wKvXOdzysAE6qrEBN_aSBC0LZS&feature=view_all">Week 4</a>: Regular expressions, classes and methods</li>
</ul>
<p>The content in the YouTube playlists reflect the content from the first iteration of the course and will not reflect any new material I add to the second iteration (at least not for a little while).</p>
<p>I encourage everyone who is interested to enroll in the course on Coursera because there you’ll have the benefit of in-video quizzes and other forms of assessment and will be able to interact with all of the great students who are also enrolled in the class. Also, if you’re interested in signing up for Jeff Leek’s <a href="https://www.coursera.org/course/dataanalysis">Data Analysis</a> course (starts on January 22, 2013) and are not very familiar with R, I encourage you to check out Computing for Data Analysis first to get yourself up to speed.</p>
<p>I look forward to seeing you there!</p>
Joe Blitzstein's free online stat course helps put a critical satellite in orbit
2012-12-10T11:15:47+00:00
http://simplystats.github.io/2012/12/10/joe-blitzsteins-free-online-stat-course-helps-put-a-critical-satellite-in-orbit
<p>As loyal readers know, we are <a href="http://simplystatistics.org/2012/08/10/why-we-are-teaching-massive-open-online-courses-moocs/">very</a> <a href="http://simplystatistics.org/2012/07/26/online-education-many-academics-are-missing-the-point/">enthusiastic</a> about MOOCs. One of the main reasons for this is the potential of teaching Statistics to students from all over the world, in particular those that can’t afford or don’t have acces to college. However, it turns out that rocket scientists can also benefit. Check out the feedback <a href="http://simplystatistics.org/2012/01/20/interview-with-joe-blitzstein/">Joe Blitztsein</a>, professor of one of the most <a href="https://itunes.apple.com/us/course/statistics-110-probability/id502492375">popular online stat courses,</a> received from one of his students:</p>
<blockquote>
<p>As an “old bubba” aerospace engineer I watched your Stat 110 class and enjoyed it very much. It sure blew out a lot of cobwebs that had collected over the past 35 years working as an aerospace engineer. As you might guess, we deal with a lot of probability. Just recently I was involved in a study to see what a blocked Reaction Control System (RCS) might do to a satellite… I am a Spacecraft Attitude Control systems engineer and it was my job to simulate what would happen if a certain RCS engine was plugged. It was a weird problem and it inspired me to watch your class… Fortunately, the statistics showed that the RCS nozzles that could get plugged would have a low probability and would not affect our ability to adjust the vehicle’s orbit. And we launched it this past summer and everything went perfect! So I just wanted to tell you that when you teach your “kiddos” tell them that Stat 110 has real life implications. This satellite is a critical national defense asset that saves the lives of our soldiers on the ground.”</p>
</blockquote>
<p>I doubt “Old Bubba” has time to go back to school to refresh his stats knowledge… but thanks to Joe’s online class, he no longer needs to. This is yet another advantage MOOCs offer: giving busy professionals a practical way to learn new skills or brush up on specific topics.</p>
Sunday data/statistics link roundup (12/9/12)
2012-12-09T10:14:57+00:00
http://simplystats.github.io/2012/12/09/sunday-datastatistics-link-roundup-12912
<ol>
<li><span style="line-height: 16px;">Some <a href="http://www.prana.com/life/2012/12/01/conscious-consumerism-how-do-your-brands-rate/">interesting data/data visualizations</a> about working conditions in the apparel industry. <a href="http://www.free2work.org/trends/apparel/?utm_source=Social%20Ventures&utm_medium=Hootsuite&utm_campaign=SV%20News%20Feed">Here</a> is the full report. Whenever I see reports like this, I wish the raw data were more clearly linked. I want to be able to get in, play with the data, and see if I notice something that doesn’t appear in the infographics. </span></li>
<li><span style="line-height: 16px;">This is an awesome <a href="http://wmbriggs.com/blog/?p=6465">plain-language discussion</a> of how a bunch of methods (CS and Stats) with fancy names relate to each other. It shows that CS/Machine Learning/Stats are converging in many ways and there isn’t much new under the sun. On the other hand, I think the really exciting thing here is to use these methods on new questions, once people <a href="http://simplystatistics.org/2012/12/08/dropping-the-stick-in-data-analysis/">drop the stick</a>. </span></li>
<li><span style="line-height: 16px;">If you are a reader of this blog and somehow do not read anything else on the internet, you will have missed Hadley Wickham’s <a href="https://github.com/hadley/devtools/wiki/Rcpp">Rcpp tutorial</a>. In my mind, this pretty much seals it, Julia isn’t going to overtake R anytime soon. In other news, Hadley is <a href="http://biostat.jhsph.edu/newsEvent/event/seminar/seminars.shtml">coming to visit</a> JHSPH Biostats this week! I’m psyched to meet him. </span></li>
<li><span style="line-height: 16px;">For those of us that live in Baltimore, this <a href="http://www.r-bloggers.com/visualizing-baltimore-with-r-and-ggplot2-crime-data/">interesting set of data visualizations</a> lets you in on the crime hotspots. This is a much fancier/more thorough analysis than <a href="http://simplystatistics.org/2012/01/03/baltimore-gun-offenders-and-where-academics-dont-live/">Rafa and I did</a> way back when. </span></li>
<li><span style="line-height: 16px;">Check out the new <a href="http://www.census.gov/easystats/">easy stats tool</a> from the Census (via Hilary M.) and read our interview with <a href="http://simplystatistics.org/2012/11/09/interview-with-tom-louis-new-chief-scientist-at-the/">Tom Louis</a> who is heading over there to the Census to do cool things. </span></li>
<li><span style="line-height: 16px;"><a href="http://www.slate.com/blogs/bad_astronomy/2012/12/07/ted_to_tedx_how_to_avoid_bad_science_in_talks.html?utm_source=tw&utm_medium=sm&utm_campaign=button_toolbar">Watch out</a>, some Tedx talks may be pseudoscience! More later this week on the politicization/glamourization of science, so stay tuned. </span></li>
</ol>
Dropping the Stick in Data Analysis
2012-12-08T14:57:26+00:00
http://simplystats.github.io/2012/12/08/dropping-the-stick-in-data-analysis
<p>When I was a kid growing up in rough-and-tumble suburban New York, one of the major summer activities was roller hockey, the kind with roller blades (remember roller blades?). My friends and I would be playing in some random parking lot and undoubtedly one of us would be just blowing it the whole game. This would usually lead to an impromptu intervention where the person screwing up (often me) would be told by everyone else on the team to “drop the stick”. The idea was you should stop playing, clear your head, skate around for a bit, and not try to do 20 things at once.</p>
<p>I don’t play much hockey now, but I do a bit more data analysis. Strangely, little has changed.</p>
<p>People come to me at various stages of data analysis. Close collaborators usually come to me with no data because they are planning a study and need some help. In those cases, I’m involved in the beginning and know how the data are generated. Usually, in those cases I analyze the data in the end so there’s less confusion.</p>
<p>Others usually come to me with data in hand wanting know what they should do now that they’ve got all this data. Often there’s confusion about where to start, what method to use, what program, what procedure, what function, what test, Bayesian or frequentist, mean or median, R or Stata, random effects or fixed effects, cat or dog, mice or men, etc. That’s usually the point where I tell them to “drop the stick”, or the data analysis version of that, which is “What question are you trying to answer?”</p>
<p>Usually, people know what question they’re trying to answer–they just forgot to tell me. But I’m always amazed at how this question can often be the subject of the entire discussion. We might end up answering a question the investigator hadn’t thought of yet, maybe a question that’s better suited to the data.</p>
<p>So, job #1 if you’re a statistician: Get more people to drop the stick. You’ll make everyone play better in the end.</p>
Email is a to-do list made by other people - can someone make it more efficient?!
2012-12-05T11:33:21+00:00
http://simplystats.github.io/2012/12/05/an-idea-for-killing-email
<p>This is a follow-up to one of our most popular posts: <a href="http://simplystatistics.org/post/10558246695/getting-email-responses-from-busy-people" target="_blank">getting email responses from busy people</a>. This post had been in the drafts for a few weeks, then this morning I saw this quote in our Twitter feed:</p>
<blockquote>
<p>Your email inbox is a to-do list created by other people (<a href="https://twitter.com/medriscoll/status/276352287230803968">via</a>)</p>
</blockquote>
<p>This is 100% true of my work email and I have to say, because of the way those emails are organized - as conversations rather than a prioritized, organized to-do list - I end up missing really important things or getting to them too late. This is happening to me with increasing enough frequency I feel like I’m starting to cause serious problems for people.</p>
<p>So I am begging someone with way better skills than me to produce software that replaces gmail in the following ways. It is a to-do list that I can allow people to add tasks too. The software shows me the following types of messages.</p>
<ol>
<li>We have an appointment at x time on y date to discuss z. Next to this message is a checkbox. If I click “ok” it gets added to my calendar, if I click “no” then a message gets sent to the person who scheduled the meeting saying I’m unavailable.</li>
<li>A multiple choice question where they input the categories of answer I can give and I just pick one, it sends them the response.</li>
<li>A request to be added as a person who can assign me tasks with a yes/no answer.</li>
<li>A longer request email - this has three entry fields: (1) what do you want, (2) when do you want it by? and (3) a yes/no checkbox asking if I’m willing to perform the task. If I say yes, it gets added to my calendar with automated reminders.</li>
<li>It should interface with all the systems that send me reminder emails to organize the reminders.</li>
<li>You can assign quotas to people, where they can only submit a certain number of tasks per month.</li>
<li>It allows you to re-assign tasks to other people so when I am not the right person to ask, I can quickly move the task on to the right person.</li>
<li>It would collect data and generate automated reports for me about what kind of tasks I’m usually forgetting/being late on and what times of day I’m bad about responding so that I could improve my response times.</li>
</ol>
<p>The software would automatically reorganize events/to-dos to reflect changing deadlines/priorities, etc. This piece of software would revolutionize my life. Any takers?</p>
Advice for students on the academic job market (2013 edition)
2012-12-04T10:00:38+00:00
http://simplystats.github.io/2012/12/04/advice-for-students-on-the-academic-job-market-2013-edition
<p>Job hunting season is upon us. Those on the job market should be sending in applications already. Here I provide links to some of the related posts we published last year.</p>
<ul>
<li><a href="http://simplystatistics.org/2011/09/12/advice-for-stats-students-on-the-academic-job-market/">Advice for stats students on the academic job market</a></li>
<li><a href="http://simplystatistics.org/2011/09/15/another-academic-job-market-option-liberal-arts/">Another academic job market option: liberal arts colleges</a></li>
<li><a href="http://simplystatistics.org/2011/11/16/preparing-for-tenure-track-job-interviews/">Preparing for tenure track job interviews</a></li>
<li><a href="http://simplystatistics.org/2011/12/19/on-hard-and-soft-money/">On hard and soft money</a></li>
</ul>
Data analysis acquisition "worst deal ever"?
2012-12-03T09:16:52+00:00
http://simplystats.github.io/2012/12/03/data-analysis-acquisition-worst-deal-ever
<p>A little over a year ago I mentioned that <a href="http://simplystatistics.org/2011/09/08/data-analysis-companies-getting-gobbled-up/">data analysis companies were getting gobbled up</a> by larger technology companies. In particular, HP bought Autonomy, a British data analysis company, for about $11 billion. (By the way, can anyone tell me if it’s still called Hewlett-Packard, or is it just “HP”, like “AT&T”?) From an article a year ago</p>
<blockquote>
<p>Autonomy, with headquarters in Cambridge, England, helps companies and governments store, process, search and analyze large electronic data sets. Its specialty lies in its sophisticated algorithms, which can make sense of unstructured information.</p>
</blockquote>
<p>At the time, the thinking was HP had overpaid (especially given HP’s recent high price for 3Par) but the deal went through anyway. Now, HP has discovered accounting problems at Autonomy and is writing down $8.8 billion.</p>
<p>Whoops.</p>
<p>James Stewart of the New York Times claims <a href="http://www.nytimes.com/2012/12/01/business/hps-autonomy-blunder-might-be-one-for-the-record-books.html?pagewanted=all">this is worse than the failed AOL-Time Warner merger</a> (although the absolute numbers involved here are smaller). With 3 CEOs in 2 years, it seems HP just can’t get anything right these days. But what intrigues me most is the question of what companies like Autonomy are worth and the possibility that HP made a huge mistake in the valuation of this company. Of course, if there was fraud at Autonomy (as it seems to be alleged), then all bets are off. But if not, then perhaps this is the first bubble popping in the realm of data analysis companies more generally?</p>
Sunday data/statistics link roundup (12/2/12)
2012-12-02T10:18:06+00:00
http://simplystats.github.io/2012/12/02/sunday-datastatistics-link-roundup-12212
<ol>
<li><span style="line-height: 16px;"><a href="http://sloanreview.mit.edu/feature/business-quandary-use-a-competition-to-crowdsource-best-answers/?non_mobile=1">An interview</a> with Anthony Goldbloom, CEO of Kaggle. I’m not sure I’d agree with the characterization that all data scientists are: creative, curious, and competitive and certainly those characteristics aren’t unique to data scientists. And I didn’t know this: “We have 65,000 data scientists signed up to Kaggle, and just like with golf tournaments, we have them all ranked from 1 to 65,000.” </span></li>
<li><span style="line-height: 16px;">Check it out, <a href="http://www.r-bloggers.com/images-as-voronoi-tesselations/">art with R</a>! It’s actually pretty interesting to see how they use statistical algorithms to generate different artistic styles. <a href="http://www.r-bloggers.com/dominant-color-palettes-with-k-means/">Here</a> are some more. </span></li>
<li><span style="line-height: 16px;">Now that Ethan Perlstein’s crowdfunding experiment </span><a style="line-height: 16px;" href="http://twitter.com/eperlste/status/273152039922565121">was successful</a><span style="line-height: 16px;">, other people are getting on the bandwagon. If you want to find out what kind of bacteria you have in your gut, for example, you could check out <a href="http://www.indiegogo.com/ubiome">this</a>. </span></li>
<li><span style="line-height: 16px;">I thought I had it rough, but apparently some data analysts spend all their time developing algorithms to <a href="http://www.p4rgaming.com/?p=481">detect penis drawings</a>!</span></li>
<li><span style="line-height: 16px;">Roger was on Anderson Cooper 360 as part of the Building America segment. We can’t find the video, but <a href="http://transcripts.cnn.com/TRANSCRIPTS/1211/27/acd.02.html">here</a> is the transcript. </span></li>
<li><span style="line-height: 16px;">An interesting article on the <a href="http://www.economist.com/blogs/babbage/2012/11/qa-samuel-arbesman?fsrc=scn/tw/te/bl/halflifeoffacts">half-life of facts</a>. I think the analogy is an interesting one and certainly there is research to be done there. But I think it jumps the shark a bit when they start talking about how the moon landing was predictable, etc. I completely believe in the retrospective analysis of knowledge, but predicting things is pretty hard, especially when it is the future. </span><span style="line-height: 16px;"> </span></li>
</ol>
Statistical illiteracy may lead to parents panicking about Autism.
2012-11-30T13:09:01+00:00
http://simplystats.github.io/2012/11/30/statistical-illiteracy-may-lead-to-parents-panicking-about-autism
<p>I just was doing my morning reading of a few news sources and stumbled across this <a href="http://www.huffingtonpost.com/2012/11/29/autism-risk-babies-cries_n_2211729.html">Huffington Post article</a> talking about research correlating babies cries to autism. It suggests that the sound of a babies cries may predict their future risk for autism. As the parent of a young son, this obviously caught my attention in a very lizard-brain, caveman sort of way. I couldn’t find a link to the research paper in the article so I did some searching and found out this result is also being covered by <a href="http://healthland.time.com/2012/11/28/can-a-babys-cry-be-a-clue-to-autism/">Time</a>, <a href="http://www.sciencedaily.com/releases/2012/11/121127111352.htm">Science Daily</a>, <a href="http://www.medicaldaily.com/articles/13324/20121129/baby-s-cry-reveal-autism-risk.htm">Medical Daily</a>, and a bunch of other news outlets.</p>
<p>Now thoroughly freaked, I looked online and found the pdf of the <a href="https://www.ewi-ssl.pitt.edu/psychology/admin/faculty-publications/201209041019040.Sheinkopf%20in%20press.pdf">original research article</a>. I started looking at the statistics and took a deep breath. Based on the analysis they present in the article there is absolutely no statistical evidence that a babies’ cries can predict autism. Here are the flaws with the study:</p>
<ol>
<li><strong>Small sample size</strong>. The authors only recruited 21 at risk infants and 18 healthy infants. Then, because of data processing issues, only ended up analyzing 7 high autistic risk versus 5 low autistic-risk in one analysis and 10 versus 6 in another. That is no where near a representative sample and barely qualifies as a pilot study.</li>
<li><strong>Major and unavoidable confounding</strong>. The way the authors determined high autistic risk versus low risk was based on whether an older sibling had autism. Leaving aside the quality of this metric for measuring risk of autism, there is a major confounding factor: the families of the high risk children all had an older sibling with autism and the families of the low risk children did not! It would not be surprising at all if children with one autistic older sibling might get a different kind of attention and hence cry differently regardless of their potential future risk of autism.</li>
<li><strong>No correction for multiple testing</strong>. This is one of the oldest problems in statistical analysis. It is also one that is a consistent culprit of false positives in epidemiology studies. XKCD <a href="http://xkcd.com/882/">even did a cartoon</a> about it! They tested 9 variables measuring the way babies cry and tested each one with a statistical hypothesis test. They did not correct for multiple testing. So I gathered resulting p-values and did the correction <a href="https://gist.github.com/4177366">for them</a>. It turns out that after adjusting for multiple comparisons, nothing is significant at the usual P < 0.05 level, which would probably have prevented publication.</li>
</ol>
<p>Taken together, these problems mean that the statistical analysis of these data do not show any connection between crying and autism.</p>
<p>The problem here exists on two levels. First, there was a failing in the statistical evaluation of this manuscript at the peer review level. Most statistical referees would have spotted these flaws and pointed them out for such a highly controversial paper. A second problem is that news agencies report on this result and despite paying lip-service to potential limitations, are not statistically literate enough to point out the major flaws in the analysis that reduce the probability of a true positive. Should journalists have some minimal in statistics that allows them to determine whether a result is likely to be a false positive to save us parents a lot of panic?</p>
<p> </p>
I give up, I am embracing pie charts
2012-11-27T20:53:20+00:00
http://simplystats.github.io/2012/11/27/i-give-up-i-am-embracing-pie-charts
<p>Most statisticians know that pie charts are a terrible way to plot percentages. You can find explanations <a href="http://www.biostat.wisc.edu/~kbroman/topten_worstgraphs/">here</a>, <a href="http://blog.revolutionanalytics.com/2009/08/how-pie-charts-fail.html">here</a>, and <a href="https://www.google.com/search?q=why+do+pie+charts+suck&oq=why+do+pie+charts+suck&aqs=chrome.0.57j62.4254&sugexp=chrome,mod=3&sourceid=chrome&ie=UTF-8">here</a> as well as the R help file for the <code class="language-plaintext highlighter-rouge">pie</code> function which states:</p>
<blockquote>
<p>Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.</p>
</blockquote>
<p><img class="alignright" style="line-height: 24px; font-size: 16px;" src="http://rafalab.jhsph.edu/simplystats/pacman.gif" alt="pacman" width="181" height="181" /></p>
<p>I have only used the <code class="language-plaintext highlighter-rouge">pie</code> R function once and it was to make this plot (R code below):</p>
<p>So why are they ubiquitous? The best explanation I’ve heard is that they are easy to make in Microsoft Excel. Regardless, after years of training, lay people are probably better at interpreting pie charts than any other graph. So I’m surrendering and embracing the pie chart. Jeff’s <a href="http://simplystatistics.org/2012/11/26/the-statisticians-at-fox-news-use-classic-and-novel-graphical-techniques-to-lead-with-data/">recent post</a> shows we have bigger fish to fry.</p>
<p>``Most statisticians know that pie charts are a terrible way to plot percentages. You can find explanations <a href="http://www.biostat.wisc.edu/~kbroman/topten_worstgraphs/">here</a>, <a href="http://blog.revolutionanalytics.com/2009/08/how-pie-charts-fail.html">here</a>, and <a href="https://www.google.com/search?q=why+do+pie+charts+suck&oq=why+do+pie+charts+suck&aqs=chrome.0.57j62.4254&sugexp=chrome,mod=3&sourceid=chrome&ie=UTF-8">here</a> as well as the R help file for the <code class="language-plaintext highlighter-rouge">pie</code> function which states:</p>
<blockquote>
<p>Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.</p>
</blockquote>
<p><img class="alignright" style="line-height: 24px; font-size: 16px;" src="http://rafalab.jhsph.edu/simplystats/pacman.gif" alt="pacman" width="181" height="181" /></p>
<p>I have only used the <code class="language-plaintext highlighter-rouge">pie</code> R function once and it was to make this plot (R code below):</p>
<p>So why are they ubiquitous? The best explanation I’ve heard is that they are easy to make in Microsoft Excel. Regardless, after years of training, lay people are probably better at interpreting pie charts than any other graph. So I’m surrendering and embracing the pie chart. Jeff’s <a href="http://simplystatistics.org/2012/11/26/the-statisticians-at-fox-news-use-classic-and-novel-graphical-techniques-to-lead-with-data/">recent post</a> shows we have bigger fish to fry.</p>
<p>``</p>
The statisticians at Fox News use classic and novel graphical techniques to lead with data
2012-11-26T10:04:47+00:00
http://simplystats.github.io/2012/11/26/the-statisticians-at-fox-news-use-classic-and-novel-graphical-techniques-to-lead-with-data
<p>Depending on where you land in the political spectrum you may either love or despise Fox News. But regardless of your political affiliation, you have to recognize that their statisticians are well-trained in the art of using graphics to persuade folks of a particular viewpoint. I’m not the first to recognize that the graphics department uses some clever tricks to make certain points. But when flipping through the graphs I thought it was interesting to highlight some of the techniques they use to persuade. Some are clearly classics from the literature, but some are (as far as I can tell) newly developed graphical “persuasion” techniques.</p>
<p><strong>Truncating the y-axis</strong></p>
<p><img class="alignnone" src="http://mediamatters.org/static/images/item/fnc-an-20120809-welfarechart-2.jpg" alt="" width="354" height="266" /></p>
<p>(<a href="http://mediamatters.org/blog/2012/08/09/today-in-dishonest-fox-charts-government-aid-ed/189223">via</a>)</p>
<p>and</p>
<p><img class="alignnone" src="http://blogs-images.forbes.com/naomirobbins/files/2012/08/Bush_cuts2.png" alt="" width="386" height="286" /></p>
<p>(<a href="http://www.forbes.com/sites/naomirobbins/2012/08/04/another-misleading-graph-of-romneys-tax-plan/">via</a>)</p>
<p>This is a pretty common technique for leading the question in statistical graphics, as discussed <a href="http://www.amazon.com/How-Lie-Statistics-Darrell-Huff/dp/0393310728">here</a> and elsewhere.</p>
<p><strong>Numbers that don’t add up</strong></p>
<p>I’m not sure whether this one is intentional or not, but it crops up in several places and I think is a unique approach to leading information, at least I couldn’t find a reference in the literature. Basically the idea is to produce percentages that don’t add to one, allowing multiple choices to have closer percentages than they probably should:</p>
<p><img class="alignnone" src="http://24.media.tumblr.com/tumblr_m9xia70vbR1rfnvq8o1_500.jpg" alt="" width="300" height="150" /></p>
<p>(<a href="http://badgraphs.tumblr.com/">via</a>)</p>
<p>or to suggest that multiple options are all equally likely, but also supported by large percentages:</p>
<p><img class="alignnone" src="http://flowingdata.com/wp-content/uploads/yapb_cache/app15725951258947184.acq6gmp0hf4sowckg80ssc8wg.2xne1totli0w8s8k0o44cs0wc.th.png" alt="" width="329" height="247" /></p>
<p>(<a href="http://flowingdata.com/2009/11/26/fox-news-makes-the-best-pie-chart-ever/">via</a>)</p>
<p><strong>Changing the units of comparison</strong></p>
<p>When two things are likely to be very similar, one approach to leading information is to present variables in different units. Here is an example where total spending for 2010-2013 is compared to deficits in 2008. This can also be viewed as an example of <a href="http://www.sao.state.tx.us/resources/Manuals/Method/data/12DECEPD.pdf">not labeling the axes</a>.</p>
<p><img class="alignnone" src="http://mediamatters.org/static/images/item/fnc-ff-20120926-spending.jpg" alt="" width="270" height="215" /></p>
<p><em>**</em> (<a href="http://mediamatters.org/blog/2012/09/26/by-the-way-heres-another-dishonest-fox-news-gra/190141">via</a>)</p>
<p><span style="color: #000000;"><strong>Changing the magnitude of units at different x-values</strong></span></p>
<p>Here is a plot where the changes in magnitude at high x-values are higher than changes in magnitude at lower x-values. Again, I think this is actually a novel graphical technique for leading readers in one direction.</p>
<p><img class="alignnone" src="http://freethoughtblogs.com/lousycanuck/files/2011/12/121212_fox.jpg" alt="" width="257" height="155" /></p>
<p>(<a href="http://freethoughtblogs.com/lousycanuck/2011/12/14/im-better-at-graphs-than-fox-news/">via</a>)</p>
<p>To really see the difference, compare to the graph with common changes in magnitude at all x-values.</p>
<p><img class="alignnone" src="http://freethoughtblogs.com/lousycanuck/files/2011/12/us-unemployment2011.png" alt="" width="341" height="198" /></p>
<p>(<a href="http://freethoughtblogs.com/lousycanuck/2011/12/14/im-better-at-graphs-than-fox-news/">via</a>)</p>
<p><strong>Changing trends by sub-sampling x values</strong> (also misleading chart titles)</p>
<p>Here is a graph that shows unemployment rates over time and the corresponding chart with the x-axis appropriately laid out.</p>
<p><img class="alignnone" src="http://onlinestatbook.com/2/graphing_distributions/graphics/graph2.png" alt="" width="282" height="163" /></p>
<p><img class="alignnone" src="http://onlinestatbook.com/2/graphing_distributions/graphics/graph3.png" alt="" width="418" height="212" /></p>
<p>(<a href="http://onlinestatbook.com/2/graphing_distributions/graphing_distributionsSA.html">via</a>)</p>
<p>One could argue these are mistakes, but based on the consistent displays of data supporting one viewpoint, I think these are likely the result of someone with real statistical training who is using data in a very specific way to make a point. Obviously, Fox News isn’t the only organization that does this sort of thing, but it is interesting to see how much effort they put into statistical graphics.</p>
Sunday data/statistics link roundup (11/25/2012)
2012-11-25T09:11:03+00:00
http://simplystats.github.io/2012/11/25/sunday-datastatistics-link-roundup-11252012
<ol>
<li><span style="line-height: 16px;">My wife used to teach at Grinnell College, so we were psyched to see that a Grinnell player set the <a href="http://espn.go.com/mens-college-basketball/story/_/id/8658462/jack-taylor-grinnell-drops-138-points-collegiate-scoring-record">NCAA record for most points in a game</a>. We used to go to the games, which were amazing to watch, when we lived in Iowa. The system the coach has in place there is a ton of fun to watch and is <a href="http://science.slashdot.org/story/12/11/21/228242/statistics-key-to-success-in-run-and-gun-basketball?utm_source=slashdot&utm_medium=twitter">based on statistics</a>!</span></li>
<li><span style="line-height: 16px;">Someone has to vet the science writers at the Huffpo. <a href="http://www.huffingtonpost.com/dr-douglas-fields/50-shades-of-grey-in-scientific-publication-how-digital-publishing-is-harming-science_b_2155760.html?utm_hp_ref=tw">This</a> is out of control, basically claiming that open access publishing is harming science. I mean, I’m all about being a curmudgeon and all, but the internet exists now, so we might as well get used to it. </span></li>
<li><span style="line-height: 16px;">This one is probably better for <a href="http://blogs.forbes.com/stevensalzberg/">Steven’s blog</a>, but this is a <a href="http://www.forbes.com/sites/matthewherper/2012/11/20/one-of-my-favorite-charts-on-the-power-of-vaccines/">pretty powerful graph</a> about the life-saving potential of vaccines. </span></li>
<li><span style="line-height: 16px;">Roger <a href="http://simplystatistics.org/2012/11/24/computer-scientists-discover-statistics-and-find-it-useful/">posted yesterday</a> about the NY Times piece on deep learning. It is one of our most shared posts of all time, you should also check out the comments, which are exceedingly good. Two things I thought I’d point out in response to a lot of the reaction: (1) I think part of Roger’s post was suggesting that the statistics community should adopt some of CS’s culture of solving problems with already existing, really good methods and (2) I tried searching for a really clear example of “deep learning” yesterday so we could try some statistics on it and didn’t find any really clear explanations. Does anyone have a really simple example of deep learning (ideally with code) so we can see how it relates to statistical concepts? </span></li>
</ol>
Computer scientists discover statistics and find it useful
2012-11-24T15:53:34+00:00
http://simplystats.github.io/2012/11/24/computer-scientists-discover-statistics-and-find-it-useful
<p>This <a href="http://www.nytimes.com/2012/11/24/science/scientists-see-advances-in-deep-learning-a-part-of-artificial-intelligence.html?smid=pl-share">article in the New York Times today</a> describes some of the advances that computer scientists have made in recent years.</p>
<blockquote>
<p>The technology, called deep learning, has already been put to use in services like Apple’s Siri virtual personal assistant, which is based on Nuance Communications’ speech recognition service, and in Google’s Street View, which uses machine vision to identify specific addresses.</p>
<p>But what is new in recent months is the growing speed and accuracy of deep-learning programs, often called artificial neural networks or just “neural nets” for their resemblance to the neural connections in the brain.</p>
</blockquote>
<p>Deep learning? Really?</p>
<p>Okay, names aside, there are a few things to say here. First, the advances described in the article are real–I think that’s clear. There’s a lot of pretty cool stuff out there (including Siri, in my opinion) coming from the likes of Google, Microsoft, Apple, and many others and, frankly, I appreciate all of it. I hope to have my own self-driving car one day.</p>
<p>The question is how did we get here? What worries me about this article and many others is that you can get the impression that there were tremendous advances in the technology/methods used. But I find that hard to believe given that the methods that are often discussed in these advances are methods that have been around for quite a while (neural networks, anyone?). The real advance has been in the incorporation of data into these technologies and the use of <em>statistical models</em>. The interesting thing is not that the data are big, it’s that we’re using data at all.</p>
<p>Did Nate Silver produce a better prediction of the election than the pundits because he had better models or better technology? No, it’s because he bothered to use data at all. This is not to downplay the sophistication of Silver’s or others’ approach, but <a href="http://electoral-vote.com/">many</a> <a href="http://votamatic.org/">others</a> <a href="http://www.huffingtonpost.com/news/pollster/">did</a> <a href="http://www.realclearpolitics.com/epolls/2012/president/2012_elections_electoral_college_map.html">what</a> <a href="http://polltracker.talkingpointsmemo.com/">he</a> <a href="http://election.princeton.edu/">did</a> (presumably using different methods–I don’t think there was collaboration) and <a href="http://fivethirtyeight.blogs.nytimes.com/2012/10/31/oct-30-what-state-polls-suggest-about-the-national-popular-vote/">more or less got the same results</a>. So the variation across different models is small, but the variation between using data vs. not using data is, well, big. Peter Norvig notes this in his <a href="http://simplystatistics.org/2012/03/16/the-unreasonable-effectiveness-of-data-a-talk/">talk about how Google uses data for translation</a>. An area that computational linguists had been working on for decades was advanced dramatically by a ton of data and (a variation of) Bayes’ Theorem. I may be going out on a limb here, but I don’t think it was Bayes’ Theorem that did the trick. But there will probably be an article in the New York Times soon about how Bayes’ Theorem is revolutionizing artificial intelligence. Oh wait, <a href="http://www.nytimes.com/2008/05/03/technology/03koller.html">there already was one</a>.</p>
<p>It may sound like I’m trying to bash the computer scientists here, but I’m not. It would be too too easy for me to write a post complaining about how the computer scientists have stolen the ideas that statisticians have been using for decades and are claiming to have discovered new approaches to everything. But that’s exactly what is happening and <em>good for them</em>.</p>
<p>I don’t like to frame everything as an us-versus-them scenario, but the truth is the computer scientists are winning and the statisticians are losing. The reason is that they’ve taken our best ideas and used them to solve problems that matter to people. Meanwhile, we should have been stealing the computer scientists’ best ideas and using them to solve problems that matter to people. But we didn’t. And now we’re playing catch-up, and not doing a particularly good job of it.</p>
<p>That said, I believe there’s still time for statistics to play a big role in “big data”. We just have to choose to do it. Borrowing ideas from other fields is good–that’s why it’s called “re”search, right? Statisticians shouldn’t be shy about it. Otherwise, all we’ll have left to do is complain about how all those people took what we’d been working on for decades and…made it useful.</p>
Developing the New York Times Visual Election Outcome Explorer
2012-11-21T16:13:27+00:00
http://simplystats.github.io/2012/11/21/developing-the-new-york-times-visual-election-outcome-explorer
<p>Mike Bostock <a href="http://source.mozillaopennews.org/en-US/articles/nyts-512-paths-white-house/">talks about the design and construction</a> of the “<a href="http://www.nytimes.com/interactive/2012/11/02/us/politics/paths-to-the-white-house.html">512 Paths to the White House</a>” visualization for the New York Times. I found this visualization extremely useful on election night as it helped me understand the implications of each of the swing state calls as the night rolled on.</p>
<p>Regarding the use of outside information to annotate the graphic:</p>
<blockquote>
<p>Applying background knowledge to give the data greater context—such as the influence of the auto-industry bailout on Ohio’s economy—makes the visualization that much richer. After all, visualizations aren’t just about numbers, but about understanding the world we live in; qualitative information can add substantially to a quantitative graphic.</p>
</blockquote>
<p>While the technical details are fascinating, I was equally interested in the editorial decisions they had to make to build a usable visualization.</p>
A grand experiment in science funding
2012-11-20T10:33:50+00:00
http://simplystats.github.io/2012/11/20/a-grand-experiment-in-science-funding
<p>Among all the young scientists I know, I think <a href="http://www.perlsteinlab.com/">Ethan Perlstein</a> is one of the most innovative in the way he has adapted to the internet era. His website is incredibly unique among academic websites, he is all over the social media and his latest experiment in <a href="http://www.rockethub.com/projects/11106-crowdsourcing-discovery">crowd-funding his research</a> is something I’m definitely keeping an eye on.</p>
<p>The basic idea is that he has identified a project (giving meth to <del>yeast</del> mouse brains -see the comment by Ethan below-, I think) and put it up on <a href="http://www.rockethub.com/">Rockethub</a>, which is a crowd funding platform. The basic idea is he is looking for people to donate to his lab to fund the project. I would love it if this project succeeded, so if you have a few extra dollars lying around I’m sure he’d really appreciate it if <a href="http://www.rockethub.com/projects/11106-crowdsourcing-discovery/fuel/reward_selection">you’d donate</a>.</p>
<p>At the bigger picture level, I love the idea of crowd-funding for science in principal. But it isn’t clear that it is going to work in practice. Ethan has been tearing it up with this project, even ending up in <a href="http://www.economist.com/news/science-and-technology/21564824-these-days-anyone-can-be-scientific-philanthropist">the Economist</a>, but he has still had trouble getting to his goal for funding. In the grand scheme of things he is asking for a relatively small amount given how much he will do, so it isn’t clear to me that this is a viable option for most scientists.</p>
<p>The other key problem, as a statistician, is that many of the projects I work on will not be as easily understandable/cool as giving meth to yeast. So, for example, I’m not sure I’d be able to generate the kind of support I’d need for my group to work on statistical analysis of RNA-seq data or batch effect removal methods.</p>
<p>Still, I love the idea, and it would be great if there were alternative sources of revenue for the incredibly important work that scientists like Ethan and others are doing.</p>
Podcast #5: Coursera Debrief
2012-11-19T10:00:08+00:00
http://simplystats.github.io/2012/11/19/podcast-5-coursera-debrief-2
<p>Jeff and I talk with Brian Caffo about teaching MOOCs on Coursera.</p>
Welcome to Simply Statistics 2.0
2012-11-18T16:40:53+00:00
http://simplystats.github.io/2012/11/18/welcome-to-simply-statistics-2-0
<p>Welcome to the re-designed, re-hosted and re-platformed Simply Statistics blog. We have moved the blog over to the WordPress platform to give us some newer features that were lacking over at tumblr. So far the transition has gone okay but there may be a few bumps over the next 24 hours or so as we learn the platform. Remember, we’re not the young hackers that we used to be.</p>
<p>A few things have changed. First off, the search box <em>actually works</em>. Also, in moving the Disqus comments over, we seem to have lost all of the old comments. So unfortunately many of your gems from the past are now gone. If anyone knows how to retain old comments on Disqus, please let us know! I think Jeff’s been banging his head for a while now trying to figure this out.</p>
<p>We’re hoping to roll out a few new features over the next few months so keep an eye out and come back often.</p>
Sunday Data/Statistics Link Roundup (11/18/12)
2012-11-18T14:54:20+00:00
http://simplystats.github.io/2012/11/18/sunday-data-statistics-link-roundup-11-18-12
<ol>
<li><a href="http://www.youtube.com/watch?v=Ipk3HIIG9-o&feature=youtu.be" target="_blank">An interview</a> with Brad Efron about scientific writing. I haven’t watched the whole interview, but I do know that Efron is one of my favorite writers among statisticians.</li>
<li><a href="http://ramnathv.github.com/slidify/" target="_blank">Slidify,</a> another approach for making HTML5 slides directly from R. I love the idea of making HTML slides, I would definitely do this regularly. But there are a couple of issues I feel still aren’t resolved: (1) It is still just a little too hard to change the theme/feel of the slides in my opinion. It is just CSS, but that’s still just enough of a hurdle that it is keeping me away and (2) I feel that the placement/insertion of images is still a little clunky, Google Docs has figured this out, I’d love it if they integrated the best features of Slidify, Latex, etc. into that system. </li>
<li>Statistics is still the new hotness. Here is a Business Insider list about 5 statistics problems that will <a href="http://www.businessinsider.com/five-statistics-problems-that-will-change-the-way-you-see-the-world-2012-11" target="_blank">“change the way you think about the world”</a>. </li>
<li>I love this one in the <a href="http://www.newyorker.com/humor/2012/11/19/121119sh_shouts_rudnick" target="_blank">New Yorker</a>, especially the line,”<span>statisticians are the new sexy vampires, only even more pasty” (via Brooke A.)</span><span><br /></span></li>
<li><span>We’ve hit the big time! We have <a href="http://www.forbes.com/sites/stevensalzberg/2012/11/12/the-election-is-over-and-the-math-geeks-won/" target="_blank">been linked to</a> by a real (Forbes) blogger. </span></li>
<li><span>If you haven’t noticed, we have a <a href="http://simplystatistics.org/post/35842154215/logo-contest-winner" target="_blank">new logo</a>. We are going to be making a few other platform-related changes over the next week or so. If you have any trouble, let us know!</span></li>
</ol>
Logo Contest Winner
2012-11-16T15:00:32+00:00
http://simplystats.github.io/2012/11/16/logo-contest-winner
<p>Congratulations to Bradley Saul, the winner of the Simply Statistics Logo contest! We had some great entries which made it difficult to choose between them. You can see the new logo to the right of our home page or the full sized version here:</p>
<p><img src="http://media.tumblr.com/tumblr_mdl39pL5ua1r08wvg.png" alt="" /></p>
<p>I made some slight modifications to Bradley’s original code (apologies!). The code for his original version is here:</p>
<pre>Here’s the code:
#########################################################
# Project: Simply Statistics Logo Design
# Date: 10/17/12
# Version: 0.00001
# Author: Bradley Saul
# Built in R Version: 2.15.0
#########################################################
#Set Graphical parameters
par(mar=c(0, 0, 0, 0), pty='s', cex=3.5, pin=c(6,6))
#Note: I had to hard code the size, so that the text would scale
#on resizing the image. Maybe there is another way to get around font
#scaling issues - I couldn't figure it out.
make_logo <- function(color){
x1 <- seq(0,1,.001)
ncps <- seq(0,10,1)
shapes <- seq(5,15,1)
# Plot Beta distributions to make purty lines.
plot(x1, pbeta(x1, shape1=10, shape2=.1, ncp=0), type='l', xlab='', ylab='',
frame.plot=FALSE, axes=FALSE)
for(i in 1:length(ncps)){
lines(x1, pbeta(x1,shape1=.1, shape2=10, ncp=ncps[i]), col=color)
}
#Shade in area under curve.
coord.x <- c(0,x1,1)
coord.y <- c(0,pbeta(x1,shape1=.1,shape2=10, ncp=10),0)
polygon(coord.x, coord.y, col=color, border="white")
#Lazy way to get area between curves shaded, rather than just area under curve.
coord.y2 <- c(0,pbeta(x1,shape1=10,shape2=.1, ncp=0),0)
polygon(coord.x, coord.y2, col="white", border="white")
#Add text
text(.98,.4,'Simply', col="white", adj=1,family='HersheySerif')
text(.98,.25,'St*atistics', col="white", adj=1, family="HersheySerif")
}
</pre>
<p>Thanks to Bradley for the great logo and congratulations!</p>
Reproducible Research: With Us or Against Us?
2012-11-15T16:33:55+00:00
http://simplystats.github.io/2012/11/15/reproducible-research-with-us-or-against-us-3
<p>Last night this <a href="http://cogprints.org/8675/" target="_blank">article by Chris Drummond</a> of the Canadian National Research Council (Conseil national de recherches Canada) popped up in my Google Scholar alert. The title of the article, “Reproducible Research: a Dissenting Opinion” would seem to indicate that he disagrees with much that has been circulating out there about reproducible research.</p>
<p>Drummond singles out the <a href="http://www.stanford.edu/~vcs/papers/RoundtableDeclaration2010.pdf" target="_blank">Declaration published by a Yale Law School Roundtable on Data and Code Sharing</a> (I was not part of the roundtable) as an example of the main arguments in favor of reproducibility and has four main objections. What I found interesting about his piece is that I think I more or less agree with all his objections and yet draw the exact opposite conclusion from him. In his abstract, he concludes that “I would also contend that the effort necessary to meet the [reproducible research] movement’s aims, and the general attitude it engenders, would not serve any of the research disciplines well.”</p>
<div>
<span>Let’s take his objections one by one:</span>
</div>
<div>
<ol>
<li>
<strong>Reproducibility, at least in the form proposed, is not now, nor has it ever been, an essential part of science</strong>. I would say that with the exception of mathematics, this is true. In math, usually you state a theorem and provide the proof. The proof shows you how to obtain the result, so it is a form of reproducibility. But beyond that I would argue that the need for reproducibility is a more recent phenomenon arising from the great complexity and cost of modern data analyses and the lack of funding for full replication. The rise of “consortium science” (think ENCODE project) diminishes our ability to fully replicate (what he calls “Scientific Replication”) an experiment in any reasonable amount of time.
</li>
<li>
<strong>The idea of a single well defined scientific method resulting in an incremental, and cumulative, scientific process is highly debatable</strong>. He argues that the idea of a forward moving process by which science builds on top of previous results in an orderly and incremental fashion is a fiction. In particular, there is no single “scientific method” into which you can drop in reproducibility as a key component. I think most scientists would agree with this. Science not some orderly process—it’s messy and can seem haphazard and discoveries come at unpredictable times. But that doesn’t mean that people shouldn’t provide the details of what they’ve done so that others don’t have to essentially reverse engineer the process. I don’t see how the disorderly reality of science is an argument against reproducibility.
</li>
<li>
<strong>Requiring the submission of data and code will encourage a level of distrust among researchers and promote the acceptance of papers based on narrow technical criteria</strong>. I don’t agree with this statement at all. First, I don’t think it will happen. If a journal required code/data, it would be burdensome for some, but it would just be one of the many requirements that journals have. Second, I don’t think good science is about “trust”. Sure, it’s important to be civilized but if you claim a finding, I’m not going to just trust it because we’re both scientists. Finally, he says “<span>Submitting code — in whatever language, for whatever system — will simply result in an accumulation of questionable software. There may be a some cases where people would be able to use it but I would doubt that they would be frequent.” I think this is true, but it’s not necessarily an argument against submitting code. Think of the all the open source/free software packages out there. I would bet that most of that code has only been looked at by one person—the developer. But does that mean open source software as a whole is not valuable?</span>
</li>
<li>
<strong>Misconduct has always been part of science with surprisingly little consequence. The public’s distrust is likely more to with the apparent variability of scientific conclusions</strong>. I agree with the first part and am not sure about the second. I’ve tried to argue previously that <a href="http://simplystatistics.org/post/12421558195/reproducible-research-notes-from-the-field" target="_blank">reproducible research is not just about preventing fraud/misconduct</a>. If someone wants to commit fraud, it’s easy to make the fraud reproducible.
</li>
</ol>
<p>
In the end, I see reproducibility as not necessarily a new concept, but really an adaptation of an old concept, that is describing materials and methods. The problem is that the standard format for publication—journal articles—has simply not caught up with the growing complexity of data analysis. And so we need to update the standards a bit.
</p>
<p>
I think the benefit of reproducibility is that if someone wants to question or challenge the findings of a study, they have the materials with which to do so. Providing people with the means to ask questions is how science moves forward.
</p>
</div>
Interview with Tom Louis - New Chief Scientist at the Census Bureau
2012-11-09T15:53:02+00:00
http://simplystats.github.io/2012/11/09/interview-with-tom-louis-new-chief-scientist-at-the
<div class="im">
<strong>Tom Louis</strong>
</div>
<div class="im">
<strong><br /></strong>
</div>
<div class="im">
<strong><img height="225" src="http://biostat.jhsph.edu/~jleek/tom.jpg" width="150" /></strong>
</div>
<div class="im">
<strong><br /></strong>
</div>
<div class="im">
<a href="http://www.biostat.jhsph.edu/~tlouis/" target="_blank">Tom Louis</a> is a professor of Biostatistics at Johns Hopkins and will be joining the Census Bureau through an <span>interagency personnel agreement as the new associate director for research and methodology and chief scientist.</span><span> Tom has an impressive history of accomplishment in developing statistical methods for everything from environmental science to genomics. We talked to Tom about his new role at the Census, how it relates to his impressive research career, and how young statisticians can get involved in the statistical work at the Census. </span>
</div>
<div class="im">
<strong><br /></strong>
</div>
<div class="im">
<strong><br /></strong>
</div>
<div class="im">
<strong>SS: How did you end up being invited to lead the research branch of the Census?</strong>
</div>
<p><span>TL: Last winter, then-director Robert Groves (now Provost at Georgetown University) asked if I would be interested in the possibility of becoming the next Associate Director of Research and Methodology (R&M) and Chief Scientist, succeeding Rod Little (Professor of Biostatistics at the University of Michigan) in these roles. I expressed interest and after several discussions with Bob and Rod, decided that if offered, I would accept. It was offered and I did accept. </span></p>
<p><span>As background, components of my research, especially Bayesian methods, is Census-relevant. Furthermore, during my time as a member of the National Academies Committee on National Statistics I served on the panel that recommended improvements in small area income and poverty estimates, chaired the panel that evaluated methods for allocating federal and state program funds by formula, and chaired a workshop on facilitating innovation in the Federal statistical system.</span></p>
<p><span>Rod and I noted that it’s interesting and possibly not coincidental that with my appointment the first two associate directors are both former chairs of Biostatistics departments. It is the case that R&D’s mission is quite similar to that of a Biostatistics department; methods and collaborative research, consultation and education. And, there are many statisticians at the Census Bureau who are not in the R&D directorship, a sociology quite similar to that in a School of Public Health or a Medical campus. </span></p>
<div class="im">
<strong>SS: What made you interested in taking on this major new responsibility?</strong>
</div>
<p><span>TL: I became energized by the opportunity for national service, and excited by the scientific, administrative, and sociological responsibilities and challenges. I’ll be engaged in hiring and staff development, and increasing the visibility of the bureau’s pre- and post-doctoral programs. The position will provide the impetus to take a deep dive into finite-population statistical approaches, and contribute to the evolving understanding of the strengths and weakness of design-based, model-based and hybrid approaches to inference. That I could remain a Hopkins employee by working via an Interagency Personnel Agreement, sealed the deal. I will start in January 2013 and serve through 2015, and will continue to participate in some Hopkins-based activities.</span></p>
<p><span>In addition to activities within the Census Bureau, I’ll be increasing connections among statisticians in other federal statistical agencies, have a role in relations with researchers funded through the NSF to conduct census-related research.</span></p>
<div class="im">
<p>
<strong>SS: What are the sorts of research projects the Census is involved in? </strong></div>
<p>
<span>TL: The Census Bureau designs and conducts the decennial Census, the Current Population Survey, the American Community Survey, many, many other surveys for other Federal Statistical Agencies including the Bureau of Labor Statistics, and a quite extraordinary portfolio of others. Each identifies issues in design and analysis that merit attention, many entail “Big Data” and many require combining information from a variety of sources. I give a few examples, and encourage exploration of </span><a href="http://www.census.gov/research" target="_blank"><a href="http://www.census.gov/research" target="_blank">www.census.gov/research</a></a><span>.</span>
</p>
<p>
<span>You can get a flavor of the types of research from the titles of the six current centers within R&M: The Center for Adaptive Design, The Center for Administrative Records Research and Acquisition, The Center for Disclosure Avoidance Research, The Center for Economic Studies, The Center for Statistical Research and Methodology and The Center for Survey Measurement. Projects include multi-mode survey approaches, stopping rules for household visits, methods of combining information from surveys and administrative records, provision of focused estimates while preserving identity protection, improved small area estimates of income and of limited english skills (used to trigger provision of election ballots in languages other than English), and continuing investigation of issues related to model-based and design-based inferences.</span>
</p>
<div class="im">
<p>
<br /><strong>SS: Are those projects related to your research?</strong></div>
<p>
<span>TL: Some are, some will be, some will never be. Small area estimation, hierarchical modeling with a Bayesian formalism, some aspects of adaptive design, some of combining evidence from a variety of sources, and general statistical modeling are in my power zone. I look forward to getting involved in these and contributing to other projects.</span>
</p>
<div class="im">
<p>
<strong>SS: How does research performed at the Census help the American Public?</strong></div>
<p>
<span>TL: Research innovations enable the bureau to produce more timely and accurate information at lower cost, improve validity (for example, new approaches have at least maintained respondent participation in surveys), enhancing the reputation of the the Census Bureau as a trusted source of information. Estimates developed by Census are used to allocate billions of dollars in school aid, and the provide key planning information for businesses and governments.</span>
</p>
<div class="im">
<p>
<strong>SS: How can young statisticians get more involved in government statistical research?</strong></div>
<p>
<span>TL: The first step is to become aware of the wide variety of activities and their high impact. Visiting the Census website and those of other federal and state agencies, and the Committee on National Statistics (</span><a href="http://sites.nationalacademies.org/DBASSE/CNSTAT/" target="_blank"><a href="http://sites.nationalacademies.org/DBASSE/CNSTAT/" target="_blank">http://sites.nationalacademies.org/DBASSE/CNSTAT/</a></a><span>) and the National Institute of Statistical Sciences (</span><a href="http://www.niss.org/" target="_blank"><a href="http://www.niss.org/" target="_blank">http://www.niss.org/</a></a><span>) is a good start. Make contact with researchers at the JSM and other meetings and be on the lookout for pre- and post-doctoral positions at Census and other federal agencies.</span>
</p>
</p></div></p></div></p></div></p></div>
Some academic thoughts on the poll aggregators
2012-11-08T20:11:00+00:00
http://simplystats.github.io/2012/11/08/some-academic-thoughts-on-the-poll-aggregators
<p>The night of the presidential elections I wrote a <a href="http://simplystatistics.org/post/35187901781/nate-silver-does-it-again-will-pundits-finally-accept" target="_blank">post</a> celebrating the victory of data over punditry. I was motivated by the personal attacks made against Nate Silver by pundits that do not understand Statistics. The post generated a little bit of (justified) <em><a href="http://www.urbandictionary.com/define.php?term=nerdrage" target="_blank">nerdrage</a> </em>(see comment section). So here I clarify a couple of things not as a member of Nate Silver’s fan club (my <a href="http://www.urbandictionary.com/define.php?term=mancrush" target="_blank"><em>mancrush</em> </a>started with <a href="http://www.baseballprospectus.com/" target="_blank">PECOTA</a> not fivethirtyeight) but as an applied statistician.</p>
<p>The main reason <a href="http://fivethirtyeight.blogs.nytimes.com/" target="_blank">fivethrityeight</a> predicts election results so well is mainly due to the idea of averaging polls. This idea was around way before fivethirtyeight started. In fact, it’s a version of <a href="http://en.wikipedia.org/wiki/Meta-analysis" target="_blank">meta-analysis</a> which has been around for hundreds of years and is commonly used to <a href="http://www.ncbi.nlm.nih.gov/pubmed/3802833" target="_blank">improve results of clinical trials</a>. This election cycle several groups, including Sam Wang (<a href="http://election.princeton.edu/" target="_blank">Princeton Election Consortium</a>), Simon Jackman (<a href="http://www.huffingtonpost.com/simon-jackman/pollster-predictions_b_2081013.html" target="_blank">pollster</a>), and Drew Linzer (<a href="http://votamatic.org/" target="_blank">VOTAMATIC</a>), predicted the election perfectly using this trick. </p>
<p><span></span>While each group adds their own set of bells and whistles, most of the gains come from the aggregation of polls and understanding the concept of a standard error. Note that <span>while each individual poll may be a bit biased, historical data shows that these biases average out to 0. So by taking the average you obtain a close to unbiased estimate. Because there are so many pollsters, each one conducting several polls, you can also estimate the standard error of your estimate pretty well (empirically rather than theoretically). </span><span> </span><span>I include a plot below that provides evidence that bias is not an issue and that standard errors are well estimated. The dash line is at +/- 2 standard erros based on the average (across all states) standard error reported by fivethirtyeight. Note that the variability is smaller for the battleground states where more polls were conducted (this is consistent with state-specific standard error reported by fivethirtyeight).</span></p>
<p><span>Finally, there is the issue of the use of the word “probability”. Obviously one can correctly state that there is a 90% chance of observing event A and then have it not happen: Romney could have won and the aggregators still been “right”. Also </span>frequentists complain when we talk about the probability of something that only will happen once? I actually don’t like getting into this philosophical discussion (<a href="http://andrewgelman.com/2012/10/is-it-meaningful-to-talk-about-a-probability-of-65-7-that-obama-will-win-the-election/" target="_blank">Gelman</a> has some thoughts worth reading) and I cut people who write for the masses some slack. If the aggregators consistently outperform the pundits in their predictions I have no problem with them using the word “probability” in their reports. I look forward to some of the post-election analysis of all this.</p>
<p><a href="http://rafalab.jhsph.edu/simplystats/silver3.png" target="_blank"><img height="500" src="http://rafalab.jhsph.edu/simplystats/silver3.png" width="500" /></a></p>
Nate Silver does it again! Will pundits finally accept defeat?
2012-11-07T05:54:00+00:00
http://simplystats.github.io/2012/11/07/nate-silver-does-it-again-will-pundits-finally-accept
<p>My favorite statistician did it again! Just like in 2008, he predicted the presidential election results almost perfectly. For those that don’t know, Nate Silver is the statistician that runs the <a href="http://fivethirtyeight.blogs.nytimes.com/" target="_blank">fivethirtyeight blog</a>. He combines data from hundreds of polls, uses historical data to weigh them appropriately and then uses a statistical model to run simulations and predict outcomes.</p>
<p>While the pundits were claiming the race was a “dead heat”, the day before the election Nate gave Obama a 90% chance of winning. Several pundits attacked Nate (some attacks were personal) for his predictions and demonstrated their ignorance of Statistics. Jeff wrote a <a href="http://simplystatistics.org/post/34635539704/on-weather-forecasts-nate-silver-and-the" target="_blank">nice post on this</a>. The plot below demonstrates how great Nate’s prediction was. Note that each of the 45 states (including DC) for which he predicted a 90% probability or higher of winning for candidate A, candidate A won. For the other 6 states the range of percentages was 48-52%. If Florida goes for Obama he will have predicted every single state correctly.</p>
<p><strong>Update</strong><strong>: </strong>Congratulations also to Sam Wang (<a href="http://election.princeton.edu/" target="_blank">Princeton Election Consortium</a>) and Simon Jackman (<a href="http://www.huffingtonpost.com/simon-jackman/pollster-predictions_b_2081013.html" target="_blank">pollster</a>) that also called the election perfectly. And thanks to the pollsters that provided the unbiased (on average) data used by all these folks. Data analysts won “experts” lost.</p>
<p><del><strong>Update 2</strong>: New plot with data from <a href="http://www.foxnews.com/politics/elections/2012-election-results/" target="_blank">here</a>. Old graph <a href="http://rafalab.jhsph.edu/simplystats/silver.png" target="_blank">here</a>.</del></p>
<p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/silver3.png" alt="Observed versus predicted" /></p>
If we truly want to foster collaboration, we need to rethink the "independence" criteria during promotion
2012-11-05T15:00:44+00:00
http://simplystats.github.io/2012/11/05/if-we-truly-want-to-foster-collaboration-we-need-to
<p class="MsoNormal">
<span>When I talk about collaborative work, I don’t mean spending a day or two helping compute some p-values and end up as middle author in a subject-matter paper. I mean spending months working on a project, </span>from start to finish, with experts from other disciplines to accomplish a goal that can only be accomplished with a diverse team. Many papers in genomics are like this (the ENOCDE and 1000 genomes papers for example). Investigators A dreams up the biology, B develops the technology, C codes up algorithms to deal with massive data, while D analyzes the data and assess uncertainty, with the results reported in one high profile paper. I illustrate the point with genomics because it’s what I know best, but examples abound in other specialties as well.
</p>
<p class="MsoNormal">
<span>Fostering collaborative research seems to be a priority for most higher education institutions. Both funding agencies and universities are creating initiative after initiative to incentivize team science. But at the same time the appointments and promotions process rewards researchers that have demonstrated “independence”. If we are not careful it may seem like we are sending mixed signals. I know of young investigators that have been advised to set time aside to demonstrate independence by publishing papers without their regular collaborators. This advice assumes that one can easily balance collaborative and independent research. But here is the problem: truly collaborative work can take just as much time and intellectual energy as independent research, perhaps more. Because time is limited, we might inadvertently be hindering the team science we are supposed to be fostering. Time spent demonstrating independence is time not spent working on the next high impact project.</span>
</p>
<p class="MsoNormal">
I understand the argument for striving to hire and promote scholars that can excel no matter the context. But I also think it is unrealistic to compete in team science if we don’t find a better way to promote those that excel in collaborative research as well. It is a mistake to think that scholars that excel in solo research can easily succeed in team science. In fact, I have seen several examples of specializations, that are important to the university, in which the best work is being produced by a small team. At the same time, “independent” researchers all over the country are also working in these areas and publishing just as many papers. But the influential work is coming almost exclusively from the team. Whom should your university hire and promote in this particular area? To me it seems clear that it is the team. But for them to succeed we can’t get in their way by requiring each individual member to demonstrate “independence” in the traditional sense.
</p>
<p class="MsoNormal">
<span> </span>
</p>
<p class="MsoNormal">
<span> </span>
</p>
Sunday Data/Statistics Link Roundup (11/4/12)
2012-11-04T14:24:48+00:00
http://simplystats.github.io/2012/11/04/sunday-data-statistics-link-roundup-11-4-12
<ol>
<li>Brian Caffo <a href="http://www.washingtonpost.com/local/education/elite-education-for-the-masses/2012/11/03/c2ac8144-121b-11e2-ba83-a7a396e6b2a7_story.html?wpisrc=emailtoafriend" target="_blank">headlines the WaPo article</a> about massive online open courses. He is the driving force behind our department’s involvement in offering these massive courses. I think this sums it up: `<span>“I can’t use another word than unbelievable,” Caffo said. Then he found some more: “Crazy . . . surreal . . . heartwarming.”’</span></li>
<li><span>A really interesting discussion of why <a href="http://marginalrevolution.com/marginalrevolution/2012/11/a-bet-is-a-tax-on-bullshit.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+marginalrevolution%2Ffeed+%28Marginal+Revolution%29" target="_blank">“A Bet is a Tax on B.S.”</a>. It nicely describes why intelligent betters must be disinterested in the outcome, otherwise they will end up losing money. The Nate Silver controversy just doesn’t seem to be going away, good news for his readership numbers, I bet. (via Rafa)</span></li>
<li><span>An<a href="http://www.businessweek.com/articles/2012-11-01/its-global-warming-stupid" target="_blank"> interesting article</a> on how scientists are not claiming global warming is the sole cause of the extreme weather events we are seeing, but that it does contribute to them being more extreme. The key quote: </span><span>“We can’t say that steroids caused any one home run by Barry Bonds, but steroids sure helped him hit more and hit them farther. Now we have </span><a href="http://www.businessweek.com/articles/2012-11-01/rising-tide" target="_blank">weather on steroids</a><span>.” —Eric Pooley. (via Roger)</span></li>
<li><span>The NIGMS is looking for a <a href="https://loop.nigms.nih.gov/index.php/2012/11/01/wanted-biomedical-technology-bioinformatics-and-computational-biology-division-director/" target="_blank">Biomedical technology, Bioinformatics, and Computational Biology Director</a>. I hope that it is someone who <a href="http://simplystatistics.org/post/21914291274/people-in-positions-of-power-that-dont-understand" target="_blank">understands statistics</a>! (via Karl B.)</span></li>
<li><span>Here is <a href="http://www.spiked-online.com/site/article/13016/" target="_blank">another article</a> that appears to misunderstand statistical prediction. This one is about the Italian scientists who were jailed for failing to predict an earthquake. No joke. </span></li>
<li><span>We talk a lot about how much the data revolution will change industries from social media to healthcare. But here is an<a href="http://www.nextgov.com/health/health-it/2012/10/patients-dont-show-much-interest-accessing-health-data-online/59104/" target="_blank"> important reality check</a>. Patients are not showing an interest in accessing their health care data. I wonder if part of the reason is that we haven’t come up with the right ways to explain, understand, and utilize what is inherently stochastic and uncertain information. </span></li>
<li><span>The BMJ is now <a href="http://www.nytimes.com/2012/11/01/business/british-medical-journal-to-require-detailed-clinical-trial-data.html?_r=0" target="_blank">going to require</a> all data from clinical trials published in their journal to be public. This is a brilliant, forward thinking move. I hope other journals will follow suit. (via Karen B.R.)</span></li>
<li><span><a href="http://marginalrevolution.com/marginalrevolution/2012/11/retractions.html" target="_blank">An interesting article</a> about the impact of retractions on citation rates, suggesting that papers in fields close to those of the retracted paper may show negative impact on their citation rates. I haven’t looked it over carefully, but how they control for confounding seems incredibly important in this case. (via Alex N.). </span></li>
</ol>
Elite education for the masses
2012-11-04T14:00:34+00:00
http://simplystats.github.io/2012/11/04/elite-education-for-the-masses
<p><a href="http://www.washingtonpost.com/local/education/elite-education-for-the-masses/2012/11/03/c2ac8144-121b-11e2-ba83-a7a396e6b2a7_story_2.html">Elite education for the masses</a></p>
The Year of the MOOC
2012-11-03T17:06:55+00:00
http://simplystats.github.io/2012/11/03/the-year-of-the-mooc
<p><a href="http://nyti.ms/TTn1E6">The Year of the MOOC</a></p>
Microsoft Seeks an Edge in Analyzing Big Data
2012-10-31T00:19:43+00:00
http://simplystats.github.io/2012/10/31/microsoft-seeks-an-edge-in-analyzing-big-data
<p><a href="http://www.nytimes.com/2012/10/30/technology/microsoft-renews-relevance-with-machine-learning-technology.html?smid=tu-share">Microsoft Seeks an Edge in Analyzing Big Data</a></p>
On weather forecasts, Nate Silver, and the politicization of statistical illiteracy
2012-10-30T14:00:35+00:00
http://simplystats.github.io/2012/10/30/on-weather-forecasts-nate-silver-and-the
<p>As you know, <a href="http://simplystatistics.org/post/34483703514/sunday-data-statistics-link-roundup-10-28-12" target="_blank">we</a> <a href="http://simplystatistics.org/post/33564003058/sunday-data-statistics-link-roundup-10-14-12" target="_blank">have</a> a <a href="http://simplystatistics.org/post/29407938554/statistics-statisticians-need-better-marketing" target="_blank">thing</a> for <a href="http://simplystatistics.org/post/13684264380/citizen-science-makes-statistical-literacy-critical" target="_blank">statistical literacy</a> here at Simply Stats. So of course this <a href="http://www.politico.com/blogs/media/2012/10/nate-silver-romney-clearly-could-still-win-147618.html" target="_blank">column over at Politico</a> got our attention (via Chris V. and others). The column is an attack on Nate Silver, <a href="http://fivethirtyeight.blogs.nytimes.com/" target="_blank">who has a blog</a> where he tries to predict the outcome of elections in the U.S., you may have heard of it…</p>
<p>The argument that Dylan Byers makes in the Politico column is that Nate Silver is likely to be embarrassed by the outcome of the election if Romney wins. The reason is that Silver’s predictions have suggested Obama has a 75% chance to win the election recently and that number has never dropped below 60% or so. </p>
<p>I don’t know much about Dylan Byers, but from reading this column and a quick scan of his twitter feed, it appears he doesn’t know much about statistics. Some people have gotten pretty upset at him on Twitter and elsewhere about this fact, but I’d like to take a different approach: education. So Dylan, here is a really simple example that explains how Nate Silver comes up with a number like the 75% chance of victory for Obama. </p>
<p>Let’s pretend, just to make the example really simple, that if Obama gets greater than 50% of the vote, he will win the election. Obviously, Silver doesn’t ignore the electoral college and all the other complications, but it makes our example simpler. Then assume that based on averaging a bunch of polls we estimate that Obama is likely to get about 50.5% of the vote.</p>
<p>Now, we want to know what is the “percent chance” Obama will win, taking into account what we know. So let’s run a bunch of “simulated elections” where on average Obama gets 50.5% of the vote, but there is variability because we don’t have the exact number. Since we have a bunch of polls and we averaged them, we can get an estimate for how variable the 50.5% number is. The usual measure of variance is the <a href="http://en.wikipedia.org/wiki/Standard_deviation" target="_blank">standard deviation</a>. Say we get a standard deviation of 1% for our estimate. That would be a pretty accurate number, but not totally unreasonable given the amount of polling data out there. </p>
<p>We can run 1,000 simulated elections like this in <a href="http://www.r-project.org/" target="_blank">R</a>* (a free software programming language, if you don’t know R, may I suggest Roger’s <a href="https://www.coursera.org/course/compdata" target="_blank">Computing for Data Analysis</a> class?). <a href="https://raw.github.com/gist/3979974/21e3aea5aad79f68c03bbc519c216ed35b2ecd8b/gistfile1.r" target="_blank">Here</a> is the code to do that. The last line of code calculates the percent of times, in our 1,000 simulated elections, that Obama wins. This is the number that Nate would report on his site. When I run the code, I get an Obama win 68% of the time (Obama gets greater than 50% of the vote). But if you run it again that number will vary a little, since we simulated elections. </p>
<p>The interesting thing is that even though we only estimate that Obama leads by about 0.5%, he wins 68% of the simulated elections. The reason is that we are pretty confident in that number, with our standard deviation being so low (1%). But that doesn’t mean that Obama will win 68% of the vote in any of the elections! In fact, here is a histogram of the percent of the vote that Obama wins: </p>
<p><img height="300" src="http://biostat.jhsph.edu/~jleek/obama.png" width="300" /></p>
<p>He never gets more than 54% or so and never less than 47% or so. So it is always a reasonably close election. Silver’s calculations are obviously more complicated, but the basic idea of simulating elections is the same. </p>
<p>Now, this might seem like a goofy way to come up with a “percent chance” with simulated elections and all. But it turns out it is actually a pretty important thing to know and relevant to those of us on the East Coast right now. It turns out weather forecasts (and projected hurricane paths) are based on the same <a href="http://en.wikipedia.org/wiki/Numerical_weather_prediction" target="_blank">sort of thing</a> - simulated versions of the weather are run and the “percent chance of rain” is the fraction of times it rains in a particular place. </p>
<p>So Romney may still win and Obama may lose - and Silver may still get a lot of it right. But regardless, the approach taken by Silver is not based on politics, it is based on statistics. Hopefully we can move away from politicizing statistical illiteracy and toward evaluating the models for the real, underlying assumptions they make. </p>
<p><em>* In this case, we could calculate the percent of times Obama would win with a formula (called an analytical calculation) since we have simplified so much. In Nate’s case it is much more complicated, so you have to simulate. </em></p>
Computing for Data Analysis (Simply Statistics Edition)
2012-10-29T14:00:26+00:00
http://simplystats.github.io/2012/10/29/computing-for-data-analysis-simply-statistics-edition
<p>As the entire East Coast gets soaked by Hurricane Sandy, I can’t help but think that this is the perfect time to…take a course online! Well, as long as you have electricity, that is. I live in a heavily tree-lined area and so it’s only a matter of time before the lights cut out on me (I’d better type quickly!). </p>
<p>I just finished teaching my course Computing for Data Analysis through Coursera. This was my first experience teaching a course online and definitely my first experience teaching a course to > 50,000 people. There were definitely some bumps along the road, but the students who participated were fantastic at helping me smooth the way. In particular, the interaction on the discussion forums was very helpful. I couldn’t have done it without the students’ help. So, if you took my course over the past 4 weeks, thanks for participating!</p>
<p>Here are a couple quick stats on the course participation (as of today) for the curious:</p>
<ul>
<li><span>50,899: Number of students enrolled</span></li>
<li><span>27,900: Number of users watching lecture videos</span></li>
<li><span>459,927: Total number of streaming views (over 4 weeks)</span></li>
<li><span>414,359: Total number of video downloads (not all courses allow this)</span></li>
<li><span>14,375: Number of users submitting the weekly quizzes (graded)</span></li>
<li><span>6,420: Number of users submitting the bi-weekly R programming assignments (graded)</span></li>
<li><span>6393+3291: Total number of posts+comments to the discussion forum</span></li>
<li><span>314,302: Total number of views in the discussion forum</span></li>
</ul>
<p>I’ve received a number of emails from people who signed up in the middle of the course or after the course finished. Given that it was a 4-week course, signing up in the middle of the course meant you missed quite a bit of material. I will eventually be closing down the Coursera version of the course—at this point it’s not clear when it will be offered again on that platform but I would like to do so—and so access to the course material will be restricted. However, I’d like to make that material more widely available even if it isn’t in the Coursera format.</p>
<p>So I’m announcing today that next month I’ll be offering the <strong>Simply Statistics Edition of Computing for Data Analysis</strong>. This will be a slightly simplified version of the course that was offered on Coursera since I don’t have access to all of the cool platform features that they offer. But all of the original content will be available, including some new material that I hope to add over the coming weeks.</p>
<p>If you are interested in taking this course or know of someone who is, please check back here soon for more details on how to sign up and get the course information.</p>
Sunday Data/Statistics Link Roundup (10/28/12)
2012-10-28T13:39:00+00:00
http://simplystats.github.io/2012/10/28/sunday-data-statistics-link-roundup-10-28-12
<ol>
<li>An important article about <a href="http://www.scientificamerican.com/article.cfm?id=antiscience-beliefs-jeopardize-us-democracy" target="_blank">anti-science sentiment</a> in the U.S. (via David S.). The politicization of scientific issues such as global warming, evolution, and healthcare (think vaccination) makes the U.S. less competitive. I think the lack of statistical literacy and training in the U.S. is one of the sources of the problem. People use/skew/mangle statistical analyses and experiments to support their view and without a statistically well trained public, it all looks “reasonable and scientific”. But when science seems to contradict itself, it loses credibility. Another reason to <a href="http://www.ted.com/talks/arthur_benjamin_s_formula_for_changing_math_education.html" target="_blank">teach statistics to everyone in high school.</a></li>
<li>Scientific American was loaded this last week, here is another <a href="http://blogs.scientificamerican.com/guest-blog/2012/10/18/nihmim12-the-spreading-shadow-of-cancer-angst-3-things-you-need-to-know-to-meet-it-rationally/" target="_blank">article on cancer screening</a>. The article covers several of the issues that make it hard to convince people that screening isn’t always good. The predictive value of the positive confusion is a huge one in cancer screening right now. The author of the piece is someone worth following on Twitter <a href="https://twitter.com/hildabast" target="_blank">@</a><span><a href="https://twitter.com/hildabast" target="_blank">hildabast</a>.</span></li>
<li><span><a href="http://www.githubarchive.org/" target="_blank">A bunch of data</a> on the use of Github. Always cool to see new data sets that are worth playing with for student projects, etc. (via Hilary M.). </span></li>
<li><span>A really interesting post over at Stats Chat about <a href="http://www.statschat.org.nz/2012/10/28/why-we-study-the-obvious/?utm_source=feedburner&utm_medium=twitter&utm_campaign=Feed%3A+StatsChat+%28Stats+Chat%29" target="_blank">why we study seemingly obvious things</a>. Hint, the reason is that “obvious” things aren’t always true. </span></li>
<li><span>A <a href="http://www.npr.org/blogs/alltechconsidered/2012/10/23/163434283/how-much-is-a-like-on-facebook-worth-for-a-companys-share-price" target="_blank">story on “sentiment analysis” </a>by NPR that suggests that most of the variation in a stock’s price during the day can be explained by the number of Facebook likes. Obviously, this is an interesting correlation. Probably more interesting for hedge funders/stockpickers if the correlation was with the change in stock price the next day. (via Dan S.)</span></li>
<li><span>Yihui Xie visited our department this week. We had a great time chatting with him about knitr/animation and all the cool work he is doing. <a href="http://yihui.name/slides/2012-reproduce-homework.html" target="_blank">Here are his slides</a> from the talk he gave. Particularly check out his idea for a fast journal. You are seeing the future of publishing. </span></li>
<li><strong>Bonus Link:</strong> <a href="http://techcrunch.com/2012/10/27/big-data-right-now-five-trendy-open-source-technologies/" target="_blank">R is a trendy open source technology for big data</a>. </li>
</ol>
I love those first discussions about a new research project
2012-10-26T19:37:18+00:00
http://simplystats.github.io/2012/10/26/i-love-those-first-discussions-about-a-new-research
<p>That has got to be the best reason to <a href="http://simplystatistics.org/post/28335633068/why-im-staying-in-academia" target="_blank">stay in academia.</a> The meetings where it is just you and a bunch of really smart people thinking about tackling a new project, coming up with cool ideas, and dreaming about how you can really change the way the world works are so much fun.</p>
<p>There is no part of a research job that is better as far as I’m concerned. It is always downhill after that, you start <a href="http://simplystatistics.org/post/31281359451/the-pebbles-of-academia" target="_blank">running into pebbles</a>, your code doesn’t work, or <a href="http://simplystatistics.org/post/26977029850/my-worst-recent-experience-with-peer-review" target="_blank">your paper gets rejected</a>. But that first blissful planning meeting always seems so full of potential.</p>
<p>Just had a great one like that and am full of optimism. </p>
Let's make the Joint Statistical Mettings modular
2012-10-23T13:08:51+00:00
http://simplystats.github.io/2012/10/23/lets-make-the-joint-statistical-mettings-modular
<p>Have you ever met a statistician that enjoys the joint statistical meetings (JSM)? I haven’t. With the exception of the one night we catch up with old friends there are few positive things we can say about JSM.They are way too big and the two talks I want to see are always somehow scheduled at the same time as mine.</p>
<p>But statisticians actually like conferences. Most of us have a favorite statistics conference, or session within a bigger subject matter conference, that we look forward to going to. But it’s never JSM. So why can’t JSM just be a collection of these conferences? For sure we should drop the current format and come up with something new.</p>
<p>I propose that we start by giving each ASA section two non-concurrent sessions scheduled on two consecutive days (perhaps more slots for bigger sections) and let them do whatever they want. Hopefully they would turn this into the conference that they want to go to. It’s our meeting, we pay for it, so let’s turn it into something we like.</p>
A statistical project bleg (urgent-ish)
2012-10-22T14:29:01+00:00
http://simplystats.github.io/2012/10/22/a-statistical-project-bleg-urgent-ish
<p>We all know that politicians can play it a little fast and loose with the truth. This is particularly true in debates, where politicians have to think on their feet and respond to questions from the audience or from each other. </p>
<p>Usually, we find out about how truthful politicians are in the “post-game show”. The discussion of the veracity of the claims is usually based on independent fact checkers such as <a href="http://www.politifact.com/" target="_blank">PolitiFact</a>. Some of these fact checkers (Politifact in particular) <a href="https://twitter.com/politifact" target="_blank">live-tweet</a> their reports on many of the issues discussed during the debate. This is possible, since both candidates have a pretty fixed set of talking points they use, so it is near real time fact-checking. </p>
<p>What would be awesome is if someone could write an R script that would scrape the live data off of Politifact’s Twitter account and create a truthfullness meter that looks something like CNN’s <a href="http://politicalticker.blogs.cnn.com/2012/10/16/13-reasons-to-watch-the-debate-on-cnns-platforms-and-nowhere-else/comment-page-1/" target="_blank">instant reaction graph</a> (see #7) for independent voters. The line would show the moving average of how honest each politician was being. How cool would it be to show the two candidates and how truthful they are being? If you did this, tell me it wouldn’t be a feature one of the major news networks would pick up…</p>
Sunday Data/Statistics Link Roundup (10/21/12)
2012-10-21T13:30:56+00:00
http://simplystats.github.io/2012/10/21/sunday-data-statistics-link-roundup-10-21-12
<ol>
<li>This is <a href="http://researchinprogress.tumblr.com/" target="_blank">scientific variant</a> on the <a href="http://whatshouldwecallme.tumblr.com/" target="_blank">#whatshouldwecallme</a> meme isn’t exclusive to statistics, but it is hilarious. </li>
<li>This is a <a href="http://www.wired.com/opinion/2012/10/passwords-and-hackers-security-and-practicality/" target="_blank">really interesting post</a> that is a follow-up to the XKCD <a href="http://xkcd.com/936/" target="_blank">password security comic</a>. The thing I find most interesting about this is that researchers realized the key problem with passwords was that we were looking at them purely from a computer science perspective. But _people _use passwords, so we need a person-focused approach to maximize security. This is a very similar idea to our previous post on <a href="http://simplystatistics.org/post/31460959187/an-experimental-foundation-for-statistics" target="_blank">an experimental foundation for statistics</a>. Looks like Di Cook and others are already <a href="http://www.r-bloggers.com/carl-morris-symposium-on-large-scale-data-inference-23/" target="_blank">way ahead of us</a> on this idea. It would be interesting to redefine optimality incorporating the knowledge that most of the time it is a person running the statistics. </li>
<li>This is another fascinating article about the <a href="http://www.insidehighered.com/news/2012/10/15/stanford-professor-goes-public-attacks-over-her-math-education-research" target="_blank">math education wars</a>. It starts off as the typical dueling schools issue in academia - two different schools of thought who routinely go after the other side. But the interesting thing here is it sounds like one side of this math debate is being waged by a person collecting data and the other is being waged by a side that isn’t. It is interesting how many areas are being touched by data - including what kind of math we should teach. </li>
<li>I’m going to visit Minnesota in a couple of weeks. I was so pumped up to be <a href="https://twitter.com/leekgroup/status/259597859639410688" target="_blank">an outlaw</a>. <a href="http://simplystatistics.org/post/33973041284/minnesota-clarifies-free-online-ed-is-ok" target="_blank">Looks like</a> I’m just a regular law abiding citizen though….</li>
<li>Here are outstanding summaries of what went on at the Carl Morris Big Data conference this last week. Tons of interesting stuff there. Parts <a href="http://civilstat.com/?p=745" target="_blank">one</a>, <a href="http://civilstat.com/?p=758" target="_blank">two</a>, and <a href="http://civilstat.com/?p=760" target="_blank">three</a>. </li>
</ol>
Minnesota clarifies: Free online ed is OK
2012-10-20T18:50:31+00:00
http://simplystats.github.io/2012/10/20/minnesota-clarifies-free-online-ed-is-ok
<p><a href="http://www.washingtonpost.com/blogs/college-inc/post/minnesota-clarifies-free-online-ed-is-ok/2012/10/19/456a0a3e-1a37-11e2-aa6f-3b636fecb829_blog.html">Minnesota clarifies: Free online ed is OK</a></p>
Free Online Education Is Now Illegal in Minnesota
2012-10-20T13:16:32+00:00
http://simplystats.github.io/2012/10/20/free-online-education-is-now-illegal-in-minnesota
<p><a href="http://www.slate.com/blogs/future_tense/2012/10/18/minnesota_bans_coursera_state_takes_bold_stand_against_free_education.html">Free Online Education Is Now Illegal in Minnesota</a></p>
Simply Statistics Podcast #4: Interview with Rebecca Nugent
2012-10-19T13:33:39+00:00
http://simplystats.github.io/2012/10/19/interview-with-rebecca-nugent-of-carnegie-mellon
<p>Interview with Rebecca Nugent of Carnegie Mellon University.</p>
<p>In this episode Jeff and I talk with <a href="http://www.stat.cmu.edu/~rnugent/" target="_blank">Rebecca Nugent</a>, Associate Teaching Professor in the Department of Statistics at Carnegie Mellon University. We talk with her about her work with the Census and the growing interest in statistics among undergraduates.</p>
<div class="attribution">
(<span>Source:</span> <a href="http://www.youtube.com/">http://www.youtube.com/</a>)
</div>
Statistics isn't math but statistics can produce math
2012-10-18T20:29:52+00:00
http://simplystats.github.io/2012/10/18/statistics-isnt-math-but-statistics-can-produce-math
<p><a href="http://thatsmathematics.com/mathgen/" target="_blank">Mathgen</a>, the web site that can produce randomly generated mathematics papers has apparently <a href="http://thatsmathematics.com/blog/archives/102" target="_blank">gotten a paper accepted in a peer-reviewed journal</a> (although perhaps not the most reputable one). I am not at all surprised this happened, but it’s fun to read both the paper and the reviewer’s comments. </p>
<p>(Thanks to Kasper H. for the pointer.)</p>
Comparing Hospitals
2012-10-17T13:05:38+00:00
http://simplystats.github.io/2012/10/17/there-was-a-story-a-few-weeks-ago-on-npr-about-how
<p>There was a story a few weeks ago on NPR about how <a href="http://www.npr.org/templates/story/story.php?storyId=162035632" target="_blank">Medicare will begin fining hospitals</a> that have 30-day readmission rates that are too high. This process was introduced in the Affordable Care Act and</p>
<blockquote>
<p><span>Under the health care law, the penalties gradually will rise until 3 percent of Medicare payments to hospitals are at risk. Medicare is considering holding hospitals accountable on four more measures: joint replacements, stenting, heart bypass and treatment of stroke.</span></p>
</blockquote>
<p>Those of you taking my <a href="https://class.coursera.org/compdata-2012-001/class/index" target="_blank">computing course on Coursera</a> have already seen some of the data used to for this assessment, which can be obtained at the <a href="http://hospitalcompare.hhs.gov" target="_blank">hospital compare web site</a>. It’s also worth noting that underlying the analysis for this was a detailed and thoughtful report published by the Committee of Presidents of Statistical Societies (COPSS) which was chaired by <a href="http://www.biostat.jhsph.edu/~tlouis/" target="_blank">Tom Louis</a>, a Professor here at Johns Hopkins.</p>
<p>The report, titled <a href="http://www.cms.gov/Medicare/Quality-Initiatives-Patient-Assessment-Instruments/HospitalQualityInits/Downloads/Statistical-Issues-in-Assessing-Hospital-Performance.pdf" target="_blank">“Statistical Issues in Assessing Hospital Performance”</a> covers much of the current methodology and its criticisms and has a number of recommendations. Of particular concern for hospitals is the issue of shrinkage targets—in an hierarchical model the estimate of the readmission rate for a hospital is shrunken towards the mean. But which mean? Hospitals with higher risk or sicker patient populations will look quite a bit worse than hospitals sitting amongst a healthy population if they are both compared to the same mean.</p>
<p>The report is worth reading even if you’re just interested in the practical application of hierarchical models. And the web site is fun to explore if you want to know how the hospitals around you are fairing.</p>
Johns Hopkins Grad Anthony Damico Shows How To
2012-10-17T12:43:00+00:00
http://simplystats.github.io/2012/10/17/johns-hopkins-grad-anthony-damico-shows-how-to
<p>[vimeo 43305640 w=500 h=281]</p>
<p>Johns Hopkins grad Anthony Damico shows how to make coffee with R (except not really). The BLS mug is what makes it for me.</p>
<div class="attribution">
(<span>Source:</span> <a href="http://player.vimeo.com/">http://player.vimeo.com/</a>)
</div>
A statistician loves the #insurancepoll...now how do we analyze it?
2012-10-15T15:12:56+00:00
http://simplystats.github.io/2012/10/15/a-statistician-loves-the-insurancepoll-now-how-do-we
<p>Amanda Palmer broke Twitter yesterday <a href="http://www.amandapalmer.net/blog/20121015/" target="_blank">with her insurance poll</a>. She started off just talking about how hard it is for musicians who rarely have health insurance, but then wandered into polling territory. She sent out a request for people to respond with the following information:</p>
<blockquote>
<p><em>quick twitter poll. 1) COUNTRY?! 2) profession? 3) insured? 4) if not, why not, if so, at what cost per month (or covered by job)?</em></p>
</blockquote>
<p>This quick little poll struck a nerve with people and her Twitter feed blew up. Long story short, tons of interesting information was gathered from folks. This information is frequently kept semi-obscured, particularly what is the cost of health insurance for folks in different places. This isn’t the sort of info that insurance companies necessarily publicize widely and isn’t the sort of thing people talk about. </p>
<p>The results were really fascinating and its worth reading the above blog post or checking out the hashtag: <a href="https://twitter.com/search/realtime?q=%23insurancepoll&src=typd" target="_blank">#insurancepoll</a>. But the most fascinating thing for me as a statistician was thinking about how to analyze these data. <a href="https://twitter.com/aubreyjaubrey" target="_blank">@<span>aubreyjaubrey</span></a> is apparently collecting the data someplace, hopefully she’ll make it public. </p>
<p>At least two key issues spring to mind:</p>
<ol>
<li>This is a massive convenience sample. </li>
<li>It is being collected through a social network</li>
</ol>
<p>Although I’m sure there are more. If a student is looking for an amazingly interesting/rich data set and some seriously hard stats problems, they should get in touch with Aubrey and see if they can make something of it!</p>
Sunday Data/Statistics Link Roundup (10/14/12)
2012-10-14T13:35:00+00:00
http://simplystats.github.io/2012/10/14/sunday-data-statistics-link-roundup-10-14-12
<ol>
<li>A fascinating article about <a href="http://www.theawl.com/2012/10/the-sugar-wars" target="_blank">the debate</a> on whether to regulate sugary beverages. One of the protagonists is David Allison, a statistical geneticist, among other things. It is fascinating to see the interplay of statistical analysis and public policy. Yet another example of how statistics/data will drive some of the most important policy decisions going forward. </li>
<li>A related article is <a href="http://bigthink.com/risk-reason-and-reality/how-the-media-put-us-at-risk-with-the-way-they-report-about-risk?page=2" target="_blank">this one</a> on the way risk is reported in the media. It is becoming <a href="http://simplystatistics.org/post/15774146480/in-the-era-of-data-what-is-a-fact" target="_blank">more and more clear</a> that to be an educated member of society now means that you absolutely have to have a basic understanding of the concepts of statistics. Both leaders and the general public are responsible for the danger that lies in misinterpreting/misleading with risk. </li>
<li>A <a href="http://biostat.jhsph.edu/~jleek/release.pdf" target="_blank">press release</a> from the Census Bureau about how the choice of college major can have a major impact on career earnings. More data breaking the results down by employment characteristics and major are <a href="http://biostat.jhsph.edu/~jleek/employment.pdf" target="_blank">here</a> and <a href="http://biostat.jhsph.edu/~jleek/degree.pdf" target="_blank">here.</a> These data update some of the data we have talked about before in calculating <a href="http://simplystatistics.org/post/12599452125/expected-salary-by-major" target="_blank">expected salaries by major</a>. (via Scott Z.)</li>
<li>An interesting article about <a href="http://www.npr.org/2012/10/08/162397787/predicting-the-future-fantasy-or-a-good-algorithm" target="_blank">Recorded Future</a> that describes how they are using social media data etc. to try to predict events that will happen. I think this isn’t an entirely crazy idea, but the thing that always strikes me about these sorts of project is how hard it is to measure success. It is highly unlikely you will ever exactly predict a future event, so how do you define how close you were? For instance, if you predicted an uprising in Egypt, but missed by a month, is that a good or a bad prediction? </li>
<li>Seriously guys, this is getting embarrassing. An article appears in the New England Journal <a href="http://www.nejm.org/doi/full/10.1056/NEJMon1211064" target="_blank">“finding” an association</a> between chocolate consumption and Nobel prize winners. This is, of course, a horrible statistical analysis and unless it was a joke to publish it, it is irresponsible of the NEJM to publish. I’ll bet any student in Stat 101 could find the huge flaws with this analysis. If the editors of the major scientific journals want to continue publishing statistical papers, they should get serious about statistical editing.</li>
</ol>
What's wrong with the predicting h-index paper.
2012-10-10T13:47:04+00:00
http://simplystats.github.io/2012/10/10/whats-wrong-with-the-predicting-h-index-paper
<p><em>Editor’s Note: I recently posted about <a href="http://simplystatistics.org/post/31990205510/prediction-contest" target="_blank">a paper</a> in Nature that purported to predict the H-index. The authors contacted me to get my criticisms, then responded to those criticisms. They have requested the opportunity to respond publicly, and I think it is a totally reasonable request. Until there is a better comment generating mechanism at the journal level, this seems like as good a forum as any to discuss statistical papers. I will post an extended version of my criticisms here and give them the opportunity to respond publicly in the comments. </em></p>
<p><span>The paper in question is a clearly a clever idea and the kind that would get people fired up. Quantifying researchers output is all the rage and being able to predict this quantity in the future would obviously make a lot of evaluators happy. I think it was, in that sense, a really good idea to chase down these data, since it was clear that if they found anything at all it would be very widely covered in the scientific/popular press. </span></p>
<div>
My original post was inspired out of my frustration with Nature, which has a history of publishing somewhat suspect statistical papers, <a href="http://www.ncbi.nlm.nih.gov/pubmed/15457248" target="_blank">such as this one</a>. I posted the prediction contest after reading another paper I consider to be a flawed statistical paper, both for statistical reasons and for scientific reasons. I originally commented on the statistics in my post. The authors, being good sports, contacted me for my criticisms. I sent them the following criticisms, which I think are sufficiently major that a statistical referee or statistical journal would have likely rejected the paper:
</div>
<div>
<ol>
<li>
Lack of reproducibility. The code/data are not made available either through Nature or on your website. This is a critical component of papers based on computation and has led to serious problems before. It is also easily addressable.
</li>
<li>
No training/test set. You mention cross-validation (and maybe the R^2 is the R^2 using the held out scientists?) but if you use the cross-validation step to optimize the model parameters and to estimate the error rate, you could see some major overfitting.
</li>
<li>
The R^2 values are pretty low. An R^2 of 0.67 is obviously superior to the h-index alone, but (a) there is concern about overfitting, and (b) even without overfitting, that low of R^2 could lead to substantial variance in predictions.
</li>
<li>
The prediction error is not reported in the paper (or in the online calculator). How far off could you be at 5 years, at 10? Would the results still be impressive with those errors reported?
</li>
<li>
You use model selection and show only the optimal model (as described in the last paragraph of the supplementary), but no indication of the potential difficulties with this model selection are made in the text.
</li>
<li>
You use a single regression model without any time variation in the coefficients and without any potential non-linearity. Clearly when predicting several years into the future there will be variation with time and non-linearity. There is also likely heavy variance in the types of individuals/career trajectories, and outliers may be important, etc.
</li>
</ol>
<div>
They carefully responded to these criticisms and hopefully they will post their responses in the comments. My impression based on their responses is that the statistics were not as flawed as I originally thought, but that the data aren’t sufficient to form a useful prediction.
</div>
<div>
</div>
<div>
However, I think the much bigger flaw is the basic scientific premise. The h-index has been identified as having major flaws, biases (including gender bias), and to be a generally poor summary of a scientist’s contribution. See <a href="http://blogs.nature.com/nautilus/2007/10/the_hindex_has_its_flaws.html" target="_blank">here</a>, the list of criticisms <a href="http://en.wikipedia.org/wiki/H-index" target="_blank">here</a>, and the discussion <a href="http://scholarlykitchen.sspnet.org/2008/06/30/the-h-index-an-objective-mismeasure/" target="_blank">here</a> for starters. The authors of the Nature paper propose a highly inaccurate predictor of this deeply flawed index. While that alone is sufficient to call into question the results in the paper, the authors also make bold claims about their prediction tool:
</div>
<blockquote>
<div>
Our formula is particularly useful for funding agencies, peer reviewers and hiring committees who have to deal with vast
</div>
<div>
numbers of applications and can give each only a cursory examination. Statistical techniques have the advantage of returning
</div>
<div>
results instantaneously and in an unbiased way.
</div>
</blockquote>
<div>
Suggesting that this type of prediction should be used to make important decisions on hiring, promotion, and funding is highly scientifically flawed. Coupled with the online calculator the authors handily provide (which produces no measure of uncertainty) it makes it all too easy for people to miss the real value of scientific publications: the science contained in them.
</div>
</div>
Why we should continue publishing peer-reviewed papers
2012-10-08T14:29:00+00:00
http://simplystats.github.io/2012/10/08/why-we-should-continue-publishing-peer-reviewed-papers
<p><span>Several bloggers are calling for the end of peer-reviewed journals as we know them. <a href="http://simplystatistics.org/post/32871552079/should-we-stop-publishing-peer-reviewed-papers" target="_blank">Jeff suggest</a><span class="apple-converted-space"> </span>that we replace them with a system in which everyone posts their papers on their blog, pubmed aggregates the feeds, and peer-review happens post publication via, for example, counting up like and dislike votes. In my view, many of these critiques seem to conflate problems from different aspects of the process. Here I try to break down the current system into its key components and defend the one aspect I think we should preserve (at least for now): pre-publication peer-review.</span></p>
<p>To avoid confusion let me start by enumerating some of the components for which I agree change is needed.</p>
<ul>
<li>There is no need to produce paper copies of our publications. Indulging our preference for reading hard copies does not justify keeping the price of disseminating our work twice as high as it should be. </li>
<li>There is no reason to be sending the same manuscript (adapted to fit guidelines) to several journals, until it gets accepted. This frustrating and time-consuming process adds very little value (we previously described <a href="http://simplystatistics.org/post/14218411483/dear-editors-associate-editors-referees-please-reject" target="_blank">Nick Jewell’s solution</a>). </li>
<li>There is no reason for publications to be static. As Jeff and many others suggest, readers should be able to comment and rate systematically on published papers and authors should be able to update them.</li>
</ul>
<p>However, all these changes can be implemented without doing away with pre-publication peer-review.</p>
<p><span>A key reason American and British universities consistently<span class="apple-converted-space"> </span><a href="http://www.arwu.org/ARWU2010.jsp" target="_blank">lead the pack</a><span class="apple-converted-space"> </span>of research institutions is their strict adherence to a peer-review system that minimizes cronyism and tolerance for mediocrity. At the center of this system is a promotion process in which outside experts evaluate a candidate’s ability to produce high quality ideas. Peer-reviewed journal articles are the backbone of this evaluation. </span>When reviewing a candidate I familiarize myself with his or her work by reading 5-10 key papers. It’s true that I read these disregarding the journal and blog posts would serve the same purpose. But I also use the publication section of the CV not only because reading all papers is logistically impossible but because these have already been evaluated by ~ three referees plus an editor and provide an independent assessment to mine. I also use the journal’s prestige because although it is a highly noisy measure of quality, the law of large numbers starts kicking in after 10 papers or so. </p>
<p><span>So are three reviewers better than the entire Internet? Can a reddit-like system provide just as much signal as the current peer-reviewed journal? You can think of the current system as a c</span><span>ooperative in which we all agree to read each other’s papers thoroughly (we evaluate 2-3 for each one we publish) with journals taking care of the logistics. The result of a review is an estimate of quality ranging from highest (Nature, Science) to 0 (not published). This estimate is certainly noisy given the bias and quality variance of referees and editors. But, across all papers on a CV variance is reduced and bias averages out </span>(I note that we complain vociferously when the bias keeps us from publishing in a good journal, but we rarely say a word when the bias helps us get into a better journal than deserved).<span> </span><span>Jeff’s argument is that post-publication review will result in many more evaluations and therefore a stronger signal to noise ratio. I need to see evidence of this before being convinced. I</span><span>n the current system </span>~ three referees commit to thoroughly reviewing of the paper. If they do a sloppy job, they will embarrass themselves with an editor or an AE (not a good thing). With the post-publication review system nobody is forced to review. I fear most papers will go without comment or votes, including really good ones. My feeling is that marketing and PR will matter even more than it does now and that’s not a good thing.</p>
<p>Dissemination of ideas is another important role of the literature. Jeff describes a couple of anecdotes to argue it can be sped up by just posting it on your blog.</p>
<blockquote>
<p><span>I posted a quick idea called </span><a href="http://simplystatistics.org/post/18132467723/prediction-the-lasso-vs-just-using-the-top-10" target="_blank">the Leekasso</a><span>, which led to some discussion on the blog, has nearly 2,000 page views</span></p>
</blockquote>
<p>But the typical junior investigator does not have a blog with hundreds of followers. Will their papers ever be read if even more papers are added to the already bloated scientific literature? The current peer-review system provides an important filter. There is an inherent trade-off between speed of dissemination and quality and it’s not clear to me that we should swing the balance all the way over to the speed side. There are <a href="http://simplystatistics.org/post/14218411483/dear-editors-associate-editors-referees-please-reject" target="_blank">other ways</a> to speed up dissemination that we should try first. Also there is nothing stopping us from posting our papers online before publication and promoting them via twitter or an aggregator. In fact, as pointed out by <a href="http://twitter.com/janhjensen" target="_blank">Jan Jensen</a> on Jeff’s post, <span> </span><span>arXiv papers are indexed on Google Scholar within a week, which also keeps track of arXiv citations.</span></p>
<p><span>The Internet is bringing many changes that will improve our peer-review system. But the current pre-publication peer-review process does a decent job of </span></p>
<ol>
<li>providing signal for the promotion process and</li>
<li>reducing noise in the literature to make dissemination possible. </li>
</ol>
<p>Any alternative systems should be evaluated carefully before dismantling a system that has helped keep our Universities at the top of the world rankings.</p>
Sunday Data/Statistics Link Roundup (10/7/12)
2012-10-07T13:53:30+00:00
http://simplystats.github.io/2012/10/07/sunday-data-statistics-link-roundup-10-7-12
<ol>
<li>Jack Welch <a href="https://twitter.com/jack_welch/status/254198154260525057" target="_blank">got a little conspiracy-theory crazy</a> with the job numbers. Thomas Lumley over at StatsChat makes <a href="http://www.statschat.org.nz/2012/10/06/statistics-conspiracy-theories/" target="_blank">a pretty good case</a> for debunking the theory. I think the real take home message of Thomas’ post and one worth celebrating/highlighting is that agencies that produce the jobs report do so based on a fixed and well-defined study design. Careful efforts by government statistics agencies make it hard to fudge/change the numbers. This is an underrated and hugely important component of a well-run democracy. </li>
<li>On a similar note Dan Gardner at the Ottawa Citizen points out that <a href="http://www.ottawacitizen.com/opinion/columnists/Evidence+comes+shapes+sizes/7351609/story.html" target="_blank">evidence-based policy</a> making is actually not enough. He points out the critical problem with evidence: <a href="http://simplystatistics.org/post/15774146480/in-the-era-of-data-what-is-a-fact" target="_blank">in the era of data what is a fact</a>? “Facts” can come from flawed or biased studies just as easily from strong studies. He suggests that a true “evidence based” administration would invest more money in research/statistical agencies. I think this is a great idea. </li>
<li><a href="http://online.wsj.com/article/SB10000872396390444223104578038362388183092.html" target="_blank">An interesting article</a> by Ben Bernanke suggesting that an optimal approach (in baseball and in policy) is one based on statistical analysis, coupled with careful thinking about long-term versus short-term strategy. I think one of his arguments about allowing players to play even when they are struggling short term is actually a case for letting the weak law of large numbers play out. If you have a player with skill/talent, they will eventually converge to their “true” numbers. It’s also good for their confidence….(via David Santiago).</li>
<li>Here is another interesting <a href="http://svpow.com/2012/10/03/dear-royal-society-please-stop-lying-to-us-about-publication-times/?utm_source=social_media&utm_medium=hootsuite&utm_campaign=standard" target="_blank">peer review dust-up</a>. It explains why some journals “reject” papers when they really mean major/minor revision to be able to push down their review times. I think this highlights yet another problem with pre-publication peer review. The evidence is mounting, but I hear we may get a defense of the current system from one of the editors of this blog, so stay tuned…</li>
<li>Several people (Sherri R., Alex N., many folks on Twitter) have pointed me to <a href="http://www.pnas.org/content/early/2012/09/14/1211286109" target="_blank">this article</a> about gender bias in science. I initially was a bit skeptical of such a strong effect across a broad range of demographic variables. After reading the supplemental material carefully, it is clear I was wrong. It is a very well designed/executed study and suggests that there is still a strong gender bias in science, across ages and disciplines. Interestingly both men and women were biased against the female candidates. This is clearly a non-trivial problem to solve and needs a lot more work, maybe one step is to<a href="http://simplystatistics.org/post/25849875593/a-specific-suggestion-to-help-recruit-retain-women" target="_blank"> make recruitment packages more flexible</a> (see the comment by Allison T. especially). </li>
</ol>
Fraud in the Scientific Literature
2012-10-07T00:28:00+00:00
http://simplystats.github.io/2012/10/07/fraud-in-the-scientific-literature
<p><a href="http://www.nytimes.com/2012/10/06/opinion/fraud-in-the-scientific-literature.html?smid=tu-share">Fraud in the Scientific Literature</a></p>
Not just one statistics interview...John McGready is the Jon Stewart of statistics
2012-10-05T14:20:54+00:00
http://simplystats.github.io/2012/10/05/not-just-one-statistics-interview-john-mcgready-is
<p><em>Editor’s Note: We usually reserve Friday’s for posting <a href="http://simplystatistics.org/interviews" target="_blank">Simply Statistics Interviews</a>. This week, we have a special guest post by <a href="http://www.biostat.jhsph.edu/~jmcgread/" target="_blank">John McGready</a>, a colleague of ours who has been doing interviews with many of us in the department and has some cool ideas about connecting students in their first statistics class with cutting edge researchers wrestling with many of the same concepts applied to modern problems. I’ll let him explain…</em></p>
<p><span>I teach a two quarter course in introductory biostatistics to master’s students in public health at Johns Hopkins. The majority of the class is composed of MPH students, but there are also students doing professional master’s degrees in environmental health, molecular biology, health policy and mental health. Despite the short length of the course, it covers the “greatest hits” of biostatistics, encompassing everything from exploratory data analysis up through and including multivariable proportional hazards regression. The course focus is more conceptual and less mathematical/computing centric than the other two introductory sequences taught at Hopkins: as such it has earned the unfortunate nickname “baby biostatistics” from some at the School. This, in my opinion, is an unfortunate misnomer: statistical reasoning is often the most difficult part of learning statistics. We spend a lot of time focusing on the current literature, and making sense or critiquing research by considering not only the statistical methods employed and the numerical findings, but also the study design and the logic of the substantive conclusions made by the study authors.</span></p>
<p><span>Via the course, I always hope to demonstrate the importance biostatistics as a core driver of public health discovery, the importance of statistical reasoning in the research process, and how the fundamentals that are covered are the framework for more advanced methodology. At some point it dawned on me that the best approach for doing this was to have my colleagues speak to my students about these ideas. Because of timing and scheduling constraints, this proved difficult to do in a live setting. However, in June of 2012 a video recording studio opened here at the Hopkins Bloomberg School. At this point, I knew that I had to get my colleagues on video so that I could share their wealth of experiences and expertise with my students, and give the students multiple perspectives. To my delight my colleagues are very amenable to being interviewed and have been very generous with their time. I plan to continue doing the interviews so long as my colleagues are willing and the studio is available.</span></p>
<p><span>I have created a <a href="http://www.youtube.com/user/StatReasoningJHSPH?feature=mhee" target="_blank">Youtube channel</a> for these interviews. At some point in the future, I plan to invite the biostatistics community as a whole to participate. This will include interviews with visitors to my department, and submissions by biostatistics faculty and students from other schools. (I realize I am very lucky to have these facilities and video expertise at Hopkins: but many folks are tech savvy enough to film their own videos on their cameras, phones etc… in fact you have seen such creativity by the editors of this here blog). </span><span>With the help of some colleagues I plan on making a complimentary website that will allow for easy submission of videos for posting, so stay tuned!</span></p>
Statistics project ideas for students (part 2)
2012-10-04T17:56:43+00:00
http://simplystats.github.io/2012/10/04/statistics-project-ideas-for-students-part-2
<p>A little while ago I wrote a post on <a href="http://simplystatistics.org/post/18493330661/statistics-project-ideas-for-students" target="_blank">statistics projects ideas for students</a>. In honor of the first Simply Statistics Coursera offering, <a href="https://www.coursera.org/course/compdata" target="_blank">Computing for Data Analysis</a>, here is a new list of student projects for folks excited about trying out those new R programming skills. Again we have rated each project with my best guess difficulty and effort required. Happy computing!</p>
<p><strong>Data Analysis</strong></p>
<ol>
<li>Use city data to predict areas with the highest risk for parking tickets. <a href="https://data.baltimorecity.gov/Financial/Parking-Citations/n4ma-fj3m" target="_blank">Here </a>is the data for Baltimore. (<em>Difficulty: Moderate, Effort: Low/Moderate</em>)</li>
<li>If you have a Fitbit with a premium account, download the data into a spreadsheet (or <a href="https://www.dropbox.com/sh/gauvv2fzf623ia5/SOgEROC7jO/Current" target="_blank">get Chris’s data</a>) Then build various predictors using the data: (a) are you running or walking, (b) are you having a good day or not, (c) did you eat well that day or not, (d) etc. For special bonus points <a href="http://myyearofdata.wordpress.com/" target="_blank">create a blog</a> with your new discoveries and share your data with the world. (<em>Difficulty: Depends on what you are trying to predict, Effort: Moderate with Fitbit/Jawbone/etc</em>.)</li>
</ol>
<p><strong>Data Collection/Synthesis</strong></p>
<ol>
<li>Make a list of skills associated with each component of the <a href="http://www.drewconway.com/zia/?p=2378" target="_blank">Data Scientist Venn Diagram</a>. Then update the <a href="https://github.com/jtleek/datascientist/blob/master/dataScientist.R" target="_blank">data scientist R function</a> described in <a href="http://simplystatistics.org/post/11271228367/datascientist" target="_blank">this post</a> to ask a set of questions, then plot people on the diagram. Hint, check out the readline() function. (<em>Difficulty: Moderately low, Effort:__Moderate)</em></li>
<li><a href="http://www.healthdata.gov/" target="_blank">HealthData.gov</a> has a ton of data from various sources about public health, medicines, etc. Some of this data is super useful for projects/analysis and some of it is just data dumps. Create an R package that downloads data from healthdata.gov and gives some measures of how useful/interesting it is for projects (e.g. number of samples in the study, number of variables measured, is it summary data or raw data, etc.) (<em>Difficulty: Moderately hard, Effort: High</em>)</li>
<li>Build an up-to-date aggregator of R tutorials/how-to videos, summarize/rate each one so that people know which ones to look at for learning which tasks. (<em>Difficulty: Low, Effort: Medium)</em></li>
</ol>
<p><strong>Tool building</strong></p>
<ol>
<li>Build software that creates a 2-d author list and averages people’s 2-d author lists. (<em>Difficulty: Medium, Effort: Low)</em></li>
<li>Create an R package that interacts with and downloads data from <a href="http://simplystatistics.org/post/15182715327/list-of-cities-states-with-open-data-help-me-find" target="_blank">government websites</a> and processes it in a way that is easy to analyze. <em>(Difficulty: Medium, Effort: High)</em></li>
</ol>
<p>_<br />
_</p>
Should we stop publishing peer-reviewed papers?
2012-10-04T13:54:00+00:00
http://simplystats.github.io/2012/10/04/should-we-stop-publishing-peer-reviewed-papers
<p>Nate Silver, everyone’s favorite statistician made good, just gave an interview where he said he thinks <a href="http://techcrunch.com/2012/10/01/nyt-election-oracle-fivethirtyeight-on-why-blogging-is-great-for-science/" target="_blank">many journal articles should be blog posts</a>. I have been thinking about this same issue for a while now, and I’m not the only one. <a href="http://dienekes.blogspot.com/2012/06/how-journals-once-facilitated-and-now.html" target="_blank">This is</a> a really interesting post suggesting that although scientific journals once facilitated dissemination of ideas, they now impede the flow of information and make it more expensive. </p>
<p>Two recent examples really drove this message home for me. In the first example, I posted a quick idea called <a href="http://simplystatistics.org/post/18132467723/prediction-the-lasso-vs-just-using-the-top-10" target="_blank">the Leekasso</a>, which led to some discussion on the blog, has nearly 2,000 page views (a pretty recent number of downloads for a paper), and has been implemented in software by someone <a href="http://cran.r-project.org/web/packages/SuperLearner/NEWS" target="_blank">other than me</a>. If this were one of my papers, it would be one of the more reasonably high impact papers. The <a href="http://simplystatistics.org/post/31990205510/prediction-contest" target="_blank">second example</a> is a post I put up about a recent Nature paper. The authors (who are really good sports) ended up writing to me to get my critiques. I wrote them out, and they responded. All of this happened after peer review and informally. All of the interaction also occurred in email, where no one can see but us. </p>
<p>It wouldn’t take much to go to a blog-based system. What if everyone who was publishing scientific results started a blog (free), then there was a site, run by pubmed, that aggregated the feeds (this would be super cheap to set up/maintain). Then people could comment on blog posts and vote for ones they liked if they had verified accounts. We skipped peer review in favor of just producing results and discussing them. The results that were interesting were shared by email, Twitter, etc. </p>
<p>Why would we do this? Well, the current journal system: (1) significantly slows the publication of research, (2) costs thousands of dollars, and (3) costs significant labor that is not scientifically productive (such as resubmitting). </p>
<p>Almost every paper I have had published has been rejected at least one place, including the “good” ones. This means that the results of even the good papers have been delayed by months. Or in the case of one paper - <a href="http://simplystatistics.org/post/26977029850/my-worst-recent-experience-with-peer-review" target="_blank">a full year and a half of delay</a>. Any time I publish open access, <a href="http://simplystatistics.org/post/12286350206/free-access-publishing-is-awesome-but-expensive-how" target="_blank">it costs me</a> at minimum around $1,500. I like open access because I think science funded by taxpayers should be free. But it is a significant drain on the resources of my group. Finally, most of the resubmission process is wasted labor. It generally doesn’t produce new science or improve the quality of the science. The effort is just in reformatting and re-inputing information about authors.</p>
<p>So why not have everyone just post results on their blog/<a href="http://figshare.com/" target="_blank">figshare</a>. They’d have a DOI that could be cited. We’d reduce everyone’s labor in reviewing/editing/resubmitting by an order of magnitude or two and save the taxpapers a few thousand dollars each a year in publication fees. We’d also increase the speed of updating/reacting to new ideas by an order of magnitude. </p>
<p>I still maintain we should be evaluating people based on reading their actual work, not highly subjective and error-prone indices. But if the powers that be insisted, it would be easy to evaluate people based on likes/downloads/citations/discussion of papers rather than on the basis of journal titles and the arbitrary decisions of editors. </p>
<p>So should we stop publishing peer review papers?</p>
<p><em>Edit: Titus points to a couple of good posts with interesting ideas about the peer review process that are worth reading, <a href="http://ivory.idyll.org/blog/blog-practicing-open-science.html" target="_blank">here </a>and <a href="http://www.genomesunzipped.org/2012/08/the-first-steps-towards-a-modern-system-of-scientific-publication.php" target="_blank">here</a>. Also, Joe Pickrell et al. are already on this for population genetics, having set up the aggregator <a href="http://haldanessieve.org/" target="_blank">Haldane’s Sieve</a>. It would be nice if this expanded to other areas (and people got credit for the papers published there, like they do for papers in journals). </em></p>
This is an awesome paper all students in statistics should read
2012-10-03T15:24:46+00:00
http://simplystats.github.io/2012/10/03/this-is-an-awesome-paper-all-students-in-statistics
<p><a href="http://arxiv.org/abs/1210.0530" target="_blank">The paper</a> is a review of how to do software development for academics. I saw it via C. Titus Brown (who <a href="http://simplystatistics.org/post/29620679415/interview-with-c-titus-brown-computational-biologist" target="_blank">we have interviewed</a>), he is also a co-author. How to write software (particularly for other people) is something that is under emphasized in many curricula. But it turns out this is also one of the more important components of disseminating your work in modern applied statistics. My only wish is that there was an accompanying website with resources/links for people to chase down. </p>
2-D author lists
2012-10-03T14:00:14+00:00
http://simplystats.github.io/2012/10/03/2-d-author-lists
<p>The order of authors on scientific papers matters a lot. The best places to be on a paper <a href="http://simplystatistics.org/post/11314293165/authorship-conventions" target="_blank">vary by field</a>. But typically the first and the corresponding (usually last) authors are the prime real estate. When people are evaluated on the job market, for promotion, or to get grants, the number of first/corresponding author papers can be the difference between success and failure. </p>
<p>At the same time, many journals list “authors contributions” at the end of the manuscript, but this is rarely prominently displayed. The result is that regardless of the true distribution of credit in a manuscript, the first and last authors get the bulk of the benefit. </p>
<p>This system is antiquated for a few reasons:</p>
<ol>
<li>In multidisciplinary science, there are often equal and very different contributions from people working in different disciplines. </li>
<li>Science is increasing collaborative, even within a single discipline and papers are rarely the effort of 2 people anymore. </li>
</ol>
<p>How about a 2-D, resortable author list? Each author is a column and each kind of contribution is a row. The contributions are: (1) conceived the idea, (2) collected the data, (3) did the computational analysis, (4) wrote the paper (you could imagine adding others). Each category then gets a quantitative number, fraction of the effort to that component of the paper. Then you build an interactive graphic that allows you to sort the authors by each of the categories. So you could see who did what on the paper. </p>
<p>To get an overall impression of which activities an author performs, you could average their contribution across papers in each of the categories. Creating a “heatmap of contributions”. Anyone want to build this? </p>
The more statistics blogs the better
2012-10-02T14:56:26+00:00
http://simplystats.github.io/2012/10/02/the-more-statistics-blogs-the-better
<p>Good friend and friend of the blog Rob Gould has started a statistics blog called <a href="http://citizen-statistician.org" target="_blank">Citizen Statistician</a>. What is a citizen statistician, you ask?</p>
<blockquote>
<p><span>What is a citizen statistician? A citizen statistician participates in formal and informal data gathering. A citizen statistician is aware of his or her data trail, and is aware of the harm that could be done to themselves or to others through data aggregation. Citizen statisticians recognize opportunities to improve their personal or professional lives through analyzing data, and know how to share data with others. They know that almost any question about the world can be answered using data, how to find relevant data sources on the web, and critically evaluate these sources. A citizen statistician also knows how to bring that data into an analysis package and how to start their investigation.</span></p>
</blockquote>
<p>What’s even better than having more statistics blogs? Having more statisticians.</p>
John McGready interviews Jeff Leek
2012-09-28T15:19:40+00:00
http://simplystats.github.io/2012/09/28/john-mcgready-interviews-the-esteemed-jeff-leek
<p>John McGready interviews the esteemed Jeff Leek. This is bearded Jeff, in case you were wondering.</p>
<div class="attribution">
(<span>Source:</span> <a href="http://www.youtube.com/">http://www.youtube.com/</a>)
</div>
John McGready interviews Roger Peng
2012-09-27T17:55:46+00:00
http://simplystats.github.io/2012/09/27/john-mcgready-a-fellow-faculty-member-in-the
<p>John McGready, a fellow faculty member in the Department of Biostatistics, interviewed me for his Statistical Reasoning class. In the interview we talk about some statistical contributions to air pollution epidemiology.</p>
<div class="attribution">
(<span>Source:</span> <a href="http://www.youtube.com/">http://www.youtube.com/</a>)
</div>
Simply statistics logo contest
2012-09-27T13:52:02+00:00
http://simplystats.github.io/2012/09/27/simply-statistics-logo-contest
<p>Simply Statistics has had the same logo since Roger grabbed the first picture in his results folder that looked “statistics related”. While we do have some affection for the logo, we would like something a little more catchy.</p>
<p>So we would like to announce a contest to create our <strong>new</strong> logo. Here are the rules:</p>
<ol>
<li>All submissions must be sent to Roger with the email subject, “Simply Statistics Logo Contest”</li>
<li>The logo must be generated with reproducible R code. Here is <a href="http://blog.revolutionanalytics.com/2011/12/using-r-to-create-a-logo-simple.html" target="_blank">an example</a> of how Simple created their logo for inspiration. </li>
<li>Ideally the logo will convey the “spirit of the blog”: we like data, we like keeping it simple, we like solving real problems, and we like to stir it up.</li>
</ol>
<p>Have at it!</p>
Pro-tips for graduate students (Part 3)
2012-09-26T14:00:11+00:00
http://simplystats.github.io/2012/09/26/pro-tips-for-graduate-students-part-3
<p>This is part of the ongoing series of pro tips for graduate students, check out parts <a href="http://simplystatistics.org/post/25368234643/pro-tips-for-grad-students-in-statistics-biostatistics" target="_blank">one</a> and <a href="http://simplystatistics.org/post/25507941642/pro-tips-for-grad-students-in-statistics-biostatistics" target="_blank">two</a> for the original installments. </p>
<ol>
<li>Learn how to write papers in a very clear and simple style. Whenever you can write in plain English, skip jargon as much as possible, and make the approach you are using understandable and clear. This can (sometimes) make it harder to get your papers into journals. But simple, clear language leads to much higher use/citation of your work. Examples of great writers are: <a href="http://www.genomine.org/" target="_blank">John Storey</a>, <a href="http://www-stat.stanford.edu/~tibs/" target="_blank">Rob Tibshirani</a>, <a href="http://en.wikipedia.org/wiki/Robert_May,_Baron_May_of_Oxford" target="_blank">Robert May</a>, <a href="http://www.ped.fas.harvard.edu/people/faculty/" target="_blank">Martin Nowak</a>, etc.</li>
<li>It is a great idea to start reviewing papers as a graduate student. Don’t do too many, you should focus on your research, but doing a few will give you a lot of insight into how the peer-review system works. Ask your advisor/research mentor they will generally have a review or two they could use help with. When doing reviews, keep in mind a person spent a large chunk of time working on the paper you are reviewing. Also, don’t forget to use Google.</li>
<li>Try to write your first paper as soon as you possibly can and try to do as much of it on your own as you can. You don’t have to wait for faculty to give you ideas, read papers and <a href="http://gking.harvard.edu/files/paperspub.pdf" target="_blank">think of what you think would have been better</a> (you might check with a faculty member first so you don’t repeat what’s done, etc.). You will learn more writing your first paper than in almost any/all classes.</li>
</ol>
<div>
</div>
NBC Unpacks Trove of Data From Olympics
2012-09-26T03:17:30+00:00
http://simplystats.github.io/2012/09/26/nbc-unpacks-trove-of-data-from-olympics
<p><a href="http://www.nytimes.com/2012/09/26/business/media/nbc-unpacks-trove-of-viewer-data-from-london-olympics.html?smid=tu-share">NBC Unpacks Trove of Data From Olympics</a></p>
Computing for Data Analysis starts today!
2012-09-24T14:54:52+00:00
http://simplystats.github.io/2012/09/24/computing-for-data-analysis-starts-today
<p>Today marks the first Simply Statistics course offering happening over at Coursera. I’ll be teaching <a href="https://class.coursera.org/compdata-2012-001/" target="_blank">Computing for Data Analysis</a> over the next four weeks. There’s still plenty of time to register if you are interested in learning about R and the activity on the discussion forums is already quite vibrant.</p>
<p>Also starting today is my colleague Brian Caffo’s <a href="https://class.coursera.org/biostats-2012-001/class/index" target="_blank">Mathematical Biostatistics Bootcamp</a>, which I hear also has had an energetic start. With any luck, the students in that class may get to see Brian dressed in military fatigues.</p>
<p>This is my first MOOC so I have no idea how it will go. But I’m excited to start and am looking forward to the next four weeks.</p>
Sunday Data/Statistics Link Roundup (9/23/12)
2012-09-23T13:57:30+00:00
http://simplystats.github.io/2012/09/23/sunday-data-statistics-link-roundup-9-23-12
<ol>
<li>Harvard Business school is getting in on the fun, calling the data scientist the <a href="http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1" target="_blank">sexy profession for the 21st century</a>. Although I am a little worried that by the time it gets into a Harvard Business document, the hype may be outstripping the real promise of the discipline. Still, good news for statisticians! (via Rafa via Francesca D.’s Facebook feed). </li>
<li>The counterpoint is <a href="http://www.forbes.com/sites/gilpress/2012/08/31/the-data-scientist-will-be-replaced-by-tools/" target="_blank">this article</a> which suggests that data scientists might be able to be replaced by tools/software. I think this is also a bit too much hype for my tastes. Certain things will definitely be automated and we may even end up with <a href="http://simplystatistics.org/post/30315018436/a-deterministic-statistical-machine" target="_blank">a deterministic statistical machine</a> or two. But there will continually be new problems to solve which require the expertise of people with data analysis skills and good intuition (link via Samara K.)</li>
<li>A bunch of websites are popping up where you can sign up and have people take your online courses for you. I’m not going to give them the benefit of a link, but they aren’t hard to find these days. The thing I don’t understand is, if it is a free online course, why have someone else take it for you? It’s free, its in your spare time, and the bar for passing is pretty low (links via Sherri R. redacted)….</li>
<li>Maybe mostly useful for me, but for other people with Tumblr blogs, here is a way to <a href="http://is-r.tumblr.com/post/31792415116/embedding-latex-in-tumblr" target="_blank">insert Latex</a>.</li>
<li>Brian Caffo <a href="http://samsiatrtp.wordpress.com/2012/09/20/brian-caffo-shares-his-impression-of-the-massive-datasets-opening-workshop/" target="_blank">shares his impressions</a> of the SAMSI massive data workshop. He raises an important issue which definitely deserves more discussion: should we be focusing on specific or general problems? Worth a read. </li>
<li>For the people into self-tracking, Chris V. <a href="http://myyearofdata.wordpress.com/2012/09/18/bootytracking/" target="_blank">points to an app</a> created by the University of Indiana that lets people track their sexual activity. The most interesting thing about that app is how it highlights a key and I suppose often overlooked issue with analyzing self-tracking data. Despite the size of these data sets, they are still definitely biased samples. It’s only a brave few who will tell the University of Indiana all about their sex life. </li>
</ol>
Prediction contest
2012-09-21T17:00:00+00:00
http://simplystats.github.io/2012/09/21/prediction-contest
<p>I have been seeing <a href="http://www.nature.com/nature/journal/v489/n7415/full/489201a.html" target="_blank">this paper</a> all over Twitter/the blogosphere. It’s a sexy idea: can you predict how “high-impact” a scientist will be in the future. It is also a pretty flawed data analysis…so this weeks prediction contest is to identify why the statistics in this paper are so flawed. In my first pass read I noticed about 5 major flaws.</p>
<p><em>Editor’s note: I posted the criticisms and the authors respond here: <a href="http://disq.us/8bmrhl" target="_blank"><a href="http://disq.us/8bmrhl" target="_blank">http://disq.us/8bmrhl</a></a></em></p>
In data science - it's the problem, stupid!
2012-09-20T17:53:32+00:00
http://simplystats.github.io/2012/09/20/in-data-science-its-the-problem-stupid
<p>I just saw <a href="http://www.nature.com/nbt/journal/v30/n8/full/nbt.2301.html" target="_blank">this article </a>talking about how in the biotech world, you can’t get caught chasing the latest technology. You have to start with a problem you are solving for people and then work your way back. This reminds me a lot of <a href="http://simplystatistics.org/post/26068033590/motivating-statistical-projects" target="_blank">Type B problems</a> in data science/statistics. <a href="http://www.wired.com/science/discoveries/magazine/16-07/pb_theory" target="_blank">We have a pile of data, so we don’t need to have a problem to solve, it will come to us later</a>. I think the answer to the question, “Did you start with a scientific/business problem that needs solving regardless of whether the data was in place?” will end up being a near perfect classifier for separating the “Big Data” projects that are just hype from the ones that will pan out long term. </p>
Every professor is a startup
2012-09-20T13:55:58+00:00
http://simplystats.github.io/2012/09/20/every-professor-is-a-startup
<p>There has been a lot of discussion lately about whether to be in academia or industry. Some of it I think is a bit <a href="http://cs.unm.edu/~terran/academic_blog/?p=113" target="_blank">unfair to academia</a>. Then I saw <a href="http://www.quora.com/Data-Science/Why-is-Hilary-Mason-a-prominent-figure-within-the-big-data-community-What-are-her-notable-accomplishments" target="_blank">this post </a>on Quora asking what Hilary Mason’s contributions were to machine learning, like she hadn’t done anything. It struck me as a bit of academia hating on industry*. I don’t see why one has to be better/worse than the other, as Roger <a href="http://simplystatistics.org/post/28335633068/why-im-staying-in-academia" target="_blank">points out</a>, there is no perfect job and it just depends on what you want to do. </p>
<p>One thing that I think gets lost in all of this are the similarities between being an academic researcher and running a small startup. To be a successful professor at a research institution, you have to create a product (papers/software), network (sit on editorial boards/review panels), raise funds (by writing grants), advertise (by giving talks/presentations), identify and recruit talent (students and postdocs), manage people and personalities (students,postdocs, collaborators) and scale (you start as just yourself, and eventually grow to a <a href="http://rafalab.jhsph.edu/" target="_blank">group</a> with <a href="http://www.smart-stats.org/" target="_blank">lots of people</a>). </p>
<p>The goals are somewhat different. In a startup company, your goal is ultimately to become a profitable business. In academia, the goal is to create an enterprise that produces scientific knowledge. But in either enterprise it takes a huge amount of entrepreneurial spirit, passion, and hustle. It just depends on how you are spending your hustle. </p>
<p><em>*Sidenote: One reason I think she is so famous is that she helps people, even people that can’t necessarily do anything for her. One time I wrote her out of the blue to see if we could get some Bitly data to analyze for a class. She cheerfully helped us get it, even though the immediate payout for her was not obvious. But I tell you what, when people ask me about her, I’ll tell them she is awesome. </em></p>
Online Mentors to Guide Women Into the Sciences
2012-09-18T20:24:22+00:00
http://simplystats.github.io/2012/09/18/online-mentors-to-guide-women-into-the-sciences
<p><a href="http://www.nytimes.com/2012/09/17/education/online-mentoring-program-to-encourage-women-in-sciences.html?smid=tu-share">Online Mentors to Guide Women Into the Sciences</a></p>
Chinese Company to Acquire DNA Sequencing Firm
2012-09-18T10:30:45+00:00
http://simplystats.github.io/2012/09/18/chinese-company-to-acquire-dna-sequencing-firm
<p><a href="http://dealbook.nytimes.com/2012/09/17/chinese-company-to-acquire-dna-sequencing-firm/?smid=tu-share">Chinese Company to Acquire DNA Sequencing Firm</a></p>
Sunday Data/Statistics Link Roundup (9/16/12)
2012-09-16T13:59:53+00:00
http://simplystats.github.io/2012/09/16/sunday-data-statistics-link-roundup-9-16-12
<ol>
<li>There has been a lot of talk about the Michael Lewis (of Moneyball fame) <a href="http://www.vanityfair.com/politics/2012/10/michael-lewis-profile-barack-obama" target="_blank">profile of Obama </a>in Vanity fair. One interesting quote I think deserves a lot more discussion is: “<span>On top of all of this, after you have made your decision, you need to feign total certainty about it. People being led do not want to think probabilistically.” This is a key issue that is only going to get worse going forward. All of public policy is probabilistic - we are even moving to <a href="http://www.guardian.co.uk/politics/2012/jun/20/test-policies-randomised-controlled-trials" target="_blank">clinical trials to evaluate public policy</a>. </span></li>
<li>It’s sort of amazing to me that I hadn’t heard about this before, but a <a href="http://www.forbes.com/sites/stevensalzberg/2012/08/25/uc-davis-threatens-professor-for-writing-about-psa-testing/" target="_blank">UC Davis professor was threatened</a> for discussing the reasons PSA screening may be overused. This same issue keeps coming up over and over - <a href="http://www.statschat.org.nz/2012/09/14/screening-isnt-treatment-or-prevention/" target="_blank">screening healthy populations for rare diseases is often not effective</a> (you need a ridiculously high specificity or a treatment with almost no side effects). What we need is John McGready to do a claymation public service video or something explaining the reasons screening might not be a good idea to the general public. </li>
<li>A bleg - I sometimes have a good week finding links myself and there are a few folks who regularly send links (Andrew J., Alex N., etc.) I’d love it if people would send me cool links when they see them with the email title, “Sunday LInks” - i’m sure there is more cool stuff out there. </li>
<li>The ICSB has <a href="http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Computational_Biology/ISCB_competition_announcement" target="_blank">a competition</a> to improve the coverage of computational biology on Wikipedia. Someone should write a <a href="http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.0030161" target="_blank">surrogate variable analysis</a> or <a href="http://www.ncbi.nlm.nih.gov/pubmed/12925520" target="_blank">robust multiarray average article.</a> </li>
<li>I had not hear of the ASA’s <a href="http://stattrak.amstat.org/" target="_blank">Stattrak</a> until this week, it looks like there are some useful resources there for early career statisticians. With the onset of fall, it is closing in on a new recruiting season. If you are a postdoc/student on the job market and you haven’t read Rafa’s post on <a href="http://simplystatistics.org/post/14454324191/on-hard-and-soft-money" target="_blank">soft vs. hard money</a>, now is the time to start brushing up! Stay tuned for more job market posts this fall from Simply Statistics. </li>
</ol>
Statistical analysis suggests the Washington Nationals were wrong to shut down Stephen Strasburg
2012-09-15T23:36:42+00:00
http://simplystats.github.io/2012/09/15/statistical-analysis-suggests-the-washington-nationals
<p><a href="http://www.grantland.com/story/_/id/8369941/history-shows-washington-nationals-shut-stephen-strasburg-too-soon">Statistical analysis suggests the Washington Nationals were wrong to shut down Stephen Strasburg</a></p>
The statistical method made me lie
2012-09-14T13:58:31+00:00
http://simplystats.github.io/2012/09/14/the-statistical-method-made-me-lie
<p>There’s a hubbub brewing over a recent study published in the Annals of Internal Medicine that compares organic food (as in ‘USDA Organic’) to non-organic food. The study, titled “<span>Are Organic Foods Safer or Healthier Than Conventional Alternatives?</span><span class="titleSeparator">: </span><span class="subTitle">A Systematic Review” is a meta-analysis of about 200 previous studies. Their conclusion, which I have cut-and-pasted below, is</span></p>
<blockquote>
<p><span>The published literature lacks strong evidence that organic foods are significantly more nutritious than conventional foods. Consumption of organic foods may reduce exposure to pesticide residues and antibiotic-resistant bacteria.</span></p>
</blockquote>
<p><span>When I first heard about this study on the radio, I thought the conclusion seemed kind of obvious. It’s not clear to me why, for example, an organic carrot would have more calcium than a non-organic carrot. At least, I couldn’t explain the mechanism by which this would happen. However, I would expect that an organic carrot would have less pesticide residue than a non-organic carrot. If not, then the certification isn’t really achieving its </span>goals. Lo and behold, that’s more or less what the study found. I don’t see the controversy.</p>
<p>But there’s a <a href="http://www.change.org/petitions/retract-the-flawed-organic-study-linked-to-big-tobacco-and-pro-gmo-corps" target="_blank">petition over at change.org</a> titled “Retract the Flawed ‘Organic Study’ Linked to Big Tobacco and Pro-GMO Corps”. It’s quite an interesting read. First, it’s worth noting that the study itself does not list any funding sources. Given that the authors are from Stanford, one could conclude that therefore Stanford funded the study. The petition claims that Stanford has “deep financial ties to Cargill”, a large agribusiness company, but does not get into specifics.</p>
<p>More interesting is that the petition highlights the involvement in the study of Ingram Olkin, a renowned statistician at Stanford. The petition says</p>
<blockquote>
<p><span>The study was authored by the very many [sic] who invented a method of ‘lying with statistics’. Olkin </span><span>worked with Stanford</span><span> University to develop a “multivariate” statistical algorithm, which is essentially </span><strong>a way to lie with statistics</strong><strong>.</strong></p>
</blockquote>
<p>That’s right, the statistical method made them lie!</p>
<p>The petition is ridiculous. Interestingly, even as the petition claims conflict of interest on the part of the study authors, it seems one of the petition authors, Anthony Gucciardi, is “a<span> natural health advocate, and creator of the health news website NaturalSociety” according to his Twitter page. Go figure. </span>It worries me that people would claim the mere use of statistical methods is sufficient grounds for doubt. It also worries me that 3,386 people (as of this writing) would blindly agree.</p>
<p>By the way, can anyone propose an alternative to “multivariate statistics”? I need stop all this lying….</p>
After Our Interview With Steven Salzberg Someone
2012-09-14T01:16:00+00:00
http://simplystats.github.io/2012/09/14/after-our-interview-with-steven-salzberg-someone
An experimental foundation for statistics
2012-09-13T13:55:16+00:00
http://simplystats.github.io/2012/09/13/an-experimental-foundation-for-statistics
<p>In a recent conversation with Brian (<a href="http://simplystatistics.org/post/28840726358/in-which-brian-debates-abstraction-with-t-bone" target="_blank">of abstraction fame</a>) about the relationship between mathematics and statistics. Statistics, for historical reasons, has been treated as a mathematical sub-discipline (this is the <a href="http://simplystatistics.org/post/29899900125/nsf-recognizes-math-and-statistics-are-not-the-same" target="_blank">NSF’s view</a>).</p>
<p>One reason statistics is viewed as a sub-discipline of math is because the foundations of statistics are built on the basis of <a href="http://en.wikipedia.org/wiki/Deductive_reasoning" target="_blank">deductive reasoning</a>, where you start with a few general propositions or foundations that you assume to be true and then systematically prove more specific results. A similar approach is taken in most mathematical disciplines. </p>
<p>In contrast, scientific disciplines like biology are largely built on the basis of <a href="http://en.wikipedia.org/wiki/Inductive_reasoning" target="_blank">inductive reasoning</a> and the <a href="http://en.wikipedia.org/wiki/Scientific_method" target="_blank">scientific method</a>. Specific individual discoveries are described and used as a framework for building up more general theories and principles. </p>
<p>So the question Brian and I had was: what if you started over and built statistics from the ground up on the basis of inductive reasoning and experimentation? Instead of making mathematical assumptions and then proving statistical results, you would use experiments to identify core principals. This actually isn’t without precedent in the statistics community. Bill Cleveland and Robert McGill studied how people <a href="http://elibrary.unm.edu/courses/documents/ClevelandandMcGill1985-GraphicalPerceptionandGraphicalMethodsforAnalyzingScientificData.pdf" target="_blank">perceive graphical information</a> and produced some general recommendations about the use of area/linear contrasts, common axes, etc. There has also been a lot of work on experimental understanding of how humans <a href="http://www.sciencemag.org/content/333/6048/1393.short" target="_blank">understand uncertainty</a>. </p>
<p>So what if we put statistics on an experimental, rather than on a mathematical foundation. We performed experiments to see what kind of regression models people were able to interpret most clearly, what were the best ways to evaluate confounding/outliers, or what measure of statistical significance people understood best? Basically, what if the “quality” of a statistical method did not rest on the mathematics behind the method, but on the basis of experimental results demonstrating how people used the methods? So, instead of justifying lowess mathematically, we justified it on the basis of its practical usefulness through specific, controlled experiments. Some of this is already happening when people do surveys of the most successful methods in Kaggle contests or with the <a href="http://www.fda.gov/ScienceResearch/BioinformaticsTools/MicroarrayQualityControlProject/default.htm" target="_blank">MAQC</a>.</p>
<p>I wonder what methods would survive the change in paradigm?</p>
Coursera introduces three courses in statistics
2012-09-13T12:40:22+00:00
http://simplystats.github.io/2012/09/13/coursera-introduces-three-courses-in-statistics
<p><a href="http://www.significancemagazine.org/details/webexclusive/2539381/Coursera-introduces-three-courses-in-statistics.html">Coursera introduces three courses in statistics</a></p>
The pebbles of academia
2012-09-10T19:02:00+00:00
http://simplystats.github.io/2012/09/10/the-pebbles-of-academia
<p>I have just been awarded a certificate for successful completion of the Conflict of Interest Commitment training (I barely passed). Lately, I have been totally swamped by administrative duties and have had little time for actual research. The experience reminded me of something I read in this <a href="http://www.nytimes.com/2011/05/29/business/economy/29view.html?_r=1" target="_blank">NYTimes article</a> by <span><a href="http://marginalrevolution.com/" target="_blank">Tyler Cowen</a></span></p>
<blockquote>
<p><span>Michael Mandel, an economist with the Progressive Policy Institute, compares government regulation of innovation to the accumulation of pebbles in a stream. At some point too many pebbles block off the water flow, yet no single pebble is to blame for the slowdown. Right now the pebbles are limiting investment in future innovation.</span></p>
</blockquote>
<p>Here are some of the pebbles of my academic career (past and present): <span>financial conflict of interest training , human subjects training, HIPAA training, safety training, ethics training, submitting papers online, filling out copyright forms, faculty meetings, center grant quarterly meetings, 2 hour oral exams, 2 hour thesis committee meetings, big project conference calls, retreats, JSM, anything with “strategic” in the title, admissions committee, </span><span>affirmative action committee, faculty senate meetings, brown bag lunches, orientations, effort reporting, conflict of interest reporting, progress reports (can’t I just point to pubmed?), dbgap progress reports, people who ramble at study section, rambling at study section, </span>buying airplane tickets for invited talks, filling out travel expense sheets, and organizing and turning in travel receipts. I know that some of these are somewhat important or take minimal time, but read the quote again.</p>
<p>I also acknowledge that I actually have it real easy compared to others so I am interested in hearing about other people’s pebbles? </p>
<p><strong>Update</strong>: add changing my eRA commons password to list!</p>
<p><img src="http://rafalab.jhsph.edu/simplystats/pebles4.jpg" width="400" /></p>
Sunday Data/Statistics Link Roundup (9/9/12)
2012-09-09T13:32:00+00:00
http://simplystats.github.io/2012/09/09/sunday-data-statistics-link-roundup-9-9-12
<ol>
<li>Not necessarily statistics related, but pretty appropriate now that the school year is starting. Here is a little introduction to <a href="http://i.imgur.com/ikDIW.gif" target="_blank">“how to google”</a> (via Andrew J.). Being able to “just google it” and find answers for oneself without having to resort to asking folks is maybe the #1 most useful skill as a statistician. </li>
<li>A <a href="http://dl.dropbox.com/u/7586336/RSS2012/googleVis_at_RSS_2012.html#(1)" target="_blank">really nice presentation</a> on interactive graphics with the googleVis package. I think one of the most interesting things about the presentation is that it was built with markdown/knitr/slidy (see slide 53). I am seeing more and more of these web-based presentations. I like them for a lot of reasons (ability to incorporate interactive graphics, easy sharing, etc.), although it is still harder than building a Powerpoint. I also wonder, what happens when you are trying to present somewhere that doesn’t have a good internet connection?</li>
<li>We talked a lot about the ENCODE project this week. We had an <a href="http://simplystatistics.org/post/31056769228/interview-with-steven-salzberg-about-the-encode" target="_blank">interview with Steven Salzberg</a>, then Rafa followed it up with a discussion of <a href="http://simplystatistics.org/post/31067828460/top-down-versus-bottom-up-science-data-analysis" target="_blank">top-down vs. bottom-up science</a>. Tons of data from the ENCODE project is <a href="http://genome.ucsc.edu/ENCODE/" target="_blank">now available</a>, there is even a <a href="http://scofield.bx.psu.edu/~dannon/encodevm/" target="_blank">virtual machine</a> with all the software used in the main analysis of the data that was just published. But my favorite quote/tweet/comment this week came from Leonid K. about a flawed/over the top piece trying to make a little too much of the ENCODE discoveries: “<a href="https://twitter.com/leonidkruglyak/status/244425345481183232" target="_blank">that’s a clown post, bro</a>”.</li>
<li>Another breathless post from the Chronicle about how there are “<a href="http://chronicle.com/article/Dozens-of-Plagiarism-Incidents/133697/" target="_blank">dozens of plagiarism cases being reported on Coursera</a>”. Given that tens of thousands of people are taking the course, it would be shocking if there wasn’t plagiarism, but my guess is it is about the same rate you see in in-person classes. I will be using peer grading in <a href="https://www.coursera.org/course/dataanalysis" target="_blank">my course</a>, hopefully plagiarism software will be in place by then. </li>
<li>A <a href="http://www.nytimes.com/2012/09/04/science/visual-strategies-transforms-data-into-art-that-speaks.html?_r=2&ref=science" target="_blank">New York Times article</a> about a new book on visualizing data for scientists/engineers. I love all the attention data visualization is getting. I’ll take a look at the book for sure. I bet it says a lot of the same things Tufte said and a lot of the things Nathan Yau <a href="http://www.amazon.com/gp/product/0470944889/?tag=flowingdata-20" target="_blank">says in his book.</a> This one may just be targeted at scientists/engineers. (link via Dan S.)</li>
<li>Edo and co. are putting together <a href="http://snap.stanford.edu/social2012/" target="_blank">a workshop on the analysis of social network data for NIPS</a> in December. If you do this kind of stuff, it should be a pretty awesome crowd, so get your paper in by the Oct. 15th deadline!</li>
</ol>
Big Data in Your Blood
2012-09-08T18:00:40+00:00
http://simplystats.github.io/2012/09/08/big-data-in-your-blood
<p><a href="http://bits.blogs.nytimes.com/2012/09/07/big-data-in-your-blood/?smid=tu-share">Big Data in Your Blood</a></p>
The Weatherman Is Not a Moron
2012-09-08T13:58:14+00:00
http://simplystats.github.io/2012/09/08/the-weatherman-is-not-a-moron
<p><a href="http://www.nytimes.com/2012/09/09/magazine/the-weatherman-is-not-a-moron.html?smid=tu-share">The Weatherman Is Not a Moron</a></p>
Top-down versus bottom-up science: data analysis edition
2012-09-07T18:56:00+00:00
http://simplystats.github.io/2012/09/07/top-down-versus-bottom-up-science-data-analysis
<p>In our most recent <a href="http://simplystatistics.org/post/31056769228/interview-with-steven-salzberg-about-the-encode" target="_blank">video</a>, <a href="http://bioinformatics.igm.jhmi.edu/salzberg/Salzberg/Salzberg_Lab_Home.html" target="_blank">Steven Salzberg</a> discusses the ENCODE project. Some of the advantages and disadvantages of top-down science are described. Here, top-down refers to big coordinated projects like the <a href="http://en.wikipedia.org/wiki/Human_Genome_Project" target="_blank">Human Genome Project</a> (HGP). In contrast, the approach of funding many small independent projects, via the <a href="http://grants.nih.gov/grants/funding/r01.htm" target="_blank">R01</a> mechanism, is referred to as bottom-up. Note that for the cost of HGP we could have funded thousands of R01s. However it is not clear that without the HGP we would have had public sequence data as early as we did. As Steven points out, when it comes to data generation the economies of scale make big projects more efficient. But the same is not necessarily true for data analysis.</p>
<p>Big projects like <a href="http://genome.ucsc.edu/ENCODE/" target="_blank">ENCODE</a> and <a href="http://www.1000genomes.org/" target="_blank">1000 genomes</a> include data analysis teams that work in coordination with the data producers. It is true that very good teams are assembled and very good tools developed. But what if instead of holding the data under embargo until the first analysis is done and a paper (or <a href="http://blogs.nature.com/news/2012/09/fighting-about-encode-and-junk.html" target="_blank">30</a>) is published, the data was made publicly available with no restrictions and the scientific community was challenged to compete for data analysis and biological discovery R01s? I have no evidence that this would produce better science, but my intuition is that, at least in the case of data analysis, better methods would be developed. Here is my reasoning. Think of the best 100 data analysts in academia and consider the following two approaches:</p>
<p>1- Pick the best among the 100 and have a small group carefully coordinate with the data producers to develop data analysis methods.</p>
<p>2- Let all 100 take a whack at it and see what falls out.</p>
<p>In scenario 1 the selected group has artificial protection from competing approaches and there are less brains generating novel ideas. In scenario 2 the competition would be fierce and after several rounds of sharing ideas (via publications and conferences), groups would borrow from others and generate even better methods.</p>
<p>Note that the big projects do make the data available and R01s are awarded to develop analysis tools for these data. But this only happens after giving the consortium’s group a substantial head start. </p>
<p>I have not participated in any of these consortia and perhaps I am being naive. So I am very interested to hear the opinions of others.</p>
Simply Statistics Podcast #3: Interview with Steven Salzberg
2012-09-07T14:12:34+00:00
http://simplystats.github.io/2012/09/07/interview-with-steven-salzberg-about-the-encode
<p>Interview with Steven Salzberg about the ENCODE Project.</p>
<p>In this episode Jeff and I have a discussion with Steven Salzberg, Professor of Medicine and Biostatistics at Johns Hopkins University, about the recent findings from the <a href="http://www.genome.gov/10005107" target="_blank">ENCODE Project</a> where he helps us separate fact from fiction. You’re going to want to watch to the end with this one.</p>
<p>Here are some excerpts from the interview.</p>
<p>Regarding why the data should have been released immediately without restriction:</p>
<blockquote>
<p>If this [ENCODE] were funded by a regular investigator-initiated grant, then I would say you have your own grant, you’ve got some hypotheses you’re pursuing, you’re collecting data, you’ve already demonstrated that…you have some special ability to do this work and you should get some time to look at your data that you just generated to publish it. This was not that kind of a project. These are not hypothesis-driven projects. They are data collection projects. The whole model is…they’re creating a resource and it’s more efficient to create the resource in one place…. So we all get this data that’s being made available for less money…. I think if you’re going to be funded that way, you should release the data right away, no restrictions, because you’re funded because you’re good at generating this data cheaply….But you may not be the best person to do the analysis.</p>
</blockquote>
<p>Regarding the problem with large-scale top-down funding approaches versus the individual investigator approach:</p>
<blockquote>
<p>Well, it’s inefficient because it’s anti-competitive. They have a huge amount of money going to a few centers, they’ll do tons of experiments of the same type—may not be the best place to do that. They could instead give that money to 20 times as many investigators who would be refining the techniques and developing better ones. And a few years from now, instead of having another set of ENCODE papers—which we’re probably going to have—we might have much better methods and I think we’d have just as much in terms of discovery, probably more.</p>
</blockquote>
<p>Regarding best way to make discoveries:</p>
<blockquote>
<p>I think a problem I have with it…is that the top-down approach to science isn’t the way you make discoveries. And NIH has sort of said we’re going to fund these data generation and data analysis groups—they’re doing both…and by golly we’re going to discover some things. Well, it doesn’t always work if you do that. You can’t just say…so the Human Genome [Project], even though, of course there were lots of promises about curing cancer, we didn’t say we were going to discover how a particular gene works, we said we’re going to discover what the sequence is. And we did! Really well. With these [ENCODE] projects they said we’re going to figure out the function of all the elements, and they haven’t figured that out, at all.</p>
</blockquote>
<p>[<a href="http://www.biostat.jhsph.edu/~rpeng/podcast/simplystatistics_HDvideo.xml" target="_blank">HD video RSS feed</a>]</p>
<p>[<a href="http://www.biostat.jhsph.edu/~rpeng/podcast/simplystatistics_audio.xml" target="_blank">Audio-only RSS feed</a>]</p>
<p>[NOTE: Due to clumsy camera operator (who forgot to turn the camera on), we lost one of our three camera angles and so the there’s no front-facing view. Sorry!]</p>
<div class="attribution">
(<span>Source:</span> <a href="http://www.youtube.com/">http://www.youtube.com/</a>)
</div>
Simply Statistics Podcast #2
2012-09-06T13:58:16+00:00
http://simplystats.github.io/2012/09/06/in-this-episode-of-the-simply-statistics-podcast
<p>In this episode of the Simply Statistics podcast Jeff and I discuss the deterministic statistical machine and increasing the cost of data analysis. We decided to eschew the studio setup this time and attempt a more guerilla style of podcasting. Also, Rafa was nowhere to be found when we recorded so you’ll have to catch his melodious singing voice in the next episode.</p>
<p>And in case you’re wondering, Jeff’s office is in fact that clean.</p>
<p>As always, we welcome your feedback!</p>
<p>[<a href="http://www.biostat.jhsph.edu/~rpeng/podcast/simplystatistics_HDvideo.xml" target="_blank">HD video RSS feed</a>]</p>
<p>[<a href="http://www.biostat.jhsph.edu/~rpeng/podcast/simplystatistics_audio.xml" target="_blank">Audio-only RSS feed</a>]</p>
<div class="attribution">
(<span>Source:</span> <a href="http://www.youtube.com/">http://www.youtube.com/</a>)
</div>
How long should the next podcast be?
2012-09-05T13:53:01+00:00
http://simplystats.github.io/2012/09/05/how-long-should-the-next-podcast-be
<p>Here’s the survival curve for audience retention from the Youtube version of our <a href="http://youtu.be/FkSlaczE1vw" target="_blank">first podcast</a>.</p>
<p><img src="http://media.tumblr.com/tumblr_m9vnbcaVdP1r08wvg.png" alt="" /></p>
<p>So the question is: How long should our next podcast be?</p>
<p>By the way, Rafa, Jeff, and I all appreciate the little bump over at the 15 minute mark. However, you’re only encouraging us there!</p>
Online universities blossom in Asia
2012-09-03T17:21:23+00:00
http://simplystats.github.io/2012/09/03/online-universities-blossom-in-asia
<p><a href="http://news.yahoo.com/online-universities-blossom-asia-185953800.html"></a></p>
Sunday Data/Statistics Link Roundup (9/2/2012)
2012-09-02T13:55:03+00:00
http://simplystats.github.io/2012/09/02/sunday-data-statistics-link-roundup-9-2-2012
<ol>
<li>Just got back from IBC 2012 in Kobe Japan. I was in an awesome session (organized by the inimitable <a href="http://www.kuleuven.be/wieiswie/en/person/u0071934" target="_blank">Lieven Clement</a>) with great talks by <a href="http://www.mnmccall.com/" target="_blank">Matt McCall</a>, <a href="http://www.bioinf.jku.at/people/clevert/" target="_blank">Djork-Arne Clevert</a>, <a href="https://www.dur.ac.uk/wolfson.institute/contacts/staff/?id=8718" target="_blank">Adetayo Kasim</a>, and <a href="http://www.linkedin.com/pub/willem-talloen/13/755/207" target="_blank">Willem Talloen</a>. Willem’s talk nicely tied in our work and how it plays into the pharmaceutical development process and the bigger theme of big data. On the way home through SFO I <a href="http://biostat.jhsph.edu/~jleek/bigdata.jpg" target="_blank">saw this</a> hanging in the airport. A fitting welcome back to the states. Although, as we talked about in <a href="http://simplystatistics.org/post/30101719608/simply-statistics-podcast-1-to-mark-the" target="_blank">our first podcast</a>, I wonder how long the Big Data hype will last…</li>
<li>Simina B. sent <a href="http://analytics.ncsu.edu/?page_id=1799&gclid=CImx-pfBkrICFYSo4AodnQsArg" target="_blank">this link</a> along for a masters program in analytics at NC State. Interesting because it looks a lot like a masters in statistics program, but with a heavier emphasis on data collection/data management. I wonder what role the stat department down there is playing in this program and if we will see more like it pop up? Or if programs like this with more data management will be run by stats departments other places. Maybe our friends down in Raleigh have some thoughts for us. </li>
<li>If one set of weekly links isn’t enough to fill your procrastination quota, go check out NextGenSeek’s <a href="http://nextgenseek.com/2012/09/nextgenseek-stories-this-week-3108/" target="_blank">weekly stories</a>. A bit genomics focused, but lots of cool data/statistics links in there too. Love the “extreme Venn diagrams”. </li>
<li><a href="http://www.wiley.com/WileyCDA/WileyTitle/productCd-STA4.html" target="_blank">This</a> seems almost like the fast statistics journal <a href="http://simplystatistics.org/post/19289280474/a-proposal-for-a-really-fast-statistics-journal" target="_blank">I proposed</a> earlier. Can’t seem to access the first issue/editorial board either. Doesn’t look like it is open access, so it’s still not perfect. But I love the sentiment of fast/single round review. We can do better though. I think Yihue X. has some <a href="http://yihui.name/en/2012/03/a-really-fast-statistics-journal/" target="_blank">really interesting</a> ideas on how. </li>
<li>My wife taught for a year at Grinnell in Iowa and loved it there. They just released this <a href="http://www.grinnell.edu/offices/institutionalresearch/CDS" target="_blank">cool data set</a> with a bunch of information about the college. If all colleges did this, we could really dig in and learn a lot about the American secondary education system (link via Hilary M.). </li>
<li>From the way-back machine, a rant from Rafa <a href="http://simplystatistics.org/post/10402321009/meetings" target="_blank">about meetings</a>. Stayed tuned this week for some Simply Statistics data about our first year on the <a href="http://en.wikipedia.org/wiki/Series_of_tubes" target="_blank">series of tubes</a>. </li>
</ol>
Drought Extends, Crops Wither
2012-09-01T13:58:22+00:00
http://simplystats.github.io/2012/09/01/drought-extends-crops-wither
<p><a href="http://www.nytimes.com/interactive/2012/08/24/us/drought-crops.html">Drought Extends, Crops Wither</a></p>
Most Americans Confused By Cloud Computing According to National Survey
2012-08-31T18:00:21+00:00
http://simplystats.github.io/2012/08/31/most-americans-confused-by-cloud-computing-according-to
<p><a href="http://www.citrix.com/English/NE/news/news.asp?newsID=2328309">Most Americans Confused By Cloud Computing According to National Survey</a></p>
Court Blocks E.P.A. Rule on Cross-State Pollution
2012-08-31T14:00:11+00:00
http://simplystats.github.io/2012/08/31/court-blocks-e-p-a-rule-on-cross-state-pollution
<p><a href="http://www.nytimes.com/2012/08/22/science/earth/appeals-court-strikes-down-epa-rule-on-cross-state-pollution.html?smid=tu-share">Court Blocks E.P.A. Rule on Cross-State Pollution</a></p>
Court Upholds Rule on Nitrogen Dioxide Emissions
2012-08-30T17:59:17+00:00
http://simplystats.github.io/2012/08/30/court-upholds-rule-on-nitrogen-dioxide-emissions
<p><a href="http://www.nytimes.com/2012/07/18/science/earth/court-upholds-rule-on-nitrogen-dioxide-emissions.html?smid=tu-share">Court Upholds Rule on Nitrogen Dioxide Emissions</a></p>
Green: Will Emissions Disclosure Mean Investor Pressure on Polluters?
2012-08-30T13:58:23+00:00
http://simplystats.github.io/2012/08/30/green-will-emissions-disclosure-mean-investor-pressure
<p><a href="http://green.blogs.nytimes.com/2012/08/24/will-emissions-disclosure-mean-investor-pressure-on-polluters/?smid=tu-share">Green: Will Emissions Disclosure Mean Investor Pressure on Polluters?</a></p>
I.B.M. Mainframe Evolves to Serve the Digital World
2012-08-29T17:59:30+00:00
http://simplystats.github.io/2012/08/29/i-b-m-mainframe-evolves-to-serve-the-digital-world
<p><a href="http://www.nytimes.com/2012/08/28/technology/ibm-mainframe-evolves-to-serve-the-digital-world.html?smid=tu-share">I.B.M. Mainframe Evolves to Serve the Digital World</a></p>
Increasing the cost of data analysis
2012-08-29T14:00:00+00:00
http://simplystats.github.io/2012/08/29/increasing-the-cost-of-data-analysis
<p>Jeff’s post about the <a href="http://simplystatistics.org/post/30315018436/a-deterministic-statistical-machine" target="_blank">deterministic statistical machine</a> got me thinking a bit about the cost of data analysis. The cost of data analysis these day is in many ways going up. The data being collected are getting bigger and more complex. Analyzing these data require more expertise, more storage hardware, and more computing power. In fact the analysis in some fields like genomics is now more expensive than the collection of the data [There’s a graph that shows this but I can’t seem to find it anywhere; I’ll keep looking and post later. For now <a href="http://www.nytimes.com/2011/12/01/business/dna-sequencing-caught-in-deluge-of-data.html" target="_blank">see here</a>.].</p>
<p>However, that’s really about the dollars and cents kind of cost. The cost of data analysis has gone very far down in a different sense. For the vast majority of applications that look at moderate to large datasets, many many statistical analyses can be conducted essentially at the push of a button. And so there’s not cost in continuing to analyze data until a desirable result is obtained. Correcting for multiple testing is one way to “fix” this problem. But I personally don’t find multiple testing corrections to be all that helpful because ultimately they still try to boil down a complex analysis into a simple yes/no answer.</p>
<p>In the old days (for example when <a href="http://web.archive.org/web/19970717063350/http://www.stat.berkeley.edu/users/rafa/index.html" target="_blank">Rafa was in grad school</a>), computing time was precious and things had to be planned out carefully, starting with the planning of the experiment and continuing with the data collection and the analysis. In fact, much of current statistical education is still geared around the idea that computing is expensive, which is why we use things like asymptotic theorems and approximations even when we don’t really have to. Nowadays, there’s a bit of a “we’ll fix it in post” mentality, which values collecting as much data as possible when given the chance and figuring out what to do with it later. This kind of thinking can lead to (1) <a href="http://simplystatistics.org/post/25924012903/the-problem-with-small-big-data" target="_blank">small big data problems</a>; (2) poorly designed studies; (3) data that don’t really address the question of interest to everyone.</p>
<p>What if the cost of data analysis were not paid in dollars but were paid in some general unit of credibility. For example, Jeff’s hypothetical machine would do some of this.</p>
<blockquote>
<p><span>By publishing all reports to figshare, it makes it even harder to fudge the data. If you fiddle with the data to try to get a result you want, there will be a “multiple testing paper trail” following you around. </span></p>
</blockquote>
<p>So with each additional analysis of the data, you get an additional piece of paper added to your analysis paper trail. People can look at the analysis paper trail and make of it what they will. Maybe they don’t care. Maybe having a ton of analyses discredits the final results. The point is that it’s there for all to see.</p>
<p>I do <em>not</em> think what we need is better methods to deal with multiple testing. This is simply not a statistical issue. What we need is a way to increase the cost of data analysis by preserving the paper trail. So that people hesitate before they run all pairwise combinations of whatever. Reproducible research doesn’t really deal with this problem because reproducibility only really requires that the <em>final</em> analysis is documented.</p>
<p>In other words, let the paper trail be the price of pushing the button.</p>
Genes Now Tell Doctors Secrets They Can’t Utter
2012-08-28T17:59:19+00:00
http://simplystats.github.io/2012/08/28/genes-now-tell-doctors-secrets-they-cant-utter
<p><a href="http://www.nytimes.com/2012/08/26/health/research/with-rise-of-gene-sequencing-ethical-puzzles.html?smid=tu-share">Genes Now Tell Doctors Secrets They Can’t Utter</a></p>
Active in Cloud, Amazon Reshapes Computing
2012-08-28T14:00:18+00:00
http://simplystats.github.io/2012/08/28/active-in-cloud-amazon-reshapes-computing
<p><a href="http://www.nytimes.com/2012/08/28/technology/active-in-cloud-amazon-reshapes-computing.html?smid=tu-share">Active in Cloud, Amazon Reshapes Computing</a></p>
A deterministic statistical machine
2012-08-27T14:00:06+00:00
http://simplystats.github.io/2012/08/27/a-deterministic-statistical-machine
<p>As Roger pointed out the most recent batch of Y Combinator startups included a bunch of <a href="http://simplystatistics.org/post/29964925728/data-startups-from-y-combinator-demo-day" target="_blank">data-focused</a> companies. One of these companies, <a href="https://www.statwing.com/" target="_blank">StatWing</a>, is a web-based tool for data analysis that looks like an improvement on SPSS with more plain text, more visualization, and a lot of the technical statistical details “under the hood”. I first read about StatWing on TechCrunch, where the title, <a href="http://techcrunch.com/2012/08/16/how-statwing-makes-it-easier-to-ask-questions-about-data-so-you-dont-have-to-hire-a-statistical-wizard/" target="_blank">“How Statwing Makes It Easier To Ask Questions About Data So You Don’t Have To Hire a Statistical Wizard”</a>.</p>
<p>StatWing looks super user-friendly and the idea of democratizing statistical analysis so more people can access these ideas is something that appeals to me. But, as one of the aforementioned statistical wizards, this had me freaked out for a minute. Once I looked at the software though, I realized it suffers from the same problem that most “user-friendly” statistical software suffers from. It makes it really easy to screw up a data analysis. It will tell you when something is significant and if you don’t like that it isn’t, you can keep slicing and dicing the data until it is. The key issue behind getting insight from data is knowing when you are fooling yourself with confounders, or small effect sizes, or overfitting. StatWing looks like an improvement on the UI experience of data analysis, but it won’t prevent false positives that plague science and cost business big $$. </p>
<p>So I started thinking about what kind of software would prevent these sort of problems while still being accessible to a big audience. My idea is a “deterministic statistical machine”. Here is how it works, you input a data set and then specify the question you are asking (is variable Y related to variable X? can i predict Z from W?) then, depending on your question, it uses a deterministic set of methods to analyze the data. Say regression for inference, linear discriminant analysis for prediction, etc. But the method is fixed and deterministic for each question. It also performs a pre-specified set of checks for outliers, confounders, missing data, <a href="http://www.nature.com/news/the-data-detective-1.10937" target="_blank">maybe even data fudging</a>. It generates a report with a markdown tool and then immediately publishes the result to <a href="http://figshare.com/" target="_blank">figshare</a>. </p>
<p>The advantage is that people can get their data-related questions answered using a standard tool. It does a lot of the “heavy lifting” in checking for potential problems and produces nice reports. But it is a deterministic algorithm for analysis so overfitting, fudging the analysis, etc. are harder. By publishing all reports to figshare, it makes it even harder to fudge the data. If you fiddle with the data to try to get a result you want, there will be a “multiple testing paper trail” following you around. </p>
<p>The DSM should be a web service that is easy to use. Anybody want to build it? Any suggestions for how to do it better? </p>
Sunday data/statistics link roundup (8/26/12)
2012-08-26T13:53:15+00:00
http://simplystats.github.io/2012/08/26/sunday-data-statistics-link-roundup-8-26-12
<p>First off, a quick apology for missing last week, and <a href="https://twitter.com/Augusto_Heink/status/237621283397984256" target="_blank">thanks to Augusto</a> for noticing! On to the links:</p>
<ol>
<li>Unbelievably the <a href="http://blogs.nature.com/news/2012/08/us-court-sides-with-gene-patents.html" target="_blank">BRCA gene patents were upheld</a> by the lower court despite the Supreme Court <a href="http://simplystatistics.org/post/19646774024/laws-of-nature-and-the-law-of-patents-supreme-court" target="_blank">coming down pretty unequivocally</a> against patenting correlations between metabolites and health outcomes. I wonder if this one will be overturned if it makes it back up to the Supreme Court. </li>
<li>A <a href="http://thebrowser.com/interviews/david-spiegelhalter-on-statistics-and-risk" target="_blank">really nice interview</a> with David Spiegelhalter on Statistics and Risk. David runs the <a href="http://understandinguncertainty.org/" target="_blank">Understanding Uncertainty </a>blog and published a recent paper on <a href="http://www.sciencemag.org/content/333/6048/1393.abstract" target="_blank">visualizing uncertainty</a>. My favorite line from the interview might be: “<span>There is a nice quote from Joel Best that “all statistics are social products, the results of people’s efforts”. He says you should always ask, “Why was this statistic created?” Certainly statistics are constructed from things that people have chosen to measure and define, and the numbers that come out of those studies often take on a life of their own.”</span></li>
<li>For those of you who use Tumblr like we do, here is a <a href="http://adamlaiacano.tumblr.com/post/11272953536/tips-for-making-a-technical-blog-on-tumblr" target="_blank">cool post</a> on how to put technical content into your blog. My favorite thing I learned about is the <a href="https://gist.github.com/" target="_blank">Github Gist</a> that can be used to embed syntax-highlighted code.</li>
<li>A few <a href="http://www.grantland.com/story/_/id/8284393/breaking-best-nfl-stats" target="_blank">interesting and relatively simple stats</a> for projecting the success of NFL teams. One thing I love about sports statistics is that they are totally willing to be super ad-hoc and to be super simple. Sometimes this is all you need to be highly predictive (see for example, the results of Football’s Pythagorean Theorem). I’m sure there are tons of more sophisticated analyses out there, but if it ain’t broke… (via Rafa). </li>
<li>My student Hilary has a new blog that’s worth checking out. Here is a <a href="http://hilaryparker.com/2012/08/25/love-for-projecttemplate/" target="_blank">nice review</a> of ProjectTemplate she did. I think the idea of having an organizing principle behind your code is a great one. Hilary likes ProjectTemplate, I think there are a few others out there that might be useful. If you know about them, you should leave a comment on her blog!</li>
<li>This is ridiculously cool. Man City has <a href="http://www.epltalk.com/man-city-makes-player-statistics-data-available-to-public-small-step-towards-stat-nerd-nirvana-45877" target="_blank">opened up </a>their data/statistics to the data analytics community. After registering, you have access to many of the statistics the club uses to analyze their players. This is yet another example of open data taking over the world. It’s clear that data generators can create way more value for themselves by releasing cool data, rather than holding it all in house. </li>
<li>The Portland Public Library has created a website called <a href="http://www.bookpsychic.com/" target="_blank">Book Psychic</a>, basically a recommender system for books. I love this idea. It would be great to have a <a href="http://simplystatistics.org/post/10521062620/the-killer-app-for-peer-review" target="_blank">recommender system for scientific papers</a>. </li>
</ol>
Judge Rules Poker Is A Game Of Skill, Not Luck
2012-08-25T17:04:57+00:00
http://simplystats.github.io/2012/08/25/judge-rules-poker-is-a-game-of-skill-not-luck
<p><a href="http://www.npr.org/2012/08/22/159833145/judge-rules-poker-is-a-game-of-skill-not-luck">Judge Rules Poker Is A Game Of Skill, Not Luck</a></p>
Simply Statistics Podcast #1
2012-08-24T13:54:00+00:00
http://simplystats.github.io/2012/08/24/simply-statistics-podcast-1-to-mark-the
<p>Simply Statistics Podcast #1.</p>
<p>To mark the occasion of our 1-year anniversary of starting the blog, Jeff, Rafa, and I have recorded our first podcast. You can tell that it’s our very first podcast because we don’t appear to have any idea what we’re doing. However, we decided to throw caution to the wind.</p>
<p>In this episode we talk about why we started the blog and discuss our thoughts on statistics and big data. Be sure to watch to the end as Rafa provides a special treat.</p>
<p><strong>UPDATE</strong>: For those of you who can’t bear the sight of us, there is an <a href="http://www.biostat.jhsph.edu/~rpeng/podcast/SSPodcast1_audio.m4a" target="_blank">audio only version</a>.</p>
<p><strong>UPDATE 2</strong>: I have setup an RSS feed for the <a href="http://www.biostat.jhsph.edu/~rpeng/podcast/simplystatistics_audio.xml" target="_blank">audio-only version of the podcast</a>.</p>
<p><strong>UPDATE 3</strong>: Here is the RSS feed for <a href="feed://www.biostat.jhsph.edu/~rpeng/podcast/simplystatistics_HDvideo.xml" target="_blank">HD video version of the podcast</a>.</p>
<div class="attribution">
(<span>Source:</span> <a href="http://www.youtube.com/">http://www.youtube.com/</a>)
</div>
Science Exchange starts Reproducibility Initiative
2012-08-23T14:46:17+00:00
http://simplystats.github.io/2012/08/23/science-exchange-starts-reproducibility-initiative
<p>I’ve fallen behind and so haven’t had a chance to mention this, but <a href="https://www.scienceexchange.com" target="_blank">Science Exchange</a> has started its <a href="https://www.scienceexchange.com/reproducibility" target="_blank">Reproducibility Initiative</a>. The idea is that authors can submit their study to be reproduced and Science Exchange will match the study with a validator who will attempt to reproduce the results (for a fee).</p>
<blockquote>
<p><span>Validated studies will receive a Certificate of Reproducibility acknowledging that their results have been independently reproduced as part of the Reproducibility Initiative. Researchers have the opportunity to publish the replicated results as an independent publication in the PLOS Reproducibility Collection, and can share their data via the figshare Reproducibility Collection repository. The original study will also be acknowledged as independently reproduced if published in a </span><a class="about_link" href="https://www.scienceexchange.com/reproducibility" target="_blank">supporting journal</a><span>.</span></p>
</blockquote>
<p>This is a very interesting initiative and it’s one I and a number of others have been talking about doing. They have an excellent advisory board and seem to have all the right partners/infrastructure lined up. </p>
<p>The obvious question to me is if you’re going to submit your study to this service and get it reproduced, why would you ever want to submit it to a journal? The level of review you’d get here is quite a bit more rigorous than you’d receive at a journal and the submission process essentially involves writing a paper without the Introduction and the Discussion (usually the hardest and most annoying parts). At the moment, it seems the service is set up to work in parallel with standard publication or perhaps after the fact. But I could see it eventually replacing standard publication altogether.</p>
<p>The timing, of course, could be an issue. It’s not clear how long one should expect it to take to reproduce a study. But it’s probably not much longer than a review you’d get at a statistics journal.</p>
Data Startups from Y Combinator Demo Day
2012-08-22T13:54:58+00:00
http://simplystats.github.io/2012/08/22/data-startups-from-y-combinator-demo-day
<p>Y Combinator, the tech startup incubator, had its 15th demo day. Here are some of the data/statistics-related highlights (thanks to TechCrunch for doing the hard work):</p>
<ul>
<li>
<p><a href="http://www.everyday.me/" target="_blank">EVERYDAY.ME</a> — A PRIVATE, ONLINE RECORD OF YOUR LIFE. </p>
This company seems to me like a meta-data company. It compiles your data from other sites.</p>
</li>
<li>
<p><a href="http://mthsense.com/" target="_blank">MTH SENSE</a>: IMPROVING MOBILE AD TARGETING
“Most [mobile] ads served are blind. Mth sense’s solution adds demographic data to ads through predictive modeling based on app and device usage. For example, if you have the Pinterest, and Vogue apps, you’re more likely to be a soccer mom.” Hmm, I guess I’d better delete those apps from my phone….</p>
</li>
<li>
<p><a href="http://www.survata.com/" target="_blank">SURVATA</a>: REPLACING PAYWALLS WITH SURVEYWALLS
Survata’s product replaces paywalls on premium content from online publishers with surveys that conduct market research.</p>
</li>
<li>
<p><a href="http://www.rent.io/" target="_blank">RENT.IO</a> — RENT PRICE PREDICTION
Rent.io says it wants to “optimize pricing of the single biggest recurring expense in lives of 100 million Americans.&rdquo</p>
</li>
<li>
<p><a href="http://www.bigcalc.com/" target="_blank">BIGCALC</a>: FAST NUMBER-CRUNCHING FOR MAKING FINANCIAL TRADING DECISIONS
BigCalc says its platform for financial modeling scales to enormous datasets, and purportedly does simulations that typically take 22 hours in 24 minutes.</p>
</li>
<li>
<p><a href="http://www.datanitro.com/" target="_blank">DATANITRO</a> — A BACKBONE FOR FINANCE-RELATED DATA
DataNitro’s founders have both worked in finance, and they say they know from experience that financial industry software is basically “held together with duct tape.” A big problem with the status quo is how data is exported from Excel.</p>
</li>
<li>
<p><a href="http://www.statwing.com/" target="_blank">STATWING</a>: EASY TO USE DATA ANALYSIS
Most existing data analysis tools (in particular SPSS) are built for statisticians. Statwing has created tools that make it easier for marketers and analysts to interact with data without dealing with arcane technical terminology. Those users only need a few core functions, Statwing says, so that’s what the company provides. With just a few clicks, users can get the graphs that they want. And the data is summarized in a single sentence of conversational English.</p>
</li>
</ul>
Harvard chooses statistician to lead Graduate School of Arts and Sciences
2012-08-22T12:29:40+00:00
http://simplystats.github.io/2012/08/22/harvard-chooses-statistician-to-lead-graduate-school-of
<p><a href="http://news.harvard.edu/gazette/story/2012/08/new-dean-for-gsas/">Harvard chooses statistician to lead Graduate School of Arts and Sciences</a></p>
NSF recognizes math and statistics are not the same thing...kind of
2012-08-21T15:19:38+00:00
http://simplystats.github.io/2012/08/21/nsf-recognizes-math-and-statistics-are-not-the-same
<p>There’s controversy brewing over at the National Science Foundation over names. Back in October 2011, Sastry Pantula, the Director of the Division of Mathematical Sciences at NSF (formerly the Chair of NC State Statistics Department and President of the ASA), proposed that the name of the Division be changed to the “Division of Mathematical and Statistical Sciences”. Excerpting from his <a href="http://imstat.org/pantulaletter10_6_11.pdf" target="_blank">original proposal</a>, Pantula says</p>
<blockquote>
<p>Extracting useful knowledge from the deluge of data is critical to the scientific successes of the future. Data-intensive research will drive many of the major scientific breakthroughs in the coming decades. There is a long-term need for research and workforce development in computational and data-enabled sciences. Statistics is broadly recognized as a data-centric discipline, thus having it in the Division’s name as proposed would be advantageous whenever “Big Data” and data-sciences investments are discussed internally and externally.</p>
</blockquote>
<p>This bureaucratic move by Pantula created quite a reaction. A sub-committee of the Math and Physical Sciences Advisory Committee (MPSAC) was formed to investigate the name change and to solicit feedback from the relevant communities. The sub-committee was chaired by Fred Roberts (Rutgers) and also included James Berger (Duke), Emery Brown (MIT), Kevin Corlette (U. of Chicago), Irene Fonseca (CMU), and Juan Meza (UC Merced). A number of organizations provided feedback to the sub-committee, including the American Statistical Association and the American Mathematical Society.</p>
<p>There was intense feedback both for and against the name change. Somewhat predictably, mathematicians were adamantly opposed to the name change and statisticians were for it. The <a href="http://nsf.gov/attachments/124926/public/DMS_Name_Change_Committee_Report_Final_4-1-12.pdf" target="_blank">final report of the sub-committee</a> is both interesting and enlightening for those not familiar with the arguments involved.</p>
<p>First a little background for people (like me) who are not familiar with NSF’s organizational structure. NSF has a number of Directorates, of which Mathematical and Physical Sciences (MPS) is one, and within MPS is the Division of Mathematical Sciences (DMS). DMS includes 11 program areas ranging from algebra and number theory to topology. Statistics is one of those program areas. </p>
<p>This should already give one pause. How exactly do statistics and topology end up in the same basket? I’m not exactly sure but I’m guessing it’s the result of bureaucratic inertia. Statistics came later and it had to be stuck somewhere. DMS is not the only place at NSF to get funding for statistics, but a <a href="http://nsf.gov/awardsearch/progSearch.do?SearchType=progSearch&page=2&QueryText=&ProgOrganization=&ProgOfficer=&ProgEleCode=1269&BooleanElement=false&ProgRefCode=&BooleanRef=false&ProgProgram=&ProgFoaCode=&Restriction=2&Search=Search" target="_blank">quick search through the currently active grants</a> shows that the vast majority of statistics-related grants go through DMS, with a smattering coming through other Divisions.</p>
<p>The primary issue here, and the only reason it’s an issue at all, is money. Statistics is one of 11 program areas in DMS, which means that it roughly gets 9% of the funding allocated to DMS. This is worth noting—the entire field of statistics gets roughly as much funding as, say, topology. For example, one of the arguments against the name change in the sub-committee’s report is</p>
<blockquote>
<p>3). Statistics constitutes a small (although significant) proportion of the DMS portfolio in terms of number of programs, number of grant applications, number of grants funded.</p>
</blockquote>
<p>Well, yes, but I would argue that the reason for this is the historically (low) prioritization of statistics in the Division. This is a choice, not a fact. I believe statistics could play a much bigger role in the Division and perhaps within NSF more generally if there were an agreement on its importance. A key argument comes next, which is</p>
<blockquote>
<p>If the name change attracts more proposals to the Division from the statistics community, this could draw funding away from other subfields and it could also increase the workload of the Division’s program officers.</p>
</blockquote>
<p>Okay, so money’s important too, but let’s get to the main attraction, which comes in comment number 5:</p>
<blockquote>
<p>5). Statistics is funded throughout the federal government. The traditional funding of statistics by DMS is appropriate: fund fundamental research in statistics. Broadening the mission of DMS to include more applied statistics would not benefit the overall funding of the mathematical sciences.</p>
</blockquote>
<p>The first sentence is a fact: Many government agencies fund statistics research. For example, the National Institutes of Health funds many statisticians who develop and apply methods to problems in the health sciences. The EPA will occasionally fund statisticians to develop methods for environmental health applications.</p>
<p>But who is charged with funding the development and application of statistical methods to every other scientific field? The problem now is that you essentially have a group of NIH-funded (bio)statisticians doing biomedical research and a group of NSF-funded statisticians doing “fundamental” research in statistics (note that “fundamental” equals “mathematical” here). But that hardly represents all of the statisticians out there. So for the rest of the statisticians who are not doing biomedical research and are not doing “fundamental” research, where do they go for funding?</p>
<p>These days, statistics is “applied” to <em>everything</em>. NSF itself has acknowledged that we are in an era of big data—it’s clear that statistics will play a big role whether we call it “statistics” or not. If NSF decided to fund research into the application of statistics to all areas, it would likely overwhelm the funding of every other program area in DMS. This is why the “solution” is to resort to what is informally understood as the mission of NSF, which is to fund “fundamental” research.</p>
<p>But it’s not clear to me that NSF should limit itself in this manner. In particular, if NSF got serious about funding the application of statistics to all scientific areas (either through DMS or some other Division), it would incentivize statisticians to build stronger and closer collaborations with scientists all over. I see this as a win-win for everyone involved. </p>
<p>As a statistician, I’m willing to admit I’m biased, but I think NSF should play a much bigger role in advancing statistics as one of the critical tools of the future. Perhaps the solution is not to rename the Division, but to create a separate division for statistical sciences independent of mathematics, one of the suggestions in the sub-committee report. This separation would mirror what has occurred in many universities over the past 50 years or so with the creation of independent departments of statistics and biostatistics. </p>
<p>Ultimately, the name of the Division was not changed. Here’s the <a href="http://www.nsf.gov/attachments/124926/public/Response_MPSAC_Subcommittee_Report_on_Name_of_Division_of_Mathematical_Sciences_8-16-2012.pdf" target="_blank">release from last week</a>:</p>
<blockquote>
<p>NSF is committed to supporting the research necessary to maximize the benefits to be derived from the age of data, and to promoting and funding research related to data-centric scientific discovery and innovation, and in particular, the growing role of the statistical sciences in all research areas. <span>Recognizing both the complex composition of the various communities and the support of statistical sciences throughout NSF, and taking into account the various community views described in the very thoughtful report of the MPSAC, I have decided to maintain the name “Division of Mathematical Sciences (DMS)” within MPS, but to affirm strong commitment to the statistical sciences.</span></p>
<p>To demonstrate this commitment, (a) whenever appropriate, we will specifically mention “statistics” alongside “mathematics” in budget requests and in solicitations in order to recognize the unique and pervasive role of statistical sciences, and to ensure that relevant solicitations reach the statistical sciences community….</p>
</blockquote>
<p>Well, I feel better already. I suppose this is progress of some sort.</p>
Recommended updates from Google Scholar
2012-08-21T00:53:19+00:00
http://simplystats.github.io/2012/08/21/recommended-updates-from-google-scholar
<p><a href="http://googlescholar.blogspot.com/2012/08/scholar-updates-making-new-connections.html">Recommended updates from Google Scholar</a></p>
Interview with C. Titus Brown - Computational biologist and open access champion
2012-08-17T13:45:57+00:00
http://simplystats.github.io/2012/08/17/interview-with-c-titus-brown-computational-biologist
<div class="im">
<strong>C. Titus Brown </strong>
</div>
<div class="im">
<strong><br /></strong>
</div>
<div class="im">
<strong><br /></strong>
</div>
<div class="im">
<img height="300" src="http://biostat.jhsph.edu/~jleek/titus.jpg" width="300" />
</div>
<div class="im">
</div>
<div class="im">
C. Titus Brown is an assistant professor in the Department of Computer Science and Engineering at Michigan State University. He develops computational software for next generation sequencing and the author of the blog, <a href="http://ivory.idyll.org/blog/" target="_blank">“Living in an Ivory Basement”</a>. We talked to Titus about open access (he publishes his unfunded grants online!), improving the reputation of PLoS One, his research in computational software development, and work-life balance in academics.
</div>
<!-- more -->
<div class="im">
</div>
<div class="im">
</div>
<div class="im">
<strong>Do you consider yourself a statistician, data scientist, computer scientist, or something else?</strong></p>
</div>
<p><span>Good question. Short answer: apparently somewhere along the way I</span><br />
<span>became a biologist, but with a heavy dose of “computational scientist”</span><br />
<span>in there.</span></p>
<p><span>The longer answer? Well, it’s a really long answer…</span></p>
<p><span>My first research was on Avida, a bottom-up model for evolution that</span><br />
<span>Chris Adami, Charles Ofria and I wrote together at Caltech in 1993:</span><br />
<a href="http://en.wikipedia.org/wiki/Avida" target="_blank"><a href="http://en.wikipedia.org/wiki/Avida" target="_blank">http://en.wikipedia.org/wiki/Avida</a></a><span>. (Fun fact: Chris, Charles and I</span><br />
<span>are now all faculty at Michigan State! Chris and I have offices one</span><br />
<span>door apart, and Charles has an office one floor down.) Avida got me</span><br />
<span>very interested in biology, but not in the undergrad “memorize stuff”</span><br />
<span>biology — more in research. This was computational science: using</span><br />
<span>simple models to study biological phenomena.</span></p>
<p><span>While continuing evolution research, I did my undergrad in pure math at Reed</span><br />
<span>College, which was pretty intense; I worked in the Software Development</span><br />
<span>lab there, which connected me to a bunch of reasonably well known hackers</span><br />
<span>including Keith Packard, Mark Galassi, and Nelson Minar.</span></p>
<p><span>I also took a year off and worked on Earthshine:</span></p>
<p><a href="http://en.wikipedia.org/wiki/Planetshine#Earthshine" target="_blank"><a href="http://en.wikipedia.org/wiki/Planetshine#Earthshine" target="_blank">http://en.wikipedia.org/wiki/Planetshine#Earthshine</a></a></p>
<p><span>and then rebooted the project as an RA in 1997, the summer after</span><br />
<span>graduation. This was mostly data analysis, although it included a</span><br />
<span>fair amount of hanging off of telescopes adjusting things as the</span><br />
<span>freezing winter wind howled through the Big Bear Solar Observatory’s</span><br />
<span>observing room, aka “data acquisition”.</span></p>
<p><span>After Reed, I applied to a bunch of grad schools, including Princeton</span><br />
<span>and Caltech in bio, UW in Math, and UT Austin and Ohio State in</span><br />
<span>physics. I ended up at Caltech, where I switched over to</span><br />
<span>developmental biology and (eventually) regulatory genomics and genome</span><br />
<span>biology in Eric Davidson’s lab. My work there included quite a bit</span><br />
<span>of wet bench biology, which is not something many people associate with me,</span><br />
<span>but was nonetheless something I did!</span></p>
<p><span>Genomics was really starting to hit the fan in the early 2000s, and I</span><br />
<span>was appalled by how biologists were handling the data — as one</span><br />
<span>example, we had about $500k worth of sequences sitting on a shared</span><br />
<span>Windows server, with no metadata or anything — just the filenames.</span><br />
<span>As another example, I watched a postdoc manually BLAST a few thousand</span><br />
<span>ESTs against the NCBI nr database; he sat there and did them three by</span><br />
<span>three, having figured out that he could concatenate three sequences</span><br />
<span>together and then manually deconvolve the results. As probably the</span><br />
<span>most computationally experienced person in the lab, I quickly got</span><br />
<span>involved in data analysis and Web site stuff, and ended up writing</span><br />
<span>some comparative sequence analysis software that was mildly popular</span><br />
<span>for a while.</span></p>
<p><span>As part of the sequence analysis Web site I wrote, I became aware that</span><br />
<span>maintaining software was a <em>really</em> hard problem. So, towards the end</span><br />
<span>of my 9 year stint in grad school, I spent a few years getting into</span><br />
<span>testing, both Web testing and more generally automated software</span><br />
<span>testing. This led to perhaps my most used piece of software, twill, a</span><br />
<span>scripting language for Web testing. It also ended up being one of the</span><br />
<span>things that got me elected into the Python Software Foundation,</span><br />
<span>because I was doing everything in Python (which is a really great</span><br />
<span>language, incidentally).</span></p>
<p><span>I also did some microbial genome analysis (which led to my first</span><br />
<span>completely reproducible paper (</span><span class="il">Brown</span><span> and Callan, 2004;</span><br />
<a href="http://www.ncbi.nlm.nih.gov/pubmed/14983022" target="_blank"><a href="http://www.ncbi.nlm.nih.gov/pubmed/14983022" target="_blank">http://www.ncbi.nlm.nih.gov/pubmed/14983022</a></a><span>) and collaborated with the</span><br />
<span>Orphan lab on some metagenomics:</span><br />
<a href="http://www.ncbi.nlm.nih.gov/pubmed?term=18467493" target="_blank"><a href="http://www.ncbi.nlm.nih.gov/pubmed?term=18467493" target="_blank">http://www.ncbi.nlm.nih.gov/pubmed?term=18467493</a></a><span>. This led to a</span><br />
<span>fascination with the biological “dark matter” in nature that is the</span><br />
<span>subject of some of my current work on metagenomics.</span></p>
<p><span>I landed my faculty position at MSU right out of grad school, because</span><br />
<span>bioinformatics is sexy and CS departments are OK with hiring grad</span><br />
<span>students as faculty. However, I deferred for two years to do a</span><br />
<span>postdoc in Marianne Bronner-Fraser’s lab because I wanted to switch to</span><br />
<span>the chick as a model organism, and so I ended up arriving at MSU in</span><br />
<span>2009. I had planned to focus a lot on development gene regulatory</span><br />
<span>networks, but 2009 was when Illumina sequencing hit, and as one of the</span><br />
<span>few people around who wasn’t visibly frightened by the term “gigabyte”</span><br />
<span>I got inextricably involved in a lot of different sequence analysis</span><br />
<span>projects. These all converged on assembly, and, well, that seems to</span><br />
<span>be what I work on now :).</span></p>
<p><span>The two strongest threads that run through my research are these:</span></p>
<p><span>1. “better science through superior software” — so much of science</span><br />
<span>depends on computational inference these days, and so little of the</span><br />
<span>underlying software is “good”. Scientists <em>really</em> suck at software</span><br />
<span>development (for both good and bad reasons) and I worry that a lot of</span><br />
<span>our current science is on a really shaky foundation. This is one</span><br />
<span>reason I’m invested in Software Carpentry</span><br />
<span>(</span><a href="http://software-carpentry.org/" target="_blank"><a href="http://software-carpentry.org" target="_blank">http://software-carpentry.org</a></a><span>), a training program that Greg Wilson</span><br />
<span>has been developing — he and I agree that science is our best hope</span><br />
<span>for a positive future, and good software skills are going to be</span><br />
<span>essential for a lot of that science. More generally I hope to turn</span><br />
<span>good software development into a competitive advantage for my lab</span><br />
<span>and my students.</span></p>
<p><span>2. “better hypothesis generation is needed” — biologists, in</span><br />
<span>particular, tend to leap towards the first testable hypothesis they</span><br />
<span>find. This is a cultural thing stemming (I think) from a lot of</span><br />
<span>really bad interactions with theory: the way physicists and</span><br />
<span>mathematicians think about the world simply doesn’t fit with the Rube</span><br />
<span>Goldberg-esque features of biology (see</span><br />
<a href="http://ivory.idyll.org/blog/is-discovery-science-really-bogus.html" target="_blank"><a href="http://ivory.idyll.org/blog/is-discovery-science-really-bogus.html" target="_blank">http://ivory.idyll.org/blog/is-discovery-science-really-bogus.html</a></a><span>).</span></p>
<p><span>So getting back to the question, uh, yeah, I think I’m a computational</span><br />
<span>scientist who is working on biology? And if I need to write a little</span><br />
<span>(or a lot) of software to solve my problems, I’ll do that, and I’ll</span><br />
<span>try to do it with some attention to good software development</span><br />
<span>practice — not just out of ethical concern for correctness, but</span><br />
<span>because it makes our research move faster.</span></p>
<p><span>One thing I’m definitely <em>not</em> is a statistician. I have friends who</span><br />
<span>are statisticians, though, and they seem like perfectly nice people.</span></p>
<div class="im">
<strong>You have a pretty radical approach to open access, can you tell us a little bit about that?</strong></p>
</div>
<p><span>Ever since Mark Galassi introduced me to open source, I thought it</span><br />
<span>made sense. So I’ve been an open source-nik since … 1988?</span></p>
<p><span>From there it’s just a short step to thinking that open science makes</span><br />
<span>a lot of sense, too. When you’re a grad student or a postdoc, you</span><br />
<span>don’t get to make those decisions, though; it took until I was a PI</span><br />
<span>for me to start thinking about how to do it. I’m still conflicted</span><br />
<span>about <em>how</em> open to be, but I’ve come to the conclusion that posting</span><br />
<span>preprints is obvious</span><br />
<span>(</span><a href="http://ivory.idyll.org/blog/blog-practicing-open-science.html" target="_blank"><a href="http://ivory.idyll.org/blog/blog-practicing-open-science.html" target="_blank">http://ivory.idyll.org/blog/blog-practicing-open-science.html</a></a><span>).</span></p>
<p><span>The “radical” aspect that you’re referring to is probably my posting</span><br />
<span>of grants (</span><a href="http://ivory.idyll.org/blog/grants-posted.html" target="_blank"><a href="http://ivory.idyll.org/blog/grants-posted.html" target="_blank">http://ivory.idyll.org/blog/grants-posted.html</a></a><span>). There are</span><br />
<span>two reasons I ended up posting all of my single-PI grants. Both have</span><br />
<span>their genesis in this past summer, when I spent about 5 months writing</span><br />
<span>6 different grants — 4 of which were written entirely by me. Ugh.</span></p>
<p><span>First, I was really miserable one day and joked on Twitter that “all</span><br />
<span>this grant writing is really cutting into my blogging” — a mocking</span><br />
<span>reference to the fact that grant writing (to get $$) is considered</span><br />
<span>academically worthwhile, while blogging (which communicates with the</span><br />
<span>public and is objectively quite valuable) counts for naught with my</span><br />
<span>employer. Jonathan Eisen responded by suggesting that I post all of</span><br />
<span>the grants and I thought, what a great idea!</span></p>
<p><span>Second, I’m sure it’s escaped most people (hah!), but grant funding</span><br />
<span>rates are in the toilet — I spent all summer writing grants while</span><br />
<span>expecting most of them to be rejected. That’s just flat-out</span><br />
<span>depressing! So it behooves me to figure out how to make them serve</span><br />
<span>multiple duties. One way to do that is to attract collaborators;</span><br />
<span>another is to serve as google bait for my lab; a third is to provide</span><br />
<span>my grad students with well-laid-out PhD projects. A fourth duty they</span><br />
<span>serve (and I swear this was unintentional) is to point out to people</span><br />
<span>that this is MY turf and I’m already solving these problems, so maybe</span><br />
<span>they should go play in less occupied territory. I know, very passive</span><br />
<span>aggressive…</span></p>
<p><span>So I posted the grants, and unknowingly joined a really awesome cadre</span><br />
<span>of folk who had already done the same</span><br />
<span>(</span><a href="http://jabberwocky.weecology.org/2012/08/10/a-list-of-publicly-available-grant-proposals-in-the-biological-sciences/" target="_blank"><a href="http://jabberwocky.weecology.org/2012/08/10/a-list-of-publicly-available-grant-proposals-in-the-biological-sciences/" target="_blank">http://jabberwocky.weecology.org/2012/08/10/a-list-of-publicly-available-grant-proposals-in-the-biological-sciences/</a></a><span>).</span><br />
<span>Most feedback I’ve gotten has been from grad students and undergrads</span><br />
<span>who really appreciate the chance to look at grants; some people told</span><br />
<span>me that they’d been refused the chance to look at grants from their</span><br />
<span>own PIs!</span></p>
<p><span>At the end of the day, I’d be lucky to be relevant enough that people</span><br />
<span>want to steal my grants or my software (which, by the way, is under a</span><br />
<span>BSD license — free for the taking, no “theft” required…). My</span><br />
<span>observation over the years is that most people will do just about</span><br />
<span>anything to avoid using other people’s software.</span></p>
<div class="im">
<strong>In theoretical statistics, there is a tradition of publishing pre-prints while papers are submitted. Why do you think biology is lagging behind?</strong></p>
</div>
<p><span>I wish I knew! There’s clearly a tradition of secrecy in biology;</span><br />
<span>just look at the Cold Spring Harbor rules re tweeting and blogging</span><br />
<span>(</span><a href="http://meetings.cshl.edu/report.html" target="_blank"><a href="http://meetings.cshl.edu/report.html" target="_blank">http://meetings.cshl.edu/report.html</a></a><span>) - this is a conference, for</span><br />
<span>chrissakes, where you go to present and communicate! I think it’s</span><br />
<span>self-destructive and leads to an insider culture where only those who</span><br />
<span>attend meetings and chat informally get to be members of the club,</span><br />
<span>which frankly slows down research. Given the societal and medical</span><br />
<span>challenges we face, this seems like a really bad way to continue doing</span><br />
<span>research.</span></p>
<p><span>One of the things I’m proudest of is our effort on the cephalopod</span><br />
<span>genome consortium’s white paper,</span><br />
<a href="http://ivory.idyll.org/blog/cephseq-cephalopod-genomics.html" target="_blank"><a href="http://ivory.idyll.org/blog/cephseq-cephalopod-genomics.html" target="_blank">http://ivory.idyll.org/blog/cephseq-cephalopod-genomics.html</a></a><span>, where a</span><br />
<span>group of bioinformaticians at the meeting pushed really hard to walk</span><br />
<span>the line between secrecy and openness. I came away from that effort</span><br />
<span>thinking two things: first, that biologists were erring on the side of</span><br />
<span>risk aversity; and second, that genome database folk were smoking</span><br />
<span>crack when they pushed for complete openness of data. (I have a blog</span><br />
<span>post on that last statement coming up at some point.)</span></p>
<p><span>The bottom line is that the incentives in academic biology are aligned</span><br />
<span>against openness. In particular, you are often rewarded for the first</span><br />
<span>observation, not for the most useful one; if your data is used to do</span><br />
<span>cool stuff, you don’t get much if any credit; and it’s all about</span><br />
<span>first/last authorship and who is PI on the grants. All too often this</span><br />
<span>means that people sit on their data endlessly.</span></p>
<p><span>This is getting particularly bad with next-gen data sets, because</span><br />
<span>anyone can generate them but most people have no idea how to analyze</span><br />
<span>their data, and so they just sit on it forever…</span></p>
<div class="im">
<strong>Do you think the ArXiv model will catch on in biology or just within the bioinformatics community?</strong></p>
</div>
<p><span>One of my favorite quotes is: “Making predictions is hard, especially</span><br />
<span>when they’re about the future.” I attribute it to Niels Bohr.</span></p>
<p><span>It’ll take a bunch of big, important scientists to lead the way. We</span><br />
<span>need key members of each subcommunity of biology to decide to do it on</span><br />
<span>a regular basis. (At this point I will take the obligatory cheap shot</span><br />
<span>and point out that Jonathan Eisen, noted open access fan, doesn’t post</span><br />
<span>his stuff to preprint servers very often. What’s up with that?) It’s</span><br />
<span>going to be a long road.</span></p>
<div class="im">
<strong>What is the reaction you most commonly get when you tell people you have posted your un-funded grants online?</strong></p>
</div>
<p><span>“Ohmigod what if someone steals them?”</span></p>
<p><span>Nobody has come up with a really convincing model for why posting</span><br />
<span>grants is a bad thing. They’re just worried that it <em>might</em> be. I</span><br />
<span>get the vague concerns about theft, but I have a hard time figuring</span><br />
<span>out exactly how it would work out well for the thief — reputation is</span><br />
<span>a big deal in science, and gossip would inevitably happen. And at</span><br />
<span>least in bioinformatics I’m aiming to be well enough known that</span><br />
<span>straight up ripping me off would be suicidal. Plus, if reviewers</span><br />
<span>do/did google searches on key concepts then my grants would pop up,</span><br />
<span>right? I just don’t see it being a path to fame and glory for anyone.</span></p>
<p><span>Revisiting the passive-aggressive nature of my grant posting, I’d like</span><br />
<span>to point out that most of my grants depend on preliminary results from</span><br />
<span>our own algorithms. So even if they want to compete on my turf, it’ll</span><br />
<span>be on a foundation I laid. I’m fine with that — more citations for</span><br />
<span>me, either way :).</span></p>
<p><span>More optimistically, I really hope that people read my grants and then</span><br />
<span>find new (and better!) ways of solving the problems posed in them. My</span><br />
<span>goal is to enable better science, not to hunker down in a tenured job</span><br />
<span>and engage in irrelevant science; if someone else can use my grants as</span><br />
<span>a positive or negative signpost to make progress, then broadly</span><br />
<span>speaking, my job is done.</span></p>
<p><span>Or, to look at it another way: I don’t have a good model for either</span><br />
<span>the possible risks OR the possible rewards of posting the grants, and</span><br />
<span>my inclinations are towards openness, so I thought I’d see what</span><br />
<span>happens.</span></p>
<div class="im">
<strong>How can junior researchers correct misunderstandings about open access/journals like PLoS One that separate correctness from impact? Do you have any concrete ideas for changing minds of senior folks who aren’t convinced?</strong></p>
</div>
<p><span>Render them irrelevant by becoming senior researchers who supplant them</span><br />
<span>when they retire. It’s the academic tradition, after all! And it’s</span><br />
<span>really the only way within the current academic system, which — for</span><br />
<span>better or for worse — isn’t going anywhere.</span></p>
<p><span>Honestly, we need fewer people yammering on about open access and more</span><br />
<span>people simply doing awesome science and submitting it to OA journals.</span><br />
<span>Conveniently, many of the high impact journals are shooting themselves</span><br />
<span>in the foot and encouraging this by rejecting good science that then</span><br />
<span>ends up in an OA journal; that wonderful ecology oped on PLoS One</span><br />
<span>citation rates shows this well</span><br />
<span>(</span><a href="http://library.queensu.ca/ojs/index.php/IEE/article/view/4351" target="_blank"><a href="http://library.queensu.ca/ojs/index.php/IEE/article/view/4351" target="_blank">http://library.queensu.ca/ojs/index.php/IEE/article/view/4351</a></a><span>).</span></p>
<div class="im">
<strong>Do you have any advice on what computing skills/courses statistics students interested in next generation sequencing should take?</strong></p>
</div>
<p><span>For courses, no — in my opinion 80% of what any good researcher</span><br />
<span>learns is self-motivated and often self-taught, and so it’s almost</span><br />
<span>silly to pretend that any particular course or set of skills is</span><br />
<span>sufficient or even useful enough to warrant a whole course. I’m not a</span><br />
<span>big fan of our current undergrad educational system <img src="http://simplystatistics.org/wp-includes/images/smilies/simple-smile.png" alt=":)" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></p>
<p><span>For skills? You need critical thinking coupled with an awareness that</span><br />
<span>a lot of smart people have worked in science, and odds are that there</span><br />
<span>are useful tricks and approaches that you can use. So talk to other</span><br />
<span>people, a lot! My lab has a mix of biologists, computer scientists,</span><br />
<span>graph theorists, bioinformaticians, and physicists; more labs should</span><br />
<span>be like that.</span></p>
<p><span>Good programming skills are going to serve you well no matter what, of</span><br />
<span>course. But I know plenty of good programmers who aren’t very</span><br />
<span>knowledgeable about biology, and who run into problems doing actual</span><br />
<span>science. So it’s not a panacea.</span></p>
<p><strong><span>How does replicable or reproducible research fit into your interests?</span></strong></p>
<p><span>I’ve wasted <em>so much time</em> reproducing other people’s work that when</span><br />
<span>the opportunity came up to put down a marker, I took it.</span></p>
<p><a href="http://ivory.idyll.org/blog/replication-i.html" target="_blank"><a href="http://ivory.idyll.org/blog/replication-i.html" target="_blank">http://ivory.idyll.org/blog/replication-i.html</a></a></p>
<p><span>The digital normalization paper shouldn’t have been particularly</span><br />
<span>radical; that it is tells you all you need to know about replication</span><br />
<span>in computational biology.</span></p>
<p><span>This is actually something I first did a long time ago, with what was</span><br />
<span>perhaps my favorite pre-faculty-job paper: if you look at the methods</span><br />
<span>for </span><span class="il">Brown</span><span> & Callan (2004) you’ll find a downloadable package that</span><br />
<span>contains all of the source code for the paper itself and the analysis</span><br />
<span>scripts. But back then I didn’t blog :).</span></p>
<p><span>Lack of reproducibility and openness in methods has serious</span><br />
<span>consequences — how much of cancer research has been useless, for</span><br />
<span>example? See <code class="language-plaintext highlighter-rouge">this horrific report</span>
<span><</span><a href="http://online.wsj.com/article/SB10001424052970203764804577059841672541590.html" target="_blank"><a href="http://online.wsj.com/article/SB10001424052970203764804577059841672541590.html" target="_blank">http://online.wsj.com/article/SB10001424052970203764804577059841672541590.html</a></a><span>></code>__.)</span><br />
<span>Again, the incentives are all wrong: you get grant money for</span><br />
<span>publishing, not for being useful. The two are not necessarily the</span><br />
<span>same…</span></p>
<p><strong><span>Do you have a family, and how do you balance work life and home life?</span></strong></p>
<p><span>Why, thank you for asking! I do have a family — my wife, Tracy Teal,</span><br />
<span>is a bioinformatician and microbial ecologist, and we have two</span><br />
<span>wonderful daughters, Amarie (4) and Jessie (1). It’s not easy being a</span><br />
<span>junior professor and a parent at the same time, and I keep on trying</span><br />
<span>to figure out how to balance the needs of travel with the need to be a</span><br />
<span>parent (hint: I’m not good at it). I’m increasingly leaning towards</span><br />
<span>blogging as being a good way to have an impact while being around</span><br />
<span>more; we’ll see how that goes.</span></p>
Statistics/statisticians need better marketing
2012-08-14T14:02:33+00:00
http://simplystats.github.io/2012/08/14/statistics-statisticians-need-better-marketing
<p>Statisticians have not always been great self-promoters. I think in part this comes from our tendency to be <a href="http://simplystatistics.org/post/25643791866/statistics-and-the-science-club" target="_blank">arbiters</a> rather than being involved in the scientific process. In some ways, I think this is a good thing. Self-promotion can quickly become really annoying. On the other hand, I think our advertising shortcomings are hurting our field in a number of different ways. </p>
<p>Here are a few:</p>
<ol>
<li>As Rafa <a href="http://simplystatistics.org/post/12241459446/we-need-better-marketing" target="_blank">points out</a> even though statisticians are ridiculously employable right now it seems like statistics M.S. and Ph.D. programs are flying under the radar in all the hype about data/data science (<a href="http://biostat.jhsph.edu/" target="_blank">here</a> is an awesome one if you are looking). Computer Science and Engineering, even the social sciences, are cornering the market on “big data”. This potentially huge and influential source of students may pass us by if we don’t advertise better. </li>
<li>A corollary to this is lack of funding. When the Big Data event happened at the White House with all the major funders in attendance to announce $200 million in new funding for big data, <a href="http://www.nsf.gov/news/news_videos.jsp?cntn_id=123607&media_id=72174&org=NSF" target="_blank">none of the invited panelists</a> were statisticians. </li>
<li>Our top awards don’t get the press they do in other fields. The Nobel Prize announcements are an international event. There is always speculation/intense interest in who will win. There is similar interest around the <a href="http://en.wikipedia.org/wiki/Fields_Medal" target="_blank">Fields medal</a> in mathematics. But the top award in statistics, the <a href="http://www.imstat.org/awards/copss_recipients.htm" target="_blank">COPSS award</a> doesn’t get nearly the attention it should. Part of the reason is lack of funding (the Fields is $15k, the COPSS is $1k). But part of the reason is that we, as statisticians, don’t announce it, share it, speculate about it, tell our friends about it, etc. The prestige of these awards can have a big impact on the visibility of a field. </li>
<li> A major component of visibility of a scientific discipline, for better or worse, is the popular press. The <a href="http://www.nytimes.com/2012/08/12/business/how-big-data-became-so-big-unboxed.html?_r=1&smid=tu-share" target="_blank">most recent article</a> in a long list of articles at the New York Times about the data revolution does not mention statistics/statisticians. Neither do the other articles. We need to cultivate relationships with the media. </li>
</ol>
<p>We are all busy solving real/hard scientific and statistical problems, so we don’t have a lot of time to devote to publicity. But here are a couple of easy ways we could rapidly increase the visibility of our field, ordered roughly by the degree of time commitment. </p>
<ol>
<li>All statisticians should have <a href="http://simplystatistics.org/post/15348632030/why-all-academics-should-have-professional-twitter" target="_blank">Twitter accounts</a> and we should share/discuss our work and ideas online. The more we help each other share, the more visibility our ideas will get. </li>
<li>We should make sure we let the ASA know about cool things that are happening with data/statistics in our organizations and they should spread the word through <a href="https://twitter.com/amstatnews" target="_blank">their Twitter account</a> and other social media. </li>
<li>We should start a conversation about who we think will win the next COPSS award in advance of the next JSM and try to get local media outlets to pick up our ideas and talk about the award. </li>
<li><a href="http://simplystatistics.org/post/20902656344/statistics-is-not-math" target="_blank">We should be more “big tent”</a> about statistics. ASA President Robert Rodriguez <a href="http://www.amstat.org/news/pdfs/RodriguezSpeech8_13_12.pdf" target="_blank">nailed this</a> in his speech at JSM. Whenever someone does something with data, we should claim them as a statistician. Sometimes this will lead to claiming people we don’t necessarily agree with. But the big tent approach is what is allowing CS and other disciplines to overtake us in the data era. </li>
<li>We should consider setting up a place for statisticians to donate money to build up the award fund for the COPSS/other statistics prizes. </li>
<li>We should try to forge relationships with start-up companies and encourage our students to pursue industry/start-up opportunities if they have interest. The less we are insular within the academic community, the more high-profile we will be. </li>
<li>It would be awesome if we started a statistical literacy outreach program in communities around the U.S. We could offer free courses in community centers to teach people how to understand polling data/the census/weather reports/anything touching data. </li>
</ol>
<p>Those are just a few of my ideas, but I have a ton more. I’m sure other people do too and I’d love to hear them. Let’s raise the tide and lift all of our boats!</p>
Johns Hopkins University Professor Louis Named to Lead Census Bureau Research Directorate
2012-08-13T22:48:13+00:00
http://simplystats.github.io/2012/08/13/johns-hopkins-university-professor-louis-named-to-lead
<p><a href="http://www.census.gov/newsroom/releases/archives/directors_corner/cb12-150.html"></a></p>
Big-Data Investing Gets Its Own Supergroup
2012-08-13T19:48:24+00:00
http://simplystats.github.io/2012/08/13/big-data-investing-gets-its-own-supergroup
<p><a href="http://bits.blogs.nytimes.com/2012/08/12/big-data-investing-gets-its-own-supergroup/?smid=tu-share">Big-Data Investing Gets Its Own Supergroup</a></p>
Sunday data/statistics link roundup (8/12/12)
2012-08-12T19:59:35+00:00
http://simplystats.github.io/2012/08/12/sunday-data-statistics-link-roundup-8-12-12
<ol>
<li>An interesting blog post about the <a href="http://caseybergman.wordpress.com/2012/07/31/top-n-reasons-to-do-a-ph-d-or-post-doc-in-bioinformaticscomputational-biology/" target="_blank">top N reasons</a> to do a Ph.D. in bioinformatics or computational biology. A couple of things that I find interesting and could actually be said of any program in biostatistics as well are: computing is the key skill of the 21st century and computational skills are highly transferrable. Via Andrew J. </li>
<li>Here is an interesting <a href="http://blog.noupsi.de/post/28896819324/why-are-americans-so" target="_blank">auto-complete map</a> of the United States where the prompt was, “Why is [state] so”. It seems like using the Google auto-complete functions can lead to all sorts of humorous data, <a href="http://xkcd.com/715/" target="_blank">xkcd</a> has used it as a data source a couple of times in the past. By the way, the person(s) who think Idaho is boring haven’t been to the right parts of Idaho. (via Rafa). </li>
<li>One of my all-time favorite statistics quotes appears <a href="http://mobile.nytimes.com/2012/08/03/opinion/brooks-the-credit-illusion.xml" target="_blank">in this column</a> by David Brooks: “…<span>what God hath woven together, even multiple regression analysis cannot tear asunder.” It seems like the perfect quote for any study that attempts to build a predictive model for a complicated phenomenon where only limited knowledge of the underlying mechanisms are known. </span></li>
<li><span>I’ve been reading up a lot on how to summarize and communicate risk. At the moment, I’ve been following a lot of David Spiegelhalter’s stuff, and really liked this <a href="http://plus.maths.org/content/understanding-uncertainty-2845-ways-spinning-risk-0" target="_blank">30,000 foot view summary</a>.</span></li>
<li><span>It is interesting how often you see R popping up in random places these days. Here is a <a href="http://www.businessinsider.com/what-actually-predicts-the-stock-market-2012-8" target="_blank">blog post</a> with some clearly R-created plots that appeared on Business Insider about predicting the stock-market. </span></li>
<li><span>Roger and I had a post on MOOC’s this week from the perspective of faculty teaching the courses. For a more departmental/administrative level view, be sure to re-read Rafa’s post on the <a href="http://simplystatistics.org/post/10764298034/the-future-of-graduate-education" target="_blank">future of graduate education</a>. </span></li>
</ol>
How Big Data Became So Big
2012-08-12T13:52:02+00:00
http://simplystats.github.io/2012/08/12/how-big-data-became-so-big
<p><a href="http://www.nytimes.com/2012/08/12/business/how-big-data-became-so-big-unboxed.html?smid=tu-share">How Big Data Became So Big</a></p>
When dealing with poop, it's best to just get your hands dirty
2012-08-11T13:21:47+00:00
http://simplystats.github.io/2012/08/11/when-dealing-with-poop-its-best-to-just-get-your
<p>I’m a relatively new dad. Before the kid we affectionately call the “tiny tornado” (TT) came into my life, I had relatively little experience dealing with babies and all the fluids they emit. So admittedly, I was a little squeamish dealing with the poopy explosions the TT would create. Inevitably, things would get much more messy than they had to be while I was being too delicate with the issue. It took me an embarrassingly long time for an educated man, but I finally realized you just have to get in there and change the thing even if it is messy, then wash your hands after. It comes off. </p>
<p>It is a similar situation in my professional life, but I’m having a harder time learning the lesson. There are frequently things that I’m not really excited to do: review a lot of papers, go to long meetings, revise a draft of that paper that has just been sitting around forever. Inevitably, once I get going they usually aren’t as difficult or as arduous as I thought. Even better, once they are done I feel a huge sense of accomplishment and relief. I used to have a metaphor for this, I’d tell myself, “Jeff, just rip off the band-aid”. Now, I think “Jeff, just get your hands dirty”. </p>
Why we are teaching massive open online courses (MOOCs) in R/statistics for Coursera
2012-08-10T14:49:18+00:00
http://simplystats.github.io/2012/08/10/why-we-are-teaching-massive-open-online-courses-moocs
<p class="MsoNormal">
<em>Editor’s Note: This post written by Roger Peng and Jeff Leek. </em>
</p>
<p class="MsoNormal">
A couple of weeks ago, we announced that we would be teaching free courses in <a href="https://www.coursera.org/course/compdata" target="_blank">Computing for Data Analysis</a> and <a href="https://www.coursera.org/course/dataanalysis" target="_blank">Data Analysis</a> on the Coursera platform. At the same time, a number of other universities also announced partnerships with Coursera leading to a <a href="https://www.coursera.org/courses" target="_blank">large number of new offerings</a>. That, coupled with a <a href="http://gigaom.com/2012/07/17/coursera-adds-first-international-university-partners-raises-additional-6m/" target="_blank">new round of funding</a> for Coursera, led to press coverage in the <a href="http://www.nytimes.com/2012/07/17/education/consortium-of-colleges-takes-online-education-to-new-level.html?pagewanted=all" target="_blank">New York Times</a>, the <a href="http://www.theatlantic.com/business/archive/2012/07/the-single-most-important-experiment-in-higher-education/259953/" target="_blank">Atlantic</a>, and other media outlets.
</p>
<p class="MsoNormal">
There was an ensuing explosion of blog posts and commentaries from academics. The opinions ranged from <a href="http://www.forbes.com/sites/susanadams/2012/07/17/is-coursera-the-beginning-of-the-end-for-traditional-higher-education/" target="_blank">dramatic</a>, to <a href="http://chronicle.com/blogs/innovations/going-public-the-uva-way/33623" target="_blank">negative</a>, to <a href="http://www.nytimes.com/2012/07/20/opinion/the-trouble-with-online-education.html?_r=2&smid=fb-share" target="_blank">critical</a>, to um…<a href="http://blogs.swarthmore.edu/burke/2012/07/20/listen-up-you-primitive-screwheads/" target="_blank">hilariously angry</a>. Rafa posted a few days ago that many of the folks freaking out are <a href="http://simplystatistics.org/post/28053129018/online-education-many-academics-are-missing-the-point" target="_blank">missing the point</a> - the opportunity to reach a much broader audience of folks with our course content.
</p>
<p class="MsoNormal">
[Before continuing, we’d like to make clear that at this point no money has been exchanged between Coursera and Johns Hopkins. Coursera has not given us anything and Johns Hopkins hasn’t given them anything. For now, it’s just a mutually beneficial partnership — we get their platform and they get to use our content. In the future, Coursera will need to figure out a way to make money, and they are currently considering a number of options.]
</p>
<p class="MsoNormal">
Now that the initial wave of hype has died down, we thought we’d outline why we are excited about participating in Coursera. We think it is only fair to start by saying this is definitely an experiment. Coursera is a newish startup and as such is still figuring out its plan/business model. Similarly, our involvement so far has been a little whirlwind and we haven’t actually taught courses yet, and we are happy to collect data and see how things turn out. So ask us again in 6 months when we are both done teaching.
</p>
<p class="MsoNormal">
But for now, this is why we are excited.
</p>
<ol>
<li><strong>Open Access.</strong> As Rafa alluded to in his post, this is an opportunity to reach a broad and diverse audience. As academics devoted to open science, we also think that opening up our courses to the biggest possible audience is, in principle, a good thing. That is why we are both basing our courses on free software and teaching the courses for free to anyone with an internet connection. </li>
<li><strong>Excitement about statistics.</strong> The data revolution means that there is a really intense interest in statistics right now. It’s so exciting that <a href="http://simplystatistics.org/post/16170052064/interview-with-joe-blitzstein" target="_blank">Joe Blitzstein’s</a> stat class on iTunes U has been one of the top courses on that platform. Our local superstar John McGready has also put his <a href="http://simplystatistics.org/post/27046976568/statistical-reasoning-on-itunes-u" target="_blank">statistical reasoning course</a> up on iTunes U to a similar explosion of interest. Rafa recently put his <a href="http://www.youtube.com/user/rafalabchannel?feature=results_main" target="_blank">statistics for genomics</a> lectures up on Youtube and they have already been viewed thousands of times. As people who are super pumped about the power and importance of statistics, we want to get in on the game. </li>
<li><strong>We work hard to develop good materials.</strong> We put effort into building materials that our students will find useful. We want to maximize the impact of these efforts. We have over 30,000 students enrolled in our two courses so far. </li>
<li><strong>It is an exciting experiment.</strong> Online teaching, including very very good online teaching, has been around for a long time. But the model of free courses at incredibly large scale is actually really new. Whether you think it is a gimmick or something here to stay, it is exciting to be part of the first experimental efforts to build courses at scale. Of course, this could flame out. We don’t know, but that is the fun of any new experiment. </li>
<li><strong>Good advertising.</strong> Every professor at a research school is a start-up of one. This idea deserves it’s own blog post. But if you accept that premise, to keep the operation going you need good advertising. One way to do that is writing good research papers, another is having awesome students, a third is giving talks at statistical and scientific conferences. This is an amazing new opportunity to showcase the cool things that we are doing. </li>
<li><strong>Coursera built some cool toys.</strong> As statisticians, we love new types of data. It’s like candy. Coursera has all sorts of cool toys for collecting data about drop out rates, participation, discussion board answers, peer review of assignments, etc. We are pretty psyched to take these out for a spin and see how we can use them to improve our teaching.</li>
<li><strong>Innovation is going to happen in education.</strong> The music industry spent years fighting a losing battle over music sharing. Mostly, this damaged their reputation and stopped them from developing new technology like iTunes/Spotify that became hugely influential/profitable. Education has been done the same way for hundreds (or thousands) of years. As new educational technologies develop, we’d rather be on the front lines figuring out the best new model than fighting to hold on to the old model. </li>
</ol>
<p>Finally, we’d like to say a word about why we think in-person education isn’t really threatened by MOOCs, at least for our courses. If you take one of our courses through Coursera you will get to see the lectures and do a few assignments. We will interact with students through message boards, videos, and tutorials. But there are only 2 of us and 30,000 people registered. So you won’t get much one on one interaction. On the other hand, if you come to the top <a href="http://www.biostat.jhsph.edu/" target="_blank">Ph.D. program in biostatistics</a> and take Data Analysis, you will now get 16 weeks of one-on-one interaction with Jeff in a classroom, working on tons of problems together. In other words, putting our lectures online now means at Johns Hopkins you get the most qualified TA you have ever had. Your professor. </p>
A non-exhaustive list of things I have failed to accomplish
2012-08-09T19:07:37+00:00
http://simplystats.github.io/2012/08/09/a-non-exhaustive-list-of-things-i-have-failed-to
<p>A few years ago I stumbled across a blog post that described a person’s complete cv. The idea was that the cv listed both the things they had accomplished and the things they had failed to accomplish. At the time, it really helped me to see that to be successful you have to be willing to fail over and over. </p>
<p>I use <a href="http://biostat.jhsph.edu/~jleek/" target="_blank">my website</a> to show the things I have accomplished career-wise. But I have also failed to achieve a lot of the things I set out to do. The reason was that there was strong competition for the awards/positions I was up for and other deserving people got them. </p>
<ol>
<li>Applied to MIT undergrad in 1999 - rejected</li>
<li>Donovan J. Thompson Award 2001 - did not receive</li>
<li>Applied for <a href="http://www.act.org/goldwater/" target="_blank">Barry Goldwater scholarship</a> 2002 - rejected</li>
<li>Applied for NSF Pre-Doctoral Fellowship 2003 - rejected</li>
<li>Applied for graduate school in math at MIT 2003, rejected</li>
<li>One of my first 3 papers rejected at PLoS Biology 2005</li>
<li>Many subsequent rejections of papers - too many to list exhaustively but here is <a href="http://simplystatistics.org/post/26977029850/my-worst-recent-experience-with-peer-review" target="_blank">one example</a></li>
<li>Applied for <a href="http://www.amstat.org/committees/commdetails.cfm?txtComm=CCRAWD04" target="_blank">Youden Award</a> 2010 - rejected</li>
<li>Applied for Microsoft Faculty Fellowship 2012 - rejected</li>
<li>Applied for Sloan Fellowship 2012 - rejected</li>
<li>Many grants have been rejected, again too long to list exhaustively </li>
</ol>
On the relative importance of mathematical abstraction in graduate statistical education
2012-08-08T15:40:15+00:00
http://simplystats.github.io/2012/08/08/on-the-relative-importance-of-mathematical-abstraction
<p><em>Editor’s Note: This is the counterpoint in our series of posts on the value of abstraction in graduate education. See Brian’s <a href="http://simplystatistics.org/post/28840726358/in-which-brian-debates-abstraction-with-t-bone" target="_blank">defense of abstraction</a> on Monday and the comments on his post, as well as the comments on our <a href="http://simplystatistics.org/post/28125455811/how-important-is-abstract-thinking-for-graduate" target="_blank">original teaser post</a> for more. See below for a full description of the T-bone inside joke*.</em>**</p></p>
<p></strong>Brian did a good job at defining abstraction. In a cagey debater’s move, he provided an incredibly broad definition of abstraction that includes the reason we call a <img src="http://simplystatistics.org/wp-includes/images/smilies/simple-smile.png" alt=":-)" class="wp-smiley" style="height: 1em; max-height: 1em;" />a smiley face, the reason why we can apply least squares to a variety of data types, and the reason we write functions when programming. At this very broad level, it is clear that abstract thinking is necessary for graduate students or any other data professional.</p>
<p>But our debate was inspired by a discussion of whether measure-theoretic probability was a key component of our graduate program. There was some agreement that for many biostatistics Ph.D. students, this exact topic may not be necessary for their research or careers. Brian suggested that measure-theoretic probability was a surrogate marker for something more important - abstract thinking and the ability to generalize ideas. This is a very specific form of generalization and abstraction that is used most commonly by statisticians: the ability that permits one to prove theorems and develop statistical models that can be applied to a variety of data types. I will therefore refocus the debate on the original topic. I have three main points: <br />
**<br />
**</p>
<ol>
<li><span>There is an over emphasis in statistical graduate programs on abstraction defined as the ability to prove mathematical theorems and develop general statistical methods. </span></li>
<li><span>It is possible to create incredible statistical value without developing generalizable statistical methods</span></li>
<li><span>While abstraction as defined generally is good, overemphasis on this specific type of abstraction limits our ability to include computing and real data analysis in our curriculum. It also takes away from the most important learning experience of graduate school: performing independent research.</span></li>
</ol>
<p><strong id="internal-source-marker_0.49558418267406523"><br /><span>There is an over emphasis in statistical graduate programs on abstraction defined as the ability to prove mathematical theorems and develop general statistical methods. </span></p></strong></p>
<p>
</strong>At a top program, you can expect to take courses in very theoretical statistics, measure theoretic probability, and an applied (or methods) sequence. The first two courses are exclusively mathematical. The third (at the programs I have visited, graduated from, taught in), despite its name, is most generally focused on mathematical details underlying statistical methods. The result is that most Ph.D. students are heavily trained in the mathematical theory behind statistics.
</p>
<p>
At the same time, there are a long list of skills necessary to develop a successful Ph.D. statistician. These include creativity in applications, statistical programming skills, grit to power through the <a href="http://simplystatistics.org/post/23928890537/schlep-blindness-in-statistics" target="_blank">boring/hard parts of research</a>, interpretation of statistical results on real data, ability to identify the most important scientific problems, and a deep understanding of the scientific problems you are working on. Abstraction is on that list, but it is just one of many skills on that list. Graduate education is a zero-sum game over a finite period of time. Our strong focus on mathematical abstraction means there is less time for everything else.
</p>
<p>
Any hard quantitative course will measure the ability of a student to abstract in the general sense Brian defined. One of these courses would be very useful for our students. But it is not clear that we should focus on mathematical abstraction to the exclusion of other important characteristics of graduate students. <br /><strong id="internal-source-marker_0.49558418267406523"><br /><span>It is possible to create incredible statistical value without developing generalizable statistical methods</span></p>
<p>
</strong>A major standard for success in academia is the ability to generate solutions to problems that are widely read, cited, and used. A graduate student who produces these types of solutions is likely to have a high-impact and well-respected career. In general, it is not necessary to be able to prove theorems, understand measure theory, or develop generalizable statistical models to have this type of success.
</p>
<p>
One example is one of the co-authors of our blog, best known for his work in genomics. In this field, data is noisy and full of systematic errors, and for several technologies, he invented methods to correct them. For example, he developed the <a href="http://www.ncbi.nlm.nih.gov/pubmed/12925520" target="_blank">most popular method</a> for making measurements from different experiments comparable, for removing the dependence of measurements on <a href="http://amstat.tandfonline.com/doi/abs/10.1198/016214504000000683?journalCode=uasa20" target="_blank">the letters in a gene</a>, and for <a href="http://www.nature.com/nmeth/journal/v4/n11/abs/nmeth1102.html" target="_blank">reducing variability</a> due to operators who run the machine or the ozone levels. Each of these discoveries involved: (1) deep understanding of the specific technology used, (2) a good intuition of what signals were due to biology and which were due to technology, (3) application/development of specific, somewhat ad-hoc, statistical procedures to correct the mistakes, and (4) the development and distribution of good software. His work has been hugely influential on genomics, has been cited thousands of times, and has substantially improved the quality of both biological and statistical results.
</p>
<p>
But the work did not result in knowledge that was generalizable to other areas of application, it deals with problems that are highly specialized to genomics. If these were his only contributions (they are not), he’d be a hugely successful Ph.D. statistician. But had he focused on general solutions he would have never solved the problems at hand, since the problems were highly specific to a single application. And this is just one example I know well because I work in the area. <a href="http://www.ncbi.nlm.nih.gov/pubmed/2593165" target="_blank">There</a> <a href="http://www.nature.com/nature/journal/v457/n7232/full/nature07634.html" target="_blank">are</a> <a href="http://biostatistics.oxfordjournals.org/content/8/1/118.abstract" target="_blank">a</a> <a href="http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fjournal.pmed.1001093" target="_blank">ton</a> <a href="http://www.ped.fas.harvard.edu/people/faculty/publications_nowak/MichelScience2011.pdf" target="_blank">more</a><a href="http://www.sysecol2.ethz.ch/Refs/EntClim/K/Ku076.pdf" target="_blank"> just</a> <a href="http://www.nature.com/nature/journal/v458/n7242/full/nature08017.html" target="_blank">like</a> <a href="http://ilpubs.stanford.edu:8090/422/" target="_blank">it</a>. <br /><strong id="internal-source-marker_0.49558418267406523"><br /><span>While abstraction as defined generally is good, overemphasis on a specific type of abstraction limits our ability to include computing and real data analysis in our curriculum. It also takes away from the most important learning experience of graduate school: performing independent research.</span></p>
<p>
</strong>One could argue that the choice of statistical techniques during data analysis is abstraction, or that one needs to abstract to develop efficient software. But the ability to abstract needed for these tasks can be measured by a wide range of classes, not just measure theoretic probability. Some of these classes might teach practically applicable skills like writing fast and efficient algorithms. Many results of high statistical value do not require mathematical proofs, abstract inductive reasoning, or asymptotic theory. It is a good idea to have a some people who can abstract away the science behind statistical methods to the core mathematical philosophy. But our current curriculum is too heavily weighted in this direction. In some cases, statisticians are even being left behind because they do not have sufficient time in their curriculum to develop the computational skills and amass the necessary subject matter knowledge needed to compete with the increasingly diverse set of engineers, computer scientists, data scientists, and computational biologists tackling the same scientific problems.
</p>
<p>
We need to reserve a larger portion of graduate education for diving deeply into specific scientific problems, even if it means they spend less time developing generalizable/abstract statistical ideas. <br /><strong id="internal-source-marker_0.49558418267406523"><br /></strong><em>* Inside joke explanation: Two years ago at JSM I ran a footrace with <a href="http://www.biostat.jhsph.edu/~jgoldsmi/" target="_blank">this guy</a> for the rights to the name “Jeff” in the department of Biostatistics at Hopkins for the rest of 2011. Unfortunately, we did not pro-rate for age and he nipped me by about a half-yard. True to my word, I went by Tullis (my middle name) for a few months, including on the <a href="http://biostat.jhsph.edu/~jleek/jsm-2011-title-slide.pdf" target="_blank">title slide</a> of my JSM talk. This was, of course, immediately subjected to all sorts of nicknaming and B-Caffo loves to use “T-bone”. I apologize on behalf of those that brought it up.</em>
</p>
</strong></p></strong></p>
My worst nightmare...
2012-08-07T14:44:31+00:00
http://simplystats.github.io/2012/08/07/my-worst-nightmare
<p>I don’t know if you have <a href="http://www.emptyage.com/post/28679875595/yes-i-was-hacked-hard" target="_blank">seen this</a> about a person who’s iCloud account was hacked. But man does it freak me out. As a person who relies pretty heavily on cloud-based storage devices and does some cloud-computing based research as well, this is a pretty freaky scenario. Time to go back everything up again…</p>
In which Brian debates abstraction with T-Bone
2012-08-06T16:09:00+00:00
http://simplystats.github.io/2012/08/06/in-which-brian-debates-abstraction-with-t-bone
<p><em>Editor’s Note: This is the first in a set of point-counterpoint posts related to the value of abstract thinking in graduate education that <a href="http://simplystatistics.org/post/28125455811/how-important-is-abstract-thinking-for-graduate" target="_blank">we teased</a> a few days ago. <a href="http://www.bcaffo.com/" target="_blank">Brian Caffo</a>, recently installed Graduate Program Director at the <a href="http://www.biostat.jhsph.edu/" target="_blank">best Biostat department in the country</a>, has kindly agreed to lead off with the case for abstraction. We’ll follow up later in the week with my counterpoint. In the meantime, there have already been a number of really interesting and insightful comments inspired by our teaser post that are well worth reading. See the comments <a href="http://simplystatistics.org/post/28125455811/how-important-is-abstract-thinking-for-graduate" target="_blank">here</a>. </em></p>
<p>The impetus for writing this blog post came out of a particularly heady lunchroom discussion on the role of measure theoretic probability in our curriculum. We have a very mathematically rigorous program at Hopkins Biostatistics that includes a full academic year of measure theoretic probability. Similar to elsewhere, many faculty dispute the necessity of this course. I am in favor of it. My principal reason being that I believe it is useful for building up and evaluating a student’s abilities in abstraction and generalization.</p>
<p>In our discussion, abstraction was the real point of contention. Emphasizing abstraction versus more immediately practical tools is an age-old argument of ivory tower stereotypes (the philosopher archetype) versus equally stereotypically scientific pragmatists (the engineering archetype).</p>
<p>So, let’s begin picking this scab. For your sake and mine, I’ll try to be brief.<strong id="internal-source-marker_0.6420917874202132"></p></strong></p>
<p>
<span>My definitions:</span><br /></strong>
</p>
<p>
<strong id="internal-source-marker_0.6420917874202132"><span>Abstraction</span><span> -</span></strong> reducing a technique, idea or concept to its essence or core.
</p>
<p>
<strong id="internal-source-marker_0.6420917874202132"><span>Generalization</span><span> - </span><span> </span></strong>extending a technique, idea or concept to areas for which it was not originally intended.
</p>
<p>
<strong id="internal-source-marker_0.6420917874202132"><span>PhD</span><span> - </span></strong>a post baccalaureate degree that requires substantial new contributions to knowledge.
</p>
<p>
<strong><strong><br /></strong></strong><span>The term “substantial new contributions” in my definition of a PhD is admittedly fuzzy. To tie it down, examples that I think do create new knowledge in the field of statistics include: </span><strong><strong><br /></strong></strong>
</p>
<ol>
<li>
<span>applying existing techniques to data where they have not been used before (generalization of the application of the techniques),</span>
</li>
<li>
<span>developing statistical software (abstraction of statistical and mathematical thoughts into code),</span>
</li>
<li>
<span>developing new statistical methods from existing ones (generalization), </span>
</li>
<li>
<span>proving new theory (both abstraction and generalization) and</span>
</li>
<li>
<span>creating new data analysis pipelines (both abstraction and generalization). </span>
</li>
</ol>
<p>
In every one of these examples, generalization or abstraction is what differentiates it from a purely technical accomplishment.
</p>
<p>
To give a contrary activity, consider statistical technical specialization. That is, the application an existing method to data where the method is already known to be effective and no new statistical thought is required. Regardless of how necessary, difficult or important applying that method is, such activity does not constitute the creation of new statistical knowledge, even if it is a <a href="http://simplystatistics.org/post/23928890537/schlep-blindness-in-statistics" target="_blank">necessary schlep</a> in the creation of new knowledge of another sort. <br /><strong><strong><br /></strong></strong>Though many statistics graduate level activities require substantial technical specialization, to be doctoral statistical research in a way that satisfies my definition, generalization and abstraction are necessary components.
</p>
<p>
I further contend that abstraction is a key tool for obtaining meaningful generalization. A method, theory, analysis, etcetera can not be retooled to non-intended use without stripping away some of its specialization and abstracting it to its core utility.
</p>
<p>
Abstraction is constantly necessary when applying statistical methods. For example, whenever a statistician says “Method A really was designed for a different kind of data than mine. But at its core it’s really useful for finding out B, which I need to know. So I’ll use it anyway until (if ever) I come up with something better.”
</p>
<p>
As examples: A = CLT, B = distribution for normalized means, A = principal components, B = directions of variation, A = bootstrap, B = sampling distributions, A = linear models, B = mean relationships with covariates.
</p>
<p>
Abstraction and generalization facilitates learning new areas. Knowledge of the abstract core of a discipline makes that knowledge much more portable. This is seen across every discipline. Musicians who know music theory can use their knowledge for any instrument; computer scientists who understand data structures and algorithms can switch languages easily; electrical engineers who understand signal processing can switch between technologies easily. Abstraction is what allows them to see past the concrete (instrument, syntax, technology) to the essence (music, algorithm, signal).
</p>
<p>
And statisticians learn statistical and probability theory. However, in statistics, abstraction is not represented only by mathematics and theory. As pointed out by t<strong>he absolutely unimpeachable source, Simply Statistics</strong>, <a href="http://simplystatistics.org/post/24060354412/why-no-one-reads-the-statistics-literature-anymore" target="_blank">software is exactly an abstraction</a>.<strong><strong><br /></strong></strong>
</p>
<blockquote>
<p>
<span>I think abstraction is important and we need to continue publishing those kinds of ideas. However, I think there is one key point that the statistics community has had difficulty grasping, which is that </span><strong>software represents an important form of abstraction</strong><span>, if not the most important form …</span>
</p>
</blockquote>
<p>
<strong id="internal-source-marker_0.6420917874202132"><span>(A QED is in order, I believe.)</span></strong>
</p>
Samuel Kou wins COPSS Award
2012-08-06T13:55:00+00:00
http://simplystats.github.io/2012/08/06/samuel-kou-wins-copss-award
<p>At JSM this year we learned that <a href="http://www.people.fas.harvard.edu/~skou/" target="_blank">Samuel Kou</a> of Harvard’s Department of Statistics won the Committee of Presidents of Statistical Societies (<a href="http://nisla05.niss.org/copss/?q=copss" target="_blank">COPSS</a>) President’s award. The award is given annually to</p>
<blockquote>
<p><span>a young member of the statistical community in recognition of an outstanding contribution to the profession of statistics. </span><span>The recipient of the Presidents’ Award must be a member of at least one of the participating societies. The candidate may be chosen for a single contribution of extraordinary merit, or an outstanding aggregate of contributions, to the profession of statistics.</span><span> </span></p>
</blockquote>
<p>Samuel’s work spans a wide range of areas from biophysics to MCMC to model selection with contributions in the top journals in statistics and elsewhere. He is also a member of a highly selective group of people who have been promoted to full Professor at Harvard’s Department of Statistics. (Bonus points to those who can name the last person to achieve such a distinction.)</p>
<p>This is a well-deserved honor to an exemplary member of our field.</p>
NYC and Columbia to Create Institute for Data Sciences & Engineering
2012-07-31T17:57:01+00:00
http://simplystats.github.io/2012/07/31/nyc-and-columbia-to-create-institute-for-data-sciences
<p><a href="http://mikebloomberg.com/index.cfm?objectid=D867EFB0-C29C-7CA2-F4B1FEBC8B06249D">NYC and Columbia to Create Institute for Data Sciences & Engineering</a></p>
If I were at #JSM2012 today, here's where I'd go.
2012-07-31T13:58:01+00:00
http://simplystats.github.io/2012/07/31/if-i-were-at-jsm2012-today-heres-where-id-go
<p>Obviously, there are tons of sessions everyday at <a href="http://amstat.org/meetings/jsm/2012/index.cfm" target="_blank">JSM</a> this week and it’s physically impossible to go to everything that looks interesting. Alas, I am but one man, so choices had to be made. Here’s what looks good to me from the <a href="http://amstat.org/meetings/jsm/2012/onlineprogram/" target="_blank">JSM program</a>:</p>
<ul>
<li>8:30-10:20am: <strong>Contemporary Software Design Strategies for Statistical Methodologists</strong>, HQ-Sapphire B</li>
<li>10:30am-12:20pm: <strong>Stat-Us Update from Facebook</strong>, HQ-Sapphire EF</li>
<li>2:00-3:50pm: <strong>Astrostatistics</strong>, CC-Room 29A or <strong>Results from the 2010 Census Experimental Program</strong>, CC-Room 30A (perhaps you can run back and forth?)</li>
</ul>
<p>Lots of other good stuff out there, of course. I wouldn’t mind hearing some feedback on how these go.</p>
Why I'm Staying in Academia
2012-07-30T13:45:00+00:00
http://simplystats.github.io/2012/07/30/why-im-staying-in-academia
<p>Recently, I’ve seen a few blog posts/articles about professors leaving academia for industry or some other non-academic position. By my last count I think I’ve seen three from computer science professors leaving academia for Google. The most recent one being from <a href="http://cs.unm.edu/~terran/academic_blog/?p=113" target="_blank">Terran Lane</a> at University of New Mexico. At this point, Google should just start a recruiting office in middle of all the CS departments around the country. I think they’d get some good people.</p>
<p>Each of the “fairwell” blog posts cover many of the same points—difficulty with having an impact, increasing specialization of academic research, difficult funding climate, increasing workloads—and, frankly, all of this is true to varying degrees. <a href="http://www.cc.gatech.edu/~beki/Beki.html" target="_blank">Beki Grinter</a> has already written a pretty good <a href="http://beki70.wordpress.com/2012/07/26/on-not-leaving-academia/" target="_blank">response</a>. One topic, massive open online courses (MOOCs), is something on which I’ll comment at a later date. For now, I thought I would add a few of my thoughts.</p>
<ul>
<li><strong>There’s no perfect job</strong>. Many of the problems affecting academia—difficult funding, increased workloads—are affecting other industries too. Right now we’re in the worst economic recession in decades and money is tight everywhere. I find it difficult to imagine that there’s a job out there that doesn’t suffer from some form of economic or other constraint. Academia needs to find some solutions, for sure, but times are tough everywhere unfortunately.</li>
<li><strong>This is about as close as it gets to the perfect job</strong>. Really, it’s a pretty good gig. Everyday I come into work and I sit down and work on whatever I want. I’m surrounded by fantastic students and postdocs and when I walk the halls I can talk to great people who are smarter than I (even if they don’t necessarily appreciate me barging in). But that said, it’s not an easy job. The reality is that every professor is a like 1-person startup company, and you need to work pretty hard to stay afloat. (Okay, I’ve never worked at a startup, but I imagine they work pretty hard there.) They don’t tell you that in grad school but, then again, there’s a lot they don’t tell you in grad school.</li>
<li><strong>It helps to work at a medical institution where tenure is meaningless</strong>. Okay, I’m being a bit facetious here…but not really. Much of academic anxiety comes from the need to “get tenure”. At most medical institutions, while tenure exists, having it is fairly meaningless (getting it, of course, is still very tough). The reason is because most medical researchers are funded on soft money, so somewhere between 60% to 100% of their salary is paid from grants. Whether this is a good way or a terrible way to do things is worth discussing at a later date, but the end result is if you can’t fund your salary, getting tenure isn’t going to magically come up with the missing dollars. Universities can’t afford it using the current model. So while tenure is a tremendous privilege and honor and will secure your position at the University, it can’t secure your salary. In the end, what I really need to be focusing on is doing the best research. There’s really no “game” to play here.</li>
<li><strong>The best way to have an impact is to do it</strong>. Every University is different, for sure, and some put many more constraints on their professors than others. I consider myself lucky to be working at an institution that has substantial resources and is in relatively good financial condition. So in the end, if I want to have an impact on statistics or science, I just need to decide to do it. If one day someone comes to me and says “stop what you’re doing, you need to be doing something else”, then I might need to reconsider things. But until that day comes, I’m staying put. It might turn out I’m not good enough to have an impact, but we can’t all be above average.</li>
</ul>
<p>Ultimately, I don’t want the many grad students out there who may be considering a career in academia to feel discouraged by what they might be reading on the Internets these days. There’s good and bad with every job, but I think with academia the balance is fairly positive, and you get to hang out with <a href="http://www.biostat.jhsph.edu/~jleek/" target="_blank">cool</a> <a href="http://rafalab.jhsph.edu/" target="_blank">people</a>. </p>
<p>Of course, if you’re in computer science, you should just go to Google like everyone else.</p>
Statistician (@cocteau) to show journalists how it's done
2012-07-29T17:57:33+00:00
http://simplystats.github.io/2012/07/29/statistician-cocteau-to-show-journalists-how-its
<p>Mark Hansen, a Professor at UCLA’s Departments of Statistics and Media Arts, has been appointed as the inaugural Director of the <a href="http://www.journalism.columbia.edu/news/609" target="_blank">David and Helen Gurley Brown Institute for Media Innovation</a>. The Institute is a joint venture between Columbia University’s Graduate School of Journalism and Stanford’s School of Engineering.</p>
<blockquote>
<p><span>The Institute and the collaboration between the two schools is groundbreaking in that it is designed to encourage and support new endeavors with the potential to inform and entertain in transformative ways. It will recognize the increasingly important connection between journalism and technology, bringing the best from the East and West Coasts.</span></p>
</blockquote>
<p><span>Congratulations to Mark for this fantastic opportunity!</span></p>
In Sliding Internet Stocks, Some Hear Echo of 2000
2012-07-29T13:36:01+00:00
http://simplystats.github.io/2012/07/29/in-sliding-internet-stocks-some-hear-echo-of-2000
<p><a href="http://www.nytimes.com/2012/07/28/technology/as-social-sites-shares-fall-some-hear-echo-of-2000.html?smid=tu-share">In Sliding Internet Stocks, Some Hear Echo of 2000</a></p>
Tweet up #JSM2012
2012-07-28T23:15:43+00:00
http://simplystats.github.io/2012/07/28/tweet-up-jsm2012
<p>If only because I won’t be there this year and I need to know what’s going on! Where’s the action?</p>
Predictive analytics might not have predicted the Aurora shooter
2012-07-28T17:49:13+00:00
http://simplystats.github.io/2012/07/28/predictive-analytics-might-not-have-predicted-the
<p><a href="http://blogs.computerworld.com/business-intelligenceanalytics/20749/could-data-mining-stop-mass-murderers">Predictive analytics might not have predicted the Aurora shooter</a></p>
When Picking a C.E.O. Is More Random Than Wise
2012-07-28T13:46:01+00:00
http://simplystats.github.io/2012/07/28/when-picking-a-c-e-o-is-more-random-than-wise
<p><a href="http://dealbook.nytimes.com/2012/07/24/when-picking-a-c-e-o-is-more-random-than-wise/?smid=tu-share">When Picking a C.E.O. Is More Random Than Wise</a></p>
Congress to Examine Data Sellers
2012-07-27T17:59:02+00:00
http://simplystats.github.io/2012/07/27/congress-to-examine-data-sellers
<p><a href="http://www.nytimes.com/2012/07/25/technology/congress-opens-inquiry-into-data-brokers.html?smid=tu-share">Congress to Examine Data Sellers</a></p>
How important is abstract thinking for graduate students in statistics?
2012-07-27T13:57:01+00:00
http://simplystats.github.io/2012/07/27/how-important-is-abstract-thinking-for-graduate
<p>A recent lunchtime discussion here at Hopkins brought up the somewhat-controversial topic of abstract thinking in our graduate program. We, like a lot of other biostatistics/statistics programs, require our students to take measure theoretic probability as part of the curriculum. The discussion started as a conversation about whether we should require measure theoretic probability for our students. It evolved into a discussion of the value of abstract thinking (and whether measure theoretic probability was a good tool to measure abstract thinking).</p>
<p><a href="http://www.bcaffo.com/" target="_blank">Brian Caffo</a> and I decided an interesting idea would be a point-counterpoint with the prompt, “How important is abstract thinking for the education of statistics graduate students?” Next week Brian and I will provide a point-counterpoint response based on our discussion.</p>
<p>In the meantime we’d love to hear your opinions!</p>
Smartphones, Big Data Help Fix Boston's Potholes
2012-07-26T17:57:32+00:00
http://simplystats.github.io/2012/07/26/smartphones-big-data-help-fix-bostons-potholes
<p><a href="http://www.informationweek.com/news/software/info_management/240004303"></a></p>
Online education: many academics are missing the point
2012-07-26T13:45:00+00:00
http://simplystats.github.io/2012/07/26/online-education-many-academics-are-missing-the-point
<p>Many academics are complaining about online education and warning us about how it can lead to a lower quality product. For example, the New York Times recently published <a href="http://www.nytimes.com/2012/07/20/opinion/the-trouble-with-online-education.html?_r=1&smid=fb-share" target="_blank">this</a> op-ed piece wondering if “online education [will] ever be education of the very best sort?”. Although pretty much every controlled experiment comparing online and in-class education finds that students learn just about the same under both approaches, I do agree that in-person lectures are more enjoyable to both faculty and students. But who cares? My enjoyment and the enjoyment of the 30 privileged students that physically sit in my classes seems negligible compared to the potential of reaching and educating thousands of students all over the world. Also, using recorded lectures will free up time that I can spend on one-on-one interactions with tuition paying students. But what most excites me about online education is the possibility of being part of the movement that redefines existing disciplines as the number of people learning grows by orders of magnitude. How many <a href="http://en.wikipedia.org/wiki/Srinivasa_Ramanujan" target="_blank">Ramanujan</a>s are out there eager to learn Statistics? I would love it if they learned it from me. </p>
Voters Say They Are Wary of Ads Made Just for Them
2012-07-26T13:19:35+00:00
http://simplystats.github.io/2012/07/26/voters-say-they-are-wary-of-ads-made-just-for-them
<p><a href="http://www.nytimes.com/2012/07/24/business/media/survey-shows-voters-are-wary-of-tailored-political-ads.html?smid=tu-share">Voters Say They Are Wary of Ads Made Just for Them</a></p>
Buy your own analytics startup for $15,000 (at least as of now)
2012-07-25T17:55:50+00:00
http://simplystats.github.io/2012/07/25/buy-your-own-analytics-startup-for-15-000-at-least-as
<p><a href="http://techcrunch.com/2012/07/23/pinterest-analytics-site-pinreach-puts-itself-up-for-sale-as-co-founder-joins-google/">Buy your own analytics startup for $15,000 (at least as of now)</a></p>
Really Big Objects Coming to R
2012-07-25T13:56:55+00:00
http://simplystats.github.io/2012/07/25/really-big-objects-coming-to-r
<p>I noticed in the development version of R the following note in the NEWS file:</p>
<blockquote>
<p>There is a subtle change in behaviour for numeric index values 2^31 and larger. These used never to be legitimate and so were treated as NA, sometimes with a warning. They are now legal for long vectors so there is no longer a warning, and x[2^31] <- y will now extend the vector on a 64-bit platform and give an error on a 32-bit one.</p>
</blockquote>
<p>This is significant news indeed!</p>
<p>Some background: In the old days, when most us worked on 32-bit machines, objects in R were limited to be about 4GB in size (and practically a lot less) because memory addresses were indexed using 32 bit numbers. When 64-bit machines became more common in the early 2000s, that limit was removed. Objects could theoretically take up more memory because of the dramatically larger address space. For the most part, this turned out to be true, although there were some growing pains as R was transitioned to be runnable on 64-bit systems (I remember many of those pains).</p>
<p>However, even with the 64-bit systems, there was a key limitation, which is that vectors, one of the fundamental objects in R, could only have a maximum of 2^31-1 elements, or roughly 2.1 billion elements. This was because array indices in R were stored internally as signed integers (specifically as ‘R_len_t’), which are 32 bits on most modern systems (take a look at .Machine$integer.max in R).</p>
<p>You might think that 2.1 billion elements is a lot, and for a single vector it still is. But you have to consider the fact that internally R stores all arrays, no matter how many dimensions there are, as just long vectors. So that would limit you, for example, to a square a matrix that was no bigger than roughly 46,000 by 46,000. That might have seemed like a large matrix back in 2000 but it seems downright quaint now. And if you had a 3-way array, the limit gets even smaller. </p>
<p>Now it appears that change is a comin’. The details can be found in the R source starting at revision 59005 if you follow on subversion. </p>
<p>A new type called ‘R_xlen_t’ has been introduced with a maximum value of 4,503,599,627,370,496, which is 2^52. As they say where I grew up, that’s a lot of McNuggets. So if your computer has enough physical memory, you will soon be able to index vectors (and matrices) that are significantly longer than before.</p>
A Contest for Sequencing Genomes Has Its First Entry in Ion Torrent
2012-07-24T17:59:01+00:00
http://simplystats.github.io/2012/07/24/a-contest-for-sequencing-genomes-has-its-first-entry-in
<p><a href="http://bits.blogs.nytimes.com/2012/07/23/cheaper-computer-power-leading-to-sequencing-genome/?smid=tu-share">A Contest for Sequencing Genomes Has Its First Entry in Ion Torrent</a></p>
Proof by example and letters of recommendation
2012-07-24T13:58:05+00:00
http://simplystats.github.io/2012/07/24/proof-by-example-and-letters-of-recommendation
<p>In math or statistics, proof by example does not work. One example of a phenomenon does not prove anything. For example, because 2 is prime doesn’t mean that all even numbers are prime. In fact, no even numbers other than 2 are prime. </p>
<p>But in other areas proof by example is the best way to demonstrate something. One example is writing letters of recommendation. It is way more convincing when I get one example of something a person has achieved:</p>
<blockquote>
<p>Kyle created the first R package that can be used to analyze terabytes of sequencing data in under an hour.</p>
</blockquote>
<p>Than something much more general but with no details:</p>
<blockquote>
<p>Bryan is an excellent programmer with a mastery of six different programming languages. </p>
</blockquote>
<p>In mathematics it makes sense why proof by example does not work. There is a concrete result and even one example violating that result means it isn’t true. On the other hand, if most of the time Kyle crushes his work, but every once in a while he has an off day and doesn’t get it done, I can live with that. That’s true of a lot of applied statistical methods too. If it works 99% of the time and 1% of the time fails but you can discover how it failed, that is still a pretty good statistical method…</p>
I.B.M. Is No Longer a Tech Bellwether (It's too busy doing statistics)
2012-07-24T02:29:50+00:00
http://simplystats.github.io/2012/07/24/i-b-m-is-no-longer-a-tech-bellwether-its-too-busy
<p><a href="http://bits.blogs.nytimes.com/2012/07/23/ibm-no-longer-a-tech-bellwether/?smid=tu-share">I.B.M. Is No Longer a Tech Bellwether (It’s too busy doing statistics)</a></p>
Facebook's Real Big Data Problem
2012-07-23T18:00:38+00:00
http://simplystats.github.io/2012/07/23/facebooks-real-big-data-problem
<p>Facebook’s first quarterly earnings report as a public company is coming out this Thursday and <a href="http://www.nytimes.com/2012/07/23/technology/facebook-advertising-efforts-face-a-day-of-judgment.html" target="_blank">everyone’s wondering what will be in it</a>. One question is whether advertisers are going to Facebook over other sites like Google.</p>
<blockquote>
<p><span>“Advertisers need more proof that actual advertising on Facebook offers a return on investment,” said Debra Aho Williamson, an analyst with </span>the market research firm eMarketer<span>. “There is such disagreement over whether Facebook is the next big thing on the Internet or whether it’s going to fail miserably.”</span></p>
<p><span>Facebook’s unique asset is the pile of personal data it collects from 900 million users. But using that data to serve up effective, profitable advertisements is a daunting task. Google has been in the advertising game longer and has roughly $40 billion in annual revenue from advertising — 10 times that of Facebook. Since the public offering, Wall Street has tempered its expectations for Facebook’s advertising revenue, and shares closed Friday at $28.76, down from their initial price of $38.</span></p>
</blockquote>
<p>There’s a pretty fundamental question here: Does it work?</p>
<p>With all the data Facebook has at its fingertips, it would be a shame if they couldn’t answer that question.</p>
Medalball: Moneyball for the olympics
2012-07-23T16:10:52+00:00
http://simplystats.github.io/2012/07/23/medalball-moneyball-for-the-olympics
<p><a href="http://www.nytimes.com/2012/07/22/sports/olympics/how-much-for-an-olympic-medal.html?_r=1&ref=magazine">Medalball: Moneyball for the olympics</a></p>
We used, you know, that statistics thingy
2012-07-23T13:59:16+00:00
http://simplystats.github.io/2012/07/23/we-used-you-know-that-statistics-thingy
<p><a href="http://nsaunders.wordpress.com/2012/07/23/we-really-dont-care-what-statistical-method-you-used/">We used, you know, that statistics thingy</a></p>
Sunday Data/Statistics Link Roundup (7/22/12)
2012-07-22T14:24:00+00:00
http://simplystats.github.io/2012/07/22/sunday-data-statistics-link-roundup-7-22-12
<ol>
<li><a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2114571" target="_blank">This paper</a> is the paper describing how Uri Simonsohn identified academic misconduct using statistical analyses. This approach has received a <a href="http://news.sciencemag.org/scienceinsider/2012/06/fraud-detection-method-called-cr.html" target="_blank">huge</a> <a href="http://www.nature.com/news/the-data-detective-1.10937" target="_blank">amount</a> of <a href="http://www.crimeandconsequences.com/crimblog/2012/07/using-statistics-to-detect-sci.html" target="_blank">press</a> in the scientific literature. The basic approach is that he calculates the standard deviations of mean/standard deviation estimates across groups being compared. Then he simulates from a Normal distribution and shows that under the Normal model, it is unlikely that the means/standard deviations are so similar. I think the idea is clever, but I wonder if the Normal model is the best choice here…could the estimates be similar because it was the same experimenter, etc.? I suppose the proof is in the pudding though, several of the papers he identifies have been retracted. </li>
<li>This is an <a href="http://blogs.swarthmore.edu/burke/2012/07/20/listen-up-you-primitive-screwheads/" target="_blank">amazing rant</a> by a history professor at Swarthmore over the development of massive online courses, like the ones <a href="http://simplystatistics.org/post/27405330688/free-statistics-courses-on-coursera" target="_blank">Roger, Brian and I</a> are teaching. I think he makes some important points (especially about how we could do the same thing with open access in a heart beat if universities/academics through serious muscle behind it), but I have to say, I’m personally very psyched to be involved in teaching one of these big classes. I think that statistics is a field that a lot of people would like to learn something about and I’d like to make it easier for them to do that because I love statistics. I also see the strong advantage of in-person education. The folks who enroll at Hopkins and take our courses will obviously get way more one-on-one interaction, which is clearly valuable. I don’t see why it has to be one or the other…</li>
<li>An <a href="http://www.forbes.com/sites/davefeinleib/2012/07/16/6-insights-from-facebooks-former-head-of-big-data/" target="_blank">interesting discussion</a> with Facebook’s former head of big data. I think the first point is key. A lot of the “big data” hype has just had to do with the infrastructure needed to deal with all the data we are collecting. The bigger issue (and where statisticians will lead) is figuring out what to do with the data. </li>
<li>This is a <a href="http://techcrunch.com/2012/07/14/how-authoritarianism-will-lead-to-the-rise-of-the-data-smuggler/" target="_blank">great post</a> about data smuggling. The two key points that I think are raised are: (1) how when the data get big enough, they have their own mass and aren’t going to be moved, and (2) how physically mailing harddrives is still the fastest way of transferring big data sets. That is certainly true in genomics where it is called “sneaker net” when a collaborator walks a hard drive over to our office. Hopefully putting data in physical terms will drive home the point that the new scientists are folks that deal with/manipulate/analyze data. </li>
<li>Not statistics related, but here is a <a href="http://www.hhmi.org/biointeractive/evolution/kingsley.html" target="_blank">high-bar</a> to hold your work to: the bus-crash test. If you died in a bus-crash tomorrow, would your discipline notice? Yikes. Via C.T. Brown. </li>
</ol>
Big Data on Campus
2012-07-21T20:06:10+00:00
http://simplystats.github.io/2012/07/21/big-data-on-campus
<p><a href="http://www.nytimes.com/2012/07/22/education/edlife/colleges-awakening-to-the-opportunities-of-data-mining.html?smid=tu-share">Big Data on Campus</a></p>
Risks in Big Data Attract Big Law Firms
2012-07-21T19:49:01+00:00
http://simplystats.github.io/2012/07/21/risks-in-big-data-attract-big-law-firms
<p><a href="http://www.law.com/jsp/lawtechnologynews/PubArticleLTN.jsp?id=1202563911650">Risks in Big Data Attract Big Law Firms</a></p>
Interview with Lauren Talbot - Quantitative analyst for the NYC Financial Crime Task Force
2012-07-20T13:24:49+00:00
http://simplystats.github.io/2012/07/20/interview-with-lauren-talbot-quantitative-analyst-for
<div class="im">
<p>
<strong>Lauren Talbot</strong>
</p>
<p>
<img height="325" src="http://biostat.jhsph.edu/~jleek/lauren.png" width="250" />
</p>
<p>
<strong><br /></strong>Lauren Talbot is a quantitative analyst for the New York City Financial Crime Task Force. Before working for NYC she was an analyst at Acumen LLC and got her degree in economics from Stanford University. She is a key player turning spatial data in NYC into new tools for government management. We talked to Lauren about her work, how she is using open data to do things like predict where fires might occur, and how she got started in the Financial Crime Task Force.
</p>
<p>
<strong>SS: Do you consider yourself a statistician, computer scientist, or something else?</strong>
</p>
</div>
<p>LT: A lot of us can’t call ourselves statisticians or computer scientists, even if that is a large part of what we do, because we never studied those fields formally. Quantitative or Data Analyst are popular job titles, but don’t really do justice to all the code infrastructure/systems you have to build and cultivate — you aren’t simply analyzing, you are matching and automating and illustrating, too. There is also a large creative aspect, because you have to figure out how to present the data in a way that is useful and compelling to people, many of whom have no prior experience working with data. So I am glad people have started using the term “Data Scientist,” even if makes me chuckle a little. Ideally I would call myself “Data Artist,” or “Data Whisperer,” but I don’t think people would take me seriously.</p>
<p><strong>SS: How did you end up in the NYC Mayor’s Financial Crimes Task Force?</strong></p>
<p>LT: I actually responded to a Craigslist posting. While I was still in the Bay Area (where I went to college), I was looking for jobs in NYC because I wanted to relocate back here, where I am originally from. I was searching for SAS programmer jobs, and finding a lot of stuff in healthcare that made me yawn a little. And then I had the idea to try the government jobs section. The Financial Crimes Task Force (now part of a broader citywide analytics effort under the Office of Policy and Strategic Planning) was one of two listings that popped up, and I read the description and immediately thought “dream job!” It has turned out to be even better than I imagined, because there is such a huge opportunity to make a difference — the Bloomberg administration is actually very interested in operationalizing insights from city data, so they are listening to the data people and using their work to inform agency resource allocation and even sometimes policy. My fellow are also just really fun and intelligent. I’m constantly impressed by how quickly they pick up new skills, get to the bottom of things, and jump through hoops to get things done. We also amuse and entertain each other throughout the day, which is awesome. </p>
<div class="im">
<p>
<strong>SS: Can you tell us about one of the more interesting cases you have tackled and how data analysis/statistics played into the case?</strong>
</p>
</div>
<p>LT: Since this is the NYC Mayor’s Office, dealing with city data, almost of our analyses are in some way location-based. We are trying to answer questions like, “what locations are most likely to have a catastrophic event (e.g. fire) in the near future?” This involves combining many disparate datasets such as fire data, buildings data, emergency calls data, city planning data, even garbage data. We use the tax lot ID as a common identifier, but many of the datasets do not come with this variable - they only have a text address or intersection. In many cases, the address is entered manually and has spelling mistakes. In the beginning, we were using a point-and-click geocoding tool that the city provides that reads the text field and assigns the tax lot ID. However, it was taking a long time to prepare the data so it could be used by the program, and the program was returning many errors. When we visually inspected the errors, we saw that they were caused by minor spelling differences and naming conventions. Now, almost every week we get new datasets in different structures, and we need to geocode them immediately before we can really work with them. So we needed a geocoding program that was automated and flexible, as well as capable of geocoding addresses and intersections with spelling errors and different conventions. Over the past few months, using publicly available city planning datasets and regular expressions, my side project has been creating such a program in SAS. My first test case was self-reported data created solely through user entry. This dataset, which could only be 40% geocoded using the original tool, is now 93% geocoded using the program we developed. The program is constantly evolving and improving. Now it is assigning block faces, spellchecking street and city names, and accounting for the occasional gaps in the data. We use it for everything.</p>
<div class="im">
<p>
<strong>SS: What are the computational tools and ideas you use most frequently in your day to day work (R, databases, regression analysis, etc.)?</strong>
</p>
</div>
<p>LT: In the beginning, all of the data was sent to us in SQL or Excel, which was not very efficient. Now we are building a multi-agency SAS platform that can be used by programmers and non-programmers. Since there are so many data sources that can work together, having a unified platform creates new discoveries that agencies can use to be more efficient or effective. For example, a building investigator can use 311 noise complaints to uncover vacated properties that are being illegally occupied. The platform employs Palantir, which is an excellent front-end tool for playing around with the data and exploring many-to-many relationships. Internally, my team has also used R, Python, Java, even VBA. Whatever gets the job done. We use a good mix of statistical tools. The bread and butter is usually manipulating and understanding new data sources, which is necessary before we can start trying to do something like run a multiple regression, for example. In the end, it’s really a mashup: text parsing, name matching, summarizing/describing/reporting using comparative statistics, geomapping, graphing, logistic regression, even kernel density, can all be part of the mix. Our guiding principle is to use the tool/analysis/strategy that has the highest return on investment of time and analyst resources for the city.</p>
<div class="im">
<p>
<strong>SS: What are the challenges of working as a quantitative analyst in a regulatory role? Is it hard to make your analyses/discoveries understandable?</strong>
</p>
</div>
<p>LT: A lot of data analysts working in government have a difficult time getting agencies and policymakers to take action based on their work due to political priorities and organizational structures. We circumvent that issue by operating based on the needs and requests of the agencies, as well as paying attention to current events. An agency or official may come to us with a problem, and we figure out what we can deliver that will be of use to them. This starts a dialogue. It becomes an iterative process, and projects can grow and morph once we have feedback. Oftentimes, it is better to use a data-mining approach, which is more understandable to non-statisticians, rather than a regression, which can seem like a black box. For example, my colleague came up with an algorithm to target properties that were a high fire risk based on the presence of illegal conversion complaints and evidence that the property owner was under financial distress. He began with a simple list of properties for the Department of Buildings to focus on, and now they go out to inspect a list of places selected by his algorithm weekly. This video of the fire chief speaking about the project illustrates the challenges encountered and why the simpler approach was ultimately successful:<a href="http://www.youtube.com/watch?v=425QSx0U8lU&feature=youtube_gdata_player" target="_blank"><a href="http://www.youtube.com/watch?v=425QSx0U8lU&feature=youtube_gdata_player" target="_blank">http://www.youtube.com/watch?v=425QSx0U8lU&feature=youtube_gdata_player</a></a></p>
<div class="im">
<p>
<strong>SS: Do you have any advice for statisticians/data scientists who want to get involved with open government or government data analysis?</strong>
</p>
</div>
<p>LT: I’ve found that people in government are actually very open to and interested in using data. The first challenge is that they don’t know that the data they have is of value. To be the most effective, you should get in touch with the people who have subject matter expertise (usually employees who have been working on the ground for some time), interview them, check your assumptions, and share whatever you’re seeing in the data on an ongoing basis. Not only will both parties learn faster, but it helps build a culture of interest in the data. Once people see what is possible, they will become more creative and start requesting deliverables that are increasingly actionable. The second challenge is getting data, and the legal and social/political issues surrounding that. The big secret is that so much useful data is actually publicly available. Do your research — you may find what you need without having to fight for it. If what you need is protected, however, consider whether the data would still be useful to you if scrubbed of personally identifiable information. Location-based data is a good example of this. If so, see whether you can negotiate with the data owner to obtain only the parts needed to do your analysis. Finally, you may find that the cohort of data scientists in government is all too sparse, and too few people “speak your language.” Reach out and align yourself with people in other agencies who are also working with data. This is a great way to gain new insight into the goals and issues of your administration, as well as friends to support and advise you as you navigate “the system.”</p>
Help me find the good JSM talks
2012-07-19T13:43:00+00:00
http://simplystats.github.io/2012/07/19/help-me-find-the-good-jsm-talks
<p>I’m about to head out for JSM in a couple of weeks. The sheer magnitude of the conference means it is pretty hard to figure out what talks I should attend. One approach I’ve used in the past is to identify people who I know give good talks and go to their talks. But that isn’t a very good talk-discovery mechanism. So this year I’m trying a crowd-sourcing experiment. </p>
<p>First, some background on what kind of talks I like.</p>
<ul>
<li>I strongly prefer talks where someone is tackling a problem presented by a new kind of data, whether they got that data from a collaborator, they scraped it off the web, or they generated it themselves.</li>
<li> I am 100% ok if they only used linear regression to analyze the data if it led to interesting exploratory analysis, surprising results, or a cool conclusion. </li>
<li>Major bonus points if the method is being used to solve a real-world problem.</li>
<li>I also really like creative and informative plots.</li>
<li>I prefer pictures to text/equations in general</li>
</ul>
<p>On the other hand, I really am not a fan of talks where someone developed a method, no matter how cool, then started looking around for a data set to apply it to. </p>
<p>If you know of anyone who is going to give a talk like that can you post it in the comments or tweet it to @simplystats with the hashtag #goodJSMtalks?</p>
<p>Also, if you know anyone who gives posters <a href="http://www.bioinformaticszen.com/post/genotype-from-phenotype/" target="_blank">like this</a>, lemme know so I can drop by. </p>
<p>Thanks!!!</p>
Big data is worth nothing without big science
2012-07-19T11:40:11+00:00
http://simplystats.github.io/2012/07/19/big-data-is-worth-nothing-without-big-science
<p><a href="http://news.cnet.com/8301-1001_3-57434736-92/big-data-is-worth-nothing-without-big-science/">Big data is worth nothing without big science</a></p>
Top Universities Test the Online Appeal of Free
2012-07-18T18:00:15+00:00
http://simplystats.github.io/2012/07/18/top-universities-test-the-online-appeal-of-free
<p><a href="http://www.nytimes.com/2012/07/18/education/top-universities-test-the-online-appeal-of-free.html?smid=tu-share">Top Universities Test the Online Appeal of Free</a></p>
A closer look at data suggests Johns Hopkins is still the #1 US hospital
2012-07-18T17:31:00+00:00
http://simplystats.github.io/2012/07/18/a-closer-look-at-data-suggests-johns-hopkins-is-still
<p>The <a href="http://health.usnews.com/health-news/best-hospitals/articles/2012/07/16/best-hospitals-2012-13-the-honor-roll" target="_blank">US News best hospital 2012-2013<strike>2</strike> rankings</a> are out. The big news is that Johns Hopkins has lost its throne. For 21 consecutive years Hopkins was ranked #1, but this year Mass General Hospital (MGH) took the top spot displacing Hopkins to #2. However, <a href="http://www.linkedin.com/pub/elisabet-pujadas/46/320/722" target="_blank">Elisabet Pujadas</a>, an MD-PhD student here at Hopkins, took a close look at the data used for the rankings and made <a href="http://rafalab.jhsph.edu/simplystats/pujadasversion.JPG" target="_blank">this plot</a> (by hand!). The plot shows histograms of the rankings by speciality and shows Hopkins outperforming MGH.</p>
<p><a href="http://rafalab.jhsph.edu/simplystats/hospitalrankings.png" target="_blank"><img height="263" src="http://rafalab.jhsph.edu/simplystats/hospitalrankings.png" width="525" /></a></p>
<p>I reproduced Elisabet’s figure using R (see plot on the left above… hers is way cooler). A quick look at the histograms shows that Hopkins has many more highly ranked specialities. For example, Hopkins has 5 specialities ranked as #1 while MGH has none. Hopkins has 2 specialities ranked #2 while MGH has none. The median rank for Hopkins is 3 while for MGH it’s 5. The plot on the right plots ranks, Hopkins’ versus MGH’s, and shows that Hopkins has a better ranking for 13 out of 16 specialities considered.</p>
<p>So how does MGH get ranked higher than Hopkins? Here U.S. News’ explanation of how they rank: </p>
<blockquote>
<p><span>To make the Honor Roll, a hospital had to earn at least one point in each of six specialties. A hospital earned two points if it ranked among the top 10 hospitals in America in any of the 12 specialties in which the US News rankings are driven mostly by objective data, such as survival rates and patient safety. Being ranked in the next 10 in those specialties earned a hospital one point. In the other four specialties, where ranking is based on each hospital’s reputation among doctors who practice that specialty, the top five hospitals in the country received two Honor Roll points and the next five got one point.</span></p>
</blockquote>
<p>This actually results in a tie of 30 points, but according to the table <a href="http://health.usnews.com/health-news/best-hospitals/articles/2012/07/16/best-hospitals-2012-13-the-honor-roll" target="_blank">here</a>, Hopkins was ranked in 15 specialities to MGH’s 16. This was the tiebreaker. But, the <a href="http://health.usnews.com/best-hospitals/area/md/johns-hopkins-hospital-6320180" target="_blank">data they put up</a> shows Hopkins ranked in all 16 specialities. Did the specialty ranked 17th do Hopkins in? In any case, a closer look at the data does suggest Hopkins is still #1.</p>
<p>Disclaimer: I am a professor at Johns Hopkins University _<strong>_</strong>_<strong>_</strong>_<strong>_</strong>_<strong>_</strong>_<strong>_</strong>_<strong>_</strong>_<strong>_</strong>_<em>__</em></p>
<p>The data for Hopkins is <a href="http://health.usnews.com/best-hospitals/area/md/johns-hopkins-hospital-6320180" target="_blank">here</a> and I cleaned it up and put it <a href="http://rafalab.jhsph.edu/simplystats/hopkins.txt" target="_blank">here</a>. For MGH it’s <a href="http://health.usnews.com/best-hospitals/area/ma/massachusetts-general-hospital-6140430" target="_blank">here</a> and <a href="http://rafalab.jhsph.edu/simplystats/mgh.txt" target="_blank">here</a>. The script used to make the plots is <a href="http://rafalab.jhsph.edu/simplystats/hospitalrankings.R" target="_blank">here</a>. Thanks to Elisabet for the pointer and data.</p>
Johns Hopkins Coursera Statistics Courses
2012-07-18T13:51:07+00:00
http://simplystats.github.io/2012/07/18/johns-hopkins-coursera-statistics-courses
<p><a href="https://www.coursera.org/course/compdata" target="_blank">Computing for Data Analysis</a></p>
<p>[youtube http://www.youtube.com/watch?v=gk6E57H6mTs]</p>
<p><a href="https://www.coursera.org/course/dataanalysis" target="_blank">Data Analysis</a></p>
<p>[youtube http://www.youtube.com/watch?v=-lutj1vrPwQ]</p>
<p><a href="https://www.coursera.org/course/biostats" target="_blank">Mathematical Biostatistics Bootcamp</a></p>
<p>[youtube http://www.youtube.com/watch?v=ekdpaf_WT_8]</p>
Universities Reshaping Education on the Web
2012-07-17T23:56:14+00:00
http://simplystats.github.io/2012/07/17/universities-reshaping-education-on-the-web
<p><a href="http://www.nytimes.com/2012/07/17/education/consortium-of-colleges-takes-online-education-to-new-level.html?smid=tu-share">Universities Reshaping Education on the Web</a></p>
Free Statistics Courses on Coursera
2012-07-17T13:22:45+00:00
http://simplystats.github.io/2012/07/17/free-statistics-courses-on-coursera
<p>Today, we’re very excited to announce that the Biostatistics Department at Johns Hopkins is offering three new online courses through <a href="http://www.coursera.org/" target="_blank">Coursera</a>. These courses are</p>
<ul>
<li><strong><a href="https://www.coursera.org/course/dataanalysis" target="_blank">Data Analysis</a></strong>: Data have never been easier or cheaper to come by. This course will cover how to collect, clean, interpret and analyze data, then communicate your results for maximum impact.<br />
<strong>Instructor</strong>: <a href="http://www.biostat.jhsph.edu/~jleek/" target="_blank">Jeff Leek</a></li>
<li><strong><a href="https://www.coursera.org/course/compdata" target="_blank">Computing for Data Analysis</a></strong>: This course is about learning the fundamental computing skills necessary for effective data analysis. You will learn to program in R and to use R for reading data, writing functions, making informative graphs, and applying modern statistical methods.<br />
<strong>Instructor</strong>: <a href="http://www.biostat.jhsph.edu/~rpeng/" target="_blank">Roger Peng</a></li>
<li><strong><a href="https://www.coursera.org/course/biostats" target="_blank">Mathematical Biostatistics Bootcamp</a></strong>: This course presents fundamental probability and statistical concepts used in biostatistical data analysis. It is taught at an introductory level for students with junior- or senior-college level mathematical training.<br />
<strong>Instructor</strong>: <a href="http://www.bcaffo.com/" target="_blank">Brian Caffo</a></li>
</ul>
<p>These courses will be offered free of charge through Coursera to anyone interested in signing up. Those who complete the course and meet a passing grade will get a certificate of completion from Coursera.</p>
<p>Computing for Data Analysis and Mathematical Biostatistics Bootcamp will start in the fall on September 24. Data Analysis will start in the spring on January 22, 2013.</p>
Sunday Data/Statistics Link Roundup (7/15/12)
2012-07-15T13:23:19+00:00
http://simplystats.github.io/2012/07/15/sunday-data-statistics-link-roundup-7-15-12
<ol>
<li>A <a href="http://ivory.idyll.org/blog/journal-data-policies.html" target="_blank">really nice list</a> of journals software/data release policies from <a href="http://ivory.idyll.org/blog/" target="_blank">Titus’ blog</a>. Interesting that he couldn’t find a data/release policy for the New England Journal of Medicine. I wonder if that is because it publishes mostly clinical studies, where the data are often protected for privacy reasons? It seems like there is going to eventually be a big discussion of the relative importance of privacy and open data in the clinical world. </li>
<li>Some <a href="http://www.mygrid.org.uk/" target="_blank">interesting software</a> that can be used to build virtual workflows for computational science. It seems like a lot of data analysis is still done via “drag and drop” programs. I can’t help but wonder if our effort should be focused on developing drag and drop or educating the next generation of scientists to have minimum scripting capabilities. </li>
<li>We added <a href="http://www.statschat.org.nz/" target="_blank">StatsChat</a> by Thomas L. and company to our blogroll. Lots of good stuff there, for example, this recent post on <a href="http://www.statschat.org.nz/2012/07/13/when-randomized-trials-dont-help/" target="_blank">when randomized trials don’t help</a>. You can also <a href="https://twitter.com/statschat" target="_blank">follow them</a> on twitter. </li>
<li>A <a href="http://www.premiersoccerstats.com/wordpress/?p=925&utm_source=rss&utm_medium=rss&utm_campaign=processing-public-data-with-r" target="_blank">really nice post</a> on processing public data with R. As more and more public data becomes available, from governments, companies, APIs, etc. the ability to quickly obtain, process, and visualize public data is going to be hugely valuable. </li>
<li>Speaking of public data, you could get it from <a href="http://simplystatistics.org/post/11237403492/apis" target="_blank">APIs</a> or from <a href="http://simplystatistics.org/post/15182715327/list-of-cities-states-with-open-data-help-me-find" target="_blank">government websites</a>. But beware those <a href="http://simplystatistics.org/post/26068033590/motivating-statistical-projects" target="_blank">category 2 problems</a>! </li>
</ol>
Bits: Betaworks Buys What's Left of Social News Site Digg
2012-07-15T02:22:21+00:00
http://simplystats.github.io/2012/07/15/bits-betaworks-buys-whats-left-of-social-news-site
<p><a href="http://bits.blogs.nytimes.com/2012/07/12/betaworks-buys-whats-left-of-social-news-site-digg/?smid=tu-share">Bits: Betaworks Buys What’s Left of Social News Site Digg</a></p>
Bits: Mobile App Developers Scoop Up Vast Amounts of Data, Reports Say
2012-07-14T13:56:28+00:00
http://simplystats.github.io/2012/07/14/bits-mobile-app-developers-scoop-up-vast-amounts-of
<p><a href="http://bits.blogs.nytimes.com/2012/07/12/mobile-app-developers-scoop-up-vast-amounts-of-data-reports-say/?smid=tu-share">Bits: Mobile App Developers Scoop Up Vast Amounts of Data, Reports Say</a></p>
GDP Figures in China are for "reference" only
2012-07-13T17:51:34+00:00
http://simplystats.github.io/2012/07/13/gdp-figures-in-china-are-for-reference-only
<p><a href="http://www.npr.org/2012/07/13/156710844/chinas-economy-slows-to-3-year-low">GDP Figures in China are for “reference” only</a></p>
This Is Not About Statistics But Its About
2012-07-13T13:53:50+00:00
http://simplystats.github.io/2012/07/13/this-is-not-about-statistics-but-its-about
<p>[youtube http://www.youtube.com/watch?v=p3Te_a-AGqM?wmode=transparent&autohide=1&egm=0&hd=1&iv_load_policy=3&modestbranding=1&rel=0&showinfo=0&showsearch=0&w=500&h=375]</p>
<p>This is not about statistics, but it’s about Emacs, which I’ve been using for a long time. This guy is an Emacs virtuoso, and the crazy thing is that he’s only been using it for 8 months!</p>
<p>Best line: “Should I wait for the next version of Emacs? Hell no!”</p>
<p>(Thanks to Brian C. and Kasper H. for the pointer.)</p>
<div class="attribution">
(<span>Source:</span> <a href="http://www.youtube.com/">http://www.youtube.com/</a>)
</div>
What is the most important code you write?
2012-07-12T13:50:40+00:00
http://simplystats.github.io/2012/07/12/what-is-the-most-important-code-you-write
<p>These days, like most people, the research I do involves writing a lot of code. A lot of it. Usually, you need some code to</p>
<ol>
<li>Process the data to take it from its original format to the format that’s convenient for you</li>
<li>Run exploratory data analyses creating plots, calculating summary statistics, etc.</li>
<li>Try statistical model 1</li>
<li>Try statistical model 2</li>
<li>Try statistical model 3</li>
<li>…</li>
<li>Fit final statistical model; if this involves MCMC then there’s usually a ton of code to do this</li>
<li>Make some more plots of results, make tables, more summary statistics of output</li>
</ol>
<p>My question is, of all this code, which is the most important? The code that fits the final model? The code that does that summarizes results? Often you just see the code that fit the final statistical model and maybe some of the code that summarizes the results. The code for fitting all of the previous models and doing the exploratory analysis is lost in the ether (or at least the version control ether). Now, I’m not saying I always want to see all that other code. Usually, I am just interested in the final model.</p>
<p>The point is that the code for the final model only represents a small fraction of the work that was done to get there. This work is the bread and butter of applied statistics and it is essentially thrown out. Of course, life would be much easier if someone would just <em>tell</em> me what the final model would be every time. Then I would just fit it! But nooooo, hundreds or thousands of lines of code and numerous judgment calls go into figuring out what that last model is going to be. </p>
<p>Yet when you read a paper, it more or less looks like the final model appeared out of thin air because there’s no space/time to tell the story about everything that came before. I would say the same is true for theoretical statistics too. It’s not as if theorems/proofs appear out of nowhere. Hard work goes into figuring out both the right theorem to prove and the right way to prove it.</p>
<p>But I would argue that there’s one key difference between theoretical and applied statistics in this regard: Everyone seems to accept that theoretical statistics is hard. So when you see a theorem/proof in a paper you consciously or unconsciously realize that it must have been hard work to arrive at that point. But in a great applied statistics paper, all you get is an interesting scientific question and some graphs/tables that provide an answer. Who cares about that?</p>
<p>Seriously though, even for a seasoned applied statistician, it’s sometimes easy to forget that everything looks easy once someone else has done all the work. It’s not clear to me whether we just need to change expectations or if we need a different method for communicating the effort involved (or both). Making research reproducible would be one approach as it would require all the code for the work be available. But that’s mostly just “final model” stuff plus some data processing code. Going one step further might require that a git repository be made available. That way you could see all the history in addition to the final stuff. I’m guessing there would be some resistance to universally adopting that approach!</p>
<p>Another approach might be to allow applied stat papers to go into more of the details about the process. With strict space limitations these days, it’s often hard enough to talk about the final model. But in some cases I think I would enjoy reading the story behind the story. Some of that “backstory” would make for good instructional material for applied stat classes.</p>
Statistical Reasoning on iTunes U
2012-07-12T12:38:19+00:00
http://simplystats.github.io/2012/07/12/statistical-reasoning-on-itunes-u
<p>Our colleague, the legendary John McGready has just put his <a href="http://itunes.apple.com/us/course/statistical-reasoning-i/id535928182" target="_blank">Statistical Reasoning I</a> and <a href="http://itunes.apple.com/us/course/statistical-reasoning-ii/id538088324" target="_blank">Statistical Reasoning II</a> courses on iTunes U. This course sequence is extremely popular here at Johns Hopkins and now the entire world can experience the joy.</p>
My worst (recent) experience with peer review
2012-07-11T14:10:00+00:00
http://simplystats.github.io/2012/07/11/my-worst-recent-experience-with-peer-review
<p>My colleagues and I just published a paper <a href="http://www.biomedcentral.com/1471-2105/13/150/abstract" target="_blank">on validation of genomic results</a> in BMC Bioinformatics. It is <a href="http://www.biomedcentral.com/bmcbioinformatics/mostviewed" target="_blank">“highly accessed”</a> and we are really happy with how it turned out. </p>
<p>But it was brutal getting it published. Here is the line-up of places I sent the paper. </p>
<ul>
<li><strong>Science</strong>: Submitted 10/6/10, rejected 10/18/10 without review. I know this seems like a long shot, but this <a href="http://www.sciencemag.org/content/334/6060/1230" target="_blank">paper on validation</a> was published in Science not too long after. </li>
<li><strong>Nature Methods</strong>: Submitted 10/20/10, rejected 10/28/10 without review. Not much to say here, moving on…</li>
<li><strong>Genome Biology</strong>: Submitted 11/1/10, rejected 1/5/11. 2/3 referees thought the paper was interesting, few specific concerns raised. I felt they could be addressed so appealed on 1/10/11, appeal accepted 1/20/11, paper resubmitted 1/21/11. Paper rejected 2/25/11. 2/3 referees were happy with the revisions. One still didn’t like it. </li>
<li><strong>Bioinformatics</strong>: Submitted 3/3/11, rejected 3/1311 without review. I appealed again, it turns out “I have checked with the editors about this for you and t<span>heir opinion was that there was already substantial work in </span><span class="il">validating</span><span> gene lists based on random sampling.” If anyone knows about one of those papers let me know :-). </span></li>
<li><span><strong>Nucleic Acids Research</strong>: Submitted 3/18/11, rejected with invitation for revision 3/22/11. Resubmitted 12/15/11 (got delayed by a few projects here) rejected 1/25/12. Reason for rejection seemed to be one referee had major “philosophical issues” with the paper.<br /></span></li>
<li><span><strong>BMC Bioinformatics</strong>: Submitted 1/31/12, first review 3/23/12, resubmitted 4/27/12, second revision requested 5/23/12, revised version submitted 5/25/12, accepted 6/14/12. <br /></span></li>
</ul>
<div>
An interesting side note is the really brief reviews from the Genome Biology submission inspired me to <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0026895" target="_blank">do this paper</a>. I had time to conceive the study, get IRB approval, build a web game for peer review, recruit subjects, collect the data, analyze the data, write the paper, submit the paper to 3 journals and have it come out 6 months before the paper that inspired it was published!
</div>
<div>
</div>
<div>
Ok, glad I got that off my chest.
</div>
<div>
</div>
<div>
What is your worst peer-review story?
</div>
How Does the Film Industry Actually Make Money?
2012-07-11T12:56:04+00:00
http://simplystats.github.io/2012/07/11/how-does-the-film-industry-actually-make-money
<p><a href="http://www.nytimes.com/2012/07/01/magazine/how-does-the-film-industry-actually-make-money.html?smid=tu-share">How Does the Film Industry Actually Make Money?</a></p>
A Northwest Pipeline to Silicon Valley
2012-07-09T12:39:16+00:00
http://simplystats.github.io/2012/07/09/a-northwest-pipeline-to-silicon-valley
<p><a href="http://www.nytimes.com/2012/07/08/technology/u-of-washington-a-northwest-pipeline-to-silicon-valley.html?smid=tu-share">A Northwest Pipeline to Silicon Valley</a></p>
Skepticism+Ideas+Grit
2012-07-09T12:32:12+00:00
http://simplystats.github.io/2012/07/09/skepticism-ideas-grit
<p>A number of people seem to have objected to my <a href="http://simplystatistics.org/post/26020538368/the-price-of-skepticism" target="_blank">post quoting Carl Sagan about skepticism</a> (hi Paramita!) and I appreciate the comments. However, I wanted to clarify why I liked the quotation. I think in order to be successful in science three things are necessary:</p>
<ol>
<li>A healthy skepticism</li>
<li>An original idea</li>
<li>Quite a bit of <a href="http://simplystatistics.org/post/23928890537/schlep-blindness-in-statistics" target="_blank">grit and moxie</a></li>
</ol>
<p>I find that too often, people consciously or unconsciously stop at (A). In fact some people make an entire career doing (A) but it’s not one that I can personally appreciate.</p>
<p>What we need more of is skepticism coupled with new ideas, not pure skepticism. </p>
The power of power
2012-07-04T13:54:01+00:00
http://simplystats.github.io/2012/07/04/the-power-of-power
<p>Those of you living in the mid-Atlantic region are probably not reading this right now because you don’t have power. I’ve been out of power in my house since last Friday and projections are it won’t come back until the end of the week. I am lucky because my family and I have some backup options, but not everyone has those options.</p>
<p>So that leads me to this question—do power outages affect health? There have been a number of papers examining this question, mostly looking at one-off episodes, as you might expect. One paper, written by Brooke Anderson (postdoctoral fellow here) and <a href="http://environment.yale.edu/bell/" target="_blank">Michelle Bell</a> at Yale University examined the <a href="http://www.ncbi.nlm.nih.gov/pubmed/22252408" target="_blank">effect of the massive 2003 power outage in New York City on all-cause mortality</a>. This was the first city-wide blackout since 1977 and the data from the time period are striking.</p>
<p>A key point with this paper is that often mortality is under-estimated in these kinds of situations because deaths are only counted if they are identified as “disaster-related” (there may be other reasons, but I won’t get into that here). The NYC Department of Health and Mental Hygiene reported the total number of deaths to be 6 over the 2-day period of the blackout, mostly from carbon monoxide poisoning. However, the paper estimated a 28% increase in all-cause mortality which, in New York, translates to an excess mortality from of 90 deaths, an order of magnitude higher than official results.</p>
<p>The power outage in the mid-Atlantic is ongoing but things appear to be improving by the day. According to <a href="http://www.bge.com/customerservice/stormsoutages/currentoutages/pages/default.aspx" target="_blank">BGE</a>, the primary electricity provider in Baltimore City, over half of its customers in the city were without power. On top of that the region is in the middle of a heat wave that has been going on for roughly the same amount of time as the power outage. If you figure the worst of it was in the first 3 days, and if New York’s relative risk could be applied here in Baltimore (a BIG if), then given a typical daily mortality of 17 deaths in the summer months, we would expect an excess mortality for the 3-day period of about 14 deaths from all causes.</p>
<p>Unfortunately, it seems power outages are likely to become more frequent because of increasing stress on an aging power grids and climate change causing more extreme weather (this outage was caused by a severe thunderstorm). It seems to me that the contribution of such infrastructure failures to health problems will be an interesting problem to study for the future.</p>
Replication and validation in -omics studies - just as important as reproducibility
2012-07-03T12:57:29+00:00
http://simplystats.github.io/2012/07/03/replication-and-validation-in-omics-studies-just-as
<p>The psychology/social psychology community has made <a href="http://simplystatistics.org/post/21326470429/replication-psychology-and-big-science" target="_blank">replication</a> a <a href="http://openscienceframework.org/" target="_blank">huge focus</a> over the last year. One reason is the recent, <a href="http://blogs.discovermagazine.com/notrocketscience/2012/03/10/failed-replication-bargh-psychology-study-doyen/" target="_blank">public blow-up</a> over a famous study that did not replicate. There are also concerns about the experimental and conceptual design of these studies that go beyond simple lack of replication. In genomics, a <a href="http://simplystatistics.org/post/18378666076/the-duke-saga-starter-set" target="_blank">similar scandal</a> occurred due to what amounted to “data fudging”. Although, in the genomics case, much of the blame and focus has been on <a href="http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aoas/1267453942" target="_blank">lack of reproducibility</a> or <a href="http://www.nature.com/nature/journal/v467/n7314/full/467401b.html" target="_blank">data availability</a>. </p>
<p>I think one of the reasons that the field of genomics has focused more on reproducibility is that replication is already more consistently performed in genomics. There are two forms for this replication: validation and independent replication. Validation generally refers to a replication experiment performed by the same research lab or group - with a different technology or a different data set. On the other hand, independent replication of results is usually performed by an outside laboratory. </p>
<p>Validation is by far the more common form of replication in genomics. <a href="http://www.sciencemag.org/content/334/6060/1230.full" target="_blank">In this article</a> in Science, Ioannidis and Khoury point out that validation has different meaning depending on the subfield of genomics. In GWAS studies, it is now expected that every significant result will be validated in a second large cohort with genome-wide significance for the identified variants.</p>
<p>In gene expression/protein expression/systems biology analyses, there has been no similar definition of the “criteria for validation”. Generally the experiments are performed and if a few/a majority/most of the results are confirmed, the approach is considered validated. My colleagues and I just published a paper where we define <a href="http://www.biomedcentral.com/content/pdf/1471-2105-13-150.pdf" target="_blank">a new statistical sampling</a> approach for validating lists of features in genomics studies that is somewhat less ambiguous. But I think this is only a starting point. Just like in psychology, we need to focus not just on reproducibility, but also replicability of our results, and we need new statistical approaches for evaluating whether validation/replication have actually occurred. </p>
Computing and Sustainability: What Can Be Done?
2012-07-02T14:20:25+00:00
http://simplystats.github.io/2012/07/02/computing-and-sustainability-what-can-be-done
<p>Last Friday, the National Research Council released a report titled <em><a href="http://www.nap.edu/catalog.php?record_id=13415" target="_blank">Computing Research for Sustainability</a></em>, written by the NRC’s Committee on Computing Research for Environmental and Societal Sustainability, on which I served (<a href="http://www8.nationalacademies.org/onpinews/newsitem.aspx?RecordID=13415" target="_blank">press release</a>). This was a novel experience for me given that I was the only non-computer scientist on the committee. That said, I think the report is quite interesting for a number of reasons. As a statistician, I took away a few lessons.</p>
<ul>
<li><strong>Sustainability presents many opportunities for CS</strong>. One of the first things the committee did was hold a workshop where researchers from all over presented their work on CS and sustainability—-and it was impressive. Everything from Shwetak Patel’s clever use of data analysis to monitor home power usage to Bill Tomlinson’s work in human computer interaction. Very educational for me. One thing I remember is that towards the end of the workshop <a href="http://www.cms.caltech.edu/people/2994/profile" target="_blank">John Doyle</a> made some comment about IPv6 and everyone laughed and…I didn’t get it. I still don’t get it.</li>
<li><strong>CS faces a number of statistical challenges</strong>. Many of the interesting areas posed by sustainability research come across, in my mind, as statistical problems. In particular, there is a need to develop better statistical models for understanding uncertainty in a variety of systems (e.g. electrical power grids, climate models, ecological dynamics). These are CS problems because they are “big data” systems but the underlying issues are largely statistical. Overall, it seems a lot of money has been put into collecting data but relatively little investment has been made (so far) in figuring out what to do with it.</li>
<li><strong>Statistics and CS will be crashing into each other at a theater near you</strong>. In many discussions the Committee had, I couldn’t help thinking that a lot of the challenges in CS are exactly the same as in statistics. Specifically, how integrated should computer scientists be in the other sciences? Being an outsider to that area, it seems there is a debate going on between those who do “pure” computer science, like compilers and programming languages, and those who do “applied” computer science, like computational biology. This debate sounds <a href="http://simplystatistics.org/post/25643791866/statistics-and-the-science-club" target="_blank">eerily familiar</a>.</li>
</ul>
<p>It was fun to hang out with the computer scientists for a while, and this group was really exceptional. But now, back to my day job.</p>
Meet the Skeptics: Why Some Doubt Biomedical Models - and What it Takes to Win Them Over
2012-07-02T13:10:12+00:00
http://simplystats.github.io/2012/07/02/meet-the-skeptics-why-some-doubt-biomedical-models
<p><a href="http://biomedicalcomputationreview.org/content/meet-skeptics-why-some-doubt-biomedical-models-and-what-it-takes-win-them-over-0">Meet the Skeptics: Why Some Doubt Biomedical Models - and What it Takes to Win Them Over</a></p>
Sunday data/statistics link roundup (7/1)
2012-07-01T13:59:38+00:00
http://simplystats.github.io/2012/07/01/sunday-data-statistics-link-roundup-7-1
<ol>
<li>A <a href="http://www.reddit.com/r/explainlikeimfive/comments/vb8vs/eli5_what_exactly_is_obamacare_and_what_did_it/c530lfx" target="_blank">really nice </a>explanation of the elements of Obamacare. <a href="http://simplystatistics.org/post/26138675180/obamacare-is-not-going-to-solve-the-health-care-crisis" target="_blank">Rafa’s post</a> on the new <a href="http://web.jhu.edu/administration/provost/initiatives/ihi/" target="_blank">inHealth initiative</a> Scott is leading got a lot of <a href="http://www.reddit.com/r/Health/comments/vsltn/obamacare_is_not_going_to_solve_the_health_care/" target="_blank">comments on Reddit</a>. Some of them are funny (Rafa’s spelling got rocked) and if you get past the usual level of internet-commentary politeness, some of them seem to be really relevant - especially the comments about generalizability and the economics of health care. </li>
<li>From Andrew J. a <a href="http://www.yonderbiology.com/DNA_art/acgt" target="_blank">cool visualization of the human genome</a>, they are showing every base of the human genome over the course of a year. That turns out to be about 100 bases per second. I think this is a great way to show how much information is in just one human genome. It also puts the sequencing data deluge in perspective. We are now sequencing thousands of these genomes a year and its only going to get faster. </li>
<li>Cosma Shalizi has a <a href="http://cscs.umich.edu/~crshalizi/weblog/920.html%20" target="_blank">nice list</a> of unsolved problems in statistics on his blog (via Edo A.). These problems primarily fall into what I call Category 1 problems in my post on <a href="http://simplystatistics.org/post/26068033590/motivating-statistical-projects" target="_blank">motivating statistical projects</a>. I think he has some really nice insight though and some of these problems sound like a big deal if one was able to solve them.</li>
<li>A really provocative talk on why <a href="http://www.youtube.com/watch?v=bBx2Y5HhplI" target="_blank">consumers are the job creators</a>. The issue of who are the job creators seems absolutely ripe for a thorough statistical analysis. There are a thousand confounders here and my guess is that most of the work so far has been Category 2 - let’s use convenient data to make a stab at this. But a thorough and legitimate data analysis would be hugely impactful. </li>
<li>Your eReader is <a href="http://online.wsj.com/article/SB10001424052702304870304577490950051438304.html?mod=rss_Today's_Most_Popular" target="_blank">collecting data</a> about you.</li>
</ol>
Obamacare is not going to solve the health care crisis, but a new initiative, led by a statistician, may help
2012-06-29T13:00:00+00:00
http://simplystats.github.io/2012/06/29/obamacare-is-not-going-to-solve-the-health-care-crisis
<p>Obamacare may help protect a vulnerable section of our population, but it does nothing to solve the real problem with health care in the US: it is unsustainably expensive and getting <strike>worst</strike> worse. In the graph below (left) per capita medical expenditures for several countries are plotted against time. The US is the black curve, other countries are in grey. On the right we see life expectancy plotted against per capita medical expenditure. Note that the US spends $8,000 per person on healthcare, more than any other country and about 40% more than Norway, the runner up. If the US spent the same as Norway per person, as a country we would save ~ 1 trillion $ per year. Despite the massive investment, life expectancy in the US is comparable to Chile’s, a country that spends about $1,500 per person. To make matters worse, politicians and pundits greatly oversimply this problem by trying to blame their favorite villains while experts agree: no obvious solution exists.</p>
<p><img height="265" src="http://rafalab.jhsph.edu/simplystats/healthcare.jpg" width="511" /></p>
<p>This past Tuesday Johns Hopkins announced the launching of the <a href="http://web.jhu.edu/administration/provost/initiatives/ihi/" target="_blank">Individualized Health Initiative</a>. This effort will be led by <a href="http://scholar.google.es/citations?user=mSO6jtEAAAAJ&hl=es" target="_blank">Scott Zeger</a>, a statistician and former chair of our department. The graphs and analysis shown above are from a presentation Scott has <a href="http://web.jhu.edu/administration/provost/initiatives/ihi/inHealth.Overview.SLZ.June%201.2012.pdf" target="_blank">shared on the web</a>. The initiative’s goal is <span>to “discover, test, and implement health information tools that allow the individual to understand, track, and guide his or her unique health state and its trajectory over time”. In other words, by tailoring treatments and prevention schemes for individuals we can improve their health more effectively.</span></p>
<!--[if gte mso 9]>-->
<p>1</p>
<p>49</p>
<p>284</p>
<p>Johns Hopkins University</p>
<p>2</p>
<p>1</p>
<p>332</p>
<p>14.0</p>
<p>96</p>
<p>800x600</p>
<p>So how is this going to help solve the health care crisis? Scott explains that when it comes to health care, Hopkins is a self-contained microcosm: we are the patients (all employees), the providers (hospital and health system), and the insurer (Hopkins is self-insured, we are not insured by for-profit companies). And just like the rest of the country, we spend way too much per person on health care. Now, because we are self-contained, it is much easier for us to try out and evaluate alternative strategies than it is for, say, a state or the federal government. Because we are large, we can gather enough data to learn about relatively small strata. And with a statistician in charge, we will evaluate strategies empirically as opposed to ideologically. </p>
<p>Furthermore, because we are a University, we also employ Economists, Public Health Specialists, Ethicists, Basic Biologists, Engineers, Biomedical Researchers, and other scientists with expertise that seem indispensable to solve this problem. Under Scott’s leadership, I expect Hopkins to collect data more systematically, run well thought-out experiments to test novel ideas, leverage technology to improve diagnostics, and use existing data to create knowledge. Successful strategies may then be exported to the rest of the country. Part of the new institute’s mission is to incentivize our very creative community of academics to participate in this endeavor. </p>
Motivating statistical projects
2012-06-28T12:33:00+00:00
http://simplystats.github.io/2012/06/28/motivating-statistical-projects
<p>It seems like half of the battle in statistics is <a href="http://normaldeviate.wordpress.com/2012/06/21/90/" target="_blank">identifying an important/unsolved problem</a>. In math, this is easy, <a href="http://www.claymath.org/millennium/" target="_blank">they have a list</a>. So why is it harder for statistics? Since I have to <a href="http://simplystatistics.org/post/18493330661/statistics-project-ideas-for-students" target="_blank">think up projects to work on</a> for my research group, for classes I teach, and for exams we give, I have spent some time thinking about ways that research problems in statistics arise.</p>
<p><img height="517" src="http://biostat.jhsph.edu/~jleek/stat-projects.jpg" width="400" /></p>
<p>I borrowed a page out of Roger’s book and made a little diagram to illustrate my ideas (actually I can’t even claim credit, it was Roger’s idea to make the diagram). The diagram shows the rough relationship of science, data, applied statistics, and theoretical statistics. Science produces data (although there are other sources), the data are analyzed using applied statistical methods, and theoretical statistics concerns the math behind statistical methods. The dotted line indicates that theoretical statistics ostensibly generalizes applied statistical methods so they can be applied in other disciplines. I do think that this type of generalization is becoming harder and harder as theoretical statistics becomes farther and farther removed from the underlying science.</p>
<p>Based on this diagram I see three major sources for statistical problems: </p>
<ol>
<li><strong>Theoretical statistical problems</strong> One component of statistics is developing the mathematical and foundational theory that proves we are doing sensible things. This type of problem often seems to be inspired by popular methods that exists/are developed but lack mathematical detail. Not surprisingly, much of the work in this area is motivated by what is mathematically possible or convenient, rather than by concrete questions that are of concern to the scientific community. This work is important, but the current distance between theoretical statistics and science suggests that the impact will be limited primarily to the theoretical statistics community. </li>
<li><strong>Applied statistics motivated by convenient sources of data.</strong> The best example of this type of problem are the analyses in <a href="http://www.freakonomics.com/" target="_blank">Freakonomics</a>. Since both big data and <a href="http://simplystatistics.org/post/25924012903/the-problem-with-small-big-data" target="_blank">small big data</a> are now abundant, anyone with a laptop and an internet connection can download the <a href="http://books.google.com/ngrams/datasets" target="_blank">Google n-gram data</a>, a <a href="http://www.ncbi.nlm.nih.gov/geo/" target="_blank">microarray from GEO </a>, <a href="http://simplystatistics.org/post/15182715327/list-of-cities-states-with-open-data-help-me-find" target="_blank">data about your city</a>, or really <a href="http://www.factual.com/" target="_blank">data about anything</a> and perform an applied analysis. These analyses may not be straightforward for computational/statistical reasons and may even require the development of new methods. These problems are often very interesting/clever and so are often the types of analyses you hear about in newspaper articles about “Big Data”. But they may often be misleading or incorrect, since the underlying questions are not necessarily well founded in scientific questions. </li>
<li><strong>Applied statistics problems motivated by scientific problems. </strong>The final category of statistics problems are those that are motivated by concrete scientific questions. The new sources of big data don’t necessarily make these problems any easier. They still start with a specific question for which the data may not be convenient and the math is often intractable. But the potential impact of solving a concrete scientific problem is huge, especially if many people who are generating data have a similar problem. Some examples of problems like this are: can we tell if one <a href="http://en.wikipedia.org/wiki/Student's_t-test" target="_blank">batch of beer is better than another</a>, how are <a href="http://en.wikipedia.org/wiki/Analysis_of_variance" target="_blank">quantitative characteristics inherited from parent to child</a>, which <a href="http://en.wikipedia.org/wiki/Proportional_hazards_models" target="_blank">treatment is better when some people are censored</a>, how do we <a href="http://en.wikipedia.org/wiki/Bootstrapping_(statistics)" target="_blank">estimate variance when we don’t know the distribution of the data</a>, or how do we <a href="http://en.wikipedia.org/wiki/False_discovery_rate" target="_blank">know which variable is important when we have millions</a>? </li>
</ol>
<p>So this leads back to the question, what are the biggest open problems in statistics? I would define these problems as the “high potential impact” problems from category 3. To answer this question, I think we need to ask ourselves, what are the most common problems people are trying to solve with data but can’t with what is available right now? Roger nailed this when he talked about the role of <a href="http://simplystatistics.org/post/25643791866/statistics-and-the-science-club" target="_blank">statisticians in the science club</a>. </p>
<p>Here are a few ideas that could potentially turn into high-impact statistical problems, maybe our readers can think of better ones?</p>
<ol>
<li>How do we credential students taking online courses <a href="http://simplystatistics.org/post/16759359088/why-in-person-education-isnt-dead-yet-but-a" target="_blank">at a huge scale</a>?</li>
<li>How do we <a href="http://understandinguncertainty.org/" target="_blank">communicate risk</a> about personalized medicine (or anything else) to a general population without statistical training? </li>
<li>Can you use social media as a <a href="http://www.uvm.edu/~pdodds/files/papers/others/2011/moreno2011a.pdf" target="_blank">preventative health tool</a>?</li>
<li>Can we perform <a href="http://www.cabinetoffice.gov.uk/sites/default/files/resources/TLA-1906126.pdf" target="_blank">randomized trials to improve public policy</a>?</li>
</ol>
<div>
<em>Image Credits: The Science Logo is the old logo for the <a href="http://www.usu.edu/science/" target="_blank">USU College of Science</a>, the R is the logo for the <a href="http://www.r-project.org/" target="_blank">R statistical programming language</a>, the data image is a screenshot of <a href="http://www.gapminder.org/" target="_blank">Gapminder</a>, and the theoretical statistics image comes from the Wikipedia page on the <a href="http://en.wikipedia.org/wiki/Law_of_large_numbers" target="_blank">law of large numbers</a>.</em>
</div>
<div>
<em><br /></em>
</div>
<div>
<strong>Edit</strong>: I just noticed <a href="http://www.pnas.org/content/early/2012/06/22/1205259109.abstract" target="_blank">this paper</a>, which seems to support some of the discussion above. On the other hand, I think just saying lots of equations = less citations falls into category 2 and doesn’t get at the heart of the problem.
</div>
The price of skepticism
2012-06-27T20:25:48+00:00
http://simplystats.github.io/2012/06/27/the-price-of-skepticism
<p>Thanks to <a href="http://www.johndcook.com/blog/" target="_blank">John Cook</a> for posting this:</p>
<blockquote>
<p><span>“If you’re only skeptical, then no new ideas make it through to you. You never can learn anything. You become a crotchety misanthrope convinced that nonsense is ruling the world.” – Carl Sagan</span></p>
</blockquote>
<p><span><br /></span></p>
Follow up on "Statistics and the Science Club"
2012-06-27T12:58:40+00:00
http://simplystats.github.io/2012/06/27/follow-up-on-statistics-and-the-science-club
<p>I agree with Roger’s latest <a href="http://simplystatistics.org/post/25643791866/statistics-and-the-science-club" target="_blank">post</a>: “we<span> need to expand the tent of statistics and include people who are using their statistical training to lead the new science”. I am perhaps a bit more worried than Roger. </span>Specifically, I worry that talented go-getters interested in leading science via data analysis will achieve this without engaging our research community. </p>
<p>A quantitatively trained person (engineers , computer scientists, physicists, etc..) with strong computing skills (knows python, C, and shell scripting), that reads, for example, “Elements of Statistical Learning” and learns R, is well on their way. Eventually, many of these users of Statistics will become developers and if we don’t keep up then what do they need from us? Our already-written books may be enough. In fact, in genomics, I know several people like this that are already developing novel statistical methods. I want these researchers to be part of our academic departments. Otherwise, I fear we will not be in touch with the problems and data that lead to, quoting Roger, “the most exciting developments of our lifetime.” </p>
The problem with small big data
2012-06-26T12:56:13+00:00
http://simplystats.github.io/2012/06/26/the-problem-with-small-big-data
<p>There’s lots of talk about “big data” these days and I think that’s great. I think it’s bringing statistics out into the mainstream (even if they don’t call it statistics) and it creating lots of opportunities for people with statistics training. It’s one of the reasons we created this blog.</p>
<p>One thing that I think gets missed in much of the mainstream reporting is that, in my opinion, the biggest problems aren’t with the truly massive datasets out there that need to be mined for important information. Sure, those types of problems pose interesting challenges with respect to hardware infrastructure and algorithm design.</p>
<p>I think a bigger problem is what I call “small big data”. Small big data is the dataset that is collected by an individual whose data collection skills are far superior to his/her data analysis skills. You can think of the size of the problem as being measured by the ratio of the dataset size to the investigator’s statistical skill level. For someone with no statistical skills, any dataset represents “big data”.</p>
<p>These days, any individual can create a massive dataset with relatively few resources. In some of the work I do, we send people out with portable air pollution monitors that record pollution levels every 5 minutes over a 1-week period. People with fitbits can get highly time-resolved data about their daily movements. A single MRI can produce millions of voxels of data.</p>
<p>One challenge here is that these examples all represent datasets that are large “on paper”. That is, there are a lot of bits to store, but that doesn’t mean there’s a lot of useful information there. For example, I find people are often impressed by data that are collected with very high temporal or spatial resolution. But often, you don’t need that level of detail and can get away with coarser resolution over a wider range of scenarios. For example, if you’re interested in changes in air pollution exposure across seasons but you only measure people in the summer, then it doesn’t matter if you measure levels down to the microsecond and produce terabytes of data. Another example might be the idea the <a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3137276/?tool=pubmed" target="_blank">sequencing technology doesn’t in fact remove biological variability</a>, no matter how large a dataset it produces.</p>
<p>Another challenge is that the person who collected the data is often not qualified/prepared to analyze it. If the data collector didn’t arrange beforehand to have someone analyze the data, then they’re often stuck. Furthermore, usually the grant that paid for the data collection didn’t budget (enough) for the analysis of the data. The result is that there’s a lot of “small big data” that just sits around unanalyzed. This is an unfortunate circumstance, but in my experience quite common.</p>
<p>One conclusion we can draw is that we need to get more statisticians out into the field both helping to analyze the data; and perhaps more importantly, designing good studies so that useful data are collected in the first place (as opposed to merely “big” data). But the sad truth is that there aren’t enough of us on the planet to fill the demand. So we need to come up with more creative ways to get the skills out there without requiring our physical presence.</p>
Hilary Mason: From Tiny Links, Big Insights
2012-06-26T10:01:59+00:00
http://simplystats.github.io/2012/06/26/hilary-mason-from-tiny-links-big-insights
<p><a href="http://www.businessweek.com/articles/2012-04-26/hilary-mason-from-tiny-links-big-insights">Hilary Mason: From Tiny Links, Big Insights</a></p>
The Evolution of Music
2012-06-25T18:43:43+00:00
http://simplystats.github.io/2012/06/25/the-evolution-of-music
<p><a href="http://news.sciencemag.org/sciencenow/2012/06/computer-program-evolves-music.html?ref=hp#.T-iSbWmcsYs.email">The Evolution of Music</a></p>
A specific suggestion to help recruit/retain women faculty at Hopkins
2012-06-25T12:59:26+00:00
http://simplystats.github.io/2012/06/25/a-specific-suggestion-to-help-recruit-retain-women
<p><span>A recent </span><a href="http://www.theatlantic.com/magazine/archive/2012/07/why-women-still-can-8217-t-have-it-all/9020/" target="_blank">article</a><span> by a former Obama administration official has stirred up debate over the obstacles women face in balancing work/life. This reminded me of this </span><a href="http://web.jhu.edu/administration/jhuoie/resources/women.html" target="_blank">report</a> written by a committee here at Hopkins to help resolve the current gender-based career obstacles for women faculty. The report is great, but in practice we have a long way to go. For example, my department has not hired a woman at the tenure track level in 15 years. This drought has not been for lack of trying as we have made several offers, but none have been accepted. One issue that has come up multiple times is “spousal hires”. Anecdotal evidence strongly suggests that in academia the “two body” problem is more common with women than men. As hard as my department has tried to find jobs for spouses, efforts are ad-hoc and we get close to no institutional support. As far as I know, as an institution, Hopkins allocates no resources to spousal hires. So, a tangible improvement we could make is changing this. Another specific improvement that many agree will help women is subsidized day care. The waiting list <a href="http://www.jhbrighthorizons.org/" target="_blank">here</a> is very long (as a result few of my colleagues use it) and one still has to pay more than $1,600 a month for infants.</p>
<p>These two suggestions are of course easier said than done as they both require $. Quite of bit actually, and Hopkins is not rich <a href="http://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_the_United_States_by_endowment" target="_blank">compared to other well-known universities</a>. My suggestion is to <strong>get rid of the college tuition remission benefit</strong> for faculty. Hopkins covers half the college tuition for the children of all their employees. This perk helps male faculty in their 50s much more than it helps potential female recruits. So I say get rid of this benefit and use the $ for spousal hires and to further subsidize childcare.</p>
<p>It might be argued the tuition remission perk helps retain faculty, but the institution can invest in that retention on a case-by-case basis as opposed to giving the subsidy to everybody independent of merit. I suspect spousal hires and subsidized day care will be more attractive at the time of recruitment. </p>
<p>Although this post is Hopkins-specific I am sure similar reallocation of funds is possible in other universities.</p>
Sunday data/statistics link roundup (6/24)
2012-06-24T14:16:23+00:00
http://simplystats.github.io/2012/06/24/sunday-data-statistics-link-roundup-6-24
<ol>
<li>We’ve got a new domain! You can still follow us on tumblr or here: <a href="http://simplystatistics.org/" target="_blank"><a href="http://simplystatistics.org/" target="_blank">http://simplystatistics.org/</a></a>. </li>
<li>A <a href="http://www.fastcompany.com/1824499/sports-data-analytics-mit-sloan-goldsberry" target="_blank">cool article</a> on MIT’s annual sports statistics conference (via <a href="https://twitter.com/storeylab" target="_blank">@storeylab</a>). I love how the guy they chose to highlight created what I would consider a pretty simple visualization with known tools - but it turns out it is potentially a really new way of evaluating the shooting range of basketball players. This is my favorite kind of creativity in statistics.</li>
<li>This is an interesting article calling higher education a “<a href="http://nplusonemag.com/death-by-degrees" target="_blank">credentials cartel</a>”. I don’t know if I’d go quite that far; there are a lot of really good reasons for higher education institutions beyond credentialing like research, putting smart students together in classes and dorms, broadening experiences etc. But I still think there is room for a smart group of statisticians/computer scientists to solve the <a href="http://simplystatistics.org/post/16759359088/why-in-person-education-isnt-dead-yet-but-a" target="_blank">credentialing problem</a> on a big scale and have a huge impact on the education industry. </li>
<li>Check out John Cook’s <a href="http://www.johndcook.com/blog/2012/06/18/methods-that-get-used/" target="_blank">conjecture</a> on statistical methods that get used: “<span>The probability of a method being used drops by at least a factor of 2 for every parameter that has to be determined by trial-and-error.” I’m with you. I wonder if there is a corollary related to how easy the documentation is to read? </span></li>
<li>If you haven’t read Roger’s post on <a href="http://simplystatistics.org/post/25643791866/statistics-and-the-science-club" target="_blank">Statistics and the Science Club</a>, I consider it a must-read for anyone who is affiliated with a statistics/biostatistics department. We’ve had feedback by email/on twitter from other folks who are moving toward a more science oriented statistical culture. We’d love to hear from more folks with this same attitude/inclination/approach. </li>
</ol>
Statistics and the Science Club
2012-06-22T13:24:59+00:00
http://simplystats.github.io/2012/06/22/statistics-and-the-science-club
<p>One of my favorite movies is Woody Allen’s <em>Annie Hall</em>. If you’re my age and you haven’t seen it, I usually tell people it’s like <em>When Harry Met Sally</em>, except really good. The movie <a href="http://www.youtube.com/watch?v=rrxlfvI17oY" target="_blank">opens with Woody Allen’s character Alvy Singer explaining</a> that he would “never want to belong to any club that would have someone like me for a member”, a quotation he attributes to Groucho Marx (or Freud).</p>
<p>Last week I <a href="http://simplystatistics.tumblr.com/post/25177731588/statisticians-asa-and-big-data" target="_blank">posted a link</a> to ASA President Robert Rodriguez’s column in Amstat News about big data. In the post I asked what was wrong with the column and there were a few good comments from readers. In particular, Alex wrote:</p>
<blockquote>
<p><span>When discussing what statisticians need to learn, he focuses on technological changes (distributed computing, Hadoop, etc.) and the use of unstructured text data. However, Big Data requires a change in perspective for many statisticians. Models must expand to address the levels of complexity that massive datasets can reveal, and many standard techniques are limited in utility.</span></p>
</blockquote>
<p><span>I agree with this, but I don’t think it goes nearly far enough. </span></p>
<p><span>The key element missing from the column was the notion that statistics should take a leadership role in this area. I was disappointed by the lack of a more expansive vision displayed by the ASA President and the ASA’s unwillingness to claim a leadership position for the field. Despite the name “big data”, big data is really about <em>statistics</em> and statisticians should really be out in front of the field. We should not be observing what is going on and adapting to it by learning some new technologies or computing techniques. If we do that, then as a field we are just leading from behind. Rather, we should be defining what is important and should be driving the field from both an educational and research standpoint. </span></p>
<p><span>However, the new era of big data poses a serious dilemma for the statistics community that needs to be addressed before real progress can be made, and that’s what brings me to Alvy Singer’s conundrum.</span></p>
<p><span>There’s a strong tradition in statistics of being the “outsiders” to whatever field we’re applying our methods to. In many cases, we are the outsiders to scientific investigation. Even if we are neck deep in collaborating with scientists and being involved in scientific work, we still maintain our ability to criticize and judge scientists because we are “outsiders” trained in a different set of (important) skills. In many ways, this is a Good Thing. </span>The outsider status is important because it gives us the freedom to be “arbiters” and to ensure that scientists are doing the “right” things. It’s our job to keep people honest. However, being an arbiter by definition means that you are merely observing what is going on. You cannot be leading what is going on without losing your ability to arbitrate in an unbiased manner.</p>
<p><span>Big data poses a challenge to this long-standing tradition because all of the sudden statistics and science are more intertwined then ever before and statistical methodology is absolutely critical to making inferences or gaining insight from data. Because now there are data in more places than ever before, the demand for statistics is in more places than ever before. We are discovering that we can either teach people to apply the statistical methods to their data, or we can just do it ourselves!</span></p>
<p><span>This development presents an enormous opportunity for statisticians to play a new leadership role in scientific investigations because we have the skills to extract information from the data that no one else has (at least <em>for the moment</em>). But now we have to choose between being “in the club” by leading the science or remaining outside the club to be unbiased arbiters. I think as an individual it’s very difficult to be both simply because there are only 24 hours in the day. It takes an enormous amount of time to learn the scientific background required to lead scientific investigations and this is piled on top of whatever statistical training you receive. </span></p>
<p><span>However, I think as a field, we desperately need to promote both kinds of people, if only because we are the best people for the job. We need to expand the tent of statistics and include people who are using their statistical training to lead the new science. They may not be publishing papers in the <em>Annals of Statistics</em> or in <em>JASA</em>, but they <em>are</em> statisticians. If we do not move more in this direction, we risk missing out on one of the most exciting developments of our lifetime.</span></p>
Pro Tips for Grad Students in Statistics/Biostatistics (Part 2)
2012-06-20T15:44:00+00:00
http://simplystats.github.io/2012/06/20/pro-tips-for-grad-students-in-statistics-biostatistics
<p>This is the second in my series on pro tips for graduate students in statistics/biostatistics. For more tips, see <a href="http://simplystatistics.tumblr.com/post/25368234643/pro-tips-for-grad-students-in-statistics-biostatistics" target="_blank">part 1</a>. </p>
<ol>
<li>Meet with seminar speakers. When you go on the job market face recognition is priceless. I met Scott Zeger at UW when I was a student. When I came for an interview I already knew him (and Ingo, and Rafa, and ….). An even better idea…<em>ask a question during the seminar</em>.</li>
<li>Be a finisher. The key to getting a Ph.D. (other than passing your quals) is the ability to sit down and just power through and get it done. This means sometimes you will have to work late or on a weekend. The people who are the most successful in grad school are the people that just nd a way to get it done. If it was easy…anyone would do it.</li>
<li>Work on problems you genuinely enjoy thinking about/are<br />
passionate about. A lot of statistics (and science) is long periods of concentrated effort with no guarantee of success at the end. To be a really good statistician requires a lot of patience and effort. It is a lot easier to work hard on something you like or feel strongly about.</li>
</ol>
<div>
<span>More to come soon. </span>
</div>
E.P.A. Casts New Soot Standard as Easily Met
2012-06-19T15:53:47+00:00
http://simplystats.github.io/2012/06/19/e-p-a-casts-new-soot-standard-as-easily-met
<p><a href="http://green.blogs.nytimes.com/2012/06/15/e-p-a-casts-new-soot-standard-as-easily-met/?smid=tu-share">E.P.A. Casts New Soot Standard as Easily Met</a></p>
Pro Tips for Grad Students in Statistics/Biostatistics (Part 1)
2012-06-18T16:21:00+00:00
http://simplystats.github.io/2012/06/18/pro-tips-for-grad-students-in-statistics-biostatistics-2
<div>
I just finished teaching a Ph.D. level applied statistical methods course here at Hopkins. As part of the course, I gave one “pro-tip” a day; something I wish I had learned in graduate school that has helped me in becoming a practicing applied statistician. Here are the first three, more to come soon.
</div>
<ol>
<li>A major component of being a researcher is knowing what’s going on in the research community. Set up an RSS feed with journal articles. Google Reader is a good one, but there are others. Here are some good applied stat journals: Biostatistics, Biometrics, Annals of Applied Statistics…</li>
<li>Reproducible research is a hot topic, in part because a couple of high-profile papers that were disastrously non-reproducible (see “<a href="http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aoas/1267453942" target="_blank">Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology</a>”). When you write code for statistical analysis try to make sure that: (a) It is neat and well-commented - liberal and specific comments are your friend. (b)That it can be run by someone other than you, to produce the same results that you report.</li>
<li>In data analysis - particularly for complex high-dimensional<br />
data - it is frequently better to choose simple models for clearly defined parameters. With a lot of data, there is a strong temptation to go overboard with statistically complicated models; the danger of overfitting/ over-interpreting is extreme. The most reproducible results are often produced by sensible and statistically “simple” analyses (Note: being sensible and simple does not always lead to higher prole results).</li>
</ol>
Sunday data/statistics link roundup (6/17)
2012-06-17T17:07:12+00:00
http://simplystats.github.io/2012/06/17/sunday-data-statistics-link-roundup-6-17
<p>Happy Father’s Day!</p>
<ol>
<li>A <a href="http://www.cabinetoffice.gov.uk/sites/default/files/resources/Final-Test-Learn-Adapt.pdf" target="_blank">really interesting read</a> on randomized controlled trials (RCTs) and public policy. The examples in the boxes are fantastic. This seems to be one of the cases where the public policy folks are borrowing ideas from Biostatistics, which has been involved in randomized controlled trials for a long time. It’s a cool example of adapting good ideas in one discipline to the specific challenges of another. </li>
<li>Roger points <a href="http://www.nytimes.com/2012/06/17/technology/acxiom-the-quiet-giant-of-consumer-database-marketing.html?_r=1" target="_blank">to this link</a> in the NY Times about the “Consumer Genome”, which basically is a collection of information about your purchases and consumer history. On Twitter, Leonid K. <a href="https://twitter.com/leonidkruglyak/status/214365264886759426" target="_blank">asks</a>: ‘<span>Since when has “genome” becaome a generic term for “a bunch of information”?’. I completely understand the reaction against the “genome of x”, which is an over-used analogy. I actually think the analogy isn’t that unreasonable; like a genome, the information contained in your purchase/consumer history says something about you, but doesn’t tell the whole picture. I wonder how this information could be used for public health, since it is already being used for advertising….</span></li>
<li><span>This <a href="http://peerj.com/" target="_blank">PeerJ journal</a> looks like it has the potential to be good. They even encourage open peer review, <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0026895" target="_blank">which has some benefits</a>. Not sure if it is sustainable, see for example, this <a href="http://scholarlykitchen.sspnet.org/2012/06/14/is-peerj-membership-publishing-sustainable/" target="_blank">breakdown of the costs</a>. I still think we <a href="http://simplystatistics.tumblr.com/post/19289280474/a-proposal-for-a-really-fast-statistics-journal" target="_blank">can do better</a>. </span></li>
<li>Elon Musk is one of my favorite entrepreneurs. He tackles what I consider to be some of the most awe-inspiring and important problems around. <a href="http://www.mercurynews.com/business/ci_20859341/tesla-model-s-motors-readies-launch-its-all-electric-sedan-preview?refresh=no" target="_blank">This article </a>about the Tesla S got me all fired up about how a person with vision can literally change the fuel we run on. Nothing to do with statistics, other than I think now is a similarly revolutionary time for our discipline. </li>
<li>There was some <a href="https://twitter.com/johnmyleswhite/status/212573099886002176" target="_blank">interesting</a> <a href="https://twitter.com/johnmyleswhite/status/212580188452683776" target="_blank">discussion</a> on Twitter of the usefulness of the Yelp dataset I posted for academic research. Not sure if this ever got resolved, but I think more and more as data sets from companies/startups become available, the terms of use for these data will be critical. </li>
<li>I’m still working on <a href="http://simplystatistics.tumblr.com/post/25177731588/statisticians-asa-and-big-data" target="_blank">Roger’s puzzle</a> from earlier this week. </li>
</ol>
Statisticians, ASA, and Big Data
2012-06-15T20:30:52+00:00
http://simplystats.github.io/2012/06/15/statisticians-asa-and-big-data
<p>Today I got my copy of Amstat News and eagerly opened it before I realized it was not the issue with the salary survey….</p>
<p>But the President’s Corner section had the <a href="http://magazine.amstat.org/blog/2012/05/31/prescorner/" target="_blank">following column on big data</a> by ASA president Robert Rodriguez.</p>
<blockquote>
<p><span>Big Data is big news. It is the focus of stories in </span><em>The New York Times</em><span> and the subject of technology blogs, business forums, and economic studies. This column describes how statisticians can prepare for opportunities in Big Data and explains the distinctive value our profession can provide.</span></p>
</blockquote>
<p>Here’s a homework assignment for you all: Please read the column and explain what’s wrong with it. I’ll post the answer in a (near) future post.</p>
Poison gas or...air pollution?
2012-06-12T15:07:33+00:00
http://simplystats.github.io/2012/06/12/poison-gas-or-air-pollution
<p>From our Beijing bureau, we have the following message from the U.S. embassy that was recently issued to U.S. citizens in China:</p>
<blockquote>
<p><span>The Embassy has received reports from U.S. citizens living and </span><span>traveling in Wuhan that the air quality in the city has been </span><span>particularly poor since yesterday morning. On June 11 at 16:20, the </span><span>Wuhan Environmental Protection Administrative Bureau posted information </span><span>about this on its website. Below is a translation of that information:</span></p>
<p><span>“Beginning on June 11, 2012 around 08:00 AM, the air quality inside </span><span>Wuhan appeared to worsen, with low visibility and burning smells. </span><span>According to city air data, starting at 07:00 AM this morning, the </span><span>density of the respiratory particulate matter increased in the air </span><span>downtown; it increased quickly after 08:00 AM. The density at 14:00 </span><span>approached 0.574mg/m3, a level that is deemed “serious” by national </span><span>standards. An analysis of the air indicates the pollution is caused </span><span>from burning of plant material northeast of Wuhan.</span></p>
</blockquote>
<p><span>It’s not immediately clear which pollutant they’re talking about, but it’s probably PM10 (particulate matter less than 10 microns in aerodynamic diameter). If so, that level is quite high—U.S. 24-hour average standards are at 0.15 mg/m3 (note that the reported level was an hourly level). </span></p>
<blockquote>
<p><span>Our investigation of downtown’s districts, and based on reports from all </span><span>of Wuhan’s large industrial enterprises, have determined that that there </span><span>has not been any explosion, sewage release, leakage of any poisoning </span><span>gas, or any other type of urgent environmental accident from large </span><span>industrial enterprises. Nor is there burning of crops in the new city </span><span>area. News spread online of a chlorine leak from Qingshan or a boiler </span><span>explosion at Wuhan Iron and Steel Plant are rumors.</span></p>
</blockquote>
<p><span></span>So, this is not some terrible incident, it’s just the usual smell. Good to know.</p>
<blockquote>
<p><span>According to our investigation, the abnormal air quality in our city is </span><span>mainly caused by the burning of the crops northeast of Wuhan towards </span><span>Hubei province. Similar air quality is occurring in Jiangsu, Henan and </span><span>Anhui provinces, as well as in Xiaogan, Jingzhou, Jingmen and Xiantao, </span><span>cities nearby Wuhan.</span></p>
</blockquote>
<blockquote>
<p><span></span>The weather forecast authority of the city has advised that recent weather conditions have not been good for the dispersion of pollutants.”</p>
</blockquote>
<p>The embassy goes on to warn:</p>
<blockquote>
<p>U.S. citizens are reminded that air pollution is a significant problem in many cities and regions in China. Health effects are likely to be more severe for sensitive populations, including children and older adults. While the quality of air can differ greatly between cities or between urban and rural areas, U.S. citizens living in or traveling to China may wish to consult their doctor when living in or prior to traveling to areas with significant air pollution.</p>
</blockquote>
Big Data Needs May Create Thousands Of Tech Jobs
2012-06-12T13:01:00+00:00
http://simplystats.github.io/2012/06/12/big-data-needs-may-create-thousands-of-tech-jobs
<p><a href="http://www.npr.org/2012/06/07/154485152/big-data-may-create-thousands-of-industry-jobs">Big Data Needs May Create Thousands Of Tech Jobs</a></p>
Green: E.P.A. Soot Rules Expected This Week
2012-06-12T11:35:59+00:00
http://simplystats.github.io/2012/06/12/green-e-p-a-soot-rules-expected-this-week
<p><a href="http://green.blogs.nytimes.com/2012/06/11/e-p-a-soot-rules-expected-this-week/">Green: E.P.A. Soot Rules Expected This Week</a></p>
Chris Volinsky knows where you are
2012-06-11T14:52:32+00:00
http://simplystats.github.io/2012/06/11/chris-volinsky-knows-where-you-are
<p><a href="http://mobile.nj.com/advnj/db_272903/contentdetail.htm?contentguid=EkXn77Ya&full=true#display">Chris Volinsky knows where you are</a></p>
Getting a grant...or a startup
2012-06-11T12:47:01+00:00
http://simplystats.github.io/2012/06/11/getting-a-grant-or-a-startup
<p><a href="http://ycombinator.com/index.html" target="_blank">Y Combinator</a> is company that invests in startups and brings them to the San Francisco area to get them ready for prime time. One of the co-founders is <a href="http://paulgraham.com/" target="_blank">Paul Graham</a>, whose essays we’ve featured on this blog.</p>
<p>The Y Combinator web site itself is quite interesting and in particular, the section on <a href="http://ycombinator.com/howtoapply.html" target="_blank">how to apply to Y Combinator</a> caught my eye. Now, I don’t know the first thing about starting a startup (nor do I have any current interest in doing so), but I do know a little bit about applying for NIH grants and it struck me that the advice for the startups seemed very useful for writing grants. It surprised me because I always thought that the process of “marketing” a startup to someone would be quite different from applying for a grant—-startups are supposed to be cool and innovative and futuristic while grants are more about doing the usual thing. Just shows you how much I know about the startup world.</p>
<p>I thought I’d pluck out a few good parts from Graham’s long list of advice that I found useful. The full essay is definitely worth reading.</p>
<p>Here’s one that struck me immediately:</p>
<blockquote>
<p><span>If we get 1000 applications and have 10 days to read them, we have to read about 100 a day. That means a YC partner who reads your application will on average have already read 50 that day and have 50 more to go. Yours has to stand out. So you have to be exceptionally clear and concise. Whatever you have to say, give it to us right in the first sentence, in the simplest possible terms.</span></p>
</blockquote>
<p>In that past, I always thought that grant reviewers had all the time in the world to read my grant and probably dedicated a week of their life to reading it. Hah! Having served on study sections now, I realize there’s precious little time to dedicate to the tall pile of grants that need to be read. Grants that are well written are a pleasure to read. Ones that are poorly written (or take forever to get to the point) just make me angry.</p>
<blockquote>
<p>It’s a mistake to use marketing-speak to make your idea sound more exciting. We’re immune to marketing-speak; to us it’s just noise. So don’t begin…with something like</p>
<blockquote>
<p>We are going to transform the relationship between individuals and information.</p>
</blockquote>
<p><span>That sounds impressive, but it conveys nothing. It could be a description of any technology company. Are you going to build a search engine? Database software? A router? I have no idea.</span></p>
<p>One test of whether you’re explaining your idea effectively is to ask how close the reader is to reproducing it. After reading that sentence I’m no closer than I was before, so its content is effectively zero.</p>
</blockquote>
<p>I usually tell people if at any stage of writing a grant you have a choice between being more general and more specific, always be more specific. That way people can judge you based on the facts, not based on their imagination of the facts. This doesn’t always lead to success, of course, but it can remove an element of chance. If a reviewer has to fill in the details of your idea, who knows what they’ll think of?</p>
<blockquote>
<p><span>One reason [company] founders resist giving matter-of-fact descriptions [of their company] is that they seem to constrain your potential. “But [my product] is so much more than a database with a wiki UI!” The problem is, the less constraining your description, the less you’re saying. So it’s better to err on the side of matter-of-factness.</span></p>
</blockquote>
<p><span>Of course, there are some applications that specifically ask you to “think big” and there the rules may be a bit different. But still, I think it’s better to avoid broad and sweeping generalities. These days, given the relatively tight page limits, you need to convey the maximum amount of information possible.</span></p>
<blockquote>
<p><span>One good trick for describing a project concisely is to explain it as a variant of something the audience already knows. It’s like Wikipedia, but within an organization. It’s like an answering service, but for email. It’s eBay for jobs. This form of description is wonderfully efficient. Don’t worry that it will make your idea seem “derivative.” Some of the best ideas in history began by sticking together two existing ideas no one realized could be combined.</span></p>
</blockquote>
<p>Not sure this is so relevant to writing grants, but I thought was interesting. My instinct was to think that this would make your idea seem derivative also, but maybe not.</p>
<blockquote>
<p>…if we can see obstacles to your idea that you don’t seem to have considered, that’s a bad sign. This is your idea. You’ve had days, at least, to think about it, and we’ve only had a couple minutes. We shouldn’t be able to come up with objections you haven’t thought of.</p>
<p>Paradoxically, it is for this reason better to disclose all the flaws in your idea than to try to conceal them. If we think of a problem you don’t mention, we’ll assume it’s because you haven’t thought of it. </p>
</blockquote>
<p>This is one definitely true—better to reveal limitations/weaknesses than to look like you haven’t thought of them. Because if a reviewer finds one, then it’s all they’ll talk about. Often times, a big problem is lack of space to fit this in, but if you can do it I think it’s always a good idea to include it.</p>
<p>Finally,</p>
<blockquote>
<p><span>You don’t have to sell us on you. We’ll sell ourselves, if we can just understand you. But every unnecessary word in your application subtracts from the effect of the necessary ones. So before submitting your application, print it out and take a red pen and cross out every word you don’t need. And in what’s left be as specific and as matter-of-fact as you can.</span></p>
</blockquote>
<p><span>I think there are quite a few differences between scientists reviewing grants and startup investors and we probably shouldn’t take the parallels too seriously. In particular, investors I think are going to be more optimistic because, as Graham says, “they get equity”. Scientists are trained to be skeptical and so will be looking at applications with a slightly different eye. </span></p>
<p><span>However, I think the general advice to be specific and concise about what you’re doing is good. If anything, it may help you realize that you have no idea what you’re doing.</span></p>
Sunday data/statistics link roundup (6/10)
2012-06-10T21:31:47+00:00
http://simplystats.github.io/2012/06/10/sunday-data-statistics-link-roundup-6-10
<ol>
<li> Yelp put a <a href="http://www.yelp.com/academic_dataset" target="_blank">data set online</a> for people to play with, including reviews, star ratings, etc. This could be a really neat data set for a student project. The data they have made available focuses on the area around 30 universities. My <a href="http://www.washington.edu/" target="_blank">alma mater</a> is one of them. </li>
<li>A sort of <a href="http://fhuszar.blogspot.co.uk/2012/06/how-data-scientist-decides-when-to-get.html" target="_blank">goofy talk</a> about how to choose the optimal marriage partner when viewing the problem as an optimal stopping problem. The author suggests that you need to date around <span>196,132 partners to make sure you have made the optimal decision. Fortunately for the Simply Statistics authors, it took many fewer for us all to end up with our optimal matches. Via <a href="https://twitter.com/#!/fhuszar" target="_blank">@fhuszar</a>.</span></li>
<li>An <a href="http://www.nytimes.com/2012/06/10/business/essay-grading-software-as-teachers-aide-digital-domain.html?_r=1&emc=eta1" target="_blank">interesting article</a> on the recent <a href="http://www.kaggle.com/c/asap-aes" target="_blank">Kaggle contest</a> that sought to identify statistical algorithms that could accurately match human scoring of written essays. Several students in my advanced biostatistics course competed in this competition and did quite well. I understand the need for these kinds of algorithms, since it takes a huge amount of human labor to score these essays well. But it also makes me a bit sad since it still seems even the best algorithms will have a hard time scoring creativity. For example, this phrase from my favorite president, doesn’t use big words, but it sure is clever, “<span class="huge">I think there is only one quality worse than hardness of heart and that is softness of head.”</span><span><br /></span></li>
<li><span class="huge">A really good article by friend of the blog, Steven, on the <a href="http://www.nature.com/clpt/journal/v91/n6/abs/clpt20126a.html" target="_blank">perils of gene patents</a>. This part sums it up perfectly, “</span><span>Genes are not inventions. This simple </span><span>fact, which no serious scientist would </span><span>dispute, should be enough to rule them </span><span>out as the subject of patents.” Simply Statistics has weighed in on this issue a <a href="http://www.nature.com/nature/journal/v484/n7394/full/484318a.html" target="_blank">couple</a> of <a href="http://simplystatistics.tumblr.com/post/14135999782/the-supreme-courts-interpretation-of-statistical" target="_blank">times</a> <a href="http://simplystatistics.tumblr.com/post/19646774024/laws-of-nature-and-the-law-of-patents-supreme-court" target="_blank">before</a>. But I think in light of 23andMe’s recent Parkinson’s patent it bears repeating. <a href="http://www.genomicslawreport.com/index.php/2012/06/01/patenting-and-personal-genomics-23andme-receives-its-first-patent-and-plenty-of-questions/" target="_blank">Here</a> is an awesome summary of the issue from Genomics Lawyer.</span></li>
<li><span><a href="http://simplystatistics.tumblr.com/post/19289280474/a-proposal-for-a-really-fast-statistics-journal" target="_blank">A proposal</a> for a really fast statistics journal I wrote about a month or two ago. Expect more on this topic from me this week. </span></li>
</ol>
China Asks Embassies to Stop Measuring Air Pollution
2012-06-05T17:02:06+00:00
http://simplystats.github.io/2012/06/05/china-asks-embassies-to-stop-measuring-air-pollution
<p><a href="http://www.nytimes.com/2012/06/06/world/asia/china-asks-embassies-to-stop-measuring-air-pollution.html?smid=tu-share">China Asks Embassies to Stop Measuring Air Pollution</a></p>
How Big Data Gets Real
2012-06-04T14:24:25+00:00
http://simplystats.github.io/2012/06/04/how-big-data-gets-real
<p><a href="http://bits.blogs.nytimes.com/2012/06/04/how-big-data-gets-real/">How Big Data Gets Real</a></p>
Interview with Amanda Cox - Graphics Editor at the New York Times
2012-06-01T14:57:00+00:00
http://simplystats.github.io/2012/06/01/interview-with-amanda-cox-graphics-editor-at-the-new
<div class="im">
<strong>Amanda Cox </strong>
</div>
<div class="im">
<strong><br /></strong>
</div>
<div class="im">
<strong><img height="294" src="http://biostat.jhsph.edu/~jleek/cox.jpg" width="200" /></strong>
</div>
<div class="im">
<strong><br /></strong>
</div>
<div class="im">
<strong><br /></strong>Amanda Cox received her M.S. in statistics from the University of Washington in 2005. She then moved to the New York Times, where she is a graphics editor. She, and the graphics team at the New York Times, are responsible for many of the cool, informative, and interactive graphics produced by the Times. For example, <a href="http://www.nytimes.com/interactive/2009/11/06/business/economy/unemployment-lines.html" target="_blank">this</a>, <a href="http://www.nytimes.com/interactive/2009/07/02/business/economy/20090705-cycles-graphic.html" target="_blank">this</a> and <a href="http://www.nytimes.com/interactive/2010/02/26/sports/olympics/20100226-olysymphony.html" target="_blank">this</a> (the last one, Olympic Symphony, is one of my all time favorites).
</div>
<div class="im">
<strong><br /></strong>
</div>
<div class="im">
<strong>You have a background in statistics, do you consider yourself a statistician? Do you consider what you do statistics?</strong><br /><span></span>
</div>
<div class="im">
<span><br /></span>
</div>
<div class="im">
<span>I don’t deal with uncertainty in a formal enough way to call what I do statistics, or myself a statistician. (My technical title is “graphics editor,” but no one knows what this means. On the good days, what we do is “journalism.”) Mark Hansen, a statistician at UCLA, has possibly changed my thinking on this a little bit though, by asking who I want to be the best at visualizing data, if not statisticians.</span>
</div>
<div class="im">
<span><br /></span>
</div>
<div class="im">
<strong>How did you end up at the NY Times?</strong>
</div>
<p><span>In the middle of my first year of grad school (in statistics at the University of Washington), I started applying for random things. One of them was to be a </span><a href="http://www.nytimes-internship.com/" target="_blank">summer intern</a><span> in the graphics department at the Times.</span></p>
<p><strong><span>How are the graphics and charts you develop different than </span><span>producing graphs for a quantitative/scientific audience?</span></strong></p>
<div class="im">
<span><br /></span>
</div>
<div class="im">
<span>“Feels like homework” is a really negative reaction to a graphic or a story here. In practice, that means a few things: we don’t necessarily assume our audience already cares about a topic. We try to get rid of jargon, which can be useful shorthand for technical audiences, but doesn’t belong in a newspaper. Most of our graphics can stand on their own, meaning you shouldn’t need to read any accompanying text to understand the basic point. Finally, we probably pay more attention to things like typography and design, which, done properly, are really about hierarchy and clarity, and not just about making things cute. </span>
</div>
<p><strong><span><br /></span></strong></p>
<p><strong><span>How do you use R to prototype graphics? </span></strong></p>
<p><span>I sketch in R, which mostly just means reading data, and trying on different forms or subsets or levels of aggregation. It’s nothing fancy: usually just points and lines and text from base graphics. For print, I will sometimes clean up a pdf of R output in Illustrator. You can see some of that in practice at </span><a href="http://chartsnthings.tumblr.com/" target="_blank">chartsnthings.tumblr.com</a><span>, which where one of my colleagues, Kevin Quealy, posts some of the department’s sketches. (Kevin and I are the only regular R users here, so the amount of R used on chartsnthings is not at all representative of NYT graphics as a whole.)</span></p>
<p><strong><span>Do you have any examples where the R version and the eventual final web version are nearly identical?</span></strong></p>
<p><span>Real interactivity changes things, so my use of R for web graphics is mostly just a proof-of-concept thing. </span><span>(Sometimes I will also generate “poor-man’s interactivity,” which means hitting the pagedown key on a pdf of charts made in a for loop.) But here are a couple of proof-of-concept sketches, where the initial R output doesn’t look so different from the final web version.</span></p>
<p><a href="http://www.nytimes.com/interactive/2009/11/06/business/economy/unemployment-lines.html" target="_blank">The Jobless Rate for People Like You</a></p>
<p><img height="354" src="http://biostat.jhsph.edu/~jleek/jobless.png" width="400" /></p>
<p><a href="http://www.nytimes.com/interactive/2009/07/31/business/20080801-metrics-graphic.html" target="_blank">How Different Groups Spend Their Day</a></p>
<p><img src="http://biostat.jhsph.edu/~jleek/groups.png" alt="" /></p>
<p><strong><span>You consistently produce arresting and informative graphics about </span><span>a range of topics. How do you decide on which topics to tackle?</span></strong></p>
<p><span>News value and interestingness are probably the two most important criteria for deciding what to work on. In an ideal world, you get both, but sometimes, one is enough (or the best you can do).</span></p>
<p><strong><span>Are your project choices motivated by availability of data?</span></strong></p>
<p><span>Sure. The availability of data also affects the scope of many projects. For example, the guys who work on our live election results will probably map them by county, even though precinct-level results are </span><a href="http://www.stanford.edu/~jrodden/jrhome_files/electiondata.htm" target="_blank">so much better</a><span>. But precinct-level data isn’t generally available in real time.</span></p>
<p><strong><span>What is the typical turn-around time from idea to completed project?</span></strong></p>
<p><span>The department is most proud of some of its one-day, breaking news work, but very little of that is what I would think of as data-heavy. The real answer to “how long does it take?” is “how long do we have?” Projects always find ways to expand to fill the available space, which often ranges from a couple of days to a couple of weeks.</span></p>
<p><span><br /></span></p>
<p><strong><span>Do you have any general principles for how you make complicated </span><span>data understandable to the general public?</span></strong></p>
<div class="im">
</div>
<p><span>I’m a big believer in learning by example. If you annotate three points in a scatterplot, I’m probably good, even if I’m not super comfortable reading scatterplots. I also think the words in a graphic should highlight the relevant pattern, or an expert’s interpretation, and not merely say “Here is some data.” The annotation layer is critical, even in a newspaper (where the data is not usually super complicated).</span></p>
<p><strong><span>What do you consider to be the most informative graphical elements or interactive features that you consistently use?</span></strong></p>
<p><span>I like sliders, because there’s something about them that suggests story (beginning-middle-end), even if the thing you’re changing isn’t time. Using movement in a way that means something, like </span><a href="http://www.nytimes.com/packages/html/newsgraphics/pages/hp/2008/2008-06-03-1800.html" target="_blank">this</a><span> or </span><a href="http://www.nytimes.com/interactive/2009/07/02/business/economy/20090705-cycles-graphic.html" target="_blank">this</a><span>, is still also fun, because it takes advantage of one of the ways the web is different from print.</span></p>
Writing software for someone else
2012-05-31T12:27:06+00:00
http://simplystats.github.io/2012/05/31/writing-software-for-someone-else
<p><a href="http://www.johndcook.com/blog/2012/05/30/writing-software-for-someone-else/">Writing software for someone else</a></p>
Why "no one reads the statistics literature anymore"
2012-05-30T12:54:00+00:00
http://simplystats.github.io/2012/05/30/why-no-one-reads-the-statistics-literature-anymore
<p>Spurred by Rafa’s post on <a href="http://simplystatistics.tumblr.com/post/23674712262/how-do-we-evaluate-statisticians-working-in-genomics" target="_blank">evaluating statisticians working in genomics</a>, there’s an interesting <a href="https://groups.google.com/group/reproducible-research/browse_thread/thread/7a8da11209cec2f2" target="_blank">discussion</a> going on at the Scientists for Reproducible Research group on statistics journals. Evan Johnson kicks it off:</p>
<blockquote>
<p><span>…our statistics journals have </span><span>little impact on how genomic data are analyzed. My group rarely looks </span><span>to publish in statistics journals anymore because even IF we can get </span><span>it published quickly, NO ONE will read it, so the only things we send </span><span>there anymore are things that we don’t care if anyone ever uses.</span></p>
</blockquote>
<p>Evan continues:</p>
<blockquote>
<p><span>It’s </span><span>crazy to me that all of our statistical journals are barely even </span><span>noticed by bioinformaticians, computational biologists, and by people </span><span>in genomics. Even worse, very few non-statisticians in genomics ever </span><span>try to publish in our journals. Ultimately, this represents a major </span><span>failure in the statistical discipline to be collectively influential </span><span>on how genomic data are analyzed. </span></p>
</blockquote>
<p>I may agree with the first point but I’m not sure I agree with second. Regarding the first, I think <a href="http://simplystatistics.tumblr.com/post/23674712262/how-do-we-evaluate-statisticians-working-in-genomics#comment-538117393" target="_blank">Karl put it best</a> in that really the problem is that “the bulk of the people who might benefit from my method do not read the statistical literature”. For the second point, I think the issue is that the way science works is changing. Here’s my cartoon of how science worked in the “old days”, say, pre-computer era:</p>
<p><img src="http://media.tumblr.com/tumblr_m4u5m65EJP1r08wvg.png" alt="" /></p>
<p>The idea here is that scientists worked with statisticians (they may have been one and the same) to publish stat papers and scientific papers. If Scientist A saw a paper in a domain journal written by Scientist B using a method developed by Statistician C, how could Scientist A apply that method? He had to talk to Statistician D, who would read that statistics literature and find Statistician C’s paper to learn about the method. The point is that there is no direct link from Scientist A to Statistician C except through statistics journals. Therefore, it was critical for Statistician C to publish in the stat journals to ensure that there would be an impact on scientists.</p>
<p>My cartoon of the “new way” of doing things is below.</p>
<p><img src="http://media.tumblr.com/tumblr_m4u5s53fSy1r08wvg.png" alt="" /></p>
<p>Now, if Scientist wants to use a method developed by Statistician C (and used by Scientist B), he simply finds the software developed by Statistician C and applies it to his data. Here, there is a direct connection between A and C through software. If Statistician C wants his method to have an impact on scientists, there are two options: publish in stat journals and hope that the method filters through other statisticians, or publish in domain journals <em>with software</em> so that other scientists may apply the method directly. It seems the latter approach is more popular in some areas.</p>
<p>Peter Diggle makes an <a href="http://simplystatistics.tumblr.com/post/23674712262/how-do-we-evaluate-statisticians-working-in-genomics#comment-538227946" target="_blank">important point</a> about generalized linear models and the seminal book written by McCullagh and Nelder:</p>
<blockquote>
<p><span>the book [by McCullagh and Nelder] would have been read by many fewer people if Nelder and colleague had not embedded the idea in software that (for the time) was innovative in being interactive rather than batch-oriented.</span></p>
</blockquote>
<p>For better or for worse (and probably very often for worse), the software allowed many many people access to the methods.</p>
<p>The supposed attraction of publishing a statistical method in a statistics journal like JASA or JRSS-B is that the methods are published in a more abstract manner (usually using mathematical symbols) in the hopes that the methods will be applicable to a wide array of problems, not just the problem for which it was developed. Of course, the flip side of this argument is, as <a href="http://simplystatistics.tumblr.com/post/23674712262/how-do-we-evaluate-statisticians-working-in-genomics#comment-538117393" target="_blank">Karl says</a>, again eloquently, “<span>if you don’t get down to the specifics of a particular data set, then you haven’t really solved </span><em>any</em><span> problem”.</span></p>
<p>I think abstraction is important and we need to continue publishing those kinds of ideas. However, I think there is one key point that the statistics community has had difficulty grasping, which is that <strong>software represents an important form of abstraction</strong>, if not the most important form. Anyone who has written software knows that there are many approaches to implementing your method in software and various levels of abstraction one can use. The variety of problems to which the software can be applied depends on how general the interface to your software is. This is why I always encourage people to write R packages because it often forces them to think a bit more abstractly about who might be using the software.</p>
<p>Whither the statistics journals? It’s hard to say. Having them publish more software probably won’t help as the audience remains the same. I’m a bit stumped here but I look forward to continued discussion!</p>
View my Statistics for Genomics lectures on Youtube and ask questions on facebook/twitter
2012-05-29T17:12:51+00:00
http://simplystats.github.io/2012/05/29/view-my-statistics-for-genomics-lectures-on-youtube-and
<p>This year I recorded my lectures during my Statistics for Genomics course. Slowly but surely I am putting all the videos on Youtube. Links will eventually be <a href="http://rafalab.jhsph.edu/688/" target="_blank">here</a> (all slides and the first lecture is already up). As new lectures become available I will post updates on <a href="https://www.facebook.com/pages/RafaLab/144709675562592" target="_blank">rafalab’s facebook page</a> and <a href="https://twitter.com/#!/rafalab" target="_blank">twitter feed</a> where I will answer questions posted as comments (time permitting). Guest lecturers include Jeff Leek, Ben Langmead, Kasper Hansen and Hongkai Ji.</p>
Schlep blindness in statistics
2012-05-28T14:19:24+00:00
http://simplystats.github.io/2012/05/28/schlep-blindness-in-statistics
<p>This is yet another <a href="http://paulgraham.com/schlep.html" target="_blank">outstanding post</a> by Paul Graham, this time on “Schlep Blindness”. He talks about how there are great startup ideas that no one considers because they are too much of a “schlep” (a tedious unpleasant task). He talks about how most founders of startups want to put up a clever bit of code they wrote and just watch the money flow in. But of course it doesn’t work like that, you need to advertise, interact with customers, raise money, go out and promote your work, fix bugs at 3am, etc. </p>
<p>In academia there is a similar tendency to avoid projects that involve a big schlep. For example, it is relatively straightforward to develop a mathematical model, work out the parameter estimates, and write a paper. But it is a big schlep to then write fast code that implements that method, debug the code, dummy proof the code, fix bugs submitted by users, etc. <a href="http://simplystatistics.tumblr.com/post/23674712262/how-do-we-evaluate-statisticians-working-in-genomics" target="_blank">Rafa’s post</a>, <a href="http://simplystatistics.tumblr.com/post/22844703875/ha" target="_blank">Hadley’s interview</a>, and the discussion Rafa <a href="https://groups.google.com/forum/?fromgroups#!topic/reproducible-research/eo2hEgnOwvI" target="_blank">linked to</a> all allude to this issue. Particularly the fact that the schlep, the long slow slog of going through a new data type or writing a piece of usable software is somewhat undervalued. </p>
<p>I think part of the problem is our academic culture and heritage, which has traditionally put a very high premium on being clever and a relatively low premium on being willing to go through the schlep. As applied statistics touches more areas and the number of users of statistical software and ideas grows, the schlep becomes just as important as the clever idea. If you aren’t willing to put in the time to code your methods up and make them accessible to other investigators, then who will be? </p>
<p>To bring this back to the discussion inspired by Rafa’s post, I wonder if applied statistics journals could increase their impact, encourage more readership from scientific folks, and support a broader range of applied statisticians if there was a re-weighting of the importance of cleverness and schlep? As Paul points out: </p>
<blockquote>
<p><span> In addition to their intrinsic value, they’re like undervalued stocks in the sense that there’s less demand for them among founders. If you pick an ambitious idea, you’ll have less competition, because everyone else will have been frightened off by the challenges involved.</span></p>
</blockquote>
Sunday data/statistics link roundup (5/27)
2012-05-27T17:09:06+00:00
http://simplystats.github.io/2012/05/27/sunday-data-statistics-link-roundup-5-27
<ol>
<li>Amanda Cox <a href="http://chartsnthings.tumblr.com/post/23348191031/amanda-cox-and-countrymen-chart-the-facebook-i-p-o" target="_blank">on the process</a> they went through to come up with <a href="http://www.nytimes.com/interactive/2012/05/17/business/dealbook/how-the-facebook-offering-compares.html" target="_blank">this graphic</a> about the Facebook IPO. So cool to see how R is used in the development process. A favorite quote of mine, “<span>But rather than bringing clarity, it just sort of looked chaotic, even to the seasoned chart freaks of 620 8th Avenue.” One of the more interesting things about posts like this is you get to see how statistics versus a deadline works. This is typically the role of the analyst, since they come in late and there is usually a deadline looming…</span></li>
<li><span>An interview <a href="http://www.readability.com/read?url=http%3A//m.theatlantic.com/business/archive/2012/05/the-golden-age-of-silicon-valley-is-over-and-were-dancing-on-its-grave/257401/" target="_blank">with Steve Blank</a> about Silicon valley and how venture capitalists (VC’s) are focused on social technologies since they can make a profit quickly. A depressing/fascinating quote from this one is, “</span><span>If I have a choice of investing in a blockbuster cancer drug that will pay me nothing for ten years, at best, whereas social media will go big in two years, what do you think I’m going to pick? If you’re a VC firm, you’re tossing out your life science division.” He also goes on to say thank goodness for the NIH, NSF, and Google who are funding interesting “real science” problems. This probably deserves its own post later in the week, the difference between analyzing data because it will make money and analyzing data to solve a hard science problem. The latter usually takes way more patience and the data take much longer to collect. </span></li>
<li><span><a href="http://blog.optimizely.com/how-obama-raised-60-million-by-running-an-exp" target="_blank">An interesting post</a> on how Obama’s analytics department <a href="http://en.wikipedia.org/wiki/A/B_testing" target="_blank">ran an A/B test </a>which improved the number of people who signed up for his mailing list. I don’t necessarily agree with their claim that they helped raise $60 million, there may be some confounding factors that mean that the individuals who sign up with the best combination of image/button don’t necessarily donate as much. But still, an interesting look into <a href="http://simplystatistics.tumblr.com/post/10809464773/why-does-obama-need-statisticians" target="_blank">why Obama needs statisticians</a>. </span></li>
<li><span>A <a href="https://twitter.com/kristin_linn/status/206778618016317441/photo/1" target="_blank">cute statistics cartoon</a> from <a href="https://twitter.com/#!/kristin_linn" target="_blank">@kristin_linn </a> via Chris V. Yes, we are now shamelessly reposting cute cartoons for retweets :-). </span></li>
<li><span><a href="http://simplystatistics.tumblr.com/post/23674712262/how-do-we-evaluate-statisticians-working-in-genomics" target="_blank">Rafa’s post</a> inspired some interesting conversation both on our blog and on some statistics mailing lists. It seems to me that everyone is making an effort to understand the increasingly diverse field of statistics, but we still have a ways to go. I’m particularly interested in discussion on how we evaluate the contribution/effort behind making good and usable academic software. I think the strength of the <a href="http://bioconductor.org/" target="_blank">Bioconductor</a> community and the <a href="https://github.com/" target="_blank">rise of Github </a>among academics are a good start. For example, it is really useful that Bioconductor now tracks the <a href="http://www.bioconductor.org/packages/stats/" target="_blank">number of package downloads</a>. </span></li>
</ol>
"How do we evaluate statisticians working in genomics? Why don't they publish in stats journals?" Here is my answer
2012-05-24T15:57:37+00:00
http://simplystats.github.io/2012/05/24/how-do-we-evaluate-statisticians-working-in-genomics
<p class="MsoNormal">
During the past couple of years I have been asked these questions by several department chairs and other senior statisticians interested in hiring or promoting faculty working in genomics. The main difficulty stems from the fact that we (statisticians working in genomics) publish in journals outside the mainstream statistical journals. This can be a problem during evaluation because a quick-and-dirty approach to evaluating an academic statistician is to count papers in the Annals of Statistics, JASA, JRSS and Biometrics. The evaluators feel safe counting these papers because they trust the fellow-statistician editors of these journals. However, statisticians working in genomics tend to publish in journals like Nature Genetics, Genome Research, PNAS, Nature Methods, Nucleic Acids Research, Genome Biology, and Bioinformatics. In general, these journals do not recruit statistical referees and a considerable number of papers with questionable statistics do get published in them. <strong>However, </strong>when the paper’s main topic is a statistical method or if it heavily relies on statistical methods, statistical referees are used. So, if the statistician is the corresponding or <a href="http://simplystatistics.tumblr.com/post/11314293165/authorship-conventions" target="_blank">last author</a> and it’s a stats paper, it is OK to assume the statistics are fine and you should go ahead and be impressed by the impact factor of the journal… it’s not east getting statistics papers in these journals.
</p>
<p class="MsoNormal">
But we really should not be counting papers blindly. Instead we should be reading at least some of them. But here again the evaluators get stuck as we tend to publish papers with application/technology specific jargon and show-off by presenting results that are of interest to our potential users (biologists) and not necessarily to our fellow statisticians. Here all I can recommend is that you seek help. There are now a handful of us that are full professors and most of us are more than willing to help out with, for example, <a href="http://simplystatistics.tumblr.com/post/12181264937/advice-on-promotion-letters-bleg" target="_blank">promotion letters</a>.
</p>
<p class="MsoNormal">
So why don’t we publish in statistical journals? The fear of getting scooped due to the <a href="http://simplystatistics.tumblr.com/post/17317636444/an-example-of-how-sending-a-paper-to-a-statistics" target="_blank">slow turnaround of stats journals</a> is only one reason. New technologies that quickly became widely used (microarrays in 2000 and nextgen sequencing today) created a need for data analysis methods among large groups of biologists. Journals with large readerships and high impact factors, typically not interested in straight statistical methodology work, suddenly became amenable to publishing our papers, especially if they solved a data analytic problem faced by many biologists. The possibility of publishing in widely read journals is certainly seductive.
</p>
<p class="MsoNormal">
While in several other fields, data analysis methodology development is restricted to the statistics discipline, in genomics we compete with other quantitative scientists capable of developing useful solutions: computer scientists, physicists, and engineers were also seduced by the possibility of gaining notoriety with publications in high impact journals. Thus, in genomics, the competition for funding, citation and publication in the top scientific journals is fierce.
</p>
<p class="MsoNormal">
Then there is funding. Note that while most biostatistics methodology NIH proposals go to the Biostatistical Methods and Research Design (BMRD) study section, many of the genomics related grants get sent to other sections such as the Genomics Computational Biology and Technology (GCAT) and Biodata Management and Anlayis (BDMA) study sections. BDMA and GCAT are much more impressed by Nature Genetics and Genome Research than JASA and Biometrics. They also look for citations and software downloads.
</p>
<p class="MsoNormal">
To be considered successful by our peers in genomics, those who referee our papers and review our grant applications, our statistical methods need to be delivered as software and garner a user base. Publications in statistical journals, especially those not appearing in PubMed, are not rewarded. This lack of incentive combined with how <a href="http://simplystatistics.tumblr.com/post/22844703875/ha" target="_blank">time consuming it is to produce and maintain usable software</a>, has led many statisticians working in genomics to focus solely on the development of practical methods rather than generalizable mathematical theory. As a result, statisticians working in genomics do not publish much in the traditional statistical journals. You should not hold this against them, especially if they are developers and maintainers of widely used software.
</p>
<!--EndFragment-->
Sunday data/statistics link roundup (5/20)
2012-05-20T15:43:19+00:00
http://simplystats.github.io/2012/05/20/sunday-data-statistics-link-roundup-5-20
<div>
It’s grant season around here so I’ll be brief:
</div>
<ol>
<li>I love <a href="http://online.wsj.com/article/SB10001424052702303448404577410341236847980.html" target="_blank">this article</a> in the WSJ about the crisis at JP Morgan. The key point it highlights is that looking only at the high-level analysis and summaries can be misleading, you have to look at the raw data to see the potential problems. As data become more complex, I think its critical we stay in touch with the raw data, regardless of discipline. At least if I miss something in the raw data I don’t lose a couple billion. Spotted by Leonid K. </li>
<li>On the other hand, <a href="http://www.nytimes.com/2012/05/15/science/a-mathematical-challenge-to-obesity.html?_r=1" target="_blank">this article</a> in the Times drives me a little bonkers. It makes it sound like there is one mathematical model that will solve the obesity epidemic. Lines like this are ridiculous: “<span>Because to do this experimentally would take years. You could find out much more quickly if you did the math.” The obesity epidemic is due to a complex interplay of cultural, sociological, economic, and policy factors. The idea you could “figure it out” with a set of simple equations is laughable. If you check out <a href="http://bwsimulator.niddk.nih.gov/" target="_blank">their model</a> this is clearly not the answer to the obesity epidemic. Just another example of why <a href="http://simplystatistics.tumblr.com/post/20902656344/statistics-is-not-math" target="_blank">statistics is not math</a>. If you don’t want to hopelessly oversimplify the problem, you need careful data collection, analysis, and interpretation. For a broader look at this problem, check out this article on <a href="http://www.american.com/archive/2012/may/science-vs-pr" target="_blank">Science vs. PR</a>. Via Andrew J. </span></li>
<li><span>Some <a href="http://freakonometrics.blog.free.fr/index.php?post/2012/04/18/foundwaldo" target="_blank">cool applications</a> of the raster package in R. This kind of thing is fun for student projects because analyzing images leads to results that are easy to interpret/visualize.</span></li>
<li><span>Check out John C.’s <a href="http://www.johndcook.com/blog/2012/05/07/how-do-you-know-when-someone-is-great/" target="_blank">really fascinating post</a> on determining when a white-collar worker is great. Inspired by <a href="http://simplystatistics.tumblr.com/post/22585430491/how-do-you-know-if-someone-is-great-at-data-analysis" target="_blank">Roger’s post</a> on knowing when someone is good at data analysis. </span></li>
</ol>
The West Wing Was Always A Favorite Show Of Mine
2012-05-16T14:49:15+00:00
http://simplystats.github.io/2012/05/16/the-west-wing-was-always-a-favorite-show-of-mine
<p>[youtube http://www.youtube.com/watch?v=t7FJFuuvxpI?wmode=transparent&autohide=1&egm=0&hd=1&iv_load_policy=3&modestbranding=1&rel=0&showinfo=0&showsearch=0&w=500&h=375]</p>
<p>The West Wing was always a favorite show of mine (at least, seasons 1-4, the Sorkin years) and I think this is a great scene which talks about the difference between evidence and interpretation. The topic is a 5-day waiting period for gun purchases and they’ve just received a poll in a few specific congressional districts showing weak support for this proposed policy.</p>
<div class="attribution">
(<span>Source:</span> <a href="http://www.youtube.com/">http://www.youtube.com/</a>)
</div>
Health by numbers: A statistician's challenge
2012-05-16T13:22:11+00:00
http://simplystats.github.io/2012/05/16/health-by-numbers-a-statisticians-challenge
<p><a href="http://www.reuters.com/article/2012/05/14/us-statistics-idUSBRE84D0KD20120514">Health by numbers: A statistician’s challenge</a></p>
Facebook Needs to Turn Data Into Investor Gold
2012-05-14T23:59:00+00:00
http://simplystats.github.io/2012/05/14/facebook-needs-to-turn-data-into-investor-gold
<p><a href="http://www.nytimes.com/2012/05/15/technology/facebook-needs-to-turn-data-trove-into-investor-gold.html?smid=tu-share">Facebook Needs to Turn Data Into Investor Gold</a></p>
Computational biologist blogger saves computer science department
2012-05-14T15:01:00+00:00
http://simplystats.github.io/2012/05/14/computational-biologist-blogger-saves-computer-science
<p>People who read the news should be aware by now that we are in the midst of a big data era. The New York Times, for example, has been writing about this frequently. One of their most recent <a href="http://www.nytimes.com/2012/05/01/science/simons-foundation-chooses-uc-berkeley-for-computing-center.html" target="_blank">articles</a> describes how UC Berkeley is getting $60 million dollars for a <span>new </span>computer science center. Meanwhile, at University of Florida the administration seems to be oblivious to all this and about a month ago announced it was dropping its computer science department to save $. <a href="http://blogs.forbes.com/stevensalzberg/" target="_blank">Blogger</a> <a href="http://bioinformatics.igm.jhmi.edu/salzberg/Salzberg/Salzberg_Lab_Home.html" target="_blank">Steven Salzberg</a>, a computational biologists known for his <a href="http://scholar.google.com/citations?user=sUVeH-4AAAAJ&hl=en" target="_blank">work in genomics</a>, wrote a post titled “<a href="http://genome.fieldofscience.com/2012/04/university-of-florida-eliminates.html" target="_blank">University of Florida eliminates Computer Science Department. At least they still have football</a>” ridiculing UF for their decisions. Here are my favorite quotes:</p>
<blockquote>
<p><span> in the midst of a technology revolution, with a shortage of engineers and computer scientists, UF decides to cut computer science completely? </span></p>
</blockquote>
<blockquote>
<p>Computer scientist Carl de Boor, a member of the National Academy of Sciences and winner of the 2003 National Medal of Science, asked the UF president “What were you thinking?”</p>
</blockquote>
<p>Well, his post went viral and days later <a href="http://www.forbes.com/sites/stevensalzberg/2012/04/25/university-of-florida-announces-plan-to-save-computer-science-department/" target="_blank">UF reversed it’s decision</a>! So my point is this: statistics departments, be nice to bloggers that work in genomics… one of them might save your butt some day.</p>
<p><em>Disclaimer: Steven Salzberg has a joint appointment in my department and we have joint lab meetings.</em></p>
Sunday data/statistics link roundup (5/13)
2012-05-13T20:39:10+00:00
http://simplystats.github.io/2012/05/13/sunday-data-statistics-link-roundup-5-13
<ol>
<li>Patenting <a href="http://www.nytimes.com/2012/05/13/jobs/an-actuary-proves-patents-arent-only-for-engineers.html?_r=1" target="_blank">statistical sampling</a>? I’m pretty sure the Supreme Court who threw out the Mayo Patent wouldn’t have much trouble tossing this patent either. The properties of sampling are a “law of nature” right? via Leonid K.</li>
<li><a href="http://www.youtube.com/watch?v=aUaInS6HIGo" target="_blank">This video</a> has me all fired up, its called 23 1/2 hours and talks about how the best preventative health measure is getting 30 minutes of exercise - just walking - every day. He shows how in some cases this beats doing much more high-tech interventions. My favorite part of this video is how he uses a ton of statistical/epidemiological terms like “effect sizes”, “meta-analysis”, “longitudinal study”, “attributable fractions”, but makes them understandable to a broad audience. This is a great example of “statistics for good”.</li>
<li>A very nice collection of <a href="http://www.twotorials.com/2012/05/ninety-two-minute-r-tutorial-videos.html" target="_blank">2-minute tutorials</a> in R. This is a great way to teach the concepts, most of which don’t need more than 2 minutes, and it covers a lot of ground. One thing that drives me crazy is when I go into Rafa’s office with a hairy computational problem and he says, “Oh you didn’t know about function x?”. Of course this only happens after I’ve wasted an hour re-inventing the wheel. If more people put up 2 minute tutorials on all the cool tricks they know, the better we’d all be.</li>
<li>A plot using ggplot2, developed by this week’s interviewee <a href="http://simplystatistics.tumblr.com/post/22844703875/ha" target="_blank">Hadley Wickham</a> appears <a href="http://www.theatlantic.com/entertainment/archive/2012/03/the-foreign-language-of-mad-men/254668/" target="_blank">in the Atlantic</a>! Via David S.</li>
<li>I’m refusing to buy into Apple’s hegemony, so I’m still running OS 10.5. I’m having trouble getting github up and running. Anyone have this same problem/know a solution? I know, I know, I’m way behind the times on this…</li>
</ol>
Interview with Hadley Wickham - Developer of ggplot2
2012-05-11T16:11:20+00:00
http://simplystats.github.io/2012/05/11/ha
<div class="im">
<strong>Hadley Wickham</strong>
</div>
<div class="im">
<strong><br /></strong>
</div>
<div class="im">
<strong><img height="365" src="http://biostat.jhsph.edu/~jleek/hw.jpg" width="244" /></strong>
</div>
<div class="im">
<strong><br /></strong>
</div>
<div class="im">
<strong><br /></strong><a href="http://had.co.nz/" target="_blank">Hadley Wickham </a>is the Dobelman Family Junior Chair of Statistics at Rice University. Prior to moving to Rice, he completed his Ph.D. in Statistics from Iowa State University. He is the developer of the wildly popular <a href="http://had.co.nz/ggplot2/" target="_blank">ggplot2</a> software for data visualization and a contributor to the <a href="http://www.ggobi.org/" target="_blank">Ggobi </a>project. He has developed a number of really useful R packages touching everything from data processing, to data modeling, to visualization.
</div>
<div class="im">
<strong><br /></strong>
</div>
<div class="im">
<strong>Which term applies to you: data scientist, statistician, computer</strong><br /><strong>scientist, or something else?</strong></p>
</div>
<p><span>I’m an assistant professor of statistics, so I at least partly</span><br />
<span>associate with statistics :). But the idea of data science really</span><br />
<span>resonates with me: I like the combination of tools from statistics and</span><br />
<span>computer science, data analysis and hacking, with the core goal of</span><br />
<span>developing a better understanding of data. Sometimes it seems like not</span><br />
<span>much statistics research is actually about gaining insight into data.</span></p>
<div class="im">
<strong>You have created/maintain several widely used R packages. Can you</strong><br /><strong>describe the unique challenges to writing and maintaining packages</strong><br /><strong>above and beyond developing the methods themselves?</strong></p>
</div>
<p>I think there are two main challenges: turning ideas into code, and<br />
documentation and community building.</p>
<p>Compared to other languages, the software development infrastructure<br />
in R is weak, which sometimes makes it harder than necessary to turn<br />
my ideas into code. Additionally, I get less and less time to do<br />
software development, so I can’t afford to waste time recreating old<br />
bugs, or releasing packages that don’t work. Recently, I’ve been<br />
investing time in helping build better dev infrastructure; better<br />
tools for documentation <a href="http://github.com/klutometis/roxygen" target="_blank">[roxygen2]</a>, unit testing <a href="https://github.com/hadley/test_that" target="_blank">[testthat]</a>, package development <a href="https://github.com/hadley/devtools" target="_blank">[devtools]</a>, and creating package website <a href="https://github.com/hadley/staticdocs" target="_blank">[staticdocs]</a>. Generally, I’ve<br />
found unit tests to be a worthwhile investment: they ensure you never<br />
accidentally recreate an old bug, and give you more confidence when<br />
radically changing the implementation of a function.</p>
<p>Documenting code is hard work, and it’s certainly something I haven’t<br />
mastered. But documentation is absolutely crucial if you want people<br />
to use your work. I find the main challenge is putting yourself in the<br />
mind of the new user: what do they need to know to use the package<br />
effectively. This is really hard to do as a package author because<br />
you’ve internalised both the motivating problem and many of the common<br />
solutions.</p>
<p>Connected to documentation is building up a community around your<br />
work. This is important to get feedback on your package, and can be<br />
helpful for reducing the support burden. One of the things I’m most<br />
proud of about ggplot2 is something that I’m barely responsible for:<br />
the ggplot2 mailing list. There are now ggplot2 experts who answer far<br />
more questions on the list than I do. I’ve also found github to be<br />
great: there’s an increasing community of users proficient in both R<br />
and git who produce pull requests that fix bugs and add new features.</p>
<p>The flip side of building a community is that as your work becomes<br />
more popular you need to be more careful when releasing new versions.<br />
The last major release of ggplot2 (0.9.0) broke over 40 (!!) CRAN<br />
packages, and forced me to rethink my release process. Now I advertise<br />
releases a month in advance, and run `R CMD check` on all downstream<br />
dependencies (`devtools::revdep_check` in the development version), so<br />
I can pick up potential problems and give other maintainers time to<br />
fix any issues.</p>
<div class="im">
<strong>Do you feel that the academic culture has caught up with and supports</strong><br /><strong>non-traditional academic contributions (e.g. R packages instead of</strong><br /><strong>papers)?</strong></p>
</div>
<p><span>It’s hard to tell. I think it’s getting better, but it’s still hard to</span><br />
<span>get recognition that software development is an intellectual activity</span><br />
<span>in the same way that developing a new mathematical theorem is. I try</span><br />
<span>to hedge my bets by publishing papers to accompany my major packages:</span><br />
<span>I’ve also found the peer-review process very useful for improving the</span><br />
<span>quality of my software. Reviewers from both the R journal and the</span><br />
<span>Journal of Statistical Software have provided excellent suggestions</span><br />
<span>for enhancements to my code.</span></p>
<div class="im">
<strong>You have given presentations at several start-up and tech companies.</strong><br /><strong>Do the corporate users of your software have different interests than</strong><br /><strong>the academic users?</strong></p>
</div>
<p><span>By and large, no. Everyone, regardless of domain, is struggling to</span><br />
<span>understand ever larger datasets. Across both industry and academia,</span><br />
<span>practitioners are worried about reproducible research and thinking</span><br />
<span>about how to apply the principles of software engineering to data</span><br />
<span>analysis.</span></p>
<div class="im">
<strong>You gave one of my favorite presentations called Tidy Data/Tidy Tools</strong><br /><strong>at the NYC Open Statistical Computing Meetup. What are the key</strong><br /><strong>elements of tidy data that all applied statisticians should know?</strong></p>
</div>
<p>Thanks! Basically, make sure you store your data in a consistent<br />
format, and pick (or develop) tools that work with that data format.<br />
The more time you spend munging data in the middle of an analysis, the<br />
less time you have to discover interesting things in your data. I’ve<br />
tried to develop a consistent philosophy of data that means when you<br />
use my packages (particularly <a href="http://plyr.had.co.nz/" target="_blank">plyr</a> and <a href="http://had.co.nz/ggplot2/" target="_blank">ggplot2</a>), you can focus on the<br />
data analysis, not on the details of the data format. The principles<br />
of tidy data that I adhere to are that every column should be a<br />
variable, every row an observation, and different types of data should<br />
live in different data frames. (If you’re familiar with database<br />
normalisation this should sound pretty familiar!). I expound these<br />
principles in depth in my in-progress <a href="http://vita.had.co.nz/papers/tidy-data.html" target="_blank">[paper on the<br />topic]</a>. </p>
<div class="im">
<strong>How do you decide what project to work on next? Is your work inspired</strong><br /><strong>by a particular application or more general problems you are trying to</strong><br /><strong>tackle?</strong></p>
</div>
<p>Very broadly, I’m interested in the whole process of data analysis:<br />
the process that takes raw data and converts it into understanding,<br />
knowledge and insight. I’ve identified three families of tools<br />
(manipulation, modelling and visualisation) that are used in every<br />
data analysis, and I’m interested both in developing better individual<br />
tools, but also smoothing the transition between them. In every good<br />
data analysis, you must iterate multiple times between manipulation,<br />
modelling and visualisation, and anything you can do to make that<br />
iteration faster yields qualitative improvements to the final analysis<br />
(that was one of the driving reasons I’ve been working on tidy data).</p>
<p>Another factor that motivates a lot of my work is teaching. I hate<br />
having to teach a topic that’s just a collection of special cases,<br />
with no underlying theme or theory. That drive lead to <a href="http://cran.r-project.org/web/packages/stringr/index.html" target="_blank">[stringr]</a> (for<br />
string manipulation) and <a href="http://cran.r-project.org/web/packages/lubridate/index.html" target="_blank">[lubridate]</a> (with Garrett Grolemund for working<br />
with dates). I recently released the <a href="https://github.com/hadley/httr" target="_blank">[httr]</a> package which aims to do a similar thing for http requests - I think this is particularly important as more and more data starts living on the web and must be accessed through an API.</p>
<div class="im">
<strong>What do you see as the biggest open challenges in data visualization</strong><br /><strong>right now? Do you see interactive graphics becoming more commonplace?</strong></p>
</div>
<p>I think one of the biggest challenges for data visualisation is just<br />
communicating what we know about good graphics. The first article<br />
decrying 3d bar charts was <a href="http://www.jstor.org/stable/2682265" target="_blank">published in 1951</a>! Many plots still use<br />
rainbow scales or red-green colour contrasts, even though we’ve known<br />
for decades that those are bad. How can we ensure that people<br />
producing graphics know enough to do a good job, without making them<br />
read hundreds of papers? It’s a really hard problem.</p>
<p>Another big challenge is balancing the tension between exploration and<br />
presentation. For explotary graphics, you want to spend five seconds<br />
(or less) to create a plot that helps you understand the data, while you might spend<br />
five hours on a plot that’s persuasive to an audience who<br />
isn’t as intimately familiar with the data as you. To date, we have<br />
great interactive graphics solutions at either end of the spectrum<br />
(e.g. ggobi/iplots/manet vs d3) but not much that transitions from one<br />
end of the spectrum to the other. This summer I’ll be spending some<br />
time thinking about what ggplot2 + <a href="http://d3js.org/" target="_blank">[d3]</a>, might<br />
equal, and how we can design something like an interactive grammar of<br />
graphics that lets you explore data in R, while making it easy to<br />
publish interaction presentation graphics on the web.</p>
What are the products of data analysis?
2012-05-10T12:55:41+00:00
http://simplystats.github.io/2012/05/10/what-are-the-products-of-data-analysis
<p>Thanks to everyone for the feedback on my post on <a href="http://simplystatistics.tumblr.com/post/22585430491/how-do-you-know-if-someone-is-great-at-data-analysis" target="_blank">knowing when someone is good at data analysis</a>. A couple people suggested I take a look <a href="http://www.kaggle.com/users" target="_blank">here</a> for a few people who have proven they’re good at data analysis. I think that’s a great idea and a good place to start.</p>
<p>But I also think that while demonstrating an ability to build good prediction models is impressive and definitely shows an understanding of the data, not all important problems can be easily posed as prediction problems. Most of my work does not involve prediction at all and the problems I face (i.e., estimating very small effects in the presence of large unmeasured confounding factors) would be difficult to formulate as a prediction challenge (at least, I can’t think of an easy way). In fact, part of <a href="http://www.scribd.com/full/28736728?access_key=key-21htoe67zs4rs9ecj1al" target="_blank">my</a> and <a href="http://www.ncbi.nlm.nih.gov/pubmed/22364439" target="_blank">my colleagues’</a> research involves showing how statistical methods designed for prediction problems can fail miserably when applied to other non-prediction settings.</p>
<p>The general question I have is what is a useful product that you can produce from a data analysis that demonstrates the quality of that analysis? So, a very small mean squared error from a prediction model would be one product (especially if it were smaller than everyone else’s). Maybe a cool graph with a story behind it? </p>
<p>If I were hiring a musician for an orchestra, I wouldn’t have to meet that person to have strong evidence that he/she were good. I could just listen to some recordings of that person playing and that would be a pretty good predictor of how that person would perform in the orchestra. In fact, some major orchestras do completely blind auditions so that although the person is present in the room, all you hear is the sound of the playing.</p>
<p>What seems to be true with music at least, is that even though the final performance doesn’t specifically reveal the important decisions that were made along the way to craft the interpretation of the music, somehow one is still able to appreciate the fact that all those decisions were made and they benefitted the performance. To me, it seems unlikely to arrive at a sublime performance either by chance or by some route that didn’t involve talent and hard work. Maybe it could happen once, but to produce a great performance over and over requires more than just luck.</p>
<p>What products could you send to someone to convince them you were good at data analysis? I raise this question primarily because when I look around at the products that I make (research papers, software, books, blogs), even if they are very good, I don’t think they necessarily convey any useful information about my ability to analyze data.</p>
<p>What’s the data analysis equivalent of a musician’s performance?</p>
DealBook: Glaxo to Make Hostile Bid for Human Genome Sciences
2012-05-09T13:48:59+00:00
http://simplystats.github.io/2012/05/09/dealbook-glaxo-to-make-hostile-bid-for-human-genome
<p><a href="http://dealbook.nytimes.com/2012/05/09/glaxosmithkline-to-make-hostile-bid-for-human-genome-sciences/">DealBook: Glaxo to Make Hostile Bid for Human Genome Sciences</a></p>
Data analysis competition combines statistics with speed
2012-05-08T00:12:20+00:00
http://simplystats.github.io/2012/05/08/data-analysis-competition-combines-statistics-with
<p><a href="http://www.dailybruin.com/index.php/article/2012/05/data_competition_combines_analysis_with_speed?_mo=1">Data analysis competition combines statistics with speed</a></p>
How do you know if someone is great at data analysis?
2012-05-07T13:17:16+00:00
http://simplystats.github.io/2012/05/07/how-do-you-know-if-someone-is-great-at-data-analysis
<p>Consider this exercise. Come up with a list of the top 5 people that you think are really good at data analysis.</p>
<p>There’s one catch: They have to be people that you’ve never met nor have had any sort of personal interaction with (e.g. email, chat, etc.). So basically people who have written papers/books you’ve read or have given talks you’ve seen or that you know through other publicly available information. Who comes to mind? It’s okay to include people who are no longer living.</p>
<p>The other day I was thinking about the people who I think are really good at data analysis and it occurred to me that they were all people I knew. So I started thinking about people that I don’t know (and there are many) but are equally good at data analysis. This turned out to be much harder than I thought. And I’m sure it’s not because they don’t exist, it’s just because I think good data analysis chops are hard to evaluate from afar using the standard methods by which we evaluate people.</p>
<p>I think there are a few reasons. First, people who are great at data analysis are likely not publishing papers or being productive in a manner that I, an outsider, would be able to observe. If they’re working at a pharmaceutical company working on a new drug or at some fancy new startup company, there’s no way I’m ever going to know about it unless I’m directly involved.</p>
<p>Another reason is that even for people who are well-known scientists or statisticians, the products they produce don’t really highlight the difficulties overcome in data analysis. For example, many good papers in the statistics literature will describe a new method with brief reference to the data that inspired the method’s development. In those cases, the data analysis usually appears obvious, as most things do <em>after</em> they’ve been done. Furthermore, papers usually exclude all the painful details about merging, cleaning, and inspecting the data as well as all the other things you tried that didn’t work. Papers in the substantive literature have a similar problem, which is that they focus on a scientific problem of interest and the analysis of the data is secondary.</p>
<p>As skills in data analysis become more important, it seems odd to me that we don’t have a great way to evaluate a person’s ability to do it as we do in other areas.</p>
Illumina stays independent, for now
2012-05-06T11:01:00+00:00
http://simplystats.github.io/2012/05/06/illumina-stays-independent-for-now
<p><a href="http://dealbook.nytimes.com/2012/04/20/the-escalation-in-hostile-takeover-offers/">Illumina stays independent, for now</a></p>
UCLA Data Fest 2012
2012-05-05T10:49:35+00:00
http://simplystats.github.io/2012/05/05/ucla-data-fest-2012
<p>The very very cool UCLA <a href="http://datafest.stat.ucla.edu/groups/datafest/" target="_blank">Data Fest</a> is going on as we speak. This is a statistical analysis marathon where teams of undergrads work through the night (and day) to address an important problem through data analysis. Last year they looked at crime data from the Los Angeles Police Department. I’m looking forward to seeing how this year goes.</p>
<p>Great work by <a href="http://www.stat.ucla.edu/~rgould/Home/About_Me.html" target="_blank">Rob Gould</a> and the <a href="http://www.stat.ucla.edu/" target="_blank">Department of Statistics</a> there.</p>
New National Academy of Sciences Members
2012-05-04T14:46:18+00:00
http://simplystats.github.io/2012/05/04/new-national-academy-of-sciences-members
<p>The National Academy of Sciences elected <a href="http://www.nasonline.org/news-and-multimedia/news/2012_05_01_NAS_Election.html" target="_blank">new members</a> a few days ago. Among them are statisticians <a href="http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CGoQFjAA&url=http%3A%2F%2Fwww-stat.stanford.edu%2F~tibs%2F&ei=E9ijT9feLajo0gGo3byuCQ&usg=AFQjCNH9sYoebTZ858PQOmkuwC8XR7CZtA&sig2=H8W1CQVbC-ypebfWgFQCcQ" target="_blank">Robert Tibshirani</a> and sociologist <a href="http://sociology.uchicago.edu/people/faculty/raudenbush.shtml" target="_blank">Stephen Raudenbush</a>. Obviously well-deserved!</p>
<p>(Thanks to Karl Broman.)</p>
Hammer On The Importance Of Statistics Or As I
2012-05-04T12:52:39+00:00
http://simplystats.github.io/2012/05/04/hammer-on-the-importance-of-statistics-or-as-i
<p>[youtube http://www.youtube.com/watch?v=k6aBITJuSQA?wmode=transparent&autohide=1&egm=0&hd=1&iv_load_policy=3&modestbranding=1&rel=0&showinfo=0&showsearch=0&w=500&h=375]</p>
<p>Hammer on the importance of statistics (or, as I used to know him, MC Hammer). The overlay of the video for “Can’t Touch This” really helps me understand what he’s talking about. (Thanks to Chris V. for the link.)</p>
<div class="attribution">
(<span>Source:</span> <a href="http://www.youtube.com/">http://www.youtube.com/</a>)
</div>
Just like regular communism, dongle communism has failed
2012-05-02T17:09:37+00:00
http://simplystats.github.io/2012/05/02/just-like-regular-communism-dongle-communism-has
<p>Bad news comrades. <a href="http://simplystatistics.tumblr.com/post/10555655037/dongle-communism" target="_blank">Dongle communism</a> in under attack. Check out how this poor dongle has been subjugated. This is in our lab meeting room. To add insult to injury, this happened on <a href="http://en.wikipedia.org/wiki/International_Workers%27_Day" target="_blank">May 1st</a>! </p>
<p><img height="244" src="http://rafalab.jhsph.edu/simplystats/dongle-capitalism.jpg" width="320" /></p>
GE's Billion-Dollar Bet on Big Data
2012-05-02T14:57:00+00:00
http://simplystats.github.io/2012/05/02/ges-billion-dollar-bet-on-big-data
<p><a href="http://www.businessweek.com/articles/2012-04-26/ges-billion-dollar-bet-on-big-data">GE’s Billion-Dollar Bet on Big Data</a></p>
Sample mix-ups in datasets from large studies are more common than you think
2012-05-01T15:01:13+00:00
http://simplystats.github.io/2012/05/01/sample-mix-ups-in-datasets-from-large-studies-are-more
<p>If you have analyzed enough high throughput data you have seen it before: a male sample that is really a female, a liver that is a kidney, etc… As the datasets I analyze get bigger I see more and more sample mix-ups. When I find a couple of samples for which sex is incorrectly annotated (one can easily see this from examining data from X and Y chromosomes) I can’t help but wonder if there are more that are undetectable (e.g. swapping samples of same sex). Datasets that include two types of measurements, for example genotypes and gene expression, make it possible to detect sample swaps more generally. I recently attended a talk by <a href="http://www.biostat.wisc.edu/~kbroman/" target="_blank">Karl Broman</a> on this topic (one of best talks I’ve seen.. check out the slides <a href="http://www.biostat.wisc.edu/~kbroman/presentations/mousegenet2011.pdf" target="_blank">here</a>). Karl reports an example in which <span>it looks as if </span>whoever was pipetting skipped a sample and kept on going, introducing an off-by-one error for over 50 samples. As I sat through the talk, I wondered how many of the large GWAS studies have mix-ups like this?</p>
<p>A <a href="http://www.ncbi.nlm.nih.gov/pubmed/21653519" target="_blank">recent paper</a> (gated) published by Lude Franke and colleagues describes MixupMapper: a method for detecting and correcting mix-ups. They examined several public datasets and discovered mix-ups in all of them. The worst performing study, <a href="http://www.ncbi.nlm.nih.gov/pubmed/19043577" target="_blank">published in PLoS Genetics</a>, was reported to have 23% of the samples swapped. I was surprised that the MixupMapper paper was not published in a higher impact journal. Turns out PLoS Genetics rejected the paper. I think this was a big mistake on their part: the paper is clear and well written, reports a problem with a PLoS Genetics papers, and describes a solution to a problem that should have us all quite worried. I think it’s important that everybody learn about this problem so I was happy to see that, eight months later, Nature Genetics <a href="http://www.ncbi.nlm.nih.gov/pubmed/22484626" target="_blank">published a paper reporting mix-ups</a> (gated)… but they didn’t cite the MixupMapper paper! Sorry Lude, welcome to the <a href="http://simplystatistics.tumblr.com/post/13680729270/reverse-scooping" target="_blank">reverse scooped</a> club. </p>
A disappointing response from @NatureMagazine about folks with statistical skills
2012-04-30T15:02:56+00:00
http://simplystats.github.io/2012/04/30/a-disappointing-response-from-naturemagazine-about
<p>Last week <a href="http://simplystatistics.tumblr.com/post/21845976361/nature-is-hiring-a-data-editor-how-will-they-make" target="_blank">I linked to</a> an ad for a Data Editor position at Nature Magazine. I was super excited that Nature was recognizing data as an important growth area. But the ad doesn’t mention anything about statistical analysis skills; it focuses exclusively on data management expertise. As I pointed out in the earlier post, managing data is only half the equation - figuring out what to do with the data is the other half. The second half requires knowledge of statistics.</p>
<p>The folks over at Nature <a href="https://twitter.com/#!/NatureMagazine/status/195523909771198464" target="_blank">responded to our post</a> on Twitter:</p>
<blockquote>
<p><span> it’s unrealistic to think this editor (or anyone) could do what you suggest. Curation & accessibility are key. ^ng</span></p>
</blockquote>
<p>I disagree with this statement for the following reasons:</p>
<ol>
<li>Is it really unrealistic to think someone could have data management and statistical expertise? Pick your favorite data scientist and you would have someone with those skills. Most students coming out of computer science, computational biology, bioinformatics, or statistical genomics programs would have a blend of those two skills in some proportion. </li>
</ol>
<p>But maybe the problem is this:</p>
<blockquote>
<p><span>Applicants must have a PhD in the biological sciences</span></p>
</blockquote>
<p>It is possible that there are few PhDs in the biological sciences who know both statistics and data management (although that is probably changing). But most computational biologists have a pretty good knowledge of biology and a <strong>very</strong> good knowledge of data - both managing and analyzing. If you are hiring a data editor, this might be the target audience. I’d replace PhD in the biological science in the ad with, knowledge of biology,statistics, data analysis, and data visualization. There would be plenty of folks with those qualifications.</p>
<ol>
<li>
<p>The response mentions curation, which is a critical issue. But good curation requires knowledge of two things: (i) the biological or scientific problem and (ii) how and in what way the data will be analyzed and used by researchers. As the <a href="http://simplystatistics.tumblr.com/post/18378666076/the-duke-saga-starter-set" target="_blank">Duke scandal</a> made clear, a statistician with technological and biological knowledge running through a data analysis will identify many critical issues in data curation that would be missed by someone who doesn’t actually analyze data. </p>
</li>
<li>
<p>The response says that “Curation and accessibility” are key. I agree that they are <em>part</em> of the key. It is critical that data can be properly accessed by researchers to perform new analyses, verify results in papers, and discover new results. But if the goal is to ensure the quality of science being published in Nature (the role of an editor) curation and accessibility are not enough. The editor should be able to evaluate statistical methods described in papers to identify potential flaws, or to rerun code and make sure that it performs the same/sensible analyses. A bad analysis that is reproducible will be discovered more quickly, but it is still a bad analysis. </p>
</li>
</ol>
<p>To be fair, I don’t think that Nature is the only organization that is missing the value of statistical skill in hiring data positions. It seems like many organizations are still just searching for folks who can handle/process the massive data sets being generated. But if they want to make accurate and informed decisions, statistical knowledge needs to be at the top of their list of qualifications. </p>
Sunday data/statistics link roundup (4/29)
2012-04-29T22:57:41+00:00
http://simplystats.github.io/2012/04/29/sunday-data-statistics-link-roundup-4-29
<ol>
<li>Nature genetics has <a href="http://www.nature.com/ng/journal/v44/n5/full/ng.2264.html" target="_blank">an editorial</a> on the Mayo and Myriad cases. I agree with this bit: “<span>In our opinion, it is not new judgments or legislation that are needed but more innovation. In the era of whole-genome sequencing of highly variable genomes, it is increasingly hard to justify exclusive ownership of particularly useful parts of the genome, and method claims must be more carefully described.” Via <a href="http://www.biostat.jhsph.edu/~ajaffe/" target="_blank">Andrew J.</a></span></li>
<li>One of Tech Review’s 10 emerging technologies from a February 2003 article? <a href="http://www.technologyreview.com/InfoTech/12256/" target="_blank">Data mining</a>. I think doing interesting things with data has probably always been a hot topic, it just gets press in cycles. Via Aleks J. </li>
<li>An infographic in the New York Times compares the profits and taxes of Apple <a href="http://www.nytimes.com/imagepages/2012/04/29/technology/29appletax-hp-graphic.html?ref=business" target="_blank">over time</a>, <a href="http://www.nytimes.com/2012/04/29/business/apples-tax-strategy-aims-at-low-tax-states-and-nations.html?_r=1&hp" target="_blank">here is an explanation</a> of how they do it. (Via Tim O.)</li>
<li>Saw <a href="https://twitter.com/#!/fivethirtyeight/status/192683954510364672" target="_blank">this tweet</a> via Joe B. I’m not sure if the frequentists or the Bayesians are winning, but it seems to me that the battle no longer matters to my generation of statisticians - there are too many data sets to analyze, better to just use what works!</li>
<li>Statistical and computational algorithms that <a href="http://www.wired.com/gadgetlab/2012/04/can-an-algorithm-write-a-better-news-story-than-a-human-reporter/" target="_blank">write news stories</a>. Simply Statistics remains 100% human written (for now). </li>
<li>The <a href="http://simplystatistics.tumblr.com/post/12076163379/the-5-most-critical-statistical-concepts" target="_blank">5 most critical</a> statistical concepts. </li>
</ol>
People in positions of power that don't understand statistics are a big problem for genomics
2012-04-27T15:16:29+00:00
http://simplystats.github.io/2012/04/27/people-in-positions-of-power-that-dont-understand
<p class="p1">
I finally got around to reading the <a href="http://www.iom.edu/Reports/2012/Evolution-of-Translational-Omics.aspx" target="_blank">IOM report on translational omics</a> and it is very good. The report lays out problems with current practices and how these led to undesired results such as the now infamous <a href="http://simplystatistics.tumblr.com/post/18378666076/the-duke-saga-starter-set" target="_blank">Duke trials</a> and the <a href="http://online.wsj.com/article/SB10001424052702303627104576411850666582080.html" target="_blank">growth in retractions</a> in the scientific literature. Specific recommendations are provided related to reproducibility and validation. I expect the report will improve things. Although I think bigger improvements will come as a result of retirements.
</p>
<p class="p1">
In general, I think the field of <em>genomics</em> (a label that is used quite broadly) is producing great discoveries and I strongly believe we are just getting started. But we can’t help but notice that retraction and questionable findings are particularly high in this field. In my view most of the problems we are currently suffering stem from the fact that a substantial number of the people with positions of power do not understand statistics and have no experience with computing. Nevin’s biggest mistake was not admitting to himself that he did not understand what Baggerly and Coombes were saying. The l<span>ack of reproducibility just exacerbated </span><span>the problem. </span>The same is true for the editors that rejected the letters written by this pair in their effort to expose a serious problem - a problem that was obvious to all the statistics savvy biologists I talked to.
</p>
<p class="p1">
Unfortunately Nevins is not the only head of a large genomics lab that does not understand basic statistical principles and has no programming/data-management experience. So how do people without the necessary statistical and computing skills to be considered experts in genomics become leaders of the field? I think this is due to the speed at which Biology changed from a data poor discipline to a data intensive ones. For example, before microarrays, the analysis of gene expression data amounted to spotting black dots on a piece of paper (see Figure A below). In the mid 90s this suddenly changed to sifting through tens of thousands of numbers (see Figure B).
</p>
<p><img src="http://simplystatistics.org/wp-content/uploads/2013/05/expression.jpg" alt="gene expression" /></p>
<p class="p1">
Note that typically, statistics is not a requirement of the Biology graduate programs associated with genomics. At Hopkins neither of the two major programs (<a href="http://cmm.jhu.edu/index.php?title=Home" target="_blank">CMM</a> and <a href="http://biolchem.bs.jhmi.edu/bcmb/Pages/index.aspx" target="_blank">BCMB</a>) require it. And this is expected, since outside of genomics one can do great Biology without quantitative skills and for most of the 20th century most Biology was like this. So when the genomics revolution first arrived, the great majority of powerful Biology lab heads had no statistical training whatsoever. Nonetheless, a few of these decided to delve into this “sexy” new field and using their copious resources were able to perform some of the first big experiments. Similarly, Biology journals that were not equipped to judge the data analytic component of genomics papers were eager to publish papers in this field, a fact that further compounded the problem.
</p>
<p class="p1">
But I as I mentioned above, in general, the field of genomics is producing wonderful results. Several lab heads did have statistics and computational expertise, while others formed strong partnerships with quantitative types. Here I should mentioned that for these partnerships to be successful the statisticians also needed to expand their knowledge base. The quantitative half of the partnership needs to be biology and technology savvy or they too can make <a href="http://retractionwatch.wordpress.com/2011/07/21/sebastiani-group-retracts-genetics-of-aging-study-from-science/" target="_blank">mistakes that lead to retractions</a>.
</p>
<p class="p1">
Nevertheless, the field is riddled with problems; enough to prompt an IOM report. But although the present is somewhat grim, I am optimistic about the future. The new generation of biologists leading the genomics field are clearly more knowledgeable and appreciative about statistics and computing than the previous ones. Natural selection helps, as these new investigators can’t rely on pre-genomics-revolution accomplishments and those that do not posses these skills are simply outperformed by those that do. I am also optimistic because biology graduate programs are starting to incorporate statistics and computation into their curricula. For example, as of last year, our <a href="http://humangenetics.jhmi.edu/" target="_blank">Human Genetics</a> program requires our <a href="http://biostat.jhsph.edu/~iruczins/teaching/140.615/info.html" target="_blank">Biostats 615-616 course</a>.
</p>
Nature is hiring a data editor...how will they make sense of the data?
2012-04-26T13:02:00+00:00
http://simplystats.github.io/2012/04/26/nature-is-hiring-a-data-editor-how-will-they-make
<p>It looks like the journal Nature is <a href="http://www.nature.com/naturejobs/science/jobs/258826-Chief-Editor-Data" target="_blank">hiring a Chief Data Editor</a> (link via Hilary M.). It looks like the primary purpose of this editor is to develop tools for collecting, curating, and distributing data with the goal of improving reproducible research.</p>
<p>The main duties of the editor, as described by the ad are: </p>
<blockquote>
<p><span>Nature Publishing Group is looking for a Chief Editor to develop a product aimed at making research data more available, discoverable and interpretable.</span></p>
</blockquote>
<p>The ad also mentions having an eye for commercial potential; I wonder if this move was motivated by companies like <a href="http://figshare.com/" target="_blank">figshare</a> who are already providing a reproducible data service. I haven’t used figshare, but the early reports from friends who have are that it is great. </p>
<p>The thing that bothered me about the ad is that there is a strong focus on data collection/storage/management but absolutely no mention of the second component of the data science problem: making sense of the data. To make sense of piles of data requires training in applied statistics (<a href="http://simplystatistics.tumblr.com/post/20902656344/statistics-is-not-math" target="_blank">called by whatever name you like best</a>). The ad doesn’t mention any such qualifications. </p>
<p>Even if the goal of the position is just to build a competitor to figshare, it seems like a good idea for the person collecting the data to have some idea of what researchers are going to do with it. When dealing with data, those researchers will frequently be statisticians by one name or another. </p>
<p>Bottom line: I’m stoked Nature is recognizing the importance of data in this very prominent way. But I wish they’d realize that a data revolution also requires a revolution in statistics. </p>
How do I know if my figure is too complicated?
2012-04-25T17:01:36+00:00
http://simplystats.github.io/2012/04/25/how-do-i-know-if-my-figure-is-too-complicated
<p>One of the key things every statistician needs to learn is how to create informative figures and graphs. Sometimes, it is easy to use off-the-shelf plots like barplots, histograms, or if one is truly desperate a <a href="http://simplystatistics.tumblr.com/post/21611701077/sunday-data-statistics-link-roundup-4-22" target="_blank">pie-chart</a>. </p>
<p>But sometimes the information you are trying to communicate requires the development of a new graphic. I am currently working on a project with a graduate student where the standard illustration are <a href="http://en.wikipedia.org/wiki/Venn_diagram" target="_blank">Venn Diagrams</a> - including complicated Venn Diagrams with 5 or 10 circles. </p>
<p>As we were thinking about different ways of illustrating our data, I started thinking about what are the key qualities of a graphic and how do I know if it is too complicated. I realized that:</p>
<ol>
<li>Ideally just looking at the graphic one can intuitively understand what is going on, but sometimes for more technical/involved displays this isn’t possible</li>
<li>Alternatively, I think a good plot should be able to be explained in 2 sentences or less. I think that is true for pretty much every plot I use regularly. </li>
<li>That isn’t including describing what different colors/sizes/shapes specifically represent in any particular version of the graphic. </li>
</ol>
<p>I feel like there is probably something to this in the <a href="http://www.amazon.com/The-Grammar-Graphics-Leland-Wilkinson/dp/0387987746" target="_blank">Grammar of Graphics</a> or in some of <a href="http://www.stat.purdue.edu/~wsc/papersbooks.pdf" target="_blank">William Cleveland’s</a> work. But this is one of the first times I’ve come up with a case where a new, generalizable, type of graph needs to be developed. </p>
On the future of personalized medicine
2012-04-24T13:04:00+00:00
http://simplystats.github.io/2012/04/24/on-the-future-of-personalized-medicine
<p>Jeff Leek, Reeves Anderson, and I recently wrote a <a href="http://www.nature.com/nature/journal/v484/n7394/full/484318a.html" target="_blank">correspondence to <em>Nature</em></a> (subscription req.) regarding the <a href="http://simplystatistics.tumblr.com/post/19646774024/laws-of-nature-and-the-law-of-patents-supreme-court" target="_blank">Supreme Court decision in <em>Mayo v. Prometheus</em></a> and the recent Institute of Medicine <a href="http://www.iom.edu/Activities/Research/OmicsBasedTests.aspx" target="_blank">report</a> related to the <a href="http://simplystatistics.tumblr.com/post/18378666076/the-duke-saga-starter-set" target="_blank">Duke Clinical Trials Saga</a>. </p>
<p>The basic gist of the correspondence is that the IOM report stresses the need for openness in the process of developing ‘omics based tests, but the Court decision suggests that patent protection will not be available to protect those details. So how will the future of personalized medicine look? There is a much larger, more general, discussion that could be had about patents in this arena and we do not get into that here (hey, we had to squeeze it into 300 words). But it seems that if biotech companies cannot make money from patented algorithms, then they will have to find a new avenue. </p>
<p>Here are some <a href="http://www.biostat.jhsph.edu/~rpeng/talks/MayoIOM.pdf" target="_blank">slides from a recent lecture</a> I gave outlining some of the ideas and providing some background.</p>
Sunday data/statistics link roundup (4/22)
2012-04-22T23:54:12+00:00
http://simplystats.github.io/2012/04/22/sunday-data-statistics-link-roundup-4-22
<ol>
<li>Now we know who is to blame for the <a href="http://www.nytimes.com/2012/04/22/magazine/who-made-that-pie-chart.html" target="_blank">pie chart</a>. I had no idea it had been around, straining our ability to compare relative areas, since 1801. However, the same guy (William Playfair) apparently also invented the bar chart. So he wouldn’t be totally shunned by statisticians. (via Leonid K.)</li>
<li>A <a href="http://www.guardian.co.uk/technology/2012/apr/22/academic-publishing-monopoly-challenged" target="_blank">nice article</a> in the Guardian about the current group of scientists that are boycotting Elsevier. I have to agree with the quote that leads the article, “All professions are conspiracies against the laity.” On the other hand, I agree with Rafa that academics are <a href="http://simplystatistics.tumblr.com/post/15756182268/academics-are-partly-to-blame-for-supporting-the-closed" target="_blank">partially to blame</a> for buying into the closed access hegemony. I think more than a boycott of a single publisher is needed; we need a change in culture. (first link also via Leonid K)</li>
<li>A blog post on how to <a href="http://menugget.blogspot.com/2012/04/adding-transparent-image-layer-to-plot.html#more" target="_blank">add a transparent image layer</a> to a plot. For some reason, I have wanted to do this several times over the last couple of weeks, so the serendipity of seeing it on R Bloggers merited a mention. </li>
<li>I agree the Earth Institute <a href="http://junkcharts.typepad.com/junk_charts/2012/04/the-earth-institute-needs-a-graphics-advisor.html" target="_blank">needs a better graphics advisor</a>. (via Andrew G.)</li>
<li><a href="http://www.nytimes.com/2012/04/22/opinion/sunday/taking-emotions-out-of-our-schools.html" target="_blank">A great article</a> on why multiple choice tests are used - they are an easy way to collect data on education. But that doesn’t mean they are the right data. This reminds me of the Tukey quote: “The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data<strong id="internal-source-marker_0.597119664773345"><span>”. </span></strong><span>It seems to me if you wanted to have a major positive impact on education right now, the best way would be to develop a new experimental design that collects the kind of data that really demonstrates mastery of reading/math/critical thinking. </span></li>
<li>Finally, a bit of a bleg…what is the best way to do the SVD of a huge (think 1e6 x 1e6), sparse matrix in R? Preferably without loading the whole thing into memory…</li>
</ol>
Replication, psychology, and big science
2012-04-18T15:29:00+00:00
http://simplystats.github.io/2012/04/18/replication-psychology-and-big-science
<p><a href="http://www.sciencemag.org/content/334/6060/1226.full" target="_blank">Reproducibility</a> <a href="http://simplystatistics.tumblr.com/post/12328728291/interview-with-victoria-stodden" target="_blank">has been</a> a <a href="http://simplystatistics.tumblr.com/post/13780369155/preventing-errors-through-reproducibility" target="_blank">hot topic</a> for the last several years among computational scientists. A study is reproducible if there is a specific set of computational functions/analyses (usually specified in terms of code) that exactly reproduce all of the numbers in a published paper from raw data. It is now recognized that a critical component of the scientific process is that data analyses can be reproduced. This point has been driven home particularly for personalized medicine applications, where irreproducible results <a href="http://www.nature.com/news/lapses-in-oversight-compromise-omics-results-1.10298?nc=1332884191164" target="_blank">can lead to delays</a> in evaluating new procedures that affect patients’ health. </p>
<p>But just because a study is reproducible does not mean that it is <em>replicable</em>. Replicability is stronger than reproducibility. A study is only replicable if you perform the exact same experiment (at least) twice, collect data in the same way both times, perform the same data analysis, and arrive at the same conclusions. The difference with reproducibility is that to achieve replicability, you have to perform the experiment and collect the data again. This of course introduces all sorts of new potential sources of error in your experiment (new scientists, new materials, new lab, new thinking, different settings on the machines, etc.)</p>
<p>Replicability is getting a lot of attention recently in psychology due to some high-profile studies that did not replicate. First, there was the highly-cited experiment that<a href="http://blogs.discovermagazine.com/notrocketscience/2012/03/10/failed-replication-bargh-psychology-study-doyen/" target="_blank"> failed to replicate</a>, leading to a show down between the author of the original experiment and the replicators. Now there is a psychology project that allows researchers to post the results of <a href="http://www.sciencemag.org/content/335/6076/1558" target="_blank">replications of experiments</a> - whether they succeeded or failed. Finally, the <a href="http://openscienceframework.org/project/shvrbV8uSkHewsfD4/wiki/index" target="_blank">Reproducibility Project</a>, probably better termed the Replicability Project, seeks to <a href="http://chronicle.com/blogs/percolator/is-psychology-about-to-come-undone/29045?sid=at&utm_source=at&utm_medium=en" target="_blank">replicate the results</a> of every experiment in the journals <em>Psychological Science, _the</em> Journal of Personality and Social Psychology,<em>or the</em> Journal of Experimental Psychology: Learning, Memory, and Cognition _in the year 2008.</p>
<p>Replicability raises important issues for “big science” projects, ranging from genomics (<a href="http://www.1000genomes.org/" target="_blank">The Thousand Genomes Project</a>) to physics (<a href="http://en.wikipedia.org/wiki/Large_Hadron_Collider" target="_blank">The Large Hadron Collider</a>). These experiments are too big and costly to actually replicate. So how do we know the results of these experiments aren’t just errors, that upon replication (if we could do it) would not show up again? Maybe smaller scale replications of sub-projects could be used to help convince us of discoveries in these big projects?</p>
<p>In the meantime, I love the idea that replication is getting the credit it deserves (at least in psychology). The incentives in science often only credit the first person to an idea, not the long tail of folks who replicate the results. For example, replications of experiments are often not considered interesting enough to publish. Maybe these new projects will start to change some of the <a href="http://blog.regehr.org/archives/632" target="_blank">perverse academic incentives</a>.</p>
Roche: Illumina Is No Apple
2012-04-16T17:37:19+00:00
http://simplystats.github.io/2012/04/16/roche-illumina-is-no-apple
<p><a href="http://dealbook.nytimes.com/2012/04/11/roche-illumina-is-no-apple/">Roche: Illumina Is No Apple</a></p>
Sunday data/statistics link roundup (4/15)
2012-04-15T17:30:14+00:00
http://simplystats.github.io/2012/04/15/sunday-data-statistics-link-roundup-4-15
<ol>
<li>Incredibly cook, dynamic real-time maps of <a href="http://hint.fm/wind/" target="_blank">wind patterns</a> in the United States. (Via Flowing Data)</li>
<li>A d3.js <a href="http://gabrielflor.it/water" target="_blank">coding tool</a> that updates automatically as you update the code. This is going to be really useful for beginners trying to learn about D3. <a href="http://gabrielflor.it/water" target="_blank">Real time coding</a> (Via Flowing Data)</li>
<li>An interesting <a href="http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html" target="_blank">blog post </a>describing why the winning algorithm in the Netflix prize hasn’t actually been implemented! It looks like it was too much of an engineering hassle. I wonder if this will make others think twice before offering big sums for prizes like this. Unless the real value is advertising…(via Chris V.)</li>
<li><a href="http://www.fastcoexist.com/1679654/using-big-data-to-predict-your-potential-heart-problems" target="_blank">An article </a>about a group at USC that plans to collect all the information from apps that measure heart beats. Their project is called everyheartbeat. I think this is a little bit pre-mature, given the technology, but certainly the quantified self field is heating up. I wonder how long until the target audience for these sorts of projects isn’t just wealthy young technofiles? </li>
<li>A <a href="http://sellthenews.tumblr.com/post/21067996377/noitdoesnot" target="_blank">really good deconstruction</a> of a <a href="http://arxiv.org/abs/1010.3003" target="_blank">recent paper</a> suggesting that the mood on Twitter could be used to game the stock market. The author illustrates several major statistical flaws, including not correcting for multiple testing, an implausible statistical model, and not using a big enough training set. The scary thing is apparently a hedge fund is teaming up with this group of academics to try to implement their approach. I wouldn’t put my money anywhere they can get their hands on it. This is just one more in the accelerating line of results that illustrate the critical need for statistical literacy both among scientists and in the general public.</li>
</ol>
Interview with Drew Conway - Author of "Machine Learning for Hackers"
2012-04-13T13:31:21+00:00
http://simplystats.github.io/2012/04/13/interview-with-drew-conway-author-of-machine
<p><strong>Drew Conway</strong></p>
<p><strong><img height="190" src="http://biostat.jhsph.edu/~jleek/drew-iav-color.jpg" width="230" /></strong></p>
<p>Drew Conway is a Ph.D. student in Politics at New York University and the co-ordinator of the <a href="http://www.meetup.com/nyhackr/" target="_blank">New York Open Statistical Programming Meetup</a>. He is the creator of the famous (or infamous) data science <a href="http://www.drewconway.com/zia/?p=2378" target="_blank">Venn diagram</a>, the basis for our <a href="http://simplystatistics.tumblr.com/post/11271228367/datascientist" target="_blank">R function</a> to determine if your a data scientist. He is also the co-author of <a href="http://shop.oreilly.com/product/0636920018483.do" target="_blank">Machine Learning for Hackers</a>, a book of case studies that illustrates data science from a hacker’s perspective. </p>
<div class="im">
<strong>Which term applies to you: data scientist, statistician, computer</strong><br /><strong>scientist, or something else?</strong>
<div>
</div>
</div>
<div>
Technically, my undergraduate degree is in computer science, so that term can be applied. I was actually double-major in CS and political science, however, so it wouldn’t tell the whole story. I have always been most interested in answering social science problems with the tools of computer science, math and statistics.
</div>
<div>
</div>
<div>
I have struggled a bit with the term “data scientist.” About a year ago, when it seemed to be gaining a lot of popularity, I bristled at it. Like many others, I complained that it was simply a corporate rebranding of other skills, and that the term “science” was appended to give some veil of legitimacy. Since then, I have warmed to the term, but—-as is often the case—-only when I can define what data science is in my own terms. Now, I do think of what I do as being data science, that is, the blending of technical skills and tools from computer science, with the methodological training of math and statistics, and my own substantive interest in questions about collective action and political ideology.
</div>
<div>
</div>
<div>
I think the term is very loaded, however, and when many people invoke it they often do so as a catch-all for talking about working with a certain a set of tools: R, map-reduce, data visualization, etc. I think this actually hurts the discipline a great deal, because if it is meant to actually be a science the majority of our focus should be on questions, not tools.
</div>
<div class="im">
<div>
</div>
<p>
<strong>You are in the department of politics? How is it being a “data</strong><br /><strong>person” in a non-computational department?</strong>
</p>
<div>
</div>
</div>
<div>
Data has always been an integral part of the discipline, so in that sense many of my colleagues are data people. I think the difference between my work and the work that many other political scientist do is simply a matter of where and how I get my data.
</div>
<div>
</div>
<div>
For example, a traditional political science experiment might involve a small set of undergraduates taking a survey or playing a simple game on a closed network. That data would then be collected and analyzed as a controlled experiment. Alternatively, I am currently running an experiment wherein my co-authors and I are attempting to code text documents (political party manifestos) with ideological scores (very liberal to very conservative). To do this we have broken down the documents into small chunks of text and are having workers on Mechanical Turk code single chunks—rather than the whole document at once. In this case the data scale up very quickly, but by aggregating the results we are able to have a very different kind of experiment with much richer data.
</div>
<div>
</div>
<div>
At the same time, I think political science—-and perhaps the social sciences more generally—suffer from a tradition of undervaluing technical expertise. In that sense, it is difficult to convince colleagues that developing software tools is important.
</div>
<div class="im">
<div>
</div>
<p>
<strong>Is that what inspired you to create the New York Open Statistical Meetup?</strong>
</p>
<div>
</div>
</div>
<div>
<div>
I actually didn’t create the New York Open Statistical Meetup (formerly the R meetup). Joshua Reich was the original founder, back in 2008, and shortly after the first meeting we partnered and ran the Meetup together. Once Josh became fully consumed by starting / running BankSimple I took it over by myself. I think the best part about the Meetup is how it brings people together from a wide range of academic and industry backgrounds, and we can all talk to each other in a common language of computational programming. The cross-pollination of ideas and talents is inspiring.
</div>
<div>
</div>
<div>
We are also very fortunate in that the community here is so strong, and that New York City is a well traveled place, so there is never a shortage of great speakers.
</div>
</div>
<div class="im">
<div>
</div>
<p>
<strong>You created the data science Venn diagram. Where do you fall on the diagram?</strong>
</p>
<div>
</div>
</div>
<div>
Right at the center, of course! Actually, before I entered graduate school, which is long before I drew the Venn diagram, I fell squarely in the danger zone. I had a lot of hacking skills, and my work (as an analyst in the U.S. intelligence community) afforded me a lot of substantive expertise, but I had little to no formal training in statistics. If you could describe my journey through graduate school within the framework of the data science Venn diagram, it would be about me trying to pull myself out of the danger zone by gaining as much math and statistics knowledge as I can.
</div>
<div class="im">
<div>
</div>
<p>
<strong>I see that a lot of your software (including R packages) are on Github. Do you post them on CRAN as well? Do you think R developers will eventually move to Github from CRAN?</strong>
</p>
<div>
</div>
</div>
<div>
<div>
I am a big proponent of open source development, especially in the context of sharing data and analyses; and creating reproducible results. I love Github because it creates a great environment for following the work of other coders, and participating in the development process. For data analysis, it is also a great place to upload data and R scripts and allow the community to see how you did things and comment. I also think, however, that there is a big opportunity for a new site—-like Github—-to be created that is more tailored for data analysis, and storing and disseminating data and visualizations.
</div>
<div>
</div>
<div>
I do post my R packages to CRAN, and I think that CRAN is one of the biggest strengths of the R language and community. I think ideally more package developers would open their development process, on Github or some other social coding platform, and then push their well-vetted packages to CRAN. This would allow for more people to participate, but maintain the great community resource that CRAN provides.
</div>
</div>
<div class="im">
<div>
</div>
<p>
<strong>What inspired you to write, “Machine Learning for Hackers”? Who</strong><br /><strong>was your target audience?</strong>
</p>
<div>
</div>
</div>
<div>
<div>
A little over a year ago John Myles White (my co-author) and I were having a lot of conversations with other members of the data community in New York City about what a data science curriculum would look like. During these conversations people would always cite the classic text; Elements of Statistical Learning, Pattern Recognition and Machine Learning, etc., which are excellent and deep treatments of the foundational theories of machine learning. From these conversations it occurred to us that there was not a good text on machine learning for people who thought more algorithmically. That is, there was not a text for “hackers,” people who enjoy learning about computation by opening up black-boxes and getting their hands dirty with code.
</div>
<div>
</div>
<div>
It was from this idea that the book, and eventually the title, were borne. We think the audience for the book is anyone who wants to get a relatively broad introduction to some of the basic tools of machine learning, and do so through code—-not math. This can be someone working at a company with data that wants to add some of these tools to their belt, or it can be an undergraduate in a computer science or statistics program that can relate to the material more easily through this presentation than the more theoretically heavy texts they’re probably already reading for class.
</div>
<div>
</div>
</div>
The Problem with Universities
2012-04-12T14:49:04+00:00
http://simplystats.github.io/2012/04/12/the-problem-with-universities
<p>I have had the following conversation a number of times recently:</p>
<ol>
<li>I want to do X. X is a lot of fun and is really interesting. Doing X involves a little of A and a little of B.</li>
<li>We should get some students to do X also.</li>
<li>Okay, but from where should we get the students? Students in Department of A don’t know B. Students from Department of B don’t know A.</li>
<li>Fine, maybe we could start a program that specifically trains people in X. In this program we’ll teach them A and B. It’ll be the first program of it’s kind! Woohoo!</li>
<li>Sure that’s great, but because there aren’t any <em>other</em> departments of X, the graduates of our program now have to get jobs in departments of A or B. Those departments complain that students from Department of X only know a little of A (or B).</li>
<li>Grrr. Go away.</li>
</ol>
<p>Has anyone figured out a solution to this problem? Specifically, how do you train students to do something for which there’s no formal department/program without jeopardizing their career prospects?</p>
Statistics is not math...
2012-04-11T13:52:42+00:00
http://simplystats.github.io/2012/04/11/statistics-is-not-math
<p>Statistics depends on math, like a lot of other disciplines (physics, engineering, chemistry, computer science). But just like those other disciplines, statistics is not math; math is just a tool used to solve statistical problems. Unlike those other disciplines, statistics gets lumped in with math in headlines. Whenever people use statistical analysis to solve an interesting problem, the headline reads:</p>
<p>“Math can be used to solve amazing problem X”</p>
<p>or</p>
<p>“The Math of Y” </p>
<p>Here are some examples:</p>
<p><a href="http://www.wired.com/wiredscience/2012/01/the-mathematics-of-lego/" target="_blank">The Mathematics of Lego</a> - Using data on legos to estimate a distribution</p>
<p><a href="http://www.ted.com/talks/sean_gourley_on_the_mathematics_of_war.html" target="_blank">The Mathematics of War</a> - Using data on conflicts to estimate a distribution</p>
<p><a href="https://twitter.com/#!/Cambridge_Uni/status/187844697170001920" target="_blank">Usain Bolt can run faster with maths</a> (Tweet) - Turns out they analyzed data on start times to come to the conclusion</p>
<p><a href="http://blog.okcupid.com/index.php/the-mathematics-of-beauty/" target="_blank">The Mathematics of Beauty</a> - Analysis of data relating dating profile responses and photo attractiveness</p>
<p>These are just a few off of the top of my head, but I regularly see headlines like this. I think there are a couple reasons for math being grouped with statistics: (1) many of the founders of statistics were mathematicians first (<a href="http://en.wikipedia.org/wiki/Ronald_Fisher" target="_blank">but not all of them</a>) (2) many statisticians still identify themselves as mathematicians, and (3) in some cases statistics and statisticians define themselves pretty narrowly. </p>
<p>With respect to (3), consider the following list of disciplines:</p>
<ol>
<li>Biostatistics</li>
<li>Data science</li>
<li>Machine learning</li>
<li>Natural language processing</li>
<li>Signal processing</li>
<li>Business analytics</li>
<li>Econometrics</li>
<li>Text mining</li>
<li>Social science statistics</li>
<li>Process control</li>
</ol>
<p>All of these disciplines could easily be classified as “applied statistics”. But how many folks in each of those disciplines would classify themselves as statisticians? More importantly, how many would be claimed by statisticians? </p>
Evolution, Evolved
2012-04-10T14:58:45+00:00
http://simplystats.github.io/2012/04/10/evolution-evolved
<p><a href="http://magazine.jhu.edu/spring-2012/evolution-evolved">Evolution, Evolved</a></p>
What is a major revision?
2012-04-09T15:01:34+00:00
http://simplystats.github.io/2012/04/09/what-is-a-major-revision
<p>I posted a little while ago on a proposal for a <a href="http://simplystatistics.tumblr.com/post/19289280474/a-proposal-for-a-really-fast-statistics-journal" target="_blank">fast statistics journal</a>. It generated a bunch of comments and even a really nice <a href="http://yihui.name/en/2012/03/a-really-fast-statistics-journal/" target="_blank">follow up post</a> with some great ideas. Since then I’ve gotten reviews back on a couple of papers and I think I realized one of the key issues that is driving me nuts about the current publishing model. It boils down to one simple question: </p>
<p><em>What is a major revision? </em></p>
<p>I often get reviews back that suggest “major revisions” in one or many of the following categories:</p>
<ol>
<li>More/different simulations</li>
<li>New simulations</li>
<li>Re-organization of content</li>
<li>Re-writing language</li>
<li>Asking for more references</li>
<li>Asking me to include a new method</li>
<li>Asking me to implement someone else’s method for comparison</li>
</ol>
<div>
I don’t consider any of these major revisions. Personally, I have stopped asking for them as major revisions. In my opinion, major revisions should be reserved for issues with the manuscript that suggest that it may be reporting incorrect results. Examples include:
</div>
<div>
<ol>
<li>
No simulations
</li>
<li>
No real data
</li>
<li>
The math/computations look incorrect
</li>
<li>
The software didn’t work when I tried it
</li>
<li>
The methods/algorithms are unreadable and can’t be followed
</li>
</ol>
<div>
The first list is actually a list of minor/non-essential revisions in my opinion. They may <em>improve</em> my paper, but they won’t confirm that it is correct or not. I find that they are often subjective and are up to the whims of referees. In my own personal refereeing I am making an effort to remove subjective major revisions and only include issues that are critical to evaluate the correctness of a manuscript. I also try to divorce the issues of whether an idea is interesting or not from whether an idea is correct or not.
</div>
</div>
<div>
</div>
<div>
I’d be curious to know what other peoples’ definitions of major/minor revisions are?
</div>
<p>_<br />
_</p>
<p>_<br />
_</p>
<p>_<br />
_</p>
Sunday data/statistics link roundup (4/8)
2012-04-09T01:42:10+00:00
http://simplystats.github.io/2012/04/09/sunday-data-statistics-link-roundup-4-8
<ol>
<li>This is a <a href="http://arxiv.org/pdf/math/0606441.pdf" target="_blank">great article</a> about the illusion of progress in machine learning. In part, I think it explains why the <a href="http://simplystatistics.tumblr.com/post/18132467723/prediction-the-lasso-vs-just-using-the-top-10" target="_blank">Leekasso</a> (just using the top 10) isn’t a totally silly idea. I also love how he talks about sources of uncertainty in real prediction problems that aren’t part of the classical models when developing prediction algorithms. I think that this is a hugely underrated component of building an accurate classifier - just finding the quirks particular to a type of data. Via <a href="https://twitter.com/#!/chlalanne" target="_blank">@chlalanne</a>.</li>
<li>An <a href="http://www.michaeleisen.org/blog/?p=1009" target="_blank">interesting post</a> from Michael Eisen on a serious abuse of statistical ideas in the New York Times. The professor of genetics quoted in the story apparently wasn’t aware of the <a href="http://en.wikipedia.org/wiki/Birthday_problem" target="_blank">birthday problem</a>. Lack of statistical literacy, even among scientists, is becoming critical. I would love it if the Kahn academy (or some enterprising students) would come up with a set of videos that just explained a bunch of basic statistical concepts - skipping all the hard math and focusing on the ideas. </li>
<li> TechCrunch finally <a href="http://techcrunch.com/2012/04/08/patent-law-101-whats-wrong-and-ways-to-make-it-right/" target="_blank">caught up</a> to our <a href="http://simplystatistics.tumblr.com/post/19646774024/laws-of-nature-and-the-law-of-patents-supreme-court" target="_blank">Mayo vs. Prometheus</a> <a href="http://simplystatistics.tumblr.com/post/19626747057/supreme-court-unanimously-rules-against-personalized" target="_blank">coverage</a>. This decision is going to affect more than just personalized medicine. Speaking of the decision, stay tuned for more on that topic from the folks over here at Simply Statistics. </li>
<li><a href="http://www.nytimes.com/2012/04/09/technology/how-to-budget-megabytes-becomes-more-urgent-for-users.html?_r=1&hpw" target="_blank">How much is a megabyte</a>? I love this question. They asked people on the street how much data was in a megabyte. The answers were pretty far ranging looks like. This question is hyper-critical for scientists in the new era, but the better question might be, “How much is a terabyte?”</li>
</ol>
Study Says DNA’s Power to Predict Illness Is Limited
2012-04-05T14:10:43+00:00
http://simplystats.github.io/2012/04/05/study-says-dnas-power-to-predict-illness-is-limited
<p><a href="http://www.nytimes.com/2012/04/03/health/research/dnas-power-to-predict-is-limited-study-finds.html">Study Says DNA’s Power to Predict Illness Is Limited</a></p>
Epigenetics: Marked for success
2012-04-04T12:22:30+00:00
http://simplystats.github.io/2012/04/04/epigenetics-marked-for-success
<p><a href="http://www.nature.com/nature/journal/v483/n7391/full/nj7391-637a.html">Epigenetics: Marked for success</a></p>
ENAR Meeting
2012-04-03T12:14:28+00:00
http://simplystats.github.io/2012/04/03/enar-meeting
<p>This is the <a href="http://enar.org/meetings.cfm" target="_blank">ENAR meeting</a> so posting will be intermittent. If you’re at the meeting I’ll be talking at 1:45 today in the Columbia B room in a session on climate change and health. I hear Rafa is roaming the halls too so make sure you say hi if you see him.</p>
R 2.15.0 is released
2012-03-30T12:26:58+00:00
http://simplystats.github.io/2012/03/30/r-2-15-0-is-released
<p><a href="https://stat.ethz.ch/pipermail/r-announce/2012/000551.html">R 2.15.0 is released</a></p>
New U.S. Research Will Aim at Flood of Digital Data
2012-03-30T01:04:57+00:00
http://simplystats.github.io/2012/03/30/new-u-s-research-will-aim-at-flood-of-digital-data
<p><a href="http://www.nytimes.com/2012/03/29/technology/new-us-research-will-aim-at-flood-of-digital-data.html">New U.S. Research Will Aim at Flood of Digital Data</a></p>
Big Data Meeting at AAAS
2012-03-29T13:56:00+00:00
http://simplystats.github.io/2012/03/29/big-data-meeting-at-aaas
<p>The White House Office of Science and Technology Policy is hosting a meeting that will discuss several new federal efforts relating to big data. The meeting is is <strong>today</strong> from 2-3:45pm and there will be <a href="http://live.science360.gov/bigdata/" target="_blank">live webcast</a>.</p>
<p>Participants include</p>
<ul>
<li>John Holdren, Assistant to the President and Director, White House Office of Science and Technology Policy</li>
<li><span>Subra Suresh, Director, National Science Foundation</span></li>
<li><span>Francis Collins, Director, National Institutes of Health</span></li>
<li><span>Marcia McNutt, Director, United States Geological Survey</span></li>
<li><span>William Brinkman, Director, Department of Energy Office of Science</span></li>
<li><span>Zach Lemnios, Assistant Secretary of Defense for Research & Engineering, Department of Defense</span></li>
<li><span>Kaigham “Ken” Gabriel, Deputy Director, Defense Advanced Research Projects Agency</span></li>
<li><span>Daphne Koller, Stanford University (machine learning and applications in biology and education)</span></li>
<li><span>James Manyika, McKinsey & Company (Co-author of major McKinsey report on Big Data)</span></li>
<li><span>Lucila Ohno-Machado, UC San Diego (NIH’s “Integrating Data for Analysis, Anonymization, and Sharing” initiative)</span></li>
<li><span>Alex Szalay, Johns Hopkins University (Big Data for astronomy)</span></li>
</ul>
<div>
<strong>Update</strong>: Some more information from the <a href="http://www.whitehouse.gov/blog/2012/03/29/big-data-big-deal" target="_blank">White House itself</a>.
</div>
Roche Raises Illumina Bid to $51, Seeking Faster Deal
2012-03-29T11:57:55+00:00
http://simplystats.github.io/2012/03/29/roche-raises-illumina-bid-to-51-seeking-faster-deal
<p><a href="http://www.bloomberg.com/news/2012-03-29/roche-raises-illumina-bid-to-51-seeking-faster-deal.html">Roche Raises Illumina Bid to $51, Seeking Faster Deal</a></p>
Justices Send Back Gene Case
2012-03-27T00:18:57+00:00
http://simplystats.github.io/2012/03/27/justices-send-back-gene-case
<p><a href="http://www.nytimes.com/2012/03/27/business/high-court-orders-new-look-at-gene-patents.html">Justices Send Back Gene Case</a></p>
Supreme court vacates ruling on BRCA gene patent!
2012-03-26T15:53:56+00:00
http://simplystats.github.io/2012/03/26/supreme-court-vacates-ruling-on-brca-gene-patent
<p><span>As Reeves alluded to in <a href="http://simplystatistics.tumblr.com/post/19646774024/laws-of-nature-and-the-law-of-patents-supreme-court" target="_blank">his post</a> about the Mayo personalized medicine case, the Supreme Court just vacated the lower court’s ruling in </span><em>Association for Molecular Pathology v. Myriad Genetics</em><span> (No. 11-725). The case has been sent back down to the Federal Circuit for reconsideration in light of the Court’s decision in </span><em>Mayo</em><span>. This means that the Supreme Court thought the two cases were sufficiently similar that the lower courts should take another look using the new direction from </span><em>Mayo</em><span>.</span></p>
<p><span> It’s looking more and more like the Supreme Court is strongly opposed to personalized medicine patents. </span></p>
R and the little data scientist's predicament
2012-03-26T15:00:05+00:00
http://simplystats.github.io/2012/03/26/r-and-the-little-data-scientists-predicament
<p>I just read this <a href="http://www.slate.com/articles/technology/technology/2012/03/ruby_ruby_on_rails_and__why_the_disappearance_of_one_of_the_world_s_most_beloved_computer_programmers_.single.html" target="_blank">fascinating post</a> on _why, apparently a bit of a cult hero among enthusiasts of the Ruby programming language. One of the most interesting bits was <a href="http://viewsourcecode.org/why/hacking/theLittleCodersPredicament.html" target="_blank">The Little Coder’s Predicament</a>, which boiled down essentially says that computer programming languages have grown too complex - so children/newbies can’t get the instant gratification when they start programming. He suggested a simplified “gateway language” that would get kids fired up about programming, because with a simple line of code or two they could make the computer <strong>do things</strong> like play some music or make a video. </p>
<p>I feel like there is a similar ramp up with data scientists. To be able to do anything cool/inspiring with data you need to know (a) a little statistics, (b) a little bit about a programming language, and (c) quite a bit about syntax. </p>
<p>Wouldn’t it be cool if there was an R package that solved the little data scientist’s predicament? The package would have to have at least some of these properties:</p>
<ol>
<li>It would have to be easy to load data sets, one line of not complicated code. You could write an interface for RCurl/read.table/download.file for a defined set of APIs/data sets so the command would be something like: load(“education-data”) and it would load a bunch of data on education. It would handle all the messiness of scraping the web, formatting data, etc. in the background. </li>
<li>It would have to have a lot of really easy visualization functions. Right now, if you want to make pretty plots with ggplot(), plot(), etc. in R, you need to know all the syntax for pch, cex, col, etc. The plotting function should handle all this behind the scenes and make super pretty pictures. </li>
<li>It would be awesome if the functions would include some sort of dynamic graphics (with <a href="http://www.omegahat.org/SVGAnnotation/" target="_blank">svgAnnotation</a> or a wrapper for <a href="http://mbostock.github.com/d3/" target="_blank">D3.js</a>). Again, the syntax would have to be really accessible/not too much to learn. </li>
</ol>
<p>That alone would be a huge start. In just 2 lines kids could load and visualize cool data in a pretty way they could show their parents/friends. </p>
Sunday data/statistics link roundup (3/25)
2012-03-25T15:58:25+00:00
http://simplystats.github.io/2012/03/25/sunday-data-statistics-link-roundup-3-25
<ol>
<li>The psychologist whose experiment didn’t replicate then <a href="http://simplystatistics.tumblr.com/post/19190862781/sunday-data-statistics-link-roundup-3-11" target="_blank">went off</a> on the scientists who did the replication experiment is <a href="http://www.psychologytoday.com/blog/the-natural-unconscious/201203/angry-birds?page=2" target="_blank">at it again</a>. I don’t see a clear argument about the facts of the matter in his post, just more name calling. This seems to be a case study in what not to do when your study doesn’t replicate. More on “conceptual replication” in there too. </li>
<li>Berkeley is running a <a href="http://datascienc.es/" target="_blank">data science course</a> with instructors <span>Jeff Hammerbacher and Mike Franklin, I looked through the notes and it looks pretty amazing. Stay tuned for more info about my applied statistics class which starts this week. </span></li>
<li><span>A <a href="http://nyti.ms/GXwvUe." target="_blank">cool article</a> about Factual, one of the companies whose sole mission in life is to collect and distribute data. We’ve <a href="http://simplystatistics.tumblr.com/post/10410458080/data-sources" target="_blank">linked</a> to them before. We are so out ahead of the Times on this one…</span></li>
<li><span>This isn’t statistics related, but I love <a href="http://articles.businessinsider.com/2012-03-20/tech/31212683_1_jeff-bezos-robot-bookstore" target="_blank">this post</a> about Jeff Bezos. If we all indulged our inner 11 year old a little more, it wouldn’t be a bad thing. </span></li>
<li><span>If you haven’t had a chance to read Reeves <a href="http://simplystatistics.tumblr.com/post/19646774024/laws-of-nature-and-the-law-of-patents-supreme-court" target="_blank">guest post</a> on the Mayo Supreme Court decision yet, you should, it is really interesting. A fascinating intersection of law and statistics is going on in the personalized medicine world right now. </span></li>
</ol>
Some thoughts from Keith Baggerly on the recently released IOM report on translational omics
2012-03-25T01:44:00+00:00
http://simplystats.github.io/2012/03/25/some-thoughts-from-keith-baggerly-on-the-recently
<p>Shortly after the Duke trial scandal broke, the <a href="http://www.iom.edu/" target="_blank">Institute of Medicine</a> convened a committee to write a report on translational omics. Several statisticians (including one of our <a href="http://simplystatistics.tumblr.com/post/11436138110/interview-with-daniela-witten" target="_blank">interviewees</a>) either served on the committee or provided key testimony. The <a href="http://www.iom.edu/Reports/2012/Evolution-of-Translational-Omics.aspx" target="_blank">report</a> came out yesterday. <a href="http://www.nature.com/news/lapses-in-oversight-compromise-omics-results-1.10298" target="_blank">Nature</a>, <a href="http://blogs.nature.com/spoonful/2012/03/greater-oversight-needed-for-genomic-tests-experts-say.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+nm%2Frss%2Fspoonful_of_medicine+%28Spoonful+of+Medicine+-+Blog+Posts%29%20" target="_blank">Nature Medicine</a>, and <a href="http://news.sciencemag.org/scienceinsider/2012/03/panel-calls-for-closer-oversight.html?ref=hp%20" target="_blank">Science </a>had posts about the release. Keith Baggerly sent an email with his thoughts and he gave me permission to post it here. He starts by pointing out that the Science piece has a key new observation:</p>
<blockquote>
<p><span>The NCI’s Lisa McShane, who spent months herself trying to validate Duke results, says the IOM committee “did a really fine job” in laying out the issues. </span><span>NCI now plans to require that its cooperative groups who want to use omics tests follow a checklist similar to that in the IOM report.</span><span> NCI has not yet decided whether it should add new requirements for omics tests to its peer review process for investigator-initiated grants. But “our hope is that this report will heighten everyone’s awareness,” McShane says. </span><br />
<span></span></p>
</blockquote>
<p><span>Some further thoughts from Keith:</span></p>
<blockquote>
<p><span>First, the report helps clarify the regulatory landscape: if omics-based tests (which the FDA views as medical devices) will direct patient therapy, FDA approval in the form of an Investigational Device Exemption (IDE) is required. This is in keeping with increased guidance FDA has been providing over the past year and a half dealing with companion diagnostics. It seems likely that several of the problems identified with the Duke trials would have been caught by an FDA review, particularly if the agency already had cause for concern, such as a letter to the editor identifying analytical shortcomings.</span><span> </span></p>
<p><span> </span><span>Second, the report recommends the publication of the full data, code, and metadata used to construct the omics assays prior to their use to guide patient therapy. Had such data and code been available earlier, this would have greatly reduced the amount of effort required for others (including us) to check and potentially extend on the underlying results.</span></p>
<p><span>Third, the report emphasizes, repeatedly, that the test must be fully specified (“locked down”) before it is validated, let alone used to guide patient therapy. Quite a bit of effort is given to providing an explicit definition of locked down, in part (we suspect) because both Lisa McShane (NCI) and Robert Becker (FDA) provided testimony that incomplete specification was a problem their agencies encountered frequently. Such specification would have prevented problems such as that identified by the NCI for the Lung Metagene Score (LMS) in 2010, which led the NCI to remove the LMS evaluation as a goal of the Phase III cooperative group trial CALGB-30506.</span></p>
<p><span> </span><span>Finally, the very existence of the report is recognition that reproducibility is an important problem for the omics-test community. This is a necessary step towards fixing the problem.</span></p>
</blockquote>
This graph shows that President Obama's proposed budget treats the NIH even worse than G.W. Bush - Sign the petition to increase NIH funding!
2012-03-23T12:14:00+00:00
http://simplystats.github.io/2012/03/23/this-graph-shows-that-president-obamas-proposed-budget
<p>The NIH provides financial support for a large percentage of biological and medical research in the United States. This funding supports a large number of US jobs, creates new knowledge, and improves healthcare for everyone. So I am signing <a href="http://wh.gov/R3R" target="_blank">this petition</a>: </p>
<p><span><br /></span></p>
<blockquote>
<p><span>NIH funding is essential to our national research enterprise, to our local economies, to the retention and careers of talented and well-educated people, to the survival of our medical educational system, to our rapidly fading worldwide dominance in biomedical research, to job creation and preservation, to national economic viability, and to our national academic infrastructure. </span></p>
<p><span><br /></span></p>
</blockquote>
<p><span>The current administration is proposing a flat $30.7 billion FY 2013 NIH budget. The graph below (left) shows how small the NIH budget is in comparison to the Defense and Medicare budgets in absolute terms. The difference between the administration’s proposal and the petition’s proposal ($33 billion) are barely noticeable.</span><span> </span></p>
<p>The graph on the right shows how in 2003 growth in the NIH budget fell dramatically while medicare and military spending kept growing. However, despite the decrease in rate, the NIH budget did continue to increase under Bush. If we follow Bush’s post 2003 rate (dashed line), the 2013 budget will be about what the petition asks for: $33 billion. </p>
<p><span><a href="http://rafalab.jhsph.edu/simplystats/nihbudget.png" target="_blank"><img src="http://rafalab.jhsph.edu/simplystats/nihbudget.png" width="500" /></a></span></p>
<p><span><br /></span></p>
<p>If you agree that the relatively modest increase in the NIH budget is worth the incredibly valuable biological, medical, and economic benefits this funding will provide, please consider signing the petition before April 15 </p>
Big Data for the Rest of Us, in One Start-Up
2012-03-22T15:00:05+00:00
http://simplystats.github.io/2012/03/22/big-data-for-the-rest-of-us-in-one-start-up
<p><a href="http://bits.blogs.nytimes.com/2012/03/19/all-about-big-data-in-one-startup/">Big Data for the Rest of Us, in One Start-Up</a></p>
More commentary on Mayo v. Prometheus
2012-03-21T15:00:05+00:00
http://simplystats.github.io/2012/03/21/more-commentary-on-mayo-v-prometheus
<p>Some more <a href="http://www.patentlyo.com/patent/2012/03/mayo-v-prometheus-natural-process-known-elements-normally-no-patent.html" target="_blank">commentary on Mayo v. Prometheus</a> via the Patently-O blog.</p>
<p>A summary of the <a href="http://www.scotusblog.com/case-files/cases/mayo-collaborative-services-v-prometheus-laboratories-inc/" target="_blank">various briefs and history of the case</a> can be found at the SCOTUS blog.</p>
<p>Some <a href="http://www.nytimes.com/2012/03/21/business/justices-reject-patents-for-medical-tests-relying-on-drug-dosages.html" target="_blank">actual news coverage</a> of the decision.</p>
<p>The <a href="http://www.supremecourt.gov/opinions/11pdf/10-1150.pdf" target="_blank">decision</a> is well-worth reading, if you’re that kind of nerd. Here, the Court uses the phrase “law of nature” a bit more loosely than perhaps I would use it. On the one hand, something like E=mc^2 might be considered a law of nature, but on the other hand I would consider the observation that certain blood metabolites are correlated with the occurrence of patient side effects as, well, a correlation. Einstein is referred to quite a few times in the opinion, no doubt in part because he himself worked in a patent office (and also discovered a few interesting laws of nature).</p>
<p>If one were to set aside the desire to do inference, then one could argue that in a given sample of people (random or not), any correlation observed within that sample is a “law of nature”, at least within that sample. Then if I draw a different sample and observe a different correlation, is that a different law of nature? Well, it might depend on whether it’s statistically significantly different.</p>
<p>In the end, maybe it doesn’t matter, because no law of nature is patentable, no matter how many there are. I do find it interesting that the Court considered, in some sense, the possibility of statistical variation.</p>
<p>The Court also noted that simply ordering a bunch of steps together did not make a procedure patentable, if the things that were put together were things that doctors (or people in the profession) were already doing. The question becomes, if you take away the statistical correlation in the patent, is there anything left? No, because doctors were already treating patients with immune-mediated gastrointestinal disorders and those patients were already being tested for blood metabolites. </p>
<p>This section of the decision caught my eye because it sounded a lot like the work of an applied statistician. Much of applied statistics involves taking methods and techniques that are already well known (lasso, anyone?) and applying them in new and interesting ways to new and interesting data. It seems taking a bunch of well-known process/techniques and putting them together is not patentable, even if it is interesting. I don’t think I have a problem with that, but then again, getting patents aren’t my main goal.</p>
<p>Actual lawyers will be able to tell whether this case is significant. However, it seems there are many statistical correlations out there that are waiting to be turned into medical treatments. For example, take the <a href="http://simplystatistics.tumblr.com/post/18378666076/the-duke-saga-starter-set" target="_blank">Duke clinical trials saga</a>. I don’t think it’s the case that none of these are patentable, because there still is the option of adding an “inventive concept” on top. However, it seems the simple algorthmic approach of “If X do this, and if Y do that” isn’t going to fly.</p>
Laws of Nature and the Law of Patents: Supreme Court Rejects Patents for Correlations
2012-03-20T22:40:00+00:00
http://simplystats.github.io/2012/03/20/laws-of-nature-and-the-law-of-patents-supreme-court
<p class="MsoNormal">
This is a guest post by Reeves Anderson, an <a href="http://www.arnoldporter.com/professionals.cfm?action=view&id=5146" target="_blank">associate</a> at Arnold and Porter LLP. <em>Reeves Anderson is a member of the Appellate and Supreme Court practice group at Arnold & Porter LLP in Washington, D.C. The views expressed herein are those of the author alone and not of Arnold & Porter LLP or any of the firm’s clients. Stay tuned for follow-up posts by the Simply Statistics crowd on the implications of this ruling for statistics in general and personalized medicine in particular. </em>
</p>
<p class="MsoNormal">
With the country’s attention focused on next week’s arguments over the constitutionality of President Obama’s health care law, the Supreme Court slipped in an important decision today concerning personalized medicine patents. In <em>Mayo Collaborative Services v. Prometheus Laboratories</em>, the Court unanimously <a href="http://www.supremecourt.gov/opinions/11pdf/10-1150.pdf" target="_blank">struck down</a> medical diagnostic patents that concerned the use of thiopurine drugs in the treatment of autoimmune diseases. Prometheus’s patents, which provided that doctors should increase or decrease a treatment dosage depending on metabolite correlations, was ineligible for patent protection, the Court held, because the patents “simply stated a law of nature.”
</p>
<p class="MsoNormal">
As Jeff aptly <a href="http://simplystatistics.tumblr.com/post/14135999782/the-supreme-courts-interpretation-of-statistical" target="_blank">described the issue</a> in December, Prometheus’s patents sought to control a treatment process centered “on the basis of a statistical correlation.” Specifically, when a patient ingests a thiopurine drug, metabolites form in the patient’s bloodstream. Because the production of metabolites varies among patients, the same dosage of thiopurine causes different effects in different patients. This variation makes it difficult for doctors to determine optimal treatment for a particular patient. Too high of a dosage risks harmful side effects, whereas too low would be therapeutically ineffective.
</p>
<p class="MsoNormal">
But measurement of a patient’s <em>metabolite</em> levels—in particular, 6-thioguanine and its nucleotides (6-TG) and 6-methyl-mercaptopurine (6-MMP)—is more closely correlated with the likelihood that a particular dosage of a thiopurine drug could cause harm or prove ineffective. As the Court explained today, however, “those in the field did not know the precise correlations between metabolite levels and the likely harm or ineffectiveness.” This is where Prometheus stepped in. “The patent claims at issue here set forth processes embodying researchers’ findings that identified those correlations with some precision.” Prometheus contended that blood concentrations of 6-TG or of 6-MMP above 400 and 7,000 picomoles per 8x10<sup>8</sup> red blood cells, respectively, could be toxic, while a concentration of 6-TG metabolite less than 230 pmol per 8x10<sup>8</sup> red blood cells is likely too low to be effective.
</p>
<p class="MsoNormal">
Prometheus utilized this correlation by patenting a three-step method by which one (i) administers a drug providing 6-TG to a patient with an autoimmune disease; (ii) determines the level of 6-TG in the patient; and (iii) the administrator then can determine whether the thiopurine dosage should be adjusted accordingly. Significantly, Prometheus’s patents did not include a treatment protocol and thus applied regardless of whether a doctor actually altered his treatment decision in light of the test—in other words, even if the doctor thought the correlations were wrong, irrelevant, or inapplicable to a particular patient. And in fact, Mayo Clinic, the party challenging Prometheus’s patents, believed Prometheus’s correlations were wrong. (Mayo’s toxicity levels were 450 and 5700 pmol per 8x10<sup>8</sup> red blood cells for 6-TG and 6-MMP, respectively. At oral argument on December 7, 2011, Mayo insisted that its numbers were “more accurate” than Prometheus’s.)
</p>
<p class="MsoNormal">
Turning to the legal issues, both parties agreed that the correlations were “laws of nature,” which, by themselves, are not patentable. As the Supreme Court has explained repeatedly, laws of nature, like natural phenomena and abstract ideas, are “manifestations of … nature, free to all men and reserved exclusively to none.” This principle reflects a concern that patent law ought not inhibit further discovery and innovation by tying up the “basic tools of scientific and technological work.”
</p>
<p class="MsoNormal">
In contrast, the <em>application</em> of a law of nature <em>is</em> patentable. The question for the Court, then, was whether Prometheus’s patent claims “add <em>enough</em> to their statements of correlations to allow the process they describe to qualify as patent-eligible processes that <em>apply</em> natural laws.”
</p>
<p class="MsoNormal">
The Court’s answer was no. Distilled down, Prometheus’s “three steps simply tell doctors to gather data from which they may draw an inference in light of the correlations.” The Court determined that Prometheus’s method simply informed the relevant audience (doctors treating patients with autoimmune diseases) about a law of nature, and that the additional steps of “administering” a drug and “determining” metabolite levels were “well-understood, routine, conventional activity already engaged in by the scientific community.” “[T]he effect is simply to tell doctors to apply the law somehow when treating their patients.”
</p>
<p class="MsoNormal">
Although I leave it to Jeff & company to assess the impact of today’s decision on the practice of personalized medicine, I have two principal observations. First, it appears that the Court was disturbed by Mayo’s insistence that the correlations in Prometheus’s patents were wrong, and that patent protection would prevent Mayo from improving upon them. Towards the end of the opinion, Justice Breyer wrote that the patents “threaten to inhibit the development of more refined treatment recommendations (like that embodied in Mayo’s test), that combine Prometheus’s correlations with later discovered features of metabolites, human physiology or individual patient characteristics.” The worry of stifling future innovation applies to every patent, but the Court seemed especially attuned to that concern here, perhaps due in part to Mayo’s insistence that its “better” test could not be used to help patients.
</p>
<p class="MsoNormal">
Second, Mayo argued that a decision in its favor would reduce the costs of challenging similar patents that purported to “apply” a natural law. Mayo’s argument was in response to the position of the U.S. Government, which participated in the case as <em>amicus curiae</em> (“friend of the court”). The Government urged the Court not to rule on the threshold issue of whether Prometheus’s patents applied a law of nature, but rather to strike down the patents because they lacked “novelty” or were “obvious in light of prior art.” The questions of novelty and obviousness, Mayo argued, are much more fact-intensive and expensive to litigate. Whether or not the Court agreed with Mayo’s argument, it declined to follow the Government’s advice. To skip the threshold question, the Court concluded, “would make the ‘law of nature’ exception … a dead letter.”
</p>
<p class="MsoNormal">
Many Supreme Court watchers will now turn their attention to another patent case that has been waiting in the wings, <em>Association for Molecular Pathology v. Myriad Genetics</em>, which asks the Court to decide whether human genes are patentable. Predictions anyone?
</p>
Supreme court unanimously rules against personalized medicine patent!
2012-03-20T14:31:35+00:00
http://simplystats.github.io/2012/03/20/supreme-court-unanimously-rules-against-personalized
<p>Just a few minutes ago the Supreme Court released their <a href="http://biostat.jhsph.edu/~jleek/Mayo%20Opinion.pdf" target="_blank">decision</a> in the Mayo case, see <a href="http://simplystatistics.tumblr.com/post/14135999782/the-supreme-courts-interpretation-of-statistical" target="_blank">here</a> for the Simply Statistics summary of the case. The court ruled unanimously that the personalized medicine test could not be patented. Such a strong ruling likely has major implications going forward for the field of personalized medicine. At the end of the day, this decision was based on an interpretation of statistical correlation. Stay tuned for a special in-depth analysis in the next couple of days that will get into the details of the ruling and the implications for personalized medicine. </p>
Interview with Amy Heineike - Director of Mathematics at Quid
2012-03-19T14:00:00+00:00
http://simplystats.github.io/2012/03/19/interview-with-amy-heineike-director-of-mathematics
<div class="im">
<div>
<div>
<strong>Amy Heineike</strong>
</div>
<div>
<strong><img src="http://media.tumblr.com/tumblr_m1588osxOV1r08wvg.jpg" /></strong>
</div>
<div>
<strong><br /></strong>Amy Heineike is the Director of Mathematics at <a href="http://quid.com/" target="_blank">Quid</a>, a startup that seeks to understand technology development and dissemination through data analysis. She was the first employee at Quid, where she helped develop their technology early on. She has been recognized as one of the <a href="http://thephenomlist.com/lists/8/people/32" target="_blank">top Big Data Scientists</a>. As a part of our ongoing <a href="http://simplystatistics.tumblr.com/interviews" target="_blank">interview series</a> talked to Amy about data science, Quid, and how statisticians can get involved in the tech scene.
</div>
<div>
<strong><br /></strong>
</div>
<div>
<strong>Which term applies to you: data scientist, statistician, computer scientist, or something else?</strong>
</div>
</div>
<div>
</div>
</div>
<div>
Data Scientist fits better than any, because it captures the mix of analytics, engineering and product management that is my current day to day.
</div>
<div>
</div>
<div>
</div>
<div>
When I started with Quid I was focused on R&D - developing the first prototypes of what are now our core analytics technologies, and working to define and QA new data streams. This required the analysis of lots of unstructured data, like news articles and patent filings, as well as the end visualisation and communication of the results.
</div>
<div>
</div>
<div>
</div>
<div>
After we raised VC funding last year I switched to building our data science and engineering teams out. These days I jump from conversations with the team about ideas for new analysis, to defining refinements to our data model, to questions about scalable architecture and filling out pivotal tracker tickets. The core challenge is translating the vision for the product back to the team so they can build it.
</div>
<div class="im">
<div>
</div>
<div>
<div>
<div>
<strong> How did you end up at Quid?</strong>
</div>
</div>
</div>
<div>
</div>
</div>
<div>
In my previous work I’d been building models to improve our understanding of complex human systems - in particular the complex interaction of cities and their transportation networks in order to evaluate the economic impacts of, Crossrail, a new train line across London, and the implications of social networks on public policy. Through this work it became clear that data was the biggest constraint - I became fascinated by a quest to find usable data for these questions - and thats what led me to Silicon Valley. I knew the founders of Quid from University, and approached them with the idea of analysing their data according to ideas I’d had - especially around network analysis - and the initial work we collaborated on became core to the founding techology of Quid.
</div>
<div class="im">
<div>
</div>
<div>
<div>
<div>
</div>
</div>
</div>
<div>
<div>
<div>
<strong>Who were really good mentors to you? What were the qualities that helped you? </strong>
</div>
<div>
</div>
</div>
</div>
</div>
<div>
I’ve been fortunate to work with some brilliant people in my career so far. While I still worked in London I worked closely with two behavioural economists - Paul Ormerod, who’s written some fantastic books on the subject (mostly recently Why Things Fail), and Bridget Rosewell, until recently the Chief Economist to the Greater London Authority (the city government for London). At Quid I’ve had a very productive collaboration with Sean Gourley, our CTO.
</div>
<div>
</div>
<div>
</div>
<div>
One unifying characteristic of these three is their ability to communicate complex ideas in a powerful way to a broad audience. Its an incredibly important skill, a core part of analytics work is taking the results to where they are needed which is often beyond those who know the technical details, to those who care about the implications first.
</div>
<div class="im">
<div>
</div>
<div>
</div>
<div>
<strong>How does Quid determine relationships between organizations and develop insight based on data? </strong>
</div>
<div>
</div>
</div>
<div>
The core questions our clients ask us are around how technology is changing and how this impacts their business. Thats a really fascinating and huge question that requires not just discovering a document with the answer in it, but organizing lots and lots of pieces of data to paint a picture of the emergent change. What we can offer is not only being able to find a snapshot of that, but also being able to track how it changes over time.
</div>
<div>
</div>
<div>
</div>
<div>
We organize the data firstly through the insight that much disruptive technology emerges in organizations, and that the events that occur between and to organizations are a fantastic way to signal both the traction of technologies and to observe strategic decision making by key actors.
</div>
<div>
</div>
<div>
</div>
<div>
The first kind of relationship thats important is of the transactional type, who is acquiring, funding or partnering with who, and the second is an estimate of the technological clustering of organizations, what trends do particular organizations represent. Both of these can be discovered through documents about them, including in government filings, press releases and news, but requires analysis of unstructured natural language.
</div>
<div>
</div>
<div>
</div>
<div>
We’ve experimented with some very engaging visualisations of the results, and have had particular success with network visualisations, which are a very powerful way of allowing people to interact with a large amount of data in a quite playful way. You can see some of our analyses in the press links at <a href="http://quid.com/in-the-news.php" target="_blank"><a href="http://quid.com/in-the-news.php" target="_blank">http://quid.com/in-the-news.php</a></a>
</div>
<div class="im">
<div>
</div>
<div>
<div>
<div>
<strong>What skills do you think are most important for statisticians/data scientists moving into the tech industry?</strong>
</div>
</div>
</div>
<div>
</div>
</div>
<div>
Technical statistical chops are the foundation. You need to be able to take a dataset and discover and communicate what’s interesting about it for your users. To turn this into a product requires understanding how to turn one-off analysis into something reliable enough to run day after day, even as the data evolves and grows, and as different users experience different aspects of it. A key part of that is being willing to engage with questions about where the data comes from (how it can be collected, stored, processed and QAed on an ongoing basis), how the analytics will be run (how will it be tested, distributed and scaled) and how people interact with it (through visualisations, UI features or static presentations?).
</div>
<div>
</div>
<div>
</div>
<div>
For your ideas to become great products, you need to become part of a great team though! One of the reasons that such a broad set of skills are associated with Data Science is that there are a lot of pieces that have to come together for it to all work out - and it really takes a team to pull it off. Generally speaking, the earlier stage the company that you join, the broader the range of skills you need, and the more scrappy you need to be about getting involved in whatever needs to be done. Later stage teams, and big tech companies may have roles that are purer statistics.
</div>
<div class="im">
<div>
</div>
<div>
</div>
<div>
<div>
<div>
<strong>Do you have any advice for grad students in statistics/biostatistics on how to get involved in the start-up community or how to find a job at a start-up? </strong>
</div>
</div>
</div>
<div>
</div>
</div>
<div>
There is a real opportunity for people who have good statistical and computational skills to get into the startup and tech scenes now. Many people in Data Science roles have statistics and biostatistics backgrounds, so you shouldn’t find it hard to find kindred spirits.
</div>
<div>
</div>
<div>
<div>
We’ve always been especially impressed with people who have built software in a group and shared or distributed that software in some way. Getting involved in an open source project, working with version control in a team, or sharing your code on github are all good ways to start on this.
</div>
<div>
</div>
</div>
<div>
</div>
<div>
Its really important to be able to show that you want to build products though. Imagine the clients or users of the company and see if you get excited about building something that they will use. Reach out to people in the tech scene, explore who’s posting jobs - and then be able to explain to them what it is you’ve done and why its relevant, and be able to think about their business and how you’d want to help contribute towards it. Many companies offer internships, which could be a good way to contribute for a short period and find out if its a good fit for you.
</div>
<p></p></p>
Sunday data/statistics link roundup (3/18)
2012-03-18T14:58:00+00:00
http://simplystats.github.io/2012/03/18/sunday-data-statistics-link-roundup-3-18
<ol>
<li>A really interesting <a href="http://www.80grados.net/2012/03/una-upr-de-clase-mundial/" target="_blank">proposal</a> by Rafa (in Spanish - we’ll get on him to write a translation) for the University of Puerto Rico. The post concerns changing the focus from simply teaching to creating knowledge and the potential benefits to both the university and to Puerto Rico. It also has a really nice summary of the benefits that the university system in the United States has produced. Definitely worth a read. The comments are also interesting, it looks like Rafa’s post is pretty controversial…</li>
<li>An interesting <a href="http://motherboard.vice.com/2012/1/27/was-space-shuttle-challenger-a-casualty-of-bad-data-visualization" target="_blank">article</a> suggesting that the Challenger Space Shuttle disaster was at least in part due to bad data visualization. Via <a href="https://twitter.com/#!/DataInColour" target="_blank">@DatainColour</a></li>
<li>The <a href="http://news.sciencemag.org/sciencenow/2012/03/examining-his-own-body-stanford-.html" target="_blank">Snyderome</a> is getting a lot of attention in genomics circles. He used as many new technologies as he could to measure a huge amount of molecular information about his body over time. I am really on board with the excitement about measurement technologies, but this poses a huge challenge for statistics and and statistical literacy. If this kind of thing becomes commonplace, the potential for false positives and ghost diagnoses is huge without a really good framework for uncertainty. Via Peter S. </li>
<li>More <a href="http://www.wired.co.uk/news/archive/2012-03/16/nike-building-app-stunt" target="_blank">news</a> about the Nike API. Now that is how to unveil some data! </li>
<li>Add the Nike API to the <a href="http://simplystatistics.tumblr.com/post/18493330661/statistics-project-ideas-for-students" target="_blank">list of potential statistics projects</a> for students. </li>
</ol>
Peter Norvig on the "Unreasonable Effectiveness of Data"
2012-03-16T13:46:00+00:00
http://simplystats.github.io/2012/03/16/the-unreasonable-effectiveness-of-data-a-talk
<p>“The Unreasonable Effectiveness of Data”, a talk by Peter Norvig of Google. Sometimes, more data is more better. (Thanks to John C. for the link.)</p>
<div class="attribution">
(<span>Source:</span> <a href="http://www.youtube.com/">http://www.youtube.com/</a>)
</div>
A proposal for a really fast statistics journal
2012-03-14T13:53:20+00:00
http://simplystats.github.io/2012/03/14/a-proposal-for-a-really-fast-statistics-journal
<p>I know we need a new journal like we need a good poke in the eye. But I got fired up by the recent discussion of open science (by <a href="http://krugman.blogs.nytimes.com/2012/01/17/open-science-and-the-econoblogosphere/" target="_blank">Paul Krugman</a> and others) and the seriously misguided <a href="http://en.wikipedia.org/wiki/Research_Works_Act" target="_blank">Research Works Act</a>- that aimed to make it illegal to deposit published papers funded by the government in Pubmed central or other open access databases.</p>
<div>
<span>I also realized that I spend a huge amount of time/effort on the following things: (1) waiting for reviews (typically months), (2) addressing reviewer comments that are unrelated to the accuracy of my work - just adding citations to referees papers or doing additional simulations, and (3) resubmitting rejected papers to new journals - this is a huge time suck since I have to reformat, etc. Furthermore, If I want my papers to be published open-access I also realized I have to pay at minimum <a href="http://simplystatistics.tumblr.com/post/12286350206/free-access-publishing-is-awesome-but-expensive-how" target="_blank">$1,000 per paper</a>. </span></p>
<p>
<span>So I thought up my criteria for an ideal statistics journal. It would be accurate, have fast review times, and not discriminate based on how interesting an idea is. I have found that my most interesting ideas are the hardest ones to get published. This journal would:</span>
</p>
<ul>
<li>
Be open-access and free to publish your papers there. You own the copyright on your work.
</li>
<li>
The criteria for publication would be: (1) it has to do with statistics, computation, or data analysis, (2) is the work is technically correct.
</li>
<li>
We would accept manuals, reports of new statistical software, and full length research articles.
</li>
<li>
There would be no page limits/figure limits.
</li>
<li>
The journal would be published exclusively online.
</li>
<li>
We would guarantee reviews within 1 week and publication immediately upon review if criteria (1) and (2) are satisfied
</li>
<li>
Papers would receive a star rating from the editor - 0-5 stars. There would be a place for readers to also review articles
</li>
<li>
All articles would be published with a tweet/like button so they can be easily distributed
</li>
</ul>
<div>
</div>
<div>
To achieve such a fast review time, here is how it would work. We would have a large group of Associate Editors (hopefully 30 or more). When a paper was received, it would be assigned to an AE. The AEs would agree to referee papers within 2 days. They would use a form like this:
</div>
<div>
</div>
<blockquote>
<ul>
<li>
Review of: Jeff’s Paper
</li>
<li>
Technically Correct: Yes
</li>
<li>
About statistics/computation/data analysis: Yes
</li>
<li>
Number of Stars: 3 stars
</li>
</ul>
<p>
<ul>
<li>
3 Strengths of Paper (1 required):
</li>
<li>
This paper revolutionizes statistics
</li>
</ul>
<p>
<ul>
<li>
3 Weakness of Paper (1 required):
</li>
<li>
* The proof that this paper revolutionizes statistics is pretty weak
</li>
<li>
because he only includes one example.
</li>
</ul></blockquote>
<div>
</div>
<div>
That’s it, super quick, super simple, so it wouldn’t be hard to referee. As long as the answers to the first two questions were yes, it would be published.
</div>
<div>
</div>
<div>
So now here’s my questions:
</div>
<div>
</div>
<div>
<ol>
<li>
Would you ever consider submitting a paper to such a journal?
</li>
<li>
Would you be willing to be one of the AEs for such a journal?
</li>
<li>
Is there anything you would change?
</li>
</ol>
</div>
<div>
</div></div>
</p></p></blockquote></div>
Frighteningly Ambitious Startup Ideas
2012-03-13T15:00:05+00:00
http://simplystats.github.io/2012/03/13/frighteningly-ambitious-startup-ideas
<p><a href="http://paulgraham.com/ambitious.html">Frighteningly Ambitious Startup Ideas</a></p>
Sunday Data/Statistics Link Roundup (3/11)
2012-03-12T19:40:00+00:00
http://simplystats.github.io/2012/03/12/sunday-data-statistics-link-roundup-3-11
<ol>
<li>This is the big one. ESPN has opened up access to their <a href="http://developer.espn.com/docs" target="_blank">API</a>! It looks like there may only be access to some of the data for the general public though, does anyone know more? </li>
<li>Looks like ESPN isn’t the only sports-related organization in the API mood, Nike plans to open up an <a href="http://techcrunch.com/2012/03/10/nike-apis-sxsw-backplane/" target="_blank">API too</a>. It would be great if they had better access to individual, downloadable data. </li>
<li>Via Leonid K.: a <a href="http://www.yale.edu/acmelab/articles/bargh_chen_burrows_1996.pdf" target="_blank">highly influential</a> psychology study failed to replicate in a study published in PLoS One. The <a href="http://en.wikipedia.org/wiki/John_Bargh" target="_blank">author</a> of the original study <a href="http://www.psychologytoday.com/blog/the-natural-unconscious/201203/nothing-in-their-heads" target="_blank">went off</a> on the author of the paper, on PLoS One, and on the reporter who broke the story (including personal attacks!). It looks like the authors of the PLoS One paper actually did a more careful study than the original authors to me. The authors of the PLoS One paper, the reporter, and the editor of PLoS One all replied in a much more reasonable way. See this excellent <a href="http://blogs.discovermagazine.com/notrocketscience/2012/03/10/failed-replication-bargh-psychology-study-doyen/" target="_blank">summary</a> for all the details. Here are a few choice quotes from the comments: </li>
</ol>
<blockquote>
<p><span>1. But there’s a long tradition in social psychology of experiments as parables,</span></p>
<p><span>2. I’d love to write a really long response, but let’s just say: priming methods like these fail to replicate all the time (frequently in my own studies), and the news that one of Bargh’s studies failed to replicate is not surprising to me at all.</span></p>
<p><span>3. This distinction between direct and conceptual replication helps to explain why a psychologist isn’t particularly concerned whether Bargh’s finding replicates or not.</span></p>
</blockquote>
<div>
D. Reproducible != Replicable in scientific research. But Roger’s <a href="http://simplystatistics.tumblr.com/post/13633695297/reproducible-research-in-computational-science" target="_blank">perspective on reproducible research</a> still seems appropriate here.
</div>
Answers in Medicine Sometimes Lie in Luck
2012-03-12T15:00:05+00:00
http://simplystats.github.io/2012/03/12/answers-in-medicine-sometimes-lie-in-luck
<p><a href="http://www.nytimes.com/2012/03/06/health/views/for-doctors-luck-can-explain-whatever-they-cant.html">Answers in Medicine Sometimes Lie in Luck</a></p>
Cost of Gene Sequencing Falls, Raising Hopes for Medical Advances
2012-03-11T16:00:05+00:00
http://simplystats.github.io/2012/03/11/cost-of-gene-sequencing-falls-raising-hopes-for
<p><a href="http://www.nytimes.com/2012/03/08/technology/cost-of-gene-sequencing-falls-raising-hopes-for-medical-advances.html">Cost of Gene Sequencing Falls, Raising Hopes for Medical Advances</a></p>
IBM’s Watson Gets Wall Street Job After ‘Jeopardy’ Win
2012-03-10T16:00:05+00:00
http://simplystats.github.io/2012/03/10/ibms-watson-gets-wall-street-job-after-jeopardy-win
<p><a href="http://www.bloomberg.com/news/2012-03-05/ibm-s-watson-computer-gets-wall-street-job-one-year-after-jeopardy-win.html">IBM’s Watson Gets Wall Street Job After ‘Jeopardy’ Win</a></p>
Mission Control, Built for Cities
2012-03-09T16:00:05+00:00
http://simplystats.github.io/2012/03/09/mission-control-built-for-cities
<p><a href="http://www.nytimes.com/2012/03/04/business/ibm-takes-smarter-cities-concept-to-rio-de-janeiro.html">Mission Control, Built for Cities</a></p>
A plot of my citations in Google Scholar vs. Web of Science
2012-03-08T16:00:05+00:00
http://simplystats.github.io/2012/03/08/a-plot-of-my-citations-in-google-scholar-vs-web-of
<p>There has <a href="http://www.functionalneurogenesis.com/blog/2012/02/google-scholar-vs-scopus-web-of-science/" target="_blank">been</a> <a href="http://www.nature.com/nature/journal/v483/n7387/full/483036c.html" target="_blank">some</a> <a href="http://www.nature.com/nature/journal/v483/n7387/full/483036d.html" target="_blank">discussion </a>about whether Google Scholar or one of the proprietary software companies numbers are better for citation counts. I personally think Google Scholar is better for a number of reasons:</p>
<ol>
<li>Higher numbers, but consistently/adjustably higher <img src="http://simplystatistics.org/wp-includes/images/smilies/simple-smile.png" alt=":-)" class="wp-smiley" style="height: 1em; max-height: 1em;" /></li>
<li>It’s free and the data are openly available. </li>
<li>It covers more ground (patents, theses, etc.) to give a better idea of global impact</li>
<li>It’s easier to use</li>
</ol>
<p>I haven’t seen a plot yet relating Web of Science citations to Google Scholar citations, so I made one for my papers.</p>
<p><img height="400" src="http://biostat.jhsph.edu/~jleek/citations.png" width="400" /></p>
<p>GS has about 41% more citations per paper than Web of Science. That is consistent with what other people have found. It also looks reasonably linearish. I wonder what other people’s plots would look like? </p>
<p>Here is the R code I used to generate the plot (the names are Pubmed IDs for the papers):</p>
<blockquote>
<p>library(ggplot2)</p>
<p>names = c(16141318,16357033,16928955,17597765,17907809,19033188,19276151,19924215,20560929,20701754,20838408, 21186247,21685414,21747377,21931541,22031444,22087737,22096506,22257669) </p>
<p>y = c(287,122,84,39,120,53,4,52,6,33,57,0,0,4,1,5,0,2,0)</p>
<p>x = c(200,92,48,31,79,29,4,51,2,18,44,0,0,1,0,2,0,1,0)</p>
<p>Year = c(2005,2006,2007,2007,2007,2008,2009,2009,2011,2010,2010,2011,2012,2011,2011,2011,2011,2011,2012)</p>
<div>
<p>
q <- qplot(x,y,xlim=c(-20,300),ylim=c(-20,300),xlab=”Web of Knowledge”,ylab=”Google Scholar”) + geom_point(aes(colour=Year),size=5) + geom_line(aes(x = y, y = y),size=2)
</p>
</div>
</blockquote>
R.A. Fisher is the most influential scientist ever
2012-03-07T16:00:05+00:00
http://simplystats.github.io/2012/03/07/r-a-fisher-is-the-most-influential-scientist-ever
<p>You can now see profiles of famous scientists on Google Scholar citations. Here are links to a few of them (via Ben L.). <a href="http://scholar.google.com/citations?user=6kEXBa0AAAAJ&hl=en" target="_blank">Von Neumann</a>, <a href="http://scholar.google.com/citations?user=qc6CJjYAAAAJ&hl=en" target="_blank">Einstein</a>, <a href="http://scholar.google.com/citations?user=xJaxiEEAAAAJ&hl=en" target="_blank">Newton</a>, <a href="http://scholar.google.com/citations?user=B7vSqZsAAAAJ&hl=en" target="_blank">Feynman</a></p>
<p>But their impact on science pales in comparison (with the possible exception of Newton) to the impact of one statistician: <a href="http://en.wikipedia.org/wiki/Ronald_Fisher" target="_blank">R.A. Fisher</a>. Many of the concepts he developed are so common and are considered so standard, that he is never cited/credited. Here are some examples of things he invented along with a conservative number of citations they would have received calculated via Google Scholar*. </p>
<ol>
<li>P-values - <strong>3 million citations</strong></li>
<li>Analysis of variance (ANOVA) - <strong>1.57 million citations</strong></li>
<li>Maximum likelihood estimation - <strong>1.54 million citations</strong></li>
<li>Fisher’s linear discriminant <strong>62,400 citations</strong></li>
<li>Randomization/permutation tests <strong>37,940 citations</strong></li>
<li>Genetic linkage analysis <strong>298,000 citations</strong></li>
<li>Fisher information <strong>57,000 citations</strong></li>
<li>Fisher’s exact test <strong>237,000 citations</strong></li>
</ol>
<p>A couple of notes:</p>
<ol>
<li>These are seriously conservative estimates, since I only searched for a few variants on some key words</li>
<li>These numbers are <strong>BIG</strong>, there isn’t another scientist in the ballpark. The guy who wrote the “<a href="http://www.jbc.org/content/280/28/e25.full" target="_blank">most highly cited paper</a>” got 228,441 citations on GS. His next most cited paper? <a href="http://scholar.google.com/citations?hl=en&user=YCS0XAcAAAAJ&oi=sra" target="_blank">3,000 citations</a>. Fisher has at least 5 papers with more citations than his best one. </li>
<li><a href="http://archive.sciencewatch.com/sept-oct2003/sw_sept-oct2003_page2.htm" target="_blank">This page</a> says Bert Vogelstein has the most citations of any person over the last 30 years. If you add up the number of citations to his top 8 papers on GS, you get 57,418. About as many as to the Fisher information matrix. </li>
</ol>
<p>I think this really speaks to a couple of things. One is that Fisher invented some of the most critical concepts in statistics. The other is the breadth of impact of statistical ideas across a range of disciplines. In any case, I would be hard pressed to think of another scientist who has influenced a greater range or depth of scientists with their work. </p>
<ul>
<li>
<p>Calculations of citations #####################</p>
<ol>
<li><a href="http://simplystatistics.tumblr.com/post/15402808730/p-values-and-hypothesis-testing-get-a-bad-rap-but-we" target="_blank">As described</a> in a previous post</li>
<li># of GS results for “Analysis of Variance” + # for “ANOVA” - “Analysis of Variance”</li>
<li># of GS results for “maximum likelihood”</li>
<li># of GS results for “linear discriminant”</li>
<li># of GS results for “permutation test” + # for ”permutation tests” - “permutation test”</li>
<li># of GS results for “linkage analysis”</li>
<li># of GS results for “fisher information” + # for “information matrix” - “fisher information”</li>
<li># of GS results for “fisher’s exact test” + # for “fisher exact test” - “fisher’s exact test”</li>
</ol>
</li>
</ul>
Are banks being sidelined by retailers' data collection?
2012-03-06T16:00:06+00:00
http://simplystats.github.io/2012/03/06/are-banks-being-sidelined-by-retailers-data
<p><a href="http://dealbook.nytimes.com/2012/02/28/live-blog-investor-day-at-jpmorgan-chase/">Are banks being sidelined by retailers’ data collection?</a></p>
Characteristics of my favorite statistics talks
2012-03-05T16:02:05+00:00
http://simplystats.github.io/2012/03/05/characteristics-of-my-favorite-statistics-talks
<p>I’ve been going to/giving statistics talks for a few years now. I think everyone in our field has an opinion on the best structure/content/delivery of a talk. I am one of those people that has a pretty specific idea of what makes an amazing talk. Here are a few of the things I think are key, I try to do them and have learned many of these things from other people who I’ve seen speak. I’d love to hear what other people think. </p>
<p><strong>Structure</strong></p>
<ol>
<li>I don’t like outline slides. I think they take up space but don’t add to most talks. Instead I love it when talks start with a specific, concrete, unsolved problem. In my favorite talks, this problem is usually scientific/applied. Although I have also seen great theoretical talks where a person starts with a key and unsolved theoretical problem. </li>
<li>I like it when the statistical model is defined to solve the problem in the beginning, so it is easy to see the connection between the model and the purpose of the model. </li>
<li>I love it when talks end by showing how they solved the problem they described at the very beginning of the talk. </li>
</ol>
<p><strong>Content</strong></p>
<ol>
<li>I like it when people assume I’m pretty ignorant about their problem (I usually am) and explain everything in very simple language. I think some people worry about their research looking too trivial. I have almost never come away from a talk thinking that, but I frequently leave talks confused because the background material wasn’t clear. </li>
<li>I like it when talks cover enough technical detail so I can follow the basic algorithm, but not so much that I get lost in notation. I also struggle when talks go off on tangents, describing too many subproblems, rather than focusing on the main problem in the talk and just mentioning subproblems succinctly. </li>
<li>I like it when proposed methods are compared to the obvious straw man and one legitimate competitor (if it exists) on a realistic simulation/data set where the answer is known. </li>
<li>I love it when people give talks on work that isn’t totally finished. This type of talk is scary for two reasons: (1) you can be scooped and (2) you might not have all the answers. But I find that unfinished work leads to way more discussion/ideas than a talk about work that has been published and is “complete”. </li>
</ol>
<p><strong>Delivery</strong></p>
<ol>
<li>I like it when a talk runs short. I have never been disappointed when a talk ended 10-15 min early. On the other hand, when a talk is long, I almost always lose focus and don’t follow the last part. I’d love it if we moved to <a href="http://simplystatistics.tumblr.com/post/10686092687/25-minute-seminars" target="_blank">30 minute seminars</a> with more questions. </li>
<li>I like it when speakers have prepared their slides and they have a clear flow and don’t get bogged down in transitions. For this reason, I don’t mind it when people give the same talk a bunch of places. I usually find that the talk is very polished.</li>
</ol>
DealBook: Roche Extends Deadline for Illumina Offer
2012-03-04T16:03:06+00:00
http://simplystats.github.io/2012/03/04/dealbook-roche-extends-deadline-for-illumina-offer
<p><a href="http://dealbook.nytimes.com/2012/02/27/roche-extends-deadline-for-illumina-takeover/">DealBook: Roche Extends Deadline for Illumina Offer</a></p>
Sunday data/statistics link roundup (3/4)
2012-03-04T14:14:02+00:00
http://simplystats.github.io/2012/03/04/sunday-data-statistics-link-roundup-3-4
<ol>
<li>A <a href="http://www.wired.com/wiredenterprise/2012/02/github/all/1" target="_blank">cool article</a> on <a href="https://github.com/" target="_blank">Github </a>by the folks at Wired. I’m starting to think the fact that I’m not on Github is a serious dent in my nerd cred. </li>
<li><a href="http://www.visualisingdata.com/index.php/2012/02/datawrapper-open-source-data-visualisation-creator/" target="_blank">Datawrapper</a> - a less intensive, but less flexible open source data visualization creator. I have seen a few of these types of services starting to pop up. I think that some statistics training should be mandatory before people use them. </li>
<li>An interesting <a href="http://research.iheartanthony.com/2012/02/23/why-bother-publishing-in-a-journal-2/" target="_blank">blog post </a>with the provocative title, “Why bother publishing in a journal” The story he describes works best if you have a lot of people who are interested in reading what you put on the internet. </li>
<li>A <a href="http://stats.stackexchange.com/questions/6/the-two-cultures-statistics-vs-machine-learning" target="_blank">post</a> on stackexchange comparing the machine learning and statistics cultures. </li>
<li><a href="http://stackoverflow.com/questions/tagged/r" target="_blank">Stackoverflow</a> is a great place to look for R answers. It is the R mailing list, minus the flames…</li>
<li>Roger’s <a href="http://simplystatistics.tumblr.com/post/13897994725/plotting-beijingair-data" target="_blank">posts</a> on <a href="http://simplystatistics.tumblr.com/post/13601935082/beijing-air" target="_blank">Beijing</a> air pollution are worth another read if you missed them. Particularly <a href="http://simplystatistics.tumblr.com/post/14214147778/smoking-is-a-choice-breathing-is-not" target="_blank">this one</a>, where he computes the cigarette equivalent of the air pollution levels. </li>
</ol>
True Innovation
2012-03-03T16:05:06+00:00
http://simplystats.github.io/2012/03/03/true-innovation
<p><a href="http://www.nytimes.com/2012/02/26/opinion/sunday/innovation-and-the-bell-labs-miracle.html">True Innovation</a></p>
Confronting a Law Of Limits
2012-03-02T16:02:05+00:00
http://simplystats.github.io/2012/03/02/confronting-a-law-of-limits
<p><a href="http://www.nytimes.com/2012/02/25/business/apple-confronts-the-law-of-large-numbers-common-sense.html">Confronting a Law Of Limits</a></p>
An essay on why programmers need to learn statistics
2012-03-02T13:24:55+00:00
http://simplystats.github.io/2012/03/02/an-essay-on-why-programmers-need-to-learn-statistics
<p>This is <a href="http://zedshaw.com/essays/programmer_stats.html" target="_blank">awesome</a>. There are a few places with some strong language, but overall I think the message is pretty powerful. Via Tariq K. I agree with Tariq, one of the gems is:</p>
<blockquote>
<p><span>If you want to measure something, then don’t measure other sh**. </span></p>
</blockquote>
A cool profile of a human rights statistician
2012-03-01T14:02:05+00:00
http://simplystats.github.io/2012/03/01/a-cool-profile-of-a-human-rights-statistician
<p>Via <a href="http://aldaily.com/" target="_blank">AL Daily</a>, this dude <a href="http://www.foreignpolicy.com/articles/2012/02/27/the_body_counter?page=full" target="_blank">collects data and analyzes it</a> to put war criminals away. The idea of using statistics to quantify mass testimony is interesting. </p>
<blockquote>
<p>With statistical methods and the right kind of data, he can make what we know tell us what we don’t know. He has shown human rights groups, truth commissions, and international courts how to take a collection of thousands of testimonies and extract from them the magnitude and pattern of violence — to lift the fog of war.</p>
</blockquote>
<p>So how does he do it? With an idea from statistical ecology. This is a bit of a long quote but describes the key bit.</p>
<blockquote>
<p><span>Working on the Guatemalan data, Ball found the answer. He called Fritz Scheuren, a statistician with a long history of involvement in human rights projects. Scheuren reminded Ball that a solution to exactly this problem had been invented in the 19th century to count wildlife. “If you want to find out how many fish are in the pond, you can drain the pond and count them,” Scheuren explained, “but they’ll all be dead. Or you can fish, tag the fish you catch, and throw them back. Then you go another day and fish again. You count how many fish you caught the first day, and the second day, and the number of overlaps.”</span></p>
<p>The number of overlaps is key. It tells you how representative a sample is. From the overlap, you can calculate how many fish are in the whole pond. (The actual formula is this: Multiply the number of fish caught the first day by the number caught the second day. Divide the total by the overlap. That’s roughly how many fish are really in the pond.) It gets more accurate if you can fish not just twice, but many more times — then you can measure the overlap between every pair of days.</p>
<p>Guatemala had three different collections of human rights testimonies about what had happened during the country’s long, bloody civil war: from the U.N. truth commission, the Catholic Church’s truth commission, and the International Center for Research on Human Rights, an organization that worked with Guatemala’s human rights groups. Working for the official truth commission, Ball used the count-the-fish method, called <a href="https://www.hrdag.org/resources/mult_systems_est.shtml" target="_blank">multiple systems estimation</a> (MSE), to compare all three databases. He found that over the time covered by the commission’s mandate, from 1978 to 1996, 132,000 people were killed (not counting those disappeared), and that government forces committed 95.4 percent of the killings. He was also able to calculate killings by the ethnicity of the victim. Between 1981 and 1983, 8 percent of the nonindigenous population of the Ixil region was assassinated; in the Rabinal region, the figure was around 2 percent. In both those regions, though, more than 40 percent of the Mayan population was assassinated.</p>
</blockquote>
<p>Cool right? The article is worth a read. If you are inspired, check out <a href="http://datawithoutborders.cc/" target="_blank">Data Without Borders. </a></p>
<div>
</div>
The case for open computer programs
2012-02-29T14:02:06+00:00
http://simplystats.github.io/2012/02/29/the-case-for-open-computer-programs
<p><a href="http://arstechnica.com/science/news/2012/02/science-code-should-be-open-source-according-to-editorial.ars">The case for open computer programs</a></p>
Statistics project ideas for students
2012-02-29T13:50:05+00:00
http://simplystats.github.io/2012/02/29/statistics-project-ideas-for-students
<p>Here are a few ideas that might make for interesting student projects at all levels (from high-school to graduate school). I’d welcome ideas/suggestions/additions to the list as well. All of these ideas depend on free or scraped data, which means that anyone can work on them. I’ve given a ballpark difficulty for each project to give people some idea.</p>
<p>Happy data crunching!</p>
<p><strong>Data Collection/Synthesis</strong></p>
<ol>
<li>Creating a webpage that explains conceptual statistical issues like randomization, margin of error, overfitting, cross-validation, concepts in data visualization, sampling. The webpage should not use any math at all and should explain the concepts so a general audience could understand. Bonus points if you make short 30 second animated youtube clips that explain the concepts. (<em>Difficulty: Lowish; Effort: Highish</em>)</li>
<li>Building an aggregator for statistics papers across disciplines that can be the central resource for statisticians. Journals ranging from <em>PLoS Genetics</em> to <em>Neuroimage</em> now routinely publish statistical papers. But there is no one central resource that aggregates all the statistics papers published across disciplines. Such a resource would be <strong>hugely</strong> useful to statisticians. You could build it using blogging software like WordPress so articles could be tagged/you could put the resource in your RSS feeder. (<em>Difficulty: Lowish; Effort: Mediumish)</em></li>
</ol>
<p><strong>Data Analyses</strong></p>
<ol>
<li>Scrape the LivingSocial/Groupon sites for the daily deals and develop a prediction of how successful the deal will be based on location/price/type of deal. You could use either the RCurl R package or the XML R package to scrape the data. (<em>Difficulty: Mediumish; Effort: Mediumish</em>)</li>
<li>You could use the data from your city (<a href="http://simplystatistics.tumblr.com/post/15182715327/list-of-cities-states-with-open-data-help-me-find" target="_blank">here</a> are a few cities with open data) to: (a) identify the best and worst neighborhoods to live in based on different metrics like how many parks are within walking distance, crime statistics, etc. (b) identify concrete measures your city could take to improve different quality of life metrics like those described above - say where should the city put a park, or (c) see if you can predict when/where crimes will occur (like <a href="http://simplystatistics.tumblr.com/post/15628138349/statistical-crime-fighter" target="_blank">these guys did</a>). (<em>Difficulty: Mediumish; Effort: Highish</em>)</li>
<li>Download data on state of the union speeches from <a href="http://stateoftheunion.onetwothree.net/texts/index.html" target="_blank">here</a> and use the <a href="http://cran.r-project.org/web/packages/tm/index.html" target="_blank">tm package</a> in R to analyze the patterns of word use over time (<em>Difficulty: Lowish; Effort: Lowish</em>)</li>
<li>Use this <a href="http://www.factual.com/t/1fKxck/DonorsChooseorg_Projects" target="_blank">data set</a> from <a href="http://www.donorschoose.org/" target="_blank">Donors Choose</a> to determine the characteristics that make the funding of projects more likely. You could send your results to the Donors Choose folks to help them improve the funding rate for their projects. (<em>Difficulty: Mediumish; Effort: Mediumish</em>) </li>
<li>Which basketball player would you want on your team? <a href="http://simplystatistics.tumblr.com/post/16974142373/why-dont-we-hear-more-about-adrian-dantley-on-espn" target="_blank">Here</a> is a really simple analysis done by Rafa. But it doesn’t take into account things like defense. If you want to take on this project, you should take a look at this <a href="http://skepticalsports.com/?page_id=1222" target="_blank">Denis Rodman analysis</a> which is the gold standard. (<em>Difficulty: Mediumish; Effort: Highish</em>).</li>
</ol>
<p><strong>Data visualization</strong></p>
<ol>
<li>Creating an R package that wraps the <a href="http://www.omegahat.org/SVGAnnotation/" target="_blank">svgAnnotation</a> package. This package can be used to create dynamic graphics in R, but is still a bit too flexible for most people to use. Writing some wrapper functions that simplify the interface would be potentially high impact. Maybe something like svgPlot() to create simple, dynamic graphics with only a few options (<em>Difficulty: Mediumish; Effort: Mediumish</em>). </li>
<li>The same as project 1 but for <a href="http://mbostock.github.com/d3/" target="_blank">D3.js</a>. The impact could potentially be a bit higher, since the graphics are a bit more professional, but the level of difficulty and effort would also both be higher. (<em>Difficulty: Highish; Effort: Highish</em>)</li>
</ol>
Gulf on Open Access to Federally Financed Research
2012-02-29T01:19:49+00:00
http://simplystats.github.io/2012/02/29/gulf-on-open-access-to-federally-financed-research
<p><a href="http://www.nytimes.com/2012/02/28/science/a-wide-gulf-on-open-access-to-federally-financed-research.html">Gulf on Open Access to Federally Financed Research</a></p>
Duke Taking New Steps to Safeguard Research Integrity
2012-02-28T14:02:05+00:00
http://simplystats.github.io/2012/02/28/duke-taking-new-steps-to-safeguard-research-integrity
<p><a href="http://today.duke.edu/2012/02/acpotti">Duke Taking New Steps to Safeguard Research Integrity</a></p>
The Duke Saga Starter Set
2012-02-27T14:02:06+00:00
http://simplystats.github.io/2012/02/27/the-duke-saga-starter-set
<p>A <a href="http://simplystatistics.tumblr.com/post/10068195751/the-duke-saga" target="_blank">few</a> <a href="http://simplystatistics.tumblr.com/post/17563119490/the-duke-clinical-trials-saga-what-really-happened" target="_blank">of our</a> <a href="http://simplystatistics.tumblr.com/post/17370909057/duke-saga-on-60-minutes-this-sunday" target="_blank">recent</a> <a href="http://simplystatistics.tumblr.com/post/17550711561/duke-clinical-trials-saga-on-60-minutes-first" target="_blank">posts</a> relate to the Duke trial saga. For those that want to learn more, Baggerly and Coombes have put together a “<a href="http://bioinformatics.mdanderson.org/Supplements/ReproRsch-All/Modified/StarterSet/index.html" target="_blank">starter set</a>”. It includes</p>
<ol>
<li>
<p><a href="http://videolectures.net/cancerbioinformatics2010_baggerly_irrh/" target="_blank">a video of one of their talks</a></p>
</li>
<li>
<p><a href="http://www.cbsnews.com/8301-18560_162-57376073/deception-at-duke/" target="_blank">the 60 Minutes episode and clip</a></p>
</li>
<li>
<p><a href="http://bioinformatics.mdanderson.org/Supplements/ReproRsch-All/Modified/StarterSet/baggerly_nebraska12.pdf" target="_blank">slides from a recent presentation with some new details</a></p>
</li>
<li>
<p><a href="http://arxiv.org/pdf/1010.1092.pdf" target="_blank">their Annals of Applied Statistics paper</a></p>
</li>
</ol>
<p>5. <a href="http://www.clinchem.org/content/57/5/688.long" target="_blank">a editorial they wrote for Clinical Chemistry about what information should be required to support clinical “omics” publications</a> (gated)</p>
<p>6. <a href="http://bioinformatics.mdanderson.org/Supplements/ReproRsch-All/Modified/index.html" target="_blank">links to the IOM session recordings and slides</a></p>
<p>Enjoy!</p>
Graham & Dodd's Security Analysis: Moneyball for...Money
2012-02-27T01:51:40+00:00
http://simplystats.github.io/2012/02/27/graham-dodds-security-analysis-moneyball
<p>The last time I <a href="http://simplystatistics.tumblr.com/post/17152281502/an-r-script-for-estimating-future-inflation-via-the" target="_blank">posted something about finance</a> I got schooled by people who actually know stuff. So let me just say that I don’t claim to be an expert in this area, but I do have an interest in it and try to keep up the best I can.</p>
<p>One book I picked up a little while ago was <em>Security Analysis</em> by Benjamin Graham and David Dodd. This is the “bible of value investing” and so I mostly wanted to see what all the hubbub was about. In my mind, the hubbub is well-deserved. Given that it was originally written in 1934, the book has stood the test of time (the book has been updated a number of times since then). It’s quite readable and, I guess, still relevant to modern-day investing. In the 6th edition the out-of-date stuff has been relegated to an appendix. It also contains little essays (of varying quality) by modern-day value investing heros like Seth Klarman and Glenn Greenberg. It’s a heavy book though and I’m wishing I’d got it on the Kindle.</p>
<p>It occurred to me that with all the interest in data and analytics today, <em>Security Analysis</em> reads a lot like the <em>Moneyball</em> of investing. The two books make the same general point: find things that are underpriced/underappreciated and buy them when no one’s looking. Then profit!</p>
<p>One of the basic points made early on is that roughly speaking, you can’t judge a security by its cover. You need to look at the data. How novel! For example, at the time bonds were considered safe because they were bonds, while stocks (equity) were considered risky because they were stocks. There are technical reasons why this is true, but a careful look at the data might reveal that the bonds of one company are risky while the stock is safe, depending on the price at which they are trading. The question to ask for either type of security is what’s the chance of losing money? In order to answer that question you need to estimate the intrinsic value of the company. For that, you need data.</p>
<blockquote>
<p>The functions of security analysis may be described under three headings: descriptive, selective, and critical. In its more obvious form, descriptive analysis consists of marshalling the important facts relating to the issue [security] and presenting them in a coherent, readily intelligible manner…. A more penetrating type of description seeks to reveal the strong and weak points in the position of an issue, compare its exhibit with that of others of similar character, and appraise the factors which are likely to influence its future performance. Analysis of this kind is applicable to almost every corporate issue, and it may be regarded as an adjunct not only to investment but also to intelligent speculation in that it provides an organized factual basis for the application of judgment.</p>
</blockquote>
<p>Back in Graham & Dodd’s day it must have been quite a bit harder to get the data. Many financial reports that are routinely published today by public companies were not available back then. Today, we are awash in easily accessible financial data and, one might argue as a result of that, there are fewer opportunities to make money. </p>
'WaterBillWoman' pestered city for years over faulty bills
2012-02-25T15:23:00+00:00
http://simplystats.github.io/2012/02/25/waterbillwoman-pestered-city-for-years-over-faulty
<p><a href="http://www.baltimoresun.com/news/maryland/bs-md-ci-water-bills-20120224,0,7923036.story">‘WaterBillWoman’ pestered city for years over faulty bills</a></p>
Prediction: the Lasso vs. just using the top 10 predictors
2012-02-23T16:07:00+00:00
http://simplystats.github.io/2012/02/23/prediction-the-lasso-vs-just-using-the-top-10
<p>One incredibly popular tool for the analysis of high-dimensional data is the <a href="http://www-stat.stanford.edu/~tibs/lasso.html" target="_blank">lasso</a>. The lasso is commonly used in cases when you have many more predictors than independent samples (the n « p) problem. It is also often used in the context of prediction.</p>
<p>Suppose you have an outcome <strong>Y</strong> and several predictors <strong>X<sub>1</sub></strong>,…,<strong>X<sub>M</sub></strong>, the lasso fits a model:</p>
<p><strong>Y = B<sub></sub> + B<sub>1</sub> X<sub>1</sub> + B<sub>2</sub> X<sub>2</sub> + … + B<sub>M</sub> X<sub>M</sub> + E</strong></p>
<p>subject to a constraint on the sum of the absolute value of the <strong>B</strong> coefficients. The result is that: (1) some of the coefficients get set to zero, and those variables drop out of the model, (2) other coefficients are “shrunk” toward zero. Dropping some variables is good because there are a lot of potentially unimportant variables. Shrinking coefficients may be good, since the big coefficients might be just the ones that were really big by random chance (this is related to Andrew Gelman’s <a href="http://andrewgelman.com/2011/09/the-statistical-significance-filter/" target="_blank">type M errors</a>).</p>
<p>I work in genomics, where n«p problems come up all the time. Whenever I use the lasso or when I read papers where the lasso is used for prediction, I always think: “How does this compare to just using the top 10 most significant predictors?” I have asked this out loud enough that <a href="http://www.biostat.jhsph.edu/~rpeng/" target="_blank">some</a> <a href="http://www.biostat.jhsph.edu/~iruczins/" target="_blank">people</a> <a href="http://www.bcaffo.com/" target="_blank">around</a> <a href="http://rafalab.jhsph.edu/" target="_blank">here</a> <a href="http://people.csail.mit.edu/mrosenblum/" target="_blank">started</a> calling it the “Leekasso” to poke fun at me. So I’m going to call it that in a thinly veiled attempt to avoid <a href="http://en.wikipedia.org/wiki/Stigler's_law_of_eponymy" target="_blank">Stigler’s law of eponymy</a> (actually Rafa points out that using this name is a perfect example of this law, since this feature selection approach has been proposed before <a href="http://www.stat.berkeley.edu/tech-reports/576.pdf" target="_blank">at least once</a>).</p>
<p>Here is how the Leekasso works. You fit each of the models:</p>
<p><strong>Y = B<sub></sub> + B<sub>k</sub>X<sub>k</sub> + E</strong></p>
<p>take the 10 variables with the smallest p-values from testing the <strong><sub>k</sub></strong>coefficients, then fit a linear model with just those 10 coefficients. You never use 9 or 11, the Leekasso is always 10.</p>
<p>For fun I did an experiment to compare the accuracy of the Leekasso and the Lasso.</p>
<p>Here is the setup:</p>
<ol>
<li>I simulated 500 variables and 100 samples for each study, each N(0,1)</li>
<li>I created an outcome that was 0 for the first 50 samples, 1 for the last 50</li>
<li>I set a certain number of variables (between 5 and 50) to be associated with the outcome using the model with independent effects (this is an important choice, more later in the post)</li>
<li>I tried different levels of signal to the truly predictive features</li>
<li>I generated two data sets (training and test) from the exact same model for each scenario</li>
<li>I fit the Lasso using the <a href="http://cran.r-project.org/web/packages/lars/index.html" target="_blank">lars </a>package, choosing the shrinkage parameter as the value that minimized the cross-validation MSE in the training set</li>
<li>I fit the Leekasso and the Lasso on the training sets and evaluated accuracy on the test sets.</li>
</ol>
<p>The R code for this analysis is available <a href="http://biostat.jhsph.edu/~jleek/code/leekasso.R" target="_blank">here</a> and the resulting data is <a href="http://biostat.jhsph.edu/~jleek/code/lassodata.rda" target="_blank">here</a>.</p>
<p>The results show that for all configurations, using the top 10 has a higher out of sample prediction accuracy than the lasso. A larger version of the plot is <a href="http://biostat.jhsph.edu/~jleek/code/accuracy-plot.png" target="_blank">here</a>.</p>
<p><img height="240" src="http://biostat.jhsph.edu/~jleek/code/accuracy-plot.png" width="480" /></p>
<p>Interestingly, this is true even when there are fewer than 10 real features in the data or when there are many more than 10 real features ((remember the Leekasso always picks 10).</p>
<p>Some thoughts on this analysis:</p>
<ol>
<li>This is only test-set prediction accuracy, it says nothing about selecting the “right” features for prediction.</li>
<li>The Leekasso took about 0.03 seconds to fit and test per data set compared to about 5.61 seconds for the Lasso.</li>
<li>The data generating model is the model underlying the top 10, so it isn’t surprising it has higher performance. Note that I simulated from the model: <strong>X<sub>i</sub> = b<sub>0i</sub> + b<sub>1i</sub>Y + e</strong>, this is the model commonly assumed in differential expression analysis (genomics) or voxel-wise analysis (fMRI). Alternatively I could have simulated from the model: <strong>Y = B<sub></sub> + B<sub>1</sub> X<sub>1</sub> + B<sub>2</sub> X<sub>2</sub> + … + B<sub>M</sub> X<sub>M</sub> + E</strong>, where most of the coefficients are zero. In this case, the Lasso would outperform the top 10 (data not shown). This is a key, and possibly obvious, issue raised by this simulation. When doing prediction differences in the true “causal” model matter a lot. So if we believe the “top 10 model” holds in many high-dimensional settings, then it may be the case that regularization approaches don’t work well for prediction and vice versa.</li>
<li>I think what may be happening is that the Lasso is overshrinking the parameter estimates, in other words, you give up too much bias for a gain in variance. Alan Dabney and John Storey have a really nice <a href="http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0001002" target="_blank">paper</a> discussing shrinkage in the context of genomic prediction that I think is related.</li>
</ol>
<p>**<br />
**</p>
Monitoring Your Health With Mobile Devices
2012-02-23T13:19:05+00:00
http://simplystats.github.io/2012/02/23/monitoring-your-health-with-mobile-devices
<p><a href="http://www.nytimes.com/2012/02/23/technology/personaltech/monitoring-your-health-with-mobile-devices.html">Monitoring Your Health With Mobile Devices</a></p>
Professional statisticians agree: the knicks should start Steve Novak over Carmelo Anthony
2012-02-22T16:07:00+00:00
http://simplystats.github.io/2012/02/22/professional-statisticians-agree-the-knicks-should
<p><span>A week ago, Nate Silver tweeted this:</span></p>
<blockquote>
<p><span>Since Lin became starting PG, Knicks have outscored opponents by 63 with Novak on the floor. Been outscored by 8 when he isn’t.</span></p>
</blockquote>
<p>In a <a href="http://simplystatistics.tumblr.com/post/16974142373/why-dont-we-hear-more-about-adrian-dantley-on-espn" target="_blank">previous post</a> we showed the plot below. Note that Carmelo Anthony is in ball hog territory. Novak plays the same position as Anthony but is a three point specialist. His career three point FG% of 42% (253-603) puts him <strong>10th all time!</strong> It seems that with Lin in the lineup he is getting more open shots and helping his team. Should the Knicks start Novak? </p>
<p>Hat tip to David Santiago.</p>
<p><img height="300" src="http://rafalab.jhsph.edu/simplystats/melo.png" width="420" /></p>
Air Pollution Linked to Heart and Brain Risks
2012-02-22T14:02:06+00:00
http://simplystats.github.io/2012/02/22/air-pollution-linked-to-heart-and-brain-risks
<p><a href="http://well.blogs.nytimes.com/2012/02/15/air-pollution-tied-to-heart-and-brain-risks/">Air Pollution Linked to Heart and Brain Risks</a></p>
Interracial Couples Who Make the Most Money
2012-02-21T14:02:05+00:00
http://simplystats.github.io/2012/02/21/interracial-couples-who-make-the-most-money
<p><a href="http://economix.blogs.nytimes.com/2012/02/17/interracial-couples-who-make-the-most-money/">Interracial Couples Who Make the Most Money</a></p>
Scientists Find New Dangers in Tiny but Pervasive Particles in Air Pollution
2012-02-21T02:41:33+00:00
http://simplystats.github.io/2012/02/21/scientists-find-new-dangers-in-tiny-but-pervasive
<p><a href="http://www.nytimes.com/2012/02/19/science/earth/scientists-find-new-dangers-in-tiny-but-pervasive-particles-in-air-pollution.html">Scientists Find New Dangers in Tiny but Pervasive Particles in Air Pollution</a></p>
I don't think it means what ESPN thinks it means
2012-02-20T15:30:17+00:00
http://simplystats.github.io/2012/02/20/i-dont-think-it-means-what-espn-thinks-it-means
<p><img height="375" src="http://biostat.jhsph.edu/~jleek/espn.png" width="500" /></p>
<p>Given ESPN’s recent headline difficulties it seems like they might want a headline editor or something…</p>
60 Lives, 30 Kidneys, All Linked
2012-02-20T13:30:42+00:00
http://simplystats.github.io/2012/02/20/60-lives-30-kidneys-all-linked
<p><a href="http://www.nytimes.com/2012/02/19/health/lives-forever-linked-through-kidney-transplant-chain-124.html">60 Lives, 30 Kidneys, All Linked</a></p>
Company Unveils DNA Sequencing Device Meant to Be Portable, Disposable and Cheap
2012-02-20T02:44:11+00:00
http://simplystats.github.io/2012/02/20/company-unveils-dna-sequencing-device-meant-to-be
<p><a href="http://www.nytimes.com/2012/02/18/health/oxford-nanopore-unveils-tiny-dna-sequencing-device.html">Company Unveils DNA Sequencing Device Meant to Be Portable, Disposable and Cheap</a></p>
How Companies Learn Your Secrets
2012-02-16T19:30:00+00:00
http://simplystats.github.io/2012/02/16/how-companies-learn-your-secrets
<p><a href="http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html">How Companies Learn Your Secrets</a></p>
I.B.M.: Big Data, Bigger Patterns
2012-02-16T12:40:00+00:00
http://simplystats.github.io/2012/02/16/i-b-m-big-data-bigger-patterns
<p><a href="http://bits.blogs.nytimes.com/2012/02/15/i-b-m-big-data-bigger-patterns/">I.B.M.: Big Data, Bigger Patterns</a></p>
A Flat Budget for NIH in 2013 - ScienceInsider
2012-02-15T19:44:33+00:00
http://simplystats.github.io/2012/02/15/a-flat-budget-for-nih-in-2013-scienceinsider
<p><a href="http://news.sciencemag.org/scienceinsider/2012/02/a-flat-budget-for-nih-in-2013.html#.TzwKpXDLiUA.tumblr">A Flat Budget for NIH in 2013 - ScienceInsider</a></p>
Harvard's Stat 110 is now a course on iTunes
2012-02-15T14:02:06+00:00
http://simplystats.github.io/2012/02/15/harvards-stat-110-is-now-a-course-on-itunes
<p>Back in January we interviewed <a href="http://simplystatistics.tumblr.com/post/16170052064/interview-with-joe-blitzstein" target="_blank">Joe Blitzstein</a> and pointed out that he made his lectures freely available on iTunes. Now it is a course on <a href="http://itunes.apple.com/us/course/statistics-110-probability/id502492375" target="_blank">iTunes</a> and the format has been upgraded to work better with iPhones and iPads. Enjoy! </p>
Mathematicians Organize Boycott of a Publisher
2012-02-14T20:21:03+00:00
http://simplystats.github.io/2012/02/14/mathematicians-organize-boycott-of-a-publisher
<p><a href="http://www.nytimes.com/2012/02/14/science/researchers-boycott-elsevier-journal-publisher.html">Mathematicians Organize Boycott of a Publisher</a></p>
Mortimer Spiegelman Award: Call for Nominations. Deadline is April 1, 2012
2012-02-14T14:02:05+00:00
http://simplystats.github.io/2012/02/14/mortimer-spiegelman-award-call-for-nominations
<p><span>The Statistics Section of the American Public Health Association</span><br />
<span>invites nominations for the 2012 Mortimer </span><span class="il">Spiegelman</span><span> Award honoring a</span><br />
<span>statistician aged 40 or younger who has made outstanding contributions</span><br />
<span>to health statistics, especially public health statistics.</span></p>
<p><span>The award was established in 1970 and is presented annually at the</span><br />
<span>APHA meeting. The award serves three purposes: to honor the</span><br />
<span>outstanding achievements of both the recipient and </span><span class="il">Spiegelman</span><span>, to</span><br />
<span>encourage further involvement in public health of the finest young</span><br />
<span>statisticians, and to increase awareness of APHA and the Statistics</span><br />
<span>Section in the academic statistical community. More details about the</span><br />
<span>award including the list of the past recipients and more information</span><br />
<span>about the Statistics Section of APHA may be found <a href="http://www.apha.org/membergroups/sections/aphasections/stats/about/spiegelman.htm" target="_blank">here</a>.</span></p>
<p><span>To be eligible for the 2012 </span><span class="il">Spiegelman</span><span> Award, a candidate must have</span><br />
<span>been born in 1972 or later. Please send electronic versions of the</span><br />
<span>nominating letter and the candidate’s CV to the 2012 </span><span class="il">Spiegelman</span><span> Award</span><br />
<span>Committee Chair, Rafael A. Irizarry </span><a href="mailto:rafa@jhu.edu" target="_blank">rafa@jhu.edu</a><span>.</span></p>
<p><span>Please state in the nominating letter the candidate’s birthday. The</span><br />
<span>nominator should include one or two paragraphs in the nominating</span><br />
<span>letter that describe how the nominee’s contributions relate to public</span><br />
<span>health concerns. A maximum of three supporting letters per nomination</span><br />
<span>can be provided. Nominations for the 2012 Award must be submitted by</span><br />
<span>April 1, 2012.</span></p>
The Duke Clinical Trials Saga: What Really Happened
2012-02-13T20:07:38+00:00
http://simplystats.github.io/2012/02/13/the-duke-clinical-trials-saga-what-really-happened
<p><a href="http://videolectures.net/cancerbioinformatics2010_baggerly_irrh/">The Duke Clinical Trials Saga: What Really Happened</a></p>
Duke Clinical Trials Saga On 60 Minutes First
2012-02-13T14:02:06+00:00
http://simplystats.github.io/2012/02/13/duke-clinical-trials-saga-on-60-minutes-first
<p>Duke clinical trials saga on 60 Minutes. First, the back-to-back shot of Keith and Kevin is priceless. Second, I’ve never seen a cleaner desk in my life.</p>
<div class="attribution">
(<span>Source:</span> <a href="http://cnettv.cnet.com/">http://cnettv.cnet.com/</a>)
</div>
At MSNBC, a Professor as TV Host
2012-02-13T12:22:00+00:00
http://simplystats.github.io/2012/02/13/at-msnbc-a-professor-as-tv-host
<p><a href="http://www.nytimes.com/2012/02/13/business/media/host-of-msnbcs-melissa-harris-perry-is-a-professor.html">At MSNBC, a Professor as TV Host</a></p>
Sunday Data/Statistics Link Roundup (2/12)
2012-02-12T14:28:21+00:00
http://simplystats.github.io/2012/02/12/sunday-data-statistics-link-roundup-2-12
<ol>
<li>An awesome alternative to <a href="http://mbostock.github.com/d3/" target="_blank">D3.js</a> - R’s <a href="http://www.omegahat.org/SVGAnnotation/" target="_blank">svgAnnotation package</a>. Here’s the paper in <a href="http://www.jstatsoft.org/v46/i01" target="_blank">JSS</a>. I feel like this is one step away from gaining broad use in the statistics community - it still feels a little complicated building the graphics, but there is plenty of flexibility there. I feel like a great project for a student at any level would be writing some easy wrapper functions for these functions. </li>
<li>How to <a href="http://rwiki.sciviews.org/doku.php?id=getting-started:installation:android" target="_blank">run R</a> on your Android device. This is very cool - can’t wait to start running simulations on my Nexus S.</li>
<li>Interactive <a href="http://www.jasondavies.com/wordcloud/" target="_blank">word clouds</a> via John C. and why word clouds <a href="http://www.niemanlab.org/2011/10/word-clouds-considered-harmful/" target="_blank">may be dangerous</a> via Jason D. </li>
<li>Trends in <a href="http://simplystatistics.tumblr.com/post/11237403492/apis" target="_blank">APIs</a> - there are <a href="http://techcrunch.com/2012/02/10/2011-api-trends-government-apis-quintuple-facebook-google-twitter-most-popular/?icid=tc_home_art" target="_blank">more of them</a>! Go get your free data. </li>
<li>A <a href="http://gking.harvard.edu/files/paperspub.pdf" target="_blank">really interesting paper </a>by Gary King on how to get a paper by exactly replicating, then building on or discussing, the results of a previous publication. </li>
<li><a href="http://simplystatistics.tumblr.com/post/10686092687/25-minute-seminars" target="_blank">25 minute seminars</a> - I love this post by Rafa, probably because my attention span is so short. But I think 25-30 minute talks are optimal for me to learn something, but not start to zone out…</li>
</ol>
The Age of Big Data
2012-02-12T04:17:22+00:00
http://simplystats.github.io/2012/02/12/the-age-of-big-data
<p><a href="http://www.nytimes.com/2012/02/12/sunday-review/big-datas-impact-in-the-world.html">The Age of Big Data</a></p>
Peter Thiel on Peer Review/Science
2012-02-11T23:07:07+00:00
http://simplystats.github.io/2012/02/11/peter-thiel-on-peer-review-science
<p>Peter Theil gives his take on science funding/peer review:</p>
<blockquote>
<p><span>My libertarian views are qualified because I do think things worked better in the 1950s and 60s, but it’s an interesting question as to what went wrong with DARPA. It’s not like it has been defunded, so why has DARPA been doing so much less for the economy than it did forty or fifty years ago? Parts of it have become politicized. You can’t just write checks to the thirty smartest scientists in the United States. Instead there are bureaucratic processes, and I think the politicization of science—where a lot of scientists have to write grant applications, be subject to peer review, and have to get all these people to buy in—all this has been toxic, because the skills that make a great scientist and the skills that make a great politician are radically different. There are very few people who are both great scientists and great politicians. So a conservative account of what happened with science in the 20</span><sup>th</sup><span>century is that we had a decentralized, non-governmental approach all the way through the 1930s and early 1940s. At that point, the government could accelerate and push things tremendously, but only at the price of politicizing it over a series of decades. Today we have a hundred times more scientists than we did in 1920, but their productivity per capita is less that it used to be.</span></p>
</blockquote>
<p>Thiel has a history of making <a href="http://techcrunch.com/2011/04/10/peter-thiel-were-in-a-bubble-and-its-not-the-internet-its-higher-education/" target="_blank">controversial comments</a>, and I don’t always agree with him, but I think that his point about the politicization of the grant process is interesting. </p>
Data says Jeremy Lin is for real
2012-02-11T21:53:59+00:00
http://simplystats.github.io/2012/02/11/data-says-jeremy-lin-is-for-real
<p>Nate Silver <a href="http://fivethirtyeight.blogs.nytimes.com/2012/02/11/jeremy-lin-is-no-fluke/" target="_blank">makes a table</a> of all NBA players that have had four games in a row with 20+ points, 6+ assists, 50%+ shooting. The list is short (and it doesn’t include <a href="http://simplystatistics.tumblr.com/post/16817771482/this-graph-makes-me-think-kobe-is-not-that-good-he" target="_blank">Kobe</a>). </p>
Duke Saga on 60 Minutes this Sunday
2012-02-10T14:00:05+00:00
http://simplystats.github.io/2012/02/10/duke-saga-on-60-minutes-this-sunday
<p>This Sunday February 12, the news magazine <a href="http://www.cbsnews.com/sections/60minutes/main3415.shtml" target="_blank">60 Minutes</a> will have a feature on the <a href="http://simplystatistics.tumblr.com/post/10068195751/the-duke-saga" target="_blank">Duke Clinical Trials saga</a>. Will Dr. Potti himself make an appearance? This is from the 60 Minutes web site:</p>
<blockquote>
<p><strong>Deception at Duke -</strong><span> Scott Pelley reports on a Duke University oncologist whose supervisor says he manipulated the data in his study of a breakthrough cancer therapy. Kyra Darnton is the producer.</span></p>
</blockquote>
<p><span>The word on the street is that the segment will also feature statisticians Keith Baggerly and Kevin Coombes of the M.D. Anderson Cancer Center.</span></p>
<p><span>And that makes two posts this week about people at M.D. Anderson. What’s going on here?</span></p>
An example of how sending a paper to a statistics journal can get you scooped
2012-02-09T14:02:00+00:00
http://simplystats.github.io/2012/02/09/an-example-of-how-sending-a-paper-to-a-statistics
<p>In a <a href="http://simplystatistics.tumblr.com/post/14218411483/dear-editors-associate-editors-referees-please-reject" target="_blank">previous post</a> I complained about statistics journals taking way too long rejecting papers. Today I am complaining because even when everything goes right —better than <strike>above</strike> average review time (for statistics), useful and insightful comments from reviewers— we can come out losing.</p>
<p>In May 2011 we submitted a paper on <a href="http://biostatistics.oxfordjournals.org/content/early/2012/01/24/biostatistics.kxr054.long" target="_blank">removing GC bias from RNAseq</a> data to Biostatistics. It was published on December 27. However, we were scooped by <a href="http://www.biomedcentral.com/1471-2105/12/480/abstract" target="_blank">this BMC Bioinformatics paper</a> published ten days earlier despite being submitted three months later and accepted 11 days after ours. The competing paper has already earned the “highly accessed” distinction. The two papers, both statistics papers, are very similar, yet I am afraid more people will read the one that was finished second but published first.</p>
<p>Note that B<span>iostatistics is one of the fastest stat journals out there. I don’t blame the journal at all here. We statisticians have to change our culture when it comes to reviews.</span></p>
<p><img height="375" src="http://rafalab.jhsph.edu/simplystats/scoop.png" width="500" /></p>
Statisticians and Clinicians: Collaborations Based on Mutual Respect
2012-02-08T17:09:34+00:00
http://simplystats.github.io/2012/02/08/statisticians-and-clinicians-collaborations-based-on
<p><a href="http://magazine.amstat.org/blog/2012/02/01/collaborationpolic/">Statisticians and Clinicians: Collaborations Based on Mutual Respect</a></p>
DealBook: Illumina Formally Rejects Roche's Takeover Bid
2012-02-08T02:11:22+00:00
http://simplystats.github.io/2012/02/08/dealbook-illumina-formally-rejects-roches-takeover
<p><a href="http://dealbook.nytimes.com/2012/02/07/illumina-formally-rejects-roches-takeover-bid/">DealBook: Illumina Formally Rejects Roche’s Takeover Bid</a></p>
Wolfram, a Search Engine, Finds Answers Within Itself
2012-02-07T02:12:56+00:00
http://simplystats.github.io/2012/02/07/wolfram-a-search-engine-finds-answers-within-itself
<p><a href="http://www.nytimes.com/2012/02/07/technology/wolfram-a-search-engine-finds-answers-within-itself.html">Wolfram, a Search Engine, Finds Answers Within Itself</a></p>
An R script for estimating future inflation via the Treasury market
2012-02-06T13:34:04+00:00
http://simplystats.github.io/2012/02/06/an-r-script-for-estimating-future-inflation-via-the
<p>One factor that is critical for any financial planning is estimating what future inflation will be. For example, if you’re saving money in an instrument that gains 3% per year, and inflation is estimated to be 4% per year, well then you’re losing money in real terms.</p>
<p>There are a variety of ways to estimate the rate of future inflation. You could, for example, use past rates as an estimate of future rates. However, the Treasury market provides an estimate of what the market thinks annual inflation will be over the next 5, 10, 20, and 30 years.</p>
<p>Basically, the Treasury issue two types of securities: nominal securities that pay a nominal interest rate (fixed percentage of your principal), and inflation-indexed securities (TIPS) that pay an interest rate that is applied to your principal adjusted by the consumer price index (CPI). As the CPI goes up and down, the payments for inflation-indexed securities go up and down (although they can’t go negative so you always get your principal back). As these securities trade throughout the day, their respective market-based interest rates go up and down continuously. The difference between the nominal interest rate and the real interest rate for a fixed period of time (5, 10, 20, years) can be used as a rough estimate of annual inflation over that time period.</p>
<p>The Treasury publishes data for its auctions everyday on the yield for both nominal and inflation-indexed securities. There is an XML feed for <a href="http://www.treasury.gov/resource-center/data-chart-center/interest-rates/Datasets/yield.xml" target="_blank">nominal yields</a> and for <a href="http://www.treasury.gov/resource-center/data-chart-center/interest-rates/Datasets/real_yield.xml" target="_blank">real yields</a>. Using these, I used the XML R package and wrote an <a href="http://www.biostat.jhsph.edu/~rpeng/inflation.R" target="_blank">R script to scrape the data and calculate the inflation estimate</a>. </p>
<p>As of today, the market’s estimate of annual inflation is:</p>
<pre>5-year Inflation: 1.88%
10-year Inflation: 2.18%
30-year Inflation: 2.38%
</pre>
<p>Basically, you just call the ‘inflation()’ function with no arguments and it produces the above print out.</p>
Sunday Data/Statistics Link Roundup (2/5)
2012-02-06T00:06:54+00:00
http://simplystats.github.io/2012/02/06/sunday-data-statistics-link-roundup-2-5
<ol>
<li><a href="http://webdemo.visionobjects.com/equation.html?locale=default" target="_blank">Cool app</a>, you can write out an equation on the screen and it translates the equation to latex. Via Andrew G.</li>
<li>Yet another <a href="http://www.12devsofxmas.co.uk/2012/01/data-visualisation/" target="_blank">D3 tutorial</a>. Stay tuned for some cool stuff on this front here at Simply Stats in the near future. Via Vishal.</li>
<li><a href="http://simplystatistics.tumblr.com/post/14318537784/in-greece-a-statistician-faces-life-in-prison-for" target="_blank">Our favorite Greek statistician</a> in the news <a href="http://www.miamiherald.com/2012/02/01/2618716/blaming-the-messenger-greeces.html" target="_blank">again</a>. </li>
<li>How measurement of academic output <a href="http://www.int-res.com/articles/esep2008/8/e008p009.pdf" target="_blank">harms science</a>. Related: <a href="http://simplystatistics.tumblr.com/post/11059923583/submitting-scientific-papers-is-too-time-consuming" target="_blank">is submitting scientific papers too time consuming</a>? Stay tuned for more on this topic this week. Via Michael E. </li>
<li>One from the archives: <a href="http://simplystatistics.tumblr.com/post/10013120929/data-visualization-and-art" target="_blank">Data visualization and art</a>. </li>
</ol>
Why don't we hear more about Adrian Dantley on ESPN? This graph makes me think he was as good an offensive player as Michael Jordan.
2012-02-03T14:02:00+00:00
http://simplystats.github.io/2012/02/03/why-dont-we-hear-more-about-adrian-dantley-on-espn
<p>In <a href="http://simplystatistics.tumblr.com/post/16817771482/this-graph-makes-me-think-kobe-is-not-that-good-he" target="_blank">my last post</a> I complained about efficiency not being discussed enough by NBA announcers and commentators. I pointed out that some of the best scorers have relatively low FG% or <a href="http://www.ehow.com/how_2092829_calculate-true-shooting-percentage-basketball.html" target="_blank">TS%</a>. However, via the comments it was pointed out that top scorers need to take more difficult shots and thus are expected to have lower efficiency. The plot below (made with this <a href="http://rafalab.jhsph.edu/simplystats/nba.R" target="_blank">R script</a>) seems to confirm this (click image to enlarge) . The dashed line is from regression and the colors represent guards (green), forwards (orange) and centers (purple).</p>
<p><a href="http://rafalab.jhsph.edu/simplystats/kobe3.png" target="_blank"><img height="358" src="http://rafalab.jhsph.edu/simplystats/kobe3.png" width="500" /></a></p>
<p>Among this group TS% does trend down with points per game and centers tend to have higher TS%. Forwards and guards are not very different. However, the plot confirms that some of the supposed all time greats are more ball hogs than good scorers. </p>
<p>A couple of further observations. First, Adrian Dantley was way better than I thought. Why isn’t he more famous? Second, Kobe is no Jordan. Also note Jordan played several seasons past his prime which lowered his career averages. So I added points for five of these players using only data from their prime years (ages 24-29). Here Jordan really stands out. But so does Dantley! </p>
<p><a href="http://rafalab.jhsph.edu/simplystats/kobe4.png" target="_blank"><img height="358" src="http://rafalab.jhsph.edu/simplystats/kobe4.png" width="500" /></a></p>
<p>pd - Note that these plots say nothing about defense, rebounding, or passing. This <a href="http://skepticalsports.com/?page_id=1222" target="_blank">in-depth analysis</a> makes a convincing argument that Dennis Rodman is one of the most valuable players of all time.</p>
Cleveland's (?) 2001 plan for redefining statistics as "data science"
2012-02-02T12:36:58+00:00
http://simplystats.github.io/2012/02/02/clevelands-2001-plan-for-redefining-statistics-as
<p><a href="http://cm.bell-labs.com/cm/ms/departments/sia/doc/datascience.pdf" target="_blank">This plan</a> has been making the rounds on Twitter and is being attributed to William Cleveland in 2001 (thanks to Kasper for the link). I’m not sure of the provenance of the document but it has some really interesting ideas and is worth reading in its entirety. I actually think that many Biostatistics departments follow the proposed distribution of effort pretty closely. </p>
<p>One of the most interesting sections is the discussion of computing (emphasis mine): </p>
<blockquote>
<p>Data analysis projects today rely on databases, computer and network hardware, and computer and network software. A collection of models and methods for data analysis will be used only if the collection is implemented in a computing environment that makes the models and methods sufficiently efficient to use. In choosing competing models and methods, analysts will trade effectiveness for efficiency of use.</p>
<p>…..</p>
<p><strong>This suggests that statisticians should look to computing for knowledge today, just as data science looked to mathematics in the past.</strong></p>
</blockquote>
<p>I also found the theory section worth a read and figure it will definitely lead to some discussion: </p>
<blockquote>
<p>Mathematics is an important knowledge base for theory. It is far too important to take for granted by requiring the same body of mathematics for all. Students should study mathematics on an as-needed basis.</p>
<p>….</p>
<p>Not all theory is mathematical. In fact, the most fundamental theories of data science are distinctly nonmathematical. For example, the fundamentals of the Bayesian theory of inductive inference involve nonmathematical ideas about combining information from the data and information external to the data. Basic ideas are conveniently expressed by simple mathematical expressions, but mathematics is surely not at issue. </p>
</blockquote>
Evidence-based Music
2012-02-01T14:02:05+00:00
http://simplystats.github.io/2012/02/01/evidence-based-music
<p>There was recently a fascinating article published in PNAS that <a href="http://www.ncbi.nlm.nih.gov/pubmed/22215592" target="_blank">compared the sound quality of different types of violins</a>. In this study, researchers assembled a collection of six violins, three of which were made by Stradivari and Guarneri del Gesu and three made by modern luthiers (i.e. 20th century). The combined value of the “old” violins was $10 million, about 100 times greater than the combined value of the “new” violins. Also, they note:</p>
<blockquote>
<p><span>Numbers of subjects and instruments were small because it is difficult to persuade the owners of fragile, enormously valuable old violins to release them for extended periods into the hands of blindfolded strangers.</span></p>
</blockquote>
<p>Yeah, I’d say so.</p>
<p>They then got 21 professional violinists to try them all out wearing glasses to obscure their vision so they couldn’t see the violins. The researchers were also blinded to the type of violin as the study was being conducted.</p>
<p>The conclusions were striking:</p>
<blockquote>
<p><span>We found that (i) the most-preferred </span><span class="highlight">violin</span><span> was new; (ii) the least-preferred was by Stradivari; (iii) there was scant correlation between an instrument’s age and monetary value and its perceived quality; and (iv) most players seemed unable to tell whether their most-preferred instrument was new or old.</span></p>
</blockquote>
<p><span>First, I’m glad the researchers got people to actually play the instruments. I don’t think it’s sufficient to just listen to some recordings because usually the recordings are by different performers and the quality of the recording itself may vary quite a bit. Second, the study was conducted in a hotel room for its “dry acoustics”, but I think changing the venue might have changed the results. Third, even though the authors don’t declare any specific financial conflict of interest, it’s worth noting that the second author is a violinmaker who could theoretically benefit if people decide they no longer need to focus on old Italian violins.</span></p>
<p><span>I was surprised, but not that surprised, at the results. As a lifelong violinist, I had always wondered whether the Strads and the Guarneris were that much better. I once played on a Guarneri (for about 30 seconds) and I think it’s fair to say that it was incredible. But I’ve also seen some amazing violins made by guys in Brooklyn and New Jersey. I’d always heard that Strads have a darker more mellow sound, which I suppose is nice, but I think these days people may prefer a brighter and bigger sound, especially for those larger modern-day concert halls. </span></p>
<p><span>I hope that this study and others like it will get people to focus on which violins sound good rather than where they came from. I’m glad to see the use of data pose a challenge to another long-standing convention.</span></p>
This graph makes me think Kobe is not that good, he just shoots a lot
2012-01-31T14:02:00+00:00
http://simplystats.github.io/2012/01/31/this-graph-makes-me-think-kobe-is-not-that-good-he
<p>I find it surprising that NBA commentators rarely talk about field goal percentage. Everybody knows that the more you shoot the more you score. But players that score a lot are admired without consideration of their FG%. Of course having a high FG% is not necessarily admirable as many players only take easy shots, while top-scorers need to take difficult ones. Regardless, missing is undesirable and players that miss more than usual are not criticized enough. Iverson, for example, had a lowly career FG% of 43 yet he regularly made the allstar team. But I am not surprised he never won an NBA championship: it’s hard to win when your top scorer misses so often.</p>
<p><img height="450" src="http://rafalab.jhsph.edu/simplystats/kobe.png" width="450" /></p>
<p>Experts consider Kobe to be one of the all time greats and compare him to Jordan. They never mention that he is consistently among league leaders in missed shots. So far this year, Kobe has missed a whopping 279 times for a league leading 13.3 misses per game. In contrast, Lebron has missed 8.8 per game and has scored about the same per game. The plot above (made with this <a href="http://rafalab.jhsph.edu/simplystats/nba.R" target="_blank">R script</a>) shows career FG% for players considered to be superstars, top-scorers, and that have won multiple championships (red lines are 1st and 3rd quartiles). I also include Gasol, Lebron, Wade, and Dominique. Note that Kobe has the worst FG% in this group. So how does he win 5 championships? Well perhaps Shaq and later Gasol made up for his misses. Note that the first year Kobe played without Shaq, the Lakers did not make the playoffs. Also, during Kobe’s career the Lakers’ record has been <a href="http://slumz.boxden.com/f16/lakers-cavs-records-without-kobe-lebron-1370997/" target="_blank">similar with and without him</a>. Experts may compare Kobe to Jordan, but perhaps we should be comparing him to Dominique.</p>
<p><strong>Update: </strong>Please see <span>Brunsloe87’s comment for a much better analysis than mine. He/she points out that it’s too simplistic to look at FG%. Instead we should look at something closer to points scored per shot taken. This rewards players, like Kobe, that draw many fouls and has a high FT%. There is a weighted statistic called true scoring % (TS%) that tries to summarize this and below I include a plot of TS% for the same players. Kobe is no Jordan but he is not as bad as Dominique either. He is somewhere in the middle. </span></p>
<p><span><img height="500" src="http://rafalab.jhsph.edu/simplystats/kobe2.png" width="500" /></span></p>
<p>The comment also points out that Magic didn’t shoot as much as other superstars so it’s unfair to include him. A better plot would plot TS% versus shots taken (e.g. FGA+FTA/2) but I’ll let someone with more time make that one. Anyways, this plot explains why the early 80s Lakers (Magic+Kareem) were so good.</p>
Why in-person education isn't dead yet...but a statistician could finish it off
2012-01-30T14:02:05+00:00
http://simplystats.github.io/2012/01/30/why-in-person-education-isnt-dead-yet-but-a
<p>A growing tend in education is to put lectures online, for free. The <a href="http://www.khanacademy.org/" target="_blank">Kahn Academy</a>, Stanford’s recent <a href="http://www.nytimes.com/2011/08/16/science/16stanford.html?_r=2" target="_blank">AI course</a>, and Gary King’s new <a href="http://projects.iq.harvard.edu/gov2001" target="_blank">quantitative government course</a> at Harvard are three of the more prominent examples. This new pedagogical format is more democratic, free, and helps people learn at their own pace. It has led some, including us here at Simply Statistics, to suggest that the future of graduate education lies in <a href="http://simplystatistics.tumblr.com/post/10764298034/the-future-of-graduate-education" target="_blank">online courses</a>. Or to forecast the <a href="http://simplystatistics.tumblr.com/post/16474506346/the-end-of-in-class-lectures-is-closer-than-i-thought" target="_blank">end of in-class lectures</a>. </p>
<p>All this excitement led John Cook to ask, “<a href="http://www.johndcook.com/blog/2012/01/24/what-do-colleges-sell/" target="_blank">What do colleges sell?</a>”. The answers he suggested were: (1) real credentials, like a degree, (2) motivation to ensure you did the work, and (3) feedback to tell you how you are doing. As John suggests, online lectures really only target motivated and self-starting learners. For graduate students, this may work (maybe), but for the vast majority of undergrads or high-school students, self-guided learning won’t work due to a lack of motivation. </p>
<p>I would suggest that until the feedback, assessment,and credentialing problems have been solved, online lectures are still more edu-tainment than education. </p>
<p>Of these problems, I think we are closest to solving the feedback problem with online quizes and tests to go with online lectures. What we haven’t solved are assessment and credentialing. The reason is there is no good system for verifying a person taking a quiz/test online is who they say they are. This issue has two consequences: (1) it is difficult to require that a person do online quizes/tests like we do with in-class quizes/tests and (2) it is difficult to believe credentials given to people who take courses online. </p>
<p>What does this have to do with statistics? Well, what we need is an <strong>C</strong>ompletely <strong>A</strong>utomated <strong>O</strong>nline <strong>T</strong>est for <strong>S</strong>tudent <strong>I</strong>dentity (COATSI). People will notice a similarity between my acronym and the acronym for <a href="http://en.wikipedia.org/wiki/CAPTCHA" target="_blank">CAPTCHAs</a>, the simple online Turing tests used to prove that you are a human and not a computer. </p>
<p>The properties of a COATSI need to be:</p>
<ol>
<li>Completely automated</li>
<li>Provide tests that verify the identity of the student being assessed</li>
<li>Can be used throughout an online quiz/test/assessment</li>
<li>Are simple and easy to solve</li>
</ol>
<p>I can’t think of a deterministic system that can be used for this purpose. My suspicion is that a COATSI will need to be statistical. For example, one idea is to have people sign in with Facebook, then at random intervals while they are solving problems, they have to identify their friends by name. If they do this quickly/consistently enough, they are verified as the person taking the test. </p>
<p>I don’t have a good solution to this problem yet; I’d love to hear more suggestions. I also think this seems like a potentially hugely important and very challenging problem for a motivated grad student or postdoc….</p>
Sunday data/statistics link roundup (1/29)
2012-01-29T17:52:08+00:00
http://simplystats.github.io/2012/01/29/sunday-data-statistics-link-roundup-1-29
<ol>
<li>A really nice <a href="http://alignedleft.com/tutorials/d3/" target="_blank">D3 tutorial</a>. I’m 100% on board with D3, if they could figure out a way to export the graphics as pdfs, I think this would be the best visualization tool out there.</li>
<li>A <a href="http://populationaction.org/Articles/Whats_Your_Number/" target="_blank">personalized calculator</a> that tells you what number (of the 7 billion or so) that you are based on your birth day. I’m person 4,590,743,884. Makes me feel so special….</li>
<li>An old post of ours, on <a href="http://simplystatistics.tumblr.com/post/10555655037/dongle-communism" target="_blank">dongle communism</a>. One of my favorite posts, it came out before we had much traffic but deserves more attention.</li>
<li>This isn’t statistics/data related but too good to pass up. From the Bones television show, <a href="http://www.myvidster.com/video/4132893/_A_new_low_for_TV_science_Malware_Fractals_in_Bones_ampbull_videosiftcom" target="_blank">malware fractals shaved into a bone</a>. I love TV science. Thanks to Dr. J for the link.</li>
<li><a href="http://bits.blogs.nytimes.com/2012/01/26/what-are-the-odds-that-stats-would-get-this-popular/" target="_blank">Stats are popular</a>…</li>
</ol>
This simple bar graph clearly demonstrates that the US can easily increase research funding
2012-01-27T15:09:00+00:00
http://simplystats.github.io/2012/01/27/this-simple-bar-graph-clearly-demonstrates-that-the-us
<p>Some NIH R01 paylines are <a href="http://www.einstein.yu.edu/ogs/page.aspx?id=21983" target="_blank">down to 10%</a>. This means only 10% of grants are being funded. The plot below highlights that all we need is a tiny litte slice from Defense, Medicare, Medicaid or Social Security to bring that back up to 20%. The plot was taken from Alex Tarrabok’s <a href="http://www.theatlantic.com/business/archive/2012/01/the-innovation-nation-vs-the-warfare-welfare-state/251984/" target="_blank">great article</a> in the Atlantic.<img height="231" src="http://cdn.theatlantic.com/static/mt/assets/business/innovation%20welfarewarfare.png" width="377" /></p>
<p><strong>Update</strong>: The y-axis unit is billions of US dollars.</p>
When should statistics papers be published in Science and Nature?
2012-01-26T14:00:05+00:00
http://simplystats.github.io/2012/01/26/when-should-statistics-papers-be-published-in-science
<p>Like many statisticians, I was amped to see a <a href="http://www.sciencemag.org/content/334/6062/1518.abstract" target="_blank">statistics paper</a> appear in Science. Given the impact that statistics has on the scientific community, it is a shame that more statistics papers don’t appear in the glossy journals like <em>Science</em> or <em>Nature</em>. As I pointed out in <a href="http://simplystatistics.tumblr.com/post/15402808730/p-values-and-hypothesis-testing-get-a-bad-rap-but-we" target="_blank">the previous post</a>, if the paper that introduced the p-value was cited every time this statistic was used, the paper would have over 3 million citations!</p>
<p>But a couple of our readers* have pointed to a <a href="http://www-stat.stanford.edu/~tibs/reshef/comment.pdf" target="_blank">response</a> to the MIC paper published by Noah Simon and Rob Tibshirani. Simon and Tibshirani show that the MIC statistic is underpowered compared to another recently published statistic for the same purpose that came out in 2009 in the Annals of Applied Statistics. A nice <a href="http://scientificbsides.wordpress.com/2012/01/23/detecting-novel-associations-in-large-data-sets-let-the-giants-battle-it-out/" target="_blank">summary</a> of the discussion is provided by Florian over at his blog. </p>
<p><em>If the AoAS statistic came out first (by 2 years) and is more powerful (according to simulation), should the MIC statistic have appeared in Science? </em></p>
<p>The whole discussion reminds me of a recent blog post suggesting that journals need to pick one between <a href="http://spsptalks.wordpress.com/2011/12/31/groundbreaking-or-definitive-journals-need-to-pick-one/" target="_blank">groundbreaking and definitive</a>. The post points out that groundbreaking and definitive are in many ways in opposition to each other. </p>
<p>Again, I’d suggest that statistics papers get short shrift in the glossy journals and I would like to see more. And the MIC statistic is certainly groundbreaking, but it isn’t clear that it is definitive. </p>
<p>As a comparison, a slightly different story played out with another recent high-impact statistical method, the false discovery rate (FDR). The original papers were published in <a href="http://www.jstor.org/pss/2346101" target="_blank">statistics</a> <a href="http://www.genomine.org/papers/directfdr.pdf" target="_blank">journals</a>. Then when it was clear that the idea was going to be big, a more general-audience-friendly summary was published in <a href="http://www.pnas.org/content/100/16/9440.full" target="_blank">PNAS</a> (not <em>Science</em> or <em>Nature</em> but definitely glossy). This might be a better way for the glossy journals to know what is going to be a major development in statistics versus an exciting - but potentially less definitive - method. </p>
<ul>
<li>Florian M. and John S.</li>
</ul>
The end of in-class lectures is closer than I thought
2012-01-25T19:18:28+00:00
http://simplystats.github.io/2012/01/25/the-end-of-in-class-lectures-is-closer-than-i-thought
<p>Our previous post on <a href="http://simplystatistics.tumblr.com/post/10764298034/the-future-of-graduate-education" target="_blank">future of (statistics) graduate education</a> was motivated by he Stanford <a href="http://www.nytimes.com/2011/08/16/science/16stanford.html?_r=1" target="_blank">online course</a> on Artificial Intelligence. Here is <a href="http://blogs.reuters.com/felix-salmon/2012/01/23/udacity-and-the-future-of-online-universities/" target="_blank">an update</a> on the class that had 160,000 people enroll. Some highlights: 1- Sebastian Thrun has given up his tenure at Stanford and he’s started a new online university called <a href="http://www.udacity.com/" target="_blank">Udacity</a>. 2- 248 students got a perfect score: they never got a single question wrong, over the entire course of the class. All 248 took the course online; not one was enrolled at Stanford. 3- Students from Afghanistan completed the course. What do you think are the chances these students could afford Stanford’s tuition? 4 - There were more students from Lithuania alone than there are students at Stanford altogether.</p>
<p>The <a href="http://programming-puzzler.blogspot.com/2011/11/review-of-2011-free-stanford-online.html" target="_blank">class evaluations were not perfect</a>. Here is <a href="http://pennyhacks.com/2011/12/28/stanford-free-classes-a-review-from-a-stanford-student/" target="_blank">a particularly harsh one</a>. They also need to figure out how to evaluate online students. But I am sure there are plenty of people working on that problem. Here is an <a href="http://chronicle.com/article/MIT-Mints-a-Valuable-New-Form/130410/?sid=at" target="_blank">example</a>. Regardless, this was the first such experiment and for a first try it seems like a huge success to me. As more professors try this, for example Harvard’s Gary King is conducting a <a href="http://projects.iq.harvard.edu/gov2001" target="_blank">similar class </a>in Quantitative Research Methodology, it will become clearer that there is no future for in-class lectures as we know them today.</p>
<p>Thanks to Alex and Jeff for all the links. </p>
A wordcloud comparison of the 2011 and 2012 #SOTU
2012-01-25T04:02:39+00:00
http://simplystats.github.io/2012/01/25/a-wordcloud-comparison-of-the-2011-and-2012-sotu
<p>I wrote a quick (and very dirty) <a href="http://biostat.jhsph.edu/~jleek/code/sotu2011-2012comparison.R" target="_blank">R script</a> for creating a comparison cloud and a commonality cloud for President Obama’s 2011 and 2012 State of the Union speeches. The cloud on the left shows words that have different frequencies between the two speeches and the cloud on the right shows the words in common between the two speeches. <a href="http://biostat.jhsph.edu/~jleek/code/sotu-wordcloud.png" target="_blank">Here</a> is a higher resolution version.</p>
<p><img height="345" src="http://biostat.jhsph.edu/~jleek/code/sotu-wordcloud.png" width="600" /></p>
<p>The focus on jobs hasn’t changed much. But it is interesting how the 2012 speech seems to focus more on practical issues (tax, pay, manufacturing, oil) versus more emotional issues in 2011 (future, schools, laughter, success, dream).</p>
<p>The <a href="http://cran.r-project.org/web/packages/wordcloud/index.html" target="_blank">wordcloud</a> R package does all the heavy lifting.</p>
Why statisticians should join and launch startups
2012-01-23T14:00:05+00:00
http://simplystats.github.io/2012/01/23/why-statisticians-should-join-and-launch-startups
<p>The tough economic times we live in, and the potential for big paydays, have made <a href="http://en.wikipedia.org/wiki/The_Social_Network" target="_blank">entrepreneurship cool</a>. From the <a href="http://www.whitehouse.gov/issues/startup-america" target="_blank">venture capitalist-in-chief</a>, to the javascript <a href="http://chats-blog.com/2012/01/08/michael-bloomberg-learning-to-code/" target="_blank">coding mayor of New York</a>, everyone is on board. No surprise there, successful startups lead to job creation which can have a major positive impact on the economy. </p>
<p>The game has been dominated for a long time by the folks over in CS. But the value of many recent startups is either based on, or can be magnified by, good data analysis. Here are a few startups that are based on data/data analysis: </p>
<ol>
<li>The <a href="http://www.climate.com/" target="_blank">Climate Corporation</a> -analyzes climate data to sell farmers weather insurance.</li>
<li><a href="http://flightcaster.com/" target="_blank">Flightcaster</a> - uses public data to predict flight delays</li>
<li><a href="http://quid.com/" target="_blank">Quid</a> - uses data on startups to predict success, among other things.</li>
<li><a href="http://100plus.com/" target="_blank">100plus</a> - personalized health prediction startup, predicting health based on public data</li>
<li><a href="http://www.hipmunk.com/" target="_blank">Hipmunk</a> - The main advantage of this site for travel is better data visualization and an algorithm to show you which flights have the worst “agony”.</li>
</ol>
<p>To launch a startup you need just a couple of things: (1) a good, valuable source of data (there are lots of these on the web) and (2) a good idea about how to analyze them to create something useful. The second step is obviously harder than the first, but the companies above prove you can do it. Then, once it is built, you can outsource/partner with developers - web and otherwise - to implement your idea. If you can build it in R, someone can make it an app. </p>
<p>These are just a few of the startups whose value is entirely derived from data analysis. But companies from LinkedIn, to Bitly, to Amazon, to Walmart are trying to mine the data they are generating to increase value. Data is now being generated at unprecedented scale by computers, cell phones, even <a href="http://www.nest.com/" target="_blank">thremostats</a>! With this onslaught of data, the need for people with analysis skills is becoming incredibly <a href="http://radar.oreilly.com/2011/12/data-science-carrieriq-datasift-twitter.html" target="_blank">acute</a>. </p>
<p>Statisticians, like computer scientists before them, are poised to launch, and make major contributions to, the next generation of startups. </p>
Sunday Data/Statistics Link Roundup (1/21)
2012-01-22T14:00:06+00:00
http://simplystats.github.io/2012/01/22/sunday-data-statistics-link-roundup-1-21
<ol>
<li><a href="http://jermdemo.blogspot.com/2012/01/when-can-we-expect-last-damn-microarray.html" target="_blank">Is the microarray dead</a>? Jeremey Leipzig seems to think that statistical methods for microarrays should be. I’m not convinced, the technology has finally matured to the point we can use it for personalized medicine and we abandon it for the next hot thing? Not to Andrew for the link.</li>
<li>Data from 5 billion webpages available from the <a href="http://www.commoncrawl.org/data/accessing-the-data/" target="_blank">Common Crawl</a>. Want to build your own search tool - or just find out whats on the web? Get your Hadoop on. Nod to Peter S. for the heads up. </li>
<li>Simon and Tibhsirani <a href="http://www-stat.stanford.edu/~tibs/reshef/" target="_blank">criticize</a> the greatly publicized <a href="http://www.sciencemag.org/content/334/6062/1518" target="_blank">MIC statistic</a>. Nod to John S. for the link.</li>
<li>A public/free <a href="http://projects.iq.harvard.edu/gov2001/" target="_blank">statistics class</a> being offered through the IQSS at Harvard. </li>
</ol>
Interview With Joe Blitzstein
2012-01-20T14:00:06+00:00
http://simplystats.github.io/2012/01/20/interview-with-joe-blitzstein
<div class="im">
<strong>Joe Blitzstein</strong>
</div>
<div class="im">
</div>
<div class="im">
<img height="200" src="http://biostat.jhsph.edu/~jleek/Blitzstein3.jpg" width="300" />
</div>
<div class="im">
</div>
<div class="im">
Joe Blitzstein is <a href="http://news.harvard.edu/gazette/story/2011/11/the-lasting-lure-of-logic/" target="_blank">Professor of the Practice in Statistics</a> at Harvard University and co-director of the graduate program. He moved to Harvard after obtaining his Ph.D. with Persi Diaconis at Stanford University. Since joining the faculty at Harvard, he has been immortalized in Youtube prank videos, been awarded a “favorite professor” distinction four times, and performed interesting research on the statistical analysis of social networks. Joe was also the first person to discover our blog on Twitter. You can find more information about him on his <a href="http://www.people.fas.harvard.edu/~blitz/Site/Home.html" target="_blank">personal website</a>. Or check out his Stat 110 class, now available <a href="http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewPodcast?id=495213607" target="_blank">from iTunes</a>!
</div>
<div class="im">
</div>
<div class="im">
<strong>Which term applies to you: data scientist/statistician/</strong><strong>analyst?</strong>
</div>
<p><span>Statistician, but that should and does include working with data! I</span><br />
<span>think statistics at its best interweaves modeling, inference,</span><br />
<span>prediction, computing, exploratory data analysis (including</span><br />
<span>visualization), and mathematical and scientific thinking. I don’t</span><br />
<span>think “data science” should be a separate field, and I’m concerned</span><br />
<span>about people working with data without having studied much statistics</span><br />
<span>and conversely, statisticians who don’t consider it important ever to</span><br />
<span>look at real data. I enjoyed the discussions by Drew Conway and on</span><br />
<span>your blog (at</span><a href="http://www.drewconway.com/zia/?p=2378" target="_blank"><a href="http://www.drewconway.com/zia/?p=2378" target="_blank">http://www.drewconway.com/zia/?p=2378</a></a><span>and</span><br />
<a href="http://simplystatistics.tumblr.com/post/11271228367/datascientist" target="_blank"><a href="http://simplystatistics.tumblr.com/post/11271228367/datascientist" target="_blank">http://simplystatistics.tumblr.com/post/11271228367/datascientist</a></a><span>)</span><br />
<span>and think the relationships between statistics, machine learning, data</span><br />
<span>science, and analytics need to be clarified.</span></p>
<div class="im">
<strong>How did you get into statistics/data science (e.g. your history)?</strong>
</div>
<p><span>I always enjoyed math and science, and became a math major as an</span><br />
<span>undergrad Caltech partly because I love logic and probability and</span><br />
<span>partly because I couldn’t decide which science to specialize in. One</span><br />
<span>of my favorite things about being a math major was that it felt so</span><br />
<span>connected to everything else: I could often help my friends who were</span><br />
<span>doing astronomy, biology, economics, etc. with problems, once they had</span><br />
<span>explained enough so that I could see the essential pattern/structure</span><br />
<span>of the problem. At the graduate level, there is a tendency for math to</span><br />
<span>become more and more disconnected from the rest of science, so I was</span><br />
<span>very happy to discover that statistics let me regain this, and have</span><br />
<span>the best of both worlds: you can apply statistical thinking and tools</span><br />
<span>to almost anything, and there are so many opportunities to do things</span><br />
<span>that are both beautiful and useful.</span></p>
<div class="im">
<strong>Who were really good mentors to you? What were the qualities that really</strong><br /><strong>helped you?</strong>
</div>
<p><span>I’ve been extremely lucky that I have had so many inspiring</span><br />
<span>colleagues, teachers, and students (far too numerous to list), so I</span><br />
<span>will just mention three. My mother, Steffi, taught me at an early age</span><br />
<span>to love reading and knowledge, and to ask a lot of “what if?”</span><br />
<span>questions. My PhD advisor, Persi Diaconis, taught me many beautiful</span><br />
<span>ideas in probability and combinatorics, about the importance of</span><br />
<span>starting with a simple nontrivial example, and to ask a lot of “who</span><br />
<span>cares?” questions. My colleague Carl Morris taught me a lot about how</span><br />
<span>to think inferentially (Brad Efron called Carl a “natural”</span><br />
<span>statistician in his interview at</span><br />
<a href="http://www-stat.stanford.edu/~ckirby/brad/other/2010Significance.pdf" target="_blank"><a href="http://www-stat.stanford.edu/~ckirby/brad/other/2010Significance.pdf" target="_blank">http://www-stat.stanford.edu/~ckirby/brad/other/2010Significance.pdf</a></a><span>,</span><br />
<span>by which I think he meant that valid inferential thinking does not</span><br />
<span>come naturally to most people), about parametric and hierarchical</span><br />
<span>modeling, and to ask a lot of “does that assumption make sense in the</span><br />
<span>real world?” questions.</span></p>
<div class="im">
<strong>How do you get students fired up about statistics in your classes?</strong>
</div>
<p><span>Statisticians know that their field is both incredibly useful in the</span><br />
<span>real world and exquisitely beautiful aesthetically. So why isn’t that</span><br />
<span>always conveyed successfully in courses? Statistics is often</span><br />
<span>misconstrued as a messy menagerie of formulas and tests, rather than a</span><br />
<span>coherent approach to scientific reasoning based on a few fundamental</span><br />
<span>principles. So I emphasize thinking and understanding rather than</span><br />
<span>memorization, and try to make sure everything is well-motivated and</span><br />
<span>makes sense both mathematically and intuitively. I talk a lot about</span><br />
<span>paradoxes and results which at first seem counterintuitive, since</span><br />
<span>they’re fun to think about and insightful once you figure out what’s</span><br />
<span>going on.</span></p>
<p><span>And I emphasize what I call “stories,” by which I mean an</span><br />
<span>application/interpretation that does not lose generality. As a simple</span><br />
<span>example, if X is Binomial(m,p) and Y is Binomial(n,p) independently,</span><br />
<span>then X+Y is Binomial(m+n,p). A story proof would be to interpret X as</span><br />
<span>the number of successes in m Bernoulli trials and Y as the number of</span><br />
<span>successes in n different Bernoulli trials, so X+Y is the number of</span><br />
<span>successes in the m+n trials. Once you’ve thought of it this way,</span><br />
<span>you’ll always understand this result and never forget it. A</span><br />
<span>misconception is that this kind of proof is somehow less rigorous than</span><br />
<span>an algebraic proof; actually, rigor is determined by the logic of the</span><br />
<span>argument, not by how many fancy symbols and equations one writes out.</span></p>
<p><span>My undergraduate probability course, Stat 110, is now worldwide</span><br />
<span>viewable for free on iTunes U at</span><br />
<a href="http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewPodcast?id=495213607" target="_blank"><a href="http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewPodcast?id=495213607" target="_blank">http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewPodcast?id=495213607</a></a><br />
<span>with 34 lecture videos and about 250 practice problems with solutions.</span><br />
<span>I hope that will be a useful resource, but in any case looking through</span><br />
<span>those materials says more about my teaching style than anything I can</span><br />
<span>write here does.</span></p>
<p><span><strong>What are your main research interests these days?</strong></span></p>
<p><span>I’m especially interested in the statistics of networks, with</span><br />
<span>applications to social network analysis and in public health. There is</span><br />
<span>a tremendous amount of interest in networks these days, coming from so</span><br />
<span>many different fields of study, which is wonderful but I think there</span><br />
<span>needs to be much more attention devoted to the statistical issues.</span><br />
<span>Computationally, most network models are difficult to work with since</span><br />
<span>the space of all networks is so vast, and so techniques like Markov</span><br />
<span>chain Monte Carlo and sequential importance sampling become crucial;</span><br />
<span>but there remains much to do in making these algorithms more efficient</span><br />
<span>and in figuring out whether one has run them long enough (usually the</span><br />
<span>answer is “no” to the question of whether one has run them long</span><br />
<span>enough). Inferentially, I am especially interested in how to make</span><br />
<span>valid conclusions when, as is typically the case, it is not feasible</span><br />
<span>to observe the full network. For example, respondent-driven sampling</span><br />
<span>is a link-tracing scheme being used all over the world these days to</span><br />
<span>study so-called “hard-to-reach” populations, but much remains to be</span><br />
<span>done to know how best to analyze such data; I’m working on this with</span><br />
<span>my student Sergiy Nesterko. With other students and collaborators I’m</span><br />
<span>working on various other network-related problems. Meanwhile, I’m also</span><br />
<span>finishing up a graduate probability book with Carl Morris,</span><br />
<span>“Probability for Statistical Science,” which has quite a few new</span><br />
<span>proofs and perspectives on the parts of probability theory that are</span><br />
<span>most useful in statistics.</span></p>
<div class="im">
<strong>You have been immortalized in several Youtube videos. Do you think this</strong><br /><strong>helped make your class more “approachable”?</strong>
</div>
<p><span>There were a couple strange and funny pranks that occurred in my first</span><br />
<span>year at Harvard. I’m used to pranks since Caltech has a long history</span><br />
<span>and culture of pranks, commemorated in several “Legends of Caltech”</span><br />
<span>volumes (there’s even a movie in development about this), but pranks</span><br />
<span>are quite rare at Harvard. I try to make the class approachable</span><br />
<span>through the lectures and by making sure there is plenty of support,</span><br />
<span>help, and encouragement is available from the teaching assistants and</span><br />
<span>me, not through YouTube, but it’s fun having a few interesting</span><br />
<span>occasions from the history of the class commemorated there.</span></p>
Data Journalism Awards
2012-01-19T20:55:44+00:00
http://simplystats.github.io/2012/01/19/data-journalism-awards
<p><a href="http://googleblog.blogspot.com/2012/01/data-journalism-awards-now-accepting.html">Data Journalism Awards</a></p>
Fundamentals of Engineering Review Question Oops
2012-01-19T02:13:25+00:00
http://simplystats.github.io/2012/01/19/fundamentals-of-engineering-review-question-oops
<p>The <a href="http://www.ncees.org/Exams/FE_exam.php" target="_blank">Fundamentals of Engineering Exam</a> is the first licensing exam for engineers. You have to pass it on your way to becoming a professional engineer (PE). I was recently shown a problem from a review manual: </p>
<blockquote>
<p>When it is operating properly, a chemical plant has a daily production rate that is normally distributed with a mean of 880 tons/day and a standard deviation of 21 tons/day. During an analysis period, the output is measured with random sampling on 50 consecutive days, and the mean output is found to be 871 tons/day. With a 95 percent confidence level, determine if the plant is operating properly. </p>
<ol>
<li>There is at least a 5 percent probability that the plant is operating properly. </li>
<li>There is at least a 95 percent probability that the plant is operating properly. </li>
<li>There is at least a 5 percent probability that the plant is not operating properly. </li>
<li>There is at least a 95 percent probability that the plant is not operating properly. </li>
</ol>
</blockquote>
<p>Whoops…seems to be a problem there. I’m glad that engineers are expected to know some statistics; hopefully the engineering students taking the exam can spot the problem…but then how do they answer? </p>
figshare and don't trust celebrities stating facts
2012-01-17T17:48:58+00:00
http://simplystats.github.io/2012/01/17/figshare-and-dont-trust-celebrities-stating-facts
<p>A couple of links:</p>
<ol>
<li><a href="http://figshare.com/" target="_blank">figshare</a> is a site where scientists can share data sets/figures/code. One of the goals is to encourage researchers to share negative results as well. I think this is a great idea - I often find negative results and this could be a place to put them. It also uses a tagging system, like Flickr. I think this is a great idea for scientific research discovery. They give you unlimited public space and 1GB of private space. This could be big, a place to help make <a href="http://simplystatistics.tumblr.com/post/13633695297/reproducible-research-in-computational-science" target="_blank">reproducible research efforts</a> user-friendly. Via <a href="http://techcrunch.com/2012/01/17/science-data-sharing-site-figshare-relaunches-adds-features/" target="_blank">TechCrunch</a></li>
<li>Don’t trust <a href="http://newsfeed.time.com/2012/01/03/celebrities-offering-scientific-facts-just-say-no/?xid=newsletter-weekly" target="_blank">celebrities stating facts</a> because they usually don’t know what they are talking about. I completely agree with this. Particularly because I have serious doubts about the <a href="http://simplystatistics.tumblr.com/post/15774146480/in-the-era-of-data-what-is-a-fact" target="_blank">statisteracy</a> of most celebrities. Nod to Alex for the link (our most active link finder!). </li>
</ol>
A Tribute To One Of The Most Popular Methods In
2012-01-16T14:02:05+00:00
http://simplystats.github.io/2012/01/16/a-tribute-to-one-of-the-most-popular-methods-in
<p>[youtube http://www.youtube.com/watch?v=oPzERmPlmw8?wmode=transparent&autohide=1&egm=0&hd=1&iv_load_policy=3&modestbranding=1&rel=0&showinfo=0&showsearch=0&w=500&h=375]</p>
<p>A tribute to one of the most popular methods in statistics.</p>
<div class="attribution">
(<span>Source:</span> <a href="http://www.youtube.com/">http://www.youtube.com/</a>)
</div>
Sunday Data/Statistics Link Roundup
2012-01-15T15:38:00+00:00
http://simplystats.github.io/2012/01/15/sunday-data-statistics-link-roundup
<ol>
<li><a href="http://www.robertniles.com/stats/" target="_blank">Statistics help for journalists</a> (don’t forget to keep <a href="http://www.healthnewsrater.com/" target="_blank">rating stories</a>!) This is the kind of thing that could grow into a statisteracy page. The author also has a really nice <a href="http://www.sensibletalk.com/journals/robertniles/201110/84/" target="_blank">plug for public schools</a>. </li>
<li>An interactive graphic to determine <a href="http://www.nytimes.com/interactive/2012/01/15/business/one-percent-map.html" target="_blank">if you are in the 1%</a> from the New York Times (I’m not…).</li>
<li>Mike Bostock’s <a href="http://mbostock.github.com/d3/talk/20110921/#0" target="_blank">d3.js presentation</a>, this is some really impressive visualization software. You have to change the slide numbers manually but it is totally worth it. Check out <a href="http://mbostock.github.com/d3/talk/20110921/#10" target="_blank">slide 10</a> and <a href="http://mbostock.github.com/d3/talk/20110921/#14" target="_blank">slide 14</a>. This is the future of data visualization. Here is a beginners <a href="http://www.drewconway.com/zia/?p=2857" target="_blank">tutorial</a> to d3.js by Mike Dewar.</li>
<li>An online diagnosis prediction start-up (<a href="http://symcat.com/" target="_blank">Symcat</a>) based on data analysis from two Hopkins Med students.</li>
</ol>
<p>Finally, a bit of a bleg. I’m going to try to make this link roundup a regular post. If you have ideas for links I should include, tweet us @simplystats or send them to Jeff’s email. </p>
In the era of data what is a fact?
2012-01-13T14:00:06+00:00
http://simplystats.github.io/2012/01/13/in-the-era-of-data-what-is-a-fact
<p>The Twitter universe is abuzz about <a href="http://publiceditor.blogs.nytimes.com/2012/01/12/should-the-times-be-a-truth-vigilante/?pagewanted=all" target="_blank">this</a> article in the New York Times. Arthur Brisbane, who responds to reader’s comments, asks </p>
<blockquote>
<p><span>I’m looking for reader input on whether and when New York Times news reporters should challenge “facts” that are asserted by newsmakers they write about.</span></p>
</blockquote>
<p><span>He goes on to give a couple of examples of qualitative facts that reporters have used in stories without questioning the veracity of the claims. </span>As many people pointed out in the comments, this is completely absurd. Of course reporters should check facts and report when the facts in their stories, or stated by candidates, are not correct. That is the purpose of news reporting. </p>
<p>But I think the question is a little more subtle when it comes to quantitative facts and statistics. Depending on what subsets of data you look at, what summary statistics you pick, and the way you present information - you can say a lot of different things with the same data. As long as you report what you calculated, you are technically reporting a fact - but it may be deceptive. The classic example is calculating <a href="http://en.wikipedia.org/wiki/Real_estate_pricing" target="_blank">median vs. mean</a> home prices. If Bill Gates is in your neighborhood, no matter what the other houses cost, the mean price is going to be pretty high! </p>
<p>Two concrete things can be done to deal with the malleability of facts in the data age.</p>
<p>First, we need to require that our reporters, policy makers, politicians, and decision makers report the context of numbers they state. It is tempting to use statistics as blunt instruments, punctuating claims. Instead, we should demand that people using statistics to make a point embed them in the broader context. For example, in the case of housing prices, if a politician reports the mean home price in a neighborhood, they should be required to state that potential outliers may be driving that number up. How do we make this demand? By not believing any isolated statistics - statistics will only be believed when the source is quoted and the statistic is described. </p>
<p>But this isn’t enough, since the context and statistics will be meaningless without raising overall statisteracy (statistical literacy, not to be confused with <a href="http://en.wikipedia.org/wiki/Numeracy" target="_blank">numeracy</a>). In the U.S. literacy campaigns have been promoted by library systems. Statisteracy is becoming just as critical; the same level of social pressure and assistance should be applied to individuals who don’t know basic statistics as those who don’t have basic reading skills. Statistical organizations, academic departments, and companies interested in analytics/data science/statistics all have a vested interest in raising the population statisteracy. Maybe a website dedicated to understanding the consequences of basic statistical concepts, rather than the concepts themselves?</p>
<p>And don’t forget to keep <a href="http://simplystatistics.tumblr.com/post/15669033251/healthnewsrater" target="_blank">rating health news stories</a>!</p>
Academics are partly to blame for supporting the closed and expensive access system of publishing
2012-01-13T02:54:25+00:00
http://simplystats.github.io/2012/01/13/academics-are-partly-to-blame-for-supporting-the-closed
<p>Michael Eisen recently published a <a href="http://www.nytimes.com/2012/01/11/opinion/research-bought-then-paid-for.html?_r=1" target="_blank">New York Times op-ed</a> arguing that a bill meant to protect publishers, <span>introduced in the House of Representatives, will result in tax payers paying twice for scientific research. According to Eisen</span></p>
<blockquote>
<p><span>If the bill passes, to read the results of federally funded research, most Americans would have to buy access to individual articles at a cost of $15 or $30 apiece. In other words, taxpayers who already paid for the research would have to pay again to read the results.</span></p>
</blockquote>
<p>We agree and encourage our readers to write Congress opposing the “<a href="http://thomas.loc.gov/cgi-bin/query/z?c112:H.R.3699:" target="_blank">Research Works Act</a>”. However, whereas many are vilifying the publishers that are lobbying for this act, I think us academics are the main culprits keeping open access from succeeding.</p>
<p>If this bill makes it into law, I do not think that the main issue will be US taxpayers paying twice for research, but rather that access will be even more restricted to the general scientific community. Interested parties outside the US -and in developing countries in particular- should have unrestriced access to scientific knowledge. Congresswoman Carolyn Maloney <a href="http://sistnek.blogspot.com/2012/01/time-to-terminate-research-works-act.html" target="_blank">gets it wrong</a> by not realizing that giving China (and other countries) access to scientific knowledge is beneficial to science in general and consequently to everyone. However, to maintain the high quality of research publications we currently enjoy, someone needs to pay for competent editors, copy editors, support staff, and computer servers. Open access journals shift the costs from the readers to authors that have plenty of funds (grants, startups, etc..) to cover the charges. By charging the authors, papers can be made available online for free. Free to everyone. Open access. PLoS has demonstrated that the open access model is viable, but a paper in <em>PLoS Biology</em> will run you $2,900 (<a href="http://simplystatistics.tumblr.com/post/12286350206/free-access-publishing-is-awesome-but-expensive-how" target="_blank">see Jeff’s table</a>). Several non-profit societies and for profit publishers, such as <a href="http://www.nature.com/press_releases/open15.html" target="_blank">Nature Publishing Group</a>, offer open access for about <a href="http://newsbreaks.infotoday.com/Digest/Nature-Publishing-Group-Expands-Open-Access-Choices-52372.asp" target="_blank">the same price</a>. </p>
<p>So given all the open access options, why do gated journals survive? I think the main reason is that <strong>we</strong> -the scientific community<strong>- </strong>through appointments and promotions committees, study sections, award committees, etc. use journal prestige to evaluate publication records disregarding open access as a criteria (see Eisen’s <a href="http://www.michaeleisen.org/blog/?p=694" target="_blank">related post</a> on decoupling publication and assessment). Therefore, those that decide to only publish in open access journals, may hinder not only their careers, but also the careers of their students and postdocs. The other reason is that for authors, publishing gated papers is typically cheaper than open access papers, and we don’t always make the more honorable decision. </p>
<p>Another important consideration is that a substantial proportion of publication costs comes from printing paper copies. My department continues to buy print copies of several stat journals as well as some of the general science magazines. The Hopkins library, on behalf of the faculty, buys print versions of hundreds of journals. As long as we continue to create a market for paper copies, the journals will continue to allocate resources to producing them. Somebody has to pay for this, yet with online versions already being produced the print versions are superfluous.</p>
<p>Apart from opposing the Research Works Act as Eisen proposes, there are two more things I intend to do in 2012: 1) lobby my department to stop buying print versions and 2) lobby my study section to give special consideration to open access publications when evaluating a biosketch or a progress report.</p>
Help us rate health news reporting with citizen-science powered http://www.healthnewsrater.com
2012-01-11T13:19:00+00:00
http://simplystats.github.io/2012/01/11/healthnewsrater
<p>We here at Simply Statistics are big fans of science news reporting. We read newspapers, blogs, and the news sections of scientific journals to keep up with the coolest new research. </p>
<p>But health science reporting, although exciting, can also be incredibly frustrating to read. Many articles have sensational titles, like <a href="http://www.dailymail.co.uk/health/article-1149207/How-using-Facebook-raise-risk-cancer.html" target="_blank">“How using Facebook could raise your risk of cancer”</a>. The articles go on to describe some research and interview a few scientists, then typically make fairly large claims about what the research means. This isn’t surprising - eye catching headlines are important in this era of short attention spans and information overload. </p>
<p>If just a few extra pieces of information were reported in science stories about the news, it would be much easier to evaluate whether the cancer risk was serious enough to shut down our Facebook accounts. In particular we thought any news story should report:</p>
<ol>
<li><strong>A link back to the original research article</strong> where the study (or studies) being described was published. Not just a link to another news story. </li>
<li><strong>A description of the study design</strong> (was it a randomized clinical trial? a cohort study? 3 mice in a lab experiment?)</li>
<li><strong>Who funded the study</strong> - if a study involving cancer risk was sponsored by a tobacco company, that might say something about the results.</li>
<li><strong>Potential financial incentives of the authors</strong> - if the study is reporting a new drug and the authors work for a drug company, that might say something about the study too. </li>
<li><strong>The sample size</strong> - many health studies are based on a very small sample size, only 10 or 20 people in a lab. Results from these studies are much weaker than results obtained from a large study of thousands of people. </li>
<li><strong>The organism</strong> - Many health science news reports are based on studies performed in lab animals and may not translate to human health. For example, here is a report with the headline <a href="http://www.msnbc.msn.com/id/44779621/ns/health-alzheimers_disease/t/alzheimers-may-be-transmissible-study-suggests/" target="_blank">“Alzheimers may be transmissible, study suggests”</a>. But if you read the story, scientists injected Alzheimer’s afflicted brain tissue from humans into mice. </li>
</ol>
<p>So we created a citizen-science website for evaluating health news reporting called <a href="http://healthnewsrater.com" target="_blank">HealthNewsRater</a>. It was built by <a href="http://www.biostat.jhsph.edu/~ajaffe/" target="_blank">Andrew Jaffe</a> and <a href="http://www.biostat.jhsph.edu/~jleek/research.html" target="_blank">Jeff Leek</a>, with Andrew doing the bulk of the heavy lifting. We would like you to help us collect data on the quality of health news reporting. When you read a health news story on the Nature website, at nytimes.com, or on a blog, we’d like you to take a second to report on the news. Just determine whether the 6 pieces of information above are reported and input the data at <a href="http://healthnewsrater.com" target="_blank">HealthNewsRater</a>.</p>
<p>We calculate a score for each story based on the formula:</p>
<p><strong>HNR-Score = (5 points for a link to the original article + 1 point each for the other criteria)/2</strong></p>
<p>The score weights the link to the original article very heavily, since this is the best source of information about the actual science underlying the story. </p>
<p>In a future post we will analyze the data we have collected, make it publicly available, and let you know which news sources are doing the best job of reporting health science. </p>
<p><strong>Update:</strong> If you are a web-developer with an interest in health news <a href="mailto:healthnewsrater@gmail.com" target="_blank">contact us</a> to help make HealthNewsRater better! </p>
Statistical Crime Fighter
2012-01-10T19:23:00+00:00
http://simplystats.github.io/2012/01/10/statistical-crime-fighter
<p><a href="http://www-stat.wharton.upenn.edu/~berkr/" target="_blank">Dick Berk</a> is using his statistical superpowers to fight crime. <a href="http://www.theatlantic.com/magazine/archive/2012/01/misfortune-teller/8846/" target="_blank">Seriously</a>. Here is my favorite paragraph.</p>
<blockquote>
<p><span>Drawing from criminal databases dating to the 1960s, Berk initially modeled the Philadelphia algorithm on more than 100,000 old cases, relying on three dozen predictors, including the perpetrator’s age, gender, neighborhood, and number of prior crimes. To develop an algorithm that forecasts a particular outcome—someone committing murder, for example—Berk applied a subset of the data to “train” the computer on which qualities are associated with that outcome. “If I could use sun spots or shoe size or the size of the wristband on their wrist, I would,” Berk said. “If I give the algorithm enough predictors to get it started, it finds things that you wouldn’t anticipate.” Philadelphia’s parole officers were surprised to learn, for example, that the crime for which an offender was sentenced—whether it was murder or simple drug possession—does not predict whether he or she will commit a violent crime in the future. Far more predictive is the age at which he (yes, gender matters) committed his first crime, and the amount of time between other offenses and the latest one—the earlier the first crime and the more recent the last, the greater the chance for another offense.</span></p>
</blockquote>
<p><span>Hat tip to Alex Nones.</span></p>
Do you own or rent?
2012-01-10T12:00:05+00:00
http://simplystats.github.io/2012/01/10/do-you-own-or-rent
<p>When it comes to computing, history has gone back and forth between what I would call the “owner model” and the “renter model”. The question is what’s the best approach and how do you determine that?</p>
<p>Back in the day when people like John von Neumann were busy inventing the computer to work out H-bomb calculations, there was more or less a renter model in place. Computers were obviously quite expensive and so not everyone could have one. If you wanted to do your calculation, you’d walk down to the computer room, give them your punch cards with your program written out, and they’d run it for you. Sometime later you’d get some print out with the results of your program. </p>
<p>A little later, with time-sharing types of machines, you could have dumb terminals login to a central server and run your calculations that way. I guess that saved you the walk to the computer room (and all the punch cards). I still remember some of these green-screen dumb terminals from my grad school days (yes, UCLA still had these monstrosities in 1999). </p>
<p>With personal computers in the 80s, you could own your own computer, so there was no need to depend on some central computer (and a connection to it) to do the work for you. As computing components got cheaper, these personal computers got more and more powerful and rivaled the servers of yore. It was difficult for me to imagine ever needing things like mainframes again except for some esoteric applications. Especially, with the development of Linux, you could have all the power of a Unix mainframe on your desk or lap (or now your palm). </p>
<p>But here we are, with <a href="http://simplystatistics.tumblr.com/post/15565843517/a-statistician-and-apple-fanboy-buys-a-chromebook-and" target="_blank">Jeff buying a Chromebook</a>. Have we just taken a step back in time? Is cloud computing and the renter model the way to go? I have to say that I was a big fan of “cloud computing” back in the day. But once Linux came around, I really didn’t think there was a need for the thin client/fat server model.</p>
<p>But it seems we are going back that way and the reason seems to be because of mobile devices. Mobile devices are now just small computers, so many people own at least two computers (a “real” computer and a phone). With multiple computers, it’s a pain to have to synchronize both the data and the applications on them. If they’re made by different manufacturers then you can’t even have the same operating system/applications on the devices. Also, no one cares about the operating system anymore, so why should it have to be managed? The cloud helps solve some of these problems, as does owning devices from the same company (as I do, Apple fanboy that I am).</p>
<p>I think the all-renter model of the Chromebook is attractive, but I don’t think it’s ready for prime time just yet. Two reasons I can think of are (1) Microsoft Office and (2) slow network connections. If you want to make Jeff very unhappy, you can either (1) send him a Word document that needs to be edited in Track Changes; or (2) invite him to an international conference on some remote island. The need for a strong network connection is problematic because I’ve yet to encounter a hotel that had a fast enough connection for me to work remotely over on our computing cluster. For that reason I’m sticking with my current laptop.</p>
A statistician and Apple fanboy buys a Chromebook...and loves it!
2012-01-09T14:00:06+00:00
http://simplystats.github.io/2012/01/09/a-statistician-and-apple-fanboy-buys-a-chromebook-and
<p>I don’t mean to brag, but I was an early Apple Fanboy - not sure that is something to brag about now that I write it down. I convinced my advisor to go to all Macs in our lab in 2004. Since then I have been pretty dedicated to the brand, dutifully shelling out almost 2g’s every time I need a new laptop. I love the way Macs just work (until they don’t and you need a new laptop).</p>
<p>But I hate the way Apple seems to be dedicated to bleeding <a href="http://simplystatistics.tumblr.com/post/13412260027/apple-this-is-ridiculous-you-gotta-upgrade-to" target="_blank">every last cent</a> out of me. So I saved up my Christmas gift money (thanks Grandmas!) and bought a <a href="https://www.google.com/intl/en/chromebook/" target="_blank">Chromebook</a>. It cost me $350 and I was at least in part inspired by <a href="http://www.youtube.com/watch?v=DazdIFMbC_4" target="_blank">these</a> <a href="http://www.youtube.com/watch?v=m0ISVHdzJsQ" target="_blank">clever</a> <a href="http://www.youtube.com/watch?v=EaI9hORJS4M" target="_blank">ads</a>.</p>
<p>So far I’m super pumped about the performance of the Chromebook. Things I love:</p>
<ol>
<li>About 10 seconds to boot from shutdown, instantly awake from sleep</li>
<li>Super long battery life - 8 hours a charge might be an underestimate</li>
<li>Size - its a 12 inch laptop and just right for sitting on my lap and typing</li>
<li>Since everything is cloud based, nothing to install/optimize</li>
</ol>
<p>It took me a while to get used to the Browser being the operating system. When I close the last browser window, I expect to see the Desktop. Instead, a new browser window pops up. But that discomfort only lasted a short time.</p>
<p>It turns out I can do pretty much everything I do on my Macbook on the Chromebook. I can access our department’s computing cluster by turning on developer mode and <a href="https://groups.google.com/forum/#!topic/chromebook-central/dZDs1GFdlzY" target="_blank">opening a shell</a>(thanks <a href="http://www.bcaffo.com/" target="_blank">Caffo</a>!). I can do all my word processing on google docs. Email is just gmail as usual. <a href="http://www.scribtex.com/" target="_blank">Scribtex</a> for latex (<a href="http://grants.nih.gov/grants/policy/pecase2010/BrianCaffo.jpg" target="_blank">Caffo</a> again). <a href="https://music.google.com/" target="_blank">Google Music</a> is so awesome I wish I had started my account before I got my Chromebook. The only thing I’m really trying to settle on is a cloud-based code editor with syntax highlighting. I’m open to suggestions (Caffo?).</p>
<p>I’m starting to think I could bail on Apple….</p>
Sunday Data/Statistics Link Roundup
2012-01-08T19:35:42+00:00
http://simplystats.github.io/2012/01/08/sunday-data-statistics-link-roundup-2
<p>A few data/statistics related links of interest:</p>
<ol>
<li><a href="http://www.nytimes.com/2012/01/03/science/broad-institute-director-finds-power-in-numbers.html?ref=science" target="_blank">Eric Lander Profile</a></li>
<li><a href="http://www.wired.com/wiredscience/2012/01/the-mathematics-of-lego/" target="_blank">The math of lego</a> (should be “The statistics of lego”)</li>
<li><a href="http://flowingdata.com/2012/01/06/where-people-are-looking-for-homes/" target="_blank">Where people are looking for homes.</a></li>
<li><a href="http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen.html" target="_blank">Hans Rosling’s Ted Talk on the Developing world</a> (an oldie but a goodie)</li>
<li><a href="http://www.michaeleisen.org/blog/?p=807" target="_blank">Elsevier is trying to make open-access illegal</a> (not strictly statistics related, but a hugely important issue for academics who believe government funded research should be freely accessible), more <a href="http://www.michaeleisen.org/blog/?p=837" target="_blank">here</a>. </li>
</ol>
Where do you get your data?
2012-01-08T16:30:18+00:00
http://simplystats.github.io/2012/01/08/where-do-you-get-your-data
<p>Here’s a question I get fairly frequently from various types of people: Where do you get your data? This is sometimes followed up quickly with “Can we use some of your data?”</p>
<p>My contention is that if someone asks you these questions, start looking for the exits.</p>
<!-- more -->
<p>There are of course legitimate reasons why someone might ask you this question. For example, they might be interested in the source of the data to verify its quality. But too often, they are interested in getting the data because they believe it would be a good fit to a method that they have recently developed. Even if that is in fact true, there are some problems.</p>
<p>Before I go on, I need to clarify that I don’t have a problem with data sharing per se, but I usually get nervous when a person’s <em>opening line</em> is “Where do you get your data?” This question presumes a number of things that are usually signs of a bad collaborator:</p>
<ul>
<li><strong>The data are just numbers</strong>. My method works on numbers, and these data are numbers, so my method should work here. If it doesn’t work, then I’ll find some other numbers where it does work.</li>
<li><strong>The data are all that are important</strong>. I’m not that interested in working with an actual scientist on an important problem that people care about, because that would be an awful lot of work and time (see <a href="http://simplystatistics.tumblr.com/post/11695813030/finding-good-collaborators" target="_blank">here</a>). I just care about getting the data from whomever will give it to me. I don’t care about the substantive context.</li>
<li><strong>Once I have the data, I’m good, thank you</strong>. In other words, the scientific process is modular. Scientists generate the data and once I have it I’ll apply my method until I get something that I think makes sense. There’s no need for us to communicate. That is unless I need you to help make the data pretty and nice for me.</li>
</ul>
<p>The real question that I think people should be asking is “Where do you find such great scientific collaborators?” Because it’s those great collaborators that generated the data and worked hand-in-hand with you to get intelligible results.</p>
<p>Niels Keiding wrote a <a href="http://biostatistics.oxfordjournals.org/content/11/3/376.long" target="_blank">provocative commentary</a> about the tendency for statisticians to ignore the substantive context of data and to use illustrative/toy examples over and over again. He argued that because of this tendency, we should not be so excited about reproducible research, because as more data become available, we will see more examples of people ignoring the science.</p>
<p>I disagree that this is an argument against reproducible research, but I agree that statisticians (and others) do have a tendency to overuse datasets simply because they are “out there” (stackloss data, anyone?). However, it’s probably impossible to stop people from conducting poor science in any field, and we shouldn’t use the possibility that this might happen in statistics to prevent research from being more reproducible in general. </p>
<p>But I digress…. My main point is that people who simply ask for “the data” are probably not interested in digging down and understanding the really interesting questions. </p>
Building the Team That Built Watson
2012-01-08T02:06:27+00:00
http://simplystats.github.io/2012/01/08/building-the-team-that-built-watson
<p><a href="http://www.nytimes.com/2012/01/08/jobs/building-the-watson-team-of-scientists.html">Building the Team That Built Watson</a></p>
Make us a part of your day - add Simply Statistics to your RSS feed
2012-01-08T00:56:34+00:00
http://simplystats.github.io/2012/01/08/make-us-a-part-of-your-day-add-simply-statistics-to
<p>You can add us to your RSS feed through <a href="http://feeds.feedburner.com/SimplyStatistics" target="_blank">feedburner</a>.</p>
P-values and hypothesis testing get a bad rap - but we sometimes find them useful.
2012-01-06T16:54:52+00:00
http://simplystats.github.io/2012/01/06/p-values-and-hypothesis-testing-get-a-bad-rap-but-we
<p><em>This post written by Jeff Leek and Rafa Irizarry.</em></p>
<p>The <a href="http://en.wikipedia.org/wiki/P-value" target="_blank">p-value</a> is the most widely-known statistic. P-values are reported in a large majority of scientific publications that measure and report data. <a href="http://en.wikipedia.org/wiki/Ronald_Fisher" target="_blank">R.A. Fisher</a> is widely credited with inventing the p-value. If he was cited every time a p-value was reported his paper would have, at the very least, 3 <strong>million</strong> citations* - making it the <a href="http://www.jbc.org/content/280/28/e25.full#" target="_blank">most highly cited paper</a> of all time. </p>
<!-- more -->
<p>However, the p-value has a large number of very vocal critics. The criticisms of p-values, and hypothesis testing more generally, range from philosophical to practical. There are even <a href="http://warnercnr.colostate.edu/~anderson/thompson1.html" target="_blank">entire websites</a> dedicated to “debunking” p-values! One issue many statisticians raise with p-values is that they are easily misinterpreted, another is that p-values are not calibrated by sample size, another is that it ignores existing information or knowledge about the parameter in question, and yet another is that very significant (small) p-values may result even when the value of the parameter of interest is scientifically uninteresting.</p>
<p>We agree with all these criticisms. Yet, in practice, we find p-values useful and, if used correctly, a powerful tool for the advancement of science. The fact that many misinterpret the p-value is not the p-value’s fault. If the statement “under the null the chance of observing something this convincing is 0.65” is correct, then why not use it? Why not explain to our collaborator that the observation they thought was so convincing can easily happen by chance in a setting that is uninteresting. In cases where p-values are <em>small enough</em> then the substantive experts can help decide if the parameter of interest is scientifically interesting. In general, we find p-value to be superior to our collaborators intuition of what patterns are statistically interesting and which ones are not.</p>
<p>We also find p-values provide a simple way to construct decision algorithms. For example, a government agency can define general rules based on p-values that are applied equally to products needing a specific seal of approval. If the rule proves to be to lenient or restrictive, we change the p-value cut-off appropriately. In this situation we view the p-value as part of a practical protocol, not a tool for statistical inference.</p>
<p>Moreover the p-value has the following useful properties for applied statisticians:</p>
<ol>
<li><strong>p-values are easy to calculate, even for complicated statistics</strong>. Many statistics do not lend themselves to easy analytic calculation; but using permutation and bootstrap procedures p-values can be calculated even for very complicated statistics. </li>
<li><strong>p-values are relatively easy to understand.</strong> The statistical interpretation of the p-value remains roughly the same no matter how complicated the underlying statistic and they also bounded between 0 and 1. This also means that p-values are easy to <em>mis</em>-interpret - they are not posterior probabilities. But this is a difficulty with education, not a difficulty with the statistic itself. </li>
<li><strong>p-values have simple, universal properties </strong> Correct p-values are uniformly distributed under the null, regardless of how complicated the underlying statistic. </li>
<li><strong>p-values are calibrated to error rates scientists care about </strong>Regardless of the underlying statistic, calling all P-values less than 0.05 significant leads to on average about 5% false positives even if the null hypothesis is always true. If this property is ignored things like publication bias can result, but again this is a problem with education and the scientific process, not with p-values. </li>
<li><strong>p-values are useful for multiple testing correction.</strong> The advent of new measurement technology has shifted much of science from hypothesis driven to discovery driven making the existing multiple testing machinery useful. Using the simple, universal properties of p-values it is possible to easily calculate estimates of quantities like the false discovery rate - the rate at which discovered associations are false.</li>
<li><strong>p-values are reproducible.</strong> All statistics are reproducible with enough information. Given the simplicity of calculating p-values, it is relatively easy to communicate sufficient information to reproduce them. </li>
</ol>
<p>We agree there are flaws with p-values, just like there are with any statistic one might choose to calculate. In particular, we do think that confidence intervals should be reported with p-values when possible. But we believe that any other decision-making statistic would lead to other problems. One thing we are sure about is that p-values beat scientists’ intuition about chance any day. So before bashing p-values too much we should be careful because, like democracy to government, p-values may be the worst form of statistical significance calculation except all those other forms that have been tried from time to time. </p>
<p>————————————————————————————————————</p>
<p><em>* Calculated using Google Scholar using the formula:</em></p>
<p><em>Number of P-value Citations = # of papers with exact phrase “P < 0.05” + (# of papers with exact phrase “P < 0.01” and not exact phrase “P < 0.05”) + (# of papers with exact phrase “P < 0.001” and not exact phrase “P < 0.05” or “P < 0.001”) </em></p>
<p><em>= 1,320,000 + 1,030,000 + 662,500</em></p>
<p><em>This is obviously an extremely conservative estimate. </em></p>
Why all #academics should have professional @twitter accounts
2012-01-05T16:24:14+00:00
http://simplystats.github.io/2012/01/05/why-all-academics-should-have-professional-twitter
<p>I started my professional Twitter account <a href="http://twitter.com/#!/leekgroup" target="_blank">@leekgroup</a> about a year and half ago at the suggestion of a colleague of mine, John Storey (<a href="https://twitter.com/#!/storeylab" target="_blank">@storeylab</a>). I started using the account to post updates on papers/software my group was publishing. Basically, everything I used to report on my webpage as “News”. </p>
<p>I started to give talks where the title slide included my Twitter name, rather than my webpage. It frequently drew the biggest laugh in the talk, and I would get comments like, “Do you really think people care what you are thinking every moment of every day?” That is what some people use Twitter for, and no I’m not really interested in making those kind of updates. </p>
<p>So I started describing why I think Twitter is useful for academics at the beginning of talks:</p>
<ol>
<li>You can integrate it directly into your website (<a href="http://biostat.jhsph.edu/~jleek/research.html" target="_blank">like so</a>), using Twitter widgets. If you have a Twitter account you just go <a href="http://twitter.com/about/resources/widgets" target="_blank">here</a>, get the widget for your website, and add the code to your homepage. Now you don’t have to edit HTML to make news updates, you just login to Twitter and type the update in the box.</li>
<li>You can quickly gain a much broader audience for your software/papers. In the past, I had to rely on people actually coming to my website to find my papers or seeing them in journals. Now, when I announce a paper, my followers see it and if they like it, they pass it on to their followers, etc. I have noticed that my papers are being downloaded more and by a broader audience since I joined. </li>
<li>I can keep up on what other people are doing. <a href="http://simplystatistics.tumblr.com/post/12560072373/statisticians-on-twitter-help-me-find-more" target="_blank">Many statisticians</a> have Twitter accounts that they use professionally. I follow many of them and when they publish new papers, I see them pop up, rather than having to go to all their websites. It’s like an RSS feed of papers from people I want to follow. </li>
<li>You can connect with people outside academia. Particularly in my area, I’d like the statistical tools I’m developing to be used by folks in industry who work on genomics. It’s hard to get the word out about my methods through traditional channels, but a lot of those folks are on Twitter. </li>
</ol>
<p>The best part is, there is an amplification effect to this medium. So as more and more academics join and follow each other, it is easier and easier for us all to keep up with what is happening in the field. If you are intimidated by using any social media, you can get started with some really easy how-to’s like <a href="http://www.wikihow.com/Use-Twitter" target="_blank">this one</a>.</p>
<p>Alright, enough advertising for Twitter, I’m going back to work. </p>
Will Amazon Offer Analytics as a Service?
2012-01-05T15:50:16+00:00
http://simplystats.github.io/2012/01/05/will-amazon-offer-analytics-as-a-service
<p><a href="http://bits.blogs.nytimes.com/2012/01/04/will-amazon-offer-analytics-as-a-service/">Will Amazon Offer Analytics as a Service?</a></p>
Baltimore gun offenders and where academics don't live
2012-01-03T15:02:40+00:00
http://simplystats.github.io/2012/01/03/baltimore-gun-offenders-and-where-academics-dont-live
<p>Jeff recently posted <a href="http://simplystatistics.tumblr.com/post/15182715327/list-of-cities-states-with-open-data-help-me-find" target="_blank">links to data from cities and states</a>. He and I wrote <a href="http://rafalab.jhsph.edu/simplystats/gunviolations.R" target="_blank">R code</a> that plots gun offender locations for Baltimore. Specifically we plot the locations that appear on <a href="http://data.baltimorecity.gov/Crime/Gun-Offenders/aivj-4x23" target="_blank">this table</a>. I added locations of the Baltimore neighborhoods where most of our Hopkins colleagues live as well as the location of the medical institutions where we work. Note the corridor with no points between the West side (<a href="http://en.wikipedia.org/wiki/Barksdale_Organization" target="_blank">Barksdale</a> territory) and East side (<a href="http://rafalab.jhsph.edu/simplystats/propjoekima.jpg" target="_blank">Prop Joe</a> territory). Not surprisingly, academics don’t live near the gun offenders. </p>
<p><a href="http://rafalab.jhsph.edu/simplystats/baltimoreGunViolations.pdf" target="_blank"><img height="300" src="http://rafalab.jhsph.edu/simplystats/baltimoreGunViolations.png" width="300" /></a></p>
List of cities/states with open data - help me find more!
2012-01-02T14:30:00+00:00
http://simplystats.github.io/2012/01/02/list-of-cities-states-with-open-data-help-me-find
<p>It’s the beginning of 2012 and statistics/data science has never been hotter. Some of the most important data is data collected about civic organizations. If you haven’t seen Bill Gate’s <a href="http://www.ted.com/talks/bill_gates_how_state_budgets_are_breaking_us_schools.html" target="_blank">TED Talk</a> about the importance of state budgets, you should watch it now. A major key to solving a lot of our economic problems lies in understanding and using data collected about cites and states. </p>
<p>U.S. cities and states are jumping on this idea and our own Baltimore was one of the <a href="http://www.americanprogress.org/issues/2007/04/citistat.html" target="_blank">earliest adopters</a>. I thought I’d make a list of all the cities that have made an effort to make civic data public. Here are a few I’ve found:</p>
<ul>
<li><a href="http://data.baltimorecity.gov/" target="_blank">Baltimore</a></li>
<li><a href="http://nycopendata.socrata.com/" target="_blank">New York City</a></li>
<li><a href="http://datasf.org/" target="_blank">San Francisco</a></li>
<li><a href="http://data.seattle.gov/" target="_blank">Seattle</a></li>
<li><a href="http://www.civicapps.org/" target="_blank">Portland</a></li>
<li><a href="http://www.cityofboston.gov/doit/databoston/app/data.aspx" target="_blank">Boston</a></li>
<li><a href="http://opensandiego.org/" target="_blank">San Diego</a> (not sure if this is official)</li>
<li><a href="http://data.cityofchicago.org/" target="_blank">Chicago</a></li>
<li><a href="http://data.austintexas.gov/" target="_blank">Austin</a></li>
<li><a href="http://data.dc.gov/" target="_blank">Washington D.C.</a></li>
<li><a href="http://opendataphilly.org/" target="_blank">Philadelphia</a> </li>
<li><a href="http://data.nola.gov/" target="_blank">New Orleans</a></li>
</ul>
<p>There are also open data sites for many states:</p>
<ul>
<li><a href="http://www.data.ca.gov/about" target="_blank">California</a></li>
<li><a href="http://data.wa.gov/" target="_blank">Washington</a></li>
<li><a href="http://data.oregon.gov/" target="_blank">Oregon</a></li>
<li><a href="http://data.illinois.gov/" target="_blank">Illinois</a></li>
<li><a href="http://www.utah.gov/data/" target="_blank">Utah</a></li>
<li><a href="http://maineopengov.org/" target="_blank">Maine</a></li>
</ul>
<p>Civic organizations are realizing that opening their data through <a href="http://simplystatistics.tumblr.com/post/11237403492/apis" target="_blank">APIs</a> or by hosting <a href="http://kaggle.com/" target="_blank">competitions</a> can lead to greater transparency, good advertising, and new and useful applications. If I had one data-related wish for 2012, it would be that the critical mass of data/statistics knowledge being developed could be used with these data to help solve some of our most pressing problems. </p>
<p><strong>Update:</strong> Oh Canada! In the comments <a href="http://twitter.com/#!/aruhil" target="_blank">Ani Ruhil</a> points to some Canadian cities/provinces with open data pages. </p>
<ul>
<li><a href="http://data.vancouver.ca/" target="_blank">Vancouver</a></li>
<li><a href="http://www1.toronto.ca/wps/portal/open_data/open_data_home?vgnextoid=b3886aa8cc819210VgnVCM10000067d60f89RCRD" target="_blank">Toronto</a></li>
<li><a href="http://www.citywindsor.ca/003713.asp" target="_blank">Windsor</a></li>
<li><a href="http://www.ottawa.ca/online_services/opendata/index_en.html" target="_blank">Ottawa</a></li>
<li><a href="http://data.edmonton.ca/" target="_blank">Edmonton</a></li>
<li><a href="http://www.data.gov.bc.ca/" target="_blank">British Columbia</a></li>
</ul>
Grad students in (bio)statistics - do a postdoc!
2011-12-28T19:25:00+00:00
http://simplystats.github.io/2011/12/28/grad-students-in-bio-statistics-do-a-postdoc
<p>Up until about 20 years ago, postdocs were scarce in Statistics. In contrast, during the same time period, it was rare for a Biology PhD to go straight into a tenure track position.</p>
<p>Driven mostly by the availability of research funding for those working in applied areas, postdocs are becoming much more common in our field and I think this is great. It is great for PhD students to expand their horizons during two years in which they don’t have to worry about teaching, committee meetings, or grant writing. It is also great for those of us fortunate enough to work with well-trained, independent, energetic, bright, and motivated fresh PhDs. Many of our best graduates are electing to postpone their entry into tenure track jobs in favor of postdocs. Also students from other fields, computer science and engineering in particular, are taking postdocs with statisticians. I think these are both good trends. If they continue, the result will be that, as a field, we will become more well-rounded and productive. </p>
<p>This trend has been particularly beneficial for me. Most of the postdocs I have hired have come to me with a CV worthy of a tenure track job. They have been independent and worked more as collaborators than advisees. So why pass on more $ and prestige? A PhD in Statistics/Computer Science/Engineering can be on a very specific topic and students may not gain any collaborative experience whatsoever. A postdoc at Hopkins Biostat provides a new experience in a highly collaborative environment, with access to world leaders in the biomedical sciences, and where we focus on development of applied tools. The experience can also improve a student’s visibility and job prospects, while delaying the tenure clock until they have more publications under their belts.</p>
<p>An important thing you should be aware of is that in many departments you can negotiate the start of a tenure track position. So seriously consider taking 1-2 years of almost 100% research time before commencing the grind of a tenure track job. </p>
<p>I’m not the only one who thinks postdocs are a good thing for our field and for biostatistics students. The column below was written by Terry Speed in November 2003 and is reprinted with permission from the IMS Bulletin, <a href="http://bulletin.imstat.org/" target="_blank"><a href="http://bulletin.imstat.org" target="_blank">http://bulletin.imstat.org</a></a></p>
<p class="p1">
<strong>In Praise of Postdocs</strong>
</p>
<p class="p2">
<span class="s1">I don’t know what proportion </span>of IMS members have PhDs (or an equivalent) in probability or statistics, but I’d guess it’s fairly high. I don’t know what proportion of those that do have PhDs would also have formal post-doctoral research experience, but here I’d guess it’s rather low.
</p>
<p class="p3">
Why? One possible reason is that for much of the last 40 years, anyone completing a PhD in prob or stat and wanting a research career, could go straight into one. Prospective employers of people with PhDs in our field—be they universities, research institutes, national labs or companies—don’t require their novices to have completed a postdoc, and most graduating PhDs are only to happy to go straight into their first job.
</p>
<p class="p3">
This is in sharp contrast with the biological and physical sciences, where it is rare to appoint someone to a tenure-track faculty or research scientist position without their having completed one or more postdocs.
</p>
<p class="p3">
Thee number of people doing postdocs in probability or statistics has been growing over the last 15 years. This is in part due to the arrival on the scene of institutes such as the MSRI, IMA, IPAM, NISS, NCAR, and recently the MBI and SAMSI in the US, the Newton Institute in the UK, the Fields Institute in Canada, the Institut Henri Poincaré in France, and others elsewhere around the world. In such institutes short- term postdoc positions go with their current research programs, and there are usually a smaller number continuing for longer periods.
</p>
<p class="p3">
It is also the case that an increasing number of senior researchers are being awarded research funds to support postdocs in prob or stat, often in the newer, applied areas such as computational biology.
</p>
<p class="p3">
And finally, it is has long been the case that many countries (Germany, Sweden, Switzerland, and the US, to name a few) have national grants supporting postdoctoral research in their own or, even better, another country. I think all of this is great, and would like to see this trend continue and strengthen.
</p>
<p class="p3">
Why do I think postdocs are a good thing? And why do I think young probabilists and statisticians should do one, even when they can get a good job without having done so?
</p>
<p class="p3">
For most of us, doing a PhD means getting totally absorbed in some relatively narrow research area for 2–3 years, treating that as the most important part of science for that time, and trying to produce some of the best work in that area. This is fine, and we get a PhD for our efforts, but is it good training for a lifelong research career? While it is obviously good preparation for doing more of the same, I don’t think it is adequate for research in general. I regard the successful completion of a PhD as (at least) evidence that the person in question can do research, but it doesn’t follow that they can go on and successfully do research in new area, or in a different environment, or without close supervision.
</p>
<p class="p3">
Postdocs give you the chance to broaden, to learn new technical skills, to become acquainted with new areas, and to absorb the culture of a new institution, all at a time when your professional responsibilities are far fewer than they would have been had you taken that first “real” job. The postdoc period can be a wonderful time in your scientific life, one which sees you blossom, building on the confidence you gained by having completed your PhD, in what is still essentially a learning environment, but one where you can follow your own interests, explore new areas, and still make mistakes. At the worst, you have delayed your entry into the workforce two or three years, and you can still keep on working in your PhD area if you wish. The number of openings for researchers in prob or stat doesn’t fluctuate so much on this time scale, so you are unlikely to be worse off than the earnings foregone. At best, you will move into a completely new area of research, one much better suited to your personal interests and skills, perhaps also better suited to market demand, but either way, one chosen with your PhD experience behind you. This can greatly enhance your long-term career prospects and more than compensate for your delayed entry into the workforce.
</p>
<p class="p3">
<em>Students: </em>the time to think about this is <span class="s2">now [November]</span>, not just as you are about to file your dissertation. And the choice is not necessarily one between immediate security and career development: you might be able to have both. You shouldn’t shy from applying for tenure-track jobs and postdocs at the same time, and if offered the job you want, requesting (say) two years’ leave of absence to do the postdoc you want. Employers who care about your career development are unlikely to react badly to such a request.
</p>
An R function to map your Twitter Followers
2011-12-21T17:11:00+00:00
http://simplystats.github.io/2011/12/21/an-r-function-to-map-your-twitter-followers
<p>I wrote a little function to make a personalized map of who follows you or who you follow on Twitter. The idea for this function was inspired by some plots I discussed in a <a href="http://simplystatistics.tumblr.com/post/11614784508/spectacular-plots-made-entirely-in-r" target="_blank">previous post</a>. I also found a lot of really useful code over at flowing data <a href="http://flowingdata.com/2011/05/11/how-to-map-connections-with-great-circles/" target="_blank">here</a>. </p>
<p>The function uses the packages twitteR, maps, geosphere, and RColorBrewer. If you don’t have the packages installed, when you source the twitterMap code, it will try to install them for you. The code also requires you to have a working internet connection. </p>
<p><em>One word of warning is that if you have a large number of followers or people you follow, you may be rate limited by Twitter and unable to make the plot.</em></p>
<p>To make your personalized twitter map, first source the <a href="http://biostat.jhsph.edu/~jleek/code/twitterMap.R" target="_blank">function</a>:</p>
<blockquote>
<p>source(“http://biostat.jhsph.edu/~jleek/code/twitterMap.R”)</p>
</blockquote>
<p>The function has the following form: </p>
<p>twitterMap <- function(userName,userLocation=NULL,fileName=”twitterMap.pdf”,nMax = 1000,plotType=c(“followers”,”both”,”following”))</p>
<p>with arguments:</p>
<ul>
<li>userName - the twitter username you want to plot</li>
<li>userLocation - an optional argument giving the location of the user, necessary when the location information you have provided Twitter isn’t sufficient for us to find latitude/longitude data</li>
<li>fileName - the file where you want the plot to appear</li>
<li>nMax - The maximum number of followers/following to get from Twitter, this is implemented to avoid rate limiting for people with large numbers of followers. </li>
<li>plotType - if “both” both followers/following are plotted, etc. </li>
</ul>
<p>Then you can create a plot with both followers/following like so: </p>
<blockquote>
<p> twitterMap(“simplystats”)</p>
</blockquote>
<p>Here is what the resulting plot looks like for our Twitter Account:</p>
<p><img height="550" src="http://biostat.jhsph.edu/~jleek/code/simplystats.png" width="500" /></p>
<p>If your location can’t be found or latitude longitude can’t be calculated, you may have to chose a bigger city near you. The list of cities used by twitterMap can be found like so:</p>
<blockquote>
<p>library(maps)</p>
</blockquote>
<blockquote>
<p>data(world.cities)</p>
</blockquote>
<blockquote>
<p>grep(“Baltimore”, world.cities[,1])</p>
</blockquote>
<p>If your city is in the database, this will return the row number of the world.cities data frame corresponding to your city. </p>
<div>
If you like this function you may also like our function to determine if you are a <a href="http://simplystatistics.tumblr.com/post/11271228367/datascientist" target="_blank">data scientist</a> or to analyze your <a href="http://simplystatistics.tumblr.com/post/13203811645/an-r-function-to-analyze-your-google-scholar-citations" target="_blank">Google Scholar citations page</a>.
</div>
<div>
</div>
<div>
<strong>Update</strong>: The bulk of the heavy lifting done by these functions is performed by Jeff Gentry’s very nice <a href="http://cran.r-project.org/web/packages/twitteR/" target="_blank">twitteR </a>package and <a href="http://flowingdata.com/2011/05/11/how-to-map-connections-with-great-circles/" target="_blank">code</a> put together by Nathan Yau over at FlowingData. This is really an example of standing on the shoulders of giants.
</div>
On Hard and Soft Money
2011-12-19T13:10:06+00:00
http://simplystats.github.io/2011/12/19/on-hard-and-soft-money
<p>As the academic job hunting season goes into effect many will be applying to a variety of different types of departments. In statistics, there is a pretty big separation between statistics departments, which tend to be in arts & sciences colleges, and biostatistics departments, which tend to be in medical or public health institutions. A key difference between these two types of departments is the funding model.</p>
<!-- more -->
<p>Statistics department faculty tend to be on 9- or 10-month salaries with funding primarily coming from teaching classes (research funding can be obtained for the summer months). Biostatistics departments faculty tend to have 12-month salaries with a large chunk of funding coming from research grants. Statistics departments are sometimes called “hard money” departments (i.e. tuition money is “hard”) while biostatistics departments are “soft money”. Grant money is considered “soft” because it has a tendency to go away a bit more easily. As long as students want to attend a university, there will always be tuition.</p>
<p>The biostatistics department at Johns Hopkins is a soft money department. We tend to get the bulk of our salaries from research project grants. Statisticians can play two roles on research grants: as a co-investigator/collaborator and as a principal investigator (PI). I guess that’s true of anyone, but statisticians are very commonly part of research projects as co-investigators because pretty much every research project these days will need statistical advice or methodological development. Researchers often have trouble getting their grants funded if they don’t have a statistician on board. So there’s often plenty of funding to go around for statisticians. But the real problem is getting enough time to do the research <em>you</em> want to do. If you’re spending all your time doing other people’s work, then sure you’re getting paid, but you’re not getting things done that will advance your career.</p>
<p>In a soft money department, I can think of two ways to go. The first is to write your own grants with you as the PI. That way you can guarantee funding for yourself to do the things you find interesting (assuming your grant is funded!). The other approach is to collaborate on a project where the work you need to do is work you would have done anyway. That can be a happy coincidence because then you don’t have to deal with the administrative burden of running a research project. But this approach relies a bit on luck and on the research environment at your institution.</p>
<p>Many job candidates tell me that they are worried about working in a soft money department because if they can’t get their grants funded then they will be in some sort of trouble. In hard money departments, at least the majority of their salary is guaranteed by the teaching they do. This is true to some extent, but I contend that they are worrying about the wrong thing, mainly money.</p>
<p>What job candidates should <em>really</em> be worried about is whether the department will support them in their career. Candidates should be looking for departments that mentor their junior faculty and create an environment in which it will be easy to succeed. If you’re in a department that routinely hangs their junior faculty out to dry, you can have all the hard money you want and you’ll still be unhappy. A soft money department that supports their junior faculty will make sure the right structure is in place for faculty to succeed. </p>
<p>Here are some things to look out for in any department, but perhaps more so in a soft money department:</p>
<ul>
<li>Is there administrative support staff to help with writing grants i.e. for drafting budgets, assembling biosketches, and other paperwork?</li>
<li>Are their senior faculty around who have successfully written grants and would be willing to read your grants and give you feedback?</li>
<li>Is the environment there sufficient for you to do the things you want to do? For example, are their excellent collaborators for you to work with? Powerful computing support? All these things will help you get an edge over people who don’t have easy access to these resources.</li>
</ul>
<p>Besides having a good idea, the environment can play a key role in writing a good grant. For starters, if all your collaborators are in the same building as you, it makes it a lot easier to coordinate meetings to discuss ideas and to do the preparation. If you’re trying to work with 4 different people in 4 different institutions (maybe in different timezones), things just get a little harder and maybe you don’t get the feedback you need.</p>
<p>Similarly, if you have a strong computing infrastructure in place, then you can test it out beforehand and see what its capabilities are. If you need to purchase the same infrastructure for yourself as part of a grant, then you won’t know what it can do until you get and set it up. In our department, we are constantly buying new systems for our computing center and there are <em>always</em> glitches in the beginning with new equipment and new software. If you can avoid having to do this, it makes the grant a lot easier to write.</p>
<p>Lastly, I’ll just say that if you’re in the position of applying for tenure-track academic jobs, you’re probably not lazy. So you’re going to do your work no matter where you go. You just need to find a place where you can get things done. </p>
New features on Simply Statistics
2011-12-18T20:42:51+00:00
http://simplystats.github.io/2011/12/18/new-features-on-simply-statistics
<p>Check out our <a href="http://simplystatistics.tumblr.com/editorspicks" target="_blank">Editor’s Picks</a> and <a href="http://simplystatistics.tumblr.com/interviews" target="_blank">Interviews</a> pages. </p>
In Greece, a statistician faces life in prison for doing his job: calculating and reporting a statistic
2011-12-16T20:00:00+00:00
http://simplystats.github.io/2011/12/16/in-greece-a-statistician-faces-life-in-prison-for
<p>In a <a href="http://simplystatistics.tumblr.com/post/13945953822/interview-w-mario-marazzi-puerto-rico-institute-of" target="_blank">recent post</a> I described the importance of government statisticians. Well, apparently in Greece <a href="http://www.npr.org/blogs/money/2011/12/16/143766906/a-technocrat-in-trouble#more" target="_blank">it is a dangerous job</a>, as <span>Andreas Georgiou, the person in charge of the </span><span>Greek statistics office, found out.</span></p>
<blockquote>
<p><span>So far, though, his efforts have been met with resistance, strikes and a criminal investigation that could lead to life in prison for Georgiou.</span></p>
</blockquote>
<p><span>What are his efforts ?</span></p>
<blockquote>
<p><span>His first priority after he was appointed was to figure out how big Greece’s deficit really was back in 2009, when the crisis began. He looked through all the data and concluded that Greece’s deficit that year was 15.8 percent of GDP — higher what had previously been reported.</span></p>
<p><span>Eurostat, the central authority in Brussels, praised Georgiou’s methodology and blessed the number as true. The hundreds of Greek people who work beneath Georgiou — the old guard — did not.</span></p>
</blockquote>
<p>So in response, the “old guard” decided to vote on the summary statistic:</p>
<blockquote>
<p><span>Skordas sits on a governing board for the statistics office. His board wanted to debate and vote on the deficit number before anyone in Brussels was allowed to see it. Georgiou, the technocrat, saw that as a threat to his independence. He refused. The number is the number, he said. It’s not something to be put up for a vote.</span></p>
</blockquote>
<p><span>Did they perform a Bayesian analysis based on the vote?</span></p>
Interview with Nathan Yau of FlowingData
2011-12-16T12:51:35+00:00
http://simplystats.github.io/2011/12/16/interview-with-nathan-yau-of-flowingdata
<div class="im">
<strong>Nathan Yau</strong>
</div>
<div class="im">
<strong><br /></strong>
</div>
<div class="im">
<img height="400" src="http://directory.stat.ucla.edu/images/nathan-yau/1.jpg?1287095045" width="250" />
</div>
<div class="im">
</div>
<div class="im">
</div>
<div class="im">
Nathan Yau is a graduate student in statistics at UCLA and the author of the extremely popular data visualization blog <a href="http://flowingdata.com/" target="_blank">flowingdata.com</a>. He recently published a book <a href="http://book.flowingdata.com/" target="_blank">Visualize This</a>-a really nice guide to modern data visualization using R, Illustrator and Javascript - which should be on the bookshelf of any statistician working on data visualization.
</div>
<div class="im">
</div>
<div class="im">
</div>
<div class="im">
</div>
<div class="im">
<strong>Do you consider yourself a statistician/data scientist/or something else?</strong>
</div>
<p><span>Statistician. I feel like statisticians can call them data scientists, but not the other way around. Although with data scientists there’s an implied knowledge of programming, which statisticians need to get better at.</span></p>
<div class="im">
<strong>Who have been good mentors to you and what qualities have been most helpful for you?</strong>
<p>
I’m visualization-focused, and I really got into the area during a summer internship at The New York Times. Before that, I mostly made graphs in R for reports. I learned a lot about telling stories with data and presenting data to a general audience, and that has stuck with me ever since.
</p>
</div>
<p><span>Similarly, my adviser Mark Hansen has showed me how data is more free-flowing and intertwined with everything. It’s hard to describe. I mean coming into graduate school, I thought in terms of datasets and databases, but now I see it as something more organic. I think that helps me see what the data is about more clearly.</span></p>
<div class="im">
<strong>How did you get into statistics/data visualization?</strong>
In undergrad, an introduction to statistics (for engineering) actually pulled me in. The professor taught with so much energy, and the material sort of clicked with me. My friends who were also taking the course complained and had trouble with it, but I wanted more for some reason. I eventually switched from electrical engineering to statistics.
</div>
<p><span>I got into visualization during my first year in grad school. My adviser gave a presentation on visualization, but from a media arts perspective rather than a charts-and-graphs-in-R-Tufte point of view. I went home after that class, googled visualization and that was that.</span></p>
<div class="im">
<strong>Why do you think there has been an explosion of interest in data visualization?</strong>
</div>
<p><span>The Web is a really visual place, so it’s easy for good visualization to spread. It’s also easier for a general audience to read a graph than it is to understand statistical concepts. And from a more analytical point of view, there’s just a growing amount of data and visualization is a good way to poke around.</span></p>
<div class="im">
<strong>Other than R, what tools should students learn to improve their data visualizations?</strong>
</div>
<p><span>For static graphics, I use Illustrator all the time to bring storytelling into the mix or to just provide some polish. For interactive graphics on the Web, it’s all about JavaScript nowadays. D3, Raphael.js, and Processing.js are all good libraries to get started.</span></p>
<div class="im">
<strong>Do you think the rise of infographics has led to a “watering down” of data visualization?</strong>
So I actually just wrote <a href="http://flowingdata.com/2011/12/08/on-low-quality-infographics" target="_blank">a post</a> along these lines. It’s true that there a lot of low-quality infographics, but I don’t think that takes away from visualization at all. It makes good work more obvious. I think the flood of infographics is a good indicator of people’s eagerness to read data.</div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><div class="im">
<strong>How did you decide to write your book &#8220;Visualize This&#8221;?</strong>
</div>
<div class="im">
</div>
<div class="im">
Pretty simple. I get emails and comments all the time when I post graphics on FlowingData that ask how something was done. There aren&#8217;t many resources that show people how to do that. There are books that describe what makes good graphics but don&#8217;t say anything about how to actually go about doing it, and there are programming books for say, R, but are too technical for most and aren&#8217;t visualization-centric. I wanted to write a book that I wish I had in the early days.
</div>
<div class="im">
<strong>Any final thoughts on statistics, data and visualization? </strong>
</div>
<div class="im">
<strong><br /></strong>Keep an open mind. Oftentimes, statisticians seem to box themselves into positions of analysis and reports. Statistics is an applied field though, and now more than ever, there are opportunities to work anywhere there is data, which is practically everywhere.
</div>
</code></pre></div></div>
Dear editors/associate editors/referees, Please reject my papers quickly
2011-12-14T16:36:41+00:00
http://simplystats.github.io/2011/12/14/dear-editors-associate-editors-referees-please-reject
<p>The review times for most journals in our field are ridiculous. Check out Figure 1 <a href="http://www.stat.tamu.edu/~carroll/ftp/2001.papers.directory/times.pdf" target="_blank">here</a>. A careful review takes time, but not six months. Let’s be honest, those papers are sitting on desks for the great majority of those six months. But here is what really kills me: waiting six months for a review basically saying the paper is not of sufficient interest to the readership of the journal. That decision you can come to in half a day. If you don’t have time, don’t accept the responsibility to review a paper.</p>
<p>I like sharing my work with my statistician colleagues, but the Biology journals never do this to me. When my paper is not of sufficient interest, these journals reject me in days not months. I sometimes work on topics that are fast pace and many of my competitors are not statisticians. If I have to wait six months for each rejection, I can’t compete. By the time the top three applied statistics journals reject the paper, more than a year goes by and the paper is no longer novel. Meanwhile I can go through Nature Methods, Genome Research, and Bioinformatics in less than 3 months.</p>
<p>Nick Jewell once shared an idea that I really liked. It goes something like this. Journals in our field will accept every paper that is correct. The editorial board, with the help of referees, assigns each paper into one of five categories A, B, C, D, E based on novelty, importance, etc… If you don’t like the category you are assigned, you can try your luck elsewhere. But before you go, note that the paper’s category can improve after publication based on readership feedback. While we wait for this idea to get implemented, I please ask that if you get one of my papers and you don’t like it, reject it quickly. You can write this review: “This paper rubbed me the wrong way and I heard you like being rejected fast so that’s all I am going to say.” Your comments and critiques are valuable, but not worth the six month wait. </p>
<p>ps - I have to admit that the newer journals have not been bad to me in this regard. Unfortunately, for the sake of my students/postdocs going into the job market and my untenured jr colleagues, I feel I have to try the established top journals first as they still impress more on a CV.</p>
Smoking is a choice, breathing is not.
2011-12-14T13:58:52+00:00
http://simplystats.github.io/2011/12/14/smoking-is-a-choice-breathing-is-not
<p>Over the last week or so I’ve been posting about the air pollution levels in Beijing, China. The twitter feed from the US Embassy there makes it easy to track the hourly levels of fine particulate matter (PM2.5) and you can use <a href="http://www.biostat.jhsph.edu/~rpeng/makeBeijingAirGraph.R" target="_blank">this R code</a> to make a graph of the data.</p>
<p>One problem with talking about particulate matter levels is that the units are a bit abstract. We usually talk in terms of micrograms per cubic meter (mcg/m^3), which is a certain mass of particles per volume of air. The 24-hour national ambient air quality standard for fine PM in the US is 35 mcg/m^3. But what does that mean in reality?</p>
<p>C. Arden Pope III and colleagues recently wrote an interesting paper in <em>Environmental Health Perspectives</em> on the <a href="http://www.ncbi.nlm.nih.gov/pubmed/21768054" target="_blank">dose-response relationship between particles and lung cancer and cardiovascular disease</a>. They combined data from air pollution studies and smoking studies to estimate the dose-response curve for a very large range of PM levels. Ambient air pollution, not surprisingly, is on the low-end of PM exposure, followed by second hand smoke, followed by active smoking. One challenge they faced is putting everything on the same scale in terms of PM exposure so that the different studies could be compared.</p>
<p>Here are the important details: On average, actively smoking a cigarette generates a dose of about 12 milligrams (mg) of particulate matter. Daily inhalation rates obviously depend on your size, age, physical activity, health, and other factors, but in adults they range from about 13 to 23 cubic meters of air per day. For convenience, I’ll just take the midpoint of that range, which is 18 cubic meters per day.</p>
<p>If your city’s fine PM levels were compliant with the US national standard of 35 mcg/m^3, then in the worst case scenario you’d be breathing in about 630 micrograms of particles per day, which is about 0.05 cigarettes (1 cigarette every 20 days). Sounds like it’s not too bad, but keep in mind that most of the increase in risk from smoking is seen in the low range of the dose-response curve (although this is obviously very low).</p>
<p>If we move now to Beijing, where 24-hour average levels can easily reach up to 300 mcg/m^3 (and <a href="http://www.npr.org/2011/12/07/143214875/clean-air-a-luxury-in-beijings-pollution-zone" target="_blank">indoor levels can reach 200 mcg/m^3</a>), then we’re talking about a daily dose of almost half a cigarette. Now, half a cigarette might still seem like not that much, but keep in mind that <em>pretty much everyone is exposed</em>: old and young, sick and healthy_._ Not everyone gets the same dose because of variation in inhalation rates, but even the low end of the range gives you about 0.3 cigarettes. </p>
<p>Beijing is hardly alone here, as a number of studies in Asian cities show comparable levels of fine PM. I’ve redone my previous plot of PM2.5 in Beijing in terms of number cigarettes per day. Here’s the last 2 months in Beijing (for an average adult).</p>
<p><img src="http://media.tumblr.com/tumblr_lw59ogIhq81r08wvg.png" alt="" /></p>
The Supreme Court's interpretation of statistical correlation may determine the future of personalized medicine
2011-12-12T23:02:40+00:00
http://simplystats.github.io/2011/12/12/the-supreme-courts-interpretation-of-statistical
<p><strong>Summary/Background</strong></p>
<p>The Supreme Court heard <a href="http://www.supremecourt.gov/oral_arguments/argument_transcripts/10-1150.pdf" target="_blank">oral arguments</a> last week in the case Mayo Collaborative Services vs. Prometheus Laboratories (<a href="http://www.supremecourt.gov/Search.aspx?FileName=/docketfiles/10-1150.htm" target="_blank">No 10-1150</a>). At issue is a patent Prometheus Laboratories holds for making decisions about the treatment of disease on the basis of a measurement of a specific, naturally occurring molecule and a corresponding calculation. The specific language at issue is a little technical, but the key claim from the patent under dispute is:</p>
<blockquote>
<ol>
<li>A method of optimizing therapeutic efficacy for treatment of an immune-mediated gastrointestinal disorder, comprising:</li>
</ol>
<p>(a) administering a drug providing 6-thioguanine to a subject having saidimmune-mediated gastrointestinal disorder; and</p>
<p>(b) determining the level of 6-thioguanine in said subject having said immune-mediated gastrointestinal disorder,</p>
<p>wherein the level of 6-thioguanine less than about 230 pmol per 8x10^8 red blood cells indicates a need to increase the amount of said drug subsequently administered to said subject and</p>
<p>wherein the level of 6-thioguanine greater than about 400 pmol per 8x10^8 red blood cells indicates a need to decrease the amount of said drug subsequently administered to said subject.</p>
</blockquote>
<p>So basically the patent is on a decision made about treatment on the basis of a statistical correlation. When the levels of a specific molecule (6-thioguanine) are too high, then the dose of a drug (thiopurine) should be decreased, if they are too low then the dose of the drug should be increased. Here (and throughout the post) correlation is interpreted more loosely as a relationship between two variables; rather than the strict definition as the linear relationship between two quantitative variables.</p>
<p>This correlation between levels of 6-thioguanine and patient response was first reported by a group of academics in a <a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1383347/" target="_blank">paper</a> in 1996. Prometheus developed a diagnostic test based on this correlation. Doctors (including those at the Mayo clinic) would draw blood, send it to Prometheus, who would calculate the levels of 6-thioguanine and report them back.</p>
<p>According to Mayo’s <a href="http://www.patents4life.com/wp-content/uploads/2011/11/10-1150_petitioner.authcheckdam.pdf" target="_blank">brief</a>, some Doctors at the Mayo, who used this test, decided it was possible to improve on the test. So they developed their own diagnostic test, based on a different measurement of 6-thioguanine (6-TGN) and reported different information including:</p>
<blockquote>
<ul>
<li>A blood reading greater than 235 picomoles of 6-TGN is a “target therapeutic range,” and a reading greater than 250 picomoles of 6-TGN is associated with remission in adult patients; and</li>
<li>A blood reading greater than 450 picomoles of 6-TGN indicates possible adverse health effects, but in some instances levels over 700 are associated with remission without significant toxicity, while a “clearly defined toxic level” has not been established; and</li>
<li>A blood reading greater than 5700 picomoles of 6-MMP is possibly toxic to the liver.</li>
</ul>
</blockquote>
<p>They subsequently created their own proprietary test and started to market that test. At which point Prometheus sued the Mayo Clinic for infringement. The most recent decision on the case was made by a federal circuit court who upheld Prometheus’ claim. A useful summary is <a href="http://www.scotusblog.com/2011/12/argument-preview-patients-reaction-patents-scope/" target="_blank">here</a>.</p>
<p>The arguments for the two sides are summarized in the briefs for each side; for <a href="http://www.patents4life.com/wp-content/uploads/2011/11/10-1150_petitioner.authcheckdam.pdf" target="_blank">Mayo</a>:</p>
<blockquote>
<p>Whether 35 U.S.C. § 101 is satisfied by a patent claim that covers observed correlations between blood test results and patient health, so that the patent effectively preempts use of the naturally occurring correlations, simply because well-known methods used to administer prescription drugs and test blood may involve “transformations” of body chemistry.</p>
</blockquote>
<p>and for <a href="http://www.patentlyo.com/files/2011-10-31_prometheus-merits-brief.pdf" target="_blank">Prometheus</a>:</p>
<blockquote>
<p>Whether the Federal Circuit correctly held that concrete methods for improving the treatment of patients suffering from autoimmune diseases by using individualized metabolite measurements to inform the calibration of the patient’s dosages of synthetic thiopurines are patentable processes under 35 U.S.C. §101.</p>
</blockquote>
<p>Basically, Prometheus claims that the patent covers cases where doctors observe a specific data point and make a decision about a specific drug on the basis of that data point and a known correlation with patient outcomes. Mayo, on the other hand, says that since the correlation between the data and the outcome are naturally occurring processes, they can not be patented.</p>
<p>In the oral arguments, the attorney for Mayo makes the claim that the test is only patentable if Prometheus specifies a specific level for 6-thioguanine and a specific treatment associated with that level (see page 21-24 of the <a href="http://www.supremecourt.gov/oral_arguments/argument_transcripts/10-1150.pdf" target="_blank">transcript</a>). He then goes on to suggest that the Mayo would then be free to pick another level and another treatment option for their diagnostic test. Justice Breyer disagrees even with this specific option (see page 38 of the transcript and his fertilizer example). He has made this view known before in his <a href="http://www.supremecourt.gov/opinions/05pdf/04-607.pdf" target="_blank">dissent</a> to the dismissal of the Labcorp writ of certori (a very similar case focusing on whether a correlation can be patented).</p>
<p><strong>Brief summary:</strong> <em>Prometheus is trying to patent a correlation between a molecule’s level and treatment decisions. Mayo is claiming this is a natural process and can’t be patented.</em></p>
<p><strong>Implications for Personalized Medicine (a statistician’s perspective)</strong></p>
<p>I believe this case has major potential consequences for the entire field of personalized medicine. The fundamental idea of personalized medicine is that treatment decisions for individual patients will be tailored on the basis of data collected about them and statistical calculations made on the basis of that data (i.e. correlations, or more complicated statistical functions).</p>
<p>According to my interpretation, if the Supreme Court rules in favor of Mayo in a broad sense, then this suggests that decisions about treatment made on the basis of data and correlation are not broadly patentable. In both the Labcorp dissent and the oral arguments for the Prometheus case, Justice Breyer argues that the process described by the patents:</p>
<blockquote>
<p>…instructs the user to (1) obtain test results and (2) think about them.</p>
</blockquote>
<p>He suggests that these are natural correlations and hence can not be patented, just the way a formula like E = mc^2 can not be patented. The distinction seems to be subtle, where E=mc^2 is a formula that exactly describes a property of nature, the observed correlation is an empirical estimate of a parameter calculated on the basis of noisy data.</p>
<p>From a statistical perspective, there is little difference between calculating a correlation and calculating something more complicated, like the Oncotype DX<a href="http://www.oncotypedx.com/" target="_blank">signature</a>. Both return a score that can be used to determine treatment or other health care decisions. In some sense, they are both “natural phenomena” - one is just more complicated to calculate than the other. So it is not surprising that Genomic Health, the developers of Oncotype, have filed an <a href="http://www.americanbar.org/content/dam/aba/publications/supreme_court_preview/briefs/10-1150_respondentamcu6personalizedmedicalgrps.authcheckdam.pdf" target="_blank">amicus</a> in favor of Prometheus.</p>
<p>Once a score is calculated, regardless of the level of complication in calculating that score, the personalized decision still comes down to a decision made by a doctor on the basis of a number.So if the court broadly decides in favor of Mayo, from a statistical perspective, this would seemingly pre-empt patenting any personalized medicine decision made on the basis of observing data and making a calculation.</p>
<p>Unlike traditional medical procedures like surgery, or treatment with a drug, these procedures are based on data and statistics. But in the same way, a very specific set of operations and decisions is taken with the goal of improving patient health. If these procedures are broadly ruled as simply “natural phenomena”, it suggests that the development of personalized decision making strategies is not, itself, patentable. This decision would also have implications for other companies that use data and statistics to make money, like software giant SAP, which has also filed an<a href="http://www.americanbar.org/content/dam/aba/publications/supreme_court_preview/briefs/10-1150_amcusapamerica.authcheckdam.pdf" target="_blank">amicus brief</a> in support of the federal circuit court opinion (and hence Prometheus).</p>
<p>A large component of medical treatment in the future will likely be made on the basis of data and statistical calculations on those data - that is the goal of personalized medicine. So the Supreme Court’s decision about the patentability of correlation has seemingly huge implications for any decision made on the basis of data and statistical calculations.Regardless of the outcome, this case lends even further weight to the idea that statistical literacy is critical, including for Supreme Court justices.</p>
<p>Simply Statistics will be following this case closely; look for more in depth analysis in future blog posts.</p>
Interview w/ Mario Marazzi, Puerto Rico Institute of Statistics Director, on the importance of Government Statisticians
2011-12-09T01:14:00+00:00
http://simplystats.github.io/2011/12/09/interview-w-mario-marazzi-puerto-rico-institute-of
<p class="MsoNormal">
[Desplace hacia abajo para traducción al español]
</p>
<p class="MsoNormal">
In my opinion, the importance of government statisticians is underappreciated. In the US, agencies such as the CDC, the Census Bureau, and the Bureau of Labor Statistics employ statisticians to help collect and analyze data that contribute to important policy decisions. How many students will enroll in public schools this year? Is there a type II diabetes epidemic? Is unemployment rising? How many homeless people are in Los Angeles? The answers to these questions can guide policy and spending decisions and they can’t be answered without the help of the government statisticians that collect and analyze relevant data.
</p>
<p class="MsoNormal">
<img align="middle" height="181" src="http://www.primerahora.com/XStatic/primerahora/images/espanol/ph20100930marazzi.jpg" width="141" />
</p>
<p class="MsoNormal">
Until recently the Puerto Rican government had no formal mechanisms for collecting data. Puerto Rico, an unincorporated territory of the United States, has many serious economic and social problems . With a<a href="http://www.nytimes.com/2011/06/21/us/21crime.html?pagewanted=all" target="_blank"> very high murder rate</a>, less than 50% of the working-age population in the <a href="http://grupocne.org/cneblog/?p=310" target="_blank">labor force</a>, an <a href="http://www.economist.com/blogs/dailychart/2011/01/gdp_forecasts" target="_blank">economy</a> that continues to worsen after 5 years of recession , and a <a href="http://www.elnuevodia.com/losboricuasperdemos32diasalanoentapones-1123907.html" target="_blank">substantial traffic problem</a> , Puerto Rico can certainly benefit from sound government statistics to better guide policy-making. Better measurement, information and knowledge can only improve the situation.
</p>
<p class="MsoNormal">
In 2007, the <a href="http://www.estadisticas.gobierno.pr" target="_blank">Puerto Rico Institute of Statistics</a> was founded. Mario Marazzi, who obtained his PhD in Economics from Cornell University, left a prestigious job at the Federal Reserve to become the first Executive Director of the Institute. Given the <a href="http://www.elnuevodia.com/blog-las_lecciones_de_marazzi-773367.html" target="_blank">complicated political landscape</a> in Puerto Rico, Mario made an admirable sacrifice for his home country. He was kind enough to answer some questions for Simply Statistics:
</p>
<p class="MsoNormal">
<strong>What is the biggest success story of the Institute?</strong>
</p>
<p class="MsoNormal">
I would say that our biggest success story has been to revive the idea that high-quality statistics are critical for the success of any organization in Puerto Rico. For too long, statistics were neglected and even abused in Puerto Rico. There is now a palpable sense in Puerto Rico that it is important to devote resources and time to ensure that data are produced with care.
</p>
<p class="MsoNormal">
We have also undertaken a number of critical statistical projects since our inauguration in 2007. For instance, the Institute completed the revision to Puerto Rico’s Consumer Price Index, after identifying that official inflation had been overestimated by more than double for 15 years. The Institute revised Puerto Rico’s Mortality Statistics, after detecting the use of an inconsistent selection methodology for the cause of death, as well as discovering thousands of deaths that had not been previously included in the official data. We also undertook Puerto Rico’s first-ever Science and Technology Survey that allowed us to measure the economic impact of Research and Development activities in Puerto Rico.
</p>
<p class="MsoNormal">
<strong>What discovery, made from collecting data in Puerto Rico, has most surprised you?</strong>
</p>
<p class="MsoNormal">
We performed a study on migration patterns during the last decade. From anecdotal evidence, it was fairly clear that in the last five years there had been an elevated level of migration out of Puerto Rico. Nevertheless, the data revealed a few stunning conclusions. For five consecutive years, about 1 percent of Puerto Rico’s population simply left Puerto Rico every year, even after taking into account the people who migrated to Puerto Rico. The demographic consequences were significant: migration had been accelerating the aging of Puerto Rico’s population, and people who left Puerto Rico had a greater level of educational achievement than those who arrived. In fact, for the first-time ever in recorded history, Puerto Rico’s population actually declined between the 2000 and 2010 Census. Despite declining fertility rates, it is now clear migration was the cause of the overall population decrease.
</p>
<p class="MsoNormal">
<strong>Are government agencies usually willing to cooperate with the Institute? If not, what resources does the Institute have available to make them comply?</strong>
</p>
<p class="MsoNormal">
Frequently, statistical functions are not very high on policymakers’ lists of priorities. As a result, government statisticians are usually content to collaborate with the Institute, since we can bring resources to help solve the common problems they face.
</p>
<p class="MsoNormal">
At times, some agencies can be reluctant to undertake the changes needed to produce high-quality statistics. In these instances, the Institute is endowed with the authority by law to move the process along, through statistical policy mandates approved by the Board of Directors of the Institute.
</p>
<p class="MsoNormal">
<strong>If there is a particular agency that excels at collecting and sharing data, can others learn from them?</strong>
</p>
<p class="MsoNormal">
Definitely, we encourage agencies to share their best practices with one another. To facilitate this process, the Institute has the responsibility of organizing the Puerto Rico Statistical Coordination Committee, where representatives from each agency can share practical experiences, and enhance interagency coordination.
</p>
<p class="MsoNormal">
<strong>Do you think Puerto Rico needs more statisticians?</strong>
</p>
<p class="MsoNormal">
Absolutely. Some of our brightest minds in statistics work outside of Puerto Rico, both in Universities and in the Federal Government. Puerto Rico needs an injection of human resources to bring its statistical system up to global standards.
</p>
<p class="MsoNormal">
<strong>What can academic statisticians do to help institutes such as yours?</strong>
</p>
<p class="MsoNormal">
Academic statisticians are instrumental to furthering the mission of the Institute. Governments produce statistics in a wide array of disciplines. Each area can have very specific and unique methodologies. It is impossible for one to be an expert in every methodology.
</p>
<p class="MsoNormal">
As a result, the Institute depends on the collaboration of academic statisticians that can bring to bear their expertise in specific fields. For example, academic biostatisticians can help identify needed improvements to existing methodologies in health statistics. Index theorists can train government statisticians in the latest index methodologies. Computational statisticians can analyze large data sets to help us explain the otherwise unexplained behavior of the data.
</p>
<p class="MsoNormal">
We also host several Puerto Rico datasets on the Institute’s website, which were provided by professors from a number of different fields.
</p>
<hr />
<p class="MsoNormal">
<strong>Entrevista con Mario Marazzi (version en español)</strong>
</p>
<p class="MsoNormal">
En mi opinión, la importancia de los estadísticos que trabajan para el gobierno se subestima.En los EEUU, agencias como el Center for Disease Control, el Census Bureau y el Bureau of Labor Statistics emplean estadísticos para ayudar a recopilar y analizar datos que contribuyen a importantes decisiones de política pública. Por ejemplo, ¿cuántos estudiantes se matricularán en las escuelas públicas este año? ¿Hay una epidemia de diabetes tipo II? ¿El desempleo está aumentando? ¿Cuántos deambulantes viven en Los Ángeles? Las respuestas a estas preguntas ayudan determinar las decisiones presupuestarias y de política pública y no se pueden contestar sin la ayuda de los estadísticos del gobierno que recogen y analizan los datos pertinentes.
</p>
<p class="MsoNormal">
Hasta hace poco el gobierno de Puerto Rico no tenía mecanismos formales de recolección de datos. Puerto Rico, un territorio no incorporado de Estados Unidos, tiene muchos problemas socioeconómicos. Con una <a href="http://www.nytimes.com/2011/06/21/us/21crime.html?pagewanted=all" target="_blank">tasa de asesinatos muy alta</a>, menos de 50% de la población con edad de trabajar en la <a href="http://grupocne.org/cneblog/?p=310" target="_blank">fuerza laboral</a>, una economía que <a href="http://www.economist.com/blogs/dailychart/2011/01/gdp_forecasts" target="_blank">sigue empeorando </a>después de 5 años de recesión y <a href="http://www.elnuevodia.com/losboricuasperdemos32diasalanoentapones-1123907.html" target="_blank">problemas serios de tráfico</a>, Puerto Rico se beneficiaría de estadísticas gubernamentales de alta calidad para mejor guíar la formulación de política pública. Mejores medidas, información y conocimientos sólo pueden mejorar la situación.
</p>
<p class="MsoNormal">
En 2007, se inaguró el Institute de Estadísticas de Puerto Rico. Mario Marazzi, quien obtuvo su doctorado en Economía de la Universidad de Cornell, dejó un trabajo prestigioso en Federal Reserve para convertirse en el primer Director Ejecutivo del Instituto.
</p>
<p class="MsoNormal">
Tomando en cuenta el <a href="http://www.elnuevodia.com/blog-las_lecciones_de_marazzi-773367.html" target="_blank">complicado panorama político</a> en Puerto Rico, Mario hizo un sacrificio admirable por su país y cordialmente aceptó contestar unas preguntas para nuestro blog:
</p>
<p class="MsoNormal">
<strong>¿Cuál ha side el mayor éxito del Instituto?</strong>
</p>
<p class="MsoNormal">
Yo diría que nuestro mayor éxito ha sido revivir la idea de que las estadísticas de alta calidad son cruciales para el éxito de cualquier organización en Puerto Rico. Por mucho tiempo, las estadísticas fueron descuidadas e incluso abusadas en Puerto Rico. En la actualidad existe una sensación palpable en Puerto Rico que es importante dedicar recursos y tiempo para asegurarse de que los datos se produzcan con cuidado.
</p>
<p class="MsoNormal">
También, desde nuestra inauguración en 2007, hemos realizado una serie de proyectos críticos de estadística. Por ejemplo, el Instituto concluyó la revisión del Índice de Precios al Consumidor de Puerto Rico, después de identificar que la inflación oficial había sido sobreestimada por más del doble durante 15 años. El Instituto revisó las Estadísticas de Mortalidad de Puerto Rico, después de detectar el uso de una metodología de selección inconsistente para determinar la causa de muerte y tras descubrir miles de muertes que no habían sido incluidos en los datos oficiales. Además, realizamos por primera vez en Puerto Rico la primera Encuesta de Ciencia y Tecnología que nos permitió medir el impacto económico de las actividades de investigación y desarrollo en Puerto Rico.
</p>
<p class="MsoNormal">
<strong>¿Cuál descubrimiento, realizado a partir de la recopilación de datos en Puerto Rico, más te ha sorprendido?</strong>
</p>
<p class="MsoNormal">
Nosotros realizamos un estudio sobre los patrones de migración durante la última década. A partir de la evidencia anecdótica, era bastante claro que durante los últimos cinco años ha habido un nivel elevado de emigración de Puerto Rico. Sin embargo, los datos revelaron algunas conclusiones sorprendentes. Durante cinco años consecutivos, 1 por ciento de la población de Puerto Rico se ha ido de Puerto Rico todos los años, incluso después de tomar en cuenta la gente que emigró a Puerto Rico. Las consecuencias demográficas eran importantes: la migración ha acelerado el envejecimiento de la población de Puerto Rico y las personas que se fueron de Puerto Rico tienen un mayor nivel de preparación escolar que los que llegaron. De hecho, por primera vez en la historia, la población de Puerto Rico disminuyó entre el Censo de 2000 y el del 2010. A pesar de tasas de fecundidad que disminuyen, ahora está claro que la migración es la causa principal de la reducción de población.
</p>
<p class="MsoNormal">
<strong>¿Por lo general, las agencias gubernamentales están dispuestas a cooperar con el Instituto? Si no, ¿qué recursos tiene disponible el Instituto para obligarlos?</strong>
</p>
<p class="MsoNormal">
Frecuentemente, las estadísticas no aparecen muy altas en las listas de prioridades de los políticos. Como resultado, los estadísticos del gobierno por lo general están contentos de colaborar con el Instituto, ya que nosotros podemos aportar recursos para ayudar a resolver los problemas comunes a que se enfrentan.
</p>
<p class="MsoNormal">
A veces, algunas agencias pueden mostrarse reacios a emprender los cambios necesarios para producir estadísticas de alta calidad. En estos casos, el Instituto posee la autoridad legal de acelerar el proceso, a través de mandatos aprobados por el Consejo de Administración del Instituto.
</p>
<p class="MsoNormal">
<strong>Si hay un organismo en particular que se destaca en la recopilación y el intercambio de datos, ¿otros pueden aprender de ellos?</strong>
</p>
<p class="MsoNormal">
Definitivamente. Nosotros animamos a las agencias a compartir sus mejores prácticas con otros. Para facilitar este proceso, el Instituto tiene la responsabilidad de organizar el Comité de Coordinación Estadística de Puerto Rico, donde representantes de cada agencia pueden compartir experiencias prácticas y mejorar la coordinación interinstitucional.
</p>
<p class="MsoNormal">
<strong> ¿Cree usted que Puerto Rico necesita más estadísticos?</strong>
</p>
<p class="MsoNormal">
Por supuesto. Algunas de nuestras mentes más brillantes en estadísticas trabajan fuera de Puerto Rico, tanto en las universidades como en el Gobierno Federal. Puerto Rico necesita una inyección de recursos humanos para que su sistema estadístico llegue a los estándares mundiales.
</p>
<p class="MsoNormal">
<strong>¿Qué pueden hacer los estadísticos académicos hacer ayudar a instituciones como la suya?</strong>
</p>
<p class="MsoNormal">
Los estadísticos académicos son fundamentales para promover la misión del Instituto. Los gobiernos generan las estadísticas en una amplia gama de disciplinas. Cada área puede tener metodologías muy específicas y únicas. Es imposible que uno sea un experto en cada metodología.
</p>
<p class="MsoNormal">
Como resultado, el Instituto cuenta con la colaboración de estadísticos académicos que pueden ejercer sus conocimientos en campos específicos. Por ejemplo, los bioestadísticos académicos pueden ayudar a identificar las mejoras necesarias a las metodologías existentes en el contexto de la salud pública. Los “Index theorists” pueden entrenar a los estadísticos del gobierno en las últimas metodologías de índice. Los estadísticos computacionales pueden analizar grandes “datasets” que nos ayudan explicar comportamientos de otra manera no explicados de los datos.
</p>
<p class="MsoNormal">
También organizamos varios datasets de Puerto Rico en la página web del Instituto, que fueron proporcionados por profesores en varios campos diferentes.
</p>
<div>
</div>
<!--EndFragment-->
Plotting BeijingAir Data
2011-12-08T01:05:44+00:00
http://simplystats.github.io/2011/12/08/plotting-beijingair-data
<p>Here’s a bit of R code for scraping the BejingAir Twitter feed and plotting the hourly PM2.5 values for the past 24 hours. The script defaults to the past 24 hours but you can modify that by simply changing the value for the variable ‘n’. </p>
<p>You can just grab the code from this <a href="http://www.biostat.jhsph.edu/~rpeng/makeBeijingAirGraph.R" target="_blank">R script</a>. Note that you need to use the latest version of the ‘twitteR’ package because the data structure has changed from previous versions.</p>
<p>Using a modified version of the code in the script, I made a plot of the 24-hour average PM2.5 levels in Beijing over the last 2 months or so. The dashed line shows the US national ambient air quality standard for 24-hour average PM2.5. Note that the plot below is 24-hour <em>averages</em> so it is comparable to the US standard and also looks (somewhat) less extreme than the hourly values.</p>
<p><img src="http://media.tumblr.com/tumblr_lvuhvyF8S71r08wvg.png" alt="" /></p>
Clean Air A 'Luxury' In Beijing's Pollution Zone
2011-12-07T17:37:55+00:00
http://simplystats.github.io/2011/12/07/clean-air-a-luxury-in-beijings-pollution-zone
<p><a href="http://www.npr.org/2011/12/07/143214875/clean-air-a-luxury-in-beijings-pollution-zone">Clean Air A ‘Luxury’ In Beijing’s Pollution Zone</a></p>
Outrage Grows Over Air Pollution and China’s Response
2011-12-07T14:12:45+00:00
http://simplystats.github.io/2011/12/07/outrage-grows-over-air-pollution-and-chinas-response
<p><a href="http://www.nytimes.com/2011/12/07/world/asia/beijing-journal-anger-grows-over-air-pollution-in-china.html">Outrage Grows Over Air Pollution and China’s Response</a></p>
Beijing Air (cont'd)
2011-12-06T13:49:00+00:00
http://simplystats.github.io/2011/12/06/beijing-air-contd
<p>Following up a bit on my previous post on <a href="http://simplystatistics.tumblr.com/post/13601935082/beijing-air" target="_blank">air pollution in Beijing, China</a>, my brother forwarded me a link to some <a href="http://www.chinadialogue.net/article/show/single/en/4661-Beijing-s-hazardous-blue-sky" target="_blank">work conducted by Steven Q. Andrews</a> on comparing particulate matter (PM) air pollution in China versus Europe and the US. China does not officially release fine PM measurements (PM2.5) and furthermore does not have an official standard for that metric. In the US, PM standards are generally focused on PM2.5 now as opposed to PM10 (which includes coarse thoracic particles). Apparently, China is proposing a standard for PM2.5 but it has not yet been implemented.</p>
<p>The main issue seems to be that China has a somewhat different opinion about what it means to be a “bad” pollution day. In the US, the daily average <a href="http://www.epa.gov/air/criteria.html" target="_blank">national ambient air quality standard for PM2.5</a> is 35 mcg/m^3, whereas the proposed standard in China is 75 mcg/m^3. <a href="http://www.who.int/phe/health_topics/outdoorair_aqg/en/" target="_blank">The WHO recommends PM2.5 levels be below 25 mcg/m^3</a>. In China, days under 35 would be considered “excellent” and days under 75 would be considered “good”.</p>
<p>It’s a bit difficult to understand what this means here because in the US we so rarely see days where the daily average is above 75 mcg/m^3. In fact, for the period 1999-2008, if you look across the entire PM2.5 monitoring network for the US, you see that 99% of days fell below the level of 75 mcg/m^3. So seeing a day like that would be quite a rare event indeed.</p>
<p>The Chinese government has consistently claimed that air pollution has improved over time. But Andrews notes</p>
<blockquote>
<p><span>…these so-called improvements are due to irregularities in the monitoring and reporting of air quality – and not to less polluted air. Most importantly, the government </span><span>changed monitoring station locations</span><span> twice. In 2006, it shut down the two most polluted stations and then, in 2008, began monitoring outside the city, beyond the sixth ring road, which is 15 to 20 kilometres from Beijing’s centre.</span></p>
</blockquote>
<p><span>Andrews has previously published on <a href="http://iopscience.iop.org/1748-9326/3/3/034009" target="_blank">inconsistencies between Beijing’s claims of “blue sky days” and the actual monitoring of PM</a> in a paper in </span><span><em>Environmental Research Letters</em>. That paper showed an unusually high number of measurements falling just below the cutoff for a “blue sky day”. The reason for this pattern is not clear but it raises questions about the quality of the official monitoring data.</span></p>
<p><span>China has a novel propagandistic approach to air pollution regulation, which is to separate the data from the interpretation. So a day that has PM2.5 levels at 75 mcg/m^3 is called “good” and as long as you have a lot of “good” or “excellent” days, then you are set. The problem is that you can call something a “blue sky day” or whatever you want, but people still have to suffer the real consequences of high PM days. It’s hard to “relabel” increased asthma attacks, irritated respiratory tracts, and myocardial infarctions. </span></p>
<p><span>Andrews notes</span></p>
<blockquote>
<p><span>As the </span><em>China Daily</em><span> recently </span><span>wrote</span><span>: “All of the residents in the city are aware of the poor air quality, so it does not make sense to conceal it for fear of criticism.”</span></p>
</blockquote>
<p><span>Maybe the best way to conceal the air pollution is to actually get rid of it?</span></p>
Who can resist Biostatistics Ryan Gosling?
2011-12-06T12:03:06+00:00
http://simplystats.github.io/2011/12/06/who-can-resist-biostatistics-ryan-gosling
<p><a href="http://biostatisticsryangosling.tumblr.com/">Who can resist Biostatistics Ryan Gosling?</a></p>
Preventing Errors through Reproducibility
2011-12-05T15:15:05+00:00
http://simplystats.github.io/2011/12/05/preventing-errors-through-reproducibility
<p>Checklist mania has hit clinical medicine thanks to people like Peter Pronovost and many others. The basic idea is that simple and short checklists along with changes to clinical culture can prevent major errors from occurring in medical practice. One particular success story is Pronovost’s central line checklist which <a href="http://www.ncbi.nlm.nih.gov/pubmed/15483409" target="_blank">dramatically reduced bloodstream infections</a> in hospital intensive care units. </p>
<p>There are three important points about the checklist. First, it neatly summarizes information, bringing the latest evidence directly to clinical practice. It is easy to follow because it is short. Second, it serves to slow you down from whatever you’re doing. Before you cut someone open for surgery, you stop for a second and run the checklist. Third, it is a kind of equalizer that subtly changes the culture: everyone has to follow the checklist, no exceptions. A number of studies have now shown that when clinical units follow checklists, infection rates go down and hospital stays are shorter compared to units using standard procedures. </p>
<p>Here’s a question: What would it take to convince you that an article’s results were reproducible, short of going in and reproducing the results yourself? I recently raised this question in a <a href="http://simplystatistics.tumblr.com/post/12243614318/i-gave-a-talk-on-reproducible-research-back-in" target="_blank">talk I gave</a> at the Applied Mathematics Perspectives conference. At the time I didn’t get any responses, but I’ve had some time to think about it since then.</p>
<p>I think most people are thinking of this issue along the lines of “The only way I can confirm that an analysis is reproducible is to reproduce it myself”. In order for that to work, everyone needs to have the data and code available to them so that they can do their own independent reproduction. Such a scenario would be sufficient (and perhaps ideal) to claim reproducibility, but is it strictly necessary? For example, if I reproduced a published analysis, would that satisfy you that the work was reproducible, or would you have to independently reproduce the results for yourself? If you had to choose someone to reproduce an analysis for you (not including yourself), who would it be?</p>
<p>This idea is embedded in the <a href="http://www.ncbi.nlm.nih.gov/pubmed/19535325" target="_blank">reproducible research policy at </a><em><a href="http://www.ncbi.nlm.nih.gov/pubmed/19535325" target="_blank">Biostatistics</a>,</em> but of course we make the data and code available too. There, a (hopefully) trusted third party (the Associate Editor for Reproducibility) reproduces the analysis and confirms that the code was runnable (at least at that moment in time). </p>
<p>It’s important to point out that reproducible research is not only about correctness and prevention of errors. It’s also about making research results available to others so that they may more easily build on the work. However, preventing errors is an important part and the question is then what is the best way to do that? Can we generate a reproducibility checklist?</p>
Online Learning, Personalized
2011-12-05T12:42:17+00:00
http://simplystats.github.io/2011/12/05/online-learning-personalized
<p><a href="http://www.nytimes.com/2011/12/05/technology/khan-academy-blends-its-youtube-approach-with-classrooms.html">Online Learning, Personalized</a></p>
Citizen science makes statistical literacy critical
2011-12-03T17:20:13+00:00
http://simplystats.github.io/2011/12/03/citizen-science-makes-statistical-literacy-critical
<p>In today’s Wall Street Journal, Amy Marcus has a <a href="http://online.wsj.com/article/SB10001424052970204621904577014330551132036.html" target="_blank">piece</a> on the <a href="http://en.wikipedia.org/wiki/Citizen_science" target="_blank">Citizen Science</a> movement, focusing on citizen science in health in particular. I am fully in support of this enthusiasm and a big fan of citizen science - if done properly. There have already been some pretty big <a href="http://www.wired.com/wiredscience/2010/08/citizen-scientist-pulsars/" target="_blank">success</a> <a href="http://depts.washington.edu/bakerpg/drupal/system/files/jiang08A.pdf" target="_blank">stories</a>. As more companies like <a href="http://www.fitbit.com/" target="_blank">Fitbit</a> and <a href="https://www.23andme.com/" target="_blank">23andMe</a> spring up, it is really easy to collect data about yourself (<a href="http://myyearofdata.wordpress.com/" target="_blank">right Chris?</a>). At the same time organizations like <a href="http://www.patientslikeme.com/" target="_blank">Patients Like Me</a> make it possible for people with specific diseases or experiences to self-organize. </p>
<p>But the thing that struck me the most in reading the article is the importance of statistical literacy for citizen scientists, reporters, and anyone reading these articles. For example the article says:</p>
<blockquote>
<p><span>The questions that most people have about their DNA—such as what health risks they face and how to prevent them—aren’t always in sync with the approach taken by pharmaceutical and academic researchers, who don’t usually share any potentially life-saving findings with the patients.</span></p>
</blockquote>
<p>I think its pretty unlikely that any organization would hide life-saving findings from the public. My impression from reading the article is that this statement refers to keeping results blinded from patients/doctors <em>during an experiment or clinical trial. </em><a href="http://en.wikipedia.org/wiki/Blind_experiment" target="_blank">Blinding</a> is a critical component of clinical trials, which reduces many potential sources of bias in the results of a study. Obviously, once the trial/study has ended (or been stopped early because a treatment is effective) then the results are quickly disseminated.</p>
<p>Several key statistical issues are then raised in bullet-point form without discussion: </p>
<blockquote>
<p><span>Amateurs may not collect data rigorously, they say, and may draw conclusions from sample sizes that are too small to yield statistically reliable results. </span></p>
<p>Having individuals collect their own data poses other issues. Patients may enter data only when they are motivated, or feeling well, rendering the data useless. In traditional studies, both doctors and patients are typically kept blind as to who is getting a drug and who is taking a placebo, so as not to skew how either group perceives the patients’ progress.</p>
</blockquote>
<p>The article goes on to describe an anecdotal example of citizen science - which suffers from a key statistical problem (small sample size):</p>
<blockquote>
<p>Last year, Ms. Swan helped to run a small trial to test what type of vitamin B people with a certain gene should take to lower their levels of homocysteine, an amino acid connected to heart-disease risk. (The gene affects the body’s ability to metabolize B vitamins.)</p>
<p>Seven people—one in Japan and six, including herself, in her local area—paid around $300 each to buy two forms of vitamin B and Centrum, which they took in two-week periods followed by two-week “wash-out” periods with no vitamins at all.</p>
</blockquote>
<p>The article points out the issue:</p>
<blockquote>
<p><span>The scientists clapped politely at the end of Ms. Swan’s presentation, but during the question-and-answer session, one stood up and said that the data was not statistically significant—and it could be harmful if patients built their own regimens based on the results.</span></p>
</blockquote>
<p><span>But doesn’t carefully explain the importance of sample size, suggesting instead that the only reason why you need more people is “insure better accuracy”. </span></p>
<p>It strikes me that statistical literacy is critical if the citizen science movement is going to go forward. Ideas like experimental design, randomization, blinding, placebos, and sample size need to be in the toolbox of any practicing citizen scientist. </p>
<p>One major drawback is that there are very few places where the general public can learn about statistics. Mostly statistics is taught in university courses. Resources like the <a href="http://www.khanacademy.org/" target="_blank">Kahn Academy</a> and the <a href="http://www.amazon.com/Cartoon-Guide-Statistics-Larry-Gonick/dp/0062731025" target="_blank">Cartoon Guide to Statistics</a> exist, but are only really useful if you are self motivated and have some idea of math/statistics to begin with. </p>
<p>Since <span>knowledge of basic statistical concepts is quickly becoming indispensable for citizen science or even basic life choices like deciding on healthcare options, do we need “adult statistical literacy courses”? These courses could focus on the basics of experimental design and how to understand results in stories about science in the popular press. It feels like it might be time to add a basic understanding of statistics and data to reading/writing/arithmetic as critical life skills. <a href="http://simplystatistics.tumblr.com/post/13684145814/the-worlds-has-changed-from-analogue-to-digital-and" target="_blank">I’m not the only one who thinks so.</a></span></p>
<p><span><br /></span></p>
The worlds has changed from analogue to digital and it's time mathematical education makes the change too.
2011-12-03T17:17:16+00:00
http://simplystats.github.io/2011/12/03/the-worlds-has-changed-from-analogue-to-digital-and
<p><a href="http://www.youtube.com/watch?v=BhMKmovNjvc">The worlds has changed from analogue to digital and it’s time mathematical education makes the change too.</a></p>
Reverse scooping
2011-12-03T15:46:53+00:00
http://simplystats.github.io/2011/12/03/reverse-scooping
<p>I would like to define a new term: <em>reverse scooping</em> is when someone publishes your idea after you, and doesn’t cite you. It has happened to me a few times. What does one do? I usually send a polite message to the authors with a link to my related paper(s). These emails are usually ignored, but not always. Most times I don’t think it is malicious though. In fact, I almost reverse scooped a colleague recently. People arrive at the same idea a few months (or years) later and there is just too much literature to keep track-off. And remember the culprit authors were not the only ones that missed your paper, the referees and associate editor missed it as well. One thing I have learned is that if you want to claim an idea, try to include it in the title or abstract as very few papers get read cover-to-cover.</p>
New S.E.C. Tactics Yield Actions Against Hedge Funds
2011-12-03T14:07:05+00:00
http://simplystats.github.io/2011/12/03/new-s-e-c-tactics-yield-actions-against-hedge-funds
<p><a href="http://dealbook.nytimes.com/2011/12/01/new-s-e-c-tactics-yields-actions-against-hedge-funds/">New S.E.C. Tactics Yield Actions Against Hedge Funds</a></p>
Reproducible Research in Computational Science
2011-12-02T14:12:59+00:00
http://simplystats.github.io/2011/12/02/reproducible-research-in-computational-science
<p>First of all, thanks to Rafa for <a href="http://simplystatistics.tumblr.com/post/13602648384/rogers-perspective-on-reproducible-research-published" target="_blank">scooping me with my own article</a>. Not sure if that’s reverse scooping or recursive scooping or….</p>
<p>The latest issue of <em>Science</em> has a special section on <a href="http://www.sciencemag.org/content/334/6060/1225.full" target="_blank">Data Replication and Reproducibility</a>. As part of the section I wrote a brief commentary on the need for <a href="http://www.sciencemag.org/content/334/6060/1226.full" target="_blank">reproducible research in computational science</a>. <em>Science</em> has a pretty tight word limit for it’s commentaries and so it was unfortunately necessary to omit a number of relevant topics.</p>
<p>The editorial introducing the special section, as well as a separate editorial in the same issue, seem to emphasize the errors/fraud angle. This might be because <em>Science</em> has once or twice been at the center of instances of scientific fraud. But as I’ve said previously (and a point I tried to make in the commentary), <a href="http://simplystatistics.tumblr.com/post/12421558195/reproducible-research-notes-from-the-field#disqus_thread" target="_blank">reproducibility is not needed soley to prevent fraud</a>, although that is an important objective. Another important objective is getting ideas across and disseminating knowledge. I think this second objective often gets lost because there’s a sense that knowledge dissemination already happens and that it’s the errors that are new and interesting. While the errors are perhaps new, there is a problem of ideas not getting across as quickly as they could because of a lack of code and/or data. The lack of published code/data is arguably holding up the advancement of science (if not <em>Science</em>).</p>
<p>One important idea I wanted to get across was that we can ramp up to achieve the ideal scenario, if getting there immediately is not possible. People often get hung up on making the data available but I think a substantial step could be made by simply making code available. Why doesn’t every journal just require it? We don’t have to start with a grand strategy involving funding agencies and large consortia. <a href="http://simplystatistics.tumblr.com/post/13454027393/reproducible-research-and-turkey" target="_blank">We can start modestly and make useful improvements</a>. </p>
<p>A final interesting question that came up as the issue was going to press was whether I was talking about “reproducibility” or “replication”. As I made clear in the commentary, I define “replication” as independent people going out and collecting new data and “reproducibility” as independent people analyzing the same data. Apparently, others have the reverse definitions for the two words. The confusion is unfortunate because one idea has a centuries long history whereas the importance of the other idea has only recently become relevant. I’m going to stick to my guns here but we’ll have to see how the language evolves.</p>
Roger's perspective on reproducible research published in Science
2011-12-01T21:32:14+00:00
http://simplystats.github.io/2011/12/01/rogers-perspective-on-reproducible-research-published
<p><a href="http://www.sciencemag.org/content/334/6060/1226"></a></p>
Beijing Air
2011-12-01T21:16:35+00:00
http://simplystats.github.io/2011/12/01/beijing-air
<p>If you’re interested in know what the air quality looks like in Beijing China, the US Embassy there has a particulate matter monitor on its roof that tweets the level of fine particulate matter (PM2.5) every hour (see <a href="http://twitter.com/#!/beijingair" target="_blank">@BeijingAir</a>). In case you’re not used to staring at PM2.5 values all the time, let me provide some context.</p>
<p>The US National Ambient Air Quality Standard for the 24-hour average PM2.5 level is 35 mcg/m^3. The twitter feed shows hourly values, so you can’t compare it directly to the US NAAQS (you’d have to take the average of 24 values), but the levels are nevertheless pretty high.</p>
<p>For example, here’s the hourly time series plot of one 24-hour period in March of 2010:</p>
<p><img src="http://media.tumblr.com/tumblr_lvjafwLr0E1r08wvg.png" alt="" /></p>
<p>The red and blue lines show the average and maximum 24-hour value for Wake County, NC for the period 2000-2006 (I made this plot when I was giving a talk in Raleigh).</p>
<p>So, things could be worse here in the US, but remember that there’s no real evidence of a threshold for PM2.5, so even levels here are potentially harmful. But if you’re traveling to China anytime soon, might want to bring a respirator.</p>
DNA Sequencing Caught in Deluge of Data
2011-12-01T13:07:32+00:00
http://simplystats.github.io/2011/12/01/dna-sequencing-caught-in-deluge-of-data
<p><a href="http://www.nytimes.com/2011/12/01/business/dna-sequencing-caught-in-deluge-of-data.html">DNA Sequencing Caught in Deluge of Data</a></p>
Selling the Power of Statistics
2011-11-30T14:12:06+00:00
http://simplystats.github.io/2011/11/30/selling-the-power-of-statistics
<p>A few weeks ago we learned that <a href="http://www.bloomberg.com/news/2011-11-15/buffett-s-stake-in-century-old-ibm-bolsters-berkshire-s-defense.html" target="_blank">Warren Buffett is a big IBM fan</a> (a $10 billion fan, that is). Having heard that I went over to the IBM web site to see what they’re doing these days. For starters, they’re not selling computers anymore! At least not the kind that I would use. One of the big things they do now is “Business Analytics and Optimization” (i.e. statistics), which is one of the reasons they <a href="http://simplystatistics.tumblr.com/post/9955104326/data-analysis-companies-getting-gobbled-up" target="_blank">bought SPSS and then later Algorithmics</a>.</p>
<p>Roaming around the IBM web site, I found this little video on how <a href="http://www-935.ibm.com/services/us/gbs/bao/?lnk=mhse#overlay-noscript" target="_blank">IBM is involved with tennis matches</a> like the US Open. It’s the usual promo video: a bit cheesy, but pretty interesting too. For example, they provide all the players an automatically generated post-game “match analysis DVD” that has summaries of all the data from their match with corresponding video.</p>
<p>It occurred to me that one of the challenges that a company like IBM faces is selling the “power of analytics” to other companies. They need to make these promo videos because, I guess, some companies are not convinced they need this whole analytics thing (or at least not from IBM). They probably need to do methods and software development too, but getting the deal in the first place is at least as important.</p>
<p>In contrast, here at Johns Hopkins, my experience has been that we don’t really need to sell the “power of statistics” to anyone. For the most part, researchers around here seem to be already “sold”. They understand that they are collecting a ton of data and they’re going to need statisticians to help them understand it. Maybe Hopkins is the exception, but I doubt it.</p>
<p>Good for us, I suppose, for now. But there is a danger that we take this kind of monopoly position for granted. Companies like IBM hire the same people we do (including <a href="https://researcher.ibm.com/researcher/view.php?person=us-aveen" target="_blank">one grad school classmate</a>) and there’s no reason why they couldn’t become direct competitors. We need to continuously show that we can make sense of data in novel ways. </p>
Contributions to the R source
2011-11-29T14:10:03+00:00
http://simplystats.github.io/2011/11/29/contributions-to-the-r-source
<p>One of the nice things about tracking the R subversion repository using git instead of subversion is you can do</p>
<pre>git shortlog -s -n</pre>
<p>which gives you</p>
<pre>19855 ripley
6302 maechler
5299 hornik
2263 pd
1153 murdoch
813 iacus
716 luke
661 jmc
614 leisch
472 ihaka
403 murrell
286 urbaneks
284 rgentlem
269 apache
253 bates
249 tlumley
164 duncan
92 r
43 root
40 paul
40 falcon
39 lyndon
34 thomas
33 deepayan
26 martyn
18 plummer
15 (no author)
14 guido
3 ligges
1 mike
</pre>
<p>These data are since 1997 so for Brian Ripley, that’s 3.6 commits per day for the last 15 years. </p>
<p>I think that number 1 position will be out of reach for a while. </p>
<p>By the way, I highly recommend to anyone tracking subversion repositories that they use <a href="http://git-scm.com" target="_blank">git</a> to do it. You get all of the advantages of git and there are essentially no downsides.</p>
Reproducible Research and Turkey
2011-11-28T14:50:41+00:00
http://simplystats.github.io/2011/11/28/reproducible-research-and-turkey
<p>Over the Thanksgiving recent break I naturally started thinking about reproducible research in between salting the turkey and making the turkey stock. Clearly, these things are all related. </p>
<!-- more -->
<p>I sometimes get the sense that many people see reproducibility as essentially binary. A published paper is either reproducible, as in you can compute every single last numerical result to within epsilon precision, or it’s not. My feeling is that there is a spectrum of reproducibility when it comes to published scientific findings. Some papers are more reproducible than others. And that’s where cooking comes in.</p>
<p>I do a bit of cooking and I am a shameless consumer of <a href="http://www.seriouseats.com/" target="_blank">food</a> <a href="http://ruhlman.com/" target="_blank">blogs</a>/<a href="http://www.cooksillustrated.com/" target="_blank">web</a> <a href="http://upstartkitchen.wordpress.com/" target="_blank">sites</a>. There seems pretty solid agreement (and my own experience essentially confirms) that the more you can make yourself and not have to rely on other people doing the cooking, the better. For example, for Thanksgiving, you could theoretically buy yourself a pre-roasted turkey that’s ready to eat. My <a href="http://vega.bac.pku.edu.cn/~peng/" target="_blank">brother</a> tells me this is what homesick Americans do in China because so few people have an oven (I suppose you could steam a turkey?). Or you could buy an un-cooked turkey that is “flavor injected”. Or you could buy a normal turkey and brine/salt it yourself. Or you could get yourself one of those heritage turkeys. Or you could raise your own turkeys…. I think in all of these cases, the turkey would definitely be edible and maybe even tasty. But some would probably be more tasty than others. </p>
<p>And that’s the point. There’s a spectrum when it comes to cooking and some methods result in better food than others. Similarly, when it comes to published research there is a spectrum of what authors can make available to reproduce their work. On the one hand, you have just the paper itself, which reveals quite a bit of information (i.e. the scientific question, the general approach) but usually too few details to actually reproduce (or even replicate) anything. Some authors might release the code, which allows you to study the algorithms and maybe apply them to your own work. Some might release the code and the data so that you can actually reproduce the published findings. Some might make a nice R package/vignette so that you barely have to lift a finger. Each case is better than the previous, but that’s not to say that I would only accept the last/best case. Some reproducibility is better than none.</p>
<p>That said, I don’t think we should shoot low. Ideally, we would have the best case, which would allow for full reproducibility and rapid dissemination of ideas. But while we wait for that best case scenario, it couldn’t hurt to have a few steps in between.</p>
Apple this is ridiculous - you gotta upgrade to upgrade!?
2011-11-27T19:34:11+00:00
http://simplystats.github.io/2011/11/27/apple-this-is-ridiculous-you-gotta-upgrade-to
<p>So along with a few folks here around Hopkins we have been kicking around the idea of developing an app for the iPhone/Android. I’ll leave the details out for now (other than to say stay tuned!).</p>
<p>But to start developing an app for the iPhone, you need a version of <a href="http://developer.apple.com/xcode/" target="_blank">Xcode</a>, Apple’s development environment. The latest version of Xcode is version 4, which can only be installed with the latest version of Mac OS X Lion (10.7, I think) and above. So I dutifully went off to download Lion. Except, whoops! You can only download Lion from the Mac App store.</p>
<p>Now this wouldn’t be a problem, if you didn’t need OS X Snow Leopard (10.6 and above) to access the App store. Turns out I only have version 10.5 (must be OS X Housecat or something). I did a little searching and it <a href="https://discussions.apple.com/thread/3102124?start=0" target="_blank">looks like</a> the only way I can get Lion is if I buy Snow Leopard first and upgrade to upgrade!</p>
<p>It isn’t the money so much (although it does suck to pay $60 for $30 worth of software), but the time and inconvenience this causes. Apple has done this a couple of times to me in the past with operating systems needing to be upgraded so I can buy things from iTunes. But this is getting out of hand….maybe I need to consider the <a href="http://www.google.com/chromebook/" target="_blank">alternatives</a>.</p>
An R function to analyze your Google Scholar Citations page
2011-11-23T14:07:53+00:00
http://simplystats.github.io/2011/11/23/an-r-function-to-analyze-your-google-scholar-citations
<p>Google scholar has now made Google Scholar Citations profiles available to anyone. You can read about these profiles and set one up for yourself <a href="http://scholar.google.com/intl/en/scholar/citations.html" target="_blank">here</a>.</p>
<p>I asked <a href="http://www.jhsph.edu/faculty/directory/profile/5110/Muschelli/John" target="_blank">John Muschelli</a> and <a href="http://www.biostat.jhsph.edu/~ajaffe/" target="_blank">Andrew Jaffe</a>to write me a function that would download my Google Scholar Citations data so I could play with it. Then they got all crazy on it and wrote a couple of really neat functions. All cool/interesting components of these functions are their ideas and any bugs were introduced by me when I was trying to fiddle with the code at the end.</p>
<p>So how does it work? <a href="http://biostat.jhsph.edu/~jleek/code/googleCite.r" target="_blank">Here</a> is the code. You can source the functions like so:</p>
<p>source(“http://biostat.jhsph.edu/~jleek/code/googleCite.r”)</p>
<p>This will install the following packages if you don’t have them: wordcloud, tm, sendmailR, RColorBrewer. Then you need to find the url of a google scholar citation page. Here is Rafa Irizarry’s:</p>
<p><a href="http://scholar.google.com/citations?user=nFW-2Q8AAAAJ" target="_blank"><a href="http://scholar.google.com/citations?user=nFW-2Q8AAAAJ" target="_blank">http://scholar.google.com/citations?user=nFW-2Q8AAAAJ</a></a></p>
<p>You can then call the googleCite function like this:</p>
<p>out = googleCite(“http://scholar.google.com/citations?user=nFW-2Q8AAAAJ;,pdfname=”rafa_wordcloud.pdf;)</p>
<p>or search by name like this:</p>
<p>out = searchCite(“Rafa Irizarry”,pdfname=”rafa_wordcloud.pdf”)</p>
<p>The function will download all of Rafa’s citation data and put it in the matrix out. It will also make wordclouds of (a) the co-authors on his papers and (b) the titles of his papers and save them in the pdf file specified (There is an option to turn off plotting if you want). Here is what Rafa’s clouds look like:</p>
<p><img height="250" src="http://biostat.jhsph.edu/~jleek/code/rafa_wordcloud.png" width="500" /></p>
<p>We have also written a little function to calculate many of the popular citation indices. You can call it on the output like so:</p>
<p>gcSummary(out)</p>
<p>When you download citation data, an email with the data table will also be sent to Simply Statistics so we can collect information on who is using the function and perform population-level analyses.</p>
<p>If you liked this function you might also be interesting in our <a href="http://simplystatistics.tumblr.com/post/11271228367/datascientist">R function to determine if you are a data scientist</a>, or in some of the other stuff going on over at <a href="http://simplystatistics.tumblr.com/" target="_blank">Simply Statistics</a>.</p>
<p>Enjoy!</p>
Data Scientist vs. Statistician
2011-11-22T12:41:00+00:00
http://simplystats.github.io/2011/11/22/data-scientist-vs-statistician
<p>There’s in interesting discussion over at reddit on <a href="http://www.reddit.com/r/MachineLearning/comments/mhodz/data_scientist_vs_statistician/" target="_blank">the difference between a data scientist and a statistician</a>. My crude summary of the discussion seems to be that by and large they are the same but the phrase “data scientist” is just the hip new name for statistician that will probably sound stupid 5 years from now.</p>
<p>My question is why isn’t “statistician” hip? The comments don’t seem to address that much (although a few go in that direction). There a few interesting comments about computing. For example from ByteMining:</p>
<blockquote>
<p>Statisticians typically don’t care about performance or coding style as long as it gets a result. A loop within a loop within a loop is all the same as an O(1) lookup.<br /></p>
</blockquote>
<p>Another more down-to-earth comment comes from marshallp:</p>
<blockquote>
<p>There is a real distinction between data scientist and statistician</p>
<ul>
<li>
<p>the statistician spent years banging his/her head against blackboards full of math notation to get a modestly paid job</p>
</li>
<li>
<p>the data scientist gets s—loads of cash after having learnt a scripting language and an api</p>
</li>
</ul>
<p>More people should be encouraged into data science and not pointless years of stats classes</p>
</blockquote>
<p>Not sure I fully agree but I see where he’s coming from!</p>
<p>[Note: See also our post on <a href="http://simplystatistics.tumblr.com/post/11271228367/datascientist" target="_blank">how determine whether you are a data scientist</a>.]</p>
Ozone rules
2011-11-20T18:35:00+00:00
http://simplystats.github.io/2011/11/20/ozone-rules
<p>A recent article in the New York Times describes the backstory behind the <a href="http://www.nytimes.com/2011/11/17/science/earth/policy-and-politics-collide-as-obama-enters-campaign-mode.html" target="_blank">decision to not revise the ozone national ambient air quality standard</a>. This article highlights the reality of balancing the need to set air pollution regulation to protect public health and the desire to get re-elected. Not having ever served in politics (does being elected to the faculty senate count?) I can’t comment on the political aspect. But I wanted to highlight some of the scientific evidence that goes into developing these standards. </p>
<!-- more -->
<p>A bit of background: the Clean Air Act of 1970 and its subsequent amendments requires that national ambient air quality standards be set to protect public health with “an adequate margin of safety”. Ozone (usually referred to as smog in the press) is one of the pollutants for which standards are set, along with particulate matter, nitrogen oxides, sulfur dioxide, carbon monoxide, and airborne lead. Importantly, the Clean Air Act requires that the EPA to set standards based on the best available scientific evidence.</p>
<p>The ozone standard was re-evaluated years ago under the (second) Bush administration. At the time, the EPA staff recommended a daily standard of between 60 and 70 ppb as providing an adequate margin of safety. Roughly speaking, if the standard is 70 ppb, this means that states cannot have levels of ozone higher than 70 ppb on any given day (that’s not exactly true but the real standard is a mouthful). Stephen Johnson, EPA administrator at the time, set the standard at 75 ppb, citing in part the lack of evidence showing a link between ozone and health at low levels.</p>
<p>We’ve conducted epidemiological analyses that show that <a href="http://www.ncbi.nlm.nih.gov/pubmed/16581541" target="_blank">ozone is associated with mortality even at levels far below 60 ppb</a> (See Figure 2). Note, this paper was not published in time to make into the previous EPA review. The study suggests that if a threshold exists below which ozone has no health effect, it is probably at a level lower than the current standard, possibly nearing natural background levels. Detecting thresholds at very low levels is challenging because you start running out of data quickly. But <a href="http://www.ncbi.nlm.nih.gov/pubmed/14757374" target="_blank">other</a> <a href="http://www.ncbi.nlm.nih.gov/pubmed/9541366" target="_blank">studies</a> that have attempted to do this have found results similar to ours.</p>
<p>The bottom line is pollution levels below current air quality standards should not be misinterpreted as safe for human health.</p>
Show 'em the data!
2011-11-20T01:59:00+00:00
http://simplystats.github.io/2011/11/20/show-em-the-data
<div>
<p>
In a previous<a href="http://simplystatistics.tumblr.com/post/12599452125/expected-salary-by-major" target="_blank">post</a>I argued that students entering college should be shown job prospect by major data. This week I found out the American Bar Association might <a href="http://www.abajournal.com/news/article/aba_committee_appears_poised_to_adopt_new_jobs_placement_standard/" target="_blank">make it a requirement for law school accreditation.</a>
</p>
<p>
Hat tip to Willmai Rivera.
</p>
</div>
Interview with Héctor Corrada Bravo
2011-11-18T17:52:01+00:00
http://simplystats.github.io/2011/11/18/interview-with-h-ctor-corrada-bravo
<p><strong>Héctor Corrada Bravo</strong></p>
<p><strong><img height="200" src="http://biostat.jhsph.edu/~jleek/hcb.jpg" width="300" /></strong></p>
<p>Héctor Corrada Bravo is an assistant professor in the Department of Computer Science and the Center for Bioinformatics and Computational Biology at the University of Maryland, College Park. He moved to College Park after finishing his Ph.D. in computer science at the University of Wisconsin and a postdoc in biostatistics at the Johns Hopkins Bloomberg School of Public Health. He has done outstanding work at the intersection of molecular biology, computer science, and statistics. For more info check out his <a href="http://www.cbcb.umd.edu/~hcorrada/" target="_blank">webpage</a>.</p>
<!-- more -->
<p><strong><span>Which term applies to you: statistician/data scientist/computer</span></strong><br />
<strong><span>scientist/machine learner?</span></strong></p>
<p><span>I want to understand interesting phenomena (in my case mostly in</span><br />
<span>biology and medicine) and I </span><span>believe that our ability to collect a large number of relevant</span><br />
<span>measurements and infer characteristics of these phenomena can drive</span><br />
<span>scientific discovery and commercial innovation in the near future.</span><br />
<span>Perhaps that makes me a data scientist and means that depending on the</span><br />
<span>task at hand one or more of the other terms apply.</span></p>
<p><span>A lot of the distinctions many people make between these terms are</span><br />
<span>vacuous and unnecessary, but some are nonetheless useful to think</span><br />
<span>about. For example, both statisticians and machine learners [sic] know</span><br />
<span>how to create statistical algorithms </span><span>that compute interesting and informative objects using measurements </span><span>(perhaps) obtained through some stochastic or partially observed</span><br />
<span>process. These objects could be genomic tools for cancer screening, or</span><br />
<span>statistics that better reflect the relative impact of baseball players</span><br />
<span>on team success.</span><br />
<span> </span></p>
<p><span>Both fields also give us ways to evaluate and characterize these objects.</span><br />
<span>However, there are times when these objects are tools that fulfill an</span><br />
<span>immediately utilitarian purpose and thinking like an engineer might</span><br />
<span>(as many people in Machine Learning do) is the right approach.</span><br />
<span>Other times, these objects are there to help us get insights about our</span><br />
<span>world and thinking in ways that many statisticians do is the right</span><br />
<span>approach. You need both of these ways of thinking to do interesting</span><br />
<span>science and dogmatically avoiding either of them is a terrible idea.</span></p>
<p><strong><span>How did you get into statistics/data science (i.e. your history)?</span></strong></p>
<p><span>I got interested in Artificial Intelligence at one point, and found</span><br />
<span>that my mathematics background was nicely suited to work on this. Once</span><br />
<span>I got into it, thinking about statistics and how to analyze and</span><br />
<span>interpret data was natural and necessary. I started working with two</span><br />
<span>wonderful advisors a</span><span>t Wisconsin, Raghu Ramakrishnan (CS) and Grace Wahba (Statistics)</span><br />
<span>that helped shape the way I approach problems from different angles</span><br />
<span>and with different goals. The last piece was discovering that</span><br />
<span>computational biology is a fantastic setting in which to apply and</span><br />
<span>devise these methods </span><span>to answer really interesting questions.</span></p>
<p><strong><span>What is the problem currently driving you?</span></strong></p>
<p><span>I’ve been working on cancer epigenetics to find specific genomic</span><br />
<span>measurements for which increased stochasticity appears to be general</span><br />
<span>across multiple cancer types. Right now, I’m really wondering how far</span><br />
<span>into the clinic can these discoveries be taken, if at all. For</span><br />
<span>example, can we build tools that use these genomic measurements to</span><br />
<span>improve cancer screening?</span></p>
<p><strong><span>How do you see CS/statistics merging in the future?</span></strong></p>
<p><span>I think that future got here some time ago, but is about to get much</span><br />
<span>more interesting.</span></p>
<p><span>Here is one example: Computer Science is about creating and analyzing</span><br />
<span>algorithms and building the systems that can implement them. Some of</span><br />
<span>what </span><span>many computer scientists have done looks at problems concerning how to</span><br />
<span>keep, find and ship around information (Operating Systems, Networks,</span><br />
<span>Databases, etc.). Many times these have been driven by very specific</span><br />
<span>needs, e.g., commercial transactions in databases. In some ways,</span><br />
<span>companies have moved from from asking how do I use data to keep track</span><br />
<span>of my activities to how do I use data to decide which activities to do</span><br />
<span>and how to do them. Statistical tools should be used to answer these</span><br />
<span>questions, and systems built by computer scientists have statistical</span><br />
<span>algorithms at their core.</span></p>
<p><strong><span>Beyond R, what are some really useful computational tools for</span></strong><br />
<strong><span>statisticians to know about?</span></strong></p>
<p><span>I think a computational tool that everyone can benefit a lot from</span><br />
<span>understanding better is algorithm design and analysis. This doesn’t</span><br />
<span>have to be at a particularly deep level, but just getting a sense of</span><br />
<span>how long a particular process might take, </span><span>and how to devise a different way of doing it that might make it more </span><span>efficient is really useful. I’ve been toying with the idea of creating </span><span>a CS course called (something like) “Highlights of continuous</span><br />
<span>mathematics for computer science” that reminds everyone of the cool</span><br />
<span>stuff that one learns in math </span><span>now that we can appreciate their usefulness. Similarily, I think</span><br />
<span>statistics students can benefit from “Highlights of discrete</span><br />
<span>mathematics for statisticians”.</span></p>
<p><span>Now a request for comments below from you and readers: (5a) Beyond R,</span><br />
<span>what are some really useful statistical tools for computer scientists</span><br />
<span>to know about?</span></p>
<p><strong><span>Review times in statistics journals are long, should statisticians</span></strong><br />
<strong><span>move to conference papers?</span></strong></p>
<p><span>I don’t think so. Long review times (anything more than 3 weeks) are</span><br />
<span>really not necessary. We tend to publish in journals with fairly quick</span><br />
<span>review times that produce (for the most part) really useful and</span><br />
<span>insightful reviews.</span><br />
<span> </span></p>
<p><span>I was recently talking to senior members in my field who were telling</span><br />
<span>me stories about the “old times” when CS was moving from mainly</span><br />
<span>publishing in journals to now mainly publishing in conferences. But</span><br />
<span>now, people working </span><span>in collaborative projects (like computational biology) work in fields</span><br />
<span>that primarily publish in journals, so the field needs to be able to</span><br />
<span>properly evaluate their impact and productivity. There is no perfect</span><br />
<span>system.</span><br />
<span> </span></p>
<p><span>For instance, review requests in fields where conferences are the main</span><br />
<span>publication venue come in waves (dictated by conference schedule).</span><br />
<span>Reviewers have a lot of papers to go over in a relatively short time</span><br />
<span>which makes their job of providing really helpful and fair reviews not</span><br />
<span>so easy. So, in that respect, </span><span>the journal system can be better. </span><span>The one thing that is universally </span><span>true is that you don’t need long review times.</span></p>
<p><span><strong>Previous Interviews:</strong> <a href="http://simplystatistics.tumblr.com/post/11436138110/interview-with-daniela-witten" target="_blank">Daniela Witten</a>, <a href="http://simplystatistics.tumblr.com/post/11729003971/interview-with-chris-barr" target="_blank">Chris Barr</a>, <a href="http://simplystatistics.tumblr.com/post/12328728291/interview-with-victoria-stodden" target="_blank">Victoria Stodden</a></span></p>
Google Scholar Pages
2011-11-17T20:31:00+00:00
http://simplystats.github.io/2011/11/17/google-scholar-pages
<p>If you want to get to know more about what we’re working on, you can check out our Google Scholar pages:</p>
<ul>
<li><a href="http://scholar.google.com/citations?user=HI-I6C0AAAAJ" target="_blank">Jeff Leek</a></li>
<li><a href="http://scholar.google.com/citations?user=nFW-2Q8AAAAJ" target="_blank">Rafael Irizarry</a></li>
<li><a href="http://scholar.google.com/citations?user=h5wUydwAAAAJ" target="_blank">Roger Peng</a></li>
</ul>
<p>I’ve only been using it for a day but I’m pretty impressed by how much it picked up. My only problem so far is having to merge different versions of the same paper.</p>
The History Of Nonlinear Principal Components
2011-11-17T17:13:00+00:00
http://simplystats.github.io/2011/11/17/the-history-of-nonlinear-principal-components
<p>[youtube http://www.youtube.com/watch?v=V-hFORcBj44?wmode=transparent&autohide=1&egm=0&hd=1&iv_load_policy=3&modestbranding=1&rel=0&showinfo=0&showsearch=0&w=500&h=375]</p>
<p>The History of Nonlinear Principal Components Analysis, a lecture given by Jan de Leeuw. For those that have ~45 minutes to spare, it’s a very nice talk given in Jan’s characteristic style.</p>
<div class="attribution">
(<span>Source:</span> <a href="http://www.youtube.com/">http://www.youtube.com/</a>)
</div>
Amazon EC2 is #42 on Top 500 supercomputer list
2011-11-16T17:05:00+00:00
http://simplystats.github.io/2011/11/16/amazon-ec2-is-42-on-top-500-supercomputer-list
<p><a href="http://www.top500.org/list/2011/11/100">Amazon EC2 is #42 on Top 500 supercomputer list</a></p>
Preparing for tenure track job interviews
2011-11-16T15:58:00+00:00
http://simplystats.github.io/2011/11/16/preparing-for-tenure-track-job-interviews
<p>If you are in the job market you will soon be receiving (or already received) an invitation for an interview. So how should you prepare? You have two goals. The first is to make a good impression. Here are some tips:</p>
<!-- more -->
<p>1) During your talk, do NOT go over your allotted time. Practice your talk at least twice. Both times in front of a live audiences that asks questions. </p>
<p>2) Know you audience. If it’s a “math-y” department, give a more “math-y” talk. If it’s an applied department, give a more applied talk. But (sorry for the cliché) be yourself. Don’t pretend to be interested in something you are not. I remember one candidate that pretended to be interested in applications and it back fired badly during the talk. </p>
<p>3) Learn about the faculty’s research interests. This will help during the one-on-one interviews.</p>
<p>4) Be ready to answer the question “what do you want to teach?” and “where do you see yourself in five years?”</p>
<div>
5) I can’t think of any department where it is necessary to wear a suit (correct me if I’m wrong in the comments). In some places you might feel uncomfortable wearing a suit while those interviewing you are in <a href="http://owpdb.mfo.de/photoNormal?id=7558" target="_blank">shorts and t-shirt</a>. But do <a href="http://apha.org/NR/rdonlyres/20123290-0DCC-4275-B7B6-8F1F609BC3EB/10147/IMG_1001.JPG" target="_blank">dress up</a>. Show them you care.
</div>
<p>Second, and just as important, you want to figure out if you like the department you are visiting. Do you want to spend the next 5, 10, 50 years there? Make sure to find out as much as you can to answer this question. Some questions are more appropriate for junior faculty, the more sensitive ones for the chair. Here are some example questions I would ask:</p>
<p>1) What are the expectations for promotion? Would you promote someone publishing exclusively in Nature? Somebody publishing exclusively in Annals of Statistics? Is being a PI on an R01 a requirement for tenure? </p>
<p>2) What are the expectations for teaching/service/collaboration? How are teaching and committee service assignments made? </p>
<p>3) How did you connect with your collaborators? How are these connections made?</p>
<p>4) What percent of my salary am I expected to cover? Is it possible to do this by being a co-investigator?</p>
<p>5) Where do you live? How are the schools? How is the commute? </p>
<p>6) How many graduate students does the department have? How are graduate students funded? If I want someone to work with me, do I have to cover their stipend/tuition?</p>
<p>Specific questions for the junior Faculty:</p>
<p>Are the expectations for promotion made clear to you? Do you get feedback on your progress? Do the senior faculty mentor you? Do the senior faculty get along? What do you like most about the department? What can be improved? In the last 10 years, what percent of junior faculty get promoted?</p>
<p>Questions for the chair:</p>
<p>What percent of my salary am I expected to cover? How soon? Is their bridge funding? What is a standard startup package? Can you describe the promotion process in detail? What space is available for postdocs? (for hard money place) I love teaching, but can I buy out teaching with grants? </p>
<p>I am sure I missed stuff, so please comment away….</p>
<p><strong>Update</strong>: I can’t believe I forgot computing! Make sure to ask about computing support. This varies a lot from place to place. Some departments share amazing systems. Ask how costs are shared? How is the IT staff? Is R supported? In others you might have to buy your own hardware. Get <strong>all</strong> the details.</p>
OK Cupid data on Infochimps - anybody got $1k for data?
2011-11-16T12:55:04+00:00
http://simplystats.github.io/2011/11/16/ok-cupid-data-on-infochimps-anybody-got-1k-for-data
<p>OK Cupid is an online dating site that has grown its visibility in part through a pretty awesome blog called <a href="http://blog.okcupid.com/" target="_blank">OK Trends</a>, where they have analyzed their online dating data to, for example, <a href="http://blog.okcupid.com/index.php/the-4-big-myths-of-profile-pictures/" target="_blank">show you what kind of profile picture works best</a>. Now, they have compiled <a href="http://www.infochimps.com/datasets/personality-insights-okcupid-questions-and-answers-by-gender-age" target="_blank">data</a> from their personality survey and made it available online through <a href="http://www.infochimps.com/" target="_blank">Infochimps</a>. We have talked about Infochimps before, it is basically a site for distributing/selling data. Unfortunately, the OK Cupid data costs $1000. I can think of some cool analyses we could do with this data, but unfortunately the price is a little steep for me. Anybody got a grand they want to give me to buy some data? </p>
<p><strong>Related Posts</strong>: Jeff on <a href="http://simplystatistics.tumblr.com/post/11237403492/apis" target="_blank">APIs</a>, Jeff on <a href="http://simplystatistics.tumblr.com/post/10410458080/data-sources" target="_blank">Data sources</a>, Roger on <a href="http://simplystatistics.tumblr.com/post/10441403664/private-health-insurers-to-release-data" target="_blank">Private health insurers to release data</a></p>
First 100 Posts
2011-11-15T17:20:06+00:00
http://simplystats.github.io/2011/11/15/first-100-posts
<p>In honor of us passing the 100 post milestone, I’ve collected a few of our more interesting posts from the past 3 months for those who have not been avid followers from Day 1. Enjoy!</p>
<ul>
<li>First Post: <a href="http://simplystatistics.tumblr.com/post/9954726952/data-science-hot-career-choice" target="_blank">Data Science = Hot Career Choice</a></li>
<li><a href="http://simplystatistics.tumblr.com/post/10124797490/advice-for-stats-students-on-the-academic-job-market" target="_blank">Advice for stats students on the academic job market</a></li>
<li><a href="http://simplystatistics.tumblr.com/post/10558246695/getting-email-responses-from-busy-people" target="_blank">Getting responses from busy people</a></li>
<li><a href="http://simplystatistics.tumblr.com/post/10686092687/25-minute-seminars" target="_blank">25 minute seminars</a></li>
<li><a href="http://simplystatistics.tumblr.com/post/10764298034/the-future-of-graduate-education" target="_blank">The future of graduate education</a></li>
<li><a href="http://simplystatistics.tumblr.com/post/11271228367/datascientist" target="_blank">An R function to determine if you are a data scientist</a></li>
<li><a href="http://simplystatistics.tumblr.com/post/11988685443/computing-on-the-language" target="_blank">Computing on the language</a></li>
<li><a href="http://simplystatistics.tumblr.com/post/11695813030/finding-good-collaborators" target="_blank">Finding good collaborators</a></li>
<li><a href="http://simplystatistics.tumblr.com/post/11436138110/interview-with-daniela-witten" target="_blank">Interview with Daniela Witten</a></li>
<li><a href="http://simplystatistics.tumblr.com/post/12469660993/is-statistics-too-darn-hard" target="_blank">Is statistics too darn hard?</a></li>
</ul>
The Cost of a U.S. College Education
2011-11-14T19:44:40+00:00
http://simplystats.github.io/2011/11/14/the-cost-of-a-u-s-college-education
<p>As a follow up to my previous post on <a href="http://simplystatistics.tumblr.com/post/12599452125/expected-salary-by-major" target="_blank">expected salaries by majors</a> I want to share the following graph:</p>
<p><img src="http://www.tonybates.ca/wp-content/uploads/Tuition-costs-USA.jpg" alt="" /></p>
<p>So why is the cost of higher education going up at a faster rate than most everything else? Economists please correct me if I’m wrong, but it must be that demand grew right? Universities are non-profits so they didn’t necessarily have to respond by increasing offers. <a href="http://nces.ed.gov/fastfacts/display.asp?id=98" target="_blank">But apparently they did</a>. So if the proportion of the population going to college grew, why is there a <a href="http://www.nytimes.com/2011/11/06/education/edlife/why-science-majors-change-their-mind-its-just-so-darn-hard.html?_r=3" target="_blank">shortage of STEM majors</a>? I think it’s because the proportion of the population that can complete such a degree has not changed since 1985 and most of those people were already going to college. If this is right, then it implies that to make more offers, the universities had to grow majors with higher graduation rates. The graph below (taken from <a href="http://marginalrevolution.com/marginalrevolution/2011/11/college-has-been-oversold.html" target="_blank">here</a>) seems to confirm this:</p>
<p><img src="http://marginalrevolution.com/wp-content/uploads/2011/11/EducationTabarrok-300x296.png" width="300" height="296" /></p>
<p>Unfortunately, in 1985 there was no dearth of psychologists, visual and performing artists, and journalists. So we should not be surprised that the increase in their numbers resulted in graduates from these fields having a harder time finding employment (see bottom of <a href="http://rafalab.jhsph.edu/images/salarytable.html" target="_blank">this table</a>). Meanwhile, the US has <a href="http://www.npr.org/blogs/thetwo-way/2011/06/15/137203549/two-million-open-jobs-yes-but-u-s-has-a-skills-mismatch" target="_blank">2 million job openings</a>that can’t be filled, many in <a>vocational careers</a>. So why aren’t more students opting for technical training with good job prospects?In this<a href="http://www.nytimes.com/2011/07/10/business/vocational-schools-face-deep-cuts-in-federal-funding.html?pagewanted=all" target="_blank">NYTimes article</a>, Motoko Rich explains that</p>
<blockquote>
<p>In European countries like Germany, Denmark and Switzerland, vocational programs have long been viable choices for a significant portion of teenagers. Yet in the United States, technical courses have often been viewed as the ugly stepchildren of education, backwaters for underachieving or difficult students.</p>
</blockquote>
<p>It’s hard not to think that universities have benefited from the social stigma associated with vocational degrees. In any case, as I said in mymy previous <a href="http://simplystatistics.tumblr.com/post/12599452125/expected-salary-by-major" target="_blank">post</a>, I am not interested in telling people what to study, but universities should show students the data.</p>
New O'Reilly book on parallel R computation
2011-11-14T17:14:00+00:00
http://simplystats.github.io/2011/11/14/new-oreilly-book-on-parallel-r-computation
<p><a href="http://shop.oreilly.com/product/0636920021421.do">New O’Reilly book on parallel R computation</a></p>
Cooperation between Referees and Authors Increases Peer Review Accuracy
2011-11-11T17:17:56+00:00
http://simplystats.github.io/2011/11/11/cooperation-between-referees-and-authors-increases-peer
<p>Jeff Leek and colleagues just published an article in PLoS ONE on the <a href="http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0026895" target="_blank">differences between anonymous (closed) and non-anonymous (open) peer review of research articles</a>. They developed a “peer review game” as a model system to track authors’ and reviewers’ behavior over time under open and closed systems.</p>
<p>Under the open system, it was possible for authors to see who was reviewing their work. They found that under the open system authors and reviewers tended to cooperate by reviewing each others’ work. Interestingly, they say</p>
<blockquote>
<p><span>It was not immediately clear that cooperation between referees and authors would increase reviewing accuracy. Intuitively, one might expect that players who cooperate would always accept each others solutions - regardless of whether they were correct. However, we observed that when a submitter and reviewer acted cooperatively, reviewing accuracy actually increased by 11%.</span></p>
</blockquote>
Expected Salary by Major
2011-11-10T15:04:00+00:00
http://simplystats.github.io/2011/11/10/expected-salary-by-major
<p>In this<a href="http://www.thenation.com/article/164348/audacity-occupy-wall-street" target="_blank">recent editorial</a>about the Occupy Wall Street movement, Richard Kim profiles a protestor that despite having a master’s degree can’t find a job. This particular protestorquit his job as a school teacher three years ago and took out a $35K student loan to obtain a master’s degree in puppetry from the University of Connecticut. I wonder if, before taking his money, UConn showed this person data on job prospects for their puppetry graduates. More generally,I wonder if any university shows their idealist 18 year old freshmen such data.</p>
<p><img height="600" width="480" src="http://rafalab.jhsph.edu/images/salaryvsrank.png" /></p>
<p>Georgetown’s <a href="http://cew.georgetown.edu/">Center for Education and the Workforce</a> has an informative<a href="http://cew.georgetown.edu/whatsitworth/">interactive webpage</a>that students can use to find out by-major salary information. I scraped data from this<a href="http://graphicsweb.wsj.com/documents/NILF1111/#term=">Wall Street Journal webpage</a>which also provides, for each major, unemployment rates, salary quartiles, and its rank in popularity. I used these data to compute expected salaries by multiplying median salary by percent of employment. The graph above shows expected salary versus popularity rank (1=most popular) for the 50 most popular majors (Go <a href="http://rafalab.jhsph.edu/images/salarytable.html">here</a> for a complete table and <a href="http://rafalab.jhsph.edu/images/majors.zip">here</a> is the raw data and code). I also included Physics (the 70-th). I used different colors to represent four categories: engineering, math/stat/computers, physical sciences, and the rest. As a baseline I added a horizontal line representing the average salary for a truck driver: $65K, a job currently with<a href="http://www.npr.org/2011/10/13/141325299/a-labor-mismatch-means-trucking-jobs-go-unfilled">plenty of openings</a>. Different font sizes are used only to make names fit.A couple of observations stand out. First, only one of the top 10 most popular majors,Computer Science,has a higher expected salary than truck drivers. Second, Psychology, the fifth most popular major, has an expected salary of $40K and, as seen in <a href="http://rafalab.jhsph.edu/images/salarytable.html" target="_blank">the table</a>, an unemployment rate of 6.1%; almost three times worse than nursing.</p>
<p><strong>A few editorial remarks:</strong>1)I understand that being a truck driver is very hard and that there is little room for career development. 2) I am not advocating that people pick majors based on future salaries. 3) I think college freshmen deserve to know the data given how much money they fork over to us. 4) The graph is for bachelor’s degrees, not graduate education. The <a href="http://cew.georgetown.edu/whatsitworth/">CEW</a> website includes data for graduate degrees. Note that Biology shoots way up with a graduate degree. 5) For those interested in a PhD in Statistics I recommend you major in Math with a minor in a liberal arts subject, such as English, while taking as many programming classes as you can. We all know Math is the base for everything statisticians do, but why English? Students interested in academia tend to underestimate the <a href="http://bulletin.imstat.org/2011/09/terence%E2%80%99s-stuff-speaking-reading-writing/" target="_blank">importance of writing and communicating</a>.</p>
<p><strong>Related articles:</strong><a href="http://www.nytimes.com/2011/11/06/education/edlife/why-science-majors-change-their-mind-its-just-so-darn-hard.html?_r=2" target="_blank">This</a>NY Times article describes how/why students are leaving the sciences. <a href="http://marginalrevolution.com/marginalrevolution/2011/11/college-has-been-oversold.html">Here</a>, Alex Tabarrok describes big changes in the balance of majors between 1985 and today and <a href="http://marginalrevolution.com/marginalrevolution/2011/11/not-from-the-onion-3.html" target="_blank">here</a>he shares his thoughts on Richard Kim’s editorial. Matt Yglesias explains that<a href="http://thinkprogress.org/yglesias/2011/11/08/363587/unemployment-is-rising-across-the-board/" target="_blank">unemploymentis rising across the board</a>. Finally, Peter Orszag share <a href="http://www.bloomberg.com/news/2011-11-09/winds-of-economic-change-blow-away-college-degree-peter-orszag.html" target="_blank">his views</a> on how a changing world is changing the value of a college degree.</p>
<p>Hat tip to David Santiago for sending various of these links and Harris Jaffee for help with scrap<strike>p</strike>ing.</p>
Statisticians on Twitter...help me find more!
2011-11-09T17:10:06+00:00
http://simplystats.github.io/2011/11/09/statisticians-on-twitter-help-me-find-more
<p>In honor of our blog finally dragging itself into the 21st century and jumping onto Twitter/Facebook, I have been compiling a list of statistical people on Twitter. I couldn’t figure out an easy way to find statisticians in one go (which could be because I don’t have Twitter skills). </p>
<p>So here is my very informal list of statisticians I found in a half hour of searching. I know I missed a ton of people; let me know who I missed so I can update!</p>
<p><a href="http://twitter.com/#!/leekgroup" target="_blank">@leekgroup</a> - Jeff Leek (What, you thought I’d list someone else first?)</p>
<p><a href="http://twitter.com/#!/rdpeng" target="_blank">@rdpeng</a> - Roger Peng</p>
<p><a href="http://twitter.com/#!/rafalab" target="_blank">@rafalab</a> - Rafael Irizarry</p>
<p><a href="http://twitter.com/#!/storeylab" target="_blank">@storeylab</a> - John Storey</p>
<p><a href="http://twitter.com/#!/bcaffo" target="_blank">@bcaffo</a> - Brian Caffo</p>
<p><a href="http://twitter.com/#!/sherrirose" target="_blank">@sherrirose </a>- Sherri Rose</p>
<p><a href="http://twitter.com/#!/raphg" target="_blank">@raphg </a>- Raphael Gottardo</p>
<p><a href="http://twitter.com/#!/airoldilab" target="_blank">@airoldilab</a> - Edo Airoldi</p>
<p><a href="http://twitter.com/#!/stat110" target="_blank">@stat110</a> - Joe Blitzstein</p>
<p><a href="http://twitter.com/#!/tylermccormick" target="_blank">@tylermccormick</a> - Tyler McCormick</p>
<p><a href="http://twitter.com/#!/statpumpkin" target="_blank">@statpumpkin</a> - Chris Volinsky</p>
<p><a href="http://twitter.com/#!/fivethirtyeight" target="_blank">@fivethirtyeight</a> - Nate Silver</p>
<p><a href="http://twitter.com/#!/flowingdata" target="_blank">@flowingdata</a> - Nathan Yau</p>
<p><a href="http://twitter.com/#!/kinggary" target="_blank">@kinggary</a> - Gary King</p>
<p><a href="http://twitter.com/#!/StatModeling" target="_blank">@StatModeling</a> - Andrew Gelman</p>
<p><a href="http://twitter.com/#!/AmstatNews" target="_blank">@AmstatNews</a> - Amstat News</p>
<p><a href="http://twitter.com/#!/hadleywickham" target="_blank">@hadleywickham</a> - Hadley Wickham</p>
Coarse PM and measurement error paper
2011-11-08T17:05:05+00:00
http://simplystats.github.io/2011/11/08/coarse-pm-and-measurement-error-paper
<p><a href="http://www.sph.emory.edu/cms/departments_centers/bios/faculty/index.php?Network_ID=HHCHANG" target="_blank">Howard Chang</a>, a former PhD student of mine now at Emory, just published a paper on a <a href="http://www.ncbi.nlm.nih.gov/pubmed/21297159" target="_blank">measurement error model for estimating the health effects of coarse particulate matter (PM)</a>. This is a cool paper that deals with the problem that coarse PM tends to be very spatially heterogeneous. Coarse PM is a bit of a hot topic now because there is currently no national ambient air quality standard for coarse PM specifically. There is a standard for fine PM, but compared to fine PM, the scientific evidence for health effects of coarse PM is relatively less developed. </p>
<p>When you want to assign a coarse PM exposure level to people in a county (assuming you don’t have personal monitoring) there is a fair amount of uncertainty about the assignment because of the spatial variability. This is in contrast to pollutants like fine PM or ozone which tend to be more spatially smooth. Standard approaches essentially ignore the uncertainty which may lead to some bias in estimates of the health effects.</p>
<p>Howard developed a measurement error model that uses observations from multiple monitors to estimate the spatial variability and correct for it in time series regression models estimating the health effects of coarse PM. Another nice thing about his approach is that it avoids any complex spatial-temporal modeling to do the correction.</p>
<p><strong>Related Posts:</strong> Jeff on “<a href="http://simplystatistics.tumblr.com/post/11024349209/cool-papers" target="_blank">Cool papers</a>” and “<a href="http://simplystatistics.tumblr.com/post/10204192286/dissecting-the-genomics-of-trauma" target="_blank">Dissecting the genomics of trauma</a>”</p>
Is Statistics too darn hard?
2011-11-07T15:33:00+00:00
http://simplystats.github.io/2011/11/07/is-statistics-too-darn-hard
<p>In <a href="http://www.nytimes.com/2011/11/06/education/edlife/why-science-majors-change-their-mind-its-just-so-darn-hard.html?_r=1" target="_blank">this</a> NY Times article, Christopher Drew points out that many students planning engineering and science majors end up switching to other subjects or fail to get any degree. He argues that this is partly due to<strike>do</strike> the difficulty of classes. In a <a href="http://simplystatistics.tumblr.com/post/12241459446/we-need-better-marketing" target="_blank">previous post</a> we lamented the anemic growth in math and statistics majors in comparison to other majors. I do not think we should make our classes easier just to keep these students. But we can certainly do a better job of motivating the material and teaching it more interesting. After having fun in high school science classes, students entering college are faced with the reality that the first college science classes can be abstract and technical. But in Statistics we certainly can be teaching the practical aspects first. Learning the abstractions is so much easier and enjoyable when you understand the practical problem behind the math. And in Statistics there is always a practical aspect behind the math. The statistics class I took in college was so dry and removed from reality that I can see why it would turn students away from the subject. So, if you are teaching undergrad (or grads) I highly recommend the <a href="http://128.32.135.2/users/statlabs/" target="_blank">Stat labs text book</a> by Deb Nolan and Terry Speed that teaches Mathematical Statistics through applications. If you know of other good books please post in the comments? Also, if you know of similar books for other science, technology, engineering, and math (STEM) subjects please share as well.</p>
<p><strong>Related Pots:</strong> Jeff on “T<a href="http://simplystatistics.tumblr.com/post/12076163379/the-5-most-critical-statistical-concepts" target="_blank">he 5 most critical statistical concepts</a>”, Rafa on “<a href="http://simplystatistics.tumblr.com/post/10764298034/the-future-of-graduate-education" target="_blank">The future of graduate education</a>”, Jeff on “<a href="http://simplystatistics.tumblr.com/post/11770724755/graduate-student-data-analysis-inspired-by-a" target="_blank">Graduate student data analysis inspired by a high-school teacher</a>”</p>
Reproducible research: Notes from the field
2011-11-06T16:13:05+00:00
http://simplystats.github.io/2011/11/06/reproducible-research-notes-from-the-field
<p>Over the past year, I’ve been doing a lot of talking about reproducible research. Talking to people, talking on panel discussions, and talking about <a href="http://simplystatistics.tumblr.com/post/12243614318/i-gave-a-talk-on-reproducible-research-back-in" target="_blank">some of my own work</a>. It seems to me that interest in the topic has exploded recently, in part due to some recent scandals, such as the <a href="http://simplystatistics.tumblr.com/post/10068195751/the-duke-saga" target="_blank">Duke clinical trials fiasco</a>.</p>
<p>If you are unfamiliar with the term “reproducible research”, the basic idea is that authors of published research should make available the necessary materials so that others may reproduce to a very high degree of similarity the published findings. If that definitions seems imprecise, well that’s because it is.</p>
<!-- more -->
<p>I think reproducibility becomes easier to define in the context of a specific field or application. Reproducibility often comes up in the context of computational science. In computational science fields, often much of the work is done on the computer using often very large amounts of data. In other words, the analysis of the data is of comparable difficulty as the collection of the data (maybe even more complicated). Then the notion of reproducibility typically comes down to the idea of making the analytic data and the computer code available to others. That way, knowledgeable people can run your code on your data and presumably get your results. If others do not get your results, then that may be a sign of a problem, or perhaps a misunderstanding. In either case, a resolution needs to be found. Reproducibility is key to science much the way it is key to programming. When bugs are found in software, being able to reproduce the bug is an important step to fixing it. Anyone learning to program in C knows the pain of dealing with a memory-related bug, which will often exhibit seemingly random and non-reproducible behavior.</p>
<p>My discussions with others about the need for reproducibility in science often range far and wide. One reason is that many people have very different ideas what (a) what is reproducibility and (b) why we need it. Here is my take on various issues.</p>
<ul>
<li><strong>Reproducibility is not replication</strong>. There’s often honest confusion between the notion of reproducibility and what I would call a “full replication”. A full replication doesn’t analyze the same dataset, but rather involves an independent investigator collecting an independent dataset conducting an independent analysis. Full replication has been a fundamental component of science for a long time now and will continue to be the primary yardstick for measuring the plausibility of scientific claims. I think most would agree that full replication is preferable, but often it is simply not possible.</li>
<li><strong>Reproducibility is not needed solely to prevent fraud</strong>. I’ve heard many people emphasize reproducibility as a means to prevent fraud. Journal editors seem to think this is the main reason for demanding reproducibility. It is_ one_ reason, but to be honest, I’m not sure it’s all that useful for detecting fraud. If someone truly wants to commit fraud, then it’s possible to make the fraud reproducible. If I just generate a bunch of numbers and claim it as data that I collected, any analysis from that dataset can be reproducible. While demanding reproducibility may be useful for ferreting out certain types of fraud, it’s not a general solution and it’s not the primary reason we need it. </li>
<li><strong>Reproducibility is not as easy as it sounds</strong>. Making one’s research reproducible is hard. It’s especially hard when you try to do it <em>after</em> the research has been done. In that case it’s more like an audit, and I’m guessing for most people the word “audit” is NOT synonymous with “fun”. Even if you set out to make your work reproducible from the get go, it’s easy to miss things. Code can get lost (even with a version control system) and metadata can slip through the cracks. Even when you’ve done everything right, computers and software can change. Virtual machines like Amazon EC2 and others seem to have some potential. The single most useful tool that I have found is a good version control system, like <a href="http://git-scm.com/" target="_blank">git</a>. </li>
<li><strong>At this point, anything would be better than nothing</strong>. Right now, I think the bar for reproducibility is quite low in the sense that most published work is not reproducible. Even if data are available, often the code that analyzed the data is not available. So if you’re publishing research and you want to make it at least partially reproducible, just put what you can out there. On the web, on <a href="http://github.com" target="_blank">github</a>, in a data repository, wherever you can. If you can’t publish the data, make your code available. Even that is better than nothing. In fact, I find reading someone’s code to be very informative and often questions can arise without looking at the data. Until we have a better infrastructure for distributing reproducible research, we will have to make do with what we have. But if we all start putting stuff out there, the conversation will turn from “Why should I make stuff available?” to “Why wouldn’t I make stuff available?”</li>
</ul>
New ways to follow Simply Statistics
2011-11-05T16:11:05+00:00
http://simplystats.github.io/2011/11/05/new-ways-to-follow-simply-statistics
<p>In case you prefer to follow Simply Statistics using some other platforms, we’ve added two new features. First, we have an official <a href="http://twitter.com/simplystats" target="_blank">Twitter feed</a> that you can follow. We also have a new <a href="http://facebook.com/simplystatistics" target="_blank">Facebook page</a> that you can like. Please follow us and join the discussion!</p>
Interview with Victoria Stodden
2011-11-04T16:06:05+00:00
http://simplystats.github.io/2011/11/04/interview-with-victoria-stodden
<p><strong>Victoria Stodden</strong></p>
<p><img height="300" width="250" src="http://biostat.jhsph.edu/~jleek/vcs.jpg" /></p>
<p>Victoria Stodden is an assistant professor of statistics at Columbia University in New York City. She moved to Columbia after getting her Ph.D. at Stanford University. Victoria has made major contributions to the area of reproducible research and has been appointed to the NSF’s Advisory Committee for Infrastructure. She is the recent recipient of an NSF grant for “Policy Design for Reproducibility and Data Sharing in Computational Science”</p>
<!-- more -->
<p><strong>Which term applies to you: data scientist/statistician/analyst (or something else)?</strong></p>
<p>Definitely statistician. My PhD is from the stats department at Stanford University.</p>
<p><strong>How did you get into statistics/data science (e.g. your history)?</strong></p>
<p>Since my undergrad days I’ve been motivated by problems in what’s called ‘social welfare economics.’ I interpret that as studying how people can best reach their potential, particularly how the policy environment affects outcomes. This includes the study of regulatory design, economic growth, access to knowledge, development, and empowerment. My undergraduate degree was in economics, and I thought I would carry on with a PhD in economics as well. I realized that folks with my interests were mostly doing empirical work so I thought I should prepare myself with the best training I could in statistics. Hence I chose to do a PhD in statistics to augment my data analysis capabilities as much as I could since I envisioned myself immersed in empirical research in the future.</p>
<p><strong>What is the problem currently driving you?</strong></p>
<p>Right now I’m working on the problem of reproducibility in our body of published computational science. This ties into my interests because of the critical role of knowledge and reasoning in advancing social welfare. Scientific research is becoming heavily computational and as a result the empirical work scientists do is becoming more complex and yet less tacit: the myriad decisions made in data filtering, analysis, and modeling are all recordable in code. In computational research there are so many details in the scientific process it is nearly impossible to communicate them effectively in the traditional scientific paper – rendering our published computational results unverifiable, if there isn’t access to the code and data that generated them.</p>
<p>Access to the code and data permits readers to check whether the descriptions in the paper correspond to the published results, and allows people to understand why independent implementations of the methods in the paper might produce differing results. It also puts the tools of scientific reasoning into people’s hands – this is new. For much of scientific research today all you need is an internet connection to download the reasoning associated with a particular result. Wide availability of the data and code is still largely a dream, but one the scientific community is moving towards.</p>
<p><strong>Who were really good mentors to you? What were the qualities that really helped you?</strong></p>
<p>My advisor, David Donoho, is an enormous influence. He is the clearest scientific thinker I have ever been exposed to. I’ve been so very lucky with the people who have come into my life. Through his example, Dave is the one who has had the most impact on how I think about and prioritize problems and how I understand our role as statisticians and scientific thinkers. He’s given me an example of how to do this and it’s hard to underestimate his influence in my life.</p>
<p><strong>What do you think are the barriers to reproducible research?</strong></p>
<p>At this point, incentives. There are many concrete barriers, which I talk about in my papers and talks (available on my website <a href="http://stodden.net" target="_blank"><a href="http://stodden.net" target="_blank">http://stodden.net</a></a>), but they all stem from misaligned incentives. If you think about it, scientists do lots of things they don’t particularly like in the interest of research communication and scientific integrity. I don’t know any computational scientist who really loves writing up their findings into publishable articles for example, but they do. This is because the right incentives exist. A big part of the work I am doing concerns the scientific reward structure. For example, my work on the Reproducible Research Standard is an effort to realign the intellectual property rules scientists are subject to, to be closer to our scientific norms. Scientific norms create the incentive structure for the production of scientific research, providing rewards for doing things people might not do otherwise. For example, scientists have a long established norm of giving up all intellectual property rights over their work in exchange for attribution, which is the currency of success. It’s the same for sharing the code and data that underlies published results – not part of the scientific incentive and reward structure today but becoming so, through adjusting a variety of other factors like finding agency policy, journal publication policy, and expectations at the institutional level.</p>
<p><strong>What have been some success stories in reproducible research?</strong></p>
<p>I can’t help but point to my advisor, David Donoho. An example he gives is his release of <a href="http://www-stat.stanford.edu/~wavelab" target="_blank"><a href="http://www-stat.stanford.edu/~wavelab" target="_blank">http://www-stat.stanford.edu/~wavelab</a></a> - the first implementation of wavelet routines in MATLAB, before MATLAB included their own wavelet toolbox. The release of the Wavelab code was a factor that he believes made him one of the top 5 highly cited authors in Mathematics in 2000.</p>
<p>Hiring and promotion committees seem to be starting to recognize the difference between candidates that recognize the importance of reproducibility and clear scientific communication, compared to others who seem to be wholly innocent of these issues.</p>
<p>There is a nascent community of scientific software developers that is achieving remarkable success. I co-organized a workshop this summer bringing some of these folks together, see <a href="http://www.stodden.net/AMP2011" target="_blank"><a href="http://www.stodden.net/AMP2011" target="_blank">http://www.stodden.net/AMP2011</a></a>. There are some wonderful projects underway to assist in reproducibility, from workflow tracking to project portability to unique identifiers for results reproducible in the cloud. Fascinating stuff.</p>
<p><strong>Can you tell us a little about the legal ramifications of distributing code/data?</strong></p>
<p>Sure. Many aspects of our current intellectual property laws are quite detrimental to the sharing of code and data. I’ll discuss the two most impactful ones. Copyright creates exclusive rights vested in the author for original expressions of ideas – and it’s a default. What this means is that your expression of your idea – your code, your writing, figures you create – are by default copyright to you. So for your lifetime and 70+ years after that, you (or your estate) need to give permission for the reproduction and re-use of the work – this is exactly counter to scientific norms or independent verification and building on others’ findings. The Reproducible Research Standard is a suite of licenses that permit scientists to set the terms of use of their code, data, and paper according to scientific norms: use freely but attribute. I have written more about this here: <a href="http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4720221" target="_blank"><a href="http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4720221" target="_blank">http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4720221</a></a></p>
<p>In 1980 Congress passed the Bayh-Dole Act, which was designed to create incentives for access to federally funded scientific discoveries by securing ownership rights for universities with regard to inventions by their researchers. The idea was that these inventions could then by patented and licensed by the university, making the otherwise unavailable technology available for commercial development. Notice that Bayh-Dole was passed on the eve of the computer revolution and Congress could not have foreseen the future importance of code to scientific investigation and its subsequent susceptibility to patentability. The patentability of scientific code now creates incentives to keep the code hidden: to avoid creating prior art in order to maximize the chance of obtaining the patent, and to keep hidden from potential competitors any information that might be involved in commercialization. Bayh-Dole has created new incentives for computational scientists – that of startups and commercialization – that must be reconciled with traditional scientific norms of openness.</p>
<p><strong>Related Posts:</strong> Jeff’s interviews with <a href="http://simplystatistics.tumblr.com/post/11436138110/interview-with-daniela-witten" target="_blank">Daniela Witten</a> and <a href="http://simplystatistics.tumblr.com/post/11729003971/interview-with-chris-barr" target="_blank">Chris Barr</a>. Roger’s <a href="http://simplystatistics.tumblr.com/post/12243614318/i-gave-a-talk-on-reproducible-research-back-in" target="_blank">talk on reproducibility</a> </p>
Free access publishing is awesome...but expensive. How do we pay for it?
2011-11-03T16:05:06+00:00
http://simplystats.github.io/2011/11/03/free-access-publishing-is-awesome-but-expensive-how
<p>I am a huge fan of open access journals. I think open access is good both for moral reasons (science should be freely available) and for more selfish ones (I want people to be able to read my work). If given the choice, I would publish all of my work in journals that distribute results freely.</p>
<p>But it turns out that for most open/free access systems, the publishing charges are paid by the scientists publishing in the journals. I did a quick scan and compiled this little table of how much it costs to publish a paper in different journals (<a href="http://www.springeropen.com/about/apccomparison/" target="_blank">here</a> is a bigger table):</p>
<ul>
<li><strong>PLoS One</strong> $1,350.00</li>
<li><strong>PLoS Biology</strong>: $2,900.00</li>
<li><strong>BMJ Open</strong> $1,937.28</li>
<li><strong>Bioinformatics (Open Access Option)</strong> $3,000.00</li>
<li><strong>Genome Biology (Open Access Option)</strong> $2,500.00</li>
<li><strong>Biostatistics (Open Access Option)</strong> $3,000.00</li>
</ul>
<!-- more -->
<p>The first thing I noticed is that it is minimum about $1,500 to get a paper published open access. That may not seem like a lot of money and most journals offer discounts to people who can’t pay. But it still adds up, this last year my group has published 7 papers. If I paid for all of them to be published open access, that would be at minimum $10,500! That is half the salary of a graduate student researcher for an entire year. For a senior scientist that may be no problem, but for early career scientists, or scientists with limited access to resources, it is a big challenge.</p>
<p>Publishers who are solely dedicated to open access (PLoS, BMJ Open, etc.) seem to have on average lower publication charges than journals who only offer open access as an option. I think part of this is that the journals that aren’t open access in general have to make up some of the profits they lose by making the articles free. I certainly don’t begrudge the journals the costs. They have to maintain the websites, format the articles, and run the peer review process. That all costs money.</p>
<p><strong>A modest proposal</strong></p>
<p>What I wonder is if there was a better place for that money to come from? Here is one proposal (hat tip to Rafa): academic and other libraries pay a ton of money for subscriptions to journals like Nature and Science. They also are required to pay for journals in a large range of disciplines. What if, instead of investing this money in subscriptions for their university, academic libraries pitched in and subsidized the publication costs of open/free access?</p>
<p>If all university libraries pitched in, the cost for any individual library would be relatively small. It would probably be less than paying for subscriptions to hundreds of journals. At the same time, it would be an investment that would benefit not only the researchers at their school, but also the broader scientific community by keeping research open. Then neither the people publishing the work, nor the people reading it would be on the hook for the bill.</p>
<p>This approach is the route taken by <a href="http://arxiv.org/" target="_blank">ArXiv</a>, a free database of unpublished papers. These papers haven’t been peer reviewed, so they don’t always carry the same weight as papers published in peer-reviewed journals. But there are a lot of really good and important papers in the database - it is an almost universally accepted pre-print server.</p>
<p>The other nice thing about ArXiv is that you don’t pay for article processing, the papers are published as is. The papers don’t look quite as pretty as they do in Nature/Science or even PLoS, but it is also much cheaper. The only costs associated with making this a full fledged peer-reviewed journal would be refereeing (which scientists do for free anyway) and editorial responsibilities (again mostly volunteer by scientists).</p>
I Gave A Talk On Reproducible Research Back In
2011-11-02T16:05:00+00:00
http://simplystats.github.io/2011/11/02/i-gave-a-talk-on-reproducible-research-back-in
<p>[youtube http://www.youtube.com/watch?v=aH8dpcirW1U?wmode=transparent&autohide=1&egm=0&hd=1&iv_load_policy=3&modestbranding=1&rel=0&showinfo=0&showsearch=0&w=500&h=375]</p>
<p>I gave a talk on reproducible research back in July at the Applied Mathematics Perspectives workshop in Vancouver, BC.</p>
<p>In addition to the YouTube version, there’s also a Silverlight version where you can <a href="http://mediasite.mediagroup.ubc.ca/MediaGroup/Viewer/?peid=1c8f6b5a331546ed9f28631239d8b24d1d" target="_blank">actually see the slides</a> while I’m talking.</p>
<div class="attribution">
(<span>Source:</span> <a href="http://www.youtube.com/">http://www.youtube.com/</a>)
</div>
Guest Post: SMART thoughts on the ADHD 200 Data Analysis Competition
2011-11-02T15:37:42+00:00
http://simplystats.github.io/2011/11/02/guest-post-smart-thoughts-on-the-adhd-200-data
<p><strong>Note</strong>: <em>This is a guest post by our colleagues <span id="internal-source-marker_0.8358260081149638">Brian Caffo, </span>Ani Eloyan, Fang Han, Han Liu,John Muschelli, Mary Beth Nebel, Tuo Zhao and Ciprian Crainiceanu. They <a href="http://fcon_1000.projects.nitrc.org/indi/adhd200/results.html" target="_blank">won</a> the <a href="http://fcon_1000.projects.nitrc.org/indi/adhd200/" target="_blank">ADHD 200</a> imaging data analysis competition. There has been <a href="http://www.reddit.com/r/cogsci/comments/lblqs/the_adhd200_competition_was_intended_to_exhibit/" target="_blank">some</a> <a href="http://www.talyarkoni.org/blog/2011/10/12/brain-based-prediction-of-adhd-now-with-100-fewer-brains/" target="_blank">controversy</a> around the results because one team obtained a higher score without using any of the imaging data. Our colleagues have put together a very clear discussion of the issues raised by the competition so we are publishing it here to contribute to the discussion. Questions about this post should be directed to the Hopkins team leader <a href="http://www.bcaffo.com/home/contacts" target="_blank">Brian Caffo</a> </em><br />
<span> </span></p>
<p><span><strong>Background</strong></span></p>
<p><span id="internal-source-marker_0.8358260081149638">Below we share some thoughts about the ADHD 200 competition, a landmark competition using functional and structural brain imaging data to predict ADHD status. </span></p>
<p><span id="internal-source-marker_0.8358260081149638"> </span></p>
<!-- more -->
<p>Note, we’re calling these “SMART thoughts” to draw attention to our working group, “Statistical Methods and Applications for Research in Technology” (<a href="http://www.smart-stats.org/" target="_blank"><a href="http://www.smart-stats.org" target="_blank">www.smart-stats.org</a></a>), though hopefully the acronym applies in the non-intended sense as well.</p>
<p><span> </span><br />
<span>Our team was declared the official winners of the competition. However, a team from the University of Alberta scored a higher number of competition points, but was disqualified for not having used imaging data. We have been in email contact with a representative of that team and have enjoyed the discussion. We found those team members to be gracious and to embody an energy and scientific spirit that are refreshing to encounter. </span><br />
<span> </span><br />
<span>We mentioned our sympathy to them, in that the process seemed unfair, especially given the vagueness of what qualifies as use of the imaging data. More on this thought below. </span><br />
<span> </span><br />
<span>This brings us to the point of this note, concern over the narrative surrounding the competition based on our reading of web pages, social media and water cooler discussions.</span><br />
<span> </span><br />
<span>We are foremost concerned with the unwarranted conclusion that because the team with the highest competition point total did not use imaging data, the overall scientific validity of using (f)MRI imaging data to study ADHD is now in greater doubt. </span><br />
<span> </span><br />
<span>We stipulate that, like many others, we are skeptical of the utility of MRI data for tasks such as ADHD diagnoses. We are not arguing against such skepticism. </span><br />
<span> </span><br />
<span>Instead we are arguing against using the competition results as if they were strong evidence for such skepticism.</span><br />
<span> </span><br />
<span>We raise four points to argue against overreacting to the competition outcome with respect to the use of structural and functional MRI in the study of ADHD.</span></p>
<h3 id="point-1-the-competition-points-are-not-an-accurate-measure-of-performance-and-scientific-value"><strong>Point 1. The competition points are not an accurate measure of performance and scientific value.</strong></h3>
<p><span>Because the majority of the training, and hence presumably the test, sets in the competition were typically developing, the competition points favored specificity. </span><br />
<span> </span><br />
<span>In addition, a correct label of TD yielded 1 point, while a correct ADHD diagnosis with incorrect subtype yielded .5 points. </span><br />
<span></span><br />
<span>These facts suggest a classifier that declares everyone as TD as a starting point. For example, if 60% of the 197 test subjects are controls, this algorithm would yield 118 competition points, better than all but a few entrants. In fact, if 64.5% or higher of the test set is TD, this algorithm wins over Alberta (and hence everyone else).</span><br />
<span></span><br />
<span>In addition, competition points are variables possessing randomness. </span><span>It is human nature to interpret the anecdotal rankings of competitions as definitive evidence of superiority. This works fine as long as rankings are reasonably deterministic. But is riddled with logical flaws when rankings are stochastic. Variability in rankings has a huge effect on the result of competitions, especially when highly tuned prediction methods from expert teams are compared. Indeed, in such cases the confidence intervals of the AUCs (or other competition criteria) overlap. </span><span>The 5th or 10th place team may actually have had the most scientifically informative algorithm.</span></p>
<h3 id="point-2-biologically-valueless-predictors-were-important"><strong>Point 2. Biologically valueless predictors were important.</strong></h3>
<p><span>Most importantly, contributing location (aka site), was a key determinant of prediction performance. Site is a proxy for many things: the demographics of the ADHD population in the site’s PI’s studies, the policies by which a PI chose to include data, scanner type, IQ measure, missing data patterns, data quality and so on. </span><br />
<span></span><br />
<span>In addition to site, missing data existence and data quality also held potentially important information about prediction, despite being (biologically) unrelated to ADHD. The likely causality, if existent, would point in the reverse direction (i.e. that presence of ADHD would result in a greater propensity for missing data and lower data quality, perhaps due to movement in the scanner).</span><br />
<span></span><br />
<span>This is a general fact regarding prediction algorithms, which do not intrinsically account for causal directions or biological significance.</span></p>
<h3 id="point-3-the-majority-of-the-imaging-data-is-not-prognostic"><strong>Point 3. The majority of the imaging data is not prognostic.</strong></h3>
<p><span>Likely every entrant, and the competition organizers, were aware that the majority of the imaging data is not useful for predicting ADHD. (Here we use the term “imaging data” loosely, meaning raw and/or processed data.) In addition, the imaging data are noisy. Therefore, use of these data introduced tens of billions of unnecessary numbers to predict 197 diagnoses. </span><br />
<span></span><br />
<span>As such, even if extremely important variables are embedded in the imaging data, (non-trivial) use of all of the imaging data could degrade performance, regardless of the ultimate value of the data. </span><br />
<span></span><br />
<span>To put this in other words, suppose all entrants were offered an additional 10 billion numbers, say genomic data, known to be noisy and, in aggregate, not predictive of disease. However, suppose that some unknown function of a small collection of variables was very meaningful for prediction, as is presumably the case with genomic data. If the competition did not require its use, a reasonable strategy would be to avoid using these data altogether. </span><br />
<span></span><br />
<span>Thus, in a scientific sense, we are sympathetic to the organizers’ choice to eliminate the Alberta team, since a primary motivation of the competition was to encourage a large set of eyes to sift through a large collection of very noisy imaging data. </span><br />
<span></span><br />
<span>Of course, as stated above, we believe that what constitutes a sufficient use of the imaging data is too vague to be an adequate rule to eliminate a team in a competition. </span><br />
<span></span><br />
<span>Thus our scientifically motivated support of the organizers conflicts with our procedural dispute of the decision made to eliminate the Alberta team.</span><span></span></p>
<h3 id="point-4-accurate-prediction-of-a-response-is-neither-necessary-nor-sufficient-for-a-covariate-to-be-biologically-meaningful"><strong>Point 4. Accurate prediction of a response is neither necessary nor sufficient for a covariate to be biologically meaningful.</strong></h3>
<p><span>Accurate prediction of a response is an extremely high bar for a variable of interest. Consider drug development for ADHD. A drug </span><span>does not</span> <span>have to demonstrate that its application to a collection of symptomatic individuals would predict </span><span>with high accuracy</span> <span>a later abatement of symptoms. Instead, a successful drug would have to demonstrate a mild</span> <span>averaged</span> <span>improvement over a placebo or standard therapy when randomized. </span><br />
<span></span><br />
<span>As an example, consider randomly administering such a drug to 50 of 100 subjects who have ADHD at baseline. Suppose data are collected at 6 and 12 months. Further suppose that 8 out of 50 of those receiving the drug had no ADHD symptoms at 12 months, while 1 out of 50 of those receiving placebo had no ADHD symptoms at 12 months. The Fisher’s exact test P-value is .03, by the way. </span><br />
<span></span><br />
<span>The statistical evidence points to the drug being effective. Knowledge of drug status, however, would do little to improve prediction accuracy. That is, given a new data set of subjects with ADHD at baseline and knowledge of drug status, the most accurate classification for every subject would be to guess that they will continue to have ADHD symptoms at 12 months. Of course, our confidence in that prediction would be slightly lower for those having received the drug.</span><br />
<span></span><br />
<span>However, consider using ADHD status at 6 months as a predictor. This would be enormously effective at locating those subjects who have an abatement of symptoms whether they received the drug or not. In this thought experiment, one predictor (symptoms at 6 months) is highly predictive, but not meaningful (it simply suggests that Y is a good predictor of Y), while the other (presence of drug at baseline) is only mildly predictive, but is statistically and biologically significant.</span><br />
<span></span><br />
<span>As another example, consider the ADHD200 data set. Suppose that a small structural region is highly impacted in an unknown subclass of ADHD. Some kind of investigation of morphometry or volumetrics might detect an association with disease status. The association would likely be weak, given absence of a-priori knowledge of this region or the subclass. This weak association would not be useful in a prediction algorithm. However, digging into this association could potentially inform the biological basis of the disease and further refine the ADHD phenotype.</span><br />
<span></span><br />
<span>Thus, we argue that it is important to differentiate the ultimate goals of obtaining high prediction accuracy with that of biological discovery of complex mechanisms in the presence of high dimensional data. </span></p>
<h3 id="conclusions"><strong>Conclusions</strong></h3>
<p><em>We urge caution in over-interpretation of the scientific impact of the University of Alberta’s strongest performance in the competition. </em><br />
<span></span><br />
<span>Ultimately, what Alberta’s having the highest point total established is that they are fantastic people to talk to if you want to achieve high prediction accuracy. (Looking over their work, this appears to have already been established prior to the competition :-).</span><br />
<span></span><br />
<span>It was not established that brain structure or resting state function, as measured by MRI, is a blind alley in the scientific exploration of ADHD.</span></p>
<p><span><strong>Related Posts: </strong>Roger on “<a href="http://simplystatistics.tumblr.com/post/11611102993/caffo-ninjas-awesome" target="_blank">Caffo + Ninjas = Awesome”</a>, Rafa on the “<a href="http://simplystatistics.tumblr.com/post/11732716036/the-self-assessment-trap" target="_blank">Self Assessment Trap</a>”, Roger on “<a href="http://simplystatistics.tumblr.com/post/10441403664/private-health-insurers-to-release-data" target="_blank">Private health insurers to release data</a>”</span></p>
We need better marketing
2011-11-02T14:45:30+00:00
http://simplystats.github.io/2011/11/02/we-need-better-marketing
<p>In <a href="http://marginalrevolution.com/marginalrevolution/2011/11/college-has-been-oversold.html" target="_blank">this post</a> Alex Tabarrok argues that not enough people are obtaining “degrees that pay” and that college has been oversold. It struck me that the number of students studying Visual and Performing Arts has more than doubled since 1985. Yet for Math and Statistics there has been no increase at all! We need to do a better job at marketing. The great majority (if not all) of the people I know with Statistics degrees have found a job related to Statistics. With a Master’s, salary can be as high as <a href="http://www.payscale.com/research/US/Degree=Master_of_Science_(MS),_Statistics/Salary" target="_blank">$110K</a>.So to those interested in Visual and Performing Arts that are good with numbers I suggest you hedge your bets: do a double major and consider Statistics. My <a href="https://www.facebook.com/group.php?gid=5049582229" target="_blank">brother</a>, a <a href="http://www.youtube.com/watch?v=yx_sPV04Img" target="_blank">successful</a> musician, majored in Math. He uses his math skills to supplement his income by playing poker with other musicians.</p>
Computing on the Language Followup
2011-11-01T16:05:05+00:00
http://simplystats.github.io/2011/11/01/computing-on-the-language-followup
<p>My article on <a href="http://simplystatistics.tumblr.com/post/11988685443/computing-on-the-language" target="_blank">computing on the language</a> was unexpectedly popular and so I wanted to quickly follow up on my own solution. Many of you got the answer, and in fact many got solutions that were quite a bit shorter than mine. Here’s how I did it:</p>
<pre>makeList <- function(...) {
args <- substitute(list(...))
nms <- sapply(args[-1], deparse)
vals <- list(...)
names(vals) <- nms
vals
} </pre>
<p><span>Baptiste</span> pointed out that Frank Harrell has already implemented this function in his Hmisc package as the ‘llist’ function (thanks for the pointer!). I’ll just note that this function does a bit more actually because each element of the returned list is an object of class “labelled”.</p>
<p>The shortest solution was probably Tony Breyal’s version:</p>
<pre>makeList <- function(...) {
structure(list(...), names = names(data.frame(...)))
}
</pre>
<p>However, it’s worth noting that this function modifies the object’s name if the name is non-standard (i.e. if you’re using backticks like `r object name`). That’s just because the ‘data.frame’ function automatically modifies names if they are non-standard.</p>
<p>Thanks to everyone for the responses! I’ll try to come up with another one soon.</p>
Advice on promotion letters bleg
2011-11-01T01:13:56+00:00
http://simplystats.github.io/2011/11/01/advice-on-promotion-letters-bleg
<p>This fall I have been asked to write seven promotion letters. Writing these takes me at least 2 hours. If it’s someone I don’t know it takes me longer because I have to read some of their papers. Earlier this year, I wrote one for a Biology department that took me at least 6 hours. So how many are too many? Should I set a limit? Advice and opinions in the comments would be greatly appreciated.</p>
The 5 Most Critical Statistical Concepts
2011-10-29T16:05:05+00:00
http://simplystats.github.io/2011/10/29/the-5-most-critical-statistical-concepts
<p>It seems like everywhere we look, data is being generated - from politics, to biology, to publishing, to social networks. There are also diverse new computational tools, like GPGPU and cloud computing, that expand the statistical toolbox. Statistical theory is more advanced than its ever been, with exciting work in a range of areas. </p>
<p>With all the excitement going on around statistics, there is also increasing diversity. It is increasingly hard to define “statistician” since the definition ranges from <a href="http://www.stat.washington.edu/jaw/" target="_blank">very mathematical</a> to <a href="http://en.wikipedia.org/wiki/Nate_Silver" target="_blank">very applied</a>. An obvious question is: what are the most critical skills needed by statisticians? </p>
<!-- more -->
<p>So just for fun, I made up my list of the top 5 most critical skills for a statistician by my own definition. They are by necessity very general (I only gave myself 5). </p>
<ol>
<li><strong>The ability to manipulate/organize/work with data on computers</strong> - whether it is with excel, R, SAS, or Stata, to be a statistician you have to be able to work with data. </li>
<li><strong>A knowledge of exploratory data analysis</strong> - how to make plots, how to discover patterns with visualizations, how to explore assumptions</li>
<li><strong>Scientific/contextual knowledge</strong> - at least enough to be able to abstract and formulate problems. This is what separates statisticians from mathematicians. </li>
<li><strong>Skills to distinguish true from false patterns</strong> - whether with p-values, posterior probabilities, meaningful summary statistics, cross-validation or any other means. </li>
<li><strong>The ability to communicate results to people without math skills</strong> - a key component of being a statistician is knowing how to explain math/plots/analyses.</li>
</ol>
<p>What are your top 5? What order would you rank them in? Even though these are so general, I almost threw regression in there because of how often it pops up in various forms. </p>
<p><strong>Related Posts:</strong> Rafa on <a href="http://simplystatistics.tumblr.com/post/10764298034/the-future-of-graduate-education" target="_blank">graduate education</a> and <a href="http://simplystatistics.tumblr.com/post/10021164565/what-is-a-statistician" target="_blank">What is a Statistician</a>? Roger on <a href="http://simplystatistics.tumblr.com/post/11655593971/do-we-really-need-applied-statistics-journals" target="_blank">“Do we really need applied statistics journals?”</a></p>
Computing on the Language
2011-10-27T12:17:24+00:00
http://simplystats.github.io/2011/10/27/computing-on-the-language
<p>And now for something a bit more esoteric….</p>
<p>I recently wrote a function to deal with a strange problem. Writing the function ended up being a fun challenge related to computing on the R language itself.</p>
<p>Here’s the problem: Write a function that takes any number of R objects as arguments and returns a list whose names are derived from the names of the R objects.</p>
<!-- more -->
<p>Perhaps an example provides a better description. Suppose the function is called ‘makeList’. Then </p>
<pre>x <- 1<br />y <- 2<br />z <- "hello"<br />makeList(x, y, z)
</pre>
<p>returns</p>
<pre>list(x = 1, y = 2, z = "hello")
</pre>
<p>It originally seemed straightforward to me, but it turned out to be very much not straightforward. </p>
<p>Note that a function like this is probably most useful during interactive sessions, as opposed to programming.</p>
<p>I challenge you to take a whirl at writing the function, you know, in all that spare time you have. I’ll provide my solution in a future post.</p>
Visualizing Yahoo Email
2011-10-26T16:23:44+00:00
http://simplystats.github.io/2011/10/26/visualizing-yahoo-email
<p><a href="http://visualize.yahoo.com/" target="_blank">Here</a> is a cool page where yahoo shows you the email it is processing in real time. It includes a visualization of the most popular words in emails at a given time. A pretty neat tool and definitely good for procrastination, but I’m not sure what else it is good for…</p>
Web-scraping
2011-10-24T16:05:05+00:00
http://simplystats.github.io/2011/10/24/web-scraping
<p>The internet is the greatest source of publicly available data. One of the key skills to being able to obtain data from the web is “web-scraping”, where you use a piece of software to run through a website and collect information. </p>
<p>This technique can be used for collecting data from databases or to collect data that is scattered across a website. Here is a very cool little <a href="http://thebiobucket.blogspot.com/2011/10/little-webscraping-exercise.html" target="_blank">exercise</a> in web-scraping that can be used as an example of the things that are possible. </p>
<p><strong>Related Posts</strong>: Jeff on <a href="http://simplystatistics.tumblr.com/post/11237403492/apis" target="_blank">APIs</a>, <a href="http://simplystatistics.tumblr.com/post/10410458080/data-sources" target="_blank">Data Sources</a>, <a href="http://simplystatistics.tumblr.com/post/11224744922/a-nice-presentation-on-regex-in-r" target="_blank">Regex</a>, and <a href="http://simplystatistics.tumblr.com/post/10766696449/the-open-data-movement" target="_blank">The Open Data Movement</a>.</p>
Archetypal Athletes
2011-10-24T13:37:00+00:00
http://simplystats.github.io/2011/10/24/archetypal-athletes
<p><a href="http://arxiv.org/PS_cache/arxiv/pdf/1110/1110.1972v1.pdf" target="_blank">Here</a> is a cool paper on the ArXiv about archetypal athletes. The basic idea is to look at a large number of variables for each player and identify multivariate outliers or extremes. These outliers are the archetypes talked about in the title. </p>
<p>According to his analysis the author claims the best players (for different reasons, i.e. different archetypes) in the NBA in 2009/2010 were: Taj Gibson, Anthony Morrow, and Kevin Durant. The best soccer players were Wayne Rooney, Leonel Messi, and Christiano Ronaldo.</p>
<p>Thanks to <a href="http://www.biostat.jhsph.edu/~ajaffe/" target="_blank">Andrew Jaffe</a> for pointing out the article. </p>
<p><strong>Related Posts</strong>: Jeff on “<a href="http://simplystatistics.tumblr.com/post/10989030989/innovation-and-overconfidence" target="_blank">Innovation and Overconfidence</a>”, Rafa on “<a href="http://simplystatistics.tumblr.com/post/10805255044/once-in-a-lifetime-collapse" target="_blank">Once in a lifetime collapse</a>”</p>
Graduate student data analysis inspired by a high-school teacher
2011-10-22T13:02:06+00:00
http://simplystats.github.io/2011/10/22/graduate-student-data-analysis-inspired-by-a
<p>I love watching TED talks. One of my absolute favorites is the <a href="http://www.ted.com/talks/dan_meyer_math_curriculum_makeover.html" target="_blank">talk</a> by Dan Meyer on how math class needs a makeover. Dan also has one of the more fascinating <a href="http://blog.mrmeyer.com/" target="_blank">blogs</a> I have read. He talks about math education, primarily K-12 education. His posts on <a href="http://blog.mrmeyer.com/?p=3055" target="_blank">curriculum design</a>, <a href="http://blog.mrmeyer.com/?p=811" target="_blank">assessment </a>, <a href="http://blog.mrmeyer.com/?p=154" target="_blank">work ethic</a>, and <a href="http://blog.mrmeyer.com/?p=133" target="_blank">homework</a> are really, really good. In fact, just go read all his <a href="http://blog.mrmeyer.com/?page_id=2716" target="_blank">author choices</a>. You won’t regret it. </p>
<p>The best quote from the talk is:</p>
<blockquote>
<p>Ask yourselves, what problem have you solved, ever, that was worth solving, where you knew knew all of the given information in advance? Where you didn’t have a surplus of information and have to filter it out, or you didn’t have insufficient information and have to go find some?</p>
</blockquote>
<!-- more -->
<p>Many of the data analyses I have done in classes/assigned in class have focused on a problem with exactly the right information with relatively little extraneous data or missing information. But I have been slowly evolving these problems; as an example <a href="http://biostat.jhsph.edu/~jleek/qual2011.pdf" target="_blank">here</a> is a data analysis project that we developed last year for the qualifying exam at JHU. This project is what I consider a first step toward a “less helpful” project model. </p>
<p>The project was inspired by this <a href="http://marginalrevolution.com/marginalrevolution/2010/09/the-small-schools-myth.html" target="_blank">blog post</a> at marginal revolution which Rafa suggested. As with the homework problem Dan dissects in his talk, there are layers to this problem:</p>
<ol>
<li>Understanding the question</li>
<li>Downloading and filtering the data</li>
<li>Exploratory analysis</li>
<li>Fitting models/interpreting results</li>
<li>Synthesis and writing the results up</li>
<li>Reproducibility of the R code</li>
</ol>
<p>For this analysis, I was pretty specific with 1. Understanding the question:</p>
<blockquote>
<p class="MsoNormal">
<span>(1) The association between enrollment and the percent of students scoring “Advanced” on the MSA in Reading and Math in the 5<sup>th</sup> grade. </span>
</p>
<p class="MsoNormal">
<span>(2) The change in the number of students scoring “Advanced” in Reading and Math from one year to the next (at minimum consider the change from 2009-2010) versus enrollment. </span>
</p>
<p class="MsoNormal">
<span>(3) Potential reasons for results like those in <strong>Table 1</strong>. <span> </span></span>
</p>
</blockquote>
<p class="MsoNormal">
Although I didn’t mention the key idea from the Marginal Revolution post. I think for a qualifying exam, this level of specificity is necessary, but for an in-class project I think I would have removed this information so students had to “discover the question” themselves.
</p>
<p class="MsoNormal">
I was also pretty specific with the data source suggesting the Maryland Education department’s website. However, several students went above and beyond and found other data sources, or downloaded more data than I suggested. In the future, I think I will leave this off too. My google/data finding skills don’t hold a candle to those of my students.
</p>
<p class="MsoNormal">
Steps 3-5 were summed up with the statement:
</p>
<blockquote>
<p class="MsoNormal">
<span>Your project is to analyze data from the MSA and write a short letter either in favor of or against spending money to decrease school sizes.</span>
</p>
</blockquote>
<p class="MsoNormal">
<span>This is one part of the exam I’m happy with. It is sufficiently vague to let the students come to their own conclusions. It also suggests that the students <strong>should</strong> draw conclusions and support them with statistical analyses. One of the major difficulties I have struggled with in teaching this class is getting students to state a conclusion as a result of their analysis and to quantify how uncertain they are about that decision. In my mind, this is different from just the uncertainty associated with a single parameter estimate. </span>
</p>
<p class="MsoNormal">
It was surprising how much requiring reproducibility helped students focus their analyses. I think because they had to organize/collect their code which, helped them organize their analysis. Also, there was a strong correlation between reproducibility and quality of the written reports.
</p>
<p class="MsoNormal">
Going forward I have a couple of ideas of how I would change my data analysis projects:
</p>
<ol>
<li>Be less helpful - be less clear about the problem statement, data sources, etc. I definitely want students to get more practice formulating problems. </li>
<li>Focus on writing/synthesis - my students are typically very good at fitting models, but sometimes struggle with putting together the “story” of an analysis. </li>
<li>Stress much less about whether specific methods will work well on the data analyses I suggest. One of the more helpful things I think these messy problems produce is a chance to figure out what works and what doesn’t on real world problems. </li>
</ol>
<p><strong>Related Posts:</strong> Rafa on <a href="http://simplystatistics.tumblr.com/post/10764298034/the-future-of-graduate-education" target="_blank">the future of graduate education</a>, <a href="http://simplystatistics.tumblr.com/post/11655593971/do-we-really-need-applied-statistics-journals" target="_blank">Roger on applied statistics journals</a>.</p>
The self-assessment trap
2011-10-21T14:35:00+00:00
http://simplystats.github.io/2011/10/21/the-self-assessment-trap
<p>Several months ago I was sitting next to my colleague <a href="http://www.cbcb.umd.edu/~langmead/" target="_blank">Ben Langmead</a> at the <a href="http://meetings.cshl.edu/meetings/info11.shtml" target="_blank">Genome Informatics meeting</a>. Various talks were presented on short read alignments and every single performance table showed the speaker’s method as #1 and Ben’s <a href="http://bowtie-bio.sourceforge.net/index.shtml" target="_blank">Bowtie</a> as #2 among a crowded field of lesser methods. It was fun to make fun of Ben for getting beat every time, but the reality was that all I could conclude was that Bowtie was best and speakers were falling into the <a href="http://www.nature.com/msb/journal/v7/n1/full/msb201170.html#a1" target="_blank">the self-assessment trap</a>: each speaker had tweaked the assessment to make their method look best. This practice is pervasive in Statistics where easy-to-tweak Monte Carlo simulations are commonly used to assess performance. In a recent <a href="http://www.nature.com/msb/journal/v7/n1/full/msb201170.html#a1" target="_blank">paper</a>, a team at IBM described how the problem in the systems biology literature is pervasive as well. Co-author <a href="https://researcher.ibm.com/researcher/view.php?person=us-gustavo" target="_blank">Gustavo Stolovitzky <strike>Stolovitsky</strike></a> is a co-developer of the <a href="http://wiki.c2b2.columbia.edu/dream/index.php/The_DREAM3_Challenges" target="_blank">DREAM challenge</a> in which the assessments are fixed and developers are asked to submit. About 7 years ago we developed <a href="http://bioinformatics.oxfordjournals.org/content/20/3/323.long" target="_blank">affycomp</a>, a comparison <a href="http://affycomp.jhsph.edu/" target="_blank">webtool</a> for microarray preprocessing methods. I encourage others involved in fields where methods are constantly being compared to develop such tools. It’s a lot of work, but journals are usually friendly to papers describing the results of such competitions.</p>
<p><strong>Related Posts:</strong> Roger on <a href="http://simplystatistics.tumblr.com/post/11573348494/colors-in-r" target="_blank">colors in R</a>, Jeff on <a href="http://simplystatistics.tumblr.com/post/10852070603/battling-bad-science" target="_blank">battling bad science</a></p>
Interview With Chris Barr
2011-10-21T11:13:00+00:00
http://simplystats.github.io/2011/10/21/interview-with-chris-barr
<p class="MsoNormal">
<strong>Chris Barr</strong>
</p>
<p class="MsoNormal">
<span>Chris Barr is an assistant professor of biostatistics at the Harvard School of Public Health in Boston. He moved to Boston after getting his Ph.D. at UCLA and then doing a postdoc at Johns Hopkins Bloomberg School of Public Health. Chris has done important work in environmental biostatistics and is also the co-founder of <a href="http://www.openintro.org/" target="_blank"><span>OpenIntro</span></a>, a very cool open-source (and free!) educational resource for statistics.<span> </span></span>
</p>
<!-- more -->
<p class="MsoNormal">
<span> </span><strong><span>Which term applies to you: data scientist/statistician/analyst?</span></strong>
</p>
<p class="MsoNormal">
I’m a “statistician” by training. One day, I hope to graduate to “scientist”. The distinction, in my mind, is that a scientist can bring real insight to a tough problem, even when the circumstances take them far beyond their training.
</p>
<p class="MsoNormal">
<span><span> </span></span>Statisticians get a head start on becoming scientists. Like chemists and economists and all the rest, we were trained to think hard as independent researchers. Unlike other specialists, however, we are given the opportunity, from a young age, to see all types of different problems posed from a wide range of perspectives.
</p>
<p class="MsoNormal">
<strong><span>How did you get into statistics/data science (e.g. your history)?</span></strong>
</p>
<p class="MsoNormal">
I studied economics in college, and I had planned to pursue a doctorate in the same field. One day a senior professor of statistics asked me about my future, and in response to my stated ambition, said: “Whatever an economist can do, a statistician can do better.” I started looking at graduate programs in statistics and noticed UCLA’s curriculum. It was equal parts theory, application, and computing, and that sounded like how I wanted to spend my next few years. I couldn’t have been luckier. The program and the people were fantastic.
</p>
<p class="MsoNormal">
<strong><span>What is the problem currently driving you?</span></strong>
</p>
<p class="MsoNormal">
I’m working on so many projects, it’s difficult to single out just one. Our work on smoking bans (joint with Diez, Wang, Samet, and Dominici) has been super exciting. It is a great example about how careful modeling can really make a big difference. I’m also soloing a methods paper on residual analysis for point process models that is bolstered by a simple idea from physics. When I’m not working on research, I spend as much time as I can on OpenIntro.
</p>
<p class="MsoNormal">
<strong><span>What is your favorite paper/idea you have had? Why?</span></strong>
</p>
<p class="MsoNormal">
<span> </span>I get excited about a lot of the problems and ideas. I like the small teams (one, two, or three authors) that generally take on theory and methods problems; I also like the long stretches of thinking time that go along with those papers. That said, big science papers, where I get to team up with smart folks from disciplines and destinations far and wide, really get me fired up. Last, but not least, I really value the work we do on open source education and reproducible research. That work probably has the greatest potential for introducing me to people, internationally and in small local communities, that I’d never know otherwise.
</p>
<p class="MsoNormal">
<strong><span>Who were really good mentors to you? What were the qualities that really helped you?</span></strong>
</p>
<p class="MsoNormal">
Identifying key mentors is such a tough challenge, so I’ll adhere to a self-imposed constraint by picking just one: <a href="http://www.stat.ucla.edu/~frederic/" target="_blank">Rick Schoenberg</a>. Rick was my doctoral advisor, and has probably had the single greatest impact on my understanding of what it means to be a scientist and colleague. I could tell you a dozen stories about the simple kindness and encouragement that Rick offered. Most importantly, Rick was positive and professional in every interaction we ever had. He was diligent, but relaxed. He offered structure and autonomy. He was all the things a student needs, and none of the things that make students want to read those xkcd comics. Now that I’m starting to make my own way, I’m grateful to Rick for his continuing friendship and collaboration.
</p>
<p class="MsoNormal">
I know you asked about mentors, but if I could mention somebody who, even though not my mentor, has taught me a ton, it would be <a href="http://www.ddiez.com/" target="_blank">David Diez</a>. David was my classmate at UCLA and colleague at Harvard. We are also cofounders of OpenIntro. David is probably the hardest working person I know. He is also the most patient and clear thinking. These qualities, like Rick’s, are often hard to find in oneself and can never be too abundant.
</p>
<p class="MsoNormal">
<span><span> </span></span><strong><span>What is OpenIntro?</span></strong>
</p>
<p class="MsoNormal">
<span>OpenIntro is part of the growing movement in open source education. Our goal, with the help of community involvement, is to improve the quality and reduce the cost of educational materials at the introductory level. Founded by two statisticians (Diez, Barr), our early activities have generated a full length textbook (OpenIntro Statistics: Diez, Barr, Cetinkaya-Rundel) that is available for free in PDF and at cost ($9.02) in paperback. People can also use openintro.org to manage their course materials for free, whether they are using our book or not. The software, developed almost entire by David Diez, makes it easy for people to post lecture notes, assignments, and other resources. Additionally, it gives people access to our online question bank and quiz utility. Last but not least, we are sponsoring a student project competition. The first round will be this semester, and interested people can visit <a href="http://www.openintro.org/stat/comp.php" target="_blank">openintro.org/stat/comp</a> for additional information. We are little fish, but with the help of our friends (<a href="http://openintro.org/about.php" target="_blank">openintro.org/about.php</a>) and involvement from the community, we hope to do a good thing.</span>
</p>
<p class="MsoNormal">
<strong><span>How did you get the idea for OpenIntro?</span></strong>
</p>
<p class="MsoNormal">
<span> </span>
</p>
<p class="MsoNormal">
<span><span> </span></span>Regarding the book and webpage - David and I had both started writing a book on our own; David was keen on an introductory text, and I was working on one about statistical computing. We each realized that trying to solo a textbook while finishing a PhD was nearly impossible, so we teamed up. As the project began to grow, we were very lucky to be joined by Mine Cetinkaya-Rundel, who became our co-author on the text and has since played a big role in developing the kinds of teaching supplements that instructors find so useful (labs and lecture notes to name a few). Working with the people at OpenIntro has been a blast, and a bucket full of nights and weekends later, here we are!
</p>
<p class="MsoNormal">
<span><span> </span></span>Regarding making everything free - David and I started the OpenIntro project during the peak of the global financial crisis. With kids going to college while their parents’ house was being foreclosed, it seemed timely to help out the best way we knew how. Three years later, as I write this, the daily news is running headline stories about the Occupy Wall Street movement featuring hard times for young people in America and around the world. Maybe “free” will always be timely.
</p>
<p class="MsoNormal">
<strong><span>For More Information</span></strong>
</p>
<p class="MsoNormal">
<span>Check out Chris’ </span><a href="http://www.hsph.harvard.edu/faculty/christopher-barr/publications/" target="_blank">webpage</a><span>, his really nice publications including </span><a href="http://jama.ama-assn.org/content/303/1/69.extract" target="_blank">this one</a><span> on the public health benefits of cap and trade, and the </span><a href="http://www.openintro.org/" target="_blank">OpenIntro</a><span> project website. Keep your eye open for the paper on </span><span>cigarette</span><span> bans Chris mentions in the interview, it is sure to be good. </span>
</p>
<p class="MsoNormal">
<strong>Related Posts: </strong>Jeff’s interview with <a href="http://simplystatistics.tumblr.com/post/11436138110/interview-with-daniela-witten" target="_blank">Daniela Witten</a>, Rafa on <a href="http://simplystatistics.tumblr.com/post/10764298034/the-future-of-graduate-education" target="_blank">the future of graduate education</a>, Roger on <a href="http://simplystatistics.tumblr.com/post/11573348494/colors-in-r" target="_blank">colors in R</a>.
</p>
<!--EndFragment-->
Anthropology of the Tribe of Statisticians
2011-10-20T20:58:01+00:00
http://simplystats.github.io/2011/10/20/anthropology-of-the-tribe-of-statisticians
<p>From the BBC a pretty fascinating radio <a href="http://www.bbc.co.uk/iplayer/episode/b013851z/The_Tribes_of_Science_More_Tribes_of_Science_The_Statisticians/" target="_blank">piece.</a></p>
<blockquote>
<p>…in the same way that a telescope enables you to see things that are too far away to see with the naked eye, a microscope enables you to see things that are too small to see with the naked eye, <em>statistics</em> enables you to see things in masses of data which are too complex for you to see with the naked eye. </p>
</blockquote>
Finding good collaborators
2011-10-20T16:05:00+00:00
http://simplystats.github.io/2011/10/20/finding-good-collaborators
<p>The job of the statistician is almost entirely about collaboration. Sure, there’s theoretical work that we can do by ourselves, but most of the impact that we have on science comes from our work with scientists in other fields. Collaboration is also what makes the field of statistics so much fun.</p>
<p>So one question I get a lot from people is “how do you find good collaborations”? Or, put another way, how do you find good collaborators? It turns out this distinction is more important than it might seem.</p>
<!-- more -->
<p>My approach to developing collaborations has evolved over time and I consider myself fairly lucky to have developed a few very productive and very enjoyable collaborations. These days my strategy for finding good collaborations is to look for good collaborators. I personally find it important to work with people that I like as well as respect as scientists, because a good collaboration is going to involve a lot of personal interaction. A place like Johns Hopkins has no shortage of very intelligent and very productive researchers that are doing interesting things, but that doesn’t mean you want to work with all of them.</p>
<p>Here’s what I’ve been telling people lately about finding collaborations, which is a mish-mash of a lot of advice I’ve gotten over the years.</p>
<ol>
<li><strong>Find people you can work with</strong>. I sometimes see situations where a statistician will want to work with someone because he/she is working on an important problem. Of course, you want to be working on a problem that interests you, but it’s only partly about the specific project. It’s very much about the person. If you can’t develop a strong working relationship with a collaborator, both sides will suffer. If you don’t feel comfortable asking (stupid) questions, pointing out problems, or making suggestions, then chances are the science won’t be as good as it could be. </li>
<li><strong>It’s going to take some time</strong>. I sometimes half-jokingly tell people that good collaborations are what you’re left with after getting rid of all your bad ones. Part of the reasoning here is that you actually may not know what kinds of people you are most comfortable working with. So it takes time and a series of interactions to learn these things about yourself and to see what works and doesn’t work. Of course, you can’t take forever, particularly in academic settings where the tenure clock might be ticking, but you also can’t rush things either. One rule I heard once was that a collaboration is worth doing if it will likely end up with a published paper. That’s a decent rule of thumb, but see my next comment.</li>
<li><strong>It’s going to take some time</strong>. Developing good collaborations will usually take some time, even if you’ve found the right person. You might need to learn the science, get up to speed on the latest methods/techniques, learn the jargon, etc. So it might be a while before you can start having intelligent conversations about the subject matter. Then it takes time to understand how the key scientific questions translate to statistical problems. Then it takes time to figure out how to develop new methods to address these statistical problems. So a good collaboration is a serious long-term investment which has some risk of not working out. There may not be a lot of papers initially, but the idea is to make the early investment so that truly excellent papers can be published later.</li>
<li><strong>Work with people who are getting things done</strong>. Nothing is more frustrating than collaborating on a project with someone who isn’t that interested in bringing it to a close (i.e. a published paper, completed software package). Sometimes there isn’t a strong incentive for the collaborator to finish (i.e she/he is already tenured) and other times things just fall by the wayside. So finding a collaborator who is continuously getting things done is key. One way to determine this is to check out their CV. Is there a steady stream of productivity? Papers in good journals? Software used by lots of other people? Grants? Web site that’s not in total disrepair?</li>
<li><strong>You’re not like everyone else</strong>. One thing that surprised me was discovering that just because someone you know works well with a specific person doesn’t mean that <em>you</em> will work well with that person. This sounds obvious in retrospect, but there were a few situations where a collaborator was recommended to me by a source that I trusted completely, and yet the collaboration didn’t work out. The bottom line is to trust your mentors and friends, but realize that differences in personality and scientific interests may determine a different set of collaborators with whom you work well.</li>
</ol>
<p>These are just a few of my thoughts on finding good collaborators. I’d be interested in hearing others’ thoughts and experiences along these lines.</p>
<p><strong>Related Posts:</strong> Rafa on <a href="http://simplystatistics.tumblr.com/post/11314293165/authorship-conventions" target="_blank">authorship conventions</a>, <a href="http://simplystatistics.tumblr.com/post/10440612965/finish-and-publish" target="_blank">finish and publish</a></p>
Caffo's Theorem
2011-10-20T02:35:03+00:00
http://simplystats.github.io/2011/10/20/caffos-theorem
<p>Brian Caffo from the comments:</p>
<blockquote>
<p>Personal theorem: the application of statistics in any new field will be labeled “Technical sounding word” + ics. Examples: Sabermetrics, analytics, econometrics, neuroinformatics, bioinformatics, informatics, chemeometrics.</p>
<p>It’s like how adding mayonnaise to anything turns it in to salad (eg: egg salad, tuna salad, ham salad, pasta salad, …)</p>
<p>I’d like to be the first to propose the statistical study of turning things in salad. So called mayonaisics.</p>
</blockquote>
Do we really need applied statistics journals?
2011-10-19T16:05:06+00:00
http://simplystats.github.io/2011/10/19/do-we-really-need-applied-statistics-journals
<p>All statisticians in academia are constantly confronted with the question of where to publish their papers. Sometimes it’s obvious: A theoretical paper might go to the <em>Annals of Statistics</em> or_JASA Theory & Methods_ or <em>Biometrika</em>. A more “methods-y” paper might go to <em>JASA</em> or <em>JRSS-B</em> or_Biometrics_ or maybe even <em>Biostatistics</em> (where all three of us are or have been associate editors).</p>
<p>But where should the applied papers go? I think this is an increasingly large category of papers being produced by statisticians. These are papers that do not necessarily develop a brand new method or uncover any new theory, but apply statistical methods to an interesting dataset in a not-so-obvious way. Some papers might combine a set of existing methods that have never been combined before in order to solve an important <em>scientific</em> problem.</p>
<p>Well, there are some official applied statistics journals: <em>JASA Applications & Case Studies</em> or <em>JRSS-C</em> or <em>Annals of Applied Statistics</em>. At least they have the word “application” or “applied” in their title. But the question we should be asking is if a paper is published in one of those journals, <em>will it reach the right audience</em>?</p>
<p>What is the audience for an applied stat paper? Perhaps it depends on the subject matter. If the application is biology, then maybe biologists. If it’s an air pollution and health application, maybe environmental epidemiologists. My point is that the key audience is probably not a bunch of other statisticians.</p>
<p>The fundamental conundrum of applied stat papers comes down to this question:<strong>If your application of statistical methods is truly addressing an important scientific question, then shouldn’t the scientists in the relevant field want to hear about it?</strong> If the answer is yes, then we have two options: Force other scientists to read our applied stat journals, or publish our papers in their journals. There doesn’t seem to be much momentum for the former, but the latter is already being done rather frequently.</p>
<p>Across a variety of fields we see statisticians making direct contributions to science by publishing in non-statistics journals. Some examples are this <a href="http://www.ncbi.nlm.nih.gov/pubmed/21706001">recent paper in <em>Nature Genetics</em></a> or a paper I published a few years ago in the <a href="http://www.ncbi.nlm.nih.gov/pubmed/18477784">Journal of the American Medical Association</a>. I think there are two key features that these papers (and many others like them) have in common:</p>
<p><strong>There was an important scientific question addressed</strong>. The first paper investigates variability of methylated regions of the genome and its relation to cancer tissue and the second paper addresses the problem of whether ambient coarse particles have an acute health effect. In both cases, scientists in the respective substantive areas were interested in the problem and so it was natural to publish the “answer” in their journals.
<strong>The problem was well-suited to be addressed by statisticians</strong>. Both papers involved large and complex datasets for which training in data analysis and statistics was important. In the analysis of coarse particles and hospitalizations, we used a national database of air pollution concentrations and obtained health status data from Medicare. Linking these two databases together and conducting the analysis required enormous computational effort and statistical sophistication. While I doubt we were the only people who could have done that analysis, we were very well-positioned to do so.</p>
<p>So when statisticians are confronted by a scientific problems that are both (1) important and (2) well-suited for statisticians, what should we do? My feeling is we should skip the applied statistics journals and bring the message straight to the people who want/need to hear it.</p>
<p>There are two problems that come to mind immediately. First, sometimes the paper ends up being so statistically technical that a scientific journal won’t accept it. And of course, in academia, there is the sticky problem of how do you get promoted in a statistics department when your CV is filled with papers in non-statistics journals. This entry is already long enough so I’ll address these issues in a future post.</p>
Spectacular Plots Made Entirely in R
2011-10-18T16:05:00+00:00
http://simplystats.github.io/2011/10/18/spectacular-plots-made-entirely-in-r
<p>When doing data analysis, I often create a set of plots quickly just to explore the data and see what the general trends are. Later I go back and fiddle with the plots to make them look pretty for publication. But some people have taken this to the next level. Here are two plots made entirely in R:</p>
<p><img align="middle" height="280" width="500" src="http://nzprimarysectortrade.files.wordpress.com/2011/10/weapon-export-2010.png" /></p>
<p><img align="middle" src="http://paulbutler.org/wp-content/uploads/2010/12/163413_479288597199_9445547199_5658562_14158417_n.png" width="500" height="280" /></p>
<p>The descriptions of how they were created are <a href="http://paulbutler.org/archives/visualizing-facebook-friends/" target="_blank">here</a> and <a href="http://nzprimarysectortrade.wordpress.com/2011/10/16/r-tells-you-where-weapons-go/" target="_blank">here</a>.</p>
<p><strong>Related:</strong> Check out Roger’s post on <a href="http://simplystatistics.tumblr.com/post/11573348494/colors-in-r" target="_blank">R colors</a> and my post on <a href="http://simplystatistics.tumblr.com/post/11237403492/apis" target="_blank">APIs</a></p>
Caffo + Ninjas = Awesome
2011-10-18T13:10:12+00:00
http://simplystats.github.io/2011/10/18/caffo-ninjas-awesome
<p>Our colleague <a href="http://www.biostat.jhsph.edu/~bcaffo/" target="_blank">Brian Caffo</a> and his team of statistics ninjas won the “Imaging-Based Diagnostic Classification Contest” as part of the <a href="http://fcon_1000.projects.nitrc.org/indi/adhd200/results.html" target="_blank">ADHD 200 Global Competition</a>. From the prize citation:</p>
<blockquote>
<p><span>The method developed by the team from Johns Hopkins University excelled in its <strong>specificity</strong>, or its ability to identify typically developing children (TDC) without falsely classifying them as ADHD-positive. They correctly classified 94% of TDC, showing that a diagnostic imaging methodology can be developed with a very low risk of false positives, a fantastic result. Their method was not as effective in terms of <strong>sensitivity</strong>, or its ability to identify true positive ADHD diagnoses. They only identified 21% of cases; however, among those cases, they discerned the subtypes of ADHD with 89.47% accuracy. Other teams demonstrated that there is ample room to improve sensitivity scores. </span></p>
</blockquote>
<p><span>Congratulations to Brian and his team!</span></p>
Colors in R
2011-10-17T16:05:06+00:00
http://simplystats.github.io/2011/10/17/colors-in-r
<p>One of my favorite R packages that I use all the time is the <a href="http://cran.r-project.org/package=RColorBrewer" target="_blank">RColorBrewer</a> package. The package has been around for a while now and is written/maintained by Erich Neuwirth. The guts of the package are based on <a href="http://www.personal.psu.edu/cab38/" target="_blank">Cynthia Brewer’s</a> very cool work on the use of color in cartography (check out the <a href="http://colorbrewer2.org/" target="_blank">colorbrewer web site)</a>.</p>
<p>As a side note, I think the ability to manipulate colors in plots/graphs/maps is one of R’s many great strengths. My personal experience is that getting the right color scheme can make a difference in how data are perceived in a plot.</p>
<!-- more -->
<p>RColorBrewer basically provides one function, brewer.pal, that generates different types of color palettes. There are three types of palettes: sequential, diverging, and qualitative. Roughly speaking, sequential palettes are for continuous data where low is less important and high is more important, diverging palettes are for continuous data where both low and high are important (i.e. deviation from some reference point), and qualitative palettes are for categorical data where there is no logical order (i.e. male/female).</p>
<p>To use the brewer.pal function, it’s often useful to combine it with another R function, colorRampPalette. This function is built into R and is part of the grDevices package. It takes a palette of colors and interpolates between the colors to give you an entire spectrum. Think of a painter’s palette with 4 or 5 color blotches on it, and then think of the painter taking a brush and blending the colors together. That’s what colorRampPalette does. So brewer.pal gives you the colors and colorRampPalette mashes them together. It’s a happy combination.</p>
<p>So, how do we use these functions? My basic approach is to first set the palette depending on the type of data. Suppose we have continuous sequential data and we want the “Blue-Purple” palette</p>
<pre>colors <- brewer.pal(4, "BuPu")
</pre>
<p>Here, I’ve taken 4 colors from the “BuPu” palette, so there are now 4 blotches on my palette. To interpolate these colors, I can call colorRampPalette, which actually returns a <em>function</em>.</p>
<pre>pal <- colorRampPalette(colors)
</pre>
<p>Now, pal is a function that takes a positive integer argument and returns that number of colors from the palette. So for example</p>
<pre>> pal(5)
[1] "#EDF8FB" "#C1D7E9" "#9FB1D4" "#8B80BB" "#88419D"
</pre>
<p>I got 5 different colors from the palette, with their red, green, and blue values coded in hexadecimal. If I wanted 20 colors I could have called pal(20).</p>
<p>The pal function is useful in other functions like image or wireframe (in the lattice package). In both of those functions, the ‘col’ argument can be given a set of colors generated by the pal function. For example, you could call</p>
<pre>data(volcano)
image(volcano, col = pal(30))
</pre>
<p>and you would plot the ‘volcano’ data using 30 colors from the “BuPu” palette.</p>
<p>If you’re wondering what all the different palettes are and what colors are in them, here’s a handy reference:</p>
<p><img src="http://media.tumblr.com/tumblr_lsyvc6tZ9U1r08wvg.jpg" alt="" /></p>
<p>Or you can just call</p>
<pre>display.brewer.all()</pre>
<p>There’s been a lot of interesting work done on colors in R and this is just scratching the surface. I’ll probably return to this subject in a future post.</p>
Competing through data: Three experts offer their game plan
2011-10-17T01:53:00+00:00
http://simplystats.github.io/2011/10/17/competing-through-data-three-experts-offer-their-game
<p><a href="https://www.facebook.com/video/video.php?v=10150407246723134">Competing through data: Three experts offer their game plan</a></p>
Where would we be without Dennis Ritchie?
2011-10-16T16:05:06+00:00
http://simplystats.github.io/2011/10/16/where-would-we-be-without-dennis-ritchie
<p><span></span></p>
<p>Most have probably seen this already since it happend a few days ago, but <a target="_blank" href="http://www.nytimes.com/2011/10/14/technology/dennis-ritchie-programming-trailblazer-dies-at-70.html">Dennis Ritchie died</a>. It just blows my mind how influential his work was — developing the C language, Unix — and how so many pieces of technology bear his fingerprints. </p>
<p>My first encounter with K&R was in college when I learned C programming in the “Data Structures and Programming Techniques” class at Yale (taught by <a href="http://www.cs.yale.edu/people/eisenstat.html" target="_blank">Stan “the man” Eisenstadt</a>). Looking back, his book seems fairly easy to read and understand, but I must have cursed that book a million times when I took that course!</p>
Interview With Daniela Witten
2011-10-14T14:37:00+00:00
http://simplystats.github.io/2011/10/14/interview-with-daniela-witten
<p><strong>Note</strong>: This is the first in a series of posts where we will be interviewing junior, up-and-coming statisticians/data scientists. Our goal is to build visibility for people who are at the early stages of their careers.</p>
<p><strong>Daniela Witten</strong></p>
<p><img src="http://www.biostat.washington.edu/~dwitten/DanielaWittenSmall.jpg" width="230" height="308" /></p>
<p>Daniela is an assistant professor of Biostatistics at the University of Washington in Seattle. She moved to Seattle after getting her Ph.D. at Stanford. Daniela has been developing exciting new statistical methods for analyzing high dimensional data and is a recipient of the NIH Director’s Early Independence Award.</p>
<p><strong>Which term applies to you: data scientist/statistician/data analyst?</strong></p>
<p>Statistician! We have to own the term. Some of us have a tendency to try to sugarcoat what we do. But I say that I’m a statistician with pride! It means that I have been rigorously trained, that I have a broadly applicable skill set, and that I’m always open to new and interesting problems. Also, I sometimes get surprised reactions from people at cocktail parties, which is funny.</p>
<p>To the extent that there is a stigma associated with being a statistician, we statisticians need to face the problem and overcome it. The future of our field depends on it.</p>
<p><strong>How did you get into statistics/data science?</strong></p>
<p>I definitely did not set out to become a statistician. Before I got to college, I was planning to study foreign languages. Like most undergrads, I changed my mind, and eventually I majored in biology and math. I spent a summer in college doing experimental biology, but quickly discovered that I had neither the hand-eye coordination nor the patience for lab work. When I was nearing the end of college, I wasn’t sure what was next. I wanted to go to grad school, but I didn’t want to commit to one particular area of study for the next five years and potentially for my entire career.</p>
<p>I was lucky to be at Stanford and to stumble upon the Stat department there. Initially, statistics appealed to me because it was a good way to combine my interests in math and biology from the safety of a computer terminal instead of a lab bench. After spending more time in the department, I realized that if I studied statistics, I could develop a broad skill set that could be applied to a variety of areas, from cancer research to movie recommendations to the stock market.</p>
<p><strong>What is the problem currently driving you?</strong></p>
<p>My research involves the development of statistical methods for the analysis of very large data sets. Recently, I’ve been interested in better understanding networks and their applications to biology. In the past few years there has been a lot of work in the statistical community on network estimation, or graphical modeling. In parallel, biologists have been interested in taking network-based approaches to understanding large-scale biological data sets. There is a real need for these two areas of research to be brought closer together, so that statisticians can develop useful tools for rigorous network-based analysis of biological data sets.</p>
<p>For example, the standard approach for analyzing a gene expression data set with samples from two classes (like cancer and normal tissue) involves testing each gene for differential expression between the two classes, for instance using a two-sample t-statistic. But we know that an individual gene does not drive the differences between cancer and normal tissue; rather, sets of genes work together in pathways in order to have an effect on the phenotype. Instead of testing individual genes for differential expression, can we develop an approach to identify aspects of the gene network that are perturbed in cancer?</p>
<p><strong>What are the top 3 skills you look for in a student who works with you?</strong></p>
<p>I look for a student who is intellectually curious, self-motivated, and a good personality fit. Intellectual curiosity is a prerequisite for grad school, self-motivation is needed to make it through the 2 years of PhD level coursework and 3 years of research that make up a typical Stat/Biostat PhD, and a good personality fit is needed because grad school is long and sometimes frustrating ( but ultimately very rewarding), and it’s important to have an advisor who can be a friend along the way!</p>
<p><strong>Who were really good mentors to you? What were the qualities that really helped you?</strong></p>
<p>My PhD advisor, Rob Tibshirani, has been a great mentor. In addition to being a top statistician, he is also an enthusiastic advisor, a tireless advocate for his students, and a loyal friend. I learned from him the value of good collaborations and of simple solutions to complicated problems. I also learned that it is important to maintain a relaxed attitude and to occasionally play pranks on students.</p>
<p><strong>For more information:</strong></p>
<p>Check out her <a href="http://www.biostat.washington.edu/~dwitten/" target="_blank">website</a>. Or read her really nice papers on <a href="http://www.biostat.washington.edu/~dwitten/Papers/WittenTibsPenalizedLDA2010-FINAL-MARCH252011.pdf" target="_blank">penalized classification</a> and <a href="http://www.biostat.washington.edu/~dwitten/Papers/pmd.pdf" target="_blank">penalized matrix decompositions</a>.</p>
Moneyball for Academic Institutes
2011-10-13T13:26:00+00:00
http://simplystats.github.io/2011/10/13/moneyball-for-academic-institutes
<p>A way that universities grow in research fields for which they have no department is by creating institutes. Millions of dollars are invested to promote collaboration between existing faculty interested in the new field. But do they work? Does the university get their investment back? Through the years I have noticed that many institutes are nothing more than a webpage and others are so successful they practically become self-sustained entities. <a href="http://www.itmat.upenn.edu/docs/Hughes_et_al_ScienceTranslationalMedicine_2010.pdf" target="_blank">This paper</a> (published in <a href="http://stm.sciencemag.org/content/2/53/53ps49.short" target="_blank">STM</a>) led by <a href="http://bioinf.itmat.upenn.edu/hogeneschlab/" target="_blank">John Hogenesch</a>, uses data from papers and grants to evaluate an institute at Penn. Among other things, they present a method that uses network analysis to objectively evaluate the effect of the institute on collaboration. The findings are fascinating. </p>
<p>The use of data to evaluate academics is becoming more and more popular, especially among administrators. Is this a good thing? I am not sure yet, but statisticians better get involved before a biased analyses gets some of us fired.</p>
Benford's law
2011-10-12T13:44:00+00:00
http://simplystats.github.io/2011/10/12/benfords-law
<p>Am I the only one who didn’t know about <a href="http://en.wikipedia.org/wiki/Benford's_law" target="_blank">Benford’s law</a>? I<span>t says that for many datasets, the probability that the first digit of a random element is <em>d</em> is given by P(d)= log_10 (1 + 1/d). </span><a href="http://econerdfood.blogspot.com/2011/10/benfords-law-and-decreasing-reliability.html" target="_blank">This post</a> by <a href="http://apps.olin.wustl.edu/faculty/wang/" target="_blank">Jialan Wang</a> explores financial report data and, using Benford’s law, notices that something fishy is going on… </p>
<p>Hat tip to David Santiago.</p>
<p>Update: A link has been fixed. </p>
Errors in Biomedical Computing
2011-10-11T14:30:00+00:00
http://simplystats.github.io/2011/10/11/errors-in-biomedical-computing
<p>Biomedical Computation Review has a <a href="http://www.biomedicalcomputationreview.org/7/2/9.pdf" target="_blank">nice summary</a> (in which I am quoted briefly) by Kristin Sainani about the many different types of errors in computational research, including the infamous Duke incident and some other recent examples. The reproducible research policy at <em>Biostatistics</em> is described as an example for how the publication process might need to change to prevent errors from persisting (or occurring).</p>
Authorship conventions
2011-10-11T12:20:00+00:00
http://simplystats.github.io/2011/10/11/authorship-conventions
<p>The main role of academics is the creation of knowledge. In science, publications are the main venue by which we share our accomplishments, our ideas. Not surprisingly, publications are heavily weighted in hires and promotions. But with multiple author papers how do we know how much each author contributed? Here are some related links from <a href="http://sciencecareers.sciencemag.org/career_magazine/previous_issues/articles/2010_04_16/caredit.a1000039%20%20" target="_blank">Science</a> and <a href="http://www.nature.com/embor/journal/v8/n11/full/7401095.html%20%20" target="_blank">Nature</a> and below I share some thoughts specific to Applied Statistics.</p>
<p>It is common for theoretical statisticians to publish <a href="http://pubs.amstat.org/doi/abs/10.1198/016214501750332875" target="_blank">solo papers</a>. For these it is clear who takes the credit for the idea. In contrast, applied statisticians typically include various authors. Examples include the postdoc that did most the work, the graduate student that helped, the programmer that wrote associated software, and the biologists that created the data. So what position do we assign ourself so that those that evaluate us know our role? Many of us working with lab scientists have adopted their convention: the main knowledge creator, usually the lab head, goes last and is the corresponding author. Here are examples from <a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2945785/?tool=pubmed" target="_blank">Jeff</a>, <a href="http://bioinformatics.oxfordjournals.org/content/27/10/1447.long" target="_blank">Hongkai</a>, <a href="http://onlinelibrary.wiley.com/doi/10.1111/j.1469-1809.2010.00623.x/abstract?systemMessage=Wiley+Online+Library+will+be+disrupted+8+Oct+from+10-14+BST+for+monthly+maintenance" target="_blank">Ingo</a>, and <a href="http://www.ncbi.nlm.nih.gov/pubmed/21706001" target="_blank">myself</a>. Through conversations with senior Biostatistics and Statistics faculty I have been surprised to learn that many are not aware of this. In some cases they went as far as advising junior faculty to publish more first author papers. This is somewhat concerning because junior faculty could be faced with study sections (where our grants are evaluated) that look for last author papers. Study section is not going to change so I am hoping this post will help educate the statistical community about the meaning of last author papers for those of us working in genomics and other lab-science related fields. Here is a summary of authorship conventions in these fields:</p>
<ul>
<li><span class="il">Last</span> and corresponding <span class="il">author</span> is associated with the major contributor of ideas and leadership. This is the most desirable position.</li>
<li>First <span class="il">author</span> is associated with the person who did most the implementation and computing work. Very good for a postdoc or jr faculty. Excellent for a graduate student.</li>
<li>First and corresponding is sometimes used when the person not only had the ideas, but also did half or more of the work. This is rare.</li>
<li>Big collaborative projects will have two or more corresponding authors and two or more “first” authors. I included <a href="http://www.ncbi.nlm.nih.gov/pubmed/21706001" target="_blank">an example</a> above.</li>
</ul>
Government data collection vortex
2011-10-11T11:50:00+00:00
http://simplystats.github.io/2011/10/11/government-data-collection-vortex
<p><a href="http://nyti.ms/pbYl6K">Government data collection vortex</a></p>
Terence’s Stuff: Speaking, reading, writing
2011-10-11T04:19:00+00:00
http://simplystats.github.io/2011/10/11/terences-stuff-speaking-reading-writing
<p><a href="http://bulletin.imstat.org/2011/09/terence%E2%80%99s-stuff-speaking-reading-writing/">Terence’s Stuff: Speaking, reading, writing</a></p>
An R function to determine if you are a data scientist
2011-10-10T13:05:54+00:00
http://simplystats.github.io/2011/10/10/datascientist
<p>“Data scientist” is one of the buzzwords in the running for rebranding applied statistics mixed with some computing. David Champagne, over at Revolution Analytics, <a href="http://tdwi.org/articles/2011/01/05/Rise-of-Data-Science.aspx" target="_blank">described</a> the skills for being a data scientist with a Venn Diagram. Just for fun, I wrote a little R function for determining where you land on the data science Venn Diagram. Here is an example of a plot the function makes using the Simply Statistics bloggers as examples. </p>
<p><img src="http://www.biostat.jhsph.edu/~jleek/datascience2.png" alt="" /></p>
<p>The code can be found <a href="http://biostat.jhsph.edu/~jleek/code/dataScientist.R" target="_blank">here</a>. You will need the <a href="http://cran.r-project.org/web/packages/png/index.html" target="_blank">png</a> and <a href="http://cran.r-project.org/web/packages/klaR/index.html" target="_blank">klaR</a> R packages to run the script. You also need to either download the file <a href="http://biostat.jhsph.edu/~jleek/datascience.png" target="_blank">datascience.png</a> or be connected to the internet. </p>
<p>Here is the function definition:</p>
<p>dataScientist(names=c(“D. Scientist”),skills=matrix(rep(1/3,3),nrow=1), addSS=TRUE, just=NULL)</p>
<ul>
<li>names = a character vector of the names of the people to plot</li>
<li>addSS = if TRUE will add the blog authors to the plot</li>
<li>just = whether to write the name on the right or the left of the point, just = “left” prints on the left and just =”right” prints on the right. If just=NULL, then all names will print to the right. </li>
<li>skills = a matrix with one row for each person you are plotting, the first column corresponds to “hacking”, the second column is “substantive expertise”, and the third column is “math and statistics knowledge”</li>
</ul>
<p>So how do you define your skills? Here is how it works:</p>
<p><strong>If you are an academic</strong></p>
<p>You calculate your skills by adding papers in journals. The classification scheme is the following:</p>
<ul>
<li>Hacking = sum of papers in journals that are primarily dedicated to software/computation/methods for very specific problems. Examples are: Bioinformatics, Journal of Statistical Software, IEEE Computing in Science and Engineering, or a software article in Genome Biology.</li>
<li>Substantive = sum of papers in journals that primarily publish scientific results such as JAMA, New England Journal of Medicine, Cell, Sleep, Circulation</li>
<li>Math and Statistics = sum of papers in primarily statistical journals including Biostatistics, Biometrics, JASA, JRSSB, Annals of Statistics</li>
</ul>
<p>Some journals are general, like Nature, Science, the Nature sub-journals, PNAS, and PLoS One. For papers in those journals, assess which of the areas the paper falls in by determining the main contribution of the paper in terms of the non-academic classification below. </p>
<p><strong>If you are a non-academic</strong></p>
<p>Since papers aren’t involved, determine the percent of your time you spend on the following things:</p>
<ul>
<li>Hacking = downloading/transferring data, cleaning data, writing software, combining previously used software</li>
<li>Substantive = time you spend learning about the scientific problem, discussing with scientists, working in the lab/field.</li>
<li>Math and Statistics = time you spend formalizing a problem in mathematical notation, time you spend developing new mathematical/statistical theory, time you spend developing general method.</li>
</ul>
<p>Enjoy!</p>
Excuse our mess...
2011-10-10T01:09:00+00:00
http://simplystats.github.io/2011/10/10/excuse-our-mess
<p>…we are in the process of changing themes. The spammers got to us in the notes. I tried to fix the html and that didn’t go so well. New theme up shortly. </p>
<p><strong>Update</strong>: Done! We are back in business - minus the spammers. </p>
APIs!
2011-10-09T19:05:05+00:00
http://simplystats.github.io/2011/10/09/apis
<p>Application programming interfaces (<a href="http://en.wikipedia.org/wiki/Application_programming_interface" target="_blank">API</a>s) are tools that are built by companies/governments/organizations to allow software engineers to interact with their websites. One of the main uses of these APIs is to allow software engineers to build apps on top of <a href="http://developers.facebook.com/" target="_blank">Facebook</a>/<a href="https://dev.twitter.com/" target="_blank">Twitter</a>/etc. Many APIs are really helpful for statisticians/data scientists as well. Using APIs, it is generally very easy to collect large amounts of interesting data. <a href="http://www.programmableweb.com/apis/directory" target="_blank">Here </a>are some examples of APIs (you may need to sign up for accounts to get access to some of these). They vary in how easy/useful it is to obtain data from them. If people know of other good ones, I’d love to see them in the comments. </p>
<p><strong>Web 2.0</strong></p>
<ol>
<li><a href="https://dev.twitter.com/docs/using-search" target="_blank">Twitter</a> and associated <a href="http://cran.r-project.org/web/packages/twitteR/" target="_blank">R package</a></li>
<li><a href="http://code.google.com/apis/analytics/docs/gdata/home.html" target="_blank">Google analytics</a></li>
<li><a href="http://code.google.com/apis/blogger/index.html" target="_blank">Blogger</a></li>
<li><a href="http://www.indeed.com/jsp/apiinfo.jsp" target="_blank">Indeed</a></li>
<li><a href="https://sites.google.com/site/grouponapiv2/api-resources/deals" target="_blank">Groupon</a></li>
</ol>
<p><strong>Publishing</strong></p>
<ol>
<li><a href="http://developer.nytimes.com/docs" target="_blank">New York Times</a></li>
<li><span><a href="http://arxiv.org/help/api/index" target="_blank">ArXiv</a></span></li>
<li><a href="http://www.ncbi.nlm.nih.gov/books/NBK25500/" target="_blank">Pubmed</a></li>
<li><a href="http://api.plos.org/" target="_blank">PLoS</a></li>
<li><a href="http://dev.mendeley.com/" target="_blank">Mendeley</a></li>
</ol>
<p><strong>Government</strong></p>
<ol>
<li><span><a href="http://www.fedspending.org/apidoc.php" target="_blank">FedSpending</a> </span></li>
<li><span><a href="http://data.ed.gov/" target="_blank">Department of Education</a></span></li>
<li><a href="http://tools.cdc.gov/register/" target="_blank">CDC</a></li>
</ol>
A nice presentation on regex in R
2011-10-09T13:17:03+00:00
http://simplystats.github.io/2011/10/09/a-nice-presentation-on-regex-in-r
<p>Over at Recology here is a nice <a href="http://r-ecology.blogspot.com/2011/10/r-tutorial-on-regular-expressions-regex.html" target="_blank">presentation</a> on regular expressions. I found this on the R bloggers site. </p>
Hello world!
2011-10-09T00:13:34+00:00
http://simplystats.github.io/2011/10/09/hello-world-2
<p>Welcome to <a href="http://wordpress.com/">WordPress.com</a>. After you read this, you should delete and write your own post, with a new title above. Or hit <a href="/wp-admin/post-new.php" title="Direct link to the Add New in the Admin Dashboard">Add New</a> on the left (of the <a href="/wp-admin" title="Direct link to this blog's admin dashboard">admin dashboard</a>) to start a fresh post.</p>
<p><a href="http://learn.wordpress.com/" title="Learn WordPress.com—From zero to hero.">Here</a> are some suggestions for your first post.</p>
<ol>
<li>You can find new ideas for what to blog about by reading <a href="http://dailypost.wordpress.com/" title="The Daily Post at WordPress.com—post something every day">the Daily Post</a>.</li>
<li>Add <a href="/wp-admin/tools.php" title="Click the "Press This" link on this page to activate the Press this bookmark feature.">PressThis</a> to your browser. It creates a new blog post for you about any interesting page you read on the web.</li>
<li><a href="/wp-admin/post.php?post=1&action=edit" title="Edit the first post on this blog.">Make some changes to this page</a>, and then hit preview on the right. You can always preview any post or edit it before you share it to the world.</li>
</ol>
Single Screen Productivity
2011-10-08T14:25:00+00:00
http://simplystats.github.io/2011/10/08/single-screen-productivity
<p>Here’s a claim for which I have absolutely no data: I believe I am more productive with a smaller screen/monitor. I have a 13” MacBook Air that I occasionally hook up to a 21-inch external monitor. Sometimes, when I want to read a document I’ll hook up the external monitor so that I can see a whole page at a time. Other times, when I’m using R, I’ll have the graphics window on the external and then the R console and Emacs on the main screen.</p>
<p>But my feeling is that when I’ve got more monitor real estate I’m less productive. I think it’s because I have the freedom to open more windows and to have more things going on. When I’ve got my laptop, I can only really afford to have 1 or 2 windows open. So I’m more focused on whatever I’m supposed to be doing. I also think this is one of the (small) reasons that people like things like the iPad. It’s a single application/single window device.</p>
<p>A quick Google search will find some <a href="http://www.unplggd.com/unplggd/roundup/roundup-multiple-monitor-homes-052915" target="_blank">pretty crazy multiple-monitor setups</a> out there. For some of them you’d think they were head of security at Los Angeles International Airport or something. And most people I know would scoff at the idea of working solely on your laptop while in the office. Partially, it’s an ergonomic issue. But maybe they just need an external monitor that’s 13 inches? I think I have one sitting in my basement somewhere….</p>
R Workshop: Reading in Large Data Frames
2011-10-07T15:54:00+00:00
http://simplystats.github.io/2011/10/07/r-workshop-reading-in-large-data-frames
<p><span> </span>One question I get a lot about how to read large data frames into R. There are some useful tricks that can save you both time and memory when reading large data frames but I find that many people are not aware of them. Of course, your ability to read data is limited by your available memory. I usually do a rough calculation along the lines of</p>
<p><span># rows * # columns * 8 bytes / 2^20</span></p>
<p>This gives you the number of megabytes of the data frame (roughly speaking, it could be less). If this number is more than half the amount of memory on your computer, then you might run into trouble.</p>
<!-- more -->
<p>First, read the help page for ‘read.table’. It contains many hints for how to read in large tables. Of course, help pages tend to be a little confusing so I’ll try to distill the relevant details here.</p>
<p>The following options to ‘read.table()’ can affect R’s ability to read large tables:</p>
<p><strong>colClasses</strong></p>
<p>This option takes a vector whose length is equal to the number of columns in year table. Specifying this option instead of using the default can make ‘read.table’ run MUCH faster, often twice as fast. In order to use this option, you have to know the of each column in your data frame. If all of the columns are “numeric”, for example, then you can just set ‘colClasses = “numeric”’. If the columns are all different classes, or perhaps you just don’t know, then you can have R do some of the work for you.</p>
<p>You can read in just a few rows of the table and then create a vector of classes from just the few rows. For example, if I have a file called “datatable.txt”, I can read in the first 100 rows and determine the column classes from that:</p>
<pre>tab5rows <- read.table("datatable.txt", header = TRUE, nrows = 100)
classes <- sapply(tab5rows, class)
tabAll <- read.table("datatable.txt", header = TRUE, colClasses = classes)
</pre>
<p>Always try to use ‘colClasses’, it will make a very big difference. In particular, if one of the column classes is “character”, “integer”, “numeric”, or “logical”, then things will be optimal (because those are the basic classes).</p>
<p><strong>nrows</strong></p>
<p>Specifying the ‘nrows’ argument doesn’t necessary make things go faster but it can help a lot with memory usage. R doesn’t know how many rows it’s going to read in so it first makes a guess, and then when it runs out of room it allocates more memory. The constant allocations can take a lot of time, and if R overestimates the amount of memory it needs, your computer might run out of memory. Of course, you may not know how many rows your table has. The easiest way to find this out is to use the ‘wc’ command in Unix. So if you run ‘wc datafile.txt’ in Unix, then it will report to you the number of lines in the file (the first number). You can then pass this number to the ‘nrows’ argument of ‘read.table()’. If you can’t use ‘wc’ for some reason, but you know that there are definitely less than, say, N rows, then you can specify ‘nrows = N’ and things will still be okay. A mild overestimate for ‘nrows’ is better than none at all.</p>
<p><strong>comment.char</strong></p>
<p>If your file has no comments in it (e.g. lines starting with ‘#’), then setting ‘comment.char = “”’ will sometimes make ‘read.table()’ run faster. In my experience, the difference is not dramatic.</p>
A Really Cool Paper on the "Hot Hand" in Sports
2011-10-06T14:23:00+00:00
http://simplystats.github.io/2011/10/06/a-really-cool-paper-on-the-hot-hand-in-sports
<p>I just found <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0024532#pone-0024532-g001" target="_blank">this</a> really cool paper on the phenomenon of the “hot hand” in sports. The idea behind the <a href="http://en.wikipedia.org/wiki/Hot-hand_fallacy" target="_blank">“hot hand”</a> (also called the <a href="http://en.wikipedia.org/wiki/Clustering_illusion" target="_blank">“clustering illusion”</a>) is that success breeds success. In other words, when you are successful (you win games, you make free throws, you get hits) you will continue to be successful. In sports, it has frequently been observed that events are close to independent, meaning that the “hot hand” is just an illusion. </p>
<!-- more -->
<p>In the paper, the authors downloaded all the data on NBA free throws for the 2005/2006 through the 2009/2010 seasons. <span>They cleaned up the data, then analyzed changes in conditional probability. Their analysis suggested that free throw success was not an independent event. They go on to explain: </span></p>
<blockquote>
<p><span>However, while statistical traces of this phenomenon are observed in the data, an open question still remains: are these non random patterns a result of “success breeds success” and “failure breeds failure” mechanisms or simply “better” and “worse” periods? Although free throws data is not adequate to answer this question in a definite way, we speculate based on it, that the latter is the dominant cause behind the appearance of the “hot hand” phenomenon in the data.</span></p>
</blockquote>
<p>The things I like about the paper are that they explain things very simply, use a lot of real data they obtained themselves, and are very careful in their conclusions. </p>
R Workshop
2011-10-06T12:58:32+00:00
http://simplystats.github.io/2011/10/06/r-workshop
<p>I am going to start a continuing “R Workshop” series of posts with R tips and tricks. If you have questions you’d like answered or were wondering about certain aspects, please leave them in the comments.</p>
Prezi
2011-10-06T11:09:00+00:00
http://simplystats.github.io/2011/10/06/prezi
<p><a href="http://www.biostat.jhsph.edu/~ajaffe/" target="_blank">Andrew Jaffe</a> pointed me to <a href="http://prezi.com/" target="_blank">prezi.com</a>. It looks like a new way of making presentations. Andrew made an example <a href="http://prezi.com/ft-_thkdllaf/dna-methylation/?auth_key=d9ae396295709050aa1bcb5f40b1665483f9c414" target="_blank">here</a> in just a couple of minutes. <a href="http://prezi.com/ftv9hvziwqi2/coca-cola-company/" target="_blank">Here</a> is one about Coca-Cola.</p>
<p>Things I like: </p>
<ol>
<li>I go to a lot of Beamer/Powerpoint talks, these presentations at least look different and could be interesting. </li>
<li>It is cool how it is easy to arrange slides in a non-linear order and potentially avoid clicking forward a few slides then back a few slides</li>
<li>I also like how the “global picture” of the talk can be shown in a display. </li>
</ol>
<p>Things I’m not worried about:</p>
<ol>
<li>All the zooming and panning might start to drive people nuts, like slide transitions in powerpoint. </li>
<li>There is serious potential for confusing presentations, organization is already a problem with some talks. </li>
<li>There is potential for people to spend too much time on making the prezi look cool and less on content. </li>
</ol>
<p><strong>Update:</strong> From the comments <span>Abhijit points out that David Smith put together a presentation on the R ecosystem using Prezi. Check it out <a href="http://prezi.com/s1qrgfm9ko4i/the-r-ecosystem/" target="_blank">here</a>.</span></p>
Submitting scientific papers is too time consuming
2011-10-05T13:46:00+00:00
http://simplystats.github.io/2011/10/05/submitting-scientific-papers-is-too-time-consuming
<p>As an academic who does a lot of research for a living, I spend a lot of my time writing and submitting papers. Before my time, this process involved sending multiple physical copies of a paper by snail mail to the editorial office. New technology has changed this process. Now to submit a paper you generally have to: (1) find a Microsoft Word or Latex template for the journal and use it for your paper and (2) upload the manuscript and figures (usually separately). This is a big improvement over snail mail submission! But it still takes a huge amount of time. Some simple changes would give academics back huge blocks of time to focus on teaching and research.</p>
<p>Just to give an idea of how complicated the current system is here is an outline of what it takes to submit a paper.</p>
<p>To complete step (1) you go to the webpage of the journal you are submitting to, find their template files, and wrestle your content into the template. Sometimes this requires finding additional files which are not on the website of the journal you are submitting too. It always requires a large amount of tweaking the text and content to fit the template.</p>
<p>To complete step (2) you have to go the webpage of the journal and start an account with their content management system. There are frequently different requirements for usernames and passwords, leading to proliferation of both. Then you have to upload the files and fill out between 5-7 web forms with information about the authors, information about the paper, information about the funding, information about human subjects research, etc. If the files aren’t in the right format you may have to reformat them before they will be accepted. Some journals even have editorial assistants who will go over your submission and find problems that have to be resolved before your paper can even be reviewed.</p>
<p>This whole process can take anywhere from one to ten hours, depending on the journal. If you have to revise your paper for that journal, you have to go through the process again. If your paper is rejected, then you have to start all over with a new template and a new content management system at a new journal.</p>
<p>It seems like a much simpler system would be for people to submit their papers in pdf/word format with all the figures embedded. If the paper is accepted to a journal, then of course you might need to reformat the submission to make it easier for typesetters to reformat your article. But that could happen just one time, once a paper is accepted.</p>
<p>This seems like a small thing. But suppose you submit a paper between 10 and 15 times a year (very common for academics in my field). Suppose it takes on average 3 hours to submit a paper. That is 3 x 10 = 30 hours a year, almost an entire workweek, just dealing with reformatting papers!</p>
<p>In the comments, I’d love to hear about the best/worst experiences you have had submitting papers. Where is good? Where is bad?</p>
Cool papers
2011-10-04T16:52:00+00:00
http://simplystats.github.io/2011/10/04/cool-papers
<ol>
<li><a href="http://www.sciencemag.org/content/333/6051/1878" target="_blank">Here</a> is a paper where they scraped Twitter data over a year and showed how the the tweets corresponded with sleep patterns and diurnal rhythms. The coolest part of this paper is that these two guys just went out and collected the data for free. I wish they had focused on more interesting questions though, it seems like you could do a lot with data like this. </li>
<li>Since flu season is upon us, <a href="http://arxiv.org/abs/1109.0262" target="_blank">here</a> is an interesting paper where the authors used data on friendship networks and class structure in a high school to study flu transmission. They show targeted treatment isn’t as effective as people had thought when using random mixing models. </li>
<li>This one is a little less statistical. Over the last few years there were some pretty high profile papers that suggested that over-expressing just one protein could double or triple the lifetime of flies or worms. Obviously, that is a pretty crazy/interesting result. But in <a href="http://www.nature.com/nature/journal/v477/n7365/full/nature10296.html" target="_blank">this</a> paper some of those results are called into question. </li>
</ol>
Defining data science
2011-10-04T14:27:00+00:00
http://simplystats.github.io/2011/10/04/defining-data-science
<p>Rebranding of statistics as a field seems to be a popular topic these days and “data science” is one of the potential rebranding options. This <a href="http://blog.revolutionanalytics.com/2011/09/data-science-a-literature-review.html" target="_blank">article</a> over at Revolutions is a nice summary of where the term comes from and what it means. This quote seems pretty accurate:</p>
<blockquote>
<p><span>My own take is that Data Science is a valuable rebranding of computer science and applied statistics skills.</span></p>
</blockquote>
Innovation and overconfidence
2011-10-03T20:08:00+00:00
http://simplystats.github.io/2011/10/03/innovation-and-overconfidence
<p>I posted a while ago on how <a href="http://simplystatistics.tumblr.com/post/10241004305/when-overconfidence-is-good" target="_blank">overconfidence may be a good thing</a>. I just read this fascinating <a href="http://www.worldpolicy.org/journal/fall2011/innovation-starvation" target="_blank">article </a>by Neal Stephenson (via <a href="http://aldaily.com" target="_blank">aldaily.com</a>) about innovation starvation. The article focuses a lot on how science fiction inspires people to work on big/hard/impossible problems in science. Its a great read for the nerds in the audience. But one quote stuck out for me:</p>
<blockquote>
<p><span>Most people who work in corporations or academia have witnessed something like the following: A number of engineers are sitting together in a room, bouncing ideas off each other. Out of the discussion emerges a new concept that seems promising. Then some laptop-wielding person in the corner, having performed a quick Google search, announces that this “new” idea is, in fact, an old one—or at least vaguely similar—and has already been tried. Either it failed, or it succeeded. If it failed, then no manager who wants to keep his or her job will approve spending money trying to revive it. If it succeeded, then it’s patented and entry to the market is presumed to be unattainable, since the first people who thought of it will have “first-mover advantage” and will have created “barriers to entry.” The number of seemingly promising ideas that have been crushed in this way must number in the millions.</span></p>
</blockquote>
<p>This has to be the single biggest killer of ideas for me. I come up with an idea, google it, find something that is close, and think well it has already been done so I will skip it. I wonder how many of those ideas would have actually turned into something interesting if I had just had a little more overconfidence and skipped the googling? </p>
OracleWorld Claims and Sensations
2011-10-03T12:40:00+00:00
http://simplystats.github.io/2011/10/03/oracleworld-claims-and-sensations
<p>Larry Ellison, the CEO of Oracle, like most technology CEOs, has a tendency for the over-the-top sales pitch. But it’s fun to keep track of what these companies are up to just to see what they think the trends are. It seems clear that companies like IBM, Oracle, and HP, which focus substantially on the enterprise (or try to), think the future is data data data. One piece of evidence is the list of <a href="http://simplystatistics.tumblr.com/post/9955104326/data-analysis-companies-getting-gobbled-up" target="_blank">companies that they’ve acquired</a> recently.</p>
<p>Ellison claims that they’ve <a href="http://bits.blogs.nytimes.com/2011/10/02/larry-ellison-stares-into-the-sun/" target="_blank">developed a new computer</a> that integrates hardware with software to produce an overall faster machine. Why do we need this kind of integration? Well, for data analysis, of course!</p>
<p>I was intrigued by this line from the article:</p>
<blockquote>
<p><span>On Sunday Mr. Ellison mentioned a machine that he claimed would do data analysis 18 to 23 times faster than could be done on existing machines using Oracle databases. The machine would be able to compute both standard Oracle structured data as well as unstructured data like e-mails, he said.</span></p>
</blockquote>
<p><span>It’s always a bit hard in these types of articles to figure out what they mean by “data analysis”, but even still, there’s an important idea here. </span></p>
<p><span><a href="http://www.sdss.jhu.edu/~szalay/" target="_blank">Alex Szalay</a> talks about the need to “bring the computation to the data”. This comes from his experience working with ridiculous amounts of data from the Sloan Digital Sky Survey. There, the traditional model of pulling the data on to your computer, running some analyses, and then producing results just does not work. </span><span>But the opposite is often reasonable. If the data are sitting in an Oracle/Microsoft/etc. database, you bring the analysis to the database and operate on the data there. Presumably, the analysis program is smaller than the dataset, or this doesn’t quite work.</span></p>
<p><span>So if Oracle’s magic computer is real, it and others like it could be important as we start bringing more computations to the data.</span></p>
Karl's take on meetings
2011-10-02T21:47:00+00:00
http://simplystats.github.io/2011/10/02/karls-take-on-meetings
<p><a href="http://kbroman.wordpress.com/2011/09/28/meetings-vs-work/">Karl’s take on meetings</a></p>
Department of Analytics, anyone?
2011-10-02T01:51:38+00:00
http://simplystats.github.io/2011/10/02/department-of-analytics-anyone
<p>This <a href="http://www.nytimes.com/2011/10/02/business/after-moneyball-data-guys-are-triumphant.html" target="_blank">article following up on the Moneyball PR</a> demonstrates one of the reasons why statistics might be doomed:</p>
<blockquote>
<p>Julia Rozovsky is a Yale M.B.A. student who studied economics and math as an undergraduate, a background that prepared her for a traditional — and lucrative — consulting career. Instead, partly as a result of reading “Moneyball” and finding like-minded people, she pointed herself toward work in analytics.</p>
</blockquote>
<p>Why can’t they call it statistics?? The message, of course, is statistics is boring. Analytics is awesome. We probably need to start changing the names of our departments.</p>
Bits: Big Data: Sorting Reality From the Hype
2011-10-01T13:43:00+00:00
http://simplystats.github.io/2011/10/01/bits-big-data-sorting-reality-from-the-hype
<p><a href="http://bits.blogs.nytimes.com/2011/09/30/big-data-sorting-reality-from-the-hype/">Bits: Big Data: Sorting Reality From the Hype</a></p>
Battling Bad Science
2011-09-30T17:16:00+00:00
http://simplystats.github.io/2011/09/30/battling-bad-science
<p><a href="http://www.ted.com/talks/ben_goldacre_battling_bad_science.html" target="_blank">Here</a> is a pretty awesome TED talk by epidemiologist Ben Goldacre where he highlights how science can be used to deceive/mislead. It’s sort of like epidemiology 101 in 15 minutes. </p>
<p>This seems like a highly topical talk. Over on his blog, Steven Salzberg has <a href="http://genome.fieldofscience.com/2011/09/dr-oz-tries-to-do-science.html" target="_blank">pointed out</a> that Dr. Oz has recently been engaging in some of these shady practices on his show. Too bad he didn’t check out the video first. </p>
<p>In the comments section of the TED talk, one viewer points out that Dr. Goldacre doesn’t talk about the role of the FDA and other regulatory agencies. I think that regulatory agencies are under-appreciated and deserve credit for addressing many of these potential problems in the conduct of clinical trials. </p>
<p>Maybe there should be an agency regulating how science is reported in the news? </p>
Why does Obama need statisticians?
2011-09-29T16:21:00+00:00
http://simplystats.github.io/2011/09/29/why-does-obama-need-statisticians
<p>It’s worth following up a little on why the Obama campaign is recruiting statisticians (note to Karen: I am not looking for a new job!). Here’s the blurb for the position of “Statistical Modeling Analyst”:</p>
<blockquote>
<p>The Obama for America Analytics Department analyzes the campaign’s data to guide election strategy and develop quantitative, actionable insights that drive our decision-making. Our team’s products help direct work on the ground, online and on the air. We are a multi-disciplinary team of statisticians, mathematicians, software developers, general analysts and organizers - all striving for a single goal: re-electing President Obama. We are looking for staff at all levels to join our department from now through Election Day 2012 at our Chicago, IL headquarters.</p>
<p>Statistical Modeling Analysts are charged with predicting electoral outcomes using statistical models. These models will be instrumental in helping the campaign determine how to most effectively use its resources.
</blockquote></p>
<p>I wonder if there’s a bonus for predicting the correct outcome, win or lose?</p>
<p>The Obama campaign didn’t invent the idea of heavy data analysis in campaigns, but they seem to be heavy adopters. There are 3 openings in the “Analytics” category as of today.</p>
<p>Now, can someone tell me why they don’t just call it simply “Statistics”?</p>
</blockquote>
Kindle Fire and Machine Learning
2011-09-29T14:05:00+00:00
http://simplystats.github.io/2011/09/29/kindle-fire-and-machine-learning
<p>Amazon released it’s new iPad competitor, the <a href="http://www.amazon.com/gp/product/B0051VVOB2" target="_blank">Kindle Fire</a>, today. A quick read through the description shows it has some interesting features, including a custom-built web browser called Silk. One innovation that they claim is that the browser works in conjunction with Amazon’s EC2 cloud computing platform to speed up the web-surfing experience by doing some computing on your end and some on their end. Seems cool, if it really does make things faster.</p>
<p>Also there’s this interesting bit:</p>
<blockquote>
<p><span></span></p>
<p><strong>Machine Learning</strong></p>
<p>Finally, Silk leverages the collaborative filtering techniques and machine learning algorithms Amazon has built over the last 15 years to power features such as “customers who bought this also bought…” As Silk serves up millions of page views every day, it learns more about the individual sites it renders and where users go next. By observing the aggregate traffic patterns on various web sites, it refines its heuristics, allowing for accurate predictions of the next page request. For example, Silk might observe that 85 percent of visitors to a leading news site next click on that site’s top headline. With that knowledge, EC2 and Silk together make intelligent decisions about pre-pushing content to the Kindle Fire. As a result, the next page a Kindle Fire customer is likely to visit will already be available locally in the device cache, enabling instant rendering to the screen.</p>
</blockquote>
<p><span></span></p>
<p>That seems like a logical thing for Amazon to do. While the idea of pre-fetching pages is not particularly new, I haven’t yet heard of the idea of doing data analysis on web pages to predict which things to pre-fetch. One issue this raises in my mind, is that in order to do this, Amazon needs to combine information across browsers, which means your surfing habits will become part of one large mega-dataset. Is that what we want?</p>
<p>On the one hand, Amazon already does some form of this by keeping track of what you buy. But keeping track of every web page you goto and what links you click on seems like a much wider scope.</p>
Once in a lifetime collapse
2011-09-29T13:28:00+00:00
http://simplystats.github.io/2011/09/29/once-in-a-lifetime-collapse
<p><span></span></p>
<p><img src="http://media.tumblr.com/tumblr_lsadp9X52w1r085xo.jpg" alt="" /></p>
<p><a href="http://www.baseballprospectus.com/odds/" target="_blank">Baseball Prospectus</a> uses Monte Carlo simulation to predict which teams will make the postseason. According to this page, on Sept 1st, the probability of the Red Sox making the playoffs was 99.5%. They were ahead of the Tampa Bay Rays by 9 games. Before last night’s game, in September, the Red Sox had lost 19 of 26 games and were tied with the Rays for the wild card (the last spot for the playoffs). To make this event even more improbable, The Red Sox were up by one in the ninth with two outs and no one on for the last place Orioles. In this situation the team that’s winning, wins more than 95% of the time. The Rays were in exactly the same situation as the Orioles, losing to the first place Yankees (well, their subs). So guess what happened? The Red Sox lost, the Rays won. But perhaps the most amazing event is that these two games, both lasting much more than usual (one due to rain the other to extra innings) ended within seconds of each other. </p>
<p>Update: Nate Silver beat me to it. And has <a href="http://fivethirtyeight.blogs.nytimes.com/2011/09/29/bill-buckner-strikes-again/" target="_blank">much more</a>!</p>
Obama recruiting analysts who know R
2011-09-28T19:41:00+00:00
http://simplystats.github.io/2011/09/28/obama-recruiting-analysts-who-know-r
<p><a href="http://rdatamining.wordpress.com/2011/09/27/obama-recruiting-analysts-and-r-is-one-preferred-skill/">Obama recruiting analysts who know R</a></p>
The Open Data Movement
2011-09-28T14:11:00+00:00
http://simplystats.github.io/2011/09/28/the-open-data-movement
<p>I’m not sure which of the <a href="http://simplystatistics.tumblr.com/post/10524782074/most-popular-infographics" target="_blank">categories</a> this <a href="http://visually.visually.netdna-cdn.com/TheOpenDataMovement_4e80e4e7c6495.jpg" target="_blank">infographic</a> on open data falls into, but I find it pretty exciting anyway. It shows the rise of APIs and how data are increasingly open. It seems like APIs are all over the place in the web development community, but less so in health statistics. Although, from the comments, John M. posts places to find free government data including some health data: </p>
<blockquote>
<p><span>1) CDC’s National Center for Health Statistics, <a target="_blank" href="http://www.cdc.gov/nchs/"><a href="http://www.cdc.gov/nchs/" target="_blank">http://www.cdc.gov/nchs/</a></a><br />2) <span class="il">NHANES</span> (National and Health and Nutrition Examination Survey) <a href="http://www.cdc.gov/nchs/nhanes.htm" target="_blank"><a href="http://www.cdc.gov/nchs/" target="_blank">http://www.cdc.gov/nchs/</a><span class="il">nhanes</span>.htm</a><br />3) National Health Interview Survey: <a target="_blank" href="http://www.cdc.gov/nchs/nhis.htm"><a href="http://www.cdc.gov/nchs/nhis.htm" target="_blank">http://www.cdc.gov/nchs/nhis.htm</a></a><br />4) World Health Organization: <a target="_blank" href="http://www.who.gov/"><a href="http://www.who.gov" target="_blank">www.who.gov</a></a><br />5) US Census Bureau: <a target="_blank" href="http://www.uscensus.gov/"><a href="http://www.uscensus.gov" target="_blank">www.uscensus.gov</a></a><br />6) Emory maintains a repository of links related to stats/biostat including online databases </span></p>
<p><span><a target="_blank" href="http://www.sph.emory.edu/cms/departments_centers/bios/resources.html#govlist"><a href="http://www.sph.emory.edu/cms/departments_centers/bios/resources.html#govlist" target="_blank">http://www.sph.emory.edu/cms/departments_centers/bios/resources.html#govlist</a></a></span></p>
</blockquote>
The future of graduate education
2011-09-28T11:49:00+00:00
http://simplystats.github.io/2011/09/28/the-future-of-graduate-education
<p>Stanford is offering a free <a href="http://www.nytimes.com/2011/08/16/science/16stanford.html?_r=1" target="_blank">online course</a> and more than 100,000 students have registered. This got the blogosphere talking about the future of universities. Matt Yglesias thinks that “<a href="http://bit.ly/qm97hI" target="_blank">colleges are the next newspaper</a> and are destined for some very uncomfortable adjustments”. Tyler Cowen reminded us that since 2003 he has been saying that <a href="http://marginalrevolution.com/marginalrevolution/2011/08/the-coming-education-revolution.html" target="_blank">professors are becoming obsolete</a>. His main point is that thanks to the internet, the need for lecturers will greatly diminish. He goes on to predict that</p>
<blockquote>
<p><span>the market was moving towards <a href="http://marginalrevolution.com/marginalrevolution/2009/12/online-education-and-the-market-for-superstar-teachers.html" target="_blank">superstar teachers</a>, who teach hundreds at a time or even thousands online. Today, we have the <a href="http://www.khanacademy.org/" target="_blank">Khan Academy</a>, a huge increase in online education, electronic textbooks and <a href="http://marginalrevolution.com/marginalrevolution/2011/05/sword-for-peer-grading.html" target="_blank">peer grading systems</a> and highly successful superstar teachers with Michael Sandel and his popular course <a href="http://www.justiceharvard.org/" target="_blank">Justice</a>, serving as example number one.</span></p>
</blockquote>
<p>I think this is particularly true for stat and biostat graduate programs, especially in <a href="http://simplystatistics.tumblr.com/post/10124797490/advice-for-stats-students-on-the-academic-job-market" target="_blank">hard money</a> environments.</p>
<!-- more -->
<p>A typical Statistics department will admit five to ten PhD students. In most departments we teach probability theory, statistical theory, and applied statistics. Highly paid professors teach these three courses for these five to ten students, which means that the university ends up spending hundreds of thousands of dollars on them. Where does this money come from? From those that teach hundreds at a time. The stat 101 courses are full of tuition paying students. These students are subsidizing the teaching of our graduate courses. In hard money institutions, they are also subsidizing some of the research conducted by the professors that teach the small graduate courses. Note that 75% of their salaries are covered by the University, yet they are expected to spend much less than 75% of their time preparing and teaching these relatively tiny classes. The leftover time they spend on research for which they have no external funding. This isn’t a bad thing as a lot of good theoretical and basic knowledge has been created this way. However, outside pressure to lower tuition costs has University administrators looking for ways to save and graduate education might be a target. “If you want to teach a class, fill it up with 50 students. If you want to do research, get a grant. ” the administrator might say.</p>
<p>Note that, for example, the stat theory class is pretty much the same every year and across universities. So we can pick a couple of superstar stat theory teachers and have them lead an online course for all the stat and biostat graduate students in the world. Then each department hires an energetic instructor, paying him/her 1/4 what they pay a tenured professor, to sit in a room discussing the online lectures with the five to ten PhD students in the program. Currently there are no incentives for the tenured professor to teach well, but the instructor would be rewarded solely by their teaching performance. Not only does this scheme cut costs, but it can also increase revenue as faculty will have more time to write grant proposals, etc..</p>
<p>So, with teaching out of the equation, why even have departments? Well, for now the internet can’t substitute the one-on-one interactions needed during PhD thesis supervision. As long as NIH and NSF are around, research faculty will be around. The apprenticeship system that has worked for centuries will survive the uncomfortable adjustments that are coming. Special topic seminars will also survive as faculty will use them as part of their research agenda. Rotations, similar to those implemented in Biology programs, can serve as match makers between professors and students. But classroom teaching is due for some “uncomfortable adjustments”.</p>
<p>I agree with Tyler Cowen and Matt Yglesias: the number of cushy professors jobs per department will drop dramatically in the future, especially in hard money institutions. So let’s get ready. Maybe Biostat departments should start planning for the future now. Harvard, Seattle, Michigan, Emory, etc.. want to teach stat theory with us?</p>
<p>PS - I suspect this all applies to liberal arts and hard science graduate programs.</p>
The p>0.05 journal
2011-09-28T01:49:00+00:00
http://simplystats.github.io/2011/09/28/the-p-0-05-journal
<p>I want to start a journal called “P>0.05”. This journal will publish all the negative results in science. These would also be stored in a database. Think of all the great things we could do with this. We could, for example, plot p-value histograms for different disciplines. I bet most would have a flat distribution. We could also do it by specific association. A paper comes out saying <a href="http://www.nhs.uk/news/2007/January08/Pages/Chocolatecausesweakbones.aspx" target="_blank">chocolate is linked to weaker bones</a>? Check the histogram and keep eating chocolate. Any publishers interested? </p>
Some cool papers
2011-09-27T14:55:00+00:00
http://simplystats.github.io/2011/09/27/some-cool-papers
<ol>
<li>A cool article on the r<a href="http://www.pnas.org/content/108/31/12647.abstract" target="_blank">egulator’s dilemma</a>. It turns out what is the best risk profile to prevent one bank from failing is not the best risk profile to prevent all banks from failing. </li>
<li><a href="http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0024914" target="_blank">Persistence of web resources</a> for computational biology. I think this one is particularly relevant for academic statisticians since a lot of academic software/packages are developed by graduate students. Once they move on, a large chunk of “institutional knowledge” is lost. </li>
<li><a href="http://kevindenny.wordpress.com/2011/08/08/are-private-schools-better-than-public-schools-evidence-for-ireland/" target="_blank">Are private schools better than public schools</a>? A quote from the paper: “Indeed when comparing the average score in the two types of schools after adjusting for the enrollment effects, we find quite surprisingly that public schools perform better on average.</li>
</ol>
Unoriginal genius
2011-09-26T16:11:00+00:00
http://simplystats.github.io/2011/09/26/unoriginal-genius
<blockquote>
<p><span>“The world is full of texts, more or less interesting; I do not wish to add any more”</span></p>
</blockquote>
<p><span>This quote is from an <a href="http://chronicle.com/article/Uncreative-Writing/128908/" target="_blank">article</a> in the Chronicle Review. </span><span>I highly recommend reading the article, particularly check out the section on the author’s “Uncreative writing” class at UPenn. </span><span>The article is about how there is a trend in literature toward combining/using other people’s words to create new content. </span></p>
<p><span><!-- more --></span></p>
<p><br /></span></p>
<blockquote>
<p><span>The prominent literary critic Marjorie Perloff has recently begun using the term “unoriginal genius” to describe this tendency emerging in literature. Her idea is that, because of changes brought on by technology and the Internet, our notion of the genius—a romantic, isolated figure—is outdated. An updated notion of genius would have to center around one’s mastery of information and its dissemination. Perloff has coined another term, “moving information,” to signify both the act of pushing language around as well as the act of being emotionally moved by that process. She posits that today’s writer resembles more a programmer than a tortured genius, brilliantly conceptualizing, constructing, executing, and maintaining a writing machine.</span></p>
</blockquote>
<p><span>It is fascinating to see this happening in the world of literature; a similar trend seems to be happening in statistics. A ton of exciting and interesting work is done by people combining known ideas and tools and applying them to new problems. I wonder if we need a new definition of “creative”? </span></p>
25 minute seminars
2011-09-26T13:32:00+00:00
http://simplystats.github.io/2011/09/26/25-minute-seminars
<p>Most Statistics and Biostatistics departments have weekly seminars. We usually invite outside speakers to share their knowledge via a 50 minute powerpoint (or beamer) presentation. This gives us the opportunity to meet colleagues from other Universities and pick their brains in small group meetings. This is all great. But, giving a good one hour seminar is hard. Really hard. Few people can pull it off. I propose to the statistical community that we cut the seminars to 25 minutes with 35 minutes for questions and further discussion. We can make exceptions of course. But in general, I think we would all benefit from shorter seminars. </p>
By poring over statistics ignored by conventional scouts, - 05.12.03 - SI Vault
2011-09-26T00:13:00+00:00
http://simplystats.github.io/2011/09/26/by-poring-over-statistics-ignored-by-conventional
<p><a href="http://sportsillustrated.cnn.com/vault/article/magazine/MAG1028746/1/index.htm">By poring over statistics ignored by conventional scouts, - 05.12.03 - SI Vault</a></p>
How do you spend your day?
2011-09-24T20:02:00+00:00
http://simplystats.github.io/2011/09/24/how-do-you-spend-your-day
<p>I’ve seen visualizations of how people spend their time a couple of places. <a href="http://flowingdata.com/2011/09/20/how-do-americans-spend-their-days/" target="_blank">Here</a> is a good one over at Flowing Data. </p>
Getting email responses from busy people
2011-09-23T15:39:00+00:00
http://simplystats.github.io/2011/09/23/getting-email-responses-from-busy-people
<p>I’ve had the good fortune of working with some really smart and successful people during my career. As a young person, one problem with working with really successful people is that they get a <em>ton</em> of email. Some only see the subject lines on their phone before deleting them. </p>
<p>I’ve picked up a few tricks for getting email responses from important/successful people: </p>
<p><strong>The SI Rules</strong></p>
<ol>
<li>Try to send no more than one email a day. </li>
<li>Emails should be 3 sentences or less. Better if you can get the whole email in the subject line. </li>
<li>If you need information, ask yes or no questions whenever possible. Never ask a question that requires a full sentence response.</li>
<li>When something is time sensitive, state the action you will take if you don’t get a response by a time you specify. </li>
<li>Be as specific as you can while conforming to the length requirements. </li>
<li>Bonus: include obvious keywords people can use to search for your email. </li>
</ol>
<p>Anecdotally, SI emails have a 10-fold higher response probability. The rules are designed around the fact that busy people who get lots of email love checking things off their list. SI emails are easy to check off! That will make them happy and get you a response. </p>
<p>It takes more work on your end when writing an SI email. You often need to think more carefully about what to ask, how to phrase it succinctly, and how to minimize the number of emails you write. A surprising side effect of applying SI principles is that I often figure out answers to my questions on my own. I have to decide which questions to include in my SI emails and they have to be yes/no answers, so I end up taking care of simple questions on my own. </p>
<p>Here are examples of SI emails just to get you started: </p>
<p><strong>Example 1</strong></p>
<p>Subject: Is my response to reviewer 2 ok with you?</p>
<p>Body: I’ve attached the paper/responses to referees.</p>
<p><strong>Example 2</strong></p>
<p>Subject: Can you send my letter of recommendation to john.doe@someplace.com?</p>
<p>Body:</p>
<p>Keywords = recommendation, Jeff, John Doe.</p>
<p><strong>Example 3</strong></p>
<p>Subject: I revised the draft to include your suggestions about simulations and language</p>
<p>Revisions attached. Let me know if you have any problems, otherwise I’ll submit Monday at 2pm. </p>
Dongle communism
2011-09-23T13:30:00+00:00
http://simplystats.github.io/2011/09/23/dongle-communism
<p>If you have a mac and give talks or teach, chances are you have embarrassed yourself by forgetting your dongle. Our lab meetings and classes were constantly delayed due to missing dongles. Communism solved this problem. We bought 10 dongles, sprinkled them around the department, and declared all dongles public property. All dongles, not just the 10. No longer do we have to ask to borrow dongles because they have no owner. Please join the revolution. ps -I think this should apply to pens too!<img src="http://media.tumblr.com/tumblr_lrxprsU5Yq1r085xo.jpg" alt="" /></p>
Most popular infographics
2011-09-22T18:33:00+00:00
http://simplystats.github.io/2011/09/22/most-popular-infographics
<p>Thanks to <a href="http://kbroman.wordpress.com/" target="_blank">Karl Broman</a> via <a href="http://andrewgelman.com/" target="_blank">Andrew Gelman</a>.</p>
<p><a title="MOST POPULAR INFOGRAPHICS by theonlyone, on Flickr" href="http://www.flickr.com/photos/smoy/6143338263/" target="_blank"><img alt="MOST POPULAR INFOGRAPHICS" height="500" width="393" src="http://farm7.static.flickr.com/6190/6143338263_d2497c02fe.jpg" /></a></p>
The Killer App for Peer Review
2011-09-22T16:10:00+00:00
http://simplystats.github.io/2011/09/22/the-killer-app-for-peer-review
<p>A little while ago, over at Genomes Unzipped, Joe Pickrell asked, “<a href="http://www.genomesunzipped.org/2011/07/why-publish-science-in-peer-reviewed-journals.php" target="_blank">Why publish science in peer reviewed journals?</a>” He points out the flaws with the current peer review system and suggests how we can do better. What he suggests is missing is the killer app for peer review. </p>
<p>Well, PLoS has now developed an <a href="http://api.plos.org/" target="_blank">API</a>, where you can easily access tons of data on the papers published in those journals including downloads, citations, number of social bookmarks, and mentions in major science blogs. Along with <a href="http://www.mendeley.com/" target="_blank">Mendeley</a> a free reference manager, they have launched an <a href="http://dev.mendeley.com/api-binary-battle/" target="_blank">competition</a> to build cool apps with their free data. </p>
<p>Seems like with the right statistical analysis/cool features a recommender system for say, <a href="http://www.plosone.org/" target="_blank">PLoS One</a> could have most of the features suggested by Joe in his article. One idea would be an RSS-feed based on an idea like the Pandora music sharing service. You input a couple of papers you like from the journal, then it creates an RSS feed with papers similar to that paper. </p>
StatistiX
2011-09-22T12:01:00+00:00
http://simplystats.github.io/2011/09/22/statistix
<p>I think our field would attract more students if we changed the name to something ending with X or K. I’ve joked about this for years, but someone has actually done it (kind of):</p>
<p><a href="http://www.bitlifesciences.com/AnalytiX2012/" target="_blank">http://www.bitlifesciences.com/AnalytiX2012/</a></p>
Small ball is a bad strategy
2011-09-21T11:55:00+00:00
http://simplystats.github.io/2011/09/21/small-ball-is-a-bad-strategy
<p>Bill James pointed this out a long time ago. If you don’t know Bill James, you should <a href="http://en.wikipedia.org/wiki/Bill_James" target="_blank">look him up</a>. I consider him to be one of the most influential statisticians of all times. This post relates to one of his first conjectures: sacrificing outs for runs, referred to as small ball, is a bad strategy. </p>
<p>ESPN’s Gamecast, a webtool that gives you pitch-by-pitch updates of baseball games, also gives you a pitch-by-pitch “probability” of wining. Gamecast confirms the conjecure with data. How do they calculate this “probability”? I am pretty sure it is based only on historical data. No modeling. For example, if the away team is up 4-2 in the bottom of the 7th with no outs and runners on 1st and 2nd, they look at all the instances exactly like this one that have ever happened in the digitally recorded history of baseball and report the proportion of times the home team wins. Well in this situation this proportion is 45%. If the next batter successfully bunts, moving the runners over, this proportion <strong>drops</strong> to 41%. Furthermore, if after the successful bunt, the run from third scores on a sacrifice fly, the proportion <strong>drops</strong> again from 41% to 39%. The extra out hurts you more than the extra run helps you. That was Bill James’ intuition: you only have three outs so the last thing you want to do is give 33% away. </p>
MacArthur Fellow Shwetak Patel
2011-09-20T18:24:05+00:00
http://simplystats.github.io/2011/09/20/macarthur-fellow-shwetak-patel
<p>The new <a href="http://www.macfound.org/site/c.lkLXJ8MQKrH/b.7728991/k.12E8/Meet_the_2011_Fellows.htm" target="_blank">MacArthur Fellows</a> list is out and, as usual, they are an interesting bunch. One person that I thought was worth pointing out is <a href="http://www.macfound.org/site/c.lkLXJ8MQKrH/b.7730995/k.96C7/Shwetak_Patel.htm" target="_blank">Shwetak Patel</a>. I had the privilege of meeting Shwetak at a National Research Council meeting on sustainability and computer science. Basically, he’s working on devices that you can install in your home to monitor your resource usage. He’s already spun-off a startup company to make/sell some of these devices. </p>
<p>In the writeup for the award, they mention</p>
<blockquote>
<p><span>When coupled with a machine learning algorithm that analyzes patterns of activity and the signature noise produced by each appliance, the sensors enable users to measure and disaggregate their energy and water consumption and to detect inefficiencies more effectively.</span></p>
</blockquote>
<p><span>Now that’s statistics at work!</span></p>
Private health insurers to release data
2011-09-20T13:37:00+00:00
http://simplystats.github.io/2011/09/20/private-health-insurers-to-release-data
<p>It looks like four major private health insurance companies will be <a href="http://www.nytimes.com/2011/09/20/health/policy/20health.html" target="_blank">releasing data for use by academic researchers</a>. They will create a non-profit institute called the <a href="http://healthcostinstitute.org/" target="_blank">Health Care Cost Institute</a> and deposit the data there. Researchers can request the data from the institute by (I’m guessing) writing a short proposal.</p>
<p>Health insurance billing claims data might not sound all that exciting, but they are a gold mine of very interesting information about population health. In my group, we use billing claims from Medicare Part A to explore the relationships between ambient air pollutants and hospital admissions for various cardiovascular and respiratory diseases. The advantage of using a database like Medicare is that the population is very large (about 48 million people) and highly relevant. Furthermore, the data are just sitting there, already collected. The disadvantage is that you get relatively little information about those people. For example, you can’t find out what a particular Medicare enrollee’s blood pressure is on a given day. Also, it requires some pretty sophisticated data analysis skills to go through these large databases and extract the information you need to address a scientific question. But this “disadvantage” is what allows statisticians to play an important role in making scientific discoveries.</p>
<p>It’s not clear what kind of information will be made available from the private insurers—it looks like it’s mostly geared towards doing economic/cost analysis. However, I’m guessing that there will be a host of other uses for the data that will be revealed as time goes on. </p>
Finish and publish
2011-09-20T12:50:00+00:00
http://simplystats.github.io/2011/09/20/finish-and-publish
<p>Roger pointed us to this Amstat news <a href="http://magazine.amstat.org/blog/2011/09/01/nextstop/" target="_blank">profile of statisticians</a> including one on <a href="http://www.hsph.harvard.edu/faculty/francesca-dominici/" target="_blank">Francesca Dominici</a>. Francesca has used her statistics skills to become a top environmental scientist. She had this advice for young [academic] statisticians:</p>
<blockquote>
<p>First, I would say find a good mentor in or outside the department. Prioritize, manage your time, and identify the projects you would like to lead. Focus the most productive time of day on those projects. Take ownership of projects. The biggest danger is getting pulled in very different directions; focus on one main project. Finish everything you start. Always publish. Even if it is not revolutionary, publish.</p>
</blockquote>
<p>I think this is great advice. And I want to add to the last two sentences. If you are smart and it took you time to figure out the solution to a problem you find interesting, chances are others will want to read about it. So follow Francesca’s advice: finish and publish. Remember Voltaire’s quote “perfection is the enemy of the good”.</p>
Statistician Profiles
2011-09-20T12:14:00+00:00
http://simplystats.github.io/2011/09/20/statistician-profiles
<p>Just in case you forgot to renew your subscription to Amstat News, there’s a nice little <a href="http://magazine.amstat.org/blog/2011/09/01/nextstop/" target="_blank">profile of statisticians</a> (including my good colleague <a href="http://www.hsph.harvard.edu/faculty/francesca-dominici/" target="_blank">Francesca Dominici</a>) in the latest issue explaining how they ended up where they are.</p>
<p>I remember a few years ago I was at a dinner for our MPH program and the director at the time, Ron Brookmeyer, told all the students to ask the faculty how they ended up in public health. The implication, of course, was that the route was likely to be highly nonlinear. It was definitely that way for me.</p>
<p>Statisticians in particular, I think, have the ability to lead interesting careers simply because we have the ability to operate in a variety of substantive fields. I started out developing point process models for predicting wildfire occurrence. Perhaps to the chagrin of my <a href="http://www.stat.ucla.edu/~frederic/" target="_blank">advisor</a>, I’m not doing much point process modeling now, but rather am working in environmental health doing quite a bit of air pollution epidemiology.</p>
<p>So ask a statistician how they ended up where they are. It’ll probably be an interesting story.</p>
Data Sources
2011-09-19T19:26:00+00:00
http://simplystats.github.io/2011/09/19/data-sources
<p>Here are places you can get data sets to analyze (for class projects, fun and profit!)</p>
<ol>
<li><a href="http://datamarket.com/" target="_blank">Data Market</a></li>
<li><a href="http://www.infochimps.com/" target="_blank">Infochimps</a></li>
<li><a href="http://www.data.gov/" target="_blank">Data.gov</a></li>
<li><a href="http://www.factual.com/" target="_blank">Factual.com</a></li>
</ol>
<p>I’m sure there are a ton more…would love to hear from people. </p>
Meetings
2011-09-19T13:50:00+00:00
http://simplystats.github.io/2011/09/19/meetings
<p>In <a href="http://www.ted.com/talks/jason_fried_why_work_doesn_t_happen_at_work.html" target="_blank">this</a> TED talk Jason Fried explains why work doesn’t happen at work. He describes the evils of meetings. Meetings are particularly disruptive for applied statisticians, especially for those of us that hack data files, explore data for systematic errors, get inspiration from visual inspection, and thoroughly test our code. Why? Before I become productive I go through a ramp-up/boot-up stage. Scripts need to be found, data loaded into memory, and most importantly, my brains needs to re-familiarize itself with the data and the essence of the problem at hand. I need a similar ramp up for writing as well. It usually takes me between 15 to 60 minutes before I am in full-productivity mode. But once I am in “the zone”, I become very focused and I can stay in this mode for hours. There is nothing worse than interrupting this state of mind to go to a meeting. I lose much more than the hour I spend at the meeting. A short way to explain this is that having 10 separate hours to work is basically nothing, while having 10 hours in the zone is when I get stuff done.</p>
<!-- more -->
<p>Of course not all meetings are a waste of time. Academic leaders and administrators need to consult and get advice before making important decisions. I find lab meetings very stimulating and, generally, productive: we unstick the stuck and realign the derailed. But before you go and set up a standing meeting consider this calculation: a weekly one hour meeting with 20 people translates into 1 hour x 20 people x 52 weeks/year = 1040 person hours of potentially lost production per year. Assuming 40 hour weeks, that translates into six months. How many grants, papers, and lectures can we produce in six months? And this does not take into account the non-linear effect described above. Jason Fried suggest you cancel your next meeting, notice that nothing bad happens and enjoy the extra hour of work.</p>
<p>I know many others that are like me in this regard and for you I have these recommendations: 1- avoid unnecessary meetings, especially if you are already in full-productivity mode. Don’t be afraid to use this as an excuse to cancel. If you are in a soft $ institution, remember who pays your salary. 2- Try to bunch all the necessary meetings all together into one day. 3- Separate at least one day a week to stay home and work for 10 hours straight. Jason Fried also recommends that every work place declare a day in which no one talks. No meetings, no chit-chat, no friendly banter, etc… No talk Thursdays anyone? </p>
Ideas/Data blogs I read
2011-09-18T15:52:00+00:00
http://simplystats.github.io/2011/09/18/ideas-data-blogs-i-read
<ol>
<li><a href="http://www.r-bloggers.com/" target="_blank">R bloggers</a> - good R blogs aggregator</li>
<li><a href="http://flowingdata.com/" target="_blank">Flowing Data</a> - interesting data visualizations</li>
<li><a href="http://marginalrevolution.com/" target="_blank">Marginal Revolution </a>- an econ blog with lots of interesting ideas</li>
<li><a href="http://blog.revolutionanalytics.com/" target="_blank">Revolutions</a> - another news about R blog</li>
<li><a href="http://genome.fieldofscience.com/" target="_blank">Steven Salzberg’s blog</a></li>
<li><a href="http://andrewgelman.com/" target="_blank">Andrew Gelman’s blog</a></li>
</ol>
<p>I’m sure there are a ton more good blogs like this out there. Any suggestions of what I should be reading? </p>
Google Fusion Tables
2011-09-16T11:37:00+00:00
http://simplystats.github.io/2011/09/16/google-fusion-tables
<p><span></span></p>
<p>Thanks to <a href="http://www.biostat.jhsph.edu/~hiparker/" target="_blank">Hilary Parker</a> for pointing out <a href="http://www.google.com/fusiontables/public/tour/index.html#" target="_blank">Google Fusion Tables</a>. The coolest thing here, from my self-centered spatial statistics point of view, is that it automatically geocodes locations for you. So you can upload a spreadsheet of addresses and it will map them for you on Google Maps.</p>
<p>Unfortunately, there doesn’t seem to be an easy way to extract the latitude/longitude values, but I’m hoping that’s just a quick hack away….</p>
Communicating uncertainty visually
2011-09-15T23:12:00+00:00
http://simplystats.github.io/2011/09/15/communicating-uncertainty-visually
<p>From a cool <a href="http://www.sciencemag.org/content/333/6048/1393.full" target="_blank">review </a>about communicating risk to people without statistical/probabilistic training.</p>
<blockquote>
<p><span>Despite the burgeoning interest in infographics, there is limited experimental evidence on how different types of visualizations are processed and understood, although the effectiveness of some graphics clearly depends on the relative numeracy of an audience. </span></p>
</blockquote>
Another academic job market option: liberal arts colleges
2011-09-15T18:30:00+00:00
http://simplystats.github.io/2011/09/15/another-academic-job-market-option-liberal-arts
<p>Liberal arts colleges are option that falls close to the 75% hard/25% soft option described by Rafa in his advice for folks on the job market. At these schools the teaching load may be even a little heavier than schools like Berkeley/Duke; the students will usually be exclusively undergraduates. Examples of this kind of place are Pomona College, Carleton College, Grinnell College, etc. The teaching load is the focus at places like this, but research plays an increasingly major role for academic faculty. In a recent Nature <a href="http://www.nature.com/nature/journal/v477/n7363/full/nj7363-239a.html" target="_blank">editorial</a>, Amy Cheng Vollmer produces an interesting analogy for the differences in responsibilities. </p>
<blockquote>
<p>“It’s like comparing the winter Olympics to the summer Olympics,” says Vollmer, who frequently gives talks on career issues. “It’s not easier, it’s different”</p>
</blockquote>
When overconfidence is good
2011-09-15T15:40:00+00:00
http://simplystats.github.io/2011/09/15/when-overconfidence-is-good
<p>A paper came out in the latest issue of Nature called the “<a href="http://www.nature.com/nature/journal/v477/n7364/full/nature10384.html" target="_blank">Evolution of Confidence</a>”. The authors describe a simple model where two participants are competing for a resource. They can either both claim the resource, only one can claim the resource, or neither can. If the ratio of the value of the resource over the cost of competition is good enough, then it makes sense to be overconfident about your abilities to obtain it. </p>
<p>The amazing thing about this paper is that it explains a really old idea “why are people overconfident” with really simple models and simulations (<a href="http://www.nature.com/nature/journal/v477/n7364/extref/nature10384-s1.pdf" target="_blank">done in R</a>!). Based on my own experience, I feel like they may be on to something. You can’t get a paper in Nature if you don’t send it there…</p>
Dissecting the genomics of trauma
2011-09-14T16:13:00+00:00
http://simplystats.github.io/2011/09/14/dissecting-the-genomics-of-trauma
<p>Today the results of a study I’ve been involved with for a long time (read: since my early graduate school days) came out in <a href="http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fjournal.pmed.1001093" target="_blank">PLoS Medicine</a> (also Princeton News <a href="http://www.princeton.edu/main/news/archive/S31/59/38O07/index.xml?section=topstories" target="_blank">coverage</a>, Eurekalert <a href="http://www.eurekalert.org/pub_releases/2011-09/plos-cig090711.php" target="_blank">press release</a>).</p>
<p>We looked at gene expression profiles - how much each of your 20,000 genes is turned on or turned off - in patients who had experienced blunt force trauma. Using these profiles we were able to distinguish very early on which of the patients were going to have positive or negative health trajectories. The idea was to compare patients to themselves and see how much their genomic profiles deviated from the earliest measurements.</p>
<p>I’m excited about this paper for a couple of reasons: (1) like we say in the paper, “Trauma is the number one killer of individuals 1-44y of age in the United States”, (2) the communicating <a href="http://www.genomine.org/research.html" target="_blank">author</a> and joint first authors, Keyur Desai and Chuen Seng Tan, on the paper were statisticians, highlighting the important role statistics played in the scientific process. </p>
<p><strong>Update:</strong> If you want to check out the data/analyze them yourself, there is a website explaining how to access the data & code <a href="http://genomine.org/trauma/" target="_blank">here</a>. </p>
Advice for stats students on the academic job market
2011-09-12T13:34:00+00:00
http://simplystats.github.io/2011/09/12/advice-for-stats-students-on-the-academic-job-market
<p>Job hunting season is upon us. Openings are already being posted <a href="http://www.stat.ufl.edu/vlib/Index.html" target="_blank">here</a>, <a href="http://www.stat.washington.edu/jobs/" target="_blank">here</a>, and <a href="http://jobs.amstat.org/" target="_blank">here</a>. So you should have your CV, research statement, and web page ready. I highly recommend having a web page. It doesn’t have to be fancy. <a href="http://jkp-mac1.uchicago.edu/~pickrell/Site/Home.html" target="_blank">Here</a>, <a href="http://www.biostat.jhsph.edu/~khansen/" target="_blank">here</a>, and <a href="http://www.biostat.jhsph.edu/~jleek/research.html" target="_blank">here</a> are some good ones ranging from simple to a bit over the top. Minimum requirements are a list of publications and a link to a CV. If you have written software, link to that as well.</p>
<p>The earlier you submit the better. Don’t wait for your letters. Keep in mind two things: 1) departments have a limit of how many people they can invite and 2) admissions committee members get tired after reading 200+ CVs. </p>
<p>If you are seeking an academic job your CV should focus on the following: PhD granting institution, advisor (including postdoc advisor if you have one), and papers. Be careful not to drown out these most important features with superflous entries. For papers, Include three sections: 1-published, 2-under review, and 3-under preparation. For 2, include the journal names and if possible have tech reports available on your web page. For 3, be ready to give updates during the interview. If you have papers for which you are co-first author be sure to highlight that fact somehow. </p>
<p>So what are the different types of jobs? Before listing the options I should explain the concept of hard versus soft money. Revenue in academia comes from tuition (in public schools the state kicks in some extra $), external funding (e.g. NIH grants), services (e.g. patient care), and philanthropy (endowment). The money that comes from tuition, services, and philanthropy is referred to as hard money. Every year roughly the same amount is available and the way its split among departments rarely changes. When it does, it’s because your chair has either lost or won a long hard-fought zero-sum battle. Research money comes from NIH, NSF, DoD, etc.. and one has to write grants to <em>raise</em> funding (which pay part or all of your salary). These days about 10% of grant applications are funded, so it is certainly not guaranteed. Although at the school level the law of large numbers kicks in, at the individual level it certainly doesn’t. Note that the break down of revenue varies widely from institution to institution. Liberal arts colleges are almost 100% hard money while research institutes are almost 100% soft money.</p>
<p>So to simplify, your salary will come from teaching (tuition) and research (grants). The percentages will vary depending on the department. Here are four types of jobs:</p>
<p>1) Soft money university positions: examples are Hopkins and Harvard Biostat. A typical breakdown is 75% soft/25% hard. To earn the hard money you will have to teach, but not that much. In my dept we teach 48 classroom hours a year (equivalent to one one-semester class). To earn the soft money you have to write, and eventually get, grants. As a statistician you don’t necessarily have to write your own grants, you can partner up with other scientists that need help. And there are many! Salaries are typically higher in these positions. Stress levels are also higher given the uncertainty of funding. I personally like this as it keeps me motivated, focused, and forces me to work on problems important enough to receive NIH funding.</p>
<p>1a) Some schools of medicine have Biostatistics units that are 100% soft money. One does not have to teach, but, unless you have a joint appointment, you won’t have access to grad students. Still these are tenure track jobs. Although at 100% soft what does tenure mean? The Oncology Biostat division at Hopkins is an example. I should mention at MD Anderson, one only needs to raise 50% of ones salary and the other 50% is earned via service (statistical consulting to the institution). I imagine there are other places like this, as well as institutions that use endowments to provide some hard money. </p>
<p>2) Hard money positions: examples are Berkeley and Stanford Stat. A typical break down is 75% hard/25% soft. You get paid a 9 month salary. If you want to get paid in the summer and pay students, you need a grant. Here you typically teach two classes a semester but many places let you “buy out” of teaching if you can get grants to pay your salary. Some tension exists when chairs decide who teaches the big undergrand courses (lots of grunt work) and who teaches the small seminar classes where you talk about your own work.</p>
<p>3) Research associate positions: examples are jobs in schools of medicine in departments other than Stat/Biostat. These positions are typically 100% soft and are created because someone at the institution has a grant to pay for you. These are usually not tenure track positons and you rarely have to teach. You also have less independence since you have to work on the grant that funds you.</p>
<p>4) Industry: typically 100% hard. There are plenty of for-profit companies where one can have fruitful research careers. AT & T, Google, IBM, Microsoft, and Genentech are all examples of companies with great research groups. Note that S, the language that R is based on, was born in Bell Labs. And one of the co-creators of R now does his research at Genentech. Salaries are typically higher in Industry and cafeteria food can be quite awesome. The drawbacks are no access to students and lack of independence (although not always!).</p>
<p><strong>Update:</strong> I reader points out that I forgot:</p>
<p>5) Government jobs: The FDA and NIH are examples of agencies that have research positions. The NCI’s Biometric Research Branch is an example. I would classify these as 100% hard. But it is different than other hard money places in that you have to justify your budget every so often. Service, collaborative, and independent research is expected. A drawback is that you don’t have access to students although you can get joint appointments. At Hopkins we have a couple of NCI researchers with joint appointments. </p>
<p>Ok, that is it for now. Sometime in December we will <a href="http://simplystatistics.tumblr.com/" target="_blank">blog</a> about job interviews. </p>
The Duke Saga
2011-09-11T04:15:00+00:00
http://simplystats.github.io/2011/09/11/the-duke-saga
<p>For those of you that don’t know about the saga involving genomic signatures, I highly recommend reading this <a href="http://www.economist.com/node/21528593" target="_blank">very good summary</a> published in The Economist. Baggerly and Coombes are two statisticians that can confidently say they have made an impact on clinical research and actually saved lives. A paper by this pair describing the details was published in the <a href="http://www.e-publications.org/ims/submission/index.php/AOAS/user/submissionFile/5816?confirm=cfad51b7" target="_blank">Annals of Applied Statistics</a> as most of the Biology journals refused to publish their letters to the editor. Baggerly is also a fantastic public speaker as seen in <a href="http://vimeo.com/16698764" target="_blank">this video</a> and <a href="http://www.youtube.com/watch?v=j1MT0oZqPXY" target="_blank">this one</a>. </p>
What is a Statistician?
2011-09-10T03:07:00+00:00
http://simplystats.github.io/2011/09/10/what-is-a-statistician
<p><span></span></p>
<p>This Column was written by Terry Speed in 2006 and is reprinted with permission from the IMS Bulletin, <a href="http://bulletin.imstat.org" target="_blank"><a href="http://bulletin.imstat.org" target="_blank">http://bulletin.imstat.org</a></a></p>
<p class="p1">
<span class="s1">I</span>n the generation of my teachers, say from 1935 to 1960, relatively few statisticians were trained for the profession. The majority seemed to come from mathematics, without any specialized statistical training. There was also a sizeable minority coming from other areas, such as astronomy (I can think of one prominent example), chemistry or chemical engineering (three), economics (several), history (one), medicine (several), physics (two), and psychology (several). In those days, PhD programs in statistics were few and far between, and many, perhaps most people moved into statistics because they were interested in the subject, or were responding to a perceived need. They learned the subject on the job, either in government, industry or academia. I also think statistics benefited disproportionately from the minority coming from outside mathematics and statistics, but that may be a personal bias.
</p>
<p class="p1">
This diversity of backgrounds seems to have diminished from the mid-1960s. Almost all of my colleagues in statistics over the last 40 years had some graduate training in statistics. Typically they had a PhD in statistics, probability or mathematics, the last two with some exposure to statistics. A few had masters degrees or diplomas in statistics. My experience probably reflects that of most of you.
</p>
<p class="p1">
By the 1960s our subject had become professional, there was a ticket of entry into it — a PhD or equivalent — and many graduate programs handing them out. I know many statistics departments now include people with joint appointments, for example in the biological, engineering or social sciences, but I have the impression that the majority are people who trained in statistics and moved ‘away’ through their interest in applications there, rather than people from these other areas who were embraced by the statisticians. As is to be expected, there are plenty of exceptions.
</p>
<p class="p1">
Why am I presenting this made-up history of the recent origins of statisticians? Because I have the sense that the situation which has prevailed for about 40 years is changing again. I see a steady trickle, which I predict will grow substantially, of people not trained in statistics moving into our profession. Many have noticed, and I have previously remarked on, the current shortage of bright young people going into our subject. We probably all know universities, institutes or industries trying hard to recruit statisticians, and coming up empty handed. On the other hand, there has been substantial growth in areas which, while not generally regarded as mainstream statistics, might well have been, had things gone differently. My unoriginal observation is that some people from these areas are starting to see statistics as a worthwhile career, not beating but joining us. Computer science, machine learning, image analysis, information theory and bioinformatics, to name a few, have all provided future statisticians to statistics departments around the world in recent years, and I think there will be much more of this.
</p>
<p class="p1">
Recently there was a call for applications for the new United Kingdom EPSRC Statistics Mobility Fellowships, whose aim is “to attract new researchers into the statistics discipline at an early stage in their career”. Is this “mobility” a good idea? In my view, unquestionably yes. Not only do we need an influx of talent to swell our numbers, we also need it to broaden and enrich our subject, so that much of the related activity we now see taking place outside of statistics, and threatening its future, comes inside. In his highly stimulating polemic “Statistical Modelling: The Two Cultures” published in <em>Statistical Science </em>just 5 years ago (16:199–231, 2001), my late colleague Leo Breiman argued that “the focus in the statistical community on data models has:
</p>
<ul>
<li>led to irrelevant theory and questionable scientific conclusions; </li>
<li>kept statisticians from using more suitable algorithmic models; </li>
<li>prevented statisticians from working on exciting new problems.”</li>
</ul>
<p class="p1">
His view was that “we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.”
</p>
<p class="p1">
One, perhaps over-optimistic, view is that the reform that Leo so desired will come automatically as mainstream statistics is joined by “outsiders” from fields like those mentioned above. Are there risks in this trend? There must be. We want statistics broadened and enriched; we don’t want to see it fragmented, trivialized, or otherwise weakened. We need our theorists working hard to incorporate all these new ideas into our long-standing big picture, we need the newcomers to become familiar with the best we have to offer, and we all need to work together in answering the questions of all the people outside our discipline needing our involvement.
</p>
Data visualization and art
2011-09-09T23:58:00+00:00
http://simplystats.github.io/2011/09/09/data-visualization-and-art
<p><a title="Mark Hansen" href="http://www.stat.ucla.edu/~cocteau/" target="_blank">Mark Hansen</a> is easily one of my favorite statisticians today. He is a Professor of Statistics at <a href="http://www.stat.ucla.edu/" target="_blank">UCLA</a> and his collaborations with artists have brought data visualization to a whole new place, one that is both informative and moving. </p>
<p>Here is a video of his project with Ben Rubin called Listening Post. The installation grabs conversations from unrestricted chat rooms and processes them in real-time to create interesting “themes” or “movements”. I believe this one is called “I am” and the video is taken from the Whitney Museum of American Art.</p>
<p>[youtube http://www.youtube.com/watch?v=dD36IajCz6A&w=420&h=345]</p>
<p>Here some pretty cool time-lapse photography of the installation of Listening Post at the San Jose Museum of Art</p>
<p>[youtube http://www.youtube.com/watch?v=cClHQU6Fqro]</p>
Any Other Team Wins The World Series Good For
2011-09-08T23:12:00+00:00
http://simplystats.github.io/2011/09/08/any-other-team-wins-the-world-series-good-for
<p>[youtube http://www.youtube.com/watch?v=_tvh5edD22c?wmode=transparent&autohide=1&egm=0&hd=1&iv_load_policy=3&modestbranding=1&rel=0&showinfo=0&showsearch=0&w=500&h=375]</p>
<p><em>“Any other team wins the World Series, good for them…if we win, with this team … we’ll have changed the game.”</em></p>
<p>Moneyball! Maybe the start of the era of data. Plus it is a feel good baseball movie where a statistician is the hero. I haven’t been this stoked for a movie in a long time.</p>
<div class="attribution">
(<span>Source:</span> <a href="http://www.youtube.com/">http://www.youtube.com/</a>)
</div>
Data Science = Hot Career Choice
2011-09-08T15:00:00+00:00
http://simplystats.github.io/2011/09/08/data-science-hot-career-choice
<p>Not only are data analytics companies getting scooped up left and right, “data science” is blowing up as a career. Data science is sort of an amorphous term, like any hot topic (e.g., <a href="http://en.wikipedia.org/wiki/Cloud_computing" target="_blank">cloud computing</a>). Regardless, people who can crunch numbers and find patterns are in high-demand, and I’m not the only <a href="http://www.nytimes.com/2009/08/06/technology/06stats.html" target="_blank">one</a> <a href="http://www.cio.com/article/684344/The_6_Hottest_New_Jobs_in_IT" target="_blank">saying</a> <a href="http://tech.fortune.cnn.com/2011/09/06/data-scientist-the-hot-new-gig-in-tech/" target="_blank">so</a>.</p>
<p>Don’t believe the hype? Search for “data” on the career site of Amazon, Google, Facebook, Groupon, Livingsocial, Square, ….</p>
Data analysis companies getting gobbled up
2011-09-08T12:36:00+00:00
http://simplystats.github.io/2011/09/08/data-analysis-companies-getting-gobbled-up
<p>Companies that specialize in data analysis, or essentially, statistics, are getting gobbled up by larger companies. IBM bought <a href="http://dealbook.nytimes.com/2009/07/28/ibm-to-pay-12-billion-for-software-maker/" target="_blank">SPSS</a>, then later <a href="http://dealbook.nytimes.com/2011/09/01/ibm-to-buy-algorithmics-for-387-million/" target="_blank">Algorithmics</a>. MSCI bought <a href="http://dealbook.nytimes.com/2010/03/01/msci-buys-riskmetrics-for-1-55-billion/" target="_blank">RiskMetrics</a>. HP bought <a href="http://dealbook.nytimes.com/2011/08/19/after-h-p-s-rich-offer-deal-making-spotlight-swings-to-data-analysis/" target="_blank">Autonomy</a>. Who’s next? SAS?</p>
Build your own pre-cog
2011-09-07T17:38:00+00:00
http://simplystats.github.io/2011/09/07/build-your-own-pre-cog
<p>Okay, this is not really about <a href="http://simplystatistics.tumblr.com/post/9916412456/pre-cog-and-stats" target="_blank">pre-cog</a>, but just a pointer to some data that might be of interest to people. A number of cities post their crime data online, ready for scraping and data analysis. For example, the Baltimore Sun has a <a title="Baltimore homicide data" target="_blank" href="http://essentials.baltimoresun.com/micro_sun/homicides/index.php">Google map of homicides</a> in the city of Baltimore. There’s also some data for <a title="Oakland homicide data" target="_blank" href="http://www.sfgate.com/maps/oaklandhomicides/">Oakland</a>.</p>
<p>Looking at the map is fun, but not particularly useful from a data analysis standpoint. However, with a little fiddling (and some knowledge of XML), you can pull the data from the map and use it for data analysis.</p>
<p>Why not build your own model to predict crime?</p>
<p>I’ll just add that the model used in the pre-cog program was published in the Journal of the American Statistical Association in <a href="http://pubs.amstat.org/doi/abs/10.1198/jasa.2011.ap09546" target="_blank">this article</a>.</p>
Awesome Stat Ed Links
2011-09-07T13:58:00+00:00
http://simplystats.github.io/2011/09/07/awesome-stat-ed-links
<ol>
<li><a href="http://openintro.org/" target="_blank">Openintro</a> - A free online introduction to stats textbook, even the latex is free! One of the authors is Chris Barr, a former postdoc at Hopkins.</li>
<li><a href="https://sites.google.com/site/undergraduateguidetor/" target="_blank">The undergraduate guide to R</a> - A free intro to R at a super-beginners level, the most popular (and free) statistical programming language. Written by an undergrad at Princeton. </li>
</ol>
Pre-cog and stats
2011-09-07T13:11:00+00:00
http://simplystats.github.io/2011/09/07/pre-cog-and-stats
<p>A cool article <a href="http://singularityhub.com/2011/08/29/pre-cog-is-real-%E2%80%93-new-software-stops-crime-before-it-happens/" target="_blank">here</a> on a group predicting the place/time when crime is going to happen. It looks like they are using a Poisson process. They liken it to predicting the after shocks of an earthquake. More details on the math behind the pre-cog software can be found <a href="http://math.scu.edu/~gmohler/crime_project.html" target="_blank">here</a>. I wonder what their prediction accuracy is? Thanks to Rafa for pointing the link out. </p>
Where are the Case Studies?
2011-09-07T12:48:00+00:00
http://simplystats.github.io/2011/09/07/where-are-the-case-studies
<p>Many case studies I find interesting don’t appear in JASA Applications and <strong>Case Studies </strong>or other applied statistics journals for that matter. Some because the technical skill needed to satisfy reviewers is not sufficiently impressive, others because they lack mathematical rigor. But perhaps the main reason for this disconnect is that many interesting case studies are developed by people outside our field or outside academia.</p>
<p>In this blog we will try to introduce readers to some of these case studies. I’ll start it off by pointing readers to Nate Silver’s <a href="http://fivethirtyeight.blogs.nytimes.com" target="_blank">FiveThirtyEight</a> blog. Mr. Silver (yes, Mr. not Prof. nor Dr.) is one of my favorite statisticians. He first became famous for <a href="http://en.wikipedia.org/wiki/PECOTA" target="_blank">PECOTA</a>; a system that uses data and statistics to predict the performance of baseball players. In FiveThirtyEight he uses a rather sophisticated meta-analysis approach to predicting election outcomes.</p>
<p>For example, for the 2008 election he used data from the primaries to calibrate pollsters and then properly weighed these pollsters’ predictions to give a more precise estimate of election results. He predicted Obama would win 349 to 189 with a 6.1% difference in the popular vote. The actual result was 365 to 173 with a difference of 7.2%. His website included graphs that very clearly illustrated the uncertainty of his prediction. These were updated daily and I had a ton of fun visiting his blog at least once a day. I also learned quite a bit, used his data in class, and gained insights that I have used in my own projects.</p>
<!--EndFragment-->
Seek Simplicity And Distrust It
2011-09-07T03:28:02+00:00
http://simplystats.github.io/2011/09/07/seek-simplicity-and-distrust-it
<p>Seek simplicity and distrust it.</p>
<p>A. N. Whitehead</p>
First things first
2011-09-07T03:25:00+00:00
http://simplystats.github.io/2011/09/07/first-things-first
<p><strong>About us:</strong></p>
<p>We are three professors who are fired up about the new era where data is abundant and statisticians are scientists. </p>
<p><strong>About this blog:</strong></p>
<p>We’ll be posting ideas we find interesting, contributing to discussion of science/popular writing, and linking to articles that inspire us. </p>
<p><strong>Why “Simply Statistics”:</strong></p>
<p>We needed a title. Plus, we like the idea of using simple statistics to solve real, important problems. We aren’t fans of unnecessary complication - that just leads to lies, damn lies and something else. </p>
Example post
2011-09-01T00:00:00+00:00
http://simplystats.github.io/2011/09/01/examplepost
<p>Write your text here in Markdown. Be aware that our blog runs with <a href="https://jekyllrb.com/">Jekyll</a></p>
<ul>
<li>Do codeblocks like this https://help.github.com/articles/creating-and-highlighting-code-blocks/</li>
<li>Put all images in the public/ directory or point to them on a website where they are permanent</li>
</ul>