19 Sep 2016
Hilary and I celebrate our one year anniversary doing the podcast together by discussing whether there are cities that are good for data scientists, reproducible research, and professionalizing data science.
Also, Hilary and I have just published a new book, Conversations on Data Science, which collects some of our episodes in an easy-to-read format. The book is available from Leanpub and will be updated as we record more episodes. If you’re new to the podcast, this is a good way to do some catching up!
If you have questions you’d like us to answer, you can send them to
nssdeviations @ gmail.com or tweet us at @NSSDeviations.
Subscribe to the podcast on iTunes or Google Play.
Please leave us a review on iTunes!
Support us through our Patreon page.
Show Notes:
Download the audio for this episode.
Listen here:
19 Sep 2016
Today I’m happy to announce that we’re launching a new specialization on Coursera titled Mastering Software Development in R. This is a 5-course sequence developed with Sean Kross and Brooke Anderson.
This sequence differs from our previous Data Science Specialization because it focuses primarily on using R for developing software. We’ve found that as the field of data science evolves, it is becoming ever more clear that software development skills are essential for producing useful data science results and products. In addition, there is a tremendous need for tooling in the data science universe and we want to train people to build those tools.
The first course, The R Programming Environment, launches today. In the following months, we will launch the remaining courses:
- Advanced R Programming
- Building R Packages
- Building Data Visualization Tools
In addition to the course, we have a companion textbook that goes along with the sequence. The book is available from Leanpub and is currently in progress (if you get the book now, you will receive free updates as they are available). We will be releaseing new chapters of the book alongside the launches of the other courses in the sequence.
07 Sep 2016
A few months ago Jill Sederstrom from ASH Clinical News interviewed
me for this article on the data sharing editorial published by the The New England Journal of Medicine (NEJM) and the debate it generated.
The article presented a nice summary, but I thought the original
comprehensive set of questions was very good too. So, with permission from
ASH Clinical News, I am sharing them here along with my answers.
Before I answer the questions below, I want to make an important remark.
When writing these answers I am reflecting on data sharing in
general. Nuances arise in different contexts that need to be
discussed on an individual basis. For example, there are different
considerations to keep in mind when sharing publicly funded data in
genomics (my field) and sharing privately funded clinical trials data,
just to name two examples.
In your opinion, what do you see as the biggest pros of data sharing?
The biggest pro of data sharing is that it can accelerate and improve
the scientific enterprise. This can happen in a variety of ways. For
example, competing experts may apply an improved statistical analysis
that finds a hidden discovery the original data generators missed.
Furthermore, examination of data by many experts can help correct
errors missed by the analyst of the original project. Finally, sharing
data facilitates the merging of datasets from different sources that
allow discoveries not possible with just one study.
Note that data sharing is not a radical idea. For example, thanks to
an organization called The MGED Soceity, most journals require all published
microarray gene expression data to be public in one of two
repositories: GEO or ArrayExpress. This has been an incredible
success, leading to new discoveries, new databases that combine
studies, and the development of widely used statistical methods and
software built with these data as practice examples.
The NEJM editorial expressed concern that a new generation of researchers will emerge, those who had nothing to do with collecting the research but who will use it to their own ends. It referred to these as “research parasites.” Is this a real concern?
Absolutely not. If our goal is to facilitate scientific discoveries that
improve our quality of life, I would be much more concerned about
“data hoarders” than “research parasites”. If an important nugget of
knowledge is hidden in a dataset, don’t you want the best data
analysts competing to find it? Restricting the researchers who can
analyze the data to those directly involved with the generators cuts
out the great majority of experts.
To further illustrate this, let’s consider a very concrete example
with real life consequences. Imagine a loved one has a disease with
high mortality rates. Finding a cure is possible but only after
analyzing a very very complex genomic assay. If some of the best data
analysts in the world want to help, does it make any sense at all to
restrict the pool of analysts to, say, a freshly minted masters level
statistician working for the genomics core that generated the data?
Furthermore, what would be the harm of having someone double check
that analysis?
The NEJM editorial also presented several other concerns it had with data sharing including whether researchers would compare data across clinical trials that is not in fact comparable and a failure to provide correct attribution. Do you see these as being concerns? What cons do you believe there may be to data sharing?
If such mistakes are made, good peer reviewers will catch the error.
If it escapes peer review, we point it out in post publication
discussions. Science is constantly self correcting.
Regarding attribution, this is a legitimate, but in my opinion, minor
concern. Developers of open source statistical methods and software
see our methods used without attribution quite often. We survive. But
as I elaborate below, we can do things to alleviate this concern.
Is data stealing a real worry? Have you ever heard of it happening before?
I can’t say I can recall any case of data being stolen. But let’s
remember that most published data is paid for by tax payers. They are the
actual owners. So there is an argument to be made that the public’s
data is being held hostage.
Does data sharing need to happen symbiotically as the editorial suggests? Why or why not?
I think symbiotic sharing is the most effective approach to the
repurposing of data. But no, I don’t think we need to force it to happen this way.
Competition is one of the key ingredients of the scientific
enterprise. Having many groups competing almost always beats out a
small group of collaborators. And note that the data generators won’t
necessarily have time to collaborate with all the groups interested in
the data.
In a recent blog post, you suggested several possible data sharing guidelines. What would the advantage be of having guidelines in place in help guide the data sharing process?
I think you are referring to a post by Jeff Leek but I am happy to
answer. For data to be generated, we need to incentivize the endeavor.
Guidelines that assure patient privacy should of course be followed.
Some other simple guidelines related to those mentioned by Jeff are:
- Reward data generators when their data is used by others.
- Penalize those that do not give proper attribution.
- Apply the same critical rigor on critiques of the original analysis
as we apply to the original analysis.
- Include data sharing ethics in scientific education
One of the guidelines suggested a new designation for leaders of major data collection or software generation projects. Why do you think this is important?
Again, this was Jeff, but I agree. This is important because we need
an incentive other than giving the generators exclusive rights to
publications emanating from said data.
You also discussed the need for requiring statistical/computational co-authors for papers written by experimentalists with no statistical/computational co-authors and vice versa. What role do you see the referee serving? Why is this needed?
I think the same rule should apply to referees. Every paper based on
the analysis of complex data needs to have a referee with
statistical/computational expertise. I also think biomedical journals
publishing data-driven research should start adding these experts to
their editorial boards. I should mention that NEJM actually has had
such experts on their editorial board for a while now.
Are there certain guidelines would feel would be most critical to include?
To me the most important ones are:
-
The funding agencies and the community should reward data
generators when their data is used by others. Perhaps more than for
the papers they produce with these data.
-
Apply the same critical rigor on critiques of the original analysis
as we apply to the original analysis. Bashing published results and
talking about the “replication crisis”
has become fashionable. Although in some cases it is very well merited
(see Baggerly and Coombes work for example) in some circumstances critiques are made without much care mainly for the attention. If we
are not careful about keeping a good balance, we may end up
paralyzing scientific progress.
You mentioned that you think symbiotic data sharing would be the most effective approach. What are some ways in which scientists can work symbiotically?
I can describe my experience. I am trained as a statistician. I analyze
data on a daily basis both as a collaborator and method developer.
Experience has taught me that if one does not understand the
scientific problem at hand, it is hard to make a meaningful
contribution through data analysis or method development. Most
successful applied statisticians will tell you the same thing.
Most difficult scientific challenges have nuances that only the
subject matter expert can effectively describe. Failing to understand
these usually leads analysts to chase false leads, interpret results
incorrectly or waste time solving a problem no one cares about.
Successful collaboration usually involve a constant back and forth
between the data analysts and the subject matter experts.
However, in many circumstances the data generator is not necessarily
the only one that can provide such guidance. Some data analysts
actually become subject matter experts themselves, others download
data and seek out other collaborators that also understand the details
of the scientific challenge and data generation process.
06 Sep 2016
This summer I had several conversations with undergraduate
students seeking career advice. All were interested in data analysis
and were considering graduate school. I also frequently receive
requests for advice via email. We have posted on this topic
before, for example
here
and
here, but
I thought it would be useful to share this short guide I put together based on my recent interactions.
It’s OK to be confused
When I was a college senior I didn’t really understand what Applied
Statistics was nor did I understand what one does as a researcher in
academia. Now I love being an academic doing research in applied statistics.
But it is hard to understand what being a researcher is like until you do
it for a while. Things become clearer as you gain more experience. One
important piece of advice is
to carefully consider advice from those with more
experience than you. It might not make sense at first, but I
can tell today that I knew much less than I thought I did when I was 22.
Should I even go to graduate school?
Yes. An undergraduate degree in mathematics, statistics, engineering, or computer science
provides a great background, but some more training greatly increases
your career options. You may be able to learn on the job, but note
that a masters can be as short as a year.
A masters or a PhD?
If you want a career in academia or as a researcher in industry or
government you need a PhD. In general, a PhD will
give you more career options. If you want to become a data analyst or
research assistant, a masters may be enough. A masters is also a good way
to test out if this career is a good match for you. Many people do a
masters before applying to PhD Programs. The rest of this guide
focuses on those interested in a PhD.
What discipline?
There are many disciplines that can lead you to a career in data
science: Statistics, Biostatistics, Astronomy, Economics, Machine Learning, Computational
Biology, and Ecology are examples that come to mind. I did my PhD
in Statistics and got a job in a Department of Biostatistics. So this
guide focuses on Statistics/Biostatistics.
Note that once you finish your PhD you have a chance to become a
postdoctoral fellow and further focus your training. By then you will have a
much better idea of what you want to do and will have the opportunity
to chose a lab that closely matches your interests.
What is the difference between Statistics and Biostatistics?
Short answer: very little. I treat them as the same in this guide. Long answer: read
this.
How should I prepare during my senior year?
Math
Good grades in math and statistics classes
are almost a requirement. Good GRE scores help and you need to get a near perfect score in
the Quantitative Reasoning part of the GRE. Get yourself a practice
book and start preparing. Note that to survive the first two years of a statistics PhD program
you need to prove theorems and derive relatively complicated
mathematical results. If you can’t easily handle the math part of the GRE, this will be
quite challenging.
When choosing classes note that the area of math most related to your
stat PhD courses is Real
Analysis. The area of math most used in applied work is Linear
Algebra, specifically matrix theory including understanding
eigenvalues and eigenvectors. You might not make the connection between
what you learn in class and what you use in practice until much
later. This is totally normal.
If you don’t feel ready, consider doing a masters first. But also, get
a second opinion. You might be being too hard on yourself.
Programming
You will be using a computer to analyze data so knowing some
programming is a must these days. At a minimum, take a basic
programming class. Other computer science classes will help especially
if you go into an area dealing with large datasets. In hindsight, I
wish I had taken classes on optimization and algorithm design.
Know that learning to program and learning a computer language are
different things. You need to learn to program. The choice of language
is up for debate. If you only learn one, learn R. If you learn three,
learn R, Python and C++.
Knowing Linux/Unix is an advantage. If you have a Mac try to use the
terminal as much as possible. On Windows get an emulator.
Writing and Communicating
My biggest educational regret is that, as a college student, I underestimated the importance
of writing. To this day I am correcting that mistake.
Your success as a researcher greatly depends on how well
you write and communicate. Your thesis, papers, grant
proposals and even emails have to be well written. So practice as much as
possible. Take classes, read works by good writers, and
practice. Consider
starting a blog even if you don’t make it public. Also note that in
academia, job interviews will
involve a 50 minute talk as well as several conversations about your
work and future plans. So communication skills are also a big plus.
But wait, why so much math?
The PhD curriculum is indeed math heavy. Faculty often debate the
possibility of changing the curriculum. But regardless of
differing opinions on what is the right amount, math is the
foundation of our discipline. Although it is true that you will not
directly use much of what you learn, I don’t regret learning so much abstract
math because I believe it positively shaped the way I think and attack
problems.
Note that after the first two years you are
pretty much done with courses and you start on your research. If you work with an
applied statistician you will learn data analysis via the
apprenticeship model. You will learn the most, by far, during this
stage. So be patient. Watch
these
two Karate Kid scenes
for some inspiration.
What department should I apply to?
The top 20-30 departments are practically interchangeable in my
opinion. If you are interested in applied statistics make sure you
pick a department with faculty doing applied research. Note that some
professors focus their research on the mathematical aspects of
statistics. By reading some of their recent papers you will be able to
tell. An applied paper usually shows data (not simulated) and
motivates a subject area challenge in the abstract or introduction. A
theory paper shows no data at all or uses it only as an example.
Can I take a year off?
Absolutely. Especially if it’s to work in a data related job. In
general, maturity and life experiences are an advantage in grad school.
What should I expect when I finish?
You will have many many options. The demand of your expertise is
great and growing. As a result there are many high-paying options. If you want to
become an academic I recommend doing a postdoc. Here is why.
But there are many other options as we describe here
and here.
26 Aug 2016
Hilary and I are apart again and this time we’re talking about political polling. Also, they discuss Trump’s tweets, and the fact that Hilary owns a bowling ball.
Also, Hilary and I have just published a new book, Conversations on Data Science, which collects some of our episodes in an easy-to-read format. The book is available from Leanpub and will be updated as we record more episodes. If you’re new to the podcast, this is a good way to do some catching up!
If you have questions you’d like us to answer, you can send them to
nssdeviations @ gmail.com or tweet us at @NSSDeviations.
Subscribe to the podcast on iTunes or Google Play.
Please leave us a review on iTunes!
Support us through our Patreon page.
Show Notes:
Download the audio for this episode.
Listen here: