23 Feb 2016
Soon after my discussion with Hilary Parker and Jenny Bryan about spreadsheets on Not So Standard Deviations, Brooke Anderson forwarded me this article written by Steven Levy about the original granddaddy of spreadsheets, VisiCalc. Actually, the real article was written back in 1984 as so-called microcomputers were just getting their start. VisiCalc was originally written for the Apple II computer and notable competitors at the time included Lotus 1-2-3 and Microsoft Multiplan, all since defunct.
It’s interesting to see Levy’s perspective on spreadsheets back then and to compare it to the current thinking about data, data science, and reproducibility in science. The problem back then was “ledger sheets” (what we might now call a spreadsheet), which contained numbers and calculations related to businesses, were tedious to make and keep up to date.
Making spreadsheets, however necessary, was a dull chore best left to accountants, junior analysts, or secretaries. As for sophisticated “modeling” tasks – which, among other things, enable executives to project costs for their companies – these tasks could be done only on big mainframe computers by the data-processing people who worked for the companies Harvard MBAs managed.
You can see one issue here: Spreadsheets/Ledgers were a “dull chore”, and best left to junior people. However, the “real” computation was done by the people the “data processing” center on big mainframes. So what exactly does that leave for the business executive to do?
Note that the way of doing things back then was effectively reproducible, because the presentation (ledger sheets printed on paper) and the computation (data processing on mainframes) was separated.
The impact of the microcomputer-based spreadsheet program appears profound.
Already, the spreadsheet has redefined the nature of some jobs; to be an accountant in the age of spreadsheet program is — well, almost sexy. And the spreadsheet has begun to be a forceful agent of decentralization, breaking down hierarchies in large companies and diminishing the power of data processing.
There has been much talk in recent years about an “entrepreneurial renaissance” and a new breed of risk-taker who creates businesses where none previously existed. Entrepreneurs and their venture-capitalist backers are emerging as new culture heroes, settlers of another American frontier. Less well known is that most of these new entrepreneurs depend on their economic spreadsheets as much as movie cowboys depend on their horses.
If you replace "accountant" with "statistician" and "spreadsheet" with "big data" and you are magically teleported into 2016.
The way I see it, in the early 80's, spreadsheets satisfied the never-ending desire that people have to interact with data. Now, with things like tablets and touch-screen phones, you can literally "touch" your data. But it took microcomputers to get to a certain point before interactive data analysis could really be done in a way that we recognize today. Spreadsheets tightened the loop between question and answer by cutting out the Data Processing department and replacing it with an Apple II (or an IBM PC, if you must) right on your desk.
Of course, the combining of presentation with computation comes at a cost of reproducibility and perhaps quality control. Seeing the description of how spreadsheets were originally used, it seems totally natural to me. It is not unlike today's analytic dashboards that give you a window into your business and allow you to "model" various scenarios by tweaking a few numbers of formulas. Over time, people took spreadsheets to all sorts of extremes, using them for purposes for which they were not originally designed, and problems naturally arose.
So now, we are trying to separate out the computation and presentation bits a little. Tools like knitr and R and shiny allow us to do this and to bring them together with a proper toolchain. The loss in interactivity is only slight because of the power of the toolchain and the speed of computers nowadays. Essentially, we've brought back the Data Processing department, but have staffed it with robots and high speed multi-core computers.
17 Feb 2016
During the discussion that followed the ggplot2 posts from David and I last week we started talking about tidy data and the man himself noted that matrices are often useful instead of “tidy data” and I mentioned there might be other data that are usefully “non tidy”. Here I will be using tidy/non-tidy according to Hadley’s definition. So tidy data have:
- One variable per column
- One observation per row
- Each type of observational unit forms a table
I push this approach in my guide to data sharing and in a lot of my personal work. But note that non-tidy data can definitely be already processed, cleaned, organized and ready to use.
This led to a very specific blog request:
So I thought I’d talk about a couple of reasons why data are usefully non-tidy. The basic reason is that I usually take a problem first, not solution backward approach to my scientific research. In other words, the goal is to solve a particular problem and the format I chose is the one that makes it most direct/easy to solve that problem, rather than one that is theoretically optimal. To illustrate these points I’ll use an example from my area.
Example data
Often you want data in a matrix format. One good example is gene expression data or data from another high-dimensional experiment. David talks about one such example in his post here. He makes the (valid) point that for students who aren’t going to do genomics professionally, it may be more useful to learn an abstract tool such as tidy data/dplyr. But for those working in genomics, this can make you do unnecessary work in the name of theory/abstraction.
He analyzes the data in that post by first tidying the data.
library(dplyr)
library(tidyr)
library(stringr)
library(readr)
library(broom)
original_data %
separate(NAME, c("name", "BP", "MF", "systematic_name", "number"), sep = "\\|\\|") %>%
mutate_each(funs(trimws), name:systematic_name) %>%
select(-number, -GID, -YORF, -GWEIGHT) %>%
gather(sample, expression, G0.05:U0.3) %>%
separate(sample, c("nutrient", "rate"), sep = 1, convert = TRUE)
|
It isn’t 100% tidy as data of different types are in the same data frame (gene expression and metadata/phenotype data belong in different tables). But its close enough for our purposes. Now suppose that you wanted to fit a model and test for association between the “rate” variable and gene expression for each gene. You can do this with David’s tidy data set, dplyr, and the broom package like so:
rate_coeffs = cleaned_data %>% group_by(name) %>%
do(fit = lm(expression ~ rate + nutrient, data = .)) %>%
tidy(fit) %>%
dplyr::filter(term=="rate")
|
On my computer we get something like:
system.time( cleaned_data %>% group_by(name) %>%
+ do(fit = lm(expression ~ rate + nutrient, data = .)) %>%
+ tidy(fit) %>%
+ dplyr::filter(term=="rate"))
|==========================================================|100% ~0 s remaining
user system elapsed
12.431 0.258 12.364
|
Let’s now try that analysis a little bit differently. As a first step, lets store the data in two separate tables. A table of “phenotype information” and a matrix of “expression levels”. This is the more common format used for these type of data. Here is the code to do that:
expr = original_data %>%
select(grep("[0:9]",names(original_data)))
rownames(expr) = original_data %>%
separate(NAME, c("name", "BP", "MF", "systematic_name", "number"), sep = "\\|\\|") %>%
select(systematic_name) %>% mutate_each(funs(trimws),systematic_name) %>% as.matrix()
vals = data.frame(vals=names(expr))
pdata = separate(vals,vals,c("nutrient", "rate"), sep = 1, convert = TRUE)
expr = as.matrix(expr)
|
If we leave the data in this format we can get the model fits and the p-values using some simple linear algebra
expr = as.matrix(expr)
mod = model.matrix(~ rate + as.factor(nutrient),data=pdata)
rate_betas = expr %*% mod %*% solve(t(mod) %*% mod)
|
This gives the same answer after re-ordering
all(abs(rate_betas[,2]- rate_coeffs$estimate[ind]) < 1e-5,na.rm=T)
[1] TRUE
|
But this approach is much faster.
system.time(expr %*% mod %*% solve(t(mod) %*% mod))
user system elapsed
0.015 0.000 0.015
|
This requires some knowledge of linear algebra and isn’t pretty. But it brings us to the first general point: you might not use tidy data because some computations are more efficient if the data is in a different format.
Many examples from graphical models, to genomics, to neuroimaging, to social sciences rely on some kind of linear algebra based computations (matrix multiplication, singular value decompositions, eigen decompositions, etc.) which are all optimized to work on matrices, not tidy data frames. There are ways to improve performance with tidy data for sure, but they would require an equal amount of custom code to take advantage of say C, or vectorization properties in R.
Ok now the linear regressions here are all treated independently, but it is very well known that you get much better performance in terms of the false positive/true positive tradeoff if you use an empirical Bayes approach for this calculation where you pool variances.
If the data are in this matrix format you can do it with R like so:
library(limma)
fit_limma = lmFit(expr,mod)
ebayes_limma = eBayes(fit_limma)
topTable(ebayes_limma)
|
This approach is again very fast, optimized for the calculations being performed and performs much better than the one-by-one regression approach. But it requires the data in matrix or expression set format. Which brings us to the second general point: **you might not use tidy data because many functions require a different, also very clean and useful data format, and you don’t want to have to constantly be switching back and forth. **Again, this requires you to be more specific to your application, but the potential payoffs can be really big as in the case of limma.
I’m showing an example here with expression sets and matrices, but in NLP the data are often input in the form of lists, in graphical analyses as matrices, in genomic analyses as GRanges lists, etc. etc. etc. One option would be to rewrite all infrastructure in your area of interest to accept tidy data formats but that would be going against conventions of a community and would ultimately cost you a lot of work when most of that work has already been done for you.
The final point, which I won’t discuss here is that data are often usefully represented in a non-tidy way. Examples include the aforementioned GRanges list which consists of (potentially) ragged arrays of intervals and quantitative measurements about them. You could force these data to be tidy by the definition above, but again most of the infrastructure is built around a different format that is much more intuitive for that type of data. Similarly data from other applications may be more suited to application specific formats.
In summary, tidy data is a useful conceptual idea and is often the right way to go for general, small data sets, but may not be appropriate for all problems. Here are some examples of data formats (biased toward my area, but there are others) that have been widely adopted, have a ton of useful software, but don’t meet the tidy data definition above. I will define these as “processed data” as opposed to “tidy data”.
I’m sure there are a ton more I’m missing and would be happy to get some suggestions on Twitter too.
16 Feb 2016
I read a lot of articles about what is going wrong with science:
These articles always point to the “incentives” in science and how they don’t align with how we’d like scientists to work. These discussions often frustrate me because they almost always boil down to asking scientists (especially and often junior scientists) to make some kind of change for public good without any guarantee that they are going to be ok. I’ve seen an acceleration/accumulation of people who are focusing on these issues, I think largely because it is now possible to make a very nice career by pointing out how other people are doing science wrong.
The issue I have is that the people who propose unilateral moves seem to care less that science is both (a) a calling and (b) a career for most people. I do science because I love it. I do science because I want to discover new things about the world. It is a direct extension of the wonder and excitement I had about the world when I was a little kid. But science is also a career for me. It matters if I get my next grant, if I get my next paper. Why? Because I want to be able to support myself and my family.
The issue with incentives is that talking about them costs nothing, but actually changing them is expensive. Right now our system, broadly defined, rewards (a) productivity - lots of papers, (b) cleverness - coming up with an idea first, and (c) measures of prestige - journal titles, job titles, etc. This is because there are tons of people going for a relatively small amount of grant money. More importantly, that money is decided on by processes that are both peer reviewed and political.
Suppose that you wanted to change those incentives to something else. Here is a small list of things I would like:
- People can have stable careers and live in a variety of places without massive two body problems
- Scientists shouldn’t have to move every couple of years 2-3 times right at the beginning of their career
- We should distribute our money among the largest number of scientists possible
- Incentivizing long term thinking
- Incentivizing objective peer review
- Incentivizing openness and sharing
The key problem isn't publishing, or code, or reproducibility, or even data analysis.
The key problem is that the fundamental model by which we fund science is completely broken.
The model now is that you have to come up with an idea every couple of years then "sell" it to funders, your peers, etc. This is the source of the following problems:
- An incentive to publish only positive results so your ideas look good
- An incentive to be closed so people don’t discover flaws in your analysis
- An incentive to publish in specific “big name” journals that skews the results (again mostly in the positive direction)
- Pressure to publish quickly which leads to cutting corners
- Pressure to stay in a single area and make incremental changes so you know things will work.
If we really want to have any measurable impact on science we need to solve the funding model. The solution is actually pretty simple. We need to give out 20+ year grants to people who meet minimum qualifications. These grants would cover their own salary plus one or two people and the minimum necessary equipment.
The criteria for getting or renewing these grants should not be things like Nature papers or number of citations. It has to be designed to incentivize the things that we want to (mine are listed above). So if I was going to define the criteria for meeting the standards people would have to be:
- Working on a scientific problem and trained as a scientist
- Publishing all results immediately online as preprints/free code
- Responding to queries about their data/code
- Agreeing to peer review a number of papers per year
More importantly these grants should be given out for a very long term (20+ years) and not be tied to a specific institution. This would allow people to have flexible careers and to target bigger picture problems. We saw the benefits of people working on problems they weren’t originally funded to work on with research on the Zika virus.
These grants need to be awarded using a rigorous peer review system just like the NIH, HHMI, and other organizations use to ensure we are identifying scientists with potential early in their careers and letting them flourish. But they’d be given out in a different matter. I’m very confident in a peer review to detect the difference between psuedo-science and real science, or complete hype and realistic improvement. But I’m much less confident in the ability of peer review to accurately distinguish “important” from “not important” research. So I think we should consider seriously the lottery for these grants.
Each year all eligible scientists who meet some minimum entry requirements submit proposals for what they’d like to do scientifically. Each year those proposals are reviewed to make sure they meet the very minimum bar (are they scientific? do they have relevant training at all?). Among all the (very large) class of people who pass that bar we hold a lottery. We take the number of research dollars and divide it up to give the maximum number of these grants possible. These grants might be pretty small - just enough to fund the person’s salary and maybe one or two students/postdocs. To make this works for labs that required equipment there would have to be cooperative arrangements between multiple independent indviduals to fund/sustain equipment they needed. Renewal of these grants would happen as long as you were posting your code/data online, you were meeting peer review requirements, and responding to inquires about your work.
One thing we’d do to fund this model is eliminate/reduce large-scale projects and super well funded labs. Instead of having 30 postdocs in a well funded lab, you’d have some fraction of those people funded as independent investigators right from the get-go. If we wanted to run a massive large scale program that would be out of a very specific pot of money that would have to be saved up and spent, completely outside of the pot of money for investigator-initiated grants. That would reduce the hierarchy in the system, reduce pressure that leads to bad incentive, and give us the best chance to fund creative, long term thinking science.
Regardless of whether you like my proposal or not, I hope that people will start focusing on how to change the incentives, even when that means doing something big or potentially costly.
12 Feb 2016
For this episode, special guest Jenny Bryan (@jennybryan) joins us from the University of British Columbia! Jenny, Hilary, and I talk about spreadsheets and why some people love them and some people despise them. We also discuss blogging as part of scientific discourse.
Subscribe to the podcast on iTunes.
Show notes:
Download the audio for this episode.
11 Feb 2016
Some of my colleagues think of me as super data-sciencey compared to other academic statisticians. But one place I lose tons of street cred in the data science community is when I talk about ggplot2. For the 3 data type people on the planet who still don’t know what that is, ggplot2 is an R package/phenomenon for data visualization. It was created by Hadley Wickham, who is (in my opinion) perhaps the most important statistician/data scientist on the planet. It is one of the best maintained, most important, and really well done R packages. Hadley also supports R software like few other people on the planet.
But I don’t use ggplot2 and I get nervous when other people do.
I get no end of grief for this from Hilary and Roger and especially from drob, among many others. So I thought I would explain why and defend myself from the internet hordes. To understand why I don’t use it, you have to understand the three cases where I use data visualization.
- When creating exploratory graphics - graphs that are fast, not to be shown to anyone else and help me to explore a data set
- When creating expository graphs - graphs that i want to put into a publication that have to be very carefully made.
- When grading student data analyses.
Let’s consider each case.
Exploratory graphs
Exploratory graphs don’t have to be pretty. I’m going to be the only one who looks at 99% of them. But I have to be able to make them quickly and I have to be able to make a broad range of plots with minimal code. There are a large number of types of graphs, including things like heatmaps, that don’t neatly fit into ggplot2 code and therefore make it challenging to make those graphs. The flexibility of base R comes at a price, but it means you can make all sorts of things you need to without struggling against the system. Which is a huge advantage for data analysts. There are some graphs (like this one) that are pretty straightforward in base, but require quite a bit of work in ggplot2. In many cases qplot can be used sort of interchangably with plot, but then you really don’t get any of the advantages of the ggplot2 framework.
Expository graphs
When making graphs that are production ready or fit for publication, you can do this with any system. You can do it with ggplot2, with lattice, with base R graphics. But regardless of which system you use it will require about an equal amount of code to make a graph ready for publication. One perfect example of this is the comparison of different plotting systems for creating Tufte-like graphs. To create this minimal barchart:
The code they use in base graphics is this (super blurry sorry, you can also go to the website for a better view).
in ggplot2 the code is:
Both require a significant amount of coding. The ggplot2 plot also takes advantage of the ggthemes package here. Which means, without that package for some specific plot, it would require more coding.
The bottom line is for production graphics, any system requires work. So why do I still use base R like an old person? Because I learned all the stupid little tricks for that system, it was a huge pain, and it would be a huge pain to learn it again for ggplot2, to make very similar types of plots. This is one where neither system is particularly better, but the time-optimal solution is to stick with whichever system you learned first.
Grading student work
People I seriously respect suggest teaching ggplot2 before base graphics as a way to get people up and going quickly making pretty visualizations. This is a good solution to the little data scientist’s predicament. The tricky thing is that the defaults in ggplot2 are just pretty enough that they might trick you into thinking the graph is production ready using defaults. Say for example you make a plot of the latitude and longitude of quakes data in R, colored by the number of stations reporting. This is one case where ggplot2 crushes base R for simplicity because of the automated generation of a color scale. You can make this plot with just the line:
ggplot() + geom_point(data=quakes,aes(x=lat,y=long,colour=stations))
And get this out:
That is a pretty amazing plot in one line of code! What often happens with students in a first serious data analysis class is they think that plot is done. But it isn’t even close. Here are a few things you would need to do to make this plot production ready: (1) make the axes bigger, (2) make the labels bigger, (3) make the labels be full names (latitude and longitude, ideally with units when variables need them), (4) make the legend title be number of stations reporting. Those are the bare minimum. But a very common move by a person who knows a little R/data analysis would be to leave that graph as it is and submit it directly. I know this from lots of experience.
The one nice thing about teaching base R here is that the base version for this plot is either (a) a ton of work or (b) ugly. In either case, it makes the student think very hard about what they need to do to make the plot better, rather than just assuming it is ok.
Where ggplot2 is better for sure
ggplot2 being compatible with piping, having a simple system for theming, having a good animation package, and in general being an excellent platform for developers who create [Some of my colleagues think of me as super data-sciencey compared to other academic statisticians. But one place I lose tons of street cred in the data science community is when I talk about ggplot2. For the 3 data type people on the planet who still don’t know what that is, ggplot2 is an R package/phenomenon for data visualization. It was created by Hadley Wickham, who is (in my opinion) perhaps the most important statistician/data scientist on the planet. It is one of the best maintained, most important, and really well done R packages. Hadley also supports R software like few other people on the planet.
But I don’t use ggplot2 and I get nervous when other people do.
I get no end of grief for this from Hilary and Roger and especially from drob, among many others. So I thought I would explain why and defend myself from the internet hordes. To understand why I don’t use it, you have to understand the three cases where I use data visualization.
- When creating exploratory graphics - graphs that are fast, not to be shown to anyone else and help me to explore a data set
- When creating expository graphs - graphs that i want to put into a publication that have to be very carefully made.
- When grading student data analyses.
Let’s consider each case.
Exploratory graphs
Exploratory graphs don’t have to be pretty. I’m going to be the only one who looks at 99% of them. But I have to be able to make them quickly and I have to be able to make a broad range of plots with minimal code. There are a large number of types of graphs, including things like heatmaps, that don’t neatly fit into ggplot2 code and therefore make it challenging to make those graphs. The flexibility of base R comes at a price, but it means you can make all sorts of things you need to without struggling against the system. Which is a huge advantage for data analysts. There are some graphs (like this one) that are pretty straightforward in base, but require quite a bit of work in ggplot2. In many cases qplot can be used sort of interchangably with plot, but then you really don’t get any of the advantages of the ggplot2 framework.
Expository graphs
When making graphs that are production ready or fit for publication, you can do this with any system. You can do it with ggplot2, with lattice, with base R graphics. But regardless of which system you use it will require about an equal amount of code to make a graph ready for publication. One perfect example of this is the comparison of different plotting systems for creating Tufte-like graphs. To create this minimal barchart:
The code they use in base graphics is this (super blurry sorry, you can also go to the website for a better view).
in ggplot2 the code is:
Both require a significant amount of coding. The ggplot2 plot also takes advantage of the ggthemes package here. Which means, without that package for some specific plot, it would require more coding.
The bottom line is for production graphics, any system requires work. So why do I still use base R like an old person? Because I learned all the stupid little tricks for that system, it was a huge pain, and it would be a huge pain to learn it again for ggplot2, to make very similar types of plots. This is one where neither system is particularly better, but the time-optimal solution is to stick with whichever system you learned first.
Grading student work
People I seriously respect suggest teaching ggplot2 before base graphics as a way to get people up and going quickly making pretty visualizations. This is a good solution to the little data scientist’s predicament. The tricky thing is that the defaults in ggplot2 are just pretty enough that they might trick you into thinking the graph is production ready using defaults. Say for example you make a plot of the latitude and longitude of quakes data in R, colored by the number of stations reporting. This is one case where ggplot2 crushes base R for simplicity because of the automated generation of a color scale. You can make this plot with just the line:
ggplot() + geom_point(data=quakes,aes(x=lat,y=long,colour=stations))
And get this out:
That is a pretty amazing plot in one line of code! What often happens with students in a first serious data analysis class is they think that plot is done. But it isn’t even close. Here are a few things you would need to do to make this plot production ready: (1) make the axes bigger, (2) make the labels bigger, (3) make the labels be full names (latitude and longitude, ideally with units when variables need them), (4) make the legend title be number of stations reporting. Those are the bare minimum. But a very common move by a person who knows a little R/data analysis would be to leave that graph as it is and submit it directly. I know this from lots of experience.
The one nice thing about teaching base R here is that the base version for this plot is either (a) a ton of work or (b) ugly. In either case, it makes the student think very hard about what they need to do to make the plot better, rather than just assuming it is ok.
Where ggplot2 is better for sure
ggplot2 being compatible with piping, having a simple system for theming, having a good animation package, and in general being an excellent platform for developers who create](https://ggplot2-exts.github.io/index.html) are all huge advantages. It is also great for getting absolute newbies up and making medium-quality graphics in a huge hurry. This is a great way to get more people engaged in data science and I’m psyched about the reach and power ggplot2 has had. Still, I probably won’t use it for my own work, even thought it disappoints my data scientist friends.