Is it species or is it batch? They are confounded, so we can't know
20 May 2015In a 2005 OMICS paper, an analysis of human and mouse gene expression microarray measurements from several tissues led the authors to conclude that “any tissue is more similar to any other human tissue examined than to its corresponding mouse tissue”. Note that this was a rather surprising result given how similar tissues are between species. For example, both mice and humans see with their eyes, breathe with their lungs, pump blood with their hearts, etc… Two follow-up papers (here and here) demonstrated that platform-specific technical variability was the cause of this apparent dissimilarity. The arrays used for the two species were different and thus measurement platform and species were completely confounded. In a 2010 paper, we confirmed that once this technical variability was accounted for, the number of genes expressed in common between the same tissue across the two species was much higher than the those expressed in common between two species across the different tissues (see Figure 2 here).
So what is confounding and why is it a problem? This topic has been discussed broadly. We wrote a review some time ago. But based on recent discussions I’ve participated in, it seems that there is still some confusion. Here I explain, aided by some math, how confounding leads to problems in the context of estimating species effects in genomics. We will use
- Xi to represent the gene expression measurements for human tissue i,
- aX to represent the level of expression that is specific to humans and
- bX to represent the batch effect introduced by the use of the human microarray platform.
- Therefore Xi =aX + bX + ei, with ei the tissue i effect and other uninteresting sources of variability.
Similarly, we will use:
- Yi to represent the measurements for mouse tissue i
- aY to represent the mouse specific level and
- bY the batch effect introduced by the use of the mouse microarray platform.
- Therefore Yi = aY+ bY + fi, with fi tissue i effect and other uninteresting sources of variability.
If we are interested in estimating a species effect that is general across tissues, then we are interested in the following quantity:
aY - aX
Naively, we would think that we can estimate this quantity using the observed differences between the species that cancel out the tissue effect. We observe a difference for each tissue: Y1 - X1 , Y2 - X2 , etc… The problem is that aX and bX are always together as are aY and bY. We say that the batch effect bX is confounded with the species effect aX. Therefore, on average, the observed differences include both the species and the batch effects. To estimate the difference above we would write a model like this:
Yi - Xi = (aY - aX) + (bY - bX) + other sources of variability
and then estimate the unknown quantities of interest: (aY - aX) and (bY - bX) from the observed data Y1 - X1, Y2 - X2, etc... The problem is that, we can estimate the aggregate effect (aY - aX) + (bY - bX), but, mathematically, we can't tease apart the two differences. To see this note that if we are using least squares, the estimates (aY - aX) = 7, (bY - bX)=3 will fit the data exactly as well as (aY - aX)=3,(bY - bX)=7 since
{(Y-X) -(7+3))^2 = {(Y-X)- (3+7)}^2.
In fact, under these circumstances, there are an infinite number of solutions to the standard statistical estimation approaches. A simple analogy is to try to find a unique solution to the equations m+n = 0. If batch and species are not confounded then we are able to tease apart differences just as if we were given another equation: m+n=0; m-n=2. You can learn more about this in this linear models course.
Note that the above derivation apply to each gene affected by the batch effect. In practice we commonly see hundreds of genes affected. As a consequence, when we compute distances between two samples from different species we may see large differences even where there is no species effect. This is because the bY - bX differences for each gene are squared and added up.
In summary, if you completely confound your variable of interest, in this case species, with a batch effect, you will not be able to estimate the effect of either. In fact, in a 2010 Nature Genetics Review about batch effects we warned about "cases in which batch effects are confounded with an outcome of interest and result in misleading biological or clinical conclusions". We also warned that none of the existing solutions for batch effects (Combat, SVA, RUV, etc...) can save you from a situation with perfect confounding. Because we can't always predict what will introduce unwanted variability, we recommend randomization as an experimental design approach.
Almost a decade later after the OMICS paper was published, the same surprising conclusion was reached in this PNAS paper: "tissues appear more similar to one another within the same species than to the comparable organs of other species". This time RNAseq was used for both species and therefore the different platform issue was not considered*. Therefore, the authors implicitly assumed that (bY - bX)=0. However, in a recent F1000 Research publication Gilad and Mizrahi-Man describe describe an exercise in forensic bioinformatics that led them to discover that mice and human samples were run in different lanes or different instruments. The confounding was near perfect (see Figure 1). As pointed out by these authors, with this experimental design we can't simply accept that (bY - bX)=0, which implies that we can't estimate a species effect. Gilad and Mizrahi-Man then apply a linear model (ComBat) to account for the batch/species effect and find that samples cluster almost perfectly by tissue. However, Gilad and Mizrahi-Man correctly note that, due to the confounding, if there is in fact a species effect, this approach will remove it along with the batch effect. Unfortunately, due to the experimental design it will be hard or impossible to determine if it's batch or if it's species. More data and more analyses are needed.
Confounded designs ruin experiments. Current batch effect removal methods will not save you. If you are designing a large genomics experiments, learn about randomization.
* The fact that RNAseq was used does not necessarily mean there is no platform effect. The species have different genomes, with different sequences and thus can lead to different biases during experimental protocols.
Update: Shin Lin has repeated a small version of the experiment described in the PNAS paper. The new experimental design does not confound lane/instrument with species. The new data confirms their original results pointing to the fact that lane/instrument do not explain the clustering by species. You can see his response in the comments here.