Technical and biological variance structure in mRNA-Seq data: life in the real world
© Oberg et al.; licensee BioMed Central Ltd. 2012
Received: 20 January 2012
Accepted: 7 July 2012
Published: 7 July 2012
Skip to main content
© Oberg et al.; licensee BioMed Central Ltd. 2012
Received: 20 January 2012
Accepted: 7 July 2012
Published: 7 July 2012
mRNA expression data from next generation sequencing platforms is obtained in the form of counts per gene or exon. Counts have classically been assumed to follow a Poisson distribution in which the variance is equal to the mean. The Negative Binomial distribution which allows for over-dispersion, i.e., for the variance to be greater than the mean, is commonly used to model count data as well.
In mRNA-Seq data from 25 subjects, we found technical variation to generally follow a Poisson distribution as has been reported previously and biological variability was over-dispersed relative to the Poisson model. The mean-variance relationship across all genes was quadratic, in keeping with a Negative Binomial (NB) distribution. Over-dispersed Poisson and NB distributional assumptions demonstrated marked improvements in goodness-of-fit (GOF) over the standard Poisson model assumptions, but with evidence of over-fitting in some genes. Modeling of experimental effects improved GOF for high variance genes but increased the over-fitting problem.
These conclusions will guide development of analytical strategies for accurate modeling of variance structure in these data and sample size determination which in turn will aid in the identification of true biological signals that inform our understanding of biological systems.
Next generation sequencing is a tool that is revolutionizing scientific research with its unprecedented depth of coverage, accuracy, precision, and the ability to link gene expressions with phenotype. The Illumina Genome Analyzer (GA), originally by Solexa, enables interrogation of mRNA expression via the mRNA Sequencing protocol. There are several reports on quality assessments of next generation sequencing and comparisons with microarray gene expression [1, 2]. Biological signal has been evaluated in single biological replicates of cell lines or tissue samples from this platform [3–5]. Anders and Huber report on variation in pools of two fruit fly embryos ; however, to date there is no thorough report on the functional form of variance in variability in mRNA Sequencing data from true human biological replicates. Thus, we evaluated the structure of biological variability and statistical modeling strategies useful for determining differential expression in mRNA-Seq data with true biological replicates.
First we describe some distributional background. The Poisson distribution is commonly assumed when modeling count data. This distribution considers each individual piece of mRNA to be a random draw from a large collection of pieces of mRNA with some probability vector describing the relative proportion across all possible mRNA pieces. A piece could refer to a particular exon or gene according to the researcher’s interest. The Poisson distribution appears to describe well the variation observed between two technical replicates of the same specimen, i.e., two aliquots of the same library allocated to two lanes on a flow cell [3–5]. The mean, μ, and variance are expected to be equal when sampling from a Poisson distribution, i.e., Var(y) = μ.
Biological replication adds another level of variability to the observed data. Biological variability is that due to inter-individual differences between human or animal subjects, for example, which cause the probability vector describing the distribution of mRNA strands to differ between subjects. Thus, when count data are observed in multiple biological replicates, the observed variance is a sum of both the technical and biological variability pieces. This results in the observed variance being larger than expected under the Poisson distribution. That is, the variance is larger than the mean. This scenario is termed “over-dispersion” .
In the simplest case, variance increases as a linear function of the mean, i.e., the variance is a constant multiplied by the mean, Var(y) = kμ. We denote this as the over-dispersed (OD) Poisson throughout and model parameters can be estimated via quasi-likelihood methods. A more sophisticated model assumes the within-specimen (technical) variation follows a Poisson distribution and the between-specimen mean values follow a gamma distribution. This gives rise to the negative binomial (NB) distribution in which the variance increases as a quadratic function of the mean, i.e., Var(y) = μ + φμ2.
Our goal in the present work was to characterize the mean-variance relationship in mRNA Seq data in order to guide the choice of distributional assumptions. We first evaluated technical variability in gene-level counts to ensure consistency with what others have reported. Next, we evaluated the variance structure between biological replicates within a treatment group, considering the functions Var(y) = μ, kμ, and μ + φμ2 with and without normalization and blocking factors. We believe this work will be useful to others in analyzing and interpreting similar data.
Twenty five study subjects representing the extremes of the humoral immune responses to rubella vaccine (12 high antibody responders with a median titer of 145 IU/mL and 13 low responders with a median titer of 10 IU/mL) were selected from a large population-based, age-stratified random sample of 738 healthy children and young adults (age 11 to 19 years), from Olmsted County, Minnesota. Clinical and demographic characteristics of the population based sample have been previously reported . This population-based candidate gene association study was performed to assess the importance of single nucleotide polymorphisms (SNPs) and genes involved in the immune response heterogeneity to rubella vaccine [8–10]. All study participants had been previously immunized with two doses of measles-mumps-rubella (MMR-II) vaccine with a median time since last immunization to enrollment (blood draw and measurement of antibody levels) of 5.8 years. The Mayo Clinic Institutional Review Board granted approval for the study. Written, informed consent and assent from subjects and/or parents/guardians was obtained at the time of enrollment.
Subjects’ PBMC (peripheral blood mononuclear cells) were thawed and stimulated (or left unstimulated) with live rubella virus (multiplicity of infection, MOI = 5, 48 hrs). Total RNA was extracted from stabilized cells using RNeasy Plus mini kit (Qiagen, Valencia, CA). RNA concentration and quality were assessed by Nano Chip kit analysis on an Agilent 2100 Bioanalyzer (Agilent, Palo Alto, CA). Fifty samples from 25 subjects were completed for culture, RNA extraction and RNA quality control with adequate concentration and purity (lack of DNA contamination), as well as good RNA integrity and lack of RNA degradation.
Libraries were prepared using the mRNA-Seq Sample Prep Kit (Illumina, San Diego, CA) following the manufacturer’s instructions. Briefly, polyadenylated RNA was isolated from 1 μg of total RNA using two rounds of hybridization to oligo-dT magnetic beads. The mRNA samples were chemically fragmented, reverse transcribed and converted into double stranded cDNA. Unique Illumina adaptors were ligated to the DNA fragments after end repair (to produce blunt ends) and A-base tailing. Fragments of approximately 200 bp were gel purified and amplified by PCR. The libraries were validated and quantified on an Agilent 2100 Bioanalyzer (Agilent, Palo Alto, CA) using DNA 1000 Chip kits. Sequencing was carried out on the Genome Analyzer GAIIx (Illumina, San Diego, CA). Samples were sequenced as single end reads using Illumina’s Single Read Cluster Generation kit (v2) and 51 Cycle SBS Sequencing Kit (v3) following the manufacturer’s instructions.
The first five flow cells were processed using Sequencing Control Studio (SCS) v 2.01 and the last eight flow cells were processed with SCS v 2.4 which allowed for higher cluster densities and higher pass filter rates. The images from the sequencing cycles were processed using the Illumina Pipeline Software v1.5. Specifically, images were converted to signal intensities using Illumina Pipeline’s FireCrest program. Base calling from intensity values was performed, and the quality scores for every base were calculated using the Bustard program. Illumina’s alignment tool, ELAND, was used to align the sequence reads to genome build 36 and exon junction databases. Illumina’s CASAVA tool version 1.0 was used to summarize the alignment results using only reads that mapped to a unique genome location and CASAVA results were imported into Genome Studio to generate the count tables for genes, exons, and exon-junctions. Sequencing pass/fail quality was determined by cluster densities, percent of clusters passing filters, percent of reads aligning to the reference, and percent error rate of the alignment.
The endpoint used for analysis was total reads or counts per gene; for the analysis of technical variation this is counts per lane while for biological variation this is counts summed over two technical replicate lanes. We evaluated the suitability of the Poisson, OD Poisson, and NB assumptions for modeling biological variability. The Poisson distribution assumes variance is equal to the mean μ. The OD Poisson assumes the variance increases as a linear function of the mean, kμ, where k is a constant and μ is the true mean. The NB distribution assumes the variance increases as a quadratic function of the mean as μ + φμ 2 , where φ is a constant.
Model Goodness-of-fit (GOF) was assessed via quantile-quantile (QQ) plots of per-gene Pearson statistics as has been done previously . Here, i indexes n samples, y is the observed count, is the predicted value and is the appropriate variance of the predicted value ( for the Poisson, for the OD Poisson and for the NB). This statistic is asymptotically chi square with n-p degrees of freedom, where n is the sample size and p is the number of model parameters. While GOF statistics are a rather crude measure of how well a model fits, the magnitude of changes observed here are informative. We verified that the asymptotic assumptions were reasonable with the present sample size in simulated NB data (Additional file 1: Figure S1). While the empirical distributions do not have values quite as extreme in the right tail as theoretical distributions, the deviation from the identity line is very small relative to the distributional changes observed herein. An alternative to the Pearson statistic is residual deviance, but this statistic was found to be ill behaved, even in simulated data. Such behavior has been documented previously .
The generalized linear model (GLM) framework was used to fit per-gene models to test for differential expression between high and low response groups using the log link function . The most basic model included a term for high/low response group. These were fit to unstimulated specimens only so as to focus on biological variation in the absence of correlation between paired specimens. Counts for the two technical replicate lanes (which we call lane-pair) were summed for these models. Thus, these fits had p = 1 model degree of freedom.
Model fits were evaluated with no normalization, with total count per lane-pair, or with 75th percentile count per lane-pair as a normalization constant . Normalization constants were included as an offset term in the GLM model, i.e., a term with the coefficient forced to be 1.0. The normalization offset leads to the interpretation of the gene counts as a rate, e.g., the portion of counts out of the total lane counts . In addition, fits were evaluated with no blocking factor or with categorical variables added to the model to indicate flow cell (p = 13), lane-pair (p = 4) or library preparation batch (p = 6) as blocking factors. These blocking factors are potential sources of systematic non-biological variation in the experiment. There may be global shifts in counts due to flow cell, lane position on the flow cell such as middle versus edge, or PCR batch effects. Lastly, per-gene, local and global variance parameter estimation strategies were evaluated.
All statistical computing was performed in R . The glm function in R was used to fit all models owing to the flexibility in modeling biological and experimental effects. Models based upon NB distributional assumptions were fit with either per-gene, moderated or global estimates of φ. Per-gene estimates were estimated within the glm.nb function call in the MASS package on a per-gene basis. Global and local (moderated) estimates of φ were computed with the edgeR package [13–16] in R. In that package, the global estimate is computed via quantile-adjusted conditional maximum likelihood  and is shared by all genes. The local estimate is an approximate empirical Bayes estimate which shrinks per-gene estimates towards the common global estimate . The degree of shrinkage for local estimates depends on the prior.n parameter, where larger values result in more shrinkage towards the global value. We evaluated setting prior.n equal to 3, 10 (the default value), and 20; since a prior.n of 3 yielded much better fits than 10 or 20 (data not shown), results are presented here with prior.n = 3.
Convergence rates were excellent for all models, with the worst case being non-convergence for four genes when a blocking factor was included in the model. The data have been deposited in the Gene Expression Omnibus and are available (with anonymous gene names since the biological findings have not yet been published) via the following link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?token=nxahvuioqmkwwzg&acc=GSE29022.
All 25 study subjects were female Caucasians to minimize variation. Median rubella-specific antibody response in the 12 high response subjects was 145 IU/ml (min 115, max 325) and in the 13 low response subjects was 10 IU/ml (min 3, max 14) .
Figure 1B contains a schematic of the attempted and failed assays. Sample libraries were prepared in seven PCR batches and a total of thirteen flow cells were used. Sequencing failed for all four lanes of subject 86 and for the unstimulated lanes of subject 5361 and we did not receive data for these lanes. For subject 86, the libraries were determined to be of good quality and were re-sequenced, filling the eighth flow cell; for subject 5361 the libraries were not of good quality and there was not sufficient sample to repeat the library preparation. Since the focus of this manuscript is on the mean-variance relationship in these data in the absence of correlation, we focus here on the 48 lanes of data corresponding to the unstimulated specimens from 24 subjects for simplicity.
We assessed technical variation as the variation between the same specimen pipetted onto two lanes of a flow cell. Technical reproducibility in these data was good as demonstrated by scatter and minus versus average (MVA) plots (Additional file 1: Figure S2). Once scaled for total lane count, virtually no linear bias was detected as demonstrated by the grey smoother nearly overlaying the y = 0 line. In addition, there was no hint of any nonlinear biases as is often seen in gene expression microarray data. This was true for all pairs of technical replicates.
QQ plots assuming Poisson-type variation were made for all pairs of technical replicates using Pearson GOF statistics; all 24 are available in Additional file 1: Figure S3. For Poisson-type variation, the points would be expected to fall on the identity line. Eighteen of the plots demonstrated variation very close to expected. Three of the plots demonstrated a moderate fit of the Poisson distribution with observed Pearson quantile scales ranging from approximately 25 to 50 indicating larger than expected variance. Two plots demonstrated poor fit overall as evidenced by large observed Pearson quantile scales, one reaching approximately 250 and the other approximately 1,200. The observed Pearson quantiles for the technical replicates from subject 86 reached approximately 25,000. As there were problems with this sample in the laboratory, and because the distribution of GOF statistics was so far from expected and so different from the others, we discarded this sample from further analyses. Similar to what others have found, we concluded that in general, technical variation followed a Poisson distribution with variance equal to the mean. Similar results for technical variation were found in the stimulated specimens (data not shown). With this conclusion we summed the per-gene counts from the two technical replicates for further analyses as in .
We define biological variation as the variation between multiple subjects in the same group, e.g., high versus low responders, cancer versus normal, treatment versus not, etc. This variation may be small for cell lines and moderate for genetically identical animals, but can be very large for human samples. Analysis methods need to be able to cope with and appropriately model this variation.
We questioned whether a one-size-fits-all variance structure was appropriate. It is biologically plausible (and many would argue likely) that the relationship, i.e. the precise multiplier φ, varies between genes. Thus, models were fit estimating over-dispersion on a per-gene basis assuming a quadratic functional form. Note that the quantities plotted in QQ plots based on the linear over-dispersion assumption are used in the estimation of the over-dispersion parameter in the R function and therefore by definition were not useful for assessing model fit. The QQ plot based on NB distributional assumptions with per-gene estimates of φ demonstrated worse fits than when the global estimate was used with observed Pearson quantiles extending up 150,000 for all genes, with GOF statistics for all but two genes below 50,000 (Figure 4C, G). To investigate further, we exported local estimates of φ from the edgeR function and held these fixed. These fits demonstrated improvement over those based on the global estimate of φ with a maximum observed Pearson Chi-squared value of approximately 220 (Figure 4D, H). Eighty-two percent of the local φ estimates were less than 0.15 (Figure 3B). Note that a φ of 0 would correspond to no over-dispersion, consistent with Poisson distributional assumptions. Approximately 5% of the genes had an estimate of φ greater than 0.25. Upon investigation of the genes with φ estimates in the spike of density beginning at a log10(average count per lane-pair) of approximately 2 and extending to approximately 4.5 on the horizontal axis (Figure 3B), we found that both replicates from one patient sample had 0 counts for all of these genes. The data from this subject otherwise appear similar to that from other subjects. This phenomenon will be investigated further. Ignoring this spike, the estimates of φ do not appear to be a function of the mean in these data. We focus on the quadratic mean-variance functional form with local estimates from here forward for four reasons. First, a quadratic function fits the global mean-variance relationship for biological variation (Figure 3A). Second, the biological interpretation of the NB distribution is appropriate; count data that follow a Poisson distribution conditional on a subject-specific mean, together with between-subject means that follow a Gamma distribution, give rise to the NB distribution. Third, it was clearly difficult to estimate the mean-variance relationship on a per-gene basis (Figure 4). Fourth, we do not believe that all genes must share the same mean-variance relationship.
As in Bullard et al. , we found that choice of normalization constant had little effect on GOF distributions (data not shown). Therefore we present results taking the 75% count as the normalization factor since Bullard et al. found this normalization scale to have less bias in fold-change estimates and better sensitivity and specificity than none or total lane count. The rationale for using the 75% count to adjust for varying sequencing depths rather than, say, the total lane count, is to avoid the possibility of a high-count gene that is differentially expressed being overly influential on the normalization [5, 17].
While it is important for a model to explain all sources of variation, a balance must be made between this and over-fitting the data. This will especially be the case for studies with extremely small samples which are typically employed for next generation sequencing studies given the intense monetary and time resource utilization.
We set out to evaluate technical, biological, and experimental variation in gene expression measured by mRNA-Seq counts from n = 25 subjects. Technical variation in these data were in keeping with Poisson distributional assumptions in general as has been reported by some [3–5], while others have reported greater than Poisson variation for low counts . The biological variance across all observed genes was over-dispersed relative to the mean, increasing as a quadratic function of the mean. This has important implications for study planning in terms of power and sample size estimation as well as choice of statistical modeling procedures to detect differential expression. Differential expression results were not included here since the focus is on the mean-variance relationship.
Over-dispersion, i.e., variance larger than the mean, is a common phenomenon in observed count data . Others have suggested that over-dispersion in mRNA-Seq data will likely be observed in mRNA-Seq biological replicates and have suggested models relying on OD Poisson or NB assumptions to account for this [5, 6, 13, 16, 19]. There was clear evidence of over-dispersion from the standard Poisson distribution in our data. A quadratic function described the mean-variance relationship across all genes better than a linear function, which is consistent with the NB family of distributions. The interpretation of the NB distribution as counts from several within-subject Poisson distributions, each subject having their own mean parameter makes biological sense as well in the study at hand. In μ + φμ2, the square root of φ corresponds to the subject-to-subject CV of the Gamma distributed Poisson mean parameters. The CV corresponding to the global estimate of φ for the data presented herein is 36%. We have observed a CV of approximately 9% for technical variation in genetically similar rats and 22% for biological replicates in genetically similar rats (data not shown). This is in keeping with the expectation that biological variation is larger than technical variation, and human variation is larger than genetically inbred animal model variation.
There are several potential sources of experimental variation in these studies including PCR library preparation batch, flow cell, lane, and for large studies, machine and software upgrades. Some effects can be addressed through normalization while others must be modeled directly. We found that normalization strategy had little impact on GOF statistics. Bullard et al.  reported similar results and found that the 75% count normalization resulted in nearly no bias and the best sensitivity and specificity without increasing variance. Our data was similar to theirs in that a small percent of the genes contributed a large portion of the counts. Thus, we chose to use the 75% count as the normalization scale in our models. Others have evaluated normalization strategies beyond this [17, 20].
Bullard et al. found experimental variation due to library PCR preparation batch to be greater than that due to flow cell . We found that inclusion of library PCR batch, lane-pair and flow cell each resulted in a dramatic reduction in the large GOF statistics resulting in the GOF statistics being too small (i.e., less than expected) in the majority of genes. The GOF statistics were reduced to the extent that, in this particular experiment with only one subject per treatment per flow cell, we believe these terms should not be included in the differential expression models in order to avoid over-fitting. Note that this experiment is an incomplete block design with respect to subject, in which case there is not complete recovery of all effects. The discussion of incomplete recovery of effects has spanned decades in the statistical literature. The classic text by Scheffe points to the use of random effects as a compromise between the extremes of not including the effects or including them as fixed effects . Since estimation in a NB model of both the dispersion parameter and random effects is not straightforward, the experimental effects were modeled as fixed effects for the purposes of this investigation.
Estimation of the over-dispersion parameter has been the subject of much research; see for example the summary in . While the edgeR local estimates of over-dispersion resulted in the most reasonable GOF distributions, the results herein point to the need for further investigation in the estimation of variance for these data. Biologically it makes sense for the variance to differ across genes, and that the mean-variance relationship may vary across genes as well. For example, some systems in the body such as pH are very tightly controlled with dramatic consequences for deviations from the baseline normal levels. Other systems in the body such as cholesterol are less tightly controlled, and allowed to vary widely before consequences are observed. However, with small sample sizes it is likely too difficult to estimate the mean-variance relationship accurately and precisely on a per-gene basis. Local or per-gene estimates of variance are common practice in gene expression microarray and shotgun proteomic literature [22–24]. Anders and Huber recently presented data from two pools of fruit fly embryos in each of two study groups that agree with this assumption and they described a model for variation based on NB distributional assumptions with a “shot noise” portion and a “raw variance” portion . Alternatively, the fact that these model fits result in smaller than expected GOF statistics for a majority of genes may point to the need for a linear variance function for some genes and quadratic variance function for others due to biological phenomenon such as alternative splicing. Indeed, such a modeling strategy has been proposed . It is difficult to compare per-gene appropriateness of OD Poisson and NB distributional assumptions due to the poor behavior of the residual deviance statistic and the fact that estimation of the OD Poisson dispersion parameter is a function of the Pearson statistic. Application of a power transformation to the data in order to use Poisson modeling assumptions has recently been proposed . We expect continual improvement in informatics mapping algorithms to result in decreased variance; e.g. including reads that map to multiple locations on the genome has been shown to improve technical reproducibility . Robust GLMs based on M-estimators may be an attractive alternative as well. A Gaussian approximation to the Poisson distribution is said to be valid when the mean is ≥10 ; thus linear models which account for the mean-variance relationship, e.g. via weighted least squares, may be an attractive alternative since model fitting and assessment of model fit is more straightforward than for GLMs.
Our study had several strengths and limitations. We reported on technical and biological variation in mRNA-Seq data from a relatively large sample set and provide valuable insight into biological variation that will be useful to many researchers. The data herein were from true human biological replicates rather than contrived cell line replicates. We expect that the conclusion of a quadratic mean-variance relationship for biological replication will extend to other experimental settings. However, the precise estimates of over-dispersion are expected to be study-specific; it is plausible they would be smaller in inbred mouse models, smaller yet in cell lines. The present study was not designed to assess bias in estimating gene expression, fold-changes or sensitivity and specificity of fold-change detection. While the modeling strategy utilized does not account for the degrees of freedom used in estimating the dispersion parameter, thousands of data points are used in this estimation, so the estimate should be very stable and precise . Unfortunately there was a software upgrade mid-way through the study. Technology in the field of next generation sequencing is progressing at a rapid pace; indeed, data from third generation sequencing is already available. This rapid pace makes upgrades, whether in software or hardware, difficult to avoid in large studies making it important to utilize randomization and blocking when determining assay processing order.
We found that the within-gene variance structure is over-dispersed relative to the Poisson distribution. As a result, hypothesis tests based on Poisson distributional assumptions will be too liberal (reject more often than they should) and power estimates based on these assumptions will be over-estimated for most genes, i.e., sample sizes will be too small for the estimated power. Local estimates [14–16] of the mean-variance relationship are likely best, but further research is needed to understand the optimal variance estimation strategy. The variance increasing as a function of the mean has implications on modeling strategy even if the data are modeled as continuous values rather than counts; e.g. weighted least squares. Modeling experimental factors within the GLM framework likely requires larger sample sizes. These conclusions will guide development of analytical strategies useful in study planning, accurate modeling of these data, and will aid in the identification of true biological signals that inform our understanding of biological systems. In addition, as with other high dimensional expression assays [19, 28–31], since we do not yet thoroughly understand experimental variation in next generation sequencing platforms, it is vitally important to incorporate the fundamental principles of statistical experimental design, namely biological replication, blocking and randomization, during the study design phase in order to avoid confounding of experimental and biological signals.
We are grateful for the contributions of Dr. Iana Haralambieva, Mr. Sumit Middha and Mr. Bruce Eckloff, and for the assistance of Ms. Rhonda Larsen with manuscript preparation. This work was supported by the National Institutes of Health (NIH) [N01-AI40065, AI48793, AI-89859] and by the National Center for Research Resources (NCRR) [1 UL1 RR024150], a component of the National Institutes of Health, and the NIH Roadmap for Medical Research. Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.