Improved moderation for genewise variance estimation in RNASeq via the exploitation of external information
 Ellis Patrick^{1, 2},
 Michael Buckley^{2},
 David Ming Lin^{3} and
 Yee Hwa Yang^{1}Email author
https://doi.org/10.1186/1471216414S1S9
© Patrick et al.; licensee BioMed Central Ltd. 2013
Published: 21 January 2013
Abstract
Background
The cost of RNASeq has been decreasing over the last few years. Despite this, experiments with four or less biological replicates are still quite common. Estimating the variances of gene expression estimates becomes both a challenging and interesting problem in these situations of low replication. However, with the wealth of microarray and other publicly available gene expression data readily accessible on public repositories, these sources of information can be leveraged to make improvements in variance estimation.
Results
We have proposed a novel approach called Tshrink+ for inferring differential gene expression through improved modelling of the genewise variances. Existing methods share information between genes of similar average expression by shrinking, or moderating, the genewise variances to a fitted common variance. We have been able to achieve improved estimation of the common variance by using genewise sample variances from external experiments, as well as gene length.
Conclusions
Using biological data we show that utilising additional external information can improve the modelling of the common variance and hence the calling of differentially expressed genes. These sources of additional information include gene length and genewise sample variances from other RNASeq and microarray datasets, of both related and seemingly unrelated tissue types. The results of this are promising, with our differential expression test, Tshrink+, performing favourably when compared to existing methods such as DESeq and edgeR when considering both gene ranking and sensitivity. These improved variance models could easily be implemented in both DESeq and edgeR and highlight the need for a database that offers a profile of gene variances over a range of tissue types and organisms.
Background
In the postgenomic era, the development of technologies for sequencing the genome and transcriptome has become a key issue in the global analysis of biological systems. Even with the lowering cost of sequencing data, the majority of RNASeq experiments are still suffering from low replication numbers. The identification of differential expressed (DE) genes and transcripts is still a key question of interest in many biological studies. To date, there are many methods that provide a test of whether a gene is DE or not [1], including cufflinks [2], DESeq [3] and edgeR [4]. A feature in all of these methods is moderation of genewise variance estimates to improve DE inference. Moderation is important in small samples size comparisons, increasing both the power and accuracy of a DE test [5]. The key differences between these methods are the extent of the moderation and their common variance estimatethe variance estimate that the procedure is moderating towards.
DESeq and edgeR account for the heteroscedasticity observed in the read counts of genes by modelling the relationship between expected value of the count and its variability. We propose using additional information, such as gene length and variance estimates from external datasets, as explanatory variables to further model the heterogeneity seen in the observed gene variances. Combining these improved models of gene variance with a moderation method [6] creates a robust tool for estimating gene variances and hence calling differential gene expression. When evaluated on publicly available data this tool offers both improved gene ranking and power of detection when compared to DESeq and edgeR. This method is implemented in the R package sydSeq available on http://www.maths.usyd.edu.au/u/jeany/software.htm.
RNASeq
The development of high throughput sequencing technologies has made it possible to sequence the transcriptome at a much higher resolution and coverage than was previously available. Sequencing of cDNA samples (RNASeq) has a dynamic range larger than that of microarrays [7]. This, combined with its high level of reproducibility [8] and falling cost, makes high throughput sequencing technologies an increasingly attractive alternative to microarrays for transcriptome analysis. In the following we will describe how variances are estimated for RNASeq data with small samples sizes.
A typical RNAseq data analysis work flow consists of many steps. These steps generally consist of mapping, summarisation, normalisation, differential expression analysis and systems biology [9]. The summarisation step counts how many reads were mapped to each gene or transcript. We will consider the case of mapping to genes rather than transcripts, so for us summarisation results in a matrix of counts, where the rows and columns correspond to genes and samples respectively.
Let y_{ ij } be the observed read count for the i^{ th } gene in the j^{ th } sample where sample j belongs to treatment t(j) = 1, 2. Let ${\sigma}_{i}^{2}$ and μ_{ i } be the variance and mean read count for gene i. For ease of presentation we will assume that all effects that are generally normalised for or modelled, such as library sizes and GC content, remain constant across samples. The technical variability for a gene count in RNASeq can be modelled quite reliably as Poisson [10, 11]. This is attractive in situations of low replication as one parameter can be estimated to describe both the mean and variance of a gene. Modelling the data as Poisson will give a very reliable estimate of which genes have changed in expression between any two samples. However many experiments are not simply focused on the detection of gene expression differences between any two samples focusing instead on the differences between any two types of cells for example. This distinction is important as it requires us to not only model the technical variability of the experiment but to also model the biological variability of a particular cell type (or experimental condition).
where f(μ) explains the biological variability can be fitted by some form of local regression [3]. This formulation of σ^{2} highlights that σ^{2} should always be greater than or equal to μ.
In current RNASeq experiments it is still quite common to see experiments with very little biological replication. Estimating variances from few observations is unstable [12]. To improve the stability and accuracy of these variance estimates there have been many methods proposed to shrink the variances to some common value for microarrays [12] and RNASeq [1]. We will refer to this as moderation. By stabilising the variances and sharing information moderation also increase the power of a statistic as this increases the degrees of freedom of a variance estimate [5].
Heterogeneous gene variances
It is well accepted that some genes have a higher variance than other genes [13]. That is, some genes vary in expression more from cell to cell, person to person, or treatment to treatment than other genes. In RNASeq datasets, genes with larger average expression have on average larger observed variances. Instead of shrinking the estimate of a genes variance towards some common value (as is often done in microarrays) to improve stability [12], edgeR and DESeq shrink the estimate towards some fitted curve describing the relationship between mean and variance. We refer to this fitted curve as the common variance. In doing this they are making the strong assumption (although not an unreasonable one) that genes with a similar average count should have a similar variance.
We incorporate external data from RNASeq and microarrays on mouse striatum and RNASeq data from different tissues to better estimate variances and hence identify DE genes between C57BL/6J and DBA/2J mouse striatum samples.
Methods
Tshrink+
When using variance estimates from other RNASeq experiments, these variances will also have a very strong meanvariance relationship. For use as an explanatory variable we normalise the external variance estimates in such a way that they have mean zero and variance one for all ranges of expression.
where ν_{ gene } = n  2. For simplicity, rather than using a different v for each gene we instead use one degrees of freedom estimate, v_{ shrink }, for all genes, taken as the mean of the ${\widehat{\mathit{v}}}_{k}$'s.
Data
Bottomly dataset
External datasets
Additional information sources.
Evaluation study
In this study, we evaluate our proposed method of improving variance estimation for differential gene expression analysis, Tshrink+. This evaluation consists of two components, assessing the capacity of a common variance estimate to explain the observed gene sample variances and evaluating how improving this common variance estimate can aid in the detection of differentially expressed genes. The performance of Tshrink+ will also be compared with two commonly used packages, edgeR and DESeq. This evaluation study is built upon one main dataset, the Bottomly data, and three datasets which are used for additional information.
In order to assess the capacity of a common variance estimate to explain the observed gene sample variances we will use the shrinkage coefficient λ, described in the methods section, as a statistic. λ is the ratio of the expected and average squared error of the common variance estimate. We aim to assess the effectiveness of using information in addition to the average expression of a gene to estimate a common variance function. Variance estimates from the external datasets described in Table 1 and also gene length are used to aid in the estimation of the common variance functions of one hundred random comparisons of n samples of B6 mouse striatum tissue with n samples of D2 mouse striatum tissue. This is performed for one additional dataset at a time. The average λ value is calculated for each n comparison and information source using only the genes that are present in all data sources.
We then further demonstrate that improving the information content of an additional information source improves the estimation of the common variance. This will be achieved by using variance estimates from the D2 mice to aid in the estimation of a common variance function of the B6 mice. The variance estimates from a random n D2 mouse samples are used to estimate the common variance function of a random four B6 mouse, this is repeated one hundred times and average λ values are calculated.
 1.
a ttest (T),
 2.
a moderated ttest (Tshrink) and
 3.
a moderated ttest using additional information (Tshrink+).
 4.
DESeq using only the common variance (DESeqCommon),
 5.
DESeq using the maximum of the common variance and sample variance (DESeqMax) and
 6.
edgeR using a trended common variance and empirical Bayes to shrink the gene sample variances towards the common variance (edgeR).
To assess the effectiveness of the six DE methods, a standard ttest was performed comparing ten B6 and ten D2 mouse striatum samples. In all of the following, the results of this ttest are taken to be the "truth". From this ttest a gene is conservatively called "truly" DE if it has a Bonferroni adjusted pvalue of less than 0.05. A gene is called "truly" not DE if it has an unadjusted pvalue greater than 0.05. We will then evaluate the ability of the DE methods to recover the information in the comparison of ten B6 samples with ten D2 samples by smaller comparisons of n B6 samples and n D2 samples, for n ranging from two to five. This is done by comparing a random set of n B6 and n D2 mouse striatum samples one hundred times and then

generating Receiver Operator Curves (ROC, a curve describing each methods True Positive Rate as a function of its False Positive Rate for a complete range of pvalue cutoffs),

calculating partial areas under the ROC for FPR less than 0.01 and

calculating True Positives (TP) and False Positives (FP) using a Bonferroni adjusted pvalue cutoff of 0.05.
Results and discussion
The estimation of the common variance
We begin by examining the effect of using information from different additional sources to help explain the variances observed in the Bottomly Data. That is, assessing the impact that each of the additional datasets in Table 1 can have on estimating the pooled variances of n B6 vs n D2 mouse striatum comparisons. Thus we only consider one additional dataset at a time and do not consider how they could interact. When used to help fit the common variance surface, using information from any of the additional data sources improve the estimate of the common variance as seen in Figure 1. This is observed through all of the average λ's being higher when using additional information when compared to using only the mean. λ is proportional to the reciprocal of the average squared error of the variance estimates, thus a larger λ corresponds to a better estimate of the common variance. A λ value of one implies that the common variance is representative of the population variance. A λ of zero suggests that the common variance estimate is failing to describe the observed gene sample variances.
The more relevant the information contained in the additional data source, the greater the improvement seen in the common variance estimate. As is perhaps expected either of the two striatum tissue datasets, RNASeq and microarray, when used to estimate the common variance produce the largest λ, with microarray striatum and RNASeq striatum only slightly out performing hippocampus. Spleen and lung both also increase λ highlighting that information can still be gained from unrelated tissue types, however, liver and heart barely increase λ at all. This can mostly be explained by the use of liver and heart resulting in the variance of one gene, transthyretin, being severely underestimated. If this gene is excluded the λ generated by using liver and heart are much similar to that of spleen and lung. Including information on gene length also has the potential to improve variance information however this appears to relatively decrease as the sample size n increases.
Using D2 variance estimates to estimate common variance of four B6 samples.
n D2 samples  0  2  3  4  5  6  7  8  9  10 

λ  0.35  0.45  0.50  0.55  0.58  0.65  0.68  0.72  0.75  0.77 
The impact of moderation on inferring differential expression
The aim of the remainder of the evaluation is to assess how the use of moderation affects inference on differential gene expression. This is done by assessing the impact of moderation on both gene ranking and sensitivity. Moderation is used to both increase the sensitivity of a test, by increasing the degrees of freedom of the variance estimate, and to improve the ranking of a test, by improving the accuracy of the variance estimate.
We will start by simply comparing the ttest (T), moderated ttest (Tshrink) and a moderated ttest using additional information (Tshrink+). For the additional data source used by Tshrink+, the four striatum RNASeq samples [25] in Table 1 were chosen as they gave the second highest value but were not generated from the same lab as the analysis dataset (as the microarray data were).
By first considering only four vs four comparisons, the ability of moderation to improve gene ranking is illustrated in Figure 2a where a partial average ROC curve from one hundred four vs four comparisons of B6 and D2 mouse striatum is plotted for each method. This curve shows each methods TPR for a range of FPR, where a method is deemed to have ranked genes better than another at a given FPR if its TPR is higher. Here we see that Tshrink (dark blue) performs better than T (light blue) for all FPR less than 0.01. Tshrink+ (red) offers a similar improvement again on top of that of Tshrink nearly doubling the improvement of Tshrink to T.
Moderation can improve the sensitivity of a test for differential expression as seen in Figure 2b. Figure 2b plots the average number of True Positive genes called at varying pvalue cutoffs for one hundred four vs four tests. At a Bonferroni adjusted pvalue cutoff of 0.05 (the grey dashed line) T calls 8 TP, Tshrink 47 TP and Tshrink+ 108 TP. These improvements are seen at very little cost the the number of False Positives called. The number of TP and FP called at a Bonferroni adjusted pvalue cutoff of 0.05 for n ranging from two to five are plotted in Figures 3b and 3c respectively. Here we see the number of TP called for Tshrink+ increases as n increases and the number FP decreasing as n increases. While the number of TP called also increase for T, it decreases for Tshrink over this range of n. The number of TP called by Tshrink will decrease until Tshrink converges to T when it will continue to increase. Tshrink may be overzealous in its calling of TP calling a relatively large amount of FP as well for small n.
Comparison with edgeR and DESeq
Tshrink+ performs favourably when compared to both DESeq and edgeR when considering gene ranking. When assessing gene ranking using Figure 3a, Tshrink+ performs marginally better than DESeqMax (green) which is better than edgeR (black) and DESeqCommon (pink). The relative performance of Tshrink+ over DESeqMax increases as n increases. For n equal to five edgeR performs worse than T. It could be argued that this is because T is becoming closer to the ttest that was used as "truth", however this behaviour is also observed when using the results from microarray array data [17] as "truth" as seen in Supplementary Figure 4 in Additional File 1. This is performance could also be explained by edgeR over moderating to a common variance that is become decreasingly relevant as n increases.
Tshrink+ compares comparably to edgeR and DESeq when assessing sensitivity. T selects a similar number of TP at the cutoff when compared edgeR but selects less FP as seen in Figures 3b and 3c. While DESeqMax does not select as many TP for the given cutoff as DESeqCommon it selects dramatically less FP.
Conclusions
Using additional information improves the estimation of the common variance and the detection of differentially expressed genes. Our differential expression test, Tshrink+ which incorporates information from additional datasets, showed marked improvement in both gene ranking and sensitivity over a moderated ttest, Tshrink, and a standard ttest, T. Tshrink+ also performed favourably against edgeR and DESeq when comparing gene ranking and comparably when assessing sensitivity.
Whilst Tshrink+ can offer improvements to a differential expression analysis it also provides insight into avenues for further research. The moderation used in Tshrink+ [6] can be drastically affected by genes with unusual variances. A more sophisticated methodology which manages the influence of these genes on moderation could offer potentially large improvements. While using local regression to t the common variance when incorporating one additional dataset is easy to implement, it does not scale well to the use of multiple information sources. A parametric based approach may make the integration of multiple data sources feasible.
This methodology should be considered as a complement, not a replacement, for metaanalysis when similar studies to the RNASeq study of interest exist. Tshrink+ leverages only the variance estimates from external datasets to improve the variance estimation in the study of interest. If information exists on the changes of expression between conditions as well, a researcher may be remiss to not utilise this information through the use of existing metaanalysis methodologies.
Using external data to improve the estimation of the common variance for a particular problem highlights the significance of access to public data repositories like the gene expression omnibus (GEO) [26]. These repositories have the ability to actualise improved inference lending both confidence and power to results. Projects like ReCount [18] aid in this process by providing access to preprocessed data that avoids the duplication of the computationally intensive procedure of both downloading and processing large datasets.
Declarations
The publication costs for this article were funded by the corresponding author's institution.
This article has been published as part of BMC Genomics Volume 14 Supplement 1, 2013: Selected articles from the Eleventh Asia Pacific Bioinformatics Conference (APBC 2013): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S1.
Declarations
Acknowledgements
We would like to thank Terry Speed, John Robinson, Uri Keich, Samuel Müller and John Ormerod for their insightful comments. This work was supported in part by ARC through grants FT0991918 (YY), Australian Postgraduate Award (EP) and the Alzheimer's Association (DL).
Authors’ Affiliations
References
 Pachter L: Models for transcript quantification from RNASeq. Arxiv preprint arXiv:1104.3889. 2011Google Scholar
 Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNASeq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010, 28 (5): 511515. 10.1038/nbt.1621.PubMed CentralView ArticlePubMedGoogle Scholar
 Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol. 2010, 11 (10): R10610.1186/gb20101110r106.PubMed CentralView ArticlePubMedGoogle Scholar
 Robinson MD, McCarthy DJ, Smyth GK: edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010, 26: 139140. 10.1093/bioinformatics/btp616.PubMed CentralView ArticlePubMedGoogle Scholar
 Smyth GK: Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004, 3: Article3PubMedGoogle Scholar
 OpgenRhein R, Strimmer K: Accurate ranking of differentially expressed genes by a distributionfree shrinkage approach. Stat Appl Genet Mol Biol. 2007, 6: Article9PubMedGoogle Scholar
 Wang Z, Gerstein M, Snyder M: RNASeq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009, 10: 5763. 10.1038/nrg2484.PubMed CentralView ArticlePubMedGoogle Scholar
 Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNASeq. Nat Methods. 2008, 5 (7): 621628. 10.1038/nmeth.1226.View ArticlePubMedGoogle Scholar
 Oshlack A, Robinson MD, Young MD: From RNAseq reads to differential expression results. Genome Biol. 2010, 11 (12): 22010.1186/gb20101112220.PubMed CentralView ArticlePubMedGoogle Scholar
 Bullard J, Purdom E, Hansen K, Dudoit S: Evaluation of statistical methods for normalization and differential expression in mRNASeq experiments. BMC Bioinformatics. 2010, 11: 94+10.1186/147121051194.PubMed CentralView ArticlePubMedGoogle Scholar
 Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNAseq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008, 18 (9): 15091517. 10.1101/gr.079558.108.PubMed CentralView ArticlePubMedGoogle Scholar
 Cui X, Churchill GA: Statistical tests for differential expression in cDNA microarray experiments. Genome Biol. 2003, 4 (4): 21010.1186/gb200344210.PubMed CentralView ArticlePubMedGoogle Scholar
 Cheung VG, Conlin LK, Weber TM, Arcaro M, Jen KY, Morley M, Spielman RS: Natural variation in human gene expression assessed in lymphoblastoid cells. Nat Genet. 2003, 33 (3): 422425. 10.1038/ng1094.View ArticlePubMedGoogle Scholar
 Loader C: locfit: Local Regression, Likelihood and Density Estimation. 2010, [R package version 1.56]Google Scholar
 Satterthwaite FE: An approximate distribution of estimates of variance components. Biometrics. 1946, 2 (6): 110114. 10.2307/3002019.View ArticlePubMedGoogle Scholar
 Welch BL: The generalisation of student's problems when several different population variances are involved. Biometrika. 1947, 34 (12): 2835. 10.1093/biomet/34.12.28.View ArticlePubMedGoogle Scholar
 Bottomly D, Walter NAR, Hunter JE, Darakjian P, Kawane S, Buck KJ, Searles RP, Mooney M, McWeeney SK, Hitzemann R: Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNASeq and microarrays. PLoS One. 2011, 6 (3): e1782010.1371/journal.pone.0017820.PubMed CentralView ArticlePubMedGoogle Scholar
 Frazee AC, Langmead B, Leek JT: ReCount: a multiexperiment resource of analysisready RNAseq gene count datasets. BMC Bioinformatics. 2011, 12: 44910.1186/1471210512449.PubMed CentralView ArticlePubMedGoogle Scholar
 Cleveland W, Grosse E, Shyu W: Local regression models. Statistical models in S. 1992, 309376.Google Scholar
 GagnonBartsch JA, Speed TP: Using control genes to correct for unwanted variation in microarray data. Biostatistics. 2012, 13 (3): 539552. 10.1093/biostatistics/kxr034.PubMed CentralView ArticlePubMedGoogle Scholar
 Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD: The sva package for removing batch effects and other unwanted variation in highthroughput experiments. Bioinformatics. 2012, 28 (6): 882883. 10.1093/bioinformatics/bts034.PubMed CentralView ArticlePubMedGoogle Scholar
 Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memoryeffcient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10 (3): R2510.1186/gb2009103r25.PubMed CentralView ArticlePubMedGoogle Scholar
 Gautier L, Cope L, Bolstad BM, Irizarry RA: affyanalysis of Affymetrix GeneChip data at the probe level. Bioinformatics. 2004, 20 (3): 307315. 10.1093/bioinformatics/btg405.View ArticlePubMedGoogle Scholar
 Wu J, with contributions from James MacDonald Jeff Gentry RI: gcrma: Background Adjustment Using Sequence Information. [R package version 2.26.0]Google Scholar
 Keane TM, Goodstadt L, Danecek P, White MA, Wong K, Yalcin B, Heger A, Agam A, Slater G, Goodson M, Furlotte NA, Eskin E, Nellr C, Whitley H, Cleak J, Janowitz D, HernandezPliego P, Edwards A, Belgard TG, Oliver PL, McIntyre RE, Bhomra A, Nicod J, Gan X, Yuan W, van der Weyden L, Steward CA, Bala S, Stalker J, Mott R, Durbin R, Jackson IJ, Czechanski A, GuerraAssun JA, Donahue LR, Reinholdt LG, Payseur BA, Ponting CP, Birney E, Flint J, Adams DJ: Mouse genomic variation and its effect on phenotypes and gene regulation. Nature. 2011, 477 (7364): 289294. 10.1038/nature10413.PubMed CentralView ArticlePubMedGoogle Scholar
 Barrett T, Troup DB, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Muertter RN, Holko M, Ayanbule O, Yefanov A, Soboleva A: NCBI GEO: archive for functional genomics data sets10 years on. Nucleic Acids Res. 2011, 39 (Database): D1005D1010. 10.1093/nar/gkq1184.PubMed CentralView ArticlePubMedGoogle Scholar
 Polymenidou M, LagierTourenne C, Hutt KR, Huelga SC, Moran J, Liang TY, Ling SC, Sun E, Wancewicz E, Mazur C, Kordasiewicz H, Sedaghat Y, Donohue JP, Shiue L, Bennett CF, Yeo GW, Cleveland DW: Long premRNA depletion and RNA missplicing contribute to neuronal vulnerability from loss of TDP43. Nat Neurosci. 2011, 14 (4): 459468. 10.1038/nn.2779.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.