Negative binomial model and biological variation simulation
Our synthetic data is based on a negative binomial (NB) model of read counts assumed by[39] and used in edgeR[32], DESeq[25] and NBPSeq[33]. The model is a hierarchical model which takes into account sources of variability in the molar concentration of each transcript isoform in the prepared cDNA library due to i) library preparation steps and, in the case of biological replicates, ii) biological variation. This variation is compounded by an additional Poisson shot-noise arising from the sequencing step. Assuming the molar concentration in the prepared cDNA library to have a Gamma distribution, one arrives at a NB distribution for the number of counts K mapped onto a particular transcript of interest in a given lane of the sequencer:
(1)
The mean μ is proportional to the concentration of the transcript of interest in the original biological sample, up to a normalisation factor specific to the lane of the sequencer. A suitable model for this normalisation factor is the Robinson-Oshlack TMM factor[32]. The quantity ϕ is called the dispersion parameter[39], and is specific to the transcript isoform and the library preparation. A more detailed account of the model is given in the Additional file3.
R packages for DE in RNA-Seq
All three packages considered are based on a NB model, and differ principally in the way the dispersion parameter is estimated. Unless otherwise stated, tests of these packages used herein use default settings. Typical coding sequences are given in the Additional file3.
edgeR (version 2.4.0, Bioconductor)
To begin with, edgeR[43] calculates for each transcript a quantile adjusted log conditional likelihood function for the dispersion ϕ[39]. Here, “quantile adjusted” refers to an adjustment of the number of counts to adjust for the total number of counts across all transcripts in each biological replicate, and “conditional” means conditioning on the sum of counts for the given transcript across biological replicates. The “common dispersion” estimate defined by edgeR assumes ϕ to be a constant over all transcripts in one lane of the sequencer, and is obtained by maximising the log-likelihood summed over transcripts. However, edgeR recommends a “tagwise dispersion” function, which estimates the dispersion on a gene-by-gene basis, and implements an empirical Bayes strategy for squeezing the estimated dispersions towards the common dispersion. Under the default setting, the degree of squeezing is adjusted to suit the number of biological replicates within each condition: more biological replicates will need to borrow less information from the complete set of transcripts and require less squeezing.
DESeq (version 1.6.1, Bioconductor)
In previous versions of the package DESeq[25], ϕ was assumed to be a function of μ determined by nonparametric regression. The recent version used in this paper follows a more versatile procedure. Firstly, for each transcript, an estimate of the dispersion is made, presumably using maximum likelihood. Secondly, the estimated dispersions for all transcripts are fitted to the functional form:
(2)
using a gamma-family generalised linear model. The per-transcript estimate is considered to be more appropriate when large numbers of replicates (≥ 4) are present, while the functional form is considered to be more appropriate when small numbers of replicates (≤ 2) are present, in which case information is borrowed from the general trend of all transcripts. Recognising that the dispersion may be underestimated by the functional fit, leading to an overestimate of significance in detecting DE, DESeq by default chooses the maximum of the two methods for each transcript. Also by default, DESeq assumes a model in which the mean μ differs between conditions, but the dispersion ϕ is common across all conditions.
NBPSeq (version 0.1.4, CRAN)
As for edgeR, the package NBPSeq[33] considers per-transcript log likelihood conditioned on the sum of counts across replicates. However, NBPSeq further imposes the following functional relationship between ϕ and μ:
(3)
that is, a linear relationship between logϕ and logμ. The cases α = 1 and α = 2 (equivalent to common dispersion) of this function are referred to as NB1 and NB2 respectively. The global parameters α and c are estimated by maximising the log conditional likelihood summed over all replicates and transcripts.
Construction of the synthetic datasets
Each of our synthetic datasets consists of a ‘control’ dataset of read counts and a ‘treatment’ dataset of read counts, for i = 1,…,t transcript isoforms sequenced from j = 1,…,n biological replicate cDNA libraries.
For each transcript isoform, we begin by providing a pair of NB parameters and. A read count for each isoform in each biological replicate is then generated by sampling randomly from a NB distribution with these estimated parameters to from the control dataset. To create the treatment dataset, the set of isoforms is first divided into a non-regulated subset, an up-regulated subset and a down-regulated subset. A regulating factor θ
i
= 1,…,t, which is equal to 1 (non-regulated), > 1 (up-regulated) or < 1 (down-regulated) is then chosen from a suitable distribution. A treatment read count is then generated for each isoform in each biological replicate from a NB distribution with the mean and unchanged dispersion.
The basis for the parameters and is a subset of the Pickrell[24] dataset of sequenced cDNA libraries generated form mRNA from 69 lymphoblastoid cell lines derived from Nigerian individuals as part of the International HapMap Project. For each individual, a library prepared for the Illumina GA2 platform was split into two portions, with one portion sequenced at the Argonne sequencing centre and the other at the Yale sequencing centre. For 12 of the individuals a second library was also prepared, split, and sequenced at both centres. Only data from the initial 69 libraries sequenced at Argonne was used for the current study. The raw reads were re-aligned onto the human transcriptome (hg18, USCS) using the KANGA aligner[44]. The total number of reads mapped to annotated genes per lane varied substantially from 2 × 106 to 20 × 106. To provide a uniform set of biological replicates from which to estimate and, a subset of 44 libraries for which the total number of mappings to the transcriptome per lane was in the range 10 × 106 to 16 × 106 was chosen. Finally, any transcript for which the total number of reads was less than 44, i.e. an average of less than one transcript per lane, was culled from the dataset to leave a list of 46,446 transcripts. The resulting subset of the Pickrell dataset is considered to exhibit overdispersion due to both library preparation and biological variation.
Note that for generation of synthetic data it is not necessary to provide an accurate estimate of μ
i
and ϕ
i
for each isoform in the reduced Pickrell dataset, but simply to provide a plausible distribution of values of these parameters over the transcriptome representing typical isoform abundances and their variation due to technical and/or biological overdispersion. Parameter values and, were obtained from the reduced Pickrell dataset as follows. The total number of counts from each of the 44 lanes was first reduced to that of the lane with the smallest number of counts by sampling from the counts in each lane while keeping track of the transcript to which each count is mapped. This forms a normalised set of counts for the i th transcript in the j th lane.
For each transcript a maximum likelihood estimate (MLE) and, was then made from the n = 44 biological replicates. Details of the construction of this estimate are given in the Additional file3. For each simulation described herein, a synthetic dataset was constructed consisting of n biological replicates of ‘control data’ generated from NB distributions with the estimated and, and a further n biological replicates of treatment data generated from NB distributions with means and unchanged dispersion.
Two sets of simulations were performed:
-
1.
To test performance under the null hypothesis, the regulating factor was set to θ
i
= 1 for all transcripts.
-
2.
To test ability to detect DE in the presence of an alternative hypothesis, the regulating factor θ
i
was set to 1 + X
i
for a randomly chosen 7.5% of the transcripts (up-regulated), (1 + X
i
)−1 for a further 7.5% (down-regulated) and 1 for the remaining 85% of the transcripts, where the X
i
are identically and independently distributed exponential random variables with mean 1.
Calculation of true and false positive rates
Under the null hypothesis
All three packages test for DE in single-factor experiments by calculating p-values using the method described in[25]. For each transcript i, a probability is calculated for the number of counts in each of two conditions control and treatment, conditional on the sum of the counts in both conditions assuming the NB model described above. The p-value is the sum of the probabilities of all ways of apportioning the sum of counts between the two conditions, which have a lower probability than the observed counts.
To test the performance of each package under the null hypothesis, we simulated sets of n-control and n-treatment lanes of counts for a range of values of n. The FPR, quoted as a percentage, was calculated at a given significance level α as:
(4)
Ideally, the FPR should match the significance level of α if the package is performing correctly.
In the presence of an alternative hypothesis
All three packages provide an adjusted p-value, padj, to correct for multiple hypothesis testing with the Benjamini-Hochberg procedure using the R function p.adjust(). All calculations herein of true and false positive rates in the presence of an alternative hypothesis use adjusted p-values.
From the 6,966/46,446 (15%) of the transcripts induced with a regulating factor other than 1, we selected the 5,726 (12%) with a regulation factor satisfying either θ
i
≤ 0.83 or θ
i
≥ 1.20. We define these as “effectively DE” transcripts. This additional filter on minimal fold-change is designed to quantify the performance of algorithms and experimental designs for detection of DE that might be considered more biologically relevant by researchers. Likewise we define the remaining transcripts, those satisfying 0.83 < θ
i
< 1.20, as “effectively non-DE”. These definitions were used to estimate the FPR and TPR at significance level α via the following formulae:
(5)
(6)
Apart from the use of adjusted p-values, the formula for FPR reduces to Eq. 4 if the number of simulated DE transcripts is set to zero, since in this case all transcripts are, by definition, “effectively non-DE”. The quantities 1−FPR and TPR are commonly referred to in the literature as “specificity” and “sensitivity” respectively.
Simulating variable levels of sequence data and replication
Simulating variations in available sequencing data is a fundamental part of investigating the impacts of multiplex experimental design strategies. Variability in the amount of sequence data amongst samples can occur for reasons such as restrictions on available resources, machine error, or sequencing reads sequestered by pathogen transcriptome fractions present in the sample. To simulate loss of sequencing depth, we randomly sub-sampled without replacement counts from the original table of counts simulated in the presence of an alternative hypothesis for each biological replicate. Sequencing depth was decreased in both control and treatment samples over a range of 100% (a full lane of sequence) to 1% of the original data. After this sub-sampling, the resulting table of counts was analysed in DESeq (edgeR) and the total number of effectively-DE calls, TPR, FPR and fold-changes were recorded for every n scenario. We simulated experimental choices of n-controls vs. n-treatments biological replicates at n = 2, 3, 4, 6, 8 and 12.
Multiplexing experimental designs
Multiplexing various samples into one sequencing lane reduces the monetary cost of RNA-Seq DE analysis, albeit by dividing the available sequencing depth over various samples. Our strategy consisted of simulating multiplexing n-control samples vs. n treatment samples into two sequencing lanes. This way, the amount of total sequenced data is constrained and each control and treatment sample is expected to be represented at an average depth 1/n×100%. The absolute value of reads produced in a lane of sequence (i.e. 100% depth) has increased as RNA-Seq technologies evolve, currently this value can be up to 100 million reads. The multiplex experimental setups we tested are:
-
2 vs. 2 biological replicates at 50% sequencing depth
-
3 vs. 3 biological replicates at 33% sequencing depth
-
4 vs. 4 biological replicates at 25% sequencing depth
-
6 vs. 6 biological replicates at 17% sequencing depth
-
8 vs. 8 biological replicates at 13% sequencing depth
-
12 vs. 12 biological replicates at 8% sequencing depth
-
32 vs. 32 biological replicates at 3% sequencing depth
-
96 vs. 96 biological replicates at 1% sequencing depth