 Research
 Open access
 Published:
Differential expression analysis of RNA sequencing data by incorporating nonexonic mapped reads
BMC Genomics volume 16, Article number: S14 (2015)
Abstract
Background
RNA sequencing (RNAseq) is a powerful tool for genomewide expression profiling of biological samples with the advantage of highthroughput and high resolution. There are many existing algorithms nowadays for quantifying expression levels and detecting differential gene expression, but none of them takes the misaligned reads that are mapped to nonexonic regions into account. We developed a novel algorithm, XBSeq, where a statistical model was established based on the assumption that observed signals are the convolution of true expression signals and sequencing noises. The mapped reads in nonexonic regions are considered as sequencing noises, which follows a Poisson distribution. Given measureable observed and noise signals from RNAseq data, true expression signals, assuming governed by the negative binomial distribution, can be delineated and thus the accurate detection of differential expressed genes.
Results
We implemented our novel XBSeq algorithm and evaluated it by using a set of simulated expression datasets under different conditions, using a combination of negative binomial and Poisson distributions with parameters derived from real RNAseq data. We compared the performance of our method with other commonly used differential expression analysis algorithms. We also evaluated the changes in true and false positive rates with variations in biological replicates, differential fold changes, and expression levels in nonexonic regions. We also tested the algorithm on a set of real RNAseq data where the common and different detection results from different algorithms were reported.
Conclusions
In this paper, we proposed a novel XBSeq, a differential expression analysis algorithm for RNAseq data that takes nonexonic mapped reads into consideration. When background noise is at baseline level, the performance of XBSeq and DESeq are mostly equivalent. However, our method surpasses DESeq and other algorithms with the increase of nonexonic mapped reads. Only in very low read count condition XBSeq had a slightly higher false discovery rate, which may be improved by adjusting the background noise effect in this situation. Taken together, by considering nonexonic mapped reads, XBSeq can provide accurate expression measurement and thus detect differential expressed genes even in noisy conditions.
Background
Nextgeneration sequencing (NGS) has been widely used in biological studies. RNA sequencing (RNAseq) is the most commonly used NGS technologies to investigate the aberration of mRNA expression in disease and normal condition comparison. Unlike microarray technology, which uses a short section of a gene as a probe to determine the gene's expression, RNAseq provides measurement across entire exonic region, enabling accurate expression quantification and discovery of novel isoforms and splicing junctions. With RNAseq technology, thousands of novel coding and noncoding genes, alternative splice forms of known genes have been discovered.
Differential expression (DE) analysis using RNAseq is commonly employed to interrogate changes between different experimental conditions. While enormously successful, DE analysis also suffers from systematic noise and sequencing biases, such as sequence quality bias, wrong base calls, variability in sequence depth across the transcriptome, and the coverage depth differences of replicate samples [1]. There are already many statistical testing methods for RNAseq differential expression analysis. One is to normalize the read counts of target transcripts, converting them into reads per kilobase per million mapped reads (RPKM) and then perform linear modeling methods that are used in microarray experiments [2]. However, the method designed for microarray measurement may not fit the characteristics of sequencing data properly. In past years, algorithms have been developed specifically for RNAseq data analysis. Among them, two popular software packages implemented the negative binomial (NB) model that account for genomewide read counts and moderate dispersion estimates with different statistical methods [3, 4]. EdgeR [4] uses a trendedbymean estimate to moderate dispersion estimates, whereas DESeq [3] takes account of the maximum of a fitted dispersion mean or the featurewise dispersion estimate, as reviewed in [5]. However, neither of these methods considered the misaligned reads existing in the sequencing data, which may play a significant role in detecting the significance of target transcripts.
Here we propose a novel DE analysis algorithm  XBSeq, which is derived from DESeq, where we take the nonexonic reads of RNAseq data into consideration. In conventional RNAseq analysis, reads mapped to the exons are counted as the expression of a gene, whereas reads aligned to the intronic and intergenic regions are generally ignored. Those nonexonic hits exist because of: sequencing error, mapping error, contamination by genomic DNA, unannotated genes, and nascent transcription and cotranscriptional splicing [6]. Our model treats these sequence reads as sequencing noises that exist across the entire genome, both exonic and nonexonic regions. Therefore, the observed read counts can be decomposed into two components: true signals that are directly derived from transcripts expression, and the others from the random noises. We model true expression signals by a negative binomial distribution and assume sequencing background noises possess a Poisson distribution. With nonexonic read counts, we can estimate the parameter λ_{ i } of the Poisson distribution of each gene. Afterwards, we remove sequencing noise effect from observed signals and retrieve the backgroundcorrected mean and variance parameters for the NB model of true expression signals.
To study the robustness of the algorithm, we have built a simulation framework that generates RNAseq data by combining the true signals in NB distribution with different levels of nonexonic reads from Poisson distribution. We demonstrate our method by applying it to the simulated data and examine how it performs comparing with other common DE analysis algorithms.
Methods
Nonexonic sequence read count
For a typical transcriptome profiling by RNAseq, we detect read count of each gene by using HTSeq algorithm [7]. Given exons' locations of every gene, HTSeq counts sequence reads aligned to the genic regions. In order to capture the reads in nonexonic regions, we preserved the structure information of each gene (transcript length, exon size, etc) by shifting start and end positions of each exon to a nearby intronic or intergenic region (See Figure 1A). We have defined nonexonic regions for each species by the following steps:

1)
Download refFlat table from UCSC database (http://genome.ucsc.edu) and create the preliminary list of genefree regions,

2)
Download tables of (a) all_mrna; (b) ensGene; (c) pseudoYale60Gene; (d) vegaGene;, (e)xenoMrna, and (f) xenoRefGene from UCSC database and remove regions appear in any of them from the genefree regions,

3)
To guarantee genefree regions are far enough from exonic regions, trim 100 bps from both sides of intronic regions and 1,000 bps from both sides of intergenic regions,

4)
Shift each exon of a gene to the right nearest genefree region. Most of the shifted genes remain the same as the original structures of the genes,

5)
If the nearby genefree region is too short, we may only preserve the exon size features but not the whole gene structure. The priority of shifting a region is: i) nearest right genefree region, 2) nearest left genefree region; 3) the second right nearest genefree region and so on until the shift region of the original exon fits, and

6)
At last, we considered the shifted regions as the nonexonic regions for each gene and a final .gtf file was created.
To extract nonexonic read counts, HTSeq was performed second time to generate an equivalent read count for each gene over an exactly same length of nonexonic region. By doing so we guarantee an equivalent read count from nonexonic region for each gene.
The histogram of a RNAseq data in RPKM unit was plotted in Figure 1B. The blue histogram was derived from the observed read counts genomewide, while the red one was derived from the nonexonic read counts after shifting the exons' position. As illustrated in the figure, the hump of the red histogram overlaps with the left tail of the blue histogram, indicating the existence of sequencing noises in commonly reported gene expression levels, particularly when gene expression level is low. Based on this observation, we hypothesize that the read count of i^{th} gene (e.g., observed by HTSeq), defined as X_{ i }, is composed of true signal S_{ i } (not measurable directly) and background noise B_{ i } (measured over our uniquely defined nonexonic regions).
Poissonnegative binomial model
We assume that read count of i^{th} gene can be decomposed into two components: true signal S_{ i } that directly derived from transcript expression, and background noise B_{ i } due to sequencing error or misalignment. Therefore, the observed signal X_{ i } is
where r_{ i }, p_{ i } are parameters (number successes and probability of success, respectively) of NB distribution and λ_{ i } is the rate parameter of Poisson distribution. We further assume that the true expression signal S_{ i } and background noise B are independent. Given the observed signal X_{ i } to be the sum of a NB and a Poisson, the probability distribution of X_{ i } is governed by a Delaporte distribution, which is the convolution of a NB distribution with a Poisson distribution [8, 9]. When there is no background noise (which is the assumption of many other RNAseq algorithms), the observed signal is simply governed by a NB distribution,
Estimation of distribution parameters
The two NB parameters r_{ i } and p_{ i } can be estimated by the background corrected mean and variance of gene i; and with the nonexonic read counts, the Poisson parameter λ_{ i } can be determined easily. We further assume that genes are independent to each other, acknowledging that some genes are dependent within pathways or other reasons. Hence, the objective is to estimate all parameters of each gene in order to obtain the NB model fitted to the true expression signals.
The estimation of Poisson parameter λ is relative simple, we assume that the read count derived from the nonexonic regions representing the background component B and independent of S. Therefore, we can obtain λ of each gene, being the average of total nonexonic read counts across all m replicates.
where b_{ ij } is the nonexonic read count of i^{th} gene from j^{th} sample. After the estimation of Poisson parameter, we can calculate the true expression signals' mean μ_{ Si } and variance {\sigma}_{{S}_{i}}^{2} of each gene as follows,
where {\sigma}_{{B}_{i}}^{2}={\lambda}_{i}. Note that observable X and background B are not independent. As we mentioned earlier, observed read count X follows a Delaporte distribution, which has no closed form [8, 9]. The parameters of Delaporte distribution, D(λ, α, β), however, is known as \mu =\lambda +\alpha \beta, and {\sigma}^{2}=\lambda +\alpha \beta \left(1+\beta \right), where \lambda can be considered as the parameter of Poisson distribution. When \lambda =0, the Delaporte reduces to NB distribution, similar to what we have in Eqs. (3) and (4). The same variance correction method in DESeq is subsequently used to adjust {\sigma}_{{S}_{i}}^{2} in order to get precise estimate of the variance when the number of replicates is small [3]. After obtaining the adjusted true expression signal mean and variance, μ_{ S } and {\sigma}_{{S}_{i}}^{{}^{\prime}2}, from Eqs. (3) and (4), the two NB parameters r_{ i } and p_{ i } are further estimated by,
Testing for differential expression
After estimating the NB parameters in both experimental conditions, differential expression analysis between the two conditions can be tested. Designed similarly to DESeq method, we use a Fisher's exact test approach to estimate the P value of each gene [3]. In short, suppose we have x and y reads of a gene in each condition, we compute every possible p(a, b), where the sum of the variables a and b equal to K_{ total } (K_{ total } = × + y). By assuming the independence of two test conditions, we have p(a, b) = Pr(a) Pr(b), where Pr(a) and Pr(b) are the probabilities in NB distribution that we have estimated for each condition. Therefore, the P value is the proportion of the sum of possible probabilities less than the probability of actual read counts among the sum of all probabilities as follows,
Equation 7 is evaluated genewise, and for simplicity, we omitted subscript i.
Simulation
In order to evaluate the performance of different RNAseq algorithms, we generated a set of simulated data where we could control the differential expression status for a given set of genes, as well as noise level for all genes. In this study, true signal S and background signal B were simulated based on a negative binomial distribution and a Poisson distribution, respectively, with parameters estimated from real RNAseq data.
We followed a similar simulation framework used by edgeRrobust [5]. Firstly, genes from real RNAseq data were filtered based on the expression intensity across all replicates. The genes with top 10% dispersion were discarded. Then 5,000 genes were randomly selected with replacement among the filtered genes. Based on the mean and dispersion estimated from real RNAseq data, the true signal S was simulated for each gene from the negative binomial distribution. Different proportions of genes (10% and 30%) were randomly selected as differentially expressed genes with various fold changes (1.5, 2, 3, and 5). To simulate baseline background signal B_{ baseline }, firstly, the mean value was calculated using the nonexonic mapped reads of its corresponding gene. Then B_{ baseline } was generated from a Poisson distribution with parameter λ equals to the calculated mean value from the nonexonic read counts. Observed signal X_{ baseline } was then generated as the addition of S and B_{ baseline }.
To simulate background signal B_{ inc } with increased nonexonic mapped reads, we first calculated mean read count for each selected gene based on the nonexonic mapped reads from real RNAseq data. Then B_{ inc } was simulated by a hybrid model,
where \mu is from a Poisson distribution \mu ~Poisson\left(\lambda +NF\right), NF is the noise factor, which we set to be 0 (low), 7 (intermediate) and 20 (high). λ equals to the mean of the nonexonic mapped reads of a given gene, and we set σ = 3 and multiply M = 10 for our simulation. Finally, we set the observed signal X_{ inc } to be the addition of S and B_{ inc }. For noise models different from Poisson, we simply replaced Poisson with binomial, uniform or other distributions in Eq. 8. Simulations were performed 100 times in order to evaluate and plot the Receiver Operating Characteristic (ROC) curves and other statistics.
RNAseq data set for testing
A mouse RNAseq dataset were obtained from Gene Expression Omnibus (GSE61875) [10]. For this testing purpose, we selected 3 replicates of wild type mouse liver tissues (WT) and 3 replicates of Myc transgenic mouse liver tissues (MYC) for differential expression analysis to determine differential expressed genes due to the activation of Myc. Out of the six samples selected, on average, 12,781 (out of 22,609) genes have at least one nonexonic reads, and 15,973 genes have at least one exonic reads.
Comparison of other RNAseq algorithms
We also compared XBSeq with other differential expression analysis methods, including DESeq (1.14.0) [3], latest version of DESeq (DESeq2, 1.2.10) [11], edgeR(3.5.15) [4], latest version of edgeR (edgeRrobust 3.5.15) [5], baySeq (1.16.0) [12], limma (3.18.13) [13], and EBSeq (1.2.0) [14]. All these evaluation were performed under R version 3.0.2 and Bioconductor version 2.13. Detailed workflow of XBSeq and simulation is illustrated in Figure 2.
Results
Implementation of XBSeq algorithm
XBSeq requires two inputs, the observed measurement from exonic regions and the background noise from nonexonic regions. Both read counts can be obtained by submitting mapped sequence reads to HTSeq twice with coding gene annotation (e.g., gene.gtf) and shifted gene annotation (shiftgene.gtf) as discussed in Methods Section (Figure 2). As the first step, the true signal for each gene is estimated by using Eq. 3. The read counts of genes with negative true signals will be automatically assigned to 0. We apply a similar method as DESeq to normalize the true signal based on size factors calculated from each sample, and we then estimate the variance of the true signal for each gene by using Eq. 4. A similar framework for variance correction and the differential expression significance testing as DESeq (Eqs. 57) is applied to generate pvalues for each gene. The output of XBSeq contains p value, adjusted p value for multiple test correction, log_{2} fold change and other statistical merits for each gene.
To optimize the performance of XBSeq, we applied variance correction procedure either at observed signal level or at estimated true signal level. After examining the performance of these two choices based on simulated datasets, we concluded that variance correction at estimated true signal level yields the best performance with larger area under the ROC curve. We also investigated the performance of LOWESS as well as locfit R package for fitting variancemean relationship. Locfit was proven to generate more robust variancemean relationship based on simulated datasets. Therefore, we selected locfit package to estimate variancemean relationship and carry out variance correction procedure at estimated true signal level in the current XBSeq implementation.
Discrimination between DE and nonDE genes
To compare the performance of XBSeq with other statistical methods, including DESeq, we generated synthetic data where the variability and fold change of differential expression genes could be controlled. To simulate RNAseq data with baseline background noise, we generated signal and background read counts (nonexonic region mapped reads) with distribution parameters (negative binomial distribution for true expression signal and a Poisson distribution for background noise) estimated from a set of real RNAseq data. After removing not expressed genes, we randomly selected 5,000 genes and the number of differentially expressed (DE) genes were set to be 500, 1500, with 1.5 fold, 2 fold, 3 fold or 5 fold changes. To compare XBSeq and DESeq under circumstances with increased nonexonic mapped reads, we carried out the simulation to generate low, intermediate, and high levels of nonexonic mapped reads (see Methods section for detailed simulation procedure and parameter settings). We select DESeq algorithm to compare due to the similarity of statistical evaluation of differential expression levels.
The ability to discriminate between DE and nonDE genes was evaluated by area under the Receiver Operating Characteristic (ROC) curve (AUC). As shown in Figure 3a and Table 1, when the nonexonic mapped reads are at baseline level, the performance of XBSeq is indistinguishable with DESeq in terms of AUC. Specifically, when the fold change was set to 1.5 with 500 DE genes, both XBSeq and DESeq have equivalent performance with either 3 (AUC = 0.90 and 0.90, respectively) or 6 replicates (AUC = 0.96 and 0.97, respectively). As shown in Table 1, the number of differentially expressed genes had little effect over the performance of XBSeq and DESeq, and both XBSeq and DESeq performed equivalently at different fold changes. When fold change is set to 3 or 5, both methods had no difficulty in detecting virtually all DE genes, with AUC almost equals to 1 (Supplementary Table S2). Overall, under baseline level, the performance of XBSeq and DESeq is comparable.
With increased background noise, as shown Table 2, as we expected, more replicates per test group performed better in terms of AUC, but the performance decreased as with the increase of the background noise level, but overall, XBSeq has a larger AUC comparing with DESeq in various conditions (AUC_{XBSeq}  AUC_{DESeq} ~ 0.11, on average. See Table 2). For instance, with 6 replicates per test condition, XBSeq achieved AUC of 0.89 with lower background (lower nonexonic read count), while AUC_{DESeq} was only 0.77. Figures 3b depicts the ROC under 3 different background level and with 3 replicates per test group, and XBSeq evidently outperformed DESeq as we expected when XBSeq utilized additional nonexonic read count information to estimate the true signal. Further examination under highly expressed (>75% quantile) (Figure 3c, Supplementary figure S1) and lowly expressed genes (<25% quantile,) (Figure 3d, Supplementary figure S1) condition revealed that among highly expressed genes XBSeq performs only slightly better than DESeq, while XBSeq has much better AUC than DESeq among lowly expressed genes (Supplementary table S3), indicating the importance of background estimation for true signal estimation. We also compared the performance of XBSeq with DESeq (DESeq2), edgeR and edgeR (edgeR_robust), baySeq, limma and EBSeq (Figures 3e and 3f) under high background (high nonexonic read count). As demonstrated in Figure 3e (3 replicate RNAseq per test group) and 3f (6 replicate RNAseq per test group), XBSeq outperformed all the other methods (AUC_{XBSeq} = 0.73, other methods around 0.64, Supplementary Table S4). Overall, XBSeq and DESeq performs comparable at baseline level. When the background noise increases, XBSeq is more robust than DESeq and some other differential expression analysis methods, especially when expression level of the gene is low.
Control of the false discoveries
We examined the number of false discoveries encountered among the topranked genes based on p values of different statistical methods. Under baseline level, XBSeq performs comparable with DESeq with similar number of false discoveries (Figure 4a). Under the scenarios of increased background noise, XBSeq has much less false discoveries compared with DESeq (Figure 4b). Taking the similar approach, we examined the performance of XBSeq and DESeq among highly expressed (>75% quantile) (Figure 4c) and lowly expressed genes (<25% quantile) (Figure 4d) with 3 replicates per test group. Even through that DESeq picks up similar number of false discoveries among highly expressed genes compared with XBSeq, it has more false discoveries among lowly expressed genes. Comparisons with some other differential expression methods under high background noise level showed that XBSeq is more robust against false discoveries (Figures 4e and 4f). At the preselected threshold (pvalue equals to 0.05), XBSeq has a false discovery around 0.3 which is much less than other statistical methods (around 0.75, Fig. S2). Overall, XBSeq is more robust against false discoveries compared to other statistical methods especially for lowly expressed genes.
Statistical power in detecting DE genes
We compared the performance of XBSeq and DESeq as well as other statistical methods in terms of statistical power at preselected threshold (p = 0.05) in detecting DE genes (Figure 5). At baseline level, XBSeq and DESeq perform comparable (statistical power = 0.58 and 0.91 under 1.5 and 2.0 fold change, respectively) (Figure 5a and Supplementary Tables S1&2). With the increase of the background noise, both methods have decreased statistical power (Figure 5b and Supplementary Table S3). Even so, XBSeq still has better statistical power at low, intermediate or high background noises. However, different from the result of AUC and false discovery, the statistical power difference between XBSeq and DESeq algorithms among highly expressed genes (>75% quantile, Figure 5c) and lowly expressed genes (<25% quantile, Figure 5d) are relatively same: both methods perform relatively well (XBSeq is slightly better), and both methods have difficulty in detecting lowly expressed DE genes (with statistical power for XBSeq and DESeq a merely 0.05 and 0.06 respectively, with high background noise). When comparing with some other statistical methods, XBSeq is one of the best methods in terms of statistical power along with algorithms such as edgeRrobust and DESeq2 (Figures 5e and 5f, Supplementary Table S4). However, with more samples per test group, DESeq2, edgeR and edgeRrobust outperformed XBSeq, due to their robust (or moderate) dispersion estimation. Overall, XBSeq remains one of the best algorithms in terms of statistical power in detecting DE genes compared to other statistical methods.
Application to differential expression analysis for MYC induced gene expression in mouse liver tissues
We have applied a real mouse RNAseq dataset to different algorithms to test the performance of XBSeq [10]. The mouse RNAseq dataset includes 3 replicates of wild type mouse (WT) and 3 replicates of Myc transgenic mouse (MYC). Figure 6 shows a Venn diagram of overlapping and total number of genes detected using XBSeq, DESeq, DESeq2, edgeR, and edgeR_robust with the criterion of pvalue less than 0.05. XBSeq and DESeq detected similar number of DE genes (446 and 452, respectively, and 414 of them in common, Supplementary Table S5), even if a 1.5 fold change cutoff was added. This is reasonable since the two algorithms share similar differential expression significant test statistics, other than nonexonic read count incorporated into XBSeq. In order to see the different results of XBSeq comparing with others, we listed the exonic and nonexonic read counts in Table 3 and 4, which show genes exclusively found by XBSeq (Table 3) and genes that are not detected by XBSeq but by other algorithms (Table 4). The venn diagram with 1.5 fold change cutoff added is shown in Fig. S5 and the numbers of overlapped DE genes between XBSeq and other methods are listed in table S5.
Specifically, in Table 3, from only exonic mapped reads, there are no significant difference between WT and MYC samples for genes Nat1 and Brca2 (fold change is 1.3, and 0.9 respectively). After subtracting the nonexonic mapped reads, the predicted true signals was significantly differentially expressed between MYC and WT for the two genes (pvalue = 0.0495, 6 × 10^{6} for Nat1 and Brca2, respectively) with fold change increased to 1.6 and 4.3 for Nat1 and Brca2, respectively, along with shrunken standard deviation. From Brca2, we can even see that the high dispersion in WT samples are possibly caused by sequencing noises, the gene is barely expressed in WT group as predicted by XBSeq.
On the other hand, with the information in nonexonic regions, XBSeq avoided picking genes that are potentially falsely identified as DE genes because of the background noises. The estimate of the true signal, after considering the nonexonic read counts, may decreased the differential expression and diminish the significance probability. In Table 4, the two genes, Calcr and Adh7, were detected by all other four methods, except DESeq for Adh7 (pvalue = 0.06), whereas XBSeq tested on the true expression signals and considered them as insignificant changes (pvalue = 1 and 0.53, respectively).
Discussion
Sources of nonexonic mapped reads
Previous studies have shown that nonexonic mapped reads account for about 4~6% of all uniquely mapped reads in mammals [15, 16]. Sequence reads that mapped outside of exonic regions might be originated from different RNA sources depending on the RNAseq experimental protocol selected for sequencing library preparation. In addition, these nonexonic reads might be derived from experimental artifacts, like genomic DNA contamination, sequencing errors, or unprocessed RNAs, like premRNAs [17], or even noncoding RNAs [18]. van Baker, et al [19] has also demonstrated that most of the nonexon mapped reads are associated with the nearby known genes, which suggests that nonexon mapped reads are contextually specific to the corresponding gene. Besides, Hebenstreit, et al [20] has shown that all genes from RNAseq can be classified into two distinct groups, and one of them is the the lower expressed group that consists of putative nonfunctional mRNAs. All these suggested the biological relevance of incorporating information from nonexonic regions. We have carefully evaluated the aforementioned biological relevance of RNA species in nonexonic regions, and thus necessary steps have been taken for identifying nonexonic regions as discussed in the Methods Section. As we showed in Figure 1B in one of our real RNAseq data set, the nonexonic read counts for all genes are mostly less than 0 (log2 RPKM unit), indicating little or no influence from highexpression. We also examined the correlation between the reads mapped to exonic regions and the reads mapped to nonexonic regions in a real RNAseq experiment. The average correlation is 0.32 which potentially indicates that the nonexonic reads are not 'functional' reads which can be used to represent the background noise. By measuring the reads mapped to the nonexonic regions of its corresponding gene but carefully avoid those functional relevant regions, we are able to gain a more reliable estimation of true expression level by eliminating the impact of background noise.
Compare of XBSeq and DESeq at baseline level
Our simulation at baseline level suggests that XBSeq and DESeq's performance are virtually indistinguishable in terms of AUC, number of false discoveries and statistical power at baseline level. Not surprisingly, comparison with 6 replicates per test group performed better than 3 replicates per test group even with low level of fold change (fold change at 1.5). Also at baseline background level, the number of truly differentially expressed genes has little effect on the performance of XBSeq and DESeq except with the number of false discoveries. As expected, simulation of 30 percent of true DE genes is more likely to generate false positives than those of 10% true DE genes. However, this effect is dampened with increased number of replicates per test group for differential expression analysis (Table S2).
Comparison of statistical methods with increased nonexonic mapped reads
To further demonstrate the robustness of XBSeq at different background levels, we simulated genes with low, intermediate or high levels of nonexonic mapped reads. XBSeq outperform DESeq with larger AUC (Figures 2b2d), and with better controlling of false discoveries (Figures 3b3d). While we achieved excellent true positive detection, we also examined the false negative rate for XBSeq and DESeq, or the statistical power. As shown in Figures 4b4d, XBSeq outperform DESeq's statistical power in detecting DE genes in all three increased background levels. All these suggest that XBSeq is more robust in detecting DE genes in noisy NGSseq samples.
We also compared performance of XBSeq with some other RNAseq algorithms, including DESeq2 and edgeRrobust. XBseq excelled in overall performance (better AUC, Figures 3e & 3f) and false discovery control (Figure 4yres 4e and 4f). XBSeq is also one of the best algorithms in terms of statistical power (Figures 5e and 5f), indicating a modest tradeoff in false negative while maintaining overall performance in DE detection. Moreover, XBSeq performed better with higher nonexonic mapped reads (in all three simulated increased background noise level above baseline). Further examination of algorithms performance among higher (> 75% percentile) or lower expressed genes (< 25% percentile) revealed that false positive genes were mostly generated among lower expressed genes. Among lower expressed genes, XBSeq outperformed than DESeq with better AUC and false discovery. However, both XBSeq and DESeq performed poorly among lowexpression genes.
Simulations were carried out with the assumption of NB model for gene expression and Poisson model for background noise (or nonexonic read counts), which is the model XBSeq is built on. While we demonstrated the validity of the assumption in one of our RNAseq data set (Figure 1B), we also tested the XBSeq under the uniform or normal model for background read counts while keeping the same NB model for gene expression. As shown in Fig. S2A, for all 3 background distributions (including Poisson model), XBSeq performed better than DESeq, indicating the robustness even when underlying assumption is deviated from Poisson distribution. Another way to examine model bias is by examining the distribution of p value under null hypothesis (with no DE genes). As shown in Fig S3, the p values generated by most differential expression analysis algorithms are close to uniform distribution from 0 to 1 (showed only 0 to 0.2 in Figs S3), with the exception of EBSeq.
Execution time of XBSeq algorithm
XBSeq algorithm complexity is similar to DESeq. However, XBSeq will not only take read counts from each gene, but also read counts from nonexonic read counts for each gene, and then perform true signal prediction before evaluating differential expression significance. We benchmarked a set of differential analysis algorithms for their computational times with different number of samples in each condition (Fig. S4). BaySeq algorithm requires the most computational time, followed by DESeq and XBSeq.
Conclusions
We have developed an approach to take into consideration of nonexonic mapped reads as sequencing noise for precise differential expression analysis. When there is no or very low background noise, the performance of XBSeq is similar to DESeq. However, XBSeq excels when background noise (nonexonic read counts) are higher due to the model estimation of true signal by removing the noise impact. Overall, XBSeq algorithm is shown to be more robust than existing differential expression analysis methods particularly when sequencing noise is a concern.
Availability of supporting data
The R package of XBSeq, the shiftgene gtf files as well as reproducible scripts for simulation are available from GitHub, https://github.com/Liuy12/XBSeq.
Abbreviations
 AUC:

Area Under the Curve
 DE:

Differential Expression
 NB:

Negative Binomial (distribution, model)
 NGS:

Next Generation Sequencing
 ROC:

Receiver Operating Characteristic
 RPKM:

Reads Per Kilobase of transcript per Million reads mapped
References
Robles JA, Qureshi SE, Stephen SJ, Wilson SR, Burden CJ, Taylor JM: Efficient experimental design and analysis strategies for the detection of differential expression using RNASequencing. BMC genomics. 2012, 13: 48410.1186/1471216413484.
Smyth GK: Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical applications in genetics and molecular biology. 2004, 3: Article3
Anders S, Huber W: Differential expression analysis for sequence count data. Genome biology. 2010, 11 (10): R10610.1186/gb20101110r106.
Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010, 26 (1): 139140. 10.1093/bioinformatics/btp616.
Zhou X, Lindsay H, Robinson MD: Robustly detecting differential expression in RNA sequencing data using observation weights. Nucleic acids research. 2014, 42 (11): e9110.1093/nar/gku310.
Ameur A, Zaghlool A, Halvardson J, Wetterbom A, Gyllensten U, Cavelier L, Feuk L: Total RNA sequencing reveals nascent transcription and widespread cotranscriptional splicing in the human brain. Nat Struct Mol Biol. 2011, 18 (12): 14351440. 10.1038/nsmb.2143.
Anders S, Pyl PT, Huber W: HTSeq  A Python framework to work with highthroughput sequencing data. 2014
Johnson NL, Kemp AW, Kotz S: Univariate discrete distributions. 2005, Hoboken, N.J.: Wiley, 3
Vose D: Risk analysis : a quantitative guide. 2008, Chichester, England ; Hoboken, NJ: Wiley, 3
Srivastava J, Siddiq A, Gredler R, Shen XN, Rajasekaran D, Robertson CL, Subler MA, Windle JJ, Dumur CI, Mukhopadhyay ND, et al: Astrocyte elevated gene1 (AEG1) and cMyc cooperate to promote hepatocarcinogenesis. Hepatology. 2014
Love MI, Huber W, Anders S: Moderated estimation of fold change and dispersion for RNASeq data with DESeq2. 2014
Hardcastle TJ, Kelly KA: baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC bioinformatics. 2010, 11: 42210.1186/1471210511422.
Law CW, Chen Y, Shi W, Smyth GK: voom: precision weights unlock linear model analysis tools for RNAseq read counts. Genome biology. 2014, 15 (2): R2910.1186/gb2014152r29.
Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BM, Haag JD, Gould MN, Stewart RM, Kendziorski C: EBSeq: an empirical Bayes hierarchical model for inference in RNAseq experiments. Bioinformatics. 2013, 29 (8): 10351043. 10.1093/bioinformatics/btt087.
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNASeq. Nature methods. 2008, 5 (7): 621628. 10.1038/nmeth.1226.
Ponting CP, Belgard TG: Transcribed dark matter: meaning or myth?. Human molecular genetics. 2010, 19 (R2): R162168. 10.1093/hmg/ddq362.
Schwartz S, Oren R, Ast G: Detection and removal of biases in the analysis of nextgeneration sequencing reads. PloS one. 2011, 6 (1): e1668510.1371/journal.pone.0016685.
Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, Tanzer A, Lagarde J, Lin W, Schlesinger F, et al: Landscape of transcription in human cells. Nature. 2012, 489 (7414): 101108. 10.1038/nature11233.
van Bakel H, Nislow C, Blencowe BJ, Hughes TR: Most "dark matter" transcripts are associated with known genes. PLoS biology. 2010, 8 (5): e100037110.1371/journal.pbio.1000371.
Hebenstreit D, Fang M, Gu M, Charoensawan V, van Oudenaarden A, Teichmann SA: RNA sequencing reveals two major classes of gene expression levels in metazoan cells. Molecular systems biology. 2011, 7: 497
Acknowledgements
This research was supported in part by the Genome Sequencing Facility of the Greehey Children's Cancer Research Institute, UTHSCSA, which provided RNAseq service. Fundings for this research were provided partially by the Cancer Prevention and Research Institute of Texas (RP120685C2 and RP120715C4) YC, National Institutes of Health Cancer Center Shared Resources (NIHNCI P30CA54174) to YC, NIGMS (R01GM113245) to YH and YC, and the National Science Foundation (CCF1246073 to YH).
Declaration
The publication costs for this article were funded by the aforementioned CPRIT grants to YC.
This article has been published as part of BMC Genomics Volume 16 Supplement 7, 2015: Selected articles from The International Conference on Intelligent Biology and Medicine (ICIBM) 2014: Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/16/S7.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
Authors declare no competing interest in preparing the paper and developing the software associated to this paper.
Authors' contributions
All authors contribute to the manuscript. HHC, YL, YH and YC conceived and designed the study. HHC and YC implemented the algorithm in Matlab and YL migrated it to R. YL carried out the simulation procedure. YZ designed nonexonic regions for human and mouse, DS contributed biological samples and ZL and YC performed sequencing and basic bioinformatics analysis. All authors read and approved the final manuscript.
HungI Harry Chen, Yuanhang Liu contributed equally to this work.
Electronic supplementary material
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Chen, HI.H., Liu, Y., Zou, Y. et al. Differential expression analysis of RNA sequencing data by incorporating nonexonic mapped reads. BMC Genomics 16 (Suppl 7), S14 (2015). https://doi.org/10.1186/1471216416S7S14
Published:
DOI: https://doi.org/10.1186/1471216416S7S14