Evaluation of read count based RNAseq analysis methods

Background RNAseq technology is replacing microarray technology as the tool of choice for gene expression profiling. While providing much richer data than microarray, analysis of RNAseq data has been much more challenging. To date, there has not been a consensus on the best approach for conducting robust RNAseq analysis. Results In this study, we designed a thorough experiment to evaluate six read count-based RNAseq analysis methods (DESeq, DEGseq, edgeR, NBPSeq, TSPM and baySeq) using both real and simulated data. We found the six methods produce similar fold changes and reasonable overlapping of differentially expressed genes based on p-values. However, all six methods suffer from over-sensitivity. Conclusions Based on the evaluation of runtime using real data and area under the receiver operating characteristic curve (AUC-ROC) using simulated data, we found that edgeR achieves a better balance between speed and accuracy than the other methods.


Background
The process by which information from a gene is used in the synthesis of a functional gene product is called gene expression. The process of gene expression is used by all known life including eukaryotes, prokaryotes, and viruses to generate the macromolecular machinery for life. Gene expression analysis is essential for biomedical research. The introduction of microarray technology has helped biomedical research make significant advances in the last decade by allowing high-throughput gene expression screening on all known genes. Recently, the introduction of RNAseq technology has had a revolutionary impact on the field of expression research. RNAseq refers to the use of next-generation sequencing (NGS) technologies to sequence cDNA in order to get information about a sample's RNA content. Compared to microarray technology, the RNAseq method offers several distinct advantages. First, the detection range of RNAseq is not limited to a set of predetermined probes as with microarray technology, so RNAseq is capable of identifying new genes.
Second, the resolution of a microarray is limited to the gene level for most arrays and the exon level for specially designed exon arrays. On the other hand, RNAseq can detect expression at the gene, exon, transcript, and coding DNA sequence (CDS) levels. Finally and most importantly, RNAseq can detect structural variants such as alternative splicing and gene fusion. With the maturity of NGS technologies, the price of RNAseq has become comparable to microarrays. Many researchers have predicted the inevitable replacement of microarray by RNAseq [1][2][3] based on the competitive price and additional genomic information contained in RNAseq data.
With the rich genomic information RNAseq technology brings, it also comes with complication in the analysis phase. To date, the research community has not yet come to a consensus on the best approach for analyzing RNAseq data. The most popular normalization method for microarray data is Robust Multi-array Average (RMA) [4], a form of quantile normalization. For RNAseq, one of the popular normalization methods is Reads per Kilobase per Million mapped reads (RPKM) [5] or Fragments Per Kilobase of transcript per Million mapped reads (FPKM) [6]. The RPKM of a gene is computed as follows: RPKM = 10 9 × C N * L , where C is the number of reads mapped to the gene, N is the total number of reads mapped to all genes, and L is the length of the gene. FPKM is computed similarly to RPKM, except it accounts for the scenario in which only 1 end of a pairedend read is mapped. In addition to RPKM and FPKM, other read count methods based on Possion, negative binomial, and Bayesian methods also exist. Each method has unique strengths and weaknesses. In this study, we focus on read count-based methods and systematically evaluate 6 RNAseq R packages including DESeq [7], DEGseq [8], edgeR [9], bay-Seq [10], TSPM [11] and NBPSeq [12] using both real and simulated data. BaySeq is considered an empirical Bayes approach to detect patterns of differential expression, DESeq and NBPSeq are based on a negative binomial model, DEGseq and TSPM are based on a Poisson model, and edgeR uses empirical Bayes estimation and exact tests based on the negative binomial.

Method
The real RNAseq datasets were selected from The Cancer Genome Atlas (TCGA) [13,14]. TCGA is a massive, comprehensive, and collaborative project to catalogue genomic data for over 20 types of cancers by the National Cancer Institute (NCI), the National Human Genome Research Institute (NHGRI), and 27 institutes and centers of the National Institute of Health (NIH). Gene expression profiling by RNAseq is one of the major components of genomic data collected by TCGA. Breast cancer is the only cancer type in TCGA that collected expression data on a large quantity of tumor-normal paired samples. Thus we selected breast cancer tumor-normal paired data (53 pairs) as our primary source of real RNAseq data. Differentially expressed genes between tumor and normal were identified using all six methods at the significance level of 0.05 with Benjamin-Hochberg False discovery rate (BH FDR) adjustment. To evaluate the consistency between the six methods, we computed pairwise Spearman's correlations as well as intraclass correlation (ICC) for fold change values of all genes, along with the corresponding p-values. In addition, we evaluated each method's running time.
For each gene, the count is drawn from the negative binomial distribution with the mean and dispersion parameters estimated from the TCGA breast cancer dataset. For a given gene m, the fold change is calculated as (ρ m + ρ * ) d m where ρ m is drawn from the gamma distribution with shape parameter 0.87 and rate parameter 1.36 (parameters are estimated from the TCGA breast cancer dataset), ρ * is the lower bound of fold change, and We evaluated the methods using datasets simulated to present different scenarios corresponding to a given combination of the following parameters: sample size (5 or 10), proportion of differentially expressed genes (5% or 10%), ratio of up-regulated vs. down-regulated (1:1 or 3:1), lower bound of fold change (1.5 or 1.1), and lower bound of depth (5 or 1). A total of the seven most representative scenarios are shown in Table 1. For each scenario, 30 datasets were simulated from a negative binomial distribution. To evaluate the performance of the six methods, we calculated the number of genes that are significantly differentially expressed (N S ), the rate of false positives (FPR), and the rate of true positives (TPR) across 30 simulation datasets for each scenario. False discovery rate (FDR) at levels of 0.1, 0.05 and 0.01 were used as the cutoffs. In order to measure the overall performance, we also computed the area under the curve (AUC) across 30 simulation datasets under each scenario.

Results and discussion
We performed differential expression analysis on the 53 tumor-normal paired samples from TCGA using all six methods. Out of 16,146 genes, DEGseq identified the most number of significant genes at 15,226. That is, 94.3% of all genes were identified by DEGseq as statistically significantly differentially expressed between tumor and normal, which implies that DEGseq is over-sensitive. DESeq on the other hand identified the least number of differentially expressed genes with 7,171, which is still too many to be biologically plausible. NBPSeq, edgeR, TSPM, and baySeq identified 10,017, 10,457, 9,519, and 13,203 differentially expressed genes respectively ( Figure 1). In terms of up-regulated genes vs. down-regulated genes, DESeq, edgeR, and NBPSeq identified more up-regulated genes, while TSPM and DEGseq identified more downregulated genes. BaySeq does not provide fold change or test statistics, thus no direction of gene expression change can be determined through p-value alone. A unique characteristic associated with baySeq is that there is a randomization factor built into its computation model. Using baySeq to analyze the same dataset on same parameter settings twice will generate slightly different results. In our scenario, the difference can be as many as 100 genes.
Next we computed the overlap of differentially expressed genes between methods for both up-regulated and down-regulated genes. BaySeq was excluded from this analysis due to its lack of directionality ( Figure 2). For down-regulated genes, 2,521 were identified by the rest of five methods, and for up-regulated genes, 4,067 were identified by all five methods. DESeq, edgeR, and NBPSeq observed the fewest singleton genes (defined as genes identified by only 1 method) in terms of both upregulated and down-regulated genes. TSPM observed a moderate number of singleton genes, and DEGseq observed the most, which is also a reflection of its oversensitivity.
We studied the relationship of fold change and p-value vs. identification frequency of significantly expressed genes detected by the six methods ( Figure 3). As expected, genes identified by all six methods had a stronger signal (e.g., a higher fold change value) than genes identified by fewer methods. The singleton genes had the lowest fold change values (Figure 3 left). However, for p-value the pattern was not clear. No significant association between p-value and identification frequency was observed.
We also tested the consistency among the six methods at overall significance levels. Pair-wise Spearman's correlation coefficient of fold change values was computed ( Figure 4). Most correlation coefficients are very close to 1, which indicates that the six methods are highly consistent in terms of ranking genes according to fold change despite difference in normalization method. We also computed intraclass correlation coefficients (ICC) using both fold change and p-value as well. The ICC based on fold change was 0.975 which strongly suggests that a high degree of agreement among the five methods (excluding baySeq because no fold change can be calculated) as the fold change value is only affected by different normalization techniques used by different methods ( Figure 5) However, the ICC for p-value was only 0.50, indicating that these methods have a low agreement in terms of ranking the differentially expressed genes. In addition, baySeq produced the worst p-value correlation with regard to the other methods. To assess accuracy, one has to know a priori which genes are differentially expressed between normal and tumor. Hence, we can only calculate it in a simulation.
For simulated data, we compared the number of genes that are significantly differentially expressed. In contrast to the results from real data, baySeq identified the smallest number of differentially expressed genes, while DEGseq identified largest number of differentially expressed genes, significantly more than other methods (Table 2). DESeq, edgeR and NBPSeq identified similar numbers of differentially expressed genes. The similar performance of DESeq, edgeR, and NBPseq is probably due to the fact that these methods are based on similar principles except the estimation of the dispersion parameter. Among all six methods, baySeq has the smallest FPR, and DEGseq has the largest FPR. This indicates that baySeq was more conservative compared to other methods. Among DESeq, edgeR and NBPSeq, the largest FPR was found by NBPSeq, followed by DESeq or edgeR. In general, the FPR of DESeq was smaller than the FPR of edgeR. For TPR, the largest TPR was found by DEGseq (i.e. it also found too many false positives), followed by TSPM. As expected, DESeq, edgeR and NBPSeq obtained similar results in TPR. After comparing the results under scenarios I and II, we found that sample size has a large effect, especially for TPR. As sample size decreases, N S decreases, FPR increases, and TPR decreases, respectively. TSPM shows the strongest effects on sample size among the methods. After comparing the results under scenarios I and III, we found that the proportion of differentially expressed genes has a relatively small effect on FPR and TPR for all methods. As the proportion decreases, all of N S , FPR, and TPR decrease. After comparing the results under scenarios I and IV, we found that the proportions of up-regulated and down-regulated genes have a large effect on N S and FPR for TSPM and NBPSeq. After comparing the results under scenarios I, V, and VI, we found that the level of treatment effects and depth of reads have small effects for all of the methods under reasonable sample size. Comparing the results  under scenarios I and VII, all methods are sensitive to the sample size, level of treatment effects, and depth.
In Table 3, we summarized the average AUC-ROC for each scenario. For each scenario, DEGseq has less satisfying performance compared to the other methods. Under scenarios II, V, and VII, the average AUC-ROC of baySeq was the largest. Those three scenarios were simulated for the dataset with small sample size or small treatment effects. This indicates that baySeq is the best performing method when the dataset has a small sample size or small treatment effect. Under other scenarios (i.e. I, III, IV, VI), the best performance was generally obtained by edgeR and DESeq, followed by NBPSeq and TSPM.
We also measured the runtime for each method running on the TCGA breast cancer dataset. On a Dell 3500 with a 2.8GHz CPU and 12 GB RAM, DESeq took the most time to finish with over 2 hours followed by NBPSeq (77 minutes), baySeq (47 minutes). TSPM and edgeR took significantly less time in comparison with 6 and 2 minutes, respectively. DEGseq was the speediest of the six, finishing in only 12 seconds. Combining information from both runtime and AUC, edgeR performed best with relatively short runtime and good AUC-ROC.

Conclusions
In this study, we systematically evaluated six read countbased RNAseq analysis methods using both real and simulated data. BaySeq is the most unique method out of the six, because it is based on an empirical Bayesian approach which produces no fold change or test statistics  to infer the direction of the gene expression difference. This inconvenient feature of baySeq forces users to go back to the raw expression value to determine the direction. Another unique characteristic of baySeq is the randomization steps involved in its model. For the same data with the same parameter settings, baySeq produces slightly different results. Granted, genes with the strongest fold change will always be detected by baySeq, but this minor inconsistency between runs of baySeq does produce some inconvenience. From our simulation study, we found that the performances of DESeq, edgeR, and NBPSeq are similar in most cases. This result is expected because all three methods are based on a negative binomial model and differ principally in the way the dispersion parameter is estimated.
One of the issues we observed with these six read count-based RNAseq analysis methods is that they tend to be over-sensitive. For a traditional microarray study, the number of differentially expressed genes identified by a simple method such as a t-test is highly dependent on the sample size. We have tested these six methods using smaller sample sizes such as 2 vs. 2, but the majority of the methods still produced huge amounts of differentially expressed genes (sometimes over 50% of total genes), which is a clear sign of over-sensitivity. Through more thorough analysis, we found that the five methods (excluding baySeq) produced very similar fold changes but less similar ranks of p-value. Given these facts, we would recommend using the combined criteria of fold change and p-value to filter out more false positives. Combined with evaluation of runtime from real data and AUC-ROC from simulated data, edgeR achieved a good balance between speed and accuracy.
We are far from reaching a consensus on the best RNAseq analysis approach. Several unique characteristics of RNAseq data contribute to the difficulty of RNAseq data analysis. The value for non-expression in RNAseq is zero, which means there are no reads aligned to the gene. In microarray, there is always background intensity for non-expressed genes, thus we can always take log transformations of the intensity data. In contrast, due to the large number of zeros in RNAseq data (often more than 50%), we cannot take a log transformation. The range of RNAseq data is between 0 and 50,000, compared to microarray's 2 to 15 after RMA normalization. This huge range of RNAseq data can result in many false high fold changes. Also, there are many sequencing and alignment artifacts that can skew