 Research article
 Open Access
 Published:
Choice of library size normalization and statistical methods for differential gene expression analysis in balanced twogroup comparisons for RNAseq studies
BMC Genomics volume 21, Article number: 75 (2020)
Abstract
Background
Highthroughput RNA sequencing (RNAseq) has evolved as an important analytical tool in molecular biology. Although the utility and importance of this technique have grown, uncertainties regarding the proper analysis of RNAseq data remain. Of primary concern, there is no consensus regarding which normalization and statistical methods are the most appropriate for analyzing this data. The lack of standardized analytical methods leads to uncertainties in data interpretation and study reproducibility, especially with studies reporting high false discovery rates. In this study, we compared a recently developed normalization method, UQpgQ2, with three of the most frequently used alternatives including RLE (relative log estimate), TMM (Trimmedmean M values) and UQ (upper quartile normalization) in the analysis of RNAseq data. We evaluated the performance of these methods for genelevel differential expression analysis by considering the factors, including: 1) normalization combined with the choice of a Wald test from DESeq2 and an exact test/QL (Quasilikelihood) FTest from edgeR; 2) sample sizes in two balanced twogroup comparisons; and 3) sequencing read depths.
Results
Using the MAQC RNAseq datasets with small sample replicates, we found that UQpgQ2 normalization combined with an exact test can achieve better performance in term of power and specificity in differential gene expression analysis. However, using an intragroup analysis of false positives from real and simulated data, we found that a Wald test performs better than an exact test when the number of sample replicates is large and that a QL Ftest performs the best given sample sizes of 5, 10 and 15 for any normalization. The RLE, TMM and UQ methods performed similarly given a desired sample size.
Conclusion
We found the UQpgQ2 method combined with an exact test/QL Ftest is the best choice in order to control false positives when the sample size is small. When the sample size is large, UQpgQ2 with a QL Ftest is a better choice for the type I error control in an intragroup analysis. We observed read depths have a minimal impact for differential gene expression analysis based on the simulated data.
Background
Highthrough RNA sequencing (RNAseq) has been increasingly used in the studies of genomics and transcriptomics over the last decade [1, 2]. Unlike cDNA microarray technology, RNAseq has wide applications for the identification of novel genes or transcripts, mutations, gene editing and differential gene expression [1, 3,4,5,6,7]. Recent clinical studies demonstrated the utility of RNAseq in identifying complex disease signatures via transcriptome analysis [8, 9]. Despite this utility and importance, optimal methods for analyzing RNAseq data remain uncertain.
For each sample in an RNAseq experiment, millions of reads with a desired read length are mapped to a reference genome by alignment tools such as Bowtie2/TopHat2, STAR and HISAT2 [10,11,12,13,14]. The mapped reads for each gene or transcript are subsequently used to quantify its expression abundance. However, the sample read depth typically varies from one sample to another and a direct comparison of gene expression between samples cannot be performed. Thus, normalization and proper test statistics are critical steps in the analysis of RNAseq data [15].
Normalization of RNAseq read counts is an essential procedure that corrects for nonbiological variation of samples due to library preparation, sequencing read depth, gene length, mapping bias and other technical issues [16,17,18,19,20]. This ensures proper modeling of biological variations to directly compare and accurately detect expression changes between sample groups. Currently, a number of normalization methods are available to correct for technical variations and biases. These include methods to correct for read depth and transcript lengths, most commonly formulated as RPKM (Reads Per Kilobase per Million mapped reads) and FPKM (Fragments Per Kilobase per Million mapped fragments), which have been implemented in DEGSeq and CufflinksCuffDiff [7, 19, 21, 22]. Other global scaling quantile normalization methods consider either a TC (persample total counts) [23], UQ (persample 75% upper quartile Q3) [17], Med (persample Median Q2) [23], or Q (full quantile) implemented in Aroma.light [24]. More complex methods based on a size factor imputed include RLE normalization as implemented in DESeq2/DESeq and TMM implemented in edgeR for correcting read depth bias [16, 25, 26]. Still other methods normalize by the expression of control genes such as RUV for removing unwanted technical variation across samples [17, 27], GCcontent [28], or log_{2} transformed read counts implemented in voomlimma [24, 29]. In addition to these traditional normalization methods, two abundance estimation normalization methods have been recently developed. One is called RNAseq by ExpectationMaximization using a directed graph model (RSEM) [30] and the other is Sailfish which is an alignmentfree abundance estimation using kmers to index and count RNAseq reads [31]. More recently we developed a method called UQpgQ2 (pergene Q2 normalization following persample upperquartile global scaling at 75 percentile) for correcting library depths and scaling the reads of each gene into the similar levels across conditions [18, 32].
A number of studies have compared these normalization methods and their impact on the downstream analysis for identification of differentially expressed genes (DEGs) (Table 1). Briefly, the earliest comparison studies reported that UQ normalization followed by an exact test/LRT significantly reduced the length bias of DE from RPKM relative to quantitative RealTime polymerase chain reaction (qRTPCR) [17] and baySeq with UQ normalization had the highest true positive rates with low false positive rates (FPRs). The observed false discovery rate (FDR) from edgeR, DESeq and TSPM methods was higher than the expected rate of 0.05, while TSPM performs the worst when sample sizes were as small as two [33]. In contrast, Rapaport et al. reported that no single method was favorable in all comparisons. They observed that baySeq with UQ normalization was the least correlated with qRTPCR, CufflinksCuffDiff had an inflated number of false positive predictions and voomlimma package had comparable performance as DESeq and edgeR [34]. Moreover, a recent study based on a Spearman correlation analysis between read counts and qRTPCR for the two abundance estimation methods (Sailfish and RSEM) revealed that raw counts (RC) or RPKM seemed to be adequate due to inconsistent results from Sailfish and RSEM, suggesting that normalization methods are not necessary for all sequence data [35]. An extensive evaluation performed by Dillest et al. found that an exact test combined with DESeq/TMM normalization was the best for controlling the FDR below 0.05 for highcount genes while RPKM, TC and Q normalization were suggested to be abandoned [23]. Moreover, several studies summarized that DESeq was often too conservative, edgeR, NBPSeq, and EBSeq were too liberal, and voom/vstlimma had good type I error control with low power for small sample sizes [36,37,38,39]. These studies concur that DESeq is preferred for controlling the number of false positives while edgeR with TMM is slightly preferable for controlling false negatives by achieving higher sensitivity.
Since DESeq with an exact test was overly conservative, DESeq2 with a Wald test was developed for improving the sensitivity/power [25]. Subsequently, a comparison of RLE normalization from DESeq2 with other existing methods was performed by several studies (Table 1). In one of these studies, a threegroup comparison calculated the area under a Receiver Operating Characteristic (ROC) curve and recommended edgeR for count data with replicates while DESeq2 with RLE normalization was recommended for data without replicates [40]. Another study reported that voom and edgeR were generally superior to other methods for controlling the FDR with replicates of 3 and 5, but voom significantly underperformed in transcriptlevel simulation [41]. In contrast, another study reported that TMM, RLE and MRN gave the same results for a twocondition comparison without replicates [42] while limmaVoom, NOIseq and DESeq2 had more consistent results for DEG identification [43]. A recent study using RNAseq time course data found DESeq2 and edgeR with a pairwise comparison outperformed TC tools for short time course (< 8 time points) due to high FPRs, but they performed worse on longer time series than splineTC and maSigPro tools [44].
Taken together, these studies showed that TMM and/or RLE associated with edgeR and DESeq2 outperformed the others in terms of overall performance on sensitivity and specificity [17, 23, 33, 34, 36, 37, 39,40,41, 43, 44]. However, these studies also reported that TMM and UQ normalizations were too liberal or oversensitive, resulting in a large number of false positives, while RLE implemented in DESeq with an exact test was too conservative [23, 36, 37]. A recent study concluded that RLE/DESeq2 with a Wald test improves sensitivity compared with a previous version of RLE/DESeq with an exact test. But this comes with a tradeoff for a relatively higher FPR [25]. Later studies reported that the actual FDR produced from TMM/edgeR with an exact test, and RLE/DESeq2 with a Wald test, was not controlled well in many cases [18, 23, 33, 36, 37]. Most recently, edgeR offered a quasilikelihood (QL) Ftest for testing DE genes using negative binomial generalized models, which was considered to be a preferred choice for the uncertainty in estimating the dispersion for each gene when sample sizes were small [45]. In our recent study, we found that UQpgQ2 normalization combined with an exact test from edegR performed slightly better than TMM and RLE in terms of FDR when using MAQC data and simulated data. However, all the methods had an inflated FDR using MAQC datasets [18]. Thus, it remains unclear which combination of normalization and test statistics can minimize the number of false positives while taking into consideration of sample size and read depth variations. While studies comparing different normalization methods have been widely reported and discussed, this issue for the evaluation of newly developed normalization and testing statistical methods has not been adequately addressed.
In this study, we evaluated the performance of two commonly used packages (DESeq2 and edgeR) with three statistical tests (exact test, QL Ftest and Wald test), the three most frequently used normalization methods (RLE, TMM and UQ) and the more recently proposed twostep normalization (UQpgQ2). Two benchmark MAQC (Microarray Quality Control Project) datasets [34, 46, 47], five real RNAseq datasets from The Cancer Genome Atlas (TCGA) website [48], and simulated data with varying read depths are used in this study.
Results
Statistical analysis of MAQC2 and MAQC3 for the combined methods
In our previous study, we evaluated the effect of normalization methods including DESeq, TMM, UQpgQ2 and UQ based on DEG analysis using two MAQC datasets and an exact test/edgeR. In this study, the effects of the Wald test/DESeq2, exact test/QL Ftest from edgeR and ttest/voomlimma were used to evaluate the normalization and test statistical methods.
The number of true positive (TP) and false positive (FP) genes calculated were based on the number of DEGs identified from MAQC RNAseq data given a nominal FDR cutoff 0.05, and the total number of TPs and true negatives (TNs) were based on qRTPCR data. We also calculated the positive predictive value (PPV), the actual FDR, sensitivity and specificity for both datasets (Table 2). Using MAQC2 data, the analysis results show that UQpgQ2 combined with an exact test/edgeR has the highest specificity (85.1%) with the lowest actual FDR (0.055) while the others ranged from 37.8 to 45.3% with a FDR greater than 0.1 and slightly lower sensitivity (96.7%). An exact test/TMM has the highest sensitivity (98.5%) while the others ranged from 96.7 to 97.4%. The UQ approach performed the worst in both sensitivity and specificity, consistent with other findings [18].
While using a Wald test, the results show that UQpgQ2 outperformed the others with the highest specificity (66.9% compared to the others from 43.9 to 46.0%) and a slightly higher sensitivity (98.7% compared to the others from 95.9 to 96.4%). RLE has a slightly higher sensitivity (96.4%) than the TMM and UQ methods while having a tradeoff of lower specificity. When using the recently proposed QL Ftest, the results show that UQpgQ2 has the highest specificity (58.7% compared to the others arranged from 24.5 to 28.0%) and the highest sensitivity (99.7% compared to the others 99.2%). TMM with a QL Ftest has a slightly higher specificity (28%) than RLE/UQ (24.5%). Although a ttest for DEGs analysis in RNAseq studies is not commonly used due to the distribution of the read counts in RNAseq data following a negative binomial [26, 49], the voomlimma package has been recently proposed [29] and was reported to have good control of FDR, but low power for small sample size [36, 37]. Therefore, it is interesting to examine the results from a ttest using logtransformation of read counts following one of the four normalization methods. As expected, the results show there is no difference between the UQ and UQpgQ2 methods since the median scaling factor estimated for each gene across samples in UQpgQ2 was canceled while applying a ttest [50]. Although UQ/UQpgQ2 performed relatively better than TMM and RLE, with a specificity of 48.7%, there was a tradeoff with lower power of 93.1%, consistent with previous reports [36, 37]. The results also suggested a ttest is not a better choice for the TMM and RLE methods compared to other tests such as a Wald test or an exact test/QL Ftest.
Overall, for this comparison study of the four test statistics (the exact test/QL Ftest, Wald test and ttest), the results from MAQC2 data demonstrated that UQpgQ2 and TMM combined with an exact test/Wald test performed much better than using a QL Ftest and ttest in terms of sensitivity/power and specificity/FDR while UQ and RLE were varied.
The results from an additional analysis of MAQC3 with five replicates had similar conclusions for UQpgQ2 normalization (Table 2). Briefly, UQpgQ2 with an exact test was the best choice and achieved the highest specificity among the four normalization methods for all four test statistics. The results also show that all normalization methods combined with a ttest/voomlimma achieved better specificity than a Wald test and QL Ftest while all the methods have a sensitivity close to or above 90% with a tradeoff of lower power than others. Thus, the results using MAQC3 data suggested that an exact test for UQpgQ2 or a ttest from voomlimma seems to control the FDR better than other methods when sample sizes or replicates are relatively large.
Finally, the results from the analysis of both MAQC datasets suggested the four normalization methods combined with the three test statistics (exact test, QL Ftest and Wald test) can achieve a great sensitivity/power while a ttest from voomlimma has relatively lower power with unstable performance on the control of FDR. Although the UQpgQ2 method performed relatively better for controlling FPs, all normalization methods have a problem maintaining the actual FDR below the nominal level of 0.05, which agrees with previous reports [18, 23, 33].
Withingroup analysis of real cancer datasets for detecting FPs given a desired sample size
A type I error rate and FDR are the most important performance measures for evaluating DEG analysis methods. The large number of replicates from TCGA human cancer datasets including nonsmall cell lung cancer with adenocarcinoma subtype (AdLC), ovarian cancer (OC) and triple negative breast cancer (BC) allows us to perform withingroup analysis of FPs for estimation of a type I error rate. The four normalization methods (TMM, RLE and UQ and UQpgQ2) combined with the exact test, QL Ftest or Wald test were compared given a desired sample size of replicates in a single group. The two synthesized groups with an equal and desired sample size were randomly subsampled from the same cancer subtype. Under the null hypothesis, the genes between the two synthesized groups in this study are not expected to be differentially expressed. Thus, the DE genes identified are defined as FPs. Given a FDR cutoff of 0.05 and an absolute value of FC cutoff at 2 as a conventional way for identifying DEGs, the FPR (a fraction of DEGs) and the number of FPs identified are illustrated in Fig. 1 and Additional file 1: Figure S1, respectively. Although the FPR for all the four normalization methods based on the three datasets are below 0.05, the performance of these methods are significantly different.
First, using an exact test/QL Ftest, we found that the FPR in Fig. 1 increases as the sample size for all the methods increases from 5 to 40 in the three cancer datasets (Fig. 1a, b, d,e, g and h). However, not unexpectedly, this pattern was not observed when a Wald test was used. With a Wald test, higher FPRs are observed when sample sizes are five and they tend to decrease at larger sample sizes of 10, 15, 20, 25, 35 and 40 (Fig. 1c, f and i), but the FPR for different sample sizes varies (Fig. 1c).
Second, we found that the exact test at a sample size of five can achieve a smaller value of FPRs than a Wald test for all the methods (RLE in pink, TMM in green, UQ in blue and UQpgQ2 in purple). This suggests that when a sample size is small, an exact test is more conservative than a Wald test. Moreover, the QL Ftest combined with any of four normalization methods at sample sizes of 5, 10 and 15 can achieve the smallest FPR compared to the other two tests (Fig. 1b, e and h). However, when a sample size becomes large (n > 15), a Wald test for RLE, TMM and UQ is more conservative than choosing the exact test or QL Ftest.
Third, in this study, the differences among RLE, TMM and UQ normalization methods are relatively small and varied. We found that the twostep normalization method UQpgQ2, consistently performed better than the others by achieving the smallest FPR and number of FPs given a desired sample size in all scenarios (Fig. 1 and Additional file 1: Figure S1). Overall, the results illustrate that a QL Ftest with a UQpgQ2 may be the best option for DEG analysis when FDR is more important to be considered. These observations are consistent for the three real datasets.
The effect of sequencing read depth of OC samples on the analysis of FPs from the normalization methods and test statistics
Next, we examined whether the read depth in a RNAseq study affects the number of FPs for the normalization and test statistical methods given a desired sample size. The read depths of the 379 OC samples range from 19 to 157 million reads. Two new datasets were generated by simply reducing the read depth of the OC samples to onethird or half. Thus, we obtained one dataset with the read depth in the range of 12.8 to 104 million reads and the other with a range of 9.6 to 78.6 million reads. Given an FDR cutoff of 0.05 and an absolute FC cutoff of 2, the FPR estimated from the number of FPs identified in Additional file 1: Figure S2 and S3 is illustrated in Figs. 2 and 3.
Withingroup analysis of three sets of data with different read depths revealed that the FPR from higher read depths (19157 M and 12.8104 M) in Fig. 2a to 2f is slightly larger than those with smaller read depths (9.6–78.6 M) in Fig. 2g to 2i regardless of normalization methods and sample sizes. However, the difference for the samples with the read depths between 19 and 157 M and 12.8104 M (a smaller change) is varied and very small. Regardless of read depth, similar patterns are observed between Figs. 1 and 2. Overall, UQpgQ2 method is more conservative than the others in most of scenarios given a desired sample size and statistical test. However, Fig. 2 shows that UQpgQ2 combined with a Wald test in read depth of 9.6 to 78.6 million is more liberal at the sample size of five.
Figure 3 and Additional file 1: Figure S3 further demonstrate the difference between three test statistics (the exact test in pink, QL Ftest in green and Wald test in blue) and three normalization methods. The FPR increases as sample sizes increase while using the exact test and QL Ftest, but the impact of FPR by the sequencing read depth are very small. For a Wald test, the FRP is larger than the one from other tests when the sample size is five. Moreover, the FPR from RLE and TMM combined with the three tests are similar (Fig. 3a, b, d, e, g and h). In contrast, UQpgQ2 from the three tests (Fig. 3c, f and i) can achieve lower FPR compared to other normalization methods.
The effect of sequencing read depths from simulated data on the analysis of FPs given a desired sample size
Each of the six simulated datasets contains 122 samples with a desired mean read depth of 30, 40 and 50 million reads with a standard deviation (SD) of 3 and 5 million reads, respectively. In this study, we examined whether the simulated data with a desired read depth and SD affects the number of FPs/FPRs from different normalization and test statistical methods given a desired sample size. Given a FDR cutoff of 0.05 and an absolute value of FC cutoff at 2, the results are illustrated in Fig. 4 and Additional file 1: Figure S4.
Overall, the results from the simulated data are similar to the ones illustrated in Figs. 1 and 2. When the read depths increase from 30 to 50 million reads (Fig. 4a, d, g, j, m and p), the FPR from an exact test slightly increases, which is consistent with the observation from real data (Fig. 1). But, these patterns were not observed when using the Wald test and QL Ftest. We also observed that UQpgQ2 (purple) performed the best with the lowest FPR while TMM (green) performed the worst with the largest FPR using an exact test. For a Wald test, UQpgQ2 performed the worst for a sample size of five and performed the best for sample sizes of 10 or larger while the other methods performed similarly. For a QL Ftest, all the methods perform similarly by achieving very small FPR for a given sample size of 5, 10 and 15. When the sample size is 20 or larger, similar results are observed except TMM combined with the QL Ftest can achieve a smaller FPR than RLE and UQ methods. Taken together, with a QL Ftest, UQpgQ2 outperformed the other methods by achieving the smallest value of FPR regardless of the sample sizes and sequencing read depth.
Betweengroup analysis for identifying DEGs from cancer samples versus the normal controls given a desired sample size
Finally, we evaluated the four normalization methods with three test statistics based on the number of significant DEGs identified using the paired BC data (117 BC and 112 control) and lung cancer data (535 AdLC and 59 normal control). Given a desired sample size, the cancer and control groups were subsampled from BC/AdLC and their control samples for 50 times, respectively. The number of DEGs detected from each method with a desired sample size is the average of the number of DEGs while bootstrapping for 50 times. Given an FDR cutoff of 0.05 and an absolute FC cutoff of 2, the number of DEGs identified is illustrated in Fig. 5.
First, we found that the number of DEGs increases as the sample size increases from 5 to 30 for each method (Fig. 5af). As the sample size increases from 30 to 40, the number of DEGs varies either slightly increasing (Fig. 5ad) or decreasing (Fig. 5e, f).
Second, given sample sizes of 15 or more, an exact test combined with one of the four normalization methods detected more DEGs than the QL Ftest and Wald test in the most scenarios, which indicates that the Wald test and QL Ftest are more conservative than the exact test while the exact test has more power than others. However, given a sample size of five, the Wald test can identify more DEGs than the exact test and QL Ftest.
Third, comparing the four normalization methods, UQpgQ2 (purple) is more conservative than the others in the most scenarios except the QL Ftest for AdLC data, where UQpgQ2 achieved the highest detection power for given sample sizes of 10 or larger (Fig. 5d). However, given a sample size of five, the number of DEGs detected from UQpgQ2 combined with the exact test or QL Ftest is small (Fig. 5ac). This observation suggests that RLE, TMM and UQ combined with the exact test or QL Ftest from edgeR are the better choice for achieving a better detection power than UQpgQ2 for a small sample size such as five and the number of DEGs identified from UQpgQ2 looks more reasonable given a sample size ≥10. Since we do not know the number of true positives and true negatives, we assume the method detecting the highest number of DEGs may have the highest sensitivity or detection power.
Discussion
Some previous studies comparing normalization methods have reported that both DESeq2 and edgeR with an exact test/likelihood ratio test failed to maintain the actual FDR below the nominal level of 0.05, suffering from being” oversensitive” in some cases [14, 18, 23, 37, 51]. Our recently proposed normalization method, UQpgQ2, combined with an exact test had a better performance than the others in terms of controlling the type I error rate and FDR. However, in that previous study, sample sizes (replicates) for a twogroup comparison were smaller than six in many cases. Currently, common sample sizes in RNAseq studies can range from a minimum of 3 up to several hundreds of biological replicates. It is known that a Wald test can be used for testing a hypothesis of a parameter when sample sizes are usually larger than 30 based on an asymptotic theorem. Recently, edgeR provided a QL Ftest which was recommended for studies with a small number of replicates in RNAseq data. A study of singlecell differential expression analysis observed that edgeR/QLF performed well after filtering out lowly expressed genes [52]. Thus, a comprehensive comparison of QL Ftest and other test statistics combined with newly developed normalization methods (UQpgQ2 or voomlimma) for DEG analysis of bulk RNAseq data deem to be needed. Furthermore, one outstanding question is whether the Wald test, QL Ftest or ttest combined with a normalization method performs better than an exact test used in the previous study in terms of controlling FDR. Another question is what is the best combination between normalization and test statistical methods for the control of type I error rate given a desired sample size. To address these issues, we focused on four normalization methods and three test statistics using both real RNAseq datasets and simulated data given sample sizes at 5, 10, 15, 20, 25, 30, 35 and 40.
Initially, we used two benchmark MAQC datasets for the DEGs analysis. The results from MAQC datasets show that UQpgQ2 combined with an exact test can achieve the highest specificity with the sensitivity higher than 90%. We also observed that edgeR with TMM normalization performed better than RLE and UQ with the Wald test or QL Ftest in terms of sensitivity and specificity using the small samples sizes of MAQC datasets.
Next, we used an intragroup approach to calculate the number of FPs and FPR from null datasets generated from real data and simulated data. We compared these methods by taking into consideration of sample sizes and read depths. The results from this approach have shown that in general the QL Ftest combined with any one of four normalization methods (RLE, TMM, UQ and UQpgQ2) performs the best for achieving the lowest value of FPRs when the sample size is small (≤15). However, when the sample size is large, UQpgQ2 combined with the QL Ftest performs the best and the Wald test performs much better than an exact test and QL Ftest with FPR below 0.01. Moreover, we found the read depths from simulated data with 30 to 50 million reads have a minimal impact on detection power and FDR. Furthermore, comparing DEG analysis of BC and AdLC suggests that the RLE, TMM and UQ combined with an exact test or a Wald test have higher sensitivity or power than the UQpgQ2 method. However, these methods may suffer oversensitivity when the sample sizes are large. In addition, the results from AdLC show UQpgQ2 combined the QL Ftest achieves higher detection power than other combined methods, but this observation is not consistent with the one from BC datasets.
Furthermore, it is important to note that the evaluated methods may not be applicable to all type RNAseq data. For example, singlecell RNA sequencing (scRNAseq) data has been increasingly used to assess different cell states and cell types such as stem cell, neuron cell and cancer cells [53,54,55]. A DEG analysis between different cells in scRNAseq can help to uncover driver genes in cancer research [56]. Due to technical limitations, scRNAseq data generally have low library sizes resulting in a large fraction of ‘dropout’ events as well as huge heterogeneity, which introduces a major challenge in identification of DEGs. Given the special characteristics, many new methods have been developed especially for DE analysis of scRNAseq data [57,58,59,60]. A few studies have compared DEG analysis tools for scRNAseq data and found that existing methods for analysis of bulk RNAseq data perform as well as, or not worse than, those specifically developed for scRNAseq data in terms of the power and FDR [52, 61]. A recent study has conducted DEG analysis for a comparison of eleven tools and two of them (edgeR and DESeq2) were designed originally for bulk RNAseq [62]. In this study, a high disagreement among these tools in calling DEGs were identified due to a tradeoff between the sensitivity and FDR. It also reported that current methods developed for scRNAseq did not show better performance compared to edgeR and DESeq2, which is consistent with the findings from previous studies. Thus, it is reasonable for us to mention here whether the newly developed normalization methods such as UQpgQ2 and MedpgQ2 can be used for DEG analysis in scRNAseq data for a control of FDR.
Finally, our study has some limitations. First, this work is limited to genelevel analysis and balanced designs. Second, the per gene normalization method, UQpgQ2, is only used for the DEG analysis among the groups and is not applicable for comparing genes within a group. Third, we used within group analysis of several cancer subtypes to identify DEGs and we assumed the DEGs as false positives that were used to calculate FPRs. Although this approach has been used to estimate the type I error or FPs for the comparison of normalization and test statistical methods in several studies [32, 37, 46], these identified genes may contain some TPs due to the variation between cancer patients. However, this limitation can be offset by the two benchmark MAQC RNAseq data. Finally, since voomlimma with a ttest used for DEGs analysis in bulk RNAseq data, we need to address here that per gene normalization in UQpgQ2 would not alter the DEG results due to the invariant property of ttest for the linear transformations of gene counts across samples.
Conclusions
Taken together, we found the UQpgQ2 method with an exact test is the best choice for DEG analysis in terms of controlling false positives when using the benchmark MAQC datasets. However, based on an intragroup analysis of real data and simulated data, we found UQpgQ2 combined with a QL Ftest outperforms other methods by achieving the smallest value of FPR, and a Wald test from DESeq2 can achieve a FPR below 0.05 when sample sizes are large. We observed that the RLE, TMM and UQ normalization methods combined with the Wald or exact test/QL Ftest performed similarly and read depths have minimal impact on detection of DEGs from the analysis of simulated data. We hope this new finding can serve a guide for researchers to properly choose the normalization and test statistical methods for identifying DEGs while taking sample sizes into the consideration. As scRNAseq technology emerges, UQpgQ2 combined with the QL Ftest may be useful for DEG analysis of scRNAseq data for controlling FDR, but an evaluation should be conducted in the future.
Methods
Data sources
Microarray quality control project (MAQC) RNAseq data
Two benchmark RNAseq datasets (MAQC2 and MAQC3) were used for evaluation [18, 34, 46]. These datasets have two conditions: human brain reference RNA (hbr) and universal human reference RNA (uhr). MAQC2 RNAseq data contains two replicates in each condition (hbr1, hbr2, uhr1 and uhr2). The hbr1 / uhr1 and hbr2 / uhr2 samples were prepared and sequenced in different labs. MAQC3 RNAseq data contains five replicates in each condition.
TaqMan qRTPCR data
In the MAQC project, a benchmark PCR dataset of 1044 genes was used for validation. Detailed information for processing and analyzing this data has been previously described [17, 18, 38, 47]. Briefly, we identified 388 genes as true positives and 143 genes as true negatives to evaluate these methods.
Human cancer RNAseq datasets from TCGA
Three types of human BC, OC and LC data were downloaded from TCGA website [48]. In this study, we used one subtype from each cancer type. The 122 TNBC, 379 OC, 523 AdLC and 59 normal control samples were extracted using an R script (v3.5.2). For DEG analysis, 117 BC and 112 paired normal control samples were extracted. Lowly expressed genes in each TCGA dataset were filtered out when the zero counts across samples were greater than 50% across the samples. The remaining genes were used for the downstream analysis. These datasets are available in Additional file 2.
Simulated data using Monte Carlo method
The simulated data is based on the 122 TNBC samples with a range of total read depth between 23.8 and 98.7 million reads. For our simulation, let G be the total number of genes (G = 30,831) with desired sample sizes (n = 5, 10, 15, 20, 25, 30, 35 and 40) denoted as the number of replicates. Let R_{ig} be the read counts in sample i and gene g and N_{i} be the total number of reads (read depth) in sample i as estimated from the 122 TNBC samples. Let p_{ig} be the proportion for the gene g in sample i, where \( {\hat{p}}_{ig} \) was estimated from R_{ig} divided by N_{i}. For simulating read depths, we modeled the independent and desired read depth N_{i} (i = 1, …, n_{i}) to follow a normal distribution with a mean of 30, 40 and 50 million reads, and standard deviation of 3 million reads for scenario one and 5 million reads for scenario two for each N,. The resulting distributions are N_{i} ∼ N (30, 3^{2}) and N_{i} ∼ N(30, 5^{2}); N_{i} ∼ N(40, 3^{2}) and N_{i} ∼ N(40, 5^{2}); and N_{i} ∼ N(50, 3^{2}) and N_{i} ∼ N(50, 5^{2}), respectively.
As an example, in the case of generating samples with a mean read depth of 30 million reads and a standard deviation of 3 million reads given a desired sample size of five, first, ten samples for the two groups were randomly selected from the set of 122 TNBC cases. We estimated \( {\hat{p}}_{ig} \) for each sample and generated read depths based on N_{i} ∼ N(30, 3^{2}) using the Monte Carlo method. Finally, the raw counts for each sample were generated using multinomial distribution given a desired N_{i} and a vector of \( {\hat{p}}_{ig} \) for sample i. This procedure was repeated 50 times given desired sample size of 5, 10, 15, 20, 25, 30, 35 or 40 combined with N_{i} ∼ N(30, 3^{2}). Thus, the data in the two groups was generated in the scenario of N_{i} ∼ N(30, 3^{2}).
Sequence alignment and extraction of gene counts
The raw Sequence Read Archive (SRA) files for the MAQC2 and MAQC3 with two conditions were converted to .fastq files and then aligned to the human hg38 reference genome using STAR (v2.6.0c) [11] and the Ensembl hg38 annotation gtf file (GRCh38.82). The mapped counts for 60,483 genes per sample were extracted using HTSeqscriptscount (version 2.7.5). Lowly expressed genes with zero counts across all the samples were further filtered. The MAQC datasets are available in Additional file 2.
Normalization methods and software packages
The normalization methods, statistical tests and software packages for the DEG analysis between the twogroup comparisons are summarized in Table 3. The R code used in our analysis is available in Additional file 3.
Normalization methods
Four methods (RLE, TMM, UQ and UQpgQ2) were used to normalize RNAseq data. In this study, RLE and TMM normalizations were implemented in DESeq2 and edgeR, respectively [25, 63]. UQ and UQpgQ2 were implemented using an R script [18, 23, 28]. UQpgQ2 uses a twostep approach for normalization [18]. Briefly, assuming G genes and m samples, the scaling factor for UQ normalization is calculated from the 75th percentile (Q3) of the counts for each sample after removing genes with zero counts. Gene g in sample j is scaled by the UQ scaling factor and then multiplied by the mean of the scaling factors from m samples. Therefore, \( {X}_{gj}^{UQ} \) is gene g in sample j with UQ normalized counts. Then, UQnormalized gene g is further scaled by its median (Q2) of read counts across m samples and then multiplied by 100. Therefore, \( {X}_{gj}^{UQ pgQ2} \) is gene g in sample j and normalized by UQpgQ2 method.
Software packages and test statistics used for DEGs analysis
The exact test, QL Ftest and Wald test were used for the detection of DEGs. In this study, we used the exact test and QL Ftest implemented in edgeR (v3.24.3) [45]. A Wald test implemented in DESeq2 (v1.22.2) was reported to improve the sensitivity compared with an exact test implemented in DESeq [25].
Logtransformed Wald test in a NB distribution
The Wald statistical test is an asymptotic test based on the normal approximation, which utilizes the largesample properties of maximum likelihood estimation (MLE). In DESeq2, the read counts K_{ij} in gene i sample j is modelled by a generalized linear model (GLM) of the NB family with a log link:
K_{ij}~NB(mean = u_{ij}, Dispersion = α_{i}), u_{ij} = s_{ij}u_{ij} and log \( {q}_{ij}=\sum r{x}_{jr}{\beta}_{ir} \). The s_{ij} = s_{j} is the size factor used to normalize the gene read counts in sample j and the q_{ij} is the true expression of gene i [25].
For a GLM with two conditions and a single gene in sample j, the logtransformed Wald test has been described in several studies [64,65,66]. Briefly, the treatment and control group indicators (x_{r}) take the value 1 and 0, respectively, resulting \( \frac{q_1}{q_0}={e}^{\beta_1} \) . For the differential expression gene analysis, the ratio ρ =q_{1}/q_{0} (a fold change) is used for a hypothesis test. Testing the hypothesis of q_{r}: H_{0}: ρ = 1 vs. H_{1}: ρ ≠ 1 is equivalent to H_{0}: β_{1} = 0 vs. H_{1}: β_{1} ≠ 0 in GLM. The twosided Wald test is defined as \( \mid {Z}_w\mid =\frac{\left{\hat{\beta}}_1\right}{\sqrt{Var\left({\hat{\beta}}_1\right)}} \), where the \( Var\left({\hat{\beta}}_1\right) \) is estimated from the variancecovariance matrix of (β_{0} ,β_{1}), which is the inverse of the Fisher information matrix of \( {I}_{n_0{n}_1}\left({\beta}_0\ \mathrm{and}\ {\beta}_1\right) \) asymptotically. To reject the null hypothesis, \( \left{Z}_w\right>{Z}_{1\frac{\alpha }{2}} \) is defined. Thus, this gene with pvalue less than 0.05 is called a DEG. For testing multiple genes simultaneously, the pvalues are further corrected using BenjaminiHouchberg FDR method [67].
Exact test in a NB distribution
Since RNAseq data are read counts, an exact test has been implemented similarly in DESeq and edgeR [26, 68]. For a gene in a twogroup comparison, the exact test has been described by several studies [18, 26, 69] . Briefly, Y_{ij} is denoted the normalized read counts of the single gene in condition i = A and B, and replicate j = 1, …, n_{i}. Then, the distributions of \( {Y}_{ij}\ \mathrm{and}\ \sum \limits_j^{n_i}{Y}_{ij} \) are assumed to follow a negative binomial distribution with an expected mean u_{i} and dispersion ϕ expressed as:
Y_{gij}~Y_{gij}~NB(u_{gi}, ϕ), \( \sum \limits_j^{n_i}{Y}_{ij}\sim NB\left({n}_i\cdotp {u}_i,\frac{\upphi\ }{n_i}\right),E\left(\sum \limits_j^{n_i}{Y}_{ij}\right)={n}_i\cdotp {u}_i \), and Var \( \left(\sum \limits_j^{n_i}{Y}_{ij}\right)={n}_i\cdotp {u}_i+{n}_i\cdotp {u}_i^2\cdotp \upphi \). The null hypothesis H_{0} : u_{A} = u_{B} is to identify DEGs between conditions A and B and the total normalized counts in each condition are \( {Y}_A=\sum \limits_j^{n_A}{Y}_{Aj} \) and \( {Y}_B=\sum \limits_j^{n_B}{Y}_{Bj}. \) The total counts of two conditions for the gene are Y_{S} = Y_{A} + Y_{B}.
Since Y_{A} and Y_{B} are assumed to be independent, the joint probability of P(Y_{A} = y_{A}, Y_{B} = y_{B} ) under H_{0} is P(Y_{gA} = y_{gA}) × P(Y_{gB} = y_{gB}). Thus, the pvalue from an exact test [26] is calculated by summation of the probability of a pair of P(a, b) that is less than or equal to the observed P(y_{A}, y_{B}) given that the overall summation of P(a, b). The pair of variables a and b are defined as a = 0, …, Y_{S} and b = Y_{S} − a. Then the pvalue for gene g is
p. value = P_{1}/P_{2}, where
and
The pvalue is further adjusted for multiple test correction using the BenjaminiHochberg FDR methodology. A study reported that the exact test performed better for achieving a smaller FDR than a Wald and likelihood ratio tests when the sample size is small [49].
Intragroup analysis to identify the number of FPs given a desired sample size
To compare the methods, we utilized the desired number of replicates (5, 10, 15, 20, 25, 30, 35 and 40) to estimate the number of false positives and the corresponding FPR, a fraction of DEGs via intragroup analysis. Given sample sizes, two groups are generated by randomly subsampling from a single cancer group. Since the samples originate from the same condition, the number of DEGs expected in such twogroup comparisons should be zero. Thus, by this assumption, any DEGs would be defined as false positives that were further used for estimating FPR under a null hypothesis [32, 37, 52, 70].
Bootstrap differential expression analysis of cancer versus normal control samples given a desired sample size
Given desired sample sizes of 5, 10, 15, 20, 25, 30, 35 and 40, the twogroup data (cancer and normal control) were randomly subsampling from BC/AdLC and the normal samples, respectively. Subsequently, the cancer and control groups were normalized by one of the four normalization methods used in this study. The DEG analysis of the normalized data was performed with the aid of edgeR, DESeq2 and limma tools. The results from these packages include the log_{2} transformed fold change, pvalues and BenjaminiHochberg FDRs. In this study, we defined significant DEGs using an absolute value of log_{2} FC cutoff at one (FC = 2) and a FDR cutoff at 0.05. The bootstrapping process of running each DEG algorithm was iteratively repeated 50 times. The mean number of DEGs corresponding to the standard error was imputed from the 50 iterations.
Abbreviations
 AdLC:

Human nonsmall cell lung cancer with adenocarcinoma subtype
 BC:

Human breast cancer
 DE:

Differential expression
 DEGs:

Differentially expressed genes
 FC:

Fold change
 FDR:

False discovery rate
 FP:

False positive
 FPKM:

Fragments Per Kilobase per Million mapped fragments
 FPR:

False positive rate
 hbr:

Human brain reference RNA
 LC:

Human lung cancer
 LRT:

Likelihood ratio test
 MAQC:

Microarray quality control project
 MedpgQ2:

pergene Q2 normalization following persample median global scaling
 MLE:

Maximum likelihood estimation
 NB:

Negative binomial distribution
 OC:

Human ovarian cancer
 PPV:

Positive predict value
 Q:

Full Quantile
 Q2:

The 50th percentile
 Q3:

The 75th percentile
 QL Ftest:

Quasilikelihood F test
 qRTPCR:

Quantitative RealTime polymerase chain reaction
 RC:

Raw counts
 RLE:

Relative log estimate
 RNAseq:

Highthroughput RNA sequencing
 ROC:

Receiver Operating Characteristic
 RPKM:

Reads Per Kilobase per Million mapped reads
 RSEM:

RNAseq by expectationmaximization
 RUV:

Remove unwanted technical variation
 scRNAseq:

Single cell RNA sequencing
 SD:

Standard deviation
 SRA:

Sequenced read archive
 TC:

Total counts
 TCGA:

The Cancer Genome Atlas
 TMM:

Trimmedmean M values
 TNBC:

Triple negative breast cancer
 TP:

True positive
 uhr:

Universal human reference RNA
 UQ:

Upper Quartile (Q3)
 UQpgQ2:

pergene Q2 normalization following persample upperquartile global scaling at 75 percentile
References
 1.
Wang Z, Gerstein M, Snyder M. RNASeq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63.
 2.
Marguerat S, Bahler J. RNAseq: from technology to biology. Cell Mol Life Sci. 2010;67(4):569–79.
 3.
Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman SD, et al. De novo assembly and analysis of RNAseq data. Nat Methods. 2010;7(11):909–12.
 4.
Schliebner I, Becher R, Hempel M, Deising HB, Horbach R. New gene models and alternative splicing in the maize pathogen Colletotrichum graminicola revealed by RNASeq analysis. BMC Genomics. 2014;15(1):842.
 5.
Canovas A, Rincon G, IslasTrejo A, Wickramasinghe S, Medrano JF. SNP discovery in the bovine milk transcriptome using RNASeq technology. Mamm Genome. 2010;21(11–12):592–8.
 6.
Piskol R, Ramaswami G, Li JB. Reliable identification of genomic variants from RNAseq data. Am J Hum Genet. 2013;93(4):641–51.
 7.
Wang L, Feng Z, Wang X, Wang X, Zhang X. DEGseq: an R package for identifying differentially expressed genes from RNAseq data. Bioinformatics. 2010;26(1):136–8.
 8.
Ellard S, Patrinos GP, Oetting WS. Clinical applications of nextgeneration sequencing: the 2013 human genome variation society scientific meeting. Hum Mutat. 2013;34(11):1583–7.
 9.
Byron SA, Van KeurenJensen KR, Engelthaler DM, Carpten JD, Craig DW. Translating RNA sequencing into clinical diagnostics: opportunities and challenges. Nat Rev Genet. 2016;17(5):257–71.
 10.
Langmead B, Salzberg SL. Fast gappedread alignment with bowtie 2. Nat Methods. 2012;9(4):357–9.
 11.
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNAseq aligner. Bioinformatics. 2013;29(1):15–21.
 12.
Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNASeq. Bioinformatics. 2009;25(9):1105–11.
 13.
Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat Methods. 2015;12(4):357–60.
 14.
Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14(4):R36.
 15.
Zeng W, Mortazavi A. Technical considerations for functional sequencing assays. Nat Immunol. 2012;13(9):802–7.
 16.
Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNAseq data. Genome Biol. 2010;11(3):R25.
 17.
Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNASeq experiments. BMC Bioinformatics. 2010;11:94.
 18.
Li X, Brock GN, Rouchka EC, Cooper NGF, Wu D, O'Toole TE, et al. A comparison of per sample global scaling and per gene normalization methods for differential expression analysis of RNAseq data. PLoS One. 2017;12(5):e0176185.
 19.
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNASeq. Nat Methods. 2008;5(7):621–8.
 20.
Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNAseq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;18(9):1509–17.
 21.
Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNASeq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28(5):511–5.
 22.
Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, et al. Differential gene and transcript expression analysis of RNAseq experiments with TopHat and cufflinks. Nat Protoc. 2012;7(3):562–78.
 23.
Dillies MA, Rau A, Aubert J, HennequetAntier C, Jeanmougin M, Servant N, et al. A comprehensive evaluation of normalization methods for Illumina highthroughput RNA sequencing data analysis. Brief Bioinform. 2013;14(6):671–83.
 24.
Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3:Article 3.
 25.
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biol. 2014;15(12):550.
 26.
Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11(10):R106.
 27.
Risso D, Ngai J, Speed TP, Dudoit S. Normalization of RNAseq data using factor analysis of control genes or samples. Nat Biotechnol. 2014;32(9):896–902.
 28.
Risso D, Schwartz K, Sherlock G, Dudoit S. GCcontent normalization for RNASeq data. BMC Bioinformatics. 2011;12:480.
 29.
Law CW, Chen Y, Shi W, Smyth GK. voom: precision weights unlock linear model analysis tools for RNAseq read counts. Genome Biol. 2014;15(2):R29.
 30.
Li B, Dewey CN. RSEM: accurate transcript quantification from RNASeq data with or without a reference genome. BMC Bioinformatics. 2011;12:323.
 31.
Patro R, Mount SM, Kingsford C. Sailfish enables alignmentfree isoform quantification from RNAseq reads using lightweight algorithms. Nat Biotechnol. 2014;32(5):462–4.
 32.
Li X, Rouchka EC, Brock GN, Yan J, O'Toole TE, Tieri DA, et al. A combined approach with genewise normalization improves the analysis of RNAseq data in human breast cancer subtypes. PLoS One. 2018;13(8):e0201813.
 33.
Kvam VM, Liu P, Si Y. A comparison of statistical methods for detecting differentially expressed genes from RNAseq data. Am J Bot. 2012;99(2):248–56.
 34.
Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, Zumbo P, et al. Comprehensive evaluation of differential gene expression analysis methods for RNAseq data. Genome Biol. 2013;14(9):R95.
 35.
Li P, Piao Y, Shon HS, Ryu KH. Comparing the normalization methods for the differential analysis of Illumina highthroughput RNASeq data. BMC Bioinformatics. 2015;16:347.
 36.
Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNAseq data. BMC Bioinformatics. 2013;14:91.
 37.
Seyednasrollah F, Laiho A, Elo LL. Comparison of software packages for detecting differential expression in RNAseq studies. Brief Bioinform. 2015;16(1):59–70.
 38.
Zhang ZH, Jhaveri DJ, Marshall VM, Bauer DC, Edson J, Narayanan RK, et al. A comparative study of techniques for differential expression analysis on RNASeq data. PLoS One. 2014;9(8):e103207.
 39.
Lin Y, Golovnina K, Chen ZX, Lee HN, Negron YL, Sultana H, et al. Comparison of normalization and differential expression analyses using RNASeq data from 726 individual Drosophila melanogaster. BMC Genomics. 2016;17:28.
 40.
Tang M, Sun J, Shimizu K, Kadota K. Evaluation of methods for differential expression analysis on multigroup RNAseq count data. BMC Bioinformatics. 2015;16:361.
 41.
Germain PL, Vitriolo A, Adamo A, Laise P, Das V, Testa G. RNAontheBENCH: computational and empirical resources for benchmarking RNAseq quantification and differential expression methods. Nucleic Acids Res. 2016;44(11):5054–67.
 42.
Maza E. In Papyro comparison of TMM (edgeR), RLE (DESeq2), and MRN normalization methods for a simple twoconditionswithoutreplicates RNASeq experimental design. Front Genet. 2016;7:164.
 43.
CostaSilva J, Domingues D, Lopes FM. RNASeq differential expression analysis: an extended review and a software tool. PLoS One. 2017;12(12):e0190152.
 44.
Spies D, Renz PF, Beyer TA, Ciaudo C. Comparative analysis of differential gene expression tools for RNA sequencing time course data. Brief Bioinform. 2019;20(1):288–98.
 45.
Lun AT, Chen Y, Smyth GK. It's DElicious: a recipe for differential expression analyses of RNAseq experiments using quasilikelihood methods in edgeR. Methods Mol Biol. 2016;1418:391–416.
 46.
Consortium M, Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, et al. The MicroArray quality control (MAQC) project shows inter and intraplatform reproducibility of gene expression measurements. Nat Biotechnol. 2006;24(9):1151–61.
 47.
Wan L, Sun F. CEDER: accurate detection of differentially expressed genes by combining significance of exons using RNASeq. IEEE/ACM Trans Comput Biol Bioinform. 2012;9(5):1281–92.
 48.
Grossman RL, Heath AP, Ferretti V, Varmus HE, Lowy DR, Kibbe WA, et al. Toward a shared vision for Cancer genomic data. N Engl J Med. 2016;375(12):1109–12.
 49.
Robinson MD, Smyth GK. Smallsample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2008;9(2):321–32.
 50.
Li X, Gu G, Soliman F, Sanders AJ, Wang X, Liu C. The evaluation of durative transfusion of Endostar combined with chemotherapy in patients with advanced nonsmall cell lung cancer. Chemotherapy. 2018;63(4):214–9.
 51.
Guo Y, Li CI, Ye F, Shyr Y. Evaluation of read count based RNAseq analysis methods. BMC Genomics. 2013;14(Suppl 8):S2.
 52.
Soneson C, Robinson MD. Bias, robustness and scalability in singlecell differential expression analysis. Nat Methods. 2018;15(4):255–61.
 53.
Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, et al. Droplet barcoding for singlecell transcriptomics applied to embryonic stem cells. Cell. 2015;161(5):1187–201.
 54.
Usoskin D, Furlan A, Islam S, Abdo H, Lonnerberg P, Lou D, et al. Unbiased classification of sensory neuron types by largescale singlecell RNA sequencing. Nat Neurosci. 2015;18(1):145–53.
 55.
Zheng C, Zheng L, Yoo JK, Guo H, Zhang Y, Guo X, et al. Landscape of infiltrating T cells in liver Cancer revealed by singlecell sequencing. Cell. 2017;169(7):1342–56 e1316.
 56.
Myers JS, von Lersner AK, Robbins CJ, Sang QX. Differentially expressed genes and signature pathways of human prostate Cancer. PLoS One. 2015;10(12):e0145322.
 57.
Kharchenko PV, Silberstein L, Scadden DT. Bayesian approach to singlecell differential expression analysis. Nat Methods. 2014;11(7):740–2.
 58.
Finak G, McDavid A, Yajima M, Deng J, Gersuk V, Shalek AK, et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in singlecell RNA sequencing data. Genome Biol. 2015;16:278.
 59.
Nabavi S, Schmolze D, Maitituoheti M, Malladi S, Beck AH. EMDomics: a robust and powerful method for the identification of genes differentially expressed between heterogeneous classes. Bioinformatics. 2016;32(4):533–41.
 60.
Miao Z, Deng K, Wang X, Zhang X. DEsingle for detecting three types of differential expression in singlecell RNAseq data. Bioinformatics. 2018;34(18):3223–4.
 61.
Jaakkola MK, Seyednasrollah F, Mehmood A, Elo LL. Comparison of methods to detect differentially expressed genes between singlecell populations. Brief Bioinform. 2017;18(5):735–43.
 62.
Wang T, Li B, Nelson CE, Nabavi S. Comparative analysis of differential gene expression analysis tools for singlecell RNA sequencing data. BMC Bioinformatics. 2019;20(1):40.
 63.
Robinson MD, McCarthy DJ, Smyth GK. edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40.
 64.
Keene ON, Jones MR, Lane PW, Anderson J. Analysis of exacerbation rates in asthma and chronic obstructive pulmonary disease: example from the TRISTAN study. Pharm Stat. 2007;6(2):89–97.
 65.
Zhu H, Lakkis H. Sample size calculation for comparing two negative binomial rates. Stat Med. 2014;33(3):376–87.
 66.
Li X, Wu D, Cooper NGF, Rai SN. Sample size calculations for the differential expression analysis of RNAseq data using a negative binomial regression model. Stat Appl Genet Mol Biol. 2019;18(1). https://doi.org/10.1515/sagmb20180021.
 67.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B. 1995;57:289–300.
 68.
Robinson MD, Smyth GK. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007;23(21):2881–7.
 69.
Yu D, Huber W, Vitek O. Shrinkage estimation of dispersion in negative binomial models for RNAseq experiments with small sample size. Bioinformatics. 2013;29(10):1275–82.
 70.
Schurch NJ, Schofield P, Gierlinski M, Cole C, Sherstnev A, Singh V, et al. How many biological replicates are needed in an RNAseq experiment and which differential expression tool should you use? RNA. 2016;22(6):839–51.
Acknowledgements
The authors gratefully acknowledge the reviewers for the insightful suggestions and advice during the revision.
Funding
This research was supported by the National Institutes of Health, P20GM103436 to Dr. Nigel Cooper. The funding bodies played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Author information
Affiliations
Contributions
XL conducted the simulation, statistical and computational analysis, and drafted the manuscript. ECR and NGFC advised and participated in discussion. XL, TEO, ECR and NGFC revised the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 2.
Contains datasets used for the analysis. This zipped file folder contains MAQC2 and MAQC3 raw read counts, cancer raw data files filtered with zero counts (AdLC, OC and TNBC), and description of these data files named Supplementary Material.docx.
Additional file 3.
R code for the analysis.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Li, X., Cooper, N.G.F., O’Toole, T.E. et al. Choice of library size normalization and statistical methods for differential gene expression analysis in balanced twogroup comparisons for RNAseq studies. BMC Genomics 21, 75 (2020). https://doi.org/10.1186/s1286402065027
Received:
Accepted:
Published:
Keywords
 RNAseq
 Sample sizes
 Normalization
 Statistical test
 Differentially expressed genes