RNA sequencing (RNA-Seq) allows an entire transcriptome to be surveyed at single-base resolution whilst concurrently profiling gene expression levels on a genome scale
. RNA-Seq is an attractive approach as it profiles the transcriptome directly through sequencing and therefore does not require prior knowledge of the transcriptome under consideration. An example of the use of RNA-Seq as a high-resolution exploratory tool is the discovery of thousands of additional novel coding and noncoding genes, transcripts and isoforms of known genes despite the prior extensive annotation of the mouse
[2–4] and human genomes
Arguably, the most popular use of RNA-Seq is profiling of gene expression or transcript abundance between samples or differential expression (DE). The efficiency, resolution and cost advantages of using RNA-Seq as a tool for profiling DE has prompted many biologists to abandon microarrays in favour of RNA-Seq
Despite the advantages of using RNA-Seq for DE analysis, there are several sources of sequencing bias and systematic noise that need to be considered when using this approach. Clearly, RNA-Seq analysis is vulnerable to the general biases and errors inherent in the next-generation sequencing (NGS) technology upon which it is based. These errors and biases include: sequencing errors (wrong base calls), biases in sequence quality, nucleotide composition and error rates relative to the base position in the read
[9, 10], variability in sequence depth across the transcriptome due to preferential sites of fragmentation, variable primer and transcript nucleotide composition effects
 and finally, differences in the coverage and composition of raw sequence data generated from technical and biological replicate samples
Recently, there have been several investigations
[13–15] into the biases that affect the accuracy with which RNA-Seq represents the absolute abundance of a given transcript as measured by high precision approaches such as Taqman RT-PCR
. It has been shown that these abundance measures are prone to biases correlated with the nucleotide composition
[14, 17] and length of the transcript
[1, 18]. Several within and between sample correction and normalisation procedures have recently been developed to address these biases either as nucleotide composition effects
 or various combinations of nucleotide, length or library preparation biases
[14, 15]. These approaches all yield improvements in the correspondence of RNA-Seq read counts with expression estimates gained by other experimental approaches.
Despite the known biases, RNA-Seq continues to be widely and successfully used to profile relative transcript abundances across samples to identify differentially expressed transcripts
. The profile of a given transcript across a biological population would be hoped to be less prone to nucleotide composition and length biases as these variables remain constant. Nevertheless, to accurately detect DE across samples it is necessary to understand the sources of variation across technical and biological replication and where possible respond to these with an appropriate experimental design and statistically robust analysis
[17, 20]. To date, there has been little discussion in the literature of efficient experimental designs for the detection of DE and a lack of consensus about a standard and comprehensive approach to counter the many sources of noise and biases present in RNA-Seq has meant that some of the biological community remain sceptical about its reliability and unsure of how to design cost-efficient RNA-Seq experiments (see
Good experimental design and appropriate analysis is integral to maximising the power of any NGS study. With regard to RNA-Seq, important experimental design decisions include the choice of sequencing depth and number of technical and/or biological replicates to use. For researchers with a fixed budget, often a critical design question is whether to increase the sequencing depth at the cost of reduced sample numbers or to increase the sample size with limited sequencing depth for each sample
Sequencing depth is usually referenced to be the expected mean coverage at all loci over the target sequence(s), in the case of RNA-seq experiments assuming all transcripts having similar levels of expression. Without the benefit of extensive previous RNA-Seq studies, it is difficult in most cases to estimate prior to data generation the optimal sequencing depth or amount of sequencing data required to adequately power the detection of DE in the transcriptome of interest. Pragmatically, RNA-seq sequencing depth is typically chosen based on an estimation of total transcriptome length (bases) and the expected dynamic range of transcript abundances. Given the dynamic nature of the transcriptome, the suitability of these estimates could vary substantially across organisms, tissues, time points and biological contexts.
Wang et al.
 found a significant increase in correlation between gene transcripts observed and number of sequence reads generated when increasing sequencing depth from 1.6 to 10 million reads after which the gains plateau – 10 million reads detected about 80% of the annotated chicken transcripts. Despite the expectation of continuous sequencing depth increases in the near future, Łabaj et al.
 argue that most of the additional reads will align to the subset of already extensively sampled transcripts. As a result, transcripts with low to moderate expression levels will remain difficult to quantify with good precision using current RNA-Seq protocols even at higher read depths. Greater sequencing depth will also increase sensitivity to detect smaller changes in relative expression, however this does not guarantee that these changes have functional impact in the biological system under study as opposed to tolerated fluctuations in transcript abundance
. Ideally, an efficient experimental design will be informed by an understanding of when increasing sequencing depth begins to provide rapidly diminishing returns with regard to transcript detection and DE testing.
Replication is vital for robust statistical inference of DE. In the context of RNA sequencing, multiple nested levels of technical replication exist depending upon whether it is the sequence data generation, library preparation or RNA extraction technical processes that are being replicated from the same biological sample. Several published studies have incorporated technical replicates into their RNA-Seq experimental designs
[23–25]. The degree of technical variation present in these datasets appears to vary and the main source of technical variation appears to be library preparation
. Biological replication measures variation within the target population and simultaneously can counteract random technical variation as part of independent sample preparation
It has been shown that power to detect DE improves when the number of biological replicates n is increased from n = 2 to n = 5
, however, to date few studies have incorporated extensive biological replication and extensive testing of the effects of replication on power is needed. More recently with the increasing utility and availability of multiplex experimental designs, the incorporation of biological replicates with decreased sequencing depth is becoming a much more attractive and cost-effective strategy. The relative merits of sacrificing sequencing depth for increased replication has not been rigorously explored.
Efficient experimental design
Multiplexing is an increasingly popular approach that allows the sequencing of multiple samples in a single sequencing lane or reaction and consequently the reduction in sequencing costs per sample
[27, 28]. Multiplexing uses indexing tags, “barcodes” or short (≤ 20 bp) stretches of sequence that are ligated to the start of sample sequence fragments during the library preparation step. Barcodes are distinct between sample libraries and allow pooling for sequencing followed by allocation of reads back to individual samples after sequencing by analysis of the sequenced barcode. Multiplex barcode designs are routinely available with up to 12 samples in the same lane, recently up to 96 yeast DNA samples were profiled in single lane
. Novel methods are continuing to emerge for low-cost strategies to multiplex RNA-Seq samples
. With the dramatic increases in sequencing yields being achieved with current chemistries and new platforms, multiplexing is becoming the method of choice to increase sample throughput. These designs have direct impacts on sequencing depth generated that need to be considered in the power of the experimental design. Also, when multiplex strategies are used, biologists need to be mindful of potential systematic variations between sequencing lanes. These variations can be addressed through randomisation or blocking designs to distribute samples across lanes, see
 for a discussion of barcoding bias in multiplex sequencing, and
 for an alternative to barcoding. In a comparison between microarray and NGS technologies in synthetic pools of small RNA, Willenbrock et al.
 found that multiplexing resulted in decreased sensitivity due to a reduction of sequencing depth and a loss of reproducibility; however the authors did not investigate power for detection of DE in their study.
Improving detection of DE requires not only an appropriate experimental design but also a suitably powered analysis approach. Several algorithms have recently been developed specifically to appropriately handle expected technical and biological variation arising from RNA-Seq experiments. A non-exhaustive list of these algorithms is: edgeR
. A thorough comparison of these packages’ performance with datasets of different properties falls beyond the scope of this study, however before considering issues relating to power and experimental design, it is important to investigate whether packages for DE analysis give the correct type I error rate under the null hypothesis of no DE. To do this evaluation we considered three popular packages for DE analysis of RNA-sequencing data. These packages are based on a negative binomial distribution model of read counts
 and include edgeR
 and NBPSeq
To quantify the effects of different sequencing depths and replication choices we compared a range of realistic experimental designs for their ability to robustly detect DE. Using simulated data with known DE transcripts allowed us to estimate the false positive rate (FPR) and true positive rate (TPR) of DE calls. The changes of these rates were used to compare the detection power yielded by each choice of number of biological replicates and sequencing depth.
In the Methods section, we outline the definitions used for FPR and TPR as well as explaining the method used for the construction of the synthetic data; which includes induced differential expression, simulates the variations that biological replicates introduce and simulates loss of sequencing depth.
In our study, we test a wide range of real-world experimental design scenarios for performance under the null hypothesis and in the presence of DE. In these scenarios both the numbers of biological replicates n and the sequencing depth are varied. This provides a comprehensive quantitative comparison of different experimental design strategies and is particularly informative for those accessing modern multiplex approaches.