Differentially expressed genes from RNA-Seq and functional enrichment results are affected by the choice of single-end versus paired-end reads and stranded versus non-stranded protocols
© The Author(s). 2017
Received: 16 December 2016
Accepted: 16 May 2017
Published: 23 May 2017
RNA-Seq is now widely used as a research tool. Choices must be made whether to use paired-end (PE) or single-end (SE) sequencing, and whether to use strand-specific or non-specific (NS) library preparation kits. To date there has been no analysis of the effect of these choices on identifying differentially expressed genes (DEGs) between controls and treated samples and on downstream functional analysis.
We undertook four mammalian transcriptomics experiments to compare the effect of SE and PE protocols on read mapping, feature counting, identification of DEGs and functional analysis. For three of these experiments we also compared a non-stranded (NS) and a strand-specific approach to mapping the paired-end data. SE mapping resulted in a reduced number of reads mapped to features, in all four experiments, and lower read count per gene. Up to 4.3% of genes in the SE data and up to 12.3% of genes in the NS data had read counts which were significantly different compared to the PE data. Comparison of DEGs showed the presence of false positives (average 5%, using voom) and false negatives (average 5%, using voom) using the SE reads. These increased further, by one or two percentage points, with the NS data. Gene ontology functional enrichment (GO) of the DEGs arising from SE or NS approaches, revealed striking differences in the top 20 GO terms, with as little as 40% concordance with PE results. Caution is therefore advised in the interpretation of such results. By comparison, there was overall consistency in gene set enrichment analysis results.
A strand-specific protocol should be used in library preparation to generate the most reliable and accurate profile of expression. Ideally PE reads are also recommended particularly for transcriptome assembly. Whilst SE reads produce a DEG list with around 5% of false positives and false negatives, this method can substantially reduce sequencing cost and this saving could be used to increase the number of biological replicates thereby increasing the power of the experiment. As SE reads, when used in association with gene set enrichment, can generate accurate biological results, this may be a desirable trade-off.
KeywordsRNA-Seq Transcriptomics Paired-end reads Single-end reads Differential expression Strand-specific Non-strand-specific
Technical advances in next generation sequencing over the past decade have resulted in greater output of sequence data, and at a lower cost . At the same time, analysis methods for understanding and interrogating sequence data have flourished [2–4]. This has resulted in the widespread uptake of techniques such as RNA-Seq for projects both large and small. RNA-Seq typically involves sequencing RNA obtained from a sample, quantification by mapping reads to genomic features, and comparison between conditions. Unlike micorarrays, no prior knowledge of samples (or probes) is necessary and hence it is possible to identify both known and novel transcripts as well as assembling a transcriptome de novo [5, 6]. The use of RNA-Seq to better understand the transcriptome of a vast range of organisms has grown dramatically over the past few years. With more researchers undertaking transcriptomic analyses, questions arise as to the most accurate and cost efficient way of doing this.
One question that arises is whether it is necessary to perform paired-end sequencing or whether single-end sequencing is adequate. Paired-end sequencing involves the sequencing of both ends of each cDNA fragment rather than sequencing only one end . As the gap size between the ends of the fragment can be estimated, this technique facilitates accurate alignment back to the reference genome . However, paired-end sequencing involves twice as much sequencing and is therefore more costly than single-end. So the decision to use any method will affect the number of samples which can be sequenced within a researcher’s budget. The number of biological replicates sequenced will affect the power of the experiment to find differential expression, which is the purpose of many RNA-Seq experiments .
Early RNA-Seq library preparation protocols could not determine the strand of origin and thus direction of any RNA read from the genome. This is problematic when there are overlapping genomic features. A number of techniques have now been developed to address this shortcoming, as reviewed by Levin et al. . Strand-specific library preparation protocols, such as the Illumina TruSeq Stranded Total RNA Prep Kit or the Illumina TruSeq Stranded mRNA Library Prep Kit, are available and are no more costly than using non-strand aware protocols. There have been some recent analyses of strand-specific and non-strand-specific protocols, which point to the benefits of strand-specific approaches [11, 12].
We have used four different mammalian RNA-Seq experiments to assess the effect of using paired-end or single-end reads and a strand-specific versus non-specific library preparation protocol. We have looked at the effect of these factors on mapping, feature counting and on the ultimate objective of many RNA-Seq experiments – the calling of differentially expressed genes. We also consider the impact on functional insights which emerge from the differential expression analysis.
Four mammalian RNA-Seq experiments using different read mapping strategies
RNA-Seq data sets
Hematopoietic cells (primary)
Control group (n)
Treated group (n)
Reads PEa (averaged)
Reads SEb (averaged)
Reads NSc (averaged)
The number of assigned reads, multimapped reads and ambiguous reads varied between the PE, SE and NS data analyses
We mapped the reads to the respective genomes using Tophat2  and then mapped the read loci to RNA features using the featureCounts function of Subread . The featureCounts function gives the overall number of assigned reads as well as the number of reads which could not be assigned to RNA features because they were (i) ambiguous, (ii) multimapped or (iii) did not correspond to an RNA feature.
We investigated the reduction in the number of reads uniquely assigned using SE mapping. This revealed a high proportion of reads being ‘multimapped’, with an average increase of 20% of multimapped reads compared to the PE data (Fig. 1c). This indicates that many 75 bp reads are not sufficiently unique to allow mapping to a unique genomic loci whereas the availability of the other end of the DNA fragment enables the exact genomic location of the fragment to be identified. When comparing the strand-specific PE and NS data, there was little difference in the number of multimapped reads (Fig. 1d). However, in this case we saw a striking increase in the number of ambiguous reads ranging from a 200% increase in Experiment 2 to a 40% increase in Experiment 3, with an average increase of 116% over the three experiments (Fig. 1f). By contrast we saw a decrease in ambiguous reads in the SE data in Experiments 2 and 3 and a small increase < 10% in Experiments 1 and 4 (Fig. 1e). Ambiguous reads arise when the genomic location of a read is known but where that location may be part of more than one gene or other feature, on the same or opposite strand of DNA. Using a strand-specific protocol resolves ambiguity arising from the latter, and hence we see the least ambiguous reads using the strand-specific protocol.
Read count per gene tends to be less in SE data compared to PE data
The number of assigned reads are significantly different in SE and NS data compared to PE data
Number of genes with significantly different counts
non protein coding
non protein codingc
non protein codingc
Proportion of protein-coding and non-coding reads is affected by seqeuncing protocols
By contrast, the DEGs found with a non-strand-specific protocol have a higher proportion of antisense reads (ranging from 7–16% of the DEGs in the three experiments) compared to 1–4% of the DEGs in the SE vs PE comparison. This is also consistent with expectation as, by definition, antisense transcripts overlap genetic features on the opposite DNA strand and so are likely to be considered ‘ambiguous’ if a non-strand-specific protocol is used.
Differential expression analysis of treated versus control samples
Functional analysis of differentially expressed genes using gene ontology (GO) analysis and gene set enrichment
Concordance for GO terms found using goana function
PE vs SE
Top 20 GO terms
PE vs NS
Top 20 GO terms
Concordance for gene sets found in gene set analysis using the camera function
PE vs SE
Top 50 gene sets
PE total 243
PE vs NS
Top 50 gene sets
PE total 243
The importance of biological replicates
Power calculations using functions in RNASeqPower
PE Power (n = 3)e
SE Power (n = 3)f
PE Power (n = 5)e
SE Power (n = 5)f
SE data can be produced more economically than PE and can potentially allow the use of additional replicates. To investigate this, we tested whether differential expression analysis between controls and treated samples differed when using 3 biological replicates per condition with SE data compared to using 2 biological replicates per condition with PE data. In all cases we found that using 3 biological replicates with SE data was superior to using 2 biological replicates with PE data for identifying DEGs, as judged against the gold standard for this experiment (three biological replicates with PE). However, it should be noted that our results were influenced by which 2 biological replicates were chosen for study, indicating the inadequacy of 2 replicates for estimating biological variation. These data are presented in Additional file 1, Additional file 2: Table S1 and Additional file 3: Figure S1. The use of only 2 biological replicates is extreme, and is not recommended, however this analysis does indicate the benefit of increasing the number of biological replicates even though this may be at the expense of sequencing both ends of the RNA-seq reads.
RNA-Seq technology has been keenly embraced by the research community, as evidenced by the hundreds of publications involving the deep sequencing and analysis of transcriptomes. Sequencing centres generally offer a number of library preparation strategies, along with paired-end or single-end sequencing. Initially, sequencing chemistry could not distinguish the DNA strand from which a read originated. However, straightforward strand-specific chemistry has become available recently, such as that provided by the Illumina TruSeq Stranded Library Prep Kits, as used in this study. Sequencing centres and researchers about to embark upon an RNA-Seq project require information to make decisions regarding library preparation. This motivated us to undertake this comparison, involving four mammalian RNA-Seq experiments.
Single-end sequencing involves half the amount of sequencing as paired-end sequencing and thus halves the sequencing cost, excluding sample preparation. Based on this it is an attractive option. However, this must be balanced against its drawbacks. Our study shows that single-end reads, compared to paired-end sequencing, result in a reduction in the number of reads that can be assigned to RNA features and a trend of lower read counts per feature. As a consequence, between 0.72% and 4.29% of expressed genes had significantly different counts when comparing the SE data to PE data. The main reason for this appears to be the discarding of multimapped reads which reduce the reads that could be assigned to features. We found that this had a strong effect on read counts for pseudogenes but also affected other non-protein coding features as well as protein coding genes.
Using a non-strand-specific protocol had an even greater impact on results. Up to 12.26% of the tested genes showed significantly different counts in the NS data, as compared to the PE data. Non-strand specific protocols therefore pose a definite risk to accurate analysis. An earlier study comparing a stranded and non-stranded RNA-Seq protocol in blood samples from five human subjects found 10.65% of genes were differentially expressed when comparing these protocols . Our study gave similar results in the human samples (12.26%, Exp 3 and 11.40%, Exp 4). Whilst SE data showed a trend towards decreased read counts per gene, the NS data was more variable with read counts being higher for some genes but lower for others. It is interesting though that many genes had reads allocated in the NS protocol but no counts using the strand-specific data. This spurious allocation of reads is likely to lead to the incorrect assumption that certain features are expressed when they are not. Indeed it has previously been observed that a non-strand-specific protocol results in a significant fraction of genes having overestimated expression values; this poses a significant problem given that approximately 16% of protein coding genes are overlapping .
Differential expression analysis of the controls and treated samples, using PE data as a gold standard, revealed that the NS data produces a greater proportion of false positives and false negatives than occurs using the SE data. We found that this had a non-negligible effect on evaluating differential expression. In RNA-Seq analyses it is usual to derive a list of DEGs based on a threshold statistical value, such as a multitest adjusted p-value of 0.05. Adopting this standard approach single-end data produced both false positives (average across four experiments (edgeR, 7%, voom, 5%), and false negatives (average across four experiments (edgeR, 4%, voom, 5%)). Differential expression analysis of the controls and treated samples, using PE data as a gold standard, revealed that the NS data produces a greater proportion of false positives and false negatives than occurs using the SE data (average false positives across three experiments (edgeR, 9%, voom, 6%) and average false negatives across three experiments (edgeR, 8%, voom, 7%).
The utility of single-end reads may depend on the questions being asked in the research. When it comes to understanding any functional differences between case and controls in the experiment, we saw that SE reads and NS reads can lead to a dramatic difference in the top 20 gene ontology terms arising from enrichment analysis. For example in Experiment 2, there was only 65% concordance in GO terms found between the SE and PE data and 40% concordance between the NS and PE data. Caution is thus to be advised if using SE data or NS data. However, there was reasonable agreement between the top 300 GO terms identified from the DEGs. If a comprehensive list of GO terms is used it is therefore likely that the same broad conclusions would be drawn as to the functional effect of the treated samples versus the controls. Interestingly, when we supplemented the use of DEGs by a gene set enrichment analysis approach, which uses all tested genes, we found strong agreement in the most significant gene sets. In this case, the same biological conclusions could be drawn regardless of the sequencing and read mapping method.
A final consideration when undertaking RNA-Seq is the power to detect differentially expressed genes. This increases with additional replicates. We estimate that power to detect differential expression could be increased by around 15% if the number of biological replicates in each condition was increased from three to five. As cost of sequencing is an important factor in experimental design it may be that the combination of single-end reads with increased number of biological replicates would be a sound trade-off, especially if care is taken in the functional analysis of results. Our analysis indicates that using 3 biological replicates per condition with SE is preferable to using 2 biological replicates with PE sequencing when undertaking differential expression analysis. This is a rather extreme comparison, however it does illustrate the important advantages that can be gained with additional biological replicates.
There is no cost difference in using a strand-specific and non-strand-specific library preparation. In this case it is clear that a strand-specific method is preferable. A strand-specific protocol avoids underestimation of read counts as occurs when a read could be allocated to alternative features. The number of ambiguous reads decreases and a strand-specific protocol avoids spurious allocation of reads to features in cases where the feature is not actually expressed.
Conclusions and recommendations
Using a paired-end strand-specific protocol is necessary to obtain an accurate read count for all genomic features. Errors in read counts will occur from use of single-end or non-stranded sequencing, and lead to false negatives and false positives in the analysis of differentially expressed genes. This can and will affect downstream analysis, including in functional GO enrichment analysis. Ultimately, this can affect the biological interpretation of results.
At the same time it must be borne in mind that using SE mapping reduces the sequencing cost and that this saving could be used to increase the number of biological replicates. This will increase the power of an experiment, and may be a desirable trade-off. There is no similar advantage in using a non-strand-specific protocol and we would always recommend the use of a strand-specific protocol.
In conclusion, the use of a strand-specific protocol is recommended in all cases. The use of single-end reads with additional replicates may be preferable to paired-end reads with less replicates for differential expression analysis. However, if transcriptome assembly is the primary objective of an experiment then use of paired-end reads will be a better strategy.
Experiments 1 and 2 involved the use of mice. All experimental procedures were approved by the Animal Care and Ethics Committee at UNSW Australia.
Experiment 3 involved the use of human primary hematopoietic cells. All experimental procedures were approved by the Human Research Ethics Committee and Institutional Biosafety Committee at UNSW Australia.
Experiment 4 involved the use of a human NuLi cell line  and did not require ethics approval.
mRNA libraries for all four experiments were prepared at the Ramaciotti Centre for Genomics (UNSW Australia). The Illumina TruSeq RNA Prep Kit was used for Experiment 1. The Illumina TruSeq Stranded Total RNA Prep Kit was used for Experiments 2-3 and the Illumina TruSeq Stranded mRNA Prep Kit was used for Experiment 4. The six RNA-Seq libraries in Experiment 1 were sequenced on the Illumina HiSeq2000 platform, Experiments 2, 3 and 4 were sequenced using the Illumina NextSeq 500. R1.fastq and R2.fastq files were produced for each sample.
Mapping RNA-Seq reads
The reads for Experiment 1 and Experiment 2 were mapped to the Ensembl Mus musculus genome (GRCm38). The reads for Experiment 3 and Experiment 4 were mapped to the Ensembl Homo sapiens genome (GRCh38). Mapping was performed with Tophat2 (v 2.0.12)  calling Bowtie2 (v 2.1.0) . For paired-end mapping we used the settings: tophat2 -p 6 --library-type fr-firststrand -G $gtf -o $output $ref *R1*.fastq.gz *R2*.fastq.gz. For single-end mapping we used the settings: tophat2 -p 6 --library-type fr-firststrand -G $gtf -o $output $ref *R1*.fastq.gz. To process in a non-strand-specific manner we used the settings: tophat2 -p 6 --library-type fr-unstranded -G $gtf -o $output $ref *R1*.fastq.gz *R2*.fastq.gz.
The featureCounts function of Subread  was used to generate counts of reads uniquely mapped to annotated genes using the GRCm38 annotation gtf file and the GRCh38 annotation gtf file respectively. For the bam files produced from the strand-specific paired-end data we used the script: featureCounts -s 2 -T 12 -p -a $gtf -t exon -g gene_id -o $seq\featurecounts.txt n_sort.accepted_hits.bam. For the bam files produced from the strand-specific single-end data we used the script: featureCounts -s 2 -T 12 -a $gtf -t exon -g gene_id -o $seq\featurecounts.txt n_sort.accepted_hits.bam. For the non-strand-specific protocol we used the script featureCounts -s 0 -T 12 -p -a $gtf -t exon -g gene_id -o $seq\ featurecounts.txt n_sort.accepted_hits.bam.
Differential gene expression analysis
Tables of raw counts generated using featureCounts were used as input in all analyses. Comparison of controls in the PE, SE and NS datasets was performed using edgeR (v3.14.0). We excluded lowly expressed genes and tested those genes with expression of at least 1 CPM (counts per million) in at least one of the controls. The number of genes retained for testing in each of the experiments was as follows: Experiment 1: 15456, Experiment 2: 12702, Experiment 3: 14016, Experiment 4: 14166. Counts were normalized using the TMM method and generalized linear models were used for differential expression analysis. Comparison of the controls versus treated samples in each of the four experiments was performed using functions in the edgeR (v 3.14.0)  and limma (v 3.28.17)  Bioconductor packages. Low count transcripts were excluded and only those genes with at least 1 count per million (cpm) in at least 3 samples were used for analysis. In all cases differentially expressed genes were defined as those genes with a Benjamini-Hochberg corrected p value less than 0.05.
We used the goana function  included in the limma Bioconductor package to find the most enriched gene ontology terms in the lists of DEGs. The Gene Ontology (GO) terms in the categories Biological Process, Cellular Component and Molecular Function were included. The false discovery rate (FDR) was set to 0.05 and the function topGO was used to order the GO terms by statistical significance. We compared sets comprised of the top 20, 50, 100, 200 and 300 GO terms.
We used the camera function  included in the limma package to perform gene set enrichment analysis. We interrogated the gene sets contained in mouse_c2_v5.rdata for mouse and human_c2_v5.rdata for human downloaded from http://bioinf.wehi.edu.au/software/MSigDB. We used an inter.gene.cor = 0.01 and an FDR value of 0.01. We compared sets comprised of the top 20, 50, 100, 200 and 300 GO terms.
We carried out power calculations using the RNASeqPower package . The required inputs to the rnapower function are the coefficient of variation between biological replicates (CV) and average read depth per gene. We calculated the biological co-efficient of variation in each experiment using the function estimateGLMCommonDisp from edgeR. We used an average depth of 10 reads and confirmed this estimate by creating multidensity plots of count distribution.
The fastq files relating to Experiment 1 were kindly provided by Dr Stephen Palmer of UNSW Australia. Ms Veronica Mendoza Reinoso (PhD candidate), UNSW Australia carried out the molecular biology necessary to prepare the samples used in Experiment 2 for RNA-Seq analysis. Ms Zara Ali (PhD candidate), Children’s Cancer Institute, carried out the molecular biology necessary to prepare the samples used in Experiment 3 for RNA-Seq analysis. Ms Joanne Pagnon (PhD candidate) and Dr Phoebe Griffin (School of Medicine, University of Tasmania) carried out the molecular biology necessary to prepare the samples used in Experiment 4 for RNA-Seq analysis. SMC and MRW acknowledge support from the Australian Federal Government CRIS scheme, the NSW State Government RAAP scheme and UNSW Australia. KLM (Experiment 3) had funding support from the National Health and Medical Research Council and the Cancer Council NSW. LFR (Experiment 4) was funded by the Clifford Craig Medical Research Trust (Grant 116).
Availability of data and materials
The raw data on which this publication is based are available at the Gene Expression Omnibus with the GEO accession number GSE97471.
The study was conceived by SMC. The RNA samples were provided by AB, KLM and LFR. SMC designed and performed the bioinformatic analysis under the supervision of MRW. SMC drafted the manuscript in consultation with MRW. MRW, AB, KLM and LFR revised the draft and made contributions to the final manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
All experimental procedures involving mice (Experiment 1, Experiment 2) were approved by the Animal Care and Ethics Committee at UNSW Australia.
Experiment 3 involved the use of human primary hematopoietic cells was approved by the Human Research Ethics Committee and Institutional Biosafety Committee at UNSW Australia.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- van Dijk EL, Auger H, Jaszczyszyn Y, Thermes C. Ten years of next-generation sequencing technology. Trends Genet. 2014;30(9):418–26.View ArticlePubMedGoogle Scholar
- Engstrom PG, Steijger T, Sipos B, Grant GR, Kahles A, Ratsch G, Goldman N, Hubbard TJ, Harrow J, Guigo R, et al. Systematic evaluation of spliced alignment programs for RNA-seq data. Nat Methods. 2013;10(12):1185–91.View ArticlePubMedPubMed CentralGoogle Scholar
- Anders S, McCarthy DJ, Chen Y, Okoniewski M, Smyth GK, Huber W, Robinson MD. Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nat Protoc. 2013;8(9):1765–86.View ArticlePubMedGoogle Scholar
- Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, Carvalho BS, Bravo HC, Davis S, Gatto L, Girke T, et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods. 2015;12(2):115–21.View ArticlePubMedPubMed CentralGoogle Scholar
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5(7):621–8.View ArticlePubMedGoogle Scholar
- Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28(5):511–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Hu Y, Wang K, He X, Chiang DY, Prins JF, Liu J. A probabilistic framework for aligning paired-end RNA-seq data. Bioinformatics. 2010;26(16):1950–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Li H, Homer N. A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform. 2010;11(5):473–83.View ArticlePubMedPubMed CentralGoogle Scholar
- Ching T, Huang SJ, Garmire LX. Power analysis and sample size estimation for RNA-Seq differential expression. RNA. 2014;20(11):1684–96.View ArticlePubMedPubMed CentralGoogle Scholar
- Levin JZ, Yassour M, Adiconis X, Nusbaum C, Thompson DA, Friedman N, Gnirke A, Regev A. Comprehensive comparative analysis of strand-specific RNA sequencing methods. Nat Methods. 2010;7(9):709–15.View ArticlePubMedPubMed CentralGoogle Scholar
- Sigurgeirsson B, Emanuelsson O, Lundeberg J. Analysis of stranded information using an automated procedure for strand specific RNA sequencing. BMC Genomics. 2014;15:631.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhao S, Zhang Y, Gordon W, Quan J, Xi H, Du S, von Schack D, Zhang B. Comparison of stranded and non-stranded RNA-seq transcriptome profiling and investigation of gene overlap. BMC Genomics. 2015;16:675.View ArticlePubMedPubMed CentralGoogle Scholar
- Sultan M, Dokel S, Amstislavskiy V, Wuttig D, Sultmann H, Lehrach H, Yaspo ML. A simple strand-specific RNA-Seq library preparation protocol combining the Illumina TruSeq RNA and the dUTP methods. Biochem Biophys Res Commun. 2012;422(4):643–6.View ArticlePubMedGoogle Scholar
- Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14(4):R36.View ArticlePubMedPubMed CentralGoogle Scholar
- Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30(7):923–30.View ArticlePubMedGoogle Scholar
- Law CW, Chen Y, Shi W, Smyth GK. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014;15(2):R29.View ArticlePubMedPubMed CentralGoogle Scholar
- Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7), e47.View ArticlePubMedPubMed CentralGoogle Scholar
- Young MD, Wakefield MJ, Smyth GK, Oshlack A. Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biol. 2010;11(2):R14.View ArticlePubMedPubMed CentralGoogle Scholar
- Wu D, Smyth GK. Camera: a competitive gene set test accounting for inter-gene correlation. Nucleic Acids Res. 2012;40(17), e133.View ArticlePubMedPubMed CentralGoogle Scholar
- Liu Y, Zhou J, White KP. RNA-seq differential expression studies: more sequence or more replication? Bioinformatics, 2014;30(3):301-4.Google Scholar
- Zabner J, Karp P, Seiler M, Phillips SL, Mitchell CJ, Saavedra M, Welsh M, Klingelhutz AJ. Development of cystic fibrosis and noncystic fibrosis airway cell lines. Am J Physiol Lung Cell Mol Physiol. 2003;284(5):L844–854.View ArticlePubMedGoogle Scholar
- Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40.View ArticlePubMedGoogle Scholar
- Hart SN, Therneau TM, Zhang Y, Poland GA, Kocher JP. Calculating sample size estimates for RNA sequencing data. J Comput Biol. 2013;20(12):970–8.View ArticlePubMedPubMed CentralGoogle Scholar