Finding the active genes in deep RNA-seq gene expression studies
© Hart et al.; licensee BioMed Central Ltd. 2013
Received: 12 July 2013
Accepted: 29 October 2013
Published: 11 November 2013
Early application of second-generation sequencing technologies to transcript quantitation (RNA-seq) has hinted at a vast mammalian transcriptome, including transcripts from nearly all known genes, which might be fully measured only by ultradeep sequencing. Subsequent studies suggested that low-abundance transcripts might be the result of technical or biological noise rather than active transcripts; moreover, most RNA-seq experiments did not provide enough read depth to generate high-confidence estimates of gene expression for low-abundance transcripts. As a result, the community adopted several heuristics for RNA-seq analysis, most notably an arbitrary expression threshold of 0.3 - 1 FPKM for downstream analysis. However, advances in RNA-seq library preparation, sequencing technology, and informatic analysis have addressed many of the systemic sources of uncertainty and undermined the assumptions that drove the adoption of these heuristics. We provide an updated view of the accuracy and efficiency of RNA-seq experiments, using genomic data from large-scale studies like the ENCODE project to provide orthogonal information against which to validate our conclusions.
We show that a human cell’s transcriptome can be divided into active genes carrying out the work of the cell and other genes that are likely the by-products of biological or experimental noise. We use ENCODE data on chromatin state to show that ultralow-expression genes are predominantly associated with repressed chromatin; we provide a novel normalization metric, zFPKM, that identifies the threshold between active and background gene expression; and we show that this threshold is robust to experimental and analytical variations.
The zFPKM normalization method accurately separates the biologically relevant genes in a cell, which are associated with active promoters, from the ultralow-expression noisy genes that have repressed promoters. A read depth of twenty to thirty million mapped reads allows high-confidence quantitation of genes expressed at this threshold, providing important guidance for the design of RNA-seq studies of gene expression. Moreover, we offer an example for using extensive ENCODE chromatin state information to validate RNA-seq analysis pipelines.
Second-generation sequencing technology has provided deep insight into the complexity of the transcriptome. Early sequencing of cellular mRNA resulted in a level of transcript quantitation that was in broad concordance with microarrays . Subsequent studies with improved mapping tools [2, 3] and increasingly deep sequencing depth [4, 5] suggested that, with enough depth of coverage, most annotated genes could be observed at some level. A key unanswered question, however, is whether these low-abundance transcripts are biologically significant [6, 7].
A recent study by Hebenstreit et al.  demonstrated that gene expression in mammalian cells measured by RNA-seq follows a bimodal distribution of high and low expression genes, and suggested that the high-expression genes comprise the active, functional transcriptome of the cell. The results of several studies constrain the range of the threshold that divides active from low-expression genes: at the upper bound, Hebenstreit et al. and Mortazavi et al.  calculated that fragments per kilobase of gene model per million mapped reads (FPKM) values of 1 to 2 correspond to ~1 mRNA molecule per cell, though a deep proteomic sampling of HeLa cells detected proteins from several genes expressed below this level . At FPKM of about 0.3, RNA-seq reads were shown to map to exonic regions and intergenic regions at similar rates , suggesting lower confidence in measured expression below this level. However, the data used in these studies were from short read RNA-seq (often 32-base single-end reads) of moderate depth (typically ~20 million reads). Advances in RNA-seq library preparation and sequencing technology now regularly yield tens to hundreds of millions of paired-end reads of 50 to 100 or more bases in length. Increased read length improves mapping accuracy and lowers the odds of spurious multiple mapping, while greater read depth allows more accurate assessment of the relative abundance of low-expression transcripts as well as the detection (by at least one read mapping) of a greater number of genes . These advances undermine the assumptions upon which previous heuristics for evaluating gene expression were based, highlighting the need for updated understanding of the signal and noise present in RNA-seq data.
In this study, we integrate current-generation RNA-seq and chromatin state data from the ENCODE project to understand the relationship between gene expression level and promoter activity signatures. We explore the effect of varying read depth on transcript detection and quantitation, and offer a novel normalization method that robustly identifies the subset of active genes observed in an RNA-seq experiment and provides guidance regarding efficient experimental design.
Results and discussion
Expression state from chromatin state
An important question arising from this observation is whether the low-expression transcripts of the shoulder are comprised of functional genes or merely by-products of leaky gene expression, sequencing errors, and/or off-target read mapping. To explore this question, we compared gene expression profiles to the results of an integrated analysis of chromatin state derived from ENCODE ChIP-seq data . Each gene promoter was tagged as "active" or "repressed" based on the local chromatin state (see Methods and Additional file 1: Tables S3 and S4). Genes were rank-ordered by expression level, binned (n = 500), and the fraction of genes in the bin with either active or repressed promoters was plotted against the genes’ mean expression level (Figure 1). As expected, ~100% of highly expressed genes have active promoters. However, transcripts detected at low levels tend to be associated with repressed promoters, suggesting that they do not play a functional role in the cell.
A Gaussian fit describes active genes
Mean +/− SD
−2.11 +/− 0.42
−2.74 +/− 0.30
−2.23 +/− 0.28*
−2.82 +/− 0.22*
While we do not have corresponding information on chromatin state for these samples, other cell line data do corroborate the relationship between promoter activation level and gene expression in the major peak. Additional file 2: Figure S2 shows RNA-seq distributions and corresponding paired histone H3K4 trimethylation ChIP-seq data. As with the Encode chromatin state data, the fraction of genes with promoter-associated H3K4me3 is high for genes expressed in the primary peak and drops to negligible levels for transcripts detected at trace levels.
The zFPKM threshold is robust to changes in read depth
As noted in this study and others , greater read depth increases the total number of genes detected (Figure 3c, solid curve), with newly discovered genes tending to show very low expression (Additional file 2: Figure S3). The corresponding fraction of newly detected genes that are expressed in the active region drops rapidly with read depth (Figure 3c, dashed curve). Beyond 24 Mmr, though transcripts from over 1,000 new genes are putatively detected, only 37 are observed in the active region. This suggests 20–30 Mmr is a reasonable target for RNA-seq studies of gene expression, as it captures virtually all active genes in a sample while allowing sample multiplexing on sequencing machines to reduce costs. This result is consistent with ENCODE recommendations for RNA-seq best practices ; moreover, at an expression level of log2(FPKM) of −2.8 (the lowest expression level corresponding to our zFPKM threshold in the samples studied here), this read depth yields ~10 mapped reads per typical 3 kb transcript, the minimum coverage recommended for analysis of differential expression using count-based statistics .
Other normalization methods have been proposed to deal with the change in calculated FPKM induced by, e.g., changes in read depth and mapping quality. One such option is transcripts per million (TPM), implemented in the RSEM software package  and used to compute gene expression values in, e.g., The Cancer Genome Atlas . While the TPM transform should in principle be more stable than raw FPKM, the software implementation (rsem-calculate-expression version 1.2.6 at time of writing) calls Bowtie with lax mapping parameters that result in dozens to hundreds of genes being called highly expressed in one pipeline vs. trace or zero expression in the other. Additional file 2: Figure S4 shows the Tophat/Cufflinks-derived FPKM vs. RSEM-derived TPM for nine ENCODE cell lines and highlights the genes unique to each pipeline. Comparing the fraction of active and repressed promoters among these genes suggests that the default Tophat/Cufflinks pipeline delivers more accurate results (Additional file 2: Table S5), and that end-users should carefully consider the command line parameters when using RSEM as a wrapper for Bowtie.
Second-generation sequencing technology has provided a detailed view of the transcriptome. Assays that previously required multiple platforms, or which were simply not available, now can be performed from a single sequencing data set; e.g. transcript quantitation, isoform identification, alternative splicing and transcription start sites, allele-specific transcription, and discovery of novel transcripts. For common assays of gene expression, however, the remarkable sensitivity of RNA-seq has generated many questions regarding how to most efficiently design an experiment and analyze the resulting data.
Previous transcriptome studies have suggested that many rare transcripts may be the product of biological noise, although few have provided evidence that these products are non-functional. We show that low-abundance transcripts are associated with chromatin signatures consistent with repressed promoters, and we provide the zFPKM normalization method that accurately determines the expression regime defined by genes controlled by active promoters. The method provides several advantages over widely used heuristic approaches of accepting expression values above a fixed threshold, typically FPKM values ~ 1. We show that, while most human RNA-seq experiments yield similarly shaped distributions of gene expression values, different samples and experimental protocols can result in pronounced changes in the location and scale of these distributions that add variability to the results from the application of such heuristics. In the more extreme cases, however, it would be worth carefully re-evaluating the quality of the primary data before applying any normalization techniques.
With a finite population of biologically active transcripts in a cell, it stands to reason that experimental methods can be optimized to provide requisite coverage of those transcripts while maximizing the multiplexing capability of a sequencer. Our work shows that 20–30 million mapped reads are sufficient to detect virtually all active transcripts in a cell line, and provides deep enough coverage to undertake analysis of differential expression across the bulk of the active transcriptome. RNA-seq at ever greater depth continues to detect new transcripts, but the overwhelming majority are expressed at trace levels and, in the ENCODE data, are associated with repressed promoters, indicating that these are not biologically active genes.
It is worth noting that these results are derived primarily from homogeneous samples of human cell lines. Heterogeneous samples present their own set of challenges. A gene that is moderately expressed in a small fraction of cells in a sample might be indistinguishable from the background transcripts of the whole sample. At the other extreme, an equal mix of two or three cell types would likely result in a similar top-end distribution of constitutively expressed genes but an enlarged left shoulder of tissue-specific genes whose observed expression is reduced by averaging over the whole sample. While none of these issues are unique to RNA-seq—microarray studies have long faced the same problems—there may be an opportunity to formally quantify this behavior by in silico combinations, across a range of proportions, of the ENCODE matched transcript and chromatin state data from different samples.
More broadly, the ENCODE data provides a unique and comprehensive data set from which to evaluate the quality of RNA-seq studies generally. Having independent chromatin state data for multiple cell lines provides a vital "ground truth" against which to measure the performance of RNA-seq analysis tools. We point to the differences between RSEM and Tophat/Cufflinks quantitation presented here as a case study for using this framework to evaluate computational methods against real-world data.
Sequencing technology has evolved significantly since the early proof-of-concept RNA-seq studies. Through a combination of bioinformatic and biochemical advances, modern RNA-seq data represents a deeper and more accurate sampling of the transcriptome than the moderate-depth, short-read data from which many current rules of thumb for analysis were derived. Improved library prep techniques have increased the fraction of total sequenced bases that map to mRNA and reduced the bias toward reads mapping at the 3′ end of known transcripts, while splice-aware mappers align longer reads with greater accuracy and less likelihood of multiple hits. The net result is that many of the features of early RNA-seq data which drove the development of heuristics in use today are not always applicable. We evaluate latest-generation data and offer an updated framework for extracting relevant gene expression information from RNA-seq experiments.
ENCODE RNA-seq data were downloaded from NCBI GEO (Accession no. GSE30567). Jurkat RNA-seq and ChIP-seq and CD4+ RNA-seq data generated in the Salomon lab were submitted to GEO. From EMBL-EBI, we downloaded Illumina BodyMap reads [E-MTAB-513] and Geuvadis FPKM values [E-GEUV-1]. Other sequence data were acquired from NCBI SRA: HeLa RNA-seq, SRR309265; HeLa ChIP-seq, SRR037862; HCC1954 RNA-seq and ChIP-seq, SRX061987-SRX061997. ICGC pancreatic cancer RNA-seq FPKM values were downloaded from ftp://data.dcc.icgc.org/current/Pancreatic_Cancer-OICR-CA/.
For isolation of total CD4 T cells and memory CD4 T cells, peripheral blood mononuclear cells (PBMCs) were first enriched by density gradient centrifugation of peripheral blood from healthy human donors through a Ficoll-histopaque gradient (Sigma). For total CD4 T cell purification, cells were positively selected from PBMCs on anti-CD2 beads (Miltenyi) followed by positive selection on anti-CD4 beads (Invitrogen). CD4+ memory T cells were purified from PBMCs by negative selection with magnetic beads (Miltenyi). Purified cells were cultured in RPMI (Mediatech) supplemented with 10% FBS, 100 U/ml Penicillin (Gibco), and 100 μg/ml Streptomycin (Gibco) for 48 hours with and without stimulation by anti-CD3/CD28 beads (Invitrogen) at 37C in 5% CO2. Jurkat cells were obtained from ATCC (clone E6-1) and cultured in the same medium as the primary cells.
Cells were harvested and resuspended in TRIzol (Invitrogen). RNA was isolated following a standard TRIzol extraction protocol. RNA-seq libraries were prepared as described . Briefly, 100 ng total RNA was amplified using the Ovation RNA-seq kit (NuGen). 100 ng amplified cDNA was digested with 50 U/µl endonuclease S1 (Promega) for 30 min at room temperature. Digested cDNA was end repaired and sequencing adapters were annealed following standard protocols (Illumina). Sequencing of total CD4 T cell RNA was conducted on an Illumina GAIIx instrument with 60-base paired-end reads. Ultradeep sequencing of activated CD4 memory T cell RNA was conducted in 5 lanes of the Genome Analyzer IIx instrument, generating 80-base single-end reads.
Sequence mapping and gene expression quantitation
RNA-seq reads were mapped to hg19 with TopHat version 1.4.1. No junctions file (−j) or GTF file (−G) was specified. FPKM values were calculated per gene with Cufflinks version 2.0.2, using the Gencode v.14 GTF file downloaded from the Human Genome Browser at UCSC. Cufflinks output was filtered for protein coding genes as annotated by the HUGO Gene Nomenclature Committee (http://www.genenames.org). The matrix of raw FPKM values is included as Additional file 1: Table S1.
Gaussian fit and zFPKM normalization
A matrix of all calculated zFPKM values is included as Additional file 1: Table S2.
Promoter chromatin state
Files containing the results of chromatin state analysis in  were downloaded in .bed format from the Human Genome Browser at UCSC at http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeBroadHmm/. In a given cell line, a gene was labeled as having an active promoter if a locus was classified as State 1 ("Active Promoter") or State 2 ("Weak Promoter") within 2 kb of the transcription start site of any annotated transcript associated with the gene in the Gencode gene models. A gene was labeled as having a repressed promoter if any TSS was within a locus classified as State 12 ("Polycomb-repressed") or State 13 ("Heterochromatin"). In rare cases genes were labeled with both active and repressed promoters. A list of all active and repressed genes in each sample is included as Additional file 1: Tables S3 and S4.
ChIP-seq reads were mapped to hg19 with Bowtie, and peak finding was performed using sissrs . H3K4me3 peaks within 1 kb of a gene transcription start site were identified based on the GTF file described above.
All data generated in the Salomon lab for this manuscript were covered by Human Subjects Research Protocols approved by the Institutional Review Board. Informed written consent was obtained from all study subjects in the study.
Funding for this work was provided by NIH grants U19 AI063603, U01 AI084146 and U01 AI063594 (ARRA). SL and KP were supported by National Institutes of Health TL1 RR025772–03.
- Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D: A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science. 2008, 321 (5891): 956-960. 10.1126/science.1160342.View ArticlePubMedGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology. 2009, 10 (3): R25-10.1186/gb-2009-10-3-r25.PubMed CentralView ArticlePubMedGoogle Scholar
- Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009, 25 (9): 1105-1111. 10.1093/bioinformatics/btp120.PubMed CentralView ArticlePubMedGoogle Scholar
- Labaj PP, Leparc GG, Linggi BE, Markillie LM, Wiley HS, Kreil DP: Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling. Bioinformatics. 2011, 27 (13): i383-391. 10.1093/bioinformatics/btr247.PubMed CentralView ArticlePubMedGoogle Scholar
- Toung JM, Morley M, Li M, Cheung VG: RNA-sequence analysis of human B-cells. Genome research. 2011, 21 (6): 991-998. 10.1101/gr.116335.110.PubMed CentralView ArticlePubMedGoogle Scholar
- Mercer TR, Gerhardt DJ, Dinger ME, Crawford J, Trapnell C, Jeddeloh JA, Mattick JS, Rinn JL: Targeted RNA sequencing reveals the deep complexity of the human transcriptome. Nature biotechnology. 2012, 30 (1): 99-104.View ArticleGoogle Scholar
- van Bakel H, Nislow C, Blencowe BJ, Hughes TR: Most "dark matter" transcripts are associated with known genes. PLoS Biol. 2010, 8 (5): e1000371-10.1371/journal.pbio.1000371.PubMed CentralView ArticlePubMedGoogle Scholar
- Hebenstreit D, Fang M, Gu M, Charoensawan V, van Oudenaarden A, Teichmann SA: RNA sequencing reveals two major classes of gene expression levels in metazoan cells. Mol Syst Biol. 2011, 7: 497-PubMed CentralView ArticlePubMedGoogle Scholar
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5 (7): 621-628. 10.1038/nmeth.1226.View ArticlePubMedGoogle Scholar
- Nagaraj N, Wisniewski JR, Geiger T, Cox J, Kircher M, Kelso J, Paabo S, Mann M: Deep proteome and transcriptome mapping of a human cancer cell line. Mol Syst Biol. 2011, 7: 548-PubMed CentralView ArticlePubMedGoogle Scholar
- Ramskold D, Wang ET, Burge CB, Sandberg R: An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data. PLoS Comput Biol. 2009, 5 (12): e1000598-10.1371/journal.pcbi.1000598.PubMed CentralView ArticlePubMedGoogle Scholar
- Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology. 2010, 28 (5): 511-515. 10.1038/nbt.1621.PubMed CentralView ArticlePubMedGoogle Scholar
- Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S: GENCODE: the reference human genome annotation for The ENCODE Project. Genome research. 2012, 22 (9): 1760-1774. 10.1101/gr.135350.111.PubMed CentralView ArticlePubMedGoogle Scholar
- Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, Epstein CB, Zhang X, Wang L, Issner R, Coyne M: Mapping and analysis of chromatin state dynamics in nine human cell types. Nature. 2011, 473 (7345): 43-49. 10.1038/nature09906.PubMed CentralView ArticlePubMedGoogle Scholar
- Lappalainen T, Sammeth M, Friedländer MR, 't Hoen PA, Monlong J, Rivas MA, Gonzàlez-Porta M, Kurbatova N, Griebel T, Ferreira PG: Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013, 501 (7468): 506-511. 10.1038/nature12531.PubMed CentralView ArticlePubMedGoogle Scholar
- Standards, Guidelines and Best Practices for RNA-Seq.http://encodeproject.org/ENCODE/protocols/dataStandards/ENCODE_RNAseq_Standards_V1.0.pdf,
- Soneson C, Delorenzi M: A comparison of methods for differential expression analysis of RNA-seq data. BMC bioinformatics. 2013, 14: 91-10.1186/1471-2105-14-91.PubMed CentralView ArticlePubMedGoogle Scholar
- Li B, Dewey CN: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC bioinformatics. 2011, 12: 323-10.1186/1471-2105-12-323.PubMed CentralView ArticlePubMedGoogle Scholar
- McLendon R, Friedman A, Bigner D, Van Meir EG, Brat DJ, Mastrogianakis GM, Olson JJ, Mikkelsen T, Lehman N, Aldape K: Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008, 455 (7216): 1061-1068. 10.1038/nature07385.View ArticleGoogle Scholar
- Head SR, Komori HK, Hart GT, Shimashita J, Schaffer L, Salomon DR, Ordoukhanian PT: Method for improved Illumina sequencing library preparation using NuGEN Ovation RNA-Seq System. Biotechniques. 2011, 50 (3): 177-180.PubMedGoogle Scholar
- Jothi R, Cuddapah S, Barski A, Cui K, Zhao K: Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Res. 2008, 36 (16): 5221-5231. 10.1093/nar/gkn488.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.