Ultra-deep sequencing of the E. coli transcriptome
Previous studies have suggested that accurate quantification of > 95% of transcripts in a mammalian cell line (including splice junction level quantification) requires ~700 million reads [17]; however, no estimate of the number of reads needed to approach saturation of a bacterial transcriptome has been reported. To address this question, we isolated total RNA from a log phase culture of Escherichia coli K-12 which was then depleted of rRNA using the RiboZero kit (Epicentere), converted to a strand-specific Illumina cDNA library as described [14], and sequenced in one lane of Illumina HiSeq. This produced a dataset of more than 306 million total reads aligning to the E. coli K-12 genome. Over 97% of these reads corresponded to properly mapped paired end reads, i.e. those corresponding to reads derived from opposite ends of the same cDNA mapping no more than 450 base pairs apart on the genome (the approximate maximum size of cDNAs in the library – see Methods). Properly mapped paired end reads were resolved into a single fragment by filling in the gap between them (if any). For pairs of reads that was not properly mapped, one read was discarded and the remaining reads along with unpaired reads were each treated as independent fragments. In total this dataset contained approximately 156 million aligned fragments with an average length of 159 nucleotides. rRNA depletion in this sample was nearly complete, with less than 0.15% of fragments aligning to rRNA-encoding genes (Additional file 1: Table S1).
The proportion of annotated ORFs represented in this dataset was very high, with all but 2 of 4149 ORFs annotated in RefSeq covered by at least 1 fragment (Additional file 2: Table S2). Coverage of the genome also approached saturation, with at least 1 fragment mapping to over 94% of strand-specific genomic positions. Importantly, the density of this coverage varied markedly among different regions of the genome (Figure 1A). For example, while 96% of bases within annotated ORFs were detected by 10 or more fragments, only 60% of bases in regions antisense to annotated ORFs were detected above this cutoff. Similarly, the density of coverage was relatively high for genes encoding non-coding RNAs (ncRNAs) and relatively low in intergenic regions (Figure 1A).
As shown in Figure 1A, a surprisingly high proportion of antisense and intergenic positions were covered by at least 1 fragment. We reasoned that this could be due to limitations in the method used to maintain strand specificity in our libraries [18, 19]. In this method, dUTPs are incorporated only into the second strand of cDNAs during cDNA synthesis and these dUTPs are then excised prior to library amplification, ensuring that only the first cDNA strand is efficiently amplified. Incomplete incorporation and/or excision of dUTPs would presumably lead to low levels of antisense fragments corresponding to the second strand of cDNAs. To assess the level of second strand contamination in our samples, we compared the average fragment coverage on the sense and antisense strands of each annotated ORF with the expectation that this coverage should be somewhat correlated if second strand removal was incomplete. As shown in Additional file 3: Figure S1, there was very little positive correlation (R2 = 0.0004) between the fragment coverage of sense and antisense strands, even among highly expressed genes. In contrast, the correlation in the coverage of ORF sense and antisense strands was much higher (R2=0.83) when a similar rRNA-depleted E. coli cDNA library was not subjected to dUTP excision prior to amplification and sequencing. Thus, incomplete strand specificity in our libraries does not seem to have contributed significantly to the observed high coverage of antisense positions.
Another explanation for the high coverage of antisense and intergenic positions observed is that a much higher proportion E. coli genome is transcribed than is suggested by current gene annotations. Indeed, several recent studies have demonstrated widespread transcription from the antisense strand of protein-encoding genes in diverse bacteria [20–23]. While in some cases these antisense transcripts have been shown to play important regulatory functions, two recent studies in Bacillus subtilis and E. coli K-12 suggest that many antisense RNAs derive from spurious transcription initiation or incomplete transcription termination and may not be functionally relevant [24, 25]. Thus many of the fragments aligning to intergenic regions of the genome may correspond to non-specific transcription initiation or leaky transcription termination of upstream genes. Other sequences from intergenic regions may be derived from previously unannotated ncRNAs. Recent studies suggest the prevalence of ncRNA genes has likely been underestimated, even in well-studied bacteria such as E. coli K-12 [7, 26].
Finally, the nearly complete RNA-Seq read coverage of the genome could also reflect contamination of our cDNA libraries with a low amount of E. coli genomic DNA (gDNA). While total RNA was subjected to 2 rounds of DNase treatment and no gDNA was detected following 40 rounds of PCR prior to cDNA synthesis, it is possible that removal of gDNA from our total RNA was not complete. Similarly, reagents used after DNase treatment in library construction may also have introduced low amounts of E. coli gDNA contamination.
Taken together our findings suggest that a sequencing depth of 156 million fragments is sufficient to saturate the E. coli K-12 transcriptome but also yields numerous fragments aligning to very rare and potentially non-functional transcripts and/or to low-level contaminants introduced during library construction.
Genome coverage of RNA-Seq data after background subtraction
While read coverage of annotated E. coli genes was nearly complete in the 156M read dataset, the possibility of gDNA contamination raised concern that some of these genes were not actually transcribed. To better estimate the proportion of E. coli genes transcribed under the conditions tested, we devised an algorithm to subtract potential gDNA background from our RNA-Seq dataset based on the assumption that, unlike reads corresponding to cDNAs, the alignment of reads corresponding to gDNA would be uniformly distributed across the E. coli genome. As shown in Figure 1B and 1C, background subtraction assuming 0.5% or 1% gDNA contamination led to relatively modest decreases in ORF and ncRNA coverage but to significant drops in coverage of IGR and AS positions. Indeed, after applying a 1% background subtraction, only 33% and 62% of AS and IGR positions were covered at saturation, respectively, compared to 90% and 92% of ORF and ncRNA position, respectively. While the actual extent of gDNA contamination is difficult to ascertain, the results of our PCR screen prior to cDNA synthesis suggest it is unlikely to be as high as 1%. Yet even with this high level of subtraction, at least 1 and 10 reads aligned to 98% and 95% of annotated ORFs, respectively, suggesting that a very high proportion of annotated E. coli genes are expressed at least at low levels during exponential growth in rich media.
Of the 100 ORFs to which no reads aligned following 1% subtraction, several are near the minimum size cutoff of cDNAs efficiently maintained during library construction. These include 4 of the 5 ibs toxic membrane proteins that may indeed not be expressed under normal growth conditions [27] (Additional file 2: Table S2). Importantly, ORFs annotated as “predicted proteins” or encoded within annotated prophages were enriched more than 2- and 4-fold, respectively, among the undetected ORFs. Moreover, many undetected ORFs were clustered in known operons, including 5 of 7 ORFs in the rut operon involved in pyrimidine degradation [28], 5 of 6 ORFs in the cit operon encoding components of an inactive citrate lyase [29], and 8 of 15 ORFs in the phn operon required for use of phosphonate and phosphite as phosphorous sources [30] (Additional file 2: Table S2). Some of the 100 ORFs not represented in our RNA-Seq data have been shown to be expressed in other studies conducted under different growth conditions, suggesting the transcription of these genes is highly repressed and/or the half-lives of these transcripts is very short during exponential growth of E. coli K-12 in LB medium.
Effect of ribosomal RNA depletion on RNA-Seq transcriptome profiles
We next assessed to what extent rRNA depletion increases detection of low expressed transcripts by RNA-Seq. To this end, we constructed another Illumina library derived from the same total RNA used to produce the initial rRNA-depleted dataset and sequenced this library in a single Illumina HiSeq lane. While the depleted and undepleted libraries yielded a similar number of total fragments, 82% number of fragments in the undepleted sample aligned to rRNAs and the number of fragments aligning to ORFs in this sample was more than 8-fold lower than in the depleted sample.
As expected, the proportion of annotated ORFs detected was higher in the depleted than the undepleted samples (Figure 2A). However, even in the undepleted sample, at least one fragment mapped to over 99% of annotated ORFs, and over 96% of annotated ORFs were associated with 20 or more fragments. Moreover, the subset of ORFs detected with a minimum of 10 fragments per ORF was only 2% lower in the undepleted sample. Thus, in a dataset containing enough fragments to saturate the E. coli transcriptome, the lack of rRNA depletion greatly reduced the number of mRNA-derived fragments obtained but led to only a relatively modest decrease in the proportion of annotated E. coli ORFs detected.
Effect of decreased sequencing depth on RNA-Seq transcriptome coverage in E. coli
To systematically assess how decreasing fragment counts per sample affects the comprehensiveness of gene expression profiles, we developed scripts that randomly sampled our 156 million fragment rRNA-depleted E. coli RNA-Seq dataset to create datasets with decreasing numbers of fragments. The ORF and genome coverage provided by these datasets was then quantified and compared. To ensure our sampling approach accurately simulated multiplexing, we re-sequenced the E. coli cDNA library, this time multiplexed with 11 unrelated libraries in the same HiSeq lane, producing a dataset with approximately 15 million total fragments. Importantly, both the levels of genome coverage and the number of fragments per ORF in this dataset correlated very well (R2 > 0.99) with those of a dataset of 15 million fragments sampled from the 156 million fragment dataset.
As shown in Figure 2B, reducing the number of fragments led to a decrease in the proportion of annotated ORFs to which 1 or more fragments aligned. However, this decrease was often relatively small compared to the reduction in the number of fragments. For example, decreasing the number of fragments over 15-fold from 156 to 10 million fragments led to only a 3% and 7% loss in the number of ORFs detected with more than 5 and 10 fragments, respectively. Indeed, even with only 2 million fragments, 96% and 84% of ORFs were covered by at least 1 fragment and 5 fragments, respectively.
As shown in Figure 3A, positions within annotated genes were nearly saturated by 50 million fragments, and only relatively incremental increases in annotated gene coverage were obtained above 10 million fragments. A similar trend was observed in intergenic positions. As the number of fragment continues to increase beyond 50 million, nearly all new positions detected were within antisense regions of the genome, many of which, as discussed above, may correspond to non-functional spurious transcripts or gDNA contamination. Indeed, in the background subtracted datasets, very few new positions were detected in any category in datasets with more than 50 million fragments (Figure 3B and 3C). Taken together, these findings suggest that 50 million non-rRNA fragments yield nearly complete coverage of biologically relevant E. coli transcripts expressed during log phase growth in LB. Moreover, they suggest that vast majority of the E. coli transcriptome can be detected under this growth condition even with datasets of only 5-10 million non-rRNA fragments.
Effect of decreased sequencing depth on RNA-Seq transcriptome coverage in M. tuberculosis and V. cholerae
The regulatory networks governing gene expression can diverge significantly among different bacteria. Moreover, patterns of gene expression can vary dramatically among different growth conditions. To assess whether the relationship between sequencing depth and transcriptome coverage described above extends beyond log-phase E. coli K-12 cultures growing in LB, we repeated the analysis above with RNA-Seq data derived from log phase LB cultures of Mycobacterium tuberculosis (Figure 4), a species whose GC content, gene content and organization, and physiology are significantly diverged from those of E. coli. Importantly, similar levels of coverage of annotated ORFs and ncRNAs were seen in these M. tuberculosis datasets containing 5 and 10 million non-rRNA fragments (Figure 4). We also analyzed RNA-Seq datasets containing 5 and 10 million non-ribosomal fragments derived from log phase cultures of Vibrio cholerae growing in M9 minimal medium [3] and found similar levels of gene coverage, though coverage of antisense and intergenic regions in these data was somewhat lower (Figure 4). These results suggest that a sequencing depth of 5-10 million non-rRNA fragments enables profiling of the vast majority of transcriptional activity in diverse species grown under diverse culture conditions.
Using RNA-Seq to identify differentially expressed genes: how important is depth?
In addition to its utility in profiling the transcriptome of a single strain of interest, RNA-Seq is also a powerful tool for comparing gene expression among different strains and/or growth conditions. A recent study by Tarazona et al. examined the relationship between sequencing depth and the reliable identification of changes in gene expression in human RNA-Seq data [31] but to date no similar analysis has been conducted for bacterial RNA-Seq data. To assess how changes in sequencing depth influence RNA-Seq-based analysis of differential gene expression in bacteria, we sequenced rRNA-depleted total RNA isolated from LB cultures of E. coli O157:H7 strain EDL933 (from hereon referred to as EDL933) at the late exponential and early stationary phases. cDNA libraries corresponding to 2 biological replicates for each time point were subjected to multiplexed sequencing using Illumina HiSeq to yield 25-30 million fragments per sample. Data between biological replicates for each time point was were extremely well correlated (R2 of fragments/ORF = 0.99). To examine the impact of having fewer fragments on the results of differential expression analysis, we scaled down the counts of fragments per gene from each dataset while retaining the original values of relative gene expression.
We first used these sampled datasets to determine how changes in sequencing depth influenced the detection of transcripts in two independent biological replicates. As shown in Figure 5, the total number of transcripts detected in both replicates rose significantly as depth was increased, particularly among lowly expressed genes. Importantly, these increases began reaching an asymptote around 13 million fragments, suggesting that additional depth beyond this point did relatively little to increase either the number or percent of all genes detected in both biological replicates.
We next analyzed the full and sampled datasets with DESeq, a variance-analysis package that uses a model based on the negative binomial distribution to infer statistically significant differences in gene-expression from RNA-Seq data [32]. Based on the counts of gene-mapped fragments derived from the full RNA-Seq data set of ~25 million fragments per sample, DESeq identified 2486 genes (corresponding to 45% of all annotated EDL933 genes) as being at least 2-fold up- or down-regulated (P < 1×10-3). As shown in Figure 6A, reductions in sequencing depth correlated with a decrease in the number of genes identified as differentially expressed below this P-value cutoff. As expected, the effect of decreased depth was most marked for genes whose differential abundance between the two growth phases was relatively small (Figure 6A). For example, a 10-fold decrease in depth resulted in a loss of 38% of genes 2-5-fold differentially expressed but only 9% of genes whose differential expression was greater than 10-fold. However, even when the depth was reduced to 2.5-3 million fragments in each dataset, 1704 genes were identified as differentially regulated more than 2-fold with P < 1×10-3. Our findings indicate that when data from well-correlated biological replicates are included, 2-3 million fragments per sample enable a significant number of genes differentially expressed by 2-fold or more to be identified with high statistical significance.
The ability to reliably identify differentially expressed genes by RNA-Seq is affected by a variety of factors aside from total sequencing depth that can vary significantly from one experiment to another, including the number of biological replicates included and the variation between them, the average abundance of differentially expressed genes, and the magnitude of their differential expression under the conditions tested. We therefore repeated the analysis above with RNA-Seq data that were distinct in several ways from the EDL933 data. Specifically, these data were derived from V. cholerae growing in M9 minimal medium or isolated from the cecal fluid of 2 orally infected infant rabbits [3]. Moreover, the correlation between the 2 rabbit samples was much lower than for the EDL933 in vitro samples (R2=0.69). Finally, the total number of non-rRNA fragments for these datasets was between 4 and 6 million, significantly less than in the EDL933 datasets.
Despite these numerous differences, the impact of reducing the number of fragments in the V. cholerae and EDL933 datasets on the ability to detect differentially expressed genes was very similar (Figure 6B). Importantly, as we observed in the analysis of the EDL933 data, numerous genes were identified as differentially expressed by at least 2-fold (P < 1×10-3) even with a reduction of fragments to 2-3 million per sample. These included all 16 of the major V. cholerae colonization or virulence factors identified as induced in the rabbit when the full datasets were compared [3]. Strikingly, the differential expression of all but one of these major colonization and virulence factors was detected (P < 1×10-3) even when the total number of non-rRNA fragments was reduced 100-fold to 40,000-60,000 total fragments. While it is not possible to accurately simulate how changes in depth will affect RNA-Seq comparative gene expression analyses in all cases, our findings indicate that in diverse species and growth conditions and even with relatively low correlation between biological replicates, 2-3 million fragments per sample enable a significant number of genes differentially expressed by 2-fold or more to be identified with high statistical significance.