Directional RNA-seq reveals highly complex condition-dependent transcriptomes in E. coli K12 through accurate full-length transcripts assembling
© Li et al.; licensee BioMed Central Ltd. 2013
Received: 22 April 2013
Accepted: 27 July 2013
Published: 30 July 2013
Although prokaryotic gene transcription has been studied over decades, many aspects of the process remain poorly understood. Particularly, recent studies have revealed that transcriptomes in many prokaryotes are far more complex than previously thought. Genes in an operon are often alternatively and dynamically transcribed under different conditions, and a large portion of genes and intergenic regions have antisense RNA (asRNA) and non-coding RNA (ncRNA) transcripts, respectively. Ironically, similar studies have not been conducted in the model bacterium E coli K12, thus it is unknown whether or not the bacterium possesses similar complex transcriptomes. Furthermore, although RNA-seq becomes the major method for analyzing the complexity of prokaryotic transcriptome, it is still a challenging task to accurately assemble full length transcripts using short RNA-seq reads.
To fill these gaps, we have profiled the transcriptomes of E. coli K12 under different culture conditions and growth phases using a highly specific directional RNA-seq technique that can capture various types of transcripts in the bacterial cells, combined with a highly accurate and robust algorithm and tool TruHMM (http://bioinfolab.uncc.edu/TruHmm_package/) for assembling full length transcripts. We found that 46.9 ~ 63.4% of expressed operons were utilized in their putative alternative forms, 72.23 ~ 89.54% genes had putative asRNA transcripts and 51.37 ~ 72.74% intergenic regions had putative ncRNA transcripts under different culture conditions and growth phases.
As has been demonstrated in many other prokaryotes, E. coli K12 also has a highly complex and dynamic transcriptomes under different culture conditions and growth phases. Such complex and dynamic transcriptomes might play important roles in the physiology of the bacterium. TruHMM is a highly accurate and robust algorithm for assembling full-length transcripts in prokaryotes using directional RNA-seq short reads.
KeywordsRNA-seq Prokaryote E. coli Transcriptome Assembly Transcription start site Alternative operon Antisense RNA Non-coding RNA
In prokaryotes, several adjacent genes on the same strand of DNA are often co-transcribed as a polycistronic mRNA, forming a multi-gene transcription unit called an operon. Furthermore, in addition to protein- and RNA-coding genes, some parts of a non-coding sequence and the opposite strand of a coding sequence can also be transcribed under certain conditions, generating non-coding RNAs (ncRNAs) [1, 2] and anti-sense RNAs (asRNAs) [3, 4], respectively. Accumulating body of evidence suggest that ncRNAs [1, 2] and asRNAs [3, 4] may play important roles in the physiology of prokaryotes. Therefore, a full understanding of the transcriptomes of prokaryotic cells is necessary to annotate the functional elements in their genomes and to reconstruct the gene transcriptional networks in their cells. However, experimental determination of operon structures, ncRNAs and asRNAs by traditional molecular biology methods is time-consuming and labour-intensive. As a result, no single prokaryote has so far had all of its operon structures, ncRNA and asRNAs characterized using such methods. For instance, even for the most well-studied model bacteria E. coli K12 and B. subtilis, only 3,409  and 736  operons have been determined in their genomes using these methods, respectively, after decades of research while not each of their genes has been assigned to an operon. On the other hand, although a great progress has been made in computational prediction of operons [7–14] and small RNA genes [15–18], the accuracy of these predictors is still low [13, 19], and they can only predict the static longest possible operons without considering possible alternative operons [7–14].
In the past few years, increasing applications in prokaryotes of whole genome directional (strand-specific) tiling array and directional RNA-seq techniques have completely changed our way to study and our view of the architecture and complexity of prokaryotic transcriptomes (for a thorough review, see [20–22]). For example, using a combination of whole genome directional tiling array and RNA-seq techniques, Guell et al. found that operon utilizations in the reduced parasitic M. pneumoniae genome were highly variable and dynamic, almost half of the 139 identified multi-gene operons showed varying levels of (dynamic) expression in a staircase-like manner. Under different conditions, large operons could be transcribed as smaller sub-operons, resulting in many alternative transcripts, suggesting that the operon structures in M. Pneumonia were highly complex and dynamic, a phenomenon that was comparable to the alternative splicing in eukaryotes . They also identified a large number of ncRNAs and asRNAs expressed under various culture conditions, hence a much larger portion of the genome was transcribed than originally anticipated . Similar results were observed in many other taxonomically distinct species, such as epsilon proteobacteria H. pylori; firmicutes B. sutiblis and B. anthracis; cyanobacteria Synechocystis sp. PCC6803 ; euryarchaeota Halobacterium salinarum NRC-1 ; and bacteroidia Porphyromonas gingivalis, to only name a few. However, not all these surprising observations were noted in some other studies. For instance, prevalent alternative operon utilizations were not reported in many studies in a variety of prokaryotes, such as B. subtilis, Salmonella entericaserovar Typhi , Burkholderia cenocepacia, Caulobacter crescentus, Staphylococcus aureus, Vibrio cholera, Chlamydia trachomatis, Chlamydia pneumonia, Clostridium beijerinckii NCIMB 8052 , Listeria monocytogenes, Anabaena sp. strain PCC 7120 , Synechococcuselongatus PCC 7942 , and Sulfolobus solfataricus P2 . Contradictory results have also been reported. For instance, although Rasmussen et al. did not note alternative operon utilizations in B. subtilis, more recently, Nicolas et al. observed highly prevalent condition-dependent operon utilizations using a similar tiling array technique. Moreover, although most of these studies found extensive asRNA and ncRNA transcriptions, the levels of their prevalence could vary quite differently from different studies even in the same strains. For instance, although Selinger et al. reported that up to 4,000 E. coli K12 genes had asRNA transcriptions using directional tilling arrays, Dornenburg et al. only identified about 1,000 asRNAs in the same strain under similar growth conditions using directional RNA-seq. These discrepancies can be due to different experimental conditions and methods used in these studies. Nevertheless, they inevitably raise the question: are the prevalent alternative operon utilizations, asRNA and ncRNA transcriptions ubiquitous phenomena in all prokaryotes or only prevalent in some specific species?
E. coli K12 is probably the best known free living model organism [45, 46], where novel biological hypotheses and computational algorithms can be tested. Indeed, it is mainly through the studies in E. coli K12 that we have understood many fundamental biological processes, including the mechanisms of gene transcriptional regulation [47–49]. As a result, the E. coli K12 genome is in fact the best understood among all the model organisms in almost all aspects [50, 51]. Since the finishing of its genome sequence in 1997 , almost all newly developed high throughput technologies have been applied to this bacterium. As a result, 4,501 genes have been experimentally or computationally identified in the MS1655 strain, and 3,384 (75%) of them have been assigned a biochemical function . Of these 3,384 genes with an assigned function, 2,941 (87%) had their functions characterized experimentally (66% of the total encoded genes) [46, 51]. The products of the 918 genes with experimentally characterized function catalyze 1,008 metabolic reactions, which constitute the best understood metabolic network . As for its transcriptomes and transcriptional regulatory networks, RegulonDB database  that is dedicated to compiling all experimentally verified relevant information in E. coli K12 has documented 3,409 operons (including singleton genes), 1,878 promoters, 1,940 transcription factor binding sites of 175 transcription factors (TF) in the regulatory region of 703 operons, and 2,697 TF-target gene regulations . Furthermore, more than a hundred of ncRNAs and asRNAs have been experimentally identified in the E. coli[54–56]. More recently, Cho et al. applied a combination of tiling array, 5’-end RNA deep sequencing, RNAP ChIP-chip and proteomics analyses to reveal the transcription unit architecture in the E. coli K12 genome. They identified 4,661 transcription units, many alternative Transcription Start Sites (TSSs), alternative operons and ncRNAs under a few cultural conditions. In another study, Mendoza-Vargas et al. identified ~1,500 new TSSs using a modified 5’-RACE method and a 5’-end RNA sequencing method in the genome. Consequently, after more than 40 years intensive molecular genetics research in this bacterium, including the recent high throughput studies [43, 44, 57, 58], our experimentally validated knowledge of the transcriptome and gene regulatory systems in E. coli K12 is the most complete currently available for any organism [46, 51]. However, ironically, our understanding about the complexity of the transcriptomes in this model bacterium is rather limited compared to its counterpart model Gram-positive bacterium B. subtilis. In particular, large scale dynamic and alternative operon utilizations under various conditions have not been reported in E. coli K12, so do they exist in this bacterium? Furthermore, how many asRNAs and ncRNAs are transcribed in E. coli K12 given the aforementioned inconsistent results [43, 44]?
Technically, compared to directional tiling array techniques, directional RNA-seq methods are more suitable and powerful tools for understanding the complexity of the prokaryotic transcriptomes due to their single-nucleotide resolution, higher dynamic range, and lower noise levels, thus they have gained increasing popularity . One important step in RNA-seq data analysis is to accurately assemble all meaningful transcripts in their full-length, so that correct conclusions can be drawn from tens of thousands of RNA-seq short reads generated by next generation sequencing (NGS) technologies. However, it has been recently released [23, 24, 28, 29, 60, 61] and we will indicate later in this paper, that the coverage of reads generated by the current RNA-seq techniques on transcribed regions is highly non-uniform. More seriously, there are even numerous uncovered parts in transcribed regions, leading to gaps in otherwise a continuous mapping in the region [62–67]. These highly non-uniform coverage and uncovered gaps make the transcripts assembly and quantitative analyses highly challenging tasks [23, 60, 68–71]. Several technical problems in the current RNA-seq library construction protocols and sequencing technologies have been identified responsible for the non-uniform coverage and gaps. First, the chemical RNA fragmentation methods used in many protocols may have a bias to break or degrade some sequences . Second, the random primer based reverse transcription may preferentially transcribe some sequences [66, 73]. Third, ligases may preferentially link the adaptors to some sequences [74–76]. Fourth, PCR amplification is well-known for introducing GC content-dependent bias in libraries [77–80]. Fifth, it was recently found that sequencing errors could be biased to some specific sequences, making such sequences missing from the reads . Moreover, prokaryotic RNAs are more labile than their counterparts in eukaryotes, thus segments of some RNAs can be more easily lost during the library preparation. Although some of these problems can be avoided by new technical development, such as using FRET-seq for amplification-free sequencing to avoid GC content-dependent PCR bias , or using single RNA molecular sequencing for longer reads to ease the assembly problem [83, 84], no effective routine technique has yet been developed to avoid all these problems.
On the other hand, although several transcriptome assemblers using RNA-seq short reads have been developed in the past few years, they are mainly for reconstructing alternative isoforms in eukaryotes . These assemblers can be classified into two basic categories : the reference-based assemblers when a reference genome sequence is used, and the de novo assemblers when a reference genome is not used. The reference-based assemblers usually involve two steps: RNA-seq reads are first mapped to the reference genome using an aligner, such as BLAT , TopHat  or Bowtie , and then a graph representing all possible isoforms from overlapping reads is constructed, and the isoforms are resolved by traversing the graph. Examples of this strategy include Cufflinks  and Scripture . The de novo assemblers such as Trinity , Oases , TransAByss , Rnnotator , and Multiple-k , generally assemble isoforms based on a De Bruijn graph constructed using overlapping reads. The advantage of de novo strategy is that it can assemble transcripts when a reference genome is not available and can recover transcripts that are missing in the genome assembly. However, de novo transcriptome assembly is very sensitive to sequencing errors, in particular, missing and chimerical reads in the dataset, thus their accuracy is generally lower than the reference-based approaches .
De novo transcriptome assembly in prokaryotes can also be more challenging in prokaryotes owing to the prevalence of uncovered gaps caused by the aforementioned technical reasons and the unique prosperities of their RNAs. Fortunately, with thousands of sequenced prokaryotic genomes available now, transcriptome assembly in prokaryotes can often be done using the reference-based approaches. However, the only reference-based transcriptome assembler for prokaryotes that we are aware of is a Hidden Markov Model (HMM)-based method for reconstructing operons in Bacillus anthracis, yet no tool was delivered from this research. Furthermore, there are at least two limitations in this method. First, the prevalently uncovered gaps were not explicitly treated in this method , thus the interrupted partial transcripts could not be effectively bridged. Second, although this method attempted to model transcripts of different transcription levels using different expression states, it did not allow transitions among the states . Thus, without an effective method to correct the high non-uniformity of the read coverage along a transcript [65, 72, 73, 75, 81], this method can break a transcript into smaller fragments. Because of the lack of a good prokaryotic assembler, currently prokaryotic transcripts were assembled by either simply stitching the two covered segments if the gap between them is shorter than a cutoff , or determining 5’ and 3’ ends of transcripts via a probability-based approach , or relying on an additional source of information for the assembly, such as tiling array data that tend to have a more even and consecutive coverage along transcribed regions albeit at lower resolution [23, 25]. As RNA-seq becomes a routine technique for probing transcriptomes in prokaryotes, an efficient and accurate full-length transcripts assembly algorithm and tool tailored to prokaryotes are urgently needed in the research community.
To gain a better understanding of the complexity of the transcriptomes in E. coli K12, we have profiled the transcriptomes of the bacterium under different culture conditions and growth phases using a highly specific directional RNA-seq technique that can capture various types of transcripts in the cells, including mRNAs, asRNAs, and ncRNAs. To assemble all types of full length transcripts using the directional RNA-seq short reads, we have developed a new Hidden Markov Model based algorithm, TruHMM (TRancription Unit assembly by a Hidden Markov Model), attempting to addresses the highly non-uniform read coverage and uncovered gap problems of current RNA-seq techniques. TruHMM differs from the earlier HMM-based algorithm  in several ways (for details see Methods and Discussion). In particular, TruHMM overcomes the aforementioned limitations of the earlier method by allowing a transcript to have highly non-uniform coverage at different positions, and explicitly addressing the uncovered gap problem using a sliding window-based centroid read counting strategy in a pre-processing step. Furthermore, TruHmm can also predict alternative operons and TSSs of the assembled transcripts. When evaluated on sets of known operons, asRNAs and ncRNAs in E. coli K12, TruHMM was able to assemble various types of transcripts with rather high accuracy. The parameters trained in E. coli K12 can be applied to an earlier directional RNA-seq dataset in H. pylori with similarly high accuracy, and vice versa, thus TruHMM is also very robust. Based on the transcripts assembled in TruHMM, we found that 46.9 ~ 63.4% of expressed operons were utilized in their putative alternative forms, 72.23 ~ 89.54% open reading frames had putative asRNA transcriptions and 51.37 ~ 72.74% intergenic regions had putative ncRNA transcriptions under different culture conditions and growth phases. Thus, it seems that there are more prevalent alternative operon utilizations as well as asRNA and ncRNA transcriptions in E. coli K12 than originally anticipated, and they may play important roles in the physiology of the bacterium.
Our directional RNA-seq libraries are highly strand-specific and can capture various types of RNAs
Uncovered-gaps in transcribed regions are prevalent and read coverage is highly non-uniform
Furthermore, we also found that the read coverage along genes were highly non-uniform (an example is shown in Additional file 1: Figure S5). Interestingly, the pattern of non-uniform coverage did not depend on the culture conditions and growth phase; rather, it strongly depended on the positions in the transcribed region (Additional file 1: Figure S5). Such highly non-uniform read coverage along a transcribed region has been widely noted in recent studies [23, 24, 28, 29, 60, 61], and were shown to be caused by several technical artifices in current RNA-seq techniques [66, 72–81]. Clearly, both the uncovered gaps and highly non-uniform read coverage along transcribed regions make the full-length transcript assembling and alternative operon identification challenging tasks.
TruHMM assembles operons with high accuracy
The performance of TruHMM is robust
To evaluate the performance of TruHMM and the robustness of its parameters on different organisms and datasets, we first applied TruHMM with the parameters trained on the E. coli K12 dataset to the earlier directional RNA-seq datasets of H. pylori generated under five different culture conditions . We then trained the algorithm using an H. pylori training set (Additional file 3, and see Methods) based on the results in , and applied the algorithm with the trained parameters to both the H. pylori and E. coli K12 RNA-seq datasets. Remarkably, the operons reconstructed in both H. pylori and E. coli K12 using the E. coli- or H. pylori-trained parameters are exactly the same (data not shown), and have high accuracy measured by all the five metrics (Figures 5 and 6, and Additional file 1: Table S3 and S4). This might be explained by the fact that the parameters of the algorithm trained on the H. pylori training sets and on the E. coli K12 training sets are almost the same (Additional file 1: Table S5), although our E. coli and the earlier H. pylori RNA-seq datasets were generated by quiet different methods. These results unambiguously demonstrate that the performance of our algorithm is highly robust, thus parameters trained in one organism can be well extended to other organisms, at least in our tested datasets. The assembled operons in H. pylori for each sample are listed in Additional file 4.
The boundaries of operons can largely be captured by our libraries and assembled by TruHMM
We next evaluated the ability of TruHMM to define operon boundaries, i.e., the TSSs and transcription termination sites (TTSs) of assembled transcripts. However, an accurate evaluation of predicted operon boundaries is complicated by the recently discovered fact that alternative TSSs and TTSs are far more prevalent than previously thought [23–25, 57, 58] and the lack of a gold standard TSS and TTS datasets because although some different TSSs and TTSs are documented for some operons in RegulonDB, they were generally characterized in different studies under various conditions that are not necessarily the same as we used in this study. Thus, we evaluated our reconstructed TSSs by the following alternative ways. First, we wanted to know how many experimentally verified TSS in RegulonDB could be recovered by the boundaries of our assembled operons in any of the seven samples. If two known TSSs in RegulonDB are within 10nt from each other, we considered them as the same one in our evaluation. Thus, there are 1,742 known TSSs (Additional file 5) associated with the genes transcribed in at least one of our seven samples. We considered a known TSS was recovered by our predicted TSS if they were at most 50nt from each other. Using this criterion, 908 out of 1,742 (~52.1%) known TSS were recovered by a total of our 5,706 predicted TSSs (Additional file 5). Second, as for the remaining 4,798 predicted TSSs with no match to a known TSS, 2,830 of which appeared in at least two samples, thus they are likely to be novel true TSSs. For example, although genes b2628-b2627 on the reverse strand is documented as an operon in RegulonDB, there is no TSS documented for gene b2628. We predicted a potential TSS in the upstream intergenic region of b2628 (2,763,486) in five samples (Additional file 1: Figure S5). The remaining 1,968 predicted TSSs appeared only in one sample. The 4,798 predicted TSSs are listed in Additional file 6. The low coverage of known TSSs in RegulonDB does not necessarily indicate the inaccuracy of our prediction, considering the prevalence of alternative TSSs utilizations under different conditions and the fact that TSSs in RegulonDB were mostly characterized by different researchers, and under different conditions. Therefore, the limited number of TSSs in RegulonDB might be the major reason.
Lastly, Sharma et. al have determined 735 primary TSSs (defined as the most frequently used TSS by an annotated transcript, supplementary information of ) in H. pylori, using dRNA-seq technique that enriches the reads coverage on the 5’ end of a transcript. Therefore, the TSSs determined in this study could be a good dataset to test the accuracy of our algorithm. Specifically, we compared our predicted TSSs in H. pylori using their directional RNA-seq datasets with their TSSs determined by dRNA-seq. On average, 73.12% of our predicted TSSs in each sample are located within the [-50 nt, 50 nt] interval around a TSS determined by dRNA-seq (Additional file 1: Table S6). Thus our algorithm has achieved a rather high specificity. Our predicted TSSs in each of the five samples, located within the [-50 nt, 50 nt] interval around a verified TSS are listed in Additional file 4. Furthermore, we used the primary TSS to check the recall rate (sensitivity) of our program. Our program detected 558 (~76%) out of the 735 total primary TSSs. The majority of the verified TSSs recalled by our algorithm had a dominant coverage on the 5’ end of the transcript, one of such cases is shown in Additional file 1: Figure S7. By contrast, the majority of primary TSSs missed by our algorithm did not have a dominant read coverage on the 5’-end, two such cases are shown in Additional file 1: Figure S8. The primary TSSs both covered and missed by TruHMM are listed in Additional file 8. The much higher recovery rate of known TSSs by our algorithm in H. pylori than in E. coli K12 might be due to the fact that the gold standard dataset in H. pylori were derived from the same conditions as the RNA-seq datasets that we used for assembling the transcripts, while the datasets in RegulonDB were derived under various conditions.
As for the TTS predictions, our algorithm recovered 148 out of 221 (~67%) known TTSs associated with expressed genes in E. coli K12 (Additional file 5), which is higher than the recovery rate of known TSSs, even though the mapped reads are strongly biased to the 5’-ends (Figure 3). The lower recovery rates of known 5’ ends (TSS) compared to 3’ ends (TTS) might indicate that operons utilize more alternative TSSs than TTSs under different conditions. In other words, the predicted TSSs without a match with a known TSS in RegulonDB are likely to be novel alternative TSSs used in different conditions. Taken together, all these results strongly suggest that most of the predicted TSSs and TTSs are likely to be true transcription boundaries. The assembled operons and their alternative TSSs in each sample are listed in Additional file 9. However, as also demonstrated in earlier studies [24, 57, 58], to more accurately detect TSSs and TTSs of transcripts/operons, in particular TSSs, in addition to directional RNA-seq datasets, special datasets targeted to the 5’-endof transcripts are clearly needed, such as dRNA-seq data  and datasets for the more recently discovered transcription start site RNAs (tssRNAs) .
Condition-dependent alternative operon utilizations appear to be prevalent in E. coli K12
Another interesting example is the alternative utilization of the 13-gene operon fliFGHIJKLMNOPQR encoding proteins in the flagella of E. coli K12 (Additional file 1: Table S9 and Additional file 9). Although the fli operon was expressed as a 13-gene polycistron in the sample LB, it was split into short suboperons under the treatments of heat shock or phosphorus starvation in a time dependent manner (Additional file 1: Table S9). For example, at the beginning of heat shock (the sample HS15 min), the fli operon was divided into four suboperons, then it was further split into six to seven suboperons (samples HS30 min and HS60 min). Interestingly, it has been shown that heat shock reduces bacterial mobility possibly through the regulatory interactions between the heat shock system and the flagellum/chemotaxis system . Moreover, it has been shown that inorganic phosphorus is necessary for the motility of bacteria . However, the underlying mechanisms of these observations are largely unknown. Therefore, our results might provide a possible molecular explanation of these earlier observations: the extreme conditions (heat shock/phosphorus starvation) alter the expression of flagella proteins by changing the patterns of alternative usages of the fli operon, thus influence the motility of the bacterial cells.
Condition-dependent asRNA and ncRNA transcriptions appear to be prevalent in E. coli K12
Some hypothetical genes are transcribed
Although E. coli K12 is probably the best studied and understood model organism, researchers have not completely defined even its coding genes. For instance, there are still 36 sequences labelled as hypothetical protein genes as of this writing in the RegulonDB . Interestingly, we found that all these 36 hypothetical genes were transcribed in at least one of our seven samples (Additional file 14), and 21 (b0050, b0137, b1356, b1382, b1419, b1446, b1457, b1607, b1952, b1998, b3471, b3638, b3937, b4325, b4335, b4336, b4593, b4596, b4610, b4615 and b4620) of them were expressed in all the seven samples, suggesting that they are highly likely to be true protein coding genes. Furthermore, 20 of them formed multi-gene operons with other known genes (Additional file 14). The functions of these known genes might provide hints to possible functions of the associated hypothetical genes for “guilt by association”.
Although a few high throughput studies have attempted to delineate the architecture of E. coli K12 transcriptomes [43, 44, 57, 58], they mainly focused on identifying TSSs [57, 58], promoters  and other features . Thus we still lack a good understanding of the level of the complexity of the transcriptomes in E. coli K12, from which we gained most of our knowledge about transcription in bacteria, but the more recent revolutionary view of the high complexity and dynamics of prokaryotic transcriptomes. Therefore, there is an urgent need for a better understanding of the complexity of the transcriptomes in this most widely-used model Gram-negative bacterium, in particular, when the same highly complex and dynamic transcriptomes have recently been revealed in its counterpart model Gram-positive bacterium B. Subtilis. To fill the gap, we have profiled the transcriptomes in E. coli K12 during the course of heat shock and phosphorus starvation conditions using a highly strand-specific RNA-seq method that can capture various forms RNA transcripts, in conjunction with a highly accurate full-length transcript assembler, TruHMM. Indeed, as has been widely reported in many other prokaryotes [24–29], we have also identified numerous putative novel and/or alternative operons and TSSs, as well as novel putative asRNAs and ncRNAs in E. coli K12. More importantly, the transcription patterns of these putative alternative operons, asRNAs and ncRNAs were highly dependent on the growth phases and culture conditions of the bacterium, suggesting that they might play important roles in the physiology of the bacterium. In the future, it would be very interesting to study how the alternative operons, asRNAs and ncRNAs are related to transcriptional and translational regulations and cellular functions, in particular in responses to environmental cues. Furthermore, the molecular mechanisms that lead to the highly complex and dynamic transcriptomes in E. coli K12 and other organisms also warrant further investigations.
Based on the ever increasing body of evidence [20–22], and the data presented in current study, it is highly likely that prokaryotes generally have highly dynamic and complex transcriptomes to cope with environmental changes. The failure to observe such highly complex and dynamic transcriptomes in some earlier studies [31–42], and the inconsistent results in E. coli K12 and B. subtilis[25, 30], might well be due to the limitations of experimental and computational methods used in these studies. For instance, although an earlier study  did not detect alternative operon utilizations in B. subtilis using tiling arrays under two culture conditions, a more recent study  observed highly prevalent condition-dependent operon utilizations as well as numerous asRNA and ncRNA transcriptions using higher resolution tiling arrays and more sophisticated computational analysis in ~120 culture conditions. Furthermore, although Selinger et al. found that up to 3,000 ~ 4,000 E. coli K12 genes had asRNA transcriptions using directional tilling arrays, Dornenburg et al. only identified about 1,000 asRNAs in the same genome under similar growth conditions using a directional RNA-seq technique. Our results is in excellent agreement with the former results , as we detected that 72.23 ~ 89.54% annotated genes have putative asRNA transcriptions (Additional file 1: Table S10). Thus again asRNA transcription appears to be more prevalent than originally anticipated in E. coli K12. With the continuous drop in costs of the NGS technologies, directional RNA-seq becomes a routine technique to profile transcriptomes in thousands of sequenced prokaryotic genomes. We expect that highly complex and dynamic transcriptomes will be identified in more and more prokaryotes using improved directional RNA-seq techniques and analysis tools. The experimental methods and the transcripts assembler that we developed in this study can add in these efforts.
Another interesting and rather prevalent phenomenon called dynamic operon transcription is recently revealed by transcriptome profiling studies in M. Pneumonia and B. subtilis using high density tiling arrays that give more uniform signal coverage along genes albeit at lower resolution [23, 25]. Dynamic operon transcription is characterized by varying levels of transcription along an operon, resulting in staircase like transcription levels between adjacent genes in the operon [23, 25]. This phenomenon also is clearly seen in our datasets (examples are shown in Figure 9). However, TruHMM in its current form is unable to detect such dynamic operon transcription events due to the highly non-uniform read coverage along genes in an operon. Furthermore, if multiple alternative operons start at the same TSS, but terminate at different TTS in the same sample, TruHMM will fail to detect such coexisting alternative operons in the same sample. Clearly, to solve these problems, one needs to transform the highly non-uniform read coverage along the genes into a more uniform one by effectively correcting the aforementioned technical biases in the current RNA-seq methods, or relies on a better sequencing technology with minimal read bias, or capable of sequencing transcripts in their full-length. In addition, TruHMM might not be able to separate overlapping transcripts if the downstream transcript has no outstanding primary TSS. Finally, additional sequencing library targeted to the intact 5’-end of RNAs might be needed in order to identify all possible TSSs in a sample.
Using a highly efficient and strand-specific RNA-seq method combined with a highly accurate and robust algorithm and tool, TruHMM for assembling full-length transcriptomes, we showed that alternative operon utilizations in E. coli K12 appear to be more prevalent than originally anticipated, and that a large portion of ORFs and intergenic regions of the genome appear to have asRNA and ncRNA transcriptions, respectively. Furthermore, the patterns of alternative operon, asRNA and ncRNA transcriptions are dependent on the culture conditions and growth phases of the bacterium, thus they might play important roles in the physiology of the bacterium. Furthermore, with the recognition of the highly complex nature of prokaryote transcriptomes and the wide application of RNA-seq techniques in the prokaryotes research community, TruHMM can also be very useful for biologists to reveal the complexity of transcriptomes and the underlying molecular mechanisms in all sequenced prokaryotic genomes.
A frozen stock of Escherichia coli K12 strain MG1655 (a gift from Dr. Todd Steck, Department of Biology, the University of North Carolina at Charlotte) was thawed, inoculated in LB medium in a test tube by 1:100 dilution and cultured overnight at 37°C and 250 rpm. The cells were then transferred to fresh LB medium in a flask by 1:100 dilutions, and cultured at 37°C and 250 rpm. When the cells grew to the log phase with an optical density at 610 nm [OD610] of 0.87, they were spun down at 3,200 g for 25 min. For heat shock treatment (HS), the cell pellets were resuspended in the same volume of MOPS medium (100 ml of 10X MOPS mixture, 880 ml of sterile H2O, 10 ml (0.132 M) KH2PO4 and 10 ml of 20% glucose, Teknova, Hollister, CA), and incubated at 48°C and 250 rpm. For phosphorus-starvation treatment (M-P), the cell pellets were resuspended in the MOPS medium without KH2PO4. Three milliliter cell suspension were collected in a tube containing 1.5 ml RNA Later (Invitrogen) immediately after the cell pellets were resuspended in the indicated medium (0 min) and at the indicated time points thereafter (HS:15 min, 30 min and 60 min; M-P: 0 hrs, 2 hrs, 4 hrs). Cells were spun down at 6,000 g, 8 min and -4°C, and the pellets were resuspended in 1.5 ml of RNAlater. The samples were stored at -80°C until use.
Isolation and enrichment of mRNA
Total RNA was isolated using a RiboPure™ -Bacteria Kit (Ambion) following the manufacturer’s instructions. Once isolated, ~10 μg total RNA was treated with 8 units DNase (Invitrogen) twice to remove genomic DNA, and the complete removal of DNA was confirmed bythe absence of the product of 35 cycles PCR amplification of a 196 bp fragment of the crp gene (5’-primer: AGCATATTTCGGCAATCCAG; 3’-primer: TACAGCGTTTCCGCTTTTTC). To enrich mRNAs and other transcripts, majority of rRNAs were removed from the DNase-treated total RNA using a MICROBExpress kit (Ambion) following the manufacturer’s instructions.
Construction of directional RNA-seq libraries
In our early stage of experiments, sequencing was done on an Illumina GAII platform at the sequencing core facility of the University of North Carolina at Chapel Hill, and the directional RNA-seq libraries were constructed by following an Illumina’s instruction using their Small RNA Sample Prep Kit with some modifications. Briefly, after the purified mRNA was fragmented using a RNA fragmentation kit (Ambion), the fragmented RNA was treated with Antarctic phosphatase (NEB) to remove the 5’-tri-phosphate groups of RNAs with an intact 5’-end. A mono-phosphate group was then added back to the 5’-end of fragmented RNAs by polynucleotide kinase (PNK, NEB) in the presence of 10 mM ATP. The v1.5 sRNA 3’ Adaptor (5’/5rApp/ATCTCGTATGCCGTCTTCTGCTTG/3ddC/) was ligated to the 3’-end of fragmented RNAs using truncated T4 ligase 2 (NEB), and the SRA 5’ RNA adaptor (5’GUUCAGAGUUCUACAGUCCGACGAUC) was ligated to the 5’-end of fragmented RNAs using T4 ligase. To preserve short inserts from small RNAs we omitted the size selection step after PCR application of inserts. In our later experiments, sequencing was done on an Illumina HiSeq 2000 platform at David H. Murdock Research Institute of the North Carolina Research Campus (Kannapolis, NC), and we constructed the directional RNA-seq libraries using Illumina’s TruSeq Small RNA Sample Prep Kit, so that multiplex sequencing can be achieved by using the barcoded PCR primers. The details of the method will be described elsewhere (Dong, Li and Su). Briefly, after similar treatments as described above, the 5’ Adapter (RA5: 5’ GUUCAGAGUUCUACAGUCCGACGAUC), and 3’ Adapter (RA3: 5’ TGGAATTCTCGGGTGCCAAGG) were ligated to 5’- and 3’-end of fragmented RNAs, respectively. Reverse transcription-PCR (RT-PCR) was performed using SuperScript II Reverse Transcriptase Kit using the SRA RT primer, followed by 16 cycles of PCR amplification. Again, the size selection was omitted on PCR products to preserve short inserts from possible small RNAs. Single-end sequencing on the Illumina GA II platform was done with 76 cycles, while that on the HiSeq 2000 platform was done with 100 cycles. Some samples (HS15 min and M-P4 h) were sequenced on both platforms.
Mapping and filtering RNA-seq reads
The genome sequence and annotation files of E. coli K12 substr. MG1655 were obtained from NCBI (http://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/), and the experimentally verified operons in the bacterium were downloaded from RegulonDB  (http://regulondb.ccg.unam.mx/). Additional 112 experimentally verified small RNAs in E. coli were obtained from Storz’s group (http://cbmp.nichd.nih.gov/segr/ecoli_rnas.html). A total of 4,501 annotated genes (also including pseudo genes and small RNAs) are included in this analysis. As the reads were not size-selected during the library construction, we trimmed the 3’ adapters attached to some short insertions. Adapter-free reads with lengths of <10 nt were discarded; the remaining reads were mapped to the E. coli K12 genome using Bowtie . For the reads of length 10–14, 15–29 and ≥30 nt, up to 1, 2, and 3 mismatches were allowed, respectively. Since over 99.6% of the multiple mapped reads in each sample were from duplicated tRNA/rRNA genes (data not shown), only uniquely mapped reads were used for further analysis. The alignment of mapped reads to the reference genome was visualized by Integrated Genome Browser (IGB) . To map the directional RNA-seq reads of H. pylori, we trimmed the polyA tails of the original datasets, which were introduced during the library preparation, and mapped the reads to the reference genome using Bowtie with the same parameter settings as for E. coli K12.
Normalization of the mapped counts
Where n is the number of nucleotides of the reads mapped to the transcript, N* our normalization factor defined above, and L the length of the transcript. Clearly, when all reads have the same length, NPKB and RPKM differ by a constant scaling factor. A similar method has been used earlier , except that our NPKB is further normalized by the global scaling factor N* in each sample.
Training the HMM
Selection of expressed adjacent operon pairs
Positive and negative training sets
To train the HMM, we constructed a positive training set in a sample by simply stitching the known adjacent operon pairs that met the two criteria described above to form a large operon if they are parts of a known operon according to RegulonDB. These positive training sets in the seven samples are listed in Additional file 2. To construct a relatively large negative training set in a sample, we included all the uncovered regions in the genome excluding the ones inside the sufficiently expressed genes in the sample.
Positive and negative testing sets
We evaluated the operon prediction accuracy using two methods: one was based on adjacent operon pairs, and the other on the entire operon structure using all the gene pairs of a known operon. For the first method, we constructed a positive testing set in a sample, consisting of sufficiently expressed adjacent operon pairs, and a negative testing set consisting of known adjacent non-operon pairs that were both sufficiently expressed in the sample. A known adjacent non-operon pair consisted of either the first gene of a known operon and its immediate upstream gene, or the last gene in a known operon and its immediate downstream gene, as long as the intergenic region of the gene pair had at least one uncovered region, regardless of its length. For the second method, we constructed a positive testing set in a sample, consisting of all pair-wise combinations of the genes in a sufficiently expressed operon, and a negative testing set consisting of the gene pairs between the genes of the operon and the immediate upstream or immediate downstream gene, given that the known adjacent non-operon pairs had no overlapping un-translated region (UTR) and that all these relevant genes were sufficiently expressed.
Leave-one-out cross validation
We employed a leave-one-out cross validation strategy to evaluate the performance of our algorithm. Specifically, we used the positive training sets and negative training sets in (n-1) samples to train the emission and transition probabilities of the HMM, and used the positive testing set and the negative testing set in the remaining sample to test the trained model.
Training emission probabilities
Where i is the i-th position (nucleotide) on the chromosome. N* the normalization factor defined in equation (1), L the window size, and Coverage (k) the coverage of position k on the genome. Note that a pseudo count of 1 is added to the coverage value of each window. The optimal window size is determined by balancing two goals with opposite effects: to cover as many gaps as possible and to exclude as many interoperonic regions as possible. See Results for the details of window size selection.
We arbitrarily assigned a high probability 1-10-20 for N to emit this value, and a low probability10-20 for N to emit any other values. The value 10-20 is also a pseudo probability to avoid zero probability for decoding the HMM later.
Training transition probabilities
Reconstruction of full length transcripts/operons
We used the Viterbi algorithm  to decode the path of the states that best explains the centroid coverage values of a region of DNA. If a string of adjacent genes are connected by a consecutive sequence of expressed states, then these genes are predicted to form an operon. Furthermore, we stitched two candidate adjacent operons, for instance, A-B and B-C, to obtain the full length transcripts/operons A-B-C. If over half of the length of a terminal gene is predicted to be expressed, this gene is considered as a member of the predicted operon, otherwise the expressed part of the terminal gene is only considered as the UTR of the operon. The TSS and TTS of an assembled operon/transcript were determined by the locations of its 5’-end and the 3’-end, respectively.
To identify potential alternative TSSs for the downstream genes of the assembled transcripts, we used a rather strict threshold of 5-fold for the ratio γ1(j), to guarantee that there is an outstanding ‘jump’ of read coverage in the downstream of position j. In both cases, we set w1 = 80 nt and w2 = 10 nt. The TTSs were simply determined by the locations of the 3’-end of the assembled operons/transcripts.
The algorithm was encoded in C++ and perl. The software package is open-source, and can be downloaded from http://bioinfolab.uncc.edu/TruHmm_package/. We provide users the option to train their model if enough known operons are available in their genomes of interest. Otherwise users can apply our algorithm using the default settings without the need of any training.
Motif detection in promoters
To estimate the statistical significance of motif scores, we used a 3rd-order Markov model to generate 50,000 random sequences based on the transition probabilities learned from the set of experimentally verified promoters in E. coli K12. The distribution of the motif scores in the random sequences was used to define an empirical p-value.
Where, TP (true positive) = Number of known operon pairs accurately classified as operon pairs by the model.
FP (False Positive) = Number of non-operon pairs falsely classified as operon pairs by the model.
FN (False Negative) = Number of known operon pairs falsely classified as non-operon pairs by the model.
TN (True Negative) = Number of non-operon pairs accurately classified as non-operon pairs by the model.
Sensitivity, i.e. TPR (True Positive Rate or recall) is the proportion of known operon pairs that can be correctly identified as operon pairs by the model. Specificity, i.e. 1-FPR (False Positive Rate) is the proportion of non-operon pairs that are correctly classified as non-operon pairs. Accuracy combines the two metrics to quantify the overall performance of the model. A high Accuracy value represents a low total error rate. Precision denotes the proportion of predicted positives that are true positives. F-factor combines Recall and Precision and normalized them to an idealized value.
Hidden Markov model
TRancription unit assembly by a Hidden Markov model
- E. coli:
Escherichia coli K12 substr MG1655 uid57779
- H. pylori:
Helicobacter pylori 26695 uid57787
- B. subtilis:
- M. pneumonia:
Transcription starting site
Transcription terminating site
Nucleotides per kilo base of transcript per billion nucleotides mapped
True positive rate
False positive rate
Open reading frames
Next generation sequencing.
This work was supported by the National Science Foundation (EF0849615, and CCF1048261 to ZS); and the UNC Charlotte (faculty research grand). Funding for open access charge is from EF0849615. We would like to thank members of the Su lab for discussions, Dr. Jennifer Weller for help with library preparations, and Peter Pham for suggestions on the website.
- Liu JM, Camilli A: A broadening world of bacterial small RNAs. Curr Opin Microbiol. 2010, 13: 18-23. 10.1016/j.mib.2009.11.004.PubMed CentralPubMedGoogle Scholar
- Repoila F, Darfeuille F: Small regulatory non-coding RNAs in bacteria: physiology and mechanistic aspects. Biol Cell. 2009, 101: 117-131. 10.1042/BC20070137.PubMedGoogle Scholar
- Thomason MK, Storz G: Bacterial antisense RNAs: how many are there, and what are they doing?. Annu Rev Genet. 2010, 44: 167-188. 10.1146/annurev-genet-102209-163523.PubMed CentralPubMedGoogle Scholar
- Georg J, Hess WR: cis-antisense RNA, another level of gene regulation in bacteria. Microbiol Mol Biol Rev. 2011, 75: 286-300. 10.1128/MMBR.00032-10.PubMed CentralPubMedGoogle Scholar
- Keseler IM, Collado-Vides J, Santos-Zavaleta A, Peralta-Gil M, Gama-Castro S, Muniz-Rascado L, Bonavides-Martinez C, Paley S, Krummenacker M, Altman T: EcoCyc: a comprehensive database of Escherichia coli biology. Nucleic Acids Res. 2011, 39: D583-590. 10.1093/nar/gkq1143.PubMed CentralPubMedGoogle Scholar
- Sierro N, Makita Y, de Hoon M, Nakai K: DBTBS: a database of transcriptional regulation in Bacillus subtilis containing upstream intergenic conservation information. Nucleic Acids Res. 2007, 36: D93-96. 10.1093/nar/gkm910.PubMed CentralPubMedGoogle Scholar
- Chen X, Su Z, Xu Y, Jiang T: Computational prediction of Operons in synechococcus sp. WH8102. Genome Inform Ser Workshop Genome Inform. 2004, 15: 211-222.Google Scholar
- Westover BP, Buhler JD, Sonnenburg JL, Gordon JI: Operon prediction without a training set. Bioinformatics. 2005, 21: 880-888. 10.1093/bioinformatics/bti123.PubMedGoogle Scholar
- Price MN, Huang KH, Alm EJ, Arkin AP: A novel method for accurate operon predictions in all sequenced prokaryotes. Nucleic Acids Res. 2005, 33: 880-892. 10.1093/nar/gki232.PubMed CentralPubMedGoogle Scholar
- Dam P, Olman V, Harris K, Su Z, Xu Y: Operon prediction using both genome-specific and general genomic information. Nucleic Acids Res. 2007, 35: 288-298.PubMed CentralPubMedGoogle Scholar
- Tran TT, Dam P, Su Z, Poole FL, Adams MW, Zhou GT, Xu Y: Operon prediction in Pyrococcus furiosus. Nucleic Acids Res. 2007, 35: 11-20.PubMed CentralPubMedGoogle Scholar
- Bergman NH, Passalacqua KD, Hanna PC, Qin ZS: Operon prediction for sequenced bacterial genomes without experimental information. Appl Environ Microbiol. 2007, 73: 846-854. 10.1128/AEM.01686-06.PubMed CentralPubMedGoogle Scholar
- Mao F, Dam P, Chou J, Olman V, Xu Y: DOOR: a database for prokaryotic operons. Nucleic Acids Res. 2009, 37: D459-463. 10.1093/nar/gkn757.PubMed CentralPubMedGoogle Scholar
- Taboada B, Verde C, Merino E: High accuracy operon prediction method based on STRING database scores. Nucleic Acids Res. 2010, 38: e130-10.1093/nar/gkq254.PubMed CentralPubMedGoogle Scholar
- Livny J: Efficient annotation of bacterial genomes for small, noncoding RNAs using the integrative computational tool sRNAPredict2. Methods Mol Biol. 2007, 395: 475-488. 10.1007/978-1-59745-514-5_30.PubMedGoogle Scholar
- Tjaden B: Prediction of small, noncoding RNAs in bacteria using heterogeneous data. J Math Biol. 2008, 56: 183-200.PubMedGoogle Scholar
- Pichon C, Felden B: Small RNA gene identification and mRNA target predictions in bacteria. Bioinformatics. 2008, 24: 2807-2813. 10.1093/bioinformatics/btn560.PubMedGoogle Scholar
- Luban S, Kihara D: Comparative genomics of small RNAs in bacterial genomes. OMICS. 2007, 11: 58-73. 10.1089/omi.2006.0005.PubMedGoogle Scholar
- Brouwer RW, Kuipers OP, Hijum SA: The relative value of operon predictions. Brief Bioinform. 2008, 9: 367-375. 10.1093/bib/bbn019.PubMedGoogle Scholar
- Toledo-Arana A, Solano C: Deciphering the physiological blueprint of a bacterial cell: revelations of unanticipated complexity in transcriptome and proteome. Bioessays. 2010, 32: 461-467. 10.1002/bies.201000020.PubMedGoogle Scholar
- Sorek R, Cossart P: Prokaryotic transcriptomics: a new view on regulation, physiology and pathogenicity. Nat Rev Genet. 2010, 11: 9-16.PubMedGoogle Scholar
- Filiatrault MJ: Progress in prokaryotic transcriptomics. Curr Opin Microbiol. 2011, 14: 579-586. 10.1016/j.mib.2011.07.023.PubMedGoogle Scholar
- Guell M, van Noort V, Yus E, Chen WH, Leigh-Bell J, Michalodimitrakis K, Yamada T, Arumugam M, Doerks T, Kuhner S: Transcriptome complexity in a genome-reduced bacterium. Science. 2009, 326: 1268-1271. 10.1126/science.1176951.PubMedGoogle Scholar
- Sharma CM, Hoffmann S, Darfeuille F, Reignier J, Findeiss S, Sittka A, Chabas S, Reiche K, Hackermuller J, Reinhardt R: The primary transcriptome of the major human pathogen Helicobacter pylori. Nature. 2010, 464: 250-255. 10.1038/nature08756.PubMedGoogle Scholar
- Nicolas P, Mader U, Dervyn E, Rochat T, Leduc A, Pigeonneau N, Bidnenko E, Marchadier E, Hoebeke M, Aymerich S: Condition-dependent transcriptome reveals high-level regulatory architecture in Bacillus subtilis. Science. 2012, 335: 1103-1106. 10.1126/science.1206848.PubMedGoogle Scholar
- Passalacqua KD, Varadarajan A, Ondov BD, Okou DT, Zwick ME, Bergman NH: Structure and complexity of a bacterial transcriptome. J Bacteriol. 2009, 191: 3203-3211. 10.1128/JB.00122-09.PubMed CentralPubMedGoogle Scholar
- Mitschke J, Georg J, Scholz I, Sharma CM, Dienst D, Bantscheff J, Voss B, Steglich C, Wilde A, Vogel J, Hess WR: An experimentally anchored map of transcriptional start sites in the model cyanobacterium Synechocystis sp. PCC6803. Proc Natl Acad Sci USA. 2011, 108: 2124-2129. 10.1073/pnas.1015154108.PubMed CentralPubMedGoogle Scholar
- Koide T, Reiss DJ, Bare JC, Pang WL, Facciotti MT, Schmid AK, Pan M, Marzolf B, Van PT, Lo FY: Prevalence of transcription promoters within archaeal operons and coding sequences. Mol Syst Biol. 2009, 5: 285-PubMed CentralPubMedGoogle Scholar
- Hovik H, Yu WH, Olsen I, Chen T: Comprehensive transcriptome analysis of the periodontopathogenic bacterium Porphyromonas gingivalis W83. J Bacteriol. 2012, 194: 100-114. 10.1128/JB.06385-11.PubMed CentralPubMedGoogle Scholar
- Rasmussen S, Nielsen HB, Jarmer H: The transcriptionally active regions in the genome of Bacillus subtilis. Mol Microbiol. 2009, 73: 1043-1057. 10.1111/j.1365-2958.2009.06830.x.PubMed CentralPubMedGoogle Scholar
- Perkins TT, Kingsley RA, Fookes MC, Gardner PP, James KD, Yu L, Assefa SA, He M, Croucher NJ, Pickard DJ: A strand-specific RNA-Seq analysis of the transcriptome of the typhoid bacillus Salmonella typhi. PLoS Genet. 2009, 5: e1000569-10.1371/journal.pgen.1000569.PubMed CentralPubMedGoogle Scholar
- Yoder-Himes DR, Chain PS, Zhu Y, Wurtzel O, Rubin EM, Tiedje JM, Sorek R: Mapping the Burkholderia cenocepacia niche response via high-throughput sequencing. Proc Natl Acad Sci USA. 2009, 106: 3976-3981. 10.1073/pnas.0813403106.PubMed CentralPubMedGoogle Scholar
- McGrath PT, Lee H, Zhang L, Iniesta AA, Hottes AK, Tan MH, Hillson NJ, Hu P, Shapiro L, McAdams HH: High-throughput identification of transcription start sites, conserved promoter motifs and predicted regulons. Nat Biotechnol. 2007, 25: 584-592. 10.1038/nbt1294.PubMedGoogle Scholar
- Lasa I, Toledo-Arana A, Dobin A, Villanueva M, de los Mozos IR, Vergara-Irigaray M, Segura V, Fagegaltier D, Penades JR, Valle J: Genome-wide antisense transcription drives mRNA processing in bacteria. Proc Natl Acad Sci USA. 2011, 108: 20172-20177. 10.1073/pnas.1113521108.PubMed CentralPubMedGoogle Scholar
- Mandlik A, Livny J, Robins WP, Ritchie JM, Mekalanos JJ, Waldor MK: RNA-Seq-based monitoring of infection-linked changes in Vibrio cholerae gene expression. Cell Host Microbe. 2011, 10: 165-174. 10.1016/j.chom.2011.07.007.PubMed CentralPubMedGoogle Scholar
- Albrecht M, Sharma CM, Reinhardt R, Vogel J, Rudel T: Deep sequencing-based discovery of the Chlamydia trachomatis transcriptome. Nucleic Acids Res. 2010, 38: 868-877. 10.1093/nar/gkp1032.PubMed CentralPubMedGoogle Scholar
- Albrecht M, Sharma CM, Dittrich MT, Muller T, Reinhardt R, Vogel J, Rudel T: The transcriptional landscape of Chlamydia pneumoniae. Genome Biol. 2011, 12: R98-10.1186/gb-2011-12-10-r98.PubMed CentralPubMedGoogle Scholar
- Wang Y, Li X, Mao Y, Blaschek HP: Single-nucleotide resolution analysis of the transcriptome structure of Clostridium beijerinckii NCIMB 8052 using RNA-Seq. BMC Genomics. 2011, 12: 479-10.1186/1471-2164-12-479.PubMed CentralPubMedGoogle Scholar
- Toledo-Arana A, Dussurget O, Nikitas G, Sesto N, Guet-Revillet H, Balestrino D, Loh E, Gripenland J, Tiensuu T, Vaitkevicius K: The Listeria transcriptional landscape from saprophytism to virulence. Nature. 2009, 459: 950-956. 10.1038/nature08080.PubMedGoogle Scholar
- Flaherty BL, Van Nieuwerburgh F, Head SR, Golden JW: Directional RNA deep sequencing sheds new light on the transcriptional response of Anabaena sp. strain PCC 7120 to combined-nitrogen deprivation. BMC Genomics. 2011, 12: 332-10.1186/1471-2164-12-332.PubMed CentralPubMedGoogle Scholar
- Vijayan V, Jain IH, O’Shea EK: A high resolution map of a cyanobacterial transcriptome. Genome Biol. 2011, 12: R47-10.1186/gb-2011-12-5-r47.PubMed CentralPubMedGoogle Scholar
- Wurtzel O, Sapra R, Chen F, Zhu Y, Simmons BA, Sorek R: A single-base resolution map of an archaeal transcriptome. Genome Res. 2010, 20: 133-141. 10.1101/gr.100396.109.PubMed CentralPubMedGoogle Scholar
- Selinger DW, Cheung KJ, Mei R, Johansson EM, Richmond CS, Blattner FR, Lockhart DJ, Church GM: RNA expression analysis using a 30 base pair resolution Escherichia coli genome array. Nat Biotechnol. 2000, 18: 1262-1268. 10.1038/82367.PubMedGoogle Scholar
- Dornenburg JE, Devita AM, Palumbo MJ, Wade JT: Widespread antisense transcription in Escherichia coli. MBio. 2010, 1: pii: e00024-10-Google Scholar
- Neidhardt FC, Curtiss R, INgraham JL, Lin ECC, Low KB, Magasanik B, Reznikoff WS, Riley M, Schaechter M, Umbarger HE: EcoSal : Escherichia coli and Salmonella : cellular and molecular biology. 2002, Washington D.C.: ASM PressGoogle Scholar
- Karp PD, Riley M, Saier M, Paulsen IT, Collado-Vides J, Paley SM, Pellegrini-Toole A, Bonavides C, Gama-Castro S: The EcoCyc database. Nucleic Acids Res. 2002, 30: 56-58. 10.1093/nar/30.1.56.PubMed CentralPubMedGoogle Scholar
- Resendis-Antonio O, Freyre-Gonzalez JA, Menchaca-Mendez R, Gutierrez-Rios RM, Martinez-Antonio A, Avila-Sanchez C, Collado-Vides J: Modular analysis of the transcriptional regulatory network of E. coli. Trends Genet. 2005, 21: 16-20. 10.1016/j.tig.2004.11.010.PubMedGoogle Scholar
- Busby S, Ebright RH: Promoter structure, promoter recognition, and transcription activation in prokaryotes. Cell. 1994, 79: 743-746.PubMedGoogle Scholar
- Browning DF, Busbym SJW: The regulation of bacterial transcription initiation. Nat Rev Microbiol. 2004, 2: 57-65. 10.1038/nrmicro787.PubMedGoogle Scholar
- Riley M, Abe T, Arnaud MB, Berlyn MK, Blattner FR, Chaudhuri RR, Glasner JD, Horiuchi T, Keseler IM, Kosuge T: Escherichia coli K-12: a cooperatively developed annotation snapshot–2005. Nucleic Acids Res. 2006, 34: 1-9. 10.1093/nar/gkj405.PubMed CentralPubMedGoogle Scholar
- Karp PD, Keseler IM, Shearer A, Latendresse M, Krummenacker M, Paley SM, Paulsen I, Collado-Vides J, Gama-Castro S, Peralta-Gil M: Multidimensional annotation of the Escherichia coli K-12 genome. Nucleic Acids Res. 2007, 35: 7577-7590. 10.1093/nar/gkm740.PubMed CentralPubMedGoogle Scholar
- Blattner FR, Plunkett G, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF: The complete genome sequence of Escherichia coli K-12. Science. 1997, 277: 1453-1462. 10.1126/science.277.5331.1453.PubMedGoogle Scholar
- Gama-Castro S, Salgado H, Peralta-Gil M, Santos-Zavaleta A, Muniz-Rascado L, Solano-Lira H, Jimenez-Jacinto V, Weiss V, Garcia-Sotelo JS, Lopez-Fuentes A: RegulonDB version 7.0: transcriptional regulation of Escherichia coli K-12 integrated within genetic sensory response units (Gensor Units). Nucleic Acids Res. 2011, 39: D98-105. 10.1093/nar/gkq1110.PubMed CentralPubMedGoogle Scholar
- Hershberg R, Altuvia S, Margalit H: A survey of small RNA-encoding genes in Escherichia coli. Nucleic Acids Res. 2003, 31: 1813-1820. 10.1093/nar/gkg297.PubMed CentralPubMedGoogle Scholar
- Gottesman S, Storz G: Bacterial small RNA regulators: versatile roles and rapidly evolving variations. Cold Spring Harb Perspect Biol. 2011, 3: pii: a003798-Google Scholar
- Storz G, Vogel J, Wassarman KM: Regulation by small RNAs in bacteria: expanding frontiers. Mol Cell. 2011, 43: 880-891. 10.1016/j.molcel.2011.08.022.PubMed CentralPubMedGoogle Scholar
- Cho BK, Zengler K, Qiu Y, Park YS, Knight EM, Barrett CL, Gao Y, Palsson BO: The transcription unit architecture of the Escherichia coli genome. Nat Biotechnol. 2009, 27: 1043-1049. 10.1038/nbt.1582.PubMedGoogle Scholar
- Mendoza-Vargas A, Olvera L, Olvera M, Grande R, Vega-Alvarado L, Taboada B, Jimenez-Jacinto V, Salgado H, Juarez K, Contreras-Moreira B: Genome-wide identification of transcription start sites, promoters and transcription factor binding sites in E. coli. PLoS One. 2009, 4: e7526-10.1371/journal.pone.0007526.PubMed CentralPubMedGoogle Scholar
- Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009, 10: 57-63. 10.1038/nrg2484.PubMed CentralPubMedGoogle Scholar
- Vivancos AP, Guell M, Dohm JC, Serrano L, Himmelbauer H: Strand-specific deep sequencing of the transcriptome. Genome Res. 2010, 20: 989-999. 10.1101/gr.094318.109.PubMed CentralPubMedGoogle Scholar
- Levin JZ, Yassour M, Adiconis X, Nusbaum C, Thompson DA, Friedman N, Gnirke A, Regev A: Comprehensive comparative analysis of strand-specific RNA sequencing methods. Nat Methods. 2010, 7: 709-715. 10.1038/nmeth.1491.PubMed CentralPubMedGoogle Scholar
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5: 621-628. 10.1038/nmeth.1226.PubMedGoogle Scholar
- Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L: Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol. 2011, 12: R22-10.1186/gb-2011-12-3-r22.PubMed CentralPubMedGoogle Scholar
- Cheung MS, Down TA, Latorre I, Ahringer J: Systematic bias in high-throughput sequencing data and its correction by BEADS. Nucleic Acids Res. 2011, 39: e103-10.1093/nar/gkr425.PubMed CentralPubMedGoogle Scholar
- Sendler E, Johnson GD, Krawetz SA: Local and global factors affecting RNA sequencing analysis. Anal Biochem. 2011, 419: 317-322. 10.1016/j.ab.2011.08.013.PubMedGoogle Scholar
- Wu Z, Wang X, Zhang X: Using non-uniform read distribution models to improve isoform expression inference in RNA-Seq. Bioinformatics. 2011, 27: 502-508. 10.1093/bioinformatics/btq696.PubMedGoogle Scholar
- Li J, Jiang H, Wong WH: Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biol. 2010, 11: R50-10.1186/gb-2010-11-5-r50.PubMed CentralPubMedGoogle Scholar
- Pop M: Genome assembly reborn: recent computational challenges. Brief Bioinform. 2009, 10: 354-366. 10.1093/bib/bbp026.PubMed CentralPubMedGoogle Scholar
- Flicek P, Birney E: Sense from sequence reads: methods for alignment and assembly. Nat Methods. 2009, 6: S6-S12. 10.1038/nmeth.1376.PubMedGoogle Scholar
- Martin JA, Wang Z: Next-generation transcriptome assembly. Nat Rev Genet. 2011, 12: 671-682. 10.1038/nrg3068.PubMedGoogle Scholar
- Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010, 28: 511-515. 10.1038/nbt.1621.PubMed CentralPubMedGoogle Scholar
- Ciesiolka J, Michalowski D, Wrzesinski J, Krajewski J, Krzyzosiak WJ: Patterns of cleavages induced by lead ions in defined RNA secondary structure motifs. J Mol Biol. 1998, 275: 211-220. 10.1006/jmbi.1997.1462.PubMedGoogle Scholar
- Hansen KD, Brenner SE, Dudoit S: Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 2010, 38: e131-10.1093/nar/gkq224.PubMed CentralPubMedGoogle Scholar
- Hafner M, Renwick N, Brown M, Mihailovic A, Holoch D, Lin C, Pena JT, Nusbaum JD, Morozov P, Ludwig J: RNA-ligase-dependent biases in miRNA representation in deep-sequenced small RNA cDNA libraries. RNA. 2011, 17: 1697-1712. 10.1261/rna.2799511.PubMed CentralPubMedGoogle Scholar
- Zhuang F, Fuchs RT, Sun Z, Zheng Y, Robb GB: Structural bias in T4 RNA ligase-mediated 3’-adapter ligation. Nucleic Acids Res. 2012, 40: e54-10.1093/nar/gkr1263.PubMed CentralPubMedGoogle Scholar
- Jayaprakash AD, Jabado O, Brown BD, Sachidanandam R: Identification and remediation of biases in the activity of RNA ligases in small-RNA deep sequencing. Nucleic Acids Res. 2011, 39: e141-10.1093/nar/gkr693.PubMed CentralPubMedGoogle Scholar
- Risso D, Schwartz K, Sherlock G, Dudoit S: GC-content normalization for RNA-Seq data. BMC Bioinformatics. 2011, 12: 480-10.1186/1471-2105-12-480.PubMed CentralPubMedGoogle Scholar
- Benjamini Y, Speed TP: Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 2012, 40: e72-10.1093/nar/gks001.PubMed CentralPubMedGoogle Scholar
- Aird D, Ross MG, Chen WS, Danielsson M, Fennell T, Russ C, Jaffe DB, Nusbaum C, Gnirke A: Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 2011, 12: R18-10.1186/gb-2011-12-2-r18.PubMed CentralPubMedGoogle Scholar
- Minoche AE, Dohm JC, Himmelbauer H: Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and genome analyzer systems. Genome Biol. 2011, 12: R112-10.1186/gb-2011-12-11-r112.PubMed CentralPubMedGoogle Scholar
- Nakamura K, Oshima T, Morimoto T, Ikeda S, Yoshikawa H, Shiwa Y, Ishikawa S, Linak MC, Hirai A, Takahashi H: Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res. 2011, 39: e90-10.1093/nar/gkr344.PubMed CentralPubMedGoogle Scholar
- Mamanova L, Andrews RM, James KD, Sheridan EM, Ellis PD, Langford CF, Ost TW, Collins JE, Turner DJ: FRT-seq: amplification-free, strand-specific transcriptome sequencing. Nat Methods. 2010, 7: 130-132. 10.1038/nmeth.1417.PubMed CentralPubMedGoogle Scholar
- Lipson D, Raz T, Kieu A, Jones DR, Giladi E, Thayer E, Thompson JF, Letovsky S, Milos P, Causey M: Quantification of the yeast transcriptome by single-molecule sequencing. Nat Biotechnol. 2009, 27: 652-658. 10.1038/nbt.1551.PubMedGoogle Scholar
- Raz T, Causey M, Jones DR, Kieu A, Letovsky S, Lipson D, Thayer E, Thompson JF, Milos PM: RNA sequencing and quantitation using the Helicos Genetic Analysis System. Methods Mol Biol. 2011, 733: 37-49. 10.1007/978-1-61779-089-8_3.PubMedGoogle Scholar
- Kent WJ: BLAT–the BLAST-like alignment tool. Genome Res. 2002, 12: 656-664.PubMed CentralPubMedGoogle Scholar
- Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009, 25: 1105-1111. 10.1093/bioinformatics/btp120.PubMed CentralPubMedGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10: R25-10.1186/gb-2009-10-3-r25.PubMed CentralPubMedGoogle Scholar
- Guttman M, Garber M, Levin JZ, Donaghey J, Robinson J, Adiconis X, Fan L, Koziol MJ, Gnirke A, Nusbaum C: Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat Biotechnol. 2010, 28: 503-510. 10.1038/nbt.1633.PubMed CentralPubMedGoogle Scholar
- Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011, 29: 644-652. 10.1038/nbt.1883.PubMed CentralPubMedGoogle Scholar
- Schulz MH, Zerbino DR, Vingron M, Birney E: Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics. 2012, 28: 1086-1092. 10.1093/bioinformatics/bts094.PubMed CentralPubMedGoogle Scholar
- Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman SD, Mungall K, Lee S, Okada HM, Qian JQ: De novo assembly and analysis of RNA-seq data. Nat Methods. 2010, 7: 909-912. 10.1038/nmeth.1517.PubMedGoogle Scholar
- Martin J, Bruno VM, Fang Z, Meng X, Blow M, Zhang T, Sherlock G, Snyder M, Wang Z: Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNA-Seq reads. BMC Genomics. 2010, 11: 663-10.1186/1471-2164-11-663.PubMed CentralPubMedGoogle Scholar
- Surget-Groba Y, Montoya-Burgos JI: Optimization of de novo transcriptome assembly from next-generation sequencing data. Genome Res. 2010, 20: 1432-1440. 10.1101/gr.103846.109.PubMed CentralPubMedGoogle Scholar
- Martin J, Zhu W, Passalacqua KD, Bergman N, Borodovsky M: Bacillus anthracis genome organization in light of whole transcriptome sequencing. BMC Bioinformatics. 2010, 11 (Suppl 3): S10-10.1186/1471-2105-11-S3-S10.PubMed CentralPubMedGoogle Scholar
- Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M: The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008, 320: 1344-1349. 10.1126/science.1158441.PubMed CentralPubMedGoogle Scholar
- Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol. 1994, 2: 28-36.PubMedGoogle Scholar
- Yus E, Guell M, Vivancos AP, Chen WH, Lluch-Senar M, Delgado J, Gavin AC, Bork P, Serrano L: Transcription start site associated RNAs in bacteria. Mol Syst Biol. 2012, 8: 585-PubMed CentralPubMedGoogle Scholar
- Makino K, Kim SK, Shinagawa H, Amemura M, Nakata A: Molecular analysis of the cryptic and functional phn operons for phosphonate use in Escherichia coli K-12. J Bacteriol. 1991, 173: 2665-2672.PubMed CentralPubMedGoogle Scholar
- Hove-Jensen B, Rosenkrantz TJ, Zechel DL, Willemoes M: Accumulation of intermediates of the carbon-phosphorus lyase pathway for phosphonate degradation in phn mutants of Escherichia coli. J Bacteriol. 2010, 192: 370-374. 10.1128/JB.01131-09.PubMed CentralPubMedGoogle Scholar
- Iqbal S, Parker G, Davidson H, Moslehi-Rahmani E, Robson RL: Reversible phase variation in the phnE gene, which is required for phosphonate metabolism in Escherichia coli K-12. J Bacteriol. 2004, 186: 6118-6123. 10.1128/JB.186.18.6118-6123.2004.PubMed CentralPubMedGoogle Scholar
- Jochimsen B, Lolle S, McSorley FR, Nabi M, Stougaard J, Zechel DL, Hove-Jensen B: Five phosphonate operon gene products as components of a multi-subunit complex of the carbon-phosphorus lyase pathway. Proc Natl Acad Sci USA. 2011, 108: 11393-11398. 10.1073/pnas.1104922108.PubMed CentralPubMedGoogle Scholar
- Chen CM, Ye QZ, Zhu ZM, Wanner BL, Walsh CT: Molecular biology of carbon-phosphorus bond cleavage. Cloning and sequencing of the phn (psiD) genes involved in alkylphosphonate uptake and C-P lyase activity in Escherichia coli B. J Biol Chem. 1990, 265: 4461-4471.PubMedGoogle Scholar
- Metcalf WW, Wanner BL: Evidence for a fourteen-gene, phnC to phnP locus for phosphonate metabolism in Escherichia coli. Gene. 1993, 129: 27-32. 10.1016/0378-1119(93)90692-V.PubMedGoogle Scholar
- Kononova SV, Nesmeyanova MA: Phosphonates and their degradation by microorganisms. Biochemistry (Mosc). 2002, 67: 184-195. 10.1023/A:1014409929875.Google Scholar
- Shi W, Zhou Y, Wild J, Adler J, Gross CA: DnaK, DnaJ, and GrpE are required for flagellum synthesis in Escherichia coli. J Bacteriol. 1992, 174: 6256-6263.PubMed CentralPubMedGoogle Scholar
- Rashid MH, Rao NN, Kornberg A: Inorganic polyphosphate is required for motility of bacterial pathogens. J Bacteriol. 2000, 182: 225-227. 10.1128/JB.182.1.225-227.2000.PubMed CentralPubMedGoogle Scholar
- Nicol JW, Helt GA, Blanchard SG, Raja A, Loraine AE: The integrated genome browser: free software for distribution and exploration of genome-scale datasets. Bioinformatics. 2009, 25: 2730-2731. 10.1093/bioinformatics/btp472.PubMed CentralPubMedGoogle Scholar
- Bullard JH, Purdom E, Hansen KD, Dudoit S: Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010, 11: 94-10.1186/1471-2105-11-94.PubMed CentralPubMedGoogle Scholar
- Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008, 18: 1509-1517. 10.1101/gr.079558.108.PubMed CentralPubMedGoogle Scholar
- Jones DC, Ruzzo WL, Peng X, Katze MG: A new approach to bias correction in RNA-Seq. Bioinformatics. 2012, 28: 921-928. 10.1093/bioinformatics/bts055.PubMed CentralPubMedGoogle Scholar
- Srivastava S, Chen L: A two-parameter generalized Poisson model to improve the analysis of RNA-seq data. Nucleic Acids Res. 2010, 38: e170-10.1093/nar/gkq670.PubMed CentralPubMedGoogle Scholar
- Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997, 268: 78-94. 10.1006/jmbi.1997.0951.PubMedGoogle Scholar
- Larsen TS, Krogh A: EasyGene–a prokaryotic gene finder that ranks ORFs by statistical significance. BMC Bioinformatics. 2003, 4: 21-10.1186/1471-2105-4-21.PubMed CentralPubMedGoogle Scholar
- Reese MG, Kulp D, Tammana H, Haussler D: Genie–gene finding in Drosophila melanogaster. Genome Res. 2000, 10: 529-538. 10.1101/gr.10.4.529.PubMed CentralPubMedGoogle Scholar
- Durbin R, Eddy S, Krogh A, Mitchison G: Biological sequence analysis. 1998, Cambrage, UK: Cambridge University PressGoogle Scholar
- Su Z, Olman V, Mao F, Xu Y: Comparative genomics analysis of NtcA regulons in cyanobacteria: regulation of nitrogen assimilation and its coupling to photosynthesis. Nucleic Acid Res. 2005, 33: 5156-5171. 10.1093/nar/gki817.PubMed CentralPubMedGoogle Scholar
- Li S, Xu M, Su Z: Computational analysis of LexA regulons in Cyanobacteria. BMC Genomics. 2010, 11: 527-10.1186/1471-2164-11-527.PubMed CentralPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.