Leaderless genes in bacteria: clue to the evolution of translation initiation mechanisms in prokaryotes
© Zheng et al; licensee BioMed Central Ltd. 2011
Received: 1 March 2011
Accepted: 12 July 2011
Published: 12 July 2011
Shine-Dalgarno (SD) signal has long been viewed as the dominant translation initiation signal in prokaryotes. Recently, leaderless genes, which lack 5'-untranslated regions (5'-UTR) on their mRNAs, have been shown abundant in archaea. However, current large-scale in silico analyses on initiation mechanisms in bacteria are mainly based on the SD-led initiation way, other than the leaderless one. The study of leaderless genes in bacteria remains open, which causes uncertain understanding of translation initiation mechanisms for prokaryotes.
Here, we study signals in translation initiation regions of all genes over 953 bacterial and 72 archaeal genomes, then make an effort to construct an evolutionary scenario in view of leaderless genes in bacteria. With an algorithm designed to identify multi-signal in upstream regions of genes for a genome, we classify all genes into SD-led, TA-led and atypical genes according to the category of the most probable signal in their upstream sequences. Particularly, occurrence of TA-like signals about 10 bp upstream to translation initiation site (TIS) in bacteria most probably means leaderless genes.
Our analysis reveals that leaderless genes are totally widespread, although not dominant, in a variety of bacteria. Especially for Actinobacteria and Deinococcus-Thermus, more than twenty percent of genes are leaderless. Analyzed in closely related bacterial genomes, our results imply that the change of translation initiation mechanisms, which happens between the genes deriving from a common ancestor, is linearly dependent on the phylogenetic relationship. Analysis on the macroevolution of leaderless genes further shows that the proportion of leaderless genes in bacteria has a decreasing trend in evolution.
As the first stage of protein synthesis in gene expression, translation is a key process highly conserved in the biological system. Up to now, 31 universally occurring genes identified in 191 species are shown being involved in the translation process . However, translation initiation shows great variation in the three kingdoms. In eukaryotes, the ribosome binds at the 5'-end of the capped mRNA and slides downstream to find the first start codon and then initiate the translation, which is the so-called scanning mechanism . In prokaryotes, there are two known mechanisms. The Shine-Dalgarno (SD) initiation mechanism was found early in Escherichia coli. For this mechanism, a short motif called SD sequence in the 5'-untranslated region (5'-UTR) on mRNA binds with the 3'-end of 16S rRNA on the ribosome and helps the ribosome directly identify the translation initiation site (TIS). The other one, namely leaderless initiation, was found later in λ-phage of E. coli. In this case, the mRNA lacks a 5'-UTR and hence has no SD sequence in it, thus the start codon itself serves as the most important signal for the translation initiation. There were ever propositions that signals downstream of the start codon called "downstream boxes" may bind with the 16S rRNA and help translation initiation of leaderless genes, but these suggestions were then refuted by experimental evidences . In fact, several studies reported that leaderless initiation uses an alternative way which is like that in eukaryotes and the leaderless λ cI gene can be faithfully translated in vitro in all three kingdoms [6–8]. What is more, this suggests that this initiation way may be the one used by the last universal common ancestor (LUCA) and is conserved in all three kingdoms [2, 7, 9].
Regarding the diversity of translation initiation mechanism, SD initiation has long been considered the dominant way in prokaryotes. However, recent studies revealed that leaderless initiation should be as important as SD initiation in archaea, one of two branches in prokaryotes. Computational study of 144 genes in the archaeal Sulfolobus solfataricus indicated that distal genes in operon are SD-led while single genes and proximal genes in operon are leaderless . Further computational analysis showed that leaderless genes are demonstrated with a rather high proportion in the archaeal Pyrobaculum aerophilum. Torarinsson et al. also analyzed 18 complete archaeal genome sequences and estimated the number of SD-led genes as well as leaderless genes, their results indicate that at least 12 of 18 archaeal genomes have plenty of leaderless genes . In addition, experimental study in Pyrobaculum aerophilum and Haloarchaea also reported that the majority of transcripts are leaderless in those archaeal genomes [11, 13].
In spite of above-noted efforts to understand on archaea, little is known about the bacteria-wide situation of leaderless genes. One of major reasons is that translation initiation was believed more complex in archaea than in bacteria , which led to lack of attention on this issue and continuing to use the canonical explanation by neglecting leaderless initiation mechanism in bacteria. Recently, a few experimental verified leaderless genes were reported in bacteria , however it is extremely difficult to explore a clear scenario of the leaderless genes at the bacteria-wide level since them seem to scatter occasionally in some bacterial genomes with a shortage of known data. Nevertheless, there were some computational analyses in respect of the translation initiation mechanism in bacteria, however mainly based on the genes likely using SD-led initiation way, other than the leaderless one. SD sequences in 21 bacterial and 9 archaeal species were investigated and found their occurrence varied from 10.8% to 90.1% . Chang et al. also studied 141 bacterial and 21 archaeal complete genomes and gave number estimates of SD-led genes from 11.6% to 90.8% . A most recent work is noteworthy: Nakagawa et al. analyzed 277 prokaryotes (249 bacteria and 28 archaea) to survey the proportion of SD-led genes in each genome, and then to discuss the link with initiation mechanism . However, knowing the proportion of SD-led genes does not lead to the knowledge of the leaderless ones. Moreover, to estimate the number of SD-led genes, most of these algorithms usually detected SD signals by a simple scanning method which may ignore the nucleotide composition bias of each genome. At the same time, lack of the test of statistical significance would not give a solid evidence for meaningful signals detected in these algorithms. With currently more than thousand of complete bacterial genomes deposited in the public database or in sequencing, it is more and more significant to reveal the translation initiation mechanism by a clear picture of leaderless genes at the state of the art bacteria-wide level, which should be based on a more accurate and reliable analytical study.
The objective of this study is to answer the question to understand the translation initiation mechanism and its evolutionary scenario in view of leaderless genes in bacteria. We developed an algorithm, which is validated with statistical significance, to classify the initiation regulatory signals upstream to gene start into SD-like, TA-like and atypical signals for all genes in each prokaryotic genome. The method leads to definite identification of both leadered and leaderless genes in a genome. We examined 953 bacterial and 72 archaeal genomes and annotated diverse translation initiation signals in these genomes. Focusing on the leaderless genes and their initiation signals, our analysis reveals that leaderless genes are totally widespread, although not dominant, in a variety of groups in bacteria. What more useful is the quantitative relationships of evolution of initiation signals in bacteria, which might provide a clear picture of the evolution of translation initiation mechanisms.
Validation of the algorithm
Documented leaderless genes in S. coelicolor A3(2)
Sequence upstream of TIS1
CGCTCTTG TAGCGT GC TGGAA TG
CGAAGTTGTAGCGTTT GGTCG TG
CCTCAAGAGAGAGTGT GAGAG TG
CAGGGAG TAGGTT CG CCGCCA TG
GACGCGGT TACTTT GA CGGCA TG
GACAGG TACAGT CC ACCCCTG TG
GCCGTGCC TAACCT GG AGACA TG
GAGGCGCCTTGAATAG AGGCA TG
CCGCCGCCTTGACTGG GGGCA TG
CGACTCGTAATCTCG ACACCA TG
GCCTCGGGTAG AAAATCCACA TG
TCGGCGCCGACAA AGGATGC GTG
It should be noted that the current algorithm is not sensitive enough to the genome when it has a few leaderless genes. In addition, with a widely-used Pribnow box as reference consensus (see Methods for details), the algorithm maybe fail in detecting some few leaderless genes with non-Pribnow signal for bacteria in this study. So when a bacterial genome is not judged by the algorithm having leaderless genes, it does not simply mean that the genome has not any leaderless genes, perhaps its leaderless genes are very small in number or without the common Pribnow signal. However in summary, the simulation based on large sample size shows that our TA-like signal prediction has a significant difference from the random strings by the background. This may lead to identifying leaderless genes with a statistical reliability in the current study. Although it is difficult to estimate the prediction accuracy due to lack of experimental data, our prediction is still shown to include most of the known leaderless genes literature-documented at least in this example.
High usage of leaderless genes in archaea
Distribution of leaderless genes in sequenced prokaryotic genomes
Avg. Percentage2 (%)
49.5 (6.1, 66.7)
31.7 (3.8, 63.3)
71.8 (71.8, 71.8)
13.3 (9.7, 19.4)
19.2 (2.3, 29.8)
26.5 (20.7, 32.4)
2.4 (2.4, 2.4)
4.7 (4.0, 5.7)
39.4 (35.9, 46.1)
4.2 (2.4, 16.6)
6.3 (4.7, 7.8)
6.8 (2.8, 10.6)
4.5 (3.0, 5.7)
15.8 (5.7, 20.6)
2.9 (2.7, 3.2)
3.9 (1.8, 6.9)
7.8 (1.4, 16.1)
Diverse distribution of leaderless genes in bacteria
It is of great interest in this study to discuss the leaderless genes in bacteria. Below we report the analysis of leaderless genes in all bacterial genomes (see Table 2). Overall, the results show that the algorithm detects 207 among 953 bacterial genomes having leaderless genes, and these 207 genomes include Acidobacteria, Actinobacteria, Aquificae, Bacteroidetes/Chlorobi, Chloroflexi, Deinococcus-Thermus, Firmicutes, Proteobacteria (α, β, γ, δ, and ε), Spirochaetes, and one unclassified in RefSeq. Unlike in archaea, leaderless genes are not identified in most genomes, however are shown to concentrate in a few groups. The most notable group is Actinobacteria, including high-GC and gram-positive genomes. The Actinobacteria usually live in a variety of natural environments such as soil, freshwater and the sea, and some of them being human pathogens, meanwhile are well known as secondary metabolite producers and are important in pharmaceutical industry . Due to their importance, 89 species have been completely sequenced. Despite the biodiversity in the 89 genomes, our method detects leaderless genes in nearly all (86 genomes) with proportions around 20%. This is in accordance with previous reports of leaderless genes found in Streptomyces and Corynebacterium. The Deinococcus-Thermus group is also noticeable, although there are only five genomes completely sequenced. Our analysis shows that all five have leaderless genes with high proportions (around 40%). Among them, the Deinococcus radiodurans R1 genomes is detected with the highest occurrence of leaderless genes (46.1%) in all bacteria. In fact, this group includes the Deinococcus species which are radiation-resistant, and Thermus thermophilus which is thermophilic. The high presence of leaderless genes in these genomes may highly probably correspond to their extreme habitat, as suggested for archaea . Besides these two groups, leaderless genes are shown to present in a few genera or species for other groups. For example in γ-Proteobacteria, leaderless genes are most found in Xanthomonadales and Legionellales, which are located near the root of γ-Proteobacteria, showing that the missing of leaderless genes in other γ-Proteobacteria may be due to one loss event during the genome evolution. In Firmicutes, leaderless genes concentrate in two clades, Lactobacillales and Mycoplasma. In Lactobacillales, leaderless genes are identified in two subclades: the Lactococcus-Streptococcus and the Oenococcus-Leuconostoc clade, while not in Enterococcus and Lactobacillus. Neither genomes with leaderless genes nor those without leaderless genes form a monophyletic group, probably showing the complex evolution of translation initiation mechanism in Lactobacillales. In addition, leaderless genes are found in five genomes in Mycoplasma, with an averaged occurrence of 18.3%.
As the transcription promoters well as the translation initiation signals for leaderless genes, the TA-like signals detected in bacteria merit a conscientious attention. In fact as shown in Figure 1, our results revealed the variation of the detected TA-like signals across species. In high-GC (> 60%) groups such as Actinobacteria and Deinococcus-Thermus, the signals have a TAnnnT pattern where the three positions in the middle provide but little information (Figure 1A-B); in medium-GC (40-60%) groups such as Aquificae and Chloroflexi, the signals have a TAtaaT pattern just like the typical Pribnow box in E. coli where the three positions in the middle become more informative (Figure 1C); while in Firmicutes, which have mostly low-GC (< 40%) genomes, the signals became TAtAAT, where the two As in the middle became as high as the A and T in the second and sixth positions (Figure 1D). Moreover in several low-GC genomes, the TG peak before Pribnow box can be observed as seen in Figure 1D, which has been reported to substitute for the function of the -35 region . For example in Lactococcus, the TG peak is extremely strong and this probably means that the -35 region is missing in this genus. This is probably because the signal is AT-rich, and in low-GC genomes, the signal needs to be strengthened to remain informative, while in high-GC genomes, it degrades weaker as the TAnnnT pattern already provides enough information. Considering our algorithm has been designed to compensate the nucleotide composition bias from genomic background, such a tendency of signal conservation by different GC content seems to be nontrivial and worth further studying. More specifically, this tendency is most probably related with genome evolution in some phylogenetic bacterial groups.
Atypical genes in prokaryotes
Besides the typical TA-like and the SD-like signals, the algorithm also identifies a lot of so-called atypical signals. Genes bearing these signals, namely atypical genes, are probably SD-less leadered genes, however they could also be leaderless genes with unknown promoter signals. In fact, several clades, for example Mycoplasma, Cyanobacteria and Bacteroidetes, are shown to have a substantial number of atypical genes. It is still uncertain to understand the atypical signals detected in both bacteria and archaea. For example in halophilic archaea, most leadered transcripts have no SD sequence in their leader regions , and this genes use new translation initiation mechanisms that are still unknown . However, some atypical signals are shown conserved across species and have conserved position distributions relative to TIS. For instance in Cyanobacteria, Proteobacteria and other groups, a conserved dipyrimidine is located immediately upstream to the TISs, which is consistent with the findings in the previous study . Another AT-rich signal without strong consensus is found in Bacteroidetes and many other groups. These AT-rich signals are suggested to bind ribosomal protein S1  or facilitate translation initiation by affecting mRNA secondary structure . Other atypical signals are likely to be patterns of coding regions, transcription factor binding sites, or other unknown translation or transcription signals. Altogether such signal can serve as a target for biologists to decipher its regulation role by experiments, thus leading to a better understanding of the initiation mechanism.
To sum up as a result, our analysis described above revealed that the translation initiation mechanism in bacteria is far from a simple scenario as previously imaged, and therefore a complex scenario with diversity should be rebuilt. Though SD-led genes are dominant in Firmicutes, Proteobacteria, and many other groups as previously regarded, leaderless genes are found with high occurrence in many groups such as Actinobacteria and Deinococcus-Thermus. Moreover, many genomes in Cyanobacteria and Bacteroidetes use neither SD-led genes nor leaderless genes. The translation initiation mechanisms in those genomes are largely unknown yet to be unravelled. In the following part, we try to understand this diversity from an evolutionary point of view.
Evolution scenario of translation initiation mechanisms
One limitation of the current study should be noticed that only leaderless genes with transcriptional promoters resembling the Pribnow box are identified. This may cause an underestimation for the number of leaderless genes, especially in genomes with a large amount of atypical genes, such as Mycoplasma, Bacteroidetes, and Cyanobacteria. However, σ70 is the most widely used σ factor in bacterial genomes, and its σ2 domain is highly conserved with binding to the Pribnow box . It is unlikely that there exist high percentages of leaderless genes not identified by our method. Therefore, despite this limitation, the analysis did suggest the general evolutionary trend of translation initiation mechanisms in bacteria.
Although leaderless genes have been studied extensively in archaea, they have long been regarded as rare events in bacteria [2, 24]. Recent works reported the possibility of high occurrence of leaderless genes in some bacterial genomes [2, 15], but their propositions are just based on lack of SD sequences in those genomes, and not direct evidence of leaderless genes. In this study, we have clearly shown that leaderless genes are totally widespread, although not dominant, in a variety of groups in bacteria. Two of these groups, Deinococcus-Thermus and Actinobacteria, deserve serious attention. As the algorithm detects, Deinococcus-Thermus is the phylum that has the most leaderless genes, averaged over 30%. It is interesting that most organisms in this group live in extreme environments. For example, D. radiodurans is capable of withstanding an acute dose of 5,000 Gy of ionizing radiation with almost no loss of viability , while Thermus are of the most thermophilic bacteria and can grow at 85°c . These environments may reflect the traits of ancient earth. Thus this suggests an important role of leaderless initiation mechanism playing for the ancestral organisms in the original habitat. The later group Actinobacteria is the major antibiotics producer in both nature and industry . It is known that most antibiotics attack the translation system. Specifically speaking, antibiotics such as kasugamycin and pactamycin inhibit translation initiation on leadered transcripts but have no effect on leaderless transcripts. Therefore, the high occurrence of leaderless genes in Actinobacteria may suggest the correlation to their antibiotics production.
In our method, all weight matrices and position distributions are learned automatically from the genome sequences. It's worth noting that there are several additional predefined parameters for the model, such as the length of region upstream to TIS in which signals are searched (denoted as L in the section Materials and methods). Generally, increasing the length could improve the sensitivity of motif finding. However, this would cost more computation time and the algorithm may find too many unexpected motifs unrelated to translation initiation. Therefore we limited the region to a proper range which is enough for finding signals. In both bacteria and archaea, SD sequences are usually 4-7 bp in length and located at 5-13 bp upstream to the TIS , therefore the region of 20 bp upstream TIS is enough to cover the signals. For leaderless genes in bacteria, the -10 box is 6-9 bp in length (TATAAT or with the extended TG) and the distance between the -10 box and the transcription start site varies between -14 and -8 bases (the distance is calculated from the "A" at the second position of the TATAAT box as in the current study ). Therefore the -10 box is also covered by the 20 bp region in bacteria. For leaderless genes in archaea, the TATA box is located about 25-37 bp from the transcription start  and TIS upstream region of 50 bp is enough to cover it. When longer regions (50 bp upstream to TIS e.g.) are used for searching signals, TA-like signals are found in some new species. However, the position of these TA-like signals are not aggregated and mostly located far from the -10 region of the TIS, and thus these signals are more likely transcription promoters of leadered genes.
According to current evidences, SD-led initiation and leaderless initiation are probably both used by the LUCA. Evidence for the usage of SD-led initiation by the LUCA includes broad usage of SD sequence and the high conservation of anti-SD sequence in both bacteria and archaea , and meanwhile leaderless initiation is also proposed to be used by the LUCA considering its usage in all three kingdoms [2, 9, 24]. However, it is also possible that leaderless initiation originated only in archaea if leaderless gene is a marginal effect in bacteria as long regarded. Our results have shown the broad occurrence of leaderless genes in bacteria, especially in Actinobacteria, where around 20% genes are leaderless (Table 2). This indicates that leaderless genes were used by the LUCA, because it is unlikely that the leaderless initiation mechanism originated independently in bacteria and archaea and both become so important. In conclusion, our results provide further support for the proposition that leaderless initiation was used by the LUCA.
In this paper, we have studied the translation initiation signals in 1025 sequenced prokaryotic genomes and demonstrated the distribution of translation initiation mechanisms used in these genomes. The most surprising finding is that, though not as common as in archaea, there are substantial numbers of leaderless genes in bacteria. Most genomes with high percentage of leaderless genes are located near the root on the 16S rRNA tree and have relatively small number of operon distal genes. These facts show that current leaderless genes in bacteria are likely to be remnants of the ancestor and are retained because those genomes have low demand of organizing genes into operons for highly efficient and specialized gene expression.
953 bacterial and 72 archaeal genomes and their coding gene annotations in this study were downloaded from RefSeq in 2010 . List of the genomes is available at our webpage . To perform analyses of TIS upstream regions, annotations of TISs were collected from the database ProTISA  or predicted by TriTISA . ProTISA collects TIS confirmed through a variety of available evidences for prokaryotic genomes, including Swiss-Prot experiments record, literature (IPT), conserved domain hits (CDC) and sequence alignment between orthologous genes (HSC). ProTISA also includes TIS annotations from RefSeq and predicted by TriTISA (MED) . The latest update of ProTISA was released in Oct 2008, corresponding to RefSeq 30. For each gene, we use its TIS in a priority order IPT, CDC, HSC, and MED . Since ProTISA covered only 709 of the 1025 organisms, TISs of the remaining 316 genomes were predicted with TriTISA, which is a TIS predictor with high accuracy . Then for each genome, all TIS upstream sequences were extracted for further analysis. For bacteria 20 bp sequences were extracted, while for archaea 50 bp sequences were extracted since transcription promoters are farther away.
The multi-signal model and the signal detection algorithm
For the purpose of studying the translational initiation mechanism of prokaryotes, a straight way is to find the sequence patterns in the TIS upstream regions for each genome. With the sequence patterns, an algorithm can then build one or more positional weight matrices (PWMs) to describe the aligned positional frequency corresponding to potential signals. Such a relation has been used in many computational studies of translation initiation mechanism [41–45]. In the current study, based on the similarly motivated strategy we developed a multi-signal model upstream of TIS, and used the Expectation-Maximization (EM) algorithm to estimate the parameters of the model. With the model, we then computationally identified the leaderless genes for each genome.
To detect a signal associated with leaderless genes, the algorithm first selects the search region for each genome and the reference signal consensus. In archaea, leaderless genes can be detected by scanning the reference consensus of TA-rich transcription promoters at about 30 bp upstream to the TISs. Since the promoter mostly occurs at 30 bp upstream to the TSSs, thus for a leaderless gene the TIS and TSS should coincide . In bacteria, the mostly used promoter sequence is the so-called Pribnow box corresponding to the σ70 factor with a T82A89T52A59A49T89 (subscript means the frequency of the nucleotide at that position) pattern and occurs at about 10 bp upstream to the TSS . Thus in this paper, we search the signals in -10 bp regions by using this reference consensus.
The multi-signal model of translation initiation signals
Usually the translation initiation signals are conserved in both content and position. For example, in E. coli, SD sequences are mostly AGGA, GGAG and GAGG and are within [-10, -5] region to the TIS. Our previous works used a positional weight matrix to characterize the signal content and a discrete distribution to characterize the position . In the current study, motivated by the knowledge of that there may exist three categories of signals, SD sequence, TA-rich transcription promoters and other possible signals in the TIS upstream region for a genome, we generalize the model to describe multiple signals to explore the complexity of the translation initiation mechanism.
Let S be the set of N sequences of length L upstream to TISs. In these sequences, a signal with length W can be characterized by a W × 4 weight matrix w, and each element w ib of w means the probability of base b (A, C, G or T) to occur at position i in a set of aligned signal representatives. In addition, the signal may occur at any position upstream to the TIS with different probabilities p j by the distance starting from the j-th position of the TIS upstream sequence.
To describe multiple signals, herein we assume that there are M signals in the sequence set, but only one in each sequence. We use w m and p m to denote their weight matrices and positional distributions. For the regions of the sequences not covered by signals, we further build a uniform background model with nucleotide frequency b. There is also a probability p0 to describe the case that a sequence does not contain any signal and is entirely from the background model.
We then use the EM algorithm to obtain the maximum-likelihood estimation, details of which are described in Additional File 2: Text S1. As a result for each genome, the EM algorithm based on the multi-signal model is designed to obtain four PWMs as potential signals detected in the genome. It is clear that the more PWMs were searched, the more detailed structures of TIS upstream sequences would be found, but the algorithm will cost much more time and be easier to fall into local maximums. Therefore we select four signals for balance in the algorithm. Other predefined parameters include the sequence length L, which is set as 50 for archaea and 20 for bacteria.
Classification of translation initiation signals
We then build the links between the four PWMs for each genome and the translation initiation way, which may be used to classify a PWM as which category of signals. A one-by-one manual examining would certainly carry this point. However it is impracticable for large-scale analysis over thousands of PWMs and more than thousand genomes in current study. Therefore, a classifier is designed to computationally calculate each PWM against those widely-known initiation signals as well as the "Atypical" signals  for further study.
Our analysis of the distribution of all distances shows that it displays a bimodal distribution separated at approximately 4.0. Then we classify all the PWMs with SD_dis < 4.0 as SD-like signals.
All the distances also follow a bimodal distribution and the threshold is selected to be 0.8. Therefore the PWMs with TA_dis < 0.8 are regarded as TA-like signals.
In summary, for each PWM from a genome, it is first judged whether can be classified as SD-like signal, then TA-like signal. If neither can be decided, the PWM is regarded as the so-called atypical signal.
Classification of genes
For each sequence k, denote as the maximum one among and all , then the sequence may be predicted to have the signal ; if is the maximum, then the sequence has no signal. As we already classified the signals as SD-like, TA-like and atypical, all the genes are also labeled as SD-led, TA-led and atypical.
Strategy for validation of the signal detection algorithm
To test the statistical significance of the signals, we use simulated data that retain dinucleotide frequencies of the original sequences. For each 20 bp TIS upstream sequence in a given genome, a corresponding sequence with exactly the same dinucleotide frequency are generated using uShuffle . These simulated sequences form another sequence set and we then apply our gene-classification procedure on it with the real sequence model. The simulation is run 1000 times. If the maximum number of "leaderless genes" identified in shuffled samples is less than the number of "leaderless genes" identified in real data, this means the signal we found is statistically significant with P-value < 0.001.
Estimation of TIS signal evolution distance
16S rRNA sequences were extracted from RefSeq annotation. After discarding those sequences that are longer than 1600 bp or are shorter than 1400 bp or have too much divergence with other sequences, 967 sequences were left, with 903 bacterial and 64 archaeal genomes. A list of the 967 genomes can be found at our webpage . The sequences were aligned with ClustalW v2.0 . Then a Neighbor-Joining tree was build with Mega 3.1  with Kimura 2-parameter distance model. The tree was rerooted between bacteria and archaea, and distance from each bacterial organism to the root was calculated according to the tree.
translation initiation site
transcription start site
Last Universal Common Ancestor
Clusters of Orthologous Groups.
We thank Jiao Qu, Yongchu Liu, Hong Kang, Jiangtao Guo, Binbin Lai, Feifei He and others for beneficial discussions and helps to the work. We are grateful to Dr. Iain C Bruce of Zhejiang University School of Medicine for his linguistic help. The work was supported by the National Natural Science Foundation (30970667, 30770499 and 10721403) of China and the National Basic Research Program of China (2011CB707500).
- Ciccarelli FD, Doerks T, Mering Cv, Creevey CJ, Snel B, Bork P: Toward Automatic Reconstruction of a Highly Resolved Tree of Life. Science. 2006, 311 (5765): 1283-1287. 10.1126/science.1123061.PubMedView ArticleGoogle Scholar
- Nakagawa S, Niimura Y, Miura K, Gojobori T: Dynamic evolution of translation initiation mechanisms in prokaryotes. Proc Natl Acad Sci USA. 2010, 107 (14): 6382-6387. 10.1073/pnas.1002036107.PubMed CentralPubMedView ArticleGoogle Scholar
- Shine J, Dalgarno L: The 3'-terminal sequence of Escherichia coli 16S ribosomal RNA: complementarity to nonsense triplets and ribosome binding sites. Proc Natl Acad Sci USA. 1974, 71 (4): 1342-1346. 10.1073/pnas.71.4.1342.PubMed CentralPubMedView ArticleGoogle Scholar
- Ptashne M, Backman K, Humayun MZ, Jeffrey A, Maurer R, Meyer B, Sauer RT: Autoregulation and function of a repressor in bacteriophage lambda. Science. 1976, 194 (4261): 156-161. 10.1126/science.959843.PubMedView ArticleGoogle Scholar
- Moll I, Grill S, Gualerzi CO, Blasi U: Leaderless mRNAs in bacteria: surprises in ribosomal recruitment and translational control. Mol Microbiol. 2002, 43 (1): 239-246. 10.1046/j.1365-2958.2002.02739.x.PubMedView ArticleGoogle Scholar
- Udagawa T, Shimizu Y, Ueda T: Evidence for the translation initiation of leaderless mRNAs by the intact 70 S ribosome without its dissociation into subunits in eubacteria. J Biol Chem. 2004, 279 (10): 8539-8546. 10.1074/jbc.M308784200.PubMedView ArticleGoogle Scholar
- Grill S, Gualerzi CO, Londei P, Blasi U: Selective stimulation of translation of leaderless mRNA by initiation factor 2: evolutionary implications for translation. Embo J. 2000, 19 (15): 4101-4110. 10.1093/emboj/19.15.4101.PubMed CentralPubMedView ArticleGoogle Scholar
- Andreev DE, Terenin IM, Dunaevsky YE, Dmitriev SE, Shatsky IN: A leaderless mRNA can bind to mammalian 80S ribosomes and direct polypeptide synthesis in the absence of translation initiation factors. Mol Cell Biol. 2006, 26 (8): 3164-3169. 10.1128/MCB.26.8.3164-3169.2006.PubMed CentralPubMedView ArticleGoogle Scholar
- Benelli D, Londei P: Begin at the beginning: evolution of translational initiation. Res Microbiol. 2009, 160 (7): 493-501. 10.1016/j.resmic.2009.06.003.PubMedView ArticleGoogle Scholar
- Tolstrup N, Sensen CW, Garrett RA, Clausen IG: Two different and highly organized mechanisms of translation initiation in the archaeon Sulfolobus solfataricus. Extremophiles. 2000, 4 (3): 175-179. 10.1007/s007920070032.PubMedView ArticleGoogle Scholar
- Slupska MM, King AG, Fitz-Gibbon S, Besemer J, Borodovsky M, Miller JH: Leaderless transcripts of the crenarchaeal hyperthermophile Pyrobaculum aerophilum. J Mol Biol. 2001, 309 (2): 347-360. 10.1006/jmbi.2001.4669.PubMedView ArticleGoogle Scholar
- Torarinsson E, Klenk HP, Garrett RA: Divergent transcriptional and translational signals in Archaea. Environ Microbiol. 2005, 7 (1): 47-54. 10.1111/j.1462-2920.2004.00674.x.PubMedView ArticleGoogle Scholar
- Brenneis M, Hering O, Lange C, Soppa J: Experimental characterization of Cis-acting elements important for translation and transcription in halophilic archaea. Plos Genet. 2007, 3 (12): 2450-2467.View ArticleGoogle Scholar
- Ma J, Campbell A, Karlin S: Correlations between Shine-Dalgarno sequences and gene features such as predicted expression levels and operon structures. J Bacteriol. 2002, 184 (20): 5733-5745. 10.1128/JB.184.20.5733-5745.2002.PubMed CentralPubMedView ArticleGoogle Scholar
- Chang B, Halgamuge S, Tang SL: Analysis of SD sequences in completed microbial genomes: Non-SD-led genes are as common as SD-led genes. Gene. 2006, 373: 90-99.PubMedView ArticleGoogle Scholar
- The Translation initiation Signal Database. [http://mech.ctb.pku.edu.cn/leaderless/]
- Shultzaberger RK, Chen Z, Lewis KA, Schneider TD: Anatomy of Escherichia coli sigma70 promoters. Nucleic Acids Res. 2007, 35 (3): 771-788. 10.1093/nar/gkl956.PubMed CentralPubMedView ArticleGoogle Scholar
- Wu CJ, Janssen GR: Translation of vph mRNA in Streptomyces lividans and Escherichia coli after removal of the 5' untranslated leader. Mol Microbiol. 1996, 22 (2): 339-355. 10.1046/j.1365-2958.1996.00119.x.PubMedView ArticleGoogle Scholar
- Ryding NJ, Kelemen GH, Whatling CA, Flardh K, Buttner MJ, Chater KF: A developmentally regulated gene encoding a repressor-like protein is essential for sporulation in Streptomyces coelicolor A3(2). Mol Microbiol. 1998, 29 (1): 343-357. 10.1046/j.1365-2958.1998.00939.x.PubMedView ArticleGoogle Scholar
- Nothaft H, Parche S, Kamionka A, Titgemeyer F: In vivo analysis of HPr reveals a fructose-specific phosphotransferase system that confers high-affinity uptake in Streptomyces coelicolor. J Bacteriol. 2003, 185 (3): 929-937. 10.1128/JB.185.3.929-937.2003.PubMed CentralPubMedView ArticleGoogle Scholar
- Mazurakova V, Sevcikova B, Rezuchova B, Kormanec J: Cascade of sigma factors in streptomycetes: identification of a new extracytoplasmic function sigma factor sigmaJ that is under the control of the stress-response sigma factor sigmaH in Streptomyces coelicolor A3(2). Arch Microbiol. 2006, 186 (6): 435-446. 10.1007/s00203-006-0158-9.PubMedView ArticleGoogle Scholar
- Ventura M, Canchaya C, Tauch A, Chandra G, Fitzgerald GF, Chater KF, van Sinderen D: Genomics of Actinobacteria: tracing the evolutionary history of an ancient phylum. Microbiol Mol Biol Rev. 2007, 71 (3): 495-548. 10.1128/MMBR.00005-07.PubMed CentralPubMedView ArticleGoogle Scholar
- Janssen GR: Eubacterial, archaebacterial, and eukaryotic genes that encode leaderless mRNA. Industrial Microorganisms: Basic and Applied Molecular Genetics. Edited by: Baltz RH, Hegeman GD, Skatrud PL. 1993, Washington, DC: American Society for Microbiology Press, 59-67.Google Scholar
- Londei P: Evolution of translational initiation: new insights from the archaea. FEMS Microbiol Rev. 2005, 29 (2): 185-200. 10.1016/j.fmrre.2004.10.002.PubMedView ArticleGoogle Scholar
- Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, et al: The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003, 4: 41-10.1186/1471-2105-4-41.PubMed CentralPubMedView ArticleGoogle Scholar
- Hering O, Brenneis M, Beer J, Suess B, Soppa J: A novel mechanism for translation initiation operates in haloarchaea. Mol Microbiol. 2009, 71 (6): 1451-1463. 10.1111/j.1365-2958.2009.06615.x.PubMedView ArticleGoogle Scholar
- Sazuka T, Ohara O: Sequence features surrounding the translation initiation sites assigned on the genome sequence of Synechocystis sp. strain PCC6803 by amino-terminal protein sequencing. DNA Res. 1996, 3 (4): 225-232. 10.1093/dnares/3.4.225.PubMedView ArticleGoogle Scholar
- Kozak M: Regulation of translation via mRNA structure in prokaryotes and eukaryotes. Gene. 2005, 361: 13-37.PubMedView ArticleGoogle Scholar
- Dam P, Olman V, Harris K, Su Z, Xu Y: Operon prediction using both genome-specific and general genomic information. Nucleic Acids Res. 2007, 35 (1): 288-298.PubMed CentralPubMedView ArticleGoogle Scholar
- Price MN, Huang KH, Alm EJ, Arkin AP: A novel method for accurate operon predictions in all sequenced prokaryotes. Nucleic Acids Res. 2005, 33 (3): 880-892. 10.1093/nar/gki232.PubMed CentralPubMedView ArticleGoogle Scholar
- Bolotin A, Wincker P, Mauger S, Jaillon O, Malarme K, Weissenbach J, Ehrlich SD, Sorokin A: The Complete Genome Sequence of the Lactic Acid Bacterium Lactococcus lactis ssp. lactis IL1403. Genome Res. 2001, 11 (5): 731-753. 10.1101/gr.GR-1697R.PubMed CentralPubMedView ArticleGoogle Scholar
- Kaberdina AC, Szaflarski W, Nierhaus KH, Moll I: An unexpected type of ribosomes induced by kasugamycin: a look into ancestral times of protein synthesis?. Mol Cell. 2009, 33 (2): 227-236. 10.1016/j.molcel.2008.12.014.PubMed CentralPubMedView ArticleGoogle Scholar
- Paget MS, Helmann JD: The sigma70 family of sigma factors. Genome Biol. 2003, 4 (1): 203-10.1186/gb-2003-4-1-203.PubMed CentralPubMedView ArticleGoogle Scholar
- Moseley BE, Mattingly A: Repair of irradiation transforming deoxyribonucleic acid in wild type and a radiation-sensitive mutant of Micrococcus radiodurans. J Bacteriol. 1971, 105 (3): 976-983.PubMed CentralPubMedGoogle Scholar
- Henne A, Bruggemann H, Raasch C, Wiezer A, Hartsch T, Liesegang H, Johann A, Lienard T, Gohl O, Martinez-Arias R, et al: The genome sequence of the extreme thermophile Thermus thermophilus. Nat Biotechnol. 2004, 22 (5): 547-553. 10.1038/nbt956.PubMedView ArticleGoogle Scholar
- Chin K, Shean CS, Gottesman ME: Resistance of Lambda-Ci Translation to Antibiotics That Inhibit Translation Initiation. J Bacteriol. 1993, 175 (22): 7471-7473.PubMed CentralPubMedGoogle Scholar
- Chen H, Bjerknes M, Kumar R, Jay E: Determination of the optimal aligned spacing between the Shine-Dalgarno sequence and the translation initiation codon of Escherichia coli mRNAs. Nucleic Acids Res. 1994, 22 (23): 4953-4957. 10.1093/nar/22.23.4953.PubMed CentralPubMedView ArticleGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007, D61-65. 35 Database
- Hu GQ, Zheng XB, Yang YF, Ortet P, She ZS, Zhu HQ: ProTISA: a comprehensive resource for translation initiation site annotation in prokaryotic genomes. Nucleic Acids Res. 2008, D114-D119. 36 Database
- Hu GQ, Zheng X, Zhu HQ, She ZS: Prediction of translation initiation site for microbial genomes with TriTISA. Bioinformatics. 2009, 25 (1): 123-125. 10.1093/bioinformatics/btn576.PubMedView ArticleGoogle Scholar
- Zhu HQ, Hu GQ, Yang YF, Wang J, She ZS: MED: a new non-supervised gene prediction algorithm for bacterial and archaeal genomes. BMC Bioinformatics. 2007, 8: 97-10.1186/1471-2105-8-97.PubMed CentralPubMedView ArticleGoogle Scholar
- Hu GQ, Zheng X, Ju LN, Zhu H, She ZS: Computational evaluation of TIS annotation for prokaryotic genomes. BMC Bioinformatics. 2008, 9: 160-10.1186/1471-2105-9-160.PubMed CentralPubMedView ArticleGoogle Scholar
- Zhu HQ, Hu GQ, Ouyang ZQ, Wang J, She ZS: Accuracy improvement for identifying translation initiation sites in microbial genomes. Bioinformatics. 2004, 20 (18): 3308-3317. 10.1093/bioinformatics/bth390.PubMedView ArticleGoogle Scholar
- Besemer J, Lomsadze A, Borodovsky M: GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 2001, 29 (12): 2607-2618. 10.1093/nar/29.12.2607.PubMed CentralPubMedView ArticleGoogle Scholar
- Delcher AL, Bratke KA, Powers EC, Salzberg SL: Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics. 2007, 23 (6): 673-679. 10.1093/bioinformatics/btm009.PubMed CentralPubMedView ArticleGoogle Scholar
- Hershberg R, Bejerano G, Santos-Zavaleta A, Margalit H: PromEC: An updated database of Escherichia coli mRNA promoters with experimentally identified transcriptional start sites. Nucleic Acids Res. 2001, 29 (1): 277-10.1093/nar/29.1.277.PubMed CentralPubMedView ArticleGoogle Scholar
- Jiang M, Anderson J, Gillespie J, Mayne M: uShuffle: a useful tool for shuffling biological sequences while preserving the k-let counts. BMC Bioinformatics. 2008, 9: 192-10.1186/1471-2105-9-192.PubMed CentralPubMedView ArticleGoogle Scholar
- Larkin MA, B G, Brown NP, Chenna R, McGettigan PA, McWilliam H*, Valentin F*, Wallace IM, Wilm A, Lopez R*, Thompson JD, Gibson TJ, Higgins DG: ClustalW and ClustalX version 2. Bioinformatics. 2007, 23 (21): 2947-2948. 10.1093/bioinformatics/btm404.PubMedView ArticleGoogle Scholar
- Kumar S, Nei M, Dudley J, Tamura K: MEGA: A biologist-centric software for evolutionary analysis of DNA and protein sequences. Brief Bioinform. 2008, 9 (4): 299-306. 10.1093/bib/bbn017.PubMed CentralPubMedView ArticleGoogle Scholar
- Lake JA: Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. Proc Natl Acad Sci USA. 1994, 91 (4): 1455-1459. 10.1073/pnas.91.4.1455.PubMed CentralPubMedView ArticleGoogle Scholar
- Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: a sequence logo generator. Genome Res. 2004, 14 (6): 1188-1190. 10.1101/gr.849004.PubMed CentralPubMedView ArticleGoogle Scholar
- Anderson TB, Brian P, Champness WC: Genetic and transcriptional analysis of absA, an antibiotic gene cluster-linked two-component system that regulates multiple antibiotics in Streptomyces coelicolor. Mol Microbiol. 2001, 39 (3): 553-566. 10.1046/j.1365-2958.2001.02240.x.PubMedView ArticleGoogle Scholar
- Hoskisson PA, Rigali S, Fowler K, Findlay KC, Buttner MJ: DevA, a GntR-like transcriptional regulator required for development in Streptomyces coelicolor. J Bacteriol. 2006, 188 (14): 5014-5023. 10.1128/JB.00307-06.PubMed CentralPubMedView ArticleGoogle Scholar
- Revill WP, Bibb MJ, Scheu AK, Kieser HJ, Hopwood DA: Beta-ketoacyl acyl carrier protein synthase III (FabH) is essential for fatty acid biosynthesis in Streptomyces coelicolor A3(2). J Bacteriol. 2001, 183 (11): 3526-3530. 10.1128/JB.183.11.3526-3530.2001.PubMed CentralPubMedView ArticleGoogle Scholar
- Hahn JS, Oh SY, Roe JH: Regulation of the furA and catC operon, encoding a ferric uptake regulator homologue and catalase-peroxidase, respectively, in Streptomyces coelicolor A3(2). J Bacteriol. 2000, 182 (13): 3767-3774. 10.1128/JB.182.13.3767-3774.2000.PubMed CentralPubMedView ArticleGoogle Scholar
- Umeyama T, Horinouchi S: Autophosphorylation of a bacterial serine/threonine kinase, AfsK, is inhibited by KbpA, an AfsK-binding protein. J Bacteriol. 2001, 183 (19): 5506-5512. 10.1128/JB.183.19.5506-5512.2001.PubMed CentralPubMedView ArticleGoogle Scholar
- van Wezel GP, White J, Young P, Postma PW, Bibb MJ: Substrate induction and glucose repression of maltose utilization by Streptomyces coelicolor A3(2) is controlled by malR, a member of the lacl-galR family of regulatory genes. Mol Microbiol. 1997, 23 (3): 537-549. 10.1046/j.1365-2958.1997.d01-1878.x.PubMedView ArticleGoogle Scholar
- Sola-Landa A, Rodriguez-Garcia A, Franco-Dominguez E, Martin JF: Binding of PhoP to promoters of phosphate-regulated genes in Streptomyces coelicolor: identification of PHO boxes. Mol Microbiol. 2005, 56 (5): 1373-1385. 10.1111/j.1365-2958.2005.04631.x.PubMedView ArticleGoogle Scholar
- Hong HJ, Hutchings MI, Neu JM, Wright GD, Paget MS, Buttner MJ: Characterization of an inducible vancomycin resistance system in Streptomyces coelicolor reveals a novel gene (vanK) required for drug resistance. Mol Microbiol. 2004, 52 (4): 1107-1121. 10.1111/j.1365-2958.2004.04032.x.PubMedView ArticleGoogle Scholar