We obtained mycobacterial type strains from the Deutsche Sammlung von Mikroorganismen und Zellkulturen in Germany and the CCUG Laboratory, Göteborg, Sweden (Table S1a). The mycobacteria were cultivated and the DNA isolated and sequenced as outlined in Methods. Genomes of 114 different mycobacteria (RGM and SGM), distributed evenly throughout the genus Mycobacterium, together with 130 representative genomes available at the National Center for Biotechnology Information (NCBI) were included in a comparative genomic analysis of the Mycobacterium genus. Among these 244 genomes 192 represent known mycobacterial species. Some of the genomes appear in duplicates since sequencing were performed by different research groups and we also included more than one strain for some to ensure species affiliation (Table S1a, which also indicates type strains and accession numbers). Most of the genomes are near-complete multi-scaffold drafts, while 47 genomes are complete single scaffold genomes. The qualities of the 244 genomes were good, with estimated completeness of more than 90%; see Supplementary information and Table S1b. All 244 genomes were grouped and analyzed based on different criteria such as growth rate and pathogenicity. For simplicity, we name the different mycobacteria as, e.g., M. marinum throughout the text since we mainly discuss the Mycobacterium genus. We have followed the historical naming of mycobacteria and clades in order to avoid confusions [3, 19,20,21]. Below, we present phylogenetic data and factors influencing the evolution of the Mycobacterium genus. Secondly, we focus on tRNA and ncRNA.
Phylogeny and factors influencing the evolution of mycobacteria
Overview of mycobacterial genomes
The genome sizes of mycobacteria range from 3.2 Mb (M. leprae) to 8.1 Mb (M. dioxanotrophicus) with an average genome size of 5.7 Mb (Fig. 1a). Compared to other members of the Corynebacteriales order, to which mycobacteria belong [3], the average mycobacterial genome size is among the largest (Fig. S1a). On the basis of available literature [3,4,5, 7,8,9,10] and Refs therein we classified mycobacteria into three pathogenicity and two growth rate categories: (a) Pathogenic (P; n = 25, where n refers to the number of species); (b) Opportunistic pathogenic (OP; n = 116); (c) Non-pathogenic (NP; n = 66); (d) Rapid growing (RGM; n = 119); and (e) Slow growing (SGM; n = 112). For some mycobacteria we could not obtain enough information to classify them into any of the three pathogenicity categories (n = 37) or any of the two growth rate categories (n = 13).
First, we analysed the correlation between genome size and pathogenicity. As shown in Fig. 1a, the genome size of pathogenic mycobacteria is, on average, smaller (4.9 Mbp) than opportunistic/non-pathogenic mycobacteria (≈5.8 Mbp) with a p value of 0.0000005. Among those assigned to the pathogenic group, the smallest and largest genomes are M. lepromatosis (3,206,741 bp) and M. senegalense CK1 strain (6,738,555 bp), respectively. For the non-pathogenic and opportunistic pathogenic mycobacteria, the data suggested that the genome size distribution is similar. A comparison of SGM and RGM revealed that the average genome size for SGM to be roughly 0.5 Mbp smaller than in the case of RGM (Fig. 1a; p = 0.000000001). Our data further suggested that there is a linear correlation between the number of predicted coding sequences, CDS, and genome size with high R2-values [0.96 (NP), 0.91 (OP), 0.81 (P), 0.96 (RGM), and 0.89 (SGM); Fig. 1b, c]. In contrast, low R2-values [ranging from 0.01 (SGM) to 0.23 (P); Fig. 1d, e] indicated no correlation between the number of predicted tRNA genes and genome size (Fig. 1d, e). Similarly, the number of rRNA genes (Fig. 1f, g) did not correlate with the genome size nor did the number of ncRNA genes (Fig. 1h, i). Notably, while most mycobacteria carry one or two rRNA operons it appears that there are a few RGM with higher numbers of rRNA operons; M. neworleansense and M. dioxanotrophicus carry three each while M. icosiumassiliensis has four (Fig. 1f, g).
Mycobacterial phylogeny
Core genes refer to those common to the mycobacterial genomes and the outgroup Hoysella subflava genome [22, 23]. These were identified using two methods, which are based on different homology approaches (see Methods). The PanOCT tool, which includes a bidirectional blastp approach and consideration of the gene synteny (minimum cut off of 45% identity and 60% query coverage), identified 56 hard-core protein genes, hereafter the “56 HC-genes” (see Table S2 and the Discussion). Using the SCARAP v0.3.1 tool (core pipeline setting parameters -e 245, −f 245 and -i 1, with default coverage cutoff 50%) we identified 387 orthogroups, hereafter referred to as the “387 core genes”. The “56 HC-genes” and “387 core genes” were subsequently used to construct high-resolution phylogenetic trees (Fig. 2 and Fig. S2a). The two trees are in good agreement with each other, but have some differences. Below, we focus on the “387 core gene” tree since it resulted in higher bootstrap values (Fig. 2). The “56 HC-gene” tree and the differences are discussed in the Supplementary information (Fig. S2a and b).
The “387 core gene” tree was sub-divided into 33 clades (each encompassing more than one member) and 6 single-species clades (see also Table S3a). The clade names, based on the definitions according to Goodfellow et al. [3], were retained as far as possible for historical reasons and to avoid confusion (see above). To examine the quality of the core phylogeny, all vs. all pairwise ANI values were calculated and the branches between genomes, or group of genomes in the core tree (Fig. 2; see also Tables S3b and S4, and Figs. S2 and S3), were colored according to the ANI values in five ranges: > 95% (species boundary [13, 17, 28]; in red), 90 to 95% (in orange), 85 to 90% (in green), 80 to 85% (in blue) and < 80% (in black). The ANI data supported the overall clade structures based on the “387 core gene” (as well as the “56 HC-gene”) phylogenies in which the values were highest at the “tips” of the tree and decreased with the distance from the tips towards the root (Figs. 2 and S2, see also Fig. S3 and Table S4).
Clade assignment of unnamed mycobacteria and identification of new species
The “387 core gene” phylogeny suggested that M. palustre, M. phlei, M. peregrinum, and M. parafortuitum clades are dissolved. The different species from these clades were relocated to other clades, forming new or single membered clades (Fig. 2; see also Table S3a for a detailed clade description). For example, M. palustre was relocated to the M. interjectum clade and M. holsaticum to the new (proposed) M. elephantis clade. Moreover, the M. gordonae clade was suggested to encompass M. szulgai as well as M. angelicum and M. riyadhense. In addition to the M. elephantis clade, the “387 core gene” phylogeny suggested the formation of the new clades M. bohemicum, M. nebraskense, M. gadium, M. chlorophenolicum, M. pyrenivorans, M. mageritense, M. sediminis and M. duvalii while M. shigaense, M. conspicuum, M. parmense, M. shinjukuense, M. lacus and M. cookii constitute single clades (see the Discussion).
The “387 core gene” phylogeny also allowed the identification of new mycobacterial species and subspecies which we assigned to existing or new clades. Calculation of the ANI values between all genomes in the phylogenetic tree supported the suggestion to consider these mycobacteria as new species using ANI = 95% as “species threshold value” [17, 28] (Figs. 2, S2 and S3; Tables S3a, b and S4; and Fig. S3b-k compare the ANI values for these unnamed species with the respective clade members). Two unnamed species were assigned to the M. sphagni clade (Fig. S3b): M. sp. M26, which is deeply rooted in the clade, and M. sp. Epa45, which is close to but different from, M. crocinum, and M. rhodesiae JS60. The M. rhodesiae JS60 strain is separated from the M. rhodesiae type strain (DSM 44223) and phylogenetically closer to M. sphagni and therefore suggested not to be a M. rhodesiae species in keeping with a recent report [15]. Moreover, two unnamed mycobacteria belong to the M. flavescens clade (Fig. S3c), M. sp. GA-0227b and M. sp. IS-3022, and are clearly separated from each other as well as from the closely related species, M. celeriflavum. Finally, M. sp. URHB004 see also [24], which is deeply rooted in the M. sediminis clade, and M. sp. Root135 within the same clade are well separated from any other species (Fig. S3d). Taken together, we suggest that these unnamed mycobacteria should be considered as new species. This is also supported by the “56 HC-gene” tree (Fig. S2b).
For the other unnamed mycobacteria, the branch depth was smaller. On the basis of species and subspecies thresholds 95 and 98% [17, 28], respectively, our “387 core gene” (and “56 HC-gene”) based phylogeny (Fig. 2) and ANI analysis suggested the following. The isolates M. sp. TKK-01-0059 and M. sp. MOTT36Y assigned to the M. avium clade should probably be considered to be M. yongonense strains, whereas M. sp. MAC 080597 8934 is closely related to M. bouchedurhonense, M. timonense and M. avium 104 (Fig. S3e).
Two isolates were assigned to the M. terrae clade where M. sp. UM Kg17 belongs to the M. arupense group, and M. sp. JDM601 should be considered a M. algericum subspecies (Fig. S3f). The lone isolate, M. sp. 012931, belonging to the M. ulcerans clade, is closest to M. pseudoshottsii or M. liflandii; however, we emphasize that all mycobacteria in this clade show ANI values above the subspecies threshold (Fig. S3g; see also Ref [25]). The two unnamed mycobacteria assigned to the M. chelonae clade likely represent a single M. chelonae subspecies (Fig. S3h, ANI values for members of the M. chelonae clade; see also Ref [26]). Three isolates, M. sp. KMS, M. sp. MCS and M. sp. JLS, were assigned to the M. doricum clade and our data suggested that they should be considered as M. monacense strains (Fig. S3i). Finally, M. sp. UNNCL9 belonging to the M. neoaurum clade is suggested to be a M. neoaurum strain (Fig. S3j) and the M. fortuitum clade member M. sp. VKM Ac-1817D strain a M. fortuitum strain (Fig. S3k).
To conclude, the “387 core gene” (and “56 HC-gene”) phylogenetic trees, together with the ANI data, provided insight into i) the organization of the clades constituting the Mycobacterium genus, ii) clade allocation, and iii) proximity of the phylogenetic relationships for unnamed mycobacterial isolates. For the complete list of newly assigned or re-assigned species and subspecies see Table S3a, b.
Presence of plasmids
To identify plasmid sequences, we assembled raw reads (Illumina and Ion Torrent data) using “plasmidSPAdes” (see Methods). Following this we identified plasmids in 20 mycobacteria, 6 SGM and 14 RGM (Table S5a, b). Six of the 20 plasmids have previously been detected in other mycobacteria and 14 were new plasmids previously unreported. One of these latter, pJCM15653 was present in M. boenickei and partial hits were detected in M. peregrinum and M. septicum (not shown); all three belong to the M. fortuitum clade. Of the four M. gadium clade members, three harbor different plasmids where M. gadium carries pMM23, a plasmid which previously was reported to be present in the M. marinum M strain [29].
Prediction of the plasmid genes revealed many hypothetical genes and a number of interesting homologs (for annotation, see Table S5c). These homologs include di-guanylate cyclase dosC (M. crocinum; di-guanylate cyclase participates in the synthesis of the signal molecule c-di-GMP [30]); transcriptional factor whiB6; anti-sigma factor F antagonist rsfA (M. gordonae); dnaA (M. chimaera); and house-keeping sigma factor sigA (M. komossense and M. moriokaense). Plasmids in the M. ulcerans and M. chlorophenolicum clade members are found in Refs [6, 25].
Searching for plasmids in 197 draft genomes (excluding the complete genomes) using the PLSDB database resulted in 30 known plasmid sequences, corresponding to 29 circular plasmids and one linear plasmid, the latter present in M. branderi (Table S5b). For M. chimaera, M. nebraskense and M. parascrofulacuem we detected the presence of multiple known plasmid sequences; regarding M. chimaera DSM 44623, see also Ref [31].
Taken together, our analysis revealed the presence of 14 new plasmid sequences: 10 in RGM and 4 in SGM. Moreover, it appears that certain mycobacterial strains, such as M. chimaera and M. nebraskense, have been exposed to different plasmids that might have affected their evolution.
Presence of phages
Phages contribute to the diversity of genomes and play a role in horizontal gene transfer, HGT. We therefore predicted the presence and impact of phage genomes/sequences in mycobacteria as the percentage of the genome sequences with similarity to phage-derived genes. We predicted intact phage genomes to be present for 46 mycobacteria (Fig. 3; Table S6). Most (30 mycobacteria) carried one phage each, but up to three phages were detected in six species (M. aquaticum strain RW6, M. canariasense DSM 44828, M. mucogenicum DSM 44124, M. heckeshornense RLE, M. immunogenum DSM 45595, and M. abscessus subsp. bolletii 50,594). Including the questionable and incomplete phages, 238 genomes carried phage-derived genes. For 6 mycobacteria, we did not detect any phage-derived sequences, e.g., M. leprae and M. lepromatosis in the M. haemophilum clade. On average, 0.7% (range 0 to 4.64%) of the mycobacterial genomes consisted of phage-like genes (Table S6). For the majority of the mycobacteria (231 genomes) the phage content deviated less than two standard deviations (±1.38%; calculated for the whole Mycobacterium genus, i.e., 2 × 0.69%; see Table S6) from the average number 0.7%. Among the 13 mycobacteria with higher phage content than average, 9 belonged to the RGM with M. immunogenum having the highest fraction (4.64% of the total number of genes). There was no statistically significant difference between SGM and RGM, nor between the NP, OP, and P categories. At the clade level, M. mageritense and M. chelonae clade members were predicted to have significantly higher phage content (1.49 and 1.75%, respectively) than the average mycobacteria (Table S6). Thus, it appears that phages have had a larger impact on the evolution of these species compared to the majority of mycobacteria. With respect to the possible link between phage and tRNA see below and the Discussion.
Identification of IS elements
IS-elements also represent an important driving force for the evolution of bacterial genomes; they can disrupt genes and influence transcription/expression of genes close to the integrated IS element [34]. Hence, we predicted different types and numbers of IS elements in mycobacteria (Fig. 3). At least one IS element was detected in all mycobacteria but none was universally present in mycobacteria. The average number of IS elements in the genus was 28.3 copies with a standard deviation of 32.7 (Table S7a) indicating a high variability. (Due to draft genome status for many of the genomes, we emphasize the difficulties in accurately determining the number of IS elements.) The most common family of IS elements, found in > 65% of the genomes, included IS3, IS110, IS256, IS481, and ISL3 (Fig. 3). Of these, IS3 and IS256 are present with the highest average copy number per genome (5.5 and 5.8, respectively; Fig. 3 and not shown) albeit the variation in copy numbers among mycobacteria is high (ranging between 1 and 53 and 1 and 80, respectively). The ISAs1 type is present with the highest number (212 copies) in any individual genome (M. liflandii). M. ulcerans and M. liflandii have the highest total number of IS elements, 290 and 218, respectively (Table S7b). In contrast, only one IS element was found in M. cookii (IS481) while two were predicted in M. salmoniphilum, M. saopaulense EPM10906 (one ISAs1 and one IS701), and M. chelonae (one IS481 and one IS701); all of these belong to the earliest diverging mycobacterial lineage.
With respect to growth rate and pathogenicity classifications (RGM, SGM, NP, OP and P), the mycobacteria pathogens had significantly more IS-elements than the opportunistic ones (45.1 vs 28.0; p < 0.04; Table S7a). Variation within groups was high for all categories. Among species that we could classify with respect to growth rate and pathogenicity, IS66 and ISAzo13 were only found in slow-growing opportunistic pathogens (in one and two species, respectively) while IS1595 was only detected in four opportunistic and four non-pathogens (Fig. 3; Table S7b). IS1 was only found in SGM of unknown pathogenicity (three species). At the clade level, the M. ulcerans, and M. gadium clades harbor a significantly higher than average number of IS elements (88.0 and 69.8, respectively). This indicates that IS elements have played a significant role in the diversification of the species in these clades.
Our comparison of IS element types in mycobacteria detected no clear correlation with respect to clades, pathogenicity, or growth rate, with a few notable exceptions. Within the M. tuberculosis and the M. chelonae clades, there was a high degree of correlation, with species in each clade having similar sets of IS element types (Figs. 3 and S4, and Supplementary information). This suggests a low rate of gain of new IS elements and a low rate of loss of existing ones within these clades compared the to other mycobacteria.
Taken together, the data emphasizes the possible impact of IS elements on the evolution of the Mycobacterium genus (see also the Discussion).
Comparison of the presence of tRNA and ncRNA among mycobacteria
tRNA and non-coding RNA have key roles in the expression of genes and their regulation. tRNA genes also act as targets in integrating foreign DNA, leading to the establishment of e.g., pathogenicity, and metabolic and resistance islands [35, 36] and Refs therein. Among bacteria, including some mycobacteria, tRNA genes have been horizontally transferred by phages [24]. Together this indicates that tRNA and ncRNA have indeed contributed to bacterial evolution. Hence, we mapped the presence of tRNA and ncRNA genes among mycobacteria (see Methods). Since aminoacyl-tRNA synthetases (AARS) are closely linked to tRNA we also surveyed for the presence of AARS genes. Below we will discuss tRNA and AARS, then identified ncRNAs.
Variation of the number of tRNA genes among mycobacteria
Except for tRNAMet(CAT) and tRNACys(GCA), which are present in multiple copies in most mycobacteria, the remaining tRNA isoacceptors are generally present as single-gene copies in the mycobacterial genomes (Fig. 4). On average, mycobacteria are equipped with 49 tRNAs genes (Table S8a, b; range 41–87). Certain mycobacteria, however, have higher numbers and 16 mycobacteria carry more than 17 tRNA genes higher than the average number (including multiple gene copies of several tRNA isoacceptors). These mycobacteria belong to RGM and some harbor a large fraction of phage derived genes in their genomes (see above and the Discussion). Members of four clades encoded significantly more tRNA genes than average, with M. mageritense clade members carrying the highest numbers (65.8; p = 0.000183). Others belong to the M. fortuitum, M. mucogenicum and M. chelonae clades [24, 26]. The genomic locations of tRNA genes in 47 mycobacteria (for which complete genomes are available) were compared with M. chelonae as reference, which belongs to the earliest mycobacterial lineage (Figs. 2 and S2a). Our comparison revealed differences in the chromosomal locations of the tRNA genes. However, we also noted similarities when comparing M. chelonae and M. avium clade members (Fig. S5a). Interestingly, the positioning of tRNA genes [tRNAIle(GAT), tRNAAla(TGC) and tRNALeu(CAG)] close to dnaA and origin of replication, oriC see also [24, 27], appears to be conserved among these mycobacteria. The tRNAIle(GAT) and tRNAAla(TGC) are likely to be transcribed together and they are among the tRNAs identified to be necessary for optimal growth of M. tuberculosis H37Rv [37].
Among mycobacteria, the tRNA isoacceptor genes, tRNAAla(AGC), tRNAArg(TCG), tRNAArg(GCG), tRNAHis(ATG), tRNAIle(TAT), tRNASer(ACT) and tRNAThr(AGT) are rare (Fig. 5; Table S8a). The tRNAAla(AGC), tRNAArg(GCG), tRNAHis(ATG) and tRNAThr(AGT) genes were only detected as single copies in M. kansasiiatcc12478, M. salmoniphilumdsm43276, M. parakoreenseDSM45576 and M. fortuitumdsm44220, respectively. All mycobacteria carried tRNAIle(CAT), which allows reading of AUA as a result of modifying the C in the anticodon to 2-lysyl-cytidine. This modification is catalyzed by TilS [38], and tilS homologs were predicted in almost all mycobacteria. (When no homolog could be identified, was likely due to draft genome status; Fig. 5 and Table S8b). For 12 mycobacteria, the rare tRNAIle(TAT) was predicted to be present, in addition to tRNAIle(CAT). Hence, there are two ways to read AUA in these mycobacteria see also [24]. We also predicted one UAG tRNA suppressor gene in M. minnesotenseDSM45633, as well as the presence of a selenocysteine tRNASeC(TCA) gene in 103 mycobacteria (Fig. 4; Table S8b).
Taken together, our data suggest a high variability in the number of tRNA genes among mycobacteria (see the Discussion).
Mycobacteria lack AsnRS and GlnRS genes
Aminoacyl-tRNA synthetases (AARS) are responsible for attaching the correct amino acid to the respective tRNA. As previous studies indicate [24], all AARS were predicted to be present in all mycobacteria with the exception of AsnRS and GlnRS (Fig. 5; Table S9). In the absence of AsnRS and GlnRS charging is accommodated through the tRNA-dependent amidotransferase pathway (Adt) where the GatCAB enzymes are essential [39]. Indeed, gatCAB homologs were universally present among mycobacteria (Fig. 5; Table S9a) suggesting that the Adt pathway operates and is a common characteristic for the Mycobacterium genus. For prediction of aminoacyl-tRNA-synthetase paralogs see Supplementary information.
Variation of non-coding RNA genes among mycobacteria
To provide insight into the presence and role of non-coding RNAs (ncRNAs) among mycobacteria, we used the Rfam database (see Methods). In addition, we searched for ncRNA homologous to those of M. tuberculosis ncRNAs [40] (Figs. 6 and S6). The average mycobacterial genome encodes 39.2 ncRNAs of which 12 ncRNAs were predicted to be universally present among mycobacteria. The M. gordonae (53.6; p = 0.000053), M. kansasii (54.3; p = 0.0074), M. ulcerans (52.6; p = 0.00035) and M. interjectum (49.5; p = 0.0102) clades were predicted to have significantly more ncRNA genes than the genus average number and other clade members to have fewer (Fig. 6; Table S10a); the M. sphagni (27.3; p = 0.00061) and M. haemophilum (28.8; p = 0.033) clades both have < 30 detected ncRNA genes (with M. leprae having 24). Pathogens and opportunistic pathogens have significantly more (42.4; p = 0.00090; and 42.2, p = 5.79 × 10− 6, respectively) ncRNAs than non-pathogens (35.4), and SGM have significantly higher number of ncRNA genes (41.5; p = 0.00062) than RGM (37.2). The highest number of ncRNAs predicted was 73 in the SGM M. tusciae (Table S10).
Among core ncRNAs, ribonuclease P RNA (RPR; rnpB [45]), transfer-messenger RNA (tmRNA; ssrA [46]), signal recognition particle RNA (4.5S RNA), Ms1 RNA and 6C RNA were identified. (Notably, 6C RNA was not detected in M. leprae or M. lepromatosis). Moreover, we noted the presence of an anti-sense RNA targeting the fatty acid desaturase, desA2 (ASdes) [47] among core ncRNAs, as well as several cis-regulatory riboswitches [48] including mraW (putative regulator of peptidoglycan synthesis), ydaO-yuaA (which binds the signal molecule cyclic di-AMP; see below) and Actino-pnp (located in the 5′ UTR, untranslated region, of pnp, polynucleotide phosphorylase). These core ncRNA genes were present in single copies in all mycobacteria with few exceptions as discussed below. For some mycobacteria we detected multiple copies for other universally conserved riboswitches: the Glycine riboswitch (regulates glycine degradation in response to glycine) and SAMIV (responsive to the concentration of S-adenosyl methionine). Most have more than one copy of the TPP riboswitch (responsive to the level of thiamine pyrophosphate, i.e., the active form of vitamin B1 [49]). Together, this indicates the importance of regulating these targets/processes in mycobacteria.
Several other identified ncRNAs were confined to a (usually small) subset of mycobacteria, e.g., the horizontally transferred GOLLD RNA [24, 50] (see the Discussion). The M. tuberculosis ncRNAs predicted by Wang et al. [40] were also detected in some mycobacteria outside of the M. tuberculosis complex (Fig. S6; note that some of these ncRNAs overlaps with the Rfam annotated ncRNAs).
To summarize, a core set of 12 ncRNAs are present in almost all (> 99%) mycobacteria. Another set of five ncRNAs are present in many (> 83%), while eight additional ncRNAs are present in 54–75% of the mycobacterial genomes. Together, this provides information about conserved ncRNA that are involved in the growth, survival and stress tolerance of mycobacteria. We also noted that mapping the core ncRNA genes on the 47 mycobacterial (complete) genomes - using M. chelonae as reference - suggested differences in their locations, which was in keeping with what we observed for tRNA genes (Fig. S5b; see above).
Interestingly, for some mycobacteria more than one gene copy of RPR, tmRNA, Ms1 RNA, 6C RNA, ydaO-yuaA riboswitch and the ALIL RNA pseudoknot was annotated. Three RPR homologs were annotated in M. austroafricanum and M. sp. YC-RL4 [51] (Fig. 6; Table S10b). One of the identified RPR homologs folds into a regular bacterial RPR structure. For the two additional copies, the overall similarity was low for MAUSTDSM44191_20 while deeper analysis of the second copy, MAUSTDSM44191_22, revealed structural differences, in particular in the specificity domain [45] and Refs therein. This raises questions about the function of the two extra RPRs in these two mycobacteria albeit conserved residues important for RPR function are present [45] (Fig. S7). In this context, the gene positioned between the extra annotated RPR genes is annotated to encode an endonuclease, which might indicate that the sequences encompassing these two additional copies have been acquired through HGT.
Four mycobacteria, M. austroafricanum, M. canariasense, M. conceptionense and M. dioxanotrophicus, harbor extra homologs of tmRNA. Analysis of the extra tmRNA genes revealed that these possibly have been acquired through horizontal gene transfer. For M. austroafricanum and M. canariasense our data suggested that their extra tmRNA gene originated from a Mycobacterium phage, Nappy, while in the case of M. conceptionense and M. dioxanotrophicus the origin appeared to be Tsukamurella tyrosinosolvens strain PH-06 (Fig. S8). In keeping with that these have been horizontally transferred is that translation of these extra tmRNAs would result in proteolysis tags different from the tag generated by the regular mycobacterial tmRNA (Fig. S8). It remains to be seen whether these extra tmRNAs are functional and, if so, how they affect the distribution of polypeptides carrying different tags and the subsequent degradation of these polypeptides (see the Discussion).
Three mycobacteria, M. nebraskense (two strains; see also Ref [24]) and M. celatum, carry an additional copy of Ms1 RNA, which is suggested to function as a mycobacterial 6S RNA variant [24, 52, 53] (Fig. S9; 6S RNA in e.g. Escherichia coli is involved in the regulation of stationary phase genes [54]).
For the Actinobacteria specific 6C RNA, we identified more than one copy in several mycobacteria belonging to the M. flavescens, M. gadium and M. sediminis clades where M. tusciae strain JS617 and M. yunnanensis strain DSM 44838 carry three extra 6C RNA genes (Figs. 6 and S10; Table S10). The secondary structures (not shown) of these extra 6C RNAs are similar to the regular 6C RNA [24] and the genes were annotated to be positioned on the chromosomes as well as on plasmids (Fig. S10). Interestingly, in mycobacteria the regular 6C RNA gene is closely linked to the Ms1 RNA gene (generally one gene in between; Fig. S10) but this is not the case for the extra 6C RNA genes.
We also predicted the presence of an extra ydaO-yuaA riboswitch upstream of the hypothetical gene MKAN_24085 in M. kansasii. This riboswitch binds the signal molecule c-di-AMP and thereby influences regulation of rpfA (resuscitation-promoting factor A [48]). No extra copy is present in neighboring species such as M. persicum and M. gastri. Hence, the extra ydaO-yuaA has been acquired after M. kansasii diverged from M. persicum and M. gastri (Figs. 2 and S11).
In bacteria, translational frameshifting is induced by the ALIL RNA pseudoknot motif and it was originally identified in association with transposable elements belonging to the IS3 family. The frameshifting event results in a polypeptide having both transposase and integrase core domains, which eventually leads to transposition [46, 55]. For 113 of the mycobacterial genomes, we predicted the presence of ALIL motifs with the gene synteny “Transposase – ALIL – Integrase Core Domain” in 75 of these mycobacteria while a different gene synteny was observed in 17 genomes (with ALIL not “in between” but at one end or with completely different gene synteny; Fig. 6 and not shown). We did not detect any ALIL motif in 21 of the selected 113 genomes, which is likely due to their draft genome status. Moreover, the number of ALIL RNA motifs varies ranging between zero and twenty: M. avium strain 104 has twenty and the M. phlei strain CCUG21000 has twelve. A comparison of the complete and draft (Illumina generated; unpublished) M. phleiccug21000 genomes predicted only one ALIL motif was predicted for the draft genome. This supports the notion that we did not detect any ALIL motifs in 21 draft genomes and might indicate that ALIL motifs in mycobacteria are underestimated when only draft genomes are available (see above). Nevertheless, taken together this indicates the importance of ALIL RNA elements among mycobacteria and the possible impact on the diversification of some mycobacteria (with respect to IS elements see above).