Microsatellite polymorphism across the M. tuberculosis and M. bovis genomes: Implications on genome evolution and plasticity

Background Microsatellites are the tandem repeats of nucleotide motifs of size 1–6 bp observed in all known genomes. These repeats show length polymorphism characterized by either insertion or deletion (indels) of the repeat units, which in and around the coding regions affect transcription and translation of genes. Results Systematic comparison of all the equivalent microsatellites in the coding regions of the three mycobacterial genomes, viz. Mycobacterium tuberculosis H37Rv, Mycobacterium tuberculosis CDC1551 and Mycobacterium bovis, revealed for the first time the presence of several polymorphic microsatellites. The coding regions affected by frame-shifts owing to microsatellite indels have undergone changes indicative of gene fission/fusion, premature termination and length variation. Interestingly, the genes affected by frame-shift mutations code for membrane proteins, transporters, PPE, PE_PGRS, cell-wall synthesis proteins and hypothetical proteins. Conclusion This study has revealed the role of microsatellite indel mutations in imparting novel functions and a certain degree of plasticity to the mycobacterial genomes. There seems to be some correlation between microsatellite polymorphism and the variations in virulence, host-pathogen interactions mediated by surface antigen variations, and adaptation of the pathogens. Several of the polymorphic microsatellites reported in this study can be tested for their polymorphic nature by screening clinical isolates and various mycobacterial strains, for establishing correlations between microsatellite polymorphism and the phenotypic variations among these pathogens.


Background
Microsatellites, also known as simple sequence repeats, are the short nucleotide segments comprising tandem repeating motifs of length 1-6 bp [1]. They are present in all genomes known to date [2][3][4], and are known to be polymorphic [5]characterized by high rates of indels of repeat units [1]. Microsatellites provide a framework for crucial genetic rearrangements with their reversible frameshift mutations that can confer a certain degree of selective advantage on pathogenic bacteria. Microsatellite mutations are known to affect expression levels [6], switching on/off of genes [6] and even alteration of gene functions [7]. The primary cause of microsatellite polymorphism is thought to be strand slippage during DNA replication [8]. Usually errors owing to strand slippage are repaired by a three-enzyme system comprising the enzymes mutL, mutS and mutH. However, some genomes like those of the mycobacterial species lack these enzymes [9]. Hence, such genomes serve as interesting systems to investigate the rates of mutations in microsatellites and the existence of regulatory mechanisms that govern microsatellite mutations. Furthermore, these genomes present challenging and exciting systems to understand the role of microsatellite mutations in conferring genome plasticity, and in aiding the pathogens in their adaptation and evolution.
Previous reports on genomic changes in M. tuberculosis, were mainly concerned with single nucleotide polymorphisms (SNPs) and large-sequence polymorphisms (LSPs) (>10 bp) [10]. While the involvement of SNPs in drug resistance has been shown [11], most of the LSPs are thought to be deleterious [12]. In the present study, we show for the first time that the coding regions of the three genomes of mycobacteria (M. tuberculosis H37Rv [13], M. tuberculosis CDC1551 [10] and M. bovis [14]) harbor a number of polymorphic microsatellite loci associated with remarkable changes in the coding regions.

Results and discussion
All the three mycobacterial genomes, M. tuberculosis H37Rv (MTH), M. tuberculosis CDC1551 (MTC) and M. bovis (MB) harbor about a million microsatellite tracts each, comprising of mono to hexa repeats (Sreenu, Pankaj Kumar, Nagaraju and Nagarajaram, manuscript communicated). Systematic comparison of all the equivalent microsatellites and the equivalent coding regions harboring them, in all the three genomes revealed several examples of microsatellites exhibiting length polymorphism characterized by indels of the repeat units. Frame-shifts in the coding regions owing to indels in microsatellites, were also observed. While some frame-shifts caused ORFs to split (fission) (see methods), others seemed to bring about fusion of two adjacent ORFs (with or without overlap) giving rise to a single ORF. Our study also revealed several ORFs eliminated as a result of premature termination by stop codons, and numerous other ORFs exhibiting length changes (Fig. 1). The complete list of polymorphic microsatellites along with the ORFs in which they are present is given in Table 1 (see Additional File 1 for details of the tracts, microsatellite polymorphism and outcomes). Illustrated below are some examples of microsatellites and their polymorphic effects on the coding regions.
In the MTH genome, two ORFs annotated as gmhA (Rv0113) and gmhB (Rv0114) have been identified as sedoheptulose-7-phosphate isomerase and D-α-β-D-hep-tose-7-biphosphate phosphatase, respectively (the TB structural genomics consortium [15]). These enzymes are known to be involved in the biosynthesis pathway of nucleotide activated glycerol-manno-heptose precursors of bacterial glycoproteins and cell surface polysaccharides [16]. Our study indicates that the ORF Rv0113 annotated as gmhA harbors the microsatellite (T)4 in MTH,while it is expanded to (T)5in the MTC genome. This expansion has resulted in a frame-shift owing to which the reading frame extends and fuses with that of the gmhB, thus giving rise to a fused ORF. Although it is hard to speculate the possible roles of the gmhA-gmhB fused protein in MTC, there exists a high probability of it forming a bi-functional protein with two domains.
Similarly, two adjacent ORFs viz., Rv0192A and Rv0192 in the MTH genome are observed to have fused into a single ORF (Mb0198) in the MB genome, owing to a frameshift caused by the expansion of the microsatellite (G) 4 to (G) 5 . Previous PhoA fusion screening studies have shown Rv0192A in MTH to act as a signal peptide [17], and in light of this it is reasonable to speculate the fused gene product in MB to be a secretory protein that may act as a surface antigen.
The ORF MT1966 in MTC encoding a functional isocitrate lyase [18], is observed to have split into two ORFs (Rv1915 and Rv1916) in MTH due to a single nucleotide deletion in the mononucleotide tract (T) 5 . The failure of these two ORFs to complement isocitrate lyase activity in MTH has been demonstrated [19]. Immunoblotting studies were unable to detect AceAa or AceAb products [18]. Subsequent studies by Betts and co-workers (2002) enabled detection of only the mRNA of AceAa, indicating the lack of expression of AceAb [20]. It is interesting to note that both the MTC and MTH genomes possess another copy of isocitrate lyase. This indicates the existence of two Schematic representation of the various changes observed in the coding regions (green arrows) affected by microsatellite indel mutations Figure 1 Schematic representation of the various changes observed in the coding regions (green arrows) affected by microsatellite indel mutations. In this illustration a hypothetical microsatellite tract (AT) 5 has been shown to undergo an indel of one repeat unit causing fission/fusion, premature termination and length variation of ORFs. The bi-directional arrows (black) indicate reversible nature of the microsatellite mutations.  (MB). Please note that the microsatellites in the intergenic regions are not reported here. The table lists the ORFs (given by their gene id) harboring the polymorphic microsatellites. The first column denotes microsatellite tract and its observed mutation in the form of insertion/deletion of repeat units leading to expansion or contraction of the microsatellite. As discussed in the text evolutionary relationship among the three genomes, is not established clearly. Therefore, we have followed a consensus approach where the observed event being a case of insertion or deletion of a repeat, is decided by the number of genomes in which the repeat number is conserved (given in bold text). For example, G4↔5 denotes that two of the genomes possess the tract G4 while in the third genome it exists as G5, and therefore it is regarded as an event of insertion leading to microsatellite expansion. Accordingly, the effect (fusion/fission, premature termination, length variation) on the coding region is also displayed.   ) harboring the polymorphic microsatellites. The first column denotes microsatellite tract and its observed mutation in the form of insertion/deletion of repeat units leading to expansion or contraction of the microsatellite. As discussed in the text evolutionary relationship among the three genomes, is not established clearly. Therefore, we have followed a consensus approach where the observed event being a case of insertion or deletion of a repeat, is decided by the number of genomes in which the repeat number is conserved (given in bold text). For example, G4↔5 denotes that two of the genomes possess the tract G4 while in the third genome it exists as G5, and therefore it is regarded as an event of insertion leading to microsatellite expansion. Accordingly, the effect (fusion/fission, premature termination, length variation) on the coding region is also displayed.  [30], ArrayExpress [32]) and from references [33][34][35][36][37]. In some entries in column 2, the * mark denotes information on known expression from different literature but not from microarray data. The expression profile data of MTC and MB are not available on the public domain databases and therefore not given in this table. # Mutation is absent and also the region has not been annotated as ORF

list of polymorphic microsatellites found in the coding regions of the three genomes, M. tuberculosis H37Rv (MTH), M. tuberculosis CDC1551 (MTC) and M. bovis (MB). Please note that the microsatellites in the intergenic regions are not reported here. The table lists the ORFs (given by their gene id) harboring the polymorphic microsatellites. The first column denotes microsatellite tract and its observed mutation in the form of insertion/deletion of repeat units leading to expansion or contraction of the microsatellite. As discussed in the text evolutionary relationship among the three genomes, is not established clearly. Therefore, we have followed a consensus approach where the observed event being a case of insertion or deletion of a repeat, is decided by the number of genomes in which the repeat number is conserved (given in bold text).
For example, G4↔5 denotes that two of the genomes possess the tract G4 while in the third genome it exists as G5, and therefore it is regarded as an event of insertion leading to microsatellite expansion. Accordingly, the effect (fusion/fission, premature termination, length variation) on the coding region is also displayed. (Continued) functional copies of the enzyme in MTC, and only a single copy in MTH. In MTC the activity of isocitrate lyase increases during the latent phase when the pathogen utilizes lipid as the energy source [21]. Redundancy in isocitrate lyase in MTC can therefore be beneficial to the pathogen, providing a greater chance of its survival in the host cell debris where lipid is used as a carbon source. However, in MTH which is cultured under laboratory conditions with no dependence on lipids as the carbon source, the duplication of the isocitrate lyase enzyme is not required. Therefore, the removal of one copy of the enzyme in MTH may not pose as a constraint for the growth of the pathogen.
On comparison, the highest number (18 ORFs) of split events is observed in the MB genome ( Table 1). The expression of both parts of split genes in the MB genome, imply a favorable situation for versatile protein-protein interactions. However, it is to be noted in the cases of split ORF, the expression of the second part of the ORF is entirely dependent on the availability of regulatory signals (Shine-Dalgarno sequence) for that ORF. In the absence of a regulatory mechanism, the second part of the ORF is unexpressed. As given in Table 1, section III, the second part of all the four examples, has been annotated as psuedogene because of the absence of the Shine-Dalgarno sequence. If both the parts of the split ORFs are expressing the split subunits can act together [22,23] or in isolation resulting in different protein-protein interactions, that can be instrumental in the creation of alternate/new pathways, which in turn may eventually render greater adaptation mechanisms to the bacteria. This may well be the one of the underlying reasons for MB to have a wider host range as compared to M. tuberculosis.
The split ORFs encode membrane proteins, transporters, PE_PGRS, cell-wall synthesis proteins and hypothetical proteins. The membrane proteins are known to play an important role in host-pathogen interactions [24]. The majority of bacteria are thought to modify their membrane protein structures in order to escape the host immune defense system and promote colonization at various places within the host [6,24]. The PE-PGRS proteins are specific to mycobacteria and are speculated to function as surface antigens [25,26]. Truncation with respect to the second part can potentially give rise to an antigenic variant.
MTC as compared to the other genomes exhibits a greater number of cases of premature terminations (10 ORFs) ( Table 1), confined to the PE_PGRS, umaA1, pks5 and some hypothetical proteins. Of these, the ORF umaA1 codes for a mycolic acid methyl transferase that modifies the lipids of the mycobacterial cell wall [27]. The umaA1 deletion mutant of MTH is observed to be more virulent than the wild-type, in the severe combined immune deficiency (SCID) mouse model [28]. However, it is difficult to categorically stress the importance of umaA1 in the virulence of the pathogen. This is because MTC has been shown to be less virulent in the immunocompetent mice as compared to other clinical isolates [29]. Study on an umaA1 deletion mutant of MTH in immunocompetent mice would provide clues to the role of umaA1 in virulence. In addition, it is equally possible for the other prematurely terminated ORFs to also be responsible for the less virulent nature of MTC. However, such correlations require further studies.
We also observe an appreciable number of ORFs (43 examples) in all the three genomes exhibiting length variations due to indels of repeat units in microsatellites. Many proteins in this category have been annotated as hypothetical proteins, PPE and mammalian cell entry (mce) family virulence proteins. While the length variation in some ORFs produce no effect on the function of the translated protein with the functional domains being well conserved; in others, drastic changes are observed. For example, Rv2732c in MTH as well as Mb2791c in MB code for a membrane anchoring protein of length 204aa. The equivalent ORF MT2802.1 in MTC is a shorter ORF encoding only 180aa, owing to a frame-shift caused by a single G insertion in the microsatellite tract (G) 2 . In silico analysis of these proteins, reveals a greater probability (0.959) of the N-terminal deleted short protein in MTC to act as a signal peptide and secrete outside, than its longer counterparts in MB and MTH that possess negligible propensities of being signal peptides and therefore for external secretion.
Although the primary focus of this communication is on microsatellite polymorphism in the coding regions, we have also examined the upstream promoter regions of the ORFs and obtained some ORFs harboring polymorphic microsatellites (data not shown). It should be noted that genes are located very close to each other in a prokaryotic genome; at times without any long intergenic region between two adjacent genes. It is probable that the coding sequence of a gene may act as a regulatory sequence for its neighboring genes. In addition to bringing about changes in the coding regions, the observed microsatellite variations may also influence regulation of regions downstream of coding sequences.
We have referred the Stanford microarray database [30], Tuberculist [31], ArrayExpress [32] and available literature on microarray analysis of mycobacterium [20,[33][34][35][36][37] for the expression profiles of all ORFs of MTH listed in Table  1. Almost 85% of the ORFs (indicated by * in the table) display high expression profiles, including those that have undergone fission. However, further studies are necessary to verify and complement the function of these split gene products with their cognate wild-type/unsplit proteins.
It is evident from Table 1 that microsatellites with as few as two repeats display polymorphism (i.e., indels of their repeat units). This appears to contradict earlier observations of the requirement of a microsatellite length threshold for repeat expansions or contractions due to strand slippage [38,39]. Our study therefore indicates the nondependence of strand slippage on microsatellite tract lengths. However, one should bear in mind the possibility of random mutational events leading to the observed length variation in microsatellites. For example, the genomes of M. canetti and M. tuberculosis contain the (GGGCCGC) 2 tract in the ORF that encodes for pks15/1. However, the equivalent regions in the MTC and MTH genomes have a 7 bp deletion of (GGGCCGC) and in the MB genome a 6 bp deletion of (GGCCGC) [40]. Although the deletion events are independent, the resultant sequences when compared give an impression of the G tract expansion. Alternatively, it can be argued that all three genomes MB, MTC and MTH may have possessed an initial 7 bp deletion (GGGCCGC) similar to M. canetti, giving rise to the microsatellite tract (G) 5 that may have subsequently expanded to (G) 6 in MB. It is still unclear as to which of the models depict the correct picture of events for the observed microsatellite polymorphism. This is largely because of the unavailability of detailed evolutionary information of the mycobacterial pathogen. Although M. canetti is believed to be the root from which the other mycobacterial strains evolved, a clear understanding of the evolutionary relationship between M. tuberculosis and M. bovis is absent [41][42][43][44]. Owing to this, it is difficult to put forward precisely the path of microsatellite evolution, although several possibilities can be suggested.
The rate at which microsatellites mutate is much higher than the single-base substitutions [45,46], therefore greater variations are expected in the polymorphic loci than other regions of the genomes. Though mycobacterial genomes are enriched with microsatellite tracts (Sreenu, Pankaj, Nagaraju and Nagarajaram, manuscript communicated), surprisingly there is yet no report available on the microsatellite mediated phase variation in these bacteria. The majority of microsatellite mediated phase variations reported in pathogenic bacteria are changes in the pili [47,48], capsule [49,50] and flagella [51,52] and the mycobacteria do not possess any of these structures. According to Hallet, phase variation is "an adaptive process through which bacteria undergo frequent and reversible phenotypic changes resulting in genetic alterations in their genomes" [53]. In light of this point it is highly interesting that this work presents several polymorphic microsatellite loci that seem to have been evolutionarily 'selected' and are involved in bringing about phenotypic alterations in the coding regions namely, antigenic variation, virulence and modified host-pathogen interactions for presumably better adaptation of the pathogen.
It is tempting to speculate that some of the polymorphic microsatellites discovered in this study are those that have undergone mutations at some point of time during microbe evolution, perhaps during speciation, and thereafter remained frozen as the 'molecular fossils'. If this model is correct, then such tracts can be used as markers for species/strain identification. In any case all the loci form a good starting set to screen several isolates and strains. This would enable to study correlation between microsatellite polymorphism and the observed phenotypic variations among different isolates and strains.
An important point to be noted in connection with microsatellite polymorphism in the mycobacterial genomes is the absence of the post replicative DNA mismatch repair system mediated by mutS, mutL and mutH genes [9]. Impairment of these enzymes destabilizes mono, di and trinucleotide repeats [54]. This probably accounts for the prevalence of mono and dinucleotide microsatellite variations in mycobacterial genomes. Moreover, the absence of these enzymes appears advantageous to these pathogens, resulting in the generation of polymorphic microsatellites, thereby imparting a certain degree of plasticity to the genomes. However, the total number of microsatellites that exhibit polymorphism, and their significance in the context of pathogen adaptability, virulence and survival remains to be tested.

Conclusion
The coding regions in the mycobacterial genomes, viz. M. tuberculosis H37Rv, M. tuberculosis CDC1551 and M. bovis, harbor a number of polymorphic microsatellites. The observed indel mutations in microsatellites have brought out some interesting changes in the coding regions indicative of gene fusion/fission, loss, and functional variation. From this study, it can be concluded that microsatellites form an important set of genomic elements, mutations of which are beneficial to the pathogens.

Methods
Complete genome sequences of M. tuberculosis (H37Rv and CDC1551) and M. bovis were downloaded from the NCBI ftp site [55]. Functional annotations of the coding regions were referred to the Tuberculist website [31] and the TB structural genomics consortium site [15]. The various microsatellites in the three genomes were identified using SSRF [56]. SSRF scans a given nucleotide sequence and extracts all microsatellite tracts of motif length 1-6 bp. The extracted information includes genomic location of the tracts, repeating motifs, repeat numbers and regions (coding or non-coding or partial) in which the tracts are present. The program utilizes the GenBank annotation file "xxx.ffn" (where xxx = genome name) that has exon boundary information, using which the location of microsatellites relative to the protein coding regions is subsequently recorded. In addition the internal motif redundancy is taken care of; where a sequence of the type (AAAAGCAAAAGCAAAAGC) is represented as (AAAAGC) 3 with the internal "A"s (AAAAGC) not considered as a separate (A) 4 tract.
The ORFs harboring microsatellites of one genome were used as queries to search against the other two complete mycobacterial genome sequences using the BLASTN program (version 2.2.6) [57] without the repeat masking filter. The alignment hits with queried sequences comprising only indels in the microsatellites were selected for further analysis. The Tuberculist database (for H37Rv and M. bovis) and the NCBI (for CDC1551) were checked and confirmed to ensure that the indels in microsatellites especially those of the mononucleotide tracts were indeed authentic mutations and not the results of sequencing errors (however one can not rule out some remote possibility of sequencing artifact). Subsequently, the ORFs and their equivalent sequences were realigned using CLUS-TALW [58] to reconfirm the alignment as well as the INDELS in the microsatellites. As the phylogenetic relation of these genomes is still ambiguous, a consensus of the three genomes for microsatellite categorization into premature terminations, gene fusion/fission and ORF premature termination was used.