Simple sequence repeats and compositional bias in the bipartite Ralstonia solanacearum GMI1000 genome

Background Ralstonia solanacearum is an important plant pathogen. The genome of R. solananearum GMI1000 is organised into two replicons (a 3.7-Mb chromosome and a 2.1-Mb megaplasmid) and this bipartite genome structure is characteristic for most R. solanacearum strains. To determine whether the megaplasmid was acquired via recent horizontal gene transfer or is part of an ancestral single chromosome, we compared the abundance, distribution and compositon of simple sequence repeats (SSRs) between both replicons and also compared the respective compositional biases. Results Our data show that both replicons are very similar in respect to distribution and composition of SSRs and presence of compositional biases. Minor variations in SSR and compositional biases observed may be attributable to minor differences in gene expression and regulation of gene expression or can be attributed to the small sample numbers observed. Conclusions The observed similarities indicate that both replicons have shared a similar evolutionary history and thus suggest that the megaplasmid was not recently acquired from other organisms by lateral gene transfer but is a part of an ancestral R. solanacearum chromosome.


Background
The paradigm that bacterial genomes consist of a single circular chromosome is no longer valid. Linear chromosomes have been identified in Borrellia burgdorferi [1], various Streptomyces species [2,3], Agrobacterium tumefaciens [4] and various other species. In addition, it is now appreciated that genomes of several bacterial taxa consist of multiple replicons. Most organisms with a multi-or bipartite genome structure belong to the α-Proteobacteria (including Rhodobacter sphaeroides [5,6] and various Rhizobium [7,8], Agrobacterium [4,8], Brucella [9,10] and Azospirillum [11] species) or the β-Proteobacteria. Most isolates from species belonging to the β-proteobacterial genera Burkholderia and Ralstonia harbour multiple replicons, including members of the Burkholderia cepacia complex [12][13][14][15][16], Burkholderia gladioli [15], Burkholderia pseudomallei [17], Burkholderia glumae [13], Burkholderia glathei [13], Burkholderia sp. LB400 [18], Ralstonia pickettii [13], Ralstonia eutropha [13] and Ralstonia metallidurans [18]. Multiple replicons may have arisen from the need to achieve higher overall replication rates [19]. The origin of these multiple replicons is at present unclear but it has been suggested that they could have their origin in gene duplication followed by divergence; in this case intrachromosomal recombinational events within a duplicated region could give rise to the formation of two stable replicons [8]. In the genus Brucella these rearrangements have occurred in the region containing the ribosomal RNA genes [10] but in theory the rearrangements can occur at any repeated sequence [20]. An additional explanation is that the presence of multiple replicons within an organism involved horizontal DNA transfer [19,21,22]. This hypothesis was used to explain the presence of two chromosomes in Vibrio cholerae: the small chromosome was suggested to be derived from a megaplasmid captured by an ancestral Vibrio [23,24]. This megaplasmid probably acquired genes from diverse bacterial species before its capture by the ancestral Vibrio; subsequent relocation of essential genes from the chromosome to the megaplasmid completed its stable structure.
Ralstonia solanacearum is a soil-borne phytopathogen with an unusually broad host-range, causing bacterial wilt on a wide range of crops, including economically important crops like potato, tomato, ginger and banana [25]. Recently the genome sequence of R. solanacearum strain GMI1000 was determined [26]. It was shown that the 5.8-Mb genome is organised into two replicons, a 3.7-Mb chromosome and a 2.1-Mb megaplasmid. This bipartite genome structure is characteristic for most R. solanacearum strains [27] and derivatives of strain GMI1000 without the megaplasmid have not been obtained [26]. The larger replicon contains all the basic genes required for survival of the bacterium; the smaller replicon carries several metabollically essential genes also present on the chromosome (including a rDNA locus, a gene coding for the α-subunit of DNA polymerase III and the gene for protein elongation factor G) but also contains several genes coding for enzymes involved in primary metabolism (including amino acid and cofactor biosynthesis) not present on the chromosome. The smaller replicon also contains all the hrp genes (required to cause disease in plants) and it has been suggested that it has a significant function in overall fitness and adaptation of the organism to various environmental conditions [26]. The origin of the bipartite genome structure of R. solanacearum is not clear. To determine whether the megaplasmid was formed through intrachromosomal recombinational events within a duplicated region or was recently acquired from other organisms we compared the abundance, distribution and composition of simple sequence repeats between the chromosome and the megaplasmid of R. solanacearum GMI1000. We also compared the compositional bias of di-and tetranucleotides between both replicons.
Repeated DNA consists of homopolymeric tracts of a single nucleotide or of small or large numbers of multimeric classes of repeats. These multimeric repeats can be homogenous (i.e. built from identical units), heterogeneous (i.e. built from mixed units) or are built from degenerate repeat sequence motifs [28]. A special category of repeats are tandem repeats which are made up of periodically repeated monomeric sequences of varying length, arranged in a 'head-to-tail' configuration [29]. Several mechanisms have been proposed for the creation of tandem repeats, in-cluding 'slipped strand mispairing' in which illegitimate base-pairing during replication gives rise to addition of repeat units [30,31]. There is growing evidence that small tandem repeats (also called simple sequence repeats or SSRs) affect gene expression. A first effect of SSRs is the mediation of phase variation through the loss or gain of one or more repeats [29]. Phase variation is the process by which many bacterial species undergo reversible phenotypic changes resulting from genetic alterations in certain loci [32,33]. SSRs can also be involved in gene regulation by affecting spacing between flanking regions [34] or spacing between the -35 and -10 promotor regions [35]. Variation in abundance, distribution and composition of SSRs has been described [28] and it has been proposed that variation in SSR results in variation in gene expression and key phenotypes and hence provides an important target for natural selection and evolution [28,36].
The comparison of genome-wide compositional biases as a tool to study bacterial evolution has been introduced by Karlin and co-workers [37][38][39]. It is thought that dinucleotide relative abundance values are constant within a genome because the factors that work on them are constant throughout the genome; and it has been postulated that the set of dinucleotide relative abundance values constitute a genomic signature that reflects the pressures of these factors [38]. Differences in genome signature between different organisms can be attributed to differences in context-dependent mutation rates generated by the replication-repair system and differences in efficiency of the replication machinery on different sequences. In addition, many DNA structural properties (including curvature, flexibility and helix stability), which may play an important role in biological processes like replication, are determined by dinucleotide arrangements [38,40]. Tetranucleotide relative abundances are also characteristic for a given genome [39]. It has been postulated that frequent tetranucleotides may include parts of repetitive structural, regulatory and transposable elements, while low values for some palindromic tetranucleotides have been attributed to restriction avoidance [39].

Distribution and composition of SSRs in the R. solanacearum genome
A total of 221729 SSRs with a motif length between 1 and 10 bp and minimum three repeats were found in the entire R. solanacearum genome. Of those, 139993 (63.14%) were located on the chromosome (Table 1) and 81736 (36.86%) were located on the megaplasmid (Table 2). This corresponds wel with the size distribution between both replicons (63.96% of all bases are in the chromosome, 36.04% are in the megaplasmid). The SSRs were evenly distributed both over the chromosome as over the megaplasmid (Fig. 1). The total number of repeats is lower than expected by chance; especially the number of mononucleotide repeats is significantly lower than expected (Tables 1 and 2). Trinucleotide repeats occur more than expected by chance alone, both in the chromosome and the megaplasmid (Tables 1 and 2). Mononucleotide repeats of length = 3 bp and dinucleotide repeats are dis-tributed over coding and non-coding regions as expected, both in the chromosome and the megaplasmid. As mononucleotide repeats become larger, there is more and more deviation from the expected distribution; these larger mononucleotide repeats are almost exclusively located in non-coding regions. Our data also show that trinucle-  A.

B.
otides are overrepresented in protein-coding regions of both replicons ( Table 3). The nucleotide composition of the SSR tracts in the R. solanacearum chromosome and megaplasmid are shown in Tables 4 and 5, respectively. Our data show that (i) the G+C composition of mononucleotide repeats in both replicons is significantly lower than the overall composition, but this difference can exclusively be attributed to non-coding regions; (ii) G and C mononucleotide repeats are underrepresented in coding and non-coding regions of both replicons and (iii) CG and GC dinucleotide repeats are vastly overrepresented both in coding and non-coding regions of both replicons, while other dinucleotide repeats are underrepresented.

Compositional biases in the R. solanacearum genome
Dinucleotide relative abundances are shown in Table 6. The dinucleotides TA and AT are strongly underrepresented in both replicons while GC is moderately overrepresented in both replicons. CC and GG are moderately underrepresented in the chromosome. The average absolute dinucleotide relative abundance difference (δ*) between both replicons is 9.78. To assess the variability of dinucleotide relative abundances within a replicon, both replicons were divided into 12 and 7 (for the chromosome and the megaplasmid, respectively) equally-sized, nonoverlapping fragments and ρ* XY values were calculated for each fragment. δ*(f,g) values within repli- 121189 -14727 + 3942 + 9 0 2 3 1 2 1 1 7 1 1 3 9 9 9 3 + + significantly overrepresented compared to mean frequencies in computer-generated randomised genomes (P < 0.001)significantly underrepresented compared to mean frequencies in computer-generated randomised genomes (P < 0.001)  These differences are not significantly smaller than the between-replicon differences (data not shown). Significantly over-or underrepresented tetranucleotides are shown in Table 7. CTAG, AATT, CATG, GATA and TATA are underrepresented in both replicons. GTAG and TTAA are overrepresented in both replicons.

Discussion
To study the origin of the bipartite genome structure of R. solanacearum GMI1000 we compared the abundance, distribution and composition of simple sequence repeats and differences in compositional biases between the chromosome and the megaplasmid of R. solanacearum GMI1000.

Occurrence of simple sequence repeats
Our data clearly show that the R. solanacearum genome contains numerous SSRs with a motif length between 1  and 10 bp, although not as many as expected by chance alone. Mutations in SSRs are thought to be the result of slipped strand mispairing during DNA replication; slipped strand mispairing can occur because the tertiary structure of SSRs allows mismatching and repeats can be inserted or excised during DNA duplication [41][42][43]. The observation of upper limits for SSR length in Escherichia coli suggested that the tendency for repeat length to arise via mutation is counteracted by selection [36]. We observed similar upper limits: the upper limit for total length of mononucleotide SSRs is 13 bp and 11 bp for the chromosome and megaplasmid, respectively and, in addition, very few other SSRs with a total length >15 bp (for the chromosome) or >18 bp (for the megaplasmid) are observed. Both strand separation and slippage are more likely for mononculeotide SSRs, explaining why mononucleotide SSRs are more likely to undergo slipped strand mispairing; longer SSRs with a lower repeat number have less opportunity to undergo slipped strand mispairing and there will be less mutability in their repeat number [36]. This may explain why larger mononucleotide SSRs are overrepresented in non-coding regions of the R. solanacearum genome as selection has ample opportunity to operate against these larger repeats that cause frameshift and nonsense mutations in coding regions. This hypothesis is supported by the fact that poly(A) and poly(T) SSRs are overrepresented, especially in the noncoding regions, in both replicons (Tables 4 and 5): strand separation for these poly(A) and poly(T) tracts is considerably easier than for poly(G) or poly(C) tracts, increasing the possibility of slipped strand mispairing.

Compositional biases
The dinucleotide TA is underrepresented in both replicons. TA is underrepresented in almost all prokaryotic genomes; this could be due to the fact that (i) TA forms the thermodynamically least stable DNA (allowing unwinding of the helix), (ii) RNases preferentially degrade UA dinucleotides in mRNA, and/or (iii) TA is part of many regulatory sequences [38]. AT is significantly underrepresented in the R. solanacearum genome but is overrepresented in the genome of most α-Proteobacteria and in the genomes of the β-proteobacterial species R. eutropha and Bordetella pertussis [39]. CC and GG are slightly underrepresented in the chromosome but not in the megaplasmid, although the differences in relative abundances are small ( Table 6). The dincucleotide GC is overrepresented in both replicons; this is also the case in most other β-Proteobacteria and γ-Proteobacteria [39]. In general, within species δ*(f,g)-differences among nonoverlapping 50 kb contigs of bacteria are in the range 18-43 [39] and genome signatures of chromosomes and plasmids from the same host are at least weakly similar to each other [δ*(f,g) < 115] [44,45]. δ*(f,g) values reported for the multiple chromosomes of A. tumefaciens, Deinococcus radiodurans, V. cholerae and B. melitensis were between 27.0 and 30.8 [45]. A comparison of both R. solanacearum replicons based on dinucleotide relative abundances indicates that they are very similar with δ*(f,g) = 9.78. A comparison of δ*(f,g) values within and between replicons revealed that the variability in δ*(f,g) within a replicon is not significantly smaller than the difference in δ*(f,g) between both replicons. CTAG is significantly underrepresented in the R. solanacearum genome as it is in most proteobacterial organisms. Possible reasons for the underrepresentation of this tetranucleotide include structural defects or special functional roles associated with CTAG [38]. AATT, CATG, GATA and TATA are underrepresented in both replicons while GTAG and TTAA are overrepresented. ATTG, CATC and TTGG occur slightly less than expected in the megaplasmid but their relative abundance in the chromosome is in the normal range. The general mechanisms underlying tetranucleotide extremes are unclear but besides the above-mentioned structural defects or functional roles associated with specific tetranucleotides, it has been suggested that restriction avoidance may play an important role in the maintenance of tetranucleotide extremes [39]).

Conclusions
It can be concluded that both replicons that constitute the R. solanacearum genome are very similar in respect to distribution and composition of SSRs and presence of compositional biases, although minor differences between both replicons are present. The megaplasmid carries the hrp genes required to cause disease in plants, genes coding for constituents of the flagellum and genes involved in exopolysaccharide production; it also contains 315 genes of unknown function [26]. The minor variations in SSR and compositional biases observed between both replicons may therefore be attributable to minor differences in gene expression and regulation of gene expression between both replicons. Alternatively, it is not unlikely that some of the observed differences are the result of the small sample numbers observed (for example the minor differences in tetranucleotide SSR distribution over coding and noncoding regions in both replicons [ Table 3]). At present no completely sequenced and fully annotated genomes of other β-Proteobacteria with multiple replicons are available for comparison and therefore it is difficult to place the observed differences in a broader perspective. Nevertheless, the observed similarities in SSRs and compositional biases indicate that both replicons have shared a similar evolutionary history and suggest that the megaplasmid was not recently acquired from other organisms by lateral gene transfer but is a part of an ancestral R. solanacearum chromosome. Alernatively, the hypothesis of an ancient acquisition by lateral gene transfer followed by a long coevolution with the chromosome cannot be completely ruled out.

DNA sequences
The sequences of the chromosome (AL646052) and the megaplasmid (AL646053) of R. solanacearum strain GMI1000 were downloaded from the GenBank database.

Analysis of SSRs
We used the software developed by Gur-Arie et al. [36] to screen the entire genome of R. solanacearum for SSRs withg a motif length between 1 and 10 bp and a minimal number of three repeats. This software can be downloaded from ftp://ftp.technion.ac.il/supported/biotech/ssr.exe and reports motif, motif length, repeat number and genomic location of all SSRs. To determine whether the ob-  served SSR frequencies of a given motif length and repeat number occurred as expected by chance, they were compared with the mean frequencies observed in three randomly shuffled genomes. Randomised sequences were generated with shuffleseq (part of the EMBOSS package, http://www.hgmp.mrc.ac.uk/software/EMBOSS). Statistical significance was tested with two-tailed t-tests using SPSS 11.0.1 (SPSS). To determine the distribution of SSRs between coding and non-coding regions of the genome, all coding regions were extracted from the sequence using Artemis 4.0 [46] and parsed into a new sequence file using seqret (EMBOSS).

Analysis of compositional bias
We determined the compositional bias in di-and tetranucleotides in the chromosome and megaplasmid of R. solanacearum GMI1000. Both sequences were concatenated with their inverted complementary sequence using revseq, yank and union (EMBOSS). Mononucleotide frequencies were calculated using Artemis 4.0 [46], di-, triand tetra-nucleotide frequencies were calculated using compseq (EMBOSS). Dinucleotide relative abundances ρ* XY were calculated using the equation ρ* XY = f XY /f X f Ywhere f XY denotes the frequency of dinucleotide XY and f X and f Y denote the frequencies of X and Y, respectively [38]. Similarly, the corresponding fourth-order oligonucleotide measures (which factor out all lower-order biases) is given by τ* XYZW = (f* XYZW f* XY f* XNZ f* XN1N2W f* YZ f* YNW f* ZW )/(f* XYZ f* XYNW f* YZW f* X f* Y f* Z f* W ) were N is any nucleotide and X, Y, Z and W are each one of A, C, G and T [38]. Statistical theory and data from previous studies [38,39] indicate that the normal range of ρ* XY , is between 0.78 and 1.23. In this study we used the refined criteria of discrimination proposed by Karlin et al. [38]. Overrepresentation is indicated by + (1.23 = ρ* < 1.30), ++ (1.30 = ρ*< 1.50) and +++ (ρ* ≥ 1.50), while underrepresentation is indicated by -(0.70 < ρ* = 0.78), --(0.50 < ρ* = 0.70) and ---(ρ* = 0.50). The dissimilarities in relative abundance of dinucleotides between both sequences were calculated using the equation described by Karlin et al. [38]: δ*(f,g) = 1/16Σ |ρ* XY (f)-ρ* XY (g)| (multiplied by 1000 for convenience), were the sum extends over all dinucleotides. To assess the variability of dinucleotide relative abundances within a replicon, both replicons were divided into 12 and 7 (for the chromosome and the megaplasmid, respectively) non-overlapping fragments and ρ* XY values were calculated for each fragment. The average δ*(f,g) within each replicon was also calculated.

List of Abbreviations
SSR : simple sequence repeat

Authors' Contribution
TC conceived the study and carried out the computational analyses. PV participated in experimental design. Both authors read and approved the final manuscript.