Short and long-term genome stability analysis of prokaryotic genomes
© Brilli et al.; licensee BioMed Central Ltd. 2013
Received: 13 August 2012
Accepted: 11 April 2013
Published: 8 May 2013
Skip to main content
© Brilli et al.; licensee BioMed Central Ltd. 2013
Received: 13 August 2012
Accepted: 11 April 2013
Published: 8 May 2013
Gene organization dynamics is actively studied because it provides useful evolutionary information, makes functional annotation easier and often enables to characterize pathogens. There is therefore a strong interest in understanding the variability of this trait and the possible correlations with life-style. Two kinds of events affect genome organization: on one hand translocations and recombinations change the relative position of genes shared by two genomes (i.e. the backbone gene order); on the other, insertions and deletions leave the backbone gene order unchanged but they alter the gene neighborhoods by breaking the syntenic regions. A complete picture about genome organization evolution therefore requires to account for both kinds of events.
We developed an approach where we model chromosomes as graphs on which we compute different stability estimators; we consider genome rearrangements as well as the effect of gene insertions and deletions. In a first part of the paper, we fit a measure of backbone gene order conservation (hereinafter called backbone stability) against phylogenetic distance for over 3000 genome comparisons, improving existing models for the divergence in time of backbone stability. Intra- and inter-specific comparisons were treated separately to focus on different time-scales. The use of multiple genomes of a same species allowed to identify genomes with diverging gene order with respect to their conspecific. The inter-species analysis indicates that pathogens are more often unstable with respect to non-pathogens. In a second part of the text, we show that in pathogens, gene content dynamics (insertions and deletions) have a much more dramatic effect on genome organization stability than backbone rearrangements.
In this work, we studied genome organization divergence taking into account the contribution of both genome order rearrangements and genome content dynamics. By studying species with multiple sequenced genomes available, we were able to explore genome organization stability at different time-scales and to find significant differences for pathogen and non-pathogen species. The output of our framework also allows to identify the conserved gene clusters and/or partial occurrences thereof, making possible to explore how gene clusters assembled during evolution.
Genome dynamics are mainly studied in relation to gene content, with several evolutionary models adapted to the problem, such as for instance birth-death and transfer models [1–4]. These approaches contributed to the development of the concepts of core and accessory genome: genes shared by all genomes of a species constitute the core, whereas accessory genes are present in a subset of the genomes. The maintenance of many prokaryotic genes is influenced by ecology, and accessory genes often carry information about peculiar adaptations (e.g.[5–13]). It is therefore conceivable that the fitness associated with a given genome organization depends in a similar way on the life-style of an organism: gene clusters may be transferred or assembled/disassembled, providing information on the selective pressures acting on peculiar gene associations in different ecological scenarios. Specific chromosome organizations (e.g. operons, genomic islands or larger aggregates) can be preferred by evolution, for instance through the selection of a given distribution of genes relative to the origin of replication or a specific pattern of gene co-expression. Chromosome rearrangements antagonize the selective features of the organization of bacterial genomes so that a trade-off is reached integrating the selective advantage of specific gene associations in an evolutionary and ecological context. Besides their destructive role, rearrangements allow exploring alternative gene associations and are therefore of paramount importance for genome evolution. Different rates of genome rearrangements characterize different ecological constraints, making the study of genome organization stability particularly interesting. Pathogens are for instance under periodic selection e.g. imposed by the immune system or drug treatments, and it was suggested that they have genomes with plastic organization . However, the transient nature of these selective pressures on the long evolutionary scale must be taken into account to identify stable and unstable genomes and gene associations: if historical niches do not reflect modern environmental associations, merging comparisons spanning millions or billions of years can obscure the true relationship with life-style: genes forming an operon in Escherichia coli can be scattered in other species [15–19], indicating that the tendency of particular genes to stay close on the genome is subject to evolutionary change, like all biological properties.
Gene order analysis is also an important tool in comparative and functional genomics since conserved gene clusters often comprise genes with related functions [20–26]. The importance of gene clustering in evolution has started being recognized for eukaryotes too [27–31].
where N=n f +n s is the number of pairs of genes in the genome. A similar strategy was previously used by Huynen et al.  leading to the same conclusions. Tamames et al.  used a different strategy and proposed a sigmoid relationship between genome organization conservation and phylogenetic distance. In this case the authors identified orthologous genes between pairs of genomes, extracted genome regions with conserved gene order and calculated their stability estimator as the fraction of genes in conserved runs with respect to the number of genes. In this case, accessory genes help defining the borders of the conserved regions.
Previous works therefore express different views on how genome organization changes in time which could be ascribed to the genomes selected for the comparisons or to differences in the analytical methods. Specifically, the way the insertion/deletion of accessory genes is addressed is relevant in this context since genome organization divergence is the resultof the interplay between genome rearrangements (i.e. translocations and recombinations) and gene content dynamics (insertions and deletions). The latter do not change the relative order of core genes, and are consequently neglected in backbone stability analyses. By taking them into account we can identify evolutionarily persistent gene associations but it is difficult to discern between the contribution of genome rearrangements and gene content dynamics to genome organization divergence. Based on this, a complete picture on genome organization evolution clearly requires considering the information coming from both core gene order (backbone stability) and insertions/deletions of accessory genes (genome organization stability).
To fulfill this task, we implemented a graph-based framework to study in depth the stability of prokaryotic genomes and applied it to a selected dataset of genomes. We improved Eq. 1 for backbone stability in time, and then we compared the fit of the new and several other models to the data. Using the fitted model, we studied genome backbone stability within and between bacterial species to better understand genome organization dynamics on the short and the long evolutionary time. The relationship between backbone stability (BS) and genome organization stability (GOS) provided information about the importance of genome organization rearrangements and gene content dynamics for genome evolution in different species. A comparison between GOS and genome fluidity  allowed to summarize the variability in the size of accessory gene clusters in different species highlighting differences between pathogens and non-pathogens. An additional output of our approach is the phylogenetic distribution of conserved gene clusters in the genomes under analysis, which provides useful evolutionary insights on how they are distributed and assembled.
We discuss our results in an ecological perspective where the life-style of the species under analysis is taken into account to explain the properties of the corresponding genomes.
The input for computing GOS is the same as for diameter calculation. The latter is the longest shortest path connecting any two nodes in a network. The shortest path between two nodes in a graph is defined as the path with the minimum number of edges between them. The diameter can be calculated in different ways; we use Johnson’s method  implemented in the Matlab library MatlabBGL . We propose to use the diameter as an alternative stability measure because it allows to consider accessory genes and to convey additional information to the previous measures. As shown by Watts and Strogatz , the simple rewiring of a small fraction of the links in a regular lattice results in a sudden lowering of the diameter of the graph; similarly, when the position of a gene changes between different genomes, the diameter of the corresponding GGN shrinks (Figure 2B). It follows that the diameter is inversely related to the stability of the genomes.
The GGN can be obtained summing any number of adjacency matrices: if the edge values of the GGN are normalized by the number of genomes under comparison we obtain a weighted network with edge values corresponding to the fraction of times a given gene is found close to another one in these genomes. This new matrix allows a rapid extraction of gene clusters present in more than α % of the genomes by removing edges with a value under the threshold and collecting the induced connected components. To this purpose, we use the Dulmage-Mendelsohn decomposition in Matlab.
Model fitting results
R o c h a+3p
E x p
B S=a+e b x
R o c h a+2p
R o c h a
H i l l
T a m a m e s
Results of the non linear fitting are reported in Table 1. The R o c h a+3p model has the minimum AIC and best explains the data. The Akaike weights indicate that its probability of being the correct model is much higher than for the others. The first observation on estimated parameters is that p f takes the same value in the two R o c h a models, around 0.24, and so does the f f fraction, indicating that about 95% of the gene pairs change quite frequently.Following the R o c h a+3p model, a small fraction of edges (f f f =0.02) changes very fast. These might involve transposases and other mobile elements. It should be noticed that the original R o c h a model gives p f =0.17 and p s =0.37, not too far from what we obtained here.
Prochlorococcus marinus was identified as unstable in Rocha  on the basis of the comparisons of three P. marinus and five other Cyanobacteria. This discrepancy derives from the different phylogenetic ranges of the comparisons here and in . By looking more in detail at the behavior of two additional Cyanobacteria present in our genus dataset (see Methods and Additional file 1), we found that the P. marinus genomes have indeed higher stability (Additional file 2). This reinforces the idea that by considering multiple genomes of the same species we can define the behavior of each group of organisms in a better way. Our analysis moreover shows that P. marinus is relatively unstable on the long-term, in agreement with .
The intra-species analysis shows that all species have more or less stable gene organization. To identify aberrant genomes within a species, we obtained Z-scores for the residuals of each genome with respect to its conspecific and the model predictions (Figure 4C); genomes with the largest deviation from the mean are discussed below in the light of previous reports about genome organization. Despite the only marginal sequence divergence within the analyzed isolates, strain Angola appeared to be highly rearranged with respect to the remaining Y. pestis genomes. A further analysis indicated that this strain shares 92% of the gene pairs with Y. pseudotubercolosis IP_32953, which is almost the same as with the other Y. pestis genomes. The similarity in gene order among the other Y. pestis genomes is instead much higher (they share on average 99% of the gene pairs) as it is higher the similarity of these strains with Y. pseudotubercolosis IP_32953 (about 96% of the gene pairs are in common). This suggests that strain Angola experienced a period of intense reorganization after the separation from Y. pseudotubercolosis and independently from other Y. pestis strains, in agreement with the high degree of intrachromosomal rearrangements detected in a dedicated comparative analysis .
The Bacillus anthracis CI genome was previously analyzed by a comparative genomics approach leading to the conclusion that it has evolved from a B. cereus strain and established a B. anthracis lifestyle . This is also reflected in a markedly different gene organization with respect to the other genomes of the anthracis clade, as it appears from Figure 4C.
Sela et al.  found an abundance of mobile genetic elements in the genome of Bifidobacterium longum ATCC15697 relative to other sequenced bifidobacteria, a feature positively affecting rearrangement frequencies . Acinetobacter baumannii is the source of numerous nosocomial infections in humans and is often multidrug resistant. Comparative genomics revealed that strain SDF is highly divergent from strains ADP1 and AYE and that it harbors over 400 insertion sequences, much more than other strains . This, along with our stability analysis, suggests that this strain is undergoing an intense rearrangement of gene order, perhaps as a consequence of the adaptation to the human host or the new challenges imposed by drug treatment.
For the inter-species analysis, we selected one stable genome per species on the basis of the previous analysis and we compared them all against all. Genomes more stable than expected at such large phylogenetic distances were those of B. aphidicola, S. muelleri and Coxiella burnetii (not shown). The latter is a widespread bacterium causing Q fever in humans whereas it does not normally cause overt disease in its reservoirs (cattle, sheep, goat). Its genome stability seems to be in contrast with the phenotype in humans but it agrees with its obligate intracellular life-style. It was indeed shown to have rearranged genomes with large syntenic blocks .
The presence of the two endosymbionts suggests that rearrangements played a minor role in their genome evolution, indicating that these endosymbionts diverged from their ancestors by eliminating superfluous genes and joining the remaining without much rearrangements.
In the previous section, we analyzed what we called backbone stability. In that case, we neglected the effect of accessory genes to highlight gene movements along the chromosome, and we observed a general stability of genomes on the short evolutionary time. However, the stability of a genome is the consequence of two processes: genome order rearrangements and gene content dynamics. In this section, we focus on genome organization stability using the diameter and the stability measure defined in Eq. 4 where insertions and deletions are also considered.
There are 7 and 4 gene clusters that are also present in the Archaea Sulfolobus islandicus and Methanococcus maripaludis, respectively; among those only one is common (gene cluster 23, comprising two genes involved in tryptophan biosynthesis). These gene clusters code for interacting proteins and have metabolic roles (e.g. enzymes for tryptophan, arginine, leucine and pyrimidine nucleotides biosynthesis, glycine cleavage for serine biosynthesis), in addition to the Phe-tRNA synthetase subunits α and β (gene cluster 25). The only gene cluster present in the Archaea and comprising more than two genes codes for the PhoU regulator and three subunits of a phosphate transporter (cluster 61, partially conserved in S. islandicus), underlining the importance of this limiting nutrient across the prokaryotic kingdoms. The 3 gene clusters with the wider phylogenetic distribution code for three subunits of cytochrome oxydase (cluster 50), the elongation factor G plus two ribosomal proteins (cluster 53) and two ribosomal proteins (cluster 38). The functions encoded are in this case more housekeeping but similarly to the previous case, all of them encode interacting proteins. These results lead to the testable hypothesis about protein interaction being the major driving force for gene clustering, facilitating the initial assembly of gene clusters that are then combined to form larger gene aggregates during evolution.
We studied the genome organization stability of 40 prokaryotic species for a total of 277 genomes, using an approach based on interpreting chromosomes as graphs. We focus on two different time-scales finding that at short phylogenetic distances, genomes are all quite stable besides life style; the use of multiple genomes of the same species allowed the identification of genomes with increased instability within a species, which are in majority from pathogens. Our results are in agreement with previous findings indicating that during adaptation to pathogenesis, several species experience phases of instability [49–51]. We confirm the high stability of endosymbiont genomes, adding moreover a few hints: B. aphidicola Cc has a deviant stability with respect to the other Buchneras, plausibly because of its coexistence with other symbionts within the host . The results show at the same time that Sulcia muelleri and Buchnera aphidicola differ concerning the stability of their genomes, with Buchnera having much more variability of both gene order and gene content. This suggests that Sulcia is more terminally differentiated, with a static backbone gene order and slow gene content dynamics.
The long term analysis allowed to identify those genomes that, although stable on the short time, are instead unstable on the long evolutionary time. As in the previous case, the genomes with increased instability were often from pathogens, indicating that at least some of them experienced instability periods during evolution while being quite stable today.
The comparison between backbone and genome organization stability for intra-species comparisons allowed to detect an important difference between pathogens and free-living non-pathogens: gene content dynamics plays a much more prominent role in the evolution of pathogen genomes, whereas free-living species tend to have slower gene content dynamics. We have moreover shown that non-pathogens tend to gain/delete fragments of the genome containing on average more genes than what is observed in pathogens.
Gene transfers play a fundamental role in genome evolution, therefore we focused on singleton gene clusters, that is gene clusters formed by genes present in only one of the genomes of a given species. It was shown that this category of genes is often enriched in xenologous genes, and this analysis may therefore inform about the size of transferred DNA fragments. We find that non-pathogens have a significantly larger mean size of singleton gene clusters. The statistical significance vanishes when the mean is calculated for gene clusters of at least two genes in length, indicating that the difference is not caused by larger clusters in the non-pathogens, but instead by a larger fraction of isolated singletons in pathogens.
Insertions and translocations in multiple genomes define the borders of evolutionary conserved gene clusters that can be rapidly extracted from our graphs by filtering the edges on the basis of the degree of conservation. Depending on the threshold and the organisms used, they can be seen as gene associations with different evolutionary success, and thus they may be related to more or less universal selective pressures. We show here that when focusing on gene clusters common to distantly related organisms, we mostly detect clusters encoding interacting proteins. This suggests that the main selective pressure towards gene clustering could be the co-localization of the synthesis of interacting partners, as it has been previously proposed on a limited number of genomes [18, 52, 53]. Even if only partial, our analysis also suggests that these conserved gene clusters may function as nucleation sites for the evolution of larger ones. Two lines of evidence support this view: most of the shorter clusters are known to be involved in longer operons in E. coli (e.g. tryptophan biosynthesis, clusters 23 and 24 and leucine biosynthesis, cluster 4), and larger gene clusters show high conservation in close E. coli relatives and only partial conservation in more distant ones.
Our strategy is briefly described at the beginning of the Results section, here we add some technical description.
We classify orthologs with the BBH (“Bidirectional-Best-Hit”) criterion  by comparing all the proteins coming from a group of genomes at once. After an all-against-all blast, we build a similarity network where two proteins are connected if they are reciprocal best hits. We define a cluster in such a network as an orthologous group (OG). Since we require a one-to-one orthologous mapping for assigning unique gene neighborhoods to all genes in a genome, we systematically excluded all proteins belonging to an OG that contains multiple proteins from a same organism. When reconstructing the gene neighborhood graph, we skip the corresponding genes, i.e. the gene cluster A→B→C→D, in case B and C have been assigned to multiple orthologs group, becomes A→D. Removed genes correspond mainly to transposases present in multiple (almost) identical copies in most of the genomes under analysis (Additional file 6) and for which a one-to-one mapping is almost impossible without adding some information.
The gene neighborhood network of each genome is built using the information about protein coding gene coordinates. Each gene is connected to the following one in the genome table, with no threshold on their distance and taking into account the circularity of the chromosome. Genes corresponding to removed proteins (see Orthologous mapping) are deleted at this stage by joining together predecessor and successor nodes. Taking advantage of the orthologous mapping and the gene ordering information encoded in the genome table files, all gene neighborhoods are stored in compatible adjacency matrices (the genome specific neighborhood networks, GSN), i.e. proteins belonging to the same OG occupy the same place in matrices corresponding to different chromosomes. The GSN is encoded as un undirected graph. Once the GSN for all genomes have been built, they can be compared on a pairwise basis (see Figure 1 and Figure 2). We call GGN the network for a given comparison, which is obtained by taking the sum of the adjacency matrices of the two GSNs under analysis. By summing all the GSNs from a group it is possible to extract connected components that are present in a given fraction of the genomes i.e. evolutionary conserved gene clusters. To this purpose we use the Dulmage-Mendelsohn decomposition performed by the matlab function dmperm.
The GSN is circular and all genes have one incoming and one outgoing edge only. The aim of the compression procedure is to remove a defined set of accessory genes (R) and add the connections between predecessors and successors of genes belonging to R (Figure 1), allowing to focus on the re-organization of the genome backbone. For each gene in R, we add to the graph the edge between its core predecessor and its core successor, and then we remove from the graph the genes in R. If part of the genes belonging to R form connected groups of genes, they are treated as a single gene. Genes in the compressed GSN correspond to genes common to the two genomes under comparison and consequently the compression of a given genome can be different for each comparison. To be noticed that the compressed network goes in the direction of the work of , and that it has a different meaning with respect to the original GSN, since after the compression, edges do not always correspond to physical proximity between genes (see Figure 2a) and cannot be used for identification of evolutionarily persistent gene clusters.
The diameter of the networks was calculated using the MatlabBGL package developed by David Gleich, .
where R is the number of models under comparison. Akaike weights inform on how much more probable is the model with the lowest AIC, with respect to the other models allowing not only to identify the best model, but also to say something on how far the others are from its performance.
Genomic features of the species under analysis
CDC 684 (59303)
AU 1054 (58371)
NCTC 10229 (58383)
NCTC 11168 (57587)
A Hall (58931)
RSA 493 (57631)
86 028NP (58093)
EGD e (61583)
ATCC 17029 (58449)
TIE 1 (58995)
Gallinarum 287 91 (59249)
ATCC 700669 (59287)
P1 7 (32235)
M 16 4 (58841)
M66 2 (59355)
ATCC 33913 (57887)
IP 32953 (58157)
On the contrary, when we consider a wider phylogenetic span for the comparisons, the probability that two genomes come from species with highly similar life-styles is reduced, and we can analyze changes in stability that occurred in ancestors of the present species (long-term stability analysis). This dataset concerns 40 genomes (one for each of the species under analysis), and comprises 780 pairwise comparisons, corresponding to 39 new observations for each reference genome.
For the purpose of statistical model fitting only, we add a set of comparisons between genomes belonging to 32 genera (see Additional file 1), and the comparisons were made only within each genus. To be noticed that this dataset is only used for statistical model fitting.
We ask for at least 4 genomes when selecting species for the short term analysis for two main reasons. First in such a way, we have some statistical power for intra-species comparisons allowing to identify deviant genomes with respect to the average species behavior. This is important if we want to identify genomes that changed their stability recently. Second, this within-species analysis allows to select the genomes for the comparisons between species. These genomes were chosen so that they have a stability which is in line with the other genomes belonging to the same species. It should be noticed that if genomes for the long term analysis are picked randomly within a species, we could end up using a biased genome as the prototypical species genome, affecting the subsequent analysis and interpretation of the results. Thus choosing genomes whose stability pattern corresponds to some sort of average of the species allows to obtain more accurate stability values for the species in the long-term analysis.
For distance calculations, we used two universal sequences proposed by , namely FusA and RplB. Coding sequences were aligned at the protein level using RevTrans . Distances were calculated using Mega 5 , Tamura-3 parameters model of evolution for DNA sequences, heterogeneous rates along lineages (α=1.3, default value), and the pairwise deletion option. The distance matrices obtained for the two proteins were combined together by taking the sum of the corresponding elements and used for subsequent analysis.
The research leading to these results has received funding from the European Research Council under the European Community’s Seventh Framework Programme (FP7/2007-2013) / ERC grant agreement Nr 10. The authors wish to thank two anonymous reviewers: their criticisms about previous versions of the manuscript notably improved the paper.
Finally, thank to the BMC Genomics editors for according us more time to complete the revisions.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.