Extensive recombination events and horizontal gene transfer shaped the Legionella pneumophila genomes

Background Legionella pneumophila is an intracellular pathogen of environmental protozoa. When humans inhale contaminated aerosols this bacterium may cause a severe pneumonia called Legionnaires' disease. Despite the abundance of dozens of Legionella species in aquatic reservoirs, the vast majority of human disease is caused by a single serogroup (Sg) of a single species, namely L. pneumophila Sg1. To get further insights into genome dynamics and evolution of Sg1 strains, we sequenced strains Lorraine and HL 0604 1035 (Sg1) and compared them to the available sequences of Sg1 strains Paris, Lens, Corby and Philadelphia, resulting in a comprehensive multigenome analysis. Results We show that L. pneumophila Sg1 has a highly conserved and syntenic core genome that comprises the many eukaryotic like proteins and a conserved repertoire of over 200 Dot/Icm type IV secreted substrates. However, recombination events and horizontal gene transfer are frequent. In particular the analyses of the distribution of nucleotide polymorphisms suggests that large chromosomal fragments of over 200 kbs are exchanged between L. pneumophila strains and contribute to the genome dynamics in the natural population. The many secretion systems present might be implicated in exchange of these fragments by conjugal transfer. Plasmids also play a role in genome diversification and are exchanged among strains and circulate between different Legionella species. Conclusion Horizontal gene transfer among bacteria and from eukaryotes to L. pneumophila as well as recombination between strains allows different clones to evolve into predominant disease clones and others to replace them subsequently within relatively short periods of time.


Background
Legionella pneumophila is the etiologic agent of Legionnaires' disease, an atypical pneumonia, which is often fatal if not treated promptly. However, it is principally an environmental bacterium that inhabits fresh water reservoirs worldwide where it parasitizes within free-living protozoa but also survives in biofilms [1][2][3]. Since L. pneumophila does not spread from person-to-person, humans have been inconsequential for the evolution of this pathogen. Instead, the virulence strategies of L. pneumophila have been shaped by selective pressures in aquatic ecosystems. Indeed, the co-evolution of L. pneumophila with fresh-water amoebae is reflected in its genome sequence. The analysis of two L. pneumophila genomes identified the presence of an unexpected high number and variety of eukaryotic-like proteins and proteins containing motifs mainly found in eukaryotes [4]. These proteins were predicted to interfere in different steps of the infectious cycle by mimicking functions of eukaryotic proteins [4]. For several of these eukaryotic like proteins it has been shown recently that they are secreted effectors that help L. pneumophila to subvert host functions to allow intracellular replication [5,6]. The possibility that L. pneumophila has acquired at least some of these genes through horizontal gene transfer from eukaryotes has been suggested by two studies [7,8].
* Correspondence: carmen.buchrieser@pasteur.fr 1 Institut Pasteur, Biologie des Bactéries Intracellulaires, 75724, Paris, France Full list of author information is available at the end of the article Plasticity is another specific feature of the L. pneumophila genomes as integrative plasmids, putative conjugation elements and genomic islands were identified. In addition to DNA interchange between different bacterial genera and even domains of life, horizontal gene transfer within the genus Legionella and within the species L. pneumophila has been reported. For example a 65-kb pathogenicity island described first in L. pneumophila strain Philadelphia [9] is present in several L. pneumophila strains and also in other Legionella species like L. anisa [10]. Another example is the particular lipopolysaccharide cluster of serogroup 1 strains that has been detected in L. pneumophila strains of different lineages and genetic backgrounds [10]. L. pneumophila has all necessary features for incorporating foreign DNA, as these bacteria are naturally competent and possess an intact recombination machinery [11,12]. These findings suggest that the L. pneumophila genomes are very dynamic and one would expect that horizontal gene transfer and recombination events play an important role in their evolution.
However, different analyses like early studies applying multilocus enzyme electrophoreses (MEE) supported a clonal population structure of L. pneumophila [13]. Two recent reports using genetic profiling based on six or three genetic loci, respectively concluded also that L. pneumophila shows a clonal populations structure [14,15] although the presence of few recombination events was not ruled out. Later the analysis of the dotA, mip and rpoB genes in different isolates suggested for the first time that recombination may play some role in L. pneumophila evolution [16][17][18] and a more in depth analysis using over 20 loci suggested that recombination events might be more frequent than was previously thought [19]. However, comparisons of these studies are difficult due to different sampling and different analysis methods used. Furthermore there may be a bias associated with some of the genes selected in these studies like intergenic spacer regions or genes under positive selection that may lead to artefactual effects in detecting recombination. To solve these problem efforts have been undertaken recently to homogenize the results obtained for different species to allow comparisons [20]. These authors report for L. pneumophila a low recombination rate like for the obligate pathogens Bordetella pertussis or Bartonella henselae. In contrast Coscolla and colleagues suggest a more important role for recombination at the intergenic level [21].
These different results and the fact that a globally distributed L. pneumophila clone implicated in Legionnaires' disease has been described [10] may suggest that the role of recombination is not relevant. However, the description of clonal complexes is not incompatible with high recombination rates. Transient clones may appear within a recombining population [22], in particular if clones with high disease prevalence appear, as this seems to be the case for some L. pneumophila strains. These clones are often vastly over-sampled due to their clinical importance and show strong clonality. Thus, this may be correct for this subgroup, but it may not be representative for the population. Indeed when analyzing over 200 clinical and environmental L. pneumophila strains, significantly less diversity was found among the clinical isolates [23].
In this study we investigated the genome dynamics and evolution of the species L. pneumophila by analyzing horizontal gene transfer, mobile genetic elements and recombination on a genome-wide level. We undertook this analysis based on six complete genome sequences four of which are the previously published reference genomes of L. pneumophila Paris, Lens [4], Corby [24] and Philadelphia [25] and two that were sequenced in this study. The newly sequenced strains were selected according to epidemiological features that might be reflected in their genomes and should thus allow to study genome dynamics with respect to virulence. Strain Lorraine is rarely isolated from the environment but its prevalence in human disease is increasing considerably in the last years [26]. In contrast, L. pneumophila strain HL 0604 1035 has been frequently isolated from a hospital water system since over 10 years but has never caused disease. Analysis of these six strains identified a highly conserved and syntenic core genome and a diverse accessory genome. Furthermore, it showed that recombination events and horizontal gene transfer are frequent in L. pneumophila. Horizontal gene transfer from eukaryotes as well as recombination between strains were identified suggesting that L. pneumophila genomes are highly dynamic, a feature allowing different clones to evolve into predominant disease clones and others to replace them subsequently within relatively short periods of time.

Results and discussion
The L. pneumophila core genome comprises over 2400 conserved genes that are highly syntenic To get comprehensive insight into the genetic basis, evolution and genome dynamics of L. pneumophila Sg1, the strains responsible for over 90% of disease worldwide, we analyzed six completely sequenced genomes. The strains selected are all of Sg1, have endemic and/or epidemic character (e.g. Paris, Lorraine or Philadelphia) were isolated in different countries (France, England, Spain, US) and in different years. Two strains were newly sequenced for this study (Lorraine and HL 0604 1035), the other four L. pneumophila genomes (Paris, Lens, Philadelphia, Corby) have been published previously [4,24,25]. The genomes of L. pneumophila Lorraine and HL 0604 1035 consist each of a single circular chromosome of 3.4 Mb. Strain Lorraine also contains a plasmid. As shown in Table 1, the main features of the six L. pneumophila genomes analyzed (e.g. genome size, GC content and coding density), are highly conserved. The core genome of the six L. pneumophila genomes comprises 2434 genes, which represents about 80% of the predicted genes in each genome. Furthermore, the gene order is highly conserved as the 260 kb inversion in strain Lens with respect to the other strains is the only exception. When comparing the strains two by two, in average 90% of the genes are present in both strains ( Figure 1). However, when determining the nonorthologous genes specific of each genome and not present in the remaining 5 strains, each strain contains between 136 (strain HL 0604 1035) and 222 (strain Corby) strain specific genes mainly encoded on mobile genetic elements. Taken together, the L. pneumophila genomes have a highly conserved and syntenic backbone and a highly dynamic accessory genome of about 300 genes each mainly formed by mobile genetic elements, genomic islands and genes of unknown function. The complete annotation of these six genomes is available in a new data base resource that we have set up, Legionel-laScope https://www.genoscope.cns.fr/agc/microscope/ about/collabprojects.php?P_id = 27 and at the Institut Pasteur, LegioList http://genolist.pasteur.fr/LegioList/.
The species L. pneumophila has a highly conserved core genome a) Most eukaryotic like proteins are conserved in all L. pneumophila genomes The presence of proteins with high similarity to eukaryotic proteins or proteins with domains preferentially or only present in eukaryotic genomes are a particular feature of L. pneumophila [4]. However, the criteria for identifying these proteins were never clearly defined. To analyze their evolution and possible origin in depth we have thus developed an automatic and systematic method to identify eukaryotic like proteins according to defined criteria. Previously we had identified eukaryotic like proteins in L. pneumophila as proteins with the highest similarity score to eukaryotic proteins according to BLAST results, or by identifying eukaryotic domains [4]. However, due to constantly growing databases BLAST results are changing. Furthermore, recent analyses of amoeba-associated bacteria, in particular symbionts of amoeba have shown that they also contain eukaryotic like proteins, suggesting multiple origins of these proteins in prokaryotes [27]. To get a more complete picture of eukaryotic like proteins of L. pneumophila and also to include those proteins that might have been transferred independently to different amoeba associated bacteria we defined a eukaryotic like protein as i) a protein having a better normalized blast score against eukaryotic sequences than against prokaryotic ones and ii) a protein that did not show BLAST results against neither Legionella spp. nor other bacterial species for which resistance to amoeba infection has been demonstrated (see material and methods). Applying these criteria we identified 46 proteins with putative eukaryotic origin, of which 17 are described here for the first time (Table 2). Given the fact that these proteins were probably acquired by HGT one would expect high diversity in the repertoire. However, our analyses revealed a considerable conservation as more than 50% (26) are conserved in all six L. pneumophila strains, indicating an ancient transfer. Furthermore, they show 89-99% nucleotide identity, probably due to high selection pressure for their maintenance. Thus most of these proteins belong to the core genome, indicating that their acquisition has taken place before the speciation of L. pneumophila. These 26 proteins might have allowed a common Legionella ancestor to colonize an intracellular niche or to adapt better to the intercellular environment of a specific protozoan species leading to the evolution of the species L. pneumophila. Interestingly, 19 of these 26 proteins are also conserved in L. longbeachae, which might thus be those indispensible for intracellular replication of Legionellae (Table 2) [28].
b) Eukaryotic protein motifs are highly conserved among the L. pneumophila genomes A second class of eukaryotic proteins of L. pneumophila is carrying domains predominantly present in eukaryotic proteins. To systematically identify these proteins we used the Interpro database comprising 10 different domain search programs [29]. This allowed to identify the L. pneumophila proteins carrying eukaryotic domains in the newly sequenced strains Lorraine and HL 0604 1035 as well as to identify previously not reported motifs. Similarly to the above described eukaryotic like proteins over half of the eukaryotic domain coding proteins are conserved in all six genomes and over 80% are conserved when two genomes are compared (e.g. 33 of the 39 proteins containing an eukaryotic motif in strain Lens are present also in strain Paris).
Moreover half of them share very high nucleotide identity of in average 98%-100% (Table 3) again suggesting high selection pressure to maintain them. Our approach identified also new eukaryotic domains like spectrin repeats. The spectrin repeat forms a threehelix bundle and was reported primarily in the animal kingdom [30]. These repeats act as modules building long, extended molecules that also serve as a docking surface for cytoskeletal and signal transduction proteins. In L. pneumophila it is present in up to eight proteins of each strain (Table 3) and all spectrin repeat proteins are predicted to be secreted Dot/Icm substrates [31][32][33]. Another interesting domain is the RAS GEF domain that is present in two proteins encoded by strain Paris one of which (Lpp0350) is conserved in the six strains analyzed. Ras-GEFs are small GTPases typically present in eukaryotes that are involved in numerous cellular processes like gene expression, cytoskeleton re-organization, microtubule organization and vesicular and nuclear transport [34]. GEFs (GDP-GTP exchange factors) regulate Rabs, GTP-binding proteins with conserved functions in membrane trafficking [35]. Interestingly, according to the Pfam database Ras-GEF domains in bacteria are only present in Legionella, Parachlamydia acanthamoebae and Protochlamydia amoebophila, all of which are amoeba-associated bacteria.
Coiled-coil domains have been identified previously in the L. pneumophila genomes as this motif can be found in all kingdoms of life. However extended coiled-coil domains are largely absent from bacterial genomes but are typical for archaea and eukaryotes. We thus searched the L. pneumophila genomes and 29 other genomes of bacterial pathogens or bacteria present in the aquatic environment (Table 4) for proteins with five or more coiled coil domains. Interestingly, Legionella spp, Streptococcus pneumoniae and Pseudomonas aeruginosa contain the highest percentage of proteins with extended coiled-coil domains (6-11 domains) compared to the number of predicted proteins encoded in their genome and only P. aeruginosa and L. pneumophila encode proteins containing more than 10 coiled-coil domains ( Table 4). Most of these Legionella proteins are predicted substrates of the Dot/Icm secretion system [31][32][33]36]. This suggests that large coiled-coil domains are specific adaptations to the eukaryotic cell probably implicated in interactions with host proteins.

c) High selection pressure acts on the Dot/Icm T4SS and its substrates
Central to the pathogenesis of L. pneumophila are the dot/icm loci, which together direct assembly of a type IV secretion apparatus [37,38]. Although all L. pneumophila strains investigated to date contain the complete dot/icm loci, sequence variations among the dot/icm genes among different L. pneumophila strains have been   reported [39]. The dot/icm loci of the six strains analyzed here exhibited a very high nucleotide conservation of 98-100% among orthologs except for dotA, icmX and for icmC of strain Corby that is shorter and more divergent (84% nucleotide identity) as compared to icmC of strain Paris. These results indicate that strong negative selection acts on these genes (Table 5).
Since the identification of RalF [40], numerous approaches have been used to identify Dot/Icm translocated substrates. Currently 278 proteins of L. pneumophila have been described as being transloctaed by the Dot/Icm T4SS system [7,31,32,[41][42][43][44]. Analysis of their distribution among the six L. pneumophila strains reveals a very high conservation, as 206 of the 278 substrates are present in all six strains. Nearly all of them show a nucleotide similarity of 95-100% and only nine are specific to strain Philadelphia (Additional file 1, Table S1). Furthermore, only 34 of the 278 substrates of strain Philadelphia are missing in strain Paris, 30 in strain Lorraine or 25 in strain HL 0604 1035 (Additional file 1, Table S1). Thus, although high redundancy seems to be present in the repertoire of Dot/Icm effectors, the strong conservation of nearly all of them in all genomes, argues for their mutual importance for the L. pneumophila life cycle, Rare exceptions are RalF and AnkB/Lpp2028. The nucleotide sequence of ralF of strain Philadelphia is only 85% similar to the ralF genes of the other strains and is 72 nts (24aa) shorter. A similar situation is seen for lpg2144/ankB that is 54 nts (18aa) longer in strain Philadelphia and Lens than in strain Paris and Corby. This is surprising, as the C-terminal region of AnkB of strain Philadelphia contains a eukaryotic prenylation CAAX motif mediating posttranslational modification of effector proteins, important for intracellular replication of L. pneumophila. Lipidation facilitates the localization of this effector protein to host organelles and serves as a docking platform for ubiquitinated proteins [45,46]. Thus in strain Paris and Corby other proteins might take over this function. Taken together, this analysis suggests that over 200 of the Dot/Icm substrates of L. pneumophila have been present or have been acquired before the speciation and that such a large repertoire of effectors is indeed necessary for intracellular replication and adaptation to the specific protozoan hosts.
The species L. pneumophila has a highly dynamic accessory genome a) A wide variety of T4ASSs and conjugative elements contribute to genome plasticity Based on sequence comparisons, T4SSs are categorized according to their similarity to the A. tumefaciens VirB/ D4 system into type IVA (type F and P) and type IVB secretion systems [47]. T4ASSs resemble the VirB/D4 system of A. tumefaciens, whereas T4BSS proteins are more distantly related to the VirB/D4 proteins [48]. T4SSs are involved in effector translocation, horizontal DNA transfer to other bacteria and eukaryotic cells, in DNA uptake from or release into the extracellular milieu or in the spread of conjugative plasmids [49]. Genome sequence analyses suggest that for L. pneumophila T4SSs play an important role for adaptation and virulence as each genome encodes several T4ASSs in addition to the essential T4BSS Dot/Icm discussed above. We identified in each strain either F-type or Ptype T4ASSs or both. Figures 2 and Figure 3 show the organization of the structural genes encoding these systems, their organization and their localization (chromosomal or plasmid). The F-type T4ASSs are all predicted to encode a complete T4SS core as well as the essential gene products for pilus assembly and mating pair stabilization that appears to be involved in DNA transfer. They show homology and colinearity with the tra-region of the E. coli F plasmid [50] and with the recently described tra region of Rickettsia belii [51]. In L. pneumophila strain Philadelphia (Tra5) and L. longbeachae strain NSW (Tra6), where the system has a chromosomal localization, it is inserted in a tRNA gene and flanking repeats are present as well as a gene coding for an integrase, suggesting that these T4SSs are mobile ( Figure  2). Furthermore, comparison of amino acid identities revealed that the Tra-region on the L. pneumophila strain Paris plasmid (Tra1) shows much higher identity with the Tra region located on the L. longbeachae plasmid (Tra4) than with those of the different L. pneumophila strains (Paris-Tra1, Lens-Tra3 or Lorraine-Tra2)    Figure 2). Thus these systems seem to be transferred horizontally via plasmids but are also able to integrate in the genome similar to what was reported for the Lvhregion [52]. The F-type T4SS encode long, flexible pili that allow donors to mate in liquid and on solid media with equal efficiencies [53]. In contrast P-type T4SS like described in P. aeuroginosa encode short and rigid conjugative pili that allow surface mating. Homologues to this system are also present in the Legionella genomes. They were initially described in two genomic islands of L. pneumophila strain Corby (Figure 3; Trb1 and Trb2) [54]. We show here that they are also present in the chromosomes of L. pneumophila strain Lorraine (Trb3) and L.
longbeachae NSW150 (Trb4) (Figure 3). Again for all T4SS regions flanking repeats are found suggesting mobility, and protein identity values and GC-content values of the tra-trb genes are higher than the genomic average (38%), supporting again horizontal and not vertical transmission.
Another intriguing feature of these regions is that several transposases and phage related proteins are present in each of the tra clusters as well as genes coding for homologues of a putative phage repressor protein (PrpA) and for homologues of LvrA, LvrB and LvrC, first described for the Lvh region of L. pneumophila. LvrC is a homologue of CsrA, a protein crucial for the regulation of the switch between replicative and    transmissive phase of L. pneumophila [55]. It is tempting to assume that these CsrA homologues are implicated in the regulation of the mobility of these islands. Possibly, dependent on the growth phase and/or on metabolic cues L. pneumophila might excise these islands as multiple copies could be advantageous in certain conditions, or perhaps allow high frequencies of DNA transfer leading to fast and efficient adaptation to new conditions. The genomic features of these islands suggest a particular mechanism of mobility, which will be interesting to investigate.
b) The L. pneumophila genomes encode systems specific for protection against invading DNA and stabilization of large genomic fragments Bacteria have developed multiple methods of protection against mobile genetic elements or bacteriophages. An example for acquired phage specific immunity is clustered regularly interspaced short palindromic repeats (CRISPR) loci [56]. Another type of protection may be conferred by toxin-antitoxin (TA) systems. Bacterial TA systems are small genetic modules composed of a toxin and antitoxin. While toxins are always proteins, antitoxins are either RNAs (type I and III) or proteins (type II) [57]. These systems were first described for being dedicated to plasmid maintenance. Several lines of research indicate that chromosomal TA systems might serve as protection against mobile genetic elements such as plasmids and phages. However, recent studies have shown that type II systems are also involved in the stabilization of large genomic fragments and of integrative conjugative elements [57]. Interestingly, type II TA systems are thought itself to be part of the mobilome and to move from one genome to another through horizontal gene transfer [57]. Genome analyses identified several TA and CRISPR systems. Interestingly, we identified only type II TA systems of which all except two are in a chromosomal location (Table 6). However, of the 18 chromosomal encoded TA systems identified at least 14 are located on putative genomic islands or mobile genetic elements. The two most frequently found TA systems in the L. pneumophila genomes are homologues of the HigAB and RelEB systems. HigAB was first described in the Vibrio cholerae superintegron where it encodes mRNA  [58]. RelEB was shown, when introduced into the E. coli chromosome to prevent deletion of flanking DNA and thus to diminish large scale genome reduction [59]. The same function was shown for the ParED system of Vibrio vulinificus, homologues of which are also present in one of the L. pneumophila genomes (Table 6). Thus, the different L. pneumophila TA systems might be important for stabilization of plasmids and integrative conjugative elements and for protection against invasion of plasmids, phages, or other mobile genetic elements. The CRISPR/cas system was shown to provide resistance against invading viruses and plasmids and has been identified in many bacteria and archea [60]. CRISPR/cas loci are also present in the L. pneumophila genomes of strains Paris, Lens, Alcoy and 130 b but are absent from strains HL06041035 and Lorraine. According to the cas genes, the CRISPR locus of Paris is closely related to that of strain 130 b. In contrast the one of strain Lens located on the plasmid is closely related to the chromosomal CRISPR locus of strain Alcoy as previously described [61]. Strain Lens carries a second CRISR locus on the chromosome; however, it does not seem to be functional like the one encoded by strain Alcoy. Probably strong protection against invading phages is not extremely important, as not all L. pneumophila strains contain CRISPR loci. This may be related to their intracellular life style or that despite their widespread occurrence in aquatic environments only few bacteriophages that specifically infect Legionella seem to exist [62].

c) Accessory genome of strains Lorraine and HL 0604 1035
In order to get insight in the genetic basis of the two newly sequenced strains, possibly implicated in their different disease frequencies (Lorraine is an newly emerging endemic clone and strain HL 0604 1035 is a L. pneumophila Sg1 strain never isolated from disease) we analyzed the specific gene content of each of these strains more in depth. Strain HL 0604 1035 contains 92 and strain Lorraine 148 genes without homology to any gene of the other five L. pneumophila strains sequenced of which the majority (60 in strain HL 0604 1035 and 73 in strain Lorraine) code for proteins of unknown function (Additional file 2, Tables S2 and additional file  3, Table S3). Among the genes in these two genomes that lack an ortholog in the other sequenced L. pneumophila genomes, about 50% are clustered on three large genomic islands. One genomic Island (GI-HL1) of 45 kb spans from lpv2637 to lpv2691. It is bordered by a Met tRNA gene and encodes a phage related integrase. A second putative mobile element (GI-HL2) of 27 kbs contains the region from lpv0193 to lpv0226. It is bordered at one side by an integrase and a reverse transcriptase (lpv0225) and on the other side by a prophage Rac integrase and a phage excisionase. Strain Lorraine contains also a large genomic island (GI-Lo1) of 69 kb that spans from lpo2442 to lpo2531. It is inserted in a Met tRNA gene, contains a phage related integrase and flanking repeats of 72 nts. Additional, smaller genomic islands seem to be present, however, their borders are difficult to define. Thus most of the strain specific genes seem to be acquired by HGT through mobility of genomic islands.
Only for few of the specific genes a putative function can be predicted like genes coding for proteins involved in sugar and nucleotide metabolism, for uridine diphosphoglucuronate 5'-epimerase or for an UDP-glucose 6dehydrogenase. Furthermore a specific ANK motif containing protein and a leucine reach repeat protein are present in strain HL 0604 1035. In strain Lorraine we identified mainly specific metabolic enzymes like a putative flavanone 3-dioxygenase, an enzyme involved in flavonoids metabolism and in biosynthesis of phenylpropanoids, which are secondary metabolites of plants and algae. In addition, lpo2614 is predicted to encode a kynurenine-oxoglutarate transaminase, an enzyme that is part of the tryptophan metabolism and lpo2960 codes for a putative glycolate oxidase that catalyses the conversion of glycolate and oxygen to glyoxylate and hydrogen proxide. lpo2502 codes a homologue of CsbD, a general stress response protein of Bacillus subtilis [63]. However, the best BLASTp hit is with the Protochlamydia amoebophila homologue, an Acanthamoeba sp. symbiont [64]. Probably this gene has been acquired by HGT between these two bacteria within their amoeba host. Quite surprisingly, we identified a gene coding a putative methyl-accepting chemotaxis sensory transducer (lpv1770) although all L. pneumophila strains analyzed to date do not encode chemotaxis systems. This gene shares 71.34% amino acid identity with Llo3301 of L. longbeachae a protein that is part of its chemotaxis system [28] also present in L. drancourtii [65]. Probably a common ancestor encoded a chemotaxis system that was lost in L. pneumophila through a deletion and degradation process.

d) Shared genome of the epidemic strains Paris and Lorraine
A search for genes shared by the two endemic strains but absent in all other strains identified only three genes that fulfilled these criteria and for which a function could be predicted. These encode the alpha, beta and gamma subunits of a putative thiocyanate hydrolase (lpo1236, lpo1237, lpo1238 and lpp1219, lpp1220, lpp1221). Most interestingly, these strains are both common in France and strain Paris is also world-wide distributed [10] suggesting a better niche adaptation. Indeed, thiocyanate compounds are used for cleaning water circuits and these strains are thus probably able to better resist these treatments [66]. Furthermore, strain Alcoy that is responsible for several outbreaks and many cases of Legionnaires' disease in Spain, also contains these genes [61]. The genes coding the putative thiocyanate hydrolase have a GC content of 41-43%, which is significantly higher than the average G+C content of the L. pneumophila genome, which is 38%. When searching for the closest homologues according to BLAST searches we identified them in the genomes of Rhodococcus opacus strain B4 and Nocardia farcinica spp. These two are high G+C Gram-positive bacteria belonging to the Actinomycetales, which are phylogenetically not closely related to Legionella suggesting that L. pneumophila acquired these genes by horizontal gene transfer.
Taken together, the analysis of the accessory gene content showed again that L. pneumophila genomes show high plasticity due to mobile genetic elements and HGT. No specific virulence related genes explaining their different disease frequencies have been identified. However, the identification of a specific thiocyanate hydrolase might explain the wide distribution of strains Paris and Lorraine as it may allow them to better adapted to artificial water systems.

Evolutionary genomics Phylogenetic reconstruction reveals extensive recombination
To analyze the relationship among the six different L. pneumophila strains a phylogenetic reconstruction was done based on a multilocus sequence (MLSA) approach using 31 genes selected according to Zeigler [67] (Table  7 and Additional file 4, Table S4). These 31 genes were chosen as they had been shown to be powerful for predicting the relatedness of bacterial genomes [67]. The phylogeny obtained from their concatenated alignment showed a well-resolved topology with bootstrap values a with respect to strain Paris, nts nucleotides over 50%. To ascertain the reliability of the obtained phylogenetic tree we established individual phylogenies for each of the 31 genes. Surprisingly, the incongruence among several gene trees was high. In addition the Consense program results did not support any node to at least 50%. To further investigate these results we undertook a second analysis using a Shimodaira-Hasegawa test and compared the topologies of the individual alignments of each gene and the concatenated alignment of the 31 genes. As shown in Additional file 5, Table S5 the likelihood-based SH test for alternative tree topologies identified striking discordances. A possible explanation for the identified incongruences among the phylogenies obtained in our study is the presence of recombination events.
With the aim to explore whether recombination events are present in the selected genes we undertook an in depth analysis using the program RDP [68]. Indeed, the analysis of individual genes identified intragenic recombination in 9 of the 31 genes (Table 8). Numerous additional recombination events were detected with the concatenated alignment of the 22 genes for which no intragenic recombination had been shown (Table 8). To minimize false positive recombination events only those that were supported by at least two of the six methods used in RDP were taken into account. However, except one, all were supported by at least three methods. No artifacts resulting of positive selection should be included in this analysis since all of the genes are either informational or operational (housekeeping). Most interestingly, four of the genes in which intragenic recombination was detected are housekeeping genes (pgk, atpD, ffh, metK). Housekeeping genes allow to estimate the extent of recombination within bacterial species since presence of recombination in such "normally recombination free genes" is indicative of a high rate of recombination [22]. Similarly antigen-coding genes of Legionella were reported to show recombination events [18,69] and certain other genomic regions [17,19,[70][71][72]. Another example of intragenic recombination in L. pneumophila is the rtxA gene that contains a long tandem repeated domain of variable copy number and sequence [4,10,73]. rtxA of strain Lorraine and Corby share the same repeats, whereas the other strains have unique types of repeats. However, when including the newly sequenced strains Lorraine and HL 0604 1035 we found that repeats of the same type are shared by HL 0604 1035 and Philadelphia and by Lorraine and Lens (Figure 4 and Additional file 6, Table S6), further substantiating high intragenic recombination among strains.
To reconstruct the phylogenetic history of the species L. pneumophila we used thus the concatenated alignment of the 31 genes described above. It gave a topology with high bootstrap support, however recombination bias may result in high support for the wrong tree. To avoid possible bias we thus analyzed the concatenated alignment of the 31 genes using a split tree decomposition that allows a more realistic representation of the phylogenetic relationships. Furthermore we constructed a classical bifurcating tree using the highest possible number of genes [all orthologs among the six strains with (1867 genes) and without (2434 genes) L. longbeachae as outgroup]. As shown in Figure 5 the Splits Decomposition phylogeny is network-like suggesting incompatible partitions within sequence data, which commonly arise from recombination. Although the phylogeny based on the orthologous genes can also be affected by recombination, the high number of informative sites included in this data set, should allow recovering the correct history of the species as it has been shown previously for other closely related bacterial species [74].
Taken together, in contrast to previous studies, which reported that the species L. pneumophila is a clonal population [13,14] our results show clearly that a high recombination rate shapes the L. pneumophila genomes. This finding is in line with the natural competence of L. pneumophila. However, some worldwide distributed L. pneumophila clones have been described (e.g. [10]), suggesting that L. pneumophila is able to develop a unique genetic population structure within a particular region or environment as reported recently [72].

Recombination of large chromosomal regions of over 200 kbs among L. pneumophila strains
Our recombination analysis revealed not only intragenic recombination events but also intergenic recombination as recombination was detected when using the entire alignment even with only recombination free genes (Table 8). This finding may be explained by the recombination of fragments encompassing several genes or multiple recombination events involving smaller tracts along the genome. To test this hypothesis we used a method recently developed for the analysis of Streptococcus agalactiae genomes [75]. In order to identify patterns of recombination, nucleotide substitutions between strains were counted in sliding windows across the previously defined core chromosome representing 15 possible pair wise comparisons. Each pair wise comparison revealed highly conserved regions (<0.05% polymorphism on average) and less-conserved regions (>0.7% polymorphism), suggesting the occurrence of recombinational exchanges. When analyzing the different strains in depth we identified in each genome several regions with very low polymorphisms (below 0.05%) suggesting that DNA exchange of these fragments has occurred between the different L. pneumophila strains. Most interestingly, the two French strains Paris and HL 0604 1035 that are present since several years in France show 15 regions of a size between 10 and 99 kbs that have very low polymorphism and thus seem to have been exchanged between them (Additional file 7, Figure  S1). In contrast when comparing strain Lens with the other 5 genomes analyzed here, very few regions with low polymorphism, two with strain HL 0604 1035 and one with strain Lorraine, were detected. Furthermore, no DNA exchanges seem to have occurred with strains Corby, Philadelphia or Paris. This indicates that strains that are frequent in the same environment (e.g. strain Paris and HL 0604 1035) show high rates of DNA exchange probably by conjugation as suggested for Streptococcus agalactiae [75] and Enterococcus fecalis [76]. In contrast strain Lens, which has been identified to date only twice, in Lens (France) and in Germany, very few DNA transfers with the studied L. pneumophila strains seem to have taken place. Furthermore, some regions may be transferred also between several strains.  Figure 6 shows the distribution of single-nucleotide polymorphisms (SNPs) along 330 kb of the genome of L. pneumophila HL 0604 1035, Philadelphia and Lorraine as compared to the same region in the genome of strain Paris. We identified a region of 213 kbs a SNP frequency of 0.005%. Except an indel of 158 bs that shows higher polymorphism, only 11 SNPs are present in this region. This fragment may have evolved by conjugative transfer and recombination between strains Philadelphia and Paris. Among others, this region carries the genes necessary for lipopolysaccaride biosynthesis, that are also part of the smaller fragment that has been exchanged with strain HL 0604 1035. Our analyses suggest, that in addition to frequent intragenic recombination also recombination and horizontal transfer of large chromosomal fragments is taking place and shapes the chromosomes of L. pneumophila.

Conclusion
Analysis of the genome sequences of six L. pneumophila strains shows that the genomes of this environmental pathogen evolve by frequent HGT and high recombination rates. Most interestingly, these events take place between eukaryotes and prokaryotes and among different strains and species of Legionella. A genome-wide map analysis of nucleotide polymorphisms among these six strains demonstrated that each chromosome is a mosaic of large chromosomal fragments from different origins suggesting that exchanges of large DNA regions of over 200 kb have contributed to the genome dynamics in the natural population. The many T4SS might be implicated in exchange of these fragments by conjugal transfer. Plasmids also play a role in genome diversification and are exchanged among strains and circulate even between different species of Legionella. Importantly, plasmids seem to excise and integrate into the genome probably depending on environmental cues. However, L. pneumophila encodes also several toxin anti-toxin that might help to stabilize certain mobile genetic elements. In the near future, the analyses of 100 s of genomes thanks to new generation sequencing combined with molecular studies should provide further clues about the genetic mechanisms and the evolutionary forces that shape the Legionella genomes.

Bacterial strains and sequence accession numbers
The strains sequenced in this study are L. pneumophila strain Lorraine [EMBL: FQ958210, EMBL:FQ958212] and L. pneumophila HL 0604 1035 [EMBL:FQ958211]. Strain Lorraine was isolated in 2004 from a patient and was recently described as a newly emerging endemic clone [26]. L. pneumophila strain HL 0604 1035 (ST 734, Bellingham subgroup of the Dresden panel) was isolated in 2006 from a water supply system in a French hospital that it is colonizing since more than 10 years.

Sequencing and assembly
The complete genome sequence of L. pneumophila subsp. pneumophila strain HL06041035 (A) and strain Lorraine (B) were determined using a Sanger/pyrosequencing hybrid approach. A shotgun library was constructed with 10kb size fragments, obtained after mechanical shearing of the total genomic DNA, and cloned into vector pCNS (pSU derived). Sequencing with vector-based primers was carried out using the ABI 3730 Applera Sequencer. A total of 20736 (A) and 21888 (B) reads (~4 fold-coverage) were analyzed and assembled with 502731 (A) and 555541 (B) reads (~15 fold-coverage) obtained with Genome Sequencer GS20 (Roche Applied Science). For the assembly, we used the Arachne "HybridAssemble" version (Broad Institute, http://www. broad.mit.edu) that combines the contigs obtained with 454 sequencing with Sanger reads. To validate the assembly, the Mekano interface (Genoscope), based on visualization of clone links inside and between contigs, was used to check the clone coverage and misassemblies. In addition, the consensus was confirmed using Consed functionalities http://www.phrap.org: the consensus quality and the high quality discrepancies. The finishing step was achieved by PCR, primer walking and in vitro transposition technology (Template Generation System™ II Kit; Finnzyme, Espoo, Finland), and a total of 930 (A) and 999 (B) sequences (109, 165 and 656 respectively for L. pneumophila subsp. pneumophila strain HL06041035 and 62, 204 and 733 respectively for L pneumophila subsp. pneumophila str. Lorraine) were needed for gap closure and quality assessment.

Sequence analysis and annotation
The two newly sequenced L. pneumophila genomes were integrated into the MicroScope platform [77] to perform automatic and expert annotation of the genes, and comparative analysis with the other L. pneumophila strains already published. In addition the annotations of the previously published genomes were updated. The system integrates, for each predicted gene, the results of multiple bioinformatics methods (Blast result on UniProt and specialized genomic data, InterPro, COG, PRIAM, synteny group computation using the complete bacterial genomes available at NCBI RefSeq, etc; more information on the syntaxic and functional annotation process is given in [78]). In addition, many genomic and metabolic comparative tools are also available [77]. For details see https://www.genoscope.cns.fr/agc/microscope/home/index.php.

Definition of orthologous genes
To define orthologous chromosomal genes among the different L. pneumophila strains, pseudogenes and mobile elements were not taken into account due to the difficulty of ortholog assignment for these genes. Putative orthologous relations were defined as gene couples fulfilling two criteria: (i) having a bidirectional best hit (BBH) with an alignment threshold of 55% identity over at least 60% of the query sequence and target size (ii) and being in synteny. Subsequently, putative genes without any orthologous relation due to reduced identity percentage were integrated in a pre-existing orthologue group if they were flanked by orthologous genes showing gene order conservation (microsynteny). A final step of manual curation was carried out for each doubtful case.

Sequence alignments
For each gene of the selected data set, the nucleotide sequence was aligned based on the amino acid sequence using tranalign/EMBOSS package http://emboss.sourceforge.net/. Subsequently genes were concatenated in different data sets.

Identification of eukaryotic like proteins and eukaryotic domain carrying proteins
Eukaryotic domains were identified by analyzing the results obtained for all genes using the Interpro database that is integrated in MAGE. For the identification of eukaryotic like proteins we developed a new method. First we constructed two databases, one containing all and only eukaryotic sequences retrieved from public databases and a second one containing all and only prokaryotic sequences. From the second database we excluded the proteins of bacterial genera for which eukaryotic like protein-domains have been found in high proportions (e.g. parasites of protozoa) or bacterial genera that are reported to establish a symbiotic relationship with amoeba (for a detailed list see Additional file 8, Table S7). Those proteins, that showed a better, normalized blast score against eukaryotic proteins than to those present in the prokaryotic database were retrieved as eukaryotic like proteins. Parameters established for blast were: minimum identity: 25%; minimum ratio avec query: 60%; minimum ratio avec target: 50%. The final results were manually checked.

Phylogenetic Analysis
For phylogentic reconstruction of the L. pneumophila strains analyzed in this work several data sets were used: (i) 31 housekeeping genes described to be essential for all prokaryotes were selected based on the study of Zeigler [67] (Table 7 and Additional file 9, Figure S2) for a multi locus sequence analysis (MLSA) approach for which gene each was analyzed individually and as a concatenated alignment, (ii) a concatenated alignment of 2434 orthologous genes present in all analyzed L. pneumophila strains (iii) a concatenated alignment of 1867 orthologous genes present in all analyzed L. pneumophila strains and in the selected out group, Legionella longbeachae strain NSW150. An analysis of genetic divergence was performed using DNAsp vs 5.00.07 [79] using the 31 selected housekeeping genes. For phylogenetic reconstruction maximum likelihood (ML) methods were used to infer phylogenetic relationships for all data sets. Prior to ML analyses, a DNA substitution model for each gene or data set was selected using Modeltest v3.06 [80] and the Akaike information criterion. ML heuristic searches were performed using 500 random taxon-addition replicates with tree bisection and reconnection (TBR) and branch swapping. ML bootstrap support was determined using 1000 bootstrap replicates. The ML best trees were rooted on L. longbeachae when added. A network reconstruction was done for the same data set (i) using SplitsTree4 (version 4.10) [81]. The NeighborNet method and the GTR distance model were used to create the network.
position of the corresponding gene starting from the origin of replication.