Representativeness of the strains used in this study
This study included 27 previously sequenced genomes of Escherichia coli (Table 1). To assess how representative these genomes are of the global diversity of the species, we compared them to the Escherichia coli reference collection (ECOR)  on the basis of two Multi-Locus Sequence Typing schemes which together spanned a total of 15 genes [16, 47]. The resulting phylogeny (Figure 2) highlighted the six previously described lineages of E. coli, designated A, B1, B2, E, D and F [14, 16]. Overall, the 27 strains covered much of the diversity of E. coli, with eight strains in clade A, seven in clade B1, five in clade E and seven in clade B2 (Figure 2; Table 1). In each of these four clades, the strains seem to represent much of the within-clade diversity rather than being closely related within the clade. However, two clades were not represented in this genomic panel: clade D and clade F. Three genomes from these phylogroups (IAI39 , SMS-3-5  and UMN026 ) were initially intended to be included, but were removed because they showed evidence for significant deviation from the assumption of a fixed molecular clock (Additional file 1: Figure S1). Figure 2 indicates how the diversity of the genomes in this study relates with that of the ECOR strains, however, it should be noted that the issue of biased sampling of bacterial isolates is frequent and it is never possible to be sure of the representativeness of a sample [4, 58].
Reconstruction of the clonal genealogy
Aligning the 27 genomes using progressiveMauve [18, 49, 50] allowed us to compare their genomic content. As more genomes are considered in the analysis, the cumulative size of genomic regions shared by them decreased down to about 3.3Mbp, or about two thirds of each genome (Figure 3). Since this length is roughly constant whether 10, 15, 20 or all 27 genomes are aligned (Figure 3), these regions are likely to represent the core-genome of E. coli, which means that homologs of these regions would be found in virtually any sequenced genome. We found 765 core regions present in the genomes of all 27 strains, with total length 3,306,899bp. These core regions were input into ClonalFrame  in order to estimate the clonal genealogy in a way that accounts for homologous recombination which can confuse the signal of clonal inheritance . This aspect is important because recombination has been reported to be frequent in E. coli by a large number of previous studies [14, 16, 47, 60, 61]. The inferred clonal genealogy (Figure 4A) consisted of four clades corresponding to A, B1, B2 and E. The relationships between genomes within clades were fully resolved, which is typically not achievable with MLST (eg. [14, 16]). The relationships between clades were unbalanced, with clade A and B1 most closely related to each other, and clade B2 most distant from any other clade. The stemminess (ie. the ratio of internal to external branch lengths) of this tree was compatible with expectation under the standard coalescent model (Additional file 3: Figure S2), suggesting no evidence for population size variation during the evolution of E. coli[62–64].
Analysis of the dispensable genome
In contrast to the core regions described above, non-core regions are found only in a strict subset of the genomes. The set of non-core regions is called the dispensable genome and together with the core genome forms the pan-genome [65–67]. The cumulative length of the non-core regions continues to increase up to the 27th genome, showing no sign of flattening, with each new genome adding about 250Kbp of previously unobserved sequence (Figure 3). This distribution has been observed before, including in E. coli, and its pan-genome has consequently been called “open” [23, 65–67]. However, an important difference between these previous studies and ours is that in Figure 3 the lengths of genomic material are measured directly whereas previous studies counted the number of genes. Our analysis is therefore robust to the problem of identifying homologous families of genes. Nevertheless, this result indicates that the pan-genome of species with a high diversity and ecological plasticity such as E. coli draws from a large repertoire of genes that can be gained and lost through lateral gene transfer [5, 67].
The similarity of the genomes in terms of genomic content was calculated from the patterns of presence and absence of non-core regions (Figure 4B). Compared with the clonal genealogy (Figure 4A), the clade structure is only partly preserved in this tree of genome content, with clades B2 and E intact but clades A and B1 intermingled. Clade B1 was split into three parts which were perfectly congruent with pathotypes. The three EHEC strains 12009, 11368 and 11128  formed one separate cluster. The two commensal strains IAI1  and SE11  and the only ETEC strain E24377A  constituted another separate cluster, in which the two commensal strains were closest to each other. Finally, the EAEC strain 55989  was on a separate branch in spite of its close relationship with the commensal strains IAI1 and SE11 in the clonal genealogy (Figure 4A). This subdivision of B1 in terms of genomic content has been partially hinted at before  and the fact that it is congruent with pathotypes suggests that it is linked with differences in ecological and pathogenic lifestyles. The presence or absence of genomic regions in the 27 observed genomes is the result of a process of gain and loss of content by the ancestors of the genomes since their evolution from a common ancestor. If gain and loss happened randomly and at constant rates, the tree based on genomic content (Figure 4B) would be expected to to be very similar to the tree based on homology of the core-genome (Figure 4A) since the evolution of both core and pan genomes would then follow the same molecular clock. The two trees were however highly different, indicating that the non-homologous recombination process (gain and loss of regions) did not follow a strict molecular clock. GenoPlast  was used to infer the non-homologous recombination events that happened in the context of the clonal genealogy inferred by ClonalFrame (Figure 4A) under a model where the rates of gain and loss are allowed to change. The results of the GenoPlast analysis are shown in Figure 5, with differences in the rates of gain and loss on specific branches spanning two orders of magnitude. The rates of gain and (to a lesser extent) loss of genomic material were found to be higher on the short recent branches within clades A, E and B2 than on older and longer branches, which explained the higher stemminess of the genomic content tree (Figure 4B) compared with the clonal genealogy (Figure 4A).
The branch directly above EHEC strain 12009 had the largest amount of gain of any branch (1405 Kbp) whereas the branch above the common ancestor of the other two B1 EHEC strains 11368 and 11128 was the highest amount of gain for an internal branch (788 Kbp; with the exception of the very long branch above clade E). Amongst the genomic material gained on these two branches, 265 Kbp were shared by the three genomes, which explained why they clustered together in Figure 4B. The distribution of this gain on the three genomes (Additional file 4: Figure S3) indicated that their convergence in genomic content happened as a result of multiple gain events that happened both on the branch above 12009 and on the branch above the common ancestor of 11368 and 11128. The convergence in genomic content of the three EHEC B1 strains was therefore reciprocal rather than unidirectional. Few convergence events were found on the branches directly above 11368 and 11128 (Additional file 4: Figure S3) in spite of considerable gain on these branches (1009Kbp and 645Kbp respectively), which could indicate that the convergence in gene content with 12009 is not on-going. Unsurprisingly, this convergence involved several genes known to be EHEC determinants, including Shiga toxins  and all genes from the locus of enterocyte effacement (or LEE ). However, it also included additional genes, such as flagellar genes (fli) and a few metabolic clusters (frl and gal) with a notable presence of genes involved in aromatic compounds metabolism (hpa
mhp and mhp). These genes were not present in the other B1 strains examined in this study, which may indicate that acquiring EHEC determinants via HGT is an important means of E. coli adaptation, possibly enhanced by the differences in host-associated selective pressures on EHEC compared to commensals or more opportunistic pathotypes.
Homologous recombination hotspots in Escherichia coli
To quantify the propensity, genomic distribution and directionality of homologous recombination during the evolution of E. coli, we applied ClonalOrigin  to the 765 core regions and assuming the clonal relationships between genomes estimated by ClonalFrame  in Figure 4A. The average length of fragments involved in homologous recombination was estimated at δ=542bp. This is almost ten times higher than a previous estimate in E. coli, but is of the same order as recent whole-genome estimates in Bacillus cereus, Helicobacter pylori or Chlamydia trachomatis. The relative rate of occurrence of recombination and mutation  was estimated at ρ
=0.0128/0.0125=1.024 which means that overall recombination happened just as frequently as mutation. The estimated rate of homologous recombination was fairly constant throughout the genome (Figure 6), with the exception of three clear hotspots (highlighted in grey) in which recombination rates were significantly higher. This included two large regions around the rfb operon involved in synthesis of the O antigen (positions 2,020 to 2,190 Kbp in the reference genome K-12/MG1655 ) and around the fimA gene (positions 4,420 to 4,620 Kbp). These two regions had been reported previously as hotspots of diversity and recombination [23, 78].
A smaller recombination hotspot was also detected, made of just two nearly adjacent core regions (between positions 2,442 and 2,447 Kbp). This region had a similarly high recombination rate as the two regions above, but had not previously been detected as a hotspot, perhaps because of its small size (around 5 Kbp). This hotspot contained genes yfcL
prmB and smrB. The gene mepA encodes for a murein endopeptidase  whose role is presumably to restructure the bacterial cell wall during elongation or stabilise the peptidoglycan. Mutational analyses on mepA[79, 80] do not provide enough information to explain why recombination should be high for this gene. In the bacterial cell, aroC governs the synthesis of chorismate, a key precursor to the biosynthesis of aromatic compounds including the amino acids tryptophan and phenylalanine but also the siderophore enterobactin. The positive maintenance of a functional allele of aroC is arguably crucial for the cell to maintain appropriate levels of these amino acids and siderophores in natural conditions. In Salmonella, as well as in Brucella suis, aroC is required for virulence. Incidentally, aroC is a common target to produce knocked-out attenuated vaccine strains , for instance in Salmonella serovars Typhi [84–86] and Typhimurium [81, 85, 87, 88], pathogenic E. coli, Brucella suis, Burkholderia pseudomallei or Edwardsiella tarda. To our knowledge, this is the first mention of aroC being part of a recombination hotspot, giving additional clues on evolutionary dynamics at this locus. Depending on how aroC is involved with virulence in E. coli, it may be under selective pressure from the immune system of the host, which could explain the observed peak in recombination rate , but this hypothesis will need further work to be fully assessed.
Flux of homologous recombination
The numbers of recombination events inferred by ClonalOrigin were counted for every combination of clades receiving and donating, and these values were compared with their expectation under the ClonalOrigin model which represents a close approximation to the coalescent model with gene conversion [21, 55, 56]. This comparison revealed significant non-uniformity in the homologous recombination flux within and between clades (Figure 7). The three clades A, B1 and B2 had higher numbers of within-clade recombination than expected, whereas clade E had almost exactly the expected number. On the other hand, the number of recombination events detected between clades was almost systematically below its expected value, with the only exception being recombination from clade A to B1 and vice-versa which had slightly higher than expected values. Clades A and B1 are the two most closely related phylogroups (Figure 4A) which may contribute to explain this observation. Overall, inter-phylogroup recombination fluxes were lower than intra-phylogroup ones, which is compatible with the hypothesis that there is a preferred way of gene sharing within phylogroups . This preferred exchange among strains of the same phylogroups could be explained by the possibility that the different E. coli phylogroups have slightly distinct ecological overlaps, which makes the likelihood of gene transfer higher among them than between them.
A similar analysis as above was performed on a branch-by-branch basis rather than a clade-by-clade basis (Figure 8), the only added difficulty being that some donor/recipient pairs of branches have too low numbers of expected and observed recombination events for the comparison to be meaningful (represented in grey in Figure 8).This analysis confirmed the general pattern described above, with more recombination than expected within-clades and between A and B1, and less recombination between all other clades. However, it also allowed the comparison of the individual behaviours of strains belonging to the same clade. For instance, strains BL21 and REL606  showed little history of importing recombination from clade B1, contrasting with ATCC8739  or HS  even though all four strains belong to clade A. This may be explained by the fact that these two strains are laboratory-adapted derived from E. coli strain B [27, 93], so that they would have had little or no opportunity for recent encounter and recombination with B1 strains.The four K-12 strains in this study [28–31] were also laboratory-adapted, but had terminal branches too small to reliably estimate deviations in the number of recombination events. These four strains all originated from bioengineering manipulation on K-12 lineages over the last century and therefore harbour a very limited number of differences between them compared to what would be observed in natural populations.
The B1 strains 11128 and 11368  showed significantly less sign of import from clade A (and to a lesser extent from B1) than other strains of B1. This observation implies that these EHEC strains have stopped recombining with strains of clade A (which are all commensals) as they adapted to this new pathogenic lifestyle. Two of the highest values throughout Figure 8 were the ones corresponding to imports from strains 11128 and 11386 into strain 12009 . As previously noted, these are the only three EHEC strains in B1, and these three genomes have been converging in genomic content due to numerous non-homologous recombination events. This result indicates that the three genomes also have an extensive history of convergence through homologous recombination, which may have occurred at the same time as the gain of new shared genes. The evolutionary history of these three genomes seems therefore analogous to that of Salmonella serovars Typhi and Paratyphi A, for which both core and pan genomes converged through recombination as they were progressively adapting to exclusive infection of the human host .
Speciation in E. coli
In the analysis of homologous recombination described above, three groups corresponding respectively to phylogroups E, B2, and A+B1 exhibited more recombination within than between one another (Figures 7 and 8). This pattern is compatible with a definition of speciation in bacteria in which recombination plays the role of a cohesive force counterbalancing divergence by genetic drift and population structure, and where species appear when this force is weakened between lineages [1, 2, 94]. Under such a model, patterns of genetic diversity can be generated in silico similar to those observed for example in Salmonella enterica[95, 96]. The three groups might therefore represent lineages that, because of slightly distinct ecologies or notable variations in the species life cycle, have gradually diverged too far from one another for recombination to play its cohesive role, so that they might eventually become separate species, should these variations remain or increase. In other words, all E. coli phylogenetic backgrounds are found in the gut of endotherms  which is their primary environment, and to some extent in nonhost secondary environments [8, 9] but it sounds plausible that phylogroup-associated variations in ecological fitness in different hosts or secondary environments could gradually decrease the physical and ecological overlap of strains from different phylogroups through time, and therefore the genetic flux between them. A number of studies seem to support this hypothesis, as different proportions of the different phylogroups are found in different environments and hosts [8, 9, 97, 98]. Additionally, some phylogroups seem to harbour strains that have either host-restricted or more generalist lifestyles , as well as strains that are either resident or transients in their ability to colonize the gut . Our study contributes to highlight that such variations in ecology could potentially have an impact on genetic exchange in E. coli.
An additional number of factors can be evoked to explain why the three groups would be diverging, including differences in their geographic distribution, adaptative selection, or simply as a result of the dependence of recombination on homology between donor and recipient [4, 94, 96, 101, 102]. The three groups are clearly separate in terms of genomic content (Figure 4B) which could explain why they recombine less with each other and why clades A and B1 still recombine frequently since they are not differentiated in terms of genomic content. To test this hypothesis, we compared the distribution of the number of recombination events found in the middle and at the edge of core regions (Additional file 5: Figure S4). We found that recombination happened more often in the middle of core regions at a small but highly significant level (Kolmogorov-Smirnov test; p-value=8.8e-09). If the variable genomic content is not just a random process, then homologous recombination would be expected to happen less often around these genes, a concept sometimes called fragmented speciation or “species in pieces” [103–106] as it would predict that speciation can apply differentially across the genome. Our results therefore demonstrate that fragmented speciation applies to E. coli, and that difference in genomic content is at least one of the factors driving the divergence of the three lineages.