Population genomics of Vibrionaceae isolated from an endangered oasis reveals local adaptation after an environmental perturbation

Background In bacteria, pan-genomes are the result of an evolutionary “tug of war” between selection and horizontal gene transfer (HGT). High rates of HGT increase the genetic pool and the effective population size (Ne), resulting in open pan-genomes. In contrast, selective pressures can lead to local adaptation by purging the variation introduced by HGT and mutation, resulting in closed pan-genomes and clonal lineages. In this study, we explored both hypotheses, elucidating the pan-genome of Vibrionaceae isolates after a perturbation event in the endangered oasis of Cuatro Ciénegas Basin (CCB), Mexico, and looking for signals of adaptation to the environments in their genomes. Results We obtained 42 genomes of Vibrionaceae distributed in six lineages, two of them did not showed any close reference strain in databases. Five of the lineages showed closed pan-genomes and were associated to either water or sediment environment; their high Ne estimates suggest that these lineages are not from a recent origin. The only clade with an open pan-genome was found in both environments and was formed by ten genetic groups with low Ne, suggesting a recent origin. The recombination and mutation estimators (r/m) ranged from 0.005 to 2.725, which are similar to oceanic Vibrionaceae estimations. However, we identified 367 gene families with signals of positive selection, most of them found in the core genome; suggesting that despite recombination, natural selection moves the Vibrionaceae CCB lineages to local adaptation, purging the genomes and keeping closed pan-genome patterns. Moreover, we identify 598 SNPs associated with an unstructured environment; some of the genes associated with these SNPs were related to sodium transport. Conclusions Different lines of evidence suggest that the sampled Vibrionaceae, are part of the rare biosphere usually living under famine conditions. Two of these lineages were reported for the first time. Most Vibrionaceae lineages of CCB are adapted to their micro-habitats rather than to the sampled environments. This pattern of adaptation is concordant with the association of closed pan-genomes and local adaptation.


Background
Comparative genomics analyses have shown a wide range of genomic variation within bacteria from different phylogenetic groups [1][2][3]. This variation range has been explained in part by the wide ecological niche occupied by different bacterial groups [4][5][6][7][8]. Bacterial genomes, in contrast to eukaryotic genomes, usually maintain constant genome sizes [9,10], suggesting that while horizontal gene transfer (HGT) increases the genome size by adding new genes, selection maintains the genome size by removing deleterious, non-functional or non-useful genes [11][12][13]. Therefore, bacteria can present very different genomic compositions even within a species, with HGT creating a flexible genome and natural selection purging or maintaining it [10,14].
Thus, the type of pan-genome is an indication of the evolutionary "tug of war" between selection and HGT. As a prediction, if there are high rates of HGT, the total genetic pool will increase, as well as the effective population size, generating an open pan-genome maintained by natural selection [15]. However, if there is a selective pressure towards local adaptation, the genetic diversity introduced by HGT will be purged, resulting in a closed pan-genome and clonal lineages [14].
To start understanding the reasons why some pangenomes are open while others are closed, we can analyze the rate and type of recombination. On the one hand, homologous recombination homogenizes populations, keeping them genetically cohesive in a closed pangenome [16,17]. On the other hand, non-homologous recombination brings new genetic material, offering new evolutionary opportunities for diversification and generating an open pan-genome [18][19][20][21]. Selection and the Hill-Robertson effect are expected to operate when recombination decreases the linkage disequilibrium among genes, which avoids the purging of genetic diversity along the genome [22,23]. As a result of this diversity of mechanisms, species with higher recombination levels maintain a large historical effective population size [15,24,25]. In contrast, highly clonal populations with low or no HGT evolve mostly by mutation and genetic drift, because the efficiency of selection is hampered by the Hill-Robertson effect that also reduces the standing levels of variation in the population and the historical effective population sizes [23,26].
In this study, we explored the role of different evolutionary forces shaping the genetic diversity of Vibrionaceae in the oasis of the Cuatro Ciénegas Basin (CCB), Mexico. CCB is composed of several aquatic systems that have a significant unbalance of the nutrient stoichiometry [27]. Population genetic studies of Pseudomonas spp., Exiguobacterium spp. and Bacillus spp. isolated from CCB aquatic systems in general show low recombination levels [28][29][30]. These patterns suggest that nutrient constraints in CCB may work as an ecological filter, reducing recombination maybe due to the cost of replicating new DNA, and leading to local adaptation [27,31,32].
We tested whether the environmental nutrient constraint would affect the genetic structure of Vibrio spp. lineages at CCB. Members of Vibrio spp. have been characterized in general as highly recombinant [33,34]. We analyzed the genetic structure of Vibrionaceae in a particular site of CCB, Pozas Rojas (Fig. 1). This site was the most stoichiometrically unbalanced (N:P 157:1) in our first sampling in 2008. In that study, it was found that Pseudomonaceae was the most abundant family, comprising around 50% of the taxonomical sequences, while only 0.08% corresponded to Vibrionaceae [35]. Later, Pozas Rojas was naturally perturbed with intense rains associated with hurricane Alex in 2010. The runoff detritus and water caused the nutrients ratios to change from extremely unbalanced stoichiometry to a ratio similar the standard values in the sea (N:P 20:1; compared to the Redfield standard N:P 16:1 values of the sea [36]). Given the change in stoichiometry ratios, we asked the following questions: 1) How did a naturally recombinant lineage like some members of Vibrionaceae respond to this perturbation? 2) Did Vibrionaceae lineages maintain their local adaptation to this unique site by restricting recombination, and maintaining their pangenomes closed? Alternatively, 3) Is it possible that Vibrio spp. developed open pan-genomes with large effective population sizes, similar to the lineages in the ocean to deal with this stoichiometric change? [33,34].
Herein we analyzed the role of the evolutionary forces that have shaped Vibrionaceae at CCB by performing a comparative genomics analysis of five reference and 42 strains isolated from two different local environments (i.e., water and sediments) in perturbed Pozas Rojas. Contrary to what we expected, our results show that most CCB Vibrionaceae lineages had similar levels of recombination compared to their oceanic relatives, and much higher levels of recombination than other genera in the CCB [28][29][30]. However, since most of the analyzed lineages had closed pan-genomes, we suggest that most of such recombination is homologous. This type of recombination should promote reproductive isolation and generate local adaptation. We did not observe a clear pattern of adaptation to either water or sediment environments, suggesting that there may be other environmental variables that we were not able to measure that could be driving local adaptation among these lineages.

Results
Nutrients raising shifts the stoichiometric unbalanced and the Vibrionaceae family at the cultivable level Based on Kruskal-Wallis statistical test, the total nutrient concentrations (Carbon (C), Nitrogen (N), and Phosphorus (P)) of the Pozas Rojas were not significantly different between the nine sampled ponds (C: p = 0.8815; N: p = 0.2256 and P: p = 0.9624; Fig. 1; Additional file 1: Table 1), however, they were statistically significant between type of environment (i.e., water vs. sediment: C: p = 3.486e-4; N: p = 0.03798 and P: p = 3.461e-4).
The proportion of C:N:P was on average 350:9:1 for water, and 258:21:1 for sediment (Additional file 1: Table 2). This ratios indicate a stoichiometric "balance" (i.e., similar to Redfield standard ratios) in Pozas Rojas during 2013, due to higher P availability, compared with the extreme stoichiometric imbalance observed in most of CCB sites, and in particular in Pozas Rojas microbial mat during summer 2008 (i.e., 15,820:157:1) [35], previous to the hurricane Alex perturbation.
The AdatptML environmental association analysis [37] showed that strains are structured according to the environment where they were isolated, i.e., water or sediment, and not by pond (Additional file 1: Figure 2). While most clades were specialist either to water (higher nutrient condition) or to sediment (lower nutrient condition), the most abundant lineage had no preference for any environment. Based on this analysis, we selected 42 isolates for further sequencing with Illumina MySeq 2 × 250 and one additional Jr. 454 Roche library for a de novo assembly of the strain V15_P4S5T153. These isolates were chosen as representatives from the different lineages and environments. The genome coverage ranged from 6x to 31x and the N50 values were from 4806 to 143,363 (Additional file 2: Table 4).
Among the 39 CCB sequenced Vibrio spp. genomes, we found variation in terms of genome size, ranging from 3.1 Mbp to 5.1 Mbp, while the three CCB Photobacterium spp. genomes had an average genome size of 4.5 Mbp. Despite this variation, when we compared the CCB strains genomes to their closest reference strain, we found similar genome sizes (Additional file 1: Table 5). The evaluation of the genome completeness showed that 39 (92.8%) of the genomes contained at least 95% of the 452 near-universal single-copy orthologs (BUSCOs) evaluated by the program [38] (Additional file 1: Table 6), suggesting that the observed variation in genome sizes could be due to intrinsic characteristics of each strain and not to a sequencing bias.

Most CCB Vibrionaceae lineages display a closed pangenome pattern
The pan-genome analysis of 39 CCB Vibrio spp., 3 CCB Photobacterium spp., and 5 Vibrio spp. references strains involved a total of 20,121 orthologous gene families. The orthologous gene families were defined by the DeNo-GAP pipeline [39] through HMMs generated by using Vibrio anguillarum 775 as seed reference, with cut-off values of 70% similarity and 70% coverage for query and target sequences. The genes that were present in at least 95% of the genomes conformed the core genome, including reference genomes, composed by 1254 gene families. The accessory genome is far more substantial, consisting of a total of 14,072 genes families that were found in at least two of the obtained genomes. The rest 4795 genes families were strain-specific.
In the core phylogeny, we found seven lineages (Fig. 2), of which six of them were previously identified in the 16S rRNA gene tree, and one was represented by a unique strain of marine V. furnissii sp. Nov. 4 strain (NCTC 11218) [40]. Reference strain V. anguillarum 775, isolated from a Coho salmon [41] clusters within the large generalist Clade II, while reference strain V. metschnikovii CP 69-14, which was isolated in marine systems, is basal to Clade III. Basal to Clade VI are reference V. parahaemoliticus BB22OP, a pre-pandemic strain [42], associated with seafood-borne gastroenteritis in humans and V. alginolyticus NBRC 15630 = ATCC 17749, an aquatic organism that can cause bacteremia. Clades IV and V are likely to be exclusive to CCB, given that there is no closely related strain sequenced on databases. Finally, Clade I is related to Photobacterium spp. (Fig. 2).
From the six clades identified, only Clade II presented an open pan-genome as suggested by the Heaps law analysis [43] (alpha = 0.7913). The rest of the clades displayed closed pan-genome patterns (i.e., alpha values > 1.0; Table 1). We performed random sub-samplings of genomes per clade to verify the effect of sample size, and we re-calculated alpha values from three random genomes of each clade; this test recovered the same pattern as the first test regarding the open or closed pangenome nature of each clade. Taking as an example, the Clade II, which is composed of 24 strains, the analysis identified the clade as with an open pan-genome even when we tested only three genomes from this clade (Additional file 1: Figure 3).

Clades have differences in genetic diversity, effective population sizes and recombination
We found that nucleotide diversity values for Clades III, IV, and V were the lowest within samples, ranging from 2.86E-05 to 0.0051, while Clades I, II, and VI had higher levels of genetic variation, in the range of 0.011 to 0.046 ( Table 2). This same pattern was observed for the θw values ( Table 2) Table 3).
Recombination analysis of 15,380 ortholog clusters showed that only 11% (1759) had a significant signal of recombination (Additional file 2: Table 7). These recombination events occurred more frequently among isolates of the same environment and pond (SPSE), suggesting reproductive isolation associated with an environmental variable (Additional file 1: Figure 5). However, we are aware that only isolates of water or sediment conform most clades. Therefore, we propose that the frequency of recombination events is mostly restricted to occur within clades (Fig. 3).
We evaluated the impact of homologous recombination and mutation within lineages estimating r/m at genome-scale using ClonalFrameML [49]. This measure reflects the ratio of probabilities that a given polymorphism is explained by either recombination (r) or by   (Table 4). We also performed the same analysis on V. parahaemolyticus, V. ordalii, V. anguillarum, and P. leiognathi reference genomes, all isolated from marine environments. For the marine samples, r/m estimates were within the range of CCB strains, except for V. anguillarum, which had the highest values (Table 4). This analysis also shows that some recombination events are shared with Vibrio spp. references strains (Additional file 1: Figure 6), supporting the hypothesis of ancient origin of these recombination events even though more recent recombination events were detected only among CCB strains. This indicates that homologous recombination is a constant (albeit relatively infrequent) source of polymorphism in the analyzed strains.

Structure of clade II and the effective population size supports a recent diversification
In the case of the generalist Clade II, we found substructure. Using Nei's genetic distances, we identified ten genetic groups (that we will call Sub-clades  T. brucei 5,332,244 --- [46] First column shows the names of the CCB Clades and reference strains used for the calculus, second column represents the number of strains within each group, followed by the median N e value estimated and the range. Last two columns display the isolation environment and the reference hereafter) with distances greater than 0.001. The discriminant function shows the same structure as the Nei distances, reflecting a broader relationship between Subclades A, D, F and G, and B with C and E. Meanwhile, H, I, and J Sub-clades had dissimilar sub-structures (Additional file 1: Figure 4). Since only three of the Subclades contained more than two isolates, further analyses were just performed with the larger Sub-clades (A, D, and G). When estimating the nucleotide diversity for Subclades belonging to Clade II (described below, see Additional file 1: Figure 4), we found lower values, π in the range of 1.61E-06 to 5.47E-06 (Table 2). This same pattern was observed for the θw values (Table 2). Regarding the posterior distribution of the effective population size, it was far smaller in the Sub-clades (Sub-clade A N e = 55,938; Sub-clade D N e = 20,849; Sub-clade G N e = 29, 791) than in the other clades, reinforcing the hypothesis of recent diversification in these Sub-clades (Table 3).
Selection analysis of orthologue genes show stronger signals of positive selection within the core genome than in the flexible genome Of a total of 15,380 ortholog clusters analyzed, only 367 (2.3%) had a significant signal of positive selection. Of these ortholog gene families, 297 belonged to the flexible genome, while 70 are part of the core genome. However, when we considered the universe of ortholog genes that conform the flexible genome (14,072), only 2.1% of the flexible genome had signals of positive selection, while in the core genome (composed by 1254 genes) 5.6% of the genes are positive selected (Additional file 2: Table 8). A Gene Ontology (GO) enrichment analysis was performed in order to identify those biological functions Fig. 3 Patterns of recombination events among isolated strains. Heatmap of the frequency of recombination events among different strains; red colors indicate more recombination events within strains while blue events indicate few recombination events. Distances were estimated with the Jaccard dissimilarity index overrepresented given those ortholog clusters with positive selection. Seven GO terms were enriched within these families (Table 5), one of them was the term GO: 0007156, which is associated with cell-cell adhesion; within this category, most of the genes annotated were related to cadherin domains.
Besides those analyses, based on pan-genome information, we looked for specific coding sequences that could be private (unique) to a specific pond, environment, or clade. There were no specific genes associated with a particular environment or pond, but we did identify ortholog gene clusters exclusive per clade. From Clades I to VI, we observed 1280, 10, 72, 23, 72, and no exclusive ortholog gene families, respectively. For each clade with exclusive ortholog gene families, we looked for enriched GO terms. On Clade I, the term related with bacteriocin immunity was enriched; in Clade II the terms associated to siderophore transport were enriched; in Clade III the category related to the biosynthesis of lipopolysaccharides was enriched; and on Clades IV and V terms related to transport were enriched (Additional file 1: Table 9).

Genome association study detected SNPs related to unstructured environment
Based on a whole genome alignment, we obtained 38, 533 single nucleotide polymorphisms (SNPs) variants, from which 26,663 were bi-allelic characters that were used in an UPGMA analysis of genetic distances. This analysis produced the same clustering as the core genome phylogeny (Fig. 2). With these SNPs, we performed a membership probability test, which shows that all the isolates had the same probability of being isolated from any pond and environment (Additional file 1: Figure 7).
We found on average 2473 private (unique) SNPs for each one of the nine ponds, 33,655 private SNPs for water or sediment environments, and 29,141, private SNPs for each of the six clades. This abundance of private SNPs suggests an effect of the environment, either by local adaptation (selection) or by genetic drift (low effective sizes or little or no gene flow).
We removed the SNPs with a minor allele frequency < 0.05 (771 SNPs removed) and we kept the alleles that were found in at least three individuals, for a total of 25, 892 SNPs. Within those SNPs, we detected a total of 598 SNPs with an association to the sediment environment. An UPGMA analysis of these 598 SNPs was performed in order to infer the similarity between samples (Fig. 4), finding most of the clusters previously observed with the core genome phylogeny (Fig. 2), except Clade III, which appears inside Clade II. Moreover, the mixed isolates of Clade III fall among the Sub-clade G of Clade II, and most of them were isolated from water environment, as well as members of Clade III (Fig. 4), suggesting a preference for diluted, unstructured environments.
To analyze the distribution of the SNPs, we mapped the above detected 598 SNPs to their positions in the genome alignment from where they were obtained, moving in 1 Kb windows. A total of 144 genomic regions containing SNPs were inspected, and we found 237 ortholog gene families in these regions. From these ortholog gene families, only 24 showed recombination signals, while 18 had selection signals (Additional file 2: Table 10). Within those SNPs we performed a test for  GO-term enrichment with TopGO [50]. From the 24 ortholog genes families with recombination signals, we detected four enriched GO, while we found only one enriched GO-term in the 18 ortholog gene families with selection signals ( Table 6). One of the functional enriched GO terms within these genes was the GO: 0006814, which is involved in sodium transport; some of the genes annotated within this category were the bacterial Na+/H+ antiporter B (NhaB).

Discussion
In this study, we performed comparative genomic analyses to understand how evolutionary forces shaped the pan-genome of 42 Vibrionaceae strains isolated from CCB, where environmental filtering is believed to increase local adaptation due to extreme stoichiometric bias [27]. We described how a natural perturbation lead to a temporal balanced stoichiometry, allowing six lineages of Vibrionaceae to prosper under a "feast-famine"

Ecology and microbial diversity in CCB
During the past 20 years, one of the main questions surrounding CCB bacterial hyper-diversity has been related to the roles of ecology and evolution promoting and maintaining its remarkable microbial diversity [27,51]. According to Souza et al. [27] "lost world" hypothesis, the extreme unbalanced stoichiometry (i.e., very low P availability) of CCB not only keeps the "ancestral niche" of many bacterial lineages, but also works as a semipermeable barrier to migration, restricting migration and keeping these ancient bacterial lineages alive and thriving in CCB [27]. As a result of these ecological and evolutionary conditions, CCB lineages are generally clonal [28][29][30], displaying an ancient marine ancestry [27,32,52]. Paradoxically, this extremely unbalanced stoichiometry seems to be, in part, the reason behind CCB high microbial endemicity and local differentiation: "No food, no sex, no travel" [27,31,32], allowing for local adaptation and broad differentiation between sites.
In this study, we explored the evolutionary dynamics after a natural perturbation (in this case a flood) changed the ecological conditions in CCB in a particular site (Pozas Rojas), generating a temporarily more "balanced" stoichiometric proportions (i.e., N:P 20:1). We know by meteorological data that similar floods occur at CCB sporadically due to the low incidence of intense storms (i.e., only three since 1940 [53]). The flood introduced to this lowland a large amount of debris that, with time, generated an increase in nutrients, in particular phosphorus that in turn opened opportunities for the "rare biosphere", represented by standing bacterial lineages usually found at very low proportions, like the rare members of Vibrionaceae that normally are not common at the standard low nutrient conditions [54][55][56].
Given the change in resources in Pozas Rojas, we proposed two hypotheses when we started the study: Vibrionaceae from CCB would show as their ocean counterparts an open pan-genome, showing high levels of recombination and genetic variation, as well as a high N e . Alternatively, due to local adaptation in each lineage of CCB, Vibrionaceae would display closed pangenomes, and a strong genetic structure, generated by high clonality and low genetic variation probably related to periodic selection and small effective population sizes.

Vibrionaceae in CCB
In a previous study at Pozas Rojas using both cultivated strains and metagenomic data, Bonilla-Rosso et al. found that Vibrio spp. were either very rare or absent [54]. In their study, the authors found mostly Pseudomonads among the cultivated strains [54]. This result was confirmed with metagenomics, where Pseudomonadales, Burkholderiales, and Bacillales represented 50% of the metagenome reads. As a result of this previous knowledge, in the 2013 sampling, we first used PIA media to analyze the effect of the 2010 flood in the previously abundant genera. However, we found that this lineage was replaced in the cultures by Vibrio spp. In other words, the increased levels of nutrients and the perturbation reduced the abundances of Pseudomonas and related genera in CCB. This effect was corroborated later in another system in CCB (Churince) with a nutrient enrichment experiment [55,57]. Among the analyzed genomes, we found two clades of Vibrionaceae, Clades IV and V, that had not been isolated previously and could be endemic to the basin.

Recombination, pan-genomes, and selection in Vibrionaceae
For this study, we sequenced 42 strains, which were selected to include most of the Vibrionaceae cultivable clades in our collection. However, for some of these clades, only a few numbers of strains were recovered, so we chose three at the minimum, which allowed us to have statistical support. As this reduced number of First two columns show the enriched GO IDs and its name, with signals of recombination or selection. Third column the number of annotated genes, fourth and fifth column the number of significant genes and the expected, last column shows the significance corrected with Bonferroni samples for some clades could have an impact on the analyses, we included different correction methods for our analyses.
In the case of diversity measures, π and θw showed lower diversity than cosmopolitan E. coli [58], nevertheless, for Clades I, II and VI, those values are comparable to the ones observed in pathogenic Vibrio spp. [59,60] suggesting similar demographic dynamics. Tajima's D was in most cases negative, except for Clade II, but none of the values were statistically significant. This could suggest bottlenecks in the process of diversification, explaining the extremely low effective population size and diversity in those Sub-clades. Negative values of Tajima's D indicate high content of rare alleles, which is in agreement with the private alleles test we performed [61]. In the same way, it could be the result of selective sweeps or recent demographic expansion as a result of the new nutrient conditions (feast).
We believe that the natural disturbance at Pozas Rojas generated by an increase in nutrient availability relaxed selection against HGT. Nevertheless recombination is kept within close lineages, resulting in large effective population sizes and a closed pan-genome in most of the lineages, allowing selection to act in response to environmental pressures [47,62,63]. The closed pangenome of these lineages contrast to what has been reported in oceanic Vibrio spp. where population sizes are large and pan-genomes are kept open due to HGT [64]. Even though Clade II is the only one with an open pangenome, its internal substructure suggest a recent process of diversification where each of its sub-clades shows again a closed pan-genome, with smaller N e and low genetic diversity, a similar pattern of N e and low genetic diversity has been observed in natural populations of E. coli [58].

Selection and adaptation in Pozas Rojas
We found 367 gene families that have signals of positive selection, most of them ortholog genes found in the core genome (2.05% of the flexible genome and 5% of the core genome; Additional file 2: Table 9). This result suggests that selection purges the genes that are in the flexible genome, closing the pan-genomes. Among the detected genes with selection signals, we found several genes related to cadherin domains that are associated with biofilm formation [65]. In natural environments, biofilm formation allows bacteria to cope with environmental changes, protects the cell, provides mechanical stability and provides cellular adhesion with other cells or with surfaces. It has been observed that biofilm formation is a persistent characteristic among bacteria from CCB in both water and sediment, and also under different nutrient conditions [57].
When we analyzed unique genes for each clade disregarding the isolation environment, in the case of Clade I we found the term GO:0030153 enriched, which is related to bacteriocin immunity. However, antibiotic resistance associated genes did not show particular signals of selection, suggesting that overall there is no ongoing selective pressure for defense. In the large generalist Clade II, we found three GO terms enriched, two of them related to cell wall structure while the third is related to siderophore transport, a group of genes that were rare in the previous metagenomic analysis of the same site [35]. In the case of Clade III, the enriched GO term is related to lipopolysaccharide biosynthesis. Meanwhile, in Clade IV, we identified six enriched GO terms, where most of them were related to transport and signal transduction. Finally, for Clade V, we identified GO four terms enriched mostly related to transport. These results suggest that distinct clades are indeed responding to their environment in different ways reinforcing the idea of genetic isolation as a way to preserve local adaptation.
When we performed a genome-wide association study (GWAS) test to analyze the association of the SNPs to either water or sediment environment, we identified 598 SNPs related to sediment. The UPGMA analysis showed a similar clustering pattern as the core genome (Fig. 4), suggesting a clade effect. However, Cluster III grouped among the Sub-clade G of Clade II, and most of the isolates of this Sub-clade as well as Clade III were isolated from the water environment. One possibility is that these SNPs are important to the adaptation to nonstructured environments such as water. Some of the genes associated to these SNPs presented signals of recombination and selection.
Our data suggest that there is a selective pressure over some clades regarding the unstructured environment, as shows the enriched GO term GO:0006814, with signals of positive selection. This term is related to sodium transport, and among the genes annotated within this category were the bacterial Na+/H+ antiporter B (NhaB), that has been suggested to play a role in the adaptation of halophilic and haloalkaliphilic proteobacteria to marine habitats [66]. This gene has also been found to play a role in homeostasis in Vibrio spp. [67].

Conclusions
At CCB, most of the environments present an extremely low phosphorus concentration, a factor that acts as an effective migration barrier, maintaining conditions of the ancient sea as well as ancestral microbial diversity [27]. However, due to natural perturbation, we had the opportunity to observe in Pozas Rojas what happens when that nutrimental barrier is lifted temporarily. Apparently, rare biosphere strains that normally had a hard time surviving low P conditions can follow a feast-famine cycles and have population expansion when the P availability is less limiting.
In order to understand the other dimensions of local adaptation, further sampling of Vibrio spp. in CCB is needed. Unfortunately, this extraordinary oasis is disappearing, given the loss of more than 95% of CCB wetlands due to groundwater overexploitation by agriculture [27,51,56,68].

Site description
We analyzed bacterial isolates from sediment and water of nine ponds in the Pozas Rojas area of CCB (Fig. 1). This site is composed of several small ponds (locally called pozas) that surround a larger pond in the system of Los Hundidos [30,35]. These small ponds become hypersaline in summer [30], and used to have the highest stoichiometric unbalance (i.e., lowest P concentration) reported in CCB (C:N:P 15820:157:1) [35]. The ponds have seasonal high fluctuations in temperature (around 1°C in winter to up to 60°C in some summer moments in some cases) [35] and are small but permanent, separated from each other by ca. 9 m or more, along an arch around the larger pond. However, the Pozas Rojas were flooded by hurricane Alex during summer 2010, merging most of the small ponds into a single large pond, until autumn 2011, when the water receded, leaving the moon shaped array of small red ponds at the same place (Fig. 1).

Sample collection and strains isolation
We collected water and sediment samples in duplicate from nine ponds located in Pozas Rojas, Los Hundidos, CCB, during March 2013 and stored them at 4°C until processing. Sediment was collected for nutrient analysis in 50 ml Falcon tubes and covered with aluminum foil before storage. Water was collected for nutrient quantification in 1 l volumes and stored in the dark at 4°C. Chemical analyses were performed at the Instituto de Investigaciones en Ecosistemas y Sustentabilidad, UNAM, in Morelia, Mexico. Cultivable strains from both sediment and water were isolated in PIA (Pseudomonas isolation agar) and TCBS (Thiosulfate Citrate Bile Sucrose Agar) as previously described [57,69], obtaining a total of 174 isolates, being 88 isolates from sediment and 86 from water (Additional file 1: Table 3).

Environmental variables measurement
For nutrient quantification, sediment samples were dried, and water samples were filtered through a Millipore 0.42 μm filter. Total carbon (TC) and inorganic carbon (IC) were determined by combustion and colorimetric detection [70] using a total carbon analyzer (UIC model CM5012, Chicago, USA). Total organic carbon (TOC) was calculated as the difference between TC and IC. For total N (TN) and total P (TP) determination, samples were acid digested with H 2 SO 4 , H 2 O 2 , K 2 SO 4 and CuSO 4 at 360°C. Soil N was determined by the macro-Kjeldahl method [71], while P was determined by the molybdate colorimetric method following ascorbic acid reduction [72]. The N and P forms analyzed were determined colorimetrically in a Bran-Luebbe Auto analyzer 3 (Norderstedt, Germany).

DNA extraction and PCR amplification of 16S rRNA
For the 174 isolates obtained, DNA extraction was performed as described by Aljanabi and Martinez (1997) [73]. 16S rRNA genes were amplified using universal primers 27F (5′-AGA GTT TGA TCC TGG CTC AG-3′) and 1492R (5′-GGT TAC CTT GTT ACG ACT T-3′) [74]. All reactions were carried out in an Applied Biosystems Veriti 96 Well Thermal cycler (California, USA) using an Amplificasa DNA polymerase (BioTec-Mol, Mexico) with the following program: 94°C for 5 min, followed by 30 cycles consisting of 94°C for 1 min, 50°C for 30 s, 72°C for 1 min and 72°C for 5 min. Polymerase chain reaction (PCR) amplification products were electrophoresed on 1% agarose gels. Sanger sequencing was performed at the University of Washington High-Throughput Genomics Center.

Phylogenetic analysis of 16S rRNA sequences
The first 700 bps of the 16S rRNA gene, were aligned with Clustalw [75] and quality control was performed with Mothur [76]. Genera level identification was made using the classifier tool [77] from the Ribosomal Database Project (RDP) Release 11.4 [78] (Additional file 1: Table 3). Blastn searches were performed against Refseq database from NCBI to select reference sequences.
A total of 110 sequences were identified as members of the Vibrionaceae family, 43 were isolates from water and 67 from sediment. Based on the previous taxonomic assignment, the sequences of Vibrio alginolyticus, V. parahaemolyticus, V. anguillarum, V. metschnikovii and Photobacterium spp. were included as references. These strains were used in subsequent analyses. A maximum likelihood phylogenetic reconstruction was obtained with PhyML version 3.0 [79], using the HKY + I + G substitution model estimated with jModelTest 2 [80]. The degree of support for the branches was determined with 1000 bootstrap iterations.

Environmental association of phylogroups
To test whether the community of cultivable strains was structured based on its isolation environment (i.e., water or sediment), we performed an AdaptML analysis [37], including our 110 isolates belonging to Vibrionaceae and a Halomonas spp. strain as an out-group. Three categorical environmental variables were tested, including pond of isolation, high and low nutrient concentrations, and the two sampled environments (water or sediment).

Genome sequencing, assembly, and annotation
For whole-genome sequencing, we selected from the AdaptML analysis 39 Vibrio spp. isolates, 23 isolated from sediment and 16 from water, plus 3 isolates of Photobacterium spp. (a lineage closely related to the Vibrio genus) isolated from sediment. DNA extractions were performed with the DNeasy Blood and Tissue kit (Qiagen).
Sequencing was performed with Illumina MiSeq 2 × 250 technology, with insert libraries of 650 bps and an expected coverage of ca.10x per genome. At first, we planned an assembly strategy using a genome reference; for this reason, the strain V15_P4S5T153 had a second library that was designed using the Jr. 454 Roche technology, in order to reduce sequencing bias and get higher coverage. However, due to divergence among genomes, we performed de novo assemblies for all genomes. All sequencing was performed at the Laboratorio Nacional de Genómica para la Biodiversidad (LANGE-BIO), México.
We estimated the core genome based on presence and absence of gene families across the genomes. If the genes were present in all strains, the orthologs were classified as core, while genes were classified as accessory when present in more than one strain but not in all of them, and unique genes when they were present only in a single strain. Since most of the genomes in our dataset are not completely sequenced, we designated core ortholog families as those present in at least 95% of the genomes, to avoid the impact of missing genes due to sequencing or assembly artifacts.
The package Micropan [87] within R v.3.4 (R Core Team) [88] was used to infer the open or closed nature of each pan-genome dataset, following the Heaps law proposed by Tettelin et al. [43]. The Heaps law model is fitted to the number of new gene clusters observed when genomes are ordered randomly. The model has two parameters: an intercept, and a decay parameter called alpha. If alpha is higher than 1.0 the pan-genome is considered closed, if alpha is lower than 1.0 it is considered open. Additionally, a random sub-sampling for each clade was made, taking three genomes and calculating the alpha value for each group of three genomes. A total of 1000 independent sub-sampling events were made for each clade.
Core proteins were aligned using Kalign [89] to infer the phylogenetic relationship between the samples. The resulting alignments of individual ortholog families were concatenated using a custom Perl script. With these concatenated core genes, a maximum likelihood phylogenetic tree was constructed using the FastTree program [90].

Recombination analyses
Of the total ortholog families in the Vibrio spp. pangenome, we only used the ortholog families found in at least three genomes for the recombination analyses. Genetic recombination was examined on each coding sequence (CDS) alignment by using inference of pairwise recombination sites, obtained with GENECONV [91] and by the identification of putative recombinant sequences through breakpoints using GARD [92].
We estimated the number of recombination events, considering the pan-genome size, number of strains per clade and branch length, according to the following classification: (i) isolates of the same pond and environment (SPSE), (ii) isolates of the different pond and environment (DPDE), (iii) isolates of the same pond and different environment (SPDE) and (iv) isolates of different pond and same environment (DPSE). For this, we normalized the data by pan-genome size, number of strains and branch length. Given that the large generalist Clade II presented a clear sub-structure, we did a separated analysis for the shorter branches within Clade II (Additional file 1: Figure 4).
To assess the impact of homologous recombination, we analyzed the substitution pattern using two different algorithms, Gubbins [93] and ClonalFrameML [49]. A whole-genome alignment for the 47 analyzed genomes was performed with MAUVE [94]. The resulting alignment was used as input for Gubbins [93] using RAxML for the phylogenetic inference [95] and default parameters. Additionally, whole genome alignments were performed for each clade, excluding references, with the progressive MAUVE algorithm [94]. We calculated the R/theta ratio, nu and delta [49] for each sample and for 100 bootstrapped replicates.

Genetic structure of clade II
Recombination analyses showed that in Clade II there are internal groups with higher internal recombination, so we decided to further investigate the structure within Clade II. For clustering analyses, we used Nei's genetic distance [96] and Neighbor Joining. Genomes with distance less than 0.001 were grouped and tested with a discriminant analysis of principal components of the genetic variation, using the adegenet library in R [97]. For this study, we used 20 principal components and 3 discriminant functions.

Selection analyses
We used FUBAR [98] to identify signatures of positive selection among ortholog gene families found in at least three genomes. We accounted for recombination breakpoints in the ortholog families, while calculating positively selected sites based on GARD results [92]. We considered any site to be positively selected if it showed P-value < 0.05. We also conducted a Gene Ontology (GO) enrichment analysis using topGO [50] to find overrepresented biological functions in this set of genes.

Effective population size estimation
We followed a simulation approach to estimate the posterior distribution of the effective population size (N e ) of each of the six clades. According to the previous clustering and recombination analysis, for Clades I, III, IV, V and VI we simulated a single population, while for Clade II we simulated three sub-populations that diverged from an ancestral population.
Simulations were performed using Fastsimcoal2 [44,45]. For each clade, we simulated DNA sequences having a similar length equal to the number of nucleotides in the given clade, as well as a sample size equal to the number of sequences sampled for each clade. We assumed no recombination within the genome, and used the Escherichia coli mutation rate of 2.2 × 10 − 10 mutations per nucleotide per generation [99]. We ran between two and four simulations for each clade. For the initial runs, we generated 100,000 replicates extracting N e values from a prior log-uniform distribution that ranged from 100,000 to 20,000,000 individuals. For Clade II, we also estimated the age of divergence of each Sub-clade, by setting the prior distribution of time ranging from 1000 to 4,000,000 generations. After a first run, we narrowed the prior ranges based on those simulations that had similar summary statistics compared to the observed data and performed another 100,000 simulations using the narrowed priors.
To compare the previously simulated and observed data based on summary statistics, we used the ape [100] and pegas [101] libraries in R to estimate the number of polymorphic sites and the Tajima's D based on the entire genomes. Tajima's D is commonly used to estimate demographic changes in populations [102,103]. Also, we obtained 1000 sliding windows frames to estimate the Tajima's D along the genomes, as well as the mean and standard deviation of Tajima's D. Tajima's D, π, and Watterson's theta (θw) were estimated for each clade as well as for Sub-clades A, B and G. Since Clades I and VI had three sequences and it was not possible to obtain Tajima's D, we did 1000 replicates in which we subsampled with replacement 10 sequences. For each replicate, we calculated Tajima's D and we obtained as the proximate value the median estimated across the 1000 replicates.
Based on the summary statistics, we used the abc function in the ABC package [104] in R to calculate the distribution of the N e parameter based on a 0.05% threshold distance between the simulated and observed data. For each clade, we reported the median and the 95% interval confidence of N e . For Clade II, we further reported the average and 95% interval confidence of the number of generations since each Sub-clade diverged from an ancestral clade.

Association between genotypes and environmental variables
We evaluated whether the genetic variation within the Vibrionaceae genomes could be explained by particular adaptations to the environment (water or sediment). We used progressiveMauve [94] to perform a global multiple alignment between the assembled genomes. We extracted the variant sites within the alignment and exported them as SNPs using snp-sites [105].
We obtained 38,533 SNPs, which we used to search for private alleles using Poppr [106]. Afterwards, we obtained a subset of 25,892 SNPs by filtering biallelic sites with minor allele frequencies > 0.05. We used PLINK [107] to perform a GWAS to detect possible associations between our SNPs set and either the water or sediment environments. We conducted Fisher exact tests and regarded as significant all SNPs whose associations had p-values < 0.01 after Bonferroni corrections. These analyses may be informative even considering these sampling differences [108,109].
To test whether these associations could be explained by convergent evolution rather than by common ancestry, we compared an UPGMA tree reconstructed from the total set of SNPs from an UPGMA tree using only the SNPs that were significantly associated to the environment. We analyzed the distribution of the SNPs within the genomes to find the genes associated to those SNPs.
We mapped the SNPs positions in the genome alignment moving by 1 Kb windows; this window size was selected considering the average bacterial gene size and retrieved all the associated genes. We conducted a Gene Ontology (GO) enrichment analysis using topGO [50] to find overrepresented biological functions in this set of genes.
Additional file 1: Table S1. Nutrient measures across sampled points. Table S2. Ratio of nutrients concentrations. Table S3. RDP classification of partial 16S rRNA. Table S5. General information of the 42 Vibrionaceae genomes and 5 references used in this study. Table S6. Results of the Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis (see methods) using 452 BUSCOS of 721 species of the Gammaproteobacteria database. Table S9. GO terms enriched at the unique genes of each clade. Figure S1. Phylogenetic reconstruction of 16S RNA sequences. Figure S2. AdaptML analysis. Figure S3. Random analysis of alpha values within each clade. Figure S4. Analysis of structure of Clade II. Figure S5. Frequency of recombination events. Figure S6. Recombination events across a whole genome alignment. Figure S7. UPGMA and membership of the CCB strains.
Additional file 2: Table S4. Genome assembly overview. Table S7. GeneConv results. Table S8. Gene clusters in which positive selection was found. Table S10. Orthologue cluster genes containig SNPs associated to the strain isolation environment.