Skip to main content

Population genomics of Vibrionaceae isolated from an endangered oasis reveals local adaptation after an environmental perturbation



In bacteria, pan-genomes are the result of an evolutionary “tug of war” between selection and horizontal gene transfer (HGT). High rates of HGT increase the genetic pool and the effective population size (Ne), resulting in open pan-genomes. In contrast, selective pressures can lead to local adaptation by purging the variation introduced by HGT and mutation, resulting in closed pan-genomes and clonal lineages. In this study, we explored both hypotheses, elucidating the pan-genome of Vibrionaceae isolates after a perturbation event in the endangered oasis of Cuatro Ciénegas Basin (CCB), Mexico, and looking for signals of adaptation to the environments in their genomes.


We obtained 42 genomes of Vibrionaceae distributed in six lineages, two of them did not showed any close reference strain in databases. Five of the lineages showed closed pan-genomes and were associated to either water or sediment environment; their high Ne estimates suggest that these lineages are not from a recent origin. The only clade with an open pan-genome was found in both environments and was formed by ten genetic groups with low Ne, suggesting a recent origin. The recombination and mutation estimators (r/m) ranged from 0.005 to 2.725, which are similar to oceanic Vibrionaceae estimations. However, we identified 367 gene families with signals of positive selection, most of them found in the core genome; suggesting that despite recombination, natural selection moves the Vibrionaceae CCB lineages to local adaptation, purging the genomes and keeping closed pan-genome patterns. Moreover, we identify 598 SNPs associated with an unstructured environment; some of the genes associated with these SNPs were related to sodium transport.


Different lines of evidence suggest that the sampled Vibrionaceae, are part of the rare biosphere usually living under famine conditions. Two of these lineages were reported for the first time. Most Vibrionaceae lineages of CCB are adapted to their micro-habitats rather than to the sampled environments. This pattern of adaptation is concordant with the association of closed pan-genomes and local adaptation.


Comparative genomics analyses have shown a wide range of genomic variation within bacteria from different phylogenetic groups [1,2,3]. This variation range has been explained in part by the wide ecological niche occupied by different bacterial groups [4,5,6,7,8]. Bacterial genomes, in contrast to eukaryotic genomes, usually maintain constant genome sizes [9, 10], suggesting that while horizontal gene transfer (HGT) increases the genome size by adding new genes, selection maintains the genome size by removing deleterious, non-functional or non-useful genes [11,12,13]. Therefore, bacteria can present very different genomic compositions even within a species, with HGT creating a flexible genome and natural selection purging or maintaining it [10, 14].

Thus, the type of pan-genome is an indication of the evolutionary “tug of war” between selection and HGT. As a prediction, if there are high rates of HGT, the total genetic pool will increase, as well as the effective population size, generating an open pan-genome maintained by natural selection [15]. However, if there is a selective pressure towards local adaptation, the genetic diversity introduced by HGT will be purged, resulting in a closed pan-genome and clonal lineages [14].

To start understanding the reasons why some pan-genomes are open while others are closed, we can analyze the rate and type of recombination. On the one hand, homologous recombination homogenizes populations, keeping them genetically cohesive in a closed pan-genome [16, 17]. On the other hand, non-homologous recombination brings new genetic material, offering new evolutionary opportunities for diversification and generating an open pan-genome [18,19,20,21]. Selection and the Hill-Robertson effect are expected to operate when recombination decreases the linkage disequilibrium among genes, which avoids the purging of genetic diversity along the genome [22, 23]. As a result of this diversity of mechanisms, species with higher recombination levels maintain a large historical effective population size [15, 24, 25]. In contrast, highly clonal populations with low or no HGT evolve mostly by mutation and genetic drift, because the efficiency of selection is hampered by the Hill-Robertson effect that also reduces the standing levels of variation in the population and the historical effective population sizes [23, 26].

In this study, we explored the role of different evolutionary forces shaping the genetic diversity of Vibrionaceae in the oasis of the Cuatro Ciénegas Basin (CCB), Mexico. CCB is composed of several aquatic systems that have a significant unbalance of the nutrient stoichiometry [27]. Population genetic studies of Pseudomonas spp., Exiguobacterium spp. and Bacillus spp. isolated from CCB aquatic systems in general show low recombination levels [28,29,30]. These patterns suggest that nutrient constraints in CCB may work as an ecological filter, reducing recombination maybe due to the cost of replicating new DNA, and leading to local adaptation [27, 31, 32].

We tested whether the environmental nutrient constraint would affect the genetic structure of Vibrio spp. lineages at CCB. Members of Vibrio spp. have been characterized in general as highly recombinant [33, 34]. We analyzed the genetic structure of Vibrionaceae in a particular site of CCB, Pozas Rojas (Fig. 1). This site was the most stoichiometrically unbalanced (N:P 157:1) in our first sampling in 2008. In that study, it was found that Pseudomonaceae was the most abundant family, comprising around 50% of the taxonomical sequences, while only 0.08% corresponded to Vibrionaceae [35]. Later, Pozas Rojas was naturally perturbed with intense rains associated with hurricane Alex in 2010. The runoff detritus and water caused the nutrients ratios to change from extremely unbalanced stoichiometry to a ratio similar the standard values in the sea (N:P 20:1; compared to the Redfield standard N:P 16:1 values of the sea [36]). Given the change in stoichiometry ratios, we asked the following questions: 1) How did a naturally recombinant lineage like some members of Vibrionaceae respond to this perturbation? 2) Did Vibrionaceae lineages maintain their local adaptation to this unique site by restricting recombination, and maintaining their pan-genomes closed? Alternatively, 3) Is it possible that Vibrio spp. developed open pan-genomes with large effective population sizes, similar to the lineages in the ocean to deal with this stoichiometric change? [33, 34].

Fig. 1
figure 1

Study site, Pozas Rojas in Los Hundidos within Cuatro Ciénegas Basin, Mexico. Sampling sites are signaled in yellow. Cuatro Ciénegas location is also shown in a map (Pozas Rojas photos were provided by David Jaramillo, a map showing the location of Cuatro Ciénegas Valley was obtained from Google Earth,

Herein we analyzed the role of the evolutionary forces that have shaped Vibrionaceae at CCB by performing a comparative genomics analysis of five reference and 42 strains isolated from two different local environments (i.e., water and sediments) in perturbed Pozas Rojas. Contrary to what we expected, our results show that most CCB Vibrionaceae lineages had similar levels of recombination compared to their oceanic relatives, and much higher levels of recombination than other genera in the CCB [28,29,30]. However, since most of the analyzed lineages had closed pan-genomes, we suggest that most of such recombination is homologous. This type of recombination should promote reproductive isolation and generate local adaptation. We did not observe a clear pattern of adaptation to either water or sediment environments, suggesting that there may be other environmental variables that we were not able to measure that could be driving local adaptation among these lineages.


Nutrients raising shifts the stoichiometric unbalanced and the Vibrionaceae family at the cultivable level

Based on Kruskal-Wallis statistical test, the total nutrient concentrations (Carbon (C), Nitrogen (N), and Phosphorus (P)) of the Pozas Rojas were not significantly different between the nine sampled ponds (C: p = 0.8815; N: p = 0.2256 and P: p = 0.9624; Fig. 1; Additional file 1: Table 1), however, they were statistically significant between type of environment (i.e., water vs. sediment: C: p = 3.486e-4; N: p = 0.03798 and P: p = 3.461e-4).

The proportion of C:N:P was on average 350:9:1 for water, and 258:21:1 for sediment (Additional file 1: Table 2). This ratios indicate a stoichiometric “balance” (i.e., similar to Redfield standard ratios) in Pozas Rojas during 2013, due to higher P availability, compared with the extreme stoichiometric imbalance observed in most of CCB sites, and in particular in Pozas Rojas microbial mat during summer 2008 (i.e., 15,820:157:1) [35], previous to the hurricane Alex perturbation.

Using two different isolation media, (i) PIA (Pseudomonas isolation agar) and (ii) TCBS (Thiosulfate Citrate Bile Sucrose Agar), we obtained 174 isolates from the 9 sampled ponds, 88 isolates from sediment and 86 from water. The taxonomic classification of the partial sequence of the 16S RNA of those isolates revealed that the collection was dominated by Vibrionaceae (63%, 110 strains), followed by Aeromonadaceae (14%, 24 strains) and Halomonadaceae (9.7%, 17 strains; Additional file 1: Table 3). Among Vibrionaceae, we identified three different genera; most strains belonged to Vibrio spp. (91.8%, 101 strains), far less to the related genus Photobacterium (6.4%, 7 strains), and 2 to Listionella genus (1.8%). Six different lineages were identified within Vibrio spp., and one additional lineage corresponded to Photobacterium spp. (Additional file 1; Fig. 1).

The AdatptML environmental association analysis [37] showed that strains are structured according to the environment where they were isolated, i.e., water or sediment, and not by pond (Additional file 1: Figure 2). While most clades were specialist either to water (higher nutrient condition) or to sediment (lower nutrient condition), the most abundant lineage had no preference for any environment. Based on this analysis, we selected 42 isolates for further sequencing with Illumina MySeq 2 × 250 and one additional Jr. 454 Roche library for a de novo assembly of the strain V15_P4S5T153. These isolates were chosen as representatives from the different lineages and environments. The genome coverage ranged from 6x to 31x and the N50 values were from 4806 to 143,363 (Additional file 2: Table 4).

Among the 39 CCB sequenced Vibrio spp. genomes, we found variation in terms of genome size, ranging from 3.1 Mbp to 5.1 Mbp, while the three CCB Photobacterium spp. genomes had an average genome size of 4.5 Mbp. Despite this variation, when we compared the CCB strains genomes to their closest reference strain, we found similar genome sizes (Additional file 1: Table 5). The evaluation of the genome completeness showed that 39 (92.8%) of the genomes contained at least 95% of the 452 near-universal single-copy orthologs (BUSCOs) evaluated by the program [38] (Additional file 1: Table 6), suggesting that the observed variation in genome sizes could be due to intrinsic characteristics of each strain and not to a sequencing bias.

Most CCB Vibrionaceae lineages display a closed pan-genome pattern

The pan-genome analysis of 39 CCB Vibrio spp., 3 CCB Photobacterium spp., and 5 Vibrio spp. references strains involved a total of 20,121 orthologous gene families. The orthologous gene families were defined by the DeNoGAP pipeline [39] through HMMs generated by using Vibrio anguillarum 775 as seed reference, with cut-off values of 70% similarity and 70% coverage for query and target sequences. The genes that were present in at least 95% of the genomes conformed the core genome, including reference genomes, composed by 1254 gene families. The accessory genome is far more substantial, consisting of a total of 14,072 genes families that were found in at least two of the obtained genomes. The rest 4795 genes families were strain-specific.

In the core phylogeny, we found seven lineages (Fig. 2), of which six of them were previously identified in the 16S rRNA gene tree, and one was represented by a unique strain of marine V. furnissii sp. Nov. 4 strain (NCTC 11218) [40]. Reference strain V. anguillarum 775, isolated from a Coho salmon [41] clusters within the large generalist Clade II, while reference strain V. metschnikovii CP 69–14, which was isolated in marine systems, is basal to Clade III. Basal to Clade VI are reference V. parahaemoliticus BB22OP, a pre-pandemic strain [42], associated with seafood-borne gastroenteritis in humans and V. alginolyticus NBRC 15630 = ATCC 17749, an aquatic organism that can cause bacteremia. Clades IV and V are likely to be exclusive to CCB, given that there is no closely related strain sequenced on databases. Finally, Clade I is related to Photobacterium spp. (Fig. 2).

Fig. 2
figure 2

Core gene phylogeny of the 1254 orthologs. Maximum-likelihood phylogenetic reconstruction of core genes, supporting branch values are shown. Each square represents the isolation environment, water or sediment, while yellow stars indicate reference strains. Isolation pond is indicated by its number. Clades are distinguished with colors. Clades IV and V which are likely to be exclusive to CCB are highlighted with an asterisk

From the six clades identified, only Clade II presented an open pan-genome as suggested by the Heaps law analysis [43] (alpha = 0.7913). The rest of the clades displayed closed pan-genome patterns (i.e., alpha values > 1.0; Table 1). We performed random sub-samplings of genomes per clade to verify the effect of sample size, and we re-calculated alpha values from three random genomes of each clade; this test recovered the same pattern as the first test regarding the open or closed pan-genome nature of each clade. Taking as an example, the Clade II, which is composed of 24 strains, the analysis identified the clade as with an open pan-genome even when we tested only three genomes from this clade (Additional file 1: Figure 3).

Table 1 Pan-genome metrics of each Vibrionaceae clades isolated from Poza Rojas, CCB

Clades have differences in genetic diversity, effective population sizes and recombination

We found that nucleotide diversity values for Clades III, IV, and V were the lowest within samples, ranging from 2.86E-05 to 0.0051, while Clades I, II, and VI had higher levels of genetic variation, in the range of 0.011 to 0.046 (Table 2). This same pattern was observed for the θw values (Table 2). Due to the number of individuals we could not obtain Tajima’s D estimate for Clades I and VI. For the rest of the clades, Tajima’s D values were negative, except for Clade II, that had positive values. The posterior distribution of the effective population size (Ne) estimated with Fastsimcoal2 [44] ranging from millions in the specialist Clades I (Ne = 12,822,270), III (Ne = 15,018,880), and V (Ne = 9,594,874) to intermediate in the range of thousands in the Clades IV (Ne = 383,067) and VI (Ne = 141,870; Table 3).

Table 2 Genetic diversity statistics of Vibrionaceae clades isolated from Poza Rojas, CCB
Table 3 Estimates of effective population sizes (Ne) of Vibrionaceae clades isolated from Poza Rojas, CCB, obtained through simulations with Fastsimcoal2 [44, 45], and comparative values from other organisms

Recombination analysis of 15,380 ortholog clusters showed that only 11% (1759) had a significant signal of recombination (Additional file 2: Table 7). These recombination events occurred more frequently among isolates of the same environment and pond (SPSE), suggesting reproductive isolation associated with an environmental variable (Additional file 1: Figure 5). However, we are aware that only isolates of water or sediment conform most clades. Therefore, we propose that the frequency of recombination events is mostly restricted to occur within clades (Fig. 3).

Fig. 3
figure 3

Patterns of recombination events among isolated strains. Heatmap of the frequency of recombination events among different strains; red colors indicate more recombination events within strains while blue events indicate few recombination events. Distances were estimated with the Jaccard dissimilarity index

We evaluated the impact of homologous recombination and mutation within lineages estimating r/m at genome-scale using ClonalFrameML [49]. This measure reflects the ratio of probabilities that a given polymorphism is explained by either recombination (r) or by mutation (m). Clade VI displayed the lowest r/m values = 0.0052, while Clade I (i.e., Photobacterium spp.) had the highest value in our dataset, r/m = 2.72 (Table 4). We also performed the same analysis on V. parahaemolyticus, V. ordalii, V. anguillarum, and P. leiognathi reference genomes, all isolated from marine environments. For the marine samples, r/m estimates were within the range of CCB strains, except for V. anguillarum, which had the highest values (Table 4). This analysis also shows that some recombination events are shared with Vibrio spp. references strains (Additional file 1: Figure 6), supporting the hypothesis of ancient origin of these recombination events even though more recent recombination events were detected only among CCB strains. This indicates that homologous recombination is a constant (albeit relatively infrequent) source of polymorphism in the analyzed strains.

Table 4 Recombination vs. mutation estimates of Vibrionaceae clades isolated from Poza Rojas, CCB

Structure of clade II and the effective population size supports a recent diversification

In the case of the generalist Clade II, we found sub-structure. Using Nei’s genetic distances, we identified ten genetic groups (that we will call Sub-clades hereafter) with distances greater than 0.001. The discriminant function shows the same structure as the Nei distances, reflecting a broader relationship between Sub-clades A, D, F and G, and B with C and E. Meanwhile, H, I, and J Sub-clades had dissimilar sub-structures (Additional file 1: Figure 4). Since only three of the Sub-clades contained more than two isolates, further analyses were just performed with the larger Sub-clades (A, D, and G).

When estimating the nucleotide diversity for Sub-clades belonging to Clade II (described below, see Additional file 1: Figure 4), we found lower values, π in the range of 1.61E-06 to 5.47E-06 (Table 2). This same pattern was observed for the θw values (Table 2). Regarding the posterior distribution of the effective population size, it was far smaller in the Sub-clades (Sub-clade A Ne = 55,938; Sub-clade D Ne = 20,849; Sub-clade G Ne = 29,791) than in the other clades, reinforcing the hypothesis of recent diversification in these Sub-clades (Table 3).

Selection analysis of orthologue genes show stronger signals of positive selection within the core genome than in the flexible genome

Of a total of 15,380 ortholog clusters analyzed, only 367 (2.3%) had a significant signal of positive selection. Of these ortholog gene families, 297 belonged to the flexible genome, while 70 are part of the core genome. However, when we considered the universe of ortholog genes that conform the flexible genome (14,072), only 2.1% of the flexible genome had signals of positive selection, while in the core genome (composed by 1254 genes) 5.6% of the genes are positive selected (Additional file 2: Table 8). A Gene Ontology (GO) enrichment analysis was performed in order to identify those biological functions overrepresented given those ortholog clusters with positive selection. Seven GO terms were enriched within these families (Table 5), one of them was the term GO:0007156, which is associated with cell-cell adhesion; within this category, most of the genes annotated were related to cadherin domains.

Table 5 GO terms enriched estimated with TopGO [50], regarding the gene families with signals of positive selection

Besides those analyses, based on pan-genome information, we looked for specific coding sequences that could be private (unique) to a specific pond, environment, or clade. There were no specific genes associated with a particular environment or pond, but we did identify ortholog gene clusters exclusive per clade. From Clades I to VI, we observed 1280, 10, 72, 23, 72, and no exclusive ortholog gene families, respectively. For each clade with exclusive ortholog gene families, we looked for enriched GO terms. On Clade I, the term related with bacteriocin immunity was enriched; in Clade II the terms associated to siderophore transport were enriched; in Clade III the category related to the biosynthesis of lipopolysaccharides was enriched; and on Clades IV and V terms related to transport were enriched (Additional file 1: Table 9).

Genome association study detected SNPs related to unstructured environment

Based on a whole genome alignment, we obtained 38,533 single nucleotide polymorphisms (SNPs) variants, from which 26,663 were bi-allelic characters that were used in an UPGMA analysis of genetic distances. This analysis produced the same clustering as the core genome phylogeny (Fig. 2). With these SNPs, we performed a membership probability test, which shows that all the isolates had the same probability of being isolated from any pond and environment (Additional file 1: Figure 7).

We found on average 2473 private (unique) SNPs for each one of the nine ponds, 33,655 private SNPs for water or sediment environments, and 29,141, private SNPs for each of the six clades. This abundance of private SNPs suggests an effect of the environment, either by local adaptation (selection) or by genetic drift (low effective sizes or little or no gene flow).

We removed the SNPs with a minor allele frequency < 0.05 (771 SNPs removed) and we kept the alleles that were found in at least three individuals, for a total of 25,892 SNPs. Within those SNPs, we detected a total of 598 SNPs with an association to the sediment environment. An UPGMA analysis of these 598 SNPs was performed in order to infer the similarity between samples (Fig. 4), finding most of the clusters previously observed with the core genome phylogeny (Fig. 2), except Clade III, which appears inside Clade II. Moreover, the mixed isolates of Clade III fall among the Sub-clade G of Clade II, and most of them were isolated from water environment, as well as members of Clade III (Fig. 4), suggesting a preference for diluted, unstructured environments.

Fig. 4
figure 4

UPGMA of the 598 SNPs associated with the isolation environment. Tip colors represent clade membership, for Clade II, Sub-clades are also indicated. Squares represent the isolation environment. Distances were calculated with the bitwise distance function of poppr v2.8.1

To analyze the distribution of the SNPs, we mapped the above detected 598 SNPs to their positions in the genome alignment from where they were obtained, moving in 1 Kb windows. A total of 144 genomic regions containing SNPs were inspected, and we found 237 ortholog gene families in these regions. From these ortholog gene families, only 24 showed recombination signals, while 18 had selection signals (Additional file 2: Table 10). Within those SNPs we performed a test for GO-term enrichment with TopGO [50]. From the 24 ortholog genes families with recombination signals, we detected four enriched GO, while we found only one enriched GO-term in the 18 ortholog gene families with selection signals (Table 6). One of the functional enriched GO terms within these genes was the GO:0006814, which is involved in sodium transport; some of the genes annotated within this category were the bacterial Na+/H+ antiporter B (NhaB).

Table 6 GO terms enriched in the genes found to have an association with the isolation environment (water or sediment)


In this study, we performed comparative genomic analyses to understand how evolutionary forces shaped the pan-genome of 42 Vibrionaceae strains isolated from CCB, where environmental filtering is believed to increase local adaptation due to extreme stoichiometric bias [27]. We described how a natural perturbation lead to a temporal balanced stoichiometry, allowing six lineages of Vibrionaceae to prosper under a “feast-famine” cycle. Most of these lineages present large effective population sizes as well as recombination rates comparable to their oceanic counterparts. However, their pan-genomes remained closed, probably due to selection purging HGT events external to each clade where genetic isolation has maintained clade specific selective events. Clade II is the exception, this large clade shows an open pan-genome with evidence of substructure with small effective sizes, suggesting early stages of diversification.

Ecology and microbial diversity in CCB

During the past 20 years, one of the main questions surrounding CCB bacterial hyper-diversity has been related to the roles of ecology and evolution promoting and maintaining its remarkable microbial diversity [27, 51]. According to Souza et al. [27] “lost world” hypothesis, the extreme unbalanced stoichiometry (i.e., very low P availability) of CCB not only keeps the “ancestral niche” of many bacterial lineages, but also works as a semipermeable barrier to migration, restricting migration and keeping these ancient bacterial lineages alive and thriving in CCB [27]. As a result of these ecological and evolutionary conditions, CCB lineages are generally clonal [28,29,30], displaying an ancient marine ancestry [27, 32, 52]. Paradoxically, this extremely unbalanced stoichiometry seems to be, in part, the reason behind CCB high microbial endemicity and local differentiation: “No food, no sex, no travel” [27, 31, 32], allowing for local adaptation and broad differentiation between sites.

In this study, we explored the evolutionary dynamics after a natural perturbation (in this case a flood) changed the ecological conditions in CCB in a particular site (Pozas Rojas), generating a temporarily more “balanced” stoichiometric proportions (i.e., N:P 20:1). We know by meteorological data that similar floods occur at CCB sporadically due to the low incidence of intense storms (i.e., only three since 1940 [53]). The flood introduced to this lowland a large amount of debris that, with time, generated an increase in nutrients, in particular phosphorus that in turn opened opportunities for the “rare biosphere”, represented by standing bacterial lineages usually found at very low proportions, like the rare members of Vibrionaceae that normally are not common at the standard low nutrient conditions [54,55,56].

Given the change in resources in Pozas Rojas, we proposed two hypotheses when we started the study: Vibrionaceae from CCB would show as their ocean counterparts an open pan-genome, showing high levels of recombination and genetic variation, as well as a high Ne. Alternatively, due to local adaptation in each lineage of CCB, Vibrionaceae would display closed pan-genomes, and a strong genetic structure, generated by high clonality and low genetic variation probably related to periodic selection and small effective population sizes.

Vibrionaceae in CCB

In a previous study at Pozas Rojas using both cultivated strains and metagenomic data, Bonilla-Rosso et al. found that Vibrio spp. were either very rare or absent [54]. In their study, the authors found mostly Pseudomonads among the cultivated strains [54]. This result was confirmed with metagenomics, where Pseudomonadales, Burkholderiales, and Bacillales represented 50% of the metagenome reads. As a result of this previous knowledge, in the 2013 sampling, we first used PIA media to analyze the effect of the 2010 flood in the previously abundant genera. However, we found that this lineage was replaced in the cultures by Vibrio spp. In other words, the increased levels of nutrients and the perturbation reduced the abundances of Pseudomonas and related genera in CCB. This effect was corroborated later in another system in CCB (Churince) with a nutrient enrichment experiment [55, 57]. Among the analyzed genomes, we found two clades of Vibrionaceae, Clades IV and V, that had not been isolated previously and could be endemic to the basin.

Recombination, pan-genomes, and selection in Vibrionaceae

For this study, we sequenced 42 strains, which were selected to include most of the Vibrionaceae cultivable clades in our collection. However, for some of these clades, only a few numbers of strains were recovered, so we chose three at the minimum, which allowed us to have statistical support. As this reduced number of samples for some clades could have an impact on the analyses, we included different correction methods for our analyses.

In the case of diversity measures, π and θw showed lower diversity than cosmopolitan E. coli [58], nevertheless, for Clades I, II and VI, those values are comparable to the ones observed in pathogenic Vibrio spp. [59, 60] suggesting similar demographic dynamics. Tajima’s D was in most cases negative, except for Clade II, but none of the values were statistically significant. This could suggest bottlenecks in the process of diversification, explaining the extremely low effective population size and diversity in those Sub-clades. Negative values of Tajima’s D indicate high content of rare alleles, which is in agreement with the private alleles test we performed [61]. In the same way, it could be the result of selective sweeps or recent demographic expansion as a result of the new nutrient conditions (feast).

We believe that the natural disturbance at Pozas Rojas generated by an increase in nutrient availability relaxed selection against HGT. Nevertheless recombination is kept within close lineages, resulting in large effective population sizes and a closed pan-genome in most of the lineages, allowing selection to act in response to environmental pressures [47, 62, 63]. The closed pan-genome of these lineages contrast to what has been reported in oceanic Vibrio spp. where population sizes are large and pan-genomes are kept open due to HGT [64]. Even though Clade II is the only one with an open pan-genome, its internal substructure suggest a recent process of diversification where each of its sub-clades shows again a closed pan-genome, with smaller Ne and low genetic diversity, a similar pattern of Ne and low genetic diversity has been observed in natural populations of E. coli [58].

Selection and adaptation in Pozas Rojas

We found 367 gene families that have signals of positive selection, most of them ortholog genes found in the core genome (2.05% of the flexible genome and 5% of the core genome; Additional file 2: Table 9). This result suggests that selection purges the genes that are in the flexible genome, closing the pan-genomes. Among the detected genes with selection signals, we found several genes related to cadherin domains that are associated with biofilm formation [65]. In natural environments, biofilm formation allows bacteria to cope with environmental changes, protects the cell, provides mechanical stability and provides cellular adhesion with other cells or with surfaces. It has been observed that biofilm formation is a persistent characteristic among bacteria from CCB in both water and sediment, and also under different nutrient conditions [57].

When we analyzed unique genes for each clade disregarding the isolation environment, in the case of Clade I we found the term GO:0030153 enriched, which is related to bacteriocin immunity. However, antibiotic resistance associated genes did not show particular signals of selection, suggesting that overall there is no ongoing selective pressure for defense. In the large generalist Clade II, we found three GO terms enriched, two of them related to cell wall structure while the third is related to siderophore transport, a group of genes that were rare in the previous metagenomic analysis of the same site [35]. In the case of Clade III, the enriched GO term is related to lipopolysaccharide biosynthesis. Meanwhile, in Clade IV, we identified six enriched GO terms, where most of them were related to transport and signal transduction. Finally, for Clade V, we identified GO four terms enriched mostly related to transport. These results suggest that distinct clades are indeed responding to their environment in different ways reinforcing the idea of genetic isolation as a way to preserve local adaptation.

When we performed a genome-wide association study (GWAS) test to analyze the association of the SNPs to either water or sediment environment, we identified 598 SNPs related to sediment. The UPGMA analysis showed a similar clustering pattern as the core genome (Fig. 4), suggesting a clade effect. However, Cluster III grouped among the Sub-clade G of Clade II, and most of the isolates of this Sub-clade as well as Clade III were isolated from the water environment. One possibility is that these SNPs are important to the adaptation to non-structured environments such as water. Some of the genes associated to these SNPs presented signals of recombination and selection.

Our data suggest that there is a selective pressure over some clades regarding the unstructured environment, as shows the enriched GO term GO:0006814, with signals of positive selection. This term is related to sodium transport, and among the genes annotated within this category were the bacterial Na+/H+ antiporter B (NhaB), that has been suggested to play a role in the adaptation of halophilic and haloalkaliphilic proteobacteria to marine habitats [66]. This gene has also been found to play a role in homeostasis in Vibrio spp. [67].


At CCB, most of the environments present an extremely low phosphorus concentration, a factor that acts as an effective migration barrier, maintaining conditions of the ancient sea as well as ancestral microbial diversity [27]. However, due to natural perturbation, we had the opportunity to observe in Pozas Rojas what happens when that nutrimental barrier is lifted temporarily. Apparently, rare biosphere strains that normally had a hard time surviving low P conditions can follow a feast-famine cycles and have population expansion when the P availability is less limiting.

In order to understand the other dimensions of local adaptation, further sampling of Vibrio spp. in CCB is needed. Unfortunately, this extraordinary oasis is disappearing, given the loss of more than 95% of CCB wetlands due to groundwater overexploitation by agriculture [27, 51, 56, 68].


Site description

We analyzed bacterial isolates from sediment and water of nine ponds in the Pozas Rojas area of CCB (Fig. 1). This site is composed of several small ponds (locally called pozas) that surround a larger pond in the system of Los Hundidos [30, 35]. These small ponds become hypersaline in summer [30], and used to have the highest stoichiometric unbalance (i.e., lowest P concentration) reported in CCB (C:N:P 15820:157:1) [35]. The ponds have seasonal high fluctuations in temperature (around 1 °C in winter to up to 60 °C in some summer moments in some cases) [35] and are small but permanent, separated from each other by ca. 9 m or more, along an arch around the larger pond. However, the Pozas Rojas were flooded by hurricane Alex during summer 2010, merging most of the small ponds into a single large pond, until autumn 2011, when the water receded, leaving the moon shaped array of small red ponds at the same place (Fig. 1).

Sample collection and strains isolation

We collected water and sediment samples in duplicate from nine ponds located in Pozas Rojas, Los Hundidos, CCB, during March 2013 and stored them at 4 °C until processing. Sediment was collected for nutrient analysis in 50 ml Falcon tubes and covered with aluminum foil before storage. Water was collected for nutrient quantification in 1 l volumes and stored in the dark at 4 °C. Chemical analyses were performed at the Instituto de Investigaciones en Ecosistemas y Sustentabilidad, UNAM, in Morelia, Mexico. Cultivable strains from both sediment and water were isolated in PIA (Pseudomonas isolation agar) and TCBS (Thiosulfate Citrate Bile Sucrose Agar) as previously described [57, 69], obtaining a total of 174 isolates, being 88 isolates from sediment and 86 from water (Additional file 1: Table 3).

Environmental variables measurement

For nutrient quantification, sediment samples were dried, and water samples were filtered through a Millipore 0.42 μm filter. Total carbon (TC) and inorganic carbon (IC) were determined by combustion and colorimetric detection [70] using a total carbon analyzer (UIC model CM5012, Chicago, USA). Total organic carbon (TOC) was calculated as the difference between TC and IC. For total N (TN) and total P (TP) determination, samples were acid digested with H2SO4, H2O2, K2SO4 and CuSO4 at 360 °C. Soil N was determined by the macro-Kjeldahl method [71], while P was determined by the molybdate colorimetric method following ascorbic acid reduction [72]. The N and P forms analyzed were determined colorimetrically in a Bran-Luebbe Auto analyzer 3 (Norderstedt, Germany).

DNA extraction and PCR amplification of 16S rRNA

For the 174 isolates obtained, DNA extraction was performed as described by Aljanabi and Martinez (1997) [73]. 16S rRNA genes were amplified using universal primers 27F (5′-AGA GTT TGA TCC TGG CTC AG-3′) and 1492R (5′-GGT TAC CTT GTT ACG ACT T-3′) [74]. All reactions were carried out in an Applied Biosystems Veriti 96 Well Thermal cycler (California, USA) using an Amplificasa DNA polymerase (BioTecMol, Mexico) with the following program: 94 °C for 5 min, followed by 30 cycles consisting of 94 °C for 1 min, 50 °C for 30 s, 72 °C for 1 min and 72 °C for 5 min. Polymerase chain reaction (PCR) amplification products were electrophoresed on 1% agarose gels. Sanger sequencing was performed at the University of Washington High-Throughput Genomics Center.

Phylogenetic analysis of 16S rRNA sequences

The first 700 bps of the 16S rRNA gene, were aligned with Clustalw [75] and quality control was performed with Mothur [76]. Genera level identification was made using the classifier tool [77] from the Ribosomal Database Project (RDP) Release 11.4 [78] (Additional file 1: Table 3). Blastn searches were performed against Refseq database from NCBI to select reference sequences.

A total of 110 sequences were identified as members of the Vibrionaceae family, 43 were isolates from water and 67 from sediment. Based on the previous taxonomic assignment, the sequences of Vibrio alginolyticus, V. parahaemolyticus, V. anguillarum, V. metschnikovii and Photobacterium spp. were included as references. These strains were used in subsequent analyses. A maximum likelihood phylogenetic reconstruction was obtained with PhyML version 3.0 [79], using the HKY + I + G substitution model estimated with jModelTest 2 [80]. The degree of support for the branches was determined with 1000 bootstrap iterations.

Environmental association of phylogroups

To test whether the community of cultivable strains was structured based on its isolation environment (i.e., water or sediment), we performed an AdaptML analysis [37], including our 110 isolates belonging to Vibrionaceae and a Halomonas spp. strain as an out-group. Three categorical environmental variables were tested, including pond of isolation, high and low nutrient concentrations, and the two sampled environments (water or sediment).

Genome sequencing, assembly, and annotation

For whole-genome sequencing, we selected from the AdaptML analysis 39 Vibrio spp. isolates, 23 isolated from sediment and 16 from water, plus 3 isolates of Photobacterium spp. (a lineage closely related to the Vibrio genus) isolated from sediment. DNA extractions were performed with the DNeasy Blood and Tissue kit (Qiagen).

Sequencing was performed with Illumina MiSeq 2 × 250 technology, with insert libraries of 650 bps and an expected coverage of ca.10x per genome. At first, we planned an assembly strategy using a genome reference; for this reason, the strain V15_P4S5T153 had a second library that was designed using the Jr. 454 Roche technology, in order to reduce sequencing bias and get higher coverage. However, due to divergence among genomes, we performed de novo assemblies for all genomes. All sequencing was performed at the Laboratorio Nacional de Genómica para la Biodiversidad (LANGEBIO), México.

The quality of raw reads was analyzed using FASTQC software ( A minimum quality value of 25 was set, and low-quality sequences were removed with fastq_quality_filter from the FASTX-Toolkit ( Adapter sequences were identified, removed and paired-end reads were merged using SeqPrep ( De novo assemblies were performed with Newbler (Roche/ 454 Life Sciences) using both single-end and merged reads.

For scaffolding process, we used SSPACE [81], gaps were closed using GapFiller [82] and final error correction was performed with iCORN [83] (Additional file 2: Table 4). Coding sequences were inferred with Prodigal 2.0 [84] implemented in PROKKA software [85]. InterProScan 5 allowed annotation [86] with the databases enabled by default. Genome completeness was assessed with BUSCO using the Gamma-proteobacteria database [38].

Pan-genome analyses

The 42 genomes from CCB where compared with genomes of 5 reference Vibrio spp. strains: Vibrio alginolyticus NBRC 15630 = ATCC 17749, V. anguillarum 775, V. furnissii NCTC 11218, V. parahaemolyticus BB22OP and V. metschnikovii CIP 69 14 (Additional file 1: Tables 5, 6; Additional file 2: Table 4). Ortholog gene families were predicted from all 47 genomes using the DeNoGAP comparative genomics pipeline [39]. To minimize false positive prediction of orthologs, we assigned Photobacterium spp. genomes as outgroup. The completely sequenced genome of V. anguillarum strain 775 was used as seed reference.

We estimated the core genome based on presence and absence of gene families across the genomes. If the genes were present in all strains, the orthologs were classified as core, while genes were classified as accessory when present in more than one strain but not in all of them, and unique genes when they were present only in a single strain. Since most of the genomes in our dataset are not completely sequenced, we designated core ortholog families as those present in at least 95% of the genomes, to avoid the impact of missing genes due to sequencing or assembly artifacts.

The package Micropan [87] within R v.3.4 (R Core Team) [88] was used to infer the open or closed nature of each pan-genome dataset, following the Heaps law proposed by Tettelin et al. [43]. The Heaps law model is fitted to the number of new gene clusters observed when genomes are ordered randomly. The model has two parameters: an intercept, and a decay parameter called alpha. If alpha is higher than 1.0 the pan-genome is considered closed, if alpha is lower than 1.0 it is considered open. Additionally, a random sub-sampling for each clade was made, taking three genomes and calculating the alpha value for each group of three genomes. A total of 1000 independent sub-sampling events were made for each clade.

Core proteins were aligned using Kalign [89] to infer the phylogenetic relationship between the samples. The resulting alignments of individual ortholog families were concatenated using a custom Perl script. With these concatenated core genes, a maximum likelihood phylogenetic tree was constructed using the FastTree program [90].

Recombination analyses

Of the total ortholog families in the Vibrio spp. pan-genome, we only used the ortholog families found in at least three genomes for the recombination analyses. Genetic recombination was examined on each coding sequence (CDS) alignment by using inference of pairwise recombination sites, obtained with GENECONV [91] and by the identification of putative recombinant sequences through breakpoints using GARD [92].

We estimated the number of recombination events, considering the pan-genome size, number of strains per clade and branch length, according to the following classification: (i) isolates of the same pond and environment (SPSE), (ii) isolates of the different pond and environment (DPDE), (iii) isolates of the same pond and different environment (SPDE) and (iv) isolates of different pond and same environment (DPSE). For this, we normalized the data by pan-genome size, number of strains and branch length. Given that the large generalist Clade II presented a clear sub-structure, we did a separated analysis for the shorter branches within Clade II (Additional file 1: Figure 4).

To assess the impact of homologous recombination, we analyzed the substitution pattern using two different algorithms, Gubbins [93] and ClonalFrameML [49]. A whole-genome alignment for the 47 analyzed genomes was performed with MAUVE [94]. The resulting alignment was used as input for Gubbins [93] using RAxML for the phylogenetic inference [95] and default parameters. Additionally, whole genome alignments were performed for each clade, excluding references, with the progressive MAUVE algorithm [94]. We calculated the R/theta ratio, nu and delta [49] for each sample and for 100 bootstrapped replicates.

Genetic structure of clade II

Recombination analyses showed that in Clade II there are internal groups with higher internal recombination, so we decided to further investigate the structure within Clade II. For clustering analyses, we used Nei’s genetic distance [96] and Neighbor Joining. Genomes with distance less than 0.001 were grouped and tested with a discriminant analysis of principal components of the genetic variation, using the adegenet library in R [97]. For this study, we used 20 principal components and 3 discriminant functions.

Selection analyses

We used FUBAR [98] to identify signatures of positive selection among ortholog gene families found in at least three genomes. We accounted for recombination breakpoints in the ortholog families, while calculating positively selected sites based on GARD results [92]. We considered any site to be positively selected if it showed P-value < 0.05. We also conducted a Gene Ontology (GO) enrichment analysis using topGO [50] to find overrepresented biological functions in this set of genes.

Effective population size estimation

We followed a simulation approach to estimate the posterior distribution of the effective population size (Ne) of each of the six clades. According to the previous clustering and recombination analysis, for Clades I, III, IV, V and VI we simulated a single population, while for Clade II we simulated three sub-populations that diverged from an ancestral population.

Simulations were performed using Fastsimcoal2 [44, 45]. For each clade, we simulated DNA sequences having a similar length equal to the number of nucleotides in the given clade, as well as a sample size equal to the number of sequences sampled for each clade. We assumed no recombination within the genome, and used the Escherichia coli mutation rate of 2.2 × 10− 10 mutations per nucleotide per generation [99]. We ran between two and four simulations for each clade. For the initial runs, we generated 100,000 replicates extracting Ne values from a prior log-uniform distribution that ranged from 100,000 to 20,000,000 individuals. For Clade II, we also estimated the age of divergence of each Sub-clade, by setting the prior distribution of time ranging from 1000 to 4,000,000 generations. After a first run, we narrowed the prior ranges based on those simulations that had similar summary statistics compared to the observed data and performed another 100,000 simulations using the narrowed priors.

To compare the previously simulated and observed data based on summary statistics, we used the ape [100] and pegas [101] libraries in R to estimate the number of polymorphic sites and the Tajima’s D based on the entire genomes. Tajima’s D is commonly used to estimate demographic changes in populations [102, 103]. Also, we obtained 1000 sliding windows frames to estimate the Tajima’s D along the genomes, as well as the mean and standard deviation of Tajima’s D. Tajima’s D, π, and Watterson’s theta (θw) were estimated for each clade as well as for Sub-clades A, B and G. Since Clades I and VI had three sequences and it was not possible to obtain Tajima’s D, we did 1000 replicates in which we subsampled with replacement 10 sequences. For each replicate, we calculated Tajima’s D and we obtained as the proximate value the median estimated across the 1000 replicates.

Based on the summary statistics, we used the abc function in the ABC package [104] in R to calculate the distribution of the Ne parameter based on a 0.05% threshold distance between the simulated and observed data. For each clade, we reported the median and the 95% interval confidence of Ne. For Clade II, we further reported the average and 95% interval confidence of the number of generations since each Sub-clade diverged from an ancestral clade.

Association between genotypes and environmental variables

We evaluated whether the genetic variation within the Vibrionaceae genomes could be explained by particular adaptations to the environment (water or sediment). We used progressiveMauve [94] to perform a global multiple alignment between the assembled genomes. We extracted the variant sites within the alignment and exported them as SNPs using snp-sites [105].

We obtained 38,533 SNPs, which we used to search for private alleles using Poppr [106]. Afterwards, we obtained a subset of 25,892 SNPs by filtering biallelic sites with minor allele frequencies > 0.05. We used PLINK [107] to perform a GWAS to detect possible associations between our SNPs set and either the water or sediment environments. We conducted Fisher exact tests and regarded as significant all SNPs whose associations had p-values < 0.01 after Bonferroni corrections. These analyses may be informative even considering these sampling differences [108, 109].

To test whether these associations could be explained by convergent evolution rather than by common ancestry, we compared an UPGMA tree reconstructed from the total set of SNPs from an UPGMA tree using only the SNPs that were significantly associated to the environment. We analyzed the distribution of the SNPs within the genomes to find the genes associated to those SNPs.

We mapped the SNPs positions in the genome alignment moving by 1 Kb windows; this window size was selected considering the average bacterial gene size and retrieved all the associated genes. We conducted a Gene Ontology (GO) enrichment analysis using topGO [50] to find overrepresented biological functions in this set of genes.

Availability of data and materials

The datasets generated and analysed during the current study are available in the genome assembly project BioProject: PRJNA361510; PRJNA361511. The resulting InterProScan annotation files, CDS fasta files and the predicted protein fasta files for all taxa are available at Dryad. As by the politics of Dryad, the data will be available once the manuscript is accepted.



Cuatro Ciénegas Basin

N e :

Effective population size

r/m :

Recombination and mutation estimators







r :


m :



Markov chain Monte Carlo


Gene Ontology


Single nucleotide polymorphisms


Genome-wide association study


Na+/H+ antiporter B


Pseudomonas isolation agar


Thiosulfate Citrate Bile Sucrose Agar


Total carbon


Inorganic carbon


Total organic carbon


Total N


Total P


Polymerase chain reaction


Ribosomal Database Project


Laboratorio Nacional de Genómica para la Biodiversidad


Coding sequence


Basic Local Alignment Search Tool


Benchmarking Universal Single-Copy Orthologs


Fast Unconstrained Bayesian AppRoximation


Unweighted Pair Group Method using Arithmetic averages


National Center for Biotechnology Information


SSAKE-based Scaffolding of Pre-Assembled Contigs after Extension


Microbial Pan-Genome


Statistical Tests for Detecting Gene Conversion


A genetic algorithm for recombination detection


Genealogies Unbiased By recomBinations In Nucleotide Sequences


Randomized Axelerated Maximum Likelihood


  1. Lilburn TG, Gu J, Cai H, Wang Y. Comparative genomics of the family vibrionaceae reveals the wide distribution of genes encoding virulence-associated proteins. BMC Genomics. 2010;11:369.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Moriel DG, Tan L, Goh KGK, Phan M-D, Ipe DS, Lo AW, et al. A novel protective vaccine antigen from the core Escherichia coli genome. mSphere. 2016;1.

  3. Sanglas A, Albarral V, Farfán M, Lorén JG, Fusté MC. Evolutionary roots and diversification of the genus Aeromonas. Front Microbiol. 2017;8.

  4. Lapierre P, Gogarten JP. Estimating the size of the bacterial pan-genome. Trends Genet. 2009;25:107–10.

    Article  CAS  PubMed  Google Scholar 

  5. Collins RE, Higgs PG. Testing the infinitely many genes model for the evolution of the bacterial core genome and pangenome. Mol Biol Evol. 2012;29:3413–25.

  6. Gordienko EN, Kazanov MD, Gelfand MS. Evolution of pan-genomes of Escherichia coli, Shigella spp., and Salmonella enterica. J Bacteriol. 2013;195:2786–92.

  7. Valdivia-Anistro JA, Eguiarte-Fruns LE, Delgado-Sapién G, Gasca-Pineda PM-ZJ, Learned J, Elser JJ, et al. Variability of rRNA operon copy number and growth rate dynamics of Bacillus isolated from an extremely oligotrophic aquatic ecosystem. Front Microbiol. 2016;6.

  8. Zhi X-Y, Jiang Z, Yang L-L, Huang Y. The underlying mechanisms of genetic innovation and speciation in the family corynebacteriaceae : a phylogenomics approach. Mol Phylogenet Evol. 2017;107:246–55.

    Article  CAS  PubMed  Google Scholar 

  9. Hou Y, Lin S. Distinct gene number-genome size relationships for eukaryotes and non-eukaryotes: gene content estimation for dinoflagellate genomes. PLoS One. 2009;4:e6978.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. McInerney JO, McNally A, MJ O. Why prokaryotes have pangenomes. Nat Microbiol. 2017;2.

  11. Kuo C-H, Ochman H. Deletional bias across the three domains of life. Genome Biol Evol. 2009;1:145–52.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Morris JJ, Lenski RE, Zinser ER. The black queen hypothesis: Evolution of dependencies through adaptive gene loss. mBio. 2012;3.

  13. Mas A, Jamshidi S, Lagadeuc Y, Eveillard D, Vandenkoornhuyse P. Beyond the black queen hypothesis. ISME J. 2016;10:2085–91.

    Article  PubMed  PubMed Central  Google Scholar 

  14. Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". Proc Natl Acad Sci. 2005;102:13950–5.

  15. Andreani NA, Hesse E, Vos M. Prokaryote genome fluidity is dependent on effective population size. ISME J. 2017;11:1719–21.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Smith JM, Smith NH, Spratt MOBG. How clonal are bacteria? Proc Natl Acad Sci. 1993;90:4384–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Souza V, Eguiarte LE. Bacteria gone native vs. bacteria gone awry?: Plasmidic transfer and bacterial evolution. Proc Natl Acad Sci. 1997;94:5501–3.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Lawrence JG, Ochman H. Molecular archaeology of the Escherichia coli genome. Proc Natl Acad Sci. 1998;95:9413–7.

  19. Ochman H, Lawrence JG, Groisman EA. Lateral gene transfer and the nature of bacterial innovation. Nature. 2000;405:299–304.

    Article  CAS  PubMed  Google Scholar 

  20. Fournier GP, Gogarten JP. Evolution of acetoclastic methanogenesis in methanosarcina via horizontal gene transfer from cellulolytic clostridia. J Bacteriol. 2007;190:1124–7.

    Article  PubMed  PubMed Central  Google Scholar 

  21. Soucy SM, Fullmer MS, Papke RT, Gogarten JP. Inteins as indicators of gene flow in the halobacteria. Front Microbiol. 2014;5.

  22. Roze D, Barton NH. The hill Robertson effect and the evolution of recombination. Genetics. 2006;173:1793–811.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Comeron JM, Williford A, Kliman RM. The hill Robertson effect: evolutionary consequences of weak selection and linkage in finite populations. Heredity. 2007;100:19–31.

    Article  CAS  PubMed  Google Scholar 

  24. Souza V, Nguyen TT, Hudson RR, Pinero D, Lenski RE. Hierarchical analysis of linkage disequilibrium in rhizobium populations: evidence for sex? Proc Natl Acad Sci. 1992;89:8389–93.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Bobay L-M, Ochman H. Factors driving effective population size and pan-genome evolution in bacteria. BMC Evol Biol. 2018;18.

  26. Cohan FM. Bacterial species and speciation. Syst Biol. 2001;50:513–24.

    Article  CAS  PubMed  Google Scholar 

  27. Souza V, Moreno-Letelier A, Travisano M, Alcaraz LD, Olmedo G, Eguiarte LE. The lost world of Cuatro Ciénegas basin, a relictual bacterial niche in a desert oasis. eLife. 2018;7.

  28. Escalante AE, Eguiarte LE, Espinosa-Asuar L, Forney LJ, Noguez AM, Saldivar VS. Diversity of aquatic prokaryotic communities in the Cuatro Cienegas basin. FEMS Microbiol Ecol. 2008;65:50–60.

    Article  CAS  PubMed  Google Scholar 

  29. Rebollar EA, Avitia M, Eguiarte LE, González-González A, Mora L, Bonilla-Rosso G, et al. Water-sediment niche differentiation in ancient marine lineages of Exiguobacterium endemic to the Cuatro Cienegas basin. Environ Microbiol. 2012;14:2323–33.

  30. Avitia M, Escalante AE, Rebollar EA, Moreno-Letelier A, Eguiarte LE, Souza V. Population expansions shared among coexisting bacterial lineages are revealed by genetic evidence. PeerJ. 2014;2:e696.

    Article  PubMed  PubMed Central  Google Scholar 

  31. Souza V, Eguiarte LE, Siefert J, Elser JJ. Microbial endemism: does phosphorus limitation enhance speciation? Nat Rev Microbiol. 2008;6:559–64.

    Article  PubMed  Google Scholar 

  32. Souza V, Eguiarte LE, Travisano M, Elser JJ, Rooks C, Siefert JL. Travel, sex, and food: Whats speciation got to do with it? Astrobiology. 2012;12:634–40.

    Article  PubMed  PubMed Central  Google Scholar 

  33. Vos M, Didelot X. A comparison of homologous recombination rates in bacteria and archaea. ISME J. 2008;3:199–208.

    Article  CAS  PubMed  Google Scholar 

  34. Cui Y, Yang X, Didelot X, Guo C, Li D, Yan Y, et al. Epidemic clones, oceanic gene pools, and eco-LD in the free living marine pathogen Vibrio parahaemolyticus. Mol Biol Evol. 2015;32:1396–410.

  35. Peimbert M, Alcaraz LD, Bonilla-Rosso G, Olmedo-Alvarez G, Garc-Oliva F, Segovia L, et al. Comparative metagenomics of two microbial mats at Cuatro Ciénegas basin II: ancient lessons on how to cope with an environment under severe nutrient stress. Astrobiology. 2012;12:648–58.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Redfield AC. On the proportions of organic derivatives in sea water and their relation to the composition of plankton. In: Daniel RJ, editor. James Johnstone Memorial Volume: Liverpool Univ. Press, Liverpool, Merseyside, England; 1934. p. 176–92.

  37. Hunt DE, David LA, Gevers D, Preheim SP, Alm EJ, Polz MF. Resource partitioning and sympatric differentiation among closely related bacterioplankton. Science. 2008;320:1081–5.

    Article  CAS  PubMed  Google Scholar 

  38. Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31:3210–2.

    Article  CAS  PubMed  Google Scholar 

  39. Thakur S, Guttman DS. A de-novo genome analysis pipeline (DeNoGAP) for large-scale comparative prokaryotic genomics studies. BMC Bioinformatics. 2016;17.

  40. Lux TM, Lee R, Love J. Complete genome sequence of a free-living Vibrio furnissii sp. nov. strain (NCTC 11218). J Bacteriol. 2011;193:1487–8.

  41. Naka H, Dias GM, Thompson CC, Dubay C, Thompson FL, Crosa JH. Complete genome sequence of the marine fish pathogen Vibrio anguillarum harboring the pJM1 virulence plasmid and genomic comparison with other virulent strains of V. anguillarum and V. ordalii. Infect Immun. 2011;79:2889–900.

  42. Xu F, Ilyas S, Hall JA, Jones SH, Cooper VS, Whistler CA. Genetic characterization of clinical and environmental Vibrio parahaemolyticus from the Northeast USA reveals emerging resident and non-indigenous pathogen lineages. Front Microbiol. 2015;6.

  43. Tettelin H, Riley D, Cattuto C, Medini D. Comparative genomics: the bacterial pan-genome. Curr Opin Microbiol. 2008;11:472–7.

    Article  CAS  PubMed  Google Scholar 

  44. Excoffier L, Foll M. Fastsimcoal: a continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios. Bioinformatics. 2011;27:1332–4.

    Article  CAS  PubMed  Google Scholar 

  45. Excoffier L, Dupanloup I, Huerta-Sánchez E, Sousa VC, Foll M. Robust demographic inference from genomic and SNP data. PLoS Genet. 2013;9:e1003905.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Sung W, Ackerman MS, Miller SF, Doak TG, Lynch M. Drift-barrier hypothesis and mutation-rate evolution. Proc Natl Acad Sci. 2012;109(45):18488–92.

    Article  PubMed  PubMed Central  Google Scholar 

  47. Jensen JD, Bachtrog D. Characterizing the influence of effective population size on the rate of adaptation: Gillespie’s Darwin domain. Genome Biology Evol. 2011;3:687–701.

    Article  Google Scholar 

  48. Sivasundar A, Hey J. Population genetics of Caenorhabditis elegans: the paradox of low polymorphism in a widespread species. Genetics. 2003;163:147–57.

  49. Didelot X, Wilson DJ. ClonalFrameML: efficient inference of recombination in whole bacterial genomes. PLoS Comput Biol. 2015;11:e1004041.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Alexa A, Rahnenfuhrer J, Lengauer T. Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics. 2006;22:1600–7.

    Article  CAS  PubMed  Google Scholar 

  51. Taboada B, Isa P, Gutiérrez-Escolano AL, del Ángel RM, Ludert JE, Vázquez N, et al. The geographic structure of viruses in the Cuatro Ciénegas basin, a unique oasis in northern Mexico, reveals a highly diverse population on a small geographic scale. Appl Environ Microbiol. 2018;84.

  52. Souza V, Espinosa-Asuar L, Escalante AE, Eguiarte LE, Farmer J, Forney L, et al. An endangered oasis of aquatic microbial biodiversity in the Chihuahuan desert. Proc Natl Acad Sci. 2006;103:6565–70.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Montiel-González C, Bautista F, Delgado C, García-Oliva F. The Climate of Cuatro Ciénegas Basin: Drivers and Temporal Patterns. In: Souza V, Olmedo-Álvarez G, Eguiarte LE, editors. Cuatro Ciénegas Ecology, Natural History and Microbiology. New York: Springer, Cham; 2018. p. 35–42.

    Chapter  Google Scholar 

  54. Bonilla-Rosso G, Peimbert M, Alcaraz LD, Hernández I, Eguiarte LE, Olmedo-Alvarez G, et al. Comparative metagenomics of two microbial mats at Cuatro Ciénegas basin II: community structure and composition in oligotrophic environments. Astrobiology. 2012;12:659–73.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Lee ZM-P, Poret-Peterson AT, Siefert JL, Kaul D, Moustafa A, Allen AE, et al. Nutrient stoichiometry shapes microbial community structure in an evaporitic shallow pond. Front Microbiol. 2017;8.

  56. Anda VD, Zapata-Peñasco I, Blaz J, Poot-Hernández AC, Contreras-Moreira B, González-Laffitte M, et al. Understanding the mechanisms behind the response to environmental perturbation in microbial mats: a metagenomic-network based approach. Front Microbiol. 2018;9.

  57. Ponce-Soto GY, Aguirre-von-Wobeser E, Eguiarte LE, Elser JJ, Lee ZM-P, Souza V. Enrichment experiment changes microbial interactions in an ultra-oligotrophic environment. Front Microbiol. 2015;6.

  58. Ghalayini M, Launay A, Bridier-Nahmias A, Clermont O, Denamur E, Lescat M, et al. Evolution of a dominant natural isolate of Escherichia coli in the human gut over the course of a year suggests a neutral evolution with reduced effective population size. Appl Environ Microbiol. 2018;84.

  59. Farfan M, Minana-Galbis D, Fuste MC, Loren JG. Allelic diversity and population structure in Vibrio cholerae o139 bengal based on nucleotide sequence analysis. J Bacteriol. 2002;184:1304–13.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Gonzalez-Escalona N, Martinez-Urtaza J, Romero J, Espejo RT, Jaykus L-A, DePaola A. Determination of molecular phylogenetics of Vibrio parahaemolyticus strains by multilocus sequence typing. J Bacteriol. 2008;190:2831–40.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Korneliussen TS, Moltke I, Albrechtsen A, Nielsen R. Calculation of Tajima’s D and other neutrality test statistics from low depth next-generation sequencing data. BMC Bioinformatics. 2013;14.

  62. Petit N, Barbadilla A. Selection efficiency and effective population size in Drosophila species. J Evol Biol. 2009;22:515–26.

    Article  CAS  PubMed  Google Scholar 

  63. Gossmann TI, Keightley PD, Eyre-Walker A. The effect of variation in the effective population size on the rate of adaptive molecular evolution in eukaryotes. Genome Biol Evol. 2012;4:658–67.

    Article  PubMed  PubMed Central  Google Scholar 

  64. Shapiro BJ, Friedman J, Cordero OX, Preheim SP, Timberlake SC, Szabó G, et al. Population genomics of early events in the ecological differentiation of bacteria. Science. 2012;336:48–51.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Vozza NF, Abdian PL, Russo DM, Mongiardini E, Lodeiro A, Molin S, et al. A rhizobium leguminosarum CHDL- (cadherin-like-) lectin participates in assembly and remodeling of the biofilm matrix. Front Microbiol. 2016;7.

  66. Kurz M, Brünig AN, Galinski EA. NhaD type sodium/proton-antiporter of Halomonas elongata: a salt stress response mechanism in marine habitats? Saline Systems. 2006;2:10.

  67. Vimont S, Berche P. NhaA, an Na(+)/H(+) antiporter involved in environmental survival of Vibrio cholerae. J Bacteriol. 2000;182:2937–44.

  68. Wolaver BD, Crossey LJ, Karlstrom KE, Banner JL, Cardenas MB, Ojeda CG, et al. Identifying origins of and pathways for spring waters in a semiarid basin using He, Sr, and C isotopes: Cuatrociénegas basin, Mexico. Geosphere. 2012;9:113–25.

  69. Vázquez-Rosas-Landa M, Ponce-Soto GY, Eguiarte LE, Souza V. Comparative genomics of free-living gammaproteobacteria: pathogenesis-related genes or interaction-related genes? Pathogens Disease. 2017;75.

  70. Huffman EW. Performance of a new automatic carbon dioxide coulometer. Microchem J. 1977;22:567–73.

    Article  CAS  Google Scholar 

  71. Bremner JM. Total nitrogen. In: Sparks DL, editor. Methods of Soil Analysis. Part 2 Chemical Methods. Madison: Soil Science Society of America; 1996. p. 1085–6.

    Google Scholar 

  72. Murphy J, Riley J. A modified single solution method for the determination of phosphate in natural waters. Anal Chim Acta. 1962;27:31–6.

    Article  CAS  Google Scholar 

  73. Aljanabi S. Universal and rapid salt-extraction of high quality genomic DNA for PCR- based techniques. Nucleic Acids Res. 1997;25:4692–3.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  74. Lane DJ. 16S/23S rRNA sequencing. In: Stackebrandt E, Goodfellow M, editors. Nucleic acid techniques in bacterial systematic. New York: Wiley; 1991. p. 115–75.

    Google Scholar 

  75. Larkin M, Blackshields G, Brown N, Chenna R, McGettigan P, McWilliam H, et al. Clustal W and Clustal X version 2.0. Bioinformatics. 2007;23:2947–8.

    Article  CAS  PubMed  Google Scholar 

  76. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. 2009;75:7537–41.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  77. Wang Q, Garrity GM, Tiedje JM, Cole JR. Naive bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol. 2007;73:5261–7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  78. Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, et al. The ribosomal database project: Improved alignments and new tools for rRNA analysis. Nucleic Acid Res. 2009;37 Database:D141–5.

    Article  CAS  Google Scholar 

  79. Guindon S, Dufayard J-F, Lefort V, Anisimova M, Hordijk W, Gascuel O. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol. 2010;59:307–21.

    Article  CAS  PubMed  Google Scholar 

  80. Darriba D, Taboada GL, Doallo R, Posada D. jModelTest 2: more models, new heuristics and parallel computing. Nat Methods. 2012;9:772.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  81. Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics. 2010;27:578–9.

    Article  CAS  PubMed  Google Scholar 

  82. Nadalin F, Vezzi F, Policriti A. GapFiller: a de novo assembly approach to fill the gap within paired reads. BMC Bioinformatics. 2012;13.

  83. Otto TD, Sanders M, Berriman M, Newbold C. Iterative correction of reference nucleotides (iCORN) using second generation sequencing technology. Bioinformatics. 2010;26:1704–7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  84. Hyatt D, Chen G-L, LoCascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010;11.

  85. Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014;30:2068–9.

    Article  CAS  PubMed  Google Scholar 

  86. Jones P, Binns D, Chang H-Y, Fraser M, Li W, McAnulla C, et al. InterProScan 5: genome-scale protein function classification. Bioinformatics. 2014;30:1236–40.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  87. Snipen L, Liland KH. Micropan: an R-package for microbial pan-genomics. BMC Bioinformatics. 2015;16.

  88. Core Team R. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing; 2017.

    Google Scholar 

  89. Lassmann T, Frings O, Sonnhammer ELL. Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features. Nucleic Acids Res. 2008;37:858–65.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  90. Price MN, Dehal PS, Arkin AP. FastTree 2 approximately maximum-likelihood trees for large alignments. PLoS One. 2010;5:e9490.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  91. Sawyer S. Statistical tests for detecting gene conversion. Mol Biol Evol. 1989;6:526–38.

    Article  CAS  PubMed  Google Scholar 

  92. Pond SLK, Posada D, Gravenor MB, Woelk CH, Frost SD. GARD: a genetic algorithm for recombination detection. Bioinformatics. 2006;22:3096–8.

    Article  CAS  Google Scholar 

  93. Croucher NJ, Page AJ, Connor TR, Delaney AJ, Keane JA, Bentley SD, et al. Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using gubbins. Nucleic Acids Res. 2014;43:e15.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  94. Darling AE, Mau B, Perna NT. progressiveMauve: Multiple genome alignment with gene gain, loss and rearrangement. PLoS ONE. 2010;5:e11147.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  95. Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30:1312–3.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  96. Nei M. Estimation of average heterozygosity and genetic distance from a small number of individuals. Genetics. 1978;89(3):583–90.

    CAS  PubMed  PubMed Central  Google Scholar 

  97. Jombart T, Ahmed I. Adegenet 1.3-1: new tools for the analysis of genome-wide SNP data. Bioinformatics. 2011;27:3070–1.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  98. Murrell B, Moola S, Mabona A, Weighill T, Sheward D, Pond SLK, et al. FUBAR: a fast, unconstrained BAyesian AppRoximation for inferring selection. Mol Biol Evol. 2013;30:1196–205.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  99. Lee H, Popodi E, Tang H, Foster PL. Rate and molecular spectrum of spontaneous mutations in the bacterium Escherichia coli as determined by whole-genome sequencing. Proc Natl Acad Sci. 2012;109:E2774–83.

  100. Paradis E, Schliep K. Ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics. 2018;35:526–8.

    Article  CAS  Google Scholar 

  101. Paradis E. Pegas: an R package for population genetics with an integrated-modular approach. Bioinformatics. 2010;26:419–20.

    Article  CAS  PubMed  Google Scholar 

  102. Eckshtain-Levi N, Weisberg AJ, Vinatzer BA. The population genetic test Tajima’s D identifies genes encoding pathogen-associated molecular patterns and other virulence-related genes in Ralstonia solanacearum. Mol Plant Pathol. 2018;19:2187–92.

  103. Shen H-M, Chen S-B, Cui Y-B, Xu B, Kassegne K, Abe EM, et al. Whole-genome sequencing and analysis of Plasmodium falciparum isolates from China-Myanmar border area. Infect Dis Poverty. 2018;7.

  104. Csillery K, François O, Blum MGB. Abc: an R package for approximate bayesian computation (ABC). Methods Ecol Evol. 2012;3:475–9.

    Article  Google Scholar 

  105. Page AJ, Taylor B, Delaney AJ, Soares J, Seemann T, Keane JA, et al. SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microb Genom. 2016;2.

  106. Kamvar ZN, Tabima JF, Grünwald NJ. Poppr: an r package for genetic analysis of populations with clonal, partially clonal, and/or sexual reproduction. PeerJ. 2014;2:e281.

    Article  PubMed  PubMed Central  Google Scholar 

  107. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81:559–75.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  108. Li N, Stephens M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics. 2003;165:2213–33.

    CAS  PubMed  PubMed Central  Google Scholar 

  109. Marchini J, Howie B. Genotype imputation for genome-wide association studies. Nat Rev Genet. 2010;11:499–511.

    Article  CAS  PubMed  Google Scholar 

Download references


We thank Felipe García-Oliva and Rodrigo Velázquez-Durán at the Instituto de Investigaciones en Ecosistemas y Sustentabilidad, UNAM for performing the biogeochemical analysis. Laura Espinosa-Asuar and Erika Aguirre-Planter provided technical and logistical assistance during the project. David Jaramillo for the aerial view and the picture of Pozas Rojas in Fig. 1.


MV-R-L was a doctoral student from Programa de Doctorado en Ciencias Biomédicas, Universidad Nacional Autónoma de México (UNAM) and got a fellowship 345250 from CONACYT. This research was also supported by funding from PAPIIT project IG200215, WWF-Alianza Carlos Slim, SEP-Ciencia Básica CONACYT grant 238245 all awarded to VS and LEE. The paper was written during a sabbatical leave of LEE and VS at the University of Minnesota in Peter Tiffin and Michael Travisano laboratories, respectively, both with support by scholarships from PASPA, DGAPA, UNAM. The funding agencies had no role in the design of the study or in the collection, analysis, interpretation of data, or in writing the manuscript.

Author information

Authors and Affiliations



MV-R-L designed the sampling, obtained the biological material, analyzed the data, prepared figures and tables, and wrote the paper. GYP-S analyzed the data and participated in all stages of writing. JA-L, ST, ES, and JB-R analyzed the data. EI-L analyzed the data and provided computing facilities. DS-G provided computing facilities and contributed substantially to the analysis and discussion of the data. LEE made contributions for the design, analysis, discussion of the data and writing. V-S conceived, designed the study and the analyses, managed the obtaining financial resources and participated in all stages of writing. All authors reviewed the manuscript and approved the final version.

Corresponding author

Correspondence to Valeria Souza.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1: Table S1.

Nutrient measures across sampled points. Table S2. Ratio of nutrients concentrations. Table S3. RDP classification of partial 16S rRNA. Table S5. General information of the 42 Vibrionaceae genomes and 5 references used in this study. Table S6. Results of the Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis (see methods) using 452 BUSCOS of 721 species of the Gamma-proteobacteria database. Table S9. GO terms enriched at the unique genes of each clade. Figure S1. Phylogenetic reconstruction of 16S RNA sequences. Figure S2. AdaptML analysis. Figure S3. Random analysis of alpha values within each clade. Figure S4. Analysis of structure of Clade II. Figure S5. Frequency of recombination events. Figure S6. Recombination events across a whole genome alignment. Figure S7. UPGMA and membership of the CCB strains.

Additional file 2: Table S4.

Genome assembly overview. Table S7. GeneConv results. Table S8. Gene clusters in which positive selection was found. Table S10. Orthologue cluster genes containig SNPs associated to the strain isolation environment.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vázquez-Rosas-Landa, M., Ponce-Soto, G.Y., Aguirre-Liguori, J.A. et al. Population genomics of Vibrionaceae isolated from an endangered oasis reveals local adaptation after an environmental perturbation. BMC Genomics 21, 418 (2020).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: