Genome-wide divergence among invasive populations of Aedes aegypti in California

Background In the summer of 2013, Aedes aegypti Linnaeus was first detected in three cities in central California (Clovis, Madera and Menlo Park). It has now been detected in multiple locations in central and southern CA as far south as San Diego and Imperial Counties. A number of published reports suggest that CA populations have been established from multiple independent introductions. Results Here we report the first population genomics analyses of Ae. aegypti based on individual, field collected whole genome sequences. We analyzed 46 Ae. aegypti genomes to establish genetic relationships among populations from sites in California, Florida and South Africa. Based on 4.65 million high quality biallelic SNPs, we identified 3 major genetic clusters within California; one that includes all sample sites in the southern part of the state (South of Tehachapi mountain range) plus the town of Exeter in central California and two additional clusters in central California. Conclusions A lack of concordance between mitochondrial and nuclear genealogies suggests that the three founding populations were polymorphic for two main mitochondrial haplotypes prior to being introduced to California. One of these has been lost in the Clovis populations, possibly by a founder effect. Genome-wide comparisons indicate extensive differentiation between genetic clusters. Our observations support recent introductions of Ae. aegypti into California from multiple, genetically diverged source populations. Our data reveal signs of hybridization among diverged populations within CA. Genetic markers identified in this study will be of great value in pursuing classical population genetic studies which require larger sample sizes. Electronic supplementary material The online version of this article (10.1186/s12864-019-5586-4) contains supplementary material, which is available to authorized users.


Background
Aedes aegypti has a short flight range, usually not actively moving more than 200 m from their breeding source [1], but is exquisitely adapted to hitchhiking in transport vehicles [2]. One central question concerning the populations dynamics of Ae. aegypti in California (CA) therefore is whether the established populations at different locations are founded from one source population that spread across the state or if they are the result of other kinds of founding effects. A recent study revealed several, genetically distinct Ae. aegypti populations in CA presumably originating from multiple introductions from other sites in the U.S. and/or northern Mexico [3]. Insight into the population structure of CA Ae. aegypti beyond this will be necessary to fully understand the dynamics that shape the current pattern of distribution and continuing spread of this invasive vector species.
California had no known established local populations of Ae. aegypti prior to the summer of 2013 when it was detected in three cities in central California: Clovis, Madera and Menlo Park [4][5][6]. In the spring and summer of the following year, this mosquito was again found in the same three California locations and for the first time in additional communities in central California and further south in San Diego County (Fig. 1). Ae. aegypti specimens have been collected over multiple years from some sites and from additional locations in both central and southern CA each year, indicating that Ae. aegypti has now become established and is spreading through large parts of the state (Fig. 1).
Successful vector control can benefit from population genetics and genomics analyses which can provide estimates of gene flow and identify the genetic basis of phenotypes such as insecticide resistance [7] and host preference [8]. Population genomics studies are especially critical to the development of control strategies based on genetic manipulation of vectors, which is a matter of growing interest. Modelling, planning and monitoring activities associated with control programs require affordable and rapid assays to distinguish vector sub-populations within a species and a deep understanding of the processes that shape their genetic structure. It is becoming increasingly apparent that hybridization between diverged vector populations may be an import source of new genetic material including alleles that mediate adaptations to facilitate range expansion [9] or that promote the evolution of resistance to insecticides [10]. Analysis of whole-genome sequencing data is the most powerful method to detect even minor admixture [11]. Therefore, we have applied a population genomics approach to study invasive Ae. aegypti populations. Here we report a preliminary analysis based on genome sequences of 39 individual Ae. aegypti, collected from twelve locations throughout CA, and, for comparison, four specimens from Florida and three from South Africa. The analyses presented here should serve as a step toward expanded population genomics studies aimed at understanding how invasive mosquito species become established in new locations and how distinct populations interact on the genetic level.

Results
We sequenced the genomes of 46 specimens of Ae. aegypti from California (N = 39), Florida (N = 4) and South Africa (N = 3) with median genome coverage of 9.6X per sample (Fig. 2, Additional file 1: Table S1). Filtering for biallelic SNPs and a minimum depth of 8 with at most 20% missing data yielded 4,653,297 high-quality SNPs.
Principal Components Analysis based on the genotypes of these SNPs revealed four distinct genetic clusters (Fig. 3). The genetic cluster designated GC1 includes all six cities in southern CA and includes one central CA site, Exeter. Samples from Florida also fall within this cluster. The GC2 cluster includes the northernmost sites near the coast in Menlo Park and includes some central CA populations at Madera and Fresno. Populations from the restricted area around Clovis and Sanger in central CA form the GC3 cluster. The GC4 cluster includes all South African Ae. aegypti samples. Overall, the distribution of the three Ae. aegypti genetic clusters containing CA Ae. aegypti have a nearly parapatric distribution with the three groups potentially converging in the Central Valley. The population at Sanger appears to have multiple genetic clusters occurring in sympatry. One specimen from Sanger (Ae17CON058) could be GC2 or a hybrid of GC2 and GC3 given its values for PC2 and PC3 relative to other samples (Fig.  3). Other Principal components (PC5 and PC6 in Fig. 3) indicate additional population subdivisions further dividing Mission Viejo, Garden Grove, Exeter and Vero Beach samples from the rest of GC2 (Fig. 3). This is consistent with a previous report suggesting highly structured populations within southern CA Ae. aegypti [3]. Our genome-wide SNP-based clustering results show clear subdivision between Clovis and Fresno/Madera/ Menlo Park. This may seem slightly different from published SNPchip data [3] but is consistent with the microsatellite-based genetic clustering reported in the same study. The respective year of the first detection of the species in each site is displayed on the map. * 2018 data is surveillance data available as of July, 31, 2018. Data was derived from CalSurv Maps [40] and the CleanTOPO2 basemap [41] was used as background Windowed-F ST analysis indicates genome-wide differentiation among the four genetic clusters (Fig. 4a -c). The GC1 and GC4 clusters are the most highly diverged (genome-wide average F ST = 0.159). Genetic distance between GC1 and the other clusters is intermediate (Fig. 4 a and b). Nucleotide diversity (π) is highly variable but lowest in the middle of chromosome 3 (Fig. 4). These regions correspond to the location of the centromeres (coordinates obtained from personal communications with M. Sharakhova at Virginia Polytechnic Institute) and include a relatively high density of repeated and difficult to sequence regions which are excluded from the SNP set analyzed. Overall nucleotide  diversity is lowest on Chromosome 1, which contains the sex determining locus in Ae. aegypti. Overall nucleotide diversity is similar in all population comparisons as indicated by difference in nucleotide diversity (Δπ = π 1 -π 2 ). As a peculiarity, the South African populations have noticeably higher nucleotide diversity in a region around 160-170 Mbp of Chromosome 1 compared to samples from the southern CA GC1.   Figure S1 and 2 (mean π 2016 -π 2013 value of 7.22 × 10 − 5 and 2.32 × 10 − 5 , respectively) but a decrease on chromosome 3 (π 2016 -π 2013 value of − 1.70 × 10 − 5 ). However, regions with a relatively large change in nucleotide diversity between 2016 and 2013 are visible on all three chromosomes, some of which also coincide with highly differentiated (F ST > 0.1) regions. These highly differentiated regions with a relatively large nucleotide diversity change may indicate genomic regions under selection presumably as the founding population adapts to local environmental conditions [7].
Differentiation among populations within GC2 (F ST = -0.043 to 0.007) and GC3 (F ST = -0.084 to 0.000) is minimal (Additional file 2: Table S2). However, differentiation between some populations within GC1 is fairly high (F ST = 0.000 to 0.210), especially between Exeter and the southern CA samples (F ST = 0.132 to 0.228). This suggests potential substructure within GC1 separating Exeter and Florida samples from other southern CA samples ( Fig. 3 and Fig. 5), so they are removed from GC1 in subsequent analysis ( Fig. 4 and Additional file 3: Figure S1 panel D). The F ST distance between one sample from the town of Sanger (sample Ae17CON058) to GC2 and GC3 clusters were equivalent (Additional file 2: Table S2) and its placement in the PCA (Fig. 3) and phylogenetic tree (Fig. 5) suggest that this individual could be a GC2/GC3 hybrid.
Estimates of F ST derived from whole genome sequence data have been shown to be accurate even with very small sample sizes (i.e. N = 2/population [12]). This is due to the very large number of SNPs (i.e. n> > 1000 loci) used in these analyses. The 4.65 million loci used in our analysis is well in excess of the number of loci required for an accurate assessment of F ST . In addition, we evaluated various minimum read depths and missing data ratios and observed results consistent with those we report here (data not shown).
Mitogenome sequence analysis revealed two major mitochondrial lineages in CA Ae. aegypti (Fig. 6). One lineage includes the Ae. aegypti reference sequence and is represented in all four genetic clusters, the other was present in samples from GC1, GC2 and GC4. Samples from Florida and South Africa are distributed among the two major lineages suggesting that these lineages might be present throughout the global range of Ae. aegypti.

Discussion
Using nuclear genome sequence data, we identified three major genetic clusters among CA Ae. aegypti. These correspond roughly to geographic regions in the state (Figs. 2 and 3). Our data support the hypothesis that Ae. aegypti in CA currently exists as multiple, mostly isolated populations. High genetic distance (F ST > 0.1) as well as genome-wide differentiation (Fig. 4) support multiple introductions into CA from genetically distinct source populations as the most plausible history of this invasion.
Two major mitochondrial lineages are present within California populations, probably corresponding to previously described global clades [13,14]. However, their genealogy differs from the nuclear genome genealogy  5 and 6). This is comparable to a previous study using ND4 sequence analysis of Ae. aegypti populations introduced to Florida [15]. The lack of geographic clustering of mitochondrial lineages therefore appears to be common in invasive Ae. aegypti populations and is likely due to the saltatory nature of dispersal in this species. The incongruence between nuclear and mitochondrial gene genealogies could be due to different evolutionary rates between different loci producing differing topologies [16][17][18]. It is possible that mitochondrial lineages capture historic divergence events, while nuclear genome divergence reflects relatively recent divergence. Linkage disequilibrium decays rapidly in mosquito genomes as seen in Anopheles arabiensis [19]. Thus, any contact between two distinct Ae. aegypti populations may have resulted in relatively recent gene flow homogenizing populations within a locality. In this case mitochondrial markers appear to be less useful to determine relatively recent population divergence events in Ae. aegypti. We investigated the possibility of generating SNP genotypes that are compatible with the existing Ae. aegypti SNP chip dataset [20] to allow for a direct comparison of our results with those previously published. The SNP positions provided in Evans et al. [20] were based on the initial genome assembly AaegL1 [21]. Our BLAST results comparing AaegL1-based SNP sequences to the AaegL5 assembly revealed numerous and significant differences, including multiple matches with high (> 98%) similarity, sequence differences (arising from indel mutations), non-biallelic SNPs, polymorphisms surrounding the target SNPs, etc. (Additional file 4: Appendix S1). These often resulted in mismatched genotype calls between the two different platforms (see genotype discrepancy examples provided in Additional file 4: Appendix S1). Due to these problems a direct comparison of SNP genotype calls using the published SNP chip data with those generated from genome sequence data is deemed inappropriate and we highly recommend taking this into account when applying SNP chip analyses in the future.
Microsatellite data from [3,4] indicated that San Mateo (=Menlo Park), Madera and Fresno samples were genetically similar to samples from the southeastern USA which includes samples from Louisiana, Georgia and Florida. Pless et al. [3] also included a population from Exeter, CA that was also classified together with other central CA samples and south central and southeast USA populations based on microsatellite profiles. This appears to be inconsistent with our results. Our analysis placed the three central CA populations (Menlo Park, Madera and Fresno) in a group (GC2) distinct from the group containing the southeast USA populations (Vero Beach and Key West, Florida). In our analysis, the samples from Florida clustered with populations from Exeter, CA and southern CA (GC1, Fig. 3). Contrary to the microsatellite data, the SNP chip data from the same study [3] groups the Exeter population apart from all other CA populations including those in central CA, consistent with our genome-wide SNP data. Unfortunately, their SNP chip data clustering results did not include samples from the southeast USA preventing direct comparison with their SNP clustering result. This, however, could support the view that the Exeter population, introduced in 2014 is distinct from all other CA populations and that it was introduced independently, rather than resulting from local spread of Ae. aegypti within CA.
PCA analyses of the SNP chip data separated Clovis (GC3) from the GC2 cluster with some overlap [3]. The larger number of SNPs used in our analysis (> 2.9 million biallelic SNPs compared to 15,698 SNPs) may have increased the resolution, allowing us to confidently separate the two. Our data together with previous reports strongly support multiple introductions of Ae. aegypti in California. The most likely scenario includes four independent introductions: (i) Clovis area; probably in 2013 (ii) Madera area; probably in 2013 (iii) southern CA, probably in 2014 (iv) Exeter, probably in 2014 introduced from someplace in the southeast USA like Florida. The years are based on reports from the California vector control districts. This scenario is also in line with most of the results published based on microsatellites and SNP chip data [3]. From our data the exact origin of the introductions remains uncertain with only the Exeter population showing signs of presumable derivation from the southeast USA.
The degree of genetic differentiation found in the Clovis population between the years 2013 and 2016 (Fig. 4d) indicates the population is undergoing rapid changes in its genome, potentially reflecting local adaptation, or, less likely, drift. The only other longitudinal investigation of a CA population of Ae. aegypti that we are aware of compares genotypes of samples from 2013 and 2015 from Madera, detecting almost no change within these two years [3]. Further investigation on genic features showing significant differences between the two time points may shed light on the genes involved in local adaptation at Clovis and the particular circumstances that drove it. Because Ae. aegypti chromosomes do not produce clearly visible polytene chromosomes like e.g. Anopheles gambiae, the detection of chromosome inversions has been challenging and the identification of precise location is at infant stage [22]. Approximate location of diverged regions and the potential chromosome inversions noted by Bernhardt et al. [22] did not provide clear indication that the diverged regions we observed are due to chromosome inversions. Future studies of linkage disequilibrium could illuminate the potential role of chromosome structures in adaptation as it has been demonstrated in Anopheles mosquitoes [23].
The geographic origin of CA Ae. aegypti populations and the means by which they were introduced remains unclear. Perhaps the most interesting open question is what conditions facilitated multiple introductions? Answering these questions is beyond the scope of this study and requires additional data. Investigating samples from different origins using the same NGS platform may provide a clearer description of Ae. aegypti invasion history in CA. In addition, investigation describing genomic changes over time may provide information on local adaptation and potentially will be useful for the control of the species in California.

Conclusion
The mosquito species Aedes aegypti, introduced in 2013, has now been detected in multiple locations throughout California. Our genome analyses identified 3 distinct population groups loosely corresponding to different regions within California. Genome-wide comparisons indicate extensive differentiation between genetic clusters. Samples collected from Clovis in two different years (2013 and 2016) reveal genomic signatures of potential selection. Our mitogenome analysis suggests that founding populations were polymorphic for two mitochondrial lineages with one or the other lost in the various extant populations. These observations support recent multiple introductions of Ae. aegypti into California. This is the first paper that utilizes the whole genome sequences of Aedes aegypti field isolates. Our dataset serves an important step toward future studies aimed at understanding population divergence, gene-environment interactions, and dispersal of this invasive species.

Mosquito collection
Adult female Ae. aegypti were collected from 13 cities by personnel from Mosquito Abatement Districts in Fresno, San Diego, and Orange Counties ( Fig. 2 and Additional file 1: Table S1). These mosquitoes are collected using BG Sentinel traps baited with CO 2 . All collections on private properties were conducted after obtaining permission from residents and/or owners. Mosquito samples were individually stored in 80% ethanol prior to DNA extraction.

Whole genome sequencing
Genomic DNA was extracted using established protocols [24,25]. DNA concentrations for each sample were measured using the Qubit dsDNA HS Assay Kit (Life Technologies) on a Qubit instrument (Life Technologies). A genomic DNA library was constructed for each individual mosquito using 20 ng DNA, Qiaseq FX 96 (Qiagen, Valencia, CA), and Ampure SPRI beads (Beckman) following an established protocol [25]. Library concentrations were measured using Qubit (Life Technologies) as described above. Libraries were sequenced as 150 bp paired-end reads using a HiSeq 4000 instrument (Illumina) at the UC Davis DNA Technologies Core.

Sequence analysis
Raw reads were trimmed using Trimmomatic [26] version 0.36 and mapped to the AaegL5 reference genome [27] using BWA-MEM [28] version 0.7.15. Mapping statistics were calculated using Qualimap version 2.2 [29](Additional file 1: Table S1). Joint variant calling using all samples was done using Freebayes [30] version 1.0.1 with standard filters and population priors disabled. We required a minimum depth of 8 to call variants for each individual following the recommendation of Crawford and Lazzaro to minimize bias in population inference [31]. To improve the reliability of calls, we required variants to be supported by both forward and reverse reads overlapping the loci (Erik Garrison, Wellcome Trust Sanger Institute and Cambridge University, personal communication, Dec. 2014). The repeat regions are "soft-masked" in the AaegL5 reference genome and SNPs in these regions were excluded from analysis. Only biallelic SNPs were used for further analysis. A missing data threshold of 20% was used to filter SNPs. A phylogenetic tree base on the polymorphism data was constructed using the neighbor-joining algorithm as implemented in PHYLIP [32] version 3.696. Hudson F ST [33], nucleotide diversity (π) and Principal Component Analysis (PCA) analyses was done in Python version 3.6.6 using the scikit-allel module version 1.2.0 [34].
The presence of mitochondrial pseudogenes in the nuclear genomes of Ae. aegypti could potentially confound SNP calling [35]. Thus we followed the mapping recommendations suggested by Schmidt et al. [14] and mapped raw reads to the mitochondrial reference genome prior to mapping unmapped reads to the nuclear genome.
We used Ae13CLOV028MT (Genbank ID: MH348176) as a reference for mapping the mitochondrial genome because all our specimens contained a deletion between position 14,522 and 14,659 compared to the AaegL5 reference genome [14]. Variants in the mitochondrial genome were called with Freebayes as described for the nuclear genome, but set to single ploidy. Mitochondrial coverage was on average 160 times greater than the nuclear genome coverage with a minimum of 25-fold difference (Additional file 1: Table S1). Use of properly paired reads for variant calling reduced errors generated by failing to recognize mitochondrial pseudogenes present in the nuclear genome. The Vcf2fasta program [36] was used to extract mitogenome sequences from the VCF file to FASTA format. MEGA version 7.0.26 [37] was used for mitogenome alignment. Mitogenome reference sequences of Culex quinquefasciatus (Genbank accession number = HQ724617), Aedes notoscriptus (KM676219), and Aedes albopictus (NC_006817) were obtained from GenBank and added to the alignment. Sequences for the thirteen mitochondrial protein-coding genes in Ae. aegypti were obtained from GenBank [38], extracted from our dataset, and concatenated for tree construction with the maximum likelihood algorithm implemented in MEGA.

Additional files
Additional file 1: Table S1. Metadata for each sample including geographic coordinates, collection date, genome sequencing related metrics, and genotypes of genetic-cluster-specific SNPs. (XLSX 20 kb) Additional file 2: Table S2