Dataset
Using data from the Ag1000G Project (phase 1 AR3 data release; https://www.malariagen.net/projects/ag1000) carried out by the Malaria Genomic Epidemiology Network (MalariaGEN) [22], SNPs were obtained for each gene. The dataset contains 765 samples (619 An. gambiae s.s., 132 An. coluzzii and 14 hybrids) from 8 African countries comprising Angola, Burkina Faso, Cameroon, Gabon, Guinea, Guinea-Bissau, Kenya and Uganda. Mosquitoes were collected between 2009 and 2012, except the Gabon samples which were collected in 2000. Samples from Burkina Faso were separated between species, An. gambiae s.s and An. coluzzii. Due to high frequency of hybrids [25, 26], Guinea-Bissau was considered as a single mixed population of the two sibling species. SNPs in non-coding regions were removed and SNP numbers were counted after filtering (S2 Table). Only SNPs within exons were used in this study to minimize the confounding effects of nested genes. In total 11,318 genes were included for downstream analysis corresponding to 90.14% of An. gambiae genes in VectorBase [27] gene set Agam4.2 (S2 Table). These genes are distributed throughout all 4 chromosomal arms, i.e. 2R, 2L, 3R, 3L and X chromosome (S2 Table). No single mosquito or gene had a substantial (> 4%) amount of missing data (S3 Table).
A gene-resolution map of An. gambiae s.s. and An. coluzzii populations
Inter-individual genetic distance matrices (765 × 764 / 2 = 292,230 individual pairs) were calculated for each of the 11,318 genes (methods summarized in Fig. 1). The distance matrices were linearized and combined for all genes into one large matrix (11,318 genes × 292,230 individual pairs). The t-SNE algorithm was used to flatten this high-dimensional data into a representation depicting genes in two dimensions. The goal of this dimensionality reduction is to preserve as much of the high dimensional population structure information in the low dimensional representation as possible. The graph-based t-SNE algorithm differs from matrix factorization methods such as principal components analysis (PCA) and multidimensional scaling (MDS) in that it is concerned primarily with local relationships between genes (genes with highly similar population structures) and is able to flatten complex manifolds to some extent, although global relationships are not usually reliably represented. Our approach has similarities to the TREESPACE algorithm for tree clustering [28]. However, the dimension-reducing method used by TREESPACE, MDS, does not scale well to large datasets [29] and over-emphasises long-range distances as discussed above.
The resulting all-gene t-SNE is shown in Fig. 2, where genes are colored by chromosome arm and two previously described chromosomal inversions: 2La and 2Rb [5, 30]. Approximately 6000 genes from a mix of chromosomes form a large homogenous central region, while surrounding clusters of various sizes appear to be defined by chromosomal arm and inversions, i.e. genes on the same chromosomal arm or within the same inversion tend to be co-located on the t-SNE.
Because t-SNE has a random initialization step, each invocation of the algorithm could produce a different result. The results of 30 independent replicate t-SNE mappings (S1, S2 and S3 Figs; S4 Table; and also interactively via the web interface at https://vigilab.shinyapps.io/anopheles/) show that, overall, the dense central region and the large and smaller peripheral clusters are consistent in terms of gene content, though their relative positions are more variable. Henceforth in this article, “consistent clustering” or “consistently clustered” refers to highly reproducible cluster membership across the 30 plots. The variability of global cluster arrangement between the repeated t-SNE plots clearly illustrates the limitations of the method with respect to the accurate reproduction of long-range relationships. Thus, we warn against trying to interpret the high-level structure of the plots.
Although the layout of genes on t-SNE is driven by inter-individual genetic distances, it is informative to use population-based genetic measures such as the fixation index (FST) as an aid to interpret the map. General FST, which quantifies the average divergence between all pairs of populations defined by country and sibling species, is high (FST > 0.2) in most of the outer clusters (Fig. 3a; S5 Table).
There is also a clear global trend for an increased number of SNPs per gene from the center to the periphery of the map (Fig. 3c). This trend is also seen at a smaller scale within major peripheral clusters. Thus, a broad interpretation of the map is that genes with undifferentiated SNPs tend to be located in the center, while more differentiated genes are found in several distinct, peripheral clusters (Fig. 3a, b).
X chromosome and speciation islands
In the An. gambiae s.l. complex, a “speciation continuum” is observed, i.e. species undergo heterogenous gene flow [31, 32], genomic introgression [25, 33, 34], and uncertain boundaries [35, 36]. Most of the species within this complex were first distinguished by inter-species hybridization resulting in sterile male progeny or by the presence of fixed chromosomal inversions [2, 4]. Centromeric regions of chromosomes have been demonstrated to contain high levels of differentiation and often described as “islands of speciation” [16, 17]. One of these, a region on the X chromosome, has been especially associated with the speciation process [36, 37]. According to Fontaine et al. [36], a 15 Mb region of the chromosome X reveals the ‘true’ species tree between An. gambiae, An. arabiensis and An. melas, while autosomes are misleading due to extensive historical introgression between species. Similarly, Lee et al. [38] showed that markers on the X chromosome have greater diagnostic power than those on autosomes for divergence between An. gambiae and An. coluzzii. Later, Aboagye-Antwi et al. [37] demonstrated that the X chromosome island only plays a key role in assortative mating between An. gambiae and An. coluzzii.
Genetic differences between the two sibling species included in the dataset analysed here, namely An. gambiae s.s. and An. coluzzii, contribute to the clustering in the t-SNE of genes that may be relevant to speciation (“species divergence cluster”, Fig. 4). Genes with high Species FST are located on the right-hand side of the plot (FST > 0.5) and are predominantly located either on the centromeric half of the X chromosome or close to the 3R centromere (Fig. 4a, b), which is consistent with previous studies that have detected high level of divergence between An. gambiae s.s. and An. coluzzii on speciation islands [16, 17, 39, 40]. Additionally, several high Species FST genes from other autosomal locations are consistently co-clustered with these X and 3R centromeric genes. If a t-SNE is made with only samples from one species (An. gambiae s.s.) the co-clustering of these genes is lost (Fig. 4e).
Thus, the t-SNE provides a visual indication of the genomic extent of “islands of speciation”. Starting with genes close to known X-linked species-diagnostic markers (e.g. the intergenic spacer (IGS) of the multicopy ribosomal DNA [41, 42] and the An. coluzzii-specific SINE (short interspersed element) insertion [43], one can identify autosomal genes that are consistently co-clustered with them (Fig. 4; S6 Table). One such gene, OBP41 (odorant binding protein 41; AGAP005182; non-centromeric 2L; Species FST 0.53) is interesting because it is highly expressed in ovaries 48 h after a blood meal [44, 45]. The protein product of this gene may, like other atypical-type ovary-expressed OBPs, be present in the eggshell and have a role in sperm chemotaxis [46]. Also, co-clustered with OBP41 and genes of X-linked speciation is AGAP001820 (genomic location: 2Rj inversion; Species FST 0.18). This gene is a one-to-one ortholog of Drosophila melanogaster Helicase 89B (Hel89B) which encodes a DNA-binding protein that acts as a chromatin regulator. The high level of expression of AGAP001820 in the testis mirrors the ovarian expression of OBP41 and is likewise suggestive of a role of this gene in speciation [44]. Furthermore, two odorant receptor genes Or37 (AGAP002126; chromosome 2R; Species FST 0.26) and Or60 (AGAP011979; 3L; Species FST 0.53) and TEP3 (thioester-containing protein 3, AGAP010816; 3L; Species FST 0.28) are consistently co-clustered. The odorant receptor genes have shown sex-biased expression in An. gambiae, where Or37 was differentially expressed in male reproductive tissues [47] and Or60 in females after a blood meal [45]. The immunity gene, TEP3, has been previously identified as having long-range LD with speciation island regions and highlighted as differentiated between A. gambiae and A. coluzzii [48, 49]. Thus, several genes that may have either driven the speciation process, or be directly downstream of it, have been identified easily using this visual tool.
Recent adaptive introgression on the left arm of chromosome 2 (2L) was repeatedly detected in natural populations conferring homogenization of autosomal genomic islands [31, 33, 50]. For example, under strong selective pressure by insecticides, An. coluzzii inherited the entire An. gambiae-associated large centromeric region of chromosome 2L 2L, where the voltage-gated sodium channel (Vgsc, AgamP4 gene ID = AGAP004707) gene that confers insecticide resistance is found [33, 34]. Homogenization of this genomic region via introgression therefore explains why centromeric genes from 2L are not highlighted in Fig. 4a.
Chromosomal inversions
As seen in Fig. 2, genes located within the 2La and 2Rb inversions form two well delineated clusters (blue and purple clusters in Fig. 2, respectively). The distinct population structures for the genes in these clusters is expected due to the much-reduced recombination rate within inversions in heterokaryotypes that reduces gene flow between homokaryotypes [51]. Chromosomal inversions are the result of reversed reinsertion of two break points, and like any other type of mutation, evolve under selection and random drift [51]. These two inversions in the second chromosome of An. gambiae have a broad geographic distribution and their frequency as the degree of aridity increases [5, 11]. Genomic resequencing of An. gambiae collected along the cline, has shown evidence of local adaptation i.e. environmental/ecological conditions maintain the cline inversion distribution [12, 52].
In the present study, a total of 2615 genes located in the 2L chromosome arm were included, of which 1087 (41.5%) are within the 22 Mb 2La inversion. In the dataset studied here, 75% of the individuals are homokaryotypes, including 43% 2La-standard (+a/+a) and 32% 2La-inverted (a/a). Genes mapped within the 2La inversion formed one large cluster (2La-1) and a small cluster (2La-2) in the plot (Fig. 5a). The 2La-2 cluster (or in some cases, just the subset of its 15 most distal genes) is present in all 30 replicate plots (S1 Fig) and is discussed in the selective sweeps section.
The other inversion, 2Rb, is approximately 7.7 Mb long and consequently comprises a smaller number of genes (475 genes), although the gene density here is higher (2La- 50 genes/Mb; 2Rb- 61 genes/Mb) (Fig. 5b). In the present study, 62% of the individuals are homokaryotypes for 2Rb-standard (+b/+b) and 17% homokaryotypes for 2Rb-inverted (b/b). Together with 2Rb, the other four common polymorphic inversions on this chromosome arm (c, d, j, and u, and overlapped du) are highlighted in Fig. 5b. The clustering of genes in the t-SNE largely follows the pattern of inversions and their overlaps. For example, where the 2Ru inversion overlaps with 2Rd, genes in this region (in blue in Fig. 5b) form a separate cluster from the non-overlapping 2Rd genes (in orange). This could be explained in part by the covariance that is relatively tightly linked by virtue of physical proximity of any two genes.
Not all documented inversions will cause a clear clustering of genes in the t-SNE. Only inversions that are polymorphic in the samples analysed will have this effect.
Centromere and telomere proximity
Several well-defined peripheral clusters are formed from genes not located within inversions and are worthy of further investigation. We have used the algorithm DBScan to extract clusters for each chromosome arm outside of the main inversions (S7 Table). Two of the largest non-inversion clusters are 2R-vii and 2L-ii, each containing more than 200 genes that are located close to the centromere (Fig. 6). Cluster X-iii (120 genes, Fig. 7c), previously discussed in terms of sibling species differentiation, is also centromere-proximal. The distinctness of these clusters may be explained by lower rates of recombination around centromeres that has been shown in several animals, plants and fungi [53], since reduced mixing would make it more likely that two neighbouring genes share the same population structure.
Centromere-proximal genes on chromosome 3 do not form consistently large peripheral clusters analogous to clusters 2R-vii and 2L-ii. Cluster 3R-i (180 genes, Fig. 7a) is large and consistent, but is rarely truly peripheral like 2R-vii and 2L-ii. Centromere-proximal clusters on 3L are small and one (3L-iv) is discussed below in another context. How the centromeres of the two autosomes have come to have different population genetic dynamics remains to be explained, though the presence of chromosomal inversions in one chromosome but not the other may be a factor.
Genes close to telomeres are not locally constrained in the t-SNE plot to the degree seen for genes in inversions and centromeres regions. For example, on chromosome 2, only two small t-SNE clusters are found near the telomeres 2R-ii (21 genes, Fig. 6b) and 2L-iv (18 genes, Fig. 6a), and near the telomere of chromosome 3R, cluster 3R-iii is quite large (184 genes, Fig. 7a) though not clearly separated from the core region of t-SNE.
Selective sweeps
Another genetic factor that strongly influences population structure and therefore the layout of genes in the t-SNE is positive selection. If a single locus is under strong selection, its genomic neighborhood is also affected due to linkage disequilibrium (LD), creating a so-called selective sweep. Positive selection is typically identified through the analysis of haplotype diversity and LD with reference to a fully assembled genome [54, 55]. In this study, we note that small, isolated clusters of contiguous genes in the t-SNE typically contain a gene that has either previously been implicated in recent selective sweeps (often related to insecticide resistance) or is a likely candidate for such selection. Thus, our t-SNE of gene-resolution population structure may offer a simple visual means to identify potential selective sweep genes in organisms with poorly assembled genomes. Below we explore in detail individual genes and genomic regions under selection.
Perhaps the most prominent ‘selective sweep cluster’ is the ‘GSTE cluster’, 3R-iv (Fig. 7a), containing four glutathione S-transferase genes: GSTE1 (AGAP009195), GSTE5 (AGAP009192), GSTE6 (AGAP009191) and GSTE7 (AGAP009196) that exhibit high population structure (General FST respectively = 0.27, 0.25, 0.29 and 0.25; S6 Table). This region of strong selection was also identified in the original analysis of the Ag1000G dataset [22]. The genes GSTE2 (AGAP009194), GSTE3 (AGAP009197) and GSTE4 (AGAP009193) are also located in this genomic region but fall in the center of the t-SNE due to low numbers of SNPs that pass the quality criteria. It is thought that GSTE2 may be the actual gene under selection [56, 57]. Three immune system genes SRPN6, SRPN16 and CLIPB11 are also consistently present on this cluster, however it is not clear if these have evolutionary and functional significance or have simply piggybacked with the locus under selection. It is noteworthy that SRPN6 is highly expressed in mosquito midgut and salivary gland epithelial cells that are invaded by the malaria parasites and is involved in parasite killing and/or clearance [58, 59]. Therefore, its putative involvement in this selective sweep notwithstanding, its location within a strongly selected locus could contribute to diversifying vectorial capacity between An. gambiae populations.
The tandemly duplicated CYP6P gene family has been previously identified to be under recent selection and likely involved in insecticide resistance [60,61,62]. In the t-SNE, the genes of this family are excluded because they are located in the intron of another gene, AGAP002859. However, this gene and 56 neighbouring genes form an isolated cluster (yellow cluster in Fig. 5b). All 57 genes are located within the 2Rc inversion, though the majority of the 271 genes within this inversion are dispersed elsewhere on the t-SNE (Fig. 5b; web interface). Thus the ‘2Rc cluster’ in Fig. 5b is not a typical inversion cluster as seen for 2La or 2Rb on the t-SNE plot (see Fig. 2), for instance, and may be better characterised as a selective sweep.
Both of the small, isolated, contiguous clusters 2La-2 and 3L-iv contain cytochrome P450 genes. Cluster 2La-2 (Fig. 5a) contains CYP4J5 (AGAP006048), CYP4J10 (AGAP006049) and CYP4J9 (AGAP006047). Weetman et al. [20] identified SNPs in CYP4J5 and CYP4J10 that are associated with pyrethroid resistance in Ugandan isofemale families but only one of the SNPs in the CYP4J5 gene showed highly reproducible and significant resistance association in sample sets from both Uganda and Kenya [20]. Because there was no loss of haplotypic diversity in the few samples sequenced from Uganda, they suggested that CYP4J5 has been subject to a soft selective sweep. However, the consistent distinctness of the 2La-2 cluster in our analysis of 765 samples suggests a strong selective sweep has indeed occurred. The 3L-iv cluster (Fig. 7b) is likely the result of selection on CYP4C28 (AGAP010414) or carboxylesterase COE12O (AGAP010390). The former gene is overexpressed in mosquitoes collected from agricultural sites compared to an insecticide susceptible strain, suggesting involvement in insecticide resistance [63].
Another small, isolated cluster of 15 genes, containing the immunity-related gene CLIPB5 (AGAP004148) may also indicate a recent selective sweep (Fig. 6b). Notably, this region (2R:50645302–50,862,651) does not contain any genes typically associated with insecticide resistance and so CLIPB5 may be the most likely candidate to contain the allele under selection [64].
The best-known gene under strong selective pressure in insects is Vgsc (para gene - AGAP004707). Two mutations in Vgsc codon 995 of An. gambiae have conferred knockdown resistance (kdr) to DDT and pyrethroid insecticides: leucine to phenylalanine (L995F) [65] and leucine to serine (S995) [66]. The frequency of L995F kdr mutation is high in West and Central Africa populations included in the present study, while L995S is mostly present in Central and East Africa [22]. This gene is located in the centromere-proximal region of the left arm of chromosome 2 and it belongs to cluster 2L-ii in the t-SNE (Fig. 6a), a cluster much larger (200 genes) than the selective sweep clusters (20–30 genes) described above. Thus, visual interpretation of the t-SNE would not highlight this locus as a potential selective sweep. The complex multi-locus resistance of the Vgsc gene, its multiple introgressions between sibling species, and its location close to the centromere may explain why this gene does not belong to a small, isolated cluster typical of other recently selected genes.
The Guinea-Bissau cluster
It is also possible for selection to act on two or more unlinked loci. Clusters 2L-iv (Fig. 6a) and 3L-v (Fig. 7b) overlap in an area of the t-SNE that we henceforth refer to as the “Guinea-Bissau cluster” (Fig. 8) since genes within this cluster exhibit the highest Population FST for this country (S4 Fig; S8 Table). Guinea Bissau samples included in present study are from the coastal region, where introgression from An. coluzzii to An. gambiae has been persistently reported at high rates (> 20%) [25, 26, 65]. Vicente et al. [67] reported that this massive introgression, which is limited to the coastal region, drove species radiation between coastal and inland An. gambiae populations. The spatiotemporal stability of this novel hybrid form was related to species and local selection on chromosomal inversions. Whilst inland An. gambiae populations showed a common chromosomal form (SAVANNA) across West Africa, coastal A. gambiae presented a localized chromosomal form (BISSAU). The “Guinea-Bissau cluster” could reveal genes involved in local adaptation of this hybrid form.
The cluster contains just over 100 genes from both autosomes and three genes from the X chromosome. Several putative immunity-related genes are present including CLIPC4 (AGAP000573), CLIPA8 (AGAP010731), CLIPB14 (AGAP010833), HPX11 (AGAP010899), HPX15 (AGAP013327) and HPX16 (AGAP011216). The heme peroxidase (HPX) genes are of particular note being located so close together in the map despite not being close genomic neighbours (spanning a region of 300 genes), and because HPX15 (also known as IMPer) has been implicated in the modulation of midgut immunity and microbiota tolerance [68].
Interestingly, two of the four genes in this cluster with the highest Guinea-Bissau Population FST have published links to viral infections: AGAP010732 (FST = 0.71) encodes a zinc-finger protein which is significantly upregulated upon densovirus infection (Ren et al., 2014); and AGAP004695 (FST = 0.75), which encodes a subunit of the ESCRT-I complex that mediates the intracellular trafficking of membrane proteins, was found to be upregulated during O’Nyong Nyong virus (ONNV) infection [69]. The gene encoding eukaryotic translation initiation factor 3 subunit B (AGAP012140) is also located in this cluster, though with a lower FST of 0.14, providing further support for a local viral challenge hypothesis given that viruses are dependent on the host’s translation machinery [70]. However, the innate immunity genes CLIPA8 and CLIPB14 that are also found in this cluster, are associated with Plasmodium and bacterial infections [71, 72], so non-viral immune challenges may also have influenced the evolution of these genes in Guinea-Bissau.
Gene function enrichment in the t-SNE
A systematic analysis was performed to detect over-representation of gene functions in sub-regions of the t-SNE. K-means clustering using the 2D t-SNE coordinates was exhaustively performed for a variety of K values (see Methods for details) to partition the map into different subsets. Each gene set was tested for overrepresentation of biological function by means of a Gene Ontology (GO) term enrichment analysis using annotations from VectorBase [27]. After appropriate multiple testing corrections, 67 unique GO terms were found to be significantly enriched in various locations in the t-SNE (S6 Fig; S9 Table).
Since genomic location is the primary driver of the location of a gene in the t-SNE, tandemly duplicated genes are generally found close together in the plot and their GO terms are enriched, though only trivially. Therefore, we were particularly interested in GO terms enriched in clusters of non-contiguous genes. Broadly speaking, the center of the plot is characterized by a low number of SNPs as well as low population structure values, i.e. General and Species FST (Fig. 3), characteristic of conserved/housekeeping genes. As expected, GO terms related to basic maintenance of biological functions such as translation, peptide and amide biosynthesis, ribosome, mitochondria are enriched in that area (S6 Fig).
The t-SNE region described above as the “species divergence cluster” contains genes from different chromosome arms (Figs. 2; 4d). Interestingly, this area is enriched for sensory perception and behavior (S6 Fig; S9 Table), functions likely to be involved in the distinct mating and habitat preferences of An. gambiae and An. coluzzii. The “Guinea-Bissau cluster”, which, as discussed above, contains several highly differentiated genes putatively involved in viral infection, is also significantly enriched for cholesterol transport and ion binding. All four cholesterol transport genes are located in a tandem array within the 2Rc inversion, so the GO enrichment is not unexpected. However, the majority of the genes closely neighbouring the tandem array are found in other distinct clusters in the t-SNE, particularly the main 2Rc cluster. So, the cholesterol transport genes appear to be in the “Guinea-Bissau cluster” by exception rather than by default. Cholesterol transport and ion binding can be linked to viral infections: membrane lipid properties can affect viral entry and exit and intracellular trafficking, and the expression of ion binding genes was previously found altered in Aedes aegypti under flaviviral infection [73].