This study provides the most extensive set of data on molecular polymorphisms in immune and non immune related genes in A. gambiae and A. arabiensis. We observed a high level of nucleotide diversity (π>0.006) in coding regions within members of the A. gambiae complex. This is approximately ten fold higher than the level of nucleotide diversity observed in coding regions of the human genome  but is comparable to the estimates reported from Drosophila or in previous studies on Anopheles [20, 21, 24, 35–38]. Nucleotide diversity is a product of both mutation rate and effective population size. Mutation rates were shown to be similar in humans and Drosophila [39, 40], so the ten fold difference in average nucleotide diversity between species was attributed to the approximately ten fold larger long term effective population size in Drosophila than in humans (Ne = 300,000 versus 20,000 respectively) [41, 42]. The effective population size of A. gambiae is estimated at several hundred thousand , which falls within the range of Drosophila estimates and might explain similarities in levels of genetic diversity in both species. We also observed a slightly reduced diversity in A. arabiensis compared to A. gambiae and in the M form of A. gambiae compared to the S form. Reduced genetic diversity in A. arabiensis was already observed previously [20, 44], but colonization and further maintenance of A. arabiensis for several generations in an insectary might have added to this trend through increased genetic drift. Alternatively, long term effective population sizes differences might be expected between these species, in light of their distinct bionomics and distribution across Africa [45, 46].
Patterns of genetic diversity can be influenced by differences in recombination rate across the genome, with a higher genetic diversity and a faster evolution rate expected in high recombination regions [47, 48]. Our results must therefore be interpreted taking into account this potential heterogeneity along the A. gambiae genome. The known factors influencing recombination rate in A. gambiae are the chromosomal inversions and the proximity to centromeres and telomeres [44, 49]. The distribution of immune related genes on the genome did not show any aggregation in particular regions (Figure 1), therefore we did not expect that the diversity pattern observed in immune related genes would be a consequence of heterogeneity of recombination rates along the genome. However, considering genes position on the genome sheds light on some evolutionary processes in A. gambiae. Paracentric chromosomal inversions are very abundant in the A. gambiae complex. More than 120 polymorphic inversions have been detected in natural populations [49, 50]. Ten inversions are fixed in the different species of the complex and can be used to differentiate individual specimens. Several lines of evidence suggest that these chromosomal arrangements are incidental to ecotypic adaptation and may be directly involved in the past and current speciation processes occurring within this species complex, mainly because recombination suppression between alternative chromosomal arrangements might protect arrays of co-adapted genes [46, 51–56]. Mathematical models for inversion-based local adaptation and/or speciation predict lower genetic diversity within species in the chromosomal region involved in speciation, and higher divergence between species [57, 58]. In agreement with these models, we found evidence for reduced variability and higher genetic divergence between species on the X-chromosome. Indeed, A. arabiensis and A. gambiae are differentiated by a fixed chromosomal arrangement (Xag) that inverts a large part of the X chromosome . Reduced effective population size for the X chromosome is not sufficient to explain the observed pattern which is therefore in agreement with previous studies demonstrating a "large × effect" on differentiation between these sibling species [44, 59]. The "large × effect" hypothesis assumes the existence of speciation genes on the X chromosome responsible for ecological and/or behavioral adaptations that affect interspecific mating and/or hybrid fitness.
No X chromosome inversions, detectable at the cytogenetic level, distinguish the M and S forms of A. gambiae. Moreover, our samples were collected from an area of South Cameroon where A. gambiae M and S are known to be homosequential for the standard karyotype on all autosomes [60–62]. Accordingly, levels of genetic differentiation (single gene Fst estimates) were generally much lower between the M and S forms of A. gambiae than they were between these populations and A. arabiensis, and only occasionally did reach statistical significance. However, consistent with recent evidence for increased differentiation due to reduced recombination in the centromeric region of the X chromosome of A. gambiae [44, 62–64], the only fixed SNP we observed between M and S maps to the proximal region of the X chromosome (X5D), a region that is considered as a "speciation island" between the M and S forms of A. gambiae . Except for the X chromosome, we observed comparable genetic diversity across the entire genome. We did not detect any centromere or telomere effect, but a higher density of genetic markers would be necessary to draw firm conclusions about reduced diversity in these regions. The constant distribution of genetic diversity on autosomal chromosomes observed in the present study is in contrast with results of the A. gambiae genome project, where a highly variable distribution was observed . The genome sequencing project utilized the PEST strain that was established from a mix of several natural populations of the M and S forms of A. gambiae, maintained under insectary conditions for several years and exposed to bottlenecks and selections . It is likely that the uneven distribution of diversity in the PEST strain resulted from maintenance in the insectary, and that diversity in natural populations is more evenly distributed. It must be kept in mind, however that the present study focuses on populations from only one location for each species and might reflect only a portion of the natural genetic diversity of the species. Moreover, Cameroon might be an area where the M and S forms of A. gambiae have achieved one of the highest level of genetic differentiation observable throughout the species range, as was first described by Wondji et al.  using microsatellite data and further expanded by Turner et al.[62, 66], using sequence data. Overall, our estimates of Fst between species/molecular forms are in strong agreement with those of Wondji et al.  and Turner et al. [62, 66], who detected one of the highest and most significant level of genetic differentiation between the M and S forms of A. gambiae observable throughout the species range.
Tests for departure from neutrality are based on the assumption of mutation-migration-drift equilibrium. However, evidence for recent population expansion and radiation has been found in both A. gambiae and A. arabiensis [67, 68]. Unstable demographic history can produce patterns of genetic variation indistinguishable from those of selection [69, 70]. However, demographic history affects similarly the entire genome, while selection is locus-specific. For example, rapid population expansion is expected to result in highly negative Tajima D values, as a consequence of the rapid increase in number of polymorphic sites (S), and excess of low frequency alleles that have little effect on π . This pattern was not observed in our dataset, where most computed Tajima D values were negative but not statistically significant. The Z test of selection and Ka/Ks ratios demonstrated a deficit of non synonymous mutations in most of the genes (immune related or not). The influence of population size changes on tests based on synonymous and non synonymous variations appears to be weak, even if not fully understood . Therefore, it is most likely that the deficit of non synonymous mutations is due to generalized purifying selection acting on A. gambiae ORFs, probably reflecting functional constraints on the encoded proteins.
In Drosophila, several immunity genes revealed directional selection [22, 72, 73] and a broad comparison of immune system and non immunity genes supported the hypothesis that pathogens exert a selective pressure on the immune system . Evidence for directional selection driving evolution of the immune system in Drosophila was consistent with the "arms race" model of co-evolution . In this model, the pathogen constantly evolves to escape the host's immune response and, in turn, the hosts' immune system evolves to better control infections. Such dynamic iterative interactions would promote rapid evolution in the genes involved in pathogen-host interactions with rapid rise in frequency of selectively advantageous alleles and high turn-over between alleles leaving insufficient time for the accumulation of neutral polymorphism [76, 77]. In contrast with data in Drosophila, our results did not detect a pattern of directional selection in innate immunity genes of A. gambiae. Although it is likely that different evolutionary forces are at play in these organisms, our inconclusive results probably reflect the limited statistical power of our dataset. Indeed, evidence for a higher rate of evolution in immune related genes compared to housekeeping genes in Drosophila was generated through a powerful sequence dataset and the results of the tests of selection was right below the 5% statistical significance thresholds . In our data, the statistical power was limited by the small number of fixed mutations detected between species, pointing towards the necessity of using a more distant outgroup than A. arabiensis for evolutionary studies in A. gambiae . Suitable outgroup species should diverge sufficiently to allow powerful selection tests without reaching mutation saturation. There is evidence that members of the A. gambiae complex are so closely related to each other that their low level of divergence would limit the statistical power of any conventional test of selection. Moreover, genetic introgression between well established species within the complex further overshadows their phylogenetic relationships and reduces divergence time . In contrast, species of the Pyretophorus series other than A. gambiae complex members appeared too divergent to represent appropriate outgroups . The identification of a suitable outgroup for comprehensive and powerful evolutionary studies in the An. gambiae complex is still pending but it is likely that ongoing whole-genome sequencing efforts and increased interest in this burgeoning field will soon provide appropriate candidate species and allow revision of previously inconclusive inferences.
In A. gambiae as well as in Drosophila, the level of genetic diversity appeared to be similar between immune related genes and control genes and between functional categories of genes involved in immunity. Balancing selection does not drive the evolution of the immune system (the specific case of TEP1 in A. gambiae is discussed below). This is contrasting with the pattern of selection observed in vertebrates, in which genes involved in defense mechanisms are under balancing selection in addition to directional selection. In vertebrates, the system of recognition of acquired immunity requires a large diversity of major histocompatibility complex genes to bind large diversity of antigens . As such, the pattern of variability in A. gambiae and Drosophila immune systems is consistent with the recognition of relatively few motifs conserved across broad ranges of pathogens. However, the recent discovery of hypervariable immunoglobulin domain-encoding genes, Dscam, capable of producing pathogen-specific splice [80, 81] opens new insights into recognition system in insects suggesting specific recognition of a spectrum of pathogens. Future investigations of molecular evolution will determine the selective forces at play on such genes and will help understanding their role in pathogen recognition.
One of the immune-related genes, TEP1, showed a unique pattern of variation. It displayed the highest level of genetic diversity among the genes we investigated and a significant positive value of the Tajima D statistic suggesting maintenance of divergent alleles. A previous study  revealed two highly differentiated TEP1 alleles that were initially mistaken as distinct genes, in the first version of the genome assembly. Crosses between laboratory strains showed Mendelian inheritance of these allelic forms, TEP1s and TEP1r, which are associated with two A. gambiae strains susceptible and refractory to P. berghei, respectively ; it was hypothesized that the alternative alleles are causally related to these phenotypes. The diversity we observed in TEP1 can be the result of balancing selection and would be reminiscent of selection for diversity in acquired immunity. However, gene conversion between different genes in the TEP family might result in similar patterns of diversity and further investigation is needed to disentangle these hypotheses. Nonetheless, our results emphasize the importance of the TEP1s and TEP1r alleles, and demonstrate these to segregate in natural populations. However, their role in determining the susceptibility of A. gambiae to Plasmodium infection remains to be established, as their segregation in laboratory strains could be due to increased genetic drift at the onset and throughout the colonization process.