Identification and analysis of Single Nucleotide Polymorphisms (SNPs) in the mosquito Anopheles funestus, malaria vector

Background Single nucleotide polymorphisms (SNPs) are the most common source of genetic variation in eukaryotic species and have become an important marker for genetic studies. The mosquito Anopheles funestus is one of the major malaria vectors in Africa and yet, prior to this study, no SNPs have been described for this species. Here we report a genome-wide set of SNP markers for use in genetic studies on this important human disease vector. Results DNA fragments from 50 genes were amplified and sequenced from 21 specimens of An. funestus. A third of specimens were field collected in Malawi, a third from a colony of Mozambican origin and a third form a colony of Angolan origin. A total of 494 SNPs including 303 within the coding regions of genes and 5 indels were identified. The physical positions of these SNPs in the genome are known. There were on average 7 SNPs per kilobase similar to that observed in An. gambiae and Drosophila melanogaster. Transitions outnumbered transversions, at a ratio of 2:1. The increased frequency of transition substitutions in coding regions is likely due to the structure of the genetic code and selective constraints. Synonymous sites within coding regions showed a higher polymorphism rate than non-coding introns or 3' and 5'flanking DNA with most of the substitutions in coding regions being observed at the 3rd codon position. A positive correlation in the level of polymorphism was observed between coding and non-coding regions within a gene. By genotyping a subset of 30 SNPs, we confirmed the validity of the SNPs identified during this study. Conclusion This set of SNP markers represents a useful tool for genetic studies in An. funestus, and will be useful in identifying candidate genes that affect diverse ranges of phenotypes that impact on vector control, such as resistance insecticide, mosquito behavior and vector competence.

Results: DNA fragments from 50 genes were amplified and sequenced from 21 specimens of An. funestus. A third of specimens were field collected in Malawi, a third from a colony of Mozambican origin and a third form a colony of Angolan origin. A total of 494 SNPs including 303 within the coding regions of genes and 5 indels were identified. The physical positions of these SNPs in the genome are known. There were on average 7 SNPs per kilobase similar to that observed in An. gambiae and Drosophila melanogaster. Transitions outnumbered transversions, at a ratio of 2:1. The increased frequency of transition substitutions in coding regions is likely due to the structure of the genetic code and selective constraints. Synonymous sites within coding regions showed a higher polymorphism rate than non-coding introns or 3' and 5'flanking DNA with most of the substitutions in coding regions being observed at the 3 rd codon position. A positive correlation in the level of polymorphism was observed between coding and non-coding regions within a gene. By genotyping a subset of 30 SNPs, we confirmed the validity of the SNPs identified during this study.
Conclusion: This set of SNP markers represents a useful tool for genetic studies in An. funestus, and will be useful in identifying candidate genes that affect diverse ranges of phenotypes that impact on vector control, such as resistance insecticide, mosquito behavior and vector competence.

Background
Anopheles funestus and Anopheles gambiae are the major malaria vectors in Africa. Due to the difficulty of laboratory colonization, An. funestus has not received the same attention as An. gambiae and as a consequence there are few molecular markers for this species. However, the recent successful colonization of two strains of An. funes-tus [1] and the identification of a number of microsatellite markers [2,3] have facilitated more detailed studies of this species. Microsatellite markers particularly have been used to study population structure and gene flow between An. funestus populations [4][5][6] and a subset of these microsatellite markers were used to build the first linkage map of this species [7]. However, microsatellite markers are not evenly distributed across the genome, and their low number so far is an obstacle to the development of high resolution linkage maps needed for QTL mapping or association studies in An. funestus. Therefore, this study was initiated to increase the availability of characterized and mapped markers for An. funestus. Physically mapped ESTs were used to identify SNPs. Such ESTs have been used to study the genetic variability in a number of species such as Aedes aegypti, Drosophila melanogaster or Homo sapiens [8][9][10] and should be a source of DNA polymorphisms for An. funestus as well.
Single nucleotide polymorphisms (SNPs) are by far the most common type of molecular variation in all organisms. They are extremely abundant with an occurrence of about one SNP per kb in human [11] and about one SNP every 125 bp in An. gambiae [10]. Significant progress has been made in the development of tools for detection and genotyping of SNPs and they are now becoming the markers of choice for association studies, high-resolution linkage mapping and population genomics studies [12]. SNPs located in non-coding regions of the genome and synonymous SNPs (sSNPs) in coding regions, which have no impact on the phenotype, may provide useful markers for population genetics studies. Non-synonymous SNPs (nsS-NPs) which alter the structure (change of amino acid sequence) and potentially the function of encoded proteins are useful markers for association studies to detect genetic variations linked with phenotypic traits.
Patterns of genetic diversity in An. funestus have not been studied to the same extent as in An. gambiae or Drosophila species. Nucleotide diversity in these species has been used to compare patterns of nucleotide variation, such as the relative occurrence of transitions/tranversions in different regions of the genome [8,13]. These surveys have established codon usage and usage bias patterns in many species, with bias hypothesized to occur as a result of selection for efficient translation [14,15].
The sequencing of the 278 million base pairs (Mbp) constituting the An. gambiae genome has revealed more than 400,000 SNPs indicating a high level of polymorphism in mosquito species [16]. We hypothesize that by sequencing DNA fragments of different genes of An. funestus, a similar level of polymorphism should be encountered and will allow the identification of a significant set of SNPs.
Here, we describe the detection and characterization of a set of genome-wide SNP markers from 50 nuclear genes using two laboratory strains and field samples of An. funestus. We also examined patterns of polymorphism and nucleotide diversity in coding and non-coding regions of the genome and define the pattern of codon usage in An.
funestus. The utility of the SNPs was assessed by genotyping a subset of these SNPs during a linkage mapping study.

Gene amplification
In total, 70 primer pairs were tested by PCR, 55 of which gave reliable amplification with PCR products ranging from 194 to 1342 bp. Sequence data from a total of 21 specimens of An. funestus was obtained for 50 of these genes (see Table 1) from laboratory and field samples. Overall, we sequenced a total of 20,547 bp consisting of 14,671 bp of coding region and 5,876 bp of non-coding region. We identified 494 SNPs consisting of 303 coding SNPs (cSNPs) and 191 non-coding SNPs. Each gene contained at least one polymorphism with BU73 having one and BU88 having 29. The distribution of SNPs among the 50 genes is presented in Table 1. All information concerning the location and the nature of each individual SNP have been submitted to dbSNP, the SNP database of Gen-Bank. These SNPs with their respective reference SNP number (rs) are publicly available in dbSNP Build N°127. The NCBI ss (submitted SNP) numbers of these SNPs are ss65917063 to ss65917416.

Type of polymorphism
For all sequenced DNA fragments, transition substitutions were more predominant than transversions (62 % vs 38%). Transitions C↔T and A↔G are over-represented with 35.4 and 27.2 % of the total substitutions respectively while the four transversion classes occurred at similar levels ( Figure 1). The higher frequency of C↔T and A↔G SNPs is probably partly related to 5-methylcytosine deamination reactions that occur frequently, particularly at CpG dinucleotides [17]. The preponderance of transitions is more obvious for coding regions where out of the 303 SNPs identified, 201 were transitions (66.3%) and 102 were transversions (33.7%). The ratio of transitions/ transversions observed here is close to the 2 : 1 ratio observed for Drosophila and humans [13,18]. For polymorphism in non-coding regions, transitions accounted for 55% (106) and transversion for 45% (86). The frequency of transitions between coding and non-coding regions were significantly different (66.3% vs 55% respectively; χ 2 = 5.86, P < 0.01). This confirms that SNPs occur more frequently as transitions in coding regions than in non-coding regions. There is also a higher frequency of SNPs occurring at the third codon position (63.7%) than at the 1 st or 2 nd position ( Table 1). Similar results have been observed for Aedes aegypti [10] and in three species of Drosophila [13]. The degeneracy of the genetic code and the selective pressure for gene conservation have been suggested as the main reasons for the preponderance of transitions over transversions [13]. Synonymous or silent substitutions are more often transitions than transver-     sions and there is a stronger selection against replacement substitutions than against synonymous, leading to an increase of the relative frequency of transitions [13]. For fourfold degenerate codons, selection should be neutral, since no amino acid change is induced by a nucleotide substitution at the third position, and each of the 4 codons will produce the same amino acid. We tested this hypothesis by comparing the proportions of transversions at fourfold degenerate codon positions and at non-coding positions for all the 50 genes ( Table 2). The result shows that there is no significant difference between the frequency of transversions at fourfold degenerate codon positions (36.8%) and at non-coding regions (44.5%) (χ 2 = 2.55 P = 0.11), while this difference is significant between coding and non-coding regions (χ 2 = 5.33; P = 0.021). The fact that fourfold degenerate sites have a similar ratio of transitions/tranversions to non-coding regions is consistent with an hypothesis that the structure of the genetic code and selection against replacement polymorphisms accounts for the preponderance of transition substitutions in coding regions.
Five insertion/deletion polymorphisms (indels) were observed in four genes ranging from 1 to 4 bp in coding, intronic and 5'UTR regions (Table 3). Two indels of 2 and 4 bp were observed in the BU10 intron. The frequency of indels (8% for 4/50) is lower than that reported in Ae aegypti of 24% [10] or 25% in An. gambiae [19]. Only one indel, in the BU93 gene, was located in a coding region. This indel was a triplet that did not cause a frame shift. The four indels identified can serve as molecular makers for mapping studies.
Approximately 2/3 (206) of the 303 cSNPs were synonymous substitutions (no modification in amino acid) while around 1/3 (97) were non-synonymous or replace-ment SNPs leading to a change of amino acid. As approximately two-thirds of random coding substitutions change an amino acid, the fact that only 1/3 of cSNPs are non-synonymous implies strong selection against changes that alter amino acid. This ratio of synonymous and replacement cSNPs is similar to that observed in An. gambiae [19] and Ae. aegypti [10].

Genetic diversity
We estimated the nucleotide diversity for each of the 50 genes in coding and non-coding regions ( Table 1). The average nucleotide diversity per gene in coding regions was 7.2 × 10 -3 or around 1 SNP every 138 bp similar to that observed in An. gambiae (1 SNPs every 125 bp) [19] but much higher than the frequency of 1 SNP/kb observed in humans [20]. SNPs were observed in non-coding regions at a frequency of 1 SNP per 100 bp, corresponding to π = 10 × 10 -3 . Figure 2 shows that there is a positive correlation in the level of polymorphism between coding and non coding regions of An. funestus genome within a gene (r = 0.48, P < 0.01). This positive correlation may be the consequence of many factors notably the correlated genealogies existing between coding regions and their surrounding non-coding regions. This correlation may also be strengthened by the presence of indirect selection (hitchhiking or background selection) and probably by variable recombination rate, as it is the case in Drosophila [21]. Mutational effect of recombination or biased gene conversion can also operate, but this needs to be confirmed as even in Drosophila, the effect of biased gene conversion is only suspected but unwarranted [22,23]. The average nucleotide diversity in non-coding DNA (0.010) was lower than in synonymous sites of the coding regions (0.0207), P < 0.01. This pattern was also observed in An gambiae, Ae aegypti and Drosophila species [10,13,19]. This is an indication that non-coding regions are under greater purifying selection than synonymous sites within coding regions. This is not surprising, given that non-coding regions may be involved in gene regulation. The non-coding 5'-flanking sequence of a gene may contain regulatory elements such as the promoter that control the expression of that gene, and single-base mutations can affect essential structures for splicing and processing [24].
Nucleotide diversity varies greatly from one gene to another (Table 1) and this is likely related to individual gene function and potentially to differences in selective constraints. However, non-synonymous diversities need to be compared in order to definitely estimate the influence of differences in selective constraints. Among the most polymorphic genes sequenced were cytochrome P450 genes, lysozyme, translation initiation factor and ubiquitin conjugating genes. The non-synonymous nucleotide diversity of these genes varied from 14 to 36.3 × 10 -3 . Most of these genes are involved in specific mechanisms that evolve very rapidly, such as detoxification of xenobi-

8.4%
otics for cytochrome P450s or defense mechanisms against bacteria like lysozyme. For example, P450s present a high level of redundancy with less genetic constraints and therefore more polymorphism. In contrast some genes showed very low level of variation particularly those involved in transcriptional or translational regulation (BU973 and BU25, BU93) or in signaling processes (BU01, BU08, BU13). Examples of selective constraints have been observed as well in Drosophila spp. where substitution rate between conservative genes and fast evolving genes differ by around 10-fold [25]. Nucleotide diversity was not statistically different between laboratory strains and field collected mosquitoes (7.4 × 10 -3 and 6.9 × 10 -3 ; P = 0.21 by Student's t-test), despite an apparent low level of heterozygosity (fewer heterozygote SNPs) observed in the two laboratory strains compared to the field sample. This result could be due to the fact that FUMOZ and FANG (the two laboratory strains used in this study), were only recently colonized in laboratory and therefore still largely retain the polymorphism of natural populations of An. funestus.
The ratio of synonymous to non-synonymous changes (Ka/Ks) gives an indication of the magnitude of the purifying selection against deleterious mutations in a species. The rate of non-synonymous nucleotide substitution per non-synonymous site (Ka) is generally expected to be much lower than the rate of synonymous substitution per synonymous site (Ks), because random amino acid changes are usually deleterious, whereas synonymous changes are likely to be neutral or nearly so [26]. Thus, the expectation is Ka << Ks, except when positive selection is involved favouring particular amino acid replacements, in which case Ka will increase. For An. funestus the Ka/Ks ratio was equal to 0.181 and is similar to the ratio of 0.192 observed in An. gambiae [19] or 0.204 in Ae. aegypti [10] but, higher than the ratio of 0.115 reported in D. melanogaster [13]. This result indicates that the purifying selection against deleterious mutations is acting in An. funestus. Indeed species with large effective population size such Drosophila or Anopheles species are generally more effective at purging deleterious mutations [26].

Clustering pattern of the SNPs
We analyzed the distribution of SNPs identified in this study. We found 16 clusters of two directly neighboring SNPs, one cluster of 3 consecutive SNPs and 13 clusters of two SNPs separated by just 1 bp. For some SNP genotyping methods based on allele-specific amplification, ligation or single base extension principles for which primers need to be designed immediately adjacent to the SNP, it is important that the SNPs are not too close together to prevent primer designing. The presence of a polymorphism within approximately 20 bp will limit the possibilities for designing a robust primer. Most of the 494 SNPs identified in this study do not have a SNP within 20 bp on either or one side thus, and should be easily genotyped by one of these methods.

Genomic position of the SNPs
Among the 50 genes amplified for SNP detection in this study, 45 are already physically mapped to the An. funestus genome by in situ hybridization [27], and the remaining 5 genes were genetically located to their respective chromosome by linkage mapping [7]. Overall, 29 SNPs were located on the X chromosome, 334 on chromosome 2 and 131 on chromosome 3. The higher number of SNPs observed on chromosome 2 is also a consequence of the fact that most of the studied genes are located on that chromosome. Table 4 gives the chromosomal location of the 50 genes across the genome of An. funestus.

Polymorphism reliability
To assess the validity of the SNPs identified in this study, 30 SNPs were tested for segregation in isofemale lines. These SNPs were tested using different methods (pyrose-  quencing, HOLA, SBE and AS-PCR) [7,28]. The Mendelian segregation ratio of each of these SNP loci at F 0 , F 1 and F 2 generations was examined in four families from reciprocal crosses between a pyrethroid resistant strain (FUMOZ-R) and a susceptible strain (FANG). Homozygous and heterozygous genotypes for each of these SNPs were observed. Importantly, the expected Mendelian ratio of 1:2:1 was respected in 27 of these 30 SNPs [7], confirming the polymorphism observed at these different positions. We can conclude from this result that the SNPs described in this study are then likely to be true polymorphisms rather than sequence artifacts and our scoring results indicate that they are suitable for use as genetic markers.

Relevance of the SNPs
The set of SNPs identified in this study provide a very useful tool for future genetic studies in An. funestus. These markers are of immediate use for association and QTL mapping studies. Some of these SNPs have been used for linkage mapping and identification of QTL involved in pyrethroid resistance in An. funestus [7]. This set of SNPs can be used as tools for population genetic studies in An. funestus. Genotyping large number of SNP markers will facilitate the study of genetic structure of natural populations and provide independent estimates of gene flow. It may provide additional markers to study the speciation process observed between the Folonzo and Kiribina chromosomal forms of An. funestus [29]. These markers may also be invaluable in monitoring insecticide resistance genes or genes involved in vector competence.

Conclusion
Through the sequencing of DNA fragments from 50 genes of An. funestus, we identified a set of 494 SNP markers and studied the pattern of genetic variability in this species. The distribution of SNPs in An. funestus was not neutral but under the influence of regional factors such as recom-Correlation of nucleotide diversity in coding (π c ) and non-coding regions (π nc ) π c : nucleotide diversity of coding region, π nc : nucleotide diversity of non-coding region Figure 2 Correlation of nucleotide diversity in coding (π c ) and non-coding regions (π nc ) π c : nucleotide diversity of coding region, π nc : nucleotide diversity of non-coding region.

Mosquito samples used for polymorphism discovery
We used adult female specimens of An. funestus from two laboratory strains, FANG and FUMOZ-R (seven specimens for each strain) as well as seven field specimens. FANG is a pyrethroid susceptible strain from Calueque, southern Angola and FUMOZ-R is a pyrethroid resistant strain from southern Mozambique [1]. Field specimens of An. funestus were collected from Kela village in Chikwawa district in southern Malawi.

Selection of gene sequences for SNP identification
Target genes were selected among cytochrome P450 genes for their putative involvement in insecticide resistance [30] or among genes of a broad range of functions that had been physically mapped to An. funestus polytene chromosomes [27] (see Table 4; Figure 3). They were also chosen to be distributed across the genome of An. funestus. The sequences of the physically mapped cDNAs were retrieved from Genbank. Determination of coding sequence, UTRs and intronic regions were done using the BLAST procedure through NCBI.

Gene amplification and sequencing
Genomic DNA was extracted using the LIVAK method as described previously [31]. Primers were designed using Primer3 software [32]  PCR products were purified using the QIaquick PCR purification kit (Qiagen) and directly sequenced on both strands using a Beckman CEQ 8000 automatic sequencer.

Sequence analysis and SNP detection
SNPs were detected as sequence differences in multiple alignments using Clustalw [33]. Electrophoregrams were visually inspected using BioEdit and heterozygotes were identified [12]. SNPs were identified as transitions or transversions in coding and non-coding regions. SNPs located within coding regions were classified as synonymous or non-synonymous and their codon position determined. Nucleotide diversity analyses were performed using DnaSP 4.0 [34]. The average number of nucleotide substitutions per site between two sequences, π was calculated for each gene as well as the haplotype diversity. The average number of synonymous substitutions per synonymous site (Ks) and non-synonymous substitutions per non-synonymous site (Ka) was computed according to [35].

SNP validation
Many of the SNPs discovered in this study were validated by different methods. As a part of an effort to construct a genetic map and to identify QTL involved in pyrethroid resistance, 30 SNP loci were genotyped in several families generated from a cross between FANG and FUMOZ-R strains of An. funestus. These SNPs were scored using a HOLA (Hot Oligonucleotides Ligation Assay) method [36], single base extension (SBE) using Beckman CEQ8000 and a pyrosequencing method [7].

Authors' contributions
CSW (corresponding author) carried out the experiments; analyzed the data and wrote the manuscript. JH is the PI of the program that funded the work and contributed to the critical review of the draft manuscript. HR contributed to the design of the study and critical review of the draft manuscript. All authors read and approved the final manuscript.
Relative location of studied genes on the An. funestus genome Figure 3 Relative location of studied genes on the An. funestus genome. For definitions of genes, see Table 4. This figure was adapted from [37].