Annotation and analysis of a large cuticular protein family with the R&R Consensus in Anopheles gambiae
- R Scott Cornman†1,
- Toru Togawa†1,
- W Augustine Dunn1,
- Ningjia He2,
- Aaron C Emmons1 and
- Judith H Willis1Email author
© Cornman et al; licensee BioMed Central Ltd. 2008
Received: 08 August 2007
Accepted: 18 January 2008
Published: 18 January 2008
The most abundant family of insect cuticular proteins, the CPR family, is recognized by the R&R Consensus, a domain of about 64 amino acids that binds to chitin and is present throughout arthropods. Several species have now been shown to have more than 100 CPR genes, inviting speculation as to the functional importance of this large number and diversity.
We have identified 156 genes in Anopheles gambiae that code for putative cuticular proteins in this CPR family, over 1% of the total number of predicted genes in this species. Annotation was verified using several criteria including identification of TATA boxes, INRs, and DPEs plus support from proteomic and gene expression analyses. Two previously recognized CPR classes, RR-1 and RR-2, form separate, well-supported clades with the exception of a small set of genes with long branches whose relationships are poorly resolved. Several of these outliers have clear orthologs in other species. Although both clades are under purifying selection, the RR-1 variant of the R&R Consensus is evolving at twice the rate of the RR-2 variant and is structurally more labile. In contrast, the regions flanking the R&R Consensus have diversified in amino-acid composition to a much greater extent in RR-2 genes compared with RR-1 genes. Many genes are found in compact tandem arrays that may include similar or dissimilar genes but always include just one of the two classes. Tandem arrays of RR-2 genes frequently contain subsets of genes coding for highly similar proteins (sequence clusters). Properties of the proteins indicated that each cluster may serve a distinct function in the cuticle.
The complete annotation of this large gene family provides insight on the mechanisms of gene family evolution and clues about the need for so many CPR genes. These data also should assist annotation of other Anopheles genes.
While the R&R Consensus is conserved across arthropods, its location within a cuticular protein and the nature of the regions that flank it are highly variable. Understanding of the role of these proteins in forming the insect exoskeleton and other cuticular structures will be facilitated by defining all of the cuticular proteins of a single species. Accounts of the cuticular proteins with the R&R Consensus have now been published for 28 proteins from Apis mellifera  and for 101 from Drosophila melanogaster . Also 102 CPR proteins have been identified in the genome of Tribolium castaneum (Beeman and Willis, unpublished observations). These annotations depended in large part on computerized genome annotation and were not systematically verified at the mRNA or protein level.
In the present study, we have carried out an exhaustive manual annotation of the CPR family of An. gambiae based on the whole genome sequence of the PEST strain . These annotations are being facilitated and verified by a proteomics analysis of cuticles [, He unpublished observations] and accompanied by an analysis of gene expression with real-time RT-PCR . In addition, ambiguous gene models have been confirmed or revised by sequencing RT-PCR or RACE products. This work has identified 156 genes coding for CPR proteins. Hence over 1% of the genes of An. gambiae are devoted to just this one family of cuticular proteins.
An investigation of cuticular proteins in An. gambiae carried out prior to whole genome sequencing was particularly informative for the present annotation study. Dotson et al.  sequenced a 17.4 kb insert in a genomic library constructed from the Sua strain. This region had three CPR genes that were at least 98% identical in their coding regions, yet differed in 5' and 3' UTRs as well as their introns. Hence, the lesson learned was that virtually identical genes can reside in compact tandem arrays, yet can be recognized as distinct and not an assembly artefact because of the differences in the non-coding regions associated with them.
CPR proteins can be divided into groups according to which variant of the extended R&R Consensus they possess. Two major groups have been named RR-1 and RR-2 while a third group (RR-3) has been identified but from only a small number of sequences [15, 16]. It is unclear whether RR-3s are an evolutionarily distinct group; for the present analysis we include RR-3 genes within the RR-1 class. A Hidden Markov Model can be employed at the cuticle DB web server  to assign proteins as RR-1 or RR-2 . Our analysis confirms that the bulk of RR-1 and RR-2 proteins form non-overlapping clades in An. gambiae, separated by a small set of long-branch RR-1, RR-2, and RR-3 proteins that are probably an artificial group. In addition to assembling information that supports annotation, we have analyzed the structure of these clades, examined patterns of molecular evolution, compared the amino acid composition of the different proteins and identified characteristics of each group. We now have further appreciation of the complexity of the insect cuticle and clues about the need for so many CPR genes.
Results and Discussion
Validation of annotation
The 156 putative genes coding for proteins with the R&R Consensus is the largest group validated to date for any species. EST support from public sequence databases obtained at  had to be evaluated carefully, requiring matches in unique non-translated regions, because of the similarities within groups of genes. When this was done, a combination of available EST sequences and RACE and RT-PCR products sequenced by our laboratory confirmed 55% of the sequences (Additional File 3). A further 42% were confirmed as unique genes in the G3 strain by obtaining single-copy kinetics with qPCR on genomic DNA and confirming their expression with qRT-PCR (Additional File 3, ). Most primers for this analysis were designed around the stop codon to take advantage of the higher level of variation in the 3' UTR. Together these data uniquely confirm the large majority of annotated genes. Using the SignalP web tool , we also confirmed that each predicted protein had a valid signal peptide, an essential feature of secreted proteins. The presence of a TATA box and INR (initiator element) in 93% of the genes (see below) provided additional confirmation of the reality of these genes. Indeed, these features were important in several cases where Ensembl had predicted a single gene with multiple R&R Consensus regions, yet manual annotation revealed that multiple genes were present. A proteomics analysis based on head capsules and pupal cuticles left behind after a molt and some batch prepared cuticles [, He unpublished observations] revealed unique peptide(s) for 75 of these genes and shared peptides for an additional 70, having excluded those with only DGDVVK, a very common peptide. This information confirms 48% of genes we annotated as authentic components of the cuticle, and if we include proteins with shared peptides the proportion rises to 93% (Figure 2, Additional File 3). Finally, a combination of all of the supporting data has provided at least some support for all 156 CPR proteins (Additional File 3). Given that our experimental work was carried out on the G3 strain, not the PEST strain that was the source of the annotated genome, we were gratified by the support we found.
Did we identify all of the genes with the R&R Consensus? We only included three genes on unplaced contigs (i.e., on the ''UNKN'' chromosome). We had information about these three genes that supported their inclusion in our analysis. Fifteen of the 22 genes we ignored were on contigs less than 1 kb, and the longest was on a contig of only 2966 nucleotides. We assumed that these represent alternative haplotypes present in the PEST strain, which is reasonable given the level of haplotype variation that has been observed . In one case, we omitted a tract of five candidate genes on chromosome 2R that are nearly identical to, but a subset of, another set of genes (CPR115 – CPR123 and CPR154) approximately 1 Mb away. The intergenic sequence in this tract was in some regions completely unresolved whereas in other regions it was virtually identical, preventing any direct verification of the tract by PCR. To ascertain whether all of these candidate genes were likely to be present in the PEST strain, we performed qPCR on genomic DNA with a primer set that included intronic sequences and was complementary to a subset of the candidate genes in question. We compared the data to results for a known single copy gene (chitin synthase, AGAP001748), which indicated that the actual number of targets was between five and six and not eight as expected if all of the candidate genes in the assembled genome were actually present (data not shown). This result was consistent with only the annotated genes being present in PEST; an identical result was also obtained with the G3 strain. The coordinates of the omitted sequences (labelled DUPL A-E) are provided in Additional File 1 for reference, but we believe there is insufficient evidence of their validity to warrant their inclusion in the present analysis. Thus, given the diverse data we used to annotate and confirm the CPR genes and our exhaustive BLAST searches, it is unlikely that many genes were missed or are artifacts of assembly or incorrect annotation. We also identified two pseudogenes as shown in Additional File 1, but we did not include them as named members of the gene family and they were excluded from analyses of molecular evolution. Seventy-six percent of the gene models are modifications or new genes relative to the Ensembl data available at the time the annotations were done. See Methods for details on retrieving all the sequences.
Further verification of the annotation comes from orthologs in D. melanogaster. The data on orthologs are summarized in Additional File 5 where obvious differences between pairs are highlighted. Such comparisons reveal an important complication in the discovery of distant orthologs, namely that high-scoring reciprocal BLAST hits may be restricted to the R&R Consensus region with little conservation of the flanking sequence. This creates ambiguity as to whether a gene is a genuine ortholog or a product of parallel evolution or domain shuffling. We present data in Additional File 5 on sequence identity in both the Consensus and the total protein and have retained putative orthologs only in cases where the evidence was compelling or interesting.
We found good orthologs for seven of the 15 proteins that had over 300 amino acids in their mature form. This indicates that there is something important about their structure that has been preserved for about 250 myr .
There are instances where the regions flanking the Consensus are quite different in an ortholog pair, such as AgamCPR132 and DmelCry, a D. melanogaster lens protein . The An. gambiae protein has a stretch of 20 glutamines in a row, while the D. melanogaster protein with almost the same percentage of glutamines has no cluster longer than 7. Given the propensity of glutamines to form amyloid-like structures, this suggests that the structure of the cuticle will be somewhat different in the two species. The length of the longest glutamine repeat was variable among the more than 50 cDNAs and ESTs that have been deposited in public databases from different strains (median = 23). Interestingly, while the long glutamine tracts in Huntington's and other human diseases are based on the expansion of just CAG repeats ; the long An. gambiae cluster uses both CAG and CAA codons. Cry along with AgamCPR132 had numerous repeats of RREE that may make a contribution to protein structure comparable to a stretch of glutamines.
Another example is illustrated by AgamCPR152 and the D. melanogaster protein resilin (CG15920). Resilin is a cuticular protein that confers elastic properties to specific regions of cuticle . While the extended Consensus region of the two has 76% identity, the other regions of the protein are totally different. AgamCPR152 has 20% histidine residues, resilin had only 1 histidine. Dmelresilin is 35% glycine, AgamCPR152 only 12%. This is a clear case where the regions flanking the Consensus are not related in the two species. So is there a resilin gene in Anopheles? Lyons et al.  have made a recombinant protein with 16 units of a repeat (AQTPSSQYGAP) from an An. gambiae EST (BX61961) identified in a BLAST search using the Drosophila gene. When properly cross-linked, this protein has the same resilient elastic properties as a comparable multimer made with the D. melanogaster resilin repeat (GGRPSDSYGAPGGGN). The An. gambiae sequence corresponds to the incompletely annotated gene AGAP002367. This gene has 10 perfect copies of the repeat and three additional ones that differ in one amino acid. It lacks, however, an R&R Consensus even after probable errors in the original annotation were corrected and no Consensus region lurks in the surrounding 10 kb. This may well be a situation where domain shuffling has occurred.
Several other orthologs merit comment. While all other An. gambiae CPR proteins have only a single R&R Consensus region, there are three versions of it in AgamCPR144 and its orthologs DmelCpr73D and Tribolium GLEAN_16311. All three share other features. AgamCPR138 is a clear ortholog of the D. melanogaster protein l(3)mbn. The latter is longer, but they share two interesting features, the C-terminal position of the R&R Consensus and the presence of two cysteine residues that are rarely found in mature CPRs. We identified orthologs for two of the six An. gambiae genes that lacked the aromatic triad; one had a diad, the other only a single aromatic residue signifying the start of the R&R Consensus. Each had an ortholog with the same atypical feature.
Of the six CPR genes on the X chromosome, we identified orthologs for five, but only AgamCPR129 has an ortholog that also resides on the X chromosome in D. melanogaster. The other four orthologs are on 3R and three unrelated genes with the R&R Consensus are on the D. melanogaster X chromosome. It was apparent from this limited number of orthologs, that genes that are on either 2L or 3L in An. gambiae reside on 3L in D. melanogaster.
Core promoter regions and polyA addition sites
For each gene, we summarized the presence of a TATA box, the position and sequence of its INR if one could be identified, and other features (Additional File 3). Analysis of the core promoter region was possible for 153 genes; the other three had their 5' region determined by RACE, since the genomic sequence was populated by a string of "Ns", an intra-contig gap, in the relevant regions. The vast majority, 126/153 (82%), of the CPR genes have a conventional TATA box (TATAAA). An additional 19 have a variant TATA box, including those in sequence cluster 2LC where 13/16 genes appeared to use TATTTAA. In all 145 cases where a TATA box was used, we were able to identify a putative INR. Cherbas and Cherbas  reported four common INRs in D. melanogaster, TCAGT, ACAGT, GCAGT, and TCATT. These sequences comprise 71% of the INRs we identified. Variants are shown in Additional File 3, some of which (italicized) were confirmed by 5' RACE, others were deduced by their distance downstream from the TATA box and the presence of an adenine in the third position. For seven of the eight genes with no known TATA box, we were able to identify INRs and putative DPEs (downstream promoter elements) . The conventional polyA addition site (AATAAA) was found in 84% of the CPR genes within the first 500 nucleotides after the stop codon. Alternative sites (AATACA, AATATA, AATTAA) have been reported from some Bombyx cuticular protein genes [26, 27]. We found the first two types in 14 additional genes. Hence all but 7% of the 153 genes that had sufficient C-terminal sequence data available had known polyA addition sites. The presence of appropriate untranslated flanking regions is further evidence for the authenticity of the genes we have annotated.
Patterns of gene architecture
As is evident from Additional Files 1 and 2, An. gambiae CPR genes show considerable variation in exon position and length. Complete coding sequences from genomic DNA are obtainable for 154 of these proteins; two others, with "Ns" in the genomic data, were obtained by sequencing RACE products. Eighteen sequences had only a single exon, 11 of which were in sequence cluster 2LC; one sequence had five exons. The majority (55%) had two exons, with three and four exons present in 25%, and 8%, respectively. A large majority (86%) of genes with introns had an interruption in the region coding for the signal peptide, resulting in short first exons of from 3–36 nucleotides long, excluding the 5'UTR. Both the median and mode were 12. Short first introns are not unique to CPR genes but are an obstacle to gene annotation. This is because several plausible first exons may fit a well-supported gene model and because EST support for first exons is sometimes lacking. We cloned and sequenced RACE products from the G3 strain to verify some problematic first exons (see Additional File 3), but we did not do so routinely for gene models that were well supported by similarity to related CPR genes.
Intron number is variable throughout the CPR phylogeny, suggesting that intron loss or gain has occurred repeatedly. Overall, however, there are general patterns that distinguish the exon structure of RR-1 and RR-2 genes. There are fewer RR-2 genes with more than two introns than RR-1 genes (23% vs. 52%), and exon-intron boundaries are more strongly biased toward phase 0 (i.e. between codons) in RR-2 genes (96% of introns) than in RR-1 genes (63%). Furthermore, the interruption of the Consensus region by an intron is less common in RR-2 genes (15%) than in RR-1 genes (49%). Indels are very rare in the aligned RR-2 Consensus region and span at most two codons, whereas the aligned RR-1 Consensus region has extensive indels, particularly near the center of the alignment. Thus, the two classes differ substantially in the structural diversity of the R&R Consensus. These architectural differences also suggest that RR-2 proteins may be more amenable to novel gene formation by transposition of an intact R&R Consensus domain, although such an occurrence has not been demonstrated for a CPR gene.
Strain variation in gene number
An inversion on chromosome 2L in some strains of An. gambiae is of particular interest because one form (2La) correlates with desiccation resistance and vector competence discussed in Sharakhov et al. . Their careful mapping of the inversion boundaries enabled us to place it in the context of the genes we have annotated with the surprising result that there are 73 CPR genes within the inverted region (Figure 2, Additional File 1). We used inversion specific primers  to verify that the G3 strain, which we used for analysis, is heterozygous for the inversion (data not shown).
General protein properties
Properties of CPRs by class and by individual RR-2 sequence clusters
% to START of CONSENSUS
Commonly occurring sequence motifs
Within the two major classes of the R&R Consensus sequence, some common variants are recognizable that extend the region of alignable sequence in the N- or C-terminal directions for a subset of genes. One common variant found in RR-1 genes has a proline-rich region adjacent to the defined Consensus (Figure 1), which can be approximated by the expression GFQPQGxHxPxPPP. A second common variant is found in RR-2 genes at the C-terminus of the R&R Consensus, approximated by the expression GFNAVV(HR)RE(GP). Also present in 38 RR-2 genes was RDGDVVKG. All three of these variants occur in tandem gene arrays in An. gambiae as well as other Dipterans, indicating that they arose to high frequency by tandem gene duplication (Cornman and Willis, MS in preparation). The first two, however, are also present in Apis and Tribolium (Cornman, unpublished data) indicating that these motifs are old and, given their high level of conservation, probably have functional importance, possibly in addition to their participation in chitin-binding.
Phylogeny and genomic organization of CPR genes
In An. gambiae, CPR proteins cannot generally be aligned outside of the R&R Consensus and signal peptide, consistent with the pattern of insect CPRs generally. We therefore based our phylogenetic analysis on the region that spans the R&R Consensus from the fifth position N-terminal to the tenth position C-terminal of the pfam00379 sequence (Figure 1). We included these flanking regions because, while they are not alignable across all An. gambiae CPRs, they incorporate common sequence variants that are alignable within large subsets of genes and thus are phylogenetically informative. However, we double-weighted all positions from the aromatic triad to two positions past the final invariant glycine (Figure 1) as these positions can be aligned across all CPRs. We used amino-acid sequence rather than codon-aligned nucleotide sequence because mutational saturation of the latter is evident across the gene family as a whole.
CPR genes are found on all of the chromosomal arms except for the Y chromosome but have a biased distribution with 51% occurring on 2L and 28% occurring on 3R. Phylogenetically, genes show a very strong tendency to cluster by chromosome (Figure 2, Figure 6), indicating that inter-chromosome duplications are rare. RR-1 and RR-2 tandem arrays show no overlap on chromosomes (Figure 2, Additional File 1) and contain few non-CPR genes.
Evolutionary patterns within the R&R Consensus and flanking sequence
The pattern of natural selection acting during the evolutionary history of a gene family can be illuminated by comparing the rate of nucleotide substitutions that change the protein sequence (Ka) to the rate of nucleotide substitutions that do not (Ks). Coding sequences that have a ratio of Ka to Ks equal to one are evolving in a manner consistent with neutrality, that is, nonsynonymous mutations are as likely to be fixed as synonymous ones. Ka/Ks greater than one indicates a higher rate of amino-acid substitution than expected under neutrality and implies that positive selection has driven the divergence from the ancestral state. Ka/Ks lower than one implies that changes in protein sequence are selected against on average.
To investigate the possibility of adaptive evolution of the R&R Consensus during the diversification of CPR genes, we calculated Ka/Ks within this region for all An. gambiae paralog pairs and An. gambiae – Ae. aegypti ortholog pairs that we identified. For this analysis, we used a shortened version of pfam00379, in which we excluded the first seven positions, to define a more strict R&R Consensus for An. gambiae. This was done because the alignment of the first seven sites across all RR-1 or RR-2 proteins was ambiguous and therefore not useful for analyses of substitution patterns.
One difficulty with the interpretation of Ka/Ks ratios between functional paralogs is that a finding of increased Ka/Ks after duplication may be explained by relaxed selection on initially redundant genes or by adaptive evolution at some fraction of sites within a functionally constrained sequence, or both successively. Only if Ka/Ks is substantially greater than one is adaptive evolution strongly implicated, although it remains a possible explanation for even small increases in Ka/Ks. Codon-based models of nucleotide substitution are potentially more useful because they can identify individual sites under selection within a local region of low Ka/Ks, but their power is strongly dependent on sample size  and the actual pattern of selection. We used both approaches to investigate the molecular evolution of the chitin-binding R&R Consensus, which has a well-characterized secondary structure and both labile and highly conserved positions [8, 33].
CPR genes with the highest mean pairwise Ka/Ks. Genes in bold are on the X chromosome
We next used codon- and branch-based tests of selection to test for positive selection on the R&R Consensus of An. gambiae CPRs. We used the single-likely-ancestor counting method, SLAC , and GA-Branch method  implemented by the DataMonkey web server . The SLAC method identifies individual codons that deviate from neutral expectation with respect to nonsynonymous versus synonymous substitutions, whereas the GA-Branch method identifies lineages of a phylogeny that differ in overall Ka/Ks. Subsets of CPR genes were investigated by submitting well supported interior clades that are likely to have experienced less mutational saturation and by removing highly similar gene duplicates. In no case did we find any positively selected sites within the R&R Consensus at P < 0.05, whereas a large majority of sites were found to be under negative selection. Conceptually similar but methodologically distinct methods for identifying sites under positive selection were also implemented with the codeml program of the PAML package  and gave similar results. Specifically, evolutionary models that included a category of Ka/Ks > 1 in addition to either a single category of Ka/Ks < 1 or a beta distribution of Ka/Ks < 1 were not significantly more likely than models that did not include positive selection (codeml model options 2 versus 1 and options 8 versus 7, respectively, 2ΔlnL ≈ 0 for both comparisons). Thus, diversification of the CPR gene family does not appear to have been driven by strong positive selection on specific sites within the R&R Consensus. We estimated the global branch Ka/Ks within the R&R Consensus under the beta-distribution model (model 7) of codeml, which was significantly more likely than a single Ka/Ks < 1 category (model 1) for both the RR-1 and RR-2 clades (P << 0.01), and after removing all but one haphazardly chosen gene of each sequence cluster. The evolutionary rate of amino-acid substitutions was more than twofold higher within the RR-1 clade (Ka/Ks = 0.271) than within the RR-2 clade (Ka/Ks = 0.118).
Although no sites or sites within lineages showed significant evidence of positive selection, the relative rates of amino-acid evolution are nonetheless variable among lineages, as indicated by Tajima's test . The most notable example is the branch leading to the 3RB sequence cluster (Figures 6, 8) compared with the linked sequence clusters 3RA and 3RC (average P < 0.01 across all possible gene comparisons for the three sequence clusters). Since we found no support for elevated Ka/Ks occurring along this or any other lineage as described above, the high rate of evolution in the 3RB sequence cluster appears to be due to mutation-rate variation alone. Consistent with this interpretation, the 3RB sequence cluster has a substantially lower GC content (36.9%) than the other 3R sequence clusters (46.7%), suggesting divergent mutational histories. This localized difference in GC content is surprising given the very tight linkage of the genes in these three sequence clusters (Additional File 1).
We also performed a SLAC analysis on all examples of the motif identified by MEME (aligned with indels) from An. gambiae and Ae. aegypti. Ten of fifteen sites were found to be under purifying selection at P < 0.05, as shown in Figure 5B, and the Ka/Ks for the region was estimated to be 0.42 (95% C.I. 0.39, 0.65). While the motif is evidence of sequence conservation among paralogs outside the R&R Consensus, the level of conservation is somewhat less than what is seen among single-copy orthologs in the two mosquito genomes (Ka/Ks of 0.08 – 0.33 with the R&R Consensus removed).
Our analysis of the CPR gene family in An. gambiae reveals that a high number of paralogs can be retained in insect genomes with remarkably low rates of pseudogene formation. The retention of these paralogous single-copy genes or sequence clusters over periods of 100 million years or more, as evidenced by orthologs in Ae. aegypti (Cornman and Willis, MS in preparation), implies that there has been extensive functional diversification of this group. We do not yet know the nature of this functional diversification, although there is certainly broad variation in the complement of CPR proteins expressed at different developmental stages and in different tissues as evidenced by gene expression and proteomic studies ([12, 13], He unpublished data). There are also a few examples from D. melanogaster of CPR proteins that have been shown to be essential components of specialized cuticular structures, such as crystallin in the eyes  and resilin in wing tendons, ligaments, etc. [22, 23]. Nonetheless, codon-based analyses of the R&R Consensus of An. gambiae CPR proteins failed to identify any sites under positive selection, although such tests can have limited power to detect ancient positive selection and Ka/Ks>1 within gene families is rarely found .
Our comparison of the RR-1 and RR-2 classes presents an interesting dichotomy regarding the diversification of the R&R Consensus and flanking sequences. The R&R Consensus of the RR-1 class has higher rates of amino-acid evolution and is structurally much more variable than the RR-2 family, and a number of RR-1 proteins have average pairwise Ka/Ks approaching one, many times higher than the mean Ka/Ks for orthologous pairs in An. gambiae and Ae. aegypti. To the extent that there has been adaptive evolution of the chitin-binding Consensus, it has probably been more prevalent in the RR-1 clade. In contrast, the RR-2 clade appears to have diversified in terms of amino-acid composition to a much greater extent than the RR-1 class and in general there is a higher frequency of histidine residues, which are involved in cross-linking, in this group. It seems likely that the functional diversification of the RR-2 group derives to a greater extent from the properties of the sequence flanking the Consensus than from any particular changes in the chitin-binding sequence itself.
One of the most remarkable features of this gene family in An. gambiae, the presence of sequence clusters of highly similar genes, begs investigation as to the possible selective advantage of increased copy number for these particular genes. A separate analysis of these genes was too extensive to include here, but our unpublished data indicate that these proteins have complex repeats and unusual amino-acid compositions relative even to other RR-2 proteins. Furthermore, different sequence clusters appear to have independently acquired compact gene architectures, perhaps as a response to selection for increased gene expression, and concerted evolution in addition to purifying selection appears to be an important mode by which protein similarity is maintained in these groups.
We have identified 156 genes containing the R&R Consensus in the genome sequence of An. gambiae; many have been confirmed experimentally. Phylogenetic analysis of these genes reveals three broad groups, the core RR-1 and RR-2 groups plus an intermediate, long-branched group that is probably artificial. The RR-2 class is dominated by sets of highly similar genes that we have termed sequence clusters. The RR-1 class of the R&R Consensus has a higher evolutionary rate than the RR-2 class, whereas the latter group has a greater diversity of amino-acid composition in the flanking sequences. The multiplicity of almost identical genes within sequences clusters suggests that their amplification may serve to allow massive protein synthesis during the brief periods of cuticle secretion. In addition, and perhaps most importantly, differences among sequence clusters, the rarity of pseudogenes, and the presence of good orthologs for several single-copy genes all indicate that the distinct CPR proteins serve important and unique roles in the cuticle.
Annotation of sequences
Prior to the availability of sequences from completely sequenced insect genomes, there were 98 insect cuticular proteins sequences available that had the R&R Consensus. Where corresponding genes were known, they were generally simple, with rarely more than two exons. The first intron frequently interrupted the region coding for the signal peptide. The second exon also began in a relatively conserved position close to or interrupting the nucleotides that coded for an aromatic triad near the N-terminus of the extended Consensus . This triad is Y/F-x-Y/F/W-x-Y/F. These simple and consistent features guided the annotation of the An. gambiae genome. We also used the information from  that genes that coded for nearly identical proteins would differ in their 5' and 3' UTRs.
We investigated all gene predictions that contained the pfam00379 motif by searching the Ensembl Anopheles Web Site  with IPR000618 Additional sequences were identified when our proteomics project  turned up peptides indicative of belonging to a previously unannotated protein with the R&R Consensus. Additional BLAST searches and dot-matrix plots were also employed to confirm that we had identified all genes.
We examined each candidate gene to identify essential or common gene features, namely a TATA box, INR , signal peptide (identified by SignalP [19, 42], R&R Consensus sequence, and poly-adenylation site (AATAAA or rarely AATACA). These were manually identified, guided whenever possible by ESTs available via BLAST searching . The program Splice Predictor  was used to guide identification of splice sites. In a few cases, putative orthologs in other species provided valuable clues. For a few of the more difficult sequences, we used 5' or 3' RACE or RT-PCR to verify/complete the sequence. (See Additional File 7 for primers used). Such experimental work was, of necessity, done with the G3 strain as the PEST strain used for the whole genome sequencing no longer exists. Once sequences were annotated we used SPIDEY  to locate the coordinates on the appropriate contig. We present only positions that correspond to the coding sequence itself.
A summary of all annotated genes, their properties, and the nature of supporting evidence is given in Figure 2 and Additional Files 1, 2, 3. Genes in these tables are ordered by their appearance on chromosome arms. We were advised not to use chromosome bands for names as had been done for some of the D. melanogaster genes coding for cuticular proteins , so we settled for naming them simply as CPR#. When the genes are discussed in the context of genes in other species, their complete name should be AgamCPR#. Genes were named in the order in which they were annotated; so many names were added out of numerical order. We have provided the VectorBase stable identifiers that begin with AGAP in Additional File 1. A few genes that are incorrect are noted. CPR150–156 were not available when the naming occurred and no Ensembl gene names exist for these.
For genes with ambiguous predictions, RT-PCR or RACE (rapid amplification of cDNA ends) was performed. RNA isolation and RT-PCR procedures were described before [13, 47]. All RACE products except for 3' RACE of CPR140 and CPR152 were obtained with the GeneRacer® kit (Invitrogen) using their SuperScript III reverse transcriptase. The 5' RACE product should reach the transcription start site. All PCR products were amplified with tag primers and/or gene specific primers listed in Additional File 7 by LA Taq or Ex Taq (TaKaRa) and cloned in pGEM-T Easy (Promega) or pCR4-TOPO (Invitrogen) plasmid vectors. For CPR140 and CPR152 3' RACE, first strand cDNA was synthesized by SuperScript III reverse transcriptase (Invitrogen) with an oligo dT-anchor primer (5'-GACCACGCGTATCGATGTCGACT23-3'), a modified version of the primer from the Roche RACE kit. A PCR anchor primer (5'-GACCACGCGTATCGATGTCGAC-3') and gene-specific primers were used with these two cDNAs. These modifications were necessary because the 3' RACE primers provided by Invitrogen frequently amplified incorrect regions of the An. gambiae genome by priming at both ends of a sequence. All other gene-specific primers used are given in Additional File 7.
Phylogenetics and sequence analysis
The R&R Consensus was aligned with ClustalW using gap penalties of 10 to open and 5 to extend. The alignment was then manually adjusted to ensure consistency with pfam00379, particularly at the edges of the alignment, and to ensure that alignments among sets of highly similar genes were not distorted by the global optimum. The MEGA3 program  was used to calculate genetic distances using the JTT exchangeability matrix . The distance measure is the expected fraction of accepted mutations scaled such that a distance of 1.0 is equivalent to 100 iterations of the JTT matrix. All distances were computed with pairwise deletion of indels.
All annotated An. gambiae CPR genes and all annotated Ae. aegypti RR-2 genes listed in Cornman and Willis (MS in preparation) were submitted to the MEME server [50, 51] with the extended R&R Consensus and signal peptide removed. We placed no restriction on the number of motifs per sequence but limited the size to 5 – 25 amino acids per motif. We investigated the stability of the MEME-identified motifs by submitting haphazard subsamples of CPR genes from mosquito as well as other insect CPR genes. This was to ensure that the same motif definition was recovered when background sequence probabilities were modulated in this way.
The model of nucleotide evolution for SLAC analysis was chosen independently for each run using the DataMonkey model-selection tool . Selected codons were assessed using a Bonferroni-corrected alpha of 0.05. Pairwise Ka/Ks ratios were calculated using the method of Nei and Gojobori  as implemented by the program DnaSP . The program codeml of the PAML package  was used to assess the likelihood of sequential pairs of evolutionary models described by Yang et al. . For each pair of models, the model with the fewest parameters was considered the null hypothesis and was rejected if the alternative model had a significantly higher likelihood by the chi-squared test suggested by Yang et al. .
List of abbreviations used
cuticular protein(s) with the R&R Consensus
downstream promoter element
expressed sequence tag
ratio of nonsynonymous to synonymous substitutions
principal components analysis
quantitative polymerase chain reaction with DNA template
quantitative reverse transcriptase PCR
rapid amplification of cDNA ends
- R&R Consensus:
Rebers and Riddiford Consensus
RR-2, two classes of CPR proteins
untranslated region of an mRNA
single-likely-ancestor counting method.
The authors appreciate the help Maureen E. Hillenmeyer and Frank Collins (Notre Dame) provided with the initial stages of annotation and thank Kathryn Cambell (FlyBase) for input on the genes on 2L. Christos Louis (Institute of Molecular Biology and Biotechnology, Crete) gave guidance on naming genes, and Martin Hammond provided help with moving data to ENSEMBL. Nora Besansky (Notre Dame) shared insight into the importance of the 2La inversion. Hugh Robertson (Univ. of Illinois) provided helpful comments on an early draft of the MS. This work was supported by a grant from the National Institutes of Health (AI55624) to JHW.
- Neville AC: Biology of the Arthropod Cuticle. 1975, New York, Springer-VerlagView ArticleGoogle Scholar
- Andersen SO, Hojrup P, Roepstorff P: Insect cuticular proteins. Insect Biochem Mol Biol. 1995, 25: 153-176. 10.1016/0965-1748(94)00052-J.PubMedView ArticleGoogle Scholar
- Willis JH, Iconomidou VA, Smith RF, Hamodrakas SJ: Cuticular proteins. Comprehensive Insect Science. Edited by: Gilbert LI, Iatrou K, Gill S. 2005, Oxford, Elsevier, 4: 79-109.View ArticleGoogle Scholar
- Rebers JE, Riddiford LM: Structure and expression of a Manduca sexta larval cuticle gene homologous to Drosophila cuticle genes. J Mol Biol. 1988, 203: 411-423. 10.1016/0022-2836(88)90009-5.PubMedView ArticleGoogle Scholar
- Rebers JE, Willis JH: A conserved domain in arthropod cuticular proteins binds chitin. Insect Biochem Mol Biol. 2001, 27: 229-240. 10.1016/S0965-1748(96)00090-2.View ArticleGoogle Scholar
- Togawa T, Nakato H, Izumi S: Analysis of the chitin recognition mechanism of cuticle proteins from the soft cuticle of the silkworm, Bombyx mori. Insect Biochem Mol Biol. 2004, 34: 1059-1067. 10.1016/j.ibmb.2004.06.008.PubMedView ArticleGoogle Scholar
- Hamodrakas SJ, Willis JH, Iconomidou VA: A structural model of the chitin-binding domain of cuticle proteins. Insect Biochem Mol Biol. 2002, 32: 1577-1583. 10.1016/S0965-1748(02)00079-6.PubMedView ArticleGoogle Scholar
- Iconomidou VA, Willis JH, Hamodrakas SJ: Unique features of the structural model of 'hard' cuticle proteins: implications for chitin-protein interactions and cross-linking in cuticle. Insect Biochem Mol Biol. 2005, 35: 553-560. 10.1016/j.ibmb.2005.01.017.PubMedView ArticleGoogle Scholar
- The Honeybee Genome Sequencing Consortium: Insights into social insects from the genome of the honeybee Apis mellifera. Nature. 2006, 443: 931-949. 10.1038/nature05260.PubMed CentralView ArticleGoogle Scholar
- Karouzou MV, Spyropoulos Y, Iconomidou VA, Cornman RS, Hamodrakas SJ, Willis JH: Drosophila cuticular proteins with the R&R Consensus: annotation and classification with a new tool for discriminating RR-1 and RR-2 sequences. Insect Biochem Mol Biol. 2007, 37: 754-760. 10.1016/j.ibmb.2007.03.007.PubMedView ArticleGoogle Scholar
- Holt RA, Subramanian GM, Halpern A, Sutton GG, Charlab R, Nusskern DR: The genome sequence of the malaria mosquito Anopheles gambiae. Science. 2002, 298: 129-149. 10.1126/science.1076181.PubMedView ArticleGoogle Scholar
- He N, Botelho JMC, McNall RJ, Belozerov V, Dunn WA, Mize T, Orlando R, Willis JH: Proteomic analysis of cast cuticles from Anopheles gambiae by tandem mass spectrometry. Insect Biochem Mol Biol. 2007, 37: 135-146. 10.1016/j.ibmb.2006.10.011.PubMedView ArticleGoogle Scholar
- Togawa T, Dunn WA, Emmons AC, Nagao J, Willis JH: Developmental expression patterns of cuticular protein genes with the R&R Consensus from Anopheles gambiae. Insect Biochem Mol Biol. 2008, [http://dx.doi.org/10.1016/j.ibmb.2007.12.008]Google Scholar
- Dotson EM, Cornel AJ, Willis JH, Collins FH: A family of pupal-specific cuticular protein genes in the mosquito Anopheles gambiae. Insect Biochem Molec Biol. 1998, 28: 459-472. 10.1016/S0965-1748(98)00016-2.View ArticleGoogle Scholar
- Andersen SO: Amino acid sequence studies on endocuticular proteins from the desert locust, Schistocerca gregaria. Insect Biochem Mol Biol. 1998, 28: 421-434. 10.1016/S0965-1748(98)00028-9.PubMedView ArticleGoogle Scholar
- Andersen SO: Studies on proteins in post-ecdysial nymphal cuticle of locust, Locusta migratoria, and cockroach, Blaberus cranifer. Insect Biochem Mol Biol. 2000, 30: 569-577. 10.1016/S0965-1748(00)00029-1.PubMedView ArticleGoogle Scholar
- cuticleDB. [http://bioinformatics2.biol.uoa.gr/cuticleDB/index.jsp]
- NCBI BLAST Server. [http://www.ncbi.nlm.nih.gov/blast/]
- SignalP. [http://www.cbs.dtu.dk/services/SignalP/]
- Janssens H, Gehring WJ: Isolation and characterization of drosocrystallin, a lens crystallin gene of Drosophila melanogaster. Devel Biol. 1999, 207: 204-214. 10.1006/dbio.1998.9170.View ArticleGoogle Scholar
- Perutz MF, Windle AH: Cause of neural death in neurodegenerative diseases attributable to the expansion of glutamine repeats. Nature. 2001, 412: 143-144. 10.1038/35084141.PubMedView ArticleGoogle Scholar
- Andersen SO, Weis-Fogh T: Resilin. A rubber-like protein in arthropod cuticle. Adv Insect Physiol. 1964, 2: 1-65.View ArticleGoogle Scholar
- Lyons RE, Lesieur E, Kim M, Wong DCC, Huson MG, Nairn KM, Brownlee AG, Pearson RD, Elvin CM: Design and facile production of recombinant resilin-like polypeptides: gene construction and a rapid protein purification method. Protein Eng Des Sel. 2007, 20: 25-32. 10.1093/protein/gzl050.PubMedView ArticleGoogle Scholar
- Cherbas L, Cherbas P: The arthropod initiator: the capsite consensus plays an important role in transcription. Insect Biochem Molec Biol. 1993, 23: 81-90. 10.1016/0965-1748(93)90085-7.View ArticleGoogle Scholar
- Butler JE, Kadonaga JT: The RNA polymerase II core promoter: a key component in the regulation of gene expression. Genes Dev. 2002, 16: 2583-92. 10.1101/gad.1026202.PubMedView ArticleGoogle Scholar
- Nakato H, Toriyama M, Izumi S, Tomino S: Structure and expression of mRNA for a pupal cuticle protein of the silkworm, Bombyx mori. Insect Biochem. 1990, 7: 667-678. 10.1016/0020-1790(90)90080-E.View ArticleGoogle Scholar
- Shofuda K-I, Togawa T, Nakato H, Tomino S, Izumi S: Molecular cloning and characterization of a cDNA encoding a larval cuticle protein of Bombyx mori. Comp Biochem Physiol Part B. 1999, 122: 105-109. 10.1016/S0305-0491(98)10151-7.View ArticleGoogle Scholar
- Charles JP, Chihara C, Nejad S, Riddiford LM: A cluster of cuticle protein genes of Drosophila melanogaster at 65A: sequence, structure and evolution. Genetics. 1997, 147: 1213-1224.PubMedPubMed CentralGoogle Scholar
- Sharakhov IV, White BJ, Sharakhova MV, Kayondo J, Lobo NF, Santolamazza F, Della Torre A, Simard F, Collins FH, Besansky NJ: Breakpoint structure reveals the unique origin of an interspecific chromosomal inversion (2La) in the Anopheles gambiae complex. Proc Natl Acad Sci USA. 2006, 103: 6258-62. 10.1073/pnas.0509683103.PubMedPubMed CentralView ArticleGoogle Scholar
- White BJ, Santolamazza F, Kamau L, Pombi M, Grushko O, Mouline K, Brengues C, Guelbeogo W, Coulibaly M, Kayondo JK, Sharakhov I, Simard F, Petrarca V, Della Torre A, Besansky NJ: Molecular karyotyping of the 2La inversion in Anopheles gambiae. Am J Trop Med Hyg. 2007, 76: 334-9.PubMedGoogle Scholar
- Andersen SO: Cuticular sclerotization and tanning. Comprehensive Insect Science. Edited by: Gilbert LI, Iatrou K, Gill S. 2005, Oxford, Elsevier, 4: 145-170.View ArticleGoogle Scholar
- Kosakovsky Pond SL, Frost SDW: Not so different after all: a comparison of methods for detecting amino acid sites under selection. Mol Biol Evol. 2005, 22: 1208-1222. 10.1093/molbev/msi105.PubMedView ArticleGoogle Scholar
- Iconomidou VA, Willis JH, Hamodrakas SJ: Is β-pleated sheet the molecular conformation which dictates formation of helicoidal cuticle?. Insect Biochem Mol Biol. 1999, 29: 285-292. 10.1016/S0965-1748(99)00005-3.PubMedView ArticleGoogle Scholar
- Thornton K, Long M: Rapid divergence of gene duplicates on the Drosophila melanogaster X chromosome. Mol Biol Evol. 2002, 19: 918-925.PubMedView ArticleGoogle Scholar
- Charlesworth B, Coyne JA, Barton NH: The relative rates of evolution of sex-chromosomes and autosomes. Am Nat. 1987, 130: 113-146. 10.1086/284701.View ArticleGoogle Scholar
- Kosakovsky Pond SL, Frost SDW: A genetic algorithm approach to detecting lineage-specific variation in selection pressure. Mol Biol Evol. 2005, 22: 478-485. 10.1093/molbev/msi031.View ArticleGoogle Scholar
- Data Monkey. [http://www.datamonkey.org]
- Yang Z: PAML: a program package for phylogenetic analysis by maximum likelihood. Computer Applications in BioSciences. 1997, 13: 555-556.Google Scholar
- Tajima F: Simple methods for testing molecular clock hypothesis. Genetics. 1993, 135: 599-607.PubMedPubMed CentralGoogle Scholar
- Zhang L, Pond SK, Gaut BS: A survey of the molecular evolutionary dynamics of twenty-five multigene families from four grass taxa. J Mol Evol. 2001, 52: 144-156.PubMedView ArticleGoogle Scholar
- Ensembl Anopheles Web Site. [http://www.ensembl.org/Anopheles_gambiae/index.html]
- Bendtsen JD, Nielsen H, von Heijne G, Brunak S: Improved prediction of signal peptides: SignalP 3.0. J Mol Biol. 2004, 340: 783-795. 10.1016/j.jmb.2004.05.028.PubMedView ArticleGoogle Scholar
- Splice Predictor. [http://deepc2.psi.iastate.edu/cgi-bin/sp.cgi]
- SPIDEY. [http://www.ncbi.nlm.nih.gov/IEB/Research/Ostell/Spidey/]
- Anopheles CP sequences. [http://may2005.archive.ensembl.org/Anopheles_gambiae/submission]
- Magkrioti CK, Spyropoulos IC, Iconomidou VA, Willis JH, Hamodrakas SJ: cuticleDB: a relational database of arthropod cuticular proteins. BMC Bioinform. 2004, 5: 138-143. 10.1186/1471-2105-5-138.View ArticleGoogle Scholar
- Togawa T, Dunn WA, Emmons A, Willis JH: CPF and CPFL, two related gene families encoding cuticular proteins of Anopheles gambiae and other insects. Insect Biochem Mol Biol. 2007, 37: 675-688. 10.1016/j.ibmb.2007.03.011.PubMedView ArticleGoogle Scholar
- Kumar S, Tamura K, Nei M: MEGA3: Integrated software for molecular evolutionary genetics analysis and sequence alignment. Brief Bioinform. 2004, 5: 150-163. 10.1093/bib/5.2.150.PubMedView ArticleGoogle Scholar
- Jones DT, Taylor WR, Thornton JM: The rapid generation of mutation data matrices from protein sequences. CABIOS. 1992, 8: 275-282.PubMedGoogle Scholar
- Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology. Edited by: Altman R, Brutlag D, Karp P, Lathrop R, Searls D. 1994, Menlo Park, AAAI Press, 28-36.Google Scholar
- Meme Server. [http://meme.sdsc.edu/meme]
- Nei M, Gojobori T: Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol. 1986, 3: 418-426.PubMedGoogle Scholar
- Rozas J, Sanchez-DelBarrio JC, Messeguer X, Rozas R: DnaSP, DNA polymorphism analyses by the coalescent and other methods. Bioinformatics. 2003, 19: 2496-2497. 10.1093/bioinformatics/btg359.PubMedView ArticleGoogle Scholar
- Yang Z, Nielsen R, Goldman N, Krabbe Pedersen AM: Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics. 2000, 155: 431-449.PubMedPubMed CentralGoogle Scholar