Genome annotation of Anopheles gambiae using mass spectrometry-derived data
© Kalume et al. 2005
Received: 30 January 2005
Accepted: 19 September 2005
Published: 19 September 2005
Skip to main content
© Kalume et al. 2005
Received: 30 January 2005
Accepted: 19 September 2005
Published: 19 September 2005
A large number of animal and plant genomes have been completely sequenced over the last decade and are now publicly available. Although genomes can be rapidly sequenced, identifying protein-coding genes still remains a problematic task. Availability of protein sequence data allows direct confirmation of protein-coding genes. Mass spectrometry has recently emerged as a powerful tool for proteomic studies. Protein identification using mass spectrometry is usually carried out by searching against databases of known proteins or transcripts. This approach generally does not allow identification of proteins that have not yet been predicted or whose transcripts have not been identified.
We searched 3,967 mass spectra from 16 LC-MS/MS runs of Anopheles gambiae salivary gland homogenates against the Anopheles gambiae genome database. This allowed us to validate 23 known transcripts and 50 novel transcripts. In addition, a novel gene was identified on the basis of peptides that matched a genomic region where no gene was known and no transcript had been predicted. The amino termini of proteins encoded by two predicted transcripts were confirmed based on N-terminally acetylated peptides sequenced by tandem mass spectrometry. Finally, six sequence polymorphisms could be annotated based on experimentally obtained peptide sequences.
The peptide sequences from this study were mapped onto the genomic sequence using the distributed annotation system available at Ensembl and can be visualized in the context of all other existing annotations. The strategy described in this paper can be used to correct and confirm genome annotations and permit discovery of novel proteins in a high-throughput manner by mass spectrometry.
The recent completion of Anopheles gambiae genome sequence  provided an architectural scaffold for mapping, identifying, selecting, and exploiting malaria insect vector genes for future studies. An. gambiae genome consists of 3 pairs of chromosome, designated as 2R/2L, 3R/3L and X. The Y chromosome is yet to be completely sequenced and assembled because of the high number of transposable element fragments. Thus far, approximately 85% of the genome has been assembled with the total genome size being 278 Mbp. About 15,189 genes are annotated in the An. gambiae genome, of which 11,757 are derived from prediction programs . Currently, there are approximately 700 known An. gambiae proteins that are annotated in the databases. The annotation of the An. gambiae genome sequence has been an ongoing process since it was completed in 2002 . The assembled genome is publicly available through NCBI (National Center for Biotechnology Information) and EBI (European Bioinformatics Institute)/Ensembl http://www.ensembl.org. It is important to note that the existing genome annotation is mainly based on de novo gene predictions in addition to a small number of experimentally obtained transcripts. Because of the magnitude of sequence data, automatic annotation is a necessity. However, this results in different types of errors, some of which can be overcome by combining manual annotation and experimental evidence. In this regard, mass spectrometry is a powerful tool that can contribute to the identification of novel genes and assist in confirmation of predicted transcripts as well as correction of incorrect assignments from automated gene annotations . The use of mass spectrometry to assist the validation of genome annotation has been previously demonstrated in prokaryotes , yeast , plants  and humans . However, two of these studies [5, 7] did not directly search mass spectrometry-derived data against the genomic databases - rather, a post hoc integration of peptide sequences with the genomic sequence was carried out. This approach is not preferable for annotating genomes because if there is any region of a genome that has no transcript associated with it (e.g. introns and intergenic regions), it will not be identified.
In this study, we carried out a proteomic analysis of salivary gland proteins from An. gambiae and searched this data against the An. gambiae genome database. We were able to validate 73 transcripts, which were predicted from the genome. We also corrected several erroneous predictions (e.g. missed exons) and identified one gene that was not predicted at all. To share our results with the biomedical community, we have taken advantage of the Distributed Annotation System  provided by Ensembl. We have mapped all the peptides identified in this study such that they can be visualized by anyone using the Ensembl genome browser. It is hoped that availability of additional proteomic data will aid in further refining the annotation of genomic data.
Transcripts annotated as known transcripts are generally those for whom a full length cDNA exists. However, it is possible for transcripts to exist without being translated, as pseudogenes are also known to be transcribed [9, 10]. Here, we describe the use of peptide sequences obtained by tandem mass spectrometry to assist in the annotation and confirmation of exons of known transcripts. Figure 2 demonstrates how we were able to obtain peptide evidence for two of three exons of a known transcript encoding D7-related protein 1 whose cDNA was recently isolated [11, 12]. Additional file 1 lists 172 peptides that led to validation of exons from 23 known transcripts.
Proteins with the identified cSNP.
Protein Accession #
Amino acid change
Similar to SGS4
D7-related 3 protein precursor
Putative 5'-nucleotidase precursor
List of peptide sequences shown in Figures 3 and 6B.
Protein Accession #
Peroxidase family of proteins
Novel protein-coding gene
Antigen 5-related 1 protein
Similar to calmodulin1
D7 protein family
Proteins generally undergo proteolytic cleavage of their N-termini by aminopeptidases, in vivo, which results in removal of one or more amino acids from the N-termini . In most cases, this is followed by addition of an acetyl moiety to the amino terminus of the processed protein often referred to as a 'blocked' N-terminus, which cannot be sequenced easily by the traditional Edman method. In this study, we found two acetylated peptides that matched two different predicted transcripts. The peptide sequence Ac-ADQLTEEQIAEFKEAFSLFDKDGDGTITTK, whose MS/MS spectrum is shown in Figure 3D (Ac refers to the acetyl moiety), corresponds to a novel transcript (Table 2- protein [ENSANG:P00000012700]). The predicted protein is orthologous to rabbit calmodulin (93% identity), which has been observed to contain a blocked amino terminus . We found another N-terminally acetylated peptide sequence, Ac-STVDKEELVQK, which corresponds to another novel transcript ([ENSANG:P00000009311]) predicted to encode a protein very similar (96% identity) to a chaperone found in D. melanogaster. In both of the above-mentioned cases, the amino terminal methionine residue was cleaved and the newly exposed amino terminus was acetylated. Thus, we were able to validate the assignment of the translation initiation codons for these two predicted transcripts which is not always straightforward as it has been shown that, contrary to popular beliefs, the most 5' AUG codon is not used for translation initiation in a large proportion of cases [17, 18]. We should also note that our strategy was not designed to enrich for N-termini of proteins. If such strategies were to be used in conjunction with mass spectrometry, a large number of N-termini could be assigned in a single experiment.
One of the basic components of the annotation of any genome is an accurate representation of protein-coding genes. However, even this seemingly simple task is quite difficult. Here, we demonstrate that mass spectrometry is a powerful tool for annotating protein-coding regions in genomes. Here we report a pilot study to annotate the An. gambiae genome. Using mass spectrometry-derived data, we validated the physical existence of 23 known and 50 predicted transcripts at the protein level and confirmed the N-termini of proteins encoded by two predicted transcripts based on N-terminal acetylation. We also identified two sequence polymorphisms based on peptide evidence that were not annotated as SNPs in the databases. Thus, mass spectrometry is a valuable complementary method for initial discovery of locations of non-synonymous SNPs. Importantly; we also identified a novel gene that was not predicted by automatic annotation pipelines at all. The task of assigning translational start sites is fraught with errors especially in the absence of transcript data. Similarly, UTRs can also be wrongly assigned. Using MS/MS data, we corrected the translational start sites and UTR assignment of proteins, which would otherwise be difficult, or impossible using molecular biology based methods. Our MS/MS derived peptide sequence data has been uploaded onto Ensembl DAS server and can be visualized using the Ensembl genome browser. In summary, we have demonstrated how mass spectrometry-derived data can be used to refine the annotations of a complex eukaryotic genome and share them with the biomedical community.
Salivary glands from female An. gambiae (G-3 strain) were homogenized and subjected to digestion with trypsin as described previously . The tryptic peptides were subjected to LC-MS/MS and analyzed on a quadrupole time of flight mass spectrometer (QTOF US-API, Micromass, UK) as described . A total of 16 LC-MS/MS runs were carried out and the 3,967 MS/MS spectra acquired were searched against both protein and genome databases, this led to identification of 369 unique peptide sequences. The acquisition and deconvolution of data were performed on a MassLynx Windows NT PC data system (version 4.0).
The An. gambiae genome and proteome database (release 16.2) was downloaded from the Ensembl ftp site http://ftp.ensembl.org. Mass spectrometric data searches were performed using Mascot version 1.9 installed on a Linux cluster  against the NCBI non-redundant (nr) database as described earlier  The following settings were used: a) trypsin as the specific enzyme (al1ow up to 2 missed cleavages); b) peptide window tolerance (error window on experimental peptide mass values) ± 0.4 Da; and c) fragment mass tolerance of ± 0.3 Da. Moreover, during the searches, N-terminal acetylation, oxidation of methionine and carbamidomethylcysteine modification were the three amino acid modifications allowed. Searches were also carried out against the genome database. For this purpose, the large genome sequence files in FASTA format were trimmed into 100 kb long sequences since Mascot cannot deal with large genomic sequences. Same set of parameters were used for genome search as used for NCBInr search. Only peptides with a Mascot score greater than 30 and containing a sequence tag of at least four consecutive amino acids were considered in this study. The spectra were further investigated and verified by manual interpretation. Any peptide hits that matched transcripts labeled known or novel were investigated further using Ensembl browser. This included validation of existing exons, correction of intron-exon boundaries and mapping of N-termini of mature proteins. All peptide matches to the genome were compared with matches to the non-redundant protein database. Those peptides that did not match any entry in the nr database were analyzed further. This allowed identification of novel genes that are not predicted by gene prediction algorithms or correction of regions annotated as introns or untranslated regions.
In our study, if a peptide maps to more than one transcript, we have assigned such peptides to all of the corresponding transcripts. However, for the purpose of counting the number of transcripts/proteins that we have identified, we have included only the assignments of those peptides that had matches to only one transcript.
In order to identify the presence of potential cSNPs, we utilized the "point mutations" feature in X! Tandem, which allows the user to identify single amino acid changes in peptides. We searched the data against Ensembl Anopheles protein database using the X! Tandem 2 release search algorithm installed on a Linux cluster . The searching parameters were the same as described above for search using Mascot.
The Distributed Annotation System provided by Ensembl was used to visualize the peptides in the context of genome and to share our annotations with the community. A genome database search was carried out using the peptides to determine the corresponding regions in the genome for each peptide. The positions of these peptides were obtained by implementing scripts written in Python programming language. These scripts parse the output files obtained by searching genome using TBLASTN algorithm. The genomic coordinates were packaged into a tab delimited format necessary for uploading onto the DAS server at Ensembl. Whenever a user chooses a 'MS data JHU' as a DAS server source, the peptide and its coordinates on the genome are mapped onto the browser and visualized on a separate track.
National Center for Biotechnology Information
European Bioinformatics Institute
Mega base pair
Distributed Annotation System
Annotated untranslated regions
Coding single nucleotide polymorphisms
Liquid chromatography nanoelectrospray tandem mass spectrometry
tandem mass spectrometry
Johns Hopkins University
AP and NK were supported by a pilot project grant from the Malaria Research Institute at the Johns Hopkins Bloomberg School of Public Health. We thank John Kloss and Jakob Bunkenborg for their help with database searching. We thank Martin Hammond, Ensembl mosquito genome project's coordinator, and Ewan Birney at the EBI for their help with making the peptide data available through the Ensembl DAS server. We also thank members of the Institute of Bioinformatics for their assistance with the genome analysis. We thank Sun Microsystems for providing us a computer cluster under the Academic Equipment Grant mechanism.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.