Identification of alternative splice variants in Aspergillus flavus through comparison of multiple tandem MS search algorithms
© Chang and Muddiman; licensee BioMed Central Ltd. 2011
Received: 14 January 2011
Accepted: 11 July 2011
Published: 11 July 2011
Database searching is the most frequently used approach for automated peptide assignment and protein inference of tandem mass spectra. The results, however, depend on the sequences in target databases and on search algorithms. Recently by using an alternative splicing database, we identified more proteins than with the annotated proteins in Aspergillus flavus. In this study, we aimed at finding a greater number of eligible splice variants based on newly available transcript sequences and the latest genome annotation. The improved database was then used to compare four search algorithms: Mascot, OMSSA, X! Tandem, and InsPecT.
The updated alternative splicing database predicted 15833 putative protein variants, 61% more than the previous results. There was transcript evidence for 50% of the updated genes compared to the previous 35% coverage. Database searches were conducted using the same set of spectral data, search parameters, and protein database but with different algorithms. The false discovery rates of the peptide-spectrum matches were estimated < 2%. The numbers of the total identified proteins varied from 765 to 867 between algorithms. Whereas 42% (1651/3891) of peptide assignments were unanimous, the comparison showed that 51% (568/1114) of the RefSeq proteins and 15% (11/72) of the putative splice variants were inferred by all algorithms. 12 plausible isoforms were discovered by focusing on the consensus peptides which were detected by at least three different algorithms. The analysis found different conserved domains in two putative isoforms of UDP-galactose 4-epimerase.
We were able to detect dozens of new peptides using the improved alternative splicing database with the recently updated annotation of the A. flavus genome. Unlike the identifications of the peptides and the RefSeq proteins, large variations existed between the putative splice variants identified by different algorithms. 12 candidates of putative isoforms were reported based on the consensus peptide-spectrum matches. This suggests that applications of multiple search engines effectively reduced the possible false positive results and validated the protein identifications from tandem mass spectra using an alternative splicing database.
Tandem mass spectrometry (MS/MS) has been one of the most effective high-throughput approaches for protein identification and quantification. In a typical "bottom-up" approach, also known as the shotgun proteomics strategy, the enzyme-digested protein mixture is analyzed using single- or multi-dimensional chromatography coupled with tandem mass spectrometry [1, 2]. A variety of computational approaches have been developed to assign peptide sequences to the acquired MS/MS data. Database searching algorithms are the most frequently used methods for large-scale proteomics studies . The most popular commercial MS/MS search engines are SEQUEST  (Thermo Fisher Scientific Inc.) and Mascot  (Matrix Science Ltd.). Open source tools are also available, such as OMSSA , X! Tandem , and Andromeda . Although each implementation is different, the general approach of MS/MS search algorithms is similar . Given a protein sequence database, the search algorithm first generates all in silico-digested peptides upon the specified parameters, such as digestive enzymes, missed cleavages, and modifications. For each MS/MS spectrum, the search engine only evaluates the candidate peptide sequences within a user-defined precursor mass tolerance window. A scoring function is used to calculate a score which represents how well the theoretical spectrum of each candidate peptide matches the observed spectrum. The top scoring peptide hit is reported and then the peptide sequence is assigned to the experimental MS/MS spectrum. Protein identifications are inferred by grouping the peptide-spectrum matches .
Another approach for identifying peptides from fragment ion spectra combines partial de novo sequencing and database searching. Short peptide sequence tags are inferred from MS/MS spectra using de novo algorithms. The list of candidate peptides in the database search can be reduced to only those containing the tag . The algorithms will then try to extend the sequence tag by finding masses of the flanking residues in the database peptide which match masses of the prefix and suffix regions of the tag . Although the hybrid approach is still reliant on protein sequence databases, it is an alternative strategy while analyzing peptides with novel modifications or sequence variations .
Alternative pre-mRNA splicing (AS) enables eukaryotes to generate distinct mRNAs and therefore multiple protein variants from a single gene. The common approach to developing an alternative splicing database is based on automated large-scale mapping of transcripts and genomic sequences. The massively parallel picolitre-scale sequencing system developed by the 454 Life Sciences Corporation was capable of sequencing 25 million bases in a four-hour run . The 454 sequence reads are short, averaging 80-120 bases per read. The massively parallel sequencing-by-synthesis technology has been used to generate EST data of a human prostate cancer cell line, and 25 novel alternative exon splicing events were identified .
Rebuilding A. flavus alternative splicing database
Comparison of MS/MS search algorithms on identifying putative isoforms
Number of identified peptides and proteins by algorithms with a FDR < 2%
E-value < 0.001
E-value < 0.09
E-value < 0.04
p-value < 0.02
Overlap of identified peptides and proteins between algorithms with a FDR < 2%
X! Tandem only
Mascot, X! Tandem
OMSSA, X! Tandem
X! Tandem, InsPecT
Mascot, OMSSA, X! Tandem
Mascot, OMSSA, InsPecT
Mascot, X! Tandem, InsPecT
OMSSA, X! Tandem, InsPecT
Mascot, OMSSA, X! Tandem, InsPecT
Number of MS/MS spectra assigned to different peptide sequences by algorithms
by X! Tandem
List of consensus peptides specific to putative isoforms with a FDR < 2%
Peptide Specific to Putative Isoform
prefoldin subunit 6
UTP-glucose-1-phosphate uridylyltransferase Ugp1
conserved hypothetical protein
14-3-3 family protein ArtA
conserved hypothetical protein
14-3-3 protein sigma, gamma, zeta, beta/alpha
ubiquinol-cytochrome C reductase complex core protein 2
Conserved domain analysis of putative isoforms of UDP-galactose 4-epimerase
While multiple protein products are encoded from the same gene, different isoforms are usually destined for performing various biological functions. Thus, we were interested in learning whether two identified UGE isoforms had different functional motifs among their sequences. The Conserved Domain Database (CDD), part of NCBI's Entrez database system, is a protein annotation resource that consists of a collection of well-annotated multiple sequence alignment models as position-specific score matrices (PSSMs) . Two motifs were found by searching the RefSeq sequence against CDD (version 2.23, containing 37407 PSSMs) (Figure 5A). One was a member of the Rossmann-fold NAD(P)(+)-binding proteins superfamily, 3-ketoacyl-(acyl-carrier-protein) reductase [CDD: PRK12825], and the other was UDP-glucose 4-epimerase [CDD: PLN02240]. A different member of the Rossmann-fold NAD(P)(+)-binding proteins superfamily, short chain dehydrogenase [CDD: pfam00106], was found in the sequence of the alternatively spliced variant (Figure 5B). UDP-galactose 4-epimerase is known as a member of the short chain dehydrogenase/reductase superfamily. These enzymes contain a conserved Tyr-X-X-X-Lys motif necessary for catalytic activity. The characteristic YXXXK motif of human epimerase was located at Tyr-157-Gly-Lys-Ser-Lys-161 . The YXXXK signature sequence, Tyr-156-Gly-Asn-Thr-Lys-160 (YGNTK), was also found in the predicted variant sequence of A. flavus UGE. The different sets of motifs found in two UGE proteins suggested the putative isoforms may carry out different functions in vivo.
A new A. flavus alternative splicing database was rebuilt referencing the latest genome annotation. By incorporating new qualified 454 sequence reads, more splice variants were predicted from more genes compared to the previous database. Though several previously discovered peptides had been included in the updated proteome, newly predicted variants were identified from the improved database using the same set of spectra. According to the Mascot results, 29 additional proteins from 26 genes were found in the previous study  while the 21 putative isoforms encoded by 21 genes were reported in this study. The results suggested that the increase of transcript sequences was able to predict eligible splice variants though the genome had been updated recently.
Different groups have conducted comparative evaluations of MS/MS search algorithms [9, 21]. The variation in scoring functions and statistical significance techniques in database-searching algorithms give different identification results. The overlaps of the search results from multiple algorithms can shift significantly as search parameters are modified . However, those studies were performed using general protein databases without emphasizing alternatively spliced isoforms. In this study, Mascot, OMSSA, X! Tandem, and InsPecT were compared using an alternative splicing database. In spite of the agreement on 42% of peptide and 51% of RefSeq protein identifications, our results showed that 15% of the putative splice isoforms were detected by all algorithms (Table 2). The fact that less than 2% of spectra were assigned to multiple peptide sequences did not explain all the variation in isoform identifications (Table 3).
The prediction of the alternatively spliced variants based on EST sequences by a computational pipeline inclines to be over-estimated and may contain errors. The introduction of putative isoforms into the protein database can further lower the p-value of peptide identifications because of the increasing size of the database. Consensus decision making exploits the goodness of multiple search algorithms to validate the assignment results of spectral data at a relatively low cost. The approach is particularly valuable while making inferences in isoform identifications from an alternative splicing database.
The A. flavus NRRL3357 whole genome shotgun project [Refseq: NZ_AAIH00000000] released an updated version on Aug 12, 2009. The second version of the project contains 13487 genes and coding proteins, and no splice isoforms were included in the genome annotation. The nucleotide records and protein sequences of A. flavus NRRL3357 were downloaded from RefSeq release 40 (March 7, 2010) using Taxonomy ID equal to 332952. Other supplementary information including coding exons was collected from Entrez Genome and Entrez Gene databases.
Alternative Splicing Database
The alternative splicing database of A. flavus in this study was constructed using the most recent official version of the genome described above. Serving as transcription evidence, 21130 EST sequences and 559014 454 sequences were used to predict putative slicing variants. 20371 ESTs were downloaded from the EST database of NCBI by specifying the species "Aspergillus flavus". All 454 sequences and an additional 759 ESTs were provided by the Center for Integrated Fungal Research at North Carolina State University.
The EST and 454 sequences were first mapped to the annotated gene sequences using BLAST  (version 2.2.22). To ensure the quality of the predicted splicing variants sequences, only those EST/454 transcripts which satisfied the threshold (E-value < 0.001) were aligned against the corresponding genes by sim4 . The alignments were allowed to search 3000 bases upstream and downstream to capture any potential missing exons. The distance of 3 kb was decided as two times the length of the largest intron found in the current genome annotation. For each gene, all splice sites of exons reported by sim4 alignments were integrated into a data structure called a splicing graph . In the resulting directed graph, edges represented putative exons, vertices stood for splice sites, and paths denoted transcripts. If more than one exon (edge) pointed to the same 5' splice site (vertex) or the same 3' splice site (vertex) followed by multiple possible exons (edges), alternative splicing events were indicated. The putative splicing variants from the same gene were generated by visiting all possible paths. The corresponding protein sequences were translated from the predicted transcripts with a minimum length requirement of eighteen amino acids. Finally, any predicted protein whose sequence was either a subsequence or an identical duplicate of one entry in the RefSeq database was removed before conducting the database searches.
The MS/MS spectra used in this study were generated in a previous experiment . In brief, 12C6-Arg and 13C6-Arg labeled cultures of A. flavus were grown for 24 h at 28°C or 37°C. Extracted protein samples were separated on 12.5% SDS-PAGE gel. Forty bands from each lane were excised then they were reduced, alkylated, and digested by trypsin for 18 h at 37°C. Each of the 40 in-gel digested samples was analyzed by nanoflow LC-MS/MS on a LTQ-FT (ThermoFisher Scientific). The bottom-up SILAC A. flavus data associated with this manuscript may be downloaded from the Proteome Commons Database  Tranche network using the following hash: O9h2YUGGpAOG+ex5+rYTySoRxqvyPayGlWPspibKkA13BXCVcpVMp3oCmH4HwZOofp5azAQcx4coCH6I82DCx5vQjwwAAAAAAAAn5g==.
Four different MS/MS search algorithms were chosen for comparison, including Mascot Server (version 2.2.04) from Matrix Science Ltd., OMSSA (version 2.1.7) from NCBI, X! Tandem TORNADO (2010.01.01.4) from the Global Proteome Machine Organization, and InsPecT (version 20100804) from the Center for Computational Mass Spectrometry at the University of California, San Diego. The original spectra were stored as Thermo XCalibur RAW files. To ensure that all four search algorithms started with the same set of peak lists, the experimental spectra in RAW file format were first converted to the files in Mascot Generic Format (MGF) by Mascot Distiller (Matrix Science Ltd.) using the same processing options. A total of 311105 spectra from 77 MGF files were used in this study. The database searches were performed with the same parameters for all four search algorithms. The settings specified trypsin as the protease, a maximum of two missed cleavage sites, precursor charge up to 3+, 5 ppm precursor ion tolerance (0.01 Da for OMSSA), and 1 Dalton product ion tolerance. The searches also accounted for carbamidomethyl modification on Cysteine (C) as a fixed parameter, and variable modifications included oxidation on Methionine (M) and deamidation on Asparagine (N) or Glutamine (Q). This study focused on detecting splice isoforms instead of exploring the protein profiles at different temperatures. Although the input spectra were derived from a previous SILAC experiment, the data were only searched for light peptides without the 13C6-Arg label. It is noted that the setting of the refinement node for X! Tandem is ON as default.
False Discovery Rate
The FDR for each search result was estimated through searching the decoy (reverse) database and then counting the number of peptide-spectrum matches identified from the target database (Nt) and decoy database (Nd). The target-decoy database search can be conducted in two ways: a single search against a concatenated target/decoy database; or two independent searches against the target and decoy databases, respectively. The separate search provided a conservative estimate . FDRs of the peptides identified by Mascot, OMSSA, and X! Tandem were estimated using the separate search strategy and calculated as Nd/Nt. However, the separate search approach was not feasible for the InsPecT results. The InsPecT tutorial describes that most results are not statistically significant and post-processing is essential. It is necessary to run the PValue.py script, included in the InsPecT distribution, to weed out insignificant results. The script uses a concatenated target/decoy database to calibrate the p-value by fitting the score distribution with a mixture model. Hence, FDR of the peptides identified by InsPecT was estimated using the concatenated database strategy instead, computed as 2 * Nd/(Nt + Nd) .
The authors thank the W.M. Keck Foundation and North Carolina State University for supporting this research. The authors gratefully acknowledge Dr. Gary A. Payne and Dr. Dahlia Nielsen for providing the 454 sequencing data of A. flavus.
- Link AJ, Eng J, Schieltz DM, Carmack E, Mize GJ, Morris DR, Garvik BM, Yates JR: Direct analysis of protein complexes using mass spectrometry. Nat Biotechnol. 1999, 17: 676-682. 10.1038/10890.PubMedView Article
- Washburn MP, Wolters D, Yates JR: Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat Biotechnol. 2001, 19: 242-247. 10.1038/85686.PubMedView Article
- Sadygov RG, Cociorva D, Yates JR: Large-scale database searching using tandem mass spectra: looking up the answer in the back of the book. Nat Methods. 2004, 1: 195-202. 10.1038/nmeth725.PubMedView Article
- Eng JK, McCormack AL, Yates JR: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom. 1994, 5: 976-989. 10.1016/1044-0305(94)80016-2.PubMedView Article
- Perkins DN, Pappin DJ, Creasy DM, Cottrell JS: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999, 20: 3551-3567. 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2.PubMedView Article
- Geer LY, Markey SP, Kowalak JA, Wagner L, Xu M, Maynard DM, Yang X, Shi W, Bryant SH: Open mass spectrometry search algorithm. J Proteome Res. 2004, 3: 958-964. 10.1021/pr0499491.PubMedView Article
- Craig R, Beavis RC: TANDEM: matching proteins with tandem mass spectra. Bioinformatics. 2004, 20: 1466-1467. 10.1093/bioinformatics/bth092.PubMedView Article
- Cox J, Neuhauser N, Michalski A, Scheltema RA, Olsen JV, Mann M: Andromeda: a peptide search engine integrated into the MaxQuant environment. J Proteome Res. 2011, 10: 1794-1805. 10.1021/pr101065j.PubMedView Article
- Balgley BM, Laudeman T, Yang L, Song T, Lee CS: Comparative evaluation of tandem MS search algorithms using a target-decoy search strategy. Mol Cell Proteomics. 2007, 6: 1599-1608. 10.1074/mcp.M600469-MCP200.PubMedView Article
- Nesvizhskii AI, Vitek O, Aebersold R: Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat Methods. 2007, 4: 787-797. 10.1038/nmeth1088.PubMedView Article
- Hughes C, Ma B, Lajoie GA: De novo sequencing methods in proteomics. Methods Mol Biol. 2010, 604: 105-121. 10.1007/978-1-60761-444-9_8.PubMedView Article
- Tanner S, Shu H, Frank A, Wang LC, Zandi E, Mumby M, Pevzner PA, Bafna V: InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. Anal Chem. 2005, 77: 4626-4639. 10.1021/ac050102d.PubMedView Article
- Tabb DL, Saraf A, Yates JR: GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. Anal Chem. 2003, 75: 6415-6421. 10.1021/ac0347462.PubMed CentralPubMedView Article
- Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, et al: Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005, 437: 376-380.PubMed CentralPubMed
- Bainbridge MN, Warren RL, Hirst M, Romanuik T, Zeng T, Go A, Delaney A, Griffith M, Hickenbotham M, Magrini V, et al: Analysis of the prostate cancer cell line LNCaP transcriptome using a sequencing-by-synthesis approach. BMC Genomics. 2006, 7: 246-10.1186/1471-2164-7-246.PubMed CentralPubMedView Article
- Chang KY, Georgianna DR, Heber S, Payne GA, Muddiman DC: Detection of alternative splice variants at the proteome level in Aspergillus flavus. J Proteome Res. 2010, 9: 1209-1217. 10.1021/pr900602d.PubMedView Article
- Holden HM, Rayment I, Thoden JB: Structure and function of enzymes of the Leloir pathway for galactose metabolism. J Biol Chem. 2003, 278: 43885-43888. 10.1074/jbc.R300025200.PubMedView Article
- Barber C, Rosti J, Rawat A, Findlay K, Roberts K, Seifert GJ: Distinct properties of the five UDP-D-glucose/UDP-D-galactose 4-epimerase isoforms of Arabidopsis thaliana. J Biol Chem. 2006, 281: 17276-17285. 10.1074/jbc.M512727200.PubMedView Article
- Marchler-Bauer A, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese-Scott C, Fong JH, Geer LY, Geer RC, Gonzales NR, Gwadz M, et al: CDD: specific functional annotation with the Conserved Domain Database. Nucleic Acids Res. 2009, 37: D205-D210. 10.1093/nar/gkn845.PubMed CentralPubMedView Article
- Thoden JB, Wohlers TM, Fridovich-Keil JL, Holden HM: Human UDP-galactose 4-epimerase. Accommodation of UDP-N-acetylglucosamine within the active site. J Biol Chem. 2001, 276: 15131-15136. 10.1074/jbc.M100220200.PubMedView Article
- Kapp EA, Schutz F, Connolly LM, Chakel JA, Meza JE, Miller CA, Fenyo D, Eng JK, Adkins JN, Omenn GS, Simpson RJ: An evaluation, comparison, and accurate benchmarking of several publicly available MS/MS search algorithms: sensitivity and specificity analysis. Proteomics. 2005, 5: 3475-3490. 10.1002/pmic.200500126.PubMedView Article
- Searle BC, Turner M, Nesvizhskii AI: Improving sensitivity by probabilistically combining results from multiple MS/MS search methodologies. J Proteome Res. 2008, 7: 245-253. 10.1021/pr070540w.PubMedView Article
- Edwards N, Wu X, Tseng CW: An unsupervised, model-free, machine-learning combiner for peptide identifications from tandem mass spectra. Clin Proteomics. 2009, 5: 23-36. 10.1007/s12014-009-9024-5.View Article
- Yu W, Taylor JA, Davis MT, Bonilla LE, Lee KA, Auger PL, Farnsworth CC, Welcher AA, Patterson SD: Maximizing the sensitivity and reliability of peptide identification in large-scale proteomic experiments by harnessing multiple search engines. Proteomics. 2010, 10: 1172-1189. 10.1002/pmic.200900074.PubMedView Article
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410.PubMedView Article
- Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W: A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 1998, 8: 967-974.PubMed CentralPubMed
- Heber S, Alekseyev M, Sze SH, Tang H, Pevzner PA: Splicing graphs and EST assembly problem. Bioinformatics. 2002, 18 (Suppl 1): S181-S188. 10.1093/bioinformatics/18.suppl_1.S181.PubMedView Article
- Georgianna DR, Hawkridge AM, Muddiman DC, Payne GA: Temperature-dependent regulation of proteins in Aspergillus flavus: whole organism stable isotope labeling by amino acids. J Proteome Res. 2008, 7: 2973-2979. 10.1021/pr8001047.PubMedView Article
- Proteome Commons Database http://proteomecommons.org.
- Choi H, Nesvizhskii AI: False discovery rates and related statistical concepts in mass spectrometry-based proteomics. J Proteome Res. 2008, 7: 47-50. 10.1021/pr700747q.PubMedView Article
- Käll L, Storey JD, MacCoss MJ, Noble WS: Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. J Proteome Res. 2008, 7: 29-34. 10.1021/pr700600n.PubMedView Article
- Elias JE, Gygi SP: Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods. 2007, 4: 207-214. 10.1038/nmeth1019.PubMedView Article
- Oliveros JC: VENNY. An interactive tool for comparing lists with Venn Diagrams http://bioinfogp.cnb.csic.es/tools/venny/index.html. 2007
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.