Identification of alternative splice variants in Aspergillus flavus through comparison of multiple tandem MS search algorithms

Background Database searching is the most frequently used approach for automated peptide assignment and protein inference of tandem mass spectra. The results, however, depend on the sequences in target databases and on search algorithms. Recently by using an alternative splicing database, we identified more proteins than with the annotated proteins in Aspergillus flavus. In this study, we aimed at finding a greater number of eligible splice variants based on newly available transcript sequences and the latest genome annotation. The improved database was then used to compare four search algorithms: Mascot, OMSSA, X! Tandem, and InsPecT. Results The updated alternative splicing database predicted 15833 putative protein variants, 61% more than the previous results. There was transcript evidence for 50% of the updated genes compared to the previous 35% coverage. Database searches were conducted using the same set of spectral data, search parameters, and protein database but with different algorithms. The false discovery rates of the peptide-spectrum matches were estimated < 2%. The numbers of the total identified proteins varied from 765 to 867 between algorithms. Whereas 42% (1651/3891) of peptide assignments were unanimous, the comparison showed that 51% (568/1114) of the RefSeq proteins and 15% (11/72) of the putative splice variants were inferred by all algorithms. 12 plausible isoforms were discovered by focusing on the consensus peptides which were detected by at least three different algorithms. The analysis found different conserved domains in two putative isoforms of UDP-galactose 4-epimerase. Conclusions We were able to detect dozens of new peptides using the improved alternative splicing database with the recently updated annotation of the A. flavus genome. Unlike the identifications of the peptides and the RefSeq proteins, large variations existed between the putative splice variants identified by different algorithms. 12 candidates of putative isoforms were reported based on the consensus peptide-spectrum matches. This suggests that applications of multiple search engines effectively reduced the possible false positive results and validated the protein identifications from tandem mass spectra using an alternative splicing database.


Background
Tandem mass spectrometry (MS/MS) has been one of the most effective high-throughput approaches for protein identification and quantification. In a typical "bottom-up" approach, also known as the shotgun proteomics strategy, the enzyme-digested protein mixture is analyzed using single-or multi-dimensional chromatography coupled with tandem mass spectrometry [1,2]. A variety of computational approaches have been developed to assign peptide sequences to the acquired MS/MS data. Database searching algorithms are the most frequently used methods for large-scale proteomics studies [3]. The most popular commercial MS/MS search engines are SEQUEST [4] (Thermo Fisher Scientific Inc.) and Mascot [5] (Matrix Science Ltd.). Open source tools are also available, such as OMSSA [6], X! Tandem [7], and Andromeda [8]. Although each implementation is different, the general approach of MS/MS search algorithms is similar [9]. Given a protein sequence database, the search algorithm first generates all in silico-digested peptides upon the specified parameters, such as digestive enzymes, missed cleavages, and modifications. For each MS/MS spectrum, the search engine only evaluates the candidate peptide sequences within a user-defined precursor mass tolerance window. A scoring function is used to calculate a score which represents how well the theoretical spectrum of each candidate peptide matches the observed spectrum. The top scoring peptide hit is reported and then the peptide sequence is assigned to the experimental MS/MS spectrum. Protein identifications are inferred by grouping the peptide-spectrum matches [10].
Another approach for identifying peptides from fragment ion spectra combines partial de novo sequencing and database searching. Short peptide sequence tags are inferred from MS/MS spectra using de novo algorithms. The list of candidate peptides in the database search can be reduced to only those containing the tag [11]. The algorithms will then try to extend the sequence tag by finding masses of the flanking residues in the database peptide which match masses of the prefix and suffix regions of the tag [12]. Although the hybrid approach is still reliant on protein sequence databases, it is an alternative strategy while analyzing peptides with novel modifications or sequence variations [13].
Alternative pre-mRNA splicing (AS) enables eukaryotes to generate distinct mRNAs and therefore multiple protein variants from a single gene. The common approach to developing an alternative splicing database is based on automated large-scale mapping of transcripts and genomic sequences. The massively parallel picolitrescale sequencing system developed by the 454 Life Sciences Corporation was capable of sequencing 25 million bases in a four-hour run [14]. The 454 sequence reads are short, averaging 80-120 bases per read. The massively parallel sequencing-by-synthesis technology has been used to generate EST data of a human prostate cancer cell line, and 25 novel alternative exon splicing events were identified [15].
Recently, we expanded the target database to include putative alternatively spliced isoforms with the aim that the MS/MS spectra can be better interpreted [16]. The results showed that our approach was able to identify more proteins from the experimental spectra and to provide evidence for improving the genome annotation. Subsequently, the Aspergillus flavus NRRL3357 whole genome shotgun project had a major update in 2009. Among 41 peptides discovered in our previous study, 6 of them were included in the second version of genome annotation. Meanwhile, 454 sequencing data of A. flavus became available locally. The first goal of this study was to rebuild the alternative splicing database using the latest genome annotation and newly acquired 454 sequencing data as transcript evidence. The second part of the study aimed at comparing four MS/MS search algorithms for isoform identifications using the resulting alternative splicing database. We tested three probability-based algorithms, Mascot [5], OMSSA [6], and X! Tandem [7], and one sequence tag-based algorithm, InsPecT [12]. The design of the study is illustrated in Figure 1.

Results
Rebuilding A. flavus alternative splicing database Genome annotation is the result of continuous efforts. An updated version of A. flavus genome annotation was released in 2009. Compared to the prior genome project, the second version dropped 360 previously documented genes and added 1000 novel ones (Figure 2A). A newly acquired collection of 454 sequence reads and ESTs provided the transcription information of half of the genes for predicting splice variants ( Figure 2B). An updated alternative splicing database was then built using the second version of the genome and all available transcripts. The RefSeq database (release 40) contained 13487 A. flavus genes and corresponding proteins, with no splice isoform. The updated alternative splicing database predicted another 15833 putative protein variants ( Figure 2C). It was estimated that 15.4% (2077/13487) of the total genes encoded more than one protein, 7.62 (15833/2077) putative isoforms per gene on average. The predicted variant sequences were appended to the collection of the RefSeq proteins to form a combined database for the following database searches.

Comparison of MS/MS search algorithms on identifying putative isoforms
In order to compare the performance of identifying putative splice variants, the same set of MS/MS spectra were searched against the resulting combined database by Mascot, OMSSA, X! Tandem, and InsPecT. Although each algorithm already reported internal statistical measures like p-value or E-value, the cut-off thresholds were selected to ensure the search results had an estimated false discovery rate (FDR) < 2% for peptide identification (see Additional file 1). While several isoforms were encoded from the same gene, sometimes the different protein products could not be distinguished by the identified peptides. In such a scenario, it was observed that Mascot would pick the protein with the longest sequence from all possible candidates. InsPecT would also report one protein from the list of candidate sequences, but not necessarily the longest one. In contrast, OMSSA and X! Tandem would report all matched proteins and let users interpret the findings. In order to present the results concisely, we accepted the longest protein sequence to represent the group of all possible matches. If a group of peptides could be mapped to either the RefSeq protein or the splice variant of the same gene, we conservatively assigned the identification to the RefSeq protein since no clear conclusion was possible. The number of identified peptides, RefSeq proteins, and splice variants by algorithms are listed in Table 1. To study the consistency between different algorithms on search results, the identified hits were categorized by the algorithms having the same finding ( Table 2). The overlaps were illustrated in four-way Venn diagrams as well ( Figure 3). For the peptide-spectrum matches, 42% (1651/3891) of peptide assignments were concurred by all four algorithms. Since we introduced predicted isoform sequences into the database, the protein identification was divided into two subgroups: RefSeq proteins and putative splice variants. 51% (568/1114) of the identified RefSeq proteins were consistent across all algorithms. In contrast, only 15% (11/72) of the putative splice variants were identified unanimously.
To investigate whether different algorithms assigned the same spectrum to different peptide sequences, the peptide-spectrum matches were examined within and between algorithms (Table 3). It was observed for all algorithms that 1% or fewer spectra were assigned to different peptides by the same tool. The inconsistency expanded but never exceeded 2% while comparing the assignment of the same spectrum between different algorithms. It also appeared that InsPecT assigned more spectra differently in comparison with other three probability-based algorithms. The multiple peptides assigned from the same spectra between algorithms might account for a part of the identification variations.
It was not surprising to see that the number of peptidespectrum matches and protein hits dropped while reducing the false discovery rate. However, most of the removed hits belonged to the identifications reported by only one algorithm (see Additional file 2). The consensus hits of multiple algorithms seemed more likely to be the correct identification. In the comparison of the overlaps between search results, the identified splice variants between different algorithms showed greater variations than the RefSeq proteins. It is noted that the prediction of all possible splice variants from ESTs tends to be overestimated. To reduce the false positive results, we compiled a list of top splice isoform candidates by taking advantage of the consensus peptides. By focusing on those variant-specific peptides identified by at least three different algorithms, 12 putative isoforms were reported (Table 4). 11 splice variants were inferred by all four algorithms. The scores, p-values, and E-values of the assignments looked satisfying. None of these specific peptide sequences appeared in any RefSeq proteins. In addition, no two consensus peptides came from the same spectra. As an example, one putative isoform discovered through the strategy was further analyzed below.  Conserved domain analysis of putative isoforms of UDPgalactose 4-epimerase plays a pivotal role in normal galactose metabolism, converting UDP-galactose back to UDP-glucose in the final step of the Leloir pathway [17]. NAD + is required to be a cofactor in the catalytic mechanism. Five UGE isoforms encoded in the Arabidopsis thaliana genome differed in enzymatic properties, transcript regulation, and subcellular localization [18]. The MS/MS spectrum which was used to assign the consensus peptide FAVE-TAITDVINAQR in the putative UGE isoform was examined ( Figure 4). The abundant matched b-and yions, accurate precursor ion mass, and expected mass difference from the SILAC pair observed in the spectrum correlated well with the low expectation value or p-value reported by algorithms.
According to the annotation of RefSeq release 40, A. flavus UDP-glucose 4-epimerase [Entrez Gene: 7919639] contained four coding exons ( Figure 5A). The corresponding splice variant generated from our prediction had three exons instead: the first two were constitutive and the third was alternative ( Figure 5B). Since different sets of peptide-spectrum matches were used to conclude the protein identification between search algorithms, the peptides shown in Figure 5 are based on Mascot's result. The alternative exon in the protein variant was supported by the distinctive peptide FAVETAITDVINAQR which was located in an intron of the corresponding RefSeq protein. The encoding variant sequence ended approximately in the middle of the third coding exon of the RefSeq counterpart. A group of 9 peptides which were mapped to the remaining coding sequence supported the identification of the RefSeq protein.
While multiple protein products are encoded from the same gene, different isoforms are usually destined for performing various biological functions. Thus, we were interested in learning whether two identified UGE isoforms had different functional motifs among their sequences. The Conserved Domain Database (CDD), part of NCBI's Entrez database system, is a protein annotation resource that consists of a collection of wellannotated multiple sequence alignment models as position-specific score matrices (PSSMs) [19]. Two motifs were found by searching the RefSeq sequence against CDD (version 2.23, containing 37407 PSSMs) ( Figure  5A)  Figure 5B). UDPgalactose 4-epimerase is known as a member of the short chain dehydrogenase/reductase superfamily. These enzymes contain a conserved Tyr-X-X-X-Lys motif necessary for catalytic activity. The characteristic YXXXK motif of human epimerase was located at Tyr-157-Gly-Lys-Ser-Lys-161 [20]. The YXXXK signature sequence, Tyr-156-Gly-Asn-Thr-Lys-160 (YGNTK), was also found in the predicted variant sequence of A. flavus

Discussion
A new A. flavus alternative splicing database was rebuilt referencing the latest genome annotation. By incorporating new qualified 454 sequence reads, more splice variants were predicted from more genes compared to the previous database. Though several previously discovered peptides had been included in the updated proteome, newly predicted variants were identified from the improved database using the same set of spectra. According to the Mascot results, 29 additional proteins from 26 genes were found in the previous study [16] while the 21 putative isoforms encoded by 21 genes were reported in this study. The results suggested that the increase of transcript sequences was able to predict eligible splice variants though the genome had been updated recently. Different groups have conducted comparative evaluations of MS/MS search algorithms [9,21]. The variation in scoring functions and statistical significance techniques in database-searching algorithms give different identification results. The overlaps of the search results from multiple algorithms can shift significantly as search parameters are modified [22]. However, those studies were performed using general protein databases without emphasizing alternatively spliced isoforms. In this study, Mascot, OMSSA, X! Tandem, and InsPecT were compared using an alternative splicing database. In spite of the agreement on 42% of peptide and 51% of RefSeq protein identifications, our results showed that 15% of the putative splice isoforms were detected by all algorithms ( Table 2). The fact that less than 2% of spectra were assigned to multiple peptide sequences did not explain all the variation in isoform identifications (Table  3).
To be cautious, we chose the RefSeq protein to represent a protein group when there was no decisive peptide belonging to the putative isoforms. This allowed different algorithms to assign various peptide groups to the same RefSeq protein, thus might indirectly increase the RefSeq protein identifications. On the other hand, the inference of isoform detection mainly relied on identifying the unique peptides which exclusively belonged to variant sequences ( Figure 6). As a result, the difference in peptide identifications might lead to a greater variation in isoform identifications. The variation between the splice variants identified by different algorithms implied that many unique peptides concluded by one algorithm were not necessarily recognized by another. Especially when the existence of putative isoforms was suggested by one or two isoform-specific peptides, an incorrect identification or missed detection of the specific peptides can change the conclusion immensely. Combination of multiple MS/MS search methods was used to distinguish the correct peptide identifications from the incorrect [23] and improve peptide identification rates [22]. We took advantage of consensus peptides assigned by at least three algorithms to generate 12 top candidate isoforms from the search results having estimated FDRs < 2% (Table 4). A recent study showed that the error rate of peptide hits was effectively reduced to 0.5% when a minimum of three engines were used [24]. The multiple search engine approach for peptide assignment not only takes advantage of differences in scoring functions to expand the target space for searching, but also bolsters the confidence of the peptide identifications [24].

Conclusions
The prediction of the alternatively spliced variants based on EST sequences by a computational pipeline inclines to be over-estimated and may contain errors. The introduction of putative isoforms into the protein database can further lower the p-value of peptide identifications because of the increasing size of the database. Consensus decision making exploits the goodness of multiple search algorithms to validate the assignment results of spectral data at a relatively low cost. The approach is particularly valuable while making inferences in isoform identifications from an alternative splicing database.

RefSeq Proteins
The A. flavus NRRL3357 whole genome shotgun project [Refseq: NZ_AAIH00000000] released an updated version on Aug 12, 2009. The second version of the project contains 13487 genes and coding proteins, and no splice isoforms were included in the genome annotation. The nucleotide records and protein sequences of A. flavus NRRL3357 were downloaded from RefSeq release 40 (March 7, 2010) using Taxonomy ID equal to 332952. Other supplementary information including coding exons was collected from Entrez Genome and Entrez Gene databases.

Alternative Splicing Database
The alternative splicing database of A. flavus in this study was constructed using the most recent official version of the genome described above. Serving as transcription evidence, 21130 EST sequences and 559014 454 sequences were used to predict putative slicing variants. 20371 ESTs were downloaded from the EST The EST and 454 sequences were first mapped to the annotated gene sequences using BLAST [25] (version 2.2.22). To ensure the quality of the predicted splicing variants sequences, only those EST/454 transcripts which satisfied the threshold (E-value < 0.001) were aligned against the corresponding genes by sim4 [26]. The alignments were allowed to search 3000 bases upstream and downstream to capture any potential missing exons. The distance of 3 kb was decided as two times the length of the largest intron found in the current genome annotation. For each gene, all splice sites of exons reported by sim4 alignments were integrated into a data structure called a splicing graph [27]. In the resulting directed graph, edges represented putative exons, vertices stood for splice sites, and paths denoted transcripts. If more than one exon (edge) pointed to the same 5' splice site (vertex) or the same 3' splice site (vertex) followed by multiple possible exons (edges), alternative splicing events were indicated. The putative splicing variants from the same gene were generated by visiting all possible paths. The corresponding protein sequences were translated from the predicted transcripts with a minimum length requirement of eighteen amino acids. Finally, any predicted protein whose sequence was either a subsequence or an identical duplicate of one entry in the RefSeq database was removed before conducting the database searches.

Experimental Spectra
The MS/MS spectra used in this study were generated in a previous experiment [28]. In brief, 12 C 6 -Arg and 13 C 6 -Arg labeled cultures of A. flavus were grown for 24 h at 28°C or 37°C. Extracted protein samples were separated on 12.5% SDS-PAGE gel. Forty bands from each lane were excised then they were reduced, alkylated, and digested by trypsin for 18 h at 37°C. Each of the 40 ingel digested samples was analyzed by nanoflow LC-MS/ MS on a LTQ-FT (ThermoFisher Scientific). The bottom-up SILAC A. flavus data associated with this manuscript may be downloaded from the Proteome Commons Database [29] Tranche network using the following hash: O9h2YUGGpAOG+ex5+rYTySoRxqvy-PayGlWPspibKkA13BXCVcpVMp3oCmH4HwZOof p5azAQcx4coCH6I82DCx5vQjwwAAAAAAAAn5g==. To ensure that all four search algorithms started with the same set of peak lists, the experimental spectra in RAW file format were first converted to the files in Mascot Generic Format (MGF) by Mascot Distiller (Matrix Science Ltd.) using the same processing options. A total of 311105 spectra from 77 MGF files were used in this study. The database searches were performed with the same parameters for all four search algorithms. The settings specified trypsin as the protease, a maximum of two missed cleavage sites, precursor charge up to 3 + , 5 ppm precursor ion tolerance (0.01 Da for OMSSA), and 1 Dalton product ion tolerance. The searches also accounted for carbamidomethyl modification on Cysteine (C) as a fixed parameter, and variable modifications included oxidation on Methionine (M) and deamidation on Asparagine (N) or Glutamine (Q). This study focused on detecting splice isoforms instead of exploring the protein profiles at different temperatures. Although the input spectra were derived from a previous SILAC experiment, the data were only searched for light peptides without the 13 C 6 -Arg label. It is noted that the setting of the refinement node for X! Tandem is ON as default.

False Discovery Rate
The FDR for each search result was estimated through searching the decoy (reverse) database and then counting the number of peptide-spectrum matches identified Figure 6 Detection of isoform-specific peptides plays a critical role in identifying alternatively spliced isoforms. The finding of two putative isoforms of A. flavus UDP-galactose 4 epimerase is illustrated as an example. Isoform A (RefSeq protein) and Isoform B (putative splice isoform) were identified by shared (green) and unique (yellow/blue) peptides. The identifications of both Isoform A and B are needed to declare the occurrence of an alternative splicing event. The detection of the splice isoform-specific (blue) peptide is decisive for the identification of Isoform B. Since only one isoform-specific peptide was found in this example, a false positive or missed identification of the peptide could alter the result of the isoform detection.
from the target database (N t ) and decoy database (N d ). The target-decoy database search can be conducted in two ways: a single search against a concatenated target/ decoy database; or two independent searches against the target and decoy databases, respectively. The separate search provided a conservative estimate [30]. FDRs of the peptides identified by Mascot, OMSSA, and X! Tandem were estimated using the separate search strategy and calculated as N d /N t [31]. However, the separate search approach was not feasible for the InsPecT results. The InsPecT tutorial describes that most results are not statistically significant and post-processing is essential. It is necessary to run the PValue.py script, included in the InsPecT distribution, to weed out insignificant results. The script uses a concatenated target/decoy database to calibrate the p-value by fitting the score distribution with a mixture model. Hence, FDR of the peptides identified by InsPecT was estimated using the concatenated database strategy instead, computed as 2 * N d /(N t + N d ) [32].