MUMAL: Multivariate analysis in shotgun proteomics using machine learning techniques
© Cerqueira et al.; licensee BioMed Central Ltd. 2012
Published: 19 October 2012
Skip to main content
© Cerqueira et al.; licensee BioMed Central Ltd. 2012
Published: 19 October 2012
The shotgun strategy (liquid chromatography coupled with tandem mass spectrometry) is widely applied for identification of proteins in complex mixtures. This method gives rise to thousands of spectra in a single run, which are interpreted by computational tools. Such tools normally use a protein database from which peptide sequences are extracted for matching with experimentally derived mass spectral data. After the database search, the correctness of obtained peptide-spectrum matches (PSMs) needs to be evaluated also by algorithms, as a manual curation of these huge datasets would be impractical. The target-decoy database strategy is largely used to perform spectrum evaluation. Nonetheless, this method has been applied without considering sensitivity, i.e., only error estimation is taken into account. A recently proposed method termed MUDE treats the target-decoy analysis as an optimization problem, where sensitivity is maximized. This method demonstrates a significant increase in the retrieved number of PSMs for a fixed error rate. However, the MUDE model is constructed in such a way that linear decision boundaries are established to separate correct from incorrect PSMs. Besides, the described heuristic for solving the optimization problem has to be executed many times to achieve a significant augmentation in sensitivity.
Here, we propose a new method, termed MUMAL, for PSM assessment that is based on machine learning techniques. Our method can establish nonlinear decision boundaries, leading to a higher chance to retrieve more true positives. Furthermore, we need few iterations to achieve high sensitivities, strikingly shortening the running time of the whole process. Experiments show that our method achieves a considerably higher number of PSMs compared with standard tools such as MUDE, PeptideProphet, and typical target-decoy approaches.
Our approach not only enhances the computational performance, and thus the turn around time of MS-based experiments in proteomics, but also improves the information content with benefits of a higher proteome coverage. This improvement, for instance, increases the chance to identify important drug targets or biomarkers for drug development or molecular diagnostics.
Proteomic studies cover the identification of entire proteomes, the detection of post-translational modifications (PTMs), protein quantitation, and the determination of protein interactions. The shotgun strategy by means of liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) has been considered the method of choice when the analysis involves complex mixtures [1–3]. On the other hand, a single MS/MS experiment typically generates thousand of spectra from which usually less than 20% are correctly interpreted, clearly stressing the necessity of computational solutions for assessing each peptide-spectrum match (PSM) [4, 5]. Note that database (DB) search algorithms are far the most used approach to MS/MS spectrum interpretation. Notably, Mascot  and Sequest  are currently the most known standard methods for DB search. As a result, the main computational tools for PSM evaluation were built to analyze DB search algorithm results. In the context of peptide/protein identification, which is our focus here, there are currently two largely used techniques for assessing PSMs produced by DB search methods: the construction of mixture models implemented in the PeptideProphet  approach and the target-decoy search strategy [9–13].
In PeptideProphet approach, standard statistical distributions are used to fit observed positive and negative score distributions. In the case of Sequest, for instance, the parameters of Gaussian and gamma distributions are pursued to identify the underlying score distributions of correct and incorrect hits, respectively. Hence, the probability that a PSM with a certain score is correct is computed using the corresponding density functions along with prior probabilities. As long as the assumed distributions fit the data appropriately, the probabilities are very accurate and can be used in protein inference as well. On the other hand, certain datasets might present completely different score distributions. When dealing with phosphoproteins, for instance, scores are normally lower than usual because the process of fragmenting precursor ions in mass spectrometry via low energy dissociation has a tendency to be biased towards phosphate groups, leading to the suppression of important fragment ions [4, 11, 14].
In contrast, the target-decoy search strategy, works without any a priori assumption about the data, making it a good and general method for identification assessment in MS-based proteomics. In this strategy, besides using the target proteins in the search, a database composed by decoy (false) sequences is also included in the assignment procedure. A common approach is to generate decoy sequences by reversing the target ones, and both sets of sequences are then used as a composite target-decoy DB for the search. The resulting false sequences have to be produced in a way that it is reasonable to assume that a wrong PSM has an equal probability to come from either protein sequence (target or decoy). In this case, the number of decoy PSMs is an excellent estimate for the number of wrong hits among target PSMs. A desired false discovery rate (FDR) can be achieved by varying the score threshold and counting decoy results until reaching a suitable cutoff value. Even though providing a very good method to select a set of PSMs with accurate estimate of its FDR, the target-decoy search strategy, as it was originally conceived, does not consider sensitivity, i.e., no computational strategy and performance metrics are applied to find alternative sets of PSMs having the same FDR but with higher number of hits [5, 10, 11, 13].
Cerqueira et al.  proposed a new strategy called MUDE (MUltivariate DEcoy database analysis) to extend the target-decoy method. Using Sequest for their experiments, the authors prove that a much higher sensitivity can be achieved. The enhancements are two-fold. First, the authors consider many more quality parameters than usual (traditionally uni or bivariate analysis), namely, Xcorr, ΔC n , ΔM, SpRank, PercIons, and RT (retention time) p-value. Second, in the MUDE approach, the problem of finding threshold values leading to the desired FDR is treated as an optimization problem in contrast with simplistic procedures usually employed to explore possible values. As a consequence, a much higher discriminatory power is achieved when compared to the traditional target-decoy search strategy and to PeptideProphet, resulting also in a significant higher sensitivity for the same FDRs. Note, however, that the MUDE approach provides linear decision boundaries to separate false from true positives. Furthermore, according to the authors, the heuristic used to solve the proposed optimization problem has to be executed several times in order to visit many local optima, and the final result is a merge of several outputs obtained. To achieve the results shown in , the authors performed 45 runs of the proposed procedure. Each run takes on average 10 s, meaning a total running time of 7.5 minutes, approximately. Considering that a manual curation may take days or weeks, this is quite a good performance. On the other hand, it clearly demonstrates room for enhancements.
We present here MUMAL, a computational tool to perform multivariate analysis for the target-decoy search strategy using powerful machine learning techniques. This is an improvement to the MUDE method, where the optimization procedure is replaced by the application of neural networks (NNs) to find better decision boundaries, even in non-linearly separable data, and the resulting ROC (receiver operating characteristic) curve is analyzed to further improve sensitivity. Experiments were performed on the same data generated by Sequest that was used to evaluate the MUDE approach. In this data, there are six datasets derived mostly from phosphoproteins, and five datasets from non-phosphorylated proteins. Given a certain dataset, we start training a neural network to separate decoy from non-decoy PSMs. The features used for training are the six scores proposed in the MUDE procedure. In a second stage, the resulting ROC curve of the NN model is analyzed to determine the best probability threshold leading to the highest sensitivity for the chosen FDR. The user has the chance to run the same procedure many times, using different parameter settings, and merge the best answers (highest sensitivities) of each run in a unique output, similarly to the MUDE pipeline. The difference is that with considerably fewer iterations, we could achieve significantly better sensitivities when comparing with MUDE. In our experiments, we have chosen FDRs varying from 0 to 0.05, so that we could compare the number of PSMs our method and the MUDE approach could retrieve for the same error rates. The results were quite encouraging. For non-phosphodata, the sensitivities were ca. 26% higher, while phosphodata presented an average improvement of 24%. Furthermore, the running time of our procedures was strikingly shorter. A NN model takes approximately the same time to be built when compared to a MUDE run. Notice, however, that only few NN runs are necessary to achieve much better sensitivities. In our experiments, we performed six NN rounds for each data in contrast with the 45 runs of the MUDE approach. In summary, the proposed strategy is able to enhance sensitivity with a running time 7.5 times faster than MUDE.
In this work, we used the same data generated from a LC-MS/MS approach (high performance liquid chromatography coupled with a LTQ FT mass spectrometer (Thermo Electron, Bremen)) described in the MUDE publication . For more information on sample preparation details see Cerqueira et al.  and Morandell et al. . Three datasets were produced from three independent phospho-enriched samples. MS/MS Spectrum files were converted to dta files, the text-file format of SEQUEST for MS/MS spectra, resulting in 24405 (S1), 23668 (S2) and 18996 (S3) spectra, respectively. Next, SEQUEST (Bioworks v3.3, Thermo Electron) was run on this data to assign peptide sequences to each spectrum. Each dataset (with its respective SEQUEST output) was divided in two parts, one containing spectra whose top result was reported as a phosphopeptide, and the other composed by spectra whose the best assignment indicated a non-phosphopeptide. Each part was further split based on the precursor charge state. Only charges +2 and +3 were considered. As a result, the three initial datasets generated twelve sets. These separations are necessary as score distributions may vary significantly from a dataset of phosphorylated proteins to another of non-phosphorylated proteins. Important differences in scores are also noted in datasets with distinct precursor charge state [8, 16]. The twelve datasets were labeled as S1_PH_CH2, S1_PH_CH3, S1_NPH_CH2, S1_NPH_CH3, S2_PH_CH2, S2_PH_CH3, S2_NPH_CH2, S2_NPH_CH3, S3_PH_CH2, S3_PH_CH3, S3_NPH_CH2, and S3_NPH_CH3, where "PH" and "NPH" denote phosphodata and non-phosphodata, respectively, while "CH2" and "CH3" represent +2 and +3 charge states, respectively. The dataset S3_NPH_CH3 was removed from our experiments as it has shown to contain fewer than 10 correct assignments. It was verified by a decoy DB analysis and with Trans-Proteomic Pipeline v4.2 (tool containing PeptideProphet) .
Finally, in order to use retention time as a discriminatory feature in our method for identification assessment, the out files (containing assignments produced by SEQUEST) of each set was converted to a unique IdXML (v1.1) file. This is the format used by the algorithm (OpenMS v1.4) for retention time prediction described by Pfeifer et al. .
Following Elias et al.  recommendation, all searches used a database constructed as a composition of target protein sequences appended to their reverse (decoy sequences). Target proteins were obtained from the mouse IPI database (v3.18) . The search parameters were set the same for all runs. Enzyme: trypsin; missed cleavages: up to 2; fixed modifications: carbamidomethyl (C), methyl (C-term), Methyl (DE); variable modifications: oxidation (M), phosphorylation (ST), phosphorylation (Y); protein mass: unrestricted; mass values: monoisotopic; peptide mass tolerance: ±10 ppm; fragment mass tolerance: ±0.6 Da.
In shotgun proteomics, a natural necessity has arisen to automatically evaluate resulting PSMs, given the huge amount typically produced in a single run. One of the most widely applied procedures to evaluate PSMs generated by DB search methods is the target-decoy DB search strategy. In this method, false (decoy) protein sequences are generated maintaining the amino acids distribution of real (target) protein sequences. The search is then performed either once using a composite DB containing target sequences appended to decoy sequences or twice using the same parameters and each sequence DB at a time. The most common ways to generate decoy sequences are reversing target ones, shuffling them, or using some randomization process [22, 23]. The construction of a decoy DB as proposed in literature allows the assumption that a wrong hit (of SEQUEST or any other DB search algorithm) might come either from a real sequence or a target one with the same probability. This means that the number of hits coming from decoy sequences can be taken as a very good estimate of the number of wrong PSMs coming from target sequences. The main advantage of this method is that there is no a priori assumption on data distribution, which made this strategy very popular in proteomics. Particularly, the target-decoy DB search strategy is frequently present in phosphoproteomics research, since scores of phosphodata have a very peculiar distribution [10–12].
As already mentioned, decoy DB methods have been widely applied to find score thresholds leading to a desired FDR, particularly in the case of phosphodata with typically odd score distributions. However, to our best knowledge, this method has been used without any attempt to maximize sensitivity, where sensitivity here means the proportion of true identifications captured by the chosen thresholds. Either only one quality parameter is varied or, even when more scores (normally two) are explored, after thresholds are determined that produce the desired FDR, no other score combination that might provide a higher number of identifications is investigated and verified. Therefore, the inclusion of other parameters in the analysis as well as a more systematic and elegant way to explore them are a clear direction for improvements.
In MUDE, other four important parameters are included: ΔM, SpRank, percentage of ions found, and RT deviation (the difference between observed and predicted RT), i.e., six features are considered for the assessment procedure instead of one or two as stated by previous works. Additionally, MUDE presents an optimization procedure, termed ε-masp, to maximize sensitivity for a fixed error ε. Even demonstrating a significant increase in sensitivity, this method presents two characteristics that could be further improved. First, the optimization method produces only linear decision boundaries. However, we show in Figure 3b that a non-linear decision boundary (green curve) could provide an even higher sensitivity for the same FDR. Second, the MUDE's optimization procedure has to be repeated several times in a typical run to ensure a high sensitivity. Notice that non-linear learning algorithms can establish more appropriate decision boundaries, leading to high sensitivity, in a single run.
Therefore, instead of pursuing a set of thresholds for PSM scores, as stated in former procedures, our approach seeks now the establishment of a more complex function to combine such scores, representing a more accurate decision boundary. This is exactly what support vector machines (SVMs) and neural networks can provide.
Before further developing our procedure for PSM assessment, we performed a comparison between the SVM approach and NNs to decide which method should be chosen as the main learning algorithm in the MUMAL pipeline. We used the eleven datasets mentioned in Section "MS/MS data" to analyze which approach could provide a higher sensitivity for a 1% FDR. According to Elias et al. and Balgley et al. [24, 25], this FDR represents the best trade-off between sensitivity and precision when assessing PSMs. See Section "Varying the discriminant probability to achieve a desired FDR" for details on how to calibrate a learning algorithm, using the ROC curve and decoy hits counting, to obtain a decision boundary that provides the pursued FDR.
The comparisons were made using the Weka (v3.7.0) application programming interface (API) , which provides two different implementations of the SVM approach: SMO  and LibSVM  as well as an implementation of a multilayer neural network with backpropagation. For NN runs, default parameter values were used. In the case of LibSVM and SMO, the only change in parameters was probability estimate = true to allow probability calculation instead of dichotomous classification of type "yes" or "no". For more details on parameters of these methods, see Tan et al.  as well as Platt  and Fan et al. .
Comparison between NN and SVM (LibSVM and SMO)
Given the results of this first experiment, we proceeded with the development of the proposed method using neural networks as the learning algorithm of our pipeline.
The study of artificial neural networks is an effort to mimic biological neural systems with the objective to create a powerful learning technique [29–32]. Similarly to human brain, a NN is comprised of a set of nodes interconnected by directed links. The first proposed model was called perceptron . Only two kinds of nodes (neurons) are present in this simple architecture: input nodes and one output node. Nodes of the first type represent features, while ones of the second kind represent the model output. Each input node is connected to the output node by a weighted link. The weights represent the strength of synaptic connections between neurons. Note that the human learning process consists exactly of changing the strength of such connections due to some repeated stimulus. In a perceptron, the output node computes the weighted sum of the inputs, subtracts the result by a bias term, and uses what is called an activation function (that, in this case, is the signum function) to produce the final output (if value is positive it outputs +1, if it is negative the output is -1) . Hence, the process of training a perceptron is the adaptation of weights until getting an acceptable relation between input and output according to what is observed in training data.
Details of the parameters used in NN training
No. of nodes in hidden layer
The learning procedure normally seeks to maximize the number of correctly classified instances, i.e., the accuracy. It is expected that our datasets lead to low-accuracy models, since our classes are decoy and normal hits. Notice that most of normal hits (the wrong ones) will have similar characteristics when compared to decoy hits (which are obviously wrong). This is due to the fact that most of interpretations performed on MS/MS spectra are wrong. Because of this property in the shotgun approach, our data can be thought as very noisy data, which makes the model construction a challenging task. In fact, the average accuracy obtained for our eleven datasets was 60% and the FDR for P > 0.5 in all cases was very high.
As the model construction is performed to maximize accuracy, we expect maximization of sensitivities as well. Notice that the MUDE approach also tries to maximize sensitivity. The difference in our case is that the models obtained here can construct non-linear decision boundaries, denoting the possibility of even higher sensitivities, as stated previously in the text.
In this paper, we propose a multivariate decoy DB analysis using neural networks and ROC analysis to produce more flexible decision boundaries. As described for the MUDE procedure, we also take advantage of many important scores in contrast to the bivariate decoy analysis (termed here as BIDE) of previous works. On the other hand, MUMAL achieves higher sensitivity and much faster running times when compared to MUDE, as can be seen in our experiments below. Notice that PSMs are used to build a NN model, which, in turn, is applied to the same data as our goal is not to apply the obtained model to future unseen instances, but, instead, we want to separate correct from incorrect hits. Hence, there is no sense here in applying traditional statistical methods to evaluate learning algorithm models, such as cross validation. The main measure to evaluate our models is the number of true positives that can be achieved for a certain maximum FDR.
Our comparisons were performed on the peptide level. As previously demonstrated, improvements on peptide level lead also to improvements on protein level, possibly leading to a higher proteome coverage (i.e., identification of more proteins) . This is quite obvious, as proteins are inferred from peptide identifications. Thus, we limit our analysis to the peptide level, i.e., the amount of correct PSMs our method could separate for a predefined maximum FDR. The experiments below demonstrate the superior performance of MUMAL regarding the main tools currently used for PSM validation: MUDE, PeptideProphet, and BIDE (using ΔC n and Xcorr or ΔM and Xcorr). See the work of Cerqueira et al.  for details on how these previous methods were applied to generate the curves shown next.
It has been largely demonstrated that the target-decoy search strategy is a powerful tool for evaluating PSMs of MS/MS runs. Nonetheless, the potential of this method has not been fully explored as sensitivity maximization is not taken into account in typical experiments. The MUDE approach treats the decoy analysis as an optimization problem, enabling a significant improvement in sensitivity. In this work, we present MUMAL, a PSM evaluation pipeline that uses machine learning methods, namely neural networks and ROC curve analysis, to promote an even higher increase of sensitivity, i.e., the retrieval of as many PSMs as possible for a fixed error rate. Experiments demonstrate that our approach can establish better decision boundaries, embracing a higher number of true positives than MUDE and other standard methods.
The next step is to perform new experiments with alternative machine learning algorithms and, if they show promising results, to optimize their models to reach higher sensitivities. Another future effort will focus on extending the method to cope also with MASCOT results.
With the new proposed strategy, experiments on MS-based proteomics will gain in performance with respect to both time and proteome coverage, so that a better understanding of cellular activities can be achieved, advancing ultimately the utility of proteomics in the process of discovery and development of new drugs.
The software is open-source and is available under the URL: http://sourceforge.net/projects/mumal/
This work is supported by FAPEMIG, CNPq, and CAPES.
This article has been published as part of BMC Genomics Volume 13 Supplement 5, 2012: Proceedings of the International Conference of the Brazilian Association for Bioinformatics and Computational Biology (X-meeting 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S5.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.