Proteomic studies cover the identification of entire proteomes, the detection of post-translational modifications (PTMs), protein quantitation, and the determination of protein interactions. The shotgun strategy by means of liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) has been considered the method of choice when the analysis involves complex mixtures [1–3]. On the other hand, a single MS/MS experiment typically generates thousand of spectra from which usually less than 20% are correctly interpreted, clearly stressing the necessity of computational solutions for assessing each peptide-spectrum match (PSM) [4, 5]. Note that database (DB) search algorithms are far the most used approach to MS/MS spectrum interpretation. Notably, Mascot [6] and Sequest [7] are currently the most known standard methods for DB search. As a result, the main computational tools for PSM evaluation were built to analyze DB search algorithm results. In the context of peptide/protein identification, which is our focus here, there are currently two largely used techniques for assessing PSMs produced by DB search methods: the construction of mixture models implemented in the PeptideProphet [8] approach and the target-decoy search strategy [9–13].

In PeptideProphet approach, standard statistical distributions are used to fit observed positive and negative score distributions. In the case of Sequest, for instance, the parameters of Gaussian and gamma distributions are pursued to identify the underlying score distributions of correct and incorrect hits, respectively. Hence, the probability that a PSM with a certain score is correct is computed using the corresponding density functions along with prior probabilities. As long as the assumed distributions fit the data appropriately, the probabilities are very accurate and can be used in protein inference as well. On the other hand, certain datasets might present completely different score distributions. When dealing with phosphoproteins, for instance, scores are normally lower than usual because the process of fragmenting precursor ions in mass spectrometry via low energy dissociation has a tendency to be biased towards phosphate groups, leading to the suppression of important fragment ions [4, 11, 14].

In contrast, the target-decoy search strategy, works without any a priori assumption about the data, making it a good and general method for identification assessment in MS-based proteomics. In this strategy, besides using the target proteins in the search, a database composed by decoy (false) sequences is also included in the assignment procedure. A common approach is to generate decoy sequences by reversing the target ones, and both sets of sequences are then used as a composite target-decoy DB for the search. The resulting false sequences have to be produced in a way that it is reasonable to assume that a wrong PSM has an equal probability to come from either protein sequence (target or decoy). In this case, the number of decoy PSMs is an excellent estimate for the number of wrong hits among target PSMs. A desired false discovery rate (FDR) can be achieved by varying the score threshold and counting decoy results until reaching a suitable cutoff value. Even though providing a very good method to select a set of PSMs with accurate estimate of its FDR, the target-decoy search strategy, as it was originally conceived, does not consider sensitivity, i.e., no computational strategy and performance metrics are applied to find alternative sets of PSMs having the same FDR but with higher number of hits [5, 10, 11, 13].

Cerqueira et al. [5] proposed a new strategy called MUDE (MUltivariate DEcoy database analysis) to extend the target-decoy method. Using Sequest for their experiments, the authors prove that a much higher sensitivity can be achieved. The enhancements are two-fold. First, the authors consider many more quality parameters than usual (traditionally uni or bivariate analysis), namely, Xcorr, Δ*C*
_{
n
}, ΔM, SpRank, PercIons, and RT (retention time) p-value. Second, in the MUDE approach, the problem of finding threshold values leading to the desired FDR is treated as an optimization problem in contrast with simplistic procedures usually employed to explore possible values. As a consequence, a much higher discriminatory power is achieved when compared to the traditional target-decoy search strategy and to PeptideProphet, resulting also in a significant higher sensitivity for the same FDRs. Note, however, that the MUDE approach provides linear decision boundaries to separate false from true positives. Furthermore, according to the authors, the heuristic used to solve the proposed optimization problem has to be executed several times in order to visit many local optima, and the final result is a merge of several outputs obtained. To achieve the results shown in [5], the authors performed 45 runs of the proposed procedure. Each run takes on average 10 s, meaning a total running time of 7.5 minutes, approximately. Considering that a manual curation may take days or weeks, this is quite a good performance. On the other hand, it clearly demonstrates room for enhancements.

We present here MUMAL, a computational tool to perform multivariate analysis for the target-decoy search strategy using powerful machine learning techniques. This is an improvement to the MUDE method, where the optimization procedure is replaced by the application of neural networks (NNs) to find better decision boundaries, even in non-linearly separable data, and the resulting ROC (receiver operating characteristic) curve is analyzed to further improve sensitivity. Experiments were performed on the same data generated by Sequest that was used to evaluate the MUDE approach. In this data, there are six datasets derived mostly from phosphoproteins, and five datasets from non-phosphorylated proteins. Given a certain dataset, we start training a neural network to separate decoy from non-decoy PSMs. The features used for training are the six scores proposed in the MUDE procedure. In a second stage, the resulting ROC curve of the NN model is analyzed to determine the best probability threshold leading to the highest sensitivity for the chosen FDR. The user has the chance to run the same procedure many times, using different parameter settings, and merge the best answers (highest sensitivities) of each run in a unique output, similarly to the MUDE pipeline. The difference is that with considerably fewer iterations, we could achieve significantly better sensitivities when comparing with MUDE. In our experiments, we have chosen FDRs varying from 0 to 0.05, so that we could compare the number of PSMs our method and the MUDE approach could retrieve for the same error rates. The results were quite encouraging. For non-phosphodata, the sensitivities were ca. 26% higher, while phosphodata presented an average improvement of 24%. Furthermore, the running time of our procedures was strikingly shorter. A NN model takes approximately the same time to be built when compared to a MUDE run. Notice, however, that only few NN runs are necessary to achieve much better sensitivities. In our experiments, we performed six NN rounds for each data in contrast with the 45 runs of the MUDE approach. In summary, the proposed strategy is able to enhance sensitivity with a running time 7.5 times faster than MUDE.