 Methodology Article
 Open Access
 Published:
A costsensitive online learning method for peptide identification
BMC Genomics volume 21, Article number: 324 (2020)
Abstract
Background
Postdatabase search is a key procedure in peptide identification with tandem mass spectrometry (MS/MS) strategies for refining peptidespectrum matches (PSMs) generated by database search engines. Although many statistical and machine learningbased methods have been developed to improve the accuracy of peptide identification, the challenge remains on largescale datasets and datasets with a distribution of unbalanced PSMs. A more efficient learning strategy is required for improving the accuracy of peptide identification on challenging datasets. While complex learning models have larger power of classification, they may cause overfitting problems and introduce computational complexity on largescale datasets. Kernel methods map data from the sample space to high dimensional spaces where data relationships can be simplified for modeling.
Results
In order to tackle the computational challenge of using the kernelbased learning model for practical peptide identification problems, we present an online learning algorithm, OLCSRanker, which iteratively feeds only one training sample into the learning model at each round, and, as a result, the memory requirement for computation is significantly reduced. Meanwhile, we propose a costsensitive learning model for OLCSRanker by using a larger loss of decoy PSMs than that of target PSMs in the loss function.
Conclusions
The new model can reduce its false discovery rate on datasets with a distribution of unbalanced PSMs. Experimental studies show that OLCSRanker outperforms other methods in terms of accuracy and stability, especially on datasets with a distribution of unbalanced PSMs. Furthermore, OLCSRanker is 15–85 times faster than CRanker.
Introduction
Tandem mass spectrometry (MS/MS)based strategies are presently the method of choice for largescale protein identification due to its highthroughput analysis of biological samples. With database sequence searching method, a huge number of peptide spectra generated from MS/MS experiments are routinely searched by using a search engine, such as SEQUEST, MASCOT or X!TANDEM, against theoretical fragmentation spectra derived from target databases or experimentally observed spectra for peptidespectrum match (PSM). However, most of these PSMs are not correct [1]. A number of computational methods and error rate estimation procedures after database search have been proposed to improve the identification accuracy of target PSMs[2, 3].
Recently, advanced statistical and machine learning approaches have been studied for better identification accuracy in the postdatabase search. PeptideProphet [4] and Percolator [5] are two popular ones among those machine learningbased tools. PeptideProphet employs the expectation maximization method to compute the probabilities of correct and incorrect PSM, based on the assumption that the PSM data are drawn from a mixture of the Gaussian distribution and the Gamma distribution which generate samples of the correct and incorrect PSMs. Several works have extended the PeptideProphet method to improve its performance. Particularly, decoy PSMs were incorporated into a mixture probabilistic model in [6] at the estimation step of the expectation maximization. An adaptive method described in [7] iteratively learned a new discriminant function from the training set. Moreover, a Bayesian nonparametric (BNP) model was presented in [8] to replace the probabilistic distribution used in PeptideProphet for calculating the posterior probability. A similar BNP model [9] was also applied to MASCOT search results. Percolator starts the learning process with a small set of trusted correct PSMs and decoy PSMs, and it iteratively adjusts its learning model to fit the dataset. Percolator ranks the PSMs according to its confidence on them. Some works [10, 11] have also extended Percolator to deal with largescale datasets.
In fact, Percolator is a typical method of supervised learning. With given knowledge (labeled data), supervised learning can train a model with labeled data and uses it to get an accurate prediction on unlabeled data. In [12], a fully supervised method is proposed to improve the performance of Percolator. Two types of discriminant functions, linear functions and twolayer neural networks, are compared. The twolayer neural networks is a nonlinear discriminant function which adds lots of parameters of hidden units. As expected, it achieves better identification performance than the model with linear discriminant function [12]. Besides, the work in [13] used a generative model, Deep Belief Networks, to improve the identification.
In supervised learning, kernel functions have been widely used to map data from the sample space to high dimensional spaces where data with nonlinear relationships can be classified by linear models. With the kernelbased support vector machine (SVM), CRanker [14] has shown significantly better performance than linear models. Although kernelbased postdatabase searching approaches have improved the accuracy of peptide identification, two big challenges remain in practical implementation of kernelbased methods: (1) The performance of the algorithms degrades on the datasets with a distribution of unbalanced PSMs, in which case some datasets contain an extremely large proportion of false positives. We call them “hard dataset” as most postdatabase search methods degrade their performances on these datasets; (2) Scalability problems in both memory use and computational time are still barriers for kernelbased algorithms on largescale datasets. Kernelbased batch learning algorithms need to load the entire kernel matrix into memory, and thus the memory requirement can be very intense during the training process.
In some extent, the above challenges also exists in other postdatabase searching methods. A number of recent works are related to the two challenges. The methods of data fusion [15–18] integrate different sources of auxiliary information, alleviated the challenge of “hard datasets”. Moreover, cloud computing platform is used in [19] to tackle the intense memory and computation requirement for mass spectrometrybased proteomics analysis using the TransProteomic Pipeline (TPP). Existing researches either integrated extensive biological information or leveraged hardware support to overcome the challenges.
In this work, we develop an online classification algorithm to tackle the two challenges in kernelbased methods. For the challenge of “hard dataset”, we extend CRanker [14] model to a costsensitive Ranker (CSRanker) by using different loss functions for decoy and target PSMs respectively. The CSRanker model gives a larger penalty for wrongly selecting decoy PSMs than that for target PSMs, which reduces the model’s false discovery rate while increases its true positive rate. For the challenge of scalability problems, we design an online algorithm for CSRanker (OLCSRanker) which trains PSM data samples one by one and uses an active set to keep only those PSMs effective to the discriminant function. As a result, memory requirement and total training time can be dramatically reduced. Moreover, the training model is less prone to converging to poor local minima, avoiding extremely bad identification results.
In addition, we calibrate the quality of OLCSRanker outputs by using the entrapment sequences obtained from “Pfu” dataset published in [20]. Although the targetdecoy strategy has become a mainstream method for the quality control in peptide identification, it cannot directly evaluate the false positive matches in identified PSMs. We aim to use the entrapment sequence method as an alternative of targetdecoy strategy in the assessment of OLCSRanker [21, 22].
Experimental studies have shown that OLCSRanker not only outperformed Percolator and CRanker in terms of accuracy and stability, especially on hard datasets, but also reported evidently more target PSMs than those reported by Percolator on about half of datasets. Also, OLCSRanker is 15∼85 times faster on large datasets than the kernelbased baseline method, CRanker.
Results
Experimental setup
To evaluate the OLCSRanker algorithm, we used six LC/MS/MS datasets generated from a variety of biological and control protein samples and different mass spectrometers to minimize the bias caused by the sample, type of mass spectrometer, or mass spectrometry method. Specifically, the datasets include universal proteomics standard set (Ups1), the S.cerevisiae Gcn4 affinitypurified complex (Yeast), S.cerevisiae transcription complexes using the Tal08 minichromosome (Tal08 and Tal08large) and Human Peripheral Blood Mononuclear Cells (PBMC datasets). There are two PBMC sample datasets which were analyzed with the LTQOrbitrap Velos with MiPS (Velosmips) and MiPSoff (Velosnomips) respectively. All PSMs were assigned by the SEQUEST search engine. Refer to [23] for the details of the sample preparation and LC/MS/MS analysis.
We converted the SEQUEST outputs from *.out format to Microsoft Excel format for OLCSRanker and removed all blank PSMs records if any. Statistics of the SEQUEST search results of the datasets are summarized in Table 1.
A PSM record is represented by a vector of nine attributes: xcorr, deltacn, sprank, ions, hit mass, enzN, enzC, numProt, deltacnR. The first five attributes inherit from the SEQUEST algorithm and the last four attributes are defined as
enzN: A boolean variable indicating whether the peptide is preceded by a tryptic site;
enzC: A boolean variable indicating whether the peptide has a tryptic Cterminus;
numProt: The number that the corresponding protein matches other PSMs;
deltacnR: deltacn/xcorr.
Based on our observation, “xcorr” and “deltacn” played more important roles in identification of PSMs, and hence, we used 1.0 for the weights of the two features, and 0.5 for all others. Also, Gaussian kernel \(k(x_{i},x_{j}) = \exp {(\frac {\x_{i}x_{j}\^{2}}{2\sigma ^{2}})} \) was chosen in this experimental study.
The choice of parameters, C_{1},C_{2},σ, is a critical step in the use of OLCSRanker. We performed a 3fold crossvalidation and the values of parameters were chosen by maximizing the number of identified PSMs. Detailed crossvalidation results could be found in Additional file 2. The PSMs were selected according to the calculated scores under FDR level 0.02 and 0.04, respectively, and FDR was computed using the following equation
where D is the number of the spectra matched to decoy peptide sequences and T is the number of the PSMs matched to target peptide sequence. As the performance of OLCSRanker is not sensitive to the algorithm parameters, we constantly set M=1000, m=0.35S, where S is the active index set and S denotes its size, in this experimental study.
OLCSRanker was implemented with Matlab R2015b. The source code can be download from https://github.com/IsaacQiXing/CRanker. All experiments were implemented on a PC with Intel Core E52640 CPU 2.40GHz and 24Gb RAM.
For comparison with PeptideProphet and Percolator, we followed the steps described in Trans Proteomic Pipeline (TPP) suite[24] and [10]. In PeptideProphet, we used the program MzXML2Search to extract the MS/MS spectra from the mzXML file, and the search outputs were converted to pep.XML format files with the TPP suite. In Percolator, we converted the SEQUEST outputs to a merged file in SQT format [25, 26], and then transformed it to PIN format by sqt2pin integrated in Percolator suite[10]. We used ’N’ option of the “percolator” command to specify the number of training PSMs.
Comparison with benchmark methods
We compared OLCSRanker, PeptideProphet and Percolator on the six datasets in term of the numbers of validated PSMs at FDR =0.02 and FDR =0.04. The performance of a validation approach is better if it can validate more target PSMs than the other approach under the same FDR. Table 2 shows the number of validated PSMs and the ratio of this number to the total of each dataset. As we can see, OLCSRanker identified more PSMs on three datasets, similar numbers of PSMs on the other three datasets, compared with PeptideProphet or Percolator.
Compared with PeptideProphet, 25.1%, 4.9% and 2.4% more PSMs were identified by OLCSRanker at FDR =0.02 on Tal08, Tal08large and Velosnomips, respectively. Compared with Percolator, 12.2%, 10.0% and 3.4% more PSMs were identified by OLCSRanker at FDR =0.01 on Yeast, Tal08 and Velosnomips, respectively. On Ups1 and Tal08large OLCSRanker identified a similar number of PSMs to that of Percolator. The numbers of PSMs identified by the three methods on each dataset under FDR =0.04 are similar to those under FDR =0.02.
We have also compared the overlapping of target PSMs identified by the three approaches as a PSM reported by multiple methods is more likely to be correct. Figure 1 shows that the majority of validated PSMs by the three approaches overlaps, indicating high conference on the identified PSMs output by OLCSRanker. Particularly, on Yeast, the three approaches have 1197 PSMs in common, covers more than 86% of the total target PSMs identified by each of the algorithms. This ratio of common PSMs is 86% and 75% on Ups1 and Tal08, respectively, and more than 90% on Tal08large, Velosmips and Velosnomips.
Furthermore, the overlapping PSMs identified from OLCSRanker and each of PeptideProphet and Percolator is more than those overlapping PSMs identified from PeptideProphet and Percolator. On Yeast, besides the overlapping among three methods, OLCSRanker and PeptideProphet identified 128 PSMs in common and OLCSRanker and Percolator identified 25 PSMs in common. In contrast, PeptideProphet and Percolator have only 3 PSMs in common. Similar patterns occurred on other datasets.
Not surprisingly, OLCSRanker validated more PSMs than other methods in most cases. For a closer look, we compared the outputs by OLCSRanker and Percolator on Velosnomips in Fig. 2. For visualization, we project PSMs in ninedimensional sample space to a plane which can be seen, as shown in Fig. 2. As we can see, the red dots are mainly distributed in the margin region, and they are mixed with decoy and other target PSMs. Percolator misclassified these red dots, OLCSRanker, however, has correctly identified them using nonlinear kernel. Similarly, we have observed this advantage of OLCSRanker on Yeast, Tal08 and Velosmips datasets as well. These figures could be found in Additional file 1.
Hard datasets and normal datasets
Note that in Table 2, all the three approaches reported relatively low ratios of validated PSMs on Yeast, Ups1 and Tal08 dataset. As aforementioned, we call them “hard datasets”, in which a large proportion of incorrect PSMs usually increases the complexity of identification for any approach. Particularly, the ratios on Yeast, Ups1 and Tal08 are 0.204 ∼0.219, 0.05 ∼0.062, and 0.096 ∼0.117, respectively, while the ratios on the other datasets (“normal datasets”) are larger than 0.35.
Model evaluation
We used receiver operating characteristic (ROC) to compare the performances of OLCSRanker, PeptideProphet and Percolator. As shown in Fig. 3, OLCSRanker reached highest TPRs among the three methods at most values of FPRs on all datasets. Compared with PeptideProphet, OLCSCRanker reached significantly higher TPR levels on Tal08 and Tal08large dataset. Compared with Percolator, OLCSCRanker reached significantly higher TPR levels on Yeast, Tal08 and Velosnomips dataset. On Velosnomips, the TPR values of OLCSRanker were about 0.04 higher (i.e., about 8% more identified target PSMs) than that of Percolator with FPR levels from 0 to 0.02 (corresponding FDR levels from 0 to 0.07). In general, OLCSRanker outperformed PeptideProphet and Percolator in terms of the ROC curve.
We have also examined model overfitting by the ratio of identified PSMs in the test set to the number of the total identified PSMs (identified_test/identified_total) versus the ratio of the size of training set to the size of total dataset (train set  / total set ). As PeptideProphet does not use the supervised learning framework, we only compared OLCSRanker with Percolator and CRanker in this experiment. Assume that correct PSMs are identically distributed over the whole dataset. If neither underfitting nor overfitting occurs, then the ratio of identified_test/identified_total should be close to 1  train set / total set . For example, at train set / total set  =0.2, the expected ratio of identified_test/identified_total is 0.8. Particularly, the training sets and test sets were formed by randomly selecting PSMs from the original datasets according to the values of =0.1,0.2,⋯,0.8. For each value of train/total, we computed the mean value and the standard deviation of the ratios of identified_test/identified_total based on 30 times of running Percolator and OLCSRanker, and results were shown in Fig. 4. As we can see, the identified_test/identified_total ratios reported by OLCSRanker are closer to the expected ratios than those of Percolator does on Yeast on Ups1. Take train set / total set  = 0.2 in Fig. 4a, as an example, in which 20%/80% of PSMs were used for training/testing, and the corresponding expected identified_test/identified_total ratio is 0.8. The actual identified_test/identified_total ratio of OLCSRanker is 0.773 with standard error 0.018, and 0.861 with standard error 0.043 by Percolator.
Due to the extraordinary running time of CRanker, we only compared OLCSRanker and CRanker at train set / total set  =2/3, and listed the results in Table 3. Although CRanker showed the same ratios of identified_test/identified_total on normal datasets as OLCSRanker did, its ratios on hard dataset are less than the expected ratio, 1/3. While the identified_test/identified_total ratio of CRanker is 0.272 and 0.306 on Ups1 and Tal08 respectively, the ratio of OLCSRanker is 0.334 and 0.342, respectively. The results indicate that compared with CRanker, OLCSRanker overcomes the overfitting problem on hard datasets.
Furthermore, we have compared the outputs of Percolator and OLCSRanker with different training sets to examine the stability of OLCSRanker. Usually, the output of a stable algorithm does not change dramatically along with input training data samples. We have run Percolator and OLCSRanker 30 times at each value of train set / total set  ratio =0.1,0.2,0.3,⋯,0.8.
The average numbers of identified PSMs and its standard deviations were plotted in Fig. 5. As we can see, both algorithms are stable on normal datasets. However, on Yeast and Ups1, deviations of outputs by OLCSRanker are smaller, especially when train set / total set  ratio is small.
The algorithm efficiency
In order to evaluate the computational resources consumed by OLCSRanker, we compared its running time and used memory with that used by the kernelbased baseline method, CRanker. As the whole training data is needed for CRanker to construct its kernel matrix, it is very timeconsuming on large datasets. Instead, CRanker divided the training set into five subsets by randomly selecting 16000 PSMs for each subset. The final score of a PSM is the average of the scores on the five subsets.
Table 3 summarized the comparison of OLCSRanker and CRanker in terms of the total number of identified PSMs, the ratio of identified PSMs in the test set to the number of total identified PSMs, used RAM and elapsed time. As we can see, it took CRanker from about 10 min to half an hour on three small datasets, Ups1, Yeast and Tal08, and about 3 h on comparatively large datasets, Tal08large, Velosmips and Velosnomips. In contrast, it took OLCSRanker only 13 min on the largest dataset Velosnomips, about 15∼85 times faster than CRanker. Moreover, OLCSRanker consumed only about 1/10 of RAM that used by CRanker on small datasets. On large datasets, OLCSRanker has low memory cost. It uses about 400Mb RAM on the tested largest dataset, Velosnomips. By contrast, CRanker could not efficiently deal with largescale datasets since large kernel matrix could not load into to memory. The memory of CRanker list in the table is used for training its five smallsized submodels.
In summary, OLCSRanker requires less computational time and memory than CRanker does. The analysis is given as follows. CRanker uses a batch learning method in training process and has to maintain a nbyn dense kernel matrix, where n is the number of PSMs. In contrast, OLCSRanker uses an online learning algorithm, which iteratively trains the model by taking only one data sample at each round. Moreover, OLCSRanker only needs to keep data samples in the active set in the memory. Hence, the requirement of computational resources during the modeltraining process is significantly reduced.
Particularly, the memory required by CRanker is O(n^{2}), with n the number of training PSMs, while it is O(S^{2}) required by OLCSRanker, where S is the number of PSMs in the active set S. As the value of n is usually very large, CRanker can hardly run a dataset with more than 20,000 PSMs on a normal PC. However, the maximum size of the active set S in OLCSRanker is preselected and far less than the value of n for large datasets.
From the perspective of computational complexity, CRanker needs to solve a series of convex subproblem. Each subproblem is essentially an SVM classification problem, and the computational complexity is between O(n^{2}) and O(n^{3}). Thus, the computational complexity of CRanker is at least O(n^{2}). However, OLCSRanker deals with one PSM sample, at the computational cost of O(S^{2}), at each round. Thus, the computational complexity of OLCSRanker is bounded by O(nS^{2}), which is usually far less than that of CRanker when S≪n.
Evaluation by the entrapment sequence method
The entrapment sequence method was introduced as an alternative of targetdecoy strategy to validate true PSMs in mass spectrometry data analysis. We have evaluated the performance of OLCSranker with the entrapment sequences obtained from “Pfu” dataset published in reference [20].
We use the entrapment hits to calculate the false match rate (FMR) to assess the quality of the identification results. Fig. 6 depicts corresponding FMRs under a series of FDR levels of OLCSRanker. It is shown that with both Tide (Fig. 6a) and Comet (Fig. 6b) search engines, OLCSRanker has approximately lower FMR levels than those of FDRs in identified sample PSMs and peptides, which indicates the identification results are reasonable according to the definition of FMR.
We also compared the identification results of OLCSRanker using different search engines with those in [20] under 0.01 FDR for PSM and peptide, respectively, and results are listed in Table 4. It is shown that in most cases the FMRs estimated by entrapment hits are roughly equal to 0.01. Particularly, with the Comet search engine at FMR =0.009, OLCSRanker identified 10603 PSMs, 6% more than those identified by Crux Percolator. Similarly for identified peptides, the number given by OLCSRanker is about 6% (5667−5343)/5343=6.06%) more than that of Crux Percolator. With the Tide search engine, OLCSRanker identifies approximately the same number of PSMs and peptides as those of Crux Percolator, but has lower FMR levels. Thus, in terms of identification number and FMRs given by this entrapment sequence test, OLCSRanker has shown the quality of its identified results is at least as high as that of Crux Percolator.
Conclusions
We have presented a costsensitive postdatabase search approach, OLCSRanker, for peptide identification to overcome the challenges of “hard datasets” and scalability problem with the kernelbased learning model. We designed an online costsensitive model to tackle a large portion of decoy PSMs in hard datasets by assigning them larger penalties. Moreover, OLCSRanker has shown better scalability than CRanker due to significantly reduced memory requirement and total training time. Experimental studies have shown that OLCSRanker outperformed benchmark methods in terms of accuracy and stability. Also, compared with CRanker, OLCSRanker is about 15 ∼85 times faster over tested datasets and has overcome the overfitting problem on hard datasets.
Materials and methods
Basic CRanker model
CRanker [14] cast identification of target PSM as a classification problem. Let \(\Omega =\{x_{i}^{},y_{i}^{}\}_{i=1}^{n} \subseteq R_{}^{q} \times \{1,1\}\) be a set of n PSMs, where \(x_{i}^{} \in R_{}^{q}\) represents its ith PSM record with q attributes, and \(y_{i}^{} \in \{ 1, 1\}\) is the corresponding label indicating a target or decoy PSM. Define \( \Omega _{+}^{} = \{j \,\, y_{j}^{} = 1 \}, \quad \Omega _{}^{} = \{j \,\, y_{j}^{} = 1 \}. \) The identification task is to train a discriminant function for filtering out the correct PSMs from the target PSMs (ones with labels “ +1”).
While class labels in a standard classification problem are all trustworthy, a large number of “ +1” labels in PSM identification are not correct. CRanker [14] introduced weight θ_{i}∈[0,1] for each PSM sample (x_{i},y_{i}) to indicate the degree of the reliability of the label y_{i}. Particularly, θ_{i}=1 indicates that label y_{i} is definitely correct, θ_{i}=0 indicates that it is definitely incorrect, and θ_{i}∈(0,1) indicates that label y_{i} is probably correct. In fact, all “ −1” labels (decoy PSMs) are correct, and thus θ_{i}=1 for all \(i\in \Omega _{}^{}\). Based on Support Vector Machine (SVM) [27], CRanker can be solved by the following optimization problem
where C>0 is the regularization parameter, λ>0 is the parameter controlling the number of identified PSMs, h(t)= max(0,1−t) is the hinge loss function, and f(x_{i})=〈w,ϕ(x_{i})〉 is the value of discriminant function at x_{i} with feature mapping ϕ(·). As shown in [28, 29], a larger value of parameter λ selects more PSMs into the training process.
Costsensitive ranker model
In this section, we present a costsensitive (CS) classification model to partially tackle the stability problem of CRanker over datasets with a distribution of unbalanced PSMs. Unlike the CRanker model, the CS model uses different loss functions for decoy and target PSMs. In fact, learning errors should be treated with different penalties in peptide identification. If the discriminant function assigns “ +1” label to a decoy PSM, then we know for sure that the label assignment is wrong. In this case, the learning error is more likely caused by the model itself rather than the quality of the data sample, and hence we should give the loss function a large penalty. On the other hand, if a target is classified as negative and assigned label “ −1”, we are not even sure whether the label assignment is correct, and thus we consider a small penalty for the loss function. Based on these observations, we incorporate the new penalty policy into model (1) and the new model is described as follows:
where C_{1}>0, C_{2}>0 are weights for the losses of the decoys and targets, respectively. Model (2) is named costsensitive ranker model and denoted by CSRanker. As we choose a larger penalty for decoy losses, \(C_{1}^{} \geq C_{2}^{}\) always holds.
The convexconcave procedure for solving CSRanker
In order to solve the CSRanker model, we transform (2) to its DC (difference of two convex functions) form. According to the method in [29], if a pair of w^{∗}∈R^{n} and θ^{∗}∈R^{n} is an optimal solution to CSRanker model (2), then w^{∗} is also an optimal solution of the following problem
where R_{s}(t)= min(1−s, max(0,1−t)), \(s = 1 \frac {\lambda }{C_{2}^{}}\).
Since R_{s}(t)=H_{1}(t)−H_{s}(t), with H_{s}(t)= max(0,s−t) and H_{1}(t)= max(0,1−t), then model (3) can be recast as
where
\(J_{\text {vex}}^{}(\cdot)\) and \(J_{\text {cav}}^{}(\cdot)\) are convex and concave functions respectively. Hence, Problem (4) can be solved by a standard ConcaveConvex Procedure (CCCP) [30], which iteratively solves subproblems
with initial w^{0}. The subproblem (6) can be solved by its Lagrange dual [31]:
where \(\eta _{i} = \left \{ \begin {array}{cl} 1, & \text { if}\ y_{i} f^{}(x_{i}) < s,\\ 0, & \text { otherwise }. \end {array} \right.\)
Model (7) is a kernelbased learning model with k(·,·) the kernel function. Then k(x_{i},x_{j}) calculates, in feature space, the pairwise inner product of PSM records of x_{i} and x_{j}, which are represented in vector format. Hence, OLCSRanker can handle PSM records generated by any search engine as long as the output PSMs are represented in vector format.
The online learning algorithm for CSRanker model
Inspired by the work in [32, 33], we obtain the discriminant function for CSRanker by solving its DC form (3).
Different from classical classifiers which take all PSM samples at once, the online CSRanker algorithm (OLCSRanker) iteratively trains the discrimination function and adds only one PSM sample into the training process at each iteration. The PSM sample is randomly selected to prevent the solution of (3) from trapping at a local minimum and its effectiveness has been observed in approaches such as stochastic gradient descent [34]. In order to reduce the cost of memory and computation, OLCSRanker maintains an active set which keeps only indices of PSMs that determine the discriminant function in model training, and the PSMs that do not affect the discriminant function are discarded.
Online algorithm for solving CSRanker
The implementation of OLCSRanker is depicted in Algorithm 1. Particularly, given a chosen PSM sample (Line 3), OLCSRanker updates bounds A_{j}, B_{j}, for all \(j \in \Omega _{+}^{}\cap S\) (Line 4 – Line 7), and calls subroutines PROCESS and REPROCESS to solve dual programming (7) with training samples in active set S (Line 8–Line 12). Iteratively, the algorithm calls subroutine CLEAN to remove part of redundant PSMs from the active set (Line 13). The iteration terminates when all the training PSMs have been chosen for training.
Subroutines
Subroutine PROCESS ensures that all the coordinates of α_{j} satisfy the bound constraint conditions in CSRanker model (7). It initializes \(\alpha _{i_{0}}\) with 0, where i_{0} is the index of the chosen PSM, and updates the coordinates α_{j} if bound A_{j} or B_{j} has changed (Line 12). Then, it updates gradient vector g_{j}, j∈S (Line 3), where g is defined by
Subroutine REPROCESS aims to find a better solution of model (7). It selects the instances with the maximal gradient in active set S (Line 1 – Line 12). Once an instance is selected, it computes a stepsize (Line 13 – Line 17) and performs a direction search (Line 18 – Line 19). The derivation of these iteration formulae could be found in Additional file 1.
Subroutine CLEAN removes PSMs that are not effective to the discriminant function from the active set S to minimize the requirement of memory and computation. The subroutine selects nonsupport vectors and keeps them in set V (Line 1 – Line 4), then selects at most m PSMs of V with the largest gradients, and finally removes them from S (Line 5 – Line 9).
Calculate PSM scores
After discriminant function \(\hat {f}\): \( \hat {f}(x) = \sum _{j\in S}^{} \alpha _{j} k(x_{j},x),\) where k(·) is the selected kernel function, is trained, we calculate the scores of all PSMs on both training and test sets. The score of PSM (x_{i},y_{i}) is defined in [14]:
The larger the score value is, the more likely a PSM is correct. The PSMs are ordered according to their scores, and a certain number of PSMs are reported according to a preselected FDR.
Availability of data and materials
The datasets supporting the conclusions of this article are available in the Figshare repository, https://doi.org/10.6084/m9.figshare.5739705.v1.
The software of OLCSRanker can be download from https://github.com/IsaacQiXing/CRanker. A webbased GUI for users of OLCSRanker is provided at http://161.6.5.181:8000/olcsranker/.
Abbreviations
 MS/MS:

Tandem mass spectrometry
 PSM:

Peptidespectrum match
 BNP:

Bayesian nonparametric model
 SVM:

Support vector machine
 TPP:

TransProteomic Pipeline
 CSRanker:

Costsensitive Ranker
 OLCSRanker:

Online algorithm for CSRanker
 Ups1:

Universal proteomics standard set
 Yeast:

S.cerevisiae Gcn4 affinitypurified complex
 PBMC:

Human Peripheral Blood Mononuclear Cells
 Velosmips:

LTQOrbitrap Velos with MiPS
 Velosnomips:

LTQOrbitrap Velos with MiPSoff
 ROC:

Receiver operating characteristic
 DC:

Difference of two convex functions
 CCCP:

ConcaveConvex Procedure
 FMR:

False match rate
References
 1
Elias JE, Haas W, Faherty BK, Gygi SP. Comparative evaluation of mass spectrometry platforms used in largescale proteomics investigations. Nat Methods. 2005; 2(9):667–75.
 2
Link AJ, Eng J, Schieltz1 DM, Carmack E. Direct analysis of protein complexes using mass spectrometry. Nat Biotechnol. 1999; 17(7):676–82.
 3
Nesvizhskii AI. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. J Proteomics. 2010; 73(11):2092–123.
 4
Keller A, Nesvizhskii AI, Kolker E, Aebersold R. Empirical statistical model to estimate the accuracy of peptide identifications made by ms/ms and database search. Anal Chem. 2002; 74(20):5383–92.
 5
Käll L, Canterbury JD, Weston J. Semisupervised learning for peptide identification from shotgun proteomics datasets. Nat Methods. 2007; 4(11):923–5.
 6
Choi H, Nesvizhskii AI. Semisupervised modelbased validation of peptide identifications in mass spectrometrybased proteomics. J proteome Res. 2007; 7(1):254–65.
 7
Ding Y, Choi H, Nesvizhskii AI. Adaptive discriminant function analysis and reranking of ms/ms database search results for improved peptide identification in shotgun proteomics. J Proteome Res. 2008; 7(11):4878–89.
 8
Zhang J, Ma J, Dou L, Wu S, Qian X, Xie H, Zhu Y, He F. Bayesian nonparametric model for the validation of peptide identification in shotgun proteomics. Mol Cell Proteomics. 2009; 8(3):547.
 9
Jie M, Jiyang Z, Songfeng W, Dong L, Yunping Z, Fuchu H. Improving the sensitivity of mascot search results validation by combining new features with bayesian nonparametric model. Proteomics. 2010; 10(23):4293–300.
 10
The M, MacCoss MJ, Noble WS, Käll L. Fast and accurate protein false discovery rates on largescale proteomics data sets with percolator 3.0. J Am Soc Mass Spectrom. 2016; 27(11):1719–27.
 11
Halloran JT, Rocke DM. A matter of time: faster percolator analysis via efficient svm learning for largescale proteomics. J Proteome Res. 2018; 17(5):1978–82.
 12
Spivak M, Weston J, Bottou L, Käll L, Noble WS. Improvements to the percolator algorithm for peptide identification from shotgun proteomics data sets. J Proteome Res. 2009; 8(7):3737–345.
 13
Halloran JT, Rocke DM. Gradients of generative models for improved discriminative analysis of tandem mass spectra. Adv Neural Inf Proc Syst. 2017; 30:5724–33.
 14
Liang X, Xia Z, Jian L, Niu X, Link A. An adaptive classification model for peptide identification. BMC Genom. 2015; 16(11):1–9.
 15
Ivanov MV, Levitsky LI, Lobas AA, Panic T, Laskay UA, Mitulovic G, Schmid R, Pridatchenko ML, Tsybin YO, Gorshkov MV. Empirical multidimensional space for scoring peptide spectrum matches in shotgun proteomics. J Proteome Res. 2014; 13(4):1911–20.
 16
Spivak M, Bereman MS, Maccoss MJ, Noble WS. Learning score function parameters for improved spectrum identification in tandem mass spectrometry experiments. J Proteome Res. 2012; 11(9):4499–508.
 17
Wang X, Zhang B. Integrating genomic, transcriptomic, and interactome data to improve peptide and protein identification in shotgun proteomics. J Proteome Res. 2014; 13(6):2715–23.
 18
Jian L, Xia Z, Niu X, Liang X, Samir P, Link A. L2 multiple kernel fuzzy svmbased data fusion for improving peptide identification. IEEE/ACM Trans Comput Biol Bioinforma. 2016; 13(4):804–9.
 19
Slagel J, Mendoza L, Shteynberg D, Deutsch EW, Moritz RL. Processing shotgun proteomics data on the amazon cloud with the transproteomic pipeline. Mol Cell Proteomics. 2015; 14(2):399–404.
 20
Feng XD, Li LW, Zhang JH, Zhu YP, Chang C, Shu K. x., Ma J. Using the entrapment sequence method as a standard to evaluate key steps of proteomics data analysis process. BMC Genomics. 2017; 18(Suppl 2). https://doi.org/10.1186/s1286401734912.
 21
Vaudel M, Burkhart JM, Breiter D, Zahedi RP, Sickmann A, Martens L. A complex standard for protein identification, designed by evolution. J Proteome Res. 2012; 11(10):5065–71.
 22
Granholm V, Noble WS, Käll L. On using samples of known protein content to assess the statistical calibration of scores assigned to peptidespectrum matches in shotgun proteomics. J Proteome Res. 2011; 10(5):2671–8.
 23
Jian L, Niu X, Xia Z, Samir P, Sumanasekera C, Mu Z, Jennings JL, Hoek KL, Allos T, Howard LM, Edwards KM, Weil PA, Link AJ. A novel algorithm for validating peptide identification from a shotgun proteomics search engine. J Proteome Res. 2013; 12(3):1108–19.
 24
Shteynberg D, Mendoza L, Hoopmann M, Eng J, Lam H. TransProteomic Pipeline. 2018. http://tools.proteomecenter.org/wiki/index.php?title=Software:TPP. Accessed 4 Nov 2019.
 25
Mcdonald H, Tabb D, Sadygov R, Maccoss M, Venable J, Graumann J, R Johnson J, Cociorva D, Yates J. Ms1, ms2, and sqt  three unified, compact, and easily parsed file formats for the storage of shotgun proteomic spectra and identifications. 2004; 18:2162–8. https://doi.org/10.1002/rcm.1603.
 26
Bill N. SQT file format. 2004. http://crux.ms/fileformats/sqtformat.html. Accessed 15 Dec 2019.
 27
Burges CJC. A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov. 1998; 2:121–67.
 28
Wang Y, Liang X, Xia ZX, Niu X, Link AJ. Improved classification model for peptide identification based on selfpaced learning. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM): 2017. p. 258–61. https://doi.org/10.1109/bibm.2017.8217659.
 29
Meng D, Zhao Q, Jiang L. What objective does selfpaced learning indeed optimize? 2015. arXiv:1511.06049.
 30
Yuille AL, Rangarajan A. The concaveconvex procedure. Neural Comput. 2003; 15(4):915–36.
 31
Boyd S, Vandenberghe L. Convex Optimization. New York: Cambridge university press; 2004.
 32
Bordes A, Ertekin S, Weston J, Bottou L. Fast kernel classifiers with online and active learning. J Mach Learn Res. 2005; 6(6):1579–619.
 33
Ertekin S, Bottou L, Giles CL. Nonconvex online support vector machines. IEEE Trans Pattern Anal Mach Intell. 2011; 33(2):368–81.
 34
Bottou L. Stochastic gradient learning in neural networks. In: Proceedings of NeuroNîmes, vol. 91. France: The International Neural Society (INNS), Nimes: 1991.
Acknowledgments
We wish to thank Prof. Xiaolin Chen (Qufu Normal University, China) for his work on the analysis of the OLCSRanker algorithm.
Funding
Xijun Liang and Ling Jian were partially supported by the National Natural Science Foundation of China under Grant No. 61503412, 61873279, the Key Research and Development Program of Shandong Province under Grant No. 2018GSF120020, National Natural Science Foundation of Shandong Province under Grant No. ZR2019MA016, Fundamental Research Funds for the Central Universities under Grant No. 19CX05027B, and National Science and Technology Major Project of China under Grant No. 2016ZX05011001003. Andrew J. Link was supported in part by NIH grant GM64779. Xinnan Niu and Andrew J. Link were supported by NIH Grants GM64779, HL68744, ES11993, and CA098131. Zhonghang Xia were supported by WKU RCAP Grant No. 208032.
Author information
Affiliations
Contributions
XL and ZX designed the classification model and wrote the manuscript. LJ, YW and XL designed the parameter selection and experiments. XN and AL provided the proteomics data and verified the experimental results. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Xijun Liang.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Liang, X., Xia, Z., Jian, L. et al. A costsensitive online learning method for peptide identification. BMC Genomics 21, 324 (2020). https://doi.org/10.1186/s128640206693y
Received:
Accepted:
Published:
Keywords
 Peptide identification
 Mass spectrometry
 Classification
 Support vector machines
 Online learning