MS2CNN: predicting MS/MS spectrum based on protein sequence using deep convolutional neural networks

Background Tandem mass spectrometry allows biologists to identify and quantify protein samples in the form of digested peptide sequences. When performing peptide identification, spectral library search is more sensitive than traditional database search but is limited to peptides that have been previously identified. An accurate tandem mass spectrum prediction tool is thus crucial in expanding the peptide space and increasing the coverage of spectral library search. Results We propose MS2CNN, a non-linear regression model based on deep convolutional neural networks, a deep learning algorithm. The features for our model are amino acid composition, predicted secondary structure, and physical-chemical features such as isoelectric point, aromaticity, helicity, hydrophobicity, and basicity. MS2CNN was trained with five-fold cross validation on a three-way data split on the large-scale human HCD MS2 dataset of Orbitrap LC-MS/MS downloaded from the National Institute of Standards and Technology. It was then evaluated on a publicly available independent test dataset of human HeLa cell lysate from LC-MS experiments. On average, our model shows better cosine similarity and Pearson correlation coefficient (0.690 and 0.632) than MS2PIP (0.647 and 0.601) and is comparable with pDeep (0.692 and 0.642). Notably, for the more complex MS2 spectra of 3+ peptides, MS2PIP is significantly better than both MS2PIP and pDeep. Conclusions We showed that MS2CNN outperforms MS2PIP for 2+ and 3+ peptides and pDeep for 3+ peptides. This implies that MS2CNN, the proposed convolutional neural network model, generates highly accurate MS2 spectra for LC-MS/MS experiments using Orbitrap machines, which can be of great help in protein and peptide identifications. The results suggest that incorporating more data for deep learning model may improve performance.


Background
Tandem mass spectrometry (MS 2 ) has emerged as an indispensable technology in high-throughput proteomics experiments [1]. Tandem mass spectra generated from bottom-up proteomics consist of mass-to-charge ratios and relative abundances of a set of fragment ions generated from digested peptides. The patterns of these fragment ions are useful for the identification and quantification of proteomes in the sample.
There are two common approaches for protein identification: database search and spectral library search. The former searches each tandem mass spectrum (or MS 2 spectrum) acquired from experiments against theoretical spectrums generated from all possible digested peptides (with trypsin in most of the cases) in the human proteome using a scoring function. The latter searches a MS 2 spectrum against a spectral library, a collection of highquality spectra of all identified peptides from previous experiments [2]. Although database search is more comprehensive and covers all possible peptide space, the sensitivity is lower because of the absence of intensity for each fragment ion in theoretical spectra. In contrast, spectral library search provides considerably higher sensitivity since a spectral library consists of realistic fragment ion intensities [3]. However, spectral library search is limited to peptides that have been previously identified, which hinders the application of spectral library search in areas where the discovery of novel peptides is of importance, such as the identification of peptides with mutations or peptides from isoforms of proteins. To take this into account, it is necessary to develop methods for computational prediction or simulation of MS 2 spectra from amino acid sequences to expand the size of a spectral library.
There are several different strategies in predicting the MS 2 spectrum of a peptide. MassAnalyzer, a pioneer work in computational prediction of a MS 2 spectrum, uses a kinetic model on the basis of the mobile proton hypothesis to simulate peptide fragmentation [4,5]. A semi-empirical approach is to predict the MS 2 spectrum of a peptide from the spectra of similar peptides by peak perturbation [6]. The approach is based on the observation that the peptides of similar sequences produce similar fragmentation patterns in most cases. The concept is then generalized to a weighted K-nearest neighbor (KNN) approach in which a machine learning model first selects peptides that are likely to have high spectra similarity to the target peptide, and then a consensus algorithm combines their spectra to predict the MS 2 spectrum of the target peptide [7]. Though the two approaches can yield good prediction accuracy for target peptides with similar amino acid sequence neighbors, they are not designed to predict the MS 2 spectrum for arbitrary peptides of interest. For better predictive capability, other methods simplify the model by focusing on the prediction of y-ion intensities only [8][9][10]. Although they achieve some success, the applicability of these methods is somewhat restricted.
PeptideART, a data-driven approach based on feedforward neural networks, is trained with more than 40, 000 peptide spectrum matches (PSMs) [11]. In benchmark tests on five different data sets for MS 2 spectrum prediction, PeptideART compares favorably to MassAnalyzer. MS 2 PIP [12], a later random forest approach, incorporates different predictive models for different peptide lengths (8 to 28 amino acids) and different charge states (charge 2+ and 3+). These models are trained with more than 73,000 PSMs; the overall performance is reported to be better than PeptideART. A web server version of MS 2 PIP has been constructed with a new computational model and a much larger training data set of more than 170,000 PSMs [13]. More recently, a deep neural network-based method called pDeep has been developed [14]. The method is based on a bidirectional long short-term memory (BiLSTM) model and is trained with a data set of around 4,000,000 MS 2 spectra. Notably, for the same peptide sequence, it predicts MS 2 spectra of three different fragmentation approaches: HCD (higher-energy collisional dissociation), ETD (electron-transfer dissociation), and EThcD (electron-transfer/higher-energy collision dissociation). According to the reported benchmark experiment, pDeep yields considerable improvements over MassAnalyzer and MS-Simulator.
In this study, we propose MS 2 CNN, a deep convolutional neural network (DCNN) method for MS 2 spectrum prediction given experimental spectra large enough to effectively train a sophisticated deep learning model. We adopt the network structure of LeNet-5 [15], a DCNN consisting of three major components: a convolutional layer, a pooling layer, and a fully connected layer. A single DCNN is constructed to predict peptides of a specific length and charge. The entire training set was composed of high-quality human HCD MS 2 spectra from an Orbitrap LC-MS/MS experiment downloaded from the National Institute of Standards and Technology (NIST) consisting of 320,824 unique peptide sequences and 1,127,971 spectra. Five-fold cross validation was performed and the method was then benchmarked on a publicly available independent test dataset of human HeLa cell lysate from LC-MS/MS experiments with MS 2 PIP and pDeep. MS 2 CNN achieved a cosine similarity (COS) in the range of 0.57-0.79 and 0.59-0.74 for peptides of charge 2+ and charge 3+, respectively. These results suggest that MS 2 CNN significantly outperforms MS 2 PIP, especially for shorter peptide sequences for which abundant training data is available. It is also shown that MS 2 CNN has an overall comparable performance to pDeep; however the former predicts MS 2 spectra for charge 3+ peptides, which are usually considered more complicated than the spectra for 2+ peptides, at a higher accuracy.

Five-fold cross validation for determining convolutional layer
Because there are significantly more charge 2+ than charge 3+ peptide sequences, the best layer number of MS 2 CNN is determined by charge 2+, after which the value is directly applied to charge 3+. Given the one-fold run of the five-fold validation result, we chose the 4layer model as the default structure of MS 2 CNN because it yielded the best performance and is the most efficient of all the models (Additional file 1: Table S4). Although the 5-layer model is comparable to the 4-layer model for some peptide lengths, we did not consider it as its performance fluctuates considerably for peptides of different lengths and it also requires longer training times. Figure 1 shows the five-fold cross validation performance evaluated with COS for different peptide lengths and charge states (other detailed metrics are given in Additional file 1: Table S5). The figure shows that the predictive capability decreases as the peptide length gets larger, possibly due to less training data for longer peptides. We further investigated whether there is a benefit to merging charge 2+ and 3+ training data to build up a single model as MS 2 CNN_mix instead of having the two MS 2 CNN 2+ and MS 2 CNN 3+ models for charge 2+ and charge 3+ peptides, respectively. We followed the previous training procedure with an additional input feature-engineered procedure based on the merged data set of charge 2+ or 3+ peptides. The performance in general falls between the performance of charge 2+ and charge 3+ (Fig. 1, gray bar). This shows that although a larger data set boosts performance (improves MS 2 CNN 3+ performance), different charge states also contain specific patterns in terms of spectrum prediction (impairing MS 2 CNN 2+ performance).

Upper bound analysis
Peptide fragmentation is a random process; for example, even the same peptide in the same experiment can sometimes result in different peak intensities in spectra. When combining different ionization sources, ion detection, experimental steps, and even different species, the spectrum of the same peptide can be significantly different. Therefore, we compare the similarity between the training spectra and independent spectra for the same peptide sequence (Table 1). Ideally, the similarity in terms of COS or PCC should be 1 if the experimental conditions and the random processes for generating the two spectra are perfectly identical. In reality, the similarity can be seen as the Bayes rate, the theoretical prediction upper bound on prediction accuracy due to unexplainable variance. To conclude, the average upper bound COS for different peptide lengths ranges from 0.600 to 0.800 and decreases as peptide length increases. The average upper bound of PCC for different peptide lengths is even lower, ranging from 0.550 to 0.760. Peptide length seems to have a smaller effect on PCC than on COS, especially for peptides of charge 3 + .

Independent test set evaluation
We compared the proposed MS 2 CNN and MS 2 CNN_mix models with MS 2 PIP and pDeep based on the independent test set in terms of COS and PCC (Figs. 2 and 3, detailed values in Additional file 1: Table S6). In general, MS 2 CNN and MS 2 CNN_mix outperform MS 2 PIP for charge 2+ (Fig. 2) and charge 3+ (Fig. 3) peptides in both metrics significantly with a p-value < 0.01 by a Wilcoxon signed-rank test (Additional file 2: R Script). For charge 2+ peptides, MS 2 CNN outperforms pDeep marginally for peptide lengths no greater than 11, whereas for peptide lengths from 12 to 19, pDeep considerably outperforms the other methods for both COS and PCC (Fig. 2). In contrast, for charge 3+ peptides, MS 2 CNN and MS 2 CNN_ mix yield higher COS and PCC than pDeep for all peptide lengths significantly with a p-value < 0.01 by the Wilcoxon signed-rank test (Fig. 3). This suggests that pDeep might be more sensitive to the size of training data, as the number of spectra for charge 3+ peptides is significantly smaller than that of the charge 2+ peptides. Note that pDeep was trained with HCD mouse spectra. Although they show a high MS/MS spectra similarity (a median PCC of 0.94) across different species, a minority of peptides which share low similarity across species can nevertheless deteriorate prediction performance.
Note that the performance of charge 3+ peptides at lengths of 17, 18, and 19 are better than that of charge 2+ peptides for both COS and PCC. This may be due to the richer training data set and higher theoretical prediction upper bound in those ranges. The advantage of MS 2 CNN_mix can be seen in the prediction results of charge 3+ (Fig. 3), for which the size of the training data set greatly increases. This benefit becomes insignificant for charge 2+ peptides, as the original training data set is much larger: the improvement is not affected by theoretical prediction upper bound. Taking charge 3+ peptide lengths of 11 and 12 as an example (Fig. 3 b), there is more improvement in length 12 (MS 2 CNN_mix vs MS 2 PIP) but a higher upper bound in length 11 than length 12 (0.721 vs 0.682, Table 2 charge 3 + .PCC).

Discussion and conclusion
Peptide identification is an important issue in mass spectrometry-based proteomics. There are two major approaches for peptide identification: database search and spectral library search. Spectral library search boasts a greater sensitivity than database search, but is limited to peptides that have been previously identified. Overcoming this limitation calls for an accurate MS 2 spectrum prediction tool that is capable of reproducing the chemical fragmentation pattern of a peptide sequence. Over the years, a large number of high quality MS 2 spectra have been generated and made publicly available by experimentalists, making for an excellent opportunity for researchers to effectively train modern machine learning models such as deep convolutional neural networks for MS 2 spectra prediction.
We devise DCNN, a deep learning model for the prediction of peak intensities of MS 2 spectra. In addition to DCNN, we incorporate different Python libraries for feature engineering to facilitate the training process. According to our independent test set of HCD spectra of human samples from Orbitrap LC-MS experiments, MS 2 CNN shows superior prediction performance compared to MS 2 PIP for charge 2+ and 3+ peptides in terms of COS. It also outperforms pDeep, another deep learning approach, for charge 3+ peptides. In the future, we plan to improve the predictive power of our model by either including more data for longer peptide sequences or employing another popular approach in deep learning such as transfer learning, in which a pretrained model is reused for another task, for example, we use a model trained on short peptides for a long peptide task. In the light of our results, we believe MS 2 CNN can be of great use in expanding the coverage of a spectral library and improving the identification accuracy of spectral library search in the analysis of proteomics samples.

Feature engineering
To apply a deep learning method to our dataset, each peptide sequence must be converted into a feature vector with a label. Table 2 lists the features we use to  In addition, the m/z and physical-chemical properties of all the 26 (=2*(14-1)) fragment ions are included in the feature vector. The total number of features for a peptide sequence is 290 (=1 + 20 + 9 + 26*1 + 26*9). We used Pyteomics v3.4.2 [16] to compute the mass-to-charge ratio and Biopython v1.7 [17] to calculate the amino acid composition, instability index, isoelectric point, and secondary structure fraction.

MS 2 CNN model
We propose MS 2 CNN, a DCNN model that uses the aforementioned features (Fig. 4). The MS 2 CNN model takes a peptide feature vector as input and computes an ensemble of nonlinear function nodes in which each layer consists of a number of nodes. The predicted peak intensity corresponds to an output node of the MS 2 CNN model. In the proposed model, a convolution layer is activated by the relu activation function. A max-pooling layer is added after a convolution layer: together they constitute one convolution-pooling layer. The number of convolution-pooling layers is repeated n times in MS 2 CNN, where n ranges from 2 to 7. The best number was determined by a cross validation experiment. We unify the node number of the convolutional layers as 10; the node number for the last convolutional layer depends on the layer depth. Additional file 1: Table S1 lists the detailed configurations for convolutional layers from layers 2 to 7. The repeated convolution-pooling layers are followed by another layer to flatten the output. Then we add a fully connected layer with twice as many nodes as the number of output nodes. We implemented the MS 2 CNN architecture and executed the whole training process using the Keras Python package version 2.0.4 [18]. Figure 4 illustrates the MS 2 CNN model structure.

Datasets
Training data set We downloaded the training seta human HCD library based on an Orbitrap mass analyzer and LC-MS (Liquid chromatography-mass spectrometry)from the NIST website. This set is based on CPTAC and ProteomeXchange, two public repositories containing 1,127,971 spectra from 320,824 unique peptide sequences in .msp format. The dataset consists of peptides with charge states ranging from 1+ to 9+, among which only charge states of 2+ and 3+ were selected as there was not enough data for the other charges to effectively train a machine learning model. This strategy is consistent with previous studies.

De-duplicated spectrum
It is common for different spectra to belong to the same peptide sequence, and for charge states to have different peak intensities for their fragment ions. We performed a two-step process to generate a de-duplicated spectrum from a set of spectra for a given peptide. First, each peak in a spectrum was normalized by the maximum peak intensity of the spectrum. Then, the intensity of each band y-ion was determined by the median intensity of the ion across different spectra. This yielded a consensus spectrum which filters out noise that could degrade DCNN training. Additional file 1: Table S2

Independent test set
We used the data-dependent acquisition data of Orbitrap LC-MS experiments from [19] as an independent test set. This included 22,890 and 5998 spectra for charge 2+ and 3+ peptides, respectively. The proportion of common peptides in our training set and independent test set exceeded 90%. Although these peptides were viewed as easier prediction targets, the performance is still bounded by the theoretical upper bound; for example, the upper bound of COS for charge 2+ and charge 3+ peptides ranges from 0.636 to 0.800 and from 0.617 to 0.781, respectively (detailed numbers shown in Table 1). The numbers of commonly observed peptides for different lengths are summarized in Additional file 1: Table S3.

K-fold cross validation
To select the best parameters (i.e., layer numbers) for the MS 2 CNN model and to prevent overfitting, we applied five-fold cross validation with a three-way data split, namely, the entire data set was partitioned into training, validation (10% of training data), and test sets.
Training epochs continued as long as the accuracy of the validation set improved over the previous epoch by 0.001; otherwise, training was terminated. The final model was selected based on validation performance, and was used to predict the test set for performance evaluation. Since our model was selected based on validation set performance, there was no data leakage problem, in which information in the test data is involved in model selection. This problem can result in overestimation of the performance and unfair comparison with other methods.

Metrics
Two metrics are used: Cosine similarity (COS) and Pearson correlation coefficient (PCC). COS is one of the most widely used spectrum similarity measures for mass spectrometry. It measures the similarity between two nonzero vectors by calculating the angle between them (Eq. 1, calculated by the Python scikit-learn package [20]). COS ranges from − 1 to + 1 (angle from 180°to 0°).
The PCC measures the linear correlation between two variables X and Y (Eq. 2, calculated by the Python Scipy package [21]). It ranges from 1 to − 1, where 1 denotes a completely positive correlation, − 1 a completely negative correlation, and 0 a random correlation or two variables that have no association.
Evaluation methods

MS 2 PIP
Recently, MS 2 PIP released a new prediction model using XGBoost [22]; the previous random-forest model [13] was not available. Thus, we used the latest MS 2 PIP model for benchmark comparison. The local standalone version (Python code downloaded from [23])