Feature engineering
To apply a deep learning method to our dataset, each peptide sequence must be converted into a feature vector with a label. Table 2 lists the features we use to characterize a peptide sequence. These features include peptide composition (similar to amino acid composition), mass-to-charge ratio (m/z), and peptide physical-chemical properties such as isoelectric point, instability index, aromaticity, secondary structure fraction, helicity, hydrophobicity, and basicity. The m/z and physical-chemical features of not only the peptide sequence but all the possible b and y fragment ions are also included in the feature vector. Take for example the peptide sequence AAAAAAAAGAFAGR (length = 14): its m/z is 577.80, the amino acid composition is {A: 10, C: 0, D: 0, E: 0, F: 1, G: 2, H: 0, I: 0, K: 0, L: 0, M: 0, N: 0, P: 0, Q: 0, R: 1, S: 0, T: 0, V: 0, W: 0, Y: 0}, and the physical-chemical properties {isoelectric point, instability index, aromaticity, helicity, hydrophobicity, basicity, secondary structure fraction} are {9.80, 3.22, 0.07, − 0.21, 1.21, 208.46, (0.071, 0.14, 0.71)}. In addition, the m/z and physical-chemical properties of all the 26 (=2*(14–1)) fragment ions are included in the feature vector. The total number of features for a peptide sequence is 290 (=1 + 20 + 9 + 26*1 + 26*9). We used Pyteomics v3.4.2 [16] to compute the mass-to-charge ratio and Biopython v1.7 [17] to calculate the amino acid composition, instability index, isoelectric point, and secondary structure fraction.
MS2CNN model
We propose MS2CNN, a DCNN model that uses the aforementioned features (Fig. 4). The MS2CNN model takes a peptide feature vector as input and computes an ensemble of nonlinear function nodes in which each layer consists of a number of nodes. The predicted peak intensity corresponds to an output node of the MS2CNN model.
In the proposed model, a convolution layer is activated by the relu activation function. A max-pooling layer is added after a convolution layer: together they constitute one convolution-pooling layer. The number of convolution-pooling layers is repeated n times in MS2CNN, where n ranges from 2 to 7. The best number was determined by a cross validation experiment. We unify the node number of the convolutional layers as 10; the node number for the last convolutional layer depends on the layer depth. Additional file 1: Table S1 lists the detailed configurations for convolutional layers from layers 2 to 7. The repeated convolution-pooling layers are followed by another layer to flatten the output. Then we add a fully connected layer with twice as many nodes as the number of output nodes. We implemented the MS2CNN architecture and executed the whole training process using the Keras Python package version 2.0.4 [18]. Figure 4 illustrates the MS2CNN model structure.
Datasets
Training data set
We downloaded the training set – a human HCD library based on an Orbitrap mass analyzer and LC-MS (Liquid chromatography–mass spectrometry) – from the NIST website. This set is based on CPTAC and ProteomeXchange, two public repositories containing 1,127,971 spectra from 320,824 unique peptide sequences in .msp format. The dataset consists of peptides with charge states ranging from 1+ to 9+, among which only charge states of 2+ and 3+ were selected as there was not enough data for the other charges to effectively train a machine learning model. This strategy is consistent with previous studies.
De-duplicated spectrum
It is common for different spectra to belong to the same peptide sequence, and for charge states to have different peak intensities for their fragment ions. We performed a two-step process to generate a de-duplicated spectrum from a set of spectra for a given peptide. First, each peak in a spectrum was normalized by the maximum peak intensity of the spectrum. Then, the intensity of each b- and y-ion was determined by the median intensity of the ion across different spectra. This yielded a consensus spectrum which filters out noise that could degrade DCNN training. Additional file 1: Table S2 summarizes the number of spectra after deduplication. For effective training of a complex DCNN model, the number of peptides should exceed 5000 after deduplication. Based on this criterion, we focused on peptides of lengths 9 to 19 and eliminated the rest. This resulted in 166,371 charge 2+ peptides (70.4% of the 2+ peptides from NIST) and 98,364 charge 3+ peptides (69.6% of the 3+ peptides from NIST).
Independent test set
We used the data-dependent acquisition data of Orbitrap LC-MS experiments from [19] as an independent test set. This included 22,890 and 5998 spectra for charge 2+ and 3+ peptides, respectively. The proportion of common peptides in our training set and independent test set exceeded 90%. Although these peptides were viewed as easier prediction targets, the performance is still bounded by the theoretical upper bound; for example, the upper bound of COS for charge 2+ and charge 3+ peptides ranges from 0.636 to 0.800 and from 0.617 to 0.781, respectively (detailed numbers shown in Table 1). The numbers of commonly observed peptides for different lengths are summarized in Additional file 1: Table S3.
Evaluation
K-fold cross validation
To select the best parameters (i.e., layer numbers) for the MS2CNN model and to prevent overfitting, we applied five-fold cross validation with a three-way data split, namely, the entire data set was partitioned into training, validation (10% of training data), and test sets. Training epochs continued as long as the accuracy of the validation set improved over the previous epoch by 0.001; otherwise, training was terminated. The final model was selected based on validation performance, and was used to predict the test set for performance evaluation. Since our model was selected based on validation set performance, there was no data leakage problem, in which information in the test data is involved in model selection. This problem can result in over-estimation of the performance and unfair comparison with other methods.
Metrics
Two metrics are used: Cosine similarity (COS) and Pearson correlation coefficient (PCC). COS is one of the most widely used spectrum similarity measures for mass spectrometry. It measures the similarity between two non-zero vectors by calculating the angle between them (Eq. 1, calculated by the Python scikit-learn package [20]). COS ranges from − 1 to + 1 (angle from 180° to 0°).
$$ \mathit{\cos}\left(X,Y\right)=\frac{X{Y}^T}{\left|\left|X\right|\right|\left|\left|Y\right|\right|}\cdots $$
(1)
The PCC measures the linear correlation between two variables X and Y (Eq. 2, calculated by the Python Scipy package [21]). It ranges from 1 to − 1, where 1 denotes a completely positive correlation, − 1 a completely negative correlation, and 0 a random correlation or two variables that have no association.
$$ {\rho}_{XY}=\frac{\mathit{\operatorname{cov}}\left(X,Y\right)}{\sigma_X\ {\sigma}_Y}\cdots $$
(2)
Evaluation methods
MS2PIP
Recently, MS2PIP released a new prediction model using XGBoost [22]; the previous random-forest model [13] was not available. Thus, we used the latest MS2PIP model for benchmark comparison. The local standalone version (Python code downloaded from [23]) was used instead of the online server as the latter is subject to a maximum number of 5000 peptides per query.
We used the default settings of MS2PIP according to the Github config file, other than changing frag_method from HCD to HCDch2. In addition, the MGF function was enabled to generate intensities without log2 transformation. To ensure a fair comparison, we processed the test data using the same peak normalization procedure used to process our training data.
pDeep
First, we converted a peptide to a 2D array using the pDeep API. Then, we loaded the pDeep model (.h5 format), which we used to predict the intensities of the peptide [14]. Although the pDeep documentation states “If the precursor charge state is <= 2, 2+ ions should be ignored”, to ensure a fair and complete charge 2+ peptide comparison, we set the intensity of the testing 2+ peak to zero as if it were missing in pDeep prediction. pDeep provided three trained models – BiLSTM, ProteomeTools-ETD, and ProteomeTools-EThcD – of which the BiLSTM model was used for comparison as it performed the best in both COS and PCC metrics (Additional file 1: Table S6).