A new scoring function for top-down spectral deconvolution
© Kou et al.; licensee BioMed Central. 2014
Received: 16 May 2014
Accepted: 11 December 2014
Published: 18 December 2014
Top-down mass spectrometry plays an important role in intact protein identification and characterization. Top-down mass spectra are more complex than bottom-up mass spectra because they often contain many isotopomer envelopes from highly charged ions, which may overlap with one another. As a result, spectral deconvolution, which converts a complex top-down mass spectrum into a monoisotopic mass list, is a key step in top-down spectral interpretation.
In this paper, we propose a new scoring function, L-score, for evaluating isotopomer envelopes. By combining L-score with MS-Deconv, a new software tool, MS-Deconv+, was developed for top-down spectral deconvolution. Experimental results showed that MS-Deconv+ outperformed existing software tools in top-down spectral deconvolution.
L-score shows high discriminative ability in identification of isotopomer envelopes. Using L-score, MS-Deconv+ reports many correct monoisotopic masses missed by other software tools, which are valuable for proteoform identification and characterization.
In the last two decades, bottom-up mass spectrometry (MS) has been the mainstream of proteomics analysis [1–4]. Although it is efficient and high-throughput for protein identification and quantification, bottom-up MS has its limitations. It involves a sample preparation step in which long proteins are digested into short peptides by proteases, reducing its ability to identify various proteoforms with multiple changes, such as mutations, post-translational modifications (PTMs), and degradations [5, 6]. In contrast, top-down MS analyzes intact proteins, making it the method of choice for complex proteoform identification.
In a mass spectrum, each peak is represented as (m/z,i n t e n s i t y), where m/z and intensity are the mass-to-charge ratio and abundance of its corresponding ion, respectively. Because of the existence of natural isotopes, ions of the same chemical formula and charge state have different m/z values and correspond to a list of spectral peaks in a mass spectrum, called an isotopomer envelope. The monoisotopic mass of an ion is the sum of its atomic masses using the most abundant isotope for each of its atoms.
Compared with bottom-up mass spectra, top-down mass spectra are more complex because they often contain many high charge state isotopomer envelopes, some of which overlap with one another [7, 8]. As a result, a key step in top-down spectral interpretation is to deconvolute a complex top-down mass spectrum to a list of monoisotopic masses.
Given the chemical formula and charge state of an ion, its theoretical isotopomer distribution can be calculated based on the frequencies of natural isotopes. When the chemical formula is unknown and the only available information is its monoisotopic or average mass, the well-known averagine model  can be used to estimate the chemical formula from the monoisotopic or average mass. A theoretical isotopomer distribution is represented as a list of theoretical peaks (m/z,p r o b a b i l i t y), in which m/z and probility are the mass-to-charge ratio and probability of the corresponding isotopomer. In top-down spectral deconvolution, theoretical isotopomer distributions are utilized to identify and group isotopic peaks.
Spectral deconvolution of profile mass spectra has been studied by several groups [7, 10]. In this paper, we focus on centroided spectra. While profile spectra keep all information of raw data, centroided spectra simplify data representation and speed up spectral deconvolution. Similar to mass spectra, isotopomer distributions can be represented in the profile or the centroided mode (Additional file 1: Figures S1 and S2). The centroided mode will be used in the proposed scoring function.
Many software tools have been developed for top-down spectral deconvolution [7, 10–12]. Most tools deconvolute a top-down mass spectrum in four steps. First, candidate isotopomer envelopes are extracted from the experimental spectrum and matched to theoretical isotopomer distributions. Second, the theoretical isotopomer distribution in a match is converted into a theoretical isotopomer envelope by scaling the probabilities to theoretical peak intensities. The scale ratio is estimated based on the peak intensities of the experimental isotopomer envelope. Third, the matches are evaluated by a scoring function, and a match is reported only if its score is higher than a specified threshold. Finally, a monoisotopic mass is obtained from each of the reported isotopomer envelopes.
The scoring function for evaluating experimental isotopomer envelopes determines the accuracy and sensitivity of spectral deconvolution. Designing a good scoring function is a challenging problem because complex mass spectra often contain many noise peaks and overlapping isotopomer envelopes. Most software tools use scoring functions based on the intensities of peaks in a pair of experimental and theoretical isotopomer envelopes, such as the sum of squared errors of peak intensities , the ratios of neighbouring peak intensities , and the dot product of intensity distributions . The scoring function in MS-Deconv  combines peak intensities and errors in m/z values.
In this paper, we present a new scoring function, L-score, for computing the similarity between a pair of experimental and theoretical isotopomer envelopes. L-score can be used independently for spectral deconvolution or combined with other spectral deconvolution tools for envelope selection. We developed a software tool, MS-Deconv+, by combining MS-Deconv and L-score. Experiments showed that MS-Deconv+ outperformed other existing software tools in top-down spectral deconvolution.
A data set from Salmonella typhimurium (ST)  was used for training and testing L-score. Cell lysate obtained from ST was analyzed with a C4-based high-performance liquid chromatography (HPLC) column coupled with an LTQ-Orbitrap mass spectrometer. A total of 4,636 collision induced dissociation (CID) tandem mass spectra were acquired. The charge states of the spectra range from 1 to 24; the precursor masses of the spectra range from 1,000 to 20,000 Dalton (Da). (See Ref.  for the detailed experimental procedure.)
Two Escherichia coli (EC) data sets were utilized to test L-score and MS-Deconv+. Cell lysate of EC was analyzed by a reversed phase liquid-chromatography (RPLC) coupled with an LTQ-Orbitrap Velos mass spectrometer. A total of 3,704 higher-energy C-trap dissociation (HCD) and 4,045 electron-transfer dissociation (ETD) tandem mass spectra were collected at a resolution of 60,000.
Theoretical and experimental envelopes
Since the proposed scoring function is designed for centroided data, only centroided isotopomer distributions and centroided mass spectra are studied. In a centroided isotopomer distribution, two isotopomers are treated as the same if they contain the same number of neurons. For example, a water molecule with two 1H atoms and one 18O atom and another water molecule with two 2H atoms and one 16O atom are treated as the same although their masses are slightly different. As a result, isotopomers with the same number of neurons are represented by one peak. When the charge state of an ion is z, the distance between two neighbouring peaks in its centroided theoretical isotopomer distribution is approximately 1.00235/z thomson (Th) .
Top-down mass spectra contain some noise peaks. The noise intensity level of a spectrum is estimated by plotting the histogram of the peak intensity distribution and assuming that it is in the intensity bin with the largest number of peaks . A peak is considered as a signal peak if its intensity is higher than the noise intensity level. In addition, a charge state is valid if it is no larger than a user-defined parameter.
Given a theoretical envelope, a list of experimental peaks is extracted from the mass spectrum to form its corresponding experimental envelope by matching each peak in the theoretical envelope to an experimental peak with a similar m/z value (within an error tolerance). If such an experimental peak is not found, we add into the spectrum a zero-intensity peak whose m/z value is equal to the theoretical peak. A theoretical envelope and its matched experimental envelope are called an envelope match.
Training and test data sets
We generated and annotated a set of envelope matches from the ST data set for training and testing L-score. In short, after tandem mass spectra were identified by database search, the resulting protein-spectrum-matches were utilized to obtain annotated envelope matches. The detailed steps are described below.
ReAdW (http://tools.proteomecenter.org/software.php) was used to convert the Thermo raw file into a centroided mzXML file. MS-Deconv  was applied to extract a list of monoisotopic masses and their corresponding envelope matches from each tandem mass spectrum of the ST data set. The deconvoluted mass lists were searched against a target-decoy concatenated ST proteome database using MS-Align+ . Default parameter settings were used in MS-Deconv and MS-Align+. The Benjamini-Hochberg procedure [16, 17] was employed to estimate false discovery rates (FDRs) for identified protein-spectrum-matches. When the E-value cutoff was 5.74×10-4, a total of 493 target protein-spectrum-matches were identified and no decoy protein-spectrum-matches were reported. We assume the 493 target protein-spectrum-matches are all correct because they have an estimated 0% spectrum level FDR. Of the “correct” identifications, 468 protein-spectrum-matches, from 83 proteoforms of 67 proteins, do not contain PTMs (some may have truncations).
Since the PTM localization problem has not been fully solved in top-down spectral analysis, we used only the 468 protein-spectrum-matches without PTMs to generate training and test envelope matches. For each of the 83 proteoforms, we selected only one identified spectrum with the largest number of monoisotopic masses to remove similar spectra. The resulting 83 spectra contained 7,995 envelope matches. If the monoisotopic mass of an envelope match was mapped to a theoretical fragment ion of the identified proteoform within 15 parts per million (ppm), the envelope match was labeled as “correct”, otherwise, “incorrect”. Since the data set contains only CID tandem mass spectra, b- and y-ions as well as b- and y-ions with neutral losses (b-H 2O, b-N 3H, y-H 2O, y-N 3H) were used for labeling envelope matches. In addition, ±1 Da errors were allowed in mapping monoisotopic masses of envelopes to theoretical fragment ions because they are common in extracting monoisotopic masses from isotopomer envelopes. Out of the 7,995 envelope matches, 3,726 were labeled as “correct”, and 4,269 were labeled as “incorrect”.
L-score uses several features whose computation involves the number of peak pairs in an envelope match. Thus, we divided the 7,995 envelope matches into 4 groups with 2, 3, 4, and ≥5 peak pairs, which contained 924,1,284,2,017, and 3,770 envelope matches, respectively. The envelope matches in each group were randomly divided into training and test envelope matches of the same size. (If one group contains 2n+1 envelope matches, where n is an integer, the training data set contains n envelope matches and the test data set contains n+1 envelope matches).
We also generated a test set of envelope matches from the EC HCD data set. Following the method for the ST data set, we identified 1,537 protein-spectrum-matches with an estimated 0% spectrum level FDR, including 625 protein-spectrum-matches without PTMs from 242 proteoforms of 109 proteins. For each of the 242 proteoforms, we chose a matched spectrum with the largest number of monoisotopic masses. Finally, a set of 27,091 envelope matches was obtained, including 9,744 “correct” and 17,347 “incorrect” ones. They were further divided into 4 groups with 2,3,4, and ≥5 peak pairs, which contained 1,535, 4,572, 3,894, and 17,090 envelope matches, respectively.
Features of envelope matches
Let S be an experimental mass spectrum. A peak in an isotopomer envelope is represented by a pair (x,y), where x and y are the m/z value and intensity, respectively. Let E=(x1,y1),(x2,y2),⋯, (x k ,y k ) be a theoretical envelope where x1<x2<…<x k , and E′=(x1′,y1′),(x2′,y2),⋯,(x k′,y k′) its corresponding experimental envelope in S. Each theoretical peak (x i ,y i ) is mapped to the experimental peak (x i′,y i′) for 1≤i≤k. Below we describe five features for distinguishing correct envelope matches from incorrect ones.
Peak intensity distributions
Neutral loss envelopes
In the implementation of L-score, the envelope detection and selection methods in MS-Deconv are used to generate candidate envelope lists, in which supporting envelopes and neutral loss envelopes are identified.
Missing peak numbers
Peaks may be absent from experimental envelopes. In the generation of candidate envelopes, an experimental envelope is removed if it has ≥3 missing peaks, 3 theoretical peaks and 2 missing peaks, or 2 theoretical peaks and 1 missing peak. In addition, an experimental envelope is removed if it does not contain k-3 consecutive matched peaks, where k is the number of peaks in the theoretical envelope. As a result, most missing peaks in experimental envelopes are at the ends of isotopomer distributions. Therefore, locations of missing peaks are not included in L-score. Since envelope matches without missing peaks have a higher accuracy rate than those with missing peaks (Additional file 1: Figure S6), we introduce another feature m(E′) to represent the number of missing peaks in an experimental envelope E′.
The scoring function
The weights of the features in L-score reported by logistic regression using the ST training envelope matches
#peak pairs =2
#peak pairs =3
#peak pairs =4
#peak pairs ≥5
#neutral loss envelopes
To compare envelope matches from different peak pair number groups, we trained a lookup table for each peak pair number group to convert raw scores L(E,E′) to local FDRs using the ST training data set. Given a raw score and a peak pair number group, we count the numbers of correct and incorrect envelope matches in the training group whose scores are similar to the given score (the bin size is 0.02) and use the two numbers to estimate the local FDR. In practice, candidate envelope matches of a top-down mass spectrum are ranked and selected based on their estimated local FDRs.
Combination of MS-Deconv and L-score
MS-Deconv deconvolutes top-down mass spectra with four steps. First, a list of envelope matches is generated by enumerating all valid charge states and all signal peaks in a mass spectrum as base peaks. Second, all envelope matches are filtered based on the number of missing peaks and the number of consecutive matched peaks. Third, a graph model is employed to select a small number of envelope matches from the list that can explain the spectrum well. Fourth, the number x of envelope matches to report is specified by the user or estimated by the precursor mass when a tandem mass spectrum is analyzed. (When the precursor mass is M, the length L of the target protein is estimated as ⌈M/mavg⌉, where mavg is the average mass of the 20 amino acid residues, and the number of envelope matches to report is estimated as 2(L-1).) The envelope matches selected in the previous step are ranked by their similarity scores, and the top x envelope matches are reported. Finally, monoisotopic masses are extracted from the top x envelope matches. The similarity scoring function used in MS-Deconv is referred to as M-score.
To combine L-score with MS-Deconv, M-score is replaced by L-score (and the local FDR) in the fourth step of MS-Deconv (Additional file 1: Figure S7). By combining MS-Deconv and L-score, we developed a new software tool, MS-Deconv+, for top-down spectral deconvolution. Since local FDRs are reported with L-scores for envelope matches in MS-Deconv+, a local FDR threshold can be specified to decide the number of envelope matches to report.
In practice, MS-Deconv+ with default weights of features is first used to analyze data sets that are different from the training data set. To further improve the performance of MS-Deconv+, MS-Align+ can be utilized to identify highly confident protein-spectrum-matches and generate a set of training envelope matches to train the weights of features.
Results and discussion
We implemented L-score and MS-Deconv+ in Java and tested them on the ST and EC data sets.
Comparison of distance functions for peak intensity distributions
Discriminative abilities of single features and L-score
Comparison with other scoring functions
Combination of L-score and Decon2LS
Decon2LS , a reimplementation of THRASH , reports a list of ranked envelope matches from a top-down mass spectrum. To test L-score coupled with Decon2LS, L-score was utilized to re-rank the envelope matches in the list reported by Decon2LS. Two lists of ranked envelope matches (one by Decon2LS and the other by L-score coupled with Decon2LS) were generated for each of 242 mass spectra in the EC HCD test data set. For each i=1,2,…,20, we collected two sets of envelope matches with the rank i from the lists of ranked envelope matches reported by Decon2LS and L-score coupled with Decon2LS and then compared their accuracy rates (Additional file 1: Figure S8). L-score coupled with Decon2LS reported more correct top ranked enveloped matches than Decon2LS. In practice, when Decon2LS reports x envelope matches from a mass spectrum, the following procedure can be used to boost the accuracy rate of reported envelope matches. The RL-value threshold of Decon2LS is lowered so that the number of envelope matches extracted from the spectrum is larger than x. Then L-score is utilized to re-rank the envelope matches, and only the top x ones are reported.
Comparison of Decon2LS, MS-Deconv and MS-Deconv+ on spectral identification
In this paper, we proposed L-score for evaluating experimental isotopomer envelopes, which outperformed existing scoring functions in distinguishing correct experimentalenvelopes from incorrect ones. We further developed MS-Deconv+, a top-down spectral deconvolution tool that combines MS-Deconv and L-score. In the experiments on the two EC data sets, MS-Deconv+ reported more correct monoisotopic masses than MS-Deconv. These correct monoisotopic masses provide essential information for proteoform identification and characterization.
Publication of this article was funded by a startup fund provided by Indiana University-Purdue University Indianapolis.
This work was supported by a startup fund provided by Indiana University-Purdue University Indianapolis. Portions of this work were supported by funds from EMSL intramural research project and performed at EMSL, a national scientific user facility sponsored by DOE-BER and located at Pacific Northwest National Laboratory (PNNL). PNNL is a multi-program national laboratory operated by Battelle for DOE under Contract DE-AC05-76RL01830.
- Mann M, Hendrickson RC, Pandey A: Analysis of proteins and proteomes by mass spectrometry. Ann Rev Biochem. 2001, 70: 437-473. 10.1146/annurev.biochem.70.1.437.PubMedView ArticleGoogle Scholar
- Aebersold R, Mann M: Mass spectrometry-based proteomics. Nature. 2003, 422: 198-207. 10.1038/nature01511.PubMedView ArticleGoogle Scholar
- Domon B, Aebersold R: Mass spectrometry and protein analysis. Science. 2006, 312: 212-217. 10.1126/science.1124619.PubMedView ArticleGoogle Scholar
- Liu T, Belov ME, Jaitly N, Qian W-J, Smith RD: Accurate mass measurements in proteomics. Chem Rev. 2007, 107: 3621-3653. 10.1021/cr068288j.PubMed CentralPubMedView ArticleGoogle Scholar
- Liu X, Hengel S, Wu S, Pevzner PA, Tolić N: Identification of ultramodified proteins using top-down tandem mass spectra. J Proteome Res. 2013, 12: 5830-5838. 10.1021/pr400849y.PubMed CentralPubMedView ArticleGoogle Scholar
- Britton L-MP, Gonzales-Cope M, Zee BM, Garcia BA: Breaking the histone code with quantitative mass spectrometry. Expert Rev Proteomics. 2011, 8: 631-643. 10.1586/epr.11.47.PubMed CentralPubMedView ArticleGoogle Scholar
- Horn DM, Zubarev RA, McLafferty FW: Automated reduction and interpretation of high resolution electrospray mass spectra of large molecules. J Am Soc Mass Spectrom. 2000, 11: 320-332. 10.1016/S1044-0305(99)00157-9.PubMedView ArticleGoogle Scholar
- Liu X, Inbar Y, Dorrestein PC, Wynne C, Edwards N, Souda P, Whitelegge JP, Bafna V, Pevzner PA: Deconvolution and database search of complex tandem mass spectra of intact proteins: A combinatorial approach. Mol Cell Proteomics. 2010, 9: 2772-2782. 10.1074/mcp.M110.002766.PubMed CentralPubMedView ArticleGoogle Scholar
- Senko MW, Beu SC, McLafferty FW: Determination of monoisotopic masses and ion populations for large biomolecules from resolved isotopic distributions. J Am Soc Mass Spectrom. 1995, 6: 229-233. 10.1016/1044-0305(95)00017-8.PubMedView ArticleGoogle Scholar
- Zabrouskov V, Senko MW, Du Y, Leduc RD, Kelleher NL: New and automated MSn approaches for top-down identification of modified proteins. J Am Soc Mass Spectrom. 2005, 16: 2027-2038. 10.1016/j.jasms.2005.08.004.PubMed CentralPubMedView ArticleGoogle Scholar
- Jaitly N, Mayampurath A, Littlefield K, Adkins JN, Anderson GA, Smith RD: Decon2LS: An open-source software package for automated processing and visualization of high resolution mass spectrometry data. BMC Bioinformatics. 2009, 10: 87-10.1186/1471-2105-10-87.PubMed CentralPubMedView ArticleGoogle Scholar
- Hoopmann MR, Finney GL, MacCoss MJ: High-speed data reduction, feature detection, and MS/MS spectrum quality assessment of shotgun proteomics data sets using high-resolution mass spectrometry. Anal Chem. 2007, 79: 5620-5632. 10.1021/ac0700833.PubMed CentralPubMedView ArticleGoogle Scholar
- Park K, Yoon JY, Lee S, Paek E, Park H, Jung H-J, Lee S-W: Isotopic peak intensity ratio based algorithm for determination of isotopic clusters and monoisotopic masses of polypeptides from high-resolution mass spectrometric data. Anal Chem. 2008, 80: 7294-7303. 10.1021/ac800913b.PubMedView ArticleGoogle Scholar
- Tsai YS, Scherl A, Shaw JL, MacKay CL, Shaffer SA, Langridge-Smith PRR, Goodlett DR: Precursor ion independent algorithm for top-down shotgun proteomics. J Am Soc Mass Spectrom. 2009, 20: 2154-2166. 10.1016/j.jasms.2009.07.024.PubMedView ArticleGoogle Scholar
- Liu X, Sirotkin Y, Shen Y, Anderson G, Tsai YS, Ting YS, Goodlett DR, Smith RD, Bafna V, Pevzner PA: Protein identification using top-down spectra. Mol Cell Proteomics. 2012, 11: 111-008524.View ArticleGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate - a practical and powerful approach to multiple testing. J R Stat Soc Series B-Methodol. 1995, 57: 289-300.Google Scholar
- Storey JD: A direct approach to false discovery rates. J R Stat Soc Series B-Stat Methodol. 2002, 64: 479-498. 10.1111/1467-9868.00346.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.