A new scoring function for top-down spectral deconvolution

Kou, Qiang; Wu, Si; Liu, Xiaowen

doi:10.1186/1471-2164-15-1140

Research article
Open access
Published: 18 December 2014

A new scoring function for top-down spectral deconvolution

Qiang Kou¹,
Si Wu² &
Xiaowen Liu^1,3

BMC Genomics volume 15, Article number: 1140 (2014) Cite this article

2648 Accesses
12 Citations
Metrics details

Abstract

Background

Top-down mass spectrometry plays an important role in intact protein identification and characterization. Top-down mass spectra are more complex than bottom-up mass spectra because they often contain many isotopomer envelopes from highly charged ions, which may overlap with one another. As a result, spectral deconvolution, which converts a complex top-down mass spectrum into a monoisotopic mass list, is a key step in top-down spectral interpretation.

Results

In this paper, we propose a new scoring function, L-score, for evaluating isotopomer envelopes. By combining L-score with MS-Deconv, a new software tool, MS-Deconv+, was developed for top-down spectral deconvolution. Experimental results showed that MS-Deconv+ outperformed existing software tools in top-down spectral deconvolution.

Conclusions

L-score shows high discriminative ability in identification of isotopomer envelopes. Using L-score, MS-Deconv+ reports many correct monoisotopic masses missed by other software tools, which are valuable for proteoform identification and characterization.

Background

In the last two decades, bottom-up mass spectrometry (MS) has been the mainstream of proteomics analysis [1–4]. Although it is efficient and high-throughput for protein identification and quantification, bottom-up MS has its limitations. It involves a sample preparation step in which long proteins are digested into short peptides by proteases, reducing its ability to identify various proteoforms with multiple changes, such as mutations, post-translational modifications (PTMs), and degradations [5, 6]. In contrast, top-down MS analyzes intact proteins, making it the method of choice for complex proteoform identification.

In a mass spectrum, each peak is represented as (m/z,i n t e n s i t y), where m/z and intensity are the mass-to-charge ratio and abundance of its corresponding ion, respectively. Because of the existence of natural isotopes, ions of the same chemical formula and charge state have different m/z values and correspond to a list of spectral peaks in a mass spectrum, called an isotopomer envelope. The monoisotopic mass of an ion is the sum of its atomic masses using the most abundant isotope for each of its atoms.

Compared with bottom-up mass spectra, top-down mass spectra are more complex because they often contain many high charge state isotopomer envelopes, some of which overlap with one another [7, 8]. As a result, a key step in top-down spectral interpretation is to deconvolute a complex top-down mass spectrum to a list of monoisotopic masses.

Given the chemical formula and charge state of an ion, its theoretical isotopomer distribution can be calculated based on the frequencies of natural isotopes. When the chemical formula is unknown and the only available information is its monoisotopic or average mass, the well-known averagine model [9] can be used to estimate the chemical formula from the monoisotopic or average mass. A theoretical isotopomer distribution is represented as a list of theoretical peaks (m/z,p r o b a b i l i t y), in which m/z and probility are the mass-to-charge ratio and probability of the corresponding isotopomer. In top-down spectral deconvolution, theoretical isotopomer distributions are utilized to identify and group isotopic peaks.

Spectral deconvolution of profile mass spectra has been studied by several groups [7, 10]. In this paper, we focus on centroided spectra. While profile spectra keep all information of raw data, centroided spectra simplify data representation and speed up spectral deconvolution. Similar to mass spectra, isotopomer distributions can be represented in the profile or the centroided mode (Additional file 1: Figures S1 and S2). The centroided mode will be used in the proposed scoring function.

Many software tools have been developed for top-down spectral deconvolution [7, 10–12]. Most tools deconvolute a top-down mass spectrum in four steps. First, candidate isotopomer envelopes are extracted from the experimental spectrum and matched to theoretical isotopomer distributions. Second, the theoretical isotopomer distribution in a match is converted into a theoretical isotopomer envelope by scaling the probabilities to theoretical peak intensities. The scale ratio is estimated based on the peak intensities of the experimental isotopomer envelope. Third, the matches are evaluated by a scoring function, and a match is reported only if its score is higher than a specified threshold. Finally, a monoisotopic mass is obtained from each of the reported isotopomer envelopes.

The scoring function for evaluating experimental isotopomer envelopes determines the accuracy and sensitivity of spectral deconvolution. Designing a good scoring function is a challenging problem because complex mass spectra often contain many noise peaks and overlapping isotopomer envelopes. Most software tools use scoring functions based on the intensities of peaks in a pair of experimental and theoretical isotopomer envelopes, such as the sum of squared errors of peak intensities [7], the ratios of neighbouring peak intensities [13], and the dot product of intensity distributions [12]. The scoring function in MS-Deconv [8] combines peak intensities and errors in m/z values.

In this paper, we present a new scoring function, L-score, for computing the similarity between a pair of experimental and theoretical isotopomer envelopes. L-score can be used independently for spectral deconvolution or combined with other spectral deconvolution tools for envelope selection. We developed a software tool, MS-Deconv+, by combining MS-Deconv and L-score. Experiments showed that MS-Deconv+ outperformed other existing software tools in top-down spectral deconvolution.

Methods

Data sets

A data set from Salmonella typhimurium (ST) [14] was used for training and testing L-score. Cell lysate obtained from ST was analyzed with a C4-based high-performance liquid chromatography (HPLC) column coupled with an LTQ-Orbitrap mass spectrometer. A total of 4,636 collision induced dissociation (CID) tandem mass spectra were acquired. The charge states of the spectra range from 1 to 24; the precursor masses of the spectra range from 1,000 to 20,000 Dalton (Da). (See Ref. [14] for the detailed experimental procedure.)

Two Escherichia coli (EC) data sets were utilized to test L-score and MS-Deconv+. Cell lysate of EC was analyzed by a reversed phase liquid-chromatography (RPLC) coupled with an LTQ-Orbitrap Velos mass spectrometer. A total of 3,704 higher-energy C-trap dissociation (HCD) and 4,045 electron-transfer dissociation (ETD) tandem mass spectra were collected at a resolution of 60,000.

Theoretical and experimental envelopes

Since the proposed scoring function is designed for centroided data, only centroided isotopomer distributions and centroided mass spectra are studied. In a centroided isotopomer distribution, two isotopomers are treated as the same if they contain the same number of neurons. For example, a water molecule with two ¹H atoms and one ¹⁸O atom and another water molecule with two ²H atoms and one ¹⁶O atom are treated as the same although their masses are slightly different. As a result, isotopomers with the same number of neurons are represented by one peak. When the charge state of an ion is z, the distance between two neighbouring peaks in its centroided theoretical isotopomer distribution is approximately 1.00235/z thomson (Th) [7].

Top-down mass spectra contain some noise peaks. The noise intensity level of a spectrum is estimated by plotting the histogram of the peak intensity distribution and assuming that it is in the intensity bin with the largest number of peaks [7]. A peak is considered as a signal peak if its intensity is higher than the noise intensity level. In addition, a charge state is valid if it is no larger than a user-defined parameter.

We generate a theoretical isotopomer envelope as follows. First, we select a signal peak from a mass spectrum and a valid charge state z. The signal peak is called the base peak of the theoretical and its corresponding experimental isotopomer envelopes. Second, using the averagine model, we find a monoisotopic mass and its corresponding theoretical isotopomer distribution with the charge state z such that the m/z value of its most abundant isotopomer equals that of the base peak. Third, the peaks in the theoretical isotopomer distribution are matched to experimental peaks with similar m/z values in the spectrum. Finally, the intensities of the theoretical peaks are initialized as their probabilities and further scaled based on the intensities of the matched experimental peaks (Figure 1). Following the method in MS-Deconv, we scale theoretical peak intensities so that the sum of the intensities of the top three theoretical peaks equals that of their corresponding experimental peaks. If the scaled intensity of a theoretical peak is not higher than the noise intensity level, the theoretical peak is removed. The list of the remaining scaled peaks is referred to as a theoretical isotopomer envelope, or a theoretical envelope for brevity.

Given a theoretical envelope, a list of experimental peaks is extracted from the mass spectrum to form its corresponding experimental envelope by matching each peak in the theoretical envelope to an experimental peak with a similar m/z value (within an error tolerance). If such an experimental peak is not found, we add into the spectrum a zero-intensity peak whose m/z value is equal to the theoretical peak. A theoretical envelope and its matched experimental envelope are called an envelope match.

Training and test data sets

We generated and annotated a set of envelope matches from the ST data set for training and testing L-score. In short, after tandem mass spectra were identified by database search, the resulting protein-spectrum-matches were utilized to obtain annotated envelope matches. The detailed steps are described below.

ReAdW (http://tools.proteomecenter.org/software.php) was used to convert the Thermo raw file into a centroided mzXML file. MS-Deconv [8] was applied to extract a list of monoisotopic masses and their corresponding envelope matches from each tandem mass spectrum of the ST data set. The deconvoluted mass lists were searched against a target-decoy concatenated ST proteome database using MS-Align+ [15]. Default parameter settings were used in MS-Deconv and MS-Align+. The Benjamini-Hochberg procedure [16, 17] was employed to estimate false discovery rates (FDRs) for identified protein-spectrum-matches. When the E-value cutoff was 5.74×10^-4, a total of 493 target protein-spectrum-matches were identified and no decoy protein-spectrum-matches were reported. We assume the 493 target protein-spectrum-matches are all correct because they have an estimated 0% spectrum level FDR. Of the “correct” identifications, 468 protein-spectrum-matches, from 83 proteoforms of 67 proteins, do not contain PTMs (some may have truncations).

Since the PTM localization problem has not been fully solved in top-down spectral analysis, we used only the 468 protein-spectrum-matches without PTMs to generate training and test envelope matches. For each of the 83 proteoforms, we selected only one identified spectrum with the largest number of monoisotopic masses to remove similar spectra. The resulting 83 spectra contained 7,995 envelope matches. If the monoisotopic mass of an envelope match was mapped to a theoretical fragment ion of the identified proteoform within 15 parts per million (ppm), the envelope match was labeled as “correct”, otherwise, “incorrect”. Since the data set contains only CID tandem mass spectra, b- and y-ions as well as b- and y-ions with neutral losses (b-H ₂O, b-N ₃H, y-H ₂O, y-N ₃H) were used for labeling envelope matches. In addition, ±1 Da errors were allowed in mapping monoisotopic masses of envelopes to theoretical fragment ions because they are common in extracting monoisotopic masses from isotopomer envelopes. Out of the 7,995 envelope matches, 3,726 were labeled as “correct”, and 4,269 were labeled as “incorrect”.

L-score uses several features whose computation involves the number of peak pairs in an envelope match. Thus, we divided the 7,995 envelope matches into 4 groups with 2, 3, 4, and ≥5 peak pairs, which contained 924,1,284,2,017, and 3,770 envelope matches, respectively. The envelope matches in each group were randomly divided into training and test envelope matches of the same size. (If one group contains 2n+1 envelope matches, where n is an integer, the training data set contains n envelope matches and the test data set contains n+1 envelope matches).

We also generated a test set of envelope matches from the EC HCD data set. Following the method for the ST data set, we identified 1,537 protein-spectrum-matches with an estimated 0% spectrum level FDR, including 625 protein-spectrum-matches without PTMs from 242 proteoforms of 109 proteins. For each of the 242 proteoforms, we chose a matched spectrum with the largest number of monoisotopic masses. Finally, a set of 27,091 envelope matches was obtained, including 9,744 “correct” and 17,347 “incorrect” ones. They were further divided into 4 groups with 2,3,4, and ≥5 peak pairs, which contained 1,535, 4,572, 3,894, and 17,090 envelope matches, respectively.

Features of envelope matches

Let S be an experimental mass spectrum. A peak in an isotopomer envelope is represented by a pair (x,y), where x and y are the m/z value and intensity, respectively. Let E=(x₁,y₁),(x₂,y₂),⋯, (x_k,y_k) be a theoretical envelope where x₁<x₂<…<x_k, and E^′=(x1′,y1′),(x2′,y₂),⋯,(x k′,y k′) its corresponding experimental envelope in S. Each theoretical peak (x_i,y_i) is mapped to the experimental peak (x i′,y i′) for 1≤i≤k. Below we describe five features for distinguishing correct envelope matches from incorrect ones.

M/zvalues

In a correct experimental envelope, a peak is likely to have the same m/z value to its corresponding theoretical peak. Differences in m/z values between experimental and theoretical peaks are an effective feature for envelope evaluation. The squared m/z error between two peaks (x,y) and (x^′,y^′) is (x-x^′)² (Additional file 1: Figure S3). The m/z distance between E and E^′ is the root mean square deviation of the m/z values of their matched peak pairs. If a theoretical peak does not match any experimental peak and a zero-intensity peak is added to form a peak pair, the peak pair is excluded from the computation of the m/z distance. Let P be the set of peak pairs of E and E^′ without zero-intensity peaks. We define

d_{x} (E, E^{'}) = \sqrt{\frac{\sum_{((x, y), (x^{'}, y^{'})) \in P} {(x - x^{'})}^{2}}{| P |}} .

Peak intensity distributions

The difference between the peak intensities of a theoretical peak and its corresponding experimental peak in correct envelope matches is often smaller than that in incorrect ones [7]. To design the distance function for peak intensities used in L-score, the following factors are considered. First, experimental envelopes have various average peak intensities. To compare these envelopes, raw peak intensities are converted into relative intensities by dividing them by the largest peak intensity in the theoretical envelope. For a peak with raw intensity y, the relative intensity of the peak is r(y)=y/y_h, where y_h is the raw intensity of the highest peak in the theoretical envelope (Additional file 1: Figure S4). Second, a correct experimental peak may overlap with peaks from other envelopes, making its intensity error very large. To make the feature more reliable, a threshold is introduced so that the distance function is not significantly affected by one very large error in a pair of matched peaks. Third, the difference between the intensities of a theoretical peak (x,y) and its corresponding experimental peak (x^′,y^′) may be large, e.g., r(y^′)-r(y)>0.5 or r(y^′)-r(y)<-0.5. The main reason for the first case (r(y^′)-r(y)>0.5) is that the experimental peak overlaps with other peaks, but the reason for the second case (r(y^′)-r(y)<-0.5) is not clear. It is more frequent to observe the first case than the second (Additional file 1: Figure S5). Thus, a penalty factor is applied to the second case. Let t be the threshold for large errors and c the penalty factor. The distance function of a theoretical peak p=(x,y) and its corresponding experimental peak p^′=(x^′,y^′) is

d_{y} (p, p^{'}) = \{\begin{array}{l} min {| r (y) - r (y^{'}) |, t}, & if y < y^{'}; \\ c \cdot min {| r (y) - r (y^{'}) |, t}, & otherwise . \end{array}

In the experiments, t=0.5 and c=2. The distance between the intensity distributions of E and E^′ is the root mean square of the intensity distances of their matched peak pairs:

d_{y} (E, E^{'}) = \sqrt{\frac{\sum_{i = 1}^{k} {(d_{y} (p_{i}, p_{i}^{'}))}^{2}}{k}} .

Supporting envelopes

In top-down spectral deconvolution, the first step is to extract from a mass spectrum a list of candidate experimental envelopes that satisfy some basic requirements [8]. For example, a candidate experimental envelope cannot have 3 or more missing peaks. If the candidate envelope list contains two envelopes that have the same monoisotopic mass and different charge states, then one envelope is called a supporting envelope of the other. For an experimental envelope E^′ with f supporting envelopes, we define

s (E^{'}) = \{\begin{array}{l} f, & if f \leq 3; \\ 3, & otherwise . \end{array}

Neutral loss envelopes

If the monoisotopic masses m₁ and m₂ of two envelopes E1′ and E2′ in the candidate envelope list satisfy that m₁-m₂ equals (within an error tolerance) the mass of an NH ₃ or H ₂O molecule, then E2′ is a neutral loss envelope of E1′. For an experimental envelope E^′ with f neutral loss envelopes, we define

l (E^{'}) = \{\begin{array}{l} f, & if f \leq 3; \\ 3, & otherwise . \end{array}

In the implementation of L-score, the envelope detection and selection methods in MS-Deconv are used to generate candidate envelope lists, in which supporting envelopes and neutral loss envelopes are identified.

Missing peak numbers

Peaks may be absent from experimental envelopes. In the generation of candidate envelopes, an experimental envelope is removed if it has ≥3 missing peaks, 3 theoretical peaks and 2 missing peaks, or 2 theoretical peaks and 1 missing peak. In addition, an experimental envelope is removed if it does not contain k-3 consecutive matched peaks, where k is the number of peaks in the theoretical envelope. As a result, most missing peaks in experimental envelopes are at the ends of isotopomer distributions. Therefore, locations of missing peaks are not included in L-score. Since envelope matches without missing peaks have a higher accuracy rate than those with missing peaks (Additional file 1: Figure S6), we introduce another feature m(E^′) to represent the number of missing peaks in an experimental envelope E^′.

The scoring function

We designed L-score using a linear combination of the five features:

\begin{array}{l} L (E, E^{'}) = & a_{1} d_{x} (E, E^{'}) + a_{2} d_{y} (E, E^{'}) + a_{3} s (E^{'}) + a_{4} l (E^{'}) \\ + a_{5} m (E^{'}) . \end{array}

Logistic regression was applied to find the weights in the linear combination for each of the 4 groups (the peak pair number =2,3,4,≥5) using the ST training envelope matches. The resulting weights are listed in Table 1. The largest (absolute value) weight is from the feature of m/z distances, showing the importance of this feature.

Table 1 The weights of the features in L-score reported by logistic regression using the ST training envelope matches

Full size table

To compare envelope matches from different peak pair number groups, we trained a lookup table for each peak pair number group to convert raw scores L(E,E^′) to local FDRs using the ST training data set. Given a raw score and a peak pair number group, we count the numbers of correct and incorrect envelope matches in the training group whose scores are similar to the given score (the bin size is 0.02) and use the two numbers to estimate the local FDR. In practice, candidate envelope matches of a top-down mass spectrum are ranked and selected based on their estimated local FDRs.

Combination of MS-Deconv and L-score

MS-Deconv deconvolutes top-down mass spectra with four steps. First, a list of envelope matches is generated by enumerating all valid charge states and all signal peaks in a mass spectrum as base peaks. Second, all envelope matches are filtered based on the number of missing peaks and the number of consecutive matched peaks. Third, a graph model is employed to select a small number of envelope matches from the list that can explain the spectrum well. Fourth, the number x of envelope matches to report is specified by the user or estimated by the precursor mass when a tandem mass spectrum is analyzed. (When the precursor mass is M, the length L of the target protein is estimated as ⌈M/m_avg⌉, where m_avg is the average mass of the 20 amino acid residues, and the number of envelope matches to report is estimated as 2(L-1).) The envelope matches selected in the previous step are ranked by their similarity scores, and the top x envelope matches are reported. Finally, monoisotopic masses are extracted from the top x envelope matches. The similarity scoring function used in MS-Deconv is referred to as M-score.

To combine L-score with MS-Deconv, M-score is replaced by L-score (and the local FDR) in the fourth step of MS-Deconv (Additional file 1: Figure S7). By combining MS-Deconv and L-score, we developed a new software tool, MS-Deconv+, for top-down spectral deconvolution. Since local FDRs are reported with L-scores for envelope matches in MS-Deconv+, a local FDR threshold can be specified to decide the number of envelope matches to report.

In practice, MS-Deconv+ with default weights of features is first used to analyze data sets that are different from the training data set. To further improve the performance of MS-Deconv+, MS-Align+ can be utilized to identify highly confident protein-spectrum-matches and generate a set of training envelope matches to train the weights of features.

Results and discussion

We implemented L-score and MS-Deconv+ in Java and tested them on the ST and EC data sets.

Comparison of distance functions for peak intensity distributions

We proposed a function d_y(E,E^′) for measuring the distance between the peak intensity distributions of a theoretical envelope E and an experimental envelope E^′. To evaluate the performance of the function, we compared it with the dot product and the Kullback-Leibler (KL) divergence of peak intensity distributions on the ST test envelope matches. The dot-product is a function for computing the similarity between two vectors, which is used in Hardklör [12]. In an envelope match (E,E^′), the peak intensity distributions of the theoretical envelope E=(x₁,y₁),(x₂,y₂),…,(x_k,y_k) and the experimental envelope E^′=(x1′,y1′),(x2′,y2′),…,(x k′,y k′) are represented as two vectors (y₁,y₂,…,y_k) and (y1′,y2′,…,y k′). The two vectors are normalized to unit vectors before the dot product is calculated. The KL divergence is a function for measuring the relative entropy of one distribution from another distribution. For two discrete probability distributions P and Q, the KL divergence of Q from P is $\sum_{i} ln (\frac{P (i)}{Q (i)}) P (i)$ . To compute the KL divergence of E^′ from E, the two vectors (y₁,y₂,…,y_k) and (y1′,y2′,…,y k′) are converted into two probability distributions by dividing each peak intensity by the sum of peak intensities of the envelope. The three functions were tested on the 4 groups (the peak pair number =2,3,4,≥5) of the ST test envelope matches and compared based on the area under the curve (AUC) with respect to the receiver operating characteristic (ROC). The comparison shows that d_y(E,E^′) is more powerful than the other two functions in discriminating correct envelope matches from incorrect ones, especially when the envelopes contain 4 peak pairs (Figure 2).

Discriminative abilities of single features and L-score

We tested the discriminative abilities of the five single features and L-score on the ST test data set and the EC HCD test data set (Figure 3). The m/z distance has the best AUC among all the features. Compared with the single features, L-score improves the discriminative ability, demonstrating the advantage of combining multiple features (Figure 3).

Some test envelope matches have missing peaks, but the features for m/z distances and peak intensity distributions do not utilize this important information. We further compared the performance of the two features and L-score on envelope matches without missing peaks (Figure 4). L-score still outperformed the two single features in evaluating envelope matches.

Comparison with other scoring functions

We compared L-score with M-score, the dot product, and the KL divergence on the 3,998 envelope matches in the test ST data set and the 16,020 envelope matches in the test EC ETD data set. (See Additional file 1 for the parameter settings.) The ROC curves of the four functions demonstrate that M-score and L-score significantly increase the AUC compared with the other two functions (Figure 5). Compared with M-score, L-score increases the AUC from 0.696 to 0.825 for the ST test envelope matches and from 0.678 to 0.816 for the EC HCD test envelope matches.

Combination of L-score and Decon2LS

Decon2LS [11], a reimplementation of THRASH [7], reports a list of ranked envelope matches from a top-down mass spectrum. To test L-score coupled with Decon2LS, L-score was utilized to re-rank the envelope matches in the list reported by Decon2LS. Two lists of ranked envelope matches (one by Decon2LS and the other by L-score coupled with Decon2LS) were generated for each of 242 mass spectra in the EC HCD test data set. For each i=1,2,…,20, we collected two sets of envelope matches with the rank i from the lists of ranked envelope matches reported by Decon2LS and L-score coupled with Decon2LS and then compared their accuracy rates (Additional file 1: Figure S8). L-score coupled with Decon2LS reported more correct top ranked enveloped matches than Decon2LS. In practice, when Decon2LS reports x envelope matches from a mass spectrum, the following procedure can be used to boost the accuracy rate of reported envelope matches. The RL-value threshold of Decon2LS is lowered so that the number of envelope matches extracted from the spectrum is larger than x. Then L-score is utilized to re-rank the envelope matches, and only the top x ones are reported.

Comparison of Decon2LS, MS-Deconv and MS-Deconv+ on spectral identification

All the tandem mass spectra in the EC HCD and ETD data sets were deconvoluted by Decon2LS, MS-Deconv, and MS-Deconv+; the deconvoluted mass lists reported by the three tools were searched against the EC proteome for protein identification using MS-Align+ [15]. (See Additional file 1 for the parameter settings of MS-Align+ and the three tools.) The EC proteome database was downloaded from the Swiss-Prot database, and a shuffled database of the same size was concatenated to the target protein database for estimation of FDRs. With 1% protein level FDR, MS-Deconv+ coupled with MS-Align+ identified more spectra (1,585 in HCD and 1,223 in ETD) than MS-Deconv (1,543 in HCD and 1,216 in ETD) and Decon2LS (1,526 in HCD and 620 in ETD) (Figure 6). The three methods shared a total of 1,352 spectral identifications in the EC HCD data set and 607 spectral identifications in the EC ETD data set. Although the performances of MS-Deconv+ and MS-Deconv were similar in the number of identified spectra, MS-Deconv+ reported more matched monoisotopic masses (55,731 in HCD and 24,235 in ETD) than MS-Deconv (41079 in HCD and 21,360 in ETD) and Decon2LS (39,991 in HCD and 10,479 in ETD) for the spectra identified by all the tools. These matched masses play an important role in localizing various changes in identified proteoforms.

Conclusions

In this paper, we proposed L-score for evaluating experimental isotopomer envelopes, which outperformed existing scoring functions in distinguishing correct experimentalenvelopes from incorrect ones. We further developed MS-Deconv+, a top-down spectral deconvolution tool that combines MS-Deconv and L-score. In the experiments on the two EC data sets, MS-Deconv+ reported more correct monoisotopic masses than MS-Deconv. These correct monoisotopic masses provide essential information for proteoform identification and characterization.

Declarations

Publication of this article was funded by a startup fund provided by Indiana University-Purdue University Indianapolis.

References

Mann M, Hendrickson RC, Pandey A: Analysis of proteins and proteomes by mass spectrometry. Ann Rev Biochem. 2001, 70: 437-473. 10.1146/annurev.biochem.70.1.437.
Article CAS PubMed Google Scholar
Aebersold R, Mann M: Mass spectrometry-based proteomics. Nature. 2003, 422: 198-207. 10.1038/nature01511.
Article CAS PubMed Google Scholar
Domon B, Aebersold R: Mass spectrometry and protein analysis. Science. 2006, 312: 212-217. 10.1126/science.1124619.
Article CAS PubMed Google Scholar
Liu T, Belov ME, Jaitly N, Qian W-J, Smith RD: Accurate mass measurements in proteomics. Chem Rev. 2007, 107: 3621-3653. 10.1021/cr068288j.
Article CAS PubMed Central PubMed Google Scholar
Liu X, Hengel S, Wu S, Pevzner PA, Tolić N: Identification of ultramodified proteins using top-down tandem mass spectra. J Proteome Res. 2013, 12: 5830-5838. 10.1021/pr400849y.
Article CAS PubMed Central PubMed Google Scholar
Britton L-MP, Gonzales-Cope M, Zee BM, Garcia BA: Breaking the histone code with quantitative mass spectrometry. Expert Rev Proteomics. 2011, 8: 631-643. 10.1586/epr.11.47.
Article CAS PubMed Central PubMed Google Scholar
Horn DM, Zubarev RA, McLafferty FW: Automated reduction and interpretation of high resolution electrospray mass spectra of large molecules. J Am Soc Mass Spectrom. 2000, 11: 320-332. 10.1016/S1044-0305(99)00157-9.
Article CAS PubMed Google Scholar
Liu X, Inbar Y, Dorrestein PC, Wynne C, Edwards N, Souda P, Whitelegge JP, Bafna V, Pevzner PA: Deconvolution and database search of complex tandem mass spectra of intact proteins: A combinatorial approach. Mol Cell Proteomics. 2010, 9: 2772-2782. 10.1074/mcp.M110.002766.
Article CAS PubMed Central PubMed Google Scholar
Senko MW, Beu SC, McLafferty FW: Determination of monoisotopic masses and ion populations for large biomolecules from resolved isotopic distributions. J Am Soc Mass Spectrom. 1995, 6: 229-233. 10.1016/1044-0305(95)00017-8.
Article CAS PubMed Google Scholar
Zabrouskov V, Senko MW, Du Y, Leduc RD, Kelleher NL: New and automated MSn approaches for top-down identification of modified proteins. J Am Soc Mass Spectrom. 2005, 16: 2027-2038. 10.1016/j.jasms.2005.08.004.
Article CAS PubMed Central PubMed Google Scholar
Jaitly N, Mayampurath A, Littlefield K, Adkins JN, Anderson GA, Smith RD: Decon2LS: An open-source software package for automated processing and visualization of high resolution mass spectrometry data. BMC Bioinformatics. 2009, 10: 87-10.1186/1471-2105-10-87.
Article PubMed Central PubMed Google Scholar
Hoopmann MR, Finney GL, MacCoss MJ: High-speed data reduction, feature detection, and MS/MS spectrum quality assessment of shotgun proteomics data sets using high-resolution mass spectrometry. Anal Chem. 2007, 79: 5620-5632. 10.1021/ac0700833.
Article CAS PubMed Central PubMed Google Scholar
Park K, Yoon JY, Lee S, Paek E, Park H, Jung H-J, Lee S-W: Isotopic peak intensity ratio based algorithm for determination of isotopic clusters and monoisotopic masses of polypeptides from high-resolution mass spectrometric data. Anal Chem. 2008, 80: 7294-7303. 10.1021/ac800913b.
Article CAS PubMed Google Scholar
Tsai YS, Scherl A, Shaw JL, MacKay CL, Shaffer SA, Langridge-Smith PRR, Goodlett DR: Precursor ion independent algorithm for top-down shotgun proteomics. J Am Soc Mass Spectrom. 2009, 20: 2154-2166. 10.1016/j.jasms.2009.07.024.
Article CAS PubMed Google Scholar
Liu X, Sirotkin Y, Shen Y, Anderson G, Tsai YS, Ting YS, Goodlett DR, Smith RD, Bafna V, Pevzner PA: Protein identification using top-down spectra. Mol Cell Proteomics. 2012, 11: 111-008524.
Article CAS Google Scholar
Benjamini Y, Hochberg Y: Controlling the false discovery rate - a practical and powerful approach to multiple testing. J R Stat Soc Series B-Methodol. 1995, 57: 289-300.
Google Scholar
Storey JD: A direct approach to false discovery rates. J R Stat Soc Series B-Stat Methodol. 2002, 64: 479-498. 10.1111/1467-9868.00346.
Article Google Scholar

Download references

Acknowledgements

This work was supported by a startup fund provided by Indiana University-Purdue University Indianapolis. Portions of this work were supported by funds from EMSL intramural research project and performed at EMSL, a national scientific user facility sponsored by DOE-BER and located at Pacific Northwest National Laboratory (PNNL). PNNL is a multi-program national laboratory operated by Battelle for DOE under Contract DE-AC05-76RL01830.

Author information

Authors and Affiliations

Department of BioHealth Informatics, Indiana University-Purdue University Indianapolis, 535 W. Michigan Street, Indianapolis, IN, 46202, USA
Qiang Kou & Xiaowen Liu
Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, 902 Battelle Boulevard, Richland, WA, 99352, USA
Si Wu
Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, 410 West 10th Street, HS 5000, Indianapolis, IN, 46202, USA
Xiaowen Liu

Authors

Qiang Kou
View author publications
You can also search for this author in PubMed Google Scholar
Si Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaowen Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaowen Liu.

Additional information

Competing interests

The authors declare that there are no competing interests.

Authors’ contributions

XL and QK designed the methods, SW generated the test data sets, and QK implemented the methods in Java. All authors have read and approved the final manuscript.

Electronic supplementary material

Additional file 1: Supplementary material.(PDF 139 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.

The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Kou, Q., Wu, S. & Liu, X. A new scoring function for top-down spectral deconvolution. BMC Genomics 15, 1140 (2014). https://doi.org/10.1186/1471-2164-15-1140

Download citation

Received: 16 May 2014
Accepted: 11 December 2014
Published: 18 December 2014
DOI: https://doi.org/10.1186/1471-2164-15-1140

A new scoring function for top-down spectral deconvolution

Abstract

Background

Results

Conclusions

Background

Methods

Data sets

Theoretical and experimental envelopes

Training and test data sets

Features of envelope matches

M/zvalues

Peak intensity distributions

Supporting envelopes

Neutral loss envelopes

Missing peak numbers

The scoring function

Combination of MS-Deconv and L-score

Results and discussion

Comparison of distance functions for peak intensity distributions

Discriminative abilities of single features and L-score

Comparison with other scoring functions

Combination of L-score and Decon2LS

Comparison of Decon2LS, MS-Deconv and MS-Deconv+ on spectral identification

Conclusions

Declarations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Electronic supplementary material

Additional file 1: Supplementary material.(PDF 139 KB)

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us