Volume 13 Supplement 6
A systematic model of the LC-MS proteomics pipeline
© Sun et al.; licensee BioMed Central Ltd. 2012
Published: 26 October 2012
Mass spectrometry is a complex technique used for large-scale protein profiling with clinical and pharmaceutical applications. While individual components in the system have been studied extensively, little work has been done to integrate various modules and evaluate them from a systems point of view.
In this work, we investigate this problem by putting together the different modules in a typical proteomics work flow, in order to capture and analyze key factors that impact the number of identified peptides and quantified proteins, protein quantification error, differential expression results, and classification performance. The proposed proteomics pipeline model can be used to optimize the work flow as well as to pinpoint critical bottlenecks worth investing time and resources into for improving performance. Using the model-based approach proposed here, one can study systematically the critical problem of proteomic biomarker discovery, by means of simulation using ground-truthed synthetic MS data.
Mass spectrometry-based proteomics
Mass spectrometry (MS) is widely used for large-scale protein profiling with applications in biomarker discovery , signaling pathway monitoring [2, 3], drug development, and disease classification . In clinical applications of mass spectrometry, the number of samples available is usually in the range of tens to a few hundred (small sample size). The samples are analyzed by an MS instrument and transformed into a series of mass spectra containing hundreds of thousands of intensity measurements with signal generated by thousands of proteins/peptides (large feature dimension). This small-sample, high-dimensionality problem requires the experiment and analysis to be carefully designed and validated in order to arrive at statistically meaningful results.
The MS analysis pipeline consists of many steps, including sample preparation, protein digestion, ionization, peptide detection, protein quantification, and so on. The pipeline can be viewed as a noisy channel, where each processing step introduces some loss or distortion to the underlying signal and the end results are affected by the combined effects of all upstream steps. While individual components of the MS pipeline have been studied at length, little work has been done to integrate the various modules, evaluate them in a systematic way, and focus on the impact of the various steps on the end results of differential analysis and sample classification. In real experiments, it is not easy to decouple the compound parameter effects and determine the marginal influence of various modules on the end results, due to variations and the complicated nature of the work flow. Moreover, owing to contaminants and unknown or incomplete ground-truth, it is hard to meaningfully evaluate and compare results across different experiments. However, by employing a model-based approach, we may better understand the characteristics of the MS data, the contributions of the individual modules, and the performance of the full pipeline.
A key goal of MS-based proteomics is to discover protein biomarkers, which can be used to improve diagnosis, guide targeted therapy, and monitor therapeutic response across a wide range of diseases . But to date, the rate of discovery of successful biomarkers is still unsatisfactory. This is due to challenges in the candidate discovery and biomarker validation phases, such as the high dynamic range of proteins [5, 6], the tandem MS under-sampling problem , peptide redundancy and signal interference in the mass-to-charge domain , and inaccurate quantification of proteins [8, 9]. Through the proposed model-based approach and by means of simulation using ground-truthed synthetic data, the problem of biomarker discovery can be studied and evaluated.
Application of the proposed model
The proposed LC-MS proteomic pipeline model can be used to determine the working range of important parameters and may shed light on experimental design. Also, if knowledge of sample complexity, instrument configuration, system variation and detection accuracy is known beforehand, then by tuning corresponding parameters to their estimated values, the pipeline can be used to predict results on protein identification rates, protein differential analysis, quantification accuracies and classification performance. These results can be used to assess the efficacy of biomarker discovery in MS data.
Protein mixture model
where the fold change parameter, a l > 1, is sampled from a uniform distribution, as specified in the Results section.
where R ρ is a D × D matrix with 1 on the diagonal and ρ elsewhere. The correlation ρ and block size D are tunable parameters, with values specified in the Results section.
Peptide mixture model
Peptide detection and identification
The sources of noise include variation in experimental conditions, instrument variance, thermal noise and measurement error. It is reported that the noise variance follows a quadratic dependence on the expected abundance , which is reflected by Eq. (8). The two parameters in the noise model, α and β, determine the noise severity. Their value can be estimated using replication analysis, as explained in .
In electrospray ionization, peptides can be multiply charged. But we do not model the charge distribution, considering the following facts: (1) Peptide charge distribution and the maximum charge states are complicated by many factors such as sample composition, analyte concentration and peptide conformation [20, 21]. The distribution is hard to predict and has not been well characterized. (2) In order to get the abundance of a peptide, and further, its parent protein, the abundance of peptide charge variants will eventually be summed up. We omit the intermediate process since in reality many factors involved are not well understood.
where b represents the worst TPR when the SNR approaches zero.
The output of the MS1-based peptide detection algorithm is a list of detected peptides annotated by monoisotopic mass, retention time, abundance, and so on. To obtain peptide sequence information, i.e. peptide identification, which can be used to infer the parent protein from which the peptide was digested, database searching is required. To do so, the acquired MS/MS (MS2) spectra are searched against a protein database containing theoretical MS2 spectra generated from in-silico digested peptide sequences by popular software such as SEQUEST  and Mascot .
Several machine learning methods have been proposed to predict the probability (i.e., identifiability) of a peptide being identified through MS2 database searching [14, 28]. These methods try to extract the common trends residing in peptide identifiability that can be explained by peptide sequence-specific properties. Their successful application may suggest that the peptide sequence largely affects the chance of a peptide getting selected for MS2 analysis, whether the peptide can be sufficiently fragmented, and the quality of its fragmentation spectra. In our simulation, the identifiability p i of the true peptide species i is predicted by the APEX software , trained on the human serum proteome , and whether peptide species i in sample j is identified or not through database searching is determined by the outcome of a Bernoulli trial with success rate p i .
Linking of detection and identification results
For both MS1-based and MS2-based algorithms, sources of error exist that give rise to false positives (FPs). For the former, error sources include shot noise, abundance measurement error, signal interference, and so on. For the latter, co-eluting precursor ions, spectra matching ambiguity, or post-translational modifications may all lead to false identifications. By confronting the results of the two orthogonal algorithms (i.e., a feature is treated as a true positive if it is reported by both algorithms), dubious features reported by either algorithm can be filtered out.
Peptide to protein abundance roll-up
As demonstrated in the previous sections, each step of the MS analysis pipeline introduces a degree of loss or distortion to the underlying true signal. Thus, "decoding" protein abundance from observed peptide abundance corrupted by noise is nontrivial. To reduce noise, three levels of filtering are applied: (1) only unique peptides that exist only in one protein of the analyzed proteome are kept; (2) peptides with large missing value rates (larger than 0.7) are filtered out, since low reproducibility may be a red flag for false identifications; (3) among the remaining peptides, those having sufficiently high correlations (larger than 0.6) with other peptides digested from the same protein are retained. The estimated abundance of protein l in sample j is then obtained by averaging the abundances of its children peptides that pass the previous filters; if less than two peptides pass the filters, the estimated protein abundance is set to zero. The estimated protein concentration is calculated by dividing the estimated protein abundance by the instrument response factor κ.
where and are the original and estimated concentrations of protein l in sample j, respectively.
Differential expression analysis
where the superscripts identify the two classes, and m l and Var l represent the estimated class mean and variance of the abundance of protein l, respectively. The standard 0.05 significance level is used to detect differentially expressed markers.
Feature selection and classification
In the simulation, t-test feature selection is first performed to reduce the data dimension, by selecting the top 20 differentially expressed features. Then two classifiers, namely K-nearest neighbor (KNN, K = 3) and linear discriminant analysis (LDA) are trained using the observed protein expression data. Classification performance is validated by independent ground-truth (testing) data sets (each with 1000 samples, generated from the same data model), and the classification error is recorded. In addition, the KNN and LDA classification error on the original protein data (before entering the MS analysis pipeline) is obtained using a similar approach. The latter may serve as a benchmark to gauge how much loss in classification performance the analysis pipeline has introduced.
Proteomics pipeline model summary
No. of classes
Sample size of each class
M = 50
No. of marker proteins
No. of non-markers
Protein block size
D = 2
Protein block correlation
ρ = 0.6
a l ~ Unif(1.5, 2)
κ = 5
Instrument saturation effect
sat = Inf
α = 0.03, β = 3.6
Peptide efficiency factor
e i ~ Unif(0.1, 1)
Peptide detection algorithm
b = 0, k = .0016, p = 2
No. of MS2 replicates
Effect of peptide efficiency factor
Effect of protein abundance
The distribution of in-solution protein abundance can affect various detection results . While high-abundance proteins are easily detectable, low-abundance proteins are hard to detect since their signals are more likely to be buried in background noise. Hence, improving detection of low-abundance proteins has become a central issue in proteomic research.
These results indicate that it is essential to develop methods to enhance the identification results of low abundance peptides which are often of more biological interests. Relative to hardware, sample fractionation and protein depletion through immunoaffinity-based approaches  can be helpful. Relative to software, there exist algorithms shown to be efficient for the detection of low-abundance peptides, such as BPDA2d .
Effect of sample size
In Figure 5(b), the classification error of the (unobserved) original protein sample, before passing through the MS pipeline, is plotted side by side with that of the observed protein data, after analysis by the MS pipeline. The performance degradation caused by various noise conditions throughout the pipeline is clearly visible.
Effect of instrument response
Effect of saturation
The compound effects of instrument sensitivity and saturation demonstrate that the effectiveness of MS in quantitative analysis relies on achieving a wide linear dynamic range with a high saturation ceiling and a matching sensitivity. For example, in electrospray ionization mass spectrometry, the linear range may be extended by enhancing gas-phase analyte charging, facilitating droplet evaporation, or introducing ionization competitors .
Effect of noise
Peptide detection and experimental design characteristics
Effect of MS1 peptide detection algorithm
Effect of overlapping peptides and mass resolving power
Effect of MS2 replication
In tandem MS analysis, the precursor ions selected for fragmentation have low reproducibility across runs, and only a subset of peptides present in the sample can be analyzed for each run; this problem is known variously as MS2 random sampling and MS2 under-sampling . Hence, though laborious and costly, replicate MS2 measurements are frequently conducted for in-depth proteomic profiling or for building an AMT database to facilitate quantitative and high-throughput proteome measurements .
Peptide identification rate
Protein quantification rate
Protein quantification error
Percentage of detected markers
LDA error on the original protein data
KNN error on the original protein data
LDA error on the observed protein data
KNN error on the observed protein data
We have identified and analyzed different modules in a typical MS based proteomic work flow, resulting in a proteomic pipeline model that captures key factors in system performance. Through simulation based on ground-truthed synthetic data, we studied the effect of the various model parameters on the number of identified peptides and quantified proteins, quantification errors, detectable differentially expressed protein markers, and classification performance.
The main observations that were gleaned from the results of this study are as follows.
Regarding sample characteristics, we observed a positive correlation between peptide efficiency and performance. The intricacy in detecting low-abundance peptides was demonstrated, thereby elucidating the advantage of sample fractionation and protein depletion through immunoaffinity-based approaches. Moreover, we showed that results could be improved by increasing sample size.
As for instrument characteristics, the compound effects of instrument response and saturation were first examined and it was shown that the effectiveness of MS in quantitative analysis relies on achieving a wide linear dynamic range with a high saturation ceiling and matching instrument sensitivity. Enhancing gas-phase analyte charging, facilitating droplet evaporation, or introducing ionization competitors can be beneficial in extending the linear dynamic range. The adverse effects of noise was illustrated, highlighting the need in strictly following experiment protocols to minimize variance and measurement error.
Peptide detection and experimental design characteristics were also studied. It was shown that improving peptide detection algorithms in the direction of enhancing true positive rate for a wide range of SNR (especially for low SNR) and tackling convoluted peptide signals could be invaluable, especially for complex samples and for MS instruments with limited mass resolution. It was also observed that the use of only a small number of replicate tandem MS assays could effectively reduce the MS2 under-sampling problem and improve performance.
To enable the performance analysis of such a complex system, many reasonable assumptions are made and the pipeline is simplified and reduced to a few key characteristics; nevertheless corruption of the true signal caused by the pipeline is evident and readily seen. This is expected to become worse as more steps are considered.
Though we used two sample types to illustrate the use of the LC-MS based pipeline model, the extension to multiple sample types is straightforward. In addition, the same methodology can be applied to study other MS platforms such as matrix-assisted laser desorption/ionization (MALDI). In addition, a similar strategy applies to labeled experiments.
The proposed pipeline model can be used to optimize the work flow and to pinpoint critical steps to which it is worth allocating resources in order to improve biomarker detection performance, thereby giving it wide application potential in the current drive to enable proteomic biomarker discovery from MS data.
Based on “Modeling and systematic analysis of LC-MS proteomics pipeline”, by Youting Sun, Ulisses Braga-Neto and Edward R Dougherty which appeared in Genomic Signal Processing and Statistics (GENSIPS), 2011 IEEE International Workshop on. © 2011 IEEE .
The authors thank the support of the Partnership for Personalized Medicine (PPM) project, through Translational Genomics (TGen) contract C08-00904.
This article has been published as part of BMC Genomics Volume 13 Supplement 6, 2012: Selected articles from the IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS) 2011. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S6.
- Rifai N, Gillette M, Carr S: Protein biomarker discovery and validation: the long and uncertain path to clinical utility. Nature Biotechnology. 2006, 24: 971-983. 10.1038/nbt1235.View ArticlePubMed
- Pandey A, Andersen JS, Mann M: Use of mass spectrometry to study signaling pathways. Science's STKE. 2000
- Hewel JA, Liu J, Onishi K, Fong V, et al: Synthetic peptide arrays for pathway-level protein monitoring by LC-MS/MS. Mol Cell Proteomics. 2010, 9: 2460-2473. 10.1074/mcp.M900456-MCP200.PubMed CentralView ArticlePubMed
- Frank R, Hargreaves R: Clinical biomarkers in drug discovery and development. Nat Rev Drug Disc. 2003, 2: 566-580. 10.1038/nrd1130.View Article
- Hüttenhain R, Malmström J, Picotti P, Aebersold R: Perspectives of targeted mass spectrometry for protein biomarker verification. Curr Opin Chem Biol. 2009, 13: 518-525. 10.1016/j.cbpa.2009.09.014.PubMed CentralView ArticlePubMed
- Nilsson T, Mann M, Aebersold R, Yates JR, et al: Mass spectrometry in high-throughput proteomics: ready for the big time. Nature Methods. 2010, 7: 681-685. 10.1038/nmeth0910-681.View ArticlePubMed
- Sherman J, McKay MJ, Ashman K, Molloy MP: How specific is my SRM?: The issue of precursor and product ion redundancy. Proteomics. 2009, 9: 1120-1123. 10.1002/pmic.200800577.View ArticlePubMed
- Duncan MW, Yergey AL, Patterson SD: Quantifying proteins by mass spectrometry: the selectivity of SRM is only part of the problem. Proteomics. 2009, 9: 1124-1127. 10.1002/pmic.200800739.PubMed CentralView ArticlePubMed
- Griffin NM, Yu J, Long F, Oh P, et al: Label-free, normalized quantification of complex mass spectrometry data for proteomics analysis. Nature Biotechnology. 2010, 28: 83-89. 10.1038/nbt.1592.PubMed CentralView ArticlePubMed
- Knox C, Law V, Jewison T, Liu P, Ly S, et al: DrugBank 3.0: a comprehensive resource for 'omics' research on drugs. Nucleic Acids Res. 2011, 39: D1035-41. 10.1093/nar/gkq1126.PubMed CentralView ArticlePubMed
- Coombes KR, Koomen J, Baggerly KA, Morris JS, Kobayashi R: Understanding the characteristics of mass spectrometry data through the use of simulation. Cancer Informatics. 2005, 1: 41-52.PubMed CentralPubMed
- Schulz-Trieglaff O, Pfeifer N, Gröpl C, Kohlbacher O, Reinert K: LC-MSsim - a simulation software for liquid chromatography mass spectrometry data. BMC Bioinformatics. 2008, 9: 423-10.1186/1471-2105-9-423.PubMed CentralView ArticlePubMed
- Taniguchi Y, Choi PJ, Li G, Chen H, et al: Quantifying E. coli proteome and transcriptome with single-molecule sensitivity in single cells. Science. 2010, 329: 533-10.1126/science.1188308.PubMed CentralView ArticlePubMed
- Lu P, Vogel C, Wang R, Yao X, Marcotte EM: Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation. Nature Biotechnology. 2007, 25: 117-24. 10.1038/nbt1270.View ArticlePubMed
- Hua J, Waibhav T, Dougherty ER: Performance of feature selection methods in the classification of high-dimensional data. Pattern Recognition. 2008, 42: 409-424.View Article
- PNNL protein digestion simulator. [http://omics.pnl.gov/software/ProteinDigestionSimulator.php]
- Timm W, Scherbart A, Bocker S, Kohlbacher O, Nattkemper TW: Peak intensity prediction in MALDI-TOF mass spectrometry: A machine learning study to support quantitative proteomics. BMC Bioinformatics. 2008, 9: 443-460. 10.1186/1471-2105-9-443.PubMed CentralView ArticlePubMed
- Cech NB, Enke CG: Practical implications of some recent studies in electrospray ionization fundamentals. Mass Spectrom Rev. 2001, 20 (6): 362-87. 10.1002/mas.10008.View ArticlePubMed
- Anderle M, Roy S, Lin H, Becker C, Joho K: Quantifying reproducibility for differential proteomics: noise analysis for protein liquid chromatography-mass spectrometry of human serum. Bioinformatics. 2004, 20 (18): 3575-3582. 10.1093/bioinformatics/bth446.View ArticlePubMed
- Iavarone AT, Jurchen JC, Williams ER: Effects of solvent on the maximum charge state and charge state distribution of protein ions produced by electrospray ionization. J Am Soc Mass Spectrom. 2000, 11 (11): 976-985. 10.1016/S1044-0305(00)00169-0.PubMed CentralView ArticlePubMed
- Konermann L: A minimalist model for exploring conformational effects on the electrospray charge state distribution of proteins. J Phys Chem B. 2007, 111: 6534-6543.View ArticlePubMed
- Sun Y, Zhang J, Braga-Neto UM, Dougherty ER: BPDA - a Bayesian peptide detection algorithm for mass spectrometry. BMC Bioinformatics. 2010, 11: 490-10.1186/1471-2105-11-490.PubMed CentralView ArticlePubMed
- Sun Y, Zhang J, Braga-Neto UM, Dougherty ER: BPDA2d - a 2D global optimization based Bayesian peptide detection algorithm for LC-MS. Bioinformatics. 2012, 28: 564-572. 10.1093/bioinformatics/btr675.PubMed CentralView ArticlePubMed
- Renard BY, Kirchner M, Steen JA, Hamprecht FA: NITPICK: peak identification for mass spectrometry data. BMC Bioinformatics. 2008, 9: 355-10.1186/1471-2105-9-355.PubMed CentralView ArticlePubMed
- Zhang J, Haskins W: ICPD- a new peak detection algorithm for LC/MS. BMC Genomics. 2010, 11 (Suppl 3): S8-10.1186/1471-2164-11-S3-S8.PubMed CentralView ArticlePubMed
- Yates JR, Eng JK, McCormack AL, Schieltz D: Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database. Anal Chem. 1995, 67: 1426-1436. 10.1021/ac00104a020.View ArticlePubMed
- Perkins DN, Pappin DJ, Creasy DM, Cottrell JS: Probability based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999, 20: 3551-67. 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2.View ArticlePubMed
- Mallick P, Schirle M, Chen SS, Flory MR, Lee H, et al: Computational prediction of proteotypic peptides for quantitative proteomics. Nature Biotechnology. 2007, 25: 125-131. 10.1038/nbt1275.View ArticlePubMed
- Whiteaker JR, Zhang H, Eng JK, et al: Head-to-head comparison of serum fractionation techniques. J Proteome Res. 2007, 6 (2): 828-36. 10.1021/pr0604920.View ArticlePubMed
- Bohrer BC, Li YF, Reilly JP, Clemmer DE, et al: Combinatorial libraries of synthetic peptides as a model for shotgun proteomics. Anal Chem. 2010, 82 (15): 6559-568. 10.1021/ac100910a.PubMed CentralView ArticlePubMed
- Echan LA, Tang HY, Nadeem AK, Lee K, Speicher DW: Depletion of multiple high-abundance proteins improves protein profiling capacities of human serum and plasma. Proteomics. 2005, 5 (13): 3292-3303. 10.1002/pmic.200401228.View ArticlePubMed
- Bazzi BH: Ionization competitors extend the linear range of electrospray ionization mass spectrometry. Master's thesis. 2010, The University of Texas at Arlington, Arlington
- Rinner O, Mueller LN, Hubálek M, Müller M, Gstaiger M, Aebersold R: An integrated mass spectrometric and computational framework for the analysis of protein interaction networks. Nature Biotechnology. 2007, 25: 345-352. 10.1038/nbt1289.View ArticlePubMed
- Rea Smith: An accurate mass tag strategy for quantitative and highthroughput proteome measurements. Proteomics. 2002, 2: 513-523. 10.1002/1615-9861(200205)2:5<513::AID-PROT513>3.0.CO;2-W.View Article
- Sun Y, Braga-Neto U, Dougherty ER: Modeling and systematic analysis of LC-MS proteomics pipeline. Genomic Signal Processing and Statistics (GENSIPS), 2011 IEEE International Workshop on: 4-6 December 2011. 2011, 112-116. 10.1109/GENSiPS.2011.6169457.View Article
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.