Volume 18 Supplement 2
Using the entrapment sequence method as a standard to evaluate key steps of proteomics data analysis process
- Xiao-dong Feng†1, 2,
- Li-wei Li†2,
- Jian-hong Zhang†2,
- Yun-ping Zhu2,
- Cheng Chang2,
- Kun-xian Shu1Email author and
- Jie Ma2Email author
© The Author(s). 2017
Published: 14 March 2017
The mass spectrometry based technical pipeline has provided a high-throughput, high-sensitivity and high-resolution platform for post-genomic biology. Varied models and algorithms are implemented by different tools to improve proteomics data analysis. The target-decoy searching strategy has become the most popular strategy to control false identification in peptide and protein identifications. While this strategy can estimate the false discovery rate (FDR) within a dataset, it cannot directly evaluate the false positive matches in target identifications.
As a supplement to target-decoy strategy, the entrapment sequence method was introduced to assess the key steps of mass spectrometry data analysis process, database search engines and quality control methods. Using the entrapment sequences as the standard, we evaluated five database search engines for both the origanal scores and reprocessed scores, as well as four quality control methods in term of quantity and quality aspects. Our results showed that the latest developed search engine MS-GF+ and percolator-embeded quality control method PepDistiller performed best in all tools respectively. Combined with efficient quality control methods, the search engines can improve the low sensitivity of their original scores. Moreover, based on the entrapment sequence method, we proved that filtering the identifications separately could increase the number of identified peptides while improving the confidence level.
In this study, we have proved that the entrapment sequence method could be an useful strategy to assess the key steps of the mass spectrometry data analysis process. Its applications can be extended to all steps of the common workflow, such as the protein assembling methods and data integration methods.
KeywordsProteomics Tandem mass spectrometry Entrapment sequence method Target-decoy search Quality control
The development of mass spectrometry has provided a high-throughput, high-sensitivity and high-resolution analysis platform for proteomics. Tandem mass spectrometry has become one of the most powerful technologies for protein identification, making possible the global protein profiling. Meanwhile, using the database searching strategy allows high-throughput identification of peptides and proteins in shotgun proteomics. Varied models and algorithms are implemented by different search engines, including the early produced engines SEQUEST , Mascot  and X!Tandem  as well as some newly developed engines, such as Comet , Tide , MS-GF+ and MS Amanda . Then such quality control methods have been applied to achieve high reliability identifications as PeptideProphet [8–10], PepDistiller , Mfs , RockerBox , FDRAnalysis  and BuildSummary .
The target-decoy database search strategy is the most commonly used strategy to estimate false identifications in target database with the assumption that the number of false identifications in target database is equal to that in decoy database . However, this strategy can estimate the false discovery rate (FDR) within a dataset rather than directly evaluate the false positive matches in target identifications.
In our previous work, we used the protein sequences from Archaea species as appended database for standard dataset analysis to avoid the ambiguous matches caused by the sequence similarity between control protein sequences and searched database sequences [17, 18]. Similar work had been published in Granholm et al.’s  and Vaudel et al.’s paper . Granholm et al. suggested a semi-labeled method for evaluating the calibration of a given score function using dataset of known protein sample by searching the database composed of a small number of sample sequences and a large number of entrapment sequences. Vaudel et al. proposed constructing a database that contained both the sample sequences (true positive) and entrapment sequences (false positive) and proved that the Pyrococcu furiosus proteome can provide a method for detecting random hits (comparable to the decoy database).
All the above-mentioned work reminds us to introduce the entrapment sequence to target-decoy search strategy as a good supplement. By using different labels, we can separate the PSMs into different kinds and calculate the false matches in target identifications directly. Using the entrapment sequence as the objective standard (pure false positive), we assessed five database search engines and four quality control methods in terms of both quantity and quality. On the basis of the results of two datasets, the entrapment sequence method is proved to be a useful strategy to assess the mass spectrometry data analysis workflow.
Two previously published datasets were used in this study. The Pfu dataset was produced by analyzing Pyrococcus furiosus sample on LTQ Orbitrap Velos (Thermo Scientific) , and used as a standard dataset here. The LM3 dataset was generated from a shotgun analysis of the metastatic human hepatocellular carcinoma cell line (HCCLM3) using Q-Exactive (Thermo Scientific) .
Protein Sequence Database
Construction of the target database for Pfu and LM3 datasets
Sample tryptic peptides
Entrapment tryptic peptides
Shared tryptic peptide
Shared/Sample tryptic peptides (%)
Both Granholm et al.’s  and Vaudel et al.’s  work suggested sufficiently that large entrapment sequences should be used, and that the probability that a random match hits the sample database is negligible, but the best size hasn’t been examined. Here, about ten times as many entrapment sequences were used as sample sequences, which is a similar ratio to Vaudel et al.’s work. Also, we compared the tryptic peptides of all sample sequences and entrapment sequences. As shown in Table 1, the ratios of shared peptides are respectively low for three constructed databases (0.07%, 0.21% and 0.06%). Thus, very few positive PSMs should hit the entrapment sequences. A spectrum that matches both sample and entrapment sequences is considered a sample identification.
All mzML and MGF files were converted from raw files using the msconvert module  in the Trans-Proteomic Pipeline (TPP v4.7.0) . The MS/MS peak list files were searched against the combined database using Mascot  (local server v2.3.2), Comet  (in Curx v2.1.16833 [25, 26]), Tide  (in Curx v2.1.16833), MS-GF+ (v10089) and X!Tandem  (TPP v4.7.0) . The monoisotopic mass was used for both peptide and fragment ions with fixed modification (Carbamidomethyl, +57 Da) on Cys and variable modification (Oxidation, +16 Da) on Met. Tryptic cleavage at only Lys or Arg was selected. The miss cleavage number was set to be 1.
Quality control and protein assembling
Four commonly used quality control methods were used in this study, including BuildSummary , PeptideProphet [8–10], FDRAnalysis  and PepDistiller , all of which produced a rescore of Mascot results for each PSM: BuildSummary’s ExpectValue, PeptideProphet’s probability, FDRAnalysis’s FDRScore and PepDistiller’s q-value. Comet and Tide results were processed by Percolator integrated in Crux, which gave a rescore of q-value. MS-GF+ and X!Tandem results were processed by percolator-converters (v3-00) followed by percolator (v2-08) for further quality control. The percolator tools can be downloaded from (https://github.com/percolator/percolator) . In this study, we used MAYU for protein assembling . Peptides less than 7 amino acids were not taken into account.
False Discovery Rate and False Match Rate
Results and discussion
With the advance of proteome research, a growing number of database search engines as well as the subsequent quality control methods have emerged and played the key roles in the whole process of MS/MS data analysis. As shown in Fig. 1, using the entrapment sequences as a standard, we performed the evaluation of five database search engines’ original scores and reprocessed scores and four quality control methods in the two important aspects, quantity and quality.
Evaluation of different database search engines based on both the original scores and reprocessed scores
First, we used the Pfu dataset as a standard dataset to compare five search engines based on their original scores, Mascot’s ionscore, X!Tandem’s expect, Comet’s e-value, MS-GF + 's EValue and Tide’s XCorr. As shown in Additional file 1 Figure S1A-C, the MS-GF+ far outperforms the other search engines, and the use of the MS-GF + 's EValue allows significantly more identifications at all PSM, peptide and protein levels with the pre-defined FDR. The same trend has also been observed in the large LM3 dataset (Additional file 1: Figure S1D-F).
In gerenal, fewer entrapment hits occur in PSM and peptide identificaitons and in large dataset (LM3 dataset) than those in protein identificaitons and in small dataset (Pfu dataset). In most cases, the FMRs estimated by entrapment hits are roughly equal to those of FDRs estimated by decoy hits. But in some cases, the false matches represented by entrapment hits would far outnumber the expected ones, such as the Tide (FMR = 3.2%) and X!Tandem (FMR = 2.7%) searched results in Pfu dataset in 0.01 protein FDR condition (Fig. 3c), which would remind the researcher that more strict QC should be applied. Thus, we concluded that the entrapment sequence can be used as an internal scale for reseachers to monitor their peptide or protein identifications at any time.
Evaluation of four quality control methods
Combining identificaitons of different search engines and quality control methods with an appropriate framework
Varied models and algorithms that are implemented by different search engines and quality control methods, which make themselves mutually complementary and well-performing for different subsets of mass spectrometry data. Each search engine and QC method can uniquely identify some spectra (Additional file 1: Figure S3). Indeed, combining the results of multiple database search engines or QC methods can increase identifications, however, more false positive hits will be produced by uniquely identified results.
Thus, combining the results of multiple database search engines and QC methods with an appropriate framework would benefit the data analysis process, increase the numbers of identified peptides and improve the confidence level of identifications.
Using a small size of entrapment sequences to evaluate the search engines and tools in large dataset
As mentioned in Granholm et al.’s  and Vaudel et al.’s  papers, to efficiently separate correct PSMs from incorrect ones, the size of the entrapment sequences is supposed to be many times larger than the size of the sample sequences. However, the oversize database would greatly increase the search time while decreasing the total positive identifications. Thus, an appropriate size database is preferable in practical use. Here, we used the original Archaea protein sequences (Arc20825) as a small size entrapment sequence and reprocessed the LM3 dataset. Then the similar results are gained as with large size entrapment sequence search (details are shown in Additional file 1: Figure S5 and S6). Thus, an easy way to use the entrapment sequence method is to randomize the sample sequences, label them and combine them with the sample sequence to construct a routine target-decoy database search, so that the entrapment hits included in each step can be used to provide a rough estimation of the confidence of the intermediate or final results.
In this study, we proposed a complementary use of target-decoy search strategy for evaluation of proteomics data analysis workflow. The labeled entrapment sequences are combined with the sample sequences to construct the target database for search, then the entrapment hits can be considered as false positive results and used to access the quality of proteomics data analysis tools. Based on this method, we assessed the two key steps of the mass spectrometry data analysis process, database search engines and quality control methods. Tested by both standard and experimental datasets, we found that the new search engine MS-GF+ and the support vector machine model based quality control method PepDistiller performed best in all evaluated tools, and the performance of search engines can be improved after the combination with efficient quality control methods. We also proposed an alternative intergrated method for results from different tools. Filtering the identificaitons according to their overlap conditions, we can increase the number of identifications and improve the confidence level at the same time.
Moreover, the entrapment sequence method could be an excellent strategy to assess all steps of the mass spectrometry data analysis process. Its applications can be extended to protein assembling methods, data integration methods and so on. By objective assessment of all steps of the common MS data analysis, we can standardize the analysis pipeline of mass spectrometry data.
False discovery rate
False match rate
Peptide spectrum match
We thank Dong-sheng Li for his help and support. We would like to acknowledge all members of the bioinformatics lab in Beijing Proteome Research Center for helpful discussion.
This study was financially supported by the Special Project of National Science and Technology Cooperation (2014DFB30010) and National Natural Science Foundation of China (21275160, 21475150). Work in J.M.’s laboratory was supported by National High Technology Research and Development Program of China (2015AA020108) and National Basic Research Program of China (2013CB910800). X.F. was supported by Chong Qing postgraduate scientific research and innovation project (CYS14154).
Availability of data and material
The datasets analyzed during the study are available in iProX with the identifier IPX0000812000 (www.iprox.org).
JM and KS conceived and designed the project. XF, LL and JZ collected the datasets, constructed the protein sequence database and implemented the evaluation workflow. CC assisted with the analysis of LM3 dataset. JM, KS and YZ made intellectual contributions to the whole project. JM and XF wrote the manuscript. All authors contributed to the editing of the manuscript, and all authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
About this supplement
This article has been published as part of BMC Genomics Volume 18 Supplement 2, 2017. Selected articles from the 15th Asia Pacific Bioinformatics Conference (APBC 2017): genomics. The full contents of the supplement are available online http://bmcgenomics.biomedcentral.com/articles/supplements/volume-18-supplement-2.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Eng JK, McCormack AL, Yates JR. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry. 1994;5(11):976–89.View ArticlePubMedGoogle Scholar
- Perkins DN, Pappin DJ, Creasy DM, Cottrell JS. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999;20(18):3551–67.View ArticlePubMedGoogle Scholar
- Craig R, Beavis RC. TANDEM: matching proteins with tandem mass spectra. Bioinformatics. 2004;20(9):1466–7.View ArticlePubMedGoogle Scholar
- Eng JK, Jahan TA, Hoopmann MR. Comet: an open-source MS/MS sequence database search tool. Proteomics. 2013;13(1):22–4.View ArticlePubMedGoogle Scholar
- Diament BJ, Noble WS. Faster SEQUEST searching for peptide identification from tandem mass spectra. Journal of proteome research. 2011;10(9):3871–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Kim S, Pevzner PA. MS-GF+ makes progress towards a universal database search tool for proteomics. Nature communications. 2014;5:5277.View ArticlePubMedPubMed CentralGoogle Scholar
- Dorfer V, Pichler P, Stranzl T, Stadlmann J, Taus T, Winkler S, Mechtler K. MS Amanda, a universal identification algorithm optimized for high accuracy tandem mass spectra. Journal of proteome research. 2014;13(8):3679–84.View ArticlePubMedPubMed CentralGoogle Scholar
- Keller A, Nesvizhskii AI, Kolker E, Aebersold R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Analytical chemistry. 2002;74(20):5383–92.View ArticlePubMedGoogle Scholar
- Choi H, Nesvizhskii AI. Semisupervised model-based validation of peptide identifications in mass spectrometry-based proteomics. Journal of proteome research. 2008;7(1):254–65.View ArticlePubMedGoogle Scholar
- Ding Y, Choi H, Nesvizhskii AI. Adaptive discriminant function analysis and reranking of MS/MS database search results for improved peptide identification in shotgun proteomics. Journal of proteome research. 2008;7(11):4878–89.View ArticlePubMedPubMed CentralGoogle Scholar
- Li N, Wu S, Zhang C, Chang C, Zhang J, Ma J, Li L, Qian X, Xu P, Zhu Y, et al. PepDistiller: A quality control tool to improve the sensitivity and accuracy of peptide identifications in shotgun proteomics. Proteomics. 2012;12(11):1720–5.View ArticlePubMedGoogle Scholar
- Jian L, Xia Z, Niu X, Liang X, Samir P, Link A. l2 multiple kernel fuzzy SVM-based data fusion for improving peptide identification. IEEE/ACM Trans Comput Biol Bioinform. 2016;13(4):804-9.
- van den Toorn HW, Munoz J, Mohammed S, Raijmakers R, Heck AJ, van Breukelen B. RockerBox: analysis and filtering of massive proteomics search results. Journal of proteome research. 2011;10(3):1420–4.View ArticlePubMedGoogle Scholar
- Wedge DC, Krishna R, Blackhurst P, Siepen JA, Jones AR, Hubbard SJ. FDRAnalysis: a tool for the integrated analysis of tandem mass spectrometry identification results from multiple search engines. Journal of proteome research. 2011;10(4):2088–94.View ArticlePubMedPubMed CentralGoogle Scholar
- Sheng Q, Dai J, Wu Y, Tang H, Zeng R. BuildSummary: using a group-based approach to improve the sensitivity of peptide/protein identification in shotgun proteomics. Journal of proteome research. 2012;11(3):1494–502.View ArticlePubMedGoogle Scholar
- Elias JE, Gygi SP. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nature methods. 2007;4(3):207–14.View ArticlePubMedGoogle Scholar
- Zhang J, Ma J, Dou L, Wu S, Qian X, Xie H, Zhu Y, He F. Bayesian nonparametric model for the validation of peptide identification in shotgun proteomics. Molecular & cellular proteomics : MCP. 2009;8(3):547–57.View ArticlePubMed CentralGoogle Scholar
- Ma J, Zhang J, Wu S, Li D, Zhu Y, He F. Improving the sensitivity of MASCOT search results validation by combining new features with Bayesian nonparametric model. Proteomics. 2010;10(23):4293–300.View ArticlePubMedGoogle Scholar
- Granholm V, Noble WS, Kall L. On using samples of known protein content to assess the statistical calibration of scores assigned to peptide-spectrum matches in shotgun proteomics. Journal of proteome research. 2011;10(5):2671–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Vaudel M, Burkhart JM, Breiter D, Zahedi RP, Sickmann A, Martens L. A complex standard for protein identification, designed by evolution. Journal of proteome research. 2012;11(10):5065–71.View ArticlePubMedGoogle Scholar
- Wu S, Li N, Ma J, Shen H, Jiang D, Chang C, Zhang C, Li L, Zhang H, Jiang J, et al. First proteomic exploration of protein-encoding genes on chromosome 1 in human liver, stomach, and colon. Journal of proteome research. 2013;12(1):67–80.View ArticlePubMedGoogle Scholar
- Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, et al. UniProt: the Universal Protein knowledgebase. Nucleic acids research. 2004;32(Database issue):D115–119.View ArticlePubMedPubMed CentralGoogle Scholar
- Kessner D, Chambers M, Burke R, Agus D, Mallick P. ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics. 2008;24(21):2534–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Deutsch EW, Mendoza L, Shteynberg D, Farrah T, Lam H, Tasman N, Sun Z, Nilsson E, Pratt B, Prazen B, et al. A guided tour of the Trans-Proteomic Pipeline. Proteomics. 2010;10(6):1150–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Park CY, Klammer AA, Kall L, MacCoss MJ, Noble WS. Rapid and accurate peptide identification from tandem mass spectra. Journal of proteome research. 2008;7(7):3022–7.View ArticlePubMedPubMed CentralGoogle Scholar
- McIlwain S, Tamura K, Kertesz-Farkas A, Grant CE, Diament B, Frewen B, Howbert JJ, Hoopmann MR, Kall L, Eng JK, et al. Crux: rapid open source protein tandem mass spectrometry analysis. Journal of proteome research. 2014;13(10):4488–91.View ArticlePubMedPubMed CentralGoogle Scholar
- Kall L, Canterbury JD, Weston J, Noble WS, MacCoss MJ. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nature methods. 2007;4(11):923–5.View ArticlePubMedGoogle Scholar
- Reiter L, Claassen M, Schrimpf SP, Jovanovic M, Schmidt A, Buhmann JM, Hengartner MO, Aebersold R. Protein identification false discovery rates for very large proteomics data sets generated by tandem mass spectrometry. Molecular & cellular proteomics : MCP. 2009;8(11):2405–17.View ArticlePubMed CentralGoogle Scholar
- Tu C, Sheng Q, Li J, Ma D, Shen X, Wang X, Shyr Y, Yi Z, Qu J. Optimization of Search Engines and Postprocessing Approaches to Maximize Peptide and Protein Identification for High-Resolution Mass Data. Journal of proteome research. 2015;14(11):4662–73.View ArticlePubMedPubMed CentralGoogle Scholar
- Granholm V, Kim S, Navarro JC, Sjolund E, Smith RD, Kall L. Fast and accurate database searches with MS-GF + Percolator. Journal of proteome research. 2014;13(2):890–7.View ArticlePubMedGoogle Scholar