Volume 17 Supplement 13
Evaluating the effect of database inflation in proteogenomic search on sensitive and reliable peptide identification
© The Author(s). 2016
Published: 22 December 2016
Proteogenomics is a promising approach for various tasks ranging from gene annotation to cancer research. Databases for proteogenomic searches are often constructed by adding peptide sequences inferred from genomic or transcriptomic evidence to reference protein sequences. Such inflation of databases has potential of identifying novel peptides. However, it also raises concerns on sensitive and reliable peptide identification. Spurious peptides included in target databases may result in underestimated false discovery rate (FDR). On the other hand, inflation of decoy databases could decrease the sensitivity of peptide identification due to the increased number of high-scoring random hits. Although several studies have addressed these issues, widely applicable guidelines for sensitive and reliable proteogenomic search have hardly been available.
To systematically evaluate the effect of database inflation in proteogenomic searches, we constructed a variety of real and simulated proteogenomic databases for yeast and human tandem mass spectrometry (MS/MS) data, respectively. Against these databases, we tested two popular database search tools with various approaches to search result validation: the target-decoy search strategy (with and without a refined scoring-metric) and a mixture model-based method. The effect of separate filtering of known and novel peptides was also examined. The results from real and simulated proteogenomic searches confirmed that separate filtering increases the sensitivity and reliability in proteogenomic search. However, no one method consistently identified the largest (or the smallest) number of novel peptides from real proteogenomic searches.
We propose to use a set of search result validation methods with separate filtering, for sensitive and reliable identification of peptides in proteogenomic search.
KeywordsFalse discovery rate Proteogenomic search Separate false discovery rate analysis Simulation Target-decoy approach Model-based approach
Proteogenomic search , i.e., searching tandem mass spectrometry (MS/MS) spectra against an integrated database consisting of reference proteins as well as protein sequences derived from genomic or transcriptomic evidence or hypotheses, is useful for identifying novel or sample-specific peptides. Typical approaches to the construction of proteogenomic databases include 6-frame translation of genome [2, 3] and extracting splicing information from RNA sequencing (RNA-seq) data [4–7]. In 6-frame translation of genome, peptide sequences are generated using each of the six possible frames. From the extracted splicing information, novel splice-junction peptide sequences could be obtained. These genomic or transcriptomic information sources are essential for identifying novel peptides, of which sequences are not contained in reference protein databases such as RefSeq  and UniProtKB . Thus, proteogenomic search has been applied to various tasks such as discovering novel protein-coding regions [2, 10, 11], validation of gene annotation [12–15], and studying disease mechanisms for personalized diagnosis and treatments [16–18].
However, there are a number of challenges for proteogenomic search. Proteogenomic databases can be an order of magnitude larger than reference protein databases. For example, Woo and colleagues  constructed a 6-frame translation (102 MB) and a splice graph (410 MB) databases for Caenorhabditis elegans, which were respectively 7 and 28 times of a C. elegans reference protein database from UniProtKB. The increased size of proteogenomic databases demands a larger amount of computational resources, resulting in longer analysis time compared to the conventional proteomic database search.
Moreover, such inflation of proteogenomic databases makes it hard to apply widely used methods for controlling false discovery rates (FDRs) in peptide identification. In proteomic database search, the FDR of a search result is usually estimated by the target-decoy approach [19–21], in which a decoy database–consisting of reversed or shuffled version of the target protein database–is used. An inflated target database for proteogenomic search contains a large number of spurious peptide sequences. For example, most of the peptide sequences obtained from 6-frame translation of a genome are not likely to be produced in vivo or in vitro. In this regard, the FDR in proteogenomic search is prone to underestimation, because random hits to the spurious peptide sequences are considered as target hits and their numbers are not negligible when the database inflation is significant. Furthermore, the size of decoy databases for proteogenomic search can be significantly larger than the size of decoy databases for conventional proteomic database search. An inflated decoy database could decrease the sensitivity of peptide identification at the same FDR, because the number of high-scoring decoy hits increases as the size of decoy database increases.
Since the early stage of proteogenomics, it has been well noted that proteogenomic searches would produce more erroneous identifications than proteomic database searches due to their database sizes [1, 22, 23]. Blakeley and colleagues  showed that database choice is an influencing factor on FDR estimation. They proposed to limit database size for an improved FDR estimation in the target-decoy approach. Krug and colleagues  showed that FDRs in proteogenomic search could be substantially underestimated, by using a 6-frame translated Escherichia coli genome. However, reliable and sensitive peptide identification methods applicable to various proteogenomic databases for organisms with still-evolving genomic information are still not available.
To systematically evaluate the effect of database inflation in proteogenomic search on peptide identification, we generated a set of simulated and real proteogenomic databases. Proteogenomic databases of varying sizes were simulated by adding decoy peptide sequences to reference protein databases. As real proteogenomic databases, 6-frame translated versions of the yeast and the human reference genomes and a splice graph database, constructed from a human RNA-seq data set, were used. A set of yeast and human MS/MS spectra were respectively searched against the simulated and real proteogenomic databases using two widely-used database search tools, i.e., X!Tandem  and Comet . To validate the search results, the target-decoy search strategy [19, 21] and a mixture model-based method  were used and compared. The target-decoy search strategy was also tested with a refined scoring-metric calculated by the self-boosted Percolator . The mixture model-based method assumes a mixture of score distributions for correct and incorrect peptide identifications. On the contrary, the other methods require minimal distributional assumptions on peptide-spectrum match (PSM) scores. Additionally, we examined the effect of separate filtering of known and novel peptides with each of these methods. The separate filtering method has been suggested for proteogenomic search, considering the difference in the probability of identifying known and novel peptides . Our evaluation and comparison results of various peptide identification approaches applied to various proteogenomic databases provide insight into peptide identifications in proteogenomics.
MS/MS data set
We used a yeast MS/MS data set generated and studied by Joo and colleagues . Briefly, the data set was obtained from a yeast cell lysate, which was digested by trypsin and then separated by MudPIT . For the MS/MS analysis, an LTQ-Orbitrap hybrid mass spectrometer was used. The yeast data set contained 63,031 MS/MS scans. We also used a human MS/MS data set generated from a human gastric tissue sample. The tissue sample was obtained from a Korean gastric cancer patient, who signed internal review board (IRB)-approved informed consents. The human sample was digested by trypsin and analyzed using a quadrupole orbitrap mass spectrometer (Q Exactive, Thermo Scientific, Bremen, Germany) coupled with a dual online ultrahigh pressure liquid chromatography system (see Additional file 1: Supplementary methods for details on sample preparation and liquid chromatography (LC)-MS/MS experiments). The resulting data set contained 139,629 MS/MS spectra.
Databases consisting of reference protein sequences
A yeast “target” protein database (1Ty), including 179 common contaminants and 6619 S. cerevisiae proteins downloaded from Swiss-Prot (07/2012), was constructed. The total length of protein sequences in 1Ty was 3,062,279 amino acid (AA). A human target protein database (1Th), containing UniProt human protein sequences (05/2013; 90,191 entries) and 179 common contaminants, was built. The total length of protein sequences in 1Th was 35,856,033 AA. For simulated proteogenomic database construction and FDR estimation (see Database construction for simulated proteogenomic search and Database search and validation of search result), we used decoy databases 1Dy and 1Dh, which were constructed by ‘pseudo-reversing’  or ‘pseudo-shuffling’ the protein sequences in 1Ty and 1Th, respectively. To construct the decoy databases, all the fully-tryptic peptides (with maximum missed cleavage value of two) from the target protein databases were extracted. Then, each of the extracted peptides was reversed (pseudo-reversing) or randomly permuted (pseudo-shuffling), preserving the length and the amino acid composition of the original peptide. By modeling the null hypothesis (i.e., incorrect PSM), decoy databases can be used for p-value calculation and FDR estimation in peptide identification .
Database construction for real proteogenomic search
Two types of real proteogenomic target databases were used in the experiments: 6-frame translation databases for yeast (6FTTy) and human (6FTTh) as well as a splice graph database for human (SGTh). 6FTTy was constructed by 6-frame translation of the yeast whole-genome sequences (04/2014) downloaded from http://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/. 6FTTh was generated based on 6-frame translation of the human reference genome (hg19) downloaded from ftp://ftp.ensembl.org/pub/release-71/fasta/homo_sapiens/dna/. Both 6FTTy and 6FTTh were constructed using Cancer Proteogenomics Tools developed by Woo and colleagues  (downloaded from http://proteomics.ucsd.edu/software-tools/splicedb-splice-graph-proteomics-tools/). It translates regions in genome, between start and stop codons, and ignores any splicing events. The length of proteins generated by the tool is usually shorter than the length of reference proteins. In total, 6FTTy contained 114,386 proteins, corresponding to 2,010,708 fully-tryptic unique peptides (with minimum length of eight AA and maximum missed cleavage value of 2). Among the 688,452 fully-tryptic peptides in 1Ty, 677,777 (98.4%) existed in 6FTTy. The number of proteins in 6FTTh was 34,041,059, corresponding to 389,586,415 fully-tryptic unique peptides (with minimum length of eight AA and maximum missed cleavage value of 2). Among the 3,118,351 fully-tryptic peptides in 1Th, 1,851,052 (59.4%) were contained in 6FTTh.
SGTh was constructed using an RNA-seq data set obtained from the same tissue sample, used for generating the human MS/MS data set (see MS/MS data set). The RNA-seq data set (binary sequence alignment/map (BAM) file) contained 41,353,547 reads mapped onto the human reference genome (hg19). A splice graph, of which nodes and edges respectively denote exons and splice junctions, was built using the read mapping information in the RNA-seq data set. From the splice graph, protein sequences (i.e., splice graph targets) for database search were extracted. We used Cancer Proteogenomics Tools for constructing SGTh as in the studies by Woo and colleagues  (see Additional file 1: Supplementary methods for details on the RNA-seq analysis and splice graph database construction). SGTh included 264,426 splice graph targets and 90,370 entries from 1Th. Decoy databases for the three target proteogenomic databases were created by pseudo-reversing (see Databases consisting of reference protein sequences) and are denoted as 6FTDy, 6FTDh, and SGDh, respectively.
Database construction for simulated proteogenomic search
We assumed that the majority of the newly added peptide sequences to 6-frame translation and splice graph databases, apart from reference protein sequences, are not real target sequences but random sequences. For example, the proportion of novel peptides identified from a recent proteogenomic search  was 0.8% of the total peptides identified, although the size of proteogenomic database was more than 60 times larger than that of the reference protein database. To test our hypothesis, we constructed simulated proteogenomic databases containing varying numbers of “simulated novel” proteins, which were generated by the decoy database generation methods (see Databases consisting of reference protein sequences).
Size of proteomic, simulated proteogenomic, and real proteogenomic databases for yeast
Database (target + decoy)
# Target (AA)
# Decoy (AA)
1Ty + 1Dy
1T1Dy + 2Dy
1T2Dy + 3Dy
1T5Dy + 6Dy
6FTTy + 6FTDy
Size of proteomic, simulated proteogenomic, and real proteogenomic databases for human
Database (target + decoy)
# Target (AA)
# Decoy (AA)
1Th + 1Dh
1T1Dh + 2Dh
1T2Dh + 3Dh
1T5Dh + 6Dh
6FTTh + 6FTDh
SGTh + SGDh
The pseudo-shuffling method could introduce an extra level of redundancy into decoy databases by producing multiple peptides of a same sequence. This can be especially problematic in simulated proteogenomic search, where a large-sized decoy database is generated by pseudo-shuffling. Thus, we checked the level of redundancy in the decoy databases used in our experiments for simulated proteogenomic search. The proportion of redundant peptides in the generated decoy databases was less than 0.74 and 1.50% for yeast and human, respectively (see Additional file 3: Table S1). These numbers are still smaller than the proportion of redundant peptides in the reference protein databases, 1Ty and 1Th: 2.24 and 59.62%, respectively.
Database search and validation of search result
Two database search tools, X!Tandem  and Comet , were used in the experiments. For the yeast MS/MS data set, 3 Da peptide mass tolerance, 1 Da MS/MS mass tolerance, and semi-tryptic option, were assigned to the two search tools. The human MS/MS data set was also searched by X!Tandem and Comet with 15 ppm parent mass tolerance, 0.03 Da fragment mass tolerance, and fully-tryptic option. For both data sets, one fixed modification at Cys (Carbamidomethyl, +57.02146) and one variable modification at Met (Oxidation, +15.99492) were allowed.
The three methods, TD, BP, and MB, were also applied to separate filtering of known and novel (or simulated novel) peptides (SepTD, SepBP, and SepMB; see Additional file 4: Figure S2). In SepTD, PSMs were divided into two groups—known and novel (or simulated novel)—after database search. Then, each PSM group was separately filtered by TD. For the semi-supervised machine learning in SepBP or SepMB, the PSMs of both reference (known) and non-reference (novel or simulated novel) protein sequences were used together as in BP or MB, because it was not possible to train a separate support vector machine or fit a separate mixture model using only the novel (or simulated novel) PSMs. After the machine learning step, PSMs were divided into known and novel (or simulated novel) groups. Then, each group was separately filtered as follows. In SepBP, the PSMs of each group were separately sorted by the recalibrated score from BP, and filtered by estimating the FDR as N D /N T . In SepMB, the PSMs of each group were separately sorted by (1 – posterior error probability), and filtered by estimating the FDR as N D /N T . In the experiments, 1% FDR cut-off was used for high-confidence peptide identification. We calculated peptide-level FDRs by considering only the highest-scoring PSM per peptide.
Results and Discussion
Comparison between simulated and real proteogenomic search results
To test the effectiveness of simulation experiments, peptide identification results were compared between the following simulated and real proteogenomic database pairs of similar sizes: ‘1T2Dy + 3Dy’ (9,186,837 + 9,186,837 AA) and ‘6FTTy + 6FTDy’ (9,654,965 + 9,654,965 AA) for yeast, and ‘1T2Dh + 3Dh’ (107,568,099 + 107,568,099 AA) and ‘SGTh + SGDh’ (123,364,545 + 123,364,545 AA) for human (see Tables 1 and 2).
The total number of known and novel (or simulated novel) peptides identified from the simulated and real proteogenomic databases of similar sizes was also similar in most cases. For yeast, the difference was from 51 to 556 (corresponding to 0.95 to 8.99% of the peptides identified from ‘6FTTy + 6FTDy’) when X!Tandem was used (Fig. 1(a)). The difference in the results for yeast obtained using Comet was also small (from 0.90 to 4.41%), except for the results validated by SepTD and SepBP, in which the difference was 32.63 and 15.51% of the number of peptides identified from ‘6FTTy + 6FTDy’, respectively (Additional file 5: Figure S3(a)). In these two cases, the increase in the number of identified peptides by separate filtering of known and novel (or simulated novel) peptides was much larger for ‘6FTTy + 6FTDy’ than for ‘1T2Dy + 3Dy’. For the human data set, the difference in the number of identified peptides was 9.38 to 19.02% (for X!Tandem) and 2.80 to 13.86% (for Comet) of the number of peptides identified from ‘SGTh + SGDh’ (Fig. 1(b) and Additional file 5: Figure S3(b)). It must be noted that the size difference between the simulated and real proteogenomic databases was 4.85% (‘1T2Dy + 3Dy’ and ‘6FTTy + 6FTDy’) and 12.80% (‘1T2Dh + 3Dh’ and ‘SGTh + SGDh’).
Thus, we observed that proteogenomic search against simulated and real proteogenomic databases of similar sizes produced similar results with regard to the proportion of known peptides identified from reference protein sequences as well as the total number of identified peptides at the same FDR in most cases. These results suggest that we could use simulated proteogenomic databases for quantitatively examining the effect of database inflation on the sensitivity and reliability of peptide identifications.
Sensitivity and reliability in simulated proteogenomic search
Among the three separate filtering methods, SepBP identified the largest number of peptides from the search results using X!Tandem (Figs. 2 and 3; Additional file 6: Figure S4 and Additional file 7: Figure S5). In most cases, SepBP and SepMB identified larger numbers of peptides than SepTD from the search results using Comet (Additional file 8: Figure S6, Additional file 9: Figure S7, Additional file 10: Figure S8, and Additional file 11: Figure S9). Thus, machine learning-based methods for search result validation seem to improve the sensitivity in proteogenomic search.
We counted the number of simulated novel peptides identified from nDy or nDh, because they are highly probable to be false positives. In most cases, the number of identified simulated-novel peptides increased as n increased (Figs. 2 and 3; Additional file 6: Figure S4, Additional file 7: Figure S5, Additional file 8: Figure S6, Additional file 9: Figure S7, Additional file 10: Figure S8, and Additional file 11: Figure S9). Thus, the database inflation in proteogenomic search could also deteriorate the reliability in peptide identification regardless of search result validation methods. However, the number of simulated novel peptides identified by TD, BP, or MB was always much larger than SepTD, SepBP, or SepMB, suggesting that separate filtering of known and novel peptides is also essential for improving the reliability in proteogenomic search. Among the three separate filtering methods, SepBP identified the largest number of simulated novel peptides in most cases. SepTD was the most conservative for simulated novel peptide identification.
Sensitivity and reliability in real proteogenomic search
Number of peptides with charge 2+ at 1% FDR identified from search against real proteogenomic databases using X!Tandem
Database (target + decoy)
6FTTy + 6FTDy
6FTTh + 6FTDh
SGTh + SGDh
Number of peptides with charge 3+ at 1% FDR identified from search against real proteogenomic databases using X!Tandem
Database (target + decoy)
6FTTy + 6FTDy
6FTTh + 6FTDh
SGTh + SGDh
We compared the three separate filtering methods—SepTD, SepBP, and SepMB—regarding novel peptide identification. In Table 3, SepMB identified the smallest number of novel peptides with charge 2+ from the search against ‘6FTTh + 6FTDh’ using X!Tandem. However, the same method identified the largest number of novel peptides with the same charge, from the search against ‘SGTh + SGDh’ using the same database search tool. SepMB also identified the largest number of novel peptides (charge 2+) from the search against ‘6FTTh + 6FTDh’ using Comet (Additional file 12: Table S2). Therefore, there does not seem to exist one specific method, which is the most (or the least) conservative for identifying novel peptides from real proteogenomic search, among the three filtering methods. In many cases, SepTD, SepBP, and SepMB identified similar numbers of novel peptides from the three real proteogenomic databases. However, SepBP identified much larger numbers (>50 more) of novel peptides with charge 2+ than SepTD and SepMB, from the searches against ‘6FTTy + 6FTDy’ and ‘6FTTh + 6FTDh’ using X!Tandem (Table 3). Thus, SepBP could produce different results compared with the other two methods in novel peptide identification from proteogenomic search.
With regard to the identification of known peptides from proteogenomic search, SepBP identified larger numbers of peptides than SepTD and SepMB in most cases. From the search results obtained using Comet, SepBP identified 6.23 to 65.82% more known-peptides compared to SepTD or SepMB (Additional file 12: Table S2 and Additional file 13: Table S3). Only for the human data set searched against ‘6FTTh + 6FTDh’ using X!Tandem, SepBP identified a smaller number (5950) of known peptides (charge 2+) compared with SepTD (6336) and SepMB (6180), respectively (Table 3). In many cases, SepTD and SepMB identified similar numbers of known peptides, except for the case of identifying known peptides (charge 3+) from the search against ‘6FTTy + 6FTDy’ using X!Tandem, where SepMB identified 21.24% more peptides than SepTD (Table 4).
Besides the six search-result validation methods, we tested the two-stage FDR method  against ‘SGTh + SGDh’. In the two-stage FDR method for proteogenomic search, only the spectra, not matched to reference protein sequences with a sufficient level of confidence, were searched against proteogenomic databases (see Additional file 1: Supplementary methods for more detailed description on the two-stage FDR method). Additional file 15: Table S4 compares the peptide identification results among TD, SepTD, and the two-stage FDR method. Similar to the separate filtering methods, the two-stage FDR method increased the number of identified known-peptides while decreasing the number of identified novel-peptides. SepTD and the two-stage FDR method identified similar numbers of novel peptides.
Proteogenomic searches, originally suggested for gene annotation and validation, are now becoming a routine tool for many tasks including proteogenomic characterization of tumors. Since proteogenomic databases are inflated with a large number of spurious peptide sequences, it is important to accurately estimate the effect of such inflation on sensitive and reliable peptide identification. We evaluated the effect of database inflation in proteogenomic search using various simulated and real proteogenomic databases. Two popular database search tools with six approaches to search result validation were tested. First, we showed that the number of peptides identified from simulated and real proteogenomic databases of similar sizes is also similar, confirming the effectiveness of using simulated proteogenomic databases in estimating sensitivity and reliability of various search and validation strategies. Then, the relationship between the database size and the number of identified peptides was examined using simulated proteogenomic databases. When known and simulated-novel peptides were filtered together, the number of peptides at the same FDR decreased as the (target + decoy) database size increased. However, the results from separate filtering of known and simulated-novel peptides were almost not influenced by changes in database sizes. Moreover, the separate filtering methods effectively removed most of the simulated-novel peptides, which are highly likely to be false positives. Finally, the sensitivity and reliability of real proteogenomic search was examined using 6-frame translated versions of the yeast and the human genomes and a splice graph database constructed using human RNA-seq data. As in the results from simulated proteogenomic search, separate filtering of known and novel peptides increased the number of identified known-peptides while decreasing the number of identified novel-peptides, compared with the methods which filter known and novel peptides together. Therefore, separate filtering of known and novel peptides is strongly recommended for proteogenomic database search. Among the three separate filtering methods, SepBP generally identified the largest number of peptides, suggesting that semi-supervised machine learning could be effective in increasing the sensitivity of proteogenomic search. In terms of novel peptide identification, the three separate filtering methods usually identified similar numbers of novel peptides; however, no one method consistently identified the largest (or the smallest) number of novel peptides. Furthermore, the number of novel peptides commonly identified by the three methods was not large, suggesting that false negatives could be an issue even in novel peptide identification. In order to reduce the false negatives, one can apply multiple separate filtering methods to a proteogenomic search result and combine the novel peptides identified by each method. As a conclusion, we propose to use two or more methods for search result validation with separate filtering of known and novel peptides, for maximizing the sensitivity and reliability in proteogenomic search.
- (n + 1)Dh :
Simulated proteogenomic decoy database for human
- (n + 1)Dy :
Simulated proteogenomic decoy database for yeast
- 1Th :
“Target” reference protein sequences for human
- 1TnDh :
Simulated proteogenomic target database for human
- 1TnDy :
Simulated proteogenomic target database for yeast
- 1Ty :
“Target” reference protein sequences for yeast
- 6FTDh :
Six-frame translation decoy database for human
- 6FTDy :
Six-frame translation decoy database for yeast
- 6FTTh :
Six-frame translation target database for human
- 6FTTy :
Six-frame translation target database for yeast
Binary sequence alignment/map
False discovery rate
Internal review board
Mixture model-based method
Tandem mass spectrometry
Separated filtering of known and novel peptides using self-boosted Percolator
Separated filtering of known and novel peptides using mixture model-based method
Separated filtering of known and novel peptides using target-decoy search strategy
- SGDh :
Splice graph decoy database for human
- SGTh :
Splice graph target database for human
Target-decoy search strategy
This work was supported in part by the National Research Foundation of Korea [NRF-2012M3A9B9036676, NRF-2014R1A2A1A11054147, NRF-2012M3A9D1054452, NRF-2012M3A9D1054705] and in part by the Institute for Information & Communications Technology Promotion (IITP) of Korea [R6910-15-1014].
This article has been published as part of BMC Genomics Volume 17 Supplement 13, 2016: 15th International Conference On Bioinformatics (INCOB 2016). The full contents of the supplement are available online at https://bmcgenet.biomedcentral.com/articles/supplements/volume-17-supplement-13.
Publication charges for this article have been funded by the National Research Foundation of Korea.
Availability of data and materials
The human MS/MS and RNA-seq data sets used in this study are available upon request.
HL carried out the experiments and drafted the manuscript. YSJ and HK carried out the experiments and provided input on drafts of the manuscript. S-WL provided the experimental data and wrote sections of the manuscript. EP and K-BH conceived the original idea of this study, participated in its design and coordination, and wrote sections of the manuscript. All authors read and approved the final manuscript.
Hyunwoo Kim: This work was done when the author was a PhD student at Hanyang University.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
The individual subjected to this study was recruited at Asan Medical Center in Korea. The patient signed an IRB-approved informed consent form (NCCNCS-120581).
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Nesvizhskii AI. Proteogenomics: concepts, applications and computational strategies. Nat Methods. 2014;11(11):1114–25.View ArticlePubMedPubMed CentralGoogle Scholar
- Castellana NE, Shen Z, He Y, Walley JW, Cassidy CJ, Briggs SP, Bafna V. An automated proteogenomic method uses mass spectrometry to reveal novel genes in Zea mays. Mol Cell Proteomics. 2014;13(1):157–67.View ArticlePubMedGoogle Scholar
- Khatun J, Yu Y, Wrobel JA, Risk BA, Gunawardena HP, Secrest A, Spitzer WJ, Xie L, Wang L, Chen X, et al. Whole human genome proteogenomic mapping for ENCODE cell line data: identifying protein-coding regions. BMC Genomics. 2013;14:141.View ArticlePubMedPubMed CentralGoogle Scholar
- Sheynkman GM, Shortreed MR, Frey BL, Smith LM. Discovery and mass spectrometric analysis of novel splice-junction peptides using RNA-Seq. Mol Cell Proteomics. 2013;12(8):2341–53.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63.View ArticlePubMedPubMed CentralGoogle Scholar
- Woo S, Cha SW, Merrihew G, He Y, Castellana N, Guest C, MacCoss M, Bafna V. Proteogenomic database construction driven from large scale RNA-seq data. J Proteome Res. 2014;13(1):21–8.View ArticlePubMedGoogle Scholar
- Woo S, Cha SW, Na S, Guest C, Liu T, Smith RD, Rodland KD, Payne S, Bafna V. Proteogenomic strategies for identification of aberrant cancer peptides using large-scale next-generation sequencing data. Proteomics. 2014;14(23–24):2719–30.View ArticlePubMedPubMed CentralGoogle Scholar
- Pruitt KD, Tatusova T, Brown GR, Maglott DR. NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res. 2012;40(Database issue):D130–5.View ArticlePubMedGoogle Scholar
- Schneider M, Lane L, Boutet E, Lieberherr D, Tognolli M, Bougueleret L, Bairoch A. The UniProtKB/Swiss-Prot knowledgebase and its Plant Proteome Annotation Program. J Proteome. 2009;72(3):567–73.View ArticleGoogle Scholar
- Kim MS, Pinto SM, Getnet D, Nirujogi RS, Manda SS, Chaerkady R, Madugundu AK, Kelkar DS, Isserlin R, Jain S, et al. A draft map of the human proteome. Nature. 2014;509(7502):575–81.View ArticlePubMedPubMed CentralGoogle Scholar
- Nesvizhskii AI, Roos FF, Grossmann J, Vogelzang M, Eddes JS, Gruissem W, Baginsky S, Aebersold R. Dynamic spectrum quality assessment and iterative computational analysis of shotgun proteomic data: toward more efficient identification of post-translational modifications, sequence polymorphisms, and novel peptides. Mol Cell Proteomics. 2006;5(4):652–70.View ArticlePubMedGoogle Scholar
- Ansong C, Purvine SO, Adkins JN, Lipton MS, Smith RD. Proteogenomics: needs and roles to be filled by proteomics in genome annotation. Brief Funct Genomic Proteomic. 2008;7(1):50–62.View ArticlePubMedGoogle Scholar
- Castellana NE, Payne SH, Shen Z, Stanke M, Bafna V, Briggs SP. Discovery and revision of Arabidopsis genes by proteogenomics. Proc Natl Acad Sci U S A. 2008;105(52):21034–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Jaffe JD, Berg HC, Church GM. Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics. 2004;4(1):59–77.View ArticlePubMedGoogle Scholar
- Tanner S, Shen Z, Ng J, Florea L, Guigo R, Briggs SP, Bafna V. Improving gene annotation using peptide mass spectrometry. Genome Res. 2007;17(2):231–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Renuse S, Chaerkady R, Pandey A. Proteogenomics. Proteomics. 2011;11(4):620–30.View ArticlePubMedGoogle Scholar
- Sun H, Xing X, Li J, Zhou F, Chen Y, He Y, Li W, Wei G, Chang X, Jia J, et al. Identification of gene fusions from human lung cancer mass spectrometry data. BMC Genomics. 2013;14 Suppl 8:S5.View ArticlePubMedGoogle Scholar
- Zhang B, Wang J, Wang X, Zhu J, Liu Q, Shi Z, Chambers MC, Zimmerman LJ, Shaddox KF, Kim S, et al. Proteogenomic characterization of human colon and rectal cancer. Nature. 2014;513(7518):382–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Elias JE, Gygi SP. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods. 2007;4(3):207–14.View ArticlePubMedGoogle Scholar
- He K, Fu Y, Zeng WF, Luo L, Chi H, Liu C, Qing LY, Sun RX, He SM. A theoretical foundation of the target-decoy search strategy for false discovery rate control in proteomics. arXiv e-print (arXiv:150100537) 2015.
- Nesvizhskii AI. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. J Proteome. 2010;73(11):2092–123.View ArticleGoogle Scholar
- Castellana N, Bafna V. Proteogenomics to discover the full coding content of genomes: a computational perspective. J Proteome. 2010;73(11):2124–35.View ArticleGoogle Scholar
- Hernandez C, Waridel P, Quadroni M. Database construction and peptide identification strategies for proteogenomic studies on sequenced genomes. Curr Top Med Chem. 2014;14(3):425–34.View ArticlePubMedGoogle Scholar
- Blakeley P, Overton IM, Hubbard SJ. Addressing statistical biases in nucleotide-derived protein databases for proteogenomic search strategies. J Proteome Res. 2012;11(11):5221–34.View ArticlePubMedPubMed CentralGoogle Scholar
- Krug K, Carpy A, Behrends G, Matic K, Soares NC, Macek B. Deep coverage of the Escherichia coli proteome enables the assessment of false discovery rates in simple proteogenomic experiments. Mol Cell Proteomics. 2013;12(11):3420–30.View ArticlePubMedPubMed CentralGoogle Scholar
- Craig R, Beavis RC. TANDEM: matching proteins with tandem mass spectra. Bioinformatics. 2004;20(9):1466–7.View ArticlePubMedGoogle Scholar
- Eng JK, Jahan TA, Hoopmann MR. Comet: an open-source MS/MS sequence database search tool. Proteomics. 2013;13(1):22–4.View ArticlePubMedGoogle Scholar
- Keller A, Nesvizhskii AI, Kolker E, Aebersold R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal Chem. 2002;74(20):5383–92.View ArticlePubMedGoogle Scholar
- Yang P, Ma J, Wang P, Zhu Y, Zhou BB, Yang YH. Improving X!Tandem on peptide identification from mass spectrometry by self-boosted Percolator. IEEE/ACM Trans Comput Biol Bioinforma/IEEE, ACM. 2012;9(5):1273–80.View ArticleGoogle Scholar
- Joo JW, Na S, Baek JH, Lee C, Paek E. Target-Decoy with Mass Binning: a simple and effective validation method for shotgun proteomics using high resolution mass spectrometry. J Proteome Res. 2010;9(2):1150–6.View ArticlePubMedGoogle Scholar
- Kislinger T, Gramolini AO, MacLennan DH, Emili A. Multidimensional protein identification technology (MudPIT): technical overview of a profiling method optimized for the comprehensive proteomic investigation of normal and diseased heart tissue. J Am Soc Mass Spectrom. 2005;16(8):1207–20.View ArticlePubMedGoogle Scholar
- Kall L, Storey JD, MacCoss MJ, Noble WS. Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. J Proteome Res. 2008;7(1):29–34.View ArticlePubMedGoogle Scholar
- Kall L, Canterbury JD, Weston J, Noble WS, MacCoss MJ. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat Methods. 2007;4(11):923–5.View ArticlePubMedGoogle Scholar
- Ma K, Vitek O, Nesvizhskii AI. A statistical model-building perspective to identification of MS/MS spectra with PeptideProphet. BMC Bioinf. 2012;13 Suppl 16:S1.View ArticleGoogle Scholar