Exploiting proteomic data for genome annotation and gene model validation in Aspergillus niger
© Wright et al; licensee BioMed Central Ltd. 2009
Received: 07 April 2008
Accepted: 04 February 2009
Published: 04 February 2009
Proteomic data is a potentially rich, but arguably unexploited, data source for genome annotation. Peptide identifications from tandem mass spectrometry provide prima facie evidence for gene predictions and can discriminate over a set of candidate gene models. Here we apply this to the recently sequenced Aspergillus niger fungal genome from the Joint Genome Institutes (JGI) and another predicted protein set from another A.niger sequence. Tandem mass spectra (MS/MS) were acquired from 1d gel electrophoresis bands and searched against all available gene models using Average Peptide Scoring (APS) and reverse database searching to produce confident identifications at an acceptable false discovery rate (FDR).
405 identified peptide sequences were mapped to 214 different A.niger genomic loci to which 4093 predicted gene models clustered, 2872 of which contained the mapped peptides. Interestingly, 13 (6%) of these loci either had no preferred predicted gene model or the genome annotators' chosen "best" model for that genomic locus was not found to be the most parsimonious match to the identified peptides. The peptides identified also boosted confidence in predicted gene structures spanning 54 introns from different gene models.
This work highlights the potential of integrating experimental proteomics data into genomic annotation pipelines much as expressed sequence tag (EST) data has been. A comparison of the published genome from another strain of A.niger sequenced by DSM showed that a number of the gene models or proteins with proteomics evidence did not occur in both genomes, further highlighting the utility of the method.
Post genomic research and systems biology have greatly expanded our knowledge and understanding of biological processes, fuelled by the growth in sequenced genomes and accompanying technological developments. These techniques, such as microarray-based transcriptomics and proteomics, are reliant on the high quality annotation of newly sequenced genomes. Indeed, this heavy dependency on a sequenced genome or cDNA library can often be limiting in the scope of studies, particularly for non model organisms . However, functional genomics experiments on sequenced organisms can also play an important role in defining or re-evaluating the genome sequenced on which they are based. Experimental data can be fed back into the genome to help demonstrate the validity or otherwise of the original gene structure predictions or to assist the annotation of new genomes.
Many genome sequencing projects use a range of in silico prediction methods to generate a large, and sometimes highly redundant, set of possible open reading frames (ORFs) and gene structure models. A good example is the pipeline employed by the widely-used Ensembl genome browser . Here, a combination of EST, cDNA, orthology and statistical data are used to derive gene sets which are reconciled to produce a final set of high quality predicted genes. A further example is provided by recent fungal genomes sequenced at the US DOE Joint Genome Institute (JGI) whereby a large set of gene models are produced, typically with several candidates for each locus. Further analyses reduce this to a smaller filtered set of "best" gene predictions via a second layer of bioinformatic methods, manual annotation and the use of experimental data. It is one such example, that of Aspergillus niger, which forms the basis for this study. A. niger is a common ascomycete fungus that acts as an opportunistic human pathogen, however, it is generally more commonly known for its use in industrial biotechnological applications such as the production of citric acid . We wished to apply mass spectrometry-based proteomics on A. niger as an exemplar system with which to test the utility of proteomics to refine and process a recently sequenced and annotated genome and produce an even higher quality gene set. There have already been several studies of the proteomics of filamentous fungi, now that there are several complete genome sequences, and this technique is being widely applied to understand fungal biology .
Although cDNA and oligonucleotides arrays can demonstrate that a predicted gene is expressed [5, 6] and tiling arrays can define exon-intron structure with exquisite accuracy , they still focus on the un-translated mRNA. Proteomics provides a higher level confirmation of gene expression and is beginning to be used in genome annotation [8–10]. Mass spectrometry (MS) is an effective and fast method for identifying proteins from their constituent peptides and recent developments support much higher coverage of the commonly expressed proteome [11–13]. For example, Aerbersold and colleagues demonstrated how the PeptideAtlas database could be exploited to map many thousands of peptides back on to the human proteome [14, 15]. Similarly, cDNA/EST data and mass spectrometry experiments have been used to identify novel ORFs and splice variants. Peptide identifications in expressed sequence tags (ESTs)  or expressed peptide tags (ePSTs) [17, 18] were matched back to the genomic scaffolds, thereby identifying or validating real ORFs. Experimental proteomic data can therefore help with the prediction and validation of predicted gene structure and there are a growing number of examples which have helped annotate translational start sites, exons and SNPs [19, 20, 8]. In parallel, informatic proteome pipelines are also becoming more "genome-centric". Examples include the genome annotating pipeline (GAPP)  and PeptideAtlas resources [14, 15] which both support the mapping of identified peptides back onto genome viewers . Some experiments have even found these published genomes are annotated incorrectly  fully demonstrating the utility of proteome data. These conclusions have sparked interest throughout both proteomics and genomics as to the best ways in which to use this new source of experimental validation of genome annotations .
Overview of JGI and DSM A. niger genome data
DOE JGI A. niger genome
Number of Gene Models generated
Number of filtered "best" Gene Models
DSM A. niger genome
Number of annotated proteins
Experimental Aspergillus niger proteomics
Aspergillus culture conditions and protein extraction
Aspergillus niger strain N402 was cultured in Aspergillus media (ACM) at 25°C and 150 rpm. The A. niger mycelia (100 mg) were ground down using a pestle and mortar and cells lysed by mechanical glass bead cell lysis. Protein was then extracted using TCA precipitation.
A. niger extracts were separated on 10%, 12% and 15% SDS-PAGE gels and stained using Coomassie R250. Gel bands were excised from top to bottom of the gel. In-gel tryptic digestion was carried out as described by Shevchenko et al  and the resulting peptides were extracted by the addition of 2 volumes of Acetronitrile and dried prior to analysis.
Liquid Chromatography tandem mass spectrometry (LC-MS/MS) was performed on an UltiMate/Switchos/Famos nanoflow HPLC (Dionex, Camberley, Surrey) coupled to a QTof I (Waters, Manchester). Prior to analysis, dried samples were redissolved in 6 μl of 0.1% formic acid (v/v). 5 μl of each sample was injected and desalted on a trapping column (PepMap C18, 300 μm i.d., 5 mm length) prior to separation on PepMap C18 analytical column (75 μm i.d., 15 cm length). Using a 200 nl/min 1 h gradient: 5–90% solvent B (A = 2% Acetonitrile, 0.06% formic acid; B = 95% Acetonitrile, 0.05% formic acid, v/v). Data-dependent switching between MS and MS/MS acquisition was used, with product ion spectra recorded for a maximum of three precursors per cycle.
Peak lists (.pkl files) were generated using PeptideAuto in MassLynx 3.4 software (Waters), combining all sequential scans for the same precursor, centroiding data with a minimum peak width parameter of 2 and using a peak top parameter of 80%.
Computational Aspergillus niger proteomics
The main source of data for this project was downloaded from the JGI Genome Portal for A.niger http://www.jgi.doe.gov/aspergillus and included the genomic scaffolds, the unfiltered set of gene models generated autonomously in the JGI annotation pipeline and also the filtered gene models set. The size of each of these sets is displayed in Table 1 along with statistics for another A.niger genome recently sequenced industrially by DSM Food Specialties http://www.dsm.com/en_US/html/dfs/genomics_aniger.htm and recently published .
Generation of gene models
Gene models in the genome of Aspergillus niger were predicted using ab initio gene predictor Fgenesh  and homology-based methods Fgenesh+  and Genewise . In addition over 15,000 A.niger ESTs from GenBank and over 1,200 full-length (FL) mRNAs from RefSeq were directly mapped to genomic sequence and were employed to extend the predicted gene models into FL genes by adding 5' and/or 3' UTRs using the estExt method. Since multiple gene models were generated for each locus, a single representative model from each set of overlapping gene models was selected. This selection was based on homology to proteins from other organisms and available EST support. Gene models overlapping with transposable elements detected in the A.niger assembly were excluded from the final set. All these methods are integrated into the JGI annotation pipeline. Finally the initial redundant set of 87,287 gene models predicted by different methods was filtered down to a non-redundant set of 11,200 "best" gene models.
Mass spectrometry database searching
Where FP is the number of false positive reverse protein hits above the APS threshold, and TP the number of forward protein hits above the threshold less the number of false positives. For example, if 100 forward proteins and 5 reverse proteins exceed the APS threshold, TP = 100 - 5, FP = 5, and FDR = 5/(95+5) = 5%. As the reverse database hits were taken from all searches, the FP values were scaled by the relative number of spectra in each band. The APS protein threshold was lowered until the FDR reached 2%. For each gel band we calculated the optimal peptide score threshold by increasing it from a value of 10 in steps of 1 and calculating the number of APS forward hits at the 2% FDR threshold. The peptide score threshold reporting the largest number of APS hits was used.
All mass spectrometry data and peptide identifications have been deposited with the PRIDE proteomics database http://www.ebi.ac.uk/pride, with PRIDE accessions 7972–8124 inclusive.
Clustering of the gene models
Analysis of the gene models clusters with proteomics data
For searches conducted against the JGI predicted proteome set from the gene models, each peptide identified as significant using the APS scoring method pipeline was mapped back though the matched gene models to the appropriate genomic scaffold. Gene clusters were then evaluated by ranking the clustered gene models in a tabular format based on the number of peptides matched in the genomic region. Each gene model cluster and affiliated peptides were also visualised using the BioPerl::Graphics module . The peptide data matched and aligned to each cluster of gene models allowed the elimination of inconsistent models from the cluster. This allowed the clusters to be classified into four categories: i) clusters with no matched proteomics data, ii) clusters containing a "best" filtered model which is consistent with all aligned peptides, iii) clusters which do not contain a "best" filtered model but match some proteomics data, and iv) clusters containing a "best" filtered model which is inconsistent with the proteomics data. Categories iii) and iv) are of particular interest since they support novel gene structures not deemed the most likely from the gene prediction pipeline.
Results and discussion
Average Peptide Scoring (APS) Results
Protein-level identifications obtained over three search databases.
Filtered modelsb (11,200)
Gel01–12% SDS (partial) 8 bands
Gel02–15% SDS (partial) 33 bands
Gel03–10% SDS (full) 110 bands
The APS search results are compared to two interpretations of direct Mascot searching. Mascot(1) simply reports the number of proteins containing one or more peptides with ion scores above Mascot's default threshold, which estimates significance at p < 0.05 for individual peptides. Mascot(2) results refer to Mascot's MudPit scoring system, the recommended approach when considering large numbers of spectra, which filters out some low scoring peptides. Mascot did not find large numbers of significant protein hits to the reverse databases in this case (typically only 1 or 2 proteins for each experiment). It should be noted that the current version of Mascot (v2.2) also supports reverse database searching directly although we performed equivalent searches here "manually" with the earlier version.
The data presented here provides still further evidence that the APS technique is a simple yet effective strategy to find peptide hits consistent with confident protein-level identifications whilst maintaining a low overall false discovery rate. As noted by Shadforth and colleagues , the APS approach removes candidate false positive hits whilst maximising true positive matches by selecting weaker scoring peptides that are consistent with higher scoring peptides in the same proteins. This is broadly equivalent to Mascot's MudPIT scoring system which effectively removes protein hits from multiple low scoring peptide matches.
Table 2 shows results for the two protein sequence databases derived from the JGI genome. The first contained all the automatically generated gene models and the second a reduced set of filtered gene models representing the most likely protein, where appropriate, for each gene locus. The "All Models" database is highly redundant containing 87,287 gene models, and is almost eight times the size of the filtered dataset. This leads to the greater number of APS matches compared to the filtered set; consequently there is also redundancy in these protein matches and the number of hits to the filtered database gives a better reflection of the total number of identified proteins from different gene loci. In mitigation, the Mascot significance threshold is dependent on database size and consequently the larger "All Models" database has a higher peptide and protein threshold for reporting significant matches.
The APS scoring method calculates separate thresholds for single and multi peptide identifications  and typically for most high throughput proteomics studies, a large proportion of our peptide identifications were "one hit wonders". Although these matches are generally held to be of lesser confidence than multi-peptide matches, other authors have argued against this  and the APS methodology does consider them independently with a more stringent threshold (the average multi-peptide APS filter was a Mascot ion score of 34, whereas the single peptide equivalent was 51). Whilst we recognise that the approaches used here are unlikely to completely remove all false positive identifications, we reasoned that a small false positive rate was tolerable for gene model validation where other sources of information may also used. We conducted some further tests to boost confidence in these single peptide identifications, searching them back against the genome sequence using tblastn (with no low complexity filtering, and expect value, word size, and database size parameters optimised for short matches). In total, 95.5% of these 149 peptides align to the genome only once (in the original predicted locus), indicating that the majority are indeed unique and not chance matches to another mis-predicted sequence. Although this does not guarantee the Mascot identification of the peptide sequence, this does at least reassure us that given a correct peptide identification, there is no ambiguity in placement on the genome sequence. The few peptides that were unmatched were observed to span introns and could not be matched by using a simple tblastn search where the large "gap" was not spanned.
Overall, these results are generally reassuring since the proteins used here represent a range of predicted gene models and there is a clear trend in decreasing protein mass down the gel. A high false positive rate would lead to many protein identifications outside of the expected mass range, decreasing the correlation with average mass shown in Figure 3. A simple correlation of average protein mass against gel band number results in a Pearson correlation of 0.94, which supports this. This is considerably higher than would be expected by chance (p < 10-20) with a mean correlation of 0.01 (s.d. 0.1) obtained by shuffling the protein masses in a simulation. Similar results are obtained when searching against either the JGI filtered gene set or DSM gene set (data not shown) suggesting there is no bias produced by a given gene set.
Mapping of proteomics data to gene model clusters
Gene cluster and proteome peptide identification results.
Number of APS matches (to Gene clusters, Proteins, or Peptides)
i) "Best" filtered model consistent with peptide data
ii) Gene cluster does not contain a "Best" filtered model, but does have APS matches
iii) "Best" filtered model in cluster is inconsistent with peptide data
(single peptide hits)
Putting the peptide identification data in to the context of our results, we examined the 4093 models in the 214 gene clusters with associated APS peptide matches. Of these, 1221 of the models were classified as inconsistent with the proteomic data, which represents about 30% of them. This is, on average around 5 models per cluster; clusters contain 16 models on average.
One of the most informative types of peptide identification involves those that spanned an intron/exon boundary. We matched 54 peptides spanning introns. These identifications provide proteomics evidence for predicted splice sites in the gene models above and beyond any available transcript data or orthologues.
Further analysis of gene model clusters
Gene cluster statistics where "best" filtered model is inconsistent with proteome peptide data
DSM Protein Description
UniRef Protein Description
strong similarity to 14.8 kD subunit of NADH:ubiquinone reductase – Neurospora crassa
Hypothetical protein; n = 1; Coccidioides immitis RS|Rep: Hypothetical protein – Coccidioides immitis RS
strong similarity to snRNA-associated sm-like protein Lsm2 – Saccharomyces cerevisiae
Hypothetical protein; n = 1; Coccidioides immitis RS|Rep: Hypothetical protein – Coccidioides immitis RS
similarity to hypothetical protein CAE47874.1/AfA24A6.130c – Aspergillus fumigatus
Predicted protein; n = 1; Aspergillus terreus NIH2624|Rep: Predicted protein – Aspergillus terreus NIH2624
strong similarity to heat shock protein 70 hsp70 – Ajellomyces capsulatus [putative frameshift]
Heat shock protein 70; n = 2; mitosporic Trichocomaceae|Rep: Heat shock protein 70 – Penicillium marneffei
strong similarity to calmodulin 6 CaM6 – Arabidopsis thaliana
EF-hand protein; n = 2; Aspergillus|Rep: EF-hand protein – Aspergillus fumigatus (Sartorya fumigata)
strong similarity to histone 4 from patent WO9919502-A1 – Homo sapiens
PREDICTED: similar to germinal histone H4 gene; n = 1; Canis familiaris|Rep: PREDICTED: similar to germinal histone H4 gene – Canis familiaris
strong similarity to soluble cytoplasmic fumarate reductase YEL047c – Saccharomyces cerevisiae
Hypothetical protein; n = 1; Aspergillus terreus NIH2624|Rep: Hypothetical protein – Aspergillus terreus NIH2624
similarity to hypothetical protein CAD21072.1 – Neurospora crassa
Transcription factor RfeF, putative; n = 1; Aspergillus fumigatus|Rep: Transcription factor RfeF, putative – Aspergillus fumigatus (Sartorya fumigata)
strong similarity to translation initiation factor eIF-4A – Schizosaccharomyces pombe
ATP-dependent RNA helicase eIF4A; n = 1; Emericella nidulans|Rep: ATP-dependent RNA helicase eIF4A – Emericella nidulans (Aspergillus nidulans)
similarity to glucanase ZmGnsN3 from patent WO200073470-A2 – Zea mays
Hypothetical protein; n = 1; Aspergillus fumigatus|Rep: Hypothetical protein – Aspergillus fumigatus (Sartorya fumigata)
Ca2+binding actin-bundling protein; n = 2; Aspergillus|Rep: Ca2+binding actin-bundling protein – Aspergillus oryzae
strong similarity to cytoplasmic ribosomal protein of the large subunit L10 – Saccharomyces cerevisiae
RIB40 genomic DNA, SC011; n = 2; Aspergillus|Rep: RIB40 genomic DNA, SC011 – Aspergillus oryzae
strong similarity to mitochondrial ADP/ATP carrier anc1p – Schizosaccharomyces pombe
Mitochondrial ADP, ATP carrier protein (Ant), putative; n = 1; Aspergillus fumigatus|Rep: Mitochondrial ADP, ATP carrier protein (Ant), putative – Aspergillus fumigatus (Sartorya fumigata)
The second example, cluster 68_S6: Scaffold_6: 215962–216777 shows 5 peptides clearly supporting the filtered model and validating one of the three introns in the gene models. These gene modes show strong similarity to ubiquinone reductase in other filamentous fungi. Three of the peptides are clustered close together at the C-terminus and one at the N-terminus, two areas of the gene models that can be very difficult to validate and correctly predict. Figure 5 shows a more detailed look at how the peptides clearly match the terminal regions of the predicted gene models.
Finally, example cluster 78_S17:scaffold_17:300012-298397, which shows similarity to Aspartate aminotransferase, does contain a "best" filtered model which matches all the peptide data. The models in the cluster are similar with all the predicted structural differences occurring around one particular intron which is validated by one of the peptides. This cluster is supported by a large number of proteome peptides which gives good coverage of the models validating the key intron where the predictions differ. These examples provide a clear demonstration of how proteomic evidence can support and refine gene models predicted from sequenced genomes, lend weight to predictions and help reconcile different potential gene structures in the annotation process.
Comparison of DSM and JGI genomes
Another strain of A. niger has also been recently sequenced and published by DSM  and we compared the collected proteome data here using Mascot searches against both the DSM and JGI predicted protein sets, shown in Table 2 and Figure 3. Using a simple reciprocal top-hit BLAST approach we defined equivalence relationships between the DSM proteins and the JGI gene models. As would be expected, both protein datasets have a proportion of proteins with no similarity in the other database, and we have proteomics evidence for some of these unique proteins. In fact 9 DSM proteins had significant APS peptide identifications but had no corresponding gene model in the JGI dataset and, vice-versa, 130 JGI gene models (corresponding to 18 distinct clusters) with proteomic data had no equivalent protein in the DSM database. This suggests that some possible gene models have not yet been generated for the JGI genome which would fit into the "best" filtered dataset and also that several proteins have been missed in the DSM annotation which are included in the JGI gene models set.
Proteomics would not be possible without genomics; however, this does not mean that it is powerless to assist genomics. In fact quite the contrary, proteomics provides a fast, relatively cheap and confident method for gathering a large amount of experimental evidence to assist genome annotation. It also has the added advantage of confirming that transcripts are translated to the proteome stage and can help identify functional details of the mature protein form which includes N- and C-termini and post-translational modifications. Here we present a relatively modest scale study on a fungal organism of current interest and hence our data has a relatively limited coverage of the entire A. niger proteome. Recent publications point out the need to conduct multiple high throughput experiments under a variety of conditions to achieve complete proteome coverage  and we have only used one here. Despite this, and using a "hot off the press" genome annotation, we were able to offer proteomic support for 214 genes with proteomic data offering refinements to predicted gene models in around 6% of these cases. These represent 13 gene predictions for which there was uncertainty in the annotation (no filtered model) or were potentially incorrectly annotated in the original gene model selection process. Importantly, as some of the examples in Figure 4 highlight, there is often uncertainty surrounding the true N- and C-termini of genes and most of the variability in the gene model structures exists at these regions. We were able to offer concrete data to help resolve some of these ambiguities in a number of cases, but it should be noted that we have not employed a targeted strategy to protein termini here. One attractive potential solution would be to use an N-terminal specific peptide preparation to enrich for the peptides .
We believe that this feedback of experimental data into genomic annotation pipelines could well be formalised and assist in gene prediction pipelines, as has been recently been demonstrated for Arabidopsis . Proteomics has the advantage of providing direct evidence for gene products rather than any intermediate stage in the transcription process, and the field has been slow to incorporate proteomics data formally into gene prediction models. However, we hope that this work and others offers strong support for its inclusion, using MS-based peptide identifications as a similar line of evidence to ESTs.
A further point of caution is also necessary. In Aspergillus, the level of alternative splicing that occurs has yet to be fully characterised, but appears to be modest . In most cases, the choice for gene prediction is therefore which of the candidate models is most consistent with the data and most likely to be correct. For species exhibiting higher levels of splicing, with multiple isoforms from a single locus, it is more challenging to interpret the data. Multiple peptide identifications could belong to two or more isoforms and it is quite possible that several isoforms are present in the same sample, especially when multiple tissues are studied.
One final caveat that concerns many in the proteomics field is the "one hit wonder" syndrome  – proteins with only a single confident peptide identification. Although the majority of our peptide identifications were not in this category, they are still more likely to be false positives. Using the APS approach with independent thresholds for single peptide matches attempts to put them on a consistent protein level confidence to the multi-peptide hits. Indeed, the APS threshold was higher for single peptide hits (51 compared to 34 on average). However, more work is clearly needed to convince practioners of their validity, but appropriately weighted and considered they can offer genuine experimental evidence to support gene models as we have demonstrated here.
JW acknowledges NERC studentship NER/S/R/2005/13607. SJH, SJG, IRG, DS, SM acknowledge support, either directly or indirectly, from the Biotechnology and Biological Science Research Council via BBSRC grant CFB17723.
- Liska AJ, Shevchenko A: Expanding the organismal scope of proteomics: cross-species protein identification by mass spectrometry and its implications. Proteomics. 2003, 3 (1): 19-28. 10.1002/pmic.200390004.View ArticlePubMedGoogle Scholar
- Potter SC, Clarke L, Curwen V, Keenan S, Mongin E, Searle SM, Stabenau A, Storey R, Clamp M: The Ensembl analysis pipeline. Genome research. 2004, 14 (5): 934-941. 10.1101/gr.1859804.PubMed CentralView ArticlePubMedGoogle Scholar
- Baker SE: Aspergillus niger genomics: past, present and into the future. Med Mycol. 2006, 44 (Suppl 1): S17-21. 10.1080/13693780600921037.View ArticlePubMedGoogle Scholar
- Kim Y, Nandakumar MP, Marten MR: Proteomics of filamentous fungi. Trends Biotechnol. 2007, 25: 395-400. 10.1016/j.tibtech.2007.07.008.View ArticlePubMedGoogle Scholar
- Duggan DJ, Bittner M, Chen Y, Meltzer P, Trent JM: Expression profiling using cDNA microarrays. Nature genetics. 1999, 21 (1 Suppl): 10-14. 10.1038/4434.View ArticlePubMedGoogle Scholar
- Mantripragada KK, Buckley PG, de Stahl TD, Dumanski JP: Genomic microarrays in the spotlight. Trends Genet. 2004, 20 (2): 87-94. 10.1016/j.tig.2003.12.008.View ArticlePubMedGoogle Scholar
- Ghosh S, Hirsch HA, Sekinger EA, Kapranov P, Struhl K, Gingeras TR: Differential analysis for high density tiling microarray data. BMC Bioinformatics. 2007, 8 (1): 359-10.1186/1471-2105-8-359.PubMed CentralView ArticlePubMedGoogle Scholar
- Tanner S, Shen Z, Ng J, Florea L, Guigo R, Briggs SP, Bafna V: Improving gene annotation using peptide mass spectrometry. Genome research. 2007, 17 (2): 231-239. 10.1101/gr.5646507.PubMed CentralView ArticlePubMedGoogle Scholar
- Jaffe JD, Berg HC, Church GM: Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics. 2004, 4 (1): 59-77. 10.1002/pmic.200300511.View ArticlePubMedGoogle Scholar
- Ansong C, Purvine SO, Adkins JN, Lipton MS, Smith RD: Proteogenomics: needs and roles to be filled by proteomics in genome annotation. Brief Funct Genomic Proteomic. 2008, 7 (1): 50-62. 10.1093/bfgp/eln010.View ArticlePubMedGoogle Scholar
- Domon B, Aebersold R: Mass spectrometry and protein analysis. Science. 2006, 312 (5771): 212-217. 10.1126/science.1124619.View ArticlePubMedGoogle Scholar
- Kislinger T, Emili A: Multidimensional protein identification technology: current status and future prospects. Expert Rev Proteomics. 2005, 2 (1): 27-39. 10.1586/14789418.104.22.168.View ArticlePubMedGoogle Scholar
- Smith JC, Figeys D: Proteomics technology in systems biology. Mol Biosyst. 2006, 2 (8): 364-370. 10.1039/b606798k.View ArticlePubMedGoogle Scholar
- Desiere F, Deutsch EW, King NL, Nesvizhskii AI, Mallick P, Eng J, Chen S, Eddes J, Loevenich SN, Aebersold R: The PeptideAtlas project. Nucleic acids research. 2006, D655-658. 10.1093/nar/gkj040. 34 Database
- Desiere F, Deutsch EW, Nesvizhskii AI, Mallick P, King NL, Eng JK, Aderem A, Boyle R, Brunner E, Donohoe S, Fausto N, Hafen E, Hood L, Katze MG, Kennedy KA, Kregenow F, Lee H, Lin B, Martin D, Ranish JA, Rawlings DJ, Samelson LE, Shiio Y, Watts JD, Wollscheid B, Wright ME, Yan W, Yang L, Yi EC, Zhang H, Aebersold R: Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry. Genome Biol. 2005, 6 (1): R9-10.1186/gb-2004-6-1-r9.PubMed CentralView ArticlePubMedGoogle Scholar
- Lu F, Jiang H, Ding J, Mu J, Valenzuela JG, Ribeiro JM, Su XZ: cDNA sequences reveal considerable gene prediction inaccuracy in the Plasmodium falciparum genome. BMC Genomics. 2007, 8 (1): 255-10.1186/1471-2164-8-255.PubMed CentralView ArticlePubMedGoogle Scholar
- Kuster B, Mortensen P, Andersen JS, Mann M: Mass spectrometry allows direct identification of proteins in large genomes. Proteomics. 2001, 1 (5): 641-650. 10.1002/1615-9861(200104)1:5<641::AID-PROT641>3.0.CO;2-R.View ArticlePubMedGoogle Scholar
- McCarthy FM, Cooksey AM, Wang N, Bridges SM, Pharr GT, Burgess SC: Modelling a whole organ using proteomics: the avian bursa of Fabricius. Proteomics. 2006, 6 (9): 2759-2771. 10.1002/pmic.200500648.View ArticlePubMedGoogle Scholar
- Fermin D, Allen BB, Blackwell TW, Menon R, Adamski M, Xu Y, Ulintz P, Omenn GS, States DJ: Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics. Genome Biol. 2006, 7 (4): R35-10.1186/gb-2006-7-4-r35.PubMed CentralView ArticlePubMedGoogle Scholar
- Rison SC, Mattow J, Jungblut PR, Stoker NG: Experimental determination of translational starts using peptide mass mapping and tandem mass spectrometry within the proteome of Mycobacterium tuberculosis. Microbiology. 2007, 153 (Pt 2): 521-528. 10.1099/mic.0.2006/001537-0.PubMed CentralView ArticlePubMedGoogle Scholar
- Shadforth I, Xu W, Crowther D, Bessant C: GAPP: a fully automated software for the confident identification of human peptides from tandem mass spectra. J Proteome Res. 2006, 5 (10): 2849-2852. 10.1021/pr060205s.View ArticlePubMedGoogle Scholar
- Hubbard TJ, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, Down T, Dyer SC, Fitzgerald S, Fernandez-Banet J, Graf S, Haider S, Hammond M, Herrero J, Holland R, Howe K, Howe K, Johnson N, Kahari A, Keefe D, Kokocinski F, Kulesha E, Lawson D, Longden I, Melsopp C, Megy K, Meidl P, Ouverdin B, Parker A, Prlic A, Rice S, Rios D, Schuster M, Sealy I, Severin J, Slater G, Smedley D, Spudich G, Trevanion S, Vilella A, Vogel J, White S, Wood M, Cox T, Curwen V, Durbin R, Fernandez-Suarez XM, Flicek P, Kasprzyk A, Proctor G, Searle S, Smith J, Ureta-Vidal A, Birney E: Ensembl. Nucleic acids research. 2007, D610-617. 10.1093/nar/gkl996. 35 Database
- Maillet I, Berndt P, Malo C, Rodriguez S, Brunisholz RA, Pragai Z, Arnold S, Langen H, Wyss M: From the genome sequence to the proteome and back: evaluation of E. coli genome annotation with a 2-D gel-based proteomics approach. Proteomics. 2007, 7 (7): 1097-1106. 10.1002/pmic.200600599.View ArticlePubMedGoogle Scholar
- Pel HJ, de Winde JH, Archer DB, Dyer PS, Hofmann G, Schaap PJ, Turner G, de Vries RP, Albang R, Albermann K, Andersen MR, Bendtsen JD, Benen JA, Berg van den M, Breestraat S, Caddick MX, Contreras R, Cornell M, Coutinho PM, Danchin EG, Debets AJ, Dekker P, van Dijck PW, van Dijk A, Dijkhuizen L, Driessen AJ, d'Enfert C, Geysens S, Goosen C, Groot GS, de Groot PW, Guillemette T, Henrissat B, Herweijer M, Hombergh van den JP, Hondel van den CA, Heijden van der RT, Kaaij van der RM, Klis FM, Kools HJ, Kubicek CP, van Kuyk PA, Lauber J, Lu X, Maarel van der MJ, Meulenberg R, Menke H, Mortimer MA, Nielsen J, Oliver SG, Olsthoorn M, Pal K, van Peij NN, Ram AF, Rinas U, Roubos JA, Sagt CM, Schmoll M, Sun J, Ussery D, Varga J, Vervecken W, Vondervoort van de PJ, Wedler H, Wösten HA, Zeng AP, van Ooyen AJ, Visser J, Stam H: Genome sequencing and analysis of the versatile cell factory Aspergillus niger CBS 513.88. Nat Biotechnol. 2007, 25 (2): 221-231. 10.1038/nbt1282.View ArticlePubMedGoogle Scholar
- Perkins DN, Pappin DJ, Creasy DM, Cottrell JS: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999, 20 (18): 3551-3567. 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2.View ArticlePubMedGoogle Scholar
- Shadforth I, Dunkley T, Lilley K, Crowther D, Bessant C: Confident protein identification using the average peptide score method coupled with search-specific, ab initio thresholds. Rapid Commun Mass Spectrom. 2005, 19 (22): 3363-3368. 10.1002/rcm.2203.View ArticlePubMedGoogle Scholar
- Shevchenko A, Wilm M, Vorm O, Mann M: Mass spectrometric sequencing of proteins sliver-stained polyacrylamide gels. Anal Chem. 1996, 68 (5): 850-8. 10.1021/ac950914h.View ArticlePubMedGoogle Scholar
- Salamov AA, Solovyev VV: Ab initio gene finding in Drosophila genomic DNA. Genome Res. 2000, 10: 516-522. 10.1101/gr.10.4.516.PubMed CentralView ArticlePubMedGoogle Scholar
- Birney E, Durbin R: Using GeneWise in the Drosophila annotation experiment. Genome Res. 2000, 10: 547-548. 10.1101/gr.10.4.547.PubMed CentralView ArticlePubMedGoogle Scholar
- Choi H, Nesvizhskii AI: False discovery rates and related statistical concepts in mass spectrometry-based proteomics. J Proteome Res. 2008, 7 (1): 47-50. 10.1021/pr700747q.View ArticlePubMedGoogle Scholar
- Slater GS, Birney E: Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics. 2005, 6: 31-10.1186/1471-2105-6-31.PubMed CentralView ArticlePubMedGoogle Scholar
- Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H, Lehväslaiho H, Matsalla C, Mungall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD, Stupka E, Wilkinson MD, Birney E: The Bioperl toolkit: Perl modules for the life sciences. Genome research. 2002, 12 (10): 1611-1618. 10.1101/gr.361602.PubMed CentralView ArticlePubMedGoogle Scholar
- Veenstra TD, Conrads TP, Issaq HJ: What to do with "one-hit wonders"?. Electrophoresis. 2004, 25 (9): 1278-1279. 10.1002/elps.200490007.View ArticlePubMedGoogle Scholar
- Ahn NG, Shabb JB, Old WM, Resing KA: Achieving in-depth proteomics profiling by mass spectrometry. ACS Chem Biol. 2007, 2 (1): 39-52. 10.1021/cb600357d.View ArticlePubMedGoogle Scholar
- McDonald L, Beynon RJ: Positional proteomics: preparation of amino-terminal peptides as a strategy for proteome simplification and characterization. Nat Protoc. 2006, 1 (4): 1790-1798. 10.1038/nprot.2006.317.View ArticlePubMedGoogle Scholar
- Baerenfaller K, Grossmann J, Grobei MA, Hull R, Hirsch-Hoffmann M, Yalovsky S, Zimmermann P, Grossniklaus U, Gruissem W, Baginsky S: Genome-scale proteomics reveals Arabidopsis thaliana gene models and proteome dynamics. Science. 2008, 320: 938-41. 10.1126/science.1157956.View ArticlePubMedGoogle Scholar
- Semova N, Storms R, John T, Gaudet P, Ulycznyj P, Min XJ, Sun J, Butler G, Tsang A: Generation, annotation, and analysis of an extensive Aspergillus niger EST collection. BMC Microbiology. 2006, 6: 7-10.1186/1471-2180-6-7.PubMed CentralView ArticlePubMedGoogle Scholar
- Käll L, Storey JD, MacCoss MJ, Noble WS: Posterior error probababilities and false discovery rates: two sides of the same coin. J Proteome Res. 2008, 7 (1): 40-44. 10.1021/pr700739d.View ArticlePubMedGoogle Scholar
- Searle , Brian C, Turner , Mark , Nesvizhskii , Alexey I: Improving Sensitivity by Probabilistically Combining Results from Multiple MS/MS Search Methodologies. J Proteome Res. 2008, 7: 245-253. 10.1021/pr070540w.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.