Correcting Annotation Errors
Following the data path outlined in the Methods (Figure 1), ~15 million MS/MS spectra from Yersinia pestis KIM were searched by Inspect and PepNovo against the six-frame translation of the genome. Confident peptide/spectrum matches were mapped onto the genome sequence, and used to infer annotation improvements. In total, we report 30,994 peptides mapping to 1302 proteins when requiring 2 peptides per protein. 1277 proteins (31% of the proteome) contain at least 1 uniquely mapping peptide. Of the 25 proteins lacking a unique peptide, the vast majority are an active transposase (IS285). The finding of an active transposon is intriguing given than transposons have been proposed as a driving force of Y. pestis genome evolution since its recent divergence from Y. pseudotuberculosis[13].
By mapping the proteomic data onto the genome, we are able to objectively determine the quality of the genome annotation. When peptides map outside of predicted proteins, there are two categories of annotation improvement. If the open reading frame lacks a predicted protein, the observed peptides are evidence for a novel gene. If the ORF contains a predicted protein and peptides map upstream, they are evidence for a 5' extension and new start site. Both of these situations necessitate an update to the genome annotation. Working with the RefSeq curators at NCBI, all of the instances discussed below have been updated, and can be accessed seamlessly through any NCBI tool (e.g. blast) or downloaded directly.
We find four ORFs which lack protein annotation, but have at least two uniquely mapping peptides (Figure 2). Three are easily recognizable proteins: major outer membrane lipoprotein between y1943 and y1944, two cold-shock proteins between y1817 and y1818 and y2562 and y2563. We also report an apparent Yersinia specific protein between y2035 and y2036. This protein, now named y5001, lacks homology to any currently described protein domain, and has no significant blast hits outside of the Yersinia genus. Finally, y3734 is predicted to be a pseudogene, an ABC transporter disrupted by insertion elements, indels, and nonsense mutations (Additional File 1, Figure S1). These mutations destroy several functionally relevant motifs, and almost certainly preclude proper biochemical function. However, we find two peptides mapping to the n-terminus of the region, providing evidence for its translation and presence in the cell.
The second major class of proteogenomic correction is start site annotation. Yersiniabactin thioesterase is an enzyme participating in the biosynthesis of a siderophore important for iron acquisition from the host. In our proteomics data we observed several peptides upstream of the current start for this gene, which required a 44 residue extension. Similarly, numerous peptides found upstream of the Y. pestis specific protein y0291 pointed to a 40 amino acid extension. Its lack of homology to any known domain and narrow taxonomic distribution demonstrate the utility of proteomic involvement in gene prediction. A third gene, y2368, highlights a subtle ramification of erroneous start sites, which is that functional elements within the n-terminus are obscured. y2368 is annotated simply as a 'periplasmic protein' with a CDD iron transport domain (cl01377). As expected we found this protein to be highly enriched in periplasmic as compared to cytoplasmic fractions [14], yet the n-terminus lacked localization motifs, e.g. a signal peptide. Six peptides mapped upstream of the annotated start site (Figure 3). Manual inspection of the region immediately upstream from these peptides found a strong signal peptide sequence motif.
Two proteins with erroneous start sites are special cases and widely mispredicted in bacterial genomes. Peptide chain release factor II, prfB, often contains a ribosomal +1 frame shift. As with the KIM genome, erroneous annotations of this protein typically contain only the c-terminal ORF but exclude the true n-terminus (Additional File 1, Figure S2). Our proteogenomics pipeline recognized peptides in both ORFs and the two were stitched together. We also found peptides upstream of infC, protein chain initiation factor IF-3, which utilizes an ultra-rare start codon ATT [15].
Signal Peptides
Proteins exported from the cytoplasm through the Sec-dependant pathway contain a short sequence essential to targeting and export (Figure 4). The ~20 residue motif, or signal peptide, is located at the n-terminus of the full-length protein. The signal peptide helps target the protein to the membrane, where it is temporarily anchored by a patch of hydrophobic residues. A three amino acid motif following the hydrophobic patch is recognized by the signal peptidase enzyme; the protein is cleaved and the c-terminal portion of the protein is released into the periplasm. The signal peptide, still anchored in the membrane is rapidly degraded by the signal peptide peptidase. The Sec-dependent pathway is separate from other export pathways, such as the Type III secretion system. Signal peptides are also present in proteins exported through the Twin Arginine receptor-mediated pathway, but are typically longer and include an additional motif prior to the hydrophobic patch.
Proteolytic cleavage of proteins in vivo can be recognized by proteomics by their atypical peptide endpoints. Spectra used in this report were generated from proteins digested with trypsin. Thus, we expect most identified peptides to be fully tryptic. Previously Gupta and colleagues postulated that signal peptides could be discovered simply by identifying non-tryptic peptides [16]. In their analysis, if the first observed peptide in a protein had a non-tryptic n-terminus and was within 17-55 residues of the start site, then it was a considered evidence of signal peptide cleavage. 202 Yersinia proteins fulfill these two requirements. We extend Gupta's criteria to include critical biological motifs within the signal peptide: the hydrophobic patch and cleavage motif [17]. Filtering out proteins lacking these new requirements, we report 82 proteins with observed signal peptide cleavage (Additional File 2, Table S1). These proteins also contain other common signal peptide features: prevalence of LL doublets and early basic residues. Furthermore, we noticed that the hydrophobic patch had a similar placement within the signal peptide for all validated proteins (Figure 4B). This location is consistent with the patch's structural purpose, i.e. membrane anchoring and exposure of the cleavage motif at an appropriate distance from the membrane. Finally, we report that many of the sequences contain not simply early basic residues but contain Met-Lys as the first two residues in the protein sequence. 34 of the 82 proteins start in this manner. An additional 15 have Met-Lys internally which could be the true start, if this trend is seen as a general pattern.
Twenty proteins contain signal peptides longer than 30 residues, which is atypical. When we compared these sequences to close homologs within the gamma-proteobacteria, 13 of them could be better predicted at an alternative start site downstream of the current annotation. This shorter version of the protein would not only have better agreement with homologous genes, but also the characteristics of signal peptides noted above (not just length). An additional four long signal peptides appear to contain the twin arginine motif, which are characteristically longer than those exported through the Sec-dependent pathway.
The set of secreted proteins within a genome is often computationally predicted. To compare our proteomic observations to such predictions, we ran signalp on all proteins in the genome. We plot the score for proteomically validated signal peptide containing proteins against the background of signalp's score distribution (Figure 4C). We also plot proteomically rejected signal peptide containing proteins (see Methods). The proteomically observed and rejected proteins separate very clearly, with the positive set scoring well above the suggested cutoff. Furthermore, proteomics and signalp generally agree on the exact residue of cleavage.
Dubious Genes
In the Yersinia pestis KIM genome, there were over 200 genomic loci with a >50 bp overlap between two protein coding genes. Such a substantial overlap is unusual, especially considering that 10% of the proteome (~400 proteins) falls into this category. We viewed these conflicted loci as unlikely to be correctly annotated. For the 46 loci covered by proteomics, we manually reviewed the evidence supporting the existence of either gene. 38 loci contained a dubious gene (Figure 5). In over half of the instances, the loci contained genes with 100% sequence overlap. The equivocal nature of dubious genes was witnessed by narrow phylogenetic distribution, poor and seemingly random sequence conservation, and weak computational justification noted in the original genome submission (see Methods). The remaining 8 loci were in conflict due of an overly extended 5' on one or both of the genes (Figure 5C). Working with RefSeq curators at NCBI, the dubious genes have been removed. Analysis of the remaining ~150 conflicted loci is being addressed in future work.
Improvement of Related Genomes
As a final step in our analysis, we used our proteomics data to improve the annotation of other Yersinia genomes. Published in 2002, KIM was the second annotated Yersinia and served as a source for subsequent genome projects [18]. Thus errors in KIM's annotation are likely to show up in other genomes. We created one-to-one orthology maps for proteins from 21 Yersinia genomes, and used our peptide mappings from the KIM dataset to evaluate gene models. Transferring proteogenomic improvements is complicated by both legitimate differences between strains/species, but also artifacts from sequencing and assembly. Our approach was to be conservative, and only change gene models where the evidence was clear.
We started with the removal of dubious genes, resulting in the deletion of 46 accessions. Remembering that each dubious gene was part of a conflicted locus, we checked to make sure that the true gene was present. At two loci we found that the true gene was missing, excluded from the original annotation by the dubious gene. We created new gene models for each of these. Three additional new genes were created, arising from the transitive annotation of the novel KIM genes. The two cold shock proteins were missing from Y. pseudotuberculosis IP 31758. The major outer membrane lipoprotein, lpp, was missing from Y. pestis Pestoides F.
To correct start sites in other Yersinia genomes, we applied the n-terminal most peptide from a gene in KIM to members of its ortholog set. If an ortholog was too short to include this peptide, we tested whether an extension could be made to accommodate. In 43 instances, this extension was trivial - clear homology up to and including the start codon. For 22, an extension could not be made due to indels or mutations shortening the reading frame. Finally, 10 were difficult cases where we could extend the reading frame to an upstream start codon, but homology was unclear. In these cases, we did not alter the genome annotation. This was particularly problematic in Y. enterocolitica, which is the most divergent genome used in our comparison.