From the large dataset of MS/MS spectra (1 117 372) recorded for proteome samples of R. pomeroyi cultivated in various conditions, we identified thirty-nine newly annotated genes and nine wrongly described ORFs. We also corrected seventy-four start codons and described five sequencing errors (a base insertion in all cases) that consequently modified the characteristics of the genes encoded at these loci. Because of its environmental relevance, the Roseobacter clade is currently subject to intense sequencing efforts [27–31]. However, because of the large diversity of this bacterial group, there are insufficient numbers of near-related genome sequences to improve their annotations by comparative genomics alone. Here, we show the importance of proteogenomics input for a better characterization of bacterioplankton.
We noted that the number of annotation inaccuracies, in terms of structural annotation, is rather large for R. pomeroyi genome annotation. This is similar to previous proteogenomic reports for Shewanella or Mycobacterium bacteria that resulted in thirty-eight and twenty-nine new annotations, respectively [19, 32]. In contrast, a recent proteogenomic study carried out on the enterobacterium Yersinia pestis identified only four novel genes . As enterobacteria are the most extensively studied organisms, and numerous genomes from the Enterobacteriaceae family have now been sequenced and annotated, it is reasonable to consider that their genomes are amongst the best for accuracy and reliability. This is in full agreement with the proteogenomic data presented by Payne et al. . Here we have shown that even highly expressed genes and operons with potentially important cellular roles were missed during the genome annotation of R. pomeroyi. The majority of annotation problems come from the identification of CDSs exclusive to a small number of organisms, as comparative genomics is not useful in such a case in confirming the ORF prediction. Their validation requires additional experimental evidence, such as described here. Blending data from complementary approaches, such as protein characterization by tandem mass spectrometry and transcriptomic evidence, is time consuming but results in stronger evidence for small genes. In terms of mass spectrometry, 'one-hit-wonders' are proteins identified with only one, non-redundant peptide tag. They are usually proteins with low molecular weight that are able to generate only a few tryptic peptides. Depending on the score of MS/MS spectrum assignment, these hits may be difficult to ascertain confidently and require manual validation. Gupta et al  proposed a method to validate one-hit-wonders using comparative proteogenomics, but this requires the recording of various MS/MS datasets on several species. Here, we used RT-PCR to detect the expression of several CDSs identified with only one peptide. In this way we obtained evidence that the locus was being expressed, giving higher confidence to the assignment. This method proved to be effective, with the addition of five novel genes to our list.
Another frequent problem encountered during genome annotation is the identification of a CDS located in two different reading frames that clearly encodes a unique, conserved protein. This can be either a real frameshift process occurring for the regulation of protein synthesis, an artefact resulting from a sequencing error, or a pseudogene that has been recently inactivated. As we identified in the present study peptides in different reading frames at the same loci (e.g. SPO_PG036 and SPO_PG037), we confirmed the production of the polypeptides encoded and discounted the existence of pseudogenes. We checked systematically the sequences of the five loci and found in all cases that the plausible frameshifts resulted from sequencing errors. The insertion of an extra nucleotide in the sequence produced a slippage of the coding region to another reading frame in the five cases. This was expected as frameshifts are rare processes of regulation and usually down-regulate the protein synthesis in bacteria, while they are frequent in Archaea or viruses [34, 35]. The number of sequencing errors found in the R. pomeroyi genome sequence also supports the current idea of re-sequencing genomes which were established a decade ago . Here we have confirmed the input of proteogenomics to indicate the specific loci that need such sequence re-evaluation which has already been highlighted by others .
It is worth noting that ortho-proteogenomic extension of the corrected annotations to phylogenetically-related microorganisms reinforces the interest of proteogenomic studies for poorly studied bacterial phyla. Ortho-proteogenomic analyses have, to date, been limited to only two genera, Mycobacterium  and Yersinia , and have not been extended beyond this taxonomic level. In the present work, we exploited the MS/MS data combined with comparative genomics to extend re-annotations for genomes from higher taxonomic ranks. Although all sequenced members of the Roseobacter clade are distantly related, they all form a robust cluster with a high rate of similarities at the 16S RNA nucleotide sequence level . We have successfully extended the identified N-terminal annotation of the 486 proteins detected in R. pomeroyi to 9887 homologous genes in the thirty-six sequenced Roseobacter isolates, corresponding to nineteen distinct genera. In this way, 1082 genes that were wrongly annotated were confidently corrected. This represents 11% of the total number of ORFs considered. To highlight the importance of manual curation of genome annotations, the rate of erroneous N-terminal identifications decreased to 6.8% when considering only the four complete Roseobacter genomes included in this study. These error rates are probably slightly underestimated as we only considered the conserved and obvious corrections. It is important to note that the full rate of badly annotated N-terminal genes established on the well-annotated genome of R. pomeroyi was 12.8%. A more comprehensive annotation of the clade could only be accomplished by integrating a comparative proteogenomic analysis of various Roseobacter strains, as previously carried out with the genus Shewanella (Gupta et al 2008).
Finally, the identification of operon structures by RT-PCR has given insights of the plausible function of the novel proteins identified in the present study. Bacterial genomes are usually well structured and regulated in the form of operons. Remarkably, we found that most of the novel, proteogenomic-detected CDSs were identified in operons encoding catabolic functions for amino acid degradation, RTX-like secreted proteins or central citric cycle metabolism. Because most of the biological conditions were carried out in a peptide broth , this catabolism is privileged and such discovery may be advantaged. Whether the genes encoded in the close neighbourhood of genes specifying RTX-like toxins are part of the protein secretion system or associated factors is an interesting question, as such toxins can be abundantly secreted, as previously shown [24, 38]. Moreover, we identified novel CDSs with no ascribable function in operons encoding essential determinants of the citric acid cycle. These novel CDSs are not at all conserved among other Roseobacter members, but their presence is a common topic restricted to members of this clade. Whether these genes encode proteins that enhance this central metabolic cycle in these bacteria or are opportunistic genes that specifically appear in this operon because of the advantage of their high expression is an open question. The presence of the novel proteins found in the citric acid cycle operon in R. pomeroyi could represent snapshots of how novel proteins with novel specific functions arise during evolution.
Systematic listing of CDSs in numerous microorganisms, with the help of proteogenomic evidence, should increase the accuracy of annotation software. As demonstrated here, proteogenomic evidences from bacteria belonging to orders that have, thus far, been poorly characterized, such as the Roseobacter clade, are necessary to improve genome and even metagenome annotations. Ortho-proteogenomic annotation extension to a whole bacterial clade has proven here to be highly valuable. Such extension could also be applied to metagenome data, taking into account higher constraints.