Tuberculosis, the disease caused by the pathogen Mycobacterium tuberculosis, is responsible for approximately 2 million deaths annually according to the World Health Organization . At the moment, the genomic annotation of several lineages of Mycobacterium sp. is available and largely validated [2–5]. However, the annotation of protein-coding genes is still a challenge in genomic sequencing projects despite advances in computational gene finding [6, 7]. Consequently, differences in gene annotation introduced by diverse prediction methodology will have a major impact on subsequent studies. In addition, the presence of overlapping genes further increases the difficulty of annotation, resulting in theoretical protein products with different lengths.
An example of such differences can be observed from the number of Open Reading Frames (ORFs) annotated for the H37Rv laboratory strain of M. tuberculosis by two independent institutions [2, 8], representing a difference of up to 12%, simply in the number of annotated genes. In addition, there are differences in the lengths of genes annotated by both institutions due to difference in start codon choice. Therefore, the validation of such annotations by identification of protein sequences is highly desirable to further refine the genomic annotations and enable generation of improved unbiased databases. Mass spectrometry based proteomic approaches (also referred to as "shotgun" proteomics) is by far one of the most sensitive, high-throughput methods available for large scale screening of peptides present in a particular sample (for a recent review, see ). Such techniques have emerged to become instrumental in proteomic projects aiming for systematic functional analyses of the genes uncovered by genome sequence initiatives . Recently, MS-based approaches have been used to aid gene annotation in prokaryote and eukaryote genomes [11, 12], and to validate genomic annotation. Deshayes et al.  demonstrated, for example, that a mass spectrometry driven validation could identify sequencing errors of the genome of Mycobacterium smegmatis that were mistakenly believed to be interrupted coding sequences.
The possibility for in-depth analysis of complex proteomes has been dramatically increased by recent developments in mass spectrometry-based proteomics , in particular, by a hybrid mass spectrometry Linear Ion Trap - Orbitrap mass spectrometer [14, 15], in which ions are detected with high resolution by their motion in a spindle shaped electrode. It has recently been shown that, by using a 'lock mass strategy', very high mass accuracy is routinely achievable in both the MS and MS/MS modes , which virtually eliminates the problem of false positive peptide identification in proteomics.
In this article, we compared the revised original annotation for M. tuberculosis strain H37Rv from the Sanger Institute [2, 17] with the annotation of the same sequence from the Institute for Genomic Research (TIGR) . Previous results from our group suggested that differences in annotation may lead to divergent proteomic characterization , but such results were obtained using low-sensitivity, low-accuracy mass spectrometers. Therefore, we now generated a proteomic dataset from M. tuberculosis H37Rv culture filtrate acquired on a high-accuracy LTQ-Orbitrap instrument to improve identification coverage and reliability, and we specifically aimed to identify specific tryptic peptides represented in one or the other annotation.
Tryptic peptides specific to one or the other annotation can be observed when a complete gene is described in only one of the annotations. Specific tryptic peptides can also be observed when there is discrepancy in choice of the start codon of a particular gene. In that case, specific tryptic peptides can be seen in the N-terminal part of the longer gene. Correct choice of start codon may also be confirmed by observation of the very N-terminal peptide with or without its first methionine.
Our M. tuberculosis culture filtrates contain a high number of secreted proteins exported through the general secretory pathway . In order to identify the N-terminal peptides of processed secreted proteins, where the signal peptide has been cleaved off, we used the SignalP algorithm [19, 20] for identification of proteins with signal peptide, and the potential cleavage sites. Choice of start codon may however influence prediction of signal peptides using most signal prediction algorithms, including SignalP, because they consider the distance between the potential cleavage site and the precursor starting point. As a consequence, the N-terminal peptide of a mature secreted protein may not be detected if the choice of start codon precludes the prediction of the signal peptide .
We designed a database containing predicted N-terminal sequences in order to improve the identification of peptides in this area. To avoid repetitive entry generation and high levels of redundancy, the database was organized represented in a concise, MS-friendly format as described by Schandorff et al . In order to determine the identification of single nucleotide polymorphisms (SNP) and N-terminal predictions, these authors created a modified IPI human database where the tryptic peptide containing a possible mutation was inserted at the end of the original entry, but always preceded by the letter "J" (representing no amino acid). Through this method they were able to test and identify several SNPs without compromising database size, redundancy and reliability of the results. Therefore, all gene entries in the Sanger and the TIGR annotations were submitted to SignalP v3.0 prediction, and sites with a sufficiently high score were selected and appended to the entries as described by .
In total, we were able to identify 449 proteins from the M. tuberculosis H37Rv culture filtrate fractions (comprising mainly extracellular proteins), representing a more in depth scale of identification from the previous study , a difference explainable solely on better MS instrumentation. From those, we detected and validated 35 peptides which were specific to one annotation, 34 of them were specific to the Sanger annotation and only 1 was specific to the TIGR annotation. In addition, the identified peptides resulted in the identification of 5 gene products whose genes are only annotated in the Sanger dataset (and not in the one from TIGR). These data represented 1.78% of all peptides identified in the study, comprising a rather small protein population detected, indicating that such observations could be even more critical with a larger dataset. Therefore, it is of significant importance to generate more precise, unbiased gene annotation datasets from M. tuberculosis to allow more efficient proteomic characterization.