Basic use of GATU
The utility of GATU is illustrated here by the annotation of sheeppox virus strain A (SPPV-A) which was deposited to GenBank unannotated (AY077833). The related SPPV strain TU-V02127 (NC_004002) is used here as the reference genome. The annotations are extracted from the reference genome and displayed graphically (Figure 2). The SPPV genomes are approximately 150 kb in length; running on one processor of a dual 1.0 GHz Macintosh G4, the automated process took less than 15 min to complete, with no user intervention required during this time. However, if a BLAST search is also performed against the NCBI nr database for each ORF, the running time will be substantially longer; we routinely run these searches interactively since only a small fraction of the ORFs require them.
After processing is complete, GATU provides the user with an interactive table (Figure 3, top section) and a graphical view (Figure 3, bottom section) of annotations; the views in these panels can be modified in the Preferences. Clicking on a row in the table (containing an accepted annotation) will automatically highlight this annotation in the graphical view; the Jump button moves the graphical display to center the ORF in the window. The slider at the bottom of the graphical view allows the user to zoom in and out. The interactive table contains a list of all the putative annotations for the target genome, along with relevant information for each annotation; clicking on the column header will sort the table by the column value. The buttons below the interactive table allow the user to switch between lists of the Reference genes and target genome Annotations.
The interactive table and graphical display allow the user to review the automatically generated annotations and accept or reject them as desired. To aid the user, GATU pre-selects the Accept annotation box for all annotations that meet user-specified requirements of length, percent sequence identity and coding strand identity. In our example, GATU found and automatically accepted target sequence counterparts of 146 of the 148 genes (genes 1–147 and gene 101a) present in the reference genome. The accepted ORFs were 99.1–100% similar (predicted amino acid sequence) to the reference genes and 127 were 100% similar; this also indicates that the start/stop positions of the reference and target genes matched. Variation in the start/stop positions can also be examined by comparing the P. size column (predicted size) with the Size column (actual length in the reference genome) and viewing the NEEDLE alignments.
Before deciding which annotations to include in the genome file, the user may wish for more information about a particular annotation. In our example, genes 02 and 146 (these two genes happen to be identical because they are present in the terminal inverted repeats of the virus) need reviewing, as they were not automatically accepted for inclusion; the user will have to determine whether they should be accepted. To assist with this task, a global alignment of the reference protein and its putative counterpart on the target genome (generated by the NEEDLE program) can be obtained by clicking on the Needle Alignment button (Figure 4). The NEEDLE alignment shows that the ORF in the target genome is truncated at the N-terminus but contains the remaining 240 aa encoded by the reference genome. This global alignment also provides a useful indication as to the level of similarity between the two ORFs. Another useful tool is a TBLASTN search of the target genome using the reference gene as a query; the results of this search can be obtained by clicking on the Blast Alignment(s) button (Figure 5). From the data shown, it is apparent that a frame-shifting mutation is responsible for the difference between the target and reference ORFs. If desired, the user could open these two genomes in our Viral Genome Organizer (VGO) program to determine if the promoter regions are similar and the position of the frame-shifting mutation (located at a run of Ts). Another application that users will find useful at this stage is JDotter, which can show an alignment of the genomes together with the whole genome dotplots [9]. With these data at hand, the user can now make an informed decision as to whether this putative ORF should be included in the target genome annotations. Given that the promoter regions and ORF start sites are similar, it is likely that protein translation of this mRNA would begin at the same position as in the reference genome, leading to the synthesis of a polypeptide only 12 aa in length. Therefore, this ORF should be omitted from the annotation even though, at first glance, it appears to be significant. Alternatively, the annotation could be accepted with the FRAG (fragment) designation; the user can select this by clicking in the relevant row of the Genetype column.
To complement the process of matching known genes in the reference genome to the target, it is necessary to search for potential ORFs in the target genome that have no obvious match in the reference genome. Such ORFs could be the result of additional sequences in the target genome, minor sequence differences in the reference genome resulting in the loss of a functional gene, errors in the original annotation, overlapping ORFs, or failure to use the first MET codon as the ORF start. All ORFs larger than the Cutoff size that have not been matched to a gene in the reference genome are automatically placed in an Unassigned-ORFs table, which is accessed by clicking on the relevant button below the interactive table. Unassigned-ORFs can be added to the list of selected annotations and the graphical display by checking the Accept box for a given ORF; the Jump button moves the graphical display to show the selected ORF in the window (Figure 6). Several utilities have been built into GATU allowing the user to further investigate the nature of these Unassigned-ORFs. Right-clicking on any of the Unassigned-ORFs and selecting the appropriate option allows the user to view the results of pre-run BLAST searches for the ORF, run BLAST searches if the Manual BLAST search option was checked during the initial annotation run, or initiate a new search using a different algorithm (e.g. TBLASTN) or database (e.g. VOCs virus databases [2], or the NCBI nr database).
In summary, GATU correctly transferred and automatically accepted annotation for 146 of 148 genes from the reference sheeppox genome to the target genome. The missing genes (2 copies of a gene in the terminal inverted repeats of the viral genome) encode a protein of only 36 aa and were subsequently detected by a TBLASTN search of the target genome. In further tests, 97% of the genes in rabbitpox virus (AY484669; a strain vaccinia virus) were correctly annotated by using vaccinia virus strain WR (NC_006998) as the reference genome, whereas 88% of the genes in rabbitpox virus were correctly annotated by using ectromelia virus (a different species in the same Orthopoxvirus genus) as reference (data not shown). It should be noted, however, that although duplicated genes, such as those in poxvirus terminal inverted repeats, are detected by GATU, the evaluation of paralogues requires special attention because GATU can only match a reference gene to the first BLAST hit that is found. GATU places other paralogues into the Unassigned-ORFs group where they can be reviewed by the annotator and manually added to the annotation.
Annotation of bacterial genomes with GATU
Since many problems associated with genome annotation are magnified considerably when dealing with bacterial genomes due to their large size, we have tested GATU on bacterial genomes up to ten times the size of a typical poxvirus genome. First, Chlamydia pneumoniae strain TW183 (NC_005043) was annotated using C. pneumoniae strain AR39 (NC_002179) as the reference genome, containing 1112 annotated genes in its GenBank file. These two strains are highly similar (>90% nucleotide identity). Of the 1112 reference genes, 817 were 100% similar (global alignment with NEEDLE) to a gene found on the target genome, 145 genes were 95–99% similar, 28 genes were 90–94% similar, 22 genes were 85–89% similar, 16 genes were 80–84% similar, 11 genes were 75–79% similar, 14 genes were 70–74% similar and 59 genes were <70% similar. This information is available from the Statistics menu within GATU. With a minimum threshold of 60% aa similarity between the reference gene and the target ortholog, 1063 ORFs (96%) were accepted automatically by GATU and a further 24 were accepted from the reference genome after reviewing data in the Unassigned ORF table. Although annotation of bacterial genomes is outside of our area of expertise, another 29 ORFs that were not annotated in the reference genome were added from the Unassigned-ORF table to the target genome based on their similarity to genes in the NCBI nr database. In total, we were able to annotate the C. pneumoniae TW183 genome with 1116 genes and gene fragments; in comparison, the GenBank file for this genome contains 1113 annotations (NC_005043). Additional file 1 [Additional file 1] contains two Excel spreadsheets showing a more detailed comparison of these annotations; in summary, our annotation performed with GATU added several annotations to the target that were not in its GenBank file and failed to predict a small number of short ORFs. However, such differences will arise whenever two authors annotate a genome using different processes and standards; the discrepancies seen here represent less than 10% of the total genes, indicating a high level of consistency between the two annotations.
For our second test using bacterial genomes, we used two different species from the Thermoplasma genus; T. volcanium strain GSS1 (NC_002689) was annotated using T. acidophilum strain DSM 1728 (NC_002578) as a reference. The reference genome GenBank file contained 1482 annotated genes, and based on a threshold of 60% aa similarity 1103annotations (74%) were accepted automatically into the target genome annotation (11 genes had 100% similarity, 135 genes were 95–100% similar, 182 genes were 90–94% similar, 164 genes were 85–89% similar, 186 genes were 80–84% similar, 141 genes were 75–79% similar. 120 genes were 70–74% similar and 459 genes were <70% similar. A further 124 ORFs could be added from the Unassigned-ORFs table; most of these ORFs were truncated and/or fragmented versions of their counterparts in the reference genome. Finally, a number of target genome-specific ORFs (201) were added following further analysis of the Unassigned-ORFs, to yield a total of 1428 ORFs accepted for annotation. A comparison between our annotation of GSS1 using GATU and the annotated GSS1 GenBank file (NC_002689) showed 100 differences, approximately 7% [see Additional file 2].
The 1112 and 1482 BLAST searches/NEEDLE alignments required for these two annotation trials took 25 and 40 minutes, respectively, to run on a 1 GHz G4 Macintosh computer. In both cases, a large fraction of the total bacterial ORFs present were automatically annotated correctly by GATU; the majority of the "missed" ORFs were small (<150 nt) hypothetical genes that were either not annotated in the reference genome or were unique to the target genome. This represents a substantial reduction of the time and effort required for annotation, allowing the annotator to concentrate on those areas of the genome that require expert knowledge to correctly annotate.