Limitations of the rhesus macaque draft genome assembly and annotation

Finished genome sequences and assemblies are available for only a few vertebrates. Thus, investigators studying many species must rely on draft genomes. Using the rhesus macaque as an example, we document the effects of sequencing errors, gaps in sequence and misassemblies on one automated gene model pipeline, Gnomon. The combination of draft genome with automated gene finding software can result in spurious sequences. We estimate that approximately 50% of the rhesus gene models are missing, incomplete or incorrect. The problems identified in this work likely apply to all draft vertebrate genomes annotated with any automated gene model pipeline and thus represent a pervasive challenge to the analysis of draft genomes.


Background
Genomic sequences and assemblies have been provided for many vertebrate species. However, only a few have reached "finished" status. The rest are considered "draft" genomes. Although draft genomes are known to be less complete than finished genomes, the implications of working with resources derived from draft genomes is not always appreciated by investigators not involved in their production. Independent assessments of possible misassemblies and sequencing errors are rarely done. Finally, the quality metrics reported when a new genome is published provide a global assessment of completeness but provide little information on the quality of the gene models derived from the sequence and assemblies.
To address these issues, we examined the draft genome sequence and assembly of the rhesus macaque (Macaca mulatta) [1], an important biomedical research model [2][3][4][5][6][7][8]. Like many vertebrate draft genomes, an animal was sequenced to 5-6-fold coverage using Sanger sequencing. Several different assemblers were used and their results combined [1]. The human genome assembly was used to correct errors in the rhesus assembly.
Using a gene-based approach that incorporated both similarity and synteny information from the human genome, we found sequencing errors, missing sequence which included exons or parts of exons, and misassemblies in the draft rhesus genome. Further, the automated annotation pipeline used by NCBI, Gnomon [9], made a wide variety of errors in producing gene models. These errors were partially due to incorrect sequence and assembly information and partly due to the spurious annotations Gnomon created when presented with incomplete data.
We provide specific examples of the types of errors produced in the draft genome and in the annotation provided by Gnomon. We estimate that approximately 50% of the rhesus macaque genes were misannotated as a result. The errors observed in the current work likely apply to most other draft vertebrate genomes and would be expected to occur with any automated gene model pipeline.

Results
Sequencing errors Case 1 actin-related protein T1 (ACTRT1): incorrect insertion results in a frameshift Our targeted sequencing of the single exon of the rhesus ACTR1 gene [GenBank:JF749838.1] revealed that a "C" was incorrectly inserted at position 126,268,430 in Gen-Bank:NC_007878.1 (Figure 1). This occurred within the coding region of the single exon of this gene and thus resulted in a frameshift. The GenBank RNA associated with this gene was altered to remove the frameshift, but it is still annotated as a pseudogene [GenBank:   due to misannotation rather than evolutionary divergence. Alignment of the human AADAT transcript [GenBank:NM_182662.1] with the rhesus draft genome reveals that exon 12 is missing in the rhesus sequence ( Figure 4). We targeted this exon for sequencing in the rhesus [GenBank:JN624744.1]. When our exon 12 sequence is used to create a gene model for rhesus AADAT, a protein similar to the human ortholog is found in this region. However, we were not able to create a complete protein from the available sequence because another 30 nucleotides of sequence in exon 13 were also missing from the rhesus draft genome. is syntenic with human chromosome 6, the location of the SERPINB6 gene. Very few genes on a human autosome would be expected to be found on rhesus chromosome X. Therefore, the true location of the rhesus SERPINB6 gene is likely chromosome 4. Contigs which are apparently incorrectly assigned to chromosome X include: [GenBank: AANU01001442.1 and AANU01001441.1]. The correct assignment of these contigs would appear to have been, instead, rhesus chromosome 4.

Case 6
RALY RNA binding protein-like (RALYL): gene split between two chromosomes  The correct assignment of this contig would appear to have been, instead, rhesus chromosome 8. The anomalous sequence at the 5' end of the rhesus RALY transcript is likely due to this incorrect assignment.      incorrect assignment of sequences to chromosome 20 and the failure to integrate a contig into the chromosome 1 file led to two separate inventions of spurious protein sequence by Gnomon.

Case 9
Src homology 2 domain containing E (SHE): gene split between two chromosomes This apparent misassembly was discovered while trying to build a gene model for the rhesus ortholog of SHE ( Figure 9). The rhesus ortholog of SHE is correctly assigned to chromosome 1. However, it has a provisional gene symbol LOC716722 [9]  appears to be chimeric as it contains some sequence belonging to rhesus chromosome 1 (approximately 1-2,000) and some sequence belonging to rhesus chromosome X (approximately 2,000 -5,282).

Case 10
Bardet-Biedl syndrome 1 (BBS1): genomic fragment containing exon in wrong orientation This apparent misassembly was discovered while trying to build a gene model for the rhesus ortholog of BBS1 ( Figure 10

Chromosome 20 misannotations
To estimate the frequency of misannotations of rhesus genes, the first 100 genes from rhesus chromosome 20 were examined for evidence of misannotation (Additional file 1: Supplementary Table 1). Because rhesus chromosome 20 is syntenic with human chromosome 16, it is possible to directly compare human genes with their putative rhesus orthologs. Only 54% of the proteins associated with this set appeared to be complete and correct (Figure 11, Additional file 1: Supplementary Table 1). 26% appeared to be wrong, 6% incomplete, 5% no RNA or protein derived, 4% none annotated and 1 was unclear.

Discussion
The draft rhesus assembly was described in a publication which appeared in April 2007 [1]. As with all draft sequences, there were substantial gaps in the assembly as well as sequencing errors. How significant were these errors? Global statistics were provided in the paper describing this assembly [1]. However, these types of statistics, while providing useful general information, can leave the reader with an incorrect perception of completeness at the gene level. For example, although it is reported that "98% of the available genome was represented" [1], we estimate that approximately 50% of the gene models are incomplete or incorrect leading to missing, incomplete or spurious proteins. This was partly due to sequencing errors. A single missing nucleotide in an exon can result in a seriously incorrect protein model. Although the mistake represents a very small percent of the total draft sequence, the consequences to investigators relying on the accuracy of the gene models can be very great.
Misannotation of genes, transcripts and proteins with Gnomon and other automated pipelines has been previously reported [12,13]. Further, a recent study using exome data found significant errors in sequence and annotation in rhesus macaques and chimpanzees [14]. According to the documentation for Gnomon [9], empirical evidence is preferred when producing gene models. However, for many draft species (including the rhesus macaque), complete transcriptome information is lacking. In the absence of such information, Gnomon attempts to predict exons from genomic sequence [9]. If this sequence is incomplete or wrong, there is no indication that Gnomon is able to detect or even flag this possibility. Indeed, we observed instances where Gnomon "invented" exons from intronic sequence when a complete exon was not present. As a result, both transcript and protein models were incorrect. We also found instances of genes incorrectly annotated as pseudogenes because Gnomon was unable to build a complete gene model due to missing or incorrect sequence information. It is important to point out that the errors made by Gnomon with the rhesus macaque genome assembly would likely be made with any automated gene model pipeline with any draft vertebrate genome. Hence, the issues identified here represent a pervasive challenge to the analysis of all draft vertebrate genomes.
In the paper describing the draft rhesus genome, 1.8 Mb of finished sequence was compared to the available ESTs to determine whether there were misassemblies [1]. It was reported that: "No misassemblies were identified in that comparison" [1]. However, our genebased approach indicates that it is highly likely that there were a number of significant misassemblies, including contigs assigned to the wrong chromosomes. The discovery that some of the contigs were chimeric brings into question the assembly of the contigs themselves. The total number of misassemblies will not be known  until a more finished version of the rhesus macaque genome is available for comparison. However, there are two studies which also support our contention that there are major misassemblies in the rhesus genome [15,16].
Draft sequences can be very helpful for some tasks. For example, we used a preliminary draft version of the rhesus macaque genome in our targeted approach to rhesus macaque microarray design [17]. However, for other uses, draft genomes may be insufficient to fully utilize genomics approaches, or worse, result in the generation of spurious results. Most investigators prefer to align their NextGen sequences against a reference genome. This cannot be reliably done with the rhesus draft genome (and likely not with other draft vertebrate genomes). Other methods which require a high quality reference genome include: SNP discovery and nucleotide-based probes (exon capture, siRNA and Quantitative PCR).
Modern evolutionary studies depend on alignments of genomic sequences between species. Mismatches between species can be interpreted as indications of evolutionary change. However, given the incidence of sequencing errors and misannotation in the rhesus macaque draft genome (and likely all vertebrate draft genomes), such assumptions may lead to incorrect results. The more draft genomes included in an analysis, the smaller the subset of genes correctly annotated across all the genomes included in the analysis. This is a particular challenge to evolutionary studies which require information from multiple species.
Our exon order-based approach to identifying misassemblies suggests a possible strategy for assessing the quality of genome assemblies. Since exon order is highly conserved among vertebrates, requiring that contigs containing exons respect this order should lead to improved assemblies.

Conclusions
To take full advantage of the rhesus macaque genome, or indeed, any vertebrate genome, standard draft coverage (5-6X Sanger sequencing) is insufficient. This is because small sequence errors or gaps can result in large errors in assembly and in annotation. Given the relative costs involved, it may make sense to bring draft genomes closer to finished quality using NextGen sequencing.

Reference sequence and assembly
The initial draft rhesus assembly Mmul_0.1 (rheMac1) was released in January 2005. Mmul_051212 (rheMac2) was released in January 2006. No further assemblies based on this animal have been released since Mmul_051212. The draft rhesus genome assembly discussed in this work refers to Mmul_051212. NCBI produced a new annotation of Mmul_051212 in April 2010 (Macaca mulatta build 1.2) using Gnomon, a gene prediction tool for eukaryotes [9]. The annotations discussed in this work refer to build 1.2 annotations. The NCBI database Gene [10] was used to determine annotations for the rhesus draft genome.

Targeted re-sequencing
To assess the types of sequencing errors which occurred in the rhesus draft genome, 4 suspect exons from different genes were identified by attempting to align human mRNA with the rhesus draft genome. Evidence of likely error in annotation consisted of amino acids in the rhesus model not found in other species or a truncated protein with respect to other species. PCR primers were designed which flanked these exons using Primer3 [18,19]. Default settings were used with the following exception: the human mispriming library was selected. Rhesus genomic DNA obtained from the reference animal was used as the template for amplification using standard PCR procedures [20]. Sanger sequences were obtained and used to determine whether the draft rhesus genome sequences were correct. All sequences were deposited in GenBank [HM067826.1, JF749838.1, JN589014.1, JN624744.1]. Four cases are described in detail in the Results section (Cases 1-4).

Misassembly detection
Vertebrate exon order is highly conserved. We took advantage of this fact to identify potential misassemblies in the rhesus macaque draft genome. BLAST was used to align human mRNAs against the draft rhesus chromosome and scaffold files. Anomalies in exon order were noted. Additional evidence for misassemblies consisted of mRNA data from rhesus macaques or a closely related species -the cynomolgus macaque (Macaca fascicularis) and sequence from a BAC clone. Six cases are described in detail in the Results section (Cases 5-10).