Limitations of the rhesus macaque draft genome assembly and annotation
© Zhang et al.; licensee BioMed Central Ltd. 2012
Received: 6 December 2011
Accepted: 4 May 2012
Published: 30 May 2012
Finished genome sequences and assemblies are available for only a few vertebrates. Thus, investigators studying many species must rely on draft genomes. Using the rhesus macaque as an example, we document the effects of sequencing errors, gaps in sequence and misassemblies on one automated gene model pipeline, Gnomon. The combination of draft genome with automated gene finding software can result in spurious sequences. We estimate that approximately 50% of the rhesus gene models are missing, incomplete or incorrect. The problems identified in this work likely apply to all draft vertebrate genomes annotated with any automated gene model pipeline and thus represent a pervasive challenge to the analysis of draft genomes.
Genomic sequences and assemblies have been provided for many vertebrate species. However, only a few have reached "finished" status. The rest are considered "draft" genomes. Although draft genomes are known to be less complete than finished genomes, the implications of working with resources derived from draft genomes is not always appreciated by investigators not involved in their production. Independent assessments of possible misassemblies and sequencing errors are rarely done. Finally, the quality metrics reported when a new genome is published provide a global assessment of completeness but provide little information on the quality of the gene models derived from the sequence and assemblies.
To address these issues, we examined the draft genome sequence and assembly of the rhesus macaque (Macaca mulatta) , an important biomedical research model [2–8]. Like many vertebrate draft genomes, an animal was sequenced to 5-6-fold coverage using Sanger sequencing. Several different assemblers were used and their results combined . The human genome assembly was used to correct errors in the rhesus assembly.
Using a gene-based approach that incorporated both similarity and synteny information from the human genome, we found sequencing errors, missing sequence which included exons or parts of exons, and misassemblies in the draft rhesus genome. Further, the automated annotation pipeline used by NCBI, Gnomon , made a wide variety of errors in producing gene models. These errors were partially due to incorrect sequence and assembly information and partly due to the spurious annotations Gnomon created when presented with incomplete data.
We provide specific examples of the types of errors produced in the draft genome and in the annotation provided by Gnomon. We estimate that approximately 50% of the rhesus macaque genes were misannotated as a result. The errors observed in the current work likely apply to most other draft vertebrate genomes and would be expected to occur with any automated gene model pipeline.
actin-related protein T1 (ACTRT1):
incorrect insertion results in a frameshift
adrenergic, beta-1-, receptor (ADRB1):
incorrect sequence results in a premature stop codon
adenylate cyclase 3 (ADCY3):
missing exon results in spurious sequence
aminoadipate aminotransferase (AADAT):
missing exon results in spurious protein sequence
serpin peptidase inhibitor, clade B (ovalbumin), member 6 (SERPINB6):
gene split between two chromosomes
RALY RNA binding protein-like (RALYL):
gene split between two chromosomes
coiled-coil domain-containing protein 135 (CCDC135):
gene split between two chromosomes
vacuolar protein sorting 13 homolog D (S. cerevisiae) (VPS13D): gene split between two chromosomes and failure to integrate an unlocalized contig
Src homology 2 domain containing E (SHE): gene split between two chromosomes
Bardet-Biedl syndrome 1 (BBS1):
genomic fragment containing exon in wrong orientation
Chromosome 20 misannotations
The draft rhesus assembly was described in a publication which appeared in April 2007 . As with all draft sequences, there were substantial gaps in the assembly as well as sequencing errors. How significant were these errors? Global statistics were provided in the paper describing this assembly . However, these types of statistics, while providing useful general information, can leave the reader with an incorrect perception of completeness at the gene level. For example, although it is reported that "98% of the available genome was represented" , we estimate that approximately 50% of the gene models are incomplete or incorrect leading to missing, incomplete or spurious proteins. This was partly due to sequencing errors. A single missing nucleotide in an exon can result in a seriously incorrect protein model. Although the mistake represents a very small percent of the total draft sequence, the consequences to investigators relying on the accuracy of the gene models can be very great.
Misannotation of genes, transcripts and proteins with Gnomon and other automated pipelines has been previously reported [12, 13]. Further, a recent study using exome data found significant errors in sequence and annotation in rhesus macaques and chimpanzees . According to the documentation for Gnomon , empirical evidence is preferred when producing gene models. However, for many draft species (including the rhesus macaque), complete transcriptome information is lacking. In the absence of such information, Gnomon attempts to predict exons from genomic sequence . If this sequence is incomplete or wrong, there is no indication that Gnomon is able to detect or even flag this possibility. Indeed, we observed instances where Gnomon “invented” exons from intronic sequence when a complete exon was not present. As a result, both transcript and protein models were incorrect. We also found instances of genes incorrectly annotated as pseudogenes because Gnomon was unable to build a complete gene model due to missing or incorrect sequence information. It is important to point out that the errors made by Gnomon with the rhesus macaque genome assembly would likely be made with any automated gene model pipeline with any draft vertebrate genome. Hence, the issues identified here represent a pervasive challenge to the analysis of all draft vertebrate genomes.
In the paper describing the draft rhesus genome, 1.8 Mb of finished sequence was compared to the available ESTs to determine whether there were misassemblies . It was reported that: "No misassemblies were identified in that comparison" . However, our gene-based approach indicates that it is highly likely that there were a number of significant misassemblies, including contigs assigned to the wrong chromosomes. The discovery that some of the contigs were chimeric brings into question the assembly of the contigs themselves. The total number of misassemblies will not be known until a more finished version of the rhesus macaque genome is available for comparison. However, there are two studies which also support our contention that there are major misassemblies in the rhesus genome [15, 16].
Draft sequences can be very helpful for some tasks. For example, we used a preliminary draft version of the rhesus macaque genome in our targeted approach to rhesus macaque microarray design . However, for other uses, draft genomes may be insufficient to fully utilize genomics approaches, or worse, result in the generation of spurious results. Most investigators prefer to align their NextGen sequences against a reference genome. This cannot be reliably done with the rhesus draft genome (and likely not with other draft vertebrate genomes). Other methods which require a high quality reference genome include: SNP discovery and nucleotide-based probes (exon capture, siRNA and Quantitative PCR).
Modern evolutionary studies depend on alignments of genomic sequences between species. Mismatches between species can be interpreted as indications of evolutionary change. However, given the incidence of sequencing errors and misannotation in the rhesus macaque draft genome (and likely all vertebrate draft genomes), such assumptions may lead to incorrect results. The more draft genomes included in an analysis, the smaller the subset of genes correctly annotated across all the genomes included in the analysis. This is a particular challenge to evolutionary studies which require information from multiple species.
Our exon order-based approach to identifying misassemblies suggests a possible strategy for assessing the quality of genome assemblies. Since exon order is highly conserved among vertebrates, requiring that contigs containing exons respect this order should lead to improved assemblies.
To take full advantage of the rhesus macaque genome, or indeed, any vertebrate genome, standard draft coverage (5-6X Sanger sequencing) is insufficient. This is because small sequence errors or gaps can result in large errors in assembly and in annotation. Given the relative costs involved, it may make sense to bring draft genomes closer to finished quality using NextGen sequencing.
Reference sequence and assembly
The initial draft rhesus assembly Mmul_0.1 (rheMac1) was released in January 2005. Mmul_051212 (rheMac2) was released in January 2006. No further assemblies based on this animal have been released since Mmul_051212. The draft rhesus genome assembly discussed in this work refers to Mmul_051212.
NCBI produced a new annotation of Mmul_051212 in April 2010 (Macaca mulatta build 1.2) using Gnomon, a gene prediction tool for eukaryotes . The annotations discussed in this work refer to build 1.2 annotations. The NCBI database Gene  was used to determine annotations for the rhesus draft genome.
To assess the types of sequencing errors which occurred in the rhesus draft genome, 4 suspect exons from different genes were identified by attempting to align human mRNA with the rhesus draft genome. Evidence of likely error in annotation consisted of amino acids in the rhesus model not found in other species or a truncated protein with respect to other species. PCR primers were designed which flanked these exons using Primer3 [18, 19]. Default settings were used with the following exception: the human mispriming library was selected. Rhesus genomic DNA obtained from the reference animal was used as the template for amplification using standard PCR procedures . Sanger sequences were obtained and used to determine whether the draft rhesus genome sequences were correct. All sequences were deposited in GenBank [HM067826.1, JF749838.1, JN589014.1, JN624744.1]. Four cases are described in detail in the Results section (Cases 1–4).
Vertebrate exon order is highly conserved. We took advantage of this fact to identify potential misassemblies in the rhesus macaque draft genome. BLAST was used to align human mRNAs against the draft rhesus chromosome and scaffold files. Anomalies in exon order were noted. Additional evidence for misassemblies consisted of mRNA data from rhesus macaques or a closely related species - the cynomolgus macaque (Macaca fascicularis) and sequence from a BAC clone. Six cases are described in detail in the Results section (Cases 5–10).
Estimation of frequency of misannotations
The first 100 genes of rhesus chromosome 20 were compared with the orthologous human genes using the same rules which identified the four genes targeted for Sanger sequencing. Human chromosome 16 is syntenic with rhesus chromosome 20. Both similarity and synteny information were used to identify misannotations.
We thank Dr. David Webb at NCBI for valuable discussions. We gratefully acknowledge the Southwest National Primate Research Center for providing the genomic DNA for the PCR experiments described in this study. Sanger sequencing of PCR products was conducted at the University of Nebraska Medical Center DNA sequencing core facility. This study was supported by a grant from the National Center for Research Resources (RR017444).
- Gibbs RA, Rogers J, Katze MG, Bumgarner R, Weinstock GM, Mardis ER, Remington KA, Strausberg RL, Venter J, Gibbs RA, Rogers J, Katze MG, Bumgarner R, Weinstock GM, Mardis ER, Remington KA, Strausberg RL, Venter J, Gibbs RA, Rogers J, Katze MG, Bumgarner R, Weinstock GM, Mardis ER, Remington KA, Strausberg RL, Venter J, Gibbs RA, Rogers J, Katze MG, Bumgarner R, Weinstock GM, Mardis ER, Remington KA, Strausberg RL, Venter J, Rhesus Macaque Genome Sequencing and Analysis Consortium: Evolutionary and biomedical insights from the rhesus macaque genome. Science. 2007, 316: 222-234.View ArticlePubMedGoogle Scholar
- Barr CS, Newman TK, Becker ML, Parker CC, Champoux M, Lesch KP, Goldman D, Suomi SJ, Higley JD: The utility of the non-human primate; model for studying gene by environment interactions in behavioral research. Genes Brain Behav. 2003, 2: 336-340. 10.1046/j.1601-1848.2003.00051.x.View ArticlePubMedGoogle Scholar
- Arthur Chang TC, Chan AW: Assisted reproductive technology in nonhuman primates. Methods Mol Biol. 2011, 770: 337-363. 10.1007/978-1-61779-210-6_13.View ArticlePubMedGoogle Scholar
- Messaoudi I, Estep R, Robinson B, Wong SW: Nonhuman primate models of human immunology. Antioxid Redox Signal. 2011, 14: 261-273. 10.1089/ars.2010.3241.PubMed CentralView ArticlePubMedGoogle Scholar
- Niu Y, Yu Y, Bernat A, Yang S, He X, Guo X, Chen D, Chen Y, Ji S, Si W, Lv Y, Tan T, Wei Q, Wang H, Shi L, Guan J, Zhu X, Afanassieff M, Savatier P, Zhang K, Zhou Q, Ji W: Transgenic rhesus monkeys produced by gene transfer into early-cleavage-stage embryos using a simian immunodeficiency virus-based vector. Proc Natl Acad Sci USA. 2010, 107: 17663-17667. 10.1073/pnas.1006563107.PubMed CentralView ArticlePubMedGoogle Scholar
- Shedlock DJ, Silvestri G, Weiner DB: Monkeying around with HIV vaccines: using rhesus macaques to define 'gatekeepers' for clinical trials. Nat Rev Immunol. 2009, 9: 717-728. 10.1038/nri2636.PubMed CentralView ArticlePubMedGoogle Scholar
- Tachibana M, Sparman M, Sritanaudomchai H, Ma H, Clepper L, Woodward J, Li Y, Ramsey C, Kolotushkina O, Mitalipov S: Mitochondrial gene replacement in primate offspring and embryonic stem cells. Nature. 2009, 461: 367-372. 10.1038/nature08368.PubMed CentralView ArticlePubMedGoogle Scholar
- Yang SH, Cheng PH, Banta H, Piotrowska-Nitsche K, Yang JJ, Cheng EC, Snyder B, Larkin K, Liu J, Orkin J, Fang ZH, Smith Y, Bachevalier J, Zola SM, Li SH, Li XJ, Chan AW: Towards a transgenic model of Huntington's disease in a non-human primate. Nature. 2008, 453: 921-924. 10.1038/nature06975.PubMed CentralView ArticlePubMedGoogle Scholar
- Souvorov A, Kapustin Y, Kiryutin B, Chetvernin V, Tatusova T, Lipman D: Gnomon – NCBI eukaryotic gene prediction tool. 2010,http://www.ncbi.nlm.nih.gov/RefSeq/Gnomon-description.pdf,Google Scholar
- Nagy A, Hegyi H, Farkas K, Tordai H, Kozma E, Bányai L, Patthy L: Identification and correction of abnormal, incomplete and mispredicted proteins in public databases. BMC Bioinformatics. 2008, 9: 353-10.1186/1471-2105-9-353.PubMed CentralView ArticlePubMedGoogle Scholar
- Vallender EJ: Bioinformatic approaches to identifying orthologs and assessing evolutionary relationships. Methods. 2009, 49: 50-55. 10.1016/j.ymeth.2009.05.010.PubMed CentralView ArticlePubMedGoogle Scholar
- Vallender EJ: Expanding whole exome resequencing into non-human primates. Genome Biol. 2011, 12: R87-10.1186/gb-2011-12-9-r87.PubMed CentralView ArticlePubMedGoogle Scholar
- Karere GM, Froenicke L, Millon L, Womack JE, Lyons LA: A high-resolution radiation hybrid map of rhesus macaque chromosome 5 identifies rearrangements in the genome assembly. Genomics. 2008, 92: 210-218. 10.1016/j.ygeno.2008.05.013.PubMed CentralView ArticlePubMedGoogle Scholar
- Roberto R, Misceo D, D'Addabbo P, Archidiacono N, Rocchi M: Refinement of macaque synteny arrangement with respect to the official rheMac2 macaque sequence assembly. Chromosome Res. 2008, 16: 977-985. 10.1007/s10577-008-1255-1.View ArticlePubMedGoogle Scholar
- Duan F, Spindel ER, Li YH, Norgren RB: Intercenter reliability and validity of the rhesus macaque GeneChip. BMC Genomics. 2007, 8: 61-10.1186/1471-2164-8-61.PubMed CentralView ArticlePubMedGoogle Scholar
- Rozen S, Skaletsky HJ: Primer3 on the WWW for general users and for biologist programmers. Bioinformatics Methods and Protocols: Methods in Molecular Biology. Edited by: Krawetz S, Misener S, Totowa . 2000, Humana Press, NJ, 365-386.Google Scholar
- Spindel ER, Pauley MA, Jia Y, Gravett C, Thompson SL, Boyle NF, Ojeda SR, Norgren RB: Leveraging human genomic information to identify nonhuman primate sequences for expression array development. BMC Genomics. 2005, 6: 160-10.1186/1471-2164-6-160.PubMed CentralView ArticlePubMedGoogle Scholar