Annotating the human genome with Disease Ontology
© Osborne et al; licensee BioMed Central Ltd. 2009
Published: 7 July 2009
The human genome has been extensively annotated with Gene Ontology for biological functions, but minimally computationally annotated for diseases.
We used the Unified Medical Language System (UMLS) MetaMap Transfer tool (MMTx) to discover gene-disease relationships from the GeneRIF database. We utilized a comprehensive subset of UMLS, which is disease-focused and structured as a directed acyclic graph (the Disease Ontology), to filter and interpret results from MMTx. The results were validated against the Homayouni gene collection using recall and precision measurements. We compared our results with the widely used Online Mendelian Inheritance in Man (OMIM) annotations.
The validation data set suggests a 91% recall rate and 97% precision rate of disease annotation using GeneRIF, in contrast with a 22% recall and 98% precision using OMIM. Our thesaurus-based approach allows for comparisons to be made between disease containing databases and allows for increased accuracy in disease identification through synonym matching. The much higher recall rate of our approach demonstrates that annotating human genome with Disease Ontology and GeneRIF for diseases dramatically increases the coverage of the disease annotation of human genome.
High throughput genomics technologies generate a vast amount of data. Determining the biologically and clinically significant findings of an experiment can be a daunting task. Applying functional knowledge to genomic data is one method that has been used to reduce data complexity and establish biologically plausible arguments. These methods rely on a priori definition of gene sets, and the results necessarily depend on the strength of the annotations [1, 2]. Genome-wide annotation of gene function has garnered much attention and the comprehensive Gene Ontology (GO) Consortium annotations are widely used . Few tools based on ontology are available for annotating genome-wide data with disease associations. The lack of ontology based disease annotation prevents the application of disease knowledge to genomic data, therefore hindering the discovery of gene-disease associations from high throughput genomics technologies.
Online Mendelian Inheritance in Man (OMIM), curated by the NCBI and Johns Hopkins University, is arguably the most widely used disease gene annotation database. Although the curation process provides highly detailed annotation and minimizes errors, there is a noticeable delay in updating. Furthermore, the vocabulary of OMIM is predominately text based, far from comprehensive, and is difficult to use [4–6]. It is simply not possible to download a list of diseases from OMIM and users have resorted to mining the Clinical Synopsis free text section of OMIM for disease discovery . It also focuses on genetic diseases with classic Mendelian inheritance, thus eliminating the wide range of diseases resulting from more complicated environmental and genetic interactions.
Another source of gene-disease mappings from linkage studies is the "Genetic Association Database" (GAD)  which aims to "collect, standardize and archive genetic association study data". The structure of the classification system GAD used to classify its diseases is not apparent. Diseases are classified in 10 broad classes including an "other" class (some remain unclassified), an unknown number of broad phenotype classes below those and a further number of narrow phenotype classes. This lack of apparent ontology makes it hard to determine the number and types of diseases GAD contains. For example, searching for "Crohn's disease" returns 25 results but searching for "regional enteritis" returns no results.
Researchers have also used abstracts and titles from MEDLINE as a data source for inferring gene-disease associations [8, 9]. Although a current and rich source of information, the free text form of MEDLINE abstracts presents difficulties for determining the context of the association between gene and disease . This is particularly true when genes are identified by semantically ambiguous gene symbols which may or may not apply to a disease recognized in free text. For instance CAT can refer to the catalase gene or a feline animal, depending on the context.
A GeneRIF (Gene Reference Into Function) is a brief (up to 255 character) annotation to a gene in the NCBI database and contains gene specific information including disease associations. These entries are modifiable by NCBI users willing to provide their email address. Such a Wiki-type of resource offers low mapping error of gene symbols and allows a rapid update by the research community . Despite this utility, GeneRIF has been infrequently considered as a data source for text mining, evidenced by the fact that only six papers indexed in pubmed contain the term "GeneRIF". One of these describes a data mining tool called MILANO, which counts occurrences of each GeneRIF annotated gene with user-defined terms selected from Medical Subject Headings (MeSH) and includes some disease terms . The authors found GeneRIF superior to Medline, PubMatrix, BEAR GeneInfo, and MicroGenie, for identifying p53 affected genes. A more recent approach using conditional random fields to map a test set of geneRIFs to MeSH terms further validates geneRIFs as a comprehensive data source  for the human genome annotation.
To provide a comprehensive disease to gene annotation, we used the Disease Ontology (DO)  to identify relevant diseases in GeneRIFs. The Disease Ontology is a manually inspected subset of Unified Medical Language System (UMLS) and includes concepts from outside the UMLS disease/disorder semantic network including various cancers, congenital abnormalities, deformities and mental disorders. While many researchers have mapped diseases to MeSH terms [8, 15–17] or OMIM [5, 7] the Disease Ontology is larger and should therefore provide greater disease coverage. The hierarchal structure also allows more general disease terms to be distinguished from subclasses, in order to account for "over-mapping" of disease terms to a textually larger database. We used a thesaurus based approach (MetaMap Transfer tool, MMTx) for analyzing GeneRIFs with demonstrated success in studying clinically relevant terms .
Mapping genes to Disease Ontology
First Ten Diseases ordered by the number of gene annotations
Number of Genes
Malignant neoplasm of breast
Malignant neoplasm of prostate
Diabetes Mellitus, Non-Insulin-Dependent
Primary carcinoma of the liver cells
Carcinoma of the Large Intestine
First Ten Genes ordered by the number of disease annotations
Number of Diseases
matrix metallopeptidase 9
epidermal growth factor receptor
cyclin-dependent kinase inhibitor
matrix metallopeptidase 2
B-cell CLL/lymphoma 2
interleukin 1, beta
From these formulas, we can see they are closely related to false positive and false negative rates that are used in other fields. A recall rate of 100% and a precision rate of 100% are of ideal situations. For disease annotation, we constructed a truth table of the Homayouni gene collection  manually using GeneRIF and OMIM text as a source.
Estimation of recall and precision of disease annotation
A network visualization based on the DO Annotation of the human genome
The Disease Ontology consists of a manually inspected subset of UMLS and terms outside the UMLS disease and disorder semantic network including various cancers, congenital abnormalities, deformities and mental disorders that are important to researchers trying to understand the genetic and molecular basis of a particular disease. Therefore, compared to UMLS, the Disease Ontology is much larger in size and more specific to the disease of interest. It therefore offers greater disease coverage with improved accuracy. In addition, its hierarchal structure allows a more specific disease term to be binned to a more general disease term at different levels which is especially useful for Disease Ontology enrichment analysis analogous to gene ontology enrichment analysis in experiments applying high throughput technologies.
Our results indicate that GeneRIFs are an excellent data source for discovering disease-gene relationships. This is primarily due to the large number of GeneRIFs relative to OMIM entries, and the surprisingly high (14.9%) frequency of disease references. The disease coverage of OMIM would be improved if the free text had been mined, but only 235 genes in OMIM have a clinical synopsis section and limiting our analysis to these entries would bias our results. Using the clinical synopsis section in addition to other OMIM free text would increase the number of false positives since OMIM free text frequently includes diseases without a direct relationship to the gene, usually for comparative purposes or in reference to experiments in model organisms.
Errors in our method may arise from a variety of sources including problems with MMTx, many of which have already been elucidated . The problem of having disease terms present in OMIM or GeneRIF, but missing in DO or UMLS was infrequent but did include some cases such Craniofacial-deafness-hand syndrome. A more significant problem contributing to the majority of false positives was the discovery of disease terms in GeneRIF that indicated only a partial, ambiguous or no association to the gene in question. Fortunately, the succinctness of GeneRIF means that this occurs less frequently than in abstracts (data not shown) which may contain diseases not directly related to the gene. We found only one incorrectly assigned GeneRIF in the 1746 GeneRIFs examined, indicating that this is a minor source of error.
One result from our analysis is that OMIM performs poorly relative to GeneRIF with newly discovered mappings. A fairly typical case is the alpha-2-marcroglobuin gene. While OMIM includes mappings for Alzheimer's disease and pulmonary emphysema (missed by GeneRIF), it excludes potential links to benign prostatic hyperplasia, multiple sclerosis and argyrophilic grain disease. This may be a result of OMIM's stronger requirement for evidence, but failure to keep pace with current research may also contribute.
We annotated the human genome with Disease Ontology and reported its performance. Such an annotation will enable many graphical and statistical applications similar to previously what has been done with Gene Ontology annotations. An example of this is presented in Figure 4, where all the genes in the human genome with established links to four cancers led to the identification of eleven genes in common between these four cancers. This analysis facilitates identification of relevant targets or markers for diseases with common etiology or pathology, and has implications for biological plausibility as well as therapeutic potential. Although previous studies have demonstrated the utility of this approach, the improved coverage and accuracy of our analysis provide even greater potential . Our future plans include developing a web-enabled database application of the Disease Ontology for the research community http://projects.bioinformatics.northwestern.edu/fundo.
Similar to the GO annotation, we provide a DO annotation of the human genome; each annotation is supported by a peer-reviewed publication as required by GeneRIF. It enables researchers to study gene-disease relationships computationally. The DO annotation of the human genome is available in both tab-delimited format and relational database format http://projects.bioinformatics.northwestern.edu/do_rif/, which allows them to be easily adapted for other applications.
MMTx is a natural language processing engine that identifies concepts from free text using a lexicon . Briefly, a part-of-speech tagger labels the noun phrases from the lexical elements created after parsing and tokenization (Figure 1A). These noun phrases and variants of these phrases are used to search the UMLS Metathesaurus and outside disease terms to find matching candidates, each of which is given a score; final mappings are generated that best cover the input noun phrase. The Disease Ontology therefore consists of a manually inspected subset of UMLS and terms outside the UMLS disease and disorder semantic network including various cancers, congenital abnormalities, deformities and mental disorders.
Our in-house software parses in the GeneRIF and OMIM data and uses the MMTx API to generate final mappings between genes and diseases. The strict data model of Unified Medical Language System (UMLS) distribution 2005ac was searched against using the default settings of MMTx with an empirically derived score cutoff value of 700. Results were further filtered using the DO version 3.0 (RC9) to eliminate non-disease biological relationships. In addition, a simple heuristic approach was used to eliminate both non-informative mappings to DO (such as "Disease" or "Syndrome") and to eliminate text present in GeneRIF that was frequently mis-mapped by UMLS such as Ca++ ions being mapped to cancer terms. The program calling the MMTx API and generating the gene-disease mapping is written in Java and available upon request.
GeneRIF and OMIM data
The October 10th, 2008 release of both OMIM and GeneRIF were used. For OMIM, only validated data in the formatted "morbidmap" (including disease susceptibilities) file was used. This is because there are currently only 235 records in OMIM which contain a clinical synopsis of the disease, additional disease information is scattered through other sections of the OMIM record making it hard to determine if the disease mentioned is for an animal model or other comparative purposes.
Scoring and validation
To evaluate our annotation methodology, and to compare our GeneRIF results with the traditional OMIM resource in detail, we utilized a well-characterized fifty-gene collection by Homayouni et al. that they used to evaluate semantic indexing of gene functions . This gene collection includes genes in the reelin signaling pathway of Alzheimer's disease and other genes important in cancer biology and development. We call it Homayouni gene collection from here on. The 5 genes with more than 50 diseases mapped to them (APOE, EGFR, ERBB2, TGFB1 and TP53) were excluded from the test set due to the large number of GeneRIFs requiring manual inspection. This evaluation was done on February 9th, 2006.
Assessing the false positive and false negative error rates for this collection was difficult , so several domain experts were used for scoring the results with all results reviewed by MID (internal medicine physician) who made the final error determination. To determine gene-disease relationships, a false positive was scored only when the disease was identified incorrectly. No effort was made here to assess the appropriateness of the GeneRIF because of the subjective nature of such a process. However for Table 3 estimates were used for calculating precision and recall rates whereby the overall false positive value was corrected to account for false positives arising when a correctly identified disease did not have a relationship to its associated gene as specified in the GeneRIF.
The authors thank Dr. Pan Du for discussing related data mining applications using GeneRIF; Drs. Quan Chen, Bangjun He, and Xin Zheng for integrating the DO annotations in the MAP-disease web application; and Dong Fu for system administration of the website.
This article has been published as part of BMC Genomics Volume 10 Supplement 1, 2009: The 2008 International Conference on Bioinformatics & Computational Biology (BIOCOMP'08). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/10?issue=S1.
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005, 102 (43): 15545-15550. 10.1073/pnas.0506580102.PubMed CentralView ArticlePubMedGoogle Scholar
- Dennis G, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 2003, 4 (5): P3-10.1186/gb-2003-4-5-p3.View ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25 (1): 25-29. 10.1038/75556.PubMed CentralView ArticlePubMedGoogle Scholar
- Becker KG, Barnes KC, Bright TJ, Wang SA: The genetic association database. Nat Genet. 2004, 36 (5): 431-432. 10.1038/ng0504-431.View ArticlePubMedGoogle Scholar
- Masseroli M, Galati O, Manzotti M, Gibert K, Pinciroli F: Inherited disorder phenotypes: controlled annotation and statistical analysis for knowledge mining from gene lists. BMC Bioinformatics. 2005, 6 (Suppl 4): S18-10.1186/1471-2105-6-S4-S18.PubMed CentralView ArticlePubMedGoogle Scholar
- Smith CL, Goldsmith CA, Eppig JT: The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information. Genome Biol. 2005, 6 (1): R7-10.1186/gb-2004-6-1-r7.PubMed CentralView ArticlePubMedGoogle Scholar
- Masseroli M, Galati O, Pinciroli F: GFINDer: genetic disease and phenotype location statistical analysis and mining of dynamically annotated gene lists. Nucleic Acids Res. 2005, W717-723. 10.1093/nar/gki454. 33 Web ServerGoogle Scholar
- Hristovski D, Peterlin B, Mitchell JA, Humphrey SM: Using literature-based discovery to identify disease candidate genes. Int J Med Inform. 2005, 74 (2–4): 289-298. 10.1016/j.ijmedinf.2004.04.024.View ArticlePubMedGoogle Scholar
- Hu Y, Hines LM, Weng H, Zuo D, Rivera M, Richardson A, LaBaer J: Analysis of genomic and proteomic data using advanced literature mining. J Proteome Res. 2003, 2 (4): 405-412. 10.1021/pr0340227.View ArticlePubMedGoogle Scholar
- Shatkay H, Feldman R: Mining the biomedical literature in the genomic era: an overview. J Comput Biol. 2003, 10 (6): 821-855. 10.1089/106652703322756104.View ArticlePubMedGoogle Scholar
- Osborne JD, Lin S, Kibbe WA: Other riffs on cooperation are already showing how well a wiki could work. Nature. 2007, 446 (7138): 856-10.1038/446856a.View ArticlePubMedGoogle Scholar
- Rubinstein R, Simon I: MILANO – custom annotation of microarray results using automatic literature searches. BMC Bioinformatics. 2005, 6 (1): 12-10.1186/1471-2105-6-12.PubMed CentralView ArticlePubMedGoogle Scholar
- Bundschus M, Dejori M, Stetter M, Tresp V, Kriegel HP: Extraction of semantic biomedical relations from text using conditional random fields. BMC Bioinformatics. 2008, 9: 207-10.1186/1471-2105-9-207.PubMed CentralView ArticlePubMedGoogle Scholar
- Warren A, Kibbe JDO, Wolf Wendy, Smith Maureen, Zhu Lilhua, Lin Simon, Chisholm Rex: Disease Ontology. 2006Google Scholar
- Jenssen TK, Laegreid A, Komorowski J, Hovig E: A literature network of human genes for high-throughput analysis of gene expression. Nat Genet. 2001, 28 (1): 21-28. 10.1038/88213.PubMedGoogle Scholar
- Perez-Iratxeta C, Wjst M, Bork P, Andrade MA: G2D: a tool for mining genes associated with disease. BMC Genet. 2005, 6: 45-10.1186/1471-2156-6-45.PubMed CentralView ArticlePubMedGoogle Scholar
- Srinivasan P, Libbus B: Mining MEDLINE for implicit links between dietary substances and diseases. Bioinformatics. 2004, 20 (Suppl 1): i290-296. 10.1093/bioinformatics/bth914.View ArticlePubMedGoogle Scholar
- Meystre SM, Haug PJ: Comparing natural language processing tools to extract medical problems from narrative text. AMIA Annu Symp Proc. 2005, 525-529.Google Scholar
- Barabasi AL, Albert R: Emergence of scaling in random networks. Science. 1999, 286 (5439): 509-512. 10.1126/science.286.5439.509.View ArticlePubMedGoogle Scholar
- Homayouni R, Heinrich K, Wei L, Berry MW: Gene clustering by latent semantic indexing of MEDLINE abstracts. Bioinformatics. 2005, 21 (1): 104-115. 10.1093/bioinformatics/bth464.View ArticlePubMedGoogle Scholar
- Divita G, Tse T, Roth L: Failure analysis of MetaMap Transfer (MMTx). Stud Health Technol Inform. 2004, 107 (Pt 2): 763-767.PubMedGoogle Scholar
- Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabasi AL: The human disease network. Proc Natl Acad Sci USA. 2007, 104 (21): 8685-8690. 10.1073/pnas.0701361104.PubMed CentralView ArticlePubMedGoogle Scholar
- Aronson AR: Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp. 2001, 17-21.Google Scholar
- Zeeberg BR, Riss J, Kane DW, Bussey KJ, Uchio E, Linehan WM, Barrett JC, Weinstein JN: Mistaken identifiers: gene name errors can be introduced inadvertently when using Excel in bioinformatics. BMC Bioinformatics. 2004, 5: 80-10.1186/1471-2105-5-80.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.