Mining the Gene Wiki for functional genomic knowledge
© Good et al; licensee BioMed Central Ltd. 2011
Received: 6 July 2011
Accepted: 13 December 2011
Published: 13 December 2011
Skip to main content
© Good et al; licensee BioMed Central Ltd. 2011
Received: 6 July 2011
Accepted: 13 December 2011
Published: 13 December 2011
Ontology-based gene annotations are important tools for organizing and analyzing genome-scale biological data. Collecting these annotations is a valuable but costly endeavor. The Gene Wiki makes use of Wikipedia as a low-cost, mass-collaborative platform for assembling text-based gene annotations. The Gene Wiki is comprised of more than 10,000 review articles, each describing one human gene. The goal of this study is to define and assess a computational strategy for translating the text of Gene Wiki articles into ontology-based gene annotations. We specifically explore the generation of structured annotations using the Gene Ontology and the Human Disease Ontology.
Our system produced 2,983 candidate gene annotations using the Disease Ontology and 11,022 candidate annotations using the Gene Ontology from the text of the Gene Wiki. Based on manual evaluations and comparisons to reference annotation sets, we estimate a precision of 90-93% for the Disease Ontology annotations and 48-64% for the Gene Ontology annotations. We further demonstrate that this data set can systematically improve the results from gene set enrichment analyses.
The Gene Wiki is a rapidly growing corpus of text focused on human gene function. Here, we demonstrate that the Gene Wiki can be a powerful resource for generating ontology-based gene annotations. These annotations can be used immediately to improve workflows for building curated gene annotation databases and knowledge-based statistical analyses.
In 2011, PubMed surpassed 21 million total articles in its index and is growing at a rate approaching 1 million new articles per year. Tools that make effective use of this knowledge base are increasingly vital to biomedical research . Search interfaces like PubMed and Google Scholar help to find individual documents, yet no single document contains all of the knowledge about a particular biological concept. The knowledge is distributed throughout the text of many different articles with each new contribution simply adding to the stack. In considering the task of comprehensively capturing and utilizing society's biological knowledge, it is clear that document-centered approaches are insufficient.
Recognizing this problem and its clear pertinence to genome-scale biology, the research community began defining ontologies to capture and structure functional genomic knowledge even before the first human genome was fully sequenced [2–4]. Ontologies, such as the Gene Ontology (GO), provide a mechanism to efficiently bring together individual atomic facts (e.g. 'the product of the ABCA3 gene localizes to the plasma membrane') that may be scattered across many different texts. Such integration has enabled the production of new tools for interacting with and computing with massive bodies of knowledge. In particular, ontology-based computational analysis now plays a crucial role in the interpretation of the results of high-throughput, genomic studies [5, 6].
While structuring knowledge using ontologies has proven highly beneficial, it presents some substantial challenges. The task of manually representing knowledge with an ontology is difficult, time-consuming and generally not rewarded by the scientific community. Current paradigms drive scientists to publish their findings as text in traditional journals that must subsequently be sifted through by separate teams of database curators who identify and extract new ontology-based facts. This process results in a significant bottleneck that is costly both in terms of the resources required and the likelihood that the system is not capturing all of the knowledge that is produced . In addition, there remains a secondary challenge of presenting this knowledge to biologists of all levels in a manner that they can rapidly understand.
Faced with these challenges in manual biocuration, database curators have begun to investigate wikis as a potential solution [8, 9]. Wikis represent a third approach to knowledge management that straddles the line between document-centric traditional publishing and ontology-driven knowledge base development. In contrast to typical literature collections, wikis, like ontologies, are concept-centric. When new facts are added, they are placed directly into the context of existing articles. For each gene, for example, a single wiki article can summarize knowledge spread over a large and growing corpus of traditional publications related to that gene. This concept-centricity renders wikis an excellent medium to capture rapidly evolving biological knowledge.
The other distinguishing attribute of wikis is their potential to make use of community intelligence on a massive scale. Wikipedia is well known for harnessing the intellects of literally millions of people to assemble the world's largest encyclopedia. As a result of both their concept-centric structure and their enticing potential to facilitate mass collaboration, wikis have emerged in a variety of areas of biology. We have wikis that capture information about genes [10–12], proteins , protein structures [14, 15], SNPs , pathways , specific organisms  and many other biological entities.
In some cases, wikis are already successfully tapping into the community's collective intellect to produce useful biological knowledge repositories. One prominent example is the Gene Wiki [10, 11]. The Gene Wiki is a growing collection of Wikipedia articles, each of which is focused specifically on a human gene. As of January 4, 2011, it contained articles on 10,271 human genes. These articles collectively amount to over 75 megabytes of text and more than 1.3 million words. In addition, they contain direct citations to more than 35,000 distinct articles in PubMed. To emphasize the collaborative scale of the Gene Wiki project, in 2010, these articles were edited by more than 3,500 distinct editors and were viewed more than 55 million times.
The Gene Wiki successfully harnesses community intelligence and escapes the "infinite pile" of document-centric approaches by maintaining a single, dynamic article for each gene. However, most of the captured knowledge is unstructured text and therefore it does not provide the structured gene annotations needed to effectively compute with the knowledge it contains. Wikipedia and the Gene Wiki are simply not designed to capture ontology-based facts. Hence, while the wiki-model can successfully summarize the collective knowledge of the community, the challenge of fully structuring the information remains.
Computational tools for finding ontology terms in text, such as the National Center for Biomedical Ontology's (NCBO) Annotator  and the National Library of Medicine's MetaMap , can help to address the challenge of structuring information presented in natural language. In this article we describe an approach for mining ontology-based annotations for human genes from the text of the Gene Wiki. Specifically we used the NCBO Annotator to identify structured gene annotations based on the Disease Ontology (DO)  and the Gene Ontology (GO) . We evaluated the predicted annotations through comparison to known annotations and manual expert review. In addition, we assessed the impact of these predicted annotations on gene set enrichment analyses.
Overall, this workflow resulted in 2,983 candidate DO annotations (Additional File 1) and 11,022 candidate GO annotations (Additional File 2) from the collaboratively authored text of 10,271 Gene Wiki articles. We next characterized these candidate annotations through comprehensive comparison to pre-existing reference gene annotation databases and through manual inspection by experts in GO and DO annotation.
Each candidate annotation was compared to reference annotations for the relevant gene. Matches to the reference could either be exact matches, matches to a more specific term in the same lineage as a reference annotation (with respect to the ontology's hierarchy) or matches to a more general term. For example, our system suggested that the Thrombin gene be annotated with 'hemostasis' (GO:0007599); since Thrombin was already annotated with 'blood coagulation' (GO:0007596) which is a narrower term than 'hemostasis' and in the same lineage, the predicted annotation was classified as a match to a more general term. (Additional details regarding the processing of the ontology hierarchies are found in the Methods section.)
We compared the candidate DO annotations to a pre-existing gene-disease annotation database mined from NCBI's GeneRIFs . While Online Mendelian Inheritance in Man (OMIM) is probably the most widely recognized gene-disease database, we chose the GeneRIF-backed database for comparison because a) it used the DO for annotations thus enabling direct comparison (OMIM does not use any structured vocabulary for its disease annotations), b) at the time it was created, it contained the majority of the gene-disease associations in OMIM and significantly extended this set, and c) it reported a very high precision rate (96.6%) - comparable to a manually curated resource . The downside of using this database for comparison was that it had not been updated since 2008 and hence it was undoubtedly missing more recent information.
We next compared the candidate annotations to reference annotations from the GO annotation database (GOA) . The GOA database is the accepted standard public reference for GO-based annotation of human gene products. When this analysis was conducted, it provided annotations for 17,940 distinct human genes. Of the 11,022 mined GO annotations, 1,853 (17%) matched an annotation in the GOA database exactly, 218 (2%) matched more specific terms than GOA annotations, 2,850 (26%) matched more general terms, and 6,101 (55%) did not match any.
The GO is divided into three distinct topical branches: Biological Process (BP), Molecular Function (MF) and Cellular Component (CC). In all, 54% of the 11,022 candidate annotations used BP terms, 14% used MF, and the remaining 32% used CC. In the remainder of this article, we focus on the 5,978 candidate BP annotations because they were the most plentiful in the output and because they are generally the most difficult annotations to determine automatically using other methods .
For candidate annotations with no matches in the reference databases, we extracted a random sample and submitted it to expert curators to manually evaluate the quality of the predictions. For the DO evaluations, the sample contained 213 candidate annotations or approximately 10% of the 2,133 candidate annotations that either had no match in the reference set or matched a child of a reference annotation. For the GO we originally selected 200 novel candidate annotations for the evaluation but later had to remove 4 from consideration after discovering that, due to an error in processing, these were actually parents of reference annotations. The sample sizes were the largest that could be processed in a reasonable amount of time based on the curator resources at our disposal. For the evaluation, the curators assigned each candidate annotation in the sample to one of eight categories as follows.
Category 1: Yes, this would lead to a new annotation
1A: perfect match - the candidate annotation is exactly as it would be from a curator (e.g., Titin → Scleroderma)
1B: not specific enough - the candidate annotation is correct but a more specific term should be used instead (e.g., Titin → Autoimmune disease)
1C: too specific - the candidate annotation is close to correct, but is too specific given the evidence at hand (e.g., Titin → Pulmonary Systemic Sclerosis)
Category 2: Maybe, but insufficient evidence:
2A: evaluator could not find enough supporting evidence in the literature after about 10 minutes of looking (e.g., DUSP7 → cellular proliferation; there is literature indicating that DUSP7 is a phosphatase that dephosphorylates MAPK, and hence may play a role in regulating cell proliferation stimulated through MAPK. Although no direct evidence supporting this contention for Human DUSP7 was found, it seems plausible.)
2B: there is disagreement in the literature about the truth of this annotation
Category 3: No, this candidate annotation is incorrect:
3A: incorrect concept recognition (e.g., "Olfactory receptors share a 7-transmembrane domain structure with many neurotransmitter and hormone receptors and are responsible for the recognition and G protein-mediated transduction of odorant signals."  The system incorrectly identifies 'transduction' (GO:0009293) which is defined as the transfer of genetic information to a bacterium from a bacteriophage or between bacterial or yeast cells mediated by a phage vector - a completely different concept from signal transduction as intended in the sentence.)
3B: incorrect sentence context - the sentence is a negation or otherwise does not support the predicted annotation for the given gene (e.g., "The protein is composed of ~300 amino acid residues and has ~30 carbohydrate residues attached including 10 sialic acid residues, which are attached to the protein during posttranslational modification in the Golgi apparatus."  Such sentences may lead to incorrect candidate annotations of 'Golgi apparatus' and 'Posttranslational modification'.)
3C: this sentence seems factually false (e.g., a hypothetical example: "Insulin injections have been shown to cure Parkinson's disease and lead to the growth of additional toes".)
To assess the quality of the candidate DO annotations with no match in the reference set discussed above, we reviewed the 213 randomly selected novel candidate annotations manually according to the criteria outlined above (Additional File 3). The reviewers (authors SML and WAK) were experts in DO-based gene annotation and are active participants in the development of the DO.
It is worth noting that not one of errors detected in either the DO or the GO annotations were classified as 3C (sentence is factually false). This provides evidence that the text of the Gene Wiki (at least the approximately 400 sentences examined manually here) is consistently correct.
where exact, more general, child, and none correspond to agreements between candidate annotations and a reference set; e1a, e1b, e1c and e2a refer to evaluation categories 1A, 1B, 1C and 2A, and eAll refers to the total number of novel candidates evaluated.
Using equations 1 and 2, we estimated a range for precision of 90-93% for the DO annotations and 48-64% for the GO annotations. In retrospect, it may be more appropriate to remove the category 1C ('too specific') annotations from the 'biologically valid' grouping here, but, since there were no occurrences of this category in either the DO or the GO evaluations, this change would not impact the results of the present analysis.
Given these estimates of precision, we next checked to see if the annotations produced here could be used immediately in applications relevant to biological discovery. Specifically we assessed the use of the new annotations in the context of gene-set enrichment analyses.
Gene-set enrichment analyses provide a knowledge-based statistical assessment of the important concepts related to a set of genes [5, 6]. Since tools for performing enrichment analysis are noise tolerant (small numbers of annotation errors do not overly disrupt the analysis), but cannot function without annotations, the use of automatically derived annotations as an extension to curated annotations can provide increased power and flexibility in terms of which concepts can be detected. For example, if a sufficient body of relevant text can be identified for each gene in a study set, enrichment analysis can now be conducted using any of the ontologies present in the NCBO BioPortal using the NCBO Annotator . The crux of this kind of analysis is the identification of a sufficient quantity of relevant text. Since each Gene Wiki article is exclusively about one gene, as opposed to a typical article indexed in PubMed that may mention many genes and processes in very specific contexts, the articles form a particularly useful corpus. While the Gene Wiki alone does not yet have enough content to warrant the use of annotations derived solely from its text in knowledge-based analyses, we hypothesized that it could be used as an extension to other sources of gene-centric text to improve the results of text-driven enrichment analyses.
To assess the potential value of the gene-wiki derived GO annotations, we measured their impact on a controlled gene set enrichment experiment based on the pattern introduced by LePendu et al. . As an example, consider the GO term for 'muscle contraction' (GO:0006936). This GO annotation was associated with 87 genes in the GOA database. After blinding ourselves to the origin of this 87 gene list, we performed text mining on the titles and abstracts of the publications associated with those 87 genes. We expected to find the term 'muscle contraction' to be enriched relative to the background occurrence in all publications. However, when using article titles and abstracts alone, we found no such statistical enrichment (p = 1.0). In contrast, after adding Gene Wiki text to the corpus of publication titles and abstracts, the term 'muscle contraction' was highly enriched (Fisher's exact test, p = 1.22 × 10-9, odds ratio 81.8).
To confirm that this result was not an artifact, we performed a simulation in which, for each of 1,000 iterations, we selected 87 genes randomly from the set of genes with any GO annotations from the GOA database and performed the identical analysis. This provided a way to estimate the probability that we would observe this improvement from the addition of the Gene Wiki text for random gene sets. The higher this probability, the lower our confidence that the additional annotations provided by the Gene Wiki provided a real signal of value. In only 5 of the 1,000 simulation runs did the Gene Wiki-enhanced annotation set produce a significant P value (p < 0.01) when the PubMed-only set did not. In fact, for the random gene sets the Gene Wiki derived annotations were slightly more likely to make the P values worse (19/1000) though in most cases there was no impact at all. This simulation demonstrated that, as should be expected from results presented above, the annotations mined from the Gene Wiki are both non-random and are clearly correlated with annotations shared in curated databases. In addition, it provided empirical evidence that Fisher's exact test is appropriate in this situation (see Methods for additional discussion of the selection of the test statistic).
The relatively low rate of annotation rediscovery for GO terms based on the text from abstracts related to associated genes is not surprising. Other groups have reported that only about 10% of the curated GO annotations can be found in the text of the abstract of the paper cited as evidence for the annotation . What we show here simply demonstrates that the text from the Gene Wiki can extend the reach of systems that rely solely on the text of PubMed abstracts for annotation mining. While the results are preliminary, a similar text-driven enrichment analysis that used the DO rather than the GO also showed improvement when annotations mined from the Gene Wiki were included alongside annotations mined from PubMed abstracts (data not shown).
Our results demonstrate that the Gene Wiki is an important repository of knowledge about the human genome that is different from and complementary to other biological knowledge sources. Its articles provide a growing source of gene-specific text that pulls together relevant bits of information from the published literature into a form that is both useful for human consumption and highly amenable to natural language processing. Importantly, the mass collaborative approach to assembling these wiki articles scales with the explosive growth of the biomedical literature. Annotations mined from Gene Wiki text both recapitulated and extended knowledge in existing databases.
We speculate that the differences in the level of precision estimated for the predicted annotations using the GO (48-63%) and the DO (90-93%) were primarily the result of two key factors: the differing scopes of the two ontologies and differences in the way that the two communities view annotations. The extremely broad scope of the GO means that it has many fairly general terms, like 'transduction', that can result in errors due to polysemy. While fairly general terms do exist in the DO, such as 'dependence' (DOID:9973), they are far less frequent.
Aside from increased numbers of mismatched concepts from the GO, the criteria used by curators for establishing that a GO annotation is fit for inclusion in the GOA database are more stringent than the criteria for establishing a gene-disease association. This is evident when comparing the category 2A (maybe but insufficient evidence) results for the two ontologies. As shown in Figures 4 and 5, the reviewers felt that only 4% of the evaluated DO-based candidates needed additional investigation while 26% of the GO-based candidates were judged to require more evidence. In many cases, the 2A scores for the GO evaluations resulted when the main evidence for the new annotation was from research conducted in a model organism. In order for a GO curator to accept evidence from another organism, further analysis of sequence and phylogenies must be conducted and such analysis was beyond the scope of this evaluation. In the case of the DO annotations, curators accept evidence from model organisms as sufficient for forming an association.
It is worth noting that the basic protocol described here is not tied to the Annotator, it could be used with any concept detection system. The Annotator was used in this analysis because it has a fast convenient API, access to a large number of ontologies including the DO, and has been shown to have similar performance to MetaMap - a longstanding, commonly used tool to for biomedical concept recognition . Improvements to the Annotator workflow, such as negation detection, are ongoing and will benefit the protocol described here. For critical assessments of other text-mining tools and applications in biology, see the BioCreative competitions . In particular see  for a discussion of challenges in working with the GO.
In our evaluation of the results, we only had access to a single qualified GO curator and two DO curators, hence we had to constrain the size of the sample processed and could not calculate inter-annotator agreement (the DO curators discussed and resolved all discrepancies). While we suggest that the number of manually reviewed candidate annotations was sufficient to provide rough estimates on the precision of this protocol, additional evaluations would certainly be valuable. The scarcity of qualified curators, and the even more apparent scarcity of their time, provides additional motivation for continuing this area of research.
We expect that predicted annotations from the Gene Wiki will have several applications.
First, professional biocurators could use Gene Wiki-derived annotations as a useful starting point for their curation efforts. Most obviously, these candidate annotations could be processed according to current curatorial standards (similar to the expert evaluation described in this study) to approve, refine, or reject them as formal annotations . On an even more basic level, curators could simply prioritize PubMed articles that were used in inline Gene Wiki citations for formal review. In this scenario, the Gene Wiki would be used as a crowdsourced method to identify the most relevant scientific literature , an increasingly difficult problem based on the rapid growth of PubMed.
Second, these candidate annotations could be used directly by end users in statistical analyses that are tolerant to noisy data. For example, gene set enrichment analysis is among the most popular analysis strategies for genomic studies, and the underlying statistical test is, by definition, noise tolerant. A recent application called the Rich Annotation Summarizer (RANSUM) performs gene set enrichment analysis using any of the ontologies in the NCBO BioPortal by applying the Annotator to extract relevant annotations from MEDLINE abstracts and the NCBO Resource index . Annotations derived from the Gene Wiki could fit directly into these and related systems.
Ontology-based gene annotation forms a crucial component of many tasks in bioinformatics, but accumulating these annotations is costly. By combining the mass collaboratively generated text of gene-specific articles in the Gene Wiki with readily accessible natural language processing technology, we introduced a new and scalable system for generating gene annotations. As with any application of currently available natural language processing, this system is not error free. GO annotations in particular proved difficult to produce at high precision. Looking forward, we can expect that improvements in information extraction technology and the continued expansion of the gene-centric text in the Gene Wiki will combine to produce an increasingly valuable process for harnessing the ever-expanding body of functional knowledge about the human genome.
identify individual sentences
identify the most recent author of each sub-block of each sentence where sub-blocks are determined by the sentence's edit history
remove most wikimarkup from the text
NCBO Annotator Parameters
The results from the Annotator were translated into candidate annotations for each Gene Wiki gene. In addition to the stopwords sent to the Annotator, we filtered out uninformative predicted annotations that used the GO terms 'chromosome, 'cellular component' and the DO terms 'disease', 'disorder', 'syndrome', and 'recruitment'. Using markup provided by the WikiTrust system , we linked each candidate annotation to the last Wikipedia author to edit the text from which it was extracted. We used this authorship information to remove annotations extracted from text that had been imported automatically from NCBI Gene summaries during the initial creation of the Gene Wiki articles. While this text contains useful information, we chose to remove it from this analysis because we wanted to focus on the text edited by Gene Wiki users.
An executable program for identifying GO and DO annotations in Wikipedia articles according to the protocol described above as well as all relevant source code may be accessed at the open source Gene Wiki code repository .
find all the terms annotated to the gene in the reference annotation set (e.g. from the GOA database).
◦ (For the GOA comparisons, we included electronically generated annotations which use the IEA evidence code.)
◦ For the DO, use only 'is a' relationships for this expansion
◦ For the GO, use 'is a', 'part of', 'regulates', 'positively regulates', and 'negatively regulates'. Each of these relations is treated in the same way as an 'is a' relation. This ensures that we can identify when our system identifies terms that are closely related to the terms used in a reference annotation even when they are not linked through a subsumption relationship. (The versions of the DO used in this work do not make extensive use of non-'is a' relationships). For example, if we discover the term 'positive regulation of transcription from RNA polymerase II promoter' and there is a reference annotation to the term 'transcription from RNA polymerase II promoter' we would record this as a match to a narrower term than existing annotation though the relation between these terms is 'positively regulates'.
record when a predicted annotation for a gene matches a reference annotation exactly, when it matches a broader term than an existing annotation, when it matches a narrower term and when it does not have a match.
Connections between genes and relevant PubMed citations are identified using literature-based records from the GOA database. The fact that a citation is used as evidence of gene function by a human curator provides good evidence that the text in the article is somehow about the gene .
The title and abstracts of each of these articles are mined for GO terms using the Annotator following the same protocol as applied to the Gene Wiki text.
The mined terms are assigned as candidate annotations to the genes linked to the relevant PubMed citations.
create a "true positive" gene set based on the non-IEA annotations for that term from GOA
build a contingency table where the values in the cells are the numbers of genes assigned or not assigned to the GO term by GOA and by the PubMed mining system.
assess the probability that the two gene sets are independent using Fisher's exact test
expand the PubMed-mined annotation set with the annotations mined from the Gene Wiki and repeat the comparison to the GOA-derived gene set
The results from steps 4b and 4c were compared to quantify the impact that the Gene Wiki-derived annotations had on gene set enrichment analysis (Additional File 5).
Note that Fisher's exact test is used in many widely used tools for conducting enrichment analysis including: DAVID, FatiGO, and GoMiner. As the situation modeled in this experiment is precisely that of an enrichment analysis, this provides evidence that this is a reasonable test statistic. Many of the tools that do not use Fisher's test report the use of the hypergeometric test, e.g. BINGO, CLENCH, FunSpec. This test is exactly equivalent to the one-tailed version of Fisher's test .
Gene Wiki Articles
Jan. 4, 2011
Jan. 12, 2011
Jan. 11, 2011
OWL version of the Disease Ontology (for evaluation)
Jan. 6, 2011
Disease Ontology annotations (from GeneRIF mining (Osborne et al 2009))
Created Oct. 23, 2008.
Accessed Jan. 7, 2011
OWL version of the Gene Ontology (for evaluation)
(cvs version: $Revision: 1.1439)
Sept. 22, 2010
Gene Ontology Annotations
* we used all annotations including IEA in our analysis
Dec. 13, 2010
Human Disease Ontology
Gene Ontology Annotation database.
We would like to thank Paea LePendu for executing a preliminary DO enrichment analysis and for other helpful advice and feedback. We would also like to thank the anonymous reviewers for thoughtful and constructive criticism that resulted in improvements to this manuscript. This work was supported by the National Institutes of Health (GM089820).
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.