System design
As illustrated in Figure 1, we applied the NCBO Annotator to identify occurrences of DO and GO terms in Gene Wiki articles. The text of each Gene Wiki article was first filtered to eliminate references and auto-generated text, and then sent to the Annotator for concept detection. Since each article is specifically about a particular human gene, we made the assumption that occurrences of concepts within the text of a gene-centric article were also about the gene and considered such occurrences candidate annotations for the gene. For example, we identified the GO term 'embryonic development (GO:0009790)' in the text of the article on the DAX1 gene: "DAX1 controls the activity of certain genes in the cells that form these tissues during embryonic development". From this occurrence, our system proposed the structured annotation 'DAX1 participates in the biological process of embryonic development'. Following the same pattern, we found a potential annotation to the DO term 'Congenital Adrenal Hypoplasia' (DOID:10492) in the sentence: "Mutations in this gene result in both X-linked congenital adrenal hypoplasia and hypogonadotropic hypogonadism".
Overall, this workflow resulted in 2,983 candidate DO annotations (Additional File 1) and 11,022 candidate GO annotations (Additional File 2) from the collaboratively authored text of 10,271 Gene Wiki articles. We next characterized these candidate annotations through comprehensive comparison to pre-existing reference gene annotation databases and through manual inspection by experts in GO and DO annotation.
Comparisons to Reference Annotations
Each candidate annotation was compared to reference annotations for the relevant gene. Matches to the reference could either be exact matches, matches to a more specific term in the same lineage as a reference annotation (with respect to the ontology's hierarchy) or matches to a more general term. For example, our system suggested that the Thrombin gene be annotated with 'hemostasis' (GO:0007599); since Thrombin was already annotated with 'blood coagulation' (GO:0007596) which is a narrower term than 'hemostasis' and in the same lineage, the predicted annotation was classified as a match to a more general term. (Additional details regarding the processing of the ontology hierarchies are found in the Methods section.)
Disease Ontology
We compared the candidate DO annotations to a pre-existing gene-disease annotation database mined from NCBI's GeneRIFs [21]. While Online Mendelian Inheritance in Man (OMIM) is probably the most widely recognized gene-disease database, we chose the GeneRIF-backed database for comparison because a) it used the DO for annotations thus enabling direct comparison (OMIM does not use any structured vocabulary for its disease annotations), b) at the time it was created, it contained the majority of the gene-disease associations in OMIM and significantly extended this set, and c) it reported a very high precision rate (96.6%) - comparable to a manually curated resource [21]. The downside of using this database for comparison was that it had not been updated since 2008 and hence it was undoubtedly missing more recent information.
In all, 693 (23%) of the 2,983 candidate annotations exactly matched an annotation from the DO reference annotations, 157 (5%) matched a more general term in the same lineage as a reference annotation, 63 (2%) matched a more specific term, and 2070 (70%) had no match (Figure 2). We refer to the annotations with no match as 'novel candidates' as these represent potential new annotations.
Gene Ontology
We next compared the candidate annotations to reference annotations from the GO annotation database (GOA) [22]. The GOA database is the accepted standard public reference for GO-based annotation of human gene products. When this analysis was conducted, it provided annotations for 17,940 distinct human genes. Of the 11,022 mined GO annotations, 1,853 (17%) matched an annotation in the GOA database exactly, 218 (2%) matched more specific terms than GOA annotations, 2,850 (26%) matched more general terms, and 6,101 (55%) did not match any.
The GO is divided into three distinct topical branches: Biological Process (BP), Molecular Function (MF) and Cellular Component (CC). In all, 54% of the 11,022 candidate annotations used BP terms, 14% used MF, and the remaining 32% used CC. In the remainder of this article, we focus on the 5,978 candidate BP annotations because they were the most plentiful in the output and because they are generally the most difficult annotations to determine automatically using other methods [23].
Of the 5,978 candidate BP annotations, 697 (12%) were direct matches to gene-function annotations in the GOA reference, 123 (2%) matched narrower terms than existing annotations, 1,667 (28%) matched more general terms than existing annotations, and the remaining 3,491 (58%) did not match any annotations in the GOA database (Figure 3).
Manual Evaluations
For candidate annotations with no matches in the reference databases, we extracted a random sample and submitted it to expert curators to manually evaluate the quality of the predictions. For the DO evaluations, the sample contained 213 candidate annotations or approximately 10% of the 2,133 candidate annotations that either had no match in the reference set or matched a child of a reference annotation. For the GO we originally selected 200 novel candidate annotations for the evaluation but later had to remove 4 from consideration after discovering that, due to an error in processing, these were actually parents of reference annotations. The sample sizes were the largest that could be processed in a reasonable amount of time based on the curator resources at our disposal. For the evaluation, the curators assigned each candidate annotation in the sample to one of eight categories as follows.
Category 1: Yes, this would lead to a new annotation
1A: perfect match - the candidate annotation is exactly as it would be from a curator (e.g., Titin → Scleroderma)
1B: not specific enough - the candidate annotation is correct but a more specific term should be used instead (e.g., Titin → Autoimmune disease)
1C: too specific - the candidate annotation is close to correct, but is too specific given the evidence at hand (e.g., Titin → Pulmonary Systemic Sclerosis)
Category 2: Maybe, but insufficient evidence:
2A: evaluator could not find enough supporting evidence in the literature after about 10 minutes of looking (e.g., DUSP7 → cellular proliferation; there is literature indicating that DUSP7 is a phosphatase that dephosphorylates MAPK, and hence may play a role in regulating cell proliferation stimulated through MAPK. Although no direct evidence supporting this contention for Human DUSP7 was found, it seems plausible.)
2B: there is disagreement in the literature about the truth of this annotation
Category 3: No, this candidate annotation is incorrect:
3A: incorrect concept recognition (e.g., "Olfactory receptors share a 7-transmembrane domain structure with many neurotransmitter and hormone receptors and are responsible for the recognition and G protein-mediated transduction of odorant signals." [24] The system incorrectly identifies 'transduction' (GO:0009293) which is defined as the transfer of genetic information to a bacterium from a bacteriophage or between bacterial or yeast cells mediated by a phage vector - a completely different concept from signal transduction as intended in the sentence.)
3B: incorrect sentence context - the sentence is a negation or otherwise does not support the predicted annotation for the given gene (e.g., "The protein is composed of ~300 amino acid residues and has ~30 carbohydrate residues attached including 10 sialic acid residues, which are attached to the protein during posttranslational modification in the Golgi apparatus." [25] Such sentences may lead to incorrect candidate annotations of 'Golgi apparatus' and 'Posttranslational modification'.)
3C: this sentence seems factually false (e.g., a hypothetical example: "Insulin injections have been shown to cure Parkinson's disease and lead to the growth of additional toes".)
Disease Ontology
To assess the quality of the candidate DO annotations with no match in the reference set discussed above, we reviewed the 213 randomly selected novel candidate annotations manually according to the criteria outlined above (Additional File 3). The reviewers (authors SML and WAK) were experts in DO-based gene annotation and are active participants in the development of the DO.
Out of the 213 candidates evaluated, 193 (91%) were classified in category 1 (yes, this would lead to a new annotation) with 175 (82%) assigned to category 1A (perfect match). Figure 4 provides a breakdown of the results of the manual evaluation. Nearly all of the errors fall into category 3B (incorrect sentence context). For example, the NCBO Annotator correctly identified the disease term 'neuroblastoma' in the following sentence: "Dock3-mediated Rac1 activation promotes reorganisation of the cytoskeleton in SH-SY5Y neuroblastoma cells and primary cortical neurones as well as morphological changes in fibroblasts"; however, the assumption that the occurrence indicates an association with the gene Dock3 does not hold because the sentence is referring to a neuroblastoma cell line rather than to the human disease.
Gene Ontology
We then followed the same protocol to assess the quality of candidate biological process annotations with no match in the GOA reference set (Additional File 4). A professional curator familiar with GO annotation (author DGH) manually inspected a random sample of 196 candidate annotations with no match in the reference set and classified each with the same 8 categories used for the DO evaluations. The performance was substantially worse for the novel GO annotations than for the DO (Figure 5) - in particular only 8 (4%) of the 196 candidates that were evaluated were assigned to category 1A (perfect match). Aside from the low number of 'exact match' results, candidate GO annotations generated many more uncertain results (26% 2A) as well as a very large fraction of errors due to incorrect sentence context (47% 3B).
It is worth noting that not one of errors detected in either the DO or the GO annotations were classified as 3C (sentence is factually false). This provides evidence that the text of the Gene Wiki (at least the approximately 400 sentences examined manually here) is consistently correct.
Overall precision of annotation mining system
We integrated the results from both the manual assessments and the comparison to reference data sets to provide an estimate of the overall likelihood that a predicted annotation is biologically valid. We refer to this likelihood as 'precision' where precision (also known as positive predictive value) is equal to the ratio of true positive predictions to the sum of true and false positive predictions. In order to calculate precision, we need to divide all predictions into two classes 'true' and 'false'. For this analysis, the true positive set contained the direct matches to reference annotations, the matches to parents of reference annotations (though not as specific as they could be, these are still valid annotations), and the estimated number of Category 1 (would result in a new annotation) novel annotations. The estimated counts for the novel annotations were derived by multiplying the precision rates observed in the manually evaluated sample by the total number of novel candidate annotations. To account for the novel predicted annotations that were classified into Category 2A (maybe), we provide two estimates of precision: one that includes the 2A results as true positives and one that includes them as false positives. In this way we produce an estimated lower and upper bound on the system's actual precision. The estimated upper bound for the overall precision of the annotation protocol was thus calculated as:
(1)
and the estimated lower bound as:
(2)
where exact, more general, child, and none correspond to agreements between candidate annotations and a reference set; e1a, e1b, e1c and e2a refer to evaluation categories 1A, 1B, 1C and 2A, and eAll refers to the total number of novel candidates evaluated.
Using equations 1 and 2, we estimated a range for precision of 90-93% for the DO annotations and 48-64% for the GO annotations. In retrospect, it may be more appropriate to remove the category 1C ('too specific') annotations from the 'biologically valid' grouping here, but, since there were no occurrences of this category in either the DO or the GO evaluations, this change would not impact the results of the present analysis.
Potential applications in enrichment analysis
Given these estimates of precision, we next checked to see if the annotations produced here could be used immediately in applications relevant to biological discovery. Specifically we assessed the use of the new annotations in the context of gene-set enrichment analyses.
Gene-set enrichment analyses provide a knowledge-based statistical assessment of the important concepts related to a set of genes [5, 6]. Since tools for performing enrichment analysis are noise tolerant (small numbers of annotation errors do not overly disrupt the analysis), but cannot function without annotations, the use of automatically derived annotations as an extension to curated annotations can provide increased power and flexibility in terms of which concepts can be detected. For example, if a sufficient body of relevant text can be identified for each gene in a study set, enrichment analysis can now be conducted using any of the ontologies present in the NCBO BioPortal using the NCBO Annotator [26]. The crux of this kind of analysis is the identification of a sufficient quantity of relevant text. Since each Gene Wiki article is exclusively about one gene, as opposed to a typical article indexed in PubMed that may mention many genes and processes in very specific contexts, the articles form a particularly useful corpus. While the Gene Wiki alone does not yet have enough content to warrant the use of annotations derived solely from its text in knowledge-based analyses, we hypothesized that it could be used as an extension to other sources of gene-centric text to improve the results of text-driven enrichment analyses.
Evaluation of mined GO annotations in gene-set enrichment analysis
To assess the potential value of the gene-wiki derived GO annotations, we measured their impact on a controlled gene set enrichment experiment based on the pattern introduced by LePendu et al. [27]. As an example, consider the GO term for 'muscle contraction' (GO:0006936). This GO annotation was associated with 87 genes in the GOA database. After blinding ourselves to the origin of this 87 gene list, we performed text mining on the titles and abstracts of the publications associated with those 87 genes. We expected to find the term 'muscle contraction' to be enriched relative to the background occurrence in all publications. However, when using article titles and abstracts alone, we found no such statistical enrichment (p = 1.0). In contrast, after adding Gene Wiki text to the corpus of publication titles and abstracts, the term 'muscle contraction' was highly enriched (Fisher's exact test, p = 1.22 × 10-9, odds ratio 81.8).
To confirm that this result was not an artifact, we performed a simulation in which, for each of 1,000 iterations, we selected 87 genes randomly from the set of genes with any GO annotations from the GOA database and performed the identical analysis. This provided a way to estimate the probability that we would observe this improvement from the addition of the Gene Wiki text for random gene sets. The higher this probability, the lower our confidence that the additional annotations provided by the Gene Wiki provided a real signal of value. In only 5 of the 1,000 simulation runs did the Gene Wiki-enhanced annotation set produce a significant P value (p < 0.01) when the PubMed-only set did not. In fact, for the random gene sets the Gene Wiki derived annotations were slightly more likely to make the P values worse (19/1000) though in most cases there was no impact at all. This simulation demonstrated that, as should be expected from results presented above, the annotations mined from the Gene Wiki are both non-random and are clearly correlated with annotations shared in curated databases. In addition, it provided empirical evidence that Fisher's exact test is appropriate in this situation (see Methods for additional discussion of the selection of the test statistic).
We extended this analysis to all 773 GO terms used in human gene annotations and found a consistent improvement in the enrichment scores (Figure 6). Using abstracts and titles alone, this protocol resulted in a significant enrichment score for 314 gene sets (41%). Enhancing the data from PubMed mining with the Gene Wiki text resulted in an increase to 399 significant tests (52%).
The relatively low rate of annotation rediscovery for GO terms based on the text from abstracts related to associated genes is not surprising. Other groups have reported that only about 10% of the curated GO annotations can be found in the text of the abstract of the paper cited as evidence for the annotation [27]. What we show here simply demonstrates that the text from the Gene Wiki can extend the reach of systems that rely solely on the text of PubMed abstracts for annotation mining. While the results are preliminary, a similar text-driven enrichment analysis that used the DO rather than the GO also showed improvement when annotations mined from the Gene Wiki were included alongside annotations mined from PubMed abstracts (data not shown).