Use artificial neural network to align biological ontologies
© Huang et al. 2008
Published: 16 September 2008
Skip to main content
© Huang et al. 2008
Published: 16 September 2008
Being formal, declarative knowledge representation models, ontologies help to address the problem of imprecise terminologies in biological and biomedical research. However, ontologies constructed under the auspices of the Open Biomedical Ontologies (OBO) group have exhibited a great deal of variety, because different parties can design ontologies according to their own conceptual views of the world. It is therefore becoming critical to align ontologies from different parties. During automated/semi-automated alignment across biological ontologies, different semantic aspects, i.e., concept name, concept properties, and concept relationships, contribute in different degrees to alignment results. Therefore, a vector of weights must be assigned to these semantic aspects. It is not trivial to determine what those weights should be, and current methodologies depend a lot on human heuristics.
In this paper, we take an artificial neural network approach to learn and adjust these weights, and thereby support a new ontology alignment algorithm, customized for biological ontologies, with the purpose of avoiding some disadvantages in both rule-based and learning-based aligning algorithms. This approach has been evaluated by aligning two real-world biological ontologies, whose features include huge file size, very few instances, concept names in numerical strings, and others.
The promising experiment results verify our proposed hypothesis, i.e., three weights for semantic aspects learned from a subset of concepts are representative of all concepts in the same ontology. Therefore, our method represents a large leap forward towards automating biological ontology alignment.
The fields of biological and biomedical research are characterized by great complexity and imprecise terminologies. To address this imprecision and to standardize descriptions of biological entities, extensive efforts have been dedicated toward ontology development. The most successful endeavor is the development of Gene Ontology (GO), a formal and structured language, by the GO Consortium . GO has three independently structured, controlled vocabularies: molecular functions - activities, such as catalysis or binding, at the molecular level; biological processes - events accomplished by one or more ordered assemblies of molecular functions; and cellular components - components that are part of some larger objects, such as an anatomical structure or gene product group . To coordinate GO and other ontology development for biomedical research, the Open Biomedical Ontologies (OBO) group has developed mechanisms to share different ontologies . Many ontologies in OBO have been represented in both the OBO format and Web Ontology Language (OWL) as well.
Althoug being formal, declarative knowledge representation models, ontologies from OBO have exhibited great variety. This variety stems from the fact that different parties can design ontologies according to their own conceptual views of the world. Unless this heterogeneity problem is resolved, it will be very difficult, if not impossible, to relate different ontologies and take advantage of the integration thereafter [4, 5]. Current efforts to integrate ontologies include: 1) merging - merge several ontologies into a single one; 2) mapping - relate similar concepts or relationships across different ontologies, resulting in a virtual integration; and 3) alignment - define relationships between terms in different ontologies. In fact, mapping is a speical kind of alignment, i.e., to define equivalentClassOf relationship between two ontologies. This paper concentrates on the challenge of finding equivalent concept pairs from different biological ontologies, which is one of the most significant tasks in biological ontology alignment.
According to the classification in , most schema alignment techniques can be divided into two categories: rule-based and learning-based. We briefly discuss these two categories by summarizing some well-known algorithms.
The rule-based schema alignment techniques consider schema information only, and different algorithms distinguish from each other in their specific rules. PROMPT  provides a semi-automatic approach to ontology merging. By performing some tasks automatically and guiding the user in performing other tasks, PROMPT helps in understanding and reusing ontologies. Dou et al.  view ontology translation as ontology merging and automated reasoning, which are in turn implemented through a set of axioms. The authors regard the ontology merging as taking the union of the terms and the axioms defining them, then adding bridging axioms through the terms in the merge. Cupid  discovers mappings between schema elements based on their names, data types, constraints, and schema structure. Cupid has a bias toward leaf structure where much of the schema content resides. The experimental results show a better performance than DIKE and MOMIS. Giunchiglia et al.  view match as an operator that takes two graph-like structures and produces a mapping between the nodes. They discover mappings by computing semantic relations, determined by analyzing the meaning which is codified in the elements and the structures. The hypothesis in  is that a multiplicity of ontology fragments can be related to each other without the use of a global ontology. Any pair of ontologies can be related indirectly through a semantic bridge consisting of many other previously unrelated ontologies. Huang et al.  extend this work to incorporate: extended use of WordNet; use of the Java WordNet Library API for performing run-time access to the dictionary; and reasoning rules based on the domain-independent relationships and each ontology concept's property list to infer new relationships.
The learning-based schema alignment techniques consider both schema information and instance data, and various kinds of machine learning techniques have been adopted. GLUE  employs machine learning techniques to find semantic mappings between ontologies. After obtaining the results from a Content Learner and a Name Learner, a Metalearner is used to combine the predictions from both learners. Then common knowledge and domain constraints are incorporated through a Relaxation Labeler, and the mappings are finally calculated. In addition, the authors extend GLUE to find complex mappings. Williams  introduces a methodology and algorithm, DOGGIE, for multiagent knowledge sharing and learning in a peer-to-peer setting. DOGGIE enables multiagent systems to assist groups of people in locating, translating, and sharing knowledge represented in ontologies. After locating similar concepts, agents can continue to translate concepts and then are able to share meanings. Soh  describes a framework for distributed ontology learning embedded in a multiagent environment. The objective is to improve communication and understanding among the agents while agent autonomy is still preserved. Agents are able to evolve independently their own ontological knowledge, and maintain translation tables through learning to help sustain the collaborative effort. Wiesman and Roos  present an ontology matching approach based on probability theory by exchanging instances of concepts. During each step of the matching process, the likelihood that a decision is correct is taken into account. No domain knowledge is required, and the ontology structure plays no role. Madhavan et al.  show how a corpus of schemas and mappings can be used to augment the evidence about the schemas being matched. Such a corpus typically contains multiple schemas that model similar concepts and their properties. They first increase the evidence about each element being matched by including evidence from similar elements in the corpus. Then they learn statistics about elements and their relationships to infer constraints.
Both rule-based and learning-based algorithms have disadvantages. The former ignore the information obtained from instance data. A more severe problem is the way this technique treats different semantic aspects. In general, ontologies are characterized by the aspects of concept name, concept properties, and concept relationships. These aspects have different contributions to understanding ontologies' semantics. Take BiologicalProcess ontology  for example: there is a rich set of super/subClassOf relationships (over 20,000); however, at the same time, numerical strings, which are hardly meaningful to machines, are adopted as concept names, "GO_0030838" for example. Therefore, it is essential to assign different weights to different semantic aspects if a more accurate and meaningful alignment result is favored. Unfortunately, current research has made use of human intervention and/or prior domain knowledge to define these weights.
The main problems for learning-based algorithms include a relatively longer running time (due to the learning phase), and the difficulty in getting enough and/or good-quality data. The knowledge bases, i.e., ontologies and/or databases, in biological and biomedical area are usually huge. For example, there are more than 126,000 concepts in NCI Thesaurus ontology . The extremely large file size imposes a higher requirement on any alignment algorithm's efficiency. Moreover, most biological ontologies have very few instance data, if any at all. Even for those with instances, these instances are most likely to be in semi-structured or unstructured formats, and therefore difficult to use.
In this paper, we present a new approach to align biological ontologies that combines both rule-based and learning-based algorithms. Our contributions are in the following. (1) Our approach integrates an artificial neural network (ANN) technique in our algorithm, such that the weights mentioned above can be learned instead of being specified by a human in advance. (2) Moreover, our learning technique is carried out based on the ontology schema information alone, which distinguishes it from most other learning-based algorithms.
The rest of this paper is organized as follows. Section 2 gives an overview of our method, and discusses the challenges in applying machine learning techniques without instance data information; it also presents the details of our algorithm. Section 3 reports the experiments conducted and analyzes the results. Section 4 concludes with an outline of future work in aligning biological ontologies.
Our hypothesis is, three weights for semantic aspects learned from a subset of concepts are representative of all concepts in the same ontology. In order to verify this, we need to show by our experiments: (1) the learning process itself is a correct one, i.e., three weights converge to certain values; and (2) the learned weights are meaningful, i.e., the resultant equivalent concept pairs have satisfactory performance on Precision and/or Recall.
Two real-world biological ontologies, BiologicalProcess and Pathway, are adopted as the test ontologies. They have 13922 and 571 concepts, respectively; in addition, most relationships are super/subClassOf ones.
1. Both ontologies adopt numerical strings, "PW0000015" for example, as concept names, while the meaningful terms, "alzheimer_disease_pathway" for example, are embodied as labels. The purpose of this design is to avoid any potentially repeated concept names. We preprocessed both OWL files by replacing numerical strings with corresponding labels. Fortunately, there are no redundant concept names in either ontology.
Training examples in OAANN.
Concepts from BiologicalProcess
Concepts from Pathway
4. After a certain number of iterations (again, see next section for detailed information) in our ANN, all three weights for semantic aspects converged, and their values are 0.64, 0.01, and 0.35, respectively. These learned weights were then applied to recalculate the similarity matrix.
5. Out of the updated similarity matrix, a set of different thresholds for the similarity were chosen, i.e., from 1.00 to 0.29. We then used these thresholds to calculate equivalent concept pairs from the updated matrix.
Based on the experiment results, it is clear that our hypothesis is validated since:
1. w 2 has rather low value, reflecting the low contribution from concept properties. This exactly conforms to the characteristics of our test ontologies. Most properties in both test ontologies offer little help to the alignment process, because most property values are in semi-structured format and require natural language processing techniques before they can be used effectively.
3. The percentage of training examples is around 24% (30 divided by 127). This small percentage reflects the feasibility and possibility of our method.
4. The adopted ANN structure in OAANN is not complicated because only three semantic aspects are considered. However, this structure is justified via experiment results. If we also consider other relationships in addition to the super/subClassOf ones, multilayer networks might be more appropriate, owing to their representational power.
5. We have tried different settings of initial weights, with a large range from (0.0, 0.0, 1.0) to (1.0, 0.0, 0.0). Our conclusion is, these weights converge to the same values, regardless of their initial values. In addition, varied learning rates will affect the speed of this convergence. The greater the learning rate, the smaller the number of iterations needed before the convergence. For example, if η is set to 0.05, 1000 iterations are needed; while if η is set to 0.1, only 700 iterations are necessary.
6. The curve shape in Figure 2 is: an initial drop followed by a plateau, which is in turn followed by a second drop. It is reasonable to conclude that threshold can possibly be assigned the value corresponding to the beginning of the plateau. The intuition is: the semantic similarity between non-equivalent concepts and that between equivalent concepts are different, and this difference could be remarkable enough to form a plateau.
7. OAANN adopts vectors to record semantic aspects, it is therefore not difficult to handle if more relationships are to be taken into consideration. What needs to be done is for us to expand the current vectors into more dimensions to hold more semantic aspects. Nevertheless, an ANN with multiple layers might be necessary in this case.
Ontologies help in reconciling different views of independently developed and exposed data sources in biological and biomedical research area. Due to their inherent heterogeneity, ontologies need to be aligned before they can be integrated and used effectively. We present OAANN, a new alignment algorithm to overcome some disadvantages of both rule-based and learning-based approaches. Our contributions are: (1) we exploit an approach to learning the weights for different semantic aspects of ontologies, through applying an artificial neural network technique during the ontology alignment; and (2) we tackle the difficult problem of carrying out machine learning techniques without help from instance data. We explain and analyze our algorithm in detail, and our promising experiment results verify that OAANN represents a large leap forward towards automating biological ontology alignment.
Our focus has been on locating the equivalent concept pairs between two ontologies, leaving the other mapping tasks for future work, such as the discovery of parent-child concept pairs, the finding of sibling concept pairs, and so on. Another potential direction for the future work is to apply our approach in other biological ontologies, anatomy ontologies for example, where the bigger file size imposes some challenges with regard to OAANN's efficiency.
In our opinion, the semantics of an ontology concept are determined by three aspects: (1) the name of the concept; (2) the properties of the concept; and (3) the relationships of the concept. These three features together specify a conceptual model for each concept from the viewpoint of an ontology designer. For example, in the Pathway ontology , a concept has "altered_metabolic_pathway" as its name, two properties ("comment" and "label"), and seven relationships (subClassOf concept "classic_metabolic_pathway," superClassOf concepts "altered_amino_acid_metabolic_pathway," "altered_carbohydrate_metabolic_pathway," "altered_glycan_metabolic_pathway," "altered_lipid_metabolic_pathway," "altered_metabolic_pathway_of_cofactors_and_vitamins," and "altered_metabolic_pathway_of_other_amino_acids").
Rule-based algorithms usually have the advantage of relatively fast running speed, but share the disadvantage of ignoring the additional information from instance data. In addition, there is a more serious concern for this type of algorithms. In order to obtain a helpful matching result from automated/semi-automated tools, more than one of the three semantic aspects mentioned above should be considered. If only one aspect is taken into account, then a meaningful matching result is unlikely to be acquired. Once two or more aspects are considered, it is unavoidable that corresponding weights for different aspects must be determined to reflect their different importance (or contributions) in ontology alignment. To the best of our knowledge, most existing rule-based algorithms make use of human heuristics and/or domain knowledge to predefine these weights. Moreover, once weights are determined, they are unlikely to be updated, or at most by trial-and-error.
While taking advantages of extra clues contained in instance data, the learning-based algorithms are likely to be slower. In addition, the difficulty in getting enough and/or good-quality data is also a potential problem. For example, in both BiologicalProcess and Pathway ontologies, there are barely instance data. On the other hand, it is very challenging for machines to learn to reconcile ontology structures, if machines are provided with schema information alone. The most critical challenge is that, because ontologies reflect their designers' conceptual views of part of the world, they exhibit a great deal of diversity. Identical terms can be used to describe different concepts, or vice versa, different terms can be assigned to the same concept. A more complicated situation is, even if the same set of terms are adopted, which is almost impossible in the real life, different designers can still create different relationships for the same concept, corresponding to their different conceptual views for this concept. Compared with schemas, instance data usually have a lot less variety.
Based on the insight of the pros and cons of these two approaches, we present a new alignment algorithm, Ontology Alignment by Artificial Neural Network (OAANN), which combines rule-based and learning-based solutions. We integrate machine learning techniques, such that the weights of a concept's semantic aspects can be learned from training examples, instead of being ad-hoc predefined ones. In addition, in order to avoid the problem of missing instance data (either in quality or in quantity), which is common for real-world ontologies, our weight learning technique is carried out at the schema level instead of the instance level.
Our main idea is, given a pair of ontologies to be aligned, although it is true that a lot design diversity might exist, it is still reasonable to assume that the contributions of different semantic aspects to ontology understanding would hold across and therefore be independent of specific concepts. In fact, different contributions, which are the foundation for different weights, are characteristics of ontologies viewed as a whole. That is, during ontology alignment, weights are determined by ontologies, rather than by individual concepts. Therefore, we propose the following hypothesis: it is possible to learn these weights for all concepts by training examples from a subset of concepts.
Ontology alignment consists of many mapping tasks, for example, the discovery of parent-child concept pairs, the finding of sibling concept pairs, etc. OAANN concentrates on finding pairs of equivalent concepts as the first step. In addition, after the successful discovery of equivalent concept pairs, it is not difficult to design an algorithm to merge corresponding ontologies.
There are many kinds of relationships in ontologies, both domain-dependent and domain-independent, e.g., superClassOf, subClassOf, partOf, contains, etc. In this paper, we consider only the super/subClassOf relationships, which are the most common ones in most real-world ontologies. We plan to extend OAANN to include other relationships later. Due to the scalability of our approach (will be discussed later in this paper), this extension is relatively easy.
We build a 3-dimension vector for each concept, and each dimension records one semantic aspect, i.e., concept name, concept properties, and concept relationships. When we match two concepts, we compare their contents in these three dimensions, and acquire the corresponding similarity in each dimension. Recall that our goal is to find the equivalent concept pairs.
The similarity s 1 between a pair of concept names is a real value in the range of [0, 1]. Some pre-processing on these two strings is performed before the calculation of s 1. For example, the removal of hyphens and underscores.
where d stands for the edit distance between two strings, and l for the length of the longer string.
where n is the number of pairs of properties matched, and m is the smaller cardinality of lists p 1 and p 2.
In order for a pair of properties (one from p 1 and the other from p 2) to be matched, their data types should be the same or compatible with each other (float and double for example), and the property names should have a similarity value greater than a threshold. Notice that here we use the same procedure as in Section to calculate the similarity between a pair of property names. In addition, we adopt the idea of "stable marriage" in determining the matched property pairs. That is, once two properties are considered matched, they both find the best matched one from the other property list. Imagine a similarity matrix built between p 1 and p 2; each time we pick up a pair with the maximum value in the matrix, say cell [i, j], and then discard row i and column j.
We take into account only the super/subClassOf relationships. In order to obtain a better matching result, we make use of as much information as possible. For example, suppose there are two pairs of equivalent concepts, and the numbers of concepts in-between are different from each other, i.e., the ontology with more detailed design tends to have more intermediate concepts. If the direct parent alone is considered, the information from this multilayered parent-child hierarchy will be ignored. Therefore, we not only consider the direct parent of a concept, but also all ancestors (concepts along the path from a concept up to the root "Thing") of this concept as well. Descendants (direct and indirect children of a concept) are not taken into account, as that would lead to an infinite loop.
Given two lists of concept ancestors, a 1 and a 2, their similarity s 3 is a real value in the range of [0, 1], and is obtained by first calculating the similarity values for pairwise concepts (one from a 1, the other from a 2, considering all combinations), then assigning the maximum value to s 3. Notice that this is a recursive procedure but is guaranteed to terminate, because (1) the number of concepts is finite; and (2) we assume that "Thing" is a common root for the two ontologies being aligned.
where . Notice that w i are randomly initialized and will be adjusted through a learning process (see Sec. 4.5 below).
For two ontologies being matched, and , we calculate the similarity values for pairwise concepts (one from , the other from , considering all combinations). Then we build a n 1 × n 2 matrix to record all values calculated, where n i is the number of concepts in . The cell [i, j] in stores the similarity value between the i th concept in and the j th concept in .
The main purpose of OAANN is to try to learn different weights for three semantic aspects during the ontology alignment process. We design our learning problem as follows.
Task T: align two ontologies (in particular, find equivalent concept pairs)
Performance measure P: Precision and Recall measurements with regard to manual matching
Training experience E: a set of equivalent concept pairs by manual matching
Target function V: a pair of concepts → ℜ
Target function representation:
We choose ANN as our learning technique, based on the following considerations.
Instances are represented by attribute-value pairs
The target function output is a real-valued one
Fast evaluation of the learned target function is preferable
Initially, we obtain a concept similarity matrix for and , with w i being initialized randomly. Then we randomly pick up a set of concepts from , and find the corresponding equivalent concepts by a manual matching with . Each of such manually matched pairs will be processed by OAANN, and the similarity values in name, properties, and ancestors for these two concepts are calculated and used as a training example to the network in Figure 5.
where D is the set of training examples, t d is the target output for training example d, and o d is the output of the network for d.
where η is the learning rate, and s id is the s i value for a specific training example d.
After we obtain the learned weights, we apply them to recalculate the similarity matrix . We then pick up a threshold (see next section for details) and output the equivalent concept pairs according to this threshold. The resultant equivalent concept pairs between two ontologies are then presented to domain experts for verification.
This work is partly supported by IRG 97-219-08 from the American Cancer Society, GC-3319-05-44598CM, M01 RR01070, 5 P20 RR017696-05 and PhRMA Foundation Research Starter Grant to W. J. Zheng.
This article has been published as part of BMC Genomics Volume 9 Supplement 2, 2008: IEEE 7th International Conference on Bioinformatics and Bioengineering at Harvard Medical School. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/9?issue=S2
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.