Hypotheses generation as supervised link discovery with automated class labeling on large-scale biomedical concept networks
© Katukuri et al; licensee BioMed Central Ltd. 2012
Published: 11 June 2012
Skip to main content
© Katukuri et al; licensee BioMed Central Ltd. 2012
Published: 11 June 2012
Computational approaches to generate hypotheses from biomedical literature have been studied intensively in recent years. Nevertheless, it still remains a challenge to automatically discover novel, cross-silo biomedical hypotheses from large-scale literature repositories. In order to address this challenge, we first model a biomedical literature repository as a comprehensive network of biomedical concepts and formulate hypotheses generation as a process of link discovery on the concept network. We extract the relevant information from the biomedical literature corpus and generate a concept network and concept-author map on a cluster using Map-Reduce frame-work. We extract a set of heterogeneous features such as random walk based features, neighborhood features and common author features. The potential number of links to consider for the possibility of link discovery is large in our concept network and to address the scalability problem, the features from a concept network are extracted using a cluster with Map-Reduce framework. We further model link discovery as a classification problem carried out on a training data set automatically extracted from two network snapshots taken in two consecutive time duration. A set of heterogeneous features, which cover both topological and semantic features derived from the concept network, have been studied with respect to their impacts on the accuracy of the proposed supervised link discovery process. A case study of hypotheses generation based on the proposed method has been presented in the paper.
Text mining of biomedical literature is a research area that has attracted lot of attention in the last 5 to 10 years. Swanson  was one of the proponents of the hypotheses discovery from biomedical literature. As a result of his pioneering work in hypotheses discovery, Swanson discovered a novel connection between Raynaud's disease and fish oil by examining two disjoint biomedical literature sets . The hypothesis of the beneficial effect of fish oil on Raynaud's disease was confirmed by an independent clinical trial two years later, which demonstrated the value of biomedical literature mining in scientific discovery. Swanson's hypothesizing model, the so called Swanson's ABC model, can be simply described as A relates to B, B relates to C, therefore A may relate to C, . Ever since Swanson's discovery, a lot of research works have been carried out with the aim of automating and refining Swanson's ABC model [1, 3–8]. Nevertheless, most of these reported approaches are based on analyzing the retrieval result set for one or two initial topics provided as query by a user, instead of being able to scale up to the whole set of literature database for the purpose of discovering real, novel and cross-silo biomedical hypotheses.
In recent years, link discovery has been extensively studied on social networks such as those obtained from Facebook data and bibliographic databases maintained by DBLP. As an important problem of link mining, link discovery refers to the discovery of future links between objects (or nodes) that are not directly connected in the current snapshot of a given network. In , Őzgür and his colleagues applied link discovery technique to generate hypotheses on relationships between genes and vaccines. This work first extracted networks on gene-gene interactions and gene-vaccine interactions from literature with the help of gene and vaccine ontology; then analyzed the networks by computing different types of centrality measures for each node in the networks. Given its restricted focus on gene and vaccine relationships, this work by its nature was not designed for cross-silo biomedical discovery.
In order to address the challenge of large-scale cross-silo biomedical hypotheses discovery, in this paper, we first model a biomedical literature repository as a comprehensive network of biomedical concepts belonging to different semantic types. Then we extract such a large-scale concept network form Medline . We further calculate a variety of topological and semantic features from the concept network and model the hypotheses discovery as a classification problem based on those features. Moreover, in order to automatically build the classification model for prediction, we take two snapshots of the concept networks corresponding to two consecutive time durations, such that a training data set can be formed based on a group of labeled concept pairs that are automatically extracted from the snapshots. We further extract multiple heterogeneous features for labeled concept pairs solely from the first snapshot of the concept network. The impact of those heterogeneous features on hypotheses discovery has been studied.
The rest of the paper will be organized as follows. In the Related work section, we briefly describe relevant works in biomedical hypotheses discovery and link mining. In the Hypotheses generation as supervised link discovery on biomedical concept network section, we formulate hypotheses generation from literature as link discovery in a concept network and further model the link discovery as a supervised learning process based on a set of topological and semantic features. In the Concept network creation and feature extraction using Map-Reduce framework section, we address the challenges of extracting large-scale concept networks from literature corpus. We also address the challenges involved in automatically generating labeled data and extracting heterogeneous features for a large number of labeled data using Map-Reduce framework. In the Experimental results section, we present experimental results. Finally, we conclude our paper with the Conclusions section.
Swanson's pioneering work in 1986 on biomedical hypotheses generation led to the discovery of the novel connection between Raynaud's disease and fish oil by examining two disjoint biomedical literature sets (Swanson ). In his follow-up work in 1990, Swanson suggested a trial-and-error search strategy, by which the ABC model guides a manual online search for identifying logically related non-interactive literature (Swanson ). By applying this strategy for biomedical literature analysis, Swanson discovered some other novel biomedical hypotheses, such as the implicit connection between the blood levels of Somatomedin C and dietary amino acids arginine (Swanson [7, 11]), and hidden link between the mineral magnesium and treating the medical problem causing migraine headaches (Swanson ).
Along with the advances in the text retrieval and mining techniques, researchers have made several efforts to partially automate Swanson's ABC model for hypotheses generation. Stegmann and Grohman proposed a way to guide a researcher to identify a set of promising B terms by conducting clustering analyses of terms on both the retrieval result set of topic A and the retrieval result set of topic C (Stegman et al. ). Their work used measures called centrality and density to evaluate the goodness of term clusters and showed that the promising B terms that link disjoint literature for topics A and C tend to appear in clusters of low centrality and density. Srinivasan's approach to identify promising B terms starts with building two profiles for both topic A and topic C, respectively, from the retrieval result sets of A and C . In her work, the profile of a topic consists of terms that have high frequency in the retrieval result set of that topic and belong to semantic types of interest to the user. Then the intersection of A's profile with C's profile generates the candidate B terms. The process of identifying B terms from given topics A and C is called closed discovery. In her work, Srinivansan also applies the topic profile idea to conduct open discovery, which identifies both B terms and C terms given only topic A. Srinivansan's open discovery algorithm can be simply described as follows: Top-ranking B terms are selected from the profile of topic A. Then, a profile for each selected B term is created from the retrieval result set of that B term. The top-ranking terms in a B term's profile form candidate C terms. If topic A's retrieval result set is disjoint from a candidate C term's retrieval result set, then this candidate C term is reported as having potential relationship with topic A via term B. Slightly different from Srinivansan's topic profile approach, Pratt and Yildiz directly applied association mining on the retrieval result set of topic A to conduct open discovery . In their work, the logical inference based on two association rules A→B, B→C leads to the finding of a candidate C term.
One of the problems that almost all the hypotheses generating approaches face is the large amount of spurious hypotheses generated in the process of automating the Swanson's ABC model. In order to eliminate the spurious hypotheses, different components of the biomedical ontology system, UMLS , have been utilized. Weeber et al.  used Metathesaurus of the UMLS to extract biomedical phrases and further limited the desired phrases by using the semantic types of the UMLS as an additional filter. Similar strategies are widely used by most of the follow-up research. Zhang et al.  and his colleagues used semantic network, another UMLS component that specifies possible relations among different semantic types, in order to restrict the association rules generated from the retrieval result set of topic A in the process of open discovery. Besides utilizing the biomedical ontology system, we envision that cross-repository validation may be another effective addition for eliminating spurious hypotheses.
No matter whether designed for closed discovery or open discovery, the described works are still constrained in the category of automating and refining Swanson's ABC hypothesizing model. Furthermore, all the approaches are based on retrieval result set of one or two initial topics provided by a user, instead of being able to scale up to the whole set of topics within a literature database for the purpose of discovering real, novel and cross-silo biomedical hypotheses.
If we model a biomedical literature repository as a comprehensive network of biomedical concepts belonging to different semantic types, the link discovery techniques may enable large-scale, cross-silo hypotheses discovery that goes beyond information retrieval-based discovery. Link discovery has been extensively studied on social networks such as Facebook, and bibliographic databases such as DBLP in recent years. As an important problem of link mining, link discovery refers to the discovery of future links between objects that are not directly connected in the current snapshot of a given network. In the following, we briefly review those link discovery techniques that are relevant to our work.
In the paper by Faloutsos et al. , the author proposed a measure called effective conductance to evaluate the goodness of a connection subgraph. Later, in the paper by Koren et al. , an improved measure called cycle free effective conductance was proposed by using only the cycle free paths in computing the proximity. This measure guaranteed that high degree intermediate nodes in the paths do not increase the proximity between two nodes unreasonably. The paper by Liben-Nowell and Kleinberg  discussed the problem of link prediction in social networks. It was one of the early works on link prediction that addressed the question of to what extent new collaborations (links) can be predicted by using the toplogy of the network. This work used an unsupervised approach to predict the links based on several network toplogy features in co-authorship networks. The paper by Al Hasan et al.  used a supervised learning approach for co-authorship link prediction based on simple neighborhood features, without factoring in any random walk features like effective conductance. Simple neighborhood features have several limitations compared to random walk features: they can not predict connecting paths of length greater than two (Benchettara et al. ), nor can they discriminate significant (good) paths from the set of all neighborhood nodes. The paper Benchettara et al.  used the bipartite nature of publication networks in a supervised learning framework. The paper Savas et al.  addressed the link discovery problem based on the number of paths of different lengths from multiple sources that exist between two nodes. However, this work did not factor in the different degrees of significances that different paths may have. Őzgür and his colleagues  applied link discovery technique to generate hypotheses on relationships between genes and vaccines. This work first extracted networks on gene-gene interactions and gene-vaccine interactions from literature with the help of gene and vaccine ontology; then analyzed the networks based upon different centrality measures calculated for each node in the networks. Given its limited focus on gene and vaccine relationships, this work by its nature was not designed for cross-silo biomedical discovery.
We model a biomedical literature as a concept network G, where each node represents a biomedical concept that belongs to certain semantic type, and each edge represents a relationship between two concepts. Each node or each edge is attached with a weight that reflects the significance of the node or the edge. In this work, we use the document frequency of a given node as its weight; use the co-occurrence of the two end nodes as the weight for the corresponding edge. Now, the hypotheses generation problem can be formulated as the process of link discovery on the concept network, i.e., the process of discovering all those pairs of nodes which are not directly connected in the current concept network but will be directly connected in the future. We further model the link discovery on the concept network as a process of supervised learning where a training data set is automatically generated from the concept network without class label assignments by domain subject experts. More specifically, we take two snapshots, namely and , of the concept networks corresponding to two consecutive time durations t f and t s . That is t f is the first time duration and t s is the second time duration. We automatically collect a group of concept pairs that are not directly connected in and labeled each pair as either positive or negative. A concept pair is assigned the class label positive if this pair is directly connected in ; is assigned negative otherwise. For each collected pair, we further extract a set of features from , such that a classification model can be built by using part of the labeled pairs as the training data. Once the classification model is learned, it can be used to predict the appearance of a new edge at a future time between two nodes that are not directly currently connected. The quality of the classification model surely depends on what features we can extract for the labeled pairs. Existing work in link discovery typically uses different types of topological features. We examine two types of topological features, namely random walk based and neighborhood based. Besides topological features, we also propose two semantically-enriched features, namely Semantic CFEC and Author List Jaccard. In the following, we will describe both topological and semantically-enriched features in detail.
Given a collected pair of nodes (s, t), we consider the following aspects of topology related to s and t: 1. the neighborhood of s and t; 2. the paths between s and t. To describe the neighborhood of s and t, the following measures are calculated:
where τ (s) and τ (t) are the set of neighboring concepts for concepts s and t respectively.
To describe the paths between s and t, we examine the following features.
Number of paths: more paths between s and t, more likely a future edge between s and t.
Distance between s and t: longer it takes to reach s from t, less likely a future edge between s and t.
From the above equation it is evident that shorter paths are preferred.
The above measures only evaluate network topology related features. However, each node that represents a biomedical concept is actually associated with rich semantic information. In this work, we consider the following two types of semantic information for a given node, its semantic type and its related author information.
In this section, we describe the implementation of the computational model presented in the Hypotheses generation as supervised link discovery on biomedical concept network section. The major challenge to implement such a computational model is related to the need to process a huge amount of data. We use the Map-Reduce framework to implement the following three major components: 1) Extract a comprehensive biomedical concept network from the abstracts of all Medline papers published within 1990-2010; 2) Generate labeled pairs from two consecutive snapshots of the concept network; and 3) For each labeled concept pair, extract all the set of features described in the subsections titled Topological features and Semantically-enriched features.
Each node of the concept network represents a biomedical concept, which is also attached with the following information: semantic type, related authors, and document frequency. Each edge of the concept network represents co-occurrence of the two end nodes in same documents. An edge is attached with the following information: the strength of the edge (i.e., the frequency of co-occurrence of the two end nodes), and the duration of the edge. The concept network is stored by using the following data structures.
Concept-Document Map (CDM): The key of an entry in this map is a concept 'c' and year 'y', and the value of an entry is a set of document ids (PMIDS), where PMID is the ID of the Medline paper that concept c appears and year represents the publication year of this paper. Given a time duration t, we can easily derive a snapshot of CDM for t, denoted as CDM t , by taking a union of all the PMIDs for the keys 〈c, y〉, where the year 'y' is within the given time duration t. To generate this map in Map-Reduce framework each of the mappers processes a subset of the document collection and sends the tuple 〈concept, year〉 as the key and document list as the value to reducers. Reducers aggregate the document set for a given concept and year.
Concept-Concept Matrix (CCM): We compute concept-concept associations from the set of concepts extracted from a PMID. That is, for each concept, we compute the co-occurring concepts within the same document. For each concept-concept association, we compute the co-occurrence frequency occurred in each year. Algorithm 1 describes the implementation of CCM in Map-Reduce framework.
Concept-semantic Type: We extract the semantic type from UMLS Metathesaurus for each of the concepts.
Concept-Author Map (CAM): The key of an entry in this map is a concept 'c' and year 'y', and the value of an entry is a set of authors. This map provides the set of authors who have published a document containing the given concept 'c' in a given year 'y'. Given a time duration t, we can easily derive a snapshot of CAM for t, denoted as CAM t , by taking a union of all the authors for the keys 〈c, y〉, where the year 'y' is within the given time duration t. To generate this map in Map-Reduce framework each of the mappers processes a subset of the document collection and sends the tuple 〈concept, year〉 as the key and author set as the value to reducers. Reducers aggregate the author set for a given concept and year.
Algorithm 1: Generating concept-concept matrix
Data: Document Corpus
Result: Concept-Concept Matrix
initialization CCMis_local matrix;
for each mapper m do
for each document d i in document corpus
of mapper m do
c(i) ← set of concepts extracted from d i ;
y i ← published year of d i ;
for each concept pairs c k , c l of c(i) do
CCM_local[c k , c l , y i ] ← CCM_local[c k , c l , y i ] + 1;
for each entry (c k , c l , y i ) in CCM_local do
key ← (c k , c l , y i ) ;
value ← CCM_local(c k , c l , y i ) ;
return (key, value);
for each key (c k , c l , y i ) and do
CCM(c k , c l , y i ) ← sum;
Given a comprehensive concept network stored in the above data structures, we apply Algorithm 2 to derive a snapshot of the concept network for a given time duration t in Map-Reduce framework. A snapshot of the concept network is stored in a graph data structure.
Given two snapshots and of the concept network corresponding to two consecutive time duration t f and t s , we generate a group of labeled pairs based on which a training data set can be formed for the proposed supervised link discovery. The following process describes how we automatically assign class labels to concept pairs without any involvement of subject domain experts.
For a pair of nodes (i, j) that is not directly connected in , we categorize its possible connection situations in as follows:
Connection is strong in : There is an edge between i and j in , namely e ij , and we have e ij .strength ≥ min_support.
Connection is emerging in : There is an edge between i and j in , namely e ij , and we have margin × min_support ≤ e ij .strength < min_support, where 0 < margin < 1.
Algorithm 2: Generating the snapshot of the concept network, G t , for a time duration t
Data: Concept-Concept Matrix CCM, Concept-Document Map CDM, time duration t
Result: Snapshot of the Concept Network for a time duration t
initialization Create CDM t : the snapshot of CDM for the time duration t;
for each 〈key(c i , c j , y k ),value(val)〉 in CCM do
if y k ∈ t then
if no node exists for c i then
create a node v i for c j ; v i .name = c i ;
v i .frequency = (CDM t .get(c i )).size();
if no node exists for c j then
create a node v j for c j ; v j .name = c j ;
v j .frequency = (CDM t .get(c j )).size();
if no edge links for 〈v i , v j 〉 then
create an edge e ij between v i and v j
e ij .strength = 0;
e ij .strength = e ij .strength + val
Connection is weak in : There is an edge between i and j in , namely e ij , e ij .strength < margin × min_support, where 0 < margin < 1.
No direct connection in : There is no edge between i and j in .
Given a pair of nodes that has no direct connection in , we assign the class label positive to it if this pair's connection is strong in ; assign the class label negative to it if this pair's connection is weak in or there is no direct connection in . If this pair's connection in is emerging, its class label should be emerging, however, we don't consider this class in this work. The major challenging issue of generating labeled pairs is that there would be a huge number of pairs that are not directly connected in . In order to address this issue, we use the following procedure to generate labeled pairs.
For each pair whose connection is strong in , if it has no direct connection in , assign positive to this pair.
For each pair whose connection is weak in , if it has no direct connection in , assign negative to this pair.
Select a random sample of the nodes in and generate concept pairs from the selected random sample. If a pair has no connection in both and , assign negative to it.
The number of labeled pairs generated from a large-scale concept pairs can be huge. Furthermore, the number of positive pairs and negative pairs can be highly unbalanced. To address these issues, we randomly select certain portion of positive and negative pairs to form a training data set.
For each of labeled concepts pair, we extract all the set of features described in the subsections titled Topological features and Semantically-enriched features from the snapshot of the concept network . Given the fact that the number of labeled pairs is large, feature extraction is also a computationally expensive step. To address this problem, the feature extraction is implemented on a map-reduce framework. The distributed implementation of feature extraction can be described in the following way:
Trim such that it only contains edges with strength greater than or equal to the minimum support. Store the trimmed in each of the mapper's main memory. After trimming, is much smaller, so it is feasible to store it in memory.
Distribute the labeled pairs among the mappers. Each mapper extracts the features for a subset of concept pairs using the trimmed .
We study the following aspects of our proposed methodology in our experimental set-up:
1. The performance of the proposed supervised link discovery approach. More specifically, we evaluate whether the proposed approach is able to conduct reasonable predictions on concept links that are currently weak or non-existing but may become strong in the future. Since predictions are carried out based on a classification model that is built upon a training data set extracted from two consecutive snapshots of the concept network, the performance of link discovery can be evaluated by measures such as classification accuracy, recall, and precision as results of n-fold cross validation on the training data.
2. The effect of the parameters min-support and margin on the performance of link discovery. These two parameters are used in generating class labels for concept pairs of the training data.
3. The effect of the proposed features for each concept pair, such as CFEC, Semantic-CFEC and Author-Jaccard, on the performance of link discovery.
4. The effect of using different snapshots of the concept network to generate training data. For this purpose, we first take three consecutive snapshots of the concept network, each of which spans a 5-year period; then generate the first training data set from the first two snapshots and the second training data set from the last two snapshots. Accordingly, we compare the performance of classification models built on these two training sets.
5. The effects of different supervised learning methods on the performance of link discovery. For this purpose, we experiment with two typical supervised learning methods, one is C4.5 decision tree and the other is Support Vector Machine(SVM). Decision tree generates results that are easy to interpret, whereas SVM is well received due to its outstanding performance in various applications.
Statistics of the generated concept network
Total number of concept pairs
Total number of documents
Total number of concepts
Number of instances
Test support value
We first applied C4.5 Decision Tree on the training data set generated from and to study the effects of parameters and proposed features on the performance of the proposed approach; then studied the performance of C4.5 Decision Tree built on both training data sets; finally compared the performance of C4.5 Decision tree and SVM based on both training data sets. A 10-fold cross validation was used to evaluate classification accuracy, recall, precision and F-Measure in all experiments.
We generated the labeled pairs by using the procedure described in the Automatic generation of class labels for concept pairs section with different values for the variable min_support and for the variable margin. The number of positive instances and the negative instances generated for training purpose is highly unbalanced. Table 2 shows the number of positive and negative examples for different values of min_support. Given the fact that unbalanced data sets are difficult to train on, we performed an under-sampling of the majority class.
If we consider the time duration from 1991 to 1995, there exists no Medline record in this time duration that mentioned both of "Prostatic Neoplasms" and "NF-κB inhibitor alpha". Document frequency of "Prostatic Neoplasms" in this time duration is 6807, whereas document frequency of "NF-κB inhibitor alpha" is 91. However, the co-occurence frequencies of this concept pair are 15 and 42 corresponding to the MEDLINE corpus in the time durations 1996-2000 and 2001-2005, respectively. It is worthwhile to study if the supervised learning model built from the first training data set is able to predict the strong connection between these two concepts after 1995.
Recall that, in our experimental study, the first training data set was formed by randomly selecting 10% of labeled pairs generated from concept network snapshots G t1 = 1991-1995 and G t2 = 1996-2000. We first made sure that the pair "Prostatic Neoplasms" and "NF-κB inhibitor alpha" is not part of the first training data set. Then we run the supervised learning model built on the first training data set to make a prediction for this pair. The model successfully predicted the strong connection between these two concepts after 1995 by assigning a positive class label to this pair.
Significant paths using cycle free effective conductance feature
Prostatic Neoplasms →
Tumor Necrosis Factor-alpha →
NF-κB inhibitor alpha
Prostatic Neoplasms →
RNA, Messenger →
NF-κB inhibitor alpha
Prostatic Neoplasms →
Adenosine Triphosphate →
NF-κB inhibitor alpha
Prostatic Neoplasms →
NF-κB inhibitor alpha
Prostatic Neoplasms →
Tetradecanoylphorbol Acetate →
NF-κB inhibitor alpha
Prostatic Neoplasms →
NF-κB inhibitor alpha
Modeling a biomedical literature repository as a comprehensive network of biomedical concepts and viewing hypotheses generation as a process of automated link discovery on the concept network representing the literature repository, opens the door for performing large-scale cross-silo biomedical hypotheses discovery. We have presented the methods to generate a concept network and concept-author map from large-scale literature repositories using Map-Reduce framework. The link discovery on the concept network was further modeled as a classification problem and we proposed a framework to automatically generate the labeled instances of concept pairs for supervised link discovery. Our method also extracts multiple heterogeneous features for labeled concept pairs. These features include path based features such as cycle free effective conductance (CFEC), neighborhood features such as preferential attachment. In addition, we proposed a new feature based on CFEC namely semantic-CFEC, which utilizes the semantic type of the nodes in the path. Another important contribution of work is the use of author information. To the best of our knowledge, this is the first work that exploited the connecting two concepts via author links associated with those concepts for hypotheses discovery. Through experimental results, we showed an improvement of 7-9% in classification accuracy of link discovery obtained due to the addition of semantic type and author based features.
As part of the future work, we will explore using ensemble methods such as gradient descent boosted decision trees for classification. We will also explore the prediction of emerging connections between concepts in addition to the prediction of strong connections. A web service that generates biomedical hypotheses based on the proposed method will be built and published.
This article has been published as part of BMC Genomics Volume 13 Supplement 3, 2012: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2011: Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S3.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.