 Proceedings
 Open access
 Published:
Hypotheses generation as supervised link discovery with automated class labeling on largescale biomedical concept networks
BMC Genomics volume 13, Article number: S5 (2012)
Abstract
Computational approaches to generate hypotheses from biomedical literature have been studied intensively in recent years. Nevertheless, it still remains a challenge to automatically discover novel, crosssilo biomedical hypotheses from largescale literature repositories. In order to address this challenge, we first model a biomedical literature repository as a comprehensive network of biomedical concepts and formulate hypotheses generation as a process of link discovery on the concept network. We extract the relevant information from the biomedical literature corpus and generate a concept network and conceptauthor map on a cluster using MapReduce framework. We extract a set of heterogeneous features such as random walk based features, neighborhood features and common author features. The potential number of links to consider for the possibility of link discovery is large in our concept network and to address the scalability problem, the features from a concept network are extracted using a cluster with MapReduce framework. We further model link discovery as a classification problem carried out on a training data set automatically extracted from two network snapshots taken in two consecutive time duration. A set of heterogeneous features, which cover both topological and semantic features derived from the concept network, have been studied with respect to their impacts on the accuracy of the proposed supervised link discovery process. A case study of hypotheses generation based on the proposed method has been presented in the paper.
Introduction
Text mining of biomedical literature is a research area that has attracted lot of attention in the last 5 to 10 years. Swanson [1] was one of the proponents of the hypotheses discovery from biomedical literature. As a result of his pioneering work in hypotheses discovery, Swanson discovered a novel connection between Raynaud's disease and fish oil by examining two disjoint biomedical literature sets [1]. The hypothesis of the beneficial effect of fish oil on Raynaud's disease was confirmed by an independent clinical trial two years later, which demonstrated the value of biomedical literature mining in scientific discovery. Swanson's hypothesizing model, the so called Swanson's ABC model, can be simply described as A relates to B, B relates to C, therefore A may relate to C, [2]. Ever since Swanson's discovery, a lot of research works have been carried out with the aim of automating and refining Swanson's ABC model [1, 3–8]. Nevertheless, most of these reported approaches are based on analyzing the retrieval result set for one or two initial topics provided as query by a user, instead of being able to scale up to the whole set of literature database for the purpose of discovering real, novel and crosssilo biomedical hypotheses.
In recent years, link discovery has been extensively studied on social networks such as those obtained from Facebook data and bibliographic databases maintained by DBLP. As an important problem of link mining, link discovery refers to the discovery of future links between objects (or nodes) that are not directly connected in the current snapshot of a given network. In [9], Őzgür and his colleagues applied link discovery technique to generate hypotheses on relationships between genes and vaccines. This work first extracted networks on genegene interactions and genevaccine interactions from literature with the help of gene and vaccine ontology; then analyzed the networks by computing different types of centrality measures for each node in the networks. Given its restricted focus on gene and vaccine relationships, this work by its nature was not designed for crosssilo biomedical discovery.
In order to address the challenge of largescale crosssilo biomedical hypotheses discovery, in this paper, we first model a biomedical literature repository as a comprehensive network of biomedical concepts belonging to different semantic types. Then we extract such a largescale concept network form Medline [10]. We further calculate a variety of topological and semantic features from the concept network and model the hypotheses discovery as a classification problem based on those features. Moreover, in order to automatically build the classification model for prediction, we take two snapshots of the concept networks corresponding to two consecutive time durations, such that a training data set can be formed based on a group of labeled concept pairs that are automatically extracted from the snapshots. We further extract multiple heterogeneous features for labeled concept pairs solely from the first snapshot of the concept network. The impact of those heterogeneous features on hypotheses discovery has been studied.
The rest of the paper will be organized as follows. In the Related work section, we briefly describe relevant works in biomedical hypotheses discovery and link mining. In the Hypotheses generation as supervised link discovery on biomedical concept network section, we formulate hypotheses generation from literature as link discovery in a concept network and further model the link discovery as a supervised learning process based on a set of topological and semantic features. In the Concept network creation and feature extraction using MapReduce framework section, we address the challenges of extracting largescale concept networks from literature corpus. We also address the challenges involved in automatically generating labeled data and extracting heterogeneous features for a large number of labeled data using MapReduce framework. In the Experimental results section, we present experimental results. Finally, we conclude our paper with the Conclusions section.
Related work
Swanson's pioneering work in 1986 on biomedical hypotheses generation led to the discovery of the novel connection between Raynaud's disease and fish oil by examining two disjoint biomedical literature sets (Swanson [1]). In his followup work in 1990, Swanson suggested a trialanderror search strategy, by which the ABC model guides a manual online search for identifying logically related noninteractive literature (Swanson [7]). By applying this strategy for biomedical literature analysis, Swanson discovered some other novel biomedical hypotheses, such as the implicit connection between the blood levels of Somatomedin C and dietary amino acids arginine (Swanson [7, 11]), and hidden link between the mineral magnesium and treating the medical problem causing migraine headaches (Swanson [7]).
Along with the advances in the text retrieval and mining techniques, researchers have made several efforts to partially automate Swanson's ABC model for hypotheses generation. Stegmann and Grohman proposed a way to guide a researcher to identify a set of promising B terms by conducting clustering analyses of terms on both the retrieval result set of topic A and the retrieval result set of topic C (Stegman et al. [6]). Their work used measures called centrality and density to evaluate the goodness of term clusters and showed that the promising B terms that link disjoint literature for topics A and C tend to appear in clusters of low centrality and density. Srinivasan's approach to identify promising B terms starts with building two profiles for both topic A and topic C, respectively, from the retrieval result sets of A and C [5]. In her work, the profile of a topic consists of terms that have high frequency in the retrieval result set of that topic and belong to semantic types of interest to the user. Then the intersection of A's profile with C's profile generates the candidate B terms. The process of identifying B terms from given topics A and C is called closed discovery. In her work, Srinivansan also applies the topic profile idea to conduct open discovery, which identifies both B terms and C terms given only topic A. Srinivansan's open discovery algorithm can be simply described as follows: Topranking B terms are selected from the profile of topic A. Then, a profile for each selected B term is created from the retrieval result set of that B term. The topranking terms in a B term's profile form candidate C terms. If topic A's retrieval result set is disjoint from a candidate C term's retrieval result set, then this candidate C term is reported as having potential relationship with topic A via term B. Slightly different from Srinivansan's topic profile approach, Pratt and Yildiz directly applied association mining on the retrieval result set of topic A to conduct open discovery [4]. In their work, the logical inference based on two association rules A→B, B→C leads to the finding of a candidate C term.
One of the problems that almost all the hypotheses generating approaches face is the large amount of spurious hypotheses generated in the process of automating the Swanson's ABC model. In order to eliminate the spurious hypotheses, different components of the biomedical ontology system, UMLS [12], have been utilized. Weeber et al. [13] used Metathesaurus of the UMLS to extract biomedical phrases and further limited the desired phrases by using the semantic types of the UMLS as an additional filter. Similar strategies are widely used by most of the followup research. Zhang et al. [3] and his colleagues used semantic network, another UMLS component that specifies possible relations among different semantic types, in order to restrict the association rules generated from the retrieval result set of topic A in the process of open discovery. Besides utilizing the biomedical ontology system, we envision that crossrepository validation may be another effective addition for eliminating spurious hypotheses.
No matter whether designed for closed discovery or open discovery, the described works are still constrained in the category of automating and refining Swanson's ABC hypothesizing model. Furthermore, all the approaches are based on retrieval result set of one or two initial topics provided by a user, instead of being able to scale up to the whole set of topics within a literature database for the purpose of discovering real, novel and crosssilo biomedical hypotheses.
If we model a biomedical literature repository as a comprehensive network of biomedical concepts belonging to different semantic types, the link discovery techniques may enable largescale, crosssilo hypotheses discovery that goes beyond information retrievalbased discovery. Link discovery has been extensively studied on social networks such as Facebook, and bibliographic databases such as DBLP in recent years. As an important problem of link mining, link discovery refers to the discovery of future links between objects that are not directly connected in the current snapshot of a given network. In the following, we briefly review those link discovery techniques that are relevant to our work.
In the paper by Faloutsos et al. [14], the author proposed a measure called effective conductance to evaluate the goodness of a connection subgraph. Later, in the paper by Koren et al. [15], an improved measure called cycle free effective conductance was proposed by using only the cycle free paths in computing the proximity. This measure guaranteed that high degree intermediate nodes in the paths do not increase the proximity between two nodes unreasonably. The paper by LibenNowell and Kleinberg [16] discussed the problem of link prediction in social networks. It was one of the early works on link prediction that addressed the question of to what extent new collaborations (links) can be predicted by using the toplogy of the network. This work used an unsupervised approach to predict the links based on several network toplogy features in coauthorship networks. The paper by Al Hasan et al. [17] used a supervised learning approach for coauthorship link prediction based on simple neighborhood features, without factoring in any random walk features like effective conductance. Simple neighborhood features have several limitations compared to random walk features: they can not predict connecting paths of length greater than two (Benchettara et al. [18]), nor can they discriminate significant (good) paths from the set of all neighborhood nodes. The paper Benchettara et al. [18] used the bipartite nature of publication networks in a supervised learning framework. The paper Savas et al. [19] addressed the link discovery problem based on the number of paths of different lengths from multiple sources that exist between two nodes. However, this work did not factor in the different degrees of significances that different paths may have. Őzgür and his colleagues [9] applied link discovery technique to generate hypotheses on relationships between genes and vaccines. This work first extracted networks on genegene interactions and genevaccine interactions from literature with the help of gene and vaccine ontology; then analyzed the networks based upon different centrality measures calculated for each node in the networks. Given its limited focus on gene and vaccine relationships, this work by its nature was not designed for crosssilo biomedical discovery.
Hypotheses generation as supervised link discovery on biomedical concept network
We model a biomedical literature as a concept network G, where each node represents a biomedical concept that belongs to certain semantic type, and each edge represents a relationship between two concepts. Each node or each edge is attached with a weight that reflects the significance of the node or the edge. In this work, we use the document frequency of a given node as its weight; use the cooccurrence of the two end nodes as the weight for the corresponding edge. Now, the hypotheses generation problem can be formulated as the process of link discovery on the concept network, i.e., the process of discovering all those pairs of nodes which are not directly connected in the current concept network but will be directly connected in the future. We further model the link discovery on the concept network as a process of supervised learning where a training data set is automatically generated from the concept network without class label assignments by domain subject experts. More specifically, we take two snapshots, namely {G}_{{t}_{f}} and {G}_{{t}_{s}}, of the concept networks corresponding to two consecutive time durations t_{ f }and t_{ s }. That is t_{ f }is the first time duration and t_{ s }is the second time duration. We automatically collect a group of concept pairs that are not directly connected in {G}_{{t}_{f}} and labeled each pair as either positive or negative. A concept pair is assigned the class label positive if this pair is directly connected in {G}_{{t}_{s}}; is assigned negative otherwise. For each collected pair, we further extract a set of features from {G}_{{t}_{f}}, such that a classification model can be built by using part of the labeled pairs as the training data. Once the classification model is learned, it can be used to predict the appearance of a new edge at a future time between two nodes that are not directly currently connected. The quality of the classification model surely depends on what features we can extract for the labeled pairs. Existing work in link discovery typically uses different types of topological features. We examine two types of topological features, namely random walk based and neighborhood based. Besides topological features, we also propose two semanticallyenriched features, namely Semantic CFEC and Author List Jaccard. In the following, we will describe both topological and semanticallyenriched features in detail.
Topological features
Given a collected pair of nodes (s, t), we consider the following aspects of topology related to s and t: 1. the neighborhood of s and t; 2. the paths between s and t. To describe the neighborhood of s and t, the following measures are calculated:

Common neighbors:
Score\mathit{\left(}s,\phantom{\rule{2.77695pt}{0ex}}t\mathit{\right)}=\tau \left(s\right)\cap \tau \left(t\right),where τ (s) and τ (t) are the set of neighboring concepts for concepts s and t respectively.

Adamic/Adar: The measure uses the common neighbors between two nodes and weights each of the common neighbors. It gives higher score for nodes with low degree.
Score\mathit{\left(}s,\phantom{\rule{2.77695pt}{0ex}}t\mathit{\right)}={\sum}_{z\in \tau \left(s\right)\cap \tau \left(t\right)}\frac{1}{log\left\tau \left(z\right)\right}. 
Jaccard Coefficient:
Score\mathit{\left(}s,\phantom{\rule{2.77695pt}{0ex}}t\mathit{\right)}=\tau \left(s\right)\cap \tau \left(t\right)/\tau \left(s\right)\cup \tau \left(t\right). 
Preferential Attachment:
Score\mathit{\left(}s,\phantom{\rule{2.77695pt}{0ex}}t\mathit{\right)}=\left\tau \left(s\right)\right\cdot \left\tau \left(t\right)\right.
To describe the paths between s and t, we examine the following features.

Number of paths: more paths between s and t, more likely a future edge between s and t.

Distance between s and t: longer it takes to reach s from t, less likely a future edge between s and t.
Given a pair of collected nodes (s, t), the Cycle Free Effective Conductance (CFEC) measure proposed in [15] can be used to describe the effects of both these two features on s and t on the likelihood of a future edge between s and t. We briefly explain the definition of CFEC below. The cyclefree escape probability (Pcf.esc(s→t)) from s to t is the probability that a random walk originating at s will reach t without visiting any node more than once. Let R be the set of simple paths from s to t (simple paths are those that never visit the same node twice). Cyclefree escape probability (Pcf.esc(s→t)) is defined using the following equation
Cycle free effective conductance measure, is defined with the following equation:
From the above equation, it is clear that having multiple paths between two nodes will boost the score and thus addresses the first desired property. The definition also makes sure that already known information has no contribution to the score as it avoids cycles. In the random walk, a probability of transition from node i to node j is {p}_{ij}=\frac{{\text{w}}_{ij}}{{deg}_{i}}. Thus, given a path P = v_{1}, v_{2}, . . . v_{ r }the probability that a random walk starting at v_{1} will follow this path is given by:
From the above equation it is evident that shorter paths are preferred.
Semanticallyenriched features
The above measures only evaluate network topology related features. However, each node that represents a biomedical concept is actually associated with rich semantic information. In this work, we consider the following two types of semantic information for a given node, its semantic type and its related author information.
To factor in the semantic type of a given node, we propose a semanticallyenriched CFEC measure that is called Semantic CFEC. The intuition behind using the semantic types of the intermediate nodes in a path is that connections formed between homogeneous nodes are less likely to be spurious connections. This observation has also been substantiated in the prior work of biomedical literature mining. The works by Weeber et al. [13] and Zhang et al. [3] used the UMLS semantic types to restrict the association rules or the hypotheses. Our proposed semanticCFEC considers a subset of the simple paths, where each path has only those intermediate nodes whose semantic type is same as either the source node or the destination node. Let R* be the set of such simple paths called as semantic simple paths. Semantic CFEC is then computed using the paths r ∈ R*. Figure 1 shows some examples of such paths. To factor in the related author information for a given node, we propose another new measure that is called AuthorList Jaccard. The intuition behind this measure is that two distant concepts may get connected due to the presence of enough researchers who are familiar with both the concepts. Let author(s) and author(t) be the list of authors who have published documents containing concepts s, t respectively. Then, we define this measure as below:
Concept network creation and feature extraction using MapReduce framework
In this section, we describe the implementation of the computational model presented in the Hypotheses generation as supervised link discovery on biomedical concept network section. The major challenge to implement such a computational model is related to the need to process a huge amount of data. We use the MapReduce framework to implement the following three major components: 1) Extract a comprehensive biomedical concept network from the abstracts of all Medline papers published within 19902010; 2) Generate labeled pairs from two consecutive snapshots of the concept network; and 3) For each labeled concept pair, extract all the set of features described in the subsections titled Topological features and Semanticallyenriched features.
Concept network extraction
Each node of the concept network represents a biomedical concept, which is also attached with the following information: semantic type, related authors, and document frequency. Each edge of the concept network represents cooccurrence of the two end nodes in same documents. An edge is attached with the following information: the strength of the edge (i.e., the frequency of cooccurrence of the two end nodes), and the duration of the edge. The concept network is stored by using the following data structures.

ConceptDocument Map (CDM): The key of an entry in this map is a concept 'c' and year 'y', and the value of an entry is a set of document ids (PMIDS), where PMID is the ID of the Medline paper that concept c appears and year represents the publication year of this paper. Given a time duration t, we can easily derive a snapshot of CDM for t, denoted as CDM_{ t }, by taking a union of all the PMIDs for the keys 〈c, y〉, where the year 'y' is within the given time duration t. To generate this map in MapReduce framework each of the mappers processes a subset of the document collection and sends the tuple 〈concept, year〉 as the key and document list as the value to reducers. Reducers aggregate the document set for a given concept and year.

ConceptConcept Matrix (CCM): We compute conceptconcept associations from the set of concepts extracted from a PMID. That is, for each concept, we compute the cooccurring concepts within the same document. For each conceptconcept association, we compute the cooccurrence frequency occurred in each year. Algorithm 1 describes the implementation of CCM in MapReduce framework.

Conceptsemantic Type: We extract the semantic type from UMLS Metathesaurus for each of the concepts.

ConceptAuthor Map (CAM): The key of an entry in this map is a concept 'c' and year 'y', and the value of an entry is a set of authors. This map provides the set of authors who have published a document containing the given concept 'c' in a given year 'y'. Given a time duration t, we can easily derive a snapshot of CAM for t, denoted as CAM_{ t }, by taking a union of all the authors for the keys 〈c, y〉, where the year 'y' is within the given time duration t. To generate this map in MapReduce framework each of the mappers processes a subset of the document collection and sends the tuple 〈concept, year〉 as the key and author set as the value to reducers. Reducers aggregate the author set for a given concept and year.
Algorithm 1: Generating conceptconcept matrix
Data: Document Corpus
Result: ConceptConcept Matrix
initialization CCMis_local matrix;
Map:
for each mapper m do
for each document d _{ i } in document corpus
of mapper m do
c(i) ← set of concepts extracted from d_{ i };
y_{ i } ← published year of d_{ i };
for each concept pairs c _{ k } , c _{ l } of c(i) do
CCM_local[c_{ k }, c_{ l }, y_{ i }] ← CCM_local[c_{ k }, c_{ l }, y_{ i }] + 1;
end
end
for each entry (c _{ k } , c _{ l } , y _{ i } ) in CCM_local do
key ← (c_{ k }, c_{ l }, y_{ i }) ;
value ← CCM_local(c_{ k }, c_{ l }, y_{ i }) ;
return (key, value);
end
\mathsf{\text{sum}}\leftarrow {\sum}_{i=1}^{n}{count}_{i};
CCM(c_{ k }, c_{ l }, y_{ i }) ← sum;
end
Reduce
for each key (c _{ k } , c _{ l } , y _{ i } ) and coun{t}_{1}^{n} do
end
Given a comprehensive concept network stored in the above data structures, we apply Algorithm 2 to derive a snapshot of the concept network for a given time duration t in MapReduce framework. A snapshot of the concept network is stored in a graph data structure.
Automatic generation of class labels for concept pairs
Given two snapshots {G}_{{t}_{f}} and {G}_{{t}_{s}} of the concept network corresponding to two consecutive time duration t_{ f } and t_{ s }, we generate a group of labeled pairs based on which a training data set can be formed for the proposed supervised link discovery. The following process describes how we automatically assign class labels to concept pairs without any involvement of subject domain experts.
For a pair of nodes (i, j) that is not directly connected in {G}_{{t}_{f}}, we categorize its possible connection situations in {G}_{{t}_{s}} as follows:

Connection is strong in {G}_{{t}_{s}}: There is an edge between i and j in {G}_{{t}_{s}}, namely e_{ ij }, and we have e_{ ij }.strength ≥ min_support.

Connection is emerging in {G}_{{t}_{s}}: There is an edge between i and j in {G}_{{t}_{s}}, namely e_{ ij }, and we have margin × min_support ≤ e_{ ij }.strength <min_support, where 0 <margin < 1.
Algorithm 2: Generating the snapshot of the concept network, G_{ t }, for a time duration t
Data: ConceptConcept Matrix CCM, ConceptDocument Map CDM, time duration t
Result: Snapshot of the Concept Network for a time duration t
initialization Create CDM_{ t }: the snapshot of CDM for the time duration t;
for each 〈key(c_{ i }, c_{ j }, y_{ k }),value(val)〉 in CCM do
if y_{ k } ∈ t then
if no node exists for c _{ i } then
create a node v_{ i }for c_{ j }; v_{ i }.name = c_{ i };
v_{ i }.frequency = (CDM_{ t }.get(c_{ i })).size();
end
if no node exists for c _{ j } then
create a node v_{ j }for c_{ j }; v_{ j }.name = c_{ j };
v_{ j }.frequency = (CDM_{ t }.get(c_{ j })).size();
end
if no edge links for 〈v_{ i }, v_{ j }〉 then
create an edge e_{ ij }between v_{ i }and v_{ j }
e_{ ij }.strength = 0;
end
e_{ ij }.strength = e_{ ij }.strength + val
end
end

Connection is weak in {G}_{{t}_{s}}: There is an edge between i and j in {G}_{{t}_{s}}, namely e_{ ij }, e_{ ij }.strength <margin × min_support, where 0 <margin < 1.

No direct connection in {G}_{{t}_{s}}: There is no edge between i and j in {G}_{{t}_{s}}.
Given a pair of nodes that has no direct connection in {G}_{{t}_{f}}, we assign the class label positive to it if this pair's connection is strong in {G}_{{t}_{s}}; assign the class label negative to it if this pair's connection is weak in {G}_{{t}_{s}} or there is no direct connection in {G}_{{t}_{s}}. If this pair's connection in {G}_{{t}_{s}} is emerging, its class label should be emerging, however, we don't consider this class in this work. The major challenging issue of generating labeled pairs is that there would be a huge number of pairs that are not directly connected in {G}_{{t}_{f}}. In order to address this issue, we use the following procedure to generate labeled pairs.

For each pair whose connection is strong in {G}_{{t}_{s}}, if it has no direct connection in {G}_{{t}_{f}}, assign positive to this pair.

For each pair whose connection is weak in {G}_{{t}_{s}}, if it has no direct connection in {G}_{{t}_{f}}, assign negative to this pair.

Select a random sample of the nodes in {G}_{{t}_{f}} and generate concept pairs from the selected random sample. If a pair has no connection in both {G}_{{t}_{f}} and {G}_{{t}_{s}}, assign negative to it.
The number of labeled pairs generated from a largescale concept pairs can be huge. Furthermore, the number of positive pairs and negative pairs can be highly unbalanced. To address these issues, we randomly select certain portion of positive and negative pairs to form a training data set.
Feature extraction
For each of labeled concepts pair, we extract all the set of features described in the subsections titled Topological features and Semanticallyenriched features from the snapshot of the concept network {G}_{{t}_{f}}. Given the fact that the number of labeled pairs is large, feature extraction is also a computationally expensive step. To address this problem, the feature extraction is implemented on a mapreduce framework. The distributed implementation of feature extraction can be described in the following way:

1.
Trim {G}_{{t}_{f}} such that it only contains edges with strength greater than or equal to the minimum support. Store the trimmed {G}_{{t}_{f}} in each of the mapper's main memory. After trimming, {G}_{{t}_{f}} is much smaller, so it is feasible to store it in memory.

2.
Distribute the labeled pairs among the mappers. Each mapper extracts the features for a subset of concept pairs using the trimmed {G}_{{t}_{f}}.
Experimental results
We study the following aspects of our proposed methodology in our experimental setup:

1.
The performance of the proposed supervised link discovery approach. More specifically, we evaluate whether the proposed approach is able to conduct reasonable predictions on concept links that are currently weak or nonexisting but may become strong in the future. Since predictions are carried out based on a classification model that is built upon a training data set extracted from two consecutive snapshots of the concept network, the performance of link discovery can be evaluated by measures such as classification accuracy, recall, and precision as results of nfold cross validation on the training data.

2.
The effect of the parameters minsupport and margin on the performance of link discovery. These two parameters are used in generating class labels for concept pairs of the training data.

3.
The effect of the proposed features for each concept pair, such as CFEC, SemanticCFEC and AuthorJaccard, on the performance of link discovery.

4.
The effect of using different snapshots of the concept network to generate training data. For this purpose, we first take three consecutive snapshots of the concept network, each of which spans a 5year period; then generate the first training data set from the first two snapshots and the second training data set from the last two snapshots. Accordingly, we compare the performance of classification models built on these two training sets.

5.
The effects of different supervised learning methods on the performance of link discovery. For this purpose, we experiment with two typical supervised learning methods, one is C4.5 decision tree and the other is Support Vector Machine(SVM). Decision tree generates results that are easy to interpret, whereas SVM is well received due to its outstanding performance in various applications.
Experimental setting
We processed the MEDLINE records from 19902010 to build the base concept network. From each of the MEDLINE record, which is a XML file, we extract the following information to build the concept network: Authors, Dates, Document ID (PMID), Keywords from fields such as MeshHeadingList, Chemical Compounds List and Gene Symbol List. Table 1 shows some important statistics of the generated concept network.
We further show the distribution of document frequency of concepts in Figure 2, the distribution of cooccurrence frequency of concepts in Figure 3, and the distribution of degree of concept nodes in Figure 4. From these distributions, we observed that 1) majority of the concepts have document frequency greater than 1000; 2) majority of the concepts link to at least 1000 other concepts; and 3) among all linked concept pairs, around 33% have cooccurance frequency greater than 4 and around 20% have cooccurance frequency greater than 8.
Based on the concept network, the following snapshots were generated: {G}_{{t}_{1}}=19911995, {G}_{{t}_{2}}=19962000 and {G}_{{t}_{3}}=20012005. We generated the first set of labeled pairs from {G}_{{t}_{1}} and {G}_{{t}_{2}}. As shown in Table 2, the number of labeled pairs, especially the number of negative instances, is too large for a typical supervised learning algorithm. Therefore, we randomly select 10% of positive instances and 10% of negative instances from the first set of labeled pairs generated from {G}_{{t}_{1}} and {G}_{{t}_{2}} to form the first training data set. For each labeled pair in the first training data set, we extracted its features solely from {G}_{{t}_{1}}. Then we generated the second set of labeled pairs from {G}_{{t}_{2}} and {G}_{{t}_{3}}. By taking 10% of positive instances and 10% of negative instances from the second set of labeled pairs, we form the second training data set. For each labeled pair in the second training data set, we extracted its features solely from {G}_{{t}_{2}}.
We first applied C4.5 Decision Tree on the training data set generated from {G}_{{t}_{1}} and {G}_{{t}_{2}} to study the effects of parameters and proposed features on the performance of the proposed approach; then studied the performance of C4.5 Decision Tree built on both training data sets; finally compared the performance of C4.5 Decision tree and SVM based on both training data sets. A 10fold cross validation was used to evaluate classification accuracy, recall, precision and FMeasure in all experiments.
Support and margin
We generated the labeled pairs by using the procedure described in the Automatic generation of class labels for concept pairs section with different values for the variable min_support and for the variable margin. The number of positive instances and the negative instances generated for training purpose is highly unbalanced. Table 2 shows the number of positive and negative examples for different values of min_support. Given the fact that unbalanced data sets are difficult to train on, we performed an undersampling of the majority class.
Figure 5 shows the classification results obtained on the test data set by varying the value of min_support from 4 to 10 for a fixed value of 0.3 for the variable margin. We present classification accuracy, recall for the positive class (PRecall), precision for the positive class (PPrecision) and the FMeasure for the positive class (PFmeasure). As can be seen from Figure 5, the model accuracy in terms of all 4 measures increased as we increase the value of min_support from 4 to 10. The classification accuracy increased from 67.5% to 73.4% as the min_support is increased from 4 to 10. The explanation for the improvement in the model accuracy is as follows: As we increase the value of min support, some of the labeled pairs which are considered to be strong connections at a lower value will no longer be strong connections at a higher value, but will fall into the category of emerging connections. This means, our feature set has a better discriminating ability to choose between the strong connections and weak connections as compared to that of emerging connections and weak connections.
We have also experimented with different values for the variable margin. Figure 6 illustrates the results of the classifier as increase the value of margin from 0.1 to 0.7. The best results are obtained with margin 0.1. We obtained a classification accuracy of 76.2% with margin 0.1. As the margin increases, there will be more negative examples and the data becomes even more unbalanced.
Semanticallyenriched features
We proposed two semanticallyenriched features, Author_List Jaccard, and Semantic CFEC. Figure 7 illustrates the usefulness of Author_List Jaccard towards the improvement in the classification model. Figure 7 also illustrates the improvement that we obtained by adding Semantic CFEC. Figure 7 also shows the relative improvements that were obtained by adding the features Author_List Jaccard and Semantic CFEC. The feature Semantic CFEC improved the classification accuracy by 6% and the feature Author_List Jaccard improved the classification accuracy by another 2%.
Two different training data sets
In Figure 8, we compare the classification accuracies corresponding to two different training data sets. Recall that the first training data set was extracted from concept network snapshots G_{t 1}and G_{t 2}; whereas the second training data set was extracted from snapshots G_{t 2}and G_{t 3}. As can be seen from the figure, the classification accuracies are consistent across two different training data sets.
C4.5 decision tree vs. SVM
Figure 9 illustrates the comparison of the classification accuracy obtained using SVM and C4.5 decision tree on the first training data set that was extracted from concept network snapshots G_{t 1}and G_{t 2}. We used radial basis function (RBF) as the kernel type for SVM. Libsvm [20] is used as the SVM library. The results from SVM are slightly better (1% to 2%). In Figure 10, we show the similar result of comparison for the second training data set that was extracted from concept network snapshots G_{t 2}and G_{t 3}.
A case study
If we consider the time duration from 1991 to 1995, there exists no Medline record in this time duration that mentioned both of "Prostatic Neoplasms" and "NFκ B inhibitor alpha". Document frequency of "Prostatic Neoplasms" in this time duration is 6807, whereas document frequency of "NFκ B inhibitor alpha" is 91. However, the cooccurence frequencies of this concept pair are 15 and 42 corresponding to the MEDLINE corpus in the time durations 19962000 and 20012005, respectively. It is worthwhile to study if the supervised learning model built from the first training data set is able to predict the strong connection between these two concepts after 1995.
Recall that, in our experimental study, the first training data set was formed by randomly selecting 10% of labeled pairs generated from concept network snapshots G_{t 1}= 19911995 and G_{t 2}= 19962000. We first made sure that the pair "Prostatic Neoplasms" and "NFκ B inhibitor alpha" is not part of the first training data set. Then we run the supervised learning model built on the first training data set to make a prediction for this pair. The model successfully predicted the strong connection between these two concepts after 1995 by assigning a positive class label to this pair.
Furthermore, we extracted the paths between these two concepts, which may provide clues on why these two concepts may potentially link to each other. Table 3 shows the six most significant paths using Cycle Free Effective Conductance (CFEC) feature to sort the paths connecting the given concepts.
Conclusions
Modeling a biomedical literature repository as a comprehensive network of biomedical concepts and viewing hypotheses generation as a process of automated link discovery on the concept network representing the literature repository, opens the door for performing largescale crosssilo biomedical hypotheses discovery. We have presented the methods to generate a concept network and conceptauthor map from largescale literature repositories using MapReduce framework. The link discovery on the concept network was further modeled as a classification problem and we proposed a framework to automatically generate the labeled instances of concept pairs for supervised link discovery. Our method also extracts multiple heterogeneous features for labeled concept pairs. These features include path based features such as cycle free effective conductance (CFEC), neighborhood features such as preferential attachment. In addition, we proposed a new feature based on CFEC namely semanticCFEC, which utilizes the semantic type of the nodes in the path. Another important contribution of work is the use of author information. To the best of our knowledge, this is the first work that exploited the connecting two concepts via author links associated with those concepts for hypotheses discovery. Through experimental results, we showed an improvement of 79% in classification accuracy of link discovery obtained due to the addition of semantic type and author based features.
As part of the future work, we will explore using ensemble methods such as gradient descent boosted decision trees for classification. We will also explore the prediction of emerging connections between concepts in addition to the prediction of strong connections. A web service that generates biomedical hypotheses based on the proposed method will be built and published.
References
Swanson DR: Raynaud's syndrome and undiscovered public knowledge. Perspectives in Biology and Medicine. 1984, John Hopkins University Press, 30 (1): 718.
Bekhuis T: Conceptual biology, hypothesis discovery, and text mining: Swanson's legacy. Biomed Digit Libr. 2006, 3: 210.1186/1742558132.
Hu X, Zhang X, Yoo I, Wang X, Feng J: Mining hidden connections among biomedical concepts from disjoint biomedical literature sets through semanticbased association rule. International Journal of Intelligent Systems  Granular Computing: Models and Applications. 2010, 25: 207223.
Pratt W, Yildiz MY: Capturing connections across the biomedical literature. Proceedings of the 2nd International Conference on Knowledge Capture; Sanibel Island, FL, USA. 2003, ACM, 105112.
Srinivasan P: Text mining: generating hypotheses from Medline. Journal of American Society for Information Science and Technology. 2004, 55 (5): 396413. 10.1002/asi.10389.
Stegmann J, Grohmann G: Hypothesis generation guided by coword clustering. Scientometrics. 2003, 56 (1): 111135. 10.1023/A:1021954808804.
Swanson DR: Somatomedin C and arginine: implicit connection between mutuallyisolated literatures. Perspectives in Biology and Medicine. 1990, John Hopkins University Press, 33: 157186.
Xie Y, Katukuri JR, Raghavan VV, Presti T: Conceptual biology research supporting platform: current design and future issues. Applications of Computational Intelligence in Bioinformatics and Biomedicine: Current Trends and Open Problems. 2008, Berlin, Heidelberg: SpringerVerlag, 307324.
Őzgür A, Vu T, Ergan G, Radev DR: Identifying genedisease associations using centrality on a literature mined geneinteraction network. Bioinformatics. 2008, 24 (13): i277i285. 10.1093/bioinformatics/btn182.
MEDLINE^{®}/PubMed^{®}Resources Guide. [http://www.nlm.nih.gov/bsd/pmresources.html]
Swanson DR: Medical literature as a potential source of new knowledge. Bull Med Libr Assoc. 1990, 78: 2937.
Unified Medical Language System^{®} (UMLS^{®}). [http://www.nlm.nih.gov/research/umls/]
Weeber M, Klein H, Aronson AR, Mork J, de Jongvan den Berg LTW, Vos R: Textbased discovery in biomedicine: the architecture of the DADsystem. Proceedings of the AMIA Annual Fall Symposium; Philadelphia, PA, USA. 2000, ACM, 903907.
Faloutsos C, McCurley KS, Tomkins A: Fast discovery of connection subgraphs. Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Seattle, WA, USA. 2004, 118127.
Koren Y, North SC, Volinsky C: Measuring and extracting proximity in networks. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Philadelphia, PA, USA. 2006, 245255.
LibenNowell D, Kleinberg J: The linkprediction problem for social networks. Proceedings of the Twelfth International Conference on Information and Knowledge Management; New Orleans, LA, USA. 2003, 556559.
Hasan MA, Chaoji V, Salem S, Zaki M: Link prediction using supervised learning. Proc of SDM 06 Workshop on Link Analysis, Counterterrorism and Security. 2006
Benchettara N, Kanawati R, Rouveirol C: A supervised machine learning link prediction approach for academic collaboration recommendation. Proceedings of the Fourth ACM Conference on Recommender Systems; Barcelona, Spain. 2010, 253256.
Lu Z, Savas B, Tang W, Dhillon I: Supervised link prediction using multiple sources. Proceedings of the 2010 IEEE International Conference on Data Mining. 2010, IEEE Computer Society, 923928.
Support Vector Machine. [http://www.csie.ntu.edu.tw/~cjlin/libsvm/]
Acknowledgements
This article has been published as part of BMC Genomics Volume 13 Supplement 3, 2012: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2011: Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S3.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
JRK researched the area of link discovery methods and proposed the supervised link discovery method for biomedical hypotheses discovery. YX and VVR proposed further improvements to the methodology. JRK implemented the proposed method and generated experimental results. AG generated the input data sets and also formatted the manuscript. YX and VVR organized the manuscript in a formal way. All authors have contributed to the writing of the manuscript.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Katukuri, J.R., Xie, Y., Raghavan, V.V. et al. Hypotheses generation as supervised link discovery with automated class labeling on largescale biomedical concept networks. BMC Genomics 13 (Suppl 3), S5 (2012). https://doi.org/10.1186/1471216413S3S5
Published:
DOI: https://doi.org/10.1186/1471216413S3S5