Disease gene identification by random walk on multigraphs merging heterogeneous genomic and phenotype data
© Li et al.; licensee BioMed Central Ltd. 2012
Published: 13 December 2012
Skip to main content
© Li et al.; licensee BioMed Central Ltd. 2012
Published: 13 December 2012
High throughput experiments resulted in many genomic datasets and hundreds of candidate disease genes. To discover the real disease genes from a set of candidate genes, computational methods have been proposed and worked on various types of genomic data sources. As a single source of genomic data is prone of bias, incompleteness and noise, integration of different genomic data sources is highly demanded to accomplish reliable disease gene identification.
In contrast to the commonly adapted data integration approach which integrates separate lists of candidate genes derived from the each single data sources, we merge various genomic networks into a multigraph which is capable of connecting multiple edges between a pair of nodes. This novel approach provides a data platform with strong noise tolerance to prioritize the disease genes. A new idea of random walk is then developed to work on multigraphs using a modified step to calculate the transition matrix. Our method is further enhanced to deal with heterogeneous data types by allowing cross-walk between phenotype and gene networks. Compared on benchmark datasets, our method is shown to be more accurate than the state-of-the-art methods in disease gene identification. We also conducted a case study to identify disease genes for Insulin-Dependent Diabetes Mellitus. Some of the newly identified disease genes are supported by recently published literature.
The proposed RWRM (Random Walk with Restart on Multigraphs) model and CHN (Complex Heterogeneous Network) model are effective in data integration for candidate gene prioritization.
Reliable identification of disease genes is an important task in biomedical research useful to find out the mechanism of a disease and to reveal therapeutic targets. Family based genetic linkage analysis has been widely conducted to determine regions in the chromosomes of a genome which have large genetic effects on a disease . Each susceptible region in the chromosomes is called a susceptible locus which may cover dozens even hundreds of genes. Those genes in a susceptible locus are candidate disease genes which can be further narrowed down to the real disease genes by computational or experimental experiments. At the Online Mendelian Inheritance in Man (OMIM) database  which stores the latest data obtained by linkage analysis, there are still thousands of disease loci in which the real disease-causing genes have not been identified. Sophisticated computational algorithms have been recently proposed to prioritize those candidate genes [3–7] to deal with this problem. However, most of the algorithms are based on single data source. As a single data source is prone of bias, incompleteness and noise, integration of various genomic data sources is highly demanded for reliable prioritization of a set of candidate genes. The top ranked candidate genes are then most likely to be the real disease genes.
A commonly adapted data integration approach is to integrate separate lists of candidate genes derived from the each single data sources. A notable example is ENDEAVOUR , by which nine data sources were handled including sequence data, gene annotation data, etc. It was implemented in a rank aggregation based integration (RABI) framework consisting of two stages. In the first stage, a rank list of candidate genes is determined according to their similarity to known disease genes based on each data source. Subsequently, these rank lists are integrated into one rank list by using N-dimensional order statistics (NDOS) . In an earlier work , we improved the performance of ENDEAVOUR by using a random walk with restart (RWR) in the first stage as the ranking algorithm, and using a discounted rating system (DRS) in the second stage to combine the ranked lists.
Merging separate lists of candidate disease genes derived from single data sources with bias and noise can inflate the uncertainties in the data and may propagate into the final ranking. To address this problem, it's better to eliminate the bias and noise by merging the single data sources into an integrated data source, and then to prioritize a set of candidate genes. This work proposes a novel integration method to merge various genomic networks into a multigraph which is capable of connecting multiple edges between a pair of nodes. We then operate a random walk on the multigraph to find disease genes. Many random walk models have been introduced to solve different kinds of problems in bioinformatics recently. For example, Köhler et al.  used the RWR algorithm to prioritize candidate genes. Macropol et al.  proposed a repeated random walks algorithm to predict protein complex from the PPI network. Nibbe et al.  used random walk models to identify disease-relevant subnetworks from the PPI network, and studied a crosstalk between them. However, none of these models can work on multigraphs as the multiple edges between a pair of nodes complicates the calculation of transition probabilities.
In this work, we first construct separate gene networks corresponding to different data sources, and then merge these networks into a single network defined by a multigraph. When our random walk algorithm runs on the merged network, the transition probability is proposed to be calculated as the expected value of the transition probabilities from the multiple networks. Our algorithm was compared with four RABI models [8, 10]. On a benchmark data set covering 36 genetic diseases , our proposed algorithm achieved AUC value of 89.4%, much higher than the four RABI models. Our method is named RWRM (Random Walk with Restart on Multigraphs).
This work is further deepened by additionally considering phenotype data. It is widely understood that phenotype information can be used to improve the discovery of disease genes [14–16], because similar disease phenotypes are caused by mutation in functionally similar genes . We do not integrate the phenotype data into the multigraph gene network. Instead, it is connected, as a subgraph, to the multigraph gene network. So, the traversals of random walk are sometimes within the multigraph gene network, sometimes within the phenotype network, and sometimes cross between these two subgraphs. We propose a Complex Heterogeneous Network (CHN) model to guide the random walk. Four genomic data sources used in this work are a PPI network and three ontologies: Biological Process (BP), Cellular Component (CC), and Molecular Function (MF). We also used the latest OMIM data as the benchmark data containing 3,871 Phenotype-Gene Relationships (PGRs) to parter with the genomic networks. We successfully identified 2,105 of the PGRs, whereas a NDOS-based  or a DRS-based  method could identify only 2,008 or 2,048 number of relationships respectively. This demonstrate that a better performance by our CHN model can be achieved. We also conduct a case study to show the excellent performance of the proposed CHN model by discovering disease genes for Insulin-Dependent Diabetes Mellitus (IDDM).
This section describes the datasets used by this work, including two datasets of disease genes, a phenotype network, a PPI network and functional similarity networks derived from gene ontology (GO) .
Two benchmark datasets of disease genes are involved. The first one is obtained from Köhler et al.'s work . It is constructed by processing all entries in the OMIM database  in which thousands of OMIM phenotypes are categorized into 110 diseases. The largest one contains 47 genes and the smallest contains only three genes. This study is focused on 36 diseases each related to at least 6 genes. The second benchmark dataset contains 3,871 PGRs obtained from the latest version of OMIM as described in detail in the following section.
Every phenotype entry is defined as an MIM record, a text description of the disease phenotype. We excluded those records with the prefix '∗' and '^', because the prefix '∗' refers to an arbitrary record of disease gene, and '^' refers to an obsoleted record. We obtained 6,708 phenotypes in total. Then we calculated the similarity between phenotypes, based on the co-occurrence of key words in the Medical Subject Headings vocabulary (MeSH) . This was done as follows. Every disease phenotype is first converted into a numerical vector, where each element denotes the frequency of a key word in the description of the disease phenotype. Then the similarity between two phenotypes is measured by the correlation between the two vectors. We used a text mining PERL package MimMiner  to calculate the similarity between every pair of phenotypes and obtained a phenotype similarity matrix.
PGRs were extracted from the OMIM database , using BioMart http://www.biomart.org/biomart/martview. The PGRs can be viewed as a bipartite network, i.e., a phenotype-gene network, in which one partite of nodes are the genes and the other partite are the phenotypes, and edges are the PGRs. This phenotype-gene network can be used as a bridge to construct a heterogeneous network as explained in a later section.
The PPI data were derived from Human Protein Reference Database (HPRD) . HPRD contains manually curated scientific information pertaining to the biology of most human proteins. All the interactions in HPRD are extracted manually from literature by expert biologists who read, interpret and analyze the published data. We excluded self-interactions. In total, there are 36,619 unique interactions between 9,474 proteins. The interaction data is used to construct a network, in which proteins and interactions are represented by nodes and edges, respectively.
Gene Ontology provides a controlled vocabulary to describe gene and gene product attributes . It is comprised of three independently annotated subontologies: Biological Process (BP), Cellular Components (CC) and Molecular Function (MF).
The functional similarity between two genes can be measured by the semantic similarity between their GO annotation terms [23–26]. In our research work, the similarity between two genes was measured by their overlap annotation terms , because of its computational efficiency. Since there are three independent sub-ontologies, the functional similarity is defined considering three different aspects. We calculate three similarity matrices, each corresponding to a sub-ontology, then we construct the corresponding KNN graph, as illustrated in Figure 1. Specifically in this study, K is set as 5. (Other K's are also investigated and compared.) The obtained graphs are named BP network, CC network and MF network, respectively.
where M T is the transpose matrix of M and p 0 is the initial probability vector. In this work, the initial probability vector p 0 is set such that equal probabilities are assigned to all the source nodes with the sum of the probabilities equal to 1. The probability p ∞ (i) is the probability of finding the random walker at node g i in the steady state. It gives a measure of proximity between g i and source nodes. If p ∞ (i) >p ∞ (j), then node g i is more proximate to source nodes than node g j does.
We extend the RWR algorithm to operate on multigraph gene networks and propose an RWRM (Random Walk with Restart for Multigraphs) algorithm to prioritize a set of candidate genes.
where q (k) is the probability of selecting the k-th network.
As another example, suppose we want to find the most proximate node with g 4 in Figure 2(d). Then we set g 4 as the source node to run the RWRM algorithm. The initial probability is p 0 = [ 0 0 0 1 0 ] T . After running the RWRM algorithm, we get the stationary probability p ∞ = [ 0.01 0.04 0.13 0.71 0.06 ] T . Node g 3 has the highest stationary probability, therefore it is found to be the most proximate to the source node g 4. If g 4 is the known disease gene, we may infer that gene g 3 is possibly a disease gene.
where M G is the transition matrix of the gene network; M P is the transition matrix of the phenotype network; M GP is the transition matrix from the gene network to the phenotype network; and M PG is the transition matrix from the phenotype network to the gene network. The intra-subnetwork transition matrix, M P and M G , are calculated by using Eq.1 and Eq.3, respectively. The transition probability from a bridging node to the other nodes in the same subnetwork is modified by multiplying 1 - λ. For example, using Eq.3, the transition probability from node g 2 to g 3 is calculated as p(g 2 → g 3) = 0.44. However in the CHN model, it becomes 0.44 × (1 - λ). If λ = 0.5, p(g i → g j ) = 0.44 × 0.5 = 0.22. The transition probability from an internal node remains unchanged. For example, the transition probability from the internal node g 1 to g 2 is 0.75 in both the merged gene network and the CHN model.
The parameter η is used to weight the importance of each sub-network. If η is 0.5, two sub-networks are equally weighted. If η is above 0.5, the random walker prefers to return to the phenotype source node, therefore the phenotype information is assigned with more importance.
The initial probability of gene network u 0 is assigned such that equal probabilities are set to all the source nodes in the gene network, with the sum of the probabilities equal to 1. The initial probability of phenotype network v 0 is given similarly. Suppose g 1, g 4 and g 5 in Figure 3 are candidate genes for phenotype Ph3. To find the real disease gene, we use g 2, g 3 and Ph3 as source nodes to run the above random walk algorithm. In this case, .
can be obtained. Then genes and phenotypes can be ranked according to this steady probability u ∞ and v ∞ , respectively. The larger the probability value is, the higher the rank position.
Our proposed ranking methods are compared with various rank aggregation based integration (RABI) methods for performance evaluation. RABI methods prioritize a set of candidate genes derived from individual data source, and then combine these rank lists into a single list. A RABI has two steps: ranking and rank aggregation. For the first step, we considered using one-class support vector machine (1CSVM) [27, 28], RWR  and RWRH . For the second step, we considered using N-dimensional order statistics (NDOS)  and discounted rating system (DRS) . Therefore, six RABI models were used for comparison in this work, namely 1CSVM+NDOS, 1CSVM+DRS, RWR+NDOS, RWR+DRS, RWRH+NDOS and RWRH+DRS. The first four models are compared with our RWRM model, which integrates multiple gene networks. The last two models are compared with the CHN model, which integrates multiple gene networks and the phenotype network.
The 1CSVM method requires a kernel representation of the data. We used the diffusion kernel [29, 30], with the diffusion parameter set as 0.25. In RWR and RWRH, the γ value was set as 0.7. It has been shown that the performance of RWR algorithm is stable with λ ranging from 0.6 to 0.9 . In the DRS algorithm, there is one parameter, the number of ratings, which has little impact on the results . In this work, candidate genes were classified into five ratings. There is no parameter in the NDOS algorithm.
This section reports a comparative performance of our new methods on two benchmark datasets. As mentioned, the first dataset consists of 36 diseases collected by . The second dataset is the set of the 3,871 PGRs obtained from OMIM . Our proposed RWRM and CHN models are compared with the six RABI models by using the ROC and other criteria. As a complicated case study, we also use the proposed CHN model to discover previously unknown disease genes for Insulin-Dependent Diabetes Mellitus (IDDM).
Overview of Four Gene Networks and the Merged Gene Network
Performance of five different integration models on 36 diseases in terms of overall AUC values (%)
Overall AUC (%)
The PPI network and the three ontology networks BP, CC and MF were used to construct the multigraph part of the heterogeneous network which was then finalized by using the PGRs to connect to the phenotype network. The leave one out cross validation was also taken to evaluate the performance of the CHN model. In each round of cross validation, one PGR was removed from the dataset of 3,871 PGRs. The removed gene is called held-out gene and the removed disease phenotype is called target phenotype. The aim of this cross validation is to test whether or not the CHN model can successfully predict the relationship between the held-out gene and the target phenotype. We also collected 99 genes in the nearest chromosome region of the held-out disease gene, and our set of test genes constituted these 99 genes and the held-out gene.
Top five ranked candidate genes for each phenotype of IDDM
Number of Candidate Genes
Top 5 Candidate Genes
BLM, IGF1R, FURIN, RHCG, NR2F2
PPME1, RELA, STIP1, GSTP1, MEN1
BCL2, TXNL1, MYO5B, SMAD4, TNFRSF11A
NEUROD1, ABCB11, WIPF1, LRP2, CCDC141
ESR1, PLG, TBP, VIL2, IGF2R
TSHR, FOS, EIF2B2, NEK9, PTPN21
IDH1, ERBB4, PIP5K3, LANCL1, PTHR2
FYN, CDC40, ATG5, CD164, TUBE1
TCF7L2, GFRA1, VTI1A, ACSL5, ADRA2A
SPINK1, NR3C1, VDAC1, HSPA9, IL3
SH2B3, TCF1, OAS1, PTPN11, OAS2
ESR1, SYNE1, VIL2, LATS1, OPRM1
CCNA2, IL2, MAD2L1, FGF2, TRPC3
PTEN, ACTA2, PANK1, ANKRD1, MPHOSPH1
Given a phenotype, the set of candidate genes was obtained by using BioMart . Then our CHN model was used to prioritize this set of candidate genes. The parameters for the CHN model were set as γ = 0.7, λ = 0.5 and η = 0.5. Top five ranked candidate genes are shown in the fifth column of Table 3 with a decreasing relevance from the left to the right. Some of these predictions can be affirmed with literature work.
The fifth ranked gene for MIM 600318 (Serial No. 1 in Table 3) is NR2F2 (also called COUP-TFII). It has been found to contribute to the control of insulin secretion through the complex HNF4 transcription factor network operating in chicken pancreatic beta cell . This is suggestive of that the mutation studies in the human NR2F2 gene are important to understand more about IDDM.
The first ranked gene for MIM 600321 (Serial No. 4 in Table 3) is NEUROD1. A literature work by Cinek et al.  found a close association between NEUROD1 and IDDM in Czech children. They compared 285 children with IDDM diagnosed under the age of 15 years with 289 non-diabetic control children to confirm that NEUROD1 polymorphism Ala45Thr is associated with IDDM.
The locus of MIM 603266 (Serial No. 9 in Table 3) overlaps with the loci of type 2 diabetes (T2D). Therefore those genes in this particular region may explain the common mechanisms of both types of diabetes. The first ranked gene TCF7L2 is known as T2D causing gene and the fifth ranked gene ADRA2A is hypothesized to increase T2D risk . A single-nucleotide polymorphism (SNP) in the human ADRA2A gene has been found responsible to reduced insulin secretion . All these suggest that important associations of IDDM with TCF7L2 or with ADRA2A are likely to be established through further mutation analysis.
NR3C1 is known as GR (Glucocorticoid Receptor). It is ranked the second among the set of 241 candidate genes for MIM 605598 (Serial No. 10 in Table 3). Rosmond and Holm  performed a five-year follow-up study on 163 unrelated Swedish men for investigating 3 polymorphisms of this gene. They found a significant increase in body weight, body mass index, abdominal obesity, fasting glucose, insulin, and homeostasis model assessment over the 5-year follow-up among homozygotes for the rare BclI allele. Syed et al.  also reported an SNP in the human GR gene (rs2918419) which is linked with insulin resistance in men. These results indicate a high possibility that NR3C1 is likely to be associated with IDDM18.
In this paper, we have proposed two random walk models applicable to merged data for candidate gene prioritization and identification of disease genes. The RWRM model is an extended version of random walk algorithms specially designed for multigraph gene networks integrating various data sources. It was compared to the state-of-the-art RABI models and improved performance has been achieved. We also combined the phenotype information with the multigraph gene networks and proposed the CHN model. The CHN model is also found to be better than the RABI models in disease gene prediction.
We are interested in several future research directions. One topic is how to assign proper weights to different data sources. In this paper, we simply give all the data sources equal importance. Biologists may give the weights to different data sources based on their expert knowledge. Another problem is that the transition probability corresponding to individual gene networks is calculated independently before combination. For better performance, the transition probability should be calculated not only based on the topology of the current network but also based on other networks. The third direction is about the case study. Some of the top-ranked genes have been affirmed with literature work. As our prediction performance is high in accuracy, it is believed that those top-ranked but un-affirmed genes are new and potentially useful research targets in the field of disease gene prioritization.
This article has been published as part of BMC Genomics Volume 13 Supplement 7, 2012: Eleventh International Conference on Bioinformatics (InCoB2012): Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S7.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.