# BMRF-MI: integrative identification of protein interaction network by modeling the gene dependency

- Xu Shi
^{1}, - Xiao Wang
^{1}, - Ayesha Shajahan
^{2}, - Leena Hilakivi-Clarke
^{2}, - Robert Clarke
^{2}and - Jianhua Xuan
^{1}Email author

**16(Suppl 7)**:S10

https://doi.org/10.1186/1471-2164-16-S7-S10

© Shi et al.; licensee BioMed Central Ltd. 2015

**Published: **11 June 2015

## Abstract

### Background

Identification of protein interaction network is a very important step for understanding the molecular mechanisms in cancer. Several methods have been developed to integrate protein-protein interaction (PPI) data with gene expression data for network identification. However, they often fail to model the dependency between genes in the network, which makes many important genes, especially the upstream genes, unidentified. It is necessary to develop a method to improve the network identification performance by incorporating the dependency between genes.

### Results

We proposed an approach for identifying protein interaction network by incorporating mutual information (MI) into a Markov random field (MRF) based framework to model the dependency between genes. MI is widely used in information theory to measure the uncertainty between random variables. Different from traditional Pearson correlation test, MI is capable of capturing both linear and non-linear relationship between random variables. Among all the existing MI estimators, we choose to use k-nearest neighbor MI (kNN-MI) estimator which is proved to have minimum bias. The estimated MI is integrated with an MRF framework to model the gene dependency in the context of network. The maximum a posterior (MAP) estimation is applied on the MRF-based model to estimate the network score. In order to reduce the computational complexity of finding the optimal network, a probabilistic searching algorithm is implemented. We further increase the robustness and reproducibility of the results by applying a non-parametric bootstrapping method to measure the confidence level of the identified genes. To evaluate the performance of the proposed method, we test the method on simulation data under different conditions. The experimental results show an improved accuracy in terms of subnetwork identification compared to existing methods. Furthermore, we applied our method onto real breast cancer patient data; the identified protein interaction network shows a close association with the recurrence of breast cancer, which is supported by functional annotation. We also show that the identified subnetworks can be used to predict the recurrence status of cancer patients by survival analysis.

### Conclusions

We have developed an integrated approach for protein interaction network identification, which combines Markov random field framework and mutual information to model the gene dependency in PPI network. Improvements in subnetwork identification have been demonstrated with simulation datasets compared to existing methods. We then apply our method onto breast cancer patient data to identify recurrence related subnetworks. The experiment results show that the identified genes are enriched in the pathway and functional categories relevant to progression and recurrence of breast cancer. Finally, the survival analysis based on identified subnetworks achieves a good result of classifying the recurrence status of cancer patients.

## Background

Biological systems in cancer involve multifunctional modules that coordinately regulate complex behavior [1]. Many researches focus on identifying biomarkers on high-throughput data such as DNA microarray data and RNA sequencing data. However, the high complexity of biological systems makes the single molecular approach hard to fully reveal the underlying mechanism. Integrative approaches with different data sources are needed to extract deeper insights in different levels and aspects [2]. Several methods [3–6] have been developed to integrate protein-protein interaction (PPI) data with microarray gene expression data to identify significant protein interaction networks. Chuang *et al*. [3] proposed a protein-network based approach to identify the biomarkers of metastasis within gene expression profiles. The biomarkers identified in interconnected subnetworks have shown high reproducibility and accuracy in the classification of metastatic versus non-metastatic tumors. Ideker *et al*. [5] introduced an approach to identify active subnetworks, which shows consistent condition-specific gene expression change on PPI network. The change of gene expression is measured by significance value (p-value) and further converted to z-score, then the network score can be aggregated by the z-score of the genes in the subnetwork. A simulated annealing based searching algorithm is implemented to find the maximal-scoring connected network. Chen *et al*. [6] point out that these two methods mentioned above define the network score as an aggregation of significance score of genes, which usually leave the less differentially expressed biomarkers unidentified. In order to address the concern of gene interaction, Chen *et al*. proposed a bagging Markov random field (BMRF) based method to improve the protein interaction subnetwork identification. BMRF employs maximum a posterior (MAP) principle to estimate the differential score of genes or proteins and form a novel network score that considers the pairwise gene interaction in the subnetworks. Although BMRF has achieved success in identifying biologically meaningful subnetworks, there are still some concerns about the method. BMRF treats the PPI connection equally by putting the same weight on the edges of the network, which is not true in the real case. As a cell needs to act differently in response to different stimuli, the regulatory mechanism should have condition specific preference. Accepting all the connections in the PPI database may lead to errors in MAP estimation. Furthermore, it has been proven that there are a lot of false positives in the protein interaction database [7]. Therefore, we need to quantify the dependency between genes to reduce the negative effect of false connections and improve the performance.

In this paper, we proposed an approach, namely BMRF-MI, for identifying protein interaction network by incorporating mutual information (MI) into a Markov random field (MRF) framework to model the dependency of genes. MI is developed in information theory to measure the uncertainty between random variables. As the complexity of the biological system is very high, using MI to estimate the correlation between genes can help us reveal both linear and non-linear relationships. By incorporating the quantification of the dependency between the genes, we are able to minimize the effects of false edges in protein network and identify more accurate subnetworks. We generate synthetic data to show that our method has an improved performance in protein subnetwork identification. Besides, we also apply BMRF-MI to breast cancer patient data to demonstrate the feasibility of the proposed method for real biological studies. Experimental results show that the proposed method is able to identify biologically meaningful subnetworks. Furthermore, we use survival analysis to show that the identified subnetworks can be used to predict the recurrence status of the cancer patients.

## Results and discussion

### Bagging Markov random field and mutual information (BMRF-MI) based network identification

### Simulation experiments

The simulation PPI network is an estrogen receptor (ER) related PPI network with 376 nodes and 1,825 edges extracted from the HPRD database [6, 8]. The ground truth subnetwork is constructed by the genes in pathways closely related to breast cancer, which has 35 genes and 89 interactions. The dependency between genes can be constructed as a symmetric matrix, and the magnitude of the element indicates the strength of dependency. The genes in the ground truth subnetwork are set to have a higher dependency value, which means that they have stronger mutual dependency than the other genes. We use the Markov random field model developed by Wei *et al*. [9] to simulate the differential state of genes. Based on the differential state and gene dependency, the gene expression data is simulated from multivariate Gaussian distribution with 40 samples in each phenotype (80 samples in total), which takes the dependency matrix as the covariance matrix. For differential genes, a random difference will be generated to differentiate the mean level of two phenotypes. For non-differential genes, the gene expression data comes from the same distribution. The differential z-score is calculated by inverse cumulative standard normal distribution of p-value estimated from Student's t-test. The false positive rate (FPR) of the simulation data can be controlled by a weight parameter *w* in the MRF model introduced by Chen *et al*. [6]. We simulate the data by varying the FPR ranging from 30% to 85% to evaluate the performance of identifying differential expressed protein interaction networks under different levels.

AUC values for subnetwork identification.

FPR | BMRF-MI | BMRF |
---|---|---|

31% |
| 0.8768 |

38% |
| 0.8734 |

44% |
| 0.8496 |

51% |
| 0.8082 |

85% |
| 0.6015 |

F-scores for subnetwork identification.

FPR | BMRF-MI | BMRF | jActiveModule |
---|---|---|---|

31% |
| 0.6674 | 0.3862 |

38% |
| 0.6649 | 0.3627 |

44% |
| 0.6079 | 0.3558 |

51% |
| 0.4371 | 0.3324 |

85% |
| 0.2145 | 0.1965 |

### Breast cancer microarray data

We further tested our method on one estrogen receptor (ER) related breast cancer patient dataset introduced in Loi *et al*. [10] to identify subnetworks related to recurrence of breast cancer. The patients samples are divided into 'early recurrence' group (≤ 5 years) and 'late recurrence' group (> 5 years) based on survival time. We finally got 19 samples in 'early recurrence' group and 28 samples in 'late recurrence' group. The PPI network data is obtained from HPRD database [8], which contains about 9000 genes and 35000 interactions. We further extracted an ER focused PPI network with 2545 genes and 15094 connections by finding the subnetwork within two jumps from ER. The 199 seed genes are selected from the ER focused network with node degree larger than 10. For the differential score of the genes, we perform Student's t-test on the 2545 genes between two groups of samples to calculate the p-value and convert the p-value to z-score by inverse cumulative standard normal distribution. The gene dependency is estimated by R package 'parmigene'[11], which implements the kNN-MI estimator mentioned in Kraskov *et al*. [12]. For the bootstrapping process, the confidence level threshold is set to 0.3 to find significant genes in network.

*et al*. [15] show that the CDK inhibitors will lead to increased MCM complex association with DNA and MCM is closely with DNA damage [16], which is related to breast cancer recurrence. In addition, Net2 shows strong connection between BAD and YWHAQ, which is associated with cell death and validated by [17]. Moreover, the genes in common pathway tend to have strong mutual dependency. For example, it can be seen from Net2 that the cell cycle genes including YWHAG, YWHAQ, YWHAH, CDKN1B and ABL1 are closely connected.

P-values of the enriched pathways of identified subnetworks from functional annotation.

Identified Pathway | Net1 | Net2 | Net3 | Net4 | Net5 |
---|---|---|---|---|---|

Cell Cycle | 4.40e-12 | 2.90e-7 | 5.70e-6 | - | - |

TGF-Beta signaling | 7.00e-5 | - | - | - | - |

ErbB signaling | - | 5.10e-5 | - | 2.00e-9 | 9.70e-7 |

Insulin signaling | - | - | - | 1.90e-5 | - |

MAPK signaling | - | - | - | 8.40e-5 | - |

## Conclusion

In this paper, we have proposed a new method by incorporating mutual information into a Markov random field based framework to tackle the problem of protein interaction subnetwork identification. The proposed method is tested by simulation data with different experimental conditions. We have observed significant improvements in terms of the accuracy of subnetwork identification. To validate the efficacy of the method in real biomedical applications, a breast cancer patient dataset is used for the identification of protein interaction networks related to recurrence of breast cancer. The identified subnetworks are significantly enriched in pathways related to the progression and recurrence of breast cancer. We further validate the significance of the subnetworks on other dataset by predicting the recurrence status of patients.

## Methods

### Gene dependency estimation

*p*

_{ Y }(

*y*) are intractable, discretization is one intuitive solution which is usually applied to gene expression data to approximate the distributions. After partitioning both × and Y into a finite number of bins, all the probabilities can be calculated by counting the number of points falling into corresponding bins. Then Equation (1) can be approximated as:

*et al*. [12] proposed a k-nearest neighbor MI (kNN-MI) estimator. The kNN-MI estimator utilizes the distance between the point and its kth nearest neighbor to estimate the mutual information:

where $\psi \left(x\right)=\Gamma {\left(x\right)}^{-1}\frac{d\Gamma \left(x\right)}{dx}$ is the digamma function, $<f\left(x\right)>=\frac{1}{\mathsf{\text{N}}}E\left[f\left(x\right)\left(i\right)\right]$ averages *f*(*x*) all over *i* and realizations, and *N* is the number of realizations or samples. Assume ${\in}_{X}\left(i\right)$ and ${\in}_{Y}\left(i\right)$ are the distance from point *i* with coordinate (x_{i},y_{i}) to its k-th nearest neighbor in subspace × and Y respectively, *n*_{
x
} and *n*_{
y
} are the number of points in the set $\left\{\left({x}_{j},{y}_{j}\right)|{||x}_{j}-{x}_{i}||\le \frac{{\in}_{\mathsf{\text{x}}}\left(i\right)}{2}\phantom{\rule{2.77695pt}{0ex}}\mathsf{\text{and}}\phantom{\rule{2.77695pt}{0ex}}||{y}_{j}-{y}_{i}||\le \frac{{\in}_{y}\left(i\right)}{2}\right\}$. Sales *et al*. [11] compare kNN-MI estimator with several other mutual information estimators such as KDEMI [19], the Miller-Madow [20] and the Schurmann-Grassberger estimators [21] and show that the kNN-MI estimator has the minimum bias. In this paper, we use the 'parmigene' [11], a well-developed R package, to estimate the mutual information between all pair of genes. The mutual information can be represented as a symmetric matrix **W**, where *w*(*i, j*) is the estimated mutual information between the *ith* and the *jth* gene.

### Markov random field (MRF) based network score

*i*in the subnetwork, let

*N*

_{ i }represents the number of genes connected with gene

*i*. By the virtue of Markov property, we can assume that the discriminative score of gene

*i*depends on the discriminative scores of its

*N*

_{ i }neighbor genes:

**f**:

*T*is a temperature parameter that controls the sharpness of the distribution and

*U*(f) is the prior energy function over 1-vertex and 2-vertex cliques of the subnetwork. The 1-vertex clique

*C*

_{1}is defined as the whole gene set and 2-vertex clique

*C*

_{2}is defined as the set of neighbor genes on

*C*

_{1}.

*U*(f) proposed by Chen

*et al*.[6] can be represented by the sum of clique potential

*V*

_{ c }(

**f**):

*k*is the number of edges, λ is a trade-off parameter and

*d*

_{ i }is the node degree of gene

*i*. The first term in the equation estimates discriminative level of the network and the second term accounts for the smoothness of the discriminative score. As seen from Equation (6), smaller U(

**f**) will lead to a more probable network configuration, thus identifying a network with genes that have high and consistent discriminative score is our goal. However, there are some concerns needed to be addressed. In the first term, the average of discriminative score tends to decrease the variance of

**f**when the number of genes increases, which makes the network score not comparable over different sizes of networks. In order to make the variance comparable, we modified the first term by changing the coefficient $\frac{1}{\mathsf{\text{m}}}$ to $\frac{1}{\sqrt{m}}$. The second term utilizes the MRF framework to smooth the discriminative scores of the genes in a network; however, this term treats each connection equally. Here we incorporate the gene dependency estimated from kNN-MI:

where *w*(*i, j*) is the mutual information between *ith* gene and *jth* gene and the node degree *d*_{
i
} can be calculated as $\sum _{\left\{j|\left(i,j\right)\in {C}_{2}\right\}}w\left(i,j\right)$. The weights added in the second term of Equation (7) can guarantee the smoothness of the discriminative scores over genes with strong dependency.

**f**cannot be directly estimated. In order to lessen the effect of noise and address the concern of gene dependency, MAP estimation method is applied on the observed score to estimate

**f**. Suppose the observed discriminative scores are$z={\left[{z}_{1},\dots ,{z}_{m}\right]}^{T}$. Here the z-score

*z*

_{ i }can be calculated from corresponding p-value

*p*

_{ i }by the inverse Gaussian cumulative density function. In this method, p-value is calculated by Student's t-test between two phenotypes. To account for the noise in the gene expression data, we assume that

**e**is Gaussian noise and

**I**is the identity matrix. Given the z-score, we can estimate the discriminative score $\widehat{f}$ by MAP estimation. Based on Bayes's rule and Gibbs distribution, the estimation can be expressed as:

**f**) is specified in Equation (7). U(

**z**|

**f**) is the likelihood potential, which can be derived from the distribution of

**e**as:

**f**has zero mean; however, the assumption usually violates in the real application. Here we estimate the background distribution of network score with different sizes by random sampling subnetworks from the whole PPI network. The network score can be normalized by:

where *μ*(*M*) and *σ*(*M*) are the mean and standard deviation estimated from the null distribution of the networks with the same size of M.

### Simulated annealing searching

Given the network score definition, finding the optimal network with the highest network score is an NP hard problem. Instead of using exhausted searching approach, a probabilistic approach for global optimization, simulated annealing, is applied here. To reduce the computational complexity, we start the simulated annealing searching from 'seed' genes, which are pre-selected from the PPI network. Several constraints are made to further increase the network searching efficiency: (i) the searching space is restricted to a local 2-jump network from 'seed' node and (ii) the searching will be terminated when no significant improvement is observed.

### Confidence level measurement

Due to the large noise of data and heterogeneity of samples, the reproducibility of subnetwork identification is usually low. Furthermore, the number of samples is usually limited in biological experiment due to cost and quality issue, which will introduce bias to the results. In order to get more robust results, we applied a non-parametric bootstrapping strategy to measure the confidence level of the genes in the identified network. For each bootstrap, we generate a data set by randomly sampling the samples from the gene expression data with replacement. Applying BMRF-MI on the generated data sets, we can measure the confidence level of genes as the frequency of appearance. We are more confident about the genes with high frequency; then a threshold can be set to generate the final network.

### Clustering networks by affinity propagation clustering (APC)

S(*i, j*) has a value between 0 and 1 and higher value indicates higher similarity between the two networks. The number of exemplars or clusters can be automatically determined without pre-configuration, which is different from traditional clustering methods such as k-means clustering.

## Declarations

### Acknowledgements

This work is supported by National Institutes of Health (NIH) [CA149653, CA149147, CA164384, and NS29525-18A, in part].

**Declarations**

The publication costs for this article were funded by National Institutes of Health (NIH) [CA149653].

This article has been published as part of *BMC Genomics* Volume 16 Supplement 7, 2015: Selected articles from The International Conference on Intelligent Biology and Medicine (ICIBM) 2014: Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/16/S7.

## Authors’ Affiliations

## References

- Hanash S: Integrated global profiling of cancer. Nature reviews Cancer. 2004, 4 (8): 638-644. 10.1038/nrc1414.View ArticlePubMedGoogle Scholar
- Rhodes DR, Chinnaiyan AM: Integrative analysis of the cancer transcriptome. Nature genetics. 2005, S31-37. 37 SupplGoogle Scholar
- Chuang HY, Lee E, Liu YT, Lee D, Ideker T: Network-based classification of breast cancer metastasis. Molecular systems biology. 2007, 3: 140-PubMed CentralView ArticlePubMedGoogle Scholar
- Dittrich MT, Klau GW, Rosenwald A, Dandekar T, Muller T: Identifying functional modules in protein-protein interaction networks: an integrated exact approach. Bioinformatics. 2008, 24 (13): i223-231. 10.1093/bioinformatics/btn161.PubMed CentralView ArticlePubMedGoogle Scholar
- Ideker T, Ozier O, Schwikowski B, Siegel AF: Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics. 2002, S233-240. 18 Suppl 1Google Scholar
- Chen L, Xuan J, Riggins RB, Wang Y, Clarke R: Identifying protein interaction subnetworks by a bagging Markov random field-based method. Nucleic acids research. 2013, 41 (2): e42-10.1093/nar/gks951.PubMed CentralView ArticlePubMedGoogle Scholar
- Bork P, Jensen LJ, von Mering C, Ramani AK, Lee I, Marcotte EM: Protein interaction networks from yeast to human. Current opinion in structural biology. 2004, 14 (3): 292-299. 10.1016/j.sbi.2004.05.003.View ArticlePubMedGoogle Scholar
- Mishra GR, Suresh M, Kumaran K, Kannabiran N, Suresh S, Bala P, Shivakumar K, Anuradha N, Reddy R, Raghavan TM, et al: Human protein reference database--2006 update. Nucleic acids research. 2006, 34 (Database): D411-414.PubMed CentralView ArticlePubMedGoogle Scholar
- Wei Z, Li H: A Markov random field model for network-based analysis of genomic data. Bioinformatics. 2007, 23 (12): 1537-1544. 10.1093/bioinformatics/btm129.View ArticlePubMedGoogle Scholar
- Loi S, Haibe-Kains B, Majjaj S, Lallemand F, Durbecq V, Larsimont D, Gonzalez-Angulo AM, Pusztai L, Symmans WF, Bardelli A, et al: PIK3CA mutations associated with gene signature of low mTORC1 signaling and better outcomes in estrogen receptor-positive breast cancer. Proceedings of the National Academy of Sciences of the United States of America. 2010, 107 (22): 10208-10213. 10.1073/pnas.0907011107.PubMed CentralView ArticlePubMedGoogle Scholar
- Sales G, Romualdi C: parmigene--a parallel R package for mutual information estimation and gene network reconstruction. Bioinformatics. 2011, 27 (13): 1876-1877. 10.1093/bioinformatics/btr274.View ArticlePubMedGoogle Scholar
- Kraskov A, Stogbauer H, Grassberger P: Estimating mutual information. Phys Rev E. 2004, 69 (6):Google Scholar
- Frey BJ, Dueck D: Clustering by passing messages between data points. Science. 2007, 315 (5814): 972-976. 10.1126/science.1136800.View ArticlePubMedGoogle Scholar
- Huang da W, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature protocols. 2009, 4 (1): 44-57.View ArticlePubMedGoogle Scholar
- Johnson N, Shapiro GI: Cyclin-dependent kinases (cdks) and the DNA damage response: rationale for cdk inhibitor-chemotherapy combinations as an anticancer strategy for solid tumors. Expert opinion on therapeutic targets. 2010, 14 (11): 1199-1212. 10.1517/14728222.2010.525221.PubMed CentralView ArticlePubMedGoogle Scholar
- Bailis JM, Forsburg SL: MCM proteins: DNA damage, mutagenesis and repair. Current opinion in genetics & development. 2004, 14 (1): 17-21. 10.1016/j.gde.2003.11.002.View ArticleGoogle Scholar
- Chen L, Willis SN, Wei A, Smith BJ, Fletcher JI, Hinds MG, Colman PM, Day CL, Adams JM, Huang DC: Differential targeting of prosurvival Bcl-2 proteins by their BH3-only ligands allows complementary apoptotic function. Molecular cell. 2005, 17 (3): 393-403. 10.1016/j.molcel.2004.12.030.View ArticlePubMedGoogle Scholar
- Chen L, Xuan J, Riggins RB, Clarke R, Wang Y: Identifying cancer biomarkers by network-constrained support vector machines. BMC systems biology. 2011, 5: 161-10.1186/1752-0509-5-161.PubMed CentralView ArticlePubMedGoogle Scholar
- Qiu P, Gentles AJ, Plevritis SK: Fast calculation of pairwise mutual information for gene regulatory network reconstruction. Computer methods and programs in biomedicine. 2009, 94 (2): 177-180. 10.1016/j.cmpb.2008.11.003.View ArticlePubMedGoogle Scholar
- Olsen C, Meyer PE, Bontempi G: On the impact of entropy estimation on transcriptional regulatory network inference based on mutual information. EURASIP journal on bioinformatics & systems biology. 2009, 308959-Google Scholar
- Schurmann T, Grassberger P: Entropy estimation of symbol sequences. Chaos. 1996, 6 (3): 414-427. 10.1063/1.166191.View ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.