Volume 12 Supplement 2
Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2010
bNEAT: a Bayesian network method for detecting epistatic interactions in genomewide association studies
 Bing Han^{1} and
 Xuewen Chen^{1}Email author
DOI: 10.1186/1471216412S2S9
© Han and Chen; licensee BioMed Central Ltd. 2011
Published: 27 July 2011
Abstract
Background
Detecting epistatic interactions plays a significant role in improving pathogenesis, prevention, diagnosis and treatment of complex human diseases. A recent study in automatic detection of epistatic interactions shows that Markov Blanketbased methods are capable of finding genetic variants strongly associated with common diseases and reducing false positives when the number of instances is large. Unfortunately, a typical dataset from genomewide association studies consists of very limited number of examples, where current methods including Markov Blanketbased method may perform poorly.
Results
To address small sample problems, we propose a Bayesian networkbased approach (bNEAT) to detect epistatic interactions. The proposed method also employs a BranchandBound technique for learning. We apply the proposed method to simulated datasets based on four disease models and a real dataset. Experimental results show that our method outperforms Markov Blanketbased methods and other commonlyused methods, especially when the number of samples is small.
Conclusions
Our results show bNEAT can obtain a strong power regardless of the number of samples and is especially suitable for detecting epistatic interactions with slight or no marginal effects. The merits of the proposed approach lie in two aspects: a suitable score for Bayesian network structure learning that can reflect higherorder epistatic interactions and a heuristic Bayesian network structure learning method.
Background
Genomewide association study (GWAS) focuses on studies of the genetic variants related with a variety of diseases from individual to individual among a cohort of cases (people with the disease) and controls (similar people without the disease) [1–3]. The most important category of genetic variations is SNP (Single Nucleotide Polymorphism), which influences disease risk. Conventional analysis methods for GWAS data only consider one SNP at a time by the Armitage trend test (ATT) and are likely to miss genetic variants having slight to moderate marginal effects but strong joint effects on disease risk. Moreover, it is widely acknowledged that some common complex diseases such as various types of cancers, cardiovascular disease, and diabetes are caused by multiple genetic variants [4]. Therefore, there is an urgent need to detect highorder epistasis (genegene interaction), which refers to the interactive effect of two or more genetic variants on complex human diseases, and explore how these epistatic interactions confer susceptibility to complex diseases [5]. However, the very large number of SNPs checked in a typical GWAS (more than 10 million) and the enormous number of possible SNP combinations make detecting highorder epistatic interactions from GWAS data statistically and computationally challenging [6, 7].
During the past decade, some heuristic computational methods have been proposed to detect causal interacting genes or SNPs. One type of computational methods for epistatic interactions detection are statistical methods including multifactor dimensionality reduction (MDR) [8–11], penalized logistic regression (stepPLR [12], lassoPLR [13]), and Bayesian epistasis association mapping (BEAM) methods [14]. MDR is a nonparametric and modelfree method based on constructing a risk table for every SNP combination [11]. If the case and control ratio in a cell of this risk table is larger than 1, MDR will label it as “high risk”, otherwise, “low risk”. By the risk table, MDR can predict disease risk and will select the SNP combination with the highest prediction accuracy. StepPLR and lassoPLR make some modifications to avoid the overfitting problem of standard logistic regression when detecting epistatic interactions [15]. For example, stepPLR combines the LR criterion with a penalization of the L2norm of the coefficients. This modification makes stepPLR more robust to highorder epistatic interactions [12]. In general, most statistical methods can only be applied to smallscale analysis (i.e., a small set of SNPs) due to their computational complexity. Moreover, MDR, stepPLR and lassoPLR are all predictorbased methods, which make them easy to include false positives. Comparing to MDR, stepPLR and lassoPLR, BEAM is a scalable and nonpredictorbased statistical method [14]. BEAM partitions SNPs into three groups: group 0 is for normal SNPs, group 1 contains disease SNPs affecting disease risk independently, and group 2 contains disease SNPs that jointly contribute to the disease risk (interactions). Give a fixed partition, BEAM can get the posterior probability of this partition from SNP data based on Bayes theory. A Markov Chain Monte Carlo method is used to reach the optimal SNP partition with maximum posterior probability in BEAM. One drawback of BEAM is that identifying both single disease SNP and SNP combinations simultaneously make BEAM overcomplex and weakens its power.
An alternative approach is machine learning based methods, which are based on binary classification (prediction) and treat cases as positives and controls as negatives in SNP data. Support vector machinebased approaches [16] and random forestbased approaches [17] are two commonlyused machine learning methods for epistatic interactions detection. They use SVM or random forest as a predictor and select a set of SNPs with the highest prediction/classification accuracy by feature selection. Like predictorbased statistical methods, machine learningbased methods lack the capability of detecting causal elements and tend to introduce many false positives, which may result in a huge cost for further biological validation experiments [18].
Recently, we propose a new Markov Blanketbased method, DASSOMB, to detect epistatic interactions in casecontrol studies [18]. The Markov Blanket is a minimal set of variables, which can completely shield the target variable from all other variables based on Markov condition property. Thus, DASSOMB can detect the SNP set that shows a strong association with diseases with the fewest false positives. Furthermore, the heuristic search strategy in DASSOMB can avoid the timeconsuming training process as in SVMs and Random Forests.
In this paper, we address the problems by introducing a Bayesian networksbased method, which also employs a BranchandBound technique to detect epistatic interactions. Bayesian networks provide a succinct representation of the joint probability distribution and conditional independence among a set of variables. In general, a structure learning methods for Bayesian networks first defines a score reflecting the fitness between each possible structure and the observed data, and then searches for a structure with the maximum score. Comparing to Markov Blanket based methods, the merits of applying Bayesian networks method to epistatic interaction detection includes: (1) BDE, BIC or MDL scores for Bayesian network structure learning can reflect higherorder interactions and are not sampleconsuming; and (2) heuristic Bayesian network structure learning method can solve the classical XOR problem, which may hinder the applications of Markov blanket based approaches.
We apply the bNEAT (B ayesian N etworks based E pistatic A ssociation sT udies) method to simulated datasets based on four disease models and a real dataset (the Agerelated Macular Degeneration (AMD) dataset). We demonstrate that the proposed method outperforms Markov Blanket methods and other commonlyused methods, especially when the number of samples is small.
Results
Analysis of simulation data
We first evaluate the proposed bNEAT method on simulated data sets, which are generated from three commonly used twolocus epistatic models in [15] and one threelocus epistatic model developed in [14]. Model1 is a multiplicative model, model2 demonstrates twolocus interaction multiplicative effects and model3 specifies twolocus interaction threshold effects. There are three disease loci in model4 [14]. Some certain genotype combinations can increase disease risk in model4 and there are almost no marginal effects for each disease locus.
where N is the total number of simulated datasets and N_{ D } is the number of simulated datasets in which all disease associated markers are identified without any false positives.
We compare the bNEAT algorithm with four methods: BEAM, Support Vector Machine, MDR and DASSOMB on the four simulated disease models. The BEAM software is downloaded from http://www.fas.harvard.edu/~junliu/BEAM and we set the threshold of the B statistic as 0.1 [14]. For support vector machines, we use LIBSVM with a RBF kernel to detect genegene interactions and the detail is shown in [18]. Since MDR algorithm can not be applied to a large dataset directly, we first reduce the number of SNPs to 10 by ReliefF [19], a commonlyused feature selection algorithm, and then MDR performs an exhaustive search for a SNP set that can maximize crossvalidation consistency and prediction accuracy. For DASSOMB, we set the threshold of G^{2} test as 0.01 to determine (conditional) dependence and (conditional) independence.
Typically, GWAS can not generate a large number of samples due to the high experiment cost. Thus, the performance of various computational methods for epistatic interaction detection in case of small samples is important. We explore the effect of the number of samples on the performance of bNEAT, DASSOMB, BEAM and SVM. We generate synthetic datasets containing 40 markers genotyped for different number of cases and controls with r^{2} = 1 and MAF=0.5.
Results on AMD data
In this section, we apply bNEAT to largescale (large number of SNPs but small samples) datasets in real genomewide casecontrol studies, which often require genotyping of 30,000–1,000,000 common SNPs. We make use of an Agerelated Macular Degeneration (AMD) dataset containing 116,204 SNPs genotyped with 96 cases and 50 controls [21]. Multiple genetic factors cause AMD, which can result in a loss of vision.
To remove inconsistently genotyped SNPs, we perform filtering process as in [18]. After filtering, there are 97,327 SNPs remained. Since the number of SNPs is very large, restricting the search space to avoid unreasonable search by selecting some candidate SNPs as in [22] is necessary. We select top 200 candidate SNPs based on G^{2} test and then use bNEAT to identify disease SNPs related with AMD. bNEAT detects three associated SNPs: rs380390, rs3913094 and rs10518433. The first SNP, rs380390, is already found in [21] with a significant association with AMD. Although no evidences were reported with the other two SNPs related to AMD in the literature, they may be plausible candidate SNPs associated with AMD.
Conclusions and discussion
Comparing with many computational methods used for identification of epistatic interactions, Markov Blanket based method can increase power and reduce false positives. However, Markov Blanket based method is sampleconsuming and the greedy searching strategy in Markov Blanket method is not suitable for detecting some interaction models with no independent main effects for each disease locus. In this paper, we propose a Bayesian networks method based on BranchandBound technique (bNEAT) to detect epistatic interactions. We demonstrate that the proposed bNEAT method significantly outperforms Markov Blanket method and other commonlyused methods, especially when the number of samples is small.
Even though the bNEAT method is more powerful than Markov Blanket based method, it can not be directly applied to genomewide dataset due to the large number of SNPs. Integrating Markov chain Monte Carlo or simulated annealing technique into our bNEAT method to make it scalable to genomewide dataset is one direction for future research. Moreover, we will explore different score schemes for epistatic interaction detection by Bayesian networks. For example, informationbased score schemes (e.g., AIC score and BIC score) are derived in case of large number of samples [23]. When the number of samples is small, the approximation in the inference of both AIC score and BIC score can not hold any more. In fact, the penalty term for model complexity in AIC score and BIC score can also reflect the variance of the model [24]. Thus in our future work, we will design a new score scheme by estimating the penalty term from data to make sure that the score scheme can fit data better.
Methods
Bayesian networks
A Bayesian network is a directed acyclic graph (DAG) G consisting of nodes corresponding to a random variable set X = {X_{1}, X_{2}, …, X_{ n }} and edges between nodes, which determine the structure of G and therefore the joint probability distribution of the whole network [25].
Definition 1 (Conditional Independence) For three random variables (nodes) X, Y and Z, if the probability distribution of X conditioned on both Y and Z is equal to the probability distribution of X conditioned only on Y, i.e., P(X  Y, Z) = P(X  Y), X is conditionally independent of Z given Y.
This conditional independence is represented as . Similarly, represents conditional dependence [26].
Theorem 1 (Local Markov Assumption) Each variable is conditionally independent of its nondescendants, given its parents in the DAG G.
where Pa(X_{ i }) denotes the set of parents of X_{ i } in G . Therefore, there are two components in a Bayesian network. The first component is the DAG G reflecting the structure of the network. The second component, θ, describes the conditional probability distribution P(X_{ i }  Pa(X_{ i })) to specify the unique distribution J on G.
Definition 2 (Vstructure) For three nodes X, Y and Z in a Bayesian network, a structure with the form of X→Z←Y (no edge between X and Y) is called a vstructure.
Definition 3 (Dseperation) For three nodes X, Y and Z in a Bayesian network, if there is no active path between X and Y given Z, we say that X and Y are dseperated given Z, denoted as Dsep(X;Y  Z).
Structure learning of Bayesian networks
Even though a Bayesian network can be constructed by an expert, most tasks of determining the network structure are too complex for humans. We have no choice but to learn the network structure and parameters from data. There are two types of structure learning methods for Bayesian networks: constraintbased methods and scoreandsearch methods.
The constraintbased methods first build the skeleton of the network (undirected graph) by a set of dependence and independence relationships. Next constraintbased methods direct links in the undirected graph to construct a directed graph with dseparation properties corresponding to the dependence and independence determined [27–29]. Even though constraintbased methods are developed with a rigorous theoretical foundation, errors in conditional dependence and independence will affect the stability of constraintbased methods and this problem is especially serious when the number of samples is small.
The scoreandsearch methods view a Bayesian network as a statistical model and transform the structure learning of Bayesian network into a model selection problem [30]. To select the best model, a score function is needed to indicate the fitness between a network and the data. Then the learning task is to find the network with the highest score. Thus, scoreandsearch methods typically consist of two components, (1) a score function, and (2) a search procedure. In this paper, we focus on structure learning approaches for Bayesian networks based on scoreandsearch methods because scoreandsearch methods are more robust for small data sets than constraintbased methods.
One of the most important issues in scoreandsearch methods is the selection of score function. A natural choice of score function is the likelihood function. However, the maximum likelihood score often overfits the data because it does not reflect the model complexity. Therefore, a good score function for Bayesian networks’ structure learning must have the capability of balancing between the fitness and the complexity of a selected structure. There are several existing score functions based on a variety of principles, such as the information theory and minimum description length (BIC score, AIC score, MDL score) [31–33] and Bayesian approach (BDe score) [34].
The BIC score are derived from a Taylor expansion and Laplace approximation when the number of samples N approaches ∞. This results in a problem that the structure penalty term in (8) is very strict when the number of samples is small; therefore, we adjust the coefficient of the second term in (8) from 1/2 to a smaller number (in our applications, we empirically set it to be 0.17 for all the datasets we study).
The computational task in scoreandsearch methods is to find a network structure with the highest score. The searching space consists of a superexponential number of structures2^{ O }^{(}^{ n }^{2}) and thus exhaustively searching optimal structure from data for Bayesian networks is NPhard [37]. One simple heuristic search algorithm is greedy hillclimbing algorithm. In greedy hillclimbing algorithm, there are three types of operators that change one edge at each step:

Add an edge

Remove an edge

Reverse an edge
By these three operators, we can construct the local neighbourhood of current network. Then we select the network with the highest score in the local neighbourhood to get the maximal gain. This process can be repeated until it reaches a local maximum. However, greedy hillclimbing algorithm cannot guarantee a global maximum [30]. Other structure learning methods for Bayesian networks include BranchandBound (B&B) [38, 39], genetic algorithms [40] and Markov chain Monte Carlo [41]. BranchandBound algorithms guarantee the optimal results in a significantly reduced search time compared to exhaustive search. Thus, we will employ B&B algorithms in our study.
Declarations
Acknowledgements
This work is supported by the US National Science Foundation Award IIS0644366.
This article has been published as part of BMC Genomics Volume 12 Supplement 2, 2011: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2010. The full contents of the supplement are available online at http://www.biomedcentral.com/14712164/12?issue=S2.
Authors’ Affiliations
References
 Hirschhorn JN, Daly MJ: Genomewide association studies for common diseases and complex traits. Nat Rev Genet. 2005, 6: 95108.View ArticlePubMed
 McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP, Hirschhorn JN: Genomewide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet. 2008, 9: 356369. 10.1038/nrg2344.View ArticlePubMed
 Wang WY, Barratt BJ, Clayton DG, Todd JA: Genomewide association studies: theoretical and practical concerns. Nat Rev Genet. 2005, 6: 109118. 10.1038/nrg1522.View ArticlePubMed
 Cordell HJ: Detecting genegene interactions that underlie human diseases. Nat Rev Genet. 2009, 10: 392404.PubMed CentralView ArticlePubMed
 Cordell HJ: Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. Hum Mol Genet. 2002, 11: 24632468. 10.1093/hmg/11.20.2463.View ArticlePubMed
 McKinney BA, Reif DM, Ritchie MD, Moore JH: Machine learning for detecting genegene interactions: a review. Appl Bioinformatics. 2006, 5: 7788. 10.2165/0082294220060502000002.PubMed CentralView ArticlePubMed
 Musani SK, Shriner D, Liu N, Feng R, Coffey CS, Yi N, Tiwari HK, Allison DB: Detection of gene x gene interactions in genomewide association studies of human population data. Hum Hered. 2007, 63: 6784. 10.1159/000099179.View ArticlePubMed
 Hahn LW, Ritchie MD, Moore JH: Multifactor dimensionality reduction software for detecting genegene and geneenvironment interactions. Bioinformatics (Oxford, England). 2003, 19: 376382. 10.1093/bioinformatics/btf869.View Article
 Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, White BC: A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. Journal of theoretical biology. 2006, 241: 252261. 10.1016/j.jtbi.2005.11.036.View ArticlePubMed
 Ritchie MD, Hahn LW, Moore JH: Power of multifactor dimensionality reduction for detecting genegene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genetic epidemiology. 2003, 24: 150157. 10.1002/gepi.10218.View ArticlePubMed
 Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH: Multifactordimensionality reduction reveals highorder interactions among estrogenmetabolism genes in sporadic breast cancer. American journal of human genetics. 2001, 69: 138147. 10.1086/321276.PubMed CentralView ArticlePubMed
 Park MY, Hastie T: Penalized logistic regression for detecting gene interactions. Biostatistics (Oxford, England). 2008, 9: 3050.View Article
 Wu TT, Chen YF, Hastie T, Sobel E, Lange K: Genomewide association analysis by lasso penalized logistic regression. Bioinformatics (Oxford, England). 2009, 25: 714721. 10.1093/bioinformatics/btp041.View Article
 Zhang Y, Liu JS: Bayesian inference of epistatic interactions in casecontrol studies. Nature genetics. 2007, 39: 11671173. 10.1038/ng2110.View ArticlePubMed
 Marchini J, Donnelly P, Cardon LR: Genomewide strategies for detecting multiple loci that influence complex diseases. Nature genetics. 2005, 37: 413417. 10.1038/ng1537.View ArticlePubMed
 Chen SH, Sun J, Dimitrov L, Turner AR, Adams TS, Meyers DA, Chang BL, Zheng SL, Gronberg H, Xu J, Hsu FC: A support vector machine approach for detecting genegene interaction. Genetic epidemiology. 2008, 32: 152167. 10.1002/gepi.20272.View ArticlePubMed
 Jiang R, Tang W, Wu X, Fu W: A random forest approach to the detection of epistatic interactions in casecontrol studies. BMC bioinformatics. 2009, 10 (Suppl 1): S6510.1186/1471210510S1S65.PubMed CentralView ArticlePubMed
 Han B, Park M, Chen XW: A Markov blanketbased method for detecting causal SNPs in GWAS. BMC bioinformatics. 2010, 11 (Suppl 3): S510.1186/1471210511S3S5.PubMed CentralView ArticlePubMed
 RobnikŠikonja M, Kononenko I: Theoretical and empirical analysis of ReliefF and RReliefF. Machine learning. 2003, 53: 2369. 10.1023/A:1025667309714.View Article
 Ueno M: Learning networks determined by the ratio of prior and data. Proceedings of 26th Conference Conference on Uncertainty in Artificial Intelligence: 811 July 2010; Corvallis, Oregon. Edited by: P. Grünwald and P. Spirtes. 2010, Arlington: AUAI Press, 598605.
 Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, Henning AK, SanGiovanni JP, Mane SM, Mayne ST, et al: Complement factor H polymorphism in agerelated macular degeneration. Science (New York, NY). 2005, 308: 385389. 10.1126/science.1109557.View Article
 Friedman N, Nachman I, Pe'er D: Learning Bayesian Network Structure from Massive Datasets: The ''Sparse Candidate'' Algorithm. Proceedings of 15th Conference Conference on Uncertainty in Artificial Intelligence: 30 July 1August 1999; Stockholm, Sweden. Edited by: Kathryn B. Laskey and Henri Prade. 1999, San Fransisco:Morgan Kaufmann, 206215.
 Burnham KP, Anderson DR, Hussong Fund Hazel: Model selection and multimodel inference : a practical informationtheoretic approach. 2002, New York: Springer, 2
 Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning : data mining, inference, and prediction. 2001, New York: SpringerView Article
 Chen XW, Anantha G, Wang X: An effective structure learning method for constructing gene networks. Bioinformatics (Oxford, England). 2006, 22: 13671374. 10.1093/bioinformatics/btl090.View Article
 Chen XW, Anantha G, Lin X: Improving Bayesian Network Structure Learning with Mutual InformationBased Node Ordering in the K2 Algorithm. IEEE Trans on Knowl and Data Eng. 2008, 20: 628640.View Article
 Cheng J, Greiner R, Kelly J, Bell D, Liu W: Learning Bayesian networks from data: an informationtheory based approach. Artif Intell. 2002, 137: 4390. 10.1016/S00043702(02)001911.View Article
 Pearl J: Causality : models, reasoning, and inference. 2009, Cambridge, U.K. ; New York: Cambridge University Press, 2View Article
 Spirtes P, Glymour CN, Scheines R: Causation, prediction, and search. 2000, Cambridge, Mass.: MIT Press, 2
 Heckerman D, Geiger D, Chickering DM: Learning Bayesian Networks: The Combination of Knowledge and Statistical Data. Mach Learn. 1995, 20: 197243.
 Akaike H: A new look at the statistical model identification. IEEE Transactions on Automatic Control. 1974, 19: 716723. 10.1109/TAC.1974.1100705.View Article
 Schwarz G: Estimating the dimension of a model. The Annals of Statistics. 1978, 6: 461464. 10.1214/aos/1176344136.View Article
 Rissanen J: Stochastic Complexity and Modeling. The Annals of Statistics. 1986, 14: 10801100. 10.1214/aos/1176350051.View Article
 Cooper GF, Herskovits E: A Bayesian Method for the Induction of Probabilistic Networks from Data. Mach Learn. 1992, 9: 309347.
 Koller D, Friedman N: Probabilistic graphical models : principles and techniques. 2009, Cambridge, Mass.: MIT Press
 Yu J, Smith VA, Wang PP, Hartemink AJ, Jarvis ED: Advances to Bayesian network inference for generating causal networks from observational biological data. Bioinformatics (Oxford, England). 2004, 20: 35943603. 10.1093/bioinformatics/bth448.View Article
 Chickering DM, Heckerman D, Meek C: LargeSample Learning of Bayesian Networks is NPHard. J Mach Learn Res. 2004, 5: 12871330.
 Suzuki J: Learning Bayesian Belief Networks Based on the Minimum Description Length Principle: An Efficient Algorithm Using the B & B Technique. Proceedings of 13th conference on machine learning: 36 July 1996; Bari, Italy. Edited by: Lorenza Saitta. 1996, San Fransisco: Morgan Kaufmann, 462470.
 Tian J: A BranchandBound Algorithm for MDL Learning Bayesian Networks. Proceedings of 16th Conference Conference on Uncertainty in Artificial Intelligence: 30 June  3 July 2000; Stanford, California. Edited by: Craig Boutilier and Moisés Goldszmidt. 2000, San Fransisco:Morgan Kaufmann, 580588.
 Wong ML, Lam W, Leung KS: Using Evolutionary Programming and Minimum Description Length Principle for Data Mining of Bayesian Networks. IEEE Trans Pattern Anal Mach Intell. 1999, 21: 174178. 10.1109/34.748825.View Article
 Giudici P, Castelo R: Improving Markov Chain Monte Carlo Model Search for Data Mining. Machine learning. 2003, 50: 127158. 10.1023/A:1020202028934.View Article
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.