 Methodology article
 Open Access
 Published:
A Bayesian model for detection of highorder interactions among genetic variants in genomewide association studies
BMC Genomicsvolume 16, Article number: 1011 (2015)
Abstract
Background
A central question for disease studies and crop improvements is how genetics variants drive phenotypes. Genome Wide Association Study (GWAS) provides a powerful tool for characterizing the genotypephenotype relationships in complex traits and diseases. Epistasis (genegene interaction), including highorder interaction among more than two genes, often plays important roles in complex traits and diseases, but current GWAS analysis usually just focuses on additive effects of single nucleotide polymorphisms (SNPs). The lack of effective computational modelling of highorder functional interactions often leads to significant underutilization of GWAS data.
Results
We have developed a novel Bayesian computational method with a Markov Chain Monte Carlo (MCMC) search, and implemented the method as a Bayesian Highorder Interaction Toolkit (BHIT) for detecting epistatic interactions among SNPs. BHIT first builds a Bayesian model on both continuous data and discrete data, which is capable of detecting highorder interactions in SNPs related to case—control or quantitative phenotypes. We also developed a pipeline that enables users to apply BHIT on different species in different use cases.
Conclusions
Using both simulation data and soybean nutritional seed composition studies on oil content and protein content, BHIT effectively detected some highorder interactions associated with phenotypes, and it outperformed a number of other available tools. BHIT is freely available for academic users at http://digbio.missouri.edu/BHIT/.
Background
In this era of explosive genomics development and nextgeneration sequencing (NGS) data, genomewide association study (GWAS) is central to characterizing complex traits and diseases [1]. However, the vast majority of genetic variants associated with complex traits identified by current GWAS approaches explain only a small amount of the overall variance of these traits in the underlying population [1]. Some of the reasons for this have been extensively studied, including trait identity problems, sample collection, population resampling and epigenetic variation [2]. From the perspective of computational methodology, one prominent limitation of widely used methods is due to the fact that GWAS usually analyzes one single nucleotide polymorphism (SNP) at a time [3]. Admittedly, the singleSNP approach is useful and (relatively) computationally efficient [4–6]. However, this approach does not account for collective effects among SNPs (or interactions among genetic variants in a more general sense) indicating a phenotype or a disease [3]. In genetics, these effects arise from the phenomenon epistasis, where the expression or effect of one gene depends on the presence of one or more other genes [4]. The roles of SNP interactions have been widely acknowledged in the research community. Hence, a number of computational methods for detecting SNP interactions have been developed in recent years [4, 5]. These detected SNP interactions often illustrate epistasis interactions that better explain the phenotype from the genotype [7, 8].
The major challenge in SNP interaction detection using the whole genomescale data is computing time [4, 5, 9]. It may not be feasible to enumerate all possible twoorder interactions in whole genomic scale with typical computational resources, let alone calculating highorder SNP interactions (e.g., three SNPs interacting together, four SNPs interacting together, etc.) across the whole genome. Even with improved computational strategies and resources, multiple testing in computing is likely more problematic. Researchers have developed several methods to address this issue in detecting and exploring SNP interactions [5]. Briefly, these methods use four strategies: exhaustive search, heuristic search, sampling, and twostage search. This exhaustive search strategy examines all possible SNP interactions to make sure that no candidates are missing, which is extremely costly in computational time. PLINK [10], uses a classic logistic regression and oddsratio contrast to infer epistasis, which provides a baseline of SNP interaction detection. BOOST presents data in the Boolean format and conducts Boolean computation to speed up the search process [11]. The heuristic search, e.g. EDCF [12] sets several rules to prune the search space, which consumes less time than the exhaustive search, but may lose some true SNP interactions. The sampling strategy applies statisticsbased sampling processes to avoid the bruteforce search. BEAM uses Markov Chain Monte Carlo (MCMC) in Bayesian partition to infer highorder interactions in case–control data [13], then its following version BEAM2 incorporates linkage disequilibrium (LD) information into Bayesian partition [14]. The twostage search strategy separates the two search processes by first filtering out candidates and then identifying interactions, such as SNPHarvester [15] and TRM [16].
Although there are multiple methods for SNPinteraction detections, several challenges still remain open to conquest:

(a)
Highorder (more than twoorder) SNP interaction is rarely handled. Given the extremely high computational cost in highorder SNP interaction detection in GWAS [5], nearly all the existing methods ignore highorder epistasis, which are highly important in many cases [4], especially in quantitative trait analysis [9]. It was demonstrated that highorder epistasis is critical in metabolic networks in yeast [17] and E. coli [18]. Specific interactions have uncovered twogene to fourgene interactions showing differential pleiotropic effects on branching and flowering in Arabidopsis [19], which cannot be easily detected by standard twoway tests.

(b)
Continuous traits in genotypephenotype relationships. Nearly all the existing computational methods are designed for categorical phenotypes in case–control GWAS analysis. To our knowledge, no other existing methods can effectively handle highorder interactions in continuous traits.
To address these issues, we developed the Bayesian Highorder Interaction Toolkit (BHIT), a novel Bayesian partition computational method and tool for detecting SNP interactions. The proposed approach first builds a Bayesian model on both continuous data and discrete data, and then extends the model to partition multiplephenotype data. When compared with other methods on both simulation data and real data, the key strengths of our developed approach are as follows: (i) With the advanced Bayesian model using MCMC search, BHIT can efficiently explore highorder interactions. (ii) BHIT can handle both continuous and discrete phenotypes, and the interaction within or between phenotypes and genetic data can also be detected. We have applied BHIT to both simulation datasets and experimental soybean oil/protein content datasets, and we were able to obtain high accuracy and reliable results on both datasets. Based on BHIT, we also developed a generalpurpose BHIT pipeline to meet the demands of detecting highorder interactions between genotype and phenotypes for various species.
Methods
Let Y be the continuous trait with G samples in the population, Y = (Y _{1}, Y _{2}, …, Y _{ G }). X contains observed genetic variations and R is the total number of variations, X = (X _{1}, X _{2}, …, X _{ R }). Assuming traits are conditional dependence on associated genetic variations, Y _{ i } are independent of each other following a Gaussian distribution, as shown in the example illustrated by Fig. 1; hence, we assume Y can be divided into M clusters based on values of the quantitative trait. Let I be indicators, I = (I _{1}, I _{2}, …, I _{ R }), indicating group membership of each X _{ i }. H is the total number of groups (determined by I, 1 ≤ H ≤ R), which means we partition all R genetic variations into H groups by I. We assume M is the total number of combination configurations of X _{{I = 1}} that are associated with Y, where X _{{I = h}} represents all the X in the hth group, (h = 1 ⋅ ⋅ ⋅ H). In the example of Fig. 1, M equals to 4 by the values of quantitative traits. In genotype, only four genes inferred in Group 1 (X _{{I = 1}}) is associated with Y; all the other groups such as X _{{I = h}}(h = 2 ⋅ ⋅ ⋅ H) are independent groups, which are clearly not associated with phenotypes.
The goal of the method is to infer P(I, HY, X), which is partitioning relationships between the genotypes (X) and phenotypes (Y), and the relationship is depicted by grouping dependent genotypes and phenotypes in the same groups (illustrated by I and H). Given partition indicator I, the likelihood is illustrated as (1):
Where X _{{I = h}} represents all the X in the hth group, and Y _{{m}} represents all the Y in the mth cluster. For P(X _{{I = h}}I) we used the multinomial distribution method and Dirichlet prior as the Bayesian partition model in our study [20]. Assume that there are c _{ h } possible combination values in the hth genetic variation group (X _{{I = h}}); thus, M = c _{1}. In the hth group, suppose for every Y _{ i } (the ith row in Fig. 1a), X _{{I = h}} has the probability p _{1} to be the first combination value, p _{2} for the second combination value, … , p _{ ch } for the last combination value, and \( {\displaystyle {\sum}_{j=1}^{c_h}{p}_j=1} \). Then the conditional likelihood for the hth group of genetic variations is \( P\left({X}_{\left\{I=h\right\}}\Big{p}_1,\dots, {p}_{ch},I\right)={\displaystyle \prod_{j=1}^{c_h}{p}_j^{n_j}} \) where n _{ j } denotes the number of the rows in Fig. 1a taking the jth value in X _{{I = h}}. However, we do not know the p _{ j }. So we assume they are random and used the Dirichlet prior on them:
where \( B\left(\alpha \right)=\frac{{\displaystyle {\prod}_{j=1}^{c_h}\varGamma \left({\alpha}_j\right)}}{\varGamma \left({\displaystyle {\sum}_{j=1}^{c_h}{\alpha}_j}\right)} \) , \( \alpha =\left({\alpha}_1,\dots, {\alpha}_{c_h}\right) \) and Γ(x) = ∫ _{0} ^{∞} t ^{x − 1} e ^{− t} dt.
So we have
By integrating p we have:
For P(Y _{{m}}X _{{I = 1}}) we use the Gaussian distribution with conjugate priors on mean and variance. Suppose there are G _{ m } rows (samples) in Y _{{m}} (the mth cluster, \( {\displaystyle \sum_{m=1}^M{G}_m}=G \)), and they are i.i.d. to follow a N(μ _{ m }, σ _{ m } ^{2} ) distribution. We further use conjugate priors , μ _{ m }σ _{ m } ^{2} ~ N(μ _{0}, σ _{0} ^{2} /κ _{0}), σ _{0} ^{2} ~ Inv − χ ^{2}(ν _{0}, σ _{0} ^{2} ). The resulting posterior distribution of (μ _{ m }, σ _{ m } ^{2} )Y _{{m}} is then a \( N\sim Inv{\chi}^2\left({\mu}_n,\frac{\sigma_n^2}{\kappa_n},{\nu}_n,{\sigma}_n^2\right) \) distribution (n = G _{ m }), where: \( \overline{y}={\overline{Y}}_{\left\{m\right\}} \), \( {\mu}_n=\frac{1}{\kappa_n}\left({\kappa}_0{\mu}_0+n\overline{y}\right) \), κ _{ n } = κ _{0} + n, ν _{ n } = ν _{0} + n. And \( {\sigma}_n^2=\frac{1}{\nu_n}\left({\nu}_0{\sigma}_0^2+\left(n1\right){s}^2+\frac{\kappa_0n}{\kappa_n}{\left(\overline{y}{\mu}_0\right)}^2\right) \), \( {s}^2=\frac{1}{n1}{\displaystyle \sum_{i=1}^n{\left({y}_i\overline{y}\right)}^2} \).
Using the relationship \( P\left({Y}_{\left\{m\right\}}\right)=\frac{P\left(\mu, {\sigma}^2,{Y}_{\left\{m\right\}}\right)}{P\left(\mu, {\sigma}^2\Big{Y}_{\left\{m\right\}}\right)} \), we can compute the marginal distribution of the data as (5):
This form of the marginal distribution is then used to compute Formula (1).
The joint posterior of the targeted P(I, HY, X) was defined as P(I, HY, X) ∝ P(H)P(IH)P(Y, XI, H), and the MetropolisHasting algorithm applying MCMC [13] was used to sample from this posterior distribution and make the inference on I. Considering that different I’s may represent the same grouping of X, for example, I = {1, 2, 3} is the same as I = {1, 3, 2} so we order the group label increasingly, and thus I = {1, 3, 2} is not allowed.
BHIT Algorithm
The details of the BHIT algorithm is shown below:

Step 1.
Initialization. Choose I _{0} = (I _{1}, I _{2}, …, I _{ R }), genotype matrix X(G × R), and phenotype matrix Y(G × 1); then H _{0} equals to R, means each genetic variation makes one group by initial partition; set the maximum number of iterations T with the current iteration t = 0; set the burnin number of iterations B (B < T), choose I only when MCMC gets convergency.

Step 2.
Calculate initialized likelihood P _{0}(Y, XI _{0}, H _{0}) by (1) (H is determined by I).

Step 3.
For iteration t, sample I _{ t } randomly, get a candidate I _{ t } ^{′} , and calculate likelihood of P _{ t } ' (Y, XI _{ t } ', H _{ t } ').

Step 4.
Calculate the defined acceptance ratio.
$$ \alpha ={P}_t\hbox{'}\left(Y,X\Big{I}_t\hbox{'},{H}_t\hbox{'}\right)/{P}_t\left(Y,X\Big{I}_t,{H}_t\right) $$ 
Step 5.
If α ≥ 1, then accept the candidate by setting I _{ t + 1} = I _{ t } '; otherwise, accept the candidate with probability α. If the candidate is rejected, set I _{ t + 1} = I _{ t } instead. Meanwhile, set t = t + 1.

Step 6.
Check whether iteration t is smaller than burnin number T; if not store I _{ t }.

Step 7.
Check whether iteration t meets threshold T; if not go back to Step 3; or select output I _{ t }.
Simulation setup
Considering the additive and nonadditive effects, we used four different Epistasis Models as introduced in [13] to test epistasis on single continuous trait, and then proposed four additional sophisticated Dependency Models to mimic scenarios of epistasis detection on different types of phenotypes. In addition with calculating statistical powers on type2 errors of Epistasis and Dependency Models, two Nullmodels are constructed to calculate type1 errors of BHIT.
Simulation on epistasis models
Epistasis models were designed to check the epistasis detection on single quantitative trait. Epistasis Models 1–4 were generated upon the genotypes depicted by discrete numbers (0 for Homozygous Major Allele, one for Heterozygous and two for Homozygous Minor Allele), and the quantitative phenotypes dependent with the genotype were depicted by continuous values. Each model contains one group of ground truth loci predefined interacted together according individual types of interaction as work [13] in genotype, and other loci are independent with each other as the background. The quantitative trait is simulated based on the genotype of groundtruth loci combination following normal distribution. Increased quantitative level was assigned to the specific genotype combination of groundtruth loci, and marginal effect of each groundtruth locus individually ranges from very small to zero. The odds tables of Epistasis Models are demonstrated in Additional file 1: Tables S1–S4. Effect parameter θ sand α are determined using the same procedure in [13].
Model 1 demonstrates additive effects of paired interactions in genotype. This model contains two groundtruth loci, each of which contributes to the quantitative trait independently; furthermore, addictive effects accumulate when both loci occur. Model 2 is analogous with Model 1, but the addictive effect is presented only when both loci have at least one groundtruth allele. Model 3 is a threshold model in which each of the two loci contributes to the quantitative trait independently, but both loci presented simultaneously do not further increase the quantitative trait. Model 4 contains three groundtruth loci interacting together.
We used R to generate different marginal effects and dependencies. First, we used defined Minor Allele Frequency (MAF) to determine the raw ratio of three genotypes as Major Allele Homozygous (0), Heterozygous (1), and Minor Allele Homozygous (2). Then the allele of buried groundtruth epistatic locus was selected to be altered under different settings of Linkage Disequilibrium (LD). By looking up the Odds table of corresponding models in Additional file 1: Tables S1–S4, groundtruth epistatic locus can be kept by probability under the corresponding ratio; otherwise, the allele of the locus can uniformly change to other allele randomly.
Contrast with Case–control phenotypes construction in [13], the dependent quantitative phenotype is simulated as follows: For each genotype combination of altered groundtruth epistatic loci, the individual normal distribution was built using the rnorm function in R with mean 0 and standard deviation one. Then the normal distribution was placed in a related index of genotype combinations, which corresponds to the continuous phenotype.
Fifty data sets for each epistasis model were simulated under each setting, where Minor Allele Frequencies (MAFs) were chosen in {0.1, 0.5}. Simulation datasets of Models 1–3 consist of 2000 and 4000 observations, and Model 4 consists of 5000 and 10,000 observations. Each dataset has 100 simulated genotype variation linked by 100 loci with different settings of LD effect r in {0.7,1}, and the groundtruth loci (2 in Models 1–3, and 3 in Model 4) are buried in them.
Simulation on dependency models
We designed Dependency Models 5–8 to simulate multiple highorder dependencies in both discrete and continuous phenotypes. We used D to denote discrete column of data and C to denote continuous column of data. In Dependency Models, genotypes are illustrated as D, and phenotypes could be illustrated as different numbers of D and (or) C.
Discrete and dependent data sets were generated by selecting the number of different discrete values possible, then raising that value to the power equal to the number of data sets to be generated, and generating for each of those values in a probability of occurrence. In order to make it obvious that the data are related, the first value was made to be 80 % of all the values, and the rest of them were the remaining 20 %.
Independent continuous data were generated by R’s rnorm function, which selects values from a normal distribution in each column of continuous independent data set generated. Dependent continuous data were generated by R’s mvrnorm function to sample from a multivariate normal distribution.
To generate a mix of discrete and continuous dependent data sets, the defined number of discrete data sets was generated following the same routine as stated earlier. Afterwards, for each unique group of discrete data generated (unique by rows – tuples), we generate continuous data separately for each, hence making the discrete and the generated continuous data dependent on each other.
Model 5 contains nine discrete columns and nine continuous columns. Among them, D1 and D2 are discrete columns independent of all the other columns. C1 and C2 are continuous columns independent of all the other columns. There are four dependencies buried in the model: (1) D3 and D4 are dependent on each other in the discrete columns; (2) C3 and C4 are dependent on each other in the continuous columns; (3) Discrete columns D5 and D6, Continuous columns C5 and C6 are also dependent on each other; (4) Discrete columns D7, D8, and D9, Continuous columns C7, C8, and C9 are dependent on each other. The posterior distribution matrix is given as Additional file 1: Table S5 and shows that the independent columns were in partitions by themselves (column 0 is for partitions containing only one column) and all of the dependent columns were in partitions with each other. The generated partition of Model 5 is {C1} ∪ {C2} ∪ {C3, C4} ∪ {D1} ∪ {D2} ∪ {D3, D4} ∪ {C5, C6, D5, D6} ∪ {C7, C8, C9, D7, D8, D9}.
Model 6 contains one discrete column and four continuous columns. All these columns are independent. The generated partition is {C1}∪ {C2}∪ {C3}∪ {C4}∪ {D1} as posterior distribution matrix in Additional file 1: Table S6.
Model 7 contains ten discrete columns and ten continuous columns. There are seven dependencies buried in the model: (1) D1 and D2, D3 and D4, and D5 and D6 are three groups dependent on each other corresponding to the discrete columns; (2) C1 and C2, and C3 and C4 are two groups depending with each other corresponding to the continuous columns; (3) discrete columns D7 and D8, and continuous columns C5 and C6 are dependent on each other; (4) discrete columns D9 and D10, and continuous columns C7, C8, C9, and C10 are dependent on each other. The generated partition is {C1, C2}∪{C3, C4}∪ {D1, D2} ∪ {D3, D4}∪ {D5, D6} ∪ {C5, C6, D7, D8}∪ {C7, C8, C9, C10, D9, D10} as posterior distribution matrix in Additional file 1: Table S7.
Model 8 contains eight discrete columns and eight continuous columns. Among them, discrete column D1 and continuous column C1 are independent. There are four dependencies buried in the model: (1) D2, D3, and D4 are dependent on each other in the discrete columns; (2) C2, C3, and C4 are dependent on each other in the continuous columns; (3) discrete column D5 and continuous column C5 are dependent on each other; (4) discrete columns D6, D7, and D8, and continuous columns C6, C7, and C8 are dependent on each other. The generated partition is {C1} ∪{D1}∪{C2, C3, C4}∪ {D2, D3, D4} ∪ {C5, D5}∪ {D6, D7, D8, C6, C7, C8} as posterior distribution matrix in Additional file 1: Table S8.
In simulation, 1000 datasets for each dependency model were simulated under each setting. Models 5–8 consist of 5000, 1000, 5000 and 10,000 observations, respectively, and these numbers representing variations are defined in parsimony to illustrate the relationships.
Simulation on null models
Null models were generated under hypothesis that no phenotypes are associated with genotypes. In Null Models, the discrete genotypes were composed with one group of two dependent loci, one group of three dependent loci and ninetyfive independent loci. Null Model 1 has two independent phenotypes both independent with genotypes while Null Model 2 only has one phenotype independent of genotypes, which makes it comparable with other existed methods as PLINK. Null Models were generated as Dependency Models. Each Null model generated 1000 datasets in each setting of simulation with different samples of 1000, 2000 and 5000.
Computational experiment setting
For experiments on simulation, BHIT, PLINK, EDCF, and SIXPAC ran on BIOCLUSTER operating from the University of Missouri, which is a 64bit Linux platform with 16 CPU and 2 T memory; BOOST (64 bit) ran on a 64bit Windows 7 platform with 3.40 GHz Intel CPU and 8 G RAM. Experiments on experimental data also ran on BIOCLUSTER operating from the University of Missouri.
BHIT parameters were set as follows: For Epistasis Models 1–3, the MCMC iteration in BHIT was set at 30,000 times of running and 29,000 times as the builtin procedure. The MCMC iteration of Epistasis Model 4 was set at 50,000 and 29,000 times as the builtin procedure. For Dependency Models 7, 9, and 10, the number of MCMC iterations was set at 2000, and the builtin procedure was set at 1000. For Dependency Model 8, the number of MCMC iterations was set at 1000, and the builtin procedure was set at 500.
Results and discussion
BHIT software implementation and pipeline
We implemented the Bayesian partition algorithm on both categorical and continuous data in the BHIT (Bayesian Highorder Interaction Toolkit) software on the Linux computing platform using C++. BHIT requires a user specifying the Minor Allele Frequency (MAF) as the prior and uses the PLINK ped and map file format. For long runs in big whole genome data, BHIT also provides intermediate status output and input in benefits stepwise running in big whole genome data. Compared with PLINK’s minutes computing on simulation Epistasis models of hundreds variation, the typical computing time of BHIT is about 1 h on a single CPU.
The BHIT pipeline for general species is shown in Fig. 2. In the preprocessing stage, missing data imputation methods should be applied to estimate missing entries in the genotype data if any. Then SNP with MAF less than 0.05 is filtered out. All the genotype data should be converted to the appropriate data format by PLINKrecodeA. If the input has continuous trait values, whether, the data should be checked by the Kolmogorov–Smirnov test to confirm whether they follow the normal distribution or not. After that, both genotype and phenotype data should be combined together and converted to the BHIT file format by the perl script provided at the BHIT website. In order to handle genomewide SNPs, we provide three strategies to use BHIT in the pipeline. Strategy A has a two stages: (1) feature selection methods (LASSO [21], etc.) are used first to filter all the SNPs and run BHIT only on the filtered set of SNPs. Strategy B runs BHIT on individual chromosome, one at a time. Strategy C focuses on SNPs located in proteincoding regions and/or certain regions that users define. This pipeline is applicable to any species with appropriate genotype and phenotype data.
Simulation results on epistasis models
We used the Epistasis Model based simulated datasets to compare capability between BHIT and other currently available tools for case–control pairwise interaction detection. In this study, we applied the representative methods PLINK (with parameter “epistasis”) [10], PLINKfast (PLINK with parameter “fastepistasis”), BOOST [11], EDCF [12], and SIXPAC [22] on the simulation datasets. We also used PLINK(Q) on quantitative trait dataset directly. We also set BHIT running it once and then three times to fully use its MCMC properties. To accommodate the setting of PLINK, BOOST, EDCF, and SIXPAC on pairwise effects detection only, we decomposed the highorder effects to multiple pairwise effects, e.g., detecting the complete set of all three pairwise relations AB, AC, and BC, which is counted as detecting the threeorder relationship ABC successfully. In Epistasis Model, the statistical power of these methods is defined as the fraction of the generated datasets on which only topmost results given by the method matches the ground truth. For triple runs of BHIT, the statistical power is defined as the fraction of the generated datasets on any of the three outputs of BHIT matching the ground truth. We chose 0.5 as the threshold for the posterior probabilities to determine the dependency for each loci and phenotype.
We extensively explored the simulation settings on MAF and LD variants for Epistasis Models 1–4 in Table 1 and Additional file 1: Figure S1–S8. First, we considered the genotyped genetic variations to be exactly groundtruth loci where LD equals 1, an idealistic and unrealistic situation. When MAF equals 0.5, nearly all the methods designed to detect pairwise epistasis could effectively detect Models 1–3 in various marginal pairwise effects with no LD effects. For threeorder interactions of Model 4, PLINK failed to detect highorder interactions, while BOOST and EDCF significantly lost its power to effectively detect the three pairwise interactions decomposed from the threeorder interaction. However, BHIT was highly effective in detecting both twoorder and threeorder interactions on all four of these models. Through multiple runs of BHIT, we could perfectly detect nearly all of these interactions in all the simulation datasets of these models. When changing the settings of MAF to 0.1 with a perfect LD effect, PLINK could still work on all of these epistasis models fairly well. BOOST and EDCF decreased their power in pairwise interactions of Models 1–3 with lower MAF. BHIT maintained relatively high detection power in pairwise interactions and showed good performance in detecting threeorder interactions in Model 4. Then we introduced the LD effect as r equals 0.7 in the simulation datasets to mimic a more realistic scheme. Comparing with no LD effects, PLINK, BOOST and EDCF showed significant performance shortfalls in Models 1–3 with MAF settings at either 0.5 or 0.1. We also applied SIXPAC in all these simulations but it failed in all the models, which may be due to its inability to provide block distances in the simulation datasets. LD effects also decreased detection power of BHIT. However, for threeorder interactions of Model 4, BHIT showed much more tolerance with these effects as it nearly detected all the interactions with or without LD. BHIT’s ability could easily be enhanced by multiple runs.
Simulation results on dependency models
In order to demonstrate the potential of detecting more sophisticated dependency relationships within the data, we proposed four additional dependency models both on SNP (discrete) and different types of quantitative phenotypes (continuous) and discrete phenotypes. Since no public software to our knowledge can detect such multiple dependencies of both discrete and continuous data simultaneously, we only ran BHIT once, thrice and ten times to evaluate the performance.
Other than detecting loci in Epistasis Models, the statistical power of Dependency Models is defined by detecting the correct relationships of the correct corresponding variations, which is challenging.
For Dependency Models, BHIT showed its remarkable capabilities in detecting multiple dependencies both in discrete and continuous datasets detailed in Table 2, while no other publicly available tools can handle these data. For Model 5, the dependencies were correctly detected 81.3 % in 1000 simulation datasets by a single run of BHIT, and the detection power increased to 87.8 % in 1000 and 89.8 % in 1000 by running BHIT for triple and ten runs. For Models 6–8, BHIT correctly detected nearly all the dependencies (99.8 % in Model 6, 98.5 % in Model 7, and 99.8 % in Model 8) in all 1000 simulated datasets. With multiple runs of BHIT, all the interactions in simulated datasets of Dependency Models could be detected.
Simulation results on null models
Different from Epistasis Models and Dependency Models, no phenotypes are associated with genotypes in Null Models. The statistical power of Null models in BHIT is defined as the fraction of the generated datasets showing no dependencies between phenotype with any genotype, and the statistical power in PLINK is defined as the fraction of the generated datasets showing no association found. As shown in Table 3, in Null Model 1 only 2.1 % of 1000 datasets in each simulation at least one phenotype were incorrectly inferred as associated with given genotypes with the settings of 1000 samples in BHIT, and this number decreased to 1.9 and 0.6 % with settings of 2000 and 5000 samples. In Null Model 2, the phenotype was incorrectly inferred as dependent with given genotypes in only 0.6, 0.3 and 0.1 % of 1000 datasets in each simulation setting of 1000, 2000 and 5000 samples. Comparing with same datasets running by PLINK (with parameter “epistasis”), PLINKfast (PLINK with parameter “fastepistasis”), PLINK(Q) which works on the quantitative trait, BHIT obviously got confidential results in Null Models with different settings of samples.
In the simulation studies, BHIT demonstrated its excellent capabilities and potential in comparison with other epistasis detection methods on both type1 and type2 errors. On pairwise interactions with/without additive effects, BHIT could get same good results as BOOST and EDCF in high MAF, and outperform them by a large margin in a lower MAF. With various settings of MAF and LD, BHIT showed its robustness on high detection power by adopting the proper setting of prior on MAF as a Bayesian partition, which other methods (BOOST and EDCF) neglect. Besides pairwise interactions, BHIT could also obtain excellent results in threeorder interactions, while other methods were not designed to handle. Benefiting from the flexible statistics frameworks, BHIT could detect these dependencies very effectively with various types of phenotypes in simulation datasets. By checking validity by Null Models, BHIT showed good results in distinguishing no associations in simulation datasets.
Results on soybean quantitative traits studies
Soybeans represent one of the most important agricultural crops providing nutrition and sustenance to humans and household animals. Among its hundreds of agricultural traits, oil and protein content of its seeds are among the most interesting of its composition traits both for farmers and breeders. We used the SoySNP50K iSelect BeadChip SNP array [23] as the genotype, and 243 Plant Introduction (PI) lines with the oil and protein contents phenotyped in 2011 (unpublished results). All the SNPs are mapped to genes by Soybean Knowledge Base (SoyKB) [24]. For oil and protein contents are highly correlated, we only choose oil content as the phenotype in BHIT running. We applied all three strategies of the BHIT pipeline in this research. Strategy A is a twostage strategy, i.e., to choose SNP subsets by feature selection of LASSO, filtering all the SNPs first and then apply BHIT on significant SNPs detected by LASSO. Strategy B uses BHIT on individual chromosomes, one at a time. Strategy C is mainly focused on proteincoding regions, which only applies BHIT on known protein QTL regions and oil QTL regions. In all strategies, BHIT was set to running 1,000,000 times of MCMC, and set 990,000 as the burnin period to guarantee the convergency, 0.5 was chosen as the threshold for the posterior probabilities to determine the dependency for each loci and phenotype.
Preprocessing on soybean data
There is no missing value in SoySNP50K iSelect BeadChip SNP array, and each SNP in the array is only chosen with threshold that MAF larger than 0.05. Both quantitative phenotypes of soybean oil/protein data are accepted as normal distribution hypothesis by Kolmogorov–Smirnov test.
Epistasis results on soybean quantitative traits studies by strategy A
In Strategy A, we first used LASSO to get 153 SNPs related to protein and oil content, and then we ran BHIT 200 times on this subset. We got 147 SNPSNP interactions related to oil/protein content traits, including 86 twoorder interactions, 40 threeorder interactions, 20 fourorder interactions and 1 fiveorder interaction. Table 4 gives a general view of the most interesting BHIT results obtained by using Strategy A. The SNP interaction pair in Index one is located in chromosome seven at positions 15,667,842 and 15,662,403 (Fig. 3). The major allele homozygous and minor allele homozygous genotypes could divide the oil content phenotype significantly with pvalue 2.75 × 10^{− 10} by ttest. Considering that minor allele homozygous pairs are expected to appear 4.12 % in the background, the observed minor homozygous pairs appeared 11.39 times higher than the set percentage for background appearance. This bias may be due to breeding selection. When mapping to soybean genes using Soybean Knowledge Base (SoyKB), both SNPs are associated with individual genes Glyma07g15930.1, a KOGoxysterolbinding protein functioned in lipid transport process, and Glyma07g15960.1, a KOGdehydrogenase (EC/1.1.1.145/3betahydroxyDelta(5)steroid dehydrogenase). Both genes function in the steroid biosynthetic process are located in chloroplast stroma. Index two shows a threeorder interaction with Index one and another SNP located in location 42,883,965 of chromosome six, which is mapped to gene Glyma06g39891 (Fig. 4). This gene is glycerol3phosphate acyltransferase (EC 2.3.1.15), which is also located in the glycerolipid metabolism pathway (http://www.genome.jp/kegg/pathway/map/map00561.html), functioning in lipid transport and located in endoplasmic reticulum. Another promising threeorder interaction across chromosomes is Index 3 positioned at 1,481,641 in Chromosome 16, and 38,897,850 and 39,737,193 in Chromosome 19 (Fig. 5). The statistical significance pvalue reaches 7.63 × 10^{− 16} by ttest between triplemajor allele and tripleminor allele. These SNPs are associated with Glyma19g31120.1, a glutamate synthase (NADH) functioned in glutamate synthase (NADH) activity; Glyma19g31960.1, an AP2EREBP transcription factor in the lipid biosynthetic process; and Glma16g01950.1, an ABI3/VP1 transcription factor located in chloroplast. All three genes are related to the oil biosynthesis process.
Epistasis results on soybean quantitative traits studies by strategy B
In strategy B, we split the SoySNP50K data into 20 parts by their chromosomes, and then ran BHIT separately on each of these individual chromosomes. Several interesting interactions were detected among the results and are detailed in Table 5. An interesting SNP pair is located on Chromosome five at position 34,107,233 and 40,523,205 (Fig. 6), which are annotated by Glyma05g28240, an Enzyme EC 3.6.4.4, which is a plusenddirected kinesin ATPase, and Glyma05g36730, a homologous gene OPI10 in Medicago Truncatula, which is involved in phospholipid biosynthesis, respectively. This pair of SNP interactions directly connects to oil content phenotype. The individual pvalues of both SNPs are 3.63 × 10^{− 7} and, 1.03 × 10^{− 5} respectively, but the combined pvalue reached 4.97 × 10^{− 12}. Another threeorder interaction related to protein synthesis phenotype is located on Chromosome 13 at position 28,866,067, 28,868,130 and 29,473,740 (Fig. 7), which are annotated by Glyma13g25650 and Glyma13g26260. Glyma13g25650 is a subtilase family protein acting in the serinetype peptidase activity, and Glyma13g26260 encodes Enzyme EC 5.2.1.8 acting in Peptidylprolyl isomerase. The individual pvalues of SNPs are 2.52 × 10^{− 5}, 1.95 × 10^{− 3}, and 3.41 × 10^{− 8}, respectively, but their combination results in a pvalue of 1.03 × 10^{− 13}. The most significant fourorder interaction detected by this strategy is located in Chromosome ten at positions 47,616,648; 47,618,284; 47,730,445; and 47,753,689 (Fig. 8), contained in three genes Glyma10g40110, Glyma10g40260 and Glyma10g40290. Glyma10g40110 is a pyruvate kinase participating in the fatty acid biosynthetic process, Glyma10g40260 has a homologous gene fatty acylCoA reductase three in Medicago Truncatula, while Glyma10g40290 is a glycosyl hydrolase superfamily protein. The pvalues of individual SNPs are 5.68 × 10^{− 4}, 4.15 × 10^{− 4}, 6.69 × 10^{− 1}, and 3.81 × 10^{− 1}, respectively, and the combined pvalue reaches 3.79 × 10^{− 7}. We also detected a 7order interaction on Chromosome five at positions 8,688,492, 8,714,882, 8,715,355, 8,800,108, 8,800,879, 8,817,375, and 8,904,128 (Fig. 9). These SNPs are annotated by gene Glyma05g08810, Glyma05g08830, Glyma05g08970 and Glyma05g09080. Among them, Glyma05g08810 is EC 4.99.1.4 sirohydrochlorin ferrochelatase, while Glyma05g08830 is included in the premRNA cleavage complex II protein family, and Glyma05g09080 is an EC 1.14.15.3 alkane 1monooxygenase. The pvalues of these individual SNPs are around 0.1 but their combined pvalue reaches 3.59 × 10^{− 7}.
Epistasis results on soybean quantitative traits studies by strategy C
In this Strategy C, we ran BHIT on a subset of 799 SNPs from Soy50KSNP array, which overlapped with the protein coding regions and known QTL regions related to protein/oil contents (Additional file 2: Table S9). Then we ran BHIT for about 1000 times on this subset using either oil or protein phenotypes. Table 6 presents some interesting identified interactions. The first and most interesting interactions among them were identified in 4 loci across two chromosomes located in position 20,897,627; 20,954,490 of Chromosome eight, and 8,642,446; 12,051,017 of Chromosome 19. The first SNP (named as SNP293) is located in gene Glyma08g26580.1 mapped, which has an Arabidopsis homology AT3G0140 (EC/6.3.2.19) and an ubiquitinprotein ligase. At the sequence level, the polymorphism makes the major allele nucleotide guanine (g) replaced by the minor allele nucleotide adenine (a), which causes the 73th amino acid of the protein change from glycine (G) to arginine (R). The added positive charged arginine may have significant impact on the protein conformation and function. The second SNP (named as SNP294) is located in gene Glyma08g26680.1, which causes the 31st amino acid change from alanine (A) to valine (V). The function of this gene is unknown. The third SNP (named as SNP792) is located in gene Glyma19g07330.1, which also causes amino acid change from glycine (G) to arginine (R). This gene has the Arabidopsis homolog AT3G48990.1, which encodes an oxalylCoA synthetase and is required for oxalate degradation and normal seed development processes. The fourth SNP (named as SNP794) is located in gene Glyma19g10100.1, which causes amino acid change from valine (V) to isoleucine (I). This gene has the Arabidopsis homology AT1G51310.1, a tRNA (5methylaminomethyl−2thiouridylate)methyltransferases. This transferase is involved in tRNA processing in chloroplast and cytoplasm. It was found that protein Glyma08g26580.1 containing SNP293 and protein Glyma19g07330.1 containing SNP792 were predicted to interact by ProprInt [25]. We predicted their protein structures by MUFOLD [26] . Then we docked the two predicted structures using GRAMMX [27]. Interestingly, the distance between the residue containing SNP293 and the residue containing SNP792 was shorter than 0.0052 % of all the paired distances between the two structures, as shown in Fig. 10. This suggests that the epistatic interaction between the two SNPs may play a role in the interaction between the two proteins.
Many of the SNPassociated genes in Table 6 are highly related to oil content according to the literature. It is known that ubiquitin processes (Glyma08g26580) have significant influences on fatty acid compensation [28, 29] . Intracellular composition of fatty acid could affect the processing and function of enzymes in connection with the ubiquitin–proteasome pathway, which might be a common physiological approach to regulate protein degradation [30]. Under the control of a corn ubiquitin promoter, positive expression of puroindoline a and b (PINA and PINB) proteins in transgenic corn could significantly increase the seed oil content [31]. It is also well known that acylcoenzyme A (CoA) synthase (Glyma19g07330) catalyses the conversion of free fatty acid to acylCoA ester, and is, therefore, necessary for many pathways of fatty acid and lipid metabolism [32]. Researchers have successfully improved fatty alcohol by manipulating the CoA synthase in fatty alcohol biosynthesis pathway in engineered E. coli [33], and they also succeeded improving fatty acid ethyl esters in Saccharomyces cerevisiae [34]. Although there is no direct report on the role of tRNAmethyltransferase (Glyma19g10100) in oil synthesis in plants, a product of tRNAmethyltransferase, Sadenosylhomocysteine was shown to have a metabolic link with the fatty acid metabolism in Alzheimer’s disease [35]. The potential interaction between Glyma08g26580 and Glyma19g07330 links the ubiquitin and acylCoA synthase through NAD(P)/FADdependent dehydrogenases [36].
We also checked the neutral mutation hypothesis by using a dataset by Tajima’s D statistics using DnaSP software version 4.0 [37]. The Tajima’s D value of the region is 3.65 (p < 0.001), which significantly rejected null hypothesis of neutrality and meets the prior knowledge of soybean as a human cultivated plants, may undergo the balancing selection.
Conclusions
Epistasis is a common phenomenon in many complex biological processes of various organisms, which has been known to be related to quantitative traits in plants [38]. From the evolution perspective, epistasis and natural selection shape the mutational architecture of complex traits [39]. Epistasis may cause hidden quantitative genetic variation in natural populations and could be responsible for the small additive effects, missing heritability and the lack of replication, which are typically observed for human complex traits [9, 40]. Towards this direction, our computational method, tool and pipeline enable researchers to explore epistasis and may elaborate on what specific epistasis, especially higherorder epistasis, play a role in a complex trait.
Seed oil content and protein content are both polygenic traits controlled by several gene loci in soybeans, which represent a major source of dietary nutrition and an increasingly valuable feedstock for industrial applications [41]. However, due to overlapping biosynthesis pathways and alternative nutrition distribution, seed oil content shows strong negative correlation with seed protein content, i.e., improvement of one trait is often achieved at the expense of the other [42]. Quantitative genetic analyses and QTL mapping based studies have suggested that both seed oil content and protein content are governed by the additive effect of genes involved [43]. Many of the QTL alleles with positive and negative effects on oil content are often dispersed among genotypes [44], which suggests that accumulation of the positive alleles from different genetic backgrounds could eventually lead to the development of genotypes with higher seed oil content or protein content [45]. By computing Tajima’s D, multiple alleles are actively maintained in the gene pool of a population at frequencies above that of gene mutation. The significance level of positive Tajima’s D also supports the hypothesis that the samples we used are indeed under the balancing selection with genetic polymorphism conserved by multiple alleles, not simply by genetic drift. With SNP array data and NGS data, we could directly focus on the trait study at the nucleotide level, instead of the QTL level, which could give us much more detailed information and guidance on molecular breeding. Some of our predicted epistatic interactions could serve as hypotheses for molecular breeding design.
We mainly focus on biological interpretation of epistasis at the gene level. The trait associated coding region polymorphism at the gene level may change the biochemical property and the protein structural conformation, which could cause significant functional and phenotype changes. According to our analysis, these changes are indeed associated with the trait change quantitatively. Breen demonstrated that epistasis is pervasive in protein evolution by considering aminoacid usage. In this theory, epistasis as the primary factor in molecular evolution, provides the primary conceptual framework to describe the tempo and mode of longterm protein evolution [46, 47]. Another possible explanation arose from Hemani’s work [48], which indicates that epistatic interactions can allow deleterious mutations to persist under selection and these interactions can abate the depletion of additive genetic variation. In our cases, soybean has been cultivated for thousands of years. Some of the epistasis that we identified by BHIT could be a result of protein evolution under human breeding selection. In particular, the differences on protein conformation in different genotypes may cause differences in proteinprotein interaction, which could alter the interaction between ubiquitinprotein ligase and oxalylCoA synthetase, as an example. These important biochemical changes in the oil biosynthesis pathway may finally affect the phenotypes of oil/protein content quantitatively. It has been noted that some of the SNPs in an identified epistatic interaction are close to each other in the genomic sequence, which may be due to LD instead of epistasis.
The key advantage of BHIT is its advanced flexible setting on detecting highorder interactions on both discrete and continuous data. In contrast to other methods designed to detect only pairwise interaction, BHIT does not restrict the computational models to twoorder dependencies, and dependencies in different orders could be uniformly chosen by adopting larger likelihood following using a MCMC search. Even though MCMC could obtain the global maximum in theory, multiple runs of BHIT may overcome the local maximum in practice. In addition, the search convergence could be easily judged by checking the changing status of likelihood with the need to be determined empirically in other modelfree methods. The design of likelihood computation between both discrete and continuous attributes gave BHIT’s capability on both case–control and complex quantitative trait analysis. In contrast to other model free machinelearning based methods, the deployed Bayesian framework could fully use the prior knowledge of Minor Allele Frequency, which benefits BHIT to become much more adaptive in different settings of datasets. In simulation studies on epistasis models, in comparison with other existing methods, BHIT can maintain high efficiency in various settings of sample numbers, MAF and LD effects. Another key advantage of BHIT is its capability to deal with continuous phenotypes as quantitative traits. In both simulation studies on Dependency Models and soybean study of oil and protein content traits, BHIT is versatile and robust in detecting multiple dependencies simultaneously.
In comparison to BEAM and BEAM2, which were developed in detecting highorder epistasis from discrete genotypes in single case–control phenotype, BHIT expands BEAM by building a flexible framework to detect multiple highorder epistasis in case–control and/or quantitative phenotypes. Our work is also different from Zhang’s work [49] on Pleiotropic and Epistatic eQTL, which also used the idea of Bayesian partition, in that a) Zhang’s work aimed at eQTL module identification while BHIT is a general framework for GWAS analysis; b) Zhang’s work focused on association with genotype and continuous traits as eQTL, while BHIT could detect association between genotype and traits, both continuous and categorical; c) Zhang’s work mainly handled pairwised interactions, while BHIT can detect highorder interactions. With the Bayesian method’s strengths and flexible settings, BHIT demonstrates its great capabilities and potentials in detecting both pairwise and highorder interactions in GWAS datasets both on discrete and continuous data.
The BHIT pipeline is developed to apply BHIT for general purpose of research on any species and any traits with appropriate data. Due to detecting multiple orders of interaction, BHIT demonstrates its capability on thousands genetic variations by thousands samples on single CPUs in plausible time. To handle even bigger datasets genome wide, three different computational strategies in computational experiment were implemented in the BHIT pipeline to address the computational intensive nature of BHIT Bayesian computation. Strategy A screened additive effects first by feature selection method, penaltybased regression LASSO in our soybean quantitative trait study, this twostage strategy may have ignored several truepositive epistasis, but it was able to fully use all the SNPs in the genome scale. The application in experimental datasets successfully found crosschromosome epistasis. Strategy B’s individual chromosome computing was able to fully use all the candidate SNPs in one chromosome, but it neglected the crosschromosome epistasis. The nearby SNPs in one chromosome might overestimate epistasis because of linkage disequilibrium. Strategy C enables users to use priorknowledge to select SNPs first, eliminated the computation by only applying BHIT on proteincoding regions in known QTL in our study. Results obtained by this strategy could be well explained at the protein level but it obviously lost some information both in the noncoding regions and nonQTL regions. Besides soybean oil/protein traits, high order interactions exist in many other species and traits. The BHIT pipeline can be applied in detecting highorder interaction between genotype and phenotype in other species with appropriate data and strategy. Even with advanced computational technologies and strategies, BHIT still requires high computational resources within the whole genome dataset. We are working on developing an effective approximation algorithm as well as the parallel and GPU version of BHIT to advance genotypephenotype research.
References
 1.
Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–53.
 2.
Johannes F, Porcher E, Teixeira FK, SalibaColombani V, Simon M, Agier N, et al. Assessing the Impact of transgenerational epigenetic variation on complex traits. Plos Genet. 2009;5(6):e1000530.
 3.
Lu C, Latourelle J, O’Connor GT, Dupuis J, Kolaczyk ED. Networkguided sparse regression modeling for detection of genebygene interactions. Bioinformatics. 2013;29(10):1241–9.
 4.
Wei WH, Hemani G, Haley CS. Detecting epistasis in human complex traits. Nat Rev Genet. 2014;15(11):722–33.
 5.
Van Steen K. Travelling the world of genegene interactions. Brief Bioinform. 2012;13(1):1–19.
 6.
Wang Y, Liu GM, Feng ML, Wong L. An empirical comparison of several recent epistatic interaction detection methods. Bioinformatics. 2011;27(21):2936–43.
 7.
Zhang M, Liang L, Morar N, Dixon AL, Lathrop GM, Ding J, et al. Integrating pathway analysis and genetics of gene expression for genomewide association study of basal cell carcinoma. Hum Genet. 2012;131(4):615–23.
 8.
Platt A, Vilhjalmsson BJ, Nordborg M. Conditions under which genomewide association studies will be positively misleading. Genetics. 2010;186(3):1045–52.
 9.
Mackay TFC. Epistasis and quantitative traits: using model organisms to study genegene interactions. Nat Rev Genet. 2014;15(1):22–33.
 10.
Purcell S, Neale B, ToddBrown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for wholegenome association and populationbased linkage analyses. Am J Hum Genet. 2007;81(3):559–75.
 11.
Wan XA, Yang C, Yang QA, Xue H, Fan XD, Tang NLS, et al. BOOST: a fast approach to detecting genegene interactions in genomewide case–control studies. Am J Hum Genet. 2010;87(3):325–40.
 12.
Xie MZ, Li J, Jiang T. Detecting genomewide epistases based on the clustering of relatively frequent items. Bioinformatics. 2012;28(1):5–12.
 13.
Zhang Y, Liu JS. Bayesian inference of epistatic interactions in case–control studies. Nat Genet. 2007;39(9):1167–73.
 14.
Zhang Y, Zhang J, Liu JS. Blockbased bayesian epistasis association mapping with application to WTCCC type 1 diabetes data. Ann Appl Stat. 2011;5(3):2052–77.
 15.
Yang C, He ZY, Wan X, Yang Q, Xue H, Yu WC. SNPHarvester: a filteringbased approach for detecting epistatic interactions in genomewide association studies. Bioinformatics. 2009;25(4):504–11.
 16.
Lin HY, Chen YA, Tsai YY, Qu XT, Tseng TS, Park JY. TRM: a powerful twostage machine learning approach for identifying SNPSNP interactions. Ann Hum Genet. 2012;76:53–62.
 17.
Weinreich DM, Lan YH, Wylie CS, Heckendorn RB. Should evolutionary geneticists worry about higherorder epistasis? Curr Opin Genet Dev. 2013;23(6):700–7.
 18.
He XL, Qian WF, Wang Z, Li Y, Zhang JZ. Prevalent positive epistasis in Escherichia coli and Saccharomyces cerevisiae metabolic networks. Nat Genet. 2010;42(3):272–U120.
 19.
Huang XQ, Ding J, Effgen S, Turck F, Koornneef M. Multiple loci and genetic interactions involving flowering time genes regulate stem branching among natural variants of Arabidopsis. New Phytol. 2013;199(3):843–57.
 20.
Zhang J, Hou TJ, Wang W, Liu JS. Detecting and understanding combinatorial mutation patterns responsible for HIV drug resistance. Proc Natl Acad Sci U S A. 2010;107(4):1321–6.
 21.
Tibshirani R. Regression shrinkage and selection via the lasso. J Roy Stat Soc B Met. 1996;58(1):267–88.
 22.
Prabhu S, Pe’er I. Ultrafast genomewide scan for SNPSNP interactions in common complex disease. Genome Res. 2012;22(11):2230–40.
 23.
Song QJ, Hyten DL, Jia GF, Quigley CV, Fickus EW, Nelson RL, et al. Development and evaluation of SoySNP50K, a highdensity genotyping array for soybean. PLoS One. 2013;8(1):e54985.
 24.
Joshi T, Fitzpatrick MR, Chen SY, Liu Y, Zhang HX, Endacott RZ, et al. Soybean knowledge base (SoyKB): a web resource for integration of soybean translational genomics and molecular breeding. Nucleic Acids Res. 2014;42(D1):D1245–52.
 25.
Rashid M, Ramasamy S, Raghava GPS. A simple approach for predicting proteinprotein interactions. Curr Protein Pept Sc. 2010;11(7):589–600.
 26.
Zhang JF, Wang QG, Barz BD, He ZQ, Kosztin I, Shang Y, et al. MUFOLD: a new solution for protein 3D structure prediction. Proteins. 2010;78(5):1137–52.
 27.
Tovchigrechko A, Vakser IA. GRAMMX public web server for proteinprotein docking. Nucleic Acids Res. 2006;34:W310–4.
 28.
Rawsthorne S. Carbon flux and fatty acid synthesis in plants. Prog Lipid Res. 2002;41(2):182–96.
 29.
Hills MJ. Control of storageproduct synthesis in seeds. Curr Opin Plant Biol. 2004;7(3):302–8.
 30.
Ando H, Wen ZM, Kim HY, Valencia JC, Costin GE, Watabe H, et al. Intracellular composition of fatty acid affects the processing and function of tyrosinase through the ubiquitinproteasome pathway. Biochem J. 2006;394:43–50.
 31.
Zhang JR, Martin JM, Beecher B, Lu CF, Hannah LC, Wall ML, et al. The ectopic expression of the wheat Puroindoline genes increase germ size and seed oil content in transgenic corn. Plant Mol Biol. 2010;74(4–5):353–65.
 32.
Guo XJ, Jiang ML, Wan X, Hu CJ, Gong YM. Identification and biochemical characterization of five longchain acylcoenzyme a synthetases from the diatom Phaeodactylum tricornutum. Plant Physiol Bioch. 2014;74:33–41.
 33.
Zheng YN, Li LL, Liu Q, Yang JM, Wang XW, Liu W, et al. Optimization of fatty alcohol biosynthesis pathway for selectively enhanced production of C12/14 and C16/18 fatty alcohols in engineered Escherichia coli. Microb Cell Fac. 2012;11:65.
 34.
de Jong BW, Shi S, Siewers V, Nielsen J. Improved production of fatty acid ethyl esters in Saccharomyces cerevisiae through upregulation of the ethanol degradation pathway and expression of the heterologous phosphoketolase pathway. Microb Cell Fact. 2014;13(1):39.
 35.
Selley ML. A metabolic link between Sadenosylhomocysteine and polyunsaturated fatty acid metabolism in Alzheimer’s disease. Neurobiol Aging. 2007;28(12):1834–9.
 36.
Burroughs AM, Iyer LM, Aravind L. Natural history of the E1like superfamily: Implication for adenylation, sulfur transfer, and ubiquitin conjugation. Proteins. 2009;75(4):895–910.
 37.
Rozas J, SanchezDelBarrio JC, Messeguer X, Rozas R. DnaSP, DNA polymorphism analyses by the coalescent and other methods. Bioinformatics. 2003;19(18):2496–7.
 38.
Dudley JW, Johnson GR. Epistatic models and preselection of markers improve prediction of performance in corn. Mol Breeding. 2013;32(3):585–93.
 39.
Jones AG, Burger R, Arnold SJ. Epistasis and natural selection shape the mutational architecture of complex traits. Nat Commun. 2014;5:3709.
 40.
Hansen TF. Why Epistasis is important for selection and adaptation. Evolution. 2013;67(12):3501–11.
 41.
Snyder CL, Yurchenko OP, Siloto RMP, Chen X, Liu Q, Mietkiewska E, et al. Acyltransferase action in the modification of seed oil biosynthesis. New Biotechnol. 2009;26(1–2):11–6.
 42.
Dudley JW. Quantitative genetics and plant breeding. Adv Agron. 1997;59:1–23.
 43.
Qi ZM, Hou M, Han X, Lu CY, Jiang HW, Xin DW, et al. Identification of quantitative trait loci (QTLs) for seed protein concentration in soybean and analysis for additive effects and epistatic effects of QTLs under multiple environments. Plant Breed. 2014;133(4):499–507.
 44.
Zhao JY, Becker HC, Zhang DQ, Zhang YF, Ecke W. Oil content in a European x Chinese rapeseed population: QTL with additive and epistatic effects and their genotypeenvironment interactions. Crop Sci. 2005;45(1):51–9.
 45.
Weselake RJ, Taylor DC, Rahman MH, Shah S, Laroche A, McVetty PBE, et al. Increasing the flow of carbon into seed oil. Biotechnol Adv. 2009;27(6):866–78.
 46.
Breen MS, Kemena C, Vlasov PK, Notredame C, Kondrashov FA. Epistasis as the primary factor in molecular evolution. Nature. 2012;490(7421):535.
 47.
Pollock DD, Goldstein RA. Strong evidence for protein epistasis, weak evidence against it. Proc Natl Acad Sci U S A. 2014;111(15):E1450–0.
 48.
Hemani G, Knott S, Haley C. An Evolutionary Perspective on Epistasis and the missing heritability. Plos Genet. 2013;9(2):e1003295.
 49.
Zhang W, Zhu J, Schadt EE, Liu JS. A bayesian partition method for detecting Pleiotropic and Epistatic eQTL Modules. Plos Comput Biol. 2010;6(1):e1000642.
Acknowledgements
We acknowledge the contributions of Vander Heiden, Namita T Gupta and Mate Nagy. This work was supported by the National Natural Science Foundation of China (Grants 61272207, 61472158); the ScienceTechnology Development Project from Jilin Province of China (Grant 201201048); and US National Institute of Health (R01GM100701).
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
JW, DX, and JZ participated in the research design and wrote the manuscript; JW developed the software; JW performed the computational benchmark studies; HTN, BV and HS provided the experimental data and biological input. TJ and YL provided computational input. All authors read and approved the final manuscript.
Additional files
Additional file 1: Table S1.
Odds Table of Epistasis Model 1. Table S2 Odds Table of Epistasis Model 2. Table S3. Odds Table of Epistasis Model 3. Table S4. Odds Table of Epistasis Model 4. Table S5. P(IData) in Dependency Model 5 Partition. Table S6. P(IData) in Dependency Model 6 Partition. Table S7. P(IData) in Dependency Model 7 Partition. Table S8. P(IData) in Dependency Model 8 Partition. Figure S1: Statistic power of Model 1 with sample size 2000. Figure S2: Statistic power of Model 1 with sample size 4000. Figure S3: Statistic power of Model 2 with sample size 2000. Figure S4: Statistic power of Model 2 with sample size 4000. Figure S5: Statistic power of Model 3 with sample size 2000. Figure S6: Statistic power of Model 3 with sample size 4000. Figure S7: Statistic power of Model 4 with sample size 5000. Figure S8: Statistic power of Model 4 with sample size 10,000. (DOCX 62 kb)
Additional file 2: Table S9.
QTL regions related to protein/oil contents (XLSX 19 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Markov Chain Monte Carlo
 Minor Allele Frequency
 Lasso
 Markov Chain Monte Carlo Iteration
 Continuous Column