 Research
 Open access
 Published:
GEPEpiSeeker: a gene expression programmingbased method for epistatic interaction detection in genomewide association studies
BMC Genomics volume 22, Article number: 910 (2021)
Abstract
Background
Identification of epistatic interactions provides a systematic way for exploring associations among different single nucleotide polymorphism (SNP) and complex diseases. Although considerable progress has been made in epistasis detection, efficiently and accurately identifying epistatic interactions remains a challenge due to the intensive growth of measuring SNP combinations.
Results
In this work, we formulate the detection of epistatic interactions by a combinational optimization problem, and propose a novel evolutionarybased framework, called GEPEpiSeeker, to detect epistatic interactions using Gene Expression Programming. In GEPEpiSeeker, we propose several tailormade chromosome rules to describe SNP combinations, and incorporate Bayesian networkbased fitness evaluation into the evolution of tailormade chromosomes to find suspected SNP combinations, and adopt the Chisquare test to identify optimal solutions from suspected SNP combinations. Moreover, to improve the convergence and accuracy of the algorithm, we design two genetic operators with multiple and adjacent mutations and an adaptive genetic manipulation method with fuzzy control to efficiently manipulate the evolution of tailormade chromosomes. We compared GEPEpiSeeker with stateoftheart methods including BEAM, BOOST, AntEpiSeeker, MACOED, and EACO in terms of power, recall, precision and F1score on the GWAS datasets of 12 DME disease models and 10 DNME disease models. Our experimental results show that GEPEpiSeeker outperforms comparative methods.
Conclusions
Here we presented a novel method named GEPEpiSeeker, based on the Gene Expression Programming algorithm, to identify epistatic interactions in Genomewide Association Studies. The results indicate that GEPEpiSeeker could be a promising alternative to the existing methods in epistasis detection and will provide a new way for accurately identifying epistasis.
Introduction
Genomewide association studies (GWAS) aim at identifying associations between Single Nucleotide Polymorphism (SNP) and disease, which has been an important way for identifying the genetic basis of diseases in the last decade [1,2,3,4,5,6,7,8,9,10,11].
GWAS is capable of finding singlelocus SNP that is related to disease trait [7]. Great progress has been made in identifying singlelocus SNP that is the genetic causes of diseases such as Mendelian diseases and diabetes, however, detecting causative loci for complex diseases is more complicated [3, 5, 6, 12]. Complex diseases are often caused by complicated effects of multilocus SNPs, such as diabetes, rheumatoid arthritis and hypertension [6, 7, 13]. Some SNPs influence the complex disease traits and dominate the effect of other SNPs when interacting with each other [6, 7, 12]. In GWAS, the relation of an SNP influencing the effect of another SNP is described as epistasis [7, 12, 14]. Many studies have shown that epistasis exists in SNP interactions and plays an important role in human diseases [7, 15].
With the rapid development of highthroughput genotyping and sequencing technologies, it is an enormous challenge to analyze the epistatic associations between disease and millions of SNPs in GWAS. Recently, several epistatic interaction detection methods have been designed for efficiently detecting epistasis. These efforts can be divided into three types [3, 6, 13, 14, 16,17,18,19,20,21] : (1) exhaustive search method, (2) stochastic search method, and (3) heuristic search method.
Exhaustive search method evaluates all possible multilocus SNP combinations to detect the associations between disease and SNPs. Therefore, exhaustive search methods can produce stable and global optimum solutions. Some exhaustive search methods, such as MDR [22, 23], BOOST [24], TEAM [25], ESMO [6], have been proposed. Exhaustive search is a straightforward search strategy, but it may require huge computational resources and consume too much time as the size of SNP combinations exponentially grows.
Stochastic searchbased identifies SNPSNP interactions by random sampling [26, 27]. BEAM (Bayesian Epistasis Association Mapping) [27] is an example. BEAM searches and categorizes diseaseassociated SNP interactions via posterior probabilities of the suspected candidate SNPs. Tang et al. [28] constructed a Gibbs sampling approach for identifying epistatic interactions. Jiang et al. [29] presented a stochastic method called epiForest to detect epistatic interactions using random forest. Although random sampling significantly reduces search space and accelerates the detection of SNP interactions, the performance of stochastic search relies on the random sampling elements.
Heuristic search [3, 7, 12, 14, 30] adopts an approximate search strategy, which guides the search of epistatic interactions by heuristic information. For example, Wang et al. [30] proposed a twostage heuristic ant colony optimization (ACO) algorithm named AntEpiSeeker to detect epistatic interactions. AntEpiSeeker uses an ant colony optimization search to find diseaseassociated SNPs. Wan et al. [31] developed SNPRuler to detect epistatic interactions utilizing prediction rule learning. Jing et al. [3] presented a multiobjective optimization heuristic method named MACOED, which complementarily combines the logistical regression and Bayesian network to identify epistatic interactions. Yuan et al . [7] designed a multiobjective ACObased method named FAACOSE. FAACOSE combines multiobjective optimization functions with an adaptive ant colony optimization algorithm to search epistatic interactions. Sun et al . [14] proposed an ACObased method named EACO for identifying epistatic interactions by incorporating heuristic information multiSURF(Spatially Uniform ReliefF) into antdecision rules.
Recently, in addition to the ACObased algorithm, some other evolutionary methods have also been adopted for the heuristic search of diseaseassociated SNPs [15, 26, 32]. For example, Yang et al . [32] proposed a Genetic algorithmbased hybrid algorithm, which is named genetic ensemble (GE). GE combines an ensemble of classifiers with a multiobjective genetic algorithm to detect epistatic interactions. Aflakparast et al. [15] presented an evolutionarybased heuristic search method CSE (Cuckoo Search Epistasis) to detect SNP interactions. CSE integrates the evolutionarybased optimization algorithm Cuckoo with the Bayesian network to mine SNP interactions. Tuo et al. [13] presented FHSASED, which adopts a harmony search algorithm with the Bayesian network and Giniscore to detect epistatic interactions.
Heuristic search has become a popular search strategy of epistatic interactions for its heuristic positive feedback and small search space for the past decades. However, heuristic search sometimes may lose the global optimum solutions for its approximate search strategy.
In recent years, Gene Expression Programming (GEP) algorithm is a notable evolutionary algorithm, which is a generalized method of Genetic Algorithm (GA) and Genetic Programming (GP) [33]. It has advantages for simply encoding complex problems and searching for global optimum solutions, and discovering rules and formulas [33,34,35]. Therefore, GEP algorithm has been widely adopted in solving complex nonlinear problems that are difficult to be solved by traditional methods for the possible loss of global optimum solutions [36,37,38].
Motivated by GEP, we propose a novel evolutionary framework based on the GEP algorithm called GEPEpiSeeker to detect epistatic interactions. Distinguishing from other evolutionarybased methods, GEPEpiSeeker contains the screening and cleaning stages to find the SNP interactions associated with specific diseases. In the screening stage, we developed a tailormade Gene Expression Programming algorithm named EpiGEP for screening suspected SNP interactions. In the cleaning stage, we conducted Chisquare tests for each screened SNP combinations produced by EpiGEP to identify the significant epistatic interactions. Fig. 1 summarizes the flowchart of the GEPEpiSeeker.
Results and discussion
We conducted experiments on 22 simulated disease models containing 12 disease models with marginal effects (DME) and 10 disease models with no marginal effects (DNME) to investigate the performance of GEPEpiSeeker. The experimental results of GEPEpiSeeker were compared with the experimental results gained from five stateoftheart epistasis detection methods including BEAM [27], BOOST [24], AntEpiSeeker [30], MACOED [3] and EACO [14] in terms of power, recall, precision, and F1score. Furthermore, we investigated the influence of the proposed fuzzy adaptive genetic manipulation rate on GEPEpiSeeker performance. The simulation datasets for the 22 disease models, evaluation metrics, and parameter setting are introduced in the Methods section in detail.
Comparison with stateoftheart methods
Figures 2, 3 and 4 present the performance of different methods on four multiplicative DME disease models (model 1 ~ model 4), four threshold DME disease models (model 5 ~ model 8) and four concrete DME disease models (model 9 ~ model 12), respectively. As shown in Fig. 2, GEPEpiSeeker achieves higher power than all other methods and exhibits increasing power when h^{2}=0.02. Similarly, as shown in Fig. 3 and Fig. 4, GEPEpiSeeker outperforms all other methods in terms of power in most DME models with different parameter settings and is comparable with other methods in the rest DME models. Specifically, the power of GEPEpiSeeker on the models 8 and 10 are equal to 1, and the power of GEPEpiSeeker on the models 11 and 12 are equal to 0.99, due to the effective search guided by the chromosome evolution of GEPEpiSeeker. These results indicate that the Bayesian fitness evaluation combined with the tailormade chromosome evolution can fit the DME models well.
Figure 5 and Fig. 6 present the performance of different methods on ten DNME disease models under h^{2}=0.01 and MAF=0.2. The results of Fig. 5 and Fig. 6 reveal that GEPEpiSeeker outperforms other methods on most DNME models. However, the power of GEPEpiSeeker on DNME models does not reach the optimal level when comparing with its performance on DME models. This is because the SNP interactions in DNME models display no marginal effects and it is hard to capture these SNP interactions [14]. In addition, the performance of GEPEpiSeeker are quite comparable with the performance of BOOST in most models, whereas the power of GEPEpiSeeker is a little smaller than BOOST on DNME models 18 and 20. This is because the DNME models merely show interactive with no marginal effects whereas the mathematical model of BOOST only takes the interactive with no marginal effects into account, thus BOOST perfectly fits this dataset well.
To comprehensively evaluate the performance of our proposed method, we also compare the performance of the GEPEpiSeeker and other methods in terms of recall, precision and F1 on all disease models. Tables 1, 2 and 3 show the comparison results of recall, precision and F1 on different disease models, respectively. Note that the values in brackets are the pvalues of the ttest between results of GEPEpiSeeker and the corresponding comparative method. As seen from Tables 1, 2 and 3, compared with other comparative methods, GEPEpiSeeker achieves the best on 19 out of 22, 20 out of 22, and 17 out of 22 disease models in terms of recall, precision and F1, respectively. In terms of recall, GEPEpiSeeker just slightly underperforms MACOED and EACO on the disease model 7, slightly underperforms EACO on the disease model 9, and has poor performance than MACOED, EACO, and BOOST on the disease model 18. In terms of precision, GEPEpiSeeker just slightly underperforms MACOED on the disease models 6 and 17. In terms of F1, GEPEpiSeeker just slightly underperforms MACOED and EACO on the disease models 6, 7, 9, 17, and 18. These results demonstrate that GEPEpiSeeker outperforms comparative methods on most disease models. Overall, GEPEpiSeeker is superior to stateoftheart methods in the experiment. This indicates that the effective optimization of SNP combinations by the GEP algorithm greatly helps to narrow the search space and improve the power of our method. It is also interesting to see in Tables 1 3 and Fig. 5 that the power, precision and F1 of the results produced by GEPEpiSeeker achieves better performances than other comparative methods in most settings of DNME and DME models, demonstrating that the results of GEPEpiSeeker on DNME and DME models are worth exploring, despite not obtaining correspondingly high levels.
Although the recalls of MACOED and EACO on disease models 7, 9 and 18 are better than those of GEPEpiSeeker, the precision of GEPEpiSeeker is higher than that of MACOED and EACO. Similarly, the recalls of BOOST on models 18, 20 and 21 are higher or equal to those of GEPEpiSeeker, but the precisions of GEPEpiSeeker on these models are much higher than those of BOOST, thereby resulting in GEPEpiSeeker’s superior performance in F1. These results demonstrate that GEPEpiSeeker performs well in both recall and precision by coupling the EpiGEP algorithm with the Chisquare test.
Note that the values in brackets are the pvalues of the ttest between results of GEPEpiSeeker and the corresponding comparative method. The best performances of each disease model are shown in bold and italics.
The influence of fuzzy adaptive genetic manipulation rate
In this section, we investigate whether fuzzy adaptive control will affect the performance of GEPEpiSeeker. The comparisons on all metrics are based on the average score of 20 times of each epistasis model. Figures 7, 8 and 9 show the comparison result, where GEPEpiSeekerf represents that GEPEpiSeeker uses fuzzy adaptive genetic manipulation rate, while GEPEpiSeekern represents GEPEpiSeeker does not use fuzzy adaptive genetic manipulation rate but uses the same fixed genetic manipulation rate as the original GEP. We observe from Figures 7 9 that, on most epistasis models, GEPEpiSeekerf outperforms GEPEpiSeekern over all the metrics. This indicates that the use of fuzzy adaptive genetic manipulation rate in GEPEpiSeeker improves epistatic interaction detection, which is largely because the fuzzy adaptive genetic manipulation rate can improve the global search of GEP.
Conclusion
In this work, we presented a novel method named GEPEpiSeeker, based on the Gene Expression Programming algorithm, to identify epistatic interactions in Genomewide Association Studies. In GEPEpiSeeker, we proposed several tailormade chromosome rules to depict SNP combinations, and integrated Bayesian networkbased fitness function into the evolution of the chromosomes to search candidate SNP combinations and used the Chisquare test to identify optimal solutions from candidate SNP combinations.
Furthermore, we proposed two genetic operators with multiple and adjacent mutations and an adaptive genetic manipulation method with fuzzy control to improve the convergence and accuracy of our method. We conducted experiments on 22 disease models including 12 DME models and 10 DNME models to evaluate our method. Experimental results show that GEPEpiSeeker is comparable or even superior to other comparative methods including BEAM, BOOST, AntEpiSeeker, MACOED and EACO in terms of power, recall, precision and F1score on all datasets. These results indicate that GEPEpiSeeker could be a promising alternative to the existing methods in epistasis detection and will provide a new way for accurately identifying epistasis.
Generally, the length of the GEP chromosome grows as the epistatic order increases, which results in a large increase in computation resources. A possible solution for this problem is to implement highperformance parallel algorithms for detecting epistasis interactions, which would be of interest in future work.
Methods
For solving the epistasis detection problem with high dimension and small sample size, we transformed the identification of diseasecausing SNP combinations into a heuristic combinatorial optimization problem. Then, GEPEpiSeeker formulates SNP combinations using tailormade GEP chromosome rules for epistasis detections, and discovers candidate SNP combinations by integrating Bayesian fitness evaluation with the tailormade chromosome evolution, and finds optimal solutions from candidate SNP combinations by the Chisquare test. Furthermore, two genetic operators with multiple and adjacent mutations and an adaptive genetic manipulation method with fuzzy control are proposed to guide the tailormade chromosome evolution, which helps to improve the convergence and accuracy of the algorithm.
In this section, we first briefly introduce the fundamentals of GEP in the first subsection. Then the proposed method GEPEpiSeeker is introduced in detail, which involves the definitions of tailormade chromosome and genetic operators, fuzzy adaptive control of genetic manipulation rate, and Bayesian networkbased fitness function in the screening stage, and Chisquare tests for cleaning significant epistasis in the cleaning stage. Finally, we introduce the experimental method in this work, which involves the datasets, evaluation metrics for comparing the performance of the comparative methods, and the parameter setting.
Fundamentals of Gene Expression Programming
Gene Expression Programming (GEP) is an excellent evolutionary algorithm, which is based on the gene expression law of biological genetics [33, 39]. GEP does not rely on gradient information and initial search point and is strong at searching optimum solutions [33, 39]. GEP heuristically searches the optimum solutions using chromosome evolution. A GEP chromosome consists of one or multiple genes. Each gene in the chromosome consists of a head and a tail. The head consists of function set F, which contains a series of simple functors, and terminator set T, which contains a series of decision variables and constants. The tail only consists of the terminator set. Assuming that the gene head length is h, the tail length t satisfies the following Exp. (1), where n is the maximum arity of the functors in F.
The GEP chromosome has two forms of expression, one of which is the Karva expression (Kexpression), and the other is the expression tree. Each gene in the chromosome can be expressed in a Kexpression and an expression tree. Both Kexpression and expression trees can be transformed into each other. We can transform the expression tree into Kexpression by traversing the expression tree from top to bottom and left to right. Similarly, we can transform Kexpression into an expression tree by filling the expression tree layer by layer with the symbols of Kexpression from left to right. For example, Exp. (2) is a GEP chromosome with a gene of length 9, which includes functors {Q,*,,+} and terminators {a,b,c,d,2},
where Q denotes the squareroot function. According to GEP algorithm, the expression tree of the chromosome in Exp. (2) is shown in Fig. 10, and this expression tree can be interpreted as Exp. (3) in mathematics.
Each chromosome of GEP can be regarded as a solution of a target problem and is evaluated by the fitness function of GEP. The higher optimal fitness value of the solution is, the better solution represented in the chromosome is. The chromosomes can gradually evolve after a series of genetic manipulations until obtaining a solution with an acceptable fitness value. The genetic manipulation of GEP mainly includes selection, mutation, and crossover. The flowchart of GEP is shown in Fig. 11. For more details of GEP, please refer to [33, 36].
Screening stage: EpiGEP for screening SNP combinations
In this section, we will elaborate on our GEPbased algorithm named EpiGEP for detecting epistatic interactions. In EpiGEP, we proposed several tailormade chromosome rules, two new genetic operators, and a tailormade fitness function, and a genetic manipulation method with adaptive rate to accurately detect epistatic interactions. Fig. 12 provides the pseudocode of EpiGEP. In the following, we will elaborate on the procedure of EpiGEP.
Tailormade chromosome
In EpiGEP, each chromosome in a population is a candidate solution of a kway SNP interaction combination that is associated with disease status Y. Recall that each gene in the GEP chromosome consists of a head and a tail. In EpiGEP, each gene consists of a head, a tail and an GT domain. The GT domain represents a genotype of one SNP. Let Chr_{i} be ith chromosome in a population with L chromosomes, i=1, 2, …, L. The chromosome Chr_{i} can be described by Exp. (4):
where S_{ij} is the jth SNP in the SNP dataset, j=1, 2…, k; Gt_{ij} is the variable of S_{ij} genotypes with values of {0,1,2}; (S_{ij}, Gt_{ij}) indicates a gene of the chromosome Chr_{i}.
Exp. (5) gives an example of an EpiGEP chromosome Chr_{k}. Chr_{k} is the kth chromosome with head length h=3 in a population, which is described as a 2way SNP interaction combination with genotypes 0 and 2. In Exp. (5), Chr_{k} includes two genes: (* + 1825, 0) and (+ *+ 3674, 2). In gene (* + 1825, 0), the head is “* +”, the tail is “1825” and the GT domain is “0”.
In EpiGEP, any kway (k=1, 2, 3, …) SNP interaction combination can be described by Exp. (4). In order to map each EpiGEP chromosome into a valid solution in SNP interaction detections, we define several idealized rules:

EpiGEP only uses functors {+,,*, /} and terminators {1, 2,…, n}, n is the total number of SNP in the dataset.

Each chromosome in EpiGEP cannot contain identical SNP markers. The decoding result of S_{ix} and S_{iy} in a chromosome must not be identical (x≠y), or else this chromosome has to be mutated to get a new valid solution. The adjacent mutation is preferable (see section 3.2.2 for details).

When EpiGEP decoding the expression trees of genes, the decoding results will be performed modulo by the number of SNP in the dataset. EpiGEP takes the absolute value of the modulo results as the final results.
The EpiGEP chromosome of Exp.(5) can be encoded into expression trees and these trees are shown in Fig. 13. These expression trees can be decoded into 49 and 29, which correspond to the 49^{th} and 29^{th} loci of SNP, respectively. Then a candidate SNP interaction combination of S_{i49} and S_{i29} can be derived from the decoding results of these expression trees.
Tailormade genetic operators
EpiGEP inherits the genetic operators of GEP and expands two new genetic operators including adjacent mutation and multigene mutation to improve epistatic interaction detection. There are considerable correlations among neighboring SNPs in the genome as measure by linkage disequilibrium (LD) [15]. This is a helpful clue for finding epistatic interactions. We developed a novel genetic operator called adjacent mutation using the LDspecific heuristics to narrow the combination space and accelerate the convergence of EpiGEP. The adjacent mutation obeys the following idealized rules:

The adjacent mutation performs mutation when a random number between 0 and 1 is smaller than the given threshold value called adjacent mutation rate;

The adjacent mutation aims at refining the solutions with the neighborhoods of the current solution. To achieve this goal, the adjacent mutation only mutates at the tail or the GT domain of the objective gene. When adjacent mutation takes at the tail, the adjacent mutation randomly replaces the locus of the mutation point with one of the neighboring loci of the mutation point. When adjacent mutation takes at the GT domain, the adjacent mutation replaces the genotype with one of the rest genotypes.
In addition, we proposed another novel genetic operator Multigene mutation for EpiGEP. Multigene mutation simultaneously implements mutation operation on multiple points of different genes. The Multigene mutation could increase the diversity of population, assisting EpiGEP to jump out of the current search area, which avoids EpiGEP falling into local optimum to some extent and finally enhances the global exploration power of EpiGEP.
Fuzzy adaptive control of genetic manipulation rate
The crossover rate of evolutionary algorithms will largely influence their convergence efficiency, while the mutation rate determines whether the algorithms can globally find the optimal solution out of the local optimum solution or not [40]. Nevertheless, similar to other evolutionary algorithms, GEP keeps the initial parameters unchanged during the procedure of the program. As evolution is ongoing, it is not easy to jump out of the local optimum solution due to the loss of population diversity.
In this work, we use a fuzzy control method to dynamically and automatically adjust the genetic manipulation rates of EpiGEP to find the globally optimum solution out of the local optimum solution.
First, population diversity is measured according to the dispersion degree of individual fitness in the population. Population diversity is evaluated by the ratio d of optimal fitness (F_{best}) to average fitness (F_{ave}) of the current population. Equation (6) is used to determine the population diversity when F_{best} ≤F_{ave}. On the contrary, Equation (7) is used. As the population converges, d gradually approaches one.
We designed some different fuzzy controllers to describe the size of population diversity and dynamically adjust the genetic manipulation rate. To simplify, we introduce how to use three different fuzzy controllers to adjust the crossover rate, mutation rate and adjacent mutation rate combined with fuzzy mathematics. These three fuzzy controllers use the current population diversity and the number of the current iterations as input. Outputs of the three fuzzy controllers are crossover rate, mutation rate and adjacent mutation rate of the nextgeneration population. Membership function of input and output is constructed by the triangular membership function and trapezoid membership function. Five fuzzy linguistic variables {XL, ML, M, MH, XH} are represented by low, lowmedium, medium, mediumhigh and high diversity, respectively. They are used to describe the five fuzzy membership functions, as shown in Fig. 14. When the population diversity becomes low, GEP will increase the mutation rate to enhance diversity. When the population diversity becomes too high, GEP will increase the crossover rate and reduce the mutation rate.
Bayesian networkbased fitness function
A Bayesian network (BN) is a probabilistic directed graphical model [3]. In the GWAS Bayesian network, a directed graphical BN model has consisted of a set of nodes and edges [6]. Each node represents a genotype or phenotype, while each edge represents the conditional dependencies between nodes. Given the Markov condition, in a BN model with m+1 nodes (m SNP nodes and a disease state), the joint probability distribution for the m+1 nodes can be calculated as the following [3, 6]:
where pa(x_{i}) denotes the set of parent nodes of x_{i}. An instance of mSNP epistasis BN model is given in Fig. 15. Note that, in the epistasis BN model, there are only edges going from an SNP node to a disease node [6]. As we can see in Fig. 15, for an mSNP epistasis BN model, the total number of combinations of SNP and disease state is \({C}_n^m\), where n is the total number of SNP in the SNP set.
In EpiGEP, we take the K2 score given in [3] as the fitness evaluation function. K2 score can be calculated as the following:
where I is the total number of SNP combinations, and I=3^{m} as the possible values of SNP node are 0, 1 or 2. J denotes the state number of disease nodes [3]. r_{i} is the number of ith SNP combination and r_{ij} denotes the number of ith SNP combination connected with jth disease state [3, 6]. K2 score has been proposed to the mlocus epistasis detection in MACOED [3] and FHSASED [13], but these swarm intelligence based algorithms are only effective in detecting 2locus epistasis. In this work, m can be set as a positive integer larger than 1 according to the users’ requirement.
Cleaning stage: Chisquare tests for cleaning significant epistasis
In the screening stage, GEPEpiSeeker gets a candidate solution set that consists of all suspected diseasecausing SNP combinations. In the cleaning stage, the task of GEPEpiSeeker is to identify the real diseasecausing SNP combinations from candidate solutions. Previous researches [3, 41] showed that the Chisquare test can simply and powerfully identify the SNP combinations associated with the disease without considering disease models. GEPEpiSeeker conducts an exhaustive search in candidate solutions with the Chisquare test to identify the significant epistasis. In the Chisquare test, the null hypothesis is that the candidate solution and the specific disease are not associated [3, 41]. The alternative hypothesis is that the candidate SNP combinations associated with the disease are accepted when the Pvalue of the Chisquare test is smaller than 0.05 [3, 41].
Experimental method
Datasets
We used 22 GWAS datasets corresponding to 22 epistasis models as GWAS datasets, which were generated by the classic simulation software GAMETES 2.0 [42]. GAMETES was widely used in the performance evaluation of epistasis detection [43]. In 22 epistasis models, there are 12 disease models with marginal effects (DME) and 10 disease models with no marginal effects (DNME).
The 12 DME models contain three types of DME epistasis models including 4 multiplicative models, 4 threshold models and 4 concrete models. These 12 DME models are produced by three different penetrance functions. These penetrance functions of the 12 DME epistasis models are shown in Table 4 [3]. These models have both marginal and interaction effects. The parameters α and β are used to control the penetrance table. The disease prevalence P(D), the genetic heritability h^{2} and the minor allele frequency MAF can be determined by α and β [3]. In this work, P(D)=0.1. In the experiments, the multiplicative models, threshold models and concrete models are named as model 1 ~ model 4, model 5 ~ model 8, model 9 ~ model 12, respectively.
The 10 DNME models (model 13 ~ model 22) are limited to the HardyWeinberg equilibrium (HWE) constraints but not limited to specific predetermined models. The penetrance table of the DNME models was produced by an exhaustive search.
Table 5 lists the details of 22 epistasis models. In each model of our experiments, there are 100 datasets with 750 controls and 750 cases genotyped by 100 SNPs.
Evaluation method
In this section, we compare the performance of GEPEpiSeeker with other representative methods [3, 14, 24, 27, 30]. Following [3], we also used four common metrics including power, recall, precision and F1score (F1) to evaluate the performance of these comparative methods. These metrics are defined as follows:
where N_{s} is the number of identified diseasecausing models from all N_{d} datasets (in the experiments, N_{d}= 100 for each disease model). TP denotes the number of SNP combinations associated with disease verified by the comparative algorithm, where the Pvalue of the Chisquare test is smaller than the given threshold (P<0.05). FN denotes the number of SNP combinations that are truly associated with disease but are identified as not associated with disease by the algorithm. FP denotes the number of SNP combinations that are not associated with disease but are identified as diseaserelated by the algorithm.
For each epistasis model and comparative method, it was independently run 20 times with the same 100 data files in our experiment to avoid stochastic deviation. For each epistasis model, we conducted some ttests on 20 results of each method to validate the performance of the comparative models and GEPEpiSeeker.
Parameter setting
Since the Elitism mechanism can guarantee the global convergence of GEP [34], EpiGEP uses the roulette wheel selection model and the Elitism mechanism when it produces offspring. The parameters of GEPEpiSeeker are: population_size=100, number_of_iteration=1000, head_length=5, initial_genetic_manipulation_rate = 0.3. EpiGEP will terminate when the number of iteration N_i >1000. For a klocus epistasis detection, the number of gene N_g in an EpiGEP chromosome is k.
The parameters for BEAM and BOOST were set as the default of the BEAM and BOOST packages, respectively. Due to AntEpiSeeker, EACO and MACOED being three ACObased methods, the parameter settings of these ACObased methods were the same to conduct a fair comparison. The ant number and iteration number were set to 200 and 1000, respectively; the initial pheromone τ0 was set to 100; the parameters and that determine the weights of pheromone and heuristic information were set to 1. The evaporation coefficient of pheromones was set to 0.3. In addition, the rest of parameter settings for AntEpiSeeker were: largesetsize = 6, smallsetsize = 3, iItCountLarge = 150, iItCountSmall = 300.
Availability of data and materials
The Simulation datasets of the models were generated by GAMETES 2.0, that is available from https://surveillance.cancer.gov/geneticsimulationresources/packages/gametes/. All other data that support the results of this study are available from the corresponding author upon request.
Abbreviations
 SNP:

Single nucleotide polymorphism
 GWAS:

Genomewide association studies
 BEAM:

Bayesian Epistasis Association Mapping
 ACO:

The ant colony optimization algorithm
 MACOED:

A multiobjective optimization heuristic method for identifying epistatic interactions
 FAACOSE:

A multiobjective ACObased method for identifying epistatic interactions
 EACO:

An ACObased method for identifying epistatic interactions by incorporating heuristic information multiSURF into antdecision rules
 SURF:

Spatially Uniform ReliefF
 GE:

A Genetic Algorithmbased hybrid algorithm, which is named genetic ensemble
 CSE:

A Cuckoo Search method for identifying SNP interactions
 FHSASED:

A harmony search algorithm with the Bayesian network and Giniscore for identifying epistatic interactions
 GEP:

The Gene Expression Programming algorithm
 GA:

The Genetic Algorithm
 GP:

The Genetic Programming algorithm
 DME:

Disease models with marginal effects
 DNME:

Disease models with no marginal effects
 GT domain:

A substructure of the chromosome that represents the gene type
 LD:

Linkage disequilibrium
 BN:

The Bayesian network
 GAMETES:

A simulation software for generating simulation GWAS datasets
 MAF:

The minor allele frequency
 HWE constraints:

The HardyWeinberg equilibrium constraints
 TP:

True Positive
 TN:

True Negative
 FN:

False Negative
 FP:

False Positive
References
Churchill GA, Airey DC, Allayee H, Angel JM, Attie AD, Beatty J, et al. The Collaborative Cross, a community resource for the genetic analysis of complex traits. Nature Genetics. 2004;36(11):1133–7.
Fontanesi L, Schiavo G, Galimberti G, Calò DG, Scotti E, Martelli PL, et al. A genome wide association study for backfat thickness in Italian Large White pigs highlights new regions affecting fat deposition including neuronal genes. Bmc Genomics. 2012;13(1):583.
Jing PJ, Shen HB. MACOED: a multiobjective ant colony optimization algorithm for SNP epistasis detection in genomewide association studies. Bioinformatics. 2014;31(5):634–41.
Huang D, Du J. A Constructive Hybrid Structure Optimization Methodology for Radial Basis Probabilistic Neural Networks. IEEE Transactions on Neural Networks. 2008;19(12):2099–115.
Deng SP, Zhu L, Huang DS. Mining the bladder cancerassociated genes by an integrated strategy for the construction and analysis of differential coexpression networks. BMC Genomics. 2015;16 (Suppl 3):S4.
Li X. A fast and exhaustive method for heterogeneity and epistasis analysis based on multiobjective optimization. Bioinformatics. 2017;33(18):2829–36.
Yuan L, Yuan CA, Huang DS. FAACOSE: A Fast Adaptive Ant Colony Optimization Algorithm for Detecting SNP Epistasis. Complexity. 2017;2017(1):1–10.
Yu HJ, Huang DS. Normalized Feature Vectors: A Novel AlignmentFree Sequence Comparison Method Based on the Numbers of Adjacent Amino Acids. IEEE/ACM Trans Comput Biol Bioinformatics. 2013;10(2):457–67.
Zhao ZQ, Huang DS, Sun BY. Human face recognition based on multifeatures using neural networks committee. Pattern Recognition Letters. 2004;25(12):1351–8.
Wang X, Huang D. A Novel DensityBased Clustering Framework by Using Level Set Method. IEEE Transactions on Knowledge and Data Engineering. 2009;21(11):1515–31.
Huang Y, Zhong C, Lin HX, Wang J, Peng Y. Reconstructing Phylogeny by Aligning Multiple Metabolic Pathways Using Functional Module Mapping. Molecules. 2018;23(2):486.
Shang J, Wang X, Wu X, Sun Y, Ding Q, Liu J, et al. A Review of Ant Colony Optimization Based Methods for Detecting Epistatic Interactions. IEEE Access. 2019;7:13497–509.
Tuo S, Zhang J, Yuan X, Zhang Y, Liu Z. FHSASED: TwoLocus Model Detection for GenomeWide Association Study with Harmony Search Algorithm. PLOS ONE. 2016;11(3):e0150669.
Sun Y, Wang X, Shang J, Liu JX, Lei X. Introducing Heuristic Information into Ant Colony Optimization Algorithm for Identifying Epistasis. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2018;PP(99):11.
Aflakparast M, Salimi H, Gerami A, Dubé MP, Visweswaran S, MasoudiNejad A. Cuckoo search epistasis: a new method for exploring significant genetic interactions. Heredity. 2014;112:666.
DeShuang H. A constructive approach for finding arbitrary roots of polynomials by neural networks. IEEE Transactions on Neural Networks. 2004;15(2):477–91.
Huang D, Jiang W. A General CPLAdS Methodology for Fixing Dynamic Parameters in Dual Environments. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics). 2012;42(5):1489–500.
Huang DS. RADIAL BASIS PROBABILISTIC NEURAL NETWORKS: MODEL AND APPLICATION. International Journal of Pattern Recognition and Artificial Intelligence. 1999;13(07):1083–101.
Huang DS, Ip HHS, Chi Z. A Neural Root Finder of Polynomials Based on Root Moments. Neural Computation. 2004;16(8):1721–62.
Huang Y, Zhong C. Detecting listcolored graph motifs in biological networks using branchandbound strategy. Computers in Biology and Medicine. 2019;107:1–9.
Xie M, Li J, Jiang T. Detecting genomewide epistases based on the clustering of relatively frequent items. Bioinformatics. 2011;28(1):5–12.
Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, et al. MultifactorDimensionality Reduction Reveals HighOrder Interactions among EstrogenMetabolism Genes in Sporadic Breast Cancer. The American Journal of Human Genetics. 2001;69(1):138–47.
Abo Alchamlat S, Farnir F. KNNMDR: a learning approach for improving interactions mapping performances in genome wide association studies. BMC Bioinformatics. 2017;18(1):184.
Wan X, Yang C, Yang Q, Xue H, Fan X, Tang NLS, et al. BOOST: A Fast Approach to Detecting GeneGene Interactions in Genomewide CaseControl Studies. The American Journal of Human Genetics. 2010;87(3):325–40.
Zhang X, Huang S, Zou F, Wang W. TEAM: efficient twolocus epistasis tests in human genomewide association study. Bioinformatics. 2010;26(12):i217–27.
Shang J, Zhang J, Sun Y, Liu D, Ye D, Yin Y. Performance analysis of novel methods for detecting epistasis. BMC Bioinformatics. 2011;12(1):475.
Zhang Y, Liu JS. Bayesian inference of epistatic interactions in casecontrol studies. Nature Genetics. 2007;39:1167.
Tang W, Wu X, Jiang R, Li Y. Epistatic Module Detection for CaseControl Studies: A Bayesian Model with a Gibbs Sampling Strategy. Plos Genetics. 2009;5(5):e1000464.
Jiang R, Tang W, Wu X, Fu W. A random forest approach to the detection of epistatic interactions in casecontrol studies. BMC Bioinformatics. 2009;10(1):S65.
Wang Y, Liu X, Robbins K, Rekaya R. AntEpiSeeker: detecting epistatic interactions for casecontrol studies using a twostage ant colony optimization algorithm. BMC Research Notes. 2010;3(1):117.
Wan X, Yang C, Yang Q, Xue H, Tang NLS, Yu W. Predictive rule inference for epistatic interaction detection in genomewide association studies. Bioinformatics. 2009;26(1):30–7.
Yang P, Ho JWK, Zomaya AY, Zhou BB. A genetic ensemble approach for genegene interaction identification. BMC Bioinformatics. 2010;11(1):524.
Ferreira C. Gene Expression Programming: a New Adaptive Algorithm for Solving Problems. Complex Systems. 2001;13(2):87–129.
Peng Y, Yuan C, Qin X, Huang J, Shi Y. An improved Gene Expression Programming approach for symbolic regression problems. Neurocomputing. 2014;137:293–301.
Deng S, Yue D. Yang Lc, Fu X, Feng Yz: Distributed Function Mining for Gene Expression Programming Based on Fast Reduction. PLOS ONE. 2016;11(1):e0146698.
Peng YZ, Yuan CA, Chen JW, XinDong WU, Wang RL. Multicellular gene expression programming algorithm for function optimization. Control Theory & Applications. 2010;27(11):1585–9.
Zhong J, Ong YS, Cai W. SelfLearning Gene Expression Programming. IEEE Transactions on Evolutionary Computation. 2016;20(1):65–80.
Sabar NR, Ayob M, Kendall G, Qu R. A Dynamic Multiarmed BanditGene Expression Programming HyperHeuristic for Combinatorial Optimization Problems. IEEE Transactions on Cybernetics. 2015;45(2):217–28.
Huang DS, Zheng CH. Independent component analysisbased penalized discriminant method for tumor classification using gene expression data. Bioinformatics. 2006;22(15):1855–62.
Yang C, Qian Q, Wang F, Sun M: An improved adaptive genetic algorithm for function optimization. In: 2016 IEEE International Conference on Information and Automation (ICIA): 13 Aug. 2016 2016. 675680.
Guan B, Zhao Y, Li Y. DESeeker: Detecting Epistatic Interactions Using a TwoStage Differential Evolution Algorithm. IEEE Access. 2019;7:69604–13.
Urbanowicz RJ, Kiralis J, SinnottArmstrong NA, Heberling T, Fisher JM, Moore JH. GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData Mining. 2012;5(1):16.
Tuo S, Zhang J, Yuan X, He Z, Liu Y, Liu Z. Niche harmony search algorithm for detecting complex disease associated highorder SNP combinations. Scientific Reports. 2017;7(1):11529.
Acknowledgments
Not applicable.
About this supplement
This article has been published as part of BMC Genomics Volume 22 Supplement 1, 2021: Proceedings of the 2019 International Conference on Intelligent Computing (ICIC 2019): genomics. The full contents of the supplement are available online at https://bmcgenomics.biomedcentral.com/articles/supplements/volume22supplement1.
Funding
This work was supported in part by the National Natural Science Foundation of China Grant # 61862006, #61562008 and #61961004, the Natural Science Foundation of Guangxi Province Grant #2017GXNSFAA198228, 2020GXNSFAA159074, 2017GXNSFAA198276, 2017GXNSFAA198263 and the BAGUI Scholar Program of Guangxi Zhuang Autonomous Region of China.
Author information
Authors and Affiliations
Contributions
Yuzhong Peng and Yiran Huang conceived the study. Yuzhong Peng, Yanmei Lin, and Jianping Liao designed the experiment. Ying Li, Yuzhong Peng, and Guangsheng Luo performed the experiment and wrote the manuscript. Yanmei Lin, Yiran Huang, and Jianping Liao revised and polished the manuscript. All authors have participated in the study discussion and manuscript preparation. All of the authors have read and approved the final manuscript. Yuzhong Peng and Yanmei Lin contributed equally to this work and should be regarded as cofirst authors. Yiran Huang and Jianping Liao are the cocorresponding authors.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Peng, Y.Z., Lin, Y., Huang, Y. et al. GEPEpiSeeker: a gene expression programmingbased method for epistatic interaction detection in genomewide association studies. BMC Genomics 22 (Suppl 1), 910 (2021). https://doi.org/10.1186/s12864021082078
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12864021082078