An efficiency analysis of high-order combinations of gene–gene interactions using multifactor-dimensionality reduction

Background Multifactor dimensionality reduction (MDR) is widely used to analyze interactions of genes to determine the complex relationship between diseases and polymorphisms in humans. However, the astronomical number of high-order combinations makes MDR a highly time-consuming process which can be difficult to implement for multiple tests to identify more complex interactions between genes. This study proposes a new framework, named fast MDR (FMDR), which is a greedy search strategy based on the joint effect property. Results Six models with different minor allele frequencies (MAFs) and different sample sizes were used to generate the six simulation data sets. A real data set was obtained from the mitochondrial D-loop of chronic dialysis patients. Comparison of results from the simulation data and real data sets showed that FMDR identified significant gene–gene interaction with less computational complexity than the MDR in high-order interaction analysis. Conclusion FMDR improves the MDR difficulties associated with the computational loading of high-order SNPs and can be used to evaluate the relative effects of each individual SNP on disease susceptibility. FMDR is freely available at http://bioinfo.kmu.edu.tw/FMDR.rar. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1717-8) contains supplementary material, which is available to authorized users.


Background
Large single nucleotide polymorphisms (SNPs) research projects across the human genome are important studies for biological and biomedical science, with many researchers seeking to use SNPs as predictors for susceptibility to disease. Traditional approaches to identify SNP interactions usually use bio-statistical methods such as contingency tables combined with k-fold cross-validation, but the vast number of possible combinations makes the application of traditional methods difficult. Therefore, current research is aimed at combining biostatistics and machine learning in family-based and case-control association studies [1][2][3][4][5][6][7][8].
Multifactor dimensionality reduction (MDR) [9] is a well-known hybrid technology that combines a 2-way contingency table, k-fold cross-validation, and a dimensionality reduction technique. MDR belongs to a group of non-parametric statistical methods used to determine high-order gene-gene interactions in case-control studies [9,10]. Typically, multi-locus genotypes are classified into high-risk and low-risk classes, allowing the number of genotype predictors to be effectively reduced from n dimensions to one dimension. This reduction influences the contingency table allowing for the quick computation of statistics including the accuracy rate, odds ratio (OR), P-value, etc. Many modifications and extensions to MDR have been proposed and these can be divided into three groups. The first group contains modifications and combinations of biostatistics in MDR; this group includes entropy-based interpretation methods [11], the use of OR [12], generalized linear models [13], log-linear methods [14], Bayesian posterior probability [15], and model-based methods [16]. The second group focuses on particular data problems, such as imbalanced data [17,18], permutation testing [19], and missing data [20]. These extensions and modifications of MDR have been used to address different situations encountered in disease analysis. Many disease studies have thus successfully employed MDR to detect interactions between particular genes, including those for coronary artery disease [21,22], hypertension [23][24][25], bladder cancer [26], and autism [27]. Finally, the third group aims to reduce MDR computational time, using methods including parallel implementations [28] and the use of hardware graphics processing units (GPUs) [29,30]. Although these studies use GPUs to reduce MDR running time, the problem of factorial operation in MDR still presents a challenge.
This study seeks to develop a new framework to improve MDR computational times in investigations of highorder gene-gene interaction. The framework retains the significant factors to reduce the number of multi-locus evaluations in MDR. Improvements in computational time were measured over 100 runs on a simulation data set and a genome-wide analysis of chronic dialysis epistasis.

MDR algorithm
MDR is an attribute construction approach that reduces the data dimensionality by seeking to identify combinations of multi-locus genotypes that are associated with either high-risk or low-risk groups. The combination of two or more locus genotypes into a single attribute can be used to effectively estimate the risk associated with genegene interactions in relation to a disease. This study uses the imbalanced functions proposed by Yang et al. [17]. MDR can be divided into five separate processes. In the first step, the data are divided into 10 parts for ten-fold cross-validation. Nine-tenths of the data are classified as training sets and the remaining 1/10 is used for testing. The second step is the construction of a contingency table. For a given interaction order n, n SNPs are selected from the data set. L is defined as a set of multi-locus genotypes at n loci and/or environmental factors. L can be represented as an n-dimensional vector: where l represents an SNP factor and/or environmental factor. Next, L is used to calculate the case-control ratios for each multi-locus genotype. The ratio between cases and controls is evaluated by Equation (2). where where the cases are labelled P and the controls are labelled N. P * and N * respectively represent the sizes of cases and controls in the training set. Here j represents the index of samples in the cases and controls. P j represents the j th sample among the cases and N j represents the j th sample among the controls. u(L, P j ) represents a match (given a score of "1") if all multi-locus genotypes l in vector L match P; a mismatch is given a score of "0". u(L, N j ) represents a match (given a score of "1") if all multi-locus genotypes l in vector L match N; a mismatch is given a score of "0". For example, a 2-order interaction model consisting of SNP1 and SNP2 has nine multilocus genotypes, i.e., AA-AA, AA-Aa, AA-aa, Aa-AA, Aa-Aa, Aa-aa, aa-AA, aa-Aa and aa-aa. After the ratio calculation, each L is labelled 'H' ("high") if the ratio of cases to controls is equal to or greater than a threshold of T (=1); otherwise it is labelled 'L' ("low"). Once all Ls are labelled 'H' or 'L' , a new binary attribute is created by pooling the high-risk genotype combinations into one group and the low-risk genotype combinations into another group. This means that the four frequencies (TP, FP, TN, and FN) can be computed in a 2-way contingency table. Finally, each possible L computes a training classification error rate for each n-way interaction in the training set. The classification error rate is given by Equation (3).
Among all n SNP combinations, the best model with the minimum classification error rate is selected by the training step. The third step evaluates the remaining 1/10 of the original data set (i.e., the independent test data). This step creates an MDR attribute for the testing set using the n SNPs that have the minimum training classification error rate. In addition, the best model in each cross-validation is collected and named the cross-validation consistency (CVC). In the fourth step, the procedure is repeated 10 times (i.e., ten-fold cross-validation) so that each sample is included in the testing set once, and the resulting classification error rates of each of the ten models in CVC are averaged. In the last step, the best MDR model with the highest frequency in CVC is selected.

Fast MDR algorithm (FMDR)
FMDR proposes a new framework to improve the MDR computational time. Figure 1 shows the FMDR flowchart consisting of five steps: (1) data processing, (2) selection of training and testing sets, (3) evaluation of all possible combinations, (4) identifying the best model, and (5) statistical analysis of the best model. In the FMDR, the number of selected SNPs is limited to two at the outset. The framework is represented by the thick frame in  Table 1 A paired t-test comparison of the power analysis between MDR and FMDR for 2-to 5-loci   (2) and (5) (Fig. 1). In step (2), the framework checks whether or not the number of loci is equal to two. If yes, all available two-order locus combinations amongst the loci are created and regarded as conditions. All these conditions are then used to evaluate the contingency table (step (3)), and the classification error rate in each combination is estimated by Equation (3) (step (4)). In step (5), all two-order locus combinations are sorted based on the classification error rate, and then the results of the best n% combinations with the minimum classification error rate are saved into the i th memory where i is the i th -fold cross-validation. When ten cross-validations are computed, the best 2-loci model is output to show related gene-gene interaction information. If the number of order exceeds two (i.e., m-loci, m > 2), each cross-validation uses the corresponding memory and the recorded results of the best n% combinations to create the available combinations (go to step (2)), i.e., conditions. In step (3), these conditions are evaluated using the contingency table, and the classification error rates of the conditions are estimated in step (4). The results are then sorted and the best n% combinations are saved into i th memory to analyze the next interaction order. This process tremendously reduces the number of available combinations. The processes are repeatedly implemented until the defined number of selected SNPs is analyzed.

Illustrative example to FMDR and statistical analysis
The supplementary Additional file 1 provides an example to illustrate how the FMDR works, and the supplementary Additional file 2 explains the statistical analysis method.

Results
Results on the simulated data set All simulated models set the 50 attributes with a heritability of 0.2. The minor allele frequencies (MAFs) were 0.1, 0.2, and 0.4. The sample sizes were 800 and 1600, in which the total number of cases is equal to the total number of controls. The simulation data was generated using GAMETES, software used for generating complex n-loci models with random architectures [31]. The settings and results of the six models are shown in Table 1. Figure 2 shows the power analysis box plots of six models. A summary of the six simulation data set shows that the difference between MDR and FMDR was statistically significant for 4-loci and 5-loci, but there was only a slight difference between the averages of the two methods. Figure 3 showed the execution times for the simulation data sets. The MDR execution times in all locus orders were collected in a stand-alone test. The total of all FMDR execution times for all locus orders was collected since FMDR uses a continual analysis strategy. For the six simulation data sets, MDR and FMDR required similar durations to implement the 2-loci analysis. When comparing 2-loci and n-loci (n = 3, 4, 5) in model 1, the growth times between MDR and FMDR for 3-loci to 5-loci were 3.796 vs.

Results on the chronic dialysis data set
The 77 mitochondrial SNPs in the D-loop region of chronic dialysis patients were obtained from investigations conducted by Chen et al. [32] that enrolled 193 chronic dialysis patients and 704 healthy controls from unrelated ethnic Chinese in Taiwan. The results revealed that chronic peritoneal dialysis patients suffer from higher oxidative stress than healthy subjects; this elevated oxidative stress alters the number of copies of mtDNA in peripheral leukocytes. The possible complicated networks with direct or indirect cross-communication among the 77 SNP candidates were explained. The ratio of controls (n = 704) to cases (n = 193) was 3.65:1. We randomly sorted the samples in the data set to generate 100 data sets each of which was then divided into ten groups for ten-fold cross-validation. The ratios of cases to controls amongst 1000 training sets range from 3.41-3.95, with a mean (SD) ratio of 3.65 (0.10). Each data set was used once to test MDR and FMDR.
For the 100 tests, we summed up the frequencies of the results based on the cross-validation consistency (CVC) and the classification error rate in each test. The accuracy and OR of the best candidate model was evaluated. Table 2 shows the best, worst, and mean (±SD) in the 100 tests for MDR and FMDR. For the 3--6-loci  models producing the best accuracy amongst the 100 tests, both MDR and FMDR had the same candidate model, and also had the same accuracy and OR. In the models for 3-to 6-loci with the lowest accuracy amongst the 100 tests, MDR and FMDR were different slightly, and the accuracy and OR also differed. A box plot was used to compare the two methods for 3-, 4-, 5-, and 6loci interactions. Figure 4a and b respectively shows the accuracy and OR box plot of MDR and FMDR. Paired ttest comparison results indicate that the accuracy and OR values for 3--6-loci analysis over 100 test runs were similar for both MDR and FMDR. Figure 4c shows the box plot of the power results of MDR and FMDR for four order interactions. As the order of interaction increases, both MDR and FMDR shows increasing power values. All powers of MDR and FMDR exceeded 0.8. A summary of the 100 test runs shows that the difference between MDR and FMDR was statistically significant for 3-and 6-loci, but the average difference between the two methods is very slight, i.e., −0.011 at 3-loci and 0.002 at 6-loci. In addition, the powers in the 4-and 5loci analysis over 100 test runs are similar for both MDR and FMDR.
In Exhaustive search approaches, e.g., genetic algorithm (GA) [10] and ant colony optimization (ACO) [33] are important for improving MDR computational times. GA and ACO use small combinations to find the acceptable n-loci gene-gene interaction model in a huge combination space, thus effectively reducing the computational time requirements. However, all parameters can influence the results of detected gene-gene interaction. The parameters of population size, generation size, random seed, and algorithm setting (e.g., mutation probability in GA and pheromone in ACO) are difficult to define to successfully find the n-loci gene-gene interaction model for data sets of different sizes, i.e., sample size and SNP size. Therefore, current research directions focus on the use of software and hardware to improve MDR computational times.
Many researchers employ software [28] and hardware [29,30]    class. These three modular classes were implemented in parallel, finding that parallel MDR can be used to analyze high-order interactions of small data sets and can feasibly perform lower-end genome-wide analyses. Greene et al. and Sinnott-Armstrong et al. [29,30] employed modern computer Graphics Processing Units (GPUs) to speed up MDR since GPUs have a higher memory bandwidth and computational capability than Central Processing Units (CPUs). Still, the factorial increase of time complexity remains an obstacle.
The FMDR procedure is a type of greedy search strategy [34], and is based on joint effect property [35]. The joint effect can be divided into the three effects: (1) overall effect, (2) n-order interaction effect, and (3) main effect. In epistasis, overall effect indicates the common effect amongst n risk factors. The main effect indicates any effect(s) could serve as a guide to determining the correct multi-locus interaction. The n-order interaction effect indicates the least proper subset of the loci also interacts epistatically. The highly-associated SNPs have a high probability of being a significant factor in the nextorder interaction. A low classification error rate in an MDR model indicates a high statistically significant risk of n-loci effects. Suppose all 2-loci combinations in four SNPs are sorted according the classification error rate as {SNP a , SNP b }, {SNP b , SNP c }, {SNP a , SNP c }, {SNP a , SNP d }, …, {SNP b , SNP d }. The {SNP a , SNP b } is the best model in 2-loci gene-gene interaction. The {SNP b , SNP c } and {SNP a , SNP c } combinations are both probably significant models for gene-gene interaction, but neither is the best model. SNP c has the highest probability of joining the 3-loci gene-gene interaction because it's strong association with SNP a and SNP b (i.e., 2-order interaction effect). On the other hand, {SNP b , SNP d } is the worst model; it means that adding SNP d via SNP b into the gene-gene interaction network is the least likely scenario. SNP d has a high probability of being added via the SNP a effect because {SNP a , SNP d } belongs to the top model with a low classification error rate. Therefore, the {SNP b , SNP d } can be deleted, and all combinations based on {SNP b , SNP d } in 3-loci combination are not evaluated. These properties allow us to apply the greedy search strategy to find the significant gene-gene interaction model. Moreover, FMDR only sets one parameter to select the number of best combinations with the low classification error rate, which are then saved into the memory. We suggest the optimal choice for n is the dynamic adjustment according to the order of interaction, i.e., n = 2 with 2-order gene-gene interaction and n = 3 with 3-order gene-gene interaction.
The idea behind FMDR is the retention of good results for high-order interaction, indicating the available combinations are generated from n% good results, i.e., n% results × N combinations, where N is the total number of SNPs. Therefore, FMDR has a time complexity of O(n). The execution time of FMDR is much shorter than that of MDR in high-order gene-gene interactions because FMDR effectively decreases the number of possible unnecessary computations. FMDR includes the following advantages: (1) FMDR can effectively reduce the computational time required by MDR for high-order interactions, (2) the best model has a low classification error rate and a high sensitivity for disease prediction, and (3) FMDR can easily be combined with existing MDR methods.

Conclusions
FMDR based on the joint effect property reduces MDR computational time by retaining results for higher interactions. The retained number of results can be formularized and improved using statistical methods and mathematic theories in future work. The time complexity can be easily computed by estimation of a function. We suggest that the function be designed as a dynamic adjustment based on the data set size and the order of interaction. The flexible framework underlying FMDR can effectively improve the limitations of existing MDR methods in finding high-order interactions.