We examined the power and proportion of false positives of the KWII against a diverse group of multi-locus methods that included MDR, RPM, logistic regression and logic regression, demonstrating that the power of KWII metric is greater than MDR, RPM and logic regression and comparable to logistic regression for a class of realistic models both with and without genetic heterogeneity. To our knowledge, this is the first detailed comparison of power and false positive proportion comparisons between existing interaction analysis approaches and those based on information theory.

The power of KWII exceeded the power of MDR for all models in Simulation sets 1 and 2. The discrepancy in power is attributable to differences in the algorithms. KWII has greater power than MDR because it selects *all* significant combinations separately while MDR selects only the best model, such that if two or more combinations of the same order are associated with a phenotype, as in the case of GH, MDR selects only one of them. In addition to the inability to detect the independent genetic contributions to models of GH MDR can be dependent on higher order combinations for power. This is illustrated by Model 2-GH; the power of MDR to detect both Interaction 1 *and* Interaction 2 is greater than its power to detect the interactions individually. This dependence coupled with the fact that MDR uses an exhaustive search approach also means that MDR would be very computationally inefficient for larger datasets as the number of combinations increases combinatorially with number of variables and combination order [18]. MDR is being continuously improved and used to analyze quantitative phenotypes and family data [19–21], computational efficiency remains the rate-limiting factor irrespective the improvements [22, 23]. Cattaert et al. [19] have developed FAM-MDR method, which addresses correlation between observations in family-based studies and extends the model based MB-MDR approach [24] to handle continuous covariates and continuous phenotype. In contrast with the classical MDR, FAM-MDR considers multiple multi-SNP models for significance evaluation. Further research on extending the KWII based approach to handle family data is ongoing.

For Simulation sets 1 and 2, we found that RPM has reduced power when compared to KWII. When working with quantitative outcomes, RPM uses variances of the trait for each genotype group for merging groups of genotypes with closest mean trait values. While this works well for quantitative traits this approach does not translate as well for case-control data for particularly for less frequent diseases. This effect is compounded when only a small proportion of the variance of the trait is explained by genetics (Simulation 1B, Model 7C). While RPM performed reasonably well in models without GH, when GH was introduced (Table 7, Interaction 3) the ability of RPM to properly partition based on the proportion of cases for a given genotype is hampered because multiple different loci are contributing equally to disease. This is evidenced by a reduction in power, which is only overcome by substantially increasing the sample size.

In Simulation set 2, it is clear that logistic regression and KWII have almost identical power (within ~1% for all sample sizes and alpha values). Logistic regression models were also run for Simulation set 1B (results not shown) and again KWII and logistic models were powered within 2% of one another for all simulations; one approach was not consistently better than another. Despite similar power, the two methods differ in their model fitting approach. In logistic regression, model parameters are fit simultaneously but with the KWII approach, higher-order interactions are inferred after investigating and eliminating lower-order contributions. While we did not run models to detect power for three way interactions, Marchini *et al.* [25] found that loci with specific 3-way interactions are more likely to be detected by looking for two-locus effects. Thus while power is equivalent when considering two-locus models with and without GH, the genetic contribution to complex disease is proving to be oligogenic. Because KWII looks at increasing orders of interaction from low to high the method is well suited to finding these higher order combinations because it finds the lower order ones first.

For Simulation set 3, we employed a hybrid approach in which simulated interactions were planted in the context of the real genotypes from the GAW15 Problem 2 data set. This approach has the advantage of overcoming the limitations imposed by the lack of real data sets with well-validated interactions. Although there is a strong interest in detecting genes distantly located, and using a dense panel of SNPs spanning a 10 Kb region of single chromosome is not optimal, it could be argued that this approach is appropriate for post-genome wide studies (i.e., sequencing, deep genotyping). Furthermore SNPs selected from different chromosomal regions would have less LD amongst themselves, which may make interaction detection feasible with more traditional statistical approaches.

While KWII performed equivalently to the statistical gold standard (logistic regression) for the simple two locus models, with potential for greater power to detect higher order interactions, the method does have some limitations. KWII has sensitivities to missingness, sample size, low MAF and LD. As the number of missing genotypes increases in a dataset the estimation of entropies becomes more inaccurate. However given genotype imputation is regularly used for both candidate gene and genome wide studies this is easily remedied. As with statistical approaches such as MDR, RPM and regression low sample size reduces power when estimating higher order interactions, particularly when SNPs with very low allele frequencies are involved. Additionally KWII permutations have slightly higher false positive rate than logistic and MDR, although not substantially different. Lastly, the power of KWII to detect GGI is reduced when LD between the causal variant and the genotyped (or imputed) variant is low. However current chip and Hapmap coverage is quite good and this is problematic for all statistical methods used to test for allelic association

The results of these simulations provide clues as to how to perhaps more effectively search for interacting loci. One possible approach when working with a large dataset would be to combine our approach with logistic regression by running the AMBIENCE algorithm first with an anti-conservative alpha and then testing the combinations obtained using logistic regression models. This would be an improvement upon the computational speed of logistic regression and yield lower false positive rates than using KWII alone. This two stage approach is similar in spirit to that suggested by both Hoh *et al.* [26] and Marchini *et al.* [25] in which a liberal alpha is set for testing interactions in stage one in order to find SNPs with small marginal effects but large interaction effects.

Higher-order interactions are computationally intensive because of the rapid growth of the number of combinations. The power to detect higher order interactions is also limited because the number of samples within each multi-locus genotype stratum is a limiting factor. Given the challenges associated with higher-order interactions, it may be preferable to base model building on multiple lower-order interactions. Alternatively given pathway information from Gene Ontology (GO) or KEGG for example, AMBIENCE [6] could spend more computation time testing SNPs within the same pathway than across pathways. This will help to identify biological meaningful epistatic interactions that could then be analyzed using logistic models.

**Conclusions:** In this article, we compare an information theoretic approach with existing statistical methods to test for GGI and find that our method has excellent power and to detect interaction in both simulated and real data.