HFCC is a new computer software for exhaustive genome-wide analysis of multi-locus association effects in a case-control design. It carries out different types of statistical tests to assess a variety of genetic and epistatic models. HFCC differs from other association or multi-locus methods in that it can analyze simultaneously multiple samples or multiple phenotypes, and incorporates several complementary noise-signal filters, and also post-hoc analysis tools. To address the enormous computing task, it is elegantly designed to take advantage of the multiple CPUs available nowadays in computer servers or clusters.
The goal of HFCC analysis is to find multi-locus marker combinations which are significantly associated with a phenotype, especially those displaying interaction effects which may not be detected in a single-locus analysis. By setting the type of genetic model, the number of subjects in each group, the number of replication groups, the statistical cut-offs for the different tests, and applying different noise elimination filters, HFCC can arrive at a selection of the most promising multi-locus combinations.
Other multi-locus methods to detect gene-gene interactions exist. For example MDR  is a method for exhaustive search of high-level multi-locus interactions, although it is extremely computationally intensive, and genome-wide searches for epistasis are prohibitive. More recently a promising Bayesian method (BEAM) has been suggested as a powerful alternative for detecting epistatic interactions, although it is not exhaustive and still requires further improvements to effectively handle the large SNP datasets commonly used in genome-wide studies . Other methods, like PLINK  or others based on logistic regression , can carry out genome-wide epistasis tests relatively quick, but are currently limited to two- or three-locus models, and only perform general tests of epistasis. HFCC combines a relatively fast computing algorithm for genome-wide epistasis detection, with the flexibility to test a variety of different epistatic models in multi-locus combinations. Our analysis of a simulated dataset reveals that HFCC has good power, at least as good as or better than MDR , to detect epistatic interactions, as long as they are relatively strong and common. In the most extreme simulation, with 5% genotyping error, 5% missing data, 50% phenocopies, and 50% genetic heterogeneity, HFCC still had 71% power to detect some types of epistasis, although the power for other types of epistasis was smaller (34–51%). We will need to carry out more extensive simulations to evaluate the power under different conditions and genetic models.
For this illustrative application of the software we have also analyzed an open-access dataset of Parkinson's disease patients and unaffected controls. We would like to emphasize here the importance of these public datasets of real data to improve the quality of applied research, and also to foster the development of new methodology.
One of the pecularities of HFCC is the possibility of dividing the case-control sample into replication groups, to detect only those effects that are consistent across samples. The number of replication case-control groups to be used depends on the available dataset (sample size and number of genetic markers). It is an important analysis parameter because the overall significance level is a function of the selected critical value for the test statistic and the number of replication groups. For the current analysis, the sample of cases was divided into three replication groups, and so were the controls. This strategy focuses on the detection of large and consistent effects, which are hopefully detectable with the available datasets. It is reasonable to expect that the joint effect of a combination of genes is larger and more penetrant than each of the single-locus effects, and therefore under some circumstances (i.e., not extremely heterogeneous or rare effect), these multi-locus effects can be identified. The detection of small, rare or heterogeneous effects may need larger samples and more complex models.
HFCC allows for a variety of different genetic models and tests. Different models may be necessary to detect different types of effects, such as recessive, dominant or heterozygote effects. The best analysis strategy may depend on prior knowledge or hypothesis about the trait. An optimal strategy would apply a selected subset of models which would maximize the chances of detecting a hypothesized effect. For example, for this Parkinson's disease study, we have employed a subset of nine epistatic models which typically detects recessive effects.
Another characteristic of HFCC analysis is the successive application of noise-signal filters. Control groups can be compared against each other to remove background noise associations. Direction of effect can be taken into account, so that only those results consistent in strength and direction across replication samples are selected. A final filter is able to remove those multi-locus results which are primarily due to quality-control failing markers (ie, in Hardy-Weinberg disequilibrium, low allele frequency or low call rate) or to large single-locus effects. The remaining multi-locus combinations can be categorized into epistatic, conditional and simultaneous effects, and interaction tests can be used to detect possible epistatic interactions over and above the marginal effects. The selected markers can then be included in a validation analysis in an independent sample. As an illustration, Table 4 displays 30 two-locus combinations suggestive of displaying epistatic interactions influencing the development of Parkinson's disease. It is important to note that due to the small sample size used in this experiment, these results may not be reliable, and need re-analysis or confirmation in larger datasets. The number of combinations selected for a validation analysis depends on many factors, and tools are included to help perform this selection.
The Parkinson's disease study reported here can illustrate several issues regarding the search for epistatic effects in large datasets. One of the most difficult tasks in large dataset analysis is selecting the most promising candidate results. The huge number of statistical tests performed requires a severe statistical cut-off, or a protocol of data filtering, to be able to select only the most promising results. For example, the two-locus analysis of a genome-wide association SNP dataset presented here reveals several hundred thousands two-locus marker combinations at a liberal significance level (i.e., p value < 10-6). Using a stringer significance level, such as a Bonferroni correction, may be overly conservative, sometimes potentially missing real effects. HFCC filters and post-hoc analyses help selecting the most promising two-locus interactions from a large set of preliminary findings. For example, all two-marker combinations in Table 4 are consistently associated with Parkison's disease in three replication groups (overall two-locus X2 (1 df) in the range 21–30), and they all also deviate from a multiplicative model (case-only X2 (1 df) in the range 7–33) suggesting an epistatic interaction.
The resampling validation analysis raises an important issue regarding the difficulty in replicating a result across different validation samples, a finding that may reflect the general lack of power to detect these types of effects, especially in the presence of heterogeneity. With the available sample size for this study we have approximately 80% power to detect large common effects (Odds ratio > 3 in a genotype prevalence > 25%) at a significance level of 0.01 per replication group. However, this small dataset is underpowered to find more moderate, and perhaps more realistic, effect sizes. The general lack of consistency suggested by our own sensitivity analysis may be a consequence of the small sample size analyzed. Fung et al. (2006) claimed that there is no common genetic variant that exerts a large genetic risk for late-onset Parkinson's disease in white North Americans. Multi-locus analysis may, however, reveal the existence of large complex (multi-locus) genetic effects.
HFCC provides a tool for the genome-wide study of epistasis. Its use may depend heavily on the researcher's goals and the data available. For this reason, it is hard to provide general guidelines on the optimal parameters for analysis, but HFCC's flexibility to accommodate to the specific needs of each study is a great asset.
A key parameter is the number of replication groups. When the available sample size is fixed, dividing the sample into more replication groups decreases the power of the analysis, but also increases the confidence in the remaining results. For example, the two-locus analysis of the full Parkinson's dataset in one case-control group with a Type I error set at 10-6 yields a total of 784,506 preliminary positive results. The analysis of the same data splitted in three replication groups, each with alpha = 10-2, yields only 418,535 results (53% of the single-group results).
There is not an optimal number of replication groups. Researchers need to select this parameter as well as the significance level cut-off to fit their dataset and study design. For example, a study of three related diseases or phenotypes suggests the use of three replication groups. In the case of a single disease, the number of replication groups, the sample size in each group, and the statistical cut-off should be chosen depending on the nature of the study. A strict statistical correction may be necessary if the results are to be conclusive, while a more relaxed criterion may be used in a two-stage study where the goal of the preliminary analysis is to select a subset of markers for subsequent validation.
HFCC's flexibility is also possible in the application of data filters. The results displayed in Table 2 suggest that the control filter and the direction filter can eliminate most of the same noise results. The application of either or both of these filters can therefore depend on the study design. If a study has many control subjects and a limited number of cases, it can benefit from using the control filter. If a study has many affected individuals, then several replication groups and the direction filter can be used.
There are also 255 possible two-marker genetic models that can be tested. Many of them are correlated, so it is probably not necessary to test them all. We suggest using a small subsample of models that cover the researcher's hypothesis. For example, for the Parkinson's analysis in this paper we tested nine simple genetic models. To identify subsamples of models that may optimize the chances of discovering epistatic effects of different nature would need a thorough simulation analysis that is beyond the scope of the current paper. Another analysis option is to use more general tests of epistasis, which are also implemented in the software.
The two-locus analysis of the full Parkinson's dataset as presented here comprises a total of 708 × 109 statistical tests. The two-locus analysis of these data in one case-control group with a Type I error set at 10-6 yields a total of 784,506 preliminary positive results, about 10% more of those expected by chance. This false positive inflation is probably due to the tracking markers (mainly QC-fail markers) as well as to the correlated nature of some of the statistical tests, and can be controlled by the use of replication groups, or by applying more stringent cut-offs if necessary.
For the simulated dataset, analyzed with only one case-control group a significance level of 10-2 (chi-square 6.64), the Type I error was actually lower than expected (approximately around 0.006 on average). These results demonstrate that HFCC analysis is not only powerful but can also be conservative, preserving against Types I and II Errors. More thorough simulations to assess the impact of sample size, allele frequency, number of replication groups, and noise filter applications are needed in the future to understand these issues in detail.
Applying strict multiple testing correction (Bonferroni) to the Parkison's disease analysis, we do not find any significant two-locus interactions. The reason for this may be the small sample size available in the Parkinson's dataset. But we can still select the most promising two-locus results for subsequent validation. The key issue is whether we are concerned with achieving an absolute level of statistical significance, which may not be properly defined in this setting given the complexity of the analysis suggested (multiple testing of correlated hypotheses, independent replication of results, data filtering steps), or selecting the most promising markers or marker combinations that pass a more or less stringent statistical criterion. In either case, the usefulness of HFCC for selecting marker combinations for later validation is doubtless.
Multi-locus analysis is computationally intensive and is therefore limited by computing capabilities. HFCC is a relatively fast algorithm considering the huge number of computations it performs. The dataset analyzed in this study consists of 396,591 genetic markers, which results in 78.6 × 109 two-locus marker combinations. Nine genetic models were tested in the current study, resulting in a total of 708 × 109 statistical tests. Moreover, these tests may be carried out in as many as three case-control replication groups, and also in as many as three control-control noise-filter groups, so the computing task is staggering. HFCC is programmed to take advantage of computer resources by dividing the computing task into processes which can migrate to the available CPUs in a computer server or cluster. For the current analysis, we employed a computer cluster consisting of twelve 3.2 GHz CPUs, which was able to carry out the full genome-wide analysis in approximately 5 days.
Another computational complication for multi-locus analysis is related to RAM memory. The data matrices need to be loaded onto memory to speed computations, and therefore a limitation exists for the analysis of very large datasets (millions of genetic markers or several thousands of subjects) where RAM memory is rapidly exhausted. This limitation, however, can be solved with parallel processing techniques (MPI like), such as dividing the data matrices into smaller subsets which are distributed around the computer network.
HFCC can perform more complex multi-locus analysis (3-locus, 4-locus, etc.), but the number of computations grows exponentially with the number of interacting markers, and the analysis becomes dependent on computing resources and time limitations. Depending on these resources, genome-wide three- or four-locus analysis may require a two-stage strategy, where some markers are selected first by single-locus analysis, and then employed to guide the multi-locus testing . Our exhaustive two-locus genome-wide analysis of a Parkinson's disease dataset reveals that pure epistatic effects, as defined here, are rare (0.01% of the preliminary results). Therefore, a two-stage strategy for multi-locus analysis may be a more economical analysis with minimal information loss. This statement assumes that we had power to detect these epistatic effects, that more complex interactions behave as two-locus ones, and that Parkinson's disease is representative of other diseases. Our results suggest the use of a conditional two-stage strategy, where a liberal single-locus threshold is first used to select loci with marginal effects, and then these markers are used against the full panel for multi-locus analysis. This conclusion is similar to some previous suggestions [5, 19] but not all , confirming that a liberal single-locus cut-off (i.e., p < .05) greatly reduces the computational task while minimizing the probability of discarding potential epistatic loci.
It is also important to note that linkage disequilibrium (LD) is unaccounted for in our analyses. LD reflects an association among markers and therefore can affect the results of some tests. For example, it can produce a significant case-only chi-squared test. Nonetheless, HFCC's algorithm and analysis filters seem to prevent this bias. In the case where one marker is associated with several markers in LD, these results are detected in the last stage of marker selection.