Information-theoretic gene-gene and gene-environment interaction analysis of quantitative traits
- Pritam Chanda^{1},
- Lara Sucheston^{2, 3},
- Song Liu^{2, 3},
- Aidong Zhang^{1} and
- Murali Ramanathan^{4}Email author
https://doi.org/10.1186/1471-2164-10-509
© Chanda et al; licensee BioMed Central Ltd. 2009
Received: 08 October 2008
Accepted: 04 November 2009
Published: 04 November 2009
Abstract
Background
The purpose of this research was to develop a novel information theoretic method and an efficient algorithm for analyzing the gene-gene (GGI) and gene-environmental interactions (GEI) associated with quantitative traits (QT). The method is built on two information-theoretic metrics, the k-way interaction information (KWII) and phenotype-associated information (PAI). The PAI is a novel information theoretic metric that is obtained from the total information correlation (TCI) information theoretic metric by removing the contributions for inter-variable dependencies (resulting from factors such as linkage disequilibrium and common sources of environmental pollutants).
Results
The KWII and the PAI were critically evaluated and incorporated within an algorithm called CHORUS for analyzing QT. The combinations with the highest values of KWII and PAI identified each known GEI associated with the QT in the simulated data sets. The CHORUS algorithm was tested using the simulated GAW15 data set and two real GGI data sets from QTL mapping studies of high-density lipoprotein levels/atherosclerotic lesion size and ultra-violet light-induced immunosuppression. The KWII and PAI were found to have excellent sensitivity for identifying the key GEI simulated to affect the two quantitative trait variables in the GAW15 data set. In addition, both metrics showed strong concordance with the results of the two different QTL mapping data sets.
Conclusion
The KWII and PAI are promising metrics for analyzing the GEI of QT.
Background
The clinical presentation of many common complex diseases causing morbidity and mortality are associated with deviations from the population distributions of important quantitative traits (QT). For example, in hypertension and non-insulin dependent diabetes, the disease processes increase the QT, blood pressure and blood glucose, respectively. For many diseases, threshold values of QT are the basis for the diagnostic criteria for the diseases. However, obtaining an in-depth understanding of genetic and environmental determinants of QT such as weight, height and lifespan in healthy populations can also be important scientific questions. The regulation of many QT is typically complex and involves interactions among many genes as well as endogenous and exogenous factors [1, 2]. For example, genes in pathways regulating appetite, metabolism, hormones and adipokines may interact with environmental factors such as diet and exercise to determine body weight. Nonetheless, the successful identification of the critical gene-environment interactions (GEI) involved in QT such as body weight can provide the scientific basis for preventative public health measures to reduce the exposure of individuals to the modifiable environmental variable/s associated with increased risk.
Information theoretic methods have considerable promise for enhancing single nucleotide (SNP), gene-gene interaction (GGI) and GEI analysis [3–6]. The Kullback-Leibler divergence (KLD), an information theoretic measure of the 'distance' between two distributions, has been proposed for 2-group comparisons such as those used to evaluate ancestry informative markers [7–9], as a multi-locus linkage disequilibrium (LD) measure to enable identification of TagSNPs [6] and for analytical visualization [4, 5]. Entropy-based statistics to test for allelic association with a phenotype [10–12] and for two-locus interactions have also been proposed [13]. Information theoretic extensions of the KLD allow measurement of complex multivariate dependencies among genetic variations and environmental factors without complex modeling and could enable powerful and intuitive methodology for GGI and GEI analyses to be developed [14, 15].
While there is now considerable evidence demonstrating the usefulness of information theoretic methods for identifying the interactions associated with discrete and binary phenotypes, to our knowledge, information theoretic approaches have not been reported for analyzing the GGI and GEI associated with QT. This report proposes an information-theoretic approach for identifying associations of GEI and GGI with a QT.
Methods
Terminology and Representation
Definition of Interaction
In our information theoretic framework, we use the K-way interaction information (KWII) [16, 17], which is defined and described in detail below, as the measure of interaction information. We operationally define "for each variable combination containing the QT phenotype, a positive KWII value indicates the presence of an interaction, negative values of KWII indicates the presence of redundancy and a KWII value of zero denotes the absence of K-way interactions".
The methods in this paper are applicable to both GEI and GGI analyses and henceforth, we will simply use the term GEI to refer to both. The underlying terminology and representation for this paper was developed in our earlier publications [14, 15] but is concisely recapitulated here.
The operational definition can yield results that are difficult to interpret in the presence of variables that are completely redundant with each other because an even number of completely redundant variables will result in a positive KWII. We address these issues in detail in Discussion.
Entropy
K-way interaction information
The number of genetic and environmental variables K in a combination is called the order of the combination. The KWII quantifies interactions by representing the information that cannot be obtained without observing all K variables and the QT phenotype P at the same time [16, 17, 19, 20]. The KWII of a given combination of variables is a parsimonious interaction metric in that it does not contain contributions arising from the KWII of other lower order combinations (i.e., the subsets of the K-way variable combination).
Total Correlation Information (TCI)
The TCI is the amount of information shared among the variables in the set; equivalently, it can be viewed a general measure of dependency. A TCI value that is zero indicates that knowing the value of one variable tells you nothing about the others, i.e., that the variables are independent. The maximal value of TCI occurs when one variable is completely redundant with the others.
Phenotype-Associated Information
In the above equation, the genetic and environmental variables are denoted by the X_{1}, X_{2},..., X_{ K }, and the quantitative trait is denoted by P.
In the PAI definition, the TCI(X_{1}, X_{2},..., X_{ K }, P) term represents the overall dependency among the genetic and environmental variables and the phenotype whereas the TCI(X_{1}, X_{2},..., X_{ K }) term represents the interdependencies among the genetic and environmental variables in the absence of the phenotype variable.
Our approach utilizes the KWII as the principal measure of the GEI. However, we employ the PAI to facilitate efficient searching of the combinatorial space. Although the KWII is a parsimonious measure of interaction for the variable combination of interest alone because it does not contain contributions from lower-order interactions, KWII computations on the entire combinatorial space are computationally intractable because they require the entropies of all subsets. In addition, the KWII cannot be used for hill climbing algorithms because it takes on both positive and negative values depending on the nature of the interaction but the values cannot be used to direct the search process.
In contrast, the PAI is appropriate for hill climbing algorithms as it is always greater than or equal to zero and increases monotonically with increased combination size. PAI calculations require only individual and joint entropies and are computationally far more tractable than KWII calculations. In addition to these desirable properties, the PAI contains useful information regarding the KWII because it represents the cumulative phenotype-associated synergy present in all subset combinations of the variables X_{1}, X_{2}, ..., X_{ K }and P. In Results, we will demonstrate that both the KWII and PAI metrics are effective and efficient for GEI analysis.
Extension to Quantitative Traits
The expressions for KWII and PAI are general and can be used for QT and categorical phenotypes. However, the specific forms are necessary for the entropy of the QT and the joint entropy of the QT with discrete variables resulting from the genetic variants, environmental variables and their combinations.
Note that the entropy, H(Z), of a normally distributed variable is independent of the mean μ.
These equations for the entropy of the QT, the entropy of discrete variable combinations and the joint entropy of the QT and discrete variable enable computation of the KWII and the PAI for GEI analysis of QT.
The CHORUS Algorithm
The CHORUS Algorithm employs the PAI to iteratively search for combinations of genetic variations and environmental variables related to the QT. The increases in the PAI are used to guide the search process in a hill-climbing algorithm that identifies promising subsets of combinations potentially involved in GEI with the QT. The KWII values necessary for assessing GEI in our method are then computed for the promising subsets of combinations identified.
Let X = {X_{1}, X_{2}, ..., X_{ n }} be the set of all genetic/environmental variables and P be the QT of interest. The algorithm takes as input X, P, and algorithm parameters θ and τ, which represent the number of combinations retained in an iteration of the search and the number of iterations, respectively.
We start by calculating PAI(X_{ i }, P) ∀i ∈ 1 ... n. We retain the θ combinations with the highest values of PAI for use in the next iteration. Let this set of variables be denoted by S_{1}. In the next step, we calculate PAI(X_{ i }, X_{ j }, P) ∀ X_{ i }∈ 1... n, (j ≠ i) and again retain the θ combinations with the highest values of PAI(X_{ i }, X_{ j }, P) in set S_{2}. The above steps are repeated τ times. Thus, we greedily search for combinations containing up to τ variables that have higher values of PAI. This search process identifies regions in the combinatorial space with the strongest KWII-based interactions among the variable combinations examined. Finally, for each combination C identified by the above search steps, we calculate the KWII(v, P) ∀ v ⊆ C to identify the interacting combinations, {v, P}.
The simulations and CHORUS implementations were built in-house and written in the Java 1.5 programming language. The COLT high-performance computing library from CERN (acs.lbl.gov/~hoschek/colt) was used for numerical functions including random variate generators.
Simulated Data
Simulated data sets were used to examine: 1) the relationship between LD, KWII and PAI in the presence of association with a quantitative trait, 2) the correlation between effect size and KWII values and, 3) the power of KWII to detect association across various effect sizes.
Approaches Common for All Simulated Data Sets
All simulated data sets contained six diallelic SNPs with the three possible genotypes. The genotypes at each locus were assumed to be in Hardy-Weinberg equilibrium proportions. The Simulated Data Sets 1, 2 and 3 consisted of 2 interacting SNPs, SNP(1) and SNP(2), that were associated with the QT. An additional 4 null SNPs, SNP(3) through SNP(6), were simulated. The alleles of SNP 1 were labeled A_{1} and A_{2}, and those of SNP 2 were labeled B_{1} and B_{2}. All SNPs had a minor allele frequency (MAF) of 0.5.
Samples composed of 5000 individuals with randomly varying genotypes and environmental exposures consistent with the underlying GEI models were simulated. The computations for all three Simulated Data Sets were performed with CHORUS input parameter values of τ = 3 and θ was set to include all possible combinations at each iteration.
To obtain p-values of the KWII, the null distribution of the KWII of each combination was obtained from 5000 random permutations of the QT values for each simulated data set. The permutations for each combination were conducted independently. The raw p-value of the KWII of a combination was ascertained as the fraction of the permuted KWII values that were equal to or greater than the observed KWII value. The p-values were not corrected for multiple testing.
The 5^{th} and 95^{th} percentiles of the KWII and PAI, representing the variations due to sampling, were obtained from 100 independent repetitions of the simulation procedure. In forming confidence intervals we did not consider multiple testing. The number of potential tests for all one-way, two-way and three-way combinations was 41 each for Simulated Data Sets 1 and 3 and 175 for Simulated Data Set 2.
For the simulations summarized in Figures 2B-2C, 3B-3C, the effect size was set to 0.5. For the simulations in Figure 2D and 3D, the effect size was varied from 0.5 to 2.5 in increments of 0.5 by varying the standard deviation of QT distribution. Linkage disequilibrium (LD) was assumed to be absent in the simulations in Figures 2B-D and 3B-D. In Figure 2E and 3E, we incorporated LD between the SNP 1-SNP 3 and the SNP 2-SNP 4 pairs; the R^{2} values for both pairs was set to be equal and eleven R^{2} values ranging from 0 to 0.95 were studied.
Simulated Data Set 1, Two-SNP Interactions
The mean QT values for each genotype were based on the GGI model in Figure 2A. The standard deviation σ_{2} was set to σ, and μ_{2} = 1.
Simulated Data Set 2, Multi-SNP and SNP-Environment Interactions
The mean QT values for each genotype were based on the GEI model in Figure 3A. There were four environmental variables, E(1) through E(4), in addition to the six SNPs. The interacting environmental variables, E(1) and E(2), were associated with the QT; E(3) and E(4) were null environmental variables. The percentage of subjects in low (L) and high (H) exposure categories of E(1) and E(3) were each 50%; the percentage of subjects in low (L), medium (M) and high exposure (H) categories of E(2) and E(4) were 25%, 50% and 25%, respectively.
Simulated Data Set 3, Two Locus Model with Pure Epistasis Interactions
The Simulated Data Set was selected to contain pure epistasis interactions without main effects present, i.e., it does not contain QT variation that is attributable to any individual locus and requires the combined presence of two loci for explaining the QT variation.
Power Calculations
Analysis of Public Domain Data Sets
The computations for all three public domain data sets were performed with CHORUS input parameters values of θ = 50 and τ = 2.
We did not correct for multiple testing; instead, we present the number of tests for all one and two-way combinations for each dataset and the top 25 unadjusted p-values and corresponding KWII values for one and two-way combinations. Given only one dataset is based on simulated data and the other two are real datasets in which the authors adjusted for multiple testing using different methods we felt adjusting further confounded the comparisons and thus simply present the rank unadjusted values for scrutiny.
GEI Analysis of Genetic Analysis Workshop 15 Data
Effects of major trait loci and covariates in the GAW15 data set.
Locus | Chr | SNP # | Phenotype | Effects |
---|---|---|---|---|
DR | 6 | 152-155 | RA | Affects risk of RA |
A | 16 | 30-31 | RA | Controls effect of DR on RA risk |
B | 8 | 442 | RA | Controls effect of smoking on RA risk |
C | 6 | 152-155 | RA | Increases RA risk only in women |
D | 6 | 161-162 | RA | Rare allele increases RA risk 5-fold |
E | 18 | 268-269 | RA, Anti-CCP | Affects DR on anti-CCP and increases RA risk |
F | 11 | 387-389 | IgM | QTL for IgM |
G | 9 | 185-186 | Severity | 25% QTL for severity |
H | 9 | 192-193 | Severity | 25% QTL for severity |
Age | - | RA | Affects RA risk through smoking and sex ratio | |
Sex | - | RA | Affects RA risk with Locus C | |
Smoking | - | RA, IgM | Affects RA risk with Locus B and IgM. |
For our analysis, which aimed to evaluate the effectiveness of CHORUS, we have used the set of 9187 SNPs along with sex, age and smoking status as covariates. We refer to this data set as the "10K GAW15 Dataset." We conducted separate analyses with Anti-CCP and IgM as QT of interest. The Anti-CCP and IgM variables were logarithm (base 10) transformed prior to analysis. The Age variable was binned into five intervals of equal width. Although haplotype-phase information was provided, we chose to not include it and treated the data as unphased genotype data.
Analysis of High Density Lipoprotein and Atherosclerosis Data
Female B6 mice have low levels of plasma high-density lipoprotein (HDL) and are susceptible to atherosclerosis whereas female 129 mice have high plasma HDL levels and are relatively resistant. This data set contains genotypes, HDL concentrations and size of aortic fatty streak measurements for 294 female F2 intercross progeny (derived from the C57BL/6 (B6) and 129 strains of inbred mice) that were fed a high-fat diet for 14 weeks [25]. The mice were genotyped with 88 simple sequence length polymorphic (SSLP) markers and subsequently 23 additional SSLP markers in the QTL regions were added. The data were obtained from the Center for Genome Dynamics at the Jackson Laboratory http://cgd.jax.org/nav/qtlarchive1.htm.
We analyzed the HDL and atherosclerotic aortic fatty streak lesion size as QT of interest. The atherosclerotic aortic fatty streak lesion size variable was logarithmically transformed (base 10) prior to analysis.
Comparative Genomics Analysis
We also conducted mouse-human comparative genomics analysis to evaluate the biological significance of the markers identified by our method. The list of human HDL and atherosclerosis QTL were obtained from [26] and [25, 27], respectively. The coordinates of the QTL were converted to latest release (hg18, NCBI Build 36.1) of the human genome assembly based on the UCSC genome browser [28]. The mouse marker regions with peak position ± 16 Mbp (the average HDL [26] and Atherosclerosis [29] QTL is 32 Mbp) were retrieved from the latest mouse genome assembly (mm9, NCBI Build 37). The human homologous regions of mouse markers were obtained by parsing the chain alignment of mouse (mm9)-to-human (hg18) genome [30, 31].
Analysis of Ultra-violet (UV) Light-Induced Immunosuppression Data
This data set contains results from a genome-wide scan using MIT microsatellite markers and was obtained from F1 backcross mice derived from the low susceptibility BALB/c and high susceptibility C57BL/6 strains of mice that were tested for systemic UV light-induced immunosuppression of a contact hypersensitivity response [32]. The data were downloaded from the Center for Genome Dynamics at the Jackson Laboratory http://cgd.jax.org/nav/qtlarchive1.htm. The data contain 64 markers and sex was included as a factor. The percent immunosuppression of the contact hypersensitivity responses in backcross animals was used as the QT of interest. The percent immunosuppression (expressed as a fraction) was arcsine square root transformed. The between-experiment variation was corrected by expressing the results as Z-score based on the mean and standard deviation of each experiment.
Statistical Significance Assessments
For the GAW15 Data set, we used all 100 replicates to obtain KWII and PAI values and the corresponding 95% confidence intervals for each combination of variables. The number of potential tests for all one-way and two-way combinations was 42,232,645 (consisting of 9,190 one-way and 42,223,455 two-way) each for the IgM and Anti-CCP QT in the GAW15 data set.
For the high density lipoprotein and atherosclerosis and ultra-violet light induced immunosuppression datasets and the first replicate of the GAW15 Data set, the null distribution of the KWII of each combination was obtained from 5000 random permutations of the QT values. The permutations for each combination were conducted independently. The raw p-value of the KWII of a combination was ascertained as the fraction of the permuted KWII values that were equal to or greater than the observed KWII value. The same permutation approach was employed for the HDL and atherosclerosis data sets and for the first data replicate in the GAW15 data set to determine the time required for permutations; the top 25 one-way and top 25 two-way combinations were analyzed.
The number of potential tests for all one-way and two-way combinations was 6,216 (111 one-way plus 6,105 two-way) each for the HDL and atherosclerosis data sets and 2,145 (65 one-way plus 2,080 two-way) for UV light-induced immunosuppression data set.
Similarity and Differences with the Restricted Partitioning Method
The Restricted Partitioning method [33] (RPM) code was provided by Dr. Culverhouse. It was used to analyze the HDL data set [25] previously described. The α-value was set to 0.01.
Results
Performance of K-Way Interaction Information (KWII) and Phenotype-Associated Information (PAI) on Simulated Data
In these experiments, our goal was to compare the effectiveness of K-way interaction information (KWII) and phenotype-associated information (PAI) on simulated data with known patterns of interactions. These Simulated Data Sets were intentionally kept simple so that the heuristics for interpreting the KWII and PAI could be identified.
Simulated Data Set 1, Simple Epistasis
Figures 2B and 2C show the KWII and PAI, respectively, for each combination of interactions between the phenotype and all possible SNP combinations. We refer to these graphs as KWII and PAI spectra.
The combinations are shown on the y-axis; e.g., 1, 2, P indicates that variables SNP 1, SNP 2 and the QT variable, P, are used in calculating k-way (k = 3) interaction. The KWII spectrum contains two dominant peaks corresponding to the {1, P} and {2, P} combinations indicating that SNPs 1 and 2 contribute significantly to the quantitative trait variable P. The permutation-based p-values of the KWII for both combinations was p < 0.0002. Qualitatively, the stepwise structure of the PAI spectra contrasts with the peak-like structure of the KWII spectra. The presence of a higher peak for the {1, 2, P} combination compared to the {1, P} and {2, P} combinations indicates a dependence of SNP 1 on SNP 2. Thus, the PAI detects the dependence between SNPs 1 and 2 and provides information that complemented the KWII. The variables not involved in interactions (SNPs 4, 5 and 6) can be more easily identified from the PAI spectra because the PAI values for combinations containing these independent variables are the lowest compared to combinations containing interacting variables. None of p-values of the KWII for variables not involved in interactions were significant (all p-values > 0.05).
Figure 2D shows the dependence of KWII on the effect size. In Figure 2D, the KWII of three representative combinations: the 1-variable containing {1, P} and {2, P} combinations and the 2-variable containing {1, 2, P} combination are shown. The KWII values of all three combinations increased with increasing effect size. We also computed the 95^{th} and 5^{th} percentiles of the KWII each peak to assess variability due to sampling; in some of the cases, the standard deviation was quite low and obscured by the symbols used for graphing. These results demonstrate that the KWII spectra convey information on the types of interaction present and also their effect size.
We incorporated pairwise LD between the SNP 1 and SNP 3 pair and between the SNP 2 and SNP 4 pair. The R^{2} values for the LD for both SNP pairs were set equal and the value was increased from 0 to 0.95. The effects of LD on the PAI for two combinations, {1, 3, P} and {2, 4, P} are summarized in Figure 2E. The results demonstrate that the PAI is independent of LD and the interaction information related to the QT is retained.
Figure 2F shows the power of KWII at a sample size of 5000 per group for various effect sizes for {1, P}, {2, P} and {1, 2, P} combinations. The power of the KWII was greater than 0.99 for all three combinations at an effect size of 0.5 or greater.
Simulated Data Set 2
This Simulated Data Set is relatively complex and contains a combination of interactions among and between the environmental and genetic variables. In addition, 1-variable, 2-variable and 3-variable containing GGI and GEI interactions with the QT variable, C, are incorporated in the model.
The one-variable containing peaks in the KWII spectrum correctly identified the critical roles of E1, E2, SNP(1) and SNP(2) variables (all p-values < 0.0002) in the underlying model and the two-variable containing {E1, E2, P} and {1, 2, P} interactions (p < 0.0002 and p = 0.015, respectively) were also identified. Furthermore, the three variable containing {E2, 1, 2, P} combination (p < 0.0002) was the three-variable combination with highest the KWII values.
The PAI spectrum complements the KWII spectrum. For 1-variable containing interactions the values of KWII and PAI are identical and equivalent to mutual information. However, the most notable characteristic of the PAI spectrum is its discrete step-like visual appearance. Each interacting variable in the data set contributes approximately one single unit step. The height of the PAI peak increases whenever an informative variable is added to the variable list and is unchanged when a non-interacting variable is added. The lowest values of PAI correspond to the subsets containing only non-interacting variables such as {3, 4, P}, {3, 5, P}. By identifying the largest subset with low PAI values, the spectrum can be used to eliminate non-interacting variables.
Figure 3D shows the effect size dependence of KWII for three representative informative 1, 2, or 3-variable containing combinations, {E2, P}, {E1, E2, P} and {E2, 1, 2, P}, from Figure 3B. As in Simulated Data Set 1, the KWII for each combination increased by almost an order of magnitude for the 5-fold increase in effect size. The 95^{th} and 5^{th} percentiles of KWII were also computed and variability for Simulated Data Set 2 was qualitatively greater than for Simulated Data Set 1 possibly because of the complexity and number of the underlying GEI. The coefficients of variation of the KWII values decreased with increasing effect size.
The effects of LD (for R^{2} ranging from 0 to 0.95) on the PAI for two combinations, {1, 3, P} and {2, 4, P} for effect size of 0.5 are summarized in Figure 3E. Again, the results demonstrate that the PAI is robust and independent of LD.
Figure 3F shows the power of KWII at a sample size of 5000 for various effect sizes for three representative informative 1, 2, or 3-variable containing combinations: {E2, P}, {E1, E2, P} and {E2, 1, 2, P}. The KWII had a power of 0.89 at an effect size of 0.25 for the {E2, P} peak. At an effect size of 0.5, the power for the {E1, E2, P} and {E2, 1, 2, P} peaks were each 0.98 or greater.
Simulated Data Set 3
This Simulated Data Set investigated a two-locus model of pure epistasis. Pure epistasis models lack QT variation attributable to single-loci and require interactions between different loci to explain the variation [23].
As would be predicted from the simulation framework, the KWII spectra for the two-locus (Figure 4A) pure epistasis interaction model did not contain any peaks corresponding to single-SNP combinations or to uninformative SNP combinations (all p-values > 0.05). The KWII spectrum for the two-locus model contained a prominent peak (p < 0.0002) corresponding to the informative two-SNP combination {1, 2, P}. This suggests that the interactions identified with the KWII are qualitatively concordant with epistatic interactions. Figure 4B shows that the power of detecting the epistatic interaction based on the KWII of {1, 2, P} combination is satisfactory.
Although the KWII correctly detects epistatic interactions, CHORUS is not capable of identifying these interactions because it employs a marginal effect search strategy. The models used in this Simulated Data Set exhibit pure epistasis only when the allele frequencies of the interacting SNPs are 0.5. We hypothesized that differences in allele frequencies between the interacting loci would cause deviations from pure epistasis and result in the appearance of KWII peaks for lower-order combinations. Figure 4C shows the dependence of the KWII on the allele frequency at SNP(2) in the two-locus model of epistasis. When the allele frequencies of both loci are equal, the KWII values of the one-variable combinations are nearly zero as expected. However, the KWII value of the {1, P} combination increases with deviations from equality of allele frequency. The KWII value for the {1, 2, P} combination is maximal when the allele frequencies are equal and decreases with deviations from equality.
Taken together, these Simulated Data Sets demonstrate that the KWII and PAI spectra are capable of visually summarizing a diverse range of gene-environment interaction phenomena.
Performance of PAI-Based CHORUS Algorithm on Public Domain Sets
10K GAW15 Dataset
The underlying GGI and GEI in the simulations for this data set models the interaction of nine loci: C, DR and D on chromosome 6, A on chromosome 16, B on chromosome 8, E on chromosome 18, F on chromosome 11, G and H on chromosome 9 (see Table 1 for a summary) [24]. The associations between the individual loci and RA affection status and the Anti-CCP and IgM that were built into the data set by Miller et al. [24] are summarized in Table 1. The Anti-CCP and IgM were QT and the focus for this report; the associations corresponding both QT are highlighted in bold in Table 1. Note that the Anti-CCP and IgM measures are defined for the cases only. Although phase information was provided, we treated the data as genotype data for the CHORUS analysis.
The GAW15 data set contained 100 replicates from repetitions of the simulation procedure [24] and the availability of these replicates enabled us to compute the 95% confidence interval for the KWII of each combination.
Figure 5B presents the KWII values for various combinations with IgM as the QT of interest. The peaks with the highest KWII values enabled the identification of the following loci and covariates associated with IgM: loci C and DR (chromosome 6), locus E (chromosome 18), Smoking: the highest KWII peaks corresponded to {C11_389, IgM}, {C11_389, Smoking, IgM} and {Smoking, IgM} in Figure 5B (all p-values < 0.0002). It is important to note that the KWII for each of these combinations does not contain redundant information. Thus, the significant peaks for the {C11_389, Smoking, IgM} and {Smoking, IgM} indicate that Smoking alone is IgM-associated but also contributes to IgM synergistically in association C 11_389. Furthermore, the changes in the peak height should not be interpreted to imply any protective role for Smoking.
The KWII spectrum derived from CHORUS identified all key covariates associated with the IgM and Anti-CCP without the use of haplotype-phase information or the parent-child transmission information contained in the pedigree structures.
HDL and Atherosclerosis Dataset
However, our two SNP results did not agree with those of Ishimori et al. [25]. The 2-SNP containing combination {D1Mit159, D1Mit406} was found to have negative KWII indicating redundancy most likely because of the strong single SNP effect of each SNP (p < 0.0002). Also the combination {D1Mit406, D2Mit285} was not found among the 2-SNP combinations with the highest KWII values; both D1Mit406 (p < 0.0002) and D2Mit285 (p = 0.031) were individually associated with HDL (Figure 6A). Interestingly we did detect interactions between D1MIT406 locus on Chromosome 2 and markers identified by Ishimori et al. [25] for atherosclerotic aortic fatty streak lesion size, e.g., {D1MIT406, D10MIT213} and {D1MIT406, D10MIT31}. The mouse markers associated with both HDL and Atherosclerosis QTs in our analysis are annotated (with an A in Figure 6A and an H in Figure 6B).
For atherosclerotic aortic fatty streak lesion size, Ishimori et al. identified five loci that accounted for 35% of the variance in multiple regression analysis [25]. Of these, D10Mit31 was a significant QTL on chromosome 10 wherein a dominant B6 allele conferred atherosclerosis resistance. D12Mit243 was suggestive as a single QTL and was significant in gene interaction with D11Mit333. Another interaction was found between the loci D10Mit213 and D12Mit7. In the CHORUS analysis (Figure 6B), we were able to identify three of the five susceptibility loci but as single locus interactions. D10Mit32 was found to have the highest KWII value and was in strong LD with D10Mit31. D12Mit243 and D10Mit213 were also found to have significant single locus effect on atherosclerotic aortic fatty streak lesion size.
Figures 6A-C summarize the results from mouse-human comparative genomics analysis to critically evaluate the biological significance of the markers identified by our method. The plus and minus symbols in Figure 6A and 6B indicate whether or not the orthologs corresponding to a ± 16 Mbp region of the mouse markers identified by the CHORUS method overlapped with human QTLs for HDL and atherosclerotic disease. For HDL (Figure 6A), all 25 (100%) of the single marker combinations with the highest KWII values were concordant with the human QTL for HDL. Among the two-marker combinations, 20 of 25 were concordant on both markers and 5 of 25 were concordant for one of the two markers. For atherosclerotic disease (Figure 6B), 23 of 25 (92%) of the single marker combinations with the highest KWII values were concordant with the human QTL for atherosclerotic disease and among the two-marker combinations, 23 of 25 were concordant on at least one of the two markers. The high degree of concordance with human QTL demonstrates usefulness of our approach in humans. In Figure 6C, we have represented the human QTL against the backdrop of a human haploid karyotype ideogram.
The underlying genetic etiology of HDL and atherosclerotic diseases appear to be shared. Human atherosclerotic diseases include myocardial infarction, coronary artery disease, acute coronary syndrome (acute myocardial infarction and unstable angina), carotid artery intimal-medial thickness disease, coronary artery calcium, peripheral arterial occlusive disease and stroke [27, 29]. To date, atherosclerotic disease linkage studies have been conducted in 19 human populations and 40 atherosclerotic disease-regulating QTL have been identified. Two meta-analysis have confirmed regions on 2q and 3q [34] and (1p, 5p, 12q, 13q) while four new linkage regions (6p, 2 on 8q, 14p) not found in original studies have been revealed [35]. Previously, 31 of the 40 human QTL were found to be concordant with mouse QTL [26], suggesting that identifying the genes underlying the mouse QTL could potentially facilitate identification of the genes underlying the human QTLs. To date, only 3 putative disease genes from these QTLs have been found for atherosclerotic disease [36, 37]. The comparative genomics approach was used to determine if orthologs of mouse markers identified by CHORUS overlapped with the 40 human QTLs. As summarized in the human haploid karyotype ideogram in Figure 6C, we successfully identified orthologous regions of the mouse markers identified by CHORUS overlapping with the substantive majority of the human 40 QTL demonstrating: i) high concordance between human QTL and the mouse markers identified by CHORUS, and ii) that CHORUS in conjunction with comparative genomics can enable identification of more limited, narrower regions of overlap. These narrow regions may merit fine mapping studies. Furthermore several regions had QTLs for both HDL and Atherosclerosis, a biologically plausible result not previously found.
The 7q22 region (Figure 6C) exhibits a small region of overlap between HDL and Atherosclerosis phenotypes in humans. In 2002, a combined analysis of genome scans for obesity was undertaken using the interim results from the National Heart, Lung, and Blood Institute Family Blood Pressure Program. The sample represents the largest single collection of genome-wide scan data that has been analyzed for obesity and provide a test of the reproducibility of linkage analysis for a complex phenotype. Body mass index (BMI) was used as the measure of adiposity and this analysis confirmed linkage to the 7q22.2 region [38]. The result from a hypothesis free analysis clearly indicate that this is a viable region for fine mapping either heart disease endophenotypes, such as body mass index or Atherosclerotic phenotypes previously discussed. Further interrogation of this region has shown that it contains the gene visfatin (Gene Symbol PBEF1; also called nicotinamide phosphoribosyltransferase or Nampt and pre-B-cell colony enhancing factor or PBEF1) an adipokine found in abundance in visceral fat. This gene has been shown to lower plasma glucose in both humans and mice [39]. More specifically, there is evidence that visfatin levels are correlated with HDL-C and apolipoprotein A1 in Asian Indians [40] and recently evidence of allelic association with SNPs in visfatin and plasma levels has been shown in a Chinese population [41]. The results demonstrate that our method is capable of successfully identifying the same regions at the genome wide level as have been found and confirmed. Although follow up candidate gene studies have mapped visfatin, the power of our approach lies in the ability to map and narrow regions of GEI when doing the initial genome wide scan if given clinical and demographic variables.
UV-Induced Immunosuppression Data Set
Clemens et al. identified four quantitative trait loci (QTLs) with significant main effects for UV-induced immunosuppression on Chromosomes 1, 6, 10 and 17 [32]. The locus on Chromosome 1 had interactions with two other loci on chromosomes 14 and 19 [32].
Marker Pair | F-interaction | P-value | KWII | KWII p-value |
---|---|---|---|---|
D6Mit389-D10Mit170 | 2.63 | 0.11 | -0.0017 | 0.76 |
D10Mit170-D17Mit49 | 0 | 0.97 | -0.0079 | > 0.99 |
D10Mit170-D17Mit123 | 0.01 | 0.91 | 0.0047 | 0.38 |
D10Mit170-D17Mit187 | 0.04 | 0.85 | 0.0005 | 0.59 |
D10Mit170-D17Mit100 | 0.01 | 0.91 | -0.0078 | > 0.99 |
D10Mit170-D14Mit185 | 5.38 | 0.02 | 0.0087 | 0.22 |
D10Mit170-D14Mit165 | 3.78 | 0.055 | 0.0031 | 0.43 |
D1Mit411-D14Mit260 | 12.9 | 0.00053 | 0.10 | < 0.0002 |
D1Mit411-D14Mit185 | 12.3 | 0.00069 | 0.048 | 0.004 |
D1Mit411-D19Mit19 | 12.0 | 0.00079 | 0.042 | 0.006 |
D1Mit411-D17Mit49 | 0.98 | 0.32 | 0.0020 | 0.48 |
Similarity and Differences with the Restricted Partitioning Method
We assessed the similarities and the differences between the results from CHORUS to the Restricted Partitioning Method (RPM), a competing method for GEI analysis of QT, using the HDL data set [25].
RPM identified 21 one-variable containing combinations as significantly associated with HDL. Twenty of these were among the top 20 one-variable combinations with the highest KWII values. The concordance on the one-variable combinations is reassuring given that RPM and CHORUS use different approaches to detect between-group differences. RPM identified 1905 two-variable containing combinations with p-values of 0.05 or less; of these 599 combinations had p-values < 0.0002. All twenty of the top 20 two-variable combinations with the highest KWII values were identified as significant by RPM.
Computational Complexity
The CHORUS search algorithm is computationally much more efficient than the exhaustive search to compute KWII for all possible SNP combinations, which requires exponential time.
Let m denote the sample size. In the CHORUS search algorithm, lines 3-5 take O(n × m^{2}), line 6 require O(θ × n) time whereas lines 7-16 takes O(τ × θ × n × m^{2}) time and lines 19-25 takes O(2^{τ+1} × θ × m^{2}) time. In GEI analysis of QT, the range of τ values of interest is small because of sample size constraints, which limits the computational complexity from becoming exponentially large. When ω permutations are conducted for every combination identified by CHORUS, the computations require O(2^{τ+1}× θ × m^{2} × ω) time.
The GAW15 data set was analyzed on a 3.2 GHz Intel Xeon computer with Irwindale processors with 4 Gb of main memory. The maximum running time for single replicate of the GAW15 data set was 3.6 hours with τ = 2 and θ = 50 excluding permutations. The permutation analysis of the first replicate of the GAW15 data set, which was conducted for the 25 one-way and 25 two-way combinations with the highest KWII values with 5000 permutations per combination, required 5 hours and 25 minutes. The HDL and atherosclerosis data sets were analyzed using a Dell Laptop with 2 GHz dual core Intel processor with 2 Gb of memory. The time required was approximately 8 seconds for both data sets excluding permutations. The permutation analysis (5000 permutations per combination) for the top 25 one-way and top 25 two-way combinations required 14 minutes for the HDL data set and 16 minutes for the atherosclerosis data set.
Discussion
In this report, we have presented results on an information theoretic approach for GEI analysis of QT that uses two complementary information-theoretic metrics, the KWII and the PAI. The dependence of these metrics on biological and study design variables was systematically investigated with controlled numerical experiments. We analyzed the GAW15 data set, which was generated by Miller et al. [24] from a complex simulation based on rheumatoid arthritis data and two GGI data sets generated from QTL mapping studies of HDL levels/atherosclerotic lesion size [25] and UV-induced immunosuppression [32].
The current method assumes that the QT of interest is normally distributed within each strata of the gene-environmental variable combination. The normal distribution is a common assumption in parametric statistics and derives its importance from the central limit theorem. From the information theoretic standpoint, the normal distribution N(μ, σ) has maximum entropy among all real-valued distributions with specified mean μ and standard deviation σ. Therefore, if only the mean and standard deviation of a distribution are known, it is often reasonable to assume that the distribution is normal. As we demonstrated, a variety of data transformations such as log-transformation, arcsine transformation and others can sometimes be used to obtain normal distributions in some cases when the underlying variable is non-normally distributed. Although the normality requirements for each genotype-environment stratum could be considered a strong assumption, it is possible to deal with mixed distributions or empirically estimate the distribution of the QT in each stratum, e.g., with Parzen windows [44], and use the information theoretic framework and CHORUS in consistent and analogous manner.
In the case of a normal distribution, the entropy expression contains only the variance. As a result, the approach conveys the impression of being driven by the variance. We have not addressed standard deviation estimation issues in detail here because our primary focus was to determine whether the underlying method was capable of identifying GEI. Greenwood and Sandomire demonstrate that at a sample size of 25, a standard deviation estimate is within ± 10% error half the time [45].
Derivations of information theoretic metrics in terms of statistical parameters such as variance result in analytical expressions that are difficult to interpret intuitively. Set theoretic approaches provide more interpretability because they can account for addition and subtraction of entropies. Unlike the variance or second moment, which measures dispersion around the mean, entropy depends on parameters other than just the second moment, e.g., the shape and scale parameters of the distribution of interest. One advantage of the information-theoretic method is that it is capable of handling mixtures wherein the strata have different distributions.
The KWII definition of an interaction has a strong theoretical foundation from information theory and the statistical significance of the KWII can be assessed using permutation-based methods. Because the distribution of the KWII and PAI of higher order combinations has not been characterized, we used independent replicates for the three Simulated Data Sets and GAW15 data set to directly obtain confidence intervals as well as empirical information on the distribution of the KWII and PAI values. In the case of the GAW15 data set our approach enabled use of the entire data set. This approach is not feasible for real data and permutations are necessary to assess statistical significance via p-values. Permutations however, provide information on the null distribution.
As indicated in Methods, the KWII-based definition of interaction yields results that difficult to interpret for completely redundant variables because in the presence of an even number of completely redundant variables, the KWII is positive. This quandary can be addressed by retaining only one representative variable from every group of completely redundant variables in a pre-processing step prior to analysis. However, the PAI does not change when a completely redundant variable is added to combinations containing odd or even number of completely redundant variables. Because the CHORUS search of combinatorial space is directed towards combinations that increase PAI, our approach is less susceptible to identifying combinations comprised of variables that are completely redundant with each other.
The CHORUS algorithm however, is a heuristic method. CHORUS uses a search strategy rather than a dimensionality reduction approach and is capable of conducting efficient search of the large combinatorial space because of the unique nature of the PAI metric, which allows for greedy search identification of the most promising combinations by utilizing the marginal effects. As a consequence however, CHORUS is not capable of detecting pure epistasis. It is possible to develop a "two-locus" variation of CHORUS that utilizes KWII from all one-variable and two-variable combinations.
For Simulated Data Set 3, we adopted the overall structure, key assumptions and numerical values from previous work on pure epistasis in case-control data by Culverhouse [23]. Our simulations assumed Hardy-Weinberg equilibrium and MAF of 0.5 at the interacting SNPs, SNP(1) and SNP(2). The frequency of each genotype in our sample was representative of the corresponding population frequencies. The QT value for each subject was a random variate drawn from one of two normal distributions N(μ_{1}, σ_{1}) or N(μ_{0}, σ_{0}). The probability of drawing the QT random variate from N(μ_{1}, σ_{1}) was specified for each combination of SNP(1) and SNP(2) genotypes Equation 1 obtained from [23]. The probability of drawing from N(μ_{0}, σ_{0}) was defined by the complement of the probabilities in Equation 1. These assumptions result in a form of QT epistasis because the two distributions N(μ_{1}, σ_{1}) and N(μ_{0}, σ_{0}) can be considered cases and controls and the model can be viewed as a binary trait to which normally distributed noise has been added.
CHORUS can be considered complementary to dimensionality reduction methods such as combinatorial partitioning method (CPM), multi-factor dimensionality reduction (MDR) and restricted partitioning method (RPM), which are computationally more burdensome but are sensitive to pure epistasis interactions. The CPM approach is capable of identifying multilocus genotypes capable of predicting QT levels [46]. The multi-factor dimensionality reduction (MDR) method is applicable to binary phenotypes and uses constructive induction to reduce the dimensionality of the multi-locus genotype systematically by pooling into high and low risk groups [3, 47–50]. The CPM is computationally very intensive and Culverhouse et al. advocated the RPM [33], which is applicable to both binary phenotypes and QT. Although RPM and MDR are computationally more efficient than the CPM, significant computational effort is required for datasets from genomewide association studies, which can contain tens of thousands to millions of predictor variables. The generalized MDR (GMDR) method handles both discrete phenotypes and continuous traits in population-based study designs and employs the generalized linear model (GLM) framework for scoring in conjunction with MDR for dimensionality reduction [51].
Unlike exhaustive search algorithms, which can identify the global minimum, all heuristic approaches are potentially vulnerable to entrapment in local minima. CHORUS can be modified with established methods such as simulated annealing to reduce this risk. Within the current CHORUS framework, the input parameter θ, which determines the number of combinations retained at each stage of the algorithm, can also be a determinant of power: if too few combinations are retained at the initial stages of the search, the risk of missing key higher-order interactions with intermediate levels of marginal effects is increased. However, increases in θ increase the computational cost. In principle, the computational effort depends exponentially on the input parameter τ, which determines the order of combinations. In practice however, the value of τ is constrained by sample size because the genotype contingency tables for combinations rapidly become sparse and contain numerous empty cells when the order of the combinations increases. Although biological pathways are complex, sequentially ordered actions of protein-protein interactions and enzymatic chemical reactions are frequently involved [52]. Such sequential interactions typically involve only a small subset of molecules in the pathway. The order of resulting statistical interactions may be limited as a consequence [15].
There are some fundamental differences and unique advantages to CHORUS compared to the widely used GMDR approach. The metric used by GMDR is based on the GLM, a commonly used and versatile statistical analysis method, and is combined with dimensionality reduction method of MDR. For QTL analysis, GMDR analyzes the QT and covariates first to obtain the GLM score statistic and in a second stage, the interactions of GLM score statistics with the genetic and environmental variables are determined. In contrast, the CHORUS approach analyzes the underlying interactions between the QT of interest and all variables including covariates simultaneously. Another advantage with CHORUS is that it is capable of handling cases-only study designs that are useful for studying the genetic and environmental determinants of important QT such as body weight, height and lifespan. The statistical GLM framework enables GMDR to handle covariates whose distribution follows any of the exponential family distributions (normal, Poisson or Bernoulli distributions) but a limitation of CHORUS that we are working to overcome is that it cannot handle continuous covariates. Continuous covariates can be used after discretization, however. Another notable advantage of CHORUS is that it can be applied to very large data sets. We were able to analyze the 100 replicates in the 10K GAW15 data set without difficulty.
We have described the conceptual framework of CHORUS to highlight its strengths and its differences with other methods. We are exploring a range of enhancements including parallel computing that could enhance the efficiency and effectiveness of CHORUS further.
Conclusion
Our results indicate that the information theoretic, KWII-based CHORUS approach has considerable promise as a method for GEI analysis of QT.
Declarations
Acknowledgements
This work was partly supported by funding from the Center for Protein Therapeutics at the University of Buffalo. Support from the National Multiple Sclerosis Society (RG3743) for the Ramanathan laboratory is also gratefully acknowledged. We thank the two Reviewers of the manuscript for their insightful comments. We are extremely grateful to the investigators who provided simulated data for the Genetic Analysis Workshops. The GAW15 simulated data set was generated by Michael B. Miller in collaboration with others. Support for generation of the simulated data was provided from NIH grants 5R01-HL049609-14, 1R01-AG021917-01A1, the University of Minnesota and the Minnesota Supercomputing Institute.
Authors’ Affiliations
References
- Shields PG, Harris CC: Cancer risk and low-penetrance susceptibility genes in gene-environment interactions. J Clin Oncol. 2000, 18 (11): 2309-2315.PubMedGoogle Scholar
- Wu X, Gu J, Grossman HB, Amos CI, Etzel C, Huang M, Zhang Q, Millikan RE, Lerner S, Dinney CP, et al: Bladder cancer predisposition: a multigenic approach to DNA-repair and cell-cycle-control genes. Am J Hum Genet. 2006, 78 (3): 464-479. 10.1086/500848.PubMed CentralView ArticlePubMedGoogle Scholar
- Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, White BC: A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J Theor Biol. 2006, 241 (2): 252-261. 10.1016/j.jtbi.2005.11.036.View ArticlePubMedGoogle Scholar
- Bhasi K, Zhang L, Brazeau D, Zhang A, Ramanathan M: Information-theoretic identification of predictive SNPs and supervised visualization of genome-wide association studies. Nucleic Acids Res. 2006, 34 (14): e101-10.1093/nar/gkl520.PubMed CentralView ArticlePubMedGoogle Scholar
- Bhasi K, Zhang L, Brazeau D, Zhang A, Ramanathan M: VizStruct for visualization of genome-wide SNP analyses. Bioinformatics. 2006, 22 (13): 1569-1576. 10.1093/bioinformatics/btl144.View ArticlePubMedGoogle Scholar
- Liu Z, Lin S: Multilocus LD measure and tagging SNP selection with generalized mutual information. Genet Epidemiol. 2005, 29 (4): 353-364. 10.1002/gepi.20092.PubMed CentralView ArticlePubMedGoogle Scholar
- Rosenberg NA, Li LM, Ward R, Pritchard JK: Informativeness of genetic markers for inference of ancestry. Am J Hum Genet. 2003, 73 (6): 1402-1422. 10.1086/380416.PubMed CentralView ArticlePubMedGoogle Scholar
- Smith MW, Lautenberger JA, Shin HD, Chretien JP, Shrestha S, Gilbert DA, O'Brien SJ: Markers for mapping by admixture linkage disequilibrium in African American and Hispanic populations. Am J Hum Genet. 2001, 69 (5): 1080-1094. 10.1086/323922.PubMed CentralView ArticlePubMedGoogle Scholar
- Anderson EC, Thompson EA: A model-based method for identifying species hybrids using multilocus genetic data. Genetics. 2002, 160 (3): 1217-1229.PubMed CentralPubMedGoogle Scholar
- Zhao J, Boerwinkle E, Xiong M: An entropy-based statistic for genomewide association studies. Am J Hum Genet. 2005, 77 (1): 27-40. 10.1086/431243.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhao J, Boerwinkle E, Xiong M: An entropy-based genome-wide transmission/disequilibrium test. Hum Genet. 2007, 121 (3-4): 357-367. 10.1007/s00439-007-0322-6.View ArticlePubMedGoogle Scholar
- Li Y, Xiang Y, Deng H, Sun Z: An entropy-based index for fine-scale mapping of disease genes. J Genet Genomics. 2007, 34 (7): 661-668. 10.1016/S1673-8527(07)60075-7.View ArticlePubMedGoogle Scholar
- Dong C, Chu X, Wang Y, Jin L, Shi T, Huang W, Li Y: Exploration of gene-gene interaction effects using entropy-based methods. Eur J Hum Genet. 2008, 16 (2): 229-235. 10.1038/sj.ejhg.5201921.View ArticlePubMedGoogle Scholar
- Chanda P, Zhang A, Brazeau D, Sucheston L, Freudenheim JL, Ambrosone C, Ramanathan M: Information-theoretic metrics for visualizing gene-environment interactions. Am J Hum Genet. 2007, 81 (5): 939-963. 10.1086/521878.PubMed CentralView ArticlePubMedGoogle Scholar
- Chanda P, Sucheston L, Zhang A, Brazeau D, Freudenheim JL, Ambrosone C, Ramanathan M: AMBIENCE: a novel approach and efficient algorithm for identifying informative genetic and environmental associations with complex phenotypes. Genetics. 2008, 180 (2): 1191-1210. 10.1534/genetics.108.088542.PubMed CentralView ArticlePubMedGoogle Scholar
- Jakulin A, Bratko I: Testing the significance of attribute interactions. Proceedings of the Twenty-first International Conference on Machine Learning (ICML-2004): 2004; Banff, Canada. 2004, 409-416.Google Scholar
- Jakulin A: Machine Learning Based on Attribute Interactions. 2005, Ljubljana, Slovenia: University of LjubljanaGoogle Scholar
- Han TS: Multiple mutual informations and multiple interactions in frequency data. Information and Control. 1980, 46: 26-45. 10.1016/S0019-9958(80)90478-7.View ArticleGoogle Scholar
- McGill WJ: Multivariate information transmission. Psychometrika. 1954, 19: 97-116. 10.1007/BF02289159.View ArticleGoogle Scholar
- Fano RM: Transmission of Information: A Statistical Theory of Communications. 1961, Cambridge, MA.: MIT PressGoogle Scholar
- Watanabe S: Information theoretical analysis of multivariate correlation. IBM Journal of Research and Development. 1960, 4: 66-82.View ArticleGoogle Scholar
- Gokhale DV, Ahmed NV, Res BC, Piscataway NJ: Entropy Expressions and Their Estimators for Multivariate Distributions. IEEE Transactions on Information Theory. 1989, 35 (3): 688-692. 10.1109/18.30996.View ArticleGoogle Scholar
- Culverhouse R: The use of the restricted partition method with case-control data. Hum Hered. 2007, 63 (2): 93-100. 10.1159/000099181.View ArticlePubMedGoogle Scholar
- Miller MB, Lind GR, Li N, Jang S-Y: Genetic Analysis Workshop 15: Simulation of a complex genetic model for rheumatoid arthritis in nuclear families including a dense SNP map with linkage disequilibrium between marker loci and trait loci. BMC Genetics. 2007, 1 (Suppl 1): S4-Google Scholar
- Ishimori N, Li R, Kelmenson PM, Korstanje R, Walsh KA, Churchill GA, Forsman-Semb K, Paigen B: Quantitative trait loci analysis for plasma HDL-cholesterol concentrations and atherosclerosis susceptibility between inbred mouse strains C57BL/6J and 129S1/SvImJ. Arterioscler Thromb Vasc Biol. 2004, 24 (1): 161-166. 10.1161/01.ATV.0000104027.52895.D7.View ArticlePubMedGoogle Scholar
- Rollins J, Chen Y, Paigen B, Wang X: In search of new targets for plasma high-density lipoprotein cholesterol levels: promise of human-mouse comparative genomics. Trends Cardiovasc Med. 2006, 16 (7): 220-234. 10.1016/j.tcm.2006.04.003.View ArticlePubMedGoogle Scholar
- Chen Y, Rollins J, Paigen B, Wang X: Genetic and genomic insights into the molecular basis of atherosclerosis. Cell Metab. 2007, 6 (3): 164-179. 10.1016/j.cmet.2007.07.001.PubMed CentralView ArticlePubMedGoogle Scholar
- Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC. Genome Res. 2002, 12 (6): 996-1006.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang X, Ishimori N, Korstanje R, Rollins J, Paigen B: Identifying novel genes for atherosclerosis through mouse-human comparative genetics. Am J Hum Genet. 2005, 77 (1): 1-15. 10.1086/431656.PubMed CentralView ArticlePubMedGoogle Scholar
- Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D: Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci USA. 2003, 100 (20): 11484-11489. 10.1073/pnas.1932072100.PubMed CentralView ArticlePubMedGoogle Scholar
- Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W: Human-mouse alignments with BLASTZ. Genome Res. 2003, 13 (1): 103-107. 10.1101/gr.809403.PubMed CentralView ArticlePubMedGoogle Scholar
- Clemens KE, Churchill G, Bhatt N, Richardson K, Noonan FP: Genetic control of susceptibility to UV-induced immunosuppression by interacting quantitative trait loci. Genes Immun. 2000, 1 (4): 251-259. 10.1038/sj.gene.6363667.View ArticlePubMedGoogle Scholar
- Culverhouse R, Klein T, Shannon W: Detecting epistatic interactions contributing to quantitative traits. Genet Epidemiol. 2004, 27 (2): 141-152. 10.1002/gepi.20006.View ArticlePubMedGoogle Scholar
- Chiodini BD, Lewis CM: Meta-analysis of 4 coronary heart disease genome-wide linkage studies confirms a susceptibility locus on chromosome 3q. Arterioscler Thromb Vasc Biol. 2003, 23 (10): 1863-1868. 10.1161/01.ATV.0000093281.10213.DB.View ArticlePubMedGoogle Scholar
- Zintzaras E, Kitsios G: Identification of chromosomal regions linked to premature myocardial infarction: a meta-analysis of whole-genome searches. J Hum Genet. 2006, 51 (11): 1015-1021. 10.1007/s10038-006-0053-x.View ArticlePubMedGoogle Scholar
- Farrall M, Green FR, Peden JF, Olsson PG, Clarke R, Hellenius ML, Rust S, Lagercrantz J, Franzosi MG, Schulte H, et al: Genome-wide mapping of susceptibility to coronary artery disease identifies a novel replicated locus on chromosome 17. PLoS Genet. 2006, 2 (5): e72-10.1371/journal.pgen.0020072.PubMed CentralView ArticlePubMedGoogle Scholar
- Hakonarson H, Thorvaldsson S, Helgadottir A, Gudbjartsson D, Zink F, Andresdottir M, Manolescu A, Arnar DO, Andersen K, Sigurdsson A, et al: Effects of a 5-lipoxygenase-activating protein inhibitor on biomarkers associated with risk of myocardial infarction: a randomized trial. JAMA. 2005, 293 (18): 2245-2256. 10.1001/jama.293.18.2245.View ArticlePubMedGoogle Scholar
- Wu X, Cooper RS, Borecki I, Hanis C, Bray M, Lewis CE, Zhu X, Kan D, Luke A, Curb D: A combined analysis of genomewide linkage scans for body mass index from the National Heart, Lung, and Blood Institute Family Blood Pressure Program. Am J Hum Genet. 2002, 70 (5): 1247-1256. 10.1086/340362.PubMed CentralView ArticlePubMedGoogle Scholar
- Filippatos TD, Derdemezis CS, Gazi IF, Lagos K, Kiortsis DN, Tselepis AD, Elisaf MS: Increased plasma visfatin levels in subjects with the metabolic syndrome. Eur J Clin Invest. 2008, 38 (1): 71-72.View ArticlePubMedGoogle Scholar
- Smith J, Al-Amri M, Sniderman A, Cianflone K: Visfatin concentration in Asian Indians is correlated with high density lipoprotein cholesterol and apolipoprotein A1. Clin Endocrinol (Oxf). 2006, 65 (5): 667-672. 10.1111/j.1365-2265.2006.02647.x.View ArticleGoogle Scholar
- Jian WX, Luo TH, Gu YY, Zhang HL, Zheng S, Dai M, Han JF, Zhao Y, Li G, Luo M: The visfatin gene is associated with glucose and lipid metabolism in a Chinese population. Diabet Med. 2006, 23 (9): 967-973. 10.1111/j.1464-5491.2006.01909.x.View ArticlePubMedGoogle Scholar
- Noonan FP, Hoffman HA: Control of UVB immunosuppression in the mouse by autosomal and sex-linked genes. Immunogenetics. 1994, 40 (4): 247-256. 10.1007/BF00189969.View ArticlePubMedGoogle Scholar
- Damian DL, Patterson CR, Stapelberg M, Park J, Barnetson RS, Halliday GM: UV radiation-induced immunosuppression is greater in men and prevented by topical nicotinamide. J Invest Dermatol. 2008, 128 (2): 447-454.View ArticlePubMedGoogle Scholar
- Kwak N, Choi CH: Input feature selection by mutual information based on Parzen window. IEEE Trans Pattern Analysis and Machine Intelligence. 2002, 24 (12): 1667-1671. 10.1109/TPAMI.2002.1114861.View ArticleGoogle Scholar
- Greenwood JA, Sandomire MM: Sample Size Required for Estimating the Standard Deviation as a Per Cent of its True Value. Journal of the American Statistical Association. 1950, 45 (250): 257-260. 10.2307/2280683.View ArticleGoogle Scholar
- Nelson MR, Kardia SL, Ferrell RE, Sing CF: A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Res. 2001, 11 (3): 458-470. 10.1101/gr.172901.PubMed CentralView ArticlePubMedGoogle Scholar
- Hahn LW, Ritchie MD, Moore JH: Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics. 2003, 19 (3): 376-382. 10.1093/bioinformatics/btf869.View ArticlePubMedGoogle Scholar
- Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH: Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001, 69 (1): 138-147. 10.1086/321276.PubMed CentralView ArticlePubMedGoogle Scholar
- Bush WS, Dudek SM, Ritchie MD: Parallel multifactor dimensionality reduction: a tool for the large-scale analysis of gene-gene interactions. Bioinformatics. 2006, 22 (17): 2173-2174. 10.1093/bioinformatics/btl347.View ArticlePubMedGoogle Scholar
- Ritchie MD, Hahn LW, Moore JH: Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemiol. 2003, 24 (2): 150-157. 10.1002/gepi.10218.View ArticlePubMedGoogle Scholar
- Lou XY, Chen GB, Yan L, Ma JZ, Zhu J, Elston RC, Li MD: A generalized combinatorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine dependence. Am J Hum Genet. 2007, 80 (6): 1125-1137. 10.1086/518312.PubMed CentralView ArticlePubMedGoogle Scholar
- Hwang W, Cho YR, Zhang A, Ramanathan M: CASCADE: a novel quasi all paths-based network analysis algorithm for clustering biological interactions. BMC Bioinformatics. 2008, 9: 64-10.1186/1471-2105-9-64.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.