- Methodology article
- Open Access
A robust and efficient statistical method for genetic association studies using case and control samples from multiple cohorts
- Minghui Wang^{1, 2},
- Lin Wang^{1},
- Ning Jiang^{1, 3},
- Tianye Jia^{2} and
- Zewei Luo^{1, 2}Email author
https://doi.org/10.1186/1471-2164-14-88
© Wang et al.; licensee BioMed Central Ltd. 2013
- Received: 16 May 2012
- Accepted: 31 January 2013
- Published: 8 February 2013
Abstract
Background
The theoretical basis of genome-wide association studies (GWAS) is statistical inference of linkage disequilibrium (LD) between any polymorphic marker and a putative disease locus. Most methods widely implemented for such analyses are vulnerable to several key demographic factors and deliver a poor statistical power for detecting genuine associations and also a high false positive rate. Here, we present a likelihood-based statistical approach that accounts properly for non-random nature of case–control samples in regard of genotypic distribution at the loci in populations under study and confers flexibility to test for genetic association in presence of different confounding factors such as population structure, non-randomness of samples etc.
Results
We implemented this novel method together with several popular methods in the literature of GWAS, to re-analyze recently published Parkinson’s disease (PD) case–control samples. The real data analysis and computer simulation show that the new method confers not only significantly improved statistical power for detecting the associations but also robustness to the difficulties stemmed from non-randomly sampling and genetic structures when compared to its rivals. In particular, the new method detected 44 significant SNPs within 25 chromosomal regions of size < 1 Mb but only 6 SNPs in two of these regions were previously detected by the trend test based methods. It discovered two SNPs located 1.18 Mb and 0.18 Mb from the PD candidates, FGF20 and PARK8, without invoking false positive risk.
Conclusions
We developed a novel likelihood-based method which provides adequate estimation of LD and other population model parameters by using case and control samples, the ease in integration of these samples from multiple genetically divergent populations and thus confers statistically robust and powerful analyses of GWAS. On basis of simulation studies and analysis of real datasets, we demonstrated significant improvement of the new method over the non-parametric trend test, which is the most popularly implemented in the literature of GWAS.
Keywords
- Case and control samples
- Genome-wide association study
- Linkage disequilibrium
- Multiple cohorts
- Parkinson’s disease
- Robust statistical approach
Background
Rapid advancement in high-throughput sequencing techniques has greatly inspired the wave of genome-wide association studies (GWAS) to unravel the genetic basis underlying complex traits in plants, animals and humans [1–3]. The theoretical kernel of these genetic association studies is statistical inference of linkage disequilibrium (LD) between a tested polymorphic marker locus and a putative trait locus in the population of interest. A review of the rich literature has revealed that a major challenge to association studies lies in the high level of vulnerability of the LD based analyses to several demographic factors, the most prominent among which is the population stratification. It has been well documented that use of samples collected from the population with genetic structure may result in both false positive and false negative inferences of association [4]. Tremendous efforts have been invested to tackle the problem through either predicting genetic structure in the population under study [5] and incorporating this information into the association analysis [6] or adjusting the test statistic through so called genomic control [7].
In contrast to the problem raised from population stratification, the consequences of using non-random samples in association studies are usually neglected. We recently investigated the effect of using non-random samples in LD analyses and observed that estimates of LD parameters can be severely biased and that the statistical power for testing their significance substantially reduced [8]. In practice, the sampling schemes of many genetic association studies involve various types of selection and thus the samples collected are no longer random representations of the corresponding populations. A typical example is the ‘case–control sample’ used in many association studies of human diseases. In such instances, the frequencies of some disease genotypes are artificially inflated relative to the population frequencies so as to ensure sufficient representation of those genotypes carrying a rare allele. Although approaches have been developed to model and analyze ‘case-control’ samples, they are usually based on nonparametric statistical tests such as χ^{2} or trend tests etc. and rarely account for the biases described above [9]. Such approaches are statistically less sophisticated and often not robust in the presence of these influences, exposing the corresponding analyses to the risk of false positive and/or negative inferences of genetic association. We present here a novel likelihood-based statistical framework that confers improved robustness in estimation of model parameters to non-randomness of samples and thus a more powerful statistical test of LD in the presence or the absence of genetic structure. We demonstrate the improved robustness and statistical power by re-analyzing the recently published Parkinson’s disease (PD) case and control datasets [2]. In addition, we illustrate the statistical properties of the method by computer simulation study.
Methods
Derivation of equation (1) implies that the conditional probabilities of a marker allele given an allele at the disease locus are constant in both cases and controls across different subpopulations, that is ${p}_{M|a}^{\left(i\right)\mathit{case}}={p}_{M|a}^{\left(i\right)\mathit{contrl}}={p}_{M|a}^{\left(i\right)}$ and ${p}_{M|A}^{\left(i\right)\mathit{case}}={p}_{M|A}^{\left(i\right)\mathit{contrl}}={p}_{M|A}^{\left(i\right)}$. This is true if all the sub-populations are no longer subject to any further structure stratification. The simple algebraic formulation reveals that any association tests, which are based on comparing the marker allele frequency between cases and controls, share at least two common properties. First, the test statistic will not be zero even though the disease and marker loci are in linkage equilibrium in all subpopulations, i.e., all D^{(i)} = 0, if the allele frequency of the tested marker varies from one population to the other, i.e. ${p}_{M}^{\left(i\right)}\ne {p}_{M}^{\left(j\right)}\phantom{\rule{0.5em}{0ex}}\left(i\ne j\right)$, suggesting the risk of making false positive inference of LD if the case and control samples are collected from genetically divergent populations. Secondly, the efficiency of the association test can be greatly influenced by the sampling scheme of the cases and controls as characterized by the term ${r}_{i}\left({q}_{A}^{\left(i\right)\mathit{case}}-{q}_{A}^{\left(i\right)}\right)-{s}_{i}\left({q}_{A}^{\left(i\right)\mathit{control}}-{q}_{A}^{\left(i\right)}\right)$ in equation (1) as we and others have previously demonstrated [4, 8].
Conditional probability distributions
a. | ||||||||
AA | Aa | aa | ||||||
MM | Mm | mm | MM | Mm | mm | MM | Mm | mm |
g _{ 11 } | g _{ 12 } | g _{ 13 } | g _{ 21 } | g _{ 22 } | g _{ 23 } | g _{ 31 } | g _{ 32 } | g _{ 33 } |
Q ^{2} | 2Q(1-Q) | (1-Q)^{2} | QR | Q + R-2QR | (1-Q)(1-R) | R ^{2} | 2R(1-R) | (1-R)^{2} |
where Q = p + D/q and R = p - D/(1-q) | ||||||||
b. | ||||||||
MM | Mm | mm | ||||||
AA | Aa | aa | AA | Aa | aa | AA | Aa | aa |
h _{ 11 } | h _{ 12 } | h _{ 13 } | h _{ 21 } | h _{ 22 } | h _{ 23 } | h _{ 31 } | h _{ 32 } | h _{ 33 } |
Q ^{2} | 2Q(1-Q) | (1-Q)^{2} | QR | Q + R-2QR | (1-Q)(1-R) | R ^{2} | 2R(1-R) | (1-R)^{2} |
where Q = q + D/p and R = q - D/(1-p) | ||||||||
c. | ||||||||
Cases | Controls | |||||||
MM | Mm | mm | MM | Mm | mm | |||
AA | f_{ 1 } × g_{ 11 } | f_{ 1 } × g_{ 12 } | f_{ 1 } × g_{ 13 } | (1-f_{ 1 }) × g_{ 11 } | (1-f_{ 1 }) × g_{ 12 } | (1-f_{ 1 }) × g_{ 13 } | ||
Aa | f_{ 2 } × g_{ 21 } | f_{ 2 } × g_{ 22 } | f_{ 2 } × g_{ 23 } | (1-f_{ 2 }) × g_{ 21 } | (1-f_{ 2 }) × g_{ 22 } | (1-f_{ 2 }) × g_{ 23 } | ||
aa | f_{ 3 } × g_{ 31 } | f_{ 3 } × g_{ 32 } | f_{ 3 } × g_{ 33 } | (1-f_{ 3 }) × g_{ 31 } | (1-f_{ 3 }) × g_{ 32 } | (1-f_{ 3 }) × g_{ 33 } | ||
#observed | n _{ 11 } | n _{ 12 } | n _{ 13 } | n _{ 21 } | n _{ 22 } | n _{ 23 } |
In Method 1, we first consider a case–control sample of size n collected only from a single randomly mating population. The method focuses on gene segregation at a marker locus and a putative disease locus in this population. There are two alleles, M and m, segregating at the marker locus and two alleles, A and a, at the disease locus. For simplicity but without loss of generality, A is assigned to be the disease causing allele. The population genetic parameters characterising genotypic distribution of genotypes at the marker and disease loci include p (or q), frequency of marker allele M (or the disease allele A), and D, the coefficient of LD between genes at the two loci. Distribution of genotypes at the two loci can be expressed in terms of the population genetic parameters. Let g_{ ij } = Pr{Y = j | X = i} be the conditional probability of marker genotype Y = j (j = 1, 2 and 3 for marker genotypes MM, Mm and mm respectively) given the disease genotype X = i (i = 1, 2, 3 for the disease genotypes AA, Aa and aa accordingly). Let h_{ ij } = Pr{X = i | Y = j} be the conditional probability of disease genotype X = i given the marker genotype Y = j. These conditional probability distributions can be expressed in terms of the population genetic parameters p, q and D as given in Table 1a and 1b respectively.
The cases and controls collected from the population can be classified according to their genotypes at marker loci, while the sample size n is equal to sum of n_{ ij } representing the number of individuals with j th marker genotype (j = 1, 2 and 3) in cases (i = 1) or controls (i =2). Table 1c illustrates the conditional probability distribution of genotypes at the disease locus for any given genotype at the marker locus among cases or controls. It can be seen that the conditional probability distribution is a function of the penetrance parameters that characterize the inheritance of the disease genes as well as the population genetic parameters. Table 1c is derived from Table 1a by noting that each disease genotype presents a unique disease risk. The model involves a total of six parameters, leaving their estimation as a typical over-parameterization problem. To ease the problem, we fixed the penetrance parameters f_{1-3} to take either of values (1, 1, 0), (1, ½, 0) or (1, 0, 0), which correspond to the dominant, additive or recessive inheritance of the disease gene A. Our focus was on estimation of the population genetic parameters. Consequence of possible mis-specification of penetrance parameters will be evaluated through simulation study as detailed in Results section below.
The coefficients a_{ i } (i = 0,1,…,6) and b_{ i } (i = 0,1,…,5) in equations (4) and (5) were functions of the sample observations N = (n_{11},n_{12},n_{13},n_{21},n_{22},n_{23}) and the conditional probabilities w_{ ij } and v_{ ij }. Mathematical forms of these coefficients were derived using the computer software Mathematica [10] and listed in Additional file 1. We proposed here an EM algorithm to calculate the maximum likelihood estimates (MLEs) of parameters q and D. The algorithm starts with the estimate of marker allele frequency,$\widehat{p}$, and initial guess for values of the other two model parameters, D and q. With these parameter values, the conditional probabilities w_{ ij } and v_{ ij } can be calculated from equation (3). This constitutes the expectation (E) step of the EM algorithm. The maximization (M) step calculates new values of the parameters by solving equations (4) and (5) respectively. It should be noted that the coefficient of the leading term in the polynomial equations (4) and (5) is a positive constant, warranting the existence of at least one real root to these equations. Although there was no analytical solution to these equations, they can be solved numerically [11]. When multiple real roots were found, we selected the one that was within the corresponding theoretical bounds (0 < q < 1 and/or max{-pq, -(1-p)(1-q)} ≤ D ≤ min{p(1-q), (1-p)q}) and also resulted in the highest value of the likelihood. As the E and M steps are iteratively repeated, the likelihood function increases monotonically along the sequence of the newly determined estimates of the parameters, which converge to the MLEs of the model parameters,$\widehat{q}$and$\widehat{D}$
It is important to note that the likelihood function under the null hypothesis can be simplified to be L(p, q, D = 0|N, f_{1}, f_{2}, f_{3}) = (n_{11} + n_{21})Log[p^{2}] + (n_{12} + n_{22})Log[2p(1 - p)] + (n_{13} + n_{23})Log[(1 - p)^{2}], which is a function of the vector N and marker allele frequency p only and is independent of the other parameters, q and D. Thus, the likelihood ratio test statistic can be approximated by a χ^{2} distribution with 2 degrees of freedom (d.f.). Under D = 0, the MLE of p is given by$\widehat{p}=\left(2{n}_{11}+2{n}_{21}+{n}_{12}+{n}_{22}\right)/2n$as expected.
where the superscript is used to denote the parameters for each subpopulation. To calculate the above likelihood function, we proposed firstly to work out the population specific parameters q^{(i)} and D^{(i)} from the case and control sample of the i th subpopulation separately using the method described above, and then to sum up the likelihoods for all the case and control samples. Although the likelihood ratio statistic confers the flexibility to test for significance of LD in any subpopulations, we are interested here in testing for a conservative null hypothesis that there is no LD between marker and disease loci in all the subpopulations based on the ratio of the congregate likelihood with${\widehat{D}}^{\left(i\right)}$ over the likelihood with D^{(i)} = 0. This likelihood ratio test statistic was approximated by a χ^{2} variable with 2 k d.f..
to be the test statistic which follows the chi-square distribution with 1 d.f.. In equation (8), the denominator was the sampling variance of the numerator under the null hypothesis, i.e. there is no LD in either subpopulation. ${n}_{i}^{\mathit{case}}$ (or ${n}_{i}^{\mathit{control}}$) and n^{(i)} are the number of cases (or controls) and size of cases and controls from the i th population or cohort.
under an additive genetic inheritance model [9, 13], where${\widehat{p}}_{M}=\left(2{n}_{11}+{n}_{12}+2{n}_{21}+{n}_{22}\right)/2n$,${\widehat{p}}_{\mathit{MM}}=\left({n}_{11}+{n}_{21}\right)/n$,${\widehat{p}}_{M}^{\mathit{\text{case}}}=\left(2{n}_{11}+{n}_{12}\right)/2{n}_{1\cdot}$, and${\widehat{p}}_{M}^{\mathit{\text{control}}}=\left(2{n}_{21}+{n}_{22}\right)/2{n}_{2\cdot}$. Note that the denominator in equation (10) has a term${\widehat{p}}_{\mathit{MM}}-{\widehat{p}}_{M}^{2}$, which is zero when the case–control sample is in Hardy-Weinberg Equilibrium (HWE). This term is proposed to be a correction for bias in variance estimation when there is departure from HWE due to several factors including population structure [13]. However, there is no such a correction term under dominant and recessive models.
Re-analysis of the Parkinson’s disease datasets
We implemented the three methods described above to re-analyze the PD dataset which was recently published by Simon-Sanchez et al. [2]. The study carried out a genome wide screen for genetic variants predisposing susceptibility to the PD through a two-stage case–control design. In stage I, 4,005 individuals (971 cases and 3,034 cases) recruited from the United States and 1,686 individuals (742 cases and 944 cases) recruited from Germany were genotyped at 507,861 SNPs using Infinium BeadChips of which 463,185 SNPs with genotyping call rate larger than 95%, minor allele frequency (MAF) above 0.05 and no departure from HWE (p > 0.01) were remained [2]. Because estimates of allele frequency from a small sample may vary greatly, we further excluded those markers, at which there were less than five individuals for any genotype, from further analysis. After this quality control, a total of 447,270 SNPs were used in the present study. Principal component analysis (PCA) from genotype data was carried out to investigate the population structure for the stage I dataset by using program GCTA [14]. In stage II, which was designed as a confirmation stage, 3,392, 3,223 and 1,319 individuals were recruited from three different cohorts: the USA (1,473 cases and 1,919 controls), Germany (1,074 cases and 2,149 controls) and the UK (814 cases and 505 controls) respectively. All 7,934 individuals were genotyped for 345 SNPs which showed significant associations in analysis with stage I dataset. After applying the same quality check on the SNP data, two SNPs were excluded from the present study. The genetic association for each SNP marker was evaluated by Armitage’s trend test (Method 3 here) and the genome-wide significance level was determined by the Bonferroni correction for the probability of an overall type I error at 5%.
Simulation model and method
To investigate statistical properties and limitations of the method developed in the present study, we considered three schemes for sampling cases and controls from computer simulated randomly mating populations. In the first two sampling schemes, scheme A and B, we fixed the penetrance parameters f_{1-3} for genotypes at the disease locus to be (1, ½, 0), while in the third sampling scheme, scheme C, mild penetrance parameters (i.e. f_{1-3} < 1) were used. Sampling schemes A and C collected cases and controls from a single population, while scheme B sampled individuals from two genetically divergent populations with regard to a tested marker and a putative disease locus. A simulated population in the present study was fully characterized by a set of population genetic parameters, p, q and D (frequencies of alleles M and A at a marker locus and a disease locus respectively and the coefficient of LD between the two loci), and quantitative genetic parameters, f_{1}, f_{2} and f_{3} (penetrance of three genotypes at the disease locus). For a given set of these parameters, genotype of a case or control subject at both the marker and disease loci was generated by randomly sampling from the conditional probability distribution given in Table 1c. The sampling process continued until the required number of cases or controls was obtained. The computer programs implementing the simulation were described and modified in Luo [15] and Wang et al. [8].
Results
Re-analysis of the Parkinson’s disease datasets
Summary of top associations from stage I dataset
Locus | SNP name | Dist(kb) * | P value | BPP (%) | ||||
---|---|---|---|---|---|---|---|---|
M 1 | M 2 | M 3 | M 1 | M 2 | M 3 | |||
1p13.2-13.3 | rs17654531 | - | 1.9 × 10^{-9} | 3.2 × 10^{-6} | 1.2 × 10^{-5} | 37 | 22 | 14 |
rs10857899 | 328 | 2.7 × 10^{-8} | 3.1 × 10^{-6} | 3.1 × 10^{-6} | 57 | 25 | 27 | |
2p23.3 | rs7564397 | - | 9.7 × 10^{-8} | 0.013 | 0.033 | 55 | 0 | 0 |
2q21.2 | rs1474406 | - | 4.3 × 10^{-8} | 2.3 × 10^{-3} | 0.001 | 57 | 1 | 3 |
2q36.1 | rs1447108 | - | 5.5 × 10^{-8} | 2.5 × 10^{-4} | 4.4 × 10^{-4} | 59 | 4 | 3 |
3p24.3 | rs1605527 | - | 2.0 × 10^{-8} | 1.0 × 10^{-4} | 9.4 × 10^{-5} | 53 | 9 | 10 |
4p15.2 | rs6820719 | - | 1.6 × 10^{-9} | 0.23 | 0.30 | 74 | 0 | 0 |
rs7676830 | 23 | 8.6 × 10^{-10} | 0.12 | 0.15 | 77 | 0 | 0 | |
rs12649499 | 11 | 4.8 × 10^{-10} | 0.20 | 0.26 | 77 | 0 | 0 | |
4q21 | rs11931074 | - | 3.9 × 10^{-8} | 5.1 × 10^{-8} | 4.8 × 10^{-8} | 56 | 54 | 54 |
rs356220 | 2 | 7.7 × 10^{-11} | 3.4 × 10^{-8} | 7.0 × 10^{-8} | 81 | 56 | 52 | |
rs3857059 | 34 | 5.3 × 10^{-8} | 4.0 × 10^{-8} | 3.6 × 10^{-8} | 56 | 55 | 56 | |
rs2736990 | 3 | 6.3 × 10^{-12} | 2.9 × 10^{-9} | 5.7 × 10^{-9} | 88 | 71 | 67 | |
6q27 | rs2072638 | - | 1.1 × 10^{-11} | 0.014 | 0.012 | 86 | 0 | 0 |
7p14-p13 | rs859522 | - | 1.8 × 10^{-8} | 9.7 × 10^{-6} | 3.4 × 10^{-5} | 62 | 21 | 14 |
7q21 | rs3779331 | - | 6.6 × 10^{-8} | 0.028 | 0.01 | 56 | 0 | 0 |
7q21.11 | rs10246477 | - | 9.3 × 10^{-8} | 2.3 × 10^{-5} | 5.3 × 10^{-5} | 56 | 13 | 10 |
8p23.2 | rs7013027 | - | 5.8 × 10^{-8} | 4.3 × 10^{-6} | 1.9 × 10^{-6} | 56 | 23 | 29 |
rs4875773 | 63 | 1.6 × 10^{-8} | 0.02 | 0.044 | 63 | 0 | 0 | |
8p22 | rs7828611 | - | 8.4 × 10^{-8} | 1.2 × 10^{-4} | 6.2 × 10^{-4} | 55 | 6 | 3 |
rs2736050 | 1 | 9.9 × 10^{-10} | 1.0 × 10^{-5} | 2.0 × 10^{-4} | 74 | 18 | 5 | |
rs2009817 | 3 | 2.0 × 10^{-9} | 1.3 × 10^{-5} | 2.1 × 10^{-4} | 72 | 16 | 5 | |
8q24.23-24.3 | rs4556079 | - | 4.8 × 10^{-8} | 5.0 × 10^{-6} | 4.8 × 10^{-6} | 60 | 20 | 22 |
rs11781101 | 14 | 7.3 × 10^{-8} | 5.4 × 10^{-6} | 5.3 × 10^{-6} | 56 | 21 | 22 | |
rs7004938 | 12 | 3.1 × 10^{-8} | 3.0 × 10^{-6} | 3.0 × 10^{-6} | 59 | 24 | 25 | |
rs11783351 | 1 | 7.7 × 10^{-8} | 5.0 × 10^{-6} | 5.5 × 10^{-6} | 53 | 21 | 21 | |
9q21.31 | rs2378554 | - | 6.6 × 10^{-8} | 2.0 × 10^{-6} | 2.9 × 10^{-5} | 54 | 29 | 13 |
10p11.21 | rs2492448 | - | 3.8 × 10^{-8} | 1.6 × 10^{-6} | 3.8 × 10^{-6} | 61 | 29 | 24 |
rs11591754 | 12 | 4.8 × 10^{-10} | 2.5 × 10^{-7} | 1.7 × 10^{-6} | 80 | 43 | 30 | |
rs7923172 | 102 | 7.0 × 10^{-8} | 1.1 × 10^{-5} | 1.4 × 10^{-5} | 54 | 17 | 16 | |
rs4934704 | 23 | 7.3 × 10^{-8} | 1.2 × 10^{-5} | 1.5 × 10^{-5} | 54 | 17 | 16 | |
rs10827492 | 97 | 9.7 × 10^{-8} | 1.3 × 10^{-5} | 1.7 × 10^{-5} | 52 | 16 | 16 | |
10q24.3 | rs17115100 | - | 2.7 × 10^{-8} | 6.9 × 10^{-6} | 2.5 × 10^{-5} | 37 | 19 | 13 |
11p15.2 | rs11605276 | - | 3.4 × 10^{-11} | 0.079 | 0.19 | 86 | 0 | 0 |
rs10500796 | 45 | 1.9 × 10^{-8} | 0.18 | 0.30 | 61 | 0 | 0 | |
11q13 | rs1726764 | - | 6.6 × 10^{-8} | 0.088 | 0.20 | 53 | 0 | 0 |
12p13 | rs10849446 | - | 6.7 × 10^{-9} | 1.1 × 10^{-4} | 3.7 × 10^{-5} | 68 | 6 | 12 |
16p13.3 | rs11648673 | - | 5.5 × 10^{-8} | 1.3 × 10^{-5} | 4.8 × 10^{-7} | 56 | 15 | 38 |
17q21 | rs169201 | - | 1.0 × 10^{-7} | 6.5 × 10^{-6} | 1.2 × 10^{-7} | 57 | 19 | 49 |
rs199533 | 39 | 4.1 × 10^{-8} | 2.8 × 10^{-6} | 5.0 × 10^{-8} | 60 | 24 | 55 | |
17q24.3 | rs558076 | - | 6.6 × 10^{-8} | 1.0 × 10^{-4} | 2.5 × 10^{-5} | 57 | 7 | 14 |
rs817097 | 42 | 5.0 × 10^{-8} | 8.1 × 10^{-6} | 6.2 × 10^{-6} | 56 | 18 | 18 | |
20p12.1 | rs6041636 | - | 9.9 × 10^{-9} | 0.16 | 0.24 | 66 | 0 | 0 |
21q22.3 | rs2070535 | - | 5.0 × 10^{-8} | 0.060 | 0.096 | 54 | 0 | 0 |
To assess variation of the predicted genetic associations, we carried out bootstrap sampling with replacement from the stage I dataset (1,000 replicates) and calculated the empirical posterior probability at each of the 44 significant SNPs. Table 2 summarizes the significance levels (P values) and the bootstrap posterior probabilities (BPP) calculated for each of the three methods. BPP was calculated as the proportion of bootstraps in which the SNP of interest was detected given the empirical Bonferroni P value threshold of 1.1 × 10^{-7}. Of the three methods tested on the stage I dataset, we find that Method 1 confers the most powerful test for the genetic association. The BPP values calculated from repeated bootstrap samples by analysis using Method 1 were consistently higher (Wilcoxon signed-rank test P value 5.8 × 10^{-9}), suggesting the method is more robust to variation caused from sampling than the other two methods tested in this study. Method 2 and 3 had comparable BPP values (Wilcoxon signed-rank test P value 0.57) and hence similar robustness to sampling variation.
Before reporting our analysis of the stage II dataset, it is worth stressing that the 345 SNPs originally genotyped were selected only from the previous analysis using Method 3[2]. The dataset contains only 27 of the 44 significant SNPs identified using Method 1 in our analysis of the stage I data. Using a Bonferroni corrected genome-wide P = 0.05 significant threshold (1.5 × 10^{-4}), we found the SNPs located within 4q21 and 17q21 to be repeatedly detected by all the three methods in the stage II dataset, while an additional six SNPs were detected by Method 1 to be in significant association with the disease trait at the Bonferroni genome-wide threshold (Figure 1b). The maximum r^{2} between the six SNPs identified only by Method 1 was 0.0013. In particular, analysis using Method 1 detected a significant SNP (rs11564162) within chromosome 12q12, (P = 2.2 × 10^{-5}), located just 176 Kb from the previously identified PD candidate gene PARK8[19]; neither Method 2 nor 3 detected this significant SNP. In addition, a significant SNP (rs2878172) within chromosome 14q22.2 detected by Method 1 is only 4 Kb from the gene GCH1, which was recently found to be associated with PD through meta-analysis of multiple PD GWAS datasets and curated in the PDGene database [20]. A full list of significant SNPs detected by the three methods in the analysis of the stage II dataset are shown in Additional file 3.
When the two datasets (stage I and stage II) were combined, 90 SNPs were detected significant at the Bonferroni corrected P = 0.05 threshold (1.5 × 10^{-4}) using Method 1, including all the twenty seven significant SNPs detected using the same method in stage II analysis and eight significant SNPs detected by the same method in stage I data analysis (Figure 1c and Additional file 3). The 55 SNPs undetected in individual dataset were distributed in 39 chromosomes regions (r^{2} between regions was less than 0.0012). The SNP marking the PD candidate gene, PARK8, detected in the analysis of the stage II dataset, was also repeated in analysis with the combined dataset. Query against the PDGene database [20], we found another 6 out of the 55 SNPs that had been reported to be associated with PD: rs6812193 (OR 0.89, 95% confidence interval (CI) 0.85-0.93), rs6532197 (OR 1.31, 95% CI 1.19-1.44), rs7077361 (OR 0.86, 95% CI 0.79-0.93), rs11191425 (OR 0.84, 95% 0.75-0.93), rs12413409 (OR 0.84, 95% CI 0.75-0.95) and rs1481088 (OR 1.08, 95% CI 1.01-1.16). A majority of the thirty-five significant SNPs that replicated the stage I or II analysis were detected with markedly more stringent significant levels, reflecting the increased size of the combined dataset.
Simulation study
Parameters and results of scheme a simulation
Pop. | p | q | D | Method 1 | Method 3 | ||||
---|---|---|---|---|---|---|---|---|---|
$\widehat{\mathit{q}}\pm \mathit{s}\mathbf{.}\mathit{d}\text{.}$ | $\widehat{\mathit{D}}\pm \mathit{s}\mathit{.}\mathit{d}\text{.}$ | ${\mathit{X}}_{\left[\mathbf{2}\right]}^{\mathbf{2}}\mathbf{\pm}\mathit{s}\mathit{.}\mathit{d}\text{.}$ | ρ(%) | ${\mathit{X}}_{\left[\mathbf{1}\right]}^{\mathbf{2}}\mathbf{\pm}\mathit{s}\mathit{.}\mathit{d}\text{.}$ | ρ(%) | ||||
1 | 0.5 | 0.5 | 0 | - | 0.004 ± 0.012 | 1.9 ± 2.5 | 6.9 | 1.0 ± 1.3 | 4.2 |
2 | 0.3 | 0.7 | 0 | - | 0.005 ± 0.011 | 2.0 ± 2.8 | 7.3 | 1.0 ± 1.4 | 4.5 |
3 | 0.7 | 0.3 | 0 | - | 0.002 ± 0.011 | 1.9 ± 2.7 | 6.7 | 1.0 ± 1.5 | 5.0 |
4 | 0.5 | 0.5 | 0.15 | 0.50 ± 0.05 | 0.148 ± 0.015 | 184.4 ± 42.8 | 100 | 73.3 ± 14.0 | 100 |
5 | 0.5 | 0.5 | 0.10 | 0.50 ± 0.09 | 0.097 ± 0.018 | 73.9 ± 26.5 | 99.7 | 33.3 ± 10.6 | 96.6 |
6 | 0.5 | 0.5 | 0.05 | 0.50 ± 0.20 | 0.043 ± 0.020 | 18.1 ± 12.0 | 36.8 | 8.8 ± 5.6 | 10.8 |
7 | 0.3 | 0.7 | 0.07 | 0.72 ± 0.12 | 0.064 ± 0.026 | 68.4 ± 25.4 | 99.6 | 29.6 ± 10.2 | 91.5 |
8 | 0.3 | 0.7 | 0.05 | 0.70 ± 0.15 | 0.047 ± 0.023 | 33.2 ± 17.6 | 77.3 | 15.1 ± 7.5 | 38.2 |
9 | 0.7 | 0.3 | -0.07 | 0.28 ± 0.14 | -0.062 ± 0.028 | 54.8 ± 23.4 | 96.8 | 26.3 ± 9.6 | 85.2 |
10 | 0.7 | 0.3 | -0.05 | 0.31 ± 0.20 | -0.042 ± 0.024 | 27.8 ± 15.6 | 66.1 | 13.7 ± 6.9 | 31.0 |
Parameters and results of scheme b simulation
Pop. | p ^{(1)} | q ^{(1)} | D ^{(1)} | p ^{(2)} | q ^{(2)} | D ^{(2)} | Population 1 | Population 2 | Admixed samples | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
M 1 | M 2 | M 3 | M 1 | M 2 | M 3 | M 1 | M 2 | M 3 | |||||||
1 | 0.40 | 0.10 | 0.00 | 0.70 | 0.10 | 0.00 | 0.1 | 0.0 | 0.0 | 1.6 | 0.0 | 0.0 | 1.2 | 0.0 | 25.3 |
2 | 0.45 | 0.10 | 0.00 | 0.70 | 0.10 | 0.00 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.6 | 0.0 | 12.6 |
3 | 0.50 | 0.10 | 0.00 | 0.70 | 0.10 | 0.00 | 0.3 | 0.0 | 0.0 | 1.4 | 0.0 | 0.0 | 1.2 | 0.0 | 3.7 |
4 | 0.55 | 0.10 | 0.00 | 0.70 | 0.10 | 0.00 | 0.2 | 0.0 | 0.0 | 2.1 | 0.0 | 0.0 | 1.1 | 0.0 | 0.9 |
5 | 0.60 | 0.10 | 0.00 | 0.70 | 0.10 | 0.00 | 0.0 | 0.0 | 0.0 | 1.1 | 0.0 | 0.0 | 1.0 | 0.0 | 0.3 |
6 | 0.65 | 0.10 | 0.00 | 0.70 | 0.10 | 0.00 | 0.1 | 0.1 | 0.1 | 0.9 | 0.0 | 0.0 | 0.5 | 0.0 | 0.0 |
7 | 0.40 | 0.10 | 0.00 | 0.50 | 0.10 | 0.02 | 0.1 | 0.0 | 0.0 | 94.3 | 44.8 | 45.6 | 91.1 | 2.9 | 50.8 |
8 | 0.45 | 0.10 | 0.00 | 0.50 | 0.10 | 0.02 | 0.0 | 0.0 | 0.0 | 93.4 | 45.7 | 47.2 | 90.8 | 1.4 | 28.4 |
9 | 0.40 | 0.10 | 0.02 | 0.50 | 0.10 | 0.00 | 99.5 | 93.9 | 94.7 | 1.1 | 0.0 | 0.0 | 99.4 | 70.1 | 90.0 |
10 | 0.45 | 0.10 | 0.02 | 0.50 | 0.10 | 0.00 | 99.7 | 95.4 | 95.5 | 1.1 | 0.0 | 0.0 | 99.3 | 69.3 | 77.4 |
11 | 0.40 | 0.10 | 0.02 | 0.50 | 0.10 | 0.02 | 99.6 | 95.0 | 95.1 | 93.2 | 43.7 | 45.7 | 100.0 | 99.7 | 100.0 |
12 | 0.45 | 0.10 | 0.02 | 0.50 | 0.10 | 0.02 | 99.6 | 95.2 | 95.6 | 93.1 | 47.5 | 49.0 | 100.0 | 99.7 | 100.0 |
13 | 0.40 | 0.10 | 0.02 | 0.50 | 0.10 | -0.02 | 99.4 | 95.1 | 95.3 | 92.2 | 45.6 | 47.0 | 100.0 | 4.2 | 6.1 |
14 | 0.45 | 0.10 | 0.02 | 0.50 | 0.10 | -0.02 | 99.1 | 93.9 | 94.0 | 94.2 | 45.8 | 47.8 | 100.0 | 3.0 | 1.4 |
Parameters and results of scheme c simulation
Pop. | p | q | D | f _{1} | f _{2} | f _{3} | $\widehat{\mathit{p}}\pm \mathit{s}\mathbf{.}\mathit{d}\text{.}$ | Method 1* | Method 1** | Method 3 | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
χ^{2} ± s.d. | ρ (%) | χ^{2} ± s.d. | ρ (%) | χ^{2} ± s.d. | ρ (%) | ||||||||
1 | 0.5 | 0.5 | 0 | 0.1 | 0.05 | 0 | 0.50 ± 0.02 | 2.0 ± 6.1 | 2 | 1.9 ± 2.9 | 0 | 0.9 ± 1.3 | 0 |
2 | 0.3 | 0.7 | 0 | 0.1 | 0.05 | 0 | 0.30 ± 0.02 | 2.2 ± 5.9 | 2.1 | 1.8 ± 2.9 | 0.2 | 1.0 ± 1.3 | 0 |
3 | 0.7 | 0.3 | 0 | 0.2 | 0.1 | 0 | 0.70 ± 0.02 | 1.5 ± 3.9 | 0.5 | 2.0 ± 2.8 | 0 | 1.0 ± 1.3 | 0 |
4 | 0.5 | 0.5 | 0.15 | 0.2 | 0.1 | 0 | 0.48 ± 0.02 | 57.8 ± 21.8 | 99.2 | 52.5 ± 19.0 | 98 | 24.1 ± 9.2 | 79.1 |
5 | 0.5 | 0.5 | 0.1 | 0.1 | 0 | 0 | 0.49 ± 0.02 | 74.5 ± 24.1 | 99.7 | 68.7 ± 20.9 | 99.6 | 35.0 ± 11.0 | 97.7 |
6 | 0.5 | 0.5 | 0.05 | 0.1 | 0 | 0 | 0.50 ± 0.02 | 20.0 ± 12.3 | 42.6 | 19.6 ± 11.9 | 41.5 | 9.3 ± 5.8 | 11.8 |
7 | 0.3 | 0.7 | 0.07 | 0.3 | 0.1 | 0 | 0.29 ± 0.02 | 16.4 ± 12.7 | 32.4 | 14.2 ± 10.9 | 25 | 6.5 ± 4.9 | 4.5 |
8 | 0.3 | 0.7 | 0.05 | 0.3 | 0.1 | 0 | 0.29 ± 0.02 | 9.5 ± 8.9 | 12.6 | 8.2 ± 7.7 | 8.8 | 3.7 ± 3.6 | 1 |
9 | 0.7 | 0.3 | -0.07 | 0.1 | 0 | 0 | 0.70 ± 0.02 | 102.6 ± 29.7 | 100 | 93.5 ± 26.1 | 99.9 | 44.7 ± 12.0 | 99.6 |
10 | 0.7 | 0.3 | -0.05 | 0.1 | 0 | 0 | 0.70 ± 0.02 | 53.7 ± 21.1 | 96.6 | 50.5 ± 19.3 | 96.2 | 24.0 ± 9.0 | 80.1 |
Discussion
We have shown that Armitage’s trend test [9], the most popular statistical strategy implemented in the current literature of GWAS with a case–control setting, is highly vulnerable to sampling schemes and genetic structure embedded in the samples. To address this problem, we have developed a novel statistical method that is robust to these influential factors and confers a more powerful test. We have demonstrated the robustness and improved statistical power of the new method through (i) re-analysis of the large-scale SNP genotype datasets of the PD cases and controls collected from multiple geographical cohorts [2], and (ii) through computer simulation studies. The new method was able to detect a total of 44 SNPs in significant association with the disease phenotype, which distributed in 25 chromosomal regions of size < 1 Mb largely in LE. Only two of these regions are detected by the other methods under comparison. Among the newly detected significant SNPs, some are within or nearby the PD candidate genes previously reported in the literature and the rest novel discoveries. A Bootstrap-based analysis shows that the new method has consistently higher posterior probabilities at the significant SNPs than the compared methods, suggesting the remarkably improved robustness of the former to the sampling problem.
which follows a chi-square distribution with 1 d.f.. Comparison of equation (11) to equation (8) shows that the two test statistics share the same numerator, and the denominator of ${\chi}_{G}^{2}$ is only slightly larger than that of ${\chi}_{M}^{2}$. Thus, the Mantel-Haenszel test is approximately equivalent to the Method 2 which has been explored in the present study. Moreover, we demonstrated in Additional file 5 that the Mantel-Haenszel test was indeed equivalent to the widely employed logistic regression approach for analyzing stratified case and control samples such as the PD datasets. Either the present Method 2 or the Mantel-Haenszel test provides an efficient alternative to the logistic regression in testing for association using case and control samples with known stratification. Thirdly, the likelihood-based method developed in the present study confers the flexibility to fit in different fixed effects in different populations and is thus logically appropriate to integrate cases and controls collected from genetically different cohorts or populations like the PD case and control samples we thoroughly analyzed here. Although a rich literature has been available for prediction of genetic structure of a census population from random samples of the population, there is no relevant theory and method established to make the prediction from case and control samples. In the present study, we have assumed that the population origin of the case and control samples is previously known. This assumption is perfectly satisfied in many association studies, as illustrated by the PD datasets, where the cases and controls are collected from known populations or cohorts.
where π = f_{1}q^{2} + 2f_{2}q(1 - q) + f_{3}(1 - q)^{2} is the population prevalence of the disease attributed to the disease locus. The second term, i.e. the bias, will be negligible when π is low. To illustrate magnitude of the bias, we worked out the absolute difference |p’-p| and illustrated for a wide ranges of the population settings in Additional file 6. It is clear that the absolute value of bias will not exceed 0.05 if π is less than 10%. It should be stressed that the bias presented here is its largest possible value because it was calculated at the maximum value of the disequilibrium parameter D. In addition, we compared the rate of false positive and statistical power of Method 1 when the true and biased values of marker allele frequency were used in analysis of simulation data under a wide range of settings. The results of the analysis summarized in Additional file 7 show that use of marker allele frequency estimates from control samples does not result in any notable difference in the false positive rate and test power from use of the true marker allele frequencies. All these thus suggest that the way we proposed to calculate the marker allele frequency will not lead to any serious influence of the method developed in the present study for its efficiency in the association test.
In spite that the population genetic model has been focused on the most prominent LD measure, D as defined in Table 1, there are several other scaled or standardized disequilibrium measures such as D’, r^{ 2 } and some others [28], which are frequently used in the literature. The robustness and improved statistical efficiency achieved in inferring D will be inherent to that of the transformed versions of the parameter [8]. Although the method is developed for complex quantitative traits with discrete phenotypes, it would not involve major technical difficulty to extend the ideas and principles behind the newly developed method to cope with continuous phenotypes. Genetic heterogeneity may add extra complication to genetic control of common disease traits and is not taken into account in the present model and analysis. In presence of genetic heterogeneity, disease disposing loci may differ in different populations or cohorts. A direct and intuitive consequence of the heterogeneity would be a weakened test power because the effective sample size for detecting the marker-disease association at a test site is actually reduced when compared to the census sample size.
Conclusions
We have developed a novel likelihood based statistical approach to model linkage disequilibrium between any genetic marker locus and a putative disease locus in a randomly matting population and to infer the disequilibrium parameter and other population genetic parameters from case and control samples from the population under a likelihood based framework. The model and likelihood based approach are implemented to re-analyze large SNP datasets of the Parkinson disease case and control samples collected from multiple human cohorts. Statistical properties and utility limitations are investigated through simulation studies. Based on the simulation data analysis and analysis with the Parkinson disease case and control sample, we demonstrate that the likelihood based approach outperforms the trend test and logistic regression methods for an increased statistical power and reduced false positive inference, which are popularly implemented in the GWAS literature.
Declarations
Acknowledgements
We thank Dr. Thomas Gasser of Neurodegenerative Diseases and German Center for Neurodegenerative Diseases (Germany) and Dr. Andrew B Singleton at National Institute on Aging (NIH, USA) for allowing us to re-analyze the Parkinson’s disease datasets. We thank two anonymous reviewers for their comments and suggestions which have been useful for improving presentation of the paper. This study was supported by research grants from the Leverhulme Trust (UK) and The National Basic Research Program of China (2012CB316505). ZL is also supported by China’s National Natural Science Foundation.
Authors’ Affiliations
References
- Atwell S, Huang YS, Vilhjalmsson BJ, Willems G, Horton M, Li Y, Meng D, Platt A, Tarone AM, Hu TT: Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature. 2010, 465: 627-631. 10.1038/nature08800.PubMed CentralView ArticlePubMedGoogle Scholar
- Simon-Sanchez J, Schulte C, Bras JM, Sharma M, Gibbs JR, Berg D, Paisan-Ruiz C, Lichtner P, Scholz SW, Hernandez DG: Genome-wide association study reveals genetic risk underlying Parkinson’s disease. Nat Genet. 2009, 41 (12): 1308-U1368. 10.1038/ng.487.PubMed CentralView ArticlePubMedGoogle Scholar
- Valdar W, Solberg LC, Gauguier D, Burnett S, Klenerman PO, Cookson W, Taylor MS, Rawlins JNP, Mott R, Flint J: Genome-wide genetic association of complex traits in heterogeneous stock mice. Nat Genet. 2006, 38 (8): 879-887. 10.1038/ng1840.View ArticlePubMedGoogle Scholar
- Astle W, Balding DJ: Population structure and cryptic relatedness in genetic association studies. Stat Sci. 2009, 24 (4): 451-471. 10.1214/09-STS307.View ArticleGoogle Scholar
- Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics. 2000, 155 (2): 945-959.PubMed CentralPubMedGoogle Scholar
- Yu JM, Pressoir G, Briggs WH, Bi IV, Yamasaki M, Doebley JF, McMullen MD, Gaut BS, Nielsen DM, Holland JB: A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat Genet. 2006, 38 (2): 203-208. 10.1038/ng1702.View ArticlePubMedGoogle Scholar
- Devlin B, Roeder K: Genomic control for association studies. Biometrics. 1999, 55 (4): 997-1004. 10.1111/j.0006-341X.1999.00997.x.View ArticlePubMedGoogle Scholar
- Wang M, Jia T, Jiang N, Wang L, Luo Z: Inferring linkage disequilibrium from Non-random samples. BMC Genomics. 2010, 11: 328-10.1186/1471-2164-11-328.PubMed CentralView ArticlePubMedGoogle Scholar
- Sasieni PD: From genotypes to genes: doubling the sample size. Biometrics. 1997, 53 (4): 1253-1261. 10.2307/2533494.View ArticlePubMedGoogle Scholar
- Wolfram S: Mathematica: a system for doing mathematics by computer. 1991, Redwood City: Addison-Wesley, Second EditionGoogle Scholar
- Riley KF: Mathematical methods for the physical sciences. 1978, Cambridge, London: Cambridge University PressGoogle Scholar
- Jackson MR, Genin E, Knapp M, Escary JL: Accurate power approximations for chi(2)-tests in case–control association studies of complex disease genes. Ann Hum Genet. 2002, 66: 307-321. 10.1046/j.1469-1809.2002.00120.x.View ArticlePubMedGoogle Scholar
- Schaid DJ, Jacobsen SJ: Biased tests of association: comparisons of allele frequencies when departing from hardy-Weinberg proportions. Am J Epidemiol. 1999, 149 (8): 706-711. 10.1093/oxfordjournals.aje.a009878.View ArticlePubMedGoogle Scholar
- Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG, Montgomery GW: Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010, 42 (7): 565-569. 10.1038/ng.608. 1PubMed CentralView ArticlePubMedGoogle Scholar
- Luo ZW: Detecting linkage disequilibrium between a polymorphic marker locus and a trait locus in natural populations. Heredity. 1998, 80 (2): 198-208. 10.1046/j.1365-2540.1998.00275.x.View ArticlePubMedGoogle Scholar
- Petroski MD, Deshaies RJ: Function and regulation of cullin-RING ubiquitin ligases. Nat Rev Mol Cell Biol. 2005, 6 (1): 9-20.View ArticlePubMedGoogle Scholar
- Shimura H, Hattori N, Kubo S, Mizuno Y, Asakawa S, Minoshima S, Shimizu N, Iwai K, Chiba T, Tanaka K: Familial Parkinson disease gene product, parkin, is a ubiquitin-protein ligase. Nat Genet. 2000, 25 (3): 302-305. 10.1038/77060.View ArticlePubMedGoogle Scholar
- Mizuta I, Tsunoda T, Satake W, Nakabayashi Y, Watanabe M, Takeda A, Hasegawa K, Nakashima K, Yamamoto M, Hattori N: Calbindin 1, fibroblast growth factor 20, and α-synuclein in sporadic Parkinson’s disease. Hum Genet. 2008, 124 (1): 89-94. 10.1007/s00439-008-0525-5.View ArticlePubMedGoogle Scholar
- Khan NL, Jain S, Lynch JM, Pavese N, Abou-Sleiman P, Holton JL, Healy DG, Gilks WP, Sweeney MG, Ganguly M: Mutations in the gene LRRK2 encoding dardarin (PARK8) cause familial Parkinson’s disease: clinical, pathological, olfactory and functional imaging and genetic data. Brain. 2005, 128 (12): 2786-2796. 10.1093/brain/awh667.View ArticlePubMedGoogle Scholar
- Lill CM, Roehr JT, McQueen MB, Kavvoura FK, Bagade S, Schjeide B-MM, Schjeide LM, Meissner E, Zauft U, Allen NC: Comprehensive research synopsis and systematic meta-analyses in Parkinson’s disease genetics: the PDGene database. PLoS Genet. 2012, 8 (3): e1002548-10.1371/journal.pgen.1002548.PubMed CentralView ArticlePubMedGoogle Scholar
- Storey JD, Tibshirani R: Statistical significance for genomewide studies. Proc Natl Acad Sci U S A. 2003, 100 (16): 9440-9445. 10.1073/pnas.1530509100.PubMed CentralView ArticlePubMedGoogle Scholar
- Kruger R, Kuhn W, Muller T, Woitalla D, Graeber M, Kosel S, Przuntek H, Epplen JT, Schols L, Riess O: Ala30Pro mutation in the gene encoding alpha-synuclein in Parkinson’s disease. Nat Genet. 1998, 18 (2): 106-108. 10.1038/ng0298-106.View ArticlePubMedGoogle Scholar
- Zabetian CP, Hutter CM, Factor SA, Nutt JG, Higgins DS, Griffith A, Roberts JW, Leis BC, Kay DM, Yearout D: Association analysis of MAPT H1 haplotype and subhaplotypes in Parkinson’s disease. Ann Neurol. 2007, 62 (2): 137-144. 10.1002/ana.21157.PubMed CentralView ArticlePubMedGoogle Scholar
- Little RA, Rubin DB: Statistical analysis with missing data. 1987, New York: John Wiley & SonsGoogle Scholar
- Chatterjee N, Chen Y, Luo S, Carroll R: Analysis of case–control association studies: SNPs, imputation and haplotypes. Stat Sci. 2009, 24 (4): 489-502. 10.1214/09-STS297.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang T, Jacob H, Ghosh S, Wang X, Zeng Z-B: A joint association test for multiple SNPs in genetic case–control studies. Genetic Epidemiol. 2009, 33 (2): 151-163. 10.1002/gepi.20368.View ArticleGoogle Scholar
- Hill WG, Weir BS: Maximum-likelihood estimation of gene location by linkage disequilibrium. Am J Hum Genet. 1994, 54 (4): 705-714.PubMed CentralPubMedGoogle Scholar
- Devlin B, Risch N: A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics. 1995, 29: 311-322. 10.1006/geno.1995.9003.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.