 Methodology article
 Open Access
 Published:
LPG: A fourgroup probabilistic approach to leveraging pleiotropy in genomewide association studies
BMC Genomics volume 19, Article number: 503 (2018)
Abstract
Background
To date, genomewide association studies (GWAS) have successfully identified tens of thousands of genetic variants among a variety of traits/diseases, shedding light on the genetic architecture of complex disease. The polygenicity of complex diseases is a widely accepted phenomenon through which a vast number of risk variants, each with a modest individual effect, collectively contribute to the heritability of complex diseases. This imposes a major challenge on fully characterizing the genetic bases of complex diseases. An immediate implication of polygenicity is that a much larger sample size is required to detect individual risk variants with weak/moderate effects. Meanwhile, accumulating evidence suggests that different complex diseases can share genetic risk variants, a phenomenon known as pleiotropy.
Results
In this study, we propose a statistical framework for Leveraging Pleiotropic effects in largescale GWAS data (LPG). LPG utilizes a variational Bayesian expectationmaximization (VBEM) algorithm, making it computationally efficient and scalable for genomewidescale analysis. To demonstrate the advantages of LPG over existing methods that do not leverage pleiotropy, we conducted extensive simulation studies and applied LPG to analyze two pairs of disorders (Crohn’s disease and Type 1 diabetes, as well as rheumatoid arthritis and Type 1 diabetes). The results indicate that by levelaging pleiotropy, LPG can improve the power of prioritization of risk variants and the accuracy of risk prediction.
Conclusions
Our methodology provides a novel and efficient tool to detect pleiotropy among GWAS data for multiple traits/diseases collected from different studies. The software is available at https://github.com/Shufeyangyi2015310117/LPG.
Background
Genomewide association studies (GWAS) have reported more than 51,000 singlenucleotide polymorphisms (SNPs) to be significantly associated with complex human phenotypes, including quantitative traits and complex diseases (Accession of the GWAS Catalog database [1] on October, 2017). Although the discovery of genetic risk variants has advanced our understanding of the genetic architecture of complex diseases/traits, these variants explain only a small proportion of phenotypic variance [2]. For example, while the heritability of human height has been estimated to be approximately 70%80%, the 697 genetic variants found by a GWAS analysis of human height based on 253,288 individuals explain only 20% of the heritability of human height. A more complete characterization of the genetic architecture of complex phenotypes remains a significant challenge.
To increase the statistical power of a GWAS analysis, newer analytical methods leveraging pleiotropy have been developed. Pleiotropy, which refers to the situation where a gene affects multiple phenotypes, was first proposed more than 100 years ago [3]. Since then, an increasing number of human genetic studies have reported pleiotropic effects in various complex diseases, such as autoimmune diseases [4], metabolic disorders [5] and psychiatric disorders [6]. Thus, the identification of genetic risk variants in GWAS can be significantly improved by incorporating pleiotropy into the statistical analysis. Existing statistical methods for GWAS analysis that incorporate pleiotropy involve the joint GWAS analysis of multiple traits [7–9]. However, these methods assume that the individuallevel GWAS data for each trait were collected from the same study cohort, and the methods cannot be applied when the individuallevel GWAS data were collected from different study cohorts. Alternatively, when summary statistics derived from GWAS analysis conducted in different study cohorts are available, methods proposed to leverage pleiotropy via GWAS summary statistics can be utilized [10–12]. Thus, a methodological gap in leverage pleiotropy for joint GWAS analysis of multiple traits using individuallevel GWAS data for each trait from different study cohorts remains.
In this article, we propose a novel statistical method for leveraging pleiotropic effects in GWAS (LPG), where individuallevel data for two traits are obtained from different studies. LPG provides a statistical framework for the evaluation of the local false discovery rate and prediction accuracy and a formal test of the pleiotropic effects between two traits. LPG utilizes a variational Bayesian expectationmaximization (VBEM) algorithm, making it computationally efficient for genomewide analysis. We conducted extensive simulation studies to evaluate the performance of LPG. We then applied LPG to conduct a joint analysis of two pairs of disorders (Crohn’s disease and Type 1 diabetes, as well as rheumatoid arthritis and Type 1 diabetes) using data from the Welcome Trust Case Control Consortium (WTCCC) [13]. The simulation studies and real data analyses suggest that LPG can steadily improve both the prediction accuracy and the statistical power of risk variant identification compared to those of singletraitbased methods that do not leverage pleiotropy.
The remainder of this article is organized as follows. First, we introduce the statistical model and describe the VBEM algorithm used to estimate the parameters in the model. Second, we describe the statistical inference procedure used to evaluate the local false discovery rate and the prediction accuracy of the identified genetic variants. We also describe a formal hypothesis test for pleiotropy. Third, we evaluate the performance of LPG using simulations and real data analysis of WTCCC data. Finally, we conclude with a discussion.
Methods
Model for quantitative traits
Suppose that we have a GWAS data set {y,X} with n independent samples, where \({\mathbf {y}}\in \mathbb {R}^{n}\) is the vector of quantitative phenotype, and \(\mathbf {X}=[\mathbf {x}_{1},\dots,\mathbf {x}_{p}]\in \mathbb {R}^{n\times p}\) is the genotype matrix for n individuals and p SNPs. Without loss of generality, we assume that both X and y have been centered. We assume the following standard linear model,
where β=[β_{1},…,β_{p}]^{⊤} is a vector of effect sizes, and \({\mathbf {e}}\sim \mathcal {N}\left (\boldsymbol {0},\sigma _{e}^{2}{\mathbf {I}}\right)\) is the random error. Let the vector of binary variables γ=[γ_{1},…,γ_{p}]^{⊤} indicate the association status of all p SNPs, where γ_{j}=1 indicates that the jth SNP is associated with trait y, and γ_{j}=0 otherwise. In this paper, we consider a spikeslab prior [14]. Many sparse priors can be employed in the context of Bayesian variable selection. However, the spikeslab prior is perfectly adapted to the variational expectationmaximization algorithm because after reparameterization, we are able to derive closedform formulas for the variational expectationmaximization algorithm.
where Ber(α) is a Bernoulli distribution with probability Pr(γ_{j}=1)=α, and \(\mathcal {N}(m,\sigma ^{2})\) denotes a Gaussian distribution with mean m and variance σ^{2}. In Eq (2), α represents the true (unknown) proportion of genetic variants associated with trait y (nonnull group of genetic variants), and 1−α represents the true (unknown) proportion of genetic variants not associated with trait y (null group of genetic variants). The model (2) is known as a binary mask model because we can consider the indicator γ_{j} to be masking the coefficient β_{j}. Then, the probabilistic model can be written as
where \(\boldsymbol {\theta } = \left \{\sigma _{\beta }^{2},\sigma _{e}^{2},\alpha \right \}\) is the collection of model parameters, \(\sigma _{\beta }^{2}\) depicts the variance of the genetic effects, and \(\sigma _{e}^{2}\) is the variance of the random errors. We note that in our model, the parameters \(\sigma _{\beta }^{2},\sigma _{e}^{2}\) and α are considered to be fixed but unknown and are estimated as part of the model. This is in contrast to β and γ, which are not considered to be fixed but have prior and posterior distributions.
Now, we generalize the above twogroup model to leverage the pleiotropy between two traits that are potentially genetically correlated. Suppose we have two GWAS datasets {y_{1},X_{1}} and {y_{2},X_{2}} with n_{1} and n_{2} samples, respectively. Here, \(\mathbf {y}_{1} \in \mathbb {R}^{n_{1}}\) and \(\mathbf {y}_{2} \in \mathbb {R}^{n_{2}}\) are the vectors of phenotypic values, and \(\mathbf {X}_{1}=[\mathbf {x}_{11},\dots,\mathbf {x}_{1p}]\in \mathbb {R}^{n_{1}\times p}\) and \(\mathbf {X}_{2}=[\mathbf {x}_{21},\dots,\mathbf {x}_{2p}]\in \mathbb {R}^{n_{2}\times p}\) are the corresponding genotype matrices for p identical SNPs. Without loss of generality, we assume that both the genotype data (X_{1} and X_{2}) and phenotype data (y_{1} and y_{2}) have been centered. Then, we have
where k=1,2 refers to individual studies 1 and 2, β_{k}=[β_{k1},…,β_{kp}]^{⊤} is a vector of effect sizes for study k, and \(\sigma _{e_{k}}^{2}\) is the variance of the random error in study k. Compared with traditional linear regression, the latent vector of binary variables γ_{k}=[γ_{k1},…,γ_{kp}]^{⊤} indicates the association statuses in study k, and \(\boldsymbol {\gamma }=[\boldsymbol {\gamma }_{1},\boldsymbol {\gamma }_{2}] \in \mathbb {R}^{p \times 2}\) is a matrix of the association statuses in the two studies. For mixture proportions, α=(α_{00},α_{01},α_{10},α_{11})^{⊤} is the vector of parameters in a multinomial distribution, and Mu_{l∈L}(α) is the multinomial distribution with parameter α for each possible value of L={00,01,10,11}, i.e., α_{00}=Pr(γ_{1j}=0,γ_{2j}=0), α_{10}=Pr(γ_{1j}=1,γ_{2j}=0), α_{01}=Pr(γ_{1j}=0,γ_{2j}=1), and α_{11}=Pr(γ_{1j}=1,γ_{2j}=1).
When comparing model (4) with the basic model (2) for a single trait, the major difference lies in the joint sampling of hidden association statuses in the joint model of the two studies. In the presence of pleiotropy, γ_{1j} and γ_{2j} are no longer independent. We demonstrate that all the parameters in our model can be adaptively estimated from the data without ad hoc tuning. Let \(\boldsymbol {\theta } (= \{\sigma _{\beta _{1}}^{2},\sigma _{\beta _{2}}^{2},\sigma _{e_{1}}^{2},\sigma _{e_{2}}^{2},\boldsymbol {\alpha }\})\) be the collection of model parameters. The joint probabilistic model can be written as
By marginalizing over the latent variables (β_{1},β_{2},γ_{1},γ_{2}), the probabilistic model of observed data becomes
where we have used the operation \(\sum \) to represent the integration of continuous variables. Then, according to Bayes rule, the posterior probability distributions for the variables of interest can be calculated as
Computing the posterior distribution (7) is difficult because it requires the evaluation of the marginal likelihood (6), which is computationally intractable.
Algorithm for the quantitative trait model
To overcome the intractability of the marginal likelihood (6), we derive an efficient algorithm based on variational inference, which makes our model scalable to genomewide data analysis (see supplementary document for details). The key idea is that we use Jensen’s inequality to iteratively obtain an adjustable lower bound on the marginal log likelihood [15]. First, we consider a lower bound of the logarithm of the marginal likelihood (6),
where
Note that \(\mathbb {KL}(qp)\) is the KullbackLeibler (KL) divergence and satisfies \(\mathbb {KL}(qp)\ge 0\), with the equality holding if, and only if, the variational posterior probability (q) and the true posterior probability (p) are equal. Similar to the expectationmaximization (EM) algorithm, we can maximize the lower bound \(\mathcal {L}(q,\boldsymbol {\theta })\) with respect to the variational distribution q, which is equivalent to minimizing the KL divergence [16]. To make the evalution of the lower bound computationally effecient, we use meanfield theory [17] and assume that q(β_{1},β_{2},γ_{1},γ_{2}) can be factorized as
No additional assumptions on the posterior distribution are required. This factorization (9) is used as an approximation for the posterior distribution Pr(β_{1},β_{2},γ_{1},γ_{2}y_{1},y_{2},X_{1},X_{2};θ). This fully factorized approximating distribution was first proposed by Logsdon et al. [18] in the context of GWAS. The factorization used in the approximating distribution makes the corresponding variational expectationmaximization algorithm scalable to large sample sizes and large numbers of SNPs. We expect the approximation to perform best when the genetic variants are independent. Nevertheless, our numerical studies demonstrate that the approximation is sufficient even when linkage disequilibrium exists between the genetic variants.
By means of the properties of the factorized distributions in variational inference [16], we can obtain the optimal approximation via the following formula:
where the expectation is taken with respect to all the other factors \(\phantom {\dot {i}\!}\{q_{j^{\prime }}(\beta _{1j^{\prime }},\beta _{2j^{\prime }},\gamma _{1j^{\prime }},\gamma _{2j^{\prime }})\}\) for j^{′}≠j. After some derivations (details in the supplementary document), we have
where α_{lj} is the posterior probability of [γ_{1j},γ_{2j}]=l, f_{0}(β_{kj}) is the posterior distribution of β_{kj} when γ_{kj}=0, and f_{kj}(β_{kj}) is the posterior distribution of β_{kj} under γ_{kj}=1. Following algebraic manipulation, we show that f_{0}(β_{kj}) and f_{kj}(β_{kj}) are the density functions of Gaussian distributions \(\mathcal {N}(0,\sigma _{\beta _{k}}^{2})\) and \(\mathcal {N}(\mu _{kj},s_{kj}^{2})\).
The details of the derivation of the updating equations and the corresponding VBEM algorithm (Algorithm 1) can be found in the supplementary document. The VBEM algorithm performs similarly to the coordinate descent algorithm, which comes from the factorization of the variational distribution (11). Hence, the VBEM algorithm developed here is scalable to large numbers of individuals and large numbers of SNPs..
Accommodating casecontrol data
Suppose that we have two GWAS casecontrol datasets {y_{1},X_{1},Z_{1}} and {y_{2},X_{2},Z_{2}} with n_{1} and n_{2} samples, respectively. We may apply the definitions introduced earlier with \(\mathbf {y}_{k} \in \mathbb {R}^{n_{k}\times 1}\) as the vector indicating disease status having values 1 and 1 for controls and cases, respectively, and \(\mathbf {Z}_{k}=[\mathbf {z}_{k1},\dots,\mathbf {z}_{kp_{0}}]\in \mathbb {R}^{n_{k}\times p_{0}}\) as a matrix of the p_{0} covariates in study k. Note that the first column of Z_{k} is a vector of ones corresponding to the intercept. Then, conditional on the observed genotype X_{k}, hidden status γ, and effects β_{k}, we have
where \(\boldsymbol {\delta }_{k} = [\delta _{k1},\dots,\delta _{kn_{k}}]^{\top }\), \(\delta _{ki} \left (= \text {Pr}(y_{ki}=1{\mathbf {X}}_{k}, \boldsymbol {\beta }_{k}, \boldsymbol {\gamma }_{k}) = \frac {1}{1+e^{y_{ki}\eta _{ki}}}\right)\) is the sigmoid function of linear predictor η_{ki}, i is the index for individuals, and \(\boldsymbol {\eta }_{k} (= [\eta _{k1},\dots,\eta _{kn_{k}}]^{\top }\in \mathbb {R}^{n_{k}\times 1})\) is the linear predictor of all the individuals in study k such that \(\boldsymbol {\eta }_{k} = \sum _{j=1}^{p_{0}}\mathbf {z}_{kj}\phi _{kj} + \sum _{j=1}^{p} \gamma _{kj}\beta _{kj}\mathbf {x}_{kj}\). Here, we include fixedeffect covariates in the binary studies to adjust for potential population stratification and confounders in samples. β and γ are the effect sizes and indicator variables as defined earlier. Let \(\boldsymbol {\theta } = \{\sigma _{\beta _{1}}^{2}, \sigma _{\beta _{2}}^{2},\boldsymbol {\phi }_{1}, \boldsymbol {\phi }_{2}, \boldsymbol {\alpha }\}\) be the collection of model parameters. The probabilistic model can be written as
Note that we take the coefficients for covariates (Z_{1} and Z_{2}) as fixed effects, which are included in the parameter space θ. By marginalizing over latent variables (β_{1},β_{2},γ_{1},γ_{2}), we can obtain the marginal likelihood, similar to expression (6). The primary difficulty for the binary model (13) comes from the evaluation of the sigmoid function δ_{ki}. As there is no convenient conjugate prior for the sigmoid function, it is not analytically feasible to compute the full posterior distribution over the parameter space. To overcome this limitation, we use the Bohning bound [19]. Here, we first derive a lower bound of the completedata likelihood as follows
where \(B(\mathbf {y}_{k}\mathbf {X}_{k},\mathbf {Z}_{k},\boldsymbol {\beta }_{k},\boldsymbol {\gamma }_{k};\tilde {\boldsymbol {\theta }}) \left (= \prod _{i=1}^{n_{k}}\exp (\frac {1}{2}a\eta _{ki}^{2}y_{ki}^{2}+(1+b_{ki})\eta _{ki}y_{ki}  c_{ki})\right)\) denotes the product of the lower bound of sigmoid functions with a=1/4, \(\phantom {\dot {i}\!}b_{kn} = a\psi _{kn}  (1 + e^{\psi _{kn}})^{1}\) and \(c_{kn} = \frac {1}{2}a\psi _{kn}^{2}  (1 + e^{\psi _{kn}})^{1}\psi _{kn} + \log (1 + e^{\psi _{kn}})\), and \(\tilde {\boldsymbol {\theta }} =\{\sigma _{\beta _{1}}^{2},\sigma _{\beta _{2}}^{2},\boldsymbol {\phi }_{1},\boldsymbol {\phi }_{2},\boldsymbol {\alpha },\boldsymbol {\psi }_{1},\boldsymbol {\psi }_{2}\}\) is the new parameter that combines the model parameters θ with the variational parameters ψ_{1},ψ_{2} (details are provided in the supplementary document). Using Jensen’s inequality and the lower bound of the completedata likelihood (14), we have the following lower bound
where the first inequality is based on the Bohning bound and the second follows from Jensen’s inequality as in lower bound (8). By maximizing the lower bound (15) with respect to μ_{kj} and \(s_{kj}^{2}\), we can obtain the variational distribution in the same fashion as in expression (11). The details of the updating equation and the corresponding VBEM algorithm (Algorithm 2) are given in the supplementary document.
Statistical inference
Evaluation of the local false discovery rate (lfdr)
After fitting an LPG model with all the parameters estimated, SNPs can be prioritized based on their local false discovery rates (lfdr) [20]. As discussed in [21], although false discovery rate (FDR) methods were developed in a strict frequentist framework, they also have a convincing Bayesian rationale. Since \(\sum _{l\in L_{k}} \alpha _{lj}\) is a good approximation for the true posterior Pr(γ_{kj}=1y_{1},y_{2},X_{1},X_{2};θ), \(\text {lfdr}_{kj} (= 1  \sum _{l\in L_{k}} \alpha _{lj})\) can be used as the lfdr of SNP j in the kth trait, where k=1 or 2, L_{1}={10,11} and L_{2}={01,11}. Namely, the smaller the lfdr is, the more confident we are in prioritizing a SNP. Then, we use the direct posterior probability approach [22] to control the global false discovery rate to select a list of SNPs to be as large as possible while bounding the rate of false discoveries by a prespecified threshold τ. With the data and fitted model, we rank the SNPs according to their local false discovery rate in ascending order. We increase the threshold for lfdr ζ from zero to one and find the largest ζ that satisfies
where τ is the prespecified bound of the global FDR, and \(\mathbb {I}(\cdot)\) is an indicator function that returns 1 if the argument is true and 0 otherwise. This progress makes it convenient for users to control the FDR either in terms of the global FDR or lfdr.
Evaluation of prediction performance
In addition to the identification of risk variants, we can also use the LPG approach to conduct risk prediction. In the LPG model, the effect size of SNP j in the kth study is given as \(\mathbb {E}(\gamma _{kj}{\beta }_{kj}) = \sum _{l\in L_{k}}\alpha _{lj}\mu _{kj}\). Given the genotype vector of an individual x_{k}=[x_{k1},…,x_{kp}]^{⊤}, the predicted phenotypic value is \(\hat {y}_{k} = c_{k0} + \sum _{j}\left (({x}_{kj}  c_{kj})\sum _{l\in L_{k}} \alpha _{lj} \mu _{kj}\right)\), where c_{k0} and c_{k1},…,c_{kp} are the mean of the phenotype and each SNP before centering for the kth study, respectively. We measure the Pearson’s correlation between the observed phenotypic values and the predicted phenotypic values in the testing set for quantitative traits. For casecontrol studies, the predicted linear predictor is \(\hat \eta _{k} = \mathbf {z}_{k} \boldsymbol {\phi }_{k} + \sum _{j}\left (({x}_{kj}  c_{kj})\sum _{l\in L_{k}} \alpha _{lj} \mu _{kj}\right)\), and the odds of being a case for such an individual can be found via logit transformation. For the predicted odds from the testing set, we can evaluate the area under the receiver operating characteristic (ROC) curve (AUC) [23].
Hypothesis testing of pleiotropy
It is of great interest to quantify the significance of pleiotropy between two traits. The presence of pleiotropy means that the null and nonnull groups in two traits are not distributed independently. Formally, we can set up a likelihood ratio test (LRT) as follows:
where α_{1∗}=α_{10}+α_{11} and α_{∗1}=α_{01}+α_{11} are marginal probabilities. The LRT statistic is
where \(\hat {\boldsymbol {\theta }}_{0}\) and \(\hat {\boldsymbol {\theta }}\) denote the parameters estimated under the null and alternative hypotheses, respectively. Due to the intractability of the marginal distribution (6), we use the lower bound as a surrogate to approximate the marginal likelihood. Under the null hypothesis, the test statistic λ approximately follows a χ^{2} distribution with df=1.
Results and discussion
We applied the LPG approach to both simulation data and real data. First, we evaluated the performance of the LPG approach using simulation studies. We examined its performance in risk variant identification as measured by AUC, statistical power and FDR and its performance in risk prediction as measured by the Pearson’s correlation and AUC for quantitative traits and binary traits, respectively. We compared the LPG performance with two other singletrait analysis methods that do not leverage pleiotropy, namely, the twogroup model (BVSR [24]) and Lasso [25]. The number of the replicates in simulation studies was 50 unless otherwise specified.
Simulation settings
The simulation datasets were generated by simulating genotype matrices X_{k} (k=1,2) from a normal distribution, where an autoregressive correlation (AR) \(\phantom {\dot {i}\!}\rho ^{jj^{\prime }}\) structure was used to mimic the linkage disequilibrium (LD) between variants j and j^{′} with ρ = 0.2, 0.5 and 0.7. Next, the entries of both X_{k} (k=1,2) were discretized to obtain genotypes {0,1,2} according to the HardyWeinberg equilibriumbased minor allele frequencies, which were drawn from a uniform distribution of [0.05,0.5]. In all scenarios, unless otherwise specified, the sample size used was n_{k}=3000 (k=1,2) and the number of variants was set to p=20,000. To evaluate the prediction performance, we generated an additional n_{test}=500 samples for each study under the same model. For all scenarios, except for those specifically evaluating Type 1 error rates for the test of pleiotropy, we assumed α_{01}=α_{10}. Denote the proportions of the null and nonnull SNPs of both GWAS as α_{0}=α_{00}+α_{10}=α_{00}+α_{01} and α_{1}=α_{11}+α_{10}=α_{11}+α_{01}, respectively. Then, the hidden association status in the first study (γ_{1}) can be sampled randomly with the number of nonzero entries – pα_{1}. α_{1} is set to 0.005 for the quantitative traits and 0.0025 for the binary traits. To account for pleiotropy between two GWAS, we controlled the number of SNPs with pleiotropic effects for the two traits as p(α_{11}+α_{10})(α_{11}+α_{01}+g(α_{10}+α_{00})) and p(α_{11}+α_{01})((α_{11}+α_{10}+g(α_{01}+α_{00})), where g=0 and g=1 correspond to the two extreme cases of no pleiotropy and full pleiotropy, respectively. We considered g=0 to 1 in intervals of 0.2, where larger values of g represent larger pleiotropy. Next, the effect sizes β were simulated from \(\mathcal {N}(0,1)\). For the quantitative traits, as the heritability of each study was defined as \(h_{k}^{2} = \frac {\text {Var}(\mathbf {X}_{k}\boldsymbol {\beta }_{k} \boldsymbol {\gamma }_{k})}{\text {Var}(\mathbf {X}_{k}\boldsymbol {\beta }_{k} \boldsymbol {\gamma }_{k}) + \sigma _{e_{k}}^{2}}\), the noise level was chosen to control the heritability at 0.3, 0.4 and 0.5. For the binary traits, the heritability was also defined as \(h_{k}^{2} = \frac {\text {Var}(\mathbf {X}_{k}\boldsymbol {\zeta }_{k} \boldsymbol {\gamma }_{k})}{\text {Var}(\mathbf {X}_{k}\boldsymbol {\zeta }_{k} \boldsymbol {\gamma }_{k}) + \sigma _{e_{k}}^{2}}\), except for the effect size β_{k} was replaced by ζ_{k}. We set the population prevalence to 0.1 and casecontrol ratio to 1 while controlling heritability at 0.3, 0.4 and 0.5 using a liability model [26]. To evaluate the Type 1 error rate of our proposed test of pleiotropy (i.e., g=0), we considered different values of α=(α_{00},α_{01},α_{10},α_{11})^{⊤}. The values of α=(α_{00},α_{01},α_{10},α_{11})^{⊤} are given in Additional file 1: Figure S20. We also considered simulation studies where the true effect sizes β were generated from either a truncated normal distribution or a tdistribution for a quantitative trait (ρ=0.5 and h^{2} = 0.5). Finally, to more accurately mimic the LD and minor allele frequency patterns present in real data, we excerpted a subset of variants from real dataset (KAISER, dbGaP Study Accession: phs000674.v2.p2) and conducted a simulation study in which we sampled the genotypes from these data. We considered a binary outcome generated using a logistic regression model with casecontrol sampling (instead of using a liability model as above), with n_{1}=n_{2} = 7000. For each simulation, we randomly selected 10 causal SNPs, where half of the causal SNPs had odds ratios of =e^{0.25}=1.28 and half had odds ratios of =e^{−0.25}=0.78. The causal SNPs were randomly selected such that they had at most a moderate correlation with each other (correlation < 0.8). However, the tested SNPs could be highly correlated (correlation > 0.8) with each other. Additionally, when calculating the true positive rate (for the AUC and power), each of the 10 causal SNPs was considered to have been correctly discovered if either the causal SNP or an SNP in high LD with it (correlation > 0.8) was discovered (with the global FDR controlled at 0.2). When calculating the true negative rate, the collection of true nullSNPs excluded the 10 causal SNPs and any SNP in high LD (correlation > 0.8) with any of the 10 causal SNPs. This mimics the situation in GWAS where the identified SNPs may or may not be causal but are capturing the “signal” from the true causal SNPs.
Simulation results
For both the quantitative and binary traits, we analyzed the simulated data using the proposed LPG jointly on two traits in comparison with other alternative methods, including BVSR and Lasso, on each separate trait. For the probabilistic approaches, i.e., LPG and BVSR, we evaluated their risk variant identification performance using the area under the receiver operating characteristic (ROC) curve (AUC), statistical power, and false discovery rate (FDR). Note that for all settings, we evaluated the statistical power to identify risk variants with the global FDR controlled at 0.2. As Lasso is a deterministic approach and its FDR is not controllable, we did not evaluate its statistical power. The tuning parameter in Lasso was chosen via 5fold crossvalidation [27]. We evaluated the risk prediction performance based on the Pearson’s correlation between the observed phenotypic values and the predicted values in the testing datasets for quantitative traits; AUC was used to measure the classification accuracy performance for binary outcomes.
For the quantitative traits, Fig. 1 shows the risk variant identification and prediction performance for ρ=0.5 and h^{2}=0.5. It demonstrates that LPG, which incorporats the pleiotropy between two traits, improves the risk SNP identification compared with the singletrait analysis (BVSR). In particular, when there is no pleiotropy (g=0), the performance of LPG is the same as that of the singletrait analysis (BVSR), suggesting that LPG can exploit available pleiotropic information. Another observation is that the risk SNP identification performance (AUC and statistical power) of LPG improved with increasing proportion of shared risk SNPs. Additionally, the probabilistic approaches (LPG and BVSR) outperformed Lasso in terms of risk SNP identification, regardless of the presence of pleiotropy, as Lasso does not leverage pleiotropy between two traits and its performance depends on the extent of sparsity and strong signals. The FDR rates of both probabilistic models (LPG and BVSR) were wellcontrolled at the target 0.2 level. In terms of prediction performance, as pleiotropy became stronger, the Pearson’s correlation coefficients between the observed and predicted phenotypic values in LPG increased slightly over those of BVSR. For the binary traits, we observed similar results (shown in Fig. 2 for ρ=0.5 and h^{2}=0.5). First, the improved AUC and statistical power of LPG increased as the strength of pleiotropy increased, and the global FDR rates of both LPG and BVSR were wellcontrolled. The prediction performance of LPG showed a slight improvement over that of BVSR when pleiotropy was strong. In our simulation studies, the performance of Lasso was worse than that of its probabilistic counterpart, BVSR. A similar observation was previously reported [28]. Additional simulation results under different configurations of ρ (strength of the correlation between genetic variants) and h^{2} (heritability) (Additional file 1: Tables S1 and S2) produced similar conclusions (Additional file 1: Figures S1  S18). We also conducted simulation studies (Additional file 1: Figures S21  S23) where the true effect sizes β were generated from either a truncated normal distribution or a tdistribution (quantitative trait, ρ=0.5 and h^{2} = 0.5). The results demonstrate that LPG performs well even when the underlying generating distribution for the effect sizes β differ from our assumed prior distribution for β. The simulation results when the genotypes were sampled from real data are given in Additional file 1: Figure S24. The simulation results demonstrate that our proposed method also performs well in this setting as well. We evaluated the Type 1 error and power of the hypothesis test for pleiotropy at a nominal 0.05 level. As expected, the power of the test increases with increasing pleiotropy (increasing g) for both quantitative and binary traits (Additional file 1 Figure S19). The empirical Type 1 error rates (g=0) for various configurations of α=(α_{00},α_{01},α_{10},α_{11})^{⊤} were close to the nominal 0.05 level (Additional file 1: Figure S20).
Real data analysis
Crohn’s disease (CD), rheumatoid arthritis (RA) and type 1 diabetes (T1D) are autoimmune diseases, and previous work suggests they share common genetic risk variants [29]. We applied LPG to the analysis of two pairs of diseases, CD and T1D, as well as RA and T1D, using data reported by the WTCCC [13]. The dataset consists of approximately 2000 cases for CD, RA and T1D and 3000 shared controls, with genotypes at 500,568 SNPs. We performed strict data quality control using plink [30]. First, we removed individuals with missing genotype call rates greater than 2%. For cases from each disease and samples from each control dataset, we removed SNPs with minor allele frequencies smaller than 5% and SNPs with missing rates greater than 1%. We further excluded SNPs with pvalues < 0.001 in the HardyWeinberg equilibrium test for samples in the control groups. In addition, pairs of subjects with estimated relatedness exceeding 0.025% were identified and one individual from each pair was removed at random by GCTA [31].
RA and T1D
Since WTCCC used shared controls among seven diseases and because samples in the control group were from two cohorts (the 1958 British Birth Cohort (58C) and UK Blood Services (UKBS)), we used one control cohort for RA and the other for T1D. After quality control filtering, 240,101 SNPs in 1,812 cases from RA, 1,932 cases from T1D, and 2,897 controls (1,427 controls from 58C and 1,470 controls from UKBS) from the two data sets were retained for the following analysis. First, we conducted the analysis for the 58C controls with RA and the UKBS controls with T1D using LPG and BVSR. The prioritization results are shown in Fig. 3, in addition to a complete list of findings in Table 1, where the lfdr cutoff point is 0.2. As shown in Table 2, the singletrait analysis using BVSR identified 2 SNPs for RA, while a joint analysis using LPG identified 9 SNPs, in addition to the 2 SNPs identified by BVSR, for RA (giving a total of 11 SNPs identified by LPG). There were a few SNPs (e.g., rs10484565) where the joint analysis using LPG gave highly significant pvalues for RA but the separate trait analysis using BVSR gave a pvalue of 1 (Table 1). One possible explanation for this discrepancy is that the effect sizes for these SNPs were smaller and the sample size used in the separate analysis was too small to detect SNPs with smaller effect sizes. For the additional SNPs identified by LPG that were not identified by BVSR, 1 of 9 was reported to be associated with RA in previous studies. rs10484565, within the TAP2 gene was previously reported to be associated with RA [32]. The pvalue for the pleiotropy test was 1.68×10^{−17}, suggesting the existence of pleiotropy between RA and T1D (Table 3). In summary, leveraging the pleiotropic effects enabled LPG to identify more risk SNPs compared to those identified by the singletrait analysis (BVSR). We also evaluated the prediction performance using RA and T1D. Specifically, we quantitatively assessed the risk prediction performance using 10fold cross validation. The prediction accuracies of both LPG and BVSR are shown in Table 2, which shows that the joint analysis of RA and T1D consistently outperformed the separate analysis of each study in terms of prediction accuracy, improving from 62.8% to 64.4% for RA and from 76.7% to 78.3% for T1D. The joint analysis of RA and T1D took 8 to 29 min to complete on a Linux platform with a 2.60 GHz intel Xeon CPU E52690 v3 with 30720 KB cache and 96 GB RAM (Additional file 1: Table S3). To demonstrate the robustness of our LPG, we switched the control cohorts for RA and T1D and repeated the analysis, with similar results.
CD and T1D
After the basic quality control filtering described above, 240,393 SNPs in 1675 cases from CD, 1932 cases from T1D, and 2895 controls (1425 controls from 58C and 1470 controls from UKBS) from the two datasets were used for the analysis. After excluding the MHC region SNPs, leaving a total of 239,931 SNPs, we performed the same four comparisons. Here, we discuss the comparison with 58C controls for CD and UKBS controls for T1D after excluding the MHC region. Manhattan plots are shown in Additional file 1: Figure S37, and all SNP findings are shown in Additional file 1: Table S17 in the supplementary document, where the threshold for lfdr was set to 0.2. As shown in Additional file 1: Table S17, the singletrait analysis using BVSR identified 3 SNPs for CD, while the joint analysis using LPG identified an additional 4 SNPs for CD (giving a total of 7 SNPs identified by LPG). For the SNPs identified by LPG that were not identified by BVSR, 2 (rs6679677 and rs2542151) of 4 were reported to be associated with CD in the GWAS catalog [1]. Overall, the SNP findings are consistent with the published literature. For example, rs11805303 in the IL23R gene was identified to be strongly associated with CD by both methods, consistent with an earlier report by the WTCCC [13]. Additionally, rs17234657 on chromosome 5 was identified to be associated with CD by both LPG and BVSR, a finding previously reported by the WTCCC [13]. Another intergenic SNP, rs2542151, which was identified by LPG but not BVSR, was also previously reported to be significantly associated with CD [13, 33]. The pvalue for the pleiotropy test was 2.96×10^{−2}, suggesting the existence of pleiotropy between CD and T1D (Table 3). The prediction performance of both LPG and BVSR is shown in Table S16 in the Additional file 1: document. The results demonstrate that the prediction of the joint analysis of CD and T1D was slightly better than that of the separate analysis of each study, improving from 58.1 to 58.7% for CD and from 60.1 to 60.3% for T1D. The joint analysis of CD and T1D took 20 to 37 min on a Linux platform with a 2.60 GHz intel Xeon CPU E52690 v3 with 30720 KB cache and 96 GB RAM (Additional file 1: Table S3). The results from the other comparisons are detailed in the supplementary document and were similar to those presented above.
Conclusion
In this article, we proposed a novel statistical framework for leveraging pleiotropy in GWAS data. Compared with a singletraitbased analysis that does not leverage pleiotropy, LPG offers improved statistical power and prediction accuracy in the identification of risk variants. We developed an efficient algorithm based on VBEM, which not only enabled us to evaluate the posterior quantities of interest but also made the evaluation computationally scalable. These advantages make LPG a powerful tool to analyze GWAS data exhibiting pleiotropic effects. In this article, we analyzed two pairs of traits from WTCCC, namely, RA vs T1D and CD vs T1D. The findings reported here are consistent with the published literature.
Despite these advantages, a current limitation of LPG is that it is not applicable to more than two traits. Modeling pleiotropic effects in a combinatorial fashion for more than two traits is challenging as the number of hidden association statuses increases exponentially with the number of traits. LPG was designed to leverage pleiotropy when GWAS data for multiple traits are collected from different study individuals, and LPG therefore complements the earlier methods proposed for incorporating pleiotropy when GWAS data are collected from the same study individuals [8, 9]. However, to date, no method has been proposed for leveraging pleiotropy when the GWAS data for multiple traits are collected from partially shared study samples, indicating an avenue for future work.
Abbreviations
 58C:

The 1958 British Birth Cohort
 AUC:

Area under the curve
 BVSR:

Bayesian variable selection regression
 CD:

Crohn’s disease
 FDR:

False discovery rate
 GWAS:

Genomewide association study
 lfdr:

Local false discovery rate
 RA:

Rheumatoid arthritis
 ROC:

Receiver operating characteristic
 SNP:

Single nucleotide polymorphism
 T1D:

Type 1 diabetes
 UKBS:

UK Blood Services
 VBEM:

Variational Bayesian expectationmaximization
 WTCCC:

Welcome Trust Case Control Consortium
References
GWAS Catalog. http://www.ebi.ac.uk/gwas/home. Accessed 28 Oct 2017.
Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nature. 2009; 461(7265):747.
Wagner GP, Zhang J. The pleiotropic structure of the genotype–phenotype map: the evolvability of complex organisms. Nat Rev Genet. 2011; 12(3):204–13.
Cotsapas C, Voight BF, Rossin E, Lage K, Neale BM, Wallace C, et al. Pervasive sharing of genetic effects in autoimmune disease. PLos Genet. 2011; 7(8):e1002254.
Kraja AT, Chasman DI, North KE, Reiner AP, Yanek LR, Kilpeläinen TO, et al. Pleiotropic genes for metabolic syndrome and inflammation. Mol Genet Metab. 2014; 112(4):317–38.
Wang Q, Yang C, Gelernter J, Zhao H. Pervasive pleiotropy between psychiatric disorders and immune disorders revealed by integrative analysis of multiple GWAS. Hum Genet. 2015; 134(1112):1195–209.
Bottolo L, ChadeauHyam M, Hastie DI, Zeller T, Liquet B, Newcombe P, et al. GUESSing polygenic associations with multiple phenotypes using a GPUbased evolutionary stochastic search algorithm. PLoS Genet. 2013; 9(8):e1003657.
Zhou X, Stephens M. Efficient multivariate linear mixed model algorithms for genomewide association studies. Nat Methods. 2014; 11(4):407–9.
Liu J, Yang C, Shi X, Li C, Huang J, Zhao H, et al. Analyzing Association Mapping in PedigreeBased GWAS Using a Penalized Multitrait Mixed Model. Genet Epidemiol. 2016; 40(5):382–93.
Andreassen OA, Djurovic S, Thompson WK, Schork AJ, Kendler KS, O’Donovan MC, et al. Improved detection of common variants associated with schizophrenia by leveraging pleiotropy with cardiovasculardisease risk factors. Am J Hum Genet. 2013; 92(2):197–209.
Chung D, Yang C, Li C, Gelernter J, Zhao H. GPA: a statistical approach to prioritizing GWAS results by integrating pleiotropy and annotation. PLos Genet. 2014; 10(11):e1004787.
Liu J, Wan X, Ma S, Yang C. EPS: An empirical Bayes approach to integrating pleiotropy and tissuespecific information for prioritizing risk genes. Bioinformatics. 2016; 32(12):1856–64.
Burton PR, Clayton DG, Cardon LR, Craddock N, Deloukas P, Duncanson A, et al. Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007; 447(7145):661–78.
Kuo L, Mallick B. Variable selection for regression models. Sankhyā: Indian J Stat Ser B. 1998; 60(1):65–81.
Jordan MI, Ghahramani Z, Jaakkola TS, Saul LK. An introduction to variational methods for graphical models. Mach Learn. 1999; 37(2):183–233.
Bishop CM. Pattern recognition. Mach Learn. 2006; 128:1–58.
Opper M, Saad D. Advanced mean field methods: Theory and practice. Cambridge: MIT press; 2001.
Logsdon BA, Hoffman GE, Mezey JG. A variational Bayes algorithm for fast and accurate multiple locus genomewide association analysis. BMC Bioinformatics. 2010; 11(1):58.
Böhning D. Multinomial logistic regression algorithm. Ann Inst Stat Math. 1992; 44(1):197–200.
Efron B. LargeScale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction (Institute of Mathematical Statistics Monographs). Cambridge: Cambridge University Press; 2010. https://doi.org/10.1017/CBO9780511761362.
Efron B, et al. Microarrays, empirical Bayes and the twogroups model. Stat Sci. 2008; 23(1):1–22.
Newton MA, Noueiry A, Sarkar D, Ahlquist P. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics. 2004; 5(2):155–76.
Fogarty J, Baker RS, Hudson SE. Case studies in the use of ROC curve analysis for sensorbased estimates in human computer interaction. In: Proceedings of Graphics Interface 2005. Canadian HumanComputer Communications Society: 2005. p. 129–136.
Guan Y, Stephens M. Bayesian variable selection regression for genomewide association studies and other largescale problems. Ann Appl Stat. 2011; 5(3):1780–815.
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010; 33(1):1.
Lee SH, Wray NR, Goddard ME, Visscher PM. Estimating missing heritability for disease from genomewide association studies. Am J Hum Genet. 2011; 88(3):294–305.
Friedman J, Hastie T, Tibshirani R. The elements of statistical learning. vol. 1. Berlin: Springer series in statistics Springer; 2001.
Dai M, Ming J, Cai M, Liu J, Yang C, Wan X, Xu Z. IGESS: a statistical approach to integrating individuallevel genotype data and summary statistics in genomewide association studies. Bioinformatics. 2017; 33(18):2882–9.
Solovieff N, Cotsapas C, Lee PH, Purcell SM, Smoller JW. Pleiotropy in complex traits: challenges and strategies. Nat Rev Genet. 2013; 14(7):483.
Purcell S, Neale B, ToddBrown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for wholegenome association and populationbased linkage analyses. Am J Hum Genet. 2007; 81(3):559–75.
Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genomewide complex trait analysis. Am J Hum Genet. 2011; 88(1):76–82.
Lee HS, Lee AT, Criswell LA, Seldin MF, Amos CI, Carulli JP, et al. Several regions in the major histocompatibility complex confer risk for antiCCPantibody positive rheumatoid arthritis, independent of the DRB1 locus. Mol MedCamb N Y. 2008; 14(5/6):293.
Parkes M, Barrett JC, Prescott N, Tremelling M, Anderson CA, Fisher SA, et al. Sequence variants in the autophagy gene IRGM and multiple other replicating loci contribute to Crohn disease susceptibility. Nat Genet. 2007; 39(7):830.
Acknowledgement
The computational work for this article was performed on resources of the National Supercomputing Centre, Singapore, and the Shanghai University of Finance and Economics. The authors also acknowledge the efforts of Chao Su and XianChen Meng, who assisted in the software development.
Funding
This work was supported in part by grant No. 61501389 from National Science Funding of China, grants No. 22302815, No. 12316116 and No. 12301417 from the Hong Kong Research Grant Council, and grant R913200098263 from the DukeNUS Graduate Medical School, and AcRF Tier 2 (MOE2016T22029) from the Ministry of Education, Singapore.
Availability of data and materials
The datasets analyzed in this study can be requested from the Wellcome Trust Case Control Consortium (WTCCC). The software is available at https://github.com/Shufeyangyi2015310117/LPG.
Author information
Authors and Affiliations
Contributions
JL conceived the idea of the article. YY and MD jointly developed the software. YY implemented the model, generated the simulation data, analyzed the real data, obtained the results and prepared the manuscript. JL supervised the whole progress of modeling and analysis. XL, JL, CY, JH and MC edited and approved the manuscript. All authors have read and approved the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Additional file 1
The supplementary document contains additional simulation and data analysis results as well as derivation details. (PDF 11571 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Yang, Y., Dai, M., Huang, J. et al. LPG: A fourgroup probabilistic approach to leveraging pleiotropy in genomewide association studies. BMC Genomics 19, 503 (2018). https://doi.org/10.1186/s1286401848512
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1286401848512
Keywords
 Pleiotropy
 Variational Bayesian expectationmaximization
 Genomewide association studies