LPG: A four-group probabilistic approach to leveraging pleiotropy in genome-wide association studies

Background To date, genome-wide association studies (GWAS) have successfully identified tens of thousands of genetic variants among a variety of traits/diseases, shedding light on the genetic architecture of complex disease. The polygenicity of complex diseases is a widely accepted phenomenon through which a vast number of risk variants, each with a modest individual effect, collectively contribute to the heritability of complex diseases. This imposes a major challenge on fully characterizing the genetic bases of complex diseases. An immediate implication of polygenicity is that a much larger sample size is required to detect individual risk variants with weak/moderate effects. Meanwhile, accumulating evidence suggests that different complex diseases can share genetic risk variants, a phenomenon known as pleiotropy. Results In this study, we propose a statistical framework for Leveraging Pleiotropic effects in large-scale GWAS data (LPG). LPG utilizes a variational Bayesian expectation-maximization (VBEM) algorithm, making it computationally efficient and scalable for genome-wide-scale analysis. To demonstrate the advantages of LPG over existing methods that do not leverage pleiotropy, we conducted extensive simulation studies and applied LPG to analyze two pairs of disorders (Crohn’s disease and Type 1 diabetes, as well as rheumatoid arthritis and Type 1 diabetes). The results indicate that by levelaging pleiotropy, LPG can improve the power of prioritization of risk variants and the accuracy of risk prediction. Conclusions Our methodology provides a novel and efficient tool to detect pleiotropy among GWAS data for multiple traits/diseases collected from different studies. The software is available at https://github.com/Shufeyangyi2015310117/LPG. Electronic supplementary material The online version of this article (10.1186/s12864-018-4851-2) contains supplementary material, which is available to authorized users.


2 Simulation studies: Binary trait
For the binary trait in case-control studies,we conducted simulation studies to evaluate the performance of LPG with its alternative, i.e. BVSR and Lasso. We first evaluated their performance of the identification of risk variants using AUC, statistical power, and FDR.
Note that we can only evaluate AUC for Lasso and FDR was controlled at 0.2. Thirdly, we measure the proportion of true risk SNPs discovered by Power. We then evaluated the prediction performance in a testing set using AUC as described in Section Results and discussion for all three methods The results are provided in Supplementary Figure S10-S18 and they demonstrate that the performance of LPG gradually gets better as the pleiotropic effects become stronger.

Additional simulation studies
We also conducted simulation studies (Supplementary Figures S21 -S23) where the true effect sizes β were generated from either a truncated normal distribution or a t-distribution (quantitative trait, ρ = 0.5 and h 2 = 0.5). The results demonstrate that LPG performs well even when the underlying generating distribution for the effect sizes β differ from our assumed prior distribution for β. The simulation results where the genotypes were sampled from real data are given in Supplementary Figure S24. The simulation results demonstrate that our proposed method performs well in this setting as well.
26 Figure S21: The comparison of LPG (VB joint) with its alternative methods, BVSR (VB separate) and Lasso, for quantitative traits using effect sizes drawn from truncated normal distribution TN(-2,2,0,1), where standard normal distribution is truncated at -2 and 2. Panels from top to bottom are AUC, FDR, Power and Prediction, respectively. Choices of g range from 0 to 1. The parameter setting of the model is : p = 20,000, n 1 = n 2 = 3000, h 2 = 0.5, ρ = 0.5, α 1 = 0.005. Figure S22: The comparison of LPG (VB joint) with its alternative methods, BVSR (VB separate) and Lasso, for quantitative traits using effect sizes drawn from t-distribution with degree of freedom (df)= 5. Panels from top to bottom are AUC, FDR, Power and Prediction, respectively. Choices of g range from 0 to 1. The parameter setting of the model is : p = 20,000, n 1 = n 2 = 3000, h 2 = 0.5, ρ = 0.5, α 1 = 0.005. Figure S23: The comparison of LPG (VB joint) with its alternative methods, BVSR (VB separate) and Lasso, for quantitative traits using effect sizes drawn from t-distribution with df = 10. Panels from top to bottom are AUC, FDR, Power and Prediction, respectively. Choices of g range from 0 to 1. The parameter setting of the model is : p = 20,000, n 1 = n 2 = 3000, h 2 = 0.5, ρ = 0.5, α 1 = 0.005. Figure S24: For a binary outcome with genotype excerpted from the real data and binary outcome generated using a logistic regression model with case-control sampling. Comparison of LPG (VB joint), BVSR (VB separate), and Lasso with different g ranging from 0 to 1 for binary trait. Panels from top to bottom are AUC, FDR, Power and Prediction, respectively. The parameters used for this simulation are: p = 20,000, n 1 = n 2 = 7000, α 1 = 0.0005. For each simulation, we randomly selected 10 causal SNPs such that the 10SNPs at most moderate correlation with each other (correlation < 0.8  Table S3: Summary of real data analysis, inMHC and exMHC mean that SNPs include and exclude MHC region, respectively, iter-sep1 and iter-sep2 mean iterations of separate analysis using BVSR for first and second study, respectively, iter-joint means iterations of joint analysis using LPG, time-sep1 and time-sep2 mean time of separate analysis using BVSR for first and second study, respectively, time-joint means time of joint analysis 4.1 Comparison of LPG and BVSR for the data consisting of 58C controls with T1D and UKBS controls with RA (RA-T1D-inMHC) Figure S25: For the data consisting of 58C controls with T1D and UKBS controls with RA, manhattan plots of separate analysis using BVSR and joint analysis using LPG. Figure S26: For the data consisting of 58C controls with T1D and UKBS controls with RA, prediction performance of separate analysis using BVSR and joint analysis using LPG.     Figure S31: For the data consisting of 58C controls with T1D and UKBS controls with CD, manhattan plots of separate analysis using BVSR and joint analysis using LPG. Figure S32: For the data consisting of 58C controls with T1D and UKBS controls with CD, prediction performance of separate analysis using BVSR and joint analysis using LPG.  4.5 Comparison of LPG and BVSR for the data consisting of 58C controls with CD and UKBS controls with T1D (T1D-CD-inMHC) Figure S33: For the data consisting of 58C controls with CD and UKBS controls with T1D, manhattan plots of separate analysis using BVSR and joint analysis using LPG. Figure S34: For the data consisting of 58C controls with CD and UKBS controls with T1D, prediction performance of separate analysis using BVSR and joint analysis using LPG.  Figure S35: For the data consisting of 58C controls with T1D and UKBS controls with CD excluding MHC region, manhattan plots of separate analysis using BVSR and joint analysis using LPG. Figure S36: For the data consisting of 58C controls with T1D and UKBS controls with CD excluding MHC region, prediction performance of separate analysis using BVSR and joint analysis using LPG.  Figure S37: For the data consisting of 58C controls with CD and UKBS controls with T1D excluding MHC region, manhattan plots of separate analysis using BVSR and joint analysis using LPG. Figure S38: For the data consisting of 58C controls with CD and UKBS controls with T1D excluding MHC region, prediction performance of separate analysis using BVSR and joint analysis using LPG.

Proof detail for Quantitative trait model
To overcome the intractability of marginal likelihood, we derive an efficient algorithm based on variational inference, which makes our model scalable to genome-wide data analysis. The key idea is that we make use of Jensen's inequality to iteratively obtain an adjustable lower bound on the marginal log likelihood [4]. First, we have a lower bound of the logarithm of the marginal likelihood, where we define Note that Kullback-Leibler (KL) divergence satisfies KL(q||p) ≥ 0 by using Jensen's inequality, with equality holds if, and only if, that variational posterior probability (q) and the true posterior probability (p) are equal. Similar to expectation-maximization (EM) algorithm, we can maximize the lower bound L(q, θ) with respect to the variational distribution q, which is equivalent to minimizing the KL divergence [1]. To make it computationally efficient to evaluate the lower bound, we use mean-field theory [5], and assume that q(β 1 , β 2 , γ 1 , γ 2 ) can be factorized as No additional assumption on the posterior distribution is required. This factorization (S3) is used as a surrogate for the posterior distribution Pr(β 1 , β 2 , γ 1 , γ 2 |y 1 , y 2 , X 1 , X 2 ; θ).
Using the properties of factorized distributions in variational inference [1], we can obtain the optimal approximation using the following formula: where the expectation is taken with respect to all of the other factors {q j (β 1j , β 2j , γ 1j , γ 2j )} for j = j. After some derivations, we have where α lj is the posterior probability of [γ 1j , γ 2j ] = l, f 0 (β kj ) is the posterior distribution of β kj when γ kj = 0, f kj (β kj ) is the posterior distribution of β kj under γ kj = 1. With some algebra, it is easy to show that f 0 (β kj ) and f kj (β kj ) are the density functions of Gaussian distributions N (0, σ 2 β k ) and N (µ kj , s 2 kj ) with As discussed in [3], both s 2 kj and µ kj can be interpreted using a single-variable linear model y k = x kj β kj + k , i.e. , update for s 2 kj is the posterior variance of effect β kj and update for µ kj is the posterior mean of effect β kj by correcting correlations among covariates not included in the single-variable model. With some algebra, we can update log odds of α lj as follows, . (S7)
With q(β 1 , β 2 , γ 1 , γ 2 ) given in (S5), we can derive the lower bound analytically. Once we have variational lower bound, other parameters can be updated by maximizing the lower bound while keeping variational distribution q fixed: where L 1 = {10, 11}, L 2 = {01, 11}. Clearly, σ 2 e k is equal to the its maximum likelihood estimates adjusted by the variance of γ kj β kj , update for σ 2 β k is the weighted average of posterior variance, where the weights are the posterior mean of γ kj , and update for α l is the average of all posterior mean of γ kj . Derivation details for parameter updates in Gaussian distribution (S6), and update equations (S7) and (S8) can be found later. The VBEM algorithm (Algorithm 1) performs similarly to coordinate descent algorithm, which comes from the factorization of variational distribution (S5). Hence, VBEM algorithm developed here is scalable to large number of individuals and large number of SNPs.

The derivation of lower bound
The lower bound of quantitative LPG model can be written as Algebraically, the first term of lower bound (S9) is where the variational expectations in (S10) are listed as below (S11) The second term of lower bound (S9) is the entropy of posterior distribution.
Obviously, taking some simplification, we can derive the analytical from of lower bound (S14)

The derivation of posterior distribution
We derive the variational posterior distribution by maximizing the lower bound (S13). The derivative function with respective to µ kj and s 2 kj are (S15) It is obvious to obtain µ kj and s 2 kj as equation (S6). The posterior distribution is (S16) 50

The estimation of model parameters
Then we derive the updating formula of α lj and θ by maximizing the lower bound (S13).
For the updating formula of posterior α lj , we adopt to the lagrange multiplier approach.
The derivative with respect to α 00j and α 01j are We can obtain the following equation using last equation (S18).
log α 01j α 00j = log α 01 α 00 Similarly, we can get the other three equations. After some algebra, they can be expressed as equation (S7). Next, we maximize the lower bound with respect to σ 2 e k .
which yields a maximum at Then, we maximize lower bound with respect to σ 2 β k .
which has a maximum at At last, it is obvious to get updating equation of α l by lagrange multiplier approach Algorithm 1: Variational EM algorithm to solve the quantitative LPG model l∈{00,10,01,11} exp(A lj ) , where A lj are defined in expression (S7) 9ỹ k ←ỹ k + l∈L k α lj µ kj x kj 10 end 11 end 12 σ 2 e k ← 1 with n 1 and n 2 samples, respectively. All settings remain the same to the quantitative traits except that y k ∈ R n k ×1 is the vector for disease status taking values -1 and 1 for study k and Z k = [z k1 , . . . , z kp 0 ] ∈ R n k ×p 0 is a matrix for p 0 covariates. Then conditional on observed genotype X k , hidden status γ, and effects β k , we have where δ k = [δ k1 , . . . , δ kn k ] , δ ki = Pr(y ki = 1|X, β k , γ k ) = 1 1+e −y ki η ki is the sigmoid function of linear predictor η ki , i is the index for individuals, η k (= [η k1 , . . . , η kn k ] ∈ R n k ×1 ) is the linear predictor of the all individuals in study k such that η k = p 0 j=1 z kj φ kj + p j=1 γ kj β kj x kj . Note that φ k is a vector of fixed effects including intercept in the model. Here, we include fixed-effect covariates in the binary studies to adjust population stratification in samples.

The derivation of lower bound
As the definition of lower bound for binary trait model is We can derive the analytical form of lower bound by evaluate these two expectation. The first expectation of lower bound (S30) is E q [log h(y 1 , y 2 , β 1 , β 2 , γ 1 , γ 2 |X 1 , X 2 ;θ)] (y kn z T kn φ k ) 2