Skip to main content
  • Methodology article
  • Open access
  • Published:

A new statistical framework for genetic pleiotropic analysis of high dimensional phenotype data

Abstract

Background

The widely used genetic pleiotropic analyses of multiple phenotypes are often designed for examining the relationship between common variants and a few phenotypes. They are not suited for both high dimensional phenotypes and high dimensional genotype (next-generation sequencing) data.

To overcome limitations of the traditional genetic pleiotropic analysis of multiple phenotypes, we develop sparse structural equation models (SEMs) as a general framework for a new paradigm of genetic analysis of multiple phenotypes. To incorporate both common and rare variants into the analysis, we extend the traditional multivariate SEMs to sparse functional SEMs. To deal with high dimensional phenotype and genotype data, we employ functional data analysis and the alternative direction methods of multiplier (ADMM) techniques to reduce data dimension and improve computational efficiency.

Results

Using large scale simulations we showed that the proposed methods have higher power to detect true causal genetic pleiotropic structure than other existing methods. Simulations also demonstrate that the gene-based pleiotropic analysis has higher power than the single variant-based pleiotropic analysis. The proposed method is applied to exome sequence data from the NHLBI’s Exome Sequencing Project (ESP) with 11 phenotypes, which identifies a network with 137 genes connected to 11 phenotypes and 341 edges. Among them, 114 genes showed pleiotropic genetic effects and 45 genes were reported to be associated with phenotypes in the analysis or other cardiovascular disease (CVD) related phenotypes in the literature.

Conclusions

Our proposed sparse functional SEMs can incorporate both common and rare variants into the analysis and the ADMM algorithm can efficiently solve the penalized SEMs. Using this model we can jointly infer genetic architecture and casual phenotype network structure, and decompose the genetic effect into direct, indirect and total effect. Using large scale simulations we showed that the proposed methods have higher power to detect true causal genetic pleiotropic structure than other existing methods.

Background

In the past several years, a large number of statistical methods for association analysis of both qualitative and quantitative traits with next-generation sequencing data were developed [114]. Most genetic analyses of quantitative traits focus on association analysis of a single trait, analyzing each phenotype individually and independently [15]. However, multiple phenotypes are correlated. For example, metabolism of lipoproteins involves cholesterol, triglycerides, very low density lipoproteins (VLDL), low density lipoproteins and high density lipoproteins. These multiple traits are dependent. The integrative analysis of correlated phenotypes often increase statistical power to identify genetic associations [16, 17]. The association analysis of multiple phenotypes is expected to become popular in the near future [18].

Three major approaches are commonly used to explore association of genetic variants with multiple correlated phenotypes: multiple regression methods, integration of p values of univariate analysis, and dimension reduction methods [16]. Despite their differences in selection of specific methods for estimation, all these estimation methods share the following common features. First, many methods were designed for common variants and hence may not be appropriate for rare ones. Second, the results of all these analyses are difficult to interpret. They do not provide information to indicate which phenotypes the genetic variants are significantly associated [15]. Third, all these methods estimate the effect of the genetic variant on each phenotype individually and do not explore the dependency patterns of genetic effects among the phenotypes and do not provide a detailed characterization of the relationships among the genetic effects. Fourth, all these estimations only estimate the effects of the genetic variants on the phenotypes. However, the genetic effects can be classified into three types of effects: direct, indirect and total effects. These methods are unable to reveal mechanisms underlying the genetic structures of multiple phenotype association analysis [19]. The direct effect is the measurement of the influence of a genetic variant on a phenotype that is not mediated by any other phenotypes in a system. The indirect effect of a genetic variant measures the sensitivity of a phenotype to change of a genetic variant that is mediated by at least one intervening variable (phenotype). The total effect is the sum of the direct and indirect effects. The most popular multivariate association methods are lack of ability to decompose total effect into direct effect and indirect effect and ignore indirect effects through other mediating phenotypes and risk factors. Therefore, they cannot discover how the effect of the genetic variant on the phenotype is mediated by other phenotypes and the effect path from the initially affected phenotype by the genetic variant through a number of mediating phenotypes to the targeted phenotype. Pleiotropic effect is a context dependent genetic effect and plays an important role in multivariate trait association studies and evolution analysis [20]. The pleiotropic effect of a specific genetic variant on multiple phenotypes may be due to either direct contribution of the genetic variant to the multiple phenotypes or phenotype correlations (mediations). The multivariate trait association studies cannot distinguish the paths connecting multiple phenotypes and genetic effects [21].

In the past several years, there have been increasing interests in modeling the complex structures among phenotypes, risk factors and genotypes which are referred to as the genotype-phenotype networks and therefore overcome these limitations. Current methods for inference of genotype-phenotype networks can be classified into two categories: whole network scoring methods and local analysis methods [2229]. Network scoring approaches assign a score to the network model for measuring how well the network fits the data and develop algorithms to search the network with the best score. Local analysis methods analyze small sets of variables that are pieced together into networks from multiple causality tests between variables.

One of network scoring methods is structure equations that can be used as a tool to model the complex network structures among phenotypes, risk factors and genotypes [1921, 3032]. A graphical model in which the variables are represented as nodes and the relationships between variables are represented by edges between the nodes can be used to model the genotype-phenotype networks. Structural equations can generate biological interpretations of relations among variables and uncover the mechanism structure underlying phenotypic and genotypic relationships. To date, in applications of the structural equation model (SEM) in quantitative genetics, the causal structure was assumed to be known as a priori, or partially specified, thereby allowing selection of the causal structure for a small set of variables from the data [21]. There are two major approaches to estimate the causal structure from the data. One approach is based on the conditional independence and the notion of Markov equivalence of directed acyclic graphs (DAGs) [33]. DAGs encode causal structure. However, a DAG is not, in general, identifiable from observational data. Conditional independence only determines the skeleton of the DAG which is the undirected graph of the DAG by removing its directions of all edges, and the v structure of the DAG where two nodes are directed to a common node (collider) [34]. A number of algorithms such as PC-algorithms have been used to estimate the equivalence class of DAGs [35]. A second approach is to use the notion of ‘sparse’ and develop sparse SEMs for estimating the causal structures [36]. By incorporating the penalized constraints of the parameters into the likelihood function to enforce the network sparsity, we could estimate the causal structure. Coordinate ascent algorithms are often used to maximize the penalized likelihood functions.

Despite their successful application to joint analysis of genetic architecture and causal phenotype networks, current approaches often demand intensive computations and are lack of efficient computational algorithms for implementing penalization of network structure parameters. Therefore, they cannot be used for large-scale causal inference. Most current approaches are designed for common variants and are difficult to be applied to next generation sequencing (NGS) data. The purpose of this paper is to overcome these limitations. We first develop novel functional SEMs where exogenous genotype profiles across a genomic region or a gene are represented as a function of the genomic position for genetic association analysis of multiple quantitative traits which is referred to as multivariate quantitative traits locus (QTL) analysis. The functional SEMs for multivariate QTL analysis consist of three components. The first component is a phenotype network that is modeled as a directed graph. The second component is a genotype network that is represented as an undirected graph. The third component is connections between the genotype network and phenotype network with direction from genotype nodes to phenotype nodes. To make the network sparse and reduce the burden of computations, we develop the novel sparse SEMs for genotype-phenotype networks and an efficient computational algorithms based on ADMM to search the causal structure and estimate the parameters [37, 38]. We will estimate the direct, indirect and total effects of the genetic variants on the phenotypes using estimated directed graph and intervention calculus [39] and explore the relationships between direct, indirect and total effects estimated from SEMs and the genetic effects estimated from the traditional simple regressions and multiple regressions. Finally, the sparse SEMs are applied to exome sequence data from the NHLBI’s Exome Sequencing Project (ESP) with 11 phenotypes. A program implementing the developed sparse SEMs for quantitative genetic analysis with multiple phenotypes will be published as an R package.

Methods

Multivariate quantitative trait association analysis can be investigated by phenotype-genotype networks, which can be represented as a graph. Phenotypes, covariates such as age, sex, race, and SNPs are variables. Variables are represented as nodes in the graph. We assume that causal relationships among phenotypes exist. Therefore, a phenotype network is represented by a directed graph. A directed edge between two nodes indicates the causal relationship between them. Since SNPs do not have causal relationships among them, a genotype network is represented as an undirected graph. An edge between two nodes in the genotype network indicates their correlation. Since all SNPs and covariates may cause changes in phenotypes, the phenotype network and genotype network are connected by edges directed from covariates and SNP to the phenotypes. The phenotypes and connections between phenotypes, covariates and SNPs can be modeled by structural equations. The genotype network can be leant by graphical LASSO (GLASSO) [39], here we didn’t focus on genotype network in this paper. An example of phenotype-genotype network is shown in Additional file 1: Figure S1.

SEMs for multivariate association analysis

The SEMs offer a general statistical framework for inferring phenotype networks and connections between genotypes and phenotypes. Assume that n individuals are sampled. We consider M phenotypes that are referred to as endogenous variables. The endogenous variables are jointly determined in the model and are also influenced by the variables outside the model. We denote the n observations on the M endogenous variables by the matrix Y = [y 1, y 2, …, y M ], where y i  = [y 1i , …, y ni ]T is a vector of collecting n observation of the endogenous variable i. Covariates, genetic variants as exogenous or predetermined variables are denoted by X = [x 1, …, x K ] where x i  = [x 1i, …, x ni ]T. Similar to independent variables in the regression, the exogenous variables are outside the models and are not influenced by the variables in the model. Similarly, random errors are denoted by E = [e 1, …, e M ], where we assume E[e i ] = 0 and E[e i e T i ] = σ 2 i I n for i = 1, …, M. Recall that the relationships between the phenotypes and genotypes are traditionally described by the regressions where the phenotypes are taken as dependent variables and genotypes are taken as independent variables are predictors. In the regression models, the dependence relationships among dependent variables or phenotypes cannot be explicitly expressed. Therefore, the regression models cannot be used to determine which phenotypes cause the variations of which phenotypes. To overcome this limitation, we introduce linear structural equations. The linear structural equations for modeling relationships among phenotypes and genotypes can be written as [38].

$$ \begin{array}{l}{y}_1{\gamma}_{11}+{y}_2{\gamma}_{21}+\dots +{y}_M{\gamma}_{M1}+{x}_1{\beta}_{11}+{x}_2{\beta}_{21}+\dots +{x}_K{\beta}_{K1}+{e}_1=0\\ {}\begin{array}{cccc}\hfill \hfill & \hfill \hfill & \hfill \vdots \hfill & \hfill \begin{array}{cccc}\hfill \hfill & \hfill \hfill & \hfill \hfill & \hfill \begin{array}{cccc}\hfill \hfill & \hfill \hfill & \hfill \hfill & \hfill \begin{array}{cccc}\hfill \hfill & \hfill \hfill & \hfill \hfill & \hfill \begin{array}{cccc}\hfill \hfill & \hfill \hfill & \hfill \hfill & \hfill \begin{array}{ccc}\hfill \hfill & \hfill \vdots \hfill & \hfill \hfill \end{array}\hfill \end{array}\hfill \end{array}\hfill \end{array}\hfill \end{array}\hfill \end{array}\\ {}{y}_1{\gamma}_{1M}+{y}_2{\gamma}_{2M}+\dots +{y}_M{\gamma}_{MM}+{x}_1{\beta}_{1M}+{x}_2{\beta}_{2M}+\dots +{x}_K{\beta}_{KM}+{e}_M=0\end{array} $$
(1)

where the γ’s and β’s are the structural parameters of the system that are unknown.

Variables in the SEMs can be classified into two basic types of variables: observed variables that can be measured and the residual error variables that cannot be measured and represent all other unmodeled causes of the variables. Most observed variables (e. g. phenotypes such as BMI, blood pressure, high density lipoprotein, low density lipoprotein) are random. Some observed variables may be nonrandom or control variables (e. g. genotypes, drug dosages) whose values remain the same in repeated random sampling or might manipulated by the experimenter. The observed variables will be further classified into exogenous variables (e.g. genotypes, age, sex, race), which lie outside the model, and endogenous variables (e.g. phenotypes), whose values are determined through joint interaction with other variables within the system. All nonrandom variables can be viewed as exogenous variables. Phenotypes are viewed as endogenous variables. The terms exogenous and endogenous are model specific. It may be that an exogenous variable in one model is endogenous in another. The structural parameters γ describe the relationships between phenotypes and parameters β measure the direct genetic effects of the genotypes on the phenotypes.

In matrix notation the SEMs (1) can be rewritten as

$$ Y\varGamma +XB+E=0, $$
(2)

where Γ = [Γ 1, …, Γ M ], Γ i  = [γ 1i , …, γ Mi ]T, B = [B 1, …, B M ], B i  = [β 1i , …, β Ki ]T.

We assume that the random errors in the structural equations are independent and uncorrelated with exogenous variables. We apply the sparsity penalty to each equation to ensure that the sparse SEMs are identifiable.

Two-stage least square estimates of the parameters in the SEMs

The ordinary least squares estimator is biased and inconsistent for the parameters of structural equations. To ensure the consistent estimates of the parameters in the SEMs, we use a generalized least square method that can be interpreted as a two-stage least square estimate method to estimate the parameters in the SEMs [38].

Recalling that y i is the vector of observations of the variable i, let Y − i be the observation matrix Y after removing y i from it and γ − i be the parameter vector Γ i after removing the parameter γ ii . The ith equation:

$$ Y{\varGamma}_i+X{B}_i+{e}_i=0 $$

can be rewritten as

$$ \begin{array}{l}{y}_i={Y}_{-i}{\gamma}_{-i}+X{B}_i+{e}_i\\ {}\begin{array}{cc}\hfill \hfill & \hfill ={W}_i{\Delta}_i+{e}_i\hfill \end{array},\end{array} $$
(3)

where \( {W}_i=\left[\begin{array}{cc}\hfill {Y}_{-i}\hfill & \hfill X\hfill \end{array}\right],{\Delta}_i={\left[\begin{array}{cc}\hfill {\gamma}_{-i}^T\hfill & \hfill {B}_i^T\hfill \end{array}\right]}^T. \)

Multiplying by the matrix X T on both sides of eq. (3), we obtain

$$ {X}^T{y}_i={X}^T{Y}_{-i}{\gamma}_{-i}+\left({X}^TX\right){B}_i+{X}^T{e}_i={X}^T{W}_i{\Delta}_i+{X}^T{e}_i. $$
(4)

It is known that

$$ \operatorname{cov}\left({X}^T{e}_i,{X}^T{e}_i\right)={X}^TX{\sigma}_i^2. $$

The generalized least square estimate \( {\widehat{\Delta}}_i \) is given by

$$ {\widehat{\Delta}}_i={\left[{W}_i^TX{\left({X}^TX\right)}^{-1}{X}^T{W}_i\right]}^{-1}{W}_i^TX{\left({X}^TX\right)}^{-1}{X}^T{y}_i. $$
(5)

The generalized least square estimate \( {\widehat{\Delta}}_i \) can be interpreted as a two-stage least square estimate [38].

Suppose that in the first stage, Y − i is regressed on X to obtain

\( {\widehat{\Pi}}_i={\left({X}^TX\right)}^{-1}{X}^T{Y}_{-i} \) and \( {\widehat{Y}}_{-i}=X{\widehat{\Pi}}_i. \)

Then,

$$ \begin{array}{l}{\widehat{W}}_i=\left[\begin{array}{cc}\hfill {\widehat{Y}}_{-i}\hfill & \hfill X\hfill \end{array}\right]\\ {}\begin{array}{cc}\hfill \hfill & \hfill =X{\left({X}^TX\right)}^{-1}{X}^T{W}_i\hfill \end{array}.\end{array} $$

Eq. (5) can be reduced to

$$ {\widehat{\Delta}}_i={\left({\widehat{W}}_i^T{\widehat{W}}_i\right)}^{-1}{\widehat{W}}_i^T{y}_i. $$
(6)

Therefore, if W i in eq. (3) is replaced by Ŵ i , eq. (6) can be interpreted as that in the second stage, y i is regressed on Ŷ i and X to obtain estimate \( {\widehat{\Delta}}_i \).

Sparse SEMs and alternating direction method of multipliers

In general, the genotype-phenotype networks are sparse. Therefore, Γ and B are sparse matrices. In order to obtain sparse estimates of Γ and B, the natural approach is the l 1 -norm penalized regression of eq. (4). Using weighted least square and l 1 -norm penalization, we can form the following optimization problem:

$$ \begin{array}{l}\underset{\Delta_i}{ \min}\kern0.5em f\left({\Delta}_i\right)+\lambda \left|\right|{\Delta}_i\left|\right|{}_1\\ {}\mathrm{where}\ f\left({\Delta}_i\right)={\left({X}^T{y}_i-{X}^T{W}_i{\Delta}_i\right)}^T{\left({X}^TX\right)}^{-1}\left({X}^T{y}_i-{X}^T{W}_i{\Delta}_i\right).\end{array} $$
(7)

The size of the genotype-phenotype network may be large. The efficient ADMM [37] algorithm is used to solve the optimization problem (7). The procedure for implementing ADMM is given below (more detailed descriptions are provided in Appendix 1).

Under some assumptions convergence of ADMM can be proved [37]. In practice, although it can be slow to converge to high accuracy, ADMM converges to modest accuracy within a few tens of iterations. When large-scale problems and parameter estimation problems are considered, modest accuracy is sufficient. Therefore, ADMM may work very well for structure and parameter estimation in the genotype-phenotype networks.

Most of the elements of matrices Γ and B are equal to zero. The l 1 − regularized Lasso for the two stage least squares approach and ADMM algorithms are expected to shrink most of the coefficient matrices Γ and B toward zero, yielding sparse network structures. The sparsity-controlling parameter λ will be estimated via cross validation or set by users to get reasonable results. We abbreviate this sparse two stage least square estimation of SEMs as S2SEMs.

Sparse functional structural equation models for phenotype and genotype networks

Fast and cheaper next generation sequencing (NGS) technologies will generate unprecedentedly massive and highly-dimensional genomic variation data. Despite their promise, next generation sequencing platforms also have three specific features: high error rates, enrichment of rare variants and large proportion of missing values. Available causal analysis platforms for genetic studies which are mainly designed for common variants provide useful tools for single marker-based pleiotropic genetic analysis, but have limitations in analyzing thousands of sequences collected for very large population-based studies of humans. To address the critical barrier in causal genetic analysis with NGS data, we extend the multivariate SEMs to functional SEMs where exogenous genotype profiles across a genomic region are represented as a function of the genomic position. To effectively reduce the dimension of the data, we use genetic variant profiles which will recognize information contained in the physical location of the SNP as a major data form. The densely distributed genetic variants across the genomes in large samples can be viewed as realizations of a Poisson process. The densely typed genetic variants in a genomic region for each individual are so close that these genetic variant profiles can be treated as observed data taken from curves. The genetic variant profiles are called functional.

Large simulations have shown that combining information across multiple variants in a genomic region of analysis will greatly enhance power to detect association of rare variants [9]. To jointly utilize multi-locus genetic information and reduce the dimension of the NGS data, we propose to use a genomic region or a gene as a unit in multiple trait association analysis and develop sparse functional structural equation models (FSEMs) for construction and analysis of the phenotype and genotype networks. The FSEMs collectively analyze the contribution of multiple variants to the traits, reduce the errors in the NGS data via data reduction techniques and can effectively deal with missing data through the smooth mechanism of the function curves of the data.

Let t be a genomic position. Define a genotype profile x i (t) of the i-th individual as

$$ {x}_i\left(\mathrm{t}\right)=\left\{\begin{array}{c}\hfill 2{P}_q\left(\mathrm{t}\right), \kern2.25em \mathrm{Q}\mathrm{Q}\hfill \\ {}\hfill {P}_q\left(\mathrm{t}\right)\hbox{-} {\mathrm{P}}_{\mathrm{Q}}\left(\mathrm{t}\right),\kern0.5em \mathrm{Q}\mathrm{q}\hfill \\ {}\hfill -2{P}_Q\left(\mathrm{t}\right), \kern1.5em \mathrm{q}\mathrm{q}\hfill \end{array}\right. $$

where Q and q are two alleles of the marker at the genomic position t, P Q (t) and P q (t) are the frequencies of the alleles Q and q, respectively. Suppose that we are interested in k genomic regions or genes [a j , b j ], denoted as T j , j = 1, 2, …, k. We consider the following functional structural equation models:

$$ \begin{array}{l}{y}_1{\gamma}_{11}+{y}_2{\gamma}_{21}+\dots +{y}_M{\gamma}_{M1}+{\displaystyle {\int}_{T_1}{x}_1(t){\beta}_{11}(t)dt+\dots +{\displaystyle {\int}_{T_k}{x}_k(t){\beta}_{k1}(t)dt+{e}_1}}=0\\ {}{y}_1{\gamma}_{12}+{y}_2{\gamma}_{22}+\dots +{y}_M{\gamma}_{M2}+{\displaystyle {\int}_{T_1}{x}_1(t){\beta}_{12}}(t)dt+\dots +{\displaystyle {\int}_{T_k}{x}_k(t){\beta}_{k2}(t)dt+{e}_2=0}\\ {}\vdots \kern11.75em \vdots \kern17.5em \vdots \\ {}{\mathrm{y}}_1{\gamma}_{1M}+{y}_2{\gamma}_{2M}+\dots +{y}_M{\gamma}_{MM}+{\displaystyle {\int}_{T_1}{x}_1(t){\beta}_{1M}(t)dt+\dots +{\displaystyle {\int}_{T_k}{x}_k(t){\beta}_{kM}(t)dt+{e}_M=0}}\end{array} $$
(8)

where β ij (t) are genetic effect functions.

Functional principal components (FPCs) are efficient summary statistics. The FPCs simultaneously employs genetic information of the individual variants and correlation information (linkage disequilibrium) among all variants. The FPCs view the genetic variation across the genomic region as a function of its genomic location and uses intrinsic functional dependence structure of the data and all available genetic information of the variants in the genomic region. The neighboring genetic variants are linked. The genotypes at one SNP are dependent on the genotypes at nearby SNPs. The FPCs account for the space-ordering of the genetic variation data. Expanding the genotype functions in terms of a few orthogonal FPCs will substantially reduce the dimensions of the genetic variation data while preserving the intrinsic correlation structure and the space-ordering of the data. Specifically, For each genomic region or gene, we use functional principal component analysis to calculate principal component function [14]. We expand x nj (t), n = 1, …, N, j = 1, 2, …, k in each genomic region in terms of orthogonal principal component functions:

$$ {x}_{ij}(t)={\displaystyle \sum_{l=1}^{L_j}{\eta}_{ijl}{\varphi}_{jl}(t)},j=1,\dots, k, $$

where φ jl (t), j = 1, …, k, l = 1, …, L j are the l-th principal component function in the j-th genomic region or gene and η ijl are the functional principal component scores of the i-th individual.

Let η be a matrix collection of all functional principal component scores, the parameter matrix B can be defined as that in Appendix 2, matrices Y and Γ can be defined as that in the previous section. The structural functional equations can be reduced in terms of functional principal component scores (Appendix 2):

$$ Y{\varGamma}_i+\eta {B}_i+{e}_i=0, $$

which can be rewritten as

$$ {y}_i={W}_i{\Delta}_i+{e}_i, $$

where \( {W}_i=\left[\begin{array}{cc}\hfill {Y}_{-i}\hfill & \hfill \eta \hfill \end{array}\right],{\Delta}_i={\left[\begin{array}{cc}\hfill {\gamma}_{-i}^T\hfill & \hfill {B}_i^T\hfill \end{array}\right]}^T. \)

Then, the sparse FSEMs are transformed to

$$ \begin{array}{l}\underset{\Delta_i}{ \min}\kern0.5em f\left({\Delta}_i\right)+\lambda \left|\right|{\Delta}_i\left|\right|{}_1\\ {}\mathrm{where}\ f\left({\Delta}_i\right)={\left({\eta}^T{y}_i-{\eta}^T{W}_i{\Delta}_i\right)}^T{\left({\eta}^T\eta \right)}^{-1}\left({\eta}^T{y}_i-{\eta}^T{W}_i{\Delta}_i\right).\end{array} $$
(9)

The ADMM algorithms for solving the sparse FSEMs are the same as that in the previous section if the matrix X is replaced by a functional principal component score matrix η (Appendix 2).

The functional SEMs can efficiently combine both common and rare genetic variants across the gene region and are suitable for NGS data [14]. This model extends the single variant-based network analysis to gene-based analysis, which can deal with hundreds of genes that may include tens of thousands SNPs. However, due to the computational limitation, we cannot directly handle the whole genome sequencing data. To construct whole genome genotype-phenotype networks, the network construction consists of two stages. At the first stage, we can group the genes based on the metabolic pathways or cluster analysis, with each group having at most hundreds of genes, and then apply the functional SEMs to each group of genes to find the set of genes significantly connected to the phenotypes. At the second stage, adding the sets of significantly connected genes identified at the first stage together to form a new of set of genes for network construction. The functional SEMs are again applied to the new set of genes to construct the final genotype-phenotype network.

Effect decomposition and estimation

To make this paper self-contained, we introduce basic concepts and methods for decomposition and estimation of the effects. In the genotype-phenotype network analysis we are interested in estimation of effects of genetic variants on phenotypes, which is referred to as genetic effects and effects of treatment on phenotypes. All genetic effects and treatment effects can be decomposed as total (causal), direct effects and indirect effects. Distinction between total, direct and indirect effects are of great practical importance in genetic association analysis [40]. The total effect measures the changes of response variable Y (phenotype) would take on the value y when variable X is set to x by external intervention. Direct effect is defined as sensitivity of Y to changes in X while all other variables in the model are held fixed. Indirect effect is to measure the portion of the effect which can be explained by mediation alone, while inhibiting the capacity of Y to respond to X [41]. The total effect is equal to the summation of direct and indirect effects.

Given a directed graph model G, one can compute total effects using intervention calculus [34, 42]. Suppose that the expected value of a response variable Y, after X is assigned value x by intervention is denoted by E[Y|do(X = x)]. The total effect is defined as

$$ \frac{\partial }{\partial x}E\left[Y\Big| do\left(X=x\right)\right]. $$
(10)

Note X j is called a parent of X in G if there is a directed edge X j  → X. Let pa x denote the set of all parents of X in G. In the linear SEMs, we assume that E[Y|X, pa x ] is linear in X and pa x :

$$ E\left[Y\Big|X,\ {\mathrm{pa}}_x\right]=\alpha +\beta X+{\gamma}^T{\mathrm{pa}}_x. $$
(11)

Then,

$$ \frac{\partial }{\partial x}E\left[Y\Big| do\left(X=x\right)\right]=\beta . $$

When a directed graph is given, it is easy to calculate total effect [42]. Assume that there are k directed paths from X to Y and p i are the product of the path coefficients along the i-th path. The total effect of X on Y is then defined as ∑ k i = 1 p i . As shown in Additional file 1: Figure S2, the total effect of X on Y is ag + bdh + acdh. By its definition, direct effect measures the sensitivity of Y to changes in X while all other variables in the model are held fixed. In other words, all links from X to Y other than the direct link will be blocked. As a consequences, the direct effect is equal to the path coefficient from X to Y. In the linear SEMs, the indirect effect of X on Y mediated by M is equal to the sum of the products associated with directed paths from X to Y through M [42]. In Additional file 1: Figure S2, there is no direct effect from X to Y. The indirect effect of X on Y which is mediated by B and D is equal to bdh.

In the SEMs for genotype-phenotype networks, since all SNPs only form undirected graph and there are no directed links between SNPs although we can observe linkage (or correlation) between SNPs; SNPs in the genotype-phenotype networks do not have parents. The total effect of SNP X on Y is the regression coefficient β of the following linear regression:

$$ E\left[Y\Big| do\ \left(X=x\right)\right]=\alpha +\beta x, $$

which is a simple regression of Y on X. This indicates that the traditional simple regression for association studies captures the total effect of a genetic variant on a phenotype.

If we include environments and risk factors such as smoking and obesity in the model and want to evaluate the effects of the environments and risk factors on the phenotype, these variables play mediating roles and will also be taken as phenotypes. We denote these mediating phenotypes by Y ME . Since genetic variants, and other risk factors and phenotypes will affect the mediating phenotypes, the mediating phenotypes in the graphics may have parents. Their parents are denoted by S. Total effect of the mediation phenotype on the target phenotype is calculated by

$$ E\left[Y\Big| do\ \left({Y}_{ME}={y}_{ME},{X}_{pa}={x}_{pa}\right)\right]=\alpha +\beta {y}_{ME}+{\gamma}^T{x}_{pa}, $$
(12)

where β is the total effect of the mediation phenotype Y ME on the target phenotype Y. In this case, a simple regression of Y on Y ME can no longer be used to measure the total effect of the mediation phenotype Y ME on the target phenotype Y. To observe this, we simulated 1000 individuals with the SEM as shown in Additional file 1: Figure S3. Each variable has a noise term distributed as N(0, 1). The total effect of the mediation phenotype Y ME on the target phenotype Y is 3.5. We obtain the simple regression:

$$ Y=1.39+5.85{Y}_{ME}. $$

It is clear that the coefficient of the simple regression is 5.85. This value is far away from the total effect 3.5. However, using eq. (12) we obtain

$$ Y=3.54{Y}_{ME}+5.85X, $$

where the regression coefficient 3.54 measured the total effect of the mediation phenotype Y ME on the target phenotype Y.

Test statistics for path coefficients

Testing connection between the j-th gene and the i-th phenotype in the genotype-phenotype network, we formally investigate the problem of testing the coefficient of the path directed from the j- th gene to the i-th phenotype:

$$ {H}_0:{\beta}_{ji}(t)=0, \kern0.5em \forall t\in \left[0,{T}_j\right] $$
(13)

against

$$ {H}_a:{\beta}_{ji}(t)\ne 0\ . $$

If the coefficient function of path or genetic effect function β ji (t) is expanded in terms of the principal component functions:

$$ {\beta}_{ji}(t)={\displaystyle {\sum}_{g=1}^G{b}_{jig}{\varphi}_{jg}(t)}, $$

then testing the null hypothesis H0 in Eq. (13) is equivalent to testing the hypothesis:

$$ {H}_0:{b}_{jig}=0,\ \forall \mathrm{g}. $$
(14)

The path coefficients b jig can be estimated by solving problems (8) and (9). Let \( {\widehat{b}}_{ji}={\left[{b}_{ji1},\dots, {b}_{jiG}\right]}^T \). The covariance matrix of the vector of the estimators of path coefficients for the i -th equation is given by [38]

$$ {\widehat{\varSigma}}_i={\sigma}_{ii}{\left[{W}_i^T\eta {\left({\eta}^T\eta \right)}^{-1}{\eta}^T{W}_i\right]}^{-1}, $$
(15)

where

$$ {\sigma}_{ii}={\left({y}_i-{W}_i{\widehat{\Delta}}_i\right)}^T\left({y}_i-{W}_i{\widehat{\Delta}}_i\right)/n. $$
(16)

Let Λ i be the submatrix that corresponds to b ji in the matrix \( {\widehat{\Sigma}}_i \). Define the statistic for testing the directed connection from the j-th gene to the i-th phenotype as

$$ {T}_g={\widehat{b}}_{ji}^T{\varLambda}_i^{-1}{\widehat{b}}_{ji}. $$
(17)

Under the null hypothesis of no association H 0 : b ji  = 0, T g is asymptotically distributed as a central χ 2(G) distribution where G is the number of functional principal components in the expansion of β ji (t).

For testing a single parameter or single variant’s path coefficient in the SEMs, the l-th parameter of the i-th equation, the statistic is given by

$$ {T}_c=\frac{{\widehat{\Delta}}_{il}^2}{\operatorname{var}\left({\widehat{\Delta}}_{il}\right)}, $$
(18)

where \( \operatorname{var}\left({\widehat{\Delta}}_{il}\right) \) is the l-th diagonal element of the matrix \( {\widehat{\Sigma}}_i \). Under the null hypothesis H 0 : Δ il  = 0, T c is asymptotically distributed as a central χ 2(1) distribution.

Testing for the path coefficients within the network results in multiple testing problems. Both false discovery rate approach and Bonferroni correction can be used to adjust for multiple testing [43, 44].

Results

Model evaluation by simulations

We evaluated the performance of the sparse SEM approach for genetic analysis of multiple quantitative traits in simulation studies of a genotype-phenotype network where SNP-based simulations and gene-based simulations were considered. The simulations were carried out for common variants, rare variants and half common and half rare variants. The genotype data were selected from the NHLBI’s Exome Sequencing Project (ESP) with 3248 individuals of European origin, which were then used to generate a population of 1,000,000 individuals.

We first study the SNP-based simulations. The genotype-phenotype network consisted of two parts. The first part was the phenotype network that was modeled by a DAG. The second part was the connections between the genotypes and phenotypes in which the genotypes were directed to the phenotypes. We randomly generated a genotype-phenotype network structure (see an example in Additional file 1: Figure S4). The parameters Γ ij in the SEMs for modeling phenotype sub-network were generated from a uniformly distributed random variable over the interval (0.5, 1) or (−1,-0.5) if an edge from node j to node i was presented in the phenotype sub-network; otherwise Γ ij =0. Similarly, the parameters B ij in the SEMs for modeling the direction from the genotype (SNP) node j to the phenotype node i were generated from a uniformly distributed random variable over the interval (0, 1) or (−1,0) if an edge from node j to node i was presented in the genotype-phenotype network, otherwise B ij  = 0. The indicator variables for coding genotypes of the SNP were as previously described. Using the randomly generated network structure and parameters in the structural equations, we produced the phenotypes by the model: Y = − XBΓ− 1 + εΓ− 1, where ε ~ N(0, 0.01 × I), and X is a matrix of indicator variables for coding genotypes. For the randomly generated phenotype network, the expected number of degrees per node is three. Simulations were repeated 100 times. Five-fold cross validation was used to determine the penalty parameter λ that was then employed to infer the network while running power simulations. Two measures: the power of detection (PD) and the false discovery rate (FDR) were used to evaluate the performance of the algorithms for identification of the network structures. Specifically, let N t be the total number of edges among 1000 replicates of the network and \( {\widehat{N}}_t \) be the total number of edges detected by the inference algorithm, N true be the total number of true edges detected among simulated network and N False be the false edges detected among \( {\widehat{N}}_t \). Now, the power of detection (PD) is defined by \( \frac{NTrue}{{\widehat{N}}_t} \) and false discovery rate (FDR) is defined by \( \frac{NFalse}{{\widehat{N}}_t} \).

In the SNP-based simulations we first compared the S2SEM with ADMM algorithms with the sparse maximum likelihood SEMs (SML) with coordinate ascent algorithms [36]. The SML method assumes each phenotype has one priori known QTL, and only focus on the inference of phenotype network. So in this comparison we only calculate PD and FDR for phenotype network.

We first compare the power and FDR of the S2SEM and SML under the assumption that each QTL had only connection with one phenotype, and no pleiotropic effects were present. We considered two scenarios: 10 phenotypes and 10 SNPs, 30 phenotypes and 30 SNPs. Results were shown in Additional file 1: Figure S5. We observed that if the variants were common variants, the SML had a higher power and a lower FDR than the S2SEM. However, once rare variants were included the SML substantially lost power and increased FDR.

Now we compare the power and FDR of the S2SEM and SML under the assumption of the presence of pleiotropic genetic effects. We considered three scenarios: ten phenotypes and SNPs, 30 phenotypes and SNPs, and 100 phenotypes and SNPs. The simulation results were shown in Fig. 1. We observed that even though SML still showed very high power for common variants, its FDR was large when the genotype had pleiotropic effects. Again, in the presence of rare variants the S2SEM had a higher power and a lower FDR than the SML. Even if in the presence of only common variants, we also observed an interesting feature that when the number of phenotypes and SNPs exceeds some threshold, the power of the S2SEM became higher than the SML.

Fig. 1
figure 1

Performance of S2SEM and SML. The power and FDR of the two methods for phenotype networks inference when the phenotype and genotype number is 10 (a, b), 30 (c, d) and 100 (e, f) respectively, each genotype has pleiotropic effect for another phenotype

Next we compare the computational time of the S2SEM and SML methods. Table 1 showed their program running time for one replicate of simulations in the presence of the common variants. The computer CPU was Intel (R) Xeon E7-4870. We can see that our S2SEM method are much faster than SML method.

Table 1 The computation time for S2SEM and SML methods of one replicate for the simulations of common variants

When we study the general genotype-phenotype networks, construction of large genotype-phenotype requires heavy computations. The SML methods are not suitable for the general genotype-phenotype network inference due to its large computational time and have not be applied to general genotype-phenotype network estimation. Next we compared the S2SEM with the QTLnet algorithm [22] which can be used for joint inference of causal network and genetic architecture for correlated phenotypes. We considered three scenarios: ten phenotypes with 30 SNPs, 30 phenotypes with 100 SNPs and 100 phenotypes and 1000 SNPs. The procedures for randomly generating genotype-phenotype networks were described as in the previous section. We assumed that on the average, each phenotype was affected by three genetic variants.

Figure 2a and c showed the power of two methods: S2SEM and QTLnet for detecting the structure of the genotype (common variants, rare variants and both common and rare variants)– phenotype network as a function of sample size.

Fig. 2
figure 2

Performance of S2SEM and QTLnet. The power and FDR of the two methods for genotype-phenotype networks inference in three different settings, (a,b) is results for 10 phenotypes and 30 SNPs, (c,d) is results for 30 phenotypes and 100 SNPs. QTLnet method is too time consuming to obtain results for 100 phenotypes, (e,f) only gives results of S2SEM for different SNPs frequencies

We observed three features. The first, the power of S2SEM in all three cases was higher than QTLnet method. Second, the power of the two methods to detect the structure of the networks with the common variants was the highest, followed by the half common and half rare variants. The power of two methods to detect the structure of the network with the rare variants was the lowest. Third, in general, the power increased when the sample sizes increased. To fully evaluate the performance of the two methods, we also presented the FDR for detection of the structure of the networks as a function of sample sizes in Fig. 2 (b) and (d). It was clear that the FDR of the S2SEM in all three cases was lower than QTLnet method. The FDR of two methods to detect the structure of the networks with the common variants was the lowest, followed by the both common and rare variants. The FDR of two methods to detect the structure of the network with the rare variants was the highest. However, the false discovery rates for these two methods and in three cases were larger than 0.1 even the sample sizes reached 3000, and it is larger than 0.3 in the rare variants case.

Finally, the simulation results for the third scenario: 100 phenotypes and 1000 SNPs were shown in Fig. 2(e) and (f). The power and FDR patterns of the S2SEM and QTLnet were the same as that for the previous two scenarios. We also observed that the sample sizes increased as the sizes of the network increased. However, when the sample sizes exceeded 2000 the impact of the sizes of the network on the power became small. Table 2 showed the required computational times for network construction using the S2SEM and QTLnet. It was clear that the S2SEM still can estimate the large genotype-phenotype networks in a short time. However, the QTLnet method could not estimate such large genotype-phenotype networks in a reasonable time.

Table 2 The computation time for S2SEM and QTLnet methods of one replicate for the simulations of common variants

Figures 1 and 2 showed that the power of the variant by variant tests for identifying the network structure with the rare variants was low. To increase the power and reduce data dimensions, we develop functional SEMs (FSEMs) for network analysis using a genomic region or gene as a unit of analysis. To evaluate this strategy, we presented Fig. 3 to compare the power and FDR of the gene-based FSEMs and the SNP-based SEMs for detection of the network structures. Since the original papers for QTLnet [22] did not develop the gene-based statistics, in Fig. 3 we did not present the results of QTLnet algorithm. Simulation were conducted for two settings: ten phenotypes with ten genes (ten SNPs for each gene), and 30 phenotypes with 100 genes (ten SNPs for each gene). We observed that in all three cases: common, rare and both common and rare variants, the gene-based FSEM had much higher power and smaller FDR than the SNP-based SEMs. It is interesting to observe that even if for the rare variants the gene-based method can reach the power as high as 85 % when sample sizes were larger than 3000.

Fig. 3
figure 3

Performance of gene-based S2SEM and FSEM. The power and FDR of the two methods for inference of genotype-phenotype networks in two different settings, (a,b) is results for 10 phenoytypes and 10 genes which include 100 SNPs, (c,d) is results for 30 phenotypes and 100 genes which include 1000 SNPs

Application to real data examples

To evaluate its performance, we applied the sparse functional SEMs with a gene as a unit of analysis to a sample of 1011 European-Americans (EA) with complete exome sequencing (total of 1,861,447 common and rare variants, 18,025 genes, of which, 5288 genes were mapped to 259 pathways downloaded from the KEGG database) and 11 phenotypes: high density lipoprotein cholesterol (HDL), low density lipoprotein cholesterol (LDL), triglyceride (Trig) and total cholesterol (TotChol), fast glucose, systolic blood pressure (SBP), diastolic blood pressure, body mass index (BMI), fastinsulin, Fibrinogen, and platelet count (PLATELET) (no missing phenotype data). Inverse rank normal transformation of the phenotypes was used in the analysis.

The analysis consisted of two stages. At the first stage, the sparse functional SEMs were applied to each of the 259 KEGG pathways and 11 phenotypes to infer genotype-phenotype networks. The remaining 12,737 genes which were not mapped to KEGG pathways were divided into 100 groups according to the order of chromosomes. Again, the sparse functional SEMs were applied to each group of genes and 11 phenotypes. We identified 1789 genes with P-values for testing path coefficients < 0.05 from the analysis at the first stage. To dissect pleiotropic genetic structure, at the second stage, we select 142 genes that were connected with more than one phenotype for further analysis. The sparse functional SEMs were applied to the selected 142 genes and 11 phenotypes to infer genotype-phenotype networks. To improve the accuracy of estimation, a stability selection procedure was used to infer the structure of the network. In other words, we randomly resampled data and estimated the genotype-phenotype networks 100 times. We only selected arrows when their P-values for testing the path-coefficients were less than 0.05 and they were present in the estimated network more than 80 times, i.e., the probability for each arrow to be selected was more than 0.8. We identified a genotype-phenotype network with 137 genes directly connected to 11 phenotypes and 341 edges. One hundred fourteen genes out of 137 genes showed pleiotropic genetic effects. The results were presented in Fig. 4. Additional file 2: Table S1 shows the path coefficients and p-values according to network in Fig. 4. We observed that the most causal relationships among phenotypes had P-values < 10− 7 and stability ~1. This showed that the inference about phenotype sub-network is highly reliable. We also observed that large proportion of the edges in the phenotype network had two directions. This demonstrated that the SEMs had limitations for inferring causal networks.

Fig. 4
figure 4

A genotype-phenotype network consisted of 137 genes and 11 phenotypes. One hundred fourteen genes of all the gene nodes showed pleiotropic genetic effect. The nodes in yellow color represented the phenotypes, the nodes in light red color represented genes influencing phenotype variation, the nodes in the red color represented genes from our network were reported to be associated with 11 phenotypes or cardiovascular diseases phenotypes, the black arrows indicated the causal relations between phenotypes, the blue arrows indicted the contribution of the gene to one phenotype

We observed that PIK3R5 directly affected seven phenotypes, HS1BP3 directly affected five phenotypes, 11 genes directly affected four phenotypes, and 33 genes directly affected three phenotypes and remaining 102 genes directly affected two phenotypes. To assess the roles of path analysis in detecting genetic pleiotropic effects, we presented Table 3 that summarizes the P-values of three genes that affecting more than four phenotypes for path coefficients, the marginal effects of single and multiple traits (simple regression and multiple regression), and the minimum of P-values derived from principal component analysis (PCA) based regression. Table 3 showed that the most P-values for path coefficient were less than that for the marginal effect of corresponding single trait. Estimation of the marginal genetic effect of single trait only explores information of the target trait and genetic variants. However, estimation of the path coefficient uses information of all the relevant traits and genetic variants. This implies that path analysis has higher power to detect genetic risk variants than the traditional marginal analysis. From Table 3 we also observed that in general, each gene had at least one path with P-value for path coefficient was less or close to that for marginal effects of multiple relevant traits or their PCA analysis. Additional file 3: Table S2 summarizes the results for all the 13 genes that connected to more than 4 phenotypes.

Table 3 P-values for four different effect tests

SEMs provide a powerful tool to distinguish four types of effects: direct, indirect, total and marginal (estimated by a simple regression) effects. Additional file 4: Table S3 summarized the direct, indirect, total and marginal effects of one variable (phenotype or gene) that was referred to as the causal on another variable (phenotype) that was referred to as outcome for all 11 phenotypes and 137 genes in the Fig. 4. Investigating each type of effect allows a more comprehensive understanding of the relationship between variables.

Additional file 4: Table S3 listed the total 1414 pairs of causal relations between variables. We observed 343 (24.3 %) pairs of relations with direct effects, 1283 (90.7 %) pairs of relations with indirect effects and 212 (15 %) pairs of relations with both direct and indirect effects (Table 4 showed examples of pairs with both direct and indirect effects). This implied that the most effects are indirect effects due to mediation. In the quantitative trait locus (QTL) analysis, we often identify QTL by testing association of the marginal effect with the single trait. The SEMs provide complimentary information about path coefficients. In Table 5 we listed 25 tests in which the P-values for testing path coefficients were smaller than that for testing the marginal effects (coefficient of simple regression model, SRG) and 25 tests in which the P-values for testing marginal effects were smaller than that for testing the path coefficients. This showed that using SEMs for path analysis will discover additional QTLs that may be missed by marginal association analysis. In theory, the total effect of the causal X on outcome Y is equal to the summation of the product of the path coefficients along all possible paths between X and Y [34]. In the previous section, the total effect is defined as the coefficient \( {\beta}_{YX. Paren{t}_X} \) of X in the linear regression of Y on X and its parent set. Let Z = Parent x . The total effect β YX. Z can be expressed by [34]

$$ {\beta}_{YX.Z}={\beta}_{YX}\frac{1-\frac{\rho_{YZ}{\rho}_{ZX}}{\rho_{YX}}}{1-{\rho}_{XZ}^2}. $$
(19)
Table 4 An example of 20 pairs of variables that had both direct and indirect effects
Table 5 Twenty-five pairs of P-values for testing path coefficients and marginal effects, respectively

Since causal relations between SNPs do not exist, any SNP does not have its parent, i.e., the set Z = φ is empty. Therefore, for the SNP or gene X, we have β YX. Z  = β YX . For example, the estimator of the direct effect of gene MET on the phenotype SBP was −0.0596. MET also had path MET → DBP → SBP. The indirect effect of MET on SBP was 0.0621 × 0.605 = 0.0376. Thus, the total effect of MET on SBP was −0.022. The marginal effect β YX of MET on SBP estimated by SRG was −0.0212. The total effect of MET on SBP estimated from Fig. 4 was close to the marginal effect of MET on SBP. This example showed that if the causal relationships among the variables were completely captured by a DAG, the total effect and marginal effect were almost equal. Therefore, in the genotype-phenotype estimation process, we can use the relationship between the total and marginal effects to check whether the causal relationship modeled by a DAG is complete.

Multiple SNPs within a gene jointly have significant genetic effects, but individually each SNP make mild contributions to the phenotype variation. Table 6 listed P-values of 22 SNPs in seven genes for testing the path coefficients. We observed that single SNP made only a mild contribution to the direct effect, the multiple SNPs made significant contributions to the phenotype variation. This showed that the gene-based genotype-phenotype inference had higher power than the single SNP-based genotype-phenotype inference.

Table 6 P-values of 22 SNPs in seven genes for testing path coefficients

Since the most existing methods for genotype-phenotype network estimation only take a single SNP as a variable (unite of analysis) and cannot take a gene as a unite of analysis, next we illustrate the application of S2SEMs for inference of genotype-phenotype network using SNPs and compared their results with that of QTL-driven phenotype network method (QTLnet) [22]. The number of SNPs in 137 genes was 5482. Due to the limitation of the size of the genotype-phenotype network which the sparse multivariate SEMs can estimate, from 137 genes in Fig. 4 we selected 45 genes that were reported to be associated with the 11 phenotypes in the analysis or other cardiovascular disease (CVD) related phenotypes in the literature. A total of 1993 SNPs in the 45 genes (248 common and 1745 rare SNPs) were included in the analysis.

The gene-based genotype-phenotype network with 55 nodes (11 phenotypes and 44 genes) and 110 edges estimated using the selected 45 genes and the FSEM method was shown in Fig. 5. S2SEM can also be used to estimate gene-based genotype-phenotype network. The procedures were as follows. At the first stage, S2SEM method and all 1993 SNPs were used to estimate the SNP-phenotype network (Fig. 6) where a gene was connected to a phenotype if the minimum of P-values for the coefficients of all the paths connecting SNPs within a gene and a phenotype was less than 0.05. At the second stage, we used Bonferroni correction to adjust P-values for multiple tests. In Fig. 7, we plotted the estimated gene-phenotype network with 17 nodes (11 phenotypes and six genes) and 22 edges using the selected 45 genes and the gene-based S2SEM method where a gene was connected to a phenotype if the Bonferroni correction adjusted P-values for path coefficients connecting gene and phenotype was less than 0.05. Figures 5 and 7 showed that the gene-based FSEM method can identify much more genes influencing phenotypes than the gene-based S2SEM method.

Fig. 5
figure 5

A genotype-phenotype network consisted of 44 genes and 11 phenotypes. The network was constructed using FSEM from 45 genes. Nodes and edges are the same as described in Fig. 4

Fig. 6
figure 6

A genotype-phenotype network consisted of 31 genes and 11 phenotypes. The network was constructed using SNP-based S2SEM method from 1993 SNPs of 45 genes. Nodes and edges are the same as described in Fig. 4

Fig. 7
figure 7

A genotype-phenotype network consisted of six genes and 11 phenotypes. The network was constructed using the gene-based S2SEM method from 1993 SNPs of 45 genes where a gene was connected to a phenotype if the Bonferroni correction adjusted P-values for path coefficients connecting gene and phenotype was less than 0.05. Nodes and edges are the same as described in Fig. 4

Next we study the SNP-based genotype-phenotype network estimation using the S2SEM method. In other words, we connected genes to the phenotypes using the minimum of P-values for the coefficients of all the paths that connect SNPs within a gene and a phenotype without Bonferroni correction. Figure 6 plotted the estimated genotype-phenotype network with 42 nodes (11 phenotypes and 31 genes) and 78 edges using S2SEM method and 1993 SNPs in the 45 genes. The path coefficients and P-values (<0.05) for the path coefficients of the edges connecting the SNPs in the gene to the phenotypes were summarized in Additional file 5: Table S4. In Additional file 1: Figure S5, we plotted the estimated genotype-phenotype network with 13 nodes (ten connected phenotypes, one isolated phenotype and two genes) and 20 edges using QTLnet method. In Additional file 6: Table S5, we listed the edges of the estimated network which connect the genes and phenotypes using QTLnet method. While the QTLnet method only identified two genes: LBP connected to the phenotypes TRIGS and TOTCHOL, and DOCK1 connected to the phenotype HDL, the SNP-based and gene-based S2SEM method, respectively, discovered 31 and six genes connected to phenotypes. These results showed that all proposed SEM methods including FSEM, gene-based and SNP-based S2SEM methods outperform the QTLnet method.

Similar to the gene-based FSEM method, we observed several remarkable features from these results obtained by the S2SEM method. First, we observed three SNPs that showed pleiotropic genetic effects (rs138251768 in the gene ADAMTS19 effected SBP and DBP, rs116623954 in the gene CNIH3 affected FASTINSULIN and FIBRINOGEN, rs13223756 in the gene MET affected SBP and DBP). Second, multiple SNPs in the same gene affected the same phenotype. Three SNPs: rs754555, rs754554 and rs754553 in the gene DFNA5 jointly affected BMI, two SNPs: rs11017658 and rs61758438 in the gene DOCK1 jointly affected SBP. Third, the pleiotropic effects of the gene were due to different SNPs. The SNPs: rs564665 and rs141647150 in the gene DAB1 affected phenotypes DBP and FASTGLUCOSE, respectively; the SNPs: rs376043577 and rs3731878 in the gene IHH affected BMI and PLATELET COUNT, respectively; SNPs rs2232585 and rs2232605 in the gene LBP affected FIBRINOGEN and PLATELET COUNT, respectively. SNPs rs2305610, rs372123385 and rs17027957 in the gene OSBPL10 affected BMI, DBP and FIBRINOGEN, respectively; three SNPs: rs144082896, rs140962261 and rs11547635 in the gene SYN3 affected TRIGS, SBP and FASTINSULIN, respectively. You can find more examples from Additional file 5: Table S4 and Additional file 6: Table S5. Due to space limitation, they are omitted here.

In summary, we jointly estimated genetic architecture and phenotype network with 137 genes that were significantly connected to phenotypes. A total of 45 genes out of 137 genes were reported to be associated with 11 phenotypes or CVD related phenotypes, Additional file 7: Table S6 summarized the results of the reported 45 genes and their associated phenotypes. For the reported phenotypes, 6 phenotypes are from the analyzed 11 phenotypes. According to Fig. 4, Gene SMC2 was connected with phenotypes: BMI, HDL and FIBRINOGEN. It was reported associated with HDL and BMI [45, 46], and also related with respiratory function and Echocardiography [47, 48]. Gene RNF157 was connected with HDL, and it was reported associated with blood pressure [49] and HDL [45]. The other pairs of association for these six phenotypes were found through indirect paths from Fig. 4. For example, gene DAB1, DFNA5 and DOCK1 were reported associated with LDL [46], and there are indirect path from these genes to LDL according to Fig. 4. From these results we can summarized that our gene-based FSEMs has a rather high power to detect genetic pleiotropic effects, and it also provide a tool to decompose the effects into direct and indirect effects.

Discussion

Alternative to the standard marginal models for genetic association analysis of multiple correlated phenotypes, we have developed sparse SEMs and sparse FSEMs as a statistical framework for joint analysis of genetic architecture and causal phenotype network, which may emerge as a new generation of genetic analysis of multiple phenotypes exploring the causal network structures of the phenotypes. To facilitate using SEMs as a new paradigm for genetic analysis of multiple phenotypes, several issues have been addressed in this paper.

The first issue is to develop a unified framework for joint analysis of genetic architecture and causal phenotype network with both GWAS and the NGS data. The traditional multivariate SEMs can be applied to infer genotype-phenotype network with common variants, but are difficult to deal with rare variants. To overcome this limitation, we extend the multivariate SEMs to functional SEMs where exogenous genotype profiles across a genomic region or a gene are represented as a function of the genomic position for genetic analysis of multiple quantitative traits. In other words, we extend the variant-based genotype-phenotype network analysis to gene-based genotype-phenotype network analysis.

The second issue is how to develop statistical methods for jointly inferring genetic architecture and casual phenotype network structure. There is increasing consensus that the structure of the network in nature is sparse. However, the traditional estimation methods for the SEMs do not take the sparsity presented in the network into account. To solve this problem, we developed sparse SEMs and sparse functional SEMs to automatically incorporate the sparse condition into the estimation process. The widely used estimation method for the SEMs is the maximum likelihood method. However, the penalized maximum likelihood method and coordinate descent algorithms are not scalable to SEMs of high dimension. To overcome this limitation, we develop the ADMM-based sparse two-stage least square estimation method for the structure and parameter estimation of the SEMs. Our experience showed that the newly developed ADMM-based sparse two-stage least square estimation methods can infer networks with hundreds of nodes.

The third issue is the true structure discovery. An essential problem for the genotype-phenotype network analysis is to accurately estimate the network structure. By large scale simulations we showed that the true network structure can be accurately recovered with high probability. We also compared the performance of the sparse two-stage least square estimate methods with the QTLnet method. We demonstrated that for all the three cases (common, rare and both common and rare variants) our sparse two-stage SEMs (S2SEM) outperformed QTLnet method. Since the gene-based version of QTLnet method has not been developed we only compared the power and false discovery rates of the variant-based SEMs and gene-based functional SEMs. We found that for all spectrums of allele frequencies (common, rare and both common and rare variants) the gene-based functional SEMs substantially outperformed the variant-based multivariate SEMs.

The fourth issue is how to distinguish four types of effects: direct, indirect, total and marginal effects. The current paradigm for genetic association analysis of multiple phenotypes is genetic marginal analysis in which the effects of the genetic variants on the phenotypes are estimated by regressing phenotypes on the genetic variants. This paradigm is unable to unravel the structure of the genotype-phenotype network and to estimate direct, indirect and total effects of the genetic variants on the phenotypes. The direct, indirect and total genetic effects provide valuable information for dissecting genetic structure of complex traits. We developed sparse SEMs and FSEMs as a causal inference tool to estimate direct, indirect and total genetic effects in addition to estimating marginal genetic effects. We observed that the most effects were indirect effects due to mediation. In traditional QTL analysis, we often identify QTL by testing association of the marginal effect with the single trait. The FSEMs and SEMs provide complimentary information about path coefficients. Interestingly, we found that many P-values for testing path coefficients were smaller than that for testing the marginal effects. This demonstrated that only using marginal association analysis we might miss identification of many significant QTLs.

The fifth issue is how to solve the large genotype-phenotype networks with up to hundreds of nodes or genes. A key to the large network inference is computation efficiency of the algorithms. Two strategies were employed to solve this problem. The first strategy was to reduce the dimension of data using functional data analysis. We first expand the genotype profiles in a genomic region (gene) in terms of orthonormal eigenfunctions. Genetic information across all variants in the genomic region including all single variant variation and their linkage disequilibrium is compressed into functional principal component scores. We use genetic information compressed into functional principal component scores to infer genotype-phenotype networks. The second strategy is to use ADMM algorithms to optimally solve the sparse SEM problem. The widely used algorithms for sparse SEMs are coordinate descent algorithms borrowed from the lasso originally designed for the sparse linear regression. The ADMM algorithms are parallel and efficient. Their convergence rates are fast. The ADMM algorithms allow inferring networks with hundreds or even thousands of nodes.

Major limitation of the SEMs for joint inference of genetic architecture and causal phenotype networks is the presence of two directions associated with one edge in the estimated network, which leads to a cyclic graph. To remove the cycles from the graph we need to strictly enforce the global constraint that the graph structure has to be acyclic. Such problems are often casted into a combinatorial optimization problem. We rank graph structures via a scoring metric that measure how well the DAG models fit the data. Combinatorial optimization algorithms are then used to search the optimal DAG with the best score [50].

Although their application to genome-wide genotype-phenotype network construction is difficult due to computational limitations, the SEMs are suitable to the phenome-wide association studies where starting phenomics, defined as the unbiased study of a large number of phenotypes in a population. We study the complex networks between multiple expressed phenotypes and genetic variants. Since the number of genetic variants in the phenome-wide association is quite limited and hence the size of the genotype-phenotype network is limited, the required computational time of construction of genotype-phenotype networks using SEMs is in the range the current computer system can reach. Advances in biosensors and sequencing technologies generate large amounts of phenotype and genetic data. SEMs and causal inference may emerge as a new paradigm of genetic studies of complex traits. The main purpose of this paper is to stimulate discussions about what are the optimal strategies to facilitate the development of a new generation of genetic analysis. We hope that our results will greatly increase the confidence in joint inference of genetic architecture and causal phenotype networks.

Conclusions

We have developed sparse SEMs and sparse FSEMs as a statistical framework for joint analysis of genetic architecture and causal phenotype network, which may emerge as a new generation of genetic analysis of multiple phenotypes. Our proposed sparse functional SEMs can incorporate both common and rare variants into the analysis and the ADMM algorithm can efficiently solve the penalized SEMs. Using this model we can jointly infer genetic architecture and casual phenotype network structure, and decompose the genetic effect into direct, indirect and total effect. Using large scale simulations we showed that the proposed methods have higher power to detect true causal genetic pleiotropic structure than other existing methods.

Abbreviations

ADMM:

Alternative direction methods of multiplier, an algorithm that solves convex optimization problems

BMI:

Body mass index

CVD:

Cardiovascular disease

DAGs:

Directed acyclic graphs

DBP:

Diastolic blood pressure

ESP:

Exome Sequencing Project

FDR:

The false discovery rate

FPCs:

Functional principal components

FSEMs:

Functional structure equation models

HDL:

High density lipoprotein cholesterol

LDL:

Low density lipoprotein cholesterol

NGS:

Next generation sequencing

PCA:

Principal component analysis

PD:

The power of detection

PLATELET:

Platelet count

QTL:

Quantitative traits locus

S2SEM:

Sparse two-stage structure equation models

SBP:

Systolic blood pressure

SML:

Sparse maximum likelihood SEMs

SRG:

Simple regression model

TotChol:

Total cholesterol

Trig:

Triglyceride

VLDL:

Very low density lipoproteins

References

  1. Morris AP, Zeggini E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet Epidemiol. 2010;34(2):188–93.

    Article  PubMed  Google Scholar 

  2. Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83(3):311–21.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009;5(2):e1000384.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Price AL, Kryukov GV, de Bakker PI, Purcell SM, Staples J, Wei L-J, Sunyaev SR. Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet. 2010;86(6):832–8.

    Article  PubMed  PubMed Central  Google Scholar 

  5. Li Y, Byrnes AE, Li M. To identify associations with rare variants, just WHaIT: weighted haplotype and imputation-based tests. Am J Hum Genet. 2010;87(5):728–35.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89(1):82–93.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Yi N, Zhi D. Bayesian analysis of rare variants in genetic association studies. Genet Epidemiol. 2011;35(1):57–69.

    Article  PubMed  PubMed Central  Google Scholar 

  8. Han F, Pan W. A data-adaptive sum test for disease association with multiple common or rare variants. Hum Hered. 2010;70(1):42–54.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Orho-Melander M, Kathiresan S, Purcell SM, Roeder K, Daly MJ. Testing for an unusual distribution of rare variants. PLoS Genet. 2011;7(3):e1001322.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Ionita-Laza I, Buxbaum JD, Laird NM, Lange C. A new testing strategy to identify rare variants with either risk or protective effect on disease. PLoS Genet. 2011;7(2):e1001289.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Liu DJ, Leal SM. A novel adaptive method for the analysis of next-generation sequencing data to detect complex trait associations with rare variants due to gene main effects and interactions. PLoS Genet. 2010;6(10):e1001156.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Luo L, Boerwinkle E, Xiong M. Association studies for next-generation sequencing. Genome Res. 2011;21(7):1099–108.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Fan R, Wang Y, Mills JL, Wilson AF, Bailey‐Wilson JE, Xiong M. Functional linear models for association analysis of quantitative traits. Genet Epidemiol. 2013;37(7):726–42.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Luo L, Zhu Y, Xiong M. Quantitative trait locus analysis for next-generation sequencing with the functional linear models. J Med Genet. 2012;49(8):513–24.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Stephens M. A unified framework for association analysis with multiple related phenotypes. PLoS ONE. 2013;8(7):e65245.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Aschard H, Vilhjálmsson BJ, Greliche N, Morange P-E, Trégouët D-A, Kraft P. Maximizing the power of principal-component analysis of correlated phenotypes in genome-wide association studies. Am J Hum Genet. 2014;94(5):662–76.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Schifano ED, Li L, Christiani DC, Lin X. Genome-wide association analysis for multiple continuous secondary phenotypes. Am J Hum Genet. 2013;92(5):744–59.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Zhou X, Stephens M. Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nat Methods. 2014;11(4):407–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Bollen KA. Structural equations with latent variables. New York: John Wiley & Sons; 2014.

    Google Scholar 

  20. Lawson HA, Cady JE, Partridge C, Wolf JB, Semenkovich CF, Cheverud JM. Genetic effects at pleiotropic loci are context-dependent with consequences for the maintenance of genetic variation in populations. PLoS Genet. 2011;7(9):e1002256.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Rosa GJ, Valente BD, de Los Campos G, Wu X-L, Gianola D, Silva MA. Inferring causal phenotype networks using structural equation models. Genet Sel Evol. 2011;43(6). doi:10.1186/1297-9686-43-6

  22. Neto EC, Keller MP, Attie AD, Yandell BS. Causal Graphical Models in Systems Genetics: A Unified Framework for Joint Inference of Causal Network and Genetic Architecture for Correlated Phenotypes. Ann Appl Stat. 2010;4(1):320–39.

    Article  PubMed  PubMed Central  Google Scholar 

  23. Neto EC, Ferrara CT, Attie AD, Yandell BS. Inferring causal phenotype networks from segregating populations. Genetics. 2008;179(2):1089–100.

    Article  Google Scholar 

  24. Rockman MV. Reverse engineering the genotype-phenotype map with natural genetic variation. Nature. 2008;456(7223):738–44.

    Article  CAS  PubMed  Google Scholar 

  25. Winrow CJ, Williams DL, Kasarskis A, Millstein J, Laposky AD, Yang HS, Mrazek K, Zhou L, Owens JR, Radzicki D, et al. Uncovering the genetic landscape for multiple sleep-wake traits. PLoS ONE. 2009;4(4):e5161.

    Article  PubMed  PubMed Central  Google Scholar 

  26. Hageman RS, Leduc MS, Korstanje R, Paigen B, Churchill GA. A Bayesian framework for inference of the genotype-phenotype map for segregating populations. Genetics. 2011;187(4):1163–70.

    Article  PubMed  PubMed Central  Google Scholar 

  27. Schadt EE, Molony C, Chudin E, Hao K, Yang X, Lum PY, Kasarskis A, Zhang B, Wang S, Suver C, et al. Mapping the genetic architecture of gene expression in human liver. PLoS Biol. 2008;6(5):e107.

    Article  PubMed  PubMed Central  Google Scholar 

  28. Li Y, Tesson BM, Churchill GA, Jansen RC. Critical reasoning on causal inference in genome-wide linkage and association studies. Trends Genet. 2010;26(12):493–8.

    Article  PubMed  PubMed Central  Google Scholar 

  29. Duarte CW, Zeng ZB. High-confidence discovery of genetic network regulators in expression quantitative trait loci data. Genetics. 2011;187(3):955–64.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Mi X, Eskridge K, Wang D, Baenziger PS, Campbell BT, Gill KS, Dweikat I, Bovaird J. Regression-based multi-trait QTL mapping using a structural equation model. Stat Appl Genet Mol. 2010;9(1):1–23.

    Google Scholar 

  31. Valente BD, Rosa GJ, de Los CG, Gianola D, Silva MA. Searching for recursive causal structures in multivariate quantitative genetics mixed models. Genetics. 2010;185(2):633–44.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Li R, Tsaih S-W, Shockley K, Stylianou IM, Wergedal J, Paigen B, Churchill GA. Structural model analysis of multiple quantitative traits. PLoS Genet. 2006;2(7):e114.

    Article  PubMed  PubMed Central  Google Scholar 

  33. Hauser A, Bühlmann P. Characterization and greedy learning of interventional Markov equivalence classes of directed acyclic graphs. J Mach Learn Res. 2012;13(1):2409–64.

    Google Scholar 

  34. Maathuis MH, Kalisch M, Bühlmann P. Estimating high-dimensional intervention effects from observational data. Ann Stat. 2009;37(6A):3133–64.

    Article  Google Scholar 

  35. Kalisch M, Bühlmann P. Estimating high-dimensional directed acyclic graphs with the PC-algorithm. J Mach Learn Res. 2007;8:613–36.

    Google Scholar 

  36. Cai X, Bazerque JA, Giannakis GB. Inference of gene regulatory networks with sparse structural equation models exploiting genetic perturbations. PLoS Comp Biol. 2013;9(5):e1003068.

    Article  CAS  Google Scholar 

  37. Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trend Mach Learn. 2011;3(1):1–122.

    Article  Google Scholar 

  38. Judge GG, Hill RC, Griffiths WE, Lutkepohl H, Lee T-C. Introduction to the Theory and Practice of Econometrics. New York: Wiley; 1982.

    Google Scholar 

  39. Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9(3):432–41.

    Article  PubMed  Google Scholar 

  40. Pearl J. Direct and indirect effects. In: Proceedings of the seventeenth conference on uncertainty in artificial intelligence: 2001. San Francisco: Morgan Kaufmann Publishers Inc; 2001. p. 411–20.

    Google Scholar 

  41. Pearl J. The deductive approach to causal inference. J Causal Infer. 2014;2(2):115–29.

    Google Scholar 

  42. Chen B, Pearl J. Graphical tools for linear structural equation modeling. In.: DTIC Document; 2014.

  43. Glickman ME, Rao SR, Schultz MR. False discovery rate control is a recommended alternative to Bonferroni-type adjustments in health studies. J Clin Epidemiol. 2014;67(8):850–7.

    Article  PubMed  Google Scholar 

  44. Veazie PJ. When to combine hypotheses and adjust for multiple tests. Health Serv Res. 2006;41(3p1):804–18.

    Article  PubMed  PubMed Central  Google Scholar 

  45. Kathiresan S, Manning AK, Demissie S, D’Agostino RB, Surti A, Guiducci C, Gianniny L, Burtt NP, Melander O, Orho-Melander M, et al. A genome-wide association study for blood lipid phenotypes in the Framingham Heart Study. BMC Med Genet. 2007;8 Suppl 1:S17.

    Article  PubMed  Google Scholar 

  46. Fox CS, Heard-Costa N, Cupples LA, Dupuis J, Vasan RS, Atwood LD. Genome-wide association to body mass index and waist circumference: the Framingham Heart Study 100K project. BMC Med Genet. 2007;8 Suppl 1:S18.

    Article  PubMed  Google Scholar 

  47. Wilk JB, Walter RE, Laramie JM, Gottlieb DJ, O’Connor GT. Framingham Heart Study genome-wide association: results for pulmonary function measures. BMC Med Genet. 2007;8 Suppl 1:S8.

    Article  PubMed  Google Scholar 

  48. Vasan RS, Larson MG, Aragam J, Wang TJ, Mitchell GF, Kathiresan S, Newton-Cheh C, Vita JA, Keyes MJ, O’Donnell CJ, et al. Genome-wide association of echocardiographic dimensions, brachial artery endothelial function and treadmill exercise responses in the Framingham Heart Study. BMC Med Genet. 2007;8 Suppl 1:S2.

    Article  PubMed  Google Scholar 

  49. O’Donnell CJ, Cupples LA, D’Agostino RB, Fox CS, Hoffmann U, Hwang SJ, Ingellson E, Liu C, Murabito JM, Polak JF, et al. Genome-wide association study for subclinical atherosclerosis in major arterial territories in the NHLBI’s Framingham Heart Study. BMC Med Genet. 2007;8 Suppl 1:S4.

    Article  PubMed  Google Scholar 

  50. Loh P-L, Bühlmann P. High-dimensional learning of linear causal networks via inverse covariance estimation. J Mach Learn Res. 2014;15(1):3065–105.

    Google Scholar 

Download references

Acknowledgments

Mr. Xiong was supported by Grant 1R01AR057120–01 and 1R01HL106034-01, from the National Institutes of Health and NHLBI. Ms. Wang and Mr. Jin were supported by Grant 31330038 from the National Science Foundation of China and grant B13016 from the Programme of Introducing Talents of Discipline to Universities (111 Project). Ms. Wang was also supported by China Scholarship Council.

The authors thank two anonymous reviewers for the thorough reading of the manuscript and thoughtful suggestions that improved the manuscript.

The authors wish to acknowledge the support of the National Heart, Lung, and Blood Institute (NHLBI) and the contributions of the research institutions, study investigators, field staff and study participants in creating this resource for biomedical research. Funding for GO ESP was provided by NHLBI grants RC2 HL-103010 (HeartGO), RC2 HL-102923 (LungGO) and RC2 HL-102924 (WHISP). The exome sequencing was performed through NHLBI grants RC2 HL-102925 (BroadGO) and RC2 HL-102926 (SeattleGO).

Funding

NIH, National Heart, Lung, and Blood Institute (NHLBI), National Science Foundation of China, Programme of Introducing Talents of Discipline to Universities (111 Project). These funding agencies supported the design of the study, analysis, interpretation of data and writing the manuscript.

Availability of data and materials

Exome sequence data were generated from the NHLBI’s Exome Sequencing Project (ESP) and have been deposited in dbGaP as part of the ESP cohort data. The data can be downloaded from dbGaP (https://esp.gs.washington.edu/drupal/dbGaP_Releases).

Authors’ contributions

MX and PW proposed the model and drafted the manuscript. PW implemented the model and carried out most of the real data analysis. PW and MR completed the simulations together. LJ, MX and PW designed the analysis for real data application. All authors read and approved the final manuscript.

Competing interest

The authors declare that they have no competing interests.

Consent for publication

Not applicable.

Ethics approval and consent to participate

Not applicable.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Li Jin or Momiao Xiong.

Additional files

Additional file 1: Figure S1.

A scheme of genotype-phenotype network. Figure S2. Diagram associated with effect decomposition. Figure S3. Diagram of a simulation example for illustrating equation (12). Figure S4. An example for the simulated genotype-phenotype network. The network consisted of ten phenotype nodes and 30 genotype (SNP) nodes. Figure S5. Performance of S2SEM and SML for phenotype network inference. The power and FDR of the two methods for inference of phenotype networks when the phenotype and genotype number is 10 and 30 respectively. Figure S6. A genotype-phenotype network consisted of two genes that were reported to be associated with phenotypes in the analysis or other CVD related phenotypes in the literatures and ten phenotypes (one isolated phenotype didn’t appear) estimated using QTLnet method. The nodes in yellow color represented the phenotypes, the nodes in the red color represented genes, the black arrows indicated the causal relations between phenotypes and the blue arrows indicted the contribution of the gene to one phenotype. (DOCX 4101 kb)

Additional file 2: Table S1.

Path coefficients and P-values in Fig. 4. (XLSX 40 kb)

Additional file 3: Table S2.

P-values for the path coefficient, mariginal effects of single trait and multiple traits, and minimum of P-values from PCA analysis for 13 genes that are connected to more than 4 phenotypes in Fig. 4. (XLSX 17 kb)

Additional file 4: Table S3.

Direct, indirect, total and marginal effects for Fig. 4. (XLSX 94 kb)

Additional file 5: Table S4.

P-value for the path coefficient (SNPs to Phenotype) of the genotype-phenotype networks with 31 genes and 11 phenotypes estimated using S2SEM method. (XLSX 13 kb)

Additional file 6: Table S5.

SNPs to phenotype edges in the genotype-phenotype network estimated by QTL net method. (XLSX 8 kb)

Additional file 7: Table S6.

Reported 45 genes out of 137 genes that are associated with the phenotypes in the analysis or other CVD related phenotypes. (XLSX 11 kb)

Appendices

Appendix 1

Alternating direction method of multipliers for sparse SEMs

The optimization problem (7) can be further reduced to

$$ \begin{array}{l} \min \kern2.25em f\left({\Delta}_i\right)+\lambda \left|\right|{Z}_i\left|\right|{}_1\\ {}\mathrm{subject}\ \mathrm{t}\mathrm{o}\kern0.75em {\Delta}_i-{Z}_i=0.\end{array} $$
(20)

To solve the optimization problem (20), we form the augmented Lagrangian

$$ {L}_{\rho}\left({\Delta}_{i,}{Z}_i,\mu \right)=f\left({\Delta}_i\right)+\lambda \left|\right|{Z}_i\left|\right|{}_1+{\mu}^T\left({\Delta}_i-{Z}_i\right)+\frac{\rho }{2}\left|\right|{\Delta}_i-{Z}_i\left|\right|{}_2^2. $$
(21)

The alternating direction method of multipliers (ADMM) consists of the iterations:

$$ {\Delta}_i^{\left(k+1\right)}:==\underset{\Delta_i}{ \arg \min }{L}_{\rho}\left({\Delta}_i,{Z}_i^{(k)},{\mu}^{(k)}\right) $$
(22)
$$ {Z}_i^{\left(k+1\right)}:==\underset{Z_i}{ \arg \min }{L}_{\rho}\left({\Delta}_i^{\left(k+1\right),},{Z}_i,{\mu}^{(k)}\right) $$
(23)
$$ {\mu}^{\left(k+1\right)}:=={\mu}^{\left(k+1\right)}+\rho \left({\Delta}_i^{\left(k+1\right)}-{Z}_i^{\left(k+1\right)}\right), $$
(24)

where ρ > 0. Let \( u=\frac{\mu }{\rho } \). Eq. (20, 21 and 22) can be reduced to

$$ {\Delta}_i^{\left(k+1\right)}:==\underset{\Delta_i}{ \arg \min }\ \left(f\left({\Delta}_i\right)+\frac{\rho }{2}\left|\right|{\Delta}_i-{Z}_i^{(k)}+{u}^{(k)}\left|\right|{}_2^2\right) $$
(25)
$$ {Z}_i^{\left(k+1\right)}:==\underset{Z_i}{ \arg \min }\ \left(\lambda \left|\right|{Z}_i\left|\right|{}_1+\frac{\rho }{2}\left|\right|{\Delta}_i^{\left(k+1\right)}-{Z}_i+{u}^{(k)}\left|\right|{}_2^2\right) $$
(26)
$$ {u}^{\left(k+\right)}:=={u}^{(k)}+\left({\Delta}_i^{\left(k+1\right)}-{Z}_i^{\left(k+1\right)}\right). $$
(27)

Solving minimization problem (25), we obtain

$$ {\Delta}_i^{\left(k+1\right)}={\left[{W}_i^TX{\left({X}^TX\right)}^{-1}{X}^T{W}_i+\rho I\right]}^{-1}\left[{W}_i^TX{\left({X}^TX\right)}^{-1}{X}^T{y}_i+\rho \left({Z}_i^k-{u}^k\right)\right], $$

which can be reduced to

$$ {\Delta}_i^{\left(k+1\right)}=\left[\frac{1}{\rho }I-\frac{1}{\rho }{W}_i^TX{\left(\rho {X}^TX+{X}^T{W}_i{W}_i^TX\right)}^{-1}{X}^T{W}_i\right]\left[{W}_i^TX{\left({X}^TX\right)}^{-1}{X}^T{y}_i+\rho \left({Z}_i^k-{u}^k\right)\right] $$
(28)

The optimization problem (26) is non-differentiable. Although the first term in (26) is not differentiable, we still can obtain a simple closed-form solution to the problem (26) using subdiffenrential calculus [37]. Let Γ j be a generalized derivative of the j-th component Z j i of the vector Z i and Γ = [Γ1, …, Γ M + K − 1]T where

$$ {\Gamma}_j=\left\{\begin{array}{c}\hfill \begin{array}{cc}\hfill 1\hfill & \hfill {Z}_i^j>0\hfill \end{array}\hfill \\ {}\hfill \begin{array}{cc}\hfill \left[-1,1\right]\hfill & \hfill {Z}_i^j=0\hfill \end{array}\hfill \\ {}\hfill \begin{array}{cc}\hfill -1\hfill & \hfill {Z}_i^j<0\hfill \end{array}\hfill \end{array}\right. $$

Then, we have

$$ \frac{\lambda }{\rho}\varGamma +{Z}_i={\Delta}_i^{k+1}+{u}^k, $$

which implies that

$$ {Z}_i^{\left(k+1\right)}=\operatorname{sgn}\left({\Delta}_i^{k+1}+{u}^k\right){\left(\left|{\Delta}_i^{k+1}+{u}^k\right|-\frac{\lambda }{\rho}\right)}_{+}, $$
(29)

Where

$$ \left|x\right|{}_{+}=\left\{\begin{array}{cc}\hfill x\hfill & x\ge 0\\ {}\hfill 0\hfill & \hfill x<0.\hfill \end{array}\right. $$

Appendix 2

Estimation of parameters in the sparse structural functional equation models for the genotype-phenotype networks

Assume that the sparse SFEMs are given by

$$ \begin{array}{l}{y}_1{\gamma}_{11}+{y}_2{\gamma}_{21}+\dots +{y}_M{\gamma}_{M1}+{\displaystyle {\int}_{T_1}{x}_1(t){\beta}_{11}(t)dt+\dots +{\displaystyle {\int}_{T_k}{x}_k(t){\beta}_{k1}(t)dt+{e}_1}}=0\\ {}{y}_1{\gamma}_{12}+{y}_2{\gamma}_{22}+\dots +{y}_M{\gamma}_{M2}+{\displaystyle {\int}_{T_1}{x}_1(t){\beta}_{12}}(t)dt+\dots +{\displaystyle {\int}_{T_k}{x}_k(t){\beta}_{k2}(t)dt+{e}_2=0}\\ {}\vdots \kern11.75em \vdots \kern17.5em \vdots \\ {}{\mathrm{y}}_1{\gamma}_{1M}+{y}_2{\gamma}_{2M}+\dots +{y}_M{\gamma}_{MM}+{\displaystyle {\int}_{T_1}{x}_1(t){\beta}_{1M}(t)dt+\dots +{\displaystyle {\int}_{T_k}{x}_k(t){\beta}_{kM}(t)dt+{e}_M=0}}\end{array} $$
(30)

For each genomic region or gene, we use functional principal component analysis to calculate principal component function [14]. We expand x ij (t), j = 1, 2, …, k in each genomic region in terms of orthogonal principal component functions:

$$ {x}_{ij}(t)={\displaystyle \sum_{l=1}^{L_j}{\eta}_{ijl}{\varphi}_{jl}(t)},j=1,\dots, k, $$
(31)

where φ jl (t), j = 1, …, k, l = 1, …, L j are the l-th principal component function in the j-th genomic region and η ijl are the functional principal component scores of the i-th individual. Using the functional principal component expansion of x ij (t), we obtain

$$ {\displaystyle \underset{T}{\int }{x}_{ij}(t){\beta}_{jm}(t)}dt={\displaystyle \underset{T}{\int }{\displaystyle \sum_{l=1}^{L_j}{\eta}_{ijl}{\varphi}_{jl}(t)}{\beta}_{jm}(t)}dt={\displaystyle \sum_{l=1}^{L_j}{\eta}_{ijl}{b}_{jlm}},i=1,\dots, n,j=1,\dots, k,m=1,\dots, M. $$
(32)

Let x j (t) = [x 1j (t), …, x nj (t)]T, η jl  = [η 1jl , …, η njl ]T. Substituting eq. (32) into eq. (30), we obtain

$$ \begin{array}{l}{y}_1{r}_{11}+{y}_2{r}_{21}+\cdots +{y}_M{r}_{M1}+{\displaystyle \sum_{l=1}^{L_1}{\eta}_{1l}{b}_{1l1}}+\cdots +{\displaystyle \sum_{l=1}^{L_k}{\eta}_{kl}{b}_{kl1}}+{e}_1=0\\ {}\vdots \kern11.5em \vdots \kern12em \vdots \\ {}{y}_1{r}_{1M}+{y}_2{r}_{2M}+\cdots +{y}_M{r}_{MM}+{\displaystyle \sum_{l=1}^{L_1}{\eta}_{1l}{b}_{1lM}}+\cdots +{\displaystyle \sum_{l=1}^{L_k}{\eta}_{kl}{b}_{klM}}+{e}_M=0\end{array} $$
(33)

Let \( \eta =\left[{\eta}_{11},\dots, {\eta}_{1{L}_1,\dots, }{\eta}_{k1},\dots, {\eta}_{k{L}_k}\right],\kern0.5em B=\left[\begin{array}{ccc}\hfill \begin{array}{c}\hfill {b}_{111}\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill {b}_{1{L}_11}\hfill \end{array}\hfill & \hfill \begin{array}{c}\hfill \cdots \hfill \\ {}\hfill \vdots \hfill \\ {}\hfill \cdots \hfill \end{array}\hfill & \hfill \begin{array}{c}\hfill {b}_{11M}\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill {b}_{1{L}_1M}\hfill \end{array}\hfill \\ {}\hfill \vdots \hfill & \hfill \vdots \hfill & \hfill \vdots \hfill \\ {}\hfill \begin{array}{c}\hfill {b}_{k11}\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill {b}_{k{L}_k1}\hfill \end{array}\hfill & \hfill \begin{array}{c}\hfill \cdots \hfill \\ {}\hfill \vdots \hfill \\ {}\hfill \vdots \hfill \end{array}\hfill & \hfill \begin{array}{c}\hfill {b}_{k1M}\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill {b}_{k{L}_kM}\hfill \end{array}\hfill \end{array}\right] \) and Y, Γ and E be defined as before.

In matrix form, eq. (33) can be rewritten as

$$ Y\Gamma +\eta B+E=0, $$
(34)

which has the same form as the Eq. (2) has.

If we consider only one genomic region or gene, the matrices η and B will be reduced to η = [η 1, …, η L ] and \( B=\left[\begin{array}{ccc}\hfill {b}_{11}\hfill & \hfill \cdots \hfill & \hfill {b}_{1M}\hfill \\ {}\hfill \vdots \hfill & \hfill \vdots \hfill & \hfill \vdots \hfill \\ {}\hfill {b}_{L1}\hfill & \hfill \cdots \hfill & \hfill {b}_{LM}\hfill \end{array}\right]. \)

If we take functional principal component scores as predictors, the models and algorithms for network structure and parameter estimation will be similar to that discussed in Appendix 1. Specifically, the i-th equation is given by

YΓ i  + ηB i  + e i  = 0,

which can be rewritten as

$$ {y}_i={W}_i{\Delta}_i+{e}_i, $$
(35)

where \( {W}_i=\left[\begin{array}{cc}\hfill {Y}_{-i}\hfill & \hfill \eta \hfill \end{array}\right],{\Delta}_i=\left[\begin{array}{cc}\hfill {\gamma}_{-i}\hfill & \hfill {B}_i\hfill \end{array}\right] \).

Then, the sparse SFEMs are transformed to

$$ \begin{array}{l}\underset{\varDelta_i}{ \min}\kern0.5em f\left({\Delta}_i\right)+\lambda \left|\right|{\Delta}_i\left|\right|{}_1\\ {}\mathrm{where}\ f\left({\Delta}_i\right)={\left({\eta}^T{y}_i-{\eta}^T{W}_i{\Delta}_i\right)}^T{\left({\eta}^T\eta \right)}^{-1}\left({\eta}^T{y}_i-{\eta}^T{W}_i{\Delta}_i\right).\end{array} $$
(36)

Finally, ADMM algorithms are given by

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, P., Rahman, M., Jin, L. et al. A new statistical framework for genetic pleiotropic analysis of high dimensional phenotype data. BMC Genomics 17, 881 (2016). https://doi.org/10.1186/s12864-016-3169-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12864-016-3169-1

Keywords