Multivariate quantitative trait association analysis can be investigated by phenotype-genotype networks, which can be represented as a graph. Phenotypes, covariates such as age, sex, race, and SNPs are variables. Variables are represented as nodes in the graph. We assume that causal relationships among phenotypes exist. Therefore, a phenotype network is represented by a directed graph. A directed edge between two nodes indicates the causal relationship between them. Since SNPs do not have causal relationships among them, a genotype network is represented as an undirected graph. An edge between two nodes in the genotype network indicates their correlation. Since all SNPs and covariates may cause changes in phenotypes, the phenotype network and genotype network are connected by edges directed from covariates and SNP to the phenotypes. The phenotypes and connections between phenotypes, covariates and SNPs can be modeled by structural equations. The genotype network can be leant by graphical LASSO (GLASSO) [39], here we didn’t focus on genotype network in this paper. An example of phenotype-genotype network is shown in Additional file 1: Figure S1.
SEMs for multivariate association analysis
The SEMs offer a general statistical framework for inferring phenotype networks and connections between genotypes and phenotypes. Assume that n individuals are sampled. We consider M phenotypes that are referred to as endogenous variables. The endogenous variables are jointly determined in the model and are also influenced by the variables outside the model. We denote the n observations on the M endogenous variables by the matrix Y = [y
1, y
2, …, y
M
], where y
i
= [y
1i
, …, y
ni
]T is a vector of collecting n observation of the endogenous variable i. Covariates, genetic variants as exogenous or predetermined variables are denoted by X = [x
1, …, x
K
] where x
i
= [x
1i, …, x
ni
]T. Similar to independent variables in the regression, the exogenous variables are outside the models and are not influenced by the variables in the model. Similarly, random errors are denoted by E = [e
1, …, e
M
], where we assume E[e
i
] = 0 and E[e
i
e
T
i
] = σ
2
i
I
n
for i = 1, …, M. Recall that the relationships between the phenotypes and genotypes are traditionally described by the regressions where the phenotypes are taken as dependent variables and genotypes are taken as independent variables are predictors. In the regression models, the dependence relationships among dependent variables or phenotypes cannot be explicitly expressed. Therefore, the regression models cannot be used to determine which phenotypes cause the variations of which phenotypes. To overcome this limitation, we introduce linear structural equations. The linear structural equations for modeling relationships among phenotypes and genotypes can be written as [38].
$$ \begin{array}{l}{y}_1{\gamma}_{11}+{y}_2{\gamma}_{21}+\dots +{y}_M{\gamma}_{M1}+{x}_1{\beta}_{11}+{x}_2{\beta}_{21}+\dots +{x}_K{\beta}_{K1}+{e}_1=0\\ {}\begin{array}{cccc}\hfill \hfill & \hfill \hfill & \hfill \vdots \hfill & \hfill \begin{array}{cccc}\hfill \hfill & \hfill \hfill & \hfill \hfill & \hfill \begin{array}{cccc}\hfill \hfill & \hfill \hfill & \hfill \hfill & \hfill \begin{array}{cccc}\hfill \hfill & \hfill \hfill & \hfill \hfill & \hfill \begin{array}{cccc}\hfill \hfill & \hfill \hfill & \hfill \hfill & \hfill \begin{array}{ccc}\hfill \hfill & \hfill \vdots \hfill & \hfill \hfill \end{array}\hfill \end{array}\hfill \end{array}\hfill \end{array}\hfill \end{array}\hfill \end{array}\\ {}{y}_1{\gamma}_{1M}+{y}_2{\gamma}_{2M}+\dots +{y}_M{\gamma}_{MM}+{x}_1{\beta}_{1M}+{x}_2{\beta}_{2M}+\dots +{x}_K{\beta}_{KM}+{e}_M=0\end{array} $$
(1)
where the γ’s and β’s are the structural parameters of the system that are unknown.
Variables in the SEMs can be classified into two basic types of variables: observed variables that can be measured and the residual error variables that cannot be measured and represent all other unmodeled causes of the variables. Most observed variables (e. g. phenotypes such as BMI, blood pressure, high density lipoprotein, low density lipoprotein) are random. Some observed variables may be nonrandom or control variables (e. g. genotypes, drug dosages) whose values remain the same in repeated random sampling or might manipulated by the experimenter. The observed variables will be further classified into exogenous variables (e.g. genotypes, age, sex, race), which lie outside the model, and endogenous variables (e.g. phenotypes), whose values are determined through joint interaction with other variables within the system. All nonrandom variables can be viewed as exogenous variables. Phenotypes are viewed as endogenous variables. The terms exogenous and endogenous are model specific. It may be that an exogenous variable in one model is endogenous in another. The structural parameters γ describe the relationships between phenotypes and parameters β measure the direct genetic effects of the genotypes on the phenotypes.
In matrix notation the SEMs (1) can be rewritten as
$$ Y\varGamma +XB+E=0, $$
(2)
where Γ = [Γ
1, …, Γ
M
], Γ
i
= [γ
1i
, …, γ
Mi
]T, B = [B
1, …, B
M
], B
i
= [β
1i
, …, β
Ki
]T.
We assume that the random errors in the structural equations are independent and uncorrelated with exogenous variables. We apply the sparsity penalty to each equation to ensure that the sparse SEMs are identifiable.
Two-stage least square estimates of the parameters in the SEMs
The ordinary least squares estimator is biased and inconsistent for the parameters of structural equations. To ensure the consistent estimates of the parameters in the SEMs, we use a generalized least square method that can be interpreted as a two-stage least square estimate method to estimate the parameters in the SEMs [38].
Recalling that y
i
is the vector of observations of the variable i, let Y
− i
be the observation matrix Y after removing y
i
from it and γ
− i
be the parameter vector Γ
i
after removing the parameter γ
ii
. The ith equation:
$$ Y{\varGamma}_i+X{B}_i+{e}_i=0 $$
can be rewritten as
$$ \begin{array}{l}{y}_i={Y}_{-i}{\gamma}_{-i}+X{B}_i+{e}_i\\ {}\begin{array}{cc}\hfill \hfill & \hfill ={W}_i{\Delta}_i+{e}_i\hfill \end{array},\end{array} $$
(3)
where \( {W}_i=\left[\begin{array}{cc}\hfill {Y}_{-i}\hfill & \hfill X\hfill \end{array}\right],{\Delta}_i={\left[\begin{array}{cc}\hfill {\gamma}_{-i}^T\hfill & \hfill {B}_i^T\hfill \end{array}\right]}^T. \)
Multiplying by the matrix X
T on both sides of eq. (3), we obtain
$$ {X}^T{y}_i={X}^T{Y}_{-i}{\gamma}_{-i}+\left({X}^TX\right){B}_i+{X}^T{e}_i={X}^T{W}_i{\Delta}_i+{X}^T{e}_i. $$
(4)
It is known that
$$ \operatorname{cov}\left({X}^T{e}_i,{X}^T{e}_i\right)={X}^TX{\sigma}_i^2. $$
The generalized least square estimate \( {\widehat{\Delta}}_i \) is given by
$$ {\widehat{\Delta}}_i={\left[{W}_i^TX{\left({X}^TX\right)}^{-1}{X}^T{W}_i\right]}^{-1}{W}_i^TX{\left({X}^TX\right)}^{-1}{X}^T{y}_i. $$
(5)
The generalized least square estimate \( {\widehat{\Delta}}_i \) can be interpreted as a two-stage least square estimate [38].
Suppose that in the first stage, Y
− i
is regressed on X to obtain
\( {\widehat{\Pi}}_i={\left({X}^TX\right)}^{-1}{X}^T{Y}_{-i} \) and \( {\widehat{Y}}_{-i}=X{\widehat{\Pi}}_i. \)
Then,
$$ \begin{array}{l}{\widehat{W}}_i=\left[\begin{array}{cc}\hfill {\widehat{Y}}_{-i}\hfill & \hfill X\hfill \end{array}\right]\\ {}\begin{array}{cc}\hfill \hfill & \hfill =X{\left({X}^TX\right)}^{-1}{X}^T{W}_i\hfill \end{array}.\end{array} $$
Eq. (5) can be reduced to
$$ {\widehat{\Delta}}_i={\left({\widehat{W}}_i^T{\widehat{W}}_i\right)}^{-1}{\widehat{W}}_i^T{y}_i. $$
(6)
Therefore, if W
i
in eq. (3) is replaced by Ŵ
i
, eq. (6) can be interpreted as that in the second stage, y
i
is regressed on Ŷ
i
and X to obtain estimate \( {\widehat{\Delta}}_i \).
Sparse SEMs and alternating direction method of multipliers
In general, the genotype-phenotype networks are sparse. Therefore, Γ and B are sparse matrices. In order to obtain sparse estimates of Γ and B, the natural approach is the l
1 -norm penalized regression of eq. (4). Using weighted least square and l
1 -norm penalization, we can form the following optimization problem:
$$ \begin{array}{l}\underset{\Delta_i}{ \min}\kern0.5em f\left({\Delta}_i\right)+\lambda \left|\right|{\Delta}_i\left|\right|{}_1\\ {}\mathrm{where}\ f\left({\Delta}_i\right)={\left({X}^T{y}_i-{X}^T{W}_i{\Delta}_i\right)}^T{\left({X}^TX\right)}^{-1}\left({X}^T{y}_i-{X}^T{W}_i{\Delta}_i\right).\end{array} $$
(7)
The size of the genotype-phenotype network may be large. The efficient ADMM [37] algorithm is used to solve the optimization problem (7). The procedure for implementing ADMM is given below (more detailed descriptions are provided in Appendix 1).

Under some assumptions convergence of ADMM can be proved [37]. In practice, although it can be slow to converge to high accuracy, ADMM converges to modest accuracy within a few tens of iterations. When large-scale problems and parameter estimation problems are considered, modest accuracy is sufficient. Therefore, ADMM may work very well for structure and parameter estimation in the genotype-phenotype networks.
Most of the elements of matrices Γ and B are equal to zero. The l
1 − regularized Lasso for the two stage least squares approach and ADMM algorithms are expected to shrink most of the coefficient matrices Γ and B toward zero, yielding sparse network structures. The sparsity-controlling parameter λ will be estimated via cross validation or set by users to get reasonable results. We abbreviate this sparse two stage least square estimation of SEMs as S2SEMs.
Sparse functional structural equation models for phenotype and genotype networks
Fast and cheaper next generation sequencing (NGS) technologies will generate unprecedentedly massive and highly-dimensional genomic variation data. Despite their promise, next generation sequencing platforms also have three specific features: high error rates, enrichment of rare variants and large proportion of missing values. Available causal analysis platforms for genetic studies which are mainly designed for common variants provide useful tools for single marker-based pleiotropic genetic analysis, but have limitations in analyzing thousands of sequences collected for very large population-based studies of humans. To address the critical barrier in causal genetic analysis with NGS data, we extend the multivariate SEMs to functional SEMs where exogenous genotype profiles across a genomic region are represented as a function of the genomic position. To effectively reduce the dimension of the data, we use genetic variant profiles which will recognize information contained in the physical location of the SNP as a major data form. The densely distributed genetic variants across the genomes in large samples can be viewed as realizations of a Poisson process. The densely typed genetic variants in a genomic region for each individual are so close that these genetic variant profiles can be treated as observed data taken from curves. The genetic variant profiles are called functional.
Large simulations have shown that combining information across multiple variants in a genomic region of analysis will greatly enhance power to detect association of rare variants [9]. To jointly utilize multi-locus genetic information and reduce the dimension of the NGS data, we propose to use a genomic region or a gene as a unit in multiple trait association analysis and develop sparse functional structural equation models (FSEMs) for construction and analysis of the phenotype and genotype networks. The FSEMs collectively analyze the contribution of multiple variants to the traits, reduce the errors in the NGS data via data reduction techniques and can effectively deal with missing data through the smooth mechanism of the function curves of the data.
Let t be a genomic position. Define a genotype profile x
i
(t) of the i-th individual as
$$ {x}_i\left(\mathrm{t}\right)=\left\{\begin{array}{c}\hfill 2{P}_q\left(\mathrm{t}\right), \kern2.25em \mathrm{Q}\mathrm{Q}\hfill \\ {}\hfill {P}_q\left(\mathrm{t}\right)\hbox{-} {\mathrm{P}}_{\mathrm{Q}}\left(\mathrm{t}\right),\kern0.5em \mathrm{Q}\mathrm{q}\hfill \\ {}\hfill -2{P}_Q\left(\mathrm{t}\right), \kern1.5em \mathrm{q}\mathrm{q}\hfill \end{array}\right. $$
where Q and q are two alleles of the marker at the genomic position t, P
Q
(t) and P
q
(t) are the frequencies of the alleles Q and q, respectively. Suppose that we are interested in k genomic regions or genes [a
j
, b
j
], denoted as T
j
, j = 1, 2, …, k. We consider the following functional structural equation models:
$$ \begin{array}{l}{y}_1{\gamma}_{11}+{y}_2{\gamma}_{21}+\dots +{y}_M{\gamma}_{M1}+{\displaystyle {\int}_{T_1}{x}_1(t){\beta}_{11}(t)dt+\dots +{\displaystyle {\int}_{T_k}{x}_k(t){\beta}_{k1}(t)dt+{e}_1}}=0\\ {}{y}_1{\gamma}_{12}+{y}_2{\gamma}_{22}+\dots +{y}_M{\gamma}_{M2}+{\displaystyle {\int}_{T_1}{x}_1(t){\beta}_{12}}(t)dt+\dots +{\displaystyle {\int}_{T_k}{x}_k(t){\beta}_{k2}(t)dt+{e}_2=0}\\ {}\vdots \kern11.75em \vdots \kern17.5em \vdots \\ {}{\mathrm{y}}_1{\gamma}_{1M}+{y}_2{\gamma}_{2M}+\dots +{y}_M{\gamma}_{MM}+{\displaystyle {\int}_{T_1}{x}_1(t){\beta}_{1M}(t)dt+\dots +{\displaystyle {\int}_{T_k}{x}_k(t){\beta}_{kM}(t)dt+{e}_M=0}}\end{array} $$
(8)
where β
ij
(t) are genetic effect functions.
Functional principal components (FPCs) are efficient summary statistics. The FPCs simultaneously employs genetic information of the individual variants and correlation information (linkage disequilibrium) among all variants. The FPCs view the genetic variation across the genomic region as a function of its genomic location and uses intrinsic functional dependence structure of the data and all available genetic information of the variants in the genomic region. The neighboring genetic variants are linked. The genotypes at one SNP are dependent on the genotypes at nearby SNPs. The FPCs account for the space-ordering of the genetic variation data. Expanding the genotype functions in terms of a few orthogonal FPCs will substantially reduce the dimensions of the genetic variation data while preserving the intrinsic correlation structure and the space-ordering of the data. Specifically, For each genomic region or gene, we use functional principal component analysis to calculate principal component function [14]. We expand x
nj
(t), n = 1, …, N, j = 1, 2, …, k in each genomic region in terms of orthogonal principal component functions:
$$ {x}_{ij}(t)={\displaystyle \sum_{l=1}^{L_j}{\eta}_{ijl}{\varphi}_{jl}(t)},j=1,\dots, k, $$
where φ
jl
(t), j = 1, …, k, l = 1, …, L
j
are the l-th principal component function in the j-th genomic region or gene and η
ijl
are the functional principal component scores of the i-th individual.
Let η be a matrix collection of all functional principal component scores, the parameter matrix B can be defined as that in Appendix 2, matrices Y and Γ can be defined as that in the previous section. The structural functional equations can be reduced in terms of functional principal component scores (Appendix 2):
$$ Y{\varGamma}_i+\eta {B}_i+{e}_i=0, $$
which can be rewritten as
$$ {y}_i={W}_i{\Delta}_i+{e}_i, $$
where \( {W}_i=\left[\begin{array}{cc}\hfill {Y}_{-i}\hfill & \hfill \eta \hfill \end{array}\right],{\Delta}_i={\left[\begin{array}{cc}\hfill {\gamma}_{-i}^T\hfill & \hfill {B}_i^T\hfill \end{array}\right]}^T. \)
Then, the sparse FSEMs are transformed to
$$ \begin{array}{l}\underset{\Delta_i}{ \min}\kern0.5em f\left({\Delta}_i\right)+\lambda \left|\right|{\Delta}_i\left|\right|{}_1\\ {}\mathrm{where}\ f\left({\Delta}_i\right)={\left({\eta}^T{y}_i-{\eta}^T{W}_i{\Delta}_i\right)}^T{\left({\eta}^T\eta \right)}^{-1}\left({\eta}^T{y}_i-{\eta}^T{W}_i{\Delta}_i\right).\end{array} $$
(9)
The ADMM algorithms for solving the sparse FSEMs are the same as that in the previous section if the matrix X is replaced by a functional principal component score matrix η (Appendix 2).
The functional SEMs can efficiently combine both common and rare genetic variants across the gene region and are suitable for NGS data [14]. This model extends the single variant-based network analysis to gene-based analysis, which can deal with hundreds of genes that may include tens of thousands SNPs. However, due to the computational limitation, we cannot directly handle the whole genome sequencing data. To construct whole genome genotype-phenotype networks, the network construction consists of two stages. At the first stage, we can group the genes based on the metabolic pathways or cluster analysis, with each group having at most hundreds of genes, and then apply the functional SEMs to each group of genes to find the set of genes significantly connected to the phenotypes. At the second stage, adding the sets of significantly connected genes identified at the first stage together to form a new of set of genes for network construction. The functional SEMs are again applied to the new set of genes to construct the final genotype-phenotype network.
Effect decomposition and estimation
To make this paper self-contained, we introduce basic concepts and methods for decomposition and estimation of the effects. In the genotype-phenotype network analysis we are interested in estimation of effects of genetic variants on phenotypes, which is referred to as genetic effects and effects of treatment on phenotypes. All genetic effects and treatment effects can be decomposed as total (causal), direct effects and indirect effects. Distinction between total, direct and indirect effects are of great practical importance in genetic association analysis [40]. The total effect measures the changes of response variable Y (phenotype) would take on the value y when variable X is set to x by external intervention. Direct effect is defined as sensitivity of Y to changes in X while all other variables in the model are held fixed. Indirect effect is to measure the portion of the effect which can be explained by mediation alone, while inhibiting the capacity of Y to respond to X [41]. The total effect is equal to the summation of direct and indirect effects.
Given a directed graph model G, one can compute total effects using intervention calculus [34, 42]. Suppose that the expected value of a response variable Y, after X is assigned value x by intervention is denoted by E[Y|do(X = x)]. The total effect is defined as
$$ \frac{\partial }{\partial x}E\left[Y\Big| do\left(X=x\right)\right]. $$
(10)
Note X
j
is called a parent of X in G if there is a directed edge X
j
→ X. Let pa
x
denote the set of all parents of X in G. In the linear SEMs, we assume that E[Y|X, pa
x
] is linear in X and pa
x
:
$$ E\left[Y\Big|X,\ {\mathrm{pa}}_x\right]=\alpha +\beta X+{\gamma}^T{\mathrm{pa}}_x. $$
(11)
Then,
$$ \frac{\partial }{\partial x}E\left[Y\Big| do\left(X=x\right)\right]=\beta . $$
When a directed graph is given, it is easy to calculate total effect [42]. Assume that there are k directed paths from X to Y and p
i
are the product of the path coefficients along the i-th path. The total effect of X on Y is then defined as ∑
k
i = 1
p
i
. As shown in Additional file 1: Figure S2, the total effect of X on Y is ag + bdh + acdh. By its definition, direct effect measures the sensitivity of Y to changes in X while all other variables in the model are held fixed. In other words, all links from X to Y other than the direct link will be blocked. As a consequences, the direct effect is equal to the path coefficient from X to Y. In the linear SEMs, the indirect effect of X on Y mediated by M is equal to the sum of the products associated with directed paths from X to Y through M [42]. In Additional file 1: Figure S2, there is no direct effect from X to Y. The indirect effect of X on Y which is mediated by B and D is equal to bdh.
In the SEMs for genotype-phenotype networks, since all SNPs only form undirected graph and there are no directed links between SNPs although we can observe linkage (or correlation) between SNPs; SNPs in the genotype-phenotype networks do not have parents. The total effect of SNP X on Y is the regression coefficient β of the following linear regression:
$$ E\left[Y\Big| do\ \left(X=x\right)\right]=\alpha +\beta x, $$
which is a simple regression of Y on X. This indicates that the traditional simple regression for association studies captures the total effect of a genetic variant on a phenotype.
If we include environments and risk factors such as smoking and obesity in the model and want to evaluate the effects of the environments and risk factors on the phenotype, these variables play mediating roles and will also be taken as phenotypes. We denote these mediating phenotypes by Y
ME
. Since genetic variants, and other risk factors and phenotypes will affect the mediating phenotypes, the mediating phenotypes in the graphics may have parents. Their parents are denoted by S. Total effect of the mediation phenotype on the target phenotype is calculated by
$$ E\left[Y\Big| do\ \left({Y}_{ME}={y}_{ME},{X}_{pa}={x}_{pa}\right)\right]=\alpha +\beta {y}_{ME}+{\gamma}^T{x}_{pa}, $$
(12)
where β is the total effect of the mediation phenotype Y
ME
on the target phenotype Y. In this case, a simple regression of Y on Y
ME
can no longer be used to measure the total effect of the mediation phenotype Y
ME
on the target phenotype Y. To observe this, we simulated 1000 individuals with the SEM as shown in Additional file 1: Figure S3. Each variable has a noise term distributed as N(0, 1). The total effect of the mediation phenotype Y
ME
on the target phenotype Y is 3.5. We obtain the simple regression:
$$ Y=1.39+5.85{Y}_{ME}. $$
It is clear that the coefficient of the simple regression is 5.85. This value is far away from the total effect 3.5. However, using eq. (12) we obtain
$$ Y=3.54{Y}_{ME}+5.85X, $$
where the regression coefficient 3.54 measured the total effect of the mediation phenotype Y
ME
on the target phenotype Y.
Test statistics for path coefficients
Testing connection between the j-th gene and the i-th phenotype in the genotype-phenotype network, we formally investigate the problem of testing the coefficient of the path directed from the j- th gene to the i-th phenotype:
$$ {H}_0:{\beta}_{ji}(t)=0, \kern0.5em \forall t\in \left[0,{T}_j\right] $$
(13)
against
$$ {H}_a:{\beta}_{ji}(t)\ne 0\ . $$
If the coefficient function of path or genetic effect function β
ji
(t) is expanded in terms of the principal component functions:
$$ {\beta}_{ji}(t)={\displaystyle {\sum}_{g=1}^G{b}_{jig}{\varphi}_{jg}(t)}, $$
then testing the null hypothesis H0 in Eq. (13) is equivalent to testing the hypothesis:
$$ {H}_0:{b}_{jig}=0,\ \forall \mathrm{g}. $$
(14)
The path coefficients b
jig
can be estimated by solving problems (8) and (9). Let \( {\widehat{b}}_{ji}={\left[{b}_{ji1},\dots, {b}_{jiG}\right]}^T \). The covariance matrix of the vector of the estimators of path coefficients for the i -th equation is given by [38]
$$ {\widehat{\varSigma}}_i={\sigma}_{ii}{\left[{W}_i^T\eta {\left({\eta}^T\eta \right)}^{-1}{\eta}^T{W}_i\right]}^{-1}, $$
(15)
where
$$ {\sigma}_{ii}={\left({y}_i-{W}_i{\widehat{\Delta}}_i\right)}^T\left({y}_i-{W}_i{\widehat{\Delta}}_i\right)/n. $$
(16)
Let Λ
i
be the submatrix that corresponds to b
ji
in the matrix \( {\widehat{\Sigma}}_i \). Define the statistic for testing the directed connection from the j-th gene to the i-th phenotype as
$$ {T}_g={\widehat{b}}_{ji}^T{\varLambda}_i^{-1}{\widehat{b}}_{ji}. $$
(17)
Under the null hypothesis of no association H
0 : b
ji
= 0, T
g
is asymptotically distributed as a central χ
2(G)
distribution where G is the number of functional principal components in the expansion of β
ji
(t).
For testing a single parameter or single variant’s path coefficient in the SEMs, the l-th parameter of the i-th equation, the statistic is given by
$$ {T}_c=\frac{{\widehat{\Delta}}_{il}^2}{\operatorname{var}\left({\widehat{\Delta}}_{il}\right)}, $$
(18)
where \( \operatorname{var}\left({\widehat{\Delta}}_{il}\right) \) is the l-th diagonal element of the matrix \( {\widehat{\Sigma}}_i \). Under the null hypothesis H
0 : Δ
il
= 0, T
c
is asymptotically distributed as a central χ
2(1)
distribution.
Testing for the path coefficients within the network results in multiple testing problems. Both false discovery rate approach and Bonferroni correction can be used to adjust for multiple testing [43, 44].