 Research article
 Open access
 Published:
Genetic association tests in family samples for multicategory phenotypes
BMC Genomics volume 22, Article number: 873 (2021)
Abstract
Background
Advancements in statistical methods and sequencing technology have led to numerous novel discoveries in human genetics in the past two decades. Among phenotypes of interest, most attention has been given to studying genetic associations with continuous or binary traits. Efficient statistical methods have been proposed and are available for both types of traits under different study designs. However, for multinomial categorical traits in related samples, there is a lack of efficient statistical methods and software.
Results
We propose an efficient score test to analyze a multinomial trait in family samples, in the context of genomewide association/sequencing studies. An alternative Wald statistic is also proposed. We also extend the methodology to be applicable to ordinal traits. We performed extensive simulation studies to evaluate the typeI error of the score test, Wald test compared to the multinomial logistic regression for unrelated samples, under different allele frequency and study designs. We also evaluate the power of these methods. Results show that both the score and Wald tests have a wellcontrolled typeI error rate, but the multinomial logistic regression has an inflated typeI error rate when applied to family samples. We illustrated the application of the score test with an application to the Framingham Heart Study to uncover genetic variants associated with diabesity, a multicategory phenotype.
Conclusion
Both proposed tests have correct typeI error rate and similar power. However, because the Wald statistics rely on computerintensive estimation, it is less efficient than the score test in terms of applications to largescale genetic association studies. We provide computer implementation for both multinomial and ordinal traits.
Background
Genetic association tests for continuous or binary phenotypes have uncovered many susceptibility genes or variants related to diseases. Various methods and efficient software have been developed and used for continuous and binary traits. For family samples, due to the correlation between relatives and violation of the independence assumption of ordinary linear regression, some alternative approaches were proposed. For example, Therneau and colleagues developed an R package (coxme) implementing linear mixed effects model to evaluate the association between a genetic variant and a continuous trait or survival outcome accounting for correlation present in family samples. Similar extensions to account for familial correlation using mixed effects models have been proposed for genebased association tests [1]. The progress in family sample designs has been restricted mostly to quantitative traits or binary traits. However, methods are needed to study categorical traits with more than two categories in family samples. For example, the phenotype diabesity has been defined as a fourcategory (diabetes & obesity, diabetes but no obesity, obesity but no diabetes and no diabetes & no obesity) variable constructed jointly from type 2 diabetes and obesity. Currently, approaches for genetic association analysis of such multinomial traits are limited. Zhang and colleagues [2] proposed a proportional odds logistic model which allows for the inclusion of covariates. However, it has a few limitations. First, this approach is restricted to nuclear families and cannot handle complex family structures. Second, no software implementation has been made publicly available. Diao and Lin [3] proposed a general framework for linkage and association tests for ordinal traits. Their method utilized adaptive Gaussian quadrature to approximate the maximum loglikelihood and a likelihood ratio test was proposed to test the hypothesis of no association between a genetic variant and an ordinal trait of interest. Again, this approach also has not been widely used due to lack of computerefficient software and the fact that the likelihood ratio test is computationally intensive. Another possible option is to use the SAS generalized linear mixed models (GLMM) procedure, which can incorporate a kinship matrix. However, in real applications, the current implementation of the GLMM cannot handle extended families due to the computational burden. More recently, Wang and colleagues [4] proposed a Bayesian framework incorporating kinship matrix as a random effect, which however can not be applied to largescale genetic study because of lack of computational efficiency. Bi and colleagues [5] proposed a computerefficient framework (POLMM), specifically for ordinal traits. Because it doesn’t allow for a userprovided kinship matrix, such as the one estimated from pedigree or using a typical genetic software, this will be a limitation for familybased cohort studies with known relationships. Our proposed method is complementary to these two approaches as it can be applied to family samples without available genomewide data to compute a GRM, and without the proportional odds assumption. In this paper, we propose a computationally efficient score test based on extended generalized estimating equations (EGEE) for largescale genetics studies of multicategory phenotypes accounting for familial correlation. We evaluate our approach using simulations and apply it to a genomewide scan to identify genetic variants associated with diabesity, a fourcategory phenotype, with the healthy referent category being no diabetes and no obesity and the unhealthiest category, “diabese” (diabetes and obesity), having a prevalence of at least 25% in several countries [6].
Results
TypeI error
The results of familybased and unrelated samples are summarized in Table 12 respectively. Both the score and Wald tests have wellcontrolled typeI error rates across all MAF scenarios except for rare variants. This conclusion applies to both familybased and unrelated designs. The multinomial logistic regression, which ignores familial correlation, returns an inflated typeI error rate in the presence of related individuals, although its typeI error rate for unrelated study design is wellcontrolled. In the application to ordinal trait (Table 3), robust score test preserves the typeI error in all MAF scenarios although the simulated phenotype distribution is highly unbalanced. The Wald test is only very slightly inflated for very rare variants when evaluated at 0.0001. We have also generated QQplots (Additional File 3) for the robust score test and the simplified score test for results from all MAF scenarios for both multinomial and ordinal traits when applied to familybased samples. The QQplots are consistent with the empirical typeI error summarized in the tables below.
Power evaluation
The results of familybased and unrelated samples are summarized in Tables 4 and 5, respectively. Because we have concluded that multinomial logistic regression leads to inflated typeI error rates, the power rate of multinomial logistic regression is not evaluated for familybased samples (Table 4). The score and Wald tests have approximately the same power rate for each scenario (MAF, study design). The logistic regression using LRT has approximately the same power as the other two approaches in unrelated samples.
Data analysis
Lowfrequency (MAF < 0.01) and poorly imputed variants (imputation ratio < 0.3) have been excluded to avoid spurious results. All results are presented in the Manhattan plot (Fig. 1., and Manhattan plots for diabetes, obesity in Additional File 3) and QQplot (Fig. 2.). The variants that have reached a genomewide significance threshold of 5 × 10^{−8} or a suggestive threshold of 4 × 10^{−7} (calculated as 1/number of tests = 1/2542166) are summarized in Table 6. All variants in Table 6 are located within the CYP3A43, AP3B1 and LOC105370246 genes. AP3B1 is known to have variants associated with fasting insulin and HOMAIR in African Americans without diabetes [7]. The direct association between LOC105370246 and dibestes or obesity is not known in literature. CYP3A43 gene encodes a member of the cytochrome P450 superfamily of liver enzymes. Although the direct relationship between CYP3A43 and diabetes/obesity was not well known, some variants located in CYP3A4 have been identified in previous studies to be associated with relevant metabolism traits. For instance, one study in 2011 [8] indicated diabetes is associated with a significant decrease in hepatic CYP3A4 enzymatic activity and protein level. Several studies have demonstrated nonalcoholic fatty liver disease and diabetes are associated with decreased expression of the protein encoded by this gene in human livers [9, 10]. Two variants on CYP3A43 were identified to be associated with Ticagrelor levels in individuals with acute coronary syndromes treated with ticagrelor [11] and serum metabolite measurement [12] respectively. Because this gene might have clinical value for treating chronic metabolic diseases such as nonalcoholic fatty liver disease [13], future research efforts targeting this gene area are worthwhile. Additional information about this region might be discovered with targeted sequencing.
For target validation purpose, we have performed two additional GWAS of diabetes and obesity respectively using our approach (Manhattan plots in Additional File 3). The plots confirm that all signals are observed from the combined phenotype and not driven by a single binary trait (diabetes or obesity).
We apply our ordinal approach to the secondary outcome (“ordinal” diabesity) and compare to results obtained from POLMM, an approach for ordinal trait. We observe that the results are similar with small differences (Fig. 3. and 4). Compared to results obtained from the multinomial trait (Fig. 1.), the ordinal trait highlights one region near DAB1 on Chromosome 1 and one region near LOC107986327 on Chromosome 4 of potential interest in the search for genes associated with “ordinal” diabesity.
Discussion
The proposed score test offers advantages over the Wald test and the multinomial logistic regression in the following aspects. First, it is more computationally efficient, especially for largescale genetic studies such as GWAS, or sequencing studies because the iterative Fisher’s scoring algorithm is only applied once under the null hypothesis while the iterative algorithm is implemented for each variant when computing the Wald test statistic. Therefore, for a largescale genetic study, the Wald test will be less computationally efficient than the score test. We have summarized the computing time for the score and Wald tests in Table 7 for different sample sizes as implemented in R functions using a 3category multinomial phenotype on a i78565u processor with 16GB RAM. Second, the simulation studies show that the typeI error of both the score and Wald tests is well controlled for most scenarios. In contrast, the multinomial logistic regression results in a very inflated typeI error rate for familybased design when the familial correlation is ignored, and therefore it is not recommended for familybased studies. When the phenotypes are extremely unbalanced, e.g. the allocation ratio of the 4 categories is approximately 2.5:1:10:23, both score and Wald tests can result in slightly inflated typeI error for rare variants in the simulation studies. This conclusion has been noted in most approaches [5]. However, when the phenotype distribution is more balanced, the tests return valid typeI error rates for all MAF scenarios, as demonstrated in the QQplot of the FHS data analysis (Fig. 2.). We have observed that the typeI error of ordinal traits is very robust to an unbalanced distribution of the phenotype for all MAF scenarios (Table 3), as indicated by the calculated typeI error rate obtained from 500,000 simulations by treating the simulated phenotype as an ordinal trait. Lastly, the score test has approximately the same power as the Wald test, under the scenarios we evaluated.
It is worth noting that the EGEE are simply reduced to the score equations of generalized linear models for a multinomial variable when applied to unrelated samples. Because the same iteratively reweighted least square method is employed under this particular circumstance, the parameter estimates are identical to those obtained using a generalized linear model function for multinomial variables. This equivalence enhances the applicability of this approach to a general population, regardless of the underlying study design.
The score test can be readily extended to ordinal traits (i.e. categorical traits for which the values are ordered.) in family samples. Due to the nature of the ordinal regression model, fewer regression parameters are estimated. Because applications to ordinal traits are a special case of the general framework proposed with reduced complexity, the validity of simulation results should hold when applied to ordinal traits. When K = 2, i.e. an ordinal trait with only two categories, the estimates will be the same when using either multinomial or ordinal function, i.e. estimates of a binary logistic regression accounting for familial correlation.
Our proposed approaches have enabled the identification of a few loci associated with diabesity. As discussed, none of the signals were driven solely by one of the two binary traits (diabetes or obesity). Targeted sequencing might reveal more information, by providing a more comprehensive overview of rare and lowfrequency variants in that specific regions. We also provide a comparison of our ordinal approach to POLMM for an ordinal trait and found that both approaches have revealed similar regions of association.
Conclusions
Score tests should be considered for largescale genetic association testing due to their computational advantage. Because the Wald test also has valid typeI error rates and its computational efficiency is comparable to the score test (Table 7), if computing resources allow, the Wald test can also be applied for largescale genetic studies. As illustrated using Framingham heart study data, the proposed score test has enabled the identification of several loci associated with diabesity. One of the drawbacks of the score test is the lack of effect estimates. When only a handful of associated variants are identified from a genetic association study, the effect size and statistical significance of each variant can be estimated using the Wald test. In addition to the multinomial application, we have also provided a computer implementation for ordinal traits in Additional File 2. Although we presented association results from additively coded genetic variants, the application and implementation are not restricted to SNPs, but also applicable to a genetic risk score, weightedsum gene test [14], and other genetic summary measures.
Methods
Assuming that there are N independent families (i = 1, …, N), with n_{i} individuals in family i and a total of \( n={\sum}_{i=1}^N{n}_i \) subjects, the basic model for a Kcategory (multinomial) trait, with the Kth level chosen as the reference level, is written as,
The n × 1 response variable Y has K unordered levels, i.e. k = 1, …, K, resulting in (K1) equations; G is the genotype vector of size n × 1; X is the n × q covariates matrix; α = (α_{1}, …, α_{k}, …, α_{K − 1})^{T} is the intercept vector for the (K1) equations; β = (β_{1}, …, β_{k}, …, β_{K − 1})^{T} is the effect size vector of the genotype in the (K1) equations; and γ = (γ_{1}, …, γ_{k}, …, γ_{K − 1}) are the parameters of the covariates X, for the (K1) equations with a dimension of q × 1 for each γ_{k}. Although there are a variety of choices for the link function g, here we demonstrate with the canonical link function, the general logit, i.e.
Extended generalized estimating equations (EGEE)
We adopt the idea of EGEE previously proposed [15, 16] to approximate the likelihood using quasilikelihood, to handle correlated observations. The variance of the response variable Y_{ij}, is defined using (K1) indicator variables as follows: z_{ij} = [I(Y_{ij} = 1), …, I(Y_{ij} = (K − 1))] ’. The expected value of z_{ij} is E[z_{ij}] = [P(Y_{ij} = 1), …, P(Y_{ij} = (K − 1))]′ and the variance of z_{ij} can be derived as:
Let R = rJ where J is a matrix of ones with a dimension of (K1) by (K1), and r is an unknown correlation parameter to be estimated with value between −1 and 1. The implementation of the approach provided in Additional File 1 can also accommodate twoparameter R with diagonal elements set to r1 and all offdiagonal elements set to r2. The matrix R is used to model the correlation between any two individuals in the same family along with the use of relationship matrix, such that R_{i} = Φ_{i} ⨂ R ( Φ_{i} is the relationship matrix of the ith family defined as twice the kinship matrix), similar to how the familial correlation was handled in previous publications [17, 18]. V_{i}, the overall variance matrix of z_{i.} for the ith independent family is constructed as sd(z_{i.})R_{i}sd(z_{i.}) with the variance of each subject var(z_{ij}) (j = 1, …, n_{i}), as derived above,
where
and
The following score equations of EGEE [15, 16, 19] are used to estimate the regression parameters θ = (α_{1}, β_{1}, γ_{1}, …, α_{K − 1}, β_{K − 1}, γ_{K − 1}) and the correlation parameter r.
where the n_{i}(K − 1) × (2 + q)(K − 1) matrix D_{i} is stacked vertically from D_{ij} (j = 1, …, n_{i}) and defined as \( {\boldsymbol{D}}_{\boldsymbol{ij}}=\frac{\partial E\left[{z}_{ij}\right]}{\partial {\boldsymbol{\theta}}^{\prime }}={\left(\frac{\partial P\left({Y}_{ij}=1\right)}{\partial \boldsymbol{\theta}},\dots, \frac{\partial P\left({Y}_{ij}=K1\right)}{\partial \boldsymbol{\theta}}\right)}^{\prime } \); F_{i} is the vectorized \( \frac{\partial {\boldsymbol{V}}_{\boldsymbol{i}}^{\mathbf{1}}}{\partial r} \) with a dimension of \( {n}_i^2{\left(K1\right)}^2 \) by 1, I is an identity matrix with a size of \( {n}_i^2{\left(K1\right)}^2 \) and σ_{i} is the vectorized version of V_{i}. Similarly, s_{i} is vectorized version derived from the following:
where \( {\boldsymbol{e}}_{\boldsymbol{ij}}=\left({e}_{ij}^1\kern0.5em \dots \kern0.5em {e}_{ij}^{K1}\right)^{\prime } \) (j = 1, …, n_{i}) and \( {e}_{ij}^k \) (k = 1, …, (K1)) is defined as =I(Y_{ij} = k) − P(Y_{ij} = k). Therefore, E[s_{i}] = σ_{i}. Fisher’s scoring algorithm is used to update both θ and r from mth iteration to (m + 1)th iteration, written as
where
and D stands for the firstorder derivative with respect to (θ, r), until the prespecified convergence criterion is met. Estimates of multinomial logistic regression and r = 0 or 0.5 usually work well in terms of starting values.
Note the score equations will be reduced to the following GEE form [20] when applied to N unrelated samples. The coefficients estimation will follow the same iteratively reweighted least square method of generalized linear model [21] for multinomial outcome until a prespecified convergence criterion is met.
Robust score test
To determine if a genetic variant is associated with a multicategory phenotype, the following null hypothesis is tested H_{0} : β = 0. We first define the score vectors \( {\boldsymbol{U}}^{\left(\mathbf{1}\right)}=\left(\begin{array}{c}\begin{array}{c}{\boldsymbol{U}}_{{\boldsymbol{\upgamma}}_{\mathbf{1}}}\\ {}\boldsymbol{\vdots}\\ {}\boldsymbol{\vdots}\end{array}\\ {}{\boldsymbol{U}}_{{\boldsymbol{\upgamma}}_{\boldsymbol{K}\mathbf{1}}}\end{array}\right) \), \( {\boldsymbol{U}}^{\left(\mathbf{2}\right)}={\boldsymbol{U}}_{\boldsymbol{\upbeta}}=\left(\begin{array}{c}{U}_{\upbeta_1}\\ {}\boldsymbol{\vdots}\\ {}{U}_{\upbeta_{K1}}\end{array}\right) \), U^{(3)} = U_{r}. The score statistic is proposed as follows:
Where \( {\hat{\boldsymbol{\theta}}}_{\mathbf{0}},{\hat{\boldsymbol{r}}}_{\mathbf{0}} \) are parameter estimates under H_{0} : β = 0. \( {\boldsymbol{U}}^{\boldsymbol{main}}\left(\boldsymbol{\theta}, r\right)=\left(\begin{array}{c}{\boldsymbol{U}}^{\left(\mathbf{1}\right)}\\ {}{\boldsymbol{U}}^{\left(\mathbf{2}\right)}\end{array}\right)={\sum}_{i=1}^N{\boldsymbol{U}}_{\boldsymbol{i}}^{\boldsymbol{main}}\left(\boldsymbol{\theta}, r\right) \), A(θ, r) = (−U^{∗}_{21}U^{∗}_{11}^{−1}, I) with subscript 2 denoting rows/columns that correspond to β, subscript 1 denoting rows/columns that correspond to γ_{1}, …,γ_{K − 1}, and I is an identity matrix of size (K1).
The score statistic follows a \( {\chi}_{K1}^2 \) asymptotically according to the derivation for bivariate association testing in family samples [17, 22]. One of the major advantages is its robustness to incorrect variance specification. If the variance V_{i} (i = 1, … N) is prespecified correctly, then var(U^{main}(θ, r)) will equal to U^{∗} restricted to β, γ_{1}, …,γ_{K − 1}, and the score statistic will be simplified to
where \( {V}^{\left(\mathbf{2}\right)}={I}_{22}{I}_{2\left(2\right)}{I}_{\left(2\right)\left(2\right)}^{1}{I}_{\left(2\right)2} \) (The subscript 2 denotes the (K1) row/columns corresponding to β_{1}, …, β_{(K − 1)}; ““denotes excluding these rows/columns) and \( I={\sum}_{\boldsymbol{i}=\mathbf{1}}^{\boldsymbol{N}}{\boldsymbol{D}}_{\boldsymbol{i}}^{\prime }{\boldsymbol{V}}_{\boldsymbol{i}}^{\mathbf{1}}{\boldsymbol{D}}_{\boldsymbol{i}} \).
Wald test
The Wald test is an alternative test with lower computational efficiency when applied to a largescale genetic study. The Wald test statistic is proposed as follows:
This test statistic follows a \( {\chi}_{K1}^2 \) asymptotically. The parameters \( \hat{\boldsymbol{\theta}},\hat{r} \) are obtained from the score equations with no constraints (i.e. H_{0} ∪ H_{a}) until the prespecified convergence criterion is met.
The full variance matrix of all parameters \( V\left(\hat{\boldsymbol{\theta}},\kern0.5em \hat{r}\ \right) \) is derived as \( V\left(\hat{\boldsymbol{\theta}},\kern0.5em \hat{r}\ \right)={\boldsymbol{U}}^{\ast}{\left(\hat{\boldsymbol{\theta}},\hat{r}\right)}^{1}{\sum}_{i=1}^N{\boldsymbol{U}}_{\boldsymbol{i}}\left(\hat{\boldsymbol{\theta}},\hat{r}\right){\boldsymbol{U}}_{\boldsymbol{i}}\left(\hat{\boldsymbol{\theta}},\hat{r}\right)^{\prime }{\left({\boldsymbol{U}}^{\ast}{\left(\hat{\boldsymbol{\theta}},\hat{r}\right)}^{\prime}\right)}^{1} \). \( V\left({\hat{\beta}}_1\kern0.5em \dots \kern0.5em {\hat{\beta}}_{\left(K1\right)}\right) \) is extracted from \( V\left(\hat{\boldsymbol{\theta}},\kern0.5em \hat{r}\ \right) \), a sandwichtype variance estimator [19], with rows and columns corresponding to \( \left({\hat{\beta}}_1\kern0.5em \dots \kern0.5em {\hat{\beta}}_{\left(K1\right)}\right) \).
Ordinal traits
Under the same framework, using the statistical theory of ordinal regression, the above score and Wald tests can be easily extended to test the association of a genetic variant with an ordinal trait for a familybased design. More specifically, because P(Y_{ij} = k) can be derived from P(Y_{ij} = k) = P(Y_{ij} ≤ k) − P(Y_{ij} ≤ k − 1) using proportional cumulative logit models, then the same EGEE equations are used for parameter estimation. However, the dimensions of EGEE equations are reduced and mathematical formulas of the matrix elements are derived differently due to the use of proportional cumulative logit models. A computer implementation for both multinomial and ordinal phenotypes is provided in Additional File 12.
Simulations
We conduct typeI error and power simulation studies to evaluate the validity of our score test in assessing the association between singlenucleotide variants (SNVs) with different minor allele frequencies (MAF) and a categorical trait with four categories (“multinomial” trait), and compare the score test to the Wald test and the multinomial logistic regression which does not account for related samples. We then conduct simulations to assess the power of all three approaches.
TypeI error
We compare the typeI error rate of the robust score test to the Wald test as well as multinomial logistic regression (without accounting for related samples) in both familybased and unrelated designs. We simulate a 4category trait under the null hypothesis that there is no genetic association with the trait, i.e. H_{0} : β_{1} = … = β_{3} = 0. Eight SNV scenarios with MAF ranging from 0.01 to 0.3 are explored. For each SNV scenario and sample design, 500,000 replicates are simulated and the typeI error rate is defined as the proportion of simulations significant at the threshold of 0.01, 0.001, and 0.0001. For familybased samples, we also have conducted simulations to evaluate the typeI error of robust score and Wald test when applied to ordinal traits, based on 500, 000 replicates for each MAF scenario.
Familybased samples: In each replicate, a total of 1000 independent 3generation families with 2 grandparents who have one son and one daughter (Fig. 5.) are simulated. The number of grandchildren (3rdgeneration) is randomly determined from a discrete uniform distribution ranging from 1 to 4. Within each of the 1000 families, we simulate additively coded genotypes (0, 1, or 2 minor alleles) of the grandparents under HardyWeinberg equilibrium, and the 2nd and 3rd generations’ genotypes are then simulated using random allele dropping. Two covariates (age and sex) are simulated. The sex of the 3rdgeneration is randomly assigned, and the covariate of age is simulated in the following way [17]: we start by simulating the age of female offspring (2nd generation) from a continuous uniform distribution ranging from 25 to 50. Her spouse’s age is set to be within 5year of her age. The male offspring’s ages (2nd generation) are set to be within 5 years of the sister with at least a 1year gap to exclude twins. Then we simulate the age of the grandparents (1st generation). The grandmother is assumed to be 20 to 45 years older than both offspring (2nd generation), and the grandfather’s age is set to be within 5year of the grandmother’s age and he must be at least 20 years older than his older offspring. Finally, we simulate the age of the 3rd generation, in such a way that everyone in the 3rd generation is assumed to be 20 to 45 years younger than the mother (2nd generation) and at least 20 years younger than the father (2nd generation). Two continuous traits are simulated from age and sex, based on the following two equations, i.e. age and sex explains around 3 and 0.002% of the total variance of the latent variable u_{1} versus 0.8 and 0.01% of the latent variable u_{2}:
where \( \left(\begin{array}{c}{\varepsilon}_1\\ {}{\varepsilon}_2\end{array}\right)\sim N\left(0,{\boldsymbol{\Sigma}}_{\boldsymbol{a}}\bigotimes \boldsymbol{\Phi} +{\boldsymbol{\Sigma}}_{\boldsymbol{e}}\bigotimes \mathbf{I}\right) \), the additive covariance matrix is \( {\boldsymbol{\Sigma}}_{\boldsymbol{a}}=\left(\begin{array}{cc}4& 6\\ {}6& 36\end{array}\right) \) and the environmental covariance matrix is \( {\boldsymbol{\Sigma}}_{\boldsymbol{e}}=\left(\begin{array}{cc}4& 6\\ {}6& 36\end{array}\right) \). Φ is the relationship matrix which is a kinship matrix multiplied by 2.
We transform u_{1}, u_{2} to two binary traits using a threshold model with a disease prevalence of 10 and 35%, assuming a disease with a moderate prevalence such as type 2 diabetes (T2D) and a high prevalence such as obesity. The multinomial trait is then defined by these two binary traits as follows: diabetes & obesity, diabetes but no obesity, obesity but no diabetes and no diabetes & no obesity, in adults.
Unrelated samples: In each replicate, we simulate a total of 5000 independent subjects with ages ranging from 18 to 90. A total of 5000 independent additivelycoded genotypes are simulated. The sex is randomly assigned (1 = male; 2 = female). We then simulate two continuous traits influenced by age and sex only, based on the following two equations, so that age and sex explain around 3.2 and 0.8% of the total variance of u_{1} versus 0.94 and 0.01% of u_{2} respectively:
where \( \left(\begin{array}{c}{\varepsilon}_1\\ {}{\varepsilon}_2\end{array}\right)\sim N\left(0,{\boldsymbol{\Sigma}}_{\boldsymbol{T}}\bigotimes \mathbf{I}\right) \) with \( {\boldsymbol{\Sigma}}_{\boldsymbol{T}}=\left(\begin{array}{cc}8& 12\\ {}12& 72\end{array}\right) \). We transform u_{1}, u_{2} as described in the family design section.
We evaluate the typeI error of the proposed score test and Wald test, and then compare them to the multinomial logistic regression assuming independence among observations (using likelihood ratio test (LRT)).
Power evaluation
We compare the power of the score to the Wald test and multinomial logistic regression under the same allele scenarios and with the same family/unrelated structure as described above. In addition to the effects of age and sex, we also include an additively coded genetic variant g which explains approximately 0.5% of the variance of each continuous trait, i.e.
With this phenotype generation model, both traits are simulated under the alternative hypothesis that there is an association between the trait and the genetic variant. For each MAF scenario, a total of 5000 replicates are generated. The power rate is then evaluated for 3 different significance thresholds including the commonly used GWAS threshold for each method.
Framingham heart study
The motivation for developing this efficient score test is to make the application to a largescale genetic study computationally feasible, especially after the cost of wholegenome sequencing has been greatly reduced in recent years.
We apply the robust score test to the Framingham Heart Study (FHS) [17, 23]. A total of 7564 participants from 1315 families are analyzed, after excluding observations with missing values in body mass index (BMI), age, sex, the first 10 principal components (PC) s or T2D status. The primary outcome is diabesity with four categories as defined above. Diabesity is considered a modern epidemic and the largest in human history [24]. However, there are very few papers available regarding genetic association studies on this trait. We analyze the association between diabesity and genotypes from the Framingham SNP Health Association Resource (SHARe) project sponsored by the National Heart, Lung and Blood Institute (NHLBI), adjusting for age, sex, and the first 10 PCs. Genotypes from Affymetrix 550 K genotyping arrays (Affymetrix, Santa Clara, CA, USA), supplemented by the Affymetrix MIPS array, are available on 8481 participants after exclusion for low call rate (< 97%), heterozygosity rate outside of 5 SDs from the mean or excess Mendelian errors (> 1000). Additional SNVs are imputed with the software MACH (Markov Chainbased haplotyper) using the HapMap 2 reference haplotypes [25]. To help understand the GWAS results of diabesity and given the fact that diabesity is jointly constructed from obesity and diabetes, we perform two additional familybased logistic regression analyses using our approach to study the association of diabetes and genotypes, and the association of obesity and genotypes respectively. A secondary outcome treats the diabesity as an ordinal variable with 4 levels of increasing severity. We apply both our ordinal approach and POLMM with derived sparse GRM matrix to the secondary outcome and compare the results.
Availability of data and materials
The FHS dataset that supports the findings of this manuscript is available on dbGap (dbGaP Study Accession: phs000007.v32.p13, https://www.ncbi.nlm.nih.gov/projects/gap/cgibin/study.cgi?study_id=phs000007.v32.p13).
Abbreviations
 GLMM:

Generalized linear mixed model
 EGEE:

Extended generalized estimating equations
 SNV:

Singlenucleotide variant
 MAF:

Minor allele frequency
 T2D:

Type2 diabetes
 FHS:

Framingham heart study
 BMI:

Body mass index
 SHARe:

SNP Health Association Resource
 NHLBI:

National Heart, Lung and Blood Institute
 MACH:

Markov Chainbased haplotyper
 POLMM:

Proportional Odds Logistic Mixed Model
References
Therneau T. Mayo Clinic. The lmekin function.
Zhang H, Wang X, Ye Y. Detection of genes for ordinal traits in nuclear families and a unified approach for association studies. Genetics. 2006;172(1):693–9. https://www.ncbi.nlm.nih.gov/pubmed/16219774. https://doi.org/10.1534/genetics.105.049122.
Diao G, Lin DY. Variancecomponents methods for linkage and association analysis of ordinal traits in general pedigrees. Genetic epidemiology. 2010;34(3):232n/a. https://www.ncbi.nlm.nih.gov/pubmed/19918762. https://doi.org/10.1002/gepi.20453.
Wang X, Philip VM, Ananda G, White CC, Malhotra A, Michalski PJ, et al. A bayesian framework for generalized linear mixed modeling identifies new candidate loci for lateonset alzheimer's disease. Genetics. 2018;209(1):51–64. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5937180/; https://pubmed.ncbi.nlm.nih.gov/29507048https://doi.org/10.1534/genetics.117.300673.
Bi W, Zhou W, Dey R, Mukherjee B, Sampson JN, Lee S. Efficient mixed model approach for largescale genomewide association studies of ordinal categorical phenotypes. The Am J Hum Gen. 2021;108(5):825–39. https://www.sciencedirect.com/science/article/pii/S0002929721001038. https://doi.org/10.1016/j.ajhg.2021.03.019.
Zimmet PZ. Diabetes and its drivers: The largest epidemic in human history?. 2017;3.
Irvin MR, Wineinger NE, Rice TK, Pajewski NM, Kabagambe EK, Gu CC, et al. Genomewide detection of allele specific copy number variation associated with insulin resistance in african americans from the HyperGEN study. PLoS One. 2011;6(8):e24052. https://doi.org/10.1371/journal.pone.0024052.
Dostalek M, Court MH, Yan B, Akhlaghi F. Significantly reduced cytochrome P450 3A4 expression and activity in liver from humans with diabetes mellitus. Br J Pharmacol. 2011;163(5):937–47. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3130941/; https://pubmed.ncbi.nlm.nih.gov/21323901https://doi.org/10.1111/j.14765381.2011.01270.x.
Jamwal R. de la Monte, Suzanne M., Ogasawara K, Adusumalli S, Barlock BB, Akhlaghi F. nonalcoholic fatty liver disease and diabetes are associated with decreased CYP3A4 protein expression and activity in human liver. Mol Pharm. 2018;15(7):2621–32. https://doi.org/10.1021/acs.molpharmaceut.8b00159. https://doi.org/10.1021/acs.molpharmaceut.8b00159.
Kolwankar D, Vuppalanchi R, Ethell B, Jones DR, Wrighton SA, Hall SD, et al. Association between nonalcoholic hepatic steatosis and hepatic cytochrome P450 3A activity. Clin Gastroenterol Hepatol. 2007;5(3):388–93. https://doi.org/10.1016/j.cgh.2006.12.021. https://doi.org/10.1016/j.cgh.2006.12.021.
Varenhorst C, Eriksson N, Johansson A, et al. Effect of genetic variations on ticagrelor plasma levels and clinical outcomes. Eur Heart J. 2015;36(29):1901–12. https://doi.org/10.1093/eurheartj/ehv116.
Krumsiek J, Suhre K, Evans AM, et al. Mining the unknown: A systems approach to metabolite identification combining genetic and metabolic information. PLoS Genet. 2012;8(10):e1003005. https://doi.org/10.1371/journal.pgen.1003005.
Zanger UM, Schwab M. Cytochrome P450 enzymes in drug metabolism: Regulation of gene expression, enzyme activities, and impact of genetic variation. Pharmacol Ther. 2013;138(1):103–41. https://www.sciencedirect.com/science/article/pii/S0163725813000065. https://doi.org/10.1016/j.pharmthera.2012.12.007.
Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009;5(2):e1000384. https://doi.org/10.1371/journal.pgen.1000384.
Hall DB. On the application of extended quasilikelihood to the clustered data case. Can J Statistics. 2001;29(1):77–97. https://doi.org/10.2307/3316052. https://doi.org/10.2307/3316052.
Hall DB, Severini TA. Extended generalized estimating equations for clustered data. J Am Stat Assoc. 1998;93(444):1365–75. https://www.jstor.org/stable/2670052. https://doi.org/10.1080/01621459.1998.10473798.
Wang S, Meigs J. & Dupuis, J. Joint association analysis of a binary and a quantitative trait in family samples. 2017;25:130–6.
Wang X, Lee S, Zhu X, Redline S, Lin X. GEEBased SNP set association test for continuous and discrete traits in FamilyBased association studies. Genet Epidemiol. 2013;37(8):778–86. https://doi.org/10.1002/gepi.21763.
Liu J, Pei Y, Papasian CJ, Deng H. Bivariate association analyses for the mixture of continuous and binary traits with the use of extended generalized estimating equations. Genet Epidemiol. 2009;33(3):217–27. https://doi.org/10.1002/gepi.20372.
Zeger SL, Liang KY, Albert PS. Models for longitudinal data: a generalized estimating equation approach. Biometrics. 1988;44(4):1049–60. https://www.jstor.org/stable/2531734. https://doi.org/10.2307/2531734.
Nelder JA, Wedderburn RWM. Generalized linear models. Journal of the Royal Statistical Society: Series A (General). 1972;135(3):370–84. https://doi.org/10.2307/2344614. https://doi.org/10.2307/2344614.
Davison AC. Statistical models: Cambridge University Press; 2003. . https://doi.org/10.1017/CBO9780511815850.
Levy R. The framingham study: The epidemiology of atherosclerotic disease. JAMA. 1981;245(5).
Farag YMK. Gaballa MR. Diabesity: An overview of a rising epidemic. 2011;26(1):28–35. https://doi.org/10.1093/ndt/gfq576.
Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol. 2010;34(8):816–34. https://doi.org/10.1002/gepi.20533. https://doi.org/10.1002/gepi.20533.
Acknowledgments
We acknowledge the contribution of Achilleas Pittsillides to the POLMM analyses. We thank Boston University research computing services and Katia Bulekova for generous assistance with script development and use of the highperformance computing (HPC) cluster.
Funding
This work is supported by U.S. National Institutes of Heath (NIH) grant U01DK078616, UM1DK078616–13, and 1R01HL151855 awarded to Dr. James B. Meigs. The funding body played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Author information
Authors and Affiliations
Contributions
SW analyzed the data. SW, JBM, JD interpreted the data. SW and JD drafted the manuscript. All authors have agreed to the submitted version.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The Framingham Heart Study was approved by the Boston University Medical Campus Institutional Review Board and all participants provided written informed consent. Only adult participants’ data were analyzed in this manuscript.
Consent for publication
Not applicable.
Competing interests
Authors declare no competing interest.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Wang, S., Meigs, J.B. & Dupuis, J. Genetic association tests in family samples for multicategory phenotypes. BMC Genomics 22, 873 (2021). https://doi.org/10.1186/s1286402108107x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1286402108107x