Genetic association tests in family samples for multi-category phenotypes

Background Advancements in statistical methods and sequencing technology have led to numerous novel discoveries in human genetics in the past two decades. Among phenotypes of interest, most attention has been given to studying genetic associations with continuous or binary traits. Efficient statistical methods have been proposed and are available for both types of traits under different study designs. However, for multinomial categorical traits in related samples, there is a lack of efficient statistical methods and software. Results We propose an efficient score test to analyze a multinomial trait in family samples, in the context of genome-wide association/sequencing studies. An alternative Wald statistic is also proposed. We also extend the methodology to be applicable to ordinal traits. We performed extensive simulation studies to evaluate the type-I error of the score test, Wald test compared to the multinomial logistic regression for unrelated samples, under different allele frequency and study designs. We also evaluate the power of these methods. Results show that both the score and Wald tests have a well-controlled type-I error rate, but the multinomial logistic regression has an inflated type-I error rate when applied to family samples. We illustrated the application of the score test with an application to the Framingham Heart Study to uncover genetic variants associated with diabesity, a multi-category phenotype. Conclusion Both proposed tests have correct type-I error rate and similar power. However, because the Wald statistics rely on computer-intensive estimation, it is less efficient than the score test in terms of applications to large-scale genetic association studies. We provide computer implementation for both multinomial and ordinal traits. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-08107-x.


Background
Genetic association tests for continuous or binary phenotypes have uncovered many susceptibility genes or variants related to diseases. Various methods and efficient software have been developed and used for continuous and binary traits. For family samples, due to the correlation between relatives and violation of the independence assumption of ordinary linear regression, some alternative approaches were proposed. For example, Therneau and colleagues developed an R package (coxme) implementing linear mixed effects model to evaluate the association between a genetic variant and a continuous trait or survival outcome accounting for correlation present in family samples. Similar extensions to account for familial correlation using mixed effects models have been proposed for gene-based association tests [1]. The progress in family sample designs has been restricted mostly to quantitative traits or binary traits. However, methods are needed to study categorical traits with more than two categories in family samples. For example, the phenotype diabesity has been defined as a four-category (diabetes & obesity, diabetes but no obesity, obesity but no diabetes and no diabetes & no obesity) variable constructed jointly from type 2 diabetes and obesity. Currently, approaches for genetic association analysis of such multinomial traits are limited. Zhang and colleagues [2] proposed a proportional odds logistic model which allows for the inclusion of covariates. However, it has a few limitations. First, this approach is restricted to nuclear families and cannot handle complex family structures. Second, no software implementation has been made publicly available. Diao and Lin [3] proposed a general framework for linkage and association tests for ordinal traits. Their method utilized adaptive Gaussian quadrature to approximate the maximum loglikelihood and a likelihood ratio test was proposed to test the hypothesis of no association between a genetic variant and an ordinal trait of interest. Again, this approach also has not been widely used due to lack of computer-efficient software and the fact that the likelihood ratio test is computationally intensive. Another possible option is to use the SAS generalized linear mixed models (GLMM) procedure, which can incorporate a kinship matrix. However, in real applications, the current implementation of the GLMM cannot handle extended families due to the computational burden. More recently, Wang and colleagues [4] proposed a Bayesian framework incorporating kinship matrix as a random effect, which however can not be applied to large-scale genetic study because of lack of computational efficiency. Bi and colleagues [5] proposed a computer-efficient framework (POLMM), specifically for ordinal traits. Because it doesn't allow for a userprovided kinship matrix, such as the one estimated from pedigree or using a typical genetic software, this will be a limitation for family-based cohort studies with known relationships. Our proposed method is complementary to these two approaches as it can be applied to family samples without available genome-wide data to compute a GRM, and without the proportional odds assumption. In this paper, we propose a computationally efficient score test based on extended generalized estimating equations (EGEE) for large-scale genetics studies of multi-category phenotypes accounting for familial correlation. We evaluate our approach using simulations and apply it to a genome-wide scan to identify genetic variants associated with diabesity, a four-category phenotype, with the healthy referent category being no diabetes and no obesity and the unhealthiest category, "diabese" (diabetes and obesity), having a prevalence of at least 25% in several countries [6].

Type-I error
The results of family-based and unrelated samples are summarized in Table 1-2 respectively. Both the score and Wald tests have well-controlled type-I error rates across all MAF scenarios except for rare variants. This conclusion applies to both family-based and unrelated designs. The multinomial logistic regression, which ignores familial correlation, returns an inflated type-I error rate in the presence of related individuals, although its type-I error rate for unrelated study design is wellcontrolled. In the application to ordinal trait (Table 3), robust score test preserves the type-I error in all MAF scenarios although the simulated phenotype distribution is highly unbalanced. The Wald test is only very slightly inflated for very rare variants when evaluated at 0.0001. We have also generated QQ-plots (Additional File 3) for the robust score test and the simplified score test for results from all MAF scenarios for both multinomial and ordinal traits when applied to family-based samples. The QQ-plots are consistent with the empirical type-I error summarized in the tables below.

Power evaluation
The results of family-based and unrelated samples are summarized in Tables 4 and 5, respectively. Because we have concluded that multinomial logistic regression leads to inflated type-I error rates, the power rate of multinomial logistic regression is not evaluated for family-based samples ( Table 4). The score and Wald tests have approximately the same power rate for each scenario (MAF, study design). The logistic regression using LRT has approximately the same power as the other two approaches in unrelated samples.

Data analysis
Low-frequency (MAF < 0.01) and poorly imputed variants (imputation ratio < 0.3) have been excluded to avoid spurious results. All results are presented in the Manhattan plot (Fig. 1., and Manhattan plots for diabetes, obesity in Additional File 3) and QQ-plot (Fig. 2.). The variants that have reached a genome-wide significance threshold of 5 × 10 −8 or a suggestive threshold of 4 × 10 −7 (calculated as 1/number of tests = 1/2542166) are summarized in Table 6. All variants in Table 6 are located within the CYP3A43, AP3B1 and LOC105370246 genes. AP3B1 is known to have variants associated with fasting insulin and HOMA-IR in African Americans without diabetes [7]. The direct association between LOC105370246 and dibestes or obesity is not known in literature. CYP3A43 gene encodes a member of the cytochrome P450 superfamily of liver enzymes. Although the direct relationship between CYP3A43 and diabetes/obesity was not well known, some variants located in CYP3A4 have been identified in previous studies to be associated with relevant metabolism traits. For instance, one study in 2011 [8] indicated diabetes is associated with a significant decrease in hepatic CYP3A4 enzymatic activity and protein level. Several studies have demonstrated nonalcoholic fatty liver disease and diabetes are associated with decreased expression of the protein encoded by this gene in human livers [9,10]. Two variants on CYP3A43 were identified to be associated with Ticagrelor levels in individuals with acute coronary syndromes treated with ticagrelor [11] and serum metabolite measurement [12] respectively. Because this gene might have clinical value for treating chronic metabolic diseases such as nonalcoholic fatty liver disease [13], future research efforts targeting this gene area are worthwhile. Additional information about this region might be discovered with targeted sequencing.
For target validation purpose, we have performed two additional GWAS of diabetes and obesity respectively using our approach (Manhattan plots in Additional File 3). The plots confirm that all signals are observed from the combined phenotype and not driven by a single binary trait (diabetes or obesity).
We apply our ordinal approach to the secondary outcome ("ordinal" diabesity) and compare to results obtained from POLMM, an approach for ordinal trait. We observe that the results are similar with small differences (Fig. 3. and 4). Compared to results obtained from the multinomial trait ( Fig. 1.), the ordinal trait highlights one region near DAB1 on Chromosome 1 and one region near LOC107986327 on Chromosome 4 of potential interest in the search for genes associated with "ordinal" diabesity.

Discussion
The proposed score test offers advantages over the Wald test and the multinomial logistic regression in the following aspects. First, it is more computationally efficient, especially for large-scale genetic studies such as GWAS, or sequencing studies because the iterative Fisher's scoring algorithm is only applied once under the null hypothesis while the iterative algorithm is implemented for each variant when computing the Wald test statistic. Therefore, for a large-scale genetic study, the Wald test will be less computationally efficient than the score test. We have summarized the computing time for the score and Wald tests in Table 7 for different sample sizes as implemented in R functions using a 3-category multinomial phenotype on a i7-8565u processor with 16GB RAM. Second, the simulation studies show that the type-I error of both the score and Wald tests is well controlled for most scenarios. In contrast, the multinomial logistic regression results in a very inflated type-I error rate for family-based design when the familial correlation is ignored, and therefore it is not recommended  for family-based studies. When the phenotypes are extremely unbalanced, e.g. the allocation ratio of the 4 categories is approximately 2.5:1:10:23, both score and Wald tests can result in slightly inflated type-I error for rare variants in the simulation studies. This conclusion has been noted in most approaches [5]. However, when the phenotype distribution is more balanced, the tests return valid type-I error rates for all MAF scenarios, as demonstrated in the QQ-plot of the FHS data analysis (Fig. 2.). We have observed that the type-I error of ordinal traits is very robust to an unbalanced distribution of the phenotype for all MAF scenarios (Table 3), as indicated by the calculated type-I error rate obtained from 500,000 simulations by treating the simulated phenotype as an ordinal trait. Lastly, the score test has approximately the same power as the Wald test, under the scenarios we evaluated.   It is worth noting that the EGEE are simply reduced to the score equations of generalized linear models for a multinomial variable when applied to unrelated samples. Because the same iteratively reweighted least square method is employed under this particular circumstance, the parameter estimates are identical to those obtained using a generalized linear model function for multinomial variables. This equivalence enhances the applicability of this approach to a general population, regardless of the underlying study design.
The score test can be readily extended to ordinal traits (i.e. categorical traits for which the values are ordered.) in family samples. Due to the nature of the ordinal regression model, fewer regression parameters are estimated. Because  applications to ordinal traits are a special case of the general framework proposed with reduced complexity, the validity of simulation results should hold when applied to ordinal traits. When K = 2, i.e. an ordinal trait with only two categories, the estimates will be the same when using either multinomial or ordinal function, i.e. estimates of a binary logistic regression accounting for familial correlation.
Our proposed approaches have enabled the identification of a few loci associated with diabesity. As discussed, none of the signals were driven solely by one of the two binary traits (diabetes or obesity). Targeted sequencing might reveal more information, by providing a more comprehensive overview of rare and low-frequency variants in that specific regions. We also provide a comparison of our ordinal approach to POLMM for an ordinal trait and found that both approaches have revealed similar regions of association.

Conclusions
Score tests should be considered for large-scale genetic association testing due to their computational advantage. Because the Wald test also has valid type-I error rates and its computational efficiency is comparable to the score test (Table 7), if computing resources allow, the Wald test can also be applied for large-scale genetic studies. As illustrated using Framingham heart study data, the proposed score test has enabled the identification of several loci associated with diabesity. One of the drawbacks of the score test is the lack of effect estimates. When only a handful of associated variants are identified from a genetic association study, the effect size and statistical significance of each variant can be estimated using the Wald test. In addition to the multinomial application, we have also provided a computer implementation for ordinal traits in Additional File 2. Although we presented association results from additively coded genetic variants, the application and implementation are not restricted to SNPs, but also applicable to a genetic risk score, weighted-sum gene test [14], and other genetic summary measures.

Methods
Assuming that there are N independent families (i = 1, …, N), with n i individuals in family i and a total of n ¼ P N i¼1 n i subjects, the basic model for a K-category (multinomial) trait, with the Kth level chosen as the reference level, is written as, The n × 1 response variable Y has K unordered levels, i.e. k = 1, …, K, resulting in (K-1) equations; G is the genotype vector of size n × 1; X is the n × q covariates matrix; α = (α 1 , …, α k , …, α K − 1 ) T is the intercept vector for the (K-1) equations; β = (β 1 , …, β k , …, β K − 1 ) T is the effect size vector of the genotype in the (K-1) equations; and γ = (γ 1 , …, γ k , …, γ K − 1 ) are the parameters of the covariates X, for the (K-1) equations with a dimension of q × 1 for each γ k . Although there are a variety of choices for the link function g, here we demonstrate with the canonical link function, the general logit, i.e.

Extended generalized estimating equations (EGEE)
We adopt the idea of EGEE previously proposed [15,16] to approximate the likelihood using quasi-likelihood, to handle correlated observations. The variance of the response variable Y ij , is defined using (K-1) indicator variables as follows: )]′ and the variance of z ij can be derived as: where J is a matrix of ones with a dimension of (K-1) by (K-1), and r is an unknown correlation parameter to be estimated with value between −1 and 1. The implementation of the approach provided in Additional File 1 can also accommodate two-parameter R  with diagonal elements set to r1 and all off-diagonal elements set to r2. The matrix R is used to model the correlation between any two individuals in the same family along with the use of relationship matrix, such that R i = Φ i ⨂ R ( Φ i is the relationship matrix of the i-th family defined as twice the kinship matrix), similar to how the familial correlation was handled in previous publications [17,18]. V i , the overall variance matrix of z i. for the i-th independent family is constructed as sd(z i. )R i sd(z i. ) with the variance of each subject var(z ij ) (j = 1, …, n i ), as derived above, where The following score equations of EGEE [15,16,19] are used to estimate the regression parameters θ = (α 1 , β 1 , γ 1 , …, α K − 1 , β K − 1 , γ K − 1 ) and the correlation parameter r.
where the n i (K − 1) × (2 + q)(K − 1) matrix D i is stacked vertically from D ij (j = 1, …, n i ) and defined as ∂r with a dimension of n 2 i ðK −1Þ 2 by 1, I is an identity matrix with a size of n 2 i ðK −1Þ 2 and σ i is the vectorized version of V i . Similarly, s i is vectorized version derived from the following: ij 0 (j = 1, …, n i ) and e k ij (k = 1, …, (K-1)) is defined as =I(Y ij = k) − P(Y ij = k). Therefore, E[s i ] = σ i . Fisher's scoring algorithm is used to update both θ and r from m-th iteration to (m + 1)-th iteration, written as and D stands for the first-order derivative with respect to (θ, r), until the pre-specified convergence criterion is met. Estimates of multinomial logistic regression and r = 0 or 0.5 usually work well in terms of starting values. Note the score equations will be reduced to the following GEE form [20] when applied to N unrelated samples. The coefficients estimation will follow the same iteratively reweighted least square method of generalized linear model [21] for multinomial outcome until a prespecified convergence criterion is met.
Robust score test To determine if a genetic variant is associated with a multi-category phenotype, the following null hypothesis is tested H 0 : β = 0. We first define the score vectors U ð1Þ The score statistic is proposed as follows: Whereθ 0 ;r 0 are parameter estimates under H 0 : β = 0.

Wald test
The Wald test is an alternative test with lower computational efficiency when applied to a large-scale genetic study. The Wald test statistic is proposed as follows: This test statistic follows a χ 2 K −1 asymptotically. The parametersθ;r are obtained from the score equations with no constraints (i.e. H 0 ∪ H a ) until the pre-specified convergence criterion is met.

Ordinal traits
Under the same framework, using the statistical theory of ordinal regression, the above score and Wald tests can be easily extended to test the association of a genetic variant with an ordinal trait for a family-based design. More specifically, because P(Y ij = k) can be derived from P(Y ij = k) = P(Y ij ≤ k) − P(Y ij ≤ k − 1) using proportional cumulative logit models, then the same EGEE equations are used for parameter estimation. However, the dimensions of EGEE equations are reduced and mathematical formulas of the matrix elements are derived differently due to the use of proportional cumulative logit models. A computer implementation for both multinomial and ordinal phenotypes is provided in Additional File 1-2.

Simulations
We conduct type-I error and power simulation studies to evaluate the validity of our score test in assessing the association between single-nucleotide variants (SNVs) with different minor allele frequencies (MAF) and a categorical trait with four categories ("multinomial" trait), and compare the score test to the Wald test and the multinomial logistic regression which does not account for related samples. We then conduct simulations to assess the power of all three approaches.

Type-I error
We compare the type-I error rate of the robust score test to the Wald test as well as multinomial logistic regression (without accounting for related samples) in both family-based and unrelated designs. We simulate a 4-category trait under the null hypothesis that there is no genetic association with the trait, i.e. H 0 : β 1 = … = β 3 = 0. Eight SNV scenarios with MAF ranging from 0.01 to 0.3 are explored. For each SNV scenario and sample design, 500,000 replicates are simulated and the type-I error rate is defined as the proportion of simulations significant at the threshold of 0.01, 0.001, and 0.0001. For family-based samples, we also have conducted simulations to evaluate the type-I error of robust score and Wald test when applied to ordinal traits, based on 500, 000 replicates for each MAF scenario.
Family-based samples: In each replicate, a total of 1000 independent 3-generation families with 2 grandparents who have one son and one daughter (Fig. 5.) are simulated. The number of grandchildren (3rd-generation) is randomly determined from a discrete uniform distribution ranging from 1 to 4. Within each of the 1000 families, we simulate additively coded genotypes (0, 1, or 2 minor alleles) of the grandparents under Hardy-Weinberg equilibrium, and the 2nd and 3rd generations' genotypes are then simulated using random allele dropping. Two covariates (age and sex) are simulated. The sex of the 3rd-generation is randomly assigned, and the covariate of age is simulated in the following way [17]: we start by simulating the age of female offspring (2nd generation) from a continuous uniform distribution ranging from 25 to 50. Her spouse's age is set to be within 5-year of her age. The male offspring's ages (2nd generation) are set to be within 5 years of the sister with at least a 1-year gap to exclude twins. Then we simulate the age of the grandparents (1st generation). The grandmother is assumed to be 20 to 45 years older than both offspring (2nd generation), and the grandfather's age is set to be within 5-year of the grandmother's age and he must be at least 20 years older than his older offspring. Finally, we simulate the age of the 3rd generation, in such a way that everyone in the 3rd generation is assumed to be 20 to 45 years younger than the mother (2nd generation) and at least 20 years younger than the father (2nd generation). Two continuous traits are simulated from age and sex, based on the following two equations, i.e. age and sex explains around 3 and 0.002% of the total variance of the latent variable u 1 versus 0.8 and 0.01% of the latent variable u 2 : u 1 ¼ 5:6 þ 0:025age þ 0:5sex þ ε 1 ; where ε 1 ε 2 ∼Nð0; Σ a ⨂Φ þ Σ e ⨂IÞ , the additive covari-ance matrix is Σ a ¼ 4 6 6 36 and the environmental covariance matrix is Σ e ¼ 4 6 6 36 . Φ is the relationship matrix which is a kinship matrix multiplied by 2.
We transform u 1 , u 2 to two binary traits using a threshold model with a disease prevalence of 10 and 35%, assuming a disease with a moderate prevalence such as type 2 diabetes (T2D) and a high prevalence such as obesity. The multinomial trait is then defined by these two binary traits as follows: diabetes & obesity, diabetes but no obesity, obesity but no diabetes and no diabetes & no obesity, in adults.
Unrelated samples: In each replicate, we simulate a total of 5000 independent subjects with ages ranging from 18 to 90. A total of 5000 independent additivelycoded genotypes are simulated. The sex is randomly assigned (1 = male; 2 = female). We then simulate two continuous traits influenced by age and sex only, based on the following two equations, so that age and sex explain around 3.2 and 0.8% of the total variance of u 1 versus 0.94 and 0.01% of u 2 respectively: where ε 1 ε 2 ∼Nð0; Σ T ⨂IÞ with Σ T ¼ 8 12 12 72 . We transform u 1 , u 2 as described in the family design section.
We evaluate the type-I error of the proposed score test and Wald test, and then compare them to the multinomial logistic regression assuming independence among observations (using likelihood ratio test (LRT)).

Power evaluation
We compare the power of the score to the Wald test and multinomial logistic regression under the same allele scenarios and with the same family/unrelated structure as described above. In addition to the effects of age and sex, we also include an additively coded genetic variant g which explains approximately 0.5% of the variance of each continuous trait, i.e. With this phenotype generation model, both traits are simulated under the alternative hypothesis that there is an association between the trait and the genetic variant. For each MAF scenario, a total of 5000 replicates are generated. The power rate is then evaluated for 3 different significance thresholds including the commonly used GWAS threshold for each method.

Framingham heart study
The motivation for developing this efficient score test is to make the application to a large-scale genetic study computationally feasible, especially after the cost of whole-genome sequencing has been greatly reduced in recent years.
We apply the robust score test to the Framingham Heart Study (FHS) [17,23]. A total of 7564 participants from 1315 families are analyzed, after excluding observations with missing values in body mass index (BMI), age, sex, the first 10 principal components (PC) s or T2D status. The primary outcome is diabesity with four categories as defined above. Diabesity is considered a modern epidemic and the largest in human history [24]. However, there are very few papers available regarding genetic association studies on this trait. We analyze the association between diabesity and genotypes from the Framingham SNP Health Association Resource (SHARe) project sponsored by the National Heart, Lung and Blood Institute (NHLBI), adjusting for age, sex, and the first 10 PCs. Genotypes from Affymetrix 550 K genotyping arrays (Affymetrix, Santa Clara, CA, USA), supplemented by the Affymetrix MIPS array, are available on 8481 participants after exclusion for low call rate (< 97%), heterozygosity rate outside of 5 SDs from the mean or excess Mendelian errors (> 1000). Additional SNVs are imputed with the software MACH (Markov Chain-based haplotyper) using the HapMap 2 reference haplotypes [25]. To help understand the GWAS results of diabesity and given the fact that diabesity is jointly constructed from obesity and diabetes, we perform two additional family-based logistic regression analyses using our approach to study the association of diabetes and genotypes, and the association of obesity and genotypes respectively. A secondary outcome treats the diabesity as an ordinal variable with 4 levels of increasing severity. We apply both our ordinal approach and POLMM with derived sparse GRM matrix to the secondary outcome and compare the results.