Method overview
LT-FH
Details regarding the LT-FH method can be found elsewhere [5]. In LT-FH, the first step is to compute a posterior mean genetic liability for each individual, based on the individual’s disease status as well as any available parent or sibling disease status. The genetic liabilities are assumed to follow a multivariate normal distribution:
$$\left(\begin{array}{c}{\epsilon }_{o,e}\\ {\epsilon }_{o,g}\\ {\epsilon }_{p1}\\ {\epsilon }_{p2}\\ {\epsilon }_{s}\end{array}\right)\sim {N}_{5}\left(\left(\begin{array}{c}0\\ 0\\ 0\\ 0\\ 0\end{array}\right),\left(\begin{array}{ccccc}1-{h}^{2}& 0& 0& 0& 0\\ 0& {h}^{2}& {0.5h}^{2}& 0.5{h}^{2}& {0.5h}^{2}\\ 0& {0.5h}^{2}& 1& 0& {0.5h}^{2}\\ 0& {0.5h}^{2}& 0& 1& {0.5h}^{2}\\ 0& {0.5h}^{2}& {0.5h}^{2}& {0.5h}^{2}& 1\end{array}\right)\right)$$
Which includes the environmental (\({\epsilon }_{o,e}\)) and genetic (\({\epsilon }_{o,g}\)) components of the liabilities of offspring, the liabilities of parent (\({\epsilon }_{p1},{\epsilon }_{p2}\)), and the liabilities of siblings (\({\epsilon }_{s}\)). Given the disease status of an individual \({Z}_{0}\), his parents \({Z}_{p1},{Z}_{p2}\), and siblings \({Z}_{s}\), The posterior mean genetic liabilities \(E[{\epsilon }_{o,g}|{Z}_{0},{Z}_{p1},{Z}_{p2},{Z}_{s}]\) for every possible configuration of case–control status and family history are estimated with Monte Carlo integration. In the second step, an association test is conducted between the continuous liabilities and genotypes. The association statistics can be obtained using multiple models, such as linear regression, score test, or linear mixed effect models.
Fam-meta
Fam-meta considers two independent association tests that have been previously described elsewhere [6]. The first analysis involves probands, or participants who have both genotypes and phenotype information available. A logistic regression can be fitted based on standard likelihood for case–control. The second analysis involves relatives who only have phenotypic information. This logistic regression is based on the likelihood of observing a relative’s case–control status, conditioning on both the proband’s case–control status as well as the probands’ phenotypes. The regression coefficients and variances from both regressions are then combined in a meta-analysis framework with optimal (inverse variance) weights:
$${T}_{meta}=\frac{\frac{1}{\widehat{var}(\widehat{{\beta }^{P}})}\widehat{{\beta }^{P}}+\frac{2\phi }{\widehat{var}(\widehat{{\beta }^{R}})}\widehat{{\beta }^{R}}}{\sqrt{\frac{1}{\widehat{var}(\widehat{{\beta }^{P}})}+\frac{4{\phi }^{2}}{ \widehat{var}(\widehat{{\beta }^{R}})}}}$$
where \(\widehat{{\beta }^{P}}\) is the beta coefficient from the probands’ regression, \(\widehat{{\beta }^{R}}\) is the beta coefficient from the relatives’ regression, and \(\phi\) is the kinship coefficient between the participant and the relative.
Simulations
Main models
We conducted simulations to directly compare the performance of LT-FH and Fam-meta, while CC-GWAS results served as a reference. Three main models were utilized to generate phenotypes: a null model without SNP effects (model 1), a model with variants contributing to disease liability (model 2), and a model with an additional age by genotype interaction (model 3). To evaluate each model, we first generated 400 nuclear families consisting of parents and offspring for each iteration, where 200 families have two children, and 200 families have three children (N = 1800). Sex was randomly assigned for the children, and age was assigned using the following rules: each child’s age was generated under a continuous uniform distribution between 18 to 45 years old; the mother was 20–45 years older than the oldest child; the father was within five years of the mother’s age, and had to be at least 20 years older than the oldest child.
For the null model (model 1), we simulated one continuous disease trait using the following formula:
$$Y=0.015age+0.45sex+\gamma +\varepsilon$$
Age and sex explained 10% and 5% of the total phenotypic variance, respectively. The polygenic component \(\gamma\) followed a multivariate normal distribution with mean 0 and covariance \({\phi \sigma }_{G}^{2}\), where \(\phi\) was the kinship matrix of one family, and \({\sigma }_{G}^{2}\) was set to 0.2. The random error \(\varepsilon\) was normally distributed with variance \({\sigma }_{E}^{2}\)=0.65.
In the second model (model 2), we incorporated eight independent causal variants with the following minor allele frequencies (MAFs): 1%, 2%, 5%, 10%, 20%, 30%, 40%, 50% (See Additional File 1 for description and results of additional scenarios where less frequent variations, or variants in linkage disequilibrium (LD) generated using HAPGEN2 [7] and 1000 Genomes reference panel were used). We first assigned parents’ genotypes under Hardy–Weinberg equilibrium, i.e., the probability of having 0, 1, or 2 minor alleles are p2, 2pq, q2, respectively, where p is the MAF. Then, we determined the children’s genotypes through gene dropping, assuming that each parent passes down one of their alleles to their offspring with an equal chance of selecting either allele. Phenotypes were generated as follows:
$$Y={\sum }_{k=1}^{8}{\beta }_{k}\left({g}_{k}\right)+0.015age+0.45sex+\gamma +\varepsilon$$
We assumed trait heritability of 0.28, including 1% of total phenotypic variance explained by each of the eight causal variants, and variability due to the polygenic component \(\gamma\) described in model 1. The \({\beta }_{k}\) was calculated using \(\sqrt{\frac{1\%}{2\times MAF\times (1-MAF)}}\). Ten percent of total variance was explained by age, and 5% of total variance was explained by sex. The remaining variance was explained by the normally distributed random error \(\varepsilon\), where \({\sigma }_{E}^{2}\)=0.57.
We incorporated into the third model (model 3) an interaction between age and genotype of causal variants in addition to the eight causal variants:
$$Y= 0.02Age({\sum }_{k=1}^{8}{\beta }_{k}\left({g}_{k}\right))+{\sum }_{k=1}^{8}{\beta }_{k}\left({g}_{k}\right)+0.015Age+0.45Sex+\gamma +\varepsilon$$
Causal variants were generated the same way as in model 2. We reduced the proportion of phenotypic variance explained by each causal variant from 1% to 0.5% so that the genotype by age interaction term and the genotype term each explained 4% of the total phenotypic variance (see Additional File 1 for details regarding an additional scenario where we simulated phenotype with a larger age effect).
The continuous trait Y was transformed to a binary case–control status for all three models using a threshold model with a disease prevalence of 0.3. To evaluate type I error rate of each model, we generated eight independent, non-causal variants with the same allele frequencies as the causal variants, and assessed their individual association with case–control status using CC-GWAS, LT-FH, and Fam-meta. For model 1, we completed 50,000 simulation replicates under H0: there is no association between non-causal variants and case–control status, then calculated the proportion of replicates with P-value less than 5%, 1%, and 0.5%; for models 2 and 3, we completed 5000 replicates and used a 5% alpha level. To determine power for models 2 and 3, we examined the association between the binary trait of each model and the causal SNPs using CC-GWAS, LT-FH, and Fam-meta. We ran 5000 simulation replicates under H1: there is an association between causal variants and case–control status, and calculated power as the proportion of replicates with P-value less than 5%. Note that in all association tests, we only used offspring (N = 1000) for CC-GWAS, as if the parents’ genotypes were unavailable but their phenotypes (case–control statuses) were. LT-FH and Fam-meta each leveraged the phenotypes of parents accordingly (see descriptions of each method above).
Use of more distant relatives
To test the influence of leveraging more distant relatives’ information compared to first-degree relatives, we used the phenotypes of grandparents instead of parents as the available family history information. We only tested this scenario with CC-GWAS and Fam-meta, because LT-FH is not designed to incorporate second-degree relatives. A new family structure was utilized, where each family still comprised two parents with two or three offspring, but each parent also had their respective parents, totaling four grandparents. Similar to previous simulations, there were 200 families with two grandchildren and 200 families with three grandchildren (N = 3400). Using the new pedigree, we simulated phenotypes under model 2, and ran association tests using the offspring (N = 1000) for CC-GWAS. In terms of the relatives regression conducted in Fam-meta, we considered two sample sizes: first, all grandparents (N = 1600), and second, only one pair of randomly selected grandparents (N = 800). The purpose of this scenario was to compare whether including a larger number of grandparents in association analyses would increase power. For each sample size, we ran 5000 iterations.
Application of both methods to the analysis of T2D in FHS
Description of the FHS
The FHS is an ongoing longitudinal cohort study that began in 1948, with an initial enrollment of 5209 first-generation participants who were mainly of European descent. Over the years, the cohort has grown substantially to include offspring (Offspring cohort) and grandchildren (Gen 3 cohort), while numerous health conditions have been monitored to assess cardiovascular diseases and their risk factors. Both LT-FH and Fam-meta were applied to the same three generations of FHS participants, with T2D case–control status available for most participants. T2D was defined as having at least one of the following conditions: fasting glucose ≥ 7 mmol/L after ≥ 8 h, Hemoglobin A1c ≥ 6.5%-units, 2-hour glucose by an oral glucose tolerance test ≥ 11.1 mmol/L, non-fasting glucose ≥ 11.1 mmol/L, physician-diagnosed diabetes, or use of antidiabetic medication. Participants with known type 1 diabetes were excluded. In the case where a second-generation participant’s parent was not part of the original FHS cohort, but the participant reported T2D family history during a follow-up exam, we used those records to expand available family information. For participants who were genotyped, genetic variant dosages from the Haplotype Reference Consortium release 1.1 based imputations were used in all analyses. Imputation was performed on the Michigan Imputation Server using minimac3 and the Haplotype Reference Consortium reference panel release 1.1 April 2016 using genetic variants passing the following criteria: call-rate ≥ 97%, Hardy–Weinberg P ≥ 10–6, < 1000 Mendelian errors, and MAF ≥ 1%. Prior to imputation, phasing was performed using the duoHMM algorithm incorporated into SHAPEIT2 to account for parental genotypes. We excluded variants with imputation quality r2 less than 0.3.
Incorporation of family history using LT-FH
From the full set of FHS participants, we retrieved T2D status and T2D family history of 8362 genotyped individuals, excluding participants of non-European ancestry, defined using principal component (PC) analysis.PCs were first computed with HapMap samples. Mean and standard deviation for PC1 and PC2 were computed for White samples, including those in FHS and CEU (Utah residents with ancestry from northern and western Europe). Participants were labeled as non-European if the value for PC1 or PC2 was greater than 6 standard deviations from the mean value for White samples. Using the LT-FH software, we calculated the posterior mean genetic liability scores for each participant based on the participant’s T2D status as well as any available parents’ or siblings’ T2D status. Then, we used a linear mixed-effects model to evaluate the association between T2D liability phenotypic scores and each imputed variant, adjusting for participant’s last exam age, sex, and the first ten PCs. We additionally adjusted for smoking status (never/former/current smoker) in a sensitivity analysis (see Additional File 1 for more details and results). We accounted for familial relatedness using a kinship matrix based on the FHS participants’ family structures.
Incorporation of family history using Fam-meta
We first identified two separate samples, one with probands (n = 8362) who have both T2D disease status and genotypes available, while the other included relatives (n = 3780) who only have T2D status but no genotypes available. For the first sample, we used a logistic mixed-effects regression model to evaluate the association between proband’s T2D case–control status and individual genetic variants, adjusting for last exam age, sex, and the first two PCs that were significantly associated with T2D. We accounted for relatedness using a kinship matrix based on the FHS pedigree. The first regression analysis corresponds exactly to CC-GWAS. In the second sample, we first matched the relatives with their corresponding probands using Framingham family identifiers. If a relative was related to multiple probands, the probands’ genotypes and phenotypes were averaged within the same family. Then, we used a logistic mixed-effects model to evaluate the association between relative’s T2D case–control status and imputed genetic variants based on probands’ genotypes, adjusting for proband’s phenotype, relative’s last exam age, relative’s sex, and the first two PCs. A kinship matrix was included to account for familial relatedness. For both regression models, we additionally adjusted for smoking status in a sensitivity analysis (see Additional File 1 for more details and results). Lastly, regression coefficients and variances from both regressions were combined using the meta-analysis formula provided in the original paper [6].