PopPAnTe assesses the relationship between quantitative dependent variables (responses) and quantitative independent variables (predictors) within a variance components framework in order to model the resemblance among relatives.
The association of a single predictor with a single response variable is described as
$$ r_{i} = \mu + \beta p_{i} + {\sum\nolimits}_{j}\psi_{ij}C_{ij} + g_{i} + e_{i} $$
(1)
where r
i
represents the response value for the i-th individual, μ the response mean, β the estimate of the predictor value p
i
, ψ
j
the estimate of the j-th covariate C, and g
i
and e
i
the polygenic and environmental effect, respectively.
The total response variance is partitioned into polygenic and environmental variances (the latter including also measurement errors), and the variance-covariance matrix is calculated as
$$\omega = 2\Phi\sigma_{a}^{2} + I\sigma_{e}^{2} $$
where Φ is the relatedness matrix between each pair of individuals, I is the identity matrix, and \(\sigma _{a}^{2}\) and \(\sigma _{e}^{2}\) are the additive genetic and environmental variance, respectively.
Within the same framework, PopPAnTe allows the evaluation of the narrow heritability of any quantitative response variable included in the analysis.
The significance of the association is calculated using a formal likelihood-ratio test comparing the likelihood of the alternative model described in Eq. (1) to the likelihood of a null model where the effect of the predictor is constrained to zero.
PopPAnTe implements an exact linear mixed model equivalent to that implemented in the QTDT software [4].
To speed-up the evaluation, PopPAnTe clusters variables having the same pattern of missingness (i.e., the same missing values in a subset of individuals), then evaluates the likelihood of the null model once, and reuses the value to assess every variable included in the same cluster. PopPAnTe also allows the evaluation of empirical p-values by randomly permuting the predictor values among subjects and re-assessing the association under the null hypothesis. When genealogical information is provided as input, predictor values are randomly permuted within families in order to preserve the phenotypic correlation between family members. To speed up performance, PopPAnTe implements an adaptive permutation approach [8], stopping the generation of randomly permuted samples earlier when there is little or no evidence of significance.
Pedigree versus population analysis
When genealogical information is available, PopPAnTe evaluates the relatedness matrix from the known pedigree relationships using a recursive procedure and assuming pedigree founders as unrelated [9]. This results in a variance-covariance matrix that is usually both symmetric and semi-positive definite. Therefore, the maximum likelihood estimates of the variance components can be assessed through efficient Cholesky decomposition.
When the genealogical information is not available, a GSM can be estimated from genome-wide genetic data with any of several well-established tools, such as PLINK [10], GCTA [11], or LDAK [12], and given as input to PopPAnTe. The property of positive-definiteness does not always hold for GSMs. A bending procedure [13] is used by default to transform the matrix when it is not positive semi-definite –but the user has the option to use a LU decomposition instead. Additionally, PopPAnTe implements the QR decomposition to solve the rare cases where the variance-covariance matrix is not invertible and neither the Cholesky nor the LU decompositions can be used.
To speed-up the evaluation of the variance components, PopPAnTe allows the user to set an arbitrary threshold below which individuals can be considered as unrelated. Otherwise, the user has the option of using the value of expected kinship between second or third cousins [14, 15].
Region-based testing
When predictors can be ordered in space, as in the case of epigenetic markers, PopPAnTe allows the computation of region-based association tests by gathering information from flanking predictors included in a sliding window of user-defined size, whose values are replaced by their first principal component. By definition the first principal component accounts for as much of the variability in the data as possible, and can thus be used to summarise the joint distribution of all variables included in a given region for gene- or region-based association studies (e.g., [16, 17]).
Data pre- and post-processing
Quantile normalisation [18] can be automatically applied to improve normality of both response variables and predictors. Moreover, PopPAnTe implements two approaches to correct the association test for unwanted biological and technical variability (e.g., batch effects). When the source of the confounders is known, it can be directly included in the association model. To deal with unknown sources of biological and technical co-variation, PopPAnTe can integrate into the association model the principal components that are required to explain a user-specified percentage of variation.
PopPAnTe implements the Benjamini-Hochberg procedure (BH step-up procedure) to control the false discovery rate [19], and, to aid in results interpretation and further analyses, it generates basic Quantile-Quantile and Manhattan plots – the latter only when genomic data that can be ordered in space (e.g., CpG loci) are used as predictors.
Finally, when the genealogical information is available, to determine whether an association has been generated by a uniform contribution of all the families within the sample, or by a strong contribution of a small number of families, PopPAnTe reports, for each test, the percentage of families showing a positive contribution and the Gini coefficient [20] assessed on family contribution to the χ
2 statistics.