Similarity-based tests
Assume that an association study involves N individuals (NA cases and NU controls), and within a genomic region L SNPs (both common and rare) were called. Let us denote the genotype matrix G = {g
nl
, n = 1,…,N l = 1,…,L} coded as minor allele counts, and the phenotype vector Y = {y
n
, n = 1,…,N} with the elements valued 1 for cases and −1 for controls (except when otherwise specified). The N × N similarity matrix is defined as K = {s(g
n
, g
m
)}n,m = 1N, where g
n
is a multi-site vector of genotype {g1n,…,gLn} for n th individual, and s (x,y) is a similarity function. There is a variety of examples of similarity functions published in statistical genetics literature (for examples, see Wu et al.[11], Wessel and Shork[14], and Mukhopadhyay et al.[25]). However, it is desirable for the similarity matrix K to be symmetric positive semi-definite as this is “the key to its use in many statistical analyses”[26]. Thus, we consider only those similarity measures that result in a positive definite similarity matrix. Examples of such similarity measures are the weighted linear kernel s(g
n
, g
m
) = ∑ l = 1Lw
l
g
nl
g
ml
for some fixed weights w
l
,l = 1,…,L the weighted quadratic kernel s(g
n
, g
m
) = (1 + ∑
l
Lw
l
g
nl
g
ml
)2, and the weighted IBS kernel s(g
n
, g
m
) = ∑ l = 1Lw
l
(2 − |g
nl
− g
ml
|). For our analysis, a popular exponential similarity measure[27] was used:
(2)
The choice of similarity was motivated by the need to analyze quantitative genotype obtained as a result of population stratification adjustment (see Results section). As the exponential similarity is a function of the Euclidean distance between two multi-site genotypes, we consider this similarity to be more appropriate compared with, for example, another popular similarity measure, identity-by-state[17], which was designed to be applied to genotype codes.
Weighting and collapsing
Here we consider the two major ways of rare variants pooling: weighting and collapsing. The SNP weights will be denoted as w = {w
l
, l = 1,…,L}. In general, they may be derived from observed minor allele frequency (MAF) or prior information. Here, we adopted the weights proposed by Wu et al.[11]: w
l
= Beta(maf
l
; 1, 25)2, where maf
l
is MAF of l th SNP, Beta (a; b, c) is the beta density distribution function with parameters b and c evaluated at point a. The weight function monotonically increases as MAF decreases, while, as noted by the authors, “putting decent nonzero weights for variants with MAF 1%–5%”. As noted by Wu et al.[11], setting 0 ≤ b ≤ 1 and c ≥ 1 allows for an increase in the weight of rare variants and a decrease in the weight of common variants. Thus, any values of parameters and from the specified range are acceptable. For the three tests (SKAT, MDMR and U-test), the weights are incorporated via the calculation of similarity matrix. Specifically, the weights incorporating similarity function sw for the similarity matrix K
w
is as follows:
(3)
For the KBAT test statistic, the weights were incorporated differently (for details, see the description below) as the test does not use the multi-site genotype similarity.
The collapsing of rare variants was performed as described in Thalamuthu et al.[18], namely, by defining a super-locus g
n(L+1)
,n = 1,…,N as follows:
(4)
In general, this type of collapsing preserves more information than an indicator of at least one rare variant being present, as suggested by Li and Leal[28]. The collapsed genotype is treated as a new SNP (super-locus) g
n(L+1)
,n = 1,…,N, and a similarity matrix is constructed using common variants and the super-locus.
Multivariate distance matrix regression (MDMR)
Let us denote N x N identity matrix 1
N
and a vector of 1 of size N as 1
N
. Following Wessel and Schork[14], the test statistic is calculated according to the algorithm:
-
1.
Phenotype projection matrix H = Y(Y T Y) -1 Y T, where upper T denotes transposition.
-
2.
Dissimilarity matrix D = {d
ij
}i,j = 1 N = 1
N
1
N
T − K, where K is a similarity matrix defined above.
-
3.
Gover’s centered matrix G = (1
N
− 1
N
1
N
T/N)A(I
N
− 1
N
1
N
T/N), where .
-
4.
The test statistic MDMR = tr(HGH)/tr((I
N
− H)G(I
N
− H)), where tr is matrix trace.
Large values of the test statistic indicate a deviation from the null hypothesis of no association of a genotype with a phenotype.
Sequence kernel association test (SKAT)
For this test, the phenotype vector Y = {yn,n = 1,…,N} is coded as 1 for cases and 0 for controls. The mean phenotype vector is defined as. Following Wu et al.[11], the test statistic is. The SKAT test statistic under the null hypothesis is asymptotically distributed as the weighted sum of chi-squared random variables with one degree of freedom. Thus, the significance level can be assessed theoretically. Permutations can also be used to estimate the p-value empirically.
U-test
The average similarity score between pairs of cases U
1
and controls U
0
is defined as follows:
(5)
(6)
where K
nm
, n, m = 1,…,N are the elements of the K similarity matrix (K = {K
nm
}n,m = 1N). The U-test statistic is defined as U = (U1 − U0)2. Note that Shaid et al.[17] considered the weighted sum of the single SNP U-test statistics, where weights were derived from the asymptotic variance-covariance matrix of the U statistics vector. However, for the purpose of comparison of weighting and collapsing rare variants pooling methods, the statistic was modified as described above. The test statistic U is similar to the single SNP U-test statistic proposed by Shaid et al.[17], but it incorporates the similarity information across multiple variants within a region. Permutations need to be applied to assess the p-value.
Kernel-based association test (KBAT)
Let us denote K
l
= {(K
l
)
nm
}n,m = 1N as a single SNPs similarity matrix for l th variant. Similar to the notations of the U-test subsection, Ul
1
and Ul
0
are the average similarity scores for pairs of cases and controls, respectively, calculated from K
l
, and let U
l
= (Ul 1 + Ul 0)/2. Following Mukhopadhyay et al.[15], consider the within-group and between-group sum of squares:
(7)
(8)
where the two groups are case-case and control-control pairs. The test statistic is KBAT = ∑ l = 1LBSS
l
/∑l = 1LWSS
l
. Since the test does not utilize the multi-site similarity matrix, but only single SNP matrices K
l
, the weighted test statistic KBAT
W
= ∑ l = 1Lw
l
BSS
l
/∑l = 1Lw
l
WSS
l
is used here. A large value of the KBAT statistic indicates a deviation from the null hypothesis. Permutations are used to assess the significance.
Population genetics simulations
Population genetics simulations were performed based on the code provided by King et al.[29] with demographic parameters from Boyko et al.[30]. A total of 1000 data replicates were generated for each of the four phenotype models: “Risk Rare”, “Risk Both”, “Risk Common” and “Mixed Rare”. For a detailed description of the simulations, see Additional file1.