Similaritybased tests
Assume that an association study involves N individuals (N^{A} cases and N^{U} controls), and within a genomic region L SNPs (both common and rare) were called. Let us denote the genotype matrix G = {g_{
nl
}, n = 1,…,N l = 1,…,L} coded as minor allele counts, and the phenotype vector Y = {y_{
n
}, n = 1,…,N} with the elements valued 1 for cases and −1 for controls (except when otherwise specified). The N × N similarity matrix is defined as K = {s(g_{
n
}, g_{
m
})}_{n,m = 1}^{N}, where g_{
n
} is a multisite vector of genotype {g_{1n},…,g_{Ln}} for n th individual, and s (x,y) is a similarity function. There is a variety of examples of similarity functions published in statistical genetics literature (for examples, see Wu et al.[11], Wessel and Shork[14], and Mukhopadhyay et al.[25]). However, it is desirable for the similarity matrix K to be symmetric positive semidefinite as this is “the key to its use in many statistical analyses”[26]. Thus, we consider only those similarity measures that result in a positive definite similarity matrix. Examples of such similarity measures are the weighted linear kernel s(g_{
n
}, g_{
m
}) = ∑ _{l = 1}^{L}w_{
l
}g_{
nl
}g_{
ml
} for some fixed weights w_{
l
},l = 1,…,L the weighted quadratic kernel s(g_{
n
}, g_{
m
}) = (1 + ∑ _{
l
}^{L}w_{
l
}g_{
nl
}g_{
ml
})^{2}, and the weighted IBS kernel s(g_{
n
}, g_{
m
}) = ∑ _{l = 1}^{L}w_{
l
}(2 − g_{
nl
} − g_{
ml
}). For our analysis, a popular exponential similarity measure[27] was used:
s\left({g}_{n},{g}_{m}\right)=\mathit{\text{exp}}\left\{{\displaystyle {\sum}_{l=1}^{L}{\left({g}_{\mathit{nl}}{g}_{\mathit{ml}}\right)}^{2}}\right\}
(2)
The choice of similarity was motivated by the need to analyze quantitative genotype obtained as a result of population stratification adjustment (see Results section). As the exponential similarity is a function of the Euclidean distance between two multisite genotypes, we consider this similarity to be more appropriate compared with, for example, another popular similarity measure, identitybystate[17], which was designed to be applied to genotype codes.
Weighting and collapsing
Here we consider the two major ways of rare variants pooling: weighting and collapsing. The SNP weights will be denoted as w = {w_{
l
}, l = 1,…,L}. In general, they may be derived from observed minor allele frequency (MAF) or prior information. Here, we adopted the weights proposed by Wu et al.[11]: w_{
l
} = Beta(maf_{
l
}; 1, 25)^{2}, where maf_{
l
} is MAF of l th SNP, Beta (a; b, c) is the beta density distribution function with parameters b and c evaluated at point a. The weight function monotonically increases as MAF decreases, while, as noted by the authors, “putting decent nonzero weights for variants with MAF 1%–5%”. As noted by Wu et al.[11], setting 0 ≤ b ≤ 1 and c ≥ 1 allows for an increase in the weight of rare variants and a decrease in the weight of common variants. Thus, any values of parameters and from the specified range are acceptable. For the three tests (SKAT, MDMR and Utest), the weights are incorporated via the calculation of similarity matrix. Specifically, the weights incorporating similarity function s_{w} for the similarity matrix K_{
w
} is as follows:
{s}_{w}\left({g}_{n},{g}_{m}\right)=\mathit{\text{exp}}\left\{{\displaystyle {\sum}_{l=1}^{L}{w}_{l}{\left({g}_{\mathit{nl}}{g}_{\mathit{ml}}\right)}^{2}}/{\displaystyle {\sum}_{l=1}^{L}{w}_{l}}\right\}
(3)
For the KBAT test statistic, the weights were incorporated differently (for details, see the description below) as the test does not use the multisite genotype similarity.
The collapsing of rare variants was performed as described in Thalamuthu et al.[18], namely, by defining a superlocus g_{
n(L+1)
},n = 1,…,N as follows:
{g}_{n\left(L+1\right)}=\mathit{\text{min}}\left\{2,{\displaystyle {\sum}_{l:\mathit{ma}{f}_{l}\le 0.01}\left({g}_{\mathit{nl}}\right)}\right\}
(4)
In general, this type of collapsing preserves more information than an indicator of at least one rare variant being present, as suggested by Li and Leal[28]. The collapsed genotype is treated as a new SNP (superlocus) g_{
n(L+1)
},n = 1,…,N, and a similarity matrix is constructed using common variants and the superlocus.
Multivariate distance matrix regression (MDMR)
Let us denote N x N identity matrix 1_{
N
} and a vector of 1 of size N as 1_{
N
}. Following Wessel and Schork[14], the test statistic is calculated according to the algorithm:

1.
Phenotype projection matrix H = Y(Y ^{T} Y) ^{1} Y ^{T}, where upper T denotes transposition.

2.
Dissimilarity matrix D = {d _{
ij
}}_{i,j = 1} ^{N} = 1_{
N
}1_{
N
} ^{T} − K, where K is a similarity matrix defined above.

3.
Gover’s centered matrix G = (1_{
N
} − 1_{
N
}1_{
N
} ^{T}/N)A(I _{
N
} − 1_{
N
}1_{
N
} ^{T}/N), where A={\left\{\frac{{\text{d}}_{\mathrm{ij}}^{2}}{2}\right\}}_{\text{i},\text{j}=1}^{\text{N}}.

4.
The test statistic MDMR = tr(HGH)/tr((I _{
N
} − H)G(I _{
N
} − H)), where tr is matrix trace.
Large values of the test statistic indicate a deviation from the null hypothesis of no association of a genotype with a phenotype.
Sequence kernel association test (SKAT)
For this test, the phenotype vector Y = {y_{n},n = 1,…,N} is coded as 1 for cases and 0 for controls. The mean phenotype vector is defined as\overline{Y}={N}^{A}{1}_{N}/N. Following Wu et al.[11], the test statistic is\mathit{T}={\left(Y\overline{Y}\right)}^{T}K\left(Y\overline{Y}\right)/2. The SKAT test statistic under the null hypothesis is asymptotically distributed as the weighted sum of chisquared random variables with one degree of freedom. Thus, the significance level can be assessed theoretically. Permutations can also be used to estimate the pvalue empirically.
Utest
The average similarity score between pairs of cases U_{
1
} and controls U_{
0
} is defined as follows:
{U}_{1}={\displaystyle {\sum}_{\underset{n,m:{Y}_{n}={Y}_{m}=1}{1\le n\le m\le N}}2{K}_{\mathit{nm}}/{N}^{A}\left({N}^{A}1\right)}
(5)
{U}_{0}={\displaystyle {\sum}_{\underset{n,m:{Y}_{n}={Y}_{m}=1}{1\le n\le m\le N}}2{K}_{\mathit{nm}}/{N}^{U}\left({N}^{U}1\right)}
(6)
where K_{
nm
}, n, m = 1,…,N are the elements of the K similarity matrix (K = {K_{
nm
}}_{n,m = 1}^{N}). The Utest statistic is defined as U = (U_{1} − U_{0})^{2}. Note that Shaid et al.[17] considered the weighted sum of the single SNP Utest statistics, where weights were derived from the asymptotic variancecovariance matrix of the U statistics vector. However, for the purpose of comparison of weighting and collapsing rare variants pooling methods, the statistic was modified as described above. The test statistic U is similar to the single SNP Utest statistic proposed by Shaid et al.[17], but it incorporates the similarity information across multiple variants within a region. Permutations need to be applied to assess the pvalue.
Kernelbased association test (KBAT)
Let us denote K_{
l
} = {(K_{
l
})_{
nm
}}_{n,m = 1}^{N} as a single SNPs similarity matrix for l th variant. Similar to the notations of the Utest subsection, Ul_{
1
} and Ul_{
0
} are the average similarity scores for pairs of cases and controls, respectively, calculated from K_{
l
}, and let U_{
l
} = (U_{l 1} + U_{l 0})/2. Following Mukhopadhyay et al.[15], consider the withingroup and betweengroup sum of squares:
\begin{array}{c}\mathit{WS}{S}_{l}={\sum}_{\underset{n,m{Y}_{n}={Y}_{m}=1}{1\le n\le m\le N}}{\left({\left({K}_{l}\right)}_{\mathit{nm}}{U}_{l1}\right)}^{2}\\ \phantom{\rule{4.2em}{0ex}}+{\displaystyle {\sum}_{\underset{n,m{Y}_{n}={Y}_{m}=1}{1\le n\le m\le N}}{\left({\left({K}_{l}\right)}_{\mathit{nm}}{U}_{l0}\right)}^{2}}\end{array}
(7)
\begin{array}{c}\mathit{BS}{S}_{l}=\frac{{N}^{A}\left({N}^{A}1\right){\left({U}_{l}{U}_{l1}\right)}^{2}}{2}\\ \phantom{\rule{4.3em}{0ex}}+\frac{{N}^{U}\left({N}^{U}1\right){\left({U}_{l}{U}_{l0}\right)}^{2}}{2}\end{array}
(8)
where the two groups are casecase and controlcontrol pairs. The test statistic is KBAT = ∑ _{l = 1}^{L}BSS_{
l
}/∑_{l = 1}^{L}WSS_{
l
}. Since the test does not utilize the multisite similarity matrix, but only single SNP matrices K_{
l
}, the weighted test statistic KBAT_{
W
} = ∑ _{l = 1}^{L}w_{
l
}BSS_{
l
}/∑_{l = 1}^{L}w_{
l
}WSS_{
l
} is used here. A large value of the KBAT statistic indicates a deviation from the null hypothesis. Permutations are used to assess the significance.
Population genetics simulations
Population genetics simulations were performed based on the code provided by King et al.[29] with demographic parameters from Boyko et al.[30]. A total of 1000 data replicates were generated for each of the four phenotype models: “Risk Rare”, “Risk Both”, “Risk Common” and “Mixed Rare”. For a detailed description of the simulations, see Additional file1.