In this article, we compared the performance of the four similarity-based tests together with two rare variants pooling strategies using population genetics simulations and the GAW17 real data set. The results suggest that weighting may be a much better strategy than collapsing under the assumption of strong effect from rare variants, and moderate or low effect from common variants. Collapsing, in turn, showed much better performance when common variants possessed a strong effect. The absolute power difference of a statistical test when applied with collapsing and weighting pooling strategies may be substantial. From our study, it follows that if researchers are inclined to believe in the association of rare variants within a region, weighted pooling should be applied with similarity-based tests, whereas collapsing is more appropriate if common variants are suspected to be associated with phenotype. Additionally, under strong rare variants effect size in one direction when common variants were excluded from the analysis, collapsing performed equally good or better than weighting. Finally, both SKAT and KBAT had consistently high power compared with other considered similarity-based tests when applied with the appropriate pooling strategy.

Recently, Basu and Pan
[24] compared the performance of multiple statistical tests to identify an association with rare variants. The authors included SKAT with unweighted linear and quadratic kernels as one of the testing strategies. Based on the results, Basu and Pan
[24] concluded that SKAT was powerful compared with other methods when only rare variants were considered. However, the authors found that the method lost its high power when neutral common variants were added. Our results suggest that using weighted kernels in SKAT may preserve high power to identify an association with rare variants even if multiple neutral common variants are included into the analysis. However, since we compared the performance of similarity-based tests, additional investigation is required to compare weighted similarity-based tests with other statistical strategies, including those considered in Basu and Pan
[24].

From our results, the MDMR test does not seem to perform well when applied with weighting pooling strategy. To have a more detailed picture, we applied weighted MDMR test to the “Risk Rare” data sets with modified weights *w*
_{
l
}
^{
p
}, *l* = 1, … *L*, where the power value *p* varied from 0 to 1. So, *p = 1* corresponded to the beta weights applied in our study, whereas *p = 0* corresponded to the analysis without weights. Additional file
7 shows the power surface as a function of significance level and a value of *p*. It is clear that for all significance levels, the power of MDMR monotonically decreased with higher values of *p*, which corresponded to higher relative weights for rare variants. In the Additional file
1, we proved that when the number of cases equals the number of controls (like in our simulations), SKAT and MDMR test statistics are equivalent to the sum of, and the sum of squares of dissimilarities for all case–control pairs respectively. When weighting pooling strategy is applied, dissimilarity tends to be relatively large for pairs of individuals whose genotypes differ in multiple rare minor alleles. Squaring dissimilarity measure puts much more emphasis on pairs with a larger dissimilarity. Thus, the magnitude of the MDMR test statistic may be completely defined by the number of case–control pairs whose genotypes differ by at least two rare minor alleles. We suppose that pairs with a difference of one rare allele may not have sufficient dissimilarity to significantly influence the MDMR test statistic, which leads to a loss of power. To illustrate our reasoning, let us have two rare variants with only eight observed minor alleles each across 500 cases and 500 controls. To simplify the description, assume that individuals have either zero or one copy of a minor allele across the two variants. Also, we will use the equivalence of the MDMR test statistic to the sum of squared case–control dissimilarities. Consider the following cases under the null and alternative hypotheses, respectively: cases and controls have four minor alleles for each variant, and cases have all minor alleles. Under the alternative hypothesis, we have zero case–control pairs with a difference of two alleles across genotype, whereas under the null hypothesis, we have 32. However, under the alternative hypothesis, there are 16 × 500 case–control pairs with a difference of one minor allele, whereas under the null hypothesis, there are only 16 × (500–8). Now it becomes clear that if the dissimilarity of pairs of individuals with a difference of two alleles is large enough relative to the dissimilarity of pairs of individuals with a difference of only one allele, the MDMR test statistic may become lower compared to the null test statistic. The consideration above explains the low performance of MDMR with weighted similarity and the fact that for the “Risk Rare” scenario, the power of MDMR test was below type-1 error rate.

One limitation of the current study is that the minimum significance level in population genetics simulations was 0.001. For genome-wide significance, the number of permutations needed to reliably estimate the significance is very large. This makes the comparison of the similarity-based tests at the genome-wide level prohibitive. In real GWAS studies, only few highly-significant genes will require a very large number of permutations to estimate *p*-values, as many genes with low or no association signal can be dropped out after a few thousand permutations. For highly significant genes, permutation procedure can be split into several parts and performed in parallel on a cluster.