AP-SKAT: highly-efficient genome-wide rare variant association test

Background Genome-wide association studies have revealed associations between single-nucleotide polymorphisms (SNPs) and phenotypes such as disease symptoms and drug tolerance. To address the small sample size for rare variants, association studies tend to group gene or pathway level variants and evaluate the effect on the set of variants. One of such strategies, known as the sequential kernel association test (SKAT), is a widely used collapsing method. However, the reported p-values from SKAT tend to be biased because the asymptotic property of the statistic is used to calculate the p-value. Although this bias can be corrected by applying permutation procedures for the test statistics, the computational cost of obtaining p-values with high resolution is prohibitive. Results To address this problem, we devise an adaptive SKAT procedure termed AP-SKAT that efficiently classifies significant SNP sets and ranks them according to the permuted p-values. Our procedure adaptively stops the permutation test when the significance level is outside some confidence interval of the estimated p-value for a binomial distribution. To evaluate the performance, we first compare the power and sample size calculation and the type I error rates estimate of SKAT, SKAT-O, and the proposed procedure using genotype data in the SKAT R package and from 1000 Genome Project. Through computational experiments using whole genome sequencing and SNP array data, we show that our proposed procedure is highly efficient and has comparable accuracy to the standard procedure. Conclusions For several types of genetic data, the developed procedure could achieve competitive power and sample size under small and large sample size conditions with controlling considerable type I error rates, and estimate p-values of significant SNP sets that are consistent with those estimated by the standard permutation test within a realistic time. This demonstrates that the procedure is sufficiently powerful for recent whole genome sequencing and SNP array data with increasing numbers of phenotypes. Additionally, this procedure can be used in other association tests by employing alternative methods to calculate the statistics.


Background
High-throughput sequencing (HTS) technologies enable the detection of rare and common variants at the genomewide scale for thousands of individuals [1,2]. In addition, with population-specific reference panels comprised of detected variants from HTS, low-frequency variants can be imputed accurately from single-nucleotide polymorphism (SNP) array genotype data [3]. Thus far, associations between SNPs and disease phenotypes have been *Correspondence: t-hasegw@megabank.tohoku.ac.jp; nagasaki@megabank.tohoku.ac.jp Department of Integrative Genomics, Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, Japan studied for genotype data from HTS and SNP arrays, and the recent focus has moved to rare and low-frequency variants. Unlike common variants, the power of rare and low-frequency variants on single-variant association tests is low because of the lack of allele counts, even with thousands of individuals.
To address this issue, rare and low-frequency variants are often grouped at the gene or pathway level, and the effects of multiple variants are evaluated. This type of strategy is called collapsing, and the sequential kernel association test (SKAT) [4,5] is one of the most effective collapsing methods [6,7]. Because the p-values based on SKAT are derived from an asymptotic distribution of its statistics, the p-values for datasets with an insufficient number of samples may be inaccurate, which causes inflation or power loss. To obtain accurate p-values, resampling methods such as the permutation test can be implemented in SKAT. However, resampling requires a huge amount of computation time to obtain highresolution p-values for the correction of multiple comparisons, and hence a more efficient resampling method is necessary.
Therefore, we propose an adaptive procedure, termed AP-SKAT, for the highly efficient calculation of SKAT statistics. This procedure adaptively stops the permutation test when the significance level is outside some predetermined confidence interval for the estimated p-value. In this evaluation, we propose the following criteria to stop the permutation test and obtain a p-value: (i) when all permutation statistics are greater or less than the original statistic, the calculation is terminated when the probability of the event is less than the significance level, and (ii) the calculation is terminated when the confidence interval of the estimated p-value does not include a significance level. To show the effectiveness of the proposed procedure, we first evaluate the power and sample size calculations of SKAT [4], SKAT-O [5], and the proposed procedure using a genotype dataset in the SKAT R package [8]. Second, we also evaluate the type I error rate of SKAT-O and the proposed procedure using real whole genome sequencing (WGS) data from the 1000 Genomes Project (1000GP) [9]. Finally, computational experiments additionally using SNP array data downloaded from the Wellcome Trust Case Control Consortium (WTCCC) [10] and the International HapMap Project [11] show that the proposed procedure can calculate highly accurate p-values within a reasonable time. We conclude that the proposed procedure is applicable to recent sequencing and genotype imputed data with large amounts of phenotype data.

Sequential kernel association test
Let n and m be the number of individuals and grouped SNPs, respectively. A SKAT test statistic s is calculated as where y is an n-dimensional vector of observed phenotypes, μ is an n-dimensional vector of predicted means under the null hypothesis, i.e., the target phenotype has no association with the genotypes, using the logistic and the linear models for case/control studies and quantitative trait analysis, respectively. G is given by (g 1 , . . . , g i , . . . , g m ) , where g i is an n-dimensional vector including the genotypes of n individuals for the ith SNP and W = diag(w 1 , . . . , w j , . . . , w m ) is an m × m diagonal matrix consisting of weights w j for the jth variant.
In calculating SKAT statistics, we assume where y i is the ith element of y, α is a constant that is unrelated to genotypes, β j is the effect size of the jth SNP, G i,j is the ith row and jth column of G, and i is a noise term that obeys a Gaussian distribution. A good property of s is that it corresponds to a mixture of chisquared distributions, and we can calculate the p-values for the obtained statistics when the optimal conditions are satisfied [4]. However, it has been suggested that the distribution of s differs from the ideal one when the sample size n is insufficient and the phenotype data do not follow a Gaussian distribution. Thus, in case/control or cohort genome studies with limited samples, it is not valid to evaluate the test statistics based on a mixture of chisquared distributions. In this case, Lee et al. [5] suggested to use the optimal adjustment technique termed SKAT-O to combine burden test and the moment adjustment technique to modify the distribution instead of using the permutation test, and Wu et al. [12] also proposed an alternative calculation procedure to efficiently and analytically calculate the adaptive sum of SKAT-O statistics. However, even when applying these techniques, the modified distribution includes residual biases. Additionally, for the permutation test with more than 20,000 SNP sets, grouping SNPs into gene level and considering multiple test is not practical because it requires at least 4.0 × 10 5 α p = 5.0 × 10 −2 or 2.0 × 10 6 α p = 1.0 × 10 −2 tests for each SNP set, where α p is the significance level. Thus, we focus on obtaining detailed p-values for sets of rare SNPs associated with phenotypes around the predefined significance level α p through the permutation test, and efficiently calculate p-values by adaptively stopping the test for plausible/improbable sets.

Distribution of estimated p-values in permutation test
In the process of a permutation test, let B and r be the number of permutations completed and the number of permutation statistics that are greater than the original statistic s using the observed data, respectively. In this case, we consider a binary random variable X, which takes a value of 1 when a permutation statistic is greater than s and 0 otherwise, according to a previous SNP analysis [13]. We take the expectation and the variance of X corresponding to each of the permutations considered so far to be wherep is the estimated p-value of an SNP set on the Bth permutation. Thus, the Bienaymé formula for the sum of variances gives the variance of the mean as According to the central limit theorem, we considerp to correspond to a Gaussian distribution N(0,p(1 −p)/B) and obtain d α as the distance between the α confidence interval of the distribution. In this binomial setting, we fix the number of permutations B and consider the numerator r as a random variable to estimatep. Then, we compare the α confidence interval ofp with α p , where α p is a predetermined significance level considering multiple comparisons, and continue the permutation until either the α confidence interval does not include α p or B becomes b. Figure 1 exemplifies this situation. In contrast, Chen et al. [13] used a negative binomial setting by fixing the total number of successes r and considering the denominator B as a random variable to estimatep. They chose b and R to control the standard error ofp at some determined values with α p , and continued the permutation until r became R or B became b.
However, when r is 0 or B, the variance becomes 0 and it is not reasonable to use the criteria for terminating the permutation test. Thus, we adopt a negative binomial distribution. Let Y be a positive integer random variable indicating the number of trials and α e = α p × m, where m is the number of SNP sets. Assuming that the true pvalue is at most α p (when r is 0) or at least α e (when r is B), we attempt to obtain the probability of B occurring with r and finish the permutation test at α p . Hence, when r is 0, if the probability NB(Y = B; B, 1 − α p ) is less than α p , which gives an α p confidence level ofp = α p , the permutation test can be stopped and we obtainp = 1/B. Similarly, when r is B, if the probability NB(Y = B; B, α e ) is less than α p , the permutation test can be stopped and we obtain p = 1.
If more precise p-values are needed for significant SNP sets, we can ignore the stop criterion ifp < α p and proceed with b permutation tests to obtain the minimalp = 1/b.

Adaptive SKAT
Our proposed procedure adaptively stops the permutation test when the significance level α p is outside the α confidence interval of the estimated p-value using the binomial distributions described in the previous subsection. The proposed procedure is described in Algorithm 1.
The following values are taken as input parameters: the significance level α e (α p = α e /m), maximum number of permutation tests b, which must be at least 1/α p , and significance interval α for the Gaussian distribution. Note that, in practice, we should also set the number of tests performed in the same loop to M for computational efficiency. We recommend to set b = 5/α p , α = α p , and M = 1000 as those used in the Results section. Outputp as an estimated p-value for the ith SNP set 22: end for In practice, when SNPs are grouped at the gene level, the number of SNP sets exceeds 20,000. Although our proposed procedure can handle a few phenotypes on a single processor within a reasonable time, multiple phenotypes and their combinations will entail a huge computational cost. As in many association testing procedures, we therefore recommend using parallel computation to calculate the p-value for each SNP set on a different core.

Results and discussion
We first examine the comparison of power and sample size calculation of SKAT, SKAT-O, and the proposed procedure. In these experiments, according to the SKAT R package and previous literatures [4,5], we adopted the following settings; we used a numerical matrix of 10,000 haplotypes over a 200,000 Base Pair region, where each row represents a different haplotype and each column represents a different SNP marker. The matrix was generated by the calibration coalescent model (COSI) base on the LD structure of European ancestry [8]. As with the SKAT R package, to evaluate the power of the above methods, we simulated datasets under the alternative model; thus, we repeatedly and randomly selected 5 kb regions from a broader region, and then randomly set causal variants from the rare variants with a minor allele frequency (MAF) of less than 0.05 in each simulation. For generating phenotypes, we considered 20 %  Tables 1, 2, 3, 4 and 5. These results show that the proposed procedure can perform relatively higher power than SKAT and SKAT-O even when the sample size and the effect size are small, and also could retain the competitive power when these are high values, which can achieve type II error of almost 0.2. Even when the phenotype is not according to the idea distribution, the proposed procedure could control the lower type I error than that of SKAT-O.
Additionally, we evaluated the type I error rate of SKAT-O and the proposed procedure when {β 1 , . . . , β m are 0 and in Eq. (2) is according to the Student's t -distribution with 5 degrees of freedom; thus, the distribution of phenotypes is a heavier tailed distribution than the ideal normal one. In this setting, we applied Illumina WGS data for 2,504 samples from 26 populations across Africa, East and South Asia, Europe, and the Americas in the 1000 Genome Project [9] and performed 50 experiments for each sample size of {500, 1, 000, 1, 500, 2, 000}, which are      randomly extracted from the data. The results of the number of false positives in using SKAT-O and the proposed procedure are concluded in Table 6 and it indicates that the proposed method can reduce the number of false positives even when the distribution has heavier tails than the normal ones. Finally, to validate the proposed approach, we compared the computation times and estimated the p-values given by the permutation test (standard procedure) and the adaptive procedure. For this comparison, we prepared genotype data on the previous WGS data from 1000 Genome Project, Illumina Infinium 550 SNP BeadChip for 1,438 samples from the 1958 British Birth Cohort in the Wellcome Trust Case Control Consortium [10], and on the Illumina SNP Chip for 1,397 individuals from 11 populations, including 250 of the original 270 phase I and phase II individuals in the International HapMap Project [11]. Their quantitative phenotype data were synthetically generated according to a Gaussian distribution and SNPs were grouped at the gene level. Note that only those SNPs annotated as 'High' and 'Moderate' by the SnpEff tool [14] were selected as plausible ones for 1000GP, because WGS data include a lot of less significant SNPs. All SNPs were grouped at the gene level for the data from WTCCC and HapMap. In these experiments, we also consider SNPs with MAF of less than 0.05. The combination of significance levels α, b, and M were set to 0.05, 2.5 × 10 −6 , 2.5 × 10 −11 , 100, 1000, . . . , 1.0 × 10 7 , and min N/10, 10 4 , respectively. All computations were performed on 800 nodes of an Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80 GHz (20 cores each) in our supercomputer system. Figure 2 indicates that the computation time for the standard procedure increases linearly with respect to the number of permutation tests b. Hence, the setting with b = 10 8 tests was infeasible, even using our supercomputer system. However, the computation time of the adaptive procedure is bounded because the proposed procedure terminates the evaluation of the SNP sets according to a certain criterion. Hence, the computation time of the adaptive procedure depends on the number of significant SNP sets; as only a handful of sets should be selected as significant SNPs, the computational cost is significantly lower than that of the standard procedure. When b = 100, the computational cost of the adaptive procedure is higher than that of the standard procedure. This is because the adaptive procedure requires additional computation to judge the stop criterion for each M loop. However, as b should be greater than 1/α p considering multiple comparisons, the low computational cost when N > 1.0 × 10 5 is more significant.
In Fig. 3, the estimated p-values in the adaptive procedure clearly approach those of the standard procedure according to the spread of the confidence interval, and they are almost the same when the confidence interval is lower than 2.5 × 10 −6 . Even if the confidence interval was set to around 0.05, the tendency of the p-values could be observed, enabling us to clarify whether the p-values of SNP sets exceeded the threshold value. These results indicate that the proposed procedure can be applied at the whole genome scale to achieve arbitrary confidence levels within a reasonable time.  Comparison plot with several confidential intervals using the 1000 Genomes Project data, WTCCC data, and HapMap data. The comparisons of estimated p-values for the 1000 Genomes Project data, WTCCC data, and HapMap data by the standard and the adaptive procedures with a significance interval of 0.05, 2.5 × 10 −06 and 2.5 × 10 −11 . Solid and dotted lines are the base line and the Bonferroni corrected significance level (p = 0.05), respectively. Circles indicate the estimated p-values of SNP sets by the standard and the adaptive procedures, and the numbers of SNP sets is 20, 568, 13, 397, 31, 002, respectively. Both the vertical and the horizontal axes in these figures are logarithmic scale

Conclusions
In this paper, we proposed a novel rare variant association procedure that can calculate the p-values for sets of SNPs within a reasonable time. A comparison experiment showed that the proposed procedure significantly reduced the computational cost while maintaining the estimation quality at predefined significance levels, and can be bounded at a reasonable cost even if we select the highest significance level. This result demonstrates that the proposed procedure is capable of calculating pvalues of SNP sets for WGS data that cannot be evaluated by the standard permutation procedure. In addition, this procedure can be applied to other common/rare variant association tests [15,16]. The R code is available at http:// nagasakilab.csml.org/data/aSKAT.zip, for which input is either one of PLINK format files or a numeric matrix.

Availability and requirements
Project name: AP-SKAT Project home page: http://nagasakilab.csml.org/data/ aSKAT.zip Operating system(s): Platform independent Programming language: R Any restrictions to use by non-academics: Please contact authors for commercial use.