GATE: an efficient procedure in study of pleiotropic genetic associations

Background The association studies on human complex traits are admittedly propitious to identify deleterious genetic markers. Compared to single-trait analyses, multiple-trait analyses can arguably make better use of the information on both traits and markers, and thus improve statistical power of association tests prominently. Principal component analysis (PCA) is a well-known useful tool in multivariate analysis and can be applied to this task. Generally, PCA is first performed on all traits and then a certain number of top principal components (PCs) that explain most of the trait variations are selected to construct the test statistics. However, under some situations, only utilizing these top PCs would lead to a loss of important evidences from discarded PCs and thus makes the capability compromised. Methods To overcome this drawback while keeping the advantages of using the top PCs, we propose a group accumulated test evidence (GATE) procedure. By dividing the PCs which is sorted in the descending order according to the corresponding eigenvalues into a few groups, GATE integrates the information of traits at the group level. Results Simulation studies demonstrate the superiority of the proposed approach over several existing methods in terms of statistical power. Sometimes, the increase of power can reach 25%. These methods are further illustrated using the Heterogeneous Stock Mice data which is collected from a quantitative genome-wide association study. Conclusions Overall, GATE provides a powerful test for pleiotropic genetic associations. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3928-7) contains supplementary material, which is available to authorized users.


Background
A lot of human complex traits are highly correlated due to genetics, environmental influences and interactions among them, such as, low density lipoprotein and triglycerides, serum calcium and phosphorus, serum prostate specific antigen and prostate cancer [1][2][3]. Identification genetic variants that are associated with these correlated traits can help researchers understand their genetic architecture better [4]. Single nucleotide polymorphism (SNP) is an important genetic factor. A variety of SNPs have been detected to be deleterious based on the hypothesis analyses of multiple-trait-single-marker. For example, seven SNPs including rs3764261, rs4420638, rs629301, *Correspondence: liqz@amss.ac.cn † Equal contributors 1 Key Laboratory of Systems and Control, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China Full list of author information is available at the end of the article rs964184, rs1367117, rs1042034, and rs174546 are concurrently associated with four complex traits including total cholesterol, high and low density lipoprotein, and triglycerides [1,5], and the SNP rs2476601 has been reported to be associated with five traits including rheumatoid arthritis [6], Crohn's disease [7], systemic lupus erythematosus [8], type I diabetes [9], and Graves' disease [10]. The joint analysis of the associations between multiple traits and a single marker is becoming popular nowadays, and many methods have been put forward in the literature [5,[11][12][13][14][15][16][17][18][19]. Broadly speaking, these methods can be classified into two categories: univariate analyses and multivariate analyses. The basic idea of univariate analyses is to implement the association study on one trait and one SNP firstly and then combine the obtained p-values with some p-value combination procedure to construct an omnibus test. Fisher-combined p-values [20] and weighted p-values [16] are two representative approaches of this type. Multivariate analyses mainly consist of two types of methods: model-based analyses and dimension-reduction methods. For model-based analyses, the traits are regressed on the marker or the marker is regressed on the traits simultaneously. The frequently used regression models are the mixed effect model and the proportional odds model [5,21,22]. Through using random effects to account for the correlation among subjects, linear mixed effect model can not only model the covariance structure caused by correlated phenotypes, but also by population structure [12,18]. Besides, Bayesian approach is another important type of model-based approaches. PEER [23] and mvBIMBAM [15] are two Bayesian approaches which utilize the inferred hidden factor and posterior probabilities which can provide information about which phenotypes are involved in the association model. In the other hand, the canonical correlation analysis (CCA) and principal component analysis (PCA) are two common dimension-reduction approaches. Both of them have been widely applied in pleiotropic genetic association studies [17,24,25].
As is well known, Fisher-combined p-values possesses the optimal Bahadur efficiency when these p-values are independent [26]. However, in pleiotropic genetic studies, the test statistics are often dependent. For example, the largest value of the correlation coefficients among the traits in the Trinity Students Study analyzed below is 0.98. TATES, a typical procedure of weighted p-values, uses extended Simes procedure to correct for correlations among components, and might have low power when the genetic variant just affects some of the highly correlated traits. MultiPhen [5] which utilizes the proportional odds model by taking the marker as the outcome and the traits as the independent variables, may suffer from loss of power when the interested genetic marker is associated with all traits which are strongly correlated. CCA [25] is equivalent to the one-way multivariate analysis of variance analysis (MANOVA). The principal component analysis is mainly proceeded based on some top principal components (PCs) that can explain most of the total phenotypic variance of the traits used in the association studies. It will lose power if the discarded PCs are highly correlated with the traits. However, there is no widely accepted selection criterion for the optimal PCs. Furthermore, Aschard et al. [17] pointed out that the PCs that account for a small proportion of total variance can be as important as those account for a large proportion of variance in the association studies. To avoid it, they developed a multistep combined PC procedure (mCPC). For their method, the number of top PCs included in the first group is a key, which will affect the power significantly. For the selection of number of PCs, the accumulated contribution rate of 80% is recommended. As shown in the later simulation studies, using 80% sometime can lose power prominently.
In this work, we propose a procedure called GATE to test for the asscoaition between multiple traits and a single marker. GATE can be implemented using the following three steps: 1) first perform the PCA on all traits and calculate the p-values of the association analysis on univariate PC and a single marker one by one; 2) then divide the obtained p-values which are sorted by the descending order according to the correponding eigenvalues of the covariance matrix of traits into a few groups with given sizes and utilize the Fisher-combined method to combine p-values within and between groups; 3) let the number of p-values assigned in the first group vary and take the minimal value of all the quantities obatined in Step 2 as the final test statistic. To improve the computational efficiency, we propose a resampling procedure which integrates a two-layer resampling procedure to one-layer procedure to calculate the statistical significance of the test statistic. It is built based on the facts that under the null hypothesis where the genetic marker is not associated with the traits, all the p-values asymptotically follow the uniform distribution on [0,1] and −2 ln(p − value)s follow the Chi-squared distribution with 2 degrees of freedom. Simulation studies show that GATE outperforms TATES, MultiPhen and mCPC under most scenarios in terms of power. Sometime more than 25% power increase can be achieved (see Fig. 3 below). The performance of the compared methods are further illustrated using the genotypic and phenotypic data from the Trinity Students Study, a quantitative genome-wide association study.

The GATE
Suppose that there are n unrelated individuals enrolled from a source population in a genetic study. For the ith individual, let y ij be the observation values of the jth trait and denote its genotype at a SNP locus by g i , i = 1, 2, · · · , n, j = 1, 2, · · · , m, where m is the number of traits of interest. Denote Y = y ij n×m and G = (g 1 , g 2 , · · · , g n ) τ . Let = δ j 1 j 2 m×m be the covariance y ij , j, j 1 , j 2 = 1, 2, · · · , m. By the singular value decomposition, can be written as = Q Q τ , where is a diagonal matrix with diagonal elements being λ 1 , λ 2 , · · · , λ m (λ 1 ≥ λ 2 ≥ · · · ≥ λ m ≥ 0) and Q is an orthogonal matrix with columns being the eigenvectors. Denote Z = YQ, which is called the principal component matrix whose columns correspond to all principal components. Let z j be the jth column vector of Z. So the relationship between the traits and the genotype can be transformed into where ε j is the residual error term independently following from a normal distribution with mean of zero and unknown variance of σ 2 . The null hypothesis that there is no association between the genetic variant and phenotypes becomes H 0 : β 1 = β 2 = · · · = β m = 0. Denote the Wald test statistic for β j = 0 by T j , j = 1, 2, · · · , m. Then T 1 , T 2 , · · · , T m are independently and identically distributed and follow from the standard normal distribution asymptotically under the null hypothesis.
To test H 0 , a natural choice is the Fisher's combined test denoted as FCT = m j=1 T 2 j which follows from the central Chi-squared distribution with m degrees of freedom (DF) asymptotically. Notice that T 1 , T 2 , · · · , T m are sorted by the descending order of the eigenvalues λ 1 ≥ λ 2 ≥ · · · ≥ λ m . Aschard et al. (2014) proposed to use where s is the smallest integer satisfying s j=1 λ j m j=1 λ j ≥ 0.8, j = 1, 2, · · · , m, and F d (·) is the cumulative distribution function of the centralized chi-squared distribution with d DFs. As pointed out in the later simulations, using 0.8 to determine s is not robust and mCPC could loss power substantially. Sometimes such power loss can be more than 25% (see Fig. 3

below).
In order to overcome this drawback, we suggest to divide all marginal test statistics T 1 , T 2 , · · · , T m into K groups: where m i denote the size of the ith groups, 0 < m t ≤ m, t = 1, 2, · · · , K, and m 1 +m 2 +· · ·+m K = m. For a given grouping (i.e. m 1 , m 2 , · · · , m K are fixed), we can first construct a combined statistic as where F d (·) is the cumulative distribution function of the centralized Chi-squared distribution with d DFs and ξ m 1 m 2 ···m K asymptotically follows from χ 2 2K under the null hypothesis, a central Chi-squared distribution with 2K DFs. It should be noted that when K = 1, although the DF of the distribution of ξ m 1 is 2, the power of ξ m 1 is exactly equal to that of FCT. Hence the proposed test statistic is given by where H K (·) is the cumulative distribution function of max Note that when K = 1, GATE is reduced to FCT and becomes mCPC when K = 2, m 1 = s. Hence GATE is expected to have more broader application than FCT and mCPC.

Significance computation
GATE is the minimal value of some correlated statistic, its exact distribution or asymptotic distribution is hard to know. To calculate the p-value of GATE, we propose to adopt the following resampling procedure. Since the distribution function of the statistic ξ m 1 m 2 ···m K under each possible grouping is unknown, a two-layer resampling procedure is required. However, the two-layer resampling procedure is computation-intensive. To address it, we develop to use the following one-layer resampling procedure: 1) Calculate GATE based on the observations, denote it by η (0) . Set a large number B, for example B = 10000; 2) For b from 1 to B, generate m random variables which are i.i.d. from the standard normal distribution and denoted as 3) Estimate the distribution function H with the ξ m 1 m 2 ···m k obtained from Step 2 and denoted asĤ; m andĤ to calculate the GATE, denote it by η (b) ; 5) The p-value of the GATE is calculated as where # is an operator that counts the number of the elements in a set.
We point out that when m is fixed, the empirical null distribution functions of ξ and GATE are fixed, which is free of the marker. Hence GATE can be readily to be applied to a large-scale genetic study such as genome-wide association studies.

Association models
Since the effect of a causal genetic variant on the phenotypes can be indirect and direct [27], here we consider two association models (indirect and direct association model) with indirect and direct genetic effect to generate multiple correlated phenotypes. These two models (denoted by Model 1 and Model 2) have also been used in van der Sluis et al. [16] and Aschard et al. [17]. In Model 1, the genetic markers are associated with the phenotypes through latent factors. Considering m correlated phenotypes, Y 1 , Y 2 , · · · , Y m , which depend on L latent variables U 1 , U 2 , · · · , U L and a genetic marker G. Model 1 can be expressed as: . .
where k 1 , k 2 , · · · , k m ∈ {1, 2, · · · , L}, e 1 , e 2 , · · · , e L and ε 1 , ε 2 , · · · , ε m are independent random error terms which follow the standard normal distribution. Denote G as the genotype value for a biallelic SNP with the minor allele frequency being p (MAF = p) and assume that Hardy-Weinberg equilibrium holds in the general population on the SNP locus. Thus the corresponding genotype frequencies are Pr( and Pr(G = 2) = p 2 . It should be noted that in reality, the latent variables are unobservable. The correlations among phenotypes rely on the coefficients β = (β 1 , β 2 , · · · , β L ) τ and γ = (γ 1 , γ 2 , · · · , γ m ) τ , which measures the strength of the association between the genetic marker and the latent variables and the association between the latent variables and the traits, respectively. The proportion of the variance of the ith phenotype explained by the genetic variant is For Model 2, the genetic markers are directly associated with the phenotypes and their genetic effects are independent of the latent factors. The relationships are where U 1 , U 2 , · · · , U L are L latent variables that are independently normally distributed with mean 0 and variance 1, G and ε 1 , ε 2 , · · · , ε m are defined as above, γ ik and β i are coefficients, i = 1, 2, · · · , m, k = 1, 2, · · · , L. The proportion of the variance of the ith phenotype explained by G can be calculated by These two simulated schemes are illustrated in Fig. 1.

Simulation settings
To compare our proposed method with the existing methods, we generate datasets from the indirect and direct association model, respectively. The detailed setups are as follows.
We specify different values for the latent variable coefficients γ i , i = 1, 2, · · · , m to ensure the non-zero elements of the correlation matrix match the above four structures under the null hypothesis (i.e. β 1 = β 2 = · · · = β L = 0). For uniform low and high correlation structure, we let all γ i , i = 1, 2 · · · , m be equal to 0.5 and 2 which results in a uniform correlation matrix with equal correlation coefficient of 0.2 and 0.8, respectively. On the other hand, we consider a list of monotone decreasing values for γ to construct a gradient correlation matrix. When m = 20, we have L = 5 and setγ = (1, 0.8, 0.6, 0.4, 0.2) τ . The derived correlation matrix belongs to the third pattern of correlation matrix and have the biggest non-zero correlation coefficient of 0.500 (moderate) and the smallest value of 0.038 (low). We setγ = (1.5, 1.3, 1.1, 0.9, 0.7) τ to get the fourth pattern of correlation structure with the biggest value of 0.692 and the smallest value of 0.329 for the non-zero correlation coefficients when m = 20. We denote the obtained four correlation structures for the indirect association model by S1, S2, S3, and S4, respectively. The detailed settings of γ i , i = 1, 2 · · · , m corresponding to the above four correlation structures for m = 20 are presented as follows: The correlation matrix is calculated under the null hypothesis (β 1 = β 2 = · · · = β 5 = 0). Similarly, we simulated 100 correlated phenotypes with the above third and forth correlation structures through lettingγ respectively. In addition, we provide the detailed settings of correlation structures when 100 correlated phenotypes are considered in Additional file 1 which we denote by S5, S6, S7, and S8, respectively. We specify the minor allele frequency (

Selection of k
The selection of K in GATE is a key since large K leads to extensive computations and small K may result in not grasping the information thoroughly. We suggest selecting K = 2 for the proposed GATE procedure in practice. From the view of "pseudo degree of freedom", when K = 1, the statistic ξ m 1 m 2 ···m K which is used in the construction of the GATE statistic might possess m degrees of freedom or so, while when K ≥ 2, the corresponding DF becomes 2K. Thus, when the number of traits that need to be analysed is enough large (m > 4), dividing all single Wald test statistics T 1 , T 2 , · · · , T m into 2 groups will lead to the smallest DF. Hence, we deduce that the GATE with K = 2 will have better power performance than the other selections of K. Furthermore, we conduct some simulation studies to explore the performances of GATE under different selections of K. The simulations results are summarized in the Additional file 1 which coincidentally demonstrates our deduction. Therefore, in the following simulation, we compare the GATE with only considering K ∈ {1, 2} to other existing methods.

Performance comparison to other methods
In order to test the performance of the proposed GATE aprroach, four existing methods including TATES [16], MultiPhen [5], MANOVA, and mCPC [17] are compared.

1) Indirect association model
Firstly, we assume the correlated phenotypes are sampled from the indirect association model and explore the performances of the above five tests. Table 1

Power
Next, we compare the powers of the TATES, MANOVA, MultiPhen, mCPC, and GATE under the nominal significance level of 0.05. Under each scheme of the correlation structures, 5 levels of association including λ = 20%, 40%, 60%, 80%, 100% of the phenotypes that are associated with the genotype are considered. Denote the number of the associated traits by k (= λm). Without loss of generality, we assume that the first k phenotypes are associated with G. Besides this, we consider the scenarios that the phenotypes are randomly selected to be associated with the genotype and the corresponding results are presented in Additional file 1. Figure 2 reports the power results for 20 correlated phenotypes which are generated from Model 1 with the correlation structures of S1, S2, S3, and S4, respectively. To make the power comparable, we set the proportions of the variance of the associated phenotypes explained by the genetic variant under the four configurations (S1, S2, S3, S4) are 0.1%, 0.2%, 0.1%, and 0.2%, respectively. In most cases, our proposed test is more powerful than the other methods except when the correlations among associated phenotypes are uniformly strong (S2). Sometimes the power increase of TATES compared to the other four approaches can reach 13%. For example, when MAF = 0.15, n = 1, 500, λ = 60% and belongs to S3, the empirical powers of TATES, MANOVA, Multi-Phen, mCPC, and GATE are 0.324, 0.309, 0.312, 0.286, and 0.453, respectively. GATE is sightly less powerful than  In most scenarios, GATE performs better than the other three methods. From Fig. 3, we can find that the superiority of GATE over the other three methods is more evident when the number of the analysed traits is large. Sometimes, the power increase of the proposed GATE can reach 25%. For instance, when MAF = 0.15, n = 1, 500, λ = 60% under the correlation matrix of S7, the powers of TATES, MANOVA, mCPC, and GATE are 0.251, 0.428, 0.409, and 0.683, respectively. When the non-zero correlation coefficients are uniformly large (S6), the mCPC performs slightly better than the GATE. This occurs because when the correlations among different phenotypes are strong and a relatively large number of analysed traits are analyzed, the test ξ m 1 which is included in the construction of GATE would have loss of power substantially. However, under the other correlation structures (S5, S7, and S8), the powers of the GATE are always higher than those of mCPC. In some cases, GATE can have 28% increase of power comparing to mCPC. For instances, when MAF = 0.05, n = 1, 500, λ = 60% under the correlation matrix of S7, the powers of mCPC and GATE are 0.408 and 0.693, respectively.

2) Direct association model
Next, we assess the performance of the proposed test compared with those of the other four tests when the correlated phenotypes are sampled from the direct association model.

Type I error rate
In this section, we compare the type I error rates of TATES, MANOVA, MultiPhen, mCPC, and GATE when multiple phenotypes are generated from direct association model. Table 2 reports the results of type I error rate for 20 and 100 correlated phenotypes under the nominal significance level of 0.05, respectively. It shows that when m = 20, all five tests can control the type I error rates correctly because their empirical type I error rates are close to the nominal level. For example, when MAF = 0.30, the type I error rates of TATES, MANOVA, MultiPhen, mCPC, and GATE under the correlation structure of S11 are 0.049, 0.054, 0.055, 0.052, and 0.054, respectively. Similarly, when the number of simulated phenotypes is large (m = 100), MultiPhen has inflated type I error rates, while the other four tests maintain correct type I error rates. For instance, when m = 100, the type I error rates of MultiPhen under the correlation structure of S15 for MAF = 0.05, 0.15, 0.30, and 0.50 are 0.106, 0.102, 0.118, and 0.103, respectively.

Power
Next, we compare the power of the five tests when multiple phenotypes are simulated from Model 2. Under each scheme of the correlation structures, 5 levels of association: λ = 20%, 40%, 60%, 80%, 100% of the phenotypes that are associated with the genetic variant are considered and the number of the associated traits is k. Here we report the results under the scenario where the first k phenotypes are associated with the genetic variant. Additional empirical power results for the cases that the associated phenotypes are randomly selected with equal probability are available in Additional file 1. Figure 4 presents the power results of all five tests for 20 correlated phenotypes simulated from Model 2 with S9, S10, S11 and S12, respectively. To make the power results comparable, we set the proportions of the variance of the associated phenotypes explained by the genetic variant under the four configurations (S9-S12) are 0.2%, 0.1%, 0.2%, and 0.2%, respectively. When m = 20, we find that MANOVA and Multi-Phen usually have similar performances. For example, when MAF = 0.30 and the correlation structure is S9, the powers of MANOVA and MultiPhen for λ = 20%, 40%, 60%, 80%, and 100% are (0.348, 0.350),  The number of correlated phenotypes is 20 and 100. Scenario S9-S12 correspond to four correlation structures for m=20 and Scenario S13-S16 are for m=100. For each scenario, four MAFs including 0.05, 0.15, 0.30, and 0.50 are considered. The nominal significance level is 0.05 and 1000 simulation replicates are conducted their corresponding powers for λ = 100% are 0.501, 0.475 and 0.798, respectively. When the correlations among different phenotypes are nonuniform (S11 and S12), the GATE performs better than the other four methods in most cases. When λ is relatively small, mCPC outperform slightly than GATE. However, when λ becomes large, the powers of the GATE exceed those of mCPC significantly. And in some cases, the power increase can reach 25%. For example, when the correlation structure is S11 and MAF = 0.15, the powers of mCPC and GATE for λ = 80% are 0.372 and 0.621, respectively. Moreover, when there exist strong correlations among phenotypes (S10), TATES suffers significant loss  Figure 5 shows the power results of four tests including TATES, MANOVA, mCPC, and GATE for 100 simulated phenotypes from Model 2. The proportions of the variance of the associated phenotypes explained by the genetic variant under four configurations (S13, S14, S15, S16) are all set to be 0.1%. The performances of all compared approaches are similar to those under m = 20.

Applications to heterogeneous stock mice data
The mouse is an important model organism which can provide information on gene functions in mammals. Its use has been proved to be a powerful approach to understanding the genetic architecture of human disease and fundamental mammalian biology [28]. To further explore the performance of the proposed method on the test for pleiotropic genetic effects, we apply it to the analysis of the Heterogeneous Stock Mice data, which is downloaded from http://mus.well.ox.ac.uk/. Originally, 101 phenotypes including models of human disease (such as, asthma, type 2 diabetes mellitus, obesity, anxiety), immunological, biochemical and hematological phenotypes, and others, are collected [29]. These 101 phenotypes belong to 19 categories and a full description of them is available in Solberg et al. [30]. Before the analysis, we remove the phenotypes with the proportion of missing values being large than 0.01, so that 52 phenotypes (listed in Additional file 1) are left. The remaining phenotyps are correlated with each other and the largest correlation coefficient is 0.979, which happens between two hematological phenotypes: "Haem Haemoglobin" and "Haem Haematocrit". In addition, we exclude the subjects with missing oberved phenotype values and thus a total of 588 mice are obtained. There are totally 302 SNPs on chromosome 19. After removing the SNPs with the proportions of missing genotype value large than 15% and MAF being smaller than 0.05, 250 SNPs are finally analyzed. We use TATES, MANOVA, MultiPhen, mCPC, and GATE to test the association between the SNPs on chromosome 19 and all 52 phenotypes. 1,000,000 resamplings are conducted to calculate of the p-value of GATE. Under the nominal significance level of 0.05, the adjusted significance level for a single test is 0.05/250= 2 × 10 −4 from the Bonferroni correction for multiplicity. On a whole, the number of identified SNPs that are associated with all the 52 phenotypes by GATE is more than those by the other four methods. Among 250 SNPs, there are 125 SNPs are detected by GATE, while 80, 114, 118, and 116 SNPs are detected by TATES, MANOVA, MultiPhen, and mCPC, respectively. Among the 125 SNPs, there are 7 SNPs that are only detected to be significantly associated with all the analyzed phenotypes by the proposed methods. The p-values of these SNPs for five methods are presented in Table 3. For each SNP, the p-value of GATE is always the smallest and smaller than the adjusted significance level 2 × 10 −4 . For example, for rs13483499, the p-value of GATE is 6.30 × 10 −5 which is smaller than those of TATES  [29] reported that the SNP rs13459157 is associated with the phenotype "OFT Activity and defecation", which is among the 52 phenotypes and the SNP rs6259521 has an association with the phenotype "Pleth Enhanced pause (baseline)".

Discussion
The genetic variants play fundamental roles in studies of human complex diseases. The elucidation of genetic risk factors could provide an insightful understanding on the occurrence of the diseases and then make the targeted therapy feasible. As the genome-wide association studies move forward, the association between multiple traits and a single SNP is becoming a hot pot nowadays. Intuitively, multiple-traits-single-SNP analysis (MTSS) is more powerful in identifying deleterious SNP compared to single-marker test on one trait. In this paper, we have presented GATE, a new procedure to do MTSS. The false positive rate of GATE is controlled correctly for various MAFs and different correlation structures for the traits "snpid" is the ID of the selected SNPs since the computation of the significance of GATE is based on the resampling procedure. Extensive simulations including the direct association model and indirect association model show that GATE outperforms the existing procedures when the association model is indirect and the relationship is not consistently strong, and is more robust under other situations. In other words, GATE is an efficient multivariate analysis procedure to conduct association studies between genotypes and phenotypes since the potential genetic architecture is generally unknown beforehand.
We provide a resampling procedure to calculate the significance of GATE. The key of such resampling procedure is to generate i.i.d. observations from the standard normal distribution. This procedure is very user-friendly and can be implemented in any statistical and numerical softwares such as R, SAS, Matlab, and others. In principle, a two-layer resampling procedure should be employed to obtain the p-value of GATE. Here we adopt a onelayer resampling procedure, where the cumulative distribution function (H) of the inner statistic was estimated at the beginning, and then use the estimated distribution function of H and the same samples to obtain the empirical significance of GATE. This procedure can efficiently reduce the computational cost and make GATE feasible to a large-scale genetic study. Since a large number of replications in the one-layer resampling procedure won't result in high computation cost, we recommend using B=10000 or larger to ensure the stability of the calculated GATE s p-values. On the other hand, we can use the generalized Gamma distribution (GGD) [20] to approximate the distribution of −2 ln(GATE). The 95%, 99%, 99.9%, 99.99%, 99.999% quantiles using the fitted GDD and the empirical values of −2 ln(GATE) based on 1,000,000 resamplings are given in Additional file 1: Table  S5. They match very well. So in order to reduce the computational intensity, we can consider using the fitted GDD method to obtain the p-values of GATE. The proposed procedure has been coded in R verion 3.3.3 and is available at http://www.statsci.amss.ac.cn/yjscy/yjy/lqz/ 201510/t20151027_313273.html.
PCA is an important tool in multivariate analysis. In PCA, a crucial issue is how to select PCs. A standard selection criterion is using the cumulative contribution rate that indicates a few top PCs can be chosen. As pointed out by [31] and [17], only using some top PCs might miss some important PCs that are with low contribution rate, but are highly correlated with the outcome. FCT that combines all PCs can be an alternative approach. However, it loses power substantially when the number of true signals is large. To overcome this drawback, Aschard [17] proposed a mCPC procedure. By dividing the marginal test statistics for each PC into two groups and combining the tests among groups, the DF can be reduced, especially when the signals are very sparse. However, for the correlation structure among multiple phenotypes and association strength between genotype and phenotype are unknown beforehand, using a fixed grouping technique is not enough robust. GATE makes a bridge between FCT and mCPC. To some extent, GATE can be regarded as an extension of FCT and mCPC since it is exactly equal to FCT when the group number is one and takes mCPC as one of components. GATE is constructed from a large family of test statistics containing FCT and mCPC. Overall, GATE is more robust than FCT and mCPC. The simulation results also demonstrate it.
GATE is also an extension of TATES who uses the minimum of weighted p-value as the test statistic. Basically, TATES can be viewed as a function of p-values. The function is the combination of linear operator and minimum operator. For GATE, it utilizes the cumulative distribution function, log, quadratic-form and summation functions. These four functions are commonly used in constructing the test statistics in hypothesis testing, which is expected to integrate the information over a wide range of scenarios than other functions. The simulations show that GATE is more robust than TATES under most of the considered scenarios.
Covariates or confounding factors including the gender, age, environment factors can be of great importance in assessing the associations between the genetic variants and complex traits. Adjusting for covariates in genetic association studies have two motivations: one is correcting for the bias of the genetic effect estimates, and another is improving statistical power. For example, the hidden population structure can not be ignored in population genetic association studies and a failure to consider the population stratification might lead to many false positive findings. So it is a routine for researchers to correct for the population stratification in the genome wide association studies. Fortunately, the proposed GATE procedure can be directly applied to multiple traits association studies with covariates by adding the covariates in the association studies of the single PC and genetic variant .