 Proceedings
 Open Access
A probabilistic method for identifying rare variants underlying complex traits
 Jiayin Wang^{2}Email author,
 Zhongmeng Zhao^{1}Email author,
 Zhi Cao^{1},
 Aiyuan Yang^{1} and
 Jin Zhang^{2}
https://doi.org/10.1186/1471216414S1S11
© Wang et al.; licensee BioMed Central Ltd. 2013
 Published: 21 January 2013
Abstract
Background
Identifying the genetic variants that contribute to disease susceptibilities is important both for developing methodologies and for studying complex diseases in molecular biology. It has been demonstrated that the spectrum of minor allelic frequencies (MAFs) of risk genetic variants ranges from common to rare. Although association studies are shifting to incorporate rare variants (RVs) affecting complex traits, existing approaches do not show a high degree of success, and more efforts should be considered.
Results
In this article, we focus on detecting associations between multiple rare variants and traits. Similar to RareCover, a widely used approach, we assume that variants located close to each other tend to have similar impacts on traits. Therefore, we introduce elevated regions and background regions, where the elevated regions are considered to have a higher chance of harboring causal variants. We propose a hidden Markov random field (HMRF) model to select a set of rare variants that potentially underlie the phenotype, and then, a statistical test is applied. Thus, the association analysis can be achieved without preselection by experts. In our model, each variant has two hidden states that represent the causal/noncausal status and the region status. In addition, two Bayesian processes are used to compare and estimate the genotype, phenotype and model parameters. We compare our approach to the three current methods using different types of datasets, and though these are simulation experiments, our approach has higher statistical power than the other methods. The software package, RareProb and the simulation datasets are available at: http://www.engr.uconn.edu/~jiw09003.
Keywords
 Minor Allele Frequency
 Rare Variant
 Causal Variant
 Markov Random Field
 Background Region
Introduction
In most existing genetic variant association studies, "common trait, common variants", which asserts that common genetic variants contribute to most of traits (disease susceptibilities), serves as the central assumption. Researchers have successfully identified some significant associations between common single nucleotide polymorphisms (SNPs) and disease traits [1]. However, despite the enormous efforts expended on association studies of complex traits, common genetic variants only show a moderate influence on different phenotypes in many reported disease associations and consequently have limited diagnostic value [2, 3]. While the identification of common variants creates a dilemma, known as "common trait, rare variants", an alternative hypothesis, which asserts that multiple rare variants with moderate to high penetrances may collectively influence disease susceptibilities, has been suggested in some literatures [3–5]. Rare variants are defined as those whose minor allele frequencies (MAF) are less than or equal to 0.01 (≤ 10^{2}). Although some rare variants associated with Mendelian diseases have been identified, more often, the allelic population attributable risk (PAR), which describes a small reduction in the incidence that would be observed in unexposed samples compared to the actual exposure pattern, is low. The odds ratio (OR), a measure of the strength of association or nonindependence between two binary data values, is also low. Moreover, based on the "common trait, rare variants" hypothesis, in many cases, a set of rare variants, instead of just one variant, should be identified to fully explain the genetic influence. Both the singlevariant test [6] and the multiplevariant test [7] have been applied to rare variant association studies. However, due to the reasons outlined above, neither of them shows satisfactory power in obtaining associations. Although more and more attention is being focused upon rare variants, there has only been limited success thus far [8–10].
Alternatively, the collapsing strategy, also called the "burdenbased test", is another approach for rare variants association studies. Most of the collapsebased approaches build on the "recessiveset" genetic model, in which the predisposing haplotype contains mutation(s) in at least one variant [11]. Multiple rare variants in the same locus are collapsed, based on different standards, then statistical tests are applied. The locus here is defined as a selected region that consists of a group of candidate rare variants [9, 12–14]. However, it is argued that existing collapsebased approaches assume all rare variants implicitly influencing the phenotype in the same direction and with the same magnitude [10, 15]. Researchers have observed that any given rare variant could have no effect, could be causal, or could be protective for the endpoints (traits) [15]. For example, some lowfrequency variants in African Americans PCSK9 can have a substantial effect on serum LowDensity Lipoprotein Cholesterol (LDLC) by increasing the risk of or protecting against myocardial infarction [16–18].
Collapsebased approaches have low statistical powers when "causal", "neutral" and "protective" variants are combined [13, 15, 19]. To overcome this weakness, some approaches [9, 14] assume that the rare variants are well selected by experts, while weighting of each variant is another widely used strategy [9, 11, 14]. In a recent study, Bhatia and others [19] suggest the development of a "modelfree" approach, RareCover, that only collapses a subset of potentially causal variants from all of the given variants. Here, the "model" refers to the genetic association model that consists of the preselection candidate variants.
Motivated by RareCover, in this article, we focus on rare variant association analysis without any preselection of candidate variants. We propose a probabilistic approach, RareProb, to make selections using a Markov random field (MRF) model and identify multiple causal rare variants that influence a dichotomous phenotype using statistical tests. Our approach considers both the causal and the protective variants, which distinguishes it from the previous study RareCover, and it is therefore a robust predictor of the direction and the magnitude of the genetic effects. Moreover, inspired by the weightsum approaches [9, 11, 14], we also weight each variant; however, we not only consider the likelihood of a variant being causal but also compute the pairwise likelihood of candidate variants being collapsed together. Note that although it is difficult to observe, relatively low interactions (e.g., linkage disequilibrium) are expected between rare variants [4, 11, 13, 20]. Furthermore, in regressionbased association methods, genetic similarities are often used to reduce the dimensions of the regression models. Therefore, we introduced two kinds of genetic regions, the elevated region and the background region, in our model analysis; the elevated region has a higher probability of harboring a causal variant. This assumption that the causal variants are often located close to each other is often used, e.g. slide windows in RareCover [19]. However, the regions are more flexible than a preset slide window, as in RareCover.
We adopt the "dominant" and "recessive set" genetic model, which are also used in [9–11, 14, 15, 19]. In the dominant and recessiveset model, the predisposing genotype harbors the mutation(s) in at least one variant on any of the two haplotypes. Therefore, for one genotype, there are two possible allelic values at each variant: one denotes that both haplotypes carry a widetype allele, while the other denotes that at least one haplotype carries a mutant. In our method, each variant has two hidden states, causal/noncausal status and elevated/background region status. The MRF includes the hidden states, emission probabilities and transition probabilities. The emission probabilities bridge the hidden states and the genotypes, while the transition probabilities link the two hidden states. Following the pseudolikelihood estimation method [21], we infer the model parameters and all of the hidden states. The simulation experiments show that our approach outperforms RareCover, RWAS [14] and LRT [9] on different parametric settings. In particular, RareProb obtains better results on largescale data.
Methods
Notions and model overview
Suppose we are given M rare variants (allelic sites) on a set of N genotypes. Let s_{ i } denote the allelic value of the site s on the genotype i (1 ≤ i ≤ N, 1 ≤ s ≤ M), where s_{ i } = 0 means both haplotypes of i have the wild type allele, while s_{ i } = 1 means at least one haplotype has a mutant allele. Each genotype carries a dichotomous phenotype. Let vector P denotes the phenotypes, where P_{ i } = 1 represents that i is affected by the phenotype trait (being a case), while P_{ i } = 0 represents that i is a control.
The core of our approach is a Markov random field (MRF) model. We first introduce four key components of modeling this MRF:

The observed data of this MRF consist of all of the genotypes and phenotypes.

There are two unknown states for each site: one is the causal or noncausal status and the other is the region location status. Here, we define them as the hidden states of this Markov random field. Let a latent vector R represent the region status, where R_{ s }= 1 denotes that the site s is located in an elevated region, while R_{ s }= 0 denotes the s is located in a background region. Additionally, let a latent vector X represent the causal/noncausal status, where X_{ s }= 1 if the site s is causal (contributes to the phenotype); otherwise, X_{ s }= 0. Probabilistic functions are designed to present the probabilities of each hidden state. The RareProb framework is able to incorporate prior information, obtained by different software tools, e.g. AlignGVGD[22] and SIFT[23], etc, by updating initial X vector and R vector.

A neighborhood system is required in the MRF model to describe the interactions among hidden states. Details of the hidden states and neighborhood system are shown in the section "Estimation of the hidden states in HMRF".

There are two kinds of probabilities in the MRF model: emission probabilities and transition probabilities. Emission probabilities bridge the relationships among genotypes, phenotypes and hidden states. Moreover, hidden states X and R are not independent of each other, as the relationships between the hidden states are described by the transition probabilities. The conditional probability P(X_{ s }= 1R_{ s }= 1) denotes the probability that the site s is a causal site when it is located in an elevated region, while P(X_{ s }= 0R_{ s }= 1) denotes the probability that the site s is noncausal when it is located in an elevated region. Similarly, another two conditional probabilities, P(X_{ s }= 1R_{ s }= 0) and P(X_{ s }= 0R_{ s }= 0), present the probabilities of being causal or noncausal if the site is located in a background region. Details of the emission probabilities are shown in the section "Estimation of the emission probabilities in HMRF", and the transition probabilities are shown in the section "Estimation of the transition probabilities in HMRF".
The central thesis of our approach is that causal rare variants, which should be collapsed together, are treated as one random vector variable with certain dimensions. Then, the probability of this bunch of causal rare variants becomes the probability of one variable being associated with the phenotype. Based on the MarkovGibbs equivalence [21], the probability of this random variable can be decomposed into the sum of clique potentials. The firstorder clique potentials describe the probability of one variant being causal, while the secondorder clique potentials measure the pairwise genetic similarities, which share the idea of the kernel machine in regression frameworks [10, 24, 25]. The neighborhood system in the MRF model consists of clique potentials. In our approach, we select that the neighborhood system only contains the firstorder and the secondorder clique potentials because there is scanty evidence supporting the biological or medical scenario of highorder potentials. For each variable, the MAFs and model parameters can be estimated by maximizing the likelihoods of the genotypes. Then, the probability of the variable and the variable itself can be updated by MAFs and model parameters. Two or three iterations can be applied if needed for the convergence of the MRF. Thus, our approach selects a subset of candidate causal variants by updating the variables and avoids the weakness of the same magnitude effect assumption because the neighborhood system is able to describe both the "causal" and "protective" variants.
Estimation of the hidden states in HMRF
Neighborhood system
The ω function has the following properties: (1) When both s and s' are causal variants, due to the PAR, ω_{ s,s' } locates in the interval (0, 1]. (2) If one variant is "causal" but the other is "protective", the likelihood takes on a negative value. (3) The likelihood encourages the collapse of the variants with similar PAR. Those rare variants whose MAFs increase rapidly in some cases, as we mentioned before, could be identified by singlesite tests or pairwise tests, which are often not considered in collapsing models [8]. Let ω.,. be the weight of two neighbors. The closer the statistics z_{ s } and z_{ s' } are, the larger the likelihood will be. And thus, the neighborhood system is built up.
Hidden states
and the joint probability of latent vector R can be represented by $p\left(R;{\Phi}_{R}\right)\propto \mathsf{\text{exp}}\left(\tau {\sum}_{s}^{M}R+\upsilon {\sum}_{s,{s}^{\prime}}{\omega}_{s,{s}^{\prime}}{R}_{{s}^{\prime}}\right)$, where Φ_{ R } = (τ, υ)·τ and υ are two MRF parameters. We also limit υ > 0, which encourages the pairwise weights and prevents them from counteracting the negative weights.
Estimation of the emission probabilities in HMRF
where ${c}_{s}={c}_{s}^{+}+{c}_{s}^{}$. We have now obtained all the three emission probabilities of this HMRF: p (YX), P (Y_{ s }X_{ s } = 0) and P (Y_{ s }X_{ s } = 1).
Estimation of the transition probabilities in HMRF
where α(·) and β(·) are also hyperparameters.
Thus far, we have obtained all of the three transition probabilities of this HMRF: p (XR), $\pi \left(\xi {c}_{X}^{+}\right)$ and $\pi \left(\zeta {c}_{X}^{}\right)$.
Estimation the model parameters
Based on the GibbsMarkov Equivalence [21], a pseudolikelihood estimation cycle can be applied to this hidden MRF to estimate the model parameters and update the hidden states. We use the pseudolikelihood estimation because p (X; Φ) and p (R; Φ_{ R }) are difficult to compute directly. The algorithm involves the following four steps:

Step 1: Estimate α_{ θ }and β_{ ρ }with $\widehat{\theta}$ and $\widehat{\rho}$ by maximizing the likelihood $L\left(Y/\widehat{X}\right)$. Update ${\widehat{\theta}}_{s}$ by maximizing the posterior distribution:$\pi \left({\theta}_{s}{c}_{s}^{+}\right)=\frac{{\theta}_{s}^{{\alpha}_{{\theta}_{s}}+{c}_{s}^{+}1}{\left(1{\theta}_{s}\right)}^{{\beta}_{{\theta}_{s}}+N{c}_{s}^{+}1}}{B\left({\alpha}_{{\theta}_{s}}+{c}_{s}^{+},{\beta}_{{\theta}_{s}}+N{c}_{s}^{+}\right)}$
Similarly, Update ${\widehat{\rho}}_{s}$.

Step 2: Estimate α_{ ξ }, β_{ ξ }and α_{ ζ }, β_{ ζ }with $\widehat{\xi}$ and $\widehat{\zeta}$ by maximizing the transition probability $L\left(X/\hat{R}\right)$. Update $\widehat{\xi}$ and $\widehat{\zeta}$ by maximizing the transition probabilities $\pi \left(\xi {c}_{X}^{+}\right)$ and $\pi \left(\xi {c}_{X}^{+}\right)$, respectively.

Step 3: Estimate Φ and Φ_{ R }with $\widehat{\Phi}$ and ${\widehat{\Phi}}_{R}$ by maximizing the pseudolikelihood functions:$L\left(\widehat{X};\Phi \right)=exp\left({\displaystyle \sum _{S}^{M}}{p}_{s}\left({\widehat{X}}_{s}{\widehat{X}}_{n\left(s\right)}\right);\Phi \right)$
and $L\left(\widehat{R};{\Phi}_{R}\right)$.

Step 4: Update $\widehat{X}$ and $\widehat{R}$ by$P\left({X}_{s}Y,{\widehat{X}}_{S/s}\right)\propto f\left({Y}_{s}{X}_{s};\widehat{\theta},\widehat{\rho}\right){p}_{s}\left({X}_{s}{\widehat{X}}_{n\left(s\right)};\widehat{\Phi}\right)$
and $P\left({R}_{s}X,{\widehat{R}}_{S/s}\right)$.
There are several ways to exit from this iteration. We measure the Euclidean distance between the current and the updated $\widehat{X}$. If the distance is less than a preset threshold, our approach will stop the iteration. After the convergence of HMRF, we obtain the estimations of X and R, as well as the MAFs for every variant. The collapsed rare variants can be tested based on the existing statistics, e.g. in [9, 10, 14].
Experiments and results
In this section, we apply our approach on a real dataset from [30] and also compare it with three other approaches using different types of simulated datasets. The three comparison approaches are RareCover, which is based on [19], RWAS from [14] and LRT from [9]. Additionally, it seems that RareCover is not released online, so as in many previous works, we reimplement this algorithm and the related statistics by ourselves.
Simulation frameworks
As the simulation settings in different papers [9, 14, 19] are quite different, we adopt all of them and generate three types of simulated datasets. In the first one, each dataset has a fixed number of causal variants, while in the second dataset, the number of causal variants is determined by allelic population attributable risk (PAR). The last simulation method first generates elevated regions and background regions and then plants causal variants in each region. We describe the three simulation methods in the following sections.
Fix number of causal variants
where σ is the selection coefficient, β_{ S } is the probability that the normal allelic site mutates to the causal variant, and β_{ N } is the probability that a causal variant repairs to a normal variant. We take σ = 12.0, β_{ S } = 0.001 and β_{ N } = 0.00033, which are the same settings used by [9, 11, 14]. Then, the relative risk of s is: $RR=\frac{\delta}{\left(1\delta \right){\rho}_{s}}+1$, where δ is the marginal PAR. The marginal PAR is equal to the group PAR (Δ) divided by the number of causal variants, while the relative risk of M variants is 1 [14]. Afterwards, the MAF of s for the cases is calculated according to ${\theta}_{s}=\frac{RR\times {\rho}_{s}}{\left(RR1\right){\rho}_{s}+1}$. In each dataset, we simulate N = 2000 genotypes with half cases and half controls. The mutations on the cases and the controls are sampled independently according to θ_{ s } and ρ_{ s }, respectively.
Causal variants depends on PAR
where Pr represents the penetrance of the group of causal variants and P_{ D } is the disease prevalence in the population. Different settings are applied in the experiments.
We use the algorithm proposed in [19] to obtain the MAF of each causal variant. The algorithm samples the MAF of a causal variant s, θ_{ s }, from the Wright's distribution with σ = 30.0, β_{ S } = 0.2 and β_{ N } = 0.002 [4, 19], and then appends s to C. Next, the algorithm checks whether ${\prod}_{s\in C}\left(1\frac{{\theta}_{\mathsf{\text{s}}}Pr}{{P}_{D}}\right)>1\mathrm{\Delta}$ is true. If the inequality does not hold, the algorithm terminates and outputs C. Thus, we obtain all of the causal variants and their MAFs. If the inequality holds, then the algorithm continuously samples the MAF of the next causal variant. The mutations on genotypes are sampled according to θ_{ s }.
For those noncausal variants, we use Fu's model [29] of allelic distributions on a coalescent, which is the same used in [19]. We adopt ${\rho}_{s}=\frac{5.0}{N}$. The mutations on genotypes are sampled according to ρ_{ s }. The phenotype of each individual (genotype) is computed by the penetrance of the subset, Pr. Thereafter, we sample 1000 of the cases and 1000 of the controls.
Causal variants depends on regions
There are many ways to generate a dataset with regions. The simplest way is to preset the elevated regions and the background regions and to plant causal variants based on certain probabilities. An alternate way creates the regions by a Markov chain. For each site, there are two groups of states. The E state denotes that the variant is located in an elevated region, while the B state denotes that the variant is located in a background region. Both states E and B can transfer to a causal state C or a noncausal state $\stackrel{\u0304}{C}$. If the Markov chain travels to the $\stackrel{\u0304}{C}$ state, it plants a mutant on the genotype with probability ρ. If the variant is considered to be causal, it may continuously transfer to the state A, which means that the genotype carries a mutant that may affect the phenotype with penetrance Pr. Otherwise it arrives in the state Ā, and the Markov chain plants a mutant or a wildtype allele on the genotype afterwards.
To generate enough genotypes, we perform the following steps for each variant: if the process drops into $\stackrel{\u0304}{C}$, we take 50,000 iterations to yield a mutant, where ρ is sampled from the Wright's distribution with σ = 30.0, β_{ S } = 0.2 and β_{ N } = 0.002. If it drops into A or Ā, we design an iteration to C until it reaches 50,000 iterations. The transition probability from C to A is equal to ρ × Pr. After we have enough genotypes, we sample 1000 cases and 1000 controls from them.
Comparisons on power
Similar to the measurements in [9, 14], the power of an approach is measured by the number of significant datasets, among many datasets, using a significance threshold of 2.5 × 10^{6} based on the Bonferroni correction assuming 20000 genes, genomewide. We test at most 1000 datasets for each comparison experiment.
Power versus different proportions of causal variants
The power comparisons at different proportions of causal variants
Total  Causal  RareProb  RareCover  RWAS  LRT 

100  50  100%  100%  100%  100% 
200  50  100%  100%  99.6%  99.9% 
400  50  100%  100%  85.3%  88.6% 
600  50  100%  94.6%  54.1%  58.8% 
800  50  100%  0.0%  33.0%  36.5% 
1000  50  100%  0.0%  20.7%  22.0% 
2000  50  100%  0.0%  2.0%  2.0% 
3000  50  100%  0.0%  0.8%  0.0% 
4000  50  100%  0.0%  0.4%  0.0% 
5000  50  100%  0.0%  0.3%  0.0% 
200  1*  51.0%  0.0%  0.0%  0.0% 
400  3*  77.0%  0.0%  0.0%  0.0% 
600  2*  63.6%  0.0%  0.0%  0.0% 
800  3*  57.1%  0.0%  0.0%  0.0% 
1000  3*  59.0%  0.0%  0.0%  0.0% 
2000  1*  34.0%  0.0%  0.0%  0.0% 
3000  2*  41.2%  0.0%  0.0%  0.0% 
4000  3*  40.0%  0.0%  0.0%  0.0% 
5000  2*  29.8%  0.0%  0.0%  0.0% 
The Type I error rate is another important measurement for estimating an approach. To compute the Type I error rate, we apply the same technique as [19]. Type I error rate is defined as the probability of a noncausal variant being selected in the potential causal set. We compare our approach only with RareCover because RWAS or LRT does not select any potential causal variants. The results on different configurations can be found in supplementary documents. Based on the results, our approach always holds reasonable Type I error rates. Although on some configurations RareProb has a little higher Type I error rates, e.g. 1%10% higher when gourp PAR is 5%, than RareCover, the absolute values are still satisfied. Moreover, when the group PAR decreases, RareProb always performs lower Type I error rates than RareCover. These results can be found in Table S2 in the Additional file 2. Considering both statistical power and Type I error rate, the advantage of RareProb cannot be neglected: it is able to identify most of the causal variants with an acceptable Type I error rate. In the other words, if an approach rarely identifies correct variants, a low Type I error rate becomes meaningless.
Power versus different configurations of regions
The power comparisons for different configurations of regions
Total  Causal  Regions  Length  RareProb  Correct R 

1000  36*  1  50  100%  96% 
37*  2  50  100%  98%  
36*  3  50  100%  97%  
35*  4  50  100%  98%  
2000  73*  1  100  100%  97% 
73*  2  100  100%  97%  
70*  3  100  100%  98%  
71*  4  100  100%  96%  
Total  Causal  Regions  Length  RareCover  Correct R 
1000  36*  1  50  0.0%  1.9% 
37*  2  50  0.0%  1.4%  
36*  3  50  0.0%  1.7%  
35*  4  50  0.0%  1.6%  
2000  73*  1  100  0.0%  0.7% 
73*  2  100  0.0%  0.8%  
70*  3  100  0.0%  1.3%  
71*  4  100  0.0%  0.8% 
RareProb on real mutation screening data
Finally, we apply our approach to a real mutation screening dataset. This dataset has been previously published by [30]. Authors screen for a susceptibility gene, ATM, which is thought to associate with ataxia telangiectasia. ATM is also an intermediaterisk susceptibility gene for breast cancer [9, 14]. The dataset (ATM_CCMSdata_Dec2011_v1) we have consists of 121 rare variants in a set of 2506 cases and 2235 controls, which is called "bona fide casecontrol studies" [9, 14].
We apply RareProb to this dataset without any prior information. RareProb identifies variant #c.4424A > G as a causal variant and reports a significant association with a pvalue of 8.8817 × 10^{16}. As a comparison, authors in [30] reports that they did identify a significant association with the help of the prior information, but that they did not find a significant association only according to the results of CMC. Sul and others [14] applied RWAS and reports a nonsignificant association with pvalue of 0.3946 without prior information and a nonsignificant association with pvalue of 0.0078 and 0.0881 when prior information of variants is obtained by AlignGVGD [22] and SIFT [23], respectively. Sul and others [9] also applied LRT and reports that a nonsignificant association with pvalue of 0.3934 was found without prior information, but a significant association with pvalue of 0.0058 and 0.08384 were found introducing AlignGVGD scores and SIFT scores, respectively. Our approach successfully identifies an association and clearly points out the candidate causal variant, without prior information, while either RWAS or LRT cannot achieve this.
Conclusion
In this article, we propose a probabilistic method, RareProb, to identify multiple rare variants that contribute to dichotomous disease susceptibility. Our approach is inspired by RareCover. Both approaches select a subset of potentially causal variants from the given variants, which means our approach does not rely on the preselection of candidate rare variants. Furthermore, as opposed to simply merging the variants in RareCover, our approach gains power by considering the directions and the magnitudes of the genetic effects. Both the causal and the protective variants can be described by pairwise measurements, respectively. This method gets rid of the weakness of losing statistical power when "causal", "neutral" and "protective" variants are combined. Note that the pairwise weight is not the linkage disequilibrium (LD). LD is quite difficult to observe, although it is expected among rare variants. The pairwise measurements indicate the likelihood of two variants being collapsed, which is similar to the kernel functions in regressionbased frameworks. This weight is then used to build up the neighborhood system of the hidden Markov random field model.
The Markov random field model treats all of the variants as one vector and estimates their causal/noncausal status by globally maximizing the likelihood of genotypes instead of by local optimization. Our approach gains more power than existing groupwise collapsing approaches; RareProb filters out those variants with noncausal status. At the same time, unlike the previous selectionbased approaches, RareProb controls the false positive rate by partitioning elevated regions and background regions, instead of by presetting any sliding windows. Regions are much more flexible than preset sliding windows. While existing approaches can only handle hundreds of variants, there is no doubt that the total number of variants will increase rapidly with the development of new technologies, e.g. applications of next generation sequencing. The simulation experiments show that our approach obtains significantly more power, especially when the total number of given rare variants is large. We also apply our approach to a real mutation screening dataset and a significant association is found. Our approach is able to handle thousands of variants. Moreover, our approach is easy to extend to an "additive" genetic model and multiple phenotypes by updating the Dirichlet prior distribution.
Declarations
The publication costs for this article were funded by Xi'an Jiaotong University.
This article has been published as part of BMC Genomics Volume 14 Supplement 1, 2013: Selected articles from the Eleventh Asia Pacific Bioinformatics Conference (APBC 2013): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S1.
Declarations
Acknowledgements
This work was supported by National Science Foundation [IIS0803440], [CCF1116175] and [IIS0953563] and the Ph.D. Programs Foundation of Ministry of Education of China [20100201110063]. Authors thank Professor Sean Tavtigian and Professor Georgia ChenevixTrench for sharing the ATM datasets with us. Authors thank Professor ChunXia Yan M.D. and M.D. Feng Zhu for discussing the elevated region and the background region from a medical and clinical view.
Authors’ Affiliations
References
 Hirschhorn Joel, Daly Mark: Genomewide association studies for common diseases and complex traits. Nature Reviews Genetics. 2005, 6: 95108.View ArticlePubMedGoogle Scholar
 Ropers HansHilger: New perspectives for the elucidation of genetic disorders. Am J Hum Genet. 2007, 81: 199207. 10.1086/520679.PubMed CentralView ArticlePubMedGoogle Scholar
 Manolio Teri, Collins Francis, Cox Nancy et al: Finding the missing heritability of complex diseases. Nature. 2009, 461: 747753. 10.1038/nature08494.PubMed CentralView ArticlePubMedGoogle Scholar
 Pritchard Jonathan: Are rare variants responsible for susceptibility to complex diseases?. Am J Hum Genet. 2001, 69: 124137. 10.1086/321272.PubMed CentralView ArticlePubMedGoogle Scholar
 Reich David, Lander Eric: On the allelic spectrum of human disease. Trends in Genetics. 2001, 17: 502510. 10.1016/S01689525(01)024106.View ArticlePubMedGoogle Scholar
 Chapman Nicola, Wijsman Ellen: Genome screens using linkage disequilibrium tests: optimal marker characteristics and feasibility. Am J Hum Genet. 1998, 63: 18721885. 10.1086/302139.PubMed CentralView ArticlePubMedGoogle Scholar
 Xiong Momiao, Zhao Jinying, Boerwinkle Eric: Generalized T^{2} Test for genome association studies. Am J Hum Genet. 2002, 70: 12571268. 10.1086/340392.PubMed CentralView ArticlePubMedGoogle Scholar
 Bodmer Walter, Bonilla Carolina: Common and rare variants in multifactorial susceptibility to common diseases. Nature Genetics. 2008, 40: 695701. 10.1038/ng.f.136.PubMed CentralView ArticlePubMedGoogle Scholar
 Sul Jae, Han Buhm, Eskin Eleazar: Increasing power of groupwise association test with likelihood ratio test. Proceedings of RECOMB. 2011, 2011: 2831.Google Scholar
 Michael Wu, Seunggeun Lee, Tianxi Cai et al: Rarevariant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011, 89: 8293. 10.1016/j.ajhg.2011.05.029.View ArticleGoogle Scholar
 Madsen Bo, Browning Sharon: A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genetics. 2009, 5: e100038410.1371/journal.pgen.1000384.PubMed CentralView ArticlePubMedGoogle Scholar
 Morgenthaler Stephan, Thilly William: A strategy to discover genes that carry multiallelic or monoallelic risk for common diseases: a cohort allelic sums test (CAST). Mutation Research. 2007, 615: 2856. 10.1016/j.mrfmmm.2006.09.003.View ArticlePubMedGoogle Scholar
 Li Bingshan, Leal Suzanne: Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008, 83: 311321. 10.1016/j.ajhg.2008.06.024.PubMed CentralView ArticlePubMedGoogle Scholar
 Sul Jae, Han Buhm, He Dan, Eskin Eleazar: An optimal weighted aggregated association test for identification of rare variants involved in common diseases. Genetics. 2011, 188: 181188. 10.1534/genetics.110.125070.PubMed CentralView ArticlePubMedGoogle Scholar
 Neale Benjamin, Rivas Manuel, Voight Benjamin et al: Testing for an unusual distribution of rare variants. PLoS Genetics. 2011, 7: e100132210.1371/journal.pgen.1001322.PubMed CentralView ArticlePubMedGoogle Scholar
 Cohen Jonathan, Pertsemlidis Alexander, Kotowski Ingrid et al: Low LDL cholesterol in individuals of African descent resulting from frequent nonsense mutations in PCSK9. Nature Genetics. 2005, 37: 161165. 10.1038/ng1509.View ArticlePubMedGoogle Scholar
 Cohen Jonathan, Boerwinkle Eric, Mosley Thomas, Hobbs Helen: Sequence variations in PCSK9, low LDL, and protection against coronary heart disease. The New England Journal of Medicine. 2006, 354: 12641272. 10.1056/NEJMoa054013.View ArticlePubMedGoogle Scholar
 Kathiresan Sekar, Melander Olle, Anevski Dragi et al: Polymorphisms associated with cholesterol and risk of cardiovascular events. The New England Journal of Medicine. 2008, 358: 12401249. 10.1056/NEJMoa0706728.View ArticlePubMedGoogle Scholar
 Bhatia Gaurav, Bansal Vikas, Harismendy Olivier et al: A covering method for detecting genetic associations between rare variants and common phenotypes. PLoS Computational Biology. 2010, 6: e100095410.1371/journal.pcbi.1000954.PubMed CentralView ArticlePubMedGoogle Scholar
 Pritchard Jonathan, Cox NJ: The allelic architecture of human disease genes: common diseasecommon variantor not?. Human Molecular Genetics. 2002, 11: 24172423. 10.1093/hmg/11.20.2417.View ArticlePubMedGoogle Scholar
 Besag Julian: On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society Series B. 1986, 48: 259302.Google Scholar
 Tavtigian Sean, Deffenbaugh AM, Yin L et al: Comprehensive statistical study of 452 BRCA1 missense substitutions with classification of eight recurrent substitutions as neutral. Journal of Medical Genetics. 2006, 43: 295305.PubMed CentralView ArticlePubMedGoogle Scholar
 Pauline Ng, Steven Henikoff: SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Research. 2003, 31: 38123814. 10.1093/nar/gkg509.View ArticleGoogle Scholar
 Kwee Coulter Lydia, Dawei Liu, Xihong Lin et al: A powerful and flexible multilocus association test for quantitative traits. Am J Hum Genet. 2008, 82: 386397. 10.1016/j.ajhg.2007.10.010.PubMed CentralView ArticlePubMedGoogle Scholar
 Michael Wu, Kraft P, Michael Epstein et al: Powerful SNPset analysis for casecontrol genomewide association studies. Am J Hum Genet. 2010, 86: 929942. 10.1016/j.ajhg.2010.05.002.View ArticleGoogle Scholar
 Quintana Melanie, Berstein Jonine, Thomas Duncan, Conti David: Incorporating model uncertainty in detecting rare variants: the Bayesian risk index. Genetic Epidemiology. 2011, 35: 638649. 10.1002/gepi.20613.PubMed CentralView ArticlePubMedGoogle Scholar
 Conti David, James Gauderman W: SNPs, haplotypes, and model selection in a candidate gene region: the SIMPle analysis for multilocus data. Genetic Epidemiology. 2004, 27: 429441. 10.1002/gepi.20039.View ArticlePubMedGoogle Scholar
 Melanie Wilson, Edwin Iversen, Merlise Clyde et al: Bayesian model search and multilevel inference for SNP association studies. Annals of Applied Statistics. 2010, 4: 13421364. 10.1214/09AOAS322.View ArticleGoogle Scholar
 Fu YunXin: Statistical properties of segregating sites. Theoretical Population Biology. 1995, 48: 172197. 10.1006/tpbi.1995.1025.View ArticlePubMedGoogle Scholar
 Sean Tavtigian, Peter Oefner, Davit Babikyan et al: Rare, evolutionarily unlikely missense substitutions in ATM confer increased risk of breast cancer. Am J Hum Genet. 2009, 85: 427446. 10.1016/j.ajhg.2009.08.018.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.