 Proceedings
 Open Access
 Published:
A probabilistic method for identifying rare variants underlying complex traits
BMC Genomicsvolume 14, Article number: S11 (2013)
Abstract
Background
Identifying the genetic variants that contribute to disease susceptibilities is important both for developing methodologies and for studying complex diseases in molecular biology. It has been demonstrated that the spectrum of minor allelic frequencies (MAFs) of risk genetic variants ranges from common to rare. Although association studies are shifting to incorporate rare variants (RVs) affecting complex traits, existing approaches do not show a high degree of success, and more efforts should be considered.
Results
In this article, we focus on detecting associations between multiple rare variants and traits. Similar to RareCover, a widely used approach, we assume that variants located close to each other tend to have similar impacts on traits. Therefore, we introduce elevated regions and background regions, where the elevated regions are considered to have a higher chance of harboring causal variants. We propose a hidden Markov random field (HMRF) model to select a set of rare variants that potentially underlie the phenotype, and then, a statistical test is applied. Thus, the association analysis can be achieved without preselection by experts. In our model, each variant has two hidden states that represent the causal/noncausal status and the region status. In addition, two Bayesian processes are used to compare and estimate the genotype, phenotype and model parameters. We compare our approach to the three current methods using different types of datasets, and though these are simulation experiments, our approach has higher statistical power than the other methods. The software package, RareProb and the simulation datasets are available at: http://www.engr.uconn.edu/~jiw09003.
Introduction
In most existing genetic variant association studies, "common trait, common variants", which asserts that common genetic variants contribute to most of traits (disease susceptibilities), serves as the central assumption. Researchers have successfully identified some significant associations between common single nucleotide polymorphisms (SNPs) and disease traits [1]. However, despite the enormous efforts expended on association studies of complex traits, common genetic variants only show a moderate influence on different phenotypes in many reported disease associations and consequently have limited diagnostic value [2, 3]. While the identification of common variants creates a dilemma, known as "common trait, rare variants", an alternative hypothesis, which asserts that multiple rare variants with moderate to high penetrances may collectively influence disease susceptibilities, has been suggested in some literatures [3–5]. Rare variants are defined as those whose minor allele frequencies (MAF) are less than or equal to 0.01 (≤ 10^{2}). Although some rare variants associated with Mendelian diseases have been identified, more often, the allelic population attributable risk (PAR), which describes a small reduction in the incidence that would be observed in unexposed samples compared to the actual exposure pattern, is low. The odds ratio (OR), a measure of the strength of association or nonindependence between two binary data values, is also low. Moreover, based on the "common trait, rare variants" hypothesis, in many cases, a set of rare variants, instead of just one variant, should be identified to fully explain the genetic influence. Both the singlevariant test [6] and the multiplevariant test [7] have been applied to rare variant association studies. However, due to the reasons outlined above, neither of them shows satisfactory power in obtaining associations. Although more and more attention is being focused upon rare variants, there has only been limited success thus far [8–10].
Alternatively, the collapsing strategy, also called the "burdenbased test", is another approach for rare variants association studies. Most of the collapsebased approaches build on the "recessiveset" genetic model, in which the predisposing haplotype contains mutation(s) in at least one variant [11]. Multiple rare variants in the same locus are collapsed, based on different standards, then statistical tests are applied. The locus here is defined as a selected region that consists of a group of candidate rare variants [9, 12–14]. However, it is argued that existing collapsebased approaches assume all rare variants implicitly influencing the phenotype in the same direction and with the same magnitude [10, 15]. Researchers have observed that any given rare variant could have no effect, could be causal, or could be protective for the endpoints (traits) [15]. For example, some lowfrequency variants in African Americans PCSK9 can have a substantial effect on serum LowDensity Lipoprotein Cholesterol (LDLC) by increasing the risk of or protecting against myocardial infarction [16–18].
Collapsebased approaches have low statistical powers when "causal", "neutral" and "protective" variants are combined [13, 15, 19]. To overcome this weakness, some approaches [9, 14] assume that the rare variants are well selected by experts, while weighting of each variant is another widely used strategy [9, 11, 14]. In a recent study, Bhatia and others [19] suggest the development of a "modelfree" approach, RareCover, that only collapses a subset of potentially causal variants from all of the given variants. Here, the "model" refers to the genetic association model that consists of the preselection candidate variants.
Motivated by RareCover, in this article, we focus on rare variant association analysis without any preselection of candidate variants. We propose a probabilistic approach, RareProb, to make selections using a Markov random field (MRF) model and identify multiple causal rare variants that influence a dichotomous phenotype using statistical tests. Our approach considers both the causal and the protective variants, which distinguishes it from the previous study RareCover, and it is therefore a robust predictor of the direction and the magnitude of the genetic effects. Moreover, inspired by the weightsum approaches [9, 11, 14], we also weight each variant; however, we not only consider the likelihood of a variant being causal but also compute the pairwise likelihood of candidate variants being collapsed together. Note that although it is difficult to observe, relatively low interactions (e.g., linkage disequilibrium) are expected between rare variants [4, 11, 13, 20]. Furthermore, in regressionbased association methods, genetic similarities are often used to reduce the dimensions of the regression models. Therefore, we introduced two kinds of genetic regions, the elevated region and the background region, in our model analysis; the elevated region has a higher probability of harboring a causal variant. This assumption that the causal variants are often located close to each other is often used, e.g. slide windows in RareCover [19]. However, the regions are more flexible than a preset slide window, as in RareCover.
We adopt the "dominant" and "recessive set" genetic model, which are also used in [9–11, 14, 15, 19]. In the dominant and recessiveset model, the predisposing genotype harbors the mutation(s) in at least one variant on any of the two haplotypes. Therefore, for one genotype, there are two possible allelic values at each variant: one denotes that both haplotypes carry a widetype allele, while the other denotes that at least one haplotype carries a mutant. In our method, each variant has two hidden states, causal/noncausal status and elevated/background region status. The MRF includes the hidden states, emission probabilities and transition probabilities. The emission probabilities bridge the hidden states and the genotypes, while the transition probabilities link the two hidden states. Following the pseudolikelihood estimation method [21], we infer the model parameters and all of the hidden states. The simulation experiments show that our approach outperforms RareCover, RWAS [14] and LRT [9] on different parametric settings. In particular, RareProb obtains better results on largescale data.
Methods
Notions and model overview
Suppose we are given M rare variants (allelic sites) on a set of N genotypes. Let s_{ i } denote the allelic value of the site s on the genotype i (1 ≤ i ≤ N, 1 ≤ s ≤ M), where s_{ i } = 0 means both haplotypes of i have the wild type allele, while s_{ i } = 1 means at least one haplotype has a mutant allele. Each genotype carries a dichotomous phenotype. Let vector P denotes the phenotypes, where P_{ i } = 1 represents that i is affected by the phenotype trait (being a case), while P_{ i } = 0 represents that i is a control.
The core of our approach is a Markov random field (MRF) model. We first introduce four key components of modeling this MRF:

The observed data of this MRF consist of all of the genotypes and phenotypes.

There are two unknown states for each site: one is the causal or noncausal status and the other is the region location status. Here, we define them as the hidden states of this Markov random field. Let a latent vector R represent the region status, where R_{ s }= 1 denotes that the site s is located in an elevated region, while R_{ s }= 0 denotes the s is located in a background region. Additionally, let a latent vector X represent the causal/noncausal status, where X_{ s }= 1 if the site s is causal (contributes to the phenotype); otherwise, X_{ s }= 0. Probabilistic functions are designed to present the probabilities of each hidden state. The RareProb framework is able to incorporate prior information, obtained by different software tools, e.g. AlignGVGD[22] and SIFT[23], etc, by updating initial X vector and R vector.

A neighborhood system is required in the MRF model to describe the interactions among hidden states. Details of the hidden states and neighborhood system are shown in the section "Estimation of the hidden states in HMRF".

There are two kinds of probabilities in the MRF model: emission probabilities and transition probabilities. Emission probabilities bridge the relationships among genotypes, phenotypes and hidden states. Moreover, hidden states X and R are not independent of each other, as the relationships between the hidden states are described by the transition probabilities. The conditional probability P(X_{ s }= 1R_{ s }= 1) denotes the probability that the site s is a causal site when it is located in an elevated region, while P(X_{ s }= 0R_{ s }= 1) denotes the probability that the site s is noncausal when it is located in an elevated region. Similarly, another two conditional probabilities, P(X_{ s }= 1R_{ s }= 0) and P(X_{ s }= 0R_{ s }= 0), present the probabilities of being causal or noncausal if the site is located in a background region. Details of the emission probabilities are shown in the section "Estimation of the emission probabilities in HMRF", and the transition probabilities are shown in the section "Estimation of the transition probabilities in HMRF".
The central thesis of our approach is that causal rare variants, which should be collapsed together, are treated as one random vector variable with certain dimensions. Then, the probability of this bunch of causal rare variants becomes the probability of one variable being associated with the phenotype. Based on the MarkovGibbs equivalence [21], the probability of this random variable can be decomposed into the sum of clique potentials. The firstorder clique potentials describe the probability of one variant being causal, while the secondorder clique potentials measure the pairwise genetic similarities, which share the idea of the kernel machine in regression frameworks [10, 24, 25]. The neighborhood system in the MRF model consists of clique potentials. In our approach, we select that the neighborhood system only contains the firstorder and the secondorder clique potentials because there is scanty evidence supporting the biological or medical scenario of highorder potentials. For each variable, the MAFs and model parameters can be estimated by maximizing the likelihoods of the genotypes. Then, the probability of the variable and the variable itself can be updated by MAFs and model parameters. Two or three iterations can be applied if needed for the convergence of the MRF. Thus, our approach selects a subset of candidate causal variants by updating the variables and avoids the weakness of the same magnitude effect assumption because the neighborhood system is able to describe both the "causal" and "protective" variants.
Estimation of the hidden states in HMRF
Neighborhood system
Assume there are N/ 2 cases and N/ 2 controls among all of the genotypes (if the number of cases is not equal to the number of controls, then all of the results still can be used by applying noncentrality parameters). At a certain variant s, let θ_{ s } denote the MAF for the cases, and let the number of genotypes in cases that carry at least one mutant allele be ${c}_{s}^{+}$. Let ρ_{ s } denote the MAF for the controls, and let the number of genotypes in controls that carry at least one mutant allele be ${c}_{s}^{}$. Then, we can draw two binomial distributions for the cases and the controls [9, 14]: ${c}_{s}^{+}~\mathsf{\text{Bin}}\left(\frac{N}{2}{\theta}_{s}\right)$ and ${c}_{s}^{}~\mathsf{\text{Bin}}\left(\frac{N}{2},{\rho}_{s}\right)$, where $f\left({c}_{s}^{+}{\theta}_{s}\right)={C}_{\frac{N}{2}}^{{c}_{s}^{+}}{\theta}_{s}^{{c}_{s}^{+}}{\left(1{\theta}_{s}\right)}^{\frac{N}{2}{c}_{s}^{+}}$ and $f\left({c}_{s}^{}{\rho}_{s}\right)={C}_{\frac{N}{2}}^{{c}_{s}^{}}{\rho}_{s}^{{c}_{\mathsf{\text{s}}}^{}}{\left(1{\rho}_{S}\right)}^{\frac{N}{2}{c}_{s}^{}}$. Thus, for a site s, the statistic of the difference between θ and ρ is
where ${\widehat{\theta}}_{s}=\frac{{c}_{\mathsf{\text{s}}}^{+}}{N/2}$ is the estimation of θ_{ s } and ${\widehat{\rho}}_{s}=\frac{{c}_{\mathsf{\text{s}}}^{}}{N/2}$ is the estimation of ρ_{ s }. Similar to the linear kernel function, which calculates genetic similarities [10], we measure the likelihood between pairwise rare variants, which denotes how likely two variants would be collapsed together. For two variants s and s', we define ω_{ s,s' } as the likelihood of collapsing as follows:
The ω function has the following properties: (1) When both s and s' are causal variants, due to the PAR, ω_{ s,s' } locates in the interval (0, 1]. (2) If one variant is "causal" but the other is "protective", the likelihood takes on a negative value. (3) The likelihood encourages the collapse of the variants with similar PAR. Those rare variants whose MAFs increase rapidly in some cases, as we mentioned before, could be identified by singlesite tests or pairwise tests, which are often not considered in collapsing models [8]. Let ω.,. be the weight of two neighbors. The closer the statistics z_{ s } and z_{ s' } are, the larger the likelihood will be. And thus, the neighborhood system is built up.
Hidden states
Rare variant s is either located in an elevated region or in a background region. Thus, we define the probability (Bayesian classifier) of s as
where n (s) denotes the neighbors of s. γ and η are two MRF parameters. γ represents how strongly the status of X_{ s } affects the probability of X_{ s }, while η represents how strongly the neighbors of s affect the probability of X_{ s }. Here, we limit η > 0, which encourages the pairwise weights and prevents them from counteracting the negative weights. Thus, the joint probability of the latent vector X is $p\left(X;\Phi \right)\propto \mathsf{\text{exp}}\left(\gamma {\sum}_{s}^{M}p\left({X}_{s}{R}_{s}\right)+\eta {\sum}_{{s}^{\prime}\in n\left(s\right)}{\omega}_{s,{s}^{\prime}}p\left({X}_{{s}^{\prime}}{R}_{{s}^{\prime}}\right)\right)$, where Φ = (γ, η). As the variants in different subsets (different collapsing groups) are conditional independent, this joint probability covers all of the probabilities of the random variables (collapsing groups). Similarly, the probability of s located in an elevated region can be represented by
and the joint probability of latent vector R can be represented by $p\left(R;{\Phi}_{R}\right)\propto \mathsf{\text{exp}}\left(\tau {\sum}_{s}^{M}R+\upsilon {\sum}_{s,{s}^{\prime}}{\omega}_{s,{s}^{\prime}}{R}_{{s}^{\prime}}\right)$, where Φ_{ R } = (τ, υ)·τ and υ are two MRF parameters. We also limit υ > 0, which encourages the pairwise weights and prevents them from counteracting the negative weights.
Estimation of the emission probabilities in HMRF
We now estimate the emission probabilities to relate X and R with the observed data. As linkage disequilibrium is rarely observed between rare variants [8], the vector consists of the allelic values from one variant that is conditionally independent from the others, when a particular X is given. Thus, the joint conditional probability of all of the genotypes is
If X_{ s } = 1, due to the PAR, θ_{ s } ≠ ρ_{ s }. We place a prior distribution on θ_{ s } and a prior on ρ_{ s } [26]: $\pi \left({\theta}_{s}\right)=\frac{{\theta}_{s}^{{\alpha}_{{\theta}_{s}}1}{\left(1{\theta}_{\mathsf{\text{s}}}\right)}^{{\beta}_{{\theta}_{s}}1}}{B\left({\alpha}_{{\theta}_{s}},{\beta}_{{\theta}_{s}}\right)}$ and $\pi \left({\rho}_{s}\right)=\frac{{\rho}_{s}^{{\alpha}_{{\rho}_{s}}1}{\left(1{\rho}_{s}\right)}^{{\beta}_{{\rho}_{s}}1}}{B\left({\alpha}_{{\rho}_{s}},{\beta}_{{\rho}_{s}}\right)}$, where α(·) and β(·) are hyperparameters in the prior distributions [26–28]. Then, the marginal distribution of ${c}_{s}^{+}$ is
The marginal distribution of ${c}_{s}^{}$ is similar. The probability of the observed genotypes on s is equal to the sum of $\frac{m\left({c}_{s}^{\cdot}\right)}{{C}_{\frac{N}{2}}^{{c}_{s}^{\cdot}}}$ Thus we have:
On the other hand, if X_{ s } = 0, then there is no PAR between θ_{ s } and ρ_{ s } that infers θ_{ s } = ρ_{ s }. Here, we simply use ρ_{ s } to denote the MAF of s for both the cases and controls. Thus, we have
where ${c}_{s}={c}_{s}^{+}+{c}_{s}^{}$. We have now obtained all the three emission probabilities of this HMRF: p (YX), P (Y_{ s }X_{ s } = 0) and P (Y_{ s }X_{ s } = 1).
Estimation of the transition probabilities in HMRF
The transition probabilities link the hidden states X and R. Let ${c}_{X}^{+}$ be the counts of the causal variants on all of the elevated regions, and let c_{ E } be the number of variants in those regions. Let ${c}_{X}^{}$ be the counts of the causal variants on all of the background regions, and c_{ B } be the number of variants in those regions. Then, we draw two binomial distributions: ${c}_{X}^{+}~\mathsf{\text{Bin}}\left({c}_{E},\phantom{\rule{2.77695pt}{0ex}}\xi \right);{c}_{X}^{}~\mathsf{\text{Bin}}\left({c}_{B},\phantom{\rule{2.77695pt}{0ex}}\zeta \right)$ where ξ = P (X = 1R = 1) and ζ = P (X = 1R = 0). We also place the prior distributions on ξ and ζ, as follows:
and
where ξ = P (X = 1R = 1) and ζ = P (X = 1R = 0). We also place the prior distributions on ξ and ζ, as follows:
where α(·) and β(·) are also hyperparameters.
Thus, we have the conditional probability of X given R:
and the posterior distribution of ξ given ${c}_{X}^{+}$ is
Similarly, the posterior distribution of ζ given ${c}_{X}^{}$ is
Thus far, we have obtained all of the three transition probabilities of this HMRF: p (XR), $\pi \left(\xi {c}_{X}^{+}\right)$ and $\pi \left(\zeta {c}_{X}^{}\right)$.
Estimation the model parameters
Based on the GibbsMarkov Equivalence [21], a pseudolikelihood estimation cycle can be applied to this hidden MRF to estimate the model parameters and update the hidden states. We use the pseudolikelihood estimation because p (X; Φ) and p (R; Φ_{ R }) are difficult to compute directly. The algorithm involves the following four steps:

Step 1: Estimate α_{ θ }and β_{ ρ }with $\widehat{\theta}$ and $\widehat{\rho}$ by maximizing the likelihood $L\left(Y/\widehat{X}\right)$. Update ${\widehat{\theta}}_{s}$ by maximizing the posterior distribution:
$$\pi \left({\theta}_{s}{c}_{s}^{+}\right)=\frac{{\theta}_{s}^{{\alpha}_{{\theta}_{s}}+{c}_{s}^{+}1}{\left(1{\theta}_{s}\right)}^{{\beta}_{{\theta}_{s}}+N{c}_{s}^{+}1}}{B\left({\alpha}_{{\theta}_{s}}+{c}_{s}^{+},{\beta}_{{\theta}_{s}}+N{c}_{s}^{+}\right)}$$
Similarly, Update ${\widehat{\rho}}_{s}$.

Step 2: Estimate α_{ ξ }, β_{ ξ }and α_{ ζ }, β_{ ζ }with $\widehat{\xi}$ and $\widehat{\zeta}$ by maximizing the transition probability $L\left(X/\hat{R}\right)$. Update $\widehat{\xi}$ and $\widehat{\zeta}$ by maximizing the transition probabilities $\pi \left(\xi {c}_{X}^{+}\right)$ and $\pi \left(\xi {c}_{X}^{+}\right)$, respectively.

Step 3: Estimate Φ and Φ_{ R }with $\widehat{\Phi}$ and ${\widehat{\Phi}}_{R}$ by maximizing the pseudolikelihood functions:
$$L\left(\widehat{X};\Phi \right)=exp\left({\displaystyle \sum _{S}^{M}}{p}_{s}\left({\widehat{X}}_{s}{\widehat{X}}_{n\left(s\right)}\right);\Phi \right)$$
and $L\left(\widehat{R};{\Phi}_{R}\right)$.

Step 4: Update $\widehat{X}$ and $\widehat{R}$ by
$$P\left({X}_{s}Y,{\widehat{X}}_{S/s}\right)\propto f\left({Y}_{s}{X}_{s};\widehat{\theta},\widehat{\rho}\right){p}_{s}\left({X}_{s}{\widehat{X}}_{n\left(s\right)};\widehat{\Phi}\right)$$
and $P\left({R}_{s}X,{\widehat{R}}_{S/s}\right)$.
There are several ways to exit from this iteration. We measure the Euclidean distance between the current and the updated $\widehat{X}$. If the distance is less than a preset threshold, our approach will stop the iteration. After the convergence of HMRF, we obtain the estimations of X and R, as well as the MAFs for every variant. The collapsed rare variants can be tested based on the existing statistics, e.g. in [9, 10, 14].
Experiments and results
In this section, we apply our approach on a real dataset from [30] and also compare it with three other approaches using different types of simulated datasets. The three comparison approaches are RareCover, which is based on [19], RWAS from [14] and LRT from [9]. Additionally, it seems that RareCover is not released online, so as in many previous works, we reimplement this algorithm and the related statistics by ourselves.
Simulation frameworks
As the simulation settings in different papers [9, 14, 19] are quite different, we adopt all of them and generate three types of simulated datasets. In the first one, each dataset has a fixed number of causal variants, while in the second dataset, the number of causal variants is determined by allelic population attributable risk (PAR). The last simulation method first generates elevated regions and background regions and then plants causal variants in each region. We describe the three simulation methods in the following sections.
Fix number of causal variants
First, we generate the datasets with fixed numbers of causal variants, following previous approaches [14] and [9]. Each variant is generated independently because they believe that rare variants do not show significant linkage disequilibrium [9, 14]. For each variant, the probability distribution of the MAF of site s on controls, ρ_{ s }, satisfies the Wright's distribution under purifying selection [4],
where σ is the selection coefficient, β_{ S } is the probability that the normal allelic site mutates to the causal variant, and β_{ N } is the probability that a causal variant repairs to a normal variant. We take σ = 12.0, β_{ S } = 0.001 and β_{ N } = 0.00033, which are the same settings used by [9, 11, 14]. Then, the relative risk of s is: $RR=\frac{\delta}{\left(1\delta \right){\rho}_{s}}+1$, where δ is the marginal PAR. The marginal PAR is equal to the group PAR (Δ) divided by the number of causal variants, while the relative risk of M variants is 1 [14]. Afterwards, the MAF of s for the cases is calculated according to ${\theta}_{s}=\frac{RR\times {\rho}_{s}}{\left(RR1\right){\rho}_{s}+1}$. In each dataset, we simulate N = 2000 genotypes with half cases and half controls. The mutations on the cases and the controls are sampled independently according to θ_{ s } and ρ_{ s }, respectively.
Causal variants depends on PAR
The second way generates a set, C, that contains all of the causal variants. Instead of a fixed number, the total number of causal variants depends on PAR [19], which is limited by Δ (the group PAR):
where Pr represents the penetrance of the group of causal variants and P_{ D } is the disease prevalence in the population. Different settings are applied in the experiments.
We use the algorithm proposed in [19] to obtain the MAF of each causal variant. The algorithm samples the MAF of a causal variant s, θ_{ s }, from the Wright's distribution with σ = 30.0, β_{ S } = 0.2 and β_{ N } = 0.002 [4, 19], and then appends s to C. Next, the algorithm checks whether ${\prod}_{s\in C}\left(1\frac{{\theta}_{\mathsf{\text{s}}}Pr}{{P}_{D}}\right)>1\mathrm{\Delta}$ is true. If the inequality does not hold, the algorithm terminates and outputs C. Thus, we obtain all of the causal variants and their MAFs. If the inequality holds, then the algorithm continuously samples the MAF of the next causal variant. The mutations on genotypes are sampled according to θ_{ s }.
For those noncausal variants, we use Fu's model [29] of allelic distributions on a coalescent, which is the same used in [19]. We adopt ${\rho}_{s}=\frac{5.0}{N}$. The mutations on genotypes are sampled according to ρ_{ s }. The phenotype of each individual (genotype) is computed by the penetrance of the subset, Pr. Thereafter, we sample 1000 of the cases and 1000 of the controls.
Causal variants depends on regions
There are many ways to generate a dataset with regions. The simplest way is to preset the elevated regions and the background regions and to plant causal variants based on certain probabilities. An alternate way creates the regions by a Markov chain. For each site, there are two groups of states. The E state denotes that the variant is located in an elevated region, while the B state denotes that the variant is located in a background region. Both states E and B can transfer to a causal state C or a noncausal state $\stackrel{\u0304}{C}$. If the Markov chain travels to the $\stackrel{\u0304}{C}$ state, it plants a mutant on the genotype with probability ρ. If the variant is considered to be causal, it may continuously transfer to the state A, which means that the genotype carries a mutant that may affect the phenotype with penetrance Pr. Otherwise it arrives in the state Ā, and the Markov chain plants a mutant or a wildtype allele on the genotype afterwards.
To generate enough genotypes, we perform the following steps for each variant: if the process drops into $\stackrel{\u0304}{C}$, we take 50,000 iterations to yield a mutant, where ρ is sampled from the Wright's distribution with σ = 30.0, β_{ S } = 0.2 and β_{ N } = 0.002. If it drops into A or Ā, we design an iteration to C until it reaches 50,000 iterations. The transition probability from C to A is equal to ρ × Pr. After we have enough genotypes, we sample 1000 cases and 1000 controls from them.
Comparisons on power
Similar to the measurements in [9, 14], the power of an approach is measured by the number of significant datasets, among many datasets, using a significance threshold of 2.5 × 10^{6} based on the Bonferroni correction assuming 20000 genes, genomewide. We test at most 1000 datasets for each comparison experiment.
Power versus different proportions of causal variants
We compare the powers under different sizes of total variants. In the first group of experiments, we include 50 causal variants and vary the total number of variants from 100 to 5000. Thus, the proportions of causal variants decrease from 50% to 1%. In the second group of experiments, we hold the group PAR as 5% and vary the total number of variants as before. The results are compared in Table 1. From the results, our approach clearly shows more powerful and more robust at dealing with largescale data. We also test our approach on different settings of the group PARs. Those results can be found in Table S1 in the Additional file 1.
The Type I error rate is another important measurement for estimating an approach. To compute the Type I error rate, we apply the same technique as [19]. Type I error rate is defined as the probability of a noncausal variant being selected in the potential causal set. We compare our approach only with RareCover because RWAS or LRT does not select any potential causal variants. The results on different configurations can be found in supplementary documents. Based on the results, our approach always holds reasonable Type I error rates. Although on some configurations RareProb has a little higher Type I error rates, e.g. 1%10% higher when gourp PAR is 5%, than RareCover, the absolute values are still satisfied. Moreover, when the group PAR decreases, RareProb always performs lower Type I error rates than RareCover. These results can be found in Table S2 in the Additional file 2. Considering both statistical power and Type I error rate, the advantage of RareProb cannot be neglected: it is able to identify most of the causal variants with an acceptable Type I error rate. In the other words, if an approach rarely identifies correct variants, a low Type I error rate becomes meaningless.
Power versus different configurations of regions
We compare the powers on different configurations of elevated regions and background regions and test the performance of our approach in identifying the regions. At each total variant number, we preset the number of regions between 2 and 8, with half elevated regions and half background regions. In these datasets, the probability of a rare variant being causal is 0.1 if the variant is located in an elevated region; otherwise, the probability is 0.001 if variant is located in a background region. In the last group of experiments, the regions are generated by the Markov chain, where the transition probability of remaining in the same regions (keeps in elevated region or background region) is 0.8, while the transition probability of transitioning between different regions (jumps from an elevated region to a background region, or jumps from a background region to an elevated region) is 0.2. The emission probabilities are the same as before. We test the powers and record the percentages of correct identifications on the regions. The results are listed in Table 2. The results show that our approach successfully estimates the regions, while RareCover suffers difficulty on identifying neither candidate causal variants nor region information. We also test our approach on total variants being 3000, 4000 and 5000. These results can be found in Table S3 in the Additional file 3.
RareProb on real mutation screening data
Finally, we apply our approach to a real mutation screening dataset. This dataset has been previously published by [30]. Authors screen for a susceptibility gene, ATM, which is thought to associate with ataxia telangiectasia. ATM is also an intermediaterisk susceptibility gene for breast cancer [9, 14]. The dataset (ATM_CCMSdata_Dec2011_v1) we have consists of 121 rare variants in a set of 2506 cases and 2235 controls, which is called "bona fide casecontrol studies" [9, 14].
We apply RareProb to this dataset without any prior information. RareProb identifies variant #c.4424A > G as a causal variant and reports a significant association with a pvalue of 8.8817 × 10^{16}. As a comparison, authors in [30] reports that they did identify a significant association with the help of the prior information, but that they did not find a significant association only according to the results of CMC. Sul and others [14] applied RWAS and reports a nonsignificant association with pvalue of 0.3946 without prior information and a nonsignificant association with pvalue of 0.0078 and 0.0881 when prior information of variants is obtained by AlignGVGD [22] and SIFT [23], respectively. Sul and others [9] also applied LRT and reports that a nonsignificant association with pvalue of 0.3934 was found without prior information, but a significant association with pvalue of 0.0058 and 0.08384 were found introducing AlignGVGD scores and SIFT scores, respectively. Our approach successfully identifies an association and clearly points out the candidate causal variant, without prior information, while either RWAS or LRT cannot achieve this.
Conclusion
In this article, we propose a probabilistic method, RareProb, to identify multiple rare variants that contribute to dichotomous disease susceptibility. Our approach is inspired by RareCover. Both approaches select a subset of potentially causal variants from the given variants, which means our approach does not rely on the preselection of candidate rare variants. Furthermore, as opposed to simply merging the variants in RareCover, our approach gains power by considering the directions and the magnitudes of the genetic effects. Both the causal and the protective variants can be described by pairwise measurements, respectively. This method gets rid of the weakness of losing statistical power when "causal", "neutral" and "protective" variants are combined. Note that the pairwise weight is not the linkage disequilibrium (LD). LD is quite difficult to observe, although it is expected among rare variants. The pairwise measurements indicate the likelihood of two variants being collapsed, which is similar to the kernel functions in regressionbased frameworks. This weight is then used to build up the neighborhood system of the hidden Markov random field model.
The Markov random field model treats all of the variants as one vector and estimates their causal/noncausal status by globally maximizing the likelihood of genotypes instead of by local optimization. Our approach gains more power than existing groupwise collapsing approaches; RareProb filters out those variants with noncausal status. At the same time, unlike the previous selectionbased approaches, RareProb controls the false positive rate by partitioning elevated regions and background regions, instead of by presetting any sliding windows. Regions are much more flexible than preset sliding windows. While existing approaches can only handle hundreds of variants, there is no doubt that the total number of variants will increase rapidly with the development of new technologies, e.g. applications of next generation sequencing. The simulation experiments show that our approach obtains significantly more power, especially when the total number of given rare variants is large. We also apply our approach to a real mutation screening dataset and a significant association is found. Our approach is able to handle thousands of variants. Moreover, our approach is easy to extend to an "additive" genetic model and multiple phenotypes by updating the Dirichlet prior distribution.
Declarations
The publication costs for this article were funded by Xi'an Jiaotong University.
This article has been published as part of BMC Genomics Volume 14 Supplement 1, 2013: Selected articles from the Eleventh Asia Pacific Bioinformatics Conference (APBC 2013): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S1.
References
 1.
Hirschhorn Joel, Daly Mark: Genomewide association studies for common diseases and complex traits. Nature Reviews Genetics. 2005, 6: 95108.
 2.
Ropers HansHilger: New perspectives for the elucidation of genetic disorders. Am J Hum Genet. 2007, 81: 199207. 10.1086/520679.
 3.
Manolio Teri, Collins Francis, Cox Nancy et al: Finding the missing heritability of complex diseases. Nature. 2009, 461: 747753. 10.1038/nature08494.
 4.
Pritchard Jonathan: Are rare variants responsible for susceptibility to complex diseases?. Am J Hum Genet. 2001, 69: 124137. 10.1086/321272.
 5.
Reich David, Lander Eric: On the allelic spectrum of human disease. Trends in Genetics. 2001, 17: 502510. 10.1016/S01689525(01)024106.
 6.
Chapman Nicola, Wijsman Ellen: Genome screens using linkage disequilibrium tests: optimal marker characteristics and feasibility. Am J Hum Genet. 1998, 63: 18721885. 10.1086/302139.
 7.
Xiong Momiao, Zhao Jinying, Boerwinkle Eric: Generalized T^{2} Test for genome association studies. Am J Hum Genet. 2002, 70: 12571268. 10.1086/340392.
 8.
Bodmer Walter, Bonilla Carolina: Common and rare variants in multifactorial susceptibility to common diseases. Nature Genetics. 2008, 40: 695701. 10.1038/ng.f.136.
 9.
Sul Jae, Han Buhm, Eskin Eleazar: Increasing power of groupwise association test with likelihood ratio test. Proceedings of RECOMB. 2011, 2011: 2831.
 10.
Michael Wu, Seunggeun Lee, Tianxi Cai et al: Rarevariant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011, 89: 8293. 10.1016/j.ajhg.2011.05.029.
 11.
Madsen Bo, Browning Sharon: A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genetics. 2009, 5: e100038410.1371/journal.pgen.1000384.
 12.
Morgenthaler Stephan, Thilly William: A strategy to discover genes that carry multiallelic or monoallelic risk for common diseases: a cohort allelic sums test (CAST). Mutation Research. 2007, 615: 2856. 10.1016/j.mrfmmm.2006.09.003.
 13.
Li Bingshan, Leal Suzanne: Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008, 83: 311321. 10.1016/j.ajhg.2008.06.024.
 14.
Sul Jae, Han Buhm, He Dan, Eskin Eleazar: An optimal weighted aggregated association test for identification of rare variants involved in common diseases. Genetics. 2011, 188: 181188. 10.1534/genetics.110.125070.
 15.
Neale Benjamin, Rivas Manuel, Voight Benjamin et al: Testing for an unusual distribution of rare variants. PLoS Genetics. 2011, 7: e100132210.1371/journal.pgen.1001322.
 16.
Cohen Jonathan, Pertsemlidis Alexander, Kotowski Ingrid et al: Low LDL cholesterol in individuals of African descent resulting from frequent nonsense mutations in PCSK9. Nature Genetics. 2005, 37: 161165. 10.1038/ng1509.
 17.
Cohen Jonathan, Boerwinkle Eric, Mosley Thomas, Hobbs Helen: Sequence variations in PCSK9, low LDL, and protection against coronary heart disease. The New England Journal of Medicine. 2006, 354: 12641272. 10.1056/NEJMoa054013.
 18.
Kathiresan Sekar, Melander Olle, Anevski Dragi et al: Polymorphisms associated with cholesterol and risk of cardiovascular events. The New England Journal of Medicine. 2008, 358: 12401249. 10.1056/NEJMoa0706728.
 19.
Bhatia Gaurav, Bansal Vikas, Harismendy Olivier et al: A covering method for detecting genetic associations between rare variants and common phenotypes. PLoS Computational Biology. 2010, 6: e100095410.1371/journal.pcbi.1000954.
 20.
Pritchard Jonathan, Cox NJ: The allelic architecture of human disease genes: common diseasecommon variantor not?. Human Molecular Genetics. 2002, 11: 24172423. 10.1093/hmg/11.20.2417.
 21.
Besag Julian: On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society Series B. 1986, 48: 259302.
 22.
Tavtigian Sean, Deffenbaugh AM, Yin L et al: Comprehensive statistical study of 452 BRCA1 missense substitutions with classification of eight recurrent substitutions as neutral. Journal of Medical Genetics. 2006, 43: 295305.
 23.
Pauline Ng, Steven Henikoff: SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Research. 2003, 31: 38123814. 10.1093/nar/gkg509.
 24.
Kwee Coulter Lydia, Dawei Liu, Xihong Lin et al: A powerful and flexible multilocus association test for quantitative traits. Am J Hum Genet. 2008, 82: 386397. 10.1016/j.ajhg.2007.10.010.
 25.
Michael Wu, Kraft P, Michael Epstein et al: Powerful SNPset analysis for casecontrol genomewide association studies. Am J Hum Genet. 2010, 86: 929942. 10.1016/j.ajhg.2010.05.002.
 26.
Quintana Melanie, Berstein Jonine, Thomas Duncan, Conti David: Incorporating model uncertainty in detecting rare variants: the Bayesian risk index. Genetic Epidemiology. 2011, 35: 638649. 10.1002/gepi.20613.
 27.
Conti David, James Gauderman W: SNPs, haplotypes, and model selection in a candidate gene region: the SIMPle analysis for multilocus data. Genetic Epidemiology. 2004, 27: 429441. 10.1002/gepi.20039.
 28.
Melanie Wilson, Edwin Iversen, Merlise Clyde et al: Bayesian model search and multilevel inference for SNP association studies. Annals of Applied Statistics. 2010, 4: 13421364. 10.1214/09AOAS322.
 29.
Fu YunXin: Statistical properties of segregating sites. Theoretical Population Biology. 1995, 48: 172197. 10.1006/tpbi.1995.1025.
 30.
Sean Tavtigian, Peter Oefner, Davit Babikyan et al: Rare, evolutionarily unlikely missense substitutions in ATM confer increased risk of breast cancer. Am J Hum Genet. 2009, 85: 427446. 10.1016/j.ajhg.2009.08.018.
Acknowledgements
This work was supported by National Science Foundation [IIS0803440], [CCF1116175] and [IIS0953563] and the Ph.D. Programs Foundation of Ministry of Education of China [20100201110063]. Authors thank Professor Sean Tavtigian and Professor Georgia ChenevixTrench for sharing the ATM datasets with us. Authors thank Professor ChunXia Yan M.D. and M.D. Feng Zhu for discussing the elevated region and the background region from a medical and clinical view.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
JW and ZM conducted this research. JW designed algorithms and experiments. ZC, AY and JZ developed the software packages and participated in the performance analysis and the experiments on the real dataset. JW, ZM and JZ wrote this paper. All authors have read and approved the final manuscript.
Electronic supplementary material
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Published
DOI
Keywords
 Minor Allele Frequency
 Rare Variant
 Causal Variant
 Markov Random Field
 Background Region