Skip to main content

A probabilistic method for identifying rare variants underlying complex traits



Identifying the genetic variants that contribute to disease susceptibilities is important both for developing methodologies and for studying complex diseases in molecular biology. It has been demonstrated that the spectrum of minor allelic frequencies (MAFs) of risk genetic variants ranges from common to rare. Although association studies are shifting to incorporate rare variants (RVs) affecting complex traits, existing approaches do not show a high degree of success, and more efforts should be considered.


In this article, we focus on detecting associations between multiple rare variants and traits. Similar to RareCover, a widely used approach, we assume that variants located close to each other tend to have similar impacts on traits. Therefore, we introduce elevated regions and background regions, where the elevated regions are considered to have a higher chance of harboring causal variants. We propose a hidden Markov random field (HMRF) model to select a set of rare variants that potentially underlie the phenotype, and then, a statistical test is applied. Thus, the association analysis can be achieved without pre-selection by experts. In our model, each variant has two hidden states that represent the causal/non-causal status and the region status. In addition, two Bayesian processes are used to compare and estimate the genotype, phenotype and model parameters. We compare our approach to the three current methods using different types of datasets, and though these are simulation experiments, our approach has higher statistical power than the other methods. The software package, RareProb and the simulation datasets are available at:


In most existing genetic variant association studies, "common trait, common variants", which asserts that common genetic variants contribute to most of traits (disease susceptibilities), serves as the central assumption. Researchers have successfully identified some significant associations between common single nucleotide polymorphisms (SNPs) and disease traits [1]. However, despite the enormous efforts expended on association studies of complex traits, common genetic variants only show a moderate influence on different phenotypes in many reported disease associations and consequently have limited diagnostic value [2, 3]. While the identification of common variants creates a dilemma, known as "common trait, rare variants", an alternative hypothesis, which asserts that multiple rare variants with moderate to high penetrances may collectively influence disease susceptibilities, has been suggested in some literatures [35]. Rare variants are defined as those whose minor allele frequencies (MAF) are less than or equal to 0.01 (≤ 10-2). Although some rare variants associated with Mendelian diseases have been identified, more often, the allelic population attributable risk (PAR), which describes a small reduction in the incidence that would be observed in unexposed samples compared to the actual exposure pattern, is low. The odds ratio (OR), a measure of the strength of association or non-independence between two binary data values, is also low. Moreover, based on the "common trait, rare variants" hypothesis, in many cases, a set of rare variants, instead of just one variant, should be identified to fully explain the genetic influence. Both the single-variant test [6] and the multiple-variant test [7] have been applied to rare variant association studies. However, due to the reasons outlined above, neither of them shows satisfactory power in obtaining associations. Although more and more attention is being focused upon rare variants, there has only been limited success thus far [810].

Alternatively, the collapsing strategy, also called the "burden-based test", is another approach for rare variants association studies. Most of the collapse-based approaches build on the "recessive-set" genetic model, in which the predisposing haplotype contains mutation(s) in at least one variant [11]. Multiple rare variants in the same locus are collapsed, based on different standards, then statistical tests are applied. The locus here is defined as a selected region that consists of a group of candidate rare variants [9, 1214]. However, it is argued that existing collapse-based approaches assume all rare variants implicitly influencing the phenotype in the same direction and with the same magnitude [10, 15]. Researchers have observed that any given rare variant could have no effect, could be causal, or could be protective for the endpoints (traits) [15]. For example, some low-frequency variants in African Americans PCSK9 can have a substantial effect on serum Low-Density Lipoprotein Cholesterol (LDL-C) by increasing the risk of or protecting against myocardial infarction [1618].

Collapse-based approaches have low statistical powers when "causal", "neutral" and "protective" variants are combined [13, 15, 19]. To overcome this weakness, some approaches [9, 14] assume that the rare variants are well selected by experts, while weighting of each variant is another widely used strategy [9, 11, 14]. In a recent study, Bhatia and others [19] suggest the development of a "model-free" approach, RareCover, that only collapses a subset of potentially causal variants from all of the given variants. Here, the "model" refers to the genetic association model that consists of the pre-selection candidate variants.

Motivated by RareCover, in this article, we focus on rare variant association analysis without any pre-selection of candidate variants. We propose a probabilistic approach, RareProb, to make selections using a Markov random field (MRF) model and identify multiple causal rare variants that influence a dichotomous phenotype using statistical tests. Our approach considers both the causal and the protective variants, which distinguishes it from the previous study RareCover, and it is therefore a robust predictor of the direction and the magnitude of the genetic effects. Moreover, inspired by the weight-sum approaches [9, 11, 14], we also weight each variant; however, we not only consider the likelihood of a variant being causal but also compute the pair-wise likelihood of candidate variants being collapsed together. Note that although it is difficult to observe, relatively low interactions (e.g., linkage disequilibrium) are expected between rare variants [4, 11, 13, 20]. Furthermore, in regression-based association methods, genetic similarities are often used to reduce the dimensions of the regression models. Therefore, we introduced two kinds of genetic regions, the elevated region and the background region, in our model analysis; the elevated region has a higher probability of harboring a causal variant. This assumption that the causal variants are often located close to each other is often used, e.g. slide windows in RareCover [19]. However, the regions are more flexible than a preset slide window, as in RareCover.

We adopt the "dominant" and "recessive set" genetic model, which are also used in [911, 14, 15, 19]. In the dominant and recessive-set model, the predisposing genotype harbors the mutation(s) in at least one variant on any of the two haplotypes. Therefore, for one genotype, there are two possible allelic values at each variant: one denotes that both haplotypes carry a wide-type allele, while the other denotes that at least one haplotype carries a mutant. In our method, each variant has two hidden states, causal/non-causal status and elevated/background region status. The MRF includes the hidden states, emission probabilities and transition probabilities. The emission probabilities bridge the hidden states and the genotypes, while the transition probabilities link the two hidden states. Following the pseudo-likelihood estimation method [21], we infer the model parameters and all of the hidden states. The simulation experiments show that our approach outperforms RareCover, RWAS [14] and LRT [9] on different parametric settings. In particular, RareProb obtains better results on large-scale data.


Notions and model overview

Suppose we are given M rare variants (allelic sites) on a set of N genotypes. Let s i denote the allelic value of the site s on the genotype i (1 ≤ iN, 1 ≤ sM), where s i = 0 means both haplotypes of i have the wild type allele, while s i = 1 means at least one haplotype has a mutant allele. Each genotype carries a dichotomous phenotype. Let vector P denotes the phenotypes, where P i = 1 represents that i is affected by the phenotype trait (being a case), while P i = 0 represents that i is a control.

The core of our approach is a Markov random field (MRF) model. We first introduce four key components of modeling this MRF:

  • The observed data of this MRF consist of all of the genotypes and phenotypes.

  • There are two unknown states for each site: one is the causal or non-causal status and the other is the region location status. Here, we define them as the hidden states of this Markov random field. Let a latent vector R represent the region status, where R s = 1 denotes that the site s is located in an elevated region, while R s = 0 denotes the s is located in a background region. Additionally, let a latent vector X represent the causal/non-causal status, where X s = 1 if the site s is causal (contributes to the phenotype); otherwise, X s = 0. Probabilistic functions are designed to present the probabilities of each hidden state. The RareProb framework is able to incorporate prior information, obtained by different software tools, e.g. Align-GVGD[22] and SIFT[23], etc, by updating initial X vector and R vector.

  • A neighborhood system is required in the MRF model to describe the interactions among hidden states. Details of the hidden states and neighborhood system are shown in the section "Estimation of the hidden states in HMRF".

  • There are two kinds of probabilities in the MRF model: emission probabilities and transition probabilities. Emission probabilities bridge the relationships among genotypes, phenotypes and hidden states. Moreover, hidden states X and R are not independent of each other, as the relationships between the hidden states are described by the transition probabilities. The conditional probability P(X s = 1|R s = 1) denotes the probability that the site s is a causal site when it is located in an elevated region, while P(X s = 0|R s = 1) denotes the probability that the site s is non-causal when it is located in an elevated region. Similarly, another two conditional probabilities, P(X s = 1|R s = 0) and P(X s = 0|R s = 0), present the probabilities of being causal or non-causal if the site is located in a background region. Details of the emission probabilities are shown in the section "Estimation of the emission probabilities in HMRF", and the transition probabilities are shown in the section "Estimation of the transition probabilities in HMRF".

The central thesis of our approach is that causal rare variants, which should be collapsed together, are treated as one random vector variable with certain dimensions. Then, the probability of this bunch of causal rare variants becomes the probability of one variable being associated with the phenotype. Based on the Markov-Gibbs equivalence [21], the probability of this random variable can be decomposed into the sum of clique potentials. The first-order clique potentials describe the probability of one variant being causal, while the second-order clique potentials measure the pair-wise genetic similarities, which share the idea of the kernel machine in regression frameworks [10, 24, 25]. The neighborhood system in the MRF model consists of clique potentials. In our approach, we select that the neighborhood system only contains the first-order and the second-order clique potentials because there is scanty evidence supporting the biological or medical scenario of high-order potentials. For each variable, the MAFs and model parameters can be estimated by maximizing the likelihoods of the genotypes. Then, the probability of the variable and the variable itself can be updated by MAFs and model parameters. Two or three iterations can be applied if needed for the convergence of the MRF. Thus, our approach selects a subset of candidate causal variants by updating the variables and avoids the weakness of the same magnitude effect assumption because the neighborhood system is able to describe both the "causal" and "protective" variants.

Estimation of the hidden states in HMRF

Neighborhood system

Assume there are N/ 2 cases and N/ 2 controls among all of the genotypes (if the number of cases is not equal to the number of controls, then all of the results still can be used by applying noncentrality parameters). At a certain variant s, let θ s denote the MAF for the cases, and let the number of genotypes in cases that carry at least one mutant allele be c s + . Let ρ s denote the MAF for the controls, and let the number of genotypes in controls that carry at least one mutant allele be c s - . Then, we can draw two binomial distributions for the cases and the controls [9, 14]: c s + ~ Bin N 2 θ s and c s - ~ Bin N 2 , ρ s , where f ( c s + | θ s ) = C N 2 c s + θ s c s + ( 1 - θ s ) N 2 - c s + and f ( c s - | ρ s ) = C N 2 c s - ρ s c s - ( 1 - ρ S ) N 2 - c s - . Thus, for a site s, the statistic of the difference between θ and ρ is

z s = 2 θ ^ s - ρ ^ s 2 N θ ^ s + ρ ^ s 2 - θ ^ s - ρ ^ s

where θ ^ s = c s + N / 2 is the estimation of θ s and ρ ^ s = c s - N / 2 is the estimation of ρ s . Similar to the linear kernel function, which calculates genetic similarities [10], we measure the likelihood between pairwise rare variants, which denotes how likely two variants would be collapsed together. For two variants s and s', we define ω s,s' as the likelihood of collapsing as follows:

ω s , s = 2 z s z s z s 2 + z s 2

The ω function has the following properties: (1) When both s and s' are causal variants, due to the PAR, ω s,s' locates in the interval (0, 1]. (2) If one variant is "causal" but the other is "protective", the likelihood takes on a negative value. (3) The likelihood encourages the collapse of the variants with similar PAR. Those rare variants whose MAFs increase rapidly in some cases, as we mentioned before, could be identified by single-site tests or pair-wise tests, which are often not considered in collapsing models [8]. Let ω.,. be the weight of two neighbors. The closer the statistics z s and z s' are, the larger the likelihood will be. And thus, the neighborhood system is built up.

Hidden states

Rare variant s is either located in an elevated region or in a background region. Thus, we define the probability (Bayesian classifier) of s as

p ( X s | X n ( s ) ) exp γ p ( X s | R s ) X s + η s n ( s ) ω s , s p ( X s | R s ) X s

where n (s) denotes the neighbors of s. γ and η are two MRF parameters. γ represents how strongly the status of X s affects the probability of X s , while η represents how strongly the neighbors of s affect the probability of X s . Here, we limit η > 0, which encourages the pair-wise weights and prevents them from counteracting the negative weights. Thus, the joint probability of the latent vector X is p ( X ; Φ ) exp γ s M p ( X s | R s ) + η s n ( s ) ω s , s p ( X s | R s ) , where Φ = (γ, η). As the variants in different subsets (different collapsing groups) are conditional independent, this joint probability covers all of the probabilities of the random variables (collapsing groups). Similarly, the probability of s located in an elevated region can be represented by

p R s | R n ( s ) exp τ R + υ s n ( s ) ω s , s R s

and the joint probability of latent vector R can be represented by p ( R ; Φ R ) exp τ s M R + υ s , s ω s , s R s , where Φ R = (τ, υτ and υ are two MRF parameters. We also limit υ > 0, which encourages the pair-wise weights and prevents them from counteracting the negative weights.

Estimation of the emission probabilities in HMRF

We now estimate the emission probabilities to relate X and R with the observed data. As linkage disequilibrium is rarely observed between rare variants [8], the vector consists of the allelic values from one variant that is conditionally independent from the others, when a particular X is given. Thus, the joint conditional probability of all of the genotypes is

p ( Y | X ) = exp s = 1 M p ( Y s | X s )

If X s = 1, due to the PAR, θ s ρ s . We place a prior distribution on θ s and a prior on ρ s [26]: π ( θ s ) = θ s α θ s - 1 ( 1 - θ s ) β θ s - 1 B ( α θ s , β θ s ) and π ( ρ s ) = ρ s α ρ s - 1 ( 1 - ρ s ) β ρ s - 1 B ( α ρ s , β ρ s ) , where α(·) and β(·) are hyper-parameters in the prior distributions [2628]. Then, the marginal distribution of c s + is

m ( c s + ) = C N 2 c s + Γ ( α θ s , β θ s ) Γ ( α θ s ) Γ ( β θ s ) Γ ( α θ s + c s + ) Γ N 2 - c s + + β θ s Γ α θ s + β θ s + N 2

The marginal distribution of c s - is similar. The probability of the observed genotypes on s is equal to the sum of m c s C N 2 c s Thus we have:

P ( Y s | X s = 1 ) = Γ ( α θ s , β θ s ) Γ ( α θ s ) Γ ( β θ s ) Γ ( α θ s + c s + ) Γ ( N 2 - c s + + β θ s ) Γ ( α θ s + β θ s + N 2 ) × Γ ( α ρ s , β ρ s ) Γ ( α ρ s ) Γ ( β ρ s ) Γ ( α ρ s + c s - ) Γ ( N 2 - c s - + β ρ s ) Γ ( α ρ s + β ρ s + N 2 )

On the other hand, if X s = 0, then there is no PAR between θ s and ρ s that infers θ s = ρ s . Here, we simply use ρ s to denote the MAF of s for both the cases and controls. Thus, we have

P ( Y s | X s = 0 ) = Γ ( α s , β s ) Γ ( α s ) Γ ( β s ) Γ ( α s + c s ) Γ ( N - c s + β s ) Γ ( α s + β s + N )

where c s = c s + + c s - . We have now obtained all the three emission probabilities of this HMRF: p (Y|X), P (Y s |X s = 0) and P (Y s |X s = 1).

Estimation of the transition probabilities in HMRF

The transition probabilities link the hidden states X and R. Let c X + be the counts of the causal variants on all of the elevated regions, and let c E be the number of variants in those regions. Let c X - be the counts of the causal variants on all of the background regions, and c B be the number of variants in those regions. Then, we draw two binomial distributions: c X + ~ Bin ( c E , ξ ) ; c X - ~ Bin ( c B , ζ ) where ξ = P (X = 1|R = 1) and ζ = P (X = 1|R = 0). We also place the prior distributions on ξ and ζ, as follows:

c X + ~ Bin ( c E , ξ ) ; f ( c X + | ξ ) = C c E c X + ξ c X + ( 1 - ξ ) c E - c X +


c X - ~ Bin ( c B , ζ ) ;f ( c X - | ζ ) = C c B c X - ζ c X - ( 1 - ζ ) c B - c X -

where ξ = P (X = 1|R = 1) and ζ = P (X = 1|R = 0). We also place the prior distributions on ξ and ζ, as follows:

π ( ξ ) = ξ α ξ - 1 ( 1 - ξ ) β ξ - 1 B ( α ξ , β ξ ) ; π ( ζ ) = ζ α ζ - 1 ( 1 - ζ ) β ζ - 1 B ( α ζ , β ζ )

where α(·) and β(·) are also hyper-parameters.

Thus, we have the conditional probability of X given R:

P ( X | R ) = Γ ( α ξ , β ξ ) Γ ( α ξ ) Γ ( β ξ ) Γ ( α ξ + c X + ) Γ ( c E - c X + + β ξ ) Γ ( α ξ + β ξ + c E ) × Γ ( α ζ , β ζ ) Γ ( α ζ ) Γ ( β ζ ) Γ ( α ζ + c X - ) Γ ( c B - c X - + β ζ ) Γ ( α ζ + β ζ + c B )

and the posterior distribution of ξ given c X + is

π ( ξ | c X + ) = ξ α ξ + c X + - 1 ( 1 - ξ ) β ξ + M - c X + - 1 B ( α ξ + c X + , β ξ + M - c X + )

Similarly, the posterior distribution of ζ given c X - is

π ( ζ | c X - ) = ζ α ζ + c X - - 1 ( 1 - ζ ) β ζ + M - c X - - 1 B ( α ζ + c X - , β ζ + M - c X - )

Thus far, we have obtained all of the three transition probabilities of this HMRF: p (X|R), π ( ξ | c X + ) and π ( ζ | c X - ) .

Estimation the model parameters

Based on the Gibbs-Markov Equivalence [21], a pseudo-likelihood estimation cycle can be applied to this hidden MRF to estimate the model parameters and update the hidden states. We use the pseudo-likelihood estimation because p (X; Φ) and p (R; Φ R ) are difficult to compute directly. The algorithm involves the following four steps:

  • Step 1: Estimate α θ and β ρ with θ ^ and ρ ^ by maximizing the likelihood L ( Y / X ^ ) . Update θ ^ s by maximizing the posterior distribution:

    π ( θ s | c s + ) = θ s α θ s + c s + - 1 ( 1 - θ s ) β θ s + N - c s + - 1 B α θ s + c s + , β θ s + N - c s +

Similarly, Update ρ ^ s .

  • Step 2: Estimate α ξ , β ξ and α ζ , β ζ with ξ ^ and ζ ^ by maximizing the transition probability L ( X / R ^ ) . Update ξ ^ and ζ ^ by maximizing the transition probabilities π ( ξ | c X + ) and π ( ξ | c X + ) , respectively.

  • Step 3: Estimate Φ and Φ R with Φ ^ and Φ ^ R by maximizing the pseudo-likelihood functions:

    L X ^ ; Φ = e x p S M p s X ^ s | X ^ n ( s ) ; Φ

and L R ^ ; Φ R .

  • Step 4: Update X ^ and R ^ by

    P X s | Y , X ^ S / s f Y s | X s ; θ ^ , ρ ^ p s X s | X ^ n ( s ) ; Φ ^

and P R s | X , R ^ S / s .

There are several ways to exit from this iteration. We measure the Euclidean distance between the current and the updated X ^ . If the distance is less than a preset threshold, our approach will stop the iteration. After the convergence of HMRF, we obtain the estimations of X and R, as well as the MAFs for every variant. The collapsed rare variants can be tested based on the existing statistics, e.g. in [9, 10, 14].

Experiments and results

In this section, we apply our approach on a real dataset from [30] and also compare it with three other approaches using different types of simulated datasets. The three comparison approaches are RareCover, which is based on [19], RWAS from [14] and LRT from [9]. Additionally, it seems that RareCover is not released online, so as in many previous works, we re-implement this algorithm and the related statistics by ourselves.

Simulation frameworks

As the simulation settings in different papers [9, 14, 19] are quite different, we adopt all of them and generate three types of simulated datasets. In the first one, each dataset has a fixed number of causal variants, while in the second dataset, the number of causal variants is determined by allelic population attributable risk (PAR). The last simulation method first generates elevated regions and background regions and then plants causal variants in each region. We describe the three simulation methods in the following sections.

Fix number of causal variants

First, we generate the datasets with fixed numbers of causal variants, following previous approaches [14] and [9]. Each variant is generated independently because they believe that rare variants do not show significant linkage disequilibrium [9, 14]. For each variant, the probability distribution of the MAF of site s on controls, ρ s , satisfies the Wright's distribution under purifying selection [4],

f ( ρ s ) ( ρ s ) β s - 1 ( 1 - ρ s ) β N - 1 e σ - ρ s σ

where σ is the selection coefficient, β S is the probability that the normal allelic site mutates to the causal variant, and β N is the probability that a causal variant repairs to a normal variant. We take σ = 12.0, β S = 0.001 and β N = 0.00033, which are the same settings used by [9, 11, 14]. Then, the relative risk of s is: RR= δ ( 1 - δ ) ρ s +1, where δ is the marginal PAR. The marginal PAR is equal to the group PAR (Δ) divided by the number of causal variants, while the relative risk of M variants is 1 [14]. Afterwards, the MAF of s for the cases is calculated according to θ s = R R × ρ s ( R R - 1 ) ρ s + 1 . In each dataset, we simulate N = 2000 genotypes with half cases and half controls. The mutations on the cases and the controls are sampled independently according to θ s and ρ s , respectively.

Causal variants depends on PAR

The second way generates a set, C, that contains all of the causal variants. Instead of a fixed number, the total number of causal variants depends on PAR [19], which is limited by Δ (the group PAR):

Δ 1 - s C 1 - θ s P r P D

where Pr represents the penetrance of the group of causal variants and P D is the disease prevalence in the population. Different settings are applied in the experiments.

We use the algorithm proposed in [19] to obtain the MAF of each causal variant. The algorithm samples the MAF of a causal variant s, θ s , from the Wright's distribution with σ = 30.0, β S = 0.2 and β N = 0.002 [4, 19], and then appends s to C. Next, the algorithm checks whether s C 1 - θ s P r P D > 1 - Δ is true. If the inequality does not hold, the algorithm terminates and outputs C. Thus, we obtain all of the causal variants and their MAFs. If the inequality holds, then the algorithm continuously samples the MAF of the next causal variant. The mutations on genotypes are sampled according to θ s .

For those non-causal variants, we use Fu's model [29] of allelic distributions on a coalescent, which is the same used in [19]. We adopt ρ s = 5 . 0 N . The mutations on genotypes are sampled according to ρ s . The phenotype of each individual (genotype) is computed by the penetrance of the subset, Pr. Thereafter, we sample 1000 of the cases and 1000 of the controls.

Causal variants depends on regions

There are many ways to generate a dataset with regions. The simplest way is to preset the elevated regions and the background regions and to plant causal variants based on certain probabilities. An alternate way creates the regions by a Markov chain. For each site, there are two groups of states. The E state denotes that the variant is located in an elevated region, while the B state denotes that the variant is located in a background region. Both states E and B can transfer to a causal state C or a non-causal state C ̄ . If the Markov chain travels to the C ̄ state, it plants a mutant on the genotype with probability ρ. If the variant is considered to be causal, it may continuously transfer to the state A, which means that the genotype carries a mutant that may affect the phenotype with penetrance Pr. Otherwise it arrives in the state Ā, and the Markov chain plants a mutant or a wild-type allele on the genotype afterwards.

To generate enough genotypes, we perform the following steps for each variant: if the process drops into C ̄ , we take 50,000 iterations to yield a mutant, where ρ is sampled from the Wright's distribution with σ = 30.0, β S = 0.2 and β N = 0.002. If it drops into A or Ā, we design an iteration to C until it reaches 50,000 iterations. The transition probability from C to A is equal to ρ × Pr. After we have enough genotypes, we sample 1000 cases and 1000 controls from them.

Comparisons on power

Similar to the measurements in [9, 14], the power of an approach is measured by the number of significant datasets, among many datasets, using a significance threshold of 2.5 × 10-6 based on the Bonferroni correction assuming 20000 genes, genome-wide. We test at most 1000 datasets for each comparison experiment.

Power versus different proportions of causal variants

We compare the powers under different sizes of total variants. In the first group of experiments, we include 50 causal variants and vary the total number of variants from 100 to 5000. Thus, the proportions of causal variants decrease from 50% to 1%. In the second group of experiments, we hold the group PAR as 5% and vary the total number of variants as before. The results are compared in Table 1. From the results, our approach clearly shows more powerful and more robust at dealing with large-scale data. We also test our approach on different settings of the group PARs. Those results can be found in Table S1 in the Additional file 1.

Table 1 The power comparisons at different proportions of causal variants

The Type I error rate is another important measurement for estimating an approach. To compute the Type I error rate, we apply the same technique as [19]. Type I error rate is defined as the probability of a non-causal variant being selected in the potential causal set. We compare our approach only with RareCover because RWAS or LRT does not select any potential causal variants. The results on different configurations can be found in supplementary documents. Based on the results, our approach always holds reasonable Type I error rates. Although on some configurations RareProb has a little higher Type I error rates, e.g. 1%-10% higher when gourp PAR is 5%, than RareCover, the absolute values are still satisfied. Moreover, when the group PAR decreases, RareProb always performs lower Type I error rates than RareCover. These results can be found in Table S2 in the Additional file 2. Considering both statistical power and Type I error rate, the advantage of RareProb cannot be neglected: it is able to identify most of the causal variants with an acceptable Type I error rate. In the other words, if an approach rarely identifies correct variants, a low Type I error rate becomes meaningless.

Power versus different configurations of regions

We compare the powers on different configurations of elevated regions and background regions and test the performance of our approach in identifying the regions. At each total variant number, we preset the number of regions between 2 and 8, with half elevated regions and half background regions. In these datasets, the probability of a rare variant being causal is 0.1 if the variant is located in an elevated region; otherwise, the probability is 0.001 if variant is located in a background region. In the last group of experiments, the regions are generated by the Markov chain, where the transition probability of remaining in the same regions (keeps in elevated region or background region) is 0.8, while the transition probability of transitioning between different regions (jumps from an elevated region to a background region, or jumps from a background region to an elevated region) is 0.2. The emission probabilities are the same as before. We test the powers and record the percentages of correct identifications on the regions. The results are listed in Table 2. The results show that our approach successfully estimates the regions, while RareCover suffers difficulty on identifying neither candidate causal variants nor region information. We also test our approach on total variants being 3000, 4000 and 5000. These results can be found in Table S3 in the Additional file 3.

Table 2 The power comparisons for different configurations of regions

RareProb on real mutation screening data

Finally, we apply our approach to a real mutation screening dataset. This dataset has been previously published by [30]. Authors screen for a susceptibility gene, ATM, which is thought to associate with ataxia telangiectasia. ATM is also an intermediate-risk susceptibility gene for breast cancer [9, 14]. The dataset (ATM_CCMSdata_Dec2011_v1) we have consists of 121 rare variants in a set of 2506 cases and 2235 controls, which is called "bona fide case-control studies" [9, 14].

We apply RareProb to this dataset without any prior information. RareProb identifies variant #c.4424A > G as a causal variant and reports a significant association with a p-value of 8.8817 × 10-16. As a comparison, authors in [30] reports that they did identify a significant association with the help of the prior information, but that they did not find a significant association only according to the results of CMC. Sul and others [14] applied RWAS and reports a non-significant association with p-value of 0.3946 without prior information and a non-significant association with p-value of 0.0078 and 0.0881 when prior information of variants is obtained by Align-GVGD [22] and SIFT [23], respectively. Sul and others [9] also applied LRT and reports that a non-significant association with p-value of 0.3934 was found without prior information, but a significant association with p-value of 0.0058 and 0.08384 were found introducing Align-GVGD scores and SIFT scores, respectively. Our approach successfully identifies an association and clearly points out the candidate causal variant, without prior information, while either RWAS or LRT cannot achieve this.


In this article, we propose a probabilistic method, RareProb, to identify multiple rare variants that contribute to dichotomous disease susceptibility. Our approach is inspired by RareCover. Both approaches select a subset of potentially causal variants from the given variants, which means our approach does not rely on the pre-selection of candidate rare variants. Furthermore, as opposed to simply merging the variants in RareCover, our approach gains power by considering the directions and the magnitudes of the genetic effects. Both the causal and the protective variants can be described by pair-wise measurements, respectively. This method gets rid of the weakness of losing statistical power when "causal", "neutral" and "protective" variants are combined. Note that the pair-wise weight is not the linkage disequilibrium (LD). LD is quite difficult to observe, although it is expected among rare variants. The pair-wise measurements indicate the likelihood of two variants being collapsed, which is similar to the kernel functions in regression-based frameworks. This weight is then used to build up the neighborhood system of the hidden Markov random field model.

The Markov random field model treats all of the variants as one vector and estimates their causal/non-causal status by globally maximizing the likelihood of genotypes instead of by local optimization. Our approach gains more power than existing group-wise collapsing approaches; RareProb filters out those variants with non-causal status. At the same time, unlike the previous selection-based approaches, RareProb controls the false positive rate by partitioning elevated regions and background regions, instead of by presetting any sliding windows. Regions are much more flexible than preset sliding windows. While existing approaches can only handle hundreds of variants, there is no doubt that the total number of variants will increase rapidly with the development of new technologies, e.g. applications of next generation sequencing. The simulation experiments show that our approach obtains significantly more power, especially when the total number of given rare variants is large. We also apply our approach to a real mutation screening dataset and a significant association is found. Our approach is able to handle thousands of variants. Moreover, our approach is easy to extend to an "additive" genetic model and multiple phenotypes by updating the Dirichlet prior distribution.


The publication costs for this article were funded by Xi'an Jiaotong University.

This article has been published as part of BMC Genomics Volume 14 Supplement 1, 2013: Selected articles from the Eleventh Asia Pacific Bioinformatics Conference (APBC 2013): Genomics. The full contents of the supplement are available online at


  1. 1.

    Hirschhorn Joel, Daly Mark: Genome-wide association studies for common diseases and complex traits. Nature Reviews Genetics. 2005, 6: 95-108.

    CAS  Article  PubMed  Google Scholar 

  2. 2.

    Ropers Hans-Hilger: New perspectives for the elucidation of genetic disorders. Am J Hum Genet. 2007, 81: 199-207. 10.1086/520679.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  3. 3.

    Manolio Teri, Collins Francis, Cox Nancy et al: Finding the missing heritability of complex diseases. Nature. 2009, 461: 747-753. 10.1038/nature08494.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  4. 4.

    Pritchard Jonathan: Are rare variants responsible for susceptibility to complex diseases?. Am J Hum Genet. 2001, 69: 124-137. 10.1086/321272.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  5. 5.

    Reich David, Lander Eric: On the allelic spectrum of human disease. Trends in Genetics. 2001, 17: 502-510. 10.1016/S0168-9525(01)02410-6.

    CAS  Article  PubMed  Google Scholar 

  6. 6.

    Chapman Nicola, Wijsman Ellen: Genome screens using linkage disequilibrium tests: optimal marker characteristics and feasibility. Am J Hum Genet. 1998, 63: 1872-1885. 10.1086/302139.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  7. 7.

    Xiong Momiao, Zhao Jinying, Boerwinkle Eric: Generalized T2 Test for genome association studies. Am J Hum Genet. 2002, 70: 1257-1268. 10.1086/340392.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  8. 8.

    Bodmer Walter, Bonilla Carolina: Common and rare variants in multifactorial susceptibility to common diseases. Nature Genetics. 2008, 40: 695-701. 10.1038/ng.f.136.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  9. 9.

    Sul Jae, Han Buhm, Eskin Eleazar: Increasing power of groupwise association test with likelihood ratio test. Proceedings of RECOMB. 2011, 2011: 28-31.

    Google Scholar 

  10. 10.

    Michael Wu, Seunggeun Lee, Tianxi Cai et al: Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011, 89: 82-93. 10.1016/j.ajhg.2011.05.029.

    Article  Google Scholar 

  11. 11.

    Madsen Bo, Browning Sharon: A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genetics. 2009, 5: e1000384-10.1371/journal.pgen.1000384.

    PubMed Central  Article  PubMed  Google Scholar 

  12. 12.

    Morgenthaler Stephan, Thilly William: A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST). Mutation Research. 2007, 615: 28-56. 10.1016/j.mrfmmm.2006.09.003.

    CAS  Article  PubMed  Google Scholar 

  13. 13.

    Li Bingshan, Leal Suzanne: Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008, 83: 311-321. 10.1016/j.ajhg.2008.06.024.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  14. 14.

    Sul Jae, Han Buhm, He Dan, Eskin Eleazar: An optimal weighted aggregated association test for identification of rare variants involved in common diseases. Genetics. 2011, 188: 181-188. 10.1534/genetics.110.125070.

    PubMed Central  Article  PubMed  Google Scholar 

  15. 15.

    Neale Benjamin, Rivas Manuel, Voight Benjamin et al: Testing for an unusual distribution of rare variants. PLoS Genetics. 2011, 7: e1001322-10.1371/journal.pgen.1001322.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  16. 16.

    Cohen Jonathan, Pertsemlidis Alexander, Kotowski Ingrid et al: Low LDL cholesterol in individuals of African descent resulting from frequent nonsense mutations in PCSK9. Nature Genetics. 2005, 37: 161-165. 10.1038/ng1509.

    CAS  Article  PubMed  Google Scholar 

  17. 17.

    Cohen Jonathan, Boerwinkle Eric, Mosley Thomas, Hobbs Helen: Sequence variations in PCSK9, low LDL, and protection against coronary heart disease. The New England Journal of Medicine. 2006, 354: 1264-1272. 10.1056/NEJMoa054013.

    CAS  Article  PubMed  Google Scholar 

  18. 18.

    Kathiresan Sekar, Melander Olle, Anevski Dragi et al: Polymorphisms associated with cholesterol and risk of cardiovascular events. The New England Journal of Medicine. 2008, 358: 1240-1249. 10.1056/NEJMoa0706728.

    CAS  Article  PubMed  Google Scholar 

  19. 19.

    Bhatia Gaurav, Bansal Vikas, Harismendy Olivier et al: A covering method for detecting genetic associations between rare variants and common phenotypes. PLoS Computational Biology. 2010, 6: e1000954-10.1371/journal.pcbi.1000954.

    PubMed Central  Article  PubMed  Google Scholar 

  20. 20.

    Pritchard Jonathan, Cox NJ: The allelic architecture of human disease genes: common disease-common variant-or not?. Human Molecular Genetics. 2002, 11: 2417-2423. 10.1093/hmg/11.20.2417.

    CAS  Article  PubMed  Google Scholar 

  21. 21.

    Besag Julian: On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society Series B. 1986, 48: 259-302.

    Google Scholar 

  22. 22.

    Tavtigian Sean, Deffenbaugh AM, Yin L et al: Comprehensive statistical study of 452 BRCA1 missense substitutions with classification of eight recurrent substitutions as neutral. Journal of Medical Genetics. 2006, 43: 295-305.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  23. 23.

    Pauline Ng, Steven Henikoff: SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Research. 2003, 31: 3812-3814. 10.1093/nar/gkg509.

    Article  Google Scholar 

  24. 24.

    Kwee Coulter Lydia, Dawei Liu, Xihong Lin et al: A powerful and flexible multilocus association test for quantitative traits. Am J Hum Genet. 2008, 82: 386-397. 10.1016/j.ajhg.2007.10.010.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  25. 25.

    Michael Wu, Kraft P, Michael Epstein et al: Powerful SNP-set analysis for case-control genome-wide association studies. Am J Hum Genet. 2010, 86: 929-942. 10.1016/j.ajhg.2010.05.002.

    Article  Google Scholar 

  26. 26.

    Quintana Melanie, Berstein Jonine, Thomas Duncan, Conti David: Incorporating model uncertainty in detecting rare variants: the Bayesian risk index. Genetic Epidemiology. 2011, 35: 638-649. 10.1002/gepi.20613.

    PubMed Central  Article  PubMed  Google Scholar 

  27. 27.

    Conti David, James Gauderman W: SNPs, haplotypes, and model selection in a candidate gene region: the SIMPle analysis for multilocus data. Genetic Epidemiology. 2004, 27: 429-441. 10.1002/gepi.20039.

    Article  PubMed  Google Scholar 

  28. 28.

    Melanie Wilson, Edwin Iversen, Merlise Clyde et al: Bayesian model search and multilevel inference for SNP association studies. Annals of Applied Statistics. 2010, 4: 1342-1364. 10.1214/09-AOAS322.

    Article  Google Scholar 

  29. 29.

    Fu Yun-Xin: Statistical properties of segregating sites. Theoretical Population Biology. 1995, 48: 172-197. 10.1006/tpbi.1995.1025.

    CAS  Article  PubMed  Google Scholar 

  30. 30.

    Sean Tavtigian, Peter Oefner, Davit Babikyan et al: Rare, evolutionarily unlikely missense substitutions in ATM confer increased risk of breast cancer. Am J Hum Genet. 2009, 85: 427-446. 10.1016/j.ajhg.2009.08.018.

    Article  Google Scholar 

Download references


This work was supported by National Science Foundation [IIS-0803440], [CCF-1116175] and [IIS-0953563] and the Ph.D. Programs Foundation of Ministry of Education of China [20100201110063]. Authors thank Professor Sean Tavtigian and Professor Georgia Chenevix-Trench for sharing the ATM datasets with us. Authors thank Professor Chun-Xia Yan M.D. and M.D. Feng Zhu for discussing the elevated region and the background region from a medical and clinical view.

Author information



Corresponding authors

Correspondence to Jiayin Wang or Zhongmeng Zhao.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JW and ZM conducted this research. JW designed algorithms and experiments. ZC, AY and JZ developed the software packages and participated in the performance analysis and the experiments on the real dataset. JW, ZM and JZ wrote this paper. All authors have read and approved the final manuscript.

Electronic supplementary material

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Wang, J., Zhao, Z., Cao, Z. et al. A probabilistic method for identifying rare variants underlying complex traits. BMC Genomics 14, S11 (2013).

Download citation


  • Minor Allele Frequency
  • Rare Variant
  • Causal Variant
  • Markov Random Field
  • Background Region