Skip to main content

An improved burden-test pipeline for identifying associations from rare germline and somatic variants



Identifying rare germline and somatic variants associated with cancer progression is an important research topic in cancer genomics. Although many approaches are proposed for rare variant association study, they are not fit for cancer sequencing data due to multiple issues, such as overly relying on pre-selection, losing sight of interacting hotspots, etc.


In this article, we propose an improved pipeline to identify germline variant and somatic mutation interactions influencing cancer susceptibility from pair-wise cancer sequencing data. The proposed pipeline, RareProb-C performs an algorithmic selection on the given variants by incorporating the variant allelic frequencies. The interactions among the variants are considered within the regions which are limited by a four-gamete test. Then it filters singular cases according to the posterior probability at each site. Finally, it outputs the selected candidates that pass a collapse test.


We apply RareProb-C on a series of carefully constructed simulation cases and it outperforms six existing genetic model-free approaches. We also test RareProb-C on 429 TCGA ovarian cancer cases, and RareProb-C successfully identifies the known highlighted variants which are considered increasing disease susceptibilities.


Over recent decades, large cancer genome projects, such as TCGA and ICGC [1, 2], greatly promote the achievements on cancer genomics. One of the important research topics is to comprehensively identify the germline and somatic variants that contribute to cancer susceptibility. Several standard pipelines for detecting germline variants and somatic mutations have been developed. For each cancer patient, these pipelines require two sets of sequencing data, one of which is from tumor tissue, while the other collects from normal tissue. The variant calls from a pair of such two sets are compared. Germline variants are expected to be observed in both sets, while the plausible differences may represent genuine somatic mutations.

Deleterious germline variants inheritance are usually confined to a small population, whose minor allele frequencies are usually very low. Interacting with germline ones, highly recurrent somatic mutations only make up a small proportion of total somatic events. Low minor allele frequencies can terribly hurt the statistical power and odd ratio in association analysis. To tackle this issue, existing computational approaches for germline and somatic variants widely adopt the collapsing strategy, as known as the burden-test. It is a major technique for rare variant association study. The basic idea of collapsing strategy is to merge the given variants to one or multiple virtual loci, whose minor allele frequency(ies) is(are) obviously increased. The statistical tests are then applied to these virtual loci, together with common variants. For example, the mutations from the genes that are on the same pathway are often collapsed to a single virtual locus. The successful cases of using collapsing strategy have reported many rare variants that contribute to the susceptibility of a number of complex diseases and traits, including mutlipe types of cancer [36].

Any burden-test approaches could be classified into one of two categories: genetic model-based or genetic model-free. Genetic model-based approaches assume that the given variants are finely selected, while the model-free ones generate a set of candidate variants from given ones. Recent published approaches prefer the model-free strategy because model-free approaches are more aggressive in identifying novel deleterious events. The major diversity of the existing approaches is presented in the ways of selecting candidate variants. Some approaches weight the given variants before statistical tests, such as the rare variant weighted aggregated statistic (RWAS) [7], the likelihood ratio test (LRT) [8] and two weighted score tests with branching under ratios (BUR) and likelihood-based model branching (LiMB), respectively [9]. Regression is another idea to refine the given variants, such as the kernel-based association test (KBAT) [10], the sequence kernel association tests (SKAT) [11, 12] and convex-concave rare variant selection method (CCRS) [13]. RareCover is considered the first algorithmic selection approach, which filters the variants via a χ 2 aggregation greedy strategy [14]. Both RareProb [15] and the logistic Bayesian LASSO (LBL) [16] improve the selection strategy of RareCover, where RareProb implements a hidden Markov random field model and LBL incorporates a posterior Bayesian score calculated by a Markov chain framework.

Although different burden-test approaches vary in the means of collapsing, the “common disease, rare variant” hypothesis limits their applications in cancer genomics [17]. This hypothesis suggests that a set of independent variants with low MAFs and modest penetrances jointly affect a complex trait, where each may only explain a small portion of the phenotypes [5, 6]. Here, the interactions among rare variants are suggested to be either non-existent or too weak to be observed [7, 8, 14, 18]. However, it is reported that the germline variant and somatic mutation interactions influencing cancer susceptibility is involved in more than 3% of cancer cases across multiple cancer types [4]. For example, the two-hit hypothesis serves as a classic genetic model for the DNA repair and tumor suppressor genes [19]. Moreover, several computational models are proposed to capture such interactions. For example, the significant mutated gene test has successfully supported the research on the landscape of the interacting somatic mutations across 12 major cancer types [3]. Such interactions are also observed in clonal expansion analysis [20]. It is also reported that the somatic copy-number alternations may contribute to the potential selective advantages of the germline variants [4]. Second, tumor tissue is an admixture of clonal tumor cells and non-cancerous cells. And thus, sample contamination and clonal architecture are ineluctable. Furthermore, the variant allelic frequency at each site reveals differences among sub-groups of the given samples. Without considering these issues, a collapsing method would lose sensitivity and specificity, and the association approach could be weakened by decreasing the statistical powers.

We propose a burden-test pipeline in this article. This pipeline extends the existing RareProb framework proposed in [15] and the name of this pipeline is RareProb-C. RareProb-C is designed for cancer sequencing data, which considers the interactions among rare germline and somatic variants. To verify this pipeline, we conduct a series of simulation experiments with different settings, and compare RareProb-C to 6 popular approaches, which are 1) RWAS, BUR and LiMB from weighted-based group, 2) CCRS from regression-based group and 3) RareCover and LBL from algorithmic selection group. The results of RareProb-C show significant advantage in type-II error rates. RareProb-C is also tested on a set of 429 TCGA ovarian cancer cases. The association report provided by RareProb-C includes more highlighted variants, which are considered to be associated with increased cancer susceptibilities.


RareProb-C consists of four components, and RareProb [15] becomes a core module of one component. The three new components are described in this section, whose functions are detecting interacting variants, estimating the significant mutated regions, and removing singular variants or cases. During implementation, all these components are executed simultaneously within a hidden Markov random field framework. Supposing that we are given M variants on a set of N samples. Variants could include both single-site ones and small indels. Large structural variations are not discussed here, as their functional analyses are often involved in either large LOH/CNV or more complex mechanisms [4]. Suppose the given N samples consist of \(\frac {N}{2}\) cases and \(\frac {N}{2}\) controls. If the number of cases is not equal to the number of controls, the following equations can be used simply by applying non-centrality parameters.

Detecting interacting variants

For germline variant s, let θ s denote the variant allelic frequency (VAF) in a tumor sample and ρ s denote the variant allelic frequency in the paired normal sample. Let \(c_{s}^{+}\) and \(c_{s}^{-}\) represent the number of reads supporting the mutation at site s in the tumor and normal samples, respectively. Two binomial distributions can be drawn for the tumor and normal samples: \(c_{s}^{+} \sim \text {Bin}\left (d_{s}^{+},\theta _{s}\right)\) and \(c_{s}^{-} \sim \text {Bin}\left (d_{s}^{-},\rho _{s}\right)\), where \(d_{s}^{+}\) and \(d_{s}^{-}\) denote the respective read depths. For this site across samples, the statistic of the difference between θ s and ρ s is calculated by a pair-wise t-test. Similar to the linear kernel function [11], which calculates genetic similarities, we weight the interactions of two variants s and s by \(\omega _{s,s'} \sim \frac {2t_{s}t_{s'}}{t_{s}^{2} + t_{s'}^{2}}\), which affects how likely these two variants would be collapsed. For somatic mutations, only θ · and \(c_{\cdot }^{+}\) are considered. By applying clonal architecture analysis [3], θ · are clustered into multiple sub-clones and the correlation coefficient \(\phantom {\dot {i}\!}\omega _{\cdot,\cdot ^{\prime }}\) of each mutation-pair within a sub-clone is obtained simultaneously.

Furthermore, the interactions among multiple variants may vary from two adjacent sites to a few sites within a certain region/pathway, e.g., within a gene or multiple genes. Let G ·,s represent the genotypes at any site s across all given samples. Here we assume that the observed genotype reflects a Bayesian classifier from multiple factors, such as \(c_{s}^{+}\) and \(c_{s}^{-}\). Let m(G ·,s ) denote the classifier model for each case at site s. Then, the interactions among s and neighboring variants n(s) can be described as the conditional probability of m(G ·,s ) undergoing \(\phantom {\dot {i}\!}m\left (G_{\cdot,s'}\right)\), where \(\phantom {\dot {i}\!}s^{\prime }\in n(s)\). To clearly describe the patterns of interactions, we have

$${} P\left(G_{\cdot,s} | G_{\cdot,s_{n(s)}}\right) \propto \text{exp}\left(\tau m\left(G_{\cdot,s}\right) + \nu \sum_{s'\in n(s)} \omega_{s,s'} m\left(G_{\cdot,s'}\right) \right) $$

where the first component, τ m(G ·,s ) represents the solo effects from s itself, while the second component describes the effects from the interacting ones.

The variant allelic frequency for each site consists of an N-dimensional vector. Let the vector be a state in the Markov random field. Then, state H s =[G ·,s ]−1 corresponding to site s, and based on the Markov-Gibbs equivalence, for two interacting variants s and s , the interaction could be represented as a transition probability between H s and \(\phantom {\dot {i}\!}H_{s^{\prime }}\), which is

$$\begin{aligned} P\left(H_{s} | H_{s^{\prime}}, \theta \right) \propto& \text{exp}\left(\sum_{i=1}^{N} \mu + \left(1 - \mu\right) f\right.\\&\left. \left(-\theta d_{s,s' } + I_{s,s'}e^{-\theta d_{s,s'}}\right){\vphantom{\sum_{i=1}^{N}}}\right) \end{aligned} $$

where μ denotes the mutation rate by which a variant occurs independently, while θ denotes an unknown model parameter that describes the probability of a variant being located in a deleterious region. \(\phantom {\dot {i}\!}d_{s,s^{\prime }}\) is the genetic distance between two adjacent sites s and s , which underlies the length of a region.

Estimating the regions with interactions

The transition probability \(\phantom {\dot {i}\!}P\left (H_{s} | H_{s^{\prime }}, \theta \right)\) models the interacting patterns between two sites, and we then bridge each interacting pattern to the region vector R. R can be shared across cases with similar features.

For any site s, vector G ·,s is used to estimate the status of region vector R s . To facilitate the computation, without loss of the Markov property, we first define two additional silent states affiliated to the state H s , which are \(R_{s}^{e}\) and \(R_{s}^{b}\). Then, we define that a Markov chain via state \(R_{s}^{e}\) denotes that site s is in an elevated region; otherwise, the chain though \(R_{s}^{b}\) denotes that site s is in a background region. Let \(P_{s}^{e} = P\left (R = 1 | H_{s}\right)\) be the transition probability of site s within an elevated region, while \(P_{s}^{b} = P\left (R = 1 | H_{s}\right)\) denotes the probability of site s within a background region and the Markov property limits \(P_{s}^{e} + P_{s}^{b} = 1\). Thus, the probability of site s being located in an elevated region can be represented by

$${} p\left(R_{s}^{e}|H_{s},H_{s^{\prime}},R_{n(s)}\right) \propto \text{exp}\left(\tau B\left(P_{s}^{e},P_{s}^{b}\right) + \nu \sum_{s' \in n(s)} \omega_{s,s'} R_{s'}\right) $$

and the joint probability of latent vector R can be represented by \(p\left (R;\Phi _{R}\right) \propto \text {exp}\left (\tau \sum _{s}^{M} R_{s} + \nu \sum _{s,s'} \omega _{s,s'} R_{s'}\right)\), where s n(s) and Φ R =(τ,ν). τ and ν are two model parameters balancing the importance of interactions, where τ+ν=1. \(\phantom {\dot {i}\!}R_{s'} = P\left (R_{s'}|X\right)\), where X is a M-dimensional vector that each element represent the causal/non-causal status of the corresponding variant.

Furthermore, we require each region to pass the four-gamete test, which limits the length of a region and potentially reduces false positives. We adopt a maximal-K-cover pipeline to divide \(G_{\frac {N}{2}\times M}\) into multiple compatible intervals, as proposed in [21]. This pipeline consists of a series of scanning algorithms. First, it establishes two interval sets, I l r and I r l , and obtains intervals from the left-to-right algorithm and right-to-left algorithm, respectively, where each set contains K intervals. Second, a merging algorithm combines these 2K intervals to extract the overlapping genotypes with the same index, for example, I l,i and I r,i . It then defines K cores according to the overlapping, denoted by C={o 1,o 2,,o K }. Third, an uber-scan algorithm calculates K+m candidate intervals U={u 1,u 2,,u K } and assigns these K+m intervals to each core in C. If o i contains u j , then u j is allocated in the i-th group. Finally, the maximal intervals are estimated via a dynamic programming algorithm. We restrict any region within one compatible interval, where the genotypes are restricted by having limited diversity.

Estimating model parameters

Based on the Gibbs-Markov equivalence, a pseudo-likelihood iteration algorithm can solve this model to estimate the unknown model parameters and the hidden states. Again, due to the computational complexity, we introduce another two silent field states to facilitate the computation process. For any site s, let \(\vec {J}_{i,\cdot }^{s}\) permute from \(\phantom {\dot {i}\!}\vec {G}_{\cdot,s^{\prime }}\), where the transition probability from \(J_{i,\cdot }^{s}\) to \(\vec {G}_{\cdot,s'}\) is \(\propto \frac {1}{N-1}\); let \(\vec {J}_{i}^{s}\) duplicate from the vector \(\phantom {\dot {i}\!}G_{\cdot,s^{\prime }}\), where the transition probability from \(\vec {J}_{i}^{s}\) to \(\vec {G}_{\cdot,s'}\) is 1. The other component of \(\phantom {\dot {i}\!}P\left (H_{s} | H_{s^{\prime }},\theta \right)\) can be divided into two separate transition probabilities, namely, \(P\left (J_{i}^{s} | R_{s}\right) = f\left (-\theta d_{s,s'}\right) + e^{-\theta d_{s,s'}}\) and \( P\left (J_{i,\cdot }^{s} | R_{s}\right) = \frac {1}{N-1}f\left (-\theta d_{s,s'}\right)\), where f(x) is (1−e x)/N.

Now, for the entire hidden Markov random field model, we need to estimate three unknown model parameters: the mutation rate μ, the deleterious region location likelihood θ and the regional Bayesian classifier \(P_{\cdot }^{e}\). \(P_{\cdot }^{b}\) fully depends on \(P_{\cdot }^{e}\). The forward probability for the hidden state R s is

$$\alpha_{R}(s) = \sum_{j=0}^{2N} \alpha_{R}\left(s - 1\right) P_{s}^{b}\cdot I + P_{s}^{e} \left(1 - I\right) P\left(R_{s}|X\right) $$

where the indicator function I=1 when m o d e(j,2)=0. The forward probability for \(\vec {H}_{s}\) is

$$\alpha_{H}(s) = \sum_{j=0}^{N} \alpha_{H}\left(i,s-1\right)P\left(H_{s}\right) $$

Similarly, we have the forward and backward probabilities for hidden states R s and H s at the same site, which are β R (s) and β H (s).

We incorporate an EM algorithm described in [22] into the original iterated conditional mode algorithm in RareProb to estimate the model parameters and update the hidden states. In iteration r, the algorithm estimates the hidden parameters as follows:

$$\mu_{r+1} = \frac{\sum_{i=1}^{N}\sum_{j=0}^{M} I(H_{s} \neq G_{i,s})P(H_{s}|G,\mu_{r},\theta_{r})}{\sum_{i=1}^{N}\sum_{j=0}^{M} I\left(G_{i,s} > 0\right) P(H_{s} |G,\mu_{r},\theta_{r})} $$


$$\begin{aligned} \theta_{r+1} =& argmax \left(\sum_{i=1}^{M} P\left(J_{i}^{s} | R_{s}\right) \sum_{j=1}^{N} P\left(J_{i}^{s} | G,\mu_{r},\theta_{r}\right)\right. \\&\qquad \left.+ P\left(J_{i,\cdot}^{s} | G,\mu_{r},\theta_{r}\right) {\vphantom{ \sum_{i=1}^{M}}}\right) \end{aligned} $$

Let ξ R (s,i,j) represent the probability of R s equal to i and \(\phantom {\dot {i}\!}R_{s^{\prime }}\) equal to j, with the conditions of \(\hat {R}\) and model parameters. Then, we have

$${} \xi_{R}\left(s,i,j\right) = \frac{\alpha_{R}(s,i) P\left(H_{i} | H_{j}\right) P\left(R_{s}|X\right) \beta_{R}(s,j)}{\sum_{i=1}^{2N} \sum_{j=1}^{2N} \alpha_{R}(s,i) P\left(H_{i} | H_{j}\right) P\left(R_{s}|X\right) \beta_{R}(s,j)} $$


$$\left(P_{s}^{b}\right)_{r+1} = \frac{\sum_{s}^{M} \xi_{R}\left(s,0,0\right)}{\frac{1}{N}\left(1 + \left(N - 1\right)e^{-\theta d_{s,s'}}\right)} $$

Once the differences in parameters between two iterations are less than a preset threshold(s), this algorithm terminates.

Filtering singular cases

With the estimation of regions and causal/non-causal status for each variant, we also filter cases with different patterns of features via the posterior probability at each variant. We first bootstrap a set of samples from all of the cases and estimate the model structural parameters via the aforementioned algorithms. In the implementation, we include a short-cut to first filter out L samples with higher posterior probabilities, and then apply the bootstrap on those selected ones to speed up the process.The posterior probability at a specific site s carrying genotype i is

$$\zeta\left(G_{i,s}\right) \propto \text{exp}\frac{\alpha_{R}(s,i) \beta_{R}(s,j)}{\alpha_{R}(M,i)} $$

If the posterior probability ζ(G i,s ) is less than \(\zeta \left (\bar {G}_{i,s}\right)\), we consider the variant as one with different feature pattern. If a case carries a number of such variants, where the number is a user-setting parameter, we will eliminate it from downstream association analysis.

Results and discussion

To test the performance of the proposed approach, we first apply RareProb-C to a series of simulated datasets under different configurations, and compare the results to 6 recent published methods. As mentioned above, according to the ways of selecting candidate variants, the existing model-free approaches could be summarized into one of three categories: weighted-based methods, regression-based methods and algorithmic selection methods. We choose at least one approach from each category. The approaches, to which we compare, include 1) RWAS, BUR and LiMB from weighted-based group, 2) CCRS from regression-based group and 3) RareCover and LBL from algorithmic selection group. We also apply RareProb-C to a set of 429 TCGA samples from the ovarian group, and then the results are verified by the literature review.

Generating simulation datasets

We adopt the same way of generating simulation datasets as in [7, 8, 15]. It implements a fixed-number strategy. The number of causal variants, denoted by C, and the total number of variants M in each dataset are preset. To conduct a fair comparison, each rare variant in this experiment is generated independently because the existing approaches do not consider the interacting rare variants. In the real dataset later on, the interacting germline and somatic variants are involoved then. The assumption behind the fixed-number strategy is that linkage disequilibrium between rare variants does not exist. The minor allele frequency of each variant in these cases follows the Wright’s distribution:

$$f\left(\rho_{i}\right) \propto \left(\rho_{i}\right)^{\beta_{i} - 1}\left(1 - \rho_{i}\right)^{\beta_{L} - 1}e^{\sigma - \rho_{i}\sigma} $$

where ρ i is the MAF at site i, and σ is the selection coefficient. A causal variant may be repaired to a neutral one with the probability of β L , while β i is the probability of mutating a neutral variant to causal one. We adopt the same parameter setting as in [7, 8, 15], where we set σ=12.0, β i =0.001 and β L =0.00033. The relative risk (RR) of a causal variant is defined as \(RR = \frac {\delta }{\left (1 - \delta \right)\rho _{i}} + 1\), where δ is the marginal population attributed risk (PAR), which is equal to the group PAR divided by the number of causal variants. The reason is that the sum of relative risk among all the variants should be equal to 1. The minor allele frequency in controls is controlled by relative risk, which is \(\theta _{i} = \frac {RR \times \rho _{i}}{(RR - 1)\rho _{i} + 1}\). In each dataset, we first randomly choose C causal variants in a set of 100 variants and record the causal ones into X k n o w n . Then, according to the previous strategy, we generate 1000 cases and 1000 controls.

Comparisons between RareProb-C and the existing approaches

We first compare the statistical power and type-I error rate. For the statistical power, we take the same measurement used in [7, 8, 15], where the power of an approach is measured by the number of significant datasets among all the datasets with the same configuration, using a significance threshold of 2.5 ×10−6 based on the Bonferroni correction, assuming 20000 genes genome-wide. The type-I error rate of each experiment is defined as the probability of a preset causal variant not being selected for the set of candidate variants. We test exactly 100 datasets for each comparative experimental configuration, where the type-I error rates take the average. In this group of experiments, we hold the number of preset causal variants to 50 and vary the population attributed risk (PAR) from 0.02 to 0.05 and the results are shown in Table 1. According to Table 1, we are able to summarize that most of the approaches can achieve very high statistical powers, among which RareProb-C presents lower type-I error rates than the other approaches in most of the simulation configurations. We also would like to explain a little more for the two exceptions, RareCover and LRT. The type-I error rates of RareCover are calculated from the significant datasets only. Although RareCover’s rates are much lower than the other approaches, the statistical power is always the premise. For LRT, as there is no prior information for the simulation datasets to conduct variant selection, LRT involves all given variants into the likelihood ratio test. And thus, LRT does not have type-I error rates but always has 100% type-II error rates.

Table 1 The statistical powers and the type-I error rates of RareProb-C and other approaches on the simulation datasets

To further evaluate the performance, we also compute the type-II error rate. The type-II error rate of each experiment is defined as the probability of a preset neural variant being selected for the set of candidate variants. We test exactly 100 datasets for each comparative experimental configuration, where the type-II error rates take the average. In the following experiments, we vary the population attributed risk (PAR) from 0.02 to 0.05 and also enumerate the number of preset causal variants from 50 to 90 and the results are listed in Table 2. From Table 2 we can see that RareProb-C offers a significant improvement in reducing the type-II error rate, which is considered to be very important in clinical genomics.

Table 2 Comparison results among RareProb-C and the other 5 state-of-the-art approaches on the type-II error rates

Null-model test for RareProb-C

We also apply a null-model test on RareProb-C to collect the dataset-level type-I error rate. The type-I error at dataset level measures how frequently a non-significant dataset (consisting of non-causal variants) is wrongly reported as a significant association dataset. We randomly generate one million datasets, each consisting of 1000 samples with 100 variants each. At each sample site, a mutation is assigned with the probability of 0.005. For each sample, it has the same probability of being set as a case or a control. Among these 10,000 datasets, RareProb-C only reports 9 significant datasets, which shows strong reliability. LBL reports 24 significant datasets. CCRS reports 122 significant datasets. Both BUR and LiMB report 0 significant datasets.

Experiments on cancer sequencing data

We then apply RareProb-C to a real cancer sequencing dataset. This dataset consists of 429 TCGA serous ovarian cancer (OV) cases [1, 23]. Each case has one tumor sample with whole exome sequencing data and one normal sample with whole exome sequencing data. All of the data are aligned to human reference build 37 using BWA, and variants are identified using VarScan, GATK, and Pindel, with stringent downstream filtering to standardize specificity. Variant annotation is based on Ensembl release 70_37_v5. The variant list for association analysis contains 3050 germline truncation variants and 4724 somatic truncation mutations. Read count and variant allelic frequency analysis are performed by the bam-readcount tool, available at Somatic variants with VAFs can be downloaded from the supplemental data of [23]. The control cohort is from the NHLBI Women’s Health Initiative (WHI), which consists of 557 samples. The variant calls for each WHI sample are collected via the same pipeline with OV cases.

RareProb-C applied an exome-wide association analysis on the total of 7774 variants. It reports 9 genes harboring causal variants with significant p-values, which are BRCA1, BRCA2, CHEK2, BRIP1, USP6, PALB2, ATM, PCSK7 and FLT3. Among these 9 genes, 5 genes (BRCA1, BRCA2, CHEK2, BRIP1 and USP6) are highlighted as significant susceptibility genes associated to ovarian cancer in an integrated germline-somatic study on the same TCGA OV cases [23]. The association analysis results between RareProb-C and this research are shown in Table 3.

Table 3 Significant associated genes identified by RareProb-C comparing to the ones highlighted in the integrated germline-somatic research on the same dataset

Moreover, for the 3 genes (ATM, PCSK7 and FLT3) that are not reported as significant in [23], our findings are also noteworthy. ATM is a known ovarian cancer associated gene reported in multiple researches [24, 25]. According to the literature review, we find that both PCSK7 and FLT3 are reported to be associated with ovarian cancer in [26], respectively. As a comparison, we also run RWAS+SIFT, LRT+SIFT, RareCover and RareProb on the same dataset. However, RareCover only identifies BRCA1 and BRCA2 with significant associations, while RareProb reports BRCA2, CHEK2 and YWHAE. Unfortunately, we do not find literature supports YWHAE. Other approaches do not report significant results.


In this article, we introduce an improved burden-test pipeline for cancer sequencing data, RareProb-C. This new pipeline is a model-free association approach. It considers the interactions among the given variants, by incorporating variant allelic frequencies and other estimations directly from sequencing data. It is able to overcome several known weaknesses of the existing collapsing methods. RareProb-C significantly extends and enhances the hidden Markov random field model in RareProb and technically estimates the hidden states and model parameter with fewer degrees of freedom. We apply RareProb-C to a set of TCGA ovarian cancer cases and a control cohort from NHLBI Women’s Health Initiative. RareProb-C successfully identifies several significant associations, which are strongly supported by multiple researches. In the comparisons of RareProb-C on simulation datasets under different simulation configurations, the results demonstrate that this new approach outperforms 6 popular approaches in terms of statistical power, sensitivity and specificity.


  1. The Cancer Genome Atlas. 2016. Accessed 5 June 2016.

  2. International Cancer Genome Consortium. 2016. Accessed 2 Feb 2017.

  3. Kandoth C, McLellan M, Vandin F, et al.Mutational landscape and significance across 12 major cancer types. Nature. 2013; 502:333–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Lu C, Xie M, Wendl M, Wang J, McLellan M, Leiserson M, et al.Patterns and functional implications of rare germline variants across 12 cancer types. Nat Commun. 2015; 6:10086.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Asimit J, Zeggini E. Rare variant association analysis methods for complex traits. Annu Rev Genet. 2010; 44:293–308.

    Article  CAS  PubMed  Google Scholar 

  6. Wagner M. Rare-variant genome-wide association studies: a new frontier in genetic analysis of complex traits. Pharmacogenomics. 2013; 14:413–24.

    Article  CAS  PubMed  Google Scholar 

  7. Sul J, Han B, He D, et al.An optimal weighted aggregated association test for identification of rare variants involved in common diseases. Genetics. 2011; 188:181–8.

    Article  PubMed  PubMed Central  Google Scholar 

  8. Sul J, Han B, Eskin E. Increasing power of groupwise association test with likelihood ratio test. J Comput Biol. 2011; 18:1611–24.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Coombes B, Basu S, Guha S, et al.Weighted score tests implementing model-averaging schemes in detection of rare variants in case-control studies. PLoS ONE. 2015; 10:e0139355.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Mukhopadhyay I, Feingold E, Weeks D, et al.Association tests using kernel-based measures of multi-locus genotype similarity between individuals. Genet Epidemiol. 2010; 34:213–21.

    PubMed  PubMed Central  Google Scholar 

  11. Wu M, Lee S, Cai T, et al.Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011; 89:82–93.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Lee S, Wu M, Lin X. Optimal tests for rare variant effects in sequencing association studies. Biostatistics. 2012; 13:762–75.

    Article  PubMed  PubMed Central  Google Scholar 

  13. Yazdani A, Yazdani A, Boerwinkle E. Rare variants analysis using penalization methods for whole genome sequence data. BMC Bioinforma. 2015; 16:405.

    Article  Google Scholar 

  14. Bhatia G, Bansal V, Harismendy O, et al.A covering method for detecting genetic associations between rare variants and common phenotypes. PLoS Comput Biol. 2010; 6:e1000954.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Wang J, Zhao Z, Cao Z, et al.A probabilistic method for identifying rare variants underlying complex traits. BMC Genomics. 2013; 14:S11.

    Article  PubMed  PubMed Central  Google Scholar 

  16. Biswas S, Papachristou C. Evaluation of logistic Bayesian LASSO for identifying association with rare haplotypes. BMC Proc. 2014; 8:S54.

    Article  PubMed  PubMed Central  Google Scholar 

  17. Geng Y, Zhao Z, Zhang X, et al.An improved burden-test pipeline for cancer sequencing data In: Bourgeois A, Skums P, Wan X, Zelikovsky A, editors. Bioinformatics Research & Applications ISBRA 2016, LNCS (LNBI); vol. 9683. Cham: Springer. p. 314–5.

  18. Pritchard J. Are rare variants responsible for susceptibility to complex diseasesAm J Hum Genet. 2001; 69:124–37.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Hu H, Huff C. Detecting statistical interaction between somatic mutational events and germline variation from next-generation sequence data. In: Proceedings of Pacific Symposium on Biocomputing. New Jersey: World Scientific: 2014. p. 51–62.

    Google Scholar 

  20. Xie M, Lu C, Wang J, et al.Age-related mutations associated with clonal hematopoietic expansion and malignancies. Nat Med. 2014; 20:1472–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Wang J, Moore K, Zhang Q, et al.Genome-wide compatible SNP intervals and their properties. In: Proceedings of In Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology. New York: ACM: 2010. p. 43–52.

    Google Scholar 

  22. Kang H, Zaitlen N, Eskin E. EMINIM: an adaptive and memory-efficient algorithm for genotype imputation. J Comput Biol. 2010; 17:547–60.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Kanchi K, Johnson K, Lu C. Integrated analysis of germline and somatic variants in ovarian cancer. Nat Commun. 2014; 5:3156.

    Article  PubMed  PubMed Central  Google Scholar 

  24. Weissman S, Weiss S, Newlin A. Genetic testing by cancer site: Ovary. Cancer J. 2012; 18:320–7.

    Article  PubMed  Google Scholar 

  25. Walsha T, Casadeia S, Leea M, et al.Mutations in 12 genes for inherited ovarian, fallopian tube, and peritoneal carcinoma identified by massively parallel sequencing. Proc Nal Acad Sci USA. 2011; 108:18032–7.

    Article  Google Scholar 

  26. Page R, Klein-Szanto A, Litwin S, et al.Increased expression of the pro-protein convertase furin predicts decreased survival in ovarian cancer. Cell Oncol. 2007; 29:289–99.

    CAS  PubMed  PubMed Central  Google Scholar 

Download references


The authors would like to thank the conference organizers of ISBRA 2016. A 2-page abstract has been published in Lecture notes in computer science: Bioinformatics research and applications. We also would like to thank the reviewers for their valuable comments and suggestions, which guide us to improve the work and manuscript.


Publication of this article is funded by the National Science Foundation of China (Grant No: 31701150, 81400632), Shaanxi Science Plan Project (Grant No: 2014JM8350) and the Fundamental Research Funds for the Central Universities (CXTD2017003).

Availability of data and materials

All of the source codes have been uploaded to:, and are freely available for academic use only.

About this supplement

This article has been published as part of BMC Genomics Volume 18 Supplement 7, 2017: Selected articles from the 12th International Symposium on Bioinformatics Research and Applications (ISBRA-16): genomics. The full contents of the supplement are available online at

Author information

Authors and Affiliations



JYW and ZMZ conducted this research. WKW, YG and XPZ designed the algorithm, WKW, XJC, YG and JYW applied the simulation experiments, YG, KY applied the experiments on TCGA dataset, JYW, WKW and XX wrote this manuscript. All authors read and approved the final version of this manuscript.

Corresponding authors

Correspondence to Zhongmeng Zhao or Jiayin Wang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional information

Authors’ information

JYW, XX, ZMZ, ZXP, YG, XJC, KY are with the Shaanxi Engineering Research Center of Medical and Health Big Data. In addition, ZMZ, ZXP, YG, XJC are with the Department of Computer Science and Technology of Xi’an Jiaotong University. KY is with the Ye Lab for Omics and Omics Informatics of Xi’an Jiaotong University. WKW has graduated from the Department of Computer Science and Technology of Xi’an Jiaotong University.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Geng, Y., Zhao, Z., Zhang, X. et al. An improved burden-test pipeline for identifying associations from rare germline and somatic variants. BMC Genomics 18 (Suppl 7), 753 (2017).

Download citation

  • Published:

  • DOI: