RareProb-C consists of four components, and RareProb [15] becomes a core module of one component. The three new components are described in this section, whose functions are detecting interacting variants, estimating the significant mutated regions, and removing singular variants or cases. During implementation, all these components are executed simultaneously within a hidden Markov random field framework. Supposing that we are given M variants on a set of N samples. Variants could include both single-site ones and small indels. Large structural variations are not discussed here, as their functional analyses are often involved in either large LOH/CNV or more complex mechanisms [4]. Suppose the given N samples consist of \(\frac {N}{2}\) cases and \(\frac {N}{2}\) controls. If the number of cases is not equal to the number of controls, the following equations can be used simply by applying non-centrality parameters.
Detecting interacting variants
For germline variant s, let θ
s
denote the variant allelic frequency (VAF) in a tumor sample and ρ
s
denote the variant allelic frequency in the paired normal sample. Let \(c_{s}^{+}\) and \(c_{s}^{-}\) represent the number of reads supporting the mutation at site s in the tumor and normal samples, respectively. Two binomial distributions can be drawn for the tumor and normal samples: \(c_{s}^{+} \sim \text {Bin}\left (d_{s}^{+},\theta _{s}\right)\) and \(c_{s}^{-} \sim \text {Bin}\left (d_{s}^{-},\rho _{s}\right)\), where \(d_{s}^{+}\) and \(d_{s}^{-}\) denote the respective read depths. For this site across samples, the statistic of the difference between θ
s
and ρ
s
is calculated by a pair-wise t-test. Similar to the linear kernel function [11], which calculates genetic similarities, we weight the interactions of two variants s and s
′ by \(\omega _{s,s'} \sim \frac {2t_{s}t_{s'}}{t_{s}^{2} + t_{s'}^{2}}\), which affects how likely these two variants would be collapsed. For somatic mutations, only θ
· and \(c_{\cdot }^{+}\) are considered. By applying clonal architecture analysis [3], θ
· are clustered into multiple sub-clones and the correlation coefficient \(\phantom {\dot {i}\!}\omega _{\cdot,\cdot ^{\prime }}\) of each mutation-pair within a sub-clone is obtained simultaneously.
Furthermore, the interactions among multiple variants may vary from two adjacent sites to a few sites within a certain region/pathway, e.g., within a gene or multiple genes. Let G
·,s
represent the genotypes at any site s across all given samples. Here we assume that the observed genotype reflects a Bayesian classifier from multiple factors, such as \(c_{s}^{+}\) and \(c_{s}^{-}\). Let m(G
·,s
) denote the classifier model for each case at site s. Then, the interactions among s and neighboring variants n(s) can be described as the conditional probability of m(G
·,s
) undergoing \(\phantom {\dot {i}\!}m\left (G_{\cdot,s'}\right)\), where \(\phantom {\dot {i}\!}s^{\prime }\in n(s)\). To clearly describe the patterns of interactions, we have
$${} P\left(G_{\cdot,s} | G_{\cdot,s_{n(s)}}\right) \propto \text{exp}\left(\tau m\left(G_{\cdot,s}\right) + \nu \sum_{s'\in n(s)} \omega_{s,s'} m\left(G_{\cdot,s'}\right) \right) $$
where the first component, τ
m(G
·,s
) represents the solo effects from s itself, while the second component describes the effects from the interacting ones.
The variant allelic frequency for each site consists of an N-dimensional vector. Let the vector be a state in the Markov random field. Then, state H
s
=[G
·,s
]−1 corresponding to site s, and based on the Markov-Gibbs equivalence, for two interacting variants s and s
′, the interaction could be represented as a transition probability between H
s
and \(\phantom {\dot {i}\!}H_{s^{\prime }}\), which is
$$\begin{aligned} P\left(H_{s} | H_{s^{\prime}}, \theta \right) \propto& \text{exp}\left(\sum_{i=1}^{N} \mu + \left(1 - \mu\right) f\right.\\&\left. \left(-\theta d_{s,s' } + I_{s,s'}e^{-\theta d_{s,s'}}\right){\vphantom{\sum_{i=1}^{N}}}\right) \end{aligned} $$
where μ denotes the mutation rate by which a variant occurs independently, while θ denotes an unknown model parameter that describes the probability of a variant being located in a deleterious region. \(\phantom {\dot {i}\!}d_{s,s^{\prime }}\) is the genetic distance between two adjacent sites s and s
′, which underlies the length of a region.
Estimating the regions with interactions
The transition probability \(\phantom {\dot {i}\!}P\left (H_{s} | H_{s^{\prime }}, \theta \right)\) models the interacting patterns between two sites, and we then bridge each interacting pattern to the region vector R. R can be shared across cases with similar features.
For any site s, vector G
·,s
is used to estimate the status of region vector R
s
. To facilitate the computation, without loss of the Markov property, we first define two additional silent states affiliated to the state H
s
, which are \(R_{s}^{e}\) and \(R_{s}^{b}\). Then, we define that a Markov chain via state \(R_{s}^{e}\) denotes that site s is in an elevated region; otherwise, the chain though \(R_{s}^{b}\) denotes that site s is in a background region. Let \(P_{s}^{e} = P\left (R = 1 | H_{s}\right)\) be the transition probability of site s within an elevated region, while \(P_{s}^{b} = P\left (R = 1 | H_{s}\right)\) denotes the probability of site s within a background region and the Markov property limits \(P_{s}^{e} + P_{s}^{b} = 1\). Thus, the probability of site s being located in an elevated region can be represented by
$${} p\left(R_{s}^{e}|H_{s},H_{s^{\prime}},R_{n(s)}\right) \propto \text{exp}\left(\tau B\left(P_{s}^{e},P_{s}^{b}\right) + \nu \sum_{s' \in n(s)} \omega_{s,s'} R_{s'}\right) $$
and the joint probability of latent vector R can be represented by \(p\left (R;\Phi _{R}\right) \propto \text {exp}\left (\tau \sum _{s}^{M} R_{s} + \nu \sum _{s,s'} \omega _{s,s'} R_{s'}\right)\), where s
′∈n(s) and Φ
R
=(τ,ν). τ and ν are two model parameters balancing the importance of interactions, where τ+ν=1. \(\phantom {\dot {i}\!}R_{s'} = P\left (R_{s'}|X\right)\), where X is a M-dimensional vector that each element represent the causal/non-causal status of the corresponding variant.
Furthermore, we require each region to pass the four-gamete test, which limits the length of a region and potentially reduces false positives. We adopt a maximal-K-cover pipeline to divide \(G_{\frac {N}{2}\times M}\) into multiple compatible intervals, as proposed in [21]. This pipeline consists of a series of scanning algorithms. First, it establishes two interval sets, I
l
r
and I
r
l
, and obtains intervals from the left-to-right algorithm and right-to-left algorithm, respectively, where each set contains K intervals. Second, a merging algorithm combines these 2K intervals to extract the overlapping genotypes with the same index, for example, I
l,i
and I
r,i
. It then defines K cores according to the overlapping, denoted by C={o
1,o
2,⋯,o
K
}. Third, an uber-scan algorithm calculates K+m candidate intervals U={u
1,u
2,⋯,u
K
} and assigns these K+m intervals to each core in C. If o
i
contains u
j
, then u
j
is allocated in the i-th group. Finally, the maximal intervals are estimated via a dynamic programming algorithm. We restrict any region within one compatible interval, where the genotypes are restricted by having limited diversity.
Estimating model parameters
Based on the Gibbs-Markov equivalence, a pseudo-likelihood iteration algorithm can solve this model to estimate the unknown model parameters and the hidden states. Again, due to the computational complexity, we introduce another two silent field states to facilitate the computation process. For any site s, let \(\vec {J}_{i,\cdot }^{s}\) permute from \(\phantom {\dot {i}\!}\vec {G}_{\cdot,s^{\prime }}\), where the transition probability from \(J_{i,\cdot }^{s}\) to \(\vec {G}_{\cdot,s'}\) is \(\propto \frac {1}{N-1}\); let \(\vec {J}_{i}^{s}\) duplicate from the vector \(\phantom {\dot {i}\!}G_{\cdot,s^{\prime }}\), where the transition probability from \(\vec {J}_{i}^{s}\) to \(\vec {G}_{\cdot,s'}\) is 1. The other component of \(\phantom {\dot {i}\!}P\left (H_{s} | H_{s^{\prime }},\theta \right)\) can be divided into two separate transition probabilities, namely, \(P\left (J_{i}^{s} | R_{s}\right) = f\left (-\theta d_{s,s'}\right) + e^{-\theta d_{s,s'}}\) and \( P\left (J_{i,\cdot }^{s} | R_{s}\right) = \frac {1}{N-1}f\left (-\theta d_{s,s'}\right)\), where f(x) is (1−e
−x)/N.
Now, for the entire hidden Markov random field model, we need to estimate three unknown model parameters: the mutation rate μ, the deleterious region location likelihood θ and the regional Bayesian classifier \(P_{\cdot }^{e}\). \(P_{\cdot }^{b}\) fully depends on \(P_{\cdot }^{e}\). The forward probability for the hidden state R
s
is
$$\alpha_{R}(s) = \sum_{j=0}^{2N} \alpha_{R}\left(s - 1\right) P_{s}^{b}\cdot I + P_{s}^{e} \left(1 - I\right) P\left(R_{s}|X\right) $$
where the indicator function I=1 when m
o
d
e(j,2)=0. The forward probability for \(\vec {H}_{s}\) is
$$\alpha_{H}(s) = \sum_{j=0}^{N} \alpha_{H}\left(i,s-1\right)P\left(H_{s}\right) $$
Similarly, we have the forward and backward probabilities for hidden states R
s
and H
s
at the same site, which are β
R
(s) and β
H
(s).
We incorporate an EM algorithm described in [22] into the original iterated conditional mode algorithm in RareProb to estimate the model parameters and update the hidden states. In iteration r, the algorithm estimates the hidden parameters as follows:
$$\mu_{r+1} = \frac{\sum_{i=1}^{N}\sum_{j=0}^{M} I(H_{s} \neq G_{i,s})P(H_{s}|G,\mu_{r},\theta_{r})}{\sum_{i=1}^{N}\sum_{j=0}^{M} I\left(G_{i,s} > 0\right) P(H_{s} |G,\mu_{r},\theta_{r})} $$
and
$$\begin{aligned} \theta_{r+1} =& argmax \left(\sum_{i=1}^{M} P\left(J_{i}^{s} | R_{s}\right) \sum_{j=1}^{N} P\left(J_{i}^{s} | G,\mu_{r},\theta_{r}\right)\right. \\&\qquad \left.+ P\left(J_{i,\cdot}^{s} | G,\mu_{r},\theta_{r}\right) {\vphantom{ \sum_{i=1}^{M}}}\right) \end{aligned} $$
Let ξ
R
(s,i,j) represent the probability of R
s
equal to i and \(\phantom {\dot {i}\!}R_{s^{\prime }}\) equal to j, with the conditions of \(\hat {R}\) and model parameters. Then, we have
$${} \xi_{R}\left(s,i,j\right) = \frac{\alpha_{R}(s,i) P\left(H_{i} | H_{j}\right) P\left(R_{s}|X\right) \beta_{R}(s,j)}{\sum_{i=1}^{2N} \sum_{j=1}^{2N} \alpha_{R}(s,i) P\left(H_{i} | H_{j}\right) P\left(R_{s}|X\right) \beta_{R}(s,j)} $$
thus,
$$\left(P_{s}^{b}\right)_{r+1} = \frac{\sum_{s}^{M} \xi_{R}\left(s,0,0\right)}{\frac{1}{N}\left(1 + \left(N - 1\right)e^{-\theta d_{s,s'}}\right)} $$
Once the differences in parameters between two iterations are less than a preset threshold(s), this algorithm terminates.
Filtering singular cases
With the estimation of regions and causal/non-causal status for each variant, we also filter cases with different patterns of features via the posterior probability at each variant. We first bootstrap a set of samples from all of the cases and estimate the model structural parameters via the aforementioned algorithms. In the implementation, we include a short-cut to first filter out L samples with higher posterior probabilities, and then apply the bootstrap on those selected ones to speed up the process.The posterior probability at a specific site s carrying genotype i is
$$\zeta\left(G_{i,s}\right) \propto \text{exp}\frac{\alpha_{R}(s,i) \beta_{R}(s,j)}{\alpha_{R}(M,i)} $$
If the posterior probability ζ(G
i,s
) is less than \(\zeta \left (\bar {G}_{i,s}\right)\), we consider the variant as one with different feature pattern. If a case carries a number of such variants, where the number is a user-setting parameter, we will eliminate it from downstream association analysis.