Gibbs-sampling
Our method is based on Gibbs sampling and we first introduce the general idea of Gibbs sampling below. Consider the following distribution typically used to perform optimization in graphical models:
$$\begin{array}{@{}rcl@{}} P(X) = \frac{1}{Z} exp\left(\mu \sum_{i=1}\sum_{j=1} \phi_{ij}(x_{i}, x_{j})\right) \end{array} $$
(1)
where X=(x1,x2,…,x
d
) is a d-dimensional vector and Z is a normalization factor. The function ϕ specifies the edge potential for two variables with an edge between them. We would like to collect samples of X based on this distribution P(X).
Gibbs sampler is a special case of Monte Carlo Markov Chain (MCMC) method [20], which is guaranteed to converge to the equilibrium distribution after sufficient burn-in iterations. In each iteration, it randomly samples one variable x
i
based on the conditional probability P(x
i
|x[−i]) where all other variables x[−i]=(x1,x2,…,xi−1,xi+1,…,x
d
) are fixed. Formally, this conditional probability can be written based on bayesian rule:
$$\begin{array}{@{}rcl@{}} P(x_{i} = t | x_{[-i]}) = \frac{P(x_{i} = t, x_{[-i]})}{\sum_{t'} P(x_{i} = t', x_{[-i]})} \end{array} $$
(2)
Readers can refer to [21] for a more detailed description of Monte Carlo Markov Chain.
Polyploid haplotype phasing
The inputs for the polyploid haplotype phasing problem is the ploidy k (the number of haplotypes to be phased), the set of aligned sequencing reads X (We assume the raw reads have been aligned to a reference sequence and thus the SNPs spanned by the reads are identified already), a sequencing error rate ε. The VCF (Variant Call Format) file containing the SNP positions and dosages could be optional. The dosage information gives the number of reference alleles and alternative alleles for a given SNP position and therefore can be used to reduce the phasing search space and to improve the phasing accuracy. Notice for some programs such as HapCompass, the dosage information is mandatory.
The output of the phasing algorithms is the k phased haplotypes. There are a few popular metrics to evaluate the performance of the phasing algorithms, such as MEC (minimum error correction) [22], minimum fragment removal, MSR (minimum single nucleotide polymorphism (SNP) removal) [23], and two recent models MFC (maximum fragments cut) [24] and BOP (balanced optimal partition) [25]. In this work, we focused on MEC as our metric.
MEC
We focus on minimizing MEC between the phased haplotypes and the input read matrix, which is calculated as the total number of mismatches between the reads and their assigned haplotypes. The following formula is the MEC for polyploid haplotypes:
$$\begin{array}{@{}rcl@{}} MEC(X, H) = \sum_{j=1}^{m} \sum_{k=1}^{n} r_{jk} \times D(x_{j}, h_{k}) \end{array} $$
(3)
where X is a set of m sequencing reads, H is the set of n haplotypes, x
j
is the j-th read, h
k
is the k-th haplotype, D(x
j
,h
k
) is the number of mismatches between x
j
and h
k
,r
jk
is 1 if the j-the read is assigned to the k-the haplotype and 0 vice versa. A read is assigned to a haplotype which minimizes its number of mismatches. Notice mismatches only occur at SNP positions as all other positions are homozygous. Thus the polyploid haplotype phasing problem is to phase k haplotypes H given the set of reads X and the SNP positions so that the objective function MEC(X,H) is minimized. It is known [8] that minimizing MEC is NP-hard even for diploid haplotype phasing when the length of the reads is greater than one.
Recently various methods have been proposed to the polyploid haplotype phasing problem. HapCompass [16] builds a compass graph from the sequencing reads, which is an undirected weighted graph. In the compass graph, the vertices are the SNPs and an edge between a pair of SNPs indicates that at least one read spans the two SNPs. There is an integer weight associated with each edge. It is shown that a compass graph has a unique phasing if it has no conflicting cycles, which is a simple cycle that contains either an odd number of negative edges or at least one 0-weight edge or both. Haplotype phasings correspond to spanning trees in the graph. The phasing problem is converted to a minimum weighted edge removal optimization on the graph and an algorithm based on cycle basis local optimizations for resolving conflicting cycles is proposed.
HapTree [17] aims to trim down the search space for all the possible haplotypes to a much smaller set of more likely solutions. It takes an inductive approach, generating a collection of likely phasing solutions for the first two SNPs in the genome, and then extending those to phasing solutions of the first three SNPs, and those to the first four SNPs, and so on. When extending any particular solution, HapTree chooses (based on computing likelihoods) how the alleles of the newly added SNP may be assigned to chromosomes; it includes only those assignments that are sufficiently likely. Upon including all SNPs to be phased, HapTree randomly chooses a solution of maximum likelihood from amongst the solutions it has found. It is shown [18] that the trimming process might be time consuming for some cases.
SDHaP [18] formulates the haplotype assembly problem as a semi-definite program and exploits its special structure, namely the low rank of the underlying solution, to solve the problem rapidly and with high accuracy. A graph is defined where the nodes are the reads, the edge between two nodes indicate that the two corresponding reads overlap by at least one SNP. A weight is associated with each edge and the weight is computed as the following:
$$\begin{array}{@{}rcl@{}} W_{ij} = \frac{k_{sim} - k_{dissim}}{k_{sim} + k_{dissim}} \end{array} $$
(4)
where k
sim
denotes the number of overlapping positions where the reads have an identical alleles and k
dissim
is the number of positions where they are different. Then giving the graph and the ploidy, SDHaP aims to find k−1 cuts such that the sum of intra-partition edge weights is maximized and inter-partition edge weights is minimized and the problem is solved via correlation clustering. SDhaP formulates the problem as a semi-definite program (SDP), and employs a low-rank Lagrangian scheme followed by randomized projections and a greedy refinement of the k-ploid haplotypes to solve the SDP.
H-PoP and H-PoPG [19] try to partition the DNA reads sequenced from a k-ploid organism into k groups such that the reads of the same group share the same alleles on as many SNP loci as possible and the reads from different groups are different on as many loci as possible. Heuristic strategies are proposed by limiting the number of intermediate solutions at each iteration of a dynamic programming algorithm. Notice H-PoP assumes no VCF file, which contains the dosage information of the variants. is provided while H-PoPG accepts VCF file as an input.
The polyploid haplotype phasing method we proposed in this work is an extension of Harsh [14], which applies the Gibbs Sampling algorithm to phase diploid haplotypes (namely two haplotypes). For diploid scenario, the two haplotypes are complimentary. Therefore indeed only one haplotype needs to be phased. For polyploid scenario, the haplotypes are more than two and they can share common segments and are not necessarily complimentary. Thus the problem is much more challenging. Harsh can not be applied to polyploid haplotype phasing in that the conditional probability estimation for the Gibbs Sampling process needs to be completely re-invented.
Poly-Harsh
Haplotype blocks
Notice that there are cases where haplotype components are disconnected, i.e., we need to identify haplotype blocks that are not connected by any reads. There are two possible reasons for disconnected haplotype components, or blocks: the adjacent SNPs might be far from each other, namely their distance is longer than the length of the reads and thus they will not be spanned by any read; the sequencing coverage is low and thus not all SNPs are covered. To identify the haplotype blocks, we can create a graph where the nodes are SNPs and an edge between two SNPs indicates that the two SNPs are connected by some reads. Then we identify the connected components of the graph, which are the SNPs contained in each haplotype block. There would not be any read spanning two blocks and every read only covers the SNPs from a single block. We next phase each block independently, using only the reads covering the SNPs for that specific block.
Gibbs sampling
In this work, we developed a Gibbs Sampling based method Poly-Harsh for polyploid haplotype phasing. Our algorithm consists of two major steps: fix the haplotypes, compute read assignments; then fix the read assignments, compute the haplotypes. Here when we compare a haplotype to the reference haplotype, if the allele is the same as the reference, the genotype value is 0. If the allele is alternative, the genotype value is 1. In cases where the allele is neither the reference nor the alternative, we simply assign the genotype value as 1. For each SNP position i, we define a genotype value vector h
i
= [ g1,i,g2,i,…,gk,i], where gj,i is the genotype value of the j-th haplotype at the i-th SNP. For a given ploidy k (for illustration purpose and without losing generality, assuming k = 4), the genotype value vector could be one of 2k binary vectors [1, 0, 0, 0], [1, 1, 0, 0], …, [0, 1, 1, 1]. Notice we ignore vectors [0,0,0,0] and [1,1,1,1] as they indicate the allele is homozygous. We also define a read assignment vector r as a k-dimensional vector r= [ a1,a2,…,a
k
], where a
i
= 1 if the read is assigned to the i-th haplotype and a
i
= 0 if the read is not assigned to the i-th haplotype. For example, for k = 4, if the read is assigned to the first haplotype, r= [ 1,0,0,0]. If the read is assigned to the second haplotype, r= [ 0,1,0,0] and so on so forth. Notice a read can not be assigned to more than one haplotype and on the other hand it has to be assigned to one haplotype. Given the genotype value vector h
i
for the i-th SNP and a read assignment vector r
j
for the j-th read x
j
, we define a function as below:
$$\begin{array}{@{}rcl@{}} \theta(h_{i}, r_{j}, x_{j}) & = & ln(1-\epsilon)^{t} + ln(\epsilon)^{k-t} \\ t & = & match(h_{i}, r_{j} \times x_{j,i}) \end{array} $$
(5)
where ε is the sequencing error rate, xj,i is the i-th value of the x
j
,match(A,B) is the vector-wise matches between two vectors A and B where the number of matches is increased by 1 if two vector elements are identical (either both 1 or both 0). For example, for h
i
= [ 1,1,0,0], r
j
=[1,0,0,0],xj,i=1, we have r
j
×xj,i= [ 1,0,0,0] and thus t =match(h
i
,r
j
×xj,i)= 3. For h
i
= [ 1,1,0,0],r
j
=[1,0,0,0],xj,i=0, we have r
j
×xj,i=[0,0,0,0] and thus t =match(h
i
,r
j
×xj,i)= 2. The θ function essentially models the probability of the correct read assignment given the matches between the read and the haplotypes. Notice the function considers the mismatches due to sequencing error.
Given ploidy as k, the set of genotype value vectors \(H =\, [\!h_{1}, h_{2}, \dots, h_{n}]\) where \(h_{i} =\, [g_{1,i}, g_{2,i}, \dots, g_{k_{i}}]\) as we have defined and n is the number of SNPs, the set of read assignment vectors \(R = \,[\!r_{1}, r_{2}, \dots, r_{n}]\) where \(r_{i} = [a_{1}, a_{2}, \dots, a_{k}]\) is the assignment vector for the i-th read as we have defined, the sampling process proceeds as follows: We first randomly initiate H, then we fix H and compute the conditional probability P(R|H). We sample the reads assignment R based on P(R|H). Next we fix R and compute the conditional probability P(H|R). We sample the genotype value vectors H based on P(H|R). For ploidy k, we have 2k haplotype values for a specific SNP. Assuming the genotype value vector is h and the set of reads X= [ x1,x2,…,x
n
], we could compute its probability as
$$\begin{array}{@{}rcl@{}} P(h | R) = \frac{exp\left(\sum_{j=1}^{n} \theta(h, r_{j}, x_{j})\right)}{exp\left(\sum_{i=1, j=1}^{i=2^{k}, j=n} \theta(h_{i}, r_{j}, x_{j})\right)} \end{array} $$
(6)
where the function θ is defined in Eq. 5, h
i
is the i-th genotype value vector, r
j
is the assignment vector for the j-th read x
j
. Then we apply sampling to update the genotype values of the SNP. We do a similar Gibbs sampling step for read origin given fixed haplotypes. Again, we conduct sampling to update the read assignment vector on the fixed haplotypes H for a given read x as below:
$$\begin{array}{@{}rcl@{}} P(r | H) = \frac{exp\left(\sum_{j=1}^{2^{k}} \theta(h_{j}, r, x)\right)}{exp\left(\sum_{i=1, j=1}^{i=k, j=2^{k}} \theta(h_{j}, r_{i}, x)\right)} \end{array} $$
(7)
where the function θ is defined in Eq. 5, h
j
is the j-th genotype value vector, r
i
is the assignment vector when the read x is assigned to the i-th haplotype.
Given H and R, we can easily compute the MEC of the phasing. Notice the k haplotypes can be constructed from the genotype value vectors H, by concatenating all the genotype values from the same haplotype. For example, haplotype one can be constructed as [g1,1,g1,2,…,g1,n]. Therefore, the haplotype phasing problem becomes identifying the optimal H that minimizes MEC.
Notice for diploid scenario and polyploid scenario, the computations for P(r|H) and P(h|R) are significantly different, as for polyploid scenario, r and h are k-dimensional vectors. The probabilities not only depends on the haplotype the read is assigned to, but also depends on the remaining haplotypes: if one haplotype has a mismatch at a SNP position to a read and the read is not assigned to the haplotype, it is considered as a correct operation to not assign the read to the haplotype.
We repeat the above two steps iteratively. For each iteration, we compute the MEC score of the reads. The process converges when the MEC does not improve or we have reached certain number of iterations. As Gibbs Sampling is sampling based and its performance is affected by the initial random simulation of H, it may fall into local optimum solutions. Therefore we re-run the program multiple times, each time started from a different random seed H. This helps the program to escape from local optimum. A pseudocode of the algorithm is shown in Algorithm 1. Notice the algorithm shows only one run of the procedure. If we would like to run the algorithm multiple times, we need to repeat steps 1 to 5 multiple times, each time with a different randomly initialized H.