In this section, we will describe the algorithms used to solve the problem. We first design a dynamic programming algorithm that gives an exact solution and runs in O(n × 2t × t) time, where n is the number of columns in M, and t is the maximum coverage of a column in M. The dynamic programming algorithm will be very slow when t is large. We then design a heuristic algorithm that first computes an initial pair of haplotypes by using the dynamic programming algorithm on only a subset of M. This initial pair of haplotypes can be viewed as an approximation to the optimal solution. To obtain a better solution, we further introduce some techniques to refine the initial solution.
A dynamic programming algorithm
Recall that the goal of the haplotype assembly problem is to partition the rows of the input fragment matrix M into two groups, each of which determining a haplotype. To obtain an optimal partition, a naive approach is to enumerate all possible partitions on the rows of M, among which we then choose the one minimizing MEC. For an instance with m rows, there are 2m total partitions, and thus the approach does not work in practice. Here we introduce a dynamic programming algorithm for the haplotype assembly problem with MEC that runs in O(n × 2t × t) time, where n is the number of columns in M, and t is the maximum coverage of a column in M.
Before we give the details of the dynamic programming algorithm, we first define some basic notations that will be used later:
-
R
i
: the set of rows covering column i in M.
-
P
j
(i): the j-th partition on R
i
.
-
Q
j
(i): the j-th partition on R
i
∩ R
i
+1.
-
P
j
(i)|
Ri
∩Ri+1: the partition on R
i
∩ R
i
+1 obtained from P
j
(i) by restriction on the rows in R
i
∩ R
i
+1.
-
QQ
j
(i): the set of partitions P
k
(i) such that P
k
(i)|
Ri
∩Ri+1= Q
j
(i).
-
C(P
j
(i)): the minimum number of corrections to be made in column i of M when the partition on R
i
is indicated by P
j
(i).
-
MEC(i, P
j
(i)): the optimal cost for the first i columns in M such that column i has a partition P
j
(i).
In order to compute MEC(i + 1, P
j
(i + 1)) efficiently, we define
(3)
Let P
j
(i + 1) be the j-th partition on R
i
+1, Q
k
(i) = P
j
(i + 1)|
Ri
∩Ri+1. The recursion formula of the dynamic programming algorithm is illustrated as follows:
(4)
Based on P
j
(i + 1), we can get Q
k
(i) in O(t) time. Furthermore, we know the majority value (0 or 1) at column (i + 1) in each group. To compute C(P
j
(i + 1)), we can simply count the number of minorities in each group (at column (i + 1)) separately, and then add them up. Thus, it takes O(t) time to compute C(P
j
(i + 1)).
The optimal MEC cost for partitioning all the rows of M is the smallest MEC(n, P
j
(n)) over all possible P
j
(n), where n is the number of columns in M. A standard backtracking process can be used to obtain the optimal solution.
Let us look at the time complexity of the dynamic programming algorithm. To compute each MEC(i + 1, P
j
(i + 1)) in Equation (4), it requires O(t) time to compute C(P
j
(i + 1)). Thus, it takes O(n × 2t × t) time to compute all C(P
j
(i + 1))s for all the n columns in M. Now, let us look at the way to compute M E(i, Q
k
(i))s. For each partition P
j
(i) on R
i
, we can get Q
k
(i) = P
j
(i)|
Ri
∩Ri+1in O(t) time. We then update ME(i, Q
k
(i)) if the current value of ME(i, Q
k
(i)) is greater than MEC(i, P
j
(i)). There are at most 2t P
j
(i)s on R
i
. Thus, it takes O(t × 2t) time to compute all ME(i, Q
k
(i))s on R
i
. Since there are n columns in M, the total time required for computing all ME(i, Q
k
(i))s is O(n × 2t × t).
Theorem 1 Given a fragment matrix M, there is an O(n × 2t × t) time algorithm to compute an optimal solution for the haplotype assembly problem with MEC, where n is the number of columns in M, and t is the maximum coverage of a column in M.
Obtaining an initial solution via randomized sampling
The dynamic programming algorithm works well when t is relatively small. However, it will be very slow when t is large. To solve the problem when t is large, we look at each column i at a time, randomly select a fixed number of rows, say, boundOfCoverage, from the set of rows covering it and delete the characters in the remaining rows at all the columns after i - 1. After that, the coverage of each column in the newly obtained submatrix is at most boundOfCoverage. We then run the dynamic programming algorithm on the submatrix. The resulting pair of haplotypes, which is referred to as the initial solution, can be viewed as an approximation to the optimal solution.
The detailed procedure for obtaining a submatrix from M via the randomized sampling approach is as follows:
-
1.
Compute the coverage c
i
for each column i in M.
-
2.
For i = 1 to n, perform the following steps.
-
3.
If c
i
≤ boundOfCoverage, do nothing and goto the next column. Otherwise, goto step 4.
-
4.
Randomly choose boundOfCoverage rows from the set of rows covering column i. Let be the set of rows covering column i but are not chosen during this process.
-
5.
For each row , cut r from column i such that it no longer covers any column larger than i (including i). Accordingly, we need to reduce c
j
by 1 for each i ≤ j ≤ k, where k is the end position of r before being cut.
By employing this randomized sampling strategy, we can always make sure that the maximum coverage is bounded by the threshold boundOfCoverage in the selected submatrix. How to choose a proper value for boundOfCoverage? Actually, there is a tradeoff between the running time and the quality of the initial solution output by the dynamic programming algorithm. On one hand, reducing boundOfCoverage can reduce the running time of the algorithm. However, on the other hand, increasing boundOfCoverage can maintain more information from M. As a result, the initial solution output by the dynamic programming algorithm has a higher chance to be close to the optimal solution. In practice, boundOfCoverage is generally no larger than 15, which is feasible in terms of running time and is large enough to sample sufficient information from M. See Section Experiments for a detailed discussion on how the size of boundOfCoverage affects the initial solution.
Refining the initial solution with all fragments
In the newly obtained submatrix, it is possible that (1) some columns are not covered by any rows, thus leaving the haplotype values at these SNP sites undetermined in the initial solution, (2) the haplotype values at some SNP sites in the initial solution are wrongly determined due to the lack of sufficient information sampled from M during the randomized sampling process. In this subsection, we try to refine the initial solution with all input fragments, aiming to fill haplotype values that are left undetermined and correct those that are wrongly determined.
The refining procedure contains several iterations. In each iteration, we take two haplotypes as its input and output a new pair of haplotypes. Initially, the two haplotypes in the initial solution are used as the input to the first iteration. The haplotypes output in an iteration are then used as the input to the subsequent iteration. In each iteration, we try to reassign the rows of M into two groups based on the two input haplotypes. More specifically, for each row r of M, we first compute the generalized hamming distance between r and the two haplotypes. Then, we assign r to the group associated with the haplotype that has the smaller (generalized hamming) distance with r. After reassigning all rows of M into two groups, we can compute a haplotype from each of the two groups by majority rule. At the same time, we can also obtain the corresponding MEC cost.
The refining procedure stops when, at the end of some iteration, the obtained haplotypes no longer change, or when a certain number of iterations have been finished. The two haplotypes output in the last iteration are the output of the refining procedure.
Voting procedure
To further reduce the effect of randomness caused by the randomized sampling process, we try to obtain several different submatrices from M by repeating the randomized sampling process several times. Accordingly, we can obtain several initial solutions, one derived from each submatrix. Furthermore, we can refine these initial solutions with all fragments. Given a set of solutions, each of which containing a pair of haplotypes, the goal here is to compute a single pair of haplotypes by adopting a voting procedure.
In the voting procedure, the two haplotypes are computed separately. We next see how to compute one of the two haplotypes. The other case is similar. Let S be the set of solutions used for voting. First, we find a set of haplotypes (denoted by S1), one from each solution in S, such that the haplotypes in S1 all correspond to the same copy of a chromosome. With S1, we can then compute a haplotype by majority rule. Simply speaking, at each SNP site, we count the number of 0s and 1s at the given SNP site over the haplotypes in S1. If we have more 0s, the resulting haplotype takes 0 at the SNP site, otherwise, it takes 1.
How to find S1? First, we need to clarify that the two haplotypes in each solution in S are unordered. That is, given a solution H = (h1, h2), we do not know which chromosome copy h1 (or h2) corresponds to. So, we should first find the correspondence between the haplotypes in different solutions. Let be the set of solutions in S. Without loss of generality, assume that the MEC cost associated with H1 is the smallest among all the y solutions. We use H1 as our reference and try to find the correspondence between haplotypes in H1 and other solutions. For each i (1 < i ≤ y), we first compute two generalized hamming distances and . If , we claim that corresponds to the same chromosome copy as . Otherwise, corresponds to the same chromosome copy as . As a result, the set of haplotypes in S that correspond to the same chromosome copy as is the S1 we want to find.
Assume that at the beginning of this procedure, we obtain x solutions by repeating the randomized sampling process along with the refining procedure x times. It is worth mentioning that in the voting procedure, we only use part of the solutions, say, the first y (y ≤ x) solutions with the highest quality. Given two solutions A and B, we say that A has higher quality than B if the MEC cost associated with A is smaller than that of B. In this case, we assume that A is much closer to the optimal solution and contains less noises than B. To reduce the sideeffect of noises and improve the quality of the solution output by the voting procedure, it is helpful to use only solutions with high quality in the voting procedure.
Summarization of the algorithm
Generally speaking, given an input fragment matrix M, our heuristic algorithm can be summarized as the following four steps.
Step 1: We first perform a preprocessing on M to detect possible errors in it. After removing errors from M, we further convert it into in which each entry is encoded by a character from the alphabet . See Section Preliminaries for more details. is used as the input to the following steps.
Step 2: We compute an initial solution by running the dynamic programming algorithm on a subset of . The submatrix is computed by using the randomized sampling approach.
Step 3: Refine the initial solution with all the fragments in , instead of the submatrix that is used to generate the initial solution in Step 2.
Step 4: To further reduce the effect of randomness caused by the randomized sampling process, we repeat Step 2 and Step 3 several times. Each repeat ends with a solution, from which we then compute a single pair of haplotypes by adopting the voting procedure. The resulting pair of haplotypes is the output of our algorithm.