In general, our method consists of disassembling and reassembling genomic regions (Fig. 1). In the first step, we locate structurally consistent segments in an assembly (Fig. 1
a) and extract these regions (Fig. 1
b), thus disassembling the input assembly into smaller, validated portions (or contigs). In the second step, we reassemble the consistent regions (Fig. 1
c) by applying an independently derived path cover model first proposed as a scaffold problem in [14] and fill the gaps. The details of our algorithms are presented below.
Disassembling
Long-read alignment
We first align error-corrected long reads to a target assembly (see Fig. 1
a), using either Nucmer [17] or BLASR [23] (any long-read alignment tool could in theory be used here). The output of Nucmer/BLASR can be viewed as a sequence of aligned blocks (DL
i
, DR
i
, RL
i
, RR
i
), such that the nucleotides in the range [DL
i
,DR
i
] of the target assembly are aligned to the long read in the range [RL
i
,RR
i
]. If there are overlaps between consecutive aligned blocks, either in the draft sequence or in the long read, we cut off the left ends of the blocks on the right. We then attempt to close any small gaps between these aligned blocks via the Needleman-Wunsch Algorithm, and call the resulting alignment region an alignment segment.
An alignment segment is kept and used if for any sequence in the target assembly and a long read, the following hold: i) For any gap of length g
1 between two aligned blocks [DL
i
,DR
i
] and [DL
i+1,DR
i+1] in the draft assembly, and for the corresponding gap between [RL
i
,RR
i
] and [RL
i+1,RR
i+1] of length g
2 in the long read, the value |g
1−g
2| is within a chosen threshold (30 in our experiments); ii) the sum of all alignment blocks for a long read divided by the length of the long read is larger than a chosen threshold (0.99 in our experiments). After picking out these good alignment segments, we claim that if there are two aligned segments overlapping by at least δ (also a threshold, 50 in our experiments) in the target assembly, then these two aligned segments in the assembly should be merged.
Since we do not have a quantitative measure for block quality, the choices of these thresholds are based on observations derived from custom visualizations of the alignment results. In general, the first threshold makes sure that if the target assembly is of high quality, then there are not too many pieces after disassembly. The second threshold guarantees that the long reads are well-contained in the target assembly. The third threshold eliminates overlaps that are not trustworthy while trying to retain contiguity. Using error-corrected long reads maximizes the number of obtained alignments and is not a requirement for this step.
The time complexity of the alignment step depends on the tool used (e.g., BLASR or Nucmer). Sorting the aligned blocks takes O(a loga) time, where a is the total number of aligned blocks. Cutting off the overlaps takes O(k) time, where k is the number of insertions in all the aligned blocks [DL
i
,DR
i
,RL
i
,RR
i
]. Closing the gaps takes O(l
1
l
2) time for each gap, where l
1=DL
i+1−DR
i
−1 and l
2=RL
i+1−RR
i
−1. Theoretically, in the worst case, the time bound could be O(MN), where M is the total length of the target assembly and N is the total length of all the long reads; but this is unlikely to happen in practice because the Needleman-Wunsch Algorithm will not be applied if a contig and a long read are not already aligned by BLASR or Nucmer. Another task is to determine whether an alignment segment should be kept, which takes linear time to handle. In general, this alignment step takes O(MN) time in the worst-case, and may run faster in practice.
Segment extraction
Once we obtain validated segments from the target assembly, we extract these segments and discard the remaining untrustworthy regions (see Fig. 1
b). We support two options for extracting structurally validated segments. The first one is simple extraction. This option is preferred if the quality of the target draft assembly is fairly high. The second option is to generate a new consensus from the long read data aligned to each segment. This option is preferred if the draft assembly quality is relatively low (e.g., a low coverage sequencing scheme).
The first option takes O(K) time, where K is the number of validated segments. The second option can take much longer time, because it requires to call BLASR or Nucmer for another sequence alignment. This new alignment is then used for consensus of the validated segments.
Reassembling
In this step, our goal is to reassemble the extracted segments (e.g., Fig. 1
c is a possible reassembly). The available information for closing induced gaps hopefully is within the remaining unaligned long reads. Therefore, we next utilize these remaining long reads using a graph-based theoretical model — which also appeared as the scaffolding problem proposed by Igor et al. [14].
Prior work on this scaffolding problem was to reduce it to the vertex disjoint problem that is NP-hard, but no complexity analysis was given for the scaffolding problem. Here, we show that this scaffolding problem is indeed NP-hard. We also prove that the algorithm presented in ScaffMatch [14] is a 2-approximation of an optimal scaffolding using our independent formulation of this model.
Graph model construction
Our graph model (see Fig. 3 for an example) is derived from the concept of breakpoint graphs [24–26]. In a breakpoint graph, each gene is represented by two vertices, indicating the 5’-end and 3’-end of the gene. If two genes are consecutive in a scaffold, a colored edge will be added to connect the two corresponding vertices. Following the same idea, we view each validated segment as two vertices, and a dashed edge is added between these two vertices. A solid edge can be added between two vertices in the graph if there is a long read bridging them. Similar edges can be merged in this step. If there are still multiple edges between two vertices, we consider using the one with the largest support (the number of long reads) (see Fig. 3
a). This construction of the model has the advantage that in the successive steps of the algorithm, we do not have to consider the directions of the vertices (as shown in Fig. 3
b and c). Prior approaches of MAIA [13] and Medusa [27] both require the directions of vertices and edges to be determined. The directions are usually determined in two separate phases, which may result in accumulative assembly errors, given that both phases are NP-hard problems (and therefore approximated). This 2-phase difficulty was addressed by both our model and the similar graph model in ScaffMatch [14].
NP-hardness of the scaffolding problem
In this graph, an alternating path that starts at a dashed edge, followed by a solid edge, …, and ends at a dashed edge, is a possible scaffold (as shown in Fig. 3
b). Since each validated segment should ideally appear in exactly one scaffold in the final output assembly (i.e., no repeats, see Fig. 3
c), the objective of the problem is to find a set of (vertex-disjoint) alternating paths that covers all the dashed edges exactly once. In addition, we can weight the solid edges between vertices by the number of long reads that support that connection. Thus, the objective is to find a maximum weighted alternating path cover.
Formally, the scaffolding problem is defined as follows.
Definition 1
Given a weighted undirected graph G=(U,U
′;E), where U={u
1,u
2,…,u
n
}, U
′={u1′,u2′,…,un′}, and E is a set of dashed edges and solid edges. The dashed edges are exactly {(u
1,u1′), (u
2,u2′),…,(u
n
,un′)}. The solid edges may connect any two vertices in the graph. Each solid edge is assigned a positive edge weight, and the edge weight for each dashed edge is 0. An alternating path is defined as a simple path that starts at a dashed edge, followed by a solid edge, then a dashed edge, then a solid edge, …, and ends at a dashed edge. The objective is to find a set of alternating paths such that each vertex and each dashed edge appear in the paths exactly once (i.e., an alternating path cover), and the sum of the edge weights in all paths is maximized.
To prove that this problem is NP-hard, we prove that the decision version of the problem is NP-complete.
Theorem 1
Given a weighted undirected graph G=(U,U
′;E) defined as above, and a parameter k, the problem of determining whether there exists an alternating path cover whose edge weight sum is at least k is NP-complete.
Proof
Given any set of paths, we can easily verify whether they form an alternating path cover and the weight sum is at least k in polynomial time. Hence, the decision version of the scaffolding problem is in NP.
To prove that the decision version is NP-complete, we reduce the Hamiltonian path problem (the undirected graph version) to this problem. Given an instance I of the Hamiltonian path problem G=(V,E), we make a copy of the vertex set V and E as V
′ and E
′. Let E∪E
′ be the solid edges in the instance I
′ of the decision version of the scaffolding problem. The dashed edges in I
′ are constructed by connecting the corresponding vertices in V and V
′. The edge weights of the solid edges are all 1’s and k=n−1. In this way, we construct an instance of the decision version of the scaffolding problem in polynomial time. Figure 4 shows an example of the construction.
(⇒) If there is a Hamiltonian path in I, we can denote the path as u
1→u
2→⋯→u
n
. Then we can construct a single alternating path u
1→u1′→u2′→u
2→u
3→u3′→⋯. The path ends at u
n
if n is even, and at un′ if n is odd. The edge weight sum of the alternating path is (n−1). Each dashed edge and each vertex in I
′ are covered exactly once, and edges (u1′,u2′), (u
2,u
3),… must be in the graph I
′ because the corresponding edges are in the Hamiltonian path, and thus in the original graph I.
(⇐) If there is an alternating path cover of edge weight sum at least (n−1) in the graph I
′, then the path cover must be a single alternating path. W.L.O.G, we can assume that the path starts at a vertex in U. Then the alternating path would be \(u_{i_{1}} \rightarrow u_{i_{1}}' \rightarrow u_{i_{2}}' \rightarrow u_{i_{2}} \rightarrow u_{i_{3}} \rightarrow u_{i_{3}}' \cdots \). We can construct a Hamiltonian path \(u_{i_{1}} \rightarrow u_{i_{2}} \rightarrow \cdots \rightarrow u_{i_{n}}\) since these edges must exist in the graph I.
Hence, the decision version of the scaffolding problem is NP-complete. □
The 2-approximation algorithm
For completeness, we also describe the algorithm presented in ScaffMatch [14]:
-
1.
Find a maximum weighted matching by considering only the solid edges.
-
2.
Add the dashed edges to the matching.
-
3.
For each alternating cycle, change the smallest weight of its solid edges to -1, and run steps (1) and (2) until there is no cycle.
To prove that this is a 2-approximation algorithm, for simplicity, we relax step (3) such that we remove the smallest weight solid edge from each cycle. This relaxation is worse than the iterative algorithm above, but the worst case is equal. In fact, the current implementation of our algorithm follows this relaxed version. Our proof is similar to that for the ordinary path cover problem in [28].
Theorem 2
There is a 2-approximation algorithm for the scaffolding problem.
Proof
Let M
∗ be the weight sum of the maximum matching from step (1), ALG be the output value of the algorithm presented above, and OPT be the optimal value of the scaffolding problem. For each feasible solution, after removing the dashed edges, the solution must be a matching. Hence, we have OPT≤M
∗. On the other hand, for a cycle c
i
, let k
i
be the number of solid edges in c
i
. And let k
min
= min{k
i
}. Since we remove the smallest weighted edge from each cycle. It follows that ALG≥M
∗·(k
min
−1)/k
min
. Thus OPT/ALG≤k
min
/(k
min
−1). In the worst case, k
min
=2. Hence, OPT/ALG≤2. □
Remark 1
If the output of the maximum weighted matching is improved (i.e., k
min
is larger), then the approximation ratio of the algorithm will be better.
The time complexity of our 2-approximation algorithm is dominated by the maximum weighted matching. We applied the implementation version of the maximum weighted matching in the library LEMON [29] whose time complexity is O(mn logn), where n is the number of vertices and m is the number of edges in the graph for matching. Hence, the total time complexity for the 2-approximation algorithm is O(Km logK), where m is the number of edges, which in the worst case is the number of long reads, and K is the number of validated segments, which is proportional to the number of vertices in the graph.