Single molecule sequencing-guided scaffolding and correction of draft assemblies

Zhu, Shenglong; Chen, Danny Z.; Emrich, Scott J.

doi:10.1186/s12864-017-4271-8

Volume 18 Supplement 10

Selected articles from the 6th IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS): genomics

Research
Open access
Published: 06 December 2017

Single molecule sequencing-guided scaffolding and correction of draft assemblies

Shenglong Zhu¹,
Danny Z. Chen² &
Scott J. Emrich³

BMC Genomics volume 18, Article number: 879 (2017) Cite this article

1812 Accesses
6 Citations
1 Altmetric
Metrics details

Abstract

Background

Although single molecule sequencing is still improving, the lengths of the generated sequences are inevitably an advantage in genome assembly. Prior work that utilizes long reads to conduct genome assembly has mostly focused on correcting sequencing errors and improving contiguity of de novo assemblies.

Results

We propose a disassembling-reassembling approach for both correcting structural errors in the draft assembly and scaffolding a target assembly based on error-corrected single molecule sequences. To achieve this goal, we formulate a maximum alternating path cover problem. We prove that this problem is NP-hard, and solve it by a 2-approximation algorithm.

Conclusions

Our experimental results show that our approach can improve the structural correctness of target assemblies in the cost of some contiguity, even with smaller amounts of long reads. In addition, our reassembling process can also serve as a competitive scaffolder relative to well-established assembly benchmarks.

Background

Paired end sequencing, where shorter, high quality sequences are generated from both ends of a longer DNA fragment, has been used to improve genome assemblies (see [1] for a review). This method was a significant advance since it allowed scaffolding [2] and improved assembly quality assessment [3]. It also helped resolve many repetitive sequences that are smaller than the DNA fragments provided to sequencing [1].

Although the total cost of sequencing has decreased substantially, the DNA fragment sizes suitable for inexpensive sequencing have remained similar. As a result, repetitive sequences larger than a few thousand bases in extant genomes largely remain either incorrect or absent. Prior work has focused on combining different assemblers or picking the best assembly (see [4] for a review), but these approaches cannot fix errors (or omissions) produced by all assemblers [2].

Newer sequencing methods, such as Illumina’s long reads (e.g., [5]) or the more recent 10X Genomics Chromium platform, will indirectly increase fragment sizes through localized assembly. If repeats are spread throughout the genome, instances of repeats should be unique in each subproblem (reconstructed fragment) and therefore can be assembled well. If they are not distributed throughout, as in the case of complex genomes like maize [6], such approaches will suffer from the same pitfalls of traditional assembly.

In this paper, we present an alternative framework that uses single molecule sequencing (SMS) data [7–10]. In contrast to paired end sequencing, SMS platforms like Oxford Nanopore generate low quality sequences that currently can be as large as one hundred thousand nucleotides. Because the sequences are generated from actual DNA molecules, they are ideal to resolve local tandem repeats, transposon structure(s), and small-scale inversions and translocations. Most importantly, this new method can in theory be applied to any extant genome to improve quality. We demonstrate this on both bacterial and yeast genomes using currently available SMS data.

To the best of our knowledge, the framework of BIGMAC [11] is the most similar to ours. BIGMAC, however, is designed to improve genome accuracy for metagenomic assembly by extensively relying on the assembly itself. As a result, it can hardly correct structural errors depicted in Fig. 1, and this was validated in a comparison. Individual components of this method are also related to prior work in assembly improvement. For example, our scaffolding step is similar to the likelihood-based gap closing approach of GMcloser [12], similar to [13], as well as the recently reported ScaffMatch [14]. In contrast, we compute overlaps between SMS reads in order to more accurately weight the connections between genome regions. To further validate and compare with prior methods, we also test our approach on a well-established genome assembly benchmark (GAGE [15]) at varied coverages of SMS scaffolding data.

Results

We evaluate the performance of our program, denoted as SMSC, on both synthetic data and real data as target assemblies. In the case that the experimental results of different program settings are similar, we present only one of them.

Synthetic data

Escherichia coli is a well studied organism with a highly curated genome assembly. For this reason, it is a common benchmark for assembly software. The input to our program is a target assembly with induced structural errors (see below) and Nanopore long reads that have already been corrected by Nanocorr [16].

To randomly mutate the complete genome of E. coli K-12 (accession NC_000913), we wrote a Python script to cut the genome randomly N−1 times (N=50 in Fig. 2 a). Specifically, if the total length of the remaining genome is L, and there are m cuts to go, then we chose a cutting point uniformly randomly between length l and L/(m+1) (l=10000 when N≤400 and l=5000 when N=500 in Table 1. The reason we chose different l is to more accurately represent expected fragmentation post-assembly given the length of the E. coli sequence). After this cut, there are (m−1) cuts to go. After all the cuts, we then shuffle the pieces and randomly reverse complement some to eliminate known order and orientation. Further, we introduced 2% insertion, deletion, and mismatches each into the new target assembly for correction.

Table 1 Correction of mutations of E. coli K-12 with two additional downsamples of the 50X data

Full size table

The coverage of these Nanopore long reads is 50x, the average length is 4070 base pairs (bp), and the upper quartile is 6231 bp. Figure 2 is a before and after example comparison using these data in our method visualized using mummerplot [17]. Bacterial genomes are circular, so the two genome alignment segments actually represent a single genome. This figure shows that our method disassembled the erroneous genome and then correctly reassembled the consistent segments.

To test the performance of our method with higher fragmentation, we gradually increased the number of mutations in the mutated genome (see Table 1). The numbers of relocations and inversions post correction were assessed by dnadiff, which is another tool in the MUMmer toolkit. Based on the definition of comparisons in [17], translocations are not reported given that E. coli K-12 is a single sequence. Because dnadiff does not natively distinguish whether a genome is circular, there will be a relocation if the starting location of our output is different from that of the ground truth. We also noticed that many inferred relocations and inversions are small and localized. These appear to be insertions present in our assembly resulting from either imperfect read alignments or uncorrected errors in the long reads.

To see how the coverage of the long reads affects our method, we downsampled the long reads to 25x and 12.5x, where the 12.5x sampling is a subset of the 25x sample (see Table 1). Because the results of using Nucmer and BLASR as alignment tools are similar for this genome, the results in Table 1 were all produced using Nucmer as the alignment tool.

The results of our experiments show that, as expected, our results depend somewhat on both the coverage of the long read data and the fragmentation/errors present in the target assembly. Higher fragmentation produces shorter fragments, which will lead to shorter validated segments because our conservative method requires 99% of the long reads to align. We expect this could be alleviated somewhat by changing the parameters to be 99% of the smaller of the two sequences or relaxing this threshold.

Real data

S. cerevisiae W303

We downloaded the yeast (S. cerevisiae W303) draft assembly from [18], and error-corrected PacBio long reads (10x, upper quartile length of 932bp, already corrected by the Celera Assembler PBcR pipeline) from [19]. After running our program on this draft assembly with the error-corrected long reads, the number of contigs increases from 30 in the draft assembly to 294 (using BLASR as alignment tool) and 278 (using Nucmer as alignment tool). Given the short and overall low coverage post long read correction, however, this added fragmentation is consistent with our earlier simulations.

To our best knowledge, there are only draft assemblies available for S. cerevisiae W303 [16, 19, 20]. We therefore used its closely related strain S. cerevisiae S288 (available at [16]) as in [16]. We then used Quast [21] as the quality assessment tool to determine the overall quality. As for the required independent Illumina pair-end reads for Quast evaluation, we used the S288 data downloaded from NCBI (SRA accession SRR507778).

In this experiment, we also compared our results with another scaffolder ScaffMatch [14], which requires traditional paired end reads for scaffolding. The pair-end reads were downloaded from [16]. Its coverage is 105.5x. Since ScaffMatch can only do scaffolding, we use our disassembler output, i.e., the validated segments, as the common input for both methods. Because our edge weighting is based on alignments between error-corrected PacBio long reads, it can therefore fill gaps. ScaffMatch, on the other hand, can only estimate the size of the gap and insert an appropriate number of padding characters (usually an ‘N’).

Table 2 shows that even with suboptimal long read data (10X coverage, <1kb length), the post-correction result is better than the original target assembly. When using BLASR as the alignment tool, our method performs the best, followed by ScaffMatch, except that SMSC with BLASR produces an inversion. Note that the higher fragmentation of our method is somewhat expected given that we are using less data (10X vs. >105X).

Table 2 Errors determined by Quast on assemblies of S. Cerevisiae W303

Full size table

S. aureus USA300

Since we did not use the exact strain in the comparison above, we decided to also test a well-characterized GAGE benchmark [15]: S. aureus USA300.

Its complete genome, a draft assembly, and pair-end reads were downloaded from [15]. Although there are several assemblies on GAGE, we present only the results from the Allpath-LG assembly since there were no detected structural errors and our method worked best with this assembler. The accession number for the raw PacBio long reads data is SRR2063240, which is then error corrected. After correction by Canu [22], the coverage reduces to 28.63X (from over 293X originally).

Here, we compare with ScaffMatch as before. Since we used Canu to correct the PacBio long reads, we also produced a de novo assembly for S. aureus USA300 based only on long reads. The experiments on the full dataset (the one with coverage of 293.6x of long reads in Table 3) show that SMSC has better N50 value but more pieces. This is because there are 8 contigs of length ≤300 and 2 of length about 1500.

Table 3 Quality comparisons for S. aureus USA300 assemblies using downsampled PacBio long reads

Full size table

We also downsampled the raw PacBio long reads to observe the effects of coverage for this genome. Considering that we downsampled only the raw PacBio long reads but not the PE reads, the output of ScaffMatch will appear to be very good. We also take as input for ScaffMatch the error-corrected PacBio long reads. Since this is not a fair usage of ScaffMatch, we will denote it by ScaffMatch ^∗. Table 3 includes our experimental results from 6 to 10% of the full coverage, and from this table one can see that our approach and ScaffMatch have higher contiguity and identity than de novo assemblies. ScaffMatch almost always has a constant contiguity because the PE reads were not downsampled. By taking as input the error-corrected downsampled PacBio long reads, ScaffMatch ^∗ outputs improvement of contigs as the downsample-size increases, as anticipated.

Discussion

Our results indicate that SMSC is competitive with Canu in terms of contiguity and identity when using lower coverage SMS data. Interestingly, and maybe as expected, scaffolding using both SMSC and ScaffMatch is worse than high-depth SMS de novo assembly using Canu.

The higher relocations and misassemblies in ScaffMatch seem to be caused by gaps induced by regions missing in mate-pair data. Since ScaffMatch uses the output of the first stage of SMSC as input, and SMSC implements a simpler scaffolding algorithm than ScaffMatch, we expect that our results could improve if we implemented a more complete algorithm, with the structural consistency competitive with Canu. We leave this implementation as future work.

Conclusion

Here we presented a new framework that can correct structural assembly errors using single molecule sequencing (SMS) data. Further, we show that long reads when applied to an extant genome can fill scaffold gaps and produce higher overall structural consistency. We expect this framework to be more useful for researchers who have invested heavily in a draft genome and do not want to completely re-sequence this genome.

Methods

In general, our method consists of disassembling and reassembling genomic regions (Fig. 1). In the first step, we locate structurally consistent segments in an assembly (Fig. 1 a) and extract these regions (Fig. 1 b), thus disassembling the input assembly into smaller, validated portions (or contigs). In the second step, we reassemble the consistent regions (Fig. 1 c) by applying an independently derived path cover model first proposed as a scaffold problem in [14] and fill the gaps. The details of our algorithms are presented below.

Disassembling

Long-read alignment

We first align error-corrected long reads to a target assembly (see Fig. 1 a), using either Nucmer [17] or BLASR [23] (any long-read alignment tool could in theory be used here). The output of Nucmer/BLASR can be viewed as a sequence of aligned blocks (DL _i, DR _i, RL _i, RR _i), such that the nucleotides in the range [DL _i,DR _i] of the target assembly are aligned to the long read in the range [RL _i,RR _i]. If there are overlaps between consecutive aligned blocks, either in the draft sequence or in the long read, we cut off the left ends of the blocks on the right. We then attempt to close any small gaps between these aligned blocks via the Needleman-Wunsch Algorithm, and call the resulting alignment region an alignment segment.

An alignment segment is kept and used if for any sequence in the target assembly and a long read, the following hold: i) For any gap of length g ₁ between two aligned blocks [DL _i,DR _i] and [DL _i+1,DR _i+1] in the draft assembly, and for the corresponding gap between [RL _i,RR _i] and [RL _i+1,RR _i+1] of length g ₂ in the long read, the value |g ₁−g ₂| is within a chosen threshold (30 in our experiments); ii) the sum of all alignment blocks for a long read divided by the length of the long read is larger than a chosen threshold (0.99 in our experiments). After picking out these good alignment segments, we claim that if there are two aligned segments overlapping by at least δ (also a threshold, 50 in our experiments) in the target assembly, then these two aligned segments in the assembly should be merged.

Since we do not have a quantitative measure for block quality, the choices of these thresholds are based on observations derived from custom visualizations of the alignment results. In general, the first threshold makes sure that if the target assembly is of high quality, then there are not too many pieces after disassembly. The second threshold guarantees that the long reads are well-contained in the target assembly. The third threshold eliminates overlaps that are not trustworthy while trying to retain contiguity. Using error-corrected long reads maximizes the number of obtained alignments and is not a requirement for this step.

The time complexity of the alignment step depends on the tool used (e.g., BLASR or Nucmer). Sorting the aligned blocks takes O(a loga) time, where a is the total number of aligned blocks. Cutting off the overlaps takes O(k) time, where k is the number of insertions in all the aligned blocks [DL _i,DR _i,RL _i,RR _i]. Closing the gaps takes O(l ₁ l ₂) time for each gap, where l ₁=DL _i+1−DR _i−1 and l ₂=RL _i+1−RR _i−1. Theoretically, in the worst case, the time bound could be O(MN), where M is the total length of the target assembly and N is the total length of all the long reads; but this is unlikely to happen in practice because the Needleman-Wunsch Algorithm will not be applied if a contig and a long read are not already aligned by BLASR or Nucmer. Another task is to determine whether an alignment segment should be kept, which takes linear time to handle. In general, this alignment step takes O(MN) time in the worst-case, and may run faster in practice.

Segment extraction

Once we obtain validated segments from the target assembly, we extract these segments and discard the remaining untrustworthy regions (see Fig. 1 b). We support two options for extracting structurally validated segments. The first one is simple extraction. This option is preferred if the quality of the target draft assembly is fairly high. The second option is to generate a new consensus from the long read data aligned to each segment. This option is preferred if the draft assembly quality is relatively low (e.g., a low coverage sequencing scheme).

The first option takes O(K) time, where K is the number of validated segments. The second option can take much longer time, because it requires to call BLASR or Nucmer for another sequence alignment. This new alignment is then used for consensus of the validated segments.

Reassembling

In this step, our goal is to reassemble the extracted segments (e.g., Fig. 1 c is a possible reassembly). The available information for closing induced gaps hopefully is within the remaining unaligned long reads. Therefore, we next utilize these remaining long reads using a graph-based theoretical model — which also appeared as the scaffolding problem proposed by Igor et al. [14].

Prior work on this scaffolding problem was to reduce it to the vertex disjoint problem that is NP-hard, but no complexity analysis was given for the scaffolding problem. Here, we show that this scaffolding problem is indeed NP-hard. We also prove that the algorithm presented in ScaffMatch [14] is a 2-approximation of an optimal scaffolding using our independent formulation of this model.

Graph model construction

Our graph model (see Fig. 3 for an example) is derived from the concept of breakpoint graphs [24–26]. In a breakpoint graph, each gene is represented by two vertices, indicating the 5’-end and 3’-end of the gene. If two genes are consecutive in a scaffold, a colored edge will be added to connect the two corresponding vertices. Following the same idea, we view each validated segment as two vertices, and a dashed edge is added between these two vertices. A solid edge can be added between two vertices in the graph if there is a long read bridging them. Similar edges can be merged in this step. If there are still multiple edges between two vertices, we consider using the one with the largest support (the number of long reads) (see Fig. 3 a). This construction of the model has the advantage that in the successive steps of the algorithm, we do not have to consider the directions of the vertices (as shown in Fig. 3 b and c). Prior approaches of MAIA [13] and Medusa [27] both require the directions of vertices and edges to be determined. The directions are usually determined in two separate phases, which may result in accumulative assembly errors, given that both phases are NP-hard problems (and therefore approximated). This 2-phase difficulty was addressed by both our model and the similar graph model in ScaffMatch [14].

NP-hardness of the scaffolding problem

In this graph, an alternating path that starts at a dashed edge, followed by a solid edge, …, and ends at a dashed edge, is a possible scaffold (as shown in Fig. 3 b). Since each validated segment should ideally appear in exactly one scaffold in the final output assembly (i.e., no repeats, see Fig. 3 c), the objective of the problem is to find a set of (vertex-disjoint) alternating paths that covers all the dashed edges exactly once. In addition, we can weight the solid edges between vertices by the number of long reads that support that connection. Thus, the objective is to find a maximum weighted alternating path cover.

Formally, the scaffolding problem is defined as follows.

Definition 1

Given a weighted undirected graph G=(U,U ^′;E), where U={u ₁,u ₂,…,u _n}, U ^′={u1′,u2′,…,un′}, and E is a set of dashed edges and solid edges. The dashed edges are exactly {(u ₁,u1′), (u ₂,u2′),…,(u _n,un′)}. The solid edges may connect any two vertices in the graph. Each solid edge is assigned a positive edge weight, and the edge weight for each dashed edge is 0. An alternating path is defined as a simple path that starts at a dashed edge, followed by a solid edge, then a dashed edge, then a solid edge, …, and ends at a dashed edge. The objective is to find a set of alternating paths such that each vertex and each dashed edge appear in the paths exactly once (i.e., an alternating path cover), and the sum of the edge weights in all paths is maximized.

To prove that this problem is NP-hard, we prove that the decision version of the problem is NP-complete.

Theorem 1

Given a weighted undirected graph G=(U,U ^′;E) defined as above, and a parameter k, the problem of determining whether there exists an alternating path cover whose edge weight sum is at least k is NP-complete.

Proof

Given any set of paths, we can easily verify whether they form an alternating path cover and the weight sum is at least k in polynomial time. Hence, the decision version of the scaffolding problem is in NP.

To prove that the decision version is NP-complete, we reduce the Hamiltonian path problem (the undirected graph version) to this problem. Given an instance I of the Hamiltonian path problem G=(V,E), we make a copy of the vertex set V and E as V ^′ and E ^′. Let E∪E ^′ be the solid edges in the instance I ^′ of the decision version of the scaffolding problem. The dashed edges in I ^′ are constructed by connecting the corresponding vertices in V and V ^′. The edge weights of the solid edges are all 1’s and k=n−1. In this way, we construct an instance of the decision version of the scaffolding problem in polynomial time. Figure 4 shows an example of the construction.

(⇒) If there is a Hamiltonian path in I, we can denote the path as u ₁→u ₂→⋯→u _n. Then we can construct a single alternating path u ₁→u1′→u2′→u ₂→u ₃→u3′→⋯. The path ends at u _n if n is even, and at un′ if n is odd. The edge weight sum of the alternating path is (n−1). Each dashed edge and each vertex in I ^′ are covered exactly once, and edges (u1′,u2′), (u ₂,u ₃),… must be in the graph I ^′ because the corresponding edges are in the Hamiltonian path, and thus in the original graph I.

(⇐) If there is an alternating path cover of edge weight sum at least (n−1) in the graph I ^′, then the path cover must be a single alternating path. W.L.O.G, we can assume that the path starts at a vertex in U. Then the alternating path would be \(u_{i_{1}} \rightarrow u_{i_{1}}' \rightarrow u_{i_{2}}' \rightarrow u_{i_{2}} \rightarrow u_{i_{3}} \rightarrow u_{i_{3}}' \cdots \). We can construct a Hamiltonian path \(u_{i_{1}} \rightarrow u_{i_{2}} \rightarrow \cdots \rightarrow u_{i_{n}}\) since these edges must exist in the graph I.

Hence, the decision version of the scaffolding problem is NP-complete. □

The 2-approximation algorithm

For completeness, we also describe the algorithm presented in ScaffMatch [14]:

1.
Find a maximum weighted matching by considering only the solid edges.
2.
Add the dashed edges to the matching.
3.
For each alternating cycle, change the smallest weight of its solid edges to -1, and run steps (1) and (2) until there is no cycle.

To prove that this is a 2-approximation algorithm, for simplicity, we relax step (3) such that we remove the smallest weight solid edge from each cycle. This relaxation is worse than the iterative algorithm above, but the worst case is equal. In fact, the current implementation of our algorithm follows this relaxed version. Our proof is similar to that for the ordinary path cover problem in [28].

Theorem 2

There is a 2-approximation algorithm for the scaffolding problem.

Proof

Let M ^∗ be the weight sum of the maximum matching from step (1), ALG be the output value of the algorithm presented above, and OPT be the optimal value of the scaffolding problem. For each feasible solution, after removing the dashed edges, the solution must be a matching. Hence, we have OPT≤M ^∗. On the other hand, for a cycle c _i, let k _i be the number of solid edges in c _i. And let k _min= min{k _i}. Since we remove the smallest weighted edge from each cycle. It follows that ALG≥M ^∗·(k _min−1)/k _min. Thus OPT/ALG≤k _min/(k _min−1). In the worst case, k _min=2. Hence, OPT/ALG≤2. □

Remark 1

If the output of the maximum weighted matching is improved (i.e., k _min is larger), then the approximation ratio of the algorithm will be better.

The time complexity of our 2-approximation algorithm is dominated by the maximum weighted matching. We applied the implementation version of the maximum weighted matching in the library LEMON [29] whose time complexity is O(mn logn), where n is the number of vertices and m is the number of edges in the graph for matching. Hence, the total time complexity for the 2-approximation algorithm is O(Km logK), where m is the number of edges, which in the worst case is the number of long reads, and K is the number of validated segments, which is proportional to the number of vertices in the graph.

Abbreviations

SMS:: Single molecule sequencing

References

Nagarajan N, Pop M. Sequence assembly demystified. Nat Rev Genet. 2013; 14(3):157–167.
Article CAS PubMed Google Scholar
Bradnam K, Fass J, Alexandrov A, Baranay P, Bechner M, Birol I, et al.Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience. 2013; 2(1):10. Available from: http://dx.doi.org/10.1186/2047-217X-2-10. Accessed 8 Nov 2017.
Article PubMed PubMed Central Google Scholar
Phillippy AM, Schatz MC, Pop M. Genome assembly forensics: finding the elusive mis-assembly. Genome Biol. 2008; 9(3):R55–R55.
Article PubMed PubMed Central Google Scholar
Koren S, Treangen T, Hill C, Pop M, Phillippy A. Automated ensemble assembly and validation of microbial genomes. BMC Bioinformatics. 2014; 15(1):126. Available from: http://www.biomedcentral.com/1471-2105/15/126. Accessed 8 Nov 2017.
Article PubMed PubMed Central Google Scholar
McCoy RC, Taylor RW, Blauwkamp TA, Kelley JL, Kertesz M, Pushkarev D, et al.Illumina TruSeq Synthetic Long-Reads Empower de novo Assembly and Resolve Complex, Highly-Repetitive Transposable Elements. PLoS ONE. 2014; 09;9(9):e106689.
Article Google Scholar
Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, et al.The B73 maize genome: complexity, diversity, and dynamics. Science. 2009; 326(5956):1112–1115.
Article CAS PubMed Google Scholar
Koren S, Schatz MC, Walenz BP, Martin J, Howard JT, Ganapathy G, et al.Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol. 2012; 30:693–700.
Article CAS PubMed PubMed Central Google Scholar
English AC, Richards S, Han Y, Wang M, Vee V, Qu J, et al.Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS One. 2012; 7(11):e47768.
Article CAS PubMed PubMed Central Google Scholar
Chin CS, Alexander DH, Marks P, Klammer AA, Drake J, Heiner C, et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat Methods. 2013; 10:563–569.
Article CAS PubMed Google Scholar
Boetzer M, Pirovano W. SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information. BMC Bioinformatics. 2014; 15(1):211. Available from: http://www.biomedcentral.com/1471-2105/15/211. Accessed 8 Nov 2017.
Article PubMed PubMed Central Google Scholar
Lam KK, Hall R, Clum A, Rao S. BIGMAC: Breaking Inaccurate Genomes and Merging Assembled Contigs for long read metagenomic assembly. BMC Bioinformatics. 2016; 17(1):435.
Article PubMed PubMed Central Google Scholar
Kosugi S, Hirakawa H, Tabata S. GMcloser: closing gaps in assemblies accurately with a likelihood-based selection of contig or long-read alignments. Bioinformatics. 2015; 31(23):3733–3741.
CAS PubMed Google Scholar
Nijkamp J, Winterbach W, Van den Broek M, Daran JM, Reinders M, De Ridder D. Integrating genome assemblies with MAIA. Bioinformatics. 2010; 26(18):i433–i439.
Article CAS PubMed PubMed Central Google Scholar
Mandric I, Zelikovsky A. ScaffMatch: scaffolding algorithm based on maximum weight matching. Bioinformatics. 2015; 31(16):2632–2638.
Article CAS PubMed Google Scholar
Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, et al.GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 2012; 22(3):557–567. Available from: http://gage.cbcb.umd.edu/data/index.html. Accessed 8 Nov 2017.
Article CAS PubMed PubMed Central Google Scholar
Goodwin S, Gurtowski J, Ethe-Sayers S, Deshpande P, Schatz MC, McCombie WR. Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Res. 2015; 25(11):1750–1756. Available from: http://schatzlab.cshl.edu/data/nanocorr/. Accessed 8 Nov 2017.
Article CAS PubMed PubMed Central Google Scholar
Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu, C, et al. Versatile and open software for comparing large genomes. Genome Biol. 2004; 5(2):R12.
Article PubMed PubMed Central Google Scholar
DevNet, PacificBiosciences, (eds).PacificBiosciences/DevNet. Pacific Biosciences of California, Inc.; 2013. Available from: http://datasets.pacb.com.s3.amazonaws.com/2013/Yeast/HGAP_Assembly/polished_assembly.fasta. Accessed 8 Nov 2017.
Koren S, Schatz MC, Walenz BP, Martin J, Howard JT, Ganapathy G, et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol. 2012; 30(7):693–700. Available from: ftp://ftp.cbcb.umd.edu/pub/data/PBcR//corrected/yeast.corrected.fasta.bz2. Accessed 8 Nov 2017.
Article CAS PubMed PubMed Central Google Scholar
Ralser M, Kuhl H, Ralser M, Werber M, Lehrach H, Breitenbach M, et al. The Saccharomyces cerevisiae W303-K6001 cross-platform genome sequence: insights into ancestry and physiology of a laboratory mutt. Open Biol. 2012; 2(8):120093.
Article PubMed PubMed Central Google Scholar
Hunt M, Kikuchi T, Sanders M, Newbold C, Berriman M, Otto TD. REAPR: a universal tool for genome assembly evaluation. Genome Biol. 2013; 14(5):R47.
Article PubMed PubMed Central Google Scholar
Berlin K, Koren S, Chin CS, Drake JP, Landolin JM, Phillippy AM. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol. 2015; 33(6):623–630.
Article CAS PubMed Google Scholar
Chaisson MJ, Tesler G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics. 2012; 13(1):238.
Article CAS PubMed PubMed Central Google Scholar
Sankoff D, Blanchette M. Multiple genome rearrangement and breakpoint phylogeny. J Comput Biol. 1998; 5(3):555–570.
Article CAS PubMed Google Scholar
Aganezov S, Alekseyev MA. Multi-genome Scaffold Co-assembly Based on the Analysis of Gene Orders and Genomic Repeats In: Bourgeois A, Skums P, Wan X, Zelikovsky A, editors. Bioinformatics Research and Applications: 12th International Symposium, ISBRA 2016, Minsk, Belarus, June 5-8, 2016, Proceedings. Cham: Springer International Publishing: 2016. p. 237–249. doi:10.1007/978-3-319-38782-6_20. https://doi.org/10.1007/978-3-319-38782-6_20. Accessed 8 Nov 2017.
Google Scholar
Alekseyev MA, Pevzner PA. Breakpoint graphs and ancestral genome reconstructions. Genome Res. 2009; 19(5):943–957.
Article CAS PubMed PubMed Central Google Scholar
Bosi E, Donati B, Galardini M, Brunetti S, Sagot MF, Lió P, Crescenzi P, Fani R, Fondi M. MeDuSa: a multi-draft based scaffolder. Bioinformatics. 2015; 31(15):2443. doi:10.1093/bioinformatics/btv171. http://dx.doi.org/10.1093/bioinformatics/btv171
Article CAS PubMed Google Scholar
Moran S, Wolfstahl V. Approximation Algorithms for Covering a Graph by Vertex-Disjoint Paths of Maximum Total Weight. NETWORKS. 1990; 20(5):4.
Google Scholar
Dezs B, Jüttner A, Kovács P. LEMON - an Open Source C++ Graph Template Library. Electron Notes Theor Comput Sci. 2011; 264(5):23–45. Available from: http://dx.doi.org/10.1016/j.entcs.2011.06.003. Accessed 8 Nov 2017.
Article Google Scholar

Download references

Funding

This work was supported in part by NIH grant R21AI112734 and NIH contract HHSN272200900039C (SJE), NIH grant R21AI123967 (SJE and DZC), and NSF grants CCF-1217906 and CCF-1617735 (DZC). The publication cost will come from NIH grant R21AI123967.

Availability of data and materials

The software is available at https://bitbucket.org/NDBL/smsc.

About this supplement

This article has been published as part of BMC Genomics Volume 18 Supplement 10, 2017: Selected articles from the 6th IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS): genomics. The full contents of the supplement are available online at https://bmcgenomics.biomedcentral.com/articles/supplements/volume-18-supplement-10.

Authors’ contributions

SZ implemented the program, designed and performed experiments, analyzed the data, and prepared the manuscript. SJE provided the experimental data. DC and SJE supervised this project and edited the manuscript. All authors have read and approved the final manuscript.

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and Affiliations

Department of Computer Science & Engineering, University of Notre Dame, 244 Fitzpatrick Hall, Notre Dame, 46556, IN, USA
Shenglong Zhu
Department of Computer Science & Engineering, University of Notre Dame, 326E Cushing Hall, Notre Dame, 46556, IN, USA
Danny Z. Chen
Department of Computer Science & Engineering, University of Notre Dame, 211B Cushing Hall, Notre Dame, 46556, IN, USA
Scott J. Emrich

Authors

Shenglong Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Danny Z. Chen
View author publications
You can also search for this author in PubMed Google Scholar
Scott J. Emrich
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shenglong Zhu.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Zhu, S., Chen, D. & Emrich, S. Single molecule sequencing-guided scaffolding and correction of draft assemblies. BMC Genomics 18 (Suppl 10), 879 (2017). https://doi.org/10.1186/s12864-017-4271-8

Download citation

Published: 06 December 2017
DOI: https://doi.org/10.1186/s12864-017-4271-8

Selected articles from the 6th IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS): genomics

Single molecule sequencing-guided scaffolding and correction of draft assemblies

Abstract

Background

Results

Conclusions

Background

Results

Synthetic data

Real data

S. cerevisiae W303

S. aureus USA300

Discussion

Conclusion

Methods

Disassembling

Long-read alignment

Segment extraction

Reassembling

Graph model construction

NP-hardness of the scaffolding problem

Definition 1

Theorem 1

Proof

The 2-approximation algorithm

Theorem 2

Proof

Remark 1

Abbreviations

References

Funding

Availability of data and materials

About this supplement

Authors’ contributions

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us