LRScaf: improving draft genomes using long noisy reads

Background The advent of third-generation sequencing (TGS) technologies opens the door to improve genome assembly. Long reads are promising for enhancing the quality of fragmented draft assemblies constructed from next-generation sequencing (NGS) technologies. To date, a few algorithms that are capable of improving draft assemblies have released. There are SSPACE-LongRead, OPERA-LG, SMIS, npScarf, DBG2OLC, Unicycler, and LINKS. Hybrid assembly on large genomes remains challenging, however. Results We develop a scalable and computationally efficient scaffolder, Long Reads Scaffolder (LRScaf, https://github.com/shingocat/lrscaf), that is capable of significantly boosting assembly contiguity using long reads. In this study, we summarise a comprehensive performance assessment for state-of-the-art scaffolders and LRScaf on seven organisms, i.e., E. coli, S. cerevisiae, A. thaliana, O. sativa, S. pennellii, Z. mays, and H. sapiens. LRScaf significantly improves the contiguity of draft assemblies, e.g., increasing the NGA50 value of CHM1 from 127.1 kbp to 9.4 Mbp using 20-fold coverage PacBio dataset and the NGA50 value of NA12878 from 115.3 kbp to 12.9 Mbp using 35-fold coverage Nanopore dataset. Besides, LRScaf generates the best contiguous NGA50 on A. thaliana, S. pennellii, Z. mays, and H. sapiens. Moreover, LRScaf has the shortest run time compared with other scaffolders, and the peak RAM of LRScaf remains practical for large genomes (e.g., 20.3 and 62.6 GB on CHM1 and NA12878, respectively). Conclusions The new algorithm, LRScaf, yields the best or, at least, moderate scaffold contiguity and accuracy in the shortest run time compared with other scaffolding algorithms. Furthermore, LRScaf provides a cost-effective way to improve contiguity of draft assemblies on large genomes.


Background
With the advent of next-generation sequencing (NGS) technologies, the genomics community has made significant contributions to de novo genome assembly. Despite that many studies and tools are aimed at reconstructing NGS data into complete de novo genomes, this goal is challenging to achieve because of an intrinsic limitation of NGS data, i.e., read lengths are shorter than most of the repetitive sequences [1]. The existence of repeats makes it challenging to reconstruct a complete genome instead of generating lots of contiguous sequences (contigs) even when the sequencing coverage is high [2]. Thus, attention has focused on the so-called genomic scaffolding procedure, which aims at reducing the number of contigs by using fragments of moderate lengths whose ends are sequenced (double-barreled data) [3,4]. Nevertheless, long repetitive sequences still limit genomic assembly.
As the development of third-generation sequencing (TGS) technologies, it sheds light on different alternatives to solve genome assembly problems by offering very long reads. For example, the single-molecule, realtime (SMRT) sequencing technology of Pacific Biosci-ences® (PacBio) delivers read lengths of up to 50 kbp [5], and the nanopore sequencing technology of Oxford Nanopore Technologies® (Nanopore) yields read lengths that are greater than 800 kbp [6]. Also, the Hi-C data provides much longer-range linking information than other technologies (such as mate pairs, optical maps, linked reads) [7]. With the TGS long reads and the Hi-C data on de novo assembly, it is possible to reconstruct genome into complete chromosome arms [8,9]. However, these long reads suffer from high sequencing error rates, which necessitate high coverage during the de novo assembly [10]. Also, TGS technologies have a higher cost per base than NGS methods, and the Hi-C data would produce small inversions within the scaffolds when the draft assemblies are not with good quality and contiguity [11]. On a large-scale assembly project (such as the ruminant project [12]), a more reasonable and cost-effective way is to improve the contiguity of draft assemblies constructed by NGS data with low coverage long reads [7,13].
The process of genome assembly typically divides into two major steps. The first step is to piece-by-piece overlap reads into contigs. This step commonly performs using de Bruijn or overlap graphs [1]. The second step is to assemble scaffolds that consist of ordered the oriented contigs with estimated distances between them. Scaffolding, which was first introduced by Huson [3], is a critical part of the genome assembly process, especially for NGS data. Scaffolding is an active research area because of NP-hard complexity [14]. By using paired-end and/or mate-pair reads, a number of standalone scaffolders, e.g., Bambus [4], MIP [15], Opera [16], SCARPA [17], SOPRA [18], SSPACE [19], BESST [20], and BOSS [21] have been developed. A recent comprehensive evaluation showed that scaffolding remains computationally intractable and requires larger insert-size and higher quality pair read libraries than what is presently available [22]. As TGS technologies are likely to offer longer reads than the lengths of the most common repeats, these technologies are capable of drastically reducing and overcoming the complexity caused by repeats. Considering the pros and cons of NGS and TGS data, a hybrid assembly approach that assembles draft genomes using TGS data was proposed [23]. The core strategy of this approach is: 1) long reads are mapped onto the contigs using a longread mapper (e.g., BLASR [24] or minimap [25]); 2) alignment information is examined, and long reads that span more than one contig are identified and their linking relationship is stored in a data structure; 3) the last step is to clean up the structure by removing redundant and error-prone links, calculate distances between contigs, and build scaffolds using linking information.
AHA was the first standalone scaffolder basing on the hybrid-assembly strategy, and this algorithm was part of the SMRT analysis software suite [23]. As AHA is designed for small genomes and has limitations on the input data, it is not suitable for large genomes. To ensure that scaffolds are as contiguous as possible, AHA performs six iterations by default, thus increasing the run time. For comparison, SSPACE-LongRead [26] produces final scaffolds in a single iteration and, therefore, has a significantly shorter run time than AHA. Nevertheless, SSPACE-LongRead has somewhat lower assembly accuracy than AHA. Despite being designed for eukaryotic genomes, the run time of SSPACE-LongRead is unpractical on large genomes. LINKS [27] opens a new door to building linking information between contigs. The algorithm uses the long interval nucleotide K-mer without computational alignment and a readscorrection step. The memory usage of LINKS is noteworthy. OPERA-LG [28] provides an exact algorithm for large and repeat-rich genomes. It requires significant mate-pair information to constrain the scaffold graph and yield an optimised result. OPERA-LG is not directly designed for TGS data. To construct scaffold edges and link contigs into scaffolds, OPERA-LG needs to simulate and group mate-pair relationship information from long reads. Recently, scaffolding algorithms, such as SMIS (http://www. sanger.ac.uk/science/tools/smis), npScarf [29], DBG2OLC [30], and Unicycler [31], incorporate the hybrid-assembly strategy. However, these tools have not been thoroughly assessed for different genome sizes, especially large genomes.
Here we present a Long Reads Scaffolder (LRScaf) that improves draft genomes using TGS data. The input to LRScaf is given by a set of contigs and their alignments over PacBio or Nanopore long reads. We compare our algorithm with state-of-the-art tools on datasets for seven species (i.e., Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Oryza sativa, Solanum pennellii, Zea mays, and Homo sapiens). All the tested methods improve the contiguity of pre-assembled genomes. LRScaf yields the best assembly metrics and contiguity for the pre-assembled genomes on E. coli, S. cerevisiae, A. thaliana, S. pennellii, Z. mays, and H. sapiens. More importantly, our method consistently returns high-quality scaffolds and has the shortest run time. LRScaf significantly improves the contiguity of human draft assemblies, increasing the NGA50 value of CHM1 from 127.1 kbp to 9.4 Mbp with 20-fold coverage PacBio dataset and the NGA50 value of NA12878 from 115.3 kbp to 12.9 Mbp with 35-fold coverage Nanopore dataset. Thus, we show that LRScaf is a promising tool for improving draft assemblies in a computationally costeffective way.

Experimental data
The present study performs on two small genomes (E. coli and S. cerevisiae), three medium genomes (A. thaliana, O. sativa, and S. pennellii), and two large genomes (Z. mays and H. sapiens). All the tested data of the seven species are collected from published datasets except the Illumina dataset of O. sativa, which is sequenced in this study (Table 1). For the PacBio long reads datasets, we select the first 20-fold coverage of each PacBio dataset to assess all the scaffolders comprehensively. To test all the scaffolders' performances on the lower depths, we use three different coverages, i.e., 1, 5, and 10 -fold for the two small genomes and 1, 5, and 15 -fold for H. sapiens (NA12878). To assess scaffolder performances for different median read lengths, we use three different medians (8,18, and 26 kbp) of 10-fold coverage on the E. coli PacBio datasets. The coverage of the Nanopore long reads datasets is not too high, and, therefore, we use all of the long reads data from these datasets to assess scaffolder performances.

Producing draft assemblies
For the two small genomes, the draft assemblies are constructed by SOAPdenovo2 [32] and SPAdes [33]. To assess the performances between LINKS and the other scaffolders on the Nanopore datasets, the draft assemblies which are published on LINKS are also included ( Table 2). The NGS assemblers, i.e., DISCOVAR [34], MaSuRCA [35], Platanus [36], SOAPdenovo2, and Spar-seAssembler [37] are used to generate the draft assemblies for A. thaliana (KBS-Mac-74), O. sativa, and S. pennellii. To compare with DBG2OLC, we generate the draft assemblies for E. coli and A. thaliana (ler-0) using SparseAssembler. The best parameters for these NGS assemblers are determined by taking assembly size and contiguity into account. For the Z. mays, H. sapiens (CHM1), and H. sapiens (NA12878), the released assemblies are used because the computational resources required to determine optimised assembly parameters for these species exceed our platform capacity.

Alignment validation and repeat identification
LRScaf is designed to separate the mapping and scaffolding procedures and supports BLASR and minimap (Version 1 and 2). The high error rate is a severe disadvantage of TGS long reads. Thus, a significant fraction of the alignments is incorrect and needs to be filtered out. We develop a validation model to validate each alignment (Fig. 1). The model partitions each long read into three regions (R1, R2, and R3) that are separated by two points (P1 and P2). There are six different combination sets in R if the alignment start (S) and end (E) loci of the contig are considered, i.e., R ∈ {(S in R1, E in R1), We also define the distal length of a contig to the start or end alignment Note: a refers to DBG2OLC dataset; b refers to LINKS dataset; c the dataset was sequenced in this study loci as the over-hang length of the contig. Taking both the alignment region and the over-hang length into account, the valid alignment satisfies: 1) (S in R1, E in R1) with the right over-hang length not exceeding the constraints; 2) (S in R1, E in R2) with the right over-hang length not exceeding the constraints; 3) (S in R2, E in R2) with the over-hang length of two ends not exceeding the constraints; 4) (S in R2, E in R3) with the left overhang length not exceeding the constraints; 5) (S in R3, E in R3) with the left over-hang length not exceeding the constraints. An alignment is filtered out if a long read is entirely covered by a contig (S in R1, E in R3), i.e., the contig contains the long read. After this procedure, the remaining alignments are considered to be valid for the scaffolding procedure. Repetitive sequences complicate genome assembly. Thus, such sequences are masked in our approach. First, we identify and remove repeats by the coverage of reads based on the uniform coverage of TGS data. In the calculation of reads coverage, long reads that covered the entire contig are counted. Then we compute the mean coverage and the standard deviation among the set of contigs. Any contig coverage is higher than the threshold coverage which is μ cov + 3 × s. d. cov by default. It is considered to be a repeat, and the corresponding contig is removed from the next step of the analysis.

Construction of links and edges
A long read may have multiple mappings because of repeats and high sequencing error rates. Figure 2 shows how links are built between contigs from the validated alignments. This process has constraints on orientation and distance. Four strand combination sets, S, are used between contigs to constrain orientation, i.e., S ∈ {s 1 : (+, +), s 2 : (+, −), s 3 : (−, +), s 4 : (−, −)}. We define the orientation between contigs as O(c i , c j ) = max (s). The probability that the internal distance, e, between two contigs lies outside the range [μ is − 3 × σ is , μ is + 3 × σ is ] is less than 5% because e approximately follows a normal distribution N(μ is , σ is ). When e is out of the range [μ is − , it is considered to be abnormal, and the linking information is removed. Any long reads linking a contig to itself at different loci are also removed. After validating that the two constraints on links between contigs are fulfilled, we introduce an edge to represent a bundle of links that join two contigs using quadruple parameters Eðc i ; c j Þ ¼ ðn; μ is ; σ is ; oÞ . Here, n is the number of remaining links considering as the weight of the edge, μ is is the mean internal distance for the remaining links, σ is is the standard deviation of the internal distances for the remaining links, and o is the orientation strand between contigs.

Graph generation and simplification
In After the edges-construction step, we account for the majority of the sequencing errors by removing all the edges that have a lower number of long reads than the threshold value. Once the edges have been cleaned and filtered, we construct an assembly graph G. We only add an edge to G if neither of the two nodes comprising the edge is present in G. In some cases, G contains some edges of transitive reduction, error-prone and tips. After deleting these edges, we obtain the final scaffold graph, which we use for further analysis.

Scaffolding contigs into scaffolds
After the repeat identification and the graph simplification steps, most of the contigs are connected in linear  stretches on the assembly graph. There are, however, some complex regions that required addition manipulation. We refer to a contig as a divergent node if it links more than two nodes in the graph (Fig. 3). LRScaf searches for unique nodes at the end of this complex region and steps through this region for as long as there are long reads that join two unique nodes. If a divergent node is reached, LRScaf stops travelling the graph in the forward direction and switches to the reverse direction. Similarly, the search along the reverse direction of the graph stops at the end of a linear stretch or a divergent node. The process is then repeated using an unvisited node as the starting node. The procedure ends after traversing all the unvisited and unique nodes in the graph, thus exposing all linear paths. Finally, the gap-size between contigs is calculated. If the gap-size value is negative, the contigs are merged into a combined contig, and if the value is positive, a gap is inserted between the contigs (a gap is represented by one or more undefined 'N' nucleotides, depending on gap-size).

The parameters of LRScaf
LRScaf has three sections of parameters, i.e., 1) a filerelated section, 2) an alignment-validation section, and 3) a performance-related section. The parameter settings can be provided via the program's command line or through an XML configuration input file. In the filerelated section, input parameters are required for the alignment file, draft assemblies, and output path. In the alignment-validation section, there are six mandatory parameters for the alignment-validation model. These are min_overlap_length, min_overlap_ratio, max_over-hang_length, max_overhang_ratio, max_end_length, and max_end_ratio. Whereas loosening these parameters improves assembly contiguity, the number of misassemblies increases. In the performance-related section, there are 7 parameters, i.e., min_contig_length, identity, min_sup-ported_links, repeat_mask and tip_length, iqr_time, and process. It will sometimes be necessary to change the values of these parameters based on data. If, for example, long-read coverage is low and assembly contiguity is the main priority, reducing the values of identity and min_ supported_links could significantly improve the assembly contiguity. Besides, masking repeats could decrease the number of divergent nodes in the scaffolding graph and generate more contiguous scaffolds.

Results and discussion
We perform in-depth analysis on seven species (  assembly contiguity (NG50 and NGA50) is about two times contiguous than that of DBG2OLC. The BUSCO measurement for LRScaf is 94.0%. To summarise, LRScaf yields comparable or superior metrics when compared with other scaffolders on PacBio long reads.

Nanopore long read benchmarks
We use the Nanopore long reads datasets for E. coli, S. cerevisiae, A. thaliana, S. pennellii, and H. sapiens to assess the performances of the tested scaffolders (Table 4).
For the two small genomes, OPERA-LG and Unicycler are excluded from this assessment because of the lack of NGS data. The assembly contiguity obtained from SMIS, npScarf, and LRScaf are generally better than those obtained from LINKS, SSPACE-LongRead, and DBG2OLC (Additional file 1 Supplementary Note; Additional file 2: Although all scaffolders show some improvements in our experiments, the application of the Nanopore data remains challenging. A recent study showed that the NA12878 genome was assembled with an NG50 value of about 6.5 Mbp using pure 35-fold Nanopore data [6]. Our experiments show that it is possible to improve the assembly contiguity further. Using LRScaf, we obtained an NG50 value of 17.4 Mbp, which is similar to the PacBio human (CHM1) benchmark. To summarise, LRScaf yields either the best or similar assembly metrics using long reads of Nanopore compared with other scaffolders.

Computational performance and accuracy analysis
Assembly metrics are undoubtedly the most concerning matters to biologists and bioinformaticians. Nevertheless, from a practical point of view, the run time limits software applications. As evident from our experiments, the run time and the memory usage for scaffolding procedures become significant concerns for large and complex genomes. LRScaf is the fastest scaffolder on the benchmarks. For the PacBio dataset of O. sativa, LRScaf reduces the CPU time more than 6700 times compared with SMIS. MaSuRCA-Hy produces the best assembly contiguity. However, its run time is 1600 times longer than LRScaf. For Z. mays, LRScaf is 20 times faster than SSPACE-LongRead. On the Nanopore dataset, LRScaf reduces the CPU time more than 4700 times compared with SMIS for A. thaliana and 380 times compared with MaSuRCA-Hy for S. pennellii. Our experiments show that the memory usage of LRScaf is practical even for large and complex genomes because the peak RAM usage remains below 30.0 GB on the CHM1 PacBio dataset and 70.0 GB on the NA12878 Nanopore dataset respectively. Reducing the number of misassemblies is important because misassemblies are likely to be misinterpreted as real genetic variations [39,40]. For the PacBio dataset, LRScaf and SSPACE-LongRead yield the fewest number of misassemblies on O. sativa and Z. mays, respectively. For the Nanopore dataset, OPERA-LG has the best assembly accuracy on A. thaliana, and LRScaf yields the fewest number of misassemblies on S. pennellii. LRScaf yields a relatively low number of misassemblies in most of the studied cases. We have no record on the number of misassemblies on H. sapiens for the other scaffolders because all of them fail to scaffold the draft assemblies. In summary, LRScaf introduces a new strategy for keeping valid alignments and produces only a moderate number of misassemblies.

Conclusions
In this work, we present a novel algorithm (LRScaf, see Additional file 3) for scaffolding draft assemblies using noisy TGS long reads and compare our algorithm with published methods. The majority of the draft assemblies constructed using NGS data are fragmented and influenced by repeats. To improve the contiguity of draft assemblies using TGS long reads, there are two strategies on the scaffolding step: 1) simulating mate-pair information (e.g., OPERA-LG and Fast-SG [41]), and 2) using the full-length long reads information (e.g., SSPACE-LongRead and LRScaf). Both strategies could improve the contiguity of draft assemblies. The first strategy could build the scaffolding graph from either short or long reads and have sufficient and convenient linking information from a range of synthetic insert-size mate-pair libraries as well as for the NGS standalone scaffolder reuse. The run time for processing the synthetic linking information and the impact of base accuracy are significant concerns. For the second strategy, it could do the scaffolding step in a speedy and memoryefficient way. Considering only the linking information from long reads would neglect the linking information of paired reads library, whereas integrating paired reads information could improve assembly accuracy. We successfully use long reads to improve draft assemblies over different NGS de novo assemblers. Basing on assembly contiguity and accuracy, DISCOVAR and SOAPdenovo2 are the recommended NGS de novo assemblers for LRScaf.
We propose a new strategy to filter out inaccurate alignments so that these false alignments do not propagate through the scaffolding process. For the assessments on PacBio and Nanopore long-read datasets covering seven organisms, our method shows significant improvements over state-of-the-art scaffolders. The primary benefits of LRScaf are significant reductions in run time and memory usage. These benefits are especially crucial for large and complex genomes. As studied genomes keep getting bigger and more complex, the run time and memory usage of scaffolding software become increasingly essential to biologists and bioinformaticians. Our method is designed to reduce run time and memory usage. Thus, LRScaf is computationally more efficient than other scaffolders. Identification of misassembled contigs is also essential because any misassembled sequences are propagated into the next step during biological analysis. Most state-of-the-art scaffolders lack functions for the identification of misassembled contigs. Besides, misassemblies might be introduced during the scaffolding procedure. Consequently, to limit the number of misassembled scaffolds, our method incorporates an effective validation algorithm to reduce the influence of false alignment.
In the past decade, worldwide collaboration has led to several projects aiming at improving the understanding of species biology and evolution. Examples of such projects are the i5k [42], which provides the genomes of 5000 species of insects, and the Bird 10,000 Genomes (B10K) [43]. A substantial fraction of genomes with short contiguity prevent downstream analysis. Our result shows that TGS data is capable of effectively improving draft assemblies, and LRScaf is a valuable tool for cost-effectively improving these draft assemblies.

Availability and requirements
Project name: LRScaf.
Project home page: https://github.com/shingocat/lrscaf Operating system(s): Platform independent. Programming language: Java. Other requirements: Java 1.8 or higher. License: GNU GPL. Any restrictions to use by non-academics: license needed.
Additional file 1. Supplementary Note. The additional comparisons and information are included in the Supplementary Note: 1) the assessment for two small genomes on PacBio and Nanopore long reads; 2) the Note: The best genomic assembly metrics are highlighted in Bold; Mis. the number of misassemblies, SA SparseAssembler, MSR MaSuRCA de novo, SOAP SOAPdenovo2, DIS DISCOVAR de novo, PLA Platanus, SSPACE-LR SSPACE-LongRead, MaSuRCA-Hy MaSuRCA hybrid pipeline; '-': not available; We report two CPU times for MaSuRCA-Hy (the first value is the time for hybrid pipeline and the second value enclosed in the parenthesis is the time for de novo pipeline); SMIS is excluded on S. pennellii because the run time exceeds the one-month time limit. SIMS and SSPACE-LongRead are excluded on H. sapiens (NA12878) because the run time exceeds the one-month time limit. OPERA-LG is excluded from H. sapiens (NA12878) because of the lack of NGS data