BASE: a practical de novo assembler for large genomes using long NGS reads

Background De novo genome assembly using NGS data remains a computation-intensive task especially for large genomes. In practice, efficiency is often a primary concern and favors using a more efficient assembler like SOAPdenovo2. Yet SOAPdenovo2, based on de Bruijn graph, fails to take full advantage of longer NGS reads (say, 150 bp to 250 bp from Illumina HiSeq and MiSeq). Assemblers that are based on string graphs (e.g., SGA), though less popular and also very slow, are more favorable for longer reads. Methods This paper shows a new de novo assembler called BASE. It enhances the classic seed-extension approach by indexing the reads efficiently to generate adaptive seeds that have high probability to appear uniquely in the genome. Such seeds form the basis for BASE to build extension trees and then to use reverse validation to remove the branches based on read coverage and paired-end information, resulting in high-quality consensus sequences of reads sharing the seeds. Such consensus sequences are then extended to contigs. Results Experiments on two bacteria and four human datasets shows the advantage of BASE in both contig quality and speed in dealing with longer reads. In the experiment on bacteria, two datasets with read length of 100 bp and 250 bp were used.. Especially for the 250 bp dataset, BASE gives much better quality than SOAPdenovo2 and SGA and is simlilar to SPAdes. Regarding speed, BASE is consistently a few times faster than SPAdes and SGA, but still slower than SOAPdenovo2. BASE and Soapdenov2 are further compared using human datasets with read length 100 bp, 150 bp and 250 bp. BASE shows a higher N50 for all datasets, while the improvement becomes more significant when read length reaches 250 bp. Besides, BASE is more-meory efficent than SOAPdenovo2 when sequencing data with error rate. Conclusions BASE is a practically efficient tool for constructing contig, with significant improvement in quality for long NGS reads. It is relatively easy to extend BASE to include scaffolding.


Background
The past few years have witnessed a number of improved de novo genome assemblers, providing users choices between speed and accuracy [1]. The more recent NGS technologies have gradually increase the read length beyond 100 bp (e.g., 150 bp from HiSeq and 250 -400 bp from MiSeq), yet existing efficient assemblers do not have much improvement regarding accuracy, and it remains challenge how to take better advantage of longer NGS reads to assemble genomes in a fast and accurate manner. This paper presents a new assembler that can achieve better assembly quality for longer reads when compared with those efficient assemblers, without scarifying speed a lot.
Most state-of-the-art short read assemblers such as SOAPdenovo2 [2] and ALLPATHS-LG [3] are based on de Bruijn graph (DBG). In these assemblers, reads are chopped into a sequence of overlapping k-mers such that two adjacent k-mers have k-1 bases in common.
The DBG based method works well for short reads but it is non-trivial to handle repetitive sequences that are longer than k. When the reads are longer, it is natural to consider using a larger k, yet this is not feasible as this will require higher sequencing depth and consume much more memory (especially for NGS data with high error rate). Another method is to use the multiple k-mer strategies like IDBA-UD [4] and SPAdes [5], which could save memory by using smaller k parameter to take care most of sequencing errors, and using larger k to solve longer repetitive sequences. Yet this requires multiple constructions of DBG and much longer running time, limiting their usage for the assembly for relative large genome.
Assemblers for Sanger sequencing reads or Roche 454 reads (400-800 bp) are mostly based on Overlap-Layout-Consensus (OLC) strategy, such as Celera assembler and Newbler. An alternative representation named string graph was proposed by Mayer a decade ago [6], which has been implemented in assemblers such as SGA [7], Fermi [8], and Readjoiner [9]. Like OLC based assemblers, a proper minimum overlap size is required in string graph based assemblers to reduce the complexity of graph and to improve the connectivity of graph. Smaller minimum overlap size will increase the probability that the overlap sequence falls within a repetitive region of the genome. This leaves much more branches in the graph and may result in shorter contigs. Meanwhile, according to the Lander-Waterman model [10], larger minimum overlap size leads to a reduction of sufficient support for overlap among reads, thus enhancing the demand for higher sequencing depth. Therefore, due to the variation in length of repetitive sequences in genome, it is difficult to find a fixed minimum overlap size that fits all use cases especially when the NGS read is not so long. Regarding speed, for 30X 100 bp reads, BASE took 2 days and 5 days to assemble raw contigs by SGA [7] and Fermi [8], respectively. This speed is much slower than DBG based assembler SOAPdenovo2, which takes only a half day to obtain raw contigs.
Recently, a very efficient GPU-based method has been developed to index short reads using the Burrows Wheeler Transform (BWT) or bi-directional BWT of short reads [11]. For 30X human short reads, it only needs 6 h to build the BWT index. With such an index, the depth of any sequence that is no longer than the read length could be calculated in real time, which enables us to predict whether a sequence comes from a repetitive region of the genome [12]. With such efficient indexing, we find that it becomes feasible to produce better assemblies efficiently for large genomes using longer NGS reads, and in particular, we developed an adaptive seed extension method called BASE to construct contigs by searching for non-repetitive overlaps between reads. The details of our algorithm are given in the Methods Section, and an overview of the extension method is shown in Fig. 1. We tested the performance of BASE using data from HiSeq and MiSeq, with length ranging from 100 bp to 250 bp and compared BASE to popular assemblers including SOAPdenovo2, SGA and SPAdes.

Preliminary
Given a set of reads R = {R 0 , R 1 , …, R n-1 }, and each R i is terminated with a sentinel symbol $ (i.e., Given a string P, the range [l R (P), u R (P)] of P in SA R is defined as follows: From this definition, the size of SA range (u R (P) -l R (P) + 1) is the number of reads containing string P. If l R (P) > u R (P), it means that P is not a substring of any reads in R.
For double-stranded DNA sequence, we define P' as the reverse sequence of P and RC(P) for its reverse complement sequence. In this way, we also define R' as the reverse of R, SA R' as the suffix array of R' and BWT of R'. Then for string P, we can also have the SA' range as [l R' (P'), u R' (P')], which can help to find the reads containing the RC(P) [13]. The BWT of R and BWT of R' form the bi-directional BWT of R.
For bi-directional BWT, we introduce the term intact SA range (ISR) of P, which is the combination of: a) the SA range of P in R, b) the SA range of RC(P) in R, and c) the SA' range of RC(P) in R. The intact SA range is denoted by ISR(P) = [l R (P), u R (P), l R (RC(P)), u R (RC(P)), l R' (RC(P)), u R' (RC(P))]. Note that the size of b) and c) are the same. With respect to ISR(P), the depth of P (with respect to the set R) is defined as follows: Dep(P) = max{0, u R (P) -l R (P) + 1} + max{0, u R' (RC(P)) -l R' (RC(P)) + 1}.
As shown by Lam et al. [13], bi-directional BWT can finish the following operations in constant time: For any string P and a character c in {A, C, G, T, $}, calculate the SA range of cP from the SA range of P.
For any string P and a character c in {A, C, G, T, $}, calculate the SA range and the SA' range of cP (or Pc) from the SA range and the SA' range of P.
Then for P = P[0…m], using backward searching, the depth of P[m], P[m-1…m],…, P[0…m] can be calculated incrementally by updating their ISRs. Therefore, with the bi-directional BWT of R, it is possible to trace how Dep(P) decreases with the increasing length of P.

Bi-directional BWT construction
We build the BWT index of the reads based on our GPU-accerlated algorithm [11]. To test the efficiency of GPU acceleration and to make this construction available to computers without GPU, we have also developed a CPU-only version.
Meanwhile, the detailed content of BWT is also modified for easier genome assembly.
(a) A base is encoded with 4 bits using the first 3 bits to encode "A", "C", "G", "T" or the read terminal symbol "$", and the last 1 bit to indicate whether the base has a high sequencing quality with respect to a user-defined threshold. (b)Read ID (2i, 2i + 1) is defined by the i-th pair of reads, and an auxiliary table is used to record the mapping between a read ID and the position of the "$" in the BWT w.r.t this read. This enables fast recovery of read sequences and qualities in linear time. However, this method requires that all reads have equal length.

Seed selection
A seed is a sub-sequence shorter than a read. The main idea of our seed selection strategy is to select the seeds that have only one occurrence in the genome to be assembled. In the context of de novo assembly, there is no way to calculate the exact number of occurrences of a seed in the genome. We develop the following method to guarantee a high probability to select one-occurrence seeds, which we call inferred-unique seeds. Let d be the average sequencing depth of a genome, and each read has length m. Here we define the expected depth of a sub-sequence P with length k to be Fig. 1 Overview of the whole assembly method. There are five steps for one direction extension. Firstly, we choose an initial read by order and find an initial seed in this read. Then we use bi-directional BWT to get the SA ranges of this seed using backward exact matching. Thirdly, we build up a backward extension tree by adding bases to continue the backward matching. After removing erroneous branches and heterozygosis branches, we obtain the consensus sequence of the extended region. Finally, we continue to find a new seed in the extended region and extend iteratively d k = (mk + 1) * d/m [12]. If Dep(P) (which is the depth of P calculated according to ISR(P) in Section 2) is no larger than z*d k , in which z is a user-defined parameter, we define P as an inferred-unique sequence, which means it is likely to occur only once in the genome.
To find an inferred-unique seed in a read R i or a previously extended sequence, starting at the end and by using backward search mentioned above, we can update the ISRs and calculate the depth incrementally until it achieves inferred-unique. For example, we find a seed in read R i of length m, we calcualte the ISRs and depth of Meanwhile, We calculate the expected depth d m-j with j decreasing from m-1 to 0. Then there would be two cases for the changes of depth from these sub-sequences: Case 1: The depth of R i [j…m-1] is reduced to less than user-defined depth threshold τ. Then we will further try to find seed in the substring R i [0…j-1].
Case 2: The depth of R i [j…m-1] is no larger than z* d m-j . Then sub-sequence R i [j…k-1] meets the requirement of inferred-unique and no more sub-sequences will be checked.
Each inferred-unique sub-sequence will be further checked to make sure the all the bases in seed have high quality scores (using the 1-bit base quality stored in BWT). Then we obtain a high quality inferred-unique seed to start the extension step. It can be found in a read to initialize an extention as an initial seed or in the extended regions to start a new iteration.

Seed extension and consensus
Given a pattern P with Dep(P) > 0 and a character c, we define cP as a valid backward extension of P if and only if Dep(cP) > 0. For a seed S, by adding characters in the head base by base, we can construct a backward extension tree T s whose nodes are tagged with characters chosen from {"A", "C", "G", "T"}, except for the root node, which is tagged with Seed S. Define L(v), the label of a node v, to be the concatenation of tags from v to the root; and define W(v), the weight of the node v, to be the depth of L(v).
Backward extension tree is built recursively. The root tagged with S is firstly created. For each newly created node v, if cL(v) is a valid backward extension of L(v) for some character c in {"A", "C", "G", "T"}, a new node is created as a child of v and is tagged with c. Note that the label of a node will not be longer than the read length, the depth of the tree is limited by the read length minus the seed length. Moreover, for any node v in the tree, if Dep($L(v)) > 0, we obtain the IDs of reads which have L(v) as a shared prefix and mark these reads to avoid redundant assembly.
The consensus sequence for the backward extension tree is constructed by walking down the tree from the root to a certain node. This process is called consensuswalk. When visiting a node with only one child, the walk moves on to that child. Otherwise we have to select a branch to move on or stop immediately. A greedy algorithm, which chooses the child with the largest weight, is straightforward but error-prone. Therefore, we introduce another strategy, which we call reverse validation, to improve the probability of choosing the correct branch. For simplicity, we describe our method for the case of two branches. As shown in Fig. 2, let ν be the node that the consensus is currently processing to, a and b be two children of v, tagged with t a and t b respectively. Let C = L(v) be the consensus sequence we have already constructed. The method incrementally calculates the depth of t a , t a C[0], t a C[0 … 1], t a C[0 … 2], etc. and Below τ denotes a user-defined threshold. Case 1. If Dep(t a C[0 … i]) < τ and Dep(t a C[0 … i]) < 0 for some i, we immediately conclude that a is an erroneous branch and b is authentic if it demonstrates the following properties: Case 2. if Dep(t a C[0 … i]) = 0 for some i, we conclude that the initial seed is a false positive inferred-unique seed and a is near another copy of this seed in the genome, and a is named as a repetitive branch if it demonstrates the following properties: If we fail to identify the above two cases, an additional step will be introduced to estimate whether the branches are due to heterozygous sites. Starting from a and b, we use a greedy algorithm mentioned above to obtain two sequences representing the consensus of the sub-trees rooted at a and b, respectively. If the similarity of these two sequences is high enough, we make a prediction that these two branches are caused by heterozygous sites, and walk to the child with larger weight. Otherwise, the consensus-walk stopped at node v.
If the consensus-walk does not stop at the root of the tree, i.e. the consensus sequence has been extent by at least one base pair from the seed, a new inferred unique seed will be chosen from the prefixes of the consensus sequences to start a new round. The process of seeding, backward extension and consensus is repeated until the consensus-walk stops at the root of the extension tree in some round. Then a series of symmetric processes follow, which forward extend the initial seed. Finally the contig containing this initial seed, which is the concatenation of the consensus sequences in both directions, is obtained when the forward extension completes.

Contig assembly using paired-end information
Paired-end reads have adjacent read IDs in the BWT and this information is used to resolve longer repetitive regions, when the consensus-walk stops at the root of the extension tree. For two children nodes a and b of root node v, reads with "$" falling in the sub-tree of a and sub-tree of b regions are divided into R(a) and R(b). We check whether the paired reads of R(a) or R(b) have been used in current assembled sequences and whether the distances are proper as estimated by their positions in this contig and their insert sizes. Without loss of generality, if only paired reads of R(a) are found and the number is larger than user-defined threshold τ mentioned above, the child node b will be removed. This method could be used to assemble repetitive sequences longer than read length and obtain longer contig sequences.

Evaluation
Using reference genomes for Staphylococcus aureus MW2 (www.genomic.ch/edena/results2013/ReferenceSe quences/) and V. parahaemolyticus (RIMD2210633), we evaluated the accuracy of assembly using the GAGE pipeline [16], in which metrics such as correct N50, mismatch, align rate and coverage are assessed. For YH and NA12878, we mapped the assembled contigs to Fig. 2 Remove branches in backward extension tree. In the backward extension tree, we try to remove erroneous branches, repetitive branches and heterozygosis branches to obtain the consensus sequences of the extended region. As an example, we meet node v with two child node a and b. Firstly, combined with L(v), we obtained TL(v) for a and GL(v) for b to detect erroneous branches between a and b. We incrementally calculate the depth of sub-sequences of a(sub-a i with length i): T, TA, TAT, …, and b(sub-b i with length i): G, GA, GAT, … until the depth of sub-a is less than user-defined threshold τ. At the same time, if Dep(sub-a i ) is significantly smaller than Dep(sub-a i-1 ), Dep(sub-a i ) is significantly smaller than d i and Dep(sub-b i ) is significantly larger than Dep(sub-a i ), then branch a will be treated as a erroneous branch or repetitive branch. When there is no erroneous signal, we will further try to remove the branch, which might be caused by heterozygosis. After obtaining two sequences representing the consensus sequences of the sub-trees rooted at a and b respectively, we compare the two sequences to find the matched region and get the depth of it. Then we use this depth to calculate base depth and compare to the base depth calculated by depth of initial seed. If the two sequences have high similarity and the two base depths are similar to each other, we will treat a as heterozygous branch if W(a) is smaller than W(b) hg19 with LAST [17] and evaluated the alignment rate, reference coverage and repeat-masked reference coverage.

Comparison
As shown in the bacterial assembly (Table 1), BASE obtains contigs with the highest accuracy among all evaluated assemblers and is the only assembler that achieve 100 % alignment rate. Except four translocations of SPAdes in dataset of V.para, translocations assembled by BASE, SGA and SOAPdenovo2 are all caused by circular DNA and are not included in Table 1. For the 100 bp dataset of S.aureus, the correct N50 statistics of BASE is much shorter than that of SPAdes and is only a bit longer than that of SGA and SOAPdenovo2. Further analysis showed that BASE's improvement over SGA and SOAPdenenovo2 is mainly due to the effective usage of paired-end information. For the 250 bp dataset of V.para, the correct N50 from BASE is indeed comparable to that of SPAdes and is much longer than that of SGA and SOAPdenovo2. As shown in Table 1, BASE takes slightly longer time in building index and assembling contigs than SOAPdenovo2, but is much faster than SPAdes and SGA. The coverage of contigs from BASE is relatively low, which could be improved by devoting more time to initialize more extension or by scaffolding like SOAPdenovo2.
We also tested human genome assembly with four datasets: YH 100 bp~35X, YH 150 bp~63X, NA12878 XTen 150 bp~35X and NA12878 HiSeq 250 bp~45X. With 30X 100 bp reads, it already took SGA more than 2 days [7] and Fermi nearly five days [8] to output the contigs. As shown in Table 2, for the~35X 100 bp YH dataset, both BASE and SOAPdenovo2 (in single-kmer mode) took only about half a day to obtain the contigs. To assemble X Ten data (150 bp reads), BASE used much less memory than SOAPdenovo2 on indexing and contig assembly (Table 3). This is probably related to the high error rate of X Ten data, as shown in 17mer depth distribution of the three datasets in Fig. 3.
In all four human datasets, shown in Table 3 the N50 statistics of BASE improves as read length increases, while SOAPdenovo2 does not show such degree of improvement. BASE's improvement over SOAPdenovo2 becomes significant for 250 bp reads. Similar to bacterial S.aureus MW2 has its real reference with length 2.8 Mb and V.para has its species' reference with length 5.1 Mb and two chromosomes. Both of these two bacteria are sequenced up to 240X. GAGE validation pipeline was used to calculate the corrected contig N50, base errors, structural errors, contig aligned rate and reference coverage. Except BASE used single thread for contig assembly part, and other the assemblies were all performed with 24 threads. The time before semicolon is for index building and after semicolon is for assembly. For SGA, indexing time contains the time used in the indexing after error correction and filtering; assembly time contains the time used in the overlap and assembly  assembly, BASE's genome coverage, with repeat masked, is lower that of SOAPdenovo2. But BASE has an overall higher genome coverage in each dataset. This suggests that BASE is able to assemble more repetitive regions, which can be used to explain the slightly more mismatches between assembled sequences and hg19 for BASE, as shown in Table 4. The efficiency of GPU acceleration is shown in Table 5. To construct BWT of small datasets from two bacterial genomes, without GPU, it takes 1.38~1.56 folds longer time than GPU version. But for large human datasets, the CPU-only version takes near 4 times as much time as the GPU version. Therefore the GPU version is recommended especially for large sequencing dataset.

Conclusions
The primary objective of this paper is to study whether a seed-extension approach to contig assembly, coupled with reverse validation, can give a performance (accuracy and N50) comparable to SOAPdenovo2 and SGA. As shown in the previous section, the new approach gives clear advantage for longer reads, and with speed much higher than SGA and comparable to SOAPdenovo2, and stable memory usage (i.e., not sensitive to error rate of the reads). The contigs obtained by BASE are longer and cover more repetitive sequences than those from SOAPdenovo2 and SGA.
Based on the high quality contigs assembled by BASE, one could use less accurate third generation reads or paired-end reads with longer insert size for scaffolding and gap closing. This approach has been used in a recently published assembler DBG2OLC [18], which assembles second level contigs onto high accurate DBG contigs. Indels or SV could also be detected with these contigs using established methods [8].
With the increasing length of high quality Illumina reads, it is of computational interest how to fully utlilize the read length information in contig assembly. SGA, Fermi and our tool BASE both build an index of the reads and make it possible to assemble high-depth short reads without splitting them into kmers. Although SGA and Fermi could finish assembly with less memory, they need much longer time. As noted in MEGAHIT [19], the requirment for big memory machine can be circumvented. For future bioinformatics analysis including assembly, it is time and robustness that matter most. We plan to further reduce the running time of BASE. We mapped the assembled contigs to Hg19 and got the mismatches between each contig and reference sequence. Then we used the detected SNPs and SNPs from published SNP databases to analysis these mismatches in whole genome and exon regions respecitively To evaluate the acceleration performance of GPU on BWT construction, we used two versions, one of which used GPU and the other not, to build bi-directional BWT on four datasets. The results showed that especially in large genome dataset, compared with GPU version, version without GPU costs~4 fold more time to construct BWT