 Research
 Open access
 Published:
HyDAVista: towards optimal guided selection of kmer size for sequence assembly
BMC Genomics volume 15, Article number: S9 (2014)
Abstract
Motivation
Intimately tied to assembly quality is the complexity of the de Bruijn graph built by the assembler. Thus, there have been many paradigms developed to decrease the complexity of the de Bruijn graph. One obvious combinatorial paradigm for this is to allow the value of k to vary; having a larger value of k where the graph is more complex and a smaller value of k where the graph would likely contain fewer spurious edges and vertices. One open problem that affects the practicality of this method is how to predict the value of k prior to building the de Bruijn graph. We show that optimal values of k can be predicted prior to assembly by using the information contained in a phylogeneticallyclose genome and therefore, help make the use of multiple values of k practical for genome assembly.
Results
We present HyDAVista, which is a genome assembler that uses homology information to choose a value of k for each read prior to the de Bruijn graph construction. The chosen k is optimal if there are no sequencing errors and the coverage is sufficient. Fundamental to our method is the construction of the maximal sequence landscape, which is a data structure that stores for each position in the input string, the largest repeated substring containing that position. In particular, we show the maximal sequence landscape can be constructed in O(n + n log n)time and O(n)space. HyDAVista first constructs the maximal sequence landscape for a homologous genome. The reads are then aligned to this reference genome, and values of k are assigned to each read using the maximal sequence landscape and the alignments. Eventually, all the reads are assembled by an iterative de Bruijn graph construction method. Our results and comparison to other assemblers demonstrate that HyDAVista achieves the best assembly of E. coli before repeat resolution or scaffolding.
Availability
HyDAVista is freely available [1]. The code for constructing the maximal sequence landscape and choosing the optimal value of k for each read is also separately available on the website and could be incorporated into any genome assembler.
Introduction
The ability to accurately assemble genomes is a fundamental problem in bioinformatics that is vital to the success of many scientific projects, including the 10,000 vertebrate genomes (Genome 10K) [2], Arabidopsis variations (1001 genomes) [3], human variations (1000 genomes) [4], and Human Microbiome Project [5]. The genome assembly process aims to build contiguous sequences, called contigs, predominantly from short read sequencing data. Other sources of information have also been used to boost the accuracy, including genetic linkage data [6], optical mapping data [7], and longer sequencing reads (e.g. PacBio data) [8]. A potential source of information that has not been fully explored is the information contained in phylogeneticallyclose genomes. The genome of an individual of the same species or that of a phylogeneticallyclose species can potentially be used as an extra source of information, and increase the assembly quality. We argue that genome assemblers can benefit from using a reference genome to help guide the assembly process, particularly in those regions of the genome that are pervaded by repetitive sequences.
In Eulerian sequence assembly [9, 10], a de Bruijn graph is constructed with a vertex v for every (k − 1)mer present in an input set of reads, and an edge \left(v{v}^{\prime}\right) for every observed kmer in the reads with (k − 1)mer prefix v and (k − 1)mer suffix {v}^{\prime}. A contig corresponds to a nonbranching path through this graph. SPAdes [11], IDBA [12], EulerSR [13], Velvet [14], SOAPdenovo [15], ABySS [16] and ALLPATHS [17] all use this paradigm for assembly. The majority of de Bruijn graph based assemblers follow the same general outline: break the (possibly error corrected) reads into kmers, construct the de Bruijn graph on the set of resulting kmers, simplify the de Bruijn graph, resolve the repeated regions by using matepair information, and construct the contigs (simple paths in the de Bruijn graph). Therefore, the majority of assemblers require or allow the value of k to be specified by the user.
The problem of determining an appropriate value of k for the de Bruijn graph construction is important since it has a direct impact on assembly quality; stated very briefly, when k is too small the resulting graph is complicated by spurious edges and vertices, and when k is too large the graph becomes too sparse and possibly disconnected. Repetitive regions are especially problematic for genome assembly since they inadvertently result in spurious edges and vertices in the de Bruijn graph [18] and are very sensitive to the choice of k. There has been a significant effort in developing algorithms that will choose an ideal value for k by preprocessing the sequence reads, and thus, reduce the complexity of the de Bruijn graph [11, 12, 19].
A more obvious combinatorial approach for building a simplified de Bruijn graph would be to allow the value of k to vary; having a larger value of k where the graph is more complex and a smaller value of k where the graph would likely contain fewer spurious edges and vertices. A major difficulty in implementing this approach is determining a practical method that makes this idea feasible assembling large genomes. Peng et al. [12] and Bankevich et al. [11] both introduced assemblers that use various values of k. IDBA builds the de Bruijn graph in an iterative manner from k = k_{ min } to k = k_{ max }; these values of k are predetermined and (by default) do not change for different datasets or genomes. At iteration i, the de Bruin graph for k_{ i } is built from the current set of reads and the contigs for that graph are constructed, then all the reads that align to at least one of those contigs are removed from the current set of reads. In the next iteration the graph is built by converting every edge from the previous graph to a vertex while treating contigs as edges. SPAdes [11] uses a similar approach but uses all the reads at each iteration.
While this method has been shown to greatly improve assembly quality [11, 12], it is not efficient since all the reads are assembled at each iteration. Thus, one challenge that remains to be addressed is how to efficiently determine which values of k should be used for this iterative assembly process and how to assign a kmer value for each read. If this could be accomplished prior to assembly of the de Bruijn graph(s) then these iterative assembly methods could be made more efficient without degrading the assemblies quality.
Our contribution. We introduce an efficient algorithm for determining an optimal value of k for each read prior to constructing the de Bruijn graph, and implement this method into a modified version of HyDA, a de novo assembler developed by Movahedi et al. [20]. This modified assembler, which we refer to as HyDAVista, takes as input a phylogeneticallyclose genome and a set of pairedend reads. Imperative to HyDAVista is the construction of the maximal sequence landscape, which is a data structure that stores for every position in the input string, the longest repeat containing it. Prior to de Bruijn graph construction, HyDAVista constructs the maximal sequence landscape for the phylogeneticallyclose genome, and aligns the reads to the reference genome. The alignment and landscape allows the optimal value of k for each read be determined in linear time in the length of the read, provided the read is longer than the longest repeat. These values of k are "optimal" in the sense that for unchanged parts of the genome, the de Bruijn graph will have no spurious edges or vertices if there are no sequencing errors, and the length of the repeat is smaller than the read length. Unaligned reads are assigned a default value of k. After the assignment of values of k to each read, HyDAVista constructs the de Bruijn graph in an iterative manner.
Our approach for choosing values of k for each read takes into consideration the repeat structure of the genome, which enables us to avoid overlycomplex regions of the graph since the assignment of values of k to reads is done prior to the assembly rather than during the assembly. We compare HyDAVista versus IDBA [12], SPAdes [11], SOAPdenovo [15], ABySS [16] and HyDA [20]. Our results demonstrate that HyDAVista produces the best assembly of E. coli before repeat resolution or scaffolding. We aim to achieve the best assembly without repeat resolution and scaffolding and note that such methods could be applied to all these initial assemblies. Lastly, we demonstrate that this method improves the efficiency of iterative assembly.
Roadmap. We review related tools for the problem in the remainder of this section. Section then sets notation and formally lays down the problem and the data structures that will be used for the construction of the maximal sequence. We formally define the maximal sequence landscape in Section. Section gives details of HyDAVista. In Section we give results that demonstrate how HyDAVista compares against competing assemblers. Finally, Section offers reflections and some future areas of research that warrant investigation.
Related work. The resequencing methods include LOCAS and SUPERLOCAS [21, 22], eRGA [23], Colombus module of Velvet [24] and IDBAHybrid [25]. Since these methods aim to determine structural variations between species, and require extremely high sequence similarity to produce reasonable results, they have only been applied to individuals of the same species. Our focus is to produce high quality de novo assemblies using homology information contained in the reference genome of the same species or phylogeneticallyclose species. Gnerre et al. [26] also consider how to improve assembly quality by using the alignment of reads to a reference genome. Their method simultaneously builds a de novo assembly from the reads and aligns these same reads to one or more related genomes. The alignment is then used to improve the assembly quality, e.g., reads that were not used in the assembly are incorporated into the assembly using the alignment.
Complementary to the work of Chikhi and Medvedev [19], Peng et al. [12], and Bankevich et al. [11], there has been an effort in developing methods that use pairedend data to constrain the construction of the de Bruijn graph [21, 27–30]. Medvedev et al. [29] introduced the concept of a paired de Brujin graph. Since the insert size is variable among mate pairs, this method requires that all the paths within some threshold be considered in order to ensure an edge is not missed. Thus, Bankevich et al. [11] improve upon this idea by developing the rectangle graph, which eliminates the need to consider all paths. Vyahhi et al. [30] furthered this study of rectangle graphs for genome assembly. These methods merit mentioning the goal of these methods is the same as the goal of increasing the value of k in certain regions; both aim to minimize spurious edges and branching in the graph but in a different manner.
Determining all maximal exact repeats in a string has been previously studied [31–33]. It has been shown that all maximal repeats of a string can be found and stored in O(n)time and O(n)space using a suffix tree (although the output maybe of size Θ(n^{2})) or directed acrylic graph [32]. Therefore, the maximal sequence landscape, which we define in this paper, can be constructed from either a suffix tree or a directed acrylic graph in O(n)time and O(n)space using these algorithms directly or adapting them. However, the constant in the order notation of the space complexity of these constructions is relatively large. The algorithm we present uses a suffix array and thus, requires linear space with a smaller constant and O(n log n)time. Hence, we pay a log n cost in time to remove the large constant from the linear space time. We also note that the related problem of finding inexact maximal repeats also has been previously studied [34–37].
Background
Strings. Consider a fixed alphabet Σ = {σ_{1},..., σ_{ m }} and a total order ≤_{ L } defined over Σ = Σ ∪ {$} where $ ∉ Σ, and for all σ ∈ Σ we have $ ≤_{ L } σ. We denote a finite string s as s_{1}s_{2} ... s_{ n }, where s_{ i } ∈ Σ. We use s_{ ij } , where 1 ≤ i ≤ j ≤ n, to indicate substring s_{ i }s_{ i+1 } s_{ j } of string s. We call substrings {s}_{i}^{\mathsf{\text{Pre}}}={s}_{1i} and {s}_{i}^{\text{Suf}}={s}_{in} with i ∈ {1, . . . , n} the i^{th} prefix and i^{th} suffix of s respectively. Based on the total order ≤_{ L }, we define a lexicographical total order on the strings in Σ^{∗}.
Arrays. We denote arrays of integers by all capital letter strings like A, SP, LCP, etc. A[i], with 1 ≤ i ≤ A, stands for the integer in the i^{th} cell of array A. Also, A[i, j] indicates the projection of A onto indices i to j, inclusive of both ends. For an array A, with A = n, that holds a permutation of integers {1, ..., n}, index array of A is another array I(A) with I(A) = n such that I(A)[i] = j if and only if A[j] = i.
Suffix and longest common prefix arrays. SA_{ s }, for some string s, denotes the suffix array associated with s [38]. SA_{ s }[i] = l for i ∈ {1, ... , n} if and only if {s}_{l}^{\text{Suf}} is the i^{th} string in the lexicographically sorted list of all suffixes of s. We also indicate the longest common prefix array of some string s with LCP_{ s }, and LCP_{ s }[i] = l for i ∈ {1, . . . , n − 1} if and only if the length of the longest common prefix between {s}_{\mathsf{\text{S}}{\mathsf{\text{A}}}_{s\left[i\right]}}^{\mathsf{\text{Suf}}} and {s}_{\mathsf{\text{S}}{\mathsf{\text{A}}}_{s\left[i\right]}+1}^{\mathsf{\text{Suf}}} equals to l.
Approach
The maximal sequence landscape
We formally define the maximal sequence landscape in this section. Clift et al. [32] introduced the concept of a sequence landscape, which is a data structure that stores the occurrences of any substring from a source string s in a target string t. In set representation, the sequence landscape L_{ ts } of a target string t with respect to a source string s is defined as a set of tuples {m_{1}, m_{2}, ..., m_{ l }}, where m_{ i } = (b_{ i }, e_{ i }, f_{ i }) corresponds to the occurrence of substring sb_{ i } e_{ i } = sb_{ i } sb_{ i +1 } se_{ i } from s in t with frequency f_{ i }. If s and t are equal then the sequence landscape categorizes all repeated substrings in the source string s. We define to this special case where s = t as the self sequence landscape. Figure 1 illustrates an example of a self sequence landscape and a sequence landscape. Given a position i of the input string s, all the repeated substrings containing s_{ i } can be recovered from the self sequence landscape in lineartime in the number of different repetitions.
The occurrences of the substrings in the source string are defined as mountains. This terminology reflects the visual representation that was first introduced by Clift et al. [32] that illustrates each occurrence as a mountain having height equal to the length of the substring, i.e. the height of mountain mi of L_{ ts } is denoted as h(m_{ i }) and equal to e_{ i } − b_{ i } + 1. The peak of each mountain is labelled with the frequency of the substring corresponding to it. In Figure 1 (left), the substring CAT is represented as two mountains each of which has a height equal to three and frequency equal to two.
We say that a mountain m_{ j } = (b_{ j } , e_{ j } , f_{ j } ) in a landscape L_{ ts } covers index i and denote it by m_{ j } Δ i if and only if i ∈ {b_{ j } ..., e_{ j } }. Hence, the cover set of a specific index i of the sequence landscape L_{ ts } is the set of all the mountains that covers i. We denote the cover set as {C}_{{L}_{ts}}\left(i\right) and define it as follows:
Lastly, we define the summit of index i as the highest mountains in its cover set. We denote the summit of i by SLt s (i) and define it as follows:
Please note that the summit of index i can be empty or nonunique so the height of summit of index i is defined to be zero for empty set.
Definition 1 The maximal sequence landscape, which we denote as {L}_{ts}^{*}, is the set of the summits of all positions in s that have frequency greater than one. {L}_{ts}^{*}can be formally defined as follows: {L}_{ts}^{*}=\left\{{S}_{{L}_{ts}}\left(i\right)i=1,\dots ,n\right\}The maximal sequence landscape is highlighted in light grey in Figure 1.
The maximal sequence landscape is obtained from the sequence landscape by removing all mountains except those that are highest and have frequency greater than one at each position. In the case of the maximal sequence landscape constructed from a self sequence landscape, this results in a data structure containing the longest repeat at each position of the input string. In Subsection we give an algorithm that builds the maximal sequence landscape and returns an array containing the length of the longest repeat at each position of the input string. Therefore, given a position i in s, we can determine the length of the longest repeat in s containing that position in constant time by simply indexing the maximal sequence landscape at position i. By choosing a value for k that is larger than the length of this repeat it can guaranteed that there will be no branching in the corresponding vertices of the de Bruijn graph, if the same substring is not repeated in changed parts of the genome that is being assembled. This is our idea based on which we determine the optimal value of k for each read. We consider the maximal sequence landscape constructed from the self sequence landscape for the remainder of this paper since it is what is used by HyDAVista.
Methods
Algorithm 1 gives an overview of HyDAVista algorithm. We explain each of these steps in detail in the subsequent subsections.
Algorithm 1 An overview of HyDAVista
1: Build the maximal sequence landscape for the reference genome.
2: Align all reads to the reference using BWA.
3: For each aligned read: assign a value of k using the maximal sequence landscape.
4: Unaligned reads are assigned a value of k using a heuristic.
5: The de Bruijn graph is constructed in an iterative manner, as shown in Algorithm 2.
Construction of the maximal sequence landscape
In this section we demonstrate that the maximal sequence landscape for an input string s can be built in O(n + n log n)time and O(n)space using a simple algorithm, where n is the length of s. Our method relies on the use of suffix array and longest common prefix array and thus, begins by building the suffix array (SA_{ s }) and the longest common prefix array (LCP_{ s }). This construction can be done in O(n)space and O(n)time [39]. Two other auxiliary data structures are constructed at the beginning of the algorithm. However, we delay the definition of these to later in this section. The algorithm then iterates through each position of s and finds the longest repeated substring in s that contains it using SA_{ s }, LCP_{ s }, and the auxiliary data structures. An important aspect of our algorithm that allows us to achieve O(n + n log n)time is that we only search the interval of SA_{ s } that is between the indices SA_{min} and SA_{max} at each iteration; not the entire array. In other words this invariant holds at each iteration of our algorithm: [SA_{min}, SA_{max}] holds the the interval in the suffix array that corresponding suffixes share the same prefix. This prefix is the longest repeat that has been seen so far and covers that position. Thus, each time the largest repeated substring is found for a particular position, the maximal sequence landscape, SA_{min}, and SA_{max} are updated for search at the next iteration. Two possibilities exist at each iteration i of the algorithm when we are processing si;

(a)
The longest repeated substring at position i  1 can be extended by appending s_{ i }. The maximal sequence landscape, SA_{min}, and SA_{max} are updated.

(b)
The longest repeated substring at position i  1 cannot be extended by appending s_{ i } (either the extended string does not occur or it does occur but its frequency is one). Let p = s_{ j } ⋯ s_{ i−1 } be the longest repeated substring yet found that contains s_{ i−1 } , and {p}^{\prime}={s}_{j+1}\cdots {s}_{i} be the string obtained by removing the first letter of p and appending s_{ i }. If the frequency of occurrence of the {p}^{\prime} is greater than one, then the maximal sequence landscape, and the search interval is updated as in (a). Otherwise, the search for the longest repeated substring continues by eliminating the first character of {p}^{\prime} each time until a repeating match is found or the null string is reached. If the null string is reached then the maximal sequence landscape is empty at that position and the search interval is updated to [1, n].
The search interval contains all indices in SAs for which the corresponding suffixes have the current longest repeated substring as a prefix. In (a), the interval is updated by performing binary search. In (b) the search interval is no longer valid since we removed a letter from the beginning of the current longest repeated substring and we need (a more complicated) scheme to efficiently find the correct search interval. To accomplish this we need two auxiliary data structures that are constructed at the beginning of the algorithm: the SP_{ s } array, and an ordered binary search tree containing all consecutive intervals of LCP_{ s }. SP_{ s }[j] holds the index in SAs that is obtained by removing the first letter from the beginning of {s}_{j}{s}_{j+1}\cdots {s}_{n} in order to obtain {s}_{j+1}\cdots {s}_{n}. This array can be built in linear time by scanning the index array of SA_{ s }. Thus, to find the correct interval in (b), we locate an index of SA_{ s } (denoted as sp) where the corresponding suffix contains {s}_{j+1}\dots {s}_{i1} as a prefix, and find the largest interval around sp where all the suffixes in the interval have {s}_{j+1}\dots {s}_{i1} as prefix. This is the new search interval. The sp index can be found in constant time by correctly indexing SP_{ s } . The second step is equivalent to finding the largest interval [d, u] around sp that for all j ∈ [d, u] we have \mathsf{\text{LC}}{\mathsf{\text{P}}}_{s}\left[j\right]\ge \left{p}^{\prime}\right1. This can be done in O(log n)time and O(n)space using an ordered binary search tree.
The time complexity can be seen by first noting that at each iteration of the algorithm the maximal sequence landscape either ascends by one after a number of descents (possible zero) or it is undefined after a number of nonzero descents, and each of these ascents or descents require O(log n)time. Note that in (a) the maximal sequence landscape is ascending, and in (b) the maximal sequence landscape is descending, and the frequency of a substring in s can be determined in constant time using LCP_{ s }. Second, since each time it ascends one character from s is processed and the number of ascents equals the number of descents, the total number of ascents and descents is 2n. Therefore, since the data structures are constructed in O(n)time, and since there are at most 2n ascents or descents which take O(log n)time, the running time of the algorithm is O(n + n log n).
Assignment of kmer sizes to reads
We assign values of k to the reads using the maximal sequence landscape constructed for the reference genome by first aligning the reads to the reference genome using BWA (version 0.7.4) [40] in pairedend mode. We consider all forward and reverse alignments of every read. Let p be the position in the reference genome where a read of length l aligns, and let k^{∗} be the maximum of {MSL[p] + 1, MSL[p + 1] + 1, ... , MSL[p + l] + 1}, where MSL is the maximal sequence landscape array that contains the height of the maximal sequence landscape at each position. We compute k^{∗} for each forward alignment and let K^{∗} be the set of all these values. The optimal value of k for the (forward) read is equal to the maximum value in K^{∗}. We follow the same procedure for the reverse alignments with the exception that we compute the reverse complement of the read. Thus, the optimal values of k can be computed in linear time in the length of the read.
If the computed kmer size (maximum of all maximal sequence landscape heights of all aligned nucleotides) is larger than the read length, then a default value (k = 77 is the default) is used instead. Unaligned reads are also assigned a default kmer value (k = 55 is the default).
The de Bruijn graphs
Let R = {r_{1},..., r_{ N } } denote the set of reads. We also denote the kmer size assigned to r_{ i } in the previous section by K(r_{ i }). In the first step of constructing the assembly de Bruijn graphs, we partition R into R_{ k } := {r ∈ R  K(r) = k}, in which k ranges from k_{min} = min_{ r∈R } K(r) to k_{max} = max_{ r∈R } K(r). The HyDAVista assembly procedure, shown in Algorithm 2, iteratively builds de Bruijn graphs {G}_{{k}_{\mathsf{\text{min}}}},\cdots \phantom{\rule{0.3em}{0ex}},{G}_{{k}_{\mathsf{\text{max}}}} with k = k_{min},..., k_{max} respectively and obtains {A}_{{k}_{\mathsf{\text{min}}}},\dots ,{A}_{{k}_{\mathsf{\text{max}}}} assembly contig sequences after iterative graph condensation and error removal. Each G_{ k } is constructed from the reads whose assigned kmer size is not more than k and the contigs resulting from G_{ k−1 } constructed in the previous iteration,
The rationale behind this idea is that those reads that have an assigned kmer size not more than k should ideally not create any repeats when they are assembled with the kmer size k. The iterative inclusion of contigs from previous rounds, first introduced in IDBA [12] and later adopted by others [11], is an idea that has already shown merit in improving assembly quality. In Algorithm 2, HyDA is a function that accepts a set of input sequences and an integer k, and returns a set of contigs which are obtained from assembling the input sequences with a k de Bruijn graph.
Algorithm 2 Construction of the de Bruijn graphs
1: function HyDAVista(R, K)
2: {k}_{\mathsf{\text{min}}}\leftarrow \underset{r\in R}{\mathsf{\text{min}}}K\left(r\right),{k}_{\mathsf{\text{max}}}\leftarrow \underset{r\in R}{\mathsf{\text{max}}}K\left(r\right)
3: for all k_{min} ≤ k ≤ k_{max} do
4: {R}_{k}\leftarrow \mathrm{0\u0338}
5: end for
6: for all r ∈ R do
7: k ← K(r)
8: R_{ k } ← R_{ k } ∪ {r}
9: end for
10: {R}^{\prime}\leftarrow \mathrm{0\u0338}
11: {A}_{{k}_{\mathsf{\text{min}}}1}\leftarrow \mathrm{0\u0338} ▷ assembly contigs
12: for k := k_{min} to k_{max} do
13: {R}^{\prime}\leftarrow {R}^{\prime}\cup {R}_{k}
14: {A}_{k}\leftarrow \mathsf{\text{HyDA}}\left({R}^{\prime}\cup {A}_{k1},k\right) ▷ contigs resulting from assembly with HyDA
15: end for
16: return {A}_{{k}_{\mathsf{\text{max}}}}
17: end function
Results
Improved efficiency due to maximal sequence landscape
HyDAVista uses the maximal landscape to break the reads into groups by assigning each a value of k. It then uses these groups to build the graph iteratively. This is in contrast to other methods that also iteratively build of the graph; SPAdes [11] uses all the reads at each iteration, and IDBA [12] uses a more complicated approach to remove some subset of reads at each iteration. Thus, one of the main advantages of using the maximal sequence landscape is that it increases the efficiency of building the assembly graph iteratively without degrading assembly quality (see the next subsection for a comparison of the different assemblers). To demonstrate this efficiency experimentally we ran HyDAVista with and without the maximal sequence landscape on multicell E. coli (substr. K12) Illumina data and the E. coli (substr. K12) reference genome. See Subsection for a description of this dataset. Without the maximal sequence landscape the assembly took 1,414 minutes, and with the maximal sequence landscape the assembly took 822 minutes with 42 number of minutes for building the maximal sequence landscape and assigning the values of k to the reads.
Comparison between competing assemblers and HyDAVista on E.Coli
The first data set consists of approximately 27 million pairedend 100 bp reads from multicell E. coli (substr. K12), generated by the Illumina Genome Analayzer (GA) IIx platform. It was obtained from the NCBI Short Read Archive (accession ERA000206, EMBLEBI Sequence Read Archive). To assess assembly quality, we aligned the reads to the E. coli reference genome (substr. K12) using BWA (version 0.5.9) [40] with default parameters. We call a read mapped if BWA outputs an alignment for it and unmapped otherwise. Analysis of the alignments revealed that 98% of the reads mapped to the reference genome, representing an average depth of approximately 600×; An analysis using BLAST against known contaminants revealed that the unmapped reads are attributed to minor contamination of the sample [41]. All reads were error corrected using BayesHammer [42] with default parameters.
KmerGenie [19] predicted 41 to be the optimal kmer value for this dataset. Therefore, for the assemblers that require a single value of k to be specified (SOAPdenovo, ABySS, HyDA) we used k = 41. HyDA and HyDAVista were ran with a cut off of five. All other parameters of SOAPdenovo and ABySS were kept at their default. SPAdes and IDBA were run with their default parameters in singleend mode. Since the input reads were corrected prior to assembly, the reported data for SPAdes is from the "only assembly" stage. IDBA was run with and without error correction, yielding the same statistics as expected.
Table 1 gives the standard assembly statistics of all the assemblies. All statistics in Table 1 were computed by QUAST in default mode [43]. The results demonstrate that HyDAVista achieves the best assembly prior to repeat resolution or scaffolding. Note that upon determination of kmer sizes, all assemblers were run in singleend mode, i.e., ignoring the pairing information, to study only the effect of our k assignment on contigs. HyDA provides a skeletal de Bruijn graph implementation on which other technologies can be developed. HyDA alone does not compete with some of the state of the art assemblers such as SPAdes and IDBA that employ multiple sophisticated technologies. However, empowered by only the maximum landscape information without any other sophisticated technology particularly pairing information, HyDAVista increases the N50 and NG50 more than twice (in comparison to HyDA) and outperforms the competing assemblers in all measures (the NG50 of 36 kbps obtained by HyDA increases to 82 kbps by HyDAVista).
Conclusion
We demonstrated that HyDAVista achieves superior performance with respect to standard assembly statistics for ecoli genome before repeat resolution and scaffolding. A crucial aspect of our method is the construction of the maximal sequence landscape for the phylogeneticallyclose genome, which allows for the optimal kmer value to be computed for each read. The maximal sequence landscape requires that a pair of substrings be an exact match in order for them to be considered to be the same repetition. An area that warrants future investigation is determining if there is an efficient algorithm for computing the maximal sequence landscape for inexact matches, i.e., the maximal sequence landscape with the condition that two substrings x and y are considered to represent the same repetition (mountain) if the Hamming distance or edit distance between the two is at most d, where d is a parameter in the problem. Another open problem is determining whether the maximal sequence landscape could be constructed with only a suffix array. Since all the data structures (including the suffix array) are constructed in O(n)time and O(n)space the order notation of the running time of such a maximal sequence landscape construction algorithm would likely not improve the running time of the existing algorithm by more than a small constant factor. However, the removal of the auxiliary data structures may simplify the algorithm and would be of theoretical interest.
References
HyDAVista : [https://sites.google.com/site/hydavista]
Haussler D, et al: Genome 10K: a proposal to obtain wholegenome sequence for 10,000 vertebrate species. 2009, J. Hered, 100 (6): 659674.
Ossowski S, et al: Sequencing of natural strains of Arabidopsis Thaliana with short reads. Genome Res. 2008, 18 (12): 20242033. 10.1101/gr.080200.108.
The 1000 Genomes Project Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature. 2012, 491 (7422): 5665. 10.1038/nature11632.
Turnbaugh PJ, et al: The human microbiome project: exploring the microbial part of ourselves in a changing world. Nature. 2007, 449 (7164): 804810. 10.1038/nature06244.
Lin HC, et al: AGORA: assembly guided by optical restriction alignment. BMC Bioinformatics. 2012, 13 (1): 18910.1186/1471210513189.
Nagarajan N, Read TD, Pop M: Scaffolding and validation of bacterial genome assemblies using optical restriction maps. Bioinformatics. 2008, 24 (10): 12291235. 10.1093/bioinformatics/btn102.
Huddleston J, et al: Reconstructing complex regions of genomes using longread sequencing technology. Genome Res. 2014, 24 (4): 688696. 10.1101/gr.168450.113.
Idury RM, Waterman M: A new algorithm for DNA sequence assembly. J Comput Biol. 1995, 2 (2): 291306. 10.1089/cmb.1995.2.291.
Pevzner PA, Tang H, Waterman MS: An eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci. 2001, 98 (17): 97489753. 10.1073/pnas.171285098.
Bankevich A, et al: SPAdes: a new genome assembly algorithm and its applications to singlecell sequencing. J Comput Biol. 2012, 19 (5): 455477. 10.1089/cmb.2012.0021.
Peng Y, Leung HC, Yiu SM, Chin FY: IDBAa practical iterative de bruijn graph de novo assembler. Res Comput Mol Biol Lecture Notes in Computer Science. 2010, 6044: 426440. 10.1007/9783642126833_28.
Chaisson MJ, Pevzner PA: Short read fragment assembly of bacterial genomes. Genome Res. 2008, 18 (2): 324330. 10.1101/gr.7088808.
Zerbino R, Birney E: Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome Res. 2008, 18 (5): 821829. 10.1101/gr.074492.107.
Li R, et al: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 2010, 20 (2): 265272. 10.1101/gr.097261.109.
Simpson JT, et al: ABySS: a parallel assembler for short read sequence data. Genome Res. 2009, 19 (6): 11171123. 10.1101/gr.089532.108.
Butler J, et al: de novo assembly of wholegenome shotgun microreads. Genome Res. 2008, 18 (5): 810820. 10.1101/gr.7337908. ALLPATHS
Alkan C, Sajjadian S, Eichler EE: Limitations of nextgeneration genome sequence assembly. Nat Methods. 2011, 8 (1): 6165. 10.1038/nmeth.1527.
Chikhi R, Medvedev P: Informed and automated kmer size selection for genome assembly. Bioinformatics. 2014, 30 (1): 3137. 10.1093/bioinformatics/btt310.
Movahedi NS, Forouzmand E, Chitsaz H: De novo coassembly of bacterial genomes from multiple single cells. In: IEEE Conference on Bioinformatics and Biomedicine. 2012, 561565.
Schneeberger K, et al: Referenceguided assembly of four diverse Arabidopsis thaliana genomes. Proc Natl Acad Sci. 2011, 108 (25): 1024910254. 10.1073/pnas.1107739108.
Klein JD, et al: LOCASa low coverage assembly tool for resequencing projects. PLOS ONE. 2011, 6 (8): 2345510.1371/journal.pone.0023455.
Vezzi F, Cattonaro F, Policriti A: eRGA: enhanced reference guided assembly of complex genomes. EMBnet journal. 2011, 17 (1): 4610.14806/ej.17.1.208.
Colombus module of Velvet: [http://bioweb2.pasteur.fr/docs/velvet/Columbus_manual.pdf]. Unpublished
IDBAHybrid : Unpublished, [http://i.cs.hku.hk/~alse/hkubrg/projects/idba_hybrid/index.html].
Gnerre S, Lander ES, LindbladToh K, Jaffe DB: Assisted assembly: how to improve a de novo genome assembly by using related species. Genome Biol. 2009, 10 (8): 8810.1186/gb2009108r88.
Pevzner PA, Tang H, Tesler G: De novo repeat classification and fragment assembly. Genome Res. 2004, 14 (9): 17861796. 10.1101/gr.2395204.
Phillippy AM, Schatz MC, Pop M: Genome assembly forensics: finding the elusive misassembly. Genome Biol. 2008, 9 (3): 5510.1186/gb200893r55.
Medvedev P, Pham S, Chaisson M, Tesler G, Pevzner PA: Paired de bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers. J Comput Biol. 2011, 18 (11): 16251634. 10.1089/cmb.2011.0151.
Vyahhi N, Pyshkin A, Pham S, Pevzner PA: From de bruijn graphs to rectangle graphs for genome assembly. Algorithms in Bioinformatics Lecture Notes in Computer Science. 2012, 7534: 249261. 10.1007/9783642331220_20.
Gusfield D: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. 1997, Cambridge University Press, New York
Clift B, Haussler D, McConnell RM, Schneider TD, Stormo GD: Sequence landscapes. Nucleic Acids Res. 1986, 14: 141158. 10.1093/nar/14.1.141.
Blumer A, et al: The smallest automation recognizing the subwords of a text. Theor Comput Sci. 1985, 40 (0): 3155.
Kurtz S, Choudhuri JV, Ohlebusch E, Schleiermacher C, Stoye J, Giegerich R: Reputer: the manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res. 2001, 29 (22): 46334642. 10.1093/nar/29.22.4633.
Fitch WM, Smith T, Breslow JL: Detecting internally repeated sequences and inferring the history of duplication. Method Enzymol. 1985, 128: 773788.
Benson G: A space efficient algorithm for finding the best nonoverlapping alignment score. Theor Comput Sci. 1995, 145 (1): 357369.
Sagot MF: Spelling approximate repeated or common motifs using a suffix tree. LATIN'98: Theoretical Informatics Lecture Notes in Computer Science. 1998, 1380: 374390. 10.1007/BFb0054337.
Manber U, Myers G: Suffix arrays: a new method for online string searches. SIAM J Comput. 1993, 22 (5): 935948. 10.1137/0222058.
Kärkkäinen J, Sanders P, Burkhardt S: Linear word suffix array construction. J ACM. 2006, 53 (6): 918936. 10.1145/1217856.1217858.
Li H, Durbin R: Fast and accurate short read alignment with BurrowsWheeler transform. Bioinformatics. 2009, 25 (14): 17541760. 10.1093/bioinformatics/btp324.
Chitsaz H, et al: Efficient de novo assembly of singlecell bacterial genomes from shortread data sets. Nat Biotechnol. 2011, 29 (10): 915921. 10.1038/nbt.1966.
Nikolenko SI, Korobeynikov AI, Alekseyev MA: Bayeshammer: bayesian clustering for error correction in singlecell sequencing. BMC Genomics. 2013, 14 (Suppl 1): 710.1186/1471216414S1S7.
Gurevich A, Saveliev V, Vyahhi N, Tesler G: QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013, 29 (8): 10721075. 10.1093/bioinformatics/btt086.
Earl D, et al: Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 2011, 21 (12): 22242241. 10.1101/gr.126599.111.
Declarations
B.S. and C.B. were funded by the Colorado Clinical and Translational Sciences Institute which is funded by the National Institutes of Health (NIHNCATS,UL1TR001082, TL1TR001081, KL2TR001080). N.S.M.T.and H.Ch. were funded by the National Science Foundation through grant number DBI1262565 to H.Ch. Publications costs were covered by computer science department of Colorado State University.
This article has been published as part of BMC Genomics Volume 15 Supplement 10, 2014: Proceedings of the 25th International Conference on Genome Informatics (GIW/ISCBAsia): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S10.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
B.S. observed the possibility of using sequence landscape for improving sequence assembly and introduced the concept of maximal sequence landscape. C.B. and B.S. conceived the project, defined the problem, and designed the maximal landscape algorithm. H.Ch. and N.S.M.T. designed the assembly algorithm. B.S., N.S.M.T., and H.Ch. implemented the algorithms. B.S., N.S.M.T., and C.B. ran the experiments and interpreted the results. B.S., C.B., and H.Ch. wrote the manuscript.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Shariat, B., Movahedi, N.S., Chitsaz, H. et al. HyDAVista: towards optimal guided selection of kmer size for sequence assembly. BMC Genomics 15 (Suppl 10), S9 (2014). https://doi.org/10.1186/1471216415S10S9
Published:
DOI: https://doi.org/10.1186/1471216415S10S9