A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework
© Chang et al.; licensee BioMed Central Ltd. 2012
Published: 13 December 2012
State-of-the-art high-throughput sequencers, e.g., the Illumina HiSeq series, generate sequencing reads that are longer than 150 bp up to a total of 600 Gbp of data per run. The high-throughput sequencers generate lengthier reads with greater sequencing depth than those generated by previous technologies. Two major challenges exist in using the high-throughput technology for de novo assembly of genomes. First, the amount of physical memory may be insufficient to store the data structure of the assembly algorithm, even for high-end multicore processors. Moreover, the graph-theoretical model used to capture intersection relationships of the reads may contain structural defects that are not well managed by existing assembly algorithms.
We developed a distributed genome assembler based on string graphs and MapReduce framework, known as the CloudBrush. The assembler includes a novel edge-adjustment algorithm to detect structural defects by examining the neighboring reads of a specific read for sequencing errors and adjusting the edges of the string graph, if necessary. CloudBrush is evaluated against GAGE benchmarks to compare its assembly quality with the other assemblers. The results show that our assemblies have a moderate N50, a low misassembly rate of misjoins, and indels of > 5 bp. In addition, we have introduced two measures, known as precision and recall, to address the issues of faithfully aligned contigs to target genomes. Compared with the assembly tools used in the GAGE benchmarks, CloudBrush is shown to produce contigs with high precision and recall. We also verified the effectiveness of the edge-adjustment algorithm using simulated datasets and ran CloudBrush on a nematode dataset using a commercial cloud. CloudBrush assembler is available at https://github.com/ice91/CloudBrush.
With the rapid growth of DNA sequencing throughput delivered by next-generation sequencing technologies , there is a pressing need for de novo assemblers to efficiently handle massive sequencing data of genomes using scalable, on-demand, and inexpensive commodity cloud servers. De novo genome assembly is a fundamental step in analyzing a newly sequenced genome without a backbone sequence. De novo assembly software must deal with sequencing errors, repeat structures, and the computational complexity of processing large volumes of data . The most recent assemblers use de Bruijn graphs [3–10] or string graphs [11–14] to model and manipulate the sequence reads. Using the de Bruijn graph model of sequence assembly requires breaking reads into short k-mers . Typically, de Bruijn graph-based assemblers must recover the information lost from the breaking of reads, and attempt to resolve small repeats using read threading algorithms . Using the string graph model of assembly can help avoid this issue. However, with the deeper coverage depth of read data, our preliminary studies show that the underlying string graph used to model the intersection of reads becomes much more complex than expected by previous assembly algorithms .
To unfold these complex branch patterns into correct linear paths in string graphs, we present an Edge Adjustment (EA) algorithm to remedy this problem. The algorithm utilizes the sequence information of all graph neighbors for each read and eliminates the edges connecting to reads containing rare bases. We also used simulated read datasets of Escherichia coli genomes of varying sequencing depths and error rates to verify the effectiveness of the EA algorithm. In addition, we integrated the EA algorithm into a distributed assembly program  based on string graphs and MapReduce cloud computing framework [16, 17], known as CloudBrush. We evaluated the method against the GAGE benchmarks established by Salzberg et al  to compare assembly quality with other de novo assembly tools. Moreover, we introduced a pair of novel indices to measure the quality of sequence assembly, known as precision and recall, to indicate whether the output contigs are faithfully aligned (i.e., without inversions or rearrangements) with a contiguous region in the target genome, and whether the output contigs fully cover the entire target genome. It is noteworthy that these two indices are important for follow-up annotation and analysis of the target genome. Finally, we ran CloudBrush on a nematode dataset using a computing cloud  and analyzed its performance.
Structural defects in string graphs
In using the graph-based assembly approach, sequencing error may generate complex structures in the graph. For example, sequencing errors at the end of reads may create tips in the graph, and sequencing errors within long reads may create bubbles in the graph. Tips and bubbles are well-defined problems with a solution making use of the topological features of the graph as described in  and . Some errors, however, create more complex structures that cannot be readily identified from the topology of the graph. In this report, we refer to these structures are "structural defects." A well-known structural defect is the chimerical link problem. Figure 1 displays an example of chimerical links caused by sequencing error in string graph. In this instance, the chimerical link is caused by false overlap between node C and node G. In addition to sequencing errors, repeat regions also cause structural defects in a string graph; for example, the well-known "frayed rope" pattern . Furthermore, repeats shorter than the read lengths may also complicate processing in string graphs; for example, if a short repeat exists in reads D, E, F, I, J, L, and M, where C, D, E, and F are reads from a specific region in the genome, while I, J, L, and M are reads from another region in the same genome (Figure 2). It is noteworthy that in the string graph, the edge between nodes D and L is denoted as a "branch structure" which may lead an assembly algorithm to report an erroneous contig. In addition to false overlaps, missing overlaps also introduce structural defects into the string graph; for example, the formation of a braid structure caused by sequencing errors appearing in continuous reads (Figure 3). In this instance, two missing overlaps forbid the adjacent reads from being merged together; node B lost an overlap link to node C, and node D lost an overlap link to node E (Figure 3). Similar to the chimerical link problem, it is challenging to use topological features of the graph to deal with braid structures.
Edge Adjustment with the neighbors' contents
To solve the branch structure problem, the EA algorithm generates a consensus sequence for read L from the neighboring reads D, I, and J (Figure 7). Therefore, read D differs from the consensus sequence, which is primarily represented by reads I and J. The overlap link between reads D and L is removed.
To solve the braid structure problem, in which instance the errors in reads C and D complicate the graph structure, the EA algorithm removes the overlap links between reads C and E and between reads B and D (Figure 8). Thus, reads C and D are isolated from the main graph, and no braid structure exists.
Analysis of edge adjustment
We prepared simulated datasets generated from the E. coli genome to evaluate effectiveness of the EA algorithm. In other words, the position of each read on the target genome, and thus positions of sequencing errors on the read are also present in the dataset. We subsequently construct the overlap graph of the dataset by creating a node to present each read, and an edge between each pair of reads if they have a sequence overlap with size no smaller than an integer k. Two attributes are associated with each edge of the overlap graph from the simulated data. In the first attribute, if the positions of the two reads overlap with each other on the genome, then the overlapping region is designated as a true edge; otherwise, it is designated as a false edge. The second attribute is used to denote whether any sequencing error exists on the two reads of the edge. Therefore, we can now classify edges of the overlap graph into four classes according to these two attributes. Class I denotes the subset of true edges without sequencing errors; class II denotes the subset of true edges with sequencing errors; class III denotes the subset of false edges with sequencing errors; and class IV denotes the subset of false edges without sequencing errors. It is noteworthy that class I edges are most desired to improve the quality of data for subsequent stages of sequence assembly. By contrast, class III edges are chimerical edges; class II edges contain sequencing errors; and class IV edges contain reads that intersect repeats. Edges of classes II, III, and IV may introduce errors or structural defects into the later stages of sequence assembly. Therefore, it is the design goal of the EA algorithm to minimize the number of class II, III, and IV edges and to maximize the number of class I edges.
The edge analysis of overlap graph before and after Edge Adjustment
E. coli Dataset
# of edges before
# of edges after
100 × 36 bp
100 × 36 bp
200 × 150 bp
200 × 150 bp
The analysis of simplified string graph with and without Edge Adjustment
100 × 36 bp
# of node
# of edge
100 × 36 bp
# of node
# of edge
200 × 150 bp
# of node
# of edge
200 × 150 bp
# of node
# of edge
Evaluation of assembly accuracy
Importantly, we only evaluate contigs whose length ≥ 200 bp.
We used three real and two simulated datasets to test CloudBrush and the other assemblers. The first real dataset was a set of short read data from an E. coli library (NCBI Short Read Archive, accession no. SRX000429) consisting of 20.8 M 36-bp reads. The second real dataset was released by Illumina, which included 12 M paired-end 150-bp reads. This dataset contains sequences from a well-characterized E. coli strain K-12 MG1655 library sequenced on an Illumina MiSeq platform. For the two real datasets, we select the first half of reads to evaluate assemblers, and their coverage depth was 81× and 197×, respectively. We used D1 and D2 to denote the 36-bp and 150-bp datasets, respectively. Furthermore, we downloaded Caenorhabditis elegans sequence reads (strain N2) from the NCBI SRA (accession no. SRX026594) as the D3 dataset, consisting of 33.8 M read pairs sequenced using the Illumina Genome Analyzer II and a constant coverage depth of 67×. The two simulated datasets were generated at random from the E. coli K-12 genome using 36-bp reads with 100× coverage depth and 1% mismatch errors, and with 100-bp reads with 200× coverage depth and 1% mismatch errors.
We performed assemblies on these datasets using Edena , Velvet , Contrail  and CloudBrush assemblers. Edena is the first string graph-based assembler for data of short reads. Velvet is one of the first de Bruijn graph-based assemblers for short reads that is often used as a standard tool for assembling small- to medium-sized genomes. Contrail is the first de Bruijn graph-based assembler using the MapReduce framework. Each assembler is required to set the parameter k, i.e., the minimum length of overlap for two contigs to form a longer contig. Considering the relationship between parameter k and coverage depth , we used k = 21 on dataset D1 and 100× simulated data, k = 75 on dataset D2 and 200× simulated data, and k = 51 on dataset D3. Importantly, we did not use pair-end information in this experiment.
Evaluation of assemblies of the simulated dataset (100×, 36 bp, 1% error) and dataset D1 with CloudBrush, Contrail, Velvet, and Edena
# of contigs1
Largest contig size
# of valid
# of invalid contigs1
100 × 36 bp
Evaluation of assemblies of the simulated dataset (200 × 150 bp, 1% error) and dataset D2 and D3 with CloudBrush, Contrail, and Velvet
# of contigs1
# of valid
# of invalid
200 × 150 bp
Comparison with other tools using GAGE benchmarks
Evaluation of S aureus (genome size 2,872,915 bp)
> 5 bp
# of valid
contigs (> 200 bp)
# of invalid
(> 200 bp)
Evaluation of R. sphaeroides (genome size 4,603,060 bp)
> 5 bp
# of valid
(> 200 bp)
# of invalid
(> 200 bp)
As described in , a more aggressive assembler is prone to generate more segmental indels as it strives to maximize the length of its contigs, while a conservative assembler minimizes errors at the expense of contig size. We observed that the SGA assemblies have the fewest errors of misjoins and indels of > 5 bp, but have the shortest N50 (Tables 5 and 6). CloudBrush generated the second fewest number of errors, but led to a longer N50, which identified CloudBrush as a conservative assembler that could still assemble longer contigs.
A caveat on the use of the assembly precision and recall for contigs is required. When misjoined errors occur in a very long contig, the whole contig will be invalidated, and the precision and recall will obviously decrease in proportion to the contig length. By contrast, when misjoined errors occur in a shorter contig, the precision and recall may only decrease slightly. We observed that SGA and CloudBrush produced the highest precisions and recalls (Tables 5 and 6), indicating that the contigs generated will have very few artificial breakpoints generated by assemblers; moreover, it will reduce the noisy interrupts in the subsequent genome annotation and comparative genomic analysis. It is noteworthy that some assemblers e.g., Bambus2  and SOAPdenovo , have lower precision and recall due to the fact that their misjoined errors and longer indels occur in longer contigs.
Run time analysis
Discussion and conclusions
With the rapid growth of sequence data, genome assembly remains one of the most challenging computational problems in genomics. String graph-based approaches have the benefits of read coherence , less memory requirement, and successful experience in analyzing Sanger sequence data . In this report, we identify several types of structural defects in string graphs resulting from sequencing errors and short repeats. To remedy the structural defects in string graphs, we developed the EA algorithm that utilizes information from the consensus of graphical neighbors. To validate the effectiveness of the EA algorithm, we used simulated data to define four types of edges and a braid index to help evaluate the structural defects in string graphs. The experimental results show that the EA algorithm efficiently minimizes structural defects in string graphs. Thus far, the EA algorithm is not suitable for studies on SNPs, because it only removes the edges. We suggest that correcting the edges with sequence logos will maintain information for SNP analysis; this is the subject of a future study.
To demonstrate the validity of CloudBrush, we used GAGE benchmarks  to compare CloudBrush with other state-of-the-art assembly tools. The evaluation results show that CloudBrush is a conservative assembler that nevertheless can generate precise contigs that avoid error propagation in downstream analysis with moderate N50 contig lengths. We also tested the scalability of CloudBrush using three different sizes of hadoop clusters to assemble ~7-Gbp data of the C. elegans dataset on a hicloud™ computing service . The study results show that the stage of graph construction is the primary performance bottleneck and its scalability in the MapReduce framework is quite impressive.
In future studies, we will incorporate the scaffolding issue and mate-pair analysis into the MapReduce pipeline. Combining state-of-the-art error correction and our edge analysis is another subject worthy of investigation. We believe that CloudBrush will achieve a better contig N50 with fewer misjoin errors if these former two issues are resolved. Adapting the pipeline toward third generation sequencing technologies is also an important direction of investigation.
We previously described a string-graph base assembly algorithm using MapReduce called CloudBrush . The framework of MapReduce can easily be implemented as a modular pipeline, allowing it to be easily extended when improved algorithms have been developed. In this study, we have expanded on CloudBrush by revising its pipeline and adding an EA algorithm. We introduced the principle of the graph processing in MapReduce and the pipeline of CloudBrush. It is noteworthy that the code is written in Java and readers may refer to  for further details concerning the implementation of the procedures in the MapReduce framework.
Distributed graph processing in MapReduce
Genome assembly has been modelled as a graph-theoretic problem. Graph models of particular interests include de Bruijn and string graphs in either directed or bidirected forms. Here we use bidirected string graph to model the genome assembly problem.
In a bidirected string graph, nodes represent reads and edges represent the overlaps between reads. To model the double-stranded nature of DNA, a read can be interpreted in either forward or reverse-complement directions. For each edge that represents an ordered pair of nodes with overlapping reads, four possible types exist, according to the directions of the two reads: forward-forward, reverse-reverse, forward-reverse, and reverse-forward. The type attribute is incorporated into each edge of the bidirected string graph. It is noteworthy that traversing the bidirected string graph should follow a consistent rule, i.e., the directions of in-links and out-links of the same node should be consistent. In other words, the read of a specific node can only be interpreted in a unique direction in one path of traversal.
The MapReduce framework [16, 17] use key-value pairs as the only data type to distribute the computations. To manipulate a bidirected string graph in MapReduce, we use a node adjacency list to represent the graph, which stores node id (i.e., the identifier of a node) as the key, and node data structure as the value. Node data structure contains features of the node as well as a list of its outgoing edges and their features. The node adjacency list is a compact representation and allows easy traversal along the outgoing links. In MapReduce, a basic unit of computations is usually localized to a node's internal state and its neighbors in the graph. The results of computations on a node are emitted as values, each keyed with the identification of a neighbor node. Conceptually, we can think of this process as "passing" the results of computation along out-links. In the reducer, the algorithm receives all partial results having the same destination node id, and performs the computation. Subsequently, the data structure corresponding to each node is updated and written back to distributed file systems.
CloudBrush: string graph assembly using MapReduce
Graph construction in MapReduce
1. Retaining non-redundant reads as vertices
A sequence read may have several redundant copies in the dataset by oversampling in Solexa or SOLiD sequencing. The first step in graph construction is to merge redundant copies of the same read into a single node. We implemented a distributed prefix tree in MapReduce to extend Edena's prefix-tree approach .
2. Finding pairwise overlaps between reads
Read-read overlaps are basic clues in connecting reads to contigs; however, finding overlaps between reads is often the most computationally intensive step in string graph-based assemblies. To find all the pairs of read-read overlaps, we adopted a prefix-and-extend strategy to speed up construction of the string graph . The strategy consists of two phases, the prefix phase and the extend phase. In the prefix phase, a pair of reads is reported if the prefix of one of the reads exactly matches a substring of the other read at the given seed length. The pair is then said to have a "brush." In the extend phase, pairs of reads having a brush are further validated starting from the brush. If the exact match extends to one end of the second read, then an edge containing the two nodes of the two reads is created.
3. Edge Adjustment
4. Reducing transitive edges
After the EA algorithm, the graph still has superfluous edges due to oversampling in sequencing. Consider two paths of consecutively overlapping nodes r a →r b →r c and r a →r c ; r a →r c is transitive because it spells the same sequence as r a →r b →r c , but ignores the middle node r b .
To perform the transitive reduction in the MapReduce framework, we passed the neighbors' edges for each node r i such that r i knows all the neighboring nodes in the reducer. Different from de Bruijn graphs, the overlap size information is attached to the edge of our bidirected string graph. Therefore, we can sort neighbors by overlap size and remove transitive edges in order.
Graph simplification in MapReduce
After constructing the string graph, we used several techniques to simplify the graph, including path compression, tip and bubble removal, and low coverage node removal. Path compression is used to merge a chain of nodes, each having one in-link and one out-link along a specific strand direction into a single node. After path compression, tips and bubbles are easily recognized locally. Our MapReduce implementation of path compression, tip and bubble removal, and low coverage node removal are similar to that of Contrail , except that we add an additional field of overlap size for the data structure of message passing between nodes tailed for the string graphs. Additionally, we provide an option to reuse the EA algorithm in graph simplification. In this study, we only performed the EA algorithm on nodes whose neighbors were dead ends of the graph; more broadly, the EA algorithm can be performed iteratively until no dead-end neighbors can be removed.
The authors wish to thank Chunghwa Telecom Co. and National Communication Project of Taiwan for providing the cloud computing resources and the technical supports they provided. They wish to thank Jazz Yao-Tsung Wang at the National Center for High-Performance Computing for his help with the efficient deployment of Hadoop clusters. YJC, CCC, and JMH were partially supported by National Science Council grant NSC 99-2321-B-001-025-.
This article has been published as part of BMC Genomics Volume 13 Supplement 7, 2012: Eleventh International Conference on Bioinformatics (InCoB2012): Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S7.
- Stein LD: The case for cloud computing in genome informatics. Genome Biology. 2010, 11: 207-10.1186/gb-2010-11-5-207.PubMed CentralView ArticlePubMedGoogle Scholar
- Miller JR, Koren S, Sutton G: Assembly algorithms for next-generation sequencing data. Genomics. 2010, 95: 315-327. 10.1016/j.ygeno.2010.03.001.PubMed CentralView ArticlePubMedGoogle Scholar
- Pevzner P, Tang H, Waterman M: Fragment assembly with double-barreled data. Proceedings of the National Academy of Sciences. 2001, 98 (17): 9748-9753. 10.1073/pnas.171285098.View ArticleGoogle Scholar
- Zerbino D, Birney E: Velvet: Algorithms for De Novo Short Read Assembly Using De Bruijn Graphs. Genome Research. 2008Google Scholar
- Chaisson MJ, Pevzner PA: Short read fragment assembly of bacterial genomes. Genome Research. 2008, 18: 324-10.1101/gr.7088808.PubMed CentralView ArticlePubMedGoogle Scholar
- Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, Sharpe T, Hall G, Shea TP, Sykes S, Berlin AM, Aird D, Costello M, Daza R, Williams L, Nicol R, Gnirke A, Nusbaum C, Lander ES, Jaffe DB: High-Quality Draft Assemblies of Mammalian Genomes from Massively Parallel Sequence Data. PNAS. 2011, 108: 1513-1518. 10.1073/pnas.1017351108.PubMed CentralView ArticlePubMedGoogle Scholar
- Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol İ: ABySS: A parallel assembler for short read sequence data. Genome Research. 2009, 19: 1117-10.1101/gr.089532.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, Li S, Yang H, Wang J, Wang J: De novo assembly of human genomes with massively parallel short read sequencing. Genome Research. 2010, 20: 265-272. 10.1101/gr.097261.109.PubMed CentralView ArticlePubMedGoogle Scholar
- Peng Y, Leung H, Yiu S, Chin F: IDBA-A Practical Iterative de Bruijn Graph De Novo Assembler. Research in Computational Molecular Biology (RECOMB 2010). 2010, 426-440.View ArticleGoogle Scholar
- Schatz M, Sommer D, Kelley D, Pop M: Contrail: Assembly of Large Genomes using Cloud Computing. [http://contrail-bio.sf.net/]
- Myers E: The fragment assembly string graph. Bioinformatics. 2005, 21: ii79-ii85. 10.1093/bioinformatics/bti1114.PubMedGoogle Scholar
- Hernandez D, Francois P, Farinelli L, Osteras M, Schrenzel J: De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computer. Genome Research. 2008, 18: 802-809. 10.1101/gr.072033.107.PubMed CentralView ArticlePubMedGoogle Scholar
- Jackson B, Schnable P, Aluru S: Parallel short sequence assembly of transcriptomes. BMC Bioinformatics. 2009, 10: S14-PubMed CentralView ArticlePubMedGoogle Scholar
- Simpson JT, Durbin R: Efficient De Novo Assembly of Large Genomes Using Compressed Data Structures. Genome Res. 2012, 22: 549-556. 10.1101/gr.126953.111.PubMed CentralView ArticlePubMedGoogle Scholar
- Chang Y-J, Chen C-C, Chen C-L, Ho J-M: De Novo Assembly of High-Throughput Sequencing Data with Cloud Computing and New Operations on String Graphs. Proceedings of IEEE International Conference on Cloud Computing (CLOUD 2012). 2012, Hawaii, USAGoogle Scholar
- Dean J, Ghemawat S: MapReduce: Simplified data processing on large clusters. Communications of the ACM. 2008, 51: 107-113.View ArticleGoogle Scholar
- White T: Hadoop: The Definitive Guide. O'Reilly Media. 2009Google Scholar
- Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz MC, Delcher AL, Roberts M, Marçais G, Pop M, Yorke JA: GAGE: A Critical Evaluation of Genome Assemblies and Assembly Algorithms. Genome Res. 2012, 22: 557-567. 10.1101/gr.131383.111.PubMed CentralView ArticlePubMedGoogle Scholar
- Hicloud computer-as-a-service (CaaS). [http://hicloud.hinet.net/]
- Chen C-C, Lin W-D, Chang Y-J, Chen C-L, Ho J-M: Enhancing De Novo Transcriptome Assembly by Incorporating Multiple Overlap Sizes. ISRN Bioinformatics. 2012, 2012: 1-9.Google Scholar
- Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for aligning DNA sequences. Journal of Computational Biology. 2000, 7: 203-214. 10.1089/10665270050081478.View ArticlePubMedGoogle Scholar
- Koren S, Treangen TJ, Pop M: Bambus 2: scaffolding metagenomes. Bioinformatics. 2011, 27: 2964-2971. 10.1093/bioinformatics/btr520.PubMed CentralView ArticlePubMedGoogle Scholar
- Schatz MC, Delcher AL, Salzberg SL: Assembly of large genomes using second-generation sequencing. Genome Research. 2010Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.