Finishing genomes with limited resources: lessons from an ensemble of microbial genomes
© The Article is a work of the United States Government; licensee BioMed Central Ltd. 2010
Received: 25 November 2009
Accepted: 16 April 2010
Published: 16 April 2010
While new sequencing technologies have ushered in an era where microbial genomes can be easily sequenced, the goal of routinely producing high-quality draft and finished genomes in a cost-effective fashion has still remained elusive. Due to shorter read lengths and limitations in library construction protocols, shotgun sequencing and assembly based on these technologies often results in fragmented assemblies. Correspondingly, while draft assemblies can be obtained in days, finishing can take many months and hence the time and effort can only be justified for high-priority genomes and in large sequencing centers. In this work, we revisit this issue in light of our own experience in producing finished and nearly-finished genomes for a range of microbial species in a small-lab setting. These genomes were finished with surprisingly little investments in terms of time, computational effort and lab work, suggesting that the increased access to sequencing might also eventually lead to a greater proportion of finished genomes from small labs and genomics cores.
Not long ago, the expected outcome of a microbial genome project was the complete DNA sequence of all the chromosomes and extra-chromosomal elements of the genome being sequenced. As more and more complete genome sequences became available in public databases, scientists started to debate the need for completely sequencing the genome of an organism [1–3]. While an initial draft sequence for an organism could be determined in a matter of weeks, the complete sequence required many months or even years of additional experiments - a time- and cost-intensive process called genome finishing. Furthermore, draft assemblies are sufficient for many genomic analyses, especially if complete sequences of closely related organisms are available. The recent development of nextgen high-throughput sequencing technologies appears to have sealed the fate of genome finishing. Draft assemblies of approximately 5 bacterial genomes can now be generated in a matter of days (hours in fact, ignoring library preparation time) using a single 454 Titanium sequencing instrument. The costs associated with finishing, however, have not significantly decreased in recent years. The high costs of finishing experiments, thus, appear to only be justified for high-priority genomes.
Having finished or nearly-finished genomes is of course still a worthy goal as it enables a much richer set of genomic analysis. For example, the reliability of order-based genomic analysis such as studying operon structure and gene regulation as well as the granularity of comparative genomic studies are enhanced by the availability of finished genomes. In addition, the finishing process can substantially improve the quality of the data available to the community by identifying and fixing mis-assemblies and low-coverage regions. Fortunately, several characteristics of the new types of sequencing data, specifically increased depth of coverage and low representation biases in the sequencing libraries lend themselves well for finishing analysis. Draft assemblies can therefore be combined with additional sequence and map-based information to reduce the finishing effort. Here we describe our experience in doing this in the course of several finishing projects, highlighting the reduction in finishing effort as well as the feasibility of such projects in a small-lab setting. We also present the tools and approaches that were designed in our lab for this purpose (source code and executables available at http://cbcb.umd.edu/finishing). In combination with the democratization of genomics made possible by the reduced cost of sequencing, computational approaches such as the ones we describe here may help rectify the imbalance in the number of draft vs finished (or nearly finished) genomes that are available to the scientific community.
Overview of finishing techniques
Prior to describing our results we briefly survey the main challenges encountered in finishing a genome and outline ways in which new technologies can be used to overcome these challenges. Detailed descriptions of these approaches will be provided in the methods section.
Finishing aims to overcome two major limitations of the shotgun sequencing process. First of all, the output of a genome assembler is generally fragmented due to difficulties in assembling repeat regions and to cloning/sequencing biases. Second of all, the assembled fragments frequently contain errors, either due to sequencing artifacts or to the incorrect reconstruction of repeats. The finishing process can thus be decomposed into two steps: gap closure, and assembly validation and refinement.
In gap closure, pairs of adjacent contigs are identified, then the genomic sequence spanning the gap between them is determined, traditionally through directed-PCR and primer-walking approaches. When mate-pair libraries are available, the adjacency of contigs can often be inferred from the mate-pair data and the gaps spanned by paired reads (sequencing gaps) can be closed relatively easily. Contigs whose adjacency cannot be inferred from mate-pair data, however, require expensive (and error-prone) combinatorial PCR experiments . New experimental technologies alleviate these difficulties in two ways. First of all, in nextgen sequencing projects performed to a high depth of coverage (>20-fold is common in 454 projects) sequence gaps between contigs are rare due to the relatively unbiased libraries generated by these new technologies; fragmentation into contigs is largely due to the presence of repeats. Therefore, often, once the adjacency between two contigs is determined (e.g. through PCR experiments), the contigs can be simply "glued" together without the need for additional sequencing. Second of all, the adjacency of contigs can be easily determined either through recently developed nextgen mate-pair protocols or through the use of new mapping technologies, such as the optical mapping approach from Opgen Inc. http://www.opgen.com.
The validation and refinement finishing stage aims to correct errors in the assembled sequence - both single-base errors (such as mis-called bases due to sequencing errors) as well as large-scale errors (such as mis-assemblies due to repeats). Both problems are somewhat alleviated in nextgen sequencing data. Due to a high level of coverage, most single base errors can be automatically corrected. This is true even in the case of 454 pyrosequencing where errors within homo-polymer tracts are common. Furthermore, assembly software designed specifically for high-coverage nextgen data (e.g. the Newbler assembler from 454) use conservative algorithms specifically designed to avoid mis-assemblies. The resulting assemblies are usually more fragmented; contigs end at repeat boundaries where the reconstruction of the genome is ambiguous. The high depth of coverage and conservative assembly strategy also enable a better estimation of the number of repeat copies contained within a contig. Repeat-induced ambiguities can be resolved through targeted PCR experiments aimed at uncovering the correct adjacency of the assembled contigs. As in the case of gap closure, once two contigs have been determined to be adjacent in the assembly, they can be simply glued together. In the following section we describe the results from our experience in putting these principles into practice.
Results and Discussion
Statistics from some contemporary finishing projects.
Yersinia pestis Angola
Frankia sp. CcI3
Vibrio cholerae O395
Salmonella enterica SL476
Pantoea stewartii stewartii
Verrucomicrobium spinosum DSM4136
Aliivibrio salmonicida LFI1238
Finished genome of A. aphrophilus
The A. aphrophilus genome project was started at the American Museum of Natural History in 2007 to better understand the pathobiology of the bacterium  (GenBank accession number CP001607). A. aphrophilus is a proteobacteria from the Pasteurellaceae family that is strongly implicated as a causative agent of infective endocarditis . It can also be found as an apparently benign resident of dental plaque.
Sequencing was done for this project using a hybrid approach, combining two runs of 454 pyrosequencing with Sanger sequencing of two shotgun libraries of clones of 1.6 to 4 Kbp inserts. Overall, 549, 417 pyrosequencing reads (average length ~100 bp) and 12, 889 Sanger reads (average length ~600 bp) were used to generate 22× and 3× coverage respectively of the 2.3 Mbp genome. The hybrid-assembly of these reads was constructed using a novel pipeline (see Methods) and with a Newbler assembly of the 454 reads as a starting point. In terms of sequence information, the Sanger reads added very little (≤0.2%) to the Newbler assembly (only 7 reads were not mapped to the Newbler assembly). However they provide valuable information to merge contigs (reducing contig count from 146 to 122 and increasing N50 size from 99 Kbp to 150 Kbp) and scaffold them (resulting in 110 scaffolds with an N50 size of 232 Kbp), as well as provide indpendent verification of the assembly (266 reads spanned across Newbler contigs and 246 mates connected the contigs).
Scaffolded assemblies of 8 Yersinia strains
The Yersinia genus contains several enteropathogens such as Y. enterocolictica and Y pseudotuberculosis and the causative agent of the plague Y. pestis. The genus also contains several less virulent strains that are commonly found in soil and water. While the genomes of multiple strains of the pathogenic Yersinia are available we know little about their non-pathogenic relatives and so correspondingly 8 other species Y. kristensenii, Y. aldovae, Y. mollaretii, Y. fredriksenii, Y. bercovieri, Y. intermedia, Y. rohdei and Y. ruckeri were sequenced in 2006 and 2007 via 454 FLX sequencing at the Naval Medical Research Center (GenBank accession numbers ACCA00000000, ACCB00000000, AALD00000000, AALE00000000, AALC00000000, AALF00000000, ACCD00000000 and ACCC00000000 respectively) .
Assembly and Map statistics for the Yersinia genomes.
Count (>10 Kbp)
N50 Size (Kbp)
N50 Size (Kbp)
Scaffolding results for the Yersinia genomes.
Size in Mbp (% of genome)
Size in Mbp (% of genome)
Size in Mbp (% of genome)
Size in Mbp (% of genome)
# of gaps (>10 Kbp )
To further augment and verify the scaffolds, information about adjacency of contigs was extracted from the assembly and applied (see Methods). The results from this process can be seen in columns 5-6 in Table 3. In all cases, the placements from the optical map were confirmed and in a few cases gaps were closed in silico. The resulting draft genomes were well-suited for comparative analysis and as a template for finishing efforts; as a proof-of-concept project we worked on the Y. rohdei genome using a suite of finishing techniques (see Methods). Using only 43 PCR experiments and 26 sequencing reactions 33 of the gaps were closed, leaving only 7 gaps to close. In contrast, working with the original assembly (59 large contigs) could have necessiated on the order of 592 ≈ 3000 PCR experiments (see Table 1). A similar project for the Y. ruckeri genome is also in progress.
In-silico finishing of R. prowazekii
The R. prowazekii genome is an interesting case where we were able to finish the genome without any additional experimental effort. We believe it is the first case to be reported in the literature of the in silico closure of a bacterial genome and in fact we obtained similar results with several other Rickettsial genomes (data not shown).
The R. prowazekii genome provides evidence for the power of high-coverage sequencing using new sequencing technologies such as 454; we were able to reconstruct the genome with no gaps using a single run of the 454 machine, de novo assembly using Newbler and analysis of the resulting contig adjacency information. Also, while the assembly was performed without using the prior knowledge of the previously sequenced R. prowazekii Madrid E genome , the assembly could be validated using it. The post-assembly analysis revealed that the two genomes align perfectly (indicating that there are no misordered contigs), differ in length by only 3 bp and are more than 99.96% identical in sequence.
It should be noted that the finishing projects presented here were not selected in any way for their ease of finishing. While we of course cannot claim to have done a uniform sampling of "typical" genomes, the evidence from our experience as well as similar anecdotes from other labs, strongly suggests that it is increasingly feasible for small labs and genomic cores to take on the challenge of finishing a genome using minimal resources (possibly a week's effort for a bioinformatician and a lab technician).
As sequencing technologies improve in terms of their read lengths and throughput, the task of finishing a genome is likely to become even more feasible. In particular, technologies such as Illumina and SOLiD are already or will soon produce reads as long as some of those studied here (~100 bp) making the techniques described here directly applicable to the resulting assemblies. Also, newer 454 instruments, where read lengths are now roughly 400 bp, allow for the extension of similar ideas to larger genomes. As techniques to construct ordered restriction maps (such as nanocode maps  and nano-fluidic arrays http://www.bionanomatrix.com/) and mate-pair libraries improve, scaffolding and finishing larger genomes could potentially become a routine affair. All of this portends well for a more widespread adoption of finishing as a goal for genome projects.
Computational analysis continues to play an important role in the finishing task and it is unlikely that there will be a one-size fits all solution or a standard pipeline for this problem. However, often a small bag of tricks can make the task less daunting and an increased awareness and availability of tools for these could go a long way in making more finished genomes available to the scientific community. One such trick is the use of optical maps to aid in finishing and the results reported here represent the first large scale, automated use of these maps for scaffolding genomes. A slightly less novel and yet neglected idea is the contig adjacency information residing in assemblies (discussed originally in ). Ignoring it would have led us to potentially do hundreds of PCR experiments to determine the order of contigs in the R. prowazekii genome; instead we closed the genome in silico. As genomes are sequenced with longer reads and to greater coverage, we are likely to come across more such examples, possibly with larger, but still resolvable contig graphs.
The programs and pipelines that we used for our finishing analysis are freely available at http://cbcb.umd.edu/finishing. We hope that they can serve as a starting point as other labs strive to create and refine their own finishing toolboxes.
In the following paragraphs we briefly describe some of the ideas and approaches that were used in the finishing projects described in this paper. More details about installation and running of the associated tools can be found in the documentation accompanying the tools.
Using mate-pairs and contig adjacency information
Assembly algorithms piece together reads into contigs, which are ungapped sequences from the genome . If additional information linking these contigs is available (such as mate-pairs) then they may also produce scaffolds, which represent regions of the genome where the contiguous sequence is not necessarily known, but the region can be represented as an ordered and oriented list of contigs with estimates for the lengths of the gaps between contigs.
The existence of information about the adjacency of contigs in an assembly is not a new concept. Surprisingly, however, it is common practice for finishing-associated tools to treat contigs from an assembly as sequences whose relative order and orientation is completely unconstrained [16, 17]. In our work we found contig-graph information useful in three different tasks: 1) Closing gaps in silico as described in the section after next 2) Guiding and verifying map-based or mate-pair based scaffolding of contigs (see below and the next section) and 3) direct reconstruction of larger contigs. For the third task we relied on the fact that often these graphs have linear paths or simple repeat structures that can be easily resolved to merge contigs together (see Figure 3) and this is implemented as part of the AMOS-Hybrid pipeline described below. More sophisticated analysis as suggested in  to identify regions in the assembly or contig graph that are uniquely traversable is also feasible. In the case of the R. prowazekii genome, we reconstructed the contig graph using get_graph.pl in the finishing scripts package (see Availability) and manually noted the presence of a unique directed traversal of the graph to reconstruct the genome. Tools automating this task are still in development.
Optical map based scaffolds
In the case of some genome projects, multiple optical-maps can be useful as they can provide complementary information and help place contigs that have few restriction sites. In addition multiple maps can also help validate placements from individual maps. In order to achieve this we implemented an efficient approach to merge single map results from SOMA and an overview of this approach is provided below:
Select a contig placed on all the maps as an anchor.
For all uniquely placed contigs, test if the distance to the anchor is consistent across maps.
For non-uniquely placed contigs, find a unique placement based on agreement between maps.
Starting from the closest to the anchor, add unique placements into a global placement, excluding those that conflict with earlier placements.
The optical map based scaffolds can also be augmented with the information available in contig graphs and we employed it in two ways: 1) to validate placements using the optical map 2) to place smaller contigs based on connections to contigs already placed on the map. We found both these approaches useful in increasing the completeness and reliability of our contigs and the scripts we used are freely available (see Availability).
Directed finishing using contig composition and adjacency information
Closing of gaps between contigs using directed finishing experiments can be a long and laborious component of a sequencing project. While map and mate-pair based scaffolding can be critical to reducing the number of experiments to be done, they will not typically eliminate them. It is therefore important in such cases to use auxilliary information to prioritize experiments. One approach that proved valuable in our efforts was to use contig adjacency information to resolve repeats, as in the case of the R. prowazekii genome (based on the unique traversal property discussed in Theorem 7.5 in ). In several cases, analysis of connecting paths in contig graphs helped close gaps in silico and the scripts we used for this are freely available (see Availability). Another useful strategy, particularly for the larger gaps, was to use sequence composition of contig ends to suggest contigs that could supply the missing sequence. In the case of Y. rohdei, for example, we matched GC content for 1 Kbp ends of contigs and the top match closed 5 large gaps (as validated by PCR experiments) using 5 of the 6 large contigs (>10 Kbp) that were not placed on the scaffold. A script to do this matching is also provided. Note that another similar approach that could prove valuable for closing gaps in future projects is to identify genes fragmented into two or more contigs to identify contigs that are likely to be adjacent .
The executables and source code for the various programs used are avaialble at http://cbcb.umd.edu/finishing. This includes programs and scripts for scaffolding using optical maps (SOMA v2.0), an AMOS pipeline for merging mate-pair and contig information (AMOS-Hybrid v1.0) and scripts for various finishing tasks (Finishing Scripts v1.0).
The authors are supported in part by Department of Defense Transformational Medical Technologies initiative TMTI0068 07 NM T to T.D.R. and NSF grant IIS-0812111 (to M.P.). R.D. thanks the Sackler Institute for Comparative Genomics at the American Museum of Natural History and the Korein Foundation for their continued support. The views expressed in this article are those of the authors and do not necessarily reflect the official policy or position of the Department of the Navy, Department of Defense, nor the U.S. Government. A.R. is a military service member. This work was prepared as part of his official duties. Title 17 U.S.C. §105 provides that 'Copyright protection under this title is not available for any work of the United States Government.' Title 17 U.S.C. §101 defines a U.S. Government work as a work prepared by a military service member or employee of the U.S. Government as part of that person's official duties.
- Parkhill J: In defense of complete genomes. Nature Biotechnology. 2000, 18: 493-494. 10.1038/75346.PubMedView ArticleGoogle Scholar
- Fraser C, Eisen J, Nelson K, Paulsen IT, Salzberg S: The Value of Complete Microbial Genome Sequencing (You Get What You Pay For). J Bact. 2002, 183: 6403-6405.View ArticleGoogle Scholar
- Branscomb1 E, Predki P: On the High Value of Low Standards. J Bact. 2002, 183: 6406-6409.View ArticleGoogle Scholar
- Tettelin H, Radune D, Kasif S, Khouri H, Salzberg S: Optimized Multiplex PCR: efficiently closing a whole-genome shotgun sequencing project. Genomics. 1999, 62 (3): 500-507. 10.1006/geno.1999.6048.PubMedView ArticleGoogle Scholar
- Bonaventura MD, DeSalle R, Pop M, Nagarajan N, Figurski DH, Fine DH, Kaplan JB, Planet PJ: Complete Genome Sequence of Aggregatibacter (Haemophilus) aphrophilus NJ8700. J Bacteriol. 2009, 191 (14): 4693-4694. 10.1128/JB.00447-09.PubMed CentralPubMedView ArticleGoogle Scholar
- Khairat O: Endocarditis due to a new species of Haemophilus . J Pathol and Bacteriol. 1940, 50: 497-505. 10.1002/path.1700500312.View ArticleGoogle Scholar
- Chen P, Cook C, Stewart A, Nagarajan N, Sommer D, Pop M, Thomason B, Kiley M, Lentz S, Nolan N, Sozhamannan S, Sulakvelidze A, Mateczun A, Du L, Zwick M, Read T: Genomic characterization of the Yersinia genus. Genome Biology. 2010, 11: 10.1186/gb-2010-11-1-r1.Google Scholar
- Nagarajan N, Read T, Pop M: Scaffoliding and validation of bacterial genome assemblies using optical restriction maps. Bioinformatics. 2008Google Scholar
- Chaisson MJ, Pevzner PA: Short read fragment assembly of bacterial genomes. Genome Research. 2008, 18 (2): 324-330. 10.1101/gr.7088808.PubMed CentralPubMedView ArticleGoogle Scholar
- Myers E, Sutton G, Delcher A, Dew I, Faulo D, Flanigan M, Kravitz S, Mobarry C, Reinert K, Remington K, Anson E: A whole-genome assembly of Drosophila. Science. 2000, 287 (5461): 2196-2204. 10.1126/science.287.5461.2196.PubMedView ArticleGoogle Scholar
- Andersson S, Zomorodipour A, Andersson J, Sicheritz-Ponten T, Alsmark U, Podowski R, Naslund A, Eriksson A, Winkler H, Kurland C: The genome sequence of Rickettsia prowazekii and the origin of mitochondria. Nature. 1998, 396 (6707): 133-140. 10.1038/24094.PubMedView ArticleGoogle Scholar
- Jo K, Dhingra D, Odijk T, de Pablo J, Graham M, Runnheim R, Forrest D, Schwartz D: A single molecule barcoding system using nanoslits for DNA analysis. Proc Natl Acad Sci USA. 2007, 104: 2673-2678. 10.1073/pnas.0611151104.PubMed CentralPubMedView ArticleGoogle Scholar
- Mulyukov Z, Pevzner PA: EULER-PCR: Finishing Experiments for Repeat Resolution. Pacific Symposium on Biocomputing. 2002, 199-210.Google Scholar
- Pop M: Shotgun sequence assembly. Adv in Comp. 2004, 60: 193-248.View ArticleGoogle Scholar
- Sommer D, Delcher A, Salzberg SL, Pop M: Minimus: a fast, lightweight genome assembler. BMC Bioinformatics. 2007, 8: 64-10.1186/1471-2105-8-64.PubMed CentralPubMedView ArticleGoogle Scholar
- Richter DC, Schuster SC, Huson DH: OSLay: optimal syntenic layout of unfinished assemblies. Bioinformatics. 2007, 23 (13): 1573-1579. 10.1093/bioinformatics/btm153.PubMedView ArticleGoogle Scholar
- Zhao F, Zhao F, Li T, Bryant DA: A new pheromone trail-based genetic algorithm for comparative genome assembly. Nucleic Acids Research. 2008, 36 (10): 3455-3462. 10.1093/nar/gkn168.PubMed CentralPubMedView ArticleGoogle Scholar
- Nagarajan N, Pop M: Parametric complexity of sequence assembly: Theory and applications to next generation sequencing. Journal of Computational Biology. 2009, 16 (7): 897-908. 10.1089/cmb.2009.0005.PubMedView ArticleGoogle Scholar
- Miller J, Delcher A, Koren S, Venter E, Walenz B, Brownley A, Johnson J, Li K, Mobarry C, Sutton G: Aggressive Assembly of Pyrosequencing reads with Mates. Bioinformatics. 2008, 24 (24): 2818-2824. 10.1093/bioinformatics/btn548.PubMed CentralPubMedView ArticleGoogle Scholar
- Jeck W, Reinhardt J, Baltrus D, Hickenbotham MT, Magrini V, Mardis ER, Dangl JL, Jones CD: Extending assembly of short DNA sequences to handle error. Bioinformatics. 2007, 23: 2942-2944. 10.1093/bioinformatics/btm451.PubMedView ArticleGoogle Scholar
- Chevreux B, Wetter T, Suhai S: Genome Sequence Assembly Using Trace Signals and Additional Sequence Information. German Conference on Bioinformatics. 1999, 45-56.Google Scholar
- Samad A, Huff E, Cai W, Schwartz D: Optical mapping: a novel, single-molecule approach to genomic analysis. Genome Research. 1995, 5: 1-4. 10.1101/gr.5.1.1.PubMedView ArticleGoogle Scholar
- Latreille P, Norton S, Goldman B, Henkhaus J, Miller N, Barbazuk B, Bode HB, Darby C, Du1 Z, Forst S, Gaudriault S, Goodner B, Goodrich-Blair H, Slater S: Optical mapping as a routine tool for bacterial genome sequence finishing. BMC Genomics. 2007, 8: 321-10.1186/1471-2164-8-321.PubMed CentralPubMedView ArticleGoogle Scholar
- Waterman M: Introduction to computational biology. 1995, Chapman & HallView ArticleGoogle Scholar
- Salzberg S, Sommer D, Puiu D, Lee V: Gene-boosted assembly of a novel bacterial genome from very short reads. PLOS Comp Bio. 2008Google Scholar