Not long ago, the expected outcome of a microbial genome project was the complete DNA sequence of all the chromosomes and extra-chromosomal elements of the genome being sequenced. As more and more complete genome sequences became available in public databases, scientists started to debate the need for completely sequencing the genome of an organism [1–3]. While an initial draft sequence for an organism could be determined in a matter of weeks, the complete sequence required many months or even years of additional experiments - a time- and cost-intensive process called genome finishing. Furthermore, draft assemblies are sufficient for many genomic analyses, especially if complete sequences of closely related organisms are available. The recent development of nextgen high-throughput sequencing technologies appears to have sealed the fate of genome finishing. Draft assemblies of approximately 5 bacterial genomes can now be generated in a matter of days (hours in fact, ignoring library preparation time) using a single 454 Titanium sequencing instrument. The costs associated with finishing, however, have not significantly decreased in recent years. The high costs of finishing experiments, thus, appear to only be justified for high-priority genomes.
Having finished or nearly-finished genomes is of course still a worthy goal as it enables a much richer set of genomic analysis. For example, the reliability of order-based genomic analysis such as studying operon structure and gene regulation as well as the granularity of comparative genomic studies are enhanced by the availability of finished genomes. In addition, the finishing process can substantially improve the quality of the data available to the community by identifying and fixing mis-assemblies and low-coverage regions. Fortunately, several characteristics of the new types of sequencing data, specifically increased depth of coverage and low representation biases in the sequencing libraries lend themselves well for finishing analysis. Draft assemblies can therefore be combined with additional sequence and map-based information to reduce the finishing effort. Here we describe our experience in doing this in the course of several finishing projects, highlighting the reduction in finishing effort as well as the feasibility of such projects in a small-lab setting. We also present the tools and approaches that were designed in our lab for this purpose (source code and executables available at http://cbcb.umd.edu/finishing). In combination with the democratization of genomics made possible by the reduced cost of sequencing, computational approaches such as the ones we describe here may help rectify the imbalance in the number of draft vs finished (or nearly finished) genomes that are available to the scientific community.
Overview of finishing techniques
Prior to describing our results we briefly survey the main challenges encountered in finishing a genome and outline ways in which new technologies can be used to overcome these challenges. Detailed descriptions of these approaches will be provided in the methods section.
Finishing aims to overcome two major limitations of the shotgun sequencing process. First of all, the output of a genome assembler is generally fragmented due to difficulties in assembling repeat regions and to cloning/sequencing biases. Second of all, the assembled fragments frequently contain errors, either due to sequencing artifacts or to the incorrect reconstruction of repeats. The finishing process can thus be decomposed into two steps: gap closure, and assembly validation and refinement.
In gap closure, pairs of adjacent contigs are identified, then the genomic sequence spanning the gap between them is determined, traditionally through directed-PCR and primer-walking approaches. When mate-pair libraries are available, the adjacency of contigs can often be inferred from the mate-pair data and the gaps spanned by paired reads (sequencing gaps) can be closed relatively easily. Contigs whose adjacency cannot be inferred from mate-pair data, however, require expensive (and error-prone) combinatorial PCR experiments . New experimental technologies alleviate these difficulties in two ways. First of all, in nextgen sequencing projects performed to a high depth of coverage (>20-fold is common in 454 projects) sequence gaps between contigs are rare due to the relatively unbiased libraries generated by these new technologies; fragmentation into contigs is largely due to the presence of repeats. Therefore, often, once the adjacency between two contigs is determined (e.g. through PCR experiments), the contigs can be simply "glued" together without the need for additional sequencing. Second of all, the adjacency of contigs can be easily determined either through recently developed nextgen mate-pair protocols or through the use of new mapping technologies, such as the optical mapping approach from Opgen Inc. http://www.opgen.com.
The validation and refinement finishing stage aims to correct errors in the assembled sequence - both single-base errors (such as mis-called bases due to sequencing errors) as well as large-scale errors (such as mis-assemblies due to repeats). Both problems are somewhat alleviated in nextgen sequencing data. Due to a high level of coverage, most single base errors can be automatically corrected. This is true even in the case of 454 pyrosequencing where errors within homo-polymer tracts are common. Furthermore, assembly software designed specifically for high-coverage nextgen data (e.g. the Newbler assembler from 454) use conservative algorithms specifically designed to avoid mis-assemblies. The resulting assemblies are usually more fragmented; contigs end at repeat boundaries where the reconstruction of the genome is ambiguous. The high depth of coverage and conservative assembly strategy also enable a better estimation of the number of repeat copies contained within a contig. Repeat-induced ambiguities can be resolved through targeted PCR experiments aimed at uncovering the correct adjacency of the assembled contigs. As in the case of gap closure, once two contigs have been determined to be adjacent in the assembly, they can be simply glued together. In the following section we describe the results from our experience in putting these principles into practice.