High quality draft sequences for prokaryotic genomes using a mix of new sequencing technologies
© Aury et al; licensee BioMed Central Ltd. 2008
Received: 04 September 2008
Accepted: 16 December 2008
Published: 16 December 2008
Massively parallel DNA sequencing instruments are enabling the decoding of whole genomes at significantly lower cost and higher throughput than classical Sanger technology. Each of these technologies have been estimated to yield assemblies with more problematic features than the standard method. These problems are of a different nature depending on the techniques used. So, an appropriate mix of technologies may help resolve most difficulties, and eventually provide assemblies of high quality without requiring any Sanger-based input.
We compared assemblies obtained using Sanger data with those from different inputs from New Sequencing Technologies. The assemblies were systematically compared with a reference finished sequence. We found that the 454 GSFLX can efficiently produce high continuity when used at high coverage. The potential to enhance continuity by scaffolding was tested using 454 sequences from circularized genomic fragments. Finally, we explore the use of Solexa-Illumina short reads to polish the genome draft by implementing a technique to correct 454 consensus errors.
High quality drafts can be produced for small genomes without any Sanger data input. We found that 454 GSFLX and Solexa/Illumina show great complementarity in producing large contigs and supercontigs with a low error rate.
Whole-genome sequencing has profoundly impacted the field of prokaryotic genetics since its first demonstration. Almost all of the economically and medically important microbes have had at least one representative with their genome sequenced. This achievement was first seen as the main goal of bacterial genomics, but is now strongly challenged by two observations. First, most microbial diversity is represented by uncultivated organisms, so the genomes sequenced today only represent a small fraction of the microbial gene space. Second, the variability between members of the same bacterial "species" can be very high in terms of gene content[2, 3]. Therefore, the definition of the proteome for a defined taxon may necessitate the sequencing of numerous related genomes. New technologies are therefore needed to sequence a larger amount of prokaryotic genomes than previously thought. A number of new methods have reached the commercialization stage in the last few years. They are based on principles that are different from dideoxy termination and electrophoretic separations, as in the Sanger method[4, 5]. As such, they display different error rates and types, and produce assemblies with different characteristics. The most commonly used method, that make use of highly parallelized pyrosequencing, has an inherently higher error rate around tracts of mononucleotides[7, 8]. This translates into higher insertion-deletion errors in assembly consensus, and in-frame stop codons in genes.
For de novo sequencing, these technologies have two main drawbacks beyond the sequencing error issue. First, they have been developed in the framework of the resequencing of the human genome, and thus produce mostly short reads that are useful for detecting substitution polymorphisms against a reference genome, but are more difficult to use for de novo assembly of a new genome. Second, their initial implementation permitted only un-paired sequences. The presence of links between two reads is a major element for de novo sequencing, enabling both the linkage of different contigs separated by a sequence gap, and the construction of robust contigs by detection of assembly problems due to repeated elements. For these reasons, the accuracy and continuity of assemblies obtained with new sequencing technology data were lower than those traditionally obtained with the Sanger approach. Recent improvements of the new technologies brought the promise of a better final product for WGS projects. Here, we evaluated how assemblies made with such improvements compare with assemblies produced with Sanger data, and how a mix of Roche/454 and Solexa/Illumina technologies performed in whole-genome sequencing of a reference bacterial genome.
For testing the efficiency of different approaches in bacterial genome assembly, we chose the gamma-proteobacterium Acinetobacter baylyi as a test-case. A finished version of the genome has already been produced, and posterior projects of global gene disruptions have led to resequencing almost all genes, thus providing a very error-free reference. Additionally, this genome was sequenced in our lab, thus permitting reanalysis of any incongruencies between the reference assembly and the versions generated with the new data sets. Acinetobacter baylyi has a genome size of 3.6 Mb with a GC content of about 40%.
Characteristics of the assemblies with different data inputs
Assembly size (% of reference)
3.417 Mb (95%)
3.542 Mb (98%)
Unpaired + paired 454
3.544 Mb (98%)
unpaired + paired 454 with Illumina/Solexa GA1
25× and 50×
3.544 Mb (98%)
We next explored the possibility of obtaining more continuity with GSFLX data by sequencing a "paired" library using the Roche/454 protocol (see Methods). We produced about 5× supplemental coverage with a "paired" library using circularized fragments of 3 kb, of which about one-fourth were detected as paired sequences by the Newbler assembler (see Additional File 3). The other three-fourths are used in the assembly, but contained no pairing information since they are not constituted of two sufficiently-sized regions separated by the spacer. The number of contigs did not change from the 20× unpaired distribution, reflecting that the main impact of the 454 paired reads is the bridging of contigs. We obtained 10 scaffolds. When compared with the reference sequence, all gap positions appeared in repetitive regions that are more than 3 kb long. This indicates that at this coverage, all possible supercontiging has been automatically achieved. An obvious way of improving the assembly will be the construction of paired libraries from circularized fragments of larger size than 3 kb. Our results indicate that with appropriate sizing of such libraries, most bacterial genomes may be obtained as a collection of very few supercontigs using 454/Roche data only.
Use of Solexa data to correct the consensus sequence
Remaining errors after correction using Solexa/Illumina reads with different coverage
The recent availability of new sequencing technologies has fueled enormous expectation for the rapid and cost-effective determination of the sequence of small genomes. The usefulness of such genome draft sequences will be dependent on their quality, principally in terms of their error rates and contiguity. In this study, we tested the effect of mixing two different data types (454 and Solexa/Illumina) to deliver a high quality draft for a bacterial genome. There is no easy way to co-assemble efficiently these two data sets, so we devised a method that assembles the genome first using the longest reads, then corrects the remaining errors by aligning the shortest sequences to the consensus. We were able to obtain a high accuracy consensus, together with a low number of total scaffolds using this approach. The improvement over the use of a single technology is evident, and may be attributed to the different error types in each technology. We determined the approximate level of redundancy required to obtain optimal results as 25× (for 454 data assembled with Newbler) and 50× (for Solexa data). These numbers are indicative, as they may be slightly different according to the nature of the target genome. Although these numbers will evolve as these technologies develop longer read lengths and lower error rates, as well as because of assembly software improvements, this constitutes a significant decrease in the cost of a prokaryotic genome sequence compared with Sanger technology. We evaluate the total cost of this experiment as about one-third of the total cost of Sanger sequencing. Future improvements are expected to make this difference even greater.
As the number of genome sequences increases, it will be more and more difficult to finish such a large amount of draft sequences. It is therefore essential that the standard quality of a draft sequence remain high, and the method described in this paper may contribute to this objective. The main difficulty encountered in correcting all errors in the initial assembly was the lack of Solexa reads mapping uniquely at certain locations. This may be improved as the read length of Solexa data increases, as expected in the coming months. At the end of the procedure described here, we found that most remaining gaps are due to repetitive sequences. This can be improved in future versions of the assemblers that can fill gaps using paired-end data anchored on one side on single-copy sequences. These methods have been successfully employed with Sanger data [22–25], and offer the potential advantage of obtaining almost finished sequences without directed finishing, as soon as they are adapted to the new type of paired data produced with the new technologies.
The combination of two technologies (454 GSFLX and Solexa/Illumina) allows production of high-quality drafts of at least a comparable quality to those obtained with Sanger data. The method presented in this study is based on available software and protocols, and may be readily implemented in many labs. Using this procedure, and with ongoing developments in assembly methods, we can expect that the advent of the New Sequencing Technologies may provide a wealth of genome sequences without compromising their overall accuracy and contiguity. This will augment their usefulness for comparative analyses.
Two libraries were constructed from A. baylyi genomic DNA: one with 3.6 kb-inserts in the high-copy plasmid vector pcDNA2.1 after DNA shearing, and one 19.6 kb-inserts in the BAC vector pBeloBAC11 after DNA digestion with Sau3A. We obtained 24,750 and 15,739 quality-trimmed sequences for the plasmid and BAC libraries respectively. The reads were assembled using Arachne.
454 GS FLX sequencing
A library of single stranded DNA fragments was obtained from A. baylyi nebulized DNA according to Roche/454 standard procedures. A total of 390,596 individual reads were assembled using Newbler (20× coverage).
For the paired library, we purified 3 kb fragments after shearing using a Hydroshear device. These fragments were joined to a biotinylated linker and circularized according to Roche/454 standard procedures. After nebulization and purification of the linker-containing fragments, we amplified the library using the bead/emulsion protocol, and sequenced the products on the GS FLX sequencer. A total of 71,387 sequences were obtained after quality trimming. Of those, 18,774 were recognized as paired by the Newbler assembler. The 71,387 were assembled with 296,433 unpaired reads from the previous experiment to reach 25× coverage.
The genomic DNA was unidirectionally sequenced on a Solexa/Illumina Genome Analyser I using standard procedures. The sequences were 36 bases long. A total of 12,248,948 reads passed the filter, giving a total coverage of about 123 genome equivalent.
Assessment of error rates
The assemblies produced were compared to the reference sequence of Acinetobacter, and differences between the two versions were detected. Each contig of a given assembly was mapped onto the reference sequence using nucmer with default parameters. We only retained the best match for each contig, and the resulting alignment was parsed to check for mismatches, insertions and deletions.
Automatic error corrections with Solexa/Illumina reads
Short read sequences were aligned on the assembly using the SOAP software using a seed size of 12 bps and a maximum gap size allowed on a read of 3 bps. Only uniquely mapped reads were retained. Each difference was then considered and kept only if it met the following three criteria: (1) error is not located in the first 5 bps or the last 5 bps, (2) the quality of the considered bases, the previous and the next one are above 20, and (3) the remaining sequences (before and after) around the error are not homopolymers (to avoid misalignment at boundaries). Next stage pile up errors located at the same position, particularly errors that occurred inside homopolymers (since two reads that tag the same error can report different positions). Finally, each detected error was corrected if at least three reads detected the given error and 70% of the reads located at that position agree.
Since we only allow reads uniquely mapped and reads mapped with a maximum of two mismatches and three indels, several regions were devoid of Solexa tags. In a first step, one or several errors might be corrected, and if we iterate the strategy again, regions that were devoid of Solexa reads could now be covered. We therefore decided to iterate the previous strategy until no new errors were found. For example, at 50× coverage 4 cycles were required (the 1st cycle has corrected 263 errors, the second 14, the third 2 errors and the fourth and last one, no errors).
Evolution of the read coverage during the process of errors correction (using initially 50× of Solexa reads leads to a usable coverage of around 17x)
Uniquely mapped reads
Number of reads
Number of bases
Number of reads
Number of bases
Number of reads
Number of bases
We thank the Institut de Génomique du CEA for financial support, Susan Cure for the English corrections and Jean Weissenbach for continuous support. The traces and assemblies generated in the frame of this project were submitted to the Short Read Archive under accession number SRA003611.
- Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM: Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science. 1995, 269: 496-512. 10.1126/science.7542800.PubMedView Article
- Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, Jones AL, Durkin AS, Deboy RT, Davidsen TM, Mora M, Scarselli M, Margarit y, Ros I, Peterson JD, Hauser CR, Sundaram JP, Nelson WC, Madupu R, Brinkac LM, Dodson RJ, Rosovitz MJ, Sullivan SA, Daugherty SC, Haft DH, Selengut J, Gwinn ML, Zhou L, Zafar N, Khouri H, Radune D, Dimitrov G, Watkins K, O'Connor KJ, Smith S, Utterback TR, White O, Rubens CE, Grandi G, Madoff LC, Kasper DL, Telford JL, Wessels MR, Rappuoli R, Fraser CM: Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". Proc Natl Acad Sci USA. 2005, 102: 13950-13955. 10.1073/pnas.0506758102.PubMedPubMed CentralView Article
- Makarova K, Slesarev A, Wolf Y, Sorokin A, Mirkin B, Koonin E, Pavlov A, Pavlova N, Karamychev V, Polouchine N, Shakhova V, Grigoriev I, Lou Y, Rohksar D, Lucas S, Huang K, Goodstein DM, Hawkins T, Plengvidhya V, Welker D, Hughes J, Goh Y, Benson A, Baldwin K, Lee JH, Diaz-Muniz I, Dosti B, Smeianov V, Wechter W, Barabote R, Lorca G, Altermann E, Barrangou R, Ganesan B, Xie Y, Rawsthorne H, Tamir D, Parker C, Breidt F, Broadbent J, Hutkins R, O'Sullivan D, Steele J, Unlu G, Saier M, Klaenhammer T, Richardson P, Kozyavkin S, Weimer B, Mills D: Comparative genomics of the lactic acid bacteria. Proc Natl Acad Sci USA. 2006, 103: 15611-15616. 10.1073/pnas.0607117103.PubMedPubMed CentralView Article
- Mardis ER: Next-Generation DNA Sequencing Methods. Annu Rev Genomics Hum Genet. 2008
- Holt RA, Jones SJ: The new paradigm of flow cell sequencing. Genome Res. 2008, 18: 839-846. 10.1101/gr.073262.107.PubMedView Article
- Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho CH, Irzyk GP, Jando SC, Alenquer ML, Jarvie TP, Jirage KB, Kim JB, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, Li J, Lohman KL, Lu H, Makhijani VB, McDade KE, McKenna MP, Myers EW, Nickerson E, Nobile JR, Plant R, Puc BP, Ronan MT, Roth GT, Sarkis GJ, Simons JF, Simpson JW, Srinivasan M, Tartaro KR, Tomasz A, Vogt KA, Volkmer GA, Wang SH, Wang Y, Weiner MP, Yu P, Begley RF, Rothberg JM: Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005, 437: 376-380.PubMedPubMed Central
- Huse SM, Huber JA, Morrison HG, Sogin ML, Welch DM: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol. 2007, 8: R143-10.1186/gb-2007-8-7-r143.PubMedPubMed CentralView Article
- Brockman W, Alvarez P, Young S, Garber M, Giannoukos G, Lee WL, Russ C, Lander ES, Nusbaum C, Jaffe DB: Quality scores and SNP detection in sequencing-by-synthesis systems. Genome Res. 2008, 18: 763-770. 10.1101/gr.070227.107.PubMedPubMed CentralView Article
- Roach JC, Boysen C, Wang K, Hood L: Pairwise end sequencing: a unified approach to genomic mapping and sequencing. Genomics. 1995, 26: 345-353. 10.1016/0888-7543(95)80219-C.PubMedView Article
- Goldberg SM, Johnson J, Busam D, Feldblyum T, Ferriera S, Friedman R, Halpern A, Khouri H, Kravitz SA, Lauro FM, Li K, Rogers YH, Strausberg R, Sutton G, Tallon L, Thomas T, Venter E, Frazier M, Venter JC: A Sanger/pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes. Proc Natl Acad Sci USA. 2006, 103: 11240-11245. 10.1073/pnas.0604351103.PubMedPubMed CentralView Article
- Barbe V, Vallenet D, Fonknechten N, Kreimeyer A, Oztas S, Labarre L, Cruveiller S, Robert C, Duprat S, Wincker P, Ornston LN, Weissenbach J, Marliere P, Cohen GN, Medigue C: Unique features revealed by the genome sequence of Acinetobacter sp. ADP1, a versatile and naturally transformation competent bacterium. Nucleic Acids Res. 2004, 32: 5766-5779. 10.1093/nar/gkh910.PubMedPubMed CentralView Article
- de Berardinis V, Vallenet D, Castelli V, Besnard M, Pinet A, Cruaud C, Samair S, Lechaplais C, Gyapay G, Richez C, Durot M, Kreimeyer A, Le Fevre F, Schachter V, Pezo V, Doring V, Scarpelli C, Medigue C, Cohen GN, Marliere P, Salanoubat M, Weissenbach J: A complete collection of single-gene deletion mutants of Acinetobacter baylyi ADP1. Mol Syst Biol. 2008, 4: 174-10.1038/msb.2008.10.PubMedPubMed CentralView Article
- Wicker T, Schlagenhauf E, Graner A, Close TJ, Keller B, Stein N: 454 sequencing put to the test using the complex genome of barley. BMC Genomics. 2006, 7: 275-10.1186/1471-2164-7-275.PubMedPubMed CentralView Article
- Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F, Simons JF, Kim PM, Palejev D, Carriero NJ, Du L, Taillon BE, Chen Z, Tanzer A, Saunders AC, Chi J, Yang F, Carter NP, Hurles ME, Weissman SM, Harkins TT, Gerstein MB, Egholm M, Snyder M: Paired-end mapping reveals extensive structural variation in the human genome. Science. 2007, 318: 420-426. 10.1126/science.1149504.PubMedPubMed CentralView Article
- Hillier LW, Marth GT, Quinlan AR, Dooling D, Fewell G, Barnett D, Fox P, Glasscock JI, Hickenbotham M, Huang W, Magrini VJ, Richt RJ, Sander SN, Stewart DA, Stromberg M, Tsung EF, Wylie T, Schedl T, Wilson RK, Mardis ER: Whole-genome sequencing and variant discovery in C. elegans. Nat Methods. 2008, 5: 183-188. 10.1038/nmeth.1179.PubMedView Article
- Smith AD, Xuan Z, Zhang MQ: Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics. 2008, 9: 128-10.1186/1471-2105-9-128.PubMedPubMed CentralView Article
- Dohm JC, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2008
- Li R, Li Y, Kristiansen K, Wang J: SOAP: short oligonucleotide alignment program. Bioinformatics. 2008, 24: 713-714. 10.1093/bioinformatics/btn025.PubMedView Article
- McLean MJ, Wolfe KH, Devine KM: Base composition skews, replication orientation, and gene orientation in 12 prokaryote genomes. J Mol Evol. 1998, 47: 691-696. 10.1007/PL00006428.PubMedView Article
- Pihlak A, Bauren G, Hersoug E, Lonnerberg P, Metsis A, Linnarsson S: Rapid genome sequencing with short universal tiling probes. Nat Biotechnol. 2008, 26: 676-684. 10.1038/nbt1405.PubMedView Article
- Sirand-Pugnet P, Lartigue C, Marenda M, Jacob D, Barre A, Barbe V, Schenowitz C, Mangenot S, Couloux A, Segurens B, de Daruvar A, Blanchard A, Citti C: Being pathogenic, plastic, and sexual while living with a nearly minimal bacterial genome. PLoS Genet. 2007, 3: e75-10.1371/journal.pgen.0030075.PubMedPubMed CentralView Article
- Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KH, Remington KA, Anson EL, Bolanos RA, Chou HH, Jordan CM, Halpern AL, Lonardi S, Beasley EM, Brandon RC, Chen L, Dunn PJ, Lai Z, Liang Y, Nusskern DR, Zhan M, Zhang Q, Zheng X, Rubin GM, Adams MD, Venter JC: A whole-genome assembly of Drosophila. Science. 2000, 287: 2196-2204. 10.1126/science.287.5461.2196.PubMedView Article
- Havlak P, Chen R, Durbin KJ, Egan A, Ren Y, Song XZ, Weinstock GM, Gibbs RA: The Atlas genome assembly system. Genome Res. 2004, 14: 721-732. 10.1101/gr.2264004.PubMedPubMed CentralView Article
- Mullikin JC, Ning Z: The phusion assembler. Genome Res. 2003, 13: 81-90. 10.1101/gr.731003.PubMedPubMed CentralView Article
- Jaffe DB, Butler J, Gnerre S, Mauceli E, Lindblad-Toh K, Mesirov JP, Zody MC, Lander ES: Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res. 2003, 13: 91-96. 10.1101/gr.828403.PubMedPubMed CentralView Article
- Delcher AL, Phillippy A, Carlton J, Salzberg SL: Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 2002, 30: 2478-2483. 10.1093/nar/30.11.2478.PubMedPubMed CentralView Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.