The relative value of draft versus completely finished genomes has been debated for over a decade (e.g. ), over which time the finishing of genomes, including bacterial, remains time consuming and costly. However, this is expected to change because techniques to simplify the finishing process are expected to progress in the near future . At the same time, NGS technologies are developing very rapidly in terms of sequence output and reduced cost, which allows genome sequences to be obtained at a reasonable cost. In addition, some NGS technologies can already produce sequence reads comparable in length to traditional Sanger sequences with high coverage (~30 fold) in hours, which should simplify analysis. It is for these reasons that we conducted an update study to assess the usefulness of these data for rapid and preliminary genome studies. This study compares only free and open source tools because these alone open up the possibility for individual researchers and small research groups to both assemble and annotate sequence reads into completed genomes, and tune the analysis pipelines.
Before embarking on bioinformatic genome analysis, researchers are faced with a multitude of choices regarding which analytical protocols and tools to apply – a choice which is especially critical for small and independent research groups. Many bioinformatic tools are either free and open source or freely available for academic use, and are thus natural choices. However, these two types of software differ. Often, the installation, operation, and maintenance of free and open source software requires specific technical expertise and current best practice are either poorly documented or entirely lacking – drawbacks that are expected to decrease rapidly over time. Closed-source tools that limit our academic freedoms, including those without licencing fees, also limit the transparency of their pipelines which may adversely affect the reproducibility of computed results.
Analogous to the choice of analysis tools, the choice of sequencing technology to apply has not been clearly settled. Our results show that Roche 454 pyrosequencing might be a good choice for obtaining de novo draft assemblies of environmental bacterial strains relatively economically, however, it remains problematic for re-sequencing projects because of its high error rate. The errors give high quality signals at the single nucleotide level in homopolymeric regions and therefore introduce ambiguities or contradictions into the reference mapping assembly. On the other hand, Roche 454 technology is suitable for de novo sequencing and draft assembly, and allows for the automatic annotation of WGS bacterial genomes with relatively low sequencing effort, i.e. currently one or two bacterial genomes per run. Because de novo assemblers use a more probabilistic approach, they are less influenced by the erroneous reads of Roche 454 in homopolymeric nucleotide regions. These erroneous nucleotides in individual reads with high quality signals are levelled off by the average frequencies of nucleotide positions, so no artificial indels are introduced and therefore the errors do not disturb the consensus in the same way as in the reference mapping.
Automated annotation of draft genomes provides preliminary information about the genomes of novel organisms and this annotation approach ought not to yield highly erroneous results that may mislead the researcher. On this level, all of the pipelines we tested performed very well; the artefacts in the annotation data mostly originated from the re-mapping assembly and not from the annotation process itself. However, the coding regions were not precisely located (exact start and stop) when compared between different automated annotation tools because none of these tools attempts to locate the origin of replication .
Two algorithms are widely employed in de novo genome assembly. The first is the overlap layout consensus (OLC) or overlap contig consensus approach, and the other is the de Bruijn graph (DBG) or Eulerian path . The latter is more useful for shorter reads (<150 bp) numbering hundreds millions compared to a few million 454 Titanium pyrosequencing reads (>400 bp). We used our data to conduct preliminary tests on the effectiveness of Velvet, which implements a DBG-based approach , however, the assemblies were highly fragmented (data not shown). For a comprehensive review of de novo assembly algorithms we refer the reader to Miller et al. .
MIRA3 is a hybrid combination of the OLC and greedy algorithms , and is an iterative assembler that learns from past mistakes with OLC and greedy components (B. Chevreux, personal communication). The MIRA3 project is being actively developed and has a growing group of users. MIRA3 performed better in all aspects of bacterial genome assembly using Roche 454 reads and currently seems to have the greatest potential. In addition, it can be combined with high quality reads from Illumina technology into hybrid assembly, which potentially evades bottlenecks in de novo genome assembly of relatively simple bacterial genomes and hopefully will allow this step to be fully automated in the near future.
Newbler is a commercial product, and probably uses the OLC approach, developed by Roche Diagnostics, but it is usually freely available to laboratories running Roche 454 sequencing. It is not a free and open source software package, and release descriptions indicate that the originally published algorithm may differ from the current one. In our hands, the available version of Newbler performed almost as well as MIRA3, however, commercial tools cannot be evaluated in full detail. Celera Assembler (CA) is another open source OLC based tool that evolved from a Sanger-era assembler; the revised pipeline for NGS reads including Roche 454 data is generally called CABOG .
It should be mentioned that we did not apply paired end (PE) library sequencing and it is clear that such mate pairs would allow more contigs to be closed into scaffolds or at least a list of ordered and orientated longer contigs. We disregarded this kind of approach in the search for the most simple and cost efficient method for assembling bacterial genomes, and in theory a coverage of 30 times should allow an average bacterial genome to be assembled. The bacterial WGS in the contigs we obtained should not lose the correct prediction of too many genes, and from this point of view proper scaffolding is more important for eukaryotic genomes with larger size.
Reference mapping is a different field of genome assembly related to the exercise of alignment; it should be much easier to map NGS reads to very close reference sequences. We used reference mapping only with our test strain and our initial approach was to use only Mosaik  for this purpose. Mosaik is a specific reference-guided assembler using the Smith-Waterman algorithm to align a hashed table of short reads to the reference genome. The purpose of this analysis was to compare the finished genome with re-sequenced 454 reads mapped to the reference and to the de novo assembly; we did not expect surprises in this process. However, Mosaik was initially unable to close all gaps between the contigs (18 contigs) resulting from re-sequenced data, leaving ~ 20,000 bps out (Table 1). Because the preliminary results were somewhat suboptimal, including several regions that were not covered, we tested the reference mapping performances of Newbler and MIRA3 for comparison. Newbler performed better, but as with the Mosaik mapping it left some regions without mapped reads (overall statistics in Table 1). MIRA3 reference mapping revealed that there were still a few weakly-covered regions although the total coverage (30 fold) should be acceptable. Fine-tuning of the MIRA3 reference mapping would allow all gaps to be closed with coverage of very few reads that were not accepted and thus not mapped by the software.
Gene prediction and annotation - tools
In most so-called automated online tools, a truly automatic process with no manual intervention remains a design goal of annotation in terms of predicting tRNA, rRNA and protein coding genes – coding sequences (CDSs). The usual pipelines for finding a gene in a raw DNA sequence involve detection of an ORF, finding the gene and predicting its function by comparing it with genes in existing databases (see below). Automated annotators normally use several HMMs (Hidden Markov Models) and BLAST-based gene prediction methods, e.g. tRNAscan-SE , and BLASTp and BLASTn for protein coding and RNA genes, respectively.
Thereafter, the CDSs are assigned to annotations based on various functional resources such as COG clusters , Pfam , TIGRfam, Gene Ontology etc. Functional annotations may be further “grouped” into metabolically relevant “pathways” such as COG functional categories, Entrez Protein Clusters (ProtClustDB , FIGfams-subsystems , KEGG  and MetaCyc  pathway collections, etc. Thereafter, annotated genomes might be maintained by integrated network systems such as RAST-SEED, IMG and others. Such networks allow further comparative analysis to be performed easily with no need for the researcher to have deep expertise in bioinformatic algorithms and tools.
All of the annotation pipelines we tested are second generation tools, which try to combine multiple gene-calling algorithms and knowledge databases for comparison with related species and training sets. Therefore, these tools should perform much better than the early gene-prediction methods such as Glimmer  and GeneMark .
Draft WGS assembly and preliminary annotation using pipelines proved able to describe the core genomes and different metabolic features of novel environmental miroorganisms fairly well: genome size, basic metabolic pathways, the number of tRNAs, but not rRNAs. For example, two Flavobacterium strains isolated from the same environment and at the same season showed large differences in genome size; strain GOBB3-209 is 45% smaller than GOBB3-C103-3. Therefore, it is not surprising that strain GOBB3-209 lacks several features found in GOBB3-C103-3, e.g. the capsule and extracellular polysaccharide pathway, the denitrification pathway, etc. (data not shown). Re-annotation of these finished genomes after a re-sequencing project (e.g. A. borkumensis SK2 and Roche 454 pyrosequencing in our study) using automated pipelines may be less useful; in our hands, no improvement was observed after re-annotation. For example, even the prediction of rRNA operons was suboptimal; only one copy in all de novo sequenced and assembled genomes was located, although most bacteria described so far have more than one rRNA operon .
However, it is worrying that most genomes currently deposited in the public database rely on automated methods, because it has been reported that their performance seems not to have improved over several years . The deposited data sets may contain genomes and gene annotations that differ in their degree of precision and resolution owing to the use of different sequencing methods and annotations. The most severe problem is that erroneous and incomplete annotations are often carried over into the public resources and are difficult to trace and correct afterwards. For example, several hundred CDSs might be removed or added and the start sites corrected (e.g. re-annotation of the uropathogenic Escherichia coli strain CFT073 ), totalling more than 1000 changes when such data are re-evaluated carefully. Even when many biochemical, physiological, and genetic data support broad genome similarity, the current automated annotation tools can fail to predict certain metabolic pathways. Therefore, more detailed studies that combine all types of available data are needed .
Bottlenecks to be considered
The computational efficiency of de novo assembly algorithms implemented in free and open source software (e.g. MIRA3) no longer seems to be a bottleneck. At least with longer reads with ~30-fold coverage, a reasonable draft genome can be produced within hours and without manual intervention on direct shotgun sequencing.
On the other hand, the tools supplied by bioinformatic service providers such as RAST, PGAAP, and IMG cannot yet be fully automated and involve manual intervention. It is clear that the performance of these tools will improve as more carefully curated and finished genomes become available to aid in the automation process. However, considering the number of genomes from unique species available today (in the range of several thousand ), and because the potential abundance of microbes in nature is huge , progress in this area will take time. Nevertheless, the bacterial genomes used in our study are relatively well covered by existing knowledge of phylogenies and databases of fully sequenced and finished genomes. In the GOLD database  there are closely-related bacteria from the genus Marinomonas: M. mediterranea MMB-1, ATCC 700492 (unpublished), M. posidonica IVIA-Po-181 (unpublished) Marinomonas sp. MWYL1 (unpublished) and Alcanivorax borkumensis SK (finished) ; and from the genus Flavobacterium: F. columnare ATCC 49512 (unpublished), F. branchiophilum FL-15, F. johnsoniae UW101 , and F. psychrophilum JIP02/86 . Therefore, after true finishing of the drafts, the genomes of these organisms can be described with no particular effort, after manual curation, at the same level as phylogenetic relatives of finished genomes. Further justification and re-annotation would be based directly on new biological discoveries.
The usefulness of detecting SNPs has been discussed previously , and because the error rate of 454 pyrosequencing is higher than Sanger sequencing and probably more than Illumina (Solexa/Genome Analyser), it might be a suboptimal choice for re-sequencing and reference mapping projects. Pop & Salzberg  reasoned that fragmented draft genomes would produce fewer annotated genes because of false stop codons; our observations differ but lead to a similar result with regards to reference mapping. The flaws in reference mapping caused by errors introduced by 454 pyrosequencing led to frame shifts and thus to a greater number of CDSs.
We did not use paired-end (PE) libraries because we wanted to keep the preparation cost of sequencing as low as possible. Our results indicate that the PE approach is useful for scaffolding truly de novo-sequenced data to a sufficient degree of coverage (~30 fold) even in assembly of relatively small and simple bacterial genomes. Complex genomes containing repetitive elements may need more attention, although the same is true for annotation of genomes containing genomic and/or pathogenic islands. Isolation of genomic DNA by standard methods often fragments the chromosomal DNA into smaller pieces with a size limit of a few tens of kbs, so extended PE libraries with very large fragments are not easy to construct and need specific treatment of the genomic DNA prior to sequencing. Therefore, it is not possible to assemble larger repetitive elements correctly compared to the fragments in DNA extracts.