MicroPIPE: validating an end-to-end workflow for high-quality complete bacterial genome construction

Background Oxford Nanopore Technology (ONT) long-read sequencing has become a popular platform for microbial researchers due to the accessibility and affordability of its devices. However, easy and automated construction of high-quality bacterial genomes using nanopore reads remains challenging. Here we aimed to create a reproducible end-to-end bacterial genome assembly pipeline using ONT in combination with Illumina sequencing. Results We evaluated the performance of several popular tools used during genome reconstruction, including base-calling, filtering, assembly, and polishing. We also assessed overall genome accuracy using ONT both natively and with Illumina. All steps were validated using the high-quality complete reference genome for the Escherichia coli sequence type (ST)131 strain EC958. Software chosen at each stage were incorporated into our final pipeline, MicroPIPE. Further validation of MicroPIPE was carried out using 11 additional ST131 E. coli isolates, which demonstrated that complete circularised chromosomes and plasmids could be achieved without manual intervention. Twelve publicly available Gram-negative and Gram-positive bacterial genomes (with available raw ONT data and matched complete genomes) were also assembled using MicroPIPE. We found that revised basecalling and updated assembly of the majority of these genomes resulted in improved accuracy compared to the current publicly available complete genomes. Conclusions MicroPIPE is built in modules using Singularity container images and the bioinformatics workflow manager Nextflow, allowing changes and adjustments to be made in response to future tool development. Overall, MicroPIPE provides an easy-access, end-to-end solution for attaining high-quality bacterial genomes. MicroPIPE is available at https://github.com/BeatsonLab-MicrobialGenomics/micropipe. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07767-z.


Background
Bacterial genome construction using short-read sequencing has historically been difficult, largely due to the abundance of repeat sequences which collapse during de novo assembly, resulting in breaks in contiguous sequence [1]. However, long-read sequencing technologies, such as Oxford Nanopore Technology (ONT) and Pacific Biosciences (PacBio), are able to traverse these repeats enabling complete bacterial genomes [2]. Long reads also present the opportunity to correctly place single nucleotide variants (SNVs), particularly across complex regions of the genome that require more genomic context than short reads can provide. The accessibility and affordability of the ONT MinION sequencing device has resulted in its widespread use globally, allowing researchers the autonomy to perform their own experiments much more rapidly compared to using external sequencing facilities [3]. However, bacterial genome construction continues to be problematic, especially for non-specialised researchers.
Numerous tools designed to address aspects of complete bacterial genome construction have been developed by both ONT and community users, however few pipelines exist that offer end-to-end construction of bacterial genomes. Currently, these include Katuali [4], CCBGpipe [5], ASA 3 P [6] and Bactopia [7]. Katuali is an ONT-developed assembly pipeline implemented in Snakemake. It offers the user flexibility in software choice and is well-documented, but provides limited rationale or validation of the provided software. Additionally, it performs exclusively long-read assembly with no shortread polishing included. While ASA 3 P and Bactopia are able to generate assemblies using nanopore data, overall these pipelines were not designed solely for de novo assembly and are more focused on reproducible and comprehensive downstream analysis. CCBGpipe is distributed via Docker and implements a series of python scripts to run Canu with Racon and Nanopolish. However, similar to Katuali, this pipeline also performs Nanopore-only assembly (without Illumina) and was designed using Canu version 1.6, which is now several releases behind the current version (v2.1.1).
Substitution errors in nanopore reads have improved dramatically over recent years, from read accuracies of 60% [8] to the currently reported 95% for 1D reads using R9.4.1 flow cells [9]. While this is approaching that of Illumina (99.9%) [10] and PacBio (99%) [11], single nucleotide insertion/deletion (indel) errors remain problematic [12,13]. Improvements in base-calling software (e.g. that account for methylation) and the introduction of the R10 pore have reduced these artefacts, but polishing nanopore assemblies with Illumina data has been generally required to achieve the highest quality possible [14].
With the rapid pace of ONT progression, development of new software and pipelines, or reappraisal of existing ones, has become an ongoing necessity. This has prompted the need for appropriate validation sets, to assess (or reassess) the accuracy of results. While simulated datasets provide an initial assessment of a tool's ability, data generated from biological sources provide additional confidence in its real-world application, as has been developed previously using metagenomic communities [15,16]. Escherichia coli sequence type (ST)131 represents a globally disseminated lineage that has been intensively studied as a result of its recent emergence, antibiotic resistance and link to human disease [17][18][19]. Extensive knowledge of both E. coli (as a species) and the ST131 lineage makes it an ideal dataset to use for software and pipeline validation. Additionally, the E. coli ST131 strain EC958 represents an extensively curated and highly accurate reference genome, having been sequenced on multiple occasions using PacBio, Illumina and 454 pyrosequencing [20].
Here we present our complete pipeline, MicroPIPE, for automated construction of high-quality bacterial genomes using software chosen by systematic comparison of the most popular tools currently available in the community. Validation of each pipeline stage was completed using the high-quality E. coli ST131 reference genome, EC958. Subsequent validation of the complete pipeline was performed using 11 previously characterised ST131 E. coli strains, for which completely assembled genomes were already available. Finally, we tested MicroPIPE on 12 other publicly available bacterial isolates that had both a complete genome and associated raw nanopore sequencing data available. In all cases, we show that high-quality bacterial reference genomes can be achieved using MicroPIPE.

Public data
The EC958 complete genome was downloaded from NCBI (GenBank: HG941718.1, HG941719.1, HG941720.1) [20]. Illumina reads for 12 ST131 genomes and draft assemblies for 95 ST131 were accessed from [17]. Twelve publicly available complete genomes were also selected to test MicroPIPE, under the following criteria: (i) the raw nanopore sequencing files (fast5) were available, (ii) a complete genome was made available for the same strain and (iii) Illumina sequencing data were available for the same strain. These 12 genomes represented 7 species from both gram-positive and gramnegative bacteria with chromosome sizes between 1.8 Mbp -5.6 Mbps. A complete list of data used is provided in Supplementary dataset 1.

Culture and DNA extraction
Twelve ST131 E. coli isolates (including EC958) were grown from single colonies in Lysogeny Broth (LB) at 37°C overnight with 250 rpm shaking. The overnight cultures (1.5 mL) were then pelleted for DNA extraction using the Wizard Genomic DNA Purification Kit (Promega) following manufacturer's protocol with modifications. Briefly, the cell pellet was lysed following the protocol for Gram negative bacteria. RNA was removed by 1 h incubation at 37°C with RNase and the lysate was then mix with Protein Precipitation Solution by vortexing for 5 s at max speed using Vortex-Genie 2 with horizontal tube adapter (Scientific Industries). The DNA was precipitated using isopropanol and washed with 70% ethanol. The DNA pellet was air-dried and then rehydrated in 100 μl EB buffer (QIAgen) by incubation at 65°C for 1 h. The DNA was quantified using a Qubit fluorometer (ThermoFisher Scientific) and the DNA fragment size was estimated using agarose gel electrophoresis (0.5% agarose in TAE, 90 V, 1h30m).

Nanopore sequencing
DNA from 12 ST131 E. coli were multiplexed onto a single FLO-MIN106 flow cell using the rapid barcode sequencing kit (SQK-RBK004) as per manufacturer's recommendation with the following adjustments: the barcoded DNA was pooled without a concentration step using AMPure XP beads prior to sequencing. Read metrics for each isolate are given in Supplementary Table 1.

Basecalling
Reads were basecalled using Guppy (v3.4.3) "fast" and "high-accuracy" modes. Fast mode was evaluated using both GPU and CPU servers, while the "high-accuracy" mode was evaluated using only GPU as the time to completion for this mode became unfeasible when run using CPUs. Upon the release of Guppy v3.6.1, reads were rebasecalled using only the "high-accuracy" mode. Guppy versions (3.4.3 and 3.6.1) were tested using the methylation aware config file "dna_r9.4.1_450bps_modbases_ dam-dcm-cpg_hac.cfg".

Compute resources
All results were produced using cloud-based nodes with 16vCPUs and 32GB RAM. For the GPU node, the GPU is a NVIDIA Tesla P40 24GB while the CPUs are 2x Intel Xeon Silver 4214 2.2G (12C/24 T, 9.6GT/s, 16.5 M Cache, Turbo, HT [85 W] DDR4-2400).

ST131 phylogeny
ParSNP (v1.5.2) [39] was used to create an ST131 phylogeny using the 12 ST131 E. coli assembled in this study in addition to 95 ST131 E. coli short-read assemblies from Petty and Ben Zakour et al. [17]. Recombination was removed using PhiPack [40], as implemented in ParSNP. To evaluate the accuracy of each assembly and polishing step, we included our 12 completely polished assemblies (long and short read), 12 unpolished assemblies, 12 long-read polished assemblies and 12 shortread polished assemblies. The tree was visualised using Figtree (http://tree.bio.ed.ac.uk/software/figtree/) and iTOL [41].

Validation of pipeline stages by comparison to EC958 complete genome
The main goal of this study was to create a robust and easily applicable pipeline for the construction of highquality bacterial genomes with minimal manual manipulations. To achieve this, we first evaluated the performance of commonly used software at each stage of bacterial genome construction using the high-quality EC958 genome (Accession: HG941718) as our standard for final genome accuracy. Figure 1 shows a diagram of the whole workflow, indicating the software chosen for comparison at each stage. Nanopore reads for EC958 were generated on a multiplexed run of 12 using the rapid barcoding kit on an R9.4.1 flow cell.

Basecalling
When considering the ongoing stability and accuracy of our overall pipeline, we decided to limit our basecalling validation to software that we were confident would be consistent and well-maintained for the foreseeable future. Many existing basecallers (such as Bonito, Flappie and Runnie) are currently research releases and therefore have minimal support and unknown longevity. Other basecallers are either depreciated (Albacore) or no longer updated (Scrappie). As such, we decided to focus our analysis on Guppy, which is the ONT recommended basecaller and is stably released and maintained.
Here we tested Guppy using both the "fast" and "highaccuracy" modes, as well as the CPU vs. GPU configurations. When using Guppy v3.4.3 with the "high-accuracy" setting on GPU servers we generated reads with approximately 91.0% accuracy in 828.5 min (13.81 h). Using the "fast" mode on CPUs, we were able to generate 88.9% accuracy in 2948.4 min (49.14 h) ( Table 1). Testing the "high-accuracy" mode on a CPU server was unfeasible due to the time required for processing (fewer than 10% of reads completed basecalling in 1 week). Despite the lower per-read accuracy when using CPUs and the "fast" basecalling setting, the consensus quality of the overall finished genome (after assembly and polishing through MicroPIPE v0.8) was of comparable quality to that generated with the GPU and high-accuracy setting (Table 1).
We also tested the effects of methylation and found that using the "high-accuracy" model with methylationaware basecalling achieved a similar per-read accuracy (90.6%) to the "high-accuracy" only model. The final assembly, however, had fewer SNPs (3 vs. 23 originally) and indels (31 vs. 45 originally) compared to the reference standard (Table 1).

Demultiplexing
For demultiplexing we tested three tools: Deepbinner [21], Guppy_barcoder [45] and qcat [46]. While Guppy and qcat rely on basecalled reads, Deepbinner uses the raw fast5 reads. As such, we compared the total number of binned reads after both basecalling and binning for each tool. Overall, qcat was able to bin 89% of reads, compared to 84% for Guppy_barcoder and 75% for Deepbinner (Supplementary Figure 1). Initially we chose qcat as the default demultiplexer as we prioritised read retention to maximise coverage of each genome. However, following the recent depreciation of qcat (det a i l e d o n t h e i r G i t H u b : h t t p s : / / g i t h u b . c o m / nanoporetech/qcat), ONT is recommending the use of the Guppy demultiplexer. As such, Guppy was chosen as the default demultiplexer for MicroPIPE, while qcat is still optionally available within the pipeline.

Filtering
Here we tested two filtering tools: Filtlong and Japsa. Filtlong has the advantage of being versatile enough to filter based on a number of requirements, such as read length, quality, percentage of reads to keep and the option of using an external reference. Japsa primarily filters based on read length and quality. Read metrics after filtering using each tool are given in Supplementary Figure 2. Overall, we found that filtering with Japsa retained more reads, but with a reduced N50 read length and median read quality compared to Filtlong. Both tools took an equivalent amount of time to run. For all downstream analysis we filtered reads using Japsa with a minimum average quality cut-off of Q10 and 1 kb minimum read length, although Filtlong would have been equally suitable. Both filtering tools are available as optional steps in MicroPIPE. We have also included Rasusa [47] as an optional tool to randomly subsample large datasets down to a specific coverage (as necessary based  on user needs). This subsampling step is performed before trimming in order to reduce computational time.

Long-read-only assembly
A number of tools have been designed for de novo assembly from long reads. Here we compared six popular assembly tools and evaluated speed, completeness (of the chromosome and plasmids, including circularisation) and correctness (i.e. nucleotide identity) based on the complete EC958 reference genome standard, which contains 1 chromosome (5,109,767 bp) and 2 plasmids (135, 602 bp and 4080 bp). Parameters used for all assemblers are given in Supplementary Dataset 1.
Overall, we found that all assemblers constructed the chromosome and larger (~135 kb) plasmid (Fig. 2, Supplementary Table 2). Raven, Redbean and Shasta did not assemble the smaller~4 kb plasmid. While Canu was able to assemble both plasmids, closer inspection found them to be much larger than expected (1.4x and 2x larger for the large and small plasmid, respectively) due to overlapping ends that required additional trimming. Interestingly, both Flye and Canu assembled a third, previously unidentified, small plasmid of~1.8 kb in size. This small plasmid was only identified when the Flye "--plasmids" mode was selected (to rescue short unassembled plasmids) and when certain or no filtering parameters were applied to the reads prior to assembly (Supplementary Table 3). Comparison of this small plasmid to the Illumina data for the EC958 reference genome standard confirmed its presence and was likely missed in the original assembly.
For most de novo assemblies, a number of small (< 4.5 kb) misassemblies were detected, mainly on the chromosome (Fig. 2). This included a small inversion, which on closer inspection was found to be an invertible phage tail protein that has been characterised previously [20]. This inversion was found in the Flye, Unicycler, Raven and Redbean assemblies and was not counted as a misassembly due to its biological relevance.
Additional contigs were found in both Canu and Unicycler (long-read only mode). The three additional contigs produced by Unicycler all matched other parts of the EC958 reference genome standard (two on the chromosome, one on the larger plasmid). The additional contig in Canu matched part of the additional~1.8 kb plasmid.
In terms of speed, Shasta, Redbean and Raven were the fastest assemblers, completing in less than 30 min. Of the remainder, Flye was four times faster than Canu and two times faster than Unicycler. The majority of contigs from all assemblers were reported as circularised upon assembly completion, with the exception of the additional contigs in Canu and Unicycler. Redbean did not generate circularisation information, although the chromosome and plasmid contigs could be circularised manually or using 3rd party software following assembly. Overall, we found that Flye generated the best de novo assembly from long read data without the need for manual intervention.

Polishing
Polishing of assemblies generated using long reads is currently regarded as a necessity for ONT data due to Fig. 2 Assembly comparison: long horizonal bars (in greyscale and red) represent contiguous sequences generated by each assembler. The chromosome and plasmids 1 and 2 are coloured according to their overall nucleotide identity when compared to the EC958 reference genome standard (indicated by the scale on the left). Plasmid 3 was only recovered when assembling with Flye and Canu, as indicated. The "other" column refers to contigs that were generated by assemblers but were redundant to the assembly (coloured red). The additional blue horizontal bars in the Canu and Redbean assemblies represent the increased size of the contigs from these assemblers. Contigs that were not reported as circular are marked with a red asterisk(*), while contigs that required manual trimming for circularisation are marked with a blue asterisk. Misassemblies are marked with a red vertical line at their approximate position. The phage tail protein inversion is marked with a blue hourglass high per-read errors that can persist through to the de novo assemblies [14]. Here we tested the polishing capabilities of three different tools (Racon/Medaka, Next-Polish and Nanopolish) using nanopore long reads against the de novo assembly generated using Flye. We additionally tested polishing with Illumina short reads (NextPolish and Pilon), which have a higher basecall accuracy. Polishing was tested both independently (i.e., long read and short read separately) as well as sequentially (long read followed by short read polishing) to determine the best polishing protocol.
Overall, we found that polishing with Racon and Medaka (four rounds of Racon and one round of Medaka, using long reads) followed by NextPolish (two rounds using short reads) achieved the most accurate assemblies (Fig. 3, Supplementary Table 4). Polishing using only long or short reads did not produce comparable levels of accuracy, therefore we emphasize the requirement of short read sequencing in parallel with Nanopore for high-quality complete genome assembly (as is already commonly done).
To confirm our choice of Flye as the best assembler, we polished assemblies generated from the other five long-read assemblers, described above, using this strategy (Supplementary Table 5). The polished Flye assembly remained the most accurate, closely followed by the polished Raven assembly.

Hybrid assembly
In addition to long-read assembly (followed by shortread polishing), hybrid assemblers capable of using both long and short reads simultaneously have also been developed, and include Unicycler, MaSuRCA and SPAdes. Comparison of these pipelines to our genome completed with Flye, Racon, Medaka and NextPolish found that they did not outperform our current method. Unicycler was the only hybrid assembler able to completely resolve the chromosome and both plasmids (SPAdes failed to circularise the chromosome while MaSuRCA was unable to assemble the 4 kb plasmid) (Supplementary Table 6). Additional long and short read polishing greatly improved the accuracy of the Unicycler and SPAdes hybrid assemblies but not MaSuRCA (Supplementary Table 5). We compared the quality of the genomes generated by either the best long-read only assembly (Flye) or the best hybrid assembler based on accuracy and structure (Unicycler) and polished with the same strategy. The polished assemblies contained a similar number of indels compared to the EC958 reference genome standard, however the Flye assembly contained around two-fold fewer substitution errors (Supplementary Table 5). Furthermore, Flye was nearly eight times faster than Unicycler (Supplementary Table 6).

Final pipeline
Based on the results of our comparative analysis for all of the major steps of bacterial genome assembly, we have developed MicroPIPE (Fig. 4). The pipeline is written in Nextflow [48] and the dependencies are packaged into Singularity [49] container images available through the Docker Hub and Quay.io BioContainers repositories. The bioinformatics workflow manager Nextflow allows users to run the pipeline locally or using common High-Performance Computing schedulers. Each step of the pipeline uses a specific container image which enables easy modifications to be made in the future to include new or updated tools. Furthermore, in addition to the recommended default pipeline settings, MicroPIPE also provides alternative software options and/or parameters to suit the user's individual needs. The pipeline is freely available on GitHub: https://github.com/BeatsonLab-MicrobialGenomics/micropipe. Fig. 3 Polishing results for EC958 ONT Flye assembly: Comparative analysis of (i) long read polishing only, (ii) short read polishing only, and (iii) sequential long read and short read polishing, using various tool combinations. Comparison metrics were the number of SNPs/indels to the EC958 reference genome standard (by DNAdiff), run time and quality score (by Poxomis assess_assembly) Evaluation of remaining differences with EC958 reference genome standard The final genome for EC958 produced by MicroPIPE v0.8 was compared to the previously published EC958 reference genome standard (GenBank: HG941718.1) to assess any remaining differences. We observed a single 3.4 kb inversion corresponding to a phage tail protein switching event previously characterised in EC958 [20]. Overall, there were no other structural rearrangements. The final assembly contained an additional~1.8 kb plasmid, with 100% nucleotide identity to previously reported E. coli plasmids (GenBank records CP048320.1, KJ484633.1, [50]). This plasmid appears to have been lost during size selection when constructing the original genomic DNA library for PacBio RSII sequencing of EC958 as it could be identified from de novo assembly of the corresponding Illumina reads.
Comparison of the two assemblies identified 68 remaining differences (66 on the chromosome, 2 on pEC958) (for full list, see Supplementary Dataset 1). The two differences in the plasmid sequence correspond to known errors in the EC958 reference genome standard (PacBio assembly constructed without Illumina polishing). The majority of the chromosomal differences were indels (n = 45, 67%) ranging from 1 to 6 bp in size. These indels were mainly found in rRNA (n = 31), tRNA (n = 4), insertion sequences (n = 4), or phage-related genes (n = 2). The remaining 23 differences were SNPs, which were similarly found mainly in rRNA (n = 13) and insertion sequences (n = 8). These remaining differences likely represent an inability of current short-read polishing to adequately determine true alleles in repetitive regions of the genome. Using methylation-aware basecalling was found to significantly improve these errors, with only 3 SNPs and 31 indels (Supplementary Table 7).

MicroPIPE validation using 11 ST131 E. coli
To further test the robustness of MicroPIPE on other genomes, we included an additional 11 wellcharacterised E. coli ST131 strains [17] on a multiplexed run of 12 E. coli (i.e. 11 ST131 strains plus EC958).
Each strain took on average 120 min to run completely through MicroPIPE v0.8 using 16 threads (excluding the basecalling and demultiplexing steps) (Fig. 4). Of these 11 isolates, all had complete circularised chromosomes of the expected size. They also carried an array of plasmids, which were circularised in all cases except for a single isolate, HVM2044 (Supplementary Table 8). Reanalysis of this sample found that complete circularised plasmids can be achieved by adjusting the read filtering step. We also identified additional small plasmids in six out of the 11 genomes ranging between 1.5-5 kb in size. Importantly, we found that these plasmids are not recovered when using filtering parameters above 1 kb.
In order to confirm the accuracy of the assemblies generated with MicroPIPE, we recreated the ST131 phylogeny from [17] using (i) the complete MicroPIPE assembly, (ii) long read only polished assembly, (iii) short read only polished assembly and (iv) unpolished Nanopore assembly, and assessed the position of each strain within the tree. We found that all MicroPIPE v0.8 assemblies and ONT assemblies polished with Illumina clustered closest to their Illumina counterpart within the phylogenetic tree (Fig. 5A). However, the long read polished and unpolished ONT assemblies in most cases did not cluster as expected. They also displayed longer branches indicative of the remaining errors within the assembly. Interestingly, the long read polished and unpolished assemblies for all ST131 isolates belonging to our previously defined fluoroquinolone-resistance clade C [17,18] clustered together independent of other clade C strains, possibly representing systematic errors from the ONT data. Further interrogation of the branch leading to this cluster identified 401 shared SNPs. Of these SNPs, 97% were transitions, particularly A → G (n = 187) and T → C (n = 203) (Supplementary Table 9, Fig. 5C). Further analysis of these sites determined that 393 (98%) were associated with a Dcm methylase motif CC(A/T)GG (Supplementary Figure 4).

MicroPIPE validation using publicly available ONT sequenced bacteria
Lastly, we tested MicroPIPE using 12 publicly available genomes from both gram-positive and gram-negative bacteria with available raw nanopore data (fast5) and validated our results using their corresponding complete genomes ( Table 2, Supplementary Dataset 1). These genomes also represent a wide range of GC content to further validate the use of MicroPIPE on diverse bacterial species ( Table 2). As most of these isolates were  [17]. Branches are coloured based on the ST131 clade they belong to, as per [17] (Red = clade A, Orange = clade B, Green = clade C). dark blue: Complete polished assemblies from the MicroPIPE pipeline next to their Illumina assembly counterpart in the tree, light blue: assemblies with incomplete polishing (i.e. Illumina only, Nanopore only or no polishing) clustered with their Illumina counterpart, red: discrepant clustering of Nanopore assemblies. B Phylogenetic tree created using assemblies generated with MicroPIPE v0.9 (Guppy v3.6.1). Annotations are same as in A. C Position of alternative (alt) and reference (ref) alleles compared to the EC958 reference standard chromosome present on branch leading to discrepant ONT assemblies as indicated by the star in (A) Flye was run using the --asm-coverage 100 parameter in order to reduce the computational run time. Only circular contigs are reported (as identified by Flye). For further details on all public data, see Supplementary dataset 1.
sequenced using entire flow cells, the coverage was reduced to 100x during the initial Flye assembly stage to minimise processing time. Using MicroPIPE v0.9, we were able to completely assemble the chromosome and plasmids of all 12 isolates. We were also able to recover two additional plasmids from the Salmonella enterica str. SA20055162 that were not reported in the original assembly (Table 2).
To determine the accuracy of MicroPIPE, we compared our final assemblies with the submitted complete genome for each isolate. Overall, the fewest differences were detected between our MicroPIPE assembly and the complete genome of Staphylococcus aureus strain 110900 (6 SNPs, 5 indels) and strain 128254 (4 SNPs, 4 indels), constructed using ONT data basecalled with a recent version of Guppy (v3.2.6) ( Table 2). These were followed by Streptococcus pyogenes strain SP1336, constructed using PacBio long-read sequencing (8 SNPs, 96 indels). All other comparisons yielded 25-510 SNPs, and 14-758 indels, with the greatest number of differences observed in the Salmonella enterica serovar Napoli strain LC0541/17 (Table 2).
With the exception of S. pyogenes SP1336, all other complete genomes were constructed using previously assembled nanopore data (Supplementary Dataset 1). Specifically, all assemblies with a high number of SNPs and indels were generated using reads basecalled with Albacore or a Guppy version prior to v3. As such, we hypothesise that our MicroPIPE assemblies likely represent corrections to the existing complete genomes, as a result of updated basecalling and assembly methods. Further investigation found that one sample, Salmonella enterica Bareilly str. CFSAN000189, also had a corresponding complete genome constructed using PacBio data. Comparison of our MicroPIPE assembly to this complete genome detected 0 SNPs and 15 indels, while there were 32 SNPs and 34 indels compared to the ONT complete genome.

Future development of MicroPIPE
Rapid and continual enhancement of nanopore technology has been integral to ONTs growth and popularity in recent years. It does, however, lead to several problems, including rapid depreciation, abandonment or replacement of software. As such, we have developed a modularised ONT/Illumina pipeline that can be readily adapted and re-evaluated alongside the changing nanopore landscape.
An example of MicroPIPE's adaptability came from the release of Guppy v3.6.1 during preparation of this manuscript. As this version reported a substantial increase in basecalling accuracy, we incorporated it into MicroPIPE (v0.9) and re-evaluated our pipeline's performance for all ST131 genomes.
Using MicroPIPE v0.9 on our EC958 data, we were able to resolve 21 out of the 23 SNPs and 32 out of 45 indels compared to the MicroPIPE v0.8 assembly (Guppy v3. 4 Figure 3, Supplementary Table 7), relative to the published reference genome. Two SNPs and 12 indels were additionally detected using v3.6.1, which were not detected using v3.4.3. Both SNPs were detected in IS elements, while 11 out of the 12 indels were detected in rRNA genes. Overall, the v3.6.1 assembly performed better than the v3.4.3 assembly with only 29 differences compared to the complete reference EC958 genome (4 SNPs and 25 indels). Interestingly, using methylationaware basecalling with Guppy v3.6.1 was not found to improve overall assembly accuracy (Supplementary Table 7).

.3) (Supplementary Dataset 1, Supplementary
We also found that by re-basecalling all other remaining ST131 isolates with MicroPIPE v0.9 and recreating assemblies as before, we were able to achieve a remarkable increase in the accuracy of Nanopore-only assemblies, such that all assemblies clustered in their expected position within the tree (Fig. 5B).

Discussion
ONT long-read sequencing has quickly become one of the most prominent sequencing platforms for microbial researchers globally. However, despite the large number of bacterial genomes being completed using ONT, few end-to-end genome assembly pipelines exist. Here we created an easy, automated and reproducible genome assembly pipeline for the construction of complete, high-quality genomes using ONT in combination with Illumina sequencing. We also provide a robust, publicly available set of 12 ST131 genomes that can be used to validate future pipeline development or software advancements.
One of the main benefits of nanopore sequencing is its cost effectiveness, particularly when multiplexing several samples onto a single flow cell. Methods have been developed to improve yield and length during DNA extraction in order to achieve longer sequencing reads [15,58]. However, here we show with our method that highquality complete genomes can be achieved using a standard, commercially available DNA extraction kit coupled with up to 12 multiplexed samples. This builds on other advances such as those described by Wick et al. [59], and establishes an updated packaged pipeline that provides an efficient, cost effective and reproducible approach to bacterial genome construction.
In our comparative analysis of different aspects of bacterial genome assembly, we chose not to explore the effect of basecallers outside of ONT's Guppy_basecaller. As stated previously, many other existing basecallers have been released in a research-capacity (Bonito, Flappie and Runnie), and are therefore unsuitable when considering the stability and maintenance of MicroPIPE. This is of particular importance to users from clinical settings, where consistency and versioning are essential when it comes to accrediting workflows. We were also confident that Guppy was among the highest performing basecallers, as this comparison has been completed previously [14]. Lastly, Guppy is ONTs recommended basecaller, coupled with several of Oxford Nanopore's devices, such as the MinIT, PromethION and GridION. For these reasons, we felt that it was in the best interest of the community at this time to provide a pipeline that used Guppy as the basecaller. We made a point of testing both the "high accuracy" mode on a GPU server compared to the "fast" mode on a CPU server, as not all Nanopore users are guaranteed to have access to GPU facilities. We found that, while the GPU server was significantly faster, basecalling reads using the "fast" mode with CPUs could also achieve high-quality genomes with MicroPIPE.
During preparation of this manuscript, Guppy v3.6.1 was released with a raw read accuracy of > 97% using R9.4.1 flow cells (https://nanoporetech.com/accuracy). Community feedback regarding this upgraded version supported increased overall accuracy, which prompted us to incorporate this version into our analysis (Micro-Pipe v0.9). We also found that Guppy v3.6.1 increased the overall accuracy of our assemblies, particularly where it came to unresolved indels using v3.4.3, which were suspected to be the result of technical artefacts around methylated sites [58]. Using Guppy v3.6.1 made Nanopore-only assemblies more feasible, particularly in cases where sufficient genetic context can be provided (e.g. identification of outbreak vs. non-outbreak strains). However, we found that overall both v3.4.3 and v3.6.1 still required polishing with short-read Illumina for maximum accuracy.
We observed some redundancy in the choice of tools for demultiplexing. Binning of reads with both Guppy_ barcoder and qcat performed almost equivalently (in terms of number of reads binned), with minimal differences in the overall assembly (Supplementary Table 10). Recent improvements to Guppy_barcoder, which were released by ONT after compilation of this manuscript, suggest that Guppy_barcoder is likely to be the default standard moving forward.
MicroPIPE implements a modest filtering measure to remove shorter, low quality reads from the dataset. In this study, we found that the length of sequencing reads used for assembly was an important parameter. Circularised chromosomes and large circularised plasmids were only obtained when the dataset contained a substantial proportion of reads longer than 5 kb (read length N50 for the 12 E. coli strains here ranged between 11 kb and 15 kb). However, excessive removal of short reads negatively impacted the recovery of small plasmids, where removing reads <= 2 kb resulted in the loss of several small plasmids in a number of strains (data not shown). This was also the case when using certain additional filtering parameters with Filtlong, where "--min-length 1000 --keep_percent 90" resulted in the loss of the~1.8 kb small plasmid identified in EC958, which was retained when filtering with Japsa at "--minlength 1000" (Supplementary Tables 11 and 12). As such, we have implemented a conservative 1 kb filtering cut-off (using Japsa) as default in MicroPIPE to retain reads and small plasmids.
We also found when testing MicroPIPE on publicly available data that harsher filtering is sometimes desirable, especially in cases where a single bacterial genome has been sequenced using an entire flow cell (such that we used the Flye parameter "--asm-coverage 100" to reduce coverage for initial disjointig assembly). As such, pre-processing of large quantities of highly ununiform data may be the most desirable method. This is possible to implement within MicroPIPE, as users may choose to randomly subsample with Rasusa, or implement subsampling and filtering with Filtlong over Japsa (the current default tool). Ultimately, understanding the quality and read lengths of the input data is a valuable step in generating the best possible assembly. We also provided the user read quality assessment using PycoQC to assist in parameter selection.
Several other comparative analyses have been published exploring the overall utility of different assemblers, in particular Wick et al. [60], who provide a comprehensive assembly comparison using both simulated and real read datasets. While we did not test NECAT and Miniasm, we found that our results generally matched those reported by Wick et al., particularly when it came to the overall strong performance of Flye. The most recent version of Flye (v2.8) also removes the need to nominate a genome size, making it a more robust option. However, we found that this version did not outperform the release used in this paper (v2.5) on our dataset, as it was unable to circularise all plasmids. As such, we have retained Flye v2.5 in MicroPIPE.
Long and short read polishing is a staple of highquality genome assembly, as the combination of both ensures the correct contextual placement of variants as well as highly accurate basecalls. However, while longreads have enabled completion of assemblies by spanning repetitive regions, polishing of these regions with short reads remains a problem. Here we found that the majority of remaining differences between our EC958 ONT assembly and the reference assembly (constructed with PacBio single molecule real time [SMRT] sequencing) resided in repetitive regions. Ideally, polishing with long reads only would be a viable method to reduce these errors as they would have sufficient coverage to ensure correct placement of the repeat variant. However, as we show here, long read-only polishing was insufficient (likely due to per-read accuracy), and short read polishing was necessary for removal of the majority of errors. Currently, final polishing and assembly prior to completion will still necessitate manual frameshift inspection. While impractical and costly, a combination of both PacBio and ONT assembly could correct inherent biases in both technologies, using a consensus tool such as Trycycler (https://github.com/rrwick/Trycycler).
Long-read correction could also provide another means of error reduction [61,62]. Upon subsequent analysis, we did find that the final assembly produced when using MicroPIPE v0.9 with Canu error-corrected reads was marginally better than using raw reads (Supplementary Table 13). However, this was at a cost of a 2.5 times slower runtime. We further tested raw vs. corrected reads with the latest Guppy version (v4.4.1) and did not see improvement of the final genome with corrected reads. Additionally, Flye (as the default assembler in MicroPIPE) recommends the use of raw reads over corrected reads (https://github.com/fenderglass/Flye/blob/ flye/docs/USAGE.md#error-corrected-reads-input). As such, we have not implemented read-correction in our pipeline, but it could be implemented by users separately (or added in their own version of the pipeline) if desired.
We validated MicroPIPE using a set of 12 wellcharacterised E. coli isolates described previously from a global collection [17,18]. We did this for several reasons, including (i) the availability of an existing highquality reference genome and associated phylogenetic data (ii) the robustness of E. coli as a representative species and workhorse organism, and (iii) our extensive knowledge of the E. coli genome and ST131 lineage. We hope that by providing this dataset to the wider community, it can serve as a resource for future validation and testing of not only MicroPIPE, but other microbial assembly pipelines and tools.
In addition to in-house ONT sequencing data, we also tested MicroPIPE on a variety of publicly available bacterial genomes to evaluate its assembly capabilities on other species. Without any manual intervention, Micro-PIPE was able to assemble all 12 genomes, while also recovering additional plasmids that were likely missed in the original assembly. When evaluating correctness of the genomes, we found a number of remaining SNPs and indels when compared to the complete genomes provided. Investigation into construction of the reference genomes found that 11 of the 12 genomes provided were constructed previously using ONT sequencing data, leading us to believe that differences in our assemblies compared to the "reference" genomes may actually be corrections. Indeed, the genomes with the closest match between reference and MicroPIPE assembly were the genomes constructed using PacBio or ONT data with contemporary basecalling. As such, we believe that genomes completed historically using ONT reads should be used cautiously, and raw ONT data provided where possible to allow for reconstruction and improvement of the assembly as the technology improves.

Conclusions
Overall, we present an end-to-end pipeline for highquality bacterial genome construction designed to be easily implemented in the research lab setting. We believe this will be a useful resource for users to easily and reproducibly construct complete bacterial genomes from Nanopore sequencing data.
Additional file 1: Table S1. Read length and quality metrics per isolate. Table S2. assembly results for EC958.  Table S10.
Demultiplexing comparison between qcat and Guppy Table S11. Assembly comparison using different filtering parameters(qcat demultiplexing). Table S12. Assembly comparison using different filtering parameters (guppy demultiplexing). Table S13. Assembly comparison using ONT raw or corrected reads Figure S1. Demultiplexing metrics. Figure S2. Long-read metrics using different demultiplexing tools and read filtering parameters (using EC958 ONT data). Figure S3. comparison of SNPs/indels in ONT assemblies vs. complete EC958 chromosome.