MicroPIPE: An end-to-end solution for high-quality complete bacterial genome construction

Oxford Nanopore Technology (ONT) long-read sequencing has become a popular platform for microbial researchers; however, easy and automated construction of high-quality bacterial genomes remains challenging. Here we present MicroPIPE: a reproducible end-to-end bacterial genome assembly pipeline for ONT and Illumina sequencing. To construct MicroPIPE, we evaluated the performance of several tools for genome reconstruction and assessed overall genome accuracy using ONT both natively and with Illumina. Further validation of MicroPIPE was carried out using 11 sequence type (ST)131 Escherichia coli and eight publicly available Gram-negative and Gram-positive bacterial isolates. MicroPIPE uses Singularity containers and the workflow manager Nextflow and is available at https://github.com/BeatsonLab-MicrobialGenomics/micropipe.

For demultiplexing we tested three tools: Deepbinner, Guppy_barcoder and qcat. While Guppy 166 and qcat rely on basecalled reads, Deepbinner uses the raw fast5 reads. As such, we compared 167 the total numbers of binned reads after both basecalling and binning for each tool. Overall qcat 168 was the fastest demultiplexer, and was able to bin 89% of reads, compared to 84% for 169 Guppy_barcoder and 75% for Deepbinner (Supplementary Figure 1). We prioritised read 170 retention to maximise coverage of each genome. As such, qcat was chosen as the default demultiplexer for MicroPIPE. Following the recent depreciation of qcat, we have also provided 172 Guppy_barcoder as an optional demultiplexer. Here we trialled two filtering tools: Filtlong and Japsa. Filtlong has the advantage of being 177 versatile enough to filter based on a number of requirements, such as read length, quality, 178 percentage of reads to keep and the option of using an external reference. Japsa primarily filters 179 based on read length and quality. Read metrics after filtering using each tool are given in 180 Supplementary figure 2. Overall, we found that filtering with Japsa retained more reads, but 181 with a reduced N50 read length and median read quality compared to Filtlong. Both tools took 182 an equivalent amount of time to run. For all downstream analysis we filtered reads using Japsa 183 with a minimum average quality cut-off of Q10 and 1 kb minimum read length, although 184 Overall, we found that all assemblers constructed the chromosome and larger (~135 kb) 197 plasmid (Figure 2, Supplementary Table 2). Raven, Redbean and Shasta did not assemble 198 the smaller ~4 kb plasmid. While Canu was able to assemble both plasmids, closer inspection 199 found them to be much larger than expected (1.4x and 2x larger for the large and small plasmid, 200 respectively) due to overlapping ends that required additional trimming. Interestingly, both 201 Flye and Canu assembled a third, previously unidentified, small plasmid of ~1.8 kb in size. 202 This small plasmid was only identified when the Flye "-plasmids" mode was selected (to 203 rescue short unassembled plasmids) and when certain or no filtering parameters were applied 204 to the reads prior to assembly (Supplementary Table 3). Comparison of this small plasmid to 205 the Illumina data for the EC958 reference genome standard confirmed its presence and was 206 likely missed in the original assembly. 207 208 For most de novo assemblies, a number of small (<4.5 kb) misassemblies were detected, mainly 209 on the chromosome (Figure 2). This included a small inversion, which on closer inspection 210 was found to be an invertible phage tail protein that has been characterised previously [19]. 211 This inversion was found in the Flye, Unicycler, Raven and Redbean assemblies and was not 212 counted as a misassembly due to its biological relevance. In terms of speed, Shasta, Redbean and Raven were the fastest assemblers, completing in less 220 than 30 minutes. Of the remainder, Flye was four times faster than Canu and two times faster 221 than Unicycler. The majority of contigs from all assemblers were reported as circularised upon 222 assembly completion, with the exception of the additional contigs in Canu and Unicycler. 223 Redbean did not generate circularisation information, although the chromosome and plasmid 224 contigs could be circularised manually or using 3 rd party software following assembly. Overall, 225 we found that Flye generated the best de novo assembly from long read data without the need 226 for manual intervention.  Polishing of assemblies generated using long reads is currently regarded as a necessity for ONT 242 data due to high per-read errors that can persist through to the de novo assemblies [13]. Here 243 we tested the polishing capabilities of three different tools (Racon/Medaka, NextPolish and 244 Nanopolish) using nanopore long reads against the de novo assembly generated using Flye. We 245 additionally tested polishing with Illumina short reads (NextPolish and Pilon), which have a 246 higher basecall accuracy. Polishing was tested both independently (i.e., long read and short 247 read separately) as well as sequentially (long read followed by short read polishing) to 248 determine the best polishing protocol. 249 250 Overall, we found that polishing with Racon and Medaka (using long reads) followed by 251 NextPolish (using short reads) achieved the most accurate assemblies (Figure 3, 252 Supplementary Table 4). Polishing using only long or short reads did not produce comparable 253 levels of accuracy, therefore we emphasize the requirement of short read sequencing in parallel 254 with Nanopore for high-quality complete genome assembly (as is already commonly done). 255 To confirm our choice of Flye as the best assembler, we polished assemblies generated from 257 the other five long-read assemblers, described above, using this strategy (Supplementary 258 Table 5). The polished Flye assembly remained the most accurate, closely followed by the 259 polished Raven assembly. In addition to long-read assembly (followed by short-read polishing), hybrid assemblers 270 capable of using both long and short reads simultaneously have also been developed, and 271 include Unicycler, MaSuRCA and SPAdes. Comparison of these pipelines to our genome 272 completed with Flye, Racon, Medaka and NextPolish found that they did not outperform our 273 current method. Unicycler was the only hybrid assembler able to completely resolve the 274 chromosome and both plasmids (SPAdes failed to circularise the chromosome while 275 MaSuRCA was unable to assemble the 4 kb plasmid) (Supplementary Table 6). Additional 276 long and short read polishing greatly improved the accuracy of the Unicycler and SPAdes 277 hybrid assemblies but not MaSuRCA (Supplementary Table 5). We compared the quality of 278 the genomes generated by either the best long-read only assembly (Flye) or the best hybrid 279 assembler based on accuracy and structure (Unicycler) and polished with the same strategy. 280 The polished assemblies contained a similar number of indels compared to the EC958 reference 281 genome standard, however the Flye assembly contained around two-fold fewer substitution 282 Table 5). Furthermore, Flye was nearly eight times faster than 283 Table 6

298
Steps with dotted outline are optional. Time for running each step is provided based on running 12 multiplexed 299 E. coli samples with MicroPIPE v0.8. Basecalling (Guppy) and long-read polishing (Racon and Medaka) were 300 run on a GPU node. The rest of the pipeline was run using CPU resources.

302
Evaluation of remaining differences with EC958 reference genome standard: 303 304 The final genome for EC958 produced by MicroPIPE v0.8 was compared to the previously 305 published EC958 reference genome standard (GenBank: HG941718.1) to assess any remaining 306 differences. We observed a single 3.4 kb inversion corresponding to a phage tail protein 307 switching event previously characterised in EC958 [19]. Overall, there were no other structural 308 rearrangements. MicroPIPE assembled an additional ~1.8 kb plasmid, with 100% nucleotide 309 identity to previously reported E. coli plasmids (GenBank records CP048320. 1,KJ484633.1,310 [22]). This plasmid appears to have been lost during size selection when constructing the 311 original genomic DNA library for PacBio RSII sequencing of EC958 as it could be identified 312 from de novo assembly of the corresponding Illumina reads. 313

314
Comparison of the two assemblies identified 68 remaining differences (66 on the chromosome, 315 2 on pEC958) (for full list, please see Supplementary Dataset 1). The two differences in the 316 plasmid sequence correspond to known errors in the EC958 reference genome standard 317 (PacBio assembly constructed without Illumina polishing). The majority of the chromosomal 318 differences were indels (n=45, 67%) ranging from 1-6 bp in size. These indels were mainly 319 found in rRNA (n=31), tRNA (n=4), insertion sequences (n=4), or phage-related genes (n=2). 320 The remaining 23 differences were SNPs, which were similarly found mainly in rRNA (n=13) 321 and insertion sequences (n=8). These remaining differences likely represent an inability of 322 current short-read polishing to adequately determine true alleles in repetitive regions of the 323 genome. Using methylation-aware basecalling was found to significantly improve these errors, 324 with only 3 SNPs and 31 indels (Supplementary Table 7). were additionally detected using v3.6.1, which were not detected using v3.4.3. Both SNPs were 331 detected in IS elements, while 11 out of the 12 indels were detected in rRNA genes. Overall, 332 the v3.6.1 assembly performed better than the v3.4.3 assembly with only 29 differences 333 compared to the complete EC958 genome (4 SNPs and 25 indels). Interestingly, using 334 methylation-aware basecalling with Guppy v3.6.1 was not found to improve overall assembly 335 accuracy (Supplementary Table 7). Each strain took on average 86 minutes to run completely through MicroPIPE v0.8 using 16 343 threads (excluding the basecalling and demultiplexing steps) (Figure 4). Of these 11 isolates, 344 all had complete circularised chromosomes of the expected size. They also carried an array of 345 plasmids, which were circularised in all cases except for a single isolate, HVM2044 346 Table 8). Re-analysis of this sample found that complete circularised 347 plasmids can be achieved by adjusting the read filtering step. We also identified additional 348 small plasmids in seven out of the 12 genomes ranging between 1.5-5 kb in size. Importantly, 349

(Supplementary
we found that these plasmids are not recovered when using filtering parameters above 1 kb.  Table  364 9, Figure 5B). Further analysis of these sites determined that 393 (98%) were associated with 365 a Dcm methylase motif CC(A/T)GG (Supplementary Figure 4). 366

367
We also evaluated the MicroPIPE v0.9 assemblies using Guppy v3.6.1, which was released 368 during preparation of this manuscript. By re-basecalling and recreating all assemblies as before, 369 we found a remarkable increase in the accuracy of Nanopore-only assemblies, such that all 370 assemblies clustered in their expected position within the tree ( Figure 5A).   Table 10). As most of these isolates were sequenced using entire flow cells, 386 the coverage was reduced to 100x during the initial Flye assembly stage to minimise processing 387

time. 388
Using MicroPIPE v0.9, we were able to completely assemble the chromosome and plasmids 390 of all eight isolates. We were also able to recover two additional plasmids from the Salmonella 391 enterica str. SA20055162 that were not reported in the original assembly (Supplementary 392 ONT long-read sequencing has quickly become one of the most prominent sequencing 413 platforms for microbial researchers globally. However, despite the large number of bacterial 414 genomes being completed using ONT, few end-to-end genome assembly pipelines exist. Here 415 we created an easy, automated and reproducible genome assembly pipeline for the construction 416 of complete, high-quality genomes using ONT in combination with Illumina sequencing. We 417 also provide a robust, publicly available set of 12 ST131 genomes that can be used to validate 418 future pipeline development or software advancements. 419 420 One of the main benefits of nanopore sequencing is its cost effectiveness, particularly when 421 multiplexing several samples onto a single flow cell. Methods have been developed to improve 422 yield and length during DNA extraction in order to achieve longer sequencing reads [14,23].
However, here we show with our method that high-quality complete genomes can be achieved 424 using a standard, commercially available DNA extraction kit coupled with up to 12 multiplexed 425 samples. This build on other advances such as those described by Wick et al. [24], and 426 establishes an updated packaged pipeline that provides an efficient, cost effective and 427 reproducible approach to bacterial genome construction. 428 429 In our comparative analysis of different aspects of bacterial genome assembly, we chose not to 430 explore the effect of basecallers outside of ONT Guppy basecaller. This comparison has 431 already been completed previously [13], where it was found that Guppy outperformed other 432 existing basecallers. Guppy is also the default basecaller coupled with several of Oxford 433 Nanopore's devices, such as the MinIT, PromethION and GridION. For these reasons, we felt 434 that it was in the best interest of the community to provide a pipeline that used Guppy as the 435 basecaller. We also made a point of testing both the "high accuracy" mode on a GPU server 436 compared to the "fast" mode on a CPU server, as not all Nanopore users would have access to 437 GPU facilities. We found that, while the GPU server was significantly faster, basecalling reads 438 using the "fast" mode with CPUs can also achieve high-quality genomes with MicroPIPE. 439 440 During preparation of this manuscript, Guppy v3.6.1 was released with a raw read accuracy of 441 >97% using R9.4.1 flow cells (https://nanoporetech.com/accuracy). Community feedback 442 regarding this upgraded version supported increased overall accuracy, which prompted us to 443 incorporate this version into our analysis (MicroPipe v0.9). We also found that Guppy v3.6.1 444 increased the overall accuracy of our assemblies, particularly where it came to unresolved 445 indels using v3.4.3, which were suspected to be the result of technical artefacts around 446 methylated sites [23]. Using Guppy v3.6.1 made Nanopore-only assemblies more feasible, 447 particularly in cases where sufficient genetic context can be provided (e.g. identification of 448 outbreak vs. non-outbreak strains). However, we found that overall both v3.4.3 and v3.6.1 still 449 required polishing with short-read Illumina for maximum accuracy. 450

451
We observed some redundancy in the choice of tools for demultiplexing. Binning of reads with 452 both Guppy_barcoder and qcat performed almost equivalently (in terms of number of reads 453 binned), with minimal differences in the overall assembly (Supplementary Table 11). Recent 454 improvements to Guppy_barcoder, which were released by ONT after compilation of this 455 manuscript, suggest that Guppy_barcoder is likely to be the default standard moving forward. 456 MicroPIPE implements a modest filtering measure to remove shorter, low quality reads from 458 the dataset. In this study, we found that filtering reads below 5 kb had little effect on the final 459 chromosome and larger plasmids, while filtering above 1 kb resulted in the loss of several small 460 plasmids in a number of strains (Supplementary Tables 12 and 13). Filtering with Filtlong at 461 "--min-length 1000 --keep_percent 90" resulted in the loss of the additional ~1.8 kb small 462 plasmid identified in EC958, which was retained when filtering with Japsa at "--min-length 463 1000". As such, we have implemented a 1 kb filtering cut-off (using Japsa) as default in 464 MicroPIPE to retain reads and small plasmids. However, we also found when testing 465 MicroPIPE on publicly available data that harsher filtering is sometimes desirable, especially 466 in cases where a single bacterial genome has been sequenced using an entire flow cell (such 467 that we used the Flye parameter "--asm-coverage 100" to reduce coverage for initial disjointig 468 assembly). As such, pre-processing of large quantities or highly ununiform data using Filtlong 469 may be the most desirable method. Ultimately, understanding the quality and read lengths of 470 the input data is a valuable step in generating the best possible assembly. We also provided the 471 user read quality assessment using PycoQC to assist in parameter selection. 472 473 Several other comparative analyses have been published exploring the overall utility of 474 different assemblers, in particular Wick et al. [25], who provide a comprehensive assembly 475 comparison using both simulated and real read datasets. While we did not test NECAT and 476 Miniasm, we found that our results generally matched those reported by Wick et al.,477 particularly when it came to the overall strong performance of Flye. The most recent version 478 of Flye (v2.8) also removes the need to nominate a genome size, making it a more robust option. 479 However, we found that this version did not outperform the release used in this paper (v2.5) 480 on our dataset, as it was unable to circularise all plasmids. As such, we have retained Flye v2.5 481 in MicroPIPE. 482 483 Long and short read polishing is a staple of high-quality genome assembly, as the combination 484 of both ensures the correct contextual placement of variants as well as highly accurate 485 basecalls. However, while long-reads have enabled completion of assemblies by spanning 486 repetitive regions, polishing of these regions with short reads remains a problem. Here we 487 found that the majority of remaining differences between our EC958 ONT assembly and the 488 reference assembly (constructed with PacBio single molecule real time [SMRT] sequencing) 489 resided in repetitive regions. Ideally, polishing with long reads only would be a viable method 490 to reduce these errors as they would have sufficient coverage to ensure correct placement of the repeat variant. However, as we show here, long read-only polishing was insufficient (likely 492 due to per-read accuracy), and short read polishing was necessary for removal of the majority 493 of errors. Currently, final polishing and assembly prior to completion will still necessitate 494 manual frameshift inspection. While impractical and costly, a combination of both PacBio and 495 ONT assembly could correct inherent biases in both technologies, using a consensus tool such 496 as Trycycler (https://github.com/rrwick/Trycycler). Long-read correction could also provide 497 another means of error reduction, however, was not assessed in this paper [26,27]. 498

499
We validated MicroPIPE using a set of 12 well-characterised E. coli isolates described 500 previously from a global collection [16,17]. We did this for several reasons, including (i) the 501 availability of an existing high-quality reference genome and associated phylogenetic data (ii) 502 the robustness of E. coli as a representative species and workhorse organism, and (iii) our 503 extensive knowledge of the E. coli genome and ST131 lineage. We hope that by providing this 504 dataset to the wider community, it can serve as a resource for future validation and testing of 505 not only MicroPIPE, but other microbial assembly pipelines and tools. 506

507
In addition to in-house ONT sequencing data, we also tested MicroPIPE on a variety of publicly 508 available bacterial genomes to evaluate its assembly capabilities on other species. Without any 509 manual intervention, MicroPIPE was able to assemble all eight genomes, while also recovering 510 additional plasmids that were likely missed in the original assembly. When evaluating 511 correctness of the genomes, we found a number of remaining SNPs and indels when compared 512 to the complete genomes provided. Investigation into construction of the reference genomes 513 found that seven of the eight genomes provided were constructed previously using ONT 514 sequencing data, leading us to believe that differences in our assemblies compared to the 515 "reference" genomes may actually be corrections. Indeed, the genome with the closest match 516 between reference and MicroPIPE assembly were the genomes constructed using PacBio. As 517 such, we believe that genomes completed historically using ONT reads should be used 518 cautiously, and raw ONT data provided where possible to allow for reconstruction and 519 improvement of the assembly as the technology improves. Overall, we present an end-to-end pipeline for high-quality bacterial genome construction 524 designed to be easily implemented in the research lab setting. We believe this will be a useful 525 resource for users to easily and reproducibly construct complete bacterial genomes from 526 Nanopore sequencing data. 527 528 Methods: 529 530 Public data: 531 The EC958 complete genome was downloaded from NCBI (GenBank: HG941718.1, 532 HG941719.1, HG941720.1) [19]. Illumina reads for 12 ST131 genomes and draft assemblies 533 for 95 ST131 were accessed from [16]. Eight publicly available complete genomes were also 534 selected to test MicroPIPE, under the following criteria: (i) the raw nanopore sequencing files 535 (fast5) were available, (ii) a complete genome was made available for the same strain and (iii) 536 Illumina sequencing data were available for the same strain. These eight genomes represented 537 5 species from both gram-positive and gram-negative bacteria with chromosome sizes between 538 following manufacturer's protocol with modifications. Briefly, the cell pellet was lysed 545 following the protocol for Gram negative bacteria. RNA was removed by 1h incubation at 37°C 546 with RNase and the lysate was then mix with Protein Precipitation Solution by vortexing for 547 5s at max speed using Vortex-Genie 2 with horizontal tube adapter (Scientific Industries). The 548 DNA was precipitated using isopropanol and washed with 70% ethanol. The DNA pellet was 549 air-dried and then rehydrated in 100 µl EB buffer (QIAgen) by incubation at 65°C for 1 hour. 550 The DNA was quantified using a Qubit fluorometer (ThermoFisher Scientific) and the DNA 551 fragment size was estimated using agarose gel electrophoresis (0.5% agarose in TAE, 90V, 552 1h30m). 553 554 Nanopore sequencing: 555 DNA from 12 ST131 E. coli were multiplexed onto a single FLO-MIN106 flow cell using the 556 rapid barcode sequencing kit (SQK-RBK004) as per manufacturer's recommendation with the 557 following adjustments: the barcoded DNA was pooled without a concentration step using 558