AMRomics: a scalable workflow to analyze large microbial genome collections

Le, Duc Quang; Nguyen, Tam Thi; Nguyen, Canh Hao; Ho, Tho Huu; Vo, Nam S.; Nguyen, Trang; Nguyen, Hoang Anh; Vinh, Le Sy; Dang, Thanh Hai; Cao, Minh Duc; Nguyen, Son Hoang

doi:10.1186/s12864-024-10620-8

Software
Open access
Published: 22 July 2024

AMRomics: a scalable workflow to analyze large microbial genome collections

Duc Quang Le^1,2,3,
Tam Thi Nguyen⁴,
Canh Hao Nguyen⁵,
Tho Huu Ho^6,7,
Nam S. Vo⁸,
Trang Nguyen¹,
Hoang Anh Nguyen¹,
Le Sy Vinh²,
Thanh Hai Dang²,
Minh Duc Cao¹ &
…
Son Hoang Nguyen¹

BMC Genomics volume 25, Article number: 709 (2024) Cite this article

124 Accesses
Metrics details

Abstract

Whole genome analysis for microbial genomics is critical to studying and monitoring antimicrobial resistance strains. The exponential growth of microbial sequencing data necessitates a fast and scalable computational pipeline to generate the desired outputs in a timely and cost-effective manner. Recent methods have been implemented to integrate individual genomes into large collections of specific bacterial populations and are widely employed for systematic genomic surveillance. However, they do not scale well when the population expands and turnaround time remains the main issue for this type of analysis. Here, we introduce AMRomics, an optimized microbial genomics pipeline that can work efficiently with big datasets. We use different bacterial data collections to compare AMRomics against competitive tools and show that our pipeline can generate similar results of interest but with better performance. The software is open source and is publicly available at https://github.com/amromics/amromics under an MIT license.

Peer Review reports

Background

Whole genome sequencing (WGS) of bacterial isolates using the next-generation sequencing technology has progressively become the predominant method in clinical microbiology, public health surveillance, and disease control [1,2,3]. The ability to study the complete genetic information of a large number of bacterial genomes provides the potential to generate insights into the pathogenic genotype/phenotype relationships [4,5,6], pathogenic virulence transmissibility [7, 8] and antibiotic resistance tracking [9, 10]. The combination of genomics information and epidemiological data has been frequently used in disease control processes, such as rapid outbreak clustering investigation of the recent SARS-CoV-2 pandemic [11, 12] and evolutionary perspectives inference/prediction with regards to pathogenic diversification [13, 14]. The richness of current high-throughput genomic data has created a solid foundation to establish systematic studies for large cohorts of related genomes by applications of genome-wide methods such as cgMLST, phylogenetic, or pangenomic analyses. WGS approaches can generate insightful data to discern knowledge about existing pathogenesis and assist in unraveling the characteristics of unknown ones [15, 16], which is critical in understanding and thus controlling disease outbreaks.

To meet the demand for analysis tools, a number of computational pipelines have been developed to facilitate the analysis of microbial WGS data and to generate practical results of interest. Several have become well-established and widely used in the field, notably Nullarbor [17], Bactopia [18], and ASA³P [19]. The first-mentioned tool, Nullarbor, has been around as part of a standard process in public health microbial genomic procedure, while the latter two are relatively up-to-date with comprehensive and wide-spectrum functionalities. However, these software pipelines usually require high-end computation infrastructures and take prohibitively long running times to analyze when collection sizes reach beyond thousands of genomes. Furthermore, while it is typical for laboratories to collect and sequence new samples over time, none of the existing pipelines can efficiently manage the growing collections where new samples are constantly added. In most cases, many parts of these pipelines need to be rerun every time new samples are added to the collection, resulting in additional high computation costs.

Here we introduce AMRomics, a lightweight open-source software for analyzing and managing large collections of bacterial genomes. This tool offers the ability to generate essential genomic results for individual samples, together with a population analysis that outperforms other methods. Thanks to its optimal design, the performance is significantly improved, making analyses of big collections of bacteria feasible on regular desktop computers with reasonable turn-around time. AMRomics project source code is available at https://github.com/amromics/amromics.git

Workflow and implementation

AMRomics is a software package that provides a comprehensive suite of genomics analyses of microbial collections in a simple and easy to use manner. It is designed to be performant and scalable to large genome collection with minimal hardware requirements without compromising the analysis results. To that end, we select the considered best practices tools in microbial genomics, and stitch them together via a well-structured workflow as described in the next section. For certain tasks in the workflow, AMRomics provides options for users to select among several alternative tools. The workflow is written in Python and is designed as a modular and expandable application with the standardized data formats flowing between the tools in the workflow.

The software flexibly takes in input data in various formats including sequencing reads (with Illumina, Pacbio and Nanopore technologies), genome assembly, and genome annotations. It then performs assembly, genome annotation, MLST, virulome and resistome prediction, pangenome clustering, phylogenetic tree construction for each gene and core genes, and pan-SNPs analysis, all with a simple command line. AMRomics achieves this by building a pipeline consisting of the current best practice tools in bacterial genomics. It is also designed to be fast, efficient, and scalable to collections of thousands of isolates on a computer with modest hardware. Crucially, AMRomics supports the progressive analysis of a growing collection, where new samples can be added to an existing collection without the need to build the collection from scratch.

Functionally, the AMRomics pipeline can be split into 2 stages: single sample analysis and collection analysis as depicted in Fig. 1. In the single sample analysis stage, every sample is processed based on the type of input data. Specifically, for Illumina sequencing data, fastp [20, 21] is employed for quality control, adaptor trimming, quality filtering and read pruning. The pre-processed reads are then subject to sequence assembly to generate a genome assembly. SKESA [22] is the method of choice for assemblying Illumina sequencing data for its speed, but the user can optionally choose to use SPAdes [23, 24] for slightly better N50 with the extra computation time. If long read data (Nanopore and Pacbio) are provided, the sample genome is assembled by Flye [25]. The assembly step can be skipped if the user provides the genome assembly in FASTA format as input to the pipeline. AMRomics then standardizes the sample IDs and the contig names to ensure their uniqueness in the collection. Next, the genome assembly is annotated with Prokka [26] unless the annotations are provided by the user. The gene sequences are extracted and stored in files at predefined locations. The genome sequence is also subject to multi-locus strain typing with pubMLST database of typing scheme for bacterial strains [27], antibiotic-resistant gene identification with AMRFinderPlus database [28], virulent gene identification with the virulence factor database VFDB [29, 30], and plasmid detection with plasmidfinder database [31]s. All the results of the single sample analysis are organized in a standard manner.

In the second stage, AMRomics performs a pangenome comparative analysis of the genome collection. The annotations of all the genomes in GFF format are loaded into a pangenome inference module for gene clustering. PanTA [32] is the method of choice for pangenome construction for its speed and scalability, but users can optionally choose Roary [33] as the alternative. AMRomics then classifies gene clusters into core genes (genes clusters that present in at least 95% of genomes in the collection) and accessory genes. In addition, AMRomics identifies shell genes, which are those present in at least a certain number of genomes in the pangenome. The threshold for shell genes is defaulted at 25% but can be adjusted by users. AMRomics then performs multiple alignments (MSA) of all the identified shell genes using MAFFT [34]. The MSAs of these shell genes are then used to construct phylogenetic trees of these gene families using FastTree 2 [35] or IQ-TREE 2 [36]. In addition, AMRomics builds the phylogeny of the collection from the concatenation of the MSAs of all core genes using the chosen tree-building method.

AMRomics introduces pan-SNPs, a novel concept to represent genetic variants of the samples in the collection. Existing variant analysis methods usually rely on a reference genome, and can only identify variants in the genes presenting in the reference genome. This severely limits the analysis to only a fraction of the genome of interest because of the high variability between isolates within a clade. In addition, it is often not possible to have a reference genome that can represent the whole collection, especially if the collection is diverse and growing. AMRomics addresses this by building the pan-reference genome for the collection which is the set of representative genes from each gene cluster. It then identifies the variants of all genes in a cluster against the representative gene directly from the MSA. The variant profile of a sample is the concatenation of the variations of all its genes, reported in a VCF file.

The representative gene for a gene cluster is set to be the one that comes from the earliest genome in the collection list. With this selection strategy, if the users have a preferred reference genome, they can place the reference genome first in the collection list so that genes from the reference genome will be the representatives in their perspective clusters. Moreover, as AMRomics supports adding new samples into a collection, the selection strategy also ensures that the representative gene for a cluster does not change as the new samples are added into the collection, and that a new representative gene is added to the pan-reference genome only if a new cluster is created as the result of the collection expansion.

All results obtained from running AMRomics can be ultimately aggregated as the final output for reporting or customized visualizations for end users. Details of the third-party bioinformatics tools and databases used by AMRomics are listed in Supplementary Table 1 and 2.

Results

Comparison with other pipelines

To the best of our knowledge, at the time of writing, there are four existing open source software pipelines for end to end microbial genomics analysis, namely Nullarbor [17], TORMES [37], ASA³P [19] and Bactopia [18]. While AMRomics and these software tools share the overall functionalities, they differ in the underlying philosophies. Here, we present a high level discussion of AMRomics features and highlight the principles behind the design of AMRomics.

Overall, AMRomics and the existing tools support a wide variety of input formats except Nullarbor and TORMES which are designed to run on Illumina paired-end reads only as per their specific public health routine. AMRomics and the more recent methods, ASA³P [19] and Bactopia accept raw reads from third-generation sequencing technology such as Oxford Nanopore Technology or PacBio long reads. A range of genomics analyses are included in all pipelines. They are common tasks for bacteria genomics such as sequence typing (MLST), AMR/virulence factor scanning, and genome annotation for an isolate. While all of the tools provide SNP analysis results, AMRomics outputs variants (in VCF files) by the core gene alignment from the pangenome analysis instead of snippy [38] core alignment as in other methods. Table 1 summarizes the key features across the software tools.

Table 1 Functional comparison between AMRomics and other bacterial genomics pipeline

Full size table

The primary principle of AMRomics is to extract the highest quality and most informative statistics from the input data. For example, AMRomics constructs the phylogeny tree of the collection using the multiple alignment of core genes. This provides a higher resolution of evolutionary information than SNPs information or the multiple alignment of 16S genes [39], the two techniques applied by the existing tools. In addition, AMRomics utilizes the population information to call variants across the pangenome instead of from a chosen reference genome and hence provides a bigger picture of genetic relations among the isolates in the collection. The users can still use one or more preferred reference genomes by placing the reference genomes at the top of the list.

AMRomics’s second and perhaps equally important design principle emphasizes on the scallability of the software, aiming to enable the analysis of large collections without the need to scale up hardware infrastructures. While AMRomics uses the same underlying core tools (e.g., BLAST+, SPAdes, SKESA, Flye, Prokka etc) as other pipelines, we chose to reimplement the helper and pre-processing modules such as Shovill and Dragonflye. In the process, we pay attention to the data structures to manage large amount of data flowing between steps of the pipeline. As a result, AMRomics is significantly faster and requires only a fraction of memory usage in comparison with its counterparts (shown in the following section). While speed is the paramount, AMRomics offers the flexibility for users to choose between alternatives to fit their need when there are more than one core algorithms for the same step (such as SPAdes and SKESA for assembling short reads, or FastTree and IQ-TREE for phylogenetic tree construction). AMRomics also takes advantage of progressive analysis; when new samples are added into an existing collection, AMRomics only performs the extra computation related to the new samples, instead of recomputing scratch. This strategy offers a scalable solution practically suitable for analysis of the large growing collections of bacteria in the sequencing ages.

Case study

We demonstrate the utility of AMRomics on a large and heterogeneous set of Klebsiella pneumoniae genomes collected from various public sources. In particular, we designed a case study that reflects a practical use case and highlights the ease of use, flexibility and scallability of AMRomics. The input data of the case study consisted of three batches of genome data. The first batch contained the sequencing data of 89 K. pneumoniae isolates from Patan Hospital in Kathmandu, Nepal between May and December 2012 [40]. These samples were multi-drug resistant isolates, in the form of Illumina paired-end short read data. While AMRomics did not require a reference genome for variant calling, we included in the batch four genome assemblies obtained from RefSeq (two in the genome assembly fasta format and two in annotation GFF format) for the other workflows to use as the reference. In the second batch, we included 11 samples that were collected from Hospital Universitario Ramon y Cajal, exhibiting Carbapenem resistance and harboring the pOXA-48 plasmid [41]. The input data for these 11 samples were Oxford Nanopore sequencing data. Finally, we included a third batch of 1000 samples; the genomes in the batch were previously assembled and annotated by NCBI PGAP, and they were in GFF format. The data for the case study are provided in the Supporting data.

Despite the commonalities among the analysis pipelines, having a direct comparison can be challenging due to the variations in the processing steps and the selection of different analysis tools within each pipeline. For simplicity, we used the default settings to run all existing pipelines that would cover essential analyses as shown in Table 1. We also with the best effort to use the parameters that the most compatible with AMRomics (Supporting data). We did not include TORMES in the comparison because of its resemblance to its predecessor, Nullarbor. The experiments were conducted on a cloud server with moderate performance, equipped with a 6-core 12-thread E-2286G processor, 32GB of RAM, and a 960GB SSD drive.

Table 2 shows the running time and resource consumption using the four pipelines. For the first batch, AMRomics took only 4.32 hours for performing single sample analysis on 89 samples, significantly faster than Bactopia and Nullarbor with 8.82 hours and 11.09 hours respectively even though the three pipelines use similar underlying algorithms (SKESA for Illumina read assembly, Prokka for annotation and BLAST for virulome and resistome calling). This is likely due to better process management and parallelization implemented in AMRomics software. ASA³P took much longer, 22.24 hours as a result of using a slower assembly algorithm SPAdes that typically produces higher N50 quality assemblies. Of note, AMRomics, Bactopia and Nullarbor could optionally use SPAdes as the short read assembler. It is also worth noting that variation calling was part of single analysis in Bactopia, ASA³P and Nullarbor which also contributed to the extended single analysis time of these tools. AMRomics took under 1 hour for collection analyses, including pangenome inference, multiple alignment of cloud genes, phylogenetic analyses of organisms and of every cloud gene, and SNP analysis. Nullarbor performed collection analysis in much shorter time, 0.19 hours albeit producing only pan-genome and core-gene phylogeny. Bactopia and ASA³P took significantly longer, 2.32 hours and 12.24 hours respectively. Taking together, AMRomics required less than half of the time of the other tools for the whole pipeline. It also consumed only 3.44Gb of memory, comparing with 5.83Gb by Bactopia, 20.86Gb by ASA³P and 7.91Gb by Nullarbor.

Table 2 Running times and memory usages of AMRomics, Bactopia, ASA³P and Nullarbor in the case study

Full size table

The second batch consists of 11 Nanopore sequencing data, that was not supported by Nullarbor. ASA³P did not support progressive analysis hence all samples in the first batch and second batch had to be analyzed from scratch leading to a total of 54.76 hours. Bactopia took 1.86 hours for single sample analysis which was shorter than AMRomics that took 2.74 hours though both tools used Flye as underlying assembly. Upon examining the runtimes, we noticed that Bactopia performed subsampling of sequencing reads to 50x resulting in the speed-up. AMRomics took less than one hour for collection analysis thanks to the use of progressive mode of its underlying pangenome method PanTA. On the other hand, Bactopia took 3.99 hours.

The genomes in the third batch were already annotated in GFF format. We did not run ASA³P on the third batch because of the excessive time required to re-analyze the samples in the previous batches. Bactopia did not have the function to extract the annotations in the GFF files, and instead re-annotated the input genomes. In addition, Bactopia simulated sequencing reads from the assembled genomes, and mapped the simulated reads back to the reference to call SNPs. These steps, while could produce the intended analysis results, took 67.72 hours to analyze 1000 genomes. On the other hand, AMRomics reused the existing annotations from the input genomes, leading to substantially shorter single sample analysis running time, only 4.17 hours. Similarly, the pangenome analysis strategy employed by AMRomics reused the existing pangenome computation, requiring only 2.44 hours to add 1000 genomes into the existing pangenome. Bactopia ran pangenome analysis for more than 20 hours before crashing due to out of memory.

Discussion

We introduce AMRomics, a lightweight and scalable computational pipeline to analyze bacterial genomes and pan-genomes cost-effectively. Our method’s main focus is optimizing the workflow and selected sub-modules for microbial genomic studies, especially comparative genomics, and most importantly, supporting progressive analysis for growing big data collections. AMRomics provides flexible input scenarios by supporting a wide range of data formats, such as different types of raw reads, assemblies, or annotated genomes for each sample depending on data availability or pipeline settings from end users. It can generate fundamental genomic properties sample-by-sample by conducting routine analyses for bacteria isolates, and comparative genomics for the whole big collection i.e. pangenome evaluation and the corresponding phylogenetic results. Analysis results from AMRomics can be directly imported into AMRViz [42], a visualization tool for viewing and visually inspection of the analysis results.

AMRomics leverages the wealth of bioinformatics and genomics tools available to develop an end-to-end analysis workflow. While focusing on efficacy and scalability, opportunities exist for enhancing and broadening its functionality. We are continuously updating the pipeline with new methods to provide alternative options for each available function, or novel ones, to meet the various needs of end-users. For instance, the default genome annotation module in the community has been Prokka [26], but recent tools such as Bakta [43] and PGAP [44] are becoming prominent; such tools will be incorporated into the pipeline to provide the alternatives to tailor to users’ needs. Another direction to enhance the application of AMRomics is to consider species-specific downstream analyses besides the core general-purpose modules. This extra practice is required in many scenarios of microbial genomics surveillance, especially in public health settings. Exemplars of such tools include various bug-specific serotyping methods: SISTR [45] for Salmonella, Shigatyper [46] for Shigella and PneumoCAT [47] for Streptococcus pneumoniae.

In summary, AMRomics is a useful tool that can manage and enable the scale-up of large bacterial collections with modest computational resources. Continuing support for new modules and workflow maintenance will make it another practical option for the booming era of microbial genomics data.

Availability of data and materials

The data results from the case study are available on Figshare at DOI https://doi.org/10.6084/m9.figshare.26333002.

References

Kwong JC, McCallum N, Sintchenko V, Howden BP. Whole genome sequencing in clinical and public health microbiology. Pathology. 2015;47(3):199–210.
Article CAS PubMed Google Scholar
Brown E, Dessai U, McGarry S, Gerner-Smidt P. Use of whole-genome sequencing for food safety and public health in the united states. Foodborne Pathog Dis. 2019;16(7):441–50.
Article PubMed PubMed Central Google Scholar
Ferdinand AS, Kelaher M, Lane CR, da Silva AG, Sherry NL, Ballard SA, Andersson P, Hoang T, Denholm JT, Easton M, et al. An implementation science approach to evaluating pathogen whole genome sequencing in public health. Genome Med. 2021;13:1–11.
Article Google Scholar
Karlsen ST, Rau MH, Sánchez BJ, Jensen K, Zeidan AA. From genotype to phenotype: computational approaches for inferring microbial traits relevant to the food industry. FEMS Microbiol Rev. 2023;47(4):fuad030.
Do VH, Nguyen SH, Le DQ, Nguyen TT, Nguyen CH, Ho TH, Vo NS, Nguyen T, Nguyen HA, Cao MD. Pasa: leveraging population pangenome graph to scaffold prokaryote genome assemblies. Nucleic Acids Res. 2024;52(3):15. https://doi.org/10.1093/nar/gkad1170.
Article Google Scholar
Do VH, Nguyen VS, Nguyen SH, Le DQ, Nguyen TT, Nguyen CH, Ho TH, Vo NS, Nguyen T, Nguyen HA, Cao MD. PanKA : Leveraging population pangenome to predict antibiotic resistance. iScience. 2024. To Appear.
Massey RC, Horsburgh MJ, Lina G, Höök M, Recker M. The evolution and maintenance of virulence in staphylococcus aureus: a role for host-to-host transmission? Nat Rev Microbiol. 2006;4(12):953–8.
Article CAS PubMed Google Scholar
De la Fuente J, Diez-Delgado I, Contreras M, Vicente J, Cabezas-Cruz A, Tobes R, Manrique M, Lopez V, Romero B, Bezos J, et al. Comparative genomics of field isolates of mycobacterium bovis and m. caprae provides evidence for possible correlates with bacterial viability and virulence. PLoS Negl Trop Dis. 2015;9(11):0004232.
Google Scholar
Alghoribi MF, Balkhy HH, Woodford N, Ellington MJ. The role of whole genome sequencing in monitoring antimicrobial resistance: A biosafety and public health priority in the arabian peninsula. J Infect Public Health. 2018;11(6):784–7.
Article PubMed Google Scholar
Hendriksen RS, Bortolaia V, Tate H, Tyson GH, Aarestrup FM, McDermott PF. Using genomics to track global antimicrobial resistance. Front Public Health. 2019;7:242.
Article PubMed PubMed Central Google Scholar
Petrone ME, Rothman JE, Breban MI, Ott IM, Russell A, Lasek-Nesselquist E, Badr H, Kelly K, Omerza G, Renzette N, et al. Combining genomic and epidemiological data to compare the transmissibility of sars-cov-2 variants alpha and iota. Commun Biol. 2022;5(1):439.
Article CAS PubMed PubMed Central Google Scholar
Haanappel CP, Oude Munnink BB, Sikkema RS, de Jager H, de Boever R, Koene HH, Boter M, Chestakova IV, van der Linden A, Molenkamp R, et al. Combining epidemiological data and whole genome sequencing to understand sars-cov-2 transmission dynamics in a large tertiary care hospital during the first covid-19 wave in the netherlands focusing on healthcare workers. Antimicrob Resist Infect Control. 2023;12(1):1–12.
Article Google Scholar
Duault H, Durand B, Canini L. Methods combining genomic and epidemiological data in the reconstruction of transmission trees: A systematic review. Pathogens. 2022;11(2):252.
Article PubMed PubMed Central Google Scholar
Khataei MM, Epi SBH, Lood R, Spégel P, Yamini Y, Turner C. A review of green solvent extraction techniques and their use in antibiotic residue analysis. J Pharm Biomed Anal. 2022;209:114487.
Article CAS PubMed Google Scholar
Donkor ES. Sequencing of bacterial genomes: principles and insights into pathogenesis and development of antibiotics. Genes. 2013;4(4):556–72.
Article PubMed PubMed Central Google Scholar
Li LM, Grassly NC, Fraser C. Genomic analysis of emerging pathogens: methods, application and future trends. Genome Biol. 2014;15(11):1–9.
Article Google Scholar
Seemann T, Goncalves da Silva A, Bulach DM, Schultz MB, Kwong JC, Howden BP. Nullarbor Github. 2015. https://github.com/tseemann/nullarbor. Accessed 11 Dec 2023.
Petit RA III, Read TD. Bactopia: a flexible pipeline for complete analysis of bacterial genomes. Msystems. 2020;5(4):10–1128.
Article Google Scholar
Schwengers O, Hoek A, Fritzenwanker M, Falgenhauer L, Hain T, Chakraborty T, Goesmann A. ASA3P: an automatic and scalable pipeline for the assembly, annotation and higher-level analysis of closely related bacterial isolates. PLoS Comput Biol. 2020;16(3):1007134.
Article Google Scholar
Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one fastq preprocessor. Bioinformatics. 2018;34(17):884–90.
Article Google Scholar
Chen S. Ultrafast one-pass fastq data preprocessing, quality control, and deduplication using fastp. iMeta. 2023;2(2):e107
Souvorov A, Agarwala R, Lipman DJ. SKESA: strategic k-mer extension for scrupulous assemblies. Genome Biol. 2018;19(1):153. https://doi.org/10.1186/s13059-018-1540-z.
Article CAS PubMed PubMed Central Google Scholar
Prjibelski AD, Vasilinetc I, Bankevich A, Gurevich A, Krivosheeva T, Nurk S, Pham S, Korobeynikov A, Lapidus A, Pevzner PA. ExSPAnder: a universal repeat resolver for DNA fragment assembly. Bioinformatics. 2014;30(12):293–301. https://doi.org/10.1093/bioinformatics/btu266.
Article CAS Google Scholar
Vasilinetc I, Prjibelski AD, Gurevich A, Korobeynikov A, Pevzner PA. Assembling short reads from jumping libraries with large insert sizes. Bioinformatics. 2015;31(20):3262–8. https://doi.org/10.1093/bioinformatics/btv337.
Article CAS PubMed Google Scholar
Kolmogorov M, Yuan J, Lin Y, Pevzner PA. Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol. 2019;37(5):540–6.
Article CAS PubMed Google Scholar
Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014;30(14):2068–9.
Article CAS PubMed Google Scholar
Jolley KA, Maiden MC. BIGSdb: scalable analysis of bacterial genome variation at the population level. BMC Bioinformatics. 2010;11:1–11.
Article Google Scholar
Feldgarden M, Brover V, Gonzalez-Escalona N, Frye JG, Haendiges J, Haft DH, Hoffmann M, Pettengill JB, Prasad AB, Tillman GE, et al. AMRFinderPlus and the reference gene catalog facilitate examination of the genomic links among antimicrobial resistance, stress response, and virulence. Sci Rep. 2021;11(1):1–9.
Article Google Scholar
Chen L, Yang J, Yu J, Yao Z, Sun L, Shen Y, Jin Q. VFDB: a reference database for bacterial virulence factors. Nucleic Acids Res. 2005;33(suppl_1):325–8.
Liu B, Zheng D, Zhou S, Chen L, Yang J. Vfdb 2022: a general classification scheme for bacterial virulence factors. Nucleic Acids Res. 2022;50(D1):912–7.
Article Google Scholar
Carattoli A, Zankari E, García-Fernández A, Voldby Larsen M, Lund O, Villa L, Møller Aarestrup F, Hasman H. In silico detection and typing of plasmids using plasmidfinder and plasmid multilocus sequence typing. Antimicrob Agents Chemother. 2014;58(7):3895–903.
Article PubMed PubMed Central Google Scholar
Le DQ, Nguyen TA, Nguyen TT, Nguyen SH, Do VH, Nguyen CH, Phung HT, Ho TH, Nam VS, Nguyen T, Nguyen HA, Cao MD. PanTA : An ultra-fast method for constructing large and growing microbial pangenomes. bioRxiv. 2023;1–9. https://doi.org/10.1101/2023.07.03.547471
Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S, Holden MTG, Fookes M, Falush D, Keane JA, Parkhill J. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics. 2015;31(22):3691–3. https://doi.org/10.1093/bioinformatics/btv421.
Article CAS PubMed PubMed Central Google Scholar
Nakamura T, Yamada KD, Tomii K, Katoh K. Parallelization of MAFFT for large-scale multiple sequence alignments. Bioinformatics. 2018;34(14):2490–2. https://doi.org/10.1093/bioinformatics/bty121.
Article CAS PubMed PubMed Central Google Scholar
Price MN, Dehal PS, Arkin AP. FastTree 2 - Approximately Maximum-Likelihood Trees for Large Alignments. PLoS ONE. 2010;5(3):9490. https://doi.org/10.1371/journal.pone.0009490.
Article CAS Google Scholar
Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, Von Haeseler A, Lanfear R. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol Biol Evol. 2020;37(5):1530–4.
Article CAS PubMed PubMed Central Google Scholar
Quijada NM, Rodríguez-Lázaro D, Eiros JM, Hernández M. TORMES: an automated pipeline for whole bacterial genome analysis. Bioinformatics. 2019;35(21):4207–12. https://doi.org/10.1093/bioinformatics/btz220.
Article CAS PubMed Google Scholar
Seeman T. Github. 2013. https://github.com/tseemann/snippy. Accessed 11 Dec 2023.
Hassler HB, Probert B, Moore C, Lawson E, Jackson RW, Russell BT, Richards VP. Phylogenies of the 16S rRNA gene and its hypervariable regions lack concordance with core genome phylogenies. Microbiome. 2022;10(1):104. https://doi.org/10.1186/s40168-022-01295-y.
Article CAS PubMed PubMed Central Google Scholar
Chung The H, Karkey A, Pham Thanh D, Boinett CJ, Cain AK, Ellington M, Baker KS, Dongol S, Thompson C, Harris SR, et al. A high-resolution genomic analysis of multidrug-resistant hospital outbreaks of klebsiella pneumoniae. EMBO Mol Med. 2015;7(3):227–39.
Article PubMed PubMed Central Google Scholar
León-Sampedro R, DelaFuente J, Díaz-Agero C, Crellen T, Musicha P, Rodríguez-Beltrán J, de la Vega C, Hernández-García M, R-GNOSIS WP5 Study Group, López-Fresneña N, et al. Pervasive transmission of a carbapenem resistance plasmid in the gut microbiota of hospitalized patients. Nat Microbiol. 2021;6(5):606–16.
Le DQ, Nguyen SH, Nguyen TT, Nguyen CH, Ho TH, Vo NS, Nguyen T, Nguyen HA, Cao MD. AMRViz enables seamless genomics analysis and visualization of antimicrobial resistance. BMC Bioinformatics. 2024;25(1):193.
Article PubMed PubMed Central Google Scholar
Schwengers O, Jelonek L, Dieckmann MA, Beyvers S, Blom J, Goesmann A. Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microb Genomics. 2021;7(11):000685.
Article CAS Google Scholar
Tatusova T, DiCuccio M, Badretdin A, Chetvernin V, Nawrocki EP, Zaslavsky L, Lomsadze A, Pruitt KD, Borodovsky M, Ostell J. Ncbi prokaryotic genome annotation pipeline. Nucleic Acids Res. 2016;44(14):6614–24.
Article CAS PubMed PubMed Central Google Scholar
Yoshida CE, Kruczkiewicz P, Laing CR, Lingohr EJ, Gannon VP, Nash JH, Taboada EN. The salmonella in silico typing resource (sistr): an open web-accessible tool for rapidly typing and subtyping draft salmonella genome assemblies. PLoS ONE. 2016;11(1):0147101.
Article Google Scholar
Wu Y, Lau HK, Lee T, Lau DK, Payne J. In silico serotyping based on whole-genome sequencing improves the accuracy of shigella identification. Appl Environ Microbiol. 2019;85(7):00165–19.
Article Google Scholar
Kapatai G, Sheppard CL, Al-Shahib A, Litt DJ, Underwood AP, Harrison TG, Fry NK. Whole genome sequencing of streptococcus pneumoniae: development, evaluation and verification of targets for serogroup and serotype prediction using an automated pipeline. PeerJ. 2016;4:2477.
Article Google Scholar

Download references

Availability of source code and requirements

• Project name: AMRomics

• Project home page: https://github.com/amromics/amromics

• Operating system(s): Platform independent

• Programming language: Python

• Other requirements: Python 3.10 higher, conda

• License: MIT

Funding

This work has been supported by Vingroup Innovation Foundation (VINIF) in project code VINIF.2019.DA11.

Author information

Authors and Affiliations

AMROMICS JSC, Nghe An, Vietnam
Duc Quang Le, Trang Nguyen, Hoang Anh Nguyen, Minh Duc Cao & Son Hoang Nguyen
Faculty of Information Technology, VNU University of Engineering and Technology, Hanoi, Vietnam
Duc Quang Le, Le Sy Vinh & Thanh Hai Dang
Faculty of IT, Hanoi University of Civil Engineering, Hanoi, Vietnam
Duc Quang Le
Oxford University Clinical Research Unit, Hanoi, Vietnam
Tam Thi Nguyen
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto, Japan
Canh Hao Nguyen
Department of Medical Microbiology, The 103 Military Hospital, Vietnam Military Medical University, Hanoi, Vietnam
Tho Huu Ho
Department of Genomics & Cytogenetics, Institute of Biomedicine & Pharmacy, Vietnam Military Medical University, Hanoi, Vietnam
Tho Huu Ho
Center for Biomedical Informatics, Vingroup Big Data Institute, Hanoi, Vietnam
Nam S. Vo

Authors

Duc Quang Le
View author publications
You can also search for this author in PubMed Google Scholar
Tam Thi Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Canh Hao Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Tho Huu Ho
View author publications
You can also search for this author in PubMed Google Scholar
Nam S. Vo
View author publications
You can also search for this author in PubMed Google Scholar
Trang Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Hoang Anh Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Le Sy Vinh
View author publications
You can also search for this author in PubMed Google Scholar
Thanh Hai Dang
View author publications
You can also search for this author in PubMed Google Scholar
Minh Duc Cao
View author publications
You can also search for this author in PubMed Google Scholar
Son Hoang Nguyen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

MDC, SHN and DQL and TTN conceptualized and designed the project. DQL, SHN, HAN and MDC implemented the pipeline. DQL, SHN created use cases and ran the comparison. SHN, DQL drafted the first version of the manuscript. All authors contributed to the writing and revision.

Corresponding authors

Correspondence to Duc Quang Le, Minh Duc Cao or Son Hoang Nguyen.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Material 1.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Le, D.Q., Nguyen, T.T., Nguyen, C.H. et al. AMRomics: a scalable workflow to analyze large microbial genome collections. BMC Genomics 25, 709 (2024). https://doi.org/10.1186/s12864-024-10620-8

Download citation

Received: 20 April 2024
Accepted: 15 July 2024
Published: 22 July 2024
DOI: https://doi.org/10.1186/s12864-024-10620-8

AMRomics: a scalable workflow to analyze large microbial genome collections

Abstract

Background

Workflow and implementation

Results

Comparison with other pipelines

Case study

Discussion

Availability of data and materials

References

Availability of source code and requirements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Supplementary Material 1.

Rights and permissions

About this article

Cite this article

Share this article

BMC Genomics

Contact us