Comparative genomics of drug-resistant strains of Mycobacterium tuberculosis in Ecuador

Background Tuberculosis is a serious infectious disease affecting millions of people. In spite of efforts to reduce the disease, increasing antibiotic resistance has contributed to persist in the top 10 causes of death worldwide. In fact, the increased cases of multi (MDR) and extreme drug resistance (XDR) worldwide remains the main challenge for tuberculosis control. Whole genome sequencing is a powerful tool for predicting drug resistance-related variants, studying lineages, tracking transmission, and defining outbreaks. This study presents the identification and characterization of resistant clinical isolates of Mycobacterium tuberculosis including a phylogenetic and molecular resistance profile study by sequencing the complete genome of 24 strains from different provinces of Ecuador. Results Genomic sequencing was used to identify the variants causing resistance. A total of 15/21 isolates were identified as MDR, 4/21 as pre-XDR and 2/21 as XDR, with three isolates discarded due to low quality; the main sub-lineage was LAM (61.9%) and Haarlem (19%) but clades X, T and S were identified. Of the six pre-XDR and XDR strains, it is noteworthy that five come from females; four come from the LAM sub-lineage and two correspond to the X-class sub-lineage. A core genome of 3,750 genes, distributed in 295 subsystems, was determined. Among these, 64 proteins related to virulence and implicated in the pathogenicity of M. tuberculosis and 66 possible pharmacological targets stand out. Most variants result in nonsynonymous amino acid changes and the most frequent genotypes were identified as conferring resistance to rifampicin, isoniazid, ethambutol, para-aminosalicylic acid and streptomycin. However, an increase in the resistance to fluoroquinolones was detected. Conclusion This work shows for the first time the variability of circulating resistant strains between men and women in Ecuador, highlighting the usefulness of genomic sequencing for the identification of emerging resistance. In this regard, we found an increase in fluoroquinolone resistance. Further sampling effort is needed to determine the total variability and associations with the metadata obtained to generate better health policies. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-022-09042-1.


Introduction
Tuberculosis (TB), caused by Mycobacterium tuberculosis (Mtb), is one of the top 10 causes of death in 2020. The World Health Organization estimated that 9.9 million people were infected, and 1.5 million died [1]. Despite constant efforts over the years to lessen the impact of this disease, several challenges, such as patient adherence, long duration of treatment, and late diagnosis, have reduced the effectiveness of TB therapy [2]. In addition, the frequent emergence of drug-resistant strains is a serious global threat and poses significant challenges to public health, especially in low-and middle-income countries.
*Correspondence: garielmorey@uees.edu.ec; lberna@pasteur.edu.uy Drug resistance in bacteria, particularly in M. tuberculosis, negatively affects health programs worldwide due to the increase of resistant strains that cannot be treated with existing anti-TB drug therapy. Indeed, tuberculosis is considered to be one of the "Big three" infectious diseases worldwide [3]. Patients with resistant TB may develop it due to clinical factors such as late or inappropriate diagnosis, ineffective treatment, poor compliance to the regimen, and exposure to circulating multidrug-resistant and extensively drug-resistant tuberculosis (MDR/XDR-TB) strains [2]. The slow growth of Mtb slows diagnosis and consequently limits the timely detection of resistance to anti-TB drugs. This contributes to the rising incidence of MDR and XDR TB worldwide. Conversely, the accumulation of point mutations in coding regions for drug targets or drug-converting enzymes is an essential mechanism for acquiring resistance in Mtb [4].
The Mycobacterium tuberculosis complex is composed of the human-adapted members, the Mycobacterium tuberculosis sensu stricto (lineages 1, 2, 3, 4, and 7) and M. tuberculosis var. africanum (lineages 5 and 6; M. africanum), distinct phylogenetic lineages that have evolved over millennia [5][6][7] and lineage 8 and 9 are new lineages recently discovered in Africa [8]. Research on the distribution of lineages of Mtb in South America has shown that the Euro-American lineage (lineage 4) has a variable distribution of lineages/sub-lineages between and within these countries, establishing that the predominant strains in their America have evolved from the Euro-American lineage [9]. This lineage has been described as more transmissible than others [10]. From 10 sub-lineages, including in Lineage 4, LAM (L4.3), Haarlem (L4.1.2), X-type and T families are the most observed members of the Euro-American lineage in South and Central America and the Caribbean, a distribution profile shared with Europe and Middle Africa [11][12][13][14].
In 2015, due to drug resistance rising worldwide, the WHO proposed expanding rapid testing and detection of cases as one of the five high-priority actions to tackle the global DR-TB crisis [15]. As part of these actions, some molecular biology-based diagnostic methods have been applied in a clinical setting for their versatility and speed, including variants of Xpert MTB/RIF and Geno-Type MTBDRplus [16][17][18][19]. Although these methods are rapid and straightforward, Mtb continuously evolves through the genomic acquisition of single nucleotide polymorphisms (SNPs), insertions, and deletions (indels), impacting the ability to discriminate strains without the classic regions used for resistance detection [20]. Wholegenome sequencing (WGS), as a molecular diagnostic tool, has been greatly developed in TB research and is clinically useful for predicting drug resistance, lineage, tracing transmission, and defining outbreaks [21][22][23][24][25][26]. Furthermore, WGS has the potential to determine drug resistance much faster than traditional phenotypic drug susceptibility testing (DST) [27] and does not require biological safety infrastructure [28,29]. These technologies are used differentially, with the highest concentration of sequenced strains found in 16 countries [30][31][32]. However, for some countries, particularly in South America, WGS information on the variability and distribution of circulating strains is unavailable. In Latin America, are extensive information on Mycobacterium tuberculosis, highlighting its genetic diversity analyzed by molecular methods; despite this, very few studies have been carried out in Ecuador, in which the LAM and Haarlem lineages are described as the more prevalent, and a little for the Beijing family [33][34][35]. Recently Garzon-Chavez et al. [36] determined that the main circulating lineages in Ecuador were LAM, Haarlem, and S and only applied WGS for resolving the classification in 8 isolates with an ambiguous result from 373 strains assessed by 24 loci-MIRU-VNTR and DR analysis, previously. In this sense, it is essential to know the genetic composition of M. tuberculosis, especially the variations associated with drug resistance distributed within its genome. This study aims to assess the lineage and molecular resistance profile characterization in clinical isolates of M. tuberculosis through whole-genome sequencing.

Results
In order to assess the genetic variability of resistant M. tuberculosis strains, twenty-four culture-confirmed clinical isolates of M. tuberculosis with at least one antibiotic resistance were selected for whole genome sequencing. From them, the bacillary load test in the microscopy smear was greater than two crosses, and 66,7% of patients relapsed. The isolates come from four regions of western Ecuador (20, 83.3%, Guayaquil; 2, 8.3%, El Empalme, 1, 4.2% Babahoyo and Chone, respectively). The majority of the patients were female (64,5%); the mean age was 42.9 ± 12.9 years; comorbidity like HIV was present in three patients (Table 1). All isolates were only tested for phenotypic resistance to four first-line drugs: rifampicin (R), isoniazid (H), ethambutol (E), and pyrazinamide (Z); and one second-line drug: streptomycin (S). Thus, a wide range of drug resistance profiles were covered: 85.7% are polydrug resistant (50% resistant to all first-line drugs plus streptomycin), and 14.3% are MDR (Table 1 and  Supplementary table 1).
The 24 strains isolated were sequenced using the Illumina platform with paired-end reads (2 × 150 bp). The sequences were filtered by the quality and trimmed when necessary. The depth obtained ranged from 22.7X to 53.6X (Table 1). The strains were mapped against the reference M. tuberculosis H37Rv (NC_000962.3) (4.4 million base pairs and 4008 protein-coding genes) to verify that they were free of contamination. The sequencing and mapping statistics are presented in Table 1. Most strains have an outstanding quality, depth, and percentage mapping against the reference; however, three strains presented an average depth below 20 and were discarded for further analysis. The 21 remaining strains were the novo assembled, and the assemblies obtained were checked for contamination. On average, we recovered draft genomes of 4.3 Mb distributed in 140 contigs per strain with at least an N50 of 56 Kb (Table 1 and Supplementary  table 1). Although a complete genome was obtained for all strains, the fragmentation into several contigs is due primarily to regions of extreme base composition and low complexity. This can be seen in Fig. 1, which shows the reference M. tuberculosis H37Rv, the percentage of GC and GC-Skew, and the coverage plot of 21 genomes assembled and plotted according to drug resistance profile. Here, common breaks in coverage correspond to peaks in GC or GC-skew.
Prokka and RASTtk were used to determine the structure and function of genes. On average, the Prokka annotation showed the presence of ∼4,119 ± 26.28 genes; among them, 4,065 correspond to coding sequences (CDS), 53 are transfer RNAs, and 1 code for a tmRNAs (Table 1). Based on the annotations, the analysis of the pangenome was carried out with Roary [37]. Thus, the pangenome analysis of 21 genomes of M. tuberculosis revealed that there were 4735 gene families defined as the pangenome. The protein-coding genes identified for each strain have slight variation in number, ranging from a minimum of 4021 genes (strain S0036_Mtb_Ec) to a maximum of 4115 genes (strains S0017_Mtb_Ec and S0022_Mtb_Ec). A total of 3750 common genes (core genes) shared by all the genomes analyzed were identified. In addition, 99 soft-core genes (presented in 20 of the 21 strains), 353 accessory genes (shared between two and 19 strains), and 533 cloud genes (strain-specific genes) were identified ( Fig. 2A). The core and pangenome size ratio were found to be 0.79; thus, the core forms 79% of the pangenome, representing a rather closed genome with less variability. Gene accumulation was calculated for the core and pangenome. It is observed that after including 21 strains, the increase in the total number of genes is significantly reduced, and on the other hand, the core genome increases minimally, indicating a pangenome close to closing. This can be observed in Figs. 2B and C; from this visualization, it can be suggested that From the core genes and using Prokka and RAS-Ttk, subsystems were annotated. Using COG annotations, a function was assigned for a mean of 1,731 genes for each strain analyzed. These annotated genes were classified into 21 metabolic pathways within four primary categories, including Metabolism (49.2%); Information, storage, and processing (17.5%); Cellular processes, and signaling (16.8%), and poorly characterized (16.5%). As for many other organisms, almost half of the genes identified in all strains analyzed (~ 2022  Fig. 3a).
In addition, using RASTtk [38], ∼1,521 genes were assigned to the subsystems (a group of proteins that implement a specific biological process or structural complex). It was observed that almost 18% of the annotated proteincoding genes were associated with amino acids and their derivatives (267 ± 3.04), followed by genes related to carbohydrates (13.1%, 199 ± 1.74), protein metabolism (10.9%, 166 ± 0.68), cofactors, vitamins, prosthetic groups, and pigments (9.5%, 144 ± 4.65 genes). About 60% of the genes identified in subsystems are involved in the Metabolism of proteins, amino acids, carbohydrates, lipids, and cofactors ( Fig. 3b). Cell regulation and signaling (6.6%, 101 ± 1.7) and DNA metabolism (6.1%, 93 ± 1.32) also appear as subsystems with a relevant role. Partly due to the number of strains, a higher diversity in the percentage of genes in the subsystems was found for MDR strains than for pre-XDR and XDR strains (Supplementary table 3). Most of the genes involved in amino acid biosynthesis were conserved and proved to have an essential function for pathogenicity in bacteria, including M. tuberculosis [39]. Of particular interest are the strain-specific genes that may be responsible for the variability and particularities of this set. Therefore, based on the identification and classification of the genes, we performed a characterization analysis of the functional groups of strain-specific genes. By analyzing the distribution in the 21 strains, we found that most of the identified unique genes involved in metabolism and transport. In particular in the metabolism of lipids, secondary metabolites, coenzymes, amino acids, and inorganic compounds. A smaller proportion of genes are linked to replication, transcription, and translation, and only a few unique genes are related to cellular processes and signaling (Fig. 4).
Since our analysis is based on resistant strains, particular attention was devoted to specific genes, like those genes related to virulence. First, using the Virulence Factor Database (VFDB), a total of 64 virulencerelated proteins implicated in the pathogenicity of M. tuberculosis were identified, classified as follows: 58.96% related to Secretion system; 11.36% to iron uptake, Siderophore; 4.26% to Regulation and Cellular Metabolism; 4.12% to the Cell wall and 1.42% to Heat-shock protein and Magnesium uptake (Supplementary table 4). Second, using DrugBank, 66 proteins were identified as potential drug targets (Supplementary table 5). Similarly, 38 families of Transporter proteins were identified by the Transporter Classification Database (TCDB), the most representative being the ATP-binding cassette (abc) superfamily (25.17%), followed by the type vii or esx protein secretion system (t7ss) family (23.65%), the general secretory pathway (sec) family (8.38%), the major facilitator superfamily (mfs) (5.86%) and the resistance-nodulation-cell division (rnd) superfamily (5.03%) (Supplementary table 6).
Finally, the presence of CRISPR genes was analyzed using CRISPRfinder, and between 2 and 3 CRISPRs were detected in each genome (Supplementary table 1).
We also evaluated the prevalence of resistance in the fluoroquinolone among isolates sequenced by looking at gyrA and gyrB gene mutation frequencies. In the gyrA gene, the SNV Asp94Asn, known to confer fluoroquinolone resistance, was found in all pre-XDR isolates, while SNV Arg90Val was found in XDR isolate and one pre-XDR. Additionally, the known SNV Ser95Thr in the gyrA gene was found in 20 isolates. For resistance characterization in tuberculosis, the web-based tools SAM-TB (76.4%) and Mykrobe (72.7%) approached the most frequently identified mutations, complemented by PhyReSse (50.9%), to determine genetic resistance by whole-genome sequencing in surveillance programs (Supplementary table 9 and 10).

Discussion
The present study describes for the first time the use of whole genome sequencing in the screening of TB drug susceptibility in Ecuador and provides information on the phylogenetic characteristics and the identification of the predominant mutations in circulating isolates. Despite its Fig. 7 Frequency of SNVs known to confer resistance against first and second-line treatment drugs in 21 clinical isolates characterized by WGS limited number of samples including in this study (24 isolates), it represents the initial effort to determine genetic diversity using WGS. Most isolates in our study were from Guayaquil (79.17%), one of the most economically important cities in Ecuador; the high number of tuberculosis cases detected in this city is possibly due to the high mobility from other provinces for trade or work in addition to the location of the leading health centers for the monitoring of this pathogen.
The potential of WGS to provide a quick and comprehensive view of the genotype and reliable prediction of the lineages has been extensively reported [5,7,14,40]; thus, its application in surveillance programs is necessary. The Euro-American lineage (lineage 4) presents a variable distribution in Argentina with a predominance of M and Ra strains [41], for Brazil, lineage 4 varies between 25-99% [13,42,43], for Colombia, between 85-99% [44][45][46], for Mexico between 69-98% [47,48] and for Peru around 90% [12,49,50]. Ecuador is not an exception, in fact, all the samples analyzed by us belong to the Euro-American lineage. This lineage in South and Central America and the Caribbean is represented mainly by the LAM, Haarlem, and T families sub-lineages [11,41,51]. In our study, most of the cases analyzed also corresponded to LAM (61.9%), followed by the Haarlem lineage (19.0%). This result provides genomic information from Ecuadorian isolates of Mycobacterium tuberculosis that contribute to the knowledge of the distribution of Mtb lineages in South America and confirms the European origin of the circulating strains in Ecuador [9,52,53]. Similar results have been described in some, but not all, previous studies using genotyping by 24-loci MIRU-VNTR, Spoligotyping, and SNP-PCR strategies in Ecuador [36,54].
The movement or conservation of genes in bacteria is a fundamental factor related to virulence, survival, and host adaptability, allowing them to co-evolve together [55]. Several studies in M. tuberculosis have shown that the ability of transmission within populations, latency periods, and drug resistance are due to genetic differences within the members of the M. tuberculosis complex and that many of its genes have been under positive selection at different periods [56]. Furthermore, the composition within its genome remains relatively stable due to the absence of horizontal gene transfer, and this superficial level of genetic variation is probably because these pathogens have a strongly clonal nature [57]. The highly conserved genetic content and extreme clonal nature suggest that M. tuberculosis should have a relatively high percentage of core genome, however, analysis of core and pangenome have shown more variability than expected [58][59][60][61]. A pangenome is a union of the entire genetic pool of several strains of a species under comparison, essentially consisting of a core genome containing genes and sequences shared in all strains and an accessory genome consisting of genes and sequences which may be absent from one or more strains and genes that are unique to each strain remaining. In our study, the genetic composition of 21 drug-resistant isolates was analyzed, and 3750 genes (79.2%) were identified in the core genome, considering it a relatively large core genome in line with previous studies in which few genomes were evaluated. Therefore, it is necessary to expand the number of isolates to determine exactly which are the essential genes in M. tuberculosis within those circulating in Ecuador.
Characterizing the protein's functionality in microorganisms of importance in public health is essential to understanding their pathogenicity, resistance to antibiotics, and virulence, which allow them to adapt or survive in their host. Our results showed that annotations of protein-encoding genes are mainly associated with metabolisms being Amino acids and their derivatives, Energy production and conversion, and lipid metabolism, the most representative. Similar studies in other M. tuberculosis populations also showed a high presence of genes related to energy production and conversion (C), amino acid transport and metabolism (E), and lipid transport and metabolism (I) [72][73][74]. This suggests that these genes are widely conserved to ensure the interactions of M. tuberculosis with its human host, especially in mycobacterial persistence, host pathogens struggle for nutrients and immune recognition. On the other hand, we were able to identify mainly virulence factors associated with the secretion system (58.96%), the most important being the transporter proteins of the ATP-binding cassette (abc) superfamily (25.17%) and the type vii or esx protein secretion system (t7ss) family (23.65%). It has been suggested that high levels of transporters are involved in cell detoxification, nutrient recycling, and antibiotics and drug efflux, significantly affecting the survival and development of multiple drug-resistant strains in Mtb [58,59,62,63]. It should be noted that the unique genes identified 2-C-methyl-D-erythritol 4-phosphate cytidylyltransferase and Arylamine N-acetyltransferase were previously characterized as high-confidence drug targets [64][65][66].
All resistant isolates analyzed showed any variations in genes related to confer resistance, classified them in 71.43% of strains MDR (15/21), 19.05% pre-XDR (4/21), and 9.52% XDR (2/21). Both XDR strains presented similar mutations profile (8 non-synonyms and one deletion SNV in the coding region, 2 SNV in the intergenic region), but differing in additional Ser315Gly mutation in the katG gene arise to isolate S0013_Mtb_Ec, for the four pre-XDR isolates the S0202_Mtb_Ec was that more mutations were identified (9 genomic regions). It is brought up that of these six isolates of concern, five belong to women within the same age range.
The study of the M. tuberculosis resistance is an important factor within its surveillance plan in all countries, for which the timely detection of strains resistant to treatment drugs allows the application of an efficient drug strategy, reducing or limiting the incidence of new cases, especially when applying culture-based methods to detect it. In this study, we found high concordance between WGS and conventional culture-based DST in predicting phenotypic drug resistance to anti-TB drugs; In addition, WGS was able to determine resistance patterns that DST does not evaluate. These findings are in agreement with previous studies [40,67,68]. Drug resistance is mediated through mutations in specific gene targets. The key, therefore, is to identify single nucleotide polymorphisms (SNPs) that are responsible for or strongly associated with resistance. In the case of the first-line drug in Mtb treatment, mutations in the 81-bp rifampicin resistance determining region (RRDR) of the rpoB gene, also known as a hotspot region, have been accurate predictors of rifampicin resistance in many studies [69][70][71]. On the other hand, it has been established that INH resistance, predominantly mediated through loss of catalase-peroxidase activity via mutations in katG, produces high-level resistant strains [18,72,73]. Finally, in ethambutol, most resistance-related genes are located on the embB, embC, and upstream of the embA [74,75]. Meanwhile, previous studies have agreed that the genes rrs, rpsL, and gid are related to different levels of streptomycin resistance [76,77], and the chromosomal mutations in the quinolone resistance determining region of gyrA or gyrB are the main mechanism of resistance to fluoroquinolones in M. tuberculosis [78,79]. In our study, we found 57.1%, 70.0%, 47.6%, and 28.6% and 28.45% of mutations S450L (rpoB), S315T/G (katG), M306V/I (embB), K43R (rpsL) and c492t, a514c (rrs locus) related to rifampicin, isoniazid, ethambutol, streptomycin and fluoroquinolones resistance respectively. The increased resistance to fluoroquinolones, also recently reported in other studies [80], is of concern, probably caused by the increased administration of this drug without prescription [81]. It is therefore imperative to monitor the increase in these resistances in tuberculosis surveillance programs [82,83].
Whole-genome sequencing (WGS) is a powerful method for detecting drug resistance, genetic diversity, and transmission dynamics of M. tuberculosis. Despite their advantages, the analysis of genomic sequencing data remains an obstacle to the routine use of WGS technology in clinical tuberculosis because it requires bioinformatics expertise, high-performance computing, funding, and training that are not readily available in most clinical laboratories [84,85], which represents a significant challenge for TB control efforts. Over the last few years, many TB-specific genome browsers and WGS analysis tools such as TBProfiler, KvarQ, Mykrobe Predictor TB, CASTB, PhyTB, GenTB, PhyResSE, and others have been developed for genotyping and drug resistance identification, however, despite their use have not been widely used within antimicrobial resistance (AMR) surveillance program on low-and middle-income countries (LMICs) [86]. In our results, web-based tools enabled an effective and user-friendly identification of resistanceassociated mutations in M. tuberculosis. The recently developed SAM-TB platform [21] is the most capable of determining the mutations present in the isolates of our study, which, when used in conjunction with web-based tools Mykrobe and PhyReSse, would allow rapid screening of M tuberculosis isolates. These tools could easily be implemented in surveillance programs based on microbiological procedures to obtain results efficiently and cheaply [84,87,88].
Given that the sample size used is small and could be underestimating the real frequencies of mutations in Mycobacterium tuberculosis present in this region, it becomes imperative to make greater efforts -and use the advantage of WGS-to analyze a larger number of samples to identify and study the mutations present in patients with tuberculosis including other regions of Ecuador.

Conclusion
The findings of this study demonstrate the usefulness of applying next-generation sequencing tools such as WGS to characterize mutations and describe the existing variability in tuberculosis strains that allow adequate monitoring to generate health policies. We identified the variability of resistant strains circulating among men and women in Ecuador and showed that mainly strains of the American-European lineage 4 circulate, with a higher proportion of the LAM sub-lineage. XDR/MDR strains are not associated with a specific lineage, region or other metacharacter analyzed. However, we found an association between sex and resistance, which should be verified with further sampling. Within the observed resistances, an increase in fluoroquinolone resistance is evident, which should be monitored. Additional sampling is needed to determine the total variability and associations with the metadata obtained to generate better health policies.
Although M. tuberculosis is considered a highly conserved clonal species, among the 21 strains analyzed, we found unique genes specialized in lipid metabolism that are particularly interesting for identifying potential new drug targets.

Methods
Twenty-four clinical isolates were sampled and collected from private laboratories between March to December 2020; these isolates had been previously characterized as multidrug-resistant through drug susceptibility testing. The resistance pattern were performed for first-line anti-tuberculosis drugs, isoniazid (H), rifampicin (R), pyrazinamide (Z), ethambutol (E), and one second-line drug (streptomycin (S)), the proportional method by [89] on Lowenstein-Jensen medium and the resistance profile was defined according to WHO recommendations. The critical concentrations of anti-TB drugs evaluated were as follows: rifampicin, 40.0 μg/ml; isoniazid, 0.2 μg/ ml; ethambutol, 0.4 μg/ml; streptomycin, 4.0 μg/ml, and 200 µg/mL for pyrazinamide.

DNA extraction and sequencing
Genomic DNA was extracted from 24 isolates of Mycobacterium tuberculosis grown in a Lowenstein-Jensen medium using the CTAB method [90]. The quantity of isolated DNA was measured using a Qubit 4.0 fluorometer (Invitrogen, Carlsbad, CA, USA.). DNA samples that fulfilled the integrity, purity, and quantity standards were sequenced. Genomic DNA libraries were prepared for whole genome sequencing using the Tagmentation-based library prep kit according to the manufacturer's instructions (Illumina Inc., San Diego, CA) and sequenced on an Illumina MiniSeq platform with High Output Reagent Kit, producing 150 bp paired-end reads.

Bioinformatic analyses
Using the Galaxy platform (https:// usega laxy. org/), reads were classified by Kraken version 2 [91] to detect any contamination or the presence of other mycobacteria; In addition, FastQC version 0.11.9 [92] and Trimmomatic version 0.38 [93] were used to control the quality and trim the low-quality ends of the reads, respectively. In particular, a sliding window was used to trim sequences with an average quality value lower than 20.

Genome assembly
The final high-quality reads were assembled using Megahit version 1.1.3 [94]. The assembly was carried out with a kmer range from 29 to 141 at intervals of 20 and 800 bp as minimum contig size. To perform contig taxonomic assignment, we used Kaiju version 1.9.0 [95,96]. For generating metrics and evaluating the quality of the assemblies using M. tuberculosis strain HR37v (NC_000962) as a reference genome, Quast version 5.0.2 was used [97].

Gene prediction and functional annotation
The structural and functional annotation was performed with Prokka version 1.12 [98] and the "Rapid Annotation using Subsystem Technology (RASTtk server)" pipeline online [38], and the predicted proteins were also annotated for the identification with the Clusters of Orthologous Groups (COG) by eggNOG Mapper v2.1.7 (http:// eggnogmapper. embl. de/). For the identification of CRISPRs, we used the webserver of CRISPRFinder (https:// crispr. i2bc. paris-saclay. fr/ Server/). Analysis of virulence-related proteins, potential drug targets, and Transporter proteins by Virulence Factor Database (VFDB), DrugBank, and Transporter Classification Database (TCDB) was performed in the Pathosystems Resource Integration Center (PATRIC) online (https:// www. patri cbrc. org/).

Variant calling
Processed reads were mapped with Bowtie2 [99] using H37Rv (NCBI ID: NC_000962.3) as a reference genome. Sequence alignment files were sorted and indexed with Samtools v0.1.19 [100]. Bcftools [101] were used for calling variants, and vcftools [102] were used to filter the raw variants (minimum quality score of 30, minimum depth 10). From high-quality variants annotated for 21 genomes, the effect and impact were determined by SnpEff [103].

Pangenome analyses
The pangenome was analyzed with Roary [37] and organized as follows: core, soft-core, shell, and cloud genes if the genes were presented in all, 20, 3-19, and 1-2 samples, respectively; the characteristic curves of the Mtb pangenome, the core-genome, and the new genes were depicted using the pangenome Profile Analyze Tool (PanGP) [58,104,105]. Alignments for the core genes Multi Locus Sequence Typing (cgMLST) were used for generating a phylogenetic tree using Ridom SeqSphere v8.3.1 [106]. The output tree was visualized and annotated using the online tool iTOL [107].