Genome-wide analysis of Mycobacterium tuberculosis polymorphisms reveals lineage-specific associations with drug resistance

Background Continuing evolution of the Mycobacterium tuberculosis (Mtb) complex genomes associated with resistance to anti-tuberculosis drugs is threatening tuberculosis disease control efforts. Both multi- and extensively drug resistant Mtb (MDR and XDR, respectively) are increasing in prevalence, but the full set of Mtb genes involved are not known. There is a need for increased sensitivity of genome-wide approaches in order to elucidate the genetic basis of anti-microbial drug resistance and gain a more detailed understanding of Mtb genome evolution in a context of widespread antimicrobial therapy. Population structure within the Mtb complex, due to clonal expansion, lack of lateral gene transfer and low levels of recombination between lineages, may be reducing statistical power to detect drug resistance associated variants. Results To investigate the effect of lineage-specific effects on the identification of drug resistance associations, we applied genome-wide association study (GWAS) and convergence-based (PhyC) methods to multiple drug resistance phenotypes of a global dataset of Mtb lineages 2 and 4, using both lineage-wise and combined approaches. We identify both well-established drug resistance variants and novel associations; uniquely identifying associations for both lineage-specific and -combined GWAS analyses. We report 17 potential novel associations between antimicrobial resistance phenotypes and Mtb genomic variants. Conclusions For GWAS, both lineage-specific and -combined analyses are useful, whereas PhyC may perform better in contexts of greater diversity. Unique associations with XDR in lineage-specific analyses provide evidence of diverging evolutionary trajectories between lineages 2 and 4 in response to antimicrobial drug therapy. Electronic supplementary material The online version of this article (10.1186/s12864-019-5615-3) contains supplementary material, which is available to authorized users.


Background
Despite clonal expansion and a lack of lateral gene transfer in Mycobacterium tuberculosis (Mtb), the evolution of drug resistance is threatening tuberculosis disease (TB) control efforts. Resistance to all anti-Mtb drugs has been observed, usually evolving relatively shortly after their introduction. Drug-resistant TB is phenotypically categorised as multi-drug resistant (MDR) when resistant to two first-line drugs, rifampicin and isoniazid; extensively drug-resistant (XDR) occurs when MDR Mtb have additional resistance to fluoroquinolones and at least one second-line injectable. Only 50% of patients receiving treatment for MDR TB, globally, were successfully treated in 2014 [1].
De novo emergence of drug resistance has been observed, with the presence of multiple unfixed drug-resistance mutations and selective sweeps in Mtb populations within patients [2][3][4]. Additionally, transmission of resistant strains is frequently observed [5,6]. Indeed, many mutations associated with antimicrobial resistance have been identified [7], some have been associated with no fitness cost and others with additional compensatory mutations that may increase fitness and enable transmission [8]. These polymorphisms include both point mutations, for example, single nucleotide polymorphisms (SNPs) such as in rpoB [9] and structural variants such as the dfrA-thyA double deletion linked to para-aminosalicylic acid resistance [10]. Genes involved in resistance to some drugs are well known; for example, mutations for rifampicin (in rpoB and rpoC) and isoniazid (in katG) are well characterised [7]. However, the mechanisms for ethambutol (embB), pyrazinamide (pncA) and second line drug resistance are not fully known. As whole genome sequencing of Mtb becomes more routinely applied [11], association approaches using genomic variation have the potential to provide new insights into these resistance mechanisms. Compensatory mutations such as those in rpoA and rpoC, associated with the rpoB rifampicin resistance mutations, have been associated with transmission of drug resistant strains [12]. Furthermore, as patients receive a cocktail of anti-Mtb drugs, multiple concomitant resistance can arise naturally, and this complicates the analysis of phenotype-genotype relationships [13].
The genome-wide association study (GWAS) approach has been widely used in human genetics; for example, to identify variants in the class II human leukocyte antigens (HLA) region associated with susceptibility to TB infection [14]. However, it is increasingly being applied to pathogen research and shows great promise [13,15,16]. It allows the identification of variants across the genome, associated with specific phenotypes. In order to prevent spurious associations, pathogen GWASs face the need to deal with the much higher levels of population structure seen in bacteria compared to humans, whilst maximising sensitivity [17,18]. This is especially important for Mtb due to its clonality. This clonality is consistent with a phylogenetic tree structure and thus has led to the application of convergence-based methods, which have identified resistance mutations in Mtb [13,19]. Such methods seek to identify convergent evolution in genetically diverse strains with similar resistance phenotypes. This happens when mutations in the same gene or nucleotide position occur repeatedly and independently become fixed, thus signaling their positive selection for a particular phenotype.
However, there remain questions as to the importance of historic genetic background variation in the evolution of drug resistance, such as between Mtb lineages, which have not been systematically explored [20]. The Mtb complex is categorised into seven lineages, defined on the basis of molecular typing, which are endemic in different locations around the globe. These lineages are known to have other distinctive features, with some persisting in geographical regions (lineages 5 and 6 in West Africa) and others spreading across continents (lineage 2-East Asian and lineage 4 -Euro-American strains). This observation has led to the hypothesis that the strain-types are specifically adapted to people of different genetic backgrounds [21]. These lineages may vary in their propensity to transmit, their virulence, site of infection and ultimately propensity to cause disease [22][23][24], but results are inconsistent and there is considerable inter-strain variation within lineages [25,26]. Recent research into lineage 4 alludes to this variation, suggesting different evolutionary strategies are employed by different sublineages [27]. A set of single nucleotide polymorphisms (SNPs) has been identified that can be used to barcode sub-lineages [28], leading to informatic tools that position sequenced samples within a global phylogeny [29]. Thus, lineage-based genetic differences may also be important in resistance adaptations to anti-Mtb drug exposure.
The current study applies lineage-specific and lineagecombined GWAS, alongside convergence-based PhyC methods, to gain insight into lineage-specific drug resistance evolution. We focus on the modern lineage 2 and lineage 4 isolates, which are known to be drug resistant globally, and use a large dataset involving Mtb isolate sequences from more than 12 countries (n > 4400).

Genomic variants and population structure
High quality SNP and insertion and deletion (indel) variants were characterised in relation to the H37Rv reference genome, from raw sequence data from a convenience sample of existing data for isolates in lineages 2 (n = 702) and 4 (n = 3706). These isolates are within a global drug resistance data set [13], which has been further complemented by additional phenotypic data (see Methods). After removing variants that are monomorphic within each dataset, the final lineage-combined dataset consisted of 157,726 SNPs, 5998 deletions and 2926 insertions across the 4408 isolates (see Additional file 1). The median number of SNPs per sample in the lineage 2 dataset, after removing monomorphic variants, was 332 (range: 189-386) and in lineage 4 was 724 (range: 10-870) (significant difference between lineages with Wilcoxon test p-value < minimum calculable (2.2 × 10 − 16 )). Lineage 4 contains the H37Rv reference strain, but also has increased strain-type diversity [13,28]. The median number of indels per sample in lineage 2 was 31 (range: 7-42) and in lineage 4 was 40 (range: 2-61) (significant difference between lineages Wilcoxon test: p-value < minimum calculable (2.2 × 10 − 16 )) (see Additional file 1). The majority of variants were rare, with 75% of them found to have a non-reference variant frequency (defined as the number of isolates with a non-reference allele at a specific variant position divided by the total number of isolates with a non-missing allele at this position) of less than 0.0028 and 0.00054 in lineages 2 and 4, respectively (see Additional file 1 and Additional file 2). A principal component analysis (PCA) using the variants revealed the expected clustering by lineage and greater diversity within lineage 4 (see Additional file 3). Within lineage 2, the first 10 principal components account for 71.9% of the variation (see Additional file 3 and Additional file 4) and the mean pairwise variant distance was 1074 (range: 0-6270) (see Additional file 3). Within lineage 4, the first 10 principal components account for 88.9% of the variation (see Additional file 3 and Additional file 4) and the mean pairwise variant distance was 1458 (range: 0-11,780) (see Additional file 3). There are 567 isolates with < 10 variants different from at least one other isolate, indicative of potential transmission events, which can confound an association analysis. A phylogenetic tree constructed using the variants mimicked the relationships observed in the PCA, with isolates clustering by sublineage on both (see Additional file 3 and Fig. 1).

Drug resistance phenotypes
Overall, analyses were conducted for 17 drug resistance phenotypes, including for 12 individual drugs and 5 composite phenotypes. The 12 individual drug resistance phenotypes with frequency of resistance ranging from 3.3% (MOX in lineage 4) to 43.0% (STM in lineage 2), and the composite phenotypes of MDR (lineage 2 35.7%; lineage 4 9.5%) and XDR (lineage 2 9.9%; lineage 4 1.2%). The combined second-line drug resistance phenotypes for resistance to any fluoroquinones (FQ) and resistance to any aminoglycosides (AG) were also considered (see Additional file 5). The completeness of drug-resistance phenotype data is variable. Rifampicin was the most tested for (tested for in 92.0% of isolates); while ciprofloxacin was the least (tested for in 4.2% of isolates) (see Additional file 6). Furthermore, there is evidence of multiple concomitant resistance with 44.1% of MDR isolates also resistant to ethambutol.

Convergence-based analyses, variant-based GWAS and locus-based identified known resistance conferring variants
We performed convergence-based analyses (PhyC), GWAS across loci (locus-based) and GWAS on individual variants (variant-based). Each were conducted in a lineage-specific and lineage-combined manner. Due to the close relatedness between some samples, for the GWAS analyses, we applied specialized regression models with random effects that have been implemented in a human setting to handle "cryptic relatedness" [13] (see Methods).
Locus-based GWAS identified 23 different loci (see Table 2, Fig. 2, Additional file 7). Fourteen such loci were identified by locus-based GWAS exclusively; of these 14 loci, gid is known to be involved in streptomycin resistance and inhA is known to be involved in isoniazid and ethionamide resistance [30,31] (see Additional file 8). Variant-based GWAS identified eleven variants in nine different loci. No known associations were identified by variant-based GWAS exclusively; however, three novel associations were identified (RV0197, recF, argJ) (see Table 3, Additional file 8). Three loci were identified by locus-based GWAS and PhyC but not variant-based GWAS: pncA (pyrazinamide), embC-embA and embB (ethambutol) (see Fig. 3a and b, Additional file 8).
Effects of lineage-specific analysis on identifying known resistance associated variants Lineage 2 specific Overall, for locus-based GWAS analyses across the 16 phenotypes, two loci were identified exclusively to lineage 2 specific analyses; rrs (KAN; p-value = 1.40 × 10 − 22 ) and Rv3128c-Rv3129 (MDR; p-value = 7.4 × 10 − 22 ) (see Fig. 2a). For locus-based GWAS, pncA was found in association with XDR exclusively, however for lineage 4 pncA was found in association with PZA exclusively; greater variation was found in the pncA locus for lineage 2 (see Fig. 3c and d). For the variant-based GWAS analyses there were no lineage 2 exclusive associations. Furthermore, no lineage 2 exclusive associations were identified by PhyC analyses.

Lineage 4 specific
Overall, for the locus-based GWAS analyses, seven loci were identified exclusively by lineage 4 specific analyses (inhA, fadB4-Rv3142c, tuf, cut5b-Rv3725, Rv3007c, Rv2668, moeX) (see Fig. 2b). All of which were found in significant association with the XDR phenotype. For locus-based GWAS, gid was identified in association with streptomycin by lineage 4 specific analyses andcombined analyses but not lineage 2 specific analyses; there is greater variation within the gid locus for lineage 4 (see Fig. 3e and f). The variant-based GWAS analyses identified no lineage 4 exclusive analyses. Moreover, no lineage 4 exclusive associations were identified by PhyC analyses.

Novel resistance-associated variants identified
Across all analyses, we report 17 potentially novel associations between antimicrobial resistance and genomic variants in Mtb; 7 such associations were identified exclusively by lineage-specific analyses (see Tables 1, 2, 3). Twelve were identified by locus-based GWAS, three were identified by variant-based GWAS and two were identified by PhyC. All novel associations identified by GWAS were found in association with the XDR phenotype. There was no overlap in novel associations identified between methods.

Discussion
Our results highlight that lineage specific analyses are able to provide new insights into genetic associations with drug resistance phenotypes, despite a smaller sample size than a pan-lineage approach. Lineage specific associations were found within lineage 2, such as the novel association between Rv3128c-Rv3129 and MDR. We also identified lineage-specific novel associations within lineage 4, such as the association between fadB4-Rv3142c and XDR. This indicates biological differences between these lineages with respect to drug resistance and perhaps in evolutionary trajectory. Novel associations specific to combined analyses indicate convergent evolution between lineages 2 and 4 at the same loci, with variant frequency too low for lineage-specific analyses to detect, that would most likely be detected in larger scale combined analyses (as previously described 13 ). Lineage-specific GWAS is complementary to lineage-combined approaches, with their application in tandem potentially improving the power to detect Mtb genomic variants evolving under differing evolutionary dynamics.
Overall, despite conservative significance thresholds based on permutation, 17 potential novel associations were identified between antimicrobial resistance and Mtb loci and thus warrant experimental validation. For GWAS, 15 novel associations were identified, one in relation to the MDR phenotype and 14 in relation to the XDR phenotype; 7 were lineage specific. This might suggest an evolutionary shift amongst XDR strains. It may be feasible to consider XDR as a highly complex phenotype encompassing transmissibility [32]; unless evolution of XDR from pan-susceptible strains frequently happens within one patient, it is likely that XDR strains have gone through numerous cycles of active disease, transmission and treatment within recent history. The fact that many of these associations are lineage specific lends weight to such a hypothesis, suggesting differing evolutionary trajectories between lineages 2 and 4. Genetic drift might contribute to such divergence; there are numerous bottlenecks during the natural infectious cycle for Mtb, driven by host immune system, anti-TB drug therapy and transmission [33].
Some of the novel associated variants may be involved directly in drug resistance such as hadA, whose gene product, similar to InhA, is involved in fatty acid synthesis type II (FAS-II)) and thus may be involved in isoniazid resistance [34,35]. One of the novel associated loci, Rv0197, identified here by variant-based GWAS in association with XDR, was previously identified through PhyC in association with a transmissibility phenotype [36]. EspE was identified by this previous analysis also [36], and it remains possible that the espE-espF intergenic region, identified here by locus-based GWAS in    association with XDR, may be related by regulation to espE. Additionally, both espE-espF and whiB6-Rv3863 have been linked to Esx-1 which has been implicated in virulence regulation. The WhiB6-Rv3863 intergenic region, which was also identified through previous PhyC analyses including our dataset [13], may additionally be linked to the DosR regulon. This regulon is composed of 48 co-regulated genes and is considered essential for persistence of latent Mtb [37][38][39][40]. Interestingly, the whiB6-Rv3863 variant identified shows a markedly different distribution between lineages 2 and 4, showing greater frequency in lineage 2 (see Fig. 1). Apart from Rv0197, a further two variant-based GWAS SNPs were identified (recF and argJ), however both are synonymous variants. These may be examples of background variants 'hitchhiking' alongside causal variants, or may play a biological role. Notably, a number of identified loci are potentially involved in molybdenum cofactor biosynthesis; Rv3115-moeB2, moeX [41], and binding) (Mycobrowser). Molybdenum cofactor is found in molybdenum enzymes, which are responsible for a number of functions from dormancy regulation to energy source metabolism [41,42]. Interestingly, these three loci were each identified by a different analyses type; variant-based GWAS, locus-based GWAS and PhyC, respectively. Functional studies may be useful in providing further insight into the role of variants identified here.
Recognizing that drug resistance phenotypes may be subtly different, depending on the genetic background of the strain, could be important and might relate directly to drug resistance, or to fitness more broadly, such as through increased virulence and transmission. With the recognition of XDR transmission [36,43], our study suggests that further critical information on lineage and transmission clustering (obtained from the genome sequence) would also be important to determine the full impact of specific mutations, that might lead to further phenotypic descriptions related to transmission, virulence and degree of drug resistance.
The results show the differing evolutionary insights offered by locus-and variant-based GWAS, and convergence-based methodologies. Both variant-based and locus-based GWAS led to unique loci being identified. The rrs locus was found in lineage 2 only locus-based GWAS analyses, but for both variant-based GWAS and PhyC analyses, rrs was identified in both lineage-specific and lineage-combined analyses. Neutral variation within the rrs gene may be diluting the signal from causal drug resistance variants in the lineage 4 locus-based GWAS analysis.
inhA was not identified by variant-based GWAS or PhyC, only lineage 4 specific locus-based GWAS. A sub-type of the Portuguese Lisboa (lineage 4) strain is known to have inhA markers involved in isoniazid resistance [44], and a different mechanism to other lineages. Whilst inhA was not identified by lineage-combined GWAS, it is notable that Rv1482c-fabG1 and katG were; both these loci also play a role in isoniazid resistance, suggesting different mechanisms of resistance to these drugs between lineage 2 and lineage 4. In cases where drug resistance is driven by rare variants and genetic heterogeneity exists within a single gene, such as in pncA, where multiple alleles can cause pyrazinamide resistance, locus-based analyses may be more powerful. Indeed, pncA was identified here by locus-based GWAS but not variant-based GWAS.  success of PhyC in detecting antimicrobial resistance associated variants is determined by the magnitude of convergent evolution within the Mtb population under question [19]. Indeed, there were important differences between the GWAS and PhyC results outlined here. These differences might provide insight into the relative importance of within patient evolution of antimicrobial resistance versus transmission of antimicrobial resistant strains. In instances where a mutation is highly transmissible and consequently increases in frequency with only one or few mutation events, it might be expected that GWAS would be a more powerful analytical tool, due to the lack of convergent-evolution.
It is notable that lineage 2 had a smaller sample size than the lineage 4 dataset, this may contribute to the greater sensitivity in lineage 4 specific analyses. In order to assess the extent to which the lower significance levels in the lineage 2 GWAS were as a result of smaller sample size in comparison to lineage, it would be interesting to repeat the GWAS analyses with a larger and perhaps more geographically spread lineage 2 dataset. Additionally, statistical power is potentially limited in the current analyses by low resolution phenotypic data, with not all drugs tested on all samples, primarily due to second line drugs only being tested where there is multidrug resistance. For example, for lineage 2 there were only 8 resistant and 120 susceptible isolates for moxifloxacin. Despite this, the most significant gene-based GWAS result for lineage 2 was for gyrA, identified in relation to moxifloxacin resistance, showing the sensitivity of the method. Nevertheless, to identify variants with smaller effect sizes, increased phenotypic resolution may prove useful. Further work could explore the use of minimum inhibitory concentration values, where available, being incorporated into resistance phenotypes.

Conclusions
In summary, GWAS and PhyC are sensitive, robust and complementary methodologies in examining evolution of antimicrobial resistance in Mtb. Within GWAS analyses, locus-based and variant-based approaches are both useful and complementary, as are lineagecombined and lineage-specific analyses. These different methodological approaches can be used to detect different evolutionary dynamics and thus their similarities and differences are informative. Evidence presented here suggests the importance of lineage-specific paths of evolution towards drug resistance in Mtb. It will be interesting to see how methodologies outlined here might apply to other Mtb lineages and other pathogen species in an anti-microbial resistance context, or indeed in relation to other phenotypes of interest such as transmissibility.

Isolates, phenotypic methods, sequencing and variant calling
The raw sequence data used here (n = 4408) form part of a subset of a larger dataset (n = 6465), which represents multiple populations from different geographic areas (see Additional file 9), and is described elsewhere [13]. In particular, only lineages 2 (n = 702) and 4 (n = 3706) from the larger dataset are used, with additional phenotypic data for the samples collected in Portugal. Drug resistance phenotypes were available for amikacin, capreomycin, ciprofloxacin, ethambutol, ethionamide, isoniazid, kanamycin, moxifloxacin, ofloxacin, pyrazinamide, rifampicin, streptomycin, resistance to any fluoroquinolone; levofloxacin, moxifloxacin, ciprofloxacin or ofloxacin (FQ), resistance to any of the aminoglycosides; kanamycin, amikacin, or streptomycin (AG), combined isoniazid and rifampicin resistance, but not XDR (MDR), MDR plus resistance to a fluoroquinolone (ciprofloxacin, levofloxacin, moxifloxacin) and to a second line injectable (amikacin, kanamycin, capreomycin) (XDR), and pan-susceptible, susceptibility to rifampicin and isoniazid plus no other known resistance (PAN). Isoniazid, rifampicin, ethambutol, streptomycin and pyrazinamide are first-line drugs. Amikacin, capreomycin, ofloxacin, para-aminosalicylic acid, moxifloxacin and cycloserine are second-line drugs. Samples found to be MDR, underwent testing for second-line drugs. Para-aminosalycylic acid, levofloxacin, rifabutin and cycloserine resistance phenotypes were excluded from analyses due to lack of data. Where present, levofloxacin data was used in defining the aggregate phenotypes of FQ; however, there was not enough levofloxacin phenotypic data to use in individual drug-resistance analyses.
All samples underwent Illumina sequencing generating paired-end reads of at least 50 bp with at least 50-fold average genome coverage. The raw sequence data were aligned to the H37Rv reference genome (Genbank accession number: NC_000962.3) using the BWA mem algorithm [45]. The SAMtools/BCFtools [46] and GATK [47] software was used to call SNPs and small insertions or deletions (indels) using default options. The overlapping set of variants from the two algorithms was retained for further analysis. Alleles were additionally called across the whole genome (including SNP sites) using a coverage-based approach [16,28]. A missing call was assigned if the total depth of coverage at a site did not reach a minimum of 20 reads or none of the four nucleotides accounted for at least 75% of the total coverage. The final dataset consisted of 157,726 SNPs, 2926 insertions and 5998 deletions across the 4408 isolates. Monomorphic variants within each of the three datasets ('lineage 4-specific' , 'lineage 2-specific' and 'lineages 2 and 4 combined') were removed. e a Tecnologia, Portugal, through the grants UID/Multi/04413/2013 (DM and MV). The funding bodies played no role in design of the study, collection, analysis and interpretation of data or in writing the manuscript.

Availability of data and materials
The analysis was performed on raw sequencing data available from the European Nucleotide Archive (ENA) (https://www.ebi.ac.uk/ena) under the following study accession numbers; PRJEB10385, ERP006619, ERP002611, ERP000192, SRP018402 and ERP008770, as utilized in Coll et al. 2017 [13].
Authors' contributions TC and MH conceived and directed the project. JPh generated the sequencing dataset. YO performed bioinformatic and statistical analyses under the supervision of TC and MH, and wrote the first draft of the manuscript. JPe, DM, AM, IP and MV contributed protocols and data. All authors commented and edited on various versions of the draft manuscript. All authors compiled and approved the final manuscript.
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.