Skip to main content
  • Research article
  • Open access
  • Published:

Genomic epidemiology of Lineage 4 Mycobacterium tuberculosis subpopulations in New York City and New Jersey, 1999–2009



Whole genome sequencing (WGS) has rapidly become an important research tool in tuberculosis epidemiology and is likely to replace many existing methods in public health microbiology in the near future. WGS-based methods may be particularly useful in areas with less diverse Mycobacterium tuberculosis populations, such as New York City, where conventional genotyping is often uninformative and field epidemiology often difficult. This study applies four candidate strategies for WGS-based identification of emerging M. tuberculosis subpopulations, employing both phylogenomic and population genetics methods.


M. tuberculosis subpopulations in New York City and New Jersey can be distinguished via phylogenomic reconstruction, evidence of demographic expansion and subpopulation-specific signatures of selection, and by determination of subgroup-defining nucleotide substitutions. These methods identified known historical outbreak clusters and previously unidentified subpopulations within relatively monomorphic M. tuberculosis endemic clone groups. Neutrality statistics based on the site frequency spectrum were less useful for identifying M. tuberculosis subpopulations, likely due to the low levels of informative genetic variation in recently diverged isolate groups. In addition, we observed that isolates from New York City endemic clone groups have acquired multiple non-synonymous SNPs in virulence- and growth-associated pathways, and relatively few mutations in drug resistance-associated genes, suggesting that overall pathoadaptive fitness, rather than the acquisition of drug resistance mutations, has played a central role in the evolutionary history and epidemiology of M. tuberculosis subpopulations in New York City.


Our results demonstrate that some but not all WGS-based methods are useful for detection of emerging M. tuberculosis clone groups, and support the use of phylogenomic reconstruction in routine tuberculosis laboratory surveillance, particularly in areas with relatively less diverse M. tuberculosis populations. Our study also supports the use of wider-reaching phylogenomic and population genomic methods in tuberculosis public health practice, which can support tuberculosis control activities by identifying genetic polymorphisms contributing to epidemiological success in local M. tuberculosis populations and possibly explain why certain isolate groups are apparently more successful in specific host populations.


Tuberculosis (TB) epidemiology in New York City has undergone dramatic changes since the resurgent TB epidemic of the 1990s, when over 3000 cases were reported each year between 1991 and 1994, many in outbreak clusters among vulnerable populations [1]. TB incidence is now at an all-time low (7.2 cases per 100,000 people in 2014) [2], outbreak clusters have become increasingly rare, and so-called endemic clones have become a major source of new TB infections in the US-born population [3, 4].

Genotyping of Mycobacterium tuberculosis (M. tuberculosis) clinical isolates is a cornerstone of TB control in New York City. However, conventional genotyping methods (including restriction length fragment polymorphism typing, spoligotyping, and mycobacterial interspersed repetitive Units (MIRU) typing), interrogate less than 0.01% of the approximately 4 Mb M. tuberculosis genome and thus lack the discriminatory power to detect small-scale genetic differences within closely related populations. In these situations, genotyping will often yield little if any useful information, even in isolates with wide geographic distribution and long epidemiological histories in a given population [3].

Whole genome sequencing (WGS) directly overcomes these limitations and has rapidly become an important, if not central, research tool in TB epidemiology: WGS-based studies have detected previously unknown outbreak clusters among isolates with identical MIRU-VNTR types [5, 6] and identified so-called super-spreaders responsible for multiple secondary infections in the community [7]. In addition, an expanding body of work has employed WGS data to address a wide-reaching set of previously uninvestigated questions in M. tuberculosis evolution and population genomics [811].

Next-generation WGS technologies have markedly decreased per-isolate sequencing costs, and are expected to replace many current modalities in public health microbiology [12, 13]. Specific applications of interest for TB control include rapid drug resistance typing, locating cryptic outbreak clusters and transmission hotspots not identified via field epidemiology, and identification and tracking of novel M. tuberculosis strains in the community. SNP-distance based strategies have proven useful for identifying recent TB transmission [5] and WGS data has allowed for unprecedented phylogenetic resolution between and within M. tuberculosis subpopulations. Population genomics studies in both M. tuberculosis and other pathogens have established important linkages between the evolutionary and epidemiological histories of endemic and/or emerging pathogen subpopulations [14, 15]. Specifically, emerging M. tuberculosis subpopulations are expected to exhibit low sequence diversity, an excess number of high frequency derived alleles, and potentially harbor strain-specific patterns of positive or purifying genomic selection.

Multiple M. tuberculosis strains have emerged from New York City and neighboring New Jersey (NYC-NJ) over the last two decades. For example, M. tuberculosis isolates from the S75 group, a low-IS6110 copy number strain first identified in New Jersey, USA [16] in 2002, circulate within the NYC-NJ area, predominantly among HIV-positive and homeless populations [17]. The drug-susceptible C strain was first reported in NYC, where it has caused outbreaks among at-risk populations and sporadic cases in the general population, and then spread widely across the United States [3]. Both C and S75 strains belong to M. tuberculosis Lineage 4, the most widely distributed and successful of the six M. tuberculosis global phylogeographic lineages [18] and the most prevalent lineage in the New York City area.

This study uses WGS data from TB isolates collected in New York City and New Jersey between 1999 and 2009, applying both phylogenomic and population genomics methods to identify epidemiologically-relevant subpopulations within this relatively monomorphic local population. These methods identify previously known subpopulations (including S75) retrospectively, suggest useful measures for prospective and real-time identification of newly emerging isolate groups, and yield additional information on adaptation and epidemiological success in M. tuberculosis isolates endemic to New York City.


Mycobacterium tuberculosis isolates

Seventy one total M. tuberculosis full genome sequences were included in this study. Fourty-seven isolates from Lineage 4 were included: 32 isolates from TB cases occurring in New York City and New Jersey between 1997 and 2009, including 9 S75 isolates; 9 additional clinical isolates from Sub-Saharan Africa [19, 20] and North America [21]; and 6 well-characterized laboratory strains [20, 2224]. Two additional isolates from New York City, from Lineage 1 and Lineage 3, were also sequenced for this study. Sequence data for 19 additional non-L4 isolates, plus 3 isolates from the M. africanum-like Lineage 6 and the outgroup M. bovis, were obtained from publicly available sources (Additional file 1: Table S1).

Sequencing, alignment, and SNP calling

WGS data were obtained for 34 previously-unsequenced M. tuberculosis clinical isolates (Table 1). Isolates were cultured on Löwenstein-Jensen slants and grown at 37 °C for 3–5 weeks. Sequencing libraries were prepared using TruSeq DNA or Nextera DNA preparation kits (Illumina, San Diego, CA). Raw sequencing reads were generated on the Illumina HiSeq 1000 platform and aligned to the H37Rv reference genome (NC_000962.2) using the Burrows-Wheeler Aligner [25]. Genome assemblies for all isolates were deposited in the NCBI Genbank database (accession numbers are listed in Table 1). All isolates had reads covering >99% of the reference genome, and the lowest mean coverage depth for any isolate was 27×. SNPs were called using a PHRED-scaled quality threshold of 40 (Samtools v0.1.19 [26]) and annotated using snpEff v4 [27]. We excluded from analysis all variants occuring within PE and PPE genes, a family of highly repetitive, GC-rich M.tuberculosis genes in which recombination has been observed [28].

Table 1 Characteristics of the isolates sequenced in this study

Availability of data and materials

The dataset supporting the conclusions of this article is available in the NCBI Genbank repository (, BioProject: PRJNA288586) and supporting sequence alignments and phylogenetic tree data are available on TreeBASE.

Phylogenetic reconstruction

Phylogenetic trees were estimated using maximum likelihood methods in the POSIX-threads build of RAxML v8 [29]. Node robustness was assessed with 1000 bootstrap pseudoreplicates and a consensus network was calculated [30] as implemented in SplitsTree v4.3.1 [31]. A custom Perl script was used to identify SNPs with alleles unique to a given lineage or subpopulation.

Neutrality statistics and selection analysis

Neutrality statistics (including Tajima’s D, Fu and Li’s D and F, Ramos-Onsins and Rozas’s R 2 , and Fay and Wu’s H) were calculated in DnaSP v5.10.1 [32] with statistical significance assessed with 10,000–50,000 coalescent simulations. Fay and Wu’s H is particularly useful for distinguishing whether a given departure from neutrality is attributable to recent population expansion or a selective sweep [33]. The gene-wise ratio of the nonsynonymous substitution rate to the synonymous substitution rate (dN/dS) was estimated for every gene in the M. tuberculosis genome across all phylogenetic branches using the branch-site random effects likelihood (BSREL) model as implemented in HyPhy v2 [34, 35]. This model tests for branch-specific instances of episodic diversifying selection on every internal and terminal branch on the phylogenetic tree (in this case for every single gene fitted on the phylogenomic tree) and, following a likelihood-ratio test and Holm’s correction for multiple tests, detects branches on which a proportion of the codons have evolved under a dN/dS ratio that is significantly different from that of the rest of the codons. The advantage of this model over other so-called branch-site models is that it does not constrain the tree on either sides of the branch being tested to be subject to diversifying selection (foreground branches) and purifying selection (background branches).


Population structure and genetic diversity

Maximum likelihood phylogenomic reconstruction based on 14,601 quality-filtered SNPs recovered primary phylogeographic Lineages 1–6 and identified at least six distinct subpopulations within L4 isolates, including S75. Nucleotide diversity (π, the mean number of pairwise nucleotide differences per site [36]) ranged from 1.5E-5 to 1.7E-4 (Table 2), consistent with prior estimates of genetic diversity within coding regions of the M. tuberculosis genome [19]. S75 strains were separated from any other L4 isolate by at least 143 SNPs and exhibited lower nucleotide diversity and lower mean pairwise SNP distances between isolates (66.4 vs. 392.8 SNPs for NYC isolates not in the S75 cluster).

Table 2 Genetic diversity and neutrality test statistics by lineage (L1-L4)

Drug resistance-associated polymorphisms

Polymorphisms at drug resistance-associated codon sites were evaluated for 36 known drug resistance genes (Additional file 2: Figure S1). Mutations in katG, which confer resistance to isoniazid, were common among isolates from Lineages 1–3, 5, and 6 and L4 isolates from western and sub-Saharan Africa, and rare among L4 isolates from N. America and Europe, occurring in only a single isolate from this group. S75 isolates were found to have a strain-specific mutation in embA (Ala462Val) previously associated with ethambutol resistance [37], however the S75 isolates included in this study are ethambutol-sensitive. L4 isolates from Kwazulu-Natal, South Africa carried drug resistance-associated mutations in katG, rpoB, pncA, and rrs, consistent with prior studies on these drug-resistant isolates [38].

Subgroup-defining polymorphisms

One hundred seventeen synapomorphic SNPs (i.e. loci at which isolates in given subgroup carry one allele and all isolates outside the subgroup carry a different allele) differentiate L4 isolates from the non-L4 isolates included in this study (Fig. 1). Seventy-five additional SNPs differentiate North American isolates (isolates distal to Node a in Fig. 2, including those from New York, New Jersey, and the CDC1551 outbreak strain) from non-North American isolates, and 16 SNPs differentiate S75 from other North American isolates. Synapomorphic SNPs are unequally distributed by functional category, predominantly occurring in genes associated with cell wall functions, lipid metabolism, respiration, and intermediary metabolism. Non-synonymous synapomorphic SNPs occur in multiple genes with known or proposed functions in virulence, growth, and/or adaptation, including known virulence factors (mce1A, mce2C, vapC40, vapC38, otsA, yrbE2B , and cstA [3943]), and also components of gene-regulatory (sigJ, ramB), lipid metabolism (pks5, fadD15, Rv3087), intermediary metabolism (lpdA), and cell-wall associated pathways (eccC4) with known or proposed functions in M. tuberculosis virulence [4450]. New York City and S75 isolates carry a unique non-synonymous mutation in Rv1290c, a conserved gene of unknown function that when disrupted causes a severe attenuation of virulence [51]. Additional file 3: Table S2 lists the complete set of subgroup-defining SNPs.

Fig. 1
figure 1

Synapomorphic polymorphisms by functional category and isolate subgroup. a Virulence and adaptation, b Regulatory and information pathways, c Conserved proteins without known function, d Cell wall and lipid metabolism, e Intermediary metabolism and respiration. L4 includes all (n = 47) Lineage 4 isolates included in this study, NYC-NJ (N = 32) includes L4 isolates collected in New York City or New Jersey, USA, including the S75 outbreak cluster, and S75 (N = 9) includes isolates belonging the New Jersey outbreak cluster described in the text. Genes carrying diagnostic SNPs with known functions in virulence, growth, and/or adaptation are listed above each column, and of these genes, those with non-synonymous polymorphisms are highlighted in yellow. The number of total diagnostic SNPs unique to S75 (which includes those unique to L4 and NYC-NJ) are listed in the third column

Fig. 2
figure 2

a Consensus network of 1000 maximum likelihood bootstrap replicates for Mycobacterium tuberculosis isolates from North America, Sub-Saharan Africa, and Asia (n = 71) based on 14,601 SNPs. Branches are color-coded by lineage. Isolates from the S75 cluster, identified in New Jersey in 1997–2001, are highlighted; b World map of isolate collection locations color-coded by lineage

Neutrality test statistics and population size expansion

Site frequency-based neutrality test statistics were calculated using whole-genome polymorphism data by lineage (L1-L4) and by subgroups within L4, including the S75 outbreak cluster and non-S75 isolates from New York City and New Jersey (Table 2) Tajima’s D (D T ) and Fu & Li’s D and F test statistics (D FL and F FL ), were significantly negative when calculated for all L4 isolates as a group (n = 47) and for the subgroup of non-S75 isolates from New York City. Negative values for D T , D FL , and F FL indicate a relative excess of low frequency alleles in a population, which can occur following recent population expansion or a selective sweep [52]. Fay and Wu’s H, a statistic that is insensitive to population expansion but highly sensitive to selection pressure, was not significantly different from zero for all isolate subgroups, suggesting that population expansion–rather than a selective sweep–explains the relative excess of rare alleles in isolates in L1-3 and non-S75 L4 isolates. Significant values for Ramos-Onsins & Rozas’ R 2 statistic, which tests for recent population size expansion based on the difference between the number of singleton mutations and the mean number of nucleotide differences between samples, were observed for all subgroups except S75. All five neutrality test statistics were non-significant for the S75 outbreak cluster. Unlike other subgroups, the unfolded site frequency spectrum for S75 exhibited a lower number of low-frequency alleles (Fig. 3) and negative values for Fay and Wu’s H and normalized H, consistent with a small but non-significant excess of high-frequency derived alleles in this subpopulation.

Fig. 3
figure 3

Unfolded site frequency spectra for isolates from the S75 outbreak cluster (L4:S75) and non-S75 isolates from the New York City area. Dark and light blue bars indicate the number of non-synonymous and synonymous SNPs at each SNP allele frequency (from singletons on the left to SNPs at fixation on the right)

Purifying selection on genes involved in lipid metabolism and cell wall maintenance

dN/dS was significantly less than 1, consistent with purifying selection, for two genes in lipid metabolism pathways and five putative transmembrane protein genes (Supplementary Table S1). Two lipid metabolism pathway components, phenolphthiocerol synthesis polyketide synthase A–E family (ppsA) and polyketide synthase 12 (pks12), exhibited significantly decreased dN/dS in a specific subpopulation of New York City isolates (at nodes b, c, and d in Fig. 2). Evidence of episodic diversifying selection, with dN/dS significantly greater than 1, was limited to three isolates in our study, the L2 isolate W148, the L1 isolate T17, and the M. africanum K85 isolate.


M. tuberculosis exhibits very low sequence diversity compared to other bacteria, minimal evidence of horizontal gene transfer [5355], and recombination limited to known highly variable gene families [28]. This lack of genetic diversity is pronounced in geographically restricted M. tuberculosis populations, such that locally endemic clone groups have posed a unique challenge to laboratory-based identification of TB outbreak clusters in New York City. Historically, these isolates have been strongly associated with homeless and at-risk populations, in which field epidemiology and contact tracing are often difficult, placing a premium on rapid and reliable laboratory identification of clustered cases. In one case series, 52% of patients infected with C strain isolates in NYC had no phone number or address, or could otherwise not be contacted by public health investigators [17]. The present study demonstrates how whole genome-based laboratory analysis can overcome these challenges, and suggests that WGS may be a particularly important tool at the local level, where genetic diversity is expected to be lower compared to more geographically diverse samples. The results presented here provide three specific approaches for identifying outbreak clusters and emerging strain groups in local M. tuberculosis populations: (1) genome-based phylogenetic reconstruction; (2) population genetic analysis, specifically estimation of neutrality and diversity statistics within grouped samples; and (3) genome-wide analysis for distinguishing signatures of purifying or diversifying selection.

Whole genome-based phylogenetic reconstruction yielded clearly defined population substructure among locally-endemic isolates in the New York City area, and identified S75 isolates as distinct clade in the phylogeny. S75 isolates also exhibited poorly differentiated terminal branching patterns, and relatively lower bootstrap support at individual nodes, which may reflect the limits of phylogenetic resolution inherent to available genome sequencing data. Although this approach allows for robust retrospective identification of outbreak clusters and emerging strain groups, it is perhaps less well suited for rapid identification of clustered transmission among new TB cases, in which low levels of genetic differentiation may preclude high-confidence phylogenomic resolution between isolates. However, as WGS-based technologies replace conventional genotyping methods, phylogenetic reconstruction will likely become an important tool in public health microbiology, providing a “phylogenetic reference” for TB isolates sequenced within a given geographic area or TB control program [56]. In addition, clustering of incident isolates in a specific phylogenetic branch could suggest ongoing transmission within a specific at-risk population.

SNP distance-based inference of recent transmission, in which the pairwise SNP distance is used to infer whether two isolates were transmitted directly between their respective hosts, is likely to become an important epidemiological tool in TB control [7, 57]. Although the distribution of pairwise SNP differences is expected to vary between low- and high-transmission areas (with higher average pairwise SNP differences expected in high-transmission settings and in areas with lower TB case notification rates) [58], emerging M. tuberculosis subpopulations are still expected to exhibit relatively few SNP differences between isolates. Identification of subpopulation-defining synapomorphic polymorphisms can support this approach by identifying unique SNPs shared between isolates in emerging subpopulations.

The two additional methods used in this study (estimation of neutrality and diversity statistics and selection analysis) are likely to have more value in retrospective analyses, where they can yield useful information about the epidemiological and evolutionary history of circulating M. tuberculosis subpopulations. Subgroup-defining polymorphisms can provide useful genetic markers for M. tuberculosis strain identification, similar to other minimal SNP sets used in M. tuberculosis phylogenetics [59]. S75 isolates in this study could be distinguished from other North American isolates using only 16 SNPs, and determination of similar subgroup-defining SNP sets could provide a straightforward tool for rapidly determining if a given TB isolate belongs to an existing outbreak cluster or endemic strain group. More broadly, subgroup-defining polymorphisms also provide interesting, if limited, insight into the evolutionary history of Lineage 4 M. tuberculosis isolates in North America and the specific L4 populations endemic to New York City. Isolates in these populations have only a minimal number of drug resistance-associated mutations, and instead have acquired multiple non-synonymous SNPs in virulence- and growth-associated pathways. Mutations in pks5 and yrbE2B are of particular interest, first because of their well-characterized roles in M. tuberculosis virulence and second because they may both influence TNF-mediated host immune responses [39, 44]. S75 isolates strains are known to induce higher levels of TNF-α in vitro [60], which may help explain why S75 isolates have spread preferentially in immunocomproised patients. Although the functional consequences of these mutations are still unknown, these findings suggests that overall pathoadaptive fitness, rather than the acquisition of drug resistance mutations, may have played an important role in the evolutionary history of L4 M. tuberculosis populations in New York City.

Selection analysis identified two loci in M. tuberculosis lipid metabolism pathways, ppsA and pks12, with significantly decreased dN/dS ratios consistent with evolution under strong purifying selection. Observing signatures of purifying selection localized to a single subpopulation (in this case, the M. tuberculosis isolates grouped under Nodes b, c, and d), may suggest adaptation to a particular subpopulation or transmission niche, and thus provide useful information about risk factors for acquisition of infection with an emerging strain group. ppsA and other pps family genes are involved in the synthesis of phthiocerol and phenolphthiocerol, two components of cell wall lipids unique to pathogenic mycobacteria that likely participate in host-pathogen interactions [61] and virulence [62, 63]. Interestingly, Farhat et al. identified ppsA and pks12 among 39 genes that exhibit signatures of convergence and possible positive selection in multidrug-resistant M. tuberculosis isolates [64]. Although these loci may exhibit signatures of positive selection in drug-resistant populations, it is not unexpected that ppsA and pks12 would exhibit signatures of purifying selection in populations without a similar history of drug selection pressure. Consistent with this hypothesis we observed relatively fewer drug resistance-associated mutations in the same subpopulations where ppsA and pks12 exhibit signatures of purifying selection. Furthermore we found that dN/dS ratios at known drug-resistance loci were not significantly greater than one in our sample, consistent with prior studies in drug-susceptible M. tuberculosis isolates [65]. The dN/dS ratio has limited power to detect positive selection in recently diverged intraspecific sequences and may underestimate the magnitude of negative selection in genes under strong purifying selection [66]. However, because the dN/dS ratio is expected to underestimate the magnitude of the selection coefficient in this context, our analysis is likely conservative, and the true magnitude of negative selection on ppsA and pks12 may be larger than we have reported.

Lastly, estimation of multiple neutrality statistics yielded evidence for past population expansion across multiple subpopulations, consistent with prior studies on demographic expansion in M. tuberculosis populations [10, 67], with the notable exception of S75. This finding, in conjunction with the negative but nonsignificant H values estimated for S75 isolates (indicating an excess of high-frequency derived alleles), is consistent with the epidemiological history of this recently diverged group of closely related isolates. However, it is important to acknowledge that factors such as sample size and time since demographic expansion can influence the power of statistics that draw from the site frequency spectrum to detect past population growth. Specifically, site frequency spectrum-based statistics may fail to detect population expansion if the elapsed time since an expansion is either too small or too large, or with small sample sizes [52], and thus may be less useful for identification of emerging strain groups, as illustrated here. Importantly, the retrospective sample used in this study includes less than 0.01% of all M. tuberculosis infections occurring in New York City between 1999 and 2009 [2]. Nevertheless, this study demonstrates that even a small sample of isolates can yield meaningful information about the epidemiological and evolutionary history of endemic M. tuberculosis isolate groups in low-transmission settings.


WGS-based technologies are likely to replace many conventional genotyping methods currently used in public health microbiology and TB epidemiology. How to maximize the public health value of this paradigm shift, and the large quantities of genomic data it will soon make available, is still an open question. Whole genome-based drug resistance profiling, SNP distance-based methods to identify ongoing transmission, and phylogenetic reconstruction will likely yield the most direct, practical benefits, and the WGS data collected during these activities will provide an important resource for ongoing research in TB epidemiology and pathogen evolution.



Mycobacterial interspersed repetitive units


Single nucleotide polymorphism




Whole genome sequencing


  1. Macaraig M, Burzynski J, Varma JK. Tuberculosis control in New York City—a changing landscape. N Engl J Med. 2014;370(25):2362–5.

    Article  CAS  PubMed  Google Scholar 

  2. Hygiene NYCDoHaM. New York City Department of Health and Mental Hygiene. Bureau of Tuberculosis Control Annual Summary. In; 2013.

  3. Friedman CR, Quinn GC, Kreiswirth BN, Perlman DC, Salomon N, Schluger N, Lutfey M, Berger J, Poltoratskaia N, Riley LW. Widespread dissemination of a drug-susceptible strain of Mycobacterium tuberculosis. J Infect Dis. 1997;176(2):478–84.

    Article  CAS  PubMed  Google Scholar 

  4. Bifani PJ, Plikaytis BB, Kapur V, Stockbauer K, Pan X, Lutfey ML, Moghazeh SL, Eisner W, Daniel TM, Kaplan MH, et al. Origin and interstate spread of a New York City multidrug-resistant Mycobacterium tuberculosis clone family. J Am Med Assoc. 1996;275(6):452–7.

  5. Gardy JL, Johnston JC, Ho Sui SJ, Cook VJ, Shah L, Brodkin E, Rempel S, Moore R, Zhao Y, Holt R, et al. Whole-genome sequencing and social-network analysis of a tuberculosis outbreak. N Engl J Med. 2011;364(8):730–9.

    Article  CAS  PubMed  Google Scholar 

  6. Roetzer A, Diel R, Kohl TA, Ruckert C, Nubel U, Blom J, Wirth T, Jaenicke S, Schuback S, Rusch-Gerdes S, et al. Whole genome sequencing versus traditional genotyping for investigation of a Mycobacterium tuberculosis outbreak: a longitudinal molecular epidemiological study. PLoS Med. 2013;10(2):e1001387.

    Article  PubMed  PubMed Central  Google Scholar 

  7. Walker TM, Ip CL, Harrell RH, Evans JT, Kapatai G, Dedicoat MJ, Eyre DW, Wilson DJ, Hawkey PM, Crook DW, et al. Whole-genome sequencing to delineate Mycobacterium tuberculosis outbreaks: a retrospective observational study. Lancet Infect Dis. 2013;13(2):137–46.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Eldholm V, Monteserin J, Rieux A, Lopez B, Sobkowiak B, Ritacco V, Balloux F. Four decades of transmission of a multidrug-resistant Mycobacterium tuberculosis outbreak strain. Nat Commun. 2015;6:7119.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Casali N, Nikolayevskyy V, Balabanova Y, Harris SR, Ignatyeva O, Kontsevaya I, Corander J, Bryant J, Parkhill J, Nejentsev S, et al. Evolution and transmission of drug-resistant tuberculosis in a Russian population. Nat Genet. 2014;46(3):279–86.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Merker M, Blin C, Mona S, Duforet-Frebourg N, Lecher S, Willery E, Blum MG, Rusch-Gerdes S, Mokrousov I, Aleksic E, et al. Evolutionary history and global spread of the Mycobacterium tuberculosis Beijing lineage. Nat Genet. 2015;47(3):242–9.

    Article  CAS  PubMed  Google Scholar 

  11. Ford CB, Shah RR, Maeda MK, Gagneux S, Murray MB, Cohen T, Johnston JC, Gardy J, Lipsitch M, Fortune SM. Mycobacterium tuberculosis mutation rate estimates from different lineages predict substantial differences in the emergence of drug-resistant tuberculosis. Nat Genet. 2013;45(7):784–90.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Kwong JC, McCallum N, Sintchenko V, Howden BP. Whole genome sequencing in clinical and public health microbiology. Pathology. 2015;47(3):199–210.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Walker TM, Kohl TA, Omar SV, Hedge J, Del Ojo Elias C, Bradley P, Iqbal Z, Feuerriegel S, Niehaus KE, Wilson DJ, et al. Whole-genome sequencing for prediction of Mycobacterium tuberculosis drug susceptibility and resistance: a retrospective cohort study. Lancet Infect Dis. 2015;15(10):1193–202.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Bastardo A, Ravelo C, Romalde JL. Phylogeography of Yersinia ruckeri reveals effects of past evolutionary events on the current strain distribution and explains variations in the global transmission of enteric redmouth (ERM) disease. Front Microbiol. 2015;6:1198.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Luo T, Comas I, Luo D, Lu B, Wu J, Wei L, Yang C, Liu Q, Gan M, Sun G, et al. Southern East Asian origin and coexpansion of Mycobacterium tuberculosis Beijing family with Han Chinese. Proc Natl Acad Sci U S A. 2015;112(26):8136–41.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Mathema B, Bifani PJ, Driscoll J, Steinlein L, Kurepina N, Moghazeh SL, Shashkina E, Marras SA, Campbell S, Mangura B, et al. Identification and evolution of an IS6110 low-copy-number Mycobacterium tuberculosis cluster. J Infect Dis. 2002;185(5):641–9.

    Article  CAS  PubMed  Google Scholar 

  17. Macaraig M, Agerton T, Driver CR, Munsiff SS, Abdelwahab J, Park J, Kreiswirth B, Driscoll J, Zhao B. Strain-specific differences in two large Mycobacterium tuberculosis genotype clusters in isolates collected from homeless patients in New York City from 2001 to 2004. J Clin Microbiol. 2006;44(8):2890–6.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Gagneux S, Small PM. Global phylogeography of Mycobacterium tuberculosis and implications for tuberculosis product development. Lancet Infect Dis. 2007;7(5):328–37.

    Article  PubMed  Google Scholar 

  19. Comas I, Chakravartti J, Small PM, Galagan J, Niemann S, Kremer K, Ernst JD, Gagneux S. Human T cell epitopes of Mycobacterium tuberculosis are evolutionarily hyperconserved. Nat Genet. 2010;42(6):498–503.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Comparative Sequencing Project.Mycobacterium tuberculosis. Cambridge: Broad Institute of Harvard and MIT. 2013. Accessed 5 Oct 2015.

  21. Fleischmann RD, Alland D, Eisen JA, Carpenter L, White O, Peterson J, DeBoy R, Dodson R, Gwinn M, Haft D, et al. Whole-genome comparison of Mycobacterium tuberculosis clinical and laboratory strains. J Bacteriol. 2002;184(19):5479–90.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Miyoshi-Akiyama T, Matsumura K, Iwai H, Funatogawa K, Kirikae T. Complete annotated genome sequence of Mycobacterium tuberculosis Erdman. J Bacteriol. 2012;194(10):2770.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Zheng H, Lu L, Wang B, Pu S, Zhang X, Zhu G, Shi W, Zhang L, Wang H, Wang S, et al. Genetic basis of virulence attenuation revealed by comparative genomic analysis of Mycobacterium tuberculosis strain H37Ra versus H37Rv. PLoS One. 2008;3(6):e2375.

    Article  PubMed  PubMed Central  Google Scholar 

  24. Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C, Harris D, Gordon SV, Eiglmeier K, Gas S, Barry 3rd CE, et al. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature. 1998;393(6685):537–44.

    Article  CAS  PubMed  Google Scholar 

  25. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25(14):1754–60.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Genome Project Data Processing S. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078–9.

    Article  PubMed  PubMed Central  Google Scholar 

  27. Cingolani P, Platts A, le Wang L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly. 2012;6(2):80–92.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Phelan JE, Coll F, Bergval I, Anthony RM, Warren R, Sampson SL, Gey van Pittius NC, Glynn JR, Crampin AC, Alves A, et al. Recombination in pe/ppe genes contributes to genetic variation in Mycobacterium tuberculosis lineages. BMC Genomics. 2016;17:151.

    Article  PubMed  PubMed Central  Google Scholar 

  29. Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30(9):1312–3.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Holland B, Moulton V. Consensus networks: A method for visualizing incompatibilities in collections of trees. In: “Workshop on Algorithms in Bioinformatics”: 2003. 2003. p. 165–76.

    Chapter  Google Scholar 

  31. Huson DH, Bryant D. Application of phylogenetic networks in evolutionary studies. Mol Biol Evol. 2006;23(2):254–67.

    Article  CAS  PubMed  Google Scholar 

  32. Librado P, Rozas J. DnaSP v5: a software for comprehensive analysis of DNA polymorphism data. Bioinformatics. 2009;25(11):1451–2.

    Article  CAS  PubMed  Google Scholar 

  33. Fay JC, Wu CI. Hitchhiking under positive Darwinian selection. Genetics. 2000;155(3):1405–13.

    CAS  PubMed  PubMed Central  Google Scholar 

  34. Pond SL, Frost SD, Muse SV. HyPhy: hypothesis testing using phylogenies. Bioinformatics. 2005;21(5):676–9.

    Article  CAS  PubMed  Google Scholar 

  35. Kosakovsky Pond SL, Murrell B, Fourment M, Frost SD, Delport W, Scheffler K. A random effects branch-site model for detecting episodic diversifying selection. Mol Biol Evol. 2011;28(11):3033–43.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Nei M. Molecular evolutionary genetics. New York: Columbia University Press; 1987.

    Google Scholar 

  37. Ramaswamy SV, Amin AG, Goksel S, Stager CE, Dou SJ, El Sahly H, Moghazeh SL, Kreiswirth BN, Musser JM. Molecular genetic analysis of nucleotide polymorphisms associated with ethambutol resistance in human isolates of Mycobacterium tuberculosis. Antimicrob Agents Chemother. 2000;44(2):326–36.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Ioerger TR, Koo S, No EG, Chen X, Larsen MH, Jacobs Jr WR, Pillay M, Sturm AW, Sacchettini JC. Genome analysis of multi- and extensively-drug-resistant tuberculosis from KwaZulu-Natal, South Africa. PLoS One. 2009;4(11):e7778.

    Article  PubMed  PubMed Central  Google Scholar 

  39. Marjanovic O, Miyata T, Goodridge A, Kendall LV, Riley LW. Mce2 operon mutant strain of Mycobacterium tuberculosis is attenuated in C57BL/6 mice. Tuberculosis. 2010;90(1):50–6.

    Article  CAS  PubMed  Google Scholar 

  40. Chatterjee A, Saranath D, Bhatter P, Mistry N. Global transcriptional profiling of longitudinal clinical isolates of Mycobacterium tuberculosis exhibiting rapid accumulation of drug resistance. PLoS One. 2013;8(1):e54717.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Forrellad MA, Klepp LI, Gioffre A, Sabio y Garcia J, Morbidoni HR, de la Paz Santangelo M, Cataldi AA, Bigi F. Virulence factors of the Mycobacterium tuberculosis complex. Virulence. 2013;4(1):3–66.

    Article  PubMed  PubMed Central  Google Scholar 

  42. Shimono N, Morici L, Casali N, Cantrell S, Sidders B, Ehrt S, Riley LW. Hypervirulent mutant of Mycobacterium tuberculosis resulting from disruption of the mce1 operon. Proc Natl Acad Sci U S A. 2003;100(26):15918–23.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Beste DJ, Espasa M, Bonde B, Kierzek AM, Stewart GR, McFadden J. The genetic requirements for fast and slow growth in mycobacteria. PLoS One. 2009;4(4):e5349.

    Article  PubMed  PubMed Central  Google Scholar 

  44. Boritsch EC, Frigui W, Cascioferro A, Malaga W, Etienne G, Laval F, Pawlik A, Le Chevalier F, Orgeur M, Ma L, et al. pks5-recombination-mediated surface remodelling in Mycobacterium tuberculosis emergence. Nat Microbiol. 2016;1:15019.

    Article  PubMed  Google Scholar 

  45. Hu Y, Kendall S, Stoker NG, Coates AR. The Mycobacterium tuberculosis sigJ gene controls sensitivity of the bacterium to hydrogen peroxide. FEMS Microbiol Lett. 2004;237(2):415–23.

    CAS  PubMed  Google Scholar 

  46. Micklinghoff JC, Breitinger KJ, Schmidt M, Geffers R, Eikmanns BJ, Bange FC. Role of the transcriptional regulator RamB (Rv0465c) in the control of the glyoxylate cycle in Mycobacterium tuberculosis. J Bacteriol. 2009;191(23):7260–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Li AH, Waddell SJ, Hinds J, Malloff CA, Bains M, Hancock RE, Lam WL, Butcher PD, Stokes RW. Contrasting transcriptional responses of a virulent and an attenuated strain of Mycobacterium tuberculosis infecting macrophages. PLoS One. 2010;5(6):e11066.

    Article  PubMed  PubMed Central  Google Scholar 

  48. Cheruvu M, Plikaytis BB, Shinnick TM. The acid-induced operon Rv3083-Rv3089 is required for growth of Mycobacterium tuberculosis in macrophages. Tuberculosis. 2007;87(1):12–20.

    Article  CAS  PubMed  Google Scholar 

  49. Akhtar P, Srivastava S, Srivastava A, Srivastava M, Srivastava BS, Srivastava R. Rv3303c of Mycobacterium tuberculosis protects tubercle bacilli against oxidative stress in vivo and contributes to virulence in mice. Microbes Infect. 2006;8(14–15):2855–62.

    Article  CAS  PubMed  Google Scholar 

  50. Gey Van Pittius NC, Gamieldien J, Hide W, Brown GD, Siezen RJ, Beyers AD. The ESAT-6 gene cluster of Mycobacterium tuberculosis and other high G + C Gram-positive bacteria. Genome Biol. 2001;2(10):RESEARCH0044.

  51. McAdam RA, Quan S, Smith DA, Bardarov S, Betts JC, Cook FC, Hooker EU, Lewis AP, Woollard P, Everett MJ, et al. Characterization of a Mycobacterium tuberculosis H37Rv transposon library reveals insertions in 351 ORFs and mutants with altered virulence. Microbiology. 2002;148(Pt 10):2975–86.

    Article  CAS  PubMed  Google Scholar 

  52. Ramos-Onsins SE, Rozas J. Statistical properties of new neutrality tests against population growth. Mol Biol Evol. 2002;19(12):2092–100.

    Article  CAS  PubMed  Google Scholar 

  53. Liu X, Gutacker MM, Musser JM, Fu YX. Evidence for recombination in Mycobacterium tuberculosis. J Bacteriol. 2006;188(23):8169–77.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Achtman M. Evolution, population structure, and phylogeography of genetically monomorphic bacterial pathogens. Annu Rev Microbiol. 2008;62:53–70.

    Article  CAS  PubMed  Google Scholar 

  55. Smith NH, Gordon SV, de la Rua-Domenech R, Clifton-Hadley RS, Hewinson RG. Bottlenecks and broomsticks: the molecular evolution of Mycobacterium bovis. Nat Rev Microbiol. 2006;4(9):670–81.

    Article  CAS  PubMed  Google Scholar 

  56. Benavente ED, Coll F, Furnham N, McNerney R, Glynn JR, Campino S, Pain A, Mohareb FR, Clark TG. PhyTB: Phylogenetic tree visualisation and sample positioning for M. tuberculosis. BMC Bioinformatics. 2015;16:155.

    Article  PubMed  PubMed Central  Google Scholar 

  57. Walker TM, Lalor MK, Broda A, Saldana Ortega L, Morgan M, Parker L, Churchill S, Bennett K, Golubchik T, Giess AP, et al. Assessment of Mycobacterium tuberculosis transmission in Oxfordshire, UK, 2007–12, with whole pathogen genome sequences: an observational study. Lancet Respir Med. 2014;2(4):285–92.

    Article  PubMed  PubMed Central  Google Scholar 

  58. Glynn JR, Guerra-Assuncao JA, Houben RM, Sichali L, Mzembe T, Mwaungulu LK, Mwaungulu JN, McNerney R, Khan P, Parkhill J, et al. Whole Genome Sequencing Shows a Low Proportion of Tuberculosis Disease Is Attributable to Known Close Contacts in Rural Malawi. PLoS One. 2015;10(7):e0132840.

    Article  PubMed  PubMed Central  Google Scholar 

  59. Filliol I, Motiwala AS, Cavatore M, Qi W, Hazbon MH, Bobadilla del Valle M, Fyfe J, Garcia-Garcia L, Rastogi N, Sola C, et al. Global phylogeny of Mycobacterium tuberculosis based on single nucleotide polymorphism (SNP) analysis: insights into tuberculosis evolution, phylogenetic accuracy of other DNA fingerprinting systems, and recommendations for a minimal standard SNP set. J Bacteriol. 2006;188(2):759–72.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Mathema B, Kurepina N, Yang G, Shashkina E, Manca C, Mehaffy C, Bielefeldt-Ohmann H, Ahuja S, Fallows DA, Izzo A, et al. Epidemiologic consequences of microvariation in Mycobacterium tuberculosis. J Infect Dis. 2012;205(6):964–74.

    Article  PubMed  PubMed Central  Google Scholar 

  61. Azad AK, Sirakova TD, Fernandes ND, Kolattukudy PE. Gene knockout reveals a novel gene cluster for the synthesis of a class of cell wall lipids unique to pathogenic mycobacteria. J Biol Chem. 1997;272(27):16741–5.

    Article  CAS  PubMed  Google Scholar 

  62. Sirakova TD, Dubey VS, Kim HJ, Cynamon MH, Kolattukudy PE. The largest open reading frame (pks12) in the Mycobacterium tuberculosis genome is involved in pathogenesis and dimycocerosyl phthiocerol synthesis. Infect Immun. 2003;71(7):3794–801.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Matsunaga I, Bhatt A, Young DC, Cheng TY, Eyles SJ, Besra GS, Briken V, Porcelli SA, Costello CE, Jacobs Jr WR, et al. Mycobacterium tuberculosis pks12 produces a novel polyketide presented by CD1c to T cells. J Exp Med. 2004;200(12):1559–69.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Farhat MR, Shapiro BJ, Kieser KJ, Sultana R, Jacobson KR, Victor TC, Warren RM, Streicher EM, Calver A, Sloutsky A, et al. Genomic analysis identifies targets of convergent positive selection in drug-resistant Mycobacterium tuberculosis. Nat Genet. 2013;45(10):1183–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Zhang H, Li D, Zhao L, Fleming J, Lin N, Wang T, Liu Z, Li C, Galwey N, Deng J, et al. Genome sequencing of 161 Mycobacterium tuberculosis isolates from China identifies genes and intergenic regions associated with drug resistance. Nat Genet. 2013;45(10):1255–60.

    Article  CAS  PubMed  Google Scholar 

  66. Kryazhimskiy S, Plotkin JB. The population genetics of dN/dS. PLoS Genet. 2008;4(12):e1000304.

    Article  PubMed  PubMed Central  Google Scholar 

  67. Pepperell CS, Casto AM, Kitchen A, Granka JM, Cornejo OE, Holmes EC, Birren B, Galagan J, Feldman MW. The role of selection in shaping diversity of natural M. tuberculosis populations. PLoS Pathog. 2013;9(8):e1003543.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references


We would like to thank Drs. Elena Shashkina and Natalia Kurepina for their assistance.


No funding was obtained for the research activities conducted in this study.

Availability of data and materials

Genome assemblies for all isolates included in this study were deposited in the NCBI Genbank database under BioProject PRJNA288586 (accession codes are listed in Table 1).

Authors’ contributions

Conceived and designed the experiments: BM, BNK, SOK, and PJP. Conducted the experiments: AN, JRW, and PJB. Analyzed the data: SOK, AN, and TSB. Wrote the manuscript: TSB, SOK, PJP, and BM. All authors have read and approved the manuscript.

Competing interests

The authors declare that they have no competing interests.

Consent for publication

Not applicable.

Ethics approval and consent to participate

This study met criteria for non-human subjects research activity (Not Human Subjects Research Under 45 CFR 46) as determined by the Columbia University Medical Center Institutional Review Board.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Barun Mathema.

Additional files

Additional file 1: Table S1.

Complete list of non-synonymous and synonymous synapomorphic polymorphisms for all L4 isolates, L4 isolates from New York City and New Jersey, and isolates from the S75 outbreak cluster. (DOCX 19 kb)

Additional file 2:

Figure S1. Whole-genome maximum likelihood phylogenetic reconstruction of Mycobacterium tuberculosis isolates from North America, Sub-Saharan Africa, and Asia (n = 71). Values at the nodes indicate branch support based on 1000 bootstrap replicates. Letter labels denote branches with genes under purifying selection (see Additional file 1: Table S1) or lineage-defining polymorphisms. Green boxes in the adjoining matrix indicate SNPs at drug resistance-associated codon sites in known drug resistance gene. INH: isonaizid; RIF: rifampin; PZA: pyrazinamide; ETH: ethionamide; EMB: ethambutol; AMI: amikacin; FLQ: fluoroquinolones; PAS: para-aminosalicylic acid; SM: streptomycin. (PDF 210 kb)

Additional file 3:

Table S2. Statistically significant dN/dS values, estimated by gene and across phylogenetic branches. P HOLM, Holm-corrected P-value. Location of Nodes A–E are indicated in Fig. 1. (DOCX 19 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Brown, T.S., Narechania, A., Walker, J.R. et al. Genomic epidemiology of Lineage 4 Mycobacterium tuberculosis subpopulations in New York City and New Jersey, 1999–2009. BMC Genomics 17, 947 (2016).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: