Construction of a virtual Mycobacterium tuberculosis consensus genome and its application to data from a next generation sequencer
© Okumura et al.; licensee BioMed Central. 2015
Received: 24 September 2014
Accepted: 20 February 2015
Published: 20 March 2015
Although Mycobacterium tuberculosis isolates are consisted of several different lineages and the epidemiology analyses are usually assessed relative to a particular reference genome, M. tuberculosis H37Rv, which might introduce some biased results. Those analyses are essentially based genome sequence information of M. tuberculosis and could be performed in sillico in theory, with whole genome sequence (WGS) data available in the databases and obtained by next generation sequencers (NGSs). As an approach to establish higher resolution methods for such analyses, whole genome sequences of the M. tuberculosis complexes (MTBCs) strains available on databases were aligned to construct virtual reference genome sequences called the consensus sequence (CS), and evaluated its feasibility in in sillico epidemiological analyses.
The consensus sequence (CS) was successfully constructed and utilized to perform phylogenetic analysis, evaluation of read mapping efficacy, which is crucial for detecting single nucleotide polymorphisms (SNPs), and various MTBC typing methods virtually including spoligotyping, VNTR, Long sequence polymorphism and Beijing typing. SNPs detected based on CS, in comparison with H37Rv, were utilized in concatemer-based phylogenetic analysis to determine their reliability relative to a phylogenetic tree based on whole genome alignment as the gold standard. Statistical comparison of phylogenic trees based on CS with that of H37Rv indicated the former showed always better results that that of later. SNP detection and concatenation with CS was advantageous because the frequency of crucial SNPs distinguishing among strain lineages was higher than those of H37Rv. The number of SNPs detected was lower with the consensus than with the H37Rv sequence, resulting in a significant reduction in computational time. Performance of each virtual typing was satisfactory and accorded with those published when those are available.
These results indicated that virtual CS constructed from genome sequence data is an ideal approach as a reference for MTBC studies.
Tuberculosis is one of the most prevalent and deadly bacterial infections affecting humans, with almost 9 million new cases worldwide and more than 1.4 million deaths in 2010 . It has been estimated that 310000 patients newly diagnosed with pulmonary tuberculosis in 2011 were infected with multidrug-resistant (MDR) bacteria , with 9% of these patients, living in 84 countries, having extensively drug-resistant (XDR) tuberculosis . Moreover, the World Health Organization (WHO) has estimated that 350000 of the 1.4 million deaths per year from tuberculosis are associated with HIV coinfection .
A variety of molecular typing methods have been used to classify M. tuberculosis strains for epidemiological studies, including assessment of the presence of the IS6110 restriction fragment length polymorphism (RFLP) , spoligotyping, analysis of mycobacterial interspersed repetitive unit-variable number tandem repeats (MIRU-VNTR)  and large sequence polymorphisms (LSPs) [7,8]. Their target sequences are mobile elements (e.g. IS6110), repetitive sequences (e.g. spoligotyping and MIRU-VNTR) and relatively long sequence polymorphisms (at least 7 bp), with many strains belonging to unrelated lineages possessing these DNA elements in common. According to the SpolDB4 of the international online spoligotype database (http://www.pasteur-guadeloupe.fr/tb/bd_myco.html), clinical strains of M. tuberculosis obtained worldwide can be classified into 10 major groups . Although it is useful to identify clonal lineage on the global scale, the discriminatory power of this method may not be sufficient for evaluation of genetically closely related isolates, including those from areas of tuberculosis outbreak. According to the SpolDB4, all M. tuberculosis strains belonging to the Beijing family, predominantly from Far-East Asia , share the same spoligotype patterns . Over 70% of the clinical strains isolated in Japan were found to belong to the Beijing family . Similarly, MIRU-VNTR genotyping [6,12], based on the typing of 12 MIRU loci, has become a global standard in the epidemiological typing of M. tuberculosis. Since its first use, many investigators have tried to find an ideal combination to further distinguish among genetically closely related strains [13,14]. This has led to the formulation of optimized sets, including a 15-locus system as the new standard for routine epidemiological discrimination and a 24-locus system as a high-resolution tool for phylogenetic studies . Although these 15- and 24-locus VNTR locus systems have been utilized for first-line typing, they are insufficient in distinguishing among closely related strains of the Beijing family to define deep phylogenetic structures . LSPs have been utilized to determine whether lineages of M. tuberculosis were associated with specific human populations [7,8]. Utilizing LSPs, MTBCs could be divided into at least 6 phylogeographical lineages, each associated with specific, sympatric human populations. Taken together, that conventional molecular typing methods for MTBC are limited in distinguishing among strain subtypes.
This limitation may be overcome by next generation whole genome sequencing (WGS) for genome-based epidemiology . WGS is becoming the ultimate tool for diagnosing and typing pathogens, and has amplified the impact of molecular diagnostics on clinical microbiology . The potential of WGS-based MTBC genotyping has started to be explored . In phylogenetic analysis based on WGS data, sequence reads are mapped to a reference genome, usually H37Rv, and single nucleotide polymorphisms (SNPs) are detected and concatenated to generate an artificial genome sequence representing each isolate. This approach has been shown to be robustness and of high resolution [19-21].
In comparative genome or phylogenetic analysis, generated genomes or WGS data are compared to results from reference genomes. The M. tuberculosis strain H37Rv was the first sequenced in its entirety  and has been utilized extensively as the reference genome in these investigations. In these comparisons, the sequences of one or several newly sequenced genomes are compared to the sequence of a reference genome. If the reference genome used in these analyses does not contain a gene or marker possessed by the newly sequenced genome, this method cannot be used to determine the evolutionary fate of the genetic context of the latter.
The evolution of MTBC genomes has several unique features. Although several mycobacterium phages have been reported [23,24], they were found to evolve through mutation without acquiring any external genetic traits. This feature differs strikingly from that observed in conventional drug resistance bacteria such as Pseudomonas aeruginosa and Escherichia coli . The nature of its evolution makes the MTBC genome highly stable, and the genetic diversity of MTBCs obtained in clinical settings is essentially restricted in SNPs and indels. Generally, mobile elements such as transposons, phages and plasmids are omitted from phylogenetic analysis because their rate of evolution differs from that of the intrinsic chromosome. MTBCs, however, are composed of several lineages, and the development of an artificial reference sequence of the entire MTBC genome, containing sequence information from all lineages, would be extremely useful. Although some SNPs and/or indels may be omitted or ignored when using a particular genome sequence of a real isolate as a reference, these SNPs and/or indels would not be missed in comparative genomics, especially in phylogenetic research.
In this study, we constructed a consensus whole MTBC genome sequence from the sequence data of 19 MTBC strains isolated and characterized through April 2012. For proof of concept, we compared the consensus and H37Rv genomes as reference standards for phylogenetic analysis, SNP detection and secondary analysis of WGS data. In addition, we found that spoligotype, VNTR type, LSP classification, Beijing type and antibiotic susceptibility profiles could be determined simultaneously, resulting in a virtual diagnosis in the absence of actual experiments except for WGS.
Construction and features of a virtual M. tuberculosis consensus genome
Mycobacterium strains used in this study
Genome inversion points
Drug-susceptible isolate belonging to the Beijing family.
Multidrug-resistant clinical isolate.
isolated from sputum samples of patients
Predominant strain in South African epidemic
An avirulent strain derived from its virulent parent strain H37
1. between 932051 and 932052. 2. between 3479594 and 3459595
Extensively drug-resistant clinical isolate
1. between 931985 and 931986. 2. between 3479865 and 3479866
Multidrug-resistant clinical isolate
1. between 932007 and 932008. 2. between 3476553 and 3476554
Drug-susceptible clinical isolate
isolated from sputum samples of patients
isolated from sputum samples of patients
M. bovis BCG
Brazilian vaccine strain
Although the concept of a consensus genome has long been recognized [26-30], no consensus genome had been determined for M. tuberculosis, largely due to difficulties in aligning these relatively large sized genomes. Despite its relatively stable and conserved genome, some M. tuberculosis strains have inversions and rearranged regions in comparison with H37Rv, making alignment more difficult. To overcome the inversions and rearrangements in M. tuberculosis strains, we performed genome rearrangement analyses using publicly available software Mauve , finding that the genomes of M. tuberculosis KZN605, KZN1435 and KZN4207 have large inversions, of 0.93–3.46Mbp (Table 1). IS6110 insertion sequences are located immediately adjacent to the flanking regions of all the inversion points, suggesting that the genome rearrangements observed in these 3 strains were driven by the insertion sequence. Manual correction of these inversions generated artificial KZN605, KZN1435 and KZN4206 genome sequences (KZN605_m, KZN1435_m and KZN4206_m, respectively), with the latter used to align the MTBC genomes. These procedures and employing a very fast and publicly available multiple sequence alignment software, MAFFT  allowed the successful alignment of the genomes of 19 MTBC strains and 13 M. tuberculosis strains. Merging the alignment results were carried out using sequence editing commercial software, Genetyx, although any software, which can handle multiple alignment data, should be suitable for the purpose. Two types of CSs were prepared with handling SNPs as majority rule or ambiguity rule. In this study, ambiguity sequence was used for further analysis. Insertion sequences derived from strain specific regions were all included in the CSs to increases the amount of sequence information.
The MTBC CS consists of one chromosome of 4991559 bp, with an average GC content of 64.8%. This genome was approximately 0.6 Mbp longer, because all sequence data from all strains used were merged into one sequence, and had a slightly lower GC content than the elements of the 19 strains used to construct the consensus genome. This artificially merged CS was intended for use as a reference genome in analyses of MTBC. Thus, instead of CDS extraction followed by annotation, homology analysis based on the corresponding gene sequences of M. tuberculosis H37Rv, which is extensively used as a reference genome, was performed to annotate the corresponding region in CS. Each region was assigned based on its CDS locus_tag (the Rv numbers), repeat regions, and rRNA and tRNA of H37Rv. All locus_tags of H37Rv were reflected in CS. This annotation resulted in features based on 4395 CDS (annotated as misc_features in CS) and repeat regions according with those H37Rv. Public databases do not accept virtual sequences. Thus the completely annotated sequence in addition to the alignments is available as additional data (Additional file 3).
Efficacy of SNP calling using the consensus sequence or H37Rv as the reference genome by comparing the number of SNPs detected
i) individual strains
vs consensus sequence
M. bovis BCG
Total SNP found in only one strain
ii) group comparison
four BCG strains
three KZN strains
All SNP at least found in one strain
Comparison of the performance of the consensus sequence (CS) and H37Rv as the reference genome in the phylogenetic analyses
Distance analysis among phylogenic trees constructed based on maximum-likelihood and Bayesian MCMC methods
1.68E + 03
1.20E + 03
2.61E + 03
1.91E + 03
Computational time for each phylogenetic analysis
Comparison of Illumina read mapping efficacy using clinical isolates derived from different lineages using Bowtie2 and SAMtools
i) Comparison of the numbers of mapped and unmapped reads to the H37Rv sequence or consensus sequence
Subtraction of ratio (%)
End to end
End to end
End to end
(CS minus H37Rv)
Haarlem, LAM, X etc.
Haarlem, LAM, X etc.
Haarlem, LAM, X etc.
Haarlem, LAM, X etc.
ii) Comparison of mappping frequency ratio (%) among the MTBC lineanges
Haarlem, LAM, X etc.
Haarlem, LAM, X etc.
P < 0.05
Comparison of Illumina read mapping efficacy
Comparison of Illumina read mapping efficacy using clinical isolates derived from different lineages using
Subtraction of ratio (%)
Haarlem, LAM, X etc.
Haarlem, LAM, X etc.
Haarlem, LAM, X etc.
Haarlem, LAM, X etc.
ii) Comparison of mappping frequency ratio (%) among the MTBC lineanges
HaarlemLAM, X etc.
Haarlem,LAM, X etc.
Based on a number of typing methods, MTBC could be divided into several lineages [7,12,45]. To analyze whether these MTBC lineages influence the read mapping efficacy, we analyzed the values obtained by subtracting the ratio of mapped to total reads number on the H37Rv sequence from that on CS (Table 5 ii). Using the Mann–Whitney U test, we found that isolates from the Beijing family showed significantly greater differences in mapping frequencies between H37Rv and consensus sequences than did isolates from other lineages (P < 0.05), indicating the greater suitability of CS in assessing mapping frequency.
Then we used a commercial software, CLC genomics workbench (CLC bio) to evaluate read mapping efficacy with a different algorism, which is based on Smith and Waterman , from Bowtie2. After quality based trimming described in the method section, the sequence reads were mapped to the consensus and H37Rv sequences, and the number of reads mapped and unmapped were compared as did with Bowtie2 and SAMtools. The efficacy of mapping was compared by analyzing three combinations of parameters: mismatch cost, insertion cost, deletion cost, matching length and similarity. Significance tests described above showed that each combination of consensus and H37Rv sequences for individual read data in each mapping stringency setting differed significantly difference in the proportion of mapped reads (p < 0.0001), with the proportion always higher using CS as the reference. The one exception was the isolate J147, which belongs to the Haarlem lineage, as does H37Rv, explaining the higher mapping efficacy found with H37Rv as the reference.
Again, we analyzed whether these MTBC lineages influence the read mapping efficacy in the different methods (Table 6 ii). We found that isolates from the Beijing family showed significantly greater differences in mapping frequencies between H37Rv and consensus sequences than did isolates from other lineages (P < 0.05). The difference was greatest when comparing the Beijing and Haarlem lineages, with the latter being the lineage to which H37Rv belongs (P < 0.01).
The reads failed to map H37Rv specifically and CS specifically were analyzed for their character with contigs constructed by de novo assembling them and BLAST search on the database. Notably, no contigs were created from reads failed to map specifically, indicating the reads were relatively low quality reads to be assembled. BLAST analyses of resulting contigs from reads failed to map H37Rv showed that all contigs had very little identity within the genome sequence of H37Rv as expected, and some of them were MTBC lineage specific while the others were simply missing from genome sequence of H37Rv but not lineage specific.
These results indicate that read mapping of MTBC based on the WGS data is sensitive to both the reference sequence and the MTBC lineage. Our CS provided a better standard for mapping efficacy of different MTBC lineages than did the H37Rv sequence, as well as statistically significant improvements in SNP detection and read mapping, suggesting that CS is a virtually better approach for MTBC research.
Virtual VNTR typing, spoligotyping and LPS typing
Virtual analyses of LSP and Beijing typing
M. bovis BCG
Cross-sectional studies in diverse geographic locations have shown epidemiologic associations between Beijing types of M. tuberculosis and increased risks of drug resistance . Beijing typing was originally based on spoligotyping , but was later determined by detection of specific SNP [53,54] and PCR analysis of the insertion of IS6110 into specific positions . We analyzed whether the 19 sequenced MTBC genomes could be classified as Beijing type, based on IS6110 insertion between Rv0001 and Rv0002, as shown for the H37Rv genome; or into a modern or ancestral subtype based on IS6110 insertion into the NTF region  (Table 7). Of the strains tested, only CCDC5180 was Beijing type. Although spoligotyping indicated that CCDC5079 should also be Beijing type (Figure 4 and ), this strain lacks an inserted IS6110 between Rv0001 and Rv0002. Although classification of Beijing type based on IS6110 insertion is limited in classifying MTBC lineages, virtual Beijing typing could be performed using an in silico approach.
Taken together, these results suggest that virtual VNTR, spoligotyping, LSP analysis and Beijing typing in silico can be utilized for epidemiological analysis of mycobacterial strains without the need for PCR amplification and/or hybridization procedures.
In this study, we successfully aligned whole genome sequences of 19 MTBC strains by correcting the large rearrangements found in the genomes of KZN605, KZN1435 and KZN4207, and using very fast multiple sequence alignment software, MAFFT [32,57]. This alignment allowed us to create a virtual consensus genome sequence of MTBC, reflecting all genetic information from various lineages. In comparing this CS with that of H37Rv as the reference sequence, we found that CS allowed an unbiased and efficient detection of critical SNPs, distinguishing among the lineages of MTBC. Use of CS as a reference reduced the SNP calling bias, as shown for M. canettii. Moreover, SNP concatemers of MTBC strains based on CS were better able to reproduce a phylogenetic tree based on whole genome alignment than concatemers based on H37Rv. Phylogeny of M. tuberculosis is very closely related the human evolution, and consistent with MTBC displaying characteristics indicative of adaptation to both low and high host densities . Ford at al reported that M. tuberculosis strains from lineage 2 (East Asian lineage and Beijing sublineage) acquire drug resistances in vitro more rapidly than M. tuberculosis strains from lineage 4 (Euro-American lineage) and that this higher rate can be attributed to a higher mutation rate . Thus precise and accurate phylogenetic analysis based on SNP concatemers is becoming a key importance of M. tuberculosis research. Using H37Rv as a reference to call SNP led to some inadequate clustering of M. tuberculosis strain. As shown in Figure 1c, RGTB432 becomes an out-group although it should be placed on the lineage 4 group as in Figure 1a. Thus, adequate reference sequence is not only important for efficient analysis such as mapping of read data from next generation sequencers but also phylogenetic analysis based on SNP concatemers because it directly link to evolution and drug resistance of M. tuberculosis strain. Use of CS also significantly reduced the total number of SNPs detected, decreasing computational time by an order of magnitude. Reduction of computational time is extremely useful when analyzing a large number (more than hundreds) of isolates.
During the construction of CS, we also called SNPs and indels using as reference the H37Rv genome to update information about polymorphisms found in MTBCs. Complete SNP data of individual isolates were presented with position and annotations. As reported previously, about 50% of the indels were located in the genes encoding the PE-PGRS and PPE family of proteins while about 25% were located in intergenic regions. The positions, length, annotation and strains of all SNPs and indels have been reported for further applications, such as exploration of lineage specific gene traits.
Use of CS as a reference also significantly improved the efficacy of short-read mapping of clinical isolates. Use of a particular strain as the reference can affect the mapping results, depending on the lineage of that strain. Testing of isolates of a different lineage from the reference strain can result in the omission of some SNPs and/or indels critical for further analysis. Use of a consensus sequence as a reference would minimize this possibility.
VNTR typing, spoligotyping, LSP analysis and Beijing typing. Thus, technically, it is possible to perform these analyses using sequence data of MTBC strains. To prove this concept, we performed these analyses in silico. Although actual typing data were available for a few of the strains tested, we observed fairly good agreement between actual and expected data. Virtual typing also showed several limitations. For example, Beijing typing classified strain CCDC5079, which belongs to the Beijing family, as non-Beijing type based on IS6110 insertion. Thus, one typing method would not be sufficient to accurately type MTBC isolates. As WGS technologies improve, SNP concatenation would become the ideal typing method. Moreover, we found that in silico analysis using CS was highly reproducible and robust because of its intrinsically objective nature, an objectivity sometimes lacking during actual epidemiological analysis of MTBC. in silico analysis is also labor-saving, since it requires only WGS data.
Our consensus genome sequence does not contain sequence information on several lineages, including lineages 1, 3 and 5 in LPS analysis. This could be a potential shortage because lacks of particular lineages in CS could lead bias in calling SNP as shown in this study. At present, few complete whole genome sequences of these lineages are available in the database. We intend to update our consensus sequence when such information becomes available.
We generated virtual consensus sequences of MTBC from 13 M. tuberculosis and 6 non-tuberculosis strains, and showed that this sequence was superior to the H37Rv sequence as a reference in MTBC research. A completely annotated consensus sequence, relative to the sequence of H37Rv, is available as the additional data. Construction of a web based service integrating the phylogenetic and epidemiological analyses performed in this study is currently underway.
Multiple genome sequence alignment and construction of a virtual MTBC consensus genome sequence
Whole genome sequences used to construct CSs in this study are detailed in Table 1. Genome sequences of 19 strains available by April 2012 were used: 13 genomes from M. tuberculosis, one each from M. africanum and M. canetti, and four from M. bovis BCG. Genome rearrangement was analyzed by a publicly available software, Mauve , with large inversions, hampering genome alignment, observed in M. tuberculosis strains KZN605, KZN1435 and KZN4207 and manually corrected. We constructed two types of consensus whole genome sequences: one containing sequence data from 13 M. tuberculosis strains and the other one containing sequence data from all four species of mycobacteria (19 strains). All sequence data were downloaded from the NCBI databases and aligned using publicly available alignment software, MAFFT version 6 or 7 [32,57] (http://mafft.cbrc.jp/alignment/server/) using our own-build Linux-based server. Consensus sequences were constructed by merging the alignment results using a sequence editing commercial software, Genetyx (Genetyx Inc, Tokyo Japan). Two types of consensus sequences were prepared with handling SNPs as majority rule or ambiguity rule. In this study, ambiguity sequence was used for further analysis.
Annotation of the virtual M. tuberculosis consensus genome
The consensus genome, consisting of an artificially assembled sequence, was annotated manually and edited using a commercial genome sequence editing software, in silico molecular cloning (in silico Biology Co., Kanagawa, Japan). Because CS used for the annotation contains many ambiguous nucleotides and insertion, instead of CDS extraction, homologous regions based on corresponding nucleotide sequences of CDS in H37Rv was determined to extract the corresponding regions in CS, with the assignment of each region based on the locus_tag (the Rv number) of CDS, repeat regions, rRNA and tRNA of H37Rv.
SNP and indel extraction, annotation and characterization
Each genome sequence of MTBC listed in Table 1 was compared with the constructed CS and with the genome sequence of M. tuberculosis H37Rv using a publicly available software, MUMmer 3.0 . SNPs and indels were extracted from each MTBC strain. For SNP analysis, insertions of more than two bp and all deletions were excluded from the resulting data. The resulting files from MUMmer were converted into the VCF format to annotate the SNPs using the a commercial software, CLC genomics workbench (CLC bio). Since the algorithms used to align genome sequences for the construction of the consensus genome (MAFFT) and for the extraction of SNPs and indels (MUMmer) are different, we manually checked the result of indel calls in the MUMmer data, and indels greater than 5 bp were used for manual annotation.
Phylogenetic analysis using SNP concatenated and whole genome sequences
Phylogenetic trees were constructed from SNP and whole-genome sequence alignments. Two methods were used to evaluate robustness: a maximum-likelihood approach, PhyML 3.0  and the Bayesian MCMC framework, BEAST1.7 . For PhyML, GTR and gamma were chosen for a nucleotide substitution model, and tree robustness was evaluated by two methods: approximate likelihood-ratio test (aLTR)  implemented in PhyML. Bootstrappings implemented in PhyML was used to generate multiple trees (100 trees for SNP concatemers) for choosing most probable trees based on combination of 9 statistical analyses using CONSEL . Because data size limitation in CONSEL, 40 trees were generated by PhyML for phylogenetic analysis with whole genome alignment.
For BEAST, various combinations of population size change and molecular clock models were compared to find the model that best fit the data. A simple HKY model was used for SNP concatemer based phylogenetic trees, whereas an HKY kappa model was used for whole-genome sequence based phylogenetic trees. MCMC chains were run for 10 million generations, with sampling every 1000th generation. Convergences and Effective Sample Sizes (ESSs) of the estimates were checked using Tracer v1.4 (http://beast.bio.ed.ac.uk/Tracer). Three of the phylogenetic trees we constructed had ESS values greater than 100, suggesting sufficient mixing of the Markov chain. Maximum clade credibility (MCC) trees were created and annotated using TreeAnnotator within the BEAST software package. All analyses were performed on Linux or Windows 7 based computers. Trees were visualized with FigTree (http://tree.bio.ed.ac.uk/software/figtree/). The tree to tree distance were compared using Robinson and Fould test  implemented in the treedist program in the Phylip package (http://evolution.genetics.washington.edu/phylip/doc/treedist.html) with both of unrooted and rooted mode. The computation time for each analysis was obtained from the log files. Phylogenetic data related to this study have been registered at TreeBase (http://datadryad.org/(doi:10.5061/dryad.nq070)).
Comparison of Illumina read mapping efficacy
To compare the mapping efficacy using CS as reference with H37Rv, we obtained approximately one million 251 bp x 2 pair-end reads from clinical M. tuberculosis isolates listed in Tables 5 and 6 using MiSeq with Nextera XT library kits (Illumina). The sequence data used have been registered with DNA Data Bank of JAPAN (DDBJ) as accession number DRA001219. First, Read mapping was performed using Bowtie 2  with local mode and end-to-end modes and default parameters. The resultant mapping was analyzed the idxstats coomand of SAMtools . We also analyzed the mapping efficacy using CLC genomics workbench, of which algorithm is based Smith and Waterman , after trimming based on base quality (quality score limit = 0.05, removing reads if there are more than 2 ambiguous nucleotides in the reads or less than 15 bp in length). In the analysis with CLC genomics workbench, the influence of three combinations of parameters on mapping was tested: mismatch cost, insertion cost, deletion cost, matching length and similarity. In both analyses, significance of differences in mapping frequencies were assessed using multiple comparisons of proportions tests . We also compared ratio of the number of reads mapped to total read for the two reference sequences, and subtracted values of the mapping frequencies (% mapped reads with consensus sequence minus that of H37Rv) were calculated. The subtracted values were used to compare mapping frequency among the MTBC lineages with Mann–Whitney U tests.
Virtual VNTR typing, spoligotyping, LSPs typing and Beijing typing
Based on the amplification primers for 24-loci VNTR , spoligotyping , LSPs [7,8], and Beijing typing based on the insertion positions of IS6110 , sequence data corresponding to the respective loci or regions were selected. Using the Blast algorithm, these sequence data were used to analyze whether each region is present in the MTBC strains listed in Table 1 . VNTR and spoligotyping results for the isolates or strains in the database were analyzed and compared on the MIRU-VNTR web site (http://www.miru-vntrplus.org/MIRU/index.faces) .
We thank Mrs. M. Komiya and Mrs. Y. Sakurai for their excellent technical assistance.
This work was supported by a Grant for International Health Research (26A103) to author T.M.A. from the Ministry of Health, Labor, and Welfare, Japan and a Grant for International Health Research (24A103) to author T. K. from the Ministry of Health, Labor, and Welfare, Japan.
- Raviglione M, Marais B, Floyd K, Lonnroth K, Getahun H, Migliori GB, et al. Scaling up interventions to achieve global tuberculosis control: progress and new developments. Lancet. 2012;379:1902–13.View ArticlePubMedGoogle Scholar
- Lawn SD, Mwaba P, Bates M, Piatek A, Alexander H, Marais BJ, et al. Advances in tuberculosis diagnostics: the Xpert MTB/RIF assay and future prospects for a point-of-care test. Lancet Infect Dis. 2013;13:349–61.View ArticlePubMedGoogle Scholar
- Abubakar I, Zignol M, Falzon D, Raviglione M, Ditiu L, Masham S, et al. Drug-resistant tuberculosis: time for visionary political leadership. Lancet Infect Dis. 2013;13:529–39.View ArticlePubMedGoogle Scholar
- Marais BJ, Lonnroth K, Lawn SD, Migliori GB, Mwaba P, Glaziou P, et al. Tuberculosis comorbidity with communicable and non-communicable diseases: integrating health services and control efforts. Lancet Infect Dis. 2013;13:436–48.View ArticlePubMedGoogle Scholar
- Thierry D, Brisson-Noel A, Vincent-Levy-Frebault V, Nguyen S, Guesdon JL, Gicquel B. Characterization of a Mycobacterium tuberculosis insertion sequence, IS6110, and its application in diagnosis. J Clin Microbiol. 1990;28:2668–73.PubMed CentralPubMedGoogle Scholar
- Mazars E, Lesjean S, Banuls AL, Gilbert M, Vincent V, Gicquel B, et al. High-resolution minisatellite-based typing as a portable approach to global analysis of Mycobacterium tuberculosis molecular epidemiology. Proc Natl Acad Sci U S A. 2001;98:1901–6.View ArticlePubMed CentralPubMedGoogle Scholar
- Hirsh AE, Tsolaki AG, DeRiemer K, Feldman MW, Small PM. Stable association between strains of Mycobacterium tuberculosis and their human host populations. Proc Natl Acad Sci U S A. 2004;101:4871–6.View ArticlePubMed CentralPubMedGoogle Scholar
- Gagneux S, DeRiemer K, Van T, Kato-Maeda M, de Jong BC, Narayanan S, et al. Variable host-pathogen compatibility in Mycobacterium tuberculosis. Proc Natl Acad Sci U S A. 2006;103:2869–73.View ArticlePubMed CentralPubMedGoogle Scholar
- Brudey K, Driscoll JR, Rigouts L, Prodinger WM, Gori A, Al-Hajoj SA, et al. Mycobacterium tuberculosis complex genetic diversity: mining the fourth international spoligotyping database (SpolDB4) for classification, population genetics and epidemiology. BMC Microbiol. 2006;6:23.View ArticlePubMed CentralPubMedGoogle Scholar
- van Soolingen D, Qian L, de Haas PE, Douglas JT, Traore H, Portaels F, et al. Predominance of a single genotype of Mycobacterium tuberculosis in countries of east Asia. J Clin Microbiol. 1995;33:3234–8.PubMed CentralPubMedGoogle Scholar
- Murase Y, Mitarai S, Sugawara I, Kato S, Maeda S. Promising loci of variable numbers of tandem repeats for typing Beijing family Mycobacterium tuberculosis. J Med Microbiol. 2008;57:873–80.View ArticlePubMedGoogle Scholar
- Surikova OV, Voitech DS, Kuzmicheva G, Tatkov SI, Mokrousov IV, Narvskaya OV, et al. Efficient differentiation of Mycobacterium tuberculosis strains of the W-Beijing family from Russia using highly polymorphic VNTR loci. Eur J Epidemiol. 2005;20:963–74.View ArticlePubMedGoogle Scholar
- Le Fleche P, Fabre M, Denoeud F, Koeck JL, Vergnaud G. High resolution, on-line identification of strains from the Mycobacterium tuberculosis complex based on tandem repeat typing. BMC Microbiol. 2002;2:37.View ArticlePubMed CentralPubMedGoogle Scholar
- Supply P, Allix C, Lesjean S, Cardoso-Oelemann M, Rusch-Gerdes S, Willery E, et al. Proposal for standardization of optimized mycobacterial interspersed repetitive unit-variable-number tandem repeat typing of Mycobacterium tuberculosis. J Clin Microbiol. 2006;44:4498–510.View ArticlePubMed CentralPubMedGoogle Scholar
- Iwamoto T, Yoshida S, Suzuki K, Tomita M, Fujiyama R, Tanaka N, et al. Hypervariable loci that enhance the discriminatory ability of newly proposed 15-loci and 24-loci variable-number tandem repeat typing method on Mycobacterium tuberculosis strains predominated by the Beijing family. FEMS Microbiol Lett. 2007;270:67–74.View ArticlePubMedGoogle Scholar
- Baker S, Hanage WP, Holt KE. Navigating the future of bacterial molecular epidemiology. Curr Opin Microbiol. 2010;13:640–5.View ArticlePubMed CentralPubMedGoogle Scholar
- Bravo LT, Procop GW. Recent advances in diagnostic microbiology. Semin Hematol. 2009;46:248–58.View ArticlePubMedGoogle Scholar
- Niemann S, Koser CU, Gagneux S, Plinke C, Homolka S, Bignell H, et al. Genomic diversity among drug sensitive and multidrug resistant isolates of Mycobacterium tuberculosis with identical DNA fingerprints. PLoS One. 2009;4:e7407.View ArticlePubMed CentralPubMedGoogle Scholar
- Gardy JL, Johnston JC, Ho Sui SJ, Cook VJ, Shah L, Brodkin E, et al. Whole-genome sequencing and social-network analysis of a tuberculosis outbreak. N Engl J Med. 2011;364:730–9.View ArticlePubMedGoogle Scholar
- Comas I, Coscolla M, Luo T, Borrell S, Holt KE, Kato-Maeda M, et al. Out-of-Africa migration and Neolithic coexpansion of Mycobacterium tuberculosis with modern humans. Nat Genet. 2013;45:1176–82.View ArticlePubMed CentralPubMedGoogle Scholar
- Roetzer A, Diel R, Kohl TA, Ruckert C, Nubel U, Blom J, et al. Whole genome sequencing versus traditional genotyping for investigation of a Mycobacterium tuberculosis outbreak: a longitudinal molecular epidemiological study. PLoS Med. 2013;10:e1001387.View ArticlePubMed CentralPubMedGoogle Scholar
- Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C, Harris D, et al. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature. 1998;393:537–44.View ArticlePubMedGoogle Scholar
- Ford ME, Sarkis GJ, Belanger AE, Hendrix RW, Hatfull GF. Genome structure of mycobacteriophage D29: implications for phage evolution. J Mol Biol. 1998;279:143–64.View ArticlePubMedGoogle Scholar
- Hatfull GF, Sarkis GJ. DNA sequence, structure and gene expression of mycobacteriophage L5: a phage system for mycobacterial genetics. Mol Microbiol. 1993;7:395–405.View ArticlePubMedGoogle Scholar
- Rossolini GM, Mantengoli E, Montagnani F, Pollini S. Epidemiology and clinical relevance of microbial resistance determinants versus anti-Gram-positive agents. Curr Opin Microbiol. 2010;13:582–8.View ArticlePubMedGoogle Scholar
- Dutilh BE, Huynen MA, Strous M. Increasing the coverage of a metapopulation consensus genome by iterative read mapping and assembly. Bioinformatics. 2009;25:2878–81.View ArticlePubMed CentralPubMedGoogle Scholar
- Arenas M, Posada D. Computational design of centralized HIV-1 genes. Curr HIV Res. 2010;8:613–21.View ArticlePubMedGoogle Scholar
- Mlera L, Jere KC, van Dijk AA, O’eill HG. Determination of the whole-genome consensus sequence of the prototype DS-1 rotavirus using sequence-independent genome amplification and 454(R) pyrosequencing. J Virol Methods. 2011;175:266–71.View ArticlePubMedGoogle Scholar
- Marston DA, McElhinney LM, Ellis RJ, Horton DL, Wise EL, Leech SL, et al. Next generation sequencing of viral RNA genomes. BMC Genomics. 2013;14:444.View ArticlePubMed CentralPubMedGoogle Scholar
- Farrell CM, O’eary NA, Harte RA, Loveland JE, Wilming LG, Wallin C, et al. Current status and new features of the Consensus Coding Sequence database. Nucleic Acids Res. 2013;42(Dabase issue):D865–72.PubMed CentralPubMedGoogle Scholar
- Darling AE, Mau B, Perna NT. progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One. 2010;5:e11147.View ArticlePubMed CentralPubMedGoogle Scholar
- Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013;30:772–80.View ArticlePubMed CentralPubMedGoogle Scholar
- Fleischmann RD, Alland D, Eisen JA, Carpenter L, White O, Peterson J, et al. Whole-genome comparison of Mycobacterium tuberculosis clinical and laboratory strains. J Bacteriol. 2002;184:5479–90.View ArticlePubMed CentralPubMedGoogle Scholar
- Arnold C, Thorne N, Underwood A, Baster K, Gharbia S. Evolution of short sequence repeats in Mycobacterium tuberculosis. FEMS Microbiol Lett. 2006;256:340–6.View ArticlePubMedGoogle Scholar
- Vishnoi A, Roy R, Bhattacharya A. Comparative analysis of bacterial genomes: identification of divergent regions in mycobacterial strains using an anchor-based approach. Nucleic Acids Res. 2007;35:3654–67.View ArticlePubMed CentralPubMedGoogle Scholar
- Casali N, Nikolayevskyy V, Balabanova Y, Harris SR, Ignatyeva O, Kontsevaya I, et al. Evolution and transmission of drug-resistant tuberculosis in a Russian population. Nat Genet. 2014;46:279–86.View ArticlePubMed CentralPubMedGoogle Scholar
- Anisimova M, Gascuel O. Approximate likelihood-ratio test for branches: a fast, accurate, and powerful alternative. Syst Biol. 2006;55:539–52.View ArticlePubMedGoogle Scholar
- Drummond AJ, Suchard MA, Xie D, Rambaut A. Bayesian phylogenetics with BEAUti and the BEAST 1.7. Mol Biol Evol. 2012;29:1969–73.View ArticlePubMed CentralPubMedGoogle Scholar
- Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol. 2010;59:307–21.View ArticlePubMedGoogle Scholar
- Shimodaira H, Hasegawa M. CONSEL: for assessing the confidence of phylogenetic tree selection. Bioinformatics. 2001;17:1246–7.View ArticlePubMedGoogle Scholar
- Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math Biosci. 1981;53:131–47.View ArticleGoogle Scholar
- Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9:357–9.View ArticlePubMed CentralPubMedGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–9.View ArticlePubMed CentralPubMedGoogle Scholar
- RYAN TA. Significance tests for multiple comparison of proportions, variances, and other statistics. Psychol Bull. 1960;57:318–28.View ArticlePubMedGoogle Scholar
- Weniger T, Krawczyk J, Supply P, Niemann S, Harmsen D. MIRU-VNTRplus: a web tool for polyphasic genotyping of Mycobacterium tuberculosis complex bacteria. Nucleic Acids Res. 2010;38:W326–31.View ArticlePubMed CentralPubMedGoogle Scholar
- Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981;147:195–7.View ArticlePubMedGoogle Scholar
- Ioerger TR, Feng Y, Ganesula K, Chen X, Dobos KM, Fortune S, et al. Variation among genome sequences of H37Rv strains of Mycobacterium tuberculosis from multiple laboratories. J Bacteriol. 2010;192:3645–53.View ArticlePubMed CentralPubMedGoogle Scholar
- Koser CU, Niemann S, Summers DK, Archer JA. Overview of errors in the reference sequence and annotation of Mycobacterium tuberculosis H37Rv, and variation amongst its isolates. Infect Genet Evol. 2012;12:807–10.View ArticlePubMedGoogle Scholar
- Ilina EN, Shitikov EA, Ikryannikova LN, Alekseev DG, Kamashev DE, Malakhova MV, et al. Comparative genomic analysis of Mycobacterium tuberculosis drug resistant strains from Russia. PLoS One. 2013;8:e56577.View ArticlePubMed CentralPubMedGoogle Scholar
- Kamerbeek J, Schouls L, Kolk A, van Agterveld M, van Soolingen D, Kuijper S, et al. Simultaneous detection and strain differentiation of Mycobacterium tuberculosis for diagnosis and epidemiology. J Clin Microbiol. 1997;35:907–14.PubMed CentralPubMedGoogle Scholar
- Ford CB, Shah RR, Maeda MK, Gagneux S, Murray MB, Cohen T, et al. Mycobacterium tuberculosis mutation rate estimates from different lineages predict substantial differences in the emergence of drug-resistant tuberculosis. Nat Genet. 2013;45:784–90.View ArticlePubMed CentralPubMedGoogle Scholar
- Qian L, Van Embden JD, Van Der Zanden AG, Weltevreden EF, Duanmu H, Douglas JT. Retrospective analysis of the Beijing family of Mycobacterium tuberculosis in preserved lung tissues. J Clin Microbiol. 1999;37:471–4.PubMed CentralPubMedGoogle Scholar
- Nakanishi N, Wada T, Arikawa K, Millet J, Rastogi N, Iwamoto T. Evolutionary robust SNPs reveal the misclassification of Mycobacterium tuberculosis Beijing family strains into sublineages. Infect Genet Evol. 2013;16:174–7.View ArticlePubMedGoogle Scholar
- Filliol I, Motiwala AS, Cavatore M, Qi W, Hazbon MH, Bobadilla del Valle M, et al. Global phylogeny of Mycobacterium tuberculosis based on single nucleotide polymorphism (SNP) analysis: insights into tuberculosis evolution, phylogenetic accuracy of other DNA fingerprinting systems, and recommendations for a minimal standard SNP set. J Bacteriol. 2006;188:759–72.View ArticlePubMed CentralPubMedGoogle Scholar
- Mokrousov I, Ly HM, Otten T, Lan NN, Vyshnevskyi B, Hoffner S, et al. Origin and primary dispersal of the Mycobacterium tuberculosis Beijing genotype: clues from human phylogeography. Genome Res. 2005;15:1357–64.View ArticlePubMed CentralPubMedGoogle Scholar
- Zhang Y, Chen C, Liu J, Deng H, Pan A, Zhang L, et al. Complete genome sequences of Mycobacterium tuberculosis strains CCDC5079 and CCDC5080, which belong to the Beijing family. J Bacteriol. 2011;193:5591–2.View ArticlePubMed CentralPubMedGoogle Scholar
- Katoh K, Toh H. Recent developments in the MAFFT multiple sequence alignment program. Brief Bioinform. 2008;9:286–98.View ArticlePubMedGoogle Scholar
- Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, et al. Versatile and open software for comparing large genomes. Genome Biol. 2004;5:R12.View ArticlePubMed CentralPubMedGoogle Scholar
- Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Biossssinf. 2009;10:421.View ArticleGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.