Selection of isolates
The strains for this study were selected from three different collections in Colombia, Argentina and Spain and were isolated from patients between 1995 and 2008. Isolates were selected to represent the main circulating lineages as well as strains with particular characteristics, such as strains involved in TB epidemics, or strains that acquired multidrug resistance. All isolates were typed by spoligotyping and IS6110 RFLP, as described  and were selected based on diverse RFLP patterns. Spoligotypes were assigned by comparison with the international spoligotype database, SpolDB4 (http://www.pasteurguadeloupe.fr:8081/SITVITDemo/index.jsp) .
The study and protocols for collection of bacterial strains from patients were approved by the ethics committees at each institution: the Ethics Committee at the Universidad de Antioquia and local health authorities in Medellín, Colombia (Dirección Seccional de Salud de Antioquia and the Secretaría de Salud de Medellín), the Ethical Committee of the Aragon Government for strains collected at the Hospital of Getafe, Hospital Universitario Miguel Servet and Hospital Universitario Lozano Blesa in Spain, and the Ethical Committee of the Instituto Malbrán in Argentina. Written informed consent was provided by the subjects in Colombia and Spain. In Argentina this was waived because the investigators performing the study received secondary data about human subjects and microbiological strains, which had been originally isolated elsewhere from human biological samples for diagnostic purposes, but no possible personal identifiers were transferred to the researchers. The source of the data was disclosed to the Committee in the application.
Extraction of genomic DNA was performed by the chemical lysis method . In brief, all visible colonies grown on Lowenstein-Jensen were collected into 400 μl 10 mM Tris- 1 mM EDTA buffer pH 8.0, suspensions were heated at 80°C for 20 min and incubated with lysozyme at 37°C overnight. Cells were then incubated with SDS and proteinase K (10% SDS and 0.05 mg/mL proteinase K) at 65°C for 10 min, 100 μl of a CTAB-NaCl solution (10% CTAB, 0.7 M NaCl) were added, tubes were vortexed until the suspension turned milky, and incubated for 10 min at 65°C. DNA was purified and precipitated using chloroform:isoamylalcohol (24:1) and 0.6 volumes of isopropanol, respectively. The genomic DNA was then resuspended in 20 μl of nuclease-free water.
Preparation of DNA libraries
The DNA from each sample was quantified by spectroscopy (NanoDrop®, ThermoScientific). Four hundred nanograms of each DNA were diluted in 10 μl of sterile water and sonicated at maximum power for 1 min (XL 2020 Sonicator, Misonix) to yield fragments with an average size between 200–400 bp. End repair of the DNA was done in a final volume of 20 μl using 10 μl of sonicated DNA, 0.05 mM dNTPs, 1x T4 DNA ligase buffer (NEB), 5 U of Klenow DNA Polymerase (NEB) and 10 U of T4 PNK (NEB). The reaction was incubated for 30 min at 25°C followed by 20 min at 75°C. To remove unused dNTPs, 11U of SAP (Promega) were added to each sample and incubated 30 min at 37°C followed by 30 min at 80°C to inactivate the phosphatase. Addition of a 3’ terminus adenine to the fragments was done in 30 μl using 21 μl of the previous reaction, 0.1 mM dATP, 10 U of Klenow exo minus (Epicentre) and 3 μl of 1x T4 DNA Ligase buffer; the reaction was incubated for 30 min at 37°C and then 20 min at 75°C.
Adapters were annealed by combining 25 picomoles of the forward and reverse adapters (Adapter F: P –B’AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG and Adapter R: ACACTCTTTCCCTACACGACGCTCTTCCGATCTBT [Illumina; Oligonucleotide sequences © 2007–2009 Illumina, Inc. All rights reserved], where B indicates the barcode position (Additional file 1: Table S 9) and B’ the reverse complement, P designates a phosphate) in 2.5X T4 DNA ligase buffer in a final volume of 20 μl. The solution was heated for 2 min at 94°C and slowly cooled down (0.1°C/s) to room temperature. The adapter mix was added to the previously prepared genomic DNA (30 μl) in a 50 μl reaction volume and 400 U of T4 DNA ligase (NEB) were added and incubated for 1 hour at 16°C followed by 10 min at 65°C. QiAquick PCR purification kit (Qiagen) was used to clean ligated DNA, which was eluted in 30 μl of Elution Buffer (EB).
PCR amplification of insertion sequence flanking sites
For the amplification of regions flanking the IS6110 element, sets of 24 samples that were prepared and ligated with different barcoded adapters were pooled. Twenty five different pools were amplified, each using a IS 6110-specific primer tagged with a 4–5 bp barcode (Additional file 1: Table S 9) and the 3’ end of the Pair End (PE) adapter (Illumina®) (IS-right: 5’ CTGAACCG
BACTCACCGGGGCGGTT 3’; IS-left: 5’ CTGAACCGCTCTTCCGATCT
BACATGCCGGGGCGGTT 3’, where B indicates the barcode and adapter sequences are underlined), and Nested_Primer1 that anneals on the adapter (underlined) and contains the reverse Illumina PE PCR primer 1.0 (italics) (5’ AATGATACGGCGACCACCGAGATCTACATCTTTCCCTACACGAC 3’). IS-specific primers were designed to specifically anneal the terminal direct repeats of the IS 6110 leaving a CA dinucleotide at the 3’ end for quality control. The PCR was performed using 5 U of Platinum Taq (Invitrogen), 1x PCR Buffer, 0.5 mM dNTPs, 0.5 μM of Adapter Primer1and 0.5 μM of an equimolar mix of the IS specific primers, 2 mM MgCl2, 0.5 M Betaine and 120 ng (equivalent to 5 ng per sample) of the purified DNA in a final volume of 25 μl. The thermal cycling program consisted of 33 min at 93°C for initial denaturation and 35 cycles of 2-step PCR amplification using 93°C for 45 sec and 6 min annealing/amplification. The first 5 cycles were done at 55°C, the remaining were done at 68°C.
Nested_Primer1 was then used in a nested PCR reaction with Nested_Primer2 (5’ CAAGCAGAAGACGGCATACGAGAT CGGTCTCGGCATTCCTGCTGAACCGCT3’) that anneals on the adapter (underlined) and incorporates the Illumina PE PCR Primer 2.0 (italics). This was done using 2 μl of the initial product; all the reagents had the same concentration as the initial PCR. The amplification was done as before for 20 cycles with an annealing temperature of 62°C. From this PCR, 2.4 μg of DNA from each reaction (100 ng/sample) were pooled together in a single tube. Purification of the fragments between 200 – 500 bp was done by gel extraction (QIAquick, Qiagen) and the purified DNA was sequenced in an Illumina® Genome Analyzer (GA) II sequencer. Two lanes in independent flow cells were performed. The first one generating 2 x 35 bp PE sequences and the second one generating 2 x 60 bp PE sequences. Generated sequences were deposited in NCBI Short Read Archive accession number SRA030755.
Method validation using reference strains
Seven strains (three from Colombia, two from Argentina, one from Spain and the reference M. tuberculosis H37Rv (Colombian Isolate) with known insertion sites were used to validate the method. Primers were designed to PCR-amplify genes and regions flanking the insertion sequence (Additional file 1: Table S 10) such that a difference in size of the amplified product reflects the presence of the insertion element (1362 bp). Twenty nanograms of genomic DNA were used for the PCR in a final volume of 25 μl with, 1X Red Mix (Corpogen, Colombia), 400 nM of each primer and purified water. Three temperature profiles were established according to the expected size of the amplified product. For products of more than 2500 bp, 95°C for 5 minutes followed by 40 cycles of 95°C for 45 sec, annealing temperature depending on primer sequence for 45 sec and 72°C for 2.5 min, and final extension at 72°C for 10 min. For products between 1500 and 2400 bp, initial denaturation at 95°C for 5 min, 35 cycles of 95°C for 40 sec, annealing temperature for 40 sec and 72°C for 2 min with final extension for 10 min at 72°C. For small products, the number of cycles was reduced to 30, denaturation and annealing time reduced to 35 sec and extension to 1 min, initial and final extension remained the same in all three profiles. Amplified products were visualized in 1% agarose gel stained with ethidium bromide.
Sequence data processing
Sequences were processed using a combination of stand-alone software linked through perl scripts for the parallel analysis of all the sequences. The standard sequencer output consists of two files, one for each pair-end, with the corresponding sequences in the same position on each file. In our particular case, one file had sequences composed by the adapter barcode followed by the target sequence (referred to as “adapter read” hereafter), the other file has sequences containing the primer barcode, the IS-Primer, a CA dinucleotide characteristic of the insertion site and the target sequence (called a “primer read” hereafter).
The reads were first classified according to the barcodes, filtered for low quality data and the IS primer was masked. Trimming of possible sequencing primers was done using agrep (http://laurikari.net/tre/) for fuzzy string matching with up to 2 errors in the IS primer and posterior masking. A Smith-Waterman fast alignment was then used to search for and trim parts of adapters or primers. Any sequence with terminal consecutive N’s or more than 1 consecutive internal N was removed. After trimming, only sequences with exact matches to the barcodes and with a length, after barcode trimming, longer than 12 nucleotides were kept and stored in an independent file per sample.
High-throughput Illumina sequencing can be limited by the capacity to accurately map short reads to reference sequenced genomes. In order to increase the precision of mapping reads, the reads were initially classified by sample and assembled to generate contigs that could be more easily mapped to the reference M. tuberculosis genomes. This strategy allowed us to identify genomic locations using contigs of up to 300 nucleotides rather than 35-nucleotide reads. Sequences from each sample were assembled using Velvet , parameters were adjusted for a seed of 17 bases and short pair ends. During the assembly the minimum coverage cutoff was 4, the expected insert length was 400 and the minimum contig length was 45. After the assembly, all adapter reads were mapped back to the contigs, and the corresponding primer read was retrieved. Since the primer read was trimmed for the IS specific primer sequence, all the primer reads should begin with the target sequence at the exact position of the insertion site. For each contig, all the primer reads were aligned using clustalW to generate a consensus sequence. If the consensus had less than 85% conservation and coverage less than 10, the contig was dissolved and the reads sent to a leftover file.
For the remaining contigs, the consensus of the alignment and the contig sequence were used as queries against the 6 complete and annotated M. tuberculosis genomes (Genbank accession numbers: NC_000962.2, NC_002755.2, NC_002945.3, NC_008769.1, NC_009525.1, NC_009565.1). All hits were retrieved and organized by genome according to the detail level of annotation. If the contig and the consensus hit the same genome in opposite strands and with a maximum distance between the hits of 600 bp (since the fragments sequenced ranged between 200–500 bp), the hit was reported in a table as a “specific” hit where the coverage, genome, position and orientation of the hit were reported. Otherwise, the contig was dissolved and the long reads sent to the leftover file. Leftover reads were mapped individually to all the sequenced reference genomes. All hits within 10% of the score of the best hit were kept for each read. Each region in the reference genomes longer than 25 bp with a minimum coverage of 9 reads was reported. In this case, since there is no evidence for the exact insertion site location, the fragments are labeled as “approximate”. Both reports were combined and a table for each strain was generated with the coverage, position, strand, genome and the annotation for the position.
Since both ends of each insertion element were ideally amplified and sequenced, we further parsed the data in order to search for segments located in close proximity but in opposite directions. This parsing generated 5 categories of results based on the confidence of the mapping. C0 corresponded to IS located in the CRISPR element of the genome known as the direct repeat (DR) region, which due to tandem 36 bp repeats needed special parsing for accurate mapping. C1, which has the highest confidence, consisting of a pair of “specific” fragments each corresponding to an end of the IS element. C2 consisted of a pair of fragments, one” specific” and one “approximate” in close proximity and opposite directions. C3 was formed by two” approximate” fragments in close proximity and opposite directions. C4 indicated that only one “specific” fragment could be mapped. C5 include those hits where only an “approximate” fragment was mapped. These last 2 categories had the highest probability of being false positives but they were kept since they could also represent regions where one of the flanking sides of the IS element is absent from all 5 reference genomes due to strains specific insertions or deletions.
Coverage cutoff determination
To see how well our sequence reads were able to detect IS insertion sites we looked at the correlation between expected (based on RFLP data) and observed sites (our IS-seq protocol). The relationship between the coverage obtained using IS-seq (reads per RFLP band) and the difference between observed and expected sites showed that our method was prone to identify fewer insertion sites than number of bands identified by RFLP when the coverage was below 100 reads per RFLP band, hence decreasing the sensitivity (Additional file 3: Figure S 2). For this reason, 60 strains with low coverage were discarded from further analysis. At high coverage the curve leveled off with an average of 4 observed sites above the number of RFLP-detected bands, reflecting again that RFLP analysis underestimates the real copy number of the IS6110 element. This also implies that there will be no increase of false positives with higher depth of sequencing.
Definition of unique insertion sites
In order to create matrices and compare insertion sites among samples it was necessary to determine which insertion sites were conserved among strains. For this, the different observed sites were evaluated based on the estimated precision. Initially a seed matrix was generated with all C1 sites that constitute the most accurate sites identified. Sites classified as C2, C3 or C4 were then compared to the seed, if they fell within 30 bp of a C1 site, both were collapsed; otherwise a new unique site was created. C0, which represent insertions on the DR region, were checked manually for determination of independent insertion sites. Within each locus and for insertion sites identified in homologous genes of different reference genomes, a manual curation was done to collapse a site when each terminus of the insertion mapped with high scores to different genomes (i.e. H37Rv and CDC1551). Unique sites with more than 20 sequences were checked manually for consistency and to verify that it was not composed of more than one unique site within the parameter of 30 bp used above.
Data analysis and statistics
Since complete genome sequences of all the strains under study are unavailable, and thus estimation of true negatives is unfeasible, we calculated the sensitivity and positive predictive values (PPV) for validation analysis using the following formulas:
We assumed that each insertion observed appeared either independently or that all the insertions in a given position of the genome appeared only once in a common ancestor. Although both assumptions are unlikely, they nevertheless give the lowest and highest maximum number of insertion events that can be measured. To rarefy this number of insertion events respective to the number of genes affected, random sub-sampling of the distribution of insertions was performed continually from a total of 10 insertions to the total number of insertions available. A step of 20 insertions was chosen and 10 replicates per step were performed. The data was plotted in MATLAB and a non-linear fit to an exponential decay function was performed using the formula:
Distribution of IS sequences
To examine the bias in genomic distribution of IS sequences in the M. tuberculosis
genome, the number of independent insertions observed in non-overlapping windows of 2 and 50 Kb were determined. Given the total number of independent insertions and the length of the genome, an average number of insertions in a 2 or 50 Kb window can be determined by,
We tested the probability that the observed number of insertions in a given window followed an expected Poisson distribution. Bonferroni correction of the threshold was performed to assess the significance of the hit.
Functional category representation
COG categories and gene lengths were retrieved from the NCBI entry for each reference genome (accessed on August 1, 2011). Genes with independent insertions were assigned to the different categories, filtering out those belonging to the PE/PPE family or those corresponding to IS6110 elements. The Fisher exact test was performed on each category to estimate the probability of obtaining by chance that number or more insertions in a particular category, given the length of the genes. Bonferroni correction for the threshold was performed. The genes involved in virulence, detoxification and adaptation were retrieved from http://tuberculist.epfl.ch/index.html (Accessed on July 26, 2011).
Variation of information and conditional information
To determine the relatedness among the different isolates given by IS6110
insertion patterns, we applied mutual information measurements. In order to generate a metric that will allow us to build hierarchical clustering trees, we used the derived metric known as Variation of Information where the distance between any two samples d(X,Y) is given by: Equation 1:
where I(X,Y), or the mutual information between X and Y, is defined as Equation 2:
and H(X) is defined as Equation 3:
Once we determined that the IS6110
insertion sites were highly associated with strain classifications we used Mutual Information (Equation 2) to determine the association between a given insertion in a loci and a specific family. Mutual Information values can be used to determine approximate p-values for the association as follows:
where N(X) is the number of elements of X.
We used Conditional Mutual Information, which allows us to compensate for the correlation between insertion sites and phylogeny, to test for correlation between biological traits (X) and the presence or absence of a particular insertion site (Y) given that the strain belongs to a phylogenetic family (Z), using the following formula:
Family prediction of sequenced available genomes
Fasta files were downloaded for all available M. tuberculosis genomes present in the PATRIC database  on September 25 2010: M. tuberculosis '98-R604 INH-RIF-EM', M. tuberculosis 02_1987, M. tuberculosis 210, M. tuberculosis 94_M4241A, M. tuberculosis C, M. tuberculosis CDC1551, M. tuberculosis CPHL_A, M. tuberculosis EAS054, M. tuberculosis F11, M. tuberculosis GM 1503, M. tuberculosis H37Ra, M. tuberculosis H37Rv, M. tuberculosis K85, M. tuberculosis KZN 1435, M. tuberculosis KZN 4207, M. tuberculosis KZN 605, M. tuberculosis KZN R506, M. tuberculosis KZN V2475, M. tuberculosis T17, M. tuberculosis T46, M. tuberculosis T85, M. tuberculosis T92, M. tuberculosis str. Haarlem. IS 6110 elements were identified in each genome using blast with an e-value threshold < 1e-30; 200 bp of the up-stream and down-stream flanking sequences were extracted for each insertion identified. The positions of these insertions were identified running blast of the flanking sequences against the H37Rv and CDC1551 genomes (NCBI). The list of insertions and positions was compared against the family-specific markers identified and were classified accordingly. A subset of these strains has been used recently and classified using SNPs (see Additional file 1: Table S 7).