Skip to main content

ISMapper: identifying transposase insertion sites in bacterial genomes from short read sequence data

Abstract

Background

Insertion sequences (IS) are small transposable elements, commonly found in bacterial genomes. Identifying the location of IS in bacterial genomes can be useful for a variety of purposes including epidemiological tracking and predicting antibiotic resistance. However IS are commonly present in multiple copies in a single genome, which complicates genome assembly and the identification of IS insertion sites. Here we present ISMapper, a mapping-based tool for identification of the site and orientation of IS insertions in bacterial genomes, directly from paired-end short read data.

Results

ISMapper was validated using three types of short read data: (i) simulated reads from a variety of species, (ii) Illumina reads from 5 isolates for which finished genome sequences were available for comparison, and (iii) Illumina reads from 7 Acinetobacter baumannii isolates for which predicted IS locations were tested using PCR. A total of 20 genomes, including 13 species and 32 distinct IS, were used for validation. ISMapper correctly identified 97 % of known IS insertions in the analysis of simulated reads, and 98 % in real Illumina reads. Subsampling of real Illumina reads to lower depths indicated ISMapper was able to correctly detect insertions for average genome-wide read depths >20x, although read depths >50x were required to obtain confident calls that were highly-supported by evidence from reads. All ISAba1 insertions identified by ISMapper in the A. baumannii genomes were confirmed by PCR. In each A. baumannii genome, ISMapper successfully identified an IS insertion upstream of the ampC beta-lactamase that could explain phenotypic resistance to third-generation cephalosporins. The utility of ISMapper was further demonstrated by profiling genome-wide IS6110 insertions in 138 publicly available Mycobacterium tuberculosis genomes, revealing lineage-specific insertions and multiple insertion hotspots.

Conclusions

ISMapper provides a rapid and robust method for identifying IS insertion sites directly from short read data, with a high degree of accuracy demonstrated across a wide range of bacteria.

Background

An insertion sequence (IS) is a small transposable element that encodes the proteins required for its own transposition. The ISfinder database [1] currently contains over 500 distinct IS. During transposition some ISs create direct repeats, or target site duplications, in the sequences into which they are integrating. The presence and length of these duplications vary widely between ISs and are characteristic of individual IS [2]. Rates of transposition vary between ISs and host species, but are frequently in the order of the rate of nucleotide substitutions, making IS activity one of the more dynamic evolutionary forces at play in many bacterial genomes. The movement of ISs can also have functional consequences for bacterial genomes. ISs have been implicated in large changes to genome structure, by expanding in copy number in microbial genomes, with subsequent loss of ISs resulting in inactivation of genes, pseudogene formation, mediating deletion of intervening sequences between two copies of the IS, or rearrangements of the genome [3].

In addition, IS insertions upstream of protein coding sequences can result in their enhanced expression, leading to different phenotypes depending on the function of the over-expressed gene. There are several known examples of IS-mediated gene expression leading to clinically important increases in antimicrobial resistance. For example, increased resistance to fluoroquinolones such as ciprofloxacin can result from the insertion of IS1 or IS10 upstream of the acrEF efflux pump in Salmonella Typhimurium [4], or the insertion of IS186 upstream of the acrAB efflux pump in Escherichia coli [5]. In Acinetobacter baumannii, insertion of ISAba1 or ISAba125 upstream of the intrinsic beta-lactamase ampC can cause resistance to third generation cephalosporins including ceftazidime and cefotaxime [6, 7]. Insertions of the same IS in nearby locations can generate a composite transposon, capable of mobilizing the intervening sequence and transferring it to new genomic locations. For example, the composite transposon Tn6168 was generated spontaneously via insertions of ISAba1 on either side of ampC, including one copy of ISAba1 that upregulates ampC expression [8]. Tn6168 has then transferred into different A. baumannii backgrounds, conferring horizontally-acquired resistance to third generation cephalosporins [8].

IS insertions also result in the upregulation of virulence genes in clinically important human pathogens. For example, an outbreak of tuberculosis in Spain in the 1990s was associated with the B strain of Mycobacterium bovis carrying an insertion of IS6110 in the promoter region of the virulence gene phoP, resulting in its upregulation [9]. In Neisseria meningitidis, insertion of IS1301 in the middle of the capsule locus has been shown to cause increased expression of operons on either side of the IS, contributing to protection from the human immune system and enhanced pathogenicity [10]. ISs have also been shown to enhance niche adaptation in bacteria, for example IS1247 insertion upstream of dhlB in Xanthobacter autotrophicus results in increased resistance to bromoacetate [11]. This region has also been mobilised by the IS and transferred to a plasmid [11]. In E. coli, IS3 has been shown to up-regulate threonine expression, allowing the bacteria to adapt to a low-carbon environment and utilise threonine as its sole carbon source [12].

The profiling of IS insertion patterns has been used for typing purposes in numerous bacterial species of importance to human health. For example, copy number and position of IS200 in Salmonella enterica [13], IS6110 in Mycobacterium tuberculosis [14], IS1004 in Vibrio cholerae [15] and ISAba1 in A. baumannii [16] has been used to profile these bacterial pathogens, allowing the identification and tracking of distinct subtypes. To date, IS-based typing schemes for various bacteria have relied on digesting the genome followed by either hybridizing IS probes to fragments in a gel or PCR probing [1315]. The detection of precise insertion sites can be achieved using PCR, and may be done for typing purposes [17] or for the detection of functionally important insertions [7, 9].

With the advent of cheap high-throughput short-read sequencing, whole genome sequencing (WGS) of bacteria is increasingly common and is replacing traditional methods for characterizing and typing bacterial genomes. Unfortunately the detection of ISs is complicated wherever read lengths are shorter than the length of the IS, as is the case for platforms that are currently most widely used – Illumina and Ion Torrent. IS insertion sites can readily be identified in finished bacterial genomes or in draft assemblies of genomes with single-copy ISs, using tools such as nucleotide BLAST or ISfinder [1]. However where multiple copies of the same IS are present within a single genome (including on the chromosome and/or plasmids), this complicates assembly of short-read data and makes IS insertion sites difficult to identify reliably. The IS detection problem can be resolved using long-read sequencing technologies such as the SMRT Cell (Pacific Biosciences) or MinION (Oxford Nanopore) platforms; however given the relative cost efficiency and reliability of short-read sequencing, together with the current widespread use of Illumina for bacterial WGS and wealth of available short-read data for clinically important bacteria, there remains a need for a simple tool to identify IS insertion sites from short-read data.

Several studies report the use of mapping-based approaches to identify IS insertion sites from bacterial short-read data [18, 19], however none provide software code or validation of the approach used. There are tools available for detecting transposons or structural variation in genomes, for example MindTheGap [20], BreakDancer [21], and Mobster [22], however these do not perform well in the identification of IS in bacterial genomes nor were they designed to do so. Some programs could potentially be used for this purpose, such as RelocaTE [23] and RetroSeq [24], however these require additional input or prior knowledge about the IS which may not always be available. TIF (Transposon Insertion Finder) [25] and breseq [26] could potentially be used for the detection of IS insertion sites in bacterial genomes, however they were not designed specifically for this purpose and did not perform well on our data sets (see Results).

Here we present a rapid and robust tool for accurate detection of ISs, including insertion site and orientation, direct from short-read data. The method is freely available in the form of open-source code called ISMapper, and here we validate its use via analysis of simulated and real short-read data from a range of ISs and bacterial species. ISMapper requires short reads and query IS sequences as input, and can be used either for typing against a reference genome or to assist with manual resolution of complex short-read assemblies.

Implementation

An overview of the ISMapper workflow is shown in Fig. 1. ISMapper takes as input: (i) a set of paired end Illumina reads for an isolate of interest, (ii) an IS query sequence in fasta format, and (iii) either a reference genome (for typing) or an assembly of the read set (for assembly improvement), in GenBank or FASTA format (Fig. 1a). Paired end Illumina reads are mapped to the IS query sequence using BWA-MEM (v0.7.5a or later) [27]. From the resulting alignment file (SAM format), unmapped reads whose pairs map to the end of the IS query sequence (that is, reads representing the sequences directly flanking the IS) are extracted using SAMtools view (v0.1.19 or later) [28] to retrieve reads based on SAM flags (Fig. 1b). Specifically, left flanking reads (taking input sequence as left to right) are extracted using flag ‘–f 36’ and right flanking reads are extracted using flag ‘–F 40 –f 4’ and stored in separate BAM files, which are then converted to FASTQ format using BedTools (v2.20.1) [29]. In addition, Samblaster (v0.1.21) [30] is used to extract from the SAM file any reads that map to the end of the IS and extend into the neighbouring sequence (i.e., “soft clipped” reads, Fig. 1b). The resulting FASTQ file is filtered using BioPython to extract the soft clipped portion of reads, where those sequences fit a specified size range (default 5–30 bp). The resulting sequences are sorted into left and right flanking sequences; these are each mapped separately to the reference genome or assembly using BWA-MEM, to identify the location(s) of the query IS in the genome under analysis (Fig. 1c). Insertion site information is extracted from the resulting alignments using BedTools (coverage command) to summarise coverage of the reference by left and right flanking reads; these are filtered by read depth (default, minimum read depth ≥6x) to minimize false positive hits, and regions that overlap or are separated by a short distance (default, ≤100 bp) are merged using BedTools (merge command). Pairs of left and right flanking regions that likely represent either side of the same IS insertion are identified on the basis of positional information, using BedTools (intersect and closest commands). Left and right regions that overlap are considered to indicate a novel IS insertion not present in the reference, with the overlap resulting from target site duplication arising during IS transposition (Fig. 1c, novel site). Coordinate x refers to the left side of the target site duplication, and y refers to the coordinate of the right side of the target site duplication. In cases where the left and right flanking regions are extremely close but do not overlap, x refers to the inner end of the left-most region, and y refers to the inner end of the right-most region (as per a known site, see below). Where left and right regions are separated by a sequence that is approximately the length of the IS query, the intervening sequence is extracted and compared to the IS query using nucleotide BLAST+ (v2.2.25 or later) [31] to confirm whether this is a known insertion site that is present in the reference (Fig. 1c, known site). In this case, coordinate x refers to the inner end of the left-most block, and y refers to the inner end of the right-most block. Coordinate x is always the smallest number and y always the largest, regardless of IS orientation.

Fig. 1
figure 1

Workflow for ISMapper. a Inputs are reads (fastq format) and an IS query (fasta), as well as either a reference genome to compare to or an assembly of the reads (fasta or genbank). b Reads are mapped to the IS query using BWA (dashed lines) and their pairs (i.e., flanking reads, solid lines) are retreived from the resulting SAM file using mapping flags (SAMTools). The unmapped component of soft clipped reads (solid + dashed lines), identified from the SAM file using Samblaster) are also retrieved using BioPython. c Flanking and softclipped read sequences are then mapped to either the reference genome or the read assembly (BWA), and the final mapping output is then filtered for depth to remove low coverage regions. Left and right end blocks are extracted from the resulting BAM file (using BedTools) and compared to one another to find either intersecting regions, indicating a novel IS position, or close regions, indicating a known IS position in the reference. In a novel position, overlapping sequences from the left and right ends most likely indicate the target site duplication generated during IS transposition (zoomed in section). d Results are tabulated, indicating the position and orientation of each site, whether it is novel or known, and information about the genes flanking the insertion site

Extensive testing of ISMapper revealed that it was sometimes unable to resolve IS positions that were adjacent to a repeat region (segments of DNA that were repeated multiple times around the genome; see Results). This is because when the IS-flanking reads were mapped back to the reference genome, those that belonged to the neighbouring multi-copy sequence were randomly assigned by BWA-MEM to the various locations of the repeat sequence, resulting in low read depth at the ‘true’ IS-adjacent copy of the multi-copy sequence, which can fall below the minimum depth filter (Fig. 2). In such cases, the sequence on the other side of the IS is usually not a multi-copy sequence and thus does not suffer the same problem, and so is usually identified as a confident IS-flanking region without a corresponding partner region (Fig. 2, purple block). Therefore, when ISMapper identifies an un-partnered IS-flanking region, it checks the original alignments for evidence of a nearby low-coverage partner region that failed to pass the depth filter and returns this as a potential but uncertain IS location, indicated by a ‘?’ character in the results table (see example of low-coverage partner region in Fig. 2, green block).

Fig. 2
figure 2

Issues can arise in the second mapping step of ISMapper if the IS insertion is next to a region that occurs more than once in the reference genome. In this example, the left flanking reads (green arrows) have mapped to all possible copies of the repeat sequence (dark red boxes) and are each randomly assigned to a repeat copy by BWA, resulting in each copy falling below the read depth cut-off (default 6x). To overcome this, where an unpaired flanking region is identified (purple box), ISMapper searches for a potential low-depth partner flanking region that is close to or intersecting the confident region (green box). This is recorded as an uncertain call, labelled in the output with the ‘?’ character

ISMapper generates two main output files summarizing the results: (i) a GenBank file of the reference sequence, annotated with the IS-flanking regions and (ii) a table indicating the locations and characteristics of each IS-flanking region identified (Fig. 1d). The table includes details of the location of the IS insertions (indicated by coordinates x and y); the distance between the left and right flanking regions (where a negative number indicates an overlap of left and right regions, indicating the size and sequence of the target site duplication); a call as to whether the insertion is present in the reference or is a novel insertion site (and, where the insertion site is present in the reference, the percent coverage and sequence homology with the IS query); and details of the gene(s) closest to the IS insertion site (including locus tag, product, gene name and distance from the IS to the start codon). Insertions are also marked to indicate less confident calls. A ‘*’ indicates an imprecise hit; i.e., where the gap between left and right regions is larger than expected for a novel insertion, but is not consistent with an IS insertion at that location in the reference. A ‘?’ indicates an uncertain hit, where only one end (left or right of the predicted insertion) passes the minimum read depth threshold; this often occurs when the IS is inserted within or adjacent to a multi-copy sequence, as described above (Fig. 2). When run in assembly improvement mode, the table produced is simpler and indicates which contigs are predicted to end adjacent to the IS (indicating left or right orientation), assisting the user to decide whether some contigs could be joined together based on the available IS evidence.

ISMapper is lightweight code – a test run on a laptop computer (MacBook Air) with 8GB of RAM and a 1.3GHz i5 processor was able to analyse a read set comprising 2.5 million 100 bp paired-end reads in approximately ten minutes for a single IS query. Because ISMapper analyses each read set and query IS independently, screening of multiple read sets and query IS can be easily performed in parallel across multiple cores. To facilitate easy compilation of results from multiple jobs, ISMapper includes a Python script to cross-tabulate results from multiple read sets, generating a single summary table per query IS (script ‘compiled_table.py’).

Results and discussion

Validation of IS detection using simulated reads

Nine publicly available finished genomes from a variety of bacterial genera, and including both chromosomes and plasmids, were downloaded from NCBI (Table 1). ISfinder [1] was used to identify the ISs present in each finished genome sequence. All sequences that had >50 % identity to a sequence in ISfinder and were present in at least two copies were tested using ISMapper (with the query IS being sourced from curated references in ISfinder). Nucleotide BLAST+ was used to confirm the precise locations and orientations for each query IS in all genomes (total 251 insertions of 17 distinct IS, see Table 1). Short reads (100 bp) were simulated from each genome sequence using the wgsim command in SAMtools (v0.1.19), with default parameter settings.

Table 1 Validation of ISMapper using simulated reads

ISMapper was run with default parameter settings on each combination of genome, query IS and simulated reads. ISMapper was able to accurately locate each IS position and its orientation (ranging between 2 and 61 positions per genome) for the majority of genomes (Table 1). In total, 97 % of IS insertions were correctly detected, with a false positive rate of 2.1 % (n = 6). The exceptions occurred in three genomes (K. pneumoniae plasmid pNDMAR, Y. pestis CO92 and E. coli O157:H7), in which ISMapper correctly identified 151 IS insertion sites and failed to identify nine (94 % detection). Closer inspection revealed that the missed IS were each located next to multi-copy repeat sequences, complicating the second mapping step as discussed above and outlined in Fig. 2. Switching on reporting of all alignments above a mapping score threshold of 30 (−a and -T 30 in BWA-MEM) enabled the detection of a further IS100 site in Y. pestis. By default this option is turned off in ISMapper as it tends to create noise in the mapping, making it more difficult to distinguish true and false positives; however this can be useful if an IS site of interest is known or suspected to be flanked by further repeats.

Validation of IS detection using real Illumina read sets derived from isolates with finished genomes

Next we validated ISMapper using six finished genomes for which both Illumina read data and finished genomes were publicly available (Table 2). Each finished genome sequence was analysed with ISfinder [1] to identify query ISs for testing as described above, and nucleotide BLAST was used to confirm the precise locations and orientations of each IS in each genome. The resulting test set comprised 106 insertions of 14 query ISs. Using default settings, ISMapper was able to accurately identify each IS insertion site and its orientation, between 2 and 26 per genome, for the majority of genomes (Table 2). In total, 104 (98 %) IS insertions were correctly detected by ISMapper, with 3 IS insertions falsely detected (false positive rate of 2.5 %, n = 3). Three of four IS431mec insertions in Staphylococcus aureus TW20 were correctly detected, however the fourth was missed by ISMapper as it was flanked by another IS431mec and further repeat sequences. Two of three IS1 insertions in Salmonella Typhi CT18 were correctly detected however a third, located between tviE and tviD, was problematic. ISMapper identified the region flanking the IS at tviE, but did not detect any corresponding region in tviD. To investigate why, we mapped the entire Illumina read set to the CT18 chromosome reference sequence, which showed that there were no reads derived from the region between tviD to tviA. Therefore this region appears to have been deleted during culture in the laboratory prior to the extraction of DNA for Illumina sequencing. This region encodes the biosynthesis of the Vi capsule of S. Typhi, and is known to be lost sporadically during culture [32]. This illustrates that situations where one end of the IS is detected but the other is not can often be ‘accurate’ in the sense that the result reflects underlying structural variation in the genome, including potentially IS-mediated deletions.

Table 2 Validation of ISMapper using real Illumina reads for which finished genomes were also available

Detection of antibiotic resistance-mediating IS insertions in Acinetobacter baumannii, confirmed by PCR

The genomes of seven ceftazidime resistant A. baumannii isolates, belonging to global clone (GC) 1, were sequenced via Illumina HiSeq to generate 100 bp paired end reads. Resistance gene screening of the Illumina data using SRST2 [33] and the ARG-Annot database [34] confirmed earlier PCR data indicating that none of these isolates carried acquired extended spectrum beta-lactamase (ESBL) genes that can confer resistance to third-generation cephalosporins. However, it is known that the insertion of ISAba1 upstream of the intrinsic chromosomally encoded ampC beta-lactamase gene can cause increased resistance to third-generation cephalosporins in A. baumannii [6].

We used ISMapper to screen for the ISAba1 query sequence (accession AY758396), sourced from ISfinder [1]. Using default parameters, ISMapper identified ISAba1 insertions in all seven GC1 genomes. IS positions were assessed relative to the finished genome sequence of A. baumannii GC1 reference A1 (accession CP010781). ISMapper found between 3 and 5 ISAba1 insertions in each GC1 isolate, including an insertion upstream of ampC in all 7 genomes that was in the orientation required to induce upregulation and explain the observed cephalosporin resistance phenotype (Fig. 3). In addition, out of 29 total ISAba1 insertions, ISMapper was able to correctly identify 26 target site duplications (9 bp in the case of ISAba1). All ISAba1 insertions were novel compared to the reference genome A1 (Fig. 3) and were confirmed using PCR, as described in [6].

Fig. 3
figure 3

ISAba1 positions detected in seven Acinetobacter baumannii genomes. The A1 reference genome is indicated by the outer black circle, tick marks indicate genome coordinates (in Mbp), genes targeted in the two available MLST schemes are marked in blue for orientation purposes. ISAba1 detected within the individual test genomes are shown on inner rings (purple), orientation of the IS is indicated by shading (dark, left end; light, right end). Novel IS insertions were identified upstream of the beta-lactamase ampC (green) in all isolates, explaining the observed phenotypic resistance to third generation cephalosporins

Impact of read depth on ISMapper performance

To test the effect of read depth on the performance of ISMapper, each of the seven GC1 A. baumannii read sets were randomly subsampled to depths of approximately 10x, 15x, 20x, 25x, 50x, 75x and 100x, with ten replicates per depth level per read set. ISMapper was then run using default settings to screen for ISAba1 insertions. The results indicated that at mean genome-wide read depths of approximately 20x, ISMapper was able to identify 95 % of insertions correctly (Fig. 4). However, all of these calls were either imprecise (gap size larger than expected) or uncertain (high coverage end paired with a low coverage end). An average genome-wide read depth of ~50x was required to find all insertions, with confident calls for >60 %, however there was clearly some variation depending on read quality (Fig. 4). To achieve 100 % detection with high confidence, average genome-wide read depths of >75x were required (Fig. 4).

Fig. 4
figure 4

ISAba1 detection rate as a function of read depth, for seven A. baumannii GC1 genomes sequenced with Illumina. Mean (lines) and standard deviation (shaded areas) proportions of IS insertions correctly detected per genome, amongst 70 replicate read sets sampled at each depth level (7 read sets x 10 replicates each at read depths 10x, 15x, 20x, 25x, 50x, 75x, 100x). Blue, IS insertion site detected; red, detected with high confidence; purple, detected with low precision (larger than expected gap size between left and right flanking regions, indicated with ‘*’ in output); green, detected with low confidence (high-depth evidence for one side only, low-depth evidence for the other, indicated with ‘?’ in output)

Comparison of ISMapper with TIF and breseq

The seven A. baumannii GC1 genomes were used to test both breseq [26] and TIF (Transposon Insertion Finder) [25]. breseq uses split read mapping to a reference genome along with statistical models to determine new junctions and deletions in the isolates of interest. As input, breseq takes paired end reads in FASTQ format, and a reference genome in Genbank format. The breseq manual indicates that new insertions of mobile elements can determined by looking for ‘JC JC’ evidence types in the final html output. All seven A. baumannii isolates were screened using default parameters and the reference genome A1 (accession CP010781). In all cases, breseq was unable to identify any mobile element insertions, including no structural variation at the known ISAba1 insertion sites, although many other types of structural variation were detected.

TIF requires as input paired end reads (FASTQ format), the head and tail sequences (approximately 17 bp) of the IS of interest as well as the size of the target site duplication the IS makes during transposition. TIF uses regular expressions to search for the head and tail sequences in the reads, and these reads are then extracted and grouped by their target site duplications. Unfortunately, following communication with the authors, we were unable to get TIF to output any results using our data. Other disadvantages of TIF are the requirements to (i) specify the size of target site duplications (which not all IS make and is not always known), (ii) manually extract subsequences of the IS rather than inputting the complete sequence, and (iii) manually edit a Perl script in order to specify inputs to the program.

Example use case: exploration of IS6110 insertions in Mycobacterium tuberculosis

While IS insertions are thought to be important for shaping the evolution of bacteria in a variety of ways, high-resolution comparative genomic studies of bacterial pathogens have largely ignored ISs due to the difficulties associated with accurate detection of insertion sites from high-throughput short read data. An important example is IS6110 in M. tuberculosis [35]. Profiling of IS6110 insertions using PCR and restriction fragment based polymorphism (RFLP) based methods has been reported for typing purposes [36], and specific insertions have been linked to clinically relevant changes in function including in outbreak strains [9, 37, 38] However while numerous studies have reported the genomic analysis of hundreds of M. tuberculosis isolates sequenced on the Illumina platform, these have not included analysis of IS6110 insertions. Thus, to demonstrate the utility of ISMapper for comparative profiling of ISs in an important bacterial pathogen, we analysed the distribution of IS6110 within 138 publicly available genomes representing the major lineages of M. tuberculosis [39]. Paired-end Illumina reads were downloaded from NCBI (ERP001731). A core genome phylogeny was generated from these reads by SNP (single nucleotide polymorphism) calling against reference genome H37Rv (accession NC_000962) (methods as described in [40]), followed by maximum likelihood phylogenetic inference on the SNP alignment using RAxML (GTR + G substitution model, 1000 bootstraps) to build a genome-wide phylogenetic tree. ISMapper was run with default settings to screen for insertions of IS6110 (accession X17348) in each read set, relative to reference genome H37Rv.

A total of 392 unique IS6110 insertion sites were identified by ISMapper, approximately one per 10 kbp of the 4.4 Mbp reference genome. The frequency of each insertion within each of the six main global lineages is shown in Fig. 5b. The data indicate multiple lineage-specific IS6110 insertions in lineages 2–6, but none that were shared by multiple lineages, suggesting that IS6110 insertions began to accumulate only after M. tuberculosis diverged into these distinct lineages. Isolates in the “modern” lineages 2–4 and in the West African lineage 5 had more IS6110 insertions overall, with far fewer insertions observed in the “ancient” East and West African lineages 1 and 6 (Fig. 5c). Lineage 2, which includes the highly successful Beijing sublineage, had the highest number of IS6110 although it was not the most common lineage in the collection (n = 23); it could be that these insertions contribute to the adaptive fitness of the Beijing lineage.

Fig. 5
figure 5

Analysis of IS6110 insertions in a diverse set of M. tuberculosis genomes. Analyses are based on ISMapper analysis of publicly available Illumina paired end data from 138 genomes. a Phylogenetic tree for M. tuberculosis based on genome-wide SNP calls, with midpoint rooting and collapsed to lineage level. Number of isolates analysed in each lineage (L) is indicated in brackets. b Heat map indicating the frequency of each IS6110 insertion site (columns) detected in each lineage (rows), according to inset legend (i.e., a black cell indicates that the IS insertion site was detected in all isolates of the given lineage, white cell indicates that the insertion site was not detected at all in that lineage). c Boxplots show number of IS6110 insertions detected per genome, for each lineage. Black line, median; boxes, interquartile range; whiskers, minimum and maximum values. d Histogram of IS6110 insertions in 1000 bp windows along the M. tuberculosis chromosome. Dashed line, threshold for defining insertion hotspots

The spatial distribution of unique IS6110 insertions within the M. tuberculosis genome (Fig. 5d) revealed several clusters of insertions detected by ISMapper. Many of these clusters comprised multiple independent insertions into PE/PPE genes (which are surface-associated and interact with the host immune system), as well as the membrane associated proteins mmpS1 and mmpL12. There was substantial clustering of IS6110 insertions interrupting genes encoding the CRISPR machinery, which is involved in immunity to bacteriophage and other foreign DNA. Further, all three phospholipase genes, which are involved in virulence by inducing cell death in macrophages [41] and are encoded by the plcABC operon, contained multiple IS6110 insertions detected by ISMapper. This locus is a known hotspot for IS6110 insertions and has been shown to mediate deletions of segments of this region [42]. IS6110 insertions upstream of phoP, which have been associated with upregulation and enhanced virulence in M. tuberculosis [9], were identified in multiple lineages (1 insertion in 6 lineage 2 genomes; singular insertions in one genome each in lineage 3 and 5) and may be indicative of positive selection for enhanced phoP expression and virulence. These findings from ISMapper analysis are consistent with those reported from PCR-based screens of smaller sets of isolates, but provide a more comprehensive picture of IS dynamics in M. tuberculosis that could be extended to much larger genomic data sets and other important pathogens.

Conclusions

ISMapper is a lightweight and reliable tool for the detection of IS insertion sites in bacterial genomes using high-throughput short-read sequencing data, which is now ubiquitous in microbial research and clinical investigations. ISMapper performed well on real and simulated data from 32 different ISs and 13 bacterial species, detecting all but the most complex instances involving multiple neighbouring IS insertions or other repeated sequences. ISMapper was able to detect antimicrobial resistance-associated ISAba1 insertions in A. baumannii, with all sites detected by the program being subsequently confirmed by PCR. Compared to other tools such as breseq and TIF, ISMapper is ideal for detecting new positions for known ISs in bacterial genomes. In addition, ISMapper was able to rapidly produce a wealth of data on IS6110 insertions in M. tuberculosis, allowing quick identification of lineage-specific insertions and specific regions enriched for insertions that may be functionally significant.

Availability and requirements

  • Project name: ISMapper

  • Project home page: https://github.com/jhawkey/IS_mapper

  • Programming language: Python v2.7.5

  • Operating system(s): platform independent, requires Python 2.7 and dependencies

  • Other requirements: BioPython v1.63, BWA v0.7.12, SAMtools v1.1, Bedtools v2.20.1, BLAST+ v2.2.28, Samblaster v0.1.21

  • License: Modified BSD

References

  1. Siguier P, Perochon J, Lestrade L, Mahillon J, Chandler M. ISfinder: the reference centre for bacterial insertion sequences. Nucleic Acids Res. 2006;34(Database issue):D32–6.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  2. Mahillon J, Chandler M. Insertion sequences. Microbiol Mol Biol Rev. 1998;62:725–74.

    PubMed Central  CAS  PubMed  Google Scholar 

  3. Siguier P, Gourbeyre E, Chandler M. Bacterial insertion sequences: their genomic impact and diversity. FEMS Microbiol Rev. 2014;38:865–91.

    Article  CAS  PubMed  Google Scholar 

  4. Olliver A, Vallé M, Chaslus-Dancla E, Cloeckaert A. Overexpression of the multidrug efflux operon acrEF by insertional activation with IS1 or IS10 elements in Salmonella enterica serovar typhimurium DT204 acrB mutants selected with fluoroquinolones. Antimicrob Agents Chemother. 2005;49:289–301.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  5. Jellen-Ritter AS, Kern WV. Enhanced expression of the multidrug efflux pumps AcrAB and AcrEF associated with insertion element transposition in Escherichia coli mutants selected with a fluoroquinolone. Antimicrob Agents Chemother. 2001;45:1467–72.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  6. Hamidian M, Hall RM. ISAba1 targets a specific position upstream of the intrinsic ampC gene of Acinetobacter baumannii leading to cephalosporin resistance. J Antimicrob Chemother. 2013;68:2682–3.

    Article  CAS  PubMed  Google Scholar 

  7. Hamidian M, Hancock DP, Hall RM. Horizontal transfer of an ISAba125-activated ampC gene between Acinetobacter baumannii strains leading to cephalosporin resistance. J Antimicrob Chemother. 2013;68:244–5.

    Article  CAS  PubMed  Google Scholar 

  8. Hamidian M, Hall RM. Tn6168, a transposon carrying an ISAba1-activated ampC gene and conferring cephalosporin resistance in Acinetobacter baumannii. J Antimicrob Chemother. 2014;69:77–80.

    Article  CAS  PubMed  Google Scholar 

  9. Soto CY, Menéndez MC, Pérez E, Samper S, Gómez AB, García MJ, et al. IS6110 mediates increased transcription of the phoP virulence gene in a multidrug-resistant clinical isolate responsible for tuberculosis outbreaks. J Clin Microbiol. 2004;42:212–9.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  10. Uria MJ, Zhang Q, Li Y, Chan A, Exley RM, Gollan B, et al. A generic mechanism in Neisseria meningitidis for enhanced resistance against bactericidal antibodies. J Exp Med. 2008;205:1423–34.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  11. Van Der Ploeg J, Willemsen M, Van Hall G, Janssen DB. Adaptation of Xanthobacter autotrophicus GJ10 to bromoacetate due to activation and mobilization of the haloacetate dehalogenase gene by insertion element IS1247. J Bacteriol. 1995;177:1348–56.

    PubMed Central  PubMed  Google Scholar 

  12. Aronson BD, Levinthal M, Somerville RL. Activation of a cryptic pathway for threonine metabolism via specific IS3-mediated alteration of promoter structure in Escherichia coli. J Bacteriol. 1989;171:5503–11.

    PubMed Central  CAS  PubMed  Google Scholar 

  13. Soria G, Barbé J, Gibert I. Molecular fingerprinting of Salmonella typhimurium by IS200-typing as a tool for epidemiological and evolutionary studies. Microbiologia. 1994;10:57–68.

    CAS  PubMed  Google Scholar 

  14. Das S, Paramasivan CN, Lowrie DB, Prabhakar R, Narayanan PR. IS6110 restriction fragment length polymorphism typing of clinical isolates of Mycobacterium tuberculosis from patients with pulmonary tuberculosis in Madras, South India. Tuber Lung Dis. 1995;76:550–4.

    Article  CAS  PubMed  Google Scholar 

  15. Bik EM, Gouw RD, Mooi FR. DNA fingerprinting of Vibrio cholerae strains with a novel insertion sequence element: a tool to identify epidemic strains. J Clin Microbiol. 1996;34:1453–61.

    PubMed Central  CAS  PubMed  Google Scholar 

  16. Adams MD, Chan ER, Molyneaux ND, Bonomo RA. Genomewide analysis of divergence of antibiotic resistance determinants in closely related isolates of Acinetobacter baumannii. Antimicrob Agents Chemother. 2010;54:3569–77.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  17. Suzuki M, Matsumoto M, Hata M, Takahashi M, Sakae K. Development of a rapid PCR method using the insertion sequence IS1203 for genotyping Shiga toxin-producing Escherichia coli O157. J Clin Microbiol. 2004;42:5462–6.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  18. Doig KD, Holt KE, Fyfe JA, Lavender CJ, Eddyani M, Portaels F, et al. On the origin of Mycobacterium ulcerans, the causative agent of Buruli ulcer. BMC Genomics. 2012;13:258.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  19. Doughty EL, Sergeant MJ, Adetifa I, Antonio M, Pallen MJ. Culture-independent detection and characterisation of Mycobacterium tuberculosis and M. africanum in sputum samples using shotgun metagenomics on a benchtop sequencer. PeerJ. 2014;2:e585.

    Article  PubMed Central  PubMed  Google Scholar 

  20. Rizk G, Gouin A, Chikhi R, Lemaitre C. MindTheGap: integrated detection and assembly of short and long insertions. Bioinformatics. 2014;30:3451–7.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  21. Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods. 2009;6:677–81.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  22. Thung DT, de Ligt J, Vissers LE, Steehouwer M, Kroon M, de Vries P, et al. Mobster: accurate detection of mobile element insertions in next generation sequencing data. Genome Biol. 2014;15:488.

    Article  PubMed Central  PubMed  Google Scholar 

  23. Robb SMC, Lu L, Valencia E, Burnette JM, Okumoto Y, Wessler SR, et al. The use of RelocaTE and unassembled short reads to produce high-resolution snapshots of transposable element generated diversity in rice. G3. 2013;3:949–57.

    Article  PubMed Central  PubMed  Google Scholar 

  24. Keane TM, Wong K, Adams DJ. RetroSeq: transposable element discovery from next-generation sequencing data. Bioinformatics. 2013;29:389–90.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  25. Nakagome M, Solovieva E, Takahashi A, Yasue H, Hirochika H, Miyao A. Transposon Insertion Finder (TIF): a novel program for detection of de novo transpositions of transposable elements. BMC Bioinformatics. 2014;15:71.

    Article  PubMed Central  PubMed  Google Scholar 

  26. Barrick JE, Colburn G, Deatherage DE, Traverse CC, Strand MD, Borges JJ, et al. Identifying structural variation in haploid microbial genomes from short-read resequencing data using breseq. BMC Genomics. 2014;15:1039.

    Article  PubMed Central  PubMed  Google Scholar 

  27. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–60.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  28. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–9.

    Article  PubMed Central  PubMed  Google Scholar 

  29. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–2.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  30. Faust GG, Hall IM. SAMBLASTER: fast duplicate marking and structural variant read extraction. Bioinformatics. 2014;30:2503–5.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  31. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421.

    Article  PubMed Central  PubMed  Google Scholar 

  32. Holt KE, Parkhill J, Mazzoni CJ, Roumagnac P, Weill F-X, Goodhead I, et al. High-throughput sequencing provides insights into genome variation and evolution in Salmonella Typhi. Nat Genet. 2008;40:987–93.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  33. Inouye M, Dashnow H, Raven L-A, Schultz MB, Pope BJ, Tomita T, et al. SRST2: rapid genomic surveillance for public health and hospital microbiology labs. Genome Med. 2014;6:90.

    Article  PubMed Central  PubMed  Google Scholar 

  34. Gupta SK, Padmanabhan BR, Diene SM, Lopez-Rojas R, Kempf M, Landraud L, Rolain J-M: ARG-ANNOT, a new bioinformatic tool to discover antibiotic resistance genes in bacterial genomes. Antimicrob Agents Chemother 2014;58:212–20

  35. McEvoy CRE, Falmer AA, van Pittius NCG, Victor TC, van Helden PD, Warren RM. The role of IS6110 in the evolution of Mycobacterium tuberculosis. Tuberculosis. 2007;87:393–404.

    Article  CAS  PubMed  Google Scholar 

  36. van Embden JD, Cave MD, Crawford JT, Dale JW, Eisenach KD, Gicquel B, et al. Strain identification of Mycobacterium tuberculosis by DNA fingerprinting: recommendations for a standardized methodology. J Clin Microbiol. 1993;31:406–9.

    PubMed Central  PubMed  Google Scholar 

  37. Beggs ML, Eisenach KD, Cave MD. Mapping of IS6110 insertion sites in two epidemic strains of Mycobacterium tuberculosis. J Clin Microbiol. 2000;38:2923–8.

    PubMed Central  CAS  PubMed  Google Scholar 

  38. Alonso H, Aguilo JI, Samper S, Caminero JA, Campos-Herrero MI, Gicquel B, et al. Deciphering the role of IS6110 in a highly transmissible Mycobacterium tuberculosis Beijing strain, GC1237. Tuberculosis. 2011;91:117–26.

    Article  CAS  PubMed  Google Scholar 

  39. Comas I, Coscolla M, Luo T, Borrell S, Holt KE, Kato-Maeda M, et al. Out-of-Africa migration and Neolithic coexpansion of Mycobacterium tuberculosis with modern humans. Nat Genet. 2013;45:1176–82.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  40. Holt KE, Thieu Nga TV, Thanh DP, Vinh H, Kim DW, Vu Tra MP, et al. Tracking the establishment of local endemic populations of an emergent enteric pathogen. Proc Natl Acad Sci U S A. 2013;110:17522–7.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  41. Assis PA, Espíndola MS, Paula-Silva FW, Rios WM, Pereira PA, Leão SC, et al. Mycobacterium tuberculosis expressing phospholipase C subverts PGE2 synthesis and induces necrosis and alevolar macrophages. BMC Microbiol. 2014;14:128.

    Article  PubMed Central  PubMed  Google Scholar 

  42. Vera-Cabrera L, Hernández-Vera MA, Welsh O, Johnson WM, Castro-Garza J. Phospholipase region of Mycobacterium tuberculosis is a preferential locus for IS6110 transposition. J Clin Microbiol. 2001;39:3499–504.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Health and Medical Research Council of Australia (Fellowship #1061409 to KEH; Project Grant #1043830 to KEH and RMH) and the Victorian Life Sciences Computation Initiative (VLSCI, #VR0082).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jane Hawkey.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

JH developed the code, analysed data and wrote the paper. KEH conceived the study and helped to draft the manuscript. HBJ participated in design and coordination of the study and contributed to data interpretation. RRW and DJE developed code. MH performed PCR and sequence analysis. RMH provided sequence data and isolates for validation and contributed to data interpretation. All authors read and approved the final manuscript.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hawkey, J., Hamidian, M., Wick, R.R. et al. ISMapper: identifying transposase insertion sites in bacterial genomes from short read sequence data. BMC Genomics 16, 667 (2015). https://doi.org/10.1186/s12864-015-1860-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12864-015-1860-2

Keywords