Skip to main content

Identification of highly variable sequence fragments in unmapped reads for rapid bacterial genotyping



Bacterial genotyping is a crucial process in outbreak investigation and epidemiological studies. Several typing methods such as pulsed-field gel electrophoresis, multilocus sequence typing (MLST) and whole genome sequencing are currently used in routine clinical practice. However, these methods are costly, time-consuming and have high computational demands. An alternative to these methods is mini-MLST, a quick, cost-effective and robust method based on high-resolution melting analysis. Nevertheless, no standardized approach to identify markers suitable for mini-MLST exists. Here, we present a pipeline for variable fragment detection in unmapped reads based on a modified hybrid assembly approach using data from one sequencing platform.


In routine assembly against the reference sequence, high variable reads are not aligned and remain unmapped. If de novo assembly of them is performed, variable genomic regions can be located in created scaffolds. Based on the variability rates calculation, it is possible to find a highly variable region with the same discriminatory power as seven housekeeping gene fragments used in MLST. In the work presented here, we show the capability of identifying one variable fragment in de novo assembled scaffolds of 21 Escherichia coli genomes and three variable regions in scaffolds of 31 Klebsiella pneumoniae genomes. For each identified fragment, the melting temperatures are calculated based on the nearest neighbor method to verify the mini-MLST’s discriminatory power.


A pipeline for a modified hybrid assembly approach consisting of reference-based mapping and de novo assembly of unmapped reads is presented. This approach can be employed for the identification of highly variable genomic fragments in unmapped reads. The identified variable regions can then be used in efficient laboratory methods for bacterial typing such as mini-MLST with high discriminatory power, fully replacing expensive methods such as MLST. The results can and will be delivered in a shorter time, which allows immediate and fast infection monitoring in clinical practice.


Bacterial genotyping is a powerful tool to investigate the relationships between individual strains from a single species as well as to study the bacterial population structure and dynamics. Phenotypic and genotypic methods can be applied to distinguish bacteria, with genotyping most often being used these days [1].

For a long time, the method considered a ’gold standard’ for bacterial genotyping in routine practice was pulsed-field gel electrophoresis, where DNA banding patterns are analyzed [2]. The main advantage of this method is its discriminatory power and intra-laboratory reproducibility, but on the other hand, it is time and labor intensive [3].

Another typing method used worldwide is multilocus sequence typing (MLST), a standardized and highly discriminatory technique. Several housekeeping genes’ 450 - 500 bp long fragments are sequenced and analyzed. For each allele of a gene, a unique number is assigned [4]. The combination of numbers for all the genes’ alleles define the allelic profiles represented by sequence types (STs). The MLST schemes are deposited in publicly available databases; thus, the method can be used for comparative epidemiological studies and monitoring the spread of high-risk strains. The MLST is a portable, standardized and reproducible method; however, the sequencing cost is still high. [5].

An alternative to MLST is mini-MLST, where the sequencing is replaced by high resolution melting analysis (HRM) [6]. In the first step, housekeeping gene fragments are amplified by PCR. Next, HRM is performed and as a result, a melting curve is obtained. When 50% of the DNA is denatured, the melting temperature is determined [7]. Each melting curve represents an individual melt allele named according to the GC base content in the amplified region. A combination of melt alleles from each gene defines the so-called melt type. The main advantage is the low cost, which is about 10 – 20% of MLST, low time demands and very high throughput [8]. However, there is no standardized approach to detect genetic markers suitable for mini-MLST, and often the sequences chosen for typing do not have sufficient discriminatory power.

The method with the highest discriminatory power for bacterial strain genotyping is whole genome sequencing [9]. Recently, genome sequencing has become more accessible to routine laboratories; however, the main bottlenecks in post-sequencing data analysis remain. The first bottleneck is the lack of standard protocols for data processing. This is because many tools exist, and each uses different data quality assessments, data processing, and results interpretation. The second bottleneck is in genome assembly, where the outputs are crucial for clinical practice. Two main approaches exist here: de novo assembly and reference-based assembly.

de novo assembly is time-consuming and computationally demanding, high-quality data are also required. High-throughput technologies provide short reads where the assembly is challenging, especially in repetitive regions [10]. As a result, the whole genome is not obtained, but only a large number of short contigs are generated [11]. These drawbacks make it difficult to use de novo assembly in routine clinical practice.

On the other hand, reference-based assembly is faster and less computationally demanding; nevertheless, the choice of an inappropriate reference sequence significantly affects the analyses and final results. In the reference assembly, only the shared genome parts are analyzed; however, the unmapped reads can contain important information, as the bacterial genomes are highly variable [12].

Hybrid assembly can be conducted to overcome de novo and reference-based assembly’s drawbacks. This approach combines two sequencing technologies and can be used to analyze unique genome parts [13]. However, this double-sequencing approach is not used as standard in clinical studies as it requires two sequencing platforms, which means extra cost and time.

Here we present a new approach based on reference-based mapping and de novo assembly, which can be compared to a hybrid assembly approach. The main difference is that data were obtained from only one sequencing platform, specifically Illumina Miseq, one of the most frequently used platforms in clinical practice [14]. Our goal is not to obtain a complete whole genome but only the most variable genomic regions that can be used in mini-MLST. The reference-based assembly is used as a filter to remove the low variable reads, which will map to the reference sequence. From the high variability reads that did not map to the reference, scaffolds are assembled. In the scaffolds, the most variable parts are identified, and these genomic regions can be used to distinguish bacteria in mini-MLST. Thus, more sequencing will not be necessary as further samples of the given bacterium will be classified based on mini-MLST analysis using the identified variable fragments.

In the present study, our goal was to analyze whole genome sequencing data obtained from isolates representing two clinically important bacterial species - Escherichia coli and Klebsiella pneumoniae. E. coli is a Gram-negative bacterium of the Enterobacteriaceae family [15], and includes pathogenic and commensal clones. Commensal clones are natural inhabitants of the human gastrointestinal tract and cause diseases only in immunocompromised patients or those with breached gastrointestinal barriers [16], [17]. Pathogenic strains can cause urinary tract infections, sepsis, or enteric diseases [15]. E. coli genome size varies from 4.2 to 6.0 Mb with an average of about 5 Mb [18]. K. pneumoniae is another bacterium from the Enterobacteriaceae family. It is an opportunistic pathogen that causes serious diseases such as pneumonia, bloodstream infections, urinary tract infections or sepsis, mainly in immunocompromised patients [19, 20]. In recent years, the number of antibiotic-resistant strains has increased [21]; thus, K. pneumoniae has become one of the major threats due to significant morbidity and mortality [8]. The genome has approximately 5.5 Mbp, and encodes about 5500 genes [22].

Results and discussion

Sequence type determination

The reference-based assembled consensus sequences were used for in silico MLST analysis. For E. coli the Warwick MLST scheme was employed; thus, seven housekeeping genes (adk, fumC, gyrB, icd, mdh, puA, recA) were analyzed. In total 11 sequence types were present in our dataset (1 x ST 69, 4 x ST 131, 1 x ST 95, 2 x ST 404, 2 x ST 38, 2 x ST 1049, 4 x ST 58, 1 x ST 297, 1 x ST 517, 2 x ST 101, 1 x ST UNW). The complete results of MLST typing are published in [23].

The sequence types of K. pneumoniae isolates were determined using seven housekeeping genes (gapA, infB, mdh, pgi, phoE, rpoB, tonB). Overall four STs were identified in the dataset (5 x ST 45, 9 x ST 405, 13 x ST 551, 4 x ST 950). The results of the MLST analysis are attached in Additional file 1: Table S1.

Assembly analysis

The number of obtained reads for E. coli genomes were from 2 514 472 to 5 137 462, and for K. pneumoniae, the number varied from 1 608 984 to 4 001 068. In both datasets, less than 0.14% of reads mapped to the human genome; thus, no serious contamination occurred in the sequencing data. Trimmomatic removed about 5% of E. coli reads and approximately 13% of K. pneumoniae reads, and the remaining 95% and 87% from the total number of reads were used for further assembly. The number of reads mapped to the reference sequence varied from 64% to 81% for E. coli genomes and from 54% to 79% for K. pneumoniae genomes. Thus, around 17% of reads remained unmapped, on average. Complete information on the numbers of total, mapped, and unmapped reads is shown in the Tables 1 and 2.

Table 1 The numbers of all reads obtained from sequencing and number of mapped and unmapped reads to reference sequence for each E. coli genome
Table 2 The numbers of all reads obtained from sequencing and number of mapped and unmapped reads to reference sequence for each K. pneumoniae genome

The unmapped reads of the 21 E. coli genomes and 31 K. pneumoniae genomes were de novo assembled via SPAdes. The number of scaffolds obtained for E. coli isolates varied from 151 to 296, and after removing the scaffolds shorter than 500 bp, the amount fluctuated from 124 to 230. In K. pneumoniae genomes were assembled from 86 to 323 scaffolds, and after removing short scaffolds, the number varied from 76 to 192. The largest scaffold lengths, N50 and L50, were determined and can be found in Tables 3 and 4 with other statistics. The detailed assembly process is described in the sections Reference-based assembly and De novo assembly of unmapped reads.

Table 3 Results of de novo analysis for E. coli genomes - number of assembled scaffolds, number of assembled scaffolds longer than 500 bp, length of the largest scaffold, N50 and L50 values
Table 4 Results of de novo analysis for K. pneumoniae genomes - number of assembled scaffolds, number of assembled scaffolds longer than 500 bp, length of the largest scaffold, N50 and L50 values

Variable fragments identification

The scaffolds from genome EC162 and KP1241 were searched for in the remaining E. coli and K. pneumoniae genomes via BLAST+. These genomes were chosen as they contained the largest number of scaffolds. In total, from 230 scaffolds searched, only 25 scaffolds were located in all E. coli isolates and from 192 scaffolds, 78 were identified in all K. pneumoniae genomes.

The isolate sequences from corresponding scaffolds present in all genomes were aligned. Next, the most variable regions were located in the alignments, and in total, 11 variable fragments for E. coli and 244 fragments for K. pneumoniae were identified.

A phylogenetic tree was constructed for each variable fragment, and the number of clusters was determined (see Additional file 1: Table S2 and S3). From the obtained results, only one E. coli variable fragment (labelled as EC_01) was further analyzed as only this one could distinguish the isolates to 11 clusters, according to the MLST results. For K. pneumoniae, five variable fragments (labelled as KP_01, KP_02, KP_03, KP_04 and KP_05) classified the genomes correctly into four clusters according to their STs and were further analyzed. The analyzed variable fragments are shown in Table 5. The variable fragments’ identification is described in the section Detection of variable regions.

Table 5 The identified variable fragments located in assembled E. coli and K. pneumoniae genomes scaffolds with their size and number of clusters obtained by phylogenetic analysis

Variable fragment analyses

The variability calculation was used for the preliminary selection of sequences with higher variability. Before further analysis, Web BLAST was used to analyze the fragments and ensure that the sequences were from the bacterial chromosome and not from plasmids. In total, six variable fragments from both datasets were analyzed, and one K. pneumoniae fragment (KP_01) with a length of 1346 bp was removed as it contained only a plasmid sequence.

If the lengths of the identified fragments were long (more than 150 bp [7]), the sequences’ analysis via commonly used laboratory methods such as mini-MLST would be complicated. For this reason, the fragments can be shortened to identify the region with the highest variability rate. The process of shortening is described in the Fragment shortening section.

The original length of the variable fragment (EC_01) with sufficient discriminatory power to distinguish E. coli genomes was 4,762 bp. After fragment erosion and decomposition, three fragments with a length of 120 bp were obtained (EC_01_1, EC_01_2, EC_01_3). The phylogenetic tree, constructed based on the three fragments with the highest variability, is depicted in Fig. 1.

Fig. 1
figure 1

Cladogram of 21 E. coli isolates based on three variable fragments (EC_01_1, EC_01_2, EC_01_3) with highlight clusters obtained from MLST analysis created by Evolview [38]

For K. pneumoniae genomes, the identified fragments lengths were 12 bp (KP_02), 13 bp (KP_03), 21 bp (KP_04) and 26 bp (KP_05); thus, no erosion was needed. Again, the phylogenetic trees were constructed, and one of them created based on variable fragment KP_02 is depicted in Fig. 2. The other trees are shown in Additional file 1: Fig. S1 A-C.

Fig. 2
figure 2

Cladogram of 31 K. pneumoniae isolates based on the variable fragment (KP_02) with highlight clusters obtained from MLST analysis created by Evolview [38]

All trees were constructed according to the process described in the section Phylogenetic analysis.

Melting temperatures analysis

For E. coli, the melting temperatures for three variable fragments (EC_01_1, EC_01_2, EC_01_3) for each genome were calculated based on the nearest neighbor method [24] and are shown in the Table 6. A melting temperature cluster analysis was conducted, and the obtained dendrogram is depicted in Fig. 3.

Fig. 3
figure 3

Dendrogram of 21 E. coli isolates obtained from cluster analysis of variable fragments (EC_01_1, EC_01_2, EC_01_3) melting temperatures with highlight clusters obtained from MLST analysis created

Table 6 Calculated melting temperatures for each E. coli genome for three variable fragments

The melting temperatures were also calculated for variable K. pneumoniae isolate fragments, and cluster analysis was conducted. It was found that three (KP_02, KP_03, KP_05) out of four identified fragments could distinguish the genomes based on the calculated melting temperatures. The calculated values are shown in Table 7, and one obtained dendrogram based on fragment KP_02 is shown in Fig. 4. See other dendrograms in Additional file 1: Fig. S2 A-B.

Fig. 4
figure 4

Dendrogram of 31 K. pneumoniae isolates obtained from cluster analysis of variable fragment (KP_02) melting temperatures with highlight clusters obtained from MLST analysis created

Table 7 Calculated melting temperatures for each K. pneumoniae genome for three variable fragments

The way the melting temperatures were determined is described in the Melting temperature calculation chapter.

The melting temperature calculation can be used to distinguish E. coli and K. pneumoniae strains. Thus, the fragments can be analyzed in mini-MLST. However, the difference in calculated temperature from different sequence types fluctuated only in the range of one degree is some cases. Therefore, the mini-MLST parameters, such as salt concentration, must be carefully determined.


Bacterial genotyping is an essential process in epidemiology as it helps to find an infection’s source and monitor outbreaks. Results delivery should be done in the shortest possible time. However, typing methods are often laborious and computationally demanding, and financial costs are also high, especially for local clinical laboratories. The solution to the mentioned problems can be using mini-MLST, a cost-effective and efficient laboratory method.

The pipeline to identify variable sequences in the next-generation sequencing data that can be used in mini-MLST for bacterial typing was proposed and tested on 21 E. coli and 31 K. pneumoniae genomes. The hybrid assembly approach consists of reference-based mapping, and de novo assembly of unmapped reads was used. The preliminary location for variable fragments in the assembled scaffolds was carried out using variability rate calculation. In the selected fragments, the most variable parts were identified.

The melting temperatures were calculated to verify that the variable segments can be used in mini-MLST. The calculated melting temperature cluster analysis showed that distinguishing of individual strains is possible. Also, in contrast with MLST, mini-MLST does not use sequencing; thus, the bacterial genotyping cost will be significantly lower.

The proposed approach can be used to identify genomic regions that are not presented in the chosen reference sequence and can be specific to analyzed bacterial strains. Nevertheless, analyzing unmapped reads must be done carefully as the assembled sequences can contain parts of plasmids extracted with the genomic DNA.

One new variable fragment was located in E. coli isolates, and three variable fragments were identified in K. pneumoniae genomes. The identified fragments’ discriminatory power was the same as the seven housekeeping genes used in the MLST analysis. The most variable fragment regions were identified to ensure that it will be possible to perform mini-MLST. As only three regions located in the variable fragment in E. coli and one region in K. pneumoniae can be analyzed in mini-MLST instead of sequencing seven housekeeping genes, the analyzing time and cost will be significantly lower.

Material and methods


The 21 E. coli and 31 K. pneumoniae isolates were collected in the Internal Hematology and Oncology Clinic at the University Hospital of Brno between 5/2019 and 7/2019. KAPA HyperPlus Kits (Roche, Switzerland) were used for sequencing library preparation, and a 2100 Bioanalyzer (Agilent Technologies, USA) was employed as a quality check. The prepared sequencing libraries were quantified with a KAPA Library Quantification kit (Roche, Switzerland) and the sequencing process using a MiSeq Reagent Kit v2 (500-cycles) was performed on an Illumina MiSeq platform. As a result, paired-end reads about 250 bp long were acquired.

Reference-based assembly

Before genome assembly, the sequenced data quality was checked by FastQC (v0.11.5, [25]) combined with MultiQC (v1.7, [26]). BBMap (v38.71, [27]) software was used to map reads to the human genome (GRCh38.p13) to remove possible contamination. Then the Trimmomatic (v0.36, [28]) was employed for adapters and low quality read trimming. Reference-based mapping was applied, and NC_002695.2 [29] and NC_012731.1 [30] obtained from the RefSeq database [31] were chosen as the reference genomes for E. coli and K. pneumoniae assembly. The assembly was performed via Burrows-Wheeler Aligner MEM (v0.7.17-r1188, [32]).

De novo assembly of unmapped reads

The reads that did not map to the reference genome were extracted, and the PCR and optical duplicated reads and low-quality reads were removed using Samtools (v1.9, [33]). In the next step, the St. Petersburg genome assembler (SPAdes) (v3.11.1, [34]) was employed for the unmapped reads’ de novo assembly. The assembly was run together with MismatchCorrector, and the Phred quality offset for the input reads was set to 33.

From genome assembly, the scaffolds were further analyzed. Due to a large number of short scaffolds, only those with a length greater than 500 bp were examined.

Detection of variable regions

The variable fragments should be present in all isolates of the analyzed bacterium. For that reason, the scaffolds from one genome of E. coli (EC162) and one genome of K. pneumoniae (KP1241) were searched for in remaining genomes via BLAST+ (v2.6.0+, [35]).

The scaffolds that corresponded to the same region and were found in all genomes were aligned. From the alignment, the parts present in all genomes were analyzed. The variability rate was determined for each fragment of alignments. The variability V was calculated as

$$ V = \frac{n_{vp}}{n_{p}}\cdot 100, $$

where nvp is the number of variable positions in the alignment, and np is the number of all positions in the alignment. If the variability of an examined fragment is more than 10%, the fragment is further analyzed. The proposed process is shown in Fig. 5.

Fig. 5
figure 5

Scheme of variability regions’ calculation for a set of aligned sequences

Phylogenetic analysis

The evolutionary distances based on the Kimura [36] model were calculated for the segments of alignment with high variability. Using the distances, phylogenetic trees were constructed via UPGMA. Then a cluster analysis was conducted, and as a result, the number of clusters was calculated and compared with the number of sequence types obtained from the MLST analysis.

Fragment shortening

If the variable fragment was too long to use in standard laboratory methods, it was shortened by two methods: erosion and dyadic decomposition. Firstly, erosion was performed. The nucleotides from the beginning of the sequence alignment were removed one by one. In each iteration, the phylogenetic tree was constructed, and controlling the number of clusters was performed to ensure that the discriminatory power was still preserved. As soon as the number of clusters decreased, the erosion was stopped. The same process was done from the end of the alignment. Secondly, fragment decomposition was carried out. As it was not possible to shorten the alignment using erosion, the fragment was split into two separate, equal halves. The first half of the alignment was shortened from the end as it was not possible to shorten it further from the beginning. For the second half of the alignment, the nucleotides were removed from the beginning. After each time it was shortened, the number of clusters was determined. If it was impossible to shorten the fragment further, each part was again split in the middle, and the process was repeated. Shortening was stopped when the length of the fragment was 120 bp.

Melting temperature calculation

The melting temperatures were calculated for the variable fragments using the Oligo Calc [37]. The calculation parameters were left by default, and the melting temperatures computed using the nearest neighbor method were used for further analysis.

Availability of data and materials

Raw sequencing data are available from the National Center for Biotechnology Information Sequence Read Archive database under a BioProject with accession number PRJNA695195 and PRJNA770840.



Multilocus sequence typing


mini-multilocus sequence typing


Sequence type


High resolution melting analysis




Unweighted pair group method with arithmetic mean


  1. Li W, Raoult D, Fournier P-E. Bacterial strain typing in the genomic era. FEMS Microbiol Rev. 2009; 33(5):892–916.

    Article  CAS  Google Scholar 

  2. Neoh H. -m., Tan X-E, Sapri HF, Tan TL. Pulsed-field gel electrophoresis (PFGE): A review of the “gold standard” for bacteria typing and current alternatives. Infect Genet Evol. 2019; 74(March):103935.

    Article  CAS  Google Scholar 

  3. Sabat AJ, Budimir A, Nashev D, Sá-Leão R, van Dijl JM, Laurent F, Grundmann H, Friedrich AW, on behalf of the ESCMID Study Group. Overview of molecular typing methods for outbreak detection and epidemiological surveillance. Eurosurveillance. 2013; 18(4):20380.

    Article  CAS  Google Scholar 

  4. Enright MC, Spratt BG. Multilocus sequence typing. Trends Microbiol. 1999; 7(12):482–7.

    Article  CAS  Google Scholar 

  5. Urwin R, Maiden MCJ. Multi-locus sequence typing: a tool for global epidemiology. Trends Microbiol. 2003; 11(10):479–87.

    Article  CAS  Google Scholar 

  6. Tong SYC, Giffard PM. Microbiological Applications of High-Resolution Melting Analysis. J Clin Microbiol. 2012; 50(11):3418–21.

    Article  CAS  Google Scholar 

  7. Andersson P, Tong SYC, Bell JM, Turnidge JD, Giffard PM. Minim Typing – A Rapid and Low Cost MLST Based Typing Tool for Klebsiella pneumoniae. PLoS ONE. 2012; 7(3):33530.

    Article  Google Scholar 

  8. Brhelova E, Kocmanova I, Racil Z, Hanslianova M, Antonova M, Mayer J, Lengerova M. Validation of Minim typing for fast and accurate discrimination of extended-spectrum, beta-lactamase-producing Klebsiella pneumoniae isolates in tertiary care hospital. Diagn Microbiol Infect Dis. 2016; 86(1):44–9.

    Article  Google Scholar 

  9. Bezdicek M, Nykrynova M, Plevova K, Brhelova E, Kocmanova I, Sedlar K, Racil Z, Mayer J, Lengerova M. Application of mini-MLST and whole genome sequencing in low diversity hospital extended-spectrum beta-lactamase producing Klebsiella pneumoniae population. PLoS ONE. 2019; 14(8):0221187.

    Article  Google Scholar 

  10. Paszkiewicz K, Studholme DJ. De novo assembly of short sequence reads. Brief Bioinforma. 2010; 11(5):457–72.

    Article  CAS  Google Scholar 

  11. Liao X, Li M, Zou Y, Wu F-X, Yi-Pan, Wang J. Current challenges and solutions of de novo assembly. Quant Biol. 2019; 7(2):90–109.

    Article  Google Scholar 

  12. Abnizova I, te Boekhorst R, Orlov YL. Computational Errors and Biases in Short Read Next Generation Sequencing. J Proteomics Bioinforma. 2017; 10(1):1–17.

    Article  Google Scholar 

  13. Larsen PA, Harris RA, Liu Y, Murali SC, Campbell CR, Brown AD, Sullivan BA, Shelton J, Brown SJ, Raveendran M, Dudchenko O, Machol I, Durand NC, Shamim MS, Aiden EL, Muzny DM, Gibbs RA, Yoder AD, Rogers J, Worley KC. Hybrid de novo genome assembly and centromere characterization of the gray mouse lemur (Microcebus murinus). BMC Biol. 2017; 15(1):110.

    Article  Google Scholar 

  14. Besser J, Carleton HA, Gerner-Smidt P, Lindsey RL, Trees E. Next-generation sequencing technologies and their application to the study and control of bacterial infections. Clin Microbiol Infect. 2018; 24(4):335–41.

    Article  CAS  Google Scholar 

  15. Nataro JP, Kaper JB. Diarrheagenic Escherichia coli. Clin Microbiol Rev. 1998; 11(2):403.

    Article  CAS  Google Scholar 

  16. Liu B, Furevi A, Perepelov AV, Guo X, Cao H, Wang Q, Reeves PR, Knirel YA, Wang L, Widmalm G. Structure and genetics of Escherichia coli O antigens. FEMS Microbiol Rev. 2020; 44(6):655–83.

    Article  Google Scholar 

  17. Kaper JB, Nataro JP, Mobley HLT. Pathogenic Escherichia coli. Nat Rev Microbiol. 2004; 2(2):123–40.

    Article  CAS  Google Scholar 

  18. Touchon M, Perrin A, de Sousa JAM, Vangchhia B, Burn S, O’Brien CL, Denamur E, Gordon D, Rocha EPC. Phylogenetic background and habitat drive the genetic diversification of Escherichia coli. PLoS Genet. 2020; 16(6):1008866.

    Article  Google Scholar 

  19. Li B, Zhao Y, Liu C, Chen Z, Zhou D. Molecular pathogenesis of Klebsiella pneumoniae. Futur Microbiol. 2014; 9(9):1071–81.

    Article  Google Scholar 

  20. Bengoechea JA, Sa Pessoa J. Klebsiella pneumoniae infection biology: living to counteract host defences. FEMS Microbiol Rev. 2019; 43(2):123–44.

    Article  CAS  Google Scholar 

  21. Paczosa MK, Mecsas J. Klebsiella pneumoniae: Going on the Offense with a Strong Defense. Microbiol Mol Biol Rev. 2016; 80(3):629–61.

    Article  CAS  Google Scholar 

  22. Wyres KL, Holt KE. Klebsiella pneumoniae Population Genomics and Antimicrobial-Resistant Clones. Trends Microbiol. 2016; 24(12):944–56.

    Article  CAS  Google Scholar 

  23. Nykrynova M, Barton V, Sedlar K, Bezdicek M, Lengerova M, Skutkova H. Word Entropy-Based Approach to Detect Highly Variable Genetic Markers for Bacterial Genotyping. Front Microbiol. 2021; 12(February):1–8.

    Google Scholar 

  24. Borer PN, Dengler B, Tinoco I, Uhlenbeck OC. Stability of ribonucleic acid double-stranded helices. J Mol Biol. 1974; 86(4):843–53.

    Article  CAS  Google Scholar 

  25. Andrews S. FastQC: a quality control tool for high throughput sequence data. 2010. Accessed Jan 2020.

  26. Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016; 32(19):3047–8.

    Article  CAS  Google Scholar 

  27. Bushnell B, et al.BBMap: A Fast, Accurate, Splice-Aware Aligner. No. LBNL-7065E. Berkeley: Ernest Orlando Lawrence Berkeley National Laboratory; 2014.

    Google Scholar 

  28. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014; 30(15):2114–20.

    Article  CAS  Google Scholar 

  29. Hayashi T. Complete Genome Sequence of Enterohemorrhagic Eschelichia coli O157:H7 and Genomic Comparison with a Laboratory Strain K-12. DNA Res. 2001; 8(1):11–22.

    Article  CAS  Google Scholar 

  30. Wu KM, Li NH, Yan JJ, Tsao N, Liao TL, Tsai HC, Fung CP, Chen HJ, Liu YM, Wang JT, Fang CT, Chang SC, Shu HY, Liu TT, Chen YT, Shiau YR, Lauderdale TL, Su IJ, Kirby R, Tsai SF. Genome sequencing and comparative analysis of Klebsiella pneumoniae NTUH-K2044, a strain causing liver abscess and meningitis. J Bacteriol. 2009; 191(14):4492–501.

    Article  CAS  Google Scholar 

  31. O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, Smith-White B, Ako-Adjei D, Astashyn A, Badretdin A, Bao Y, Blinkova O, Brover V, Chetvernin V, Choi J, Cox E, Ermolaeva O, Farrell CM, Goldfarb T, Gupta T, Haft D, Hatcher E, Hlavina W, Joardar VS, Kodali VK, Li W, Maglott D, Masterson P, McGarvey KM, Murphy MR, O’Neill K, Pujar S, Rangwala SH, Rausch D, Riddick LD, Schoch C, Shkeda A, Storz SS, Sun H, Thibaud-Nissen F, Tolstoy I, Tully RE, Vatsan AR, Wallin C, Webb D, Wu W, Landrum MJ, Kimchi A, Tatusova T, DiCuccio M, Kitts P, Murphy TD, Pruitt KD. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016; 44(D1):733–45.

    Article  Google Scholar 

  32. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. 2013. Accessed Jan 2020.

  33. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009; 25(16):2078–9.

    Article  Google Scholar 

  34. Prjibelski A, Antipov D, Meleshko D, Lapidus A, Korobeynikov A. Using SPAdes De Novo Assembler. Curr Protoc Bioinforma. 2020; 70(1):1–29.

    Article  Google Scholar 

  35. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. BLAST+: architecture and applications. BMC Bioinformatics. 2009; 10(1):421.

    Article  Google Scholar 

  36. Kimura M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol. 1980; 16(2):111–20.

    Article  CAS  Google Scholar 

  37. Kibbe WA. OligoCalc: an online oligonucleotide properties calculator. Nucleic Acids Res. 2007; 35(Web Server):43–6.

    Article  Google Scholar 

  38. Subramanian B, Gao S, Lercher MJ, Hu S, Chen W-H. Evolview v3: a webserver for visualization, annotation, and management of phylogenetic trees. Nucleic Acids Res. 2019; 47(W1):270–5.

    Article  Google Scholar 

Download references


Not applicable.

About this supplement

This article has been published as part of BMC Genomics Volume 23 Supplement 3, 2022: Selected articles from the 9th International Work-Conference on Bioinformatics and Biomedical Engineering: genomics. The full contents of the supplement are available online at


Collecting, processing, storing and sequencing of all bacterial isolates used in this study was supported by the Ministry of Health of the Czech Republic, Grant No. NV19-09-00430, all rights reserved. Publication costs are funded by BUT Open Access Fund.

Author information

Authors and Affiliations



MN, VB, HS contributed to the conception and design of the study. MN implemented the algorithm and evaluated the results. ML and MB ensured the biological aspects of the project. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Marketa Nykrynova.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1

Supplementary tables and figures: Table S1: MLST analysis results for 31 K. pneumoniae strains, Table S2: The identified variable fragments located in assembled scaffolds of E. coli genomes with number of clusters obtained by phylogenetic analysis, Table S3: The identified variable fragments located in assembled scaffolds of K. pneumoniae genomes with number of clusters obtained by phylogenetic analysis, Figure S1 A-C: Cladograms of 31 K. pneumoniae isolates based on variable fragments with highlight clusters obtained from MLST analysis, Figure S2 A-B: Dendrograms of 31 K. pneumoniae isolates obtained from cluster analysis of variable fragments melting temperatures with highlight clusters obtained from MLST analysis created.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Nykrynova, M., Barton, V., Bezdicek, M. et al. Identification of highly variable sequence fragments in unmapped reads for rapid bacterial genotyping. BMC Genomics 23 (Suppl 3), 445 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Bacterial genotyping
  • Genome assembly
  • Unmapped reads
  • De novo assembly
  • Multilocus sequence typing
  • Mini-MLST