CLAME: a new alignment-based binning algorithm allows the genomic description of a novel Xanthomonadaceae from the Colombian Andes

Benavides, Andres; Isaza, Juan Pablo; Niño-García, Juan Pablo; Alzate, Juan Fernando; Cabarcas, Felipe

doi:10.1186/s12864-018-5191-y

Volume 19 Supplement 8

Selected articles from the IV Colombian Congress on Bioinformatics and Computational Biology & VIII International Conference on Bioinformatics SoIBio 2017

Research
Open access
Published: 11 December 2018

CLAME: a new alignment-based binning algorithm allows the genomic description of a novel Xanthomonadaceae from the Colombian Andes

Andres Benavides¹,
Juan Pablo Isaza^2,4,
Juan Pablo Niño-García³,
Juan Fernando Alzate^2,4 &
…
Felipe Cabarcas^1,2

BMC Genomics volume 19, Article number: 858 (2018) Cite this article

2627 Accesses
3 Citations
22 Altmetric
Metrics details

Abstract

Background

Hot spring bacteria have unique biological adaptations to survive the extreme conditions of these environments; these bacteria produce thermostable enzymes that can be used in biotechnological and industrial applications. However, sequencing these bacteria is complex, since it is not possible to culture them. As an alternative, genome shotgun sequencing of whole microbial communities can be used. The problem is that the classification of sequences within a metagenomic dataset is very challenging particularly when they include unknown microorganisms since they lack genomic reference. We failed to recover a bacterium genome from a hot spring metagenome using the available software tools, so we develop a new tool that allowed us to recover most of this genome.

Results

We present a proteobacteria draft genome reconstructed from a Colombian’s Andes hot spring metagenome. The genome seems to be from a new lineage within the family Rhodanobacteraceae of the class Gammaproteobacteria, closely related to the genus Dokdonella. We were able to generate this genome thanks to CLAME. CLAME, from Spanish “CLAsificador MEtagenomico”, is a tool to group reads in bins. We show that most reads from each bin belong to a single chromosome. CLAME is very effective recovering most of the reads belonging to the predominant species within a metagenome.

Conclusions

We developed a tool that can be used to extract genomes (or parts of them) from a complex metagenome.

Background

Bacterial populations have colonized almost every possible niche on Earth, including those considered harsh for most organisms. These extreme environments are those with a chemical composition or constraints imposed by the physical conditions where most organisms cannot survive. Thermophiles are present in several ecosystems where temperatures rise above 50 °C and reach up to 90 °C. They can grow optimally under these conditions [1], since they have the adaptations and the necessary enzymatic machinery to deal with the complications of living in these extreme environments. Therefore thermophiles are a potential source of thermostable proteins suitable for several industrial and biotechnological applications; then, the screening of novel thermophilic enzymes has become an important field of research. Although several thermostable enzymes have been recently described and characterized (e.g. [2,3,4]), thermophiles are still highly unexplored [5], especially because the majority of prokaryotic diversity cannot be cultured [6]. There have only been a few attempts to characterize enzymes or microorganisms from Neotropics hot springs (e.g. [7,8,9,10,11]) and just a handful of them (i.e. [10, 11]) used metagenomic approaches based on Next Generation Sequencing - NGS [12].

Since metagenomic NGS (from now on just metagenomic) approaches generate millions of short DNA reads of a few hundred bases [13], the challenge is to reconstruct the different species individual chromosomes from these reads. In a typical genomic experiment, most of the short reads belong to a single organism, and they can be assembled reliably using the tools that have been developed for this purpose (e.g. Newbler [14], Velvet [15], and Ray [16]). However, in a metagenomic experiment there is a mixture of reads from multiple species of a community [17]; moreover, the number of genomes and the abundance of reads from each species, in the sample, is unknown. These characteristics make the assembly process difficult, since there is a high risk of assembling reads from different organisms as a single chromosome. Tools like MetaVelvet [18], Ray Meta [19], MetAMOS [20], and SPAdes [21] use different approaches to address these issues and improve the assembly opportunities. However, these tools are far from perfect, and chimeric chromosomes can be assembled [22].

In order to reduce chimeric assemblies, researchers group reads in bins, based on their sequence similarity, to reduce the data complexity and to increase the likelihood of obtaining a reliable assembly. Tools like AMPHORA2 [17], MEGAN [23], MG-Rast [24], Kraken [25], Clark [26] or MetaBinG [27] use reference-based methods (i.e. supervised) that bin the reads or contigs into taxonomic clades based on pair-wise comparisons against reference databases, or pre-computed models. Similarly, there are reference-free methods (e.g. unsupervised) like MetaProb [28], BiMeta [29], MetaCluster [30], AbundanceBin [31] or CompostBin [32], that group reads using their genetic mutual similarities or their k-bases frequency composition, avoiding the pair-wise comparison step against reference databases. Supervised methods work fine in reconstructing genomes from well characterized or low-diversity communities, whose taxa have a good representation in reference databases; they exclude reads that come from less explored communities. In contrast, unsupervised methods are better when the species are poorly represented in databases, especially with long reads or contigs that increase the likelihood of finding genetic markers into a sequence to bin them correctly.

Although there are research publications that propose a draft genome of an unknown species extracted from a metagenome (eg [33, 34]), only few studies have reported the reconstruction of the complete genome of a thermophilic microbe (e.g. [35,36,37]). In these works, the process has been made mainly manual, using a combination of: Velvet [15], the study of the total coverage, k-mers characteristics and selecting contigs manually based on BLAST [38] results. In general, de-novo assembly of metagenome reads tends to generate short and chimeric contigs that are difficult to classify. Thus, the challenge of analyzing a metagenome is still open; we propose a tool that overcomes some of the limitation of traditional binning methods, mainly for metagenomes formed by unknown species.

Here, we introduce CLAME, a tool that groups metagenome reads in bins mainly from a single chromosome. The idea is to reduce the metagenomic complexity, to decreases the possibility of creating chimeric contigs and to improve the assembly speed. CLAME, from the Spanish “CLAsificador MEtagenomico”, is a C++ program that bins reads using a graph representation of the metagenome dataset. On the graph, reads are represented as nodes (vertices) and the overlap between two similar reads is represented as the edge that connects them. CLAME creates edges only on large exact matches between reads. This makes it very unlikely that two reads from different chromosome molecules can be clustered together. We found that this technique creates bins mostly from a single chromosome, while assigning most reads of one particular chromosome on a single bin. It is important to note that CLAME is not an assembly tool, it is a binning tool that groups reads as a preliminary step before genome assembly. We calibrated CLAME using public available NGS data from 454 and Illumina MiSeq platforms, and we tested it with a metagenomic dataset obtained from a never before studied Andean hot spring. CLAME allowed us to generate a high-quality draft genome (available in CLAME’s GitHub and on the NCBI’s project PRJNA431299) of a Gammaproteobacteria closely related to Dokdonella genus, which seems to represent a new lineage within the family Rhodanobacteraceae.

Methods

CLAME groups metagenomic reads in bins using their biological and shotgun sequencing properties. The fundamental biological idea of CLAME is that exact matches, of a large number of bases, between reads is very unlikely if the reads do not come from same DNA chromosome. Furthermore, assuming that in a metagenome there is a genome sufficiently covered, and given that the sequencing errors is low (on platforms like Illumina Mi-seq or Roche’s 454), most reads from a DNA chromosome will have exact matches between them. This way CLAME reliably bins together most reads of each chromosome from a metagenome.

Initially, CLAME produces a graph with nodes (vertices) and edges, G = (V,E); while the reads are the nodes, the edges are the reads alignments. An edge between two reads is created only if they have an exact alignment of a large number of bases. Ideally, two reads from different DNA chromosomes will not align together, at least not in a considerable number of bases, and thus, the graph will represent the different organisms or chromosomes as organized subgraphs. The binning will thus follow naturally by traversing the graph, creating a bin for each connected subgraph. However, conserved regions, such as the ribosomal RNA genes, may generate edges between reads with different species memberships. CLAME considers the user-defined thresholds on the number of edges of a node when creating the bins. The user can define several thresholds to configure CLAME’s sensibility to the abundance of the species present which depends on the characteristics of the experiment. A detailed CLAME methodology is illustrated in Fig. 1 and explained in the next subsections.

Read alignment stage

The read-overlap detection stage creates the edges of the graph. Algorithms like Needleman-Wush [39] and Smith-Waterman [40] were designed to find the optimal local alignment, the problem is that they have O(n²) computational times, where n is the number of bases of the reads. Thus, they are very slow for big datasets. To speed up alignment analysis, there are several algorithms that rely on a suffix/prefix tree representation of the dataset, such as suffix tree, enhanced suffix array or FM-index [41]. On these algorithms, all the reads are used to create a tree representation of them, and then, each read can be aligned to all others by searching it in the representation. In this case, the computational time can be reduced from O(n²) to O(m + n), where m is the time to build the suffix tree, which is order n, and this way, the computational time can be reduced significantly.

CLAME uses a custom version of the suffix tree method: the Succinct Data Structure Library 2.0 [42]. With this library, we can find all the alignments of a query searching for a path in the tree. In the tree, descending from the root, each edge on the path matches a query. If there is a path for a query, it means that there is a substring and the reads in the path are the matches. To reduce computational time, CLAME only searches for exact alignments of b bases (forward and the reverse complement). The parameter “b” is the number-of-bases minimum-length alignment accepted, and it is set by the user. Using this information, CLAME creates the graph. It is represented as an adjacency list in which the first column represents the node and the second, the edges (the nodes that align in at least b bases). In an ideal case, the overlap stage must separate the graph, in sub-graphs, according to the number of chromosomes present in the metagenome. However, since there are sequencing errors and highly conserved genes, some reads can align in more than one species/chromosome, creating bins that include reads from more than one chromosome. To deal with this issue, CLAME uses edge analysis stage.

Edge analysis stage

We have observed that the number of edges of a node is related to the abundance of that sequence on the metagenome. Furthermore, they follow a normal-like histogram. Using the adjacency list, generated in the read alignment stage, CLAME reports the reads’ number-of edges histogram of each bin. The number-of-edges histogram helps the user to set the thresholds, since a normal distribution is expected for the reads of a same chromosome, then the user can look at the graph and set the thresholds accordingly, to deal with the following problems. 1) nodes with a number of edges several times larger than the mean: Our experiments show that they are mainly produced by conserved zones of the DNA that are similar in several species. 2) nodes with a number of edges much smaller than the mean: we have observed that they are produced mainly by chimeric reads. Both of these problems make that reads from different DNA chromosomes end up being related.

Since the objective of CLAME is to create bins of reads from the single DNA chromosome, we allow the user to set thresholds on the number of edges. It allows users to eliminate reads with larger and smaller than the normal number of edges. CLAME takes users’ edge thresholds to redefine the graph and get connected subgraphs. The bins are generated by traveling the graph and reporting each subgraph.

Graph traversal and bin generation

CLAME uses a greedy breadth-first search strategy to traverse the graph and to report each subgraph as a bin. It starts at an arbitrary node of a graph and explores the neighbor nodes first, before moving to the next neighbors’ level. It takes into consideration the edge thresholds to decide if the node is added to the bin or further analyzed. The process ends when no more reads can be added to the bin. At this point all the reads visited are reported as members of the same bin and a new seed is taken. This is done until all reads have been added to a bin. At the end, the bins and their reads are reported on output fasta files. CLAME allows the user to define a minimum bin size (number of reads) to avoid report singletons or very small bins.

Simulated simple metagenome

A synthetic metagenome dataset was created using 289,917 reads of Brucella canis and 375,122 reads of Mycobacterium tuberculosis, both generated with the ROCHE’s 454 titanium platform and associated with the NCBI’s bioprojects PRJEB4803 and PRJEB8877, respectively. The reads were quality trimmed at Q30 using Prinseq [43]. The cleaned reads were concatenated on a simple multi-fasta file to get a total of 665,039 mixed reads that formed the Brucella-Mycobacterium synthetic metagenome. These reads were binned using CLAME, with at least 70 bases alignment. The parameters were determined experimentally, such that CLAME generated 2 bins for this metagenome (see Additional file 1 and Additional file 2 for the details).

B. canis and M. tuberculosis number of edges histogram is shown in Fig. 2, it was plotted with the in-house Python script plotHist.py; this script can be found as part of CLAME. Quality control for each bin was checked, by matching the content (read codes) of each bin against the original fastq files.

We also used MetaBinG [27], MetaProb [28], BiMeta [29], and AbundanceBin [31] tools to bin the metagenome. For the tools in which the number of bins or species can be specified, this parameter was set up to 2. Quality control for each tool was checked, by matching the content (read codes) of each bin against the original raw files. Table 1 shows the results of all the binning tools.

Table 1 Bins reported by each tool on the simulated metagenome. It also shows the number of reads that belong to each genome for each bin, and the time it took each tool to create the bins

Full size table

Simulated multi-species metagenome

We created a metagenomic dataset based on the bacterial genomes of five species which were downloaded from the NCBI database: Synechocystis, SRA code DRR106442, Dokdonella, SRA code SRR4217676, Hymnobacter, SRA code SRR1334914, Microbacteria, SRA code SRR5493999 and Rhizobium, SRA code SRR5165471. For each species, the raw reads downloaded were merged into an extended single multifasta file using the Flash tool [44] (minimal identity parameter of 65 bases). In order to simulate different abundance levels, similar to the real spring-water metagenome, different amounts of extended reads were randomly taken from each dataset. Table 2 shows: the number of raw reads, the taxonomy of each species, the number of reads used (after using Flash to join read pairs), the size of the genome reported and the depth of each genome used. The final dataset was produced by concatenating the selected sequences into a single multifasta file.

Table 2 Species and total reads used to create the simulated multi-species metagenome. It shows the size of the original database, in reads and bases, the reads and bases used to create the metagenome, the size of the reported genome, and the depth calculated as the bases used divided by the genome size

Full size table

CLAME was executed using 70 bases alignment and no edge thresholds. The number of edges histogram is shown in Fig. 3 (generated with the script plotHist.py). Using the histogram CLAME was executed again using 70 bases and edge thresholds for the range 1, 51, 10,000. Quality control for each bin was manually checked, by matching the bins content versus the read codes from the original raw files (see Additional file 1 for the details).

We also executed MetaBinG [27], MetaProb [28], BiMeta [29], and AbundanceBin [31] tools with this metagenome. For the tools in which the number of bins or species can be specified, this parameter was configured to 5. Quality control for each tool was again checked, by matching the content of each bin against the original raw file codes. Table 3 compares these results versus CLAME’s results.

Table 3 Bins reported by the binning tools on the simulated multi-species metagenome. It also shows the number of reads that belong to each genome for each bin, and the time it took each tool to create the bins

Full size table

Illumina MiSeq metagenomic read set

This dataset corresponds to a real metagenomic sequencing experiment of human intestinal microbiota after a separation stage, where the intestinal protozoa Cryptosporidium hominis was enriched [45]. The original pair-ended reads cover the whole genome of this protozoan parasite, which is contained in 8 chromosomes. The reported reads belonging to C. hominis (1,066,460) were downloaded from SRA database Accession ERX1047563. The metagenome raw reads (9,052,596) (available in CLAME’s GitHub) were trimmed, using a minimum quality cutoff of Q30 using Prinseq [43] tool. Then the reads were merged into an extended single multifasta file using the Flash [44] tool. There were 6,052,596 left after these steps.

The 6,052,596 reads were binned using CLAME with 100 bases alignment and custom edge thresholds. The distribution of the number of edges on the metagenome and the C. hominis’ read contribution was plotted using the python script plotHist.py (Fig. 4). We manually selected the bins that included reads from C. hominis genome (see Additional file 1 for the details).

CLAME performance was measured using as a control the C. hominis genome reference (SRA Accession ERX1047563) by matching the coverage generated by the original reads versus the coverage generated by the binned reads. Bowtie2 [46] was used to map the reads to the reference. Figure 5 shows the obtained coverage; the data were plotted on the same figure using another in-house script plot (plotMapping.py).

Additionally we analyzed the biggest bins produced by CLAME (Tables 4 and 5). Each bin was assembled using Newbler [14], it was set to minimum identity (mi = 95) and minimum length (ml = 60). Annotation, for the Large contigs (> 500 bases) was done using AMPHORA2 [17], MEGAN [23] and RAIphy [47]. AMPHORA2 and RAIphy were executed with default parameters. For MEGAN, we generated a BLASTn-comparison file of the Large Contigs (> 500 bases) against a local NT (downloaded on May 2017) in XML format (see Additional file 1 for the details).

Table 4 Assembly statistics of the biggest bins reported by CLAME on the Illumina metagenome

Full size table

Table 5 Annotation of Newbler’s Large contigs assembled from the biggest bins reported by CLAME on the Illumina metagenome

Full size table

San Vicente hot spring metagenome

San Vicente is a hot spring within the Cerro-Machin-Cerro-Bravo volcanic complex in Colombian Andes, located at 4° 50.25’ N and 75° 32.35’ W at an altitude of 1715 masl. It is characterized by waters with discharge temperatures above 60 °C (max. 91 °C), pH of 6.7 and high concentrations of chlorides. To reduce the complexity of the community, we incubated a sample of the hot spring (discharge temperature 64 °C) in a non-selective mineral medium, maintained at 45 °C with white light during 15 days (Fig. 6). We extracted the community DNA using PowerMax® Soil DNA Isolation Kit supplied by MOBIO Corporation [48], following the instructions of the manufacturer. The sample was sequenced using ROCHE’s 454 Titanium technology in 3/4 PTP at the Centro Nacional de Secuenciación Genómica - CNSG, Universidad de Antioquia, Medellin, Colombia. A total of 926,130 reads (available in CLAME’s GitHub and on the NCBI’s project PRJNA431299) were generated with a 300 bp average length. Raw reads were trimmed using Prinseq [43] tool to keep reads at least 50 bases long, and that at the 3′ the quality is at least 30 (see Additional file 1 for the details). Finally, a total of 900,370 quality reads were obtained for further processing steps. The analysis followed in two directions: 1) A de-novo metagenome assembly of the cleaned reads using popular state of the art tools (see below) and further comparison and annotation; 2) the binning of the quality reads using CLAME and further assembly and annotation using the biggest bin.

De-novo assembly was done with Newbler [14], Ray [16] and MetaVelvet [18] (see Table 6). Newbler assembly was set to minimum identity (mi = 95) and minimum length (ml = 60). Ray and MetaVelvet assembly software tools were configured to use 31 k-mers. Annotation, for the Large contigs (> 500 bases) reported by Newbler, was done using AMPHORA2 [17], MEGAN [23] and RAIphy [47]. AMPHORA2 and RAIphy were executed with default parameters. For MEGAN, we generated a BLASTx-comparison file of the Large contigs (> 500 bases) against a local NR in XML format (downloaded on April 2016) (see Additional file 1 and Additional file 3 for the details). Figure 7 summarizes these results.

Table 6 Assembler statistic reported by each tool on the original hot spring dataset, without binning

Full size table

Binning process with CLAME was executed using70 bases alignment and without edge threshold restrictions. Using the Edge analysis stage, CLAME was executed again using 70 bases and restriction for the range 30 edges lower bound and 130 edges upper bound (see Fig. 8). Only the biggest bin was conserved for further analysis.

Assembly for the biggest bin was done using Newbler [14], Ray [16] and MetaVelvet [18] (see Table 7 and Fig. 9). Newbler parameters were: minimum identity 95 and minimum length 60. Ray and MetaVelvet assembly software tools were configured to use 31 k-mers. Large contigs generated by Newbler were classified with AMPHORA2 [17], MEGAN [23] and RAIphy [47] (Figs. 10 and 11). For MEGAN, we previously generated a BLASTx-XML comparison file of the Large contigs (> 500 bases). The assembly completeness for Newbler’s contigs was measured in terms of gene content and Universal Single-Copy Orthologs presence (see Additional file 1 and Additional file 2 for the details).

Table 7 Assembler statistic reported by each tool on the hot spring dataset of the biggest bin produced by CLAME

Full size table

Putative open reading frames (ORFs) were detected using CheckM [49], Prodigal [50] and Genmark [51] tools (Table 8). Quality control for the ORFs reported by Prodigal was done using BLASTp [38] against the NR database from NCBI. Then we employed MEGAN [23] to assign each ORFs into a taxonomic level (Fig. 12). Universal Single-Copy Orthologs analysis was done using BUSCO tool [52], (see Additional file 1 and Additional file 2 for the details).

Table 8 Gene composition analysis for the Newbler’s Large contigs assembled of CLAME’s biggest bin of the hot spring metagenome

Full size table

Initial taxonomical classification of the organisms represented within the resultant assembled contig set was done searching contigs that contain 16S ribosomal gene sequences. The selected contigs were manually curated, annotated (Table 9) and used to build an evolutionary tree (Fig. 13). The phylogenetic tree was inferred by using the Maximum Likelihood method with the Jukes-Cantor model [53] and the process described by Brumm et al. [54]. We conserved the same number of replicates (500) and bootstrapped tree topology to represent the evolutionary history of the taxa analyzed. We used Brumm et al., strategy to obtain the initial tree(s). However, our analysis involved 29 nucleotide sequences, instead of 26 samples. There were a total of 547 positions in the final dataset. All the analysis were developed on MEGA 7.0 [55].

Table 9 BLASTn top 7 hits report for the 16S rRNA gene sequence found in the Newbler’s contig00154 of the assembly of CLAME largest bin of the hot spring metagenome

Full size table

In order to get an insight into the functional annotation of the predicted proteome of the Xanthomodaceae of the San Vicente Hot spring, Gene Ontology annotation was performed for the 2726 ORFs predicted by Prodigal (Figs. 14, 15 and 16). It was done using BLASTp comparisons of all the predicted peptides against the NCBI’s protein NR database and BLAST2GO version 2.8 [56] annotation tool. Additionally KAAS (KEGG Automatic Annotation Server) [57] was employed to provide a detail functional annotation of predicted genes.

We compared CLAME against MetaBinG [27], MetaProb [28], BiMeta [29], and AbundanceBin [31] tools. For the tools in which the number of bins or species can be specified, we decided to set it to 5, according the number of phyla found by the annotation tools described previously. The biggest bins reported by each tool were assembled using Newbler [14], it was setting at minimum identity (mi = 95) and minimum length (ml = 60) in all the cases. Table 10 compares these results versus CLAME’s de-novo assembly for the biggest bin.

Table 10 Newbler assembly statistics of the bins reported by each tools on the hot spring metagenome. It also shows the time it took each tool to create the bins

Full size table

We also analyzed the other bins (with at least 2000 reads) produced by CLAME. These bins were assembled with Newbler [14], minimum identity (mi = 95) and minimum length (ml = 60), and annotated with AMPHORA2 [17], MEGAN [23] and RAIphy [47]. AMPHORA2 and RAIphy were executed with default parameters and for MEGAN we generated a BLASTn-comparison file of the Large contigs (> 500 bases) against a local NT (downloaded on May 2017) in XML format (see Additional file 1 for the details).

In order to study the other species presents in the metagenome, we elaborated an auxiliary dataset by deleting the reads binned in the first CLAME execution and conserved the balance of the read in the original dataset. A total of the 519,524 reads conform this second dataset. CLAME was executed on this dataset using 15 bases matching and edge thresholds for the range 10 to 20 (Fig. 17), only bins with at least 2000 reads were reported. The parameters were configured experimentally to get suitable bins. The biggest bin produced by CLAME was assembled with Newbler [14] and annotated using AMPHORA2 [17], Megan [23] and RAIphy [47] (Tables 11 and 12). AMPHORA2 and RAIphy were executed with default parameters. For MEGAN we generated a BLASTn-comparison file of the Large contigs (> 500 bases) against a local NT (downloaded on May 2017) in XML format.

Table 11 Thermal metagenome Newbler assembler statistics for the balance reads (without the reads used for the draft genome)

Full size table

Table 12 Annotation of Newbler’s Large contigs assembled from the thermal metagenome from the balance reads (without the reads used for the draft genome)

Full size table

CLAME computational performance

We show CLAME’s speed and memory performances on Figs. 18 and 19. All the experiments were performed on a computer equipped with 64 Intel(R) Xeon(R) CPU X7560 @ 2.27GHz and 500 GB of RAM. CLAME was implemented in C ++ using OpenMP (Open Multi-Processing) interface. We executed CLAME employing 1, 2, 4, 8, 16, 32 and 64 threads on each dataset previously explained. We selected the best of five executions. Valgrind [58] was used to measure CLAME’s memory usage. We took the maximal memory usage of each experiment.

Results

We calibrated CLAME using public available NGS data of 454 and Illumina MiSeq platforms, then we used it to study the metagenomic dataset obtained from a hot spring in the Colombian Andean Mountains (located in San Vicente, Risaralda, Colombia).

Simulated metagenome

We tested CLAME with the simulated metagenome, which was created combining DNA sequencing from Brucella canis and Mycobacterium tuberculosis. The mixed data set, of 665,039 reads, was elaborated, as described in the methods section, using 289,917 reads of B. canis and 375,122 reads of M. tuberculosis. In order to understand the profile of the number of edges, we ran CLAME three times: only with M. tuberculosis reads, only with B. canis reads, and with the simulated metagenome (the combination of both). Figure 2 illustrates the number of edges histogram, produced by CLAME in the read alignment stage using 70 bases alignment. CLAME generated two main bins that contained 353,876 and 280,014 reads. The first bin, with 353,876 reads, was formed exclusively by reads of M. tuberculosis; they represent 94.3% of the original M. tuberculosis set. The second bin, with 280,014 reads, was composed exclusively by B. canis reads. They represent 96.5% of the original B. canis read set. Most of the remaining reads were short (smaller than 70 bases) and therefore they were binned as singletons.

We compared CLAME’s performance against the other binning tools. Table 1 summarizes the results produced by CLAME, MetaBinG [27], MetaProb [28], BiMeta [29], and AbundanceBin [31]. It shows that although most tools produced individual bins for B. canis and M. tuberculosis reads, only CLAME created bins that contained reads from only one species. The table also shows the time it took each tool to create the bins, (all the tools were executed on one thread), and it shows that CLAME is the fastest of all.