Evaluation of short read metagenomic assembly
© Charuvaka and Rangwala; licensee BioMed Central Ltd. 2011
Published: 27 July 2011
Skip to main content
© Charuvaka and Rangwala; licensee BioMed Central Ltd. 2011
Published: 27 July 2011
Metagenomic assembly is a challenging problem due to the presence of genetic material from multiple organisms. The problem becomes even more difficult when short reads produced by next generation sequencing technologies are used. Although whole genome assemblers are not designed to assemble metagenomic samples, they are being used for metagenomics due to the lack of assemblers capable of dealing with metagenomic samples. We present an evaluation of assembly of simulated short-read metagenomic samples using a state-of-art de Bruijn graph based assembler.
We assembled simulated metagenomic reads from datasets of various complexities using a state-of-art de Bruijn graph based parallel assembler. We have also studied the effect of k-mer size used in de Bruijn graph on metagenomic assembly and developed a clustering solution to pool the contigs obtained from different assembly runs, which allowed us to obtain longer contigs. We have also assessed the degree of chimericity of the assembled contigs using an entropy/impurity metric and compared the metagenomic assemblies to assemblies of isolated individual source genomes.
Our results show that accuracy of the assembled contigs was better than expected for the metagenomic samples with a few dominant organisms and was especially poor in samples containing many closely related strains. Clustering contigs from different k-mer parameter of the de Bruijn graph allowed us to obtain longer contigs, however the clustering resulted in accumulation of erroneous contigs thus increasing the error rate in clustered contigs.
Advances in sequencing technologies have equipped researchers with the ability to sequence collective genomes of entire microbial communities, commonly referred to as metagenomics, in an inexpensive and high-throughput manner. Microbes are omnipresent within the human body and environments across the world. As such, characterizing and understanding their roles is crucial for improving human health and the environment. Metagenomics provides an unbiased view of the diversity and biological potential of microbial communities  and analysis of community samples from several different microbial environments has provided some key insights into the understandings of these microbial communities. Some of the important metagenomic endeavours have radically transformed our knowledge of microbial world. One of the pioneering studies, which sequenced samples from Sargasso Sea , revealed more than 1.2 million unknown genes and identified 148 new bacterial phylotypes. Another study of Sorcerer II Global Ocean Sampling project  has added many new protein families to the existing protein databases and a large scale metagenomic analysis of fecal samples  has identified and cataloged a common core of genes and gut bacteria.
One of the major challenges related to metagenomic processing is the assembly of short reads obtained from community samples. Due to the lack of specific assemblers to handle metagenomes, researchers continue to use assemblers originally developed for whole genome assembly.
We have evaluated the performance of a state-of-the-art Eulerian-path based sequence assembler on simulated metagenomic datasets using a read length of 36 base pairs (bp), as produced by the Solexa/Illumina sequencing technology. The datasets were meant to reflect the different complexities of real metagenomic samples . They included, a low complexity dataset with one dominant organism, a high complexity dataset with no dominant organism and a medium complexity dataset having a few dominant organisms. We also created a dataset containing different strains of the same organism to measure the extent of co-assembly when reads from very similar organisms are used. Since the metagenomic read datasets are voluminous, we used a parallel sequence assembly algorithm (ABYSS ) which can be deployed easily on a commodity Linux cluster.
The assembled contigs were evaluated based on several quality measures for contig length and assembly accuracy. To improve the quality of the contigs, we clustered the results of different parameter runs of the assembler. We used efficient local alignment to quickly and accurately map the assembled contigs to the input source genomes. We also used a short read mapping algorithm to align the input reads to the assembled contigs to compute the homogeneity of the assembled contigs using entropy as a metric. Finally, we assessed the coverage of the source genomes by the produced contigs.
Short-read assembly of metagenomes performed better than our initial expectation in some aspects such as accuracy of the contigs and coverage of the source genomes. However, fragmentation of the contigs was more severe in metagenomic datasets than in the isolate assemblies. The assembly of a smaller dataset consisting of reads from 30 EColi strains showed that the contigs obtainable through co-assembly of related strains are considerably shorter than those generated using isolate assemblies. We also observed that by clustering results from assembly runs for different k-mer size values of de Bruijn graph, we were able to obtain a greater number of longer contigs (as optimal contigs are distributed across the k-mer space).
Traditionally, microbial genomics has relied primarily on pure cultures of microbes for sequencing. In recent years, researchers have developed a new approach known as metagenomics wherein the genetic material is obtained by directly sequencing the complex microbial communities without prior culturing. This presents an unbiased view of the diversity and biological potential of these communities .
The heterogeneous nature of the genetic material contained in metagenomic samples presents significant challenges for metagenomic assembly and analysis. Metagenomic samples have genomic content from many organisms which can not be easily separated. The genetic material of individual organisms in these samples is roughly proportional to the abundance of these organisms in the communities, which varies significantly. The dominant organisms are over-represented whereas the organism at low levels of abundance are not sequenced at sufficient depth. Also, the polymorphism between related members of the communities may lead to incorrect estimation of the repeat structures .
Phylogenetic classification of metagenomic reads is problem closely related to assembly. Ideally if the reads could be separated by their respective genomes the the problem of assembling them would be much more simple. However, segregation of reads in this manner is difficult to perform and several supervised and unsupervised approaches have been developed to address this problem. Some notable ones include Phymm and PhymmBL  which uses Interpolated Markov Models, MEGAN  which classifies sequences based on Lowest Common Ancestor from sequence similarity search using BLAST and PhyloPythia , a multiclass support vector based classifier using oligonucleotide frequencies. Unsupervised methods for classifying reads are sometimes also referred to as clustering or binning methods . Unsupervised methods have an advantage over the supervised methods because the known sequences only represent a minority of the estimated microbial diversity . The major drawback of classification and binning methods is that they are reliable only for relatively long sequences, of size greater than 1000 bp. In addition to binning and assembly, several other tools have been developed for gene prediction  and comparative analysis of metagenomic datasets . We refer the reader to  for a review of computational challenges and available tools for metagenomics.
Sanger’s method  has been the dominant sequencing platform for several decades. In recent years, the emergence of the so called “Next Generation Sequencing” (NGS)  technologies has radically transformed DNA sequencing domain. These new technologies are amenable to parallel sequencing and yield a much higher throughput at significantly lower cost per base compared to Sanger’s method. The compromise with NGS is shorter read length which seems to be getting better gradually. NGS is particularly suited for metagenomic applications because it obviates the need for clonal culturing and the lower per base cost allows the genomes to be sequenced at much greater depth than feasible through Sanger-based methods. The conventional Overlap Layout Consensus (OLC) strategy has been one of the most successful paradigms for assembling long Sanger-based reads. However, in the recent years an alternative method inspired by the Sequencing by Hybridization and based on Eulerian tour of de Bruijn graphs has gained prominence. Some of the assemblers using this Eulerian-based approach include EULER , VELVET , ABYSS  and ALLPATHS .
To study the extent of errors in metagenomic assemblies in comparison to single genome assembly, we performed a set of experiments on simulated datasets. Although, simulated datasets do not completely capture the characteristics of real metagenomes , simulation studies do provide some insight into the feasibility of assembly of short read metagenomic samples. In our current work, we have estimated the extent of problems associated with the assembly of short reads obtained from next generation sequencing Solexa platform for metagenomic samples. A similar study by Mavromatis et al.  produced three simulated metagenomic datasets representing microbial communities of different complexities using reads obtained from Sanger-based sequencing. They used these datasets for benchmarking various metagenomic processing methods. One of the focuses of their study was estimating the chimericity in assembling reads obtained from Sangers sequencing using Overlap-Layout-Consensus (OLC) based assemblers commonly used for whole genome assembly. Another simulation study by Wommack et al.  evaluated simulated NGS short reads from different metagenomic samples for taxonomic and functional annotation. As more and more metagenomic projects have started to tap into the potential of NGS, we felt the need for a similar simulation study to evaluate short read assemblers. The work of Pop  provides a good overview of OLC and Eulerian assembly paradigms and addresses some of the challenges associated with short read assembly. Since the next generation sequencing allows the samples to be sequenced at a greater depth, we used considerably larger datasets. Several researchers have studied the performance of NGS short reads and paired-end short reads for individual genome assembly [26–28]. Recently, Kingsford et al.  performed a theoretical analysis of Eulerian-path based approaches to survey the repeat structure of individual prokaryotic genomes.
We have evaluated metagenomic assemblies based on the accuracy of the generated contigs using alignment-based similarity to the source genomes, contig length statistics, and the proportions of the source genomes recovered by the contigs. As the k-mer size of de Bruijn graph plays a crucial role in ABYSS’s assembly, we have also tried to assess the impact of k-mer size on metagenomic assemblies by comparing the contigs produced from different runs of the k-mer parameter. We also provide comparisons of metagenomics assemblies to the isolate assemblies of the constituent genomes.
Contig Alignment Statistics.
% Accurate Contigs
Bases in Accurate Contigs
% Bases in Accurate Contigs
As a benchmark for our metagenomic assemblies we separated the reads by their source sequence and performed isolate assemblies. We assembled the reads from each individual sequence separately and combined the final contigs from all the source sequences. We performed the isolate assemblies with different values of k and pooled the results using the clustering algorithm. Figure 1 (D) compares the length distribution of clustered results form metagenomic and isolate assemblies. The simHC dataset produced shorter contigs in both isolate as well as metagenomic assemblies. Amongst the simLC and simHC datasets, the performance of simLC was closer to the isolate assemblies, whereas, the simMC metagenomic assembly was far poorer in comparison to its isolated assembly.
Even assemblies of isolate genomes are not completely error free. In the case of metagenomes, the presence of multiple genomes at different coverage depths causes additional problem and the contigs are expected to have more mis-assemblies compared to the contigs from isolate genome assemblies. To assess the accuracy of the assembly we aligned the contigs back to the source genomes. Table 1 reports the results for the different datasets. A threshold accuracy of 95% was used for considering a contig accurate. Details of alignment methods and accuracy calculations are provided in the methods section. The assembly accuracy decreased as the k-mer size was decreased and was worst for all datasets at k=21. Further, the accuracy of the clustered contigs was lowest, due to the accumulation of errors from all the contig sets. This is due to our clustering approach, which tries to retain all the unique sequences. An alternative clustering strategy could be designed that takes the consensus of contigs obtained from different runs. This strategy would improve the accuracy of the results while reducing the total number of bases recovered.
In an ideal case of metagenomic assembly, all the reads forming a contig would come from the same source sequence. In metagenomes the reads from different sources may be co-assembled, resulting in chimeric assemblies. Therefore, to estimate the degree of chimericity, we evaluated the homogeneity of contigs using their read composition. Essentially, greater the number of sources from which a contig is assembled, the higher its entropy will be. The methods section describes the alignment of reads to the assembled contigs and entropy calculations.
To evaluate the improvement in assembly quality with mate-pairs information we generated and assembled datasets similar to simHC, simLC and simMC with paired-ended reads of insert length 2000 bases. However we observed that only a small fraction (less than 2 %) of the contigs being produced were using mate pairs information. In addition, most of these contigs were assembled with gaps. For our analysis we broke those contigs at the gaps and treated them as separate contigs. Therefore, because to these two factors, we did not observe a significant improvement in the assembly quality of paired ended reads in metagenomic samples.
In this paper we have presented the results of assembly and analysis of some simulated metagenomic datasets. Short-read assembly of metagenomes performed better than our initial expectations in some aspects such as accuracy of the contigs and coverage of the source genomes. Although a large fraction of the contigs were assembled accurately, fragmentation of the contigs was more severe in metagenomic datasets when compared to the isolate assemblies. Further, assembling the high complexity dataset was more difficult in comparison to the the low complexity dataset as well as the medium complexity datasets. We also observed that ABYSS was able to utilize the mate pairs information to assemble only a small fraction of the contig with gaps. Therefore, using mate-pairs did not improve the results significantly in our case. The assembly of a smaller dataset consisting of reads from 30 EColi strains showed that the contigs obtainable through co-assembly of related strains are considerably shorter than those generated using isolate assemblies.
We also observed that by simple clustering of the results from various assembly runs (with different k-mer size values of de Bruijn graph) we were able to obtain a greater number of longer contigs, as optimal contigs are distributed across the k-mer space. However, due to our simple approach towards clustering which retains all unique contigs, most mis-assembled contigs made their way into the clustered results, increasing their error rate. Further improvements in clustering technique may be required to improve the quality of the clustered results.
We created our simulated datasets using Metasim . It is a sequencing simulation tool for generating synthetic metagenomic datasets using a collection of complete genomic sequences. Metasim provides options for controlling various simulation parameters such as sequencing platform, read length, sequencing depth of individual sequences, error rate, and error distribution. We generated reads of length 36 bp using the default empirical error model of Metasim, which simulates the reads produced by Solexa sequencing technology. The bacterial sequences for generating the reads were taken from the completely assembled bacterial genomes from NCBI  genomes database.
Metagenomes vary considerably in their compositions depending on the environment from which the reads were sampled. Therefore, to assess the assembly quality as a function of metagenome’s complexity, we constructed three datasets using the profiles described in . These datasets, simLC (low complexity), simMC (medium complexity), and simHC (high complexity) simulate the composition of real metagenomic datasets. In the low complexity simLC dataset, a sizable portion of the reads belong to a single dominant organism. The high complexity dataset simHC has no distinctly dominant organism and all organisms are present in approximately equal concentrations. The simMC dataset has more than one dominant organism, but their concentrations in the samples are considerably lower than the concentration of the dominant sequence in simLC. We also produced datasets similar to each of the three datasets described above with paired-end reads with an insert size of 2 kb, to evaluate the performance of paired-ended assembly.
Due to the high computational requirements for the assembly of our metagenomic datasets, we used ABYSS  assembler (version - 1.0.8) which can perform parallel assembly using a cluster of commodity computers. We assembled all of our datasets using ABYSS, with read length of 36 bp and varied the k-mer size parameter of ABYSS’s distributed de Bruijn Graph between 21 and 33 (in increments of two) to obtain different assemblies. In the presence of sequencing errors the optimal k-mer size for Eulerian path based assemblers is determined by the coverage of source sequences. For high coverage, values close to read length produce longer contigs. Similarly, if the percentage of sequencing errors is high, optimal results are obtained by decreasing the k-mer size. All the assembly jobs were run using 32 cluster nodes. We filtered out contigs shorter than 50 bp from the final assemblies.
We also performed paired-ended assemblies using ABYSS on datasets generated with mate-pairs information. ABYSS performs paired-ended assembly in two stages, first it assembles the reads without using the mate-pair information and in the second stage it utilizes the paired reads information to merge the contigs. We set the minimum number of pairs required to merge the contigs at 10.
Due to the presence of sequencing errors and repeat regions in the source genome, assemblies are usually not completely error free, even in the context of a single genome. We assembled the reads from the individual source genomes by separating them first and combined all the produced contigs. In this study, the assembly performed in this manner is referred to as isolate assembly and provides us a comparative baseline to the metagenomic assembly.
We observed that the contigs of optimal length were distributed across the k-mer space. Therefore, we pooled the assembled contigs from different contig sets (obtained using different k-mer values) and clustered them to remove duplicate or suboptimal contigs which were contained in another longer contig. We clustered the contigs using Cd-hit , which uses a greedy incremental algorithm. The first cluster is formed using the longest sequence as the cluster representative, and the remaining sequences are compared to it in decreasing order of their lengths. If a sequence matches to one of the cluster representatives with sufficient accuracy, then it is placed in that cluster. Otherwise, a new cluster is formed with the unmatched sequence as the cluster representative. Instead of performing the actual alignment, Cd-hit uses a short word filtering algorithm to compute sequence similarity, therefore, it achieves significant speed-up compared to alignment based clustering tools. We clustered our assemblies using a similarity threshold of 95% and a word size of 8 bases (recommended for clustering with high similarity).
Some contigs, especially the shorter ones, produced multiple alignments, either to the same or to a different genome. Therefore, we used the best accuracy among all the alignments as the contig’s assembly accuracy. For contig coverage calculations (discussed later on) we consider only the contigs that were assembled with a threshold accuracy of at least 95%. We would like to note that this threshold is rather arbitrary and in many cases, the acceptable accuracy threshold would be dependent on particular application. Our choice was motivated by a similar threshold used by the authors of ABYSS  for evaluating the assembly accuracy for isolate genomes.
We estimated the homogeneity of contigs by observing the source genome of reads used to assemble a contig. This was done by performing a read-to-contig alignment using a fast short read aligner. We used BWA , which performs a backward search with burrows wheeler transform and efficiently aligns short reads against reference sequences. In our case, the references consisted of the set of contigs. Each read was assigned to the contig to which BWA reported the best match.
where p i is the fraction of total reads coming from source genome i. At different phylogenetic levels, organisms generally display a greater sequence similarity within their group when compared with the organisms belonging to a different group. Due to this sequence similarity, the assemblers are more likely to make the mistake of mis-assembling reads belonging to the same phylogenetic class. We also compute the entropy at two higher phylogenetic levels, genus and phylum, in addition to the entropy at sequence and strain level, to see if there is a significant decrease in entropy at higher phylogenetic levels.
The need for a short-read aligner arises because, for Eulerian path based assemblers, it is difficult to determine the actual read composition of the contigs. The input reads are not used directly but are broken down into smaller k-mers and the original read information is lost.
For different assemblies generated by varying the values of assembly parameter k, we calculated the extent to which the source sequences are recovered by the contigs. This is determined by aligning the contigs to the source sequences. We considered only the accurate alignments of the contigs, i.e. the alignments which accurately cover at least 95% of the contig. For each such alignment, we marked all the positions of the source genomes which were part of the alignments. The collection of all such positions of the source genome covered by the contigs, represents the contig coverage of the genome. The contig coverage described here is different from the read coverage which is approximate coverage of the genome from the sampled reads and represents the sequencing depth of the source sequences in the datasets. The contig coverage represents the fraction of the source sequence recovered by the contigs and can be at most 1. We refer to this contig coverage as the source coverage ratio.
In cases where contigs had multiple accurate alignments, possibly due to repeat regions or shared sequences between genomes, we counted each contig’s contribution for all the alignments. Therefore, our contig mapping to source genomes is not unique, and our source coverage ratio calculation may have over-counted a little. As it is not possible to prefer one particular alignment over another, we believe this is a better option than randomly choosing a particular alignment of the contig.
AC performed the experiments and wrote all the programs needed for this study. Both, AC and HR conceptualized this study and setup the different experiments. All authors wrote the manuscript, read the manuscript, and approved for final submission.
This work was supported by NSF IIS-0905117 and George Mason seed grants.
This article has been published as part of BMC Genomics Volume 12 Supplement 2, 2011: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2010. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/12?issue=S2.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.