Evaluation of viral genome assembly and diversity estimation in deep metagenomes
© Aguirre de Cárcer et al.; licensee BioMed Central Ltd. 2014
Received: 24 August 2014
Accepted: 30 October 2014
Published: 18 November 2014
Viruses have unique properties, small genome and regions of high similarity, whose effects on metagenomic assemblies have not been characterized so far. This study uses diverse in silico simulated viromes to evaluate how extensively genomes can be assembled using different sequencing platforms and assemblers. Further, it investigates the suitability of different methods to estimate viral diversity in metagenomes.
We created in silico metagenomes mimicking various platforms at different sequencing depths. The CLC assembler revealed subpar compared to IDBA_UD and CAMERA , which are metagenomic-specific. Up to a saturation point, Illumina platforms proved more capable of reconstructing large portions of viral genomes compared to 454. Read length was an important factor for limiting chimericity, while scaffolding marginally improved contig length and accuracy. The genome length of the various viruses in the metagenomes did not significantly affect genome reconstruction, but the co-existence of highly similar genomes was detrimental. When evaluating diversity estimation tools, we found that PHACCS results were more accurate than those from CatchAll and clustering, which were both orders of magnitude above expected.
Assemblers designed specifically for the analysis of metagenomes should be used to facilitate the creation of high-quality long contigs. Despite the high coverage possible, scientists should not expect to always obtain complete genomes, because their reconstruction may be hindered by co-existing species bearing highly similar genomic regions. Further development of metagenomics-oriented assemblers may help bypass these limitations in future studies. Meanwhile, the lack of fully reconstructed communities keeps methods to estimate viral diversity relevant. While none of the three methods tested had absolute precision, only PHACCS was deemed suitable for comparative studies.
Several studies have demonstrated the potential for reconstructing genomes from viral metagenomes . A major obstacle for metagenomic reconstruction is the existence of highly similar regions between coexisting genomes which can lead to fragmented assemblies, like repeats in single genome assemblies. In this regard, it has become important to analyze the performance of different sequencing platforms and assemblers in a metagenomic context, and assess their strengths and limitations. Initially, Mavromatis et al.  combined Sanger reads from sequenced bacterial isolates to form in vitro simulated communities of different complexities and benchmark assembly and other metagenomic processing methods. Pignatelli & Moya  derived short-read in silico simulated metagenomes from this work to explore facets of the assembly of high-throughput sequencing data. Charuvaha & Rangwala  evaluated the effects of k-mer size on the performance of the Bruijn-based assemblers. Later, Mende et al.  studied the effect of quality filtering on the assembly of Sanger, 454 and Illumina metagenomic datasets, while Luo et al. compared  the assembly of 454 and Illumina datasets from the same metagenome, and evaluated single bacterial genome reconstruction in a metagenomics setting .
The above-mentioned studies of metagenomic assembly all used bacterial communities and mainly focused on assessing functional and taxonomic annotations. While many of their results and findings can be applicable to the study of natural viral communities, viruses present unique properties. Viral genomes are usually smaller than those of Bacteria and it has become affordable to obtain high coverage of viral metagenomes using current high-throughput sequencing platforms and to attempt reconstructing environmental genomes. Viral genome reconstruction is an important step in the metagenomic analysis of viral communities. Viruses lack a universal phylogenetic marker gene that can be used as the ribosomal genes for cellular organisms. In this sense, both accurate phylogenetic annotation and putative host description rely heavily on the almost complete reconstruction of the viral genome. Additionally, the extent of intra-group variability among viruses is greater than in Bacteria due to their faster evolution rates, which poses increased difficulties to the assembler. Recently, Vázquez-Castellanos et al.  assessed the effects of different overlap-layout-consensus (OLC) assemblers for the functional and taxonomic annotation of an in silico simulated 454 viral metagenome, and Solonenko et al. commented on how different library preparation choices bias the outcome of virome assembly.
Community diversity is an important ecological characteristic of natural communities, and its estimation usually complements taxonomic and functional analyses of viral metagenomes. There are currently three different approaches to estimate viral richness in metagenomic datasets; the use of clustering [10–12], PHACCS  and CatchAll [14, 15]. The latter two represent software tools which rely on assembly results, more precisely contig spectra for the fit of their diversity models. Unfortunately, none of these methods have been the subject of a comparative performance evaluation using viromes of known diversity.
In the present study, we investigate the ability of various sequencing Platform – Assembler – Depth (PAD) combinations to reconstruct the genomes from a high-throughput in silico simulated virome, and explore how genome relatedness impacts the success of genome reconstruction. Furthermore, we evaluate the applicability of three different methods to estimate viral community diversity. Collectively, our results should guide researchers undertaking deep viral metagenomic studies to adequate methods for genome reconstruction and diversity estimation, as well as understand their limitations.
Using a single virtual viral community, composed of 300 genomes with different degrees of relatedness, from both ssDNA (Microviridae, Circoviridae, and Nanoviridae) and dsDNA (Siphoviridae, Podoviridae, and Myoviridae) viral families, we generated a large number of metagenomic reads mimicking Roche’s 454 and Illumina’s GAIIx sequencing platforms. Sequencing costs were kept similar for each technology (based on Reagent cost/Mb values), resulting in different sequencing depths. We also produced lower coverage datasets, containing 10% of the reads of these high coverage datasets, and complemented them with additional Illumina Miseq and Hiseq low coverage metagenomic libraries.
Sequencing and assembly statistics
a ) High-coverage mock
b) Low-coverage mock
We then compared sequencing platforms, focusing on the assembly statistics obtained with the best tested assembler for each platform (CAMERA for 454, and IDBA_UD for Illumina). For the high coverage datasets (Table 1a), Illumina GAIIx achieved a higher number of contigs, contig sum and N50 than 454. Similarly, for the low coverage datasets (Table 1b), Illumina GAIIx outperformed 454, and even surpassed the high coverage dataset by assembling the same amount of reads in fewer, larger contigs. This outcome indicates that the simulated community may have been oversequenced by the GAIIx high coverage dataset, and is consistent with the inability of Bruijn k-mer based assemblers to deal with large numbers of sequencing errors, resulting in more fragmented assemblies . Illumina Miseq and Hiseq (low coverage) contigs (Table 1b) were also more numerous and longer than those of 454, and there were no major differences between them, other than Hiseq having a slightly larger N50 value. All Illumina platforms were able to completely recover the longest genome in the dataset (Pseudomonas_phage_phiKZ, 280 kbp).
We also analyzed the behavior of 454 and Illumina technologies using an empirical metagenomic data derived from a single Antarctic freshwater viral community (Table 1c). This dataset contains about 40% of the sequence information of the low coverage 454 and Illumina GAIIx datasets. In this case, it was possible to assemble more reads into contigs with Illumina than with 454, leading to much higher maximum contig length and N50 values.
Accuracy and chimericity statistics
a) High-coverage mock
99.8 ± 0.5
0.93 ± 1.09
99.1 ± 3.4
0.90 ± 1.01
94.0 ± 14.0
1.30 ± 1.03
99.9 ± 1.1
1.27 ± 0.94
99.9 ± 1.4
1.25 ± 0.95
b) Low-coverage mock
99.2 ± 3.4
0.59 ± 0.87
99.6 ± 1.6
1.27 ± 0.78
99.5 ± 3.1
1.26 ± 0.79
99.5 ± 3.7
1.32 ± 0.99
98.7 ± 3.9
1.32 ± 0.87
Genome and community characteristics effects on genome reconstruction
It is noteworthy that the best PAD combination for contig coverage (low-coverage GAIIx-IDBA) only yielded maximum contig coverage above 95% for 162 out of the 300 different genomes in the community, a figure that was not improved by increasing the sequencing effort 10-fold (Figure 1). This inability to obtain contigs spanning a large proportion of many original genomes, despite high sequencing coverage may be due to: i) limits to what the assembly algorithm can achieve given a particular community, either theoretically or due to imperfect design, ii) variable genome coverage along genomes (e.g. due to lower conservation or to GC bias), iii) large genome length, iv) the existence of repeats regions in the genome, and v) high community diversity.
Several of these factors were further studied using the Illumina Hiseq dataset showing best overall performance and likely future use, and while more contigs derived from this dataset were chimeric compared to its Miseq counterpart they were still highly accurate. Differential coverage along genomes (due for instance to %GC bias) was not studied as it was not modelled by the chosen metagenomic simulator, although some less versatile simulators include such feature . First, we evaluated to which degree the existence of repeats regions within the genomes may have translated into low maximum contig coverage. We then analyzed the genomes in our evolved mock community for long repeats regions and found that the longest repeats region did not span more than 400 bp. Hence, the existence of repeats within the original genomes was not likely the cause of low maximum contig coverage.
Next, we studied whether or not we had retrieved at least one almost complete genome for the intra-species groups (α = 0.0025), which exhibit the lowest maximum contig coverage (Figure 2). For groups of two siblings, the largest maximum contig coverage corresponded to the genome with highest coverage (most abundant). However, only for one out of ten existing intra-species (α = 0.0025) sibling pairs did the maximum contig coverage value surpass the 95% threshold, and this group was characterized by the lowest coverage by others values (see below) of the ten groups. On the other hand, the groups with eight siblings showed a strikingly different behavior, with seven out of ten groups having at least one member surpassing the 95% maximum contig coverage threshold. For six of these groups, the genome showing greatest maximum contig coverage was not the most abundant, and maximum contig coverage for the other sibling were very low. The exception as before corresponded to a group with very low coverage by others values, where all 8 siblings surpassed the 95% threshold. Subsequent principal coordinates analyses based on pairwise nucleotide similarities of the six groups showing a similar behavior revealed that in five out of six cases the reconstructed genome represented a central genome within the group (Additional file 1: Figure S1).
Estimating community diversity
We used PHACCS and CatchAll to estimate the number of viral species in the evolved simulated community using contig spectra derived from sub-samples of the Illumina GAIIx dataset. Since we observed that the assemblers were not able to resolve all cases between highly similar genomes we expected richness estimates ca. 200 – 300. However, both methods over-estimated the number of species (Additional file 1: Table S3), with CatchAll being always 1 to 3 orders of magnitudes off compared to PHACCS, and richness estimates increased with sequencing depth.
Our ability to accurately reconstruct the viral genomes within a deep metagenomic dataset represents a black box. Assembly success is usually described in terms of rather subjective proxies such as assembly statistics, or the number of apparently complete genomes recovered. However, as we remain ignorant of the richness, structure, and genetic diversity of the community it is not possible to work with more objective measures of success. This is especially true for viruses that have no ribosomal genes to help us estimate its composition, richness and structure. In this study, we have shed some light on this black box’s internal functions and mechanisms.
The assembly statistics derived from each PAD combination revealed that both CAMERA and IDBA_UD outperformed the popular commercial assembler CLC in terms of both total information in contigs (Sum) and N50, likely due to the fact that they have been specifically developed for metagenomic studies. For the high coverage datasets, both 454 and Illumina performed similarly. However, for the low coverage scenarios all Illumina platforms greatly outperformed 454, with Hiseq performing slightly better than Miseq. This result was apparent in the empirical metagenomic datasets, where Illumina was able to recover 26 times the amount of bases in contigs of 454.
Since single point errors in contigs can alter gene calling and predicted translated proteins, contig accuracy is important. The contigs and scaffolds produced by all PAD combinations were generally highly accurate, but CLC exhibited a lower accuracy that could substantially compromise annotation efforts. Concerning chimericity, no major differences between assemblers were observed. However, the 454 platform produced less chimeric contigs and a larger fraction of non-chimeric contigs than the Illumina platforms. Moreover, Miseq also produced a larger fraction of perfectly non-chimeric contigs compared to Hiseq. Both results indicate that read length correlates negatively with chimericity.
Overall, scaffolding marginally improved both contig length statistics and accuracy, with the tradeoff of increasing contig chimericity, which is consistent with previous results . Both overall and maximum contig coverage derived from CLC assemblies were lower than for their CAMERA and IDBA_UD counterparts. 454 produced lower yet relatively similar values when compared to Illumina GAIIx for the high coverage datasets. However, its ability to reconstruct the original genomes (overall and maximum contig coverage) was much reduced in the low coverage dataset, when compared to all Illumina platforms.
Interestingly, while some large genomes were recovered completely as a single contig, we could not obtain contigs spanning a large proportion of many original genomes despite the high coverage values attained. This issue was further explored using the Illumina Hiseq dataset, which showed that genome length, sequencing coverage, or the existence of repeat regions had little effect on genome reconstruction. However, the co-existence of highly similar genomes within the community had a strong effect on genome reconstruction. It is noteworthy that usually, for pairs of highly similar genomes, the largest maximum contig coverage corresponded to the most abundant genome. However, in the more complex cases with eight intra-species genomes, a single genome was normally reconstructed. Instead of representing the most abundant species, the reconstructed genome tended to be that showing closest similarity to its siblings. In this sense, it seems that it is not possible to ascertain whether a retrieved genome corresponds to a single strain, or rather to a group of highly similar strains. One could take steps to study the intra-population structure of a given genome, e.g. by re-mapping onto it all metagenomic reads and then evaluating its nucleotide diversity, or using more sophisticated software [18, 19].
Despite advances in sequencing technologies and bioinformatic tools, the assembly of viral metagenomes thus remains incomplete. Since all genomes cannot be reconstructed, even with very high sequencing coverage, methods to estimate viral diversity in deep metagenomes remain highly relevant. PHACCS and CatchAll are sophisticated tools that model community diversity based on the assembly of metagenomes, while clustering is a simple method that provides a proxy for viral richness and has been used to generate rarefaction curves. Using 100 simulated viromes, we showed that CatchAll results are orders of magnitude higher than expected, due to its underlying assumption that each contig belongs to a different viral genome. The use of a discounted model was not sufficient to alleviate this fundamental limitation. By its nature, clustering creates many clusters for each genome and its richness estimates can thus be considered upper bounds of viral richness. However, correcting clustering results for genome length dramatically improved richness estimates, which suggest a new direction for modeling viral diversity in metagenomes. PHACCS was the most accurate of the tested tools, reaching a 44.5% ±33.0 relative error. Note that these results were obtained using the more thorough ‘cha’ mode and providing the exact average genome length in the community, which is rarely known with precision. We suggest that assuming an average for the viral genome length limits the absolute accuracy of PHACCS. However, PHACCS richness and evenness estimates were consistent with community structure, which makes PHACCS well suited for the estimation of viral diversity of communities in comparative studies.
The amount of metagenomic information available for genome reconstruction had a profound effect on assembly success, as evidenced by the low performance demonstrated by the low coverage and empirical 454 datasets. However, despite the fact that all Illumina combinations tested likely presented per-genome coverage in excess of what seems to be needed for accurate genome reconstruction, they were unable to recover all genomes in the community, because of the presence within the community of genomes bearing highly similar regions. The assemblers were nevertheless generally able to recover at least a single genome from a highly similar group of genomes. Overall, we recommend the use of Illumina platforms such as HiSeq and MiSeq, bearing in mind that oversequencing may be detrimental,and a metagenomic-aware assembler such as IDBA-UD for the assembly of viral metagenomes; this PAD combinations provide good value for money, and yield long, accurate contigs.
Deep metagenomic studies can be complemented by analyses of community diversity, some of which are based on contig assembly. While our simulation results argue against the use of CatchAll for this purpose, PHACCS was shown to be well suited for comparative work. Clustering might also prove a worthy alternative in the future, provided average genome length in the community is taken into account.
The evolved mock community
To evaluate PAD combinations, we devised an in silico simulated community containing 300 genomes from six DNA virus families (both single and double-stranded) most commonly found in freshwater environments. Each family was given the same relative abundance and contained 40 evolved genomes (details below) (Additional file 1: Table S4).
The rank-abundance curve for the genomes of each family followed a power-law distribution (Additional file 1: Figure S4), with few abundant genomes and a long tail of low-abundance species as seen in many environmental viral communities  (Additional file 2). Within this distribution, we intercalated one original genome for each four evolved genomes to assess genome reconstruction without sibling noise.
We used the GemSIM simulator  to generate mock metagenomes with empirically derived, sequence-context based error mimicking the widespread sequencing platforms Roche 454 FLX + and Illumina GAIIx, Hiseq and Miseq. The GemSIM error models were derived from data obtained from our local sequencing center (454 FLX+, Hiseq, and Miseq) or published data (GAIIx)  (Additional file 3).
Sequencing depths were chosen based on: i) the amount of sequences being reported for environmental viromes, and ii) existing sequencing costs associated with each technology (Reagent cost/Mb) . We thus produced high-coverage datasets containing 200,000 reads for Roche 454 FLX + (ca. 700 bp in length) and 26,500,000 read pairs for Illumina GAIIx (150 bp in length). In addition, we also produced low-coverage datasets representing 10% of their high-coverage counterparts, and complemented them with 1,657,913 MiSeq and 9,201,420 HiSeq read pairs (based on Reagent cost/Mb values provided by our local sequencing center).
To simulate realistic communities, in which closely related viral species and strains co-exist, we used MetaSim’s population sampler, which produces evolved sequences based on a source genome (Additional file 4) and a given evolutionary tree . We employed the default tree-simulation parameters but included two different nucleotide transition rates α (0.01 and 0.0025), and generated groups of two and eight sibling genomes for each transition rate. An initial exploration on the outcome of chosen transition rates on a single genome to produce 10 siblings resulted in average nucleotide identities (ANIs) ranging 0.786 – 0.947 (0.85 ± 0.04) and 0.935 – 0.985 (0.96 ± 0.01) for α 0.01 and 0.0025 respectively. The chosen α levels approximately correspond to intra and inter-species siblings based on the fact that 95% ANI can be considered a rough boundary for species in Bacteria. Nevertheless, we acknowledge that this value may represent a rather artificial boundary with viruses. For each viral family we introduced five categories of genomes; 10 unmodified genomes, two pairs of sibling genomes produced using α at 0.01, two pairs of sibling genomes produced using α at 0.0025, two groups of eight sibling genomes produced using α at 0.01, and two groups of eight sibling genomes produced using α at 0.0025. Hence, we produced 40 evolved genomes from 8 original genomes for each family by generating duplicates of the four combinations (2 X α levels [0.01/0.0025] X group sizes [2/8]).
The Limnopolar empirical community
We compared the assembly results derived from our simulated community to an actual viral metagenome obtained from Lake Limnopolar and composed mainly of unknown and ssDNA viruses . We maintained the costs associated to each technology and compared 23,249 (average 220 bp in length) Roche 454 sequences against 1,989,155 (75 bp in length) Illumina GAIIx sequence pairs. The 454 virome was assembled with the CAMERA assembler and the Illumina version with IDBA_UD.
A series of filtering and trimming steps were undertaken to remove low quality reads and bases using the prinseq-lite software  (trim_qual_right 28, trim_qual_type mean, trim_qual_window 5). Additionally, Lake Limnopolar 454 reads were dereplicated with prinseq and sequences shorter than 50 bp removed. The resulting reads were assembled into contigs using different assemblers; 454 reads were assembled using the CAMERA-assembler , and CLC Genomics Workbench 6.0 (CLC Inc, Aarhus, Denmark. Trial version). Illumina reads were assembled using CLC and IDBA_UD . CAMERA and CLC were used with default settings, and IDBA_UD with recommended metagenomic settings (mink 20, maxk 120, pre_correction). In all cases, contigs shorter than 500 bp were removed from further analysis.
In order to evaluate the performance of the different PAD combinations we used previously developed analytical strategies for short read metagenomic assembly . We calculated metrics reflecting the extent of genome reconstruction: overall contig coverage, the percentage of each genome covered by all its contigs, and maximum contig coverage, the percentage of each genome covered by its longest contig. First, contigs were aligned to the input genomes using nucmer (c 30, l 15) . Then, the results were filtered allowing only ≥95% identity and ≥100 bp length alignments. For each contig, only the best-scoring alignments to a genome was allowed. Finally, a dedicated python script recorded the alignment position information for each contig, with the collection of all such positions for a given genome representing its contig coverage, expressed as a percentage of the total genome length. The same alignment file produced for the contig coverage calculations was parsed using an in-house script to obtain the proportion of the original genome’s length covered by the longest aligning contig (maximum contig coverage). To assess which particular PAD combination produced the best maximum and overall contig coverage, we conducted paired Mann–Whitney tests with R.
Where p i is the proportion of mapped reads arising from genome i.
The level to which each contig accurately represents the information contained within the original genomes was assessed using a contig accuracy score, defined as the identity of the local alignment multiplied by the ratio of alignment length to contig length. Contig accuracy values were also obtained by processing the filtered nucmer files with a dedicated script.
Both genome and community characteristics may impact our ability to assemble a particular genome from a complex community. We have used the PAD combination showing best overall performance (Hiseq) to assess the effect on maximum contig coverage caused by genome length, relative abundance, existence of closely-related genomes in the community, and repeats regions within the genome. Due to the interaction between the circular nature of many genomes and chosen alignment thresholds genomes shorter than 1700 nt were removed from further analysis as there is the possibility that their maximum contig coverage may have been slightly underestimated.
In most instances, the assembly only recovered one of the genomes (maximum contig coverage >95%) from the groups of eight intra-species genomes (α = 0.0025). We studied the possible effect of intra-group genetic similarity on genome recovery by obtaining pairwise nucleotide similarities between sibling genomes, which were then analyzed by principal coordinate analysis using the dudi.pco function of the ade4 package  in R.
The existence within a community of genomes bearing highly similar regions may also hamper the reconstruction of a genome. For instance, the reads originating from a particular genome might be used in the reconstruction of other genomes with OLC assemblers, or it may lead to graph structures not properly resolved with de Bruijn graph-based assemblers. To analyze this aspect, we mapped all metagenomic reads to each genome using bowtie2 with default parameters but allowing all above-threshold hits. Then we recorded the number of metagenomic reads mapping to each genome minus the number of reads originating from each genome, and normalized for differing genome sizes dividing by genome length, obtaining a coverage by others parameter. Finally, we used the ratio of coverage by others to coverage as a proxy to assess possible genome reconstruction bias produced by this sort of interference.
The hundred mock communities
We generated 100 in silico mock metagenomes with different community structures to benchmark the accuracy of viral diversity estimation methods more thoroughly. To this end, >2,200 complete genomes from the NCBI RefSeq database  were used as reference for the Grinder read simulator . Each metagenome contained 200,000 reads designed to follow the length (~450 bp) and errors typical of 454 GS-FLX Ti pyrosequencing. The metagenomes followed a power law rank-abundance and were classified in four community structures, varying in richness (100 or 1,000 species) and evenness (most abundant genome at 2.0 or 25% relative abundance). We let Grinder automatically randomly generate 25 metagenomes of each type (total of 100 metagenomes) for statistical replication.
Estimation of viral diversity
Using the GAIIx evolved mock metagenome, we determined the effect of metagenome size on estimated community viral diversity. We produced subsets of this metagenome containing 24,658, 248,525 and 2,485,933 reads. Their contig spectra was calculated with Circonspect  using the Minimo assembler  employing all reads and default parameters (98% identity, 35 bp overlap). Then, both PHACCS and CatchAll were employed with their default values to fit the contig spectra using all available models.
Using the hundred 454 mock metagenomes, we calculated the accuracy of viral diversity estimates obtained using PHACCS, CatchAll and UCLUST as a function of community structure. Contig spectra were generated with Circonspect at 3X fold coverage using Minimo (and default options). These contig spectra were provided to CatchAll and the estimated richness using the best model and the best discounted model were recorded. PHACCS was also given these contig spectra to estimate the viral richness and evenness, letting its optimization algorithm look for the best fit using the more exhaustive ‘cha’. For the clustering method, the entire metagenome was used as input to UCLUST (cluster_smallmem program, both strands, minimum identity of 98, 90 and 75%) and the number of resulting viral clusters was calculated. In an attempt to improve their accuracy, the UCLUST and CatchAll estimates were divided by the average genome length minus average read length.
DAC and AA conceived the study. DAC carried-out the analyses, with the exception of the hundred mock communities generation and their diversity analysis contributed by FA. DAC wrote the manuscript with input from the co-authors. All authors read and approved the final manuscript.
We thank the Genomics unit Antonia Martín Gallardo at Parque Científico de Madrid for providing the necessary sequencing runs to construct the error models. This work was funded in part by the Spanish Ministry of Science and Innovation grant CTM2011-15091-E/ANT. Daniel Aguirre de Cárcer was supported by the Marie Curie International Incoming Fellow grant PIIF-GA-2012-328287. Florent Angly was supported by the Australian Research Council’s Discovery Early Career Research Award DE120101213. We acknowledge support of the publication fee by the CSIC Open Access Publication Support Initiative through its Unit of Information Resources for Research (URICI).
- Angly FE, Felts B, Breitbart M, Salamon P, Edwards RA, Carlson C, Chan AM, Haynes M, Kelley S, Liu H, Mahaffy JM, Mueller JE, Nulton J, Olson R, Parsons R, Rayhawk S, Suttle CA, Rohwer F: The Marine Viromes of Four Oceanic Regions. PLoS Biol. 2006, 4: e368-10.1371/journal.pbio.0040368.PubMed CentralPubMedView ArticleGoogle Scholar
- Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy AC, Rigoutsos I, Salamov A, Korzeniewski F, Land M, Lapidus A, Grigoriev I, Richardson P, Hugenholtz P, Kyrpides NC: Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods. 2007, 4: 495-500. 10.1038/nmeth1043.PubMedView ArticleGoogle Scholar
- Pignatelli M, Moya A: Evaluating the fidelity of de novo short read metagenomic assembly using simulated data. PLoS One. 2011, 6: 23-View ArticleGoogle Scholar
- Charuvaka A, Rangwala H: Evaluation of short read metagenomic assembly. BMC Genomics. 2011, 12: 1471-2164.View ArticleGoogle Scholar
- Mende DR, Waller AS, Sunagawa S, Järvelin AI, Chan MM, Arumugam M, Raes J, Bork P: Assessment of Metagenomic Assembly Using Simulated Next Generation Sequencing Data. PLoS One. 2012, 7: e31386-10.1371/journal.pone.0031386.PubMed CentralPubMedView ArticleGoogle Scholar
- Luo C, Tsementzi D, Kyrpides N, Read T, Konstantinidis KT: Direct comparisons of Illumina vs. Roche 454 sequencing technologies on the same microbial community DNA sample. PLoS One. 2012, 7: 10-View ArticleGoogle Scholar
- Luo C, Tsementzi D, Kyrpides NC, Konstantinidis KT: Individual genome assembly from complex community short-read metagenomic datasets. ISME J. 2012, 6: 898-901. 10.1038/ismej.2011.147.PubMed CentralPubMedView ArticleGoogle Scholar
- Vazquez-Castellanos JF, Garcia-Lopez R, Perez-Brocal V, Pignatelli M, Moya A: Comparison of different assembly and annotation tools on analysis of simulated viral metagenomic communities in the gut. BMC Genomics. 2014, 15: 1471-2164.View ArticleGoogle Scholar
- Solonenko S, Ignacio-Espinoza J, Alberti A, Cruaud C, Hallam S, Konstantinidis K, Tyson G, Wincker P, Sullivan M: Sequencing platform and library preparation choices impact viral metagenomes. BMC Genomics. 2013, 14: 320-10.1186/1471-2164-14-320.PubMed CentralPubMedView ArticleGoogle Scholar
- Edgar RC: Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010, 26: 2460-2461. 10.1093/bioinformatics/btq461.PubMedView ArticleGoogle Scholar
- Reyes A, Haynes M, Hanson N, Angly FE, Heath AC, Rohwer F, Gordon JI: Viruses in the faecal microbiota of monozygotic twins and their mothers. Nature. 2010, 466: 334-338. 10.1038/nature09199.PubMed CentralPubMedView ArticleGoogle Scholar
- Roux S, Faubladier M, Mahul A, Paulhe N, Bernard A, Debroas D, Enault F: Metavir: a web server dedicated to virome analysis. Bioinformatics. 2011, 27: 3074-3075. 10.1093/bioinformatics/btr519.PubMedView ArticleGoogle Scholar
- Angly F, Rodriguez-Brito B, Bangor D, McNairnie P, Breitbart M, Salamon P, Felts B, Nulton J, Mahaffy J, Rohwer F: PHACCS, an online tool for estimating the structure and diversity of uncultured viral communities using metagenomic information. BMC Bioinformatics. 2005, 6: 41-10.1186/1471-2105-6-41.PubMed CentralPubMedView ArticleGoogle Scholar
- Allen HK, Bunge J, Foster JA, Bayles DO, Stanton TB: Estimation of viral richness from shotgun metagenomes using a frequency count approach. Microbiome. 2013, 1: 2049-2618.View ArticleGoogle Scholar
- Bunge J, Woodard L, Böhning D, Foster JA, Connolly S, Allen HK: Estimating population diversity with CatchAll. Bioinformatics. 2012, 28: 1045-1047. 10.1093/bioinformatics/bts075.PubMed CentralPubMedView ArticleGoogle Scholar
- Nagarajan N, Pop M: Sequence assembly demystified. Nat Rev Genet. 2013, 14: 157-167. 10.1038/nrg3367.PubMedView ArticleGoogle Scholar
- Hu X, Yuan J, Shi Y, Lu J, Liu B, Li Z, Chen Y, Mu D, Zhang H, Li N, Yue Z, Bai F, Li H, Fan W: pIRS: Profile-based Illumina pair-end reads simulator. Bioinformatics. 2012, 28: 1533-1535. 10.1093/bioinformatics/bts187.PubMedView ArticleGoogle Scholar
- Astrovskaya I, Tork B, Mangul S, Westbrooks K, Mandoiu I, Balfe P, Zelikovsky A: Inferring viral quasispecies spectra from 454 pyrosequencing reads. BMC Bioinformatics. 2011, 12 Supp 6: S1-View ArticleGoogle Scholar
- Zagordi O, Geyrhofer L, Roth V, Beerenwinkel N: Deep sequencing of a genetically heterogeneous sample: local haplotype reconstruction and read error correction. J Comput Biol. 2010, 17: 417-428. 10.1089/cmb.2009.0164.PubMedView ArticleGoogle Scholar
- Hoffmann KH, Rodriguez-Brito B, Breitbart M, Bangor D, Angly F, Felts B, Nulton J, Rohwer F, Salamon P: Power law rank–abundance models for marine phage communities. FEMS Microbiol Lett. 2007, 273: 224-228. 10.1111/j.1574-6968.2007.00790.x.PubMedView ArticleGoogle Scholar
- McElroy K, Luciani F, Thomas T: GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics. 2012, 13: 74-10.1186/1471-2164-13-74.PubMed CentralPubMedView ArticleGoogle Scholar
- Minoche A, Dohm J, Himmelbauer H: Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems. Genome Biol. 2011, 12: R112-10.1186/gb-2011-12-11-r112.PubMed CentralPubMedView ArticleGoogle Scholar
- Glenn TC: Field guide to next-generation DNA sequencers. Mol Ecol Resour. 2011, 11: 759-769. 10.1111/j.1755-0998.2011.03024.x.PubMedView ArticleGoogle Scholar
- Richter DC, Ott F, Auch AF, Schmid R, Huson DH: MetaSim—A Sequencing Simulator for Genomics and Metagenomics. PLoS One. 2008, 3: e3373-10.1371/journal.pone.0003373.PubMed CentralPubMedView ArticleGoogle Scholar
- Konstantinidis KT, Ramette A, Tiedje JM: The bacterial species definition in the genomic era. Philos Trans R Soc Lond B Biol Sci. 2006, 361: 1929-1940. 10.1098/rstb.2006.1920.PubMed CentralPubMedView ArticleGoogle Scholar
- Lopez-Bueno A, Tamames J, Velazquez D, Moya A, Quesada A, Alcami A: High diversity of the viral community from an Antarctic lake. Science. 2009, 326: 858-861. 10.1126/science.1179287.PubMedView ArticleGoogle Scholar
- Schmieder R, Edwards R: Quality control and preprocessing of metagenomic datasets. Bioinformatics. 2011, 27: 863-864. 10.1093/bioinformatics/btr026.PubMed CentralPubMedView ArticleGoogle Scholar
- Seshadri R, Kravitz SA, Smarr L, Gilna P, Frazier M: CAMERA: A Community Resource for Metagenomics. PLoS Biol. 2007, 5: e75-10.1371/journal.pbio.0050075.PubMed CentralPubMedView ArticleGoogle Scholar
- Peng Y, Leung HC, Yiu SM, Chin FY: IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012, 28: 1420-1428. 10.1093/bioinformatics/bts174.PubMedView ArticleGoogle Scholar
- Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL: Versatile and open software for comparing large genomes. Genome Biol. 2004, 5: 30-10.1186/gb-2004-5-5-r30.View ArticleGoogle Scholar
- Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012, 9: 357-359. 10.1038/nmeth.1923.PubMed CentralPubMedView ArticleGoogle Scholar
- Dray S: The ade4 package: implementing the duality diagram for ecologists. J Stat Softw. 2007, 22: 1-View ArticleGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR: NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005, 33: D501-D504. 10.1093/nar/gki476.PubMed CentralPubMedView ArticleGoogle Scholar
- Angly FE, Willner D, Rohwer F, Hugenholtz P, Tyson GW: Grinder: a versatile amplicon and shotgun sequence simulator. Nucleic Acids Res. 2012, 40: e94-10.1093/nar/gks251.PubMed CentralPubMedView ArticleGoogle Scholar
- Treangen TJ, Sommer DD, Angly FE, Koren S, Pop M: Next generation sequence assembly with AMOS. Curr Protoc Bioinformatics. 2011, 11: 11-18.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.