The evolved mock community
To evaluate PAD combinations, we devised an in silico simulated community containing 300 genomes from six DNA virus families (both single and double-stranded) most commonly found in freshwater environments. Each family was given the same relative abundance and contained 40 evolved genomes (details below) (Additional file 1: Table S4).
The rank-abundance curve for the genomes of each family followed a power-law distribution (Additional file 1: Figure S4), with few abundant genomes and a long tail of low-abundance species as seen in many environmental viral communities [20] (Additional file 2). Within this distribution, we intercalated one original genome for each four evolved genomes to assess genome reconstruction without sibling noise.
We used the GemSIM simulator [21] to generate mock metagenomes with empirically derived, sequence-context based error mimicking the widespread sequencing platforms Roche 454 FLX + and Illumina GAIIx, Hiseq and Miseq. The GemSIM error models were derived from data obtained from our local sequencing center (454 FLX+, Hiseq, and Miseq) or published data (GAIIx) [22] (Additional file 3).
Sequencing depths were chosen based on: i) the amount of sequences being reported for environmental viromes, and ii) existing sequencing costs associated with each technology (Reagent cost/Mb) [23]. We thus produced high-coverage datasets containing 200,000 reads for Roche 454 FLX + (ca. 700 bp in length) and 26,500,000 read pairs for Illumina GAIIx (150 bp in length). In addition, we also produced low-coverage datasets representing 10% of their high-coverage counterparts, and complemented them with 1,657,913 MiSeq and 9,201,420 HiSeq read pairs (based on Reagent cost/Mb values provided by our local sequencing center).
Evolved genomes
To simulate realistic communities, in which closely related viral species and strains co-exist, we used MetaSim’s population sampler, which produces evolved sequences based on a source genome (Additional file 4) and a given evolutionary tree [24]. We employed the default tree-simulation parameters but included two different nucleotide transition rates α (0.01 and 0.0025), and generated groups of two and eight sibling genomes for each transition rate. An initial exploration on the outcome of chosen transition rates on a single genome to produce 10 siblings resulted in average nucleotide identities (ANIs) ranging 0.786 – 0.947 (0.85 ± 0.04) and 0.935 – 0.985 (0.96 ± 0.01) for α 0.01 and 0.0025 respectively. The chosen α levels approximately correspond to intra and inter-species siblings based on the fact that 95% ANI can be considered a rough boundary for species in Bacteria[25]. Nevertheless, we acknowledge that this value may represent a rather artificial boundary with viruses. For each viral family we introduced five categories of genomes; 10 unmodified genomes, two pairs of sibling genomes produced using α at 0.01, two pairs of sibling genomes produced using α at 0.0025, two groups of eight sibling genomes produced using α at 0.01, and two groups of eight sibling genomes produced using α at 0.0025. Hence, we produced 40 evolved genomes from 8 original genomes for each family by generating duplicates of the four combinations (2 X α levels [0.01/0.0025] X group sizes [2/8]).
The Limnopolar empirical community
We compared the assembly results derived from our simulated community to an actual viral metagenome obtained from Lake Limnopolar and composed mainly of unknown and ssDNA viruses [26]. We maintained the costs associated to each technology and compared 23,249 (average 220 bp in length) Roche 454 sequences against 1,989,155 (75 bp in length) Illumina GAIIx sequence pairs. The 454 virome was assembled with the CAMERA assembler and the Illumina version with IDBA_UD.
Metagenomic assembly
A series of filtering and trimming steps were undertaken to remove low quality reads and bases using the prinseq-lite software [27] (trim_qual_right 28, trim_qual_type mean, trim_qual_window 5). Additionally, Lake Limnopolar 454 reads were dereplicated with prinseq and sequences shorter than 50 bp removed. The resulting reads were assembled into contigs using different assemblers; 454 reads were assembled using the CAMERA-assembler [28], and CLC Genomics Workbench 6.0 (CLC Inc, Aarhus, Denmark. Trial version). Illumina reads were assembled using CLC and IDBA_UD [29]. CAMERA and CLC were used with default settings, and IDBA_UD with recommended metagenomic settings (mink 20, maxk 120, pre_correction). In all cases, contigs shorter than 500 bp were removed from further analysis.
Contig analysis
In order to evaluate the performance of the different PAD combinations we used previously developed analytical strategies for short read metagenomic assembly [4]. We calculated metrics reflecting the extent of genome reconstruction: overall contig coverage, the percentage of each genome covered by all its contigs, and maximum contig coverage, the percentage of each genome covered by its longest contig. First, contigs were aligned to the input genomes using nucmer (c 30, l 15) [30]. Then, the results were filtered allowing only ≥95% identity and ≥100 bp length alignments. For each contig, only the best-scoring alignments to a genome was allowed. Finally, a dedicated python script recorded the alignment position information for each contig, with the collection of all such positions for a given genome representing its contig coverage, expressed as a percentage of the total genome length. The same alignment file produced for the contig coverage calculations was parsed using an in-house script to obtain the proportion of the original genome’s length covered by the longest aligning contig (maximum contig coverage). To assess which particular PAD combination produced the best maximum and overall contig coverage, we conducted paired Mann–Whitney tests with R.
The accuracy of assemblies was established using a chimericity and contig accuracy metrics. Chimeric contigs are defined as contigs formed by reads derived from more than one genome. However, due to the short length of reads issued from high-throughput sequencing platforms and the existence of closely related viral genomes, chimericity does not necessarily mean lack of correspondence between a contig and its source genome. Reads were re-mapped to contigs using the bowtie2 read aligner [31] reporting only best hits at high stringency (score-min L,0,-0.2). For each contig, we used the counts of reads from each original genome to calculate chimericity, defined here as the entropy of the contigs:
(1)
Where p
i
is the proportion of mapped reads arising from genome i.
The level to which each contig accurately represents the information contained within the original genomes was assessed using a contig accuracy score, defined as the identity of the local alignment multiplied by the ratio of alignment length to contig length. Contig accuracy values were also obtained by processing the filtered nucmer files with a dedicated script.
Genome reconstruction
Both genome and community characteristics may impact our ability to assemble a particular genome from a complex community. We have used the PAD combination showing best overall performance (Hiseq) to assess the effect on maximum contig coverage caused by genome length, relative abundance, existence of closely-related genomes in the community, and repeats regions within the genome. Due to the interaction between the circular nature of many genomes and chosen alignment thresholds genomes shorter than 1700 nt were removed from further analysis as there is the possibility that their maximum contig coverage may have been slightly underestimated.
In most instances, the assembly only recovered one of the genomes (maximum contig coverage >95%) from the groups of eight intra-species genomes (α = 0.0025). We studied the possible effect of intra-group genetic similarity on genome recovery by obtaining pairwise nucleotide similarities between sibling genomes, which were then analyzed by principal coordinate analysis using the dudi.pco function of the ade4 package [32] in R.
The existence within a community of genomes bearing highly similar regions may also hamper the reconstruction of a genome. For instance, the reads originating from a particular genome might be used in the reconstruction of other genomes with OLC assemblers, or it may lead to graph structures not properly resolved with de Bruijn graph-based assemblers. To analyze this aspect, we mapped all metagenomic reads to each genome using bowtie2 with default parameters but allowing all above-threshold hits. Then we recorded the number of metagenomic reads mapping to each genome minus the number of reads originating from each genome, and normalized for differing genome sizes dividing by genome length, obtaining a coverage by others parameter. Finally, we used the ratio of coverage by others to coverage as a proxy to assess possible genome reconstruction bias produced by this sort of interference.
The hundred mock communities
We generated 100 in silico mock metagenomes with different community structures to benchmark the accuracy of viral diversity estimation methods more thoroughly. To this end, >2,200 complete genomes from the NCBI RefSeq database [33] were used as reference for the Grinder read simulator [34]. Each metagenome contained 200,000 reads designed to follow the length (~450 bp) and errors typical of 454 GS-FLX Ti pyrosequencing. The metagenomes followed a power law rank-abundance and were classified in four community structures, varying in richness (100 or 1,000 species) and evenness (most abundant genome at 2.0 or 25% relative abundance). We let Grinder automatically randomly generate 25 metagenomes of each type (total of 100 metagenomes) for statistical replication.
Estimation of viral diversity
Using the GAIIx evolved mock metagenome, we determined the effect of metagenome size on estimated community viral diversity. We produced subsets of this metagenome containing 24,658, 248,525 and 2,485,933 reads. Their contig spectra was calculated with Circonspect [1] using the Minimo assembler [35] employing all reads and default parameters (98% identity, 35 bp overlap). Then, both PHACCS and CatchAll were employed with their default values to fit the contig spectra using all available models.
Using the hundred 454 mock metagenomes, we calculated the accuracy of viral diversity estimates obtained using PHACCS, CatchAll and UCLUST as a function of community structure. Contig spectra were generated with Circonspect at 3X fold coverage using Minimo (and default options). These contig spectra were provided to CatchAll and the estimated richness using the best model and the best discounted model were recorded. PHACCS was also given these contig spectra to estimate the viral richness and evenness, letting its optimization algorithm look for the best fit using the more exhaustive ‘cha’. For the clustering method, the entire metagenome was used as input to UCLUST (cluster_smallmem program, both strands, minimum identity of 98, 90 and 75%) and the number of resulting viral clusters was calculated. In an attempt to improve their accuracy, the UCLUST and CatchAll estimates were divided by the average genome length minus average read length.