Replication is fundamental to rigorous experimental design , but it is only now becoming financially possible for metagenomic studies [26, 27]. Here we examined replicate metagenomes across varied DNA input amounts, library preparation procedures, and sequencing platforms.
Low input DNA library success depends on adaptor ligation
While all ≥1,000 ng DNA libraries were successful, environmental samples, particularly for viruses, routinely yield <1ng of DNA . Libraries constructed from ≤100 ng DNA were successful using the linker-amplification protocol for 454 , but Illumina libraries failed or were low-quality for Experiment 1, but not Experiment 2. Two separate protocols were used – both optimized for recovery from column purification steps , but employed different template:adaptor ratios in ligation . Specifically, Experiment 1 used 170:1, while Experiment 2 used 22:1 for 10ng starting DNA. Thus low DNA libraries require adjusted adaptor:template ratios during ligation (see Genoscope protocol for guidelines).
Presence of library amplification drives bias
Two amplification reactions are common in metagenomic sample preparations. The first, library amplification, increases input DNA to balance library preparation losses from purification, size selection, and quality titrations . This adaptor-mediated amplification step is used for limiting DNA for 454 (15—25 cycles ), but is routinely employed in Ion Torrent (5 cycles) and Illumina (12—16 cycles) to enrich for correctly ligated adaptors. This step can alter overall library %G + C [15, 17, 30]. The second amplification step is specific to the sequencing technology (e.g., emPCR in 454 or Ion Torrent, bridge amplification in Illumina) and used for improving signal detection. This step should not alter overall library %G + C, but can artificially over-represent sequences [23, 24].
In this study, two libraries received no library amplification: unamplified 454 and fosmid libraries. Fosmids had elevated %G + C, which is ascribed to a cloning bias . Among the remaining libraries, we expected a low %G + C shift due to the adaptor-mediated amplification step, commonly attributed to inhibitory effects of high %G + C DNA secondary structures, either during library amplification  or downstream emPCR . However, these trends were not observed: in Experiment 1, the 454 unamplified and amplified Illumina 1,000 ng libraries correlate well with one another (r-values > 0.99), but poorly (r-values < 0.9) with the amplified (18 cycles) 10ng Illumina libraries. This difference appears to be driven by reduced low %G + C reads relative to the ≥1,000 ng libra ries, which may implicate low input DNA libraries as more sensitive to loss of low %G + C reads either during gel extraction heat steps  or preferential fragmentation through heating . A possible improvement over gel extraction is Sage Science’s Pippin Prep (tested with 65ng of DNA, see Figure 2B in ref. ), which avoids heating. Heat during fragmentation is avoidable with Covaris acoustic shearing. Both techniques also minimize contamination, which is crucial for DNA-limited libraries.
While amplified ≤100 ng metagenomes displayed different %G + C distributions from ≥1,000 ng metagenomes, the amount of amplification only minimally impacts the resulting metagenomes. This was true in Experiment 1, where starting DNA amount and amplification cycling co-varied, as well as Experiment 2, where these parameters were independent. Fragment competition resulting from cycling conditions is thought to select for higher %G + C and shorter fragments, thus linker-mediated amplification protocols employ tight sizing conditions and %G + C optimized PCR conditions . Such careful library construction can produce minimally biased (<1.5-fold %G + C variation) viral metagenomes from sub-nanogram amounts of DNA [10, 15]. The %G + C patterns observed in the current larger-scale study were also paralleled in functional analyses (protein cluster mapping) and assembly performance. This suggests that systematically prepared linker-amplified metagenomes derived from variable input DNA amounts are quantitatively comparable.
Some caution is warranted for high-throughput transposon-based library preparation options like Nextera. Specifically, Experiment 2 revealed that standard libraries prepared from limiting DNA and under varied conditions were relatively invariant, whereas the transposon-based protocol led to divergent %G + C and protein cluster profiles for metagenomes from both stations. While these deviations were statistically significant (90% bootstrap clustering in Figures 4 and 5), they were minor in magnitude relative to other treatment effects observed here. Such a %G + C bias in Nextera library preps is not entirely surprising as previous work demonstrated reduced coverage in both high and low %G + C regions of virus genomes , presumably due to non-random transposition. Evaluation of new transposition methods should be considered if their eventual products require strictly unbiased representation of input DNA.
Finally, while not investigated here, polymerases used in amplification can alter metagenomes. Phi29 polymerase, for example, leads to stochastic and systematic biases that can impact resulting coverage , while some high-fidelity polymerases (e.g., TAKARA) enrich for rare sequences and others (e.g., PfuTurbo) do not [11, 15]. In Experiment 1, the ≥1,000 ng libraries only minimally differed from each other despite the fact that they employ different polymerases across sequencing platforms. These polymerase-specific effects would depend on protocol particulars (e.g., PCR cycler settings and additives) [17, 30] and the underlying %G + C distribution (particularly for <20% or >80% G + C fragments) of the DNA to be amplified. Future work to determine the impact of polymerase choice empirically on metagenomes derived from a wider range of %G + C than those employed here would be informative.
Duplicates vary by input DNA, amplification, technology
Duplicated reads are problematic in quantitative applications as they can be real or artificial [23, 24, 35, 36]. Here, Experiment 1’s true distribution of duplicates is presumably represented by the first cluster (includes unamplified 454 libraries), except the artificial duplicates discussed below. By comparison, metagenomes from the second cluster contained highly duplicated artificial reads that reduced library complexity during amplification. The last cluster, which included amplified 454, as well as one Illumina and two Ion Torrent metagenomes, had low levels of duplication. For the 454 libraries, this could be due to the diversifying effects of the linker amplification process , but it is harder to explain this trend in the Ion Torrent metagenomes or find a process that ties low library amplification in the 100ng Illumina metagenome to lower duplication levels. Artificial duplicates in Illumina libraries were only an issue in the problematic 10ng library, where 40% of the reads were high-frequency, predominantly artificial duplicates. Further study is required to determine mechanisms that generate artificial duplicates in Illumina data.
Sequencing technologies produce comparable output
While the metagenomes here were derived from three very different ocean viral communities, the range of %G + C was not extreme. Given that, sequencing technology is not a major factor impacting ocean viral metagenomes, which is consistent with previous microbial metagenomic studies . However, read length can influence many downstream applications, from assembly efforts to functional identification of genes [37, 38]. Of widely used next-generation technologies, 454 currently has the longest read length of 800bp, with paired-end Illumina capable of 250 + bp . However, emerging nanopore technologies are likely to be truly transformative . Details are not yet public, but these technologies promise longer reads, direct observation of fragment sequences, and minimal library preparation enabling low input DNA applications.