Genomic DNA
P. falciparum 3D7 genomic DNA was a gift from Prof Chris Newbold (University of Oxford). Other P. falciparum genomic DNA (PK0076) was a clinical isolate containing ~53% human DNA contamination. The clinical isolate was obtained from the Malaria Genetics Group's Sequencing Sample Repository at the Wellcome Trust Sanger Institute.
Screening for an AT-rich tolerant enzyme
Various commercial enzymes were used to perform a conventional PCR (50 µl total volume) targeted at a 1217 bp locus (Pf3D7_01:55900-57116) with a high AT-content (84%) and a 540 bp locus (Pf3D7_11:1294982-1295521) with a relatively balanced base composition (positive control). Ten ng of P. falciparum 3D7 genomic DNA was used as template for each PCR. Pf_1 and Pf_11_ama1 oligo pairs (see additional file 1, Table S1) were used at a final concentration of 0.3 µM to amplify the high AT-rich and positive control fragments respectively. For each enzyme, two sets of PCRs were performed, one using reagents and conditions provided by the manufacturer (standard conditions) and the other deviating only by inclusion of 60 mM TMAC in the reaction mixture (optimized conditions). Each set of amplification condition was repeated at least twice to confirm the finding. See additional file 1, Table S4 for enzymes used and their catalogue numbers.
Illumina sequencing libraries
Genomic DNA (0.05 - 1 μg in 75 µl TE buffer) from P. falciparum 3D7 or the clinical isolate was sheared for 3 min using a Covaris S2 (Covaris Inc., Woburn, MA).To obtain a fragment-size distribution of ~100 bp to ~300 bp, the settings were: 20% duty cycle, intensity 5, and 200 cycles per burst. Illumina paired-end sequencing libraries were constructed using the NEBNext DNA sample preparation kit (NEB), following standard Illumina sample-preparation protocol with the following slight modifications. End-repair reactions (100 μl) contained 75 µl of sheared DNA sample, 1× T4 DNA-ligase buffer, 1 mM ATP, 0.4 mM dNTPs, 15 units of T4 DNA polymerase, 50 units of Klenow DNA polymerase, 50 units of T4 polynucleotide kinase, and were incubated at 20°C for 45 min. Resulting blunt phosphorylated DNA fragments (end-repaired) were cleaned using the QIAquick PCR purification Kit following the manufacturer's instructions and the DNA eluted in 32 μl of buffer EB. A single A base was added (A-tailing) to the 3' end of the end-repaired DNA fragments in a 50 μl reaction containing 1× Klenow buffer, 0.2 mM dATP, 150 units of Klenow exo- and incubated for 45 min at 37°C. A-tailed DNA fragments were cleaned using QIAquick MiniElute PCR purification Kits following the manufacturer's instructions and DNA eluted in 10 µl EB buffer. Pre-annealed paired-end (PE) adapter oligonucleotides (Illumina) were ligated to the A-tailed fragments in a 50 µl reaction containing 10 µl of DNA sample, 1× Quick T4 DNA-ligase buffer, 10 µl of PE-adapter mixture, 5 µl of Quick T4 DNA ligase (NEB) and incubated at 20°C for 30 min. The ligation reaction was cleaned twice with 1× Agencourt Ampure XP beads (Beckman Coulter). Aliquots were analyzed by running on an Agilent 2100 Bioanalyzer (Agilent Technologies) to determine the size distribution and to check for adapter contamination. Most amplified or non-amplified libraries were prepared twice (starting from same non-sheared genomic DNA) and each technical replicate sequenced independently before data analysis. All replicate data sets were pooled after normalizing and before analysis with the exception of coverage analysis on specific genome loci (presented as box plots, Figure 4i & 4ii) where each replicate was analyzed and shown independently.
Illumina PCR library amplification
All PCR amplifications were performed with an MJ Research Thermo Cycler PTC-225. Illumina PE 1.0 and 2.0 primers or PE 1.0- and 2.0-derived indexing primers were used to amplify 10 ng adapter-ligated library fragments by PCR. From a single pool of adapter-ligated library, aliquots were amplified using standard Illumina PCR reagents or using various alternative polymerases under optimized PCR conditions. Standard Illumina PCR (50 μl) with Phusion polymerase (Thermo) contained 1× Phusion DNA polymerase master mix and 0.4 μM of each primer pair and was amplified with thermocycling conditions of 1 min at 98°C for the initial denaturation, followed by 12 cycles of [10 s at 98°C, 30 s at 65°C, 30 s at 72°C] and a final extension for 5 min at 72°C. PCR (50 μl) with Kapa HiFi (KAPA Biosystems, South Africa) contained 1× Kapa HiFi buffer (containing TMAC) or 1× Kapa buffer B, 0.3 mM of each dNTP, 0.4 μM of each primer pair, 1 unit of Kapa HiFi and was amplified with thermocycling conditions of 1 min at 98°C for the initial denaturation followed by 12 cycles of [10 s at 98°C, 1 min at 65°C] and a final extension for 5 min at 65°C. PCR (50 μl) with Kapa2G Robust (KAPA Biosystems, South Africa) contained 1× Kapa2G buffer B, 0.3 mM dNTPs mix, 0.4 μM of each primer pair, 0.4 units of Kapa2G Robust and was amplified in the presence or absence of 60 mM TMAC with the same thermocycling conditions as Phusion above. PCR (50 μl) with Platinum pfx (Invitrogen) contained 1× Platinum pfx buffer, 0.3 mM dNTPs mix, 0.4 μM of each primer pair, 1 mM MgCl2, 0.8 units of Platinum pfx and was amplified in the presence or absence of 60 mM TMAC with thermocycling conditions of 2 min at 94°C for the initial denaturation followed by 12 cycles of [15 s at 94°C, 30 s at 60°C, 30 s at 65°C] and a final extension for 5 min at 65°C. PCR (50 μl) with AccuPrime Taq HiFi (Invitrogen) contained 1× AccuPrime buffer II, 0.4 μM of each primer pair, 1 unit of AccuPrime Taq HiFi and was amplified in the presence or absence of 60 mM TMAC with the thermocycling conditions used for Platinum pfx. PCR products were purified with 1× Agencourt Ampure XP beads (Beckman Coulter) and eluted in EB buffer (Qiagen). PCR products were analysed by running on an Agilent 2100 Bioanalyzer (Agilent Technologies) to assess amplification. Only successful amplifications were processed for sequencing.
Illumina isothermal library amplification using RPA technology
We used recombinase polymerase amplification (RPA) technology as provided in the TwistAmp™ DNA-amplification kit [11] to amplify 10 ng adapter-ligated library fragments. Illumina PE 1.0 and 2.0 primers or PE 1.0- and 2.0-derived indexing primers (each in 10 μM stock) were used for amplification following kit manufacturer's instructions. Amplifications were performed twice to generate technical replicates. For each sample, a rehydration solution was prepared by mixing 2.4 μl of each primer pair, 29.5 μl of rehydration buffer and 13.2 μl of template library solution. After vortexing and brief centrifugation, the 47.5 μl of rehydration solution was transferred to a freeze-dried RPA reaction complex pellet and mixed by pipetting up and down until complete resuspension was achieved. To initiate the reaction, 2.5 μl of 280 mM magnesium acetate was added and mixed well by vortexing followed by brief centrifugation. The reaction was allowed to continue in an incubator block at an isothermal temperature of 37°C for 40 min. Amplification products were purified with 1× Agencourt Ampure XP beads (Beckman Coulter) and eluted in EB buffer (Qiagen).
Illumina linear library amplification using T7 RNA polymerase
Linear amplification for deep sequencing (LADS) [12] was used to amplify end-repaired and A-tailed genomic DNA library fragments. Starting with 200 ng of genomic DNA, ~30 ng of A-tailed library fragments were ligated to Adapter-A and Adapter-B, previously prepared by annealing adapter-A and adapter-B forward and reverse oligos (see additional file 1, Table S1) respectively. Adapter-ligated libraries were cleaned twice with 1× Agencourt Ampure XP to remove non-ligated adapters and adapter dimers. Alternatively, adapter-ligated library was electrophoresed in a 2% agarose gel alongside a DNA ladder and fragment bands of 250 to 600 bp were cut and purified at room temperature using the Qiaquick gel extraction kit (Qiagen). The purified adapter-ligated library (20 ng) was used to perform in vitro transcription at 37°C overnight with reagents from the MEGAscript T7 kit (Ambion). The protocol described in [12] was followed to generate T7-amplified libraries ready for Illumina cluster generation and sequencing. The procedure was repeated twice, each time starting from same genomic DNA, to generate technical replicates
PCR-free libraries
Genomic DNA (1-2 μg) was sheared, end-repaired and A-tailed as described under "Illumina sequencing libraries". Pre-annealed paired-end Illumina No-PCR adapters were ligated using the same ligation procedure described under "Illumina sequencing libraries". Resulting adapter-ligated fragments were loaded alongside a low molecular weight ladder (NEB) on a TBE 2% agarose gel. Fragment bands of 350 to 500 bp were cut and purified using a Qiaquick gel extraction kit. Gel purification was performed at room temperate and DNA eluted in EB buffer.
Library quantification by qPCR
Pre-PCR libraries were quantified by qPCR using a previously quantified and sequenced library in dilutions containing 100 ng, 10 ng and 1 ng as standard. The qPCR reaction (25 μl) contained 12.5 μl of 2× Sybr Green PCR Master Mix (Applied Biosystems), 0.75 μl of 10 μM Pre-PCR qPCR top and bottom primers (see additional file 1, Table S1), 2.5 μl of library sample (pre-diluted 1000 fold) and 8.5 μl of PCR-quality water. The reaction was set in triplicates and run on a StepOne™ Real-Time PCR System (Applied Biosystems) with a thermocycling protocol of 10 min at 95°C for the initial denaturation followed by 40 cycles of [30 s at 95°C and 1 min at 60°C]. Prior to cluster generation, amplified or no-PCR libraries were quantified by qPCR using the SYBR fast Illumina library quantification kit (KAPA Biosystems). The qPCR was done following instructions provided by the kit manufacturer and run on the StepOne™ Real-Time PCR System (Applied Biosystems). The thermocycling protocol was 5 min at 95°C for the initial denaturation followed by 35 cycles of [30 s at 95°C and 45 s at 60°C].
Data processing
Replicate data sets were analyzed independently by mapping sequence reads to the P. falciparum 3D7 reference genome V2.1.5 ftp://ftp.sanger.ac.uk/pub/pathogens/Plasmodium/falciparum/3D7/3D7.version2.1.5/ using BWA [13]. SAMtools [14] was used to generate coverage data from the BWA pileup mapping output. To compare the quality of sequence data from different library preparation methods, we developed four analysis metrics. Evenness of coverage depth metrics - comparing the overall representation and depth across the entire genome; GC content analysis - to detect base composition biases in amplification through comparison with a theoretical GC content derived from shredded reads of the reference genome [6]; Tolerance metrics - to assess representation of extreme base-composition loci, focusing on selected genomic regions; and Fidelity metrics - assessing enzyme-dependent errors.
Genome coverage
Mapped reads for each replicate data sets were normalized by average genome coverage to 11× and 21× for clinical and non-clinical samples respectively, and then pooled or analyzed independently depending on analysis metrics used. For the clinical samples, reads originating from host contamination were filtered out (by aligning all reads to P. falciparum 3D7 reference genome) prior to analysis, hence the relatively low average genome coverage depth for these samples. Evaluation of coverage depth and evenness was based on cumulative distributions over a normalized overall average depth. A measurement of low-coverage index lci (d) is defined as the integration of the cumulative distribution C(x) from 0 to d giving an overall assessment of the coverage at the low end of distribution:
The value lci (0.5) that gives a measurement of the coverage below half of the average depth in the distribution was used to compare evenness of coverage for each data set.
We counted the number of bases in the genome with zero coverage (Cov = 0) and less than 5× coverage (Cov<5×). We used SAMtools to generate coverage plots and bash/awk scripts for coverage counting. Tolerance to regions of extreme base composition was analyzed after normalizing data sets by overall average coverage depth. Coverage distribution on selected regions (on chromosome 11) are presented as a boxplot generated in R [15]http://www.r-project.org/.
Fidelity
Most polymerases introduce errors during base incorporation in regions of low complexity. To assess these forced errors (or template-dependent errors), we compared variant calls by each amplification method at sites of low complexity (mainly TATA repeats). We also assessed the ability of each amplification condition to find a set of known consensus errors in the reference, and used data obtained to score true and false positives. The known consensus errors in the reference genome were previously identified and published [16]. We used the iCORN program [16] to conservatively map reads to the reference genome. iCORN iteratively maps reads and compares them with perfectly mapping reads. PCR-free variant calls were included as a quality control. A mismatch or indel was called with mpileup (a function in SAMtools) at a quality ≥40 from the vcf file. We differentiate between a true positive mismatch or indel (if the call agrees with iCORN and PCR-free), and a false positive, (if the call disagrees with iCORN and PCR-free) at that position. Using the total number of errors/variants counted per sample, we generated a normalized error score for each enzyme or library-preparation method.
Rank
We ranked the results by assigning a score of 1 to the best and a score of 0 to the worst. For each amplification condition, we calculated the average score value obtained in both coverage and fidelity metrics. Conditions with the highest average score was ranked highest while that with the least assigned the lowest rank (see Table 2).
Chimeric and duplicate reads analysis
We used the BWA paired-end alignment algorithm to generate mapped data sets. Picard software http://picard.sourceforge.net/command-line-overview.shtml was used to identify/mark duplicates from the mapped data set. Chimeric reads were identified from the alignment results using the “bwasw” option with single-end reads. Each replicate data set was analyzed independently to determine the number of chimeric or duplicate reads. We calculated the number of chimeric or duplicate reads as a percentage of the total mapped reads for each data set. Mean values are shown (see additional file 1, Table S2).
All datasets used in this study have been deposited in the ENA read archive under accession number [ERP000832].