Optimizing illumina next-generation sequencing library preparation for extremely at-biased genomes
© Oyola et al; licensee BioMed Central Ltd. 2012
Received: 26 August 2011
Accepted: 3 January 2012
Published: 3 January 2012
Massively parallel sequencing technology is revolutionizing approaches to genomic and genetic research. Since its advent, the scale and efficiency of Next-Generation Sequencing (NGS) has rapidly improved. In spite of this success, sequencing genomes or genomic regions with extremely biased base composition is still a great challenge to the currently available NGS platforms. The genomes of some important pathogenic organisms like Plasmodium falciparum (high AT content) and Mycobacterium tuberculosis (high GC content) display extremes of base composition. The standard library preparation procedures that employ PCR amplification have been shown to cause uneven read coverage particularly across AT and GC rich regions, leading to problems in genome assembly and variation analyses. Alternative library-preparation approaches that omit PCR amplification require large quantities of starting material and hence are not suitable for small amounts of DNA/RNA such as those from clinical isolates. We have developed and optimized library-preparation procedures suitable for low quantity starting material and tolerant to extremely high AT content sequences.
We have used our optimized conditions in parallel with standard methods to prepare Illumina sequencing libraries from a non-clinical and a clinical isolate (containing ~53% host contamination). By analyzing and comparing the quality of sequence data generated, we show that our optimized conditions that involve a PCR additive (TMAC), produces amplified libraries with improved coverage of extremely AT-rich regions and reduced bias toward GC neutral templates.
We have developed a robust and optimized Next-Generation Sequencing library amplification method suitable for extremely AT-rich genomes. The new amplification conditions significantly reduce bias and retain the complexity of either extremes of base composition. This development will greatly benefit sequencing clinical samples that often require amplification due to low mass of DNA starting material.
KeywordsNext-Generation Sequencing Illumina Library Plasmodium falciparum AT-rich Malaria Clinical isolate PCR Tetramethyammonium chloride PCR-free Isothermal Linear Exponential
Among the eukaryotic pathogens whose genomes have so far been sequenced, the malaria parasite, Plasmodium falciparum (P. falciparum), posed the most technical challenges both in sequencing and assembly of the draft genome . The biggest hurdle was the complexity of the genome, which is extremely base biased. The P. falciparum genome is very AT-rich; coding regions have a mean AT content of more than 75%, with up to 100% in intergenic and non-coding regions . Although biased base composition has been a challenge to the Sanger sequencing methods, it has continued to be a major problem for the current high-throughput Next-Generation Sequencing (NGS) technologies . Much work has focused on solving amplification and sequencing problems associated with high GC content but nothing has been done to improve on those caused by AT-rich parts of the genome [4, 5]. Since completion of the malaria genome sequencing and the subsequent development of massively parallel sequencing technology, there has been increased effort in using genetic information obtained through re-sequencing initiatives to control and ultimately eliminate malaria (Manske et al, submitted). The Wellcome Trust Sanger Institute is currently using Illumina NGS technologies to re-sequence thousands of clinical isolates from various malaria endemic regions, mainly sourced from Africa and Asia. Re-sequencing of these samples has faced two major challenges: low mass of total parasite DNA and extremely high AT-base composition of the genome. These problems pose major technical challenges predominantly in the library preparation stage of the NGS pipeline and also in subsequent data analysis. Due to extreme AT bias, standard Illumina library preparation procedures produce libraries with poor representation that are generally low in complexity compared to the genomic DNA from which they were derived. Polymerase chain reaction (PCR) amplification conditions as currently used in the standard library preparation procedures, have been shown to introduce biases in sequence coverage towards DNA regions with balanced base composition[5, 6]. Furthermore, genomic regions with high AT or very high GC content generally result in little or no amplification at all. These PCR-introduced artifacts have been shown to cause misleading or inaccurate conclusions in the analysis of genome-variation data .
Attempts have been made to overcome coverage problems associated with PCR amplification by omitting the PCR step in library preparation. Although the PCR-free approach has improved read distribution and produced more even genome coverage, it requires large amounts of starting DNA material, which is difficult to obtain especially from clinical isolates. The low mass of DNA in clinical isolates demands some form of amplification to generate sufficient quantities for an Illumina library preparation. We have therefore investigated various amplification possibilities and developed library-preparation procedures that are optimal for extremely AT-rich genomes using the Illumina sequencing platform.
Here we report alternative library-amplification procedures optimized for AT-rich templates and suitable for low mass DNA starting material especially from malaria clinical isolates. We have tested various library amplification methods including linear and isothermal amplification strategies. We have also explored various PCR amplification conditions by testing different polymerases, and their tolerance to AT-rich templates, in the absence or presence of tetramethylammonium chloride (TMAC), a PCR additive that increases thermostability of AT base pairs .
Conventional PCR amplification of an AT-rich locus
PCR amplification of Illumina genomic libraries
Although Phusion polymerase is the standard enzyme for Illumina sequencing library amplification, it produces libraries with poor representation of loci with extreme base composition particularly the AT-rich regions . In our screening above, many enzymes failed to show efficient amplification of the test locus and were clearly intolerant to AT-rich templates. We evaluated the efficiency of genomic library amplification using polymerases that showed some tolerance to the AT-rich template. We tested Phusion, AccuPrime Taq HiFi, Platinum pfx, Kapa2G Robust and Kapa HiFi. We used these enzymes to amplify PE adapter-ligated genomic libraries made from malaria parasite clinical isolate and the reference strain P. falciparum 3D7. In the presence of TMAC, AccuPrime Taq HiFi and Phusion produced undetectable amplifications that were not processed further. However, alternative amplification conditions produced high amplification yields (see additional file 1, Figure S1), that were sequenced and analyzed. All sequence data sets generated were normalized to average coverage before analysis. Two data sets were generated in parallel: One from a lab-cultured parasite (P. falciparum strain 3D7) genomic DNA; and the other one from P. falciparum clinical isolate containing ~53% host DNA contamination. Sequence data from the clinical isolate were filtered to remove host contamination before analysis resulting in lower coverage depths compared to those from 3D7. To analyze and compare the data generated under optimized conditions against standard procedures, we developed analysis metrics that assess four aspects of data quality: Evenness of coverage depth metrics was used to analyze the distribution of coverage depth for each library across the entire genome and plot results of multiple data sets in a single graph for comparative analysis; GC content analysis was used to compute the average GC content for each library and assess base-composition biases; tolerance metrics was used to assess representation of extreme base-composition loci, focusing on selected regions; and fidelity metrics was used to assess enzyme-dependent amplification errors.
Average GC content
Av. %GC content
Av. %GC content
Isothermal exponential amplification of Illumina genomic libraries using RPA technology
High extension temperatures in conventional PCR procedures have been associated with bias and other negative effects on high AT content libraries. We therefore assessed the suitability of an isothermal exponential amplification method provided by the recombinase polymerase amplification (RPA) technology . RPA uses recombinase enzymes which bind oligonucleotide primers (single-stranded DNA) to form nucleoprotein filaments that stimulate searching for homologous sequences in a duplex-DNA template. The primer-recombinase complex is able to attach to the homologous DNA duplex and facilitate strand exchange, eliminating the need to denature the template as in PCR. Disassembly of the recombinase allows strand-displacing DNA polymerase to begin primer extension from the 3' end. Two opposing oligonucleotide primers initiate an isothermal exponential amplification through automatic cyclic repetition of this process. We used RPA technology (TwistDx) with Illumina PE primers to amplify adapter-ligated libraries before sequencing. We obtained a relatively high yield of RPA-amplified library compared to T7 libraries (see additional file 1, Figure S1). Analysis of sequence data obtained from RPA-amplified libraries showed reduced bias towards GC rich templates compared to standard Illumina libraries for both P. falciparum 3D7 and the clinical isolate (Figure 3A & 3B). Coverage of extreme AT base-composition loci was however inferior to Kapa enzymes (Figure 4i & ii panel C). Furthermore, RPA amplified libraries showed relatively high levels of chimeric and duplicate reads (see additional file 1, Table S2).
Linear amplification of Illumina genomic libraries using T7 RNA polymerase
Exponential amplification of sequencing libraries is thought to be the source of biases and artifacts. To avoid this, we explored a recently described linear amplification for deep sequencing (LADS) protocol . LADS uses T7 RNA polymerase to transcribe DNA fragments bearing a T7-promoter sequence at the 5' end. This is an in-vitro transcription system whereby the T7 RNA polymerase transcribes a DNA fragment multiple times producing several RNA transcripts under isothermal conditions and resulting in a linearly amplified template. The resulting RNA material is then converted to cDNA using a reverse transcriptase. We sequenced the T7 polymerase-amplified libraries and analysed the quality of the data generated using our analysis metrics. The overall yield of the library generated using the LADS protocol was very low compared to those amplified exponentially (see additional file 1, Figure S1). However, coverage of extreme base-composition loci was better than standard Illumina libraries. Libraries produced under optimized PCR conditions were generally superior to the LADS protocol (T7) in coverage of AT-rich loci for both P. falciparum 3D7 and the clinical isolate (Figure 4 I & ii panel C). Surprisingly, an over-representation of fragments with high GC content was observed in T7-amplified library when viewing sequence reads aligned to the reference genome using artemis  (data not shown). T7 polymerase-amplified libraries were also associated with the highest level of duplicate and chimeric reads (see additional file 1, Table S2).
Effect of TMAC on Illumina genomic library amplification
TMAC has been shown to improve thermostability of AT-base pairs . However, it inhibits the activity of most polymerases. In our analysis, Phusion and AccuPrime Taq HiFi were inhibited by TMAC and produced no library amplification (see additional file 1, Figure S1). On the other hand Platinum pfx and the Kapa enzymes were tolerant to the concentration of TMAC used in this assay (Figure 1). We analyzed and compared libraries prepared with Kapa enzymes in the presence and absence of TMAC. Our analysis indicates that the overall evenness in coverage across the genome was significantly improved in the presence of TMAC (see additional file 1, Figure S3). We observed less bias towards GC rich sequences and an overall improved coverage on AT-rich regions. All amplification data shown for Kapa HiFi, Kapa2G and Platinum pfx in Figures 2, 3 and 4 were performed in the presence of TMAC. The positive effect of TMAC was observed in both P. falciparum 3D7 and the clinical isolate (containing ~53% host contamination).
Fidelity and coverage ranking
Ranking of P. falciparum 3D7-amplified and PCR-free libraries.
AccuPrime Taq HiFi
Library amplification is an important step in preparing samples for sequencing with the current NGS platforms, especially when the starting material is severely restricted in quantity. PCR amplification, however, has been shown to be the major source of artifacts and base-composition bias in library fragments. Although alternative library-preparation approaches that omit the PCR amplification show a significant reduction in base-composition bias, this strategy requires large quantities of starting material. Therefore, PCR-free library preparation is unsuitable where amounts of DNA starting material are limited, as is the case with most clinical isolates.
We demonstrate that problems associated with PCR can be ameliorated through optimization, thereby allowing amplification to be used in generating sequencing libraries even from extremely AT-biased genomes. Through careful selection of polymerases and the use of a PCR-additive that normalizes melting temperatures of the different library fragments, we have optimized Illumina library amplification. These alternative PCR conditions generate library fragments with increased coverage of extreme AT-rich regions.
Standard Illumina amplification procedures that use Phusion polymerase generate libraries with biased base-composition and are depleted of AT-rich loci. The sequence data generated under these standard conditions are of reduced complexity and maybe unsuitable for resequencing to identify relatively rare genetic variants of important biological function. We set out to solve this problem by focusing on the library amplification step, which is the primary source of biases in almost all NGS platforms. We started by screening commercially available polymerases for tolerance to an extremely AT-rich template. This initial screening procedure identified a preferred polymerase and amplification conditions that showed improved tolerance to AT-rich templates. We demonstrated that most commercial polymerases are inhibited by TMAC. However, this chemical can have a tremendous positive effect in amplifying AT-rich templates for the TMAC-tolerant enzymes. With TMAC additive we have developed Illumina library-amplification protocol suitable for extreme AT-rich genomic DNA. Quality analyses of sequence data from both P. falciparum 3D7 and a clinical isolate generated by our optimized conditions showed less bias towards base-balanced regions and produced an average GC content that matches that of the unamplified genome.
To date, no single polymerase or amplification method has been shown to efficiently amplify both extremes of base composition with equal and optimum efficiency. The two isothermal amplification methods (RPA and T7) tested here showed some level of improved coverage relative to standard amplifications, but they were both associated with high levels of chimeric and duplicate reads. Under the optimized conditions we show that the Kapa series of polymerases in the presence of TMAC produced the best coverage that was relatively consistent across regions of extreme base composition. Worthy of note is the low template-dependent fidelity score for Kapa HiFi (Table 2), where we detected a number of false-positive insertion calls mainly in regions of TATA repeats. Other enzymes had zero or extremely low coverage in these regions, limiting our ability to perform accurate comparative analysis on enzyme's fidelity at very low complexity regions.
Despite the bias problem associated with it, PCR still remains an essential step in many NGS library preparation procedures, especially when the starting DNA material is limited. We describe robust alternative PCR conditions optimized for extremely AT-biased genomes. We show that the optimized conditions significantly reduce bias and also retain the complexity of either extremes of base composition. This observation was consistent in both the lab-cultured strain (3D7) and the clinical isolate. The optimized procedures described here will greatly benefit the malaria genomic research and control initiatives that have faced serious challenges with the parasite's extreme AT-rich genome.
P. falciparum 3D7 genomic DNA was a gift from Prof Chris Newbold (University of Oxford). Other P. falciparum genomic DNA (PK0076) was a clinical isolate containing ~53% human DNA contamination. The clinical isolate was obtained from the Malaria Genetics Group's Sequencing Sample Repository at the Wellcome Trust Sanger Institute.
Screening for an AT-rich tolerant enzyme
Various commercial enzymes were used to perform a conventional PCR (50 µl total volume) targeted at a 1217 bp locus (Pf3D7_01:55900-57116) with a high AT-content (84%) and a 540 bp locus (Pf3D7_11:1294982-1295521) with a relatively balanced base composition (positive control). Ten ng of P. falciparum 3D7 genomic DNA was used as template for each PCR. Pf_1 and Pf_11_ama1 oligo pairs (see additional file 1, Table S1) were used at a final concentration of 0.3 µM to amplify the high AT-rich and positive control fragments respectively. For each enzyme, two sets of PCRs were performed, one using reagents and conditions provided by the manufacturer (standard conditions) and the other deviating only by inclusion of 60 mM TMAC in the reaction mixture (optimized conditions). Each set of amplification condition was repeated at least twice to confirm the finding. See additional file 1, Table S4 for enzymes used and their catalogue numbers.
Illumina sequencing libraries
Genomic DNA (0.05 - 1 μg in 75 µl TE buffer) from P. falciparum 3D7 or the clinical isolate was sheared for 3 min using a Covaris S2 (Covaris Inc., Woburn, MA).To obtain a fragment-size distribution of ~100 bp to ~300 bp, the settings were: 20% duty cycle, intensity 5, and 200 cycles per burst. Illumina paired-end sequencing libraries were constructed using the NEBNext DNA sample preparation kit (NEB), following standard Illumina sample-preparation protocol with the following slight modifications. End-repair reactions (100 μl) contained 75 µl of sheared DNA sample, 1× T4 DNA-ligase buffer, 1 mM ATP, 0.4 mM dNTPs, 15 units of T4 DNA polymerase, 50 units of Klenow DNA polymerase, 50 units of T4 polynucleotide kinase, and were incubated at 20°C for 45 min. Resulting blunt phosphorylated DNA fragments (end-repaired) were cleaned using the QIAquick PCR purification Kit following the manufacturer's instructions and the DNA eluted in 32 μl of buffer EB. A single A base was added (A-tailing) to the 3' end of the end-repaired DNA fragments in a 50 μl reaction containing 1× Klenow buffer, 0.2 mM dATP, 150 units of Klenow exo- and incubated for 45 min at 37°C. A-tailed DNA fragments were cleaned using QIAquick MiniElute PCR purification Kits following the manufacturer's instructions and DNA eluted in 10 µl EB buffer. Pre-annealed paired-end (PE) adapter oligonucleotides (Illumina) were ligated to the A-tailed fragments in a 50 µl reaction containing 10 µl of DNA sample, 1× Quick T4 DNA-ligase buffer, 10 µl of PE-adapter mixture, 5 µl of Quick T4 DNA ligase (NEB) and incubated at 20°C for 30 min. The ligation reaction was cleaned twice with 1× Agencourt Ampure XP beads (Beckman Coulter). Aliquots were analyzed by running on an Agilent 2100 Bioanalyzer (Agilent Technologies) to determine the size distribution and to check for adapter contamination. Most amplified or non-amplified libraries were prepared twice (starting from same non-sheared genomic DNA) and each technical replicate sequenced independently before data analysis. All replicate data sets were pooled after normalizing and before analysis with the exception of coverage analysis on specific genome loci (presented as box plots, Figure 4i & 4ii) where each replicate was analyzed and shown independently.
Illumina PCR library amplification
All PCR amplifications were performed with an MJ Research Thermo Cycler PTC-225. Illumina PE 1.0 and 2.0 primers or PE 1.0- and 2.0-derived indexing primers were used to amplify 10 ng adapter-ligated library fragments by PCR. From a single pool of adapter-ligated library, aliquots were amplified using standard Illumina PCR reagents or using various alternative polymerases under optimized PCR conditions. Standard Illumina PCR (50 μl) with Phusion polymerase (Thermo) contained 1× Phusion DNA polymerase master mix and 0.4 μM of each primer pair and was amplified with thermocycling conditions of 1 min at 98°C for the initial denaturation, followed by 12 cycles of [10 s at 98°C, 30 s at 65°C, 30 s at 72°C] and a final extension for 5 min at 72°C. PCR (50 μl) with Kapa HiFi (KAPA Biosystems, South Africa) contained 1× Kapa HiFi buffer (containing TMAC) or 1× Kapa buffer B, 0.3 mM of each dNTP, 0.4 μM of each primer pair, 1 unit of Kapa HiFi and was amplified with thermocycling conditions of 1 min at 98°C for the initial denaturation followed by 12 cycles of [10 s at 98°C, 1 min at 65°C] and a final extension for 5 min at 65°C. PCR (50 μl) with Kapa2G Robust (KAPA Biosystems, South Africa) contained 1× Kapa2G buffer B, 0.3 mM dNTPs mix, 0.4 μM of each primer pair, 0.4 units of Kapa2G Robust and was amplified in the presence or absence of 60 mM TMAC with the same thermocycling conditions as Phusion above. PCR (50 μl) with Platinum pfx (Invitrogen) contained 1× Platinum pfx buffer, 0.3 mM dNTPs mix, 0.4 μM of each primer pair, 1 mM MgCl2, 0.8 units of Platinum pfx and was amplified in the presence or absence of 60 mM TMAC with thermocycling conditions of 2 min at 94°C for the initial denaturation followed by 12 cycles of [15 s at 94°C, 30 s at 60°C, 30 s at 65°C] and a final extension for 5 min at 65°C. PCR (50 μl) with AccuPrime Taq HiFi (Invitrogen) contained 1× AccuPrime buffer II, 0.4 μM of each primer pair, 1 unit of AccuPrime Taq HiFi and was amplified in the presence or absence of 60 mM TMAC with the thermocycling conditions used for Platinum pfx. PCR products were purified with 1× Agencourt Ampure XP beads (Beckman Coulter) and eluted in EB buffer (Qiagen). PCR products were analysed by running on an Agilent 2100 Bioanalyzer (Agilent Technologies) to assess amplification. Only successful amplifications were processed for sequencing.
Illumina isothermal library amplification using RPA technology
We used recombinase polymerase amplification (RPA) technology as provided in the TwistAmp™ DNA-amplification kit  to amplify 10 ng adapter-ligated library fragments. Illumina PE 1.0 and 2.0 primers or PE 1.0- and 2.0-derived indexing primers (each in 10 μM stock) were used for amplification following kit manufacturer's instructions. Amplifications were performed twice to generate technical replicates. For each sample, a rehydration solution was prepared by mixing 2.4 μl of each primer pair, 29.5 μl of rehydration buffer and 13.2 μl of template library solution. After vortexing and brief centrifugation, the 47.5 μl of rehydration solution was transferred to a freeze-dried RPA reaction complex pellet and mixed by pipetting up and down until complete resuspension was achieved. To initiate the reaction, 2.5 μl of 280 mM magnesium acetate was added and mixed well by vortexing followed by brief centrifugation. The reaction was allowed to continue in an incubator block at an isothermal temperature of 37°C for 40 min. Amplification products were purified with 1× Agencourt Ampure XP beads (Beckman Coulter) and eluted in EB buffer (Qiagen).
Illumina linear library amplification using T7 RNA polymerase
Linear amplification for deep sequencing (LADS)  was used to amplify end-repaired and A-tailed genomic DNA library fragments. Starting with 200 ng of genomic DNA, ~30 ng of A-tailed library fragments were ligated to Adapter-A and Adapter-B, previously prepared by annealing adapter-A and adapter-B forward and reverse oligos (see additional file 1, Table S1) respectively. Adapter-ligated libraries were cleaned twice with 1× Agencourt Ampure XP to remove non-ligated adapters and adapter dimers. Alternatively, adapter-ligated library was electrophoresed in a 2% agarose gel alongside a DNA ladder and fragment bands of 250 to 600 bp were cut and purified at room temperature using the Qiaquick gel extraction kit (Qiagen). The purified adapter-ligated library (20 ng) was used to perform in vitro transcription at 37°C overnight with reagents from the MEGAscript T7 kit (Ambion). The protocol described in  was followed to generate T7-amplified libraries ready for Illumina cluster generation and sequencing. The procedure was repeated twice, each time starting from same genomic DNA, to generate technical replicates
Genomic DNA (1-2 μg) was sheared, end-repaired and A-tailed as described under "Illumina sequencing libraries". Pre-annealed paired-end Illumina No-PCR adapters were ligated using the same ligation procedure described under "Illumina sequencing libraries". Resulting adapter-ligated fragments were loaded alongside a low molecular weight ladder (NEB) on a TBE 2% agarose gel. Fragment bands of 350 to 500 bp were cut and purified using a Qiaquick gel extraction kit. Gel purification was performed at room temperate and DNA eluted in EB buffer.
Library quantification by qPCR
Pre-PCR libraries were quantified by qPCR using a previously quantified and sequenced library in dilutions containing 100 ng, 10 ng and 1 ng as standard. The qPCR reaction (25 μl) contained 12.5 μl of 2× Sybr Green PCR Master Mix (Applied Biosystems), 0.75 μl of 10 μM Pre-PCR qPCR top and bottom primers (see additional file 1, Table S1), 2.5 μl of library sample (pre-diluted 1000 fold) and 8.5 μl of PCR-quality water. The reaction was set in triplicates and run on a StepOne™ Real-Time PCR System (Applied Biosystems) with a thermocycling protocol of 10 min at 95°C for the initial denaturation followed by 40 cycles of [30 s at 95°C and 1 min at 60°C]. Prior to cluster generation, amplified or no-PCR libraries were quantified by qPCR using the SYBR fast Illumina library quantification kit (KAPA Biosystems). The qPCR was done following instructions provided by the kit manufacturer and run on the StepOne™ Real-Time PCR System (Applied Biosystems). The thermocycling protocol was 5 min at 95°C for the initial denaturation followed by 35 cycles of [30 s at 95°C and 45 s at 60°C].
Replicate data sets were analyzed independently by mapping sequence reads to the P. falciparum 3D7 reference genome V2.1.5 ftp://ftp.sanger.ac.uk/pub/pathogens/Plasmodium/falciparum/3D7/3D7.version2.1.5/ using BWA . SAMtools  was used to generate coverage data from the BWA pileup mapping output. To compare the quality of sequence data from different library preparation methods, we developed four analysis metrics. Evenness of coverage depth metrics - comparing the overall representation and depth across the entire genome; GC content analysis - to detect base composition biases in amplification through comparison with a theoretical GC content derived from shredded reads of the reference genome ; Tolerance metrics - to assess representation of extreme base-composition loci, focusing on selected genomic regions; and Fidelity metrics - assessing enzyme-dependent errors.
The value lci (0.5) that gives a measurement of the coverage below half of the average depth in the distribution was used to compare evenness of coverage for each data set.
We counted the number of bases in the genome with zero coverage (Cov = 0) and less than 5× coverage (Cov<5×). We used SAMtools to generate coverage plots and bash/awk scripts for coverage counting. Tolerance to regions of extreme base composition was analyzed after normalizing data sets by overall average coverage depth. Coverage distribution on selected regions (on chromosome 11) are presented as a boxplot generated in R http://www.r-project.org/.
Most polymerases introduce errors during base incorporation in regions of low complexity. To assess these forced errors (or template-dependent errors), we compared variant calls by each amplification method at sites of low complexity (mainly TATA repeats). We also assessed the ability of each amplification condition to find a set of known consensus errors in the reference, and used data obtained to score true and false positives. The known consensus errors in the reference genome were previously identified and published . We used the iCORN program  to conservatively map reads to the reference genome. iCORN iteratively maps reads and compares them with perfectly mapping reads. PCR-free variant calls were included as a quality control. A mismatch or indel was called with mpileup (a function in SAMtools) at a quality ≥40 from the vcf file. We differentiate between a true positive mismatch or indel (if the call agrees with iCORN and PCR-free), and a false positive, (if the call disagrees with iCORN and PCR-free) at that position. Using the total number of errors/variants counted per sample, we generated a normalized error score for each enzyme or library-preparation method.
We ranked the results by assigning a score of 1 to the best and a score of 0 to the worst. For each amplification condition, we calculated the average score value obtained in both coverage and fidelity metrics. Conditions with the highest average score was ranked highest while that with the least assigned the lowest rank (see Table 2).
Chimeric and duplicate reads analysis
We used the BWA paired-end alignment algorithm to generate mapped data sets. Picard software http://picard.sourceforge.net/command-line-overview.shtml was used to identify/mark duplicates from the mapped data set. Chimeric reads were identified from the alignment results using the “bwasw” option with single-end reads. Each replicate data set was analyzed independently to determine the number of chimeric or duplicate reads. We calculated the number of chimeric or duplicate reads as a percentage of the total mapped reads for each data set. Mean values are shown (see additional file 1, Table S2).
All datasets used in this study have been deposited in the ENA read archive under accession number [ERP000832].
Polymerase Chain Reaction
Recombinase Polymerase Amplification
Linear amplification for deep sequencing
Single Nucleotide Polymorphism.
This work was supported by the Wellcome Trust [grant number 079355/Z/06/Z].
- Doolittle RF: The grand assault. Nature. 2002, 419 (6906): 493-494. 10.1038/419493a.View ArticlePubMedGoogle Scholar
- Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, Carlton JM, Pain A, Nelson KE, Bowman S: Genome sequence of the human malaria parasite Plasmodium falciparum. Nature. 2002, 419 (6906): 498-511. 10.1038/nature01097.View ArticlePubMedGoogle Scholar
- Dohm JC, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2008, 36 (16): e105-10.1093/nar/gkn425.PubMed CentralView ArticlePubMedGoogle Scholar
- Aird D, Ross MG, Chen WS, Danielsson M, Fennell T, Russ C, Jaffe DB, Nusbaum C, Gnirke A: Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 2011, 12 (2): R18-10.1186/gb-2011-12-2-r18.PubMed CentralView ArticlePubMedGoogle Scholar
- Kieleczawa J, Mazaika E: Optimization of protocol for sequencing of difficult templates. J Biomol Tech. 2010, 21 (2): 97-102.PubMed CentralPubMedGoogle Scholar
- Kozarewa I, Ning Z, Quail MA, Sanders MJ, Berriman M, Turner DJ: Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes. Nat Methods. 2009, 6 (4): 291-295. 10.1038/nmeth.1311.PubMed CentralView ArticlePubMedGoogle Scholar
- Chevet E, Lemaitre G, Katinka MD: Low concentrations of tetramethylammonium chloride increase yield and specificity of PCR. Nucleic Acids Res. 1995, 23 (16): 3343-3344. 10.1093/nar/23.16.3343.PubMed CentralView ArticlePubMedGoogle Scholar
- Aurrecoechea C, Brestelli J, Brunk BP, Dommer J, Fischer S, Gajria B, Gao X, Gingle A, Grant G, Harb OS: PlasmoDB: a functional genomic database for malaria parasites. Nucleic Acids Res. 2009, D539-543. 37 Database
- Carver T, Bohme U, Otto TD, Parkhill J, Berriman M: BamView: viewing mapped read alignment data in the context of the reference sequence. Bioinformatics. 2010, 26 (5): 676-677. 10.1093/bioinformatics/btq010.PubMed CentralView ArticlePubMedGoogle Scholar
- Rutherford K, Parkhill J, Crook J, Horsnell T, Rice P, Rajandream MA, Barrell B: Artemis: sequence visualization and annotation. Bioinformatics. 2000, 16 (10): 944-945. 10.1093/bioinformatics/16.10.944.View ArticlePubMedGoogle Scholar
- Piepenburg O, Williams CH, Stemple DL, Armes NA: DNA detection using recombination proteins. PLoS Biol. 2006, 4 (7): e204-10.1371/journal.pbio.0040204.PubMed CentralView ArticlePubMedGoogle Scholar
- Hoeijmakers WA, Bartfai R, Francoijs KJ, Stunnenberg HG: Linear amplification for deep sequencing. Nat Protoc. 2011, 6 (7): 1026-1036. 10.1038/nprot.2011.345.View ArticlePubMedGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25 (14): 1754-1760. 10.1093/bioinformatics/btp324.PubMed CentralView ArticlePubMedGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.PubMed CentralView ArticlePubMedGoogle Scholar
- R-Development-Core-Team: A language and environment for statistical computing. R: A Language and Environment for Statistical Computing. Edited by: Computing RFfS. 2010, Vienna, AustriaGoogle Scholar
- Otto TD, Sanders M, Berriman M, Newbold C: Iterative Correction of Reference Nucleotides (iCORN) using second generation sequencing technology. Bioinformatics. 2010, 26 (14): 1704-1707. 10.1093/bioinformatics/btq269.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.