Digital PCR provides sensitive and absolute calibration for high throughput sequencing

Background Next-generation DNA sequencing on the 454, Solexa, and SOLiD platforms requires absolute calibration of the number of molecules to be sequenced. This requirement has two unfavorable consequences. First, large amounts of sample-typically micrograms-are needed for library preparation, thereby limiting the scope of samples which can be sequenced. For many applications, including metagenomics and the sequencing of ancient, forensic, and clinical samples, the quantity of input DNA can be critically limiting. Second, each library requires a titration sequencing run, thereby increasing the cost and lowering the throughput of sequencing. Results We demonstrate the use of digital PCR to accurately quantify 454 and Solexa sequencing libraries, enabling the preparation of sequencing libraries from nanogram quantities of input material while eliminating costly and time-consuming titration runs of the sequencer. We successfully sequenced low-nanogram scale bacterial and mammalian DNA samples on the 454 FLX and Solexa DNA sequencing platforms. This study is the first to definitively demonstrate the successful sequencing of picogram quantities of input DNA on the 454 platform, reducing the sample requirement more than 1000-fold without pre-amplification and the associated bias and reduction in library depth. Conclusion The digital PCR assay allows absolute quantification of sequencing libraries, eliminates uncertainties associated with the construction and application of standard curves to PCR-based quantification, and with a coefficient of variation close to 10%, is sufficiently precise to enable direct sequencing without titration runs.

Introduction A new generation of sequencing technologies based on "sequencing by synthesis" are revolutionizing biology, biotechnology, and medicine. A key advance facilitating higher throughput and lower costs for several of these platforms was migration from clone-based sample preparation commonly used in Sanger sequencing to massively parallel clonal PCR amplification of sample molecules on beads (Roche 454 and ABI Solid) (Margulies et al, 2005) or on a surface (Solexa) (Bing et al, 1996). The parallel amplification steps are relatively efficient, with sequence data obtained from a significant fraction of sequencing library molecules used in the amplification. This, when coupled with a good loading efficiency onto the instrument results in on the order of one million library molecules (typically less than a picogram of library DNA) being required to carry out a full sequence run. However, the manufacturers require one to ten trillion (typically 1 -5 micrograms) DNA fragments as input for library preparation. This is primarily because quantitation of the library DNA according to the manufacturers' protocols consumes more than a billion molecules, and secondarily because of the limited efficiency of the library preparation methods, which have typical conversion efficiencies of 0.001% -0.1%.
The requirement for micrograms of input DNA limits the pool of samples that can be analyzed with next generation sequencing technologies, since for many applications microgram quantities of sample are not available. In some cases it is possible to use amplification such as PCR or MDA, but amplifiers have bias and introduce distortion. We developed a method for highly accurate absolute quantitation of sequencing libraries that consumes subfemptogram amounts of library material based on digital PCR (Vogelstein 1999). Eliminating the large quantity requirement for traditional quantitation has the direct effect of reducing the sample input requirement from micrograms to nanograms or less, opening the way for analysis of minute and/or precious samples onto the nextgeneration sequencing platforms without the distorting effects of pre-amplification (Mackelprang et al. 2008). It also eliminates the need for expensive titration sequencing runs, which most manufacturers currently recommend.

Results:
TaqMan detection chemistry has the advantage of yielding a fluorescence signal proportional to the number of molecules that have been amplified, not by the total mass of dsDNA in the sample (Heid et al. 1996). This method works by the addition of a double-labeled oligonucleotide probe in a PCR reaction powered by a polymerase with 5' to 3' exonuclease activity.
The probe must be complimentary to one of the two product strands such that the extending polymerase will encounter it and separate the two labels by exonuclease activity, activating the probe's fluorescence. Conventional TaqMan detection chemistry requires that the probe is complementary to the region within the amplified portion of the template between the two amplification primers. This strategy is not possible for the sequencing libraries, which have inserts of unknown or random sequence between short adaptor sequences. To overcome the challenge of probe design for templates of random sequence, we adopted the universal template (UT) approach where a probe-binding sequence is appended to one of the PCR primers (Zhang et al. 2003). To decrease reaction times, we replaced the published 20 bp UT probe-binding region with an 8 bp sequence target for a probe containing a locked nucleic acid nucleotide as applied in Roche's UPL (Universal Probe Library) probes. The shorter amplicon-probe interaction length allows the reduction of PCR run times from 2.5 hours to less than 50 minutes. In practice, we often use the UT-qPCR assay in the real-time mode (with a calibration standard) to range the library concentration so that an appropriate dilution can be made for absolute quantitation by UT-digital PCR.
Digital PCR gives an absolute, calibration-free measurement of the concentration of amplifiable library molecules, with a lower coefficient of variation than a real-time PCR measurement with an ideally prepared standard curve ( Figure 1A). We have sequenced more than thirty 454 libraries without a single titration run over a five month period using the digital PCR quantitation method. To demonstrate the utility of digital PCR in preparing DNA libraries from small amounts of starting material, twelve libraries were created from starting amounts of E. coli DNA ranging from 35 ng to as low as 500 pg. Six of the libraries were constructed with E. coli genomic DNA (with prior dilution before construction of library) with the standard 454 shotgun protocol with Roche's molecular barcodes, "MIDs" (or Multiplex IDentifiers). Six more DNA libaries were prepared from the same quantities of an E. coli amplification product (of 466 bp). The DNA libraries were quantified via UT dPCR as described.
The results from Figure 1B show that we can obtain enough library DNA from 500 pg of genomic (shotgun) or amplicon DNA to create more than 100,000 enriched beads for sequencing. All twelve trace libraries were sequenced in a full run of our GS FLX 454 DNA pyrosequencer. In total, 18 million raw bases were sequenced from the trace shotgun libraries and 38 million raw bases were sequenced from the amplicon libraries. 69.16% of the shotgun reads and 99.17% of the amplicon reads mapped back to E. coli. Specifically, in the case of the library made from 500 pg of E. coli 16S amplicon, half of the resulting library was used for sequencing. 14.0 million raw bases were obtained in 55,206 reads with 99.02% of the reads mapping back to the template, indicating that almost 30 Mbp can be obtained from a library of 131,000 molecules prepared from 500 pg input material.
Similarly, half of the 1 ng E coli amplicon library gave 10.9 million raw bases in 43,217 reads with 99.17% mapping. The 500 pg E coli shotgun library gave 5.7 million raw bases in 26,812 reads (69.9% mapping), while the 1 ng E coli shotgun library gave 6.0 million raw bases in 28,730 reads (69.9% mapping).
In an earlier shotgun sequencing run we used 2,400,000 sstDNA fragments (or 0.71 pg amplifiable DNA) from an Acetonemia longum shotgun library DNA (prepared according to the standard library preparation method from 723 ng of genomic DNA). From these molecules, accurately and reproducibly quantitated by digital PCR, 74% of the beads loaded gave useful 454 sequence data A similar UT-dPCR assay was designed to quantify Solexa sequencing libraries. Solexa libraries were prepared from human plasma DNA or whole blood genomic DNA using starting amounts of DNA between 2 and 6 ng. The concentrations of library molecules were determined by UT-dPCR, and diluted to 4 pM for loading onto the sequencer. We achieved consistent cluster density between ~110,000 to 150,000 clusters per tile on the Genome Analyzer II, a range that is deemed optimal by the manufacturer. The total number of reads yielded was 11 to 15 million per lane ( Table 2). The libraries were also quantitated on the Agilent Bioanalyzer and NanoDrop spectrophotometers. Had we determined the dilutions based on these standard techniques, we would have obtained cluster densities too high and too low by factors of two, respectively.

Discussion
The standard workflow for the next-generation instruments entails library creation, (requiring a bulk PCR step on Solexa), massively parallel PCR amplification of library molecules, and sequencing. Library creation starts with conversion of the sample to appropriately sized fragments, ligation of adaptor sequences onto the ends of the sample molecules, and selection for molecules properly appended with adaptors. The presence of the adaptor sequences on the ends of the library molecules enables amplification of random-sequence inserts by PCR. The number of library DNA molecules in the massively parallel PCR step is critical: it must be low enough that the chance of two associating with the same bead (454) or the same surface patch (Solexa) is low, but, there must be enough library DNA present such that the yield of amplified sequences is sufficient to realize a high sequencing throughput. The standard workflow calls for measuring the mass of library DNA using the Agilent Bioanalyzer capillary gel electrophoresis (GE) instrument (454) or the nanodrop spectrophotometer (Solexa), and then converting the mass to a molecule count using knowledge of the length distribution.
Quantification of the library by mass presents three major stumbling blocks that render the quantification inaccurate to the degree where the sequencing results can be adversely affected. First, mass-based quantitation also requires an accurate estimate of the length of the molecules to determine the molar concentration of DNA fragments. Second, degraded and damaged molecules that cannot be amplified in the massively parallel amplification step are counted. And third, methods of measuring DNA mass lack sensitivity, and are imprecise in concentration measurements near the limit of detection.
When the library concentration is underestimated, the possibility of molecular crosstalk arises due to the clonality of beads (454) or clusters (Solexa) being compromised, which reduces the fraction of useful reads. When the library concentration is overestimated, the number of beads recovered (454) or number of clusters generated (Solexa) is reduced, in which case the full capacity of the sequencers cannot be utilized. Before carrying out a bulk sequencing run with a new library, Roche and Illumina recommend carrying out a four-point titration run on their sequencers in order to empirically determine the optimal volume of DNA for the massively parallel PCR. In addition, Illumina recommends that the user check the library quality with traditional Sanger Sequencing before its application for high-throughput sequencing. Digital PCR method eliminates all three of these problems and the requirement for titration.
Quantitative Real-time PCR, and especially digital PCR, are ideal candidate techniques for this application because of their exquisite sensitivity. Some detection chemistries for real-time PCR, such as TaqMan, have the property of counting molecules rather than measuring DNA mass, although the measurements are relative and the methods by which standards are established often tie the realtime PCR quantitation back to sample mass. Digital PCR is a technique where a limiting dilution of the sample is made across a large number of separate PCR reactions such that most of the reactions have no template molecules and give a negative amplification result. In counting the number of positive PCR reactions at the reaction endpoint, one is counting the individual template molecules present in the original sample one-by-one. A major advantage of digital PCR is that the quantitation is independent of variations in the amplification efficiency -successful amplifications are counted as one molecule, independent of the actual amount of product. PCR-based techniques have the additional advantage of only counting molecules that can be amplified, e.g. that are relevant to the massively parallel PCR step in the sequencing workflow. We use Fluidigm's Biomark platform for digital PCR, which performs 9,180 PCR reactions per chip with automated partitioning of nanoliter PCR reactions on 12 independent input samples.
Recently, Meyer et al. developed a SYBR Green real-time PCR assay that allows the user to estimate the number of amplifiable molecules in sequencing trace samples. This was the first report of PCRbased quantitation of sequencing libraries, and extended the sensitivity of library quantitation significantly -although to an unknown extent, since the source material used to make the trace libraries was not quantitated. However, the SYBR Green assay presents principle disadvantages: 1) SYBR Green I dye is an intercalating flurochrome that gives signal in proportion to DNA mass, not molecule number, 2) SYBR Green assays rely on an external standard that limits the absolute accuracy over time and is not universal to all sample types, and 3) intercalating fluorochomes give signal from nonspecific PCR reaction products.
In a real-time assay, the standard must have the same amplification efficiency and molecular weight distribution as the unknown library sample. This means the user must have on hand a bulk sequencing library very similar to the trace library being made and that the molecular weight distributions of both the standard and the new library be known-often impractical requirements for a trace sample library. Furthermore, this standard library must be of extremely high quality if mass-based quantitation is to be used to calibrate the assay for amplifiable molecules, which makes assessment of the concentration of amplifiable molecules in a degraded sample extremely difficult.
Lastly, sequence-nonspecific detection chemistries like SYBR Green give signal from all dsDNA products generated, including primer dimers and nonspecific amplification products, which may be an issue in complex samples. In particular, side products can compete with specific amplification from low numbers (<1000) of template molecules, limiting the accuracy of SYBR Green quantitation for dilute samples (Simpson 2000). Although the presence of these side products can often be discerned by analysis of the product melting curve, opportunities to optimize the primers are limited due to the short length of the adaptor sequences and the specific nucleotide sequences required for compatibility with proprietary sequencing reagents. Sensitivity to side products gives SYBR Green a tendency toward overestimation of the sample quantity.
Conclusion This work presents an assay that circumvents these limitations using TaqMan detection chemistry and digital PCR. When combined with digital PCR, dependence on a standard sample is eliminated, and the results are sufficiently accurate to allow the elimination of titration techniques, even for samples of low quantity and low quality. The extreme sensitivity of real-time and digital PCR eliminate quantitation as the material-limiting step in the sequencing workflow, bringing greater focus to library preparation procedures as the most limiting step in sequencing trace samples. It is natural to expect that library preparation procedures developed with the capacity to handle up to five micrograms of input are far from optimal with respect to minimizing loss from nanogram or picogram samples. Library preparation procedures optimized for trace samples with reduced reaction volumes and media quantities, possibly formatted in a microfluidic chip, have the potential to dramatically improve the recovery of library molecules, allowing preparation of sequencing libraries from quantities of sample comparable to that actually required for the sequencing run, e.g.
close to or less than one picogram.
Digital PCR quantitation is sufficiently accurate in counting amplifiable library molecules to justify elimination of titration techniques as well as the associated cost and time involved. The method is also hundreds of millions of times more sensitive than traditional means of library quantitation, and allows the sequencing of libraries prepared from tens to hundreds of picograms of starting material, rather than the micrograms of DNA required by the manufacturers' protocols. The reduced sample requirement enables the application of next-generation sequencing technologies to minute and precious samples without the need for additional amplification steps.

Methods:
Sample generation. DNA was extracted/isolated for mid-log phase K12 over night cultures using Qiagen's DNeasy Tissue & Blood kit then further purified using Qiagen's QIAquick PCR purification kit following the manufacturer's protocol. E.coli amplicons were generated from 16s rRNA PCR following standard protocols to generate a uniform 466 bp fragment. DNA extracted from human plasma or whole blood using Qiagen's DNA Blood Mini Kit or Machinerey-Nagel's NucleoSpin Plasma Kit according to manufacturers' protocols.
Sequencing library preparation. 454 shotgun libraries: Libraries were generated according to the manufacturer's protocol with a few adjustments: trace E.coli amplicons and human sample pX were not nebulized; 0.01% Tween-20 was added to the elution buffer for each mini-elute column purification step; libraries were eluted using 1xTE containing 0.05% Tween-20 at a volume of 30 μl. Single stranded libraries were aliqouted for storage.
Solexa libraries: Libraries were generated following standard genomic DNA protocol with small adjustments. No nebulization was performed on plasma DNA samples since they were fragmented in nature (average ~170 bp); the whole blood genomic DNA sample was sonicated to produce fragments between 100 and 400 bp; all ligated products were used for 18-cycle PCR enrichment; no gel extraction was performed and no Sanger sequencing was used to confirm fragments of correct sequence.

Standard creation for UT-qPCR for the Statagene Mx3005.
After sequencing library preparation, UT-qPCR was used to range the concentration for UT-dPCR. For testing purposes and to gauge the correct dilution, a standard library was created, quantitated on UT-dPCR and serially diluted for UT-qPCR calibration. In order to ensure uniform amplification among various libraries, the fragment length distribution of the standard matched the that of the library to be quantitated. To maintain the standard over time, the library was cloned into pCR2.1 (Invitrogen) and then transformed into DH5α cells. Plasmids containing library standard were harvested from mid-log phase DH5α cells and then further isolated using Qiagen's QIAprep Spin Miniprep kit. The resulting plasmids were digested using EcoRI, then gel purified and cleaned up using Qiagen's QIAquick PCR purification kit. Calibration of the UT-dPCR of the standard was conducted on a regular basis.

UT-qPCR quantitation on the Statagene's Mx3005.
Validated standards were diluted in ten-fold increments through the range 10 15 -10 3 molecules/μl. Standards were assayed in triplicate in order to obtain standard deviation/relative coefficient of variation. Each library was diluted ten-fold, and assayed with twelve replicates in order to obtain standard deviation/relative coefficient of variation. The thermal cycling parameters are listed below.

UT-dPCR quantitation on Fluidigm's BioMark System.
454 libraries: UT-qPCR was first performed on aliquoted libraries in order to estimate the dilution factor for UT-dPCR. The libraries were diluted to roughly 100-360 molecules per μl. PCR reaction mix containing the diluted template was loaded onto Fluidigm's 12.765 Digital Array microfluidic chip. The microfluidic chip has 12 panels and each panel contains 765 chambers. The concentration of diluted template that yielded 150-360 amplified molecules per panel was chosen for technical replication. Six replicate panels on the digital chip were assayed in order to obtain absolute quantitation of the initial concentration of library. The diluted samples having typical relative coefficients of variation (between replicates) within 9-12% (or lower) were used for emPCR.
Solexa libraries: quantitative qPCR using human specific primers were first performed to estimate the dilution factor required for carrying out UT-dPCR. The final dilution yielded ~150-360 amplified molecules per panel.

emPCR/Bridge PCR & Sequencing.
454 sequencing: Sequencing was performed according to manufacturer's protocol. No titration or traditional sequencing was used. The DNA:bead ratios of 0.085 -0.300 (based on UT-dPCR quantitation) were used. These ratios yielded the desired 10-15% bead recovery in enrichment and the lowest mixed sequence fraction. Mixed reads in 454 sequencing are defined as four consecutive positive nucleotide flows for a given read.
Solexa sequencing: Sequencing libraries were first diluted to 10 nM according to the concentration determined by digital PCR. The average dilution factor was 10 -20. Diluted libraries were denatured with 2 N NaOH and then diluted to a final concentration of 4 pM. The templates were loaded onto flow cells. Cluster generation was performed according to the manufacturer's instructions. Sequencing was carried out on the Genome Analyzer II. No titration run was performed.     Tables