The here established workflow and study design are shown in Figure 1.
Samples tested
In this study we evaluated six HapMap gDNA samples (Coriell Repositories, USA), members of two family trios (IDs 1350 and 1420). We tested samples IDs NA11829, NA11830, and NA10856 of family ID 1350, and NA12003, NA12004, and NA10838 of family ID 1420. Whole-genome amplification of the same three gDNA samples of family ID 1420 was prepared following standard protocol of GenomiPhi V2 Amp Kit 500 RXS (Amersham Biosciences; GE Healthcare Europe GmbH). In each WGA reaction, one μl gDNA (concentration 10 ng/μl) served as a start material. A typical WGA gDNA yield of 4–7 μg gDNA (~ 250 ng/μl) with average product length of >10 kb was achieved. After verifying the quality of the DNA/WGA samples, the enrichment of the intended target regions of all nine samples, six non-amplified gDNA and three matched WGA gDNA, were undertaken using droplet-based multiplex PCR (RainDance; USA) as described below.
Targeted genomic sequences and primer panel design
The RainDance Technologies “384 Member Primer Panel” targets 384 Consensus CDS exons from 373 different genes thought to contain somatic mutations involved with cancer [15]. Amplicons were designed as described by Sjoblom et. al. The amplicons contained in the RainDance “384 Member Primer Panel” represent 172,053 amplicon bases (Additional file 1: Table S2). Amplicons were selected to represent several different design parameters that include: amplicon length (300 to 600 bases), amplicon GC content (30 – 60%) and Primer Tm (56 - 60°C). The RainDance “384 Member Primer Panel” contains 384 unique primer droplets, one for each amplicon in the panel. The forward and reverse primers for each amplicon are synthesized and combined into a single primer aliquot in the well of a standard microtiter plate. Each primer droplet contains an equal concentration of the forward and reverse primer (5.2 μM per primer). The primer aliquot is then reformatted to create an emulsion containing many primer droplets that are identical to each other, each containing a single primer pair. The primer droplets are 25 μm in diameter with a volume of eight pL per droplet. Each primer droplet is evaluated for its size and morphology to ensure each primer is represented at the same concentration. An automated counting process quantifies the number of primer droplets generated from each primer aliquot. The different primer droplets are then mixed together to ensure that each primer panel has the same number of each unique primer droplet.
Genomic DNA fragmentation
Genomic DNA samples were fragmented using a nebulization kit (Invitrogen, K7025-05) following the manufacturer’s recommended protocol: 2.0 μg of gDNA was re-suspended in 750 μL Shearing Buffer (TE, pH 8.0 (Fisher, 50843207) containing 10% glycerol (Fisher, AC15892)) and was nebulized at 6–10 lb per square inch (psi) for 90 s to produce 2–4 kb DNA fragments. Fragmentation of the gDNA to 2–4 kb produces the optimal template size for the amplicons size distribution represented in the RainDance “384 Member Primer Library”. Sheared gDNA was precipitated by adding 80 μL 3 M sodium acetate, pH 5.2 (Fisher, 50843081), 4 μL 20 mg/ml Mussel Glycogen (Fisher, NC9329100) and 700 μL 100% isopropanol (Fisher, AC14932) mixed and stored overnight at −20°C. The samples were then centrifuged at the maximum speed for 15 min at 4°C. The supernatant was removed, 500 μL of cold 80% ethanol (Fisher, 5739852) wash buffer was added and the DNA pellet was spun down by centrifugation at the maximum speed for 5 min at 4°C. The pellet was air dried and re-suspended in 10 μL 10 mM Tris–HCl, pH 8.0 (Sigma, T2694). Fragmented genomic DNA was run on a 0.8% agarose gel to confirm that the genomic DNA was in the correct size range (2–4 kb), data not shown.
Genomic DNA template Mix
In order to prepare the gDNA Template Mix, 1.5 μg of the purified fragmented gDNA was added to 4.7 μL 10× High-Fidelity Buffer (Invitrogen, 11304–029), 1.26 μL of MgSO4 (Invitrogen, 11304–029), 1.6 μL 10 mM dNTP (New England Biolabs, NO447S/L), 3.6 μL Betaine (Sigma, B2629-50 G), 3.6 μL of RDT Droplet Stabilizer (RainDance Technologies, 30–00826), 1.8 μL dimethyl sulfoxide (Sigma, D8418-50 ml) and 0.7 μL 5 units/μL of Platinum High-Fidelity Taq (Invitrogen, 11304–029) the samples was brought to a final volume of 25 μL with Nuclease Free Water, Teknova (Fisher, 50843418).
Merge genomic DNA template Mix and primer panel droplets: RDT 1000 instrument
PCR droplets were generated on the RDT 1000 (RainDance Technologies, Inc. USA) one droplet at a time using the manufacturer’s recommended protocol. All of the resulting PCR droplets were automatically dispensed as an emulsion into a single PCR tube and transferred to a standard thermal cycler for PCR amplification. Each sample generated an emulsion containing more than 1,000,000 droplets in which each droplet contained a single-plex polymerase chain reaction (PCR) that targeted one of the 384 targets defined by the RainDance Technologies “384 Member Primer Panel”. Each amplicon was represented by multiple unique PCR droplets (Additional file 2: Figure S1).
PCR amplification
Samples were cycled in a Bio-Rad PTC-225 thermal cycler with the following profile: 94°C for 2 min; 55 cycles at 94°C for 15 s, 58°C for 15 s, 68°C for 30 s; 68°C for 10 min and hold at 4°C.
Breaking emulsion
After PCR Amplification the emulsion of PCR droplets were broken, to release each individual amplicon from the PCR droplets, using the manufacturer’s recommended protocol.
PCR product clean Up
Each sample was purified over a MinElute column (Qiagen, 28004) following the manufactures recommended protocol. The sample was eluted off the column with 11 μL of the Qiagen Elution Buffer. The purified amplicon DNA was then run on an Agilent Bioanalyzer to confirm that the amplicon profile (Additional file 2: Figure S2 A) matches the predicted histogram distribution (Additional file 2: Figure S2 B).
SOLiD sequencing library construction and sample indexing
The sequencing libraries were constructed according to the Standard Fragment Library preparation of the SOLiD 3.0 system protocols (Applied Biosystems, USA). Six individual sequencing libraries were prepared using the enriched non-amplified gDNA PCR products and indexed with barcodes 1 to 6 (Libraries IDs 759L-BC1, 760L-BC2, 761L-BC3, 762L-BC4, 763L-BC5 and 764L-BC6, respectively). Three other libraries were prepared using the enriched WGA gDNA products and indexed with barcodes 9, 10 and 11 (Library IDs 765L-BC9, 766L-BC10 and 767L-BC11, respectively) (Figure 1 and Additional file 3: Table S1). In parallel, to test the performance of sample pooling before library preparation, equimolar portions of each of the six non-amplified gDNA enriched products (based on final concentration of each sample) were pooled together in duplicates and library IDs 768_1L and 768_2L were prepared. In total, 11 Standard Fragment Libraries were prepared (six individual gDNAs, three individual WGAs, and two pools, which comprised two duplicate pools of the six non-amplified gDNA enriched products) (details in Additional file 3: Table S1). For each library 0.5-1.0 μg enriched gDNA/WGA/pool was used.
Purification of the SOLiD libraries
After library preparation, sizing, quantification and quality control of the prepared SOLiD libraries were performed on Agilent Bioanalyzer 2100 (Agilent Technologies, Waldbronn, Germany). A typical electropherogram obtained (Additional file 2: Figure S3 A) showed a primer-dimer peak between 60 and 70 bp (~68 bp). These primer-dimers, and other unincorporated dNTPs, primers, salts and other contaminants, were removed and PCR amplicons >100 bp were recovered according to the standard procedure of Agencourt AMPure Kit (Beckman Coulter Genomics). Removal of the primer-dimers from the SOLiD library construction and sample indexing step was necessary to reduce the carryover of these products into the library. After the purification step, sizing and quantification of all the generated libraries were measured again on the Agilent 2100 Bioanalyzer (Additional file 2: Figure S3 B) shows a typical electropherogram after purification).
Pooling strategies of indexed samples
We evaluated here the performance of sample indexing and pooling before and after the emPCR step. After library purification, emPCR was carried out for three different scenarios (summarized in Figure 1): A) Nine individual emPCRs were carried out for the non-amplified gDNA’s (six reactions; emulsion IDs 759em, 760em, 761em, 762em, 763em and 764em) and the WGA gDNA’s (three reactions; emulsion IDs 765em, 766em and 767em). B) Two emPCR reactions were carried out from the two pooled samples before library construction: emulsion IDs 768_1em and 768_2em. C) Equimolar portions of five of the non-amplified gDNA libraries (IDs 759L, 760L, 761L, 762L and 764L) were collected and one emPCR was performed (emulsion ID 770em (pB)). Library ID 763L with BC 5 was not pooled with the afore-mentioned five gDNA libraries due to its low concentration, which was only appropriate for one emPCR of the individual library ID 763em. Finally, after performing all the afore-mentioned emPCR reactions, a single pool of equimolar amounts of the obtained individual emPCR products from non-amplified gDNA emulsion IDs 759em, 760em, 761em, 762em, 763em and764em was generated and processed in parallel (emulsion ID 792em (pA)). Therefore, a total of 13 emPCR reactions were performed for all the tested comparisons (details in Additional file 3: Table S1 and Additional file 2: Figure S1).
NGS using SOLiD 3.0 system platform
The sequencing was carried out on the SOLiD 3.0 system platform (Applied Biosystems/Life Technologies, MA, USA) using Standard Fragment Library protocol and lengths of 50 bp according to the manufacturer’s instructions (Applied Biosystems, MA, USA). The principle steps of the SOLiD sequencing protocol are well covered [11, 16–18]. Here each enriched product (single or pooled libraries) was run on a single well/spot of an octant slide. A total of 13 wells/spots were used (759seq, 760seq, 761seq, 762seq, 763seq, 764seq, 765seq, 766seq, 767seq, 768_1seq, 768_2seq, 770seq and 792seq), i.e. one complete octant slide (8/8) and five spots (two-thirds) of a second octant slide (5/8), for these experiments to sequence the 13 different sequencing samples (listed in Additional file 3: Table S1 and Additional file 2: Figure S1). The SOLiD platform interrogates each base by two separate probe hybridizations (i.e. two color calls), which has the advantage of improved base calling accuracy (invalid color call sequences point to sequencing errors). The color space reads are translated into (base space) fasta code during alignment to a reference genome, which is performed off the machine, on a high performance computer cluster.
Mapping the SOLiD sequencing data
SOLiD sequencing reads (50 bp) were aligned to the human genome (hg18/NCBI36) using the CLCbio Genomics Workbench (v 3.0) [19] with a custom RainDance Workflow plug-in, and in-house Perl scripts. Before assembly, the reference genome was annotated with the 384 amplicon targets, consisting of 382 discontiguous regions, using a gff file (Additional file 1: Table S2). Primer locations of non-overlapping amplicon targets were omitted from analysis as SNP detection is not possible under the primer as the PCR will generate amplicon product for these regions from the primer and not the gDNA. Default settings for reference assembly (Additional file 3: Table S3) were used. Non-unique reads were mapped randomly to possible placements. All aligned sequence reads were exported in SAM [20] format.
Sequence variant identification
Single nucleotide sequence variants were identified using the CLC bio Genomics Workbench (v 3.0), using default settings with the following exceptions: minimum allele frequency was set to 10%, minimum coverage was set to 5, maximum coverage was set to 15,000. Only SNPs that were located in amplicon target regions, excluding primer locations, were analyzed. Nucleotide coverage of known SNP positions in the target regions were extracted using a Perl wrapper in combination with Samtools 0.1.7 [20]. Samtools pileup files were combined with a Perl script. Nucleotide sequence data reported are available in the GenBank databases (the Sequence Read Archive (SRA)) under the study accession number ERP000999 [21]).
Data analysis
SNP concordance was determined by comparing HapMap genotype data (HapMap Public Release #27, merged II + III) to SNPs found using the CLC bio Genomics Workbench. Caucasian (CEU) HapMap SNPs from NA12003, NA12004, NA10838, NA11829, NA11830, and NA10856 were downloaded from [22]. SNPs detected from the SOLiD sequencing data were also compared to SNPs from dbSNP, Build 130. A Perl script was written to extract all known SNPs from the six HapMap samples.
SNPs were considered heterozygous if non-reference allele frequency was 10-90%. SNPs were considered homozygous if non-reference allele frequency was equal to or greater than 90%. For a SNP to be considered detected, the SNP had to have at least five reads per allele. Therefore a heterozygous SNP had to have at least 10 counts per position to be considered “detected”. Only detected SNPs were used for concordance calculations.