We compared standard post-capture multiplexing with pre-capture multiplexing for TGE of a relatively small genomic region of 521,647 bp that comprises all known non-syndromic deafness and Usher syndrome genes as described previously [3]. We performed TGE on a set of 96 DNA samples: 48 unknowns, 39 positive controls with 44 Sanger-sequence-verified deafness-causing mutations, and 9 negative controls. The samples were captured in two ways for comparison, as shown in Figure 1. (1) 96 samples were prepared using standard post-capture multiplexing TGE and run in a single Illumina HiSeq lane (post-capture 96) and (2) pools of 12 or 16 (pre-capture 12 and pre-capture 16) samples were pre-capture multiplexed with barcodes added during ligation and sequenced in groups of 96 in a single Illumina HiSeq lane (8 pools of 12 samples for pre-capture 12; 6 pools of 16 samples for pre-capture 16). TGE for post-capture samples was performed with the Agilent SureSelect XT version 1 protocol while TGE for pre-capture samples was performed with the Agilent SureSelect XT version 2, a protocol optimized for pre-capture pooling with regards to hybridization, amplification and capture wash conditions (see Methods). A single sample in the post-capture 96 set failed sequencing QC and was excluded from further analysis.
Total reads and read mapping
The total number of sequencing reads was significantly lower in pre-capture lanes as compared with the post-capture lane (Tukey’s post-hoc ANOVA p=0.004 and p=0.02 for pre-capture 12 and pre-capture 16, respectively), however, the two pre-capture lanes were not significantly different. 97.4%, 93.3%, and 91.6% of sequencing reads mapped on average for post-capture, pre-capture 12, and pre-capture 16, respectively (Figure 2A). The percent of mapped reads was significantly lower for pre-capture lanes (independent samples T-test p < 0.01 for both pre-capture 12 and 16), however in all cases, greater than 90% of reads mapped, constituting a high-quality sequencing run in our experience.
Capture efficiency and duplicate reads
We define capture efficiency as the percent of all mapped reads that overlap the targeted regions. In our experience using SureSelect TGE, capture efficiency ranges from ~40% for small target regions (<200kb) to up to ~80% for large target regions (50 Mb, i.e. the exome) when using standard post-capture multiplexing (data not shown). On average, the capture efficiency for post-capture samples was 68.7% (Figure 2B). This was significantly higher (p < 0.01) than the average for both pre-capture 12 samples (45.3%) and pre-capture 16 samples (37.1%). The difference between pre-capture 12 and pre-capture 16 was also significantly different (p < 0.01). The difference in average duplicate reads and average optical duplicate reads also varied significantly between all three methods (p < 0.01 in all cases) as shown in Figure 2C, and was 12.6% versus 4.7% (post-capture), 7.1% versus 1.8% (pre-capture 12), and 5.8% versus 1.5% (pre-capture 16), respectively.
Coverage performance
Depth of coverage represents the number of sequencing reads aligned over a sequenced base pair. On average, all methods of multiplexing showed a 1X coverage (Figure 3A) that was not significantly different: an average of 99.8%, 99.6%, and 99.4% for post-capture, pre-capture 12, and pre-capture 16, respectively. 10X depth of coverage (Figure 3A) was greater than 94% with all methods of multiplexing, but lowest for pre-capture 16 (97.4%, 96.9%, and 94.9% for post-capture, pre-capture 12, and pre-capture 16, respectively). The average depth of coverage for pre-capture 16 samples was significantly lower at 1X, 10X, and 20X than both post-capture and pre-capture 12 (p < 0.01 in all cases). Average depth of coverage (Figure 3B) was significantly reduced for pre-capture 12 (average = 196) and pre-capture 16 (average = 136) samples when compared with post-capture samples (average = 227) (p < 0.01 in all cases).
We found a difference in coverage distribution among multiplexing methods (Figure 4). The distribution of coverage for post-capture multiplexing was most broadly distributed with increasingly sharply peaked distribution seen for pre-capture 12 and pre-capture 16 multiplexing. We identified no systematic differences in coverage of targeted regions that could not be accounted for by differences in total numbers of reads when comparing different methods of multiplexing.
In order to obtain a similar depth of coverage as post-capture samples when performing pre-capture 16 multiplexing, fewer pools of 16 can be sequenced in a single lane as a compensation for reduced capture efficiency. To test this hypothesis, we randomly sub-sampled sequencing reads from the pre-capture 16 lane to simulate a total of 5, 4, and 3 pools of 16 samples (80, 64, and 48 samples, respectively) per lane (Figure 3A and B). We validated this simulation technique by simulating data for 6 pools of 16 and found results were not significantly different when compared to actual data: average depth of coverage was 125X and 136X for simulated and actual data, respectively; coverage at 1X, 10X, and 20X was 99.5%, 96.5%, and 89.8% for simulated data and 99.7%, 94.9%, and 87.0% for actual data. Our simulations for fewer number of pools per lane showed that coverage levels approached post-capture averages when reduced by a single pool: coverage for 80 samples in 5 pools of 16 (80/5) was 99.7% at 1X, 97.6% at 10X and 92.6% at 20X (Figure 3). When the number of pools was further reduced, results surpassed post-capture averages.
Variant detection
We used 96 unique DNA samples in three sets of 96 captures each. In each set of samples, 48 samples were from persons with presumed genetic hearing loss of an unknown cause and the results from these samples will be reported elsewhere. 48 samples in each set were either positive controls carrying deafness-causing mutations verified by Sanger sequencing or negative controls (Additional file 1: Table S1 and Additional file 2: Table S2). The exact composition of samples per set was different between post-capture and pre-capture sets due to sample limitations, but there was a similar composition of types of control mutations in each set. The post-capture positive control samples included 43 mutations: 5 small deletions, 2 large deletions, 27 missense mutations (34 heterozygous and 4 homozygous), 4 mitochondrial mutations, and 3 splice site mutations (all heterozygous). The two pre-capture multiplexing lanes contained the same 44 positive control mutations: 3 small deletions, 2 large deletions, 31 missense mutations (35 heterozygous and 2 homozygous), 6 mitochondrial mutations, and 3 splice site mutations (all heterozygous).
The average number of variants identified within the targeted regions of interest for post-capture, pre-capture 12, and pre-capture 16 samples was 601, 511, and 509 variants, respectively. When normalized by total number of sequencing reads per sample, there was no significant difference between methods (p = 0.642 and p = 0.677 for post-capture compared with pre-capture 12 and pre-capture 16, respectively). 100% of variants were identified in all three lanes using our variant calling and annotation pipeline (see Methods) with the exception of the sample from post-capture lane that failed at sequencing (Additional file 1: Table S1 and Additional file 2: Table S2). No pathogenic variants were identified in the negative controls. We examined allelic balance for heterozygous positive control variants and found no significant difference between post-capture, pre-capture 12, or pre-capture 16 samples (variant reads/total reads average [standard deviation] was 0.48 [0.02], 0.47 [0.02], 0.47 [0.02], respectively).
We performed a systematic analysis to examine the possibility of artifacts or sequencing errors associated with molecular barcoding that may lead to erroneous sample assignment (chimeras) and misdiagnosis. We searched within each lane for positive control variants present in the un-annotated (“raw”) variant call format (VCF) files for any of the samples present in any other sample in that lane even at low observation rates. We did not find any evidence of sample-switching due to aberrant barcode identification in any lanes, although we did find that a single individual was an incidental carrier for a disease-causing mutation in the MYO7A gene (p.Ala397Asp), as this variant was found in all three lanes and confirmed with Sanger sequencing.
Advantages of pre-capture multiplexing
Pre-capture multiplexing was associated with reduced capture efficiency, which in turn reduced average depth of coverage but did not negatively impact variant detection. Significant savings in cost and time were associated with pre-capture multiplexing. Costs for XTv2 kits are 15% lower than XTv1. There are other costs that are associated with library preparation including consumables which are difficult to quantify, but can be estimated as 1/16th or 1/12th of the cost of post-capture TGE. As an example, each SPRI-bead purification during the protocol costs approximately $5.76/purification. Using pre-capture 16 multiplexing, samples are pooled for three of these purifications compared to post-capture TGE. For 96 samples, the cost simply for purification of each of these samples individually three times (288 purifications, as per the post-capture TGE method) is $1,658.88. Pre-capture 16 multiplexing would reduce the number of purifications required to 6 samples three times (18 purifications) and therefore the cost is $103.68 or a cost reduction of 93.75%. The same calculation yields a cost of $138.24 for pre-capture 12 multiplexing and a corresponding cost reduction of 91.7%. Similar reductions in cost can be assumed for other reagents and consumables.
In our hands, TGE requires 6 hours for pre-hybridization steps and 4 hours for post-hybridization steps. Post-capture multiplexing introduces barcodes at the final amplification step and pooling is completed immediately prior to sequencing, which does not reduce hands-on time. Pre-capture multiplexing occurs prior to the hybridization and therefore 12X or 16X as many samples can be hybridized and captured in the same amount of time. Therefore, though difficult to quantify, hands-on time is significantly reduced with pre-capture multiplexing.