Optimizing T-NGS methods with unbiased performance using non-amplified and amplified (WGA) gDNA samples is necessary to enable statistically powered large-scale studies. Even if the cost of routine sequencing of the human exome and whole-genome becomes affordable in the near future, multiplexed T-NGS will serve alongside as a downstream validation and diagnostic tool to reach sufficient coverage with less sequencing, enabling the investigation of a higher number of samples in parallel. More specifically, T-NGS provides more power to catalogue most types of variations present in clinically relevant subsets of the genome at a fraction of the cost of whole-genome sequencing. Therefore, using mixed approaches, for instance whole-genome sequencing at low-coverage followed by T-NGS at higher coverage, may become more applicable in the coming years .
In the present study, we tested the performance of combining a microdroplet-based PCR targeted sequencing approach to NGS and demonstrated its utility in T-NGS of 384 cancer exons throughout the genome of different comparisons. We evaluated the performance of standard gDNA (without amplification) versus whole-genome amplified (WGA) gDNA samples as well as testing different sample indexing strategies from the sequencing workflow, namely pooled samples before library preparation (without sample indexing) versus individual samples, and pre- versus post-emPCR with sample indexing using SOLiD molecular barcodes. The results of T-NGS of 384 cancer exons demonstrate sequencing specificity of 66.0% ± 3.4%, uniformity (coverage at 0.2× of the mean) of 85.6% ± 0.6%, concordance of 99.5% ± 0.4% and no Mendelian inheritance errors. We have also demonstrated the ability to detect minor allele frequencies within pools of six non-barcoded non-amplified gDNA samples. These results show the possibility to process WGA gDNA samples at nearly similar performance to that of standard non-amplified gDNA samples without showing significant allelic bias/difference in enrichment metrics and variation detection among non-pooled and pooled samples. For instance, 95% of the targeted regions in the six HapMap samples that were tested (two different HapMap trios) were covered with at least 20× coverage while maintaining a 99.5% average genotype concordance across all of the samples.
The results show that combining the applied target enrichment approach with NGS technology provided many advantages. This T-NGS approach leverages the sensitivity and specificity of the PCR to efficiently capture and represent the sequence context from different genomic regions . The stringency and flexibility to allocate primers to the targets of interest are required to tackle complex genomic regions of high homology (pseudogenes) and repetitive elements [24–26]. The uniformity of the sequence coverage across all the targeted regions in the indexed samples allows for efficient use of NGS sequencing capacity. Typical uniformity with a panel achieves greater than 85% of bases within 0.2× of the mean coverage. This level of performance allows for predictable sequence coverage beyond what is reported in Table 1 with the appropriate amount of sequencing per sample (see additional coverage analysis in Additional file 3: Table S4). The utilized NGS platform, with its high throughput, allows accurate SNP detection as argued  and shown here. Inadequate enrichment and/or coverage depth, which is not the case in our study, can cause failure to detect real nucleotide variation (Table 1; Additional file 4: Table S5), which may lead to higher false-negative rates in particular for heterozygotes [28, 29]. In fact, comparable coverage depth and uniformity of the tested 384 exons have been achieved using other NGS technology platforms, 454 FLX and Illumina (Additional file 3: Table S6). As illustrated in Table 2, parallel detection of both homozygous and heterozygous genotypes, points to the efficiency of the tested workflow and indicates that little, if any allelic bias has been introduced during sample enrichment and sequencing processes.
One exception to the observed even distribution of all tested bar-coded samples was the lower representation of samples indexed with BC4. We supposed at first that this might be due to inaccurate quantification and pooling; i.e. lower representation of BC4 in the original libraries before pooling/enrichment, either due to a pipetting error or a wrong DNA concentration measurement during library preparation. Surprisingly, we have also observed the substantial underrepresentation of indexed samples with BC4 in other multiplex experiments using a different sample enrichment technique, hybridization-based sequence capture, in three successive runs . So we considered BC4 as an outlier and we recommend avoiding it in future experiments. In addition, the relatively high SNP false discovery rate (10-40%; Table 3) of the non-barcoded technical replicates (samples 768_1L and 768_2L), although it seems to be dependent to some extent on coverage depth, it indicates that pooling enriched gDNA samples before library construction, using the current T-NGS approach, may not be a sensible option. We have mainly tested this pooling option to access the reproducibility of the established workflow (Figure 3). These results in general emphasize the need to achieve higher coverage to reduce the SNP false discovery rate. Applying of a similar pooling option to enriched WGA samples will likely result in worse data/performance; due to a potential cumulative impact/bias from different DNA amplification reactions. Instead, we recommend pooling the samples before or after emPCR in the course of the NGS protocol (see Table 2 for more details).
Another clear limitation in our study, which needs to be acknowledged, is the relatively low percentage of mapped reads to the human genome (Table 1). The percentage of mapped reads depends on several factors, such as NGS technology platform, target-enrichment approach, mapping approach, and software and analytical approaches. Using the same target region and RainDance amplicon library on different NGS technology platforms resulted in opposite numbers of produced reads and mapping percentages to the human genome (Table 1 and Additional file 3: Table S6). While the long-read NGS platform (454 FLX) generated a lower total number of reads (219,876), the percentage of mapped reads was as high as 84%. In contrast, only up to 40% of the 5 to 44 million reads produced by the short-read NGS platforms (SOLiD and Illumina) mapped to the human genome. This might indicate that shearing of the enriched PCR amplicons during preparation of the sequencing libraries for the short-read NGS platforms (Additional file 2: Figure S3 A and B) leads to generation of fragments that are ambiguously mapped to several locations of the genome (i.e. off-targets). In fact, CLC bio and other mapping tools, such as BWA, randomly map these ambiguous reads to one of the ‘possible’ locations. In addition, we found recently that the quality of SNP calling was substantially affected by the choice of mapping strategy and analytical tools . For example, we showed that the tested widely-used SNP-callers do not seem to be well-trained to handle enrichment data, and thus produced a significant fraction of false positive as well as negative SNP calls. Moreover, changing the mapping settings or using another software version can result in different enrichment metrics. This observation was confirmed here using a newer version of the CLC bio Workbench software (version 5.1), which, for instance, improved the mapping specificity to up to 56% (for details see new Additional file 3: Table S4). The drop in mapping percentage, while apparently worrisome, can still be satisfactory as long as it does not impact on the accuracy of the downstream SNP calling. In other words, achieving a high and even coverage at intended bases (vertical coverage; ≥20×) and good on-target percentage (completeness or horizontal coverage; 60 - 80%), as shown in our study, ensure accurate SNP calling.
The “barcoding” approach holds promise to enable the sequencing of large numbers of samples, and allows for the identification of rare and novel variations in the intended loci as well as variant carrier post-sequencing. Comparing the results of multiplexed barcoded samples pre- and post-emPCR revealed similar performances of both schemes (Table 1, Table 2 and Additional file 2: Figure S6). Therefore we recommend pooling samples before emPCR to save money, time, and effort. Indeed, the best design is to index samples and pool them before enrichment. Testing pre-enrichment sample multiplex was unfortunately infeasible using the applied target-enrichment method and due to the limited length of the generated sequence reads of the utilized SOLiD platform (version 3.0). If we consider 50 bp sequencing reads as an example, then we would expect that at least ≥5-10% of the sequencing capacity will be lost to sequence only the PCR primers (mean length of 20 bases). Sample pooling before target enrichment is possible using array/hybridization-based sequence capture methods and no loss in sequence capacity is expected, since the binding oligos/probes are kept fixed on the array and only the enriched genomic products are eluted from the array and sequenced . The array-based sequence capture approach may however be limited with regard to selection of complete genomic regions due to repeat masking before designing certain capture probes.
As a final point, due to the knowledge of the exact start and stop position of each amplicon within the primer library, the necessary amount of sequencing required for a given sample can be precisely calculated, depending on the level of coverage that is required. For the 384 amplicon library used in this study, the sum total of amplicon bases was 172,805. The amount of sequencing required to achieve 100× average coverage, which results in >85% of the bases represented at a minimum coverage of 20×, is 17,280,500 bases. The total reads for the 6× pool (792L BCs 1–6) was 33,169,196 total reads and 7,221,473 reads on target, resulting in 361,073,648 bases per octet. This would have allowed up to 20 samples to be pooled per octet or 160 samples per flow cell and 320 samples per run on the SOLiD v3 system (Additional file 3: Table S7). Another practicable scenario to improve coverage for larger target regions could be using a lower degree of multiplexing (2-, 3-, 4-, or 5-plex scheme) on a quarter of a sequencing slide (quads; 4 well slide) instead of on an eighth of a slide (octant; 8 well slide). Following such a strategy would allow more sequencing room and subsequent higher coverage magnitude; by decreasing the inherent effect of slide’s physical separation that decreases the overall number of sequences obtained. In addition, recent rapid improvement of NGS technologies would in principle allow a higher level of sample multiplexing to bring per-sample cost down further.