Improving mapping and SNP-calling performance in multiplexed targeted next-generation sequencing

Background Compared to classical genotyping, targeted next-generation sequencing (tNGS) can be custom-designed to interrogate entire genomic regions of interest, in order to detect novel as well as known variants. To bring down the per-sample cost, one approach is to pool barcoded NGS libraries before sample enrichment. Still, we lack a complete understanding of how this multiplexed tNGS approach and the varying performance of the ever-evolving analytical tools can affect the quality of variant discovery. Therefore, we evaluated the impact of different software tools and analytical approaches on the discovery of single nucleotide polymorphisms (SNPs) in multiplexed tNGS data. To generate our own test model, we combined a sequence capture method with NGS in three experimental stages of increasing complexity (E. coli genes, multiplexed E. coli, and multiplexed HapMap BRCA1/2 regions). Results We successfully enriched barcoded NGS libraries instead of genomic DNA, achieving reproducible coverage profiles (Pearson correlation coefficients of up to 0.99) across multiplexed samples, with <10% strand bias. However, the SNP calling quality was substantially affected by the choice of tools and mapping strategy. With the aim of reducing computational requirements, we compared conventional whole-genome mapping and SNP-calling with a new faster approach: target-region mapping with subsequent ‘read-backmapping’ to the whole genome to reduce the false detection rate. Consequently, we developed a combined mapping pipeline, which includes standard tools (BWA, SAMtools, etc.), and tested it on public HiSeq2000 exome data from the 1000 Genomes Project. Our pipeline saved 12 hours of run time per Hiseq2000 exome sample and detected ~5% more SNPs than the conventional whole genome approach. This suggests that more potential novel SNPs may be discovered using both approaches than with just the conventional approach. Conclusions We recommend applying our general ‘two-step’ mapping approach for more efficient SNP discovery in tNGS. Our study has also shown the benefit of computing inter-sample SNP-concordances and inspecting read alignments in order to attain more confident results.

The object of this table is to identify the best approach(es) from: enhanced (SAET 2.2) reads or raw off-machine reads, whole genome (WG) or target region (TR) mapping, SNP-backmapping or not. We observe the expected even distribution of concordance/overlap within samples of the same plex, and the expected deterioration of results for higher plexing. TR mapping of raw reads leads to the highest number of known SNPs, despite lower coverages, and also to the highest number of potential novel SNPs. WG mapping leads to near-zero potential novel SNPs. Backmapping reduces the total number of SNPs by about 15-20% and the number of potential novel SNPs by about a third, eliminating obvious false positives or errors. The average depth of coverage at the SNPs is higher than 100-fold throughout all samples except in the 20-plex pool, where it is higher than 20-fold. To validate the potential novel SNPs we examined inter-sample concordances (Table S20) and manually inspected reads at these positions in IGV (Table S13).

HapMap Yoruban from
Ibadan NA18507 (Y) Bioscope: Identical mapping and SNP-calling settings (see Table T1   The object of this table is to validate Table S2 and help identify the best approach(es) from: enhanced (SAET 2.2) reads or raw off-machine reads, whole genome (WG) or target region (TR) mapping, SNP-backmapping or not. This table confirms the Gold Standard results (Table S2): Concordance/overlap within samples of the same plex are evenly distributed, and the results deteriorate as expected for higher plexing. TR mapping of raw reads leads to the highest number of known SNPs, despite lower coverages, and also to the highest number of potential novel SNPs. WG mapping leads to near-zero potential novel SNPs. Backmapping reduces the total number of SNPs by about 15-20% and the number of potential novel SNPs by about a third. The average depth of coverage at the SNPs is higher than 100-fold throughout all samples except in the 20-plex pool, where it is higher than 20-fold. The sequencing error is probably much smaller than the Bioscope diBayes SNP-calling error (see Table S2, footnote ** manual inspection using IGV).

HapMap Yoruban from
Ibadan NA18507 (Y) Bioscope: Identical mapping and SNP-calling settings (see Table T1   identical to reads obtained off the SOLiD 3 machine, i.e. without QualityValue-filtering (this filtering is carried out during tertiary analyses, e.g. mapping/pairing/SNP-calling) 24.2 reads are read segments of length 24 which could be mapped with 2 mismatches; this is the shortest allowed read segment in the default mapping settings (see Table T1 in http://www.ikmb.uni-kiel.de/tngs-backmapping/bioscope_settings.xls) 49.0 reads are read segments of length 49 which could be mapped with 0 mismatches; this is the longest possible read segment and the best possible match This   In this table we summarise seven of our initial SNP-calling results based on corona lite v4.0r2.0 mapping, using the stand-alone early-access diBayes 1.1.1 tool, both of which we adapted to work on our PBS Pro cluster (corona lite is meant for Torque clusters, diBayes 1.1.1 is meant for onmachine use). The results were unpromising, esp. for reads which were mapped to the whole genome. The best results (shown here) were obtained for the diBayes setting "call.stringency=default". We abandoned this early access diBayes tool, because the new Bioscope 1.0.1 diBayes allows additional settings especially for enriched samples (see Table T1 in http://www.ikmb.uni-kiel.de/tngs-backmapping/bioscope_settings.xls).   This table comprises Tables S2 and S11. We separated barcode 4 samples from the other samples, because the SOLiD on-machine software SETS assigned only ca. 1% of the expected reads to barcode 4. After this separation, we observe the expected even distribution of concordance/overlap within samples of the same plex, and the expected deterioration of results for higher plexing. Target region (TR) mapping of raw reads leads to the highest number of known SNPs, despite lower coverages, and also to the highest number of potential novel SNPs. Whole genome (WG) mapping leads to nearzero potential novel SNPs. Backmapping reduces the total number of SNPs by 15-20% and the number of potential novel SNPs by about a third, eliminating obvious false positives or errors. To validate the potential novel SNPs we examined inter-sample concordances (Table S20) and manually inspected reads at these positions in IGV (Table S12). The sequencing error is probably much smaller than the Bioscope diBayes SNP-calling error (see Table S2, footnote ** manual inspection using IGV   SAET-enhanced reads mapped to target region and to whole genome, using Bioscope 1.0.1 (control sample: also raw off-machine reads). Settings see Table T1 in http://www.ikmb.uni-kiel.de/tngs-backmapping/bioscope_settings.xls

16-plex bc14
We manually inspected 19 mapping runs (17 separate samples) for the HapMap Yoruban individual NA18507 using IGV. We manually inspected 5 SNPs within the BRCA1/2 target region which are given in the HapMap3** data. We do not confirm the first 4 HapMap SNPs (rs28897686, rs4986848, rs28897673, rs28897731) manually, nor automatically with Bioscope diBayes, nor in the whole genome NGS*** data. We resequenced the first 3 SNPs with the Sanger method, confirming errors in HapMap. We did not resequence the 4th erroneous HapMap SNP with the Sanger method, because we discovered this error later, and it falls into the same pattern as the first 3 errors. It should be noted that the HapMap website estimates a false positive SNP rate of 12%-14% in their data (June 2012). The last SNP (rs206123) is not called by any diBayes analysis, but our manual inspections confirm the presence of this variant for SAET-enhanced reads as well as for raw off-machine reads, and for target region mapping as well as for whole genome mapping.    Table S19). The right table shows the output from NextGENe V2. Concordant genotypes are highlighted in green. NextGENe called 12 of our 15 Gold Standard non-reference genotypes (Table S19). Within this overlap of 12/15 (80%), the genotype concordance is 12/12 (100%). The object of this table is to provide supporting data for Tables S2 and S3 to identify the best approach(es) from: enhanced (SAET 2.2) reads or raw off-machine reads, whole genome (WG) or target region (TR) mapping, SNPbackmapping or not. This table confirms the Gold and Silver Standard results (Tables S2 and S3): Concordance/overlap within samples of the same pool are evenly distributed, and the results deteriorate as expected for higher plexing. TR mapping of raw reads leads to the highest number of known SNPs, despite lower coverages, and also to the highest number of potential novel SNPs. WG mapping leads to near-zero potential novel SNPs. Backmapping reduces the total number of SNPs by 15-20% and the number of potential novel SNPs by about a third. SNP concordance is near 100% for the non-barcoded control and for the 4-plex samples. SNP overlap is significantly lower than for the Yoruban, which we suspect is due to the significant false positive rate in the HapMap 3 data (see estimate of 12%-14% on the HapMap website). The sequencing error is probably much smaller than the Bioscope diBayes SNPcalling error (see Table S2, footnote ** manual inspection using IGV). Page 21 S12_Novel_SNP_validation To validate potential novel SNPs in silico, we narrowed down the SNP-calls (after SNP-backmapping) for each sample: We considered SNPs to be confirmed if they were shared by several samples within the same plex. From these confirmed SNPs we subtracted the SNPs known in dbSNP130, and from the remaining SNPs we subtracted the SNPs known from whole genome sequencing (Illumina, SOLiD). The remaining SNPs are potential novel SNPs, subject to detailed manual inspection in the target region mapping files and the whole genome mapping files. For Illumina reads and the BWA mapper, we have automated this last manual inspection step by extracting all reads under the SNPs which were detected in the target-region-mapping-run, and then mapping just these reads to the whole genome (read-backmapping) to eliminate mapping-artefacts (our pipeline is available from http://www.ikmb.uni-kiel.de/tngs-backmapping/).

Manual Inspection of potential novel SNPs in the IGV Viewer
Tables (a) and (b) show the list of potential novel SNPs validated in silico by inter-sample intra-plex concordances (see Table S12). The green highlighted cells in tables (a) and (b) show SNPs which are called both from the SAET-enhanced reads and from the raw off-machine reads of the same 4-plex1 libraries. Table (c) shows manual inspection results in IGV for the Yoruban control without bar codes mapped to the whole genome, for the 4-plex1-BC05 Yoruban mapped to the target region, and for the Yoruban control mapped to the target region: Yellow denotes the whole genome mapped genotypes and the single target region mapped genotype which is concordant; blue denotes the confidently different target region mapped genotypes; grey denotes two roughly concordant target region mapped genotypes. We consider the grey and yellow highlighted target region SNPs unlikely. We also consider SNPs unlikely, if less than 20% of reads call a non-reference base. We thus rule out 6 SNPs, leaving 5 remaining possible novel heterozygous SNPs.
Backmapped potential novel SNPs from non-SAET reads mapped to the target region. (Known SNPs from dbSNP130 from and SOLiD+Illumina whole genome NGS are eliminated.)

of mapping and SNP-calling times for Proposed Combined Targetregion Mapping + Whole Genome Read-Backmapping (TR) vs. Conventional Whole-Genome-Mapping (WG)
Step/program step total step total step total step total step total step total step total step total step total 1_bwa_aln 0:50 0:

Average Median
This table shows strand bias in the coverage. We used Bedtools 2.9.0 with the option genomeCoverageBed. The BAM-files were generated from Bioscope 1.0.1 target-region mapped sam files using samtools 0.1.8. It should be mentioned that Bedtools was unable to process the whole-genome mapped BAM files. The table shows a fairly low median bias of strand coverage towards the plus strand of 6.7%.   Variant  base  muttype  chr13  31787968 rs206119  G  A  hom  chr13  31787968 rs206119  G  A  hom  chr13  31788026 rs9562605  C  T  het  chr13  31788026 rs9562605  C  T  het  chr13  31791791 rs9534174  A  G  hom  chr13  31788227  G  T  het  chr13  31793377 rs206123  g  C  hom  chr13  31790885  G  A  het  chr13  31809463 rs1799944  A  G  het  chr13  31791704  A  G  het  chr13  31811055 rs206075  A  G  hom  chr13  31791791 rs9534174  A  G  hom  chr13  31824654 rs11571699  T  G  het  chr13  31791823  A  G  het  chr13  31825894 rs9943876  C  T  het  chr13  31791908  A  G  het  chr13  31834646 rs9534262  T  C  hom  chr13  31793377 rs206123  g  C  hom  chr13 31851388 In this table we analyse the concordance of barcoded samples with the non-barcoded sample (control). Concordance/overlap within samples of the same plex are evenly distributed, and the overlap deteriorates in a similar way for increased plexing as in Tables S2-S3 and S11. For the target region (TR) mapping approach, ca. 12%-36% of SNPs (column "%rest") in a barcoded sample do not occur in the non-barcoded control; this would be the upper bound for false positives. The upper bound is ca. 12%-20% for the 4-plexes and increases as expected for higher plexing. For whole genome (WG) mapping, the upper bound for false positives is mostly lower than 10%, in most cases in fact near 1%. The sequencing error is probably much smaller than the Bioscope diBayes SNP-calling error (see Table S2, footnote ** manual inspection using IGV) Bioscope: Identical mapping and SNP-calling settings (see Table T1    identical to reads obtained off the SOLiD 3 machine, i.e. without QualityValue-filtering (this filtering is carried out during tertiary analyses, e.g. mapping/pairing/SNP-calling) 24.2 reads are read segments of length 24 which could be mapped with 2 mismatches; this is the shortest allowed read segment in the default mapping settings (see Table T1 in http://www.ikmb.uni-kiel.de/tngs-backmapping/bioscope_settings.xls) 49.0 reads are read segments of length 49 which could be mapped with 0 mismatches; this is the longest possible read segment and the best possible match This table summarises mapped reads and coverages in detail for all samples. Uniform distributions between barcodes within each pool are achieved for mapped reads, enrichment factors, coverages, and covered bases. For higher plexing, coverages and covered bases are thinner, much as expected. Notably, the 20-plex is significantly below expectations and the 4-plex2 is significantly above expectations.