Improving mapping and SNP-calling performance in multiplexed targeted next-generation sequencing
- Abdou ElSharawy†1,
- Michael Forster†1,
- Nadine Schracke2,
- Andreas Keller3,
- Ingo Thomsen1,
- Britt-Sabina Petersen1,
- Björn Stade1,
- Peer Stähler2,
- Stefan Schreiber1, 4,
- Philip Rosenstiel1 and
- Andre Franke1Email author
© ElSharawy et al.; licensee BioMed Central Ltd. 2012
Received: 10 November 2011
Accepted: 10 August 2012
Published: 22 August 2012
Compared to classical genotyping, targeted next-generation sequencing (t NGS) can be custom-designed to interrogate entire genomic regions of interest, in order to detect novel as well as known variants. To bring down the per-sample cost, one approach is to pool barcoded NGS libraries before sample enrichment. Still, we lack a complete understanding of how this multiplexed t NGS approach and the varying performance of the ever-evolving analytical tools can affect the quality of variant discovery. Therefore, we evaluated the impact of different software tools and analytical approaches on the discovery of single nucleotide polymorphisms (SNPs) in multiplexed t NGS data. To generate our own test model, we combined a sequence capture method with NGS in three experimental stages of increasing complexity (E. coli genes, multiplexed E. coli, and multiplexed HapMap BRCA1/2 regions).
We successfully enriched barcoded NGS libraries instead of genomic DNA, achieving reproducible coverage profiles (Pearson correlation coefficients of up to 0.99) across multiplexed samples, with <10% strand bias. However, the SNP calling quality was substantially affected by the choice of tools and mapping strategy. With the aim of reducing computational requirements, we compared conventional whole-genome mapping and SNP-calling with a new faster approach: target-region mapping with subsequent ‘read-backmapping’ to the whole genome to reduce the false detection rate. Consequently, we developed a combined mapping pipeline, which includes standard tools (BWA, SAMtools, etc.), and tested it on public HiSeq2000 exome data from the 1000 Genomes Project. Our pipeline saved 12 hours of run time per Hiseq2000 exome sample and detected ~5% more SNPs than the conventional whole genome approach. This suggests that more potential novel SNPs may be discovered using both approaches than with just the conventional approach.
We recommend applying our general ‘two-step’ mapping approach for more efficient SNP discovery in t NGS. Our study has also shown the benefit of computing inter-sample SNP-concordances and inspecting read alignments in order to attain more confident results.
Correlations between genotype and phenotype variations have traditionally been studied by determining the genotype of known markers. For example, genome-wide association studies (GWAS) have revealed associations of known common variants with several of the common diseases. But these associations typically explain less than 25% of the heritable risk estimated for each of those diseases [1, 2]. This is a serious limitation for complex diseases, which are often genetically heterogeneous . Existing GWAS data suggest that rare alleles also have a significant influence on common diseases . Therefore, targeted resequencing of suspected exonic, intronic, and intergenic loci mapped by GWAS and linkage studies is the next logical consequence to identify the entire underlying genetic variation and its disease relevance.
Target enrichment methods rely either on PCR or sequence-specific nucleic acid hybridization, and each method has unique advantages and disadvantages [5–7]. The combination of sample enrichment methods and next-generation sequencing (NGS) pipelines is an effective analysis approach, but it also raises some important questions: How accurate is the sequence variation discovery within the targeted NGS (t NGS) data? What is the effect of the varying performance of the rapidly evolving alignment and analytical tools? Can a sample multiplexing approach reduce the analysis and study costs? To address these questions, as a test model, we combined a hybridization-based sample enrichment method (Febit biomed GmbH, Germany) with an NGS platform (SOLiD, Life Technologies, USA). With sample mix-up control and enrichment cost reduction in mind, we evaluated the enrichment of barcoded SOLiD libraries (individual and pooled), rather than preparing NGS libraries after enrichment [8–10]. Individual samples are indexed by inserting unique nucleotide signatures, the barcodes, and are then pooled together before enrichment so that the pooled DNA samples can be sequenced en-bloc. Sample identity is then re-established during the bioinformatic processing of the reads after sequencing. Sample barcoding has the advantage – in contrast to a simple pooling approach – that genotypes can be assigned at an individual level and that even rare variants can be identified with a high sensitivity [4, 5, 8, 11–14]. We tested the applicability of the evaluated t NGS method on E. coli, as a simple model to optimize the process, and human genomic samples in three experimental stages of increasing scope and complexity, culminating in a SNP concordance evaluation for the BRCA1 and BRCA2 cancer genes.
To identify a suitable analytical approach in terms of computational time and accuracy of sequence variant detection in t NGS data, we carefully analyzed the data using different software and tools. We found that SNP detection depends strongly on the chosen analytical tools and settings, rather than on key enrichment measures such as high and uniform coverage, the percentage of reads mapped on target, or an adequate enrichment fold. Unexpectedly, all evaluated tools failed to identify a large proportion of true sequence variations (false negative SNPs). Whole-genome (WG) mapping/SNP-calling was time-consuming, but with the benefit of few false positive SNPs. Target region (TR) mapping/SNP-calling was faster and yielded more true SNPs than WG mapping, but led to a high proportion (up to 50%) of likely false positive SNPs. Target region mapping can force-map reads from other genomic loci into the target, leading to false positive SNPs and requiring a postprocessing cleanup. To benefit from both methods, we here report a novel ‘two-step’ mapping approach that starts with TR mapping/SNP-calling, followed by backmapping only the SNP-supporting reads to the WG. Because of the recent wide interest in whole exome sequencing, we also applied our approach to public human exome enrichment data generated by Illumina HiSeq instruments. Moreover, our report includes detailed practical instructions, such as validating SNPs by computing inter-sample SNP-concordances between multiplexed technical replicates, filtering for novel SNPs, or establishing and evaluating a t NGS method.
Pre-enrichment multiplexing of NGS libraries of BRCA1/2 genes and data analysis
Genotype concordance and overlap analysis in the BRCA1/2 experiments
For the Yoruban sample we observed a good SNP overlap (15/15) and reached a concordance of 14/15 through visual inspection in IGV. Bioscope called only 13/15 SNPs with the chosen settings and also with many alternative settings (Bioscope_settings.xls at ). Our manual inspection of the mapped reads at the false negative SNP positions clearly showed a non-reference allele consensus, i.e. a SNP (Additional file 3: Figure S3 and Additional file 2: Table S8). The CLC bio analysis (sample “4plex-1 bc05”, a sample from the first 4-plex with barcode 5) called 15/15 SNPs with a concordance of 13/15 (or 15/15 if the minor allele threshold for heterozygosity was manually adjusted) (Additional file 2: Table S9). The NextGENe analysis called 12/15 SNPs (Additional file 2: Table S10). For the Chinese individual (Additional file 2: Table S1 and Table S11), the Bioscope genotype concordance rate was near 100% for the 4-plex samples and the control sample, but the Bioscope SNP overlap was significantly lower than for the Yoruban samples. This, we assume, was partially due to the considerable false positive rate in the HapMap 3 data. Finally, we analyzed the Yoruban samples to distinguish potential novel SNPs from false positives or sequencing errors. We eliminated about 15-20% of SNPs as false positives using SNP-backmapping (see Additional file 1 and Additional file 2: Table S2). Then we eliminated about 20-30% of SNPs in the 4-plexes as unconfirmed using inter-sample concordance checks between technical replicates (see Additional file 1 and Additional file 2: Table S12). For one sample and one bioinformatics strategy (4-plex1bc05, SAET read enhancement, Bioscope TR mapping, Bioscope SNP-calling), we manually inspected the remaining SNPs to assess more accurately the false positive SNP calling rate after our filtering process: Bioscope called 75 SNPs, of which we removed 13 SNPs (17%) by backmapping the SNP and its flanking 25 reference bases to the WG. We then eliminated 22 SNPs (29%) after inter-sample validation within the same multiplex, considering concordant genotypes shared by several samples valid and rejecting the rest as potential false positives. We eliminated 34 SNPs (45%) that were known in dbSNP130. We eliminated one further SNP (1.3%) that was known in our ‘silver’ consensus (see Methods for the definition and details). We inspected the remaining 11 SNPs (15%) in the IGV viewer (Additional file 2: Table S13a), leaving 5 SNPs (8%) as potential novel SNPs and estimating the upper bound for false positives called by Bioscope at 6 SNPs (9%), and the upper bound for false positives after manual inspection at 5 SNPs (8%). We repeated this procedure for the non-SAET reads that resulted in 15 potential novel SNPs (Additional file 2: Table S13b). Finally, we identified the most likely potential novel SNPs (Additional file 2: Table S13c), i.e. those that were present in the SAET-reads and the raw reads, and subjected these to Sanger sequencing. As controls, we also selected some unlikely novel SNPs and a known SNP for sequencing (Additional file 2: Table S13c). We could not validate the potential novel SNPs.
Figure 3 (and Additional file 2: Tables S2, Table S3 and Table S5) summarizes our different bioinformatics strategies and SNP-calling results for the same 26 indexed Yoruban samples and the non-barcoded Yoruban control. We explored whether TR mapping increases SNP-calling sensitivity compared to WG mapping, expecting a higher rate of false positive SNPs for TR mapping as a trade-off. The first four groups, in Figure 3, show automatic SNP-calling results using Bioscope without any manual inspection. The automatic detection of known SNPs (our internal ‘silver’ standard) worked best for the non-barcoded control and the 4-plex. There was very little difference in sensitivity between TR and WG mapping. However, SAET read-enhancement decreased the Bioscope detection rate for known SNPs slightly, for example from 48/66 SNPs to 44/66 SNPs in the 4-plex using TR mapping. This was surprising because SAET led to an increase in the mapping coverage. After TR mapping (first two groups in Figure 3), SNP-backmapping, and manual inter-sample genotype concordance computation, we obtained 3–15 potential novel SNPs (for instance, 15 candidate SNPs for the SAET-enhanced reads in the 4-plexes). For WG mapping (third and fourth group in Figure 3), we only obtained one potential novel SNP at most. Groups number five and six in Figure 3 show results from the mapping with Bioscope and the SNP-calling with SAMtools mpileup, for genotype quality thresholds of 0, 9, 19 and 29. The sensitivity of SAMtools‘SNP-calling is slightly inferior to that of Bioscope. SAMtools also called more false positives than Bioscope. Bioscope mapping and pibase genotyping (seventh group) clearly yielded the best results. A subjective manual inspection in IGV (eighth and ninth groups) revealed evidence or at least some traces of all known SNPs in the Yoruban control (original reads mapped to TR) and the 4 plex1bc05 (original reads and SAET-enhanced reads mapped to TR). But it also yielded more discordant SNPs than pibase.
Fourth stage: read-backmapping approach for public exome data
We here present a novel and more specific ‘two-stage’ mapping/SNP-calling approach, which can speed up the analysis of a human exome sample by 12 hours on an 8-core compute node, and which can be applied to any targeted enrichment sample/region. We evolved this approach when we analyzed targeted enrichment runs and found that conventional mapping and SNP-calling takes up a wasteful amount of time and computing resources.
In our targeted enrichment experiments, we first evaluated a scalable multiplexed protocol for high coverage t NGS to investigate cost saving and quality control potential. We subjected barcoded (indexed) SOLiD libraries instead of gDNA to one selected microarray-based sequence capture method. We tested this pre-enrichment sample multiplex approach by sequencing 68 E. coli genes as well as two human cancer genes (BRCA1 and BRCA2) in three independent stages and different multiplexing folds on the SOLiD system. We achieved good and reproducible coverage profiles for the TRs across most of the different multiplexed samples, enriching human exonic as well as intronic regions with less than 10% strand bias (Additional file 2: Table S17). Nevertheless, our enrichment design successfully captured only 54% of human BRCA1/2 regions. This weakness of the tested enrichment design would also apply to other hybridization-based sequence capture bait designs that subject targets to repeat masking before probe design. For this reason, longer capture baits and iterative refinement of the bait design would be required for such genomic regions with low complexity. To evaluate the SNP discovery performance we then analyzed our sequencing data employing different mapping and analytical tools, which have rapidly evolved within the last two years (see Figure 2). We found that SNP detection in enriched regions - even at high coverages - depends strongly on the tools and their settings. These SNP-callers, including the widely used SAMtools , do not seem to be well-trained to handle enrichment data, and thus produced a significant fraction of false positive and negative SNP calls (Figure 3). To partially overcome this common mapping/SNP calling problem we proposed to combine the advantages of WG mapping (lower false positive SNP detection rate) with TR mapping (faster processing with higher SNP detection rate) (Figure 3) and developed our novel ‘two-step’ mapping approach. In this approach we first mapped raw reads to the TR to achieve faster mapping and SNP-calling, due to the smaller reference sequence, and to detect more SNPs, due to mapping more reads into low-complexity regions and obtaining higher coverages. We followed the first step with a SNP-cleanup mapping step to reduce false positives: In this step, only the SNP-supporting reads and their paired mates are mapped back to the WG (Additional file 3: Figure S4). Our novel approach resulted in more valid SNPs being detected and a more than two-fold speed-up of the time-consuming exome analysis.
Inadequate enrichment and/or coverage can prevent the detection of real nucleotide variants, leading to higher false negative rates, particularly for heterozygotes [21, 22]. In general, 20-fold coverages are deemed necessary for reliable sequence variation calling in data from the 454 platform , Illumina Genome Analyzer , and SOLiD [24–26]. Other studies recommend a sequence coverage higher than 30× to minimize the risk of failing to detect true SNPs , and at least 33× to enable the correct genotyping of most of the heterozygous positions . These high coverage thresholds are backed by a simulation of the SNP detection performance at the NOD2 gene, which is associated with Crohn’s disease; it fell rapidly when the achieved coverage was below 40× . But a high coverage is not the only prerequisite for accurate detection of sequence variation . Rather, SNP detection seems to be significantly affected by the chosen alignment tools and SNP callers, as revealed from the results of our analyses (discussed further below). This is also in agreement with the results of a comparative analysis  of different alignment tools, which showed that there was a disturbingly low level of agreement between genome alignments produced by different tools. It concluded that it was not possible to make definitive qualitative statements concerning the alignment tools, as there are distinct trade-offs in their behaviours. Indeed, the 1000 Genomes Project Consortium  reported that their NGS genotyping accuracy at heterozygous sites was 95% and higher in some regions, dropping off to 70%-80% or lower in “harder to access regions of the genome”. To avoid false positives, the novel SNPs published by the consortium are the consensus of two or more independent groups, sequencing platforms, and pipelines.
Our experimental results (Figure 3) show that the enrichment and sequencing led to highly covered bases and the greatest number of SNP calls for the non-barcoded controls and the 4-plexes. They also illustrated that all SNP-calling tools performed weakly, with the exception of pibase and ‘manual’ inspection of aligned reads in IGV. Depending on the SNP-caller, WG mapping is generally characterized by a lower false positive and a slightly higher false negative SNP detection rate. The reverse holds true for TR mapping. Accordingly, combining both mapping approaches may help to rescue undetected true SNPs, and filter false positive SNP-calls. Regarding SNPs in non-repeat-masked regions, we feel confident that TR mapping with read-backmapping to the WG is an accurate and reliable method, because we discovered 4 genotyping errors in the HapMap data (21% of the non-reference genotypes) within our TRs. We supported this conclusion by Sanger sequencing (Additional file 3: Figure S5). When comparing our BRCA1/2 SNPs with the HapMap3 SNP chip data, we detected up to 4.4× more known SNPs than have been published as HapMap chip data.
SNP-callers were originally optimised for low or moderate variations in coverage. In other words, extremely high coverages or high coverage gradients are currently challenging from the bioinformatics point of view. Even for non-enriched samples, some degree of automatic or manual postprocessing is required specifically to distinguish SNPs from misalignments . A previous study  confirms our findings and emphasizes that manual inspection is an essential part of the analysis. As large-scale manual inspection is unfeasible and prone to subjective errors, we use the pibase software , manuscript under review) for interrogating BAM files , re-typing SNPs, and other analysis tasks. As shown in our BRCA1/2 and exome analyses, the re-typing of SNPs in targeted enrichment experiments is cbuindispensible (for example, using pibase). We further recommend, if cost allows, to duplicate samples (which is easily performed in a multiplexed experiment) and validate computed SNPs by inter-sample concordance checks between these technical replicates.
The bioinformatic processing of eukaryotic and metagenomic sample data sets can occupy a compute cluster for days to weeks. For multiplexed targeted enrichment of human samples (1536 samples can be sequenced in a single SOLiD run), the genome-sized data traffic may overload the compute cluster when too many samples are processed in parallel. Compared to BioScope, other pipelines such as BWA alleviate this problem, but a complete exome run using BWA, SAMtools and Picard nevertheless takes about a full day. Our aim was to cut the computational time significantly by mapping to the small TR (100 kb of BRCA1/2 regions, or 61 MB of exome regions, compared to 3 Gb of human genome) and cleaning up force-mapped reads. For exomes, we cut run time by more than half, enabling overnight runs on an 8-core node, or turnaround within a working day (4–5 hours) on a 16-core node. This run time cut is necessary to match the speed of new high-throughput platforms such as the Illumina HiSeq2500, which takes 27 hours for producing 120 Gb (see Additional file 2: Table S18 for details on potential energy saving calculations).
We successfully demonstrated our novel time-saving ‘two-step’ mapping approach using Illumina HiSeq2000 human exome data from the 1000 Genomes Project. This approach consists of TR mapping with subsequent SNP-cleanup by read-backmapping to the WG. We developed this approach after designing targeted enrichment experiments and experiencing an odyssey of run-time and SNP-detection problems when we used a wide range of mapping and SNP-calling tools. We recommend our approach and the employed tools (BWA, SAMtools, Picard, VCFtools, pibase) for a reliable and efficient analysis of exome and targeted enrichment data. To attain confident results, we specifically recommend the SNP-validation between duplicated samples and a final in silico validation using recently developed software and/or manual inspection.
The study design for our tNGS test model is shown in Figure 1. In our test model, we began the procedure by constructing the SOLiD sequencing fragment libraries (with and without barcodes) and pooling the desired libraries. Then, for each individual sample or pool we selected and enriched the TRs using the HybSelect sequence capture technology ([31–33]; Febit biomed GmbH, Heidelberg, Germany). This included three main steps: hybridization of the genomic DNA library to a Geniom biochip containing target-specific DNA capture probes, washing off non-captured DNA fragments, and elution of the captured fragments. Then, we resequenced the enriched products using the SOLiD 3.0 system platform (Life Technologies, Foster City, CA, USA). Details for each step (first, second, and third experimental stages) are fully described in the Additional file 1. Based on these experiments, we finally developed a novel ‘two-stage’ mapping/SNP-calling approach, which we demonstrated to work on human exome data from the 1000 Genomes Project.
SOLiD sequence analysis (stages 1–3)
We performed read mapping and SNP calling with SOLiD Corona Lite and later switched to Bioscope (a pre-release version of v1.2), after the manufacturer advised that the Corona Lite SNP-caller (pre-release diBayes) was not designed for enriched samples (because of high coverages, steep coverage gradients and/or PCR duplicates). Our Bioscope mapping and SNP-calling settings (Bioscope plan and ini files) are fully documented on the project homepage . For SNP-calling we followed the manufacturer’s recommendations for enriched samples and set ‘call.stringency = high_coverage’ and ‘coverage.iqr.het.high = 10000’. These settings tell the SNP-caller that the coverage is ‘high’, that SNPs must be supported by reads on both strands, and that very-high-coverage SNPs are not artifacts. The filtering of base qualities and reads (low quality reads, duplicate reads) is carried out internally by the Bioscope SNP-caller using its own methods, and is reported in a ‘consensus call output file’.
SOLiD reference sequences
We mapped the 50-mer reads using two separate mapping strategies for each sample: WG mapping (i.e. the normal procedure) and TR mapping (i.e. the computationally faster procedure). For E. coli WG mapping we used the E. coli K-12 M1655 genome (GenBank U00096; Refseq NC_000913) as our mapping reference. For human WG mapping we used the NCBI36/hg18 assembly as our mapping reference. For TR mapping we constructed a single FASTA reference sequence consisting of greater TRs, each separated by a block of 50 Ns (i.e. taking into account the read length of 50). Greater TRs consisted of the actual TRs, extended by 49 bases and merged with adjacent TRs if an overlap occurred. The reason for defining greater TRs was to map reads which partially overlap actual TRs. For E. coli we targeted 68 genes with a total length of 90 kb. For the human samples we targeted 198 small regions with a total length of 90 kb in two genes (a reduction of the size of the intended TRs from 165 kb to 90 kb because of the repeat-masker to enable designing efficient capture probes).
SOLiD coverage and technical reproducibility analyses
We wrote R-scripts  for the coverage analyses and plots, using the mapping statistics files and the base-wise coverage files generated by Corona Lite and Bioscope. Our coverage metrics included percentage on-target reads (specificity), average depth of coverage (ADoC), covered TR bases (completeness), percentage TR bases at various coverage depths (2×, 5×, 15×, etc.), on-target versus off-target coverages (AUC), and enrichment factor (EF, see Additional file 1, formula (1)). For visualizing the coverage uniformity between all barcoded samples of the same plex and comparing with non-barcoded samples, we generated a zoomed-in multi-sample coverage plot for each TR (see Additional file 3: Figure S6). To analyze the technical reproducibility of coverages between samples we created a scatter plot and computed Pearson’s coefficient of correlation for pairs of samples, and summarized these coefficients for all samples in the form of a correlation matrix (Additional file 3: Figure S7).
Consensus SNP list for SNP-calling benchmark in the SOLiD BRCA1/2 experiments
As validated references for our own SNP-calling, we created consensus SNP lists for the Yoruban HapMap individual NA18507. We created our ‘gold’ consensus from the HapMap3 data (), WG Illumina NGS data  (file: pgYh1.txt.gz at ), WG SOLiD NGS data (file: Yoruban_snp_18x.gff at ), and by Sanger resequencing, resulting in 15 non-reference genotypes in the TR (Additional file 2: Table S19). This eliminated 4 of 19 HapMap SNPs (21%), which is broadly in line with the HapMap3 website (28 June 2012) estimate of a 12%-14% false positive rate. We created our ‘silver’ consensus of 66 non-reference genotypes from the WG Illumina data and the WG SOLiD data. For the Beijing Han Chinese HapMap individual NA1856 our benchmark SNP list was the HapMap3 SNP list with 28 non-reference genotypes (Additional file 2: Table S19), due to the unavailability of ‘highly’ covered WG data (see Results/Discussion sections in the Additional file 1).
SNP cleanup by ‘read-backmapping’
To clean up SNPs obtained after our fast TR mapping approach, we initially tested a ‘SNP-backmapping’ approach (see Results and Additional file 1). This did not eliminate all potential false positive SNPs (Figure 3). We then developed our ‘read-backmapping’ approach to eliminate all force-mapped reads: First, only those reads (and, if present, their paired mates) that cover SNPs are mapped to the WG, which identifies reads that map better or equally well to a different locus. Then, each SNP is re-typed using only those reads that remain mapped uniquely over the SNP. The read-backmapping method corresponds exactly to the WG mapping, except that only the SNP-covering reads (and their paired mates, if appropriate) are mapped (and not all initial reads). To implement this method, we wrote Python scripts  and utilized existing freely available standard tools. We based our implementation on the current standard alignment format BAM  and the current standard genomic region format BED . Our implementation is not applicable to SOLiD Corona Lite, SOLiD Bioscope v1.0.1 or pre-release v1.2, as these pipelines do not generate BAM files. As a note, BioScope and its successor LifeScope are not freely available. Therefore, we here show results for freely available data and pipelines from the 1000 Genomes Project .
Stage four: illumina data from the 1000 genomes project
To evaluate the performance of our backmapping pipeline we used publicly available sequence data of a HapMap CEU family trio (ID 1463) who had been repeatedly sequenced within the 1000 Genomes Project . As representative exome examples, we chose two runs submitted by the Broad Institute () where the genomic DNA of the father and his daughter had been prepared as paired-end exome libraries (Hybrid Selection) and sequenced on an Illumina HiSeq 2000. To test our approach on genomic data, we selected a third run submitted by the Broad Institute, where a paired-end WG library of the mother’s gDNA had been sequenced on an Illumina Genome Analyzer II. The data are publicly available from the sequence read archive : female NA12878 (SRR098401), her father NA12891 (SRR098359), and her mother NA12892 (SRR032860). More details on the Illumina runs are given in Additional file 2: Table S14.
Conventional mapping and SNP-calling (whole genome reference)
We aligned the reads to the human genome reference NCBI GRCh37  using BWA , converted its SAM output to BAM format with SAMtools . Then we removed duplicates with Picard . Finally, we called variants using SAMtools with the option “mpileup –E” for higher sensitivity but lower specificity .
Mapping and SNP-calling (exome reference)
The exome-based reference sequence for the TR mapping was created as follows: We downloaded a file of exonic regions, represented by the compatible Consensus Coding Sequence (CCDS) file ‘CCDS.20110907.txt’ at . We then converted the CCDS file into BED-format  with a self-written Python script, resulting in 287,287 lines – chromosome, start, and stop for each exon. Next, we used another self-written Python script to extract the reference bases for each BED region from the WG reference GRCh37. The script included a padding of ± 50 bases to account for reads that only partly originated from an exon. The script also merged the overlapping regions, resulting in 183,410 sequences, and concatenated the sequences into a single sequence using 50 Ns as a buffer between sequence regions. The same script generated a text file with the coordinate transformation table between the regions of interest in the TR coordinates and the original WG coordinates. Finally, we mapped the reads of the three individuals against this new reference and called SNPs in the TR, using the conventional mapping and SNP-calling pipeline described above for the WG reference.
Backmapping of SNP-supporting reads
We uniquely extracted reads covering heterozygous SNPs, from the BAM-file. As the reads had been sequenced from paired-end libraries, we also extracted the according mate from the BAM-file and created new FASTQ files containing these reads and their (possibly unmapped) mates. We implemented the read-extraction process in a self-written shell script which used VCF-Tools , SAMtools, and Picard. Finally, we aligned the extracted reads to the WG reference GRCh37, converted formats, and removed duplicate reads using the steps and programs previously described.
Final SNP typing
Using a self-written Python script and the coordinate transformation file (see above), we transformed the coordinates in the SNP-list from the exome reference coordinate system to the genomic coordinate system. We divided the SNP-list into heterozygous SNPs and homozygous SNPs. Because the heterozygous SNPs can be artifacts resulting from the alignment of non-exonic reads to the exome reference, we extracted the corrected genotype at the SNP-coordinates from the backmapped WG BAM file using the pibase software.
We thank Yang Cheng (Febit biomed GmbH) for data analysis. We thank Markus Schilhabel, Sandra May, Lena Bossen, Catharina von der Lancken (all ICMB), Stefanie Heinze, Dr. Valesca Boisguerin, Nathalie Kefer, Haiguo Wu (Febit biomed GmbH), Alfred Wagner (University Kiel HPC center), Dr. Richard Dixon (LifeTech bioinformatics support), Rupert Yip and Aditya Balineni (Bioscope software development), and Marianne Oerum (CLC bio) for technical assistance. This study was supported by the German Ministry of Education and Research (BMBF) through the National Genome Research Network (NGFN) and the IRN-F project of the DFG Cluster of Excellence ‘Inflammation at Interfaces’. The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 201418 (READNA) and n° 262055 (ESGI). Part of this READNA budget was originally awarded to Life Technologies and then, by mutual agreement with the EU Commission, transferred to the Christian-Albrechts-University Kiel. Sequence data have been submitted to the NCBI Sequence Read Archive: http://www.ncbi.nlm.nih.gov/sra under the study number SRP004231.
- Franke A, McGovern DP, Barrett JC, Wang K, Radford-Smith GL, Ahmad T, Lees CW, Balschun T, Lee J, Roberts R, et al: Genome-wide meta-analysis increases to 71 the number of confirmed Crohn’s disease susceptibility loci. Nat Genet. 2010, 42: 1118-1125. 10.1038/ng.717.PubMed CentralView ArticlePubMedGoogle Scholar
- Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, et al: Finding the missing heritability of complex diseases. Nature. 2009, 461: 747-753. 10.1038/nature08494.PubMed CentralView ArticlePubMedGoogle Scholar
- Garber K: Fixing the front end. Nat Biotechnol. 2008, 26: 1101-1104. 10.1038/nbt1008-1101.View ArticlePubMedGoogle Scholar
- Prabhu S, Pe’er I: Overlapping pools for high-throughput targeted resequencing. Genome Res. 2009, 19: 1254-1261. 10.1101/gr.088559.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Mamanova L, Coffey AJ, Scott CE, Kozarewa I, Turner EH, Kumar A, Howard E, Shendure J, Turner DJ: Target-enrichment strategies for next-generation sequencing. Nat Methods. 2010, 7: 111-118. 10.1038/nmeth.1419.View ArticlePubMedGoogle Scholar
- Mertes F, Elsharawy A, Sauer S, van Helvoort JM, van der Zaag PJ, Franke A, Nilsson M, Lehrach H, Brookes AJ: Targeted enrichment of genomic DNA regions for next-generation sequencing. Brief Funct Genomics. 2011, 10: 374-386. 10.1093/bfgp/elr033.PubMed CentralView ArticlePubMedGoogle Scholar
- Voelkerding KV, Dames SA, Durtschi JD: Next-generation sequencing: from basic research to diagnostics. Clin Chem. 2009, 55: 641-658. 10.1373/clinchem.2008.112789.View ArticlePubMedGoogle Scholar
- Craig DW, Pearson JV, Szelinger S, Sekar A, Redman M, Corneveaux JJ, Pawlowski TL, Laub T, Nunn G, Stephan DA, et al: Identification of genetic variants using bar-coded multiplexed sequencing. Nat Methods. 2008, 5: 887-893. 10.1038/nmeth.1251.PubMed CentralView ArticlePubMedGoogle Scholar
- Nijman IJ, Mokry M, van Boxtel R, Toonen P, de Bruijn E, Cuppen E: Mutation discovery by targeted genomic enrichment of multiplexed barcoded samples. Nat Methods. 2010, 7: 913-915. 10.1038/nmeth.1516.View ArticlePubMedGoogle Scholar
- Varley KE, Mitra RD: Nested Patch PCR enables highly multiplexed mutation discovery in candidate genes. Genome Res. 2008, 18: 1844-1850. 10.1101/gr.078204.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Erlich Y, Chang K, Gordon A, Ronen R, Navon O, Rooks M, Hannon GJ: DNA Sudoku–harnessing high-throughput sequencing for multiplexed specimen analysis. Genome Res. 2009, 19: 1243-1253. 10.1101/gr.092957.109.PubMed CentralView ArticlePubMedGoogle Scholar
- Hajirasouliha I, Hormozdiari F, Sahinalp SC, Birol I: Optimal pooling for genome re-sequencing with ultra-high-throughput short-read technologies. Bioinformatics. 2008, 24: i32-i40. 10.1093/bioinformatics/btn173.PubMed CentralView ArticlePubMedGoogle Scholar
- Mir KU: Sequencing genomes: from individuals to populations. Brief Funct Genomic Proteomic. 2009, 8: 367-378. 10.1093/bfgp/elp040.View ArticlePubMedGoogle Scholar
- ten Bosch JR, Grody WW: Keeping up with the next generation: massively parallel sequencing in clinical diagnostics. J Mol Diagn. 2008, 10: 484-492. 10.2353/jmoldx.2008.080027.PubMed CentralView ArticlePubMedGoogle Scholar
- The SOLiD Spectral Analysis Enhancement Tool.http://solidsoftwaretools.com/gf/project/saet/,
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25: 2078-2079. 10.1093/bioinformatics/btp352.PubMed CentralView ArticlePubMedGoogle Scholar
- Forster M*, Forster P*, ElSharawy A*, Hemmrich G, Kreck B, Wittig M, Thomsen I, Stade B, Barann M, Ellinghaus D, Petersen BS, May S, Melum E, Schilhabel M, Keller A, Schreiber S, Rosenstiel P, Franke A, The pibase Software: From next-generation sequencing alignments to accurate comparison and validation of single nucleotide variants: the pibase software. Nucleic Acids Research. 2012, 10.1093/nar/gks836. Accepted on August 12th, 2012).* Joint First Authors. [http://www.ikmb.uni-kiel.de/pibase]Google Scholar
- Robinson JT, Thorvaldsdottir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP: Integrative genomics viewer. Nat Biotechnol. 2011, 29: 24-26. 10.1038/nbt.1754.PubMed CentralView ArticlePubMedGoogle Scholar
- The tNGS Backmapping.http://www.ikmb.uni-kiel.de/tngs-backmapping,
- The HapMap chip data.http://www.sanger.ac.uk/resources/downloads/human/hapmap3.html,
- Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, et al: Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008, 456: 53-59. 10.1038/nature07517.PubMed CentralView ArticlePubMedGoogle Scholar
- Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT, et al: The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008, 452: 872-876. 10.1038/nature06884.View ArticlePubMedGoogle Scholar
- Brockman W, Alvarez P, Young S, Garber M, Giannoukos G, Lee WL, Russ C, Lander ES, Nusbaum C, Jaffe DB: Quality scores and SNP detection in sequencing-by-synthesis systems. Genome Res. 2008, 18: 763-770. 10.1101/gr.070227.107.PubMed CentralView ArticlePubMedGoogle Scholar
- Dohm JC, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2008, 36: e105-10.1093/nar/gkn425.PubMed CentralView ArticlePubMedGoogle Scholar
- McKernan KJ, Peckham HE, Costa GL, McLaughlin SF, Fu Y, Tsung EF, Clouser CR, Duncan C, Ichikawa JK, Lee CC, et al: Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res. 2009, 19: 1527-1541. 10.1101/gr.091868.109.PubMed CentralView ArticlePubMedGoogle Scholar
- Smith DR, Quinlan AR, Peckham HE, Makowsky K, Tao W, Woolf B, Shen L, Donahue WF, Tusneem N, Stromberg MP, et al: Rapid whole-genome mutational profiling using next-generation sequencing technologies. Genome Res. 2008, 18: 1638-1642. 10.1101/gr.077776.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Chou LS, Liu CS, Boese B, Zhang X, Mao R: DNA sequence capture and enrichment by microarray followed by next-generation sequencing for targeted resequencing: neurofibromatosis type 1 gene as a model. Clin Chem. 2010, 56: 62-72. 10.1373/clinchem.2009.132639.View ArticlePubMedGoogle Scholar
- Melum E, May S, Schilhabel MB, Thomsen I, Karlsen TH, Rosenstiel P, Schreiber S, Franke A: SNP discovery performance of two second-generation sequencing platforms in the NOD2 gene region. Hum Mutat. 2010, 31: 875-885. 10.1002/humu.21276.View ArticlePubMedGoogle Scholar
- Margulies EH, Cooper GM, Asimenos G, Thomas DJ, Dewey CN, Siepel A, Birney E, Keefe D, Schwartz AS, Hou M, et al: Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Res. 2007, 17: 760-774. 10.1101/gr.6034307.PubMed CentralView ArticlePubMedGoogle Scholar
- Durbin RM, Abecasis GR, Altshuler DL, Auton A, Brooks LD, Gibbs RA, Hurles ME, McVean GA: A map of human genome variation from population-scale sequencing. Nature. 2010, 467: 1061-1073. 10.1038/nature09534.View ArticlePubMedGoogle Scholar
- Bau S, Schracke N, Kranzle M, Wu H, Stahler PF, Hoheisel JD, Beier M, Summerer D: Targeted next-generation sequencing by specific capture of multiple genomic loci using low-volume microfluidic DNA arrays. Anal Bioanal Chem. 2009, 393: 171-175. 10.1007/s00216-008-2460-7.View ArticlePubMedGoogle Scholar
- Schracke N, Kornmeyer T, Kranzle M, Stahler PF, Summerer D, Beier M: Specific sequence selection and next generation resequencing of 68 E. coli genes using HybSelect. N Biotechnol. 2009, 26: 229-233. 10.1016/j.nbt.2009.08.013.View ArticlePubMedGoogle Scholar
- Summerer D, Wu H, Haase B, Cheng Y, Schracke N, Stahler CF, Chee MS, Stahler PF, Beier M: Microarray-based multicycle-enrichment of genomic subsets for targeted next-generation sequencing. Genome Res. 2009, 19: 1616-1621. 10.1101/gr.091942.109.PubMed CentralView ArticlePubMedGoogle Scholar
- R Development Core Team: R: A language and environment for statistical computing. 2008, Vienna: R Foundation for Statistical ComputingGoogle Scholar
- The Whole Genome Illumina NGS Data for Individual NA18507.ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/,
- The whole genome SOLiD NGS data for individual NA18507.http://solidsoftwaretools.com/gf/project/yoruban/,
- Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010, 26: 841-842. 10.1093/bioinformatics/btq033.PubMed CentralView ArticlePubMedGoogle Scholar
- The Broad Institute: [http://www.broadinstitute.org]
- The Sequence Read Archive at the European Bioinformatics Institute: [http://www.ebi.ac.uk/ena/]
- The Human Genome Reference NCBI GRCh37.ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/,
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25: 1754-1760. 10.1093/bioinformatics/btp324.PubMed CentralView ArticlePubMedGoogle Scholar
- The Picard Software.http://picard.sourceforge.net,
- The recommended SNP-calling settings in SAMtools mpileup.http://www.biostars.org/post/show/9938/why-does-mpileup-skip-my-mutation/,
- The Compatible Consensus Coding Sequence (CCDS).ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/current_human/,
- Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, et al: The variant call format and VCFtools. Bioinformatics. 2011, 27: 2156-2158. 10.1093/bioinformatics/btr330.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.