Steps to ensure accuracy in genotype and SNP calling from Illumina sequencing data
© Liu et al.; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Skip to main content
© Liu et al.; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Accurate calling of SNPs and genotypes from next-generation sequencing data is an essential prerequisite for most human genetics studies. A number of computational steps are required or recommended when translating the raw sequencing data into the final calls. However, whether each step does contribute to the performance of variant calling and how it affects the accuracy still remain unclear, making it difficult to select and arrange appropriate steps to derive high quality variants from different sequencing data. In this study, we made a systematic assessment of the relative contribution of each step to the accuracy of variant calling from Illumina DNA sequencing data.
We found that the read preprocessing step did not improve the accuracy of variant calling, contrary to the general expectation. Although trimming off low-quality tails helped align more reads, it introduced lots of false positives. The ability of markup duplication, local realignment and recalibration, to help eliminate false positive variants depended on the sequencing depth. Rearranging these steps did not affect the results. The relative performance of three popular multi-sample SNP callers, SAMtools, GATK, and GlfMultiples, also varied with the sequencing depth.
Our findings clarify the necessity and effectiveness of computational steps for improving the accuracy of SNP and genotype calls from Illumina sequencing data and can serve as a general guideline for choosing SNP calling strategies for data with different coverage.
Next-generation sequencing (NGS) technology is a powerful and cost-effective approach for large-scale DNA sequencing . It has significantly propelled the sequence-based genetics and genomics research and its downstream applications which include, but are not limited to, de novo sequencing [2, 3], quantifying expression level s[4–7], providing a genome-scale look at transcription-factor binding [8, 9], creating a foundation for understanding human disease [10–12] and systematically investigating of human variation [13, 14]. A number of projects based on NGS technology are underway. For example, 1000 Genomes Project http://www.1000genomes.org/ aims to provide a comprehensive resource of human genetic variation as a foundation for understanding the relationship between genotype and phenotype . The NHLBI GO Exome Sequencing Project (ESP) http://evs.gs.washington.edu/EVS/ focuses on protein coding regions to discover novel genes and mechanisms contributing to heart, lung and blood disorders. TCGA (The Cancer Genome Atlas) http://cancergenome.nih.gov/ has been sequencing a large number of tumor/normal pairs to provide insights into the landscape of somatic mutations and the great genetic heterogeneity that defines the unique signature of individual tumor . The ability to discover a comprehensive list of human genetic variation and to search for causing variation or mutation underlying diseases depends crucially on the accurate calling of SNPs and genotypes .
Translating the raw sequencing data into the final SNP and genotype calls requires two essential steps: read mapping and SNP/genotype inference. First, reads are aligned onto an available reference genome, then variable sites are identified and genotypes at those sites are determined. SNP and genotype calling suffers from high error rates that are due to the following factors. Poor quality or low-quality tails prevent reads from being properly mapped. Each read is aligned independently, causing many reads that span indels are misaligned . The raw base-calling quality scores often co-vary with features like sequence technology, machine cycle and sequence context and, thus, cannot reflect the true base-calling error rates . These alignment and base-calling errors propagate into SNP and genotype inference and lead to false variant detection. Moreover, low-coverage sequencing always introduces considerable uncertainty into the results and makes accurate SNP and genotype calling difficult. To obtain high quality SNP and genotype data, most contemporary algorithms use a probabilistic framework to quantify the uncertainty and to model errors introduced in alignment and base calling [17–20]. In addition, a number of optional steps are recommended. Some are prior to variant calling, including raw reads preprocessing, duplicate marking, local realignment, and base quality score recalibration. Others are posterior to variant calling, including linkage-based genotype refining [21–23] and SNP filtering  or variant quality score recalibration .
Here we focused on those optional steps preceding variant calling. We assessed their relative contributions and evaluated the effect of their orders on the accuracy of SNP and genotype calling with data generated on Illumina sequencing platform, which is currently the most widely used sequencing technology. Besides, we also compared the performance of three popular multi-sample SNP callers, SAMtools , GATK , and GlfMultiples , in terms of dbSNP rate, transition to transversion ratio (Ti/Tv ratio), and concordance rate with SNP arrays (Methods section). Our findings can serve as a general guide for choosing appropriate steps for SNP and genotype calling from Illumina sequencing data with different coverage.
Five samples were selected for whole exome sequencing. All samples were taken from women with very early-onset (22-32 years old) breast cancer or early-onset (38-41 years old) plus a first-degree family history of breast cancer .
Genomic DNA from buffy coat was extracted using QIAmp DNA kit (Qiagen, Valencia, CA) following the manufacture's protocol. Exonic regions were captured using Illumina TruSeq Exome Enrichment Kit. It targeted 201,071 regions (62.1 million bases; 49.3% inside exons; average length 309 bp), covering 96.5% of consensus coding sequence database (CCDS). An Illumina HiSeq 2000 was used to generate 100-bp paired-end reads (five samples per lane).
Summary of bases distribution for five samples whole-exome sequencing data
Total mapped bases (Gb) (%)
(% of genome regions)
(outside > 200 bp)
Poor-quality tails of reads were dynamically trimmed off by the BWA parameter (-q 15). Duplicated reads were marked by Picard. Base quality recalibration and local realignment were carried out using Genome Analysis Toolkit (GATK) [17, 27]. SNPs were called simultaneously on five samples by GATK Unified Genotyper, SAMtools Mpileup and GlfMultiples using bases with base quality≥20 and reads with mapping quality ≥20.
Effects of data preprocessing on SNP calling accuracy
(QUAL > = 50)
The variants are observed either as transitions (between purines, or between pyrimidines) or transversions (between purines and pyrimidines). The ratio of the number of transitions to the number of transversions is particularly helpful for assessing the quality of SNP calls . Ti/Tv ratios are often calculated for known and novel SNPs separately. The expected Ti/Tv ratios in whole-genome sequencing are 2.10 and 2.07 for known and novel variants, respectively, and in the exome target regions are 3.5 and 3.0, respectively . The higher Ti/Tv ratio generally indicates higher accuracy. When detected variants demonstrate a ratio closer to the expected ratio for random substitutions (e.g. ~0.5), low-quality variant calling or data is implied.
All five samples have been genotyped using the Affymetrix SNP 6.0 array in a previous genome-wide association study . Detailed genotyping methods and stringent quality control criteria were described in Zheng et al., . The original scan included three quality control samples in each 96-well plate, and the SNP calls showed a very high concordance rate (mean 99.9%; median 100%) for the quality control samples.
Genotypes obtained from the sequencing data were compared with those from the SNP array. The non-reference discrepancy (NRD) rate was used to measure the accuracy of genotype calls, which reported the percent of discordant genotype calls at commonly called on-reference sites on the SNP array and exome-sequencing. The mathematical definition of NRD can be found in Depristo et al., . The lower NRD generally indicates higher accuracy of genotype calls.
The filterY step identified fewer variants (~630 k); however, those variants showed the similar dbSNP rate (~77.8%) and Ti/Tv ratio (2.19 and 1.65, respectively) compared with the raw call set. Removing poor-quality reads from raw data (filterY) added 887 known variants with a Ti/Tv ratio of 1.72, while it eliminated 9542 known variants with a Ti/Tv ratio of 2.16 from the raw call set (Figure 1D). That is, filterY step dropped more than 8,000 known variants, representing about 2% of all known calls. These results suggested that throwing out those poor quality reads which failed the chastity filter might not be necessary for further SNP calling. Comparison results from applying both filterY and trim steps (filterY&trim) with those from performing trim step alone also revealed the useless of filterY step on improving SNP calling performance (Table 2 and Figure 1E).
A comprehensive comparison using variable quality thresholds for high-coverage data (inside target regions, ~60× coverage per sample on average, Table 1), medium-coverage data (outside regions with ≤ 200 bp distance, ~30× coverage per sample on average, Table 1) and low-coverage data (outside regions with > 200 bp distance, ~4× coverage per sample on average, Table 1) came to the same conclusion, that these two preprocessing step, filterY and trim, could not improve the performance of SNP calling, a conclusion contrary to the usual expectation. Application of the trim step might even introduce false positives, especially for high-coverage data. Compared with low coverage data, the problem of introducing false positives caused by the trim step is more serious for high coverage data (Additional file 1).
Effects of duplicate marking, realignment & recalibration on SNP calling accuracy
Deep coverage with QUAL > 50
Shallow coverage with QUAL > 20
For low-coverage sequencing (outside regions with > 200 bp distance, ~4× coverage per sample on average), however, the ability of these three steps to eliminate false-positive variants changed. Marking duplication obtained the highest performance with 79.09% dbSNP rate and a novel Ti/Tv ratio of 1.53 (Table 3). Marking duplication removed 19472 novel variants from the initial call set, representing more than 10% of all novel calls, with a Ti/Tv ratio of 0.67 (Figure 2F). In contrast, local realignment only eliminated 4139 novel variants with a Ti/Tv ratio of 0.77 (Figure 2D) and recalibration only removed 3526 novel variants with a Ti/Tv ratio of 0.93 (Figure 2E). These results suggested that marking duplication was more efficient in reducing false-positive rates than other two optional steps for low-coverage sequencing data.
A comprehensive comparison using variable quality thresholds also suggested that realignment was more efficient in removing false positives than base call recalibration and marking duplication for high-coverage data, whereas marking duplication was more efficient than the other two for low-coverage data (Additional file 2).
The effect of orders of the optional steps on SNP calling was also evaluated. We obtained the same accuracy of SNP and genotype calling using different order arrangements, suggesting that the order of steps had no effect on the calling performance (Additional file 3).
Intriguingly, we found that the read preprocessing steps before mapping were not necessary. Trimming off low-quality tails from reads even worsen the power of variant calling, although it helps align more reads with high error rate in the tail. A possible explanation is that although the quality of tails is not good enough, they are still helpful for reads mapping. Thus trimming off low-quality tails would lead to more alignment artifacts than using raw reads and, in turn, cause false-positive variants discovery. It should be noted that trimming reads is somehow a question of trial and error and a balance between the number of mapped reads and mapping accuracy. If the decrease of the quality of the 3' end is acceptable and the loss of coverage is affordable, trimming is not necessary. In contrast, if there is a dramatic quality decrease at the tail and poor quality was observed at very earlier sequencing cycle, trimming might be helpful by increasing the number of mapped reads greatly but without reducing the mapping accuracy much.
For the steps after read mapping, including marking duplication, realignment and recalibration, the relative contribution of each step to the accuracy of variant calling depends on the sequencing depth. When the sequencing depth is high, read mapping can benefit from finding consistent alignment among all reads and thus reduce the number of false-positives effectively. When the sequencing depth is low, however, the lack of sufficient reads mapping to the locus limits the power of local multiple sequence alignment and thus it cannot improve the quality of variant calls much. In such circumstances, marking duplication plays a more important role in reducing false positives than realignment and recalibration. Moreover, the performances of three popular multi-sample calling tools, SAMtools, GATK and GlfMultiples, also depend on the sequencing depth. They use the same genotype likelihood model, but GlfMultiples not only takes into account the maximized likelihood but also an overall prior for each type of polymorphism. For example, they favor sites with transition polymorphisms over those with transversion . Thus, incorporating such additional information helps reduce the uncertainty associated with shallow-sequencing data. However, the additional information will disturb the identification of variants when enough evidence is already involved with deep-sequencing data.
The steps posterior to variant calling, including linkage-based genotype refining and SNP filtering or variant quality score recalibration, also contribute a lot to the accurate SNP and genotype calling. The use of LD (linkage-disequilibrium) patterns can substantially improve genotype calling when multiple samples have been sequenced . Because not all information regarding errors can be fully incorporated into the statistical framework, the proper SNP filtering strategies are recommended to reduce the error rates . Besides, the consensus of multiple call sets from different methods provide higher quality than any of individual call sets . Even with the best pipelines, however, we are still far from obtaining a complete and accurate picture of SNPs and genotypes in the human genome. The most challenging task is to distinguish rare variants from sequencing errors. SNP and genotype calling for rare variants, which would not be represented in any reference panel, may not improve much by the use of LD information. To identify rare variants, a direct and more powerful approach is to sequence a large number of individuals [23, 32]. In addition to using the proper sequencing strategies, developing more accurate SNP detection methods is needed. More research is also needed in other areas, including longer read depths, improved protocols for generate paired ends, advances in sequencing technology with lower base calling error rates, and more powerful alignment methods.
Here, we evaluated the effect of a number of computational steps on the accuracy of SNP and genotype calling from Illumina sequencing data with different coverage. To our knowledge, no other study has made a systematic assessment of whether each step is valuable and how it affects the quality of variant detection. Our findings can serves as the general guideline for choosing SNP calling strategies.
The authors wish to thank Peggy Schuyler for editorial work on this manuscript and Wei Zheng for his support. This work was supported by National Cancer Institute grants U01 CA163056, P50 CA090949, P50 CA095103, P50 CA098131 and P30 CA068485 (to YS) and the National Institutes of Health grants R01GM088822 (to BZ). Subject recruitment and exome sequencing is supported by CA124558 (to WZ) and CA137013 (to JRL). QL's work was partially supported by the National Natural Science Foundation of China 31070746 (to QL).
This article has been published as part of BMC Genomics Volume 13 Supplement 8, 2012: Proceedings of The International Conference on Intelligent Biology and Medicine (ICIBM): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S8.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.