- Open Access
Statistical modeling for sensitive detection of low-frequency single nucleotide variants
© The Author(s). 2016
- Published: 22 August 2016
Sensitive detection of low-frequency single nucleotide variants carries great significance in many applications. In cancer genetics research, tumor biopsies are a mixture of normal and tumor cells from various subpopulations due to tumor heterogeneity. Thus the frequencies of somatic variants from a subpopulation tend to be low. Liquid biopsies, which monitor circulating tumor DNA in blood to detect metastatic potential, also face the challenge of detecting low-frequency variants due to the small percentage of the circulating tumor DNA in blood. Moreover, in population genetics research, although pooled sequencing of a large number of individuals is cost-effective, pooling dilutes the signals of variants from any individual. Detection of low frequency variants is difficult and can be cofounded by sequencing artifacts. Existing methods are limited in sensitivity and mainly focus on frequencies around 2 % to 5 %; most fail to consider differential sequencing artifacts.
We aimed to push down the frequency detection limit close to the position specific sequencing error rates by modeling the observed erroneous read counts with respect to genomic sequence contexts. 4 distributions suitable for count data modeling (using generalized linear models) were extensively characterized in terms of their goodness-of-fit as well as the performances on real sequencing data benchmarks, which were specifically designed for testing detection of low-frequency variants; two sequencing technologies with significantly different chemistry mechanisms were used to explore systematic errors. We found the zero-inflated negative binomial distribution generalized linear mode is superior to the other models tested, and the advantage is most evident at 0.5 % to 1 % range. This method is also generalizable to different sequencing technologies. Under standard sequencing protocols and depth given in the testing benchmarks, 95.3 % recall and 79.9 % precision for Ion Proton data, 95.6 % recall and 97.0 % precision for Illumina MiSeq data were achieved for SNVs with frequency > = 1 %, while the detection limit is around 0.5 %.
Our method enables sensitive detection of low-frequency single nucleotide variants across different sequencing platforms and will facilitate research and clinical applications such as pooled sequencing, cancer early detection, prognostic assessment, metastatic monitoring, and relapses or acquired resistance identification.
- Negative Binomial
- Single Nucleotide Variant
- Illumina MiSeq
- Sequencing Error Rate
- Negative Binomial Generalize Linear Model
In 2005, the first next-generation sequencing (NGS) technology was released by 454 Life Sciences (now Roche) . Within the past ten years, different sequencing technologies and platforms, including Illumina, SOLiD, Ion Torrent, Complete Genomics, were released to the public. The much faster sequencing speed, high-throughput capacity and now up to several hundred bases read length, together with a greatly reduced cost, revolutionized the scope and efficiency of biomedical related field researches . Paired with the increasingly diverse range of biological application of NGS technologies, numerous computational and informatics tools, frameworks and pipelines emerged to enable researchers to harness the power of NGS technologies. Statistical models suitable for count data modeling gained much attention in NGS data analysis due to the discrete count nature of the data generated by NGS sequencers. Such models were broadly applied in DNA sequencing (DNA-Seq) based variants identification such as samtools , VarScan2 , and SNVMix . For DNA sequencing based single nucleotide variant (SNV) identification, emerging new applications bring challenges to refine the statistical modeling methods and pushing the limit of NGS technologies.
In cancer genetics research, low frequency tumor somatic SNV identification is crucial due to the inevitable normal tissue contamination [6, 7] and the highly heterogeneous, constantly evolving nature of tumors . Accurate and sensitive identification of low frequency SNVs also carries clinical significance, since it enables the early diagnosis, cancer progression monitor and relapse identification. The recent discovery of circulating tumor DNA (ctDNA) also gained much attention. Contrast to traditional tumor biopsies, which is invasive and can only offer a snapshot of the tumor genetics landscape at certain checkpoints, ctDNA based ‘liquid biopsy’  is non-invasive and can be done repeatedly for close monitoring of early sign of relapse or metastasis. However, ctDNA only takes a small percentage of all blood sample DNA, a previous research  reported for some advanced cancers, ctDNA is about 1 ~ 10 % of blood DNA.
The difficulty for low-frequency SNV identification using NGS technologies is due to the relatively high sequencing artifacts or error rates, which is around 0.1 ~ 1 % for most platforms. Further, such error rates differ significantly under various genome contexts. For example, Illumina sequencing data are prone to have mismatches while Ion Torren and Ion Proton data contain more homopolymer related indels and consequently, mismatches near homopolymer loci [11–13]. For somatic SNV identification paired tumor-normal design, some existing methods derive the sequencing error probability from base qualities followed by error likelihood ratio test of tumor and normal sample at the same location, for example in Mutect , Strelka . While VarScan2 applies a Fisher’s exact test on the paired samples, treating non-reference read counts from the normal sample as background error rate. The former failed to consider differential error rates for substitution types while the latter only utilized information in one location thus the background error rate estimation is off. For one sample low-frequency SNV calling, UDT-Seq  tabulated the error rate based on substitution types, strand and location on the read to derive an empirical background error rate, then use binomial model to distinguish signal from error, and the candidate SNVs are further refined by 7 filters. This method is context-aware but also ad-hoc, thus the ability to adapt to different sequencing technologies is limited. A brief summary of the tools mentioned above is included in Additional file 1.
By analyzing previous efforts, our group proposed a framework (Yangyang Hao XX, Li L, Nakshatri H, Edenberg HJ, Liu Y. RareVar: A Framework for Detecting Low Frequency Single Nucleotide Variants, submitted) to first generated position specific error model (PSEM) using genome sequence contexts for candidate SNV identification and then apply a machine-learning model to refine the candidates. Testing on an Ion Proton benchmark dataset, our framework outperforms existing methods, especially at 0.5 % to 3 % frequencies. However, the potential to improve PSEM performances on SNVs with close to sequencing error rates by implementing more sophisticated statistical modeling and the generalizability and adaptiveness of PSEM remain untested. In this research, we explored what distributions fit the DNA-Seq data error rates modeling as well as the possibility of improved position specific error rate prediction for higher precision and recall on SNVs down to 0.5 % frequency. Further, we evaluated how different sequencing technologies affect the behavior of PSEM and the generalizability and adaptiveness of the PSEM framework.
In the Result section, we first briefly summarized the benchmark datasets used for PSEM. Then we compared the testing benchmark dataset from Ion Proton with Illumina MiSeq sequencing data in terms of allele frequency composition and depth distribution. Utilizing count data visualization plots and tabulation, we selected the candidate distributions that may fit the data. Since the Ion Proton dataset contains 3 times of the number of benchmark SNVs from Illumina MiSeq and also is enriched with SNVs of ≤ 1 % allele frequency, we mainly focused on Ion Proton data set for model development and evaluation. To test the generalizability of the PSEM, we further trained and evaluated it on Illumina MiSeq dataset.
Benchmarks overview and comparison
Two sets of designed benchmarks targeting low-frequency SNVs from both Ion Proton and Illumina MiSeq  sequencing technologies were included. The details of these 2 datasets are described in Methods section and Additional files 2 and 3. Briefly, the Ion Proton training benchmark is the sequencing data from a single individual with known genotypes, while the test benchmark was designed to mimic the paired normal-tumor design for somatic SNVs identification applications. The Illumina MiSeq benchmark data were generated by mixing 4 individuals at 4 different percentages and then permuted the mixing percentage assignment 4 times to generate 4 calibration datasets – CAL_A, CAL_B, CAL_C and CAL_D. Since the 4 calibration data sets were generated with the same procedures, without loss of generality, we used CAL_A as training benchmark and treated the others as testing benchmark.
Candidate distributions selection
Comparing the goodness-of-fit of different distributions
9 genomic sequence context covariates, totaling 24 degrees of freedom, were included in the GLM models (Methods and Additional file 4). Since ZIP and ZINB GLM require covariates for both the ‘zero’ and ‘count’ parts, the same covariates were provided for both, resulting in doubled degrees of freedom of those included in Poisson and NB GLM.
Vuong’s non-nested tests on 4 distributions applied to Ion Proton training data
Vuong z-statistic BIC-corrected
model2 > model1
model2 > model1
model1 > model2
model2 > model1
model2 > model1
Performance evaluation on Ion Proton testing benchmark
Overall performance comparison for Ion Proton testing benchmark
Application of ZINB PSEM on Illumina MiSeq data
To evaluate the generalizability and adaptiveness of the GLM based PSEM, the same modeling strategies were applied to the Illumina MiSeq sequencing data sets. The same genomic sequence context features from Ion Proton modeling were applied to the Illumina MiSeq CAL_A dataset. Similar to the analysis on Ion Proton data set, paired Vuong’s non-nested hypothesis tests were conducted on the 4 candidate distributions, with details summarized in Additional file 5. The test conclusions remained the same except for the NB (model 1) and ZIP (model 2) comparison, where the BIC-corrected Vuong z-statistic is −0.47 resulting in p value = 0.318. Therefore the goodness-of-fit for these two distributions on MiSeq dataset are not significantly different.
Despite similar statistical modeling schema can be readily generalized to Illumina MiSeq data set, Illumina MiSeq and Ion Proton sequencers differ significantly in terms of sequencing chemistry. The former is based on sequencing-by-synthesis (SBS) that relies on high-resolution optic systems, whereas the latter is based on Ion semiconductor sequencing where no modified nucleotides or optics are required. The differences in sequencing mechanisms make Ion Proton sequencers run faster but are prone to homopolymer related errors. Comparing the NB GLM regression coefficients on both datasets (Additional file 6), homopolymer related features significant in Ion Proton data set regression are either insignificant (hmer_len, hmer_dist) or show opposite effect (hmer_op, hmer_den) on the error rate. The same trend was also observed in ZIP and ZINB models comparing Ion Proton with Illumina MiSeq (Additional files 7, 8, 9 and 10).
Comparing with the results from UDT-Seq , which reported approximately 90 % recall and >95 % precision (no specific number was given, the precision was inferred by the precision for the other data UDT-Seq tested - Illumina GAII benchmark data at 1500x depth), ZINB GLM demonstrates higher overall recall (95.1 %) and high precision (93.4 %).
The PSEM model aims to predict the position specific error rates associated with various genomic sequence contexts, under which the specific sequencing technology is prone to error. Based on publications evaluating features associated with sequencing errors and experiences from our previous effort, 9 types of significant features are considered. With the features fixed, using GLM, we evaluated the appropriateness of distributions with different mean – variance relationships and the ability to consider zero-inflation. Consistent with the computational tool EdgeR  for RNA-Seq data, we found the ability to model over-dispersion by NB distribution necessary for DNA-Seq data as well. Additionally, for DNA-Seq erroneous read counts modeling, zero-inflation is also a key factor for accurate prediction and inference. The much-elevated F1 score for 0.5 % allele frequency SNVs as well as the highest overall performance by ZINB GLM highlighted the importance of choosing suitable statistical models. Moreover, comparing with VarScan2, which conducts the Fisher’s exact test for each targeted location on paired normal-tumor sequencing data, the significance of applying the correct reference error model is exemplified by higher recalls as well as precisions for 0.5 % and 1 % frequency SNVs. In theory, for low frequency SNV loci, VarScan2 treated the sequencing reads with non-reference bases from normal as the background error, which is essentially point estimation based on one location. Whereas PSEM collectively considers all loci with similar context features and thus is able to generate more accurate error estimation.
The evaluation of PSEM modeling on Illumina MiSeq dataset and the performance comparison with Ion Proton dataset show the generality of the PSEM framework as well as its adaptiveness to different technologies. Moreover, except for the established importance of choosing appropriate statistical model, the sequencing depth evenness is also an important factor affecting low-frequency SNVs calling performances.
The current GLM-based PSEM framework only considers 9 types of genome sequence context features. To further improve the performances, more informative features associated with sequencing errors should be included and tested. In addition, from the modeling aspect, exploration of the potential to further increase the performances by applying more sophisticated computational models are desired. To better understand its generalizability and adaptiveness, tests on other sequencing technologies, such us SOLiD and Complete Genomics, are necessary. Besides, since the capture assay for the two benchmarks is amplicon-based, hybridization-based approach should be tested to compare the performance profiles.
Differentiating low frequency SNVs from sequencing artifacts is the key for identifying SNVs at frequencies close to sequencing error rates. Our PSEM approach tried to push the limit toward the sequencing error rates. Based on the analyses on benchmarks from standard sequencing protocols and the given sequencing depth, we speculate the detection limit is around 0.5 % on the regions covering all exons of hundred of genes, with a total size up to millions of bases. However, with high accuracy sequencing protocols, such as duplex sequencing  and ultra-deep target enrichment assay , the researchers reported identification of SNVs around 0.1 % on a single gene scale. Despite the promising results, more efforts to make such protocols applicable on larger regions are required for broader applications.
Our method enables sensitive detection of low-frequency single nucleotide variants across different sequencing platforms down to 0.5 % frequency. Thus will facilitate research and clinical applications such as pooled sequencing, cancer early detection, prognostic assessment, metastatic monitoring, and relapses or acquired resistance identification.
For position specific error model training, we used the invariant loci from training benchmark. Genomic sequence context features were extracted for each locus and then fed to the generalized linear models using 4 different distributions. Then testing benchmark paired tumor and normal sequencing data went through the PSEM and the candidate SNVs were derived. Additional file 11 provides a diagram illustrating this procedure. In the following method section, we first introduced the benchmark datasets from both Ion Proton and Illumina MiSeq. Then we described the application of generalized linear models for PSEM. Last, we described the performance evaluation metrics.
Both Ion Proton and Illumina MiSeq datasets were generated from amplicon-based targeted sequencing.
The targeted region for Ion Proton datasets included all exons of 409 known cancer-related genes, totaling about 1.7 million bases covered by about 16,000 amplicon primer pairs from Ion AmpliSeq™ Comprehensive Cancer Panel. The training benchmark is the DNA sequencing data of NA11993. The testing benchmark mimic the paired normal-tumor design, where the normal sample is the DNA sequencing data of NA12878 while tumor sample is a mixture of 17 individuals from 1000 Genomics Project plus NA12878. The mixing percentage assignment is listed in Additional file 2. The sequencing data were aligned with TMAP from Torrent Suite software. Reads with mapping quality less than 40 were filtered out.
The length of targeted regions for Illumina MiSeq datasets is 23.2 kb, covered by 158 amplicons. The design details can be found in the paper  and Additional file 3. The raw reads were downloaded from NCBI Short Read Archive (SRP009487.1) and processed as the paper described. Reads with mapping quality less than 30 were filtered out.
Generalized linear models
The details of the 9 genomic sequence contexts considered in GLM were summarized in Additional file 4. Briefly, general contexts including substitution types, immediate upstream and downstream bases, GC content, and homopolymer related features: whether the locus is within a homopolymer, the closest homopolymer length, the distance to the closest homopolymer, the local homopolymer base percentages and whether the alternative base is the same as the immediate upstream or downstream base are considered. These 9 features are the covariates included in the GLMs.
A location with a certain alternative base is called as a candidate SNV if the numbers of reads from both strands are significantly greater than the predicted error rates. The p values were corrected using Benjamini–Hochberg procedure. The corrected p value cut-off is 0.01.
Performance evaluation measurements
For Ion Proton dataset, loci with at least 5 reads supporting alternative base are included in the evaluation. For Illumina MiSeq dataset, filter 2 used by UDT-Seq was applied which requires > = 0.2 % frequency for alternative bases. However, the other filters were not used. We relied on the PSEM framework to properly address sequencing problems, for example, uneven depth and local sequence context induced errors.
This study is supported by the funds from the US National Institutes of Health U10AA008401 (Collaborative Study on the Genetics of Alcoholism, to H.E.), Pilot funds from the breast cancer program of IUSCC (Indiana University Simon Cancer Center) and Zeta Tau Sorority and Susan G Komen for the Cure grant SAC110025 (to Y.L. and H.N.). The sequencing was performed in the Center for Medical Genomics (CMG) sequencing core at Indiana University School of Medicine.
The publication costs for this article were funded by the corresponding author.
This article has been published as part of BMC Genomics Volume 17 Supplement 7, 2016: Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM) 2015: genomics. The full contents of the supplement are available online at http://bmcgenomics.biomedcentral.com/articles/supplements/volume-17-supplement-7.
Availability of data and materials
YL and LL conceived the project. YH and PZ conducted the analysis. LL provided statistical support. XX performed the sequencing experiment. YH, YL, and HE wrote the manuscript. All authors reviewed the manuscript. All authors read and approved the final manuscript.
The authors declare no competing financial interests.
Consent for publication
Ethics approval and consent to participate
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437(7057):376–80.PubMedPubMed CentralGoogle Scholar
- van Dijk EL, Auger H, Jaszczyszyn Y, Thermes C. Ten years of next-generation sequencing technology. Trends in Genetics: TIG. 2014;30(9):418–26.View ArticlePubMedGoogle Scholar
- Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;27(21):2987–93.View ArticlePubMedPubMed CentralGoogle Scholar
- Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, Miller CA, Mardis ER, Ding L, Wilson RK. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012;22(3):568–76.View ArticlePubMedPubMed CentralGoogle Scholar
- Goya R, Sun MG, Morin RD, Leung G, Ha G, Wiegand KC, Senz J, Crisan A, Marra MA, Hirst M, et al. SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors. Bioinformatics. 2010;26(6):730–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Carter SL, Cibulskis K, Helman E, McKenna A, Shen H, Zack T, Laird PW, Onofrio RC, Winckler W, Weir BA, et al. Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol. 2012;30(5):413–21.View ArticlePubMedPubMed CentralGoogle Scholar
- Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, Sougnez C, Gabriel S, Meyerson M, Lander ES, Getz G. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol. 2013;31(3):213–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Meacham CE, Morrison SJ. Tumour heterogeneity and cancer cell plasticity. Nature. 2013;501(7467):328–37.View ArticlePubMedPubMed CentralGoogle Scholar
- Crowley E, Di Nicolantonio F, Loupakis F, Bardelli A. Liquid biopsy: monitoring cancer-genetics in the blood. Nat Rev Clin Oncol. 2013;10(8):472–84.View ArticlePubMedGoogle Scholar
- Diehl F, Li M, Dressman D, He Y, Shen D, Szabo S, Diaz Jr LA, Goodman SN, David KA, Juhl H, et al. Detection and quantification of mutations in the plasma of patients with colorectal tumors. Proc Natl Acad Sci U S A. 2005;102(45):16368–73.View ArticlePubMedPubMed CentralGoogle Scholar
- McElroy KE, Luciani F, Thomas T. GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics. 2012;13:74.View ArticlePubMedPubMed CentralGoogle Scholar
- Bragg LM, Stone G, Butler MK, Hugenholtz P, Tyson GW. Shining a light on dark sequencing: characterising errors in Ion Torrent PGM data. PLoS Comput Biol. 2013;9(4):e1003031.View ArticlePubMedPubMed CentralGoogle Scholar
- Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, Hegarty R, Nusbaum C, Jaffe DB. Characterizing and measuring bias in sequence data. Genome Biol. 2013;14(5):R51.View ArticlePubMedPubMed CentralGoogle Scholar
- Saunders CT, Wong WS, Swamy S, Becq J, Murray LJ, Cheetham RK. Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics. 2012;28(14):1811–7.View ArticlePubMedGoogle Scholar
- Harismendy O, Schwab RB, Bao L, Olson J, Rozenzhak S, Kotsopoulos SK, Pond S, Crain B, Chee MS, Messer K, et al. Detection of low prevalence somatic mutations in solid tumors with ultra-deep targeted sequencing. Genome Biol. 2011;12(12):R124.View ArticlePubMedPubMed CentralGoogle Scholar
- Hoaglin DC. A poissonness plot. Am Stat. 1980;34(No.3):146–9.Google Scholar
- Hoaglin DC, Mosteller F, Tukey JW. Checking the Shape of Discrete Distributions. In: Hoaglin DC, Mosteller F, Tukey JW, editors. Checking the Shape of Discrete Distributions, in Exploring Data Tables, Trends, and Shapes. Hoboken: John Wiley & Sons, Inc; 2011.Google Scholar
- Friendly M. Visualizing Categorical Data. Cary: SAS Institute; 2000.Google Scholar
- Lambert D. Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing. Technometrics. 1992;34(1):1–14.View ArticleGoogle Scholar
- Greene WH. Accounting for Excess Zeros and Sample Selection in Poisson and Negative Binomial Regression Models. In: NYU Working Paper No. EC-94-10; 1994.Google Scholar
- Vuong QH. Likelihood Ratio Tests for Model Selection and Non-Nested Hypotheses. Econometrica. 1989;57(2):307–33.View ArticleGoogle Scholar
- Cameron AC, Trivedi PK. Regression-Based Tests for Overdispersion in the Poisson Model. J Econometrics. 1990;46(3):347–64.View ArticleGoogle Scholar
- Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40.View ArticlePubMedGoogle Scholar
- Kennedy SR, Schmitt MW, Fox EJ, Kohrn BF, Salk JJ, Ahn EH, Prindle MJ, Kuong KJ, Shen JC, Risques RA, et al. Detecting ultralow-frequency mutations by Duplex Sequencing. Nat Protoc. 2014;9(11):2586–606.View ArticlePubMedPubMed CentralGoogle Scholar
- Schmitt MW, Fox EJ, Prindle MJ, Reid-Bayliss KS, True LD, Radich JP, Loeb LA. Sequencing small genomic targets with high efficiency and extreme accuracy. Nat Methods. 2015;12(5):423–5.View ArticlePubMedPubMed CentralGoogle Scholar