Overall workflow
For position specific error model training, we used the invariant loci from training benchmark. Genomic sequence context features were extracted for each locus and then fed to the generalized linear models using 4 different distributions. Then testing benchmark paired tumor and normal sequencing data went through the PSEM and the candidate SNVs were derived. Additional file 11 provides a diagram illustrating this procedure. In the following method section, we first introduced the benchmark datasets from both Ion Proton and Illumina MiSeq. Then we described the application of generalized linear models for PSEM. Last, we described the performance evaluation metrics.
Benchmark dataset
Both Ion Proton and Illumina MiSeq datasets were generated from amplicon-based targeted sequencing.
The targeted region for Ion Proton datasets included all exons of 409 known cancer-related genes, totaling about 1.7 million bases covered by about 16,000 amplicon primer pairs from Ion AmpliSeq™ Comprehensive Cancer Panel. The training benchmark is the DNA sequencing data of NA11993. The testing benchmark mimic the paired normal-tumor design, where the normal sample is the DNA sequencing data of NA12878 while tumor sample is a mixture of 17 individuals from 1000 Genomics Project plus NA12878. The mixing percentage assignment is listed in Additional file 2. The sequencing data were aligned with TMAP from Torrent Suite software. Reads with mapping quality less than 40 were filtered out.
The length of targeted regions for Illumina MiSeq datasets is 23.2 kb, covered by 158 amplicons. The design details can be found in the paper [15] and Additional file 3. The raw reads were downloaded from NCBI Short Read Archive (SRP009487.1) and processed as the paper described. Reads with mapping quality less than 30 were filtered out.
Generalized linear models
The details of the 9 genomic sequence contexts considered in GLM were summarized in Additional file 4. Briefly, general contexts including substitution types, immediate upstream and downstream bases, GC content, and homopolymer related features: whether the locus is within a homopolymer, the closest homopolymer length, the distance to the closest homopolymer, the local homopolymer base percentages and whether the alternative base is the same as the immediate upstream or downstream base are considered. These 9 features are the covariates included in the GLMs.
The Poisson GLM for erroneous sequencing read counts with log link function is expressed in eq. (1), where N
s,b,l
is the observed number of erroneous reads for strand s (forward or reverse) with alternative base b (three possible values other than the reference) at location l, λ
s,b,l
represents the expected mean for N
s,b,l
, c
s,b,l
is the vector of genomic sequence context covariates, and β is the vector of fitted coefficients. The sequencing depth for strand s at location l is treated as the offset.
$$ \log \left({\lambda}_{s,b,l}\right)= \log \left(E\left({N}_{s,b,l}\Big|{\boldsymbol{c}}_{s,b,l}\right)\right)= \log \left({d}_{s,l}\right)+{\boldsymbol{\beta}}^{\boldsymbol{\hbox{'}}}{\boldsymbol{c}}_{s,b,l} $$
(1)
The negative binomial distribution GLM with log link function can be expressed in eq. (2), where μ
s,b,l
represents the expected mean for N
s,b,l
and θ is the dispersion parameter (the shape parameter of the gamma mixing distribution). The mean E(N
s,b,l
) = μ
s,b,l
and variance VAR(N
s,b,l
) = μ
s,b,l
+ θμ
s,b,l
2 can be estimated from GLM shown below.
$$ \log \left({\mu}_{s,b,l}\right)= \log \left(E\left({N}_{s,b,l}\Big|{\boldsymbol{c}}_{s,b,l}\right)\right)= \log \left({d}_{s,l}\right)+{\boldsymbol{\beta}}^{\boldsymbol{\hbox{'}}}{\boldsymbol{c}}_{s,b,l} $$
(2)
The zero-inflated Poisson distribution can be written as:
$$ P\left({N}_{s,b,l}={n}_{s,b,l}\Big|{\pi}_{s,b,l},\ {\lambda}_{s,b,l},\theta \right)=\Big\{\begin{array}{c}\hfill {\pi}_{s,b,l}+\left(1-{\pi}_{s,b,l}\right) Pois\left({\lambda}_{s,b,l};0\right)\kern3.5em if\kern0.5em {n}_{s,b,l}=0\hfill \\ {}\hfill \left(1-{\pi}_{s,b,l}\right) Pois\left({\lambda}_{s,b,l};{n}_{s,b,l}\right)\kern5.5em if\kern0.5em {n}_{s,b,l}>0\hfill \end{array} $$
(3)
Parameters of the zero-inflated Poisson distribution (3) can be estimated by generalized linear model as shown in (4), where z
s,b,l
is the vector of genomic sequence context covariates for the zero part, and γ is the vector of fitted coefficients.
$$ \mathrm{logit}\left(\frac{\pi_{s,b,l}}{1-{\pi}_{s,b,l}}\right)=\boldsymbol{\gamma} \boldsymbol{\hbox{'}}{\boldsymbol{z}}_{s,b,l} $$
(4)
$$ \log \left({\lambda}_{s,b,l}\right)={\boldsymbol{\beta}}^{\boldsymbol{\hbox{'}}}{\boldsymbol{c}}_{s,b,l} $$
The zero-inflated negative binomial distribution can be written as:
$$ P\left({N}_{s,b,l}={n}_{s,b,l}\Big|{\boldsymbol{c}}_{s,b,l},\ {\boldsymbol{z}}_{s,b,l}\right)=\Big\{\begin{array}{c}\hfill {\pi}_{s,b,l}+\left(1-{\pi}_{s,b,l}\right)NB\left({\mu}_{s,b,l},\ \theta; 0\right)\kern2.75em if\kern0.5em {n}_{s,b,l}=0\hfill \\ {}\hfill \left(1-{\pi}_{s,b,l}\right)NB\left({\mu}_{s,b,l},\theta; {n}_{s,b,l}\right)\kern4.75em if\kern0.5em {n}_{s,b,l}>0\hfill \end{array} $$
(5)
Parameters of the zero-inflated negative binomial distribution (5) can be estimated by generalized linear model as shown in (6).
$$ \mathrm{logit}\left(\frac{\pi_{s,b,l}}{1-{\pi}_{s,b,l}}\right)=\boldsymbol{\gamma} \boldsymbol{\hbox{'}}{\boldsymbol{z}}_{s,b,l} $$
(6)
$$ \log \left({\mu}_{s,b,l}\right)={\boldsymbol{\beta}}^{\boldsymbol{\hbox{'}}}{\boldsymbol{c}}_{s,b,l} $$
A location with a certain alternative base is called as a candidate SNV if the numbers of reads from both strands are significantly greater than the predicted error rates. The p values were corrected using Benjamini–Hochberg procedure. The corrected p value cut-off is 0.01.
Performance evaluation measurements
Precision, recall and F1 score are defined below.
$$ precision=\frac{number\ of\ recovered\ benchmark\ SNVs}{number\ of\ predicted\ SNVs} $$
$$ recall=\frac{number\ of\ recovered\ benchmark\ SNVs}{expected\ number\ of\ benchmark\ SNVs} $$
$$ F1=2*\frac{precision* recall}{precision+ recall} $$
For Ion Proton dataset, loci with at least 5 reads supporting alternative base are included in the evaluation. For Illumina MiSeq dataset, filter 2 used by UDT-Seq was applied which requires > = 0.2 % frequency for alternative bases. However, the other filters were not used. We relied on the PSEM framework to properly address sequencing problems, for example, uneven depth and local sequence context induced errors.