Let L denote a set of TGS reads and S denote a set of NGS data, respectively. Suppose that we are given L and S, QIHC uses the probabilistic models to judge the heterozygosity through Bayesian classifier, and corrects reads from L based on the integration of self correction and hybrid correction mechanism, finally outputs the corrected set L’. No reference sequence is required for the inputs. From the inputs to the outputs, QIHC includes three major modules, which are in turn:
-
1)
Generating pseudo reference sequence. Specifically, a pseudo reference sequence is obtained through assembly process, which can be done by any popular long-read assembly tool. Through this module, the inefficiency of repeatedly mapping S to each read from L in hybrid correction is solved. At the same time, there is no need to narrow down the QIHC scope of application in order to input a native reference sequence.
-
2)
Obtaining read alignment. Simultaneously mapping S and L to the pseudo reference sequence, named Ref. Let Lm denote a set of TGS reads which successfully map to Ref, and Lu denote a set of TGS reads which do not map to Ref. Specifically, mapping L to Ref to get Lm and Lu. A standard was set to accomplish this task. Dividing L into these two parts facilitates subsequent implementation of targeted correction strategies. For another, Ref provides anchor points for the mapping of Lm and S, that is, the sites on Ref sever as anchor points to ensure that the corresponding bases of Lm and S are mapped to the same site.
-
3)
Judging heterozygosity and correcting reads. This module consists of judging heterozygosity by the probabilistic models and performing the different correction strategies according to the judgment results. Specifically, after the first two modules are completed, the mutual relationships and the mapping positions between reads can be calculated. At this time, the probabilistic models can be used to calculate the posterior probabilities, furthermore, Bayesian classifier is used to judge heterozygosity, that is, the largest posterior probability is selected for decision making, finally the targeted correction strategies are implemented. This is our core module, under the premise of not losing accuracy and greatly increasing the sensitivity to heterozygosity, QIHC accomplishes the error correction of L.
Generating pseudo reference sequence
As the beginning of the method, we perform sequence assembly to get a pseudo reference sequence, the assembly process is as following four steps:
-
: Load reads from L and align all reads to each other to get a directed graph, where each read is treated as a node.
-
: Compute overlaps between any two reads based on Smith-Waterman algorithm and obtain the information of all possible overlaps. Specifically, we set the user parameters min_length, max_length and θ as the minimum length, the maximum length and a threshold score of overlap, respectively. Using the Smith-Waterman algorithm to compute the score of overlap between any pair of reads. Of course, if there is no overlap between two reads, the corresponding score is 0, and the overlap length is also 0. When the overlap length of a pair reads is between min_length and max_length, and the score is greater than θ, the overlap is established.
-
: According to the overlaps, the reads from L are preliminarily assembled, and get the combined relationship of fragments, defined as contig.
-
: Scan again, if there are overlaps in contigs, merge the contigs to form a new contig, and delete the original contigs. In this way, we get the final contigs.
Finally, we link these contigs to obtain a pseudo reference sequence–Ref. Since assembly principle of Canu can achieve the purpose of our assembly idea, and Canu adds correction and trimming before assembly to get high quality contigs, we choose Canu as the assembly tool.
Obtaining read alignment
After obtaining Ref, L and S are mapped to the pseudo reference sequence by BLASR [28], which has a strong fault tolerance and can map almost all reads to Ref. Since BLASR may generate multiple mapping results for a read and sort by the percentage of mapped base, we reserve the best mapping result for each read and divide L into Lm and Lu according to the percentage of mapped base, that is, the critical value of the percentage is 90%, a read with the percentage over 90% is assigned to Lm, otherwise it is assigned to Lu. Next, it is necessary to map S to each read of Lu separately in order to perform a correction strategy different from Lm.
Judging heterozygosity and correcting reads
The highlight of QIHC is the heterozygosity judgment, the core idea is as follows: two sets of the probabilistic models are established for S and L respectively, the probabilistic models are proposed based on Bayesian classifier. According to the basic principle of Bayesian classifier, we calculate the posterior probabilities of homozygosity and heterozygosity respectively, and take the side with higher probability value as the judgment result. More specifically, the comparison of the posterior probabilities is equivalent to the comparison of the product of the prior probability and the conditional probability of heterozygosity. The prior probability is a fixed value obtained through sequencing data features, the conditional probability is subject to binomial distribution. Alleles are sorted according to the frequency of occurrence. For heterozygous cases, the first two rank are taken as the heterozygous alleles, and the rest are sequencing errors; for homozygous cases, the first rank is taken as the homozygous allele, similarly, the rest are sequencing errors. Each allele obeys its respective binomial distribution, and the latter term is calculated on the basis of the former term.
Specifically, the heterozygosity judgment process is described in detail with respect to the site i on Ref. Let \(L_{k}, S_{t}, B_{k}^{m}, b_{t}^{n}, R_{i}\) represent the kth LRs, the tth SRs, the mth base of Lk, the nth base of St, the ith base of reference sequence Ref, respectively. Figure 5 intuitively shows the dependency relationships and the distribution among Ref, \( L_{k}, S_{t}, B_{k}^{m}, b_{t}^{n} \) and Ri. According to the known knowledge, a base may be four single-bases or null. Thus, \( B_{k}^{m} \) and \( b_{t}^{n} \) have five possible alleles, which are A, T, G, C and null label N. The mapping result is subdivided, the number of long reads mapped to Ri is defined as the read depth of long reads, represented by RDi. Similarly, the read depth of short reads is defined, denoted by rdi. Processing the long reads which mapped to Ri, let Xq denote alleles which are sorted according to the frequency of occurrence from large to small, |Xq| denote the corresponding frequency, q=1, 2, 3, 4, 5. Then, we can draw the binomial distribution for Xq: Xq∼Bin(Dq,P1), where \(f\left (X_{1}|P_{1}\right)={C}_{D_{q}}^{|X_{q}|} \times {\left (P_{1}\right)}^{|X_{q}|} \times {\left (1-P_{1}\right)}^{{D_{q}}-{|X_{q}|}}, D_{q}=\sum _{q}^{5} |X_{q}|\). Similarly, the alleles and frequency of occurrence of the alleles are defined in short reads, denoted as xq and |xq|, q=1, 2, 3, 4, 5. Drawing the binomial distribution for xq: xq∼Bin(dq,P2), where \(f\left (X_{1}|P_{2}\right)={C}_{d_{q}}^{|x_{q}|} \times {\left (P_{2}\right)}^{|x_{q}|} \times {\left (1-P_{2}\right)}^{{d_{q}}-{|x_{q}|}}, d_{q}={\sum \nolimits }_{q}^{5} \left |x_{q}\right |\). It should be noted here that P1 and P2 are the prior probability values of X and x respectively, which vary according to the different situations, and the details are given in the probability model calculation part.
Therefore, the posterior probability of L can be calculated by Eq. (1),
$$\begin{array}{*{20}l} & \ \ \ \ P\left(c|X_{1},X_{2},X_{3},X_{4},X_{5}\right) \notag \\ &=P\left(X_{1},X_{2},X_{3},X_{4},X_{5}|c\right) \times P(c) \notag \\ &=P\left(X_{1}|c\right)\times P\left(X_{2}|X_{1},c\right)\times P\left(X_{3}|X_{1},X_{2},c\right)\\&\times P\left(X_{4}|X_{1},X_{2},X_{3},c\right)\times P\left(X_{5}|X_{1},X_{2},X_{3},X_{4},c\right)\times P(c) \end{array} $$
(1)
where the value of c is homozygosity or heterozygosity. The probability of S is similar. For the allele whose occurrence frequency is 0, the corresponding item is removed in actual calculation. So far, two kinds of the posterior probabilities are obtained by multiplying the above probabilities, which are the probabilities of observing the bases distribution when homozygosity and heterozygosity; inferring the heterozygosity of the site based on the maximum probability value.
Then, the bases distribution as a new definition is brought up, which reflects how many kinds of alleles are mapped to Ri, denoted as dl, which can be computed as
$$\begin{array}{*{20}l} dl=\sum\limits_{n=1}^{5} I \left(|X_{q}| \ne0 \right) \end{array} $$
(2)
where I(·) is an indicator function, which outputs 1 when the equation is true. Similarly, the bases distribution are defined in short reads, denoted as ds. The possible distributions of bases are given in Fig. 6, which contribute to understanding of the heterozygosity judgment.
After |Xq|,|xq|, dl, ds, RDi and rdi are calculated, the heterozygosity judgment is performed. Let Di represent the bases distribution under the site i we observed. According to the bases distribution, it can be divided into five cases: d=1, d=2, d=3, d=4 and d=5, d here refers to dl and ds.
Case 1:
If dl=1, then it is directly judged to be homozygosity.
If ds=1, then it is directly judged to be homozygosity.
Case 2:
If dl=2, thus |X1|+|X2|=RDi, then the posterior probability of Di under homozygosity and heterozygosity are as follow (see the corresponding details of calculation formulas in Additional file 1):
$$\begin{array}{*{20}l} P \left(i\ is\ homozygosity|D_{i}=\left\{X_{1},X_{2}\right\} \right) \end{array} $$
(3)
$$\begin{array}{*{20}l} P\left(i\ is\ heterozygosity|D_{i}=\{X_{1},X_{2}\}\right) \end{array} $$
(4)
After calculating the two posterior probabilities, the result of judgment is corresponding to the larger value.
The judgment principle of S is similar to L, we do not describe here, see the corresponding details in Additional file 1.
Case 3:
If dl=3, thus |X1|+|X2|+|X3|=RDi, then the posterior probabilities of homozygosity and heterozygosity are given below (see the corresponding details of calculation formulas in Additional file 1):
$$\begin{array}{*{20}l} P\left(i\ is\ homozygosity|D_{i}=\{X_{1},X_{2},X_{3}\} \right) \end{array} $$
(5)
$$\begin{array}{*{20}l} P\left(i\ is\ heterozygosity|D_{i}=\{X_{1},X_{2},X_{3}\}\right) \end{array} $$
(6)
Similar to the case with d=2, we take the larger value in Eqs. (5) and (6) as the judgment result of L.
Case 4:
If dl=4, thus |X1|+|X2|+|X3|+|X4|=RDi, then the posterior probabilities of homozygosity and heterozygosity are given below (see the corresponding details of calculation formulas in Additional file 1):
$$\begin{array}{*{20}l} P\left(i\ is\ homozygosity|D_{i}=\{X_{1},X_{2},X_{3},X_{4}\}\right) \end{array} $$
(7)
$$\begin{array}{*{20}l} P\left(i\ is\ heterozygosity|D_{i}=\{X_{1},X_{2},X_{3},X_{4}\}\right) \end{array} $$
(8)
Case 5:
If dl=5, thus |X1|+|X2|+|X3|+|X4|+|X5|=RDi, then the posterior probabilities of homozygosity and heterozygosity are given below (see the corresponding details of calculation formulas in Additional file 1):
$$\begin{array}{*{20}l} & \ \ \ \ P\left(i\ is\ homozygosity|D_{i}=\{X_{1},X_{2},X_{3},X_{4},X_{5}\}\right) \notag \\ & =P\left(D_{i}=\{X_{1},X_{2},X_{3},X_{4},X_{5}\}|i\ is\ homozygosity\right)\\&\quad\times P\left(i\ is\ homozygosity\right) \notag \\ & = f\left(X_{1}|P_{1}\right) \times f\left(X_{2}|P_{2}\right) \times f\left(X_{3}|P_{3}\right) \times f\left(X_{4}|P_{4}\right) \\&\quad\times f\left(X_{5}|P_{5}\right) \times P\left(i\ is\ homozygosity\right) \end{array} $$
(9)
$$\begin{array}{*{20}l} & \ \ \ \ P\left(i\ is\ heterozygosity|D_{i}=\{X_{1},X_{2},X_{3},X_{4},X_{5}\}\right) \notag \\ & =P\left(D_{i}=\{X_{1},X_{2},X_{3},X_{4},X_{5}\}|i\ is\ heterozygosity\right)\\&\quad\times P\left(i\ is\ heterozygosity\right) \notag \\ & = f\left(X_{1}|P_{1}\right) \times f\left(X_{2}|P_{2}\right) \times f\left(X_{3}|P_{3}\right) \times f\left(X_{4}|P_{4}\right) \\&\quad\times f\left(X_{5}|P_{5}\right) \times P\left(i\ is\ heterozygosity\right) \end{array} $$
(10)
So far, the strategy of the heterozygosity judgment has been given. In general, the input to this process is the bases distribution under site i, and the different probabilistic models are implemented for different sources of reads. The final output is the result of heterozygosity of site i.
Then, we perform the different correction strategies for Lm and Lu, respectively.
For the correction of Lm, the inputs are the bases distributions of Lm and S and their heterozygosity judgment results. Correcting Lm is our goal, so the Ri on Ref is only used as an anchor point to locate related reads of Lm and S, as shown in Fig. 5. Under the same anchor point Ri, results of heterozygosity judgment for the bases distributions produce four possible combinations: heterozygous result from Lm and homozygous result from S; heterozygous result from Lm and heterozygous result from S; homozygous result from Lm and homozygous result from S; homozygous result from Lm and heterozygous result from S. For these four combinations, QIHC makes a decision: when the judgment results of Lm and S are consistent, since the sequencing accuracy of NGS is much higher than that of TGS, the judgment result of S is adopted; otherwise, the party whose judgment result is homozygosity is accepted. Thus, the final judgment result of heterozygosity is obtained, which is defined as Hm. According to Hm, the following correction rules are implemented:
If Hm is homozygosity, then the site to be corrected is replaced with the allele which appears most frequently among bases mapped to Ri;
If Hm is heterozygosity: if the site to be corrected is already one of the top two frequent alleles among bases mapped to Ri, then leave the allele of this site as it is; otherwise, the site to be corrected is randomly replace with one of the top two frequent bases.
According to the above decision results, all reads corresponding to the anchor point in Lm are corrected by the correction rules, a correction result set Lm’ is outputted.
For the correction of Lu, since Lu is the set of long reads that have not been successfully aligned to Ref, it can be seen that there is not enough correlation between each read in Lu. Therefore, the inputs are the bases distributions of S mapped to reads of Lu and their heterozygosity judgment results, using S to correct each read in Lu one by one. The basic principle is obtaining the final heterozygosity judgment result of S named Hu, and correcting Lu according to the following criterions:
If Hu is homozygosity, then the site to be corrected is replaced with the allele which appears most frequently among bases mapped to the base \(B_{k}^{m}\);
If Hu is heterozygosity: if the base corresponding to the site to be corrected is already one of the top two frequent alleles among bases mapped to \(B_{k}^{m}\), then leave the base of the site as it is; otherwise, the site to be corrected is randomly replaced with one of the top two frequent bases.
It is worth noting that the implementation of heterozygosity judgment and correction rules here only use the information provided by S. All reads in Lu are corrected, a correction result set Lu’ is outputted. Eventually, Lm’ and Lu’ form L’ together.
Overall, we design an error correction algorithm with high sensitivity to heterozygosity, the algorithm mainly consists of the following steps:
-
: Assemble L and get contigs;
-
: Link contigs one by one and obtain a pseudo reference sequence–Ref;
-
: Map L to Ref and get Lm and Lu;
-
: Map S to each read of Lu, obtain rdi,os(Vn) and ds, implement heterozygosity judgment and save result;
-
: Map Lm to Ref. For Ri of Ref, obtain RDi,ol(Vn) and dl, implement heterozygosity judgment and save result;
-
: Map S to Ref. For Ri of Ref, obtain rdi,os(Vn) and ds, implement heterozygosity judgment and save result;
-
: Make the final judgment Hm for Lm, if the results of step 5 and step 6 are consistent, the result of step 6 is adopted; otherwise, the step whose result is homozygosity is accepted, jump to step 9;
-
: According to the result of step 4 and the correction rules mentioned above, correct each read of Lu, obtain the correction set Lu’;
-
: According to the result of step 7 and the correction rules mentioned above, correct all reads of Lm which corresponding to the anchor point Ri, then load Ri+1 and jump to step 5, until all sites on Ref are traversed, obtain the correction set Lm’;
-
: Combine Lu’ and Lm’ to get L’.