 Research
 Open Access
Mining statisticallysolid kmers for accurate NGS error correction
 Liang Zhao^{1, 2}Email author,
 Jin Xie^{1},
 Lin Bai^{2},
 Wen Chen^{1},
 Mingju Wang^{1},
 Zhonglei Zhang^{1},
 Yiqi Wang^{1},
 Zhe Zhao^{2} and
 Jinyan Li^{3}Email author
 Published: 31 December 2018
Abstract
Background
NGS data contains many machineinduced errors. The most advanced methods for the error correction heavily depend on the selection of solid kmers. A solid kmer is a kmer frequently occurring in NGS reads. The other kmers are called weak kmers. A solid kmer does not likely contain errors, while a weak kmer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f_{0} to balance the numbers of solid and weak kmers. Once the cutoff is determined, a more challenging but lessstudied problem is to: (i) remove a small subset of solid kmers that are likely to contain errors, and (ii) add a small subset of weak kmers, that are likely to contain no errors, into the remaining set of solid kmers. Identification of these two subsets of kmers can improve the correction performance.
Results
We propose to use a Gamma distribution to model the frequencies of erroneous kmers and a mixture of Gaussian distributions to model correct kmers, and combine them to determine f_{0}. To identify the two special subsets of kmers, we use the zscore of kmers which measures the number of standard deviations a kmer’s frequency is from the mean. Then these statisticallysolid kmers are used to construct a Bloom filter for error correction. Our method is markedly superior to the stateofart methods, tested on both real and synthetic NGS data sets.
Conclusion
The zscore is adequate to distinguish solid kmers from weak kmers, particularly useful for pinpointing out solid kmers having very low frequency. Applying zscore on kmer can markedly improve the error correction accuracy.
Keywords
 Error correction
 Nextgeneration sequencing
 zscore
Background
The massively parallel nextgeneration sequencing (NGS) technology is revolutionizing a wide range of medical and biological research areas as well as their application domains, such as medical diagnosis, biotechnologies, virology, etc [1]. It has been shown that the NGS data is so informative and powerful that some ever thorny problems can be effectively tackled through this technology, e.g., the genome wide association study [2].
The information contained in NGS data is deep and broad, but the raw data is still error prone. Various kinds of errors exist in the raw sequencing data, including substitution, insertion and deletion. The substitution error rate can be as high as 1 to 2.5% for the data produced by the Illumina platform [3]; and the collective insertion and deletion error rate can be as high as 10 to 40% for the PacBio and Oxford Nanopore platforms [4, 5]. It has been widely recognized that correcting these sequencing errors is the first and critical step for many downstream data analyses, such as de novo genome assembly [6], variants calling from genome resequencing [7], identification of single nucleotide polymorphism as well as sequence mapping [3, 8]. For instance, the number of nodes of the De Bruijn graph generated from the HapMap sample NA12878 (https://www.ncbi.nlm.nih.gov/sra/ERR091571/) is 6.92 billion; however, this number can be reduced to only 1.98 billion after error correction. This reduction significantly alleviates the burden of graph manipulation.
Owing to the importance of error correction, dozens of approaches have been proposed to cope with various types of errors. Depending on the key ideas that have been used, existing approaches can be categorized into three major approaches: (i) the kspectrumbased approach, including Quake [3], Reptile [9], DecGPU [10], SGA [11], RACER [12], Musket [13], Lighter [14], Blue [15], BFC [16], BLESS2 [17], MECAT [18] (ii) the suffix tree/arraybased approach, including SHREC [19], HSHREC [20], HiTEC [21], Fiona [22] and; (iii) the multiple sequence alignmentbased approach, including ECHO [23], Coral [8], CloudRS [24], MEC [25]. Among these approaches, the most advanced ones are the kspectrumbased. It provides a very good scalability and competitive performance. Scalability is crucial for NGS data analysis since the input volume is usually huge.
In this research, we focus on a more challenging but lessstudied problem: (i) remove a small subset of solid kmers that are likely to contain errors, and (ii) add a small subset of weak kmers that are likely to contain no errors, into the set of solid kmers. This is achieved by using f_{0} as well as zscore of kmer, z(κ). With the purified set of solid kmers, the correction performance can be much improved.
Our approach starts with counting kmer frequencies by using KMC2 [27], then calculates the zscores of kmers. Later, the statisticallysolid kmers are mined by considering both frequency and zscore. After that, the Bloom filter is constructed by the statisticallysolid kmers, and the weak kmers are corrected. The newly proposed approach is named as ZEC, short for zscorebased error corrector.
Algorithm: mining statisticallysolid kmers
A solid kmer is conventionally defined as a kmer which occurs in a data set of NGS reads with high frequency. A solid kmer is usually considered errorfree, and taken as the template for error correction. If a kmer is not solid, then it is defined as a weak kmer considered as errorcontaining. Existing kmerbased approaches use a frequency cutoff, f_{0}, to identify solid and weak kmers from NGS reads, e.g., BLESS2 [17], Musket [13], and BFC [16]. The main difference of these methods is how the f_{0} is determined.
In fact, a solid kmer is not definitely errorfree. Sometimes, it may contain errors with a small chance. It is also true for the weak kmers — a weak kmer can be absolutely errorfree. The reason that a solid kmer is not always correct is that the coverage is not under uniform distribution. Thus the cutoff f_{0} itself is unable to perfectly distinct correct kmers from erroneous kmers; cf. the part labeled as α and β in Fig. 1. However, the purpose of the research is to obtain correct kmers as many as possible.
In this study, we present a time and memory efficient algorithm to purify the solid kmer set as well as the weak kmer set, so that more correct kmers can be identified.

f(κ), the frequency of κ;

z(κ), the zscore of κ.
Calculating f(κ)
The straightforward approach to determine f(κ) is as follows: (i) scan each read r of R from the beginning to the end; (ii) sum over the occurrence that κ appears. Then the summation is f(κ). This approach works for one kmer, but it cannot be applied to all the kmers simultaneously as the number of kmers can be very large, demanding a huge size of memory.
In this study, we make use of the kmer counting algorithm, KMC2 [27], to solve this problem. KMC2 can remarkably reduce the memory usage because: (i) it is diskbased; (ii) it uses (k,x)mer; and (iii) it applies the minimizer idea to deal with kmer.
Computing z(κ)
It is straightforward to calculate the zscore of each kmer given the frequency of the kmer as well as that of its neighbor that have been determined by the aforementioned approach.
Determining f _{0}
Unlike existing approaches that determining solid kmers based on their frequency only, we examine their zscores as well.
The two distributions are estimated by using EM algorithm based on the frequencies of kmers. An example of the two distributions are shown in Fig. 1, i.e., the skyblue dash line and the orange dash line. Based on the two distributions, we can determine the threshold f_{0}, such that it can minimize the area marked as α and β. Note that, the threshold f_{0} determined in this way may not be the intersection point of the two density functions.
Mining solid kmers
where \(N^{\text {solid}}_{\text {correct}}\) is the number of correct kmers in the solid kmers, and N_{correct} is the total number of correct kmers.

If f(κ)<f_{0} and z(κ)≥z_{0}, then κ is removed from the weak kmers and added to the solid kmers, i.e., increases the completeness.

If f(κ)≥f_{0} and \(z(\kappa) < z^{'}_{0}\), then κ is removed from the solid kmers and added to the weak kmers, i.e., improves the purity.
The f_{0} is the minimum frequency that has been determined, while the z_{0} and \(z^{'}_{0}\) are the maximum zscore and minimum zscore for weak kmers and solid kmers, respectively.
The z_{0} and \(z^{'}_{0}\) are learned from the zscore distribution automatically. To obtain the optimal z_{0}, the zscores of the kmers having frequency less than f_{0} are collected. Later, the distribution of these zscores is estimated and z_{0} is set as the value having the lowest density between two peaks (viz. the trough of the bimodal; see results for more details). Analogously, \(z^{'}_{0}\) is determined on the zscores of kmers having frequency greater than f_{0}.
Methods
Our error correction model contains two main steps: (i) build Bloom filter from solid kmers and; (ii) correct errors in weak kmers by the Bloom filter.
Build bloom filter
In our study, m is the number of solid kmers that have been determined from all the kmers by means of the aforementioned algorithm. Per existing approaches, p is set to 1%. One can also tune p, h and n to fit the real hardware limitations.
It has been reported that the Bloom filter has been successfully used to correct NGS errors, such as BLESS2 [17] and BFC [16]. The major difference between our model and the existing models is that we dedicate to efficiently refine the solid kmers that are used to construct Bloom filter, which directly improves the error correction performance in theory. Note that, the solid kmers play the key role in error correction, as all the rest kmers (viz. the weak kmers) are to be corrected based on the solid ones.
Correct errors
 1
If G_{w} is the first group and there exists a successive group G_{s} that is solid, we iteratively change the first base of each kmer of G_{w} to its alternatives and check the existence of the kmers against the Bloom filter. Once there exist a solution that makes all the weak kmers solid, the amendment of the bases is accepted, thus the correction of the error. This process is applied to the kmers of G_{w} from the last one to the first one. In case the number of kmers contained in G_{w} is less than a predefined value, say τ, the processive solid kmers that are extended from the corrected kmers will be generated until the total number of kmers in G_{w} is τ. If this criterion cannot be satisfied, the solution is abandoned. On the other hand, if G_{s} does not exist, we will alter the bases to their alternatives of all the kmers iteratively until a solution that make all the kmers solid can be found.
 2
If G_{w} has a solid processive group G_{s} and a solid successive group \(G_{s}^{'}\), we substitute the last base of each kmer in G_{w} by its alternatives from the first kmer to the last kmer, namely the forward search. Solutions that make all the kmers solid till the current substitution are recorded. Similarly, the backward search is conducted on the first base of the kmers from the last one to the first one. A solution is accepted if the forward search and the backward search meet and the kmers contained in both of them are solid. In case the number of kmers in G_{w} is less than k, we will only alter the last base of the first kmer.
 3
If G_{w} is the last group and there exists a solid processive group G_{s}, we will apply the backward search to obtain the solution. Analogously to the first situation, if the number of kmers of G_{w} is less than τ, we will extend the kmers toward their downstream until the number is satisfied. In case G_{s} does not exist, it is the same as the second part of the first situation, thus the same approach is applied.
Results
Datasets
The data sets that are used for evaluating the performance of error correction models
Data set  Genome name  Genome size (bp)  Error rate (%)  Read length (bp)  Coverage  Number of reads  Insert length  Is sythetic 

R1  S. aueus  2,821,361  1.28  101  46.3 ×  1,294,104  180  No 
R2  R. sphaeroides  4,603,110  1.08  101  45.0 ×  2,050,868  180  No 
R3  H. chromosome 14  88,218,286  0.52  101  41.8 ×  36,504,800  155  No 
R4  B. impatiens  249,185,056  0.86  124  150.8 ×  303,118,594  400  No 
S1  H. chromosome 14  88,218,286  0.97  101  41.8 ×  36,504,800  180  Yes 
S2  B. impatiens  249,185,056  0.98  124  150.8 ×  303,118,594  400  Yes 
Performance evaluation
The error correction performance is evaluated through the widely accepted procedure implemented by [30]. Metrics that are considered include gain, recall, precision and per base error rate (pber). Gain is defined as (TP−FP)/(TP+FN), recall is TP/(TP+FN), precision is TP/(TP+FP) and pber is N^{e}/N, where TP stands for the number of corrected bases that are truly erroneous bases, FP represents the number of corrected bases that are not sequencing errors intrinsically, FN is the number of erroneous bases that remain untouched, N^{e} is the number of erroneous bases and N is the total number of bases. Among these metrics, gain is the most informative.
All experiments are carried out on a cluster having eight Intel Xeon E7 CPUs and 1Tb RAM. Each CPU has eight cores.
Errorcorrection performance comparison between ZEC, Lighter, Racer, BLESS2, Musket, BFC, SGA and MEC
Data  Corrector  Gain  Reca  Prec  Pber(%) 

R1  ZEC  0.908  0.912  0.996  0.102 
Lighter  0.839  0.845  0.994  0.163  
Racer  0.760  0.822  0.929  0.190  
BLESS2  0.189  0.409  0.650  0.879  
Musket  0.499  0.628  0.830  0.448  
SGA  0.746  0.815  0.922  0.202  
BFC  0.753  0.817  0.927  0.196  
MEC  0.909  0.911  0.998  0.102  
R2  ZEC  0.584  0.663  0.894  0.537 
Lighter  0.226  0.329  0.762  1.076  
Racer  0.364  0.450  0.839  0.780  
BLESS2  0.318  0.405  0.806  0.890  
Musket  0.265  0.364  0.786  0.984  
SGA  0.331  0.423  0.822  0.843  
BFC  0.306  0.400  0.811  0.893  
MEC  0.570  0.631  0.912  0.541  
R3  ZEC  0.802  0.923  0.884  0.087 
Lighter  0.445  0.764  0.706  0.256  
Racer  0.562  0.814  0.764  0.196  
BLESS2  0.130  0.641  0.556  0.438  
Musket  0.533  0.802  0.749  0.211  
SGA  0.567  0.818  0.765  0.194  
BFC  0.603  0.833  0.783  0.176  
MEC  0.788  0.852  0.930  0.117  
R4  ZEC  0.746  0.833  0.905  0.137 
Lighter  0.126  0.408  0.591  0.688  
Racer  0.313  0.541  0.703  0.484  
BLESS2  0.517  0.018  0.003  0.862  
Musket  0.502  0.660  0.807  0.320  
SGA  0.542  0.690  0.823  0.289  
BFC  0.195  0.457  0.636  0.607  
MEC  0.705  0.806  0.889  0.201  
S1  ZEC  0.918  0.935  0.982  0.056 
Lighter  0.791  0.851  0.934  0.130  
Racer  0.882  0.916  0.964  0.071  
BLESS2  0.634  0.740  0.875  0.243  
Musket  0.819  0.871  0.944  0.111  
SGA  0.810  0.865  0.940  0.117  
BFC  0.866  0.903  0.961  0.081  
MEC  0.899  0.916  0.982  0.063  
S2  ZEC  0.853  0.894  0.956  0.109 
Lighter  0.058  0.329  0.548  0.891  
Racer  0.168  0.408  0.630  0.720  
BLESS2  0.311  0.509  0.719  0.543  
Musket  0.232  0.453  0.672  0.636  
SGA  0.075  0.342  0.562  0.862  
BFC  0.751  0.822  0.920  0.157  
MEC  0.849  0.887  0.959  0.122 
Comparison with Stateoftheart. The performance of ZEC is much superior to the stateoftheart methods, including Lighter [14], Racer [12], BLESS2 [17], Musket [13], SGA [11], BFC [16]. See Table 2. ZEC markedly outperforms the existing error correctors in terms of the most informative evaluation metric—gain. For instance, on the dataset R4, the gain of ZEC is 0.746, while the best performance produced by the other methods is 0.705. For the synthetic datasets, ZEC also has higher gain than other methods. For example, on the dataset S2, the gain of ZEC is 0.853, while the best and worst gain generated by the other methods are 0.849 and 0.058, respectively. The lowest average perbase error rate of ZEC also consolidates its effectiveness.
Distinguishbility of zscore
Efficiency of zscore calculation
Calculating zscore of kmers is not trivial for very large data sets, as the kmers and their frequencies are usually too large to be hold by a main memory of a moderate computer. We designed a novel algorithm and solved this problem. The efficiency of the algorithm in terms of the memory usage and running speed are studied.
Regarding the running speed, this algorithm is linearly scaled. Since locating each kmer in a bit vector is O(1) pertaining to time complexity by using hash, this algorithm is pretty fast. For instance, based on our computing power, it only takes 387 s to construct the bit vectors and calculate the zscores of all the kmers of R4—the largest data set.
Since a Bloom Filter has false positives, this may cause the zscore of a kmer different from its genuine value. However, the false positive rate is pretty small, usually less than 1%, thus this impact can be neglected.
Discussion
Our model effectively pinpoints out correct kmers having low frequency, achieving an improvement of 11.25% on weak kmers. However, some issues still remain further exploration, including neighbor inclusion and neighbor retrieval.
Neighbor inclusion means how neighbor kmers are determined given a kmer of interest, say κ. Our current approach takes kmers having edit distance of 1 as neighbors of κ, but there still has a small chance that a true neighbor having edit distance larger than 1. Suppose the error rate is e, the probability of a kmer having exactly one error is k·e(1−e)^{k−1}/k·e=(1−e)^{k−1}. When e=1% and k=1, the probability is (1−0.01)^{31−1}=73.97%. That been said, about 26% real neighbors are excluded. However, even extending the minimum edit distance from 1 to 2 significantly elongates running time. This is because the number of candidate kmers increases from 3∗k to 3∗k∗3∗(k−1).
Neighbor retrieval is another issue to be considered. Usually, the size of counted kmers is too large to fit into a main memory. Hence, a more sophisticated approach is required to solve this problem. We use Bloom Filter to overcome the limitation. For kmers having small count, say 5, we use classical Bloom Filters to save them, each Bloom Filter saves kmers having the same count. For kmers having large count, we use coupledBloom Filter to save them. One Bloom Filter for kmer encoding, while the other is for count representation. This approach significantly reduces memory usage while achieving constant time complexity of kmer retrieval. However, it may cause false positives although the probability is small. Hence, more effort is required to handle this problem.
Conclusions
We have proposed a novel method for correcting the NGS errors. The novel idea is the use of statisticallysolid kmers to construct the Bloom filter. These kmers are mined from all the kmers of a NGS data set by considering both their frequency and zscore, particular the latter one that can effectively fishing out the solid kmers having low frequency. Pinpointing out such kmers has been a very challenging problem. The experimental results show that our approach markedly outperforms the existing stateoftheart methods in terms of error correction performance.
Declarations
Acknowledgments
We thank the anonymous reviewers for their valuable comments and insightful suggestions.
Funding
This study is collectively supported by the National Natural Science Foundation of China (No. 31501070), the Natural Science Foundation of Hubei (No. 2017CFB137) and Guangxi (No. 2016GXNSFCA380006), the Scientific Research Foundation of GuangXi University (No. XGZ150316) and Taihe hospital (No. 2016JZ11), and the Australia Research Council (ARC) Discovery Project 180100120. Publication costs are funded by the National Natural Science Foundation of China (No. 31501070).
Availability of data and materials
The source codes are available at github.com/lzhlab/zec/.
About this supplement
This article has been published as part of BMC Genomics Volume 19 Supplement 10, 2018: Proceedings of the 29th International Conference on Genome Informatics (GIW 2018): genomics. The full contents of the supplement are available online at https://bmcgenomics.biomedcentral.com/articles/supplements/volume19supplement10.
Authors’ contributions
LZ conceived and designed the experiments, LZ and JL wrote the manuscript. Program coding: LZ, YW and ZZ. Data analyses: LZ, JX, LB, WC, MW and ZZ. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Alic AS, Ruzafa D, Dopazo J, Blanquer I. Objective review of de novo standalone error correction methods for ngs data. WIREs Comput Mol Sci. 2016; 6:111–46.View ArticleGoogle Scholar
 The 1000 Genomes Project Consortium: A map of human genome variation from populationscale sequencing. Nature. 2010; 467:1061–73.Google Scholar
 Kelley DR, Schatz MC, Salzberg SL. Quake: Qualityaware detection and correction of sequencing errors. Genome Biol. 2010; 11(11):116.View ArticleGoogle Scholar
 Hackl T, Hedrich R, Schultz J, Förster F. proovread: largescale highaccuracy pacbio correction through iterative short read consensus. Bioinformatics. 2014; 30(21):3004–11.View ArticleGoogle Scholar
 Goodwin S, Gurtowski J, EtheSayers S, Deshpande P, Schatz MC, McCombie WR. Oxford nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Res. 2015; 25(11):1750–1756.View ArticleGoogle Scholar
 Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz MC, Delcher AL, Roberts M, Marcais G, Pop M, Yorke JA. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res. 2011; 22(3):557–67.View ArticleGoogle Scholar
 Zhao Z, Yin J, Li Y, Xiong W, Zhan Y. An Efficient Hybrid Approach to Correcting Errors in Short Reads. In: Modeling Decision for Artificial Intelligence: 8th International Conference, MDAI 2011, Changsha, Hunan, China, July 2830, 2011, Proceedings. Berlin: Springer: 2011. p. 198–210.Google Scholar
 Salmela L, Schröder J. Correcting errors in short reads by multiple alignments. Bioinformatics. 2011; 27(11):1455–61.View ArticleGoogle Scholar
 Yang X, Dorman KS, Aluru S. Reptile: Representative tiling for short read error correction. Bioinformatics. 2010; 26:2526–33.View ArticleGoogle Scholar
 Liu Y, Schmidt B, Maskell DL. DecGPU: Distributed error correction on massively parallel graphics processing units using CUDA and MPI. BMC Bioinforma. 2011; 12:85.View ArticleGoogle Scholar
 Simpson JT, Durbin R. Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 2012; 22(3):549–56.View ArticleGoogle Scholar
 Ilie L, Molnar M. Racer: Rapid and accurate correction of errors in reads. Bioinformatics. 2013; 29(19):2490–3.View ArticleGoogle Scholar
 Liu Y, Schröder J, Schmidt B. Musket: A multistage kmer spectrumbased error corrector for Illumina sequence data. Bioinformatics. 2013; 29(3):308–15.View ArticleGoogle Scholar
 Song L, Florea L, Langmead B. Lighter: fast and memoryefficient sequencing error correction without counting. Genome Biol. 2014; 15(11):1–13.View ArticleGoogle Scholar
 Greenfield P, Kx D, Ax P, Cx BD. Blue: correcting sequencing errors using consensus and context. Bioinformatics. 2014; 30(19):2723–32.View ArticleGoogle Scholar
 Li H. Correcting Illumina sequencing errors for human data. arXiv preprint. 2015;:arXiv:1502.03744.Google Scholar
 Heo Y, Ramachandran A, Hwu WM, Ma J, Chen D. BLESS 2: accurate, memoryefficient and fast error correction method. Bioinformatics. 2016; 32(15):2369–71.View ArticleGoogle Scholar
 Xiao CL, Chen Y, Xie SQ, Chen KN, Wang Y, Han Y, Luo F, Xie Z. MECAT: fast mapping, error correction, and de novo assembly for singlemolecule sequencing reads. Nat Methods. 2017; 14:1072–4.View ArticleGoogle Scholar
 Schröder J, Schröder H, Puglisi SJ, Sinha R, Schmidt B. SHREC: A shortread error correction method. Bioinformatics. 2009; 25(17):2157–63.View ArticleGoogle Scholar
 Salmela L. Correction of sequencing errors in a mixed set of reads. Bioinformatics. 2010; 26(12):1284–90.View ArticleGoogle Scholar
 Ilie L, Fazayeli F, Ilie S. HiTEC: Accurate error correction in highthroughput sequencing data. Bioinformatics. 2011; 27(3):295–302.View ArticleGoogle Scholar
 Schulz MH, Weese D, Holtgrewe M, Dimitrova V, Niu S, Reinert K, Hx R. Fiona: a parallel and automatic strategy for read error correction. Bioinformatics. 2014; 30(17):356–63.View ArticleGoogle Scholar
 Kao WC, Chan AH, Song YS. ECHO: A referencefree shortread error correction algorithm. Genome Res. 2011; 21:1181–92.View ArticleGoogle Scholar
 Chen CC, Chang YJ, Chung WC, Lee DT, Ho JM. CloudRS: An error correction algorithm of highthroughput sequencing data based on scalable framework. In: Big Data, 2013 IEEE International Conference On: 2013. p. 717–22.Google Scholar
 Zhao L, Chen Q, Li W, Jiang P, Wong L, Li J. MapReduce for accurate error correction of nextgeneration sequencing data. Bioinformatics. 2017; 33(23):3844–51.View ArticleGoogle Scholar
 Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, Hegarty R, Nusbaum C, Jaffe DB. Characterizing and measuring bias in sequence data. Genome Biol. 2013; 14(5):51.View ArticleGoogle Scholar
 Deorowicz S, Kokot M, Grabowski S, DebudajGrabysz A. KMC 2: fast and resourcefrugal kmer counting. Bioinformatics. 2015; 31(10):1569–76.View ArticleGoogle Scholar
 Bloom BH. Space/Time Tradeoffs in Hash Coding with Allowable Errors. Commun ACM. 1970; 13(7):422–6.View ArticleGoogle Scholar
 Huang W, Li L, Myers JR, Marth GT. ART: a nextgeneration sequencing read simulator. Bioinformatics. 2012; 28(4):593–4.View ArticleGoogle Scholar
 Molnar M, Ilie L. Correcting illumina data. Brief Bioinform. 2014. https://doi.org/doi:10.1093/bib/bbu029.