Rationale behind the indel error correction
Since we leverage the lower error rate of Illumina reads to correct the PacBio indel errors, let us first describe an error model for Illumina sequences and its consequence on the DBG constructed from these reads. We first observe that k-mers, DNA words of a fixed length k, tend to have similar abundances within a read. This is a well-known property of k-mers that stem from each read originating from a single source molecule of DNA [27]. Let us consider two reads R1 and R2 representing the same region of the genome, and R1 has one error base. Assuming that the k-mers between the position posbegin and posend represent an error region in R1 where error base is at position \({pos}_{error} = \frac {pos_{end}+{pos}_{begin}}{2}\), we can make the following claim.
Claim 1: The coverage of at least one k-mer of R1 in the region between posbegin and posend is lower than the coverage of any k-mer in the same region of R2. A brief theoretical rationale of the claim can be found in Additional file 1. Figure 1 shows the rationale behind the claim.
Rationale behind the substitution error correction
After correcting the indel errors with the Illumina reads, a substantial number of substitution errors are introduced in the PacBio reads as they dominate in the Illumina short-read sequences. To rectify those errors, we first divide each PacBio long read into smaller subregions like short reads. Next, we classify only those subregions as errors where most of the k-mers have high coverage, and only a few low-coverage k-mers exist as outliers.
Specifically, we use Pearson’s skew coefficient (or median skew coefficient) to classify the true and error subregions. Figure 2 shows the histogram of three different types of subregions in a genomic dataset. Figure 2a has similar numbers of low- and high-coverage k-mers, making the skewness of this subregion almost zero. Hence, it is not considered as error. Figure 2b is also classified as true because the subregion is mostly populated with the low-coverage k-mers. Figure 2c is classified as error because the subregion is largely skewed towards the high-coverage k-mers, and only a few low-coverage k-mers exist as outliers. Existing substitution error correction tools do not analyze the coverage of neighboring k-mers and often classify the true yet low-coverage k-mers (e.g., Fig. 2b as errors.
Another major advantage of our median-based methodology is that the accuracy of the method has a lower dependency on the value of k. Median values are robust because, for a relatively small value of k, a few substitution errors will not alter the median k-mer abundance of the read [28]. However, these errors will increase the skewness of the read. The robustness of the median values in the presence of sequencing errors is shown mathematically in the Additional file 1.
Big data framework in the context of genomic error correction
Error correction for sequencing data is not only data- and compute-intensive but also search-intensive because the size of the k-mer spectrum increases almost exponentially with the increasing value of k (i.e., up to 4k unique k-mers), and we need to search in the huge search space. For example, a large genome with 1 million reads of length 5000 bp involves more than 5 billion searches in a set of almost 10 billion unique k-mers. Since existing hybrid error correction tools are not designed for large-scale genome sequence data such as human genomes, we design ParLECH as a scalable and distributed framework equipped with Hadoop and Hazelcast.
Hadoop is an open-source abstraction of Google’s MapReduce, which is a fully parallel and distributed framework for large-scale computation. It reads the data from a distributed file system called Hadoop Distributed File System (HDFS) in small subsets. In the Map phase, a Map function executes on each subset, producing the output in the form of key-value pairs. These intermediate key-value pairs are then grouped based on the unique keys. Finally, a Reduce function executes on each group, producing the final output on HDFS.
Hazelcast [29] is a NoSQL database, which stores large-scale data in the distributed memory using a key-value format. Hazelcast uses MummurHash to distribute the data evenly over multiple nodes and to reduce the collision. The data can be stored and retrieved from Hazelcast using hash table functions (such as get and put) in O(1) time. Multiple Map and Reduce functions can access this hash table simultaneously and independently, improving the search performance of ParLECH.
Error correction pipeline
Figure 3 shows the indel error correction pipeline of ParLECH. It consists of three phases: 1) constructing a de Bruijn graph, 2) locating errors in long reads, and 3) correcting the errors. We store the raw sequencing reads in the HDFS while Hazelcast is used to store the de Bruijn graph created from the Illumina short reads. We develop the graph construction algorithm following the MapReduce programming model and use Hadoop for this purpose. In the subsequent phases, we use both Hadoop and Hazelcast to locate and correct the indel errors. Finally, we write the indel error-corrected reads into HDFS. We describe each phase in detail in the subsequent sections.
ParLECH has three major steps for hybrid correction of indel errors as shown in Fig. 4. In the first step, we construct a DBG from the Illumina short reads with the coverage information of each k-mer stored in each vertex. In the second step, we partition each PacBio long read into a sequence of strong and weak regions (alternatively, correct and error regions respectively) based on the k-mer coverage information stored in the DBG. We select the right and left boundary k-mers of two consecutive strong regions as source and destination vertices respectively in the DBG. Finally, in the third step, we replace each weak region (i.e., indel error region) of the long read between those two boundary k-mers with the corresponding widest path in the DBG, which maximizes the minimum k-mer coverage between those two vertices.
Figure 5 shows the substitution error correction pipeline of ParLECH. It has two different phases: 1) locating errors and 2) correcting errors. Like the indel error correction, the computation of phase is fully distributed with Hadoop. These Hadoop-based algorithms work on top of the indel error-corrected reads that were generated in the last phase and stored in HDFS. The same k-mer spectrum that was generated from the Illumina short reads and stored in Hazelcast is used to correct the substitution errors as well.
De bruijn graph construction and counting k-mer
Algorithm 1 explains the MapReduce algorithm for de Bruijn graph construction, and Fig. 6 shows the working of the algorithm. The map function scans each read of the data set and emits each k-mer as an intermediate key and its previous and next k-mer as the value. The intermediate key represents a vertex in the de Bruijn graph whereas the previous and the next k-mers in the intermediate value represent an incoming edge and an outgoing edge respectively. An associated count of occurrence (1) is also emitted as a part of the intermediate value. After the map function completes, the shuffle phase partitions these intermediate key-value pairs on the basis of the intermediate key (the k-mer). Finally, the reduce function accumulates all the previous k-mers and next k-mers corresponding to the key as the incoming and outgoing edges respectively. The same reduce function also sums together all the intermediate counts (i.e., 1) emitted for that particular k-mer. In the end of the reduce function, the entire graph structure and the count for each k-mer is stored in the NoSQL database of Hazelcast using Hazelcast’s put method. For improved performance, we emit only a single nucleotide character (i.e., A, T, G, or C instead of the entire k-mer) to store the incoming and outgoing edges. The actual k-mer can be obtained by prepending/appending that character with the k−1 prefix/suffix of the vertex k-mer.
Locating the indel errors of long read
To locate the errors in the PacBio long reads, ParLECH uses the k-mer coverage information from the de Bruijn graph stored in Hazelcast. The entire process is designed in an embarrassingly parallel fashion and developed as a Hadoop Map-only job. Each of the map tasks scans through each of the PacBio reads and generates the k-mers with the same value of k as in the de Bruijn graph. Then, for each of those k-mers, we search the coverage in the graph. If the coverage falls below a predefined threshold, we mark it as weak indicating an indel error in the long read. It is possible to find more than one consecutive errors in a long read. In that case, we mark the entire region as weak. If the coverage is above the predefined threshold, we denote the region as strong or correct. To rectify the weak region, ParLECH uses the widest path algorithm described in the next subsection.
Correcting the indel errors
Like locating the errors, our correction algorithm is also embarrassingly parallel and developed as a Hadoop Map-only job. Like LoRDEC, we use the pair of strong k-mers that enclose a weak region of a long read as the source and destination vertices in the DBG. Any path in the DBG between those two vertices denotes a sequence that can be assembled from the short reads. We implement the widest path algorithm for this local assembly. The widest path algorithm maximizes the minimum k-mer coverage of a path in the DBG. We use the widest path based on our assumption that the probability of having the k-mer with the minimum coverage is higher in a path generated from a read with sequencing errors than a path generated from a read without sequencing errors for the same region in a genome. In other words, even if there are some k-mers with high coverage in a path, it is highly likely that the path includes some k-mer with low coverage that will be an obstacle to being selected as the widest path, as illustrated in Fig. 1.
Therefore, ParLECH is equipped with the widest path technique to find a more accurate sequence to correct the weak region in the long read. Algorithm 2 shows our widest path algorithm implemented in ParLECH, a slight modification of the Dijkstra’s shortest path algorithm using a priority queue that leads to the time complexity of O(E logV). Instead of computing the shortest paths, ParLECH traverses the graph and updates the width of each path from the source vertex as the minimum width of any edge on the path (line 15).
Locating the substitution error
Algorithm 3 shows the process to locate substitution base errors. To locate the substitution errors in the long reads, we first divided the long reads into shorter fragments. As the k-mers in a smaller subregion tend to have similar abundances [27], this will divide the longer reads into a sequence of high- and low-coverage fragments. If a fragment belongs to a low-coverage area of the genome, most of the k-mers in that fragment are expected to have low coverage. Otherwise, the k-mers are expected to have high coverage. This methodology enables ParLECH to better distinguish between true-yet-low-coverage and error-yet-high-coverage k-mers. By default, ParLECH uses the length of the short reads as the length of the shorter fragments. However, it can be easily modified with a user-defined length. The last fragment of the long reads can have a length shorter than default (or user-defined) length. This fragment is always ignored for correcting the substitution error as it is considered insufficient to gather any statistics.
After dividing the long reads into shorter fragments, we calculate the Pearson’s skew coefficient (mentioned as skewThreshold in Algorithm 3) of the k-mer coverage of each fragment as a threshold to classify those fragments as true or error. If the skew coefficient of the fragment lies in a certain interval, the fragment is classified as a true fragment without any error. Furthermore, the fragments with mostly low-coverage k-mers are also ignored. All the other fragments (i.e., the fragments with highly skewed towards high-coverage k-mers) are classified as erroneous. Through this classification, all the low-coverage areas of the genome will be considered as correct even if they have low-coverage k-mers but almost similar coverage as that of the neighboring k-mers.
After classifying the fragments as true and error, we divide all the error fragments as high and low coverage. If the median k-mer coverage of a fragment is greater than the median coverage of the entire k-mer spectrum, the fragment is classified as high coverage. Otherwise, the fragment belongs to a low-coverage area. ParLECH uses a pattern of true and error k-mers to localize the errors and searches for the set of corrections with a maximum likelihood that make all k-mers true.
Correcting the substitution error
To rectify the substitution errors, ParLECH uses a majority voting algorithm similar to that of Quake [4]. However, we have two major differences. First, ParLECH’s majority voting algorithm is fully distributed and can scale over hundreds of nodes. Second, unlike Quake, ParLECH uses different thresholds for the low and high coverage area of the genome to improve the accuracy. For each error base detected in the previous phase, ParLECH substitutes the base with all the different nucleotide characters (i.e., A, T, G, and C) and calculates the coverage of all the k-mers with that base. Finally, the error base is replaced with the one such that all those k-mers with that base exceeds or equals the specified threshold for that area.