Accelerating read mapping with FastHASH
© Xin et al.; licensee BioMed Central Ltd. 2013
Published: 21 January 2013
Skip to main content
© Xin et al.; licensee BioMed Central Ltd. 2013
Published: 21 January 2013
With the introduction of next-generation sequencing (NGS) technologies, we are facing an exponential increase in the amount of genomic sequence data. The success of all medical and genetic applications of next-generation sequencing critically depends on the existence of computational techniques that can process and analyze the enormous amount of sequence data quickly and accurately. Unfortunately, the current read mapping algorithms have difficulties in coping with the massive amounts of data generated by NGS.
We propose a new algorithm, FastHASH, which drastically improves the performance of the seed-and-extend type hash table based read mapping algorithms, while maintaining the high sensitivity and comprehensiveness of such methods. FastHASH is a generic algorithm compatible with all seed-and-extend class read mapping algorithms. It introduces two main techniques, namely Adjacency Filtering, and Cheap K-mer Selection.
We implemented FastHASH and merged it into the codebase of the popular read mapping program, mrFAST. Depending on the edit distance cutoffs, we observed up to 19-fold speedup while still maintaining 100% sensitivity and high comprehensiveness.
Massively parallel sequencing, or so-called next-generation sequencing (NGS), technologies have substantially changed the way biological research is performed since 2000 . With these new DNA sequencing platforms, we can now investigate human genome diversity between populations , find genomic variants that are likely to cause diseases [3–8], and investigate the genomes of the great ape species [9–14] and even ancient hominids [15, 16] to understand our own evolution. Despite all the revolutionary power these new sequencing platforms offer, they also present difficult computational challenges due to 1) the massive amount of data produced, 2) shorter read lengths, resulting in more mapping locations and 3) higher sequencing errors when compared to the traditional capillary-based sequencing.
With NGS platforms, such as the popular Illumina platform, billions of raw short reads are generated at a fast speed. Each short read represents a contiguous DNA fragment (i.e., 100 base-pairs (bp)) from the sequencing subject. After the short reads are generated, the first step is to map (i.e., align) the reads to a known reference genome. The mapping process is computationally very expensive since the reference genome is very large (e.g., the human genome has 3.2 gigabase-pairs). The software performing the mapping, called the mapper, has to search (query) a very large reference genome database to map millions of short reads. Even worse, each short read may contain edits (base-pairs different from the reference fragment, including mismatches, insertions and deletions) which requires expensive approximate searching. In addition, the ubiquitous common repeats and segmental duplications within the human genome complicate the task since a short read from such a genome segment corresponds to a large number of mapping locations in the reference genome.
To simplify searching a large database such as the human genome, previous work has developed several algorithms that fall into one of the two categories: seed-and-extend heuristic methods and suffix-array mapping methods.
The seed-and-extend heuristic is developed based on the observation that for a correct mapping, the short query read and its corresponding reference fragment, which is the piece of the reference genome that the query read should map to, must share some brief regions (usually 10-100 base-pair-long) of exact or inexact matches. These shorter shared regions, which indicate high similarity between the query read and the reference fragment, are called seeds. By identifying the seeds of a query read, the mapper narrows down the searching range from the whole genome to only the neighborhood region of each seed. Seeds are generated by preprocessing the reference genome and storing the locations of their occurrences in the reference genome in a separate data structure. During mapping, a seed-and-extend mapper first analyzes the query read to identify the seeds. Then, the mapper tries to extend the read at each of the seed locations via dynamic programming algorithms such as the Smith-Waterman  or Neddleman-Wunsch  algorithm.
On the other hand, the suffix-array mapping methods analyze the reference genome and transfer the reference genome into a suffix-array data structure, which mimics a suffix-tree of the reference genome. Each edge of this suffix-tree is labeled with one of the four base-pair types and each node containing all occurrence locations of a suffix. Walking through the tree from the root to leaf while concatenating all the base-pairs on the edges along the path together forms a unique suffix of the reference genome. Every leaf node of the tree stores all mapping locations of this unique suffix in the reference genome. Searching for a query read is equivalent to walking through the reference suffix-tree from the root to a leaf node following the query read's sequence. If there exists a path from the root to a leaf such that the corresponding suffix of the path matches the query read, then all the locations stored in the leaf node are returned as mapping locations. Suffix array uses the Burrows-Wheeler Transform  and the Ferragina-Manzini index  to mimic the suffix-tree traversal process with much smaller memory footprint.
Several mappers have been developed over the past few years. These mappers can be classified into two categories based on their mapping algorithms: 1) hash table based, seed-and-extend mappers (hash table based mappers) similar to the popular BLAST  method, such as mrFAST/mrsFAST [22, 23], MAQ , SHRiMP , Hobbes , drFAST  and RazerS ; and 2) suffix-array and genome compression based mappers that utilize the Burrows-Wheeler Transform and the Ferragina-Manzini index (BWT-FM) such as BWA , Bowtie , and SOAP2 . Both types of read mapping algorithms have different strengths and weaknesses. To measure the performance of different mappers, three general metrics are introduced: speed in performing the mapping, sensitivity in mapping reads in the presence of multiple edits (including mismatches, insertions and deletions) and comprehensiveness in searching for all mapping locations across the reference genome. The hash table based mappers are much slower, albeit more sensitive, more comprehensive and more robust to sequence errors and genomic diversity than suffix-array based mappers. For these reasons, hash table based mappers are typically more suitable when comparing the genomes of different species, such as mapping reads generated from a gorilla genome to the human reference genome, or when mapping reads to highly repetitive genomic regions where structural variants are more likely to occur [32–34]. On the contrary, suffix-array based mappers (with the BWT-FM optimization) offer very high mapping speed (up to 30-fold faster than hash table based mappers), but their mapping sensitivity and comprehensiveness suffer when the edit distance between the read and the reference fragment is high or when the diversity of the read increases (e.g., when mapping reads from other species). Their fast speed makes the suffix-array based mappers the first choice in single nucleotide polymorphism (SNP) discovery studies where sensitivity is less important. In this work, we focus on increasing the speed of hash table based mappers while preserving their high sensitivity and comprehensiveness.
The relatively slow speed of hash table based mappers is due to their high sensitivity and comprehensiveness. Such mappers first index fixed-length seeds (also called k-mers), typically 10-13 base-pair-long DNA fragments from the reference genome, into a hash table or a similar data structure. Next, they divide each query read into smaller fixed-length seeds to query the hash table for their associated seed locations. Finally, they try to extend the read at each of the seed locations by aligning the read to the reference fragment at the seed location via dynamic programming algorithms such as Needleman-Wunsch  and Smith-Waterman , or simple Hamming distance calculation for greater speed at the cost of missing potential mappings that contain insertions/deletions (indels). For simplicity, the rest of the paper will use the term "k-mer" representing the term "fixed-length seed". We will also use the terms "location" and "seed location" interchangeably.
Using real data generated with the NGS platforms, we observed that most of the locations fail to provide correct alignments. This is because the size of the k-mers that form the hash table's indices are typically very short (e.g., 12 bp as default for mrFAST/mrsFAST). These short k-mers appear in the reference genome much more frequently than the undivided, hundreds of base-pair-long query read. As a result, only a few of the locations of a k-mer, if any, provide correct alignments. Naively extending (aligning the read to the reference genome) at all of the locations of all k-mers only introduces unnecessary computation. In this paper, we define the seed locations that the read cannot align to as "false" locations. Reducing unnecessary computation associated with the large number of false locations is the key to improving hash table based mappers' speed.
In this paper, we propose a new algorithm, FastHASH, that dramatically improves the speed of hash table based algorithms while maintaining their sensitivity and comprehensiveness. We introduce two key ideas for this purpose. First, we drastically reduce the potential locations considered for the extend step while still preserving comprehensiveness. We call this method Cheap K-mer Selection. Second, we quickly eliminate most of the false locations without invoking the extend step in the early stages of mapping. This method is called Adjacency Filtering. We tested FastHASH by incorporating it into the mrFAST  codebase. Our initial CPU implementation of FastHASH provides up to 19-fold speedup over mrFAST, while still preserving comprehensiveness.
In the next section, we describe the basics and the characteristics of Cheap K-mer Selection and Adjacency Filtering. In the Mechanisms section, we present the mechanism of FastHASH in detail. In the Results section, we present the performance of mrFAST with FastHASH compared to the baseline mrFAST and several other read mapping tools. We then present more analysis in the Analysis section and draw conclusions in the Conclusion and Discussion section.
Hash table based mappers map query reads to a known reference genome under a user defined edit distance e. With the edit distance e, the mappers search for locations where there are fewer than e edits (including mismatches, insertions or deletions) between the query read and the reference fragment. Typically, they follow a "seed-and-extend" procedure. These mappers index the reference genome and store the contents in a hash table. The hash table maps all lexicographical permutations of a fixed-length k-mer (typically 10-13 bp) as keys to the corresponding occurrence locations in the reference genome for each k-mer as contents. The indexing procedure is performed only once. During the mapping stage the mapper uses the previously generated hash table to search for seed locations.
Hash table based mappers are computationally more expensive than suffix-array alternatives. Unlike suffix-array based mappers which quickly return the mapping locations at the leaf nodes of the suffix-tree, hash table based mappers try to calculate the optimal alignment for all query k-mers' locations. Mappers that are capable of aligning even in the presence of edits are the most sensitive, yet slowest, since these dynamic programming algorithms typically run in O(l 2) time (where l is the length of the reads). This can be reduced to O(2el) if the number of allowed indels are reduced to e.
We experimentally tested the behavior of a hash table based mapper mrFAST  to identify the performance bottlenecks. We observed that the dynamic programming alignment algorithm (step 5) occupies over 90% of the execution time while most locations fail to pass the alignment verification. Due to the short k-mer size (10-13 bp) and the repetitive nature of most genomes (including human), the location list of a k-mer may contain many locations to which the full query read does not map. Yet the mapper still tries to align the query read to all of the extra locations (step 5) since it has no knowledge of which seed locations provide correct mapping beforehand.
Verification of the vast number of false locations greatly degrades the performance of the mapper as it consumes a massive amount of unnecessary computation and memory bandwidth. To verify a location, the mapper has to 1) access the reference genome sequence starting at the seed location to get the reference fragment and then 2) invoke a complex programming algorithm to align the query read to the reference fragment. Performing these costly operations for a high number of false locations will only waste computational resources as false locations, by definition, do not provide any valid mappings. Therefore improving the performance of hash table based mappers strongly depends on efficiently reducing the number of false locations before the verification step.
There are two main directions to ameliorate the computational cost imposed by the false locations. First, one can apply a filter within the seed locations and only extend on "true locations" to reduce unnecessary computation. Second, one can select only the k-mers with low occurrence frequency in the reference as query k-mers to avoid probing long location lists, reducing the number of locations to examine. In this work, we propose two new mechanisms that address both directions.
Our first method aims to filter out the obviously false locations. Our observation is that by collecting a common set of adjacent locations from the location lists of all the k-mers, we can quickly distinguish obviously false locations from possibly true locations and skip the unnecessary verification steps for the false locations. The basic idea is as follows: A potential seed location from one k-mer's location list can return a correct mapping (under the given edit distance e) only if other adjacent k-mers of the read are also located at adjacent locations in the reference genome (e.g., in Figure 2, location 212 in first k-mer's list, location 224 in the second k-mer's list, location 236 in the third k-mer's location list, and so on). Consequently, by checking if all k-mers have the corresponding adjacent locations stored inside their location lists, we may quickly identify false locations without the alignment step (e.g., in Figure 2, location 1121 from the first k-mer's location list is an easily detectable false location since no other k-mer contains adjacent locations in their location lists). To tolerate edits, at most e k-mers are allowed to fail the adjacent location searching step. Otherwise, the number of edits (mismatches and in/dels) between the query read and the reference fragment must be greater than e, and thus the location should be rejected before the verification step (step 5). We call this method Adjacency Filtering (AF).
Note that AF does not guarantee correct mappings; instead, it rejects obviously false locations. For computing the actual number, location, and content of edits (including sequence errors) the alignment step (step 5) is still needed. Nevertheless, AF detects a large fraction of the false locations (more than 99% on average based on our empirical evaluations) and removes them from consideration for verification.
With AF eliminating unnecessary computation to detect false locations and CKS reducing the number of false locations, our new algorithm, FastHASH, is able to minimize unnecessary computation and focus on mapping only at possible true locations, which provides drastic speedup over previous hash table based mappers, as our experimental results show.
Adjacency Filtering (AF) uses the location lists retrieved from the hash table to detect false locations. Since the location lists are stored contiguously as sorted arrays in the hash table, it is easy to prefetch these lists into the CPU cache. Moreover, the location lists exhibit high temporal locality. Once fetched into the cache, the location lists of the k-mers can be used to verify all seed locations, and thus will be reused many times. Unlike traditional hash table based mappers which try to extend at all potential seed locations and perform many unpredictable reference genome lookups, FastHASH with AF only accesses the reference genome when it is confident that the seed location is very likely to be a true location.
The power of AF comes from detecting and rejecting most of the false locations before verification. Not only does AF prevent unnecessary computation (step 5), but it also prevents expensive unnecessary memory accesses to the reference genome to retrieve the reference fragment (step 4). For a false location, other than the query k-mer itself, usually the rest of the read is completely different from the reference genome. We observed in real data sets that even with high edit distance e = 5, only a small fraction (usually less than 1%) of the locations pass AF and are marked as potential true locations for further verification. As a result, AF is effective at detecting and rejecting false locations.
However, AF also comes with its own computational cost. To test a potential location, AF conducts N searches for adjacent locations, one for each k-mer's location list. Additionally, AF does not guarantee that the remaining seed locations will have fewer than e edits after alignment, since multiple edits might reside in a single edited k-mer. In such cases, AF will not be able to tell exactly how many edits there are, so it conservatively assumes there is only one edit per edited k-mer and passes the location to the verification step (step 5). During the alignment step, a dynamic programming algorithm extracts detailed editing information and verify if the mapping is indeed correct. To summarize, for true mapping locations (with fewer than e edits), AF introduces extra computation. Nevertheless, AF is cost-efficient because the number of the locations that pass AF is marginal compared to the number of locations that are correctly filtered out.
Although AF reduces memory lookups, it also incurs a penalty in detecting false locations: AF searches the corresponding adjacent locations for every k-mer. This is in fact a quick lookup in the location lists: as the location lists are sorted, we can use binary search. Nevertheless, for longer reads with many k-mers, AF can be a costly process. From our experiments, we see that AF reduces the alignment calculation (step 4 and 5) by over 90% but provides only 2x speedup. After profiling the execution, we observe that AF has become the new bottleneck by occupying over 90% of the execution time.
The core problem stems from the imbalance of the hash table. Most location lists in the hash table for the human genome have very few locations. However, there are also location lists with cardinality greater than 1 million. Even though such k-mers are only a small portion of the hash table, we encounter them frequently with real data. These high frequency (or expensive) k-mers mostly correspond to poly-N tracks and microsatellites , and such sequences have many copies in the human reference genome. These expensive k-mers also introduce many false locations. When we use such expensive k-mers to query the hash table, all of the locations in their entries will go through the AF test, which is a search-heavy (i.e., computationally expensive) process.
FastHASH actively selects cheaper k-mers over more expensive k-mers as query k-mers. There will be fewer false locations and fewer invocations to AF with cheaper query k-mers. Note that, for any read, both cheap and expensive k-mers will have the same true locations in their location lists. However, since by definition expensive k-mers are more frequent in most genomes, including the human genome, they will contain substantially more seed locations than cheaper k-mers, thus imposing more computational cost to both AF and the subsequent verification step. Instead, starting the AF and then the alignment with the cheap query k-mers relieves the mapper of this cost while preserving all the true locations.
We implemented the selection of cheap k-mers as a simple quicksort operation before querying the k-mers in the hash table. For each query read, instead of simply selecting the first e+1 k-mers to search in the hash table, FastHASH first sorts all k-mers by the cardinalities of their location lists, and then selects the cheapest e+1 k-mers (i.e., those k-mers that have the smallest cardinality of their location lists). Note that selecting e+1 k-mers as query k-mers guarantees full sensitivity under edit distance e.
In summary, Cheap K-mer Selection (CKS) reduces the number of AF and verification operations by using a computationally cheap operation: quicksort. There are other mechanisms to further reduce the number of the false locations with more complex computation and more memory accesses, as done by Ahmadi et al. . In their work "Hobbes", instead of simply dividing the read contiguously, Ahmadi et al. test all possible ways of dividing a query read into multiple k-mers and select only the cheapest division (the way of dividing the query read that returns the cheapest set of k-mers). However, to assess the cost of one possible division of the query read into k-mers, Hobbes has to access the hash table multiple times to get the location list length for each k-mer belonging to the division. We believe Hobbes may not be cost-efficient compared to CKS for two reasons: 1) With CKS, the query k-mers are already very cheap. From our observation, most of the query k-mers have fewer than ten locations after CKS. 2) Hobbes incurs tens to hundreds of accesses to the hash table with only slightly cheaper query k-mers than CKS. The benefit of having a slightly cheaper query k-mer set is very likely offset by the cost of having many long latency memory accesses to the hash table. In fact, CKS avoids examining a lot of the false locations with very few memory accesses (O(log N) accesses, where N is the number of k-mers to which a query read can divide).
We implemented FastHASH on top of mrFAST version 18.104.22.168, creating a new version, mrFAST-22.214.171.124. To assess the performance of FastHASH, we compared the performance of the new mrFAST-126.96.36.199 against several popular read mappers currently available including Bowtie, BWA, RazorS and mrFAST-188.8.131.52, both on simulated and real data sets. We evaluated the mappers with respect to three metrics: speed, sensitivity and comprehensiveness. We also tried to benchmark Hobbes but its very large memory footprint resulted in page thrashing, greatly degrading performance. As a result, we do not report the performance results of Hobbes.
Speed is how fast a mapper maps reads to the reference genome, and is defined as the execution time of the binary measured by the Linux utility "time". Sensitivity is defined as the fraction of reads where the mapper finds at least one mapping. A higher mapper sensitivity correlates to an improved ability to tolerate edits. Comprehensiveness is how many true locations the mapper finds for a given read. A higher mapper comprehensiveness correlates to a more thorough ability to search the reference genome.
We tested speed, sensitivity and comprehensiveness with different edit distances e from 1 to 5 for all mappers with three real data sets. Then, we mapped three simulated data sets with a fixed edit distance 3. Since Bowtie does not support any edit distance greater than 3, we only have results for Bowtie with edit distances 1, 2 and 3. RazerS supports edit distance via a percent identity setting. In order to provide fair comparison, we chose the edit percentage as close to the edit distance as possible. For simulated reads, we guarantee each read contains at most 3 edits from the reference genome. As a result, a fully sensitive mapper should be able to map all simulated reads with edit threshold greater than or equal to 3.
Real Data: We used three different real data sets to evaluate the performance of different mappers. All sequence data sets were generated using the Illumina platform. The first set (set 1; 160 bp per read, 1 million reads) consists of reads from an individual obtained from the 1000 Genomes project  sequenced with the Illumina platform. The second set (set 2; 101 bp per read, 500,000 reads) is generated from a chimpanzee genome , and the third set (set 3; 70 bp per read, 500,000 reads) is generated from an orangutan genome . In our benchmarks, we mapped all reads to the current human reference genome assembly (GRCh37, hg19).
Simulated Data: We generated three simulated data sets from the current human reference genome assembly (hg19). For each set, we generated 50,000 random reads from the first 20 chromosomes summing up to 1 million reads for each set. The sets differ in their read lengths: 72 bp, 108 bp, and 180 bp. For each read, we simulated the read errors and edits by randomly altering or inserting/deleting 0 to 3 base-pairs. Each set is mapped to the human reference genome (hg19).
We ran all mappers in single user mode on a Linux machine with a 3.2 GHz Intel i7 Sandy Bridge CPU and 16 GB DDR3-1333 main memory.
As Additional file 1 shows, hash table based mappers such as mrFAST-184.108.40.206 and RazerS suffer from low performance (slow speed) compared to BWA and Bowtie. As we will show in the "Analysis" section, this is mainly due to the massive amount of false locations. FastHASH (mrFAST-220.127.116.11) greatly improves mapping speed over mrFAST-18.104.22.168 (e.g., up to 19 times for e = 3, depending on the data set), and is even faster than BWA under certain circumstances (e.g., for set 1, when edit distance is greater than 3). Meanwhile, FastHASH preserves the important sensitivity and comprehensiveness properties of the previous version of mrFAST, mrFAST-22.214.171.124.
As discussed in the previous section, mrFAST-126.96.36.199, like other hash based mappers, suffers greatly from extending on a large number of false locations. Figure 8(a) presents the number of true locations out of all potential locations: only 0.007% of the potential locations (seeds) provide correct alignment on average.
We can clearly see the incremental benefits of AF and CKS when mapping 1 million simulated reads of 180 bp in length (Figure 8(b) and Figure 8(c)). As discussed above, a very small fraction of the seed locations pass the verification step of mrFAST (Figure 8(a)). Adjacency filtering substantially decreases the number of seed locations by eliminating false locations as seen in Figure 8(b). This way, AF saves many unnecessary memory accesses since now only the locations that pass AF will proceed to further verification. On average, AF filters out approximately 99.8% of all false locations. Figure 8(c) shows the benefit of having both AF and CKS. From Figure 8(c), we see that CKS reduces the number of overall potential locations, which reduces the amount of AF computation. On average, CKS eliminates 95.4% of all seed locations without degrading the sensitivity of the mapper (as seen in Table 1).
Next generation sequencing platforms continue to evolve at a fast rate. New technologies are frequently introduced that offer different strengths; each, however, has unique biases. The current trend is to generate longer reads, with newer technologies such as the nanopore sequencing, at the cost of increased error rates. While the sux-array based mappers offer tremendous read mapping speed, they also suffer greatly with higher error rates and longer read lengths. Seed-and-extend hash table based mappers are more robust to these changes, but they are also very slow for mapping short reads.
In this paper, we analyzed seed-and-extend, hash table based read mapping algorithms and proposed a new optimization algorithm called FastHAST that provides up to 19-fold speedup over the best previous seed-and-extend mapper. FastHASH provides a potential solution to the speed inefficiency problem of hash table based mappers as a generic algorithm that can be implemented in with any such read mapper.
Although our current implementation of FastHASH is on a CPU based system, we also have a preliminary implementation on a GPU based system, which we aim to develop further. Another future direction to improve FastHASH is to develop a hybrid indexing strategy that efficiently merges Burrows-Wheeler Transform and the Ferragina-Manzini indexing with FastHASH to increase seed size for longer (>1 kbp) reads while keeping the memory footprint low.
Together with additional GPU-based improvements for the alignment step of read mapping, FastHASH promises to accelerate read mapping further while maintaining the sensitivity of hash table based mappers to help cope with the overwhelming data deluge created by next generation sequencing.
The publication costs for this article were funded by the NIH grant HG006004 to O.M. and C.A.
This article has been published as part of BMC Genomics Volume 14 Supplement 1, 2013: Selected articles from the Eleventh Asia Pacific Bioinformatics Conference (APBC 2013): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S1.
We thank F. Hach for his help in editing the mrFAST source code to integrate the FastHASH mechanism. This work was supported, in part, by an NIH grant HG006004 to O.M. and C.A., and a Marie Curie Career Integration Grant (PCIG10-GA-2011-303772) within the 7th European Community Framework Programme to C.A.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.