 Research
 Open Access
 Published:
GPU accelerated sequence alignment with traceback for GATK HaplotypeCaller
BMC Genomicsvolume 20, Article number: 184 (2019)
Abstract
Background
Pairwise sequence alignment is widely used in many biological tools and applications. Existing GPU accelerated implementations mainly focus on calculating optimal alignment score and omit identifying the optimal alignment itself. In GATK HaplotypeCaller (HC), the semiglobal pairwise sequence alignment with traceback has so far been difficult to accelerate effectively on GPUs.
Results
We first analyze the characteristics of the semiglobal alignment with traceback in GATK HC and then propose a new algorithm that allows for retrieving the optimal alignment efficiently on GPUs. For the first stage, we choose intratask parallelization model to calculate the position of the optimal alignment score and the backtracking matrix. Moreover, in the first stage, our GPU implementation also records the length of consecutive matches/mismatches in addition to lengths of consecutive insertions and deletions as in the CPUbased implementation. This helps efficiently retrieve the backtracking matrix to obtain the optimal alignment in the second stage.
Conclusions
Experimental results show that our alignment kernel with traceback is up to 80x and 14.14x faster than its CPU counterpart with synthetic datasets and real datasets, respectively. When integrated into GATK HC (alongside a GPU accelerated pairHMMs forward kernel), the overall acceleration is 2.3x faster than the baseline GATK HC implementation, and 1.34x faster than the GATK HC implementation with the integrated GPUbased pairHMMs forward algorithm. Although the methods proposed in this paper is to improve the performance of GATK HC, they can also be used in other pairwise alignments and applications.
Background
NGS (Next Generation Sequencing) platforms offer the capacity to generate large amounts of DNA sequencing data in a short time and at a low cost. However, the analysis of the dramatic amounts of DNA sequencing data is still a computational challenge. Researchers have proposed many methods to improve the performance of the DNA sequencing data analysis tools and applications. One method is to execute these tools and applications on high performance computing architectures, such as supercomputers, clusters and even cloud environments. Another method is to use accelerators, such as GPUs and FPGAs, to accelerate the timeconsuming kernels of these tools and applications to improve their performance.
GATK HaplotypeCaller (HC) is a popular variant caller, which is used to find the differences (or variants) between the sample DNA sequence compared with the reference sequence. Although GATK HC has higher accuracy of identifying variants compared with many other variant callers, its feasibility is limited by the long execution time needed for the analysis, which has proven to be difficult to optimize. This has driven researchers to improve its performance. Intel and IBM researchers both employ vector instructions to optimize the pairHMMs forward algorithm [1, 2], which is the most timeconsuming part of GATK HC, to reduce the total execution time. Ren et al. [3, 4] uses GPUs to accelerate the pairHMMs forward algorithm in GATK HC, which achieved 1.71x speedup in single thread mode. After accelerating the pairHMMs forward algorithm on GPUs, profiling of GATK HC shows that the semiglobal pairwise sequence alignment accounts for around 34.5% of the overall execution time, making it the most timeconsuming kernel in the application. This provides an opportunity to further improve the performance of GATK HC using GPU acceleration.
Pairwise sequence alignment, which includes global alignment, semiglobal alignment and local alignment, is one of the commonly used techniques in many biological tools and applications. The global alignment and the semiglobal alignment are calculated by the NeedlemanWunsch algorithm and the modified NeedlemanWunsch algorithm, respectively, while the local alignment is calculated by the SmithWaterman algorithm. Although there are some differences existing in the three algorithms, the main framework of these algorithms is similar, which includes two stages: (1) a dynamic programming kernel to calculate the score matrices and find the optimal alignment score; (2) a traceback (or backtracking) kernel to find the optimal alignment.
Since three kinds of pairwise sequence alignment (global, semiglobal and local) have the same framework and differ only in details, techniques of speeding up one can be applied to the other two with tiny modifications. Different kinds of highperformance platforms, especially accelerators, such as FPGAs [5, 6] and GPUs [7–16], are used to reduce their execution time.
There has been much research done to reduce the execution time of the three kinds of pairwise alignment on GPUs. There are two approaches to implement the first stage of the pairwise sequence alignment on GPUs (which is to calculate the optimal alignment score): intertask parallelization model and intratask parallelization model. The former is that each thread performs one alignment independently, such as [7] and [8]. The latter is that threads in a block cooperate to perform an alignment, such as [9]. If the pairwise sequence alignment is applied for sequence database scanning, aligning a query sequence with all database sequences for sequence similarity, a query profile and related data storage and access techniques are employed to reduce memory accesses on GPUs, such as [10] and [11]. In [11], alignments are performed in interleaved mode in order to amortize the cost of initiating each execution pass.
However, very few researchers implement the second stage on GPUs. The existing implementations can be classified into two groups. The implementations of the first group are based on backtracking matrices. Liu et al. [11] proposed to store the score matrices and backtrack the score matrices to obtain the optimal alignment. However, the method is not described clearly. gpupairAlign [12] proposed to store the alignment moves in four Boolean backtracking matrices during the first stage and retrieve the four Boolean backtracking matrices instead of the score matrices. This group of implementations obtain the optimal alignment in linear time, but the disadvantage is that their space complexity is quadratic. The implementations of the second group are based on the MyersMiller algorithm. MSACUDA [13] developed a stackbased iterative implementation of the MyersMiller algorithm [17] to retrieve the optimal alignment in linear space. SW# [14] proposed a modified MyersMiller algorithm. CUDAlign 2.0 [15] combined the MyersMiller and SmithWaterman algorithm. Moreover, with several versions of incremental optimizations, CUDAlign 4.0 [16] is able to achieve the optimal alignment of chromosomewide sequences using multiple GPUs. However, their approaches have quadratic time complexity, making them only suitable for the pairwise alignment of very long DNA and protein sequences.
In this paper, we provide an accelerated solution tailored to GATK HC which implements the semiglobal pairwise sequence alignment with traceback on GPUs to further improve the performance. The contributions of this paper are as follows:

We first analyze the characteristics of the semiglobal alignment in GATK HC and then propose a GPUbased implementation of the semiglobal alignment with traceback based on the analysis.

During the first stage, we propose to record the length of consecutive match(es)/mismatch(es) and store the alignment moves in a special backtracking matrix.

We also propose a new algorithm that allows for retrieving the optimal alignment efficiently on GPUs.

We benchmark the results and show an overall speedup of GATK HC of about 2.3x over the nonaccelerated version.
Although this paper proposes to improve the performance of GATK HC, the GPUbased implementation of the semiglobal alignment with traceback can be used in other applications and tools. Moreover, since there are only small differences among the global alignment, semiglobal alignment and local alignment, the methods proposed in this paper can also be applied to the global alignment and local alignment.
Methods
A brief overview of semiglobal alignment
Semiglobal alignment finds the overlap between two sequences. Insertion and deletions introduce gaps in the alignment. Gaps at the start or end of the sequences may be neglected. Hence, different types of semiglobal alignments are possible between two sequences. Figure 1 shows an example of the type of the semiglobal alignment performed in GATK HC.
The pairwise sequence alignment is to find the optimal alignment between two sequences, which has the optimal alignment score. The modified NeedlemanWunsch algorithm with affine gap penalties to calculate the optimal alignment score of the semiglobal alignment in GATK HC is defined as
Initialization:
Recurrence:
Termination:
where m and n are the length of R1 and R2, respectively. In these equations, M_{i,j} represents the optimal alignment score of two subsequences R1[1]...R1[i] and R2[1]...R2[j], while I_{i,j} and D_{i,j} represent the optimal alignment score of two subsequences R1[1]...R1[i] and R2[1]...R2[j] with R2[j] aligned to a gap and R1[i] aligned to a gap, respectively. Here, the semiglobal alignment uses an affine gap penalty model to calculate gap penalties, in which α and β are the gap open penalty and the gap extension penalty, respectively. sbt is the score of a match or mismatch. As shown by Eq. 1, the penalties of gaps at the start and end of two sequences are neglected. As shown by Eq. 3, the optimal alignment score of the semiglobal alignment in GATK HC is the greatest value of the elements in the last row and the last column of the matrix M.
These equations indicate that the computation complexity of the modified NeedlemanWunsch algorithm is O(mn), which makes the execution time increase quadratically with the sequence length. Usually, the algorithm is implemented using dynamic programming which solves three two dimensional matrices. According to the equations, M_{i,j}, I_{i,j} and D_{i,j} only depend on the upleft, up and left neighbor elements, which implies that the elements on the same antidiagonal can be computed in parallel. Thus, a method employed by many researchers to reduce the execution time is to exploit this inherent parallelism in the algorithm.
If the alignment only needs to find the optimal alignment score of two sequences, the dynamic programming kernel can be calculated in linear space. Otherwise, the alignment with affine gap penalties generally uses three backtracking matrices to store the scores or alignment moves calculated by the dynamic programming kernel. The optimal alignment traceback starts from the position of the element with the optimal alignment score until reaching any element in the first row or the first column of the backtracking matrices, which is calculated in linear time. Figure 2 presents an example of backtracking an optimal alignment based on the score matrices.
Cigar format
In GATK HC, the goal is to get the optimal semiglobal alignments, which are represented in the CIGAR format [18], and POS. CIGAR is a string including one or more numbercharacter pair(s). The character, including ‘M’, ‘I’, ‘D’, ‘N’, ‘S’, ‘H’, ‘P’, ‘=’ and ‘X’, defines an operation explaining how the base in R2 aligns to the base in R1. Table 1 shows the CIGAR operations used in GATK HC. The number defines the length of the consecutive operations. POS is 0based left most position of the first matching base of R1, which indicates the position of R1 where the alignment starts.
Take the alignment in Fig. 1 for example. The CIGAR representation of the alignment is “3M2D1M2I2M1S” and POS of the alignment is 1.
GPU architecture
Modern GPUs are widely used to accelerate computationally intensive algorithms. A GPU consists of thousands of small cores capable of executing one thread at a time. On NVIDIA GPUs, threads are grouped into blocks and these blocks are grouped into grids. Furthermore, consecutive threads in the same block are grouped into warps. The size of a warp is usually 32. The memory hierarchy includes registers, shared memory, global memory, cache and so on. Each thread is assigned a set of registers. The shared memory is accessed by all threads in a block. Using the shared memory, the threads in a block can exchange data at a very fast rate. The global memory is accessed by all the threads on the GPU. The latency of the global memory access is high since it resides on the device DRAM. If the data accessed by each thread in the same warp are stored at consecutive addresses, the global memory accesses of these threads can be coalesced. Usually, the width of one global memory access is 128 bytes. If the global memory accesses of threads in a warp are coalesced, there will be only one global memory access when the data accessed by each thread is not more than 4 bytes. Otherwise, there would be 32 sequential global memory accesses in the worstcase situation.
Semiglobal alignment in GATK HC
Implementation of alignment in GATK HC
In GATK HC, the semiglobal pairwise alignment is performed in two stages.
The implementation of the first stage is realized with a twolayer loop, which results in the O(mn) computational complexity. The results of the first stage are two matrices: the score matrix sw, which stores matrix M, and the backtracking matrix btrack. In btrack, the value of each element can be classified into three kinds, which is defined as follows:

>0  indicates a deletion and the length of the consecutive deletion(s) is the value of the element

=0  indicates a match or mismatch and the length of the consecutive match(es)/mismatch(es) is increased by 1

<0  indicates an insertion and the length of the consecutive insertion(s) is the absolute value of the element
The absolute values of the elements in the backtracking matrix are calculated by recording the length of the consecutive deletion(s) and consecutive insertion(s) when calculating the score matrix.
The implementation of the second stage is to calculate the optimal alignment in CIGAR format and POS. The score matrix sw is first used to find the optimal alignment score and the backtracking matrix btrack is then used to obtain the optimal alignment and POS. The optimal alignment is calculated in linear time. The backtracking matrix in GATK HC is helpful during backtracking. It is much easier to identify the next move compared with other methods since it does not need to jump among several backtracking matrices (shown in Fig. 1) or calculate the next move based on the current move [12]. Moreover, the lengths of the consecutive deletion(s) and consecutive insertion(s) are given by the element of the btrack matrix. However, the length of the consecutive match(es)/mismatch(es) is not given, which is increased by one instead.
Data analysis
In GATK HC, the semiglobal alignment is performed in three situations:

1
Align the reference path with the dangling path to recover dangling branches for the local assembly.

2
Align the read with the assembled haplotype.

3
Align the assembled haplotype with the reference to decide whether the assembled haplotype satisfied the defined requirements.
We profiled GPUbased GATK HC [3] with a typical workload (Chromosome 10 of the whole human genome dataset G15512.HCC1954.1 [19]) to investigate which situation is most timeconsuming. The profiling results in singlethreaded mode are shown in Table 2, which specifies the relative execution time and the number of the semiglobal alignments in each situation. As shown in the table, the execution time of all the semiglobal alignment accounts for 34.5% of the total execution time. Moreover, situation 2 and 3 consumes around 100% of the semiglobal alignment execution time and the execution time in situation 1 is negligible. However, although the number of semiglobal alignments in situation 2 is much larger than that in situation 3, the execution time in situation 2 is smaller than that in situation 3.
We then analyzed the lengths of the sequence pairs in situation 2 and 3. In situation 2, let R1 be the assembled haplotype and R2 be the read. In situation 3, let R1 be the assembled haplotype and R2 be the reference. Figure 3 shows a scatter plot of the lengths of the sequence pairs in these two situations. As shown in Fig. 3, the lengths of the sequence pairs in situation 2 (40 ∼350) are shorter than those in situation 3 (300 ∼520). Since the computation complexity is O(mn), the execution time of each semiglobal alignment in situation 3 is bigger than that in situation 2. This explains why the total execution time of situation 3 is bigger than that of situation 2, which is shown in Table 2. Moreover, in situation 2, the length of R2 (the read) is shorter than the length of R1 (the assembled haplotype).
In addition, we investigated the optimal alignments achieved in situation 2 and 3 and added up the number of M/I/D/S operations in each optimal alignment. Figure 4 shows that the number of M operations is the largest. Especially in situation 2, the number of M operations accounts for 99.65% of the total operations. Moreover, we found that most of M operations are consecutive in each optimal alignment. However, the length of the consecutive match(es)/mismatch(es) is increased by one during the optimal alignment retrieval.
Hence, although the computation complexity of the optimal alignment is linear, most of its execution time is used to calculate the length of the consecutive match(es)/mismatch(es).
We last studied the source code of GATK HC version 3.7 and found that the semiglobal alignments in situation 2 and 3 can be grouped into many batches without big modifications of the source code. Each batch consists of many semiglobal alignments of sequence pairs. The numbers of batches in situation 2 and 3 are 13,142 and 13,977, respectively. Figure 5 shows the number of sequence pairs of each batch in situation 2 and 3. The biggest number of sequence pairs in all the batches in situation 2 and 3 are 293 and 192, respectively. Furthermore, the majority of batches in situation 2 include 25 ∼125 of sequence pairs while the majority of batches in situation 3 include 1 ∼8 of sequence pairs.
Implementation on GPUs
The implementation of the semiglobal pairwise alignment for GATK HC on GPUs is performed in two stages. In the first stage, it performs the modified NeedlemanWunsch algorithm in order to obtain the backtracking matrix and the position of the optimal alignment score. In the second stage, it retrieves the backtracking matrix in order to obtain the optimal alignment in CIGAR format and POS.
First stage implementation
Intratask parallelization
As mentioned in “Data analysis” subsection, the number of sequence pairs in each batch is less than 300. In order to effectively use the resources on GPUs, the intratask parallelization model is employed to implement the modified NeedlemanWunsch algorithm on GPUs. For the implementation on GPUs, the elements on the same antidiagonal of the score matrix M, I and D and backtracking matrix are calculated in parallel, reducing the computational complexity to O(m+n). Figure 6 shows the calculation of matrix M as an example to explain the implementation. Let R1 and R2 be the two sequences. There are in total 6 threads in the block and the size of matrix M is 6×6. At each step, the elements on an antidiagonal are calculated in parallel and every element is calculated by one thread. For example, at step 5 (S5), M_{5,1}, M_{4,2}, M_{3,3}, M_{2,4} and M_{1,5}, which are on the same antidiagonal, are calculated by thread 0 (T0), thread 1 (T1), thread 2 (T2), thread 3 (T3) and thread 4 (T4), respectively. These elements are then used in the next step to calculate the elements on the next antidiagonal. Moreover, each thread is responsible to calculate the elements in one column of matrix M. For example, the elements in the second column are calculated by thread 1 (T1). The goal of the first stage is to obtain the backtracking matrix and the position of the optimal alignment score. Therefore, elements of the score matrix M, I and D are not stored. Instead, two vectors in the shared memory and three registers of each thread are used to store the intermediate results of the three score matrices. During the calculation of elements of the last column and the last row of matrix M, the optimal alignment score and its position are obtained. However, the drawback of the implementation is that some threads remain idle at the beginning or at the end of the calculation procedure, resulting in low thread utilization. If the length of R2 is smaller than the number of threads in a block, the execution is similar to Fig. 6 while some threads would remain idle during the whole calculation procedure. If the length of R2 is bigger than the number of threads in a block, there are two solutions to deal with it. One is to increase the size of a block until the number of threads in a block is equal to or bigger than the length of R2. The other is to divide the calculation into several passes. In each pass, the execution is similar to Fig. 6. Three vectors in the global memory are used to store the intermediate results produced by the last thread of each pass, which would be used in the next pass. The advantage of the second solution is that it increases efficiency by reducing idle percentage of threads during the calculation procedure while its disadvantage is that it increases global memory accesses.
Recording the length of consecutive match(es)/mismatch(es)
Besides recording the length of the consecutive deletion(es)/insertion(es), we also record the length of the consecutive match(es)/mismatch(es) in the first stage. The backtracking matrix on GPUs is stored in a short2 matrix. Each element of the matrix has two values, which are x and y. The value of x and y are defined as follows:

x>0  indicates a deletion and the length of the consecutive deletion(s) is the value of the element

x=0  indicates a match or mismatch and the length of the consecutive match(es)/mismatch(es) is y.

x<0  indicates an insertion and the length of the consecutive insertion(s) is the absolute value of the element
The data type of x and y is short, of which the minimum value and maximum value are −32768 and 32767, respectively. The absolute values of the minimum value and maximum value are bigger than the theoretical maximum length of the consecutive operations, which is the length of R1 or R2. In order to calculate the backtracking matrix, a short2 vector in the shared memory and two registers of each thread are used.
Moreover, Since the backtracking matrix will be used in the next stage and the shared memory is not big enough to store it, the backtracking matrix is stored in the global memory. Similar to calculation of the matrix M shown in Fig. 6, elements of the backtracking matrix are calculated in antidiagonal order. Thus, the backtracking matrices are stored in the diagonalmajor data format (Fig. 7b), which is proposed in [20], instead of the rowmajor data format (Fig. 7a) to avoid noncoalesced global memory accesses of 32 threads in a warp and reduce global memory accesses.
Second stage implementation
In the second stage, we use the backtracking matrix btrack to obtain the optimal alignment and POS. Algorithm 1 presents the pseudo code of the optimal alignment retrieval on GPUs. P1 and P2 describe the position of the optimal alignment score. Algorithm 1 first checks whether there are soft clippings at the end of R2, and then computes the optimal alignment in a while loop. At the end, it checks whether there are soft clippings at the beginning of R2. The backtracking starts from (P1,P2) and finishes when i≤0 or j≤0, which is calculated in linear time. POS is the value of (i−1) at the end of the while loop. In addition, the position of each element in the backtracking matrix is calculated by i, j and max_col, as shown in the 9th line in Algorithm 1. max_col is the column size of the maximum backtracking matrix of all sequence pairs.
The length of the deletion, insertion and match/mismatch is given by the value of an element of the backtracking matrix, as shown in the 13th, 18th and 23rd line, respectively. This reduces the global memory accesses used to calculate the length of the operations.
Results
All the experiments are performed on IBM Power System S823L (8247842L), which has 2 IBM Power8 processors (10 cores each) running at 3.6 GHz, 256 GB of DDR3 memory, and an NVIDIA Tesla K40 card. The NVIDIA Tesla K40 card has 2880 cores that run at up to 745 MHz and has a CUDA compute capability of 3.5.
We first compare the performance of the GPUbased semiglobal alignment implementation with different techniques using the synthetic datasets. The synthetic datasets are created based on the output of Wgsim [21] with default parameters. We then compare the performance of GPUbased semiglobal alignment implementation with gpupairAlign implementation using synthetic datasets. Next, we compare the performance of GPUbased semiglobal alignment implementation with CPUbased implementation using synthetic and real datasets. We last integrate the GPUbased semiglobal alignment implementation into GATK HC 3.7 and compare the overall performance.
Throughput is used as a performance metric of the first stage of the GPUbased implementation, which is measured by giga cell updates per second (GCUPS). Note that it is not fair to compare the throughput of the first stage of the semiglobal alignment with traceback with that of the scoreonly alignments since the former needs to store backtracking matrices in the global memory.
Performance comparison of multipass
There are two solutions to implement the first stage of the semiglobal alignment on GPUs if the length of R2 is bigger than the number of threads in a block. We realized these two solutions and used different synthetic datasets to compare the performance of the two solutions.
Figure 8 shows the performance of the two solutions with different synthetic datasets. There are 9 datasets each with a different length of R1/R2, namely: 64, 128, 192, 256, 320, 384, 448, 512 and 576. In each dataset, the lengths of R1 and R2 are the same. The number of sequence pairs in the 9 datasets is 25, 100 and 1000.
For the first solution, which is to increase the block size, there are in total 9 implementations for 9 datasets. The differences of these implementations are the block size and the sizes of vectors in the shared memory which store the intermediate results. For the second solution, which employs multipass, there is 1 implementation with block size of 128.
As shown by Fig. 8, the throughput of the first solution is higher than that of the second solution when the number of sequence pairs of the datasets is 25 and 100. However, when the number of sequence pairs of the datasets is 1000, the throughput of the second solution is higher in most cases. This is because the efficiency of the implementations for the first solution is smaller than that of the implementation for the second solution and the advantage of the second solution overweighs its disadvantage when the number of sequence pairs of the dataset is big. Thus, we can choose the implementation of these two solutions based on the number of sequence pairs of the dataset.
Performance comparison of recording match/mismatch lengths
In this section, we analyze the impact of recording the length of consecutive matches/mismatches on the performance of the second stage of the alignment on GPUs. We realized two implementations. The first implementation is our approach shown in Algorithm 1 in which the length of consecutive matches/mismatches is recorded in the backtrack matrix. The second implementation is similar to Algorithm 1 except that the length of M is increased by one and the coordinates (i, j) of M are decreased by 1. The backtracking matrices are produced by 9 implementations for the first solution using 9 synthetic datasets. Here, the synthetic datasets are not based on the output of Wgsim since we consider the best case, in which only many M operations exist in the optimal alignment. The lengths of R1/R2 in the 9 synthetic datasets are 64, 128, 192, 256, 320, 384, 448, 512 and 576. The number of sequence pairs in the 9 datasets is 100.
Figure 9 shows the execution time of the two implementations. The implementation which records match/mismatch lengths is faster. Moreover, its execution time remains nearly constant with increasing length of R1 and R2 as it only requires a single global memory access per R1 and R2 pair. The execution time of the implementation without recording match/mismatch lengths increases linearly with the length of R1 and R2. This is because the number of global memory accesses increases linearly with the number of M operations in the optimal alignment, which in turn increases linearly with the length of R1 and R2.
Performance comparison with gpupairAlign
As mentioned in “Background” section, there are two methods to implement the second stage on GPUs: the method based on the MyersMiller algorithm and the method based on backtracking matrices. The method based on the MyersMiller algorithm is only suitable for the pairwise alignment of very long DNA and protein sequences. Thus, we compared our implementation with gpupairAlign [12], which uses backtracking matrices to obtain the optimal alignments. gpupairAlign is designed to perform alignment of every given sequence pair on GPUs, especially for protein sequence pairs. It includes algorithms for global alignment, semiglobal alignment and local alignment. We compare with its semiglobal alignment algorithm. The semiglobal alignment algorithm of gpupairAlign is also performed in two stages: the optimal alignment score and the backtracking matrices are computed in the first stage; the backtracking is performed in the second stage.
There are two main differences between the gpupairAlign implementation and our implementation: (1) In the first stage, our implementation employs the intratask parallelization model, while the gpupairAlign implementation employs the intertask parallelization model; (2) The backtracking matrix of our implementation is a short2 matrix, elements of which are the length of consecutive deletion(es), insertion(es) and match(es)/mismatch(es), while the backtracking matrices of the gpupairAlign implementation are four Boolean matrices, elements of which indicate the proper direction of backtracking moves.
We modified the gpupairAlign implementation to make it to deal with the data produced by GATK HC: (1) Since the input data of our implementation is a set of sequence pairs instead of a set of sequences, the way in which the gpupairAlign implementation handles input data is modified; (2) Integer arrays are used to store the intermediate results instead of short arrays since the intermediate results are bigger than the maximum value of the short data type; (3) The alignments are modified to be represented using the CIGAR format and POS.
We first used the synthetic datasets described in “Performance comparison of multipass” subsection to compare the performance of the first stage of the two implementations, which is shown in Fig. 10. The performance of gpupairAlign implementation is much smaller than our implementation. The main reason is that when the size of the synthetic datasets is small, the resource on the GPU cannot be fully utilized for the intertask parallelization model. The second reason is that the intermediate results are stored in integer arrays, which increases the size of shared memory of each block and the number of global memory accesses.
We then compared the performance of the second stage of the two implementations using the synthetic datasets described in “Performance comparison of recording match/mismatch lengths” subsection, which is shown in Fig. 11. The execution time of the second stage of our implementation remains nearly constant when the length of R1 and R2 increases, while the execution time of the second stage of the gpupairAlign implementation increases linearly with the length of R1 and R2. Although the gpupairAlign implementation reduces the global memory space by using four Boolean matrices, it still needs to calculate each move one by one, which is avoided in our implementation through storing the length of consecutive deletion(es), insertion(es) and match(es)/mismatch(es).
Performance comparison with CPUbased implementation
In this section, we compare the performance of our GPUbased semiglobal alignment with traceback implementation with the CPUbased implementation using synthetic and real datasets. We used the first solution which increases the block size when the length of R2 is bigger than the block size and records the length of consecutive matches/mismatches. The CPUbased implementation is written in the C++ programming language and compiled with gcc O3 optimization, running on one Power8 core. The real datasets are produced by using a typical workload (Chromosome 10 of the whole human genome dataset G15512.HCC1954.1).
Figure 12 shows the speedup of the GPUbased implementations compared with the CPUbased implementation using the synthetic datasets described in “Performance comparison of multipass” subsection. There are in total 9 GPUbased implementations for 9 datasets, block size of which are 64, 128, 192, 256, 320, 384, 448, 512 and 576. The GPUbased implementations is up to 80x faster than the CPUbased implementation. Moreover, the speedup of the datasets with 1000 sequence pairs is bigger than the speedup of the datasets with 25 and 100 sequence pairs.
Table 3 shows the execution time of GPUbased implementations with the real datasets. As shown by Fig. 3, the length of R2 in situation 2 is 40 ∼120 and the length of R2 in situation 3 is 300 ∼520. Thus, we used two GPUbased implementations with block size of 128 and 576 to execute the real datasets produced in situation 2 and 3, respectively. The GPUbased implementation of situation 2 is 14.14x faster than the CPUbased implementation, while the GPUbased implementation of situation 3 is 4.89x faster than the CPUbased implementation. The throughput of the first stage of the GPUbased implementation for situation 2 is 1.86 GCUPS, while that for situation 3 is 0.64 GCUPS. The throughput of situation 3 is much smaller than the throughput for the synthetic datasets with size 25. This is because the number of sequence pairs of batches in situation 3 is extremely small (1∼8 in most cases).
Integration into GATK HC
The two GPUbased implementations with block size of 128 and 576 are integrated into GATK 3.7 to accelerate the semiglobal alignment with traceback of situation 2 and situation 3, respectively. The GATK HC implementation with both GPUbased pairHMMs forward algorithm and GPUbased semiglobal alignment with traceback is compared with other two GATK HC implementations: GATK HC (referred to as baseline), which is downloaded from the GATK website, and GATK HC with only GPUbased pairHMMs forward algorithm. The dataset is Chromosome 10 of the whole human genome dataset (G15512.HCC1954.1). All the GATK HC implementations are performed in single thread mode.
Table 4 shows the overall execution time of these three implementations. The implementation with both GPUbased pairHMMs forward algorithm and GPUbased semiglobal alignment with traceback is 2.30x faster than the baseline implementation. Moreover, it is 1.34x faster than the implementation with only GPUbased pairHMMs forward algorithm.
Note that the number of sequence pairs of each batch produced by GATK HC is small, leading to under utilization of the GPU resources. It is better to launch multiple GATK HC processes at the same time to fully utilize the GPU resources.
Conclusion
This paper presents an implementation of the semiglobal alignment with traceback on GPUs to improve the performance of GATK HC. Semiglobal alignment with traceback has two stages: in the first stage, a backtracking matrix is computed; in the second stage, the optimal alignment is calculated using the backtracking matrix. Based on the characteristics of the semiglobal alignment with traceback in GATK HC, the intratask parallelization model is chosen. The first stage of our GPU implementation is up to 18.94x faster than CPU. Moreover, our GPU implementation also records the length of consecutive matches/mismatches in addition to lengths of consecutive insertions and deletions as in the CPU implementation. This helps to reduce global memory accesses and provides a speedup of up to 4.43x in the second stage. Experimental results show that our alignment kernel with traceback is up to 80x and 14.14x faster than its CPU counterpart with synthetic datasets and real datasets, respectively. The GATK HC implementation with both GPUbased pairHMMs forward algorithm and GPUbased semiglobal alignment with traceback is 2.30x faster than the baseline GATK HC. It is 1.34x faster than the GATK HC implementation with only GPUbased pairHMMs forward algorithm.
Abbreviations
 HC:

HaplotypeCaller
 NGS:

Next Generation Sequencing
References
 1
Takeshi O, Yinhe C, Kathy T. Performance optimization of Broad Institute GATK Best Practices on IBM reference architecture for healthcare and life sciences. IBM Systems Technical White Paper. 2017. https://www.ibm.com/downloads/cas/LY1OY9XJ.
 2
Proffitt A. Broad, Intel Announce Speed Improvements to GATK Powered by Intel Optimizations. BioIT World. 2014. http://www.bioitworld.com/2014/3/20/broadintelannouncespeedimprovementsgatkpoweredbyinteloptimizations.html.
 3
Ren S, Bertels K, AlArs Z. GPUAccelerated GATK HaplotypeCaller with LoadBalanced MultiProcess Optimization. In: IEEE International Conference on Bioinformatics and Bioengineering: 2017. p. 497–502.
 4
Ren S, Bertels K, AlArs Z. Efficient Acceleration of the PairHMMs Forward Algorithm for GATK HaplotypeCaller on Graphics Processing Units. Evol Bioinform Online. 2018; 14:1176934318760543.
 5
Li IT, Shum W, Truong AK. 160fold acceleration of the SmithWaterman algorithm using a field programmable gate array (FPGA). BMC Bioinforma. 2007; 8(1):1–7.
 6
Benkrid K, Liu Y, Benkrid AS. A Highly Parameterized and Efficient FPGABased Skeleton for Pairwise Biological Sequence Alignment. IEEE Trans Very Large Scale Integr Syst. 2009; 17(4):561–70.
 7
Hasan L, Kentie M, AlArs Z. DOPA: GPUbased protein alignment using database and memory access optimizations. BMC Res Notes. 2011; 4(1):261.
 8
Ahmed N, Mushtaq H, Bertels K, AlArs Z. GPU accelerated API for alignment of genomics sequencing data. In: IEEE International Conference on Bioinformatics and Biomedicine: 2017. p. 510–5.
 9
Maskell DL, Liu Y, Bertil S. CUDASW++: optimizing SmithWaterman sequence database searches for CUDAenabled graphics processing units. BMC Res Notes. 2009; 2(1):73.
 10
Liu Y, Schmidt B, Maskell DL. CUDASW++2.0: enhanced SmithWaterman protein database search on CUDAenabled GPUs based on SIMT and virtualized SIMD abstractions. BMC Res Notes. 2010; 3(1):93.
 11
Liu Y, Huang W, Johnson J, Vaidya S. GPU Accelerated SmithWaterman. In: International Conference on Computational Science: 2006. p. 188–95.
 12
Blazewicz J, Frohmberg W, Kierzynka M, Pesch E, Wojciechowski P. Protein alignment algorithms with an efficient backtracking routine on multiple GPUs. BMC Bioinforma. 2011; 12(1):181.
 13
Liu Y, Schmidt B, Maskell DL. MSACUDA: Multiple Sequence Alignment on Graphics Processing Units with CUDA. In: IEEE International Conference on ApplicationSpecific Systems, Architectures and Processors: 2009. p. 121–8.
 14
Korpar M, Sikic M. SW#GPUenabled exact alignments on genome scale. Bioinformatics. 2013; 29(19):2494–5.
 15
de O Sandes EF, de Melo ACMA. SmithWaterman Alignment of Huge Sequences with GPU in Linear Space. In: 2011 IEEE International Parallel Distributed Processing Symposium: 2011. p. 1199–211.
 16
Sandes EFO, Miranda G, Martorell X, Ayguade E, Teodoro G, Melo ACMA. CUDAlign 4.0: Incremental Speculative Traceback for Exact ChromosomeWide Alignment in GPU Clusters. IEEE Trans Parallel Distrib Syst. 2016; 27(10):2838–50.
 17
Myers EW, Miller W. Optimal alignments in linear space. Comput Appl Biosci Cabios. 1988; 4(1):11–7.
 18
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al.The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009; 25(16):2078–9.
 19
TCGA Mutation Calling Benchmark 4 Files. https://gdc.cancer.gov/resourcestcgausers/tcgamutationcallingbenchmark4files. G15512.HCC1954.1.
 20
Xiao S, Aji AM, Feng W. On the Robust Mapping of Dynamic Programming onto a Graphics Processing Unit. In: International Conference on Parallel and Distributed Systems: 2009. p. 26–33.
 21
Wgsim. https://github.com/lh3/wgsim.
Acknowledgements
The authors wish to thank the Texas Advanced Computing Center (TACC) at the University of Texas at Austin and IBM for the giving access to the IBM Power8 machines used in this paper.
Funding
This work was supported by CSC (Chinese Scholarship Council) grant and Delft University of Technology. Publication of this article was sponsored by Delft University of Technology.
Availability of data and materials
The algorithm generated in this manuscript as well as all input datasets are publicly available on a publicly available repository: https://github.com/ShanshanRen/semiglobalalignmentwithtraceback.
About this supplement
This article has been published as part of BMC Genomics Volume 20 Supplement 2, 2019: Selected articles from the 17th Asia Pacific Bioinformatics Conference (APBC 2019): genomics. The full contents of the supplement are available online at https://bmcgenomics.biomedcentral.com/articles/supplements/volume20supplement2.
Author information
Affiliations
Contributions
SR designed and performed the experiments, analyzed the data, and wrote the manuscript. All the authors jointly developed the structure and arguments for the paper, made critical revisions and approved final version.
Corresponding author
Correspondence to Zaid AlArs.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not Applicable.
Competing interests
The authors declare that they have no competing interests.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Published
DOI
Keywords
 Semiglobal alignment with traceback
 Optimal alignment
 GATK HaplotypeCaller (HC)
 GPUs