Volume 13 Supplement 1
CPU-GPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction applications
© Lei et al.; licensee BioMed Central Ltd. 2012
Published: 17 January 2012
Prediction of ribonucleic acid (RNA) secondary structure remains one of the most important research areas in bioinformatics. The Zuker algorithm is one of the most popular methods of free energy minimization for RNA secondary structure prediction. Thus far, few studies have been reported on the acceleration of the Zuker algorithm on general-purpose processors or on extra accelerators such as Field Programmable Gate-Array (FPGA) and Graphics Processing Units (GPU). To the best of our knowledge, no implementation combines both CPU and extra accelerators, such as GPUs, to accelerate the Zuker algorithm applications.
In this paper, a CPU-GPU hybrid computing system that accelerates Zuker algorithm applications for RNA secondary structure prediction is proposed. The computing tasks are allocated between CPU and GPU for parallel cooperate execution. Performance differences between the CPU and the GPU in the task-allocation scheme are considered to obtain workload balance. To improve the hybrid system performance, the Zuker algorithm is optimally implemented with special methods for CPU and GPU architecture.
Speedup of 15.93× over optimized multi-core SIMD CPU implementation and performance advantage of 16% over optimized GPU implementation are shown in the experimental results. More than 14% of the sequences are executed on CPU in the hybrid system. The system combining CPU and GPU to accelerate the Zuker algorithm is proven to be promising and can be applied to other bioinformatics applications.
RNA is an important molecule in biological systems. The function of RNA can be derived generally from its secondary structure. Recently, computational and mathematical methods such as thermodynamic energy minimization, homologous comparative sequences, and stochastic context-free grammar methods have been widely used to predict the RNA secondary structure. Among these, the Zuker algorithm is the most popularly and widely used free energy minimization algorithm for RNA secondary structure prediction . The Zuker algorithm was first presented in 1981 by M. Zuker. Its time complexity is O(n4) , and its spatial complexity is O(n2), where N is the length of the sequence. The optimized algorithm  reduces the time complexity to O(n3) by limiting the length of interior loop size. This algorithm is commonly used in many popular secondary structure prediction software, such as Mfold and ViennaPackage. The Zuker algorithm is able to predict short sequences because accuracy is decreased rapidly as the sequence grows longer . However, the exponential increment of RNA sequences makes conventional computers incapable of meeting the demand of multiple-sequence processes. For instance, a single-core Intel Xeon E5620 processor consumes 18 ms to predict the secondary structure of a single RNA sequence with length 120, and needs over 370s for 20000 RNA sequences of the same length. Execution time increases rapidly as sequence number grows.
Recently, heterogeneous computing systems have been broadly used in high-performance computing. Three of the first five Top 500 supercomputers are built based on CPU-GPU heterogeneous architecture . Specifically, Tian-1A, the first supercomputer in 2010's Top 500, is the first to exploit the heterogeneous system. The computing method on the CPU-GPU heterogeneous system has been proven effective in HPC, and its performance has grown rapidly with improvements of CPU and GPU technology.
The availability of multiple cores on a modern processor chip makes the CPU a more powerful computing platform for general applications. Parallel programming methods on multi-core CPUs, such as POSIX Thread (pThread) library , OpenMP library , and Intel Threading Building Blocks , effectively simplify concurrent software development on these platforms.
To date, GPUs have emerged as favorable and popular accelerator devices that keep up with the increasing demands of the gaming industry. GPUs have developed into more general, highly parallel, multiple-core processing architecture to keep up with the demands of the computer games, which increase faster than processor clock speeds. A number of libraries have been developed to allow users to write non-graphic applications computed on GPUs, which is known as General Purpose computation on GPUs (GPGPU). The development of GPGPU libraries, such as the NVidia CUDA , BrookGPU , the ATI Stream SDK , Sh , and OpenCL  have made GPGPU applications increasingly easy to develop. With the rapidly improving computational capability of CPU and GPU, exploring the performance potentials of both systems has become a problem. In the current heterogeneous system, CPU plays as program controller while GPU performs most of the computing task. The performance of CPU has not yet been fully explored. In the present study, a CPU-GPU hybrid system that efficiently explores both CPU and GPU computing potentials is presented. A part of the computing tasks of the GPU is allocated for the CPU, reducing the performance loss from waiting for results from the former. Speedup factor of 6.75× over optimized multi-core SIMD CPU implementation and performance improvement factor of 16% over optimized GPU implementation are shown in the experimental results. In addition, more than 14% of sequences are computed on CPU in the hybrid system.
Accelerating or parallelizing the Zuker algorithm for RNA secondary structure prediction on modern computing platforms is not new. Tan et al.  reported 8 speedup on a cluster with 16 Opteron processors that for cluster parallel computers. On shared memory parallel computers, Tan et al.  reported 19× speedup on a 32-processor system, DAWNING 4000. Mathuriya et al.  present their parallel implementation of GTfold on a 32-core IBM P5-570 server and 19.8 speedup is achieved. On multi-core processor, Wu et al.  parallelized the RNAfold program on quad-core Intel Q6600 processor to obtain 3.88 speedup. Another solution to accelerate the Zuker algorithm is to use accelerators. Based on the FPGA platform, Dou et al.  and Jacob et al.  presented fine-grain parallel implementation and one to two orders of magnitude speedup over the general-purpose processor. Based on the GPU platform, G. Rizk et al.  accelerated the Unafold application to attain 33.1 speedup over a single-core Xeon GPU. The problem of accelerating the Zuker algorithm or other applications on large-scale parallel computers is that use, maintenance, and management costs are very high. High-performance parallel computers are too expensive for use by many research institutes. Although specialized coprocessing accelerators are able to achieve high performance, the computing capability of CPUs in systems where the CPU is the program controller has not been explored. A heterogeneous hybrid CPU-GPU computing scheme would be capable of not only exploiting the power of accelerating device, but also of exploring the CPU computing potential. To our best knowledge, no parallel implementation for accelerating Zuker algorithm on CPU-GPU hybrid system has thus far been realized.
Overview of the Zuker algorithm
Suppose r1r2...r i ...r j ...r n represents an RNA sequence where i and j are the location of the nucleotides in the sequence, and n is the sequence length. Formula (1) to (5) describe the method for computing free energy. Here, W(j) is the energy of an optimal structure for the subsequence r1r2 ...... r j ; V(i, j) is the energy of the optimal structure of the subsequence r i ri+1...r j ; VBI(i, j) is the energy of the subsequence r i through r j , where r i r j closes a bulge or an internal loop; VM(i, j) is the energy of the subsequence r i through r j , where r i r j closes a multi-branched loop; and eS(i, j), eH(i, j), and eL(i, j, k, l) are free energy functions used to compute the energy of stacked pair, hairpin loop, and internal loop respectively. Given any subsequence r i ...r j , the Zuker algorithm calculates free energies of the four possible substructures if pair (i,j) is allowed. The results correspond to the four items in Formula (2): eH(i, j), eS(i,j) + V(i + 1, j - 1), VBI(i, j), and VM(i, j). The Zuker algorithm then selects the minimum value V(i,j) among the four results. The subsequence grows from r1, r1r2, ..., r1r2....rj-1to r1r2...r i ...r j ...r n . The lowest conformational free energy is stored in vector W. The corresponding energy of r1 is stored in W(1), and r1r2 is stored in W(2), and so on for longer fragments, such as W(j-1) for r1r2r3 ...... rj-1. Once the longest fragment (i.e., the complete sequence) is considered, the lowest conformational free energy of whole RNA sequence is calculated, and the energy of the most energetically stable structure is contained in W(n). The corresponding secondary structure is then obtained by a trace-back procedure from the vector W, and matrices V and WM.
Overview of CPU architecture
Multi-core CPUs make rethinking the development of application software necessary. Application programmers should explicitly use concurrency to approximate the peak performance of modern processors. To utilize all available processing power of these processors, computationally intensive tasks should be split up into subtasks for execution on different cores. A number of different approaches are available for parallel programming on multi-core CPUs, ranging from low-level multi-tasking or multi-threading such as POSIX Thread (pThread) library , over high-level libraries, such as Intel Threading Building Blocks , which provide certain abstractions and features attempting to simplify concurrent software development, to programming languages or language extensions developed specifically for concurrency, such as OpenMP . Apart from multi-threading parallelism on the multi-core platform, data parallelism can be explored by SIMD vector processing instructions. For example, Intel has an SIMD instruction set called streaming SIMD extensions (SSE) . SSE contains 70 new instructions and 8 new 128-bit registers. SSE2 adds new arithmetic types, including maximum and minimum operations. Each 128-bit register can be partitioned to perform four 32-bit integers, or single-precision floating points, or eight 16-bit short integers, or sixteen 8-bit bytes operations in parallel.
Overview of GPU architecture
A hierarchy of GPU memory architecture is available for programmers to utilize. The fastest memories are the shared memory and registers with severely limited sizes. Registers are allocated by a compiler, whereas shared memory is allocated by a programmer. The constant, texture, and global memory are all located on the off-chip DRAM. The texture and constant memory are read-only and are cached. The global memory is the slowest memory, and its access may take hundreds of clock cycles.
Most research in GPU programming involves finding the optimal way to solve a problem on data-parallel architecture while best using optimizations specific to GPU architectures. Of the many GPGPU APIs available [8–11], the Nvidia CUDA stands out as the most developed and advanced. GPGPU API only operates on Nvidia GPUs. Our development of GPGPU applications uses CUDA API limited to Nvidia graphics cards.
CPU-GPU hybrid computing system
The hybrid accelerating system predicts secondary structures of all RNA sequences input. Firstly, the CPU reads all RNA sequences, and performs some preprocess operations such as memory allocation for energy matrices. After the task allocation, CPUs and GPUs fill energy matrices in parallel. The CPU receives results from the GPU device memory after energy filling is through. Finally, the CPU executes backtrack operations, and displays all the RNA secondary structure information to the user.
Task-allocation and execution scheme
The proposed hybrid accelerating system fully exploits the performance potentials of both CPU and GPU to obtain higher system performance for the same computing task. The key factor in the hybrid accelerating system is the task allocation between the CPU and the GPU. Given several RNA sequences numbered 1,2,...,N, the task allocation obtains a boundary sequence number B. The sequences with number below B are to be computed on the CPU. The rest of sequences are computed on the GPU. To achieve load balancing in the task allocation, the processing capability of the CPU and the GPU for RNA sequence with different length is estimated beforehand.
Calculation of the boundary sequence number B is demonstrated below. The average execution time of a single sequence on CPU and GPU are first assumed as T1CPUand T1GPU, respectively. The entire energy filling time will be minimal when the CPU and GPU executions are overlapped nearly completely, satisfying equation T1CPU· B = T1GPU· (N - B). The boundary value B then can be computed by equation , where K denotes the speedup of GPU vs. CPU and equals . The expression is called the allocation ratio.
Performance tuning schemes
Multi-core CPU implementation
Performance tuning on multi-core CPU mainly consists of compiler optimization, SSE, and multi-thread processing. First, compiler optimization is used by setting the -O2 option to improve the performance of original software. SSE instruction is then utilized to accelerate computation of the VM matrix. The element VM(i,j) depends on the row WM(i,*) and the column WM(*,j). Computation on VM(i,j) is divided into two steps. In the first step, elements on the row WM(i,*) are added to corresponding elements on the column WM(*,j). The minimal value of the sums is then chosen to be VM(i,j). Using SSE instructions, the four elements from row i and column j of matrix WM are read and stored in a 128-bit register. Two 128-bit registers are added, such that four simultaneous additions are performed to compute VM(i,j). The third approach for boosting performance of the CPU is by OpenMP libraries that explore parallel processing of multiple sequences. Energy matrices computations of different sequences are mapped onto different threads for processing by an Intel compiler for OpenMP libraries. Each thread is responsible for the matrices computations of a single sequence allocated onto it. Threads are working independently from each other to access their own memory spaces, achieving high coarse-grain parallel performance.
Next, memory optimizations are adopted to fully use limited device memory. Data type transformation, immediate assignment, and redundant data cutting schemes are used to reduce parameters storage requirement in the device memory. To improve the memory access performance, column data of the matrix are stored in a consecutive memory address array. Sequence data and energy matrices are stored in shared memory and global memory respectively. A data reuse scheme is implemented by storing partly energy matrices in shared memory. Mediating data in the computation is partly stored in shared memory and registers to improve memory access and memory bandwidth.
Computation of the energy elements in the diagonal is mapped to the threads in a warp to explore fine-grain parallelism. Element computation in the V matrix is divided in two situations based on whether or not the i th and the j th nucleotide become a pair. Hence, execution path of the threads in the same warp may differ. In the proposed hybrid system, the matrix elements with the same execution path are mapped to the threads in a warp with consecutive thread ID to effectively improve the parallel efficiency.
Experiments and results
In general, the hybrid accelerating system is composed of multiple CPUs and GPUs. For ease, the proposed prototype system for performance evaluation only consists of a host PC and a GPU card. The host in the testbed is equipped with an Intel Xeon E5620 Quad 2.4 GHz CPU, 24 GB memory, and ASUS Z8PE-D12 Motherboard (Intel 5520 chipset running Windows 7 with Visual Studio 2010 development environment). Geforce GTX580 with CUDA toolkit 4.0 is utilized as the GPU experimental platform. The Zuker algorithm program RNAfold is derived from the software package ViennaRNA-1.8.4 .
Average execution time (ms) per sequence, and speedup (Sp) for four groups of sequences when different performance tuning method is adopted gradually on CPU
A (L = 68)
B (L = 120)
C (L = 154)
D (L = 221)
An average 1.90× speedup over the naive implementation on single Xeon E5620 core can be achieved in the O2 optimization. The SSE2 method is awkward for the first three groups of sequences, the control penalty induced is over the original VM decomposition operations for short sequence processes. After the multi-thread processing, the highest performance has been obtained for each group.
Average execution time (ms) per sequence and speedup (Sp) for four sequence groups when different performance tuning methods are adopted gradually on GPU
A (L = 68)
B (L = 120)
C (L = 154)
D (L = 221)
Thread scheduling and tiling
After thread scheduling and tiling, the optimal average execution time for each sequence of the corresponding group is available. The average execution time for the sequence with length 120 on a single GTX 280 card is inferred as 0.473 ms ; however, it is 0.293 ms on the proposed method. Thus, a 1.61× speedup can be achieved over single GTX 280 card implementation.
Hybrid system performance
Experimental results of the estimated task allocation ratio for different sequence groups
A (L = 68)
B (L = 120)
C (L = 154)
D (L = 221)
Avg. CPU exe. time
Avg. GPU exe. time
Speedup GPU vs. CPU
Estimated allocation ratio
Allocation scheme evaluation
Hybrid system speedup
Experimental results of execution time (s) and speedup (Sp) on different platforms for four sequence groups
A (L = 68)
B (L = 120)
C (L = 154)
D (L = 221)
According to the results of Table 4, the speedup over CPU-only implementation achieved for the hybrid system for the group B is larger than the rest of the groups. It can be inferred that the the efficiency of the hybrid system can be exploited well for sequences in this group. For longer sequences, the speedup is decreasing. It is mainly because that different portions of the Zuker algorithm have different computational complexity and GPU efficiency . For longer sequences in group D, the GPU efficiency is becoming lower. The speedup of GPU over CPU in group D is 5.83×, lower than that of the rest of groups. We can infer that although the main idea of the proposed hybrid system is to exploit CPU computing ability, the performance of the hybrid system is still limited to that of GPU. In group D, a majority of sequences (over 85%) are still processed on GPU, which is the primary computing platform in the hybrid system. The key point of the hybrid system is the task allocation between the CPU and GPU. The proposed task allocation scheme may only be used to multiple sequences with same or similar length. For these sequences, we measure the estimated average execution time of each sequence on both CPU and GPU platform to calculate the speedup of GPU over CPU. The speedup then can be used to estimated the boundary value B for allocation of tasks between CPU and GPU. According to the boundary value B, the input numbered sequences are divided into two parts which are sent to CPU and GPU for parallel processing respectively. Although we only evaluated the hybrid accelerating method in the testbed with one CPU and one GPU, the proposed method can be easily employed to hardware platforms with multiple CPUs and GPUs. In these systems, the performance of multiple CPUs and GPUs must be estimated to instruct the task allocation, which will become a little complicated. Take the hardware system with two CPUs and GPUs for example, there may be three boundary values to be calculated to determine how many sequences will be sent to multiple CPUs and GPUs respectively.
The Zuker algorithm is widely used for RNA secondary structure prediction. Based on careful investigation of the CPU and the GPU architecture, a novel CPU-GPU hybrid accelerating system for Zuker algorithm applications is proposed in the current study. Performance differences of CPU and GPU in the task allocation scheme is considered to obtain the workload balance. To improve the hybrid system performance, implementations of the Zuker algorithm on both the CPU and GPU platforms are optimized. The experimental results show that the hybrid accelerating system achieves a speedup factor of 15.93× and 16% performance advantages over optimized multi-core CPU and GPU implementations respectively. Moreover, more than 14% computation task is executed on CPU. The method combining CPU and GPU to accelerate the Zuker algorithm is proven to be promising and can also be employed to other bioinformatics applications.
We would like to thank the researchers who provided access, documentation and installation assistance for the ViennaRNA-1.8.4 software. This work is partially sponsored by NSFC under Grant 61125201.
This article has been published as part of BMC Genomics Volume 13 Supplement 1, 2012: Selected articles from the Tenth Asia Pacific Bioinformatics Conference (APBC 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/13?issue=S1.
- Mount DW: Bioinformatics: Sequence and Genome Analysis. 2004, CSHL Press
- Zuker M, Stiegler P: Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 1981, 9: 133-10.1093/nar/9.1.133.PubMed CentralView ArticlePubMed
- Lyngs R, Zuker M, Pedersen C: Fast evaluation of internal loops in RNA secondary structure prediction. Bioinformatics. 1999, 15 (6): 440-10.1093/bioinformatics/15.6.440.View Article
- TOP500 Supercomputing Sites. [http://www.top500.org/]
- Barney B: POSIX threads programming. [https://computing.llnl.gov/tutorials/pthreads/]
- Chandra R: Parallel Programming in OpenMP. 2001, Morgan Kaufmann
- Reinders J: Intel Threading Building Blocks. 2007, O'Reilly
- Corporation N: CUDATM 2.0 programming guide. 2008, [http://www.nvidia.com/]
- Buck I, Foley T, Horn D, Sugerman J, Fatahalian K, Houston M, Hanrahan P: Brook for GPUs: stream computing on graphics hardware. ACM Trans Graph. 2004, 23: 777-786. 10.1145/1015706.1015800.View Article
- ATI: Technical overview ATI stream computing. 2009, [http://developer.amd.com/sdks/amdappsdk/pages/default.aspx]
- McCool M, Du Toit S: Metaprogramming GPUs with Sh. 2004, AK Peters Wellesley
- Group K: OpenCL. 2008, [http://www.khronos.org/opencl/]
- Tan G, Feng S, Sun N: Locality and parallelism optimization for dynamic programming algorithm in bioinformatics. Proceedings of the 2006 ACM/IEEE Conference on Supercomputing. 2006, ACM, 78-es.View Article
- Tan G, F S, Sun N: An optimized and efficiently parallelized dynamic programming for rna secondary structure prediction. Journal of Software. 2006, 17: 1501-1509. 10.1360/jos171501.View Article
- Mathuriya A, Bader D, Heitsch C, Harvey S: GTfold: a scalable multicore code for RNA secondary structure prediction. Proceedings of the 2009 ACM Symposium on Applied Computing. 2009, ACM, 981-988.View Article
- Wu G, Wang M, Dou Y, Xia F: Exploiting fine-grained pipeline parallelism for wavefront computations on multicore platforms. 2009 International Conference on Parallel Processing Workshops. 2009, IEEE, 402-408.View Article
- Dou Y, Xia F, Zhou X, Yang X: Fine-grained parallel application specific computing for RNA secondary structure prediction on FPGA. Computer Design, 2008. ICCD 2008. IEEE International Conference on. 2008, IEEE, 240-247.
- Jacob A, Buhler J, Chamberlain R: Rapid RNA folding: analysis and acceleration of the Zuker recurrence. 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines. 2010, IEEE, 87-94.View Article
- Rizk G, Lavenier D: GPU accelerated RNA folding algorithm. Computational Science-ICCS. 2009, 1004-1013.
- Gardner P, Giegerich R: A comprehensive comparison of comparative RNA structure prediction approaches. BMC Bioinformatics. 2004, 5: 140-10.1186/1471-2105-5-140.PubMed CentralView ArticlePubMed
- Intel: Intel 64 and IA-32 architectures software developer manual. [http://www.intel.com/products/processor/manuals]
- Gruber AR, Lorenz R, Bernhart SH, Neuböck R, Hofacker IL: The Vienna RNA websuite. Nucleic Acids Res. 2008, 36 (Web Server issue): W70-W74.PubMed CentralView ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.