### Overview of the Zuker algorithm

The Zuker algorithm predicts the most stable secondary structure for a single RNA sequence by computing its minimal free energy (MFE). It uses a "nearest neighbor" model and empirical estimates of thermodynamic parameters for neighboring interactions and loop entropies to score all possible structures [20]. The main idea is that the secondary structure of an RNA sequence consists of four fundamental substructures: stack, hairpin, internal loop, and multi-branched loop. These fundamental substructures are independent of one another, and the energy of a secondary structure is assumed to be the sum of the substructure energies. With a single RNA sequence as input, the algorithm is executed in two steps. First, it calculates the minimal free energy of the input RNA sequence on a group of recurrence relations, as shown in Formula (1) to (5). Second, it performs a trace-back to recover the secondary structure with the base pairs. Experiments show that the first step consumes more than 99% of the total execution time. Thus, computing energy matrices as quickly as possible is critical to improve the performance.

W\left(j\right)=min\left\{W\left(j-1\right),min\left[V\left(i,j\right)+W\left(i-1\right)\right]\right\}

(1)

V\left(i,j\right)=\left\{\begin{array}{cc}\hfill min\left\{\begin{array}{c}\hfill eH\left(i,j\right)\hfill \\ \hfill eS\left(i,j\right)+V\left(i+1,j-1\right)\hfill \\ \hfill VBI\left(i,j\right)\hfill \\ \hfill VM\left(i,j\right)\hfill \end{array}\right.\hfill & \hfill pair\phantom{\rule{2.77695pt}{0ex}}\left(i,j\right)\phantom{\rule{2.77695pt}{0ex}}is\phantom{\rule{2.77695pt}{0ex}}allowed\hfill \\ \hfill \infty \hfill & \hfill pair\phantom{\rule{2.77695pt}{0ex}}\left(i,j\right)\phantom{\rule{2.77695pt}{0ex}}is\phantom{\rule{2.77695pt}{0ex}}not\phantom{\rule{2.77695pt}{0ex}}allowed\hfill \end{array}\right.

(2)

VBI\left(i,j\right)=\underset{i<k<l<j}{min}\left\{eL\left(i,j,k,l\right)+V\left(k,l\right)\right\}

(3)

VM\left(i,j\right)=\underset{i<k<j}{min}\left\{WM\left(i,k\right)+WM\left(k+1,j\right)\right\}

(4)

WM\left(i,j\right)=min\left\{VM\left(i,j\right),min\left[WM\left(i+1,j\right),WM\left(i,j-1\right)\right],V\left(i,j\right)\right\}

(5)

Suppose *r*_{1}*r*_{2}...*r*_{
i
}...*r*_{
j
}...*r*_{
n
}represents an RNA sequence where i and j are the location of the nucleotides in the sequence, and n is the sequence length. Formula (1) to (5) describe the method for computing free energy. Here, W(j) is the energy of an optimal structure for the subsequence *r*_{1}*r*_{2} ...... *r*_{
j
}; V(i, j) is the energy of the optimal structure of the subsequence *r*_{
i
}*r*_{i+1}...*r*_{
j
}; VBI(i, j) is the energy of the subsequence *r*_{
i
}through *r*_{
j
}, where *r*_{
i
}*r*_{
j
}closes a bulge or an internal loop; VM(i, j) is the energy of the subsequence *r*_{
i
}through *r*_{
j
}, where *r*_{
i
}*r*_{
j
}closes a multi-branched loop; and eS(i, j), eH(i, j), and eL(i, j, k, l) are free energy functions used to compute the energy of stacked pair, hairpin loop, and internal loop respectively. Given any subsequence *r*_{
i
}...*r*_{
j
}, the Zuker algorithm calculates free energies of the four possible substructures if pair (i,j) is allowed. The results correspond to the four items in Formula (2): eH(i, j), *eS*(*i,j*) + *V*(*i* + 1, *j* - 1), VBI(i, j), and VM(i, j). The Zuker algorithm then selects the minimum value V(i,j) among the four results. The subsequence grows from *r*_{1}, *r*_{1}*r*_{2}, ..., *r*_{1}*r*_{2}....*r*_{j-1}to *r*_{1}*r*_{2}...*r*_{
i
}...*r*_{
j
}...*r*_{
n
}. The lowest conformational free energy is stored in vector W. The corresponding energy of *r*_{1} is stored in W(1), and *r*_{1}*r*_{2} is stored in W(2), and so on for longer fragments, such as W(j-1) for *r*_{1}*r*_{2}*r*_{3} ...... *r*_{j-1}. Once the longest fragment (i.e., the complete sequence) is considered, the lowest conformational free energy of whole RNA sequence is calculated, and the energy of the most energetically stable structure is contained in W(n). The corresponding secondary structure is then obtained by a trace-back procedure from the vector W, and matrices V and WM.

### Overview of CPU architecture

In recent years, the number of processing cores available on a modern processor chip has increased steadily. Quad-core CPUs are now the norm, and more core systems have become economically available. Figure 1 shows a typical quad-core CPU architecture. Each core hosts one thread at a time, with a set of registers containing thread state and a large functional unit devoted to computation and management.

Multi-core CPUs make rethinking the development of application software necessary. Application programmers should explicitly use concurrency to approximate the peak performance of modern processors. To utilize all available processing power of these processors, computationally intensive tasks should be split up into subtasks for execution on different cores. A number of different approaches are available for parallel programming on multi-core CPUs, ranging from low-level multi-tasking or multi-threading such as POSIX Thread (pThread) library [5], over high-level libraries, such as Intel Threading Building Blocks [7], which provide certain abstractions and features attempting to simplify concurrent software development, to programming languages or language extensions developed specifically for concurrency, such as OpenMP [6]. Apart from multi-threading parallelism on the multi-core platform, data parallelism can be explored by SIMD vector processing instructions. For example, Intel has an SIMD instruction set called streaming SIMD extensions (SSE) [21]. SSE contains 70 new instructions and 8 new 128-bit registers. SSE2 adds new arithmetic types, including maximum and minimum operations. Each 128-bit register can be partitioned to perform four 32-bit integers, or single-precision floating points, or eight 16-bit short integers, or sixteen 8-bit bytes operations in parallel.

### Overview of GPU architecture

Figure 2 depicts the GPU architecture from Nvidia. The GPU contains a scalable array of multi-threaded processing units known as streaming multi-processors (SMs). Each SM contains eight scalar processor (SP) cores that execute actual instructions. Each SM performs computation independently; however, SP cores within the single multi-processor execute instructions synchronously. This paradigm, called "single instruction, multiple threads" (SIMT) [8], is the basic computing scheme of GPU. Threads are grouped into blocks, and multiple blocks may run in a grid of blocks. Such structured sets of threads may be launched on a kernel of code for data processing in the device memory. Threads of the same block share data on-chip memory, and coordinate through synchronization points. CUDA is the most used NVIDIA parallel programming model and software environment for running applications on GPUs. CUDA abstracts the architecture to parallel programmers via simple extensions to C programming language. In CUDA, threads in one block are created, managed, scheduled, and executed in a unit called Warp using a combination of 32 threads with consecutive thread ID. Parallel performance is improved when all threads in the same Warp follow the same execution path.

A hierarchy of GPU memory architecture is available for programmers to utilize. The fastest memories are the shared memory and registers with severely limited sizes. Registers are allocated by a compiler, whereas shared memory is allocated by a programmer. The constant, texture, and global memory are all located on the off-chip DRAM. The texture and constant memory are read-only and are cached. The global memory is the slowest memory, and its access may take hundreds of clock cycles.

Most research in GPU programming involves finding the optimal way to solve a problem on data-parallel architecture while best using optimizations specific to GPU architectures. Of the many GPGPU APIs available [8–11], the Nvidia CUDA stands out as the most developed and advanced. GPGPU API only operates on Nvidia GPUs. Our development of GPGPU applications uses CUDA API limited to Nvidia graphics cards.