Parallel progressive multiple sequence alignment on reconfigurable meshes

Background One of the most fundamental and challenging tasks in bio-informatics is to identify related sequences and their hidden biological significance. The most popular and proven best practice method to accomplish this task is aligning multiple sequences together. However, multiple sequence alignment is a computing extensive task. In addition, the advancement in DNA/RNA and Protein sequencing techniques has created a vast amount of sequences to be analyzed that exceeding the capability of traditional computing models. Therefore, an effective parallel multiple sequence alignment model capable of resolving these issues is in a great demand. Results We design O(1) run-time solutions for both local and global dynamic programming pair-wise alignment algorithms on reconfigurable mesh computing model. To align m sequences with max length n, we combining the parallel pair-wise dynamic programming solutions with newly designed parallel components. We successfully reduce the progressive multiple sequence alignment algorithm's run-time complexity from O(m × n4) to O(m) using O(m × n3) processing units for scoring schemes that use three distinct values for match/mismatch/gap-extension. The general solution to multiple sequence alignment algorithm takes O(m × n4) processing units and completes in O(m) time. Conclusions To our knowledge, this is the first time the progressive multiple sequence alignment algorithm is completely parallelized with O(m) run-time. We also provide a new parallel algorithm for the Longest Common Subsequence (LCS) with O(1) run-time using O(n3) processing units. This is a big improvement over the current best constant-time algorithm that uses O(n4) processing units.


Background
The advancement of DNA/RNA and protein sequencing and sequence identification has created numerous databases of sequences. One of the most fundamental and challenging tasks in bio-informatics is to identify related sequences and their hidden biological significance. Aligning multiple sequences together provides researchers with one of the best solutions to this task. In general, multiple sequence alignment can be defined as: Definition 1 Given: m sequences, (s 1 , s 2 ,..., s m ), over an alphabet ∑, where each sequence contains up to n symbols from ∑; a scoring function h: × × · · · × → ; and a gap cost function. Multiple sequence alignment is a technique to transform (s 1 , s 2 , ..., s m ) to s 1 , s 2 , · · · , s m , where s i is s i ∪ '-' [gap insertions], that optimizes the matching scores between the residues across all sequence columns [1]. However, multiple sequence alignment is an NP-Complete problem [2]; therefore, it is often solved by heuristic techniques. Progressive multiple sequence alignment is one of the most popular multiple sequence alignment techniques, in which the pair-wise symbol matching scores can be derived from any scoring scheme or obtained from a substitution scoring matrix such as PAM [3] or BLOSUM [4]. There are many implementations of progressive multiple sequence alignment as seen in [5][6][7][8]. In general, progressive multiple sequence alignment algorithm follows three steps: (i) Perform all pair-wise alignments of the input sequences.
(ii) Compute a dendrogram indicating the order in which the sequences to be aligned.
(iii) Pair-wise align two sequences (or two pre-aligned groups of sequences) following the dendrogram starting from the leaves to the root of the dendrogram. Figure 1 shows an example of these steps, where (a) represents the input sequences, (b) represents an alignment of step (i), (c) shows the dendrogram obtained from step (ii), and (d) shows a pair-wise group-alignment in step (iii).
Step (i) can be optimally solve by Dynamic Programming (DP) algorithm. There are two versions of DP: the Smith-Waterman's [9] is used to find the optimally aligned segment between two sequences (local DP), and the Needleman-Wunsch's [10] is used to find the global optimal overall sequence pair-wise alignment (global DP). The two algorithms are very similar and will be described in more details in the next section. The dynamic programming algorithms take O(n 2 ) time to complete, including the back-tracking steps. Thus, with To generate a dendrogram from the distances between the sequences (or the scores generated from step (i)), either UPGMA [11] or Neighbor Joining (NJ) [12] hierarchical clustering is used. These algorithms yield O(m 3 ) run-time complexity.
In the worst case, step (iii) performs (m -1) pair-wise alignments in-order following the dendrogram hierarchy. Similar to step (i), dynamic programming for pair-wise alignment is used, however, each of these pair-wise group alignment yields an order of O(n 4 ) via dynamic programming (O(n 2 )) and sum-of-pair scoring function [13](O(n 2 )). This scoring function is required to evaluate every all possible residue matchings of the sequences. As a result, the run-time complexity of step (iii) is O(m × n 4 ) ≈ O(n 5 ), which is the overall run-time complexity of progressive multiple sequence alignment algorithm.
Optimal pair-wise sequence alignment by dynamic programming Given two sequences x and y each contains up to n residue symbols. The optimal alignment of these sequences can be found by calculating an (n + 1) × (n + 1) dynamic programming (DP) matrix containing all possible pair-wise matching scores of residue symbols in the sequences. Initially, the first row and column of the matrix cells are set to 0, i.e. c 0,j = 0, The recursive formula to compute the DP matrix for the Longest Common Subsequence (LCS) as seen in [14] is: Similarly, the Needleman-Wunsch's algorithm [10] uses the following formula to complete the DP matrix: where s(x i , y j ) is the pair-wise symbol matching score of the two symbols x i and y j from sequences x and y, respectively; and g is the gap cost for extending a sequence by inserting a gap, i.e. gap insertion/deletion (indel).
Smith and Waterman [9] modified the above formula as: Figure 1 A progressive multiple sequence alignment. An example of progressive Multiple Sequence Alignment. (a) represents three input sequences (S1, S2, S3); (b) shows the pair-wise dynamic programming alignment of two sequences; (c) shows the order of the sequences to be aligned, where the leaves on right hand-side are the input sequences, the internal nodes represent the theoretical ancestors from which the sequences are derived, and the characters on the tree branches represent the missing/mutated residues; and (d) shows the pair-wise dynamic programming of two pre-aligned groups of sequences.
The alignment can be obtained from the DP matrix by starting from cell c n, n , (or the cell containing the max value in the matrix as in the Smith-Waterman's algorithm), and tracking back to the top of the matrix, i.e. cell c 0,0 , by following neighboring cells with the largest value.
The two most notable parallel versions of dynamic programming algorithm are proposed by Huang [25] and Huang et al. and Aluru [15,26]. Huang's algorithm exploits the independency between the cells on the antidiagonals of the DP matrix, where they can be calculated simultaneously. There are 2n + 1 anti-diagonals on a matrix of size (n + 1 × n+1). Thus, this parallel DP algorithm takes O(n) processing units and completes in O(n) time.
Independently, Huang et al. [15] and Aluru et al. [26] propose similar algorithms to partition the DP matrix column-wise and assign each partition to a processor. Next, all processors are synchronized to calculate their partitions one row at a time. For this algorithm to perform properly, each processor must hold a copy of the sequence that mapped to the rows of the matrix. Since these calculations are performed row-wise, the values from cells c i-1, j-1 and c i-1, j are available before the calculation of cell c i,j . The value of c i, j-1 can be obtained by performing prefix-sum across all cells in row i th . Thus, with n processors, the computation time of each row is dominated by the prefix-sum calculations, which is O(logn) time on PRAM models. Therefore, the DP matrix can be completed in O(nlogn) time using O(n) processors. Recently, Sarkar, et al. [19] implement both of these parallel DP algorithms [25,26] on a Networkon-Chip computing platform [27].
In addition, the construction of a dendrogram can be parallelized as in [18] using n Graphics Processing Units (GPUs) and completing in O(n 3 ) time.
Furthermore, there are attempts to parallelize the progressive alignment step [step (iii)] as in [28] and [29]. In [28], the independent pre-aligned pairs along the dendrogram are aligned simultaneously. This technique gains some speed-up, however, the time complexity of the algorithm remains unchanged since all the pair-wise alignments eventually must be merged together. Another attempt is seen in [29], where Huang's algorithm [25] is used. When an anti-diagonal of a DP alignment matrix in lower tree level in step (iii) is completed, it is distributed immediately to other processors for computing the pair-wise alignment of a higher tree level. This technique can lead to an incorrect result since the actual pair-wise alignment of the lower branch is still uncertain.
Overall, the major speedups achieved from these implementations come from two parallel tasks: performing m(m−1) 2 initial pair-wise alignments in step (i) simultaneously and calculating the dynamic programming matrix anti-diagonally (or in blocks). These tasks potentially can lower the run-time complexity of step (i) from O(m 2 n 2 ) to O(n) and step (iii) from O(mn 4 The overall run-time complexity of the original progressive multiple sequence alignment algorithm is still dominated by step (iii) with an order of O(m 3 n) regardless of how many processing units are used. The bottle-neck is the pair-wise group alignments must be done in order dictated by the dendrogram (O(m)), and each alignment requires all the column pair-wise scores be calculated (O(m 2 )). To address these issues, we design our parallel progressive multiple sequence alignment on a reconfigurable mesh (r-mesh) computing model similar to the ones used in [16,23,24]. Following is the detailed description of the rmesh model.

Reconfigurable-mesh computing models -(r-mesh)
A Reconfigurable mesh (r-mesh) computing, first proposed by Miller et al [30], is a two-dimensional grid of processing units (PUs), in which each processing unit contains 4 ports: North, South, East, and West (N, S, E, W). These ports can be fused or defused in any order to connect one node of the grid to its neighboring nodes. These configurations are shown in Figure 2. Each processing unit has its own local memory, can perform simple arithmetic operations, and can configure its ports in O(1) time.
There are many reconfigurable computing models such as Linear r-mesh (Lr-mesh), Processor Array with Reconfigurable Bus System (PARBS), Pipedlined r-mesh (Pr-mesh), Field-programmable Gate Array (FPGA), etc. These models are different in many ways from construction to operation run-time complexities. For example, the Pr-mesh model does not function properly with configurations containing cycles, while many other models do. However, there are many algorithms to simulate the operations of one reconfigurable model onto another in constant time as seen in [31][32][33][34][35][36].
In the scope of this study, we will use a simple electrical r-mesh system, where each processing unit, or processing element (PU or PE), contains four ports and can perform basic routing and arithmetic operations. Most reconfiguration computing models utilize the representation of the data to parallelize their operations; and there are various proposed formats [37]. Commonly, data in one format can be converted to another in O(1) time [37]. The unary representation format is used this study, which is denoted as 1UN, and is defined as:

Definition 2
Given an integer x [0, n -1], the unary 1UN presentation of x in n-bit is: x = (b 0 , b 1 , ..., b n-1 ), where b i = 1 for all i ≤ x and b i = 0 for all i > x [37].
For example, a number 3 is represented as 11110000 in 8-bit 1UN representation.
In addition to the 1UN unary format, we will be utilizing the following theorem for some of the operations: Theorem 1: The prefix-sum of n value in range [0, n c ] can be found in O(c) time on an n × n r-mesh [37].
In terms of multiple sequence alignment, the number of bits used in the 1UN notation is correlated to the maximum length of the input sequences. In the next Section, we will describe the designs of r-meshcomponents to use in dynamic programming algorithms.

Parallel pair-wise dynamic programming algorithms
This section begins with the description of several configurations of r-mesh needed to compute various operations in pair-wise dynamic programming algorithm. Following the r-mesh constructions is a new constanttime parallel dynamic programming algorithm for Needleman-Wunsch's, Smith-Waterman, and the Longest Common Subsequence (LCS) algorithms.

R-mesh max switches
One of the operations in the dynamic programming algorithm requires the capability to select the largest value from a set of input numbers. Following is the design of an r-mesh switch that can select the maximum value from an input triplet in the same broadcasting step. For 1-bit data, the switch can be built as in Figure  3(a) using one processing unit, (introduced by Bertossi [14]). The unit configures its ports as {NSEW}, where the North and West are input ports and the others are output ports. When a signal (or 1) comes through the switch, the max bit will propagate through the output ports. Similarly, a switch for finding a maximum value of four input bits can be built using 4 processing units with configurations: {NSW,E}, {NSE,W}, {NE,S,W}, and {NSW,E} as in Figure 3(b). To simulate a 3-input max switch on positive numbers, one of the input ports loads in a zero value. These switches can be combined together to handle the max of three n-bit values as in Figure 4. This n-bit max switch takes 4 × n, (i.e. O(n)) processing units and can handle 3 to 4 n-bit input numbers. All of these max switches allow data to flow directly through them in exactly one broadcasting step. They will be used in the design of our algorithm, described latter.

R-mesh adder/subtractor
Similarly, to get a constant time dynamic programming algorithm we have to be able to perform a series of additions and subtractions in one broadcasting step. Exploiting the properties of 1UN representation, we are presenting an adder/subtractor that can perform an addition or a subtraction of two n-bit numbers in 1UN representation in one broadcasting time. The adder/subtractor is a k × n r-mesh, where k is the smaller  magnitude of the two numbers. The r-mesh adder/subtractor is shown in Figure 5. To perform addition, one addend is fed into the North-bound of the r-mesh, and another addend is left-shifted one bit and fed into the West-bound. The left-bit shifting operation removes the bit that represents a zero, which in turn reduces one row of the r-mesh. Similarly, there is no need to have extra rows in the r-mesh to perform additions on the right trailing zeros of the second addend. Therefore, the number of rows in the r-mesh adder/subtractor can be reduced to k, where k + 1 is the number of 1-bits in the second addend. Each processing unit in the adder/subtractor fuses {NE, SW} if the West input is 1, otherwise, it will fuse {NS, E, W}. The first configuration allows the number to be incremented if there is a 1-bit coming from the West, and the second configuration maps the result directly to the output ports. Figure 5 shows the addition of 3 and 3 represented in n-bit 1UN. In this case, the r-mesh needs only 3 rows to compute the result. Similarly, for subtractions, the minuend is fed into the South bound (bottom) of the r-mesh, the subtrahend is 1-bit left-shifted and fed into the r-mesh from the West bound (left), the East bound (right) is fed with zeros, or no signals. The output is obtained from the North border (top).
This adder/subtractor can only handle numbers in 1UN representation, i.e. positive values. Thus, any operation that yields a negative result will be represented as a pattern of all zeros. When this adder/subtractor is used in a DP algorithm, one of the two inputs is already known. For example, to calculate the value at cell c i, j , three binary arithmetic operations must be performed: c i-1, j-1 + s(x i , y j ), c i-1, j + g, and c i, j-1 + g, where both the gap g and the symbol matching score s(x i , y j ) between any two residue symbols are predefined. Thus, we can store these predefined values to the West ports of the adder/subtractor units and have them configured accordingly before the actual operations.
For biological sequence alignments, symbol matching scores are commonly obtained from substitution matrices such as PAM [3], BLOSUM [4], or similar matrices, and gap cost is a small constant in the same range of the values in these matrices. These values are one or two digits. Thus, k is very likely is a 2-digit constant or smaller. Therefore, the size of the adder/subtractor unit is bounded by O(n), in this scenario.

Constant-time dynamic programming on r-mesh
The dynamic programming techniques used in the Longest Common Subsequence (LCS), Smith-Waterman's and Needle-Wunsch's algorithms are very similar. Thus, a DP r-mesh designed to solve one problem can be modified to solve another problem with minimal configuration. We are presenting the solution for the latter cases first, and then show a simple modification of the solution to solve the first case.

Smith-Waterman's and Needle-Wunsch's algorithms
Although the number representation can be converted from one format to another in constant time [37], the DP r-mesh run-time grows proportionally with the number of operations being done. These operations could be as many as O(n 2 ). To eliminate this format conversion all the possible symbol matching scores, or scoring matrix, (4 × 4 for RNA/DNA sequences and 20 × 20 for protein sequences) are pre-scaled up to positive values. Thus, an alignment of any pair of residue symbols will yield a positive score; and gap matching (or insert/delete) is the only operation that can reduce the alignment score in preceding cells. Nevertheless, if the value in cell c i-1, j (or c i,j-1 ) is smaller than the magnitude of the gap cost (|g|), a gap penalized operation will produce a bit pattern of all zeros (an indication of an underflow or negative value). This value will not appear in cell c i,j since the addition of the positive value in cell c i-1, j-1 and the positive symbol matching score s(x i , y i ) is always greater than or equal to zero.
In general, we do not have to perform this scale-up operation for DNA since DNA/RNA scoring schemes that generally use only two values: a positive integer value for match and the same cost for both mismatch and gap.
Unlike DNA, scoring protein residue alignment is often based on scoring scoring/substitution/mutation matrices such as that in [3,4]. These matrices are logodd values of the probabilities of residues being mutated (or substituted) into other residues. The difference between the matrices are the way the probabilities being derived. The smaller the probability, the less likely a mutation happens. Thus, the smallest alignment value between any two residues, including the gap is at least zero. To avoid the complication of small positive fractional numbers in calculations, log-odd is applied on these probabilities. The log-odd score or substitution score in [3] is calculated as s(i, j) = 1 λ log Q ij P i P j , where s(i, j) is the substitution score between residues i and j, l is a positive scaling factor, Q ij is the frequency or the percentage of residue i correspond to residue j in an accurate alignment, and P i and P j are background probabilities which residues i and j occur. These probabilities and the log-odd function to generate the matrices are publicly available via The National Center for Biotechnology Information's web-site (http://www. ncbi.nlm.nih.gov) along with the substitution matrices themselves. With any given gap cost, the probability of a residue aligned with a gap can be calculated proportionally from a given gap cost and other values from the un-scaled scoring matrices by taking anti-log of the log- odd values or score matrix. Thus, when a positive number b is added to the scores in these scoring matrices, it is equivalent to multiply the original probabilities by a b , where a is the log-based used in the log-odd function.
A simple mechanism to obtain a scaled-up version of a scoring matrix is: (a) taking the antilog of the scoring matrix and g, where g is the gap costs, i.e. the equivalent log-odd of a gap matching probability; (b) multiplying these antilog values by b factor such that their minimum log-odd value should be greater than or equal to zero; (c) performing log-odd operation on these scaledup values.
When these scaled-up scoring matrices are used, the Smith-Waterman's algorithm must be modified.
Instead of setting sub-alignment scores to zeros when they become negative, these scores are set to b when they fall below the scaled-up factor (b).
Using scaled-up scoring matrices will eliminate the need for signed number representation in our following algorithm designs. However, if there is a need to obtain the alignment score based on the original scoring matrices, the score can be calculated as follows: (i) load the original score matrix and gap cost to each cell on an r-mesh as similar to the one described in Section; (ii) configure cells on the diagonal path to use their corresponding matching score from the matrix and other cells representing gap insertions or deletions to use gap cost; (iii) calculate the prefix-sum of all the cells on the path representing the alignment using Theorem 1.
Having the adder/subtrator units and the switches ready, the dynamic programming r-mesh, (DP r-mesh), can be constructed with each cell c i,j in the DP matrix containing 3 adder/subtractor units and a 3-input max switch allowing it to propagate the max value of cells c i-1, j-1 , c i-1, j and c i, j-1 to cell c i, j in the same broadcasting step. Figure 6 shows the dynamic programming r-mesh construction. The adder/subtrator units are represented as "+" or "-" corresponding to their functions.
A 1 × n adder/subtractor unit can perform increments and decrements in the range of [-1,0,1]. As a result, a DP r-mesh can be built with 1-bit input components to handle all pair-wise alignments using constant scoring schemes that can be converted to [-1,0,1] range. For instance, the scoring scheme for the longest common subsequence rewards 1 for a match and zero for mismatch and gap extension.
To align two sequences, c i, j loads or computes its symbol matching score for the symbol pair at row i column j, initially. The next step is to configure all the adder/subtractor units based on the loaded values and the gap cost g. Finally, a signal is broadcasted from c 0,0 to its neighboring cells c 0,1 , c 1,0 , and c 1,1 to activate the DP algorithm on the r-mesh. The values coming from cells c i-1, j and c i, j-1 are subtracted with the gap costs. The value coming from c i-1, j-1 is added with the initial symbol matching score in c i, j . These values will flow through the DP r-mesh in one broadcasting step, and cell c n, n will receive the correct value of the alignment.
In term of time complexity, this dynamic programming r-mesh takes a constant time to initialize the DP r-mesh and one broadcasting time to compute the alignment. Thus, its run-time complexity is O(1). Each cell uses 10n processing units (4n for the 1-bit max switch and 2n for each of the three adder/subtrator units). These processing units are bounded by O(n). Therefore, the n × n dynamic programming r-mesh uses O(n 3 ) processing units.
To handle all other scoring schemes, k × n adder/subtractor r-meshes and n × n max switches must be used. In addition, to avoid overflow (or underflow) all predefined pair-wise symbol matching scores may have to be scaled up (or down) so that the possible smallest (or largest) number can fit in the 1UN representation. With this configuration, the dynamic programming r-mesh takes O(n 4 ) processing units.

Longest common subsequence (LCS)
The complication of signed numbers does not exist in the longest common subsequence problem. The arithmetic operation in LCS is a simple addition of 1 if there is a match. The same dynamic programming r-mesh as seen in Figure 6 can be used, where the two subtractors units are removed or the gap cost is set to zero (g = 0).
To find the longest common subsequence between two sequences x and y, each max switch in the DP rmesh is configured as in Figure 7. The value from cell c i-1, j-1 is fed into the North-West processing unit, and the other values are fed into the North-East unit. Then, c i, j loads in its symbols and fuses the South-East processing unit (in bold) as NS,E,W if the symbols at row i and column j are different; otherwise, it loads 1 into the adder unit and fuses N,E,SW. These configurations allow either the value from cell c i-1, j-1 or the max value of cells c i-1, j and c i, j-1 to pass through. These are the only changes for the DP r-mesh to solve the LCS problems.
This modified constant-time DP r-mesh used O(n 3 ) processing units. However, this is an order of reduction comparing the current best constant parallel DP algorithm that uses an r-mesh of size O(n 2 ) × O(n 2 ) [14] to solve the same problem.

Affine gap cost
Affine gap cost (or penalty) is a technique where the opening gap has different cost from an extending gap [38]. This technique discourages multiple and disjoined gap insertion blocks unless their inclusion greatly improves the pair-wise alignment score. The gap cost is calculated as p = o + g(l -1), where o is the opening gap cost, g is the extending gap cost, and l is the length of the gap block. Traditionally, Gotoh use three matrices to track these values; however, it is not intercessory in the reconfigurable mesh computing model since each cell in the matrix is a processing node with local memory.
To handle affine gap cost, we need to extend the representation of the number by 1 bit (right most bit). This bit indicates whether a value coming from c i-1, j or c i, j-1 to c i, j is an opening gap or not. If the incoming value has been gap-penalized, its right most bit is 1, and it will not be charged with an opening gap again; otherwise, an opening gap will be applied. The original "-" units must be modified to accommodate affine gaps. Figure 8 shows the modification of the "-" unit. The output from the original "-" unit is piped into an n × n + 1 r-mesh on/off switch (described in Section ), an adder/subtractor, and a max switch. When a number flows through the "-" unit, an extending gap is applied. If the incoming value has not been charged with gap to begin with, its right most bit (i.e. selector bit denoted as "s") remains zero, which keeps the switch in off position. Therefore, the value with extra charge on the adder/subtractor is allowed to flow through; otherwise, the switch will be on, and the larger value will be selected by the max switch. A value that is not from diagonal cells must have its selector bit set to 1 (right most bit) after a gap cost is applied to prevent multiple charges of an opening gap.
The modification of the dynamic programming r-mesh to handle affine gap cost requires additional 2 adder/ subtractor units, 2 on/off switches, and one 2-input max switch. Asymptotically, the amount of processing units used is still bounded by O(n 4 ) and the run-time complexity remains O(1).

R-mesh on/off switches
To handle affine gap cost in dynamic programming, we need a switch that can select, i.e. turns on or off, the output ports of a data flow. The on/off r-mesh switch can be configured as in Figure 9, where the input value is mapped into the North-bound (top). The right most bit of the input is served as a selector bit. The rmesh size is n × n+1, where column i fuses with row n -i to form an L-shape path that allows the input data from the Northbound to flow to the output port on the Eastbound. The processors on the last column will fuse the East-West ports allowing the input to flow through only if they receive a value of 1 from the input (Northern ports). Since the selector bit travels a shorter distance than all the other input bits, it will arrive in time to activate the opening or closing of the output ports.
This r-mesh configuration uses (n × n + 1), i.e., O(n 2 ), processing units to turn off the flow of an n-bit input in a broadcasting time.

Dynamic programming back-tracking on r-mesh
The pair-wise alignment is obtained by following the path leading to the overall optimal alignment score, or the end of the alignment. In the case of the Needleman-Wunsch's algorithm, cell c n, n holds this value; and in the case of the Smith-Waterman's algorithm, cell c i, j with the maximum alignment score is the end point. The cell with the largest value can be located in O(1) time on a 3-dimension n × n × n r-mesh through these steps: 1. Initially, the DP matrix with calculated values are stored in the first slice of the r-mesh cube, i.e. in cells c i, j,0 , 0 <i, j ≤ n.
2. c i, j,0 sends its value to c i, j, i , 0 ≤ i, j ≤ n, to propagate each column of the matrix to the 2D r-meshes on the third dimension.
3. c i, j, i sends its value to c 0, j, k , i.e. to move the solution values to the first row of each 2D r-mesh slice.
4. Each 2D r-mesh slice finds its max value c 0, r, k where r is the column of the max value in slice k 5. c 0, r, k sends r to c k,0,0 , i.e. each 2D r-mesh slice sends its max value column number m to the first 2D rmesh slice. This value is the column index of the max value on row k in the first slice.
6. The first 2D r-mesh slice, c i, j,0 , finds the max value of n DP r-mesh cells whose row index is i and column index is c i0,0 (i.e. value r received from the previous step). The row and column indices of the max value found in this step is the location of the max value in the original DP r-mesh.
These above steps rely on the capability to find the max value from n given numbers on an n × n r-mesh. This operation can be done in O(1) time as follows: 1. initially, the values are stored in the first row of the r-mesh.
3. c i, i broadcasts its value, namely a i , to c i, j (row-wise broadcasting). Figure 8 A configuration for selecting a min value. A configuration to select one of the two inputs in 1UN notation using the right most bit as a selector s. When s = 1 the switch is turned on to allow the data to flow through and get selected by the max switch. When the selector s = 0, the on/off switch produces zeros and the other data flow will be selected. ε = o -g, o ≥ g, is the difference between opening gap cost o and extending gap cost g. To trace back the path leading to the optimal alignment, we start with the end point cell c e, d found above and following these steps: 1. c i, j , (i ≤ e, i ≤ d), sends its value to c i, j+1 , c i+ 1, j , c i +1, j+i . Thus, each cell can receive up to three values coming from its Noth, West, and Northwest borders. 2. c i, j finds the max of the inputs and fuses its port to the neighbor cell that sent the max value in the previous step. If there are more than one port to be fused, (this happens when there are multiple optimal alignments), c i, j randomly selects one.
3. c e, d sends a signal to its fused port. The optimal pair-wise alignment is the ordered list of cells where this signal travels through.
Each operation in the back-tracking process takes O(1) time to complete, and there are no iterative operations. Therefore, the back-tracking steps requires n 3 processing units and takes O(1) time.

Progressive multiple sequence alignment on r-mesh
In this section, we start by describing a parallel algorithm to generate a dendrogram, or guiding tree, representing the order in which the input sequences should be aligned. Then we will show a reworked version of sum-of-pair scoring method that can be performed in constant time on a 2D r-mesh. Finally, we will describe our parallel progressive multiple sequence alignment algorithm on r-mesh along with its complexity analysis.

Hierarchical clustering on r-mesh
The parallel neighbor-joining (NJ) [12] clustering method on r-mesh is described here; other hierarchical clustering mechanisms can be done in a similar fashion. The neighbor-joining takes a distance matrix between all the pairs of sequences and represents it as a star-like connected tree, where each sequence is an external node (leaf) on the tree. NJ then finds the shortest distance pair of nodes and replaces it with a new node. This process is repeated until all the nodes are merged.  Figure 9(a) shows the r-mesh configuration on a selector bit of 1 (s = 1) and Figure 9(b) shows the r-mesh configuration on a selector bit of 0 (s = 0).
Followings are the actual steps to build the dendrogram: 1. Initially, all the pair-wise distances are given in form of a matrix D of size m × m, where m is the number of nodes (or input sequences).
2. Calculate the average distance from node i to all the other nodes by 3. The pair of nodes with the shortest distance (i, j) is a pair that gives minimal value of M ij , where M ij = D ijr ir j . 4. A new node u is created for shortest pair (i,j), and the distances from u to i and j are: 5. The distance matrix D is updated with the new node u to replace the shortest distance pair (i,j), and the distances from all the other nodes to u is calculated as These steps are repeated for m -1 iterations to reduce distance matrix D to one pair of nodes. The last pair does not have to be merged, unless the actual location of the root node is needed.
Step 1 and 4 are constant time operations on an m × m r-mesh, where each processing unit stores a corresponding value from the distance matrix. Since the progressive multiple sequence alignment algorithm only uses the dendrogram as a guiding tree to select the closest pair of sequences (or two groups of sequences), the actual distance values between the nodes on the final dendrogram mostly are insignificant. Consequently, the values in distance matrix D can be scaled down without changing the order of the nodes in the dendrogram. In addition, if these values are not to be preserved, the calculations in step 4 can be skipped.
Before proceeding to step 2, we should reexamine some facts. First, the maximum alignment score from all the pair-wise DP sequence alignments are bounded by b 2 , where b is the max pair-wise score between any two residue symbols. An alignment score of b 2 occurs only if we align a sequence of these symbols to itself. b+1 ≤ n is the number of bits being used to represent this value in 1UN. Similarly, the maximum value in distance matrix D generated from these alignment scores are also bounded by b 2 . Thus, the sum of n of these distance values are bounded by b 4 . These facts allow us to calculate the sums in step 2 in O(c) time using an n × n rmesh as in Theorem 1. In this case, c is constant, (c = 4). There are n summations to calculate, so the entire calculation requires n such r-meshes, or n 3 processing units, to complete in O(1) time.
In step 3, each processing unit computes value M ij locally. The max value can be found using the same technique described in Section in constant time.
Similarly, step 5 is performed locally by the processing units in the r-mesh in O(1) time. Since this procedure terminates after m -1 iterations, the overall run-time complexity to build a dendrogram, (or guiding tree), for any given pair-wise distance matrix of size m × m is O (m) using O(m3) processing units.

Constant run-time sum-of-pair scoring method
The third step [step (iii)] of the progressive MSA algorithm is following the dendrogram, built in the earlier step, to perform pair-wise dynamic programming alignment on two pre-aligned groups of sequences. The dynamic programming alignment algorithm in this step is exactly the same as the one in step (i); however, quantifying a match between two columns of residues are no longer a simple constant look-up, unless the hierarchical expected probability (HEP) matching scoring scheme is used [39]. The most popular quantifying method is the sum-of-pair (SP) method [40], or its variations as seen in [5][6][7]41]. This quantification is the sum of all pairwise matching scores between the residue symbols, where each paired-score is obtained either from a substitution matrix or from any scoring scheme discussed earlier. The alignment at the root of the tree gets n residues for every pair of columns to be quantified. Thus, there are m(m−1) 2 lookups per column quantification, i.e. m(m−1) 2 lookups or each DP matrix cell calculation. The sum-of-pair is formally defined as: where f is a column from one pre-aligned group of sequences and g is a column from the other pre-aligned group of sequences. f i and g j are residue symbols from columns f and g, respectively, and s(f i , g j ) is the matching score between these two symbols f i and g j . For example, to calculate the sum-of-pair of the following two columns f and g: . Thus, if we combine the two column symbols with their number of occurrences, the sum-of-pair method can be transformed into a counting problem and can be defined as: where f, g are the two columns, T is the number of different residue symbols (T = 4 for DNA/RNA and T = 20 for proteins), s(i,j) is the pair-wise matching score, or substitution score, between two residue symbols i and j, and n i and n j are the total count of symbols/types i and j (i.e. the occurrences of residue symbols/types i and j), respectively. Since residues from both column f and g are merged, there is no distinction in which column the residue are from. Since T is constant, the summations in Equation remain constant, regardless how many sequences are involved.
Thus, the sum-of-pair score of the two columns given above will be: This scoring function can be implemented on an array of m processing units as follows. First, map each residue symbol into a numeric value from 1 to T. Next, m residues from any two aligning columns are assigned to m processing units. Any processing unit holding a residue sends a 1 to processing unit p k , where k is the number represents the residue symbol it is holding. p k sums the 1's it receives. The sum-of-pair score can be computed between the pairs of processing units containing a sum larger than 0 calculated from previous steps. All of these steps are carried out in constant time. There are n 2 possible pair-wise column arrangements of two pre-aligned groups of sequences of max length n. Thus, the sum-ofpair column pair-wise matching scores for two prealigned groups of sequences can be done in O(1) using m × n 2 processing units.

Parallel progressive MSA algorithm and its complexity analysis
Progressive multiple sequence alignment algorithm is a heuristic alignment technique that builds up a final multiple sequence alignment by combining pair-wise alignments starting with the most similar pair and progressing to the most distant pair. The distance between the sequences can be calculated by dynamic programming algorithms such as Smith-Waterman's or Needle-Wunsch's algorithms (step i). The order in which the sequences should be aligned are represented as a guiding and can be calculated via hierarchical clustering algorithms similar to the one described in Section (step ii). After the guiding tree is completed, the input sequences can be pair-wise aligned following the order specified in the tree (step iii). In the previous Sections, we have described and designed several r-meshes to handle individual operations in the progressive multiple alignment algorithm. Finally, a progressive multiple sequence alignment r-mesh configuration can be constructed. First, the input sequences are pair-wise aligned using the dynamic programming r-mesh described previously in Section. These m(m−1)  described earlier, from all the pair-wise DP alignment scores from step (i). Lastly, [step (iii)], for each pair of pre-aligned groups of sequences along the dendrogram, the sum-of-pair column matching scores are pre-calculated for the DP r-mesh initialization before proceeding with the dynamic programming alignment. There are m -1 branches in the dendrogram leading to m -1 pairwise group alignments to be performed. In terms of complexity, the progressive multiple sequence  Table 1 summarizes the r-mesh size and the run-time complexity of various components in this study, where the components with "broadcast" run-time can finish their operations in one broadcasting time. The "DP" rmesh is designed to handle all the Needleman-wunsch's [10], Smith-Waterman's [9], and Longest Common Subsequence algorithms.

Conclusions
In this study, we have designed various r-mesh components that can run in one broadcasting step, which enabling us to effectively parallelize the progressive multiple sequence alignment paradigm. to align m sequences with max length n, we are able to reduce the algorithm run-time complexity from O(m × n 4 ) to O(m) using O(m × n 4 ) processing units. For a scoring scheme that rewards 1 for a match, 0 for a mismatch, and -1 for a gap insertion/deletion, our algorithm uses only O(m × n 3 ) processing units. Moreover, to our knowledge, we are the first to propose an O(1) run-time dynamic programming pair-wise alignment algorithm using only O (n 3 ) processing units.