Parallel progressive multiple sequence alignment on reconfigurable meshes
© Nguyen et al. licensee BioMed Central Ltd 2011
Published: 23 December 2011
Skip to main content
© Nguyen et al. licensee BioMed Central Ltd 2011
Published: 23 December 2011
One of the most fundamental and challenging tasks in bio-informatics is to identify related sequences and their hidden biological significance. The most popular and proven best practice method to accomplish this task is aligning multiple sequences together. However, multiple sequence alignment is a computing extensive task. In addition, the advancement in DNA/RNA and Protein sequencing techniques has created a vast amount of sequences to be analyzed that exceeding the capability of traditional computing models. Therefore, an effective parallel multiple sequence alignment model capable of resolving these issues is in a great demand.
We design O(1) run-time solutions for both local and global dynamic programming pair-wise alignment algorithms on reconfigurable mesh computing model. To align m sequences with max length n, we combining the parallel pair-wise dynamic programming solutions with newly designed parallel components. We successfully reduce the progressive multiple sequence alignment algorithm's run-time complexity from O(m × n 4) to O(m) using O(m × n 3) processing units for scoring schemes that use three distinct values for match/mismatch/gap-extension. The general solution to multiple sequence alignment algorithm takes O(m × n 4) processing units and completes in O(m) time.
To our knowledge, this is the first time the progressive multiple sequence alignment algorithm is completely parallelized with O(m) run-time. We also provide a new parallel algorithm for the Longest Common Subsequence (LCS) with O(1) run-time using O(n 3) processing units. This is a big improvement over the current best constant-time algorithm that uses O(n 4) processing units.
The advancement of DNA/RNA and protein sequencing and sequence identification has created numerous databases of sequences. One of the most fundamental and challenging tasks in bio-informatics is to identify related sequences and their hidden biological significance. Aligning multiple sequences together provides researchers with one of the best solutions to this task. In general, multiple sequence alignment can be defined as:
Given: m sequences, (s1, s2,..., s m ), over an alphabet ∑, where each sequence contains up to n symbols from ∑; a scoring function h: ; and a gap cost function. Multiple sequence alignment is a technique to transform (s1, s2, ..., s m ) to , where is s i ∪ '-' [gap insertions], that optimizes the matching scores between the residues across all sequence columns . However, multiple sequence alignment is an NP-Complete problem ; therefore, it is often solved by heuristic techniques. Progressive multiple sequence alignment is one of the most popular multiple sequence alignment techniques, in which the pair-wise symbol matching scores can be derived from any scoring scheme or obtained from a substitution scoring matrix such as PAM  or BLOSUM . There are many implementations of progressive multiple sequence alignment as seen in [5–8]. In general, progressive multiple sequence alignment algorithm follows three steps:
(i) Perform all pair-wise alignments of the input sequences.
(ii) Compute a dendrogram indicating the order in which the sequences to be aligned.
(iii) Pair-wise align two sequences (or two pre-aligned groups of sequences) following the dendrogram starting from the leaves to the root of the dendrogram.
Step (i) can be optimally solve by Dynamic Programming (DP) algorithm. There are two versions of DP: the Smith-Waterman's  is used to find the optimally aligned segment between two sequences (local DP), and the Needleman-Wunsch's  is used to find the global optimal overall sequence pair-wise alignment (global DP). The two algorithms are very similar and will be described in more details in the next section. The dynamic programming algorithms take O(n 2) time to complete, including the back-tracking steps. Thus, with unique pairs of the input sequences, the run-time complexity of step (i) is O(m 2 n 2) or O(n 4) if n and m are asymptotically equivalent.
To generate a dendrogram from the distances between the sequences (or the scores generated from step (i)), either UPGMA  or Neighbor Joining (NJ)  hierarchical clustering is used. These algorithms yield O(m 3) run-time complexity.
In the worst case, step (iii) performs (m - 1) pair-wise alignments in-order following the dendrogram hierarchy. Similar to step (i), dynamic programming for pair-wise alignment is used, however, each of these pair-wise group alignment yields an order of O(n 4) via dynamic programming (O(n 2)) and sum-of-pair scoring function (O(n 2)). This scoring function is required to evaluate every all possible residue matchings of the sequences. As a result, the run-time complexity of step (iii) is O(m × n4) ≈ O(n 5), which is the overall run-time complexity of progressive multiple sequence alignment algorithm.
where s(x i , y j ) is the pair-wise symbol matching score of the two symbols x i and y j from sequences x and y, respectively; and g is the gap cost for extending a sequence by inserting a gap, i.e. gap insertion/deletion (indel).
The alignment can be obtained from the DP matrix by starting from cell c n, n, (or the cell containing the max value in the matrix as in the Smith-Waterman's algorithm), and tracking back to the top of the matrix, i.e. cell c0,0, by following neighboring cells with the largest value.
Progressive multiple sequence alignment algorithms are widely parallelized, mostly because they perform independent pair-wise alignments as in step (i). These individual pair-wise alignments can be designated to different processing units for computation as in [15–24]. These implementations are across many computing architectures and platforms. For example,  implemented a DP algorithm on Field-Programmable Gate Array (FPGA). Similarly, Oliver et al. [23, 24] distributed the pair-wise alignment of the first step in the progressive alignment, where all pair-wise alignments are computed, on FPGA. Liu et al.  computed DP via Graphic Processing Units (GPUs) using CUDA platform,  used CRCW PRAM neural-networks,  used Clusters,  used 2D r-mesh,  used Network mesh, or  used 2D Pr-mesh computing model.
The two most notable parallel versions of dynamic programming algorithm are proposed by Huang  and Huang et al. and Aluru [15, 26]. Huang's algorithm exploits the independency between the cells on the anti-diagonals of the DP matrix, where they can be calculated simultaneously. There are 2n + 1 anti-diagonals on a matrix of size (n + 1 × n+1). Thus, this parallel DP algorithm takes O(n) processing units and completes in O(n) time.
Independently, Huang et al.  and Aluru et al.  propose similar algorithms to partition the DP matrix column-wise and assign each partition to a processor. Next, all processors are synchronized to calculate their partitions one row at a time. For this algorithm to perform properly, each processor must hold a copy of the sequence that mapped to the rows of the matrix. Since these calculations are performed row-wise, the values from cells c i-1, j-1 and c i-1, jare available before the calculation of cell c i,j . The value of c i, j-1 can be obtained by performing prefix-sum across all cells in row i th . Thus, with n processors, the computation time of each row is dominated by the prefix-sum calculations, which is O(logn) time on PRAM models. Therefore, the DP matrix can be completed in O(nlogn) time using O(n) processors. Recently, Sarkar, et al.  implement both of these parallel DP algorithms [25, 26] on a Network-on-Chip computing platform .
In addition, the construction of a dendrogram can be parallelized as in  using n Graphics Processing Units (GPUs) and completing in O(n 3) time.
Furthermore, there are attempts to parallelize the progressive alignment step [step (iii)] as in  and . In , the independent pre-aligned pairs along the dendrogram are aligned simultaneously. This technique gains some speed-up, however, the time complexity of the algorithm remains unchanged since all the pair-wise alignments eventually must be merged together. Another attempt is seen in , where Huang's algorithm  is used. When an anti-diagonal of a DP alignment matrix in lower tree level in step (iii) is completed, it is distributed immediately to other processors for computing the pair-wise alignment of a higher tree level. This technique can lead to an incorrect result since the actual pair-wise alignment of the lower branch is still uncertain.
Overall, the major speedups achieved from these implementations come from two parallel tasks: performing initial pair-wise alignments in step (i) simultaneously and calculating the dynamic programming matrix anti-diagonally (or in blocks). These tasks potentially can lower the run-time complexity of step (i) from O(m 2 n 2) to O(n) and step (iii) from O(mn 4) to O(m 3 n) ≈ O(n 4), [or O(m 4) if n < m]. The overall run-time complexity of the original progressive multiple sequence alignment algorithm is still dominated by step (iii) with an order of O(m 3 n) regardless of how many processing units are used. The bottle-neck is the pair-wise group alignments must be done in order dictated by the dendrogram (O(m)), and each alignment requires all the column pair-wise scores be calculated (O(m 2)). To address these issues, we design our parallel progressive multiple sequence alignment on a reconfigurable mesh (r-mesh) computing model similar to the ones used in [16, 23, 24]. Following is the detailed description of the r-mesh model.
There are many reconfigurable computing models such as Linear r-mesh (Lr-mesh), Processor Array with Reconfigurable Bus System (PARBS), Pipedlined r-mesh (Pr-mesh), Field-programmable Gate Array (FPGA), etc. These models are different in many ways from construction to operation run-time complexities. For example, the Pr-mesh model does not function properly with configurations containing cycles, while many other models do. However, there are many algorithms to simulate the operations of one reconfigurable model onto another in constant time as seen in [31–36].
In the scope of this study, we will use a simple electrical r-mesh system, where each processing unit, or processing element (PU or PE), contains four ports and can perform basic routing and arithmetic operations. Most reconfiguration computing models utilize the representation of the data to parallelize their operations; and there are various proposed formats . Commonly, data in one format can be converted to another in O(1) time . The unary representation format is used this study, which is denoted as 1UN, and is defined as:
Given an integer x ∈ [0, n - 1], the unary 1UN presentation of x in n-bit is: x = (b0, b1, ..., b n -1), where b i = 1 for all i ≤ x and b i = 0 for all i > x .
For example, a number 3 is represented as 11110000 in 8-bit 1UN representation.
In addition to the 1UN unary format, we will be utilizing the following theorem for some of the operations:
The prefix-sum of n value in range [0, n c ] can be found in O(c) time on an n × n r-mesh .
In terms of multiple sequence alignment, the number of bits used in the 1UN notation is correlated to the maximum length of the input sequences. In the next Section, we will describe the designs of r-meshcomponents to use in dynamic programming algorithms.
This section begins with the description of several configurations of r-mesh needed to compute various operations in pair-wise dynamic programming algorithm. Following the r-mesh constructions is a new constant-time parallel dynamic programming algorithm for Needleman-Wunsch's, Smith-Waterman, and the Longest Common Subsequence (LCS) algorithms.
This adder/subtractor can only handle numbers in 1UN representation, i.e. positive values. Thus, any operation that yields a negative result will be represented as a pattern of all zeros. When this adder/subtractor is used in a DP algorithm, one of the two inputs is already known. For example, to calculate the value at cell c i, j, three binary arithmetic operations must be performed: c i-1, j-1 + s(x i , y j ), c i-1, j+ g, and c i, j-1 + g, where both the gap g and the symbol matching score s(x i , y j ) between any two residue symbols are predefined. Thus, we can store these predefined values to the West ports of the adder/subtractor units and have them configured accordingly before the actual operations.
For biological sequence alignments, symbol matching scores are commonly obtained from substitution matrices such as PAM , BLOSUM , or similar matrices, and gap cost is a small constant in the same range of the values in these matrices. These values are one or two digits. Thus, k is very likely is a 2-digit constant or smaller. Therefore, the size of the adder/subtractor unit is bounded by O(n), in this scenario.
The dynamic programming techniques used in the Longest Common Subsequence (LCS), Smith-Waterman's and Needle-Wunsch's algorithms are very similar. Thus, a DP r-mesh designed to solve one problem can be modified to solve another problem with minimal configuration. We are presenting the solution for the latter cases first, and then show a simple modification of the solution to solve the first case.
Although the number representation can be converted from one format to another in constant time , the DP r-mesh run-time grows proportionally with the number of operations being done. These operations could be as many as O(n 2). To eliminate this format conversion all the possible symbol matching scores, or scoring matrix, (4 × 4 for RNA/DNA sequences and 20 × 20 for protein sequences) are pre-scaled up to positive values. Thus, an alignment of any pair of residue symbols will yield a positive score; and gap matching (or insert/delete) is the only operation that can reduce the alignment score in preceding cells. Nevertheless, if the value in cell c i-1, j(or c i,j-1) is smaller than the magnitude of the gap cost (|g|), a gap penalized operation will produce a bit pattern of all zeros (an indication of an underflow or negative value). This value will not appear in cell c i,j since the addition of the positive value in cell c i-1, j-1 and the positive symbol matching score s(x i , y i ) is always greater than or equal to zero.
In general, we do not have to perform this scale-up operation for DNA since DNA/RNA scoring schemes that generally use only two values: a positive integer value for match and the same cost for both mismatch and gap.
Unlike DNA, scoring protein residue alignment is often based on scoring scoring/substitution/mutation matrices such as that in [3, 4]. These matrices are log-odd values of the probabilities of residues being mutated (or substituted) into other residues. The difference between the matrices are the way the probabilities being derived. The smaller the probability, the less likely a mutation happens. Thus, the smallest alignment value between any two residues, including the gap is at least zero. To avoid the complication of small positive fractional numbers in calculations, log-odd is applied on these probabilities. The log-odd score or substitution score in  is calculated as , where s(i, j) is the substitution score between residues i and j, λ is a positive scaling factor, Q ij is the frequency or the percentage of residue i correspond to residue j in an accurate alignment, and P i and P j are background probabilities which residues i and j occur. These probabilities and the log-odd function to generate the matrices are publicly available via The National Center for Biotechnology Information's web-site (http://www.ncbi.nlm.nih.gov) along with the substitution matrices themselves. With any given gap cost, the probability of a residue aligned with a gap can be calculated proportionally from a given gap cost and other values from the un-scaled scoring matrices by taking anti-log of the log-odd values or score matrix. Thus, when a positive number β is added to the scores in these scoring matrices, it is equivalent to multiply the original probabilities by a β , where a is the log-based used in the log-odd function.
A simple mechanism to obtain a scaled-up version of a scoring matrix is: (a) taking the antilog of the scoring matrix and g, where g is the gap costs, i.e. the equivalent log-odd of a gap matching probability; (b) multiplying these antilog values by β factor such that their minimum log-odd value should be greater than or equal to zero; (c) performing log-odd operation on these scaled-up values.
When these scaled-up scoring matrices are used, the Smith-Waterman's algorithm must be modified.
Instead of setting sub-alignment scores to zeros when they become negative, these scores are set to β when they fall below the scaled-up factor (β).
Using scaled-up scoring matrices will eliminate the need for signed number representation in our following algorithm designs. However, if there is a need to obtain the alignment score based on the original scoring matrices, the score can be calculated as follows: (i) load the original score matrix and gap cost to each cell on an r-mesh as similar to the one described in Section; (ii) configure cells on the diagonal path to use their corresponding matching score from the matrix and other cells representing gap insertions or deletions to use gap cost; (iii) calculate the prefix-sum of all the cells on the path representing the alignment using Theorem 1.
A 1 × n adder/subtractor unit can perform increments and decrements in the range of [-1,0,1]. As a result, a DP r-mesh can be built with 1-bit input components to handle all pair-wise alignments using constant scoring schemes that can be converted to [-1,0,1] range. For instance, the scoring scheme for the longest common subsequence rewards 1 for a match and zero for mismatch and gap extension.
To align two sequences, c i, jloads or computes its symbol matching score for the symbol pair at row i column j, initially. The next step is to configure all the adder/subtractor units based on the loaded values and the gap cost g. Finally, a signal is broadcasted from c 0,0 to its neighboring cells c 0,1, c 1,0, and c 1,1 to activate the DP algorithm on the r-mesh. The values coming from cells c i-1, jand c i, j-1 are subtracted with the gap costs. The value coming from c i-1, j-1 is added with the initial symbol matching score in c i, j. These values will flow through the DP r-mesh in one broadcasting step, and cell c n, nwill receive the correct value of the alignment.
In term of time complexity, this dynamic programming r-mesh takes a constant time to initialize the DP r-mesh and one broadcasting time to compute the alignment. Thus, its run-time complexity is O(1). Each cell uses 10n processing units (4n for the 1-bit max switch and 2n for each of the three adder/subtrator units). These processing units are bounded by O(n). Therefore, the n × n dynamic programming r-mesh uses O(n 3) processing units.
To handle all other scoring schemes, k × n adder/subtractor r-meshes and n × n max switches must be used. In addition, to avoid overflow (or underflow) all pre-defined pair-wise symbol matching scores may have to be scaled up (or down) so that the possible smallest (or largest) number can fit in the 1UN representation. With this configuration, the dynamic programming r-mesh takes O(n 4) processing units.
The complication of signed numbers does not exist in the longest common subsequence problem. The arithmetic operation in LCS is a simple addition of 1 if there is a match. The same dynamic programming r-mesh as seen in Figure 6 can be used, where the two subtractors units are removed or the gap cost is set to zero (g = 0).
This modified constant-time DP r-mesh used O(n 3) processing units. However, this is an order of reduction comparing the current best constant parallel DP algorithm that uses an r-mesh of size O(n 2) × O(n 2)  to solve the same problem.
Affine gap cost (or penalty) is a technique where the opening gap has different cost from an extending gap . This technique discourages multiple and disjoined gap insertion blocks unless their inclusion greatly improves the pair-wise alignment score. The gap cost is calculated as p = o + g(l - 1), where o is the opening gap cost, g is the extending gap cost, and l is the length of the gap block. Traditionally, Gotoh use three matrices to track these values; however, it is not intercessory in the reconfigurable mesh computing model since each cell in the matrix is a processing node with local memory.
The modification of the dynamic programming r-mesh to handle affine gap cost requires additional 2 adder/subtractor units, 2 on/off switches, and one 2-input max switch. Asymptotically, the amount of processing units used is still bounded by O(n 4) and the run-time complexity remains O(1).
This r-mesh configuration uses (n × n + 1), i.e., O(n 2), processing units to turn off the flow of an n-bit input in a broadcasting time.
The pair-wise alignment is obtained by following the path leading to the overall optimal alignment score, or the end of the alignment. In the case of the Needleman-Wunsch's algorithm, cell c n, nholds this value; and in the case of the Smith-Waterman's algorithm, cell c i, jwith the maximum alignment score is the end point. The cell with the largest value can be located in O(1) time on a 3-dimension n × n × n r-mesh through these steps:
1. Initially, the DP matrix with calculated values are stored in the first slice of the r-mesh cube, i.e. in cells c i, j,0, 0 < i, j ≤ n.
2. c i, j,0 sends its value to c i, j, i, 0 ≤ i, j ≤ n, to propagate each column of the matrix to the 2D r-meshes on the third dimension.
3. c i, j, isends its value to c 0, j, k , i.e. to move the solution values to the first row of each 2D r-mesh slice.
4. Each 2D r-mesh slice finds its max value c 0, r, k where r is the column of the max value in slice k
5. c 0, r, k sends r to c k,0,0, i.e. each 2D r-mesh slice sends its max value column number m to the first 2D r-mesh slice. This value is the column index of the max value on row k in the first slice.
6. The first 2D r-mesh slice, c i, j,0, finds the max value of n DP r-mesh cells whose row index is i and column index is c i0,0 (i.e. value r received from the previous step). The row and column indices of the max value found in this step is the location of the max value in the original DP r-mesh.
These above steps rely on the capability to find the max value from n given numbers on an n × n r-mesh. This operation can be done in O(1) time as follows:
1. initially, the values are stored in the first row of the r-mesh.
2. c 0, j broadcasts its value, namely a j , to c i, j, (column-wise broadcasting).
3. c i, ibroadcasts its value, namely a i , to c i, j(row-wise broadcasting).
4. c i, jsets a flag bit f(i, j) to 1 if and only if a i > a j ; otherwise sets f(i, j) to 0.
5. c 0, j is holding the max value if f(0, j), f(1, j),..., f(n - 1, j) are 0. This step can be performed in O(1) time by ORing the flag bits in each column.
The location of the max value in the DP r-mesh can be obtained in O(1) time because each step in the process takes O(1) time to complete.
To trace back the path leading to the optimal alignment, we start with the end point cell c e, dfound above and following these steps:
1. c i, j, (i ≤ e, i ≤ d), sends its value to c i, j+1, c i+ 1, j, c i+1, j+i. Thus, each cell can receive up to three values coming from its Noth, West, and Northwest borders.
2. c i, jfinds the max of the inputs and fuses its port to the neighbor cell that sent the max value in the previous step. If there are more than one port to be fused, (this happens when there are multiple optimal alignments), c i, jrandomly selects one.
3. c e, dsends a signal to its fused port. The optimal pair-wise alignment is the ordered list of cells where this signal travels through.
Each operation in the back-tracking process takes O(1) time to complete, and there are no iterative operations. Therefore, the back-tracking steps requires n 3 processing units and takes O(1) time.
In this section, we start by describing a parallel algorithm to generate a dendrogram, or guiding tree, representing the order in which the input sequences should be aligned. Then we will show a reworked version of sum-of-pair scoring method that can be performed in constant time on a 2D r-mesh. Finally, we will describe our parallel progressive multiple sequence alignment algorithm on r-mesh along with its complexity analysis.
The parallel neighbor-joining (NJ)  clustering method on r-mesh is described here; other hierarchical clustering mechanisms can be done in a similar fashion. The neighbor-joining takes a distance matrix between all the pairs of sequences and represents it as a star-like connected tree, where each sequence is an external node (leaf) on the tree. NJ then finds the shortest distance pair of nodes and replaces it with a new node. This process is repeated until all the nodes are merged.
Followings are the actual steps to build the dendrogram:
1. Initially, all the pair-wise distances are given in form of a matrix D of size m × m, where m is the number of nodes (or input sequences).
2. Calculate the average distance from node i to all the other nodes by .
3. The pair of nodes with the shortest distance (i, j) is a pair that gives minimal value of M ij , where M ij = D ij - r i - r j .
4. A new node u is created for shortest pair (i,j), and the distances from u to i and j are: , and d j,u = d ij -d iu .
5. The distance matrix D is updated with the new node u to replace the shortest distance pair (i,j), and the distances from all the other nodes to u is calculated as D vu = D iv + d jv - D ij .
These steps are repeated for m - 1 iterations to reduce distance matrix D to one pair of nodes. The last pair does not have to be merged, unless the actual location of the root node is needed.
Step 1 and 4 are constant time operations on an m × m r-mesh, where each processing unit stores a corresponding value from the distance matrix. Since the progressive multiple sequence alignment algorithm only uses the dendrogram as a guiding tree to select the closest pair of sequences (or two groups of sequences), the actual distance values between the nodes on the final dendrogram mostly are insignificant. Consequently, the values in distance matrix D can be scaled down without changing the order of the nodes in the dendrogram. In addition, if these values are not to be preserved, the calculations in step 4 can be skipped.
Before proceeding to step 2, we should reexamine some facts. First, the maximum alignment score from all the pair-wise DP sequence alignments are bounded by b 2, where b is the max pair-wise score between any two residue symbols. An alignment score of b 2 occurs only if we align a sequence of these symbols to itself. b+1 ≤ n is the number of bits being used to represent this value in 1UN. Similarly, the maximum value in distance matrix D generated from these alignment scores are also bounded by b 2. Thus, the sum of n of these distance values are bounded by b 4. These facts allow us to calculate the sums in step 2 in O(c) time using an n × n r-mesh as in Theorem 1. In this case, c is constant, (c = 4). There are n summations to calculate, so the entire calculation requires n such r-meshes, or n 3 processing units, to complete in O(1) time.
In step 3, each processing unit computes value M ij locally. The max value can be found using the same technique described in Section in constant time.
Similarly, step 5 is performed locally by the processing units in the r-mesh in O(1) time. Since this procedure terminates after m - 1 iterations, the overall run-time complexity to build a dendrogram, (or guiding tree), for any given pair-wise distance matrix of size m × m is O(m) using O(m3) processing units.
where f, g are the two columns, T is the number of different residue symbols (T = 4 for DNA/RNA and T = 20 for proteins), s(i,j) is the pair-wise matching score, or substitution score, between two residue symbols i and j, and n i and n j are the total count of symbols/types i and j (i.e. the occurrences of residue symbols/types i and j), respectively. Since residues from both column f and g are merged, there is no distinction in which column the residue are from. Since T is constant, the summations in Equation remain constant, regardless how many sequences are involved.
This scoring function can be implemented on an array of m processing units as follows. First, map each residue symbol into a numeric value from 1 to T. Next, m residues from any two aligning columns are assigned to m processing units. Any processing unit holding a residue sends a 1 to processing unit p k , where k is the number represents the residue symbol it is holding. p k sums the 1's it receives. The sum-of-pair score can be computed between the pairs of processing units containing a sum larger than 0 calculated from previous steps. All of these steps are carried out in constant time. There are n 2 possible pair-wise column arrangements of two pre-aligned groups of sequences of max length n. Thus, the sum-of-pair column pair-wise matching scores for two pre-aligned groups of sequences can be done in O(1) using m × n 2 processing units.
Progressive multiple sequence alignment algorithm is a heuristic alignment technique that builds up a final multiple sequence alignment by combining pair-wise alignments starting with the most similar pair and progressing to the most distant pair. The distance between the sequences can be calculated by dynamic programming algorithms such as Smith-Waterman's or Needle-Wunsch's algorithms (step i). The order in which the sequences should be aligned are represented as a guiding and can be calculated via hierarchical clustering algorithms similar to the one described in Section (step ii). After the guiding tree is completed, the input sequences can be pair-wise aligned following the order specified in the tree (step iii). In the previous Sections, we have described and designed several r-meshes to handle individual operations in the progressive multiple alignment algorithm. Finally, a progressive multiple sequence alignment r-mesh configuration can be constructed. First, the input sequences are pair-wise aligned using the dynamic programming r-mesh described previously in Section. These pair-wise alignments can be done in O(1) using dynamic programming r-meshes, or in O(m) time using O(m) r-meshes. The latter is preferred since the dendrogram [step (ii)] and the progressive alignment [step (iii)] each takes O(m) time to complete. Then, a dendrogram is built, using the parallel neighbor-joining clustering algorithm described earlier, from all the pair-wise DP alignment scores from step (i). Lastly, [step (iii)], for each pair of pre-aligned groups of sequences along the dendrogram, the sum-of-pair column matching scores are pre-calculated for the DP r-mesh initialization before proceeding with the dynamic programming alignment. There are m - 1 branches in the dendrogram leading to m - 1 pair-wise group alignments to be performed. In terms of complexity, the progressive multiple sequence alignment takes O(m) time using O(n) DP r-meshes to complete all the pair-wise sequence alignments [step (i)] - (or O(1) time using DP r-meshes). Its consequence step, [step (ii)], to build the sequence dendrogram takes O(m) time using O(m 3) processing units. Finally, the progressive step, [step (iii)], takes O(m) time using a DP r-mesh. Therefore, the overall run-time complexity of this parallel progressive multiple sequence alignment is O(m). The number of processing units utilized in this parallel algorithm is bounded by the number of DP r-meshes used and their sizes. The general DP r-mesh uses O(n 4) processing units to handle all scoring schemes with affine gap cost. And step (i) needs m of such DP r-meshes resulting in O(mn 4) ≈ O(n 5) processing units used.
For alignment problems that use constant scoring schemes without affine gap cost, this parallel progressive multiple sequence alignment algorithm only needs O(mn 3) ≈ O(n 4) processing units to complete in O(m) time.
Summary of progressive multiple sequence alignment components
2-input max switch
1 - bit
4-input max switch
1 - bit
2-input max switch
n - bit
4-input max switch
n - bit
n - bit
n ×n +1
k ×n, k ≤ n
2 sequences, max length = n
DP (general scoring)
2 sequences, max length = n
O(kn3), k ≤ n
n × n
n × n × n
m × m
2 pre-aligned groups of m sequences
m × n2
m sequences, max length = n
O(m × n3)
m sequences, max length = n
O(m × n4)
In this study, we have designed various r-mesh components that can run in one broadcasting step, which enabling us to effectively parallelize the progressive multiple sequence alignment paradigm. to align m sequences with max length n, we are able to reduce the algorithm run-time complexity from O(m × n 4) to O(m) using O(m × n 4) processing units. For a scoring scheme that rewards 1 for a match, 0 for a mismatch, and -1 for a gap insertion/deletion, our algorithm uses only O(m × n 3) processing units. Moreover, to our knowledge, we are the first to propose an O(1) run-time dynamic programming pair-wise alignment algorithm using only O(n 3) processing units.
This study is supported by the Molecular Basis of Disease (MBD) at Georgia State University.
This research was also supported in part by CCF-0514750, CCF-0646102, and the National Institutes of Health (NIH) under Grants R01 GM34766-17S1, and P20 GM065762-01A1.
The research of Nong was supported in part by the National Natural Science Foundation of China under Grant 60873056 and the Fundamental Research Funds for the Central Universities of China under Grant 11lgzd04.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.