Sequence alignment by passing messages
- Byung-Jun Yoon^{1}Email author
https://doi.org/10.1186/1471-2164-15-S1-S14
© Yoon; licensee BioMed Central Ltd. 2014
Published: 24 January 2014
Abstract
Background
Sequence alignment has become an indispensable tool in modern molecular biology research, and probabilistic sequence alignment models have been shown to provide an effective framework for building accurate sequence alignment tools. One such example is the pair hidden Markov model (pair-HMM), which has been especially popular in comparative sequence analysis for several reasons, including their effectiveness in modeling and detecting sequence homology, model simplicity, and the existence of efficient algorithms for applying the model to sequence alignment problems. However, despite these advantages, pair-HMMs also have a number of practical limitations that may degrade their alignment performance or render them unsuitable for certain alignment tasks.
Results
In this work, we propose a novel scheme for comparing and aligning biological sequences that can effectively address the shortcomings of the traditional pair-HMMs. The proposed scheme is based on a simple message-passing approach, where messages are exchanged between neighboring symbol pairs that may be potentially aligned in the optimal sequence alignment. The message-passing process yields probabilistic symbol alignment confidence scores, which may be used for predicting the optimal alignment that maximizes the expected number of correctly aligned symbol pairs.
Conclusions
Extensive performance evaluation on protein alignment benchmark datasets shows that the proposed message-passing scheme clearly outperforms the traditional pair-HMM-based approach, in terms of both alignment accuracy and computational efficiency. Furthermore, the proposed scheme is numerically robust and amenable to massive parallelization.
Background
Sequence alignment has become an indispensable tool in modern molecular biology research, as it provides an effective and intuitive way of comparing and analyzing biological sequences. Given a set of biological sequences, the primary objective of sequence alignment is to predict the best overall mapping between the sequences, which accurately aligns the homologous regions that are embedded in them. This provides an effective means for detecting conserved sequence regions with potentially important functional roles. The concept of sequence alignment has had diverse applications in biomedical research [1–7], which include homology search, function and structure prediction of biomolecules, phylogenetic analysis, and detecting sequence motifs, among others.
Typically, sequence alignment is carried out by formulating and solving an optimization problem - either implicitly or explicitly - where the goal is to maximize an objective function that measures the overall quality of the sequence alignment. For example, one simple way of aligning a sequence pair would be to score each potential alignment by assigning a "substitution score" to every aligned symbol pair and penalty scores for gaps and then find the optimal alignment that maximizes the overall score through dynamic programming [1]. In the past, various ad hoc scoring schemes have been proposed to obtain intuitive and biologically meaningful sequence alignment results. As an alternative to heuristic scoring schemes, there have been also research efforts to develop probabilistic models for sequence alignment that can be used to evaluate and compare potential alignments and to estimate the symbol-to-symbol alignment probabilities.
Examples of such probabilistic schemes include the pair hidden Markov models (pair-HMMs) [1] and the partition function based scheme [8]. Given two biological sequences, these methods can be used to estimate the posterior symbol alignment probability for each symbol pair that may be aligned in the final sequence alignment. Based on the estimated probabilities, we can predict the optimal sequence alignment that contains the largest expected number of correctly aligned symbol pairs, rather than an alignment that maximizes an ad hoc score. This is typically referred to as the maximum expected accuracy (MEA) alignment [9–11], and as before, it can be also found through dynamic programming.
Among a number of probabilistic sequence alignment models, pair-HMMs have been especially popular, and they have been widely adopted by many multiple sequence alignment (MSA) algorithms, including ProbCons [9] and PicXAA [10]. Despite the simplicity of the model, pair-HMMs have been shown to be very effective in modeling sequence homology, as reflected in the well-rounded overall performance of various MSA algorithms that utilize the symbol alignment probabilities estimated by pair-HMMs. Furthermore, these probabilities can be estimated in a relatively efficient manner, making the pair-HMMs an attractive choice for various sequence alignment problems. However, pair-HMMs also have a number of shortcomings, which may negatively affect their alignment performance or make them impractical for certain alignment tasks.
In this paper, we propose a novel scheme for comparing and aligning biological sequences that can effectively address the limitations of pair-HMMs. The proposed scheme computes probabilistic symbol alignment confidence scores based on a simple and computationally efficient message-passing approach. As we will demonstrate in this paper, this message-passing scheme has a number of important advantages over the traditional pair-HMMs and it clearly outperforms pair-HMMs in terms of both speed and accuracy on protein alignment benchmark datasets.
Methods
A brief overview of pair hidden Markov models
Limitations of pair-HMMs
For the pair-HMM shown in Figure 1(A), any hidden state sequence $s={s}_{1}{s}_{2}\cdots {s}_{9}$ such that s_{1} = s_{2} = s_{3} = M and ${s}_{4}{s}_{5}\cdots {s}_{9}$ is a permutation of I_{x} I_{x} I_{x} I_{y} I_{y} I_{y} would lead to the sequence alignment shown in (1). When using this pair-HMM for predicting the optimal alignment of a sequence pair with the largest probability, this ambiguity may lead to performance degradation as these potential state sequences compete against each other. For this reason, it is generally more desirable to estimate the symbol alignment probabilities via the pair-HMM by considering all potential alignments and state sequences and use the estimated probabilities to find the MEA alignment that is expected to have the maximum number of correctly aligned symbols [9–11]. However, the aforementioned ambiguity also negatively affects the quality of the estimated symbol alignment probabilities, which is especially noticeable for sequence pairs with low percentage identity. In some cases, the alternative pair-HMM shown in Figure 1(B) is used to avoid such ambiguity. This alternative pair-HMM blocks transitions between the insertion states I_{x} and I_{y}, thereby prohibiting the model from inserting unaligned symbols to both sequences. For example, the alignment shown in (1) would not be allowed based on this alternative pair-HMM. However, due to this restriction, the pair-HMM in Figure 1(B) has a relatively stronger tendency to align unrelated sequence regions by treating them as mutations. This may again negatively affect the quality of the symbol alignment probabilities estimated based on the pair-HMM.
Another potential drawback of pair-HMMs is that the associated algorithms (i.e., the Viterbi, forward, and backward algorithms) can become numerically unstable for long sequences. Application of pair-HMMs to biological sequence analysis involves computing extremely small probabilities, which decrease exponentially with the sequence length. For example, based on the pair-HMM that was used in [9], the observation probability (i.e., the probability that the HMM may generate a given sequence pair) of a protein pair is typically in the order of 10^{-230} for proteins of length 80, 10^{-280} for proteins of length 100, and 10^{-320} for proteins of length 120. As a result, pair-HMM algorithms are prone to underflow errors, unless they are carefully implemented to keep them numerically robust. So far, a number of schemes have been proposed to address this issue, such as using log transformations of the probabilities or normalizing the probabilities to keep them within a reasonable numerical range, and have been shown to work well for relatively long sequences [1]. However, log transformations can make the forward and backward algorithms considerably slower, and the normalization approach can still lead to underflow errors as the sequences get longer.
One further disadvantage of pair-HMMs is that the algorithms that are used with the model cannot be easily parallelized. Although the Viterbi, forward, and backward algorithms for pair-HMMs are relatively efficient, they are still computationally expensive to be used with very long sequences. Moreover, as the algorithms are not amenable to massive parallelization, this makes the pair-HMMs not suitable for large-scale sequence analysis tasks, such as the whole genome alignment, despite their superior performance compared to other heuristic methods.
A message-passing scheme for estimating symbol alignment confidence scores
Here, we propose a novel method for aligning biological sequences that can effectively address the aforementioned shortcomings of pair-HMMs. The proposed method is based on a message-passing scheme, where messages are iteratively exchanged between neighboring symbol pairs to estimate the level of confidence for potential pairwise symbol alignments. The main underlying motivation is to develop an "analytical" method that can directly estimate the symbol alignment probabilities, without specifically modeling symbol insertions and deletions. This stands in contrast to the pair-HMM approach, which is essentially based on a "generative" sequence model that tries to explicitly model symbol insertions/deletions, in addition to symbol alignments. As discussed before, modeling symbol insertions in pair-HMMs can lead to subtle issues with potentially negative effects, and considering that our ultimate goal lies in finding an accurate sequence alignment through effective estimation of the symbol alignment probabilities, a method that can directly estimate these probabilities without explicitly modeling insertions/deletions would be desirable.
The pseudocode of the proposed message-passing algorithm is as follows:
STEP-1 Initialize c_{ xy } (i, j).
STEP-3 Normalize c_{ xy }(i, j).
STEP-4 If c_{ xy }(i, j) has converged, then terminate the algorithm.
Otherwise, go to STEP-2.
In STEP-1, we first initialize the alignment confidence score c_{ xy }(i, j), where we can simply use random initialization. If a preliminary sequence alignment of x and y is available (e.g., obtained from a simple heuristic method), we can also initialize the score based on this alignment such that c_{ xy }(i, j) = 1 if x_{ i } and y_{ j } are aligned, and c_{ xy }(i, j) = 0 otherwise. Next, in STEP-2, the alignment confidence score c_{ xy }(i, j) of the symbol pair (x_{ i }, y_{ j }) is updated based on the scores of its two neighbors (x_{i-1}, y_{j-1}) and (x_{i+1}, y_{j+1}). Note that the score is set to c_{ xy }(i, j) = 0 if i ∉ {1, ⋯ , L} or j ∉ {1, ⋯ , M}. P (x_{ i }, y_{ j }) is the joint occurrence probability of the symbol pair (x_{ i }, y_{ j }), which is essentially equivalent to the joint emission probability of an aligned symbol pair (x_{ i }, y_{ j }) at the match state M of a pair-HMM. It should be noted that this probability P (x_{ i }, y_{ j }) is not location-dependent and is simply determined by the symbols x_{ i } and y_{ j }. The weight parameter λ ∈ [0, 1] is used to balance the contribution from the neighbors and that from the joint probability of (x_{ i }, y_{ j }) in estimating the alignment confidence score. A large λ gives more weight to the "messages" received from the neighbors in estimating the scores, which tends to penalize gaps more heavily, and it generally leads to longer aligned regions with fewer gaps. On the contrary, a small λ gives more weight to the joint symbol occurrence probability P (x_{ i }, y_{ j }) while giving less weight to the messages received from the neighbors, which tends to be more lenient to gaps. Once the symbol alignment confidence score c_{ xy }(i, j) is updated for all i = 1, ⋯ , L and j = 1, ⋯ , M, we normalize the scores to keep them within a proper numerical range, as shown in STEP-3. For example, a simple way would be to divide the score matrix C = [(c_{ xy }(i, j)] by its matrix norm to normalize the confidence scores. After normalization, the updated scores are compared to the scores in the last iteration, and the algorithm terminates if the specified convergence criterion has been met. Otherwise, the algorithm goes back to STEP-2 and repeats the message-passing process.
Results and Discussion
Dataset and experimental set-up
In order to evaluate the performance of the proposed message-passing scheme, we carried out pairwise sequence alignment experiments based on the BAliBASE 3.0 protein alignment benchmark [16]. BAliBASE is arguably the most widely used benchmark for multiple sequence alignment, and it has been utilized by most multiple sequence alignment algorithms for assessing their performance. The benchmark consists of five reference sets, where Reference 1 consists of two subsets: V1 and V2. Each reference set consists of multiple sequence alignments that satisfy specific criteria, such that different reference sets can be used to test the performance of multiple sequence alignment algorithms under different conditions. For example, each alignment in Reference 2 consists of sequences that share reasonably high identity (> 40%) and "orphan sequences" that share little identity (< 20%) to other sequences in the alignment. Reference sets 4 and 5 are constructed such that every sequence has at least one other sequence in the same alignment whose identity exceeds 20%. Sequences in Reference 4 and Reference 5 may contain large N/C-terminal extensions or internal insertions, respectively. Further details of the BAliBASE 3.0 benchmark can be found in [16].
For every sequence family in BAliBASE 3.0, we performed pairwise sequence alignment for all possible sequence pairs in the given family. The pairwise alignment was performed in the following manner. First, we estimated the probabilistic symbol alignment confidence score using the proposed message-passing scheme. In our experiments, we used three different values of λ (= 0.25, 0.5, and 0.75) to investigate the effect of λ on the overall sequence alignment performance. For the joint symbol occurrence probability P (x_{ i }, y_{ j }), we used the joint emission probability (at state M) of the pair-HMM that was used in [9]. At the end of each iteration, we normalized the alignment confidence score by dividing the confidence score matrix C by the matrix 2-norm: C ← C /|| C ||_{2}. We terminated the message-passing process if ${\sum}_{i}{\sum}_{j}|{c}_{xy}\left(i,j\right)-{\stackrel{\u0303}{c}}_{xy}\left(i,j\right)|\phantom{\rule{2.77695pt}{0ex}}<0.01$, where c_{ xy }(i, j) is the current score and ${\stackrel{\u0303}{c}}_{xy}\left(i,j\right)$ is the score obtained in the previous iteration. Once the scores converged, based on our assumption that c_{ xy } (i, j) ∝ P (x_{ i } ~ y_{ j } | x, y), we used the confidence score c_{ xy } (i, j) to find the MEA alignment through dynamic programming. The predicted alignment was compared to the benchmark alignment in BAliBASE 3.0 to compute the sensitivity $\left(\mathsf{\text{SN}}\right)=\frac{TP}{TP+FN}$ and the positive predictive value $\left(\mathsf{\text{PPV}}\right)=\frac{TP}{TP+FP}$, where TP is the number of correctly aligned symbol pairs, FP is the number of incorrectly aligned pairs, and FN is the number of symbol pairs that are aligned in the benchmark alignment but not aligned in the predicted alignment. For comparison, we repeated similar experiments using the pair-HMM with the same set of parameters as the one used in [9].
Performance of the proposed message-passing scheme
Pairwise sequence alignment performance evaluated on the BAliBASE 3.0 benchmark.
Ref | Pair-HMM | Message-Passing | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
λ= 0.25 | λ= 0.50 | λ= 0.75 | ||||||||||
SN | PPV | CPU | SN | PPV | CPU | SN | PPV | CPU | SN | PPV | CPU | |
RV11 | 0.048 | 0.106 | 1.934 | 0.123 | 0.149 | 0.769 | 0.155 | 0.175 | 1.465 | 0.198 | 0.209 | 3.675 |
RV12 | 0.213 | 0.414 | 2.707 | 0.399 | 0.468 | 1.146 | 0.475 | 0.523 | 2.145 | 0.569 | 0.595 | 5.200 |
RV20 | 0.276 | 0.476 | 2.725 | 0.504 | 0.568 | 1.186 | 0.570 | 0.613 | 2.251 | 0.643 | 0.665 | 5.531 |
RV30 | 0.168 | 0.300 | 2.656 | 0.324 | 0.369 | 1.143 | 0.372 | 0.402 | 2.160 | 0.426 | 0.441 | 5.432 |
RV40 | 0.153 | 0.271 | 4.084 | 0.250 | 0.284 | 1.760 | 0.300 | 0.323 | 3.234 | 0.361 | 0.373 | 7.970 |
RV50 | 0.140 | 0.254 | 4.969 | 0.248 | 0.278 | 2.102 | 0.294 | 0.312 | 3.967 | 0.348 | 0.353 | 9.856 |
From Table 1, we can clearly see that the proposed message-passing scheme significantly outperforms the pair-HMM approach in terms of SN and PPV, for all three values of λ. For example, the message-passing scheme achieved up to 0.23 higher SN and 0.09 higher PPV for λ = 0.25, and up to 0.37 higher SN and 0.19 higher PPV for λ = 0.75. Our experiments showed that a larger λ tends to yield more accurate alignments, while a smaller λ tends to make the algorithm converge faster, hence computationally more efficient. For example, when the weight parameter was set to λ = 0.25, the message-passing scheme was around 2.3 ~ 2.5 times faster than the pair-HMM, while still yielding much more accurate alignments.
Conclusions
In this paper, we proposed a novel method for sequence alignment based on an efficient message-passing approach. Given two biological sequences, the proposed method estimates the symbol alignment confidence scores for all possible symbol pairs. These scores are iteratively computed by exchanging messages between neighboring symbol pairs, where empirical evidence shows that these scores quickly converge within several iterations. The proposed message-passing scheme effectively addresses a number of limitations of the traditional pair-HMM-based approach, and extensive performance assessment based on BAliBASE 3.0 shows that the proposed scheme consistently outperforms the pair-HMM approach, both in terms of alignment accuracy and computational efficiency. Considering that pair-HMMs have been widely adopted by many modern multiple sequence alignment algorithms [9–11], the proposed scheme has potentials to further improve the current state-of-the-art. Furthermore, the proposed scheme is numerically stable even for extremely long sequences. Unlike the pair-HMM approach, there is no global measure or quantity (such as the observation probability P (x, y) of the entire sequence pair) to be estimated, and the exchanged messages (i.e., symbol alignment confidence scores) are normalized after each iteration, which ensures that they lie within a reasonable numerical range. Finally, the simple iterative estimation process - in which the neighboring symbol pairs only exchange "local" messages - makes the proposed message passing scheme amenable to massive parallelization through the utilization of modern GPU (graphics processing unit) architecture. These characteristics open up the possibility of applying the proposed message-passing scheme to accurate probabilistic alignment of genome-scale sequences, which has not been possible using traditional pair-HMMs.
Finally, it is worth noting that the formula that is used to update c_{ xy }(i, j) in the proposed message-passing algorithm bears conceptual similarity to the eigenvalue equation used by the network alignment algorithm called IsoRank [17] for estimating the functional similarity between proteins across different protein-protein interaction (PPI) networks. As demonstrated in [18, 19], techniques that were originally developed for sequence alignment may also have potentials to improve network alignment methods. Conversely, techniques used in network alignment may also lead to better sequence alignment methods. For example, the scoring scheme used by IsoRank can be viewed as a random walk [20], and it was shown that the use of a different random walk scheme can lead to more accurate network alignment results [19]. Similarly, it may be possible to modify the update formula for c_{ xy }(i, j) to further improve the performance of the proposed message-passing scheme, and we are currently in the process of investigating several different implementations.
Declarations
Acknowledgements
This work was supported by the National Science Foundation through NSF Award CCF-1149544.
Declarations
Publication of this article was funded by the National Science Foundation through NSF Award CCF-1149544.
This article has been published as part of BMC Genomics Volume 15 Supplement 1, 2014: Selected articles from the Twelfth Asia Pacific Bioinformatics Conference (APBC 2014): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S1.
Authors’ Affiliations
References
- Durbin R, Eddy SR, Krogh A, Mitchison G: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. 1998, Cambridge University PressView ArticleGoogle Scholar
- Phillips A, Janies D, Wheeler W: Multiple sequence alignment in phylogenetic analysis. Mol Phylogenet Evol. 2000, 16: 317-330. 10.1006/mpev.2000.0785.PubMedView ArticleGoogle Scholar
- Cuff JA, Barton GJ: Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins. 2000, 40: 502-511. 10.1002/1097-0134(20000815)40:3<502::AID-PROT170>3.0.CO;2-Q.PubMedView ArticleGoogle Scholar
- Notredame C: Recent progress in multiple sequence alignment: a survey. Pharmacogenomics. 2002, 3: 131-144. 10.1517/14622416.3.1.131.PubMedView ArticleGoogle Scholar
- Edgar RC, Batzoglou S: Multiple sequence alignment. Curr Opin Struct Biol. 2006, 16: 368-373. 10.1016/j.sbi.2006.04.004.PubMedView ArticleGoogle Scholar
- Pei J: Multiple protein sequence alignment. Curr Opin Struct Biol. 2008, 18: 382-386. 10.1016/j.sbi.2008.03.007.PubMedView ArticleGoogle Scholar
- Kumar S, Filipski A: Multiple sequence alignment: in pursuit of homologous DNA positions. Genome Res. 2007, 17: 127-135. 10.1101/gr.5232407.PubMedView ArticleGoogle Scholar
- Miyazawa S: A reliable sequence alignment method based on probabilities of residue correspondences. Protein Eng. 1995, 8 (10): 999-1009. 10.1093/protein/8.10.999.PubMedView ArticleGoogle Scholar
- Do CB, Mahabhashyam MS, Brudno M, Batzoglou S: ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res. 2005, 15 (2): 330-340. 10.1101/gr.2821705.PubMedPubMed CentralView ArticleGoogle Scholar
- Sahraeian SM, Yoon BJ: PicXAA: greedy probabilistic construction of maximum expected accuracy alignment of multiple sequences. Nucleic Acids Res. 2010, 38 (15): 4917-4928. 10.1093/nar/gkq255.PubMedPubMed CentralView ArticleGoogle Scholar
- Hamada M, Asai K: A classification of bioinformatics algorithms from the viewpoint of maximizing expected accuracy (MEA). J Comput Biol. 2012, 19 (5): 532-549. 10.1089/cmb.2011.0197.PubMedPubMed CentralView ArticleGoogle Scholar
- Yoon BJ: Hidden Markov models and their applications in biological sequence analysis. Curr Genomics. 2009, 10 (6): 402-415. 10.2174/138920209789177575.PubMedPubMed CentralView ArticleGoogle Scholar
- Rabiner LR: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE. 1989, 77 (2): 257-286. 10.1109/5.18626.View ArticleGoogle Scholar
- Viterbi A: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. Information Theory, IEEE Transactions on. 1967, 13 (2): 260-269.View ArticleGoogle Scholar
- Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970, 48 (3): 443-453. 10.1016/0022-2836(70)90057-4.PubMedView ArticleGoogle Scholar
- Thompson JD, Koehl P, Ripp R, Poch O: BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins. 2005, 61: 127-136. 10.1002/prot.20527.PubMedView ArticleGoogle Scholar
- Singh R, Xu J, Berger B: Global alignment of multiple protein interaction networks with application to functional orthology detection. Proc Natl Acad Sci USA. 2008, 105 (35): 12763-12768. 10.1073/pnas.0806627105.PubMedPubMed CentralView ArticleGoogle Scholar
- Flannick J, Novak A, Srinivasan BS, McAdams HH, Batzoglou S: Graemlin: general and robust alignment of multiple large interaction networks. Genome Res. 2006, 16 (9): 1169-1181. 10.1101/gr.5235706.PubMedPubMed CentralView ArticleGoogle Scholar
- Sahraeian SM, Yoon BJ: SMETANA: Accurate and Scalable Algorithm for Probabilistic Alignment of Large-Scale Biological Networks. PLoS ONE. 2013, 8 (7): e67995-10.1371/journal.pone.0067995.PubMedPubMed CentralView ArticleGoogle Scholar
- Yoon BJ, Qian X, Sahraeian SME: Comparative analysis of biological networks: Hidden Markov model and Markov chain-based approach. IEEE Signal Processing Magazine. 2012, 29: 22-34.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.