A brief overview of pair hidden Markov models
The pair-HMM [1, 12] is a generative sequence model that can simultaneously generate a pair of aligned symbol sequences. This is different from the traditional HMMs, which generate only a single symbol sequence at a time [13]. Figure 1 shows two examples of pair-HMMs that are widely used in biological sequence analysis. As shown in Figure 1, a typical pair-HMM consists of three hidden states Ix, Iy, and M, which are used to model insertions in sequence x, insertions in sequence y, and matched (i.e., aligned) symbols in both sequences, respectively. The pair-HMM generates an aligned sequence pair (x, y) by making transitions between the hidden states according to the specified state transition probabilities. At state Ix, the model emits a symbol only to sequence x, while at Iy, a symbol is emitted only to sequence y. On the other hand, at state M, the model emits a pair of aligned symbols, where one symbol is added to x and the other symbol is added to y. Figure 1(C) gives an example of a sequence pair (x, y) that is generated by a pair-HMM. In this example, the underlying hidden state sequence that gives rise to the two sequences x = AACCG and y = CCGTT is IxIxMMMIyIy. This indicates that the first two symbols (i.e., AA) in x and the last two symbols in y (i.e., TT) are "insertions," which do not have any matching counterpart in the other sequence, while the last three symbols in x and the first three symbols in y (i.e., CCG in both sequences) are jointly generated by the pair-HMM, hence closely match each other. As we can see from this example, we can unambiguously identify the alignment of a given sequence pair (x, y), once the underlying hidden state sequence yielding the sequence pair is known. Of course, the hidden state sequence is generally not known, but there exist efficient algorithms that can be used for its prediction. For example, we can use the Viterbi algorithm [14] to predict the optimal hidden state sequence that maximizes the observation probability of the sequence pair (x, y). Alternatively, we can also predict the state sequence that maximizes the expected number of correct states, by first estimating the alignment probabilities between the symbols in x and y through the forward and backward procedures [13] and then applying the Needleman-Wunsch algorithm [15]. This will lead to the MEA alignment between the two sequences x and y.
Limitations of pair-HMMs
Although the hidden state sequence of a pair-HMM unambiguously points to a specific sequence alignment, this is not necessarily true the other way around. In fact, several different state sequences can lead to the same sequence alignment, hence we may not always be able to unambiguously determine the underlying state sequence for a given pairwise sequence alignment. For example, let us consider two sequences x = AAACGG and y = AAATTA. Suppose the "true" alignment aligns only the first three symbols (i.e., AAA) of x and y, hence the last three symbols in the respective sequences are regarded as insertions that do not have any matching counterpart in the other sequence. This is illustrated below, where the solid lines correspond to the aligned symbols:
(1)
For the pair-HMM shown in Figure 1(A), any hidden state sequence such that s1 = s2 = s3 = M and is a permutation of Ix Ix Ix Iy Iy Iy would lead to the sequence alignment shown in (1). When using this pair-HMM for predicting the optimal alignment of a sequence pair with the largest probability, this ambiguity may lead to performance degradation as these potential state sequences compete against each other. For this reason, it is generally more desirable to estimate the symbol alignment probabilities via the pair-HMM by considering all potential alignments and state sequences and use the estimated probabilities to find the MEA alignment that is expected to have the maximum number of correctly aligned symbols [9–11]. However, the aforementioned ambiguity also negatively affects the quality of the estimated symbol alignment probabilities, which is especially noticeable for sequence pairs with low percentage identity. In some cases, the alternative pair-HMM shown in Figure 1(B) is used to avoid such ambiguity. This alternative pair-HMM blocks transitions between the insertion states Ix and Iy, thereby prohibiting the model from inserting unaligned symbols to both sequences. For example, the alignment shown in (1) would not be allowed based on this alternative pair-HMM. However, due to this restriction, the pair-HMM in Figure 1(B) has a relatively stronger tendency to align unrelated sequence regions by treating them as mutations. This may again negatively affect the quality of the symbol alignment probabilities estimated based on the pair-HMM.
Another potential drawback of pair-HMMs is that the associated algorithms (i.e., the Viterbi, forward, and backward algorithms) can become numerically unstable for long sequences. Application of pair-HMMs to biological sequence analysis involves computing extremely small probabilities, which decrease exponentially with the sequence length. For example, based on the pair-HMM that was used in [9], the observation probability (i.e., the probability that the HMM may generate a given sequence pair) of a protein pair is typically in the order of 10-230 for proteins of length 80, 10-280 for proteins of length 100, and 10-320 for proteins of length 120. As a result, pair-HMM algorithms are prone to underflow errors, unless they are carefully implemented to keep them numerically robust. So far, a number of schemes have been proposed to address this issue, such as using log transformations of the probabilities or normalizing the probabilities to keep them within a reasonable numerical range, and have been shown to work well for relatively long sequences [1]. However, log transformations can make the forward and backward algorithms considerably slower, and the normalization approach can still lead to underflow errors as the sequences get longer.
One further disadvantage of pair-HMMs is that the algorithms that are used with the model cannot be easily parallelized. Although the Viterbi, forward, and backward algorithms for pair-HMMs are relatively efficient, they are still computationally expensive to be used with very long sequences. Moreover, as the algorithms are not amenable to massive parallelization, this makes the pair-HMMs not suitable for large-scale sequence analysis tasks, such as the whole genome alignment, despite their superior performance compared to other heuristic methods.
A message-passing scheme for estimating symbol alignment confidence scores
Here, we propose a novel method for aligning biological sequences that can effectively address the aforementioned shortcomings of pair-HMMs. The proposed method is based on a message-passing scheme, where messages are iteratively exchanged between neighboring symbol pairs to estimate the level of confidence for potential pairwise symbol alignments. The main underlying motivation is to develop an "analytical" method that can directly estimate the symbol alignment probabilities, without specifically modeling symbol insertions and deletions. This stands in contrast to the pair-HMM approach, which is essentially based on a "generative" sequence model that tries to explicitly model symbol insertions/deletions, in addition to symbol alignments. As discussed before, modeling symbol insertions in pair-HMMs can lead to subtle issues with potentially negative effects, and considering that our ultimate goal lies in finding an accurate sequence alignment through effective estimation of the symbol alignment probabilities, a method that can directly estimate these probabilities without explicitly modeling insertions/deletions would be desirable.
Suppose x = x1x2 ⋯ x
L
and y = y1y2 ⋯ y
M
are the two sequences to be aligned. We define c
xy
(i, j) as the symbol alignment confidence score between x
i
(the i-th symbol in x) and y
j
(the j-th symbol in y). The score c
xy
(i, j) provides a quantitative measure of confidence as to whether x
i
and y
j
should be aligned to each other or not, and we assume c
xy
(i, j) ∝ P(x
i
~ y
j
|x, y), where P (x
i
~ y
j
|x, y) is the posterior symbol alignment probability between x
i
and y
j
given the sequences x and y. We estimate the alignment confidence score by iteratively passing messages between neighboring symbol pairs, where each symbol pair (x
i
, y
j
) corresponds to a potential symbol alignment in the true (unknown) sequence alignment between x and y. For example, during the estimation process, the symbol pair (x
i
, y
j
) will exchange messages with its two neighbors (xi-1, yj-1) and (xi+1, yj+1), and similarly, the pair (xi+1, yj+1) will exchange messages with (x
i
, y
j
) and (xi+2, yj+2). The message-passing process is illustrated in Figure 2, where the solid lines indicate the messages that are used to update the alignment confidence score c
xy
(i, j) of the symbol pair (x
i
, y
j
). The dashed lines correspond to messages that are used to update the confidence scores of other symbol pairs.
The pseudocode of the proposed message-passing algorithm is as follows:
STEP-1 Initialize c
xy
(i, j).
STEP-2 Update the alignment confidence score:
STEP-3 Normalize c
xy
(i, j).
STEP-4 If c
xy
(i, j) has converged, then terminate the algorithm.
Otherwise, go to STEP-2.
In STEP-1, we first initialize the alignment confidence score c
xy
(i, j), where we can simply use random initialization. If a preliminary sequence alignment of x and y is available (e.g., obtained from a simple heuristic method), we can also initialize the score based on this alignment such that c
xy
(i, j) = 1 if x
i
and y
j
are aligned, and c
xy
(i, j) = 0 otherwise. Next, in STEP-2, the alignment confidence score c
xy
(i, j) of the symbol pair (x
i
, y
j
) is updated based on the scores of its two neighbors (xi-1, yj-1) and (xi+1, yj+1). Note that the score is set to c
xy
(i, j) = 0 if i ∉ {1, ⋯ , L} or j ∉ {1, ⋯ , M}. P (x
i
, y
j
) is the joint occurrence probability of the symbol pair (x
i
, y
j
), which is essentially equivalent to the joint emission probability of an aligned symbol pair (x
i
, y
j
) at the match state M of a pair-HMM. It should be noted that this probability P (x
i
, y
j
) is not location-dependent and is simply determined by the symbols x
i
and y
j
. The weight parameter λ ∈ [0, 1] is used to balance the contribution from the neighbors and that from the joint probability of (x
i
, y
j
) in estimating the alignment confidence score. A large λ gives more weight to the "messages" received from the neighbors in estimating the scores, which tends to penalize gaps more heavily, and it generally leads to longer aligned regions with fewer gaps. On the contrary, a small λ gives more weight to the joint symbol occurrence probability P (x
i
, y
j
) while giving less weight to the messages received from the neighbors, which tends to be more lenient to gaps. Once the symbol alignment confidence score c
xy
(i, j) is updated for all i = 1, ⋯ , L and j = 1, ⋯ , M, we normalize the scores to keep them within a proper numerical range, as shown in STEP-3. For example, a simple way would be to divide the score matrix C = [(c
xy
(i, j)] by its matrix norm to normalize the confidence scores. After normalization, the updated scores are compared to the scores in the last iteration, and the algorithm terminates if the specified convergence criterion has been met. Otherwise, the algorithm goes back to STEP-2 and repeats the message-passing process.