Novel algorithms for LDD motif search

Background Motifs are crucial patterns that have numerous applications including the identification of transcription factors and their binding sites, composite regulatory patterns, similarity between families of proteins, etc. Several motif models have been proposed in the literature. The (l,d)-motif model is one of these that has been studied widely. However, this model will sometimes report too many spurious motifs than expected. We interpret a motif as a biologically significant entity that is evolutionarily preserved within some distance. It may be highly improbable that the motif undergoes the same number of changes in each of the species. To address this issue, in this paper, we introduce a new model which is more general than (l,d)-motif model. This model is called (l,d1,d2)-motif model (LDDMS) and is NP-hard as well. We present three elegant as well as efficient algorithms to solve the LDDMS problem, i.e., LDDMS1, LDDMS2 and LDDMS3. They are all exact algorithms. Results We did both theoretical analyses and empirical tests on these algorithms. Theoretical analyses demonstrate that our algorithms have less computational cost than the pattern driven approach. Empirical results on both simulated datasets and real datasets show that each of the three algorithms has some advantages on some (l,d1,d2) instances. Conclusions We proposed LDDMS model which is more practically relevant. We also proposed three exact efficient algorithms to solve the problem. Besides, our algorithms can be nicely parallelized. We believe that the idea in this new model can also be extended to other motif search problems such as Edit-distance-based Motif Search (EMS) and Simple Motif Search (SMS).

The task is to find all the strings (also called (l, d)-motifs) of length l each that occur in each of the input sequences within a hamming distance of d. The LDMS problem is known to be NP-hard [3,4].
Motifs can be thought of as evolutionarily preserved biological information. This information might have undergone different changes in different species. The (l, d)-motif model captures this possibility by requiring that the motif occur within a hamming distance of d in each sequence. However, this requirement may be more stringent than needed. When some biological information undergoes changes (e.g., mutations) in various species, the amount of change may not be the same across all the species. Some might have undergone more changes than the others. If we think of d as an upper bound on the amount of change, then it is conceivable (and very likely) that some of the species have undergone less changes. As a result, the (l, d)-motif model is likely to admit many spurious strings as motifs. These strings might occur by random chance and get qualified as motifs. Because of this, the LDMS algorithms might take longer time than actually needed. To rectify these shortcomings, in this paper we propose a new model of motifs. This model is called (l, d 1 , d 2 )-model. The corresponding motif search problem is called the LDDMS problem and defined next.

Definition 1
The input for the LDDMS problem has n biological sequences each of length m and three integers l, d 1 , and d 2 . The problem is to find all the strings M of length l that have the following two properties: 1) M should occur in each of the n input strings within a hamming distance of d 1 . This requirement is referred to as the (l, d 1 )-condition; and 2) M should occur in at least one of the n input strings within a hamming distance of d 2 . This requirement is referred to as the (l, d 2 )-condition.

Validity of the (l, d 1 , d 2 )-motif model
In this section we demonstrate the validity of the (l, d 1 , d 2 )-motif model with a simple random model for mutations. Assume that the species under consideration have the same origin. Let M be an original motif of length l. Consider a random model where the number of mutations occurring in the species is uniformly distributed in the range 0, l 2 . Let n be the number of species and let the number of mutations that have occurred in these species be X 1 , X 2 , . . . , X n , respectively and let Y = min{X 1 , X 2 , . . . , X n } and Z = max{X 1 , X 2 , . . . , X n }. It is easy to show that: Thus the difference between Y and Z could be quite large! As an example consider an input of 20 sequences, each of length 600 and let l = 10. Assume that the number of mutations d is uniformly random in the range [ 0, 5]. If we set d 2 = 1, the probability that there exists at least one DNA sequence such that the motif occurs with a hamming distance of at most d 2 is: When n is larger than 20, this probability will become even higher. Therefore, it is quite reasonable to add the (l, d 2 )-condition into the LDMS model.
It is easy to see that if d 2 d 1 , then the (l, d 2 )-condition becomes trivial and the LDDMS problem will become the standard LDMS problem. Thus, the LDMS problem is a special case of the LDDMS problem. If d 2 = 0, it means that we want to look for a motif that appears exactly in at least one of the input sequences. In the rest of this paper we assume that d 2 < d 1 .

Related work
(l, d) motif search is also referred to as Planted Motif Search (PMS) problem in some literature. Since (l, d 1 , d 2 ) motif search is closely related to PMS and we will use a PMS solver in one of the LDDMS algorithms, it is necessary to discuss some of the latest PMS algorithms.
In 2012, Yu, et al., proposed PairMotif to solve PMS problems [5]. They reduced the size of candidate motifs and scanned l-mers by selecting pairs of l-mers from different input sequences and then generate the common neighbors. The authors tested PairMotif algorithm on simulated data as well as on five real data sets from [6], which are preproinsulin, DHFR, c-fos, metallothionein and Yeast ECB. It can solve the weak instance (27, 9) within 10 hours. They also showed that PairMotif is more stable in solving PMS problem in longer input sequences [5].
Sometimes, biologists may also be interested in motifs that occur in a fraction of the input strings. The problem of identifying such motifs is known as quorum Planted Motif Search (qPMS). In this case, in addition to l and d and n strings there is an extra input parameter q. The problem is to identify all the (l, d, q)-motifs, that is, all the (l, d)-motifs that occur in at least q% of the input strings. In 2014, Tanaka proposed TraverStringRef in [7]. This algorithm is based on the PMS8 algorithm of Nicolae and Rajasekaran [8]. This is the first algorithm that solved the challenging DNA instance with (l, d, q) = (25, 10, 20) in a reasonable amount of time.
In 2015, Nicolae and Rajasekaran proposed qPMS9 [9]. It can solve challenging instances up to (25, 10) using a single core machine and up to (30, 13) using a 48-core machine. The algorithm is based on PMS8 proposed by the same authors [8], but it added quorum support and also included better pruning techniques to significantly reduce the size of the search space.
In 2016, Xiao, Pal and Rajasekaran proposed qPMS10 [3,4]. qPMS10 is a randomized algorithm based on the idea of random sampling. It will first utilize any existing PMS solver on a subset of the input. Then the candidate motifs are filtered to get the correct motifs for the original problem. Probability analysis shows that with high probability, the result is correct. Experimental result shows that this algorithm is competitive especially when the dataset is large.
Not only mutations, but also insertions and deletions are important as they may also play critical roles in divergence of biological sequences [10,11]. In this case, edit distance instead of hamming distance should be considered [12,13]. This corresponding problem is modeled as Editdistance-based Motif Search (EMS) problem. There are also some works in the literature on EMS (see e.g., [1,[12][13][14][15], and so on).
However, as far as the authors know, no such generalizations of PMS model exist in the published literature. Therefore, we propose LDDMS model and the corresponding algorithms.

Methods
Since the LDMS problem is NP-hard, the LDDMS problem is also NP-hard. All the known exact algorithms for solving the LDMS problem take time that is exponential in some of the underlying parameters. In this paper, we present three efficient algorithms for solving the LDDMS problem. These algorithms are referred to as LDDMS1, LDDMS2 and LDDMS3. Time complexities of these three algorithms are analysed. Experimental results on simulated dataset and real datasets both demonstrate that our algorithms are efficient.

Description of LDDMS algorithms
For any l-mer u we define its d-friendhood as the set of lmers v whose hamming distance is exactly d from u; define its d-neighborhood as the set of l-mers v whose hamming distance is at most d from u.
For all the LDDMS algorithms, the input is a database S containing n sequences, each of length m, and integers l, d 1 and d 2 ; the output is all the strings of length l that meet both (l, d 1 )-condition and (l, d 2 )-condition.
A straight-forward solution is the pattern driven approach. If is the alphabet under concern, there are | | l possible l-mers. For every such l-mer, check if it meets both the (l, d 1 )-condition and the (l, d 2 )-condition. If so, output this l-mer. Obviously, this algorithm takes too much time.
In addition to pattern driven approaches, we also have sample driven approaches. We could employ the following two step algorithm: 1) First find all the motifs that satisfy the (l, d 1 )-condition. This can be done using any of the LDMS algorithms. Let C 1 be the set of these motifs; and 2) For every motif x ∈ C 1 , check if x satisfies the (l, d 2 )-condition and if so output x. We call this algorithm LDDMS1. Since qPMS9 is currently the most efficient LDMS algorithm [9], we will take advantage of it in LDDMS1 (See Algorithm 1).
Equivalently, we can also find (l, d 2 )-motifs in the first step, and then for every such motif check if it satisfies the (l, d 1 )-condition. We refer to this algorithm as LDDMS2 (See Algorithm 2).
Note that each valid motif has at least one d 2 -neighbor in at least one of the input sequences. We generate n(m − l + 1) l-mers from each of the input sequences. d 2neighborhood of an l-mer u can be found by constructing Algorithm 1 LDDMS1 1: Run any LDMS algorithm to solve the (l, d 1 )-motif search problem on the input S. Let C 1 be the set of motifs found; the neighborhood tree. With u being the root and the height of the tree being d 2 , the level of a node is the hamming distance between u and this node. All the nodes of this tree, including the root and the leaves, will constitute the d 2 -neighborhood of u. In Step 3 of LDDMS2, we can employ radix sort and eliminate duplicates. In Step 4 the output O 2 of valid motifs found will be in sorted order.

Algorithm 2 LDDMS2
1: Generate all possible l-mers from out of each of the n input sequences. Let the collection of these l-mers be L; 2: For every u ∈ L, generate all the l-mers v such that v is a d 2 -neighborhood of u. Let the union of v be N; 3: Sort all the elements in N, in lexicographic order, and remove the duplicates. Let the sorted and reduced set be C 2 ; If d 2 is very small (for example, d 2 = 0 or 1), we can expect LDDMS2 to run faster than LDDMS1. This is because the d 2 -neighborhhod for any l-mer will be small. However, when d 2 is large, the neighborhood tree will be large and so will be the number of candidate motifs. Therefore, LDDMS2 takes much more time and memory when d 2 is large. To save time, one idea is to check the candidate motifs concurrently while constructing the neighborhood tree. During the checking process, some pruning conditions can be developed such that once certain conditions hold, a node is not explored deeper. The stronger the pruning condition is, the faster the algorithm will be. Inspired by similar pruning ideas proposed for the LDMS model [16], we develop LDDMS3 (See Algorithm 3).

Definition 2
Given an l-mer u from Sequence i (i ∈ [ 1, n]), construct its d 2 -neighborhood tree. Let x be any node in this tree, denote δ(x, i, q) as the smallest hamming distance between x and any l-mer out of Sequence q. Denote δ(x, i, I) to be the maximum of δ(x, i, q) where Algorithm 3 LDDMS3 1: For i = 1, 2, ...n, generate all possible l-mers from out of the input sequence S i . Let the collection of these l-mers be from the sequence S i be L i , for 1 ≤ i ≤ n; 2: For every u ∈ L i , construct a d 2 -neighborhood tree; 3: O 3 = ∅. Traverse the neighborhood tree and for each node x compute δ(x, i, I) and based on Theorem 1, add x or x into O 3 ; 4: Sort all the elements in O 3 , in lexicographic order, and remove the duplicates; 5: return O 3 ; q = 1, 2, ..., n and q = i.
If v is an l-mer in the sequence S q , we denote it as: v l s q . Also, Hd(v, x) is the hamming distance between v and x. By computing δ(x, i, I), we have the following pruning conditions [16].

Analysis of LDDMS algorithms Candidate size and expected number of motifs
In this section, we estimate candidate sizes of LDDMS1 and LDDMS2, i.e., |C 1 | and |C 2 |, and also the expected number of motifs that would be found. Such estimation is useful in computing the time complexities of these two algorithms.
Recall that in the benchmark dataset all the characters are generated from i.i.d. and there are n sequences with length m each. Given an l-mer M, the number of l-mers that have a hamming distance of d 1 from M is: where is the alphabet under concern. The probability that a randomly chosen l-mer has a hamming distance of at most d 1 from M is: The probability that in a sequence of length m, there is at least one string u such that u and M are within a hamming distance of d 1 is: The probability that a randomly chosen l-mer occurs within a hamming distance of d 1 in each of the n input sequences, each of length m is: Therefore, the expected number of (l, d 1 )-motifs is: Similarly, the probability that a randomly chosen l-mer has a hamming distance of at most d 2 from M is: The probability that in a sequence of length m, there is at least one string u that has a hamming distance of at most d 2 from M is: Therefore, the expected number of (l, d 2 )-motifs is: In all of the above assertions we have assumed that the l-mers of a sequence are independent. Clearly, this is incorrect. However, such analyses have proven useful in estimating the number of motifs in practice (see e.g., [17]). Along these lines, let us look at the expected number of motifs that will be found, i.e., |O 1 | or |O 2 |. Let M be a random l-mer, A i be the event that M has a neighbor that is within a hamming distance of d 2 in exactly i of the input sequences. Similarly, let B j be the event that M has a neighbor that is within a hamming distance of (d 2 , d 1 ] in exactly j of the input sequences. It should be noted here that if M has a neighbor whose hamming distance is at most d 2 in an input sequence, then it automatically will also have a neighbor that is within a hamming distance of d 1 in such sequence since we assume d 2 < d 1 .
We want to know the probability that events A i and B n−i both happen, which means in each of the n input sequences, there is an l-mer that is within a hamming distance of d 2 from M and also, in each of the remaining n − i input sequences, there will be an l-mer that is within a hamming distance of (d 2 , d 1 ] from M.
Given an l-mer M, the probability that a random string u of length l has a hamming distance in the range of (d 2 , d 1 ] from M is: In one sequence, there are m − l + 1 such l-mers. The probability that in such a sequence, there is at least one lmer that is within a hamming distance of d 1 but no l-mer that is within a hamming distance of d 2 from M is: Therefore, the probability that a random l-mer out of such dataset meets both (l, d 1 ) and (l, d 2 )-condition is: In conclusion, the expected number of spurious motifs we can find in the LDDMS model is:

Time complexity of the algorithms
Note that all the three algorithms (LDDMS1, LDDMS2, and LDDMS3) can be nicely parallelized. For LDDMS1, there are parallel versions of LDMS solvers, such as PMS9.
For every candidate motif, the checking process is independent and can also be parallelized. For LDDMS2 and LDDMS3, we need to generate the neighnorhood tree for n(m − l + 1) l-mers out of the input sequences. There are n(m − l + 1) independent subproblems and can be assigned to different processors. However, in this paper, we only implement these algorithms sequentially and analyze the time complexity of the sequential versions of these algorithms. Given a candidate motif of length l, checking if it meets (l, d 1 ) and (l, d 2 )-condition in an input of n sequences, each of length m, will take O ((m − l + 1)nl) = O(mnl) time. It is easy to see that the brute-force algorithm takes time O(| | l mnl).
For LDDMS1, qPMS9 can be implemented in O (m k mnN( , l, d 1 )) time. N( , l, d 1 ) has the same definition as in Eq. 1. k is a dynamic variable between 1 and n. We get the following:

Theorem 2 The time complexity of LDDMS1 algorithm is
where |C 1 | is the candidate size of (l, d 1 )-motif. An expected number can be obtained from Eq. 5.
For LDDMS2, in Step 1 and Step 2, generating the neighborhoods from all l-mers out of each of the input sequences will take time O ((m − l + 1)nN( , l, d 2 )). In Step 3, radix sort and removing the duplicates will take time O ((m − l + 1)nlN( , l, d 2 )). Thus we arrive at:

Theorem 3 LDDMS2 can be implemented in time
where |C 2 | is the candidate size of (l, d 2 )-motif. An expected number is given in Eq. 8.
The following lemma from [16] is useful in computing the time complexity of LDDMS3.

Theorem 4 LDDMS3 can be implemented in time
Note this is only the worst-case time complexity and d 1 does not appear in this expression. The actual run time could be much less because a lot of branches can be "pruned".

Results and discussion
LDDMS1, LDDMS2 and LDDMS3 are tested on synthetic datasets as well as real datasets. We evaluated our algorithms on a Dell Precisions Workstation T7910 running RHEL 7.0 on two sockets each containing 8 Dual Intel Xeon Processors E5-2667 (8C HT, 20MB Cache, 3.2GHz) and 256GB RAM.

Synthetic datasets
Following the tradition, we employ combinations of (l, d 1 ) that are challenging [3]. We vary d 2 from 0 to d 1 /2 . The challenging instances of n = 20, m = 600 for DNA sequences and the values of d 2 for carrying out the test are listed in Table 1.
The challenging instances correspond to a small number of spurious motifs. This will make the candidate size in LDDMS1 very small and hence the time spent in Step 2 in LDDMS1 is trivial. To avoid such problems, we slightly change the way we plant the motifs. We will randomly generate two l-mers, M 1 and M 2 . The hamming distance of M 1 and M 2 is q. Then we insert M 1 into each of the first n/2 input sequences and M 2 into each of the rest n/2 input sequences. A detailed algorithm for generating the test cases is given in Algorithm 4.
In this way, the common neighbors that are within d 2 hamming distance of M 1 and M 2 are (l, d 1 , d 2 )-motifs we plant. Generally, when q is small, there will be more common neighbors between M 1 and M 2 . Conversely, when q is large, there are fewer common neighbors between M 1 and M 2 . By varying q, we can control the output motif size. There is a theory proposed in [8] which proves to be useful here.

Algorithm 4 generateTestCases
1: Generate n sequences, each of length m from the alphabet ; 2: Randomly generate two l-mers, denoted as M 1 and M 2 . The hamming distance between M 1 and M 2 is q; 3: Insert M 1 into each of the first n/2 input sequences.
Insert M 2 into each of the remaining n/2 input sequences. The starting position of M i (i = 1, 2) is randomly chosen from 1 to m − l + 1.

Theorem 5 Two l-mers a and b have a common neighbor M such that Hd(a, M) d a and Hd(b, M) d b if and only if Hd(a, b) d a + d b .
Applying the above theorem, q has to be at a distance of at most 2d 2 for M 1 and M 2 to have common neighbors that are within a d 2 hamming distance. When d 2 = 0, we set q = 0, then there will be at least N( , l, d 2 ) (l, d 1 , d 2 )motifs that can be found. When d 2 = 0, q = 2d 2 , there will be at least 2d 2 d 2 (l, d 1 , d 2 )-motifs that can be found. However, the number of planted (l, d 1 )-motifs, i.e., common neighbors that are within a d 1 hamming distance between both M 1 and M 2 , is much larger.
We have tested our algorithms on challenging instances of (l, d 1 ) from (7, 1) upto (19, 7), where d 2 varies from 0 to d 1 /2 . Tables 2, 3 and 4 show the running times of LDDMS1, LDDMS2 and LDDMS3 on different (l, d 1 , d 2 ) values. For small instances such as (l, d 1 ) = (7, 1), (8,1), (9,2), (10,2), LDDMS1 runs faster than LDDMS2 and LDDMS3. This is because qPMS9 is fast and there are only a few (l, d 1 )-motifs to check. However, for moderate and relatively large instances, a small value of d 2 It is obvious that as (l, d 1 ) instances become larger, all the LDDMS algorithms will take more time. However, an interesting observation is that for a fixed (l, d 1 ) instance, increasing the value of d 2 will make LDDMS1 run faster but LDDMS2 and LDDMS3 slower. This is because of the  way we generate the test cases. If d 2 is very small, then the two l-mers we plant will be almost identical. In this case, we will find a lot of (l, d 1 )-motifs in the end of Step 2 in LDDMS1. However, small values of d 2 will make the neighborhood tree small, thus LDDMS2 and LDDMS3 will run faster.

Real datasets
We also used the datesets discussed in [18] to test our algorithms. We chose a group of real datasets. We excluded datasets with only one input sequence because such datasets are not meaningful for our test. We chose two relatively large number, 18 and 19 for the motif length. Then we re-computed d 1 which will make (l, d 1 ) challanging instances since each dataset has different number of input sequences and different length for each sequence. However, as we noted before, the challenging instances will make the candidate size in LDDMS1 very small. In this case, we cannot manually plant a motif to avoid such a problem. Therefore, we will increment d 1 by 2. We tested the minimum and maximum number of d 2 , i.e., 0 and d 2 /2 . Table 5 shows the datasets information and the (l, d 1 , d 2 ) instances we have tested. Table 6 shows the running time of LDDMS1, LDDMS2 and LDDMS3 on real datasets. On the real dataset, for fixed (l, d 1 ), changing d 2 does not affect the running time of LDDMS1 very much. This is because for a real dataset, the candidate size, i.e., the number of (l, d 1 ) motifs is unchanged. This is also true for the number of (l, d 2 ) motifs for LDDMS2. Moreover, as one can find, for a fixed d 1 , increasing l will make LDDMS1 run faster because

Conclusions
Efficient motif search algorithms are crucial in solving many bioinformatics problems effectively. In this paper, we have presented the (l, d 1 , d 2 ) motif model, a more general model for the motif search problem. We also have proposed LDDMS1, LDDMS2 and LDDMS3, three exact efficient algorithms to solve the LDDMS problem. Theoretical analysis shows that our algorithms are very competitive. Experimental results also reveal that our algorithms perform well in practice.
In future we will focus on solving harder LDDMS instances, including those involving protein strings. We also plan to implement our algorithms in parallel.