 Research
 Open access
 Published:
Novel algorithms for LDD motif search
BMC Genomics volume 20, Article number: 424 (2019)
Abstract
Background
Motifs are crucial patterns that have numerous applications including the identification of transcription factors and their binding sites, composite regulatory patterns, similarity between families of proteins, etc. Several motif models have been proposed in the literature. The (l,d)motif model is one of these that has been studied widely. However, this model will sometimes report too many spurious motifs than expected. We interpret a motif as a biologically significant entity that is evolutionarily preserved within some distance. It may be highly improbable that the motif undergoes the same number of changes in each of the species. To address this issue, in this paper, we introduce a new model which is more general than (l,d)motif model. This model is called (l,d_{1},d_{2})motif model (LDDMS) and is NPhard as well. We present three elegant as well as efficient algorithms to solve the LDDMS problem, i.e., LDDMS1, LDDMS2 and LDDMS3. They are all exact algorithms.
Results
We did both theoretical analyses and empirical tests on these algorithms. Theoretical analyses demonstrate that our algorithms have less computational cost than the pattern driven approach. Empirical results on both simulated datasets and real datasets show that each of the three algorithms has some advantages on some (l,d_{1},d_{2}) instances.
Conclusions
We proposed LDDMS model which is more practically relevant. We also proposed three exact efficient algorithms to solve the problem. Besides, our algorithms can be nicely parallelized. We believe that the idea in this new model can also be extended to other motif search problems such as Editdistancebased Motif Search (EMS) and Simple Motif Search (SMS).
Background
Motif search has many applications in solving some crucial biological problems. For example, finding DNA motifs is very important for the determination of open reading frames, identification of gene promoter elements, location of RNA degradation signals, and the identification of alternative splicing sites [1, 2]. For more than 15 years, motif search has stimulated a lot of interest from researchers in different areas.
There are many models of motif search. One popular model that has been studied extensively is the (l,d)motif model. The corresponding motif search problem is called LDMS. The input for the LDMS problem consists of n input sequences each of length m, and two integers l and d. The task is to find all the strings (also called (l,d)motifs) of length l each that occur in each of the input sequences within a hamming distance of d. The LDMS problem is known to be NPhard [3, 4].
Motifs can be thought of as evolutionarily preserved biological information. This information might have undergone different changes in different species. The (l,d)motif model captures this possibility by requiring that the motif occur within a hamming distance of d in each sequence. However, this requirement may be more stringent than needed. When some biological information undergoes changes (e.g., mutations) in various species, the amount of change may not be the same across all the species. Some might have undergone more changes than the others. If we think of d as an upper bound on the amount of change, then it is conceivable (and very likely) that some of the species have undergone less changes. As a result, the (l,d)motif model is likely to admit many spurious strings as motifs. These strings might occur by random chance and get qualified as motifs. Because of this, the LDMS algorithms might take longer time than actually needed. To rectify these shortcomings, in this paper we propose a new model of motifs. This model is called (l,d_{1},d_{2})model. The corresponding motif search problem is called the LDDMS problem and defined next.
Definition 1
The input for the LDDMS problem has n biological sequences each of length m and three integers l,d_{1}, and d_{2}. The problem is to find all the strings M of length l that have the following two properties: 1) M should occur in each of the n input strings within a hamming distance of d_{1}. This requirement is referred to as the (l,d_{1})condition; and 2) M should occur in at least one of the n input strings within a hamming distance of d_{2}. This requirement is referred to as the (l,d_{2})condition.
Validity of the (l,d _{1},d _{2})motif model
In this section we demonstrate the validity of the (l,d_{1},d_{2})motif model with a simple random model for mutations. Assume that the species under consideration have the same origin. Let M be an original motif of length l. Consider a random model where the number of mutations occurring in the species is uniformly distributed in the range \(\left [ 0,\frac {l}{2}\right ]\). Let n be the number of species and let the number of mutations that have occurred in these species be X_{1},X_{2},…,X_{n}, respectively and let Y= min{X_{1},X_{2},…,X_{n}} and Z= max{X_{1},X_{2},…,X_{n}}. It is easy to show that:
Thus the difference between Y and Z could be quite large! As an example consider an input of 20 sequences, each of length 600 and let l=10. Assume that the number of mutations d is uniformly random in the range [0,5]. If we set d_{2}=1, the probability that there exists at least one DNA sequence such that the motif occurs with a hamming distance of at most d_{2} is:
When n is larger than 20, this probability will become even higher. Therefore, it is quite reasonable to add the (l,d_{2})condition into the LDMS model.
It is easy to see that if \(d_{2} \geqslant d_{1}\), then the (l,d_{2})condition becomes trivial and the LDDMS problem will become the standard LDMS problem. Thus, the LDMS problem is a special case of the LDDMS problem. If d_{2}=0, it means that we want to look for a motif that appears exactly in at least one of the input sequences. In the rest of this paper we assume that d_{2}<d_{1}.
Related work
(l,d) motif search is also referred to as Planted Motif Search (PMS) problem in some literature. Since (l,d_{1},d_{2}) motif search is closely related to PMS and we will use a PMS solver in one of the LDDMS algorithms, it is necessary to discuss some of the latest PMS algorithms.
In 2012, Yu, et al., proposed PairMotif to solve PMS problems [5]. They reduced the size of candidate motifs and scanned lmers by selecting pairs of lmers from different input sequences and then generate the common neighbors. The authors tested PairMotif algorithm on simulated data as well as on five real data sets from [6], which are preproinsulin, DHFR, cfos, metallothionein and Yeast ECB. It can solve the weak instance (27, 9) within 10 hours. They also showed that PairMotif is more stable in solving PMS problem in longer input sequences [5].
Sometimes, biologists may also be interested in motifs that occur in a fraction of the input strings. The problem of identifying such motifs is known as quorum Planted Motif Search (qPMS). In this case, in addition to l and d and n strings there is an extra input parameter q. The problem is to identify all the (l,d,q)motifs, that is, all the (l,d)motifs that occur in at least q% of the input strings. In 2014, Tanaka proposed TraverStringRef in [7]. This algorithm is based on the PMS8 algorithm of Nicolae and Rajasekaran [8]. This is the first algorithm that solved the challenging DNA instance with (l,d,q)=(25,10,20) in a reasonable amount of time.
In 2015, Nicolae and Rajasekaran proposed qPMS9 [9]. It can solve challenging instances up to (25,10) using a single core machine and up to (30,13) using a 48core machine. The algorithm is based on PMS8 proposed by the same authors [8], but it added quorum support and also included better pruning techniques to significantly reduce the size of the search space.
In 2016, Xiao, Pal and Rajasekaran proposed qPMS10 [3, 4]. qPMS10 is a randomized algorithm based on the idea of random sampling. It will first utilize any existing PMS solver on a subset of the input. Then the candidate motifs are filtered to get the correct motifs for the original problem. Probability analysis shows that with high probability, the result is correct. Experimental result shows that this algorithm is competitive especially when the dataset is large.
Not only mutations, but also insertions and deletions are important as they may also play critical roles in divergence of biological sequences [10, 11]. In this case, edit distance instead of hamming distance should be considered [12, 13]. This corresponding problem is modeled as Editdistancebased Motif Search (EMS) problem. There are also some works in the literature on EMS (see e.g., [1, 12–15], and so on).
However, as far as the authors know, no such generalizations of PMS model exist in the published literature. Therefore, we propose LDDMS model and the corresponding algorithms.
Methods
Since the LDMS problem is NPhard, the LDDMS problem is also NPhard. All the known exact algorithms for solving the LDMS problem take time that is exponential in some of the underlying parameters. In this paper, we present three efficient algorithms for solving the LDDMS problem. These algorithms are referred to as LDDMS1, LDDMS2 and LDDMS3. Time complexities of these three algorithms are analysed. Experimental results on simulated dataset and real datasets both demonstrate that our algorithms are efficient.
Description of LDDMS algorithms
For any lmer u we define its dfriendhood as the set of lmers v whose hamming distance is exactly d from u; define its dneighborhood as the set of lmers v whose hamming distance is at most d from u.
For all the LDDMS algorithms, the input is a database S containing n sequences, each of length m, and integers l, d_{1} and d_{2}; the output is all the strings of length l that meet both (l,d_{1})condition and (l,d_{2})condition.
A straightforward solution is the pattern driven approach. If Σ is the alphabet under concern, there are Σ^{l} possible lmers. For every such lmer, check if it meets both the (l,d_{1})condition and the (l,d_{2})condition. If so, output this lmer. Obviously, this algorithm takes too much time.
In addition to pattern driven approaches, we also have sample driven approaches. We could employ the following two step algorithm: 1) First find all the motifs that satisfy the (l,d_{1})condition. This can be done using any of the LDMS algorithms. Let C_{1} be the set of these motifs; and 2) For every motif x∈C_{1}, check if x satisfies the (l,d_{2})condition and if so output x. We call this algorithm LDDMS1. Since qPMS9 is currently the most efficient LDMS algorithm [9], we will take advantage of it in LDDMS1 (See Algorithm 1).
Equivalently, we can also find (l,d_{2})motifs in the first step, and then for every such motif check if it satisfies the (l,d_{1})condition. We refer to this algorithm as LDDMS2 (See Algorithm 2).
Note that each valid motif has at least one d_{2}neighbor in at least one of the input sequences. We generate n(m−l+1)lmers from each of the input sequences. d_{2}neighborhood of an lmer u can be found by constructing the neighborhood tree. With u being the root and the height of the tree being d_{2}, the level of a node is the hamming distance between u and this node. All the nodes of this tree, including the root and the leaves, will constitute the d_{2}neighborhood of u. In Step 3 of LDDMS2, we can employ radix sort and eliminate duplicates. In Step 4 the output O_{2} of valid motifs found will be in sorted order.
If d_{2} is very small (for example, d_{2} = 0 or 1), we can expect LDDMS2 to run faster than LDDMS1. This is because the d_{2}neighborhhod for any lmer will be small. However, when d_{2} is large, the neighborhood tree will be large and so will be the number of candidate motifs. Therefore, LDDMS2 takes much more time and memory when d_{2} is large. To save time, one idea is to check the candidate motifs concurrently while constructing the neighborhood tree. During the checking process, some pruning conditions can be developed such that once certain conditions hold, a node is not explored deeper. The stronger the pruning condition is, the faster the algorithm will be. Inspired by similar pruning ideas proposed for the LDMS model [16], we develop LDDMS3 (See Algorithm 3).
Definition 2
Given an lmer u from Sequence i (i∈ [1,n]), construct its d_{2}neighborhood tree. Let x be any node in this tree, denote δ(x,i,q) as the smallest hamming distance between x and any lmer out of Sequence q. Denote δ(x,i,I) to be the maximum of δ(x,i,q) where q=1,2,...,n and q≠i.
If v is an lmer in the sequence S_{q}, we denote it as: v⊲_{l}s_{q}. Also, Hd(v,x) is the hamming distance between v and x. By computing δ(x,i,I), we have the following pruning conditions [16].
Theorem 1
Traverse the d_{2}neighborhood tree of u in a depthfirst manner and compute δ(x,i,I) where x is a node in the tree, h is the level of x (root is at level 0);

1
If δ(x,i,I)≤d_{1}, output x;

2
If δ(x,i,I)−d_{1}>d_{2}−h, prune all the descendants from x;

3
If δ(x,i,I)−d_{1}=d_{2}−h, consider only x′ such that x′ is a child of x and δ(x′,i,I)=δ(x,i,I)−1;

4
If δ(x,i,I)−d_{1}=d_{2}−h−1, consider only x′ such that x′ is a child of x and δ(x′,i,I)≤δ(x,i,I).
Analysis of LDDMS algorithms
Candidate size and expected number of motifs
In this section, we estimate candidate sizes of LDDMS1 and LDDMS2, i.e., C_{1} and C_{2}, and also the expected number of motifs that would be found. Such estimation is useful in computing the time complexities of these two algorithms.
Recall that in the benchmark dataset all the characters are generated from i.i.d. and there are n sequences with length m each. Given an lmer M, the number of lmers that have a hamming distance of ≤d_{1} from M is:
where Σ is the alphabet under concern.
The probability that a randomly chosen lmer has a hamming distance of at most d_{1} from M is:
The probability that in a sequence of length m, there is at least one string u such that u and M are within a hamming distance of d_{1} is:
The probability that a randomly chosen lmer occurs within a hamming distance of d_{1} in each of the n input sequences, each of length m is:
Therefore, the expected number of (l,d_{1})motifs is:
Similarly, the probability that a randomly chosen lmer has a hamming distance of at most d_{2} from M is:
The probability that in a sequence of length m, there is at least one string u that has a hamming distance of at most d_{2} from M is:
Therefore, the expected number of (l,d_{2})motifs is:
In all of the above assertions we have assumed that the lmers of a sequence are independent. Clearly, this is incorrect. However, such analyses have proven useful in estimating the number of motifs in practice (see e.g., [17]). Along these lines, let us look at the expected number of motifs that will be found, i.e., O_{1} or O_{2}. Let M be a random lmer, A_{i} be the event that M has a neighbor that is within a hamming distance of d_{2} in exactly i of the input sequences. Similarly, let B_{j} be the event that M has a neighbor that is within a hamming distance of (d_{2},d_{1}] in exactly j of the input sequences. It should be noted here that if M has a neighbor whose hamming distance is at most d_{2} in an input sequence, then it automatically will also have a neighbor that is within a hamming distance of d_{1} in such sequence since we assume d_{2}<d_{1}.
We want to know the probability that events A_{i} and B_{n−i} both happen, which means in each of the n input sequences, there is an lmer that is within a hamming distance of d_{2} from M and also, in each of the remaining n−i input sequences, there will be an lmer that is within a hamming distance of (d_{2},d_{1}] from M.
Given an lmer M, the probability that a random string u of length l has a hamming distance in the range of (d_{2},d_{1}] from M is:
In one sequence, there are m−l+1 such lmers. The probability that in such a sequence, there is at least one lmer that is within a hamming distance of d_{1} but no lmer that is within a hamming distance of d_{2} from M is:
Therefore, the probability that a random lmer out of such dataset meets both (l,d_{1}) and (l,d_{2})condition is:
In conclusion, the expected number of spurious motifs we can find in the LDDMS model is:
Time complexity of the algorithms
Note that all the three algorithms (LDDMS1, LDDMS2, and LDDMS3) can be nicely parallelized. For LDDMS1, there are parallel versions of LDMS solvers, such as PMS9. For every candidate motif, the checking process is independent and can also be parallelized. For LDDMS2 and LDDMS3, we need to generate the neighnorhood tree for n(m−l+1)lmers out of the input sequences. There are n(m−l+1) independent subproblems and can be assigned to different processors. However, in this paper, we only implement these algorithms sequentially and analyze the time complexity of the sequential versions of these algorithms.
Given a candidate motif of length l, checking if it meets (l,d_{1}) and (l,d_{2})condition in an input of n sequences, each of length m, will take O((m−l+1)nl)=O(mnl) time. It is easy to see that the bruteforce algorithm takes time O(Σ^{l}mnl).
For LDDMS1, qPMS9 can be implemented in O(m^{k}mnN(Σ,l,d_{1})) time. N(Σ,l,d_{1}) has the same definition as in Eq. 1. k is a dynamic variable between 1 and n. We get the following:
Theorem 2
The time complexity of LDDMS1 algorithm is
where C_{1} is the candidate size of (l,d_{1})motif. An expected number can be obtained from Eq. 5 .
For LDDMS2, in Step 1 and Step 2, generating the neighborhoods from all lmers out of each of the input sequences will take time O((m−l+1)nN(Σ,l,d_{2})). In Step 3, radix sort and removing the duplicates will take time O((m−l+1)nlN(Σ,l,d_{2})). Thus we arrive at:
Theorem 3
LDDMS2 can be implemented in time
where C_{2} is the candidate size of (l,d_{2})motif. An expected number is given in Eq. 8 .
The following lemma from [16] is useful in computing the time complexity of LDDMS3.
Lemma 1
For a node x in the neighborhood tree, δ(x,i,I) can be updated in O(mn) time.
Theorem 4
LDDMS3 can be implemented in time
Note this is only the worstcase time complexity and d_{1} does not appear in this expression. The actual run time could be much less because a lot of branches can be “pruned”.
Results and discussion
LDDMS1, LDDMS2 and LDDMS3 are tested on synthetic datasets as well as real datasets. We evaluated our algorithms on a Dell Precisions Workstation T7910 running RHEL 7.0 on two sockets each containing 8 Dual Intel Xeon Processors E52667 (8C HT, 20MB Cache, 3.2GHz) and 256GB RAM.
Synthetic datasets
Following the tradition, we employ combinations of (l,d_{1}) that are challenging [3]. We vary d_{2} from 0 to ⌊d_{1}/2⌋. The challenging instances of n=20,m=600 for DNA sequences and the values of d_{2} for carrying out the test are listed in Table 1.
The challenging instances correspond to a small number of spurious motifs. This will make the candidate size in LDDMS1 very small and hence the time spent in Step 2 in LDDMS1 is trivial. To avoid such problems, we slightly change the way we plant the motifs. We will randomly generate two lmers, M_{1} and M_{2}. The hamming distance of M_{1} and M_{2} is q. Then we insert M_{1} into each of the first ⌈n/2⌉ input sequences and M_{2} into each of the rest ⌊n/2⌋ input sequences. A detailed algorithm for generating the test cases is given in Algorithm 4.
In this way, the common neighbors that are within d_{2} hamming distance of M_{1} and M_{2} are (l,d_{1},d_{2})motifs we plant. Generally, when q is small, there will be more common neighbors between M_{1} and M_{2}. Conversely, when q is large, there are fewer common neighbors between M_{1} and M_{2}. By varying q, we can control the output motif size. There is a theory proposed in [8] which proves to be useful here.
Theorem 5
Two lmers a and b have a common neighbor M such that Hd(a,M)≤d_{a} and Hd(b,M)≤d_{b} if and only if Hd(a,b)≤d_{a}+d_{b}.
Applying the above theorem, q has to be at a distance of at most 2d_{2} for M_{1} and M_{2} to have common neighbors that are within a d_{2} hamming distance. When d_{2}=0, we set q=0, then there will be at least N(Σ,l,d_{2})(l,d_{1},d_{2})motifs that can be found. When d_{2}≠0,q=2d_{2}, there will be at least \({{2d_{2}}\choose {d_{2}}} (l, d_{1}, d_{2})\)motifs that can be found. However, the number of planted (l,d_{1})motifs, i.e., common neighbors that are within a d_{1} hamming distance between both M_{1} and M_{2}, is much larger.
We have tested our algorithms on challenging instances of (l,d_{1}) from (7,1) upto (19,7), where d_{2} varies from 0 to ⌊d_{1}/2⌋. Tables 2, 3 and 4 show the running times of LDDMS1, LDDMS2 and LDDMS3 on different (l,d_{1},d_{2}) values. For small instances such as (l,d_{1}) = (7,1), (8,1), (9,2), (10,2), LDDMS1 runs faster than LDDMS2 and LDDMS3. This is because qPMS9 is fast and there are only a few (l,d_{1})motifs to check. However, for moderate and relatively large instances, a small value of d_{2} will make LDDMS2 run much faster than LDDMS1. For example, for (l,d_{1},d_{2})=(17,6,1), LDDMS1 takes 29.36 minutes while LDDMS2 only takes 9.19 minutes to solve. However, for large values of d_{2}, LDDMS2 is slow. Compared to LDDMS2, LDDMS3 performs much better for large instances although it will take more time when d_{2} is small. For example, it can solve instances which LDDMS2 cannot solve, such as (l,d_{1},d_{2})=(18,6,3),(19,7,3).
It is obvious that as (l,d_{1}) instances become larger, all the LDDMS algorithms will take more time. However, an interesting observation is that for a fixed (l,d_{1}) instance, increasing the value of d_{2} will make LDDMS1 run faster but LDDMS2 and LDDMS3 slower. This is because of the way we generate the test cases. If d_{2} is very small, then the two lmers we plant will be almost identical. In this case, we will find a lot of (l,d_{1})motifs in the end of Step 2 in LDDMS1. However, small values of d_{2} will make the neighborhood tree small, thus LDDMS2 and LDDMS3 will run faster.
Real datasets
We also used the datesets discussed in [18] to test our algorithms. We chose a group of real datasets. We excluded datasets with only one input sequence because such datasets are not meaningful for our test.
We chose two relatively large number, 18 and 19 for the motif length. Then we recomputed d_{1} which will make (l,d_{1}) challanging instances since each dataset has different number of input sequences and different length for each sequence. However, as we noted before, the challenging instances will make the candidate size in LDDMS1 very small. In this case, we cannot manually plant a motif to avoid such a problem. Therefore, we will increment d_{1} by 2. We tested the minimum and maximum number of d_{2}, i.e., 0 and ⌊d_{2}/2⌋. Table 5 shows the datasets information and the (l,d_{1},d_{2}) instances we have tested.
Table 6 shows the running time of LDDMS1, LDDMS2 and LDDMS3 on real datasets. On the real dataset, for fixed (l,d_{1}), changing d_{2} does not affect the running time of LDDMS1 very much. This is because for a real dataset, the candidate size, i.e., the number of (l,d_{1}) motifs is unchanged. This is also true for the number of (l,d_{2}) motifs for LDDMS2. Moreover, as one can find, for a fixed d_{1}, increasing l will make LDDMS1 run faster because it will be less challenging. Generally, when d_{2} is large, LDDMS2 takes much more time. However, it is hard to say for LDDMS1 and LDDMS3, which one performs better. For example, on real dataset dm05r, when (l,d_{1},d_{2})=(18,4,2), LDDMS3 (4.07 s) overperforms LDDMS1 (10.79 s). However, on the same dataset, when (l,d_{1},d_{2})=(19,4,2), LDDMS1 (2.31 s) overperforms LDDMS3 (4.55 s). The actual running time of these algorithms is highly dependent on the dataset and (l,d_{1},d_{2}) values.
Conclusions
Efficient motif search algorithms are crucial in solving many bioinformatics problems effectively. In this paper, we have presented the (l,d_{1},d_{2}) motif model, a more general model for the motif search problem. We also have proposed LDDMS1, LDDMS2 and LDDMS3, three exact efficient algorithms to solve the LDDMS problem. Theoretical analysis shows that our algorithms are very competitive. Experimental results also reveal that our algorithms perform well in practice.
In future we will focus on solving harder LDDMS instances, including those involving protein strings. We also plan to implement our algorithms in parallel.
References
Pal S, Xiao P, Rajasekaran S. Efficient sequential and parallel algorithms for finding edit distance based motifs. BMC Genom. 2016; 17(4):465.
Xiao P, Rajasekaran S. Efficient exact algorithms for LDD motif search. In: 2017 IEEE 7th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS). IEEE: 2017. p. 1–1.
Xiao P, Pal S, Rajasekaran S. qPMS10: A randomized algorithm for efficiently solving quorum planted motif search problem. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE: 2016. p. 670–5.
Xiao P, Pal S, Rajasekaran S. Randomised sequential and parallel algorithms for efficient quorum planted motif search. Int J Data Min Bioinforma. 2017; 18(2):105–24.
Yu Q, Huo H, Zhang Y, Guo H. Pairmotif: a new patterndriven algorithm for planted (l, d) DNA motif search. PLoS ONE. 2012; 7(10):48442.
Blanchette M, Schwikowski B, Tompa M. Algorithms for phylogenetic footprinting. J Comput Biol. 2002; 9(2):211–23.
Tanaka S. Improved exact enumerative algorithms for the planted (l,d)motif search problem. IEEE/ACM Trans Comput Biol Bioinforma. 2014; 11(2):361–74.
Nicolae M, Rajasekaran S. Efficient sequential and parallel algorithms for planted motif search. BMC Bioinforma. 2014; 15(1):1.
Nicolae M, Rajasekaran S. qPMS9: An efficient algorithm for quorum planted motif search. Sci Rep. 2015; 5:7813. Nature Publishing Group.
Pevzner PA, Sze SH. Combinatorial approaches to finding subtle signals in DNA sequences. In: ISMB, vol. 8: 2000. p. 269–78.
Karlin S, Ost F, Blaisdell BE. Patterns in DNA and Amino Acid Sequences and Their Statistical Significance In: Waterman MS, editor. Mathematical Methods for DNA Sequences. Boca Raton: CRC Press Inc: 1989.
Rocke E, Tompa M. On finding novel gapped motifs in DNA sequences. In: In RECOMB98: Proceedings of the Second Annual International Conference on Computational Molecular Biology. ACM: 1998. p. 228–33.
Sagot MF. Spelling Approximate Repeated or Common Motifs using a Suffix Tree. In: LATIN’98: Theoretical Informatics. Brazil: Springer: 1998. p. 374–90.
Pathak S, Rajasekaran S, Nicolae M. EMS1: An Elegant Algorithm for Edit Distance Based Motif Search. Int J Found Comput Sci. 2013; 24(04):473–86.
Wang X, Miao Y. GAEM: A Hybrid Algorithm Incorporating GA with EM for Planted Edited Motif Finding Problem. Curr Bioinforma. 2014; 9(5):463–9.
Davila J, Balla S, Rajasekaran S. Fast and practical algorithms for planted (l,d) motif search. IEEE/ACM Trans Comput Biol Bioinforma. 2007; 4(4):544–52.
Rajasekaran S, Nicolae M. An elegant algorithm for the construction of suffix arrays. J Discret Algorithm. 2014; 27:21–8.
Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, et al.Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005; 23(1):137–44.
Acknowledgements
Not applicable.
Funding
This work has been supported in part by the NSF grants 1447711, 1743418 and 1843025. Publication costs have been funded by these grants as well.
Availability of data and materials
The real DNA sequence data can be downloaded from [18]: http://bio.cs.washington.edu/assessment/download.html
About this supplement
This article has been published as part of BMC Genomics Volume 20 Supplement 5, 2019: Selected articles from the 7th IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS 2017): genomics. The full contents of the supplement are available online at https://bmcgenomics.biomedcentral.com/articles/supplements/volume20supplement5.
Author information
Authors and Affiliations
Contributions
PX, MS and SR conceived the study. SR and PX designed the algorithms. PX implemented the algorithms and carried out the experiments. SR, PX, and MS analyzed the results and wrote the paper. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Xiao, P., Schiller, M. & Rajasekaran, S. Novel algorithms for LDD motif search. BMC Genomics 20 (Suppl 5), 424 (2019). https://doi.org/10.1186/s1286401957016
Published:
DOI: https://doi.org/10.1186/s1286401957016