- Open Access
MODSIDE: a motif discovery pipeline and similarity detector
BMC Genomics volume 19, Article number: 755 (2018)
Previous studies demonstrate the usefulness of using multiple tools and methods for improving the accuracy of motif detection. Over the past years, numerous motif discovery pipelines have been developed. However, they typically report only the top ranked results either from individual motif finders or from a combination of multiple tools and algorithms.
Here we present MODSIDE, a motif discovery pipeline and similarity detector. The pipeline integrated four de novo motif finders: ChIPMunk, MEME, Weeder, and XXmotif. It also incorporated a motif similarity detection tool MOTIFSIM. MODSIDE was designed for delivering not only the predictive results from individual motif finders but also the comparison results for multiple tools. The results include the common significant motifs from multiple tools, the motifs detected by some tools but not by others, and the best matches for each motif in the motif collection of multiple tools. MODSIDE also possesses other useful features for merging similar motifs and clustering motifs into motif trees.
We evaluated MODSIDE and its adopted motif finders on 16 benchmark datasets. The statistical results demonstrate MODSIDE achieves better accuracy than individual motif finders. We also compared MODSIDE with two popular motif discovery pipelines: MEME-ChIP and RSAT peak-motifs. The comparison results reveal MODSIDE attains similar performance as RSAT peak-motifs but better accuracy than MEME-ChIP. In addition, MODSIDE is able to deliver various comparison results that are not offered by MEME-ChIP, RSAT peak-motifs, and other existing motif discovery pipelines.
Detecting binding site motifs can reveal the transcription factors that control the gene expression. Hence, numerous tools and methods have been developed for finding binding site motifs. Nevertheless, the results reported from different tools for an identical dataset are diverse. This is largely due to the fact that different tools implemented different algorithms and possesses unique features for discovering the motifs. Therefore, using multiple tools and methods has been suggested as it improved the accuracy of the motif detection [1,2,3,4]. The suggestion has inspired the development of several motif discovery pipelines. They can be standalone applications on standalone servers or pipelining Web servers. Recent development tends to be pipelining Web servers, which eliminate the complications of software installations and configurations required by standalone applications in order to serve more users via the Web. Another advantage is that it allows running multiple tools and methods at once on the same server and eliminates the manual runs of the same dataset on several different motif finders residing on the same standalone server or on several different Web servers.
The research community has seen several motif discovery pipelines such as W-ChIPMotifs , GimmeMotifs , CompleteMOTIFS , MEME-ChIP , RSAT peak-motifs , MotifLab , and Promzea  among many others. Generally, the pipelines incorporated multiple algorithms or tools. They were designed to complement individual motif finders for achieving better accuracy. The results can be clustered and ranked for obtaining the top significant motifs. Some pipelines allow verifying the results with the reference databases such as TRANSFAC , Jaspar , and UniPROBE  by using a motif comparison tool such as STAMP  or TOMTOM .
Table 1 gives a summary of some current pipelines. We briefly discuss some of their general limitations here. W-ChIPMotifs was designed for mouse and human species only. There is no option for running different combinations of motif finders in the pipeline. The results include the top ranked motifs and their matches from the reference database by using STAMP tool. GimmeMotifs is a standalone application that has several functions including motif finding. However, the results from motif discovery module only present the top ranked motifs and their matches in the reference database. CompleteMOTIFS allowed selecting the tools to run the motif discovery. The results showed the top ten predicted motifs from each selected tool and their matches in the reference database via STAMP tool. However, this pipeline is no longer available for use. MEME-ChIP reports the predicted motifs from each tool and their matches in the reference database by using TOMTOM. RSAT peak-motifs allows selecting the motif discovery algorithms and it reports the predicted motifs from each selected algorithm with their matches in the reference database. MotifLab is a standalone application with a wide-range of functions including motif discovery. As other pipelines, only the top ranked motifs are presented in the results. Promzea is specialized for maize, rice, and Arabidopsis thaliana. It presents only the top predicted motifs that are not verified with the reference database.
Although existing pipelines were designed with their unique integrations and the methods for ranking and selecting the significant motifs, they do not allow obtaining different comparison results for multiple tools and methods. They generally report the top ranked results either from individual motif finders or from a combination of multiple predictive algorithms and tools.
In this work, we incorporated four de novo motif finders namely ChIPMunk , MEME , Weeder , and XXmotif  into a pipeline called MODSIDE. The pipeline also integrated a motif similarity detection tool MOTIFSIM . All adopted tools are open-source software. We chose ChIPMunk, MEME, and Weeder as they are widely used and some of their features are complemented. Since XXmotif is a general-purpose motif finder and it has some advanced features over three other motif finders, we adopted it for the pipeline. The features of these motif finders are presented in the Implementation section. We chose MOTIFSIM for similarity detection because of its unique features that are not offered by all existing pipelines. They include (1) the common (global) significant motifs from multiple tools, (2) the motifs detected by some tools but not by others (the global and local significant motifs), and (3) the best matches for each motif in the motif collection of multiple tools. Besides the unique features, MOTIFSIM also possesses other useful features for verifying the predicted motifs with the reference databases, merging similar motifs, and clustering predicted motifs into motif trees. MODSIDE pipeline delivers not only the results from individual motif finders but also the comparison results from the pipeline itself.
MEME (Multiple Expectation Maximization for Motif Elicitation) is a well-known motif discovery tool developed for targeting un-gapped motifs in unaligned DNA or protein sequences . MEME algorithm is based on a profile-based method that implemented the expectation maximization (EM) . The profile-based methods are faster than consensus-based methods but they suffer from lower accuracy because they tend to be trapped in a local optimum . MEME algorithm removes the previous discovered motifs when it searches for new motifs. Thus, it can only model a single motif at a time and it does not detect alternative binding motifs, which are motifs for co-factors . MEME also requires removing duplicate sequences and those with low information prior to running the tool . Another drawback of MEME is splitting variable-length patterns into two or more separate motifs . MEME was originally designed for discovering short motifs. However, its later versions allow finding longer motifs. MEME possesses numerous features for discovering motifs. These features are presented in the Additional file 1. We adopted version 4.11.4 for the pipeline.
ChIPMunk is a fast heuristic motif finder developed for analyzing high-throughput sequencing data . ChIPMunk is also a profile-based method. Its algorithm implemented an iterative approach that combines the greedy optimization with bootstrapping. ChIPMunk evaluates the motif profiles based on the Kullback Discrete Information Content (KDIC). It employs a greedy approach for discovering the motif profiles with high KDIC values. The motif profiles are ranked based on Position Weight Matrix (PWM) scores. They are subsequently improved by an EM iterative process. ChIPMunk’s performance is better than MEME in term of runtime and prediction quality .
ChIPMunk was originally designed for discovering the motifs in PWMs for transcription factor binding sites. It was later adapted for handling ChIP-Seq data. ChIPMunk contains numerous attributes that are presented in the Additional file 1 for finding motifs. We adopted version 7 for the pipeline.
Weeder was designed for finding DNA motifs . Its algorithm is based on a pattern-driven approach, which is a sub-category of the consensus-based method . Weeder algorithm implemented a suffix tree based exhaustive enumeration and extended it for searching longer patterns . The algorithm was designed for finding subtle similarities in small datasets, rather than large similarities in large datasets . Due to the nature of consensus-based method, Weeder is significantly slower than MEME and ChIPMunk but its prediction quality is higher. Weeder also comprises several attributes for finding DNA motifs. They are presented in the Additional file 1. We adopted version 1.4.2 for the pipeline.
XXmotif is a general-purpose method, which was designed for finding enriched motifs in nucleotide sequences . However, unlike other motif finders, XXmotif is capable for optimizing the statistical significance of PWMs directly. It can also score conservation and positional clustering of motifs . XXmotif algorithm is a combination of the pattern-based enumerative approach and the iterative PWM refinement . The algorithm consists of masking stage, pattern stage, and PWM stage. The masking stage masks out the repeat regions, compositionally biased segments, and homologous segment pairs. The pattern stage calculates enrichment P-values of degenerate seed patterns. The PWM stage optimizes candidate PWMs iteratively . The experimental results in  showed XXmotif has faster runtime and higher sensitivity than MEME and Weeder. In addition, the masking stage makes XXmotif more sophisticated, as this stage does not exist in MEME, ChIPMunk, and Weeder. As other tools, XXmotif provides a wide-range of features for finding motifs. We adopted its current version for the pipeline.
The pipeline incorporated MOTIFSIM version 2.2. MOTIFSIM allows comparing the results from multiple tools for attaining the common significant motifs, the motifs reported by some tools but not by others, as well as the best matches for each predicted motif in the motif collection for multiple tools. The results from multiple tools can be verified with the reference database such as TRANSFAC, Jaspar, or UniPROBE. Since the predicted motifs reported by a single tool or multiple tools can be redundant motifs. MOTIFSIM provides an option for merging them to reduce the number of redundant motifs. The new motif is formed only of it is within the similarity threshold with both of its parents . Another useful feature of MOTIFSIM is clustering the motifs into motif trees. The tree describes the relationship between motifs. MOTIFSIM calculates the similarity scores between motifs and builds two distance matrices. One is for the global significant motifs and the other is for every motif in the motif collection of multiple tools. The distance matrices contain the best similarity scores between motifs. MOTIFSIM uses the distance matrices to build the motif trees by using hclust function in R. This function implemented the hierarchical clustering algorithm . Hence, the motifs that reside in the same branch of the tree are more similar to one another. The height of the branch also shows the degree of similarity. The motifs that are connected by shorter branches are more similar than those that are connected by taller branches.
MODSIDE web Interface
The pipeline was assessed on 16 benchmark sequence datasets from Tompa et al. in Table 2 . They came from Homo sapiens, Mus musculus, and Saccharomyces cerevisiae species. The datasets can be generic or Markov type . The generic type was generated by obtaining the promoter sequences randomly and implanted the known binding sites of the same species into those sequences. The Markov type was obtained by generating random sequences using Markov chain order of 3 and then implanted the known binding sites of the same species into those sequences. Each binding site embedded in a sequence belongs to a specific transcription factor in the TRANSFAC database. The transcription factor embedded in each sequence is listed in Table 2. We selected the benchmark datasets so that each sequence in a dataset has at least one or more embedded binding sites of the same transcription factor. These benchmarks were used to run MODSIDE with all motif finders selected. They were also used to run MEME-ChIP and RSAT peak-motifs.
We evaluated MODSIDE in two phases. In the first phase, we assessed the accuracy of MODSIDE by comparing its results with the results from individual motif finder in the pipeline. The objective is to observe the efficiency of the pipeline and its motif finders. We used the assessment method, the benchmark sequence datasets, and the on-line assessment tool from Tompa et al. for this evaluation . Tompa et al. introduced a comprehensive method for assessing computational tools for discovery of transcription factors binding sites. They built 52 benchmark datasets for evaluating 13 tools in their assessment. The technique used for creating these datasets was presented in the Datasets section. We employed six statistics from Tompa et al. for this evaluation. They are presented in the Additional file 1. The authors also built an assessment tool, which calculates several statistics including those used in this evaluation. The benchmark datasets and the assessment tool are available on-line. They can be used for assessing existing and future tools as well. We measured the accuracy of ChIPMunk, MEME, Weeder, XXmotif, and MODSIDE on 16 benchmark datasets. For each tool T and each dataset D, we have a set of known binding sites and a set of predicted binding sites. Thus, we can measure the accuracy of T on D at the nucleotide level and at the site level. At the nucleotide level, we calculated four statistics: Sensitivity (nSn), Positive Predictive Value (nPPV), Specificity (nSP), and Correlation coefficient (nCC). At the site level, we calculated two statistics that are Sensitivity (sSn) and Positive Predictive Value (sPPV). Since different tools produce different numbers of significant motifs by using the thresholds presented in the section MODSIDE Web Interface, we selected all significant motifs from each tool. We compared the significant motifs from these tools for the same sequence dataset by using MOTIFSIM for obtaining the global significant motifs . Since MOTIFSIM identifies a set of common significant motifs reported by four tools, we selected the best common significant motif based on two criteria. First, it must represent the popular vote by majority of the tools. Second, it has the highest rank of similarity score. We assessed the accuracy of the top significant motif reported by each tool by using six statistics above. We then compared the accuracy for identifying the known motif of each tool including MODSIDE.
In the second phase, we compared MODSIDE with MEME-ChIP and RSAT peak-motifs for the following reasons. First, they are widely used. Second, they have no limitation for input species. Third, they have a user-friendly Web interface. Fourth, MEME-ChIP is based on a profile-based method, which has a lower accuracy while RSAT peak-motifs is based on a word-based method or consensus-based method, which has a higher accuracy. Hence, we expected to see RSAT peak-motifs outperforms MEME-ChIP. Alternatively, MODSIDE has a combination of both profile-based method and consensus-based method. This characteristic makes it interesting to observe the performance of each pipeline. In addition, all three pipelines have no limitation for input sequences as well as file size. Finally, like MODSIDE, both MEME-ChIP and RSAT peak-motifs have a feature for reporting the results of individual motif finders. Table 3 shows the characteristics of each pipeline.
We used the default setting provided by each pipeline to run the benchmark datasets in Table 2. The significant motifs were selected by using a similarity cut-off of ≥75% for MODSIDE and an E-value of ≤0.05 for MEME-ChIP and RSAT peak-motifs. We selected the top significant motif from each pipeline for each sequence dataset. We then calculated six statistics above for each top significant motif.
MODSIDE versus ChIPMunk, MEME, Weeder, and XXmotif
We measured the accuracy of each tool by calculating six statistics in the Evaluation section for the top significant motif produced by each tool for the same sequence dataset. The results of four motif finders and MODSIDE on 16 benchmark datasets are in the Additional file 1: Figures S1-S16. The absent tools in the figures did not report any significant motif. They either failed to detect any motif or their reported motifs did not pass the significant threshold. This is due to the nature design and implementation of each tool. MEME and XXmotif did not report any significant motif for ten sequence datasets: hm08m, hm19g, hm22g, hm22m, mus09g, mus11m, yst01g, yst02g, yst03m, and yst06g. XXmotif failed to detect the known motif NF-kappaB although other tools identified it for sequence dataset hm17g. Besides, XXmotif and MEME did not report any significant motif for the sequence datasets mus10g and yst09g respectively. We calculated the average statistics for each tool including MODSIDE on 16 sequence datasets. The average result reveals MODSIDE attains better accuracy than individual motif finders. Figure 2 shows MODSIDE in the top rank followed by Weeder, MEME, ChIPMunk, and XXmotif respectively. The calculation can also be found in the Additional file 1: Table S1.
MODSIDE versus MEME-ChIP and RSAT peak-motifs
We compared the accuracies of MEME-ChIP, RSAT peak-motifs, and MODSIDE by calculating six statistics for the top significant motif from each pipeline for each sequence dataset in Table 2. The statistical results are in Additional file 1: Figures S17-S32. Most of the figures do not show MEME-ChIP as it did not report any significant motif except for the dataset hm04g in Additional file 1: Figure S18. This is due to the nature design and implementation of MEME-ChIP and its components. All pipelines failed to identify the known motifs for the datasets hm01g, hm04g, hm15g, hm22g, mus09g, and yst01g. Again, this is due to the nature design and implementation of each pipeline and its components. For the rest of the datasets, either RSAT peak-motifs or MODSIDE can identify the known motifs with various degrees of accuracies. However, both RSAT peak-motifs and MODSIDE successfully identified the known motif NF-kappaB for the dataset hm17g. We calculated the average statistics for each pipeline on all sequence datasets as shown in Fig. 3 and in the Additional file 1: Table S2. MEME-ChIP shows a poorer accuracy than RSAT peak-motifs and MODSIDE. Again, this can be caused by the nature design and implementation of MEME-ChIP as presented above. However, both RSAT peak-motifs and MODSIDE expose a similar performance, as their average accuracies are quite similar. Nevertheless, MODSIDE has more advantages than MEME-ChIP and RSAT peak-motifs because it offers various comparison results that are not offered by MEME-ChIP, RSAT peak-motifs, and other existing pipelines.
We developed MODSIDE for motif discovery and similarity detection. The pipeline delivers the predicted motifs from ChIPMunk, MEME, Weeder, and XXmotif. It also provides various comparison results for multiple motif finders. The comparison results include the common significant motifs, the motifs detected by some tools but not by others, as well as the best matches for each predicted motif in the collection of multiple tools. Besides, the pipeline allows comparing the predicted motifs with the reference databases for obtaining similar motifs. It also allows merging similar motifs and clustering the results into motif trees. We assessed MODSIDE and its motif finders on 16 benchmark datasets. The statistical results reveal MODSIDE attains better accuracy than its adopted motif finders. We also compared MODSIDE with MEME-ChIP and RSAT peak-motifs. The comparison results show MODSIDE and RSAT peak-motifs achieve similar performance while MEME-ChIP has a lower accuracy than other two pipelines. Although the performance of MODSIDE is comparable to RSAT peak-motifs, it offers various comparison results that are not offered by RSAT peak-motifs and other existing motif discovery pipelines.
Availability and requirements
Project name: modside
Project home page: http://modside.org/
Operating system(s): Linux
Other requirements: Apache2 Web server, open source Prince software package, and WebLogo v. 3.4
Any restrictions to use by non-academics: None
Kullback Discrete Information Content
Position Weight Matrix
Harbison C, Gordon D, Lee T, Rinaldi N, Macisaac K, Danford T, et al. Transcriptional regulatory code of a eukaryotic genome. Nature. 2004;431(7004):99–104.
Hu J, Li B, Kihara D. Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res. 2005;33(15):4899–913.
MacIsaac KD, Fraenkel E. Practical strategies for discovering regulatory DNA sequence motifs. PLoS Comput Biol. 2006;2(4):e36.
Hu J, Yang YD, Kihara DEMD. An ensemble algorithm for discovering regulatory motifs in DNA sequences. BMC Bioinformatics. 2006;7:342.
Jin VX, Apostolos J, Nagisetty NS, Farnham PJ. W-ChIPMotifs: a web application tool for de novo motif discovery from ChIP-based high-throughput data. Bioinformatics. 2009;25(23):3191–3.
van Heeringen SJ, Veenstra GJ. GimmeMotifs: a de novo motif prediction pipeline for ChIP-sequencing experiments. Bioinformatics. 2011;27(2):270–1.
Kuttippurathu L, Hsing M, Liu Y, Schmidt B, Maskell DL, Lee K, He A, Pu WT, Kong SW. CompleteMOTIFs: DNA motif discovery platform for transcription factor binding experiments. Bioinformatics. 2011;27(5):715–7.
Machanick P, Bailey TL. MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics. 2011;27(12):1696–7.
Thomas-Chollier M, Herrmann C, Defrance M, Sand O, Thieffry D, van Helden J. RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets. Nucleic Acids Res. 2011;40(4):e31.
Klepper K, Drabløs F. MotifLab: a tools and data integration workbench for motif discovery and regulatory sequence analysis. BMC Bioinformatics. 2013;14:9.
Liseron-Monfils C, Lewis T, Ashlock D, McNicholas PD, Fauteux F, Strömvik M, Raizada MN. Promzea: a pipeline for discovery of co-regulatory motifs in maize and other plant species and its application to the anthocyanin and phlobaphene biosynthetic pathways and the maize development atlas. BMC Plant Biol. 2013;13:42.
Matys V, Fricke E, Geffers R, Gössling E, Haubrock M, Hehl R, et al. TRANSFAC®: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 2003;31(1):374–8.
Mathelier A, Fornes O, Arenillas DJ, Chen CY, Denay G, Lee J, et al. JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2016;44(D1):D110–5.
Newburger DE, Bulyk ML. UniPROBE: an online database of protein binding microarray data on protein-DNA interactions. Nucleic Acids Res. 2009;37(Database issue):D77–82.
Mahony S, Benos PV. STAMP: a web tool for exploring DNA-binding motif similarities. Nucleic Acids Res. 2007;35(Web Server issue):W253–8.
Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS. Quantifying similarity between motifs. Genome Biol. 2007;8(2):R24.
Kulakovskiy IV, Boeva VA, Favorov AV, Makeev VJ. Deep and wide digging for binding motifs in ChIP-Seq data. Bioinformatics. 2010;26(20):2622–3.
Bailey T, Williams N, Misleh C, Li W. MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res. 2006;34(Web Server issue):W369–73.
Pavesi G, Mauri G, Pesole G. An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics. 2001;17:S207–14.
Luehr S, Hartmann H, Söding J. The XXmotif web server for eXhaustive, weight matriX-based motif discovery in nucleotide sequences. Nucleic Acids Res. 2012;40(Web Server issue):W104–9.
Tran NTL, Huang CH. MOTIFSIM 2.1: an enhanced software platform for detecting similarity in multiple DNA motif data sets. J Comput Biol. 2017;24(9):895–905.
Jia C, Carson MB, Wang Y, Lin Y, New Exhaustive LHA. Method and strategy for finding motifs in ChIP-enriched regions. PLoS One. 2014;9(1):e86044.
Tran NTL, Huang C-H. A survey of motif finding web tools for detecting binding site motifs in ChIP-Seq data. Biol Direct. 2014;9:1–22.
Liu B, Yang J, Li Y, McDermaid A, Ma Q. An algorithmic perspective of de novo cis-regulatory motif finding based on ChIP-seq data. Brief Bioinform. 2017;19(5):1069–81.
Zambelli F, Pesole G, Pavesi G. Motif discovery and transcription factor binding sites before and after the next-generation sequencing era. Brief Bioinform. 2013;14(2):225–37.
Hartmann H, Guthöhrlein EW, Siebert M, Luehr S, Söding J. P-value-based regulatory motif discovery using positional weight matrices. Genome Res. 2013;23(1):181–94.
Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005;23(1):137–44.
This work was supported by U.S. Department of Education Graduate Fellowships in Areas of National Need (GAANNs) [Grant P200A130153 to NTLT]. The tool’s infrastructure was supported by The AWS Cloud Credits for Research Program to NTLT.
Availability of data and materials
The datasets used in this study are available from the Computer Science and Engineering Department at University of Washington (http://bio.cs.washington.edu/assessment/).
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Tran, N.T.L., Huang, C. MODSIDE: a motif discovery pipeline and similarity detector. BMC Genomics 19, 755 (2018). https://doi.org/10.1186/s12864-018-5148-1
- Binding sites
- DNA motif
- Motif detection tool
- Motif discovery pipeline
- Motif similarity detection
- Motif clustering