- Open Access
A structure-based Multiple-Instance Learning approach to predicting in vitrotranscription factor-DNA interaction
© Gao and Ruan; licensee BioMed Central Ltd. 2015
- Published: 21 April 2015
Understanding the mechanism of transcriptional regulation remains an inspiring stage of molecular biology. Recently, in vitro protein-binding microarray experiments have greatly improved the understanding of transcription factor-DNA interaction. We present a method - MIL3D - which predicts in vitro transcription factor binding by multiple-instance learning with structural properties of DNA.
Evaluation on in vitro data of twenty mouse transcription factors shows that our method outperforms a method based on simple-instance learning with DNA structural properties, and the widely used k-mer counting method, for nineteen out of twenty of the transcription factors. Our analysis showed that the MIL3D approach can utilize subtle structural similarities when a strong sequence consensus is not available.
Combining multiple-instance learning and structural properties of DNA has promising potential for studying biological regulatory networks.
- Transcription Factor Binding Site
- Probe Sequence
- Protein Binding Microarray
- Mouse Transcription Factor
- Simple Counting Method
Modeling of transcription factor binding sites (TFBS), sometimes also referred to as transcription factor binding motifs, is a crucial step towards understanding molecular regulatory networks. Among the popular methods for modeling TFBS, position-specific weight matrix (PWM) [1, 2] and k-mer based approaches [3, 4] have gained great success [5, 6]. Providing a probability score for each of the four nucleotide bases of each position of a TFBS, PWM-based approaches are intuitive for representing the sequence preferences of a transcription factor (TF), and easy to visualize for the TFBS models. However, these approaches have some limitations, including inefficiency in optimizing PWM, difficulty to represent some TFBSs (such as structural motifs and those with variable lengths) by a single PWM, and incapability to capture positional dependency [3, 4, 7, 8]. In contrast, k-mer based approaches use a word-based sequence of length k to enumerate possible instances of a motif. Compared to PWM-based approaches, the k-mer based approaches can be designed to make fewer assumptions about the degree of TFBS degeneracy, length of the binding sites, and position dependences of a motif. The k-mer based approaches have been found to provide more accurate models than PWMs in general [5, 6, 9]. But these models usually involve too many parameters and may not unravel the underlying knowledge of TF-DNA interactions.
A TF binding site is usually 8 to 12-mer base pair, determined by the sequence specificity and its physicochemical properties (structural and chemical properties), which are overlooked by both the PWM based and the k-mer based approaches. In , a novel TFBS modeling and predicting approach is presented, where the sequence-specific chemical and structural features of DNA are applied. Based on their evaluation using an in vivo ChIP-chip dataset, their method outperforms four previous methods [1, 11–13] by reporting fewer false positive matches. Their method provides a new perspective for understanding TF-DNA interactions. On the other hand, the in vivo protein-DNA interactions observed in ChIP-chip assays are not necessarily direct , as some TFs tend to interact with DNA extensively through other partners. Therefore, an evaluation on a proper in vitro dataset would be more appropriate to reveal the benefit of such physicochemical features in modeling TF-DNA interactions.
Protein-binding microarray (PBM) is a high-throughput experiment used to measure the in vitro binding affinity of a given TF to the sequences on the probe array . A typical design of the array consists of an exhaustive enumeration of all possible 10-mers concatenated into ∼40,000 unique probe nucleic acid sequences, each containing 35 bases. The PBM score represents the relative binding affinity of a given TF to each probe sequence on the array. Because typical confounding factors such as transcription cofactors present in ChIP-based experiments are eliminated, PBM data provides an excellent information source to develop physicochemical models for TF-DNA interactions. On the other hand, the currently available physicochemical features are mainly 3-mer or 4-mer based . Direct mapping of the 3-mer or 4-mer based meta-features to the candidate DNA binding sequences as in the work of  may not reflect the TF-DNA binding nature, since a TFBS usually spans 6 to 12 base pairs (bp), and its exact location within the PBM probe sequence is unknown. As a result, conventional machine learning algorithms, which rely on well-structured instance (feature vector) and label pairs, may not work well in modeling PBM data.
In this paper we propose a novel approach, MIL3D, to predict in vitro transcription factor binding based on the structural properties of DNA using the so-called multiple-instance learning (MIL), which was originally developed in the mid '90s to deal with uncertainty in instance labels and has found many interesting applications in bioinformatics and information retrieval [16–19]. In a conventional classification problem, the input is a set of instances (feature vectors), each of which is labeled positive or negative. In contrast, the input of MIL is a set of bags that can have many instances, but the instances are not individually labeled - instead, the labels are associated with bags. The common assumption is that a bag with a positive label contains at least one positive instance (whose identity is unknown), while all instances in the negative bag should have negative labels. The MIL framework fits the TFBS modeling scenario very well, because it is commonly assumed that a DNA sequence that can be bound by a TF (positive sequence) should contain one or more TFBSs, while a DNA sequence that cannot be bound by the TF (negative sequence) should has no TFBS. The exact location of the TFBS within the positive sequence is typically not fixed, although some preference might exist. Therefore, it is fairly intuitive to consider each DNA sequence as a bag, and any subsequence that can be a potential binding site as an instance.
In our algorithm, we treat each PBM probe sequence as a bag, and label the bags positive or negative according to the affinity level from the TF binding data. Each PBM probe is then decomposed into a set of k-mers using a simple sliding-window approach, and each k-mer is represented by a vector describing its structural properties. We then apply an existing MIL algorithm to learn a classification model that can correct predict the labels for the bags.
To demonstrate the advantage of our approach, we compared the performance of our algorithm with the conventional single-instance based learning (SIL), and with the simple k-mer counting method. Experimental results on PBM data of twenty mouse TFs showed that our method outperformed those methods with significant margins. On the other hand, it is also worth noting that our goal in this current paper is not a state-of-the-art method to predict TF-DNA interaction with the highest accuracy. To our knowledge, this is the first work that demonstrates the feasibility of using the MIL paradigm and structural properties in modeling TF-DNA interactions. We believe that many of the existing k-mer based methods for predicting TF-DNA interactions, which often involve filtering, normalization, and transformation of the binding data, can be combined with the key idea proposed in this paper to obtain a more accurate model.
The in vitrotranscription factor-DNA binding data
The protein binding microarray (PBM) data is acquired from . Two completely different array designs, each of which consists of 40,000 unique 35-mers, are used for twenty different mouse TFs. In the arrays, all possible 10-mers, and 32 copies of every non-palindromic 8-mer are included, presenting an unbiased study of TF binding preferences. The data were used for training in the Dialogue on Reverse-Engineering Assessment and Methods (DREAM) competition  and the data is freely available on the DREAM5 competition website upon registration. In our experiment, we selected 3000 probes with the highest binding signals as positive sequences and 3000 probes with the lowest binding signals as negative for each of the twenty TFs. We used the two different sets of arrays for training and testing respectively. As the training and testing data are completely separate, and actually come from different sources, this setting makes the prediction problem more challenging and the results less likely to be influenced by biases, compared to evaluations based on, for example, 10-fold cross-validation, where training and testing data could share some unknown similarities and bias the evaluation results.
MIL model for TF-DNA binding
To apply MIL to model the transcription factor binding site, we use the multiple-instance-wrapper algorithm  implemented in the WEKA machine learning package , and used the popular C4.5 decision tree algorithm (named J48 in WEKA) as the base single-instance classifier. The choice of C4.5 is because it is a simple and less biased classifier - it does not need to adjust a parameter to optimize its performance, and the number of features of an instance has little influence on its performance. Moreover, the difference of range and scale among each feature has little influence on the performance as well.
The wrapper-based MIL approach enables us to compare MIL-based TF-DNA interaction models and single-instance based model in a relatively unbiased manner, because the wrapper-based method can use any traditional single-instance classifiers as a base classifier. In contrast, a non-wrapper based MIL algorithm does not use SIL classifiers directly, which makes it hard to underpin the source of the performance difference between the two. The wrapped approach is derived from the assumption that a bag's label is formed by the overall contribution of all the instances within the bag, which is also consistent with the understanding that multiple binding sites may contribute to recruiting TF additively. Based on the above assumption, at the learning time, the wrapper algorithm converts the MIL training and testing data to single-instance based learning data by giving each instance a label that is the same as its bag, and each instance is initially assigned a weight proportional to the inverse of the size of the bag that they belong to. In our case, as every bag has the same size (35−k+1 instances each representing a k-mer), the initial weight for every instance is . This setting is because we do not have any prior knowledge for the location of the TFBS within the probe sequence, and therefore we assume that all k-mers within the 35-mer have the same probability being the real binding site, which will be updated iteratively when the model becomes more accurate. At prediction time, the class probability for a bag is estimated by averaging the class probabilities assigned to the individual instances in the bag. Moreover, the class probability for each individual instance can also be extracted from the positive bags to identify possible binding site(s) for each probe sequence. As the result of the above designs, different from most of the other non-wrapper based MIL classifiers, the wrapper algorithm can be combined with any traditional single-instance classifiers, which enables an unbiased comparison between our MIL-based TFBS modeling method with the traditional SIL based TFBS modeling method.
The choice of decision tree rather than other popular algorithms such as support vector machines (SVM) is because of the relative parameter-free robust performance and the implicit feature selection provided by the former method, which makes it ideal for evaluating and comparing the performance of the MIL-based and SIL-based approaches. For example, the performance of SVM approach is very sensitive to kernel choices and other parameters including the soft margin parameter and feature normalization method. As the k-mer based approaches and our structure-based MIL approach have very different number of features and different feature characteristics, (e.g., in k-mer based approach, all features are small integers, while in our method the features are typically real values and different features can have different distributions), we feel a simple base classifier such as a decision tree can provide a more fair comparison.
The meta-features of DNA structural properties
The structural properties of DNA is obtained from , which describe the steric and conformational rigidity properties of DNA. The structural features of DNA oligomers were defined based on the predicted average 3-dimensional structures of short DNA sequences, which were determined via MD simulations . The structural feature table of 3-mer is a 64 by 6 matrix, which contains six real value "base" parameters for each middle base pair of all the 64 (43) different 3-mers. The base parameters measure the relative relations between two fundamental units of the middle based pair of a 3-mer in a 3-dimensional space. The six base parameters are shear, buckle, stretch, propeller, stagger and opening .
Mapping PBM sequences to feature vectors
Figure 1 shows the steps of mapping the 35-mer PBM probe sequence to the structural feature vectors by our MIL3D method. In this example the k-mer length is specified as 7. As mentioned before, each probe sequence is treated as a bag and has a negative or positive label depending on its binding affinity to a particular TF. Each of the 29 shifted 7-mers in the 35-mer is treated as an instance, which does not have a label itself. For each instance (7-mer), we obtain the structural features of each of the 5 triplets using the structural feature table mentioned above as a lookup table. The feature vectors for the 5 triplets from the same 7-mer are then concatenated, in the same order as they appear in the 7-mer, to form a single feature vector for the 7-mer, which is an instance in the MIL model.
Summary of different feature models.
Length of k-mer kernel
Number of instances per bag
Number of features per instance
k ∈ [5, 8]
k-3+1 triplets per k-mer * 6 base structural features per triplet
For each of the 35-k+1 different continuous k-mers in the 35-mer, for each of the k-3+1 triplets, map the structural features to the k-mer sequentially. The feature vector of one k-mer represent an instance, and the 35-k+1 instances form a bag.
198 (6 structural features per triplet * 33 continous 3-mers in the 35-mer)
For each of the 33 different continous 3-mers in the 35-mer, map the 6 structural features to the 3-mer.
k ∈ [3, 8]
4 k (number of occurrences of all different k-mers)
For each 35-mer DNA sequence in the PBM array, count the number of occurrences for each of the 4 k k-mers; map the k-mer counter table to the PBM sequence. This k-mer based method has been widely used in the previous decades and has been proven to be still very effective at present .
3, 4 and 5
For each 35-mer DNA sequence in the PBM array, map the above 3 counter tables (including 3-mer, 4-mer and 5-mer tables) to the sequence.
The MIL3D method performs significantly better than the traditional k-mer counting method. Furthermore, changing of instance (k-mer) length in the MIL3D model has almost no impact on the performance while the performance of k-mer based method significantly depends on the value of k. The best performance of the k-mer based method is with k equal to 5. Larger values of k cause a significant degradation of the performance of the k-mer based method, presumably due to overfitting associated with the exponentially growing number of features and sparsity of the feature vectors when k increases. For example, each instance of the 7-mer counting has 16384 features while the total number of instances in our experiment is only 6000. In contrast, the number of features in MIL3D is linear to the number of 3-mers contained in the k-mer, and therefore is relatively stable when the value of k changes. While the actual TFBS length might be different from k, the MIL3D model can overcome the problem by modeling the TFBS with multiple overlapping instances, each of which can be a partial binding site.
As shown in Figure 2, MIL3D models are consistently better than k-mer based models for almost all TFs, often by a significant margin, except for Tcf3, where the 3+4+5-mer counting approach works slightly better than MIL3D models. The average AUC for MIL3D (0.94) is significantly higher than the average AUC for the 5mer counting method (0.84) (p-value = 1e-7, paired t-test). Among all MIL3D models, there are only five TFs out of twenty with AUCs below 0.91, while for the 5-mer counting models, there are only 5 TFs with AUCs above 0.90. With the MIL3D method, there are 11 TFs with AUCs above 0.95, while with the 5-mer counting method, only has 2 TFs (Foxo4 and Gmeb2) have AUCs greater than 0.95. The performance of MIL3D on Esr1, Tcf3, Nr2C1 and Tbx3 are low compared to other TFs (AUC<0.86), despite being considerably better than the 5-mer counting method, marking possible areas for improvement.
Occurrence of ten most frequent high-scoring 7-mers in positive probe sequences for each TF (only 7-mers with a predicted probability of binding > 0.85 are considered.)
In this paper, we proposed a TFBS modeling and prediction approach - MIL3D. Combining MIL and structural properties of DNA, it models and predicts in vitro transcription factor bindings more accurately than the widely used k-mer counting methods on in vitro protein binding microarray data for twenty mouse transcription factors. Our analysis showed that the MIL3D approach can utilize subtle structural similarities when a strong sequence consensus is not available and demonstrated the encouraging potential of using MIL and structural properties of DNA to study molecular regulatory networks. The key idea in our method can be easily combined with current state-of-the-art k-mer based models, which often involve additional normalization, filtering, or transformation of data, to increase the prediction accuracy of TF-DNA interactions.
This article was funded by NSF grant IIS-1218201, NIH grants SC3GM086305, U54CA113001, G12MD007591, and a UTSA Tenure-track Research Award.
This article has been published as part of BMC Genomics Volume 16 Supplement 4, 2015: Selected articles from the IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS) 2013. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/16/S4.
- Berg OG, von Hippel PH: Selection of DNA binding sites by regulatory proteins: Statistical-mechanical theory and application to operators and promoters. Journal of molecular biology. 1987, 193 (4): 723-743. 10.1016/0022-2836(87)90354-8.View ArticlePubMedGoogle Scholar
- Stormo GD: Consensus patterns in DNA. Methods in enzymology. 1990, 183: 211-221.View ArticlePubMedGoogle Scholar
- Zhao X, Huang H, Speed TP: Finding short DNA motifs using permuted markov models. Journal of Computational Biology. 2005, 12 (6): 894-906. 10.1089/cmb.2005.12.894.View ArticlePubMedGoogle Scholar
- Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA, Chan ET, Metzler G, Vedenko A, Chen X, et al: Diversity and complexity in DNA recognition by transcription factors. Science. 2009, 324 (5935): 1720-1723. 10.1126/science.1162327.PubMed CentralView ArticlePubMedGoogle Scholar
- Weirauch MT, Cote A, Norel R, Annala M, Zhao Y, Riley TR, Saez-Rodriguez J, Cokelaer T, Vedenko A, Talukder S, et al: Evaluation of methods for modeling transcription factor sequence specificity. Nature biotechnology. 2013, 31: 126-134. 10.1038/nbt.2486.PubMed CentralView ArticlePubMedGoogle Scholar
- Lei C, Ruan J: A particle swarm optimization-based algorithm for finding gapped motifs. BioData mining. 2010, 3: 9-10.1186/1756-0381-3-9.PubMed CentralView ArticlePubMedGoogle Scholar
- Nutiu R, Friedman RC, Luo S, Khrebtukova I, Silva D, Li R, Zhang L, Schroth GP, Burge CB: Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument. Nature biotechnology. 2011, 29 (7): 659-664. 10.1038/nbt.1882.PubMed CentralView ArticlePubMedGoogle Scholar
- Maerkl SJ, Quake SR: A systems approach to measuring the binding energy landscapes of transcription factors. Science. 2007, 315 (5809): 233-237. 10.1126/science.1131007.View ArticlePubMedGoogle Scholar
- Gao Z, Zhao R, Ruan J: A genome-wide cis-regulatory element discovery method based on promoter sequences and gene co-expression networks. BMC genomics. 2013, 14 (Suppl 1): 4-Google Scholar
- Bauer AL, Hlavacek WS, Unkefer PJ, Mu F: Using sequence-specific chemical and structural properties of DNA to predict transcription factor binding sites. PLoS computational biology. 2010, 6 (11): 1001007-10.1371/journal.pcbi.1001007.View ArticleGoogle Scholar
- Chen QK, Hertz GZ, Stormo GD: Matrix search 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices. Computer applications in the biosciences: CABIOS. 1995, 11 (5): 563-566.PubMedGoogle Scholar
- Djordjevic M, Sengupta AM, Shraiman BI: A biophysical approach to transcription factor binding site discovery. Genome research. 2003, 13 (11): 2381-2390. 10.1101/gr.1271603.PubMed CentralView ArticlePubMedGoogle Scholar
- Kel AE, Gößling E, Reuter I, Cheremushkin E, Kel-Margoulis OV, Wingender E: Matchtm: a tool for searching transcription factor binding sites in DNA sequences. Nucleic acids research. 2003, 31 (13): 3576-3579. 10.1093/nar/gkg585.PubMed CentralView ArticlePubMedGoogle Scholar
- Gordân R, Hartemink AJ, Bulyk ML: Distinguishing direct versus indirect transcription factor-DNA interactions. Genome Research. 2009, 19 (11): 2090-2100. 10.1101/gr.094144.109.PubMed CentralView ArticlePubMedGoogle Scholar
- Mukherjee S, Berger MF, Jona G, Wang XS, Muzzey D, Snyder M, Young RA, Bulyk ML: Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays. Nature genetics. 2004, 36 (12): 1331-1339. 10.1038/ng1473.PubMed CentralView ArticlePubMedGoogle Scholar
- Maron O, Lozano-Pérez T: A framework for multiple-instance learning. Advances in neural information processing systems. 1998, 570-576.Google Scholar
- Andrews S, Tsochantaridis I, Hofmann T: Support vector machines for multiple-instance learning. Advances in Neural Information Processing Systems. 2002, 561-568.Google Scholar
- Auer P: On learning from multi-instance examples: Empirical evaluation of a theoretical approach. Proc. 17th International Con. on Machine Learning. 1997, Morgan Kaufmann, 97: 21-29.Google Scholar
- Wang J, Zucker J-D: Solving the multiple-instance problem: A lazy learning approach. Proc. 17th International Con. on Machine Learning. 2000, Morgan Kaufman, 1119-1125.Google Scholar
- Stolovitzky G, Monroe D, Califano A: Dialogue on reverse-engineering assessment and methods. Annals of the New York Academy of Sciences. 2007, 1115 (1): 1-22. 10.1196/annals.1407.021.View ArticlePubMedGoogle Scholar
- Frank E, Xu X: Applying propositional learning algorithms to multi-instance data. Department of Computer Science, University of Waikato, Hamilton, NZ. 2003Google Scholar
- Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The weka data mining software: an update. ACM SIGKDD explorations newsletter. 2009, 11 (1): 10-18. 10.1145/1656274.1656278.View ArticleGoogle Scholar
- Cheatham TE: Simulation and modeling of nucleic acid structure, dynamics and interactions. Current opinion in structural biology. 2004, 14 (3): 360-367. 10.1016/j.sbi.2004.05.001.View ArticlePubMedGoogle Scholar
- Olson WK, Bansal M, Burley SK, Dickerson RE, Gerstein M, Harvey SC, Heinemann U, Lu X-J, Neidle S, Shakked Z, et al: A standard reference frame for the description of nucleic acid base-pair geometry. Journal of molecular biology. 2001, 313 (1): 229-237. 10.1006/jmbi.2001.4987.View ArticlePubMedGoogle Scholar
- Wang W-M, Wu S-Y, Lee A-Y, Chiang C-M: Binding site specificity and factor redundancy in activator protein-1-driven human papillomavirus chromatin-dependent transcription. Journal of Biological Chemistry. 2011, 286 (47): 40974-40986. 10.1074/jbc.M111.290874.PubMed CentralView ArticlePubMedGoogle Scholar
- Li M, Ge Q, Wang W, Wang J, Lu Z: c-jun binding site identification in k562 cells. Journal of Genetics and Genomics. 2011, 38 (6): 235-242. 10.1016/j.jgg.2011.05.004.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.