CTF: a CRF-based transcription factor binding sites finding system
© He et al.; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Skip to main content
© He et al.; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Identifying the location of transcription factor bindings is crucial to understand transcriptional regulation. Currently, Chromatin Immunoprecipitation followed with high-throughput Sequencing (ChIP-seq) is able to locate the transcription factor binding sites (TFBSs) accurately in high throughput and it has become the gold-standard method for TFBS finding experimentally. However, due to its high cost, it is impractical to apply the method in a very large scale. Considering the large number of transcription factors, numerous cell types and various conditions, computational methods are still very valuable to accurate TFBS identification.
In this paper, we proposed a novel integrated TFBS prediction system, CTF, based on Conditional Random Fields (CRFs). Integrating information from different sources, CTF was able to capture patterns of TFBSs contained in different features (sequence, chromatin and etc) and predicted the TFBS locations with a high accuracy. We compared CTF with several existing tools as well as the PWM baseline method on a dataset generated by ChIP-seq experiments (TFBSs of 13 transcription factors in mouse genome). Results showed that CTF performed significantly better than existing methods tested.
CTF is a powerful tool to predict TFBSs by integrating high throughput data and different features. It can be a useful complement to ChIP-seq and other experimental methods for TFBS identification and thus improve our ability to investigate functional elements in post-genomic era.
Availability: CTF is freely available to academic users at: http://cbb.sjtu.edu.cn/~ccwei/pub/software/CTF/CTF.php
Functional elements in genomes play important roles in many biology processes. For example, enhancers, silencers, and transcriptional factor binding sites (TFBSs) are required in transcription. Thus, identifying functional elements in genomes is one of most important problems in post-genomic era [1–3], which is essential to elucidate gene regulation comprehensively. TFBS is one important type of functional elements. However, it is very challenging to locate the actual positions of TFBSs because they are generally very short (10 ~ 20 bp) and highly degenerate. Besides, only a small fraction of their patterns in a genome are actually bounded by transcription factors [4–6].
Recently, the advance of experimental technology greatly expands our ability to detect the locations of TFBSs. ChIP-seq (chromatin immunoprecipitation followed by massively parallel sequencing)  technology is utilized to find out the binding motifs in a high accuracy and a high throughput. ChIP-seq is becoming the gold-standard method for TFBS identification. However, it has several limitations: 1). the quality and source of the antibody have a big impact on the result and it is hard to obtain high quality antibodies for all TFs; 2). its resolution (about 300 bps) is too low  to locate TFBSs, which are only about 20 bps; 3). Another major limitation is that ChIP-seq could detect the binding sites of only one transcription factor in one experiment and it is expensive. Although recent study showed that it was possible to identify binding sites of more than one TFs using a single ChIP-seq experiment , the cost is still prohibitively expensive to identify binding sites of many TFs in various cell types and conditions. Thus, computational methods are required as complementary means for TFBS identifying.
Efforts have been made to predict TFBSs computationally by searching patterns of TFBSs in genome. Position weight matrix (PWM) , which contains TFBS patterns in sequence level, is the most widely used model to represent and identify TFBSs. However, since the motifs are very short and typically degenerated, PWM alone is not discriminative enough and will predict a large number of false positives. Recently, various approaches have been proposed to reduce false positives by integrating information from other sources [11–14]. For example, histone markers were shown to correlate with transcription factor binding sites and were able to improve the accuracy significantly [13, 15]. However, the co-occurrence of histone markers was not considered in all these methods mentioned above. The co-occurrence of histone markers was shown to reflect the state of chromatin and correlated with the binding events of transcription factors.
Three types of features, the Position Weight Matrix (PWM), the distance to Transcription start sites (TSS proximity), and histone markers (8 distinct histone modifications), have been integrated into CTF (See Additional file 1 for more details). Test datasets were collected for13 transcription factors in mouse Embryonic Stem cells (ES cells). It is shown that by integrating PWM, histone markers and TSS proximity, CTF is able to predict TFBSs with high accuracy and it outperforms existing methods, including Chromia and Cluster-Buster significantly.
CTF also integrated histone markers and transcription start site (TSS) data. Histone modifications were observed across genome and some of them correlated strongly with TFBSs. In addition, by studying the combination of different histone modifications, it was shown that chromatin states were related to activity of genomic regions and regulation events. Therefore, histone markers and their combinations were informative for the prediction of TFBSs. In our work, 8 distinct histone markers were used: H3K27me3, H3K36me3, H3K4me1, H3K4me2, H3K4me3, H3K9me3, H3 and H4K20me3. Another feature included was the TSS proximity. It was an indicator of whether a bin was within 2 kb of a TSS, the promoter region defined in this paper. The discriminative power of each histone marker could be measured by counting the frequency difference of a certain feature in bins with TFBSs and in bins without TFBSs. As Won et al presented that H3K4me2 and H3K4me3 were the most discriminative, while H3K4me1 was less discriminative . It was consistent with our knowledge that H3K4me1/2/3 were active markers. In addition, we have observed the enrichment of binding sites of some TFs such as c-Myc and Zfx in promoter regions (Additional file 3) and the enrichment can be captured by the TSS-proximity feature.
To further evaluate CTF, we then compared CTF with a couple of prevalent existing algorithms. Chromia is an integrated method based on Hidden Markov Model (HMM) and it predicts TFBSs based on PWM and chromatin signatures. Cluster-Buster is a algorithm to find motif clusters (or cis-regulatory modules), which is also based on PWM. Cluster-Buster considers not only the signal (PWM score) of motifs but also their co-occurrence.
The AUC of CTF, Chromia and PWM on the dataset of 13 TFs
CTF, a novel integrative TFBS prediction system, was proposed in this paper. Although CTF achieved a high accuracy, there are still much room for improvement. For example, in current version, only the locations of the peaks of histone modifications were considered in CTF. Continuous feature functions that score the shape and intensity could be included in the future versions. In addition, the CRF framework itself is very flexible and new features can be added into CTF in a straightforward manner. CTF can also be applied to similar problems such as the prediction of enhancers. We expect that CTF can facilitate the identification of binding sites of transcription factors as well as other functional elements, and improve our knowledge about gene regulation.
In this paper, we present and evaluated CTF, a novel integrative method to predict transcription factor binding sties (TFBSs) by combining various features using conditional random field as the underlying framework. Our results showed that CTF successfully integrated position weight matrix (PWM), distance to transcription start sites (TSSs) and 8 distinct histone markers, which in total improved accuracy of TFBS prediction significantly. It outperformed models with only part of those features. Most importantly, when compared with some existing representative tools, CTF showed significant superior performance. CTF is an effective novel integrative TFBS prediction system, and has a great potential in other functional element finding.
where y is the label sequence or annotation of all bins, x is the observed genomic sequence f x is the k-th feature functions and λ k is the corresponding weight. The feature function f k can be an arbitrary function on x and y' is any label sequence. In CTF, the possible values for label sequence of one bin is 0 (non-TFBS) and 1 (TFBS).
In CTF, several types of feature functions have been designed to capture patterns contained in features. The first type of feature functions are PWM scoring functions. The second type of feature functions are indicator functions. Each of these indictor function checks the occurrence of a feature. For example, a feature function of this type can be interpreted as an indicator of a bin in a promoter region if the i-th feature is TSS proximity, or an indicator of a bin within an H3K4me2 peak if that i-th feature corresponds to H3K4me2. The third type of feature functions targets the co-occurrence of two histone markers. This type of feature functions are able to capture co-occurring features such as a bivalent domain or a bin that is "not in a promoter region or H3K4me3", which is a marker of active enhancer. In addition, we have defined feature functions to capture patterns in adjacent bins as a complement for the above feature functions. With these feature functions, CTF is able to distinguish TFBSs from the background with high accuracy.
where i and i' corresponds to two features and u and v are tags.
where is the partition function and || || is the L-2 norm. In CTF, liblbfgs (http://www.chokkan.org/software/liblbfgs/), an open source library for unconstrained minimization, was used to find the optimal weight vector, λ.
which is assigned as the score of each bin. Thus, we can set a threshold and bins will be assigned as TFBSs if their scores exceed the threshold. The rest bins are assigned as background.
The binding sites of 13 transcription factors (TFs) in the mouse ES cells were obtained directly from the ChIP-seq data of Chen et al.  The 13 TFs were c-Myc, CTCF, E2f1, ESrrb, Klf4, Nanog, n-Myc, Oct4, Smad1, Sox2, STAT3, Tcfcp2l1 and Zfx. The position weight matrices (PWM) of TFs were obtained from JASPAR and PWMs not available in JASPAR were obtained from Chen et al. The locations of transcription start sites (Refseq mm8, April 8, 2012) were obtained from UCSC genome browser. Also, the sequence of mouse genome (mm8, April 8, 2012) was downloaded from UCSC Genome Browser. Original ChIP-seq data on 8 distinct histone modification information was obtained from . MACS was employed with default parameters to call peaks from ChIP-seq data.
"Peak-centric" method was used to generate gold-standard dataset on the binned genome. First, mouse genome was divided into 200bp bins. Then, we assigned bins overlapped with the centers of TFBSs as positive ones and other bins as negative ones. Similar strategy was applied to generate a feature matrix (Additional file 1). The PWM score assigned to a certain bin was the maximal PWM score inside the bin. Then, for other features, the value corresponding to a histone modification of a certain bin was set to 1 if that bin overlapped with one peak and 0 otherwise. As for transcription start site (TSS) proximity, we defined the promoter region as a 4,000-bp interval centred at the TSS and if bins overlapped with that region, their values of TSS proximity were set to 1; otherwise, they were 0.
In order to evaluate the performance of CTF, 10-fold cross-validation was employed. In the cross validation, 19 autosomes and chromosome X in mouse genome were randomly divided into 10 groups. Then, one group was utilized as test set and the rest as the training set. To measure the performance, we calculated Area Under the Curve (AUC) of Receiver Operator Characteristic (ROC) curve. ROC curve is a curve of True Positive Rate (TPR) vs. False Positive Rate (FPR) by changing the threshold of the model. For some methods, we were unable to get enough prediction to plot the complete ROC curve. Thus, in the comparison of all methods, we only computed the area under ROC curve when FPR was less than 10%, which was denoted as AUC10%. Another rationale was that in this range, the number of false positives was moderated and the model was useful.
We defined True Positives (TPs) as positive bins that were predicted as TFBSs and False Positives (FPs) as non-TFBS bins that were predicted as TFBSs. Similarly, negative bins predicted as non-TFBSs were termed True Negatives (TNs). Negative bins predicted as positives were defined as False Negatives (FNs). Then, True Positive Rate (TPR) was defined as the fraction of TPs called by a model in all positives. False Positive Rate (FPR) was defined as the fraction of FPs called by a model in all negatives.
In order to evaluate other methods with the same criterion, we put TFBSs predicted by other methods into bins according to their positions and the scores of those bins became the scores of corresponding TFBSs. If there were several TFBSs in one bin, the maximal score was chosen as the score of the bin. In this manner, we could measure the performance of all methods with the same criterion.
We compared CTF with two existing methods and the baseline PWM method. The two existing methods were Chromia and Cluster-Buster. Chromia was downloaded from its website (http://tabit.ucsd.edu/download/Chromia2.tar.gz). Since the current release of Chromia contained the prediction result files generated from the same data set used in this paper, the results of Chromia was used directly. After this, predicted TFBSs were merged to bins and the results were then evaluated. Cluster-buster focused on detecting clustered motifs within a relatively narrow range, and did not consider epigenetic modification information. Cluster-Buster was run with parameters, "-c 1 -m 1 -g 20 -f 2". Position weight matrix (PWM) baseline method used solely the PWM score of every bin to identify TFBSs and we used various cut-offs to draw the ROC curves.
(transcription factor binding site)
(chromatin immunoprecipitation followed by massively parallel sequencing)
(conditional random field)
(CRF-based TFBS finding system)
(false positive rate)
(true positive rate)
(position weight matrix)
(receiver operating characteristic)
(area under the curve).
We thank Dr. Kyoung Jae Won for his assistance in running Chromia. This work was supported by grants from the National Natural Science Foundation of China (60970050, 31100957), the Shanghai Pujiang Program (09PJ1407900), K.C. Wong Education Foundation, and Hong Kong, and China Postdoctoral Science Foundation fund (20110490758). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
This article has been published as part of BMC Genomics Volume 13 Supplement 8, 2012: Proceedings of The International Conference on Intelligent Biology and Medicine (ICIBM): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S8.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.