Predicting tissue specific transcription factor binding sites
© Zhong et al.; licensee BioMed Central Ltd. 2013
Received: 30 October 2013
Accepted: 6 November 2013
Published: 15 November 2013
Studies of gene regulation often utilize genome-wide predictions of transcription factor (TF) binding sites. Most existing prediction methods are based on sequence information alone, ignoring biological contexts such as developmental stages and tissue types. Experimental methods to study in vivo binding, including ChIP-chip and ChIP-seq, can only study one transcription factor in a single cell type and under a specific condition in each experiment, and therefore cannot scale to determine the full set of regulatory interactions in mammalian transcriptional regulatory networks.
We developed a new computational approach, PIPES, for predicting tissue-specific TF binding. PIPES integrates in vitro protein binding microarrays (PBMs), sequence conservation and tissue-specific epigenetic (DNase I hypersensitivity) information. We demonstrate that PIPES improves over existing methods on distinguishing between in vivo bound and unbound sequences using ChIP-seq data for 11 mouse TFs. In addition, our predictions are in good agreement with current knowledge of tissue-specific TF regulation.
We provide a systematic map of computationally predicted tissue-specific binding targets for 284 mouse TFs across 55 tissue/cell types. Such comprehensive resource is useful for researchers studying gene regulation.
To reconstruct and model transcriptional regulatory networks (TRNs) we need to know the genome-wide binding sites of transcription factors (TFs)[1, 2]. Chromatin immunoprecipitation(ChIP) followed by microarray (ChIP-chip) or sequencing (ChIP-seq) has been extensively used to study the in vivo binding locations of individual transcription factors and cofactors in a wide range of species and tissues[1, 2, 5–9]. Despite their popularity, such methods can only study a single TF in a single cell type, under a specific condition, in each experiment. Thus, it is difficult to use these methods to obtain a comprehensive understanding of the complicated mammalian TRNs. These networks can involve hundreds or thousands of TFs whose activities change across different tissues and conditions. Using computational methods to integrate other genomic resources in order to predict tissue-specific transcription factor binding is therefore an important research challenge.
Several methods have been developed to use in vitro data characterizing TF binding specificities to identify TF binding sites across the genome. Specifically, data from universal protein binding microarray (PBM)[10, 11] is often used for such analysis. PBM is capable of analyzing the interaction of a sequence-specific TF with tens of thousands of short DNA sequences (probes) in a single experiment, and thus provides a highly detailed picture of TF-DNA interactions. It has been successfully applied to reveal the binding profiles of hundreds of TFs in yeast, worm, mouse and arabidopsis. Some of the proposed methods for using PBM data represent TF binding preference by position weight matrices (PWMs)[11, 16, 17]. However, PWMs, although popular due to their simplicity, assume independence between positions, an assumption which may not hold in many cases[14, 18, 19]. In contrast, more sophisticated models (e.g. using k-mers) may better represent the full binding profiles of TFs, without loss of information from using PWM. For instance, it has been suggested that many TFs have more than one binding preference and these are easier to represent using k-mers.
While in vitro data provides important information regarding binding specificities, such data is context independent. Actual binding is highly dependent on tissue-specific conditions including chromatin accessibility, the presence of co-factors, etc. Recently, a number of studies have reported that epigenetic information including certain histone modifications and hypersensitivity to DNase I cleavage correlate with TF binding in vivo[21, 22]. Moreover, functional TFBSs tend to be under stronger negative selection, leaving a "phylogenetic footprint" in the genomic sequences. Several methods for predicting in vivo TF binding sites have attempted to combine such information with PWMs to predict global binding preferences[23–26]. However, as mentioned above, PWM may not be the best representation of TF binding. As we show, by using a model that retains the dependence between positions in the motif we can improve upon methods that integrate epigenetic and PWM data. In addition, none of these methods have so far been applied to elucidate the complete set of targets for TFs across a large number of tissues.
To predict accurate tissue-specific TFBS, we integrate multiple types of genomic data. The first part of our model is a biophysically-motivated k-mer based method for analyzing PBM data, which allows secondary binding profiles and nucleotide dependencies in different positions of the TF binding sites. Next, we develop a new method, PIPES (p robabilistic i ntegration of P BM, e pigenetics and s equence data), to combine the results from the PBM model with DNase I hypersensitivity (DHS) data and evolutionary conservation data to predict tissue-specific TFBS in vivo. We demonstrate that such an integrative model significantly boosts context specific prediction results compared with using PBM data alone. We also show that PIPES improves upon other methods developed for integrating data to predict TFBS[24, 26], in some cases significantly so. Finally, we created a resource for tissue-specific TRNs using PBM data for 284 mouse TFs from UniPROBE and DNase I hypersensitivity data for 55 mouse tissue/cell types from the mouse ENCODE project. We predict the activities of TFs across different tissues, and, as we show, many of these predictions agree with current knowledge regarding tissue-specific roles of TFs. Our tissue specific activity predictions are also supported by global analysis of TF expression data. The comprehensive resource of TF binding sites we built thus provides a reference map for understanding complex gene expression patterns.
K-mer based PBM analysis can accurately infer TF binding specificities
While a number of methods have been suggested to use PBM data for predicting TF binding sites (in most cases using PWMs), we decided to extend k-mer based methods using a biophysically-motivated model (Figure1a). K-mer based methods were shown to achieve the best performance among several techniques for the analysis of PBM data. Such methods allow an intuitive representation of potential alternative binding motifs and can account for dependency among positions in a motif. We use lasso regression with positive constraints to learn model parameters that represent binding probabilities to individual k-mers, where k is determined as part of the learning procedure. This results in a sparse model with relatively few k-mers having nonzero binding probabilities. On average, the number of k-mers with non-zero probabilities is 398.4 ± 253.2 across 284 mouse TFs with PBM data available. The model combines the benefits of recent PWM-based biophysical methods (for example, BEEML-PBM) with the ability of PBMs to capture dependencies between positions in a given motif (see Methods and Supplementary Methods in Additional file1 for details).
To test our PBM model, and as a baseline, we next used the inferred binding probabilities to predict in vivo TF binding. We collected 11 published mouse ChIP-seq datasets for which the PBM data for the same TF or for a TF with a similar DNA-binding domain is available (Additional file2). From each ChIP-seq dataset, the top 3000 peaks with highest enrichment are extracted, and the 600bp genomic regions centered on the reported peaks are used as the positive sequences bound by the TF. Then, 600bp sequences that (1) are upstream of and (2) 300bp apart from each positive sequence, and (3) do not overlap with any other positive sequences, are used as negative sequences. We also explored two alternative options for constructing negative sequences, including using size-matched random promoter sequences and randomly generated sequences. The choice of negative sequence sets makes very little difference on the AUC values, so we will only report the results based on the first negative set here (see Supplementary Results in Additional file1, and Additional file3 for details). We compared our PBM model with seven other methods that predict affinities of TF binding to given sequences, and the different methods are evaluated using areas under the ROC curve (AUC) as a measure of their abilities to correctly classify the two sets of sequences (see Methods, Supplementary Results in Additional file1 and Figure S1 therein for details).
The results indicate that the performance of our PBM method is at least comparable, and in some cases better than, previous methods. For 4 of the 11 TFs we tested (Esrrb, Sox2, Oct4 and Crx), our method improved over all other methods. In all other cases, the AUC of our method ranks within the top 4 (Supplementary Results in Additional file1 and Additional file1: Figure S1 therein). Notably, in such cases, none of the other methods consistently achieves the best AUC. The PWM-based and E-score based methods tend to work well for some cases (for example, BEEML PWM for Klf7 and Max E-score for Srf), but for others their performance is not as good. Overall, our PBM model has the highest average AUC over the 11 TFs tested (Supplementary Results in Additional file1, and Additional file4).
To further assess the benefits of using PBM to derive TF binding specificities, we use two other collections of PWMs for AUC evaluation. The HOMER PWMs were derived from ChIP-seq datasets, while the JASPAR PWMs were from multiple sources (including ChIP-seq, literature curation and PBM data). Overall, the results from the JASPAR PWMs are very similar to those obtained from the Seed and Wobble PWMs, and both are weaker than our method. The use of HOMER PWMs lead to better overall performance. However, given that HOMER trains the PWMs from ChIP-seq data and the same datasets may be used for evaluation, this is clearly not a fair comparison. We did notice that for some TFs, our method outperforms HOMER PWMs (e.g. Max: 0.809 vs 0.757). The full details are shown in Supplementary Results in Additional file1 and Additional file4.
Integrated model of PBM and DNase I hypersensitivity data significantly improves TFBS prediction accuracy
PBM data, although powerful, only measures in vitro binding. Therefore, even when using sophisticated methods, the ability to predict in vivo binding based on PBM data is limited. DNase I hypersensitive (HS) sites are regions of chromatin that are very sensitive to DNase I cleavage, and previous studies have shown that such hypersensitivity correlates with TF binding[21, 22]. To better predict tissue-specific in vivo binding sites, we developed PIPES, a probabilistic graphical model for integrating DNase I HS data with PBM data. For windows containing a 36bp genomic region ("site"), we assume the chromatin of the site could exist in two states: open or closed, and that only in the open state the chromatin is accessible to binding by a TF. We infer the chromatin state by using a mixture model for the DNase HS data: the open state should be associated with higher tag densities from the DNase data, and the closed state with lower densities. The in vivo occupancy of a site is then estimated as the probability of binding in vitro estimated using result from the PBM model multiplied by the probability that the site is in an open state inferred from the DNase HS data (see Methods for a detailed description of PIPES).
In addition to DNase I hypersensitivity data, bona fide TF binding sites are usually under evolutionary pressure and therefore more conserved[33, 34]. We thus further extended PIPES to incorporate phastCons scores for each site (Methods). Performance of the full model that incorporates PhastCons information is shown in Figure3 (purple bars). As can be seen, while in some cases adding the conservation information very slightly improves performance (for example for Srf and Oct4), overall using conservation data does not lead to a significant improvement in prediction accuracy. When the DNase HS data is not available, using PhastCons in addition to the PBM data provides a slight improvement of the AUCs from using PBM alone (improving the results for 8 out of 11 TFs, Additional file4). The average AUC is increased by 1%, and for some TFs, the improvement can be quite significant (e.g. Srf, AUC changes from 0.539 to 0.592 by adding PhastCons).
Finally, we compare PIPES with recent methods proposed for integrating DNase and motif information to predict TFBS. The first method we compare against, termed 'Intersection", was used by Neph et al.. This method intersects sites that have high-scoring PWM matches for a TF with DNase HS sites to predict in vivo binding (Supplementary Methods in Additional file1). We also compare PIPES with CENTIPEDE. CENTIPEDE uses a probabilistic model to integrate the prior information of putative sites, such as sequence conservation and matches to PWMs, with the epigenetic data to predict binding sites. While probabilistic, CENTIPEDE does, however, rely on a stringent cutoff for PWM match scores to achieve low false positive rates.
The results are presented in Figure3. As can be seen, the intersection method leads to AUCs that are significantly lower than the ones obtained by our method for all 11 TFs (cyan bars, p = 9.77 · 10-4, one sided Wilcoxon rank-sum test), indicating that strict cutoffs (as opposed to probabilistic integration) may lead to a high rate of false negatives. Similarly, using the default settings for CENTIPEDE led to AUC scores that are much lower than ours (Figure3, orange bars). To further explore this, we varied the setting of CENTIPEDE, including using different PWMs and a range of cutoffs for defining putative binding sites, but the results remained the same (See Supplementary Methods and Results in Additional file1, and Additional file4). The difference in AUCs is highly significant: p = 4.88 · 10-4 (one sided Wilcoxon rank-sum test) using the best combination of PWM and cutoff for CENTIPEDE. These results indicate that our PIPES model, which relies on k-mer based representation and avoids strict cutoffs, can improve in vivo predictions of TFBS.
We performed additional analysis to the integration model. For these, we replace the binding probabilities predicted from the k-mer model, with those predicted when using PWMs. Three different versions of PWMs were used: the Seed-and-Wobble PWMs and RAP PWMs learned from the same PBM data, and the JASPAR PWMs. In all cases, the AUCs are substantially higher than the ones from the Intersection method and CENTIPEDE (below or close to the full PIPES model). These results indicate that the probabilistic integration step alone is enough to improve upon prior methods (Additional file4). The usefulness of the k-mer based analysis provides additional advantage, as independently demonstrated in the earlier section.
Combining PBM and DNase data enables the prediction of tissue-specific TF activities
The recently released mouse ENCODE project data provides DNase I hypersensitivity data for more than 50 mouse tissue/cell types (Methods). We set out to combine the PBM data for 284 mouse TFs in UniPROBE with such DNase data to predict tissue-specific TF targets and determine tissue-specific TF activities (Methods).
To more globally validate these tissue-specific TF activities, we compared the correlation between our predicted TF activity scores and mRNA levels for the same TFs in the corresponding tissue (measured by qRT-PCR). Eight tissues and 222 TFs that are common to both datasets are used (Additional file6). Even though the two types of data (PBM and DNase vs. expression) measure completely different aspects of cellular activity, we observe a Pearson correlation coefficient of 0.229, which is highly statistically significant (p < 10-8, permutation test, Figure4b). Since many TFs are only post-transcriptionally regulated, such a significant correlation provides strong support to the predictions computed by our method.
Existing literature strongly supports predicted TF activities in several tissues
Top five predicted TFs for liver, retina and B cell
Known functions in the corresponding tissue
Essential for maintaining hepatic gene expression and lipid homeostasis
Important in maintaining liver homeostasis, and its disruption is linked to hepatocarcinogenesis
Expressed in liver, and known to regulate liver-specific genes
Important role in liver metabolism
Downstream regulator in Wnt signaling which is critical in liver physiology and pathology
Regulates photoreceptor gene expression
Involved in retina progenitor cell development
Pitx2-deficient mouse exhibits ocular abnormalities
Unknown function in retina.
C. CD19+ B cell
Essential regulator of B-cell differentiation
Required for T-cell independent B cell activation
Promotes B cell differentiation
Unknown function in B cell, but has almost the same binding preference as Pou2f2
Regulates proliferation of B cells
A number of recent projects including ENCODE, modENCODE[54, 55] and the Roadmap Epigenomics Project have generated large amounts of genomic data. An important research goal is to translate these resources into accurate, tissue and condition-sensitive, molecular-level networks. Constructing tissue-specific maps of TF binding sites is a central part of this overall research effort. Using several different datasets and a novel computational strategy, PIPES, we demonstrated that such high-quality computational predictions can be obtained. We used PIPES to compile a resource that includes comprehensive predictions for more than 200 TFs across 50 tissues.
A recent benchmark study that compared many methods for analyzing PBM data concluded that PWM-based methods work as well as other models for predicting TFBS. Our results differ from these previous studies. This could either be the result of the Lasso based method we have used or the specific dataset we used for the comparison. Additional work is required to reach a definitive conclusion regarding the importance of independence assumption used by PWMs when modeling TFBS. Here we focused on integrating a number of datasets for predicting TFBSs. For our biophysical approach, using a k-mer based method allowed us to capture both the dependency among positions within the binding site as well as multiple different motifs for a single TF. Such method worked well in classifying bound and unbound sequences from in vivo ChIP-seq data for many TFs. Moreover, integrating PBM data with chromatin accessibility from DNase I HS data greatly improves the accuracy of TF binding predictions. Several recent papers explored related ideas. Chromia used a hidden Markov model to combine sequence-specific TF binding with histone modification data, but their predictions were based on PWM scoring and only focused on a dozen of TFs in mouse embryonic stem cells. Ernst et al. combined experimental data from a number of tissues to generate a single (global) TF-target prediction map. However, that method has also relied on PWMs and no tissue specific predictions were made. CENTIPEDE used unsupervised methods to integrate TF-DNA interaction, epigenetic and evolutionary data, and is most similar to our efforts. However, CENTIPEDE relies on a footprint in the DNase data that TFs leave. Such DNase footprints are the actual locations where the TF binds and are therefore protected from DNase cleavage within the DNase HS site. Unfortunately, DNase footprint data is expensive to obtain (indeed, it was not available for most of the tissues we analyzed) since it requires very high coverage when sequencing. In addition, CENTIPEDE uses a stringent cutoff (based on PWM matching) to define putative binding sites, and thus may lose significant information in relatively weak binding sites, which have been shown to be collectively important for TF binding. Neph et al.[26, 72] also combined DNase footprints with PWMs to predict TF-TF interactions (though not TF-gene interactions) across a large number of human tissues, using a simple method to intersect motif matches with DHS sites. Such hard cutoffs may miss sites that score high (but just below the cutoff) for both types of data which are found by our method.
In general, we find that binding sites for the TFs we looked at are only modestly conserved when compared with controls (using PhastCons scores alone classifies ChIP-seq sequences quite poorly, see Additional file4). This is largely consistent with the recent findings that functional non-coding sequences evolve rather rapidly. As a result, adding PhastCons in the integrated model does not lead to improvements in AUC values. Nevertheless, there are a number of advantages for models that can incorporate sequence conservation. First, when DNase data is not available, adding conservation in the model leads to slight improvement over models that only use PBM data (Additional file4). Second, the conservation of binding events can vary greatly among tissues. For example, enhancers in brain are far more constrained than those in the heart. Thus it is quite possible that sequence conservation would be more informative for other ChIP-seq studies.
The application of PIPES to predict tissue specific TFBS led to results that agree well with existing knowledge regarding TF roles in specific tissues. The overall results are significantly correlated with independent gene expression data measured for these TFs across tissues even though such expression data was not used at all in our analysis.
Several extensions of our current work are possible. Recently Jiang et al. reported the interesting phenomenon of sticky k-mers: these are k-mers that appear to bind to TFs with relatively high affinities in a large number of PBM experiments. The sticky k-mers likely represent background noises in the PBM experiments and an interesting research direction is to expand our regression method to remove such noises. In another recent study, Ballare et al. reported that functional TFBSs are not always associated with high chromatin accessibility, an assumption implicitly made by us and other related methods. Rather, nucleosomes may occupy TFBSs at basal conditions, and are only remodeled or displaced upon change of cellular conditions (e.g. by hormone stimulation). Despite this unexpected relationship between TFBS and chromatin states, the paper does report that such sites, while associated with high nucleosome occupancy before stimulation, often overlap with DNase HS sites. It remains to be seen how common such cases are and what is the impact on methods such as ours that rely on DNase data to predict condition specific TF binding. Moreover, our integrative framework for utilizing additional information sources when predicting binding events on a genome wide scale could also be used for large scale comparison of different PWM methods and methods that use more complicated models to represent TF binding preferences[76–78]. This requires a detailed study and is left for future work.
Combining PBM and DNase data, we presented the first major effort to provide a systematic map of computationally predicted tissue-specific targets for hundreds of TFs across a large number of tissues in mouse. We complied a resource that provides TF-target predictions for all 284 TFs studied across the 55 tissue/cell types (Supplementary Methods and Supplementary Website in Additional file4). We believe that such comprehensive resource would be useful for both biology and computation-oriented researchers studying gene regulation[79–82].
K-mer based method that uses PBM data to predict TF binding
The binding specificities of TFs are often represented by position weight matrices, which assume that each position of a site contributes independently to the overall binding affinity of the site (independence assumption). The PBM data simultaneously measures binding of a TF to tens of thousands of probes, and can be used to construct a much more detailed and accurate model of TF binding specificities.
A biophysically-motivated model for PBM data
where λ j is the position effect at position j (see Supplementary Methods in Additional file1) and is the binding probability to S i (j), the k-mer at the j-th position of the i-th probe. The term β s is symmetric for any k-mer s, i.e., where is the reverse complement of s. We note that our model allows features/k-mers to overlap, so the k-mers can be of different lengths and can contain gaps. The model relies on Lasso regression to estimate the contribution of each k-mer (see below).
By plugging in the equation of F(i) into the linear model, we can couple the different values we obtain for probe intensities as a function of the individual k-mer contributions, see Supplementary Methods in Additional file1 for full details.
Learning the parameters of the linear model
The above linear model has approximately 4 k /2 parameters (one for each k-mer and its reverse complement). To avoid overfitting, we use the lasso regression to estimate the coefficients. Lasso is a widely used approach to linear regression that encourages a sparse model where most of the coefficients are zero. In our problem, we have the additional requirement that the coefficients must be non-negative, and this is known as positive lasso (Supplementary Methods in Additional file1). After learning the model parameters using positive lasso, we set to be the binding probability of the TF to the k-mer s up to a scaling constant.
Since we do not know the width of the motif bound by each TF, our method searches for k-mers of different lengths. In order to speed up the calculation, we first run positive lasso using all short 4–6 mers. To allow longer k-mers to be considered, after the first run, all pairs from the top 100 such k-mers (based on regression coefficients) are tested to see if the prefix of one matches the suffix of the other, yielding longer (k + 1)-mers. This process is repeated until up to 8-mers have been added to the feature set. In addition, we also allow for gapped k-mers to be considered (Supplementary methods in Additional file1).
Predicting TF binding to any sequences
in which β s is the binding probability to the k-mer s learned by the regression model, C s is the number of times that s occurs in this site, and Bmax is the highest possible unscaled binding probability of any 36-mer that can be achieved for the TF and is used as a scaling constant. The interpretation of this equation is that the binding probability to a 36bp sequence is the sum of binding probabilities to each of the k-mer of the 36bp sequence[85, 86]. In practice, Bmax is estimated, for each TF, from the highest unscaled binding probabilities to 100,000 randomly sampled 36bp sites. Then, the binding probability to the entire sequence is defined as the highest binding probability to any 36bp site in that sequence.
Integrated model of TF binding in vivo
PBM experiments measure TF binding in vitro. In vivo binding depends on several factors including the cellular environment and the chromatin state of the bound region. In addition, it has been shown that functional TFBSs tend to be evolutionary constrained[33, 34]. In this section, we describe PIPES, a method that integrates our PBM motif learning and scanning method with these additional data sources in order to determine tissue specific binding.
Incorporating DNase I hypersensitivity data
Let B i be the probability of binding of the TF of interest to a 36bp genomic region (a site) indexed by i based on the PBM model (Equation 3), reflecting the potential of TF binding in vitro. We are interested in the in vivo occupancy of the site, denoted as X i . We assume that X i is influenced by the chromatin state, which can be represented as a simple binary indicator variable, A i (it is 1 if the chromatin is open/accessible and 0 otherwise). When the chromatin is open (A i = 1), the occupancy X i equals B i ; whereas a closed chromatin at that location means that X i = 0. Thus, X i is simply the product of B i and P(A i = 1). The chromatin state variable can be partially determined using experimental data. Here we use DNase I hypersensitivity (HS) data (Figure1c) which is available for several mouse and human tissues (Results). See Supplementary Methods in Additional file1 for details.
The full integrated model
Here X i is the occupancy of site i as described above, Z i is a binary variable indicating whether site i is a true binding site in vivo or not, C i is a binary variable indicating whether site i is conserved or not, and S i is a measure of the evolutionary conservation of the site. The model assumes that true TFBSs have a higher occupancy. Similarly, when Z i = 1, C i is more likely to be 1 as well (a true binding site is more likely to be conserved), and this is reflected by a higher conservation score S i . The goal is to infer Z i from the observed data X i and S i . The evolutionary conservation measure we used is the phastCons score (phastCons 46way vertebrates) downloaded from the the UCSC Genome Browser (http://genome.ucsc.edu).
In Supplementary Methods in Additional file1 we discuss the specific distributions we assume for each of the conditional probabilities in our model and how we learn the parameters for these distributions. After these parameters are estimated, we can compute the probability that the i-th site is bound by a specific TF. See Supplement for details.
Identifying tissue-specific TF activities
We used PIPES to identify TFs likely to be active in each tissue. Intuitively, if a TF f is active in a tissue T, then the binding sites of f should be overrepresented in the open chromatin regions of T. To quantify this overrepresentation, we define R(f,T) as the fraction of DNase hypersensitive sites in tissue T that contain high-scoring binding sites of f. The high scoring sites are defined as those that have binding probabilities (according to the PBM model, as defined in Equation 3) higher than the top 0.1% of the binding probabilities for all possible sites for that TF (the exact percentage cutoff has little impact, data not shown). In practice the binding probability distribution of a TF is estimated from the 100,000 sampled sites. For each tissue, the open sites in the promoter regions are defined as those sites whose DNase tag densities are higher than 15. This threshold is chosen so that the inferred probability of chromatin being open is close to 1 according to our model. In order to identify TFs having active functions in specific cell types, we exclude the binding sites that are not tissue specific (defined as open in more than 1/3 of all tissues). Such broadly-active sites are not interesting for the purpose of finding tissue-specific TFs.
where denotes all tissues other than T. This is used as a measure of the likely activity of the TF in that tissue. We use a simple binomial test to evaluate the significance of the activity score defined here. Suppose we observe n high-scoring binding sites of the TF f in the tissue, and among these n sites, x sites fall into DNase hypersensitive regions. By chance, the expected fraction of binding sites in the DNase HS regions is, thus we perform the one-sided binomial test of x successes in n trials under the null model that the probability of success is equal to p.
A comprehensive collection of predicted TFBSs across 55 mouse tissues
PBM data for 284 mouse TFs were downloaded from UniPROBE. DNase data for 55 mouse tissue/cell types were downloaded from the mouse ENCODE website athttp://hgdownload.cse.ucsc.edu/goldenPath/mm9/encodeDCC/wgEncodeUwDnase/, and for each tissue/cell type, parameters were learned as described in Methods. A list of all the TFs and tissue/cell types is provided in Additional files7 and8. To predict targets of TFs, the promoter regions (+/- 10kb around transcription start sites) for all mouse genes were scanned. This choice of promoter regions is consistent with several recent publications[23, 70]. We assessed the quality of our predictions in two ways. (1) The false discovery rates of top TF/tissue combinations were assessed, and in all but one of 25 combinations we examined, the FDR is below 15%. (2) For each of the 11 TFs we evaluated in the Results, we compared the set of predicted targeted genes from our method with the genome-wide targeted genes from the mouse ENCODE ChIP-seq datasets. The overlap of the two sets are highly significant for almost all TFs. See Supplementary Methods and Results in Additional file1, Additional files9,10 and Supplementary Website for details.
Comparison with other methods
To evaluate our methods we obtained ChIP-seq data for 11 TFs for which PBM data and tissue specific DNase I hypersensitivity data were available (Additional file2). ChIP-seq data was downloaded from NCBI GEO or ENCODE using the GEO IDs or UCSC Accession IDs listed in Additional file2.
We performed a comprehensive comparison of our PBM model with several other methods that could be or have been used in predicting TF binding on real sequences. Since most prior methods relied on PWMs, we used the PWMs reported in UniPROBE for these TFs, which were obtained by applying the Seed-and-Wobble algorithm on the PBM data (S&W PWM). We also compared with BEEML using both the energy matrices (BEEML Energy) and converted PWMs (BEEML PWMs), PWMs identified by RAP (RAP PWMs), the max E-score of k-mers (Max E-score), the use of occupancy score proposed by, a support vector regression-based method (SVR), and FeatureREDUCE (unpublished). In addition, we also compared using PWMs from external sources including the JASPAR database and PWMs derived from HOMER on ChIP-seq data. Moreover, we also compared our integrative model with an intersection strategy that combines PWM scanning with DNase data across multiple tissues, a simple method that combines PWM scanning with our DNase model, and an integrative method CENTIPEDE that uses PWMs, DNase HS and conservation data. A description of the details for the settings of all methods is provided in Supplementary Methods in Additional file1.
Availability of supporting data
The genome-wide tissue-specific TFBS predictions for 284 mouse TFs and 55 tissue/cell types and codes for the k-mer based PBM modeling method are available from the supporting website athttp://www.sb.cs.cmu.edu/PIPES.
Shan Zhong and Xin He: Co-first author.
Work supported in part by NIH grant 1RO1 GM085022 and NSF DBI-0965316 award to Z.B.J.
- Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, MacIsaac KD, Danford TW, Hannett NM, Tagne JB, Reynolds DB, Yoo J, Jennings EG, Zeitlinger J, Pokholok DK, Kellis M, Rolfe PA, Takusagawa KT, Lander ES, Gifford DK, Fraenkel E, Young RA: Transcriptional regulatory code of a eukaryotic genome. Nature. 2004, 431 (7004): 99-104. 10.1038/nature02800.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen X, Xu H, Yuan P, Fang F, Huss M, Vega VB, Wong E, Orlov YL, Zhang W, Jiang J, Loh YH, Yeo HC, Yeo ZX, Narang V, Govindarajan KR, Leong B, Shahab A, Ruan Y, Bourque G, Sung WK, Clarke ND, Wei CL, Ng HH: Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell. 2008, 133 (6): 1106-1117. 10.1016/j.cell.2008.04.043.View ArticlePubMedGoogle Scholar
- Buck MJ, Lieb JD: ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics. 2004, 83 (3): 349-360. 10.1016/j.ygeno.2003.11.004.View ArticlePubMedGoogle Scholar
- Park PJ: ChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet. 2009, 10 (10): 669-680. 10.1038/nrg2641.PubMed CentralView ArticlePubMedGoogle Scholar
- Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, Volkert TL, Wilson CJ, Bell SP, Young RA: Genome-wide location and function of DNA binding proteins. Science. 2000, 290 (5500): 2306-2309. 10.1126/science.290.5500.2306.View ArticlePubMedGoogle Scholar
- Johnson DS, Mortazavi A, Myers RM, Wold B: Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007, 316 (5830): 1497-1502. 10.1126/science.1141319.View ArticlePubMedGoogle Scholar
- Zeitlinger J, Zinzen RP, Stark A, Kellis M, Zhang H, Young RA, Levine M: Whole-genome ChIP-chip analysis of Dorsal, Twist, and Snail suggests integration of diverse patterning processes in the Drosophila embryo. Genes Dev. 2007, 21 (4): 385-390. 10.1101/gad.1509607.PubMed CentralView ArticlePubMedGoogle Scholar
- Kaufmann K, Muiño JM, Jauregui R, Airoldi CA, Smaczniak C, Krajewski P, Angenent GC: Target genes of the MADS transcription factor SEPALLATA3: integration of developmental and hormonal pathways in the Arabidopsis flower. PLoS Biol. 2009, 7 (4): e1000090-PubMed CentralView ArticlePubMedGoogle Scholar
- Schmidt D, Wilson MD, Ballester B, Schwalie PC, Brown GD, Marshall A, Kutter C, Watt S, Martinez-Jimenez CP, Mackay S, Talianidis I, Flicek P, Odom DT: Five-vertebrate ChIP-seq reveals the evolutionary dynamics of transcription factor binding. Science. 2010, 328 (5981): 1036-1040. 10.1126/science.1186176.PubMed CentralView ArticlePubMedGoogle Scholar
- Mukherjee S, Berger MF, Jona G, Wang XS, Muzzey D, Snyder M, Young RA, Bulyk ML: Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays. Nat Genet. 2004, 36 (12): 1331-1339. 10.1038/ng1473.PubMed CentralView ArticlePubMedGoogle Scholar
- Berger MF, Philippakis AA, Qureshi AM, He FS, Estep PW, Bulyk ML: Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat Biotechnol. 2006, 24 (11): 1429-1435. 10.1038/nbt1246.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhu C, Byers KJRP, McCord RP, Shi Z, Berger MF, Newburger DE, Saulrieta K, Smith Z, Shah MV, Radhakrishnan M, Philippakis AA, Hu Y, De Masi F, Pacek M, Rolfs A, Murthy T, Labaer J, Bulyk ML: High-resolution DNA-binding specificity analysis of yeast transcription factors. Genome Res. 2009, 19 (4): 556-566. 10.1101/gr.090233.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Grove CA, De Masi F, Barrasa MI, Newburger DE, Alkema MJ, Bulyk ML, Walhout AJM: A multiparameter network reveals extensive divergence between C. elegans bHLH transcription factors. Cell. 2009, 138 (2): 314-327. 10.1016/j.cell.2009.04.058.PubMed CentralView ArticlePubMedGoogle Scholar
- Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA, Chan ET, Metzler G, Vedenko A, Chen X, Kuznetsov H, Wang CF, Coburn D, Newburger DE, Morris Q, Hughes TR, Bulyk ML: Diversity and complexity in DNA recognition by transcription factors. Science. 2009, 324 (5935): 1720-1723. 10.1126/science.1162327.PubMed CentralView ArticlePubMedGoogle Scholar
- Chang KN, Zhong S, Weirauch MT, Hon G, Pelizzola M, Li H, Huang SsC, Schmitz RJ, Urich MA, Kuo D, Nery JR, Qiao H, Yang A, Jamali A, Chen H, Ideker T, Ren B, Bar-Joseph Z, Hughes TR, Ecker JR: Temporal transcriptional response to ethylene gas drives growth hormone cross-regulation in Arabidopsis. eLife. 2013, 2: e00675-10.7554/eLife.00675.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhao Y, Stormo GD: Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat Biotechnol. 2011, 29 (6): 480-483.PubMed CentralView ArticlePubMedGoogle Scholar
- Orenstein Y, Linhart C, Shamir R: Assessment of algorithms for inferring positional weight matrix motifs of transcription factor binding sites using protein binding microarray data. PLoS ONE. 2012, 7 (9): e46145-10.1371/journal.pone.0046145.PubMed CentralView ArticlePubMedGoogle Scholar
- Maerkl SJ, Quake SR: A systems approach to measuring the binding energy landscapes of transcription factors. Science. 2007, 315 (5809): 233-237. 10.1126/science.1131007.View ArticlePubMedGoogle Scholar
- Mordelet F, Horton J, Hartemink AJ, Engelhardt BE, Gordan R: Stability selection for regression-based models of transcription factor-DNA binding specificity. Bioinformatics. 2013, 29 (13): i117-i125. 10.1093/bioinformatics/btt221.PubMed CentralView ArticlePubMedGoogle Scholar
- Spitz F, Furlong EEM: Transcription factors: from enhancer binding to developmental control. Nat Rev Genet. 2012, 13 (9): 613-626. 10.1038/nrg3207.View ArticlePubMedGoogle Scholar
- Li XY, Thomas S, Sabo PJ, Eisen MB, Stamatoyannopoulos JA, Biggin MD: The role of chromatin accessibility in directing the widespread, overlapping patterns of Drosophila transcription factor binding. Genome Biol. 2011, 12 (4): R34-10.1186/gb-2011-12-4-r34.PubMed CentralView ArticlePubMedGoogle Scholar
- John S, Sabo PJ, Thurman RE, Sung MH, Biddie SC, Johnson TA, Hager GL, Stamatoyannopoulos JA: Chromatin accessibility pre-determines glucocorticoid receptor binding patterns. Nat Genet. 2011, 43 (3): 264-268. 10.1038/ng.759.View ArticlePubMedGoogle Scholar
- Ernst J, Plasterer HL, Simon I, Bar-Joseph Z: Integrating multiple evidence sources to predict transcription factor binding in the human genome. Genome Res. 2010, 20 (4): 526-536. 10.1101/gr.096305.109.PubMed CentralView ArticlePubMedGoogle Scholar
- Pique-Regi R, Degner JF, Pai AA, Gaffney DJ, Gilad Y, Pritchard JK: Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res. 2011, 21 (3): 447-455. 10.1101/gr.112623.110.PubMed CentralView ArticlePubMedGoogle Scholar
- Cuellar-Partida G, Buske FA, McLeay RC, Whitington T, Noble WS, Bailey TL: Epigenetic priors for identifying active transcription factor binding sites. Bioinformatics. 2012, 28: 56-62. 10.1093/bioinformatics/btr614.PubMed CentralView ArticlePubMedGoogle Scholar
- Neph S, Stergachis AB, Reynolds A, Sandstrom R, Borenstein E, Stamatoyannopoulos JA: Circuitry and dynamics of human transcription factor regulatory networks. Cell. 2012, 150 (6): 1274-1286. 10.1016/j.cell.2012.04.040.PubMed CentralView ArticlePubMedGoogle Scholar
- Newburger DE, Bulyk ML: UniPROBE: an online database of protein binding microarray data on protein-DNA interactions. Nucleic Acids Res. 2009, 37 (Database issue): D77-D82.PubMed CentralView ArticlePubMedGoogle Scholar
- Mouse ENCODE Consortium: An encyclopedia of mouse DNA elements (Mouse ENCODE). Genome Biol. 2012, 13 (8): 418-View ArticleGoogle Scholar
- Annala M, Laurila K, Lähdesmäki H, Nykter M: A linear model for transcription factor binding affinity prediction in protein binding microarrays. PLoS ONE. 2011, 6 (5): e20059-10.1371/journal.pone.0020059.PubMed CentralView ArticlePubMedGoogle Scholar
- Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, Voss N, Stegmaier P, Lewicki-Potapov B, Saxel H, Kel AE, Wingender E: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 2006, 34 (Database issue): D108-D110.PubMed CentralView ArticlePubMedGoogle Scholar
- Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass CK: Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol cell. 2010, 38 (4): 576-589. 10.1016/j.molcel.2010.05.004.PubMed CentralView ArticlePubMedGoogle Scholar
- Gross DS, Garrard WT: Nuclease hypersensitive sites in chromatin. Annu Rev Biochem. 1988, 57: 159-197. 10.1146/annurev.bi.57.070188.001111.View ArticlePubMedGoogle Scholar
- Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M: Systematic discovery of regulatory motifs in human promoters and 3’ UTRs by comparison of several mammals. Nature. 2005, 434 (7031): 338-345. 10.1038/nature03441.PubMed CentralView ArticlePubMedGoogle Scholar
- So AYL, Cooper SB, Feldman BJ, Manuchehri M, Yamamoto KR: Conservation analysis predicts in vivo occupancy of glucocorticoid receptor-binding sequences at glucocorticoid-induced genes. Proc Natl Acad Sci USA. 2008, 105 (15): 5745-5749. 10.1073/pnas.0801551105.PubMed CentralView ArticlePubMedGoogle Scholar
- Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W, Haussler D: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005, 15 (8): 1034-1050. 10.1101/gr.3715005.PubMed CentralView ArticlePubMedGoogle Scholar
- Pandolfi PP, Roth ME, Karis A, Leonard MW, Dzierzak E, Grosveld FG, Engel JD, Lindenbaum MH: Targeted disruption of the GATA3 gene causes severe abnormalities in the nervous system and in fetal liver haematopoiesis. Nat Genet. 1995, 11: 40-44. 10.1038/ng0995-40.View ArticlePubMedGoogle Scholar
- Minegishi N, Morita S, Minegishi M, Tsuchiya S, Konno T, Hayashi N, Yamamoto M: Expression of GATA transcription factors in myelogenous and lymphoblastic leukemia cells. Int J Hematol. 1997, 65 (3): 239-249.View ArticlePubMedGoogle Scholar
- Zhang L, Ju X, Cheng Y, Guo X, Wen T: Identifying Tmem59 related gene regulatory network of mouse neural stem cell from a compendium of expression profiles. BMC Syst Biol. 2011, 5: 152-10.1186/1752-0509-5-152.PubMed CentralView ArticlePubMedGoogle Scholar
- Hennig AK, Peng GH, Chen S: Regulation of photoreceptor gene expression by Crx-associated transcription factor network. Brain Res. 2008, 1192: 114-133.PubMed CentralView ArticlePubMedGoogle Scholar
- Corbo JC, Lawrence KA, Karlstetter M, Myers CA, Abdelaziz M, Dirkes W, Weigelt K, Seifert M, Benes V, Fritsche LG, Weber BHF, Langmann T: CRX ChIP-seq reveals the cis-regulatory architecture of mouse photoreceptors. Genome Res. 2010, 20 (11): 1512-1525. 10.1101/gr.109405.110.PubMed CentralView ArticlePubMedGoogle Scholar
- Hayhurst GP, Lee YH, Lambert G, Ward JM, Gonzalez FJ: Hepatocyte nuclear factor 4alpha (nuclear receptor 2A1) is essential for maintenance of hepatic gene expression and lipid homeostasis. Mol Cell Biol. 2001, 21 (4): 1393-1403. 10.1128/MCB.21.4.1393-1403.2001.PubMed CentralView ArticlePubMedGoogle Scholar
- Lucas B, Grigo K, Erdmann S, Lausen J, Klein-Hitpass L, Ryffel GU: HNF4alpha reduces proliferation of kidney cells and affects genes deregulated in renal cell carcinoma. Oncogene. 2005, 24 (42): 6418-6431.PubMedGoogle Scholar
- Ravasi T, Suzuki H, Cannistraci CV, Katayama S, Bajic VB, Tan K, Akalin A, Schmeier S, Kanamori-Katayama M, Bertin N, Carninci P, Daub CO, Forrest ARR, Gough J, Grimmond S, Han JH, Hashimoto T, Hide W, Hofmann O, Kamburov A, Kaur M, Kawaji H, Kubosaki A, Lassmann T, van Nimwegen E, MacPherson CR, Ogawa C, Radovanovic A, Schwartz A, Teasdale RD, Tegnér J, Lenhard B, Teichmann SA, Arakawa T, Ninomiya N, Murakami K, Tagami M, Fukuda S, Imamura K, Kai C, Ishihara R, Kitazume Y, Kawai J, Hume DA, Ideker T, Hayashizaki Y: An atlas of combinatorial transcriptional regulation in mouse and man. Cell. 2010, 140 (5): 744-752. 10.1016/j.cell.2010.01.044.View ArticlePubMedGoogle Scholar
- Grant SFA, Thorleifsson G, Reynisdottir I, Benediktsson R, Manolescu A, Sainz J, Helgason A, Stefansson H, Emilsson V, Helgadottir A, Styrkarsdottir U, Magnusson KP, Walters GB, Palsdottir E, Jonsdottir T, Gudmundsdottir T, Gylfason A, Saemundsdottir J, Wilensky RL, Reilly MP, Rader DJ, Bagger Y, Christiansen C, Gudnason V, Sigurdsson G, Thorsteinsdottir U, Gulcher JR, Kong A, Stefansson K: Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nat Genet. 2006, 38 (3): 320-323. 10.1038/ng1732.View ArticlePubMedGoogle Scholar
- Boj SF, van Es, Huch M, Li VSW, José A, Hatzis P, Mokry M, Haegebarth A, van den Born M, Chambon P, Voshol P, Dor Y, Cuppen E, Fillat C, Clevers H: Diabetes risk gene and Wnt effector Tcf7l2/TCF4 controls hepatic response to perinatal and adult metabolic demand. Cell. 2012, 151 (7): 1595-1607. 10.1016/j.cell.2012.10.053.View ArticlePubMedGoogle Scholar
- Sansregret L, Nepveu A: The multiple roles of CUX1: insights from mouse models and cell-based assays. Gene. 2008, 412 (1–2): 84-94.View ArticlePubMedGoogle Scholar
- Kojima K, Takata A, Vadnais C, Otsuka M, Yoshikawa T, Akanuma M, Kondo Y, Kang YJ, Kishikawa T, Kato N, Xie Z, Zhang WJ, Yoshida H, Omata M, Nepveu A, Koike K: MicroRNA122 is a key regulator of alpha-fetoprotein expression and influences the aggressiveness of hepatocellular carcinoma. Nat Commun. 2011, 2: 338-View ArticlePubMedGoogle Scholar
- Wolfrum C, Asilmaz E, Luca E, Friedman JM, Stoffel M: Foxa2 regulates lipid metabolism and ketogenesis in the liver during fasting and in diabetes. Nature. 2004, 432 (7020): 1027-1032. 10.1038/nature03047.View ArticlePubMedGoogle Scholar
- Shih DQ, Bussen M, Sehayek E, Ananthanarayanan M, Shneider BL, Suchy FJ, Shefer S, Bollileni JS, Gonzalez FJ, Breslow JL, Stoffel M: Hepatocyte nuclear factor-1alpha is an essential regulator of bile acid and plasma cholesterol metabolism. Nat Genet. 2001, 27 (4): 375-382. 10.1038/86871.View ArticlePubMedGoogle Scholar
- Odom DT, Zizlsperger N, Gordon DB, Bell GW, Rinaldi NJ, Murray HL, Volkert TL, Schreiber J, Rolfe PA, Gifford DK, Fraenkel E, Bell GI, Young RA: Control of pancreas and liver gene expression by HNF transcription factors. Science. 2004, 303 (5662): 1378-1381. 10.1126/science.1089769.PubMed CentralView ArticlePubMedGoogle Scholar
- Iizuka K, Horikawa Y: Regulation of lipogenesis via BHLHB2/DEC1 and ChREBP feedback looping. Biochem Biophys Res Commun. 2008, 374: 95-100. 10.1016/j.bbrc.2008.06.101.View ArticlePubMedGoogle Scholar
- Dai J, Zhang C, Tian Z, Zhang J: Expression profile of HMBOX1, a novel transcription factor, in human cancers using highly specific monoclonal antibodies. Exp Ther Med. 2011, 2 (3): 487-490.PubMed CentralPubMedGoogle Scholar
- The ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome. Nature. 2012, 489 (7414): 57-74. 10.1038/nature11247.PubMed CentralView ArticleGoogle Scholar
- Gerstein MB, Lu ZJ, Van Nostrand EL, Cheng C, Arshinoff BI, Liu T, Yip KY, Robilotto R, Rechtsteiner A, Ikegami K, Alves P, Chateigner A, Perry M, Morris M, Auerbach RK, Feng X, Leng J, Vielle A, Niu W, Rhrissorrakrai K, Agarwal A, Alexander RP, Barber G, Brdlik CM, Brennan J, Brouillet JJ, Carr A, Cheung MS, Clawson H, Contrino S: Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project. Science. 2010, 330 (6012): 1775-1787. 10.1126/science.1196914.PubMed CentralView ArticlePubMedGoogle Scholar
- The modENCODE Consortium: Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science. 2010, 330 (6012): 1787-1797.PubMed CentralView ArticleGoogle Scholar
- Bernstein BE, Stamatoyannopoulos JA, Costello JF, Ren B, Milosavljevic A, Meissner A, Kellis M, Marra MA, Beaudet AL, Ecker JR, Farnham PJ, Hirst M, Lander ES, Mikkelsen TS, Thomson JA: The NIH Roadmap Epigenomics Mapping Consortium. Nat Biotechnol. 2010, 28 (10): 1045-1048. 10.1038/nbt1010-1045.PubMed CentralView ArticlePubMedGoogle Scholar
- Khetchoumian K, Teletin M, Tisserand J, Mark M, Herquel B, Ignat M, Zucman-Rossi J, Cammas F, Lerouge T, Thibault C, Metzger D, Chambon P, Losson R: Loss of Trim24 (Tif1alpha) gene function confers oncogenic activity to retinoic acid receptor alpha. Nat Genet. 2007, 39 (12): 1500-1506. 10.1038/ng.2007.15.View ArticlePubMedGoogle Scholar
- Zhang P, Bennoun M, Gogard C, Bossard P, Leclerc I, Kahn A, Vasseur-Cognet M: Expression of COUP-TFII in metabolic tissues during development. Mech Dev. 2002, 119: 109-114. 10.1016/S0925-4773(02)00286-1.View ArticlePubMedGoogle Scholar
- Wan YJ, An D, Cai Y, Repa JJ, Hung-Po Chen T, Flores M, Postic C, Magnuson MA, Chen J, Chien KR, French S, Mangelsdorf DJ, Sucov HM: Hepatocyte-specific mutation establishes retinoid X receptor alpha as a heterodimeric integrator of multiple physiological processes in the liver. Mol Cell Biol. 2000, 20 (12): 4436-4444. 10.1128/MCB.20.12.4436-4444.2000.PubMed CentralView ArticlePubMedGoogle Scholar
- Thompson MD, Monga SPS: WNT/beta-catenin signaling in liver health and disease. Hepatology. 2007, 45 (5): 1298-1305. 10.1002/hep.21651.View ArticlePubMedGoogle Scholar
- Khosrowshahian F, Wolanski M, Chang WY, Fujiki K, Jacobs L, Crawford MJ: Lens and retina formation require expression of Pitx3 in Xenopus pre-lens ectoderm. Dev Dyn. 2005, 234 (3): 577-589. 10.1002/dvdy.20540.View ArticlePubMedGoogle Scholar
- Shi X, Bosenko DV, Zinkevich NS, Foley S, Hyde DR, Semina EV, Vihtelic TS: Zebrafish pitx3 is necessary for normal lens and retinal development. Mech Dev. 2005, 122 (4): 513-527. 10.1016/j.mod.2004.11.012.View ArticlePubMedGoogle Scholar
- Chen D, Pacal M, Wenzel P, Knoepfler PS, Leone G, Bremner R: Division and apoptosis of E2f-deficient retinal progenitors. Nature. 2009, 462 (7275): 925-929. 10.1038/nature08544.PubMed CentralView ArticlePubMedGoogle Scholar
- Gage PJ, Suh H, Camper SA: Dosage requirement of Pitx2 for development of multiple organs. Development. 1999, 126 (20): 4643-4651.PubMedGoogle Scholar
- Sokalski KM, Li SKH, Welch I, Cadieux-Pitre HAT, Gruca MR, DeKoter RP: Deletion of genes encoding PU.1 and Spi-B in B cells impairs differentiation and induces pre-B cell acute lymphoblastic leukemia. Blood. 2011, 118 (10): 2801-2808. 10.1182/blood-2011-02-335539.View ArticlePubMedGoogle Scholar
- Corcoran LM, Karvelas M: Oct-2 is required early in T cell-independent B cell activation for G1 progression and for proliferation. Immunity. 1994, 1 (8): 635-645. 10.1016/1074-7613(94)90035-3.View ArticlePubMedGoogle Scholar
- Schweitzer BL, Huang KJ, Kamath MB, Emelyanov AV, Birshtein BK, DeKoter RP: Spi-C has opposing effects to PU.1 on gene expression in progenitor B cells. J Immunol. 2006, 177 (4): 2195-2207.View ArticlePubMedGoogle Scholar
- Lacorazza HD, Miyazaki Y, Di Cristofano A, Deblasio A, Hedvat C, Zhang J, Cordon-Cardo C, Mao S, Pandolfi PP, Nimer SD: The ETS protein MEF plays a critical role in perforin gene expression and the development of natural killer and NK-T cells. Immunity. 2002, 17 (4): 437-449. 10.1016/S1074-7613(02)00422-3.View ArticlePubMedGoogle Scholar
- Weirauch MT, Cote A, Norel R, Annala M, Zhao Y, Riley TR, Saez-Rodriguez J, Cokelaer T, Vedenko A, Talukder S, Agius P, Arvey A, Bucher P, Callan CG, Chang CW, Chen CY, Chen YS, Chu YW, Grau J, Grosse I, Jagannathan V, Keilwagen J, Kiełbasa SM, Kinney JB, Klein H, Kursa MB, Lähdesmäki H, Laurila K, Lei C, DREAM5 Consortium: Evaluation of methods for modeling transcription factor sequence specificity. Nat Biotechnol. 2013, 31 (2): 126-134. 10.1038/nbt.2486.PubMed CentralView ArticlePubMedGoogle Scholar
- Won KJ, Ren B, Wang W: Genome-wide prediction of transcription factor binding sites using an integrated model. Genome Biol. 2010, 11: R7-10.1186/gb-2010-11-1-r7.PubMed CentralView ArticlePubMedGoogle Scholar
- Tanay A: Extensive low-affinity transcriptional interactions in the yeast genome. Genome Res. 2006, 16 (8): 962-972. 10.1101/gr.5113606.PubMed CentralView ArticlePubMedGoogle Scholar
- Neph S, Vierstra J, Stergachis AB, Reynolds AP, Haugen E, Vernot B, Thurman RE, John S, Sandstrom R, Johnson AK, Maurano MT, Humbert R, Rynes E, Wang H, Vong S, Lee K, Bates D, Diegel M, Roach V, Dunn D, Neri J, Schafer A, Hansen RS, Kutyavin T, Giste E, Weaver M, Canfield T, Sabo P, Zhang M, Balasundaram G: An expansive human regulatory lexicon encoded in transcription factor footprints. Nature. 2012, 489 (7414): 83-90. 10.1038/nature11212.PubMed CentralView ArticlePubMedGoogle Scholar
- Blow MJ, McCulley DJ, Li Z, Zhang T, Akiyama JA, Holt A, Plajzer-Frick I, Shoukry M, Wright C, Chen F: ChIP-Seq identification of weakly conserved heart enhancers. Nature Genet. 2010, 42 (9): 806-810. 10.1038/ng.650.PubMed CentralView ArticlePubMedGoogle Scholar
- Jiang B, Liu JS, Bulyk ML: Bayesian hierarchical model of protein-binding microarray k-mer data reduces noise and identifies transcription factor subclasses and preferred k-mers. Bioinformatics. 2013, 29 (11): 1390-1398. 10.1093/bioinformatics/btt152.PubMed CentralView ArticlePubMedGoogle Scholar
- Ballaré C, Castellano G, Gaveglia L, Althammer S, González-Vallinas J, Eyras E, Le Dily F, Zaurin R, Soronellas D, Vicent GP, Beato M: Nucleosome-driven transcription factor binding and gene regulation. Mol Cell. 2012, 49 (1): 67-79.View ArticlePubMedGoogle Scholar
- Mathelier A, Wasserman WW: The next generation of transcription factor binding site prediction. PLoS Comput Biol. 2013, 9 (9): e1003214-10.1371/journal.pcbi.1003214.PubMed CentralView ArticlePubMedGoogle Scholar
- Kulakovskiy I, Levitsky V, Oshchepkov D, Bryzgalov L, Vorontsov I, Makeev V: From binding motifs in ChIP-Seq data to improved models of transcription factor binding sites. J Bioinform Comput Biol. 2013, 11: 1340004-10.1142/S0219720013400040.View ArticlePubMedGoogle Scholar
- Grau J, Posch S, Grosse I, Keilwagen J: A general approach for discriminative de novo motif discovery from high-throughput data. Nucleic Acids Res. 2013, doi:10.1093/nar/gkt831Google Scholar
- Ernst J, Vainas O, Harbison CT, Simon I, Bar-Joseph Z: Reconstructing dynamic regulatory maps. Mol Syst Biol. 2007, 3: 74-PubMed CentralView ArticlePubMedGoogle Scholar
- Li H, Zhan M: Unraveling transcriptional regulatory programs by integrative analysis of microarray and transcription factor binding data. Bioinformatics. 2008, 24 (17): 1874-1880. 10.1093/bioinformatics/btn332.PubMed CentralView ArticlePubMedGoogle Scholar
- Schulz MH, Devanny WE, Gitter A, Zhong S, Ernst J, Bar-Joseph Z: DREM 2.0: Improved reconstruction of dynamic regulatory networks from time-series expression data. BMC Syst Biol. 2012, 6: 104-10.1186/1752-0509-6-104.PubMed CentralView ArticlePubMedGoogle Scholar
- Marbach D, Roy S, Ay F, Meyer PE, Candeias R, Kahveci T, Bristow CA, Kellis M: Predictive regulatory models in Drosophila melanogaster by integrative inference of transcriptional networks. Genome Res. 2012, 22 (7): 1334-1349. 10.1101/gr.127191.111.PubMed CentralView ArticlePubMedGoogle Scholar
- Jolma A, Yan J, Whitington T, Toivonen J, Nitta KR, Rastas P, Morgunova E, Enge M, Taipale M, Wei G, Palin K, Vaquerizas JM, Vincentelli R, Luscombe NM, Hughes TR, Lemaire P, Ukkonen E, Kivioja T, Taipale J: DNA-binding specificities of human transcription factors. Cell. 2013, 152 (1–2): 327-339.View ArticlePubMedGoogle Scholar
- Efron B, Hastie T, Johnstone I, Tibshirani R: Least angle regression. Ann Stat. 2004, 32 (2): 407-499. 10.1214/009053604000000067.View ArticleGoogle Scholar
- Foat BC, Morozov AV, Bussemaker HJ: Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics. 2006, 22 (14): e141-e149. 10.1093/bioinformatics/btl223.View ArticlePubMedGoogle Scholar
- He X, Chen CC, Hong F, Fang F, Sinha S, Ng HH, Zhong S: A biophysical model for analysis of transcription factor interaction and binding site arrangement from genome-wide binding data. PLoS ONE. 2009, 4 (12): e8155-10.1371/journal.pone.0008155.PubMed CentralView ArticlePubMedGoogle Scholar
- Orenstein Y, Mick E, Shamir R: Rap: Accurate and fast motif finding based on protein-binding microarray data. J Comput Biol. 2013, 20 (5): 375-382. 10.1089/cmb.2012.0253.PubMed CentralView ArticlePubMedGoogle Scholar
- Agius P, Arvey A, Chang W, Noble WS, Leslie C: High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions. PLoS Comp Biol. 2010, 6 (9):Google Scholar
- Bryne JC, Valen E, Tang MHE, Marstrand T, Winther O, da Piedade I, Krogh A, Lenhard B, Sandelin A: JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Res. 2008, 36 (Database issue): D102-D106.PubMed CentralPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.