Improving analysis of transcription factor binding sites within ChIP-Seq data based on topological motif enrichment
© Worsley Hunt et al.; licensee BioMed Central Ltd. 2014
Received: 20 December 2013
Accepted: 20 May 2014
Published: 13 June 2014
Chromatin immunoprecipitation (ChIP) coupled to high-throughput sequencing (ChIP-Seq) techniques can reveal DNA regions bound by transcription factors (TF). Analysis of the ChIP-Seq regions is now a central component in gene regulation studies. The need remains strong for methods to improve the interpretation of ChIP-Seq data and the study of specific TF binding sites (TFBS).
We introduce a set of methods to improve the interpretation of ChIP-Seq data, including the inference of mediating TFs based on TFBS motif over-representation analysis and the subsequent study of spatial distribution of TFBSs. TFBS over-representation analysis applied to ChIP-Seq data is used to detect which TFBSs arise more frequently than expected by chance. Visualization of over-representation analysis results with new composition-bias plots reveals systematic bias in over-representation scores. We introduce the BiasAway background generating software to resolve the problem. A heuristic procedure based on topological motif enrichment relative to the ChIP-Seq peaks’ local maximums highlights peaks likely to be directly bound by a TF of interest. The results suggest that on average two-thirds of a ChIP-Seq dataset’s peaks are bound by the ChIP’d TF; the origin of the remaining peaks remaining undetermined. Additional visualization methods allow for the study of both inter-TFBS spatial relationships and motif-flanking sequence properties, as demonstrated in case studies for TBP and ZNF143/THAP11.
Topological properties of TFBS within ChIP-Seq datasets can be harnessed to better interpret regulatory sequences. Using GC content corrected TFBS over-representation analysis, combined with visualization techniques and analysis of the topological distribution of TFBS, we can distinguish peaks likely to be directly bound by a TF. The new methods will empower researchers for exploration of gene regulation and TF binding.
KeywordsChromatin immunoprecipitation ChIP-Seq Motif prediction Over-representation analysis Regulation Sequence analysis Transcription factor Transcription factor binding site Visualization
Delineating the specific cis-regulatory elements governing gene transcription has been at the forefront of genome research. The landmark release of the ENCODE project findings supports the long-standing view that a greater portion of the human genome contributes to the regulation of gene activity than the ~2% encoding proteins. High-throughput chromatin immunoprecipitation (ChIP) analysis of protein-DNA interactions has been transformative, highlighting the genomic regions that contain regulatory elements, and thus reducing the computational search space for transcription factor binding sites (TFBSs) many fold. The coupling of ChIP to high-throughput sequencing (ChIP-Seq) is empowering researchers to investigate DNA binding transcription factors (TFs) and their regulation of the genome. We present here a set of convenient, practical visualization and bioinformatics approaches, which facilitate interpretation of ChIP-Seq TF binding data.
The bioinformatics foundations of generating ChIP-Seq data have been well explored. Given mapped DNA sequence reads from a ChIP experiment, “peak calling” software is applied to quantitatively delineate regions within which a greater frequency of mapped reads are observed than expected. The peak regions are reported as a pair of coordinates, ranging from 1 bp to >5000 bp wide, often accompanied by a score and the position at which the read frequency local maximum is observed. Within the peak regions, the coordinates of specific TF-DNA interactions can be computationally inferred, using a TFBS profile for the indicated TF. Such profiles, most commonly in the form of Position Frequency Matrices (PFMs), are available for a subset of TFs from databases like JASPAR  or HoCoMoCo , or can be computationally identified from the ChIP-Seq data using de novo motif discovery tools [3–6]. Analysis with a PFM converted to a weighted TFBS profile (Position Weight Matrix – PWM) yields a score that reflects the similarity of the sequence of interest to the modeled binding sites. Although ChIP-Seq data reduces the acknowledged specificity problem of detecting short (6-15 bp), degenerate motifs bound by a TF in the genome, the problem of TFBS prediction is not perfectly resolved as the ChIP-Seq peaks are often 20-fold or greater in length than the TFBS being searched for. As they become more widely used, higher resolution methods, such as ChIP-exo , are expected to reduce the difficulty.
A proportion of ChIP-Seq peaks may not contain a canonical TFBS for the ChIP’d TF above background expectation; a confounding property of the data that presumably arises from a combination of biological, experimental, and computational influences. While these regions may result from indirect interactions between the TF and the DNA, the multi-epitope specificity of polyclonal antibodies and the tendency for chromatin to shear at promoter regions [8, 9] may give rise to peaks not specific to the ChIP’d TF. The subset of peaks lacking the TF’s canonical motif is commonly treated as equivalent to the subset with motifs. The segregation of a ChIP-Seq dataset into the two classes could lead to insights into individual TFs mechanisms of regulation and reveal common properties of regions lacking TFBS. The analysis of specific TF bound regulatory regions and TFBSs from ChIP-Seq defined peak regions can be refined, and such improvement will consequently inform and improve our utilization of ChIP-Seq data across a spectrum of analyses.
In this report, we introduce a set of visualization methods and bioinformatics approaches to improve the study of TFBSs within ChIP-Seq regions, and demonstrate the application of these methods for the generation of new insights into regulatory sequences. We focus on three key challenges: known motif over-representation analysis, spatial visualization of TFBS positions, and determination of parameters for TFBS analysis. For over-representation analysis, we introduce the BiasAway tool to account for the non-random properties of regulatory sequences; such accounting has strongly informed the design of de novo motif discovery methods, but has been inadequately addressed for over-representation studies. We introduce a set of visualization approaches that reveal topological patterns of motif positions within ChIP-Seq data, helping to delineate the subset of peaks likely to be directly bound by the ChIP’d TF. The visualization methods directly inform the selection of parameters for motif prediction, a long-standing challenge in regulatory sequence analysis. Application of the procedure reveals that on average 61% of ChIP-Seq peak regions contain the canonical motif for the ChIP’d TF. The methods are applied to two cases related to TBP and ZNF143/THAP11. Access to the new methods and visualization approaches will provide the research community with improved capacity to analyze and interpret TF ChIP-Seq data.
Composition studies reveal the influence of non-random properties of the metazoan genome on the interpretation of ChIP-Seq data
The non-random properties of genome nucleotide sequence composition have been the subject of extensive investigation over decades. Key to the study of regulatory sequences, are observations that promoters and other open chromatin tend to have higher GC content, and that the observed composition has a wide distribution. We evaluated the relationship between mononucleotide composition and TF ChIP-Seq data (see Additional file 1: Text S1). We found that each TF exhibits a range of nucleotide composition environments in which it binds. The GC content distributions differ between TFs profiled in the same cells and between different cell types profiled for the same TF (Additional file 2: Figure S1). There is a clear relationship between the GC content of the peak regions and the multiplicity of TFBSs (Additional file 1: Text S1, Additional file 3: Figure S2).
Consequences of biased composition for TFBS motif over-representation analyses
It is a central practice in the analysis of regulatory regions to evaluate the frequency of predicted TFBSs in the regions of interest. This type of analysis depends on comparing predicted motif frequencies against a reference background. Such analysis of TFBS over-representation could be influenced by the compositional properties of the peaks and the background sequences. While accounting for background composition properties has been explored for de novo pattern discovery, the best approaches for and impact of composition correction on over-representation analysis of known motifs have not been resolved.
Visualizing composition bias in TFBS over-representation analyses
The impact of background composition selection on TFBS over-representation results
For many datasets the calculation of meaningful over-representation scores requires correction of bias. We assessed approaches to create and/or retrieve a background matched to a set of target sequences’ compositional properties in order to resolve the enrichment scoring bias engendered by compositional extremes of some ChIP-Seq datasets. For evaluation purposes we retained the naïve background model of random genomic sequences in our assessment. A second background was generated by a 3rd order Markov model (RSAT package ). We implemented a background sequence generator, BiasAway, to generate 6 additional background models; these backgrounds derived from mono- and di- nucleotide shuffled sequences, and genomic sequences matched to the GC content of the target data (see Additional file 1: Text S1 for details). We included one last set of backgrounds generated by “known motif” over-representation analysis feature of HOMER 2 . HOMER 2 is the only software of which we are aware that uses GC composition matched backgrounds for TFBS over-representation analysis. These backgrounds were then evaluated against 43 human TF ChIP-Seq datasets, with 166 PWMs (JASPAR 4.0_alpha development database ) via the oPOSSUM 3.0 TFBS over-representation software . Four backgrounds were re-evaluated for platform-independence of both bias and bias correction with the ASAP tool (see Additional file 1: Text S1 and Additional file 4: Figure S3).
We summarize the background evaluations on 201 bp sequences here, with further details available in Additional file 1: Text S1 for both 201 bp and 401 bp sequences. Additional file 6: Table S1 lists the rank of the ChIP’d TF for each analysis, and Additional file 7: Figure S5 provides the CB-plots of E2F1 for six background analyses. A naïve background of randomly selected sequences resulted in a systematic bias, or skew, of over-representation scores towards GC-rich binding sites for some datasets. The mononucleotide backgrounds were not as favourable as the dinucleotide shuffled backgrounds, with or without the sliding window. While neither suffered from bias towards GC-rich binding sites, the dinucleotide shuffled backgrounds produced results with over-representation scores closer to zero and predicted the ChIP’d TF’s motif in the top 5 results of more than 60% of the datasets. The 3rd order Markov model performed as well as the dinucleotide background for all measures except that of capturing the ChIP’d TF’s motif in the top 5 results. The best performing background was genomic sequences selected to match the GC nucleotide distribution of the target sequences. The BiasAway GC nucleotide background and HOMER generated backgrounds performed comparably for most measures, with the exception that the BiasAway background resulted in an 11 percentage point improvement for the inclusion of the ChIP’d TF binding model as one of the top 5 over-represented.
The CB-plots have been incorporated into the oPOSSUM over-representation analysis software, and BiasAway is available as open-source code posted on GitHub: https://github.com/wassermanlab/BiasAway/archive/noRPY.zip, and is being incorporated into the oPOSSUM 3.0 web interface.
Assessing TFBS predictions within ChIP-Seq regions
Subsequent to motif over-representation analysis, attention turns to investigating the candidate TFBS binding profiles. We outline here a complementary set of enrichment assessment methods specifically oriented to providing such insights. We implement a heuristic algorithm for direct binding (HADB) to determine the subset of ChIP-Seq peaks with high confidence binding sites, based on TFBS enrichment proximal to the peakMax, and to derive a PWM-specific scoring threshold. As with the previous section, visualization is a key method for discerning the properties of the data. Within the R environment  we implemented a set of plotting methods (TFBS-landscape view, TFBS-bi-motif view, and Dinucleotide-environment view) for displaying the properties of the ChIP-Seq regions relative to the locations of predicted TFBS. The code and user-guide for the visualization methods and calculating the HADB thresholds are posted on GitHub: https://github.com/wassermanlab/TFBS_Visualization/archive/master.zip.
The TFBS-landscape view for ChIP-Seq data facilitates qualitative assessment of the non-random relationship between predicted motif scores (for the top scoring motif in each sequence) and the peakMax position. The information in the TFBS-landscape view is translatable into a quantitative assessment of TFBSs, as presented by the HADB algorithm below. Motif proximity to the peakMax has been demonstrated on 14 ChIP-Seq datasets by  and for 3 datasets across 11 peak calling algorithms . The global tendency for ChIP-Seq data to yield motifs for the ChIP’d TF proximal to the peakMax is systematically confirmed here with a study of ~340 ChIP-Seq datasets for ~95 TFs (human and mouse).
Using the HADB algorithm to distinguish direct binding and for the selection of PWM motif score thresholds
Deciding upon an appropriate minimum scoring threshold for PWM-based analysis is an ongoing challenge, as each TF has unique characteristics. Quantitative analysis of the peak TF motif densities relative to chance expectation can delineate a suitable threshold for motif scores for a given PWM. We set the motif score threshold as the lowest motif score for which the count of motifs within the heuristic enrichment zone consistently exceeded the count of motif scores in comparably sized background regions (see methods; Additional file 9: Figure S7a-b). The motif score threshold does not vary greatly for an individual PWM across multiple datasets. The average relative score for each of the human and mouse datasets was 82 ± 1 (mean score ± mean difference between scores; Additional file 9: Figure S7c-d). A motif score threshold specific to each TF PWM is important for downstream analyses such as the study of TFBS altering mutations. The PWM score thresholds for the studied TF binding profiles are provided in Additional file 10.
We used the heuristic boundaries of enrichment and motif score threshold for each dataset to estimate the proportion of peaks that contain at least one TFBS for the ChIP’d TF within the bounds of the thresholds. We found that on average, for the datasets studied, ~61% of a ChIP-Seq dataset contains the ChIP’d TF’s canonical motif that is both within the peakMax enrichment zone and greater than the motif score enrichment threshold (Figure 4d – human mean 61%, and Additional file 8: Figure S6b – mouse mean 65%). The source of the remaining ~40% of peaks may be from such factors as indirect binding, shearing properties of open chromatin, peak calling errors, or antibody properties.
Regions predicted by HADB to directly bind the ChIP’d TF are more likely to replicate between experiments
When available, replicate experiments are a useful validation of regions ChIP’d by a TF. While many experiments lack replicates, ~100 ENCODE datasets included replicates which we used to evaluate whether the HADB method was differentiating between peaks that co-occur between replicates (peakMax within 500 bp) versus those peaks that occur in only one of the replicates (see methods). We found that the set of peaks with the ChIP’d TFs motif central to the peakMax were significantly enriched for replicated peaks (92% of datasets produced a Fisher exact test one tailed p-value <0.001 (91% produced scores <1e-09)) (see Additional file 1: Text S1 for similar results with different peakMax separation parameter settings)).
Regions predicted by HADB to directly bind the ChIP’d TF are enriched for GO terms related to the TF’s key biological process
To assess the functional enrichment of peaks defined by the HADB method, we performed GO enrichment analyses, using the GREAT software . As GO enrichment analysis for TFBSs is somewhat problematic due to the diversity of processes a TF may regulate and the proximity of TFBS to the genes they regulate, we selected two TFs known to be highly specific for a biological process: SRF, a master regulator of the actin cytoskeleton and contractile processes , and NFE2L2, a key regulator of oxidative stress response . We submitted peaks from the whole ChIP-Seq dataset, peaks from the subset of regions inferred by HABD to be directly bound by the ChIP’d TF, and those not inferred to be directly bound. GREAT analyses reported an enrichment of terms for the expected biological processes for SRF and NFE2L2 only in the subset of peaks with inferred direct binding of the ChIP’d TF (see Additional file 1: Text S1, Additional file 11: Figure S8 and Additional file 12: Figure S9). The results highlight how the use of the HADB method can enhance the interpretation of ChIP-Seq data.
Complementary visualization methods for spatial ChIP-Seq patterns: TFBS bi-motif view and Dinucleotide-environment view
We present two visualization methods that further empower investigation of spatial patterns in ChIP-Seq data that build upon the foundation of peakMax-proximal enrichment defined by HADB.
Three examples of bi-motif views are presented in Figure 5. Homotypic clustering is observed for ESRRB and NFYB. The NFYB clustering is consistent with published findings for the NF-Y complex , while the ESRRB observation suggests that properties observed for ESRRG may extend to other members of the TF family . A third plot for SRF and ELK4 (also called ‘SRF accessory protein’), using SRF ChIP-Seq data, presents heterotypic binding site results, consistent with known interactions between these TFs .
An Applied Case Study of ChIP-Seq Analyses
To demonstrate how over-representation analysis and the visualization methods are complementary, and enhance the analysis of ChIP-Seq data, we applied the TFBS-landscape view and the Dinucleotide-environment view to two case studies. The first focuses on the TATA binding protein (TBP) in the MEL mouse cell line, while the second explores the sequence properties proximal to ZNF143 TFBSs.
We have introduced methods that allow researchers confronted with ChIP-seq data for TF binding to extract more insights into TFBS. The first challenge in such studies is often the identification of contributing TFs based on known motif over-representation analysis. Such methods can both support the quality of data based on the identification of the ChIP’d TF’s motif, as well as highlight additional TFs that may be cooperatively acting with the former. In this report we present two key components related to over-representation analysis. First, Composition-bias (CB) plots are introduced to display skew in the GC-content of the enriched motifs, a common occurrence that reflects the non-random composition properties of the ChIP-Seq regions recovered. Second, the BiasAway software tool is introduced to generate composition-matched background sequence sets that correct for the skew. Once relevant TF binding profiles are identified, the challenge shifts to the identification of reliable TFBS within the broader ChIP-Seq regions. Every ChIP-Seq dataset is a mixture of directly bound segments and regions that may reflect alternative influences. We introduce TFBS-landscape plots as a convenient form for the visualization of the topological distribution of TFBS within ChIP-Seq segments. Based on the topology, we present the HADB method for the selection of PWM thresholds for the selection of candidate TFBS consistent with direct binding events. The HADB approach provides a quantitative method for selecting PWM thresholds – an enduring challenge in regulatory sequence analysis. Systematic analysis over ~340 ChIP-Seq datasets indicates that an average of 61% of peaks are classified as directly bound. The biochemical, experimental, and/or computational influences that account for the remaining 39% of regions remain to be resolved in future investigation. In the third stage of analysis, spatial properties relative to the defined TFBS and peakMax are analyzed. Bi-motif plots allow the study of inter-TFBS spacing concurrent with HADB assessment. To gain insight into additional properties at the edges of reliable TFBS, Di-nucleotide-environment views are introduced through which compositional enrichment in the flanking regions can be revealed. In combination, the software and methods allow for researchers to launch deeper explorations of TFBS within ChIP-Seq data.
The work builds on a substantial foundation of bioinformatics approaches to regulatory sequence analysis. Over-representation analysis of known TF motifs complements de novo motif discovery methods such as MEME. The later methods often incorporate better background representation to account for the non-random properties of sequences. The CB-plots highlight the problem for over-representation methods, providing a useful means for researchers to rapidly assess the quality of results. We introduced the BiasAway tool to correct for the bias. The HOMER 2 package includes an unpublished option for background correction of over-representation analysis, which does not perform as well as the GC composition matched option in BiasAway (HOMER was originally described in ). The open-access BiasAway provides a general purpose background generation capacity, which can be used as a component in other tools going forward. Both BiasAway and the CB plots are being incorporated into the online oPOSSUM-3 motif over-representation analysis tool .
The spatial analysis of TFBS positions within ChIP-Seq data has been a central focus in the development of several bioinformatics methods. The MEME Suite includes the CentriMo software which can evaluate both known motif collections and de novo pattern discovery derived motif models based on the centrality of predicted TFBS within ChIP-Seq peak regions . The SpaMo component of the same package evaluates the statistical significance of spacing between predicted pairs of TFBS across ChIP-Seq regions . The GEM system performs de novo pattern discovery of spatially correlated motif pairs . While these approaches touch on the same theme, they are not directly comparable to the work presented here. Our methods are complementary to the published work, allowing researchers to more fully explore the properties of ChIP-Seq data.
We introduced the HADB procedure for the selection of appropriate PWM score thresholds for the prediction of TFBS within ChIP-Seq data. The threshold parameter for TFBS calling with PWMs is treated in diverse ways in the research literature. In some cases thresholds are individually selected for each matrix based on the determination of empirical p-values based on distributions of scores observed for a sequence pool . This has been extended to generating empirical p-value thresholds for composition-matched pools of sequences . In other cases, percentile-based scores are applied uniformly across diverse matrices . Biophysical approaches based on the calculation of binding energies have also been introduced . The HADB procedure provides an experimental basis for the selection of the threshold parameter for a TF that can be applied to the matrix in multiple data sets, as the results show little variation in the value for different ChIP-Seq studies for the same TF.
The prevalence of directly bound regions in ChIP-Seq data has received increasing attention. The ChIP-Seq protocol is undeniably successful at retrieving TF bound regions, however, the data from these studies of sequence-specific TFs is invariably shaped by a mixture of biological, experimental and computational influences. From our perspective, TF ChIP-Seq data is effectively bi-partite: a subset of ChIP’d regions have a direct relationship to the ChIP’d TF made evident by the presence of the TF’s binding motif near the peak local maximum, while the remaining subset of regions are lacking the motif. We found that on average, for the ~340 datasets studied, a third of peak regions lack canonical motifs proximal to the peakMax. Some of these peaks may result from indirect interactions (in which the ChIP’d TF interacts with a different DNA bound protein) or result from a change to the ChIP’d TF’s binding properties due to an interaction partner. Methods have been proposed to infer indirect binding by the enrichment of secondary TFs motifs, such as , but the presence of secondary motifs do not reliably confirm the presence of the ChIP’d TF at the regions. We have observed that the motifs of some TFs are enriched at the peakMax for more than 10 other unique TFs ChIP-Seq datasets across diverse cell lines (Worsley-Hunt, submitted), suggesting that some peaks in ChIP-Seq datasets may result from a mechanism that is not specific to the ChIP’d TF. Some additional potential sources of non-direct ChIP-Seq regions, unrelated to indirect binding, include chromatin structure properties, antibody properties, stochastic noise, or peak calling software. Two recent papers have identified highly transcribed regions as enriched in the non-direct subset of ChIP-Seq data in yeast [9, 36]. We do not suggest that any peaks be dismissed, as all are potentially informative. However, we believe that it is appropriate to use the methods introduced in this report to segregate the direct-binding subset that can be explained by the ChIP’d TFs motif within the limited range of confidence around the peakMax, allowing the two subsets to be individually analyzed.
We highlighted the investigative potential of the methods and visualization approaches in two case studies. First, a study of TBP (TATA binding protein) ChIP-Seq data, where an enrichment of SINE elements and many secondary TFs’ motifs suggests that the adaptive use of repetitive sequences as promoters may be more frequent than widely thought. Second, a study of ZNF143 that revealed the ZNF143 canonical motif is part of a wider pattern than presented by the existing PWM, which has been shown by  to be the binding site for THAP11, a protein that works antagonistically with ZNF143.
This broad survey of ChIP-Seq data from diverse studies provides greater clarity about the properties of the data that impact the quality of interpretation. Using the methods presented here, the applied researcher can visualize the distribution of predicted TFBSs and calculate thresholds for both the maximum motif enrichment distance from the peakMax, and the PWM scores, for classifying TFBS-containing peaks. In addition to ChIP-Seq peak classification, the PWM scores thresholds set a lower limit on the range of motif scores expected for functional sites, which will be useful for downstream analyses such as binding site mutation analysis. These methods and approaches will improve TFBS enrichment analyses and the applied analysis of ChIP-Seq data, particularly for the annotation of reliable TFBSs within ChIP-Seq peaks.
We downloaded ChIP-Seq datasets from the GEO database: 1) GSE11431 - thirteen mouse ES cell datasets ; 2) GSE25532 – mouse NFYA data in ES cells ; 3) GSE17917 and GSE18292 – human KLF4, POU5F1, C-MYC, NANOG, and SOX2 data ; and 4) GSE22078 – human and mouse C/EBPA and HNF4A . A dataset for mouse FOXA2 was downloaded from http://www.bcgsc.ca/data/histone-modification/histone-modification-data . We also used ENCODE ChIP-Seq datasets (human and mouse), and human ChIP-Seq controls  downloaded from the UCSC ENCODE database . The ENCODE broadPeak datasets frequently occurred in replicate; we selected the larger replicate for our analyses. Where only the mapped sequence data was available (5 datasets), we called peaks using FindPeaks 4.0  with the following parameter options: −dist_type 1 200 -subpeaks 0.6 -trim 0.2 -duplicatefilter.
As the downloaded ChIP-Seq peaks were reported with a multitude of lengths, ranging from 1 bp to >5,000 bp, we trimmed or extended all peaks to a constant length (201 bp, 401 bp, or 1001 bp) centered at the local peak maximum, or at the peak centre for datasets which do not have peakMax positions provided.
A control set of “mappable” regions was generated from the CRG Alignability (36mer) data  downloaded from the UCSC ENCODE database . Unique, contiguous CRG Alignability 36mer regions were merged, and the resulting larger regions were then split into multiple non-overlapping regions of length 201 bp or 401 bp. This yielded two datasets of mappable regions, to be used as a source of background sequences in later analyses.
The DNase accessible control datasets used in over-representation analyses were generated from DNase I hypersensitivity datasets (UW DNase: University of Washington)  downloaded from the UCSC ENCODE database . To obtain a dataset that reflected a broad range of DNase accessibility, DNase I hypersensitive segments from multiple cell lines were concatenated, and contiguous or overlapping regions merged. The resultant sequences were split into one of the two lengths, 201 bp and 401 bp, to generate two large datasets of DNase accessible sequences to be used as a source of background sequences in later analyses.
The Ensembl Perl API was employed to retrieve sequences from GRCh37/hg19 and NCBI37/mm9 assemblies. Where the genome coordinates first needed conversion to GRCh37/hg19 or NCBI37/mm9, we used a locally installed version of the UCSC lift-over tool .
Position frequency matrices (PFMs) were obtained from the JASPAR  development 4.0_alpha database of transcription factor models (April 2013). As the development set has been subsequently revised for a curated release, we provide the entire set of matrices as used in this report in Additional file 14. The PFMs were converted to position weight matrices (PWMs) using the TFBS Perl modules , with default background values for a uniform nucleotide background.
Motif prediction was performed with C code adapted from the TFBS Perl Modules , which scans sequences for TFBS instances and reports both the motif location and a PWM relative motif score. Where multiple motifs per sequence per PWM were predicted, the reported motifs were not permitted to overlap by more than one-fifth the PWM length (e.g. a 7 bp motif could only overlap a neighbouring motif by 1 bp).
The mononucleotide GC content of sequences was determined from a count of each nucleotide type in the sequence, based on single stranded DNA. The composition of the TF profiles was determined similarly; from a count per each nucleotide base in the position frequency matrix (PFM), and subsequently obtaining the ratio of GC nucleotides.
Motif over-representation analyses
Motif over-representation analyses were performed on 201 bp and 401 bp sequences with a locally installed version of oPOSSUM 3.0  and the online ASAP tool . Within oPOSSUM we used the sequence-based analysis default settings, aside from supplying our own set of 166 PFMs. The oPOSSUM software converts the PFMs to PWMs using the default setting of the TFBS Perl modules . Over-representation scores of “infinite” value were set to the greater of either 500 or to the maximum non-infinite enrichment score plus 100. For ASAP analyses, we chose parameter values similar to those used for the oPOSSUM analyses. The ASAP tool limited the number of nucleotides submitted for each analysis, therefore our input was limited to a randomly selected subset of 10,000 sequences on each submission.
The backgrounds used for the over-representation analyses were individually generated for each set of target sequences to be analyzed, and matched to the length of the target sequences. The backgrounds were derived from either the target sequences (e.g. dinucleotide shuffle methods, or a Markov model), a dataset of uniquely mappable sequences, or DNase accessibility data (see Datasets, above).
For evaluation purposes we retained naïve backgrounds in our assessment, which were composed of randomly selected sequences from either the uniquely mappable portion of the human genome, or DNase accessible regions.
3rdorder Markov model background
A 3rd order Markov model background was generated using a combination of the oligo-analysis and random-seq programs from a local installation of the RSAT package . The target sequences of interest were submitted to RSAT::oligo-analysis with the parameters: −l 4 -1str (word length 4 bp, and single strand), and the results file was in turn submitted to RSAT::random-seq with the parameter: −ol 4. The RSAT::random-seq program then used a 3rd order Markov model trained on 4 bp oligo-nucleotides to generate sequences that matched the GC content and length of the target sequences. While Markov models have been previously introduced to generate background sequences, the ideal order of the models is not clear. If the order is too high, the model will simply recapitulate the frequency of the TFBSs in the target dataset, which is why we restricted the model to 3rd order.
HOMER 2 GC background
HOMER 2  was downloaded from http://homer.salk.edu/homer/ and installed on a cluster. The findMotifsGenome.pl tool was used to perform a known motif over-representation analysis on each of the 43 datasets using the following options: −N number_of_seqs –nomotif –dumpFasta –size length_of_sequence. The number_of_sequences was the number of sequences in each dataset, and the length_of_sequence was either 201 bp or 401 bp, dependent on the analysis. The –dumpFasta option was set in order to retrieve the backgrounds after the analysis was finished running. We provided the hg19 human genome (from UCSC) and the coordinates of the peaks for each dataset.
BiasAway Background Generating Tool
Six background models were implemented as a background sequence generator: BiasAway. The BiasAway tool has been implemented in Python using BioPython modules , and is available as open source code from GitHub https://github.com/wassermanlab/BiasAway/archive/noRPY.zip. The tool provides a user with six approaches for generating a background useful to over-representation analyses: 1) mononucleotide shuffled sequences, 2) dinucleotide shuffled target sequence to preserve the dinucleotide composition of the target sequences, 3) genomic sequences matched to the mononucleotide GC content of each target sequence to preserve the non-random association of nucleotides, 4) sliding windows of mononucleotide shuffled sequence, 5) sliding windows of dinucleotide shuffled target sequence, and 6) genomic sequences matched in windows of internal mononucleotide GC content for each target sequence. The latter two backgrounds (BiasAway 5–6) are variants of the former two backgrounds (BiasAway 2–3), in which we utilized a sliding window over the ChIP-Seq sequences to determine a distribution for local regions of composition. The background sequence set is then generated (dinucleotide shuffle) or selected from a pool of genomic sequences (genomic composition match) to match the distribution of window compositions for each target sequence. These latter backgrounds were considered because due to evolutionary changes such as insertion of repetitive sequences, local rearrangements, or biochemical missteps, the target sequences may have sub-regions of distinct nucleotide composition. See Additional file 1: Text S1 Supplemental Methods for greater detail.
Measures to evaluate over-representation analysis results
To summarize the impact that the choice of background has on the reported over-representation results, we assessed the over-representation results by four measures: 1) the skew of the over-representation scores, 2) the count of datasets with the ChIP’d TF’s binding profile reported in the top 5 over-representation results, 3) the mean of the over-representation scores (excluding outlying scores), and 4) the variability of the over-representation scores (excluding outlying scores). The first measure, the skew of the over-representation scores, is the negative of the slope of the line fitted to the over-representation scores (y-value) and the associated PFM GC content values (x-value; see Figure 2a for reference). The skew informs us of the degree to which a dataset is biased towards extreme composition motifs i.e. GC-rich or AT-rich; the more negative the skew, the greater the bias. The second measure calculates the proportion of datasets for which the over-representation analysis (using a given background type) reports the ChIP’d TF’s motif in the top 5 results. The skew and the proportion of datasets are plotted together (Figure 2a). Background types that consistently yield a low skew value and return the ChIP’d TF in the top results, are ideal. We used the third and fourth measures to assess the tendency for the over-representation analysis to report the majority of TF’s motifs as having a low over-representation score. Therefore we set a threshold of the mean plus one standard deviation, to remove the highest over-representation scores from further analysis. We then obtained the mean and standard deviation of the remaining scores. Ideally the results would display both a low mean and a low standard deviation, as this indicates that the majority of TF’s motifs are close to an enrichment score of 0.
Enrichment visualization plots and HADB boundaries of enrichment
For the evaluation of ChIP-Seq datasets with the HADB method, we restricted to those datasets for which we had a PWM for the ChIP’d TF.
Composition-Bias (CB) plots are generated to detect \whether GC-rich or AT-rich PWMs are over-represented in a TFBS motif over-representation analysis. The GC content of the PFM’s submitted to the over-representation analysis are presented on the x-axis, and the over-representation score of the predicted TFBSs on the y-axis.
TFBS-landscape views were generated to visualize central enrichment of a TF’s motifs within a ChIP-Seq dataset. TFBS prediction was done on 1001 bp regions, to extend sequences sufficiently far into the flanks to present the background rate of TFBS prediction. The view’s left plot presents the motif score of the top scoring predicted motif in each peak for the given TF PWM on the y-axis, and the motif’s signed distance from the peakMax on the x-axis, where the peakMax is x = 0. The right plot presents two densities depicting: 1) the distances of the top scoring motifs to the peakMax for all peaks, and 2) the distances to the peakMax for the subset of peaks for which the top scoring motif was equal to or greater than 85 to capture cases with low enrichment of strong scoring motifs. The value of 85 was based on the default parameters of the oPOSSUM software, for which 85 was chosen to best discriminate known TFBS from background. The x-axis of the right plot is the distance of the motifs to the peakMax (x = 0), and the y-axis is the probability density, which is a reflection of peak frequency for given distances. The default peak length was 1001 bp. For the TBP case study, we overlaid multiple smoothed densities of motif-to-peakMax distance enrichments on one plot (Figure 7b).
HADB motif enrichment threshold
The data used for the TFBS-landscape plots (TFBSs and 1001 bp sequences), was also used by the HADB method. To calculate a heuristic enrichment threshold we identified the distance from the peakMax at which the density of the top scoring motifs (one motif per sequence) falls below the density of motifs in the background regions, such as we see in the TFBS-landscape plots (Figure 4b). The background is the region spanning 200–500 bp from the peakMax. We created 5 bp bins of the absolute distance from the peakMax, and counted the number of peaks with a top scoring motif within each bin. The count in each bin was converted to the proportion of peaks in the dataset and a regression line was fitted to the background bins, where x was the upper limit of the 5 bp bin, and y was the proportion of peaks in the bin. To adjust for the variation in the y-values, we created an ‘allowance line’ by shifting the regression line along the y-axis by 2-fold the 3rd largest residual value (we omitted the two most extreme residuals as they were generally outliers). We then selected the largest non-background bin (X) that had a y-value greater than the allowance line’s predicted y-value at bin X (Figure 4a). The heuristic threshold for motif enrichment was set at the greatest distance represented by bin X (e.g. bin X is 100 bp-105 bp, therefore the threshold is 105 bp).
HADB motif score lower threshold
A similar procedure was used to determine a lower limit threshold for a PWM’s motif scores (Additional file 9: Figure S7). We generated bins of the top motif scores (one per peak) and compared the bins in the enrichment zone to bins in the background zone. Bins were populated with the count of peaks corresponding to a bin’s motif score range. Each bin’s range was 1 score point e.g. for bin81, the range was 80 < X < =81. ‘Central bins’ were within the enrichment zone as defined using the heuristic enrichment thresholds described above. An equal number of ‘control bins’ were generated using background regions outside of the enrichment zone, of the same width as the central enrichment zone (e.g. if the central enrichment zone was 180 bp, then 180 bp of background motif scores were also sampled and binned). The two sets of bins were compared to identify all the central score bins that exceeded the number of peaks in their matching control bin by greater than 20% of the maximum control bin; the maximum was selected from the set of control bins equal to or of higher value than the bin of interest (i.e. to the right of the bin of interest on the x-axis of Additional file 9: Figure S7a). By using the set of control bins of higher value than the bin of interest, we limited cases where low scoring bins with few peaks in the bin were selected as a threshold. The array of bins that passed the aforementioned 20% threshold was then used to determine the lower boundary of motif score enrichment. Proceeding from the upper score bins to the lower score bins, we identified the lowest set of 5 consecutive central bins (centered on the bin of interest) where at least 4 of the bins, including the bin of interest, exceeded the control bins. The latter requirement was to reduce the chance of the algorithm selecting a central bin that was an outlier in the surrounding neighbourhood of bins. The central bin preceding the flagged bin was selected as the threshold. We provide two PWM thresholds in Additional file 10: a threshold based on signal 20% above the background, as outlined above, and a threshold based on signal 5% above the background; the median absolute difference between the two different thresholds is 2 points (e.g. a motif score of 80 versus 82). The parameter of >20% signal was a heuristic limit selected to reduce stochastic noise in the HADB results.
The diversity of enrichment zones and relative motif score thresholds from the datasets were summarized in Figure 4c, Additional file 8: Figure S6a and Additional file 9: Figure S7c-d. The enrichment zone thresholds for multiple datasets of a given TF were averaged to select one threshold value per TF. To assess the similarity of the thresholds for a given TF, we calculated the average of the differences between every dataset’s threshold for the given TF, and plotted the average difference as bars above and below the average threshold. We used the same approach for reporting the average motif score and the similarity of motif scores per TF.
For each given sequence, the left plot reports the location of the first motif relative to the peakMax on the y-axis, and the location of the second motif relative to the first motif on the x-axis. A horizontal band of enrichment around y = 0 shows that the first motif is enriched at the peakMax, while a band of enrichment on the diagonal is the second motif’s enrichment at the peakMax. A gap at x = 0 is due to restricting the overlap of motifs. The diagonal limits of the plot arise from the sequence length limit, here 1001 bp. Increased density at the origin indicates that a number of peaks have both motif 1 and motif 2 enriched at the peakMax. The x-axis on the left plot was set as the distance between motifs in order to mirror the right plot, The right plot is a 10 bp resolution histogram of the distances between the two motifs.
To create a Dinucleotide-environment view, we wrote a Perl script to align a set of sequences on the motif of interest with the motif oriented in the same direction, and assess the frequency of dinucleotides at every position of the aligned sequences. The file of dinucleotide frequencies is submitted to an R script (R version 2.14.1). The R script calculates the proportion of sequences with a particular dinucleotide at each position of the alignment, and plots the position on the x-axis and the proportion of sequences on the y-axis.
Visualization was performed using the statistical package R (version 2.14.1) . The R functions for generating the four visualization plots and the perl code for generating dinucleotide frequencies from motif-anchored sequence alignments are available at GitHub https://github.com/wassermanlab/TFBS_Visualization/archive/master.zip. Plots for visualizing the heuristic threshold decisions, as seen in Figure 4a and Additional file 9: Figure 7a, are included in the provided R code.
Analysis of broadPeak replicate experiments
The ENCODE broadPeak datasets are available as replicates, which allowed us to determine whether the peaks predicted to be directly bound by the ChIP’d TF due to the presence of the TF’s motif were enriched for occurrence in both replicate experiments. Replicate experiments were pooled together, with neighbouring peaks (two peak maximums within 500 bp or 1000 bp) flagged as a single instance of a region. The 500 bp distance was selected to be inclusive of the ~400 bp median width of the broadPeak regions; 1000 bp was selected to explore the sensitivity to longer settings. Using a Fisher exact test, we then compared the proportion of replicated regions versus regions unique to a single experiment for the set of peaks with the ChIP’d TFs motif within the heuristic enrichment zone, and for the set of peaks without the ChIP’d TFs motif. Peaks that had been flagged as having the ChIP’d TFs motif in one experiment but without the ChIP’d TFs motif in the other experiment were rare (median 0.28% of the pooled replicates for an experiment), and we chose to omit these from the analysis.
GO term enrichment analysis
We used the default settings of the web-based version of GREAT  http://bejerano.stanford.edu/great/public/html/splash.php. We submitted those peaks predicted to contain the ChIP’d TF’s motif by the HADB method, the peaks without the HADB predicted ChIP’d TF’s motif and peaks from the whole dataset. We used the Binomial Region Set Coverage and the number of terms related to actin and oxidative stress, to assess differences between the three sets of peaks.
TBP case study – repeat elements analysis
Repeat elements in the TBP peaks were assessed using data downloaded from the UCSC mm9 Repeat Masker track via the UCSC Table Browser .
Chromatin immunoprecipitation and high-throughput sequencing
Peak local maximum
Position frequency matrix
Position weight matrix
Transcription factor binding site.
We thank Miroslav Hatas for systems support and Dora Pak for management support, and the members of the Wasserman lab for suggestions and feedback. We are indebted to the researchers around the globe who generated the ChIP-Seq data. The work was supported by funding from the National Institutes of Health (USA) grant 1R01GM084875, the Canadian Institutes for Health Research, the National Science and Engineering Research Council (NSERC), and GenomeBC and GenomeCanada (ABC4DE Project). RWH was supported by fellowships from the Canadian Institutes of Health Research and the Michael Smith Foundation for Health Research. LdP was supported by funding from a Fundación Caja Madrid mobility fellowship for visiting professors 2011–2012, and the Spanish Ministry of Science and Innovation, MICINN, grant number SAF2011_24225.
- Portales-Casamar E, Thongjuea S, Kwon AT, Arenillas D, Zhao X, Valen E, Yusuf D, Lenhard B, Wasserman WW, Sandelin A: JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res. 2010, 38 (Database issue): D105-D110.PubMed CentralPubMedView ArticleGoogle Scholar
- Kulakovskiy IV, Medvedeva YA, Schaefer U, Kasianov AS, Vorontsov IE, Bajic VB, Makeev VJ: HOCOMOCO: a comprehensive collection of human transcription factor binding sites models. Nucleic Acids Res. 2013, 41 (Database issue): D195-D202.PubMed CentralPubMedView ArticleGoogle Scholar
- Machanick P, Bailey TL: MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics. 2011, 27 (12): 1696-1697.PubMed CentralPubMedView ArticleGoogle Scholar
- Georgiev S, Boyle AP, Jayasurya K, Ding X, Mukherjee S, Ohler U: Evidence-ranked motif identification. Genome Biol. 2010, 11 (2): R19-PubMed CentralPubMedView ArticleGoogle Scholar
- Kulakovskiy IV, Boeva VA, Favorov AV, Makeev VJ: Deep and wide digging for binding motifs in ChIP-Seq data. Bioinformatics. 2010, 26 (20): 2622-2623.PubMedView ArticleGoogle Scholar
- Thomas-Chollier M, Herrmann C, Defrance M, Sand O, Thieffry D, Van Helden J: RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets. Nucleic Acids Res. 2012, 40 (4): e31-PubMed CentralPubMedView ArticleGoogle Scholar
- Rhee HS, Pugh BF: Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution. Cell. 2011, 147 (6): 1408-1419.PubMed CentralPubMedView ArticleGoogle Scholar
- Auerbach RK, Euskirchen G, Rozowsky J, Lamarre-Vincent N, Moqtaderi Z, Lefrancois P, Struhl K, Gerstein M, Snyder M: Mapping accessible chromatin regions using Sono-Seq. Proc Natl Acad Sci USA. 2009, 106 (35): 14926-14931.PubMed CentralPubMedView ArticleGoogle Scholar
- Teytelman L, Thurtle DM, Rine J, Van Oudenaarden A: Highly expressed loci are vulnerable to misleading ChIP localization of multiple unrelated proteins. Proc Natl Acad Sci USA. 2013, 110 (46): 18602-18607.PubMed CentralPubMedView ArticleGoogle Scholar
- Thomas-Chollier M, Defrance M, Medina-Rivera A, Sand O, Herrmann C, Thieffry D, Van Helden J: RSAT 2011: regulatory sequence analysis tools. Nucleic Acids Res. 2011, 39: W86-W91.PubMed CentralPubMedView ArticleGoogle Scholar
- Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass CK: Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell. 2010, 38 (4): 576-589.PubMed CentralPubMedView ArticleGoogle Scholar
- Kwon AT, Arenillas DJ, Worsley Hunt R, Wasserman WW: oPOSSUM-3: advanced analysis of regulatory motif over-representation across genes or ChIP-Seq datasets. G3. 2012, 2 (9): 987-1002.PubMed CentralPubMedView ArticleGoogle Scholar
- R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. 2013Google Scholar
- Bailey TL, Machanick P: Inferring direct DNA binding from ChIP-seq. Nucleic Acids Res. 2012, 40 (17): e128-PubMed CentralPubMedView ArticleGoogle Scholar
- Wilbanks EG, Facciotti MT: Evaluation of algorithm performance in ChIP-seq peak detection. PLoS ONE. 2010, 5 (7): e11471-PubMed CentralPubMedView ArticleGoogle Scholar
- McLean CY, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, Wenger AM, Bejerano G: GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol. 2010, 28 (5): 495-501.PubMedView ArticleGoogle Scholar
- Miano JM, Long X, Fujiwara K: Serum response factor: master regulator of the actin cytoskeleton and contractile apparatus. Am J Physiol Cell Physiol. 2007, 292 (1): C70-C81.PubMedView ArticleGoogle Scholar
- Singh S, Vrishni S, Singh BK, Rahman I, Kakkar P: Nrf2-ARE stress response mechanism: a control point in oxidative stress-mediated dysfunctions and chronic inflammatory diseases. Free Radic Res. 2010, 44 (11): 1267-1288.PubMedView ArticleGoogle Scholar
- Gotea V, Visel A, Westlund JM, Nobrega MA, Pennacchio LA, Ovcharenko I: Homotypic clusters of transcription factor binding sites are a key component of human promoters and enhancers. Genome Res. 2010, 20 (5): 565-577.PubMed CentralPubMedView ArticleGoogle Scholar
- Giorgetti L, Siggers T, Tiana G, Caprara G, Notarbartolo S, Corona T, Pasparakis M, Milani P, Bulyk ML, Natoli G: Noncooperative interactions between transcription factors and clustered DNA binding sites enable graded transcriptional responses to environmental inputs. Mol Cell. 2010, 37 (3): 418-428.PubMedView ArticleGoogle Scholar
- Ji Z, Donaldson IJ, Liu J, Hayes A, Zeef LA, Sharrocks AD: The forkhead transcription factor FOXK2 promotes AP-1-mediated transcriptional regulation. Mol Cell Biol. 2012, 32 (2): 385-398.PubMed CentralPubMedView ArticleGoogle Scholar
- Yu X, Zhu X, Pi W, Ling J, Ko L, Takeda Y, Tuan D: The long terminal repeat (LTR) of ERV-9 human endogenous retrovirus binds to NF-Y in the assembly of an active LTR enhancer complex NF-Y/MZF1/GATA-2. J Biol Chem. 2005, 280 (42): 35184-35194.PubMedView ArticleGoogle Scholar
- Razzaque MA, Masuda N, Maeda Y, Endo Y, Tsukamoto T, Osumi T: Estrogen receptor-related receptor gamma has an exceptionally broad specificity of DNA sequence recognition. Gene. 2004, 340 (2): 275-282.PubMedView ArticleGoogle Scholar
- Watson DK, Robinson L, Hodge DR, Kola I, Papas TS, Seth A: FLI1 and EWS-FLI1 function as ternary complex factors and ELK1 and SAP1a function as ternary and quaternary complex factors on the Egr1 promoter serum response elements. Oncogene. 1997, 14 (2): 213-221.PubMedView ArticleGoogle Scholar
- Schmid CD, Bucher P: MER41 repeat sequences contain inducible STAT1 binding sites. PLoS ONE. 2010, 5 (7): e11425-PubMed CentralPubMedView ArticleGoogle Scholar
- Ferrigno O, Virolle T, Djabari Z, Ortonne JP, White RJ, Aberdam D: Transposable B2 SINE elements can provide mobile RNA polymerase II promoters. Nat Genet. 2001, 28 (1): 77-81.PubMedGoogle Scholar
- Schaub M, Myslinski E, Schuster C, Krol A, Carbon P: Staf, a promiscuous activator for enhanced transcription by RNA polymerases II and III. EMBO J. 1997, 16 (1): 173-181.PubMed CentralPubMedView ArticleGoogle Scholar
- Jolma A, Yan J, Whitington T, Toivonen J, Nitta KR, Rastas P, Morgunova E, Enge M, Taipale M, Wei G, Palin K, Vaquerizas JM, Vincentelli R, Luscombe NM, Hughes TR, Lemaire P, Ukkonen E, Kivioja T, Taipale J: DNA-binding specificities of human transcription factors. Cell. 2013, 152 (1–2): 327-339.PubMedView ArticleGoogle Scholar
- Ngondo-Mbongo RP, Myslinski E, Aster JC, Carbon P: Modulation of gene expression via overlapping binding sites exerted by ZNF143, Notch1 and THAP11. Nucleic Acids Res. 2013, 41 (7): 4000-4014.PubMed CentralPubMedView ArticleGoogle Scholar
- Whitington T, Frith MC, Johnson J, Bailey TL: Inferring transcription factor complexes from ChIP-seq data. Nucleic Acids Res. 2011, 39 (15): e98-PubMed CentralPubMedView ArticleGoogle Scholar
- Guo Y, Mahony S, Gifford DK: High resolution genome wide binding event finding and motif discovery reveals transcription factor spatial binding constraints. PLoS Comput Biol. 2012, 8 (8): e1002638-PubMed CentralPubMedView ArticleGoogle Scholar
- Medina-Rivera A, Abreu-Goodger C, Thomas-Chollier M, Salgado H, Collado-Vides J, Van Helden J: Theoretical and empirical quality assessment of transcription factor-binding motifs. Nucleic Acids Res. 2011, 39 (3): 808-824.PubMed CentralPubMedView ArticleGoogle Scholar
- Johansson O, Alkema W, Wasserman WW, Lagergren J: Identification of functional clusters of transcription factor binding motifs in genome sequences: the MSCAN algorithm. Bioinformatics. 2003, 19 (Suppl 1): i169-i176.PubMedView ArticleGoogle Scholar
- Zhao Y, Ruan S, Pandey M, Stormo GD: Improved models for transcription factor binding site identification using nonindependent interactions. Genetics. 2012, 191 (3): 781-790.PubMed CentralPubMedView ArticleGoogle Scholar
- Gordan R, Hartemink AJ, Bulyk ML: Distinguishing direct versus indirect transcription factor-DNA interactions. Genome Res. 2009, 19 (11): 2090-2100.PubMed CentralPubMedView ArticleGoogle Scholar
- Park D, Lee Y, Bhupindersingh G, Iyer VR: Widespread misinterpretable ChIP-seq bias in yeast. PLoS ONE. 2013, 8 (12): e83506-PubMed CentralPubMedView ArticleGoogle Scholar
- Chen X, Xu H, Yuan P, Fang F, Huss M, Vega VB, Wong E, Orlov YL, Zhang W, Jiang J, Loh YH, Yeo HC, Yeo ZX, Narang V, Govindarajan KR, Leong B, Shahab A, Ruan Y, Bourque G, Sung WK, Clarke ND, Wei CL, Ng HH: Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell. 2008, 133 (6): 1106-1117.PubMedView ArticleGoogle Scholar
- Tiwari VK, Stadler MB, Wirbelauer C, Paro R, Schubeler D, Beisel C: A chromatin-modifying function of JNK during stem cell differentiation. Nat Genet. 2012, 44 (1): 94-100.View ArticleGoogle Scholar
- Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, Nery JR, Lee L, Ye Z, Ngo QM, Edsall L, Antosiewicz-Bourget J, Stewart R, Ruotti V, Millar AH, Thomson JA, Ren B, Ecker JR: Human DNA methylomes at base resolution show widespread epigenomic differences. Nature. 2009, 462 (7271): 315-322.PubMed CentralPubMedView ArticleGoogle Scholar
- Schmidt D, Wilson MD, Ballester B, Schwalie PC, Brown GD, Marshall A, Kutter C, Watt S, Martinez-Jimenez CP, Mackay S, Talianidis I, Flicek P, Odom DT: Five-vertebrate ChIP-seq reveals the evolutionary dynamics of transcription factor binding. Science. 2010, 328 (5981): 1036-1040.PubMed CentralPubMedView ArticleGoogle Scholar
- Hoffman BG, Robertson G, Zavaglia B, Beach M, Cullum R, Lee S, Soukhatcheva G, Li L, Wederell ED, Thiessen N, Bilenky M, Cezard T, Tam A, Kamoh B, Birol I, Dai D, Zhao Y, Hirst M, Verchere CB, Helgason CD, Marra MA, Jones SJ, Hoodless PA: Locus co-occupancy, nucleosome positioning, and H3K4me1 regulate the functionality of FOXA2-, HNF4A-, and PDX1-bound loci in islets and liver. Genome Res. 2010, 20 (8): 1037-1051.PubMed CentralPubMedView ArticleGoogle Scholar
- Consortium EP, Bernstein BE, Birney E, Dunham I, Green ED, Gunter C, Snyder M: An integrated encyclopedia of DNA elements in the human genome. Nature. 2012, 489 (7414): 57-74.View ArticleGoogle Scholar
- Rosenbloom KR, Sloan CA, Malladi VS, Dreszer TR, Learned K, Kirkup VM, Wong MC, Maddren M, Fang R, Heitner SG, Lee BT, Barber GP, Harte RA, Diekhans M, Long JC, Wilder SP, Zweig AS, Karolchik D, Kuhn RM, Haussler D, Kent WJ: ENCODE data in the UCSC Genome Browser: year 5 update. Nucleic Acids Res. 2013, 41 (Database issue): D56-D63.PubMed CentralPubMedView ArticleGoogle Scholar
- Fejes AP, Robertson G, Bilenky M, Varhol R, Bainbridge M, Jones SJ: FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology. Bioinformatics. 2008, 24 (15): 1729-1730.PubMed CentralPubMedView ArticleGoogle Scholar
- Kuhn RM, Haussler D, Kent WJ: The UCSC genome browser and associated tools. Brief Bioinform. 2013, 14 (2): 144-161.PubMed CentralPubMedView ArticleGoogle Scholar
- Lenhard B, Wasserman WW: TFBS: Computational framework for transcription factor binding site analysis. Bioinformatics. 2002, 18 (8): 1135-1136.PubMedView ArticleGoogle Scholar
- Marstrand TT, Frellsen J, Moltke I, Thiim M, Valen E, Retelska D, Krogh A: Asap: a framework for over-representation statistics for transcription factor binding sites. PLoS ONE. 2008, 3 (2): e1623-PubMed CentralPubMedView ArticleGoogle Scholar
- Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, de Hoon MJ: Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009, 25 (11): 1422-1423.PubMed CentralPubMedView ArticleGoogle Scholar
- Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D, Kent WJ: The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 2004, 32 (Database issue): D493-D496.PubMed CentralPubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.