T-KDE: a method for genome-wide identification of constitutive protein binding sites from multiple ChIP-seq data sets
© Li et al.; licensee BioMed Central Ltd. 2014
Received: 26 June 2013
Accepted: 13 January 2014
Published: 15 January 2014
A protein may bind to its target DNA sites constitutively, i.e., regardless of cell type. Intuitively, constitutive binding sites should be biologically functional. A prerequisite for understanding their functional relevance is knowing all their locations for a protein of interest. Genome-wide discovery of constitutive binding sites requires robust and efficient computational methods to integrate results from numerous binding experiments. Such methods are lacking, however.
To locate constitutive binding sites for a protein using ChIP-seq data for that protein from multiple cell lines, we developed a method, T-KDE, which combines a binary range tree with a kernel density estimator. Using 132 CTCF (CCCTC-binding factor) ChIP-seq datasets, we showed that the number of constitutive sites identified by T-KDE is robust to the choice of tuning parameter and that T-KDE identifies binding site locations more accurately than a binning approach. Furthermore, T-KDE can identify constitutive sites that are missed by a motif-based approach either because a bound site failed to reach the motif significance cutoff or because the peak sequence scanned was too short. By studying sites declared constitutive by T-KDE but not by the motif-based approach, we discovered two new CTCF motif variants. Using ENCODE data on 22 transcription factors (TF) in 132 cell lines, we identified constitutive binding sites for each TF and provide evidence that, for some TFs, they may be biologically meaningful.
T-KDE is an efficient and effective method to predict constitutive protein binding sites using ChIP-seq peaks from multiple cell lines. Besides constitutive binding sites for a given protein, T-KDE can identify genomic “hot spots” where several different proteins bind and, conversely, cell-type-specific sites bound by a given protein.
KeywordsBinding pattern ChIP-seq Kernel density estimation Binary range tree Mode-finding Constitutive site CTCF code
Many TFs bind to DNA directly and have well-defined motif models. For such TFs, binding sites may be located by scanning their ChIP-seq peak sequences with motif models like position weight matrices (PWM) . Using ChIP-seq data for the same TF from a number of cell lines, one would consider a binding site constitutive if it were found in a sufficiently high proportion of the available cell lines/types. We refer to this as the motif-based approach. Not every transcription factor has a known binding motif, however. Among the ~1,400 sequence-specific DNA-binding transcription factors in the human genome , only about 10-20% of them have known binding motifs . Thus, while the motif-based approach should work well for factors with well-defined PWMs such as CTCF, it will fail for TFs lacking reliable PWMs or for proteins that do not always bind to their target DNA sites directly.
A simple alternative approach, one that accommodates TFs that binds indirectly or that lack a well-defined PWM, is to divide the genome into fixed width bins and count the number of peak centers from ChIP-seq that fall into each bin, e.g.,  and . Bins containing peak centers for a sufficiently high proportion of the available cell lines/types are declared as constitutive. Although this binning method is simple, intuitive and commonly used in genome analysis, it suffers from several drawbacks, including a boundary effect where which of two adjacent bins contains a peak center may be ambiguous.
Our T-KDE approach is based on the following idea. If a particular genomic locus is bound by a protein in all available cell lines, then the centers of all ChIP-seq peaks, one from each cell line, at that locus should be within close proximity (as in Figure 1). We aimed to systematically identify such sites from ChIP-seq experiments that target the same protein in multiple cell lines by simultaneously analyzing peak centers across all the experiments.
Our goal is distinct from peak calling in a single ChIP-seq experiment. A ChIP-seq peak is a genomic region (~100 to 500 bps for a typical transcription factor) enriched with sequence reads and identified using a peak-calling algorithm, e.g. [10, 11]. Various peak calling algorithms find genomic regions enriched for a binding signal in a variety of data types. These include Hidden Markov Model (HMM)-based peak calling algorithms for ChIP-seq data , for ChIP-chip data [12, 13], and for MeDIP-seq data . All identify peaks by modeling emission and transition probabilities using multiple states and exploiting distinct signal signatures in different states.
Despite the distinct goals, another approach for detecting constitutive protein binding sites might be to apply existing peak-calling tools to the original ChIP-seq reads from multiple cell lines simultaneously, expecting that constitutive binding sites will exhibit especially high peaks. Such an approach has several drawbacks. First, BAM files from individual ChIP-seq experiments can be very large, so that combining and processing BAM files from tens or hundreds of experiments together will be computationally intensive. Secondly, combining read counts from multiple data sets where some binding occurs at loci common across many data sets and some binding occurs at loci specific to particular data sets will introduce unusual patterns variation in reads counts that could bias estimation of background rates. For tools that require estimation of background models, this feature may compromise their ability to reliably detect constitutive binding sites. Finally, the definition of constitutive binding site in terms of binding in most cell lines to the same site does not directly translate to a criterion based on peak height in a combined BAM file. Consequently, declaring a constitutive peak seems to require mapping all reads under each detected peak back to their original BAM files – an additional computational burden.
In this paper, we propose an effective and efficient alternative to binning for locating binding sites for TFs that may bind directly or indirectly. Like binning, it uses peak centers from ChIP-seq as input data. Our algorithm, T-KDE, identifies binding site locations by combining a kernel density estimator (KDE) with a binary range tree. Kernel density estimation, also known as the Parzen window method, is an unsupervised and non-parametric technique for estimating a continuous probability density function from sample data [15, 16]. Because KDEs can converge asymptotically to any density function , they are widely used and have been applied to many genomic problems such as ChIP-seq peak calling , analyzing nucleosome positioning  and detecting transcription factor binding motifs based on their over-representation in regulatory regions . In this paper, we use a KDE to find those genomic regions that contain the highest density of ChIP-seq peak centers from multiple cell lines/types for a given TF. Use of a binary range tree in conjunction with kernel density estimation enhances T-KDE’s speed. A binary range tree is a helpful algorithm for many applications involving range or nearest neighbor searches, indexing and clustering [20–22]; we use it to recursively subdivide the set of peak centers into subgroups that allow efficient density estimation and mode finding.
Using information on the location of peak centers from 132 CTCF ChIP-seq datasets from the ENCODE project, we compared T-KDE to both the motif-based approach and the binning approach. T-KDE outperformed the binning approach and was competitive with the motif-based approach. More than 90% of the T-KDE-declared constitutive CTCF binding sites were within 20 base pairs (bp) from the nearest motif-declared constitutive CTCF binding sites (16-bp canonical motif) — indicating that T-KDE is highly accurate. In addition, T-KDE also identified additional constitutive CTCF binding sites that the motif-based approach failed to find due to lack of apparent motif sites in the ChIP-seq peaks. We also applied T-KDE to 21 other proteins for which replicate ChIP-seq datasets were available in six or more cell lines and found that the number of constitutive binding sites varied from less than a hundred to tens of thousands. Gene ontology (GO) analysis of the genes with constitutive binding sites in their promoters suggests that constitutive binding sites for several of the proteins are biologically meaningful.
We downloaded data on ChIP-seq peaks for 22 transcription factors (in Additional file 1: Table S1) from the ENCODE portal at the UCSC Genome Browser . (The complete list of datasets and their unique identifiers can be found in Additional file 1: Table S2.) For each ChIP-seq peak, we calculated the location of the peak center as half the sum of the start and end coordinates for the peak, and we used these locations for subsequent analysis.
Location of constitutive CTCF binding sites via motif model: our “gold standard”
For each of 132 CTCF ChIP-seq datasets with at least one replicate, we extended/trimmed each peak to 200 bp in length from its center. We then used a custom Python code to extract the sequences from the GRCh37 assembly stored locally. Next, we predicted the locations of the CTCF binding sites in the sequences using the GADEM software  with a CTCF position weight matrix (PWM) derived previously  (see in Additional file 1: Table S3). We declared a subsequence a CTCF binding site when its PWM score exceeded the score corresponding to the p-value cutoff of 0.0005. When more than one CTCF site was found in the sequence for a single peak, only the highest scoring site (with the lowest p-value) was retained for that peak. When a CTCF binding site was found in two or more replicate datasets representing a single cell line, the site was declared present in that cell line. A CTCF binding site was considered a constitutive binding site when the same motif site was present in more than 90% of the 55 cell lines. We used the center of the motif site as the location of the motif-based binding site.
Identification of constitutive binding sites via binning
We divided each chromosome of the human genome into bins of equal size beginning at the centromere and proceeding outward along each arm (the final bin on each arm might be smaller than the others). The center of any bin containing peak centers from at least 2 replicate datasets from the same cell line was declared a binding site location as before, and those bins containing peak centers for more than 90% of the cell lines were declared constitutive. We examined this binning procedure with various bin sizes ranging from 100 to 1000 bp.
Identification of constitutive binding sites via T-KDE
Binary range tree
A binary range tree is an algorithm that produces a structure with all data points stored in the leaves (terminal nodes) of the tree for efficient data retrieval and manipulation . In our application, we construct a separate range tree for each chromosome. Initially, all peak centers on the chromosome (from all ChIP-seq data sets for the given TF) are ordered from the smallest to the largest according their genomic locations and placed in the top node. Then, the midrange (mean of the minimum and maximum locations) is used to partition the peak centers into two sub-nodes: the left sub-node contains peak centers whose locations are less than the mid range whereas the right sub-node contains peak centers whose locations equal or exceed the midrange. This process continues recursively within each sub-node until a stopping criterion is satisfied. In our case, a sub-node becomes a terminal node when further partitioning it would result in one or two of its children nodes containing peak centers for fewer than 90% of available cell lines. Although each terminal node in our tree contains peak centers from at least 90% of the cell lines, each terminal node may contain zero, one, or more constitutive binding sites as determined by the subsequent KDE analysis and mode finding.
Kernel density estimation
where h represents the bandwidth, a user-defined tuning parameter that controls the smoothness of the resulting estimate. The kernel K(•) is a symmetric (not necessarily positive) function that integrates to one, i.e., ∫ K(x)dx = 1. The kernel function serves to smear the probability mass of each data point across a local region.
With this kernel, each term in the sum of equation (1) is a Gaussian density with mean x i and standard deviation h. Thus, equation (1) states that the estimate at any location x is formed by averaging contributions from Gaussian densities with standard deviation h and means at the observed peak centers. The basic operations of kernel density estimation used by T-KDE have been modified directly from the KDE Toolbox for Matlab .
Mode finding in Gaussian mixture models
To find local maxima and minima of the estimated density function, we adapted a fixed-point iterative search scheme . Our kernel density estimate is an equally weighted mixture of Gaussian densities where the mean of each component is an observed peak center. Such a Gaussian mixture has, at most, as many local maxima as it has components. If peak centers are far apart relative to the bandwidth, each peak center will yield a local maximum. If peak centers are close relative to bandwidth, a local maximum must be between their smallest and largest smallest locations. Thus, within each terminal node, we can use a “hill-climbing” algorithm starting from every peak center to locate all the local maxima and minima. Once we find a location whose gradient is zero using Newton’s method, we use a second derivative test to determine whether it is a maximum or a minimum. Modal regions are defined as extending from the observed peak center farthest to the left of the local maximum but no farther than the next local minimum to the similarly delimited observed peak center farthest to the right. (With this definition, modal regions containing a single peak center have width zero.)
We used DAVID  to analyze gene ontology (GO). We assigned a constitutive binding site to a gene(s) if the site was located within ±5kb from the gene’s transcription start site using the UCSC refGenes model (hg19). All unique genes that were within the distance were included in the GO analysis.
Utility of the binary range tree
Without initial data partition using the binary range tree, KDE analysis and mode finding on even a single chromosome is computationally prohibitive; estimating the density, rather than finding the local maxima/minima, is the bottleneck. For the CTCF datasets, analysis of chromosome 1 took less than half an hour with the binary range tree compared to more than 5 days without it (in Additional file 1: Table S4). The locations of sites declared constitutive using KDE with and without the binary range tree were nearly identical (in Additional file 3: Figure S1).
Bandwidth and bin width selection
Observed number of CTCF binding sites on 23 chromosomes
Bandwidth or bin width (bp)
Number of declared sites
Number of declared sites
Applying the motif-based approach to the same 132 CTCF ChIP-seq data sets with the same criteria (a binding site must be present in at least two replicate datasets per cell line and a constitutive binding site being present in more than 90% of the cell lines) identified 17,575 constitutive CTCF binding sites (the canonical 16-bp motif site). We regarded those motif-based constitutive CTCF biding sites as an “alloyed gold standard”. We have high confidence in a CTCF binding site identified by the motif-based approach because binding at the exact same motif location is detected in more than 90% of cell lines. On the other hand, the motif-based approach is imperfect as it may fail to identify low affinity or indirect binding sites. The motif-based approach could also overlook constitutive sites if the length of peak sequence scanned (200 bp around peak centers in our application) is too short to cover the actual binding site.
For T-KDE with bandwidths smaller than 500 bp, all CTCF binding sites declared constitutive are within 200 bp of their nearest motif-based constitutive CTCF binding sites. For a bandwidth of 100 bp, more than 90% of the T-KDE-declared constitutive CTCF binding sites are within 20 bp of the nearest motif-based constitutive CTCF binding sites and nearly all are within 70 bp. For bandwidths exceeding 500 bp, performance deteriorates though roughly 90% of the T-KDE-declared constitutive binding sites are still within 500 bp from their nearest motif-based counterpart.
The results from Table 1 and Figure 3 strongly suggest that changing the bandwidth with T-KDE has little impact on the number of constitutive binding sites identified but a greater impact on their locations. On the other hand, changing the bin width with the binning approach has an impact on both the number of constitutive binding sites identified and on their locations. Our results also suggest that, for CTCF, a bandwidth near 100 bp and a bin width near 400 bp may be the optimal values for T-KDE and for the binning method, respectively. Although derived from CTCF comparisons, we believe these choices of bandwidth or bin width should be applicable to other factors whose ChIP-seq peak length distributions are similar to those of CTCF.
Comparing Figure 3(A) and 3(B) also reveals that the accuracy of T-KDE for locating constitutive binding sites is generally far superior to that of the binning approach. In particular, the optimal bandwidth of 100 bp was more accurate in locating constitutive binding sites than the optimal bin width of 400 bp. Consequently, for our remaining analyses, we focus on T-KDE using a bandwidth of 100 bp.
T-KDE versus Binning
Constitutive sites found by T-KDE but not by the motif-based approach
Only 25 of the 17,575 motif-based constitutive CTCF binding sites were farther than 70 bp from the nearest constitutive CTCF binding sites identified by T-KDE. Furthermore, T-KDE declared an additional 4,237 CTCF binding sites as constitutive that the motif-based approach missed. Among those 4,237 sites, the motif-based approach failed to detect 312 because no sub-sequence in the corresponding peaks reached the motif significance cutoff. (The motif-based approach did not declare any of these as a binding site in any of the cell lines). The remaining additional constitutive sites found by T-KDE were found by the motif-based approach in a majority of cell lines but not in enough cell lines to reach the required 90%. When the true binding sites are not located near the center of some peaks and/or the peak sequences used in motif scan are not long enough to cover the actual motif, a motif-based approach would miss the site. T-DKE, however, is unaffected by these issues. Because it uses peak centers from all cell lines to identify the center of mass of each modal region as the binding site, some misalignment or displacement among ChIP-seq peaks is tolerated. Thus, T-KDE is capable of identifying constitutive binding sites that are bound by a protein either directly or indirectly.
Analysis of constitutive binding sites for 22 factors
Binding sites throughout the entire genome identified by T-KDE for 22 transcription factors
Available cell lines
Top ten GO processes for constitutive Pol II target genes
Multiple testing adjusted p-value
Cellular metabolic process
8.2 × 10-177
Primary metabolic process
1.2 × 10-121
Macromolecule metabolic process
2.8 × 10-114
Nitrogen compound metabolic process
8.8 × 10-87
1.0 × 10-46
6.3 × 10-46
Establishment of protein localization
5.2 × 10-40
1.4 × 10-37
Cell cycle process
9.1 × 10-35
Ribonucleo protein complex biogenesis
2.5 × 10-33
Binding sites that are occupied by a protein regardless of the cell or tissue type seem likely to have a distinct role compared to binding sites for the same protein that are occupied more selectively – the constitutive nature of the binding should signify something of fundamental import. Our earlier work using motif-based analysis found that constitutive CTCF binding sites, especially those near RAD21 sites, are highly enriched in CTCF-mediated chromatin interactions  and those interactions are predominately within topological domains, not between them . Consequently, we hypothesized that the constitutive CTCF binding sites may be involved in maintaining and/or establishing chromatin structures that are common among most human cell types . Those earlier findings indicate to us that constitutive binding sites for other TFs may have unique biological roles.
The ENCODE consortium has generated more than 1,000 ChIP-seq protein-binding datasets for more than 100 proteins in multiple cell lines, and the data continue to expand. Discovering the locations and functions the genomic loci that are constitutively bound by each of the proteins is potentially important. However, computational methods for locating constitutive binding sites when the protein does not bind directly to DNA are still lacking. One challenge is that the ChIP-seq peak data are low-resolution, and the technology is unable pinpoint exact genomic binding locations.
To fill this gap, we developed an efficient and effective approach, T-KDE which takes as input locations of peak centers from multiple ChIP-seq data sets and returns estimates of the locations of binding sites and declares them constitutive or not. T-KDE combines a binary range tree algorithm, a kernel density estimator, and a mode finding algorithm. Using data on CTCF binding, we found that T-KDE was superior at locating constitutive binding sites compared to a naïve approach based on binning and that T-KDE performed well compared to the motif-based approach. For example, all motif-based constitutive CTCF binding sites were included in the constitutive CTCF binding sites identified by T-KDE. Furthermore, T-KDE identified additional 4,237 constitutive CTCF binding sites that the motif-based approach failed to detect. This result highlights a major advantage of T-KDE compared to both the motif-based and binning approaches: regardless of whether binding is direct or indirect and whether an adequate motif model is known, T-KDE accurately estimates the locations of constitutive binding sites by identifying genomic regions where the centers of ChIP-seq peaks from multiple datasets lie in close proximity. Accurate binding locations are necessary for subsequent functional analysis and discovery. We applied T-KDE to locate constitutive binding sites, if present, for 22 TFs that had replicate ChIP-seq data sets for at least 6 cell lines available from ENCODE, and we used gene ontology analysis to establish possible biological functions for some of those TFs.
KDE-based methods different from ours have been applied to ChIP-seq reads for peak calling  and nucleosome positioning . Additionally, KDE-based method has been applied to motif locations for detection of regions locally enriched with transcription factor binding sites . Our goal is different: we use the locations of ChIP-seq peak centers from multiple cell lines (from as few as 6 to as many as 132, in this case) to infer the location of constitutive binding sites. In addition, our method has unique features. Our method first recursively partitions the locations of peak centers into subgroups (terminal nodes) using a binary range tree algorithm. The partitioning stops whenever either of the two would-be child nodes contains peak centers from fewer than 90% (a user-specified choice) of available cell lines. The KDE analysis and subsequent mode finding is carried out on each terminal node, one at a time. The partitioning guarantees that more than 90% of cell lines are represented in every terminal node; however, a terminal node may still contain zero, one or more constitutive binding sites depending on the spread of the peak centers present — making KDE and subsequent mode-finding necessary for localizing modal regions. Binding site locations are declared at local maxima within modal regions. Our use of the binary range tree before applying KDE and mode-finding makes our algorithm novel and efficient.
One reviewer suggested an alternative procedure (in Additional file 2: Algorithm S3) using the peak-finding algorithm MACS . The procedure involves applying MACS in its default parameters to a combined BAM file from the original ChIP-seq reads data (also in BAM format) from the multiple cell lines. The peaks with low variation in log (read count + 1) within ±50 bp from the MACS summit are considered constitutive. We compared this procedure with T-KDE and a binding-based method and showed that T-KDE was far superior to this alternative procedure (details in Additional file 4: Supplementary text).
Although T-KDE can be applied to ChIP-seq data from any number of cell lines, caution must be excised when interpreting a result from only a few cell lines. Because the property of being constitutive requires binding to the same locus in a variety of cell types, the number and diversity (or lineage) of cell lines/types providing data to the algorithm would be expected to have a strong influence on the biological trustworthiness of any result.
For N peak centers, KDE followed by mode-finding has a computational complexity of O(Nlog2N) [28, 29]. When N is large as in our CTCF dataset (N = ~ 690, 000 for chromosome 1), the process becomes computationally prohibitive. After initial data partitioning by a binary range tree into a set of terminal nodes indexed by i, each with N peak centers, complexity is greatly reduced to ∑ i O(N i log2N i ). Consequently, T-KDE reduces the computational time for CTCF on chromosome 1 from days to within an hour. We envision that parallelization of our T-KDE algorithm at the node level would further reduce the computational time. A potential cost is that partitioning all peak centers onto terminal nodes before the KDE analysis and mode finding might destroy a constitutive binding site by splitting it between two adjacent nodes. This problem appears to arise rarely or not at all as we observed that the performance of T-KDE was nearly identical to that of KDE omitting the initial partitioning. We attribute this similarity, in part, to our stopping criterion for partitioning.
Generally, the choice of the bandwidth for KDE can exhibit a strong influence on the shape of the estimated density: small bandwidths yielding spiky estimates and large bandwidths yielding overly flattened ones. Yet, in our comparisons when locating constitutive CTCF binding sites, bandwidths from 100 to 400 bp uncovered similar numbers of constitutive CTCF binding sites and the distribution of the distances from T-KDE-declared sites to the nearest motif-declared sites did not change much with bandwidth. We believe that a bandwidth of 100 to 400 bp may be optimal for most TF binding sites with narrow peaks (100-1,000 bp). Automatic selection of the optimal bandwidth would be desirable, but optimal bandwidth selection based statistical criteria such as the mean integrated squared error  did not work well with the CTCF data. That process, which involved maximizing a “pseudo-likelihood” combined with a leave-one-out cross-validation approach  was computationally expensive and selected a large bandwidth of 1,293 bp that did not locate constitutive binding sites as well as our preferred 100 bp bandwidth did.
Although designed for identifying constitutive binding sites for a protein using ChIP-seq data from multiple cell lines, our method could also be used to identify genomic loci that have concentrations of different protein binding sites (“hot spots”), and conversely “cold spots”, using multiple protein ChIP-seq data for the cell line.
In conclusion, we developed efficient and accurate method, T-KDE, to locate constitutive protein binding sites using ChIP-seq peak centers from multiple cell lines. T-KDE combines a binary range tree algorithm, a non-parametric kernel density estimator, and a mode finding algorithm. We showed that, for CTCF data, our method is relatively robust to the choice of bandwidth and is highly accurate when compared to the identification of constitutive binding sites through motif analysis. Application of T-KDE to 22 proteins with ChIP-seq data from multiple cell lines located substantial numbers of constitutive binding sites for some TFs but almost none for others. For TFs with large numbers of constitutive binding sites, GO analysis suggests that these sites are biological meaningful. As additional TF binding sites ChIP-seq datasets become available in more cell lines and for more TFs, our method will prove to be essential for identifying their constitutive binding sites.
Availability and requirements
Project Name: T-KDE
Operating system: Unix
Programming language: Matlab
Other requirements: N/A
License: This work is made available under the GPL v3.
Any restrictions to use by non-academics: none
Chromatin immunoprecipitation followed by microarray
Chromatin immunoprecipitation followed by sequencing
CCCTC binding factor
Kernel density estimation
Model-based analysis of ChIP-seq
Neuron-restrictive silencer transcription factor, also known as REST
- Pol II:
Homolog (S. pombe)
TATA box binding protein-associated factor 1
We thank Liang Niu and Weichun Huang for discussion and Xuting Wang and Grace Kissling for critical reading of the manuscript. We thank the Computational Biology Facility at NIEHS for computing time and support. This research was supported by Intramural Research Program of the NIH, National Institute of Environmental Health Sciences (ES101765).
- Schmidt D, Schwalie PC, Ross-Innes CS, Hurtado A, Brown GD, Carroll JS, Flicek P, Odom DT: A CTCF-independent role for cohesin in tissue-specific transcription. Genome Res. 2010, 20 (5): 578-588. 10.1101/gr.100479.109.PubMed CentralPubMedView ArticleGoogle Scholar
- Wang H, Maurano MT, Qu H, Varley KE, Gertz J, Pauli F, Lee K, Canfield T, Weaver M, Sandstrom R, et al: Widespread plasticity in CTCF occupancy linked to DNA methylation. Genome Res. 2012, 22 (9): 1680-1688. 10.1101/gr.136101.111.PubMed CentralPubMedView ArticleGoogle Scholar
- Li Y, Huang W, Niu L, Umbach DM, Covo S, Li L: Characterization of constitutive CTCF/cohesin loci: a possible role in establishing topological domains in mammalian genomes. BMC Genomics. 2013, 14 (1): 553-10.1186/1471-2164-14-553.PubMed CentralPubMedView ArticleGoogle Scholar
- Dixon JR, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, Hu M, Liu JS, Ren B: Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 2012, 485 (7398): 376-380. 10.1038/nature11082.PubMed CentralPubMedView ArticleGoogle Scholar
- Stormo GD: DNA binding sites: representation and discovery. Bioinformatics. 2000, 16 (1): 16-23. 10.1093/bioinformatics/16.1.16.PubMedView ArticleGoogle Scholar
- Vaquerizas JM, Kummerfeld SK, Teichmann SA, Luscombe NM: A census of human transcription factors: function, expression and evolution. Nat Rev Genet. 2009, 10 (4): 252-263. 10.1038/nrg2538.PubMedView ArticleGoogle Scholar
- Muller-Molina AJ, Scholer HR, Arauzo-Bravo MJ: Comprehensive human transcription factor binding site map for combinatory binding motifs discovery. PLoS One. 2012, 7 (11): e49086-10.1371/journal.pone.0049086.PubMed CentralPubMedView ArticleGoogle Scholar
- Yip KY, Cheng C, Bhardwaj N, Brown JB, Leng J, Kundaje A, Rozowsky J, Birney E, Bickel P, Snyder M, et al: Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors. Genome Biol. 2012, 13 (9): R48-10.1186/gb-2012-13-9-r48.PubMed CentralPubMedView ArticleGoogle Scholar
- Ho JW, Bishop E, Karchenko PV, Negre N, White KP, Park PJ: ChIP-chip versus ChIP-seq: lessons for experimental design and data analysis. BMC Genomics. 2011, 12: 134-10.1186/1471-2164-12-134.PubMed CentralPubMedView ArticleGoogle Scholar
- Martin-Magniette ML, Mary-Huard T, Berard C, Robin S: ChIPmix: mixture model of regressions for two-color ChIP-chip analysis. Bioinformatics. 2008, 24: i181-186. 10.1093/bioinformatics/btn280. doi:16PubMedView ArticleGoogle Scholar
- Qin ZS, Yu J, Shen J, Maher CA, Hu M, Kalyana-Sundaram S, Yu J, Chinnaiyan AM: HPeak: an HMM-based algorithm for defining read-enriched regions in ChIP-Seq data. BMC Bioinforma. 2010, 11: 369-10.1186/1471-2105-11-369.View ArticleGoogle Scholar
- Li W, Meyer CA, Liu XS: A hidden Markov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences. Bioinformatics. 2005, 21 Suppl 1: i274-282.PubMedView ArticleGoogle Scholar
- Seifert M, Keilwagen J, Strickert M, Grosse I: Utilizing gene pair orientations for HMM-based analysis of promoter array ChIP-chip data. Bioinformatics. 2009, 25 (16): 2118-2125. 10.1093/bioinformatics/btp276.PubMed CentralPubMedView ArticleGoogle Scholar
- Seifert M, Cortijo S, Colomé-Tatché M, Johannes F, Roudier F, Colot V: MeDIP-HMM: genome-wide identification of distinct DNA methylation states from high-density tiling arrays. Bioinformatics. 2012, 28 (22): 2930-2939. 10.1093/bioinformatics/bts562.PubMedView ArticleGoogle Scholar
- Rosenblatt M: Remarks on some nonparametric estimates of a density function. Ann Math Stat. 1956, 27 (3): 832-837. 10.1214/aoms/1177728190.View ArticleGoogle Scholar
- Scott DW: Multivariate density estimation: theory, practice, and visualization. 1992, New York: John Wiley & SonsView ArticleGoogle Scholar
- Wilbanks EG, Facciotti MT: Evaluation of algorithm performance in ChIP-Seq peak detection. PLoS One. 2012, 5 (7): e11471-doi:10.1371/journal.pone.0011471View ArticleGoogle Scholar
- Shivaswamy S, Bhinge A, Zhao Y, Jones S, Hirst M, Iyer VR: Dynamic remodeling of individual nucleosomes across a eukaryotic genome in response to transcriptional perturbation. PLoS Biol. 2008, 6 (3): e65-10.1371/journal.pbio.0060065. doi:10.1371/journal.pbio.0060065PubMed CentralPubMedView ArticleGoogle Scholar
- Vandenbon A, Kumagai Y, Teraguchi S, Amada KM, Akira S, Standley DM: A Parzen window-based approach for the detection of locally enriched transcription factor binding sites. BMC Bioinforma. 2013, 14: 26-10.1186/1471-2105-14-26. doi:10.1186/1471-2105-14-26View ArticleGoogle Scholar
- Fuchs H, Kedem ZM, Naylor BF: On visible surface generation by a priori tree structures. Proceeding SIGGRAPH ’80 proceedings of the 7th annual conference on computer graphics and interactive techniques. 1980, ACM New York, NY, USA, 124-133. ISBN:0-89791-021-4. doi:10.1145/800250.807481. 1980Google Scholar
- Bentley JL, Saxe JB: Decomposable searching problems I. Static-to-dynamic transformation. J Algorit. 1980, 1 (4): 301-358. 10.1016/0196-6774(80)90015-2.View ArticleGoogle Scholar
- Berg MD, Kreveld M, Overmars M, Schwarzkopf O: Computational geometry. 2000, New York: SpringerView ArticleGoogle Scholar
- Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC. Genome Res. 2002, 12 (6): 996-1006.PubMed CentralPubMedView ArticleGoogle Scholar
- Li L: GADEM: a genetic algorithm guided formation of spaced dyads coupled with an EM algorithm for motif discovery. J Comput Biol. 2009, 16 (2): 317-329. 10.1089/cmb.2008.16TT.PubMed CentralPubMedView ArticleGoogle Scholar
- Deng K, Moore AW: Multi-resolution instance-based learning. Proc IJCAI’95 Proc 14th Int joint Conf Artif Intell. 1995, 2: 1233-1239.Google Scholar
- Parzen E: On estimation of a probability density function and mode. Ann Math Stat. 1962, 33 (3): 1065-1076. 10.1214/aoms/1177704472.View ArticleGoogle Scholar
- Silverman BW: Density estimation for statistics and data analysis. 1986, New York: Chapman and HallView ArticleGoogle Scholar
- Ihler AT: Inference in sensor networks: graphical models and particle methods. 2005, Cambridge, MA: Massachusetts Institute of TechnologyGoogle Scholar
- Carreira-perpiñán MÁ: Continuous latent variable models for dimensionality reduction and sequential data reconstruction. 2001, UK: University of SheffieldGoogle Scholar
- da Huang W, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009, 4 (1): 44-57.PubMedView ArticleGoogle Scholar
- Nakahashi H, Kwon KR, Resch W, Vian L, Dose M, Stavreva D, Hakim O, Pruett N, Nelson S, Yamane A, et al: A genome-wide Map of CTCF multivalency redefines the CTCF code. Cell Rep. 2013, 3 (5): 1678-1689. 10.1016/j.celrep.2013.04.024.PubMed CentralPubMedView ArticleGoogle Scholar
- Schmidt D, Schwalie PC, Wilson MD, Ballester B, Goncalves A, Kutter C, Brown GD, Marshall A, Flicek P, Odom DT: Waves of retrotransposon expansion remodel genome organization and CTCF binding in multiple mammalian lineages. Cell. 2012, 148 (1–2): 335-348.PubMed CentralPubMedView ArticleGoogle Scholar
- Handoko L, Xu H, Li G, Ngan CY, Chew E, Schnapp M, Lee CWH, Ye C, Ping JLH, Mulawadi F, et al: CTCF-mediated functional chromatin interactome in pluripotent cells. Nat Genet. 2011, 43: 630-638. 10.1038/ng.857.PubMed CentralPubMedView ArticleGoogle Scholar
- Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, et al: Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008, 9 (9): R137-10.1186/gb-2008-9-9-r137.PubMed CentralPubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.