Statistics of protein-DNA binding and the total number of binding sites for a transcription factor in the mammalian genome
© Kuznetsov et al. 2010
Published: 10 February 2010
Skip to main content
© Kuznetsov et al. 2010
Published: 10 February 2010
Transcription factor (TF)-DNA binding loci are explored by analyzing massive datasets generated with application of Chromatin Immuno-Precipitation (ChIP)-based high-throughput sequencing technologies. These datasets suffer from a bias in the information about binding loci availability, sample incompleteness and diverse sources of technical and biological noises. Therefore adequate mathematical models of ChIP-based high-throughput assay(s) and statistical tools are required for a robust identification of specific and reliable TF binding sites (TFBS), a precise characterization of TFBS avidity distribution and a plausible estimation the total number of specific TFBS for a given TF in the genome for a given cell type.
We developed an exploratory mixture probabilistic model for a specific and non-specific transcription factor-DNA (TF-DNA) binding. Within ChiP-seq data sets, the statistics of specific and non-specific DNA-protein binding is defined by a mixture of sample size-dependent skewed functions described by Kolmogorov-Waring (K-W) function (Kuznetsov, 2003) and exponential function, respectively. Using available Chip-seq data for eleven TFs, essential for self-maintenance and differentiation of mouse embryonic stem cells (SC) (Nanog, Oct4, sox2, KLf4, STAT3, E2F1, Tcfcp211, ZFX, n-Myc, c-Myc and Essrb) reported in Chen et al (2008), we estimated (i) the specificity and the sensitivity of the ChiP-seq binding assays and (ii) the number of specific but not identified in the current experiments binding sites (BSs) in the genome of mouse embryonic stem cells. Motif finding analysis applied to the identified c-Myc TFBSs supports our results and allowed us to predict many novel c-Myc target genes.
We provide a novel methodology of estimating the specificity and the sensitivity of TF-DNA binding in massively paralleled ChIP sequencing (ChIP-seq) binding assay. Goodness-of fit analysis of K-W functions suggests that a large fraction of low- and moderate- avidity TFBSs cannot be identified by the ChIP-based methods. Thus the task to identify the binding sensitivity of a TF cannot be technically resolved yet by current ChIP-seq, compared to former experimental techniques. Considering our improvement in measuring the sensitivity and the specificity of the TFs obtained from the ChIP-seq data, the models of transcriptional regulatory networks in embryonic cells and other cell types derived from the given ChIp-seq data should be carefully revised.
Identification of transcription regulatory elements in the genome is an important problem of molecular systems biology and statistical genomic studies. Among those elements, transcription factor binding sites (TFBSs), short and specific DNA loci targeted by transcription factors (TFs), are considered as basic regulatory elements of gene functional activity and reflect the corresponding protein-DNA interactions in a cell. TFs are the largest set of regulatory proteins in mammalian cells. According to NCBI RefSeq database, about 10% of all known proteins of mammals, including humans, are TFs.
A TFBS serves as a target for a transcription factor which binds to this specific binding site (BS) directly or via other proteins and regulates gene transcription. In a mammalian genome the number of direct and indirect BSs for a given TF could be ranged from several hundreds to hundred thousand [1–8]. However, these values have not been directly measured and the theoretical estimates provide lowly confident values.
The interactions between the molecules of a given TF and corresponding TFBSs in the genome could be considered in the terms of TF-DNA binding events (BEs) which reflect the events of binding in the assay. Any of such events might be specific and non-specific in the context of a direct physical binding of the TF to a TFBS. The intensity (and the corresponding probability) of an occurrence of a given BE in a given genome locus can be characterized by the level of relative avidity (RA) of the TF-DNA binding- an integrative quantitative characteristic of availability of a DNA locus (e.g. TFBS and its flanking region) for a given protein (e.g. TF) binding .
The population distribution function of RA (i.e. the distribution function of RA for a given set (population) of BE) for a given TF can reveal functional attributes of the TFBSs and the mechanisms of the TF-DNA interaction on the genomic scale. However, at the level of single cells or cell samples the distribution function of RA for any TF is unknown, since many technical problems of direct counting of specific protein molecules bound DNA have not been solved yet. Instead, the relative avidity of TF-DNA binding in an average genome within a given cell sample can be quantified by an estimate of the number of TF molecules bound to a given locus averaging across the given cell sample. However, quantitative detection of TFs bound to specific loci is a great challenge. A simpler task is a pull-down of short DNA fragments directly or indirectly bound with the molecules of a given TF. Such TF-DNA complexes can be detected in Chromatin Immuno-Precipitation (ChIP) ChIP-based genome-wide sequencing experiments [3, 4, 6–8].
In a typical genome-scale ChIP experiment ~108 cells are used and the transcription factors are cross-linked to their DNA using chemical cross linkers. After the genomic DNA is isolated and fragmented by ultrasound sonication, an antibody specific to the TF of interest is used to isolate each TF molecule and the DNA fragment which it is bound to. Cloning the whole pool of such fragments and sequencing them (for example, in a serial analysis of chromatin occupancy (SACO) , ChIP-paired end tag (ChIP-PET) , sequence tag analysis of genome enrichment (STAGE) ) provides the data for a large-scale identification of TFBSs.
However, the experimental information, obtained from SACO, ChIP-PET and STAGE experiments, about statistical properties of TF binding at specific physical BSs (and moreover biological functions of the BSs) in a given cell type, at a given environment is highly noisy and essentially incomplete. Only pulled-down TF-DNA complexes containing DNA fragments with maximal relative binding avidity could be reliably detected.
One of the crucial problems with ChIP-based genome-wide assays, including ChIP-seq, is a statistically reliable identification of biologically meaningful phenomena (e.g. all specific and functionally important binding loci) from the large amounts of generated experimental data. In this context reliable, specific and sensitive mapping of protein-DNA interactions is still essentially dependent on subjective rules of pre-processing and filtration of the DNA fragment sequences, the statistical criteria used to identify specific binding loci and the real TFBSs. As a result, some basic definitions, datasets, statistical models and estimates have been revised after original publications [2,5,9,11-13].
A large unexplained "technical" noise component in the experimental measurements and its non-uniform location in the genome are serious limitations for the specificity of the current ChIP-based methods [7, 9, 11–13]. Comparison of ChIP-seq data sets analysing the same TF, under similar conditions and the same cell type but using different experimental platforms (ChIP-chip, ChIP-PET and ChIP-seq) revealed a reasonable consistency between the mapped datasets only for high and moderate avidity ChIP-defined binding loci. These loci, however, represent only a small fraction of all binding loci observed in the course of the entire ChIP-seq experiment. Therefore genomic localization and biological roles of other reliable but less reactive TF-binding loci (low avidity BSs) cannot be identified from the assays [9, 11, 12, 14]. To analyse statistical properties of the low-, moderate- and high-avidity binding loci for a given ChIP-seq data set we need to develop a probabilistic model which would allow us to answer three important open questions.
How many 'hidden' TF-specific BSs with moderate and low avidity are present in the noise-rich subset of ChIP-seq data?
How many "true negative" specific BSs which should be included in consideration are not present in the given ChIP-seq dataset?
How many specific BSs of a given TF exist in the genome?
To answer these questions an analytical model(s) of ChIP-seq TF-DNA binding experiments should be focused on analysis of specificity and sensitivity of protein-DNA binding.
To do this a simple exploratory probabilistic model of TF-DNA binding events was developed and implemented. It allowed us to facilitate noise filtering of ChIP-seq datasets and predicting, for a given TF, the number of specific TF-DNA binding loci in the entire genome. We applied the proposed analytical approach to the distributions of binding avidity of the TFBS of eleven TFs which are essential for maintaining and differentiation mouse embryonic stem cells.
TFs bind to short high affinity DNA response elements (motifs) mostly in putative promoter region of a gene and modify gene activity. However this model of TF-DNA binding is too simple and the phenomenon remains poorly understood, although many models of TF-DNA binding have been reported [9, 13, 15–17]. For example, well-known TF c-Myc exhibits very diverse and complex patterns of DNA binding activity . C-Myc binding region in gene promoter region could often contain multiple copies of a few c-Myc E-box sequences. Such TFBS clusters could be found in low-, moderate- and high- avidity ChIP-seq binding loci of the mouse embryonic stem cell (EC) -related genes (see below).
The E-box is a high affinity response/binding element of DNA which able to discriminate c-Myc among other different TFs. c-Myc forming heterodimers with its binding partners (MAX, MIZ1) and other DNA-specific proteins (e.g. MIZ1, Sp1, NEY) using "canonical" E box 5'-CACGTG-3', which represents the consensus or other at least five "non-canonical" E-boxes (CATGTG, CATGCG, CACGCG, CACGAG, CAACGTG), and additional one (CGCGAG) which we report in this work (see below). The E-box motifs are present in putative promoter regions of at least 4000 genes ; however, how a given response element can discriminate among all these different transcription factors is currently unclear. Moreover, several types of non-canonical E-boxes could bind c-Myc-MAX complex strongly than canonical . In this respect, several mechanisms of variability of c-Myc-Max-DNA binding avidity have been proposed (and, in some instances, experimentally substantiated) [12, 17, 18]. For instance, flanking sequences, chromatin structure, methylation status, and relative position within the promoter or interaction with adjacent regulatory elements might contribute to selection of a particular complex .
We assume here that transcription factor (TF)-DNA binding avidity in a given TFBS defined in ChIP-seq experiment could be considered as useful integrative characteristic of availability of the TFBS for a given TF. Naturally, TF-DNA binding avidity could be quantified by the number of TF molecules binding a given specific locus including specific binding element(s) and its flanking regions. However, the number of TF molecules bound DNA cannot be directly inferred from ChIP-seq experiment. ChIP-seq detects the amount of specific DNA fragments directly and indirectly bound by the protein-antibody complexes. Mapping the extended DNA fragments on the genome, an appropriate clustering of overlapping fragments and filtering of the clusters result in an identification of putative TFBSs. Thus the integrative relative avidity of TFBS in ChIP-seq experiment could be quantified by the number of overlapping DNA fragments in the cluster loci. When this number is equal to or larger then a specificity threshold, t, such significant DNA loci (specific BEs) could be used to construct and analyze of statistically reliable part of the empirical frequency distribution representing relatively high-avidity binding loci of specific TF-DNA binding. In particular, we used this part of the distribution for estimation of the number of reliably detected specific TFBSs (called , see Definitions & Methods).
In the datasets , which we used in our study, the specificity threshold was calculated using three different methods. The first method implements a model of normalization of observed binding peak height values in a ChIP-seq experiment against peak height values in the corresponding negative control experiment (anti-GFP ). However, the control data provided many strong peaks which are often found in specific genomic regions like satellite repeats creating an unpredictable bias in identification of specificity threshold. We found also a fraction of loci in which negative control and experimental set binding signals (peaks heights) are highly correlated and the fraction of locus in which the peaks in control dataset were higher than in the experimental set.
After removing non-random false peaks, a relatively small reduction of the number of the nonspecific DNA fragments even among the smallest peaks (1, 2,..., 7) about 15-25% of total sequence read could be excluded. However, many millions DNA fragments within low- and moderate- abundant peaks of are present in the data sets (data not shown). These findings suggest that the number and the variation of DNA fragments of the negative control experiment do not allow us to explain the source(s) of the major fraction of non-specific DNA sequences and their clusters occurred in a specific ChIP-seq experiment.
The second method implements a model of random occurrence of DNA fragment clusters in the genome. The model assumes that each genomic position has the same probability of producing a ChIP-seq DNA fragment (sequence read) and the probability of finding by chance m DNA fragments in a ChIP-seq experiment within the same genomic region taken from a genome of size g. The probability calculations were made, using a Monte-Carlo simulation, by randomly extracting 200 bp DNA fragments from reference mouse genome (mm8) and estimating the expected numbers of non-specific overlap peaks with various height values. Then the frequency distribution of the number of random ChIP-seq DNA fragments in a cluster overlap (peak heights) was constructed and the threshold value at the given specificity level for the original empirical frequency distribution of the peak height value was defined. The binding loci reported in  has been limited to the locis with specificity threshold value defined by the model described above, which was considered as the random background noise mode .
ChIP-qPCR is often used to determine in vitro whether a given ChIP-seq DNA fragment cluster belongs to a specific TFBS and, subsequently, to estimate the specificity threshold for the original ChIP-seq dataset by extrapolation. However, due to the limited number of loci which could be used in ChIP-qPCR, a sub-optimal design of ChIP-qPCR experiment might contain a bias in estimating the ChIP-seq BE specificity threshold. In this section we shall give an answer to the question: how to optimize a design of ChIP-qPCR experiments to minimize the sample size bias?
Comparative analysis of TF binding specificity estimated based on three methods. Specificity threshold estimates by the uniform noise model , by ChIP -qPCR  and by the best-fit GDP function. The uniform random model-based estimations of the threshold values defined at FDR <5%. TF binding specificity threshold t was estimated based on the best-fit double truncated GDP function. The last two columns of the table shows that GDP-model could improve the specificity estimates providing by the Poisson model and ChIP-q-PCR assay.
Unique mapped fragments, M*
Noise threshold by Poisson model*
# q-PCR experiments*
Noise 3-fold enrich threshold by qPCR*, q
Specificity threshold by GDP, t
N 2 at q*
Specificity by qPCR, %
GDP-predicted specificity at q
A simple visual inspection of the available empirical frequency distributions of TF-DNA binding plot in log-log scale (Figure 2, Additional files 1, 2) allows us to suggest that a mixture of at least two distinct skewed sample-size - dependent frequency distributions of binding avidity is present in the data. Statistical analysis of ChIP-based [9, 11–15, 19] has showed that the left part of the empirical distribution is reach with the noise resulted from relatively low avidity BE and the right side of the distribution is represented by relatively high-avidity TF- DNA binding (Figure 2, Additional files 1, 2). Also we found that the tails of the empirical distributions of TF-DNA binding avidity exhibit monotonically-skewed shape with a greater abundance of low avidity and more gaps among the high-avidity loci. Low- and moderate- avidity TFBSs are highly-abundant in the mouse and other mammalian genomes and could play biologically meaningful functional roles.
The specificity threshold of the distribution of the number of TF-bound DNA loci, t, is the parameter which separates lowly reliable and highly reliable-binding loci at a given specificity level (6).
For quantitative analysis of ChIP-seq data based on the model (1) it should be important to estimate the specificity threshold, t, as well as the number of high-avidity specific loci , at avidity m ≥ t, and the number of specific loci having low/moderate avidity less than t (m < t). To do that, we use the fitting and back-extrapolation method (see Methods). We illustrate our method with the analysis of the ChIP-seq Nanog TF-DNA binding data as an example (Figure 2A & 2B).
TGDP function parameters for the eleven TF libraries. Non-linear regression module of SigmaPlot program was used to estimate parameters of GDP function. More details regarding binding statistics and TGDP function parameters are presented in Definitions, Methods and Additional files.
GDP-defined specificity threshold t
k ± SE
β ± SE
# GDP-predicted seq.,
#Observed seq. at GDP specificity threshold t, M 2
2.40 ± 0.077
10.42 ± 0.682
1.81 ± 0.156
8.00 ± 1.33
3.00 ± 0.494
7.66 ± 2.910
2.35 ± 0.225
3.03 ± 1.432
3.55 ± 0.006
46.24 ± 0.340
1.50 ± 0.077
9.28 ± 1.227
2.90 ± 0.162
3.73 ± 0.778
4.83 ± 0.335
11.16 ± 1.421
2.94 ± 0.742
16.24 ± 7.298
2.81 ± 0.228
4.01 ± 1.306
4.02 ± 0.650
16.19 ± 4.248
The most noise-affected data points are located at the left side of the empirical histogram (from 1 to 8 BEs), where the data was accurately fitted by the exponential function using Sigma-Plot software (Figure 2A). The values of the exponential function at data points from 0 to 9 were fitted (solid line on Figure 2A) and for larger values were estimated by the best-fit exponential function (extrapolating onto the right). At the specificity threshold (t = 11) identified from the analysis of ChiP-q-PCR data, the admixture of non-specific BS (originating from the tail of the exponential function) consists of ~2% of the reliable subset (Sp = 98%, respectively). Importantly, at the predicted specificity level (Sp = 97%) our best-fit mixture model predicts the same specificity threshold (t = 11).
Subtraction of the exponential function values from the observed frequency distribution values allowed us to restore the "noise-free" frequency distribution function for overlap peak value 9 and larger (Figure 2A & 2B). This reconstructed double-truncated frequency distribution could be considered as a reliable segment of the empirical specific binding avidity distribution. After parameterisation of the GDP function we can extrapolate the best-fit GDP to smaller overlap peak values (8 and smaller). Figure 2A shows that the truncated GDP function fits well to the right-tail of the empirical frequency distribution of TF specific binding in the interval of peak values larger than 11. This part of the empirical distribution characterises the more reliable fraction of TF-DNA binding loci corresponding to high-avidity level BSs and represented in ChIP-seq experiment by number of overlapping DNA fragments (peak values) up to the maximal observed number of overlaps.
The GDP function, after parameterization in this interval, can be used to predict the values of the function for smaller number of peak values: 8,7,6,...,1. At the threshold 11 Sp= 97%, and the number of GDP-estimated specific BSs equals 10213 versus 10343 BSs observed in Nanog ChIP-Seq assay (Table 1). The number of observed unique ChIP-seq DNA fragments, M 2, representing these BSs (N 2), equals 299830; GDP estimates 297493 ChIP-seq DNA fragments. Interestingly, the last number represents only 3.5% of the total number of the unique DNA fragments of the ChIP-seq Nanog library. Similar results we obtained for other TF libraries (Table 1). These results suggest that the major fraction of ChIP-seq DNA fragments maps to moderate and low avidity loci specific TFBSs and to background non-specific DNA loci.
Only part of original ChIP-seq sequence datasets is available . The analyzed BEs were limited in the dataset to the loci conforming to the specificity threshold value defined by the random model of uniform distribution of background signals reported in . Using these data sets, the smallest threshold of the most reliable and the most specific BEs (providing the best goodness-of fit statistical criteria, according to ), was used to calculate the GDP-defined specificity cut-off value. In these cases, the right tail of the available truncated empirical distribution was used to estimate the parameters of the truncated GDP function.
We calculated the specificity threshold values (8) starting from ChIP-qPCR-defined specificity threshold followed by minimizing the distance between the truncated GDP probability function and the observed frequency distribution. Figure 2 and Additional Files 1, 2 show that the truncated GDP function fits well the right-tail GDP function corresponding to the fraction of reliable and highly specific BEs. Moreover, due to the high accuracy of parameterization of the mixture GDP function and the exponential function, our curve-fitting of ChIP-seq data allows us to predict the GDP function for the noise-enriched BEs, located on the histogram in the left part in the empirical avidity distribution function. Figure 2C and Additional File 1 show the truncated frequency distribution of Esrrb TF-DNA binding (overlap peak cut-off value 12) and the best-fit GDP function extrapolated to the noise-enriched part of the distribution are presented. The best fit GDP function with parameters k = 2.40 ± 0.0778, b = 10.42 ± 0.6828 allows us to extrapolate the best-fit curve and to predict the number of Esrrb TFBSs the in the noise-enriched binding sites.
Figure 2D contains the results of curve-fitting analysis for the c-Myc ChIP-seq library. It shows that the best-fit GDP function, by extrapolation, could provide an estimation of specific BEs in the highly noisy region of the distribution (the left part of the distribution. Thus, based on these findings, the empirical frequency distribution of TF BEs could be separated into the right-side and left-side regions, relatively to the critical cut-off value, t, discriminating the reliable and strongly-specific BSs from the less reliable noise-rich and low/moderate avidity BSs.
Figure 2D shows that at the given level of specificity (sp > 97%) the best-fit GDP function can predict a similar or larger binding specificity threshold, t, than the one obtained from ChIP-q-PCR data. Suboptimal design of the ChIP-qPCR experiment possibly (Figure 3) supports this suggestion. Table 1 and Additional file 2 show that in some Chip-seq libraries the frequency values at qPCR-defined specificity thresholds and the values around them systematically deviate from the GDP extrapolation curves. For six TFs (Oct4, sox2, Klf4, c-Myc, n-Myc, STAT3) the specificity predicted by our model at the qPCR-defined critical threshold values was much smaller (cutting off specificity at 78%, 85%, 88%, 77%, 78%, 78%, respectively) than it was found in qPCR-ChIP experiments (Table 1). In these cases our model allows us to select more reliable critical threshold values (at a larger overlap peak height) following a single confidence criterion (e.g. specificity cut-off at 94%). For the other four TF ChIP-seq libraries,(Essrb, E2F1, Tcfcp211, ZFX) out of 11 studied TFs specificity thresholds were defined by the both methods at a similar specificity cut-off (94%) (Table 1). In particular, for c-Myc BEs, at specificity level >0.95, reported in Supplementary File of , for qPCR-ChIP our model predicts t = 12 instead t = 9 (determined by qPCR-ChIP experiment with Sp = 100% based on 47 qPCR experiments). We could suggest that in these cases "optimistic" specificity estimates could be obtained due to a sub-optimal design of qPCR-ChIP experiments (see discussion of Figure 3) and/or unreliable detection of the peak values for a given range of overlap peak signal enriched by non-specific BSs.
Best fit K-W function estimations: specific components of probabilistic TF-DNA binding model
N 2/ * 100%
In this section we will answer the question: how many physically specific BSs of a given TF exist in the mammalian genome? We used the theoretical result obtained in the previous section to estimate the parameters a, b and θ of K-W function and thus, to estimate p 0. Practically, for all ChIP-seq datasets we could fit the K-W probability function to the GDP function values after fitting the truncated GDP to a high-confidence segment of the empirical frequency distribution of TF-DNA binding. This segment of the empirical distribution is assigned by the number of ChIP-seq DNA fragments, called M 2, which form N 2 clusters. The number of these clusters was counted for the peak height values ranged between specificity threshold level t and maximum value of peak height J, where the specificity Sp is defined by the formula (8).
The detailed description of the algorithm of estimation of the parameters of skewed distribution function functions like GDP and K-W was reported in . Using this algorithm, we found that for ChIP-seq data the GDP function exhibits a fairly accurate approximation of K-W function throughout the entire range of BEs (Additional file 5), We used the properties for fitting K-W function based on the extrapolated data estimated from the best-fit truncated GDP function values. The estimated parameters of K-W function, we used to estimate the probability of non-detected BEs (p 0 at m = 0) by the formula (28) and N tot , as follows:
GDP can be considered as a good empirical approximation of the K-W function (see Methods). Granted this property, we used fitting and back-extrapolation method to get an accurate estimate of three parameters of K-W function. Figure 2C-D Table 3 shows that all empirical distributions of binding avidity are fitted well by the K-W function. Using the estimated parameters of K-W function (see Methods), we can predict the fraction of specific BSs which have not been detected in the ChIP-seq experiments (p o ) (28). For example, the parameters of the probabilistic model for Nanog TF data were estimated as θ = 0.997; α = 5.870; β = 7.465, and thus, from (28), we have p 0 = 0.22. For Oct4 data: θ = 0.998; α = 5.681; β = 8.32, thus by (28) we have p o = 0.32. Interestingly, for all TF binding data the parameter of the K-W function equals to or slightly smaller than 1. This result suggests that for TF-DNA binding-dissociation processes the rate of preferential dissociation equals to or little bit larger than the rate of preferential binding.
Using our parameterization algorithm we can estimate the total number of specific BSs in the mouse genome for a given TF. This estimate equals ~6.67 104 BSs for Nanog and ~2.82 104 BSs for Oct4. Table 3 shows the estimates for other TFs.
Figure 4 and Table 3 shows that the fraction of non-detected BSs (p 0) varies among different TF ChIP-seq libraries from 6% (E2F1) to 54% (ZfX) of the total number of BSs predicted in the genome of mouse ESC E14. The total number of predicted BS ranges from 9874 (c-Myc) and 10320 (STAT3) to 178330 (Klf4) and 513739 (ZfX). The total number of c-Myc TFBS predicted in the genome mouse ESC is similar to the predicted number of c-Myc TFBS predicted in human B-lymphocytes . Note, later estimate derived based on the data analysis and modelling ChIP-PET data. Figure 4 shows also that reliable and specific BEs form the smallest fraction of the BSs with one exception (E2F1, 52%; Table 3), while the number of low- and moderate- avidity TFBSs for other ten TFs contain the most abundant subsets of N tot . The non-observed fraction of TF BS is estimated as the largest for Sox2, ZfX and n-Myc. Thus, these results strongly suggest that the sensitivity of ChIP-Seq technique is still low and has to be improved essentially.
In total, 6437 ChIP-seq c-Myc binding loci with the peak values 7 and higher were found. We found that the number of ChIP-seq loci containing the E-boxes in ± 150 bp region is 3527 (Additional files 6A and 7A) and in ± 250 bp region is 3948 (Additional files 6B and 7B), respectively. These results suggest that c-Myc binding loci are strongly enriched with E-box sequences: we found that 55% (3527/6437) loci of the ± 150 bp region (Additional file 7A) and 61% (3948/6437) loci of ± 250 bp region (additional file 7B) are E-box-positive. Each of these regions exhibits at least one copy of the three specific c-Myc E-boxes or (CACGTG, CACGCG and CGCGAG).
Figure 5C shows the frequency distribution of the peak height representing c-Myc-DNA binding avidity starting with peak value 7. On the Y-axis, we present the numbers of ChIP-seq c-Myc binding loci at the given peak height value, the number of ChiP-seq binding loci containing E box in the ± 150 bp region of centre of locus, and the number of Chip-seq binding loci containing E-boxes in the ± 150 bp region of centre of locus.
Each frequency distribution on Figure 5C shows a similar trend to increase number of binding loci when the peak height decreases. For example in ± 150 bp regions (Additional file 7A), the number of the binding loci containing E-boxes with overlap peak values 7, 8, 9, 10, 11 and 12 were 721, 539, 378, 266, 209, and 159, respectively. The low- and moderate- avidity BSs with peak values from 7 to 11 include the major fraction (~60% (2113/3527)) of the binding loci containing E-boxes found in the ChIP-seq library. We also observed that the number of binding loci containing only one of three E-box sequences increases when the binding avidity decreases (not presented). These trends are in agreement with predictions of our model of TFBS binding (Figure 2, Additional Files 1, 2).
It has been shown that relatively high avidity TFBSs could be preferentially located nearby transcription start site (TSS) of the target genes and overlap with CpG Island . It is also true for a large number of low- and moderate- avidity c-Myc BSs in the genome of human B-cells . We observed that for mouse EC c-Myc ChIP-seq data , E-box-positive Chip-seq binding loci are also often localized in the putative promoter regions of target genes. We found that 50.9% (3277/6437) c-Myc binding loci with peak height 7 and larger are localized within ± 1 kb TSS region of 3966 RefSeq genes (mm8) (Additional file 8). We count also the number of c-Myc binding loci containing E-boxes and being around ± 1 kb of TSS. 68% (2223/3277) of c-Myc binding loci around ± 1 kb of TSS contain at least one E-box in ± 150 bp around the centre of c-Myc binding loci while 32% (1054/3277) of c-Myc binding loci have no E-boxes (Table 1 in Additional file 8A). When we extend the region from ± 150 bp to be ± 250 bp around the centre of c-Myc binding loci, percent of c-Myc binding loci around ± 1 kb of TSS contain at least one E-box increase to 76% (2493/3277) and percent of c-Myc binding loci without E-boxes leave 24%(784/3277). These results easily demonstrate strong association of the E-box-positive binding loci with gene target promoter regions. Perhaps, the most of c-Myc target genes in mouse EC cell genome should be considered as direct gene targets, because alternative E-boxes (for instance, CACATG) also found in the c-Myc binding loci with high frequency.
Interestingly, 28-30% of c-Myc binding loci containing at least one E-box sequence exhibit low- or moderate- avidity (peak height 7-8) c-Myc binding (Table B Additional file 8B). These results support the prediction of our probabilistic model (K-W) that, in low- or moderate- avidity binding could be considered as true c-Myc binding sites. However information about conserved motifs and other regulatory regions are required for indentifying putative true binding sites in low- or moderate- avidity of binding. By our analysis, E-boxes positive binding loci often overlapped with CpG island region(s) which are co-occurred with c-Myc BSs . The loci with peak height >8 are overlapped with CpG Island in 82% (1305/1598) cases (+/-150 region in Table B in Additional file 8); E-box-positive relatively low- and moderate- binding avidity (7-8 peak heights) loci are also overlapped with CpG Islands of the putative gene targets in 75% (469/625) cases. The genes having this moderate avidity BS might be considered as strong candidates for further bioinformatics analysis and experimental validation.
We found that binding loci could be represented by multiple copies of E-box sequences: 3527 binding loci of ± 150 bp regions are represented by 5546 E-box sequences (mean. 1.57 (5546/3527) E-boxes per locus) (Additional file 9A), and 3948 binding loci of ± 250 bp regions are represented by 7182 E-box sequences (mean 1.82 (7182/3948) E-boxes per locus) (Additional file 9B).
Figure 5D shows that the E-boxes often co localized in the ChIP-seq c-Myc binding loci. For instance, Venn diagram on left panel of Figure 5D demonstrates that in ± 150 bp around a centre of binding locus 19% of E-box CACGTG sequence (363/1895) and 25% (363/1476) of E box CACGCG sequence are co -localized in the same binding locus. Kappa correlation coefficient, which measures co-occurrence of the events (StatXact 5; Cytel Software Co), equals 0.66 (p < E-230). Similar values of the correlation coefficient we found between two other E-box pairs, which are presented on the right panel of Figure 5D. Similar results we obtained when 250-nt vicinity around a centre of c-Myc binding locus was analyzed (not presented).
It was shown that the TFBS found in ChIP-seq and ChIP-PET study of c-Myc have a potential to stimulate embryonic SC-specific gene expression [12, 15]. In this section, we shall consider the examples of association between binding avidity level and potential relation of nearby genes with stem-cell specific activity.
FK506 binding protein 5 (FKBP5) belongs to the family of immuneglobulins named for their ability to bind immunosuppressive drugs, also known as peptidyl-prolyl cis-trans isomerases, and also with chaperones (involved in protein folding) . FK506 also plays an important role in cancer tumors growth and chemoresistance through regulating signal transduction of the NF-kappaB pathway . The expression and activity of NF-kappaB is comparatively low in undifferentiated embryonic cells ES cells, but increases during differentiation of the ES cells . Due to this finding we could suggest that c-Myc binds to the low avidity BSs in the promoter region of FKBP5 and, through a modification of expression level of this gene, can provide a regulation of NF-kappaB pathway in mouse ES cell. Interestingly, high, moderate, and relatively low avidity c-Myc BSs in the promoter regions of Wee1, Npm3, and Fkbp5 genes have their binding association scores correlating with ChIP-seq c-Myc binding avidity. Binding association score estimates the genomic distance between a BS and a gene TSS . Additional file 10 contains the data demonstrating that Wee1 could be under promoter regulating activity of Oct4, c-Myc, and n-Myc, while Npm3 could be under promoter activity of c-Myc only. However, although Fkbp5 shows a moderate binding association score with c-Myc and STAT3, Fkbp5 expression is strongly associated with Nanog and n-Myc promoter activity. These results suggest that all three genes could be under the transcriptional control of c-Myc and, additionally, that Wee1 and Fkbp5 could be associated with ES-cell specific expression.
In this work, we studied statistical characteristics of protein-DNA binding events for the eleven stem cell- related transcription factors bound in the genome of mouse embryonic stem cells E14 and detected by ChIP-seq assay . Several methods for ChIP-seq data analysis have been recently developed [11, 13, 27]. However, an appropriate mathematical model of TF-DNA binding in ChIP-seq binding assay and statistical evaluation of the sensitivity of the methods has been not developed. Several approaches to quantitative identification of individual TF-bound loci have been recently developed(peak finder algorithms [7, 9, 11–13, 27]), however overall binding events for specific DNA loci including low- and moderate- avidity TFBS at the genome scale have been out of systematic consideration.
Our previous analysis of different ChIP-base TF-DNA binding datasets  suggested that the mixture probability distribution model (1) could reflect a common property of TF-DNA binding events in different cells in different experimental conditions. This model allowed us to estimate the specificity of ChIP-based TF-DNA binding events for several TFs. In the present study, we develop a mathematical model of TF-DNA binding-dissociation events in both TF-specific and TF-in specific genomic loci and use this model to estimate a number of essential parameters of statistical distributions observed in ChIP-seq assays. We include in our consideration the binding events of TFs in the entire range of their avidity to their binding loci from very high to very low. We focused on several practically important parameters closely related to analysis of the empirical TF-DNA binding distributions: t, Se, Sp, p 0, and N tot (Figure 3, Table 3). Interestingly, for all TF binding data, the parameter of the K-W function equals to or slightly smaller than 1. It suggests that the Waring probability function could be used as good approximation of K-W function for analysis of empirical frequency distributions of ChIP-seq binding. Another interesting finding: for TF-DNA binding-dissociation process in the mouse genome the rate of preferential dissociation equals to or little bit larger than the rate of preferential binding.
We found that the parameters of the mixture distribution are sensitive to 1) the sample size (number of non-redundant sequence tag reads) M, 2) the proportion of nonspecific sequence reads in the sample, 3) the sampling model and 4) the analytical models and the computational algorithm used for the parameterization of the distribution of TF-DNA BEs. We have demonstrated that skewed shape and sample size-dependence is a common property of a specific TF-DNA binding distributions [7, 9, 11, 12]. Similar results we obtain here for 11 ChIP-seq samples derive from mouse embryonic SCs. Strongly positive values of parameter β in the GDP model describing TF-DNA binding in specific loci were found in all TF libraries (Table 2, Additional file 2 (Figure 2S)). Parameter J of the truncated GDP function reflects the maximal observed number of BEs of the GDP model. This parameter positively correlates with the sample size M [9, 28]. These results agree with previous findings in ChIP-PET TF-DNA binding assay [9, 12, 13].
The values of observed and computationally predicted parameters of the statistical distribution k are found very close to each other for all ChIP-seq TF libraries (Table 2, Additional File 1, 2, 3, 4, 5). Using 11 available ChIP-seq TF-DNA binding datasets, we revisited and improved the estimates of the levels of sensitivity and specificity of the TF-DNA BEs.
Our mixture probabilistic model allows us to estimate the specificity cut-off value for ChIP-seq library and also estimates the fraction of specific TF-DNA binding loci for TFs in a ChIP-seq data. Our probabilistic model estimates also the specificity threshold, which value often is close or more stronger than estimates by ChIP-qPCR assay. Table 2 shows that for all 11 library our model (1) models provides >94% specificity. Thus, we conclude that the basic concept of mixture skewed scale-dependent distribution, originally developed for ChIP-PET data analysis [9, 12], can be applicable to ChIP-seq data.
We conclude that a model of random occurrence of DNA fragment clusters in the genome is not appropriate for quantitative determination of critical threshold of binding specificity in Chip-based genome-wide binding analyses, including ChIp-seq. This conclusion is in agreement with recent computational simulations of the frequency distributions of TF-DNA BEs in Chip-Seq data , where local background noise for Stat1 TF ChIP-seq data  was modelled.
We developed a simulation model of unbiased identification of specificity of 'problematic' ChIP-seq loci. The method could be used to optimize the experimental design of ChIP-q-PCR experiments.
Our probabilistic model shows that at the conventional specificity threshold % (> 95%), the fraction of high-avidity specific sequences and TF binding loci containing these sequences are surprisingly low in all studied libraries. These results suggest that the major fraction of true binding sites could not be detected by the ChIP-seq method without additional experimental validation and rigorous bioinformatics and extensive statistical analysis of data.
According to our model prediction, the number of low- and moderate- avidity TFBSs should monotonously increase when the number of unique overlapping ChIP-seq DNA fragments in a binding locus becomes smaller. To validate these predictions, we used motif-discovery program nminfer to identify a position weighted matrix (PWM) of c-Myc motifs in the ChIP-seq binding loci. Our bioinformatics and statistical analyses revealed that the moderate avidity BSs with peak values 7 to 11 include the major fraction (~60% (2113/3527)) of the E-box-containing TFBSs found in the ChIP-seq library. We also observed that the number of BSs, containing each of the three major c-Myc E-box sequences and their combinations, monotonously increases when the binding avidity decreases (Figure 5C). All these trends are in agreement with predictions of our model of TFBS binding (Figure 2, Additional Files 1, 2). Thus, in combination with motif-finding techniques (and/or experimental validation assay ChIP-qPCR) our modelling approach allowed us to identify the loci of many thousands of novel BSs with characterized with low- and moderate- avidity of TF. Moreover, the number of undetected low- and moderate- avidity specific TFBSs was estimated, which addresses common problem of sensitivity of a given ChIP-seq assay for a given TF in cells under given experimental conditions. We show that although ChIP-seq is a powerful technique still it produces essentially incomplete information about the low- and moderate- avidity TF- DNA binding events in the complex (e.g. mammalian) genomes.
Our mathematical modelling of the mixture strong and weak TF-DNA binding and sequence analysis of genome-wide binding data suggests that integration of these approaches could help to reveal many new target genes for c-Myc and for other studied TFs. For instance, best-fit K-W function predicts in total 51114 Nanog BSs in the genome of E14 cells: 20% of these 51114 Nanog BSs were non-observed in the ChIP-seq library and 60% of the BSs were predicted by the model in the noise-rich BE set. These model-based and data-driven predictions of Nanog BSs could be validated in case study experiments. A functional significance of such low- and moderate- avidity BSs for putative target genes might be investigated in a near future.
The most important thing that our results suggest that low- and moderate- affinity BSs could have biologically meaningful functional roles. However, biological role of the enormous number of the moderate- and low- avidity BSs for TF is unknown. We speculate that these bindings could be used in the nucleus to storage large number diverse TFs and their cofactors at quasi-stationary and thermodynamically-defined states at vicinity of double-stranded DNA. Such weak, unstable and multiple protein-DNA bindings might be use by a cell for recruitment and redistribution of specific TF molecules along the double strand DNA depending from external and internal regulatory signals. We also speculate that the strength of TF-DNA binding-dissociation could be significantly modulated by cooperative interactions among TFs and DNA-binding signals.
For all studied TF library a relatively large fraction of low to moderate binding loci was not detected at all (Table 3). So the estimation of p 0, by K-W model, is an informative measure of the incompleteness of experimental data. ChiP-Seq- derived frequency of DNA-TF BEs can be fitted well with truncated K-W probability function and thus, allows us to estimate all specific TFBSs (Ns 1+ Ns 2) which could be found in the library and also to estimate the total number of TFBSs (N tot ) entire genome.
However, father analysis of sensitivity and robustness of estimates and extrapolations of the probabilistic model of TF-DNA binding model and analysis more complete and large ChIP-seq datasets might be important for robust and accurate parameterization of our models and its applications.
We proposed a probabilistic model of TF-DNA binding process at the genome level and based on the model we developed a computational method which allows us to 'de-noise' ChIP-seq datasets and to estimate the specificity and the sensitivity of ChIP-seq assays.
Goodness-of fit analysis of GDP and K-W functions suggests that the sensitivity problem has not yet been technically resolved by the ChIP-based methods, including ChIP-seq. TFBS motif finding analysis supports our results. Due to the proposed improvements in the sensitivity and the specificity of ChIP-seq assay, functional roles of an extremely large number of low/moderate avidity TF binding loci in the mammalian genomes can now be investigated. After these studies the models of transcriptional regulatory network in embryonic cells and other cell types should be carefully revised.
The numbers of the low/moderate- and high- avidity specific TF BSs are estimated here for all the studied data sets. We suggest that many low- and moderate- avidity BSs have biologically meaningful functional roles. Since in the previous studies only high avidity TF BSs could be reliably detected by ChIP-seq assays, identification of other binding sites and elucidation their functional role in genome is a great imperative goal for biotechnology, computational biology and functional genomics.
It is likely that our approach could also be applied to the analysis of high-resolution ChIP-based generated profiles of chromatin chemical modifications [29, 30] in mammalian genomes. Our work provides a theoretical framework for a comprehensive computational prediction and a robust experimental identification of TFBSs (and other ChIP-seq data) when low- and moderate- avidity sequences are over-represented in ChIP-derived sequence samples.
We used ChIP-seq datasets (ChIP-seq libraries) generated by Solexa sequencing for eleven TFs (Nanog, Oct4, sox2, KLf4, STAT3, E2F1, Tcfcp211, ZFX, n-Myc, c-Myc and Essrb). These TFs are considered as essential TFs for the maintenance of the pluripotency in mammalian stem cells and were studied using murine E14 embryonic stem cells cultured under feeder-free conditions as described in Chen et al . The mapped ChIP-seq datasets were downloaded from T2G GIS DB (http://t2g.bii.a-star.edu.sg; see also NIH GEO ID:GSE 11431).The extended ChIP-seq DNA fragments were clustered and the number of overlapping fragments were summed at each locus and used to construct empirical frequency distribution of TF-DNA binding. Some statistical additional characteristics of the libraries are presented in Table 1 and Table 2 and Additional files 4, 5.
To validate predictions of our TF-DNA binding model and to extend the sensitivity of ChIP-seq derived sequences clustered the loci with low- and moderate- binding avidity, we used genome coordinates of the available extended ChIP-seq fragments and provided computational identification of TFBSs by using motif-discovery program nminfer from NestedMICA http://www.sanger.ac.uk/Software/analysis/nmica/. c-Myc TF ChIP-seq data  were used to illustrate our approach NestedMICA program was used to identify position weighted matrixes (PWMs) E-boxes and consensus sequence motifs of c-Myc TF into genome-mapped ChIP-seq DNA fragments. The sequences of strongly specific BEs with height 12 and higher (peak12+) were used as a training set for motif discovery. The training set sequences were downloaded from mouse genome (UCSC mm8 after repeat masked by capital Ns). The test set is peak regions (± 200 bp around centre of the cluster peaks) with height 7 and higher (peak7+), which have been not used in the training set.
The PWM found from training set were computationally scanned on the test set by nmscan program at score threshold-3.0. The motifs found from in test set were used to scan and count number of each motif found in the mouse genome (mm8).
To quantify the avidity of TF-DNA binding in a given locus, Chen et al.  extended the distinct fragments by 200 bps and clustered these fragments that were overlapping by at least 4 bps. In ChIP-seq experiment, "a binding signal" for a given binding locus has been represented the number of the DNA fragments formed by these 5-end "extended and overlapped" DNA fragments . For Chip-seq experiment, this number could the considered as the TF-DNA binding event (BE) representing TF-DNA "binding avidity" averaged across genome BSs of hundred millions cells.
We define a list of uniquely mapped ChIP-seq DNA sequences observed in a given ChIP-seq experiment as the Chip-seq library. To quantify the data of an individual ChIP-seq library, we defined the size of the library, M, as the total number of distinct ChIP-seq reads uniquely mapped onto a reference genome. Let m denote the number of the BEs counted by a peak-finding program (e.g. T2G) in a DNA fragment cluster overlap of ChIP-seq library. The single DNA fragment mapped on the genome is also included. m = 1, 2, 3,..., m max, where m max = J denotes the maximum value of m. Let n (m, M) denote the average number of genome loci in which the BEs found in a given ChIP-seq data exactly m times in the library of a size M. Due to sample size, experimental errors and biological variation across many cells and environmental conditions, observed n is the function averaged across the cells and conditions and this function increases when M becomes larger. Thus, n only approximates a true number of TFs bound to genomic loci with BE value m. However, n = n (m, M) after an appropriate normalization could be used for statistical analysis of relative binding avidity of ChIP-seq binding loci.
Let denote the total number of distinct loci counted in the ChIP-seq library. Then and we could call M also as the "DNA sequence mass". The empirical frequency distribution of the number of DNA fragments in the locus within the ChIP-seq dataset ( ) might be considered as the empirical probability function of TF-DNA binding avidity. Such histogram is an essential starting point for further statistical analysis of data and planning of validations studies [7, 9, 13].
where is the empirical probability distribution function of specific binding, is the empirical probability of non-specific binding. The parameter α could be estimated as the fraction of the extended DNA fragments uniquely mapped on the genome and belonged to the true loci having specific TFBS. 0 < α < 1. Figure 2 shows the examples of non-normalized empirical frequency distribution of TF BEs (e.g. Nanog TF-DNA data) which after normalization to 1 can modeled by (2).
Due to sequence read sampling, the frequency distribution (1) is considered as scale-dependent and skewed functions [20, 32], i.e. when the sample size, M, increasing, the shapes of the noise and specific frequency distribution functions are changed correspondently with library size. We model the non-specific avidity distribution function P ns in (1) by an exponential function distribution with continuous exponent parameter as a decay function of sample size M [9, 11].
If one has prior knowledge of the sets of all TFBSs and of all sequences not bound by a given TF in the genome, then conventional calculation of the specificity and sensitivity of genome-wide TF BEs is straightforward. However, in the absence of such knowledge, one needs to rely on statistical analysis of data-driven physical models and computational estimates using available highly-noisy and incomplete DNA fragment samples [9, 12].
A significant amount of non-specific genomic DNA fragments (background noise) is always present in the inmmunoprecipitated DNA material of any ChIP-derived dataset [9, 12]. Some non-specific DNA might be easily filtered out after computer mapping of the DNA fragments on the genome. In larger library, the number TF specific loci could be increased. However, background (or noise) genomic DNA fragments could non-uniformly located in the genome and thus false clustered should be occurred in any region of the empirical frequency distribution [7, 9, 12, 13, 15].
The following basic statistical tasks are becoming imperative: (i) specificity of the library, i.e. to identify statistically significant TFBSs and count their number at the given confidence level, t, of BEs; (ii) power of the library, i.e., to identify "true" specific BEs which are present in the noise-enriched subset of relatively low read counts (0< m <t) in the library and (iii) sensitivity of the detection, i.e. the number of "lost" BSs which are available for TF binding in the given cells at the given condition, but were not detected due to a limit of the TF library size and the technical implementation.
We analyze these problems via probabilistic modelling, goodness-of-fit analysis and computational modelling of non-specific and specific BEs loci for a given TF in the ChIP-seq library. This analysis is used to quantify avidity of binding events of eleven TFs studied in the genome of mouse stem cells.
where N 1 is the number of observed 'noise-rich' TF- DNA binding loci, having relatively low/moderate potential of TF binding avidity and N 2 is the number of observed 'specific- rich' TF-DNA binding DNA loci, having a relatively high potential of TF binding avidity. The parameter t is the TF-DNA binding specificity threshold value.
where M 1is the number of ChIP-seq DNA sequences which observed in the subset of 'noise-rich' and non-reliable TF-bounding DNA loci, M 2 is the number of ChIP-seq DNA fragments which are observed in subset of reliable specific and TF-bounding DNA loci.
where, is the estimation of the number of specific BEs at value m. is the estimate of the number of TF-DNA binding loci at m≥t. is the estimate of the number of specific TF-DNA binding loci at m<t.
where is estimated in a domain of m-value of ChIP-seq library (m = 1, 2, 3,..., J).
The continue parameter k characterizes the skewness of the probability function; the continue parameter β characterizes the deviation of the GDP distribution from a simple power law. J denotes the maximum observed number of BEs and used as an empirical parameter of the model (11)-(12). This parameter in scale-dependent cases is positively correlated with the sample size M . Since in log-log plot the truncated function (11)-(12) exhibits systematic change of its shape when the sample size M is changed , the model could be co-called the empirical scale-dependent TGDP model .
A noise background BEs could mask the specific moderate to low avidity TFBSs. It is important to estimate the numbers of specific moderate to low avidity TFBSs masked by noise background BEs in ChIP-seq data. However, these sub-sets of BEs might be not easily separated due to the distributions overlapping and sample size dependence. To estimate t -value at the given specificity level and the numbers of specific moderate to low avidity TFBSs associated with this t-value, we used the fitting and back-extrapolation method of recovery of the distributions of specific and non-specific BEs.
Briefly, our algorithm includes several steps: (i) an identification of the functions which after optimization of parameters could provide the best-fit functions approximating the left side and the right side of the empirical distribution, respectively, (ii) an extrapolation of these functions to the function overlapping region, (iii) identification of the specificity threshold t, (iv) an estimation of the weight parameter α in (2), (v) final correction of the estimated parameters using complete model, (vi) restoration of the values and based on the extrapolation method applied to the best-fit distribution functions. To fit the distribution functions we used optimization criteria and methods reported by [20, 35]. We used also the non-linear regression tools of Sigma-Plot software (Version 11).
In ChIP-seq experiments, short DNA sequence tags are randomly chosen and consequently aggregated onto genome clusters in the result of sampling of the tags derived from a large but finite number of a ChIP-seq dataset. What kinds of exploratory models could be used to quantify forming of ensemble of TF clusters bound on the genome DNA?
We [7, 9, 12, 36] have shown that the tail of the empirical distribution function of TF-DNA BEs could be approximated by a skewed truncated Pareto-like function. This function is sample size dependent. Due to a limited number of sequence reads, some true TFBSs, N 0, which are physically available for TF binding in the given cells at the given condition cannot be detected in a TF ChIP-Seq library, even if the noisy BEs are fully suppressed. In ChIP-based experiments, the number of non-detected TFBSs, N 0, might be larger than the number of reliably detected specific TFBSs, N 2 .
In general, an estimation of N 0 within obtained from essentially incomplete samples is a very difficult statistical task . First, this is due to (i) a large number of the real low- and moderate- avidity TFBSs (subset of real 'hidden' BEs, , indicated by the extrapolation curve on Figure 2A-B) which provide only a minor admixed part of BEs in the histogram associated with noise-enriched BEs found in ChIP-seq data. Second, a fraction of the rare BEs (subset N 0) which is not presented in the observed dataset could be estimated if an appropriate physical-mathematical model is used. Below, we present a stochastic model of TF-DNA binding and dissociation together with the a method which allow us to estimate not only , but also N 0.
The GDP model (8)-(9) can provide a good approximation of the empirical data; however this model does not allow estimating N 0 in a given assay. Eq(8) can be a good approximation of the Waring probability function [20, 35, 38], which could count explicitly a probability of non-observed events. This function has been considered as an adequate distribution function derived from several explanatory stochastic process models including sampling genera from a heterogeneous biological population  and the aggregation process of particles . The Waring probability function can be considered as a special steady-state solution of the stochastic birth-death processes called Kolmogorov-Waring process  arising in 'omics' data analyses. In particular, this model has been used for the modelling of protein domains in molecular evolution [20, 35]and the sampling of SAGE gene expression tag profiles .
We assume that K-W function could be considered as an exploratory stochastic model of the evolution of specific TF-DNA binding and use the model to estimate N 0. We assume that an evolution of specific TF-DNA interaction in the genome can be considered as stochastic binding and dissociation events while taking into account two binding and two dissociation transition probabilities. For binding, we consider the preferential attachment process (due to the specific binding potential between TF and DNA) and the Poisson process (non-specific potential). Similar two processes but with different intensities are assumed for detachments transitions (Figure 1D).
where . This condition exists when starting from some i = i c the condition η i ≤ ν < 1 takes place for all i ≥ i c (i.e. on the right tail of the frequency distribution).
(m = 0, 1, 2,...), where the model parameters are defined by the following conditions . Hence, during an interval (t, t + h) where h is infinitively small, we assume that there are four independent processes: the spontaneous TF binding on and dissociation from a given specific TFBS DNA locus, with constant intensities and , respectively, and the TF binding on and dissociation from a specific TFBS DNA locus with the intensities proportional to the number of TFs already attached to the specific TFBS and respectively.
m = 0, 1, 2,.... Γ (x) is the Gamma function, and B (x) is the Beta function . The (25)-(26) is K-W probability function . When the preferential attachment and the preferential dissociation rates are equalled ( ), then we have the Waring distribution function .
at α = a, β = 1, γ = b + 1.
where and m = 0, 1, 2,...; p 0,0 ≡ p 0.
where2 F 1 is the hypergeometric Gauss series .
Eq(19) can be used for extrapolation of K-W model up to m = 0 (unobserved number of BEs). Than by the following recursive formula (18), we can estimate the frequency of BEs at each value m (m = 1, 2, 3....).
where m = 1, 2, 3...
Using this formula, we can prove the following useful approximation:
where B (b +1, m) is the Beta function.
The expressions (28) - (31) can be used for the quantitative analysis of the distribution of ChIP-seq derived TF-DNA binding events and for the calculation of p 0. Note that (27)-(28) provide more accurate estimates when a fraction of reliably detected TFBSs in ChiIP-seq assay becomes larger.
The estimation of parameters in the general multi-parameter families becomes problematic when the number of unknown parameters increases. However, the three, two- or one- parameter family distributions (27)-(28) could be feasibly fitted to the empirical distributions.
which as J → ∞ can be simplified to yield .
By analogy to (32), it is easy straight forward to derive the double-truncated K-W function for any low threshold value t (t = 1, 2,...) of the number of TFs bound to a given TFBS, and next using the re-normalized K-W function, to estimate parameters of the double-truncated K-W function using the optimization algorithm reported in .
ChIP-seq library: list of ChIP-seq DNA fragments uniquely mapped onto reference genome
m : number of ChIP-seq DNA fragments shearing a unique locus in the genome; m could represent a relative level of binding avidity of a TFBS in a given genome locus. m = 0, 1, 2, 3,... J.
n (m, M): number of the loci representing by m-value in the ChIP-seq library.
N : number of distinct loci counted by the peak-finding algorithm (T2G) in a given ChIP-seq library.
M : total number of DNA fragments uniquely mapped to the genome and counted in N loci (or " total DNA sequence mass").
t : specificity threshold
M 1: number of ChIP-seq DNA sequences which observed in the subset of low/moderate avidity loci
M 2: number of ChIP-seq DNA sequences in low/moderate avidity loci
N 1: number of observed 'noise-likely' TF-bound DNA loci, having relatively low/moderate avidity
N 2: number of observed 'specific- rich' TF-bound DNA loci having relatively high binding avidity.
P : probability distribution function.
P s : probability distribution function of specific binding.
P ns : probability distribution function of occurrence of non-specific binding : empirical probability distribution function
: empirical probability distribution function of TF-DNA specific binding.
: empirical probability function of occurrence of non-specific binding
: Predicted number of specific genome loci in the ChIP-seq library.
: Predicted number of specific TF-bound DNA fragments in .
estimate of the number of specific loci in the subset of observed 'specific- rich' TF-bound DNA loci.
estimate of the number of specific loci in the subset of observed 'noise-rich' TF-bound DNA loci.
s : relative frequency of specific BEs in the mixture probability function (1).
α : estimated fraction of specific DNA fragments representing
: Mean of TF specific binding.
TF-DNA binding avidity: an integrative quantitative characteristic of availability of a DNA locus (e.g. TFBS and its flanking region) for a given protein (e.g. TF) binding Sp : specificity
Se : sensitivity: (1- p 0)100%.
N tot : total number of specific TF bound loci in the genome
: model- predicted total number of specific TFbound loci in the genome
p 0: predicted fraction of specific TF bound loci out of ChIP-seq library data.
Generalized Discrete Pareto
Transcription factor binding site
serial analysis of chromatin occupancy
sequence tag analysis of genome enrichment
position weighted matrix
transcription start site
We thank Chia Lin Wei for access to original data of ChIP-seq libraries. The authors acknowledge Drs. A. Batagov, J. Muller, and V. Boeva for fruitful discussions and comments. Onkar Singh also thanks BII/A-Star for providing internship. This work was supported by Bioinformatics Institute/A-STAR, Singapore.
This article has been published as part of BMC Genomics Volume 11 Supplement 1, 2010: International Workshop on Computational Systems Biology Approaches to Analysis of Genome Complexity and Regulatory Gene Networks. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/11?issue=S1.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.