Most transcription factor binding sites are in a few mosaic classes of the human genome

Background Many algorithms for finding transcription factor binding sites have concentrated on the characterisation of the binding site itself: and these algorithms lead to a large number of false positive sites. The DNA sequence which does not bind has been modeled only to the extent necessary to complement this formulation. Results We find that the human genome may be described by 19 pairs of mosaic classes, each defined by its base frequencies, (or more precisely by the frequencies of doublets), so that typically a run of 10 to 100 bases belongs to the same class. Most experimentally verified binding sites are in the same four pairs of classes. In our sample of seventeen transcription factors — taken from different families of transcription factors — the average proportion of sites in this subset of classes was 75%, with values for individual factors ranging from 48% to 98%. By contrast these same classes contain only 26% of the bases of the genome and only 31% of occurrences of the motifs of these factors — that is places where one might expect the factors to bind. These results are not a consequence of the class composition in promoter regions. Conclusions This method of analysis will help to find transcription factor binding sites and assist with the problem of false positives. These results also imply a profound difference between the mosaic classes.


Background
The DNA sequence has no landmarks to guide the search for transcription factor binding sites: these binding sites may be near the transcription start site but may also be far from it [1,2]. Many papers have examined how these sites might be found computationally [3]. Some methods use a comparison between orthologous regions of different species [4], often treating the problem as one of multiple alignment [5,6]. Other algorithms use a collection of subsequences containing a binding site (for example the promoter regions of coregulated genes or subsequences derived from ChIp-chip experiments) to deduce the form or motif of the binding site which is then used to identify sites in other sequences -reviews of these methods are given in [7,8]. These methods include Weeder [9], MEME [10], ANN-SPEC [11], MORPH [12] and GLAM [13]. Some authors have proposed a statistical test to decide whether a region of DNA is a regulatory region: two methods [14,15] tested on fly data have been motivated by the hypothesis that the local region around the binding site should be similar to the motif itself. Interestingly, such a tendency would not explain the results of this paper. A distantly related line of research is the modeling of nucleosome positions with the expectation that transcription factor binding sites avoid these positions [16][17][18]. A number of projects have combined data of several types to predict binding sites: for example [19][20][21].
The motif-finding methods give the immediate context for the current work. These methods commonly find a large number of false positive binding sites in new sequences [22][23][24]. As well as a model for the binding site, these methods need a model of the non-binding sequence. The complexity of this model ranges from using single nucleotide frequencies (the default for MEME [10]), to modeling the background as a number of states [25]. Using a Hidden Markov Model, that study found that a useful level of complexity was four states with the probability of a base at a given position depending on the state and the previous base. It is convenient to refer to these states as "mosaic classes" because they are short -about 50-100 bases long. However, the emphasis has been on using no more complexity than is needed to assist the motif finding: there appears to have been little work to find the best model for the bulk DNA and this paper addresses this problem. It is plausible that such an analysis will be useful because much of the genome gets its character from local evolutionary processes [26][27][28] which would be well modeled by these kinds of classes. Short repetitive elements would also be well described.
In this paper, we draw a distinction between occurrences of motifs in the DNA sequence which are sites where a transcription factor might bind (and do bind in in vitro experiments [29]), and binding sites where factors are experimentally found to bind in vivo. There is also the difference between binding sites and the subset which are proven to affect transcription [30], but this point is not considered in this paper.
We find that the DNA sequence may be described in terms of short subsequences: each subsequence belonging to one of 38 states, or mosaic classes, each with its own distribution of base frequencies. These classes come in pairs because of the equivalence between the strands. For a set of seventeen transcription factors from different families of factors, 75% of actual binding sites are in the same set of four pairs of preferred classes which account for only 26% of the bases of the genome. However, only 31% of the motifs for these factors are in the same classes. This tendency is observed for all seventeen transcription factors. These results are not a consequence of the different base composition near transcription start sites.

Classes Found
By analysing sequences taken at random from the whole human genome, we find the pairs of mosaic classes described in Table 1. The parameters of the mathematical model describing these classes are given in Additional File 1. The classes are generally short in the range 10 to 100 bases, but one pair (number 6) has an average length of over 500 bases. As one would expect most classes are poor in CpG doublets, but one pair (number 14) has a CpG doublet proportion of 10%.
To visualise the mosaic classes, we have plotted them by their A, C, G, T content. To do this we have used the variables T+A, T+C, T+G -see Figure 1. These variables have been used because the A, C, G, T proportions form a three dimensional space (one degree of freedom is lost as the proportions sum to one), which is most naturally interpreted as a tetrahedron. Plotting the space using any two of these variables gives an undistorted view of the tetrahedron with the four corners of the tetrahedron at the four corners of the plot. These variables also maintain the symmetry between the bases as the variables T+A, T+C, T+G are one minus the remaining three pairs of variables C+G, A+G, A+C.
It would be possible to use sequences from a defined subset of the human genome to derive the mosaic classes, and such an analysis might give different results. Interesting possibilities include promoter regions, transcribed regions, non-transcribed regions and the genome masked of transposons and/or repetitive elements. Using the whole genome has the advantage of simplicity and ensures that nothing has been left out: it is the obvious baseline analysis and is justified by the results we obtain. Whether a different subset of the genome gives a more biologically relevant set of classes is a matter for research.

Symmetries
DNA has two strands which are chemically indistinguishable, with the structure found by Watson and Crick of As paired to Ts and Cs to Gs. When the base frequencies of a class are counted, it is necessary to choose one of the strands arbitrarily for the measurement. To discuss the effect of this symmetry, consider a hypothetical class, (called H), in which 40% of the bases are A, 20% are C, 20% G, and 20% T. There are logically two contradictory possibilities: a) Such a class as H cannot exist -exact strand symmetry applies everywhere including within a class, so that in a class there will be the same number of As as Ts and Cs as Gs. This would imply strong constraints on the base frequencies of any class. b) When the genome is examined -with the strands being equally likely to be used as the measurement strand -it will appear that there is another class K whose characteristics are the reverse complement of H, that is 40% T, 20% G, 20% C, 20% A. We find from our analyses that statement a) is false: if it were true, then the classes found would occur on the line of symmetry T+C = 50% in Figure 1A. This result is not unexpected because there is a literature [31,32] on A-T and C-G strand symmetry, which suggests that both on the small scale [28,33] and the large scale [34,35] there may a preferred strand.
It is possible for a large scale feature of the genome to be used to break the symmetry between the strands by defining one strand as the measurement strand -for example, in DNA replication one of the strands is the leading strand. However, we expect that classes will still be found in symmetric pairs, even if some class-pairs have a preferential orientation with respect to the defined strand. This remark is based on early analyses concerning transcription -details not shown.
In Figure 1, there are hints of other approximate symmetries, (about a vertical axis in Figure 1A and 1B and about a sloping line from lower left to upper right in Figure 1C).

Transcription factor binding sites
As explained in the Methods Section, we found the exact position of a set of experimentally verified transcription factor binding sites (TFBSs) and ran our model to discover which mosaic classes contained these sites. Detailed results will be presented for seventeen transcription factors and the average for these TFs is shown in Figure 2A. Four pairs of classes contain most of the binding sites: 2, 7, 9 and 14. We will refer to these classes as the "preferred classes" and they are shown in red in the Figures. Table 2 shows the proportion of sites in these classes for these transcription factors. The balance between the four pairs of classes varies with the factor and there may be a preferred orientation/choice of strand as may be seen from the example results for ZNF263 and SRF shown in Figures 2B and 2C. This tendency for binding sites to occur in these classes applies to every transcription factor -the lowest proportion of sites is 48% which is still nearly twice the proportion expected from the number of bases in these classes.
We ensured that in finding the positions of the binding sites we used the same motifs that had been previously reported. Logos for these motifs and cross references to previous work are given in Additional File 2. In the handful of cases where the known motif was not found the results were discarded.
We had little choice for which TF to use for a TF family and for which cell line to use for a TF, and where there is a choice, it is usually immaterial. This is shown in Table 3, which gives a table of ENCODE experiments for those TFs for which we have results. For consistency, the cell line K562 has been used where possible for the detailed analyses. (We discuss later whether the cell line influences the result.) Only for the ap1 family is there a real choice, where the c-Fos result for GM128 is an outlier from all the other results: here we have used the dataset (c-Fos/K562) that gave 60%. Each row refers to two classes, which we refer to as the a and b classes, each with the reverse complement properties of the other. The a class is the one with the higher percentage of T+C, and the base frequencies for this class are shown in columns 2-3-4-5. The frequencies for the b class are the same but with the A/T and C/G frequencies interchanged. Column 6 gives the mean length of the class in bases and column 7 gives the proportion of doublets within the class that are CpG: these two quantities are the same for both the a and b classes. Column 8 gives the proportion of bases in the genome within both classes of the class pair: the total of column 8 is therefore 1.0. The class pairs have been numbered by the proportion of bases they contain. All values have been calculated from the fitted HMM: the parameters of this model are given in Additional File 1.
We have also calculated the proportions of classes for bases for the whole length of the subsequences in the data. Given that this data is supposed to be enriched with TFBSs, (at least with one BS and more if BSs come in clusters), it would be a corollary of our results that these sequences would be enriched with the preferred classes.
In fact, the proportion of bases in the preferred classes in these subsequences is very similar to that for the binding sites, (Figure 3). If it could have been assumed that binding sites were in typical positions on the subsequence, then this result implies that the main thrust of our results  The experimental procedure is to break the DNA near to and on each side of the TF binding site to give fragments that are a few hundred bases long. The subsequence finally reported as containing the binding site is defined by the position of the two ends estimated from the distribution of the genomic positions of the fragment ends. The neutral baseline sample of breakpoints, where a particular TF is not being "pulled down", is variously called the input signal or control library. The nature of the bias inherent in a control library has been discussed by [36], which observes biases from the copy number and from different kinds of repeat regions being under or over   represented, but the most important bias they report is an enhancement near the transcription start site (TSS), especially for highly expressed genes. This later point is confirmed by [37,36] also make the point that the cell line may affect the bias in the control library. The reader might ask if any of these biases have affected our results.
A number of general arguments indicate that our results are not affected by experimental bias. For 14 of the TFs analysed the experimental protocol included direct control for these biases [38,39] and the other experiments, that is for CTCF, sp1 and p53, also included their own checking procedures [2,40,41]. If there is any remaining bias in these experiments, it must be well behaved because the protocol produces uncontroversial motifs, as noted above. Although control libraries have a bias towards the TSS, as we discuss later, most of our binding sites are in fact far from the TSS, so that this aspect of the control bias is not represented in our final results. There is some evidence concerning TFBSs: DNA-seI hypersensitive sites are usually taken to be indicative of regions containing a TFBS and reference [37] finds that there is a moderate enhancement for DNAseI hypersensi-  tive sites near TSSs, but there is very little enhancement for distant sites. We also note that the binding site tends to be at the middle of the reported subsequence, Figure 4: that is away from any end effect which might be associated with the breakpoint itself. There is some complexity, because Auerbach et al. [37] suggest that protein binding in a region makes it more likely for that region to be enhanced in a control library and this would imply that TFBSs are more likely to occur in regions of higher tag density in the control library: and we have observed such a bias in the data -details not shown. However, we are confident that our results are true because of the following analysis. There are many TFBSs in regions where the tag density in the control library is low and we have analysed the mosaic classes of these TFBSs. In the data, the tag density is given in integral values and we have found the median value M of 10 k bases chosen at random [chromosome 1 was used to find M]: we then found the subset of TFBSs for which the tag density in the control library at the binding site was zero or strictly less than M.
The results for the TFs in the cell lines K562/K56b from the YALE track show that the BSs from low tag density regions in the control data have the same character as the other BSs -see Table 4. To see if our results depend on the cell line, we have divided the ENCODE experiments into three groups: GM128, K562 and "Other". Many of the results are for K562, which appears to be a favourite cell line for this experiment. K562 is derived from cancerous cells, and it is plausible that this might affect the results. The other lines are also derived from cancer cells except for GM128 which has therefore been examined separately. Using a representative TF for each TF family, we calculated the mean proportion of sites in the preferred classes in each group of cell lines -see Table 5. Table 5 also shows the lower 95% confidence limit for a one tailed t-test. We conclude that each of these three groups of cell lines gives the same result and that our conclusions do not rely on any one cell line.
We now discuss if our results have a simple biological explanation. Most binding sites are far from the TSS: see [2] which reports only 22% of actual binding sites being    Column 2 gives the number of sites and column 5 the proportion of TFBSs in the preferred classes for all TFBS (as in columns 4 and 9 of Table  2). Columns 3 and 6 give this information for regions of low tag density in the control library, "sparse regions". Column 4 gives the proportion of sites in these regions. Note that the final column is very close to the preceding column and that often a substantial fraction of TFBSs are in "sparse regions". We conclude that the final results are not affected by experimental bias reflected in the control library. This analysis has been done for the "YALE" experiments using cell lines K562 and K562b. For K562, "sparse regions' were defined as a tag density of 0 or 1 in the input signal, and for K562b as a tag density of 0, 1 or 2.
within 1 kb of the TSS: compare the figure given by [1] of 53.5% of DNaseI hypersensitive sites as being further than 2500 bases from a TSS. For the TFs analysed in the current datasets, the proportion of actual binding sites further than 1500 bases from the TSS varies from TF to TF, but a typical figure is 80% to 90%, ( Table 6, column 3). It is therefore not possible to explain our results by arguing that the proportions of classes found for binding sites reflect the proportions of classes in promoter regions. Sites far from the TSS (Table 6) show the same tendency as all sites to occur in the preferred classes (Table 2). For sites close to the TSS, an even higher proportion of sites, 87%, are in the preferred classes, but this result comes from the predominance of pair 14, which has a high proportion of CpG doublets (Table 7). There are, however, some points of similarity between the proportion of sites in the preferred classes and the proportion of bases in promoter regions -see Figure 2D -and we speculate that this could arise because promoter regions must be suitable for TFBSs.
Another trivial explanation of our results might be that subsequences are generated at random within each class and the preferred classes shown in Table 2 are merely those classes which are most likely to generate the motif of the factor. This explanation is not likely to work for as many as seventeen factors, but this possibility has been tested as follows. For each transcription factor we used the motif found by MEME in finding the exact binding site positions, and counted the number of occurrences of this motif in artificial sequences representative of each class -see the Methods Section for details. The proportion of occurrences of motifs in each class is not a useful predictor of the proportion of sites in each class, neither for individual classes (see Figure 5), nor for the total in the preferred classes (see Figure 6). More detailed results are given in Table 8.
Using the data plotted in Figure 6, we find the following statements to be statistically significant. The proportion of sites in the preferred classes is greater than the proportion of bases in the genome: a one-sided T-test gives n = This table gives a comparison between cell lines using a representative transcription factor for each family from Table 3. The last row gives the lower 95% confidence limit for the mean of the values in the column based on a one-sided t-test. The calculation includes the anomalously low value for JunD. We conclude that the choice of cell line does not affect the conclusion that TFBSs mainly occur in the preferred classes.
17, df = 16, t = 13.4, p = 2.0e-10, and the 95% lower confidence limit of average proportion of sites = 0.68 (~0.43 more than the proportion of bases). The difference between the proportion of sites and the proportion of motifs in these classes is greater than zero: a one-sided Ttest gives n = 17, df = 16, t = 11.2, p = 2.7e-9, and the 95% lower confidence limit of average difference = 0.37. On the other hand, a comparison between the proportion of motifs and the proportion of bases has a much higher pvalue:-a one-sided T-test gives n = 17, df = 16, t = 1.8, p = 0.047, and the 95% lower confidence limit of average proportion of motifs = 0.257 (to be compared with the proportion of bases, 0.256). A prior Shapiro-Wilk test did not show any departure from normality for the samples: the sample of site proportions gave p = 0.54 and for the motif proportions p = 0.75.

Conclusions
We find that the preferred mosaic classes contain 75% of experimentally verified binding sites for transcription factors from seventeen families of transcription factors. The same mosaic classes constitute 26% of the genome and contain only 31% of the motifs for these factors. These results are shown in Figure 6. This method of analysis will help with the problem of false positive binding sites found by computational methods. These results must mirror the biological and physical processes involved and imply a profound difference between the mosaic classes. Recent research shows that biologically effective binding sites have a range of binding affinities which lead to a corresponding range of gene expression levels [42,43]. It would be interesting to see if the mosaic class which contains a transcription factor binding site also affects the expression level.
Another area for research would be to repeat the analysis for a defined subset of the genome to see if a different set of mosaic classes emerged. One interesting result would be if some parts of the genome showed mosaic classes preserved from an earlier stage of evolution -for example before CpG doublets degraded to CpA doublets. It would also be instructive to perform similar analyses using a large scale biological feature to identify a preferred strand. We have done some exploratory work on the approximate symmetries shown in Figure 1, and we think there is more to be said on this subject.

The mathematical framework
Our results on mosaic classes come from using a Hidden Markov Model (HMM) to model a large sample of the human genome. The analysis is entirely independent of any data on the position of transcription factor binding sites. In the HMM, each mosaic class has been modeled as a single state which implies that the length distribution in the model will be a simple exponential distribution. The probability of a base has been taken to depend on the class and the immediately previous base. The model was trained using the Expectation-Maximisation (E-M) algorithm -references and formulae are given in Additional File 3. The training was continued until convergence was achieved for the proportion of the genome represented by each class and for the base proportions within each class. All emission and transition probabilities may have different values when the model is trained. Classes have been matched in pairs, with the emission probabilities of the paired classes showing A/T and C/G symmetry. This symmetry also means that the transition probability between two classes is the same as that between their matched counterparts. These symmetries have been imposed at initialisation and maintained at each iteration.
Variability in the analyses comes from two sources: the initialisation of the HMM and the choice of sequence used to train the HMM. The variability has been controlled by using eight replications with different initialisation and different training sequences, with each replication using 6000 sequences of 2000 bases. These sequences were taken at random positions from the whole genome, assembly NCBI36, taken from Ensembl [44].
A common subset of classes from these replications was obtained as follows. The classes were considered in the space defined by the three coordinates T+A, T+C, T+G as defined by the steady-state model average and classes of different replication runs were matched if they were close together in this space. Classes were put into sets, so that each set contained one class from each run, with the aim of firstly maximising the number of sets and secondly minimising the average distance between classes within a set. A class was not put into a set unless it could be put into a set where it was close to other classes in the set. The procedure was as follows: in step one, the classes from runs 1 and 2 were matched and the position of each resulting pair was averaged: in the second step these positions were then matched with the classes from run 3: this step was repeated with the average position from the previous step being matched with the classes of the next run. Matching was only allowed between classes (or positions) within a distance, (D, taken as 0.12) in the base-proportion-space, but within that constraint, at each step, the algorithm made a complete search of how the classes should be matched to find the optimum in terms of number of sets and average distance. When finding the average position of pairs of classes, weights were used so that each run made an equal contribution to the final average. This procedure does not guarantee the global optimum and the heuristic was used of taking the best result when the runs were arranged in different This protocol was designed after earlier experiments in which classes were merged during the HMM training if they came close together in the space of base proportions. These experiments suggested that there were around three dozen classes and therefore to allow for some spurious classes in some runs, 50 classes were used to initialise the eight replications.
To give an indication of the statistical robustness of the method, Figure 7 shows the results of the eight constituent runs. This Figure shows that the classes are well defined. It also shows -as is to be expected -that using different sequences for the analysis will affect the results. Figure 8 gives a quantitative demonstration of statistical robustness. This Figure shows the position of the classes in the average-model, and an estimate of the uncertainty in these positions. This uncertainty is shown by the radius of the circles and can be seen to be small: the uncertainty has been calculated analogously to the standard error to the mean, as the root mean square distance from the composite-class to the constituent classes divided by the square root of the number of classes.

Finding classes for the Binding Sites
The positions of subsequences containing binding sites was obtained from the UCSC browser [45] using tracks created for the ENCODE project [1] and the tables labeled "peaks":-for three factors from the "HAIB TFBS" data [38], (ENCODE November 2008 and February 2009 freezes) and for eleven factors from the "YALE TFBS" This table gives the proportion of motifs in the class pair for each transcription factor: the value has been derived from artificial sequences and adjusted for the relative proportion of bases in each class in the human genome. This table is to be compared with Table 2 which gives the distribution of actual binding sites.
data [39], (release 2 of October 2009). We also used data [40] for CTCF from the ORegAnno database (version 17th September 2008) [46], and for sp1 [2] and p53 [2,41]: the data for sp1 and p53 was extracted from TRANSFAC [47] professional version 11.4 and the flanking bases added by TRANSFAC were removed. (For the ENCODE and CTCF data, if there were more than 5000 sequences in a dataset then about 5000 sequences were chosen evenly from the dataset for further analysis. Sequences longer than 700 bases were then excluded: sequences less than 30 bases were extended by 15 bases on each side. Sequences were also excluded from the masked data if more than a quarter of the bases were masked. In some cases the motif was found from a different subset of sequences than later used to find the binding sites.) The genomic coordinates of these subsequences were used to obtain the RepeatMasked sequence [48] from Ensembl [44], which was then analysed by MEME [10,49] to find the binding site motif of the transcription factor. MEME generated three motifs for consideration. Logos for these motifs were drawn with Weblogo [50] and checked against logos previously reported. (MEME does not find the motif for sp1 unaided and was initialised with the motif CCCCGCCCCC.) Using the version of the known motif found by MEME and the corresponding unmasked sequences, the positions of the binding sites within these sequences were found by MAST, a companion tool of MEME. MAST needs a parameter, mt, the threshold for the p-value, to determine how many sites will be reported and we normally used the default, (0.0001), for this. However, in some cases this procedure only finds a motif in a small proportion of sequences, and if necessary the value of mt was increased so that a motif was found in at least half the sequences. In fact, the final results are largely unaffected by the choice of mt. The position of all sites found by MAST in the subsequences was then cross referenced to the position in the genome. The non-masked sequence of length 4000 bases with the motif in the middle was analysed using the final HMM model to determine the class in which the motif occurred. For this we calculated the probability (of a base being in each class) as in the expectation step of the EMalgorithm. This is not the same as using the maximum likelihood estimate (as in the Viterbi algorithm) of the class containing the binding site to calculate the class proportions. For the classes selected for Table 2, the max- imum likelihood estimates are slightly higher than those shown, and lead to a slight (apparent) discrepancy between the theoretical model distribution and the observed genome distribution. Analyses of promoter regions are based on all coding genes as defined by Ensembl 52.

Estimating proportion of motifs by class
For each class, the HMM was used to generate artificial sequences (10 sequences of 1 million bases). For each transcription factor, MAST [10], a sister tool of MEME, was used to identify and count positions of the motif in the artificial sequences using its default parameters and the motif found by MEME. The proportion of motifs in each class was calculated in proportion to a × b × c where a = the number of bases in the genome, b = the proportion of all bases in the mosaic class, and c = the number of motifs per 10 million bases found from the MAST analysis.