Effects of cytosine methylation on transcription factor binding sites
BMC Genomics volume 15, Article number: 119 (2014)
DNA methylation in promoters is closely linked to downstream gene repression. However, whether DNA methylation is a cause or a consequence of gene repression remains an open question. If it is a cause, then DNA methylation may affect the affinity of transcription factors (TFs) for their binding sites (TFBSs). If it is a consequence, then gene repression caused by chromatin modification may be stabilized by DNA methylation. Until now, these two possibilities have been supported only by non-systematic evidence and they have not been tested on a wide range of TFs. An average promoter methylation is usually used in studies, whereas recent results suggested that methylation of individual cytosines can also be important.
We found that the methylation profiles of 16.6% of cytosines and the expression profiles of neighboring transcriptional start sites (TSSs) were significantly negatively correlated. We called the CpGs corresponding to such cytosines “traffic lights”. We observed a strong selection against CpG “traffic lights” within TFBSs. The negative selection was stronger for transcriptional repressors as compared with transcriptional activators or multifunctional TFs as well as for core TFBS positions as compared with flanking TFBS positions.
Our results indicate that direct and selective methylation of certain TFBS that prevents TF binding is restricted to special cases and cannot be considered as a general regulatory mechanism of transcription.
DNA methylation is one of the most studied epigenetic modifications. In differentiated cells in higher animals, methylated cytosine is almost always followed by guanine, associating methylation of 60-90% of all cytosines in a CpG context [1, 2]. Although recent evidence showed that cytosine methylation in embryonic stem cells may also occur as CpHpG and CpHpH (where H corresponds to A, C, or T) [3–5], genome-wide distributions of cytosine methylation in CpHpG and especially in CpHpH have great variability between individuals, contrary to methylation in the CpG context, which demonstrates stable cell-type-specific methylation . Thus, cell-type-specific regulatory patterns most likely depend on methylation in the CpG context.
Various methodologies have been developed to study DNA methylation at different genomic scales (for a review, see, for example, [6–8]) with direct sequencing of bisulfite-converted DNA  continuing to be the method of choice. However, the analysis of a single CpG site or a few CpG sites as surrogate indicators of DNA methylation status of the surrounding region is the most prevalent strategy in epigenetic studies at different scales, due to the assumption of the relatively homogeneous distribution of DNA methylation within genomic regions. This assumption is supported by multiple pieces of evidence of unmethylated CpGs closely co-located within CpG islands (CGIs) and methylated CpGs in repetitive elements. In addition, the level of methylation of the HpaII sites (CCGG) within CGIs demonstrates a correlation with average CGI methylation levels . At the same time, methylated CpGs have been found in unmethylated CGIs . It was also shown that a single differentially methylated CpG might affect transcription of the ESR1 gene . Moreover, it was hypothesized that DNA methylation of CpG-rich and CpG-poor regions might be involved in different regulatory programs . In short, whether or not the distinct methylation status of a single CpG affects specific transcription-related functions remains an open question.
It is widely accepted that cytosine methylation is a crucial regulatory mechanism in both normal and pathological processes. DNA methylation is involved in development [13, 14], cellular differentiation , maintaining cellular identity , pluripotency , aging [18, 19], memory formation , responses to environmental changes [21, 22] and reactions to diet . Several pathological conditions, including cancer [22, 24], diabetes , Alzheimer’s and Parkinson’s diseases , also show aberrant DNA methylation. Profiles of DNA methylation can be inherited through cell division  and in some cases through generations . However, recent studies of dynamic DNA methylation/de-methylation in vivo[27, 28] challenge the conventional view that DNA methylation is a permanent epigenetic mark and suggest the possibility of exploring DNA methylation as a promising target for non-invasive therapies for diseases linked with aberrant methylation.
DNA methylation of gene promoters is tightly associated to the repression of transcription, yet the mechanisms are still unclear . In the last four decades, multiple studies have shown that the level of DNA methylation in promoters is negatively correlated with the expression of downstream genes [30–35]. It was also hypothesized that ubiquitous, low-density cytosine methylation in vertebrate genomes can contribute to reduction of the transcriptional "noise" from inappropriate promoters . Recently, multiple pieces of evidence arguing against the paradigm that DNA methylation always represses transcription have started to appear. Transcription of some genes was found to be independent of methylation . Promoters with low CpG content are usually methylated, yet they still may be transcriptionally active [38, 39]. Although intergenic and gene terminal CGIs are frequently methylated, they demonstrate a pervasive transcription . Sparse DNA methylation of promoters may repress transcription, but this effect could be overcome by an enhancer . Genes exhibiting high levels of promoter methylation during normal development remain suppressed in Dnmt1-deficient mouse embryos, suggesting that developmental gene control does not globally rely on cytosine methylation and that the effects of DNA methylation are limited to specialized processes such as imprinting and mobile elements repression . Alternative promoter usage in different regions of the aged brain seems to be independent of promoter methylation . Promoter sequences are able to recapitulate correct DNA methylation autonomously and demonstrate proper de novo methylation during differentiation in pluripotent cells independently of the transcriptional activity of corresponding downstream promoters . Furthermore, in some cases, methylation is required for activation of transcription and therefore is positively correlated with gene expression .
Despite the various controversies, evidence that DNA methylation as an important step in regulation remains solid. The mechanisms of the interplay between methylation and expression are therefore critically important. It remains unclear whether DNA methylation is the cause or the consequence of altered gene expression. If DNA methylation causes gene repression, then there are several possible outcomes (Figure 1a). Cytosine methylation may directly affect the affinity of transcription factors (TFs) towards their binding sites (TFBSs) . Non-systematic experimental evidence that DNA methylation can prevent binding of some TFs to particular TFBSs [45, 46] supports this hypothesis. For example, methylation of the E-box (CACGTG) prevents n-Myc from binding to promoters of EGFR and CASP8 in a cell-specific manner ; methylation of the YY1-binding site in the promoter of the Peg3 gene represses the binding activity of YY1 in vitro. It is also worth noting that experimentally determined TFBSs usually show low levels of DNA methylation [4, 49, 50] and that TF-TFBS recognition is often associated with the lack of methylation [51, 52]. Furthermore, certain positions within CTCF binding sites are more sensitive to methylation than are others . Methylated cytosine can also attract TFs, both activators [44, 54] and repressors . Methylation of the CRE sequence enhances the DNA binding of C/EBPα, which in turn activates a set of promoters specific for adipocyte differentiation [44, 54]. Methyl-binding domain (MBD) proteins bind methylated CpG dinucleotide and induce histone deacetylation, subsequent chromatin condensation and gene repression .
The opposite scenario implies that chromatin modifications [56–58] reduce the accessibility of TFs and the transcriptional machinery to gene promoters, therefore leading to gene repression. DNA methylation in this model is not a cause, but a consequence of repression and serves to fix the repressed state of the chromatin (Figure 1b). In this case, cytosine methylation accumulates passively as a consequence of the independent absence of TF binding [50, 53] or it appears as a result of direct DNA methyltransferase recruitment by transcription repression proteins such as the Polycomb group (PcG) protein EZH2 . This model is supported by negative correlation of TF expression and average methylation of their TFBSs . Besides, it was reported that binding of some TFs, including Sp1 and CTCF, is sufficient for maintaining a local unmethylated state [60–65]. Nevertheless, this scenario (Figure 1b) does not explain the sensitivity of certain TFs to methylation of their TFBSs.
In this study, we explore the evidence that supports one of these two scenarios. To achieve this, we first test whether methylation of a particular cytosine correlates with transcription. This effect may provide a basis for regulation of transcription through methylation of specific TFBSs. Second, we investigate whether some TFs are more sensitive than others to the presence of such cytosines in their TFBSs and what features of TFBSs can be associated with this sensitivity. To this end, we employed ENCODE  data on DNA methylation obtained by reduced representation bisulfite sequencing (RRBS) . RRBS allows us to identify both methylated and unmethylated cytosines quantitatively at a single base pair resolution in the CCGG context in regions with high densities of rarely methylated cytosines, usually co-located within gene promoters . To evaluate genome-wide expression across different cell types, we used FANTOM5  data obtained by cap analysis of gene expression (CAGE) . FANTOM5 provides quantitative estimation of expression in several hundreds of different cell types.
Our study shows that a fraction of single CpGs within promoters exhibits a significant negative correlation of their methylation profiles with the expression profiles of neighboring transcriptional start sites (TSSs) considered across various samples. Moreover, we observe a strong negative selection against the presence of such cytosines within TFBSs, especially in their core positions. Interestingly, we find that repressors are more sensitive to the presence of such cytosines in their binding sites.
This work is part of the FANTOM5 project. Data downloads, genomic tools and co-published manuscripts are collected at http://fantom.gsc.riken.jp/5/.
Results and discussion
Only a fraction of cytosines exhibits significant correlation between methylation and expression profiles of a corresponding TSS
It is well known that the level of cytosine methylation of promoters is negatively correlated with gene expression ; the role of methylation of particular CpGs in the regulation of gene expression has been demonstrated in the case of ESR1 . The crucial role of the location of methylated regions relative to TSSs is also widely accepted. The question whether methylation of a particular cytosine may affect expression remains unanswered.
As the first step of this study, we studied whether the methylation level of a particular cytosine within a promoter region is correlated with the expression of the corresponding TSS, since such cytosines may serve as a basis for the regulation of transcription through TF binding. Table 1 demonstrates that among 237,244 cytosines analyzed in the study, only 16.6% (0.8%) have significantly (P-value ≤ 0.01) negative or positive Spearman Correlation Coefficients (SCC M/E ) between methylation and expression profiles of a closely located TSS (see Methods). This sheds different light on the common perception of a link between methylation and gene expression. We call cytosines demonstrating significantly negative SCC M/E CpG “traffic lights” (see Methods). In this study, we mostly focus on such cytosines.
Out of 50 cell types analyzed in this study, 14 were malignant. Genome-wide DNA methylation in cancer cells is dramatically different from that in normal cells (for the review see, for example [72–75]). Although we believe that the basic mechanism of interaction between DNA methylation and expression should be the same in cancer and non-cancer cells, we repeated the experiments on the 36 normal cell types and obtained similar results (Additional file 1): only a small fraction (9.5% and 1.5%) of cytosines have significant (P-value ≤ 0.01) negative and positive SCC M/E , respectively.
CAGE tags are often found within gene bodies  and methylation of a gene body may have a positive correlation with gene expression [77–79]. It was also suggested that the cytosines within gene bodies are often not methylated (5mC) but hydroxymethylated (5hmC) . However, bisulfite-based methods of cytosine modification detection (including RRBS) are unable to distinguish these two types of modifications . The presence of 5hmC in a gene body may be the reason why a fraction of CpG dinucleotides has a significant positive SCC M/E value. Unfortunately, data on genome-wide distribution of 5hmC in humans is available for a very limited set of cell types, mostly developmental [82, 83], preventing us from a direct study of the effects of 5hmC on transcription and TFBSs. At the current stage the 5hmC data is not available for inclusion in the manuscript. Yet, we were able to perform an indirect study based on the localization of the studied cytosines in various genomic regions. We tested whether cytosines demonstrating various SCC M/E are co-located within different gene regions (Table 2). Indeed, CpG “traffic lights” are located within promoters of GENCODE  annotated genes in 79% of the cases, and within gene bodies in 51% of the cases, while cytosines with positive SCC M/E are located within promoters in 56% of the cases and within gene bodies in 61% of the cases. Interestingly, 80% of CpG “traffic lights” are located within CGIs, while this fraction is smaller (67%) for cytosines with positive SCC M/E . This observation allows us to speculate that CpG “traffic lights” are more likely methylated, while cytosines demonstrating positive SCC M/E may be subject to both methylation and hydroxymethylation. Cytosines with positive and negative SCC M/E may therefore contribute to different mechanisms of epigenetic regulation. It is also worth noting that cytosines with insignificant (P-value > 0.01) SCC M/E are more often located within the repetitive elements and less often within the conserved regions and that they are more often polymorphic as compared with cytosines with a significant SCC M/E , suggesting that there is natural selection protecting CpGs with a significant SCC M/E .
Selection against TF binding sites overlapping with CpG “traffic lights”
We hypothesize that if CpG “traffic lights” are not induced by the average methylation of a silent promoter, they may affect TF binding sites (TFBSs) and therefore may regulate transcription. It was shown previously that cytosine methylation might change the spatial structure of DNA and thus might affect transcriptional regulation by changes in the affinity of TFs binding to DNA [47–49]. However, the answer to the question of if such a mechanism is widespread in the regulation of transcription remains unclear. For TFBSs prediction we used the remote dependency model (RDM) , a generalized version of a position weight matrix (PWM), which eliminates an assumption on the positional independence of nucleotides and takes into account possible correlations of nucleotides at remote positions within TFBSs. RDM was shown to decrease false positive rates effectively as compared with the widely used PWM model.
Our results demonstrate (Additional file 2) that from the 271 TFs studied here (having at least one CpG “traffic light” within TFBSs predicted by RDM), 100 TFs had a significant underrepresentation of CpG “traffic lights” within their predicted TFBSs (P-value < 0.05, Chi-square test, Bonferoni correction) and only one TF (OTX2) had a significant overrepresentation of CpG “traffic lights” within the predicted TFBSs. Similar results were obtained using only the 36 normal cell lines: 35 TFs had a significant underrepresentation of CpG “traffic lights” within their predicted TFBSs (P-value < 0.05, Chi-square test, Bonferoni correction) and no TFs had a significant overrepresentation of such positions within TFBSs (Additional file 3). Figure 2 shows the distribution of the observed-to-expected ratio of TFBS overlapping with CpG “traffic lights”. It is worth noting that the distribution is clearly bimodal with one mode around 0.45 (corresponding to TFs with more than double underrepresentation of CpG "traffic lights" in their binding sites) and another mode around 0.7 (corresponding to TFs with only 30% underrepresentation of CpG “traffic lights” in their binding sites). We speculate that for the first group of TFBSs, overlapping with CpG “traffic lights” is much more disruptive than for the second one, although the mechanism behind this division is not clear.
To ensure that the results were not caused by a novel method of TFBS prediction (i.e., due to the use of RDM), we performed the same analysis using the standard PWM approach. The results presented in Figure 2 and in Additional file 4 show that although the PWM-based method generated many more TFBS predictions as compared to RDM, the CpG “traffic lights” were significantly underrepresented in the TFBSs in 270 out of 279 TFs studied here (having at least one CpG “traffic light” within TFBSs as predicted by PWM), supporting our major finding.
We also analyzed if cytosines with significant positive SCC M/E demonstrated similar underrepresentation within TFBS. Indeed, among the tested TFs, almost all were depleted of such cytosines (Additional file 2), but only 17 of them were significantly over-represented due to the overall low number of cytosines with significant positive SCC M/E . Results obtained using only the 36 normal cell lines were similar: 11 TFs were significantly depleted of such cytosines (Additional file 3), while most of the others were also depleted, yet insignificantly due to the low number of total predictions. Analysis based on PWM models (Additional file 4) showed significant underrepresentation of such cytosines for 229 TFs and overrepresentation for 7 (DLX3, GATA6, NR1I2, OTX2, SOX2, SOX5, SOX17). Interestingly, these 7 TFs all have highly AT-rich binding sites with very low probability of CpG.
It was previously shown that cytosine methylation can prevent binding of several TFs (such as Sp1 , CTCF  and others) and, therefore, methylation may serve as a global regulatory mechanism for cell-specific TF binding. Yet, we observe that most of TFs avoid CpG “traffic lights” in their binding sites, suggesting a potentially damaging effect of CpG “traffic lights” to TFBS and therefore a natural selection against TFBS overlapping with CpG “traffic lights”.
Computational prediction of TFBSs identifies DNA regions of potential binding, which may not be available for a TF in a particular cell type due to chromatin modifications. To avoid a bias caused by potential TFBSs that are not functional in particular cell types, we used experimentally obtained regions of TF binding. Chromatin immunoprecipitation followed by parallel DNA sequencing (ChIP-seq) is an effective experimental technique for the identification of regions for DNA-protein interaction . Yet, regions where TFs most likely bind DNA (ChIP-seq peaks) in a particular cell type are relatively long, usually longer than several hundreds of base pairs, while real TFBSs are on average a dozen base pairs long. Therefore, we combined experimental and computational approaches and filtered out the predictions of TFBSs outside of ChiP-seq peak regions. We tested our results on ChIP-seq data for CTCF as it is the only TF in ENCODE with experimental binding information in as many as 22 cell types out of the 50 cell types we used in our study (14 of the 22 were normal cell types). Results in Additional file 5 support our initial finding: CTCF binding sites avoid CpG “traffic lights”. ChIP-seq data for other TFs are available only for the cancer cell lines included in our study, making it impossible to draw conclusions about normal cell functioning. At the current stage the ChiP-seq data for other TFs is not available for inclusion in the manuscript. Our findings suggest that changing a TF’s affinity to DNA or even blocking TF binding sites by direct and selective methylation is limited to certain TFBSs within a few promoters and thus is not likely to be a general mechanism of methylation-dependent regulation of gene expression.
TFBSs of repressors are especially sensitive to the presence of CpG “traffic lights”
Overlapping of TFBS with CpG “traffic lights” may affect TF binding in various ways depending on the functions of TFs in the regulation of transcription. There are four possible simple scenarios, as described in Table 3. However, it is worth noting that many TFs can work both as activators and repressors depending on their cofactors. Moreover, some TFs can bind both methylated and unmethylated DNA . Such TFs are expected to be less sensitive to the presence of CpG “traffic lights” than are those with a single function and clear preferences for methylated or unmethylated DNA.
Using information about molecular function of TFs from UniProt  (Additional files 2, 3, 4 and 5), we compared the observed-to-expected ratio of TFBS overlapping with CpG “traffic lights” for different classes of TFs. Figure 3 shows the distribution of the ratios for activators, repressors and multifunctional TFs (able to function as both activators and repressors). The figure shows that repressors are more sensitive (average observed-to-expected ratio is 0.5) to the presence of CpG “traffic lights” as compared with the other two classes of TFs (average observed-to-expected ratio for activators and multifunctional TFs is 0.6; t-test, P-value < 0.05), suggesting a higher disruptive effect of CpG “traffic lights” on the TFBSs of repressors. Although results based on the RDM method of TFBS prediction show similar distributions (Additional file 6), the differences between them are not significant due to a much lower number of TFBSs predicted by this method. Multifunctional TFs exhibit a bimodal distribution with one mode similar to repressors (observed-to-expected ratio 0.5) and another mode similar to activators (observed-to-expected ratio 0.75). This suggests that some multifunctional TFs act more often as activators while others act more often as repressors. Taking into account that most of the known TFs prefer to bind unmethylated DNA, our results are in concordance with the theoretical scenarios presented in Table 3.
“Core” positions within TFBSs are especially sensitive to the presence of CpG “traffic lights”
We also evaluated if the information content of the positions within TFBS (measured for PWMs) affected the probability to find CpG “traffic lights” (Additional files 7 and 8). We observed that high information content in these positions (“core” TFBS positions, see Methods) decreases the probability to find CpG “traffic lights” in these positions supporting the hypothesis of the damaging effect of CpG “traffic lights” to TFBS (t-test, P-value < 0.05). The tendency holds independent of the chosen method of TFBS prediction (RDM or RWM). It is noteworthy that “core” positions of TFBS are also depleted of CpGs having positive SCC M/E as compared to “flanking” positions (low information content of a position within PWM, (see Methods), although the results are not significant due to the low number of such CpGs (Additional files 7 and 8).
We found that the methylation profiles and expression profiles in 16.6% of single CpG dinucleotides in CAGE-derived promoters were significantly negatively correlated with neighbouring TSS, supporting the argument that single cytosine methylation is involved in the regulation of transcription. In a way, the current common perception of the link between methylation and gene expression is seen in a different light. Unexpectedly, we observed a strong selection against the presence of CpG “traffic lights” within the TFBSs of many TFs. We demonstrated that the selection against CpG “traffic lights” within TFBS is even more pronounced in the case of “core” positions within TFBSs as compared to “flanking” positions. These observations allow us to suggest that blocking of TFBSs by selective methylation is unlikely to be a general mechanism of methylation-dependent transcription regulation and that such a mechanism is limited to special cases. We conclude that the regulation of expression via DNA methylation and via TF binding are relatively independent regulatory mechanisms; both mechanisms are thus not in a direct causal relationship. Known cases of interaction between these mechanisms appear mostly because they operate on the same target regions (promoters) and require intermediate partners, for example, modification of chromatin.
We manually selected 137 FANTOM5 samples (cell types) matching 50 ENCODE samples. We grouped them into 50 classes of identical or similar biological cell types. To reduce the noise coming from inexact matching of cell types between FANTOM5 and ENCODE data, we averaged the expression/methylation values for different technical or biological replicas, donors and cell types within the same class. Detailed information is provided in Additional file 9.
All human samples used in the FANTOM5 project were either exempted material (available in public collections or commercially available), or provided under informed consent. All non-exempt material is covered under RIKEN Yokohama Ethics applications (H17-34 and H21-14) and collected in compliance with the Helsinki Declaration.
TSSs and promoter regions
We used TSSs found by the CAGE method in FANTOM5. The relative log expression normalization method (RLE ) was applied to CAGE-tags in each sample . For a particular TSS, we referred to a set of expression values across the selected 50 classes of cell types as an expression profile. Low expressed CAGE-tag clusters may be non-robust to sequencing errors or heterogeneity of the cell population. To reduce the effect of such CAGE-tag clusters, we excluded TSSs with all RLE-normalized expression values less than 1. For each CAGE-tag cluster, we selected a promoter region of 1500 bp upstream and 500 bp downstream of the ends of reported CAGE-tag clusters. Overlapping promoters were considered independently.
Cytosine methylation data
We used cytosine methylation data obtained by RRBS (http://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeHaibMethylRrbs). All data included cytosine methylation only in the CCGG context. We excluded cytosines covered by less than 10 reads. For a particular cytosine, we referred to a set of methylation values (the proportion of methylated reads relative to all reads) across the selected 50 cell types as a methylation profile. We excluded cytosines having methylation data for less than 50% of samples (25 when using all 50 cell types and 18 when using the 36 normal cell types) in the methylation profiles.
While each particular cytosine may be either methylated or unmethylated, the RRBS technique measures the average methylation of a particular cytosine in the cell population, which results in a 0 to 100% range of values. Although methylation values of most of the cytosines tend to be 0 or 100%, intermediate values are also possible. Low (but not 0) levels of cytosine methylation may appear as a result of experimental errors, and these levels can affect further analysis. To avoid any bias caused by such cytosines, we used only positions differentially methylated between cell types. We defined a CpG as differentially methylated if the amplitude (the difference between the maximum and minimum values in the normalized profile) of the methylation profile for a particular CpG was greater than 50%.
Correlation of cytosine methylation and TSS expression
For all cytosines located within promoter regions, we calculated the Spearman Correlation Coefficient between methylation profiles of the cytosine and the expression profiles of the corresponding TSS (referred to as SCC M/E ). We estimated the statistical significance of SCC M/E based on transformation to a Student’s t-test distribution:
Here n is the length of the methylation/expression profile for a given position. In our analysis (if not stated otherwise), we referred to positions with P-values (SCCM/E) ≤ 0.01 as positions with significantly negative or positive correlations between the methylation and the expression profiles. It is noteworthy that due to the overlapping of promoter regions for different TSSs, one cytosine may have several SCC M/E . In the case of overlapping promoters, it is difficult to estimate which TSS is affected by the methylation of a particular cytosine. We therefore considered that a particular CpG affects transcription if it has at least one SCC M/E above (or below) the significance level (see Table 1).
CpG “traffic lights”
To avoid bias in estimating SCC M/E for low methylated cytosines caused by experimental errors, we introduced differentially methylated cytosines based on the difference between the highest and lowest value (amplitude) in the normalized methylated profile when it was greater than 50% of the maximum possible value. In the analysis of TFBSs affected by cytosine methylation, we considered only CpGs differentially methylated across cell types. We introduced the term CpG “traffic lights” to describe differentially methylated cytosines with significantly (P-values (SCCM/E) ≤ 0.01) negative SCC M/E .
We also looked for co-localization of CpG “traffic lights” and several genomic features (data downloaded from http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/): known gene promoters (1500 bp upstream of TSS and 500 bp downstream) and gene bodies (500 bp downstream TSS to the end of the gene) (wgEncodeGencodeBasicV140); CpG islands (cpgIslandExt); DNase sensitivity regions (wgEncodeRegDnaseClusteredV2); repetitive elements (rmsk); SNPs (snp137Common); and conserved elements (phastConsElements46wayPrimates).
Prediction of TFBSs using the remote dependency models
To create RDMs, we used binding site alignments from HOCOMOCO . This collection of TFBS models was selected due to the low level of redundancy of TFBS models per single TF. Binding sites having scores less than PWM thresholds were excluded. PWM thresholds were selected according to the P-value < 0.0005 (i.e., when 5 of 10,000 random words had scores no less than the thresholds). P-values were computed by the MACRO-APE software (http://autosome.ru/macroape)  that implements the strategy presented in the work of Touzet and Varre . Due to the large number of parameters in RDM models as compared to PWM models provided in HOCOMOCO, the minimal number of sequences in the alignment was increased from 8 to 15. Filtered alignments of fewer than 15 binding sites were discarded, which reduced the initial set of 426 TFBS models available in HOCOMOCO to 280 TFBS models (Additional file 4, column 1).
Using the frequency of each dinucleotide with one nucleotide being at position i and the other at position j, where i = 1, …, L-1, j = i + 1, …, L, in the set of aligned binding sites, the dinucleotide frequency matrix with remote dependencies was constructed and normalized similar to PWM normalization in Bajic et al.:
Here f a,i,j is the frequency of dinucleotide a formed of nucleotides at positions i and j, and L is the length of the aligned TFBSs. We predicted TFBSs using the RDM models across the whole promoter set.
Prediction of TFBSs using position weight matrices
To check if the TFBS prediction method affects the results, we also predicted TFBS using widely accepted PWM models. We took the same PWMs from HOCOMOCO as used for RDM construction. PWM thresholds were selected according to the P-value of 0.0005 (Additional file 10).
TFBSs potentially affected by DNA methylation
We selected all cytosines for which SCC M/E were available and checked whether they were located within predicted TFBSs. The total number of predicted TFBSs is available in Additional files 2, 3 and 4 (column D). It is noteworthy that average GC-content of the RDM hits was undistinguishable from that of the binding sites in the initial alignments.
“Core” and “flanking” CpG positions within TFBS
If we consider all genome-wide hits of any TFBS model, we may find that CpG dinucleotides can appear almost in every position of TFBSs. However, some positions within binding sites contain CpG dinucleotide more often than do others, so we repeated the analysis for each type of binding site position separately. For a particular TFBS model, we selected CpG positions in the HOCOMOCO alignments according to the information content of the corresponding PWM columns. Information content is defined as DIC (Discrete Information Content ) separately for different types of binding site positions. For a particular TFBS model, we selected CpG positions in the HOCOMOCO alignments according to the information content of the corresponding PWM columns:
Here x a,j are elements of the position count matrix (i.e., nucleotide counts), N is the total number of aligned TFBS sequences. In contrast to classic information content , DIC is based on raw counts (instead of per-column nucleotide probabilities, which can be inaccurate for a small set of aligned sequences). We define two empirical DIC thresholds Th and th (introduced in ). Th corresponds to the DIC of the column having only 3 (of 4 possible) nucleotides that have the same frequency, th corresponds to the DIC of the column having two nucleotides with the same frequency, f, and the other two nucleotides each with the frequency 2f.
The CpG positions have C and G as major nucleotides (with the highest frequency) in the neighbouring columns. High information content CpG (“core” TFBS positions) has both C and G columns with DIC greater than Th. The medium (or low) information content CpG (“flanking” TFBS positions) has both C- and G-column DIC between Th and th (or lower than th). The summary is presented in Additional files 4 and 5.
Reduced representation bisulphite sequencing
Cap analysis of gene expression
Chromatin immunoprecipitation followed by DNA sequencing
Transcription start site
Transcription factor binding site
Remote dependency model
Position weight matrix
Spearman correlation coefficient between methylation and expression profiles
Discrete information content.
Ehrlich M, Gama-Sosa MA, Huang LH, Midgett RM, Kuo KC, McCune RA, Gehrke C: Amount and distribution of 5-methylcytosine in human DNA from different types of tissues of cells. Nucleic Acids Res. 1982, 10: 2709-2721. 10.1093/nar/10.8.2709.
Tucker KL: Methylated cytosine and the brain: a new base for neuroscience. Neuron. 2001, 30: 649-652. 10.1016/S0896-6273(01)00325-7.
Ramsahoye BH, Biniszkiewicz D, Lyko F, Clark V, Bird AP, Jaenisch R: Non-CpG methylation is prevalent in embryonic stem cells and may be mediated by DNA methyltransferase 3a. Proc Natl Acad Sci USA. 2000, 97: 5237-5242. 10.1073/pnas.97.10.5237.
Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, Nery JR, Lee L, Ye Z, Ngo QM, Edsall L, Antosiewicz-Bourget J, Stewart R, Ruotti V, Millar AH, Thomson JA, Ren B, Ecker JR: Human DNA methylomes at base resolution show widespread epigenomic differences. Nature. 2009, 462 (7271): 315-322. 10.1038/nature08514.
Laurent L, Wong E, Li G, Huynh T, Tsirigos A, Ong CT, Low HM, Kin Sung KW, Rigoutsos I, Loring J, Wei CL: Dynamic changes in the human methylome during differentiation. Genome Res. 2010, 20: 320-331. 10.1101/gr.101907.109.
Suzuki MM, Bird A: DNA methylation landscapes: provocative insights from epigenomics. Nat Rev Genet. 2008, 9: 465-476. 10.1038/nrg2341.
Harris RA, Wang T, Coarfa C, Nagarajan RP, Hong C, Downey SL, Johnson BE, Fouse SD, Delaney A, Zhao Y, Olshen A, Ballinger T, Zhou X, Forsberg KJ, Gu J, Echipare L, O'Geen H, Lister R, Pelizzola M, Xi Y, Epstein CB, Bernstein BE, Hawkins RD, Ren B, Chung WY, Gu H, Bock C, Gnirke A, Zhang MQ, Haussler D, et al: Comparison of sequencing-based methods to profile DNA methylation and identification of monoallelic epigenetic modifications. Nat Biotechnol. 2010, 28 (10): 1097-1105. 10.1038/nbt.1682.
Laird PW: Principles and challenges of genomewide DNA methylation analysis. Nat Rev Genet. 2010, 11: 191-203. 10.1038/nrg2732.
Clark SJ, Statham A, Stirzaker C, Molloy PL, Frommer M: DNA methylation: bisulphite modification and analysis. Nat Protoc. 2006, 1: 2353-2364. 10.1038/nprot.2006.324.
Barrera V, Peinado MA: Evaluation of single CpG sites as proxies of CpG island methylation states at the genome scale. Nucleic Acids Res. 2012, 40: 11490-11498. 10.1093/nar/gks928.
Furst RW, Kliem H, Meyer HH, Ulbrich SE: A differentially methylated single CpG-site is correlated with estrogen receptor alpha transcription. J Steroid Biochem Mol Biol. 2012, 130: 96-104. 10.1016/j.jsbmb.2012.01.009.
Schubeler D: Molecular biology. Epigenetic islands in a genetic ocean. Science. 2012, 338: 756-757. 10.1126/science.1227243.
Jaenisch R, Bird A: Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals. Nat Genet. 2003, 33 (Suppl): 245-254.
Borgel J, Guibert S, Li Y, Chiba H, Schubeler D, Sasaki H, Forne T, Weber M: Targets and dynamics of promoter DNA methylation during early mouse development. Nat Genet. 2010, 42: 1093-1100. 10.1038/ng.708.
Farthing CR, Ficz G, Ng RK, Chan CF, Andrews S, Dean W, Hemberger M, Reik W: Global mapping of DNA methylation in mouse promoters reveals epigenetic reprogramming of pluripotency genes. PLoS Genet. 2008, 4: e1000116-10.1371/journal.pgen.1000116.
Oda M, Yamagiwa A, Yamamoto S, Nakayama T, Tsumura A, Sasaki H, Nakao K, Li E, Okano M: DNA methylation regulates long-range gene silencing of an X-linked homeobox gene cluster in a lineage-specific manner. Genes Dev. 2006, 20: 3382-3394. 10.1101/gad.1470906.
Tomazou EM, Meissner A: Epigenetic regulation of pluripotency. Adv Exp Med Biol. 2010, 695: 26-40. 10.1007/978-1-4419-7037-4_3.
Christensen BC, Houseman EA, Marsit CJ, Zheng S, Wrensch MR, Wiemels JL, Nelson HH, Karagas MR, Padbury JF, Bueno R, Sugarbaker DJ, Yeh RF, Wiencke JK, Kelsey KT: Aging and environmental exposures alter tissue-specific DNA methylation dependent upon CpG island context. PLoS Genet. 2009, 5 (8): e1000602-10.1371/journal.pgen.1000602.
Rakyan VK, Down TA, Maslau S, Andrew T, Yang TP, Beyan H, Whittaker P, McCann OT, Finer S, Valdes AM, Leslie RD, Deloukas P, Spector TD: Human aging-associated DNA hypermethylation occurs preferentially at bivalent chromatin domains. Genome Res. 2010, 20 (4): 434-439. 10.1101/gr.103101.109.
Miller CA, Sweatt JD: Covalent modification of DNA regulates memory formation. Neuron. 2007, 53: 857-869. 10.1016/j.neuron.2007.02.022.
Jirtle RL, Skinner MK: Environmental epigenomics and disease susceptibility. Nat Rev Genet. 2007, 8: 253-262. 10.1038/nrg2045.
Fraga MF, Ballestar E, Paz MF, Ropero S, Setien F, Ballestar ML, Heine-Suner D, Cigudosa JC, Urioste M, Benitez J, Boix-Chornet M, Sanchez-Aguilera A, Ling C, Carlsson E, Poulsen P, Vaag A, Stephan Z, Spector TD, Wu YZ, Plass C, Esteller M: Epigenetic differences arise during the lifetime of monozygotic twins. Proc Natl Acad Sci USA. 2005, 102 (30): 10604-10609. 10.1073/pnas.0500398102.
Kucharski R, Maleszka J, Foret S, Maleszka R: Nutritional control of reproductive status in honeybees via DNA methylation. Science. 2008, 319: 1827-1830. 10.1126/science.1153069.
Esteller M, Fraga MF, Paz MF, Campo E, Colomer D, Novo FJ, Calasanz MJ, Galm O, Guo M, Benitez J, Herman JG: Cancer epigenetics and methylation. Science. 2002, 297: 1807-1808. discussion 1807–1808
Gilbert ER, Liu D: Epigenetics: the missing link to understanding beta-cell dysfunction in the pathogenesis of type 2 diabetes. Epigenetics. 2012, 7: 841-852. 10.4161/epi.21238.
Kwok JB: Role of epigenetics in Alzheimer’s and Parkinson’s disease. Epigenomics. 2010, 2: 671-682. 10.2217/epi.10.43.
Kangaspeska S, Stride B, Metivier R, Polycarpou-Schwarz M, Ibberson D, Carmouche RP, Benes V, Gannon F, Reid G: Transient cyclical methylation of promoter DNA. Nature. 2008, 452: 112-115. 10.1038/nature06640.
Metivier R, Gallais R, Tiffoche C, Le Peron C, Jurkowska RZ, Carmouche RP, Ibberson D, Barath P, Demay F, Reid G, Benes V, Jeltsch A, Gannon F, Salbert G: Cyclical DNA methylation of a transcriptionally active promoter. Nature. 2008, 452 (7183): 45-50. 10.1038/nature06544.
Walsh CP, Bestor TH: Cytosine methylation and mammalian development. Genes Dev. 1999, 13: 26-34. 10.1101/gad.13.1.26.
Shen L, Kondo Y, Guo Y, Zhang J, Zhang L, Ahmed S, Shu J, Chen X, Waterland RA, Issa JP: Genome-wide profiling of DNA methylation reveals a class of normally methylated CpG island promoters. PLoS Genet. 2007, 3: 2023-2036.
Hsieh CL: Dependence of transcriptional repression on CpG methylation density. Mol Cell Biol. 1994, 14: 5487-5494.
Jones PA, Takai D: The role of DNA methylation in mammalian epigenetics. Science. 2001, 293: 1068-1070. 10.1126/science.1063852.
Wolffe AP, Matzke MA: Epigenetics: regulation through repression. Science. 1999, 286: 481-486. 10.1126/science.286.5439.481.
Riggs AD: X inactivation, differentiation, and DNA methylation. Cytogenet Cell Genet. 1975, 14: 9-25. 10.1159/000130315.
Holliday R, Pugh JE: DNA modification mechanisms and gene activity during development. Science. 1975, 187: 226-232. 10.1126/science.1111098.
Bird AP: Functions for DNA methylation in vertebrates. Cold Spring Harb Symp Quant Biol. 1993, 58: 281-285. 10.1101/SQB.1993.058.01.033.
Bird AP: DNA methylation versus gene expression. J Embryol Exp Morphol. 1984, 83 (Suppl): 31-40.
Eckhardt F, Lewin J, Cortese R, Rakyan VK, Attwood J, Burger M, Burton J, Cox TV, Davies R, Down TA, Haefliger C, Horton R, Howe K, Jackson DK, Kunde J, Koenig C, Liddle J, Niblett D, Otto T, Pettett R, Seemann S, Thompson C, West T, Rogers J, Olek A, Berlin K, Beck S: DNA methylation profiling of human chromosomes 6, 20 and 22. Nat Genet. 2006, 38 (12): 1378-1385. 10.1038/ng1909.
Weber M, Hellmann I, Stadler MB, Ramos L, Paabo S, Rebhan M, Schubeler D: Distribution, silencing potential and evolutionary impact of promoter DNA methylation in the human genome. Nat Genet. 2007, 39: 457-466. 10.1038/ng1990.
Medvedeva YA, Fridman MV, Oparina NJ, Malko DB, Ermakova EO, Kulakovskiy IV, Heinzel A, Makeev VJ: Intergenic, gene terminal, and intragenic CpG islands in the human genome. BMC Genomics. 2010, 11: 48-10.1186/1471-2164-11-48.
Boyes J, Bird A: Repression of genes by DNA methylation depends on CpG density and promoter strength: evidence for involvement of a methyl-CpG binding protein. EMBO J. 1992, 11: 327-333.
Pardo LM, Rizzu P, Francescatto M, Vitezic M, Leday GG, Sanchez JS, Khamis A, Takahashi H, van de Berg WD, Medvedeva YA, van de Wiel MA, Daub CO, Carninci P, Heutink P: Regional differences in gene expression and promoter usage in aged human brains. Neurobiol Aging. 2013
Lienert F, Wirbelauer C, Som I, Dean A, Mohn F, Schubeler D: Identification of genetic elements that autonomously determine DNA methylation states. Nat Genet. 2011, 43: 1091-1097. 10.1038/ng.946.
Rishi V, Bhattacharya P, Chatterjee R, Rozenberg J, Zhao J, Glass K, Fitzgerald P, Vinson C: CpG methylation of half-CRE sequences creates C/EBPalpha binding sites that activate some tissue-specific genes. Proc Natl Acad Sci USA. 2010, 107: 20311-20316. 10.1073/pnas.1008688107.
Tate PH, Bird AP: Effects of DNA methylation on DNA-binding proteins and gene expression. Curr Opin Genet Dev. 1993, 3: 226-231. 10.1016/0959-437X(93)90027-M.
Choy MK, Movassagh M, Goh HG, Bennett MR, Down TA, Foo RS: Genome-wide conserved consensus transcription factor binding motifs are hyper-methylated. BMC Genomics. 2010, 11: 519-10.1186/1471-2164-11-519.
Perini G, Diolaiti D, Porro A, Della Valle G: In vivo transcriptional regulation of N-Myc target genes is controlled by E-box methylation. Proc Natl Acad Sci USA. 2005, 102: 12117-12122. 10.1073/pnas.0409097102.
Kim J, Kollhoff A, Bergmann A, Stubbs L: Methylation-sensitive binding of transcription factor YY1 to an insulator sequence within the paternally expressed imprinted gene, Peg3. Hum Mol Genet. 2003, 12: 233-245. 10.1093/hmg/ddg028.
Mukhopadhyay R, Yu W, Whitehead J, Xu J, Lezcano M, Pack S, Kanduri C, Kanduri M, Ginjala V, Vostrov A, Quitschke W, Chernukhin I, Klenova E, Lobanenkov V, Ohlsson R: The binding sites for the chromatin insulator protein CTCF map to DNA methylation-free domains genome-wide. Genome Res. 2004, 14 (8): 1594-1602. 10.1101/gr.2408304.
Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, Sheffield NC, Stergachis AB, Wang H, Vernot B, Garg K, John S, Sandstrom R, Bates D, Boatman L, Canfield TK, Diegel M, Dunn D, Ebersol AK, Frum T, Giste E, Johnson AK, Johnson EM, Kutyavin T, Lajoie B, Lee BK, Lee K, London D, Lotakis D, Neph S, et al: The accessible chromatin landscape of the human genome. Nature. 2012, 489 (7414): 75-82. 10.1038/nature11232.
Straussman R, Nejman D, Roberts D, Steinfeld I, Blum B, Benvenisty N, Simon I, Yakhini Z, Cedar H: Developmental programming of CpG island methylation profiles in the human genome. Nat Struct Mol Biol. 2009, 16: 564-571. 10.1038/nsmb.1594.
Gebhard C, Benner C, Ehrich M, Schwarzfischer L, Schilling E, Klug M, Dietmaier W, Thiede C, Holler E, Andreesen R, Rehli M: General transcription factor binding at CpG islands in normal cells correlates with resistance to de novo DNA methylation in cancer cells. Cancer Res. 2010, 70: 1398-1407. 10.1158/0008-5472.CAN-09-3406.
Wang H, Maurano MT, Qu H, Varley KE, Gertz J, Pauli F, Lee K, Canfield T, Weaver M, Sandstrom R, Thurman RE, Kaul R, Myers RM, Stamatoyannopoulos JA: Widespread plasticity in CTCF occupancy linked to DNA methylation. Genome Res. 2012, 22 (9): 1680-1688. 10.1101/gr.136101.111.
Chatterjee R, Vinson C: CpG methylation recruits sequence specific transcription factors essential for tissue specific gene expression. Biochim Biophys Acta. 1819, 2012: 763-770.
Nan X, Ng HH, Johnson CA, Laherty CD, Turner BM, Eisenman RN, Bird A: Transcriptional repression by the methyl-CpG-binding protein MeCP2 involves a histone deacetylase complex. Nature. 1998, 393: 386-389. 10.1038/30764.
Feldman N, Gerson A, Fang J, Li E, Zhang Y, Shinkai Y, Cedar H, Bergman Y: G9a-mediated irreversible epigenetic inactivation of Oct-3/4 during early embryogenesis. Nat Cell Biol. 2006, 8: 188-194. 10.1038/ncb1353.
Tachibana M, Matsumura Y, Fukuda M, Kimura H, Shinkai Y: G9a/GLP complexes independently mediate H3K9 and DNA methylation to silence transcription. EMBO J. 2008, 27: 2681-2690. 10.1038/emboj.2008.192.
Strunnikova M, Schagdarsurengin U, Kehlen A, Garbe JC, Stampfer MR, Dammann R: Chromatin inactivation precedes de novo DNA methylation during the progressive epigenetic silencing of the RASSF1A promoter. Mol Cell Biol. 2005, 25: 3923-3933. 10.1128/MCB.25.10.3923-3933.2005.
Vire E, Brenner C, Deplus R, Blanchon L, Fraga M, Didelot C, Morey L, Van Eynde A, Bernard D, Vanderwinden JM, Bollen M, Esteller M, Di Croce L, de Launoit Y, Fuks F: The Polycomb group protein EZH2 directly controls DNA methylation. Nature. 2006, 439 (7078): 871-874.
Macleod D, Charlton J, Mullins J, Bird AP: Sp1 sites in the mouse aprt gene promoter are required to prevent methylation of the CpG island. Genes Dev. 1994, 8: 2282-2292. 10.1101/gad.8.19.2282.
Brandeis M, Frank D, Keshet I, Siegfried Z, Mendelsohn M, Nemes A, Temper V, Razin A, Cedar H: Sp1 elements protect a CpG island from de novo methylation. Nature. 1994, 371: 435-438. 10.1038/371435a0.
Mummaneni P, Yates P, Simpson J, Rose J, Turker MS: The primary function of a redundant Sp1 binding site in the mouse aprt gene promoter is to block epigenetic gene inactivation. Nucleic Acids Res. 1998, 26: 5163-5169. 10.1093/nar/26.22.5163.
Maurano MT, Wang H, Kutyavin T, Stamatoyannopoulos JA: Widespread site-dependent buffering of human regulatory polymorphism. PLoS Genet. 2012, 8: e1002599-10.1371/journal.pgen.1002599.
Lin IG, Tomzynski TJ, Ou Q, Hsieh CL: Modulation of DNA binding protein affinity directly affects target site demethylation. Mol Cell Biol. 2000, 20: 2343-2349. 10.1128/MCB.20.7.2343-2349.2000.
Stadler MB, Murr R, Burger L, Ivanek R, Lienert F, Scholer A, van Nimwegen E, Wirbelauer C, Oakeley EJ, Gaidatzis D, Tiwari VK, Schubeler D: DNA-binding factors shape the mouse methylome at distal regulatory regions. Nature. 2011, 480 (7378): 490-495.
Dunham I, Kundaje A, Aldred SF, Collins PJ, Davis CA, Doyle F, Epstein CB, Frietze S, Harrow J, Kaul R, Khatun J, Lajoie BR, Landt SG, Lee BK, Pauli F, Rosenbloom KR, Sabo P, Safi A, Sanyal A, Shoresh N, Simon JM, Song L, Trinklein ND, Altshuler RC, Birney E, Brown JB, Cheng C, Djebali S, Dong X, Dunham I, et al: An integrated encyclopedia of DNA elements in the human genome. Nature. 2012, 489 (7414): 57-74. 10.1038/nature11247.
Meissner A, Gnirke A, Bell GW, Ramsahoye B, Lander ES, Jaenisch R: Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation analysis. Nucleic Acids Res. 2005, 33: 5868-5877. 10.1093/nar/gki901.
Gu H, Smith ZD, Bock C, Boyle P, Gnirke A, Meissner A: Preparation of reduced representation bisulfite sequencing libraries for genome-scale DNA methylation profiling. Nat Protoc. 2011, 6: 468-481. 10.1038/nprot.2010.190.
Forrest ARR, Kawaji H, Rehli M, Baillie JK, de Hoon MJL, Haberle V, Lassmann T, Kulakovskiy IV, Lizio M, Itoh M, Andersson R, Mungall CJ, Meehan TF, Schmeier S, Bertin N, Jørgensen M, Dimont E, Arner E, Schmidl C, Schaefer U, Medvedeva YA, Plessy C, Vitezic M, Severin J, Semple CA, Ishizu Y, Young RS, Francescatto M, Alam I, Albanese D, et al: A promoter level mammalian expression atlas. Nature. 2014,http://dx.doi.org/10.1038/nature13182,
Salimullah M, Sakai M, Plessy C, Carninci P: NanoCAGE: a high-resolution technique to discover and interrogate cell transcriptomes. Cold Spring Harb Protoc. 2011, pdb.prot5559
Stein R, Razin A, Cedar H: In vitro methylation of the hamster adenine phosphoribosyltransferase gene inhibits its expression in mouse L cells. Proc Natl Acad Sci USA. 1982, 79: 3418-3422. 10.1073/pnas.79.11.3418.
Feinberg AP, Tycko B: The history of cancer epigenetics. Nat Rev Cancer. 2004, 4: 143-153. 10.1038/nrc1279.
Esteller M: Cancer epigenomics: DNA methylomes and histone-modification maps. Nat Rev Genet. 2007, 8: 286-298. 10.1038/nrg2005.
Ehrlich M: DNA hypomethylation in cancer cells. Epigenomics. 2009, 1: 239-259. 10.2217/epi.09.33.
Ehrlich M: DNA methylation in cancer: too much, but also too little. Oncogene. 2002, 21: 5400-5413. 10.1038/sj.onc.1205651.
Kanamori-Katayama M, Itoh M, Kawaji H, Lassmann T, Katayama S, Kojima M, Bertin N, Kaiho A, Ninomiya N, Daub CO, Carninci P, Forrest AR, Hayashizaki Y: Unamplified cap analysis of gene expression on a single-molecule sequencer. Genome Res. 2011, 21 (7): 1150-1159. 10.1101/gr.115469.110.
Meissner A, Mikkelsen TS, Gu H, Wernig M, Hanna J, Sivachenko A, Zhang X, Bernstein BE, Nusbaum C, Jaffe DB, Gnirke A, Jaenisch R, Lander ES: Genome-scale DNA methylation maps of pluripotent and differentiated cells. Nature. 2008, 454 (7205): 766-770.
Ball MP, Li JB, Gao Y, Lee JH, LeProust EM, Park IH, Xie B, Daley GQ, Church GM: Targeted and genome-scale strategies reveal gene-body methylation signatures in human cells. Nat Biotechnol. 2009, 27: 361-368. 10.1038/nbt.1533.
Jones PA: The DNA methylation paradox. Trends Genet. 1999, 15: 34-37. 10.1016/S0168-9525(98)01636-9.
Booth MJ, Branco MR, Ficz G, Oxley D, Krueger F, Reik W, Balasubramanian S: Quantitative sequencing of 5-methylcytosine and 5-hydroxymethylcytosine at single-base resolution. Science. 2012, 336: 934-937. 10.1126/science.1220671.
Huang Y, Pastor WA, Shen Y, Tahiliani M, Liu DR, Rao A: The behaviour of 5-hydroxymethylcytosine in bisulfite sequencing. PLoS One. 2010, 5: e8888-10.1371/journal.pone.0008888.
Wang T, Pan Q, Lin L, Szulwach KE, Song CX, He C, Wu H, Warren ST, Jin P, Duan R, Li X: Genome-wide DNA hydroxymethylation changes are associated with neurodevelopmental genes in the developing human cerebellum. Hum Mol Genet. 2012, 21: 5500-5510. 10.1093/hmg/dds394.
Stroud H, Feng S, Morey Kinney S, Pradhan S, Jacobsen SE: 5-Hydroxymethylcytosine is associated with enhancers and gene bodies in human embryonic stem cells. Genome Biol. 2011, 12: R54-10.1186/gb-2011-12-6-r54.
Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S, Barnes I, Bignell A, Boychenko V, Hunt T, Kay M, Mukherjee G, Rajan J, Despacio-Reyes G, Saunders G, Steward C, Harte R, Lin M, Howald C, Tanzer A, Derrien T, Chrast J, Walters N, Balasubramanian S, Pei B, Tress M, et al: GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 2012, 22 (9): 1760-1774. 10.1101/gr.135350.111.
Siddharthan R: Dinucleotide weight matrices for predicting transcription factor binding sites: generalizing the position weight matrix. PLoS One. 2010, 5: e9722-10.1371/journal.pone.0009722.
Valouev A, Johnson DS, Sundquist A, Medina C, Anton E, Batzoglou S, Myers RM, Sidow A: Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat Methods. 2008, 5: 829-834. 10.1038/nmeth.1246.
Daniel JM, Spring CM, Crawford HC, Reynolds AB, Baig A: The p120(ctn)-binding partner Kaiso is a bi-modal DNA-binding protein that recognizes both a sequence-specific consensus and methylated CpG dinucleotides. Nucleic Acids Res. 2002, 30: 2911-2919. 10.1093/nar/gkf398.
Consortium U: Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Res. 2013, 41: D43-47.
Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol. 2010, 11: R106-10.1186/gb-2010-11-10-r106.
Kulakovskiy IV, Medvedeva YA, Schaefer U, Kasianov AS, Vorontsov IE, Bajic VB, Makeev VJ: HOCOMOCO: a comprehensive collection of human transcription factor binding sites models. Nucleic Acids Res. 2013, 41: D195-202. 10.1093/nar/gks1089.
Touzet H, Varre JS: Efficient and accurate P-value computation for Position Weight Matrices. Algorithms Mol Biol. 2007, 2: 15-10.1186/1748-7188-2-15.
Bajic VB, Seah SH, Chong A, Krishnan SP, Koh JL, Brusic V: Computer model for recognition of functional transcription start sites in RNA polymerase II promoters of vertebrates. J Mol Graph Model. 2003, 21: 323-332. 10.1016/S1093-3263(02)00179-1.
Kulakovskiy IV, Favorov AV, Makeev VJ: Motif discovery and motif finding from genome-mapped DNase footprint data. Bioinformatics. 2009, 25: 2318-2325. 10.1093/bioinformatics/btp434.
Schneider TD, Stormo GD, Gold L, Ehrenfeucht A: Information content of binding sites on nucleotide sequences. J Mol Biol. 1986, 188: 415-431. 10.1016/0022-2836(86)90165-8.
Kulakovskiy IV, Makeev VJ: Discovery of DNA motifs recognized by transcription factors through integration of different experimental sources. Biophysics. 2009, 54: 667-674. 10.1134/S0006350909060013.
Kulakovskii IV, Makeev V: Integration of data obtained by different experimental methods to determine the motifs in DNA sequences recognized by transcription-regulating factors. Biofizika. 2009, 54: 965-974.
This work was supported by a Research Grant from the Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT) to the RIKEN Center for Life Science Technologies; by a research grant for RIKEN Omics Science Center from MEXT to Yoshihide Hayashizaki; and by a research grant from MEXT through RIKEN Preventive Medicine and Diagnosis Innovation Program to YH. We would like to thank all members of the FANTOM5 consortium for contributing to the generation of samples and analysis of the dataset and to thank GeNAS for data production. We thank Prof. A.S. Kondrashov for computational facilities provided to IVK under Russian Ministry of Science and Education grant [11.G34.31.0008]. We also would like to thank Virginia Unkefer for careful editing of the manuscript.
YAM, AK, WBA, MdSIB and VBB were supported by the base research funds of VBB at KAUST. IVK was supported by the Dynasty Foundation, the Russian Foundation for Basic Research, 12-04-32082-mol_a and Russian Ministry of Science and Education State Contract [14.512.11.0092].
RIKEN Omics Science Center ceased to exist as of April 1st, 2013, due to RIKEN reorganization.
The authors declare that they have no competing interests.
YAM designed the computational experiments, selected and preprocessed the data, performed statistical analysis and wrote the manuscript. AK performed most of the data analysis. WBA and MdSIB provided RDM models and tools for threshold estimation and mapping. TL was responsible for tag mapping. HK managed the data handling. ARRF was responsible for FANTOM5 management and its concept. IVK performed part of the analysis and contributed to the design of the experiments and writing of the manuscript. VBB contributed to the design of the experiments and writing of the manuscript. All authors read and approved the final manuscript.
Electronic supplementary material
Additional file 1:Contains the total number of analyzed CpGs as well as the count of CpG demonstrating SCC M/E above certain significance levels. These results were obtained using only the 36 normal cell types.(DOCX 11 KB)
Additional file 2:Contains RDM-based predicted TFBSs based on 50 cell samples; tables containing information about the names of the TFBSs models used, their function in regulation (activator or repressor), the number of cytosines in our study (with any SCC M/E ) overlapping with TFBSs, the number of CpG “traffic lights” overlapping with TFBSs for each TF, the expected number of such overlaps and the statistical significance of the over-/underrepresentation of TFBS in CpG “traffic light” positions. Consistent information is given for positions with positive SCC M/E .(XLSX 82 KB)
Additional file 3:Contains RDM-based predicted TFBSs based on the 36 normal cell samples; tables containing information about the names of the TFBSs models used, their function in regulation (activator or repressor), the number of cytosines in our study (with any SCC M/E ) overlapping with TFBSs, the number of CpG “traffic lights” overlapping with TFBSs for each TF, the expected number of such overlaps and the statistical significance of the over-/underrepresentation of TFBS in CpG “traffic light” positions. Consistent information is given for positions with positive SCC M/E .(XLSX 73 KB)
Additional file 4:Contains PWM-based predicted TFBSs based on 50 cell samples; tables containing information about the names of the TFBSs models used, their function in regulation (activator or repressor), the number of cytosines in our study (with any SCC M/E ) overlapping with TFBSs, the number of CpG “traffic lights” overlapping with TFBSs for each TF, the expected number of such overlaps and the statistical significance of the over-/underrepresentation of TFBS in CpG “traffic light” positions. Consistent information is given for positions with positive SCC M/E .(XLSX 90 KB)
Additional file 5:Contains RDM-based predicted TFBS for CTCF supported with ChIP-seq peak data; tables containing information about the names of the TFBSs models used, their function in regulation (activator or repressor), the number of cytosines in our study (with any SCC M/E ) overlapping with TFBSs, the number of CpG “traffic lights” overlapping with TFBSs for each TF, the expected number of such overlaps and the statistical significance of the over-/underrepresentation of TFBS in CpG “traffic light” positions. Consistent information is given for positions with positive SCC M/E .(XLSX 11 KB)
Additional file 6:Contains a figure showing the distribution of the observed to expected ratio of CpG “traffic lights” overlapping with TFBSs of activators, repressors and multifunctional TFs. TFBSs were predicted using RDM. (PDF 34 KB)
Additional file 7:Contains RDM-based predicted TFBSs. Tables containing the positions within TFBSs with high, medium and low IC, the number of cytosines in our study (with any SCC M/E ) overlapping with TFBS, and the number of CpG “traffic lights” overlapping with TFBS for each TF. Consistent information is given for positions with positive SCC M/E .(XLSX 38 KB)
Additional file 8:Contains PWM-based predicted TFBSs. Tables containing the positions within TFBSs with high, medium and low IC, the number of cytosines in our study (with any SCC M/E ) overlapping with TFBS, and the number of CpG “traffic lights” overlapping with TFBS for each TF. Consistent information is given for positions with positive SCC M/E .(XLSX 40 KB)
Additional file 9:Contains a table listing FANTOM5 samples (cell types) matching 50 ENCODE samples. We grouped them into 50 classes of identical or similar biological cell types. The ENCODE sample description is also provided. Normal/cancer cell types (36/14) are marked in the last column. (XLSX 35 KB)
About this article
Cite this article
Medvedeva, Y.A., Khamis, A.M., Kulakovskiy, I.V. et al. Effects of cytosine methylation on transcription factor binding sites. BMC Genomics 15, 119 (2014). https://doi.org/10.1186/1471-2164-15-119
- DNA methylation
- Transcription factor binding sites
- Transcriptional regulation
- CpG “traffic lights”
- Computational biology