The number of peaks varies between cell-types
Recent genome-wide analyses of TF binding have shown that for a few individual TFs, binding sites can vary substantially between cell-types[29–31]. But how cell-type specific is TF regulation as measured by ChIP-seq data of different commonly expressed transcription factors? To address this question, we used a robust peak detection method[36] on publicly available ChIP-seq data[34] for 13 TFs from two cancer cell lines, K562 and HeLa-S3. (See Additional file1: Peaks and SNPs for predicted peak regions). Seven of the factors (CTCF, E2F4, E2F6, GABP, Max, c-Fos, and c-Myc) are Pol II factors with specific DNA binding preferences, one (TAF1) is a basal Pol II factor, four (BDP1, BRF1, BRF2, and TFIIIC-110) are general Pol III factors, and one (RPC155) is a Pol III subunit that is enriched at Pol III transcription start sites and has strongly correlated peak heights between K562 and HeLa-S3[35]. The resulting peaks revealed substantial differences in the number of putative binding sites for different TFs and cell types (Figure1A). The total number of peaks was similar in both cell types; in sum HeLa-S3 had 11% more predicted peaks than K562, but the peaks were divided unevenly between different TFs in the two cell types. Variable peak counts for different TFs could be expected as the TFs under study serve diverse regulatory roles, but the variability between cell types for the same TF was high.
Nearly half of the K562 peaks (47%) and more than a third of the HeLa-S3 peaks (34%) were associated with promoter regions; specifically, the 2000bp upstream and 200bp downstream of RefSeq genes’ transcription start sites plus their first introns (Figure1B). We included the first intron in our promoter definition, as binding sites for several TFs are enriched within first introns[37] and such binding sites have been shown to have important regulatory roles for specific genes (for example, the cystic fibrosis transmembrane conductance regulator gene (CFTR)[38]). Despite the variability in the overall number of peaks per TF and cell type, the relative number of peaks mapping to promoter regions for the same TFs were similar between the two cell types (p-value p = 0.34 on a two-sided Wilcoxon signed rank test).
Only a third of peaks are found in both cell types
Given the high variability in putative TFBS, we wondered to what extent these binding sites were cell-type specific. By comparing the genomic loci of peaks across cell types, we found that 33% of K562 and 30% of HeLa-S3 peaks overlapped with at least one peak of the same TF in the other cell type. Taking into account that the relative overlap is limited by the cell type having the fewest peaks for the TF, we found that the potential overlap varied between 16% (TAF1) and 93% (BRF2) (Figure1C; median 46%). Although the potential overlap was slightly higher for TFs with large differences in peak counts between the cell lines, this apparent trend was not significant (Spearman coefficient 0.20, two-sided p-value p = 0.52).
An extreme example of seemingly different binding between cell types is the cell-cycle associated factor E2F4, which had 8780 peaks in K562 but only 631 (7%) in HeLa-S3. A large fraction (81%) of the relatively few E2F4 peaks in HeLa-S3 overlapped with K562 peaks. Others using ChIP-chip have previously found E2F4 to have between 500 and 700 target genes with little cell-type specific binding[1, 30]. This could suggest that a majority of the E2F4 peaks in K562 are not functional binding sites, but we cannot exclude that our data are missing several true binding sites in HeLa-S3.
Peaks in regions associated with tissue-independent regulation have higher overlap than have other regions
Because of the unexpected differences between the two cell lines for the E2F4 peaks, we wondered whether overlapping peaks were enriched in genomic regions known to have limited tissue-specific activity. A previous study has indicated that TFs primarily mediate cell-type specific regulation through enhancers located far from core promoter regions[9]. In accordance with this, we found that peaks mapping to promoter regions had a significantly higher degree of overlap across cell types than had peaks that did not map to promoter regions (Figure2A; p = 0.0052 for K562 and p = 0.0034 for HeLa-S3, one-sided Wilcoxon signed-rank test). Promoter regions were again defined as the region 2000 bp upstream to 200 bp downstream of transcription start sites of RefSeq genes, together with the complete first intron of the genes. Consistent with Pol III factors preferentially associating with regions near functional Pol II promoters[35] three of the four general Pol III TFs showed higher overlap in these mostly Pol II promoters than in the non-promoter regions. Similarly, the subset of Pol II TFs also showed significantly higher overlap in the promoter regions than in other genomic regions (p = 0.012 for K562 and p = 0.020 for HeLa-S3).
Housekeeping genes tend to maintain similar expression levels across cell types and are therefore also likely to have similar regulation within different cell types. Indeed, the overlap was greater for peaks mapping to promoters of housekeeping genes[39] compared with promoters of other genes (Figure2B; p = 0.0093 for K562 and p = 0.0049 for HeLa-S3). Housekeeping genes frequently have high CpG promoters, whereas tissue-specific genes tend to have low CpG promoters[40, 41]. Accordingly, peaks in high CpG promoters had significantly higher overlap than had peaks in low CpG promoters (Figure2C; p = 2.4∗10−4 for K562 and p = 4.9∗10−4 for HeLa-S3). Thus, as expected, peaks mapping to tissue-independent regulatory regions showed a higher degree of overlap between cell types than did peaks mapping to other genomic regions.
Cell-type-specific peaks map to different genomic regions in the cell types
Some of the “missing” overlap could perhaps be explained by alternative transcription factor binding sites, where a binding site active in only one cell type could have another nearby site active in the other cell type. To test this possibility, we counted the number of overlaps when allowing increasingly larger regions surrounding the original peak regions (Figure2D). We found no striking increase in the number of overlaps, however—even when looking at regions larger than 10,000bp. Consequently, most of the cell-type unique peaks are found at completely different loci in the two cell types.
Differences in TF expression can partly explain differences in peak counts
The variability in peak numbers could suggest that some TFs have more binding activity in one cell type compared to the other. As one important determinant of binding activity could be TF availability, we looked at how the TF mRNA expression levels, as measured by RNA sequencing, correlated with the number of peaks. We found some, but insignificant, correlation between TF expression and peak numbers (Pearson coefficient r = 0.46, two-sided t-test p-value p = 0.21 in K562; r = 0.54, p = 0.11 in HeLa-S3). Using the differences in expression and peak count between the two cell types gave similar results (Figure2E; r = 0.47, p = 0.20), but removing the expression outliers c-Fos and c-Myc gave a large and significant correlation (r = 0.89, p = 7.2∗10−3). This suggested that the difference in peak count could partly be related to difference in mRNA expression levels of the TFs.
Of course, the mRNA level does not necessarily alone determine the TF’s actual protein-level[42] and binding activity could be influenced by other variables, such as for example post-translational modifications, chromatin differences, and co-factor availability.
High peaks are less cell-type specific and more associated with active gene regulation than are low peaks
The height of a ChIP-seq peak indicates the TF’s binding activity or strength at the site. Moreover, peak height can give a measure of the likelihood for the TFBS to be active in a given cellular context, as spurious binding events occurring within a fraction of the cells will give low ChIP-seq peaks. We therefore wondered whether cell-type specific binding or binding within regions consistent with regulatory activity, such as promoters or open chromatin, varied with increasing peak height (Figure3 and Additional file2: Figure S1 and Additional file3: Figure S2).
For 10 out of 13 TFs in K562, there were significantly more overlaps among higher peaks compared with lower peaks (Figure3). The trend was particularly clear for the very highest peaks, save for c-Fos and BRF2—the latter having only 15 peaks in K562. This could mean that cell-type specific binding is generally weaker than the binding in common binding sites, but the observation likely also reflects that spurious binding (stochastic noise) is more prevalent among low peaks.
High peaks were generally more associated with regulatory regions such as promoters (8 TFs significant) and CpG-rich regions (10 TFs significant). The TFs that showed little or no association were not expected to be associated with Pol II promoters; CTCF acts primarily in intergenic and intronic regions[43]), whereas BDP1, BRF1, BRF2, RPC155, and TFIIIC-110 are Pol III factors. The Pol III factors BDP1 and TFIIIC-110 and the Pol III subunit RPC155 did indeed show significant correlation between peak height and overlap with Pol III promoters (Additional file4: Figure S3; see Methods for description of Pol III promoter regions). Also, nearly all (79 of 81) BRF1 peaks in K562 were associated with Pol III promoter regions, whereas too few BRF2 peaks mapped to the Pol III promoters for a pattern to be evident. High peaks were also associated with increased chromatin accessibility as measured by sensitivity to DNase (10 TFs significant) and a higher enrichment in the histone mark for transcriptionally active promoters H3K4me3 (10 TFs significant), in accordance with previous findings[26].
Even though others have demonstrated poor conservation of binding sites across species for some TFs[30, 32], we noted that higher peaks generally resided in regions of higher sequence conservation for many TFs (6 and 4 TFs significant in K562 and HeLa-S3, respectively). The main exception was CTCF, which showed a significant negative correlation between peak height and sequence conservation.
High peaks were also generally more associated with sequence motifs (peaks for 5 of 7 TFs in K562 and 7 of 7 TFs in HeLa-S3 were positively correlated with PWM scores). However, c-Fos peak heights in K562 correlated negatively with PWM scores for the AP-1 motif, which c-Fos recognizes as a dimer with co-factor Jun. One possible explanation could be that competition with Jun homodimers and Jun-ARF2 heterodimers for the AP-1 motif may have pushed c-Fos to bind more in non-canonical motif regions[44]. Alternatively, as higher c-Fos K562 peaks did not have significantly increasing overlap with HeLa-S3 peaks, but did have significantly increasing association with promoters, CpG-rich regions, open chromatin, and sequence conservation, c-Fos or its antibody may have cross-reacted with a different TF in K562.
Using discriminative motif discovery on the 10% highest and 10% lowest c-Fos peaks (see Methods), we found that the highest c-Fos peaks in K562 had motif sequences such as CCAAT and CGCGG, which resemble binding profiles for NF-Y and parts of AP-2, but we did not find any AP-1 binding motif or variant thereof. When using sequences from another TF (CTCF) as negative data and running motif discovery with high and low c-Fos peaks separately as positive data, we again could not find the AP-1 motif in the high peaks (Additional file5: Figure S4C), whereas the top motif discovered on the lowest peaks was the canonical AP-1 consensus motif (Additional file5: Figure S4B). We therefore strongly suspect that the highest c-Fos peaks in K562 are the result of some experimental artifact.
As an additional measure of regulatory activity, we looked at how peaks from all TFs clustered together, under the assumption that clustered peaks indicate a region of high regulatory activity[45]. For all TFs save BRF2, higher peaks were significantly correlated with a higher number of TFs in the cluster. Finally, as ChIP-sequencing has a bias for GC-rich fragments[12], we did a separate analysis on a GC-controlled subset of the data. This GC-balanced subset (Additional file6: Figure S5) had the same trends as the original full dataset (Figure3). Consequently, any GC-content bias in the ChIP-seq data could not explain the correlations between peak height and different regulatory data seen in Figure3. Together, these results suggest that high peaks associated with strong and conserved binding sites tend to be cell-type independent or common for many cellular contexts, whereas low peaks associated with weak binding tend to show larger variation between cell types.
Differences in chromatin state suggest cell-type specific regulation
High TF peaks both had a higher degree of overlap between cell types and were more associated with open chromatin and active transcription, compared to low peaks. However, it was still unclear to what extent cell-type specific differences in high and low peaks were related to differences in chromatin structure or to spurious binding events. We therefore compared chromatin data for K562 and HeLa-S3 in the regions that contained the 30% highest overlapping and non-overlapping peaks and repeated the comparisons for the 30% lowest peaks.
In general, chromatin was more accessible and had a higher signal for the active histone mark H3K4me3 in the cell type where the cell-type specific peak regions were found, compared to the same regions in the other cell type (Figures4 and5). Differences in chromatin accessibility, as measured by DNase sensitivity, were significant on a Kolmogorov-Smirnov test for 10 of 13 TFs in K562 and for 9 TFs in HeLa-S3. Differences in H3K4me3 signal were significant for 10 TFs in K562 and 9 in HeLa-S3.
For the lowest 30% of peaks we expected to see smaller differences between the chromatin state in peak and non-peak regions, as a smaller difference would be consistent with increased stochastic noise due to more spurious binding within the peak data. Indeed, although the differences in DNase sensitivity and H3K4me3 signal were more consistent for the weak than for the strong peaks (11 TFs significant for DNase both in K562 and HeLa-S3; 11 TFs in K562 and 13 TFs in HeLa-S3 for H3K4me3; Additional file7: Figure S6 and Additional file8: Figure S7), the median difference in chromatin accessibility and H3K4me3 signal was greater for higher peaks (Additional file9: Figure S8). It is therefore likely that a larger fraction of the low than of the high cell-type specific peaks is a result of spurious binding events or stochastic noise in the data.
Different genotypes can contribute to binding differences
Another factor that could contribute to differences in binding between cell types is single nucleotide polymorphisms (SNPs), as K562 and HeLa-S3 are established from two different individuals with different genotypes. We aligned available DNA sequencing data for K562 and HeLa-S3 to the reference genome (hg18) and by using SNPs from the HapMap database and the aligned sequence reads, we calculated the most likely genotype for each cell type for the SNPs in peak regions (see Methods).
In total, 12% of the peaks harboured at least one SNP. Relative to peak count, more overlapping peaks had SNPs (13.9%) than K562-specific (11.8%) or HeLa-S3-specific (12.1%) peaks. But a relatively higher number of SNPs in cell-type specific peaks were homozygous with different alleles in the cell types (18%, 17.8%, and 10% for K562-specific, HeLa-S3-specific, and overlapping peak SNPs, respectively).
We reasoned that the SNPs that were homozygous for one allele in one cell type and homozygous for another allele in the other cell type could affect TF binding if the SNP resided in a high-scoring TF motif (Figure6A). Indeed, the peaks that were cell-type specific and contained such SNPs had significantly higher motif scores in the cell type having the peak compared with the other cell type (Figure6B; p = 1.9∗10−4 and p = 1.1∗10−5 on one-sided paired t-tests for K562 and HeLa-specific peaks, respectively). Among 73 K562-specific peaks that fitted the SNP and motif criteria stated above, 49 had higher motif scores in K562 and 24 had higher scores in HeLa-S3. Among 82 HeLa-S3-specific peaks, 60 had higher motif scores in HeLa-S3 and 22 in K562.
Genotype-related differences were also evident in peaks common between the two cell types. Specifically, for the SNPs that had different homozygous genotypes and were located in high-scoring motif regions, the difference in motif score between the genotypes correlated positively with the difference in peak height between the cell lines (r = 0.22, p = 0.09). This correlation was especially strong and significant for the 10% highest peaks, both for the peaks that had SNPs in high scoring motif regions (r = 0.59, p = 9.9∗10−3) and for the peaks that had SNPs in any location within the peak region (r = 0.43, p = 4.5∗10−3). Thus, differences in genotypes do affect TF binding and can explain some of the binding site differences between cell lines.
Peak height and clustering are the most important factors for cell-type specificity
Our results so far indicated that the cell-type specificity we observed among ChIP-seq peaks was related to a number of different factors that included peak height and locus, sequence properties of the peak region, and the region’s cell-type specific chromatin context. But which factors are the most important indicators of cell-type specific TF binding and to what extent do combinations of factors determine cell-type specific binding?
To address these questions, we used a machine learning approach. Specifically, we created support vector machine (SVM) classifiers[46] to separate cell-type specific peaks from peaks that are found in multiple cell types. The SVM’s feature set for each peak included both cell-type specific information such as chromatin state, and cell-type independent information such as sequence conservation; see Table1 for a full list of features. Using peaks from K562 as a reference and comparing overlap and cell-type specific data with the corresponding regions in HeLa-S3, we trained for each TF one SVM classifier to predict the K562 peaks that overlapped with HeLa-S3 peaks. We then used 10-fold stratified cross-validation to estimate the SVM classifiers’ predictive performance in terms of their receiver operating characteristic (ROC) curve and ROC-score (Additional file10: Figure S9A).
To assess which SVM features were most important for correct classification, we removed a single feature group at a time from the SVM and measured how this affected the SVM’s cross-validation performance (Figure7A). Although the results were somewhat different for each combination of TF and feature, one result was clear: peak clustering and peak height were the only features that had a strong positive impact on the prediction for the majority of the TFs. Peak clustering was particularly important for classifying E2F4 and E2F6 peaks, whereas information on peak height was important to most TFs. As for individual TFs, the two TFs with the fewest peaks in their datasets, BDP1 and BRF1, had larger score variation than the other TFs. As a result, the TFs appeared to benefit from removing some features. Finally, removing information about the distance to the closest transcription start site (TSS) had an expected negative impact on TAF1, which is a core component of the Pol II basal transcription factor TFIID[47].
Given the major importance of peak height and clustering for classifier performance, and in light of previous results (Figure3), we reasoned that including these features made it difficult to assess any importance of other features. We also suspected that some of the features could contain redundant information, as removing single features had relatively little impact on performance. We therefore removed peak height and clustering from the dataset and grouped features into the following three groups: cell-type specific features (chromatin accessibility and histone modification status), promoter and sequence-associated features (promoter, TSS distance, CpG content), and conservation feature (phyloP). The results (Additional file11: Figure S10) showed that cell-type specific and promoter-associated features were important for, respectively, most (all except E2F6) and some (CTCF, E2F6, c-Fos, and c-Myc) of the sequence-dependent TFs, whereas only promoter-associated features were important for some of the general TFs (TFIIIC-110 and RPC155). Sequence conservation, however, was not an important factor for identifying cell-type specific peaks in this dataset.
Peak height and clustering were the dominant factors in explaining cell-type specificity and could therefore confound the effects of other features when analyzing the total set of all peaks. We therefore did a separate analysis of the highest and most clustered peaks, as differences between cell types for these peaks should be due to other factors than peak height and clustering. Specifically, we first identified the subset of peaks that were among the 10% highest and that also occurred among the top 20% largest peak clusters; that is, in clusters containing peaks from six or more TFs. Second, we did a feature removal analysis on the five TFs that had more than 100 remaining peaks in both the overlapping and the cell-type specific datasets. Although none of the features on this reduced dataset showed the same consistent pattern as the height and cluster features did on the complete dataset, five of the eight features (Length, TSS, H3K4me3, H3K27me3, and CpG) had a positive impact on the prediction for the majority of the five TFs (Figure7B). Moreover, only one feature (Prom) had a negative impact on the prediction for the majority of the TFs. These results confirmed that additional factors beside peak height and clustering, including cell-type specific data on histone modifications, are important for identifying cell-type specific TF binding.
Models for identifying cell-type specific peaks are consistent between cell lines
The feature analysis suggested that even though peak height alone was a critical factor for identifying cell-type specific peaks, other features could also help predictions. To further assess the effect of adding additional features, we therefore developed two reference models, which we refer to as Height and HPP. These models were also based on SVMs, but contained fewer features; Height used only peak height as its single feature and HPP used peak height, phyloP conservation score, and PWM score (Figure8A; see Additional file10: Figure S9A for the ROC curves).
Consistent with the feature analyses, the SVM model that included all features (termed SVM in Figure8), was significantly better than the two reference models (p = 4.9∗10−4and p = 7.8∗10−3 for Height and HPP, respectively, on a Wilcoxon signed-rank test). The HPP model was also significantly better than Height (p = 7.8∗10−3). The Height model only had a higher ROC-score than the full SVM model on the BRF1 dataset. This is likely due to a low number of training examples for the more complicated SVM method, as there were only 81 BRF1 peaks in K562. Consistent with previous results (see c-Fos peak overlap in Figure3), peak height alone was insufficient to predict overlapping c-Fos peaks.
So far we had tested how the models performed when classifying new unseen peaks from the same cell types (K562 and HeLa-S3) used during training, but would the models generalize to a different cell type? To address this question, we used a stratified 10-fold cross-validation procedure where we first trained the models on a subset of K562 peaks that were common and cell-type specific compared with HeLa-S3 peaks. We then tested the models on a different subset of K562 peaks that were common and cell-type specific compared to peaks in the GM12878 ENCODE cell line; that is, we trained the models on K562 and HeLa-S3 data and tested the models on K562 and GM12878 data. Importantly, none of the peaks in these train and test sets overlapped and the cell-type specific features (Chromatin, H3K4me3, and H3K27me3) were derived from two different cell types in the train (HeLa-S3) and test (GM12878) sets. We trained and tested the models on the five TFs with ChIP-seq data available in all three cell types.
Whereas the Height and HPP models had similar performance in the new cell type (p = 0.22 and p = 0.43, respectively; one-sided Wilcoxon signed rank test), the full SVM model had a significant drop in performance (p = 0.031); see Figure8B and Additional file10: Figure S9B. This drop resulted in the full SVM model having lower ROC-scores than the Height model had on three of the five TFs. The largest differences between the SVM and Height models were on the Max and c-Fos data, suggesting that the SVM model was over-fitted on these particular TFs. Indeed for c-Fos, the different features in the SVM were more inconsistent between the three cell types than for the other TFs (7 of 14 feature trends differed, compared with 5 of 12 for TAF1 and at most 2 of 14 differing trends for the three other TFs; compare Figure3, Additional file2: Figure S1, and Additional file3: Figure S2).
These inconsistencies between the cell types also strongly affected the Height and HPP models on the c-Fos data. As already mentioned, Height could not predict c-Fos peaks that overlapped between K562 and HeLa-S3, whereas the HPP model trained on K562 and HeLa-S3 data could not predict peaks that overlapped in K562 and GM12878. Of the five TFs, c-Fos had the least consistent binding sites between the three cell types (Additional file12: Figure S11). c-Fos is less specific in recognizing the AP-1 site compared to dimer-partners Jun and ATF2[44], so some of the differences could be because c-Fos used different co-factors in the three cells. This explanation is consistent with c-Fos having three different trends in peak heights and PWM scores for the three cell lines. Moreover, Jun expression level was four times as high in HeLa-S3 as in K562, and the strong c-Fos peaks in HeLa-S3 had good AP-1 consensus motifs compared to weaker peaks (Additional file2: Figure S1). However, as previously mentioned, we cannot exclude that c-Fos or its antibody had cross-reacted with a different TF in K562. Apart from c-Fos, however, the models could consistently separate cell-type specific from cell-type independent TFBS.