To establish cChIP-seq (Fig. 1a), we optimized sonication of a limited number of crosslinked cells down to 30,000 cells using the Covaris LE220 ultrasonicator (Additional file 1: Figure S1). For each test, cells were counted prior to chromatin isolation, as were nuclei prior to sonication to ensure cell and chromatin amounts in each cChIP experiment. We estimated the amount of recombinant carrier histone based on potentially marked histones (see Methods). We have applied cChIP-seq to three informative and commonly investigated histone modifications: H3K4me3, H3K4me1 and H3K27me3 (Fig. 1b). We have compared our data to existing data from either the ENCODE consortium or Roadmap Epigenomics Consortium (REC). We provide a robust, yet simple method for ChIP-seq of 10,000 cells, which should be applicable to almost any histone modification and compatible with most working ChIP protocols.
As an initial optimization, we performed cChIP-seq for H3K4me3 in K562 cells (Fig. 1b), reasoning that it is the most robust mark to ChIP and should perform best at a small scale. After chromatin sonication, we mixed 10,000, 5000, 500 and 100 whole-cell equivalents with recombinant histone H3 with a lysine 4 trimethylation modification (recH3K4me3) and incubated with magnetic beads pre-bound with antibody against H3K4me3. We then proceeded to perform ChIP-seq as described previously [8, 9], but with the minor modification of generating libraries using PCR amplification performed in two sequential rounds of limited cycles to help reduce amplification-based background (see Methods). We generated a total of ~ 150 million monoclonal mapped reads so that all libraries were likely sequenced to saturation. We assessed correlations between replicates and across cell amounts, followed by a comparison to ENCODE consortium data as a standard for the field [15]. In order to account for differences that arise from variation in lab-to-lab practices, we compared our data to two different replicated datasets from ENCODE: Broad and SYDH. In order to avoid bias due to differences in the computational analysis, we obtained raw data for each datasets (two replicates for each assay to match our data) and analyzed all datasets with our pipeline under the same settings (see Methods).
cChIP-seq replicates for 10,000 cells correlated as well as replicates for each ENCODE group (Fig. 2a and b). Pearson’s correlation coefficients calculated across cChIP-seq, Broad and SYDH showed that at 10,000 cells cChIP-seq performed well relative to ENCODE data with average coefficients of 0.90 with respect to Broad and 0.7 with respect to SYDH. The average Pearson’s correlation coefficient between Broad and SYDH was ~0.80 (Fig. 2a, Additional file 1: Figure S2a). For lower cell numbers we did not obtain data of sufficient quality (Additional file 1: Figure S2a and b). Although enrichment for H3K4me3 at promoter regions could be observed when using 5000 cells (Additional file 1: Figure S2a), the Pearson’s correlation coefficient with respect to ENCODE data was less than 0.60 (Additional file 1: Figure S3b). This led us to conclude that below 10,000 cells cChIP-seq was sub-optimal. Given our goal to develop a protocol with minimal optimizations per mark or per starting amount of cells, we focused on 10,000 cells for the remaining validation of cChIP-seq. In addition, visual inspection across the genome of H3K4me3 cChIP-seq data using 10,000 cells, demonstrates that our data is highly similar to ENCODE data (Fig. 2b) further supporting the reliability of our method.
We compared cChIP-seq data to ENCODE data with respect to peak recovery. We used MACS to call peaks on each replicate of cChIP-seq and ENCODE data [16]. We called a similar number of peaks across the three datasets, and each replicate recovered a similar overlap for each group (Additional file 1: Figure S3c). Overlap of cChIP-seq peaks recovered approximately 80 % of the peaks called in either replicate. Similar peak overlaps were found for ENCODE replicates (Broad: 74 % average replicate overlap; and SYDH: 88 % average replicate overlap). Next, we compared the peaks called on merged replicates across the three datasets. When comparing cChIP-seq peaks to both Broad and SYDH, 71 % of the cChIP-seq peaks were found in both ENCODE datasets (Fig. 2c). Asking the reverse, cChIP-seq recovered 60 % of the Broad peaks and 67 % of the SYDH peaks, performing as well as the ENCODE datasets when compared to each other (capturing 63 and 80 % of the other’s peaks; Fig. 2c). Only 17 % of cChIP-seq peaks were unique to cChIP-seq, similar to SYDH (17.8 %) and lower than Broad (29.6 % unique). We also observed a similar enrichment of the signal across datasets at Gencode transcription start sites (TSS), regardless of peak calls (Fig. 2d). To confirm this observation we determined the overlap of peaks across the three datasets at TSS. We observed that over 90 % of the TSSs captured by each datasets were also enriched for H3K4me3 in the other two datasets. These results indicate that cChIP-seq is a robust method to profile H3K4me3 marks using three orders of magnitude fewer cells with respect to what was used for generating ENCODE data. Based on the three-way comparison across groups, we conclude that differences between cChIP-seq and ENCODE data are likely to due more to expected lab-to-lab variability rather than operating at a lower scale (Fig. 2c). Furthermore, the use of modified carrier histones results in a simple method that does not require any upfront optimization to scale down ChIP reaction conditions.
We next sought to apply our method to other histone modifications as well as test the performance on a different cell line. We first applied cChIP-seq to profile H3K4me1, a highly cell-type specific mark [7] associated with enhancer regions [17]. H3K4me1 enrichment in K562 was highly comparable to ENCODE data (Fig. 1b). Pearson’s correlation across cChIP-seq, Broad and SYDH confirmed that our method was highly reproducible (r = 0.96) and correlated with both ENCODE datasets (r = 0.86 with respect to Broad; r = 0.76 with respect to SYDH) as well as ENCODE datasets correlated with each other (r = 0.73) (Fig. 3a, Additional file 1: Figure S2b). Next, we compared peaks called in the two replicates per each datasets and observed that correlation across replicates was again comparable to ENCODE replicates (Fig. 3a, Additional file 1: Figure S4a). Moreover, cChIP-seq identified the same numbers of putative enhancers (over 40,000 when considering peaks shared between replicates) as predicted by either ENCODE dataset in this cell line (Additional file 1: Figure S4a). After replicates were merged, each datasets shared an average of 47 % of the peaks (29,938) with both the other two datasets (Fig. 3b). This drop in percent overlap compared to H3K4me3 may be due to challenges calling broader H3K4me1 peaks. When asking how many of the ENCODE peaks cChIP-seq identified, we observed that our method recovered 81 % of the peaks called in both Broad and SYDH (out of 36,778). Again we observed that the cChIP-seq data performed well with respect to either ENCODE dataset, and as well as ENCODE datasets performed with respect to each other confirming that differences between cChIP-seq and ENCODE data are due to inter-laboratory variability rather than the operating scale.
Next, we compared the signal intensity of the three datasets in a 4 kb window centered at all H3K4me1 enriched regions called in the three datasets (Fig. 3c). The three datasets performed similarly at all peaks regardless of peak calls. We did observe a higher baseline signal across our data with respect to ENCODE and a similar trend was observed for this mark in H1 hESC (see below. Additional file 1: Figure S5e and f); however, the signal-to-noise ratio is more than adequate for accurately calling peaks. To further confirm our results on H3K4me1, we compared the signal intensity of cChIP-seq and ENCODE data at K562 enhancers previously predicted by RFECS, a Random-Forest based algorithm recently developed to identify enhancers based on several histone modifications and p300 localization [18]. All datasets showed some degree of enrichment at RFECS enhancer locations (Additional file 1: Figure S4b) as well as a similar fraction of RFECS enhancers captured (on average, 57 %) (Additional file 1: Figure S4c). Overall, data generated for H3K4me1 on three orders of magnitude fewer cells by cChIP-seq performed in a highly comparable manner with respect to ENCODE data (Fig. 3d).
To ensure that cChIP-seq has applicability for various cell types, we performed cChIP-seq on 10,000 cells for H3K4me1 in H1 hESCs and compared these data to those previously generated by the REC on a few million cells [8, 9]. Both visual inspection of the global enrichment and the Pearson’s correlation between replicates (r = 0.96), indicate that cChIP-seq replicates are highly reproducible (Additional file 1: Figure S5a and b). Furthermore, cChIP-seq data correlate well with REC data (r = 0.76) (Additional file 1: Figure S5b). We observed that 63 % of the REC H3K4me1 peak calls on merged replicates were recovered by cChIP-seq (Additional file 1: Figure S5c). However, more than half of peaks called on cChIP-seq data appeared to be unique. We reasoned that at least a fraction of those unique peaks could have been captured by either of the REC ChIP-seq replicates. When we specifically looked for overlap between the unique cChIP-seq and either of the REC replicates (Additional file 1: Figure S5d), we found than 11 % of the unique cChIP-seq peaks (4878) overlapped peaks called in REC replicate 1 that were not called on the merged replicates. Similarly, an additional 19 % of the unique cChIP-seq (8430) were captured by REC replicate 2 that were not called on the merged replicates. Considering the total number of REC peaks captured by cChIP-seq, we concluded that our data correlated well with data previously generated on few million cells. This is supported by comparing the signal intensity of both datasets in a 4 kb window centered at H3K4me1 enriched regions for REC ChIP-seq (Additional file 1: Figure S5e). We also compared enrichment in both datasets on RFECS-predicted enhancers in H1 cells [18]. While both data showed enrichment at these sites, we observed higher signal in cChIP-seq compared to data ChIP-seq both at and around the peaks (Additional file 1: Figure S5f). Finally, we sought to measure how many RFECS-predicted enhancers [18] were captured by cChIP-seq as compared to REC data. We observed that cChIP-seq H3K4me1 peaks overlapped 61 % (33,996 out of 55,382) of RFECS enhancers, while REC data captured 45 % (24,831) of RFECS enhancers. Furthermore, cChIP-seq captured all the RFECS enhancers captured by ChIP-seq. Altogether, we have shown that cChIP-seq method successfully scaled our previous ChIP-seq protocol [8, 9] by two orders of magnitude for H3K4me1 in H1 hESC line.
Finally we tested cChIP-seq for the Polycomb repressive complexes-associated modification H3K27me3 in the K562 cell line. We generated H3K27me3 data on 10,000 cells using cChIP-seq and compared our data to both Broad and SYDH ENCODE datasets (Fig. 1b and Fig. 4a). Pearson’s correlation coefficients calculated across the three datasets indicated (Fig. 4b, Additional file 1: Figure S2c) that our replicates were highly reproducible (r = 0.97). While the correlation with the Broad dataset was good (r = 0.73), our data correlated less well with SYDH dataset (r = 0.46). The Pearson’s correlation coefficient between the two ENCODE datasets was 0.66 (Fig. 4b). We used ChromaBlocks, an algorithm previously developed to determine broad domains of histone modifications such as H3K27me3 [8], and called domains on merged replicated. When we merged all the domains called in cChIP-seq and ENCODE data (Fig. 4c), we identified 4743 broad H3K27me3 enriched regions common to all three datasets, accounting for 72 % of all enriched regions in cChIP-seq data. Similarly, 88 % of Broad H3K27me3 enriched regions and 59 % of SYDH regions were shared with both the other two datasets. We next looked at the global distribution of the signal in a 10 kb window around those protein-coding TSS that we found enriched for H3K27me3 in all the three datasets (Fig. 4d). Although the cChIP-seq and Broad data showed a better global correlation (Fig. 4b), the cChIP-seq and SYDH data showed a greater degree of enrichment around TSSs. Overall, these results indicate that cChIP-seq successfully generated H3K27me3ChIP-seq data using 10,000 cells.