Comparative analysis of 4C-Seq data generated from enzyme-based and sonication-based methods
© Gao et al.; licensee BioMed Central Ltd. 2013
Received: 8 October 2012
Accepted: 20 May 2013
Published: 24 May 2013
Skip to main content
© Gao et al.; licensee BioMed Central Ltd. 2013
Received: 8 October 2012
Accepted: 20 May 2013
Published: 24 May 2013
Circular chromosome conformation capture, when coupled with next-generation sequencing (4C-Seq), can be used to identify genome-wide interaction of a given locus (a “bait” sequence) with all of its interacting partners. Conventional 4C approaches used restriction enzyme digestion to fragment chromatin, and recently sonication approach was also applied for this purpose. However, bioinformatics pipelines for analyzing sonication-based 4C-Seq data are not well developed. In addition, data consistency as well as similarity between the two methods has not been explored previously. Here we present a comparative analysis of 4C-Seq data generated by both methods, using an enhancer element of Pou5f1 gene in mouse embryonic stem (ES) cells.
From biological replicates, we found good correlation (r>0.6) for inter-chromosomal interactions identified in either enzyme or sonication method. Compared to enzyme approach, sonication method generated less distal intra-chromosomal interactions, possibly due to the difference in chromatin fragmentation. From all mapped interactions, we further applied statistical models to identify enriched interacting regions. Interestingly, data generated from the two methods showed 30% overlap of the reproducible interacting regions. The interacting sites in the reproducible regions from both methods are similarly enriched with active histone marks. In addition, the interacting sites identified from sonication-based data are enriched with ChIP-Seq signals of transcription factors Oct4, Klf4, Esrrb, Tcfcp2i1, and Zfx that are critical for reprogramming and pluripotency.
Both enzyme-based and sonication-based 4C-Seq methods are valuable tools to explore long-range chromosomal interactions. Due to the nature of sonication-based method, correlation analysis of the 4C interactions with transcription factor binding should be more straightforward.
Chromosomal organization in the nucleus has been gradually recognized as an important contributor to gene regulation and genome function. In the last decade, chromosome conformation capture (3C) has been developed as a high-throughput assay to explore long-range chromatin-chromatin interactions in vivo. Technologies derived from 3C, such as Hi-C [2, 3], 5C  and TCC , explore global chromatin interactions. Other methods like ChIP-3C [6–10] and ChIA-PET (chromatin interaction analysis by paired-end tag sequencing) detect chromatin-chromatin interactions mediated by a specific DNA binding protein [11–13]. A particular derivative of 3C method named circular chromosome conformation capture (4C) enables de novo detection of all interacting partners of a known genomic region, such as differentially methylated H19 imprinting control region . Similar to 3C-based global mapping of chromatin-chromatin interactions, 4C follows the same concept of 1) crosslinking of protein-DNA complexes in the cell nucleus to capture chromosome conformation and 2) proximity ligation of physically interacting chromatin fragments. However, instead of using immuno-precipitation to capture the ligated junction DNA pieces in ChIP-3C and ChIA-PET, 4C utilizes PCR reactions to enrich genomic regions interacting with a known “bait” region. The PCR products generated from the 4C technology can be subsequently examined by different approaches: they can be cloned for Sanger sequencing , or hybridized to a microarray [14, 16]. Given the rapid development of next-generation sequencing, it is now feasible to directly interrogate the PCR products from the 4C technology (hereafter referred to as “4C-Seq”). This technique enables genome-wide mapping of the chromatins interacting with the “bait” in greater resolution and precision.
In most of the published 4C studies to date [14, 16–20], 4C library preparation typically starts with formaldehyde crosslinking of chromosomes in vivo, followed by restriction enzyme digestion to fragment chromosomes. After proximity ligation in diluted condition, chromatin proteins are digested with proteinase K and formaldehyde crosslinks are reversed. Finally “bait”-containing circular DNA molecules generated from the ligation are amplified using bait-specific primers in nested PCR reactions. Although enzyme digestion-based 4C method has limitations such as noisy/weak chromatin associations [21, 22], low chromatin accessibility-dependent digestion efficiency as well as uneven distribution of restriction digestion sites across the genome, it should be acknowledged that recent advance in enzyme-based 4C method greatly increased data resolution and robustness by using 4-bp cutter instead of 6-bp cutter in fragmentation . Compared to enzyme digestion, sonication is less accessibility-dependent and preferentially breaks crosslinked chromosomes at the edge of protein binding sites [24, 25]. Thus sonication-based approach provides an alternative choice for 4C-Seq studies , which is potentially more straightforward on exploring bound transcription factor(s) that mediate the interactome.
Unlike enzyme digestion-based 4C method [17–19], sonication-based library presents a data analysis challenge which requires a different set of analytical approaches. The major reason is that each restriction enzyme has a set of known breakage sites in the genome, so that researchers typically align all reads against flanking sequences of this set of restriction enzyme site, and the restriction sites are the exact ligation sites between two interacting regions . In contrast, sonication method has generally no preference of sequence motifs at breakage sites. Without knowing the exact ligation sites a priori, the pipeline for data processing is anticipated to be different from enzyme digestion-based method. So far, the bioinformatics pipelines have not been well developed to process sonication-based 4C data generated by high-throughput short-read sequencers, such as Illumina Hi-Seq. In the current study, we describe a sonication-based 4C-Seq protocol that we applied to explore Pou5f1 enhancer interactome in mouse ES cells. The Pou5f1 enhancer has been known to mediate expression of its own gene product – Oct4, a key reprogramming factor . Thus exploration of this enhancer interactome will provide better understanding of Pou5f1-related regulatory network. To compare different 4C protocols, we also performed a parallel study using an enzyme-based method described in . We analyzed consistency of processed 4C-Seq data generated from biological replicates, applied statistical analysis to identify enriched interacting regions, compared the reproducible enriched interacting regions identified from enzyme and sonication methods, and explored epigenetic features enriched in the 4C interactome.
Based on our experimental protocol, the short sequencing reads from Illumina Hi-Seq platform should theoretically fall into four categories: 1) reads at the 4C bait locus, 2) reads at the bait interacting regions, 3) reads that spans ligation junctions between bait locus and its interacting region, as well as 4) noises caused by contamination of genomic or circular DNA that are not amplified in PCR reactions (Figure 1A). We therefore evaluated different strategies to identify these four types of reads.
We first attempted to separately map both forward and reverse reads (91 bp) of the paired-end data to the reference genome (mm9) using Burrows-Wheeler Aligner (BWA ), to identify the reads that are completely aligned to the “bait” locus or other genomic regions. For sequencing data generated from the two biological replicates, the mapped reads account for ~77% of the total reads (Additional file 1: Table S1), with most of the mapped reads being uniquely mapped. With this mapping strategy, >99% of the uniquely mapped reads are within the bait locus, suggesting the presence of many proximal interactions or self-ligations in the data. Only 0.2% to 0.6% of the uniquely mapped reads are mapped to distal genomic regions, that is, the majority of them should fall into Category 2 described above, though it is possible that some of them may originate from genomic DNA described in Category 4. We also note that unmapped reads in the biological replicates account for ~23% of total reads, which may correspond to ligation junctions that cannot be mapped to the reference genome, that is, Category 3 above. Given the limited amount of data supporting Category 2, we believe that this mapping strategy is not optimal for identifying bait interacting regions.
We next evaluated an end-tag mapping strategy [2, 11], generally applied in 3C-based studies to identify Category 3 reads that are mosaic of the bait and its interacting regions. A similar strategy was also previously used in a ChIA-PET study , with 20-bp end tags. Here we define “bait region” as a ~1 kb region, which includes 500 bp extension from the locations of the 2nd set of forward and reverse PCR primers (Figure 1). We extracted 20-bp end tags from both forward and reverse sequencing reads and aligned them to the reference genome assembly separately using BWA . The generated forward and reverse alignment files were merged together using SAMtools . Junction reads are identified, when one end-tag uniquely maps to the “bait” and the other end-tag maps to genomic locations > 300 bp away on the same chromosome (intra-chromosomal interactions) or to a different chromosome (inter-chromosomal interactions). The rationale to choose 300 bp is that our sonication approach generates small DNA pieces with an average size of 200 bp for sequencing, so end-tags that are > 300 bp away should mostly be junction reads. We next classified junction reads as proximal junction reads and distal junction reads. Proximal junction reads have two end tags mapped on the same chromosome with genomic distance between the tags between 300 bp and 10 kb. Distal junction reads are either two tags on the same chromosome with distance greater than 10 kb or two tags on different chromosomes. Proximal junction reads account for ~90% of total junction reads identified (Additional file 1: Table S2), with the distribution of relative genomic distance between the two ends following a continuous decay, starting from 300 bp to 2 kb, similar to the self-ligation events observed in a ChIA-PET study . Predominant proximal ligation may reflect disruption of weak chromatin-chromatin interactions under the shearing force, thus facilitating self-ligation events. Since no interactions were identified within the distance range from 2 kb to 10 kb in both BR1 and BR2 biological replicate data, we used 10 kb distance cutoff to distinguish proximal vs. distal intra-chromosomal interactions.
To explore distal chromatin-chromatin interactions, we processed distal junction reads for further analysis. Tags that were within 100 bp range on their genomic locations were considered as PCR products from a single ligation event and merged as one unique distal interacting site, given that the 4C library DNA was fragmented before sequencing. Unique distal interacting sites supported by only one read were removed from our analysis as they likely represent background noises. In total, using 20 bp end-tag mapping, we identified 5,705 and 4,368 filtered unique interacting sites respectively from two biological replicate data generated from sonication-based 4C method. In the following sections, we will discuss 4C sequencing data in the context of data reproducibility, effect of sequencing depth, statistical models for identifying enriched interacting regions, comparison of reproducible interacting regions identified in both enzyme and sonication methods, epigenetic histone features surrounding the interacting sites within the reproducible regions, and transcription factors enriched around sonication generated interacting sites.
As a high-throughput assay, 4C-Seq revealed thousands of sites interacting with one bait region. However, it is unlikely that all the interactions identified are biologically significant, and many of them probably represent random collision between two genomic fragments in 3D space. To identify regions that are frequently associated with the bait region other than random collision, we applied statistical models to analyze interacting sites within each chromosome. We used a permutation-based false discovery rate (FDR) procedure to choose significantly enriched interacting regions. For enzyme-based 4C data, a z-score was assigned based on the number of 4C interacting sites per 500 HindIII sites . For sonication-based data, each interacting site was assigned a z-score based on the nearby interactions observed within ±1 Mb distance range (see Methods for details). FDR was calculated by random permutation of the data 100 times, and the cutoff value of 5% was used to select positive sites. Positive sites and the nearby interaction sites (±1 Mb range) were grouped together as enriched interacting regions. Overlapping enriched regions were further merged together. The statistical models applied here aim to identify enriched interacting regions from the background, similar to the concept used in ChIP-Seq peak calling.
In summary, we presented 4C-Seq data of an enhancer element in mouse ES cells, generated by two different methods: enzyme-based method and sonication-based method. Our data showed consistency of the data generated by both methods for biological replicate samples. For inter-chromosomal interactions, both methods have similar reproducibility of enriched-interacting regions; however, for intra-chromosomal interactions, enzyme-based method showed more frequent distal interactions and higher reproducibility. Both methods revealed that histone modifications related to gene activation were enriched in the interactomes. In addition, sonication-based data uncovered several key pluripotency genes enriched around the interaction sites. Thus, we conclude that 1) Both enzyme-based and sonication-based 4C-Seq technique are very useful tools for mapping long-range chromatin-chromatin interactions; and 2) 4C-Seq data together with ChIP-Seq data can help us elucidate molecular events surrounding a particular regulatory region in 3D space.
Mouse embryonic stem (ES) cell line E14 was grown in the culture dishes coated with 0.1% gelatin. A growth medium with Glasgow Minimum Essential Medium (GMEM) supplemented with 15% fetal bovine serum (FBS), 100 nM nonessential amino acids, 1% sodium pyruvate, 200 mM glutamate, 1% penicillin streptomycin, 50 uM b-mercaptoethanol and 10 ng/mL LIF was replaced every 24 hours to support ES cell growth. Immediately before 4C library preparation, 10 million cells were cross-linked with 1% fresh formaldehyde in the cell culture dishes, lifted and treated with Triton X100 buffer (0.25% Triton X100, 10 mM EDTA pH=8.0, 10 mM Tris–HCl pH=8.0, 100 mM NaCl, 1x protease inhibitor cocktail) to extract chromatins.
For sonication-based 4C library preparation, the isolated chromatin pellets were re-suspended in SDS lysis buffer (1% SDS, 5 mM EDTA pH=8.0, 50 mM Tris–HCl pH=8.0, 1x protease inhibitor cocktail) and sonicated to an average size of 500 bp. The diluted chromatin fragments were blunt-end repaired and ligated with T4 ligase for 24 hours at 4°C, followed by reverse crosslinking at 65°C for 20 hours with proteinase K. The purified DNA served as PCR template for 4C library construction. After nested PCR amplification for library construction, the purified PCR products (majority with size ranging from 500 bp to 1 kb) were further sonicated to small DNA fragments with an average size of 200 bp for sequencing. The Illumina HiSeq2000 Sequencer was used to perform paired-end sequencing with 90 bp read length.
For enzyme-based 4C library preparation, two rounds of enzyme digestions were carried out using HindIII and DpnII restriction enzymes respectively. The experimental procedure strictly followed the published protocol . The bar-coded DNA libraries were subject to single-end sequencing with 50 bp read length using the Illumina HiSeq2000.
For sonication-based 4C-Seq data, Burrows-Wheeler Aligner (BWA)  with the “samse” option was employed for aligning paired-end data separately to the reference genome assembly mm9. Alignment files for both ends were further merged using SAMtools , followed by a PERL script to select only uniquely aligned read pairs (BWA mapping score ≥1), where one end aligned to the bait region and the other end aligned to other regions. Finally, tags within a 100 bp range were merged as a unique interacting site using BEDTools . Singlet interacting sites were considered as background noise and removed.
For enzyme-based 4C-Seq data, the sequencing reads with 5’ end aligned to the forward inverse PCR primer sequence were selected. The rest part of the selected reads (including HindIII sites) was mapped to the mm9 assembly using BWA to locate ligation sites in the genome. The mapped ligated HindIII sites were further matched to a reduced genome with the locations of all HindIII sites included.
in which pW = μW/lW (μW is the expected number of interacting sites in window w on chromosome W).
Statistical significance was further assigned to each interacting site by a FDR-based approach. Briefly, we randomly permutated calculated z-score data for every chromosome 100 times, and chose interacting sites with a false discovery rate (FDR) ≤ 5% or FDR ≤ 20% as positively interacting sites for inter-chromosomal or intra-chromosomal interactions. FDR for each site was calculated by counting the number of randomly permutated Z-scores that are above experimentally determined Z-score. All the interacting sites within ±1 Mb range of a positively interacting site were clustered as an enriched interacting domain. Overlapping interacting domains were further merged together.
The ChIP-Seq peak data of histone variants for mouse ES cells were retrieved from the ENCODE database (http://genome.ucsc.edu/ENCODE). The ChIP-Seq raw read files of 15 DNA-binding proteins were downloaded from the GEO database (accession number GSE11431).
Statistical analysis was executed and plotted using the R software suite (http://www.r-project.org/). The conversion of genomic coordinates between different genome assemblies was done by liftover software tool (http://www.genome.ucsc.edu/).
The sequencing data has been deposited to the GEO database (accession number GSE43776, GSE45418).
The study is supported by departmental start-up funds from the Zilkha Neurogenetic Institute (K.W.), NIH grants R01 HG006465 (K.W.) and 5R01NS067213-02 (W. L.). We thank all members in the Lu lab and Wang lab for helpful comments and suggestions.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.