GeSICA: Genome segmentation from intra-chromosomal associations
© Liu et al.; licensee BioMed Central Ltd. 2012
Received: 25 August 2011
Accepted: 4 May 2012
Published: 4 May 2012
Skip to main content
© Liu et al.; licensee BioMed Central Ltd. 2012
Received: 25 August 2011
Accepted: 4 May 2012
Published: 4 May 2012
Various aspects of genome organization have been explored based on data from distinct technologies, including histone modification ChIP-Seq, 3C, and its derivatives. Recently developed Hi-C techniques enable the genome wide mapping of DNA interactomes, thereby providing the opportunity to study genome organization in detail, but these methods also pose challenges in methodology development.
We developed Genome Segmentation from Intra Chromosomal Associations, or GeSICA, to explore genome organization and applied the method to Hi-C data in human GM06990 and K562 cells. GeSICA calculates a simple logged ratio to efficiently segment the human genome into regions with two distinct states that correspond to rich and poor functional element states. Inside the rich regions, Markov Clustering was subsequently applied to segregate the regions into more detailed clusters. The binding sites of the insulator, cohesion, and transcription complexes are enriched in the boundaries between neighboring clusters, indicating that inferred clusters may have fine organizational features.
Our study presents a novel analysis method, known as GeSICA, which gives insight into genome organization based on Hi-C data. GeSICA is open source and freely available at: http://web.tongji.edu.cn/~zhanglab/GeSICA/
Upon studying the function of the eukaryotic nucleus, genome organization can be used to modulate the interpretation of functional information encoded by the primary DNA sequence [1, 2]. It has been suggested that different organizational units reside in distinct chromatin environments and thereby contribute to genomic functional diversity [3–5]. Over the past several years, genome organization has been explored based on data using a variety of technologies. One type of approaches is based on genomic data, such as histone modification (ChIP-Seq) or chromatin components (DamID), to segment the whole genome into elaborate organizational units (called as states or domains) with the computational frameworks such as Hidden Markov Models or Bayesian networks [6–8]. These inferred organizational units were found to be associated with different regulatory elements, and therefore, distinct biological functions . Another type of approaches, however, provided a probably more straightforward perspective. Dekker et al. pioneered a method called Chromosome Conformation Capture (3C)  to examine the physical and spatial interactions between specific loci. With 3C, researchers can directly detect the higher-order DNA loops, which at least partially elucidate the structural basis of certain organizational units with specific functions [11–14]. Nonetheless, the applications of 3C and its derivatives require pre-selected loci, which limit more global insights into genome organization .
Recently, a technology called Hi-C, a novel derivative of 3C coupled with massively parallel pair-ended sequencing, has been used to generate an unbiased genome-wide mapping of the DNA interactome . From the analysis of Hi-C data, Botta et al. discovered that strong long-range genomic interactions could be maintained through the activity of the CCCTC-binding factor (CTCF) . Another group demonstrated that distal genomic rearrangements in early replication domains are enriched with DNA interactions. As Hi-C technology monitors higher-order DNA looping at the genome scale, this technology provides the opportunity to study the genome organization and also poses the challenges in the development of analytical methods. Although Lieberman-Aiden et al. employed Principal Component Analysis to segregate the whole genome into two compartments based on Hi-C data , attempts to explore the more detailed organization from Hi-C data are still lacking.
In this study, we thus propose a two-step strategy, titled Genome Segmentation from Intra-Chromosomal Associations (GeSICA), to investigate genome organization based on Hi-C data. We applied the method to Hi-C data in both the GM06990 and K562 cell lines. In the first step, GeSICA calculates a simple logged ratio to categorize the entire human genome into two different states. Regions in one of the states are significantly enriched with active genes and transcription factor binding sites (indicated as "plus states"), whereas regions in the other state are relatively less active (indicated as "minus states"). In the second step, we further segregated the plus-state regions into more detailed clusters by employing a Markov Clustering algorithm. These clusters are characterized by a relatively higher probability of DNA interactions inside rather than across clusters . The insulator CTCF and one subunit of cohesin, namely, Rad21, were observed to be preferentially located in the boundaries between neighboring clusters, as were the proteins and histone marks related to transcription activities, including RNA polymerase II (Pol II), transcription initiation factor TFIID subunit 1 (Taf1) and H3K79me2. Taken together, these clues imply that the inferred clusters may achieve a finer and more detailed level in describing the features of genome organization.
To extend our exploration of the properties of inferred plus and minus states, we employed dozens of transcription factor binding profiles and histone modification profiles to examine their distributions in genomic regions with different states (we define a region as a series of continuous bins with the same state). Compared with regions with the minus state, those regions with the plus state were dominantly enriched with a plethora of important transcription factors and histone modifications (all available through the ENCODE Project from UCSC Genome Browser ) in the GM06990 cell line. For 27 of the 35 transcription factors and histone marks, including P300, Sp1 and H3K4me3, over 60% of their ChIP-Seq peaks were located in regions with the plus state (Figure 2B). For each transcription factor, we also calculated the fold-enrichments of peak numbers in regions with the plus state to those with the minus state. Most were larger than 1.5, except for Zzz3 and H3K27me3. H3K27me3 is a typical repressive histone mark  and Zzz3 is the subunit of the Ada-Two-A-containing complex, which has been shown to be responsible for regulating the activity of non-histone targets and controlling mitotic progression, while accompanied by high levels of chromatin compaction . In contrast to the plus state, regions with the minus state appeared to behave as functional deserts with a small fraction of transcription factor bindings or histone marks present. Altogether this evidence is consistent with the degree of chromatin openness between regions with plus and minus states. Similar results were obtained in the human K562 cell line ( Additional file 1: Figure S 2Additional file 1: Figure S 1B, Additional file 1: S 1D). These observations indicate that the regions with the plus state dominantly contain more interested information due to the fecundity of transcription factor binding and histone marks in the relatively open chromatin. Thus, the detailed organization of regions with the plus state became the focus of our subsequent analysis.
In a previous study, Lieberman-Aiden et al. employed principal component analysis (PCA) on the correlation matrix of DNA interactions to segregate the human genome into two compartments . The value of the first principal component (PC) from their approach could also be taken to reflect the degree of chromatin openness ( Additional file 1: Figure S 3). Thus, the genome can be segmented into regions characterized as one of the two states according to the sign of first PC: which could also be called plus and minus state separately. In the GM06990 cell line, the segmentation results of the two approaches were largely similar, and 75.2% of genomic bins were inferred to have the same state (Figure 1B). Although both approaches achieved similar results, their basic assumptions are quite different. The approach of Lieberman-Aiden et al. was based on the assumption that the interaction vectors of genomic bins (with 100Kb or 1 Mb as bin-size) are more similar within one compartment than across two different compartments. As mentioned previously, however, our approach was based on the assumption that short-range random DNA interactions would be easier to detect in relatively open chromatin environments (Figure 1A). Overall, our approach provided a new perspective by which to interpret different states in the human genome from Hi-C data, which exhibits a performance similar to that of PCA.
To explore the organizational features of the inferred clusters, we next assessed the distribution of transcription factors binding and histone marks in the boundaries between adjacent clusters. The genomic bins inside clusters were adopted as the control group. To ensure that each cluster had at least one bin not in the cluster boundary, we excluded all the clusters failing to meet the criteria in the following analysis. Among all of the transcription factor binding and histone marks available in the GM06990 cell line (through the ENCODE Project using the UCSC Genome Browser) , the binding sites of insulator CTCF were significantly enriched in the boundaries (fold around 1.38; adjusted p-value < 10-41). As CTCF has been shown to isolate long-range enhancers by looping DNA into higher-order structures that consequently maintain genomic structure [24–26], the results suggest that inferred clusters might be potential organizational units with independent structure and function, as demarcated by the insulator CTCF (Figure 3B). Similar results were obtained in human K562 cell line ( Additional file 1: Figure S 7A). Previous studies proposed that CTCF, together with cohesin, serves as an insulator in certain genomic regions [27–31]. The binding sites of Rad21, a subunit of cohesin, were only enriched in the cluster boundaries in K562 cell line. As Rad21 has twice as many binding sites in the GM06990 cell line as in the K562 cell line, we further examined the enrichment of the co–binding sites of CTCF and Rad21 in cluster boundaries. We found that CTCF-Rad21 co-binding sites were significantly enriched in the cluster boundaries of both cell lines (Figure 3B Additional file 1: Figure S 7A).
In the GM06990 cell line, the transcriptional start sites (TSS) of genes were significantly enriched in the boundaries of inferred clusters (fold 1.48, binomial test p-value < 2.2x10-16). In addition, the level of gene expression in the cluster boundaries was also higher than those within clusters. The top 25% of highly expressed genes had the highest fold-enrichment of gene number on boundaries inside clusters (1.74) compared with the middle 50% of genes (1.44) and the bottom 25% of genes (1.35) (Figure 3C Additional file 1: Figure S 7B). This result agrees with the one suggested in , which suggested the boundary functions of active TSSs. Consistent with the above results, several transcription process-related profiles were also enriched in the boundaries, including the ChIP-seq peaks of Pol II, Taf1, and H3K79me2  (Figure 3B Additional file 1: Figure S 7A). Similar results were also observed in the K562 cell line ( Additional file 1: Figure S 7A-B). These observations indicated that the inferred cluster boundaries might be the potential spots that harbor hypothesized transcription factory compounded with transcription-related factors, and also suggested the potential insulator-like role of transcription factory or Pol II in humans, which may be similar to those reported in Drosophila [34, 35].
To capture the characteristics of dynamics of detailed organization, we compared the inferred clusters in the plus-state regions of the GM06990 and K562 cell lines. For each cell line, every pair of adjacent bins in the inferred clusters was assigned a category to indicate one of the following two types of relationships: two bins belonging to the same clusters (Same) and to different clusters (Diff). In the GM06990 cell line, only 20% of the pairs in the Diff category remained Diff in K562, while approximately 50% in the Diff category became pairs in the Same category. The remaining pairs contained at least one bin with the minus state in K562 (indicated as the Minus category). In contrast, more than half of the pairs of adjacent bins in the Same category stayed Same in K562 (Figure 4B). These observations imply that the boundaries between inferred clusters are notably dynamic.
To investigate whether some important functional elements were enriched in the boundaries between adjacent clusters in a cell-type-specific manner, we classified the boundaries in the GM06990 cell line into two types: common boundaries (those that were also cluster boundaries in K562 cell line) and GM-specific boundaries (those within clusters or with the minus state in K562). In total, 204 bins were identified as common boundaries, and 938 bins were determined to be GM-specific boundaries. We found that the binding sites of CTCF and Pol II in the GM06990 cell line were more enriched in the common boundaries than in the GM-specific ones (Figure 4C). A similar trend was observed for the top 25% highly expressed genes in contrast to the bottom 25% genes (Figure 4C). The above phenomenon was observed regardless of the cell line ( Additional file 1: Figure S 8).
Next, we examined the relationship between the differential expression of genes and the dynamics of detailed organizations. Genomic bins with plus states in both cell lines were ranked by the proportion of differentially expressed genes between the GM06990 and K562 cell lines, while the list of differentially expressed genes was calculated using Limma software package . Each genomic bin with the plus state in both cell lines belongs to two inferred clusters in the two studied cell lines, and a Jaccard index was introduced to evaluate the similarity between the two clusters. Here we selected the top 10% and bottom 10% ranked bins. As shown in Additional file 1: Figure S 9, there was a trend that the higher the proportion of differentially expressed genes, the smaller the Jaccard Index, i.e., the less similarity between the clusters to which bins in the two cell lines belonged. The results suggested that the differentially expressed genes were potentially associated with the dynamics of genome organization around them.
Despite some interesting insights into the inferred genome organization that were raised during this investigation, our approach still faces analytical challenges and limitations. First, due to the current sequencing depth of Hi-C experiments, it is difficult to further narrow bin size (i.e. organizational resolution). Therefore we cannot observe more detailed organization (e.g. at several or dozens of Kbs in length) that might be more relevant to specific functions.Secondly, the use of different restriction enzymes in Hi-C experiments may have caused systematic biases. To evaluate the potential bias from enzymes on inferred genome organization, GeSICA was applied to Hi-C data in the GM06990 cell line using either HindIII or NcoI as restriction enzymes separately. The calculated interaction ratios from each dataset were notably similar to those from the combined datasets ( Additional file 1: Figure S 10A, Additional file 1; S 10B). From each dataset, we also observed the enrichment of the binding sites of CTCF, Pol II and Taf1 in the inferred cluster boundaries ( Additional file 1: Figure S 11A, Additional file 1: S 11B). From our analysis, the inferred genome organization from Hi-C datasets, as determined using different enzymes, was generally consistent to each other. Last, GeSICA is a general method of genome segmentation from Hi-C data that does not consider the rearrangement of the cancer genome, although genome rearrangement might affect inferred organization.
We introduced a two-step strategy, GeSICA, which can be used to investigate the genome organization based on Hi-C data. The first step was based on the assumption that random short-range DNA interactions would be easier to detect in open chromatin environments, and that the calculated interaction ratio could be regarded as an index that quantifies the degree of chromatin openness. The second step was designed on structural features of chromatin organization: the levels of DNA interactions within clusters are greater than those across clusters, and the inferred clusters may be potential organizational units with independent structures and functions. In the foreseeable future, we expect numerous studies to generate unbiased DNA interactome data with improved resolution that will enhance the credibility and efficacy of genome organization characterizations.
"Genomic bins" are defined as non-overlapping genomic intervals of a certain length. Several continuous bins comprise a genomic region. In this study, the default bin-size was set as 100Kb. One end of each paired-end read was considered to be in a bin if its starting position fell within the interval of that bin. For each pair of bins, the number of paired-end tags was set as the absolute interaction value (IV).
In this formula, d was set as the parameter to determine the distal interactions. For the sake of convenience in discretizing the genomic state, we used the logged form of this ratio to generate a spectrum like profile.
Markov Clustering [23, 40] is a clustering algorithm designed for the natural partitioning of weighted graphs. An intuitive perspective of Markov Clustering is to detect the clusters in which random walks would infrequently lead to another one. The whole process is deterministic. Markov Clustering firstly transforms the input adjacency list into a stochastic "Markov" matrix. This matrix portrays the transition probabilities between all pairs of bins. Markov Clustering then simulates random walks in a graph by two major steps named expansion and inflation. Expansion is calculated by taking the stochastic matrix squaring (or nth powering) to calculate the probability of a random walk of length n. The probabilities between bins in the same cluster will be higher than that across different clusters. Furthermore, to aggrandize this effect, an inflation step is raised to take the entry wise Hadamard power of a matrix and then followed by a normalization step to turn the new matrix back into a stochastic or Markov matrix again. Finally, clusters are detected by repeating and alternating expansion and inflation until convergence is obtained, i.e., the probability between the final clusters is less than the given tolerance.
GeSICA is implemented in Python and freely distributed with an open source Artistic License at . The following parameters should be used: -f for the Hi-C raw file format after reads mapping; -r for the desired resolution of the result; -d for the distance to distinguish short-range and long-range interactions; -c for filtering the interactions below a certain distance "c"; and all parameters necessary for Markov Clustering. The output files of GeSICA include three different types of files: an interaction-ratio wiggle file (-wig), a bed file for each bin with the corresponding state and cluster information (-bed) and a cluster file of the bins that each cluster contains (-cluster). GeSICA can be run from the command line and is available for Linux, UNIX and Mac OS. Additional file 1: Figure S 12 illustrates the workflow of GeSICA.
The Hi-C experimental data and the result of PCA for the GM06990 and K562 cell lines from Lieberman-Aiden et al. can be obtained from the GEO database with accession No. GSE18199  and from the Hi-C data browser . The gene expression data of the GM06990 and K562 cell line can be downloaded separately from the GEO database with accession No. GSE14083  and GSE12056 . The list of differentially expressed genes were calculated using the Limma software package . All of the ENCODE transcription factors binding sites, histone modifications, and DNaseI hypersensitive sites data can be accessed through the UCSC Genome Browser [20, 45, 46]. The genome assembly used in this work is hg18. If data from the GM06990 cell line was found not to exist, related data from the GM12878 cell line was employed instead, due to the similarity between these two cell lines . The peak files of these data, which are available in both the hg18 and hg19 genome assemblies, were collected, and liftover was applied to transform the peak files that only existed in hg19 back to hg18. The list of human housekeeping genes was derived from , which reported 575 housekeeping genes in total.
Chromosome Conformation Capture
Genome segmentation from intra-chromosomal associations
RNA Polymerase II
Transcription initiation factor TFIID subunit 1
Mono-Methylated Histone H3 at Lysine 9
Tri-Methylated Histone H3 at Lysine 9
Tri-Methylated Histone H3 at Lysine 27
Di-Methylated Histone H3 at Lysine 79.
We would like to thank X. Shirley Liu, Kai Fu and Qian Zhao and two anonymous reviewers for their advices on our method and the insightful suggestions on the accomplishment of this paper. This study was supported by funds from National Natural Science Foundation of China (31071114), the National Basic Research Program of China (973 Program; No. 2010CB944904, and 2011CB965104), the Shanghai Rising-Star Program (10QA1407300), the New Century Excellent Talents in the University of China (NCET-11-0389) and the Innovative Research Team Program Ministry of Education of China (IRT1168).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.