Integrative genome-wide chromatin signature analysis using finite mixture models
© Taslim et al.; licensee BioMed Central Ltd. 2012
Published: 26 October 2012
Skip to main content
© Taslim et al.; licensee BioMed Central Ltd. 2012
Published: 26 October 2012
Regulation of gene expression has been shown to involve not only the binding of transcription factor at target gene promoters but also the characterization of histone around which DNA is wrapped around. Some histone modification, for example di-methylated histone H3 at lysine 4 (H3K4me2), has been shown to bind to promoters and activate target genes. However, no clear pattern has been shown to predict human promoters. This paper proposed a novel quantitative approach to characterize patterns of promoter regions and predict novel and alternative promoters. We utilized high-throughput data generated using chromatin immunoprecipitation methods followed by massively parallel sequencing (ChIP-seq) technology on RNA Polymerase II (Pol-II) and H3K4me2. Common patterns of promoter regions are modeled using a mixture model involving double-exponential and uniform distributions. The fitted model obtained were then used to search for regions displaying similar patterns over the entire genome to find novel and alternative promoters. Regions with high correlations with the common patterns are identified as putative novel promoters. We used this proposed algorithm, RNA-seq data and several transcripts databases to find alternative promoters in MCF7 (normal breast cancer) cell line. We found 7,235 high-confidence regions that display the identified promoter patterns. Of these, 4,167 regions (58%) can be mapped to RefSeq regions. 2,444 regions are in a gene body or overlap with transcripts (non-coding RNAs, ESTs, and transcripts that are predicted by RNA-seq data). Some of these maybe potential alternative promoters. We also found 193 regions that map to enhancer regions (represented by androgen and estrogen receptor binding sites) and other regulatory regions such as CTCF (CCCTC binding factor) and CpG island. Around 5% (431 regions) of these correlated regions do not overlap with any transcripts or regulatory regions suggesting that these might be potential new promoters or markers for other annotation which are currently undiscovered.
Multicellular organism consists of hundreds of different cell types. A cell typically expresses only a fraction of its genes. Each type of cells become different from others because they activate different sets of genes whose activities turn on and off various biological processes. The process in which a cell determines which genes it will express and when is called gene regulation. Because of the multitude of cell types, the regulation of gene expression in complex genomes, such as the human genome, is known to be an extremely complicated process. It is now well accepted that apart from sequence polymorphism and variations, gene regulation in human plays an important role in many disease onset and progression. By matching the gene expression profiles to those of known tumors, researchers can type cancer cells of unknown tissue origin. As such, understanding the mechanism governing regulation of genes is very crucial. For many genes, their expression levels are controlled by attachment of specific proteins known as transcription factors to locations on the DNA to activate or suppress expression of the target genes. The location where transcription factor binds to is known as promoter region. Recent discoveries show that regulation of gene expression not only involve the binding of transcription factors in target gene promoters but it also depends on the characterization of the epigenetic events such as histone marks around which DNA is wrapped around [1–3]. Certain histone modification, for example di-methylated histone H3 at lysine 4 (H3K4me2) has been suggested to relax the nucleosome packing, allowing nuclear factors to bind into promoter region and activate gene . Specific chromatin signatures were also reported to be present at gene promoters . Thus, characterization of histone modifications at promoter regions fundamentally contributes toward deciphering of gene expression mechanism. To complicate the process even further, more than half of the human genes has been known to have multiple promoters. Genes that display complex transcription regulation in different cellular conditions or developmental stages have been shown to utilize alternative promoters . Therefore, predicting all these gene promoters including their alternatives are deemed to be important in understanding gene regulation mechanism.
With the rapid availability of high-throughput technologies such as chromatin immunoprecipitation followed by next-generation sequencing (ChIP-seq), scientists can now observe the binding patterns of the protein of interest in the entire genome. Genome-wide identification of promoter is commonly done using antibody against RNA polymerase II (enzyme that are required for gene transcription) . However, due to non-specific binding of Pol-II over the genome and the specific characteristics of antibody against Pol-II, it is hard to predict promoters based on Pol-II enrichment alone. The dynamics of transcribing Pol-II throughout the gene body also makes it hard to pinpoint the exact promoter region. Furthermore, there has been evidence showing that although Pol-II may accumulate at a promoter, the gene is not transcribed. A phenomenon known as RNA Pol-II stalling, which has been shown to occur in Drosophila , may also happen in human.
Thus, development of a better promoter identification algorithm is needed to account for these different situations. It is conceivable that promoter regions display unique combination of chromatin and Pol-II patterns. Condition such as Pol-II stalling may display different patterns than those of transcribing genes. As an attempt to address this problem, in this article, we propose a computational method using a finite mixture model to identify promoter signature profiles based on both Pol-II and H3K4me2 binding patterns. We choose to use H3K4me2 pattern because H3K4 di-methylation has been shown to promote transcriptional activities of genes . We scan the genome to find regions which display the identified promoter signatures using the fitted model. We call these regions putative promoters. Aided by RNA-seq data combined with several transcripts databases, we annotate these putative promoters as predicted alternative and novel promoters. We have also found similar patterns exist in regions that have been associated with gene regulatory sites besides promoters such as ER/AR (Estrogen and Androgen Receptor) binding sites. These two proteins have been known to bind to non-promoter regions known as enhancers [8, 9]. We have also found genomic regions displaying these Pol-II and H3K4me2 patterns that mapped exclusively to other regulatory regions such as CTCF (CCCTC binding factor) and CpG island.
Two ChIP-seq data sets are used to identify patterns of promoters, RNA Pol-II ChIP-seq data and H3K4me2 ChIP-seq data, both from MCF7 (normal breast cancer cell line). RNA-seq (RNA sequencing) data also from MCF7 are used to identify transcripts in the breast cancer cell line including alternative splicing. Genome annotation databases such as non-coding RNA (ie. snoRNA and miRNA), ESTs (Expressed Sequence Tags), CpG island and CTCF (CCCTC binding factor) tracks are downloaded from UCSC genome browser. ER/AR (Estrogen and Androgen Receptor) binding sites are retrieved from HRTBLDb (Hormone Receptor Target Binding Loci) database .
Specifically, we consider the H3K4me2 and Pol-II ChIP-seq profiles along 10-kb regions surrounding well-annotated TSS in known genes using RefSeq database. Each 10-kb (with 5-kb on each side of TSS) regions contains read counts in bins of size 100-bp. In pre-processing step, we smooth the data using a moving average filter replacing each count in each bin with the average of three consecutive bins. Next, in order to prevent interference from neighboring genes, we exclude genes with TSSs within 10-kb of each other. Furthermore, to prevent degenerate clustering, we remove regions with low binding intensities and low variance. Low binding intensities regions are regions with maximum read counts less than 4 among the 100-bp bins over the 10-kb regions. Low variance regions are defined as regions with variance less than 10th percentile over all 10-kb regions. These filtering criteria result in a dataset consisting a total of 9,859 10-kb regions. K-means clustering using correlation as distance measurement is then performed to find sets of common patterns. The optimal number of cluster is determined using silhouette values . Larger value of silhouette indicates greater similarity of these patterns within a cluster compared to between clusters. In our application, we found clustering these 9,859 regions into 4 common patterns yields the highest silhouette. Next, we modeled the characteristic signature of Pol-II and H3K4me2 within each cluster using a double-exponential and uniform mixture. The double exponential components will be able to capture both unimodal and bimodal distribution. This is essential because Pol-II and H3K4me2 peaks has been shown to be unimodal and bimodal, respectively. The uniform component will be used to model the tails of Pol-II profiles.
Recently there has been new discovery on the presence of RNA polymerase II at enhancer regions. These regions which are found to affect genes far away can manufactured their own RNA molecules. Thus, we try to find whether the same promoter pattern can be found at enhancer regions. We used the binding sites of ER (Estrogen Receptor) and AR (Androgen Receptor) as representative of the enhancer regions since both of these protein have been shown to bind at distal enhancer region. Overlapping unmapped region with ER binding sites, we found 120 regions with similar promoter patterns. This region is shown in Figure 6D. However, after mapping ER binding sites, we did not find any overlap with AR binding sites.
Number of correlated regions that overlap with each genome annotation or transcripts including those that are detected using RNA-seq
Number of overlaps
In this paper, we develop a novel algorithm based on finite mixture model to predict promoter regions using ChIP-seq profiles. We are interested in identifying transcriptionally active promoters clustering all TSSs regions. We use the term promoter to describe these regions throughout the paper. We identified putative promoter regions based on their statistical significance. Our algorithm takes advantage of the new sequencing technology which allow one to observe the binding patterns by modeling the shape of these promoter patterns instead of simply categorizing binding sites as binary (present/absence) . Four common models representing shapes of promoter patterns are obtained by K-means clustering algorithm. Although these patterns appear to be similar, the shift in the location of peaks may be meaningful. For example, the shift may indicate genes that are poised to be transcribed but not yet active. Furthermore, the distinctive patterns may prove to be important in differentiating different functions or different behavior of these promoters. More detailed investigation is needed in order to draw more clear picture of the gene expression mechanism. Nevertheless, the proposed algorithm may help with the discovery of novel promoters (including alternative promoters) and aid in the ongoing annotation of promoters from different ChIP-seq experiments. Finally, the proposed algorithm may also be extended to identify enhancers elements important in distal gene regulation. For instance, in 6C, the combined Pol-II and H3K4me2 peaks mapped to a potential enhancer region with detectable transcripts in the RNA-seq experiment. These short transcripts are likely to be the recently discovered eRNA which are short RNA transcribed from enhancer regions even though its function is still not clear . These findings will lead to new insight on the epigenetic mechanisms on transcription regulation with applications in cancers.
Based on “Chromatin signature analysis and prediction of genome-wide novel promoters using finite mixture model”, by Cenny Taslim, Shili Lin, Kun Huang and Tim HM Huang which appeared in Genomic Signal Processing and Statistics (GENSIPS), 2011 IEEE International Workshop on. © 2011 IEEE .
This work was supported by NCI U54CA113001 (Integrative Cancer Biology Program), NSF grant DMS-1042946, PhARMA Foundation, and NCI P30CA054174 (Cancer Center Support Grant) of the National Institutes of Health and by generous gifts from the Cancer Therapy and Research Center Foundation, University of Texas Health Science Center at San Antonio. We thank Dr. Hatice Gulcin Ozer for her help with raw ChIP-seq data and analysis of RNA seq data, Ms. Ayse Selen Yilmaz for her assistance with Cufflinks.
This article has been published as part of BMC Genomics Volume 13 Supplement 6, 2012: Selected articles from the IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS) 2011. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S6.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.