Data sets and genome annotations
Two ChIP-seq data sets are used to identify patterns of promoters, RNA Pol-II ChIP-seq data and H3K4me2 ChIP-seq data, both from MCF7 (normal breast cancer cell line). RNA-seq (RNA sequencing) data also from MCF7 are used to identify transcripts in the breast cancer cell line including alternative splicing. Genome annotation databases such as non-coding RNA (ie. snoRNA and miRNA), ESTs (Expressed Sequence Tags), CpG island and CTCF (CCCTC binding factor) tracks are downloaded from UCSC genome browser. ER/AR (Estrogen and Androgen Receptor) binding sites are retrieved from HRTBLDb (Hormone Receptor Target Binding Loci) database [10].
Methods
In the 1st step, we characterize Pol-II binding and chromatin mark patterns by performing k-means clustering around the gene transcription starting site (TSS) of known genes. This is then followed in the second step of fitting a double-exponential and uniform mixture model. At the end of the two step procedure the patterns identified will be used to scan the genome to identify putative promoter regions (see Figure 1).
Specifically, we consider the H3K4me2 and Pol-II ChIP-seq profiles along 10-kb regions surrounding well-annotated TSS in known genes using RefSeq database. Each 10-kb (with 5-kb on each side of TSS) regions contains read counts in bins of size 100-bp. In pre-processing step, we smooth the data using a moving average filter replacing each count in each bin with the average of three consecutive bins. Next, in order to prevent interference from neighboring genes, we exclude genes with TSSs within 10-kb of each other. Furthermore, to prevent degenerate clustering, we remove regions with low binding intensities and low variance. Low binding intensities regions are regions with maximum read counts less than 4 among the 100-bp bins over the 10-kb regions. Low variance regions are defined as regions with variance less than 10th percentile over all 10-kb regions. These filtering criteria result in a dataset consisting a total of 9,859 10-kb regions. K-means clustering using correlation as distance measurement is then performed to find sets of common patterns. The optimal number of cluster is determined using silhouette values [11]. Larger value of silhouette indicates greater similarity of these patterns within a cluster compared to between clusters. In our application, we found clustering these 9,859 regions into 4 common patterns yields the highest silhouette. Next, we modeled the characteristic signature of Pol-II and H3K4me2 within each cluster using a double-exponential and uniform mixture. The double exponential components will be able to capture both unimodal and bimodal distribution. This is essential because Pol-II and H3K4me2 peaks has been shown to be unimodal and bimodal, respectively. The uniform component will be used to model the tails of Pol-II profiles.
Let y1(t) and y2(t) be the read counts of Pol-II and H3K4me2 ChIP-seq in the 10-kb region around TSS of a gene where t is an indicator variables denoting the bin index, respectively. If we quantify the data into bins of size = 100-bp, then t ∈ T = {-50,- 49, ..., 49, 50}. Let R(t) be the chromosomal region relative to the TSS of the gene. Thus, for t = -50, R(t) denotes region 4901-bp to 5000-bp upstream of TSS. The mixture model for each profile (i.e. Pol-II and H3K4me2) can be defined as follows:
(1)
where μ and β are the location and scale parameters of the double exponential distribution, respectively. π is the mixing proportion (i.e. ), a and b are the parameters of the uniform distribution. k = 1, 2 for fitting Pol-II and H3K4me2 ChIP-seq profiles respectively. Each model is fitted by minimizing the Kullback-Leibler distance [12] to f
k
(t) as follows:
(2)
using generalized pattern search (GPS) algorithm [13]. GPS method is a derivatives-free optimization algorithm using positive spanning directions. The GPS algorithm is run until one of the following criteria is satisfied: (1) the number of function evaluations reaches 20,000; (2) maximum number of iterations the algorithm performs reaches 2000; (3) the minimum distance between the current points at two consecutive iteration is less than 10-6, (4) After a successful poll, the difference between the function value at the previous best point and the function value at the current best point is less than 10-6. The search algorithm is repeated 16 times with different initial points. Using this strategy, we obtained four distinct models of Pol-II and H3K4me2 signatures representing the majority of the patterns exist at promoter region of known genes. Each model is a mixture of double exponential and uniform components. Figure 2 shows the 4 distinct patterns modeled by the finite mixture.
Finally, we scan the whole genome using the fitted models to find regions that display these Pol-II and H3K4me2 patterns (see Figure 3). We concatenate the fitted Pol-II and H3K4me2 models then use a sliding window of 10-kb moving 1-bp at a time to find regions with the Pol-II and H3K4me2 fitted model. Once genome-wide correlation with these models have been obtained, a threshold for these values must be established in order to classify regions as putative promoters which display these promoter signatures. A null distribution of the test statistics (correlation) are approximated by randomly permuting the read counts of the H3K4me2 and Pol-II regions and calculating their correlation with the fitted model. Regions with high correlation coefficients are defined as regions that have correlation greater than a threshold z. The threshold z is chosen as the 95th percentile of the asymptotic distribution of the test statistics. These genomic locations which display these specific patterns of Pol-II and H3K4me2 are designated as potential promoters. For brevity, we will refer to the fitted Pol-II and H3K4me2 patterns as promoter patterns. We further annotate these correlated regions as known promoters and predicted alternative promoters using RNA-seq data in MCF7 along with transcripts databases such as ESTs, snoRNA/miRNA downloaded from USCS genome browser.