As described by Fan et al [10], MCF7 human breast cancer cells (American Type Culture Collection, Manassas, VA) and MCF7-OTH cells were cultured and treated with E2 (10-8 mol/L) for three hours. Then, cells were cross-linked with 1% formaldehyde and chromatin immunoprecipitation was done as previously described [11]. The antibodies against Pol II were purchased from Santa Cruz Biotechnology (Santa Cruz, sc-899 X and sc-8005 X). After immunoprecipitation and purification, ChIP DNA sample was run in 12% PAGE and the 100–300 bp DNA fraction was excised and eluted from the gel slice. Then, Illumina library was constructed and sequenced with Illumina/Solexa Genome Analyzer.
Most DNA binding proteins, such as transcription factors, bind to cis-acting DNA element with specific sequence features usually described by a position weight matrix (PWM). A hypothetical distribution of ChIP-seq-derived DNA fragments corresponding to transcription factors and RNA polymerase II (Pol II) is shown in Figs. 1A and 1B, respectively. For transcription factors, ChIP-seq detects a set of fragments that cluster and center around distinct biological binding sites, forming a "peak" around the binding locus (Fig. 1A). In contrast, Pol II binds throughout promoter regions, the 5'- and 3'- untranslated regions, the open reading frame (ORF), and downstream regions of the activated gene (Fig. 1B). Although Pol II can form a distinct peak around the transcription start site under certain circumstances, commonly-used peak-finding algorithms are not able to identify Pol II-enriched regions in gene transcript region derived from ChIP-seq experiments.
A Poisson mixture model to identify transcripts with different Pol II binding quantity
Based on the assumption that the number of Pol II-binding fragments detected within gene transcript region, including 5'- and 3'-untranslated regions and open reading frames, follows a Poisson distribution, we developed a mixture model to identify differences in Pol II binding under two conditions, control vs. treatment. Denote y
ij
as the Pol II quantity for gene i (i = 1,..., n) under condition j (j = 1, 2). In this application, j = 1, 2 indicates two biological conditions, MCF7 control (vehicle-treated) and MCF7 treated (3 hour E2 treatment), respectively. Marginally, y
ij
follows a Poisson distribution,
(1)
where λ
ij
denotes the expected quantity of Pol II binding for gene i under condition j (j = 1, 2). For the i-th gene, if Pol II binding quantities within the gene transcript region are statistically different between the two biological conditions, y
ij
follows marginal distributions with different parameters λi 1and λi 2; conversely, if the number of Pol II binding does not demonstrate a significant difference, y
ij
follows the same distribution with a unified λ. A mixture model is proposed to estimate the posterior probability of differential Pol II binding quantity.
For unified Pol II binding quantity between two conditions, λi 1and λi 2follow the same Gamma distribution (2). The selection of Gamma distribution is based on two considerations. Firstly, the two parameter Gamma distribution is a very flexible function which can be used to describe a wide range of distribution shapes. Secondly, the Gamma distribution is a conjugate distribution of Poisson for the λ parameters. Hence, the follow-up expectation step in an E-M algorithm has a close form solution.
(2)
Otherwise, (λi 1, λi 2) are distributed independently.
(3)
Denote Z
i
as the Bernoulli random variable with probability p, i.e., it is equal to 1 if (λi 1, λi 2) are independently distributed (and equal to 0 if not). Therefore, the joint distribution of Pol II binding quantity is modeled by a mixture of uniform binding events (p = 0) and differential binding events (p = 1).
(4)
Based on equation (4), we implement E-M algorithm by treating Z
i
as missing data.
The E-step of the algorithm is specified as:
(5)
In the M-step, the parameters in the Gamma distribution (α1, β1, α2, β2) are estimated by maximizing the two likelihood functions in Equation 6.
(6)
In the M-step, the optimization procedure is challenging, because searching for the optimal solutions for Gamma parameters can be trapped into local optimum, causing either slow convergence or failure to converge on the global optimal solution. In order to overcome these difficulties, we utilize particle swarm optimization (PSO), an artificial intelligence approach that mimics a behavior of swarm-forming agents, providing a good balance between global optimum searching and computation efficiency [12].
Because the likelihood functions of (α1, β1) and (α2, β2) are factorized in equation 6, the PSO optimization procedures are conducted independently using the following four steps.
Step 1: 100 particles (potential solution) were initially randomly distributed in 2-dimensional parameter spaces (α1, β1) or (α2, β2).
Step 2: the likelihood of each of the 100 particles are calculated by following Equation 6.
Step 3: the velocity vector of the particle, serving as the guide to search for the optimal solution, was calculated using Equation 7.
(7)
where P
global
is the global optimal solution achieved so far; Pk-localis the local optimal solution achieved by particle k; and C0, C1, C2 are adjustable weight factors used to control searching speed.
Step 4: in the solution space, all the particles are re-positioned based on their current positions and movement velocities calculated in Equation 8.
(8)
Steps 1–4 will be iterated until further particle movement cannot result in higher likelihood (defined in Equation 6).
At the convergence, Z
i
can be interpreted as the probability of differential Pol II binding between two conditions. Although the model derivation is based on ChIP-seq data from MCF7 cells before and after treatment with vehicle or E2 for 3 hours, it can also be equally applied to OHT-MCF7 (+/- E2 treatment). In practice, the solution of the E-M algorithm converges in only 5 to 6 cycles.