HEPeak pipeline
To address the aforementioned MeRIP-seq issues, HEPeak includes several high-throughput sequencing tools in its pipeline. First, HEPeak utilizes TopHat [18] to align fragmented mRNA reads to the reference transcriptome, allowing short reads to span exon-exon junctions. Next, SAM-tools [19] is applied to exclude the multi-mapping reads and index alignment results. After these pre-processing steps, HEPeak performs HMM-based peak calling on the exons of each gene, where the introns are excluded, to identify the genomic locus of methylation sites. The output result of HEPeak is in BED format, which can be visualized together with input alignments in IGV2.1 [20].
Exome-based peak calling
The goal of peak calling in MeRIP-seq is to detect regions in transcripts where the read counts in the IP sample is more "enriched" than those in the control sample. Just as with ExomePeak, our previously developed peak calling algorithm for MeRIP-seq, HEPeak performs the peak calling on connected exons of a specific gene, a clear contrast to genome-based ChIP-seq peak calling methods, such as MACS [21]. This projection of genome onto transcriptome effectively circumvents the difficulty due to the ambiguity of isoforms' assignment but it still preserves the convenience of gene-based annotation, making biological interpretation of the prediction straightforward.
The definition of HMM for MeRIP-seq data
Given a particular mRNA (RefSeq gene), its concatenated exons are first divided into N mutually connected bins, whose size is selected as the read length L. With respect to the n
th
bin, the unknown hidden methylation status is denoted as z
n
∈ {1, 2} where 1 represents unmethylation and 2 otherwise. Since a peak likely spans multiple bins, we assume that the methylation status z
n
follows a first order Markov chain, whose transition matrix A contains entries defined as
(1)
where A
jk
denotes the probability for the latent variable switching from the status j at the (n - 1)
th
bin to the status k at the n
th
bin. Here j, k is the indicator of the hidden state. Additionally, we assume that the initial probability P(zl = 1) = π and P(zl = 2) = 1 - π.
Next, let x
n
denote the read counts in the IP sample and y
n
the counts in the control sample, both for bin n. We assume that, given the methylation status z
n
, these read counts follow the Poisson distribution defined as
(2)
(3)
where M
IP
and M
ctrl
are the total reads (sequencing depth) in the IP and the control samples, respectively and for z
n
= 1, or 2 and λ
ctrl
are the normalized Poisson rates, respectively. It is worthwhile pointing out that switches according to the status of z
n
; on the contrary, λ
ctrl
stays the same.
It would be intuitive next to define the relationship between the Poisson rates for the methylated and unmethylated in the IP and the control sample, respectively. However, unlike in ChIP-seq, where this relationship is mostly defined only for the IP sample, defining the relationship for both the IP and the control is non-trivial and model complexity also needs to be assessed to avoid potential difficulties in subsequent inference. To this end, we transform the formulation by observing that, given (2) and (3), the conditional probability of observing x
n
in the IP given the total reads in the control as t
n
= x
n
+ y
n
follows the binomial distribution
(4)
where
(5)
Note that for z
n
= 1 (or 2) can be considered as the percentage of the mean IP read counts in the combined read counts of the IP and control samples for a bin, when it is unmethylated (or methylated). The distribution (4) effectively combines the reads in the IP and control samples under one model. As such, instead of using (2) and (3), we define (4) as the emission probability of the proposed HMM and work with directly. Doing so avoids modelling and inferring the potentially complex relationships between the rates. Given X = {x1, x2, x3,..., x
N
}, a set of reads for N bins and Z = {z1, z2, z3,..., z
N
}, the sequence of methylation, we use γ(zn,k) to denote the marginal posterior distribution of a latent variable z
n
at state k, and ε(zn-1, z
n
) to denote the joint posterior distribution of two successive latent variables, so that
(6)
(7)
Here, the parameter is defined as . Then, the log likelihood for the proposed HMM chain can be expressed as
(8)
We call this new formulation HEPeak or Hidden Markov Model (HMM)-based Exome Peak finding. The graphical model of HEPeak formulation is shown in Figure 2A. Compared with ExomePeak, HEPeak considers the correlation of the reads between adjacent bins and more accurately models the behaviour of methylated reads in MeRIP-Seq (Figure 2B).
The EM solution
Given HEPeak, the goal is to call peaks, i.e., predict z
n
∀n, and at the same time estimate the model parameters: θ. To this end, we developed an Expected-Maximization (EM) solution, which performs peak calling and parameter estimation in an iterative fashion. We provide the steps of the EM algorithm in the following. The detailed derivation is included in appendix.
At the m
th
iteration, proceed as follows.
E step: Given parameter θ(m-1), estimated at the m-1 step, calculate the posterior distribution of the latent variable P(Z|X, θ(m-1)).
(9)
M step: Compute and update π(m), A
jk
(m) and p
k
(m) for all j, k as
(10)
(11)
(12)
After the EM iteration converges, the model parameter θ can be obtained. Given the estimated θ, the Viterbi algorithm is applied to maximize the joint likelihood in (8) to obtain the maximum a posteriori (MAP) estimate of the methylation status z
n
.
Peak region detection
In order to evaluate the statistical significance of the putative peak regions predicted by the Viterbi algorithm, the log odds ratio of the posterior for the peak state (z
n
= 2) over the posterior for the background state (z
n
= 1) can be computed as follows
(13)
Briefly, this log-transformed scoring method [22–24] tries to utilize the posterior probability of each bin to assess the confidence of the potential peak region. The potential peak region is defined as consecutive bins predicted by the Viterbi and its PeakScore is calculated as the averaged PeakScores for all the combined bins. Next, PeakScore is assumed to follow a Gaussian distribution with mean (mean(PeakScore)) and standard deviation(std(PeakScore)) [24], estimated from all the bins. Then, after performing the z transform of PeakScores, a one-sided test for significance of the potential peak region can be conducted and p-value can calculated. Then, the Benjamini-Hochberg method [25] is utilized to correct the multiple testing and compute the False Discovery Rate (FDR).