HEPeak: an HMM-based exome peak-finding package for RNA epigenome sequencing data

Background Methylated RNA Immunoprecipatation combined with RNA sequencing (MeRIP-seq) is revolutionizing the de novo study of RNA epigenomics at a higher resolution. However, this new technology poses unique bioinformatics problems that call for novel and sophisticated statistical computational solutions, aiming at identifying and characterizing transcriptome-wide methyltranscriptome. Results We developed HEP, a Hidden Markov Model (HMM)-based Exome Peak-finding algorithm for predicting transcriptome methylation sites using MeRIP-seq data. In contrast to exomePeak, our previously developed MeRIP-seq peak calling algorithm, HEPeak models the correlation between continuous bins in an m6A peak region and it is a model-based approach, which admits rigorous statistical inference. HEPeak was evaluated on a simulated MeRIP-seq dataset and achieved higher sensitivity and specificity than exomePeak. HEPeak was also applied to real MeRIP-seq datasets from human HEK293T cell line and mouse midbrain cells and was shown to be able to recapitulate known m6A distribution in transcripts and identify novel m6A sites in long non-coding RNAs. Conclusions In this paper, a novel HMM-based peak calling algorithm, HEPeak, was developed for peak calling for MeRIP-seq data. HEPeak is written in R and is publicly available.


Background
RNA methylation is an emerging area that studies chemical modifications in the nucleotides of RNAs [1][2][3][4]. Such modification in especially coding mRNAs or transcripts has been shown [5,6] or speculated to play a critical role in regulating cellular functions [7][8][9]. However, the overall mechanism by which mRNA is methylated and the related functions in different contexts including various diseases are still elusive. Deciphering their functions and regulations under various contexts represents a grand challenge facing the biology community.
The state-of-the-art high throughput technology that enables the detection of RNA methylation in transcriptome is an affinity-based shotgun sequencing approach known as Methylated RNA immunoprecipitation (IP) sequencing (MeRIP-Seq) [2]. MeRIP-Seq was first introduced in recent studies [1,2,10,11] on transcriptome-wide mRNA m 6 A methylation and is a high throughput sequencing assay that is designed for transcriptomewide survey of RNA epigenetics [6]. As shown in Figure 1, in MeRIP-seq, mRNA is first fragmented before immunoprecipitation with anti-m6A antibody, and then the immunoprecipitated and control mRNA fragments are subject to sequencing. The output includes an IP and a control sample, which measure the immunoprecipitated m 6 A-methylated mRNA reads and the mRNA expression (or RNA-seq measurement), respectively. These paired samples are used to reconstruct the transcriptome-wide m 6 A methylome. While MeRIP-seq has demonstrated high accuracy in identifying the cell-specific transcriptome methylation patterns, as a nascent assay, MeRIP-Seq poses unique bioinformatics challenges that call for novel and sophisticated statistical computational algorithms.
From a biological perspective, MeRIP-Seq can be thought as a combination of two well-studied methods: ChIP-Seq [12][13][14] and RNA-Seq [15,16]. Like ChIP-seq, reads accumulate around the methylation sites to form peaks. Unlike ChIP-seq based measurements for DNA methylation, MeRIP-seq measures mRNA methylation and hence produces read peaks around the methylation sites that span two or more exons. In addition, the control sample of MeRIP-seq measures mRNA expression, which, compared to those in ChIP-Seq, can vary much more drastically in different cells or tissues. Due to these unique features, ExomePeak [17] was developed specifically for peak calling, or methylation site prediction, in MeRIP-seq. Although ExomePeak can perform fairly robust exomebased peak calling, it ignored the dependency of reads, and therefore could either miss true peaks with low intensity or erroneously predict narrow, noisy outliers as true peaks. In this paper, we introduce HEPeak, a novel Hidden Markov model (HMM) for exome-based peak calling algorithm. The test results showed that HEPeak improved both prediction sensitivity and specificity over ExomePeak.

HEPeak pipeline
To address the aforementioned MeRIP-seq issues, HEPeak includes several high-throughput sequencing tools in its pipeline. First, HEPeak utilizes TopHat [18] to align fragmented mRNA reads to the reference transcriptome, allowing short reads to span exon-exon junctions. Next, SAM-tools [19] is applied to exclude the multi-mapping reads and index alignment results. After these pre-processing steps, HEPeak performs HMM-based peak calling on the exons of each gene, where the introns are excluded, to identify the genomic locus of methylation sites. The output result of HEPeak is in BED format, which can be visualized together with input alignments in IGV2.1 [20].

Exome-based peak calling
The goal of peak calling in MeRIP-seq is to detect regions in transcripts where the read counts in the IP sample is more "enriched" than those in the control sample. Just as with ExomePeak, our previously developed peak calling algorithm for MeRIP-seq, HEPeak performs the peak calling on connected exons of a specific gene, a clear contrast to genome-based ChIP-seq peak calling methods, such as MACS [21]. This projection of genome onto transcriptome effectively circumvents the difficulty due to the ambiguity of isoforms' assignment but it still preserves the convenience of gene-based annotation, making biological interpretation of the prediction straightforward.

The definition of HMM for MeRIP-seq data
Given a particular mRNA (RefSeq gene), its concatenated exons are first divided into N mutually connected bins, whose size is selected as the read length L. With respect to the n th bin, the unknown hidden methylation status is denoted as z n ∈ {1, 2} where 1 represents unmethylation and 2 otherwise. Since a peak likely spans multiple bins, we assume that the methylation status z n follows a first order Markov chain, whose transition matrix A contains entries defined as A jk = P(z n = k|z n−1 = j), j, k ∈ {1, 2} (1) Figure 1 Illustration of MeRIP-seq. Total RNAs are collected from cells and fragmented into~100 nt long. A part of fragmented RNAs is taken as the control sample, which is submitted for sequencing. The remaining part is subject to immunoprecipitation by antibody to isolate the methylated sequences, which are then submitted for sequencing.
where A jk denotes the probability for the latent variable switching from the status j at the (n -1) th bin to the status k at the n th bin. Here j, k is the indicator of the hidden state. Additionally, we assume that the initial probability P(z l = 1) = π and P(z l = 2) = 1 -π.
Next, let x n denote the read counts in the IP sample and y n the counts in the control sample, both for bin n. We assume that, given the methylation status z n , these read counts follow the Poisson distribution defined as where M IP and M ctrl are the total reads (sequencing depth) in the IP and the control samples, respectively and λ IP,z n for z n = 1, or 2 and l ctrl are the normalized Poisson rates, respectively. It is worthwhile pointing out that λ IP,z n switches according to the status of z n ; on the contrary, l ctrl stays the same.
It would be intuitive next to define the relationship between the Poisson rates for the methylated and unmethylated in the IP and the control sample, respectively. However, unlike in ChIP-seq, where this relationship is mostly defined only for the IP sample, defining the relationship for both the IP and the control is non-trivial and model complexity also needs to be assessed to avoid potential difficulties in subsequent inference. To this end, we transform the formulation by observing that, given (2) and (3), the conditional probability of observing x n in the IP given the total reads in the control as t n = x n + y n follows the binomial distribution where Note that p z n for z n = 1 (or 2) can be considered as the percentage of the mean IP read counts in the combined read counts of the IP and control samples for a bin, when it is unmethylated (or methylated). The distribution (4) effectively combines the reads in the IP and control samples under one model. As such, instead of using (2) and (3), we define (4) as the emission probability of the proposed HMM and work with p z n directly. Doing so avoids modelling and inferring the potentially complex relationships between the rates. Given X = {x 1 , x 2 , x 3 ,..., x N }, a set of reads for N bins and Z = {z 1 , z 2 , z 3 ,..., z N }, the sequence of methylation, we use g(z n,k ) to denote the marginal posterior distribution of a latent variable z n at state k, and ε(z n-1 , z n ) to denote the joint posterior distribution of two successive latent variables, so that Here, the parameter is defined as θ = {A k,j ∀k∀j; π ; p k ∀k}. Then, the log likelihood for the proposed HMM chain can be expressed as We call this new formulation HEPeak or Hidden Markov Model (HMM)-based Exome Peak finding. The graphical model of HEPeak formulation is shown in Figure  2A. Compared with ExomePeak, HEPeak considers the correlation of the reads between adjacent bins and more accurately models the behaviour of methylated reads in MeRIP-Seq ( Figure 2B).

The EM solution
Given HEPeak, the goal is to call peaks, i.e., predict z n ∀n, and at the same time estimate the model parameters: θ. To this end, we developed an Expected-Maximization (EM) solution, which performs peak calling and parameter estimation in an iterative fashion. We provide the steps of the EM algorithm in the following. The detailed derivation is included in appendix.
At the m th iteration, proceed as follows. E step: Given parameter θ (m-1) , estimated at the m-1 step, calculate the posterior distribution of the latent variable P(Z|X, θ (m-1) ).
M step: Compute and update π (m) , A jk (m) and p k (m) for all j, k as After the EM iteration converges, the model parameter θ can be obtained. Given the estimated θ, the Viterbi algorithm is applied to maximize the joint likelihood in (8) to obtain the maximum a posteriori (MAP) estimate of the methylation status z n .

Peak region detection
In order to evaluate the statistical significance of the putative peak regions predicted by the Viterbi algorithm, the log odds ratio of the posterior for the peak state (z n = 2) over the posterior for the background state (z n = 1) can be computed as follows Briefly, this log-transformed scoring method [22][23][24] tries to utilize the posterior probability of each bin to assess the confidence of the potential peak region. The potential peak region is defined as consecutive bins predicted by the Viterbi and its PeakScore is calculated as the averaged PeakScores for all the combined bins. Next, PeakScore is assumed to follow a Gaussian distribution with mean (mean(PeakScore)) and standard deviation(std(PeakScore)) [24], estimated from all the bins. Then, after performing the z transform of Peak-Scores, a one-sided test for significance of the potential peak region can be conducted and p-value can calculated. Then, the Benjamini-Hochberg method [25] is utilized to correct the multiple testing and compute the False Discovery Rate (FDR).

Simulation test
Because we do not have the ground truth for the methylation status in real data, the performance of HEPeak was first validated using a simulated data, where read counts for the IP and the control samples were simulated according to the proposed HEPeak model.
Specifically, a total of 5000 genes, whose lengths were randomly selected from 500 nt to 3k nt, were generated. Reads of each gene in both IP and the control samples were allowed to vary according to the Poisson distribution, where we chose l ∈ (5~20) and assumed it constant for both methylated and unmethylated bins. Additionally, we set l IP ∈ (l ctrl , 100) when methylated and l IP = (0, l ctrl ), when unmethylated, resulting in 14200 peaks generated. The transition matrix A was defined as A = 0.7 0.3 0.1 0.9 and the initial probability π = 0.2 Note that A and π were based on the estimates obtained by HEPeak when applied to the real m 6 A data discussed in the next section. Figure 3 showed an illustration of the simulated data. In general, when a bin is methylated, there were more reads in IP than in control; otherwise, there were more reads in control.
The receiver operating characteristics (ROC) curve of the peak calling results is shown in Figure 4A and we can see that the ROC curve of HEPeak wraps around that of ExomePeak, which indicates that HEPeak achieves a higher detection sensitivity and specificity. The area under the curve (AUC) for HEPeak is 0.979, which is larger than that of ExomePeak (0.955). As shown in Figure  4B, the read distributions of a simulated gene with 10 bins marked as methylated peaks and 90 bins as unmethylated, the corresponding detection results show that HEPeak can correctly detect 8 out of 10 true peaks, with 1 false positive, while ExomePeak results in 7 false positives to get the same sensitivity.

Evaluation of HEPeak on real m 6 A MeRIP-seq data
To further validate the accuracy of HEPeak, we applied HEPeak to two m 6 A MeRIP-seq datasets including one  from human HEK293T cell line [1] and the other from the mouse midbrain cells [8]. The raw fastq datasets were obtained from Gene Expression Omnibus (GEO accession: GSE29714 and GSE47217). The datasets were preprocessed according to the HEPeak pipeline, where the raw data was first aligned to the reference hg19 and mm10 assembly by TopHat, and then peak calling was performed to predict the transcriptome-wide m 6 A methylation for each dataset. As a comparison, Exome-Peak was also applied to these datasets.
A large number of genes were predicted to have m 6 A methylation sites in both human and mouse datasets. For HEK293T dataset, HEPeak identified 24281 peaks on 10715 genes at a FDR < 0.025, whereas ExomePeak (at the default setting) reported 15164 peaks on 7344 genes. Out of all the genes, 7340 genes were predicted to be methylated by both HEPeak and ExomePeak, whereas 3375 genes were predicted only by HEPeak, as opposed to 44 genes uniquely reported by ExomePeak ( Figure 5A). For mouse midbrain cells, HEPeak discovered 25138 peaks on 11336 genes (FDR < 0.025); in contrast, ExomePeak detected 19324 peaks on 9421 genes. Among them, 9201 genes were shared by the two algorithms, while HEPeak identified 1915 more genes than ExomePeak ( Figure 5B). The above results demonstrate that more potential methylated genes ignored by ExomePeak, can be discovered by HEPeak, which makes use of dependency of consecutive bins and greatly boosts the detection sensitivity. The advantage of HEPeak becomes even clearer if we carefully examined the results in IGV for the two datasets ( Figure 6A and Figure 6B). Take HEK293T dataset for example. For gene SEC24A, visual inspection should confirm methylation where read counts in the IP sample show slight enrichment to that in control sample. HEPeak demonstrate a higher sensitivity by utilizing the whole consecutive bins to determine the peak region where reads are greatly enriched compared to other region. For gene MRPL45, both methods found m 6 A methylation sites. However, due to HMM, HEPeak correctly merged the two peaks into one peak.

HEPeak recapitulates previous reported m 6 A patterns
On average, HEPeak predicted 2.27 and 2.22 sites per gene in human and mouse, respectively. Next, we examined the pattern of m 6 A sites by mapping all the peaks to the transcriptome and tallying the distribution of m 6 A sites in genes. For mRNA residing peaks, about 45% of the peaks located in the 3'UTRs, about 35% in the CDS, and only less than 20% from the 5'UTR ( Figure 7). As shown in Figure 8, m 6 A methylation sites were significantly enriched near the stop codon and overly present in the 3'UTR for both human and mouse, indicating that m 6 A may be involved in transcriptional regulation, consistent with the reported results in previous studies [1,2]. To gain additional insights into prediction, DREME [26] was performed on the called peak sequences to predict the motif of the m 6 A methylation site. As shown in Figure 9, the most enriched motifs for the HEK293T cells and mouse midbrain cells are GGACH [10,11], which were identified bound by methytransferase METTL3 and METTL14 [27].

HEPeak revealed distribution of m 6 A in lncRNA
We next examined the m 6 A sites predicted by HEPeak in long non-coding RNAs (lncRNAs), i.e., non-coding RNAs of more than 300 bp in length. m 6 A sites were found in lncRNAs in [28,29]. In human HEK293T cells, about 1847 peaks were predicted in lncRNAs, which accounted for 12.1% of the total predicted peaks ( Figure 10). Similarly, in mouse midbrain cells, 2759 peaks (10.9% of the total peaks) were detected in lncRNAs. We then examined the distribution of the peaks in lncRNA in human HEK293T cells and found it is significantly different from that in mRNAs ( Figure 11). Instead of being enriched near the stop codon in mRNAs, m 6 A sites in lncRNAs favour 5'UTR over 3'UTR. A similar pattern was also observed for mouse midbrain cells. These findings imply that the regulatory functions in mRNAs may be different from those in lncRNAs.

Conclusion
In this paper, a novel HMM-based peak calling algorithm, HEPeak, was developed for peak calling for MeRIP-seq data. By introducing the exome-based annotation, HEPeak circumvents the ambiguity related to isoforms. In order to characterize correlation between continuous bins in an m 6 A peak region, HEPeak utilized  HMM to model the dependency. Additionally, IP reads and control reads are modelled in one mathematical model to avoid separate HMM peak-calling procedures in IP and control as in RIPSeeker [24]. Compared with ExomePeak, which treated each bin independently, HEPeak was shown to achieve higher detection specificity and sensitivity in the simulated data. When applying HEPeak to the collection of two published MeRIP-seq data from human and mouse, the results revealed that m 6 A methylation extensively existed in genes. HEPeak showed higher sensitivity than ExomePeak and predicted more novel m 6 A sites. Particularly, almost all the peaks detected by ExomePeak can be found by HEPeak. Moreover, with respect to the peak regions, m 6 A sites called by HEPeak were biologically more meaningful than ExomePeak, by connecting separate m 6 A sites together, of which gaps were not tested significantly enriched by ExomePeak due to the limitation of the independence assumption.
Furthermore, in both human and mouse mRNAs, the distributions of m 6 A sites were similar, where more m 6 A sites were observed in the 3'UTR as supposed to CDS and 5'UTR, and the sites were significantly enriched near the stop codon as previously reported. These findings highly suggest that m 6 A may play a role in transcriptional regulation. In addition, we examined the sequence motif of the predicted m 6 A sites and found that both human and mouse shared the similar m 6 A motif -GGACH. This consistency suggests that m 6 A methylation uses the same mechanism in different cells and species. Moreover, m 6 A sites were also predicted in lncRNAs but bear a different distribution from that in mRNAs, implying that m 6 A may have different roles in regulating mRNAs and lncRNAs. (18) Given x n follows a binomial distribution, then p(x n |z n,k ; t n ) = t n x n * p k x n (1 − p k ) t n −x n ⇔ ln p(x n |z n,k ; t n , p) = ln t n ! − ln x n ! − ln y n ! +x n * ln p k + (t n − x n ) * ln(1 − p k ) Thus, p k can be computed through maximizing the likelihood function of the total probability, the same as setting the first derivative equal to zero,  In the same fashion, π k and A j,k can be computed, ∂Q ∂A j,k = 0 ⇒ A j,k = N n=2 ε(z n−1,j , z n,k ) 2 l=1 N n=2 ε(z n−1,j , z n,l ) List of abbreviations used HMM, Hidden-Markov Model; FDR, False discovery rate; HEPeak, HMM-based exome peak calling method; ExomePeak, Exome-based peak calling method; MeRIP-seq, Methylated RNA Immunoprecipatation combined with RNA sequencing; EM, Expectation of maximum likelihood method; CDS, Coding DNA sequence; UTR, Untranslated region.