Histone modifications involved in cassette exon inclusions: a quantitative and interpretable analysis
© Liu et al.; licensee BioMed Central. 2014
Received: 26 August 2014
Accepted: 20 November 2014
Published: 19 December 2014
Chromatin structure and epigenetic modifications have been shown to involve in the co-transcriptional splicing of RNA precursors. In particular, some studies have suggested that some types of histone modifications (HMs) may participate in the alternative splicing and function as exon marks. However, most existing studies pay attention to the qualitative relationship between epigenetic modifications and exon inclusion. The quantitative analysis that reveals to what extent each type of epigenetic modification is responsible for exon inclusion is very helpful for us to understand the splicing process.
In this paper, we focus on the quantitative analysis of HMs’ influence on the inclusion of cassette exons (CEs) into mature RNAs. With the high-throughput ChIP-seq and RNA-seq data obtained from ENCODE website, we modeled the association of HMs with CE inclusions by logistic regression whose coefficients are meaningful and interpretable for us to reveal the effect of each type of HM. Three type of HMs, H3K36me3, H3K9me3 and H4K20me1, were found to play major role in CE inclusions. HMs’ effect on CE inclusions is conservative across cell types, and does not depend on the expression levels of the genes hosting CEs. HMs located in the flanking regions of CEs were also taken into account in our analysis, and HMs within bounded flanking regions were shown to affect moderately CE inclusions. Moreover, we also found that HMs on CEs whose length is approximately close to nucleosomal-DNA length affect greatly on CE inclusion.
We suggested that a few types of HMs correlate closely to alternative splicing and perhaps function jointly with splicing machinery to regulate the inclusion level of exons. Our findings are helpful to understand HMs’ effect on exon definition, as well as the mechanism of co-transcriptional splicing.
Nucleosome is the fundamental building unit of eukaryotic chromatin, consisting of ∼147-bp double-helical DNA (nucleosomal DNA) wrapped around a histone octamer in the left superhelix. In all, 75–90% of genomic DNA is packaged into nucleosomes, with adjacent nucleosomes being separated by stretches of DNA, which are referred as linker sequences [1, 2]. Various types of post-translational covalent modifications imposed on the histone tails, including acetylation, methylation, phosphorylation and ubiquitination [3, 4], have been shown to play important roles in various biological processes, such as transcription regulation, co-transcriptional splicing, DNA replication and repair, by functioning alone or jointly to change the charge of the nucleosome particle, and/or by recruiting other protein effectors [5–7].
With the advent of genome-wide ChIP-chip and RNA-seq techniques, mapping of global patterns of nucleosome positioning and epigenetic modifications has become commonplace and has been performed on many organisms [8–11]. One insight revealed by such large-scale biological datasets is that exons are marked with high nucleosome-occupancy levels in contrast to introns, and the occupancy pattern of nucleosomes is independent of the gene expression level . The nucleosome distribution difference can not be completely explained by GC content difference between exon and intron . Therefore, nucleosome positioning has been supposed to be finely regulated by transcription machinery and chromatin remodeling complexes to assist the recognition of splicing sites. The concept of exon/intron definition was accordingly proposed to describe the function of nucleosome positioning in RNA precursor splicing [13–15]. Also, it has been shown that alternative splicing is a major mechanism for promoting transcriptome and proteome diversity, particularly in mammals . As indicated by Wang et al., 92–94% of human genes undergo alternative splicing (AS) . Eight type of AS, including skipping exon (ES), mutually exclusive exon (ME), alternative 5’ splice site selection (A5SS), alternative 3’ splice site selection (A3SS) and intron retention (IR), Alternative first exon (AFE), Alternative last exon (ALE) and Tandem 3’ UTRs, have been identified from 15 diverse human tissue and cell line transcriptomes by deep sequencing . Many studies have been conducted to explore the cross-talk between HMs and splicing machinery [5, 17, 18], and found that some type of HMs have high enrichments on exons rather than on introns [12, 19]. For example, H3K36me3 has been suggested to be a conserved epigenetic mark for exons, as it shows significant high level on expressed exons than on introns across many species and cell types [20, 21]. H3K9me was found to function jointly with the chromodomain protein HP1 γ to favor inclusion of alternative exons . Meanwhile, some other type of HMs, such as H3K79me1, H4K20me1, have also been shown to be highly enriched in internal exons than in flanking introns, implying that HM patterns are closely associated with the co-transcriptional splicing of RNA precursor [19, 22]. Luco et al. have demonstrated that different HM patterns led to different splicing outcomes in a set of human genes . Moreover, HMs have been supposed to participate in multiple type of alternative splicing . Enroth et al. trained a rule-based model to predict exon inclusions with considerable accuracy by using HMs proximal to cassette exons (CEs). The generated rules in the form of “IF …THEN…” imply that some specific combinational patterns of HMs facilitate exon inclusion while some other patterns inhibit exon inclusion . Zhu et al. modeled the quantitative relationship between HMs and CE inclusion by linear regression and showed that HMs were predictive of exon inclusion levels .
According to the inventory of known alternative isoforms , skipping exons are the most prevalent in mammals among eight types of alternative splicing events, and are mainly responsible for the proteome diversity. In this paper, we therefore focus on the quantitative analysis of which HMs affect the CE inclusion and to what extent HMs affect the CE inclusion, by integrating the ChIP-seq and RNA-seq high-throughput data of three human cell types of ENCODE Tier 1, Gm12878, H1-hESC and K562. More precisely, we regarded respectively the CE inclusion and surrounding HMs as response variable and explanatory variables, and then used logistic regression to model their quantitative relationship. The remarkably high prediction accuracy of CE inclusion based on HMs proximal to the splicing sites confirms previous qualitative conclusions that HMs involve in co-transcriptional splicing. We also showed that comparable accuracy can be obtained by using only a few types of HMs to build the logistic regression models, which implies that several types of HMs dominate the effect on CE inclusion, and their effects were further demonstrated to be independent of the expression levels of the genes hosting the CEs. HMs located in the flanking regions of CEs were taken into account in our analysis, and we found that HMs within bounded flanking regions (about three tandem nucleosomes away from the CE splicing sites) affect moderately CE inclusions. Interestingly, HMs located on CEs whose length are close to nucleosomal DNA show most predictive of CE inclusion. Considering the fact that the average length of human cassette exons is about 150 bp, we thus suppose that exons evolutionarily fit to the nucleosomal-DNA length so that exons exactly wrap the histones, which in turn facilitate epigenetic modifications to participate in co-transcriptional splicing.
Results and discussion
Nucleosome positioning is independent of cassette exon inclusion
Histone modifications correlate closely to cassette exon inclusion
We inspected whether HMs associate with the inclusion level of CEs. For the three types of cell lines, we derived the ChIP-seq signal distributions of all types of HMs on CEs and ±250 bp flanking regions in the similar way mentioned above. The 2–4th columns of Figure 1 showed the mean ChIP-seq signal distributions of the three types of HMs, H3K36me3, H3K9me3 and H4K20me1, for Gm12878 and K562 cell lines, Additional file 1: Figure S1 showed the mean ChIP-seq signal distributions of other seven type of HMs for Gm12878 and K562, and Additional file 1: Figure S2 showed the signal distributions of all type of HMs for H1-hESC cell line available on ENCODE. From these signal distributions, it can be found that the HM signals corresponding to the two different inclusion levels of CEs diverges greatly in comparison to nucleosome positioning signals. H3K36me3 showed positive correlation to CE inclusion levels, i.e., the higher the CE inclusion levels, the stronger the signals on both CEs and flanking regions. H4K20me1 display similar signal distributions to H3K36me3, although its divergence between two classes was not as large as that of H3K36me3. H3K9me3 and H3K27me3 display slightly different distributions, the mean ChIP signals on CEs of high-class CEs is stronger than that of the low-class, while the contrary is the case on the flanking regions. Other type of HMs, including H3K9ac, H3K27ac, H3K4me3, H3K4me2 and H3K79me2, showed significantly different signal distributions that were negative correlated to CE inclusion levels, i.e., the low-class had stronger signals compared to the high-class. Interesting, these HMs can be categorized into two types in the view of their signal curve shape: bell and slash. As HMs on flanking regions have also been reported to affect CE inclusion, we regarded the HM signals of upstream and downstream regions ±250 bp flanking CEs as independent variables, and thus each type of HM corresponds to three components, denoted by suffix _up, mid and _down, respectively. We calculated the Pearson correlation coefficients between each pair of variables and performed hierarchal clustering, as shown in Additional file 1: Figure S3. Also, the correlation coefficients between CE inclusion level (represented as miso_level) and HMs were also shown in Additional file 1: Figure S3. It is found that the three components of each type of HMs positively correlated to each other. Also, H3K27ac, H3K9ac, H3K4me2, H3K4me3 and H3K79me2 were positively correlated to each other and thus formed a dense cluster, while H3K36me3, H3K9me3, H3K27me3 and H4K20me1 diverged from other HMs and formed another relatively loose cluster. In particular, we found that H3K36me3_mid and H3K36me3_down had considerably positive correlation coefficients to CE inclusion level. Take Gm12878 as an example, the correlation coefficients were 0.21 (p-value = 3.7e-148) and 0.143 (p-value = 3.1e-68). In addition, H4K20me1_mid also had positive correlation coefficients to CE inclusion level for all three cell lines, while H4K20me1_down and H3K9me3_down showed negative correlation. In summary, HMs show considerably divergent signal distributions corresponding to different CE inclusion levels, especially compared to the nucleosome positioning signal distribution, we therefore suggested that some types of HMs, such as H3K36me3 and H4K20me1, on CEs and flanking regions were closely related to CE inclusion.
Modeling CE inclusion by Logistic regression
We go one step further to computationally model the correlation between HMs and CE inclusion. The distributions of CE frequencies with respect to the binned inclusion levels were derived based on the ψ values for the three cell lines, as shown in Additional file 1: Figure S4. We found that the the inclusion levels of most CEs are close to either 0 or 1, and the cumulative frequency distributions approach to the CDF of a specific Bernoulli distribution, indicating that most CEs are universally either included into or excluded from mature RNAs in the measured cell population. It is rational for us to suppose that the inclusion levels of CE samples were generated from a binomial distribution, given that sequencing errors and the ambiguity of mapping short reads to genome are responsible for some noise samples.
Therefore, we regarded the inclusion of CEs as a two-class classification problem, for which a variety of classifiers can be applicable. For the sake of interpretability, we employed the logistic regression model to estimate the effect of each type of HMs on the inclusion of CEs, as its regression coefficients and odd-ratios are meaningful for evaluating the importance of each explanatory variable. Specifically, we regarded the signals of each type of HMs on CEs as an explanatory variables, and the CE inclusion as a binary response variable. HM signals on upstream and downstream regions ±250 bp flanking CEs are also respectively regarded as independent explanatory variables. For each cell line, we took ten types of HMs that expanded to 30 explanatory variables to build the logistic regression models. The inclusion levels were discretized by applying a predefined cutoff δ (0.5 ≤ δ < 1), so that the inclusion levels greater than δ were set to 1 and the inclusion levels less than 1-δ were set to 0.
It is helpful to reinspect the relationship between nucleosome positioning and CE inclusions by applying the logistic regression model. Therefore, we built logistic models based on only nucleosome signals using the same way mentioned above. The AUC and accuracies of the models built on nucleosome positioning signals were also shown in Figure 1 for Gm12878 and K562 (H1-hESC’s nucleosome positioning is not available). Both ROC and accuracy curves were very close to 0.5 and kept almost unchanged when δ increased from 0.5 to 0.975, implying that the prediction accuracy was roughly equal to random guess. The results confirm our previous conclusion that nucleosome positioning is independent to CE inclusion.
HMs are predictive of differentially expressed CEs
Number of differentially expressed CE samples (#CEs) and AUC values of the logistic regression models for predicting differentially expressed CE inclusion across cell lines
Cell line for testing
Cell line for training
Several types of HMs play dominant role on regulate CE inclusion
To estimate the effect of each type of HM on CE inclusion, we inspected the regression coefficients and odds-ratios obtained by logistic regression. Additional file 1: Figure S5 illustrated the regression coefficients and odds ratios on natural-log scale for Gm12878. Overall, H3K36me3, H3K9me3, H4K20me1 and H3K27me3 possessed significant regression coefficients and log-scaled odds-ratios compared to the others type of HMs. Especially, H3K36me3_mid and H3K36me3_down have large positive regression coefficients and odds-ratios, which implies the dominant role in CE inclusion played by H3K36me3. This is consistent with a variety of previous studies that H3K36me3 marks expressed exons [20, 21]. For the K562 and H1-hESC cell lines, we got similar results that H3K36me3, H3K9me3 and H4K20me1 still hold the top three large regression coefficients, as shown in Additional file 1: Figure S6-S7. In addition, it is worth noting that H3K9me3_mid in H1-hESC achieved even larger regression coefficient and log-scaled odds-ratio than H3K36me3_mid and H3K36me3_down. This is consistent with another straightforward evidence that H3K9me3 participates in the inclusion of alternative exons by functioning jointly with chromodomain protein HP1 γ . H3K9me3_down, H4K20me1_down and H3K27me3_down had large negative regression coefficients and log-scaled odds-ratios, which indicated that the HMs succeeding the CEs function to inhibit CE inclusion. In fact,the inhibitive function on CE inclusion of these downstream HMs has also been suggested by Enroth et al. . This is also consistent with Podlaha et al.’s conclusion that HMs correlated to CE inclusion via specific spatial patterns along the upstream and downstream regions around CEs . Although no biologically experimental evidence supporting the function difference played by HMs located relatively different positions have been proposed, we suppose that these in silico results provide a strong hint for biochemical validation.
AUC values and accuracies of the logistic regression models built on the three simplified and full model
Variables in model
HM’s effect on CE inclusion does not depend on gene expression level
The elongation speed of RNA Pol II has been suggested to affect CE inclusion and shape the pattern of HMs [31, 32]. Slow speed of RNA Pol II elongation nearby splicing sites is helpful for exon recognition by splicing factors and thus favors exon inclusion . This inspires our interest in whether the effect of HMs on CE inclusion is correlated to the expression level of hosting genes. For this purpose, we calculated the number of Fragments Per Kilobase of exon per Million fragments mapped (FPKM) by cufflinks . FPKM was first proposed by Mortazavi et al.  and has been widely used as the gene expression levels by many studies [25, 27], as they have higher sensitivity and accuracy than microarray signals. In our analysis, FPKM values were normalized into range [0,1], and then mapped to CEs by virtue of the annotation file that links CEs to their hosting genes .
HMs in flanking regions have only limited effect on CE inclusion
HMs on approximate nucleosomal DNA-length CEs greatly affect CE inclusion
RNA precursor splicing and transcription elongation are traditionally deemed to function separately, but increasing evidence suggests that splicing is functionally coupled to transcription elongation . It has been proposed that specific HMs may help to recruit related TFs or splicing factors, or to change the RNA Pol II elongation speed, which can affect the assembly of spliceosomes or the recognition of splicing sites [12, 19]. However, few studies have paid attention to the quantitative relationship between HMs and exon inclusions. In this paper, we used high-throughput ChIP-seq and RNA-seq data covering three human cell lines to carry out a comprehensive quantitative analysis, and found that HMs along exons and flanking regions are strongly predictive for exon inclusions.
Consistent to previous studies [19, 20, 23], H3K36me3 was found to be most predictive for CE inclusions in our analysis. In particular,  utilized biological experiments (RNA interference, RNA immunoprecipitation and quantitative RT-PCR) to validate the causal role of HMs in alternative splice site selections, and indicated that H3K36me3 could interact with PTB protein to regulate alternative splicing. The observations validate our findings. H3K9me3 was also shown to play considerable effect on exon inclusions, which has been demonstrated to favor exon inclusion by involving the recruitment of HP1 γ protein . It is worth to note the bidirectional functions of H3K9me3, i.e., its high level on promoter inhibits gene expression while its enrichment on gene body favors exon inclusion. In addition, H4K20me1 has been suggested to involve exon inclusion by several previous works [14, 24, 25] and our analysis. Although the mechanism of H4K20me1 affecting exon inclusion has not been experimentally confirmed, our analysis may indicate its genuine role and help planning biochemical experiments.
Some researchers have revealed the quantitative correlation between HMs on promoter and gene expression level. For example, causal relationship between HMs and gene expression has been inferred by building Bayesian network . Karlic et al.  showed the high prediction accuracy by modeling the relationship between HMs and gene expression level using simple linear regression. Furthermore, HMs on transcribed regions have been shown to be more predictive for gene expression level than those on promoters . It seems that the predictability of HMs for exon inclusion should be affected by gene expression level. But our analysis suggests that the predictability is independent of the expression level of hosting genes. Moveover, HMs on exons are dominantly predictive compared to those on flanking regions, although the later can slightly improve the prediction accuracy.
However, some studies have indicated that the exon-intron marker of HMs is cell type-specific, and even some HMs patterns show bias towards introns . Stable HM profiles have been observed even if different splicing results are induced by RNA interference . Furthermore, the causality between HM patterns and exon inclusions has not been specified. In fact, both directions of causality have been reported. HMs have been shown to interact with splicing factors to influence alternative splicing , while a recent study indicated that specific HMs could be determined by pre-mRNA splicing . These findings confound the emerging concept of exon definition by HMs. Actually, the co-transcriptional splicing of RNA precursor is a complex and multi-step biological process that involves recognition of splice sites, spliceosome assembly and intron excision. HMs were suggested to participate in one or more of the processing steps, resulting in the establishment of specific HM patterns. For example, the dynamic change of histone acetylation and deacetylation has been shown to drive the spliceosome assembly and rearrangement . On the contrary, Zhou et al. have demonstrated that splicing regulators interact with histone modifying enzymes to modulate the chromatin structure for the sake of proper transcriptional elongation rate . The causality of HMs marking exons, i.e., HM patterns act as the role for recognition of splice sites or HM patterns are just the result of spliceosome activities, is not clear. More studies should be devoted to exploring the interacting mechanism of HM with splicing machinery.
Finally, it is worthy to point out that our analysis has some common findings with Zhu et al. in quantitatively modeling CE inclusion by HMs. Both our analysis and Zhu et al. found HM signals around CEs were predictive of CE inclusion level, and the quantitative correlation is conservative across cell types. But our analysis differed from Zhu et al.’ findings in at least three aspects . First, we show that HMs’ effect on CE inclusion do not depend on gene expression level. Second, we analyzed the the effect of HMs located in flanking regions, and found that HMs within bounded flanking regions moderately affect CE inclusion, while Zhu et al. overlook this important phenomenon. Third, we explored the relationship between exon length and CE inclusion, and found that HMs on approximate nucleosome-length exons affect mostly on CE inclusion. Besides, the sequencing depth of the ChIP-seq and RNA-seq data of the ENCODE datasets used in our analysis is far larger than that used by Zhu et al., which enable us to mine more deeply to the epigenetic regulation.
In this paper, we carried out a comprehensive quantitative analysis of HMs’ effect on CE inclusion by integrating the ChIP-seq and RNA-seq high-throughput data of three human cell lines, Gm12878, H1-hESC and K562. We employed the logistic regression model to capture the quantitative relationship between HMs and CE inclusion, and obtained significant prediction accuracy based on the HMs located on CEs and flanking regions. We also showed that considerably high accuracy can be obtained by using only a few types of HMs, including H3K36me3, H3k9me3 and H4K20me1, to learn logistic regression models, which implies that several types of HMs dominate the effect on CE inclusion. We went further to demonstrate that HMs’ effect on CE inclusion is conservative across cell tpye and does not depend on the expression levels of the genes hosting the CEs. Furthermore, HMs located in the regions flanking CEs were also taken into account in our analysis, and were shown to had statistically significant but limited effect on CE inclusions. Interestingly, HMs on CEs whose lengths were close to nucleosomal DNA showed most predictive of CE inclusion. Based on the findings we concluded that HMs on CEs and flanking regions function jointly with splicing machinery to regulate the RNA precursor splicing.
We chose the three human cell lines of Tier 1 cell types of ENCODE , Gm12878, K562 and H1-hESC, because both the CE annotations and high-throughput sequencing data (ChIP-seq and RNA-seq) are available, which facilitates us to carry out quantitative studies.
ChIP-seq and RNA-seq data
High-throughput ChIP-seq datasets of HMs tracks published by Broad Institute of MIT and Harvard were employed in our analysis . For each cell line, only ten type of HMs, including H3K36me3, H3K9me3, H4K20me1, H3K27ac, H3K9ac, H3K4me2, H3K4me3, H3K79me2, H3K27me3 and H3K4me1, were taken into account. Other protein factors, such as H2A.Z, RNA Pol II, etc. were excluded. Short reads were aligned to the human reference genome (GRCh37/hg19) using MAQ  with default parameters, and fragment densities were computed by counting the number of reads overlapping each 25 bp window along the genome.
ChIP-seq datasets of nucleosome positioning tracks  for Gm12878 and K562 provided by Stanford University were obtained. Short reads were mapped to the human reference genome (GRCh37/hg19) in color-space with the probabilistic mapper, DNAnexus. Nucleosome density signals were generated by first shifting reads by 74 bp in the 5’ to 3’ direction and then counting the total number of reads starting at each genomic coordinate on both strands. These counts were then smoothed using a 60 bp window.
Pair-end RNA-seq tracks provided by California Institute of Technology (Caltech) were used for our analysis . Raw reads (2 × 75 bp) were mapped to the reference human genome (version hg19) using TopHat  with default settings, with the exception of specifying an empirically determined mean inner-mate distance. The datasets used in our analysis of the three cell lines are available in the Additional files 2, 3 and 4, respectively.
CE annotations and inclusion analysis
All CE events were obtained from the annotation files provided by Katz et al. . The annotation files (release 2) included all alternative splicing events that were derived from the inventory of human genes and mRNA isoforms published by Nilsen et al. . We made use of this catalog to get the average length of the human cassette exons. There are in total 42485 human cassette exons in release 2 and the average length is 146.17 bp.
RNA-seq data were analyzed by MISO , a statistical model that estimates the expression of alternatively spliced exons and isoforms, and assesses confidence in these estimates, to determine the inclusion level of each CE. In particular, MISO outputs the “percentage spliced in” (PSI or ψ) that denotes the fraction of mRNA isoforms including CEs of interest. On the basis of ψ, we partitioned the CE samples into different bins. For example, we define the 0.75–0.25 partitioning rule as that the high portion consists of CEs with inclusion level greater than 0.75, the low portion consists of those with inclusion level less than 0.25 and the middle portion consists of the rest ones. After removal of those skipping exons whose hosting genes are not expressed or counts of mapped short reads are too rare to compute the inclusion ratio with statistical significance (MISO parameter min_event_reads is set to 20), there are 14762, 17142 and 16052 skipping exons for Gm12878, K562 and H1-hESC, respectively.
Differentially expressed CE detection
We used MISO  to detect differentially expressed CEs between samples. After the calculation of CE inclusion levels, MISO computes the Bayes factor, which represents the weight of the evidence in the data in favor of differential inclusion, for each exon between two samples. In particular, the Gm12878 and K562 include 2 RNA-seq replicates, while H1-hESC includes 4 RNA-seq replicates. We carried out differentially expressed CE detection for each pair of replicates, each of which came from two different cell lines. Therefore, we have 4 pairs of replicate-replicate comparisons for Gm12878 vs. K562, and eight pairs of replicate-replicate comparisons for both Gm12878 vs. H1-hESC and K562 vs. H1-hESC. After differentially inclusion detection, differentially expressed CEs were filtered out based on their coverage or magnitude of change. We filtered out CE samples by such parameter setting: Bayes_factor = 2 and delta - psi = 0.15. Finally, we performed intersect operator on all replicate-replicate comparisons of each pair of cell lines, and got the differentially expressed CEs. The numbers of differentially expressed CE samples are shown in Table 1. The differentially expressed CE samples are listed in Additional file 5.
HM signal calculation
HM signals on CE and flanking regions were calculated in two steps. First, the smoothed read counts falling into the regions of interest were added up. If only part of the smoothing windows overlaps the regions of interest, the read counts were allocated in proportion to the fraction of overlapping part of the smoothing window. Second, the total read count was divided by the number of base pairs included in the computed region to obtain the mean signal strengths of HMs over the target region.
in which x i represents a certain type of histone modification, y represents the CE inclusion level and β i is the regression coefficient.
We employ the standard software SAS 9.2 to conduct logistic regression analysis. For SAS’s logistic procedure, we adopt the likelihood score criterion for variable selection, i.e., χ2 values were calculated to evaluate the model to choose a new variable that achieves the highest χ2 value compared to other remaining variables not included the current model (i.e., best parameter is set to 1). The likelihood score curves for the three cell lines were shown in Additional file 1: Figure S8, from which we found that the chi-square score increases rapidly at the beginning and then tends to steady after about 10 variables were included in the model.
To get statistical robust model, we randomly extracted 1500 samples from each classes of samples after CE inclusion level discretization by adopting the SAS’s surveyselect process, so as to eliminate the effect of imbalanced number of training samples that are generated by different discretization thresholds δ. We thus obtained a dataset including 3000 CE samples, and next randomly extract two thirds of these 3000 samples as training set to build the logistic regression model and the remaining as an independent test set. Furthermore, the random sampling, training and test procedures were repeated 50 times to get the mean performance measures of the logistic regression model. For the experiments of predicting differentially expressed CEs across cell types, we used all samples and run the same process to build statistically significant models, as the numbers of differentially expressed CE samples are no more than 1500. The SAS code used in our analysis is available in Additional file 6.
This work was supported by the National Natural Science Foundation of China under grants No. 31100954, No. 61300100, No. 61272380 and No. 61173118.
- Luger K, Mader AW, Richmond RK, Sargent DF, Richmond TJ: Crystal structure of the nucleosome core particle at 2.8a resolution. Nature. 1997, 389: 251-260. 10.1038/38444.PubMedView ArticleGoogle Scholar
- Richmond TJ, Davey CA: The structure of dna in the nucleosome core. Nature. 2003, 423: 145-150. 10.1038/nature01595.PubMedView ArticleGoogle Scholar
- Kouzarides T: Chromatin modifications and their function. Cell. 2007, 128: 693-705. 10.1016/j.cell.2007.02.005.PubMedView ArticleGoogle Scholar
- Bernstein BE, Meissner A, Lander ES: The mammalian epigenome. Cell. 2007, 128: 669-681. 10.1016/j.cell.2007.01.033.PubMedView ArticleGoogle Scholar
- Saint-Andre V, Batsche E, Rachez C, Muchardt C: Histone h3 lysine 9 trimethylation and hp1γ favor inclusion of alternative exons. Nat Struct Mol Biol. 2011, 18 (3): 337-344. 10.1038/nsmb.1995.PubMedView ArticleGoogle Scholar
- Dion MF, Altschuler SJ, Wu LF, Rando OJ: Genomic characterization reveals a simple histone h4 acetylation code. Proc Natl Acad Sci U S A. 2005, 102: 5501-5506. 10.1073/pnas.0500136102.PubMed CentralPubMedView ArticleGoogle Scholar
- Unnikrishnan A, Gafken PR, Tsukiyama T: Dynamic changes in histone acetylation regulate origins of dna replication. Nat Struct Mol Biol. 2010, 17 (4): 430-437. 10.1038/nsmb.1780.PubMed CentralPubMedView ArticleGoogle Scholar
- Segal E, Fondufe-Mittendorf Y, Chen L, Thastrom A, Field Y, Moore IK, Wang JP, Widom J: A genomic code for nucleosome positioning. Nature. 2006, 442 (7104): 772-778. 10.1038/nature04979.PubMed CentralPubMedView ArticleGoogle Scholar
- Li Z, Schug J, Tuteja G, White P, Kaestner KH: The nucleosome map of the mammalian liver. Nature. 2011, 18 (6): 742-746.Google Scholar
- Wang Z, Zang C, Rosenfeld JA, Schones DE, Barski A, Cuddapah S, Cui K, Roh TY, Peng W, Zhang MQ, Zhao K: Combinatorial patterns of histone acetylations and methylations in the human genome. Nat Genet. 2008, 40: 897-903. 10.1038/ng.154.PubMed CentralPubMedView ArticleGoogle Scholar
- Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K: High-resolution profiling of histone methylations in the human genome. Cell. 2007, 129: 823-837. 10.1016/j.cell.2007.05.009.PubMedView ArticleGoogle Scholar
- Schwartz S, Meshorer E, Ast G: Chromatin organization marks exon-intron structure. Nat Struct Mol Biol. 2009, 16 (9): 990-995. 10.1038/nsmb.1659.PubMedView ArticleGoogle Scholar
- Tilgner H, Nikolaou C, Althammer S, Sammeth M, Beato M, Valcarcel J, Guigo R: Nucleosome positioning as a determinant of exon recognition. Nat Struct Mol Biol. 2009, 16 (9): 996-1001. 10.1038/nsmb.1658.PubMedView ArticleGoogle Scholar
- Keren H, Lev-Maor G, Ast G: Alternative splicing and evolution: diversification, exon definition and function. Nat Rev Genet. 2010, 11 (5): 345-355. 10.1038/nrg2776.PubMedView ArticleGoogle Scholar
- Gelfman S, Burstein D, Penn O, Savchenko A, Amit M, Schwartz S, Pupko T, Ast G: Changes in exon-intron structure during vertebrate evolution affect the splicing pattern of exons. Genome Res. 2012, 22 (1): 35-50. 10.1101/gr.119834.110.PubMed CentralPubMedView ArticleGoogle Scholar
- Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB: Alternative isoform regulation in human tissue transcriptomes. Nature. 2008, 456 (7221): 470-476. 10.1038/nature07509.PubMed CentralPubMedView ArticleGoogle Scholar
- Hnilicova J, Hozeifi S, Duskova E, Icha J, Tomankova T, Stanek D: Histone deacetylase activity modulates alternative splicing. PLoS One. 2011, 6 (2): 16727-10.1371/journal.pone.0016727.View ArticleGoogle Scholar
- Gan Q, Chepelev I, Wei G, Tarayrah L, Cui K, Zhao K, Chen X: Dynamic regulation of alternative splicing and chromatin structure in Drosophila gonads revealed by RNA-seq. Cell Res. 2010, 20 (7): 763-783. 10.1038/cr.2010.64.PubMed CentralPubMedView ArticleGoogle Scholar
- Andersson R, Enroth S, Rada-iglesias A, Wadelius C, Komorowski J: Nucleosomes are well positioned in exons and carry characteristic histone modifications. Genome Res. 2009, 19: 1732-1741. 10.1101/gr.092353.109.PubMed CentralPubMedView ArticleGoogle Scholar
- Kim S, Kim H, Fong N, Erickson B, Bentley DL: Pre-mrna splicing is a determinant of histone h3k36 methylation. Proc Natl Acad Sci. 2011, 108 (33): 13564-13569. 10.1073/pnas.1109475108.PubMed CentralPubMedView ArticleGoogle Scholar
- Kolasinska-Zwierz P, Down T, Latorre I, Liu T, Liu XS, Ahringer J: Differential chromatin marking of introns and expressed exons by H3K36me3. Nat Genet. 2009, 41 (3): 376-381. 10.1038/ng.322.PubMed CentralPubMedView ArticleGoogle Scholar
- Huff JT, Plocik AM, Guthrie C, Yamamoto KR: Reciprocal intronic and exonic histone modification regions in humans. Nat Struct Mol Biol. 2010, 17 (12): 1495-1499. 10.1038/nsmb.1924.PubMed CentralPubMedView ArticleGoogle Scholar
- Luco RF, Pan Q, Tominaga K, Blencowe BJ, Pereira-Smith OM, Misteli T: Regulation of alternative splicing by Histone modifications. Science. 2010, 327 (5968): 996-1000. 10.1126/science.1184208.PubMed CentralPubMedView ArticleGoogle Scholar
- Zhou YP, Lu YL, Tian WD: Epigenetic features are significantly associated with alternative splicing. BMC Genomics. 2012, 13: 123-10.1186/1471-2164-13-123.PubMed CentralPubMedView ArticleGoogle Scholar
- Enroth S, Bornelo S: Combinations of Histone modifications mark exon inclusion levels. PLoS ONE. 2012, 7 (1): 29911-10.1371/journal.pone.0029911.View ArticleGoogle Scholar
- Zhu S, Wang G, Liu B, Wang Y: Modeling exon expression using histone modifications. PLoS ONE. 2013, 8 (6): 67448-10.1371/journal.pone.0067448.View ArticleGoogle Scholar
- Katz Y, Wang ET, Airoldi EM, Burge CB: Analysis and design of rna sequencing experiments for identifying isoform regulation. Nat Methods. 2010, 7 (12): 1009-1015. 10.1038/nmeth.1528.PubMed CentralPubMedView ArticleGoogle Scholar
- Thomas DJ, Rosenbloom KR, Clawson H, Hinrichs AS, Trumbower H, Raney BJ, Karolchik D, Barber GP, Harte RA, Hillman-Jackson J, Kuhn RM, Rhead BL, Smith KE, Thakkapallayil A, Zweig AS, Haussler D, Kent WJ, Encode Project Consortium: The encode project at UC Santa Cruz. Nucleic Acids Res. 2007, 35 (Database issue): 663-667.View ArticleGoogle Scholar
- Podlaha O, De S, Gonen M, Michor F: Histone modifications are associated with transcript isoform diversity in normal and cancer cells. PLoS Comput Biol. 2014, 10 (6): 1003611-10.1371/journal.pcbi.1003611.View ArticleGoogle Scholar
- Karlic R, Chung HR, Lasserre J, Vlahovicek K, Vingron M: Histone modification levels are predictive for gene expression. PNAS. 2010, 107 (7): 2926-29231. 10.1073/pnas.0909344107.PubMed CentralPubMedView ArticleGoogle Scholar
- Zhou HL, Hinman MN, Barron VA, Geng C, Zhou G, Luo G, Siegel RE, Lou H: Hu proteins regulate alternative splicing by inducing localized histone hyperacetylation in an rna-dependent manner. Proc Natl Acad Sci U S A. 2011, 108 (36): 627-635. 10.1073/pnas.1103344108.View ArticleGoogle Scholar
- Gunderson FQ, Merkhofer EC, Johnson TL: Dynamic histone acetylation is critical for cotranscriptional spliceosome assembly and spliceosomal rearrangements. Proc Natl Acad Sci U S A. 2011, 108 (5): 2004-2009. 10.1073/pnas.1011982108.PubMed CentralPubMedView ArticleGoogle Scholar
- Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by rna-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010, 28 (5): 511-515. 10.1038/nbt.1621.PubMed CentralPubMedView ArticleGoogle Scholar
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by rna-seq. Nat Methods. 2008, 5 (7): 621-628. 10.1038/nmeth.1226.PubMedView ArticleGoogle Scholar
- Roy M, Kim N, Xing Y, Lee C: The effect of intron length on exon creation ratios during the evolution of mammalian genomes. RNA. 2008, 14 (11): 2261-2273. 10.1261/rna.1024908.PubMed CentralPubMedView ArticleGoogle Scholar
- Moore MJ, Proudfoot NJ: Pre-mrna processing reaches back to transcription and ahead to translation. Cells. 2009, 136: 688-700. 10.1016/j.cell.2009.02.001.View ArticleGoogle Scholar
- Yu H, Zhu S, Zhou B, Xue H, Han JD: Inferring causal relationships among different histone modifications and gene expression. Res. 2008, 18 (8): 1314-1324.Google Scholar
- Dhami P, Saffrey P, Bruce AW, Dillon SC, Chiang K, Bonhoure N, Koch CM, Bye J, James K, Foad NS, Ellis P, Watkins NA, Ouwehand WH, Langford C, Andrews RM, Dunham I, Vetrie D: Complex exon-intron marking by histone modifications is not determined solely by nucleosome distribution. PLoS ONE. 2010, 5 (8): 12339-10.1371/journal.pone.0012339.View ArticleGoogle Scholar
- Ernst J, Kheradpour P, Mikkelsen TS: Mapping and analysis of chromatin state dynamics in nine human cell types. Nature. 2011, 473 (7345): 43-9. 10.1038/nature09906.PubMed CentralPubMedView ArticleGoogle Scholar
- Li H, Ruan J, Durbin R: Mapping short dna sequencing reads and calling variants using mapping quality scores. Nature. 2008, 18 (11): 1851-1858.Google Scholar
- Valouev A, Johnson SM, Boyd SD, Smith CL, Fire AZ, Sidow A: Determinants of nucleosome organization in primary human cells. Nature. 2011, 474 (7352): 516-520. 10.1038/nature10002.PubMed CentralPubMedView ArticleGoogle Scholar
- Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by rna-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010, 28 (5): 511-515. 10.1038/nbt.1621.PubMed CentralPubMedView ArticleGoogle Scholar
- Trapnell C, Pachter L, Salzberg SL: Tophat: discovering splice junctions with rna-seq. Bioinformatics. 2009, 25 (9): 1105-1111. 10.1093/bioinformatics/btp120.PubMed CentralPubMedView ArticleGoogle Scholar
- Nilsen TW, Graveley BR: Expansion of the eukaryotic proteome by alternative splicing. Nature. 2010, 463 (7280): 457-463. 10.1038/nature08909.PubMed CentralPubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.