Genome-wide prediction of cis-acting RNA elements regulating tissue-specific pre-mRNA alternative splicing
© Wang et al. 2009
Published: 7 July 2009
Skip to main content
© Wang et al. 2009
Published: 7 July 2009
Human genes undergo various patterns of pre-mRNA splicing across different tissues. Such variation is primarily regulated by trans-acting factors that bind on exonic and intronic cis-acting RNA elements (CAEs). Here we report a computational method to mechanistically identify cis-acting RNA elements that contribute to the tissue-specific alternative splicing pattern. This method is an extension of our previous model, SplicingModeler, which predicts the significant CAEs that contribute to the splicing differences between two tissues. In this study, we introduce tissue-specific functional levels estimation step, which allows evaluating regulatory functions of predicted CAEs that are involved in more than two tissues.
Using a publicly available Affymetrix Genechip® Human Exon Array dataset, our method identifies 652 cis-acting RNA elements (CAEs) across 11 human tissues. About one third of predicted CAEs can be mapped to the known RBP (RNA binding protein) binding sites or match with other predicted exonic splicing regulator databases. Interestingly, the vast majority of predicted CAEs are in intronic regulatory regions. A noticeable exception is that many exonic elements are found to regulate the alternative splicing between cerebellum and testes. Most identified elements are found to contribute to the alternative splicing between two tissues, while some are important in multiple tissues. This suggests that genome-wide alternative splicing patterns are regulated by a combination of tissue-specific cis-acting elements and "general elements" whose functional activities are important but differ across multiple tissues.
In this study, we present a model-based computational approach to identify potential cis-acting RNA elements by considering the exon splicing variation as the combinatorial effects of multiple cis-acting regulators. This methodology provides a novel evaluation on the functional levels of cis-acting RNA elements by estimating their tissue-specific functions on various tissues.
Alternative splicing of pre-mRNA is a major mechanism of diversifying the protein coding potential of eukaryotic genomes. According to studies using large-scale expressed sequence tags (EST), as high as 60% of human genes are estimated to undergo alternative splicing . Moreover, recent studies demonstrate that alternative splicing plays a pivotal role in regulating tissue-specific patterns of gene expression [2–7]. One mechanism for establishing tissue-specific alternative splicing is by modulation of the expression levels and/or intrinsic functions of "general" RNA binding proteins (RBPs) [8–10]. For instance, previous studies have reported that modulation in the relative concentrations of hnRNP A/B proteins and SR proteins can control both the alternative splice site choice and the inclusion/exclusion ratio of selected alternative exons [11–15]. In addition, another mechanism of tissue-specific alternative splicing is mediated by "tissue-specific RBPs", which accounts for the restricted expression of many RNA binding proteins to distinct tissues or developmental stages. A prototypic example is the role of Nova proteins which are crucial factors regulating brain-specific alternative splicing in the formation of synapses [16, 17]. More complicated regulatory mechanisms involving the combination of large numbers of trans-acting RBPs (RNA binding proteins that bind to the cis-acting RNA elements to control pre-mRNA splicing) and cis-acting elements (binding sites of trans-acting RNA binding proteins) remain unclear and further studies are warranted.
We applied this method using Affymetrix GeneChip® Exon Array data from 11 human tissues . Without loss of generality, only cassette exon alternative splicing is considered in this study. Data from the eleven tissues resulted in 55 paired experiments which predicted 652 statistically significant cis-acting RNA elements (CAEs) to be involved in tissue-specific alternative splicing. Of all the predicted CAEs, the majority are located within intronic regions close to 5'splice site (ss) or 3'ss. Nearly one-third of total predicted CAEs can be mapped to known RBPs' binding sites or exonic splicing prediction databases, and another two-thirds of CAEs may be potential splicing factors. Bipartite network demonstrates that the predicted elements can be classified into two different categories; one group of CAEs functions with RBPs only in a small set of tissues, while another group plays general but differential activities across a large number of tissues.
Cerebellum, spleen and testes have the largest diversities in exon expression when compared to other tissues (darker color in the lower diagonal of Figure 2). Interestingly, cerebellum and testes don't appear to have large average differential PSR ratios when compared to other tissues. Since average differential PSR ratios evaluate the average percentage of differentially expressed exons per transcript, this suggests that the alternative splicing events of these two tissues compared to other tissues is distributed broadly among transcripts. In contrast, thyroid and liver are found to have the highest differential PSR ratios but very low number of differential PSRs, which indicates that the alternative splicing events on thyroid and liver are taking place in a more specific set of genes.
In this study, we focus our analysis on one of the most important splicing variants, cassette exons , where exon inclusion and skipping leads to different types of protein isoforms. First, for each pair of tissue comparison, we use the estimation of overall gene expression to normalize the expression signals of each exon. The splicing index (SI), defined by Srinivasan K. et al , was then utilized to evaluate the relative quantity of splicing difference between two tissues. Second, we applied SplicingModeler, a computational tool we previously developed  to identify putative hexamers whose functional differences between two tissues potentially contribute to the differences in splicing patterns. SplicingModeler results in two scores for each candidate hexamer; Exon Inclusion Scores (EIC), indicating the importance of the specific hexamer; and Relative Functional Levels (RFL), where a positive or negative value suggests its role in exon inclusion in one of two comparing tissues. The most significant hexamers, which receive EIC scores that are more than 5 × IQR (interquantile range) away from median EIC score, are selected as the potential cis-acting elements that contribute to the splicing variations between two tissues.
Permutation tests are applied to validate the statistical significance of predicted cis-acting elements. For each pair of tissues, we randomized the order of observed SI values, and conducted prediction using SplicingModeler. This approach not only effectively disconnects the functional relationships between exons and their regulatory sequences, but also preserves the regulatory element contents and the distribution of the levels of splicing variation. Wilcoxon test was then conducted on the predicted EIC scores (Exon Inclusion Score) with the alternative hypothesis that the predicted cis-acting elements from the original data have greater EIC scores than those from the randomized data. Permutation tests between all the tissue comparisons resulted in significant p-values lower than 2.2e-16, which suggested that SplicingModeler predictions are biologically meaningful.
In order to estimate the functions of predicted elements in each tissue (not the relative functions in each tissue pair), we define Tissue-Specific Functional Level (TSFL), which is derived from the Relative Functional Levels (RFL) that are calculated for each hexamer (Equation 4) in each tissue pair. For each hexamer, the TSFL for each tissue is achieved by solving a reduced linear equation with 55 equations (total pairs of comparison) restricted by the relative relationships of 11 parameters, where each parameter models its function in the respective tissues (See Method section). The TSFL represents each hexamer's relative contribution to exon inclusion or exclusion in a specific mRNA isoform. For each element, the median value of their functions (TSFL) on exon inclusion among 11 tissues is set as baseline (0).
In this study, using an extended version of SplicingModeler, we successfully identified 652 cis-acting RNA elements in 11 human tissues that are important in regulating alternative splicing. Permutation tests for all experiments suggest that the predicted cis-acting elements are statistically significant and biologically relevant. Thirteen percent of predicted elements can be mapped to known binding sites of RNA binding proteins, and 31.4% can be matched in any one of the three databases that document biologically-determined known sites and predicted sites using other bioinformatics approaches.
Previous studies on splicing regulation tends to focus on exonic regions, our results suggest that intronic regions may also be equally, if not more, important for the regulation of tissue-specific alternative splicing. Consistent with previous observations [27–29], the majority of cis-acting elements predicted are in intronic regions, and only 23% are in exonic regions. An important exception, of all 38 elements regulating alternative splicing between cerebellum and testes, 32 are exonic regulators. Interestingly, of all the exonic regulators in cerebellum and testes, only one contributes more exon inclusion in testes than cerebellum, indicating that a large set of exonic cis-acting elements are regulating exon skipping in testes. Although alternative splicing is frequently observed in human brain and testes [3, 6, 30], the differentially expressed exons (Figure 2) show that these two tissues undergo different splicing patterns. Compared to other tissues, more exonic elements are found to regulate alternative cassette exons in testes.
Different from relative functional levels, derived from SplicingModeler to estimate the relative activities of cis-acting RNA elements for each pair of tissues, tissue-specific functional levels are developed in this study to represent the activities of predicted cis-acting RNA elements on exon inclusion in one tissue. From the hierarchical clustering on the tissue-specific functional levels of predicted cis-acting elements across all tissues, we observe positive correlations within each of the two major clusters. The correlation coefficients between cerebellum and spleen, and between liver and proState, even reach high values of 0.553 and 0.415 respectively, indicating that a large number of cis-acting RNA elements are playing similar functions across these tissues.
From bipartite network showing the regulatory relationship of predicted cis-acting elements (CAE) in different tissues (Figure 6), we clearly see that most of the predicted CAEs (small circles surrounding tissues) are only connected to two tissues; meanwhile, a small number of CAEs (big circles centered and surrounded by tissues) are regulating multiple tissues. We also demonstrate that the vast majority of predicted CAEs are only related to two tissues (Figure 7). Interestingly, 14 CAEs are predicted to be Nova binding sites, and all of them are only significantly regulating the splicing differences between two tissues. Similar situations also happen on SRp40, hnRNP-F and hnRNP-H related CAEs, in which 88.9%, 75% and 100% are found to regulate only two tissues respectively. This is either caused by the degenerative features of these binding sites, which is not considered in the current model, or implies that the binding sites of these general splicing factors slightly vary in different tissues. Interestingly, 44.6% of all hnRNP-B, hnRNP-C and hnRNP-I (PTB) related CAEs are significantly presented in multiple (>= 3) tissues. These data indicate that the majority of CAEs may be only recognized by tissue-specific RBPs and only function in a small set of tissues. By contrast, a small number of CAEs bound by highly abundant RBPs like hnRNP-B/C/I, could also contribute to the differential regulation of alternative splicing across multiple tissues.
Systematic investigation of RNA binding proteins and their complicated interactions and regulatory roles on alternative splicing remains a challenging problem. Computational models integrating genome-wide expression data can assist in the identification of mechanisms influencing tissue-specific alternative splicing. Popular computational methods of cis-acting RNA elements prediction are typically based on low- or high-throughput microarray or sequencing data, and then analyzed to predict binding sites using the consensus sequences of the studied RBPs [24, 31–33]. Such methods are useful for predicting binding specifiCity of specific RBP's, but lack integration of tissue-specific splicing and combinatorial regulatory mechanisms. Moreover, these methods are mainly restricted to exonic splicing regulator prediction. Recently, Das et al developed another computational model to identify CAEs for tissue-specific alternative splicing, which uses the correlation of motif parameters together with gene-level normalized exon expression signals to identify splicing regulatory motifs . However, combinatorial effects of different splicing factors were not considered. Besides, only upstream and downstream intronic regions are analyzed in that study, however, exonic splicing enhancers (ESE) and silencers (ESS) are also important parts of alternative splicing regulators as well. Our method is based on the assumption that alternative splicing is regulated by RBPs in a combinatorial fashion, and therefore potentiates an unbiased investigation on the functions of cis-acting elements in both intronic and exonic regions.
Compared with previous computational methods for the prediction of CAEs regulating tissue-specific alternative splicing [24, 31–34], our method attempts to predict cis-regulatory elements in combination with their functions and contributions to exon inclusions. Different from SplicingModeler, this study emphasizes the tissue-specific rather than relative functional levels between tissues, so that we are able to compare CAEs' activities in a broader scale. Further improvements of our model in future include sequence degeneracy of motif candidates and pre-mRNA secondary structure.
Human Exon Array dataset, containing exon array data for 11 tissues (including breast, cerebellum, heart, kidney, liver, muscle, pancreas, proState, spleen, testes, and thyroid) with 3 replicates for each, were obtained from Affymetrix Exon 1.0ST Array Sample Dataset . Affymetrix Power Tool  (APT Release 1.8.5) were used to pre-process exon array data such as probe intensity normalization, probeset and transcript expression summarization, implementation of MiDAS algorithm  etc. Since our analysis focuses primarily on cassette exons, Affymetrix core PSR dataset, which contains annotated RefSeq transcripts and full-length mRNAs, were used to filter out exon clusters with multiple PSRs. UCSC known gene database and annotation files  were retrieved from UCSC Genome Browser.
Exons with |SI|>= 2.0 are selected as splice variants between these two tissues; this can be translate into four times splicing difference.
It has been reported that splicing factors works primarily in exonic and intronic regions that are adjacent to splice sites [41–43]. As such, SplicingModeler only considers intronic regions within 300-bp and exonic regions within 150-bp to 5' and 3' splice sites as four potential cis-regulatory regions. SplicingModeler requires the SI values as well as regulatory sequences to perform the prediction of regulatory cis-acting elements. The differentially expressed exons are mapped to UCSC known gene annotation  to obtain their genomic loci. Sequences in regulatory regions, including intronic/exonic regions upstream and downstream of 5' and 3' splice sites, are then retrieved from Human Genome Database (hg18) [44, 45].
where, n r indicates the number of regulatory regions (n r = 1 to 4 corresponding to 3'ss-intronic region, 3'ss-exonic region, 5'ss-exonic region, and 3'ss-intronic regions); S k, r is the total set of functional candidate elements in regulatory region r; S k, i, r is the number of binding sites of candidate element i in regulatory r-th region on the k-th exon; RFL i, r stands for the relative functional level of motif i in regulatory region r; a positive and negative value implies its function on exon inclusion in one of the two comparing tissues.
S k, i, r can be obtained from the regulatory sequences, SI k stands for the splicing index of the k-th exon, which will be measured by the exon array experiment, and calculated based upon Equation 1. The Relative Functional Level (RFL) can then be estimated by fitting k equations with n r parameters using least-squares procedure. The significant cis-acting elements are selected in an iterative fashion. In brief, n = 20 candidate elements were randomly selected from all the candidates. The adjusted reciprocal of sum square error, calculated following least-squares procedure, is equally deposited to exon inclusion contribution (EIC) score of each motif candidate . The estimated RFL values are also accumulated into the final RFL scores. A detailed description on the SplicingModeler procedure can be found in .
In this study, to speed up the modeling, motif candidates whose binding sites occurrences are less than 5% of all alternatively spliced exons are filtered out. For each tissue pair comparison, we repeat the procedure for 10 million times. Motif candidates whose EIC scores are higher than median + 5 × IQR (interquantile range) are selected as significant cis-acting elements.
Functional levels of predicted elements in each tissue, or Tissue-Specific Functional Level (TSFL), is derived from the Relative Functional Levels (RFL) that are calculated for each hexamer (Equation 4) in tissue pairs. The TSFL is calculated through the following two steps:
For any motif candidate , such equation exists for every tissue comparison. Based on nRFL m, i, j score estimated from SplicingModeler, TSFL value was solved using least-squares procedure by randomly selecting one tissue as the standard and assigning 0 to its corresponding TSFL. For every predicted motif, the derived TSFL score is further normalized assuming standard normal distribution.
To illustrate regulatory relationships of predicted cis-acting RNA elements (CAEs) on regulated tissues, a bipartite network is constructed using Cytoscape . Predicted CAEs and tissues are represented by circular and diamond-shaped nodes respectively. Different colors of circles stand for different types of regulatory regions (green: intronic region adjacent to 3'ss; blue: intronic region adjacent to 5'ss; aubergine: exonic region adjacent to 3'ss; red: exonic region adjacent to 5'ss). The size of CAE is proportional to the number of connected tissues. CAEs connected to multiple tissues are clustered at the center of graph, while CAEs only associated with two tissues are located at surrounding regions. The connection between each CAE and tissue indicates that there is a regulatory relationship between them. Colors of connections for each CAE describe the tissue-specific functional levels (TSFL) across tissues (Red: TSFL > 0; Blue: TSFL < 0).
This work is supported by the grant from National Institutes of Health (1R01GM085121, to JRS and YL), China National 863 High-Tech Program 2007AA02Z302 (YL), CSC (China Scholarship Council) State Scholarship Programs (WF), and the Indiana Genomics Initiative of Indiana University (supported in part by the Lilly Endowment, Inc., YL).
This article has been published as part of BMC Genomics Volume 10 Supplement 1, 2009: The 2008 International Conference on Bioinformatics & Computational Biology (BIOCOMP'08). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/10?issue=S1.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.