Functional importance of different patterns of correlation between adjacent cassette exons in human and mouse
© Peng et al; licensee BioMed Central Ltd. 2008
Received: 01 October 2007
Accepted: 26 April 2008
Published: 26 April 2008
Alternative splicing expands transcriptome diversity and plays an important role in regulation of gene expression. Previous studies focus on the regulation of a single cassette exon, but recent experiments indicate that multiple cassette exons within a gene may interact with each other. This interaction can increase the potential to generate various transcripts and adds an extra layer of complexity to gene regulation. Several cases of exon interaction have been discovered. However, the extent to which the cassette exons coordinate with each other remains unknown.
Based on EST data, we employed a metric of correlation coefficients to describe the interaction between two adjacent cassette exons and then categorized these exon pairs into three different groups by their interaction (correlation) patterns. Sequence analysis demonstrates that strongly-correlated groups are more conserved and contain a higher proportion of pairs with reading frame preservation in a combinatorial manner. Multiple genome comparison further indicates that different groups of correlated pairs have different evolutionary courses: (1) The vast majority of positively-correlated pairs are old, (2) most of the weakly-correlated pairs are relatively young, and (3) negatively-correlated pairs are a mixture of old and young events.
We performed a large-scale analysis of interactions between adjacent cassette exons. Compared with weakly-correlated pairs, the strongly-correlated pairs, including both the positively and negatively correlated ones, show more evidence that they are under delicate splicing control and tend to be functionally important. Additionally, the positively-correlated pairs bear strong resemblance to constitutive exons, which suggests that they may evolve from ancient constitutive exons, while negatively and weakly correlated pairs are more likely to contain newly emerging exons.
Alternative splicing is a post-transcriptional mechanism in which the exon sequences of primary transcripts are differently included in mature RNAs. It expands proteomic diversity and regulates developmental or tissue-specific processes by generating multiple different transcripts from a single gene [1–3]. Recent comparison of data from transcript sequencing and microarray profiling indicates that alternative splicing is more frequent in higher eukaryotes. Independent studies estimate that 40–60% of human genes undergo alternative splicing [4–7]. Mutations that disrupt splicing and/or alternative splicing are reported to be an important source of many human diseases [8–11]. Thus, many efforts have been invested in understanding the regulation of alternative splicing [12–17].
Alternative splicing is a complex phenomenon. There are four basic forms of alternative splicing. (1) Exon skipping (also known as cassette exon), in which the alternative exon is either included or skipped in the mature mRNA transcript as a whole, is the most common alternative splicing event in mammal. The next two types are (2) alternative donor splice site and (3) alternative acceptor splice site. These two types of alternative exons are flanked on one side by a constitutive splice site and on the other side by several competing alternative sites. Finally, (4) intron retention is an abundant form of alternative splicing in plants but relatively rare in mammals . Combinations of the simple forms come together to make up complex types of alternative splicing, which account for 33% of all the conserved events between human and mouse [5, 19, 20]. Furthermore, there are many genes containing more than one region that is alternatively spliced. The possible coordination between different alternative regions composes an extra layer of complexity in alternative splicing regulation. Some recent studies have identified cases in which different alternative exons belonging to the same gene appear to be coordinated [21–23]. Fededa et al. constructed a minigene carrying two alternative EDI regions in tandem. Mutations that change the alternative splicing of the upstream EDI deeply affect the splicing pattern of the downstream one . The CACNAIG gene is another example. Emerick et al. analyzed hundreds of full-length cDNA sequences of this gene and found evidence for both pair-wise and high-order correlations between different alternative exons .
These studies highlight the importance of investigating to what extent the coordination between alternative exons occurs in a large number of genes. Studying the coordination patterns between alternative exons and their functional importance on a genome-wide scale can provide valuable information for our understanding of alternative splicing. This information will allow researchers to further investigate individual, or 'modular', alternative splicing events in the light of the cooperative effects between alternative splicing events and gain a deeper insight into the molecular functions and regulatory mechanisms of gene control [23, 24]. The alternative exon coordination information will also help researchers to eliminate those transcript forms which are unlikely to appear when considering all the possible combinations of multiple alternative regions. Narrowing down the whole transcript-form space can greatly enhance efforts to catalogue and construct a full length transcriptome .
Adjacent cassette exons are a pair of cassette exons which are adjacent in the final transcript (Figure 1). Our study focused on adjacent cassette exons because they are the simplest form of possible exon interaction and these events could be covered by sufficient ESTs. Of the two exons in an adjacent cassette exon pair, the one in the 5' upstream region is called the upstream exon and the other the downstream exon. Consider a case of an adjacent cassette exon pair, let random variable u represent the inclusion/exclusion status of the upstream exon in a mature transcript and let d represent the status of the downstream exon (Figure 1). An EST/cDNA sequence i spanning over these two exons will correspond to an observation of random variables (u i ,d i ). There are 4 possible value combinations of the pair (u, d):
1) (1, 0), the upstream exon is included while the downstream exon is excluded;
2) (0, 1), the upstream exon is excluded but the downstream exon is included;
3) (1, 1), both are included;
4) (0, 0), both are excluded.
The interaction between random variable pair (u, d) can represent the interaction between the adjacent cassette exons. Correlation coefficient was taken as the metric to estimate this interaction (see Methods).
To obtain our dataset we extracted all of the adjacent cassette exons deposited in the Altsplice database of the ASD project (, see Methods), resulting in 4326 adjacent cassette exon pairs from 2451 genes in human and 1905 adjacent cassette exon pairs from 923 genes in mouse (Information for all pairs are listed in additional file 2). To obtain a relatively reliable value of correlation coefficient, we took into account only those pairs with more than 10 EST sequences covering the alternatively spliced region Of all the pairs, most (3003 in human and 1314 in mouse) passed this filter. A large portion of them had substantially higher EST coverage: the median number of ESTs excluding the filtered pairs is 37 and 31 for human and mouse, respectively. (The distribution of EST evidence number per exon pair is shown in supplementary Figure S1.)
Different patterns of correlation between adjacent cassette exons
The distribution in Figure 2A shows strong multimodality. Two major and one minor peaks can be identified: 1) many of the coefficients center at zero; 2) a portion of comparable amount fall into a narrow bin near 1, forming a very sharp peak; 3) the minor peak is at -1, with less data relative to the other two peaks. Though the total amount of data is less than in human, the correlation pattern in mouse is very similar (Figure 2B). There are also more positively correlated pairs than negatively correlated ones. Three peaks can also be identified at the same places in the mouse distribution (Figure 2B). The small peak at -1 is clearer, compared with its counterpart in human. The multimodality reflects that the two exons in pairs interact with each other in different ways. We may expect different regulatory properties and functional importance among the different groups formed by these peaks.
Sequence lengths among different groups
Frame preservation among different groups
# of pairs
# in CDS
# of Frame preservation*
Mutually exclusive alternative splicing events have long been noticed. Many researchers have labeled it as one of the five basic alternative splicing types (together with exon skipping, alternative acceptor, alternative donor and intron retention) . The definition of mutual exclusion reflects people's concern that alternative splicing may be regulated by different cellular conditions. For example, an alternative splicing event with transcript form 2 and transcript form 3 from Figure 1 will be reported as a mutually exclusive event in the ASAP  and ASD  annotations, because these two transcripts may be the splicing product of two different cellular conditions (for instance, two tissue types). However, from a mathematical point of view, mutual exclusivity is not a property of a certain transcript but instead describes the relationship between differently spliced transcripts, which makes it difficult to give an accurate definition. When considering all the possible transcript forms, our data suggests that the strictly "mutually exclusive" exons count as only a small portion of all the pairs.
The correlation coefficient estimates the correlation between two binary variables (inclusion/exclusion states of upstream and downstream exons). The correlation assumes that both variables are normally distributed and that the bivariate distribution is multivariate normal, which is not satisfied in our data. The small value of EST counts further makes such estimation not very accurate. To eliminate the possibility that the distribution pattern of correlation coefficients is an artifact introduced by the relatively small value of EST counts, we performed a re-sampling procedure and a permutation procedure. The results show that the two observations in the distribution: 1) more positively-correlated pairs and 2) multimodality, are very unlikely to be random results (P-value < 1e-5. See supplementary material for the details of these results and methods from the two procedures). We also checked the correlation between the two exons by using Fisher exact test (as in , See supplementary for the calculation and discussion of the statistical test). The P-values of pairs are consistent with the Pearson correlation coefficients: A pair with small P-value tends to be strongly (positive or negative) correlated. The P-values for each pair, together with the FDR (False Discovery Rate) correction, are listed in additional file 2 for readers who concern high confident pairs.
Splice site strength and exon/intron length associated with different correlations
The three groups were examined to see if there are any differences in the sequence properties. The strength of the splice site, which is important in the regulation of splicing , was first investigated. In general, the splice sites in cassette pairs, including ME, IND and LNK, are a little bit weaker than in constitutive exons, which is consistent with previous results that constitutive splice sites are stronger than alternative ones [28, 29]. However, we did not observe meaningful differences among the ME, IND and LNK groups (see Table S1).
Next we examined the exon and intron lengths in human (Table 1). The exons of different groups are of similar length, while the intermediate introns (the intron between upstream and downstream exons) in the strongly-correlated groups (ME and LNK) are much shorter than in the weakly-correlated group (IND) (P-value roughly equals to zero, Wilcoxon test). The median lengths of intermediate introns for ME, IND and LNK are 551, 1935, and 731 nt, respectively. This result reflects that a short intermediate intron plays a role in the delicate control of strongly-correlated pairs. The intermediate introns in ME are extremely short. This may reflect the fact that steric interference is an important mechanism resulting in mutually exclusive splicing . If two adjacent exons get too close to each other, the intron between them cannot be efficiently spliced out by the spliceosome. Thus, no transcript, including both exons, will be generated. In the ME group, 18 out of 60 transcripts in human and 21 out of 60 transcripts in mouse have an intermediate intron of less than 100 nt. Possibly steric interference accounts for the mutually exclusivity of these pairs.
The lengths of upstream introns (the intron in the 5' flanking of the upstream exon) are also significantly different between groups (LNK > IND > ME, P-value < 0.005). But the downstream introns (intron flanking the 3' side of the downstream exon) are almost the same length. Previous studies report that the upstream intron length affects alternative splicing . We observed its impact in our analysis of cassette exon pairs. These observations were also found in the mouse data (Table 1). In the LNK group, the two exons in a pair are included/excluded simultaneously, behaving like an individual exon. The short intermediate intron and long upstream intron may be required for this synchronization.
Strongly correlated pairs show increased selection pressure for combinatorial reading-frame preservation
An important aspect of studying alternative splicing is to find which splice events are functionally significant . We investigated whether the different exon correlation patterns are associated with different functional importance. Reading-frame preservation has long been adopted as a criterion to find possible functional cassette exons [17, 28]. If the length of a cassette exon is a multiple of 3 nt, the inclusion or skipping of this exon will not alter the protein reading frame of subsequent exons. Thus, a functional exon skipping event tends to be frame-preserving.
We first checked those adjacent pairs that have both exons in CDS with the annotations from UCSC knownGene  (See Methods). The proportions of pairs with both exons in CDS are 85%, 71.2% and 87.1% for ME, IND and LNK groups, respectively (Table 2). Strongly-correlated pairs are more likely to be located in CDS (P-value < 2.2e-16, IND vs. LNK and ME, Fisher exact test). It has been reported that exons which are more functionally important (evolutionary old) have a higher proportion of exons in CDS compared with newly emerging exons .
Next we checked the proportion of these possible protein-coding exons which preserve the reading frame. The results show that, in all groups, the fractions of exons whose lengths are multiples of 3 nt are just above one third, roughly equal to the proportion expected by chance (Table 2). However, if the correlation of the adjacent exons is functionally important it may not be necessary that the lengths of both the upstream and downstream exons be exact multiples of 3 nt. For example, the skipping of a 50 nt exon will cause a shift of -2 in the reading frame for subsequent exons in the final transcript. But, if the next exon of 53 nt is included, as in a ME pair, the subsequent exons will get another +2 shift. In this case, the net shift is 0 for subsequent exons and frame disruption is limited to a local region. The overall reading frame remains intact. Thus, the transcript will preserve the reading frame when the length difference of the two exons is an exact multiple of 3 nt in the ME group. Similarly, in the LNK group, a length sum could preserve the protein ORF. We did observe significant signals in exon length combinations among different groups, both in human and mouse (Table 2). When considering exon length difference, the frame-preserving proportion of ME increases to 70.6% in human and 76.5% in mouse, while it remains roughly one third in the LNK and IND groups. This difference in proportions is significant (P-value < 1.3e-6 for human, P-value < 8.4e-8 for mouse. ME vs. non-ME, Fisher-exact test). We also observed in the LNK group an increased proportion of the exon length sum being frame-preserving (42.8% in human and 44.8% in mouse) compared to ME and IND. This increase is partially significant (P-value < 7.3e-3 for human, < 0.055 for mouse, LNK vs. non-LNK, fisher-exact test). The relatively weak signal in the LNK group may be partly attributed to the different exon inclusion levels (defined as the fraction of transcripts that include this pair among all transcripts) among groups (See following).
The LNK and ME pairs show a significant increase in the reading-frame preservation in a combinatorial manner. These observations may be the result of natural selection and suggest a functional importance for exon correlation.
Strongly correlated cassette exons are more conserved
The flanking introns in the ME and LNK groups are also more conserved than those in IND (Figure 3). The intron conservation level in strongly-correlated groups is even higher than the level of introns flanking constitutive exons (data not shown). This is similar to findings from previous studies which reported that the introns flanking conserved cassette exons are on average significantly more conserved than both introns flanking constitutive and species-specific cassette exons [17, 36]. This would indicate the regulatory role of these introns. The high intron conservation observed here again supports the possibility that exon interaction regulation is associated with functional importance. Intriguingly, in the strongly-correlated pairs, the intermediate intron (between two exons) is more conserved than the upstream and downstream introns (Figure 3). This conservation difference is more significant in the ME group.
Evolutionary course of pairs among different groups
Exon age difference *
The majority of LNK cassette exons are major-form exons
The results for the LNK group are entirely different: the distribution is heavily skewed toward 1. Most exons in LNK are major-form exons with very high inclusion level (Figure 5C). This is consistent with the high sequence conservation in LNK and may also partially explain the low protein frame preserving pressure observed in our study. Of the two different transcripts, with or without LNK exon pair, those transcripts in which the LNK pair is skipped express in low abundance. Though skipping of the LNK pair may break the reading frame, the low abundance of this defective transcript will probably not cause severe loss of overall product activity. It has been reported that major form exons exhibit low selection pressure in frame preservation .
Related to our observation that LNK pairs are usually have a high inclusion level, Plass et al. showed that the exonic structure is more conserved at higher inclusion levels, and that this correlates with the sequence conservation of the alternative exons . From this point of view, the conserved LNK pairs can be thought of as transcript fragments with conserved exonic structure and with high inclusion level. Thus, the previous work and our analysis came to the same conclusions from two different aspects.
Evolutionary course of adjacent cassette exon pairs
Our study indicates that the strongly-correlated groups (ME and LNK) tend to be more conserved and functionally important. So, we will investigate the origin and evolutionary course of different interaction patterns. Exon duplication has long been known as a source of mutually exclusive splicing [40, 41]. We detected 78 duplications in all 2154 adjacent pairs in human (with the same criteria as , see methods). The exon duplications are significantly enriched in ME (Table 3): 40.0% of ME pairs arise from exon duplication, while only 2.55% of IND and 2.61% of LNK are from this origin (P < 2.2e-6, ME vs. non-ME). This enrichment is also observed in the mouse data (P < 2.2e-6, ME vs. non-ME). It has been hypothesized that most duplicated exons are mutually exclusively spliced . But, with our more strict criteria, mutually exclusive pairs count only a very small proportion of all the adjacent cassette exon pairs. Though a much higher portion of ME pairs arise from exon duplication, most pairs originating from duplication interact in an independent or linked manner.
Another important source of cassette exons in vertebrates is the exonization of interspersed repeat sequences [42, 43]. To systematically examine whether these repeat elements contribute to our correlated pairs, we employed RepeatMasker to identify the exons originating from repeats (See methods). Strikingly, in human, only 5.74% of the pairs in the LNK group overlap with repeat elements, while the numbers for ME and IND groups are 13.3% and 31.3%, respectively (Table 3). Most of the repeat elements are Alu, which is a primate-specific SINE element. This finding indicates many ME and IND pairs emerged in recent genome evolution and there are few recent emergences of exons in the LNK group (P < 2.2e-16, LNK vs. non-LNK). The same trend can be observed in the mouse (Table 3).
To further investigate the evolutionary course of exon interactions, we took a multiple genome comparison strategy similar to that of Zhang et al. . The age of an exon is determined by its conservation in the most divergent organism. The rationale, for example, is that a human exon whose ortholog is present in a given organism must have been "born" before the divergence between human and that organism. All exons are divided into three groups: young, middle, and old (the evolutionary scale is limited to vertebrates, see methods). The age of an exon pair is then treated as the age of the younger exon in this pair. Zhang et al.  reported that the younger exons are more likely to be alternatively spliced and have a low inclusion level. It has also been reported that the younger exons are more likely to originate from repeat elements and be located in the UTR region. These results are consistent with our observation from our pair analysis (see following).
Compared with the ME and IND groups, exons in the LNK group are old. The age difference between two exons in a pair confirms this observation. In human, the proportion of pairs from LNK in which two exons are of different age is 22.6% (216/957), while the proportions for ME and IND group are 45.0% (27/60) and 59.5% (677/1137), respectively. In mouse, the numbers are 48.3% (29/60), 54.0% (229/424), and 23.3% (118/507) for ME, IND and LNK, respectively (Table 3). LNK pairs tend to be ancient events and the two exons in a LNK pair are uniformly old, while exon ages in an ME/IND are more likely to be different and many exons emerged in recent years (all p-value < 2.5e-4, LNK vs. ME/IND).
The two exon ages in a pair are often different. An interesting observation is that the upstream exon is usually younger than downstream one. The human exon conservation curves (Figure 3), in which the downstream exons are more conserved, especially for ME, are evidence for this idea. In Table 3, there are more upstream-young pairs, defined as pairs of which upstream exon is younger than downstream one, than upstream-old pairs in each group. We observed this in both human and mouse. A permutation (see supplementary, Fig S5) confirmed the upstream-downstream asymmetry in the LNK and IND groups (P < 1e-4). The asymmetry is not significant in ME, probably due to the limited amount of data. The fact that the upstream exon is younger than the downstream exon may be because a change in the upstream exon is more likely to affect the downstream one. When a new cassette exon is generated, by mutations or other genetic variation, this change may also affect the splicing of the subsequent exon. These two exons thus result in an adjacent cassette pair. There is an experiment reporting that some exon interactions are polar: mutations on the upstream exons deeply affect the splicing of the downstream exon, while the similar mutations on the downstream exon have no effect on the upstream exon . However, more systematic studies are needed to validate and explain the difference between upstream and downstream exons.
Currently, most studies concerning regulation of alternative splicing focus on a single alternative exon. By using EST data covering adjacent cassette exon pairs in human and mouse, we used the correlation coefficient to describe the interaction between the adjacent cassette exons and demonstrated that these pairs showed various correlation patterns. Those pairs are then categorized into three groups according to the correlation coefficient: ME (mutually exclusive pairs), IND (independent pairs) and LNK (linked pairs). The three groups show little difference in exon length and splice site strength. But the strongly-correlated pairs (ME and LNK) have a much shorter intermediate intron (the intron between upstream and downstream exons) than weakly-correlated pairs (IND). Comparison between human and mouse illustrates that the sequences in ME and LNK are more conserved, in both the exon and flanking introns, especially the intermediate intron. The strongly-correlated pairs also show a significant increase in the ORF preservation in a combinatorial manner. Sequence conservation, together with ORF preservation, indicates that strongly-correlated pairs are under more regulatory control and tend to be functionally important. The multiple-genome comparison further revealed that exon pairs with different correlation patterns may undergo different evolutionary courses. Most LNK pairs are old. The two exons in a LNK pair are usually of similar age, that is, similarly old. On the contrary, most IND pairs emerged in the recent genome evolution. The two exons in an IND pair are more likely to be of different age. The younger ones are frequently recruited from repeat elements. ME pairs lies between IND and LNK. There are many old and functional ME pairs, which mainly originated from exon duplication, while there are also many newly emerged ME events which seem to be less functionally important. Like in the IND group, quite a portion of the new ME pairs originates from repeat elements.
The evolution of individual alternative exons has two different paths: some alternative exons might have originated as a result of relaxation of the splicing signal which originally supported only constitutive splicing , while some other alternative exons might come from the mutations that turn an intron segment or repeat element directly into an alternative exon [34, 42, 43]. In the formation of an adjacent cassette exon pair, we speculate there are similar paths. The LNK exons in our analysis bear high resemblance to constitutive exons, in sequence conservation, high inclusion level and low selection pressure of frame preservation. The LNK exons were probably originally constitutive exons. With the weakening of the splicing signal and/or exon interaction effect, adjacent constitutive pairs can change into a LNK pair. The relative small difference of the exon ages in a pair also supports this hypothesis. On the other hand, there are frequently exon birth events during evolution. The new born exon may affect its neighboring exon and results in a correlated adjacent pair. Many exons in ME and IND are new and the two exons in a pair are often of different ages, so ME and IND pairs may often originate in this way. The exon ages were derived from multiple sequence comparison. With the accumulation of multiple genome alignment data (e.g. 28 vertebrate genomes alignment on UCSC bioinformatics site), we can determine the age to a finer scale. Then more delicate analysis (e.g. the continuous variable approach as in ) on the finer scale can tell us more about what happened during evolution.
We presented the primary regulatory and evolutionary properties for the differently interacting adjacent pairs. However, the splicing mechanism generating these interaction patterns is largely unknown. The IND pairs can be explained as independent control of each individual exon in a pair, but we do not know why the two exons in a ME pair are incapable of being spliced to each other or why the exons in a LNK pair are always included/excluded simultaneously. The possible mechanisms generating the different correlation patterns can be divided into "direct" and "indirect" interactions between the two exons in a pair. In direct interactions, the successful splicing of one exon promotes/prevents the splicing of the other one. Steric interference is one such direct interaction and possibly accounts for a portion of the ME cases. Spliceosome incompatibility  is another known direct interaction resulting in mutually exclusive splicing, though there is no evidence that it plays a role in our ME category (No "AT-AC" canonical splicing signal for U11/U12 spliceosome is observed in any ME case). These two known direct interactions, steric interference and spliceosome incompatibility, cannot explain all the cases we observed in the data.
correlated pair by tissue-specific regulation
We further tested an alternative, indirect hypothesis of the tissue-specific regulation: whether the genes hosting LNK and ME cases are expressed on average in a lower number of tissues. If true, this will give additional hints as to whether the strongly correlated pairs are especially regulated. However, we did not observe a significant difference in the tissue-specific expression of host genes (two approaches, based on EST and microarray data, see supplementary for details). This may be due to the fact that different subsets of genes are regulated in a tissue-specific manner at the transcriptional and alternative splicing levels [24, 45]. The extraordinary conservation in flanking introns indicates the existence of regulatory motifs, but more comprehensive analysis and experimental verification should be carried out to explore the interaction mechanism, both direct and indirect interactions.
The formation of the exon correlation is complex and comes from various sources. Some correlated pairs may arise randomly. For example, most LNK pairs are in major form and some low abundance isoforms may be splicing artifacts. Despite various sources of the correlation, the differences between correlated and independent pairs reflect significant deviation from a random result, indicating the effect of natural selection. The next step will be to investigate the functions of these correlated pairs. Exon interaction can expand the protein encoding and regulation potential for transcripts. For example, it's interesting to explore the functions of the mutually exclusive pairs. Many ME pairs originate from exon duplication and the two ME exons are highly similar. Subtle changes in the sequence may play an essential role in the protein function, like a change in the catalytic site of an enzyme. Such analyses may reveal important results. Using Gene Ontology annotation, we performed functional enrichment analysis with GeneMerge . But no significant enrichment was observed. Maybe the exon interaction is a prevalent phenomenon in transcripts and not enriched in a certain pathway or biological process.
There are several limitations in our work. First, as mentioned previously, we pooled all the ESTs from different libraries and can not distinguish the effect of tissue-specific control. This is due to a limited amount of data. A recent microarray analysis explored tissue-specific cassette exons and the exon correlation across more than 20 tissues and cell lines in mouse . The results indicate many tissue-specific exon and correlated pairs are enriched in the brain. It's necessary to compare and combine the data from different sources in the future to investigate the correlation mechanism. The second limitation is that we studied only the interaction between adjacent cassette exons but not distant exons. Focusing on adjacent exons reflects a trade-off between an exhaustive description of the exon correlation and limited data coverage. Our analysis is mainly based on EST/cDNA sequences, which are usually local fragments of full-length gene transcripts, so correlation between distant cassette exons is out of the scope of this study. Full length sequence analysis by Emerick et al. discovered that non-adjacent exons interacted with one another in the CACNAIG gene . The two correlated cassette exons in the two sequential EDI regions are also not adjacent in Fededa's experiments . Nevertheless, since RNA splicing is coupled with transcription and splicing of exons is largely sequential from upstream to downstream [48, 49], it could be expected that the adjacent exons are more likely to interact with each other than a distant pair. Our observation that the distances between two exons in strongly-correlated pairs are shorter than in loosely-correlated pairs supports this hypothesis. Our investigation of correlation between adjacent pairs showed only a small portion of the whole picture of exon interactions. With the accumulation of full length data and progress of new technology, like single molecule profiling , it will be possible to further investigate the complex interaction among exons in a gene, which will eventually provide us a fine picture and deeper understanding of alternative splicing and its regulation.
We presented a genome-wide study of exon interactions in alternative splicing by recruiting the adjacent cassette exon pairs as a model. The adjacent cassette exon pairs show distinct interaction patterns, thus can be categorized into different groups. Compared with the loosely-correlated group, the strongly-correlated groups are more conserved and contain a higher proportion of pairs with reading frame preserved in a combinatorial manner. Additionally, the positively-correlated pairs bear strong resemblance to constitutive exons, which suggests that they may evolve from ancient constitutive exons. The negatively and weakly correlated pairs are more likely to contain newly emerging exons.
As the tissue-specific expression of a cassette exon indicates the possible function of this alternative splicing event, the specific exon interaction between alternative exons also shows they are under delicate splicing control and tend to be functionally important. But we know little about the mechanism of the observed interaction and its function, which represents the future direction of this study.
Calculation of exon correlation coefficients
where , are the means of u and d, respectively and n is the number of ESTs covering this region (Figure 1). Correlation coefficients always fall in the range [-1, +1]. A value of r = -1 means the two exons in a pair are mutually exclusive; meaning there is one and only one exon in each transcript. A value of r = 1 means the upstream and downstream exons are linked; meaning the two are simultaneously included or excluded. A value of r ≈ 0 indicates the two exons are included or excluded in an independent manner.
Adjacent cassette exon pair data set
Our data set of adjacent cassette exons was extracted from the Altsplice database of EBI (release2, April 2005). The records are predicted by an EST-based computational pipeline. The initial data for Altsplice was 898,295 high-quality EST-genome alignments in human and 837,329 alignments in mouse. Alternative splicing events in Altsplice are delineated as inconsistency between distinct splice patterns (EST-genome alignment). Thus an adjacent cassette exon pair will be identified in Altsplice data as either a mutually exclusive splicing (inconsistency between transcript structures 2 and 3, Figure 1) or an exon skipping event of two exons (inconsistency between transcript structure 1 and 4, Figure 1). We extracted all the data of these two alternative splicing types and the corresponding EST alignments covering this region deposited in Altsplice. This strategy generated the 4326 adjacent cassette exon pairs in human. The same pipeline was performed on Altsplice mouse data (release2, April 2005) and resulted in 1905 pairs. In several places, properties of individual cassette exons serve as the background when discussing adjacent pairs. These data are directly calculated from Altsplice annotation.
Determination of the CDS and UTR location for exon pairs
We downloaded the knownGene annotations of human and mouse from the UCSC server  (knownGene.txt.gz for hg17 and mm5). The exon pairs are all mapped to the corresponding human/mouse genome. If the two exons in a pair were both located in a CDS region in knownGene annotation, this pair was treated as a CDS pair. Otherwise, this pair is an UTR pair.
Sequence conservation curve of cassette exon pair and flanking introns
The genome alignments between human and mouse comes from the University of California, Santa Cruz Genome Bioinformatics Site. We downloaded the human-mouse alignments (hg17 vs mm5, ) for the human conservation curve in Figure 3 and mouse-human alignments for the mouse curve in Figure 4 (mm5 vs hg17, ). For the human-referenced percent identity curve, the insertions in the mouse genome were discarded and the gaps in the mouse are treated as mismatch. The procedure is similar for mouse-referenced curve. The curves show the average percent identity of the alignment in a 9-base sliding window.
Conservation of cassette exon pair events
We took a different approach to explore the conserved cassette-exon pair event, that is, a cassette-exon pair in one species is also a cassette-exon pair in the other species. For a cassette-exon pair in human, we first found the mouse homolog gene of the human gene hosting this pair, via the BioMart . If the mouse homolog also contains a cassette-exon pair (one of the 1905 pairs in our analysis), we aligned the two exons of human pair with the exons of mouse pair with Blast. The pair where both exons were conserved in mouse pair (E-value < 0.001) were categorized as a conserved pair. The same procedure was applied to the mouse data.
The homology between two exons in a pair was identified using the bl2seq implementation of TBLASTN. If the two exons showed significant homology, these two exons were considered as originating from exon duplication. In human and mouse, significance was set at E-values of 0.001, which is the same in Letunic's work .
Overlapping with repeats
We used RepeatMasker to identify interspersed repeat elements. The type and location of repeat elements are determined by parsing the ".out" output file. We used the "-s" parameter in the command line, which is the most sensitive setting when running RepeatMasker. If an exon shares at least one nucleotide with any repeat sequences, we thought it overlaps with repeat elements. An exon pair is overlapping with a repeat if it has one exon overlapping with any repeats.
Evolutionary ages of exons
The evolutionary age of an exon is determined by the most distant species its ortholog can be found in. The rationale is that a human exon, whose ortholog is present in a given organism, must have been "born" before the divergence between human and that organism. This procedure is similar to the work of Zhang et al. . We checked the orthologs of human exons in chimpanzee, dog, mouse, rat, chicken, zebrafish, and fugu genomes. (The choice of genomes is adopted from UCSC multiple alignments. See following.) Note that we did not require the orthologs in distant species to exist in intermediate lineages. For example, a human-fugu ortholog may be absent in the intermediate dog, mouse and chicken genomes. If an exon is human-specific (It's found only in the human genome), then the most distant ortholog is in human. By this way, each exon in our study can be associated with one of the eight vertebrate genomes. The eight species were further divided into 3 groups according to evolutionary history. The first group consists of human and Chimpanzee (divergence time: < 5 million years). The second group consist of dog, mouse and rat (divergence time: ~93 million years, ). The third group consists of all the rest of the species (chicken, zebrafish, and fugu, divergence time: > 310 million years). If the most distant ortholog of an exon falls into the first, second and third group, it will be classified as young, middle and old, respectively.
A similar procedure was performed on mouse exons. We checked the orthologs for mouse exons in rat, dog, human and chicken. The five organisms were also divided into three groups: 1) mouse and rat (divergence time: < 40 million); 2) dog and human (divergence time: 93 million); 3) chicken (divergence time: 310 million). According to the most distant ortholog, all the mouse exons are classified as young, middle and old.
The presence of ortholog exons in other genomes were determined by the multiple genome alignments from the University of California, Santa Cruz Genome Bioinformatics Site. We downloaded the human-referenced eight-genome alignment (hg17, multiz8way, ). The species in the multiple genome alignments are discussed above. The corresponding ortholog sequence of the second species was extracted according to the coordinates in the human exon if it is present in the UCSC alignment. The ortholog sequence was considered as a conversed exon only if the AG and GT di-nucleotides bordering this exon in the second species were conserved. This criteria has also been adopted by Zhang et al. . For mouse exons, we also downloaded the mouse-referenced five-genome alignment (mm5, multiz5way, ) and determined their ortholog exons in other species.
Many thanks to Jin Gu, Pufeng Du and Jing Zhang for helpful discussions. We thank Xuesong Lu for the assistance in microarray data analysis and Greg Ziegler for his help in proofreading the manuscript. We also thank the anonymous reviewers for helpful suggestions. This work is supported in part by NSFC (grants 60572086, 60775002, 30625012 and 60702002), China Postdoctoral Science Foundation (20060400060) and Tsinghua Basic Research Foundation.
- Graveley BR: Alternative splicing: increasing diversity in the proteomic world. Trends Genet. 2001, 17 (2): 100-107. 10.1016/S0168-9525(00)02176-4.PubMedView ArticleGoogle Scholar
- Blencowe BJ: Alternative splicing: New insights from global analyses. Cell. 2006, 126 (1): 37-47. 10.1016/j.cell.2006.06.023.PubMedView ArticleGoogle Scholar
- Kriventseva EV, Koch I, Apweiler R, Vingron M, Bork P, Gelfand MS, Sunyaev S: Increase of functional diversity by alternative splicing. Trends Genet. 2003, 19 (3): 124-128. 10.1016/S0168-9525(03)00023-4.PubMedView ArticleGoogle Scholar
- Kan Z, Rouchka EC, Gish WR, States DJ: Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. Genome Res. 2001, 11 (5): 889-900. 10.1101/gr.155001.PubMedPubMed CentralView ArticleGoogle Scholar
- Kim E, Magen A, Ast G: Different levels of alternative splicing among eukaryotes. Nucleic Acids Res. 2007, 35 (1): 125-131. 10.1093/nar/gkl924.PubMedPubMed CentralView ArticleGoogle Scholar
- Modrek B, Lee C: A genomic view of alternative splicing. Nat Genet. 2002, 30 (1): 13-19. 10.1038/ng0102-13.PubMedView ArticleGoogle Scholar
- Modrek B, Resch A, Grasso C, Lee C: Genome-wide detection of alternative splicing in expressed sequences of human genes. Nucleic Acids Res. 2001, 29 (13): 2850-2859. 10.1093/nar/29.13.2850.PubMedPubMed CentralView ArticleGoogle Scholar
- Buratti E, Baralle M, Baralle FE: Defective splicing, disease and therapy: searching for master checkpoints in exon definition. Nucleic Acids Res. 2006, 34 (12): 3494-3510. 10.1093/nar/gkl498.PubMedPubMed CentralView ArticleGoogle Scholar
- Caceres JF, Kornblihtt AR: Alternative splicing: multiple control mechanisms and involvement in human disease. Trends Genet. 2002, 18 (4): 186-193. 10.1016/S0168-9525(01)02626-9.PubMedView ArticleGoogle Scholar
- Cartegni L, Chew SL, Krainer AR: Listening to silence and understanding nonsense: exonic mutations that affect splicing. Nat Rev Genet. 2002, 3 (4): 285-298. 10.1038/nrg775.PubMedView ArticleGoogle Scholar
- Pagani F, Baralle FE: Genomic variants in exons and introns: identifying the splicing spoilers. Nat Rev Genet. 2004, 5 (5): 389-396. 10.1038/nrg1327.PubMedView ArticleGoogle Scholar
- Goren A, Ram O, Amit M, Keren H, Lev-Maor G, Vig I, Pupko T, Ast G: Comparative analysis identifies exonic splicing regulatory sequences – The complex definition of enhancers and silencers. Molecular cell. 2006, 22 (6): 769-781. 10.1016/j.molcel.2006.05.008.PubMedView ArticleGoogle Scholar
- Wang ZF, Xiao XS, Van Nostrand E, Burge CB: General and specific functions of exonic splicing silencers in splicing control. Molecular cell. 2006, 23 (1): 61-70. 10.1016/j.molcel.2006.05.018.PubMedPubMed CentralView ArticleGoogle Scholar
- Ule J, Stefani G, Mele A, Ruggiu M, Wang XN, Taneri B, Gaasterland T, Blencowe BJ, Darnell RB: An RNA map predicting Nova-dependent splicing regulation. Nature. 2006, 444 (7119): 580-586. 10.1038/nature05304.PubMedView ArticleGoogle Scholar
- Modrek B, Lee CJ: Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss. Nat Genet. 2003, 34 (2): 177-180. 10.1038/ng1159.PubMedView ArticleGoogle Scholar
- Ast G: How did alternative splicing evolve?. Nat Rev Genet. 2004, 5 (10): 773-782. 10.1038/nrg1451.PubMedView ArticleGoogle Scholar
- Yeo GW, Van Nostrand E, Holste D, Poggio T, Burge CB: Identification and analysis of alternative splicing events conserved in human and mouse. P Natl Acad Sci USA. 2005, 102 (8): 2850-2855. 10.1073/pnas.0409742102.View ArticleGoogle Scholar
- Wang BB, Brendel V: Genomewide comparative analysis of alternative splicing in plants. Proc Natl Acad Sci USA. 2006, 103 (18): 7175-7180. 10.1073/pnas.0602039103.PubMedPubMed CentralView ArticleGoogle Scholar
- Sugnet CW, Kent WJ, Ares M, Haussler D: Transcriptome and genome conservation of alternative splicing events in humans and mice. Pacific Symposium on Biocomputing. 2004, 66-77.Google Scholar
- Koren E, Lev-Maor G, Ast G: The emergence of alternative 3' and 5' splice site exons from constitutive exons. PLoS computational biology. 2007, 3 (5): e95-10.1371/journal.pcbi.0030095.PubMedPubMed CentralView ArticleGoogle Scholar
- Fededa JP, Petrillo E, Gelfand MS, Neverov AD, Kadener S, Nogues G, Pelisch F, Baralle FE, Muro AF, Kornblihtt AR: A polar mechanism coordinates different regions of alternative splicing within a single gene. Molecular cell. 2005, 19 (3): 393-404. 10.1016/j.molcel.2005.06.035.PubMedView ArticleGoogle Scholar
- Xing Y, Resch A, Lee C: The multiassembly problem: reconstructing multiple transcript isoforms from EST fragment mixtures. Genome Res. 2004, 14 (3): 426-441. 10.1101/gr.1304504.PubMedPubMed CentralView ArticleGoogle Scholar
- Emerick MC, Parmigiani G, Agnew WS: Multivariate Analysis and Visualization of Splicing Correlations in Single-Gene Transcriptomes. BMC Bioinformatics. 2007, 8 (1): 16-10.1186/1471-2105-8-16.PubMedPubMed CentralView ArticleGoogle Scholar
- Fagnani M, Barash Y, Ip J, Misquitta C, Pan Q, Saltzman AL, Shai O, Lee L, Rozenhek A, Mohammad N: Functional coordination of alternative splicing in the mammalian central nervous system. Genome Biol. 2007, 8 (6): R108-10.1186/gb-2007-8-6-r108.PubMedPubMed CentralView ArticleGoogle Scholar
- Stamm S, Riethoven JJ, Le Texier V, Gopalakrishnan C, Kumanduri V, Tang YS, Barbosa-Morais NL, Thanaraj TA: ASD: a bioinformatics resource on alternative splicing. Nucleic Acids Res. 2006, 34: D46-D55. 10.1093/nar/gkj031.PubMedPubMed CentralView ArticleGoogle Scholar
- Kim N, Alekseyenko AV, Roy M, Lee C: The ASAP II database: analysis and comparative genomics of alternative splicing in 15 animal species. Nucleic Acids Res. 2007, D93-98. 10.1093/nar/gkl884. 35 Database
- Carmel I, Tal S, Vig I, Ast G: Comparative analysis detects dependencies among the 5 ' splice-site positions. Rna-a Publication of the Rna Society. 2004, 10 (5): 828-840.View ArticleGoogle Scholar
- Sorek R, Shemesh R, Cohen Y, Basechess O, Ast G, Shamir R: A non-EST-based method for exon-skipping prediction. Genome Res. 2004, 14 (8): 1617-1623. 10.1101/gr.2572604.PubMedPubMed CentralView ArticleGoogle Scholar
- Xia H, Bi J, Li Y: Identification of alternative 5'/3' splice sites based on the mechanism of splice site competition. Nucleic Acids Res. 2006, 34 (21): 6305-6313. 10.1093/nar/gkl900.PubMedPubMed CentralView ArticleGoogle Scholar
- Smith CWJ: Alternative splicing – When two's a crowd. Cell. 2005, 123 (1): 1-3. 10.1016/j.cell.2005.09.010.PubMedView ArticleGoogle Scholar
- Fox-Walsh KL, Dou Y, Lam BJ, Hung SP, Baldi PF, Hertel KJ: The architecture of pre-mRNAs affects mechanisms of splice-site pairing. Proc Natl Acad Sci USA. 2005, 102 (45): 16176-16181. 10.1073/pnas.0508489102.PubMedPubMed CentralView ArticleGoogle Scholar
- Sorek R, Shamir R, Ast G: How prevalent is functional alternative splicing in the human genome?. Trends Genet. 2004, 20 (2): 68-71. 10.1016/j.tig.2003.12.004.PubMedView ArticleGoogle Scholar
- Hsu F, Kent WJ, Clawson H, Kuhn RM, Diekhans M, Haussler D: The UCSC Known Genes. Bioinformatics. 2006, 22 (9): 1036-1046. 10.1093/bioinformatics/btl048.PubMedView ArticleGoogle Scholar
- Zhang XHF, Chasin LA: Comparison of multiple vertebrate genomes reveals the birth and evolution of human exons. P Natl Acad Sci USA. 2006, 103 (36): 13427-13432. 10.1073/pnas.0603042103.View ArticleGoogle Scholar
- Xing Y, Lee CJ: Protein modularity of alternatively spliced exons is associated with tissue-specific regulation of alternative splicing. PLoS genetics. 2005, 1 (3): e34-10.1371/journal.pgen.0010034.PubMedPubMed CentralView ArticleGoogle Scholar
- Philipps DL, Park JW, Graveley BR: A computational and experimental approach toward a priori identification of alternatively spliced exons. Rna-a Publication of the Rna Society. 2004, 10 (12): 1838-1844.View ArticleGoogle Scholar
- Pritsker M, Doniger TT, Kramer LC, Westcot SE, Lemischka IR: Diversification of stem cell molecular repertoire by alternative splicing. Proc Natl Acad Sci USA. 2005, 102 (40): 14290-14295. 10.1073/pnas.0502132102.PubMedPubMed CentralView ArticleGoogle Scholar
- Xing Y, Lee C: Evidence of functional selection pressure for alternative splicing events that accelerate evolution of protein subsequences. P Natl Acad Sci USA. 2005, 102 (38): 13526-13531. 10.1073/pnas.0501213102.View ArticleGoogle Scholar
- Plass M, Eyras E: Differentiated evolutionary rates in alternative exons and the implications for splicing regulation. BMC Evol Biol. 2006, 6: 50-10.1186/1471-2148-6-50.PubMedPubMed CentralView ArticleGoogle Scholar
- Kondrashov FA, Koonin EV: Origin of alternative splicing by tandem exon duplication. Hum Mol Genet. 2001, 10 (23): 2661-2669. 10.1093/hmg/10.23.2661.PubMedView ArticleGoogle Scholar
- Letunic I, Copley RR, Bork P: Common exon duplication in animals and its role in alternative splicing. Hum Mol Genet. 2002, 11 (13): 1561-1567. 10.1093/hmg/11.13.1561.PubMedView ArticleGoogle Scholar
- Sorek R, Ast G, Graur D: Alu-containing exons are alternatively spliced. Genome Res. 2002, 12 (7): 1060-1067. 10.1101/gr.229302.PubMedPubMed CentralView ArticleGoogle Scholar
- Lev-Maor G, Sorek R, Shomron N, Ast G: The birth of an alternatively spliced exon: 3 ' splice-site selection in Alu exons. Science. 2003, 300 (5623): 1288-1291. 10.1126/science.1082588.PubMedView ArticleGoogle Scholar
- Alekseyenko AV, Kim N, Lee CJ: Global analysis of exon creation versus loss and the role of alternative splicing in 17 vertebrate genomes. Rna. 2007, 13 (5): 661-670. 10.1261/rna.325107.PubMedPubMed CentralView ArticleGoogle Scholar
- Pan Q, Shai O, Misquitta C, Zhang W, Saltzman AL, Mohammad N, Babak T, Siu H, Hughes TR, Morris QD: Revealing global regulatory features of mammalian alternative splicing using a quantitative microarray platform. Molecular cell. 2004, 16 (6): 929-941. 10.1016/j.molcel.2004.12.004.PubMedView ArticleGoogle Scholar
- Consortium GO: The Gene Ontology (GO) database and informatics resource. Nucl Acids Res. 2004, 32 (suppl_1): D258-261. 10.1093/nar/gkh036.View ArticleGoogle Scholar
- Castillo-Davis CI, Hartl DL: GeneMerge–post-genomic analysis, data mining, and hypothesis testing. Bioinformatics. 2003, 19 (7): 891-892. 10.1093/bioinformatics/btg114.PubMedView ArticleGoogle Scholar
- Neugebauer KM: Please hold – the next available exon will be right with you. Nat Struct Mol Biol. 2006, 13 (5): 385-386. 10.1038/nsmb0506-385.PubMedView ArticleGoogle Scholar
- Dye MJ, Gromak N, Proudfoot NJ: Exon tethering in transcription by RNA polymerase II. Molecular cell. 2006, 21 (6): 849-859. 10.1016/j.molcel.2006.01.032.PubMedView ArticleGoogle Scholar
- Zhu J, Shendure J, Mitra RD, Church GM: Single molecule profiling of alternative pre-mRNA splicing. Science. 2003, 301 (5634): 836-838. 10.1126/science.1085792.PubMedView ArticleGoogle Scholar
- UCSC Genome Browser. [http://genome.ucsc.edu/]
- BioMart. [http://www.biomart.org]
- RepeatMasker Open-3.0. [http://www.repeatmasker.org]
- Hedges SB, Kumar S: Vertebrate genomes compared. Science. 2002, 297 (5585): 1283-1285. 10.1126/science.1076231.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.