Fast rate of evolution in alternatively spliced coding regions of mammalian genes

Background At least half of mammalian genes are alternatively spliced. Alternative isoforms are often genome-specific and it has been suggested that alternative splicing is one of the major mechanisms for generating protein diversity in the course of evolution. Another way of looking at alternative splicing is to consider sequence evolution of constitutive and alternative regions of protein-coding genes. Indeed, it turns out that constitutive and alternative regions evolve in different ways. Results A set of 3029 orthologous pairs of human and mouse alternatively spliced genes was considered. The rate of nonsynonymous substitutions (dN), the rate of synonymous substitutions (dS), and their ratio (ω = dN/dS) appear to be significantly higher in alternatively spliced coding regions compared to constitutive regions. When N-terminal, internal and C-terminal alternatives are analysed separately, C-terminal alternatives appear to make the main contribution to the observed difference. The effects become even more pronounced in a subset of fast evolving genes. Conclusion These results provide evidence of weaker purifying selection and/or stronger positive selection in alternative regions and thus one more confirmation of accelerated evolution in alternative regions. This study corroborates the theory that alternative splicing serves as a testing ground for molecular evolution.


Background
Alternative splicing is a major mechanism for generating functional and evolutionary diversity of proteins in mammals [1,2], for a review see [3]. Indeed, alternative splicing allows for generation of novel proteins without sacrificing old ones [2]. If a new isoform proves to be beneficial, its fraction increases by subtle regulatory changes. On the other hand, unlike gene duplication, alternative splicing does not lead to dramatic changes in protein concentrations. Moreover, it has been demonstrated that duplicated genes are rarely alternatively spliced compared to singletons [4,5].
There are good reasons to believe that some key mutational events driving evolution might reside in introns, untranslated regions (UTRs) and/or nontranscribed regulatory regions [6][7][8]. A large fraction of alternative splicing events occur in untranslated regions [9]. Nevertheless, most studies of molecular evolution have focused on the analysis of protein coding regions, as these data are sim-pler to obtain and are more amenable to functional interpretation.
From this point of view, alternative regions of genes occupy an intermediate position. Alternatively spliced regions are often evolutionary young: indeed, about a half of genes in human-mouse orthologous pairs have speciesspecific isoforms [2,10]. In many respects, constitutive and alternative regions are organized in different ways. Alternative human splice sites are on the average weaker than constitutive ones [11]. Non-canonical GC-AG introns tend to be alternative [12]. Among human exons conserved in mouse, about 77% of alternative cassette exons are flanked on both sides by long conserved intronic sequences, compared to only 17% of the constitutive exons [13]. Overall, statistical and evolutionary features of constitutive and alternative exons are sufficiently different to provide for computational recognition of these exons [14][15][16].
In several early studies it has been observed that patterns of nucleotide substitutions are different in alternative and constitutive coding regions. Iida and Akashi [17] analysed 26 pairs of alternatively spliced human genes and their non-human mammalian orthologs and demonstrated that synonymous divergence was lower and the nonsynonymous divergence was higher in alternative regions compared to constitutive regions. Evidence for diversifying selection was observed in alternative regions of CD45 [18], whereas the reduced rate of synonymous substitutions in an alternative region of BRCA1 [19] was assigned to purifying selection due to exonic splicing enhancer sites [20]. Recently, lower synonymous divergence in alternative exons compared to constitutive ones was demonstrated in a large-scale study of human, chimpanzee, mouse and rat genes [21].
Here we analyze evolutionary patterns in a set of 3029 pairs of orthologous human and mouse genes. We consider all types of alternative splicing and analyze separately 5'-, internal, and 3'-regions of genes, as well as faster and slower evolving genes.

Results
We considered 3029 alternatively spliced human genes and their mouse orthlogs ( Figure 1). The sample was divided into three bins of equal size with respect to nucleotide identity in coding regions. Nucleotide alignments of coding regions were sliced into constitutive (C) and alternative (A) fragments. Alternative fragments were further sorted into N-terminal (AN), internal (AI), and C-terminal (AC). To reveal the general pattern of evolution in these regions, we estimated the amino acid identity (Id), the nonsynonymous substitution rate (d N ), the synonymous substitution rate (d S ), and ω = d N /d S (Table 1). Glo-bal meta-alignments (concatenated alignment fragments across all considered genes and across three rate bins, see Methods) of five types (C, A, AN, AI, AC) were used. We It turned out that Id(A)<Id(C), dN(A)>dN(C), and ω(A)>ω(C) for alternatively spliced genes irrespective of the rate of evolution. These results show that negative selection is weaker and/or positive selection is stronger in alternative regions and thus confirm that alternatively spliced coding regions are hotspots of molecular evolution. Unexpectedly, dN and ω rise dramatically at the Cterminal alternative regions (Figures 3 and 5).
The pattern of synonymous substitutions is more complicated, as it depends on the rate of evolution ( Figure 4). The general pattern is that d S in alternative regions increases in the 5' to 3' direction. In genes evolving at the medium rate, d S (AN)<d S (AI)<d S (AC), whereas in fast evolving genes d S (AI)>d S (AC).
For control, we considered N-terminal and C-terminal constitutively spliced regions and performed similar analysis. All computed evolutionary parameters were the same as in the constitutive regions in general (data not shown). Thus the observed difference cannot be explained simply by faster evolution at gene termini.
The data flow through the analysis pipeline Figure 1 The data flow through the analysis pipeline.
Next, we analyzed individual gene pairs. For each of 2358 genes with the total lengths of constitutive and of alternative regions both exceeding 80 bp, ω for constitutive (ω C ) and alternative (ω A ) regions was calculated separately. Figure 6a represents the distribution of the difference (ω Cω A ). The distribution is skewed, showing that ω tends to be greater in alternative regions. We used the chi-squared test to compare the distributions of |ω C -ω A | in the case ω C >ω A and in the case ω C <ω A . The null hypothesis that the distributions were the same was rejected at the significance level 10 -7 . When N-terminal, internal, and C-terminal alternatives were considered separately, the effect was the same (Figure 6b-d). The null hypothesis that the distribution of |ω C -ω AN | was symmetrical was rejected at the significance level 10 -9 , the one that the distribution of |ω Cω AI | was symmetrical, at the significance level 10 -3 , and the one that the distribution of |ω C -ω AC | was symmetrical, at the significance level 10 -2 . Therefore, the detailed analysis of individual genes confirmed the observations made on concatenated alignments.

Discussion
Evolutionary patterns in different functional regions are known to be significantly different. Conserved genes are duplicated relatively more often [22], although shortly after duplication the evolutionary rate might increase, as the purifying selection is weaker [23,24], and the selection pattern in the two copies may be different [25]. Duret and Mouchiroud [26] observed lower nonsynonymous divergence in genes expressed in multiple tissues when compared to genes with more limited expression patterns, whereas the synonymous substitution rate was roughly the same. Similarly, Pál, Papp and Hurst [27] demonstrated that highly expressed genes tend to be more conserved then genes expressed at a lower level. Our results are consistent with these observations if one assumes that constitutive regions are expressed in more tissues and at a higher rate that alternative ones. Indeed, the former assumption holds for genes with isoforms having clear tis-Nonsynonymous substitution rate (d N ) in alignments of con-stitutive and alternative coding regions of 3029 alternatively spliced genes genes classified by the overall evolution rate Young gene regions tend to evolve faster. Several studies [23][24][25]28] demonstrated post-duplicational relaxation of purifying selection in paralogs. Our results provided evidence of stronger positive selection and/or weaker purifying selection in alternative gene regions.
One possible explanation for our observations could be that the data are contaminated by non-functional isoforms (hence relaxation of purifying selection takes place). We do not believe that to be the case for the following reasons. Firstly, these regions were conserved between human and mouse at a sufficiently high similarity level of 70% nucleotide identity. Secondly, the observed pattern of increased d N level in alternative regions was the most pronounced in 3' (C-terminal) regions, that are the most reliable as regards gene recognition and have higher EST coverage due to polyA-primed ESTs.
As we considered only alternatives derived from RefSeq proteins, we could miss some alternatives and thus label a fraction of the alternative regions as constitutive. However, that could only contaminate the constitutive sample with alternative regions and thus blur the observed differences, but not create any spurious effect.
Recently Xing and Lee [21] observed similar rate of nonsynonymous substitutions in alternative and constitutive regions whereas the rate of synonymous substitutions was lower in alternative regions, especially in tissue-specific exons [29]. One possible explanation for that was based on the assumption that conserved alternative exons contain more candidate splicing enhancer sites than constitutive ones [15]. As such sites could be expected to be conserved, like in BRCA1 [19,20], that could lead to higher conservation of synonymous codon positions in alternative regions compared to constitutive ones. However, this explanation seems to be incorrect, since, although indeed d S is lower in splicing enhancers, the fractions of constitutive and alternative regions covered by splicing enhancers are the same [30], and if the RNA selection pressure is the same in alternative and constitutive regions, it cannot distort the measurement of ω [31].
Selection measure ω = d N /d S in alignments of constitutive and alternative coding regions of 3029 alternatively spliced genes classified by the overall evolution rate However, this effect has not been observed in our study, and the substitution rates differ from those in [21]. As our results are consistent and statistically significant for all classes of genes (fast, medium and slow evolving) and all gene regions (N-terminal, internal, C-terminal), and do not seem to be caused by contamination, there should be other reasons for this discrepancy. One of them could be the fact that we considered all types of alternatives, as opposed to only cassette exons in other studies. We also considered short alternative regions, its skipping or inclusion might be regulated "outside". Another one could be the use of different methods to calculate the rates of evolution. We used our own implementation of the first method of Ina [32] here, as we needed a tool for very long alignments (~ 3·10 6 bp), whereas Xing and Lee [21,29] used a maximum likelihood method implemented in the PAML package [33]. On the other hand, we considered only RefSeq isoforms and did not distinguish between the minor and major isoform alternatives.
An explanation for our finding could be that the total length of regulatory sites experiencing purifying selection is still small compared to the total length of alternative regions. The pattern of substitutions in insects is less consistent [34]: in N-terminal alternatives, the synonymous rate is higher than in constant regions, whereas in internal alternatives, there are more amino acid substitutions, similar to our observations here.

Conclusion
Overall, this study corroborates the idea that alternative splicing serves as a testing ground for molecular evolution. Several lines of evidence confirm this hypothesis: (i) alternatively spliced isoforms are often evolutionary young both in mammals [2,10] and in insects [35]; (ii) the rate of nonsynonymous substitutions is higher in alternative regions compared to constitutive ones (this study), (iii) constitutive exons in genes with genome-specific alternative splicing evolve faster than constitutive  [23][24][25]), (iv) many young (rodent-specific, missing in human and pig as an outgroup) exons are alternatively spliced and tend to have ω>1 in the mouse-rat comparison [30], and (v) the frequency of nonsynonymous SNPs in human genes is higher in alternative regions than in constitutive regions [37].

Definitions
In an alternatively spliced gene, constitutive regions are defined as the ones that are always exonic and coding, and alternative regions as the ones that are either coding or spliced out. An exon can be either completely constitutive, or completely alternative, or non-coding, or consist of constitutive, alternative and non-coding regions.
A local meta-alignment is a concatenate of all alignment fragments of a fixed type (for example, coding alternative regions) for one particular gene.
A global meta-alignment is the concatenate of local metaalignments of a fixed type for all genes of a fixed group (for example, for all fast-evolving genes).

Data
Human and mouse mRNA sequences were taken from the NCBI RefSeq database [38] and orthologs were identified and aligned as described previously [39].
Overall, 12356 pairs of orthologous human and mouse genes were considered. The data flow through the analysis pipeline is shown in Figure 1. Out of 12356 human genes, 5754 genes had more than one protein isoform in the EDAS database of alternatively spliced genes [40]. These proteins were mapped to the human-mouse mRNA alignments using Pro-Frame [41] and the results were parsed using a set of Perl scripts that identified constitutive and alternative coding fragments of the human genes and their reading frames. Alternatives confirmed by protein sequences and read in a single frame were identified in 3079 genes. We further restricted the dataset to 3029 pairs with more than 70% nucleotide identity between the human and mouse genes, as we doubted that the other ones were reliable.
2358 genes were selected for individual substitution rate analysis. These were the ones with both the total length of the human-mouse alignment length of the alternative regions and that of the constitutive regions exceeding 80 base pairs.

Data classification
We grouped genes with comparable average substitution rates and formed three bins of equal size: slow, mediumspeed, and fast-evolving genes.
We also considered alternative coding regions corresponding to protein N-terminal, middle, and C-terminal parts separately.
Distributions of ω C -ω A , ω C -ω AN , ω C -ω AI , ω C -ω AC Figure 6 Distributions of ω C -ω A , ω C -ω AN , ω C -ω AI , ω C -ω AC . a. The distribution of ω C -ω A for 2358 genes with the total length of human-mouse alignments of constitutive regions and of alternative regions both exceeding 80 bp shows that ω tends to be larger in alternative regions. In particular, there are 23 genes with ω C -ω A <-0.8 and no genes with ω C -ω A >0.8. b, c, d. The distributions of ω C -ω AN , ω C -ω AI , ω C -ω AC for genes with long N-terminal (1674 genes, top, ω C -ω AN ), internal (976 genes, middle, ω C -ω AI ), and C-terminal (110 genes, bottom, ω C -ω AC ) alternative regions, respectively, show that ω A* tends to be larger than ω c in all types of alternative regions. the grey shadows are symmetrical to the left parts of the histograms.