Comprehensive analysis of alternative splicing in rice and comparative analyses with Arabidopsis
© Campbell et al. 2006
Received: 13 June 2006
Accepted: 28 December 2006
Published: 28 December 2006
Skip to main content
© Campbell et al. 2006
Received: 13 June 2006
Accepted: 28 December 2006
Published: 28 December 2006
Recently, genomic sequencing efforts were finished forOryza sativa(cultivated rice) andArabidopsis thaliana(Arabidopsis). Additionally, these two plant species have extensive cDNA and expressed sequence tag (EST) libraries. We employed the Program to Assemble Spliced Alignments (PASA) to identify and analyze alternatively spliced isoforms in both species.
A comprehensive analysis of alternative splicing was performed in rice that started with >1.1 million publicly available spliced ESTs and over 30,000 full length cDNAs in conjunction with the newly enhanced PASA software. A parallel analysis was performed with Arabidopsis to compare and ascertain potential differences between monocots and dicots. Alternative splicing is a widespread phenomenon (observed in greater than 30% of the loci with transcript support) and we have described nine alternative splicing variations. While alternative splicing has the potential to create many RNA isoforms from a single locus, the majority of loci generate only two or three isoforms and transcript support indicates that these isoforms are generally not rare events. For the alternate donor (AD) and acceptor (AA) classes, the distance between the splice sites for the majority of events was found to be less than 50 basepairs (bp). In both species, the most frequent distance between AA is 3 bp, consistent with reports in mammalian systems. Conversely, the most frequent distance between AD is 4 bp in both plant species, as previously observed in mouse. Most alternative splicing variations are localized to the protein coding sequence and are predicted to significantly alter the coding sequence.
Alternative splicing is widespread in both rice and Arabidopsis and these species share many common features. Interestingly, alternative splicing may play a role beyond creating novel combinations of transcripts that expand the proteome. Many isoforms will presumably have negative consequences for protein structure and function, suggesting that their biological role involves post-transcriptional regulation of gene expression.
Cultivated rice (Oryza sativa) is considered a model for species within the Poaceae family and, in particular, for agronomically important cereals such as maize (Zea mays), wheat (Triticum aestivum), and barley (Hordeum vulgare). Recently, map-based sequencing of ajaponica subspecies of rice has been completed with the final sequence representing ~95% of the estimated 389 Mb genome . In addition to having a nearly complete genomic sequence, rice has 33,799 full-length cDNA (FL-cDNAs) sequences as well as a collection of >1.15 million publicly available expressed sequence tags (ESTs) and cDNAs [2,3]. The rice genome assembly, based upon the TIGR annotation effort , contains 12 pseudomolecules totaling ~372 Mb of sequence and 55,890 genes (of which 13,327 are identified as transposable element (TE)-related).
In this study, we performed a comprehensive analysis of alternative splicing in rice, a model monocotyledonous species, using the latest set of rice transcript data and the Program to Assemble Spliced Alignments (PASA). The PASA software  assembles and clusters spliced transcript alignments, providing transcript-based gene structures that are used to automatically improve existing gene annotations by adding untranslated regions (UTRs), adjusting intron and exon boundaries, and adding new models that represent alternative splicing, among its numerous other functions. Alignment assembly, as performed via PASA, is particularly well-suited to the study of alternative splicing. PASA assembles overlapping and compatible alignments into maximal alignment assemblies; compatible alignments are defined as overlapping alignments that are transcribed on the same strand and have identical introns in their regions of overlap. If all overlapping transcripts yield consistent gene structures, they are assembled into a single alignment assembly, and by doing so, PASA acts to simply consolidate the transcript alignments into a single gene structure. However, overlapping transcript alignments that have different introns in their region of overlap, due to a splicing variation, are found incompatible and cannot be assembled together, and hence occupy distinct maximal alignment assemblies. After assembling all overlapping alignments, those distinct PASA assemblies found overlapping at the same locus and transcribed in the same orientation yield representative splicing isoforms for that locus.
Many algorithms have been introduced to reconstruct gene structures from spliced alignments of transcripts to genome sequences [6–9]. These most often decompose the alignments into introns and exons, build a splicing graph data structure  and traverse the graph to find combinations of introns and exons that are best supported by the underlying data under some scoring scheme. An inherent danger of splicing graphs is that they have the potential to report combinations of introns and exons for which there is no evidence in existing transcript alignments. Alternative strategies to reconstruct isoforms assemble individual whole alignments instead of their component introns and exons, and by doing so, retain distal correlations between non-adjacent exons found within single transcript alignments [5,11,12]. Isoform reconstruction (alignment assembly) as performed by PASA is strictly an exercise in dynamic programming to chain (assemble) together a compatible series of whole transcript alignments such that each alignment is found within a maximal chain of compatible alignments . This generates an isoform structure supported by the largest number of compatible whole EST and whole cDNA alignments.
Our comprehensive analysis confirms that alternative splicing is a widespread phenomenon, and also confirms prior research into alternative splicing in both rice and Arabidopsis [5,13–17] The most recent analysis for these species identified 14,452 alternative splicing events affecting 6,586 rice genes and 8,264 alternative splicing events affecting 4,707 Arabidopsis genes . Our analysis incorporates new data to extend these numbers to 36,650 events affecting 8,772 rice genes and 16,252 events affecting 5,313 Arabidopsis genes. We also note some parallels with alternative splicing in mammalian species. For example, alternate acceptor (AA) splice sites for these two plant species are most often separated by only three base pairs (bp); a feature observed in mammalian species [19–22]. However, when looking at the AA class overall for rice and Arabidopsis, the majority of these events are predicted to result in a downstream frameshift in translation, consistent with research in mouse . Alternate donor (AD) splice sites are most often found to be separated by only four bp in both species, and this maximal occurrence was previously observed in mouse with the majority of the AD isoforms yielding a frameshift . The retained intron (RI) class of alternative splicing is prevalent in both rice and Arabidopsis and often introduces a premature stop codon in the intron. These three classes (AA, AD, and RI) represent the majority of the alternative splicing events in both rice and Arabidopsis with the occurrence of the remaining classes being far less frequent. Further, the frequent introduction of either a frameshift or premature stop codon by the three predominant classes of splicing events suggests that post-transcriptional regulation via nonsense mediated decay (NMD) may play a more important role than augmenting the proteome.
Summary statistics are given for the PASA-generated assemblies in rice and Arabidopsis.
35,078 (33,799 FL)
66,642 (61,117 FL)
1,190,502 (33,797 FL)
683,137 (61,113 FL)
Map to Genome
1,149,730 (32,349 FL)
671,774 (61,085 FL)
1,055,860 (30,099 FL)
603,234 (56,668 FL)
Excluding Single Exon Transcripts
501,379 (30,099 FL)
369,092 (56,668 FL)
PASA alignment assemblies
43,869 (23,308 FL)
29,183 (22,951 FL)
PASA assembly clusters
27,497 (19,888 FL)
21,922 (18,701 FL)
We limited our analysis of alternative splicing to only those transcript alignments that have a nearly perfect alignment to the genome (e.g. 90% of their length at greater than 95% identity). Our overall averages for length and identity exceeded 99% (data not shown). Additionally, all exon boundaries required canonical splice sites (e.g. GT-AG, GC-AG, AT-AC) [25–29]. Finally, since we are focusing on alternatively spliced transcripts, it is essential that we have high confidence in the mapping of the exon-intron boundaries. We have found that spliced alignment programs will sometimes introduce gaps and mismatches within the exon sequence adjacent to the splice site in order to include a consensus splice site in the resulting alignment. We required that each alignment segment have at least three exact bp matches directly adjoining the consensus donor and acceptor splice site in order to exclude this potential artifact. Approximately 90–92% of rice and Arabidopsis transcript alignments met all of these strict validation criteria. One potential source of invalidation here would be due to the use of transcribed sequences from non-japonica cultivars of rice or non-Columbia ecotypes for Arabidopsis.
Using PASA, we assembled the 501,379 spliced rice transcript alignments into 43,869 alignment assemblies, 23,308 of which contain FL-cDNAs (Table 1). We also assembled 369,092 spliced Arabidopsis transcript alignments into 29,183 alignment assemblies, of which 22,951 include FL-cDNAs (Table 1). For ESTs that met all other mapping criteria but lack evidence for being spliced (i.e. they are a single exon), we excluded these transcripts from being assembled. This invalidation step will minimize the spurious inclusion of incompletely processed transcripts in the retained intron class. The alignment assemblies, each of which corresponds to at least one distinct mRNA isoform, served as the substrate for this current and comprehensive analysis of alternative splicing in higher plants.
The alignment assemblies were clustered based on overlapping genome coordinates and transcribed orientations in order to group together those assemblies that correspond to the same genomic locus or gene. Each cluster of PASA alignment assemblies can contain either single or multiple transcript alignment assemblies. A cluster comprising multiple assemblies indicates splicing variations mapped to the same genomic locus. A total of 18,506 rice and 16,763 Arabidopsis clusters comprise only a single assembly (i.e. a single transcript isoform), thereby lacking evidence of alternative splicing. However, 8,991 (32.7%) rice clusters have more than one assembly and range from 5,151 clusters having two assemblies to two clusters having 17 distinct assemblies [seeAdditional file 1]. The results for Arabidopsis are comparable to rice. The majority of clusters supporting alternative splicing contain just two assemblies; 57.3% of the multiple assembly containing clusters for rice and 72.3% in Arabidopsis [seeAdditional file 1]. The data from both species demonstrate that as the number of assemblies per cluster increases, the corresponding number of clusters decreases precipitously. While alternative splicing can produce a wide variety of isoforms, such as the 38,106 possible forms for theDrosophila Dscam gene , these data in rice and Arabidopsis clearly indicate that alternative splicing produces only a limited number of isoforms in plants. Given that PASA alignment assemblies in clusters of multiples are indicative of alternative splicing variations, and each assembly is representative of a given (partial or complete) splicing isoform of a gene, we henceforth use the term assembly and isoform interchangeably.
The AA and AD classes involve isoforms with a shared intron, overlap among adjacent exons, but different splice sites chosen for either the donor or the acceptor sites.
The ATE class is defined as having two completely distinct sets of terminal (either 5' or 3') exons that possess no overlapping exon sequence. The RE class includes a distinct exon in one isoform that is spliced out of its corresponding isoform (e.g. the SE isoform). The IWI and TWI classes are conceptually similar where either the 5' or the 3' end of the transcript isoform occurs in an intron of its longer isoform. The final alternative splicing class, RI, is defined by the alternatively spliced isoform retaining an intron that is found spliced in the corresponding SI isoform.
The support for alternative splicing derived from PASA-mediated assemblies can correspond to either a single transcript or multiple independent transcripts. The number of underlying individual transcript alignments exhibiting a specific splicing variation may reflect thein vivo prevalence of the variation as compared with the number of transcripts exhibiting a mutually exclusive variation. Caution is needed when only a single transcript supports a variation, as this may reflect a cDNA that was captured prior to the completed processing of the pre-mRNA by the spliceosome, or a possible aberration in transcription . Ratios of the total number of underlying transcripts responsible for each mutually exclusive pair of splicing variations found via pairwise comparisons of isoforms were calculated, and this value allowed us to infer the major and minor isoforms. This ratio is termed the transcript ratio (see Materials and Methods for details). The analysis presented here requires that each variation be supported by two or more transcripts.
Ratios of the number of transcripts supporting the alternative splicing classes of skipped exon (SE), initiation within intron (IWI), termination within intron (TWI), and retained intron (RI).
Summation of the distribution of the nine alternative splicing classes by events, assembly, cluster and gene for rice and Arabidopsis.
One caveat required when looking at the breakdown of alternative splicing by class in Table 3is that the sums of the nine splicing classes exceed the totals presented above. This disparity is due to the fact that a single cluster can contain multiple assemblies [seeAdditional file 1] and each assembly can contain several of the alternative splicing classes when compared to the other assemblies in the shared cluster. The four most frequently observed alternative splicing classes in both rice and Arabidopsis are AA, SI, RI, and AD, with the AA class being the most frequent. Of the remaining five classes (ATE, TWI, SE, RE and IWI), the percentages for the assembly comparisons are higher in each case for rice than in Arabidopsis but with a similar relative distribution.
When considering the abundance of splicing variations found on a cluster basis, the reciprocal RI/SI and symmetrical AA classes occur most frequently in rice and Arabidopsis. As observed in the alignment assembly statistics, the cluster percentages for each of the remaining classes are higher in rice than in Arabidopsis. Three analyses using EST clustering in Arabidopsis have shown that the RI class was most frequent among the classes of alternative splicing [16,18,34]. Notably, both the prevalence of the TWI and IWI classes are far greater in rice when compared to Arabidopsis. Earlier we noted that the TWI and IWI classes are largely unsupported in comparison to their counterparts that lack these variations. Their accumulation in the rice data may be due to the vast quantity of rice ESTs as compared with FL-cDNAs. Arabidopsis has nearly double the number of FL-cDNAs in the assemblies and the subsequent clusters when compared to rice and these FL-cDNA assemblies presumably provide a more complete representation of the gene, which may be reflected in the lower rates of TWI and IWI in Arabidopsis.
Given that a cluster of assemblies can contain one or more of the alternative splicing isoforms, an interesting parallel in the distribution of alternative splicing in the clusters for the two species can be observed. In both rice and Arabidopsis, the number of discrete alternative splicing classes in a single cluster can range from one to nine [seeAdditional file 3]. These data show that as the complexity of the splicing isoforms increases in a cluster, the number of clusters decreases.
Prior work using a limited set of FL-cDNA transcripts has shown the close proximity of alternative splicing sites in Arabidopsis  and this has also been noted in human and mouse [20,22,23,36,37]. The difference in base pairs (bp) between either the alternatively spliced isoforms' donor or the acceptor site was calculated as "delta bp". These delta bp values were analyzed to determine whether the AA or AD splicing isoform would retain its translation frame by being a multiple of three (an integral number of codons) or result in a translational frameshift for those delta bp values that are not a multiple of three. For the 7,071 assembly comparisons that were performed to identify the delta bps for the rice AA class, 2,797 (39.6%) are predicted to retain their translational frame with the remaining 4,274 AA having a translational frameshift downstream of the splice site. For the 3,174 assembly comparisons in the Arabidopsis AA class, 1,491 (47.0%) retain their translational frame and the remaining 1,683 will be frameshifted.
Similar to the AA class, the majority of AD assemblies in both Arabidopsis and rice are predicted to lead to a frameshift when translated. For the 3,795 assembly comparison in rice, 2,717 (71.6%) will lead to a frameshift, and for Arabidopsis, 1,002 of the total 1,635 (61.3%) assembly comparison will lead to a frameshift in translation.
The enrichment of GC donors in alternatively spliced genes has been observed previously in human andC. elegans[43,44]. We isolated all of the donor and acceptor dinucleotides in each splicing event for the AD and AA classes and separately isolated all the canonical donor dinucleotides involved directly in either the AA or AD splicing cases [28,29,42]. A Chi-Square analysis for the distribution of the AG and AC acceptor supports the null hypothesis that the AG (P = 0.99) and AC (P = 0.86) acceptor dinucleotides involved as alternative acceptor splice sites are not significantly different from their overall distribution. For the donors, the AT dinucleotide distribution in alternative splicing is similar to that observed in the whole set (P = 0.85) whereas the GC distribution was significantly elevated (GT (P = 0.04) and GC (P = 3.05 × 10-47)).
FL-cDNAs have proven highly valuable to the improvement of structural annotation of gene models in Arabidopsis [5,17,45]. Likewise, the RIKEN effort to generate a large set of rice FL-cDNAs has greatly improved its structural annotation [13,14]. We have evaluated the nine classes of splicing events in clusters in which at least one alignment assembly incorporates a FL-cDNA. These FL-cDNA assemblies (FL-assemblies) are used to provide reference gene structures to which splicing variations found in other isoforms can be localized to the protein-coding region (CDS) or to untranslated regions (UTRs) flanking the CDS. A total of 11,716 splicing variations were mapped to 5,599 rice clusters containing at least one FL-cDNA reference assembly. For Arabidopsis, 6,250 splicing variations were mapped to 3,909 FL-cDNA clusters containing at least one FL-cDNA reference assemblies.
Using the annotation derived from FL-cDNAs, the alternative splicing effects are classified relative to the CDS and UTRs.
CDS overlaps UTRs
3' UTRs contained
3' UTR overlaps CDS
5' UTR contained
5' UTR overlaps CDS
CDS overlaps UTRs
3' UTRs contained
3' UTR overlaps CDS
5' UTR contained
5' UTR overlaps CDS
A more restrictive analysis was performed to identify the effects of splicing variations on the coding sequence of alternatively spliced genes based solely on FL-assemblies for both rice and Arabidopsis. Only those assemblies generating a complete ORF of greater than 200 amino acids and having the same translational frame for a subset of the CDS were retained for this analysis. In these cases, the FL-assembly encoding the longest ORF was chosen as the reference to which alternate FL-assemblies were compared.
Frame change and protein truncation statistics by class for FL-cDNAs supported assemblies within the same cluster.
No. with Frame Changes
Average % of protein length
No. with Frame Changes
Average % of protein length
A total of 738 Arabidopsis FL-cDNA supported alternatively spliced assemblies were compared with 829 reference assemblies (Table 5). The data for Arabidopsis are quite similar to that observed in rice, particularly for the most frequently observed classes of alternative splicing (AA, AD, and RI).
This study presents a large scale analysis of alternative splicing profiles in the rice genome. In addition, a parallel analysis in Arabidopsis was performed for comparative purposes. Publicly available FL-cDNAs and ESTs were used with the PASA program to generate maximal transcript alignment assemblies and to cluster the assemblies according to alternative splicing isoforms. PASA-generated clusters having more than one assembly were screened for each of the nine classes of alternative splicing classes. In addition to enumerating the splicing variations found, we examined the underlying transcript supporting evidence to evaluate variations as major or minor isoforms. Finally, we localized splicing variations to the CDS and UTR regions of gene structures and assessed the effects of alternative splicing on the translated products of the isoforms.
When evaluating the clusters, the RI class was the most frequent for both rice (45.1%) and Arabidopsis (47.9%). Previous results in Arabidopsis and rice using an EST clustering strategy showed that the RI class is the most frequent [16,18,34]. However, by excluding the single exon transcripts, we observed fewer RI events (for example 6,887 vs. 7,774 in rice and 3,655 vs. 4,635 in Arabidopsis) than a recent report . By localizing variations to the CDS or UTR of FL-assembly based gene structures, we found the RI isoform to be largely contained in the coding sequence (Table 5), often leading to the inclusion of an in-frame PTC. Comparison of the SI assembly to its RI counterpart shows that the RI assembly has equal or greater transcript support for 43% of isoform pairs in rice and 30% in Arabidopsis. Thus, the RI class is the prevalent isoform in a sizable minority of cases.
The AA class is the second most frequently encountered class in the clusters from both of these species. Studies with mammalian EST clustering strategies revealed that the maximal occurrence of the distance between two alternate acceptor splice sites is 3 bp [20,22,23], which we also observed in both rice and Arabidopsis. A CAG tandem repeat is the most common sequence at these acceptor splice sites, consistent with the reports in mammalian and plant species [18,20]. Additionally, the pictograms reveal a pauCity of G nucleotides in the NAGNAG duplication, another feature in agreement with mammalian research . Additional studies will be needed to determine the rates of proximal and distal usage in the NAGNAG motif where the AA events have the maximal occurrence.In vitro research on closely spaced tandem AG acceptors has shown that both acceptors are competitive when the distance between them is less than 6 bp and within 19–23 bp of the branch point sequence . The highest occurrences of binned delta bps for the AA class for rice and Arabidopsis range from three bp to six bp which supports thein vitro results. However, a propensity toward alternative splicing has also been noted in genes that have unusually large distances (from 40 bp up to 400 bp) from the identified branch point to the acceptor site, and this interim sequence is hypothesized to contain splicing signal motifs . A systematic analysis of AA events in higher plants will be necessary to identify how these closely spaced alternate acceptors are utilized and to compare the results with those from mammalian systems.
The peak occurrence for the length between alternate donor splice sites was found to be 4 bp and the second most frequent value was 5 bp in both plant species. These maximal occurrence values for the AD class have been previously reported in mouse [22,23]. In addition, for mammalian species, the AD distribution was reported to have a very low, ~3 bp, periodiCity from 4 bp up to 15 bp . When considering the consensus sequence at the donor splice junction of AG|GTAWGN derived from human consensus (MAG|GTRAGT) with a strong requirement for a guanine at positions -1 and +5 , our data supports the hypothesis that this 4 bp peak occurs when either the proximal or the distal GT is used as a donor (Figure 5) [22,23]. Unlike the AA class, there does not appear to be a consensus sequence between the two donor GT dinucleotides aside from a potential enrichment for purine nucleotides. Akerman and Mandel-Gutfreund also hypothesized that a 3 bp difference in sequence length for the AD class would be "infrequent" . However, the occurrence of a 3 bp AD difference in sequence length is equal to or greater than nearly all other length differences observed indicates flexibility in 5' splice site recognition in higher plants. In the AD class, we observed an elevated rate of the donor dinucleotide sequence GC for GT, a trait previously observed in humans andC. elegans[43,44]. A survey of AD splicing events across eukaryotes will be needed to determine whether the elevated rates of the GC donor in the AD class are conserved. As with the AA class, using the FL-cDNA dataset, the majority of the AD class splicing events occur within the CDS and the majority of the AD events lead to a translational frameshift.
The transcript support for the AA and AD classes is of particular interest with respect to potential for translational frameshift. The AA class occurs about twice the rate of the AD class. However, the transcript support histograms have a very similar profile. These histograms show that, while the alternative isoform is less frequent in nearly all cases, these alternative isoforms are neither rare nor isolated. The possibility of a frameshift in the majority of events having small differences in length indicate that the AA and AD classes in both rice and Arabidopsis compare with the results in mouse [43,44]. Using the FL-cDNA as a reference for delineating the likely translated ORF, seven of the nine classes of alternative splicing occur predominantly in the coding sequence and these events have a significant effect on the translated protein. Prior work in other eukaryotic systems has suggested that alternatively spliced transcripts having a PTC are removed by non-sense mediated decay (NMD) . In mammals, degradation of PTC containing mRNAs is induced by the presence of a PTC 50 bp (or more) upstream of the exon/exon boundary in a spliced mRNA . Signals leading to decay occur at the time of translation where exon junction complexes are displaced during translation and the presence of a termination codon before an exon junction induces NMD of the transcript [48,49]. NMD is observed in all eukaryotic species that have been examined and these PTC-containing mRNAs are rapidly degraded [50,51]. A recent report suggested that nearly one-third of all alternative splicing events in human meet the criteria for NMD-mediated degradation and this phenomenon was termed regulated unproductive splicing and translation (RUST) . The conservation of RUST in higher plants has not been described in any detail yet. However the presence of an NMD-like pathway in plants was first suggested by studies showing reduced stability of mRNAs containing PTCs. Studies on mutants ofwaxy mRNA containing PTCs in rice (Oryza sativa) have suggested that splicing of the first intron present upstream of PTC is important for NMD of mutantwaxy mRNA . Also, nonsense but not missense mutants ofxantha mRNA in barley (Hordeum vulgare) appear to be subjected to rapid degradation, even though the mutant mRNA contains PTC in the last exon . Despite these findings, the mechanism of NMD in plants and its role in plant growth and development have not been clarified. In addition, there has been a recent study on the function of plant UPFs which showed that UPF3 of Arabidopsis suppresses aberrantly spliced mRNAs containing PTCs .
Interestingly, when restricting the alternative splicing analysis to the UTRs, the AA, AD, RI and SE classes are found to occur far more frequently in the 5'UTR than in the 3' UTR. Given that EST coverage is known to be 3' biased due to the nature of cDNA library generation, the 5'UTR bias for alternative splicing is notable. Following the rule for NMD, these 5'UTR localized events are assumed not to be involved in this post-transcriptional control pathway nor altering (directly) the coding sequence. A recent report using EST alignments in human also identified an enrichment of alternative splicing in the 5'UTR compared to the 3' UTR and the authors suggested this may represent a method for translational control [55,56].
Alternative splicing has also been proposed as a mechanism to expand the proteomic diversity within a genome. As the EST collections for rice and Arabidopsis have dramatically increased, successive reports have greatly expanded the number of identified alternative splicing isoforms. The data presented here suggest that, while alternative splicing may produce novel translated combinations of the primary amino acid sequence, there is also a large number of transcripts produced by the AA, AD, and RI classes having either frameshifts or PTCs that may have a role in post-transcriptional regulation. These new data will need to be explored via experimentation to determine whether RUST and NMD-mediated pathways are degrading these alternatively spliced transcripts.
All publicly available ESTs and mRNA sequences forOryza sativa and Arabidopsis thaliana(no distinction was made for cultivars or ecotypes) were obtained from the public databases, primarily GenBank, but in the case of rice, supplemented with an EST data set newly released to DDBJ [2,3]. Special care was taken to exclude sequences from RefSeq , given that most are not experimentally derived and result in many cases from predicted genes in completed genomes.
The PASA pipeline was used as previously described  with the following modifications. Earlier application of PASA utilized the BLAT  and sim4  transcript alignment software. Here, we chose instead to use the GMAP software , due to its improved accuracy, speed, and impressively low memory requirements. All transcripts, including ESTs and FL-cDNAs were aligned to the genome, followed by our transcript alignment validation procedure and subsequent alignment assembly. Only near perfect alignments were utilized in the alignment assembly process, requiring at least 95% identity across at least 90% of the transcript length. In addition, all internal alignment boundaries were required to adjoin consensus splice sites. In addition to the GT-AG and GC-AG introns, we extended our validation criteria to allow the more rare AT-AC introns. In the case of candidate AT-AC introns, we required that the AT donor site match the extended conserved donor site consensus sequence ATatcc (extended consensus shown in lower case). All other combinations of donors and acceptors were disallowed. The assembly of alignments via the PASA alignment assembly algorithm, and subsequent clustering of overlapping alignment assemblies with same transcribed orientation were performed exactly as previously described. All clusters of PASA assemblies correspond to single-linkage clusters of alignment assemblies having identical transcribed orientations and overlapping by at least 50% of either alignment's genome span. Clusters of alignment assemblies were mapped to existing genome annotations as previously described, using a 1% overlap criteria instead of the original 40% employed.
Each cluster of PASA alignment assemblies containing more than one alignment assembly was a direct result of at least one splicing variation. An all-vs-all comparison was performed among each pair of alignment assemblies within each cluster. An individual alignment assembly was labeled as having a specific splicing variation as evident when compared to another alignment assembly supporting the variation. For example, an alignment assembly was labeled as having a retained intron only after it was compared to another alignment assembly that had this specific intron spliced. Likewise, an alignment assembly was labeled as having an alternative acceptor site when compared to another alignment assembly that exhibited the corresponding exon with a different acceptor site at its boundary. Most variations were labeled reciprocally, including AD, AA, ATE, RI, and SE. In the case of an AA or AD splice site, and in the case of ATE isoforms, both alignment assemblies exhibit the corresponding variation, and each is labeled with this exact property. In the case of RI or SE, one assembly has this property, but the other assembly that provides evidence for the retained intron or skipped exon in former, has the converse property, and is labeled accordingly. The label serving as the counterpart to SE is RE, and the counterpart to RI is SI. However, the variations that initiate or terminate within intron (IWI or TWI) are asymmetrically labeled; only the specific assembly that exhibited the property was given the corresponding label. Labeling isoforms in this way is useful for the purpose of easily identifying isoforms with a given class of splicing variation, or to identify isoforms that exhibit a combination of splicing variations.
The individual transcripts within an alignment assembly that provide supporting evidence for individual splicing variation (label) were identified as those transcripts within an assembly that were disjoint from the assembly being compared and found to overlap the coordinates of the splicing variation. This comparison was performed reciprocally, regardless of whether the label was symmetrical or asymmetrical, so that for every splicing variation identified, the supporting evidence for each mutually exclusive variation was captured. We calculated the ratio of transcript support for each variation and used this ratio to infer the major and minor splice variants. The combination of labels assigned to an isoform, the genome coordinates of the localized variation, and the identity of the transcripts supporting each variation were stored in a MySQL database.
Paired isoforms (A, B), found in the same cluster of PASA assemblies, and found to overlap each other, were examined as follows:
AA and AD: All exons of isoform A are mapped to exons of isoform B based on genome coordinates. Alternate donor or acceptor splice sites are identified where one-to-one mappings exist between exons having non-identical coordinates at internal splice junctions.
RI: Introns of isoform B are compared to the exons of isoform A. An intron of isoform B completely encapsulated by an exon of isoform A is identified and classified as a retained intron in isoform A. Isoform B is reciprocally labeled as having the spliced intron variation.
SE: Exons of isoform A are compared to introns of isoform B. Exons of isoform A found completely encapsulated by a single intron of isoform B, but having neighboring exons on both sides that anchor to exons in isoform B, are classified as retained exons. A single 'retained exons' classification can contain one or more neighboring exons, all localized to a single intron of isoform B. Isoform B is reciprocally labeled as having 'skipped exons' that localize to one of its introns.
ATE: Exons of isoform A are mapped to exons of isoform B based on genome coordinates. In a search for overlapping exons between isoform A and isoform B starting from either termini, if the first overlapping exon found between isoform A and B is not a terminal exon in either isoform, the series of exons from the corresponding termini of isoform A found prior to the first overlapping exon are grouped and classified as alternate terminal exons (a single alternate terminal event).
IWI and TWI: The initial or terminal exon of isoform A begins or ends in an intron of isoform B and overlaps an exon of isoform B. Although we use the terms initiation within intron (IWI) and termination within intron (TWI), this simply indicates that our transcript alignment evidence begins or ends within an intron and is not necessarily indicative of transcriptional processes involving either transcriptional initiation, termination, or polyadenylation sites that may be responsible for these variations.
All pairs of isoforms exhibiting the AA splice site or AD splice site (isoforms specifically labeled with these variations as described above) were examined. For each pair of symmetrically labeled isoforms, the length between the alternate splice sites was computed, assigned as the delta value, binned and counted accordingly.
Localizing a splicing variation to the CDS or UTR requires that we are highly confident in the location of these features. Confidence in this matter is based on the accuracy of the underlying gene structure. We cannot rely solely on existing rice or Arabidopsis gene structure annotations because in many cases, the genes are largely predicted byab initio gene prediction programs, and we cannot confidently ascertain the accuracy of these gene predictions. We can, however, have a high degree of confidence in gene structures that are supported by FL-cDNAs; those cDNAs that encode both the complete protein-coding structure of the gene and extended UTR regions. Consequently, we limit our localizations of splicing variations to clusters of PASA assemblies including at least one FL-cDNA.
In such a cluster of PASA alignment assemblies, we chose a FL-cDNA that was found to encode a complete gene structure such that a full-length ORF was found with both a start and a stop codon. In cases where multiple candidate full-length assemblies existed, we chose the entry with the longest complete ORF. All other assemblies in this cluster were compared to this reference gene structure, and the positions of splicing variations were mapped to the CDS, the 5' UTR, or the 3' UTR.
Similarly to the above localization of splicing variations, where we required a high level of confidence in the underlying gene structure, here we require a high level of confidence in the complete gene structures for multiple isoforms of a single gene, so that we can perform meaningful comparisons between isoforms and ascertain the effects of the splicing variation(s) on the primary structure of the protein. Our analysis was hereby confined to those clusters of PASA alignment assemblies including multiple assemblies considered to be full-length (assemblies including at least one FL-cDNA). The protein isoform lengths were compared, and their gene structures were examined for the presence of a change in reading frame.
Enhancements to the PASA software resulting from our efforts described here include:
the capability to utilize GMAP for mapping and aligning transcripts to the genome, and as an alternative to blat coupled with sim4.
the consensus splice site requirement was extended to include the U12-type AT-AC introns.
the transcript alignment validation procedure can verify perfect sequence matches between the genome and the transcript sequence atn-number of bp adjoining the tentative splice junction (default forn is three bp, as employed here).
automated identification and classification of splicing variations found between clustered PASA alignment assemblies.
CGI-scripts that provide for web-browser based navigation of the results from the mining of alternative splicing variations, including data tables and images that provide illustrations of splicing graphs, localizations of splicing variations, and supporting transcript alignments.
The data sets and analysis generated in this study are available upon request.
This work was supported by a National Science Foundation Plant Genome Research Program grant to C. R. B. (DBI-0321538). The authors wish to thank Kevin Childs and Jennifer Wortman for their thoughtful discussions and helpful comments. We also wish to thank the anonymous reviewers' comments that improved the quality and presentation of these data.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.