Alt Event Finder: a tool for extracting alternative splicing events from RNA-seq data
© Zhou et al.; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Skip to main content
© Zhou et al.; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Alternative splicing increases proteome diversity by expressing multiple gene isoforms that often differ in function. Identifying alternative splicing events from RNA-seq experiments is important for understanding the diversity of transcripts and for investigating the regulation of splicing.
We developed Alt Event Finder, a tool for identifying novel splicing events by using transcript annotation derived from genome-guided construction tools, such as Cufflinks and Scripture. With a proper combination of alignment and transcript reconstruction tools, Alt Event Finder is capable of identifying novel splicing events in the human genome. We further applied Alt Event Finder on a set of RNA-seq data from rat liver tissues, and identified dozens of novel cassette exon events whose splicing patterns changed after extensive alcohol exposure.
Alt Event Finder is capable of identifying de novo splicing events from data-driven transcript annotation, and is a useful tool for studying splicing regulation.
Alternative splicing is an important level of gene regulation that greatly contributes to proteome diversity . It enables one gene to produce multiple isoforms that can have different biological functions. In humans, more than 90% of genes encode multiple protein isoforms , and many diseases are caused by the dysregulation of splicing patterns . Traditionally, EST (Expressed Sequence Tags) databases and microarray technologies have been utilized to study splicing regulation [4–7]. In recent years, high-throughput RNA sequencing (RNA-seq) technology has revolutionized functional genomics by offering the most comprehensive and accurate measurements of RNAs. In addition to previously known splicing events, RNA-seq technology can be used to identify novel splicing events.
Many bioinformatics tools have been developed to derive splicing patterns from RNA-seq data. For instance, dozens of strategies have been designed for aligning RNA-seq reads. Using various strategies, such tools, including TopHat , MMES , SpliceMap , SplitSeek , G-Mo-R-Se , GSNAP  and SAW , enable alignment of short sequencing reads over splice junction sites even across large intronic regions. Based on such splicing-sensitive alignments, follow-up algorithms, such as Cufflinks  and Scripture  have been developed to reconstruct transcript isoforms using a genome-guided approach. Although the idea of reconstructing the whole transcriptome is intriguing, a quantitative estimate of the expression levels of each isoform is difficult, particularly for transcripts expressed at low levels and/or when more than a few isoforms exist. In addition, isoform-based approaches increase the complexity of studying splicing regulation when many isoforms are present in the sample. Event-based approaches, however, only focus on the inclusion and exclusion of individual splicing events, regardless of membership in different isoforms. This greatly reduces the computational complexity, and offers a direct path for studying splicing regulation. Based on the sequencing reads supporting inclusion and exclusion events, MISO (mixture of isoforms)  is designed to estimate the percentage of inclusion for every previously documented alternative-splicing event in a sample. It further offers a probabilistic framework for detecting differentially regulated exons, and provides functional insights into pre-mRNA processing.
One requirement for implementing MISO is to provide a pre-defined alternative event annotation. Such an annotation heavily relies on previous knowledge, and is not complete or even available for many species. For instance, in the official MISO release, alternative splicing annotation library  is only available for human, mouse, and Drosophila genomes, and does not allow event-based analysis on datasets from other species. In addition, even for the species whose alternative splicing has been heavily investigated, identifying novel splicing events can be important. Therefore, having a tool for detecting novel splicing events directly from RNA-seq data is desirable.
In this study, we developed a tool, Alt Event Finder, for generating de novo annotation for alternative splicing events from a map of transcripts and isoforms reconstructed from RNA-seq experiments. In conjunction with upstream alignment and isoform reconstruction tools, we demonstrated that Alt Event Finder has the ability to identify novel cassette exon events that are not documented in the established databases. We evaluated the performance of this strategy with different combinations of alignment and transcript reconstruction algorithms, using a human dataset where alternative splicing events have been extensively investigated. We further implemented this tool on an RNA-seq dataset from rat genome, for which alternative-splicing annotation is not available.
To test the performance of our strategy, we implemented Alt Event Finder on a RNA-seq dataset derived from human primary hepatocytes; the RNA-seq experiment was conducted using the SOLiD 5500×l system (Life Technologies). The dataset consists of 7 pairs of samples derived from 7 individuals. Each pair includes a drug exposed sample and a control sample. To test the performance of Alt Event Finder on data with various sequencing depths, in addition to the 14 RNA-seq samples, we created 7 patient-specific datasets by merging the exposed and control samples from the same individual; and 1 hepatocyte-specific dataset by merging all the 14 samples together.
To evaluate the performance of the proposed strategy, we compared our data-derived events with known alternative splicing events documented in the MISO release (based on UCSC hg19 assembly) . For each sample, we calculated a rate of known events (RKE), which measures the percentage of identified events that were in the known splicing events annotation, and a recall value, which was calculated as the percentage of known splicing events that were recovered by our strategy. As shown in Figure 2B, the rate of known events varies from 0.4 to 0.57. This indicates that a significant portion of splicing events we detected was not documented in the current database, although junction reads were found in support of their existence. The recall values, however, are low, ranging from 0.004 to 0.025. This is not surprising since the known event annotation aims at completeness, and therefore documents events from many tissues with a variety of biological conditions; most of these events should not be present in one tissue under one or two biological conditions. We further evaluated the relationship between sequence depth and rate of known events (Figure 2C) and racall values (Figure 2D). Rate of known events do not show apparent changes, suggesting that the genes expressed at lower levels contain a similar percentage of novel events as the more abundant transcripts, but they require greater sequencing depth to identify. The recall, however, increases almost linearly with logarithmic transformation of the total number of mappable reads. These results (Figure 2C and 2D) indicate that many more events will be identified with deeper sequenced samples, while the percentage of novel events doesn't change. Therefore, more novel events will be identified from deeper sequenced data.
We further evaluated how the performance of the Alt Event Finder is influenced by the alignment and transcriptome reconstruction tools. For the alignment tool, in addition to our customized RNA-seq pipeline which focus on known splicing junctions, we also tested TopHat , one of the most widely used RNA-seq alignment software. For the transcriptome reconstruction tool, in addition to Cufflinks , which aims at maximizing specificity, we have also tested Scripture , a computational algorithm aiming at higher sensitivity.
We applied Alt Event Finder to study the alcohol-induced alternative splicing changes in liver tissue, using alcohol-preferring rats as a model system. Seven female rats were heavily exposed to alcohol for 10 weeks followed by 2 weeks without alcohol, and another 7 were not subjected to alcohol exposure (controls). An RNA-seq experiment was conducted on the liver tissues. After sequence alignment using TopHat, 123,017,701 and 92,389,972 total reads were mapped in the 7 control and 7 alcohol-exposed animals, respectively. Scripture was used for transcript reconstruction. Alt Event Finder identified 505 candidate events with a mixture of multiple isoforms in the combined sample of all 14 rats. With a MISO isoform differential expression test, we found 75 were alternatively spliced at Bayesian Factor (BF)  larger than 2; this number implies that it is twice as likely for the events to be alternatively spliced than not. A more stringent cutoff derived 55 events with BF > 5.
In this study, we developed a tool, Alt Event Finder, which generates splicing event annotations from RNA-seq data. Most event-based analysis, such as MISO , cannot work without a library of known event annotations. Therefore they cannot be implemented on a genome for which annotation is unavailable, such as the rat genome. Even for a genome for which alternative splicing has been extensively studied, such as human or mouse, lack of a de novo event finding tool limits the power of studying events that are not previously documented. Alt Event Finder bridges the gap between event-based analysis and isoform-based transcriptome reconstruction algorithms, such as Cufflinks and Scripture. It's an important addition to the current AS analysis toolset.
Our algorithm extracts "minimum non-overlapping exon units" (Figure 1B) from RNA-seq-derived transcript isoform annotation based on Cufflinks or Scripture, and further identifies potential alternative events. This strategy greatly increases the flexibility of our methods. Although the current study focuses on cassette exon, it can be easily extended for other types splicing events, such as intron retention, alternative 5' donor, alternative 3' acceptor, and so on. This is important because certain types of events can be more prevalent in specific tissue types. For instance, cassette exons are dominant in brain tissues, while alternative 5' donor and 3' acceptor events are more abundant in liver tissues .
Alt Event Finder relies on upstream alignment and isoform reconstruction tools. We have evaluated how different tool combinations affect the ability to discover novel splicing events. We found that a customized alignment pipeline based on known exon boundaries perform better in low sequencing coverage (< 100 Million reads), while TopHat did better for high sequencing coverage. This is because TopHat derives exon structures mainly based on the accumulation of RNA sequencing reads. Since it does not rely on existing exon annotations, at lower coverage, the data may not have adequate power to properly identify low expressed exons. For higher coverage, however, TopHat will not only have enough power to precisely map the boundaries of known exons, but also be more suitable for identifying novel exons. We have also found that we can generally identify more AS events using Scripture as isoform reconstruction tool, compared to using Cufflinks, because Scripture aims at maximizing sensitivity, while Cufflinks aims at specificity. Overall, we recommend using the mapping algorithm based on known exon annotation and Scripture combination at a low sequencing depth, and the TopHat and Scripture strategy with high sequencing coverage.
To find out the cause of the low recall rate, we investigated the AS events that were identified with the official MISO library but not found in our annotations. One of the major causes of such events is lack of junction reads between the cassette exon and constitutive exons, which makes the inclusive isoform not detectable by Cufflinks and Scripture, but still quantifiable by MISO since reads are covering the cassette exon. Another cause is additional alternative spliced 3' and 5' sites on a cassette exon event, which make an event in our annotation different from the official MISO annotation.
Since Alt Event Finder is a data-driven approach, its power highly depends on the sequencing depth. When the sequencing depth is low, a lot of junction read will be missed, and a lot of low expressed exons could be "disconnected"; this will significantly decrease the power of the transcriptome reconstruction algorithm for rebuilding the isoforms from RNA-seq data, therefore affect the performance of Alt Event Finder. Therefore, when possible, increasing the sequencing depth can significantly elevate the power of novel event identification.
When deep sequencing data is not available, at the de novo event identification step, we recommend pooling sequencing reads from all the samples. This will enable identification of the events that lowly expressed in individual samples. It will also enable us to identify the events that have complete inclusion in one condition, but exclusion in another. These events cannot be identified within individual samples, but the inclusion/exclusion switches are enormously interesting.
We used two RNA-seq datasets for de novo alternative splicing event identification, human hepatocytes and rat liver cells. In the human study, primary hepatocytes were isolated from seven individual subjects, and treated with Rifampin. Total RNA from both control and treated samples were extracted. RNA-seq experiments were conducted using the SOLiD 5500×l system with the standard protocol. In the rat study, RNA-seq experiments were conducted on liver tissues from 7 non-drinking alcohol-preferring rats, and 7 alcohol-preferring rats that were heavily exposed to alcohol for 10 weeks followed by 2 weeks without alcohol. The experiment was conducted on the SOLiD 4 system with the standard protocol.
The known alternative splicing event annotation for human genome was retrieved from the official MISO library (based on UCSC hg19 assembly). The annotation file was generated based on transcript annotation using an EST database; a splicing event was considered alternative if it was supported by multiple ESTs.
We used two RNA-seq alignment pipelines, TopHat  and a customized strategy using BFAST  as primary aligner and known splicing sites documented in UCSC Known Gene database . TopHat v1.4.0 was used with standard parameter settings on color space data. The customized pipeline uses BFAST  as a primary aligner due to its computability with small insertions/deletions, and reported higher sensitivity on color space data . The overall alignment of our customized RNA-seq pipeline includes two levels, alignment on genomic DNA sequences, and alignment on a junction library based on all possible exon combinations within a 100,000-bp span, based on documented exon boundaries. This is different from TopHat strategy, which uses sequencing reads enrichment and splicing sequence features (GU...AG) for exon boundary detection.
Based on the alignment output from TopHat or the customized pipeline, Cufflinks v1.2.1  and Scripture  were used for isoform reconstruction. fastMISO (Mixture of Isoforms)  was used to calculate the percentage of inclusion for annotated and novel alternative splicing events. Standard parameter settings were used for all the three programs.
As shown in Figure 1A, Alt Event Finder uses transcript isoform annotation from Cufflinks (GTF format) or Scripture (BED format) as input. The output is the data-derived alternative event annotation in GFF3 format, which can be used as MISO input. From the isoform annotation, the Alt Event Finder extracts "minimum non-overlapping exon regions" as expression units (Figure 1B), counts the number of isoforms that include each expression unit, and further derives appropriate AS events based on the string of counts (Figure 1B). In this study, we focus on cassette exons.
The ability of Alt Event Finder was evaluated by comparing with the splicing event annotation in the MISO library. Events from two annotations are considered consistent only if the genomic loci of the alternative exon (cassette exon) and their 5' upstream and 3' downstream exons are identical. This ensures the most conservative evaluation. The performance of Alt Event Finder is assessed by using three measurements, the total number of identified events, and the rate of known events and the recall of the overall finding. The rate of known events is defined by the percentage of known events within data-driven ones, and recall is defined as the percentage of data-driven events within known ones.
This work is supported by the grants from the National Institutes of Health AA017941, CA113001, GM085121, and GM088076. The Center for Medical Genomics (high throughput sequencing core) is supported by the Indiana Genomics Initiative of Indiana University (supported in part by the Lilly Endowment, Inc.)
This article has been published as part of BMC Genomics Volume 13 Supplement 8, 2012: Proceedings of The International Conference on Intelligent Biology and Medicine (ICIBM): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S8.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.