Read-Split-Run: an improved bioinformatics pipeline for identification of genome-wide non-canonical spliced regions using RNA-Seq data
© The Author(s). 2016
Published: 22 August 2016
Most existing tools for detecting next-generation sequencing-based splicing events focus on generic splicing events. Consequently, special types of non-canonical splicing events of short mRNA regions (IRE1α targeted) have not yet been thoroughly addressed at a genome-wide level using bioinformatics approaches in conjunction with next-generation technologies. During endoplasmic reticulum (ER) stress, the gene encoding the RNase Ire1α is known to splice out a short 26 nt region from the mRNA of the transcription factor Xbp1 non-canonically within the cytosol. This causes an open reading frame-shift that induces expression of many downstream genes in reaction to ER stress as part of the unfolded protein response (UPR). We previously published an algorithm termed “Read-Split-Walk” (RSW) to identify non-canonical splicing regions using RNA-Seq data and applied it to ER stress-induced Ire1α heterozygote and knockout mouse embryonic fibroblast cell lines. In this study, we have developed an improved algorithm “Read-Split-Run” (RSR) for detecting genome-wide Ire1α-targeted genes with non-canonical spliced regions at a faster speed. We applied the RSR algorithm using different combinations of several parameters to the previously RSW tested mouse embryonic fibroblast cells (MEF) and the human Encyclopedia of DNA Elements (ENCODE) RNA-Seq data. We also compared the performance of RSR with two other alternative splicing events identification tools (TopHat (Trapnell et al., Bioinformatics 25:1105–1111, 2009) and Alt Event Finder (Zhou et al., BMC Genomics 13:S10, 2012)) utilizing the context of the spliced Xbp1 mRNA as a positive control in the data sets we identified it to be the top cleavage target present in Ire1α +/− but absent in Ire1α −/− MEF samples and this comparison was also extended to human ENCODE RNA-Seq data.
Proof of principle came in our results by the fact that the 26 nt non-conventional splice site in Xbp1 was detected as the top hit by our new RSR algorithm in heterozygote (Het) samples from both Thapsigargin (Tg) and Dithiothreitol (Dtt) treated experiments but absent in the negative control Ire1α knock-out (KO) samples. Applying different combinations of parameters to the mouse MEF RNA-Seq data, we suggest a General Linear Model (GLM) for both Tg and Dtt treated experiments. We also ran RSR for a human ENCODE RNA-Seq dataset and identified 32,597 spliced regions for regular chromosomes. TopHat (Trapnell et al., Bioinformatics 25:1105–1111, 2009) and Alt Event Finder (Zhou et al., BMC Genomics 13:S10, 2012) identified 237,155 spliced junctions and 9,129 exon skipping events (excluding chr14), respectively. Our Read-Split-Run algorithm also outperformed others in the context of ranking Xbp1 gene as the top cleavage target present in Ire1α +/− but absent in Ire1α −/− MEF samples. The RSR package including source codes is available at http://bioinf1.indstate.edu/RSR and its pipeline source codes are also freely available at https://github.com/xuric/read-split-run for academic use.
Our new RSR algorithm has the capability of processing massive amounts of human ENCODE RNA-Seq data for identifying novel splice junction sites at a genome-wide level in a much more efficient manner when compared to the previous RSW algorithm. Our proposed model can also predict the number of spliced regions under any combinations of parameters. Our pipeline can detect novel spliced sites for other species using RNA-Seq data generated under similar conditions.
In metazoans, during endoplasmic reticulum (ER) stress, the endoribonuclease (RNase) Inositol Requiring Enzyme 1a (Ire1α) initiates removal of a 26 nt region from the mRNA encoding the transcription factor Xbp1 via an non-canonical mechanism (atypically within the cytosol). This causes a transitional open reading frame-shift to produce a potent transcription factor, Xbp1s, that induces expression of numerous downstream genes in response to ER stress as part of the unfolded protein response (UPR) [1, 2]. In addition, spliceosome-independent cytoplasmic splicing, as a part of the unfolded protein response pathway, has been described in yeast  where HAC1p was found to be the sole splicing substrate of Ire1. The mechanism of Ire1α-mediated RNA-splicing is likely conserved in all eukaryotes as well .
In recent years, many popular methods have been developed to identify novel splice sites in RNA-Seq data, including TopHat  and Alt Event Finder . A detailed review on the limitations of several other tools for identification of alternative splicing events (TrueSight , Splicing-Compass , and PASTA ) was described previously . In short, indeed none of these existing tools were suitably designed for detecting the type of non-canonical sometimes called non-canonical splice sites generated by Ire1α-targeted Xbp1 mRNA splicing. Given that non-canonical splicing events of short mRNA regions occurring within the cytosol have not yet been investigated using next-generation technologies at a genome-wide level, cutting-edge bioinformatics methods of detecting such targets are needed to quickly discover such splicing events in a patient-specific manner in order to derive future therapeutic value.
In order to supply the medical and scientific fields with such a tool we previously developed a novel bioinformatics pipeline method, named Read-Split-Walk  for detecting non-canonical, short, splicing regions using RNA-Seq data. We applied the method to ER stress-induced Ire1α heterozygous and knockout mouse embryonic fibroblast (MEF) cell lines to identify Ire1α targets of which the 26 nt non-canonical splice site in Xbp1 was detected as the most prominent splice target by our initial RSW pipeline in heterozygous (Het) samples, not mapped in the negative control Ire1α knockout (KO) samples for both Thapsigargin (Tg) and Dithiothreitol (Dtt) treated experiments. In our previous study, we also compared the Xbp1 results from our approach with results using the alignment program BWA , Bowtie2 , STAR , Exonerate  and the Unix “grep” command. Although our previous RSW method gave better results overall than the above-mentioned approaches, we realized that RSW’s running speed needed to be further improved in order to handle the massive amount of data in other experiments (human ENCODE project: https://www.encodeproject.org). In addition, we wanted to test, under different combinations of parameters, how and where reported spliced regions would differ. Therefore, we have designed a newer algorithm which we call “Read-Split-Run” (RSR) that can process RNA-Seq data in a more efficient manner with flexible parameters. We also proposed a linear regression equation under the assumption of the Generalized Linear Model for RSR parameters that can automatically predict the number of spliced regions given any parameter settings for a particular experiment.
We compared our RSR algorithm with the above-mentioned alternative splicing events detection tools using metrics of how each tool ranks Xbp1 as the top cleavage target and its presence and absence in Ire1α +/− and Ire1α −/− MEF samples. We have also compared our RSR pipeline and other tools to process a human ENCODE dataset and reported their statistics of running performance and sensitivity (the number of spliced junctions identified).
The web interface features of RSR
The spliced regions detected by the RSR pipeline for mouse MEF RNA-Seq data
Comparison of total number of junctions identified by RSR for five cases from Tg and Dtt treated samples
Minimum split size
Maximum candidate distance
Read mapping region boundary buffer size
Minimum candidate distance
Minimum number of supporting reads
Maximum good alignment allowed per read
Total number of junctions identified
Total number of junctions identified
Total number of junctions identified
Total number of junctions identified (Novel/Known)
The spliced regions detected by the RSR pipeline for human ENCODE data
Comparison of detected spliced regions between RSR and other tools
Number of reads for supporting Xbp1 26 nt spliced regions reported by RSR and other tools
500 nM Thapsigargin (Tg)
1 mM Dithiothreitol (Dtt)
Het (Ire1α +/−)
KO (Ire1α −/−)
Het (Ire1α +/−)
KO (Ire1α −/−)
Alt Event Finder
The general linear model for RSR algorithm
Parameters consideration in RSR pipeline
Our proposed model of taking different parameter combinations to run the bowtie aligner and RSR algorithm could be applied to other species. However, different parameter combinations would predict different outcomes (i.e. number of spliced regions) for different species, even for different experiments. In our pipeline, we chose three parameters (MS, MD, and BB) to test how different combinations affect the prediction outcome. Specifically, we have run a number of test cases (100 for Tg and 64 for Dtt) with different combinations and used the results to generate linear regression equations. To increase the robustness of our RSR algorithm, it is ideal to perform a large-scale simulation study in order to look for the optimal combination. Typical questions remain to be answered: What would be the trade-off between lower/upper bound split size vs alignment sensitivity? What is the optimal consolidation slip/buffer size? More or fewer supporting reads will be reported if different cut-off criteria are applied, and these can be adjusted to achieve the desired balance between sensitivity and specificity in specific applications.
We chose three parameters (minimum split size (MS), maximum candidate distance (MD), and read mapping region boundary buffer size (BB)) for the RSR algorithm because the number of reads supporting spliced regions could be different given different combination values of these parameters. Our GLM was generated by running a number of test cases. A smaller MS and bigger MD and BB could increase the number of junctions reported. Empirically MS could be set to approximately 1/3 of the read length, and should also be larger than 8 bp to ensure the split half is not too short to be mapped accurately. An estimated MD value could be determined according to the average gene length of species tested. Therefore, users could choose their customized MD value according to the species on which their experiments were performed. This criterion is determined according to the assumption that the split pairs supporting the junction are often mapped onto the same gene. Finally, BB could be set to a number that should not reflect a large boundary variation (5 or less would be reasonable).
The file deletion of our RSR algorithm
We noticed, by splitting reads into multiple read half pairs, the size of the result files substantially increased when the human ENCODE dataset was processed. To reduce the hard drive storage of large data files, we automatically delete files throughout the pipeline as they are no longer needed. For example, if the split step only needs to utilize the unmapped read datasets, the alignment files generated from the first step of bowtie are deleted. After the second step of bowtie is finished, all unmapped files will be deleted and only the alignment file will be kept for the next RSR step. The RSR pipeline can also compare spliced regions between samples and output reported regions side-by-side.
RSR running speed, sensitivity, and specificity
The overlapping spliced junctions identified by TopHat and our RSR
500 nM Thapsigargin (Tg)
1 mM Dithiothreitol (Dtt)
In our original RSW paper, we also compared the Xbp1 results from our approach with results using the alignment program BWA , Bowtie2 , STAR , Exonerate  and the Unix “grep” command. Although our RSW method gave better results overall than the above-mentioned approaches, comparison results also suggested that reads supporting removal of the 26 nt intron from Xbp1 mRNA were not fully acknowledged. A study using in vitro cleavage assay combined with microarray analysis reported 13 additional mRNAs as Ire1α cleavage targets . The discovery shed light on the existence of other possible targets. A future version of the algorithm will focus on rescuing these false negative reads in order to achieve a better sensitivity.
Applying RSR on human ENCODE RNA-Seq data
The discovery of a new set of non-canonical splicing events in humans is important not only because of the obvious potential for novel alteration of targeted transcript function, but also the potential for the resulting excised sequences to function as silencing RNAs associated with particular disease states. In addition, the frequency of these novel splicing events could be subject to altered regulation in some individuals, resulting in identifiable splicing profiles associated with the risk of certain diseases. We used ENCODE RNA-Seq datasets to train our RSR algorithm and hope to identify additional targets and elucidate their splicing patterns. Results should eventually provide unique insight of elucidating how short non-canonical spliced sequences act their biological functions in the context of relevant biological processes and diseases.
The positive control for our application, the Xbp1 26 nt non-canonical splice site, was clearly detected in Het samples but not in the KO control samples from Tg and Dtt treated MEF experiments, and was reported again as the top cleavage target for an Ire1α target splice site. Although we have tested the RSR using human ENCODE datasets, our algorithm could also be easily extended for prediction of spliced regions for other species under any given parameter settings.
The RNA-Seq read sequence data
The mouse test data were downloaded from NCBI Gene Expression Omnibus (GEO) under the accession number GSE54631. Mouse embryonic fibroblast cells (MEF) that were heterozygous for Ire1α (Ire1α(+/−)) and cells which had Ire1α knocked out (Ire1α(−/−)) treated for 4 h with either 500nM Thapsigargin (Tg) or 1 mM Dithiothreitol (Dtt). Both RNA-Seq experiments are single end reads and had no experimental replicates performed.
The human test data are ENCSR000CUR which were downloaded from the ENCODE project (https://www.encodeproject.org/experiments/ENCSR000CUR/). The data were paired-end RNA-Seq experiments performed on human skin melanocytes primary whole cells (NHEM-M2) and sequenced using Illumina HiSeq 2000. There were two biological replicates (adult 52 years old and adult 55 years old) and no technical replicates used in this experiment.
The reference genome for Read-Split-Run
We downloaded the mouse (mm9) and human (hg19) genome reference sequences from the University of California Santa Cruz (UCSC) genome browser (http://genome.ucsc.edu). We also downloaded respective UCSC gene files (knownGene.txt) from the UCSC genome browser. The splice junction file was created by setting the sequence entry on each side of the junction site to 4 bp shorter than the read length using a RNA-Seq software python script (getsplicefa.py) from ERANGE version 3.1 (http://woldlab.caltech.edu/~alim/RNAseq/). The original reference genome and splice junction site file were merged together to form an expanded genome.
The algorithm of Read-Split-Run
The present work began by porting the previous pipeline from being written in Perl to C++ (compiled with g++ 4.8.3 using optimization parameter –O4 on Linux). Porting to C++ resulted in a speedup by a factor of roughly two to three. Other than the sequence alignment using bowtie, most of the running time in the pipeline is in comparing aligned sub-sequences to determine the set of matched pairs, and comparing matched pairs to determine which support each other. The previous pipeline compared all pairs in each step, resulting in a running time that is quadratic in the number of sub-sequences coming out of the second bowtie step. We improve this step drastically so that the running time is quadratic only in the number of sub-sequences that were derived from the same initial non-aligned read sequence (typically less than a few hundred, whereas there may be millions of sub-sequences coming out of the second bowtie step). We obtain a similar improvement in the step that scans matched pairs to determine which support each other.
Methods of running Read-Split-Run on mouse MEF RNA-Seq data
The running parameter values employed for both Tg and Dtt samples
Minimum split size (MS)
Maximum candidate distance (MD)
Read mapping region boundary buffer size (BB)
8, 11, 12, 16
10000, 20000, 30000, 40000, 50000
1, 3, 5, 7, 9
11, 16, 20, 24
10000, 20000, 40000, 50000
3, 5, 7, 9
Methods of running Read-Split-Run on human ENCODE RNA-Seq dataset
The phases: bowtie, splitting, and second step of bowtie were performed on same hardware mentioned above as the Dtt and Tg sets, whereas the RSR program was run on the “compute node,” described above. The parameters for this run were: MS - 33, MD - 50,000, and BB - 5. Before we could run the split-pairs portion of the pipeline, the output from bowtie (phase 2) had to be split into individual chromosomes so that they could fit into memory. Even in doing so, chromosome 14 had so much data (611Gb) that it could not be run. No junctions were identified on chromosome M.
Comparison with other tools
We compared our RSR algorithm with a couple of other NGS based alternative splicing events detection tools (TopHat  and Alt Event Finder ). We applied these tools on RNA-Seq data from a mouse embryonic fibroblast (MEF) cell line to check which of these tools can identify and rank Xbp1 as the top cleavage target and its presence and absence in Ire1 +/− and Ire1 −/− MEF samples and extended the analysis to the ENCODE RNA-Seq datasets. We ran TopHat v2.0.13 using options: −I 3000000, −g 10, --coverage-search, −microexon-search, to generate the “accepted_hits.bam” file for RNA-Seq data for each experiment condition from MEF cell line. Alt Event Finder v0.1 was ran by taking the “transcript.gtf” file generated from Cufflinks [17–20] and the “accepted_hits.bam” file generated by TopHat. Other metrics (i.e. running speed and usability) of these tools were also examined.
A general linear model for RSR
BB, boundary buffer size; Dtt, Dithiothreitol; ENCODE, Encyclopedia of DNA Elements; ER, endoplasmic reticulum; GLM, General Linear Model; Het, heterozygote; KO, knock-out; MD, maximum candidate distance; MEF, mouse embryonic fibroblast; MS, minimum split size; RSR, Read-Split-Run; RSW, Read-Split-Walk; Tg, Thapsigargin; UPR, unfolded protein response
This research was supported by ISU start-up funds to YB. RJK acknowledges support from NIH grants DK088227, DK042394, DK103183 and CA128814 and the Hevery Foundation. The authors thank The Center for Genomic Advocacy (TCGA) and Department of Mathematics and Computer Science at Indiana State University for computing servers. Authors also thank reviewers for comments and Gary Stuart for helping biological interpretation.
The publication charges for this article have been funded by the corresponding author.
This article has been published as part of BMC Genomics Volume 17 Supplement 7, 2016: Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM) 2015: genomics. The full contents of the supplement are available online at http://bmcgenomics.biomedcentral.com/articles/supplements/volume-17-supplement-7.
Availability of data and material
The datasets supporting the conclusions of this article are included within the article and its additional files. The RSR source and accompanying examples are freely available for academic use at https://github.com/xuric/read-split-run under the Apache License, Version 2.0 license.
YB designed and supervised the project, performed the analysis, provided biological interpretation, and wrote the manuscript. JK wrote the software code, performed the analysis, and wrote the manuscript. BD wrote the software code, ran the pipeline, and provided the human data results. FJ and LD participated in result comparisons between RSR and other tools. JRH performed the experiments and provided biological interpretation. RJK designed the experiments and wrote the manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Calfon M, Zeng H, Urano F, Till JH, Hubbard SR, Harding HP, Clark SG, Ron D. IRE1 couples endoplasmic reticulum load to secretory capacity by processing the XBP-1 mRNA. Nature. 2002;415(6867):92–6.View ArticlePubMedGoogle Scholar
- Yoshida H, Matsui T, Yamamoto A, Okada T, Mori K. XBP1 mRNA is induced by ATF6 and spliced by IRE1 in response to ER stress to produce a highly active transcription factor. Cell. 2001;107(7):881–91.View ArticlePubMedGoogle Scholar
- Ruegsegger U, Leber JH, Walter P. Block of HAC1 mRNA translation by long-range base pairing is released by cytoplasmic splicing upon induction of the unfolded protein response. Cell. 2001;107(1):103–14.View ArticlePubMedGoogle Scholar
- Back SH, Lee K, Vink E, Kaufman RJ. Cytoplasmic IRE1alpha-mediated XBP1 mRNA splicing in the absence of nuclear processing and endoplasmic reticulum stress. J Biol Chem. 2006;281(27):18691–706.View ArticlePubMedGoogle Scholar
- Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009;25(9):1105–11.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhou A, Breese MR, Hao Y, Edenberg HJ, Li L, Skaar TC, Liu Y. Alt Event Finder: a tool for extracting alternative splicing events from RNA-seq data. BMC Genomics. 2012;13 Suppl 8:S10.View ArticlePubMedPubMed CentralGoogle Scholar
- Li Y, Li-Byarlay H, Burns P, Borodovsky M, Robinson GE, Ma J. TrueSight: a new algorithm for splice junction detection using RNA-seq. Nucleic Acids Res. 2013;41(4):e51.View ArticlePubMedGoogle Scholar
- Aschoff M, Hotz-Wagenblatt A, Glatting KH, Fischer M, Eils R, Konig R. SplicingCompass: differential splicing detection using RNA-seq data. Bioinformatics. 2013;29(9):1141–8.View ArticlePubMedGoogle Scholar
- Tang S, Riva A. PASTA: splice junction identification from RNA-sequencing data. BMC Bioinformatics. 2013;14:116.View ArticlePubMedPubMed CentralGoogle Scholar
- Bai Y, Hassler J, Ziyar A, Li P, Wright Z, Menon R, Omenn GS, Cavalcoli JD, Kaufman RJ, Sartor MA. Novel bioinformatics method for identification of genome-wide non-canonical spliced regions using RNA-Seq data. PLoS One. 2014;9(7):e100864.View ArticlePubMedPubMed CentralGoogle Scholar
- Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25(14):1754–60.View ArticlePubMedPubMed CentralGoogle Scholar
- Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15–21.View ArticlePubMedGoogle Scholar
- Slater GS, Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics. 2005;6:31.View ArticlePubMedPubMed CentralGoogle Scholar
- Oikawa D, Tokuda M, Hosoda A, Iwawaki T. Identification of a consensus element recognized and cleaved by IRE1 alpha. Nucleic Acids Res. 2010;38(18):6265–73.View ArticlePubMedPubMed CentralGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25.View ArticlePubMedPubMed CentralGoogle Scholar
- Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics. 2011;27(17):2325–9.View ArticlePubMedGoogle Scholar
- Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol. 2011;12(3):R22.View ArticlePubMedPubMed CentralGoogle Scholar
- Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat Biotechnol. 2013;31(1):46–53.View ArticlePubMedGoogle Scholar
- Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28(5):511–5.View ArticlePubMedPubMed CentralGoogle Scholar