eRNA: a graphic user interface-based tool optimized for large data analysis from high-throughput RNA sequencing
© Yuan et al.; licensee BioMed Central Ltd. 2014
Received: 30 October 2013
Accepted: 26 February 2014
Published: 5 March 2014
RNA sequencing (RNA-seq) is emerging as a critical approach in biological research. However, its high-throughput advantage is significantly limited by the capacity of bioinformatics tools. The research community urgently needs user-friendly tools to efficiently analyze the complicated data generated by high throughput sequencers.
We developed a standalone tool with graphic user interface (GUI)-based analytic modules, known as eRNA. The capacity of performing parallel processing and sample management facilitates large data analyses by maximizing hardware usage and freeing users from tediously handling sequencing data. The module miRNA identification” includes GUIs for raw data reading, adapter removal, sequence alignment, and read counting. The module “mRNA identification” includes GUIs for reference sequences, genome mapping, transcript assembling, and differential expression. The module “Target screening” provides expression profiling analyses and graphic visualization. The module “Self-testing” offers the directory setups, sample management, and a check for third-party package dependency. Integration of other GUIs including Bowtie, miRDeep2, and miRspring extend the program’s functionality.
eRNA focuses on the common tools required for the mapping and quantification analysis of miRNA-seq and mRNA-seq data. The software package provides an additional choice for scientists who require a user-friendly computing environment and high-throughput capacity for large data analysis. eRNA is available for free download at https://sourceforge.net/projects/erna/?source=directory.
KeywordsRNA sequencing Bioinformatics tool Graphic user interface Parallel processing
Advances in high-throughput sequencing (HTS) technologies have achieved the analysis of genome-wide RNA profiles with high accuracy and unprecedentedly deep coverage while costs continue to decrease. The Illumina Hiseq 2500 sequencing system is able to sequence 192 RNA samples (multiplexed 24 samples in a single lane) up to six billion paired-end reads in a run (http://support.illumina.com/). Due to its high capacity, RNA sequencing (RNA-seq) has become a necessary research approach for transcriptomic studies and integrated systems analyses.
To date, many bioinformatics tools have been developed to support the identification of known RNAs and analysis of RNA expression profiles. A common workflow for micro-RNA sequencing (miRNA-seq) analysis includes adapter removal, sequence alignment, and read counting. To complete this process, various tools have been developed, including DSAP , E-miR , miRanalyzer , miRDeep2 , MIReNA , miRExpress , miRNAkey , miRspring , mirTools , and SeqBuster  (Additional file 1: Table S1). These miRNA tools perform very well with respect to sensitivity, accuracy, and visualization for miRNA identification . Unlike miRNA-seq, a popular workflow for mRNA sequencing (mRNA-seq) analysis includes genome mapping, transcript assembling, and differential expression analysis, each separately accomplished by a combination of standalone tools (namely a combination of Bowtie, SAMtools, TopHat, and Cufflinks) and R packages in R environments . Some open source analytic workbenches or software solutions have been developed to integrate these different third-party tools, such as ArrayExpressHTS , Chipster , ExpressionPlot , GENE-Counter , GenePattern (http://www.broadinstitute.org/cancer/software/genepattern/modules/RNA-seq), GeneProf , RNA-seq Toolkit (RST) , RobiNA , and TCW  (Additional file 1: Table S2). Of these, the web-based tools provide a GUI-based computer platform. User friendly access to web browsers makes RNA-seq data analysis possible for broad research scientists. The standalone tools, however, are more flexible than the web-based tools. Due to local installation and operation, users may adjust the parameters or even write a program using command codes to meet their specific requirements. For some open-source tools, users may revise the codes and integrate them into their own workflow for RNA-seq data analysis.
Although there are few limits on sequencing data outputs and sample sizes, the use of current bioinformatics tools remains challenging for broad research scientists due to insufficient abilities to process large data, as well as the limitation on data inputs and sample management. Large data analysis and multiple RNA sample management through the web-based tools are not practical due to the limits of network connection, the ability of server computers, and the security of remote data storage. It is also time-consuming to upload sequencing data and reference sequences. In some cases, additional modifications are required prior to miRNA analysis. For example, users have to trim adapter sequences and convert the inputs from FASTQ to FASTA format when using some miRNA tools (namely mirTools or miRspring) . The lack of sample management further complicates data analysis and increases the potential for errors. It is impossible to analyze a large data set from numerous biological samples with different traits along with their technical and biological replicates when only one RNA sample can be processed at a time. Furthermore, computation running time under large data processing is another challenge for RNA-seq data analysis. Some tools process the datasets from only one RNA sample at a time because of their limits on parallel processing. In addition, for scientists without any programming experience, it is often difficult to perform parameter setting and data format converting in command-line tools. Although some standalone tools are user-friendly for bioinformaticians and computer scientists, mastering such knowledge is often beyond the comfort level for most research scientists. To meet these challenges, we developed a GUI-based tool called eRNA, which integrates common tools required for RNA-seq analysis and facilitates large-scale data analysis.
This is the initial step in the RNA-seq analysis pipeline and should be performed before any other modules. This module guides all analytic steps for a successful run in eRNA, including directory setup, sample management, third-party tools checking, and package dependency checking (Figure 1A). The directory setup allows re-allocation of raw data and results in more than one hard drive in a computer. Sample management is used for task assignment in parallel processing and creating associations among raw data, RNA samples, and the biological traits. The third-party tools and package dependency checks are used for detection of third-party RNA analytic tools and Perl packages required by eRNA. With raw data in FASTQ format and reference sequences in FASTA format as data input, the eRNA software package integrated with the third-party tools is able to perform miRNA or mRNA-seq data analysis.
This module performs differential expression profiling analysis and recursive partitioning analysis (Figure 1D). R environment and R packages are required for this module. The former pipeline utilizes the method implemented in the R package DESeq to reveal differential expressed genes between two groups of given RNA samples . The latter pipeline utilizes the model implemented in the R package “Party” to predict the importance of expressed genes determined by the modules known as miRNA or mRNA identification dependent on the biological traits within the given RNA samples .
In summary, raw data and reference sequence preparation, sample information input, and software parameter settings in eRNA are optimized to ensure a user friendly environment. The learning time to understand RNA data analysis is minimized. The preparation of raw data and references in a successful run is significantly simplified.
Another feature of sample management in eRNA is the distribution of the data flow in parallel processing. Once parallel analysis is triggered, the whole analytic work is split into certain components consistent with the number of multi-threads. eRNA automatically distributes the raw data to different components as the inputs based on the size of raw data for each RNA sample. The data in each component are analyzed separately and simultaneously.
Case study on miRNA-seq data analysis
The case study on mRNA-seq data analysis
GUIs of the aligners
The GUIs allow users to apply the aligners Bowtie1  and Bowtie2  for sequence alignment, including index building separately from the other pipelines provided by eRNA (Figure 9A). Fourteen of 64 optional parameters in Bowtie (v.1) are involved in the Bowtie1 GUI and 22 of 73 of Bowtie (v.2) are involved in the Bowtie2 GUI.
GUIs of the third-party miRNA tools
Functional comparison of eRNA, miRspring, and miRDeep2
Identification of known miRNAs
Visualization of analytic results
Discovery of miRNAs
miRNA expression profiling analysis
Batch data processing
Visualization of sequencing quality control
Graphic viewers for quality control
Graphic viewers known as QS Viewer, SD Viewer, and IL Viewer are used for sequencing quality control in RNA-seq experiments (Figure 9B). QS Viewer can plot distributions of quality scores (Q score) per sequencing cycle for each miRNA sample, which can be used for sequencing quality testing . SD Viewer can plot RNAs against certain reference sequences to display sequencing depth, indicating transcript abundance. IL Viewer can plot the distribution of insert lengths from the sequencing library to show the general quality of RNA sequencing library construction.
It is challenging for developers to strike a balance between a user-friendly environment and high efficiency with respect to the processing of RNA-seq data analysis. GUIs and sample management in eRNA provide a user-friendly environment and fulfill the requirements for large data analysis. The use of multi-threads technology makes parallel processing of RNA-seq data possible. The objectives of eRNA are listed as follows:
As a rule, such a tool should be easy to use and require no prior knowledge of specific computer programming language. GUIs will save time in learning how to use this software. A user-friendly framework will allow biological researchers to focus on RNA data analysis and biological interpretation. Also, preparations of raw data and reference sequences are simplified in the GUI-based tool. There are no requirements for raw data conversion. Reference sequences can be downloaded from public databases and used without further manipulation. Automated format conversion is also available.
Parallel processing in eRNA allows for the analysis of multiple RNA samples at the same time. This approach efficiently uses computation power by balancing computer performance and running time. The sample management function exempts biological researchers from manually inputting numerous data sets. eRNA can also be used for both small- and large-scale RNA-seq data. The package has been successfully tested in a personal computer as well as in an advanced server computer. Biological researchers may customize their own computer platforms at a relatively low cost.
eRNA is aiming at helping users gain insight into the underlying biology of the expressed RNAs determined by RNA-seq. The current version of eRNA has been integrated with the other tools for identification, differential expression profiling analysis, and visualization of known miRNAs and mRNAs, as well as the discovery of novel miRNAs, target gene screening using recursive partitioning analysis, sequence alignment, and the visualization of sequencing quality control. Additional mRNA-seq tools [31, 32] besides the TopHat-Cufflinks pipeline used in the module “mRNA identification”, more differential gene expression methods  besides the R package DEseq used in the module “Targets screening”, and the enrichment tools on pathway analysis  will be incorporated into future versions of eRNA.
eRNA can be used for the identification of RNAs and expression profiling analysis of miRNA-seq and mRNA-seq data. It is easy to use and requires no prior specific computer science knowledge. A user-friendly framework allows biological researchers to focus on biological interpretation. Parameter settings and preparations of raw data and reference sequences are simplified. Parallel processing in eRNA allows for the analysis of multiple RNA samples at the same time. The sample management function exempts biological researchers from manually inputting numerous data sets.
Availability and requirements
eRNA is available for free download and use at https://sourceforge.net/projects/erna/?source=directory according to the GNU Public License. The user manual including its installation and the required running environments is also included in the eRNA package. Any use by non-academics requires license. We developed eRNA using Perl language programming in the Linux operating system. The developing and testing environments were Fedora Linux 17 (X_86 64 bits) in a personal computer equipped with one Intel Core i7-3770 K CPU (3.5 GHz, 4 cores per CPU) and 32 GB memory and a Red Hat Enterprise Linux Server (release 5.9, X_86 64 bits) equipped with four Intel Xeon X5687 CPUs (3.6 GHz, 4 cores per CPU) and 96 GB memory. Other software environments included Perl (version 5.14), Perl-Gtk2 (version 1.241), Bioperl (version 1.6, http://www.bioperl.org/), and R (version 2.15, http://www.r-project.org/).
Graphic user interface
microRNA sequencing, mRNA-seq, mRNA sequencing
We would like to thank the Human and Molecular Genetics Center at Medical College of Wisconsin for sequencing consultation and support.
This work was supported by the Advancing a Healthier Wisconsin fund (Project #5520227) and by the National Institutes of Health (R01CA157881) to LW.
- Huang P-J, Liu Y-C, Lee CC, Lin W-C, Gan RR, Lyu PC, Tang P: DSAP: deep-sequencing small RNA analysis pipeline. Nucleic Acids Res. 2010, 38: W385-W391. 10.1093/nar/gkq392.PubMed CentralPubMedView ArticleGoogle Scholar
- Buermans HP, Ariyurek Y, van Ommen G, den Dunnen JT, ’t Hoen PA: New methods for next generation sequencing based microRNA expression profiling. BMC Genomics. 2010, 11: 716-10.1186/1471-2164-11-716.PubMed CentralPubMedView ArticleGoogle Scholar
- Hackenberg M, Rodríguez-Ezpeleta N, Aransay AM: miRanalyzer: an update on the detection and analysis of microRNAs in high-throughput sequencing experiments. Nucleic Acids Res. 2011, 39: W132-W138. 10.1093/nar/gkr247.PubMed CentralPubMedView ArticleGoogle Scholar
- Friedländer MR, Mackowiak SD, Li N, Chen W, Rajewsky N: miRDeep2 accurately identifies known and hundreds of novel microRNA genes in seven animal clades. Nucleic Acids Res. 2011, 40 (1): 37-52.PubMed CentralPubMedView ArticleGoogle Scholar
- Mathelier A, Carbone A: MIReNA: finding microRNAs with high accuracy and no learning at genome scale and from deep sequencing data. Bioinformatics. 2010, 26 (18): 2226-2234. 10.1093/bioinformatics/btq329.PubMedView ArticleGoogle Scholar
- Wang W-C, Lin F-M, Chang W-C, Lin K-Y, Huang H-D, Lin N-S: miRExpress: analyzing high-throughput sequencing data for profiling microRNA expression. BMC Bioinforma. 2009, 10: 328-10.1186/1471-2105-10-328.View ArticleGoogle Scholar
- Ronen R, Gan I, Modai S, Sukacheov A, Dror G, Halperin E, Shomron N: miRNAkey: a software for microRNA deep sequencing analysis. Bioinformatics. 2010, 26 (20): 2615-2656. 10.1093/bioinformatics/btq493.PubMedView ArticleGoogle Scholar
- Humphreys DT, Suter CM: miRspring: a compact standalone research tool for analyzing miRNA-seq data. Nucleic Acids Res. 2013, 41 (15): e147-10.1093/nar/gkt485.PubMed CentralPubMedView ArticleGoogle Scholar
- Zhu E, Zhao F, Xu G, Hou H, Zhou L, Li X, Sun Z, Wu J: mirTools: microRNA profiling and discovery based on high-throughput sequencing. Nucleic Acids Res. 2010, 38: W392-W397. 10.1093/nar/gkq393.PubMed CentralPubMedView ArticleGoogle Scholar
- Pantano L, Estivill X, Marti E: SeqBuster, a bioinformatic tool for the processing and analysis of small RNAs datasets, reveals ubiquitous miRNA modifications in human embryonic cells. Nucleic Acids Res. 2009, 38 (5): e34-PubMed CentralPubMedView ArticleGoogle Scholar
- Li Y, Zhang Z, Liu F, Vongsangnak W, Jing Q, Shen B: Performance comparison and evaluation of software tools for microRNA deep-sequencing data analysis. Nucleic Acids Res. 2012, 40 (10): 4298-4305. 10.1093/nar/gks043.PubMed CentralPubMedView ArticleGoogle Scholar
- Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L: Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012, 7 (3): 562-578. 10.1038/nprot.2012.016.PubMed CentralPubMedView ArticleGoogle Scholar
- Goncalves A, Tikhonov A, Brazma A, Kapushesky M: A pipeline for RNA-seq data processing and quality assessment. Bioinformatics. 2011, 27 (6): 867-869. 10.1093/bioinformatics/btr012.PubMed CentralPubMedView ArticleGoogle Scholar
- Kallio MA, Tuimala JT, Hupponen T, Klemelä P, Gentile M, Scheinin I, Koski M, Käki J, Korpelainen EI: Chipster: user-friendly analysis software for microarray and other high-throughput data. BMC Genomics. 2011, 12: 507-10.1186/1471-2164-12-507.PubMed CentralPubMedView ArticleGoogle Scholar
- Friedman BA, Maniatis T: ExpressionPlot: a web-based framework for analysis of RNA-Seq and microarray gene expression data. Genome Biol. 2011, 12 (7): R69-10.1186/gb-2011-12-7-r69.PubMed CentralPubMedView ArticleGoogle Scholar
- Cumbie JS, Kimbrel JA, Di Y, Schafer DW, Wilhelm LJ, Fox SE, Sullivan CM, Curzon AD, Carrington JC, Mockler TC, Chang JH: GENE-counter: a computational pipeline for the analysis of RNA-Seq data for gene expression differences. PLoS One. 2011, 6 (10): e25279-10.1371/journal.pone.0025279.PubMed CentralPubMedView ArticleGoogle Scholar
- Halbritter F, Vaidya HJ, Tomlinson SR: GeneProf: analysis of high-throughput sequencing experiments. Nat Methods. 2012, 9 (1): 7-8.View ArticleGoogle Scholar
- Givan SA, Bottoms CA, Spollen WG: Computational analysis of RNA-seq. Methods Mol Biol. 2012, 883: 201-219. 10.1007/978-1-61779-839-9_16.PubMedView ArticleGoogle Scholar
- Lohse M, Bolger AM, Nagel A, Fernie AR, Lunn JE, Stitt M, Usadel B: RobiNA: a user-friendly, integrated software solution for RNA-Seq-based transcriptomics. Nucleic Acids Res. 2012, 40: W622-W627. 10.1093/nar/gks540.PubMed CentralPubMedView ArticleGoogle Scholar
- Soderlund C, Nelson W, Willer M, Gang DR: TCW: transcriptome computational workbench. PLoS One. 2013, 8 (7): e69401-10.1371/journal.pone.0069401.PubMed CentralPubMedView ArticleGoogle Scholar
- Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM: The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2009, 38 (6): 1767-1771.PubMed CentralPubMedView ArticleGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10 (3): R25-10.1186/gb-2009-10-3-r25.PubMed CentralPubMedView ArticleGoogle Scholar
- Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL: TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013, 14 (4): R36-10.1186/gb-2013-14-4-r36.PubMed CentralPubMedView ArticleGoogle Scholar
- Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L: Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat Biotechnol. 2013, 31 (1): 46-53.PubMedView ArticleGoogle Scholar
- Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol. 2010, 11 (10): R106-10.1186/gb-2010-11-10-r106.PubMed CentralPubMedView ArticleGoogle Scholar
- Strobl C, Malley J, Tutz G: An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychol Methods. 2009, 14 (4): 323-348.PubMed CentralPubMedView ArticleGoogle Scholar
- Huang X-Y, Yuan T-Z, Tschannen M, Sun Z, Jacob H, Du M-J, Liang M-H, Dittmar RL, Liu Y, Kohli M, Thibodeau SN, Boardman L: Characterization of human plasma-derived exosomal RNAs by deep sequencing. BMC Genomics. 2013, 14: 319-10.1186/1471-2164-14-319.PubMed CentralPubMedView ArticleGoogle Scholar
- Kozomara A, Griffiths-Jones S: miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic Acids Res. 2010, 39: D152-D157.PubMed CentralPubMedView ArticleGoogle Scholar
- Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012, 9 (4): 357-359. 10.1038/nmeth.1923.PubMed CentralPubMedView ArticleGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The sequence alignment/map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.PubMed CentralPubMedView ArticleGoogle Scholar
- Engström PG, Steijger T, Sipos B, Grant GR, Kahles A, Rätsch G, Goldman N, Hubbard TJ, Harrow J, Guigó R, Bertone P, The RGASP Consortium: Systematic evaluation of spliced alignment programs for RNA-seq data. Nat Methods. 2013, 10 (12): 1185-1191. 10.1038/nmeth.2722.PubMed CentralPubMedView ArticleGoogle Scholar
- Steijger T, Abril JF, Engström PG, Kokocinski F, Hubbard TJ, Guigó R, Harrow J, Bertone P, The RGASP Consortium: Assessment of transcript reconstruction methods for RNA-seq. Nat Methods. 2013, 10 (12): 1177-1184. 10.1038/nmeth.2714.PubMedView ArticleGoogle Scholar
- Rapaport F, Khanin R, Liang Y-P, Pirun M, Krek A, Zumbo P, Mason CE, Socci ND, Betel D: Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol. 2013, 14 (9): R95-10.1186/gb-2013-14-9-r95.PubMed CentralPubMedView ArticleGoogle Scholar
- Khatri P, Sirota M, Butte AJ: Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol. 2012, 8 (2): e1002375-10.1371/journal.pcbi.1002375.PubMed CentralPubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.