QuickRNASeq lifts large-scale RNA-seq data analyses to the next level of automation and interactive visualization

Zhao, Shanrong; Xi, Li; Quan, Jie; Xi, Hualin; Zhang, Ying; von Schack, David; Vincent, Michael; Zhang, Baohong

doi:10.1186/s12864-015-2356-9

Software
Open access
Published: 08 January 2016

QuickRNASeq lifts large-scale RNA-seq data analyses to the next level of automation and interactive visualization

Shanrong Zhao¹,
Li Xi¹,
Jie Quan²,
Hualin Xi²,
Ying Zhang¹,
David von Schack¹,
Michael Vincent¹ &
…
Baohong Zhang¹

BMC Genomics volume 17, Article number: 39 (2016) Cite this article

10k Accesses
31 Citations
36 Altmetric
Metrics details

Abstract

Background

RNA sequencing (RNA-seq), a next-generation sequencing technique for transcriptome profiling, is being increasingly used, in part driven by the decreasing cost of sequencing. Nevertheless, the analysis of the massive amounts of data generated by large-scale RNA-seq remains a challenge. Multiple algorithms pertinent to basic analyses have been developed, and there is an increasing need to automate the use of these tools so as to obtain results in an efficient and user friendly manner. Increased automation and improved visualization of the results will help make the results and findings of the analyses readily available to experimental scientists.

Results

By combing the best open source tools developed for RNA-seq data analyses and the most advanced web 2.0 technologies, we have implemented QuickRNASeq, a pipeline for large-scale RNA-seq data analyses and visualization. The QuickRNASeq workflow consists of three main steps. In Step #1, each individual sample is processed, including mapping RNA-seq reads to a reference genome, counting the numbers of mapped reads, quality control of the aligned reads, and SNP (single nucleotide polymorphism) calling. Step #1 is computationally intensive, and can be processed in parallel. In Step #2, the results from individual samples are merged, and an integrated and interactive project report is generated. All analyses results in the report are accessible via a single HTML entry webpage. Step #3 is the data interpretation and presentation step. The rich visualization features implemented here allow end users to interactively explore the results of RNA-seq data analyses, and to gain more insights into RNA-seq datasets. In addition, we used a real world dataset to demonstrate the simplicity and efficiency of QuickRNASeq in RNA-seq data analyses and interactive visualizations. The seamless integration of automated capabilites with interactive visualizations in QuickRNASeq is not available in other published RNA-seq pipelines.

Conclusion

The high degree of automation and interactivity in QuickRNASeq leads to a substantial reduction in the time and effort required prior to further downstream analyses and interpretation of the analyses findings. QuickRNASeq advances primary RNA-seq data analyses to the next level of automation, and is mature for public release and adoption.

Background

RNA sequencing (RNA-seq) has emerged as a powerful technology in transcriptome profiling [1–3]. Our previous side-by-side comparison between RNA-seq and microarray in investigating T cell activation demonstrated that RNA-seq analysis has many advantages over microarray analysis [4]. In contrast to hybridization-based microarray analyses, RNA-seq has the extra benefits of obtaining transcription start and stop sites, alternative spliced isoforms, and genetic variants in addition to gene expression levels. One apparent shortcoming of early non-stranded (standard) RNA-seq protocols is that a sequence read loses the strand origin information, thus making it difficult to determine accurately the expression levels of overlapping genes transcribed from opposite strands. A comparison of stranded with non-stranded RNA-seq led us to conclude that stranded RNA-seq provides a more accurate estimation of gene expression levels than non-stranded RNA-seq [5].

Short reads generated by RNA-seq experiments must first be aligned, or mapped, to a reference genome or transcriptome assembly. The general objective of mapping or aligning a collection of sequence reads to a reference is to discover the true location (origin) of each read with respect to that reference. Although a large number of read mapping algorithms have been developed in recent years [6–10], the accurate alignment of RNA-seq reads is still a challenge. Indeed, some features of a reference genome such as repetitive regions, assembly errors, and assembly gaps render this objective impossible for a subset of reads. Furthermore, because RNA-seq libraries are constructed from transcribed RNA, intronic sequences are not present in exon-exon spanning reads. Therefore, when aligning the sequences to a reference genome, reads that span exon-exon junctions have to be split across potentially thousands of bases of intronic sequence. Many of the RNA-seq alignment tools, including STAR [11], GSNAP [12], MapSplice [13], and TopHat [14], use reference transcriptomes to inform the alignment of junction reads. The benefits of using a reference transcriptome to map RNA-seq reads have been demonstrated clearly in our previous RNA-seq analyses [15, 16].

The second important step in most RNA-seq analyses is gene or isoform quantification. A common method to estimate gene or transcript abundance from RNA-seq data is to count the number of reads that map uniquely to each gene or transcript. RPKM (reads per kilobase per million reads) is widely used to represent the relative abundance of mRNAs for a gene or transcript. A number of algorithms have been developed to infer gene and isoform abundance [17, 18], including RSEM [19, 20], Cufflinks [21], IsoEM [22], featureCounts [23], and HTSeq [24]. A gene can be expressed in one or more transcript isoforms; accordingly, its expression level should be represented as the sum of its isoforms. However, estimating the expression of individual isoforms is intrinsically more difficult because different isoforms of a gene typically have a high proportion of genomic overlap. Accordingly, a simpler union exon-based approach has been proposed, in which all overlapping exons of the same gene are first merged into union exons, and the total length of the union exons is taken to represent the gene length. We carried out a side-by-side comparison between union exon-based approach and transcript-based method in RNA-seq gene quantification [25], and found that gene expressions were significantly underestimated when the union exon-based approach was used. Therefore, we strongly discourage using the union exon-based approach in gene quantification despite its simplicity.

Although the time and cost for generating RNA-seq data are decreasing, the analysis of massive amounts of RNA-seq data still remains challenging. Numerous software packages and algorithms for basic data quality control (QC) and analyses have been developed, which has led to the need to apply these tools efficiently to obtain results within a reasonable timeframe, especially for large datasets. Based on our own experience with in-house analyses of multiple RNA-seq datasets of varying size using open source tools, the main challenges, gaps, and bottlenecks for large-scale RNA-seq data analyses can be summarized as follows:

1.
Selecting appropriate software packages and setting software-specific parameters. Making the right or best choice can be difficult because many similar tools are available. Setting software parameters is even harder if not impossible, because it often requires both an in-depth understanding of the algorithms and sufficient hands-on experience, which disadvantages researchers new to this field.
2.
Writing scripts to make different components work seamlessly in a pipeline. A variety of algorithms have been designed to perform different tasks, but they have been developed (and/or maintained) independently by different research groups and often use different programming languages. Moreover, those algorithms do not understand each other well, and the output(s) from one algorithm often cannot be used as input(s) for another algorithm. As a result, additional bridging scripts are necessary, which ideally requires a data analyst who is familiar with a number of programming languages, including Shell script, Perl, Python, Java, C/C++, and R.
3.
Integrating and summarizing analyses results from individual samples. In general, most algorithms are implemented to process an individual sample. Consequently, the results of primary RNA-seq data analyses have to be further processed, integrated, and summarized for reporting, presentation, and downstream analysis. Usually, data integration and summarization are tedious and not easy to execute efficiently.
4.
Identifying RNA-seq sample outliers. It is not uncommon that some samples have low quality and often substitute samples are not available, especially for RNA-seq of clinical specimen. RNA-seq is a complicated multistep process that involves sample collection/stabilization, RNA extraction, fragmentation, cDNA synthesis, adapter ligation, amplification, purification, and sequencing. Any mistake in this complex sequence of protocols can result in biased or even unusable data. Therefore, it is necessary to establish stringent RNA-seq data quality metrics to identify outliers that should be excluded from further downstream data analysis.
5.
Detecting sample swapping and mislabeling. For large-scale RNA-seq studies in which hundreds or even thousands of RNA samples are sequenced and analyzed, it is not unusual that some samples are mishandled and appear to be swapped or sequenced more than once. Such errors can become a serious problem for downstream data analyses and interpretation of results, especially for longitudinal sample analyses. It is difficult to identify such mistakes based only on RNA-seq QC metrics and/or gene expression profiles. To confirm whether samples are from the same subject, it is more reliable to compare genetic markers among samples, such as single nucleotide polymorphisms (SNPs).
6.
Sharing the results of RNA-seq data analyses with experimental scientists. Nearly all RNA-seq data analyses are performed using Linux clusters or workstations; however, analyses results in Linux are often inaccessible to most experimental scientists. RNA-seq data analyses typically generate a large number of files and large amounts of data that are difficult to comprehend or digest directly by experimental scientists. Therefore, easily accessible interfaces are needed that not only provide a quick and easy way for non-expert users to obtain high-level visualizations of the main RNA-seq analyses outputs (e.g., QC results), but also allow them to drill down further or export the results into additional analysis applications of their choice. To the best of our knowledge, very few RNA-seq related open source packages provide all these options.

To address these challenges, we have implemented a new pipeline named QuickRNASeq to advance the automation and visualization of RNA-seq data analyses results, and have constantly improved and refined its implementation since its inception. QuickRNASeq significantly reduces data analysts’ hands-on time, which results in a substantial decrease in the time and effort needed for the primary analyses of RNA-seq data before proceeding to further downstream analysis and interpretation. Additionally, QuickRNASeq provides a dynamic data sharing and interactive visualization environment for end users. All the results are accessible from a web browser without the need to set up a web server and database. The rich visualization features implemented in QuickRNASeq enable non-expert end users to interact easily with the RNA-seq data analyses results, and to drill down into specific aspects to gain insights into often complex datasets simply through a point-and-click approach.

Implementation

QuickRNASeq is designed for simplicity and visual interactivity. A few important principles dictate its implementation. First, all components of the pipeline are freely available in the public domain. Second, it is easy to deploy and use. Third, all analyses results including RNA-seq QC metrics, sample correlations, and gene quantifications are accessible via a web browser and can be further explored interactively. An overview of QuickRNASeq (Fig. 1) illustrates its three main steps. Step #1 performs RNA-seq read mapping, counting, aligned read QC, and SNP calling. Step #1 processes each sample completely independently of each other, and is computationally intensive. Therefore, all samples can be processed in parallel in a high performance computing cluster (HPC), or in a serial fashion on a standalone workstation. Step #2 merges the results from the individual sample and generates an integrated and interactive project report for data interpretation in Step #3.

Input files

In addition to raw sequence reads in FASTQ format, the only other required inputs are a reference genome file in FASTA format and a corresponding gene annotation file in GTF (gene transfer format). QuickRNASeq can be applied to any species as long as its genome and gene annotations are available; for example, human, mouse, rat, and cynomolgus or rhesus monkeys. A gene annotation file can exist in many formats, but GTF has become the de facto standard; however, not all tools accept gene annotation files in GTF format as input. For example, RSeQC (RNA-seq quality control package) [26] accepts gene annotation only in BED (browser extensible display) format, though the majority of gene annotations in the public domain are not available in BED format. To ensure that the exact same annotations are used by the different components in QuickRNASeq, we wrote Perl scripts to convert gene annotation files from GTF to BED format. This avoids any discrepancy or inconsistency among gene annotations that are available in different formats.

Step #1: single sample processing

This step consists mainly of read mapping, counting, aligned read QC, and SNP calling, and the corresponding algorithms used to perform these tasks are STAR [11, 27], featureCounts [23], RSeQC [26], and VarScan [28] respectively. STAR aligns spliced sequences of any length with moderate error rates, provides scalability for emerging sequencing technologies, and generates output files ready for transcript/gene expression quantification [27]. The algorithms featureCounts [23] and HTSeq [24] are comparable in terms of counting results, but featureCounts is considerably faster than HTSeq by an order of magnitude for gene-level summarization and requires far less computer memory. Read mapping and counting typically are very time consuming, and we chose STAR and featureCounts in QuickRNASeq mainly because of their high speed and accuracy.

The RSeQC [26] package provides a number of modules that can comprehensively inspect sequence quality, nucleotide composition bias, PCR bias, GC bias, mapped reads distribution, coverage uniformity, and strand specificity. All such QC metrics are valuable for outlier detection. VarScan [28] is a platform-independent software tool that can detect variants in RNA-seq data. It employs a robust heuristic/statistic approach to call variants that meet desired thresholds for read depth, base quality, variant allele frequency, and statistical significance. To verify samples from the same subject, it is unnecessary to call SNPs across all chromosomes. In practice, it is sufficient to use only SNPs from the chromosome that contains the major histocompatibility complex (MHC) genes. For human, mouse, and rat, these are chromosomes 6, 17, and 20, respectively. As mentioned earlier, numerous software packages that can perform similar tasks are freely available; however, we found that the combination of STAR, featureCounts, RSeQC, and VarScan represents one of the best toolsets.

Computational algorithms for RNA-seq analyses are continuously being improved, including STAR, featureCounts, RSeQC, and VarScan. Therefore, we designed our pipeline to be independent of its underlying software version and ensured that it can handle RNA-seq samples from a variety of species. To decouple the dependence of QuickRNASeq pipeline upon underlying computational algorithms and species, we introduced a plain text configuration file that can store project, species, and software-specific parameters. This configuration file also improves the reproducibility of RNA-seq data analyses and simplifies the command lines in QuickRNASeq. For the convenience of QuickRNASeq users, a configuration file template has been provided for easy customization.

Step #2: data integration, QC, and summary

Step #2 aims mainly to merge results generated in Step #1 for each individual RNA-seq sample. Additionally, it runs many across-sample calculations, such as correlation-based QC and a SNP correlation matrix. As shown in Fig. 1, the second step performs the following tasks:

1.
Merge mapping, counting summaries, and RSeQC metrics from individual samples.
2.
Generate a read counting table ready for downstream analysis of all annotated genes.
3.
Calculate a SNP (and gene expression) correlation matrix among samples.
4.
Perform correlation-based sample QC, calculation of MADScore (median absolute deviation score), and data normalization.
5.
Produce RNA-seq metrics and correlation plots ready for PowerPoint presentations.
6.
Generate a comprehensive HTML QC report for individual sample.
7.
Produce a dynamic and integrated QC metrics plot for individual samples.
8.
Generate a master HTML entry webpage for data analyses results.

Each individual task listed above is performed by a corresponding Bash, Perl, or R script, and a master script coordinates the execution of all these tasks. The main scripts and their functions are listed in Table 1. As shown in Table 1, the primary RNA-seq data analyses can be performed by as few as two shell command lines (star-fc-qc.sh and star-fc-qc.summary.sh). All the plots generated in Step #2 are ready for presentations, and the gene counting table can feed downstream differential analysis algorithms. The highly automated features in Step #2 make QuickRNASeq an efficient tool for typical standard RNA-seq analyses, and our pipeline substantially reduces the hands-on time (not the computational time) that data analysts have to spend on primary RNA-seq data analyses.

Table 1 Description of main scripts in the QuickRNASeq package

Full size table

We implemented a correlation-based QC to detect potential outliers in the RNA-seq data by calculating a MADScore for each sample. In general, an outlier appears to deviate markedly from other samples in a RNA-seq study, and thus its correlation with other samples will be relatively low. The MADScore is calculated as follows. For each sample, calculate the correlation difference, which is simply the difference between the average of all the pairwise correlations that involve the sample and the average of all the pairwise correlations that do not involve the sample. If a sample is an outlier, then the difference will be negative. Accordingly, there will be a vector of values (one for each sample). Then this vector of difference is converted to MADScores (robust Z-scores) by subtracting the medians and dividing it by median absolute deviations (MAD). A standard MADScore cutoff (e.g., −5) is set to determine the outliers.

Step #3: interactive data visualization

Primary RNA-seq data analyses results are represented by a standard file folder structure, and an integrated report provides comprehensive QC metrics and a gene expression table. RNA-seq data analyses typically generate a variety of files and large amounts of data, and the master entry webpage generated in Step #2 makes data navigation and visualization more convenient. More importantly, the project report offers interactive visualizations of RNA-seq QC and gene expression levels, and provides analytical tools to gain insights into the data. All required JavaScript libraries have already been packaged into the QuickRNASeq project report; thus, deployment of the data into a web server is optional.

JavaScript has become the hallmark of the web 2.0 technologies because it greatly enhances interactive visualizations. The availability of JavaScript-based open source visualization libraries has fueled the adoption of this technique. We implemented the interactive data visualization in QuickRNASeq using these libraries, including JQuery [29], D3 (Data-Driven Documents) [30], canvasXpress [31], SlickGrid [32], and Nozzle [33]. JQuery [29] makes HTML page traversal, manipulation, event handling, and animation simple. D3 [30] manipulates HTML documents based on input data using HTML5, SVG, and CSS (cascade style sheet). canvasXpress [31] supports a large number of plotting types and offers sample grouping, data transformation, and many other features that are usually only seen in commercial software. SlickGrid [32] is a powerful web-based spreadsheet component that supports searching, sorting, and pagination of tabular datasets, and can be scaled to handle millions of data points. Nozzle [33] is an R package that provides an API (application programming interface) to generate HTML reports with dynamic user interface elements. Nozzle is designed to facilitate summarization and rapid browsing of complex results in data analysis pipelines where multiple analyses are performed frequently on big datasets. By combining these visualization libraries with RNA-seq analyses results, we created multiple dynamic HTML pages to present the RNA-seq QC metrics, and to present gene expression profiles in boxplot and heat map formats dynamically and interactively.

Results and discussion

Test run of QuickRNASeq on a publicly available dataset

GENCODE annotation [34, 35] is based on Ensembl [36] but with improved coverage and accuracy, and thus is used by the ENCODE consortium [37] as well as many other projects (e.g., 1000 Genomes [38]) as the reference gene set. Therefore, we chose the GENCODE annotation for our test run. GENCODE Release 19 was downloaded from the GENCODE web site [35]. An analysis of RNA-seq data from 1641 samples across 43 tissues of 175 individuals in the Genotype-Tissue Expression (GTEx) project [39, 40] revealed the landscape of gene expression across tissues, and catalogued thousands of tissue-specific genes. For our test run, we selected 48 GTEx samples from five donors. The sample identifiers, annotations, and RNA-seq mapping summaries for all 48 samples are listed in Table 2. Note that a sequence read can be aligned uniquely to a reference genome, or mapped to multiple locations. Some reads cannot be mapped to the reference genome at all. The percentages of reads that were uniquely mapped, mapped to multiple locations, or unmapped are given in Table 2. The complete report for our test run of the GTEx dataset can be downloaded directly from the QuickRNASeq project home page, and is briefly described below.

Table 2 Annotation and mapping summary for the 48 samples used in the QuickRNASeq test run

Full size table

All analyses results accessible from a single entry webpage

A screenshot of the entry webpage for the results of the test run is shown in Fig. 2. The page uses Noozle’s presentation template, which collates sections into a single neat web page with functionalities to expand or collapse individual or whole sections. In the “QC Metrics” section, both static images and interactive plots are provided for a variety of QC measures, including read mapping summaries, read counting statistics, SNP correlations among samples, number of expressed genes at various RPKM cutoffs, and correlations among gene expression profiles. All static QC plots can be enlarged into a new window by clicking on the iconized image, and the corresponding more dynamic and interactive plots are accessible by clicking the pointing hand icon. The interactive plots of QC measures offer many interactive features over static images, such as zooming in and zooming out. The raw data that was used to generate these figures can be accessed simply by clicking the corresponding hyperlinked text. The “Parallel Plot” and “Expression Table” sections in Fig. 2 are detailed later. Furthermore, all the result files and figures are directly accessible by expanding the “Raw Data Files” section shown at the bottom of Fig. 2. The entry webpage makes data navigation and visualization more convenient and intuitive, especially for experimental scientists.

SNP correlation to detect mishandled samples

SNP correlation plots help to verify whether samples are from the same subject or not. By definition, SNP concordance among samples from the same subject will be much higher than those samples from different subjects. In the first case, typical examples may be samples of different tissues from the same subject or longitudinal samples from the same subject. For simplicity, we selected samples from three donors to illustrate the usefulness of the SNP concordance plot (Fig. 3). As we expected, the SNP correlation plot in Fig. 3a is clustered by the donors. The corresponding correlation plot after the swap of SRR598044 and SRR608096 is shown in Fig. 3b where the correlation pattern indicates that the two samples are wrongly labeled. The true identifiers for the two swapped samples are indicated on the right of the plot. Sample swapping is typically very difficult to detect when it occurs. We have tried different methods to rectify mislabeled or swapped samples and found that a SNP correlation-based approach gave the best results (data not shown).

Integrated QC metrics for individual sample

The parallel plot in Fig. 4 is a common way to visualize high-dimensional data and it is used widely in multivariate data analysis. We implemented the parallel plot to link all related QC measurements for all samples into one plot. Each axis within the plot represents a sample feature or a QC measurement. There are multiple ways users can control the look and feel of the plot, such as selecting a subset of samples to view, changing the order of the axes by drag-and-drop, and removing unwanted axes for a clearer view by dragging them off the plot to either side. The linked table is searchable, and for any selected sample in the table, its corresponding QC measures are highlighted simultaneously on the plot with tooltips showing the measurement values.

MAD, an alternative and more robust measure of dispersion has been proposed to detect outliers [41]. We extended MAD to implement a correlation-based QC to detect potential outliers. The MADScore was calculated as described above, and is listed in the table in Fig. 4. To determine whether a potential outlier identified from the correlation-based QC is a true outlier, we recommend that the corresponding QC report is also checked. The comprehensive QC report for an individual sample can be accessed by clicking the corresponding sample identifier in the table in Fig. 4. For example, some representative RNA-seq QC metrics for SRR603068 (highlighted in Fig. 4) are shown in Fig. 5. The metrics correspond to reads duplication rate, distribution of reads versus percentages of GC content, nucleotide composition bias, distribution of read quality score, plot of junction saturation, and characteristics of the splicing junction sites.

Two strategies are used to determine the read duplication rate, as indicated in Fig. 5a. For the sequence-based strategy, reads with exactly the same sequence content are regarded as duplicated reads, whereas, for the mapping-based strategy, reads mapped to the same genomic location are regarded as duplicated reads. For spliced reads, reads mapped to the same starting position that splice the same way are regarded as duplicated reads. SRR603068 is a brain sample, and its nucleotide composition is biased towards A/T, as indicated in Fig. 5c. For RNA-seq data, we often want to know whether the sequencing depth is enough for the analyses, and the saturation plot shown in Fig. 5d is very valuable for this. For a well annotated organism, the number of expressed genes in a particular tissue is almost fixed so the number of splice junctions is also fixed. These numbers should be rediscovered from saturated RNA-seq data. The plot in Fig. 5d indicates that more reads should be sequenced for performing alternative splicing analyses. In Fig. 5f, all multiple splicing events spanning the same intron have been consolidated into one splicing junction, and a novel junction is considered as complete_novel if neither of the two splice sites can be annotated by a gene model. Otherwise, it is partial_novel, meaning that one of the splice site (5′SS or 3′SS) is new, while the other splice site is annotated (known). While the majority of junctions in Fig. 5f are annotated, over 20 % are either complete_novel or partial_novel.

Interactive visualization of gene expression profiles

One of the most important objectives in many RNA-seq studies is to estimate gene expression levels under certain biological or disease conditions. With the help of the visualization tools shown in Fig. 6, differences in gene expression levels across samples under different conditions can be highlighted easily by a few mouse-clicks either in the boxplot (Fig. 6b) or heat map view (Fig. 6c). A keyword search box at the top of the table (Fig. 6a) provides an easy way to look at related genes such as kinases and interleukins. Gene expression profiles can be grouped and split on the fly according to the sample annotations, such as tissue type, visiting time, and treatment arms. Moreover, the look and feel of a plot, such as font size, color, plot type, and scales for x-axis and y-axis, can be customized by right clicking on the plot and selecting relevant options from the dropdown menu. An annotated heat map (Fig. 6c) is informative in comparing gene expression profiles across different conditions, and can help reveal the relationships between gene expression levels and corresponding biological conditions. Detailed instructions on how to use advanced visualization features of the interactive plot are described in the QuickRNAseq user guide that is bundled with the QuickRNASeq package.

Scalability of QuickRNASeq

All samples can be processed in parallel in Step #1 of the QuickRNAseq pipeline (Fig. 1). In principle, there is no limitation to the number of RNA-seq samples, as long as enough storage is available. For easy data sharing, the web 2.0 visualization tools allow user to interact with the analyses results without the need of a web server and/or database. Therefore, in QuickRNAseq we pack all the data into JavaScript objects within a HTML document. For a RNA-seq project with 1000 samples, the number of gene expression data points can exceed 20 million, assuming that more than 20,000 genes are expressed. As a result, most browsers such as Internet Explorer, Safari, Firefox, and Chrome fail to load such huge datasets because they surpass the memory limit allocated to these web browsers. To solve this problem, we used pako [42], a web-based compression technique, to significantly reduce the number of objects to be created without compromising the end user experience.

Limitations and running of QuickRNASeq

QuickRNASeq is presumed to be executed in a HPC environment, which can process multiple samples in parallel. The out-of-the-box QuickRNASeq pipeline has been fully tested in a HPC computing environment using the IBM Platform’s Load Sharing Facility (LSF) [43], a powerful workload management platform for demanding, distributed HPC environments. The IBM Platform’s LSF provides a comprehensive set of intelligent, policy-driven scheduling features that enable users to utilize all the computing infrastructure resources and ensure optimal application performance. In addition to LSF, many other notable job scheduling software are available [44]. For a cluster that uses a job scheduler other than LSF, star-fc-qc.sh (implementation of Step #1 in Fig. 1) needs to be customized accordingly. The only required change in the script is the way of job submission, and this command is dependent the job scheduling software. For researchers with no access to a HPC computing environment, we implemented star-fc-qc.ws.sh, a customized script that runs on a standard Linux workstation. Of course, analyzing large RNA-seq datasets on a single workstation is not typical and not recommended.

For gene quantifications, QuickRNAseq requires a complete genome sequence and well-annotated genes as inputs. The pipeline is not intended for the discovery of novel isoforms. QuickRNASeq is designed for use by bioinformaticians, experimental biologists, and geneticists in the fields of genome-scale analysis, functional genomics, and systems biology; however, downloading, installing, and running the QuickRNASeq pipeline in a Linux environment will require some basic computer-based expertise. A README.txt is provided along with the QuickRNASeq package, which explains step-by-step how to run QuickRNASeq. In addition, users can examine the configuration and sample annotation file under the test_run folder in the QuickRNASeq package. QuickRNASeq can be run without a sample annotation file, but it is strongly recommended that users provide meaningful annotations for all samples. A proper annotation file should be tab delimited, and QuickRNASeq requires that the first and second columns correspond to sample and subject identifiers, respectively. Sample names should start with a letter, and should not contain any white spaces.

In QuickRNASeq, we selected FeatureCounts, a union exon based approach, for gene quantification. According to our own most recent research [25], union exon based approach is discouraged. Unfortunately, there is still a long way to go for the switch from union exon based approach to transcript-based method in estimation of gene expression levels because of the inaccuracy of isoform quantification [25], especially for those isoforms with low expression, and gene-based annotation databases. Traditionally, functional enrichment analyses rely upon annotation databases such as Gene Ontology (GO) [45], Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways [46] and other commercial knowledge systems. All such annotations have been recorded and centered on genes, not transcripts or isoforms. In practical RNA-seq data analyses, the switch from gene to isoform in quantification should ideally go with the switch in annotation hand by hand.

The current version of QuickRNASeq focuses on the automation of primary processing steps in RNA-seq data analyses, and these steps are in general biological question independent. We plan to expand QuickRNASeq to downstream analyses in the future, including differential analysis and pathway enrichment. Downstream analyses are usually driven by biological questions and experimental designs and thus different from project to project. How to automate such analyses in a user friendly manner remains a challenge for our practical implementation.

QuickRNASeq versus QuickNGS

While this paper was in preparation, Wagle et al. [47] published QuickNGS, a new workflow system to analyze data from multiple next-generation sequencing (NGS) projects at a time. QuickNGS uses parallel computing resources, a comprehensive backend database, and the careful selection of previously published algorithmic approaches to build fully automated data analysis workflows. An overview of our comparison of the QuickRNASeq pipeline with the QuickNGS workflow is provided in Table 3. In summary, compared with QuickNGS, QuickRNASeq is more tailored to RNA-seq data. In QuickRNASeq, we developed scripts to perform RNA-seq-specific data integration and to generate integrated and interactive project reports in a fully automated manner. All the results from QuickRNASeq can be shared easily and further explored from a web browser on a personal computer even without internet access. Our pipeline QuickRNASeq provides a noticeable advancement of RNA-seq data analyses by incorporating a high degree of automation together with interactive visualizations.

Table 3 Comparison of QuickRNASeq with QuickNGS

Full size table

Conclusions

By combing the best open source tool sets developed for RNA-seq data analyses and the most advanced web 2.0 technologies, we implemented the QuickRNASeq pipeline, which significantly reduces the efforts involved in primary RNA-seq data analyses and generates an integrated project report for data sharing and interactive visualization. The dynamic visualization features enable end users to explore and digest RNA-seq data analyses results intuitively and interactively, and to gain deep insights into RNA-seq datasets. The configuration file contains project, species, and software related parameters, and thus improves the reproducibly in RNA-seq data analyses. We have already applied QuickRNASeq to in-house large scale RNA-seq projects, and its current version is stable and mature for public release and adoption.

Availability of software and supporting Information

Project name: QuickRNASeq pipeline

Project home page: http://quickrnaseq.sourceforge.net

Operating system: Linux

Programming languages: Bash scripting, Perl, R, JavaScript

Dependencies: R packages edgeR, reshape2 and ggplot2

Other requirements: None

License: GNU GPL version 3

References

Wang Z, Gerstein M, Snyder M. RNA-seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63.
Article PubMed PubMed Central CAS Google Scholar
Mutz KO, Heilkenbrinker A, Lönne M, Walter JG, Stahl F. Transcriptome analysis using next-generation sequencing. Curr Opin Biotechnol. 2013;24(1):22–30.
Article PubMed CAS Google Scholar
Mantione KJ, Kream RM, Kuzelova H, Ptacek R, Raboch J, Samuel JM, et al. Comparing bioinformatic gene expression profiling methods: microarray and RNA-Seq. Med Sci Monit Basic Res. 2014;20:138–42.
Article PubMed PubMed Central Google Scholar
Zhao S, Fung-Leung W-P, Bittner A, Ngo K, Liu X. Comparison of RNA-seq and microarray in transcriptome profiling of activated T cells. PLoS ONE. 2014;9(1):e78644.
Article PubMed PubMed Central CAS Google Scholar
Zhao S, Zhang Y, Gordon W, Quan J, Xi H, Du S, et al. Comparison of stranded and non-stranded RNA-seq transcriptome profiling and investigation of gene overlap. BMC Genomics. 2015;16:487.
Article CAS Google Scholar
Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011;8(6):469–77.
Article PubMed CAS Google Scholar
Capobianco E. RNA-Seq data: a complexity journey. Comput Struct Biotechnol J. 2014;11(19):123–30.
Article PubMed PubMed Central Google Scholar
Engström PG, Steijger T, Sipos B, Grant GR, Kahles A, Rätsch G, et al. Systematic evaluation of spliced alignment programs for RNA-seq data. Nat Methods. 2013;10(12):1185–91.
Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics. 2013;14:91.
Article PubMed PubMed Central Google Scholar
Borozan I, Watt SN, Ferretti V. Evaluation of alignment algorithms for discovery and identification of pathogens using RNA-seq. PLoS ONE. 2013;8(10):e76935.
Article PubMed PubMed Central CAS Google Scholar
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15–21.
Article PubMed PubMed Central CAS Google Scholar
Wu TD, Nacu S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics. 2010;26(7):873–81.
Article PubMed PubMed Central CAS Google Scholar
Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, et al. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res. 2010;38(18):e178.
Article PubMed PubMed Central CAS Google Scholar
Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14(4):R36.
Article PubMed PubMed Central CAS Google Scholar
Zhao S. Assessment of the impact of using a reference transcriptome in mapping short RNA-seq reads. PLoS ONE. 2014;9(7):e101374.
Article PubMed PubMed Central CAS Google Scholar
Zhao S, Zhang B. A comprehensive evaluation of Ensembl, RefSeq, and UCSC annotations in the context of RNA-seq read mapping and gene quantification. BMC Genomics. 2015;16:97.
Article PubMed PubMed Central CAS Google Scholar
Kanitz A, Gypas F, Gruber AJ, Gruber AR, Martin G, Zavolan M. Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data. Genome Biol. 2015;16:150.
Article PubMed PubMed Central CAS Google Scholar
Angelini C, De Canditiis D, De Feis I. Computational approaches for isoform detection and estimation: good and bad news. BMC Bioinformatics. 2014;15:135.
Article PubMed PubMed Central Google Scholar
Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN. RNA-Seq gene expression estimation with read mapping inaccuracy. Bioinformatics. 2009;26:493–500.
Article PubMed PubMed Central CAS Google Scholar
Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011;12:323.
Article PubMed PubMed Central CAS Google Scholar
Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28:511–5.
Article PubMed PubMed Central CAS Google Scholar
Nicolae M, Mangul S, Măndoiu II, Zelikovsky A. Estimation of alternative splicing isoform frequencies from RNA-Seq data. Algorithms Mol Biol. 2011;6:9.
Article PubMed PubMed Central Google Scholar
Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30(7):923–30.
Article PubMed CAS Google Scholar
Anders S, Theodor P, Huber W. HTSeq — a Python framework to work with high-throughput sequencing data. Bioinformatics. 2014;31(2):166–9.
Article PubMed PubMed Central Google Scholar
Zhao S, Zhang B. Union exon based approach for RNA-seq gene quantification: to be or not to be? PLoS ONE. 2015;10(11):e0141910.
Article PubMed PubMed Central CAS Google Scholar
Wang L, Wang S, Li W. RSeQC: quality control of RNA-seq experiments. Bioinformatics. 2012;28(16):2184–5.
Article PubMed CAS Google Scholar
Dobin A, Gingeras TR. Mapping RNA-seq reads with STAR. Curr Protoc Bioinformatics. 2015;51:11.14.1–11.14.19.
Article Google Scholar
Koboldt D, Zhang Q, Larson D, Shen D, McLellan M, Lin L, et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012;22:568–76.
Article PubMed PubMed Central CAS Google Scholar
jQuery. https://jquery.com. Accessed 15 November 2015.
Data-Driven Documents. http://d3js.org. Accessed 15 November 2015.
canvasXpress. http://canvasxpress.org. Accessed 15 November 2015.
SlickGrid. https://github.com/mleibman/SlickGrid. Accessed 15 November 2015.
Gehlenborg N, Noble MS, Getz G, Chin L, Park PJ. Nozzle: a report generation toolkit for data analysis pipelines. Bioinformatics. 2013;29:1089–91.
Article PubMed PubMed Central CAS Google Scholar
Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, et al. GENCODE: the reference human genome annotation for the ENCODE Project. Genome Res. 2012;22(9):1760–74.
Article PubMed PubMed Central CAS Google Scholar
GENCODE. http://www.gencodegenes.org/releases/19.html. Accessed 15 November 2015.
Flicek P, Amode MR, Barrell D, Beal K, Billis K, Brent S, et al. Ensembl 2014. Nucleic Acids Res. 2014;42(Database issue):D749–55.
Article PubMed PubMed Central CAS Google Scholar
The ENCODE Project. http://www.genome.gov/encode/. Accessed 15 November 2015.
1000 Genomes. http://www.1000genomes.org/. Accessed 15 November 2015.
GTEx Consortium. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 2015;348(6235):648–60.
Article PubMed Central CAS Google Scholar
Melé M, Ferreira PG, Reverter F, DeLuca DS, Monlong J, Sammeth M, et al. Human genomics. The human transcriptome across tissues and individuals. Science. 2015;348(6235):660–5.
Article PubMed PubMed Central CAS Google Scholar
Leys C, Ley C, Klein O, Bernard P, Licata L. Detecting outliers: do not use standard deviation around the mean, use absolute deviation around the median. J Exp Soc Psycho. 2013;49(4):764–6.
Article Google Scholar
Pako. https://github.com/nodeca/pako. Accessed 15 November 2015.
IBM Platform LSF. http://www.ibm.com/systems/platformcomputing/products/lsf. Accessed 15 November 2015.
Job scheduler software. https://en.wikipedia.org/wiki/List_of_job_scheduler_software. Accessed 15 November 2015.
The Gene Ontology Consortium. Gene Ontology Consortium: going forward. Nucl Acids Res. 2015;43(Database issue):D1049–56.
Article PubMed Central Google Scholar
Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M. The KEGG resource for deciphering the genome. Nucleic Acids Res. 2004;32(Database issue):D277–80.
Article PubMed PubMed Central CAS Google Scholar
Wagle P, Nikolić M, Frommolt P. QuickNGS elevates Next-Generation Sequencing data analysis to a new level of automation. BMC Genomics. 2015;16:487.
Article PubMed PubMed Central CAS Google Scholar

Download references

Acknowledgments

The authors thank Alexander Dobin for valuable assistance with running STAR, Shi Wei for assistance with using featureCounts, and Liguo Wang for his help with RSeQC. The development of QuickRNASeq was motivated partially by large scale in-house RNA sequencing projects in which thousands of samples were sequenced and analyzed. We thank Karen Page and Mina Hassan-Zahraee for their support and collaboration.

Author information

Authors and Affiliations

PharmaTherapeutics Clinical R&D, Pfizer Worldwide Research and Development, Cambridge, MA, 02139, USA
Shanrong Zhao, Li Xi, Ying Zhang, David von Schack, Michael Vincent & Baohong Zhang
Computational Sciences Center of Emphasis, Pfizer Worldwide Research and Development, Cambridge, MA, 02139, USA
Jie Quan & Hualin Xi

Authors

Shanrong Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Li Xi
View author publications
You can also search for this author in PubMed Google Scholar
Jie Quan
View author publications
You can also search for this author in PubMed Google Scholar
Hualin Xi
View author publications
You can also search for this author in PubMed Google Scholar
Ying Zhang
View author publications
You can also search for this author in PubMed Google Scholar
David von Schack
View author publications
You can also search for this author in PubMed Google Scholar
Michael Vincent
View author publications
You can also search for this author in PubMed Google Scholar
Baohong Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Shanrong Zhao or Baohong Zhang.

Additional information

Competing interests

All authors are employees of Pfizer. The authors declare that they have no competing interests.

Authors’ contributions

SZ and BZ conceived, designed, implemented, tested, validated the workflow, and drafted the manuscript. LX compared QuickRNASeq with QuickNGS. JQ, HX, YZ, DS, and MV participated in the design and requirements specification discussions. All authors read and approved the final manuscript.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Zhao, S., Xi, L., Quan, J. et al. QuickRNASeq lifts large-scale RNA-seq data analyses to the next level of automation and interactive visualization. BMC Genomics 17, 39 (2016). https://doi.org/10.1186/s12864-015-2356-9

Download citation

Received: 16 November 2015
Accepted: 23 December 2015
Published: 08 January 2016
DOI: https://doi.org/10.1186/s12864-015-2356-9

QuickRNASeq lifts large-scale RNA-seq data analyses to the next level of automation and interactive visualization