QuickRNASeq lifts large-scale RNA-seq data analyses to the next level of automation and interactive visualization
© Zhao et al. 2016
Received: 16 November 2015
Accepted: 23 December 2015
Published: 8 January 2016
RNA sequencing (RNA-seq), a next-generation sequencing technique for transcriptome profiling, is being increasingly used, in part driven by the decreasing cost of sequencing. Nevertheless, the analysis of the massive amounts of data generated by large-scale RNA-seq remains a challenge. Multiple algorithms pertinent to basic analyses have been developed, and there is an increasing need to automate the use of these tools so as to obtain results in an efficient and user friendly manner. Increased automation and improved visualization of the results will help make the results and findings of the analyses readily available to experimental scientists.
By combing the best open source tools developed for RNA-seq data analyses and the most advanced web 2.0 technologies, we have implemented QuickRNASeq, a pipeline for large-scale RNA-seq data analyses and visualization. The QuickRNASeq workflow consists of three main steps. In Step #1, each individual sample is processed, including mapping RNA-seq reads to a reference genome, counting the numbers of mapped reads, quality control of the aligned reads, and SNP (single nucleotide polymorphism) calling. Step #1 is computationally intensive, and can be processed in parallel. In Step #2, the results from individual samples are merged, and an integrated and interactive project report is generated. All analyses results in the report are accessible via a single HTML entry webpage. Step #3 is the data interpretation and presentation step. The rich visualization features implemented here allow end users to interactively explore the results of RNA-seq data analyses, and to gain more insights into RNA-seq datasets. In addition, we used a real world dataset to demonstrate the simplicity and efficiency of QuickRNASeq in RNA-seq data analyses and interactive visualizations. The seamless integration of automated capabilites with interactive visualizations in QuickRNASeq is not available in other published RNA-seq pipelines.
The high degree of automation and interactivity in QuickRNASeq leads to a substantial reduction in the time and effort required prior to further downstream analyses and interpretation of the analyses findings. QuickRNASeq advances primary RNA-seq data analyses to the next level of automation, and is mature for public release and adoption.
RNA sequencing (RNA-seq) has emerged as a powerful technology in transcriptome profiling [1–3]. Our previous side-by-side comparison between RNA-seq and microarray in investigating T cell activation demonstrated that RNA-seq analysis has many advantages over microarray analysis . In contrast to hybridization-based microarray analyses, RNA-seq has the extra benefits of obtaining transcription start and stop sites, alternative spliced isoforms, and genetic variants in addition to gene expression levels. One apparent shortcoming of early non-stranded (standard) RNA-seq protocols is that a sequence read loses the strand origin information, thus making it difficult to determine accurately the expression levels of overlapping genes transcribed from opposite strands. A comparison of stranded with non-stranded RNA-seq led us to conclude that stranded RNA-seq provides a more accurate estimation of gene expression levels than non-stranded RNA-seq .
Short reads generated by RNA-seq experiments must first be aligned, or mapped, to a reference genome or transcriptome assembly. The general objective of mapping or aligning a collection of sequence reads to a reference is to discover the true location (origin) of each read with respect to that reference. Although a large number of read mapping algorithms have been developed in recent years [6–10], the accurate alignment of RNA-seq reads is still a challenge. Indeed, some features of a reference genome such as repetitive regions, assembly errors, and assembly gaps render this objective impossible for a subset of reads. Furthermore, because RNA-seq libraries are constructed from transcribed RNA, intronic sequences are not present in exon-exon spanning reads. Therefore, when aligning the sequences to a reference genome, reads that span exon-exon junctions have to be split across potentially thousands of bases of intronic sequence. Many of the RNA-seq alignment tools, including STAR , GSNAP , MapSplice , and TopHat , use reference transcriptomes to inform the alignment of junction reads. The benefits of using a reference transcriptome to map RNA-seq reads have been demonstrated clearly in our previous RNA-seq analyses [15, 16].
The second important step in most RNA-seq analyses is gene or isoform quantification. A common method to estimate gene or transcript abundance from RNA-seq data is to count the number of reads that map uniquely to each gene or transcript. RPKM (reads per kilobase per million reads) is widely used to represent the relative abundance of mRNAs for a gene or transcript. A number of algorithms have been developed to infer gene and isoform abundance [17, 18], including RSEM [19, 20], Cufflinks , IsoEM , featureCounts , and HTSeq . A gene can be expressed in one or more transcript isoforms; accordingly, its expression level should be represented as the sum of its isoforms. However, estimating the expression of individual isoforms is intrinsically more difficult because different isoforms of a gene typically have a high proportion of genomic overlap. Accordingly, a simpler union exon-based approach has been proposed, in which all overlapping exons of the same gene are first merged into union exons, and the total length of the union exons is taken to represent the gene length. We carried out a side-by-side comparison between union exon-based approach and transcript-based method in RNA-seq gene quantification , and found that gene expressions were significantly underestimated when the union exon-based approach was used. Therefore, we strongly discourage using the union exon-based approach in gene quantification despite its simplicity.
Selecting appropriate software packages and setting software-specific parameters. Making the right or best choice can be difficult because many similar tools are available. Setting software parameters is even harder if not impossible, because it often requires both an in-depth understanding of the algorithms and sufficient hands-on experience, which disadvantages researchers new to this field.
Writing scripts to make different components work seamlessly in a pipeline. A variety of algorithms have been designed to perform different tasks, but they have been developed (and/or maintained) independently by different research groups and often use different programming languages. Moreover, those algorithms do not understand each other well, and the output(s) from one algorithm often cannot be used as input(s) for another algorithm. As a result, additional bridging scripts are necessary, which ideally requires a data analyst who is familiar with a number of programming languages, including Shell script, Perl, Python, Java, C/C++, and R.
Integrating and summarizing analyses results from individual samples. In general, most algorithms are implemented to process an individual sample. Consequently, the results of primary RNA-seq data analyses have to be further processed, integrated, and summarized for reporting, presentation, and downstream analysis. Usually, data integration and summarization are tedious and not easy to execute efficiently.
Identifying RNA-seq sample outliers. It is not uncommon that some samples have low quality and often substitute samples are not available, especially for RNA-seq of clinical specimen. RNA-seq is a complicated multistep process that involves sample collection/stabilization, RNA extraction, fragmentation, cDNA synthesis, adapter ligation, amplification, purification, and sequencing. Any mistake in this complex sequence of protocols can result in biased or even unusable data. Therefore, it is necessary to establish stringent RNA-seq data quality metrics to identify outliers that should be excluded from further downstream data analysis.
Detecting sample swapping and mislabeling. For large-scale RNA-seq studies in which hundreds or even thousands of RNA samples are sequenced and analyzed, it is not unusual that some samples are mishandled and appear to be swapped or sequenced more than once. Such errors can become a serious problem for downstream data analyses and interpretation of results, especially for longitudinal sample analyses. It is difficult to identify such mistakes based only on RNA-seq QC metrics and/or gene expression profiles. To confirm whether samples are from the same subject, it is more reliable to compare genetic markers among samples, such as single nucleotide polymorphisms (SNPs).
Sharing the results of RNA-seq data analyses with experimental scientists. Nearly all RNA-seq data analyses are performed using Linux clusters or workstations; however, analyses results in Linux are often inaccessible to most experimental scientists. RNA-seq data analyses typically generate a large number of files and large amounts of data that are difficult to comprehend or digest directly by experimental scientists. Therefore, easily accessible interfaces are needed that not only provide a quick and easy way for non-expert users to obtain high-level visualizations of the main RNA-seq analyses outputs (e.g., QC results), but also allow them to drill down further or export the results into additional analysis applications of their choice. To the best of our knowledge, very few RNA-seq related open source packages provide all these options.
To address these challenges, we have implemented a new pipeline named QuickRNASeq to advance the automation and visualization of RNA-seq data analyses results, and have constantly improved and refined its implementation since its inception. QuickRNASeq significantly reduces data analysts’ hands-on time, which results in a substantial decrease in the time and effort needed for the primary analyses of RNA-seq data before proceeding to further downstream analysis and interpretation. Additionally, QuickRNASeq provides a dynamic data sharing and interactive visualization environment for end users. All the results are accessible from a web browser without the need to set up a web server and database. The rich visualization features implemented in QuickRNASeq enable non-expert end users to interact easily with the RNA-seq data analyses results, and to drill down into specific aspects to gain insights into often complex datasets simply through a point-and-click approach.
In addition to raw sequence reads in FASTQ format, the only other required inputs are a reference genome file in FASTA format and a corresponding gene annotation file in GTF (gene transfer format). QuickRNASeq can be applied to any species as long as its genome and gene annotations are available; for example, human, mouse, rat, and cynomolgus or rhesus monkeys. A gene annotation file can exist in many formats, but GTF has become the de facto standard; however, not all tools accept gene annotation files in GTF format as input. For example, RSeQC (RNA-seq quality control package)  accepts gene annotation only in BED (browser extensible display) format, though the majority of gene annotations in the public domain are not available in BED format. To ensure that the exact same annotations are used by the different components in QuickRNASeq, we wrote Perl scripts to convert gene annotation files from GTF to BED format. This avoids any discrepancy or inconsistency among gene annotations that are available in different formats.
Step #1: single sample processing
This step consists mainly of read mapping, counting, aligned read QC, and SNP calling, and the corresponding algorithms used to perform these tasks are STAR [11, 27], featureCounts , RSeQC , and VarScan  respectively. STAR aligns spliced sequences of any length with moderate error rates, provides scalability for emerging sequencing technologies, and generates output files ready for transcript/gene expression quantification . The algorithms featureCounts  and HTSeq  are comparable in terms of counting results, but featureCounts is considerably faster than HTSeq by an order of magnitude for gene-level summarization and requires far less computer memory. Read mapping and counting typically are very time consuming, and we chose STAR and featureCounts in QuickRNASeq mainly because of their high speed and accuracy.
The RSeQC  package provides a number of modules that can comprehensively inspect sequence quality, nucleotide composition bias, PCR bias, GC bias, mapped reads distribution, coverage uniformity, and strand specificity. All such QC metrics are valuable for outlier detection. VarScan  is a platform-independent software tool that can detect variants in RNA-seq data. It employs a robust heuristic/statistic approach to call variants that meet desired thresholds for read depth, base quality, variant allele frequency, and statistical significance. To verify samples from the same subject, it is unnecessary to call SNPs across all chromosomes. In practice, it is sufficient to use only SNPs from the chromosome that contains the major histocompatibility complex (MHC) genes. For human, mouse, and rat, these are chromosomes 6, 17, and 20, respectively. As mentioned earlier, numerous software packages that can perform similar tasks are freely available; however, we found that the combination of STAR, featureCounts, RSeQC, and VarScan represents one of the best toolsets.
Computational algorithms for RNA-seq analyses are continuously being improved, including STAR, featureCounts, RSeQC, and VarScan. Therefore, we designed our pipeline to be independent of its underlying software version and ensured that it can handle RNA-seq samples from a variety of species. To decouple the dependence of QuickRNASeq pipeline upon underlying computational algorithms and species, we introduced a plain text configuration file that can store project, species, and software-specific parameters. This configuration file also improves the reproducibility of RNA-seq data analyses and simplifies the command lines in QuickRNASeq. For the convenience of QuickRNASeq users, a configuration file template has been provided for easy customization.
Step #2: data integration, QC, and summary
Merge mapping, counting summaries, and RSeQC metrics from individual samples.
Generate a read counting table ready for downstream analysis of all annotated genes.
Calculate a SNP (and gene expression) correlation matrix among samples.
Perform correlation-based sample QC, calculation of MADScore (median absolute deviation score), and data normalization.
Produce RNA-seq metrics and correlation plots ready for PowerPoint presentations.
Generate a comprehensive HTML QC report for individual sample.
Produce a dynamic and integrated QC metrics plot for individual samples.
Generate a master HTML entry webpage for data analyses results.
Description of main scripts in the QuickRNASeq package
Master script for Step #1 in Fig. 1
Same as star-fc-qc.sh, but implemented for a standalone workstation
Master script for Step #2 in Fig. 1
Merge STAR mapping summary
Merge featureCounts counting summary
Merge read distribution from RSeQC
Calculate all-against-all pairwise SNP correlations
Merge counts table from individual samples
Perform correlation-based QC, and calculate normalization factor
Plot the summaries for read mapping, counting, or read distribution
Plot a correlations matrix
Plot the number of genes with varying RPKM cut-offs
Generate a HTML QC report for individual sample
Generate a comprehensive, integrated, and interactive project report
Utility to extract gene annotation from a GTF file
Utility to convert a gene annotation from GTF to BED format
Template configuration file for customization
We implemented a correlation-based QC to detect potential outliers in the RNA-seq data by calculating a MADScore for each sample. In general, an outlier appears to deviate markedly from other samples in a RNA-seq study, and thus its correlation with other samples will be relatively low. The MADScore is calculated as follows. For each sample, calculate the correlation difference, which is simply the difference between the average of all the pairwise correlations that involve the sample and the average of all the pairwise correlations that do not involve the sample. If a sample is an outlier, then the difference will be negative. Accordingly, there will be a vector of values (one for each sample). Then this vector of difference is converted to MADScores (robust Z-scores) by subtracting the medians and dividing it by median absolute deviations (MAD). A standard MADScore cutoff (e.g., −5) is set to determine the outliers.
Step #3: interactive data visualization
Results and discussion
Test run of QuickRNASeq on a publicly available dataset
Annotation and mapping summary for the 48 samples used in the QuickRNASeq test run
All analyses results accessible from a single entry webpage
SNP correlation to detect mishandled samples
Integrated QC metrics for individual sample
Two strategies are used to determine the read duplication rate, as indicated in Fig. 5a. For the sequence-based strategy, reads with exactly the same sequence content are regarded as duplicated reads, whereas, for the mapping-based strategy, reads mapped to the same genomic location are regarded as duplicated reads. For spliced reads, reads mapped to the same starting position that splice the same way are regarded as duplicated reads. SRR603068 is a brain sample, and its nucleotide composition is biased towards A/T, as indicated in Fig. 5c. For RNA-seq data, we often want to know whether the sequencing depth is enough for the analyses, and the saturation plot shown in Fig. 5d is very valuable for this. For a well annotated organism, the number of expressed genes in a particular tissue is almost fixed so the number of splice junctions is also fixed. These numbers should be rediscovered from saturated RNA-seq data. The plot in Fig. 5d indicates that more reads should be sequenced for performing alternative splicing analyses. In Fig. 5f, all multiple splicing events spanning the same intron have been consolidated into one splicing junction, and a novel junction is considered as complete_novel if neither of the two splice sites can be annotated by a gene model. Otherwise, it is partial_novel, meaning that one of the splice site (5′SS or 3′SS) is new, while the other splice site is annotated (known). While the majority of junctions in Fig. 5f are annotated, over 20 % are either complete_novel or partial_novel.
Interactive visualization of gene expression profiles
Scalability of QuickRNASeq
Limitations and running of QuickRNASeq
QuickRNASeq is presumed to be executed in a HPC environment, which can process multiple samples in parallel. The out-of-the-box QuickRNASeq pipeline has been fully tested in a HPC computing environment using the IBM Platform’s Load Sharing Facility (LSF) , a powerful workload management platform for demanding, distributed HPC environments. The IBM Platform’s LSF provides a comprehensive set of intelligent, policy-driven scheduling features that enable users to utilize all the computing infrastructure resources and ensure optimal application performance. In addition to LSF, many other notable job scheduling software are available . For a cluster that uses a job scheduler other than LSF, star-fc-qc.sh (implementation of Step #1 in Fig. 1) needs to be customized accordingly. The only required change in the script is the way of job submission, and this command is dependent the job scheduling software. For researchers with no access to a HPC computing environment, we implemented star-fc-qc.ws.sh, a customized script that runs on a standard Linux workstation. Of course, analyzing large RNA-seq datasets on a single workstation is not typical and not recommended.
For gene quantifications, QuickRNAseq requires a complete genome sequence and well-annotated genes as inputs. The pipeline is not intended for the discovery of novel isoforms. QuickRNASeq is designed for use by bioinformaticians, experimental biologists, and geneticists in the fields of genome-scale analysis, functional genomics, and systems biology; however, downloading, installing, and running the QuickRNASeq pipeline in a Linux environment will require some basic computer-based expertise. A README.txt is provided along with the QuickRNASeq package, which explains step-by-step how to run QuickRNASeq. In addition, users can examine the configuration and sample annotation file under the test_run folder in the QuickRNASeq package. QuickRNASeq can be run without a sample annotation file, but it is strongly recommended that users provide meaningful annotations for all samples. A proper annotation file should be tab delimited, and QuickRNASeq requires that the first and second columns correspond to sample and subject identifiers, respectively. Sample names should start with a letter, and should not contain any white spaces.
In QuickRNASeq, we selected FeatureCounts, a union exon based approach, for gene quantification. According to our own most recent research , union exon based approach is discouraged. Unfortunately, there is still a long way to go for the switch from union exon based approach to transcript-based method in estimation of gene expression levels because of the inaccuracy of isoform quantification , especially for those isoforms with low expression, and gene-based annotation databases. Traditionally, functional enrichment analyses rely upon annotation databases such as Gene Ontology (GO) , Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways  and other commercial knowledge systems. All such annotations have been recorded and centered on genes, not transcripts or isoforms. In practical RNA-seq data analyses, the switch from gene to isoform in quantification should ideally go with the switch in annotation hand by hand.
The current version of QuickRNASeq focuses on the automation of primary processing steps in RNA-seq data analyses, and these steps are in general biological question independent. We plan to expand QuickRNASeq to downstream analyses in the future, including differential analysis and pathway enrichment. Downstream analyses are usually driven by biological questions and experimental designs and thus different from project to project. How to automate such analyses in a user friendly manner remains a challenge for our practical implementation.
QuickRNASeq versus QuickNGS
Comparison of QuickRNASeq with QuickNGS
Scope and application
Next-generation sequencing: WGS, RNA-seq, miRNA-seq, Chip-seq
Requires external MySQL database and web server support
Purpose of web interface
Track the progress of data analysis and provide access to result files
Provide access to analyses results and interactive visualization
Interactive, very rich and dynamic interface built upon web 2.0 technology
Limited. Reduction of the hands-on time
“ONE-STOP” integrated report. Particularly implemented to support large-scale RNA-seq. High level of automation and efficiency
By combing the best open source tool sets developed for RNA-seq data analyses and the most advanced web 2.0 technologies, we implemented the QuickRNASeq pipeline, which significantly reduces the efforts involved in primary RNA-seq data analyses and generates an integrated project report for data sharing and interactive visualization. The dynamic visualization features enable end users to explore and digest RNA-seq data analyses results intuitively and interactively, and to gain deep insights into RNA-seq datasets. The configuration file contains project, species, and software related parameters, and thus improves the reproducibly in RNA-seq data analyses. We have already applied QuickRNASeq to in-house large scale RNA-seq projects, and its current version is stable and mature for public release and adoption.
Availability of software and supporting Information
Project name: QuickRNASeq pipeline
Project home page: http://quickrnaseq.sourceforge.net
Operating system: Linux
Dependencies: R packages edgeR, reshape2 and ggplot2
Other requirements: None
License: GNU GPL version 3
The authors thank Alexander Dobin for valuable assistance with running STAR, Shi Wei for assistance with using featureCounts, and Liguo Wang for his help with RSeQC. The development of QuickRNASeq was motivated partially by large scale in-house RNA sequencing projects in which thousands of samples were sequenced and analyzed. We thank Karen Page and Mina Hassan-Zahraee for their support and collaboration.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Wang Z, Gerstein M, Snyder M. RNA-seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63.PubMedPubMed CentralView ArticleGoogle Scholar
- Mutz KO, Heilkenbrinker A, Lönne M, Walter JG, Stahl F. Transcriptome analysis using next-generation sequencing. Curr Opin Biotechnol. 2013;24(1):22–30.PubMedView ArticleGoogle Scholar
- Mantione KJ, Kream RM, Kuzelova H, Ptacek R, Raboch J, Samuel JM, et al. Comparing bioinformatic gene expression profiling methods: microarray and RNA-Seq. Med Sci Monit Basic Res. 2014;20:138–42.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhao S, Fung-Leung W-P, Bittner A, Ngo K, Liu X. Comparison of RNA-seq and microarray in transcriptome profiling of activated T cells. PLoS ONE. 2014;9(1):e78644.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhao S, Zhang Y, Gordon W, Quan J, Xi H, Du S, et al. Comparison of stranded and non-stranded RNA-seq transcriptome profiling and investigation of gene overlap. BMC Genomics. 2015;16:487.View ArticleGoogle Scholar
- Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011;8(6):469–77.PubMedView ArticleGoogle Scholar
- Capobianco E. RNA-Seq data: a complexity journey. Comput Struct Biotechnol J. 2014;11(19):123–30.PubMedPubMed CentralView ArticleGoogle Scholar
- Engström PG, Steijger T, Sipos B, Grant GR, Kahles A, Rätsch G, et al. Systematic evaluation of spliced alignment programs for RNA-seq data. Nat Methods. 2013;10(12):1185–91.Google Scholar
- Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics. 2013;14:91.PubMedPubMed CentralView ArticleGoogle Scholar
- Borozan I, Watt SN, Ferretti V. Evaluation of alignment algorithms for discovery and identification of pathogens using RNA-seq. PLoS ONE. 2013;8(10):e76935.PubMedPubMed CentralView ArticleGoogle Scholar
- Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15–21.PubMedPubMed CentralView ArticleGoogle Scholar
- Wu TD, Nacu S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics. 2010;26(7):873–81.PubMedPubMed CentralView ArticleGoogle Scholar
- Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, et al. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res. 2010;38(18):e178.PubMedPubMed CentralView ArticleGoogle Scholar
- Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14(4):R36.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhao S. Assessment of the impact of using a reference transcriptome in mapping short RNA-seq reads. PLoS ONE. 2014;9(7):e101374.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhao S, Zhang B. A comprehensive evaluation of Ensembl, RefSeq, and UCSC annotations in the context of RNA-seq read mapping and gene quantification. BMC Genomics. 2015;16:97.PubMedPubMed CentralView ArticleGoogle Scholar
- Kanitz A, Gypas F, Gruber AJ, Gruber AR, Martin G, Zavolan M. Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data. Genome Biol. 2015;16:150.PubMedPubMed CentralView ArticleGoogle Scholar
- Angelini C, De Canditiis D, De Feis I. Computational approaches for isoform detection and estimation: good and bad news. BMC Bioinformatics. 2014;15:135.PubMedPubMed CentralView ArticleGoogle Scholar
- Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN. RNA-Seq gene expression estimation with read mapping inaccuracy. Bioinformatics. 2009;26:493–500.PubMedPubMed CentralView ArticleGoogle Scholar
- Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011;12:323.PubMedPubMed CentralView ArticleGoogle Scholar
- Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28:511–5.PubMedPubMed CentralView ArticleGoogle Scholar
- Nicolae M, Mangul S, Măndoiu II, Zelikovsky A. Estimation of alternative splicing isoform frequencies from RNA-Seq data. Algorithms Mol Biol. 2011;6:9.PubMedPubMed CentralView ArticleGoogle Scholar
- Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30(7):923–30.PubMedView ArticleGoogle Scholar
- Anders S, Theodor P, Huber W. HTSeq — a Python framework to work with high-throughput sequencing data. Bioinformatics. 2014;31(2):166–9.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhao S, Zhang B. Union exon based approach for RNA-seq gene quantification: to be or not to be? PLoS ONE. 2015;10(11):e0141910.PubMedPubMed CentralView ArticleGoogle Scholar
- Wang L, Wang S, Li W. RSeQC: quality control of RNA-seq experiments. Bioinformatics. 2012;28(16):2184–5.PubMedView ArticleGoogle Scholar
- Dobin A, Gingeras TR. Mapping RNA-seq reads with STAR. Curr Protoc Bioinformatics. 2015;51:11.14.1–11.14.19.View ArticleGoogle Scholar
- Koboldt D, Zhang Q, Larson D, Shen D, McLellan M, Lin L, et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012;22:568–76.PubMedPubMed CentralView ArticleGoogle Scholar
- jQuery. https://jquery.com. Accessed 15 November 2015.
- Data-Driven Documents. http://d3js.org. Accessed 15 November 2015.
- canvasXpress. http://canvasxpress.org. Accessed 15 November 2015.
- SlickGrid. https://github.com/mleibman/SlickGrid. Accessed 15 November 2015.
- Gehlenborg N, Noble MS, Getz G, Chin L, Park PJ. Nozzle: a report generation toolkit for data analysis pipelines. Bioinformatics. 2013;29:1089–91.PubMedPubMed CentralView ArticleGoogle Scholar
- Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, et al. GENCODE: the reference human genome annotation for the ENCODE Project. Genome Res. 2012;22(9):1760–74.PubMedPubMed CentralView ArticleGoogle Scholar
- GENCODE. http://www.gencodegenes.org/releases/19.html. Accessed 15 November 2015.
- Flicek P, Amode MR, Barrell D, Beal K, Billis K, Brent S, et al. Ensembl 2014. Nucleic Acids Res. 2014;42(Database issue):D749–55.PubMedPubMed CentralView ArticleGoogle Scholar
- The ENCODE Project. http://www.genome.gov/encode/. Accessed 15 November 2015.
- 1000 Genomes. http://www.1000genomes.org/. Accessed 15 November 2015.
- GTEx Consortium. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 2015;348(6235):648–60.PubMed CentralView ArticleGoogle Scholar
- Melé M, Ferreira PG, Reverter F, DeLuca DS, Monlong J, Sammeth M, et al. Human genomics. The human transcriptome across tissues and individuals. Science. 2015;348(6235):660–5.PubMedPubMed CentralView ArticleGoogle Scholar
- Leys C, Ley C, Klein O, Bernard P, Licata L. Detecting outliers: do not use standard deviation around the mean, use absolute deviation around the median. J Exp Soc Psycho. 2013;49(4):764–6.View ArticleGoogle Scholar
- Pako. https://github.com/nodeca/pako. Accessed 15 November 2015.
- IBM Platform LSF. http://www.ibm.com/systems/platformcomputing/products/lsf. Accessed 15 November 2015.
- Job scheduler software. https://en.wikipedia.org/wiki/List_of_job_scheduler_software. Accessed 15 November 2015.
- The Gene Ontology Consortium. Gene Ontology Consortium: going forward. Nucl Acids Res. 2015;43(Database issue):D1049–56.PubMed CentralView ArticleGoogle Scholar
- Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M. The KEGG resource for deciphering the genome. Nucleic Acids Res. 2004;32(Database issue):D277–80.PubMedPubMed CentralView ArticleGoogle Scholar
- Wagle P, Nikolić M, Frommolt P. QuickNGS elevates Next-Generation Sequencing data analysis to a new level of automation. BMC Genomics. 2015;16:487.PubMedPubMed CentralView ArticleGoogle Scholar