TIGAR2: sensitive and accurate estimation of transcript isoform expression with longer RNA-Seq reads
© Nariai et al.; licensee BioMed Central Ltd. 2014
Published: 12 December 2014
High-throughput RNA sequencing (RNA-Seq) enables quantification and identification of transcripts at single-base resolution. Recently, longer sequence reads become available thanks to the development of new types of sequencing technologies as well as improvements in chemical reagents for the Next Generation Sequencers. Although several computational methods have been proposed for quantifying gene expression levels from RNA-Seq data, they are not sufficiently optimized for longer reads (e.g. > 250 bp).
We propose TIGAR2, a statistical method for quantifying transcript isoforms from fixed and variable length RNA-Seq data. Our method models substitution, deletion, and insertion errors of sequencers based on gapped-alignments of reads to the reference cDNA sequences so that sensitive read-aligners such as Bowtie2 and BWA-MEM are effectively incorporated in our pipeline. Also, a heuristic algorithm is implemented in variational Bayesian inference for faster computation. We apply TIGAR2 to both simulation data and real data of human samples and evaluate performance of transcript quantification with TIGAR2 in comparison to existing methods.
TIGAR2 is a sensitive and accurate tool for quantifying transcript isoform abundances from RNA-Seq data. Our method performs better than existing methods for the fixed-length reads (100 bp, 250 bp, 500 bp, and 1000 bp of both single-end and paired-end) and variable-length reads, especially for reads longer than 250 bp.
Massively parallel sequencing of cDNA libraries constructed from RNA samples (RNA-Seq) has become a popular choice for quantifying gene expression levels of transcript isoforms . Advantages of RNA-Seq over conventional microarray technologies include its larger dynamic range for quantification and capacity of identifying novel isoforms at one nucleotide resolution without the need for designing cDNA probes. A typical RNA-Seq data analysis workflow consists of two components: aligning sequenced reads to the reference cDNA sequences, and quantifying transcript isoform abundances based on the number of mapped reads on the reference sequences. In measuring gene expression levels, FPKM (Fragments Per Kilobase of transcript per Million mapped reads) is calculated under the assumption that a relative expression level of an isoform is proportional to the number of cDNA fragments that originate from it .
Since reads are typically 50-300 bp paired-end for Illumina sequencers, in many cases, they can be aligned to more than one isoform and/or locations on the reference sequences. One of challenges for accurate estimation of gene expression is to handle such multi-mapped reads . Several approaches have been proposed to model uncertainty of read mappings in a probabilistic framework, and it has been shown that the statistical inference of read mapping is effective for more accurate estimation of gene expression levels [4, 5]. Although rigorous simulation analyses with various conditions (such as 35 bp vs. 70 bp, and single-end vs. paired-end data) have been performed with several tools in the literature , cases for longer reads, such as 250 bp or longer that can be produced from the latest Illumina MiSeq sequencer, have not been extensively studied so far. Moreover, there are few methods suitable for RNA-Seq data produced from new types of sequencers, such as the Ion Torrent PGM sequencer, which generate variable-length reads with relatively higher error rate of substitutions, deletions, and insertions [7, 8].
In this paper, we present a statistical method, TIGAR2, which implements new features for improving sensitivity and accuracy of quantification of isoform expression levels from RNA-Seq data by extending the originally developed method . First, for achieving maximum sensitivity for mapping longer reads to reference sequences, TIGAR2 can handle aligned reads from BWA-MEM , as well as other widely used alignment tools such as Bowtie2 . Sequencing errors (substitutions, deletions and insertions) within reads that can be inferred from the gapped alignments of reads to reference sequences are modelled under a probabilistic framework in TIGAR2. Second, in order to speed up the variational Bayesian inference in TIGAR2, a new algorithm is implemented so that only reads that can influence isoform abundance parameters in the next iteration are detected and considered in the following update equations.
In order to evaluate quantification performance with TIGAR2, we prepare simulation data that emulates Illumina fixed-length reads (both single-end and paired-end) and Ion Torrent variable-length reads data. For simulating the variable-length reads, a variable read length distribution is empirically estimated from the actual RNA-Seq data by non-parametric regression with Gaussian kernels as basis functions in our analysis. We also apply TIGAR2 to real data of human cell line samples and evaluate consistency of estimated gene expression levels among technical replicates.
Alignment of reads to reference sequences
Reference cDNA sequences in the FASTA format of model organisms are either available from the RefSeq database , or can be generated from the whole genome reference sequence and a gene annotation file (GTF format) with a tool called "gffread", which is included in the Cufflinks package . For cases of non-model organisms, de novo transcriptome assembly might be considered , and then the resulting contigs can be used as reference sequences. Given a set of cDNA sequences in FASTA format, the FM-index for the following alignment step is constructed with the corresponding alignment tool. Then, gapped-alignments of reads to the reference sequences are generated with Bowtie2 or BWA-MEM with allowing multiple mappings of reads to the reference cDNA sequences.
Generative model of RNA-Seq data
where is a hyperparameter, is a constant, is the number of transcript isoforms, and . Here, represents the noise isoform abundance (reads that are not generated from any known isoform are assigned).
Where is the length of transcript isoform , and is the global fragment size distribution. We construct based on the normal distribution with mean μ F and standard deviation σ F , which can be either specified according to experimental protocols, or can be estimated from the primary alignments of reads for the case of paired-end data. represents the probability of the start position of the first pair of read given the transcript isoform choice and fragment size, and calculate if mRNAs have poly(A) tails, and if mRNAs do not have poly(A) tails. represents the probability of the orientation of read given the transcript isoform choice. For a strand specific protocol, it can be set as and . Otherwise, it can be automatically estimated from the primary alignment of reads from the RNA-Seq data. represents the probability of the alignment state of read given the transcript isoform choice, fragment size, start position, and orientation of read . The transition probability of the alignment state is calculated as described previously .
where emit is the emission probability of nucleotide characters of the first pair of read , is the nucleotide character, is the base call quality score, is the nucleotide character of the corresponding reference sequence, is the alignment state of the first pair of read at position . emit is similarly calculated as for the first pair of the read.
Modelling of variable read length distribution
where is the maximum read length of the read data.
Estimation of transcript isoform abundances
In our variational Bayesian inference approach, latent variables (true alignments of reads) as well as model parameters (transcript isoform abundances) are estimated as the posterior distribution. We use the Dirichlet distribution for the prior distribution
with a single hyperparameter For , the prior favors solutions in which some of isoforms have zero abundance. Hence, controls the complexity of model parameters (the number of possible transcript isoforms). A hyperparameter is selected as a maximizer of the lower bound of the marginal log likelihood of the observed data. Here, we consider = 0.001, 0.01, 0.1, or 1.0. Each iteration step of the variational approximation updates posterior distribution until a convergence criterion is satisfied. In the VBE step, the expected number of reads that are mapped to the transcript isoform t is obtained by . In the VBM step, the expected abundance of transcript isoform t is obtained by , where . Details of these update equations and calculation of the lower bound of the marginal likelihood are described in . Recently, it has been shown that the variational inference described here is accurate in estimating the mean of posterior transcript expression, but not the variance .
For each t, set theta_converged [t] to true if did not change from the previous step, and false otherwise.
For each n, if theta_converged [t] is true for all , then set to false, and true otherwise.
Then, in the VBE step, E z [Z nt = 1] is computed where is true. The algorithm heuristically eliminates unnecessary calculations of drastically in the later part of iterations, in which most of are already converged and only a fraction of reads should be considered for calculating the update equations.
Results and discussion
Simulation data analysis
Real data analysis
Computational resources required for the real data analysis
CPU time in alignment (minutes)
CPU time in estimation (minutes)
To see the scalability of TIGAR2 for a large dataset, it is applied to 100 million synthetic reads (100 bp single-end). It required 16 GB memory and 2,621 minutes of CPU time.
All the experiments were performed on an Intel Xeon CPU E5-2670 processor (2.60GHz) with the Red Hat Enterprise Linux Server release 6.2.
We have developed a computational method, named TIGAR2, which is accurate and sensitive in quantifying gene expression levels of transcript isoforms from RNA-Seq data. TIGAR2 outperformed existing methods with simulation data of both single-end and paired-end reads (100 bp, 250 bp, 500 bp and 1000 bp), especially for reads > 250 bp. TIGAR2 will be more effective for accurate detection and quantification of transcript isoforms compared to other existing methods, as new technologies for longer sequencing become available.
Instead of trying to find novel transcript isoforms from RNA-Seq data, reference cDNA sequences of transcript isoforms are assumed to be known in the TIGAR2 pipeline. Although there are a couple of algorithms to predict novel transcript isoforms or fusion genes [2, 14, 21], TIGAR2 does not provide the novel predictions at the moment. However, once candidates of novel transcript isoforms are predicted by external tools, they can be treated as known and gene expression levels of these novel isoforms can be quantified and assessed with TIGAR2. Another possible extension of TIGAR2 includes modelling of underlying genomic variation for identifying allele-specific gene expression. Because the cost of whole genome-sequencing is dropping sharply, it is becoming feasible to use both genomic information as well as gene expression data. Finally, there should be an optimal balance between the maximum number of allowed alignments per read and the convergence speed. These topics will be investigated as our future works.
Availability of supporting data
The implementation of TIGAR2 and the documentation is available in the GitHub repository, https://github.com/nariai/tigar2.
This work was supported (in part) by MEXT Tohoku Medical Megabank Project and CREST. All computational resources were provided by the Supercomputing services, Tohoku Medical Megabank Organization, Tohoku University.
The publication costs for this article were partly funded by MEXT Tohoku Medical Megabank Project and CREST.
This article has been published as part of BMC Genomics Volume 15 Supplement 10, 2014: Proceedings of the 25th International Conference on Genome Informatics (GIW/ISCB-Asia): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S10.
- Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews Genetics. 2009, 10 (1): 57-63. 10.1038/nrg2484.PubMedPubMed CentralView ArticleGoogle Scholar
- Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L: Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature protocols. 2012, 7 (3): 562-578. 10.1038/nprot.2012.016.PubMedPubMed CentralView ArticleGoogle Scholar
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature methods. 2008, 5 (7): 621-628. 10.1038/nmeth.1226.PubMedView ArticleGoogle Scholar
- Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN: RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics (Oxford, England). 2010, 26 (4): 493-500. 10.1093/bioinformatics/btp692.View ArticleGoogle Scholar
- Nariai N, Hirose O, Kojima K, Nagasaki M: TIGAR: transcript isoform abundance estimation method with gapped alignment of RNA-Seq data by variational Bayesian inference. Bioinformatics (Oxford, England). 2013, 29: (18):2292-2299.View ArticleGoogle Scholar
- Li B, Dewey CN: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC bioinformatics. 2011, 12: 323-10.1186/1471-2105-12-323.PubMedPubMed CentralView ArticleGoogle Scholar
- Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, Bertoni A, Swerdlow HP, Gu Y: A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC genomics. 2012, 13: 341-10.1186/1471-2164-13-341.PubMedPubMed CentralView ArticleGoogle Scholar
- Bragg LM, Stone G, Butler MK, Hugenholtz P, Tyson GW: Shining a light on dark sequencing: characterising errors in Ion Torrent PGM data. PLoS computational biology. 2013, 9 (4): e1003031-10.1371/journal.pcbi.1003031.PubMedPubMed CentralView ArticleGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics (Oxford, England). 2009, 25 (14): 1754-1760. 10.1093/bioinformatics/btp324.View ArticleGoogle Scholar
- Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie 2. Nature methods. 2012, 9 (4): 357-359. 10.1038/nmeth.1923.PubMedPubMed CentralView ArticleGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic acids research. 2007, 35 (Database): D61-65. 10.1093/nar/gkl842.PubMedPubMed CentralView ArticleGoogle Scholar
- Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, et al: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature biotechnology. 2011, 29 (7): 644-652. 10.1038/nbt.1883.PubMedPubMed CentralView ArticleGoogle Scholar
- Guttman M, Garber M, Levin JZ, Donaghey J, Robinson J, Adiconis X, Fan L, Koziol MJ, Gnirke A, Nusbaum C, et al: Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nature biotechnology. 2010, 28 (5): 503-510. 10.1038/nbt.1633.PubMedPubMed CentralView ArticleGoogle Scholar
- Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology. 2010, 28 (5): 511-515. 10.1038/nbt.1621.PubMedPubMed CentralView ArticleGoogle Scholar
- Papastamoulis P, Hensman J, Glaus P, Rattray M: Improved variational Bayes inference for transcript expression estimation. Statistical applications in genetics and molecular biology. 2014, 1-14.Google Scholar
- Glaus P, Honkela A, Rattray M: Identifying differentially expressed transcripts from RNA-seq data with biological variation. Bioinformatics (Oxford, England). 2012, 28 (13): 1721-1728. 10.1093/bioinformatics/bts260.View ArticleGoogle Scholar
- Hensman J, Glaus P, Honkela A, Rattray M: Fast Approximate Inference of Transcript Expression Levels from RNA-seq Data. ArXiv e-prints. 2013, 1308: 5953-Google Scholar
- Rothberg JM, Hinz W, Rearick TM, Schultz J, Mileski W, Davey M, Leamon JH, Johnson K, Milgrew MJ, Edwards M, et al: An integrated semiconductor device enabling non-optical genome sequencing. Nature. 2011, 475 (7356): 348-352. 10.1038/nature10242.PubMedView ArticleGoogle Scholar
- Thorvaldsdottir H, Robinson JT, Mesirov JP: Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Briefings in bioinformatics. 2013, 14 (2): 178-192. 10.1093/bib/bbs017.PubMedPubMed CentralView ArticleGoogle Scholar
- Wang B, Heath-Engel H, Zhang D, Nguyen N, Thomas DY, Hanrahan JW, Shore GC: BAP31 interacts with Sec61 translocons and promotes retrotranslocation of CFTRDeltaF508 via the derlin-1 complex. Cell. 2008, 133 (6): 1080-1092. 10.1016/j.cell.2008.04.042.PubMedView ArticleGoogle Scholar
- Kinsella M, Harismendy O, Nakano M, Frazer KA, Bafna V: Sensitive gene fusion detection using ambiguously mapping RNA-Seq read pairs. Bioinformatics (Oxford, England). 2011, 27 (8): 1068-1075. 10.1093/bioinformatics/btr085.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.