 Research
 Open Access
 Published:
TIGAR2: sensitive and accurate estimation of transcript isoform expression with longer RNASeq reads
BMC Genomics volume 15, Article number: S5 (2014)
Abstract
Background
Highthroughput RNA sequencing (RNASeq) enables quantification and identification of transcripts at singlebase resolution. Recently, longer sequence reads become available thanks to the development of new types of sequencing technologies as well as improvements in chemical reagents for the Next Generation Sequencers. Although several computational methods have been proposed for quantifying gene expression levels from RNASeq data, they are not sufficiently optimized for longer reads (e.g. > 250 bp).
Results
We propose TIGAR2, a statistical method for quantifying transcript isoforms from fixed and variable length RNASeq data. Our method models substitution, deletion, and insertion errors of sequencers based on gappedalignments of reads to the reference cDNA sequences so that sensitive readaligners such as Bowtie2 and BWAMEM are effectively incorporated in our pipeline. Also, a heuristic algorithm is implemented in variational Bayesian inference for faster computation. We apply TIGAR2 to both simulation data and real data of human samples and evaluate performance of transcript quantification with TIGAR2 in comparison to existing methods.
Conclusions
TIGAR2 is a sensitive and accurate tool for quantifying transcript isoform abundances from RNASeq data. Our method performs better than existing methods for the fixedlength reads (100 bp, 250 bp, 500 bp, and 1000 bp of both singleend and pairedend) and variablelength reads, especially for reads longer than 250 bp.
Background
Massively parallel sequencing of cDNA libraries constructed from RNA samples (RNASeq) has become a popular choice for quantifying gene expression levels of transcript isoforms [1]. Advantages of RNASeq over conventional microarray technologies include its larger dynamic range for quantification and capacity of identifying novel isoforms at one nucleotide resolution without the need for designing cDNA probes. A typical RNASeq data analysis workflow consists of two components: aligning sequenced reads to the reference cDNA sequences, and quantifying transcript isoform abundances based on the number of mapped reads on the reference sequences. In measuring gene expression levels, FPKM (Fragments Per Kilobase of transcript per Million mapped reads) is calculated under the assumption that a relative expression level of an isoform is proportional to the number of cDNA fragments that originate from it [2].
Since reads are typically 50300 bp pairedend for Illumina sequencers, in many cases, they can be aligned to more than one isoform and/or locations on the reference sequences. One of challenges for accurate estimation of gene expression is to handle such multimapped reads [3]. Several approaches have been proposed to model uncertainty of read mappings in a probabilistic framework, and it has been shown that the statistical inference of read mapping is effective for more accurate estimation of gene expression levels [4, 5]. Although rigorous simulation analyses with various conditions (such as 35 bp vs. 70 bp, and singleend vs. pairedend data) have been performed with several tools in the literature [6], cases for longer reads, such as 250 bp or longer that can be produced from the latest Illumina MiSeq sequencer, have not been extensively studied so far. Moreover, there are few methods suitable for RNASeq data produced from new types of sequencers, such as the Ion Torrent PGM sequencer, which generate variablelength reads with relatively higher error rate of substitutions, deletions, and insertions [7, 8].
In this paper, we present a statistical method, TIGAR2, which implements new features for improving sensitivity and accuracy of quantification of isoform expression levels from RNASeq data by extending the originally developed method [5]. First, for achieving maximum sensitivity for mapping longer reads to reference sequences, TIGAR2 can handle aligned reads from BWAMEM [9], as well as other widely used alignment tools such as Bowtie2 [10]. Sequencing errors (substitutions, deletions and insertions) within reads that can be inferred from the gapped alignments of reads to reference sequences are modelled under a probabilistic framework in TIGAR2. Second, in order to speed up the variational Bayesian inference in TIGAR2, a new algorithm is implemented so that only reads that can influence isoform abundance parameters in the next iteration are detected and considered in the following update equations.
In order to evaluate quantification performance with TIGAR2, we prepare simulation data that emulates Illumina fixedlength reads (both singleend and pairedend) and Ion Torrent variablelength reads data. For simulating the variablelength reads, a variable read length distribution is empirically estimated from the actual RNASeq data by nonparametric regression with Gaussian kernels as basis functions in our analysis. We also apply TIGAR2 to real data of human cell line samples and evaluate consistency of estimated gene expression levels among technical replicates.
Methods
A pipeline of running TIGAR2 consists of two steps: alignment of reads to reference sequences, and estimation of transcript isoform abundances based on the alignment result (Figure 1). Since the first part of the pipeline uses external alignment tools for aligning reads to the reference sequences, it is recommended to run the whole pipeline in the UNIX environment. Details of each step are described in the following sections.
Alignment of reads to reference sequences
Reference cDNA sequences in the FASTA format of model organisms are either available from the RefSeq database [11], or can be generated from the whole genome reference sequence and a gene annotation file (GTF format) with a tool called "gffread", which is included in the Cufflinks package [2]. For cases of nonmodel organisms, de novo transcriptome assembly might be considered [12], and then the resulting contigs can be used as reference sequences. Given a set of cDNA sequences in FASTA format, the FMindex for the following alignment step is constructed with the corresponding alignment tool. Then, gappedalignments of reads to the reference sequences are generated with Bowtie2 or BWAMEM with allowing multiple mappings of reads to the reference cDNA sequences.
Generative model of RNASeq data
After the alignment is complete, TIGAR2 takes the resulting SAM/BAM and the FASTA files as input for transcript isoform abundance estimation. We use a generative model for RNASeq data as described in Figure 2, which is an extended version of the original model [5]. Here, $\mathit{\theta}$ is a model parameter that represents transcript isoform abundances, and ${Z}_{nt}$ is an indicator variable and it takes one if read $n$ is generated from transcript isoform $t$, and zero otherwise. ${R}_{n}^{1}$ and ${R}_{n}^{2}$ are the nucleotide sequence of the first and second pair of read n, respectively. Then, the joint probability of the model is decomposed as the product of conditional probabilities as follows:
P ($\mathit{\theta}$) is the prior distribution of the parameter and we assume the Dirichlet distribution:
where ${\alpha}_{t}>0$ is a hyperparameter, $C$ is a constant, $T$ is the number of transcript isoforms, and ${\text{\Sigma}}_{t=0}^{T}\phantom{\rule{2.77695pt}{0ex}}{\theta}_{t}=1$. Here, ${\theta}_{0}$ represents the noise isoform abundance (reads that are not generated from any known isoform are assigned).
$P\left({Z}_{nt}\mathsf{\text{}}\theta \right)$ is the conditional probability of ${Z}_{nt}$ given $\mathit{\theta}$ and we further decompose as follows:
where ${T}_{n},{F}_{n},{S}_{n},{O}_{n}$, ${A}_{n}^{1}$, and ${A}_{n}^{2}$ respectively represent the transcript isoform choice, fragment size, read start position, orientation, and alignment state of the first pair and second pair of read $n$. $P\left({T}_{n}\right\mathit{\theta})$ represents the probability of read n generated from transcript isoform ${T}_{n}$ given a parameter vector, and we compute $P({T}_{n}=t\mathit{\theta})={\theta}_{t}$. Compared to the original model of TIGAR [5], a fragment size variable is now included in the model. The conditional probability of observing ${F}_{n}={f}_{n}$ given ${T}_{n}={t}_{n}$ is calculated by truncated and normalized distribution [6, 13, 14]:
Where ${l}_{t}$ is the length of transcript isoform $t$, and ${d}_{F}\left(x\right)$ is the global fragment size distribution. We construct ${d}_{F}\left(x\right)$ based on the normal distribution with mean μ_{ F } and standard deviation σ_{ F }, which can be either specified according to experimental protocols, or can be estimated from the primary alignments of reads for the case of pairedend data. $P\left({S}_{n}\mathsf{\text{}}{T}_{n},{F}_{n}\right)$ represents the probability of the start position of the first pair of read $n$ given the transcript isoform choice and fragment size, and calculate $P\left({S}_{n}=s\mathsf{\text{}}{T}_{n}=t\right)=1/{f}_{t}$ if mRNAs have poly(A) tails, and $P\left({S}_{n}=s\mathsf{\text{}}{T}_{n}=t\right)=1/\left({f}_{t}L+1\right)$ if mRNAs do not have poly(A) tails. $P\left({O}_{n}\mathsf{\text{}}{T}_{n}\right)$ represents the probability of the orientation of read $n$ given the transcript isoform choice. For a strand specific protocol, it can be set as $P\left({O}_{n}=0\mathsf{\text{}}{T}_{n}=t\right)=1$ and $P\left({O}_{n}=0\mathsf{\text{}}{T}_{n}=t\right)=0$. Otherwise, it can be automatically estimated from the primary alignment of reads from the RNASeq data. $P\left({A}_{n}^{1},{A}_{n}^{2}\mathsf{\text{}}{T}_{n},{F}_{n},{S}_{n},{O}_{n}\right)$represents the probability of the alignment state of read $n$ given the transcript isoform choice, fragment size, start position, and orientation of read $n$. The transition probability of the alignment state is calculated as described previously [5].
Finally, $P\left({R}_{n}^{1},{R}_{n}^{2}\mathsf{\text{}}{Z}_{nt}=1\right)$ is the conditional probability of sequence of the first and second pair of read $n$ given ${Z}_{nt}=1$. We calculate this probability considering the observed read length as
where emit $\left({r}^{1}\left[x\right],{q}^{1}\left[x\right],{c}^{1}\left[x\right],{a}^{1}\left[x\right]\right)$ is the emission probability of nucleotide characters of the first pair of read $n$, ${r}^{1}\left[x\right]$ is the nucleotide character, ${q}^{1}\left[x\right]$ is the base call quality score, ${c}^{1}\left[x\right]$ is the nucleotide character of the corresponding reference sequence, ${a}^{1}\left[x\right]$ is the alignment state of the first pair of read $n$ at position $x$. emit $\left({r}^{2}\left[x\right],{q}^{2}\left[x\right],{c}^{2}\left[x\right],{a}^{2}\left[x\right]\right)$ is similarly calculated as for the first pair of the read.
Modelling of variable read length distribution
Some sequencers, such as Ion Torrent PGM, produce reads whose lengths are variable. In order to simulate such variable read length, we model the conditional probability of the read length given the fragment size, which is also calculated by the truncated distribution [4]
where length $\left({R}_{n}\right)$ is the observed length of read $n$, and ${d}_{R}\left(x\right)$ is the global read length distribution. Here, ${d}_{R}\left(x\right)$ can be constructed based on a linear combination of the smooth functions by fitting it to the data in a nonparametric manner with M equally spaced Gaussian kernels as basis functions. Let
where ${a}_{i}$ is the coefficient parameter, and ${m}_{i}(x)$ is the normal distribution with mean μ_{ i } and standard deviation σ. From the RNASeq data, observations of read lengths and their frequency, $({x}_{n},{y}_{n})$, are constructed, where ${x}_{n}$ is the read length, and is ${y}_{n}$ the frequency of ${x}_{n}$, and ${\text{\Sigma}}_{n=1}^{N}\phantom{\rule{2.77695pt}{0ex}}{y}_{n}=1$. Then, the least squares estimate (LSE) of the parameter vector $a={\left({a}_{1},\dots ,{a}_{M}\right)}^{T}$ is obtained by
Define a real value matrix ${\mathbf{\text{B}}}_{ij}={m}_{j}\left({x}_{i}\right)$. Then, the ordinary LSE is calculated by
Then, the global read length distribution ${d}_{R}\left(x\right)$ can be constructed from $g\left(x\right)$ as:
where $\text{max}\left(L\right)$ is the maximum read length of the read data.
An example of the estimated read length distribution from the real sequencing data of a human cell line (HeLa) sequenced by the Ion Torrent PGM sequencer (http://ioncommunity.lifetechnologies.com) is shown in Figure 3.
Estimation of transcript isoform abundances
In our variational Bayesian inference approach, latent variables (true alignments of reads) as well as model parameters (transcript isoform abundances) are estimated as the posterior distribution. We use the Dirichlet distribution for the prior distribution $\theta ~D\left({\alpha}_{0},\dots ,{\alpha}_{0}\right)$
with a single hyperparameter ${\alpha}_{0}>0.$ For ${\alpha}_{0}<1$, the prior favors solutions in which some of isoforms have zero abundance. Hence, ${\alpha}_{0}$ controls the complexity of model parameters (the number of possible transcript isoforms). A hyperparameter ${\alpha}_{0}$ is selected as a maximizer of the lower bound of the marginal log likelihood of the observed data. Here, we consider ${\alpha}_{0}$ = 0.001, 0.01, 0.1, or 1.0. Each iteration step of the variational approximation updates posterior distribution until a convergence criterion is satisfied. In the VBE step, the expected number of reads that are mapped to the transcript isoform t is obtained by $\hat{{r}_{t}}={\text{\Sigma}}_{n}{E}_{Z}\left[{Z}_{nt}=1\right]$. In the VBM step, the expected abundance of transcript isoform t is obtained by ${E}_{\theta}[{\widehat{\theta}}_{t}]={\widehat{\alpha}}_{t}/({\mathrm{\Sigma}}_{t},\widehat{{\alpha}_{t\prime}})$, where $\hat{{\alpha}_{t}}={\alpha}_{0}+\hat{{r}_{t}}$. Details of these update equations and calculation of the lower bound of the marginal likelihood are described in [5]. Recently, it has been shown that the variational inference described here is accurate in estimating the mean of posterior transcript expression, but not the variance [15].
The bottleneck of the computational cost of the inference algorithm is the calculation of ${E}_{z}\left[{Z}_{nt}=1\right]$ for all the possible alignments in the VBE step, which takes O(M) time if the total number of possible alignments is M. This time complexity is upper bounded by O(NT), where N is the number of reads and T is the number of cDNA reference sequences. Suppose some ${E}_{\theta}\left[{\hat{\theta}}_{t}\right]$ are already converged (unchanged from the previous iteration step) at the current step. We store the information in a Boolean variable theta_converged [t], which takes true if ${E}_{\theta}\left[{\hat{\theta}}_{t}\right]$ is converged, and false otherwise for each isoform t. Let ${\tau}_{n}$ be a set of isoforms to which read n is aligned. In the next VBE step, for each read n, ${E}_{z}[{Z}_{nt}=1]$ will not change if theata_converged [t] is true for all $t\in {\tau}_{n}$. To represent this information, we introduce a Boolean variable $read\mathsf{\text{\_}}movable\phantom{\rule{2.77695pt}{0ex}}\left[n\right]$, which takes false if ${E}_{z}[{Z}_{nt}=1]$ will not change in the next VBE step, and true otherwise. The following algorithm computes $read\mathsf{\text{\_}}movable\phantom{\rule{2.77695pt}{0ex}}\left[n\right]$ at the start of each iteration:

1.
For each t, set theta_converged [t] to true if ${E}_{\theta}\left[{\hat{\theta}}_{t}\right]$ did not change from the previous step, and false otherwise.

2.
For each n, if theta_converged [t] is true for all $t\in {\tau}_{n}$, then set $read\mathsf{\text{\_}}movable\phantom{\rule{2.77695pt}{0ex}}\left[n\right]$ to false, and true otherwise.
Then, in the VBE step, E_{ z }[Z_{ nt } = 1] is computed where $read\mathsf{\text{\_}}movable\phantom{\rule{2.77695pt}{0ex}}\left[n\right]$ is true. The algorithm heuristically eliminates unnecessary calculations of ${E}_{z}[{Z}_{nt}=1]$ drastically in the later part of iterations, in which most of ${E}_{\theta}\left[{\hat{\theta}}_{t}\right]$ are already converged and only a fraction of reads should be considered for calculating the update equations.
Results and discussion
Simulation data analysis
We evaluate the performance of quantifying gene expression levels with TIGAR2 compared to existing methods using simulation data. First, 10,000 transcript isoforms in the human RefSeq database [11] are randomly chosen. Second, a set of true gene expression levels is constructed, in which log of isoform abundance is sampled from the standard normal distribution. Then, we generated 20 million, 8 million, 4 million, and 2 million RNASeq singleend reads of 100 bp, 250 bp, 500 bp, and 1000 bp, respectively, so that the total throughput of nucleotides remains the same. Similarly, 10 million, 4 million, 2 million, and 1 million pairedend reads of 100 bp, 250 bp, 500 bp, and 1000 bp, respectively, have been generated whose fragment size follows the normal distribution with μ_{ F } = 300, 750, 1250, and 2500, and σ_{ F } = 40, 100, 200, and 400, respectively. In order to simulate sequencing errors, we prepared a set of simulation data with 1% substitution, 1% deletion, and 1% insertion errors. All the simulation data was generated by our inhouse software. After aligning reads to the reference cDNA sequences with Bowtie2 (the maximum number of allowed alignments per read is 100), transcript isoform abundances are estimated with TIGAR2. For comparing the performance, TIGAR1 [5], RSEM v1.2.10 [6] and Cufflinks v2.1.1 (with default options except 'u' and 'G' options) [2] are applied to the same simulation data. Although BitSeq [16] is also a relevant method, it is not included in our experiment since performance comparison with TIGAR was already conducted in their analysis [17]. Similarly, variablelength reads are generated according to the estimated read length distribution as shown in Figure 3, and isoform expression levels are estimated with each method. The root mean square errors of the estimated abundances (log of FPKMs) compared to the true gene expression levels are calculated and shown in Figure 4 and 5. For both fixedlength (singleend and pairedend) and variablelength reads, TIGAR2 consistently performed better than others. Especially, when read lengths > 250bp, the prediction accuracies with TIGAR2 over those with RSEM and Cufflinks are markedly better, which can be explained by more sensitive mapping with the latest alignment tools and efficient optimization of multimapped reads by the variational Bayesian inference implemented in TIGAR2. Since RSEM uses Bowtie as an aligner in the integrated pipeline, it becomes more difficult to align longer reads to the reference sequences without gappedalignments of the reads, which potentially loses sensitivity of mappings.
Real data analysis
To evaluate performance with TIGAR2 for real RNASeq data analysis, we obtained 4.25 million singleend reads of variable lengths of the human HeLa cell, which is publicly available from the Life Technologies' web site (http://ioncommunity.lifetechnologies.com). The sequencing was performed with the Ion PGM sequencer, which detects the protons released sequentially when one of the four nucleotide bases is introduced in realtime [18]. We divided the RNASeq data into two data sets, assuming that they are technical replicates obtained from the same experimental conditions. Gene expression levels were estimated with TIGAR2, RSEM, and Cufflinks and plotted in Figure 6 (the Pearson correlation coefficients of the estimated abundances between the two technical replicates were 0.897, 0.888 and 0.888, respectively). The result shows that the quantification with TIGAR2 was most consistent among the technical replicates, compared to those with RSEM and Cufflinks. TIGAR2 outputs the optimized read alignment on cDNA references in BAM format after inference is done, so that predicted isoforms can be followed up. The resultant BAM file can be loaded into a genome browser, such as Integrative Genomics Viewer [19]. This function is also a new feature that is not available in the original TIGAR and TopHatCufflinks. The bottom track in Figure 7 shows the optimized read alignments estimated with TIGAR2 for NM_001139441, which is an isoform of BAP31 that is known to be expressed in HeLa cells [20]. Compared to the read alignment by Bowtie2 (the top track in Figure 7), not only the amount of reads assigned to the isoform increased, but also it became easier to identify possible sequencing errors from genomic variants.
Computational resources
CPU time and memory required in the real data analysis are summarized in Table 1. TIGAR2 was the fastest among others, notably more than two times faster than TIGAR1 with practical memory requirement. TopHatCufflinks was slower than TIGAR2, TIGAR1 and RSEM, especially in the alignment step.
To see the scalability of TIGAR2 for a large dataset, it is applied to 100 million synthetic reads (100 bp singleend). It required 16 GB memory and 2,621 minutes of CPU time.
All the experiments were performed on an Intel Xeon CPU E52670 processor (2.60GHz) with the Red Hat Enterprise Linux Server release 6.2.
Conclusions
We have developed a computational method, named TIGAR2, which is accurate and sensitive in quantifying gene expression levels of transcript isoforms from RNASeq data. TIGAR2 outperformed existing methods with simulation data of both singleend and pairedend reads (100 bp, 250 bp, 500 bp and 1000 bp), especially for reads > 250 bp. TIGAR2 will be more effective for accurate detection and quantification of transcript isoforms compared to other existing methods, as new technologies for longer sequencing become available.
Instead of trying to find novel transcript isoforms from RNASeq data, reference cDNA sequences of transcript isoforms are assumed to be known in the TIGAR2 pipeline. Although there are a couple of algorithms to predict novel transcript isoforms or fusion genes [2, 14, 21], TIGAR2 does not provide the novel predictions at the moment. However, once candidates of novel transcript isoforms are predicted by external tools, they can be treated as known and gene expression levels of these novel isoforms can be quantified and assessed with TIGAR2. Another possible extension of TIGAR2 includes modelling of underlying genomic variation for identifying allelespecific gene expression. Because the cost of whole genomesequencing is dropping sharply, it is becoming feasible to use both genomic information as well as gene expression data. Finally, there should be an optimal balance between the maximum number of allowed alignments per read and the convergence speed. These topics will be investigated as our future works.
Availability of supporting data
The implementation of TIGAR2 and the documentation is available in the GitHub repository, https://github.com/nariai/tigar2.
References
 1.
Wang Z, Gerstein M, Snyder M: RNASeq: a revolutionary tool for transcriptomics. Nature reviews Genetics. 2009, 10 (1): 5763. 10.1038/nrg2484.
 2.
Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L: Differential gene and transcript expression analysis of RNAseq experiments with TopHat and Cufflinks. Nature protocols. 2012, 7 (3): 562578. 10.1038/nprot.2012.016.
 3.
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNASeq. Nature methods. 2008, 5 (7): 621628. 10.1038/nmeth.1226.
 4.
Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN: RNASeq gene expression estimation with read mapping uncertainty. Bioinformatics (Oxford, England). 2010, 26 (4): 493500. 10.1093/bioinformatics/btp692.
 5.
Nariai N, Hirose O, Kojima K, Nagasaki M: TIGAR: transcript isoform abundance estimation method with gapped alignment of RNASeq data by variational Bayesian inference. Bioinformatics (Oxford, England). 2013, 29: (18):22922299.
 6.
Li B, Dewey CN: RSEM: accurate transcript quantification from RNASeq data with or without a reference genome. BMC bioinformatics. 2011, 12: 32310.1186/1471210512323.
 7.
Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, Bertoni A, Swerdlow HP, Gu Y: A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC genomics. 2012, 13: 34110.1186/1471216413341.
 8.
Bragg LM, Stone G, Butler MK, Hugenholtz P, Tyson GW: Shining a light on dark sequencing: characterising errors in Ion Torrent PGM data. PLoS computational biology. 2013, 9 (4): e100303110.1371/journal.pcbi.1003031.
 9.
Li H, Durbin R: Fast and accurate short read alignment with BurrowsWheeler transform. Bioinformatics (Oxford, England). 2009, 25 (14): 17541760. 10.1093/bioinformatics/btp324.
 10.
Langmead B, Salzberg SL: Fast gappedread alignment with Bowtie 2. Nature methods. 2012, 9 (4): 357359. 10.1038/nmeth.1923.
 11.
Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated nonredundant sequence database of genomes, transcripts and proteins. Nucleic acids research. 2007, 35 (Database): D6165. 10.1093/nar/gkl842.
 12.
Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, et al: Fulllength transcriptome assembly from RNASeq data without a reference genome. Nature biotechnology. 2011, 29 (7): 644652. 10.1038/nbt.1883.
 13.
Guttman M, Garber M, Levin JZ, Donaghey J, Robinson J, Adiconis X, Fan L, Koziol MJ, Gnirke A, Nusbaum C, et al: Ab initio reconstruction of cell typespecific transcriptomes in mouse reveals the conserved multiexonic structure of lincRNAs. Nature biotechnology. 2010, 28 (5): 503510. 10.1038/nbt.1633.
 14.
Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNASeq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology. 2010, 28 (5): 511515. 10.1038/nbt.1621.
 15.
Papastamoulis P, Hensman J, Glaus P, Rattray M: Improved variational Bayes inference for transcript expression estimation. Statistical applications in genetics and molecular biology. 2014, 114.
 16.
Glaus P, Honkela A, Rattray M: Identifying differentially expressed transcripts from RNAseq data with biological variation. Bioinformatics (Oxford, England). 2012, 28 (13): 17211728. 10.1093/bioinformatics/bts260.
 17.
Hensman J, Glaus P, Honkela A, Rattray M: Fast Approximate Inference of Transcript Expression Levels from RNAseq Data. ArXiv eprints. 2013, 1308: 5953
 18.
Rothberg JM, Hinz W, Rearick TM, Schultz J, Mileski W, Davey M, Leamon JH, Johnson K, Milgrew MJ, Edwards M, et al: An integrated semiconductor device enabling nonoptical genome sequencing. Nature. 2011, 475 (7356): 348352. 10.1038/nature10242.
 19.
Thorvaldsdottir H, Robinson JT, Mesirov JP: Integrative Genomics Viewer (IGV): highperformance genomics data visualization and exploration. Briefings in bioinformatics. 2013, 14 (2): 178192. 10.1093/bib/bbs017.
 20.
Wang B, HeathEngel H, Zhang D, Nguyen N, Thomas DY, Hanrahan JW, Shore GC: BAP31 interacts with Sec61 translocons and promotes retrotranslocation of CFTRDeltaF508 via the derlin1 complex. Cell. 2008, 133 (6): 10801092. 10.1016/j.cell.2008.04.042.
 21.
Kinsella M, Harismendy O, Nakano M, Frazer KA, Bafna V: Sensitive gene fusion detection using ambiguously mapping RNASeq read pairs. Bioinformatics (Oxford, England). 2011, 27 (8): 10681075. 10.1093/bioinformatics/btr085.
Acknowledgements
This work was supported (in part) by MEXT Tohoku Medical Megabank Project and CREST. All computational resources were provided by the Supercomputing services, Tohoku Medical Megabank Organization, Tohoku University.
Declarations
The publication costs for this article were partly funded by MEXT Tohoku Medical Megabank Project and CREST.
This article has been published as part of BMC Genomics Volume 15 Supplement 10, 2014: Proceedings of the 25th International Conference on Genome Informatics (GIW/ISCBAsia): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S10.
Author information
Affiliations
Corresponding authors
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
NN and MN conceived the study, NN, KK, TM, and MN designed the computational experiments, NN performed the analysis, and NN, KK, TM, and MN interpreted the results. YS, YK, and YYK collaborated on data collection and interpretation of the results. NN, KK, TM, YS, YK, YYK and MN wrote the manuscript. All the authors read and approved the final manuscript.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Nariai, N., Kojima, K., Mimori, T. et al. TIGAR2: sensitive and accurate estimation of transcript isoform expression with longer RNASeq reads. BMC Genomics 15, S5 (2014). https://doi.org/10.1186/1471216415S10S5
Published:
Keywords
 Transcript isoform quantification
 RNASeq
 variational Bayesian inference
 graphical models