Volume 15 Supplement 5
Transcriptome assembly and quantification from Ion Torrent RNA-Seq data
© Mangul et al.; licensee BioMed Central Ltd. 2014
Published: 14 July 2014
High throughput RNA sequencing (RNA-Seq) can generate whole transcriptome information at the single transcript level providing a powerful tool with multiple interrelated applications including transcriptome reconstruction and quantification. The sequences of novel transcripts can be reconstructed from deep RNA-Seq data, but this is computationally challenging due to sequencing errors, uneven coverage of expressed transcripts, and the need to distinguish between highly similar transcripts produced by alternative splicing. Another challenge in transcriptomic analysis comes from the ambiguities in mapping reads to transcripts.
We present MaLTA, a method for simultaneous transcriptome assembly and quantification from Ion Torrent RNA-Seq data. Our approach explores transcriptome structure and incorporates a maximum likelihood model into the assembly and quantification procedure. A new version of the IsoEM algorithm suitable for Ion Torrent RNA-Seq reads is used to accurately estimate transcript expression levels. The MaLTA-IsoEM tool is publicly available at: http://alan.cs.gsu.edu/NGS/?q=malta
Experimental results on both synthetic and real datasets show that Ion Torrent RNA-Seq data can be successfully used for transcriptome analyses. Experimental results suggest increased transcriptome assembly and quantification accuracy of MaLTA-IsoEM solution compared to existing state-of-the-art approaches.
Massively parallel whole transcriptome sequencing, commonly referred to as RNA-Seq, and its ability to generate full transcriptome data at the single transcript level, provides a powerful tool with multiple interrelated applications, including transcriptome assembly [1–4], gene and transcript expression level estimation [5–8], also known as transcriptome quantification, studying trans- and cis-regulatory effects , studying parent-of-origin effects [9–11], and calling expressed variants .
RNA-Seq has become the technology of choice for performing transcriptome analysis, rapidly replacing array-based technologies . The Ion Torrent technology offers the fastest sequencing protocol for RNA-Seq experiments able to sequence whole transcriptome in few hours . Most current research using RNA-Seq employs methods that depend on existing transcriptome annotations. Unfortunately, as shown by recent targeted RNA-Seq studies , existing transcript libraries still miss large numbers of transcripts. The incompleteness of annotation libraries poses a serious limitation to using this powerful technology since accurate normalization of RNA-Seq data critically requires knowledge of expressed transcript sequences [5–8]. Another challenge in transcriptomic analysis comes from the ambiguities in read/tag mapping to transcripts. Ubiquitous regulatory mechanisms such as the use of alternative transcription start and polyadenylation sites, alternative splicing, and RNA editing result in multiple messenger RNA (mRNA) isoforms being generated from a single genomic locus. Most prevalently, alternative splicing is estimated to take place for over 90% of the multi-exon human genes across diverse cell types , with as much as 68% of multi-exon genes expressing multiple isoforms in a clonal cell line of colorectal cancer origin . The ability to reconstruct full length transcript sequences and accurately estimate their expression levels is widely believed to be critical for unraveling gene functions and transcription regulation mechanisms .
Here, we focus on two main problems in transcriptome analysis, namely, transcriptome assembly and quantification. Transcriptome assembly, also known as novel transcript discovery or reconstruction, is the problem of assembling the full length transcript sequences from the RNA sequencing data. Assembly can be done de novo or it can be assisted by existing genome and transcriptome annotations. Transcriptome quantification is the problem of estimating the expression level of each transcript. In the remainder of this section we give a brief description of the common protocols used for mRNA sequencing.
Transcriptome assembly and quantification from RNA-Seq data has been the focus of much research in recent years. The sequences of novel transcripts together with their expression levels can be inferred from deep RNA-Seq data, but this is computationally challenging due to the short length of the reads, high percentage of sequencing errors, uneven coverage of expressed transcripts, and the need to distinguish between highly similar transcripts produced by alternative splicing. A number of methods address the problem of transcriptome assembly and quantification from RNA sequencing data. Methods for transcriptome assembly fall into three categories: "genome-guided", "genome-independent" and "annotation-guided" methods . Genome-independent methods such as Trinity  or transAbyss  directly assemble reads into transcripts. A commonly used approach for such methods is the de Brujin graph  utilizing "k-mers". The use of genome-independent methods becomes essential when there is no trusted genome reference that can be used to guide assembly. On the other end of the spectrum, annotation guided methods [26–28] make use of available information in existing transcript annotations to aid in the discovery of novel transcripts. RNA-Seq reads can be mapped onto the reference genome, reference annotations, exon-exon junction libraries, or combinations thereof, and the resulting alignments are used to assemble transcripts.
Many transcriptome reconstruction methods fall in the genome-guided category. They typically start by mapping sequencing reads onto the reference genome, using spliced alignment tools, such as TopHat  or SpliceMap . The spliced alignments are used to identify putative exons, splice junctions and transcripts that explain the alignments. While some methods aim to achieve the highest sensitivity, others work to predict the smallest set of transcripts explaining the given input reads. Furthermore, some methods aim to reconstruct the set of transcripts that would insure the highest quantification accuracy. Scripture  construct a splice graph from the mapped reads and reconstructs transcripts corresponding to all possible paths in this graph. It then uses paired-end information to filter out some transcripts. Although Scripture achieves very high sensitivity, it may predict a lot of incorrect isoforms. The method of Trapnell et al. [4, 31], referred to as Cufflinks, constructs a read overlap graph and reconstructs transcripts using a minimal size path cover via a reduction to maximum matching in a weighted bipartite graph. TRIP  uses an integer programming model where the objective is to select the smallest set of putative transcripts that yields a good statistical fit between the fragment length distribution empirically determined during library preparation and fragment lengths implied by mapping read pairs to selected transcripts. IsoLasso  uses the LASSO  algorithm, and it aims to achieve a balance between quantification accuracy and predicting the minimum number of transcripts. It formulates the problem as a quadratic program, with additional constraints to ensure that all exons and junctions supported by the reads are included in the predicted isoforms. CLIIQ  uses an integer linear programming solution that minimizes the number of predicted isoforms explaining the RNA-Seq reads while minimizing the difference between estimated and observed expression levels of exons and junctions within the predicted isoforms. Traph  proposed a method based on network flows for a multiassembly problem arising from transcript identification and quantification with RNA-Seq. Another method, CLASS  uses local read coverage patterns of RNA-seq reads and contiguity constraints from read pairs and spliced reads to predict transcripts from RNA-Seq data. iReckon  is a method for simultaneous determination of the transcripts and estimation of their abundances. This probabilistic approach incorporates multiple biological and technical phenomena, including novel isoforms, intron retention, unspliced pre-mRNA, PCR amplification biases, and multi-mapped reads. iReckon utilizes regularized Expectation-Maximization to accurately estimate the abundances of known and novel transcripts.
Alignment of RNA-Seq reads onto the reference genome, reference annotations, exon-exon junction libraries, or combinations thereof is the first step of RNA-Seq analyses, unless none of these are available in which case it is recommended to use de novo assembly methods [23, 24]. The best mapping strategy depends on the purpose of RNA-Seq analysis. If the focus of the study is to estimate transcripts and gene expression levels rather then discover new transcripts then it is recommended to map reads directly onto the set of annotated transcripts using a fast tool for ungapped read alignment. To be able to discover new transcriptional variants one should map the reads onto the reference genome. Recently, many bioinformatics tools, called spliced read aligners, have been developed to map RNA-Seq reads onto a reference genome [29, 30]. Alternatively, RNA-Seq reads can be mapped onto the genome using a local alignment tool such as the Ion Torrent mapper, TMAP. Both spliced alignments and local alignments can be used to detect novel transcriptional and splicing events including exon boundaries, exon-exon junctions, gene boundaries, transcriptional start (TSS) and transcription end sites (TES).
In our experiments we used TopHat  with default parameters. For assessing transcriptome quantification accuracy Ion Torrent reads from cancer datasets were mapped on the External RNA Controls Consortium (ERCC) RNA spike-in controls reference  with added polyA tails of 200 bp using TMAP. Reads for the MAQC datasets were mapped onto Ensembl known transcripts with added polyA tails of 200 bp, also using TMAP.
Splice graph and putative transcripts
After constructing the splice graph, MaLTA enumerates all maximal paths using a depth-first-search algorithm. These paths correspond to putative transcripts. Note that a gene with n pseudo-exons may have as many as 2 n − 1 possible candidate transcripts, each composed of a subset of the n pseudo-exons. The next subsection presents a maximum likehood transcriptome assembly and quantification algorithm that selects a minimal subset of candidate transcripts that best fits the observed RNA-Seq reads. The key ingredient is an expectation-maximization algorithm for estimating expression levels of candidate transcripts.
Maximum likehood transcriptome assembly
Existing transcriptome assembly methods [3, 4] use read pairing information and fragment length distribution to accurately assemble the set of transcripts expressed in a sample. This information is not available for current Ion Torrent technology, which can make it challenging to assemble transcripts. The Ion Torrent PGM platform is able to produce single reads with read length in 50-300 bp range. Our approach is to simultaneously explore the transcriptome structure and perform transcriptome quantification using a maximum likelihood model. MaLTA starts from the set of putative transcripts and selects the subset of this transcripts with the highest support from the RNA-Seq data. Maximum likelihood estimates of putative transcripts are computed using an Expectation Maximization (EM) algorithm which takes into account alternative splicing and read mapping ambiguities. EM algorithms are currently the state-of-the-art approach to transcriptome quantification from RNA-Seq read, and have been proven to outperform count-based approaches. Several independent implementations of EM algorithm exist in the literature [7, 40].
We developed a new version of IsoEM  suitable for Ion Torrent RNA-Seq reads. IsoEM is an expectation-maximization algorithm for transcript frequency estimation. It overcomes the problem of reads mapping to multiple transcripts using iterative framework which simultaneously estimates transcript frequencies and imputes the missing origin of the reads. A key feature of IsoEM, is that it exploits information provided by the distribution of insert sizes, which is tightly controlled during sequencing library preparation under current RNA-Seq protocols. In , we showed that modeling insert sizes is highly beneficial for transcript expression level estimation even for RNA-Seq data consisting of single reads, as in the case of Ion Torrent. Modeling insert sizes contributes to increased estimation accuracy by disambiguating the transcript of origin for the reads. In IsoEM, insert lengths are combined with base quality scores, and, if available, strand information to probabilistically allocate reads to transcripts during the expectation step of the algorithm. Since most Ion Torrent sequencing errors are insertions and deletions, we developed a version of IsoEM capable of handling insertions and deletions in read alignments.
The main idea of the MaLTA approach is to cover all trancriptional and splicing variants presented in the sample with the minimum set of putative transcripts. We use the new version of the IsoEM algorithm described above to estimate expression levels of putative transcripts. Since IsoEM is run with all possible candidate transcripts, the number of transcripts that are predicted to have non-zero frequency can still be very large. Instead of selecting all transcripts with non zero frequency, we would like to select a small set of transcripts that explain all observed splicing events and have highest support from the sequencing data. To realize this idea we use a greedy algorithm which traverses the list of candidate transcripts sorted in descending order by expression level, and selects a candidate transcript only if it contains a transcriptional or splicing event not explained by the previously selected transcripts.
Results and discussions
We evaluated the accuracy of the MaLTA-IsoEM approach on both simulated and real human RNA-Seq data. The human genome sequence (hg18, NCBI build 36) was downloaded from UCSC together with the KnownGenes transcripts annotation table. Genes were defined as clusters of known transcripts defined by the GNFAtlas2 table. In our simulation experiments, we simulate reads together with spliced alignments to the genome; these alignments are provided to all compared methods. We varied the length of single-end reads, which were randomly generated per gene by sampling fragments from known transcripts. All the methods were compared on datasets with various read length, i.e., 50 bp, 100 bp, 200 bp, and 400 bp. Expression levels of transcripts inside each gene cluster followed uniform and geometric distributions. To address library preparation process of RNA-Seq experiment we simulated fragment lengths from a normal probability distribution with different means and 10% standard deviation.
All reconstructed transcripts were matched against annotated transcripts. As in  and , two transcripts were assumed to match if and only if internal exon boundaries coordinates (i.e. all exons coordinates except the beginning of the first exon and the end of the last exon) were identical. We use sensitivity and positive predictive value (PPV) to evaluate the performance of different assembly methods. Sensitivity is defined as the proportion of assembled transcripts that match annotated transcripts, i.e., sensitivity = TP/(TP + FN). Positive predictive value (PPV) is defined as the proportion of annotated transcript sequences among assembled sequences, i.e., PPV = TP/(TP + FP).
Transcriptome quantification accuracy was evaluated by comparing RNA-Seq estimates with TaqMan qRT-PCR measurements  or External RNA Controls Consortium (ERCC) RNA spike-in controls . The coefficient of determination (R2) between the (qRT-PCR/ERCC) and Fragment Per Kilobase of exon length per Million reads (FPKM) estimates was used as accuracy measure.
Comparison on simulated RNA-Seq data
In this section, we use sensitivity and PPV defined above to compare the MaLTA to other transcriptome assembly tools. The most recent versions of Cufflinks (version 2.1.1)  and IsoLasso (v 2.6.0)  with the default parameters are used for performance comparison. We explore the influence of read and fragment length on performance of assembly methods.
Sensitivity and PPV comparison between methods on datasets simulated assuming uniform, respectively geometric expression of transcripts, with reads length 400 bp, mean fragment length 450 bp and 10% standard deviation.
# assembled transcripts
# confirmed annotated transcripts
There is a strong correlation between the number of splicing events within the gene and the number of annotated transcripts. A high number of splicing events leads to increased number of candidate transcripts, which makes the selection process more difficult. To explore the behavior of the methods depending on number of transcripts per gene we divided all genes into categories according to the number of annotated transcripts and calculated the sensitivity and PPV within each such category.
Sensitivity and PPV comparison between methods for different combinations of read and fragment lengths: (50 bp, 250 bp), (100 bp, 250 bp), (100 bp, 500 bp), (200 bp, 250 bp), (400 bp, 450 bp).
# assembled transcripts
# confirmed annotated transcripts
Comparison on Ion Torrent cancer and MAQC RNA-Seq datasets
Read mapping statistics and read length for Ion Torrent HeLa datasets.
Type of cancer
# mapped reads
Mean read length (bp)
breast ductal carcinoma
Performance comparison of transcriptome assembly between Cufflinks and MaLTA for Ion Torrent HeLa datasets.
# assembled transcripts
# confirmed annotated transcripts
# assembled transcripts
# confirmed annotated transcripts
# assembled transcripts
# confirmed annotated transcripts
Read mapping statistics and correlation between gene expression levels estimated by IsoEM and qPCR measurement for Ion Torrent UHRR dataset.
# mapped reads
Read mapping statistics and correlation between gene expression levels estimated by IsoEM and qPCR measurement for Ion Torrent HBRR dataset.
# mapped reads
Correlation (R2) between known frequencies of spiked in ERCC controls and gene expression levels estimated by IsoEM and Cufflinks for Ion Torrent HeLa datasets.
In this paper we described the MaLTA-IsoEM method for simultaneous transcriptome assembly and quantification from Ion Torrent RNA-Seq data. Our approach explores transcriptome structure and incorporates a maximum likelihood model into the assembly and quantification procedure. Results on real cancer and MAQC RNA-Seq datasets show that Ion Torrent RNA-Seq data can be successfully used for transcriptome analysis. Transcriptome assembly and quantification accuracy was confirmed by comparison to annotated transcripts and TaqMan qRT-PCR measurements and External RNA Controls Consortium RNA spike-in controls. Experimental results on both real and synthetic datasets generated with various sequencing parameters and distribution assumptions suggest increased transcriptome assembly and quantification accuracy of MaLTA-IsoEM compared to existing state-of-the-art approaches.
S.M., A.C., S.A.S., I.M. and A.Z. were supported in part by Agriculture and Food Research Initiative Competitive Grant no. 201167016-30331 from the USDA National Institute of Food and Agriculture and by Life Technology Grants "Novel transcript reconstruction from Ion Torrent sequencing" and "Viral Metagenome Reconstruction Software for Ion Torrent PGM Sequencer". S.M.,A.C. and A.Z. were supported in part by NSF award IIS-0916401. I.M. was supported in part by NSF award IIS-0916948. S.M. is supported by National Science Foundation grants 0513612, 0731455, 0729049, 0916676, 1065276, 1302448 and 1320589, and National Institutes of Health grants K25-HL080079, U01-DA024417, P01-HL30568, P01-HL28481, R01-GM083198, R01-MH101782 and R01-ES022282. S.M. was supported in part by Institute for Quantitative & Computational Biosciences Fellowship, UCLA and Second Century Initiative Bioinformatics University Doctoral Fellowship, Georgia State University. A.C. was supported in part by Molecular Basis of Disease Fellowship, Georgia State University.
Publication costs for this work were funded by the corresponding authors' institutions.
This article has been published as part of BMC Genomics Volume 15 Supplement 5, 2014: Selected articles from the Third IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS 2013): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S5.
- Guttman M, Garber M, Levin J, Donaghey J, Robinson J, Adiconis X, Fan L, Koziol M, Gnirke A, Nusbaum C, Rinn J, Lander E, Regev A: Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nature Biotechnology. 2010, 28 (5): 503-510. 10.1038/nbt.1633.PubMedPubMed CentralView ArticleGoogle Scholar
- Li W, Feng J, Jiang T: IsoLasso: A LASSO Regression Approach to RNA-Seq Based Transcriptome Assembly. Lecture Notes in Computer Science. 2011, 6577: 168-10.1007/978-3-642-20036-6_18.View ArticleGoogle Scholar
- Mangul S, Caciula A, Al Seesi S, Brinza D, Banday AR, Kanadia R: An integer programming approach to novel transcript reconstruction from paired-end RNA-Seq reads. Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine. 2012, BCB '12, New York, NY, USA: ACM, 369-376. 10.1145/2382936.2382983.View ArticleGoogle Scholar
- Trapnell C, Williams B, Pertea G, Mortazavi A, Kwan G, van Baren M, Salzberg S, Wold B, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology. 2010, 28 (5): 511-515. 10.1038/nbt.1621.PubMedPubMed CentralView ArticleGoogle Scholar
- Li B, Ruotti V, Stewart R, Thomson J, Dewey C: RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics. 2010, 26 (4): 493-500. 10.1093/bioinformatics/btp692. btp692PubMedPubMed CentralView ArticleGoogle Scholar
- Mortazavi A, Williams B, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature methods. 2008, 10.1038/nmeth.1226.Google Scholar
- Nicolae M, Mangul S, Mandoiu I, Zelikovsky A: Estimation of alternative splicing isoform frequencies from RNA-Seq data. Algorithms for Molecular Biology. 2011, 6: 9-10.1186/1748-7188-6-9.PubMedPubMed CentralView ArticleGoogle Scholar
- Wang E, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore S, Schroth G, Burge C: Alternative isoform regulation in human tissue transcriptomes. Nature. 2008, 456 (7221): 470-476. 10.1038/nature07509.PubMedPubMed CentralView ArticleGoogle Scholar
- McManus CJ, Coolon JD, Duff MO, Eipper-Mains J, Graveley BR, Wittkopp PJ: Regulatory divergence in Drosophila revealed by mRNA-seq. Genome research. 2010, 20 (6): 816-825. 10.1101/gr.102491.109.PubMedPubMed CentralView ArticleGoogle Scholar
- Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, Pritchard JK: Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics. 2009, 25 (24): 3207-3212. 10.1093/bioinformatics/btp579. [http://bioinformatics.oxfordjournals.org/content/25/24/3207.abstract]PubMedPubMed CentralView ArticleGoogle Scholar
- Gregg C, Zhang J, Butler JE, Haig D, Dulac C: Sex-specific parent-of-origin allelic expression in the mouse brain. Science (New York, N.Y.). 2010, 329 (5992): 682-685. 10.1126/science.1190831. 1190831View ArticleGoogle Scholar
- Duitama J, Srivastava P, Mandoiu I: Towards Accurate Detection and Genotyping of Expressed Variants from Whole Transcriptome Sequencing Data. BMC Genomics. 2012, 13 (Suppl 2): S6-10.1186/1471-2164-13-S2-S6.PubMedPubMed CentralView ArticleGoogle Scholar
- Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009, 10: 57-63. 10.1038/nrg2484.PubMedPubMed CentralView ArticleGoogle Scholar
- Gene expression profiling using Ion semiconductor sequencing. 2013, [http://tools.invitrogen.com/content/sfs/brochures/gene-expression-profiling-using-Ion-semiconductor-sequencing.pdf]
- Mercer TR, Gerhardt DJ, Dinger ME, Crawford J, Trapnell C, Jeddeloh JA, Mattick JS, Rinn JL: Targeted RNA sequencing reveals the deep complexity of the human transcriptome. Nature Biotechnology. 2012, 30: 99-104. 10.1038/nbt.2024.View ArticleGoogle Scholar
- Griffith M, et al: Alternative expression analysis by RNA sequencing. Nature Methods. 2010, 7 (10): 843-847. 10.1038/nmeth.1503.PubMedView ArticleGoogle Scholar
- Ponting C, Belgard T: Transcribed dark matter: meaning or myth?. Human Molecular Genetics. 2010, 10.1093/hmg/ddq362.Google Scholar
- Pandey V, Nutter RC, Prediger E: Applied Biosystems SOLiDTM System: Ligation-Based Sequencing. 2008, Wiley-VCH Verlag GmbH & Co. KGaA, 29-42. 10.1002/9783527625130.ch3.Google Scholar
- Thomas RK, Nickerson E, Simons JF, Janne PA, Tengs T, Yuza Y, Garraway LA, Laframboise T, Lee JC, Shah K, O'Neill K, Sasaki H, Lindeman N, Wong KK, Borras AM, Gutmann EJ, Dragnev KH, Debiasi R, Chen TH, Glatt KA, Greulich H, Desany B, Lubeski CK, Brockman W, Alvarez P, Hutchison SK, Leamon JH, Ronan MT, Turenchalk GS, Egholm M, Sellers WR, Rothberg JM, Meyerson M: Sensitive mutation detection in heterogeneous cancer specimens by massively parallel picoliter reactor sequencing. Nat Med. 2006, 12 (7): 852-5. 10.1038/nm1437.PubMedView ArticleGoogle Scholar
- Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, et al: Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008, 456 (7218): 53-59. 10.1038/nature07517.PubMedPubMed CentralView ArticleGoogle Scholar
- Rothberg JM, Hinz W, Rearick TM, Schultz J, Mileski W, Davey M, Leamon JH, Johnson K, Milgrew MJ, Edwards M, Hoon J, Simons JF, Marran D, Myers JW, Davidson JF, Branting A, Nobile JR, Puc BP, Light D, Clark TA, Huber M, Branciforte JT, Stoner IB, Cawley SE, Lyons M, Fu Y, Homer N, Sedova M, Miao X, Reed B, Sabina J, Feierstein E, Schorn M, Alanjary M, Dimalanta E, Dressman D, Kasinskas R, Sokolsky T, Fidanza JA, Namsaraev E, McKernan KJ, Williams A, Roth GT, Bustillo J: An integrated semiconductor device enabling non-optical genome sequencing. Nature. 2011, 475 (7356): 348-352. 10.1038/nature10242.PubMedView ArticleGoogle Scholar
- Garber M, Grabherr MG, Guttman M, Trapnell C: Computational methods for transcriptome annotation and quantification using RNA-seq. Nature Methods. 2011, 8 (6): 469-477. 10.1038/nmeth.1613.PubMedView ArticleGoogle Scholar
- Grabherr M: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature biotechnology. 2011, 29 (7): 644-652. 10.1038/nbt.1883.PubMedPubMed CentralView ArticleGoogle Scholar
- Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman SD, Mungall K, Lee S, Okada HM, Qian JQ, et al: De novo assembly and analysis of RNA-seq data. Nature Methods. 2010, 7 (11): 909-912. 10.1038/nmeth.1517.PubMedView ArticleGoogle Scholar
- Pevzner PA: 1-Tuple DNA sequencing: computer analysis. J Biomol Struct Dyn. 1989, 7: 63-73.PubMedGoogle Scholar
- Roberts A, Pimentel H, Trapnell C, Pachter L: Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics. 2011, 10.1093/bioinformatics/btr355.Google Scholar
- Mangul S, Caciula A, Glebova O, Mandoiu I, Zelikovsky A: Improved transcriptome quantification and reconstruction from RNA-Seq reads using partial annotations. In silico biology. 2011, 11 (5): 251-261.PubMedGoogle Scholar
- Feng J, Li W, Jiang T: Inference of Isoforms from Short Sequence Reads. Proc RECOMB. 2010, 138-157.Google Scholar
- Trapnell C, Pachter L, Salzberg S: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009, 25 (9): 1105-1111. 10.1093/bioinformatics/btp120.PubMedPubMed CentralView ArticleGoogle Scholar
- Au KF, Jiang H, Lin L, Xing Y, Wong WH: Detection of splice junctions from paired-end RNA-seq data by SpliceMap. Nucleic Acids Research. 2010, 10.1093/nar/gkq211.Google Scholar
- Roberts A, Trapnell C, Donaghey J, Rinn J, Pachter L: Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology. 2011, 12 (3): R22-10.1186/gb-2011-12-3-r22.PubMedPubMed CentralView ArticleGoogle Scholar
- Li W, Feng J, Jiang T: IsoLasso: A LASSO Regression Approach to RNA-Seq Based Transcriptome Assembly. Journal of Computational Biology. 2011, 18 (11): 1693-707. 10.1089/cmb.2011.0171.PubMedPubMed CentralView ArticleGoogle Scholar
- Tibshirani R: Regression shrinkage and selection via the LASSO. Journal of Royal Statistical Society. 1996, 58: 267-288. 10.1111/j.1467-9868.2011.00771.x.Google Scholar
- Lin YY, Dao P, Hach F, Bakhshi M, Mo F, Lapuk A, Collins C, Sahinalp SC: CLIIQ: Accurate Comparative Detection and Quantification of Expressed Isoforms in a Population. Proc 12th Workshop on Algorithms in Bioinformatics. 2012, 10.1007/978-3-642-33122-0_14.Google Scholar
- Tomescu AI, Kuosmanen A, Rizzi R, Ma¨kinen V: A novel min-cost flow method for estimating transcript expression with RNA-Seq. BMC Bioinformatics. 2013, 14 (S-5): S15-10.1186/1471-2105-14-S5-S15.PubMedPubMed CentralGoogle Scholar
- Song L, Florea L: CLASS: constrained transcript assembly of RNA-seq reads. BMC Bioinformatics. 2013, 14 (S-5): S14-10.1186/1471-2105-14-S5-S14.PubMedPubMed CentralView ArticleGoogle Scholar
- Mezlini AM, Smith EJM, Fiume M, Buske O, Savich GL, Shah S, Aparicio S, Chiang DY, Goldenberg A, Brudno M: iReckon: Simultaneous isoform discovery and abundance estimation from RNA-seq data. Genome Research. 2012, 23 (3): 519-529. 10.1101/gr.142232.PubMedView ArticleGoogle Scholar
- Reid LH: Proposed methods for testing and selecting the ERCC external RNA controls. BMC genomics. 2005, 6: 1-18. 10.1186/1471-2164-6-1.View ArticleGoogle Scholar
- Pal S, Gupta R, Kim H, Wickramasinghe P, Baubet V, Showe LC, Dahmane N, Davuluri RV: Alternative transcription exceeds alternative splicing in generating the transcriptome diversity of cerebellar development. Genome Research. 2011, 10.1101/gr.120535.111.Google Scholar
- Li B, Dewey C: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC bioinformatics. 2011, 12: 323-10.1186/1471-2105-12-323.PubMedPubMed CentralView ArticleGoogle Scholar
- MAQC Consortium: The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nature Biotechnology. 2006, 24 (9): 1151-1161. 10.1038/nbt1239.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.