Volume 15 Supplement 5
Selected articles from the Third IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS 2013): Genomics
A memory-efficient algorithm to obtain splicing graphs and de novoexpression estimates from de Bruijn graphs of RNA-Seq data
- Sing-Hoi Sze^{1, 2}Email author and
- Aaron M Tarone^{3}
https://doi.org/10.1186/1471-2164-15-S5-S6
© Sze and Tarone; licensee BioMed Central Ltd. 2014
Published: 14 July 2014
Abstract
Background
The recent advance of high-throughput sequencing makes it feasible to study entire transcriptomes through the application of de novo sequence assembly algorithms. While a popular strategy is to first construct an intermediate de Bruijn graph structure to represent the transcriptome, an additional step is needed to construct predicted transcripts from the graph.
Results
Since the de Bruijn graph contains all branching possibilities, we develop a memory-efficient algorithm to recover alternative splicing information and library-specific expression information directly from the graph without prior genomic knowledge. We implement the algorithm as a postprocessing module of the Velvet assembler. We validate our algorithm by simulating the transcriptome assembly of Drosophila using its known genome, and by performing Drosophila transcriptome assembly using publicly available RNA-Seq libraries. Under a range of conditions, our algorithm recovers sequences and alternative splicing junctions with higher specificity than Oases or Trans-ABySS.
Conclusions
Since our postprocessing algorithm does not consume as much memory as Velvet and is less memory-intensive than Oases, it allows biologists to assemble large libraries with limited computational resources. Our algorithm has been applied to perform transcriptome assembly of the non-model blow fly Lucilia sericata that was reported in a previous article, which shows that the assembly is of high quality and it facilitates comparison of the Lucilia sericata transcriptome to Drosophila and two mosquitoes, prediction and experimental validation of alternative splicing, investigation of differential expression among various developmental stages, and identification of transposable elements.
Background
With the advance of high-throughput sequencing techniques, it is feasible to study entire transcriptomes through the application of de novo sequence assembly algorithms [1–8]. A popular strategy of transcriptome assembly algorithms is to first obtain a de Bruijn graph that contains all branching possibilities [7–10]. An additional step is then performed to construct predicted transcripts from the graph. This strategy is employed by Oases [10] and Trans-ABySS [9], which use output from Velvet [5] and ABySS [6] respectively to obtain predicted transcripts. One drawback of the approach is that Oases can be more memory-intensive than Velvet, which limits its application when computational resources are limited. Alternatively, Trinity [8] uses a different approach of first clustering the data, then constructing an individual de Bruijn graph for each cluster that has simple structure.
We implement the algorithm as a postprocessing module of Velvet. We validate our algorithm by simulating the transcriptome assembly of Drosophila using its known complete genome under the condition that all gene transcripts have high expression levels, and by performing Drosophila transcriptome assembly using publicly available RNA-Seq libraries. We further employ a de novo expression estimate to simultaneously evaluate the differential expression levels across libraries without requiring any prior knowledge of the genome, which was validated in [11]. We have applied our algorithm to perform transcriptome assembly of the non-model blow fly Lucilia sericata in [11].
Methods
De Bruijn graph
Given a set of reads and a parameter k, a de Bruijn graph is defined by constructing a vertex for each k-mer that appears within the reads. A pair of k-mers are connected by a directed edge if the (k − 1)-suffix of the first k-mer is the same as the (k − 1)-prefix of the second k-mer. It has been observed that the de Bruijn graph can be used to implicitly assemble these reads through linking together the same k-mer that appears in different reads [12, 13]. Since the number of vertices and edges in a de Bruijn graph depends on the number of distinct k-mers from the reads rather than the total number of reads, this strategy is very popular among short read assembly algorithms for high-throughput sequencing data [2, 3, 5–7].
Postprocessing algorithm
In order to retain alternative splicing information, Heber et al [14] developed an EST assembly algorithm that retains all the junctions in the de Bruijn graph. By imposing a k-mer coverage cutoff, each component becomes a splicing graph that specifies the alternatively spliced variants of a gene. While this strategy was proved to be successful for EST assembly, there are significant challenges in transcriptome assembly from high-throughput sequencing data that are caused by the shorter reads.
We develop a postprocessing algorithm that extracts the de Bruijn graph from Velvet [5] to construct non-linear splicing graphs that represent the transcriptome. In order to retain as much alternative path information as possible, Velvet is applied without using the tour bus algorithm that removes the bubbles in the graph, while still allowing the removal of short tips. Each node returned from Velvet corresponds to a maximal succession of vertices with no branches.
SNPs
Strongly connected components
Forward-backward tangles
Since Velvet assembles the forward and the backward strands simultaneously, each gene should be represented by two disjoint components, one on each strand, which do not contain any cycles. Although there are no more cycles after removing the strongly connected components that are not just a single edge, it is still possible to have forward-backward tangles in which a forward node and the corresponding backward node reside within the same connected component. These forward-backward tangles can be identified by depth-first search [15].
Splicing graphs
We extract all the nodes within the strongly connected components that are not just a single edge and within the forward-backward tangles. We treat each node as an individual assembly that consists only of a single node while ignoring the junction information within these complicated regions. We then remove these nodes and their adjacent edges, and extract the connected components in the remaining graph. Each of these connected components does not contain cycles and should mostly represent alternatively spliced variants of one gene. Only one of the two possible orientations is retained for each extracted node and each connected component.
Junction adjustment
Note that there can still be ambiguities due to the presence of repeating patterns across junctions. Since the graph no longer contains cycles after the previous processing steps, we recover the first k − 1 letters in each starting node with no incoming edges by restoring the removed letters. After these adjustments, we consider each resulting component as a splicing graph that specifies the alternative splicing paths of one gene. Note that we only resolve simple cases and do nothing when there are simultaneously a split and a merge at the two ends of an edge. To remove very short assemblies, we retain only the splicing graphs in which all paths from a source to a sink have sequence length above 2k − 1.
De novoexpression estimate
In order to evaluate differential expression levels in a non-model organism in which no prior information is available, we employ a measure of number of reads per kilobase of node per million reads (RPKM) [11] that is similar to the statistics used by [16] and [17]. Since there is no information about exons in a de novo assembly, reads that appear in the assembly are used instead of mapped reads. Also, each node in a splicing graph is evaluated instead of each exon, with each read that contains a k-mer within a node contributing to that node. Within each node, a RPKM estimate is reported independently for each library within the same assembly. A validation of the de novo RPKM values was given in [11] that shows strong correlations of these values with the ones given by Cufflinks [18] on genes without alternative splicing and good correlations on nodes from genes with alternatively splicing.
Postprocessing software
A software program implementing our postprocessing algorithm is available at http://faculty.cse.tamu.edu/shsze/postprocess. In order to make the results directly applicable to other software during downstream analysis, we represent each assembly in an annotated FASTA format, in which each potentially non-linear structure is represented by a collection of nodes, with connecting edge information and RPKM values for each library embedded within the name of each node.
Results and discussion
Drosophila melanogastersimulations
To simulate the transcriptome assembly of Drosophila, we extracted all gene transcripts from the D. melanogaster genome. For each gene transcript, we randomly pick reads until an average nucleotide coverage of 100 is reached while allowing varying percentages of mismatches in the reads, giving 70598749 reads of length 75.
We applied Velvet by setting the parameters max_branch_length, max_divergence and max_gap_count to 0, while enabling read_trkg. We performed assemblies over different values of hash length k and cov_cutoff c. We extracted the de Bruijn graph from the LastGraph file and applied our postprocessing algorithm. Since de novo sequence assembly is performed mostly on non-model organisms and possible function of the assembled sequences is accessed with respect to a closely related organism, we used translated BLAST search [19] to simulate its usage.
Statistics of the simulated transcriptome assemblies of Drosophila using its known complete genome over different values of k and k-mer coverage cutoff c with 0.1% mismatches in the reads.
k_c | initial nodes | largest tangle | largest SCC | splicing graphs | max length | N50 | >1-node graphs | max nodes | avg nodes | SNPs | total hits | unique hits | >1-hit graphs | max hits | time (mins) | memory (GB) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
25_3 | 38884 | 17900 | 9937 | 15713 | 37380 | 2366 | 1361 | 3106 | 10 | 883 | 12731 | 10162 | 643 | 27 | 80,3 | 21,2 |
25_5 | 34822 | 16979 | 9255 | 15521 | 37380 | 2374 | 1351 | 266 | 7 | 517 | 12708 | 10160 | 643 | 27 | 80,3 | 21,2 |
25_10 | 34494 | 16712 | 9057 | 15486 | 37380 | 2373 | 1345 | 194 | 7 | 481 | 12699 | 10158 | 639 | 27 | 80,3 | 21,2 |
31_3 | 28342 | 5037 | 2080 | 13819 | 45158 | 2704 | 1719 | 1007 | 7 | 496 | 12523 | 11112 | 546 | 12 | 76,3 | 18,2 |
31_5 | 27307 | 4971 | 1898 | 13740 | 45158 | 2714 | 1717 | 167 | 6 | 381 | 12494 | 11110 | 552 | 13 | 76,3 | 18,2 |
31_10 | 27265 | 4947 | 1885 | 13829 | 45158 | 2704 | 1698 | 161 | 6 | 377 | 12536 | 11109 | 542 | 13 | 76,3 | 18,2 |
Statistics of the simulated transcriptome assemblies of Drosophila using its known complete genome over different values of k and k-mer coverage cutoff c with 0.2% mismatches in the reads. The notations are the same as in Table 1.
k_c | initial nodes | largest tangle | largest SCC | splicing graphs | max length | N50 | >1-node graphs | max nodes | avg nodes | SNPs | total hits | unique hits | >1-hit graphs | max hits | time (mins) | memory (GB) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
25_3 | 45305 | 23504 | 15883 | 13240 | 26909 | 2255 | 634 | 8671 | 27 | 2049 | 8258 | 6188 | 315 | 16 | 94,3 | 30,2 |
25_5 | 29090 | 16349 | 11411 | 11734 | 27251 | 2321 | 606 | 1832 | 11 | 337 | 8156 | 6180 | 321 | 12 | 94,3 | 30,2 |
25_10 | 26297 | 15235 | 10367 | 11606 | 27251 | 2329 | 595 | 165 | 8 | 257 | 8116 | 6176 | 319 | 13 | 94,3 | 30,2 |
31_3 | 23544 | 5604 | 2331 | 11993 | 44990 | 2536 | 583 | 1520 | 12 | 611 | 9561 | 8488 | 281 | 17 | 83,3 | 21,2 |
31_5 | 19869 | 4299 | 2097 | 11650 | 44990 | 2545 | 571 | 253 | 7 | 248 | 9548 | 8488 | 281 | 13 | 83,3 | 21,2 |
31_10 | 19541 | 4222 | 2056 | 11642 | 44990 | 2545 | 572 | 96 | 7 | 233 | 9544 | 8484 | 281 | 13 | 83,3 | 21,2 |
When k is small, the larger number of splicing graphs resulted in more complete assemblies, although the sequences were shorter and thus more fragmented. When k is large, the maximum and median (N50) lengths of splicing graphs approached the maximum and median lengths of gene transcripts in the known Drosophila genome, which are 69439 and 3231 respectively. Between 5 to 12% of splicing graphs had non-linear structures. These values are a significant portion of the percentage of known Drosophilagenes that have more than one alternatively spliced variant, which is 27%. A small number of SNPs were recovered, which may be due to variations in repeats or the inability to separate gene families.
When compared to the total number of BLAST hits, the number of unique BLAST hits to different Drosophila genes was not much smaller. When compared to the total number of splicing graphs, only a small number of graphs have BLAST hits to more than one gene. Within these graphs, the maximum number of different genes that have BLAST hits to a graph was small, thus we have mostly achieved the goal that each splicing graph should represent alternatively spliced variants of only one gene. When the k-mer coverage cutoff c is 3, the number of junctions and some of the splicing graphs were very large. Otherwise, the results were similar over different cutoffs for the same value of k. This is due to the consistent high coverage that is guaranteed by the simulation.
Drosophilapublic libraries
To investigate the transcriptome assembly of Drosophila under realistic conditions, we obtained reads of length 75 from six RNA-Seq libraries in [20] at the sequence read archive [21] that include the following developmental stages: 2-16 hours embryos (SRX019647), third instar larvae (SRX019648), mixed pupae (SRX019651, two replicates), adult females (SRX019652), and adult males (SRX019653). Since sequence quality decreases toward the end of a read, we trimmed each read by removing all positions including and to the right of the first position that has a quality score of less than 15, giving a total of 102262392 reads with average length 40. We compare the performance of our postprocessing algorithm to Oases and Trans-ABySS on machines with 32 GB physical memory. Since the memory requirement of Oases exceeds 32 GB when the k-mer length is small, we fix k to 35.
For Oases, Velvet was applied with hash length k without setting cov_cutoff while enabling read_trkg. Oases was then applied on the results from Velvet with cov_cutoff c. For Trans-ABySS, abyss-pe was applied with k-mer size k, mean k-mer coverage threshold c, and minimum number of pairs n=10. Trans-ABySS was then applied on the results from abyss-pe by utilizing the assembly.py script with the single k-mer length. For our postprocessing algorithm and Oases, all reads were treated as single-end reads.
Comparisons of the Drosophila transcriptome assemblies of our postprocessing algorithm, Oases and Trans-ABySS using six publicly available libraries over different values of k-mer coverage cutoff c.
postprocess | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
k_c | initial nodes | largest tangle | largest SCC | splicing graphs | max length | N50 | >1-node graphs | max nodes | avg nodes | SNPs | total hits | unique hits | >1-hit graphs | max hits | time (mins) | memory (GB) |
35_3 | 227614 | 178545 | 88094 | 75367 | 10539 | 544 | 2048 | 124 | 6 | 16703 | 38448 | 10719 | 392 | 5 | 86,18 | 22,2 |
35_5 | 125414 | 87895 | 41654 | 47958 | 8678 | 705 | 1720 | 93 | 6 | 11334 | 27010 | 9889 | 429 | 13 | 86,17 | 22,2 |
35_10 | 57978 | 31785 | 12695 | 27695 | 6383 | 705 | 1020 | 63 | 6 | 5034 | 17271 | 8070 | 308 | 5 | 86,16 | 22,2 |
Oases | ||||||||||||||||
k_c | locus | max length | N50 | >1-trans locus | max trans | avg trans | total hits | unique hits | >1-hit locus | max hits | time (mins) | memory (GB) | ||||
35_3 | 39584 | 15586 | 801 | 3824 | 13 | 3 | 29928 | 10898 | 256 | 4 | 94,28 | 29,32 | ||||
35_5 | 28537 | 15586 | 936 | 2616 | 16 | 3 | 22460 | 10103 | 245 | 4 | 94,26 | 29,30 | ||||
35_10 | 17075 | 11104 | 982 | 1377 | 14 | 3 | 13800 | 8201 | 185 | 5 | 94,24 | 29,26 | ||||
Trans-ABySS | ||||||||||||||||
k_c | trans | max length | N50 | >1-node trans | max nodes | avg nodes | total hits | unique hits | time (mins) | memory (GB) | ||||||
35_3 | 91365 | 15586 | 898 | 50467 | 60 | 8 | 33600 | 10639 | 205,1 | 4,1 | ||||||
35_5 | 55164 | 10582 | 997 | 27763 | 46 | 7 | 25779 | 9944 | 195,1 | 4,1 | ||||||
35_10 | 28455 | 8865 | 929 | 13665 | 43 | 6 | 16032 | 8154 | 178,1 | 4,1 |
Comparisons of the Drosophila transcriptome assemblies of our postprocessing algorithm, Oases and Trans-ABySS using four publicly available libraries over different values of k and k-mer coverage cutoff c. The notations are the same as in Table 3.
postprocess | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
k_c | initial nodes | largest tangle | largest SCC | splicing graphs | max length | N50 | >1-node graphs | max node | avg nodes | SNPs | total hits | unique hits | >1-hit graphs | max hits | time (mins) | memory (GB) |
31_3 | 293034 | 251819 | 132958 | 87306 | 7571 | 542 | 1914 | 36 | 5 | 13216 | 37135 | 10752 | 516 | 7 | 81,24 | 20,2 |
31_5 | 153123 | 115511 | 60504 | 53199 | 9708 | 748 | 1881 | 98 | 5 | 8419 | 27103 | 9868 | 683 | 8 | 81,23 | 20,2 |
31_10 | 70809 | 36861 | 19839 | 35955 | 7393 | 621 | 1224 | 108 | 5 | 3746 | 22037 | 8399 | 442 | 8 | 81,21 | 20,2 |
35_3 | 175184 | 123605 | 85923 | 73584 | 7525 | 559 | 2246 | 79 | 6 | 10311 | 37115 | 10565 | 737 | 8 | 81,22 | 20,2 |
35_5 | 98897 | 58409 | 40689 | 47081 | 9382 | 731 | 1808 | 134 | 6 | 6741 | 26560 | 9631 | 743 | 12 | 81,21 | 20,2 |
35_10 | 48595 | 19438 | 13375 | 28269 | 7008 | 706 | 1062 | 90 | 5 | 2967 | 17829 | 7883 | 461 | 8 | 81,19 | 20,2 |
Oases | ||||||||||||||||
k _c | locus | max length | N50 | >1-trans locus | max trans | avg trans | total hits | unique hits | >1-hit locus | max hits | time (mins) | memory (GB) | ||||
31_3 | 35587 | 15986 | 994 | 4881 | 18 | 3 | 26559 | 10819 | 410 | 5 | 87,24 | 25,27 | ||||
31_5 | 26679 | 15906 | 1109 | 3234 | 20 | 3 | 21084 | 9944 | 336 | 5 | 87,21 | 25,21 | ||||
31_10 | 21283 | 8174 | 877 | 1637 | 16 | 3 | 17225 | 8449 | 188 | 4 | 87,19 | 25,21 | ||||
35_3 | 37377 | 9826 | 846 | 3724 | 16 | 3 | 28492 | 10652 | 346 | 6 | 75,14 | 17,17 | ||||
35_5 | 27573 | 12562 | 979 | 2644 | 14 | 3 | 21992 | 9751 | 332 | 5 | 75,13 | 17,17 | ||||
35_10 | 18072 | 7934 | 939 | 1389 | 12 | 3 | 14614 | 7953 | 194 | 5 | 75,12 | 17,17 | ||||
Trans-ABySS | ||||||||||||||||
k _c | trans | max length | N50 | >1-node trans | max nodes | avg nodes | total hits | unique hits | time (mins) | memory (GB) | ||||||
31_3 | 113157 | 14353 | 1149 | 72266 | 56 | 6 | 33780 | 10527 | 201,1 | 4,1 | ||||||
31_5 | 62292 | 14395 | 1282 | 37656 | 72 | 6 | 24614 | 9810 | 193,1 | 4,1 | ||||||
31_10 | 32509 | 17057 | 1075 | 19837 | 50 | 5 | 16676 | 8313 | 172,1 | 4,1 | ||||||
35_3 | 76220 | 14351 | 1142 | 40606 | 79 | 6 | 31619 | 10288 | 179,1 | 4,1 | ||||||
35_5 | 46431 | 14385 | 1239 | 23632 | 38 | 5 | 23451 | 9603 | 172,1 | 4,1 | ||||||
35_10 | 24956 | 9139 | 1095 | 12968 | 30 | 5 | 15057 | 7956 | 154,1 | 4,1 |
Conclusions
We have developed a postprocessing algorithm that can recover alternative splicing information directly from de Bruijn graphs of RNA-Seq data. Our strategy does not require prior genomic knowledge and supports the study of differential expression through investigating de novo RPKM values [11]. The computational time is linear in the size of the de Bruijn graph, and our algorithm takes a few minutes to half-an-hour to complete after results from Velvet are available in the test cases (see Tables 1-4). It uses less memory than Velvet, while running Oases together with Velvet without setting cov_cutoff is often more memory-intensive than running Velvet with cov_cutoff (see Tables 1-4). Our algorithm performs well on simulations with low percentages of mismatches in the reads and generally has higher specificity than Oases or Trans-ABySS. It is most suitable in situations in which a more reliable assembly is desired at the expense of lower sensitivity. Our algorithm has been applied to perform transcriptome assembly of the non-model blow fly Lucilia sericata in [11], which allows comparison of its transcriptome to the closely related model organism Drosophila through translated BLAST search, investigation of alternative splicing and differential expression among various developmental stages, and identification of transposable elements.
Declarations
Acknowledgements
Computations were performed on the Brazos Cluster at Texas A&M University. AMT was supported by startup funds from the College of Agriculture and Life Sciences at Texas A&M University and Texas Agrilife Research. S-HS was supported by NSF grant MCB-0951120. This work was supported in part by NIJ grant 2012-DN-BX-K024. Points of view in this document are those of the authors and do not necessarily represent the official position or policies of the U.S. Department of Justice.
Declarations
Publication costs for this work were funded by the Open Access to Knowledge (OAK) Fund at the Texas A&M University Libraries.
This article has been published as part of BMC Genomics Volume 15 Supplement 5, 2014: Selected articles from the Third IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS 2013): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S5.
Authors’ Affiliations
References
- Dohm JC, Lottaz C, Borodina T, Himmelbauer H: SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res. 2007, 17: 1697-1706. 10.1101/gr.6435207.PubMedPubMed CentralView ArticleGoogle Scholar
- Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB: ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 2008, 18: 810-820. 10.1101/gr.7337908.PubMedPubMed CentralView ArticleGoogle Scholar
- Chaisson MJ, Pevzner PA: Short read fragment assembly of bacterial genomes. Genome Res. 2008, 18: 324-330. 10.1101/gr.7088808.PubMedPubMed CentralView ArticleGoogle Scholar
- Hernandez D, François P, Farinelli L, Østerås M, Schrenzel J: De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res. 2008, 18: 802-809. 10.1101/gr.072033.107.PubMedPubMed CentralView ArticleGoogle Scholar
- Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008, 18: 821-829. 10.1101/gr.074492.107.PubMedPubMed CentralView ArticleGoogle Scholar
- Birol I, Jackman SD, Nielsen CB, Qian JQ, Varhol R, Stazyk G, Morin RD, Zhao Y, Hirst M, Schein JE, Horsman DE, Connors JM, Gascoyne RD, Marra MA, Jones SJM: De novo transcriptome assembly with ABySS. Bioinformatics. 2009, 25: 2872-2877. 10.1093/bioinformatics/btp367.PubMedView ArticleGoogle Scholar
- Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, Li S, Yang H, Wang J, Wang J: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 2010, 20: 265-272. 10.1101/gr.097261.109.PubMedPubMed CentralView ArticleGoogle Scholar
- Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011, 29: 644-652. 10.1038/nbt.1883.PubMedPubMed CentralView ArticleGoogle Scholar
- Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman SD, Mungall K, Lee S, Okada HM, Qian JQ, Griffith M, Raymond A, Thiessen N, Cezard T, Butterfield YS, Newsome R, Chan SK, She R, Varhol R, Kamoh B, Prabhu AL, Tam A, Zhao Y, Moore RA, Hirst M, Marra MA, Jones SJM, Hoodless PA, Birol I: De novo assembly and analysis of RNA-seq data. Nat Methods. 2010, 7: 909-912. 10.1038/nmeth.1517.PubMedView ArticleGoogle Scholar
- Schulz MH, Zerbino DR, Vingron M, Birney E: Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics. 2012, 28: 1086-1092. 10.1093/bioinformatics/bts094.PubMedPubMed CentralView ArticleGoogle Scholar
- Sze SH, Dunham JP, Carey B, Chang PL, Li F, Edman RM, Fjeldsted C, Scott MJ, Nuzhdin SV, Tarone AM: A de novo transcriptome assembly of Lucilia sericata (Diptera: Calliphoridae) with predicted alternative splices, single nucleotide polymorphisms, and transcript expression estimates. Insect Mol Biol. 2012, 21: 205-221. 10.1111/j.1365-2583.2011.01127.x.PubMedView ArticleGoogle Scholar
- Pevzner PA: l-tuple DNA sequencing: computer analysis. J Biomol Struct Dyn. 1989, 7: 63-73.PubMedGoogle Scholar
- Idury RM, Waterman MS: A new algorithm for DNA sequence assembly. J Comput Biol. 1995, 2: 291-306. 10.1089/cmb.1995.2.291.PubMedView ArticleGoogle Scholar
- Heber S, Alekseyev M, Sze SH, Tang H, Pevzner PA: Splicing graphs and EST assembly problem. Bioinformatics. 2002, 18: S181-188.PubMedView ArticleGoogle Scholar
- Cormen TH, Leiserson CE, Rivest RL, Stein C: Introduction to Algorithms, Second Edition. 2001, The MIT PressGoogle Scholar
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5: 621-628. 10.1038/nmeth.1226.PubMedView ArticleGoogle Scholar
- Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009, 25: 1105-1111. 10.1093/bioinformatics/btp120.PubMedPubMed CentralView ArticleGoogle Scholar
- Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010, 28: 511-515. 10.1038/nbt.1621.PubMedPubMed CentralView ArticleGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410. 10.1016/S0022-2836(05)80360-2.PubMedView ArticleGoogle Scholar
- Daines B, Wang H, Wang L, Li Y, Han Y, Emmert D, Gelbart W, Wang X, Li W, Gibbs R, Chen R: The Drosophila melanogaster transcriptome by paired-end RNA sequencing. Genome Res. 2011, 21: 315-324. 10.1101/gr.107854.110.PubMedPubMed CentralView ArticleGoogle Scholar
- Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Landsman D, Lipman DJ, Lu Z, Madden TL, Madej T, Maglott DR, Marchler-Bauer A, Miller V, Mizrachi I, Ostell J, Panchenko A, Pruitt KD, Schuler GD, Sequeira E, Sherry ST, Shumway M, Sirotkin K, Slotta D, Souvorov A, Starchenko G, Tatusova TA, Wagner L, Wang Y, Wilbur WJ, Yaschenko E, Ye J: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2010, 38: D5-16. 10.1093/nar/gkp967.PubMedPubMed CentralView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.