A comparison across non-model animals suggests an optimal sequencing depth for de novo transcriptome assembly

Background The lack of genomic resources can present challenges for studies of non-model organisms. Transcriptome sequencing offers an attractive method to gather information about genes and gene expression without the need for a reference genome. However, it is unclear what sequencing depth is adequate to assemble the transcriptome de novo for these purposes. Results We assembled transcriptomes of animals from six different phyla (Annelids, Arthropods, Chordates, Cnidarians, Ctenophores, and Molluscs) at regular increments of reads using Velvet/Oases and Trinity to determine how read count affects the assembly. This included an assembly of mouse heart reads because we could compare those against the reference genome that is available. We found qualitative differences in the assemblies of whole-animals versus tissues. With increasing reads, whole-animal assemblies show rapid increase of transcripts and discovery of conserved genes, while single-tissue assemblies show a slower discovery of conserved genes though the assembled transcripts were often longer. A deeper examination of the mouse assemblies shows that with more reads, assembly errors become more frequent but such errors can be mitigated with more stringent assembly parameters. Conclusions These assembly trends suggest that representative assemblies are generated with as few as 20 million reads for tissue samples and 30 million reads for whole-animals for RNA-level coverage. These depths provide a good balance between coverage and noise. Beyond 60 million reads, the discovery of new genes is low and sequencing errors of highly-expressed genes are likely to accumulate. Finally, siphonophores (polymorphic Cnidarians) are an exception and possibly require alternate assembly strategies.


Background
RNA-seq has provided a powerful tool for analysis of 26 transcriptomes. For non-model organisms with limited 27 genomic information, transcriptome sequencing provides 28 a cost-saving tool by only sequencing functional and 29 protein coding RNAs, thus providing direct information 30 about the genes [1]. There are many benefits of sequencing 31 a genome, but for relatively large genomes such as human 32 and mouse, protein coding regions account for under 5%, Cruz, CA, USA Full list of author information is available at the end of the article thus most of the sequencing effort would go to sequenc-34 ing either regulatory regions or repetitive elements [2]. 35 Smaller genomes could be sequenced and assembled to 36 complement the transcriptomes, though this is not a 37 tractable approach if a genome is quite large. Even still, de 38 novo genome assembly can produce errors by itself [3]. 39 Despite its advantage, transcriptome assembly does 40 present additional challenges when compared to genome 41 assembly. Unlike genomes where most sequences should 42 be approximately equally represented, coverage of any 43 given sequence in a transcriptome can vary over sev-44 eral orders of magnitude due to expression differences 45 [4]. Because coverage can vary, there is also a question 46 of sequencing depth. Theoretically, there is a sequenc-47 ing depth beyond which addition of more reads does not 48 provide new information, known as the saturation depth. 49 original library. It is suggested that sequencing of very 103 small numbers of reads can be most subject to biases 104 and that cDNA normalization can improve the unifor-105 mity of the library at low numbers of reads [17]. Such 106 an approach might be quite costly, and the computational 107 sub-sampling approach has the advantage of drawing from 108 the largest pool of reads and avoid biases which could 109 occur at low numbers of reads. Subsets of the filtered 110 library were generated containing 1,5,10,20,30,40,50,60,111 and 70 million reads. Reads from each set were included 112 in the next largest set, thus all of the reads in the 1 million 113 set are included in the 5 million read set, and so forth. 114 These sets were assembled with Velvet/Oases [18,19] and 115 Trinity [20] (For a detailed comparison of assemblers, 116 see [21]). 117 Schulz et al. reported reliable parameters for Oases 118 which produced high-quality assemblies of mouse and 119 human cell cultures, using 64 million and 30 million reads, 120 respectively [19]. This included use of a broad k-mer range 121 with a low starting k-mer of 19 or 21 up to a k-mer of 33 122 or 35. Accordingly we used k-mers from 21 to 33. Also, a 123 minimum k-mer coverage is required by Oases to retain 124 any given node during the assembly process; by default 125 this is 3 in Oases, that is, any node must have at least 126 three-fold coverage for that node to be used. Some differ-127 ences were observed in the output when this parameter 128 was changed, and so the same data were assembled with 129 coverage cutoff of 3 (referred to hereafter as C3) and a 130 stricter cutoff of 10 (C10).

131
The number of transcripts (Oases terminology for con-132 tigs) increases steadily for all assemblies ( Figure 1A). C10 F1 133 also had substantially fewer transcripts and accordingly 134 much higher mean and median lengths ( Figure 1B-D). 135 The pattern of increase for median and N50 (length for 136 which half of the total bases are in contigs of this length or 137 longer) tracked the mean for the C10 assembly, but not the 138 C3 assembly which did not have a clear qualitative pattern. 139 The mean, median and N50 were all lower for the Trinity 140 assembly than the C3 despite having far fewer contigs.

141
Oases generates transcript "loci", which is Oases ter-142 minology for the de-Brujin graph clusters meant to rep-143 resent genes and their splice variants or highly-similar 144 paralogs. Both curves approach to a plateau for locus 145 counts ( Figure 1E-F). The greatest increase in loci was 146 between using 10 million to 20 million reads for both 147 C3 and C10. Similarly, the C3 assembly shows a decrease 148 in the number of transcripts per read ( Figure 1G), while 149 the C10 assembly shows an almost constant number of 150 transcripts per read. The number of transcripts increases 151 while the number of loci tend to level off and this means 152 the number of transcripts per locus always increases with 153 more reads ( Figure 1H). That is, on average, more variants 154 will be generated with more reads even though some of 155 these are likely due to noise. While the Trinity assembly 156  genes showed a much higher value of 51.24% [22,23]. 166 Interestingly, for all assemblies except for mouse, the aver-167 age GC content of the assembled contigs was lower than 168 that of the raw reads ( Figure 2), suggesting either that   Four of the animals showed modest gains in mean, 193 median and N50 with more reads (average 20% from 194 fewest to most reads), while P.robusta and H.californensis 195 nearly doubled from the fewest to the most reads 196 ( Figure 3B-D). Most of the transcript-length increase 197 occurred before 30 million reads, suggesting that adding 198 more reads did not produce longer sequences beyond 199 that threshold, or that they became longer at the same 200 rate that new, short transcripts were generated. As with 201 the mouse samples, transcripts were added continually 202 with more reads ( Figure 3A). Compared to the mouse, 203    However, gene duplications present difficulties for such 257 assessments unless one had a priori knowledge of how 258 many copies should be present in the genome. For this 259 study, we also used the subset of eukaryotic KOGs con-260 taining 248 genes from the CEGMA pipeline which 261 were identified as single-copy orthologs in most genomes 262 [26,27]. Almost one third of these KOGs are involved 263 in processes like transcription and translation and were 264 expected to be expressed in many tissues. Trinity and 265 Oases with a lower coverage cutoff of 3 found simi-266 lar numbers of KOGs at much lower numbers of reads 267 ( Figure 4B) than compared to the C10 assembly. Also 268 more KOGs were found within expected length much 269 faster with C3 than with the higher cutoff of 10, and 270 the Trinity assembly outperformed both of these. These 271 results suggest that it is better to have a lower cutoff and 272 assemble more sequences. Likewise, the Trinity assem-273 bly had more transcripts than C10 and were shorter than 274 those in C3, yet more KOGs were found with fewer 275 reads and more coding transcripts were correctly assem-276 bled at greater numbers of reads. However, for the Oases 277 assemblies this had remarkably little effect on the number 278 of correct canonical proteins that were found (Figure 4, 279 triangles). Although there is some overestimation, no pro-280 tein designated as too short was ever correct. Regarding 281 the fate of the other full-length proteins, for C3 at 70 mil-282 lion reads, 186 KOGs were found within the expected 283 range, though only 131 were correct. Eight of the 186 284 KOGs had only 1 mismatch in the amino-acid sequence 285 compared to the reference protein which could be due 286 to errors, splice variants, tissue-specific modifications or 287 alleles. The remaining KOGs had at least two amino-288 acid changes but were within the size range. Thus for 289 the mouse, the size range was a reliable predictor of true 290 full-length proteins. 292 We then examined our invertebrate transcriptomes for 293 completion using the same set of KOGs. There was a 294 clear, qualitative difference between whole-body organ-295 isms ( Figure 5A) and dissected tissues ( Figure 5B). C10 F5 296 mouse data are included for reference. For whole-body 297 transcriptomes, over 90% of the KOGs were detectable at 298 20 million reads, yet the number of within-length KOGs 299 went down with higher numbers of reads past 20 mil-300 lion. This could be caused if proteins declared to be 301 within-range were longer than the true protein due to mis-302 assembly causing addition of pieces, or if the true protein 303 became mis-assembled with addition of noisy reads. In 304 nearly all of our assemblies, it was the latter: mis-assembly 305 of the putative protein which generated stop codons. 306 C.multidentata ( Figure 5A, purple) was again exceptional, 307 as the number of within-length KOGs increased more 308 slowly with addition of more reads than the other two 309 we expected that other abundant transcripts should mis-322 assemble at high numbers of reads in that tissue. However, 323 the dissected-tissue transcriptomes had longer transcripts 324 and fewer loci, suggesting this was not the case. Since 325 whole-animal transcriptomes include all tissues, a greater 326 proportion of the genome is expressed so coverage of any 327 given transcript or splice-variant is proportionally much 328 lower. The length saturation patterns appear to be dif-329 ferent between whole-animal and tissue transcriptomes. 330 However, using conserved genes as a metric, there appears 331 to be limited benefit of sequencing beyond 60 million 332 reads.

381
In this study, number of whole animals and tissues from 382 non-model organisms and one mouse organ were assem-383 bled and the completeness was assessed using a set of 384 conserved genes. Additionally, a comparison was made 385 between two high-performing assemblers with respect to 386 the mouse data. Oases required much greater memory 387 usage while Trinity had much longer run times (approx-388 imately 2-fold longer). Both Trinity and Oases perform 389 comparably at assembling conserved genes across a large 390 set, indicating that the saturation depth is not greatly 391 affected by assembler choice.

392
Overall, these results suggest that for whole-body tran-393 scriptomes and individual organs or cells, 30 and 20 394 million reads are sufficient for mRNA level coverage, 395 respectively. For the read length used in this study, that 396 would produce 2-3 gigabases of sequence. It should be 397 noted that the mouse data consisted of shorter reads than 398 used for the invertebrates, but this did not appear to have 399 substantial effect as this difference was only between 75bp 400 reads and 100bp reads. Assembly errors are evident in 401 whole-body transcriptomes after 30 million reads, and the 402 average length appeared to level off at the same depth. 403 Presumably this depth would apply for studies of differen-404 tial expression as well, as the highly expressed transcripts 405 should be present and distinguishable at that sequencing 406 depth. In our experience, we find it is optimal to acquire 407 between 50 and 60 million reads, and then sub-sample up 408 around 20 or 30 million. This approach reliably assembles 409 nearly all proteins of interest. There are still observable 410 differences between assemblies, although some of these 411 differences may ultimately be due to variations in RNA 412 quality or properties of the animal.   [28]. We generated a script to blast and ana-467 lyze the matches, kogblaster.py (on the public repository, 468 as above). Briefly, the reference KOGs (860 orthologous 469 groups from NCBI, or 248 orthologous groups, from 470 http://korflab.ucdavis.edu/Datasets/cegma/) were aligned 471 to each assembly with tblastn with an e-value cutoff of 472 10 −6 . For each alignment, the subject hit was translated 473 and coding sequences were only kept if they contained 474 both start and stop codons. From this subset, the best 475 alignment was declared to be the correct sequence. Next, 476 the length of the correct sequence was used to estimate 477 whether that sequence was full-length relative to the con-478 served orthologs. For each KOG in the CEGMA dataset, 479 there were 6 proteins from 6 species and there was some 480 variability in protein length (average 11.8% from longest 481 mately, only those within the size range (1) were declared 499 as full-length sequences.

500
The animals in this study were treated ethically and