RNA-seq has provided a powerful tool for analysis of transcriptomes. For non-model organisms with limited genomic information, transcriptome sequencing provides a cost-saving tool by only sequencing functional and protein coding RNAs, thus providing direct information about the genes . There are many benefits of sequencing a genome, but for relatively large genomes such as human and mouse, protein coding regions account for under 5%, thus most of the sequencing effort would go to sequencing either regulatory regions or repetitive elements . Smaller genomes could be sequenced and assembled to complement the transcriptomes, though this is not a tractable approach if a genome is quite large. Even still, de novo genome assembly can produce errors by itself .
Despite its advantage, transcriptome assembly does present additional challenges when compared to genome assembly. Unlike genomes where most sequences should be approximately equally represented, coverage of any given sequence in a transcriptome can vary over several orders of magnitude due to expression differences . Because coverage can vary, there is also a question of sequencing depth. Theoretically, there is a sequencing depth beyond which addition of more reads does not provide new information, known as the saturation depth. Several studies have used approaches which map reads onto reference genomes and these have suggested saturation depths at 95% gene coverage ranging from 1.2 million reads to 50 million for mRNA level coverage, and up to 700 million for splice variants [5–7]. However, these studies all made use of short reads around 36bp and were not assembling the transcriptomes de novo.
Several recent studies have already made use of next-generation sequencing reads for de novo transcriptome assembly [8–15]. The number of reads used for assembly in these studies varies widely, ranging from 2.6 million reads up to 106 million reads [10, 11]. The assembly strategies are equally varied, but share the initial step of removing low-quality reads and adapters whereupon all remaining reads are assembled. The assembly quality estimates vary as well with the most common measure of quality based on BLAST hits to public databases like Uniprot, though it was noted that under-representation of many taxa in public databases limits this approach .
While many parameters must be optimized for the specific assembly, it is both inconvenient and costly to acquire more reads by resequencing. Presently, there is no clear consensus of what sequencing depth is optimal or what factors would contribute to the adequate depth. The problems of omitted genes or variants are obvious with too few reads. On the other hand, it was suggested that greater depth may create errors in differential expression analyses, cost more, and take longer to assemble . Thus, here we use the same assembly strategy across a diverse set of organisms to isolate the effects of read count on assembly quality to attain a general estimate of optimal read count. We compare trends from de novo assemblies across six phyla. These animals include the mouse (used as a control for the non-model samples), the Humboldt squid Dosidicus gigas, the scaleworm Harmothoe imbricata, the decapod Sergestes similis, the copepod Pleuromamma robusta, the ctenophore Hormiphora californensis, and the siphonophore Chuniphyes multidentata. To our knowledge, this is the first study to suggest an optimal number of reads for de novo assembly for the purposes of mRNA level analysis. These results are applicable to studies of organisms with limited genomic resources.