The transcriptome of each of two species of color-polymorphic theridiid spider was sequenced using Illumina technology and assembled using the assembler trinity. By sequencing pools of individuals at great depth and by combining RNA-seq libraries and sequencing libraries derived from normalized-cDNA (ncDNA) libraries we have been able to reconstruct the transcriptome of each species with apparent completeness. The great utility of RNA-seq data comes from its ability to capture digital gene expression information in the form of relative read coverage. Consequently, RNA-seq is biased towards generating sequence from the most highly-expressed contigs. Since many contigs are likely to be rare, with perhaps less than 1% of expressed genes accounting for 50% of cellular mRNA , a typical RNA-seq experiment will fail to record sequence from many transcripts. By using both ncDNA-derived data and RNA-seq data we have been able both to assemble rare transcripts into contigs and tentatively examine DE. The contribution of the ncDNA data to the assemblies was clear as only 70-80% of the RNA-seq reads mapped back to the Metazoan blastx-positive components (>100 bp). However, it is also likely that the use of ncDNA resulted in the detection of a large and diverse spider “meta-transcriptome” – an inventory of expressed genes from organisms associated with the spiders (endo- and ectoparasites, commensal and external contaminant organisms). The fascinating discrepancy in the proportion of non-spider sequences between the temperate, mainland species T. californicum and the tropical, island species T. grallator will be explored elsewhere.
Our transcriptome assemblies are naturally not complete in terms of sampling the full diversity of genes and their various isoforms or in their full-length assembly into contigs. Since the detection of gene transcripts by transcriptome sequencing depends upon the expression of those transcripts, those transcripts that are only expressed at certain life-stages will be missed. Since adult female spiders will contain developing eggs our use of this life stage will naturally also include some transcripts from early development. Accepting the absence of some life-stage specific transcripts, several lines of evidence indicate that our gene sampling is otherwise quite comprehensive. First, the numbers of coding genes predicted, and other characteristics of the assemblies, were consistent between the two species (for example see Table 2 and Figure 1), with the number of Metazoan blastx-positive components (> 200 bp) only differing by 8% (T. californicum: 20,611; T. grallator: 18,868). Second, the distributions of the top hit taxa and associated E-values (Additional file 3: Figures S2 and S3) from the blastx homology searches, as well as all GO-term assignment analyses (Additional file 3: Figures S5, S7, S9), were remarkably consistent across both species. Furthermore, when GO-terms were assigned to gene families the two species shared 131 of 135 (97.04%) unique GO terms. Third, the CEGMA analysis (Supplementary Sections 8–12, Additional file 3: Tables S3-S6) indicated that 99% (T. californicum) and 98% (T. grallator) of the 248 CEGs were at least partially represented.
The transcriptomes of T. californicum and T. grallator contain a large number of contigs that represent components or “genes” (>200 bp: T. californicum 83,701; T. grallator 89,166; Table 1). These components include both protein-coding genes (whose sequence includes untranslated regions (UTRs) i.e. 5’UTR, 3′UTR, and transcribed introns) and transcribed non-coding sequences. The non-protein-coding genes (i.e. microRNA, ribosomal RNA, transfer RNA, transposons and transposable elements) likely comprise more than 50% of the spider transcriptome but we have not attempted to characterize these here. The set of putative protein-coding components is however impressive and we estimate that these species express at least 18,868 (>200 bp) protein-coding genes and probably in excess of 21,495 (>200 bp; perhaps many more if contigs between 100 and 199 bp are considered). Theridion spiders, assuming that T. californicum and T. grallator are representative of the genus, therefore appear to have more protein-coding genes than the well-characterized two-spotted spider mite Tetranychus urticae (18,423) and a similar number to Homo sapiens (21,828). For T. californicum and T. grallator only ca. 4.5% of the Markov-predicted genes (shared among the species and not microbial) had no known homology. Given the large number of Araneae-specific gene families (Figure 1) this low percentage of genes with no known homologues may seem surprising. However, many of these homologues are likely to stem from the fact that the relatively few protein and EST sequences derived from spiders and available in public databases are biased towards those that are specific to spiders i.e. venom and silk gland EST-sequencing experiments (e.g. Latrodectus hesperus – see Additional file 3: Figure S3), and venom-gland sequences from other organisms. Of 961 curated venom peptide sequences downloaded from Arachnoserver , T. californicum had 18 and T. grallator had only 14 (23 overall for both species) RBH blast matches to diverse arachnid venom peptides (see Additional file 3: Table S12), so if many Theridion genes do code for venom peptides then these might be mostly unknown. Until the reads/transcripts can be mapped back to a reference genome it is not possible to be sure about the numbers of Theridion genes. Our transcripts are de novo assembled and will include erroneously concatenated transcripts and single transcripts that have been split into separate components. Fragmentation is likely to be common for highly-repetitive silk genes, for example and we have demonstrated that short contigs (100–199 bp) are likely to contain many fragments of single genes). However, this is unlikely to detract from the fact that the gene catalogue for these spiders, the first comprehensive list for any spider, is undoubtedly large.
In this study, pooling individuals placed a constraint upon our ability to measure DE between the (double recessive) Yellow and (dominant) Colored morphs of these spiders and hence to detect gene pathways associated with the color polymorphism. Without true biological replicates, estimation of the coefficient of variation and hence testing statistical significance becomes impossible. We attempted to circumvent this limitation by borrowing from microarray approaches, normalizing read counts and estimating common dispersion from a defined set of house-keeping (HK) genes. Even so, over such a large set of genes this approach was still of limited utility (as evidenced by the lack of congruence between the two species in terms of numbers of DE genes and enriched GO-terms (Supplemental Sections 23–26, Additional file 3: Tables S10, S11 and Figure S11). Consequently, we chose to focus on the subset of ommochrome- and pteridine-associated genes identified by RBH against D. melanogaster homologues in a survey of pigment-pathway associated genes. Since homology was established among the pigment genes and among the HK genes we were able to use the two species as biological replicates, and although statistical power was still weak for significance testing, both species showed a marked and congruent increase in expression in pigment-associated genes in Colored individuals. This result is logical since it is known that the Yellow form is double recessive with respect to all the patterned, colored morphs. As such, the recessive Yellow alleles would be expected to show lower expression levels for associated pigment genes when compared to the dominant Color alleles, and this one-tailed expectation is corroborated by both ommonchrome and pteridine pigment pathway genes (Figure 4). These results are also important because they demonstrate that many pigmentation genes are differentially expressed in adult spiders i.e. expression is not restricted to younger instars, perhaps because pigment granules are constantly being cycled . The implication of a role for pteridines in the color polymorphism of these spiders is also very significant because: 1) pteridine pigments have not been described in spiders , and 2) because the involvement of this pathway provides an intriguing link between stored guanine and overlying yellow, red and very dark-brown pigments, which have been assumed to be exclusively ommochrome-derived. Together these components interact to generate the various color morphs [6, 23]. Of course, the mere presence of the pteridine pathway genes does not necessarily mean that the animals generate pteridine pigments in any appreciable amount, even if it is suggestive of this.
This homology-based approach to pathway-gene identification works because of the deep evolutionary conservation of the pathways associated with the production of many animal pigments. Indeed pigments are often derived from the waste or terminal products of key metabolic processes such as heme  and guanine , or metabolites generated during the production and recycling of the cofactor H4biopterin . Nonetheless, the pathways and the enzymes recruited into various roles do vary and the assumption that spider homologues to D. melanogaster enzymes should have equivalent roles is not trivial, especially given that these organisms probably had a last common ancestor some 725 Ma .