Multi-tissue transcriptomics of the black widow spider reveals expansions, co-options, and functional processes of the silk gland gene toolkit

Background Spiders (Order Araneae) are essential predators in every terrestrial ecosystem largely because they have evolved potent arsenals of silk and venom. Spider silks are high performance materials made almost entirely of proteins, and thus represent an ideal system for investigating genome level evolution of novel protein functions. However, genomic level resources remain limited for spiders. Results We de novo assembled a transcriptome for the Western black widow (Latrodectus hesperus) from deeply sequenced cDNAs of three tissue types. Our multi-tissue assembly contained ~100,000 unique transcripts, of which > 27,000 were annotated by homology. Comparing transcript abundance among the different tissues, we identified 647 silk gland-specific transcripts, including the few known silk fiber components (e.g. six spider fibroins, spidroins). Silk gland specific transcripts are enriched compared to the entire transcriptome in several functions, including protein degradation, inhibition of protein degradation, and oxidation-reduction. Phylogenetic analyses of 37 gene families containing silk gland specific transcripts demonstrated novel gene expansions within silk glands, and multiple co-options of silk specific expression from paralogs expressed in other tissues. Conclusions We propose a transcriptional program for the silk glands that involves regulating gland specific synthesis of silk fiber and glue components followed by protecting and processing these components into functional fibers and glues. Our black widow silk gland gene repertoire provides extensive expansion of resources for biomimetic applications of silk in industry and medicine. Furthermore, our multi-tissue transcriptome facilitates evolutionary analysis of arachnid genomes and adaptive protein systems. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-365) contains supplementary material, which is available to authorized users.


de novo transcript assembly
Transcripts were de novo assembled using Velvet-Oases [1] and Trinity [2]. Velvet [3] is intended to assemble whole genomes and the algorithm expects equal sequencing read depth across the entire assembly, though mRNA abundance can vary widely. In addition multiple transcripts (isoforms) can be generated from a single locus. Oases [1] attempts to assemble all isoforms and deals with unequal read depth. Trinity is a stand along package built for de novo assembly of mRNA-sequencing data.
Each tissue-specific library was assembled separately. Velvet-Oases was run with multiple k-mers ranging from 29-61 in increments of 4 using the Oases supplied Python script.
The multiple runs were merged with the Oases -merge option. Because Velvet-Oases generated 30 -400 thousand transcripts depending on library, a single "transcript" per "locus" was chosen using the oases-to-csv python script [4]. Trinity was run with default parameters using a single k-mer of 25.
To generate the most complete possible set of L. hesperus transcripts we combined tissue-specific assemblies using CAP3 [5]. We first ran CAP3 [5] using default parameters on each Trinity-derived tissue specific assembly and labeled the resulting combined sequences (or contigs) and singletons according to tissue type. We then concatenated all six files (tissuespecific contigs and tissue specific singletons) and again ran CAP3 with default parameters.
For Velvet-Oases derived assemblies, we chose the "best" transcript for each "locus" using the oases-to-csv python script [4] for each tissue-specific assembly. We then ran CAP3 with default parameters on three concatenated "best transcript" files. Contigs generated among libraries do not retain any tissue-specific labeling. We predicted open read frames (ORFs) for each of the resulting assembled transcripts from both programs using GetOrf [6] and retained only those that were predicted to encode at least 30 amino acids. We compared the quality and completeness of the Trinity-derived transcriptome to the Velvet-Oases derived transcriptome by comparison to previously described proteins, according to methods described in the main document.

Trinity derived transcriptome out performs Velvet-Oases
We generated over 149 million high quality 75 or100 bp paired-end sequence reads from genes expressed (cDNAs) in three tissues of adult female black widows, silk glands, venom glands and cephalothoraxes (Additional File 1, Table S1). de novo assembly of each tissuespecific library resulted in 19-450 thousand transcripts depending on assembly method and tissue type (Additional File 1, Table S2). These transcripts were grouped into "loci" or "components" by Velvet-Oases [1] and Trinity [2], respectively. "Loci" and "components" have similar underlying mathematical definitions and are typically interpreted as representing the same genomic locus. Multiple transcripts (e.g. isoforms) can be generated from a single locus.
Trinity assemblies resulted in more loci (16.8-72.1 thousand) than Velvet-Oases (10.6-36.5 thousand), but fewer total transcripts (Additional File 1, Table S1; Trinity: 19.3-114.4 thousand, Velvet-Oases: 36.7-426.7 thousand). Due to the large numbers of transcripts generated by Velvet-Oases, we used a single transcript per locus for combining the tissue-specific assemblies into a putative transcriptome using CAP3. We retained all transcripts for combining tissuespecific assemblies into a Trinity derived transcriptome.
The Trinity derived assembly was more complete than the Velvet-Oases derived assembly in terms of possessing more homologs to a number of sets of previously described sequences. For instance, the Trinity derived transcriptome included complete homologs off 99% of the Core Eukaryotic Genes (CEGs), while Velvet-Oases recovered 90% of CEGs, as determined by CEGMA [7]. The Trinity derived transcriptome also possessed homologs of more unique tick and fruitfly RefSeq proteins than did Velvet-Oases assessed by significant BLASTX alignments (E-score < 1e-5; Table S2). Importantly, the Trinity derived transcriptome recovered 99% of 999 previously described non-redundant L. hesperus cDNA and genomic sequences while the Velvet-Oases transcriptome only recovered 88% (Additional File 1, Table   S2). Finally, using BLASTX alignments to tick proteins, we found fewer potential cases of chimeric "assembled sequences" in the Trinity derived transcriptome than the Velvet-Oases derived one. Specifically, 11.2% of Trinity derived assembled transcripts had non-overlapping alignments to two different fruit fly proteins versus 13% of Velvet-Oases derived ones (E-score < 1e-10). Using more stringent alignments (E-score < 1e-50), only 4.9% and 6.7% of assembled transcripts were potentially chimeric in the Trinity and Velvet-Oases derived transcriptomes, respectively.