Orthologous genes identified by transcriptome sequencing in the spider genus Stegodyphus

Background The evolution of sociality in spiders involves a transition from an outcrossing to a highly inbreeding mating system, a shift to a female biased sex ratio, and an increase in the reproductive skew among individuals. Taken together, these features are expected to result in a strong reduction in the effective population size. Such a decline in effective population size is expected to affect population genetic and molecular evolutionary processes, resulting in reduced genetic diversity and relaxed selective constraint across the genome. In the genus Stegodyphus, permanent sociality and regular inbreeding has evolved independently three times from periodic-social (outcrossing) ancestors. This genus is therefore an ideal model for comparative studies of the molecular evolutionary and population genetic consequences of the transition to a regularly inbreeding mating system. However, no genetic resources are available for this genus. Results We present the analysis of high throughput transcriptome sequencing of three Stegodyphus species. Two of these are periodic-social (Stegodyphus lineatus and S.tentoriicola) and one is permanently social (S. mimosarum). From non-normalized cDNA libraries, we obtained on average 7,000 putative uni-genes for each species. Three-way orthology, as predicted from reciprocal BLAST, identified 1,792 genes that could be used for cross-species comparison. Open reading frames (ORFs) could be deduced from 1,345 of the three-way alignments. Preliminary molecular analyses suggest a five- to ten-fold reduction in heterozygosity in the social S. mimosarum compared with the periodic-social species. Furthermore, an increased ratio of non-synonymous to synonymous polymorphisms in the social species indicated relaxed efficiency of selection. However, there was no sign of relaxed selection on the phylogenetic branch leading to S. mimosarum. Conclusions The 1,792 three-way ortholog genes identified in this study provide a unique resource for comparative studies of the eco-genomics, population genetics and molecular evolution of repeated evolution of inbreeding sociality within the Stegodyphus genus. Preliminary analyses support theoretical expectations of depleted heterozygosity and relaxed selection in the social inbreeding species. Relaxed selection could not be detected in the S. mimosarum lineage, suggesting that there has been a recent transition to sociality in this species.


Background
Sociality in spiders involves the formation of permanent family groups and extensive cooperation in prey capture and brood care. While sociality is relatively rare in spiders with less than 25 known social species, it is intriguing because it has evolved independently at least 18 times in six families across the spider phylogeny [1]. The evolution of sociality involves convergent life history traits that include female-biased sex ratios, an inbreeding mating system, and cooperative breeding with allomaternal care [1]. The genus Stegodyphus contains three social species [2], and although the phylogenetic relationship of Stegodyphus is poorly resolved, a partial molecular phylogeny suggests three independent origins of sociality, and that sociality is a derived state [3]. Sociality is hypothesized to evolve via the subsocial route through the elimination of pre-mating dispersal, the transition to intracolony mating, and the evolution of cooperation among individuals. The presence of independent social lineages and periodic-social (ancestral) congeners makes Stegodyphus a suitable system for comparative studies of the evolution of regular inbreeding and sociality, and for examining its consequences for evolutionary processes and genomic architecture.
The combination of strong inbreeding, female-biased sex ratios, and dynamic meta-population structure in the social spiders is expected to have pronounced effects on genetic processes [4,5]. Comparative data from studies on selfing and outcrossing plants are consistent with theoretical predictions that inbreeding populations are expected to experience reduced genetic diversity. This prediction is founded on three factors: firstly, that inbreeding reduces effective population size relative to outbreeding populations; secondly, high homozygosity reduces the effective frequency of recombination throughout the genome; and thirdly, isolation between populations is increased with inbreeding [4]. As a consequence, the efficacy of natural selection should be reduced. Examining these expectations and additional genomic consequences of the transition to sociality and regular inbreeding has so far not been possible due to a lack of genetic resources. Next generation sequencing technologies have made it possible to survey genomes of non-model organisms [e.g. [6][7][8][9][10][11]], and to identify suitable parts of the genome for phylogenetic, molecular evolutionary and population genetic studies. Using 454 pyrosequencing (Roche), we sequenced the transcriptomes of three Stegodyphus species to identify ortholog gene sequences across species to use in phylogenetic and population genetic studies.
The three Stegodyphus species sequenced here (S. lineatus, S. tentoriicola and S. mimosarum) were specifically selected to be as distantly related as possible, based on the most recent molecular phylogeny of the genus [3], to maximize the chance of finding regions conserved across the entire genus for the design of primers that will amplify in all or most Stegodyphus species. One social species and two periodic-social species [3] were included to perform preliminary contrasts of evolutionary patterns of the transcriptomes for inbreeding and outcrossing mating systems to substantiate evolutionary hypotheses to test in future studies.
We obtained a library of 1,792 ortholog loci and detected open reading frames (ORFs) (> 189 bp) in 1,345 loci. Preliminary analysis on the transcriptome sequence data showed that heterozygosity was strongly depleted in the social individual compared with the two periodic-social individuals (5 to 10-fold). Estimates of dN/dS, CG-content and codon usage bias did not reveal differences in the strength of selection averaged over the phylogenetic branches. However, levels of non-synonymous and synonymous polymorphisms suggest that selection is relaxed in the social S. mimosarum compared to the periodic-social species. From these results we hypothesize that permanent sociality evolved recently in S. mimosarum.

Results and Discussion
Assembly and characterization of transcriptomes Three transcriptomes were assembled from quality filtered sequences into 10,398, 6,882 and 7,195 transcripts, with a median length of approximately 700 bp and median coverage of 4.2, 4.7 and 4.7 for S. lineatus, S. mimosarum and S. tentoriicola respectively (see Table 1 for original and assembled data). Based on transcript clustering, 8,944, 5,893 and 6,094 individual putative genes for each of the respective species were identified, of which approximately 8.5% were found to have more than one splice variant. The statistics for the original and assembled data are summarised in Table 1.
Approximately half of the putative genes in S. lineatus and 60% in S. mimosarum and S. tentoriicola were annotated by comparative analysis using BLASTx against the NCBI non-redundant protein database (see Table 1). Functional annotation (Gene Ontology) was similar across all three species with the highest number of annotated transcripts related to metabolic processes (GO:0008152) and cellular processes (GO:0009987) (Figure 1A). Many of these genes are probably highly conserved genes, such as genes involved in DNA repair, gene expression regulation and transfer of chemical substances. This category of genes contains the largest amount of products in the Gene Ontology (GO) database, and the abundance of transcripts in different categories in the present data set is similar to the abundance of the number of products in the GO database.
The study species were found to have highly similar patterns of coding GC content and codon usage. The overall GC content of the transcriptomes was 32.2%, 34.1% and 33.2% in S. lineatus, S. mimosarum and S. tentoriicola, respectively, while in defined coding regions (see below) the GC content was slightly higher (approximately 40% in all three species) compared to the overall GC content. The patterns of codon usage in these genes, studied as the effective number of codons (N c ') and the GC content in synonymous sites (GC 3 ), were highly similar in all study species. The effective numbers of codons was approximately 50, and AT-rich codons were preferred ( Figure 2). The amount of codon usage bias was of the same order of magnitude as several Diptera and Hymenoptera species [12,13]. However, while in many Diptera species the preferred codons are GC-rich, ATrich codons were more common in Stegodyphus. Thirtyseven Nematode species have shown large variation in GC content of preferred codons, but, with a few exceptions, the GC content in third codon position is generally much higher in Nematode genomes than found in the spider species here [14].
Orthology 1,792 transcripts were found to have reciprocal best hit in all three pair-wise comparisons, and were considered to be ortholog across all three study species. However, 940, 1,131 and 1,275 respective sequences were found to have best reciprocal hits between two species only ( Figure 1B.). Of these sequences, 261 S. lineatus, 241 S. mimosarum and 259 S. tentoriicola transcripts showed reciprocal best hits among two species pairs only, which might have been caused by technical issues or paralogy  (see Additional file 1). To avoid paralog genes in the downstream analyses, these loci were excluded. The alignments of 1,792 ortholog sequences serve as a highly useful library for further genetic studies across the Stegodyphus genus. Since the three species sequenced here are phylogenetically distant within the genus, the likelihood that PCR primers designed from conserved regions in the alignments will amplify throughout the genus is maximized. Indeed, the first ten PCR primer pairs designed from ortholog sequences amplified in 8 species tested across the genus. The sequences of these primer pairs can be found in Additional file 2.

Molecular evolution
To study patterns of molecular evolution of protein coding genes we generated multiple alignments of the longest open reading frames detected from the three way ortholog sequences. This resulted in 1,345 alignments with a median length of 171 codons ranging from 63 to 1,092. These 1,345 ortholog alignments can be found in Additional file 3. Eighty-seven percent of the longest translated ORFs had at least one significant blastp hit (e-value cut-off 10 -6 ) in the NCBI non-redundant database, suggesting that in the majority of genes the correct reading frame was detected.
Single genes were concatenated into 20 units with an average length of 41,825 bp to estimate genome-wide substitution patterns, and the total length of the aligned region was 836,517 bp, gaps included. Alignments were concatenated to increase reliability of branch length estimation, and we generated 20 units (and not one single unit) to be able to estimate confidence limits. The proportion of variable sites in pair-wise comparisons was 5.7% both between S. lineatus and S. mimosarum and between S. lineatus and S. tentoriicola, and 4.2%. between S. mimosarum and S. tentoriicola. The transcriptomewide selection against amino acid changing mutations was found to be strong in all lineages, but in the S. lineatus lineage selection was significantly stronger than in the other two lineages. The estimates of the ratio of nonsynonymous to synonymous divergence (ω) (95% confidence limits in parentheses) were 0.094 (0.088-0.099), 0.119 (0.111-0.128) and 0.11 (0.104-0.117) in S. lineatus, S. tentoriicola, and S. mimosarum lineages respectively. This is comparable with genome-wide estimates of ω in Drosphila [15]. We note that because we extracted RNA from the entire body and used non-normalized cDNA libraries, genes that are expressed in all tissues are overrepresented in the dataset. These genes are likely to be under strong purifying selection [16].

Genetic polymorphism in social and periodic-social spiders
We found strikingly fewer polymorphisms in the transcriptome from the social species compared to that of periodicsocial species. From the assembled transcript in coverage range 10 to 150, we detected 6,425 and 5,043 polymorphisms for S. lineatus and S. tentoriicola respectively, while only 943 polymorphisms were detected from S. mimosarum transcripts. If only one transcript per gene was taken into account the difference in detected polymorphisms was even higher, with 3,901 polymorphism in S. lineatus and 2,415 in S. tentoriicola respectively, while only 591 polymorphisms were detected in the S. mimosarum transcript. These data are in keeping with low population level genetic variation in mitochondrial loci that were previously demonstrated in two genera of social spiders [17,18]. Our data for the social S. mimosarum transcriptome is consistent with the prediction of reduced population level polymorphism in social species as a consequence of inbreeding. This is the goal of future comparative studies of social and periodic-social Stegodyphus species.
In addition, we found evidence for a higher ratio of nonsynonymous polymorphisms to synonymous polymorphisms (π(a)/π(s)) in the social species compared with the outcrossing congeners. The number of polymorphisms that could be determined as synonymous and non-synonymous were 567 and 199 for S. lineatus, 385 and 239 for S. tentoriicola, and 22 and 24 for S. mimosarum respectively. The estimated π(a)/π(s) were 0.10 for S. lineatus and 0.18 for S. tentoriicola, and 0.32 for the social species S. mimosarum, indicating a relaxation of purifying selection. Lower polymorphism and higher π (a)/π(s) in the social species suggest lower effective population size and relaxed purifying selection, as expected for highly inbred species [4,19]. Contrary to these findings, the ω estimates in different branches did not show consistent significant differences between social and periodic-social species. ω is expected to increase over evolutionary time in inbreeding social species with low effective population size and relaxed purifying selection because of increased accumulation of slightly deleterious mutation [20]. However, since we did not observe an increased ω relaxation of purifying selection in S. mimosarum probably only occurred relatively recently. Assuming that relaxed purifying selection followed the transition to sociality, we tentatively hypothesize that sociality and regular inbreeding in S. mimosarum evolved relatively recently, and that the transition was not associated with speciation.
Under the assumption of a molecular clock of synonymous substitutions, we estimated the time for speciation using two different types of calibration. First we estimated a 'spider mutation rate' from an estimated synonymous divergence of dS = 2.64 (1.32 on each branch) between S. lineatus and Acanthoscurria gomesiana (Theraphosidae) (based on the 16,000 bp alignment of A. Gomesiana haemocyte EST data [23] and data from this study). The split time between these species has been estimated at~245 my [24,25], yielding a mutation rate of approximately 5, 4E-9 per year. Stegodyphus species are annual with one generation per year, therefore, based on this estimated mutation rate, we date the split between S. mimosarum and S. tentoriicola to~15 million years ago, and the split of S. mimosarum/S. tentoriicola and S. lineatus to~21 million years ago (Figure 3). We also used a direct mutation rate estimate from mutation accumulation lineages of Drosophila melanogaster by Haag-Liautard et al. [26]. Using this mutation rate estimate of 8.4E-9, we date the splits between S. mimosarum and S. tentoriicola at~8 million years ago and between S. lineatus and S. mimosarum/ S. tentoriicola at~13 million years ago (Figure 3). We note that the mutation rates considered here may differ from the actual mutation rate in Stegodyphus species and that these time estimates should be considered with caution.

Conclusions
We identified 1,792 three-way ortholog genes that provide a unique resource for comparative studies in ecology, ecogenomics, population genetics and molecular evolution that relate to the evolution of sociality in the spider genus Stegodyphus. Preliminary analyses suggest depleted heterozygosity and relaxed selection in the inbreeding social species S. mimosarum relative to outcrossing congeners. However, relaxed selection could not be detected in the substitution pattern, suggesting a recent transition to sociality.

Materials and methods
Live specimens were collected from Israel (S. lineatus) and South Africa (S. mimosarum and S. tentoriicola) and brought to Aarhus University, Denmark. Total RNA was extracted from the entire body of one individual of each species and complementary DNA (cDNA) was synthesized using Evrogen SMART adapters [27] (Evrogen: http:// www.evrogen.com). cDNA libraries were sequenced using the Genome Sequencer FLX instrument (454, Roche) at the Gene Pool facility at the University of Edinburgh. The raw sequence reads and assembled contigs can be downloaded (EBI: ERP001169, Genbank: JT023378 -JT033759 (S. lineatus), JT033760 -JT040598 (S. mimosarum), JT040599 -JT047773 (S. tentoriicola)).
For the assembly of the sequence reads we applied GS De Novo Assembler Software (2.3) using transcriptome assembly default settings. For the S. mimosarum assembly an additional -large option for large genomes was used, since the assembly could otherwise not be completed. The assembled contigs (isotigs in GS De Novo Assembler) represent transcripts. Alternative splicing was quantified based on isotig clustering by GS De Novo Assembler software. The isotig clusters (isogroup in GS De Novo Assembler) are referred to as putative genes in the results.
A reciprocal best similarity hit method was used for the ortholog sequence identification, where triplets of reciprocal best hits were selected to represent a putative ortholog set of transcripts. For simplicity they are referred to as ortholog sequences in the results. Nucleotide BLAST was used for the similarity search with default parameters and an E-value cut-off of 10 -5 using Bioedit 7.0.4.1 [29]. First, a pair-wise reciprocal blast was performed of which the best hits based on bit scores were selected. If the best hits were similar in both directions, the sequences were scored as putative ortholog between the two species. The putative threeway ortholog sequences were selected from these loci if they were putative orthologs in all three way comparisons. The reciprocal best hit selection was performed using a set of Java scripts implemented in a locally developed sequence manipulation tool (available on request). Putative ortholog sequences between S. lineatus and Acanthoscurria gomesiana were detected based on the same principle using identified orthologs to query in S. lineatus.
For identifying polymorphisms we developed a custom Python script SNP_finder.py (available on request) to mine polymorphic sites from GS De Novo Assembler output files. The polymorphisms were called from sites with a coverage range of 10-150. Minor allele frequency was required to be > 0.3 and only contigs longer than 100 were used, excluding 10 nucleotides from each edge of an assembly. Detected SNPs were assigned ancestral or derived by assuming that the nucleotide found in the same position in the two other species was the ancestral nucleotide. π Slin =0.00225 π Nlin =0.00023 π N /π Slin = 0.10 π Smim =0.000029 π Nmim =0.000093 π N /π Smim = 0.32 π Sten =0.00164 π Nten =0.00030 π N /π Sten = 0.18 Figure 3 Presumed phylogeny of three Stegodyphus study species. Two species, S. dumicola and S. africanus, indicated with dashed branches, were added based on Johannesen et al (2007) to illustrate that the branches to S. tentoriicola and S. mimosarum do not represent single species only. Red branch bounds indicate social species. The splitting of the terminal branches in the species studied here represents the level of polymorphism within these three species. Split time estimates, substitution rate estimates (dN, dS and ω) and polymorphism estimates (π) were mapped on the phylogeny. The root was placed under the assumption of a molecular clock.
the derived nucleotide changed the amino acid, and synonymous if not. From the detected ortholog sequences we generated alignments of protein coding regions. The longest open reading frames of detected ortholog sequences, obtained with the online version of Virtual Ribosome [30], were aligned using the alignment program PRANK [31]. The alignments were manually edited assuming that most frame shift errors are caused by sequencing or assembly errors. All sequences were cut after a frame shift gap due to uncertainties of the correct reading frame. All gaps divisible by three were moved to the closest inframe position. To reduce the risk of false positive open reading frames, alignments shorter than 189 bp were discarded. This threshold was set assuming that the number of stop codons in a random sequence with length n follows a binomial distribution where k is the number of stop codons, and p is the probability of the random occurrence of a stop codon, i.e. 3/64. We set the probability to achieve an ORF (k = 0) with a length n to 0.05. From this equation the minimum length of ORFs was solved. Thus, to set the probability of a random ORF occurring at below 5%, 63 was selected as ORF length threshold.
The alignments with ORFs (1345 in total) were assigned a random number, and sequences were concatenated in ascending order to form 20 sets of alignments with an equal number of genes. The concatenated genes were used to estimate the transcriptome-wide substitution rates in protein coding genes. The pair-wise proportion of sequence diversity was calculated as the proportion of variable sites from the total. To estimate the rate of synonymous (dS), non-synonymous (dN) and the dN/dS ratio (ω) for each branch of the phylogeny, we used maximum likelihood estimations implemented in the program codeml in the PAML 4.2 package [32]. Codon usage bias of the aligned transcripts was measured as the GC content in silent sites and the effective number of codons was computed using the formula by Novembre [33] calculated with the software INCA 2.1 [34].