Termites
Five cDNA libraries were constructed from R. flavipes workers, soldiers, alates, and larval stages collected in March 2006 from a single colony on the campus of the University of Florida, Gainesville. The larvae were divided, based on size, into early (putatively stages 1-2) and late (putatively stages 3-4) stages. Several hundred individuals of each caste/stage totaling about 2 gm of wet weight were obtained and stored at -80°C until used. The individuals were not separated according to sex; therefore, the cDNA libraries contained expressed genes in both males and females, although we do not know the precise sex ratio among individuals that were used for mRNA isolation and cDNA synthesis. The larvae have a dichotomous developmental pathway, developing into either nymphs (which then become alates or second-form reproductives) or workers (some proportion of which subsequently become presoldiers and then soldiers).
cDNA Libraries
mRNA isolation, cDNA synthesis and library construction was undertaken by AGCT Corp. (Cranston, RI). The cDNA was non-normalized and cloned into pBluescript II SK+ vector after digestion with Eco RV and Not l. We did not attempt to systematically exclude bacterial, protozoan, or other symbiont DNA from the libraries. Libraries were plated on LB medium with 50 mg/litre ampicillin and grown at 37°C overnight. Several hundred colonies were sampled randomly from the plates for further processing and sequencing. Colonies were placed in 150 μl liquid LB medium containing ampicillin, grown overnight at 37°C, and stored at -80°C after adding 20 μl of sterile glycerol. The colonies were then shipped on dry ice to the University of California-Riverside for sequencing the 5'-ends of the clones using the T3 primer.
Sequence Analyses
Using Sequencher (v4.7, Gene Codes Corp, Ann Arbor, MI), we removed the vector sequence and trimmed ends using program defaults. Sequences shorter than 150 bp were discarded from further analysis after trimming. ESTs were organized into contiguous sequences (contigs) using the default parameters in EGassembler [14]. Highly repetitive sequences were masked using the Drosophila RepBase library in EGassembler.
Gene discovery rate for each library was estimated by dividing the number of contiguous sequences by the total number of singleton sequences. Minimum average sequence insert sizes of cDNA clones were estimated by dividing the total number of base pairs sequenced by the number of ESTs in each library. The maximum length is limited by the length of sequence; thus, the actual insert size was likely underestimated. Both singletons and contigs were assigned putative functions using a batch BLASTX reference search from the non-redundant protein database using BlastStation (v2.61, TMSoftware, Arcadia, CA), with an e-value ≤1e-10. Genes assigned putative housekeeping roles were identified using a list derived from Eisenberg et al. [15]. Gene ontology (GO) enrichment analysis was performed using Blast2GO tool [16]. The functional annotation in Blast2GO was performed in three steps. First, gene sequences were queried against gene or protein databases and potential homologs identified. During the mapping step, known GO annotations of the homologs were retrieved. Finally, the homolog annotations were used to annotate the uncharacterized genes.
For a set of annotated genes (including genes for which the annotations are obtained outside Blast2GO), several tools for statistical and visual analysis are available. Among these, Blast2GO includes a tool for performing enrichment analysis, i.e., the identification of GO annotations whose abundance is significantly different between two sets of annotated genes. The GO term enrichment analysis functionality is achieved by integrating Blast2GO with Gossip, a software package that employs Fisher's exact test to estimate the significance of associations between two categorical variables, while correcting for multiple testing using FDR (false discovery rate), FWER (family-wise error rate) and single test p-value. A set of GO terms that are under- or over- represented at a specified significance value is obtained as a result of performing the enrichment analysis.
All EST sequences are deposited in dbEST http://ncbi.nlm.nih.gov/dbEST/ under dbEST-Id 65538065-65551007, and in GenBank under accession numbers G0898823-G0911765.
In silico transcript abundance
Relative transcript abundance in the ESTs derived from each of the five cDNA libraries was analyzed using the R-statistic [17]. The R-statistic can be seen as entropy of partitioning contigs (corresponding to genes) among multiple cDNA libraries, and can be used to identify differentially expressed genes as those genes with high R-value [17]. More precisely, the test statistic R compares the null hypothesis, L0 ("the abundance of a transcript is the same in all libraries"), with the alternative hypothesis, L1 ("the abundance of the transcript in each library is different", i.e., the corresponding gene is differentially expressed), using the log ratio of the two likelihoods, i.e., log(L
1
/L
0
). Thus, for a particular transcript (contig) j, we have:
where m is the number of EST libraries considered, x
i,j
is the number of ESTs in contig j belonging to the i th library, N
i
is the total number of ESTs in the i th library and f
j
is the frequency of all of the ESTs of contig j in all of the libraries. If x
i,j
= 0 for a library i, then the contribution of the i th library to R
j
is 0. Compared to other similar methods (e.g., Fisher's exact test), which can only work for two libraries at a time, the R-statistic can work with a multiple libraries. This feature makes the use of the R-statistic desirable for our analysis.
We analyzed the 20 contigs with the highest R-value, including 10 contigs that returned a BLASTX hit with an e-value ≤1e-10 and 10 that had no BLAST hit or a hit with an e-value ≥1e-10. The cut-off for the number of contigs (genes) differentially expressed was identified by investigating the frequency of R-statistic among the contigs as described [17]. Putative protein functional analysis of the latter group (i.e., those contigs with no BLAST hit or e-value ≥1e-10) was performed using the protein function predictor [18] which pools sequences with similar structural motifs and assigns GO terms, to obtain an initial indication of gene function. The protein function predictor attempts to predict protein function to a large number of genes than a conventional BLAST search by providing low-resolution function without losing accuracy. We further validated the R-statistic measure of significant difference using a chi-squared analysis.