The use of Open Reading frame ESTs (ORESTES) for analysis of the honey bee transcriptome

Background The ongoing efforts to sequence the honey bee genome require additional initiatives to define its transcriptome. Towards this end, we employed the Open Reading frame ESTs (ORESTES) strategy to generate profiles for the life cycle of Apis mellifera workers. Results Of the 5,021 ORESTES, 35.2% matched with previously deposited Apis ESTs. The analysis of the remaining sequences defined a set of putative orthologs whose majority had their best-match hits with Anopheles and Drosophila genes. CAP3 assembly of the Apis ORESTES with the already existing 15,500 Apis ESTs generated 3,408 contigs. BLASTX comparison of these contigs with protein sets of organisms representing distinct phylogenetic clades revealed a total of 1,629 contigs that Apis mellifera shares with different taxa. Most (41%) represent genes that are in common to all taxa, another 21% are shared between metazoans (Bilateria), and 16% are shared only within the Insecta clade. A set of 23 putative genes presented a best match with human genes, many of which encode factors related to cell signaling/signal transduction. 1,779 contigs (52%) did not match any known sequence. Applying a correction factor deduced from a parallel analysis performed with Drosophila melanogaster ORESTES, we estimate that approximately half of these no-match ESTs contigs (22%) should represent Apis-specific genes. Conclusions The versatile and cost-efficient ORESTES approach produced minilibraries for honey bee life cycle stages. Such information on central gene regions contributes to genome annotation and also lends itself to cross-transcriptome comparisons to reveal evolutionary trends in insect genomes.

Results: Of the 5,021 ORESTES, 35.2% matched with previously deposited Apis ESTs. The analysis of the remaining sequences defined a set of putative orthologs whose majority had their best-match hits with Anopheles and Drosophila genes. CAP3 assembly of the Apis ORESTES with the already existing 15,500 Apis ESTs generated 3,408 contigs. BLASTX comparison of these contigs with protein sets of organisms representing distinct phylogenetic clades revealed a total of 1,629 contigs that Apis mellifera shares with different taxa. Most (41%) represent genes that are in common to all taxa, another 21% are shared between metazoans (Bilateria), and 16% are shared only within the Insecta clade. A set of 23 putative genes presented a best match with human genes, many of which encode factors related to cell signaling/signal transduction. 1,779 contigs (52%) did not match any known sequence. Applying a correction factor deduced from a parallel analysis performed with Drosophila melanogaster ORESTES, we estimate that approximately half of these no-match ESTs contigs (22%) should represent Apis-specific genes.

Conclusions:
The versatile and cost-efficient ORESTES approach produced minilibraries for honey bee life cycle stages. Such information on central gene regions contributes to genome annotation and also lends itself to cross-transcriptome comparisons to reveal evolutionary trends in insect genomes.

Background
The honey bee, Apis mellifera, occupies a prominent place in biological research due to its social behavior, learning capabilities, haplodiploid mechanism of sex determination, and plasticity in phenotype (caste) and longevity. Thus, it is a model organism for classical and sociogenetic studies. In addition, bees drive a large-scale apicultural industry, and also generate important income in smallscale subsistence beekeeping. And finally, bees are of great economic and ecological relevance for their role as generalist pollinators.
The decision to include the honey bee amongst the current organisms for complete genome sequencing, http:// www.hgsc.bcm.tmc.edu/projects/honeybee/ was, therefore, well founded, yet information on its transcriptome is still meager. When starting this study, little over 250 genes were annotated as partial or full length coding sequences, and only about 15,500 expressed sequence tags (mainly 5'-ESTs generated from a normalized bee brain cDNA library [1]) were available in public databases. Thus, even after sequencing of the honey bee genome will be completed a considerable transcriptome sequencing effort will still be required for unequivocal genome annotation, gene identification, and subsequent functional studies.
We used the ORESTES (Open Reading frame Expressed Sequence Tags) strategy to generate ESTs from different life cycle stages of the honey bee, such as appropriate for a genome annotation initiative. This strategy preferentially generates ESTs of the central, and thus most informative portion of the transcript [2], and frequently also identifies less abundant mRNAs [3]. The efficacy of the Open Reading frame ESTs strategy, in the context of an organism for which there is limited genomic information, has recently been demonstrated for Schistosoma mansoni This cost-efficient approach increased the already existent Apis EST database by 30% new reads. Of the 5,021 ORESTES, only 35.2% matched with previously deposited Apis ESTs. When assembled with the existent Apis ESTs in the NCBI database, the ORESTES sequences extended 66% of the mixed contigs. Together these data indicate that the ORESTES methodology could effectively complement the current efforts towards the definition of the Apis transcriptome.
Distribution of Best-BLASTX-matches for assembled Apis mellifera Open Reading frame ESTs Figure 1 Distribution of Best-BLASTX-matches for assembled Apis mellifera Open Reading frame ESTs. After assembly into contigs and singlets the sequences were submitted to a search against a non-redundant protein database (NCBI). Independent of its E-score, the best match in each BLASTX result was listed according to organism category.

Honey bee Open Reading frame ESTs
We generated a total of 87 mini-libraries from the four major life cycle stages of honey bee workers (embryo, larva, pupa, adult) by the use of arbitrary primers and a low-stringency RT-PCR protocol [2]. From these libraries we obtained 5,021 sequences of appropriate standard quality (sequence > 100 bases; Phred 15) and with an average size of 373.9 bp. These sequences were deposited in the GenBank EST database (accession numbers CK628548 to CK633568). In the annotation pipeline, these were first submitted to BLASTN searches against Apis mellifera sequences deposited in the NCBI EST database (dbEST). At this step, 35.2% (1,769) of the validated sequences matched Apis ESTs (Table 1). In a subsequent step, a BLASTX comparison of the remaining sequences against the nr-NCBI database permitted the annotation of an additional 22.4% (1,123) of the honey bee ORESTES, while the remaining 42.4% (2,129) did not match any known sequence. This rather large set of ESTs that did not result in significant alignment with any sequence deposited in non-redundant databases contains candidates for novel honey bee genes.
The 5,021 Apis ORESTES were assembled by CAP3 into 488 contigs of a mean size of 519 bp, leaving 893 singlets. In a second round of BLASTX comparisons against the nr-NCBI database, 28.5% of the contigs and 9.2% of the singlets were classified as putative orthologs. When the respective best matches were classified according to species or higher order taxa (Figure 1), 89.6% were from the arthropod clade (including fully or partially sequenced Apis mellifera genes). The largest fraction of these putative orthologs showed best matches with predicted Anopheles genes (43.9%), followed by ORESTES that were classified as putative orthologs of Drosophila (29.5%).

Gene Ontology classification
We assigned level 3 Gene Ontology (GO) classifications to 326 of the total of 488 assembled contigs; 162 contigs did not match any sequence in the nr-protein database. In the manual annotation preceding the GO analysis we preferentially assigned the contigs with respect to their Drosophila orthologs. The cellular component, biological process, and molecular function classifications of the honey bee sequences are shown in Table 2. In the biological process categories there is a clear prevalence for ESTs representing cell communication, cell growth and maintenance, metabolism and morphogenesis. For molecular function, the dominant assignments were to enzymatic activity and to nucleic acid binding and related functions (translation factor, transcription factor). When compared to the corresponding GO results obtained for the bee brain ESTs [1], we noted a similar distribution in category dominance structure, except for the molecular functions 'transporter and ligand binding/carrier' which have a higher representation in the bee brain ESTs than in our ORESTES contigs. This discrepancy most probably reflects functional differences in the tissues used in these two studies.

Clustering of honey bee ESTs
We clustered the contigs generated in this study (AmOR-ESTES contigs) with the Apis mellifera ESTs already present in the NCBI dbEST database (further referred to as AmNCBI contigs). Clustering performed by CAP3 resulted in a total of 3,408 contigs and led to a general increase in read depth ( Figure 2A). This increase in read depth is reflected in the CAP3 assembled mixed sequences of the two databases. Mean length is 696 bp for the AmNCBI contigs and 496 bp for the AmORESTES contigs ( Figure  2B). For the mixed contigs we noted a mean increase of about 150 bp in contig length, thus documenting that the ORESTES sequences add considerable information to the characterization of the honey bee transcriptome and for subsequent studies of specific genes.
Within the total contig population, 9.5% of the assembled sequences (323) are represented by mixed contigs of both AmORESTES and AmNCBI sequences, and within this group 66.3% (214) of the original AmNCBI contigs were considerably extended, or were joined across gaps by the AmORESTES contigs, as illustrated in Figure 2C. The fact that the number of mixed contigs is relatively low compared to total contig number may be attributed to two aspects. First, most of the AmNCBI contigs were obtained from a single tissue (brain) library, whereas the AmOR-ESTES sequences represent whole body transcripts of all life cycle stages of the honey bee. Second, the AmNCBI sequences are mainly 5'-ESTs, whereas the AmORESTES sequences are expected to cover more central cDNA regions.

Genome comparison
Even though the total number of ESTs available for Apis mellifera is still low when compared to established genomic model organisms, we performed an across genome analysis with the set of 3,408 honey bee contigs. This involved sequential BLASTX searches, using the honey bee sequences as query entries against protein databases of Drosophila melanogaster, Anopheles gambiae, Caenorhabditis elegans, human, protozoan and fungal origin. With this selection of organisms we intended to extract information on the percentage of genes that Apis shares (i) with all organisms, (ii) with animals, (iii) with different sets of metazoans, (iv) and exclusively with insects. The cutoff E-value in these comparisons was set at 10 -6 , as used in comparisons of similar nature [4], and the representation of the respective putative orthologs was listed across taxonomic levels.
We found that 1,629 Apis contigs presented significant match with sequences belonging to at least one of the taxa genomes. From these, 460 contigs (28.2%) correspond to genes with a representation in all the above taxa ( Figure  3). In addition, further 211 contigs (12.9%) could also be classified as common to all organisms since they were represented in all but in one of the members of this set of taxa (at this level, Anopheles and Drosophila were considered as a single group representing Diptera). This increases the set of EST contigs that the honey bee may share with all organisms to 41.2%, or, when considering the entire set of 3,408 contigs, to 19.7%. The second largest set of ESTs (312 + 37 contigs) is the one that is represented as genes common to the bilaterian clade (or metazoans in general), and only the third largest set (198 + 68 ESTs) contains genes that are represented solely in hymenopterans and dipterans, and thus in the insect clade.
Since deep-level phylogeny relationships within the bilateria are still a matter of debate, we separated our dataset according to the two prevalent hypotheses. The traditional view clusters arthropods within the coelomate clade. In our set of genomes, this tree architecture would be represented by genes shared between insects and the human genome. The alternative, more recently proposed hypothesis joins arthropods with nematodes to form an ecdysozoan clade [5]. The result of our comparison, which places emphasis on shared genes and not on the frequency of gene losses, is more consistent with the traditional view, since the coelomate clade is represented in this analysis with almost five times more shared genes than the ecdysozoan clade.
To infer on functional aspects within this pattern of genes that different clades appear to have in common we performed a Gene Ontology classification on biological process. In the set of Apis ESTs that stands for genes putatively shared with all organisms, the majority was classified as having a role in metabolism, and thus can be considered to represent basic functions. In contrast, the majority of Apis ESTs that are shared within the insect  clade was represented in the biological process categories of cell growth and/or maintenance and cell communication. The corresponding insect-specific genes are therefore supposedly involved in more specialized functions. A similar conclusion can be reached from the micro-and macroarray analyses of transcripts detected in adult honey bee workers performing different tasks during their adult life cycle [6,7].
A total of 70 putative ortholog ESTs did not comply with any of the plausible phylogenies, yet nevertheless, this set may contain ESTs of interesting information content, especially when considering that the main set of genes Similarity and representation pattern of assembled Apis mellifera ESTs (ORESTES + NCBI dbESTs) with predicted proteins of other organisms Figure 3 Similarity and representation pattern of assembled Apis mellifera ESTs (ORESTES + NCBI dbESTs) with predicted proteins of other organisms. In this comparison we included eukaryotes with completely sequenced genomes (Drosophila melanogaster, Anopheles gambiae, Caenorhabditis elegans and human), plus higher taxon groups, such as protozoans (primarily represented by Plasmodium falciparum and P. yoelii) and fungi (primarily represented by Saccharomyces cerevisiae, Schizosaccharomyces pombe and Neurospora crassa). These BLASTX comparisons were performed with an E-value cut-off level set at 10 -6 . Subsequently, the representation pattern of each of the Apis ESTs in each of the eukaryotic genomes was listed. Out of the total 3,408 Apis EST contigs, 1,629 could be classified as putative orthologs, and these were grouped according to the representation of these genes at the different taxonomic levels.
within this group consists of Apis mellifera contigs that overlap with a mammalian genome. A manual analysis of these 23 contigs by BLASTX against the nr-NCBI database revealed that they are (at least by three orders of magnitude in E-values) more similar to mammals (especially to humans) than to other vertebrates and even other insects. This suggests that these genes may have diverged less in Apis and mammals and, therefore, may be subject to related selection pressure. Alternatively, at least some of them may have been modified in Diptera, and thus would show up as insect genes only once further non-dipteran insect genomes or transcriptomes have been sequenced and annotated. As shown in Table 3, this set of bee/ human contigs contains a considerable number of predicted proteins related to cell signaling/signal transduction and transcription factors. Such a bias to information processing in our dataset of genes shared between honey bees and a mammalian genome may reveal system properties related to complex functions.
Finally, we found that 1,779 (52,2%) of the assembled EST contigs did not match with any sequence of the analyzed organisms. Such a large proportion of Apis-specific contigs is likely to be an overestimate. As noted in a previous study [1], this might be partly due to technical problems, such as, sequencing of cDNA inserts consisting mainly of 3'-untranslated regions, the presence of unspliced intron sequences, cDNAs with a negative reading frame, or chimaeric cDNAs. However, the major portion of the Apis-specific contigs may have become classified as species-specific due to their relatively short ORFs. We performed an ESTScan analysis http:// www.ch.embnet.org/software/ESTScan.html on the Apisspecific contigs which detected ORFs in 56% of the assembled ESTs. These ORFs are, however, relatively short, with a mean ORF length around 280 bp. Short ORF length represents a notorious problem to alignment algorithms resulting in low match scores, and consequently, a more frequent classification of short ORF ESTs as speciesspecific transcripts. For the honey bee, this has been shown for the brain cDNA library where 84% of the ESTs with ORFs shorter than 450 bp were classified as speciesspecific, against 24% in the EST set that had ORFs larger than 450 bp [1].
In order to gain a general perspective on the representation of species-specific ESTs we also directed our attention to estimates obtained in whole-genome cross-species analyses. For instance, a figure of 18.6% of species-specific genes was ascertained for Drosophila melanogaster in a genome comparison which included Anopheles gambiae as the other insect representative [8]. Based on this informa- tion, and taking advantage of a set of Drosophila melanogaster ORESTES, generated in a parallel project, we calculated the frequency of Drosophila-specific ORESTES sequences to obtain a more realistic estimate on Apis-specific genes in relatively small sets of ESTs. In this analysis, a set of 5,000 CAP3 assembled Drosophila ORESTES (409 contigs) was submitted to sequential BLASTX searches against protein databases of Drosophila melanogaster, Anopheles gambiae, Caenorhabditis elegans, human, protozoan and fungal, as described for the Apis contigs. For comparison, this same analysis was also performed with Apis ESTs, using separately 5,000 AmORESTES (486 contigs) and 5,000 AmNCB (632 contigs).
The Drosophila and Apis EST contigs consistently showed relatively low proportions of insect-specific genes (6-13%). Still lower (ca. 1% each) was the proportion of ESTs that had significantly higher similarity scores with eukaryotes other than Insecta. In all EST sets we found a large fraction of sequences that were classified as either Drosophila-specific (40%) or Apis-specific (51% for AmNCBI dbESTs and 47% for AmORESTES). This high proportion of species-specific genes, therefore appears to be generated independent of the method used in EST sequencing, as it is represented in similar proportions in both the ORESTES set and the conventional 5'-EST set ( Figure 4).
Percentage of honey bee and Drosophila ESTs representing putative species-specific genes (blue bars) in relation to ESTs that represent genes solely shared within the insect clade (pink bars), or that have higher similarity with eukaryotes other than the insect clade (yellow bars) Figure 4 Percentage of honey bee and Drosophila ESTs representing putative species-specific genes (blue bars) in relation to ESTs that represent genes solely shared within the insect clade (pink bars), or that have higher similarity with eukaryotes other than the insect clade (yellow bars). In separate comparisons, the Apis mellifera contigs (ORESTES + NCBI dbESTs, n = 5,000), AmOR-ESTES (n = 5,000), AmNCBI dbESTs (n = 5,000), and Drosophila melanogaster ORESTES contigs (n = 5,000) were analyzed against protein databases of an insect (Anopheles gambiae) and several non-insect species (C. elegans, protozoans, fungi and H. sapiens) with completely sequenced genomes. The cut-off E-value in these comparisons was set at 10 -6 .
The figure of 40% Drosophila-specific genes obtained for our Drosophila ORESTES set can be directly set in contrast with the estimate of 18,6% species-specific genes reported in the Drosophila genome based study [8], and this would predict an overestimate factor of 2.15 for species-specific genes in the EST sets. When this factor is applied to the honey bee ORESTES, the 47% estimate for species-specific AmORESTES can thus be corrected to a more realistic figure of 22%. This estimate is in agreement with the results of Whitfield et al. [1] who observed that 24% of the honey bee genes represented by ESTs with ORFs larger than 450 bp did not have matches to any known protein sequences. This Apis-specific gene estimate is also in range when considering that the two dipteran species are thought to have separated from a common ancestor approximately 250 million years ago, whereas the postulated sister-group relationship of Hymenoptera and Mecopteroidea [9] suggests a pre-permian divergence, with a predicted separate lineage evolution of over 280 million years for honey bees and dipterans [10].

Conclusions
The generation of a relative small set of Open Reading frame ESTs (ORESTES) that match and complement the already existent Apis EST database shows that this approach is sufficiently robust and favorably complements other strategies, such as ESTs prepared from normalized cDNA libraries. Its inherent properties of detecting transcripts of low abundance and aligning with central regions of transcripts [2,3] also make it a suitable tool in searches for novel honey bee genes and their annotation in parallel with ongoing genome sequencing projects. Furthermore, the genome comparisons performed in this and other studies [1,11] highlight that the elevated number of putative Apis-specific genes will still require extensive transcriptome sequencing for high quality genome annotation, and will play an important role in the question of insect genome organization and model systems in comparative studies [12].

Biological samples and RNA extraction
Samples of the four major stages of the honey bee life cycle were collected from Apis mellifera colonies (Africanized hybrids) kept in the experimental apiary of the Dept. Genetics, Univ. São Paulo, Campus Ribeirão Preto, Brazil. Each embryo sample contained approximately 300 eggs retrieved from a frame on which the queen had been caged for up to 72 hours. This assured that we covered the entire embryonic period. The larval sample was a representation of all five instars and included also spinningstage larvae. Prepupae and pupae, including white-eyed, pink-eyed, brown-eyed and pigmenting pupae, were pooled into the pupal samples. For the adult sample we collected newly emerged bees, a random sample of hive bees (picked from a brood frame), and returning foragers. All these samples were snap frozen in liquid nitrogen. Total RNA was isolated using TRIzol reagent (Invitrogen). The lipid-rich larval and pupal samples required two additional extraction steps with phenol/chloroform and chloroform to obtain RNA of adequate purity.
In the case of Drosophila melanogaster, dechorionated embryos, larvae plus prepupae and pupae, as well as adult flies were collected from an isogenic y, w 1118 stock of Drosophila melanogaster. These were immediately frozen in liquid nitrogen and stored at -80°C until use. Total RNA was extracted with TRIzol, as described for Apis mellifera.

Generation of Open Reading frame ESTs (ORESTES)
From high quality DNA-free total RNA samples we isolated poly(A) + RNA using an Oligotex II (Qiagen) kit. To assess poly(A) + RNA quality of the samples we performed Northern blot hybridizations with an actin (Apis mellifera) or tubulin (Drosophila melanogaster) probe. The probes were labeled by a random priming reaction in the presence of [α-32 P]dCTP. The actin fragment was amplified using the primers described in Table 4. The Drosophila tubulin probe was already available from previous studies. High quality total RNA preparations were subjected to a DNase I treatment, and the absence of DNA contaminants was assessed by Southern blot hybridization of PCR products amplified with Apis or Drosophila 16S mitochondrial DNA primers, respectively. High quality poly(A) + RNAs were aliquoted and stored at -80°C.
ORESTES profiles were generated according to Dias-Neto et al. [2]. Briefly, aliquots of 15 ng of purified mRNA were subjected to reverse transcription reactions utilizing SuperScript II Reverse Transcriptase (Invitrogen) and a set of randomly selected primers (Table 4). First-strand cDNA synthesis occurred at 37°C for 60 min in a total volume of 20 µl. The products of this reaction were diluted 1:5 in water and stored at -20°C. The cDNAs contained in 1 µl of each diluted RT-product were then amplified by PCR using the same or a single alternative random primer in a PCR mix (Ready-to-Go PCR bead, Amersham Biosciences). The amplification protocol consisted of an initial step at 75°C for 5 min, followed by a 45 cycles touchdown series (95°C for 30 s, a gradually decreasing annealing temperature from 66 to 44°C lasting 10 s per step and a decrease of 2°C per step, 72°C for 1 min), and a final extension reaction at 72°C for 7 min.
Aliquots of the PCR products (3-5 µl) were run on 1% agarose gels and stained with ethidium bromide. From profiles that presented near-even smears we excised two sets of amplification products, one covering a size range from 300 to 700 bp and a second one from 700 to 1500 bp. For cloning, these were extracted from the agarose gels (QIAquick Gel Extraction kit, Qiagen) and ligated into pUC18 (SureClone Ligation kit, Amersham Biosciences) for transformation of competent E. coli DH5α-cells by heat shock. Bacteria were grown in 2 × YT medium before aliquots were plated on 2 × YT agar containing ampicillin.
Blue-white selected positive colonies were picked, grown overnight in 2 × YT medium in 96-well plates, and used as templates for PCR using vector primers (M13 forward and reverse). An aliquot of each amplification product was analyzed on a 1% agarose gel before another 1 µl aliquot was submitted to DNA sequencing using standard protocols of the DYEnamic™ ET Terminator kit (Amersham Bio-sciences). The reaction products were analyzed in a MegaBACE™ 1000 automated sequencer. Only profiles with more than 80% positive PCR reactions were sequenced.

Sequence analysis
After passing through the Base Caller Cimaron 1.53 Slim Phredfy (insert size > 100, "N" nucleotides less than 20%, and "N" repetitions of less than 6 nucleotides) and Score-Card procedure (MegaBACE™) to check sequences quality, reads that were larger than 100 nt were submitted to an automated protocol for data analysis (Gene Annotation Pipeline) of the Apis mellifera or Drosophila melanogaster  ORESTES. The protocol consisted of the following steps: conversion of electropherograms (Phred, to formats .fasta, .phd and .qual), primer and vector detection and trimming (Cross_match) and masking of repeats (Repeat-Masker). Validated fasta format sequences were then submitted to a general BLASTN search against GenBank entries for mitochondrial and rRNA, as well as bacterial and fungal RNA to detect and eliminate contaminant sequences.
For the Apis mellifera ORESTES, subsequent BLASTN searches were performed against the approximately 15,500 Apis mellifera EST sequences deposited in GenBank dbEST. In this case, significant E values were set at 10 -30 . Searches against the non-redundant protein database entries used the BLASTX option with E-values set at 10 -10 as significance cut-off level.
CAP3 was used to clusterize the ORESTES sequences of both species. For Apis mellifera, the annotation of the 488 contigs was manually checked, giving preference to Drosophila sequences in the Unigene assignment. Subsequently, the contigs were batch submitted to a Gene Ontology procedure utilizing the FatiGO tools [13]http:// fatigo.bioinfo.cnio.es/. Clusterization of the Apis ORESTES contigs and singletons with the Apis mellifera ESTs deposited in GenBank dbEST was also performed using a CAP3 routine (standard parameters).