Collembase: a repository for springtail genomics and soil quality assessment

Background Environmental quality assessment is traditionally based on responses of reproduction and survival of indicator organisms. For soil assessment the springtail Folsomia candida (Collembola) is an accepted standard test organism. We argue that environmental quality assessment using gene expression profiles of indicator organisms exposed to test substrates is more sensitive, more toxicant specific and significantly faster than current risk assessment methods. To apply this species as a genomic model for soil quality testing we conducted an EST sequencing project and developed an online database. Description Collembase is a web-accessible database comprising springtail (F. candida) genomic data. Presently, the database contains information on 8686 ESTs that are assembled into 5952 unique gene objects. Of those gene objects ~40% showed homology to other protein sequences available in GenBank (blastx analysis; non-redundant (nr) database; expect-value < 10-5). Software was applied to infer protein sequences. The putative peptides, which had an average length of 115 amino-acids (ranging between 23 and 440) were annotated with Gene Ontology (GO) terms. In total 1025 peptides (~17% of the gene objects) were assigned at least one GO term (expect-value < 10-25). Within Collembase searches can be conducted based on BLAST and GO annotation, cluster name or using a BLAST server. The system furthermore enables easy sequence retrieval for functional genomic and Quantitative-PCR experiments. Sequences are submitted to GenBank (Accession numbers: EV473060 – EV481745). Conclusion Collembase is a resource of sequence data on the springtail F. candida. The information within the database will be linked to a custom made microarray, based on the Agilent platform, which can be applied for soil quality testing. In addition, Collembase supplies information that is valuable for related scientific disciplines such as molecular ecology, ecogenomics, molecular evolution and phylogenetics.


Background
Organisms are able to maintain homeostasis in changing environments by regulating their metabolic machinery. To accomplish this, organisms continuously have to adjust the expression of their genes. This is particularly evident when environmental challenges drive organisms to the boundaries of their ecological niche and induce stress responses (e.g. [1]). In recent years, significant understanding has been obtained on the signal transduction pathways by which stress affects gene transcription [2]. The question arises whether it is possible to sense aspects of the environment by investigating transcriptional profiles of exposed organisms.
Recent advances in the field of toxicogenomics suggest that environmental quality can indeed be diagnosed by transcriptional profiling [3] and it is generally acknowledged that genomic techniques, and more specifically transcriptomics, have the potential to revolutionize environmental risk assessment [4][5][6][7][8][9]. The prospects are that gene expression studies will enable a fast and sensitive detection and evaluation of environmental stressors and toxicants. This is strengthened by the fact that several recent studies have shown that transcription profiling can be applied as an early indicator of toxicity [10,11] in a dose-dependent manner [12].
We started a project that aims to develop a microarraybased methodology for soil quality assessment using the parthenogenetic springtail Folsomia candida (Collembola). This species, which is easy to culture and has a short generation time, was chosen because it is already a standard test organism in ecotoxicology [13]. It lives in direct contact with the soil and toxicological data are already widely available (e.g. ECOTOX database from U.S. EPA [14]). Furthermore, a standard test looking at survival and reproduction after 28 day exposure is in place that follows OECD (Organisation for Economic Co-operation and Development) and ISO (International Standard Organization) guidelines. Although the latter test is conducted in a standardized laboratory setting, it has been shown that the outcomes are predictive of natural situations [15]. However, there are several shortcomings to the current test. First, it does not provide information about the nature of the stressor. Second, the mode of action of toxicants cannot be verified. Third, the test is time-consuming as it lasts for at least 28 days. Finally, the test is rather labor intensive.
By extending the ISO standard test with genomic technologies, these shortcomings may be circumvented. However, genomic information on F. candida is very poor: a search for sequences yields only 52 hits in the National Center of Biotechnology (NCBI;[16]) nucleotide database (July 5 th 2007), mainly consisting of 18S rRNA, 28S rRNA and cytochrome c oxidase sequences used as phylogenetic markers.
A time-and cost effective way to retrieve sequence information on the functional part of the genome is to set up an Expressed Sequence Tag (EST) project, which was conducted for the F. candida transcriptome. Here we report on the sequencing and annotation of ~9000 ESTs, which form the starting point for the construction of an oligo array that can be applied in soil quality testing. The sequences were processed, assembled, BLAST-based annotated and stored in a web-accessible database [17]. The database can be searched for BLAST-based annotations and Gene Ontology terms [18] and by using a stand alone BLAST server. Collembase furthermore enables retrieval of sequence information on (differentially) expressed genes, which can then be applied in functional genomic and Quantitative-Polymerase Chain Reaction (Q-PCR) validation experiments.
Although Collembase was primarily created for the development of a microarray, we expect that it is of interest for researchers outside the field of ecotoxicology as well. Due to its short generation time, F. candida is often used in ecological studies [13]. In addition, Collembola have a crucial position in the phylogeny of the arthropods and, thus, also have the attention from evolutionary biologists (e.g. [19]). The retrieved genome data will significantly enhance molecular ecological and evolutionary studies on F. candida.

Construction of cDNA libraries
To restrict redundant sequencing we chose to start our EST project with a normalized cDNA pool. RNA extraction from the parthenogenetic, clonally reproducing collembolan Folsomia candida (laboratory strain 'Berlin'; Vrije Universiteit Amsterdam) was carried out using the Spin Vacuum (SV) Total RNA isolation system (Promega). Animals (eggs, juveniles and adult females) were taken from a culture of mixed age with a more or less even age distribution. All animals (~100 mg) were pooled before RNA extraction. Concentration and purity of the total RNA pool was checked by UV absorption (260 and 280 nm). Quality of total RNA was evaluated on a 1% agarose gel (stained with SYBR Gold stain; Invitrogen) and on an Agilent BioAnalyzer (Agilent Technologies). Afterwards 0.1 volumes of 3 M sodium acetate and 3 volumes of 96% ethanol were added and total RNA was shipped at room temperature to Evrogen (Moscow, Russia).
Double-stranded cDNA synthesis (SMART technology [20]), normalization and library construction were performed by Evrogen. The reaction was started with 0.3 μg total RNA and cDNA was SMART amplified (18 PCR cycles) and normalized by the procedure described by [21], which consists of cDNA denaturation/reassociation, a duplex-specific nuclease (DNS) treatment [22] and PCR amplification. The cDNA thus obtained was used for library construction as follows. The cDNA was incubated with restriction enzymes Sbf1 and Not1, and ligated into Sbf1 and Not1 digested pAL17.2 vector (Evrogen). The resulting plasmids were subsequently transformed into E. coli (Evrogen). Finally, glycerol stocks were made (17% glycerol), which were transferred to the Vrije Universiteit (Amsterdam) on dry-ice and stored at -80°C until further use.
Real-time PCR was performed on an Opticon 1 real-time PCR machine (MJ Research) using SYBR green 2X Mastermix (Finnzymes), according to [23]. Real time PCR reactions used 3 μl normalized and non-normalized nonligated cDNA template (0.2 μg/100μl). The program used for amplification was: denaturation (95°C for 15 min.), 2-step amplification and quantification (92°C for 15s, 60°C for 1 min. and one fluorescence measurement), melting curve program (60-90°C with a heating rate of 0.1°C per second and one fluorescence measurement per second). As can be seen in Figure 1 the normalization procedure was effective: transcripts that were highly abundant in the original pool ( Figure 1A) occurred considerably diminished after normalization ( Figure 1B) as compared to lower abundant transcripts. Differences in Ct-values between the high abundant 28S rRNA and β-actin transcripts and the less abundant USP-RXR and RNA helicase Dead1 transcripts was reduced from about 14 cycles to less than three cycles. However, the least abundant transcripts (Ultrabithorax and Kruppel) were not very well enriched: they maintained high Ct-values.
De Boer et al. (unpublished data) constructed cDNA libraries enriched for stress responsive genes as described by [24]. In short, 960 clones were isolated from each of two subtracted cDNA libraries enriched for 1) cadmiumand 2) phenanthrene responsive genes. Both libraries were built using the suppression subtractive hybridization procedure (SSH) [25] making use of poly (A)+ RNA isolated from ~150 exposed unsynchronized adult individuals (whole body; laboratory strain 'Berlin'; Vrije Universiteit Amsterdam). Exposure to cadmium was performed by placing animals on cellulose filters wetted to approximately 50% water-holding capacity with a 267 μmole/l CdCl 2 solution for 48 h. Animals were exposed to phenanthrene by placing them on a compressed layer of LUFA 2.2 soil spiked with 840 μm/kg phenanthrene according to the standard ISO11267 [26] protocol for 6 days.

EST sequencing, bioinformatics and construction of the database
In total, 9984 cDNA clones were picked and sequenced (Greenomics; Wageningen University and Research Center) using the M13 forward primer. Clones originating from the normalized library were sequenced from the 5' end of the gene (8064 total). The cDNA fragments from the SSH procedure were not ligated directionally, and Relative abundance of six cDNAs before (upper) and after (lower) normalization as measured using quantitative PCR Figure 1 Relative abundance of six cDNAs before (upper) and after (lower) normalization as measured using quantitative PCR. Act: β-actin; 28S: 28S rDNA; De: RNA helicase Dead1; RXR: RXR-USP; Ub: Ultrabithorax; Kr: Kruppel.
therefore not sequenced from a predefined orientation (960 clones from each of the two libraries).
Raw trace files were processed using Trace2dbest [27], employing a Phred [28,29] quality threshold of 20 and a minimal high quality sequence length of 150 base pairs (bp). Of the 9984 sequences 1142 sequences did not pass the quality control, and were excluded from further analysis. A summary of the number of sequences that remained from each of the three libraries after processing of the raw data is given in Table 1.
and Phrap (P. Green, personal communication [31]) were applied, as part of the Partigene script [27], to cluster and assemble the ESTs into unique gene objects. This procedure resulted in 6092 unique sequences. There were 4686 singletons and 1406 clusters with more than one sequence. Of those 1406 clusters 920 consisted of two sequences only. The redundancy (defined as total number of sequences/clusters) was 1.45, 1.32 and 1.62 for the total dataset, the normalized library and the cadmium library respectively, but appeared considerably higher in the phenanthrene enriched library (3.18). The highest sequence depth also occurred the phenanthrene enriched library with 98 ESTs in one cluster, compared to a maximum of 31 and 16 ESTs per cluster for the normalized and cadmium library respectively.
Sequences that were assigned to one cluster were not always assembled into one single contiguous consensus sequence (contig) by Phrap, due to high quality base pair differences between sequences. The Phrap assembly (Partigene default criteria) resulted in a total number of 6212 contigs instead of the 6092 given above ( Table 2). The length of those 6212 contigs ranged between 153 bp and 1636 bp and was on average 520 bp (see Additional file 2). The sequence variation that was observed within those clusters might constitute natural occurring (allelic) variation (e.g. Single Nucleotide Polymorphisms), Taq polymerase errors and/or gene duplications, and will have to be confirmed by re-sequencing efforts.
Furthermore, a PERL script, which is made available on [17], was used to determine the sequence overlap between the three libraries. This script determined for each cluster which library contributed ESTs to that cluster. The overlap appeared rather low (Figure 2). Only seven clusters contained sequences from each of the three libraries (Table  3). At least three of those clusters remained un-annotated. However, it has to be mentioned that the sequence overlap that was observed might be an underestimation of the actual overlap in the database, as 5' sequencing (Normalized library) generally results in an overestimation of the number of unique sequences [32].
The contigs were subjected to BLAST [33] searches of Gen-Bank using blastx (against non-redundant database), blastn (against non-redundant database), tblastx (against dbEST) and an additional blastx (against non-redundant database restricted to Insecta). In addition, sequences were compared to all known and predicted proteins of Caenorhabditis elegans, Drosophila melanogaster and Mus musculus. Those species were chosen as they have fully sequenced genomes. In addition, C. elegans and D. melanogaster belong, like F. candida, to the group of molting animals (Ecdysozoa). A summary of the BLAST analyses is given in Table 4. Clusters that were perfect nucleotide matches to baker's yeast (Saccharomyces cerevisiae; 125 clusters) and human sequences (15 clusters) were regarded as contamination and later on removed. The relatively high number of yeast clusters observed (~2%) is explained by the fact that in our laboratory F. candida is fed baker's yeast. The fact that the food of F. candida is in itself a genomic model species was advantageous when pruning the database: these sequences are readily identified by their high bit and e-values scores in the BLAST searches.
F. candida harbors intracellular bacteria of the genus Wolbachia [34] and its gut contains many bacterial species as well [35]. Those might turn up as contaminating sequences in the EST dataset. To pinpoint contaminating sequences from bacterial origin the clusters were compared to all protein encoding sequences found in the genome of Escherichia coli (GenBank: U00096) and in the Wolbachia endosymbiont of Drosophila melanogaster (Gen-Bank: AE017196). Sequences showing significant homology to E. coli or Wolbachia (blastx; e-value < 10 -5 ), but not to D. melanogaster, C. elegans or M. musculus, were marked as putative contaminants. In total 70 of such clus-  Table 3 shows the five most abundant transcripts for each of the three libraries. The SSH procedure conducted on phenanthrene exposed animals appeared efficient. Of the top five phenanthrene clusters three show high similarities to monooxygenases of the cytochrome P450 enzyme family, which are known to be involved in phase I biotransformation of lipophilic substances such as phenanthrene [36]. The two other clusters show homology to other monooxygenases, and might be involved in phase I metabolism as well. The results for the cadmium library are less straightforward. Two of the five most abundant clusters remain un-annotated, and two clusters show resemblance to accessions that are not from animal origin. Note that one of those two latter clusters (cluster Fcc00170) occurred in all three libraries ( Table 3). As with the 'bacterial clusters', those clusters are currently not discarded from the database and are submitted to GenBank. Supplementary experiments will be conducted to determine the exact origin of those clusters, and whether or not they represent contaminants.
The absence of highly expressed house-keeping genes among the five most abundant transcripts in the normalized library, suggests that the normalization procedure was successful. Without normalization more highly abundant transcripts, like tubulins, ribosomal proteins and actins, would have been sequenced (e.g. [37]). Although these sequences are present in the dataset, they do not form the list of most abundantly sequenced transcripts. For example, more than 40 ribosomal protein sequences were obtained (e.g. cluster Fcc02740), but most of these were represented by only one or two ESTs.  and human mRNA from the analysis). The Partigene [27] PERL scripts were used to store all the information in a web-accessible relational database [17]. All processed ESTs, excluding the ones marked as human and yeast contamination, were submitted to dbEST (accession numbers: -).

Current contents of the database
Currently, Collembase comprises data on 8686 ESTs, which are structured in 5952 clusters. That is 6092 minus the 140 clusters from yeast and human origin. To enable easy access to the sequence dataset, the information gathered was stored in a relational database and a web-interface was created. For all clusters data is offered on (1) the ESTs within a cluster and their clone names, (2) the cDNA library from which the ESTs originated, (3) blastx and blastn hits against GenBank 'nr' databases and tblastx hits against dbEST, which all will be updated regularly, (4) the consensus sequences as generated by Phrap, and (5) the GO terms when available. Furthermore, for each cluster the BLAST results and the processed ESTs can be downloaded.
Collembase can be explored library-specific using text queries (e.g. cluster name or BLAST annotation) and by sequence similarity using a local BLAST [33] server. Furthermore, a Primer3 web-server [40] was implemented to enable PCR primer design on the assembled sequences.

Future application and intended uses of the database Soil quality and risk assessment
The dataset presented here was generated mainly to obtain the required genomic information to construct a microarray for soil quality assessment. The array, which is based on the Agilent microarray technology, is linked to  (12) No Significant Hit --Fcc00018 (9) No Significant Hit --

*) e-value from blastn
Collembase: The 60-mer oligos printed on the chip follow the nomenclature of the clusters from which they were derived. This "linkage" enables straightforward sequence retrieval. Sequences of differentially expressed genes can be downloaded from Collembase and used in validation experiments (e.g. Q-PCR). Furthermore, in the near future we intend to store microarray and Q-PCR gene expression data as well. This freely accessible online repository will allow evaluation and analysis of the data by the scientific community (sensu [41]).
The small overlap between the toxicant enriched libraries and the normalized library (Figure 2), in combination with the higher redundancy of the toxicant enriched libraries (especially the phenanthrene library), suggests that metal and PAH exposure trigger different genes in F. candida. Although our expression data on F. candida still have to be verified by actual gene expression assays, such specificity would imply that transcription profiles contain a signature of the nature of the stress, and that different stresses can be distinguished by transcription profiling. This view is strengthened by a recent ecotoxicogenomic study by [42]. These authors showed that in the crustacean Daphnia magna different substances belonging to one chemical class (metals) can be discriminated on the basis of their characteristic expression profiles. Finally, we believe that transcription profiling will enable mechanistic insight in responses to mixtures of toxicants, a relatively new and unknown field in (eco)toxicology.

Other applications
The collembolan F. candida is frequently used in experimental studies (for a recent review see [13]), therefore Collembase could be useful outside the field of ecotoxicology as well. We expect applicability in the following research areas:

Ecogenomics
To fully disentangle the molecular mechanisms by which organisms deal with ecological challenges and environ- F. candida has this potential as the species is easy to rear in the laboratory, reproduces parthenogenetically, has a short generation time, has a well-defined ecology and is traceable in (mesocosm) field experiments. It seems obvious that the sequence information stored in Collembase can be exploited to answer ecological questions, e.g. related to drought-tolerance, starvation and microbial resistance in soil ecosystems.

Molecular ecology and population genetics
The EST dataset presented holds information applicable in molecular ecological-and population genetic studies. For example, within the dataset 184 contigs showing one or more tandem-repeats (microsatellites) with a minimum of five repeats were discovered using the MISA PERL script [45] (Additional file 4). Within some of the clusters up to three different alleles were observed. However, due to the limited redundancy in our dataset and the fact that the libraries were constructed from animals from one parthenogenetic strain it is impossible to determine their degree of polymorphism. Still, in theory those loci are molecular markers that can be applied to unravel the forces that maintain genetic diversity and generate population genetic structure in this soil and cave inhabiting species. Furthermore, it seems obvious that the dataset and its accompanying microarray could be helpful in finding out whether transcriptional regulation is an important driver of adaptive evolution in this species.

Phylogenetics and comparative genomics
Collembola take an exceptional and fascinating position in the tree of life. Together with other basal hexapods (e.g. Protura, Diplura) they are positioned in-between the insects and crustaceans. However, recently some authors suggested that the six-legged body plan found among basal hexapods and insects evolved minimally twice (e.g. [46,47]). The dataset presented here might add the sequence information that is needed to gain a more detailed insight into the evolution of these groups, and the relationship between insects and crustaceans. Using the BLAST tool, Collembase can be queried for genes valuable for phylogenetic inference. Degenerate PCR primers can be developed on the retrieved sequences to obtain information on other basal hexapod groups.

Conclusion
Collembase provides EST and related data on the springtail F. candida. In the near future this database will be supplemented with microarray expression data. We expect that our strategy will impact soil quality testing. In addi-tion, it is clear that Collembase holds information applicable to many fields of ecological sciences (e.g. molecular ecology and ecogenomics, molecular evolution and phylogenetics).
formatics analyses and drafted the manuscript. MdB constructed the libraries enriched for stress-responsive genes. TdB, BN and JM assisted in setting up the project, and the laboratory work. RK-L coordinated the sequencing at Greenomics, Wageningen UR. NvS participated in the conception of the study, and helped to draft the manuscript. DR participated in experimental design, supervised the project and shaped the final version of the manuscript. All authors have read and approved the final version of the manuscript.