Construction and characterization of an expressed sequenced tag library for the mosquito vector Armigeres subalbatus

Background The mosquito, Armigeres subalbatus, mounts a distinctively robust innate immune response when infected with the nematode Brugia malayi, a causative agent of lymphatic filariasis. In order to mine the transcriptome for new insight into the cascade of events that takes place in response to infection in this mosquito, 6 cDNA libraries were generated from tissues of adult female mosquitoes subjected to immune-response activation treatments that lead to well-characterized responses, and from aging, naïve mosquitoes. Expressed sequence tags (ESTs) from each library were produced, annotated, and subjected to comparative analyses. Results Six libraries were constructed and used to generate 44,940 expressed sequence tags, of which 38,079 passed quality filters to be included in the annotation project and subsequent analyses. All of these sequences were collapsed into clusters resulting in 8,020 unique sequence clusters or singletons. EST clusters were annotated and curated manually within ASAP (A Systematic Annotation Package for Community Analysis of Genomes) web portal according to BLAST results from comparisons to Genbank, and the Anopheles gambiae and Drosophila melanogaster genome projects. Conclusion The resulting dataset is the first of its kind for this mosquito vector and provides a basis for future studies of mosquito vectors regarding the cascade of events that occurs in response to infection, and thereby providing insight into vector competence and innate immunity.


Background
The perpetuation of mosquito-borne diseases is dependent on the compatibility of the pathogen with its invertebrate and vertebrate hosts, as dictated by each respective genome. The failure of traditional mosquito-borne disease control efforts to reduce the burden of these diseases on public health has created an incentive to develop a more comprehensive understanding of molecular interactions between host and pathogen, in order to develop novel means to control disease transmission. Innate immune responsiveness in the mosquito host is of particular interest in such explorations because extensive research efforts have shown that vector mosquito species produce robust humoral and cellular immune responses against invading pathogens [1][2][3][4].
A vector species that employs a unique, robust immune response against an invading pathogen is the mosquito, Armigeres subalbatus, a natural vector of the nematode parasites that cause lymphatic filariasis. This debilitating disease affects 120 million people annually, one third of who suffer gross pathology (CDC 2006). Ar. subalbatus is ideally suited for laboratory studies of immune responsiveness because it is a natural vector of the filarial worm, Brugia pahangi, but it exhibits a refractory state to the microfilariae of Brugia malayi by virtue of a strong melanotic encapsulation response; therefore, it is the ideal organism for studying molecular mechanisms of the antifilarial worm response as a function of the broader innate immune capacity of the mosquito. In fact, Ar. subalbatus is one of the few species of mosquito to effectively use melanotic encapsulation as a natural defense mechanism against metazoan pathogens [5]. Ar. subalbatus also serves as a competent laboratory vector of Plasmodium gallinaceum, the causative agent of avian malaria in Asia ( [6] and Christensen et al., unpublished data), and also has been implicated in the transmission of Japanese encephalitis virus in Taiwan [7,8].
Experimental evidence has shown that humoral and cellular immune responses play a fundamental role in mosquito refractoriness to a particular pathogen; however, very little is known about their genetic control. As a result, our laboratory is using Expressed Sequence Tags (ESTs) as a tool to elucidate the function of known genes and assist in the discovery of previously unknown, "immunity"related genes. In addition, this high-throughput molecular approach to gene discovery provides the capacity to tactically design oligonucleotide-based microarrays that can be further used to gain insight into vector-pathogen interactions. With no genome sequencing project on the horizon for Ar. subalbatus, these EST libraries and microarrays constitute the only tools currently available to gauge immune responsiveness in this medically important vector species.
We previously reported a comprehensive analysis of ESTs from complementary DNA (cDNA) libraries created from adult, female Ar. subalbatus hemocytes [1]. Experimental evidence has shown the importance of mosquito hemocytes (blood cells) as both initiators and mediators of mosquito immune responses [9][10][11][12][13]; therefore, material was collected from the perfusate (which contains hemocytes) of Micrococcus luteus and Escherichia coli inoculated mosquitoes at 1, 3, 6, 12, & 24 hours post bacterial inoculation. These bacterial species have been extensively used to examine immune peptide production in mosquitoes [14,15], and each activates a different arm of the innate immune response. The primary response of Ar. subalbatus to E. coli is phagocytosis, whereas the primary response to M. luteus is melanization, and it has been determined that this is independent of Gram type [10,11].
In order to more completely represent the baseline physiology and innate immune capabilities of this mosquito, cDNA libraries were created from adult, female Ar. subalbatus mRNA collected from whole body mosquitoes inoculated with the same mixture of bacteria. Material also was collected from whole body Ar. subalbatus exposed to filarial worm parasites. A blood meal containing B. malayi induces the melanization response in Ar. subalbatus; therefore, whole body material was collected from female mosquitoes 24, 48, and 72 hours after an infective blood feed. Intrathoracic injection of Dirofilaria immitis microfilariae into the mosquito's hemocoel also induces a strong melanotic encapsulation response in Ar. subalbatus and is a model system by which the immune response is stimulated without exposing the mosquito to both the parasite and a blood meal [16]. This model system for infection facilitates the uncoupling of two processes -namely blood meal digestion and ovarian development -that compete for biochemical resources [17]. Whole body mosquitoes inoculated with D. immitis were collected at 24 and 48 hours post-inoculation. Libraries also were constructed from 5-7 and 14-21 day old naïve whole body females to ensure representation of transcripts from nonimmune activated, aging mosquitoes. An attempt to sequence clones from a library from blood-fed naïve females was not successful.

Results and Discussion
Sequencing and clustering Non-normalized cDNA libraries were constructed from newly emerged female mosquitoes inoculated with bacteria, inoculated or blood fed with filarial worm parasites, and from aging, naïve adult females. ESTs were sequenced from the 5' end by the University of Wisconsin Genome Sequencing Center, and the National Yang-Ming University core facility, and were assembled to collapse the entire dataset, reduce redundancy, and simplify downstream annotation (Table 1). Of the 44,940 trace files generated by the two sequencing units, 38,079 traces passed quality control (85% success rate) and were sent to assembly with an average high quality (phred score 20+) length of 450 bases. The resulting collapsed data resulted in 8,020 clusters, of which 4,949 are composed of one trace (singletons). The deepest cluster contains 870 ESTs, with an average of 11 (+/-37) ESTs per cluster.

Functional annotation of EST clusters and singletons
Consensus sequences from clustering were output in fasta format and used in comparisons to the GenBank nonredundant database, the D. melanogaster and An. gambiae genomes, and to the other Ar. subalbatus EST sets created during the project.
Each EST cluster/singleton and its corresponding sequence similarity data were uploaded into ASAP. Within the ASAP interface, annotators assessed sequence alignments and followed intact hyperlinks to NCBI, the Wellcome Trust Sanger Institute (Ensembl), FlyBase, and orthologous sequences within ASAP and at National Yang Ming University, in order to ascribe a predicted gene product and/or function to each sequence. Supporting evidence for each annotation is typically in the form of a hyperlink to a database and can be viewed in ASAP. These annotations then were reviewed and approved or rejected by a curator. Annotation followed the controlled vocabulary established in a previous study, such that each EST cluster was attributed with some functional information, and indication of quality of the BLAST hit used to attribute that information [1]. Of the 8,020 EST clusters, 2,843 were annotated as "unknown" (having no significant match to any of the databases searched), and 1896 were annotated as "conserved unknown" with varying degrees of confidence. Sequences were submitted to NCBI as annotated EST clusters into the Core Nucleotide database and made available for public viewing through ASAP.

Library to library comparison
An analysis of EST clusters from the complete project was done by combining annotations and cluster composition (in terms of source libraries) to provide insight into the molecular effort put forth by the mosquito in the face of different types of immunological challenge. Within Microsoft Access, a table was built that contains EST clusters according to ASAP ID number, contig number (created during assembly in Seqman), project (cDNA library) from which ESTs were contributed, and the number of ESTs contributed per project. Queries were built to extract the number of EST clusters unique to a particular library (e.g., bacteria-inoculated whole body), or shared between projects (e.g. bacteria-inoculated whole body and hemocyte libraries) (Figure 1). Shared are 69 clusters unique to a response to bacteria-inoculation, 98 unique to the response against filarial nematodes, and 4,498 are represented in at least one of the 4 immune-activated projects. Amongst those 4,498, 20 are represented in all 4 of those projects, perhaps indicative of the importance of these genes in immune responsiveness. Included amongst these 20 is a Clip domain serine protease (An. gambiae [ENSANGP00000017225]), Serpin 27A (D. melanogaster [FBgn0028990]), and Aslectin (AY426975) -a ficolin-like pattern recognition molecule [18]. Unique to the response against B. malayi infection is a protein-tyrosine kinase, involved in the JAK-STAT cascade, which is represented by 107 ESTs.
To examine the statistical likelihood that the numbers of ESTs in each cluster represent a true sampling of the biological variation between the six libraries, and to compare the results of clustering with microarray results [19], cluster data were submitted to the IDEG6 website for analysis [20]. The number of ESTs in each cluster was normalized based on the number of total ESTs collected and the total number of ESTs in each library. Six statistics were compared, including Audic and Claverie [21], Greller and Tobin [22], Stekel [23], Chi Square 2 × 2, general Chi Square, and Fisher's Exact Test, all corrected via a Bonferroni method. The results of the entire test are included as a supplementary table (see Additional file 1). Table 2 presents the 99 EST clusters that show the highest significant difference between libraries (p > 0.00001, R > 4). Although significant increases in ESTs encoding immune- related products are observed (i.e. sequences expected to be increasing according to infection status of the mosquitoes used to collect material for libraries), this is not always the case. Several clusters that encode "house keeping" products are demonstrably enriched for ESTs from immune-challenged libraries (e.g. cytochromes and dehydrogenases) suggesting that these metabolic genes play essential roles in the physiology of an immune and/or stress response. In addition, many of the clusters that are significantly different between libraries encode gene products of completely unknown function. A comparison with microarray data from Aliota, et al. [19], shows some overlap between the two methods. Out of the 99 clusters that are significantly different (Table 2), 19 share significant changes when compared with microarray data from B. malayi infected females (highlighted in Table 2). Combined, these EST and microarray data provide several target sequences for further study in relation to mosquito innate immunity.

Gene ontology
To attribute more functional information to annotations in ASAP, Gene Ontology (GO) classifications were migrated from Flybase annotations to homologous Ar. subalbatus clusters, because Flybase contains the most complete dataset for a related species from which to draw. GO annotations were attributed to 2,793 (35% of total) EST clusters. From the perspective of the entire dataset, A comparison of EST clusters from 6 Ar. subalbatus cDNA libraries (8,020 clusters total) Figure 1 A comparison of EST clusters from 6 Ar. subalbatus cDNA libraries (8,020 clusters total). The type of immune response activation for mosquitoes is listed in the primary row and column. At the intersection of each row and column, the number of clusters unique to that combination of libraries is listed in bold, followed by the number of those clusters that are designated as unknown (U) or conserved unknown (CU). Clusters from the 4 immune response activated libraries (Immune activated combined) were queried against the naïve libraries such that: a cluster is represented in at least 1 of the 4 libraries (but not in naive) (top -1), or clusters are represented in all 4 of the libraries (but not in naive) (bottom -2).  Clusters and their constituent ESTs were analysed using IDEG6 [20] to find clusters where the number of ESTs were statistically different between the libraries. Of 8,020 clusters analysed, 99 showed a significantly differential number of ESTs collected from one or more of the libraries. Each row represents the Genbank Accession number for cluster, and the columns are the number of ESTs from each of the six libraries that are a member of it. The libraries are: asuhem (bacteria-inoculated, hemocyte), diroinf (whole body, Dirofilaria immitis injected), imacbac (bacteria-inoculated whole body), brumal (blood-fed Brugia malayi), n7 and n14 (naïve, 5-7 and 12-14 days of age). The R value is the inverse log of the Stekel R Score [23], and the Chi value is a general Chi square analysis. Of particular interest for this dataset are those clusters related to innate immunity, so a more in-depth (4 th and 6 th tier) view is presented ( Figure  2).
Because of the unique immune response capabilities of Ar. subalbatus, EST clusters were interrogated beyond the GO analysis for clusters encoding immunity-related proteins. Those clusters encoding proteins that have a documented role in Ar. subalbatus immunity were sorted according to representation in different libraries (Table 3). In addition, immunity related genes and proteins were subdivided into categories including: CASPs: Caspases, CATs: Catalases, CLIPs: CLIP-Domain Serine Proteases, Summary of Gene Ontology assignments to 2,793 Ar. subalbatus clusters   (Tables 3 and 4). Clusters identified as immunity-related according to homology to genes in ImmunoDB [24] were broken down into the number of ESTs represented per cluster from each library ( Table 4).
This analysis underscores the degree to which immunity related ESTs are enriched in libraries from bacteria-inoculated mosquitoes. Particularly from the hemocyte library, ESTs from all subcategories are represented in abundance (Table 4). We expected to see some evidence of increased abundance of ESTs related to melanization, because published reports on the melanization response indicate that phenoloxidase is up-regulated as a result of immuneresponse activation [5]. However, few ESTs representing the biochemical pathway of melanogenesis were evident amongst the clusters (see Table 3). This limited representation could be a result of cloning bias inherent in library production, or introduced due to inoculation methodology, or even wound healing. Or, up-regulation may not be necessary to affect the response that we know to be occur-ring in the mosquito at the time points chosen for library construction [19].

Comparisons with Ae. aegypti, An. gambiae, and D. melanogaster
The family Culicidae contains approximately 2,500 species of mosquitoes, of which only a handful are capable of vectoring disease. Much of the current effort to understand the molecular components of vector competence has focused on An. gambiae and Ae. aegypti [25,26], because these species transmit disease agents that have a tremendous impact on global public health (malaria, and dengue fever and yellow fever viruses, respectively). Comparative genomics analysis between these mosquitoes and the ongoing genome project on Culex pipiens quinquefasciatus, as compared to the fruit fly, have provided and will provide resources to bolster studies to systematically investigate common and mosquito species-specific gene function [25][26][27][28]. This includes gaining new insight into the molecular basis of insecticide resistance, host-seeking behaviour, blood feeding, and vector-parasite interactions that are unique to blood-feeding (hematophagous) vectors. The last of these is perhaps the most dramatic separation between the mosquitoes and fruit flieshematophagy is intimately tied to a variety of physiologies including oogenesis and immunity, and therefore imposes unique demands on mosquitoes as compared to Drosophila. In a microarray analysis of An. gambiae, 25% of the genes on the array changed transcript levels in response to blood-feeding [29].  Among the Diptera, there is an evolutionary divergence of approximately 250 million years separating mosquitoes from D. melanogaster. The mosquitoes An. gambiae and Ae. aegypti are separated by 150 million years [26]. An. gambiae is a member of the subfamily Anophelinae, which contains the primary vectors of human malaria. In contrast, Ae. aegypti is a member of the subfamily Culicinae, which contains the majority of mosquito species that are of medical or veterinary importance, e.g., Aedes, Culex, Armigeres, and Mansonia. These two mosquito subfamilies differ significantly in genomic structure [30][31][32], and in vector competence. Broadly, Anopheles species are most often incriminated as vectors of parasitic disease agents (e.g., malaria and filarial worm parasites), and Aedes and Culex species are critically important in the transmission of arthropod-borne viruses as well as filarial worms. Peptide sequences were harvested from ImmunoDB [24] for An. gambiae, Ae. aegypti, and D. melanogaster for each of the following gene families: All 8,020 EST clusters were blasted against a database of these peptides using blastx (e-value cutoff of 1e -5 ). BLAST hits were parsed from output files using tcl_blast_parser_123_V017.tcl [55] with a cut-off of 40% match, 1e -20 normalized e-value, and a minimum match length of 30 residues. Top hits were taken for each significant match and verified manually. Matches then were categorized by family and subfamily according to Waterhouse et al. (2007) (Left column) and the composition (# of compiled sequences) of EST clusters were collated from the assembly according to the library(s) from which ESTs came: asuhem (bacteria-inoculated, hemocyte), diroinf (whole body, Dirofilaria immitis injected), imacbac (bacteria-inoculated whole body), brumal (blood-fed Brugia malayi), n7 and n14 (naïve, 5-7 and 12-14 days of age). Stekel R values [23] were determined for the groupings using IDEG6 [20] and those having a score of 0.001 or better were considered significantly differential, and are flagged with bolded and underlined text for the row.
Homologous sequences for Ar. subalbatus found in fly databases Figure 3 Homologous sequences for Ar. subalbatus found in fly databases. A) Comparative analysis of Ar. subalbatus EST clusters with predicted peptides from 3 other mosquito species with completed genomes: Ae. aegypti, An. gambiae, and C.p. quinquefasciatus. Overlapping regions indicate homologous sequences from blastx searches against the peptide databases. A homolog is defined as having an e-value cutoff of 1e-20, a percent match of 40% (true matches, not conserved), and a minimum match length of 30 for the high-scoring segment pair (HSP). This comparison includes 8,020 possible cluster sequences from Ar. subalbatus (brackets), of which 3,013 had no homolog. Boxes directly adjacent to circles indicate 1) the species being compared to Ar. subalbatus, and 2) the total # of homologous sequences between that species and Ar. subalbatus. B) A gene list of the total of overlapping and non-overlapping Ar. subalbatus homologs to Ae. aegypti, An. gambiae, and C. p. quinquefasciatus was compared to a gene list of homologs found to D. melanogaster. A significant number of genes (2,074) from Ar. subalbatus have no homolog in the fruit fly, but qualify as homologs to genes in other mosquito species.
Ar. subalbatus is a competent vector of viruses and parasites, and is more closely related to Ae. aegypti than to An gambiae; Ae. aegypti and Ar. subalbatus are phylogenetically linked at the level of tribe (Culicini). Therefore, comparisons between these two species of mosquito may provide unique insights into vector competence and innate immunity.
Based on the evolutionary distance, vector status, and vector competence of the fly species for which we have genome data, we asked: of the 8,020 EST clusters or singletons, how many have homologs in the available databases for 4 fly genomes/transcriptomes? The output from blastx analysis of predicted peptide sequences was filtered to search for homologous sequences using an e-value cutoff of 1e-20, a percent match of 40% (true matches, not conserved), and a minimum match length of 30 for the high-scoring segment pair. A large number of clusters (3,013 (38%)) did not have a homolog in any database as defined by this screen.
Those clusters that were homologous were subjected to Venn analysis ( Figure 3A) to discover overlapping predicted peptides in 3 other mosquito species: Ae. aegypti, (Ae Vectorbase AaegL1.1), An. gambiae (Anoph Vectorbase AgamP3), and C.p. quinquefasciatus (Cpip Vectorbase CpipJ1.0_5), and the fruit fly, D. melanogaster. The mosquito with the largest number of gene products that are uniquely homologous to Ar. subalbatus is Ae. aegypti, as would be predicted by the degree of relatedness of these two mosquitoes. In comparing Ar. subalbatus to all available mosquito and Drosophila homologous predicted peptides, 2908 sequences are represented in all fly species. A significant number (2,074) of clusters from Ar. subalbatus  qualify as homologs to genes in other mosquito species, but have no homolog in the fruit fly ( Figure 3B).
Taking this one step further, from quantity of hits to quality of hits, we looked at the frequency distribution of evalue hits for the homologous sequences ( Figure 4). There is an obvious shift toward more significant e-values for homologs in Ae. aegypti, a shift away from more significant e-values for homologs in Drosophila and Anopheles, and homologs in Cx. pipiens display an intermediate shift, closer to that seen in Ae. aegypti.

Conclusion
Following recognition of any pathogen in a mosquito, a cascade of innate immune responses ensues that can include humoral responses (e.g. production of antimicrobial peptides), cellular (e.g. phagocytosis) and cell-mediated events (e.g. melanotic encapsulation). Because immunity-related genes function in concert to clear a pathogen [33,34], it is informative to use a holistic approach when evaluating expression and/or regulation i.e. it is likely that most of these genes are not activated independent of other immune-response genes. For example, in Ar. subalbatus, the biochemical pathway required for melanin biosynthesis is well characterized, but there is much to learn about the anti-filarial worm response as a whole in this mosquito species. What is readily apparent from the limited number of functional genomics studies that have investigated insect immunity, is that we really do not know very much about the mechanisms required to successfully eliminate an invading pathogen from a refractory mosquito (Aliota et al. [19]), and the subsequent changes necessary for a successful return to homeostasis.
There are a large number of unknown genes found in this and many other EST and microarray projects. We hypothesize that a large proportion of these unknowns are functionally linked to the unique and specific immune response of Ar. subalbatus, because of the material used to construct the libraries from which ESTs were produced. The rapidly expanding bank of large EST datasets and whole genome sequences for mosquitoes [1,26,28,[35][36][37][38][39] provide the capability to critically evaluate the unknowns in the context of the many characterized facets of innate immunity, simultaneously. A microarray platform based on this Ar. subalbatus EST dataset has been designed for this purpose, and was screened with material from immune-response activated mosquitoes (Aliota et al. [19]).
For comparative purposes at the species level, this large dataset provides an important addition to the available sequence databases. Dipterans exhibit extraordinary variation in morphology, behaviour and physiology, so these ESTs add to the ongoing and increasingly powerful comparisons of fly species [29,[40][41][42]. By virtue of hematophagy, mosquitoes are presented with unique physiologic challenges as compared to fruit files; at a minimum, blood-feeding requires host-seeking, triggers oogenesis, and exposes mosquitoes to a variety of bloodborne pathogens. Some of these challenges are shared with other vectors of disease agents. Vector-borne diseases such as malaria, leishmaniasis, African and American trypanosomiasis, Lyme disease and epidemic typhus, are caused by disease agents that are transmitted by mosquitoes, sandflies, tsetse, kissing bugs, ticks and body lice, respectively. There is a great deal of promise for enhancing our understanding of vector biology through genome sequencing and functional genomics analysis that will be increasingly available for a number of these species [43].

Mosquito maintenance
Ar. subalbatus was obtained from the University of Notre Dame in 1986. Larvae were hatched in distilled water and fed a ground slurry of Tetramin ® fish food. Pupae were separated by sex, and females transferred in lots of 80 to cartons. Adult females were fed on 0.3 M sucrose-soaked cotton balls. All mosquitoes were maintained at 26.5° ± 1°C, 75% ± 10% relative humidity with a 16 hr/8 hr light/ dark cycle beginning and ending with a 90 min crepuscular period [1].

Immune response activation and tissue collection
To construct libraries from immune response activated mosquitoes, 2-3 day old adult female Ar. subalbatus either were inoculated or infected with the pathogen or parasite known to elicit the response of interest. Naïve blood fed mosquitoes: Sucrose was removed from the cartons 14-16 h prior to presenting mosquitoes to uninfected gerbils for a blood meal. Gerbils were anesthetized as described previously. Replete females were returned to the insectary for 24, 48, or 72 hours prior to harvesting. The library developed from this source did not produce quality sequences, so sequence data are unavailable.

Tissue collection
RNA was isolated from 5-10 whole bodies for the following libraries: E. coli and M. luteus inoculated, D. immitis inoculated, B. malayi infected, naïve 7-day, naïve 14-day, and naïve bloodfed. For whole body collection, infected or inoculated female mosquitoes were collected at the aforementioned time points, frozen on dry ice, and stored at -80°C until ready for extraction. Frozen bodies were homogenized in a 1.5 ml tube using a Kontes ® tissue grinder in the presence of guanidinium thiocyanate-phenol-chloroform solution [48]. For hemocyte-derived bacteria inoculated libraries, a volume displacement method was used, as previously described [1]. One drop of perfusate was collected from each mosquito, kept on dry ice, and stored at -80°C until ready for extraction.

Library construction
RNA was extracted from mosquito whole bodies or hemocytes by single-step guanidinium thiocyanate-phenolchloroform extraction [48]. RNA was visualized on ethidium bromide-stained agarose gels to confirm quality, and then material from all time points were pooled. Complimentary DNA libraries were constructed using the SMART cDNA Library Construction kit (Clontech, Palo Alto, CA). Purified RNA was poly(A) selected for the long range PCR templates for whole body libraries, while total RNA was used for the hemocyte libraries.

Sequence collection
For all libraries, sequence data were collected as previously described [1]. Briefly, plaques were blue/white screened, isolated by robotic picker, and used directly as a template in PCR reactions at the University of Wisconsin. At Yang-Ming the library was subjected to a mass excision protocol to produce plasmid templates for sequencing, as described in the manufacturer's protocol. The number of ESTs produced from either method is described in Table 1.

EST clustering
A total of 44,940 trace files from both the UW and Yang-Ming collections were base-called and vector-trimmed using phred version 0.020425.c [49,50]. A "trim_cutoff" value of 0.025 was used to remove poor quality bases from the ends of reads, and SCF3 trace files were output for downstream clustering. Verified duplicate files from replicate sequencing were removed from the pool to reduce perceived cluster depth and improve data analysis. Poly-A tails and any remaining vector sequences were then removed with TIGR's seqclean [51], traces identified as contaminants from E. coli or any of the pathogens used for stimulus were removed, and finally, all traces with less than 51 bases of quality sequence were discarded, resulting in 38,079 traces proceeding into the assembler.
Quality trace data were clustered using LaserGene Seqman, Genome Edition (DNASTAR, Inc.) [52] on a Win-dowsXP workstation. A rapid, high stringency clustering was performed first, using the "Fast Assembler" module, with the following parameters: minimum match 90%, match size 25, match spacing 150, gap penalty 0, gap length penalty 0.7, end position mismatch 0, and minimum sequence length 50. These parameters are very conservative within the context of this program (i.e. minimizes false joins), so further automated merging with the "Classic Assembler" module was performed at a match size of 12, and a minimum match percentage of 90%. All other parameters were set to default. This had the effect of merging clusters that were very closely related with minimal gap sizes.

Similarity searches
To predict gene products and assign gene ontology classifications, EST clusters were compared to sequences from the GenPept database (Genbank version 156) and gene products from the whole genome annotations of D. melanogaster (Flybase version 5.1), An. gambiae (ENSEMBL genebuild 41), and Ae. aeygpti (Vectorbase version L1.1). A FASTA-formatted file was collected from the assembly software, and subjected to BLASTX searches using the aforementioned databases. An E-value cut-off of 10 -3 was used to reduce non-informative hits, and filtering was not used. Search results were uploaded to A Systematic Annotation Package for Community Analysis of Genomes (ASAP) annotation workbench for manual annotation [53].

Sequence annotation
The annotations of EST clusters in ASAP were conducted in a similar fashion as outlined previously [1], excluding protein domain searches due to the large data set size. Homolog information was collected for both An. gambiae (Ensembl) and Ae. aeygpti (Vectorbase), with links provided to those databases. Special attention was paid to Gene Ontology descriptors on the matches to D. melanogaster in Flybase. Where an annotation to Flybase was of "putative" or better, Gene Ontology information was transferred onto the cluster annotation.

Data sharing
All data for this project are publicly accessible in ASAP via the web as annotated collapsed EST clusters [54]. Individual ESTs have been deposited with the National Center for Biotechnology Information (NCBI) dbEST: database of Expressed Sequence Tags, under the following accession number range: EU204979 -EU212998.