Characterizing Ancylostoma caninum transcriptome and exploring nematode parasitic adaptation

Background Hookworm infection is one of the most important neglected diseases in developing countries, with approximately 1 billion people infected worldwide. To better understand hookworm biology and nematode parasitism, the present study generated a near complete transcriptome of the canine hookworm Ancylostoma caninum to a very high coverage using high throughput technology, and compared it to those of the free-living nematode Caenorhabditis elegans and the parasite Brugia malayi. Results The generated transcripts from four developmental stages, infective L3, serum stimulated L3, adult male and adult female, covered 93% of the A. caninum transcriptome. The broad diversity among nematode transcriptomes was confirmed, and an impact of parasitic adaptation on transcriptome diversity was inferred. Intra-population analysis showed that A. caninum has higher coding sequence diversity than humans. Examining the developmental expression profiles of A. caninum revealed major transitions in gene expression from larval stages to adult. Adult males expressed the highest number of selectively expressed genes, but adult female expressed the highest number of selective parasitism-related genes. Genes related to parasitism adaptation and A. caninum specific genes exhibited more expression selectivity while those conserved in nematodes tend to be consistently expressed. Parasitism related genes were expressed more selectively in adult male and female worms. The comprehensive analysis of digital expression profiles along with transcriptome comparisons enabled identification of a set of parasitism genes encoding secretory proteins in animal parasitic nematode. Conclusions This study validated the usage of deep sequencing for gene expression profiling. Parasitic adaptation of the canine hookworm is related to its diversity and developmental dynamics. This comprehensive comparative genomic and expression study substantially improves our understanding of the basic biology and parasitism of hookworms and, is expected, in the long run, to accelerate research toward development of vaccines and novel anthelmintics.


Background
Genomic data is revolutionizing molecular parasitology and has been used to prioritize drug targets in parasites at a genomic level [1]. Similarly, pan-phylum genomic studies in parasitic nematodes have identified both highly conserved nematode-specific proteins [2], which are attractive as drug candidates as their targeting will not affect the host, and nematode-specific indels in essential proteins [3], which could also be good candidates for "indel-based" drug design in nematodes. Complementing the genomic data, expression data reflects the dynamics of genetic information. Analysis of digital expression data, obtained by sequencing cDNAs, is crucial for studying and understanding organism's development, physiology, and environmental adaptation. Knowledge of these mechanisms in parasites is essential to substantially accelerate research toward the development of both new therapies to prevent parasite infections and vaccines (or novel anthelmintics) needed to control them.
Infection of humans by parasitic nematodes results in substantial human mortality and morbidity, especially in tropical regions of Africa, Asia, and the Americas. Hook-worms, probably the most significant public health threat of these nematodes, are the second largest contributor to the 26.7 million annual DALYs (Disability Adjusted Life Years) from iron-deficiency anemia due to blood feeding by adult worms. Chronic anemia from hookworm infection is particularly devastating to children, who suffer from stunted growth and impaired intellectual development, to mothers who are at increased risk for anemia during pregnancy and childbirth, and to the elderly [4][5][6]. Current hookworm control strategies are limited to deworming of infected people using anthelmintic drugs. However, rapid re-infection in endemic areas and the lack of sterile immunity necessitates repeated treatments, which will in turn result in resistance. The high rates of re-infection after drug therapy mean that vaccines remain the best hope for worm control in humans in the future. No vaccine is yet available, despite substantial support from The Bill and Melinda Gates Foundation specifically for the development of a hookworm vaccine [7]. Until safe and effective vaccines are developed, anthelmintics will continue to be used for treatment and control of nematode infections in humans. Thus, there is a critical need for further research to identify new vaccine and drug targets which requires better understanding of the biology and parasitism of these devastating parasites.
Ancylostoma caninum, a canine hookworm closely related to the human parasites Ancylostoma duodenale and Necator americanus [8], is the most widely used model for human hookworm infections [9]. Similar to other hookworms, adult A. caninum inhabit the small intestine and produce eggs that pass in the feces and hatch in the soil. The first stage larva feeds on bacteria and molts twice to form the non-feeding, infective third stage (iL3). iL3 enters the host by penetrating the skin, molts twice, and matures in to the adult (Ad) stage in the small intestine. A. caninum iL3 can also infect a host, temporarily abort maturation and enter an arrested state (hypobiosis) within the host's somatic tissues [10], reactivating in response to host physiological changes such as pregnancy [11].
A. caninum is a Clade V nematode [12] that also includes the well-studied free-living model nematode Caenorhabditis elegans. C. elegans was the first multicellular genome to be sequenced [13] and it remains the only metazoan for which the sequence of every nucleotide is known to high confidence. Recently, the genome of the human parasite Brugia malayi, has been sequenced and analyzed [14]. Brugia malayi is phylogenetically classified in Clade III [15]. The distant phylogenetic relationship between A. caninum and B. malayi (compared to A. caninum and C. elegans) makes investigation of nematode adaptation to parasitism easier, as similarities shared by A. caninum and B. malayi (but not by C. elegans) are likely to be associated with adaptation to parasitism. Our previous studies based on limited coverage of A. caninum revealed the existence of genes unique to hookworm and the different selective pressures on these genes [16,17]. Another study using microarray technology [18] found several hundred genes in A. caninum changed their expression during the worm's transition from a free living to a parasitic larva. However, because of the limitation of the data and/or methods of these studies, many questions, especially those related to parasitism, remain to be fully explored.
To better understand the biology of parasitism and facilitate prioritization of potential vaccine and drug targets, the present study deeply sequenced the A. caninum transcriptome with a combination of two distinct sequencing technologies, ABI Sanger capillary and 454/ Roche massively parallel sequencing platforms. Over 1.5 million cDNAs were generated from different cDNA libraries constructed from pre-parasitic, parasitic larval and adult stages. These reads covered over 90% of the A. caninum transcriptome with an average depth of 10×. This dataset was also used to perform comprehensive comparative analysis among A. caninum, B.malayi and C. elegans, and the unprecedented depth of coverage enabled comparison of digital expression profiles leading to reliable identification of differentially expressed genes during development. This study provides the first nearly complete transcriptome from a parasitic nematode and provides valuable information about nematode adaptations to parasitism, in addition to revealing several candidates for further study as drug target or vaccine components.
Assembly, which was performed to reduce data redundancy and improve sequence quality and length, grouped the sequences into 48,326 transcripts longer than 90 bp, for a total of 23 Mb. The transcript consensus sequences are available at http://nematode.net [19]. The average transcript length was 477 bp and average coverage was 10×. Using the core eukaryotic genes as a reference, we estimated that 93% of the A. caninum transcriptome is identified (See additional file 1), making this the first par-asitic nematode with a near complete sequenced transcriptome.

Nematode transcriptome diversity and parasitism related genes
Even though A. caninum and C. elegans fall in the same phylogenetic Clade (Clade V) [12], only about 20% of A. caninum transcripts are homologous to C. elegans coding genes, and even lower number (14%) to B. malayi coding genes ( Figure 1A). However, when only considering the highly expressed transcripts (those sequenced deeply enough to provide confident stage selectivity in this case) about 43% of A.caninum transcripts are homologous to C. elegans. When all the transcripts were considered, the vast majority (77%) of the A. caninum transcripts were species-specific. This indicates high transcriptome diversity among nematodes. However, this diversity did not correspond to a drastic difference on functional level. The total unique number of KOs associated to the A. caninum and C. elegans genes were very similar (Table 2), with only one (amino acid metabolism pathway; P < 0.001) out of the 33 identified pathways having a statistically significant increased number of unique KOs (359 vs. 315) in A. caninum.
There were 1,643 transcripts with B. malayi homologs (1,365 genes) but no C. elegans homologs ( Figure 1A) despite A. caninum being more closely related phylogenetically to C. elegans. The majority of these transcripts (1,093 out of 1,643) failed to find any GO annotations. Nevertheless, functions of the 550 transcripts having GO annotation are enriched in 3 GO terms, prolyl oligopeptidase activity (GO:0004287, P = 3.5e-6), nucleic acid binding (GO: 0003676, P = 5.1e-5), and DNA binding (GO: 0003677, P = 1.7e-3), with the most enriched category being prolyl oligopeptidase activity. In addition, malic enzyme activity was enriched (P = 5.2e-3) though it failed our FDR cutoff because of the small number of entries in this activity. As a comparison, no GO term enrichment was detected when considering the B. malayi genes with homology to C. elegans but not A. caninum. Meanwhile, homology comparison among the free-living C. elegans and the parasites A. caninum and B. malayi found that more B. malayi genes share homology with A. caninum (5,991) than with C. elegans (5,532) ( Figure 1B). The higher number of homologous genes among parasites was statistically significant (P < 1.0e-4, Chi-square test). Since B. malayi (Clade III) is phylogentically more distant from A. caninum than C. elegans (both are in Clade V) [12], B. malayi would share a similar level of homology with both C. elegans and A. caninum if parasitism had no effect on gene evolution. Therefore, we hypothesize that the 1,643 transcripts represent putative parasitism related genes.

Comparative genomics of gene expression during development of A. caninum
The deep sequencing of this study allowed us to examine differential expression of inferred transcripts and shed light on their functions due to the association of the gene expression with molecular function. Figure 2 summarizes the expression selectivity of the 16,359 transcripts that have sequencing depth for confident estimation of expression selectivity, and Table 3 contains the ten most abundant transcripts selectively expressed in different stages (The list of the top 20 most abundant transcripts is available as additional file 2). Only half of the transcripts are expressed through all the stages. The most dramatic change observed between the developmental stages studied was the transition from larvae to adult: more than 3,000 transcripts were turned off and nearly 1,500 turned on. This is not unexpected, especially since the developing L4 stage was not examined. The serum stimulation turned off 78 transcripts, and turned on 401 transcripts. More than one third of the turned-on genes were turned Among the 401 transcripts turned on by serum stimulation, 13 of them are parasitism related ( Figure 2). Given that there are a total of 562 parasitism related transcripts among all the 16,359 transcripts (whose expression selectivity could be confidently ascertained), there is no evidence to support that serum stimulation triggers an extensive expression of parasitism related genes. However, parasitism related genes are more selective expressed in adults. Nearly 18% of these genes (99 out of 562) exhibit M, F, or M and F selectivity, which is significant (P s < 1.0e-4, Chi-square test) when compared to the overall of 9% (1,505 out of 16,359; Figure 2). Interesting, compared to the small fraction of male specific transcripts related to parasitism (17/604), a large fraction of female specific transcripts (35 out of 213) are parasitism related.
The majority of nematode conserved transcripts (2,789 of 4,056) exhibited constant expression over all stages, while less than 40% (3,446 of 8,783) of the A. caninum specific transcripts exhibited the same expression pattern ( Figure 2). This difference is highly significant statistically (P < 1.0E-10, Chi-square test). More than 80% of iL3 selective transcripts (66 out of 78) are A. caninum specific. The different expression pattern of conserved tran-scripts and A. caninum specific transcripts suggests caution in using cDNA data to estimate transcriptome diversity. Using limited number of cDNA reads can underestimate the diversity. For example, the homolog rate between A. caninum and C. elegans would be 54% ((8074-3446-304)/8074) if only the transcripts expressed constantly across the life cycle were considered while that is 20% when all transcripts are included.
The expression profiles defined by our sequencing were compared to the data published by Datu et al., [18]. Datu et al. studied transcriptional changes in the hookworm, A. caninum, during the transition from a free-living to a parasitic larva using suppression subtractive hybridization (SSH) and custom oligonucleotide microarray printed with the SSH expressed sequence tags. Comparison of the two expression profiles of the most highly upregulated mRNAs associated with serum stimulation obtained by different orthogonal approaches confirmed consistency in expression of 9 out of the 10 up-regulated mRNA associated with serum stimulation. The only one not consistent is cDNA that has been broken into several transcripts in our assembly, therefore giving rise to this discrepancy (additional file 3).

Functional profile of transcripts with different expression selectivity
Examining the function of transcripts with different expression patterns using GO terms revealed that enriched and depleted GO terms correlate to the biology of the corresponding stages. The heatmaps of molecular function GO terms ( Figure 3 and Figure 4; See additional file 4) shows that transcripts with different expression selectivity exhibit different functional profiles, although that of the transcripts selectively expressed in both iL3 and ssL3 is closer to that of those constantly expressed, and those of the transcripts selectively expressed in F, M, and both F and M look more similar. Significantly enriched GO terms in the category of molecular functions are shown in Table 4. The top three enriched GO terms of transcripts expressed constantly over all stages are zinc ion binding (GO:0008270, P = 7.4e-12), protein binding (GO:0005515, P = 1.8e-11), and nucleic acid binding (GO:0003676, P = 9.3e-11). The top three transcripts selectively expressed in both F and M are astacin activity (GO:0008533, P = 8.8e-17), a structural constituent of cuticle (GO:0042302, P = 3.5e-8), and cysteine-type endopeptidase activity (GO:0004197, P = 4.0e-11). These terms have all been previously associated with genes involved in parasitism [20][21][22]. The three most significant enriched GO terms of the selectively expressed transcripts in both iL3 and ssL3 are rhodopsin-like receptor activity (GO:0001584, P = 3.1e-9), N-acetyltransferase activity (GO:0008080, P = 6.6e-6), and sugar hydrogen
Interesting, the most significant enriched GO term of the transcripts exclusively expressed in M is serine-type endopeptidase inhibitor activity (GO:0004867, P = 2.6e-8).

Secretory parasitism related genes in A. caninum
Next we attempted to identify transcripts encoding secreted proteins associated with parasitism. Of the 562 transcripts whose differential expression we were able to define (the top 60 differentially expressed A. caninum transcript with homology to B. malayi but not C. elegans are available as additional file 5), 112 were not expressed in the preparasitic iL3 stage but expressed in other stages (Figure 2), suggesting a potential role in parasitism. How-ever, we were only able to detect a secretory signal peptide in 9 transcripts (See additional file 6). This low number might underestimate the true number due to the fragmented nature of our data. Another possibility is that some secretory proteins are released by alternate, poorly characterized secretory pathways, and would therefore be missed by searching for secretory peptide sequences, as was shown recently for B. malayi secreted proteins [23].
In addition, we found more than 700 transcripts with signal peptides from the transcripts that have neither B. malai nor C. elegans homologs (data not shown).

Intra-population polymorphism, Synonymous/nonsynonymous Single Nucleotide Polymorphism and positive selection
Due to their high evolutionary rates [24,25], nematodes are believed to have a significant number of single nucleotide polymorphisms (SNPs). A large number of sites with SNP (76,568) were detected over the total 23,038,913 assembled bases (total length of the 48,326 transcripts). Since the average coverage of our transcripts is about 10×, we estimated θ (= 4Nμ) for A. caninum as 1.2 × 10 -3 . Among the SNP sites only 345 sites have more than two alleles. We obtained confident translations for 6,502 cDNAs containing 20,715 of the 76,568 SNPs. Of these, 10,848 were non-synonymous, and 9,867 were synonymous, with an average dN/dS ratio of 0.3. Among the 518 transcripts possessing more than 9 polymorphic sites, 39 were under positive selection (dN/dS > 1.0), and there were two functional categories identified by the associated GO terms: GO:0004298 (threonine endopeptidase activity and GO:0006511 (ubiquitin-dependent protein catabolic activity). Of these, one parasitism related transcript (contig43771 encoding a protein histidine kinase) exhibited more than 9 polymorphic sites and was under positive selection.

Discussion
The A. caninum transcriptome was sequenced with unprecedented coverage in the present study. While fragmentation is still an issue, the non-biased cloning-free transcript sampling using the Roche/454 technology combined with the conventional Sanger technology in this study enabled an in depth sampling of over 93% of the A. caninum transcriptome. Comparing the A. caninum transcriptome with the coding sequences of C. elegans and B. malayi confirmed the high diversity of nematode transcriptomes. Intra-species studies revealed high expression dynamics of the nematode transcriptome, and suggested an impact of the adaptation to parasitism on A. caninum genes and gene expression. Nematodes have higher evolutionary rate than most other eukaryotes [24,25]. Only 20% of our A. caninum transcripts shared homology to C. elegans genes. Since A. caninum and C. elegans are from the same phylogenetic clade [12], this lowly shared homology clearly illustrates a high evolutionary rate. The high rate can lead to high polymorphism within species. Our estimation of θ for A. caninum is 1.2 × 10 -3 , which is about 2 times higher than that of human coding regions [26]. We may have significantly underestimated θ because each final transcript is probably derived from multiple individuals rather than single worms. This high DNA polymorphism is in agreement with the high evolutionary rate of A. caninum. Most SNPs are di-allelic just as detected in human, which suggest a similar mechanism shaping SNPs in both human and nematodes. High mutation rate and diversity are features of the phylum Nematoda. A previous survey of more than 30 nematode species distributed over four nematode clades found that only about 15% of sequenced ESTs could be found in all four nematode clades [27]. The same study suggested that about 30-50% of nematode genes are species specific. Interestingly, our previous studies [16,17], based on limited number of A. caninum genes (9,000 and 4,000 genes respectively), found that about 50% of the A. caninum genes had homologs in Caenorhabditis species. This discrepancy is likely because the conserved genes tend to be expressed at higher levels and therefore are sequenced more deeply. The previous transcriptome studies analyzed transcripts generated using conventional cDNA libraries (cloning based with capillary sequencing), therefore only the most abundant transcripts were represented in those studies. In fact, when only considering the highly expressed transcripts (at least 10 reads sequenced), the homologous rate between A.caninum and C. elegans is 43%.
The high evolutionary rate and diversity of nematodes may contribute to their ability to adapt to nearly every habitat on earth [28]. In addition, parasitism has evolved independently at least nine times in nematodes [29], and the evolution of parasitism plays a role in shaping the nematode transcriptome. The comparative genomic analysis showed that significantly more B. malayi coding genes share homology with A. caninum than with C. elegans. B. malayi is in clade III and both A. caninum and C. elegans are in clade V, thus it is expected that B. malayi share similar homologs with A. caninum and C. elegans without parasitism adaptation. Parasitic nematodes originated from non-parasites, with subsequent adaptation to the host environment. One major difference between free-living and host environments is the availability and abundance of oxygen. Intestinal parasites like hookworms must adapt to the low oxygen levels in their host by using alternative energy and metabolism pathways. They also need to develop systems to evade the host defense mechanisms. The enriched GO terms of the A. caninum transcripts that have homology with B. malayi genes but not with C. elegans reveal the effect of these adaptations. The most significantly enriched GO term is prolyl oligopeptidase activity. Prolyl oligopeptidase is a family of serinetype endopeptidases [30]. One of its members in the parasitic kinetoplastid, Trypanosoma cruzi, is critical for the parasite to invade mammalian host cells [31]. Another highly represented transcript is malic enzyme. Malic enzyme converts malate to pyruvate in the mitochondrion, and is important for adaptation to low oxygen environment in the host [32]. In addition, transcripts encoding nucleic acid binding and DNA binding proteins were also significantly enriched. It is possible that parasitic nematodes interfere with the host's transcription and translation system during invasion, or these transcripts encode endogenous enzymes required for further development and morphological changes that occur in the host. In contrast to the GO term enrichment of these A. caninum transcripts, the B. malayi genes sharing homology with C. elegans only did not exhibit any GO term enrichment.
Despite the strong adaptive capability of nematodes, we failed to find evidence of strong positive selection in A. caninum. Only 7.5% of transcripts are under possible positive selection by the dN/dS test. We also failed to detect an extensive positive selection for the parasitism related genes. The lack of extensive positive selection in nematodes could suggest that these organisms have a high mutation rate. We expect our on-going nematode genomic projects to provide additional information about nematode evolution http://www.genome.gov/10002154. Gene expression in A. caninum is highly dynamic, with only half of the genes being constantly expressed over all four stages. This 50% is actually a conservative estimation because we excluded the transcripts that lacked sequencing depth. The most drastic difference is found among expression profiles of larvae and adults, indicating the strong effect of adaptation to the host environment. This is confirmed by the functional GO term enrichments. Genes selectively expressed in parasitic adults (both in F and M) are enriched in functions important for parasitism. In fact, 10 out of the 14 GO terms enriched in parasitic adults are related to parasitism. Cysteine endopeptidase is indispensable for parasites, and its numerous functions enabling parasites to defend against their hosts are well-documented [20,21]. Cysteine proteases are also important for digestion of protein and hemoglobin in the blood meal [33]. Astacin plays a crucial role in A. caninum tissue migration [22]. The cuticle is a protective external layer of nematodes that provides the primary defense for parasitic nematodes. Several cuticlar changes associated with parasitism have been documented [34]. The glyceraldehyde-3-phosphate dehydrogenase enzyme is necessary for parasites to utilize host glucose as an energy source. In contrast, genes selectively expressed in larvae tend to be enriched in functions related to basic cellular functions such as oxidoreductase activity, signal transduction, and ribosome structure. The most significant term (based on P-value of GO enrichment test) is rhodopsin-like receptor activity, which are chemosensory receptors in C. elegans and may be required for larval nematodes to interact with their free-living environment.
Surprisingly, only a small number of genes were turned on by serum stimulation, which indicate that exposure to host-like conditions in vitro does not trigger immediate gene expression changes. Also, serum stimulation did not turn on expression of an increased number of parasitism related genes in our study. This suggests that hookworms do not drastically change gene expression during the transition to parasitism upon entering their hosts, i.e. many of the molecules that are released during infections are already synthesized in the iL3 and stored for rapid release during infection [35][36][37]. However, using differential subtractive hybrization, Datu et al [18] found that the genes expressed in serum stimulated A. caninum L3 did not overlap with genes associated with developmental changes during recovery of C. elegans dauers, and suggested that genes expressed in response to activation with serum were involved in parasitism rather than development [18]. One transcript selectively expressed in ssL3 stage in our study encoded triacylglycerol lipase, which is known to play a role during dauer recovery in C. elegans [38]. Invading hookworm L3 have been compared to recovering dauer larvae [39]. Another group of transcripts expressed selectively in ssL3 versus iL3 are the allergen V5/Tpx-1 related proteins or Ancylostoma secreted proteins (ASPs), originally isolated from excretory/secretory products of A. caninum L3 [35], and subsequently from numerous other nematodes. While its function is unknown, a second, related set of ASPs have been described from the adult stages [40], and were among the most abundant transcripts in adult M and F in this study (Table 3). Overall, expression dynamics correlated with progression through the stages of the life cycle.
Genes with different evolutionary conservation exhibit different expression pattern. Genes conserved among nematodes tend to be expressed consistently in all stages, and A. caninum specific genes tend to be expressed more selectively. These expression patterns can be associated with their corresponding functions. Conserved genes are involved in basic cellular activities and thus are required for all stages. Species-specific genes are functionally more closely related to the specific life cycle stage and niche. For example, the iL3 selective transcripts are mainly A. caninum specific and are depleted of primary metabolic process (GO:0044238) and macromolecule metabolic process (GO:0043170) (data not shown). In addition, the parasitism related genes are more like to be selectively expressed in adult stages. One would expect this since the adults are parasitic whereas the infective larvae are usually free-living. We found that the male adult differentially expressed transcripts are enriched in serine-type endopeptidase inhibitor activity. The male reproductive tract of mammals is enriched in peptidase inhibitors that function in protection and regulation of fertilization [41], suggesting a similar mechanism may be at work in nematodes.

Conclusion
In summary, this study allowed generation and cataloging of all genes expressed in four transcriptomes of A. caninum. Our analysis segregated those genes in multiple dimensions including functional, developmental and phylogenetic categories. The observations agreed with, but also extended, information on previously described genes, suggesting that the newly discovered genes will provide additional unique value. This information identified sets of stage-specific genes, as well as pre-parasitic and parasitic genes that defined differences in metabolic and cellular processes between stages. Furthermore, parasitic adaptation has been shown to be related to transcriptome diversity and developmental dynamics. This dataset is a resource for more complete microarrays, RT-PCR, RNA interference and proteomics. The latter can identify parasite proteins that occur in specific developmental stages, parasite excretory-secretory products, and the external cuticular surface. In addition, the cDNAs generated will enable better annotation of the upcoming genome sequence http://www.genome.gov/10002154. Such extended genomic studies will aid in the identification of genes involved in host recognition, infection, migration and immune invasion as well as the categorization of targets for vaccine and anthelmintic drugs. Finally, the methodology developed in this study illustrates the effectiveness of deep sequencing as a means for analyzing differential gene expression.

Nematode extraction
The Baltimore strain of A. caninum (U.S. National Parasite Collection accession 100655.00) was maintained in beagles as described [42]. Animals were housed and treated in accordance with George Washington University Institutional Care and Use Committee guidelines. Infective L3 (iL3) were recovered from 7-10 day old coprocultures using a modified Baermann technique, washed clean of debris with BU buffer (50 mM Na2PO4/ 22 mM KH2PO4/70 mM NaCl, pH 6.8; Hawdon et al., 1991), and treated with 1% HCl in BU for 30 min at 22°C. The larvae were washed twice with sterile BU and snapfrozen by immersion in liquid N 2 . Frozen larvae were stored at -80°C until used for library construction. Acti-vated (serum stimulated) larvae were generated as described previously (Brand et al, 2004). Briefly, approximately 5,000 A. caninum L 3 were incubated in a 500 μl volume of RPMI 1640 medium supplemented with 25 mM HEPES (pH 7.0) and antibiotics (RPMI-complete) containing 10% canine serum filtrate (<10 kDa ultrafiltrate) and 15 mm S-methyl-glutathione (Sigma Chemical). Negative-control (non-activated) L3 were incubated in RPMI-c alone. L3 were incubated in 10% CO 2 at 37°C for 24 hours. Following incubation, the medium containing the L3 was collected, transferred to microcentrifuge tubes and centrifuged for 5 minutes at 14,000 rpm. The supernatant was removed, and an aliquot of L3 was tested for activation as described previously [43]. L3 pellets showing greater than 85% feeding were pooled and used to isolate nucleic acids. Adult male and female were collected from intestines of infected dogs following euthanasia, and frozen at -80 C until nucleic acids were isolated.

Preparation of A. caninum staged RNA and cDNA libraries
Frozen worm pellets were pulverized using an Alloy Tool Steel Set (Fisher Scientific International). Total RNA from adult and larval parasites was prepared using TRIzol Reagent (GibcoBRL, Life Technologies or Invitrogen, Carlsbad, CA). cDNA libraries from four stages, infective L3 larva (iL3), activated L3 larva (ssL3), adult male (M), and female (F), were generated as previously described [17,44].

Capillary and high throughput sequencing
The cDNA libraries from the four stages were sequenced using the Roche/454 FLX platform [44] and capillary based Sanger sequencing using ABI 3730 and 3700 platforms [17]. All sequences were deposited to GenBank: accession numbers of the Sanger sequences are BM077300 -BM077991; EW741128 -EW744730; EX534506 -EX567272; EX827505 -EX828593; EY458148 -EY473938; FC539038 -FC555743; the Roche/454 SFF files can be found in SRA, libraries SRX000115-SRX000118. The sequences from different platforms underwent different methods for base calling and detection of high quality regions, trimming of linkers, screening for low complexity regions and contaminants, and returning high-quality sequences. Raw cDNA sequences were processed i.e. quality trimmed and screened for vector sequences using SeqClean http://compbio.dfci.harvard.edu/tgi/. The hybrid assembly was dominated by FLX reads (sff format files), therefore we used the 454 Life Sciences' Newbler assembler v1.1.03.21 followed by addition of the ABI reads. All subsequent analyses were based on these contigs, hereafter referred as transcripts, and their constituent reads.

Comparative analysis and functional assignments
The core eukaryotic genes (CEGs) [45] were used to estimate the completeness of the A. caninum transcriptome. A hidden Markov model profile search of the 48,326 transcripts against the 248 CEG profiles of C. elegans genes was carried out using the HMMER [46]. Significant hits were identified according to the suggested cutoffs [45].
C. elegans and B. malayi coding genes were downloaded from Biomart [47]. The A. caninum transcripts were compared against these coding genes using WU-BLAST to identify homologs and matches with a raw BLAST bitscore larger than 50 were considered significant homologs [27]. By this way, A. caninum transcripts could be classified in 4 groups: those sharing homology with both C.elegans and B. malayi genes, those sharing homology with only C. elegans genes or only B. malayi genes, and those sharing no homology with any other species. We defined the first group as nematode conserved, the last group as A. caninum specific, and those sharing homology only with B. malayi genes as parasitism related. KEGG orthology (KO) [48] of the transcripts were identified through BLAST searching against the KEGG database. As with the homologs, bitscores larger than 50 were used as a cut-off. The recorded KOs were mapped to the cellular and biochemical pathways using the KEGG reference maps [48]. The number of shared and unique KOs for each pathway was compared and statistically evaluated by Chi-square test with Bonferroni correction.
Gene Ontology (GO) associations of the parasitism related transcripts were performed using Interproscan [49], and significant enrichment of GO terms were computed based on the hypergeometric distribution using FUNC [50]. A probability refinement was done to remove the GO terms identified as significant due to their children terms. GO term enrichment analysis is vulnerable to false discovery. We used the false discovery rate (FDR) computed by FUNC to reduce false discovery. Therefore, unless specified otherwise, the GO term enrichment was selected based on both p-value < 0.05 (after refinement) and FDR <0.1.

Expression pattern examination
Expression patterns were defined by stage specificity and stage bias of the transcript's constituent reads. After assembly, transcripts with length greater than 90 base pairs (bp) were subjected to expression pattern examination as follows: the numbers of reads originating from the different cDNA libraries for each transcript was recorded, and transcripts with reads originating from only one stage (or a set of stages) were designated to be specific to that stage (or that set of stages). To increase the confidence with which specificity was assigned, we required the transcripts to be sequenced deeply enough to ensure that an observed stage absence (i.e. 0 reads from a specific stage) have a confidence interval above 95%. It turned out that this required the transcript be sequenced at least 10 times (i.e. transcripts have at least 10 constituent reads). Specifically, for each transcript, its observed absence from a stage (if any) was compared to its expected number of stage-specific reads (calculated by multiplying the marginal read frequency of the four stages by the total number of reads in that transcript). If the probability of the observed absence (to a Poisson distribution with the mean as the expected number of reads) was higher than 0.05, the transcript was considered to lack depth of coverage and was excluded from the analysis (including the function-related analyses). This requirement excluded 32,099 transcripts, leaving 16,359 transcripts with defined stage-specificity (with high confidence).
Stage-specific expression implies that a specific gene is expressed during one developmental stage (or one set of development stages) but not in the other. However, gene expression does not always follow the on/off model. Its expression dynamics also includes the expression fluctuation in different stages. In some stages one gene may have enhanced expression while it may have depleted expression in other stages. Such transcripts were designated as stage-biased in our analysis. These transcripts that were biased in expression towards a certain stage were selected by comparing the numbers of reads in different stages (iL3, ssL3, and Adult) using a statistical approach defined by Audic et al. [51] with a significance of P < 1e-05. We focused on the comparisons of iL3 vs ssL3 and ssL3 vs Adult (male and females were treated together as adult stage).

Functional examination of transcripts with different expression pattern
The predicted functions of the transcripts with different expression patterns were examined with GO association as described previously. In consideration of the amount of data and parasitism, we focused on 5 groups of transcripts: those constantly expressed in all four stages; those selectively expressed in both iL3 and ssL3; those exclusively expressed in F, those exclusively expressed in M; and those selectively expressed in both F and M. To visualize the functional profile of these groups, we used heatmap to illustrate the enrichment and depletion of GO terms related to molecular function. To increase the visibility, we included only the GO terms that are enriched or depleted in at least one of the 5 groups. Meanwhile, significantly enriched GO terms in molecular function of these groups were recorded and manually examined to explore the functional association of selective expression.

Identification of secretory parasitism related genes
Upon the identification of parasitism related genes, several criteria were required to further identify parasitism genes encoding secretory proteins [52] that the parasite might use to interfere with the host cellular functions and enable successful parasitism. These criteria require the genes i) to share homology to B. malayi genes but not C. elegans (are in the subset of parasitism related genes identified above), ii) are expressed in the parasitic stages (ssL3, M, F) but should not be expressed in the preparasitic iL3 stage; iii) to have signal peptide for secretion (signal peptides for secretions and trans-membrane domains were identified using PHOBIUS [53], with an additional requirement that the SP is within the first 70 amino acids). Hence, the transcripts expressed in parasitic stages, without homologs in C. elegans but with a significant homology to B. malayi genes and with a signal peptide for secretion but without membrane containing domains were classified as putative secretory parasitism genes. The functions of these transcripts were assigned based on homology found by InterProScan [49].

Intra-population polymorphism and synonymous and nonsynonymous Single Nucleotide Polymorphisms (SNPs)
The cDNAs generated in this study originate from RNA isolated from a population of individuals, enabling us to estimate the rate of population polymorphism of the transcriptomic reads. We used Polybayes [54] to detect the SNPs by using the transcript (i.e. contigs) as reference. Based on the detected SNPs, we estimated the θ, a representation of DNA sequence polymorphism which is related to the mutation rate and effective population size, by following Watterson [55], where n = 10 (the estimated coverage depth) and S is the fraction of sites with polymorphism. By treating the estimated coverage depth as n, we have likely underestimated θ. To identify if the SNPs contribute to synonymous or non-synonymous changes the transcripts were translated using Prot4Est [56]. The dN and dS were calculated using the method of Nei and Gojobori [57]. To improve reliability of the estimation, only transcripts with more than 9 polymorphic sites were analyzed. Functions of transcripts under positive selection (dN/dS > 1.0) were further investigated for GO term enrichment (as stated above).