Our identification of 26,669 M. persicae ESTs (Additional file 3) from 16 cDNA libraries extends previous sequencing efforts for this species  25-fold, and contributes to the rapidly expanding resources that are available for aphid genomics. In addition to the described M. persicae data, the GenBank database contains 66,298 ESTs for Acyrthosiphon pisum (pea aphid; ), 8344 for Aphis gossypii (cotton aphid; ), 4263 for Toxoptera citricida (brown citrus aphid; ), 959 for M. persicae , and 458 for Rhopalosiphon padi (bird cherry-oat aphid; ). Sequencing of the A. pisum genome is ongoing , and a comprehensive database for all aphid genomics information has been established (http://www.aphidbase.com; ). Functional analysis of aphid genes that are identified by sequencing or expression studies will be facilitated by the recent demonstration that it is possible to silence aphid gene expression by RNA interference [41, 42].
The broad selection of source material for cDNA library construction (Table 3) permitted sequencing of ESTs representing genes expressed at different developmental stages and morphs, as well as genes expressed in response to viral infection, and alternate host plant utilization. In addition, the production of separate libraries from heads, digestive tracts, and salivary glands ensured that genes of special interest to the study of plant-aphid interactions are well-represented in our database. Our comparison of EST frequencies between non-normalized libraries enabled in silico prediction of differential gene expression (Table 5). Clustering of the ESTs from our first few libraries (Figure 1) indicated a high degree of redundancy. We responded to this by normalizing all subsequent libraries, which significantly increased our rate of new gene discovery but eliminated our ability to make inferences about differential expression between libraries. Therefore, it was advantageous to our project to make both types of library, preserving in some cases the natural transcript ratios present in the source tissues, and in others bringing the representation of housekeeping genes more in line with that of rarely expressed transcripts.
Because aphids transmit plant viruses, and are themselves infected by entomopathogenic viruses, we searched our database for sequences with homology to known viral genes. No ESTs with homology to Potato leafroll virus (PLRV) were identified in our database, even in libraries made from aphids feeding on PLRV-infected plants. This absence of PLRV cDNA sequences is consistent with the fact that PLRV does not replicate within the aphids.
Five contigs are annotated as densovirus proteins, including one predicted to be specific to the aphid digestive tract (Table 5). All but one of the densovirus ESTs are from the USDA lineage, but this could well be an artifact relating to the fact that the gut cDNA library was made from this aphid strain. A densovirus has been reported to infect the anterior portion of the digestive tract of M. persicae, with infected aphids characterized by reduced size, delayed development, and decreased fecundity . Densoviruses represent potential biological pest control agents, and similar viruses from the families Baculoviridae and Tetraviridae have been commercialized for this purpose.
When the stringency of the BlastX search was reduced to an E-value cutoff of 1E-4, one unigene (contig 3464, Additional file 2) has a Dasheen mosaic virus (DsMV) polyprotein as its best hit. DsMV is a non-persistent RNA virus known to be transmitted by M. persicae . Over two-thirds of the 25 ESTs in the contig homologous to the DsMV polyprotein are derived from the salivary gland or head libraries, consistent with the fact that these non-circulative viruses are retained within the mouthparts of their aphid vectors. The relatively large number of DsMV-derived ESTs, which were found in seven different libraries from three lineages (F001, G006, and USDA), is unexpected in light of the fact that this virus should not replicate within the aphids, and none of the plants used for aphid rearing showed obvious signs of viral infection. Furthermore, the host range of DsMV is not known to overlap with the host plants used in these experiments.
Functional significance of annotated unigenes
In cruciferous plants, myrosinase enzymes (β-thioglucosidases, EC 126.96.36.199) initiate the rapid breakdown of glucosinolates into insect-deterrent hydrolysis products during herbivory. However the aphids Brevicoryne brassicae (cabbage aphid) and Lipaphis erysimi (turnip aphid) have co-opted this defensive system by sequestering plant-derived glucosinolates and producing their own myrosinase as a defense against predators [44–47]. One EST from our database (accession number ES221351, from the G006 lineage) has significant homology to the B. brassicae myrosinase gene. Although attempts to measure myrosinase activity in M. persicae have been unsuccessful, it is notable that aliphatic rather than indole glucosinolates were used as enzymatic substrates in these experiments . Aliphatic glucosinolates are recovered intact in the honeydew of M. persicae on A. thaliana, showing that these aphids are able to avoid or inactivate plant myrosinases. In contrast, A. thaliana indole glucosinolates are largely broken down within the aphids . Although this glucosinolate breakdown may occur by a non-enzymatic mechanism, it is also possible that M. persicae possesses a myrosinase activity that is specific to indole rather than aliphatic glucosinolates.
The genetic mechanisms regulating the cyclically parthenogenetic life cycle characteristic of most aphids are largely unknown. Environmental cues, including shortening days, triggers development of sexual morphs in the autumn . A gene from A. pisum, ApSD1, with similarity to a protein involved in amino acid transport in GABAergic neurons, is upregulated in pea aphids reared under short photoperiod conditions . We identified one EST, which we had annotated as an amino acid transporter, as being significantly similar to ApSD1. This EST (accession number EC388175) was sequenced from the G006 male library, which is consistent with a role for this amino acid transporter in the development of winged sexual morphs.
M. persicae has evolved to tolerate plant allelochemicals and insecticides by diverse strategies, including amplification of E4 esterase genes , point mutations in insecticide targets , and increased activity of glutathione S-transferases in response to glucosinolates in artificial diets . Out of 11 contigs with significant homology to M. persicae esterases, two contigs (720 and 3118) from our database are nearly identical to the M. persicae E4 esterase (GenBank Accession CAA52648), whereas nine others appear to represent different genes. These nine sequences may have evolved following amplification to acquire novel functions in the hydrolysis of plant secondary metabolites encountered during the expansion of the insect's host range, or in the breakdown of newly developed insecticides. Other potential detoxification genes represented in our database include 24 glutathione S-transferases and 53 cytochrome P450s (Additional file 2).
Among the 168 salivary gland contigs that are predicted to encode secreted proteins (Additional file 3), approximately 62% are of unknown function. However, others could have potential function in aphid virulence based on their homology to known proteins. For instance, contig 1300 encodes a protein that belongs to an insect-specific family that includes the yellow proteins of D. melanogaster, that are involved in cuticular development and behavior , and the major royal jelly proteins of Apis mellifera (honeybee). A. mellifera proteins from this family are high in essential amino acids and comprise up to 90% of the total protein content of the jelly that is fed to developing larvae . Although major royal jelly proteins are thought to be produced in the cephalic glands of nurse bees , another member of this protein family (MRJP 8) was recently identified as a component of the honeybee venom . In M. persicae, the homologous protein is less abundant, and ESTs were only found in the salivary gland and normalized head libraries (MpSG and MpHnorm in Table 3). Nevertheless, it is tempting to speculate that the protein has a virulence function in aphids. Two other genes expressed in salivary glands, represented by contigs 2422 and 3025, are predicted to encode secreted proteins that play a role in proteolysis, and therefore could have interesting functions in the interaction between M. persicae and its host plants. Contig 2422, which has highest homology to a sequence of unknown function from D. melanogaster (GenBank accession NP_611740), encodes a protease-associated domain. Contig 3025 encodes a protein with homology to Der1, a gene involved in the degradation of misfolded proteins in yeast .
DNA Sequence Polymorphisms
Comparison of ESTs from the three M. persicae lineages identified a large number of potential sequence polymorphisms which were subjected to stringent post-processing to reduce sequencing artifacts. The remaining 167 SNPs, represented by multiple ESTs in more than one aphid lineage (Additional file 6), are a good data source for the identification of M. persicae genetic markers. Furthermore, as suggested by the cathepsin B-N sequence data (Figure 3), these polymorphisms may provide clues about functional divergence of proteins in different M. persicae lineages.
However, when we re-sequenced 11 of these SNPs from genomic DNA templates, only about half were confirmed (Table 6), suggesting that many potential sequence differences in our EST collection are the result of errors created during reverse transcription, PCR amplification, or sequencing. This highlights the importance of developing effective criteria to select a list of high-confidence SNPs from the large number of polymorphisms predicted by programs such as POLYBAYES, and of validating predicted polymorphisms by re-sequencing of genomic DNA.
Given the greater reproducibility of gene expression data collected with oligonucleotide microarrays, as opposed to spotted cDNA microarrays, we decided to develop oligonucleotide microarrays for future studies on M. persicae gene expression . The highest quality microarrays currently available are those fabricated by in situ oligonucleotide synthesis, a technology pioneered by Agilent. When using such arrays, the number of required technical replicates is reduced because of the high degree of reproducibility between spots, allowing the user to concentrate resources on analyzing biological replicates. In addition, the high cost of purchasing synthesized oligonucleotides makes traditional custom printing of high density arrays at core facilities feasible only if many arrays will be made. There are no up-front costs to design microarrays on Agilent's eArray platform, and the minimum number of slides to order is one.
Transcriptional profiling with microarrays is a powerful technique for identifying genes involved in the response of an organism to its environment. We anticipate that M. persicae microarrays can be used to answer a variety of fundamental questions about aphid biology and plant-aphid interactions. Genes critical to the status of this insect as an agricultural pest can be identified by studying expression changes induced by different crop plants and in response to virus infection. Research on aphid genes specifically expressed in salivary glands may identify proteins that prevent clogging of sieve elements or otherwise contribute to the phloem-specific feeding style of aphids. Conversely, these salivary proteins likely also provide phloem-specific cues that allow plants to recognize aphid feeding and mount a defense response. Microarray experiments will allow association of gene expression changes with polyphenism, the development for morphologically different individuals (e.g. winged and unwinged) that are otherwise genetically identical. Analysis of gene expression in aphids feeding on artificial diets or plants with altered amino acid content can identify genes that are critical for the interaction with endosymbiotic B. aphidicola bacteria, which synthesize essential amino acids and allow aphids to survive on the otherwise nutritionally imbalanced phloem sap.
The broad host range and differences in host plant preferences among individual lineages of M. persicae are some of the more interesting aspects of the biology of this insect. Gene expression differences that underlie within-species variation can be identified by microarray analysis. By sequencing cDNA libraries made from aphids that were raised on both Solanaceae and Cruciferae, we have increased the probability that future microarray experiments performed by ourselves and others will include aphid genes that are expressed only under these particular growth conditions. Evidence for such regulated gene expression comes from our non-normalized libraries, which included two genes that were overrepresented among ESTs from N. benthamiana in comparison to A. thaliana (Table 5). DNA microarray experiments will almost certainly identify additional genes with host plant specific expression patterns. Further research on the function of such differentially expressed genes will illuminate adaptations that have allowed some M. persicae lineages to expand their host range to include tobacco. Other M. persicae lineages, which show differences in their ability to reproduce on A. thaliana (J. Kim and G. Jander, unpublished results), can be studied to identify aphid adaptations for feeding on Cruciferae. In addition, microarray experiments with M. persicae feeding on A. thaliana will provide the unique opportunity to simultaneously study gene expression changes on both sides of a plant-insect interaction.
Given the broad range of questions that can be addressed by microarray analysis of M. persicae gene expression, the Agilent microarray that we have developed will be of broad interest to aphid researchers. Although the technology necessary for hybridizing and scanning synthesized Agilent arrays is somewhat different from that used for experiments with spotted oligonucleotide arrays, it is available at many universities. The microarrays described here will be made available at cost to other researchers and can be obtained by contacting the corresponding author (G.J.).