Genomic resources for Myzus persicae: EST sequencing, SNP identification, and microarray design

Background The green peach aphid, Myzus persicae (Sulzer), is a world-wide insect pest capable of infesting more than 40 plant families, including many crop species. However, despite the significant damage inflicted by M. persicae in agricultural systems through direct feeding damage and by its ability to transmit plant viruses, limited genomic information is available for this species. Results Sequencing of 16 M. persicae cDNA libraries generated 26,669 expressed sequence tags (ESTs). Aphids for library construction were raised on Arabidopsis thaliana, Nicotiana benthamiana, Brassica oleracea, B. napus, and Physalis floridana (with and without Potato leafroll virus infection). The M. persicae cDNA libraries include ones made from sexual and asexual whole aphids, guts, heads, and salivary glands. In silico comparison of cDNA libraries identified aphid genes with tissue-specific expression patterns, and gene expression that is induced by feeding on Nicotiana benthamiana. Furthermore, 2423 genes that are novel to science and potentially aphid-specific were identified. Comparison of cDNA data from three aphid lineages identified single nucleotide polymorphisms that can be used as genetic markers and, in some cases, may represent functional differences in the protein products. In particular, non-conservative amino acid substitutions in a highly expressed gut protease may be of adaptive significance for M. persicae feeding on different host plants. The Agilent eArray platform was used to design an M. persicae oligonucleotide microarray representing over 10,000 unique genes. Conclusion New genomic resources have been developed for M. persicae, an agriculturally important insect pest. These include previously unknown sequence data, a collection of expressed genes, molecular markers, and a DNA microarray that can be used to study aphid gene expression. These resources will help elucidate the adaptations that allow M. persicae to develop compatible interactions with its host plants, complementing ongoing work illuminating plant molecular responses to phloem-feeding insects.


Background
Insects in the order Hemiptera, which includes all insects that feed exclusively or predominantly on phloem sap, currently represent the most significant challenge for agricultural pest management programs. Although transgenic plants producing Bacillus thuringiensis (Bt) toxin have achieved resistance to many devastating lepidopteran pests, these crops remain susceptible to infestation by aphids and other hemipterans. Reduction in insecticide application, concomitant with the widespread cultivation of Bt crops, has resulted in hemipteran pests being the primary insect threat in major agricultural systems [1]. Aphid feeding causes an alteration of plant source-sink relationships [2], the induction of premature leaf senescence [3], secondary pathogen infection through fungal growth on aphid honeydew, and the transmission of plant viruses [4]. Among these, virus transmission by aphids represents the greatest threat for agricultural crops. Myzus persicae (green peach aphid), which is capable of transmitting more than 100 plant viruses, is the world's most versatile aphid viral vector [5,6]. In particular, M. persicae is a very efficient vector of Potato leafroll virus (PLRV), which can lead to yield reductions of 40-70% in infected fields [7]. M. persicae lineages can vary considerably in their PLRV transmission efficiency [8], suggesting that there are lineage-specific genetic factors that influence this trait.
M. persicae has been found on hundreds of mostly dicotyledonous plant species [6]. Given this broad host range, it is not surprising that differences in host plant utilization among M. persicae lineages are quite common, and efforts have been made to identify molecular variation that correlates with host range and reproductive life cycle [9,10]. The best-studied example of such variation is represented by the typically red-colored lineages that are able to thrive on tobacco [11]. These first appeared on tobacco in Japan more than 60 years ago and have spread to all tobaccogrowing regions of the world. In the United States, M. persicae has been found on tobacco since at least 1947 [12]. Red strains of M. persicae were first reported in the United States in 1985, and by 1987 had become the dominant color morph on tobacco [13,14]. The tobacco-adapted lineage of M. persicae has been granted subspecific status, Myzus persicae nicotianae [15].
In the laboratory, several plant species are convenient hosts for rearing and studying M. persicae. Brassica oleracea (cabbage) is commonly employed as a host plant for maintaining aphid cultures. Arabidopsis thaliana (Arabidopsis), a well-developed model genetic organism with a fully-sequenced genome, is readily consumed by M. persicae, and A. thaliana microarray studies have identified genes that are induced or repressed in response to M. persicae feeding [16][17][18][19][20]. Nicotiana benthamiana, a wild relative of tobacco [21] also serves as a host plant for some lineages of M. persicae. Virally induced gene silencing (VIGS) is particularly effective in N. benthamiana, permitting rapid screening of individual genes to study their importance in defense against M. persicae and other herbivores [22]. P. floridana (downy ground-cherry) serves as a model solanaceous plant for studying the transmission of PLRV by M. persicae [23].
So far, only limited sequence information and genetic markers are available for the estimated 313 Mb nuclear genome of M. persicae [24][25][26][27]. However, recent advances in DNA sequencing make it possible to rapidly acquire information about the coding regions of any genome by building complementary DNA (cDNA) libraries and sequencing expressed sequence tags (ESTs). Here we describe the creation of such an EST database from 16 sequenced M. persicae cDNA libraries and the use of these data to make in silico predictions of differentially expressed genes, identify single nucleotide polymorphisms (SNPs) between lineages, and develop probes for an oligonucleotide microarray to study aphid gene expression.

Phenotypic Characterization of M. persicae Lineages
Phenotypic characterization and microsatellite genotyping of lineage 2001-12, which was collected in Scotland, has been described previously as clone Type B [28]. Other M. persicae lineages were collected at five sites in the United States, and were characterized to determine their reproductive life cycle (Additional file 1). Of the 46 tested aphid lines, eight were holocyclic, five intermediate, 15 androcyclic, and 17 anholocyclic. One lineage remained unclassified because multiple replicates failed to grow. Seven holocyclic M. persicae lineages and one anholocyclic tobacco-adapted lineage were genotyped with seven microsatellite markers, and were determined to be genetically distinct (Table 1).
Seven holocyclic M. persicae lineages (Table 1) were tested for their ability to transmit PLRV acquired from detached virus-infected leaves of P. floridana to virus-free P. floridana plants. In three independent trials, aphids of clone G006 transmitted PLRV efficiently, whereas aphids from lineage F001 failed to transmit the virus consistently (Table 2). In these experiments, the F001 and G006 clones exhibited similar growth rates and fecundity, suggesting that the observed differences in transmission are attributable to differences in the clones' capacity to acquire and/or transmit the virus, rather than to differences in the amount of time spent feeding on the infected or uninfected leaves. In control experiments, aphids from both lineages that were transferred from uninfected detached leaves to uninfected plants failed to transmit PLRV. The USDA aphid lineage, which was used as a positive control, transmitted PLRV with 100% efficiency in these experiments (data not shown).

cDNA Library Construction and Sequencing
As summarized in Table 3, 16 cDNA libraries representing a diversity of tissues and developmental stages were constructed from four aphid lineages (USDA, 2001-12, F001, and G006). Aphids were reared on host plants from the Solanaceae and Brassicacae families, as well as on plants with and without PLRV infection. Since sequencing nonnormalized libraries showed a high level of redundancy, normalized cDNA libraries were created to improve the rate of new gene discovery (Fig. 1). Although normalization increased the gene discovery rate, it also precluded making inferences about differential gene expression by comparing EST frequencies between these libraries. Alto-gether, sequencing of the cDNA libraries produced a total of 26,759 high quality sequencing reads, which have been submitted to GenBank (accession numbers: DW010205-DW015017,  EC387039-EC390992,  EE570018-EE572264, EE260858-EE265165, ES444641-ES444705,  ES217505-ES226848, and ES449829-ES451794).

Sequence Assembly and Annotation
We identified 3965 contigs and 6376 singletons in the 26,759 high-quality sequences. BlastX (E-value cutoff = 1E-5) was run on all 10,341 unigenes against a database containing all NCBI RefSeq proteins plus the 105 M. persicae proteins available at the time in GenBank (January 25,2007). A one-line annotation was generated for each contig in the following way: (i) if the best Blast hit was for a known M. persicae protein, that annotation was used; otherwise (ii) if there was a hit for a Drosophila melanogaster protein, that annotation was considered to be most reliable; and (iii) if there were no hits to D. melanogaster or M. persicae proteins, we used the annotation of the best Blast hit. In addition, the top ten Blast hits are listed for each contig in Additional file 2. Information on the ESTs from which each contig is built, including Gen-Bank accession number and source library, is provided (Additional file 3). GOslim annotations [29] were tabulated for each contig, and a summary of the molecular function and biological process annotation is provided for the total EST collection and separately for the ESTs from the non-normalized libraries (Fig. 2).
The top twenty contigs in terms of representation in the EST collection can be found in Table 4, ranked according   G002  40  60  -G003  80  --G006 c  100  100  100  G010  100  --F001 c  40  0  10  F009  80  --F012 40 40 17 a Five plants per experiment; b Ten plants per experiment, except six plants for F012; c Lineages used for cDNA library construction to the number of ESTs they contain. Seven of the most highly expressed genes in M. persicae have no significant similarity to any proteins in the RefSeq database [30]. We compared our sequences to the mitochondrial genome of the aphid Schizaphis graminum (greenbug) and identi-fied 880 ESTs likely to represent mitochondrial genes. Of these, 491 constitute four contigs that are among the most highly expressed genes in our database (Table 4).    The low overlap between the contig sequences may be due to the fact that neither EST database represents the whole transcriptome. However, it is likely that many of the contigs in one aphid which do not have homologues in the database of the other aphid species may represent genes responsible for the adaptation to specific host plants, or for other differences in physiology which have evolved since the species' divergence. There is significant similarity between predicted coding regions in a subset of M. persicae and A. pisum genes. From more than 5000 shared contig sequences, 1585 have greater than 95% identity in the coding region (coding regions are defined as open reading frames of 50 residues or longer that can be aligned to Uniprot sequences with BlastX E-values less than 1E-10). These genes are likely to represent cellular housekeeping genes, but may also include genes which are essential for aspects of plant-insect interactions which are specific to, and common among, aphids.

Comparision of M. persicae ESTs to other Genomic Data
Similarity searches against other aphid ESTs, performed using the TBlastX program (E-value cutoff = 1E-10), identified 4500 unigenes with no similarities to previously described aphid ESTs. Therefore, these unigenes represent newly described aphid cDNA sequences. However, some of these sequences may arise from untranslated regions of genes, which may not be highly conserved between spe-cies. These 4500 unigenes were subsequently compared with the nr database (non-redundant NCBI protein and nucleotide database) using TBlastX and BlastX programs. A total of 2423 unigenes had no hits at E-value <1E-10. Some subset of these may represent M. persicae-specific genes. The other 2077 unigenes represent "new-toaphids" features: genes identified in non-aphid species, which are not represented among the over 80,000 ESTs from five other aphid species that were previously submitted to GenBank.
Blasting the 959 M. persicae ESTs previously described by Figueroa et al. [27] against our database revealed that 806 of these ESTs match up to 520 of our unigenes at E-val-ues<1E-100. Based on this stringent cut-off value, 153 of these previously described M. persicae ESTs were not found in our data set -however, 109 of these ESTs have matches at E-values<1E-20, indicating that they may represent closely related gene family members.
BlastN (E-value cutoff = 1E-20) of all ESTs against the three Buchnera aphidicola genomes available in GenBank, as well as BlastX (percentage identity cutoff 80%) against all Buchnera aphidicola protein sequences in GenPept identified 90 sequences that are almost certainly from the bacterial endosymbiont of M. persicae, Buchnera aphidicola.
The top Blast hit for five of our unigenes (all singletons) are for B. aphidicola proteins, indicating that our filtering failed to remove a small number of contaminating bacterial sequences. No Escherichia coli or Saccharomyces cerevisiae proteins appeared as a top-ten Blast hit for any of our unigenes giving us confidence that significant contamination from these sources is not a concern.

In Silico Prediction of Differentially Expressed Genes
We used the previously described R statistic [31] to identify the contigs showing the greatest differences in EST abundance among four of the non-normalized libraries (MpH, MpG, MpNB, MpW). A log likelihood ratio statistic was calculated that estimated the extent to which differences in gene expression correspond to the heterogeneity of the libraries. The twenty top hits of differentially expressed genes are presented with a brief description of the protein, the value of the R statistic, and the abundance of the gene in each of the four libraries (Table  5). Among the twenty contigs showing the highest R value, twelve represent genes that are over-expressed in the head library. None of these genes show similarities to published proteins with known function (E-value cutoff = 1E-5), and ten of the twelve genes were found only in head or full body cDNA libraries made from M. persicae or other aphids. The six contigs representing genes that were most highly expressed in the gut library include two with no homology to GenBank sequences. Two other contigs show similarity to the lysosomal cysteine protease cathepsin B-N. Contig 3427 shows similarities to a structural protein from densoviruses, which have recently been described as infecting the stomach cells of aphids [32]. Contig 1196, which represents a gene that is more highly expressed in the gut, shows similarity to a glutathione Stransferase (GST). GSTs belong to a large family of proteins implicated in xenobiotic detoxification, and an increase in GST activity has been associated with the adaptation to plant secondary metabolites in M. persicae [33]. Two contigs with no homology to known genes have a significant overrepresentation of ESTs from aphids reared on N. benthamiana rather than A. thaliana. One of these contigs, number 1079, also contains five ESTs from the digestive tract library and none from the head library, suggesting this gene as a candidate for involvement in aphid response to tobacco-specific defenses.

Prediction of Secreted Salivary Proteins
In order to find aphid proteins involved in the successful infestation of host plants, we have identified cDNA sequences expressed in the salivary glands that are predicted to encode for secreted proteins. These proteins may be required for the establishment of prolonged phloem feeding and suppression of plant defenses. Using stringent criteria (see Materials and Methods) we identified 186 contigs representing sequences expressed in salivary glands. Subsequent in silico translation and signal peptide prediction resulted in the identification of 45 M. persicae proteins that may be secreted from the salivary glands (Additional file 4). These include a total of fifteen proteins that are predicted to possess an anchor sequence (Additional file 4), indicating that these proteins remain in the cell membrane upon secretion and might function as receptors or proteins involved in transport. For instance, the protein encoded by contig 515, is a close homologue of tetraspanin 29FA in D. melanogaster, where it functions as a cell surface receptor binding protein involved in sig-

SNP Identification and Validation
SNPs are effective molecular markers for genetic mapping and can also be used to estimate the level of sequence divergence between lineages. Using the program Poly-Bayes [34], we identified 12,722 potential SNPs from our EST sequences. Since we were interested in identifying SNPs that represented differences between, rather than within, lineages, we filtered our list of polymorphisms to include only those SNPs representing a nucleotide difference between two lineages, and exhibiting no apparent heterozygosity within lineages. This resulted in ~800 polymorphisms that can serve as potential molecular markers for differentiating aphid lineages (Additional file 5).
Many of the predicted SNPs are represented by only a single EST in one or more lineage, allowing for the possibility that observed sequence differences are artifacts resulting from an error in reverse transcription of the mRNA, PCR amplification of the cDNA, or the sequencing reaction. Therefore, we generated a list of high-confidence polymorphisms in which each of any two lineages was represented by two or more ESTs with the same base at the polymorphic position. This resulted in 167 high-confidence SNPs (Additional file 6), from which we selected a small subset to validate by re-sequencing genomic DNA from M. persicae lineages USDA, F001, and G006. Eleven SNPs from seven contigs were selected for validationcontigs contained either one, two, or three predicted polymorphisms (Table 6). No sequence differences between these lineages were detected when sequencing 158 bp of a control gene, EF-1α (accession numbers EF660853-EF660855). Five of the 11 tested SNPs were confirmed by resequencing ( Table 6). Four of these confirmed SNPs represented differences between the two green aphid lineages, F001 and G006, and the red tobacco-adapted USDA lineage. Moreover, three of these SNPs were in the open reading frame of contig 254, which is annotated as a lysosomal cysteine protease cathepsin B-N.
Contig 254 is one of the most highly expressed genes (Table 4) and differentially expressed in the aphid digestive tract ( Table 5). Two of these cathepsin SNPs represent non-synonymous, non-conservative amino acid changes (Fig. 3), indicating a possible functional change in enzyme activity. Structural modeling with the protein fold recognition program Phyre [35] indicates that these residues are in a conserved region in an exposed loop on the surface of the protein. One EST from the second red M. persicae lineage (2001-12), shared the three polymorphic nucleotides with the red USDA lineage. Therefore, it may be informative to genotype a wider range of red and green aphids at this locus to determine whether these polymorphisms correlate with differences in host range or life cycle between lineages. Although it is tempting to speculate that this gut-specific protease is undergoing rapid evolution in order to avoid plant protease inhibitors, the small size of our SNP dataset and the high (>50%) occurrence of false positives prevent us from inferring the significance of changes in this protein arising from variation between these lineages.

Microarray Design
The Agilent eArray platform was used to design a microarray based on our ESTs and an additional 1,121 ESTs from other sources that were available in GenBank (including [27]. Of the total of 10,525 unigenes assembled from these ESTs, we successfully designed 60-mer probes for 10,478 using eArray software. For >95% of the unigenes, three 60-mer probes were designed, corresponding to different regions of the gene. The actual synthesized array consists of one probe group representing all 10,478 unigenes, a second probe group with alternate 60-mers for 4139 of the unigenes, 11 ESTs from Schizaphis graminum (greenbug), negative controls corresponding to plant and human specific genes, and positive controls representing insect housekeeping genes. The current slide layout con-  [27], and 458 for Rhopalosiphon padi (bird cherry-oat aphid; [39]). Sequencing of the A. pisum genome is ongoing [36], and a comprehensive database for all aphid genomics information has been established (http://www.aphidbase.com; [40]). Functional analysis of aphid genes that are identified by sequencing or expression studies will be facilitated by the recent demonstration that it is possible to silence aphid gene expression by RNA interference [41,42].
The broad selection of source material for cDNA library construction ( Table 3) permitted sequencing of ESTs representing genes expressed at different developmental stages and morphs, as well as genes expressed in response to viral infection, and alternate host plant utilization. In addition, the production of separate libraries from heads, digestive tracts, and salivary glands ensured that genes of special interest to the study of plant-aphid interactions are well-represented in our database. Our comparison of EST frequencies between non-normalized libraries enabled in silico prediction of differential gene expression ( Table 5).
Clustering of the ESTs from our first few libraries ( Figure  1) indicated a high degree of redundancy. We responded to this by normalizing all subsequent libraries, which significantly increased our rate of new gene discovery but eliminated our ability to make inferences about differential expression between libraries. Therefore, it was advantageous to our project to make both types of library, preserving in some cases the natural transcript ratios present in the source tissues, and in others bringing the representation of housekeeping genes more in line with that of rarely expressed transcripts.

Virus-derived Genes
Because aphids transmit plant viruses, and are themselves infected by entomopathogenic viruses, we searched our database for sequences with homology to known viral genes. No ESTs with homology to Potato leafroll virus (PLRV) were identified in our database, even in libraries made from aphids feeding on PLRV-infected plants. This absence of PLRV cDNA sequences is consistent with the fact that PLRV does not replicate within the aphids.
Five contigs are annotated as densovirus proteins, including one predicted to be specific to the aphid digestive tract ( Table 5). All but one of the densovirus ESTs are from the USDA lineage, but this could well be an artifact relating to the fact that the gut cDNA library was made from this aphid strain. A densovirus has been reported to infect the anterior portion of the digestive tract of M. persicae, with infected aphids characterized by reduced size, delayed development, and decreased fecundity [32]. Densoviruses represent potential biological pest control agents, and similar viruses from the families Baculoviridae and Tetraviridae have been commercialized for this purpose.
When the stringency of the BlastX search was reduced to an E-value cutoff of 1E-4, one unigene (contig 3464, Additional file 2) has a Dasheen mosaic virus (DsMV) polyprotein as its best hit. DsMV is a non-persistent RNA virus SNPs in contig 254, cathepsin B, a putative gut-specific cysteine protease G006 TGGCTGATGGTCAACTCATGGAACGCAGATTGGGGTGACAATGGTCTTTTCAAAATTCAA F001 TGGCTGATGGTCAACTCATGGAACGCAGATTGGGGTGACAATGGTCTTTTCAAAATTCAA USDA TGGTTGATGGTCAACTCATGGAACGAAGATTGGGGTGACAATGGTTTTTTCAAAATTCAA *** ********************* ******************* ************** Leu-Leu Ala-Glu Leu-Phe known to be transmitted by M. persicae [43]. Over twothirds of the 25 ESTs in the contig homologous to the DsMV polyprotein are derived from the salivary gland or head libraries, consistent with the fact that these non-circulative viruses are retained within the mouthparts of their aphid vectors. The relatively large number of DsMVderived ESTs, which were found in seven different libraries from three lineages (F001, G006, and USDA), is unexpected in light of the fact that this virus should not replicate within the aphids, and none of the plants used for aphid rearing showed obvious signs of viral infection. Furthermore, the host range of DsMV is not known to overlap with the host plants used in these experiments.

Functional significance of annotated unigenes
In cruciferous plants, myrosinase enzymes (β-thioglucosidases, EC 3.2.1.147) initiate the rapid breakdown of glucosinolates into insect-deterrent hydrolysis products during herbivory. However the aphids Brevicoryne brassicae (cabbage aphid) and Lipaphis erysimi (turnip aphid) have co-opted this defensive system by sequestering plantderived glucosinolates and producing their own myrosinase as a defense against predators [44][45][46][47]. One EST from our database (accession number ES221351, from the G006 lineage) has significant homology to the B. brassicae myrosinase gene. Although attempts to measure myrosinase activity in M. persicae have been unsuccessful, it is notable that aliphatic rather than indole glucosinolates were used as enzymatic substrates in these experiments [48]. Aliphatic glucosinolates are recovered intact in the honeydew of M. persicae on A. thaliana, showing that these aphids are able to avoid or inactivate plant myrosinases. In contrast, A. thaliana indole glucosinolates are largely broken down within the aphids [49]. Although this glucosinolate breakdown may occur by a non-enzymatic mechanism, it is also possible that M. persicae possesses a myrosinase activity that is specific to indole rather than aliphatic glucosinolates.
The genetic mechanisms regulating the cyclically parthenogenetic life cycle characteristic of most aphids are largely unknown. Environmental cues, including shortening days, triggers development of sexual morphs in the autumn [50]. A gene from A. pisum, ApSD1, with similarity to a protein involved in amino acid transport in GABAergic neurons, is upregulated in pea aphids reared under short photoperiod conditions [51]. We identified one EST, which we had annotated as an amino acid transporter, as being significantly similar to ApSD1. This EST (accession number EC388175) was sequenced from the G006 male library, which is consistent with a role for this amino acid transporter in the development of winged sexual morphs.
M. persicae has evolved to tolerate plant allelochemicals and insecticides by diverse strategies, including amplification of E4 esterase genes [52], point mutations in insecticide targets [53], and increased activity of glutathione Stransferases in response to glucosinolates in artificial diets [33]. Out of 11 contigs with significant homology to M. persicae esterases, two contigs (720 and 3118) from our database are nearly identical to the M. persicae E4 esterase (GenBank Accession CAA52648), whereas nine others appear to represent different genes. These nine sequences may have evolved following amplification to acquire novel functions in the hydrolysis of plant secondary metabolites encountered during the expansion of the insect's host range, or in the breakdown of newly developed insecticides. Other potential detoxification genes represented in our database include 24 glutathione Stransferases and 53 cytochrome P450s (Additional file 2).
Among the 168 salivary gland contigs that are predicted to encode secreted proteins (Additional file 3), approximately 62% are of unknown function. However, others could have potential function in aphid virulence based on their homology to known proteins. For instance, contig 1300 encodes a protein that belongs to an insect-specific family that includes the yellow proteins of D. melanogaster, that are involved in cuticular development and behavior [54], and the major royal jelly proteins of Apis mellifera (honeybee). A. mellifera proteins from this family are high in essential amino acids and comprise up to 90% of the total protein content of the jelly that is fed to developing larvae [55]. Although major royal jelly proteins are thought to be produced in the cephalic glands of nurse bees [56], another member of this protein family (MRJP 8) was recently identified as a component of the honeybee venom [57]. In M. persicae, the homologous protein is less abundant, and ESTs were only found in the salivary gland and normalized head libraries (MpSG and MpHnorm in Table 3). Nevertheless, it is tempting to speculate that the protein has a virulence function in aphids. Two other genes expressed in salivary glands, represented by contigs 2422 and 3025, are predicted to encode secreted proteins that play a role in proteolysis, and therefore could have interesting functions in the interaction between M. persicae and its host plants. Contig 2422, which has highest homology to a sequence of unknown function from D. melanogaster (GenBank accession NP_611740), encodes a protease-associated domain. Contig 3025 encodes a protein with homology to Der1, a gene involved in the degradation of misfolded proteins in yeast [58].

DNA Sequence Polymorphisms
Comparison of ESTs from the three M. persicae lineages identified a large number of potential sequence polymorphisms which were subjected to stringent post-processing to reduce sequencing artifacts. The remaining 167 SNPs, represented by multiple ESTs in more than one aphid lineage (Additional file 6), are a good data source for the identification of M. persicae genetic markers. Furthermore, as suggested by the cathepsin B-N sequence data ( Figure  3), these polymorphisms may provide clues about functional divergence of proteins in different M. persicae lineages.
However, when we re-sequenced 11 of these SNPs from genomic DNA templates, only about half were confirmed ( Table 6), suggesting that many potential sequence differences in our EST collection are the result of errors created during reverse transcription, PCR amplification, or sequencing. This highlights the importance of developing effective criteria to select a list of high-confidence SNPs from the large number of polymorphisms predicted by programs such as POLYBAYES, and of validating predicted polymorphisms by re-sequencing of genomic DNA.

Microarray Development
Given the greater reproducibility of gene expression data collected with oligonucleotide microarrays, as opposed to spotted cDNA microarrays, we decided to develop oligonucleotide microarrays for future studies on M. persicae gene expression [59]. The highest quality microarrays currently available are those fabricated by in situ oligonucleotide synthesis, a technology pioneered by Agilent. When using such arrays, the number of required technical replicates is reduced because of the high degree of reproducibility between spots, allowing the user to concentrate resources on analyzing biological replicates. In addition, the high cost of purchasing synthesized oligonucleotides makes traditional custom printing of high density arrays at core facilities feasible only if many arrays will be made. There are no up-front costs to design microarrays on Agilent's eArray platform, and the minimum number of slides to order is one.
Transcriptional profiling with microarrays is a powerful technique for identifying genes involved in the response of an organism to its environment. We anticipate that M. persicae microarrays can be used to answer a variety of fundamental questions about aphid biology and plant-aphid interactions. Genes critical to the status of this insect as an agricultural pest can be identified by studying expression changes induced by different crop plants and in response to virus infection. Research on aphid genes specifically expressed in salivary glands may identify proteins that prevent clogging of sieve elements or otherwise contribute to the phloem-specific feeding style of aphids. Conversely, these salivary proteins likely also provide phloem-specific cues that allow plants to recognize aphid feeding and mount a defense response. Microarray experiments will allow association of gene expression changes with polyphenism, the development for morphologically different individuals (e.g. winged and unwinged) that are otherwise genetically identical. Analysis of gene expression in aphids feeding on artificial diets or plants with altered amino acid content can identify genes that are critical for the interaction with endosymbiotic B. aphidicola bacteria, which synthesize essential amino acids and allow aphids to survive on the otherwise nutritionally imbalanced phloem sap.
The broad host range and differences in host plant preferences among individual lineages of M. persicae are some of the more interesting aspects of the biology of this insect. Gene expression differences that underlie within-species variation can be identified by microarray analysis. By sequencing cDNA libraries made from aphids that were raised on both Solanaceae and Cruciferae, we have increased the probability that future microarray experiments performed by ourselves and others will include aphid genes that are expressed only under these particular growth conditions. Evidence for such regulated gene expression comes from our non-normalized libraries, which included two genes that were overrepresented among ESTs from N. benthamiana in comparison to A. thaliana (Table 5). DNA microarray experiments will almost certainly identify additional genes with host plant specific expression patterns. Further research on the function of such differentially expressed genes will illuminate adaptations that have allowed some M. persicae lineages to expand their host range to include tobacco. Other M. persicae lineages, which show differences in their ability to reproduce on A. thaliana (J. Kim and G. Jander, unpublished results), can be studied to identify aphid adaptations for feeding on Cruciferae. In addition, microarray experiments with M. persicae feeding on A. thaliana will provide the unique opportunity to simultaneously study gene expression changes on both sides of a plant-insect interaction.
Given the broad range of questions that can be addressed by microarray analysis of M. persicae gene expression, the Agilent microarray that we have developed will be of broad interest to aphid researchers. Although the technology necessary for hybridizing and scanning synthesized Agilent arrays is somewhat different from that used for experiments with spotted oligonucleotide arrays, it is available at many universities. The microarrays described here will be made available at cost to other researchers and can be obtained by contacting the corresponding author (G.J.).

Conclusion
By sequencing and analyzing 26,669 M. persicae ESTs, we have generated new genomic resources for this aphid species. Expressed aphid genes, in particular many that show no significant similarity to genes from other organisms, have been identified. Molecular markers that were found by comparing three aphid lineages will be useful not only for genotyping natural isolates, but also future genetic studies with M. persicae. The DNA microarray that has been developed will permit further investigation of agriculturally and ecologically relevant transcriptional regulation in M. persicae.
To date, the lack of genomic resources for M. persicae has stood in stark contrast to the threat posed by this aphid to agricultural systems worldwide. By studying aphid gene expression responses to virus infection, different host plants, and other stresses, it will be possible to obtain a better understanding of this important biological interaction. In addition, our increasing understanding of plant molecular responses to phloem-feeding insects will be complemented by elucidation of the adaptations that allow these insects to establish compatible interactions with their host plants. Further research on M. persicae gene expression responses will aid in efforts to breed crops with increased aphid resistance and will advance ongoing research into aphid ecology, evolution, and physiology.

Aphid Collection, Rearing, and Characterization
M. persicae lineages were collected from five sites in the United States and one in Scotland, as described in Table 1 and Additional file 1. Aphid lineages were started from a single greenhouse-or field-collected insect. M. persicae lineages were genotyped in a single multiplex PCR reaction containing dye-labelled primers (Additional file 7) to amplify seven microsatellite loci: M86 and M40 [25] and S17b, myz2, myz3, myz9 and myz25 [26]. One primer for each locus was fluorescently labeled with the following dyes; NED™: S17b-forward and myz25-forward, For life cycle characterization, about ten months after their field collection, three parthenogenetic lines of each aphid lineage were established on cabbage seedlings in small plexiglass cages. Aphids were raised on these plants as separate lines for three generations at 20°C long day (16:8 h light:dark cycle) to remove maternal and grandmaternal effects. Three third-generation (G3) adults were transferred to a new plant and placed at short day conditions (10:14 h light:dark cycle) at 15°C. After three days the adults were removed and the juveniles were returned to short day conditions. When the G4 aphids were third or fourth instars, three aphids per line were transferred to a new plant and returned to the cabinet under inducing conditions to give birth to the first batch of G5 individuals (G5-1). One week later the three G4 adults were transferred to a new plant to give birth to the G5 batch 2 individuals (G5-2). This processes was repeated a third time to generate the G5 batch 3 progeny. The adult morphs of the three batches of G5 progeny from each line were scored. Lineages that produced males and pre-sexual females (gynoparae) and in at least one of the three replicates and no asexual females (vivipara) in any of the three replicate lines were classified as cyclical parthenogens (holocyclics). Lineages that produced all three morphs males, gynoparae and vivipara were classified as intermediates. Lineages that produced males in at least one replicate and vivipara but no gynoparae were classified as androcyclics and finally lineages that failed to produce any sexual morphs were classified as obligate parthenogens (anholocyclics). Gynoparae are winged females and were distinguished from alate viviparous aphids by the exclusively production of sexual female progeny (ovipara).
Holocyclic M. persicae lineages were tested for their ability to transmit PLRV. A large quantity (>200) of aphids reared on healthy P. floridana plants was placed in a dish containing P. floridana leaves infected with PLRV. Aphids were allowed to feed for a 48 hour acquisition period before being transferred to healthy young P. floridana plants. After a 5-day transmission period, plants were treated with insecticide (Dibrom 8E; Valent, Walnut Creek, CA, USA) and transferred to the greenhouse. Symptoms of PLRV were observed within three to five weeks.

Tissue Collection
Three separate aphid tissues (digestive tracts, heads, and salivary glands) were isolated from the USDA red lineage. Digestive tracts and heads were dissected from alate adult asexual females that had been anesthetized by dipping each individual in 70% ethanol. Dissections were performed using a pin embedded in a wooden handle that was inserted dorsally between the head and thorax while holding aphids by their wings with forceps. Aphid heads with the intact digestive tract attached were removed by applying light pressure anteriorly with the pin and posteriorly with the forceps. Following dissection, head and gut tissue were separated and stored individually in RNAlater (Ambion, Austin, TX, USA) at -20°C for up to one month. Salivary gland tissue was dissected from ~400 aphids of different life stages (predominantly fourth instar alates and alate adults) reared on cabbage (B. oleraceae). Aphid heads were detached from their bodies as described above, and then the salivary glands were exposed by removal of the antennae, stylet, and head capsule. Following this, both sets of principal and accessory glands were carefully removed from the remaining tissues and stored in RNAlater (Ambion, Austin, TX, USA) solution.
Adult males were obtained after approximately six weeks and adult sexual females after seven weeks following transfer to short day conditions (13:11 h light:dark cycle at 18°C ± 2). Altogether, 92 sexual females and 128 males of lineage F001, and 81 sexual females and 134 males of lineage G006 were flash frozen and stored at -80°C for library construction.

cDNA Library Construction
Total RNA was isolated using RNeasy kits (Qiagen, Valencia, CA, USA) or Tripure reagant (Roche, Indianapolis, IN, USA) and purified mRNA using Oligotex resin (Qiagen, Valencia, CA, USA) or Dynabeads mRNA Purification Kit (Invitrogen, Carlsbad, CA, USA). All cDNA libraries were made from mRNA with the exception of the salivary gland library, (MpSG), which was made from total RNA. Genomic DNA was isolated from flash frozen aphids following the "salting-out" protocol [60].

Non-normalized libraries
Four libraries were made following the LD PCR protocol from the Creator SMART cDNA Library Construction Kit (Clontech, Mountain View, CA, USA). cDNA generated by reverse transcription was amplified, digested with Sfi 1A and Sfi 1B, and size fractionated. Double-stranded cDNA was directionally cloned into the pDNR-LIB plasmid vector, and transformed into DH10B competent cells (Invit-rogen, Carlsbad, CA, USA). The cDNA for a fifth nonnormalized library was generated from mRNA and cloned using a Superscript Plasmid system with Gateway technology (Invitrogen, Carlsbad, CA, USA). For this library, the size fractionated cDNA was cloned into the pSPORT1 vector cut with NotI and SalI and the recombinant plasmids were used to transform ElectroMAX DH10B cells (Invitrogen, Carlsbad, CA, USA).

Normalized libraries
Eleven normalized cDNA libraries were constructed using the TRIMMER direct cDNA Normalization kit (Evrogen, Moscow, Russia) in conjunction with the Creator SMART kit. We generated cDNA by reverse transcription of total RNA (salivary gland library) and mRNA (all other libraries). Double-stranded cDNA for normalization was generated using 15-21 PCR cycles. The double-stranded cDNA was denatured and allowed to re-hybridize under stringent conditions; subsequently the reaction mixture was treated with a duplex specific nuclease [61]. The duplexes corresponded disproportionately to abundant cDNAs, leaving a population of single-stranded cDNA molecules in which the representation of rare transcripts was increased. The remaining single stranded cDNA molecules were subsequently amplified, and library construction proceeded as for non-normalized libraries.

EST Sequencing
Sequencing reactions were performed either on purified plasmids or on PCR-amplified products.

PCR-amplified products
Library aliquots were spread onto selective media and grown overnight at 37°C. Colonies were picked manually into 384 well plates (Genetix, New Milton, Hampshire, UK) containing selective media and grown overnight at 37°C. One µL of liquid culture was used as a template for colony PCR (primer sequences in Additional file 7). Colony PCR products were analyzed by gel electrophoresis to confirm the presence of an insert. PCR products were purified using MinElute 96 UF plates (Qiagen, Valencia, CA, USA) or AMpure (Agencourt Biosciences, Beverly, MA, USA). Sequencing reactions were carried out using ABI PRISM BigDye technology, and sequences were analyzed on the ABI 3730XL automated multicapillary sequencer (Applied Biosystems, Foster City, CA, USA).

Purified plasmids
Library aliquots were spread onto Q-tray vented bioassay plates (Genetix, New Milton, Hampshire, UK) containing selective media and grown for 18 hours at 37°C. Colonies were picked by the Qbot robotic colony manipulator (Genetix, New Milton, Hampshire, UK) into 384-well plates containing selective media and grown for 12 hours at 37°C. Plasmid DNA was purified using SprintPrep 384 HC kits (Agencourt Biosciences, Beverly, MA, USA) and subject to dye-terminator fluorescent DNA sequencing. The sequencing products were purified using CleanSEQ (Agencourt Biosciences, Beverly, MA, USA), and the sequences were analyzed on the ABI 3730XL (Applied Biosystems, Foster City, CA, USA) automated multicapillary sequencer.

Sequence Processing and Annotation
We used Phred [62,63] to make base calls from sequence traces. Vector and adaptor sequences were identified from each EST using Crossmatch [63], and trimmed along with poly-A tails and low quality sequence (i.e. 10 or more bases out of 25 with a quality score below 20). ESTs containing less than 100 bases of quality sequence were discarded. All ESTs were compared to the GenBank nr database using BlastN [64]. Those ESTs for which the best Blast targets were B. aphidicola sequences with E-value less than 1E-20 were considered to be endosymbiont contamination and were filtered out. ESTs passing quality tests were clustered using the TribeMCL software (1E-50 and 95% identity for Blast alignment; inflation value 5; [65], and consensus contig sequences were generated using Cap3 [66]. For functional annotation we used the Gene Ontology annotation from the NCBI Gene database (Maglott et al., 2005). For classification purposes, we converted all the GO terms in the Gene database to GOslim [29]. BlastX was used to match the contig sequences to the NCBI refseq protein sequences. GOslim annotations of the Refseq hits with BlastX e-value less than 1E-5 were transferred to the query contigs.

SNP Identification and Confirmation
The PolyBayes program [34] was used to identify SNPs. Consensus contig sequences were used as anchor sequences for each alignment, and the cutoff for the SNP probability score was 0.84. Seven PCR primer pairs were used to amplify the 11 predicted SNPs (contigs 129, 254, 1080, 3202, 3285, 3347, 3713: Additional file 7) from the F001, G006 and USDA lineages. Primers for elongation factor 1α (EF-1α) were used in control PCR reactions. PCR products were purified using Ampure (Agencourt Biosciences, Beverly, MA, USA). Sequencing reactions were carried out using ABI PRISM BigDye technology (Applied Biosystems, Foster City, CA, USA), and sequences were analyzed on an ABI 3730XL automated multicapillary sequencer. Sequences were aligned using ClustalW [67] to confirm the presence of the putative polymorphism.

In silico Prediction of Tissue-Specific Gene Expression
The method proposed by [31] was used for comparing gene expression profiles from four of the non-normalized cDNA libraries. For each contig, the number of ESTs from each library was counted and the R statistic was calculated. Potential salivary gland specific ESTs were also identified from the normalized salivary gland library. Contigs considered likely to be expressed in salivary glands contained at least 2 ESTs from the MpSG library, and at least 50% of the ESTs from those contigs were derived from the MpSG library. For each contig, the probability of meeting the selection criteria was calculated using the binomial distribution, the percentage of all ESTs that were in salivary gland libraries (12.1%), and the number of ESTs that contributed to the given contig. Predicted open reading frames from potential salivary gland contigs were translated in silico using the ExPASy translator tool [68]. The predicted protein sequence was run through SignalP 3.0 [69], using both the neural network and Hidden Markov model to identify a possible signal peptide in the predicted proteins.

Microarray Design
A 15,000-element expression array containing 60-mers representing the identified M. persicae unigene set was designed using eArray (Agilent, Santa Clara, CA, USA). Arrays are being produced at Agilent by in situ oligonucleotide synthesis.