All stages except egg and first instar larvae were included in our libraries. We collected 2nd and 3rd instar larvae infested apples from a fallow orchard in Urbana, IL during summer 2007 by dissecting them from apples, washing in water, and freezing at -80°C. We reared pupae from infested hawthorn fruit collected from trees in South Bend, IN, in fall 2006. Some pupae were transferred to 4°C after a prediapause period of two weeks and kept under diapause conditions for 4 months, while a second set of pupae was kept at 25°C until eclosion. Pupae were frozen from each set at regular intervals. After removal from diapause additional pupae from the first set were reared at 25°C to provide hawthorn race adults. This stratified sampling plan allowed us to collect a variety of developmental stages including prewinter diapausing pupae, postwinter diapausing pupae, and multiple points of pharate adult development  and post eclosion adult maturation. Apple host race adults were caught on fruit in a fallow apple orchard in Urbana, IL, in the summer of 2007. Adult heads were separated from the rest of the body, with the goal of increasing representation of olfactory transcripts, and heads and bodies were used to construct separate libraries. For numbers of flies representing each stage/body part/host race in our libraries and how they were spread across sequencing runs see Table 1.
We used a two-step RNA extraction procedure beginning with an initial TRIZOL (Invitrogen) extraction, followed by further purification on the filter-based RNeasy (Qiagen) kit. Beginning the extraction with TRIZOL extraction maximizes clean yields from fatty tissues, particularly in larvae and pupae, and the filter-based RNeasy kit eliminated the genomic DNA and body pigments typically left behind by TRIZOL extraction. Total RNA extractions were pooled into four samples representing 1) larvae, 2) pupae, 3) adult bodies minus heads, and 4) adult heads. The Oligotex Mini Kit (Qiagen) was then used to purify mRNA from each of the four total RNA pools. cDNAs were synthesized from 500 ng of mRNA following the Clontech's Creator SMART cDNA synthesis system using modified Oligo-dT and 5' RACE primers. Primer sequences were: CDSIII-First 454: 5' TAG AGA CCG AGG CGG CCG ACA TGT TTT GTT TTT TTT TCT TTT TTT TTT VN 3' and SMARTIV: 5' AAG CAG TGG TAT CAA CGC AGA GTG GCC ATT ACG GCC GGG 3'.
Separate normalization of each of the four stage-specific cDNA libraries and 454 pyrosequencing were carried out at the W.M. Keck Center for Comparative and Functional Genomics, University of Illinois at Urbana-Champaign, Roy J. Carver Biotechnology Center. For normalization, 300 ng of cDNAs were denatured, allowed to self-anneal, DSN treated (Duplex/double stranded specific Nuclease; Evrogen, Russia), and remaining transcripts were PCR amplified to make normalized ds-cDNAs. Titration runs were performed for each of the four libraries on 1/16th region of a 70 × 75 PicoTiterPlate (PTP), and subsequent bulk runs were split between two 1/4th regions respectively to spread each bulk run across at least two plates (Additional file 1).
Sequence assembly was performed using the Newbler Assembler with sequences masked for oligonucleotide adaptors used in SMART cDNA synthesis and normalization. All contigs and singlets were annotated by BLAST search against both the Drosophila melanogaster non-redundant protein database where the e-value threshold was set at 1e-5 for confident annotation. Gene Ontology (GO) assignments were also assigned based on sequence similarity to D. melanogaster and transcripts with GO assignments were collapsed down into 14 major GO categories under the Biological Process category.
To specifically search for transcripts associated with olfaction, and therefore possibly host fruit choice, we searched our contigs and long (>100 bp) singleton sequences with complete sets of D. melanogaster odorant receptors (ORs), gustatory receptors (GRs), and odorant binding proteins (OBPs), and recorded all hits to all of these receptors, even in cases of relatively large e-values. We then evaluated all hits manually for the presence of characteristic conserved amino acids, transmembrane domains, and other features of olfactory and gustatory receptors. We similarly searched for several other transcripts from Drosphila or moth species that have been implicated in peripheral chemoreception (K. Wanner, personal communication). The regulation of diapause is much less well understood, but we compiled and searched our data for a set of candidate transcripts identified by Denlinger et al.  and Hahn et al. . We further searched for transcripts classified as being involved in either "circadian rhythm" or "eclosion rhythm" in the GO matches above, yielding a set of 92 total candidate transcripts for diapause. EST sequences matching a gene from these transcripts of interest were aligned (BLASTX) against the NCBI NR protein data set to verify best match to our EST. This methodology does not ensure that the gene of interest and the Rhagoletis EST are orthologs or even fulfil the same function, but does provide a first hypothesis for future functional studies.
Because multiple individuals from different populations were combined for the pyrosequencing run we can identify putative single nucleotide polymorphism (SNP) variation within and between sequencing runs. To accomplish this we re-assembled reads in MOSAIK http://bioinformatics.bc.edu/marthlab/Mosaik using the Newbler-assembled contigs as anchors. Newbler assembly notation codes putative SNPs with flanking insertion characters (presumably to reflect uncertainty in indel errors), whereas most reference-guided assemblies provide a direct alignment at polymorphic positions that are easily recognized by SNP-finding packages. We used GigaBayes http://bioinformatics.bc.edu/marthlab/GigaBayes[55, 56] to estimate the (Bayesian) posterior probability that a given variable site in the re-assembly represents a true polymorphism, setting the threshold probability at 90%.
Because pupae and larvae were run separately and represent pools from hawthorn and apple populations, respectively, we were also able to identify putative host-specific allelic variation in SNPs. We calculated Pearson χ2 statistics for a host race by allele contingency table at each SNP locus, and we report host specific alleles as those with χ2
1 values exceeding p = 0.01. This test is not strictly applied because multiple reads could have come from the same individual, resulting in a non-representative population sample. However, the procedure provides a useful metric to identify candidates for future confirmation/characterization. In addition, we established reading frames from local BLASTX alignments  using the methods of Hahn et al.  to compensate for 454 under/overcall errors. From these alignments we determined whether each SNP putatively segregating between host races is synonymous or nonsynonymous. We performed conservative formal searches for microsatellite motifs within our sequences using the program MSATCOMMANDER  to identify sequences containing di-, tri-, tetra-, penta-, and hexanucleotide repeats with a minimum length of 7.
To determine whether stage specific expression can still be detected in normalized libraries, we focused on subsets of contigs annotated as larval cuticle or digestive transcripts, both of which are, in other species, expressed only in larvae [52, 53]. We considered only contigs with ≥ 20 reads in order to examine whether we would be able to detect strongly skewed contributions of larvae or pupae to a given contig.
Although any EST project is unlikely to represent all of a species' transcriptome, EST data can provide important first pass on transcript gain or loss among taxa. We used information from previous analyses that identified groups of orthologous genes (ortholog groups) by comparing completed genomes using Smith-Waterman sequence comparisons. The results from these analyses are available in an online data base (orthoDB) that allows searching for ortholog groups according to level of conservation in different taxonomic groups . To test for ortholog groups that were lost in Drosophila but potentially retained in Rhagoletis, we compiled a list of loci that had a single copy in Apis, Tribolium, Bombyx, Anopheles, Aedes, and Culex, but were not found in any of the 12 annotated Drosophila genomes (n = 66 ortholog groups). To identify genes that are unique to both Rhagoletis and Drosophila we limited our search to genes that had at least one copy in each of the 12 Drosophila species but were absent in non-drosophilid genomes (n = 850 ortholog groups). All nucleotide sequences from these ortholog groups were blasted against our set of Rhagoletis contigs using the tBLASTn algorithm. Translated amino acid sequences of contigs that had a blast match with an expected value < = e-30 to one or more translated sequences from a particular ortholog groups were aligned with all the translated amino acid sequences from that group with ClustalW. These alignments were imported into the MEGA software, which allowed for Genbank searches of the contig using the BLASTp algorithm (local alignment of protein sequences) against the non-redundant protein database (NCBI-NR). We displayed all results of these BLAST searches using the web-BLAST's built-in construction of neighbour-joining phylogenetic trees. If this first screening supported the respective hypothesis of gene loss or gain, then the top BLAST hits down to suitable outgroups were imported into MEGA and aligned with the Rhagoletis contig and ortholog group sequences using ClustalW. We trimmed the alignment to match the aligned length of the Rhagoletis sequences and constructed phylogenetic trees (neighbour-joining, 1000 bootstrap intervals). In order to infer the presence of a gene in Rhagoletis that is absent in Drosophila we required that the translated Rhagoletis sequence formed a well-supported clade with outgroup species genes (bootstrap values > 70) that did not include any Drosophila copies. To infer that a gene was novel to both Rhagoletis and Drosophila, it had to 1) form a well supported clade (bootstrap value > 70) with its closest Drosophila relative that lacked any non-drosophilid genes and 2) form a separate sister clade to a related sister clade composed of Rhagoletis and Drosophila genes. Our detection scheme was conservative because it ignored cases in which there was no evidence for an ancestral gene in the Schizophora [58, 59]. However, it avoided the quandary of having to decide at which point along a single evolutionary branch a sequence represented a new gene. The only exception for this rule was made in cases when no insect matches other than Drosophila could be obtained from Genbank. In this case the gene in question was classified as novel to the Schizophora.