The maternal and early embryonic transcriptome of the milkweed bug Oncopeltus fasciatus

Background Most evolutionary developmental biology ("evo-devo") studies of emerging model organisms focus on small numbers of candidate genes cloned individually using degenerate PCR. However, newly available sequencing technologies such as 454 pyrosequencing have recently begun to allow for massive gene discovery in animals without sequenced genomes. Within insects, although large volumes of sequence data are available for holometabolous insects, developmental studies of basally branching hemimetabolous insects typically suffer from low rates of gene discovery. Results We used 454 pyrosequencing to sequence over 500 million bases of cDNA from the ovaries and embryos of the milkweed bug Oncopeltus fasciatus, which lacks a sequenced genome. This indirectly developing insect occupies an important phylogenetic position, branching basal to Diptera (including fruit flies) and Hymenoptera (including honeybees), and is an experimentally tractable model for short-germ development. 2,087,410 reads from both normalized and non-normalized cDNA assembled into 21,097 sequences (isotigs) and 112,531 singletons. The assembled sequences fell into 16,617 unique gene models, and included predictions of splicing isoforms, which we examined experimentally. Discovery of new genes plateaued after assembly of ~1.5 million reads, suggesting that we have sequenced nearly all transcripts present in the cDNA sampled. Many transcripts have been assembled at close to full length, and there is a net gain of sequence data for over half of the pre-existing O. fasciatus accessions for developmental genes in GenBank. We identified 10,775 unique genes, including members of all major conserved metazoan signaling pathways and genes involved in several major categories of early developmental processes. We also specifically address the effects of cDNA normalization on gene discovery in de novo transcriptome analyses. Conclusions Our sequencing, assembly and annotation framework provide a simple and effective way to achieve high-throughput gene discovery for organisms lacking a sequenced genome. These data will have applications to the study of the evolution of arthropod genes and genetic pathways, and to the wider evolution, development and genomics communities working with emerging model organisms. [The sequence data from this study have been submitted to GenBank under study accession number SRP002610 (http://www.ncbi.nlm.nih.gov/sra?term=SRP002610). Custom scripts generated are available at http://www.extavourlab.com/protocols/index.html. Seven Additional files are available.]

Background New and emerging model organisms occupy an increasingly important part of the developmental biology and developmental genetics research landscape. While studying a huge diversity of animals has long been the norm in the classical fields of experimental embryology and functional morphology [see for example [1][2][3]], the molecular biology revolution and the advent of the "model system" concept [4] created demand for a small number of highly genetically manipulable organisms that could be intensively studied [5]. Research on these "big six" [sensu 6] genetic model organisms has led to enormous advances in our understanding of general principles of embryogenesis. However, placing these general principles in an evolutionary context requires broader taxonomic sampling. Many researchers have highlighted the need for developing new model organisms for specific comparative, evolutionary and ecological questions [6][7][8]. It has also been suggested, however, that the single gene expression approach of the last several decades of evolutionary developmental biology ("evo-devo") has outlived its usefulness, and that what are needed are not more model organisms, but rather a smaller number of groups chosen for the ability to functionally manipulate genes [9,10]. Sophisticated gene expression techniques and even stable germline transgenesis have been developed in a large array of models outside of the "big six" [see for example [11,12]]. The ancient history of the small RNA processing machinery [13,14] means that gene knockdown is a feasible goal for most organisms, as long as the sequences of genes of interest are available.
While whole genome sequencing is an increasingly viable option for some organisms, many new models, particularly within the arthropods, lack the large community resources necessary to finance and maintain annotation of a genome. For these reasons, many researchers studying nontraditional model organisms have turned to Sangersequenced EST libraries [see for example [15,16]]. In principle this method of gene discovery can lead to highthroughput expression and functional genetic analyses of multiple genes [see for example [17]]. In practice, however, most non-traditional organism studies are still subject to a gene discovery bottleneck. This is largely because at the scale needed to uncover rare developmental transcripts, Sanger-based EST sequencing quickly becomes technically and financially prohibitive for many labs working on organisms with smaller research communities. In addition, those smaller-scale EST projects that have been carried out are often not publically available in easily searchable formats, and their potential contribution to the developmental and evolutionary biology fields is thus limited.
Next-generation sequencing (NGS) offers comparative and evolutionary developmental biologists a way to obtain orders of magnitude more developmental gene data than ever before, at a fraction of its former cost. Several studies have demonstrated the feasibility of NGS for identifying SNPs for population studies and gene sequences for use as phylogenetic markers [18][19][20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35]. Unfortunately, the lack of suitable protocols for cDNA preparation, and of established pipelines for analysis have left this tool under-utilized by many evo-devo researchers. Furthermore, according to some estimates [35], few of these studies have been carried out at a scale large enough to provide significant recovery of rare transcripts, and therefore of developmental genes. Here we present an optimized protocol for synthesizing cDNA for 454 Titanium pyrosequencing, as well as a simple workflow for de novo assembly of the data without a reference genome, annotation and analysis of the dataset, and a demonstration of its utility for comparative developmental genetics.
A large body of literature is dedicated to the development and genomics of holometabolous insects (insects undergoing complete metamorphosis between embryonic and adult stages). Tens of holometabolous insect genomes are now available, thanks largely to work on Drosophila melanogaster, other drosophilids, and dipteran disease vectors [36,37]. In contrast, relatively little is known about the development of hemimetabolous insects, which undergo incomplete metamorphosis. Although several of these insects are amenable to laboratory culture and a variety of experimental manipulations, molecular developmental studies are scarce, and gene discovery rates remain low. Notable exceptions among the Hemiptera are the aphid Acyrthosiphon pisum and the Chagas' disease vector Rhodnius prolixus, whose genomes are completed and in progress respectively [38,39]. However, the aphid genome has undergone extensive duplications and gene loss, possibly due to its unusual reproductive and ecological characteristics [38]. The mammalian blood feeding needs of R. prolixus make it a sub-optimal organism for developmental studies.
The milkweed bug Oncopeltus fasciatus ( Figure 1A-D) has emerged as a promising hemipteran system for studying the molecular development of hemimetabolous insects [40][41][42]. It can be reared easily and cheaply in the laboratory, and has a long history as a laboratory animal for classical embryology and pattern formation studies [43][44][45]. More recently, robust protocols for in situ hybridization, live imaging of embryogenesis, and RNAi-mediated gene knockdown have been developed and successfully applied to the study of the evolution of development [see for example [46,47]].
Here we present the results of the sequencing and de novo assembly of the Oncopeltus ovarian and early embryonic transcriptome. We outline an assembly and analysis framework using a combination of existing tools and freely available custom-made command line computational tools, which we hope will make this approach to gene discovery accessible to comparative developmental biologists. We identify homologues of genes involved in all major signaling pathways and developmental processes, including biologically verified splicing isoforms for some genes. We also address the need for library normalization in these studies, and show that at large enough scales of NGS, large numbers of developmental genes can be discovered even with omission of a normalization step.

Results and Discussion
Assembling the ovarian and embryonic transcriptome of O. fasciatus We prepared cDNA from ovaries and early to mid-staged embryos of O. fasciatus, covering oogenesis and all major stages of embryonic patterning ( Figure 1B-D). These cDNA samples were prepared using a protocol optimized for preparation of small or limiting samples for 454 pyrosequencing (see Materials and Methods). From these libraries, we generated a total of 2,087,410 sequence reads (Table 1). This includes reads generated using GS-FLX technology as well as both normalized (N) and nonnormalized (NN) cDNA sequenced using the GS-FLX Titanium platform. As expected, the reads generated using GS-FLX Titanium technology were substantially longer than those generated using GS-FLX technology (Table 1, Figure 2A). However, the N sample gave an unexpectedly low number of reads, which were on average shorter than those generated by the NN sample (Table 1; Figure 2A). Given that a pilot run of one lane (1/8 plate) of this same normalized cDNA sample generated roughly equal number and size-distribution as a NN pilot study (Additional file 1), we suspect that a technical error reduced the sequencing efficiency of this plate. Despite the comparatively low yield of this normalized cDNA, it still generated more than 600,000 high quality reads that we therefore included in subsequent analyses.
We used the cDNA assembly algorithm of Newbler v2.3 (Roche) to screen the reads for adaptor sequence   Effects of normalization and 454 sequencing chemistry on read length and isotig length. (A) Titanium sequencing chemistry (grey, black) generally results in longer read lengths when compared with FLX chemistry (white). However, the normalized sample run with Titanium chemistry (black) had shorter read lengths than the non-normalized sample (grey). This result is likely due to a technical error in that particular sequencing run, since a 1/8 plate run of the same sample showed a read length distribution comparable to that of the nonnormalized sample (Additional file 1). (B) Isotig length distributions from assemblies of Titanium-sequenced data. The longest isotig per isogroup is shown. The number of bases in the non-normalized (grey) and normalized (black) samples has been equalized to eliminate possible bias due to the greater number and length of reads obtained from the run of the normalized sample (see (A)). The isotigs generated from the normalized cDNA tended to be shorter than those produced by the non-normalized cDNA (see also To our knowledge, Newbler v2.3 and higher are the only assembly programs that address alternative splicing and can output multiple isoforms per gene. Newbler v2.3 explicitly accounts for alternative splicing by creating a hierarchical assembly composed of three elements: contigs, isotigs, and isogroups. For consistency, we follow their terminology. Contigs are stretches of assembled reads that are free of branching conflicts. In other words, contigs can be thought of as exons or sets of exons that are always co-transcribed. Isotigs represent a particular continuous path through a set of contigs, i.e. a transcript. An isogroup is the set of isotigs arising from the same set of contigs, i.e. a gene. Different isotigs within an isogroup are thought to represent alternative isoforms of the same gene. Note that it is possible for an isogroup to contain only one isotig, and it is also possible for an isotig to be composed of only one contig. After the initial Newbler assembly, we noticed substantial redundancy among the singletons. We therefore subjected the 178,770 unassembled singletons to a secondary assembly with CAP3 [48]. This secondary assembly reduced the number of singletons from 178,770 to 112,531 (28,143 cap3_contigs and 84,388 cap3_singlets). Thus, in total, our assembly generated a total of 133,628 sequences, including isotigs, cap3_contigs and cap3_singlets (Table 2).
The average coverage among contigs was 23.2 reads/ bp (median coverage = 6.9 reads/bp) (Additional file 2). This coverage value is more than twice as high as the highest reported value from a de novo transcriptome assembly to date [summarized in [20]]. Such deep coverage should be helpful for overcoming the presence of insertion/deletion errors in the individual raw reads [49].
To test whether our assembly would have been aided by the inclusion of nucleotide sequence from Rhodnius prolixus, the most closely related hemipteran to O. fasciatus whose genome is currently being sequenced [39], we used the BLASTN algorithm to compare our isotigs (the longest isotig per isogroup) with the published ESTs of R. prolixus with an e-value cut-off of 1e-6. Consistent with previous observations of extremely low levels of conservation between insect genomes [50] we found that only 53 out of 16,617 isotigs had hits to R. prolixus ESTs on the nucleotide level. These results suggest that de novo sequencing and assembling efforts will be necessary for most insect species, even when sequence data are available for other members of the same order. We note, however, that a recent study [51] has shown that it may be possible to incorporate EST data from different species into a de novo assembly by using amino acid sequence rather than nucleotide sequence.

Validation of predicted alternate isoforms
To examine whether the alternative isoforms predicted by Newbler v2.3 are in fact present in developing embryos of O. fasciatus, we first focused on a gene of particular interest to developmental biologists, nanos. This conserved metazoan gene was first described as a loss of function mutation in Drosophila melanogaster [52], and is necessary for germ cell and posterior somatic development [reviewed in [53]]. Newbler v2.3 predicted this gene to encode two alternative isotigs within a single isogroup ( Figure 3B). The two isotigs differ in that the longer contains an additional 100-bp exon that is absent from the shorter ( Figure 3B). We designed PCR primers against sequences present in both isotigs ( Figure 3B arrows), which amplified two bands differing by~100 bp from a pool of embryonic cDNA ( Figure 3C). Sequencing of these two bands confirmed that they differ exactly as predicted by Newbler v2.3 ( Figure 3D). Importantly, a previous version of Newbler (v2.0), which does not account for alternative splicing, failed to join together the three fragments which were linked by Newbler v2.3 ( Figure 3A). Because of this, Newbler v2.0 (and presumably other assemblers which do not address branching within contigs) predicted three separate contigs, only one of which could be identified as nanos with BLASTX, as the others fall in poorly conserved regions of the gene. Thus, the ability of Newbler2.3 to handle branching conflicts between reads allows this program to assemble longer continuous sequences, which are therefore in turn more easily annotated using BLAST.
To further characterize the accuracy of Newbler's predictions of alternative transcript isoforms, we randomly selected 10 isogroups that contained exactly two alternative isotigs differing by the presence/absence of a single contig (Additional file 3). As we did for nanos, we designed primers to flank the region differing between the two predicted isoforms (Additional file 3A), and performed RT-PCR on O. fasciatus embryonic cDNA. In eight of ten instances, we observed bands of the predicted sizes following agarose gel electrophoresis (Additional file 3B,C). However, in four of the eight positive cases, additional, unpredicted bands were present (Additional file 3). In one of the ten cases, we observed two RT-PCR products, but only one of them was of the predicted size (Additional file 3C, lane 6). Taken together, these results suggest that Newbler v2.3 has a low rate of false positives in the prediction of multiple splicing isoforms. Including our investigation of nanos, only one of 11 test cases (9.1%) produced a single RT-PCR product where Newbler v2.3 had predicted multiple products. However, we observed that roughly half of the time, Newbler v2.3 failed to predict all of the isoforms identified via RT-PCR.

Transcriptome annotation
A BLASTN search of our dataset for the 93 existing GenBank accessions for O. fasciatus sequences yielded a hit result for 56% of the accessions, with an e-value cutoff of 1e-10. This result may be due in part to the short length of some of the GenBank sequences. Accordingly, we found that accessions with hits in the database were significantly longer (mean length 729 bp) than accessions without hits (mean length 397 bp) (unpaired Student's t-Test: t = 2.89, DF = 91, p = 0.0048). Of greater relevance to developmental applications of this dataset, however, was our finding that 85% of O. fasciatus developmental genes with existing GenBank accessions (n = 32) are represented in our transcriptome.
We then used BLASTX to map the 133,628 O. fasciatus sequences (isotigs, cap3_contigs and cap3_singletons) against the entire RefSeq Protein database with an e-value cut-off of 1e-10. To simplify these statistics, we report only the BLAST results for the longest isotig per isogroup, under the assumption that all isotigs within an isogroup share nearly identical BLAST results. Of 16,617 isotigs, 7,219 (43.4%) had at least one hit. Of the 28,143 cap3_contigs, 2,594 (9.2%) had hits, and of the 84,388 cap3_singlets, 2,367 (2.8%) had hits. These values are higher than comparable BLAST statistics of most other published studies of 454-generated de novo transcriptomes [24][25][26]30,32,33], likely because deeper sequencing increases the length of assembled sequences and thereby makes these sequences more likely to be identified via BLAST. The unidentifiable sequences likely originate from UTRs or non-conserved portions of protein-coding sequences. Of the top BLAST hits, 89.3% were genes from arthropod sequences (Additional file 4). Of the 12,180 O. fasciatus sequences with BLAST hits, 1,455 hit non-overlapping segments of the same top BLAST hit (i.e. potentially unassembled portions of the same transcript), and 825 hit overlapping segments of the same top BLAST hit (i.e. potential paralogs). Excluding those 1,455 potentially double-counted BLAST hits, our transcriptome identified a total of 10,775 genes. The assembled sequences generated in this study, as well as pre-computed BLAST results, are available as flat files from the authors upon request.
To explore and summarize the functional categories of the genes sequenced in this study, we obtained the Gene Ontology (GO) terms associated with the top 20 BLAST hits of each sequence using Blast2GO [54]. Among the 7,059 genes for which we obtained GO terms, we observed a wide diversity of functional categories represented on all levels of the Gene Ontology database ( Figure 4). The O. fasciatus sequences fall into GO categories with a roughly similar distribution to that of the well-annotated Drosophila melanogaster genome, suggesting that our sequence data contain a large diversity of genes involved in a variety of biological processes, and do not contain any notable biases towards particular categories of genes.

Assessing coverage of the O. fasciatus transcriptome
We wished to know how thoroughly our sequencing efforts sampled the true diversity of transcripts present in our cDNA samples. This is a two-part question: first, of the genes truly expressed during O. fasciatus oogenesis and embryogenesis, how many did we identify? And second, of these identified genes, how thoroughly had we assembled their full-length transcripts?
To address the first question, we created eight separate assemblies of progressively larger sub-samples of our total reads and tallied the total number of genes identified via BLASTX. The number of newly discovered genes began to plateau after~1.5M reads (1 7/8 plates in our case) ( Figure 5 black line). However, the N50 isotig length continued to increase roughly linearly over this range of reads ( Figure 5 grey line). These results suggest that additional sequencing of this sample is unlikely to identify substantially more genes, but may continue to lengthen the existing sequences. Although in the absence of a sequenced genome it is not possible to accurately estimate how many genes are in fact present in the O. fasciatus transcriptome, we note that while several developmental genes of interest were identified in this study, others were not. (Tables 3, 4 and see  below). Because these data suggest that we have sequenced these specific cDNA samples quite deeply, some form of specific target enrichment may be necessary for future attempts to discover additional genes not identified in this dataset.
To address the second question, we employed a method proposed by O'Neil and colleagues [20] for addressing the question of how closely our sequences approached full-length transcripts. Their metric, the "ortholog hit ratio," compares the length of the newly discovered sequence that obtains a BLAST hit versus the full length of its top hit [20]. Thus, an ortholog hit ratio of one implies that a transcript has been assembled to its true full length, while values over one suggest insertions in the query sequence relative to its top BLAST hit. We note the caveat that many genes contain relatively poorly conserved regions that may fail to obtain a BLAST hit at all, causing the ortholog hit ratio to be an underestimate in these cases (Additional file 5). In our dataset, many of the O. fasciatus isotigs appear to be nearly fully assembled, while the singletons predictably tend to represent small portions of their top BLAST hit in RefSeq ( Figure 6). In total, of the 7,219 isotigs with BLAST hits, 3,953 (54.8%) had ratios > 0.5 and 2,689 (37.2%) had ratios > 0.8. We also asked, for those O. fasciatus sequences of developmental genes already present in GenBank that overlapped with transcriptome hits (n = 23), whether our transcriptome data provided any net gain in transcript sequence compared to the GenBank accession sequence. In 15/23 cases (68%), the transcriptome data extended the known sequence beyond that reported in GenBank by an average of 349 bp (range: 82-1,366 bp). In most cases, additional 3' sequence was obtained (Figure 7).

Assessing the value of cDNA normalization
Reducing the representation of highly abundant transcripts (i.e. normalizing the cDNA) is often considered essential to capture sequence from genes expressed at lower levels, including many important developmental genes [see for example [55][56][57]]. However, we hypothesized that current next-generation sequencing technologies could provide sufficiently deep sequence to render normalization largely unnecessary for construction of de novo transcriptomes for comparative developmental biologists. To address this question, we assessed the relative contribution of the N and NN cDNA to our final assembly using several strategies.
First, to test whether our normalization protocol successfully reduced the presence of highly abundant transcripts, we created separate assemblies from the N and NN cDNA samples (equalizing the total number of bases to reduce the contribution of additional sequence found in the NN sample). The N assembly contained a greater number of isotigs that were shorter on average Column heights reflect the percentage of annotated sequences in each assembly that mapped to a given Biological Process GO term. The relative percentages of genes falling into GO categories are comparable between our O. fasciatus transcriptome (black) and the D. melanogaster transcriptome (white). than those in the NN assembly ( Figure 2B). Additionally, more singletons were generated in the N assembly relative to the NN assembly (Table 2). Further, similar to the results obtained by Bellin and colleagues [27], we observed the predicted decrease in the maximum number of reads per contig in the N assembly compared to the NN assembly ( Figure 8A, B), demonstrating that the normalization procedure successfully reduced the sequencing of highly abundant transcripts. These statistics, which could be interpreted to suggest that the N reads generated an inferior assembly, may result from the shorter average length of reads in the N sample ( Figure 2A). Indeed, Newbler rejected 7.9% (30,780) of the N reads as too short, compared to only 1% (3,935) of the NN reads. However, these assembly statistics could also indicate greater heterogeneity in the N sample, which would suggest that normalization might increase the number of new genes identified.
To discriminate between these possibilities, we explored the contribution of the N and NN reads to the genes discovered in our full assembly. We used BLASTN to map one plate's worth of raw reads from the N sample and from the NN sample (equalized to contain the same number of base pairs) against the complete assembled transcriptome, with an e-value cutoff of 1e-4. We then explored the GO annotation of those genes hit exclusively by only one of these two samples. We observed similar overall GO term distributions between the N and NN samples ( Figure 8C). We found that a small number of GO terms (n = 20) were significantly differentially represented in the two samples, albeit generally with very few sequences in each GO term (Additional file 6). For example, we were surprised to see that three of the four terms statistically over-represented in the N sample were related to ribosome function (14/750 (1.9%) of the N hits were annotated with 'ribosomal subunit', compared to 1/1124 (0.09%) NN hits; FDR-corrected p-value = 0.006). In contrast, several terms related to active transmembrane transport were over-represented in the NN sample (Additional file 6) possibly indicating that normalization may have reduced the representation of genes involved in certain basic metabolic processes.
As an additional way to investigate the contribution of the N and NN samples to identifying specific genes of interest for our studies, we manually examined the results of mapping the N and NN samples to the fully assembled transcriptome. Of the 79 genes of interest that we investigated, four (5.1%) were uniquely present in the N sample, whereas nine (11.4%) were uniquely present in the NN sample, and the remaining 66 (83.6%) were present in reads of both the N and NN samples (Tables 3,  4). Although this may be an artifact of sequencing depth (i.e. low-abundance genes of interest may be present in only one of the two cDNA samples simply due to sampling effects rather than the normalization protocol per se), our data suggest that the normalized cDNA sample did not contribute disproportionately to gene discovery.

Gene discovery for developmental studies
The ultimate goal of this sequencing project was to identify a wide diversity of candidate genes involved in developmental processes. Traditionally, such gene discovery in "non-model" organisms has required degenerate PCR, which is labor-intensive, expensive, and prone to failure. The annotated transcriptome assembly we present here allows researchers to identify genes of interest via simple text searches, or via BLAST searches. To demonstrate the usefulness of these data for largescale gene discovery, we report here the identification of several components from each of the seven widely studied metazoan signaling pathways (Table 3) as well as many genes involved in specific developmental processes (Table 4). We note that the majority of these gene fragments are of suitable length for immediate application of such widely used techniques as in situ hybridization and RNAi-based functional knockdown. In cases of functional experiments where full-length proteins are desirable, such as protein overexpression, RACE PCR will likely be required. Importantly, we note that many genes of interest were present among the singletons, many of which are long enough for immediate use as sequences for in situ hybridization probes or RNAi Randomly chosen subsets of increasing numbers of Titanium reads were used to generate progressively larger sub-assemblies. The number of reads in each sub-assembly (X axis) is plotted against the number of unique BLAST hits in each sub-assembly (left Y axis: black), and against the N50 isotig length (right Y axis: grey). For this analysis BLAST was performed against the SwissProt database. The number of unique BLAST hits plateaus when the assembly is composed of approximately 1.5 million reads. However, the N50 isotig length maintains an approximately constant rate of increase. templates, emphasizing the importance of including these in NGS gene discovery studies.
Although we identified a diverse array of genes, some well-studied genes known to be expressed during embryogenesis were not easily identified in this study. For example, our BLAST results only contained three genes from the Hox cluster (fushi tarazu, Antennapedia, and Abdominal-B), although orthologs of all the canonical arthropod Hox genes are known to be present in O. fasciatus [58]. However, using the O. fasciatus Hox gene sequence fragments available from NCBI as a BLAST query against our transcriptome did reveal sequences for all Hox genes except Sex combs reduced. It is possible that these genes are expressed at very low levels during the developmental stages sampled here, suggesting that enrichment techniques may be necessary to more easily identify certain genes of interest. We do note, however, that fushi tarazu, the only Hox cluster gene not previously identified in O. fasciatus, was identified in both N and NN samples of this transcriptome dataset (Table 4).

Case study: gene discovery for endocrine regulation of development
In addition to surveying the transcriptome for genes involved in embryonic patterning and other developmental Hit ID indicates if gene hits were found among isotigs (I), Cap3-assembled contigs (C), or unassembled singletons (S). Sequence length (range) indicates the shortest and longest S, C or I hit sequences for each gene. These results were generated by BLASTing the raw reads from the N and NN samples against the full assembly. When multiple sequences were obtained via name search, they were tested to see whether they could be made to form a contig with Sequencher or CLC Combined Workbench (see Methods). Asterisk indicates hits only present in normalized GS-FLX reads. X(Y † ) indicates that the X sequences with hits could be assembled into Y contigs.  Hit ID indicates if gene hits were found among isotigs (I), CAP3-assembled contigs (C), or unassembled singletons (S). Sequence length (range) indicates the shortest and longest S, C or I hit sequences for each gene. These results were generated by BLASTing the raw reads from the N and NN samples against the full assembly. When multiple sequences were obtained via name search, they were tested to see whether they could be made to form a contig with Sequencher or CLC Combined Workbench (see Methods). Asterisk indicates hits only present in normalized GS-FLX reads. X(Y † ) indicates that the X sequences with hits could be assembled into Y contigs. Boldface indicates genes also present in Table 3.
processes, we asked whether we could also identify genes known to be employed in biological processes during postembryonic development of holometabolous insects. Recent studies have suggested that many of the genes used during holometabolous insect metamorphosis may also play important roles during embryogenesis in hemimetabolous insects [59,60]. To investigate this, we searched the O. fasciatus transcriptome for expression of key ecdysteroidand juvenile hormone (JH)-related genes. We identified transcripts for many of the known ecdysteroid biosynthesis genes, including cytochrome P450 genes encoded by the Drosophila Halloween family, such as shade (CYP314A1), shadow (CYP315A1), phantom (CYP306A1) and disembodied (CYP302A1) ( Table 4). We also detected expression of ecdysone response genes. In particular, we identified many of the ecdysone-regulated genes that play key roles during molting and metamorphosis, including E75, HR3, and HR4 (Table 4). The presence of these genes in the ovaries and early embryos of O. fasciatus corroborates recent studies that implicate ecdysone-response genes in key developmental processes during embryogenesis [59][60][61]. As might be expected for a situation where ecdysone regulates embryonic development but not molting, transcripts encoding insect peptide hormones implicated in eclosion behavior, such as ecdysis-triggering hormone, eclosion hormone and crustacean cardioactive peptide, were not detected. JH biosynthesis and response genes were also isolated ( Table 4). JH has been shown to play a role in promoting embryonic development and tissue maturation [62]. The expression of these genes, together with that of JH esterase and JH binding proteins, is consistent with previous studies implicating tight control of JH during embryogenesis [63].

Conclusions
We have used 454 pyrosequencing to create an early developmental transcriptome for the milkweed bug O. fasciatus in the absence of a reference genome. Although genomic sequence data will be necessary in the future for linkage or cis-regulatory analyses, at the early stages of establishing new model organisms, one of the most important goals is often gene discovery. In this regard, while no transcriptome generated in this way can realistically be "complete" in the sense of containing full length transcripts for all expressed genes, we propose that for many evolutionary developmental biology studies, the approach described here is a useful one for fast, high-throughput gene discovery. A high priority for comparative developmental biology research is gene expression and function analyses. By sequencing at great depth and testing a variety of cDNA preparation methods (normalized, non-normalized, embryo-and ovaryspecific), we have generated tens of thousands of gene sequences of sufficient lengths for the commonly used developmental techniques of in situ hybridization and RNAi-mediated gene knockdown. These data can also be used for phylogenetic, population genetic, and functional genomic applications, provide a starting point for identification of genomic regulatory sequences, and assist with assembly of hemipteran genomes sequenced in the future.

Note added in Proof
While this article was in review, Kumar and Blaxter [64] published a comparison of de novo assemblers for 454 transcriptome data, and reported important shortcomings of Newbler v2.3 compared to other available assemblers. Specifically, the authors reported that Newbler v2.3 produced the smallest assembly (i.e. the smallest number of base pairs incorporated into contigs) of the assemblers tested. The authors argue that this poor performance is likely because Newbler v2.3 inexplicably discards portions of read overlap information. In contrast, a newer, currently unreleased version of Newbler, v2.5, produced the most complete assembly of all those tested. Kumar and Blaxter (2010) therefore strongly advise all de novo 454 transcriptome assembly projects which have used Newbler v2.3 to recompute their assemblies with Newbler v2.5.
To address this concern, we obtained a pre-release version of Newbler v2.5 from Roche and reassembled the O. fasciatus data, again using the -nosplit flag. In contrast to Kumar and Blaxter (2010), we observed much less dramatic differences between the assemblies Figure 6 Ortholog hit ratio analysis of isotigs and CAP3reassembled singletons. An ortholog hit ratio of one implies that a transcript has been assembled to its true full length. For isotigs (black), a majority (54.8%) appear to contain at least 50% of the full length transcript sequence (arrow), while over one-third (37.2%) appear to represent at least 80% of the full length transcript sequence (arrowhead). Most singletons (grey) represent much smaller percentages of full-length transcripts.
produced by Newbler v2.3 and Newbler v2.5 (Additional file 7). For example, Kumar and Blaxter (2010) report that Newbler v2.5 increased their total assembly size by 39% compared to Newbler v2.3. For the O. fasciatus data analyzed here, Newbler v2.5 increased the total assembly size by less than 1% (Additional file 7). Further, we observed very similar numbers of isogroups, isotigs, and singletons between the two assemblies (Additional file 7). We did observe a 16% increase in the number of contigs reported by Newbler v2.5, but this difference was markedly less than the 80% increase observed in the data analyzed by Kumar and Blaxter (2010). After BLASTing all of the assembled isotigs and cap3-assembled singletons against the RefSeq database, we identified a total of 10,886 unique BLAST hits, compared to 10,775 genes identified using Newbler v2.3.
These results suggest that, although we did observe a modest increase in assembly size using Newbler v2.5, the analyses presented in the current study are largely robust against differences between currently available versions of Newbler. One possible explanation for the difference between these results and those observed by Kumar and Blaxter (2010), is the greater sequencing depth performed in the current study. If in fact the poor performance of Newbler v2.3 involves discarding information in regions of low coverage, the fact that our dataset includes~2.4x more reads than that analyzed by Kumar and Blaxter (2010) may explain the reduced improvement that Newbler v2.5 provided our dataset. Figure 7 The O. fasciatus transcriptome adds sequence data to existing GenBank accessions, which in turn improves annotation of transcriptome sequences. (A) Extended contig for Of-hunchback (bottom), comprising the complete mRNA GenBank accession (top, light grey), two isotigs and one CAP3 contig from the transcriptome (middle, dark grey). The largest isotig provides an additional 252 bp of 3' UTR sequence to the GenBank sequence (black). Comparison with the GenBank sequence enabled isotig 08619 and cap3_contig 21314 to be assembled into the same contig. (B) Extended contig for Of-homothorax (bottom), with a partial mRNA GenBank accession (top, light grey) and two transcriptome isotigs (middle, dark grey). Both isotigs extend beyond the known GenBank sequence at the 3' and 5' ends, extending the known region by 449 bp in total (black). Both isotigs had been identified as homothorax, and because they did not overlap, they were classified as belonging to the same transcript rather than being paralogs. The GenBank sequence bridges an 87 bp gap between the isotigs, confirming that both sequences are fragments of a single gene.
We also suggest that the reduced number of genes identified via BLAST observed by Kumar and Blaxter (their Table five) may result from the fact that the authors excluded singletons from their analyses. If Newbler v2.3 indeed fails to assemble regions of low coverage and instead retains those reads as singletons, many genes of interest may only be present as singletons. Indeed, we observed many genes of interest exclusively represented as singletons (Tables 3 and 4). Thus, for the purpose of gene discovery, we emphasize that future de novo transcriptome projects should analyze singletons as an important source of useful gene sequence.
Although our results do not appear to be greatly sensitive to which version of Newbler is used, we agree with Kumar and Blaxter (2010) that future transcriptome project should use utilize the most current Figure 8 Normalization decreases coverage of highly abundant genes, but does not change the GO term distribution of contigs. In both samples, most contigs are composed of <10 2 reads. However, the non-normalized sample (A) contains contigs with many more reads per contigs than the normalized sample (B). In other words, normalization preferentially decreases the number of reads of those contigs with the most reads. (C) GO term distributions do not differ dramatically between pyrosequenced libraries of N versus NN cDNA. However, see Additional file 6 for exceptions. Column heights reflect the percentage of annotated sequences in each assembly that mapped to a given GO term. Note that the GO terms shown represent the results of mapping the N and NN reads against the complete assembly, rather than those obtained via independent assemblies of N and NN reads.
available version of Newbler, or whichever assembler algorithm they find most useful for their data.

Animal culture
The O. fasciatus specimens sequenced in this study were originally purchased from the Carolina Biological Supply Company (Burlington, NC) and were maintained in the laboratory on sunflower seeds under a 12h:12h light/ dark cycle at 28°C.

cDNA Synthesis
For our pilot study using the GS-FLX platform, total RNA was isolated from mature ovaries ( Figure 1B) and from mixed-stage embryos representing the first three days of development (roughly 60% of embryogenesis at 28°C; Figure 1C, D) using TRIzol (Invitrogen), following the manufacturer's protocols. For each RNA sample, approximately 5 μg of cDNA was prepared using the SMART cDNA library construction kit (Clontech, CA, USA). The cDNA was normalized using Evrogen's Trimmer-Direct cDNA Normalization kit (Evrogen, Moscow, Russia), and subsequently digested with SfiI to partially remove the SMART adapters. The size distributions of total RNA and cDNA were assessed on 1.0% agarose gels following each step of the protocol.
To prepare cDNA for sequencing on the GS-FLX Titanium platform, we followed a modified version of the SMART cDNA protocol [65] that has been optimized for cDNA quality and yield from small quantities of total RNA. A helpful guide that formed the initial basis for the optimization of this protocol was once available online from Evrogen, but has since been removed. At the time these libraries were prepared, Roche had not yet provided a specific protocol for cDNA library preparation for 454 pyrosequencing. Subsequently, the company has released a cDNA protocol that requires approximately 500 ng of purified mRNA (typically requiring isolation of 10 to 50 μg of total RNA). While useful for larger tissue samples, the Roche cDNA preparation protocol is difficult to apply to samples in which RNA quantity is limiting, as is the case with many non-model organisms. The protocol we present here does not require the loss-prone step of mRNA purification, and we have found that it produces sufficient quantities of high-quality cDNA when 5 μl of the RNA (18S and 28S bands) can be visualized on a 1% agarose gel stained with ethidium bromide. Compared with the original SMART protocol, we have optimized the primers, PCR conditions, and downstream purification steps to maximize the yield of double-stranded cDNA required for 454 pyrosequencing. We initially optimized this protocol for Roche's original 454 library preparation protocol (not specific to cDNA), which required input of double-stranded DNA amounts of 2.5-10 μg (in our experience, typically 10-20 μg prepared cDNA as measured by UV absorbance). However, newer protocols from Roche require only 500 ng doublestranded cDNA, limiting the need for a secondary amplification step, as described here, for samples with highly limiting quantities of total RNA.
After separately isolating total RNA from mature ovaries ( Figure 1B) and from each of the first three days of embryogenesis ( Figure 1C, D) as described above, each RNA sample was treated with DNAse to remove potential genomic contamination. Equal amounts of each sample were then pooled for use as a template for first strand cDNA synthesis. Due to concerns that the poly (T) primer used in the SMART kit could interfere with pyrosequencing, the 3'-primer used was modified in two ways: (1) the poly(T) was interrupted every fourth base by the inclusion of a cytosine [sensu 30]; and (2) the primer contained an MmeI site which allowed most of the poly(T) to be removed during digestion. This 3'-primer (PD243Mme-30TC, 5'-ATT CTA GAG CGC ACC TTG GCC TCC GAC TTT TCT TTT CTT TTT TTT TCT TTT TTT TTT VN-3') was used during first strand synthesis and for all subsequent amplification steps. Because MmeI also cleaves relatively commonly within eukaryotic genes, it may not always be desirable to use this enzyme for library preparation. As an alternative, we have additionally found that a similar 3' primer containing an SfiI cleavage site (PD243-30TC, 5'-ATT CTA GAG GCC ACC TTG GCC GAC ATG TTT  TCT TTT CTT TTT TTT TCT TTT TTT TTT VN-3') is also effective in producing cDNA that yields highquality 454 data (data not shown).
For first-strand synthesis, 3 μg of total RNA (in 6 μl) and 2 μl 3' primer (12 μM) were mixed and denatured at 65°C for 5 minutes, then placed on ice. Reverse transcription reactions using SuperScript II (Invitrogen) in the manufacturer's recommended buffer were performed for 50 minutes at 42°C using twice the recommended concentration of enzyme, 1 μl of Protector RNAse inhibitor (Roche) to avoid RNA degradation, 2 μl 5' primer (12 μM), 2 μl 10 mM DTT, and 1 μl 10 mM dNTPs. Template-switching essential for the SMART technique was achieved using a 5' primer (PD242, 5'-AAG CAG TGG TAT CAA CGC AGA GTG GCC ACG AAG GCC rGrGrG-3') with three RNA nucleotides at its 3' end, which contains an SfiI site. Reactions were then heat-inactivated for 15 minutes at 70°C and diluted 1:5 in milliQ water in preparation for PCR amplification. Contrary to some expectations, SuperScript III reverse transcriptase (Invitrogen) may be substituted in this protocol with equivalent results (data not shown).
To maximize yield during cDNA amplification, the first round of amplification was conducted using a 2:2:1 mix (v:v:v) of Hemo KlenTaq (New England Biolabs), Phusion (New England Biolabs), and PfuTurbo (Stratagene) polymerases. This mixture of enzymes was determined empirically to provide the highest yield of cDNA with a range of input first-strand concentrations. Cesium KlenTaq AC (DNA Polymerase Technologies) and the hot start versions of Phusion and PfuTurbo polymerases in the same ratio may be also substituted at this step without sacrificing yield; this may produce fewer PCR artifacts in the final cDNA preparation. Buffer conditions (MgCl 2 and DMSO) were also empirically optimized to maximize yield and minimize PCR artifacts. Reactions were performed in 100 μL total volume in 1X Phusion HF buffer, 1.5 μL polymerase mix, 5 μL first-strand cDNA (previously diluted 1:5 in H 2 O), 1 μL 3' primer (PD243Mme-30TC, 12 μM), 1 μL 5' primer (PCRIIA, 5'-AAG CAG TGG TAT CAA CGC AGA GT-3', 12 μM), and a final concentration of 1% DMSO, 1.5 mM MgCl 2 (in addition to the MgCl 2 already present in the HF buffer), and 200 μM dNTPs. Reactions were cycled with the following program: 1 minute at 95°C, followed by 16-20 cycles of 30 seconds at 95°C (see below for determining optimal number of cycles), 30 seconds at 66°C, and 3 minutes at 72°C, and a final 10 minutes at 72°C. After cooling to room temperature, 10 μL 3M NaOAc pH 5.5 was added to each 100 μL secondary PCR reaction followed by purification with the QiaQuick PCR purification kit (Qiagen) using the manufacturer's recommended protocol. For all purification steps, samples were eluted with TM buffer (10 mM Tris-HCl pH 8.5, 1 mM MgCl 2 ) to prevent strand separation of double-stranded cDNA.
To produce sufficient cDNA for sequencing, Advantage 2 (Clontech) polymerase was used under the manufacturer's recommended conditions during the second round of amplification using the same primer concentrations and 1 μl of undiluted primary PCR product. We recommend testing a range of dilutions of the primary PCR product to obtain the desired quantity of amplified cDNA in 9-10 PCR cycles. In cases of highly limiting RNA concentration, we have also found that a secondary PCR reaction using a 1:1:1 mix of Phusion, Cesium Klen-Taq AC, and Deep Vent (exo-) (New England Biolabs) polymerase in ThermoPol reaction buffer supplemented with 1.5 mM MgSO 4 and 1% DMSO produces the highest yield of secondary PCR product (note that this polymerase mix does not produce optimal results when used for first-round amplification). Secondary PCR reactions were cycled using the same parameters as the primary PCR but running for approximately 10 cycles.
To prevent overcycling during both rounds of PCR amplification, each reaction was prepared in duplicate, and one reaction was spiked with 1 μl of 1:750 SybrGreen I (Invitrogen). The spiked reactions were monitored in real time on an Mx3005P QPCR machine (Stratagene Inc.), and the samples were removed when amplification began to plateau. To increase the representation of double-stranded cDNA, two cycles of "chase PCR" were conducted following each round of cDNA amplification after the optimal number of cycles had been reached. Excess primers were added (1.5 μL of each, 12 μM primer per 100 μL reaction), and each reaction was subjected to two additional non-denaturing cycles of 1 minute at 77°C, 1 minute at 65°C, and 3 minutes at 72°C, followed by a 10 minute extension at 72°C.
Following the second round of amplification and PCR purification, the cDNA samples were double-digested with SfiI and MmeI (40 and 26 units per 150 μl reaction, respectively). cDNA species <500 bp were then removed using Chroma Spin 400 columns (Clontech) which had been equilibrated with TM buffer following the manufacturer's protocol. It should be noted that the Chroma Spin column protocol suggested in the Clontech SMART cDNA kit is non-optimal, and that following the protocol provided with the separately purchased columns is less labor-intensive and produces a higher yield of size-selected cDNA. Equilibration of Chroma Spin columns is critical for maximizing the yield of doublestranded cDNA as required by the Roche library preparation protocols. Following size selection, cDNA was blunt-ended with the NEB Quick Blunting kit (New England Biolabs) and purified once more with the Qia-Quick kit. After each step of cDNA synthesis, the size distribution was checked on 1.0% agarose gels, and the cDNA samples were quantified using a Qubit (Invitrogen), after observing that the NanoDrop 1000 (Thermo Scientific) did not reliably quantify ds-cDNA (C. Dunn, personal communication).
To prepare normalized cDNA for GS-FLX Titanium sequencing, 1 μl of the twice-amplified, purified cDNA sample described above was subjected to Evrogen's DSN-treatment protocol, followed by a single round of further amplification, SfiI/MmeI digestion, and size selection. Approximately 5 μl of normalized and nonnormalized cDNA were synthesized.

Titanium Pyrosequencing
For the pilot study using the GS-FLX platform, EnGen-Core (University of South Carolina) conducted the final steps of library preparation, including nebulization, adaptor-ligation, and sequencing of each sample (¼ plate each). For sequencing using the Titanium platform, the samples were nebulized, adaptor-ligated, and pyrosequenced by the Institute for Genome Science and Policy DNA Sequencing Facility (Duke University).

Sequence Assembly
Raw reads were assembled using the cDNA assembly algorithm of Newbler v2.3 (Roche) with default assembly parameters. An adaptor-trimming step was included in the assembly (the "-v" flag), and the "-nosplit" flag was also used to reduce the generation of extremely short contigs that might otherwise have been created. All of the raw reads generated in this study have been submitted to the NCBI Short Read Archive (Study Accession Number: SRP002610.1).
Because redundancy was observed among the singletons generated by Newbler v2.3, the singletons were reassembled using CAP3 [48], with '-z' option set to 1. Prior to this secondary assembly, the singletons were screened for adaptor sequences using both cross_match [66][67][68] and a custom python script (Casey Dunn, personal communication), We note that Newbler can also be used to produce a .fasta and corresponding .qual files of trimmed reads using the '-tr' option. The final assembly thus consists of three types of sequences: Newblerassembled sequences, cap3_contigs, and cap3_singlets, all of which were subjected to subsequent analyses.

Sequence Annotation
Sequences were first mapped against the RefSeq Protein database [ [69], downloaded from ftp://ftp.ncbi.nih.gov/ blast/db/ on April 27, 2010] using BLASTX. All BLAST searches were conducting using BLAST v2.2.23+ [70] with an e-value cut-off of 1e-10. We then used Blast2GO v1.2.7 [54] to retrieve the Gene Ontology (GO) [71] terms and their parents associated with the top 20 BLAST hits for each sequence. To avoid potentially double-counting sequences that might represent un-assembled portions of the same transcript, a custom python script ("transcripto-me_blast_summarizer.py", available at http://www.extavourlab.com/protocols/index.html) was used to identify sequences with identical top BLAST hits prior to GO annotation. If multiple sequences hit non-overlapping portions of the same top BLAST hit, we used the conservative assumption that these sequences represented unassembled portions of the same transcript, and therefore only tallied the GO terms of one of these sequences. However, if multiple sequences hit overlapping portions of the same top BLAST hit, we considered these sequences potential paralogs and retained them all. Thus, the counts of sequences in each GO term only include one sequence per top BLAST hit, unless the multiple sequences mapped to overlapping portions of the same BLAST hit. These counts were used to compare the distribution of sequences among specific GO terms between the transcriptomes of O. fasciatus and the Drosophila melanogaster genome. For this comparison, we used a precomputed GO annotation of the D. melanogaster genome [72].
The FASTA formatted transcriptome data set file was examined in TextWrangler (v. 3.1, Bare Bones Software, Inc.). Candidate genes were sought via whole gene names and, where possible, via the gene name abbreviations, while avoiding irrelevant hits. The FASTA header annotation of transcriptome sequences includes the top 20 BLASTx hits to the RefSeq database as described above.

Estimating sequencing depth
To estimate how thoroughly our sequencing efforts sampled the O. fasciatus transcriptome, eight progressively larger subsets of the reads were independently assembled. The total number of genes was then identified via BLASTX. For these smaller assemblies, reads from one plate each of normalized and non-normalized reads were combined in random order and sampled without replacement. For each assembly, we BLASTed the longest isotig of each isogroup, and all of the singletons, against the SwissProt database [ [73], downloaded from ftp://ftp.ncbi.nih.gov/blast/db/ on April 21, 2010]. We used the relatively small SwissProt database in order to reduce computation time. However, the absolute values of BLAST hits against this database are likely to be underestimates of those values that would have been obtained from a larger database such as RefSeq or nr. If multiple isotigs or contigs hit non-overlapping portions of the same top BLAST hit, only one of these sequences was counted. However, because frequent cases of identical, unassembled singletons were observed, we counted only one singleton per top BLAST hit, regardless of whether these hits overlapped or not.
We used a custom python script to calculate the ortholog hit ratio. This script, "ortholog_hit_ratio_calculator.py" is available at http://www.extavourlab.com/protocols/ index.html).

Assessing the importance of cDNA normalization
To assess the relative contribution of cDNA normalization to the quality of our assembly, the screened, raw reads from both normalized (N) and non-normalized (NN) samples were mapped against the complete assembly of all reads using the BLASTN algorithm [70] with an e-value cut-off of 1e-4. Based on these results, the Fisher's Exact Test was used to identify over-and under-represented terms in each gene list. This test was performed using Blast2GO (two-tailed, removing double IDs so that only those genes hit uniquely by either N or NN reads were considered). The BLASTN results were also investigated using text searches to find whether certain genes of interest were present in only one of the two cDNA samples.