A genomic perspective to assessing quality of mass-reared SIT flies used in Mediterranean fruit fly (Ceratitis capitata) eradication in California

Background Temperature sensitive lethal (tsl) mutants of the tephritid C. capitata are used extensively in control programs involving sterile insect technique in California. These flies are artificially reared and treated with ionizing radiation to render males sterile for further release en masse into the field to compete with wild males and disrupt establishment of invasive populations. Recent research suggests establishment of C. capitata in California, despite the fact that over 250 million sterile flies are released weekly as part of the state’s preventative program. In this project, genome-level quality assessment was performed, measured as expression differences between the Vienna-7 tsl mutants used in SIT programs and wild flies. RNA-seq was performed to provide a genome-wide map of the messenger RNA populations in C. capitata, and to investigate significant expression changes in Vienna-7 mass reared flies. Results Flies from the Vienna-7 colony showed a markedly reduced abundance of transcripts related to visual and chemical responses, including light stimuli, neural development and signaling pathways when compared to wild flies. In addition, genes associated with muscle development and locomotion were shown to be reduced. This suggests that the Vienna-7 line may be less competitive in mating and host plant finding where these stimuli are utilized. Irradiated flies showed several transcripts representing stress associated with irradiation. Conclusions There are significant changes at the transcriptome level that likely alter the competitiveness of mass reared flies and provide justification for pursuing methods for strain improvement, increasing competitiveness of mass-reared flies, or exploring alternative SIT approaches to increase the efficiency of eradication programs.


Background
The Mediterranean fruit fly, Ceratitis capitata (Wiedemann) is a polyphagous insect of great economic importance. Native to Africa, this fly is now distributed worldwide, with established populations in southern Europe, the Mediterranean region, Africa, Australia and some South American countries [1]. C. capitata was introduced to Hawaii in 1907 and considered established by 1910 [2,3]. It represents a major and ongoing threat to the continental United States, particularly in the agriculturally rich regions of Florida and California. Each year, detections of C. capitata are made, and efforts are ongoing to eliminate invasive populations in order to prevent further establishment and spread in the mainland. For example, each week, over 250 million sterile male mass reared flies are released in California to disrupt establishment of invasive populations. Despite this effort, a recent analysis of historical capture patterns suggested that C. capitata, along with other tephritid pests, is already established in California [4]. Establishment of C. capitata in areas despite mass reared sterile males release suggests a weakness or failure of this approach, possibly rooted in the quality of the mass reared flies.
The sterile insect technique (SIT) is a very important and useful tool for the control of the Mediterranean fruit fly. It makes use of laboratory reared sterile males which are released in large numbers to compete with wild invasive males for fertilizing females. In rearing flies for SIT programs, the presence of females in release colonies is undesirable as it reduces the number of males that can be reared from a set amount of diet. Additionally, release of female flies along with sterile males would confound the SIT approach through additional fruit damage caused by oviposition, and through competition with wild females for mating with the sterilized male flies [5][6][7]. For these reasons, to improve the efficiency and reduce the cost of SIT implementation, several genetic sexing strains (GSS) were developed that allow for the selection or elimination of females early in the rearing process. GSS traits include pupal color, differential development time, and temperature sensitivity [6]. Current GSS systems in C. capitata are based on a translocation that links the selectable trait with the Y sex chromosome. Among these GSS, the temperature sensitive lethal mutants (tsl) allow for the elimination of female eggs through incubation at elevated temperature. The tsl trait is often linked with a white pupal trait (wp), which allows for straightforward visual confirmation of presence of the trait during the pupal stage. Mutants carrying these traits have been developed by researchers at the International Atomic Energy Agency in Vienna Austria (IAEA) and thus the C. capitata tsl strains are named "Vienna"; several independent lines have been developed from independent transformation events, with newer lines attempting to reduce trait breakdown by creating breakpoints closer to the sexing traits (e.g. Vienna-7 and Vienna-8) [5,8].
An important factor that determines the success of SIT is the competitiveness of the mass-reared sterile males with the wild males. Sterilization of males is achieved by gamma-irradiation in doses up to 145 Gy (http://www.cdfa. ca.gov/plant/pdep/prpinfo). Irradiated males are known to be at a mating disadvantage in terms of attractiveness to wild females, courtship behavior, or acceptance of the courtship by females [9,10]. The process of irradiating flies may not only render fly males sterile, but ionizing radiation may potentially cause multiple unknown mutations in the genome. Additional disadvantages in fly quality can arise from long term inbreeding of colonies reared on highly artificial environments [11]. Insects growing in captivity for several generations would likely adapt to confinement conditions and experience inbreeding depression through changes in the frequencies and allele state of mutations present in the population, reducing the genetic diversity and fitness in the wild [12][13][14]. Other factors such as shipping to the target program location and handling of irradiated pupae may also reduce the quality of flies.
Together, these factors bring to question the overall efficiency of mass-reared SIT flies, and demonstrate the importance of investigating differences between mass reared flies and their wild counterparts from alternative perspectives. By cataloguing and quantitatively assessing these differences at the genome and transcriptome levels, measurable change can be determined, helping to understand why mass reared flies may be outperformed in the wild. In the present work the transcriptome of C. capitata was constructed from adult and pupal males, and the transcript expression profiles were compared between a wild C. capitata strain found in Hawaii and the mass-reared GSS Vienna-7 utilized in the Mediterranean Fruit Fly Exclusion Program by the California Department of Food and Agriculture (CDFA), reared in their facility in Waimanalo, Hawaii. Analyses were done with the aim of providing a basic landscape of C. capitata transcriptome, as well as to shed light on the effects of long term mass rearing of the Vienna-7 line, as well as effects of gamma-irradiation. The hypothesis is that through long-term mass rearing, selection, and irradiation, there will be consistent changes in expression patterns in Vienna-7 derived flies that are indicative of reduced quality of these flies when compared to their wild counterparts. Overall, comparative analysis of expression and genetic changes that arise through mass rearing will provide insight into developing more competitive mass reared flies in SIT programs.

Sequencing and quality filtering
For RNA-seq analysis, approximately 190 million paired 76 bp reads were obtained from Illumina Hiseq 2000 sequencing, totaling over 28 Gb of data (Table 1). These reads were evenly distributed between all of the libraries sequenced. All raw reads were submitted to the NCBI Sequence Read Archive under accession numbers SAMN02208095 -SAMN02208112 associated with BioProject PRJNA208956. From 190,186,205 raw read pairs, filtering and normalization reduced the read abundance to 17,217,414 (9.05%) reads used as input into the Trinity assembly, greatly reducing the computational requirements for assembly and avoiding complicating de Bruijn graphs with low quality or overabundant sequences.
De novo transcriptome assembly, assembly filtering and gene prediction The raw assembly of the C. capitata transcriptome was constructed using the Trinity assembly pipeline with all filtered reads from all eighteen libraries pooled into one dataset. This assembly yielded 190,958 contigs, with an N50 contig size of 2,686 bases, 64,803 contigs greater than 1000 bp, and a transcript sum of 236.1 Mb. While not all contigs produced by Trinity were likely to represent true transcripts in C. capitata, this contig set was used as a starting point for defining the transcriptome present in our sample. Filtering based off of read abundance and component isoform percentage removed 135,957 sequences, leaving 55,001 remaining. Further filtering through identification of likely coding sequence based on ORF prediction identified a total of 18,919 transcripts across 10,775 unigenes, with an N50 transcript size of 3,546 bp and transcript sum of 53.00 Mb. This filtered assembly is considered a high quality transcript set, and was used for downstream analysis.

Gene annotation
Transcripts and genes identified in the final filtered assembly were annotated using public databases as described in Methods. More than 90% of the translated transcripts had homologs in the Uniprot database (e value < 1.00E-5) ( Table 2). The resulting annotated transcript sequences were submitted to NCBI as Transcriptome Shotgun Assembly SUB276938, under Bioproject (PRJNA208956).

Pfam abundance
From the 10,776 unigenes identified in the assembly, 7,886 had a Pfam annotation based on translated sequence. These annotated genes belonged to one of 2,711 unique Pfam families. Forty six percent of these families could be placed into a Pfam Clan, for a total of 333 unique Pfam clans. The abundance of TMM normalized reads on each protein family was calculated to assess the global transcriptome composition (Additional file 1: Table S1). The protein family most abundant across unigenes was PF00089 (trypsins), followed by PF00069 (protein kinases). The protein family with the highest read counts was PF00379, annotated as chitin-binding proteins, representing structural proteins likely part of the insect cuticle; this family was over-represented in all pupae libraries of Vienna-7 and wild type. Interestingly, the second and fifth protein families with the highest read abundance corresponded to viral sequences: PF00910 (viral RNA helicases), and PF08762 (CRPV capsid protein-like family). These viral-related families showed very high counts in the Vienna-7 libraries as opposed to the wild type flies, making for up to 43% of the normalized counts in one of the irradiated libraries.  To better visualize the protein family makeup of the C. capitata transcriptome, the two top viral protein families were removed to overcome expression bias due to viral infection. Also, a family of peptidases (PF12381) with the sixth most abundant count numbers was removed as 98% of the counts were derived from a single library, clearly representing an outlier of a single replication. The absolute abundance of each of the top 40 Pfams is shown in Figure 1 together with the total of normalized counts for that family across all the libraries. The PBP_GOBP (PF01395), a family of olfactory receptors was the second most abundant family after chitinbinding proteins.
A cluster based on Spearman rank-based correlation coefficients was created to group similarly expressed protein families ( Figure 2). The heatmap representing the clustering showed marked grouped differences between the wild Hawaiian colony and Vienna-7. Differences between adults and pupae were also visible and perhaps more abundant, these subclusters are shown in Figure 1. Derived subclusters with Pfam annotations are presented in Additional file 2: Figure S1 and Additional file 3: Figure S2, Figures 3 and 4. A second clustering was run using only untreated flies from both the artificially reared colony and the wild Hawaiian population at either adult or pupal stage. These clusters show that at least 50% of the protein family composition differs in abundance between both types of flies at both stages tested (Figure 5a and b). Finally, irradiated and nonirradiated fly libraries (Vienna-7 flies only) were clustered together, using the same algorithm; most of the segregation was observed between the growing stages, while differences between irradiated and non-irradiated samples were minimal and mostly concentrated in the pupal stage ( Figure 6). In addition, in this last clustering, we observed that the differences in protein family abundance between irradiated and non-irradiated samples were inconsistent across replication, pointing to the randomness of the irradiation effects on the genome.

Fold change
A statistically based measurement is necessary to identify the most potentially differentially regulated genes between repeated sequenced mRNA libraries. The overall differences between the two types of flies (Wild vs. Vienna-7), between adults and pupae, and between irradiated and nonirradiated flies from the Vienna-7 colony were tested. For these three effects a total of 5,619 unigenes were identified as statistically differentially regulated in at least one of the comparisons (FDR corrected p-value < 0.05). Not surprisingly, the comparison between adults and pupae yielded the highest number of differentially regulated transcripts (4,634 genes), followed by the comparison between the wild Figure 1 Absolute abundance of unigenes across all libraries and total normalized counts on each library for the 40 Pfams with highest count values. The unigenes identified in the sequencing fell into one of 2,711 unique Pfam families. TMM normalized read counts on each protein family were calculated. Viral protein families were removed to overcome expression bias. Also, a family of peptidases (PF12381) with the sixth most abundant normalized count numbers was removed as 98% of the counts were derived from a single library, clearly representing an outlier of a single replication. Chitin binding proteins were the most abundant (PF 00379) followed by a family of olfactory receptors (PF01395). and Vienna-7 (968 genes). The comparison between irradiated and non-irradiated flies showed only 17 genes significantly differentially regulated. Additionally, six independent pair wise comparisons between replicated libraries were tested ( Table 3); each of these yielded a range of differentially expressed genes, varying from 148 genes (Vienna-7 irradiated pupae vs. non-irradiated pupae) to over four thousand genes (wild adult vs. wild pupae).

Gene ontology term enrichment of differentially expressed genes
Gene Ontology (GO) classification coupled with term enrichment analysis was used to group statistically differentially regulated genes by biological function allowing for a better understanding of the biological differences between the flies used in the experiment. The term enrichment results for the comparison between adults and pupae showed highly significant enrichment on up-regulated genes with terms related to primary metabolism, including amino acid, carbohydrate, lipid and general anabolism and catabolism processes (e.g. GO (Table 4). To better understand the differences between colonies, GO term enrichment was also run on the list of significantly differentially regulated unigenes obtained from the separate pair wise comparison between Vienna-7 pupae only and wild pupae only (3,094 unigenes) and between Vienna-7 adults only and wild adults only (1,649 unigenes). Vienna-7 pupae compared to wild pupae showed several stress-related terms with higher abundance in the Vienna-7 flies, including response to cold (GO:0009409, GO:0009631), disease response (GO:000 9617), and oxidative stress (increase in antioxidant machinery in response to reactive oxygen species originated by stress)-related terms (such as electron transport and  oxidative phosphorylation: GO:0022900, GO:0022904, GO:0042773, GO:0006119, and GO:0042775). In addition, several viral-related terms were present, four of them in the top 40 most significant terms, with an additional "gene silencing" term (GO:0019079, GO:0019058, GO:0016032, GO:0022415, GO:0031047). Terms with less abundance in the Vienna-7 pupae than in the wild corresponded to those in signaling and neurological processes (e.g. GO:0007154, GO:0023052, GO:0044700, GO:0007165, GO:0050877) ( Table 5). Comparing adults between both colonies showed more abundance of a vast amount of nucleic acid metabolism, DNA replication and cell cycle-related terms in the Vienna-7 colony (Table 6). Interestingly, the wildtype colony showed a distinctly high enrichment in terms related to response to light stimuli including phototransduction processes, rhodopsin signaling, detection of visible light and the like. A term related to perception of chemical stimulus was also found among the top highly significant GO terms, suggesting impact on both the visual and olfaction systems. This last comparison also showed higher abundance of motility-related terms in the wild fly compared to Vienna-7, these terms including muscle cell differentiation, striated muscle cell development and differentiation, locomotion and motility ( Table 6).
The term enrichment for the overall comparison between irradiated and non-irradiated samples in the Vienna-7 colony showed high significance for terms related to hormonal synthesis and related compounds in up-regulated genes (i.e. GO:0016126, GO:0008299, GO:0006694), suggesting an increased activity of these pathways as a result of irradiation. On the other hand, genes with higher expression values in the non-irradiated  samples resulted in a term enrichment of high significance for several viral-related terms (e.g. GO:0019058, GO:00 22415, GO:0016032, GO:0)006278 (Additional file 5: Table S3). Because of our special interest in the effects of irradiation on the Vienna-7 colony, the differential expression between irradiated and non-irradiated samples from adults and pupa were separated and the GO term enrichment for these pairwise sample comparisons were analyzed. Terms related to virus/transposon sequences were found enriched only in the down-regulated irradiated vs. non irradiated pupae dataset, suggesting that irradiation somehow affects the presence of virus or transposon mobility in the Vienna-7 colony, but only when applied to pupae. Additionally, two GO terms related to reproduction were among the top 20 found downregulated in pupae irradiated vs. non-irradiated flies (i.e. GO:0022414 and GO:0000003), perhaps pointing to the desired sterilization results expected from the treatment (Additional file 6: Table S4). Up-regulated terms in irradiated vs. non irradiated pupae samples were highly enriched for DNA repair-related mechanism (e.g. GO: 0006974, GO:0006281, GO:0006302), suggesting DNA damage (Additional file 6: Table S4). In adults, irradiation effects seemed to be very random, with biological processes affected ranging from ion transport to amino acids metabolism (Additional file 7: Table S5).

Discussion
Whole transcriptome analyses are of growing importance for the understanding of biological processes that take place in organisms of interest. We have generated a de novo transcriptome assembly of the pupal and adult stages of male Mediterranean fruit fly, C. capitata, using paired-end RNA-seq analysis. The assembly identified 18,919 transcripts and 10,775 unigenes which were annotated and used for a descriptive and comparative analysis of the mRNA composition of Hawaiian wild flies and mass reared GSS Vienna-7 flies at adult and pupae stages, before and after ionizing radiation treatment.
With the data, we were able to delineate a general protein family-based composition of C. capitata, as well as identify specific genes and protein families present in specific libraries. In addition, we have built and made public extensive data obtained from the sequencing. This data can be used for generation of new hypotheses regarding different aspects on the biology of this important pest as well as a resource for developing assays for assessing the quality of mass reared flies.  CN_P1  CN_P2  CN_P3  W_P1  W_P2  W_P3  CN_A1  CN_A2  CN_A3  W_A1  W_A2  W_A3 A B Figure 5 Clustering of identified Pfams present on untreated flies from the artificially reared colony and the wild Hawaiian population. Figure 5a displays pupal stage and Figure 5b displays adult stage. At least 50% of the protein family composition differs in abundance between both types of flies at both stages tested.
Differences between adults and pupae across both types of flies showed that several developmental processes were significantly overrepresented in pupae as compared to the adult. Also, the four most significantly enriched GO terms were related to cell adhesion, a process which is known to play a key role in development and tissue morphogenesis [15,16]. While these results are not surprising, they provide a means of verification on the quality of the experiment and the data analysis performed.
Grouping absolute expression values by Pfam and clustering the data, we noted considerable differences between the artificially reared colonies and the wild Hawaiian flies. Furthermore, by contrasting the Vienna-7 with the Hawaiian wild flies' relative transcript levels and statistical significance, we have identified the presence of several viral-related sequences among the sequenced transcripts, which suggests the existence of a virus in the mass reared colony; this is supported by the occurrence of several stress-related transcripts. Preliminary analyses showed that these sequences may belong to a picornavirus; however, it is possible that these transcripts are an indication of increased or activated transposon activity in the Vienna-7 colony. That the possible presence of virus is detrimental for the quality of the reared colonies needs to be tested, as several endogenous viruses have been found in different strains of C. capitata and are commonly found in other fruit flies [17][18][19]. Nonetheless, this is an interesting point to investigate and simple measures could be employed to monitor viral levels in mass reared colonies and measures put forth to reduce the impact of virus on fly quality. Additionally, by comparing the Vienna-7 colony to wild pupae and adults, the differential expression analyses showed marked down-regulation of signaling and neurological processes. In adults, two important sensory mechanisms were observed to be down-regulated in the Vienna-7 flies: light response processes and chemoreception. A third marked difference was observed in genes related to muscle development, muscle differentiation and locomotion, which were also reduced in abundance in the Vienna-7 colonies compared to the wild Hawaiian flies. Given the number and consistency of GO terms in the above mentioned categories, we hypothesize that Vienna-7  CI_A1  CI_A2  CI_A3  CI_P1  CI_P2  CI_P3  CN_A1  CN_A2  CN_A3  CN_P1 CN_P2 CN_P3 Figure 6 Clustering of identified Pfams in irradiated and non-irradiated fly libraries (Vienna-7 flies only); Most of the segregation was observed between the growing stages, while differences between irradiated and non-irradiated samples were minimal and mostly concentrated in the pupal stage.
flies may have reduced fitness and competitiveness due to impaired response to light stimuli as a consequence of mass rearing in artificial conditions under low/artificial light, this impairment being reflected at the signaling (perception and signal transduction) and neuronal levels (responsive mechanism). Vienna-7 may also have reduced host and mate finding ability due to decreased chemical sensory development. In addition muscular development is diminished, reducing movement or flight ability in the flies as well as having potential impacts on longevity. As for irradiation effects, Pfam abundance clustering demonstrated the marked randomness of the irradiation effects, while few genes may be consistently affected by the treatment. This is not to say that irradiation does not have a deleterious effect on the fly, just that expression level changes are not consistent between replicate treatments. The study showed high induction of statistically differentially regulated genes in GO terms related to DNA repair mechanisms, demonstrating the DNA damage that occurs in the irradiated samples. Single and double stranded DNA damage caused by ionizing gamma radiation has been extensively reported in mammals [20][21][22], and is very likely one of the effects in flies. The data also showed that irradiation may somehow affect the presence of virus in pupae. One possible explanation of this is that the proportion of the population that is virus infected is weaker than other flies, and thus do not survive the impact of irradiation. In general, we can conclude that long-term artificial rearing of flies, added to the effect of ionizing radiation, may affect several specific pathways and biological processes in the fly, all of which may translate into reduced quality of individuals released for SIT.

Conclusions
The California Department of Food and Agriculture (CDFA) reports a weekly release of approximately 125,000 sterile flies per square mile over more than 1,200 square miles by the USDA-CDFA Mediterranean Fruit Fly Exclusion Program, at an annual cost of approximately 15 million dollars for preventive purposes (http://www.cdfa.ca.gov/plant/pdep/prpinfo/pg1.html). Under that perspective, it is highly desirable that the released sterile flies are of the best possible quality and with a high rate of competitiveness in the field. The basic principles under which current SIT methods are applied at this time (long term captivity and gamma irradiation) may limit its potential utility, and alternatives to SIT or modifications to it should be considered. One such potential approach involves transgenesis, a concept already tested in C. capitata through the insertion of a tetracyclinerepressible transactivator [23][24][25], and in Olive fly (Bactrocera olea) with the use of a dominant, female-specific lethal genetic system [26]. Another alternative is population replacement. This method has been widely studied in the control mosquito-borne diseases with insects carrying antipathogen genes and thus unable to transmit disease [27]. Similar approaches could be utilized to produce only male progeny from mated females, and subsequent mating with these males will continue spreading this trait, effectively allowing the wild population to serve as the rearing mechanism. Additionally, in population replacement approaches, repeated mass release of flies would not be required, and the trait is pushed through the population. Those males carrying the trait are wild derived, and thus there is not expected to be a reduced fitness cost compared to flies derived from the wild population. As the International Atomic Energy Agency has expressed its interest in developing alternatives to gamma irradiation used for SIT programs [28], recent advances in gene silencing through RNAi methods should also be considered which could induce sterilization. These alternative approaches would have reduced impact on the quality of the fly as irradiation is not performed, which has been shown to impact courtship behavior. Overall, the analysis presented here provides the foundation for beginning to evaluate mass reared flies at a genomic level, and to develop tools for better monitoring the quality of these flies over generations of colony production.

Methods
Sampling of Ceratitis capitata Vienna-7 colony and Hawaiian wild fly C. capitata Vienna-7 derived mass reared flies were obtained from the California Department of Agriculture medfly mass rearing facility in Waimanalo, Hawaii. Newly formed pupae were collected (~40 ml) in triplicate before irradiation. In addition, triplicate samples were subjected   to gamma-irradiation at the USDA-APHIS irradiation facility (140 Gy) following the standard methods used for mass reared flies being shipped to California for SIT release. Both irradiated and non-irradiated flies were transferred to the USDA-ARS Pacific Basin Agricultural Research Center (PBARC) for use in this study. Pools of five pupae were flash frozen in liquid nitrogen from each replicate approximately 1 day prior to adult emergence for RNA extraction and sequencing. A subset of pupae was placed in emergence cages and adult male flies were  In silico library normalization and de novo transcriptome assembly De novo reconstruction of the transcriptome was done utilizing the Trinity package [29]. Raw reads obtained were first normalized to reduce redundant read data and discard read errors using Trinity's normalize_by_kmer_ coverage.pl script with a kmer size of 25 and maximum read coverage of 30. The Trinity de novo RNAseq assembly pipeline (Inchworm, Chrysalis, and Butterfly) was executed using default parameters, implementing the -REDUCE flag in Butterfly and utilizing the Jellyfish k-mer counting approach [30]. Assembly was completed in 3 hours and 13 minutes on a compute node with 32 Xeon 3.1 GHz cpus and 256 GB of RAM on the USDA-ARS Pacific Basin Agricultural Research Center Moana compute cluster (http://moana.dnsalias.org).

Assembly filtering and gene prediction
The output of the Trinity pipeline is a FASTA formatted file containing sequences defined as a set of transcripts, including alternatively spliced isoforms determined during graph reconstruction in the Butterfly step. These transcripts are grouped into gene components which represent multiple isoforms across a single unigene model. While many full length transcripts were expected to be present, it is likely that the assembly also consisted of erroneous contigs, partial transcript fragments, and noncoding RNA molecules. This collection of sequences was thus filtered to identify contigs containing full or near full length transcripts or likely coding regions and isoforms that are represented at a minimum level based off of read abundance. Pooled non-normalized reads were aligned to the unfiltered Trinity.fasta transcript file using bowtie 0.12.7 [31], through the alignReads.pl script distributed with Trinity. Abundance of each transcript was calculated using RSEM 1.2.0 (RNA-Seq by Expectation Maximization) [32], utilizing the Trinity wrapper 'run_RSEM.pl'. Through this wrapper, RSEM read abundance values were calculated on a per-isoform and per-unigene basis. In addition, percent composition of each transcript component of each unigene was calculated. From these results, the original assembly file produced by Trinity was filtered to remove transcripts that represent less than 5% of the RSEM based expression level of its parent unigene or transcripts with transcripts per million (TPM) value below 0.5.
Coding sequence was predicted from the filtered transcripts using the 'transcripts_to_best_scoring_ORFs.pl' script distributed with the Trinity software from both strands of the transcripts. This approach uses the software Transdecoder (http://transdecoder.sourceforge.net/) which first identifies the longest open reading frame (ORF) for each transcript and then uses the 500 longest ORFs to build a Markov model against a randomization of these ORFs to distinguish between coding and non-coding regions. This model is then used to score the likelihood of the longest ORFs in all of the transcripts, reporting only those putative ORFs which outscore the other reading frames. Thus, the low abundance filtered transcript assembly was split into contigs that contain complete open reading frames, contigs containing transcript fragments with predicted partial open reading frames, and contigs containing no ORF prediction. The resulting retained transcript sets containing transcripts above the abundance threshold and containing a likely open reading frame were merged and subjected to annotation and utilized in subsequent analysis.

Gene annotation
The filtered transcripts were annotated using the UniRef SwissProt database, Pfam-A, eggNOG, and gene ontology utilizing a beta release of the Trinotate annotation pipeline. The filtered transcript set was first subjected to blastp alignment against the UniRef-Swissprot database (downloaded 12/11/2012) using blast-2-2-26+ with e-value cutoff of 1.0E-5. In addition, protein domains were identified through searching the Pfam_A database using HMMER 3.0. Signal peptides and transmembrane regions were annotated with SignalP 4.1 and TMHMM 2.0, respectively. The resulting outputs were loaded into a Trinotate database where eggNOG and Gene Ontology terms were added and the resulting annotation set was exported as a delimited file for further analysis. In addition, transcripts were subjected to blastx alignment against the Drosophila melanogaster protein set (Flybase.org, Dmel-r5.44) [33,34] and Uni-Ref90 using an e-value cutoff of 1e-5 to identify homologous genes in these databases.

Read library mapping and expression analysis
Because the Trinity assembler is able to accurately predict splice isoforms, gene and isoform expression quantification was performed using RSEM, which is particularly well suited to work with multiple isoforms where the same read may map to multiple sequences. The filtered transcript set described above (containing transcripts passing a minimum read abundance cutoff and containing predicted full or partial ORFs) was used for analysis to avoid skewing expression quantification results with non-coding and fragmented data. Reads from each sequencing library were independently mapped to this high confidence transcriptome assembly using bowtie (v 0.12.7) using the align-Reads.pl script distributed with Trinity. The resulting bam formatted mapping files were sorted and used to produce fragment abundance estimation by RSEM. Transcript abundance values were produced as expected read count at both unigene and individual transcript isoform level.

Absolute expression analysis by Pfam
Read count values for unigenes were normalized using the trimmed mean of M values (TMM) method and transformed into fragments per feature kilobase per million reads mapped (FPKM) for each gene and the individual isoforms that compose each gene for each developmental library using scripts provided by Trinity. TMM-FPKM normalized read counts across genes in the same Pfam family were added together to assess family abundance. For clustering, Pfams with less than two gene members and normalized counts less than 50 in at least one library were removed. Pfams were clustered using Spearman rank correlation coefficients with complete linkage as distance measurement using Cluster v3.0 software [35,36]. Clusters were visualized in a heatmap using Java TreeView (http:// jtreeview.sourceforge.net).

Differential expression analyses
Transcript abundance values were compared among samples using the EdgeR package (bioconductor) to assess statistical significance [37]. The general linear models capability (glm EdgeR) was used to account for the multiple factors in the experiment (i.e. colony, developmental stage and irradiation treatment). The non-normalized expected counts for each gene, calculated by RSEM in the previous step, were used as input for EdgeR; contrasts were built between the three main overall effects and between the six pairwise comparisons of biological relevance for a total of nine comparisons. The data was normalized between each pair of groups or samples with the TMM method using the calcNormFactors function to account for the effect of RNA composition. Out of the 10,775 genes identified, a total of 9,920 genes were used for differential analyses after filtering out genes with count per million mapped reads less than 2 in any of the three replicate samples.

Gene ontology term enrichment
The bioconductor package "topGO" was used to run a GO term enrichment analysis on the data [38]. Genes declared significantly differentially regulated on each of the treatment-wise comparisons and mapped with GO terms were used as predefined lists of genes of interest. Terms with less than 5 annotated genes were cut-out from the GO hierarchy during the enrichment using the nodSize option, and only the "biological processes" ontology was used. GO terms were then subjected to the classic Fisher's exact test included in the topGO package.

Availability of supporting data
All raw reads were submitted to the NCBI Sequence Read Archive under accession numbers SAMN02208095 -SA MN02208112 associated with BioProject PRJNA208956.

Additional files
Additional file 1: Table S1. Abundance of TMM normalized reads on each protein family.
Additional file 2: Figure S1. Sub-cluster of identified Pfams derived from Figure 2. Differences between adults and pupae libraries across both types of flies. Subcluster shows Pfams with higuer abundance in the adult stage.
Additional file 3: Figure S2. Sub-cluster of identified Pfams derived from Figure 2. Differences between adults and pupae libraries across both types of flies. Subcluster shows Pfams with higher abundance in pupae stage.