De novo characterization of a whitefly transcriptome and analysis of its gene expression during development

Background Whitefly (Bemisia tabaci) causes extensive crop damage throughout the world by feeding directly on plants and by vectoring hundreds of species of begomoviruses. Yet little is understood about its genes involved in development, insecticide resistance, host range plasticity and virus transmission. Results To facilitate research on whitefly, we present a method for de novo assembly of whitefly transcriptome using short read sequencing technology (Illumina). In a single run, we produced more than 43 million sequencing reads. These reads were assembled into 168,900 unique sequences (mean size = 266 bp) which represent more than 10-fold of all the whitefly sequences deposited in the GenBank (as of March 2010). Based on similarity search with known proteins, these analyses identified 27,290 sequences with a cut-off E-value above 10-5. Assembled sequences were annotated with gene descriptions, gene ontology and clusters of orthologous group terms. In addition, we investigated the transcriptome changes during whitefly development using a tag-based digital gene expression (DGE) system. We obtained a sequencing depth of over 2.5 million tags per sample and identified a large number of genes associated with specific developmental stages and insecticide resistance. Conclusion Our data provides the most comprehensive sequence resource available for whitefly study and demonstrates that the Illumina sequencing allows de novo transcriptome assembly and gene expression analysis in a species lacking genome information. We anticipate that next generation sequencing technologies hold great potential for the study of the transcriptome in other non-model organisms.


Background
The whitefly Bemisia tabaci (Gennadius) is a genetically diverse complex containing some of the most destructive invasive pests of many ornamental and glasshouse crops worldwide [1,2]. The species complex colonizes more than 600 different species of plants, transmits many plant viruses, feeds on phloem sap, and promotes the growth of damaging fungi on honeydew excretions deposited on plants [3][4][5][6]. Recent phylogenetic analysis combined with a pattern of reproductive isolation among genetic groups within B. tabaci indicate that the complex contains at least 24 cryptic species, some of which have been referred to as "biotypes" in the last 20 years [7,8]. As the separa-tion at the species level within the B. tabaci complex is yet to be fully resolved, we have retained the commonly used term "biotype" to link this study with existing literature. The most predominant and damaging biotypes of B. tabaci are the B and Q biotypes [9,10]. While the former is known for its high fitness parameters, the Q biotype whitefly has a unique ability to develop and maintain high levels of resistance to major classes of insecticides owing to biological and genetic factors [11,12].
Despite its global importance, genomic sequence resources available for the whitefly are scarce, especially for the Q biotype. Currently (March 30th, 2010), there are about 9110 EST and 762 nucleotide sequences available on NCBI for the B biotype whitefly, and only 683 nucleotide sequences have been deposited for the Q biotype whitefly. The previous EST sequencing efforts for the B biotype whitefly have allowed the development of smallscale microarrays for gene expression analysis in the con- Full list of author information is available at the end of the article text of insecticide resistance and parasitoid-whitefly interactions [13][14][15]. While these studies have highlighted the utility of cDNA sequencing for candidate gene discovery in the absence of a genome sequence, a comprehensive description of the genes expressed in insecticideresistant Q biotype whitefly remains unavailable.
Over the past several years, the next generation sequencing technology has emerged as a cutting edge approach for high-throughput sequence determination and this has dramatically improved the efficiency and speed of gene discovery [16,17]. For example, the Illumina sequencing technology is able to generate over one billion bases of high-quality DNA sequence per run at less than 1% of the cost of capillary-based methods [18]. Furthermore, this next generation sequencing has also significantly accelerated and improved the sensitivity of gene-expression profiling and, is expected to boost collaborative and comparative genomics studies [19,20]. Previously, Illumina sequencing of transcriptomes for organisms with completed genomes confirmed that the relatively short reads produced can be effectively assembled and used for gene discovery and comparison of gene expression profiles [21,22]. Despite its obvious potential, next generation sequencing methods have not yet been applied to whitefly research.
In this study, we generated over three billion bases of high-quality DNA sequence with Illumina technology and demonstrated the suitability of short-read sequencing for de novo assembly and annotation of genes expressed in a eukaryote without the prior genome information. In a single run, we identified 168,900 distinct sequences including hundreds of insecticide target and metabolism genes. Furthermore, we compared the gene expression profiles of whiteflies during different developmental stages using a digital gene expression system. The assembled, annotated transcriptome sequences and gene expression profiles provide an invaluable resource for the identification of whitefly genes involved in insecticide resistance, development and virus transmission.

Illumina sequencing and reads assembly
To obtain an overview of the whitefly gene expression profile at different developmental stages, a cDNA sample was prepared from whitefly eggs, nymphs, pupae and adults and sequenced using the Illumina sequencing platform. After cleaning and quality checks, we obtained 43 million of 75 bp reads from one plate (8 lanes) of sequencing. To facilitate sequence assembly, these raw reads were randomly clipped into 21-mers for sequence assembly using SOAPdenovo software [23]. These short 21-mers were assembled, resulting in 4,274,766 contigs ( Table 1). The mean contig size was 40 bp with lengths ranging from as small as 22 bp to as large as 2,189 bp. The size distribution of these contigs is shown in Additional file 1. Using paired-end joining and gap-filling, the contigs were further assembled into 170,115 scaffolds with a mean size of 266 bp including 4,588 scaffolds larger than 1,000 bp (Table 1). After clustering using TGICL software [24], the 170,115 scaffolds generated 168,900 distinct sequences with 1,206 clusters (mean size: 372 bp) and 167,694 singletons (mean size: 265 bp) ( Table 1). In this manuscript, the singleton means a scaffold that matches no other scaffolds and the distinct sequences include all the clusters and singletons generated from scaffolds. The size distribution of these distinct sequences is shown in Additional file 1. To demonstrate the quality of sequencing data, we randomly selected 6 singletons and designed 6 pairs of primers for RT-PCR amplification. In this analysis, 5 out of 6 primer pairs resulted in a band of the expected size and the identity of all five PCR products were confirmed by Sanger sequencing (data not shown).

Annotation of predicted proteins
For annotation, distinct gene sequences were first searched using BLASTx against the non-redundant (nr) NCBI nucleotide database using a cut-off E-value of 10 -5 . Using this approach, 27,290 genes (16.2% of all distinct sequences) returned an above cut-off BLAST result (see Additional file 2). Because of the relatively short length of distinct gene sequences (mean size of 266 bp) and lack of genome information in whitefly, most of the 168,900 assembled sequences could not be matched to known genes (83.8%). Figure 1 indicates that the proportion of sequences with matches in nr databases is greater among the longer assembled sequences. Specifically, a 97% of match efficiency was observed for sequences longer than 2,000 bp, whereas the match efficiency decreased to about 48% for those ranging from 500 to 1,000 bp and to 11.5% for sequences between 100 to 500 bp ( Figure 1). The E-value distribution of the top hits in the nr database showed that 22% of the mapped sequences have strong Sequences with E-value < 10 -5 27,290 homology (smaller than 1.0E -50 ), whereas 78% of the homolog sequences ranged between 1.0E -5 to 1.0E -50 (Figure 2A). The similarity distribution has a comparable pattern with 18% of the sequences having a similarity higher than 80%, while 82% of the hits have a similarity ranging from 18% to 80% ( Figure 2B). For species distribution, 20% of the distinct sequences have top matches (first hit)trained with sequences from the hemipteran species pea aphid (Acyrthosiphon pisum), followed by the body louse (Pediculus humanus corporis) (15%), red flour beetle (Tribolium castaneum) (12%) and honey bee (Apis mellifera) (10%) ( Figure 2C). There were 126 distinct sequences with the highest homology to genes from B. tabaci and the majority of these hits were to cytochrome P450 and heat shock protein (data not shown).

Gene ontology (GO) and clusters of orthologous groups (COG) classification
GO assignments were used to classify the functions of the predicted Q biotype whitefly genes. Based on sequence homology, 7,330 sequences can be categorized into 52 functional groups ( Figure 3). In each of the three main categories (biological process, cellular component and molecular function) of the GO classification, 'Cellular process', "Cell part" and "Binding" terms are dominant respectively; however, we did not find any genes in the clusters of 'Symplast', 'Chemoattractant' and 'Chemorepellent'. We also noticed a high-percentage of genes from categories of 'Metabolic process', 'Catalytic' and 'Pigmentation' and only a few genes from terms of 'Cell killing' and 'Nutrient reservoir' (Figure 3). To further evaluate the completeness of our transcriptome library and the effectiveness of our annotation process, we searched the annotated sequences for the genes involved in COG classifications. In total, out of 27,290 nr hits, 7,790 sequences have a COG classification ( Figure 4). Among the 25 COG categories, the cluster for 'General function prediction' represents the largest group (1778, 22.8%) followed by 'Replication, recombination and repair' (897, 11.5%) and 'Translation, ribosomal structure and biogenesis' (842, 10.8%). The following categories: extracellular structures (2, 0.025%); nuclear structure (8, 0.1%) and cell motility (22, 0.28%), represent the smallest groups ( Figure 4). To identify the biological pathways that are active in the Q biotype whitefly, we mapped the 27,290 annotated sequences to the reference canonical pathways in Kyoto Encyclopedia of Genes and Genomes (KEGG) [25]. In total, we assigned 11,104 sequences to 214 KEGG pathways. The pathways with most representation by the unique sequences were starch and sucrose metabolism (553 members); purine metabolism (458 members) and galactose metabolism (183 members). These annotations provide a valuable resource for investigating specific processes, functions and pathways during whitefly research.

Detection of sequences related to the insecticide targets and metabolism
As the Q biotype whitefly is highly resistant to insecticides, we analyzed the sequences related to insecticide targets and metabolism and compared our data with sequences from NCBI nucleotides and EST database [13]. As shown in Table 2, we identified a number of sequences which are homologous to enzymes related to insecticide   Figure 1 Effect of query sequence length on the percentage of sequences for which significant matches were found. The proportion of sequences with matches (with a cut-off E-value of 1.0E-5) in NCBI nr databases is greater among the longer assembled sequences. metabolic resistance, such as carboxylesterase and glutathione S-transferase; and insecticide targets, such as acetylcholinesterase, GABA receptor and nicotinic acetylcholine receptor. Among them, the GABA receptor sequences were identified from the whitefly for the first time. In total, we obtained 28 nicotinic acetylcholine receptor related sequences. After removing redundant sequences, we identified 9 different nicotinic acetylcholine receptor subunits (Table 3). As an insect genome generally has 10 to 11 nicotinic acetylcholine receptor subunits [26], our database represents a nearly complete collection of such genes in whitefly. This finding further demonstrates the quality of our sequencing data and will certainly facilitate Q biotype whitefly insecticide resistance research. It was previously observed that the resistance to organophosphates in the B biotype B. tabaci was associated with a point mutation (Phe392Trp) in ace1type acetylcholinesterase [27]. Therefore, we compared the protein sequence of the Q biotype whitefly ace1-type acetylcholinesterase with that of the B biotype. Sequence alignment shows that the Q biotype whitefly acetylcholinesterase also contains the Phe392Trp mutation (data not shown). Furthermore, three additional mutations were observed in the N-terminal region of the Q biotype whitefly ace1-type acetylcholinesterase (Additional file 3) which might confer further organophosphate resistance.
However, the relevance of these additional mutations requires further investigation.

Digital gene expression (DGE) library sequencing
An immediate application of our transcriptome sequence data includes gene expression profiling under different physiological conditions. Using the DGE method, which generates absolute rather than relative gene expression measurements and avoids many of the inherent limitations of microarray analysis, we analyzed the gene expression variations during whitefly development. We sequenced three DGE libraries: egg & nymph, pupa and adult, and generated between 2.8 and 3.9 million raw tags for each of the three samples (Table 4). After removing the low quality reads, the total number of tags per library ranged from 2.7 to 3.8 million and the number of tag entities with unique nucleotide sequences ranged from 84,583 to 134,042 (Table 4). For mRNA expression, heterogeneity and redundancy are two significant characteristics. While the majority of mRNA is expressed at low levels, a small proportion of mRNA is highly expressed. Therefore, the distribution of tag expression was used to evaluate the normality of the DGE data. As shown in Figure 5, the distribution of total tags and distinct tags over different tag abundance categories showed similar patterns for all three DGE libraries. However, under the dis- vi ri o n p a rt a n tio xi d a n t a u xi lia ry tr a n sp o rt p ro te in b in d in g ca ta ly tic ch e m o a tt ra ct a n t ch e m o re p e lle n t e le ct ro n ca rr ie r e n zy m e re g u la to r m e ta llo ch a p e ro n e m o le cu la r tr a n sd u ce r n u tr ie n t re se rv o ir p ro te a so m e re g u la to r p ro te in ta g st ru ct u ra l m o le cu le tr a n sc ri p tio n re g u la to r tr a n sl a tio n re g u la to r tr a n sp o rt e r biological_process cellular_component molecular_function Percent of genes

Comparison of GO classification
Number of genes tribution of total tags, high-expression tags with copy numbers larger than 100 are in absolute dominance whereas low-expression tags with copy numbers smaller than 10 occupy the majority of distinct tag distributions ( Figure 5).  (Table 4).
To confirm whether the number of detected genes increases proportionally to sequencing amount (total tag number), a saturation analysis was performed. Additional file 5 shows a trend of saturation where the number of detected genes almost ceases to increase when the number of reads reaches 2 million. Next, the level of gene expression was determined by calculating the number of unambiguous tags for each gene and then normalizing this to the number of transcripts per million tags (TPM). As summarized in Figure 6, the results show that mRNA transcribed from the major kinds of genes is represented in fewer than ten copies and only a small proportion of genes are highly expressed. In the Additional file 6, we have listed the top 20 most abundantly expressed genes from adult whiteflies. While analyzing the five most abundantly expressed genes with annotation, we noticed the presence of glyceraldehyde-3-phosphate dehydrogenase, translation elongation factor 2, cytochrome P450 CYP6CX1v2 and tubulin alpha chain in all the three libraries.

Distribution of DGE tags on genes
DGE is a SAGE-based transcript profiling method. In theory, all the sequenced tags in the libraries should be mapped to the last NlaIII restriction site on the sense strand at the 3'-end of the transcript. However, like the results in DGE analysis of mouse and zebrafish transcrip- tomes [20,22], we found that approximately 60% of the tags mapped to the classical site (data not shown). This is probably due to the incomplete NlaIII digestion during library preparation and the usage of alternative polyadenylation and/or splicing sites [28]. Detection of multiple tags with high count numbers for a predicted transcript indicates the reliability of the transcript sequence [22]. Furthermore, the information obtained from multiple tags per transcript is valuable for the verification of ab initio gene predictions.

Changes in gene expression profile among the different developmental stages
To identify genes showing a significant change in expression during different developmental stages, the differentially expressed tags between two samples were identified by an algorithm developed by Audic et al. [29]. A total of 12,257 significantly changed tag entities were detected between the adult and pupa whitefly libraries. Those tags were mapped to 3,318 genes with 822 genes up-regulated and 2,496 genes down-regulated ( Figure 7A & Additional file 7). Between pupa and egg & nymph libraries, a total of 2,197 differentially expressed genes were detected with  . We found that more than 60% of the highly regulated genes are orphan sequences -no homologues found in the NCBI database. This might be consistent with the expectation that Hemiptera development expresses genes with no homologues in other species.

Functional annotation of differentially expressed genes
To understand the functions of differentially expressed genes, we mapped all the genes to terms in KEGG database and, compared this with the whole transcriptome background, with a view to search for genes involved in metabolic or signal transduction pathways that were significantly enriched. Among all the genes with KEGG pathway annotation, 284 differentially expressed genes were identified between pupa and egg & nymph libraries. Notably, specific enrichment of genes was observed for pathways involved in energy and lipid metabolism, such as the citrate cycle, phenylpropanoid biosynthesis and fatty acid metabolism. In the citrate cycle pathway, all the differentially expressed genes were significantly upregulated in pupa, including pyruvate dehydrogenase, isocitrate dehydrogenase, pyruvate carboxylase and citrate synthase. This probably indicates that the metabolic rate of whitefly at pupa stage is higher than that of whitefly at egg & nymph stages. Between adult and pupa stages, a total of 514 differentially expressed genes were found and a specific enrichment of starch and sucrose metabolism pathways was noticed. Interestingly, we also found the enrichment of differentially expressed genes in the glycosphingolipid biosynthesis pathway. To further evaluate our DGE data, we analyzed the expression level of vitello-genin which is an egg yolk gene highly expressed in the adult female [30]. As shown in Figure 8A-B, both vitellogenin 1 and vitellogenin 2 were highly expressed in the adult whitefly. However, both genes were expressed at significantly lower levels in the pupa stage and only vitellogenin 2 is present during the egg & nymph stage. For ecdysone receptor and ecdysone-inducible protein E75, we noticed a much high expression level in nymphs and pupas than that in adult whitefly ( Figure 8C-D). This is consistent with their roles in the orchestration of development during molts and metamorphosis [31]. As a control, the translation elongation factor 2 is equally expressed during the three developmental stages ( Figure  8E).

Conclusion
With this study, we present a rapid and cost-effective method for transcriptome and DGE analysis using Illumina sequencing technology. The single run produced more than 168,900 distinct sequences with 27,290 sequences having an above cut-off BLAST result. These findings provide a substantial contribution to existing sequence resources for the whitefly and will certainly accelerate insecticide resistance research in the Q biotype whitefly. To our knowledge, this is the first publication using Illumina sequencing technology for an organism without prior genome annotation. Additionally, we have demonstrated the feasibility of using Illumina sequencing based DGE system for gene expression profiling and have provided new leads for functional studies of genes involved in whitefly development.

Insect rearing and sample preparation
Stock cultures of Q biotype whiteflies (Bemisia tabaci) (cytochrome oxidase I sequence GenBank accession no: DQ473394) were maintained on cotton Gossypium hirsutum L (Malvaceae) cv. "Zhe-Mian 1793" in climate cham- bers at 27 ± 1°C, a photoperiod of 14 h light:10 h darkness and 70 ± 10% relative humidity. The purity of the cultures was monitored every 3-5 generations using the random amplified polymorphic DNA-polymerase chain reaction technique and the sequence of mitochondrial cytochrome oxidase I gene, which has been used widely to differentiate whitefly (B. tabaci) "biotypes" [32]. Since the quantity of eggs is extremely low, a mixture of eggs and first to third nymphs were collected as one sample. The pupae were collected as another sample. For adults, individuals were collected from the culture using a glass tube (5 × 0.5 cm) and the sex was determined under a stereo microscope. Then adults of the same sex were pooled into a plastic tube using an aspirator. Finally, these samples were frozen at -80°C until use.

RNA isolation and library preparation for transcriptome analysis
Total RNA was isolated using SV total RNA isolation system (Promega) according to the manufacturer's protocol. RNA integrity was confirmed using the 2100 Bioanalyzer (Agilent Technologies) with a minimum RNA integrated number value of 8. The samples for transcriptome analysis were prepared using Illumina's kit following manufacturer's recommendations. Briefly, mRNA was purified from 6 μg of total RNA (a mixture of RNA from egg & nymph, pupa and adult at equal ratio) using oligo (dT) magnetic beads. Following purification, the mRNA is fragmented into small pieces using divalent cations under elevated temperature and the cleaved RNA fragments were used for first strand cDNA synthesis using reverse Distribution of total clean tags Distribution of distinct clean tags transcriptase and random primers. This was followed by second strand cDNA synthesis using DNA polymerase I and RNaseH. These cDNA fragments then went through an end repair process and ligation of adapters. These products were purified and enriched with PCR to create the final cDNA library.

Analysis of Illumina sequencing results
The cDNA library was sequenced on the Illumina sequencing platform (GAII). The size of the library is approximately 200 bp and both ends of the libraries are sequenced. Image deconvolution and quality value calculations were performed using the Illumina GA pipeline 1.3. The raw reads were cleaned by removing adaptor sequences, empty reads and low quality sequences (reads with unknown sequences 'N'). The reads obtained were randomly clipped into 21 bp K-mers for assembly using de Bruijn graph and SOAPdenovo software [23]. After assessing different K-mer sizes, we found that the 21-mer   provided the best result for transcriptome assembly. Small K-mers make the graph very complex; while large K-mers can have poor overlap in regions with low sequencing depth. After sequence assembly, the resultant contigs were joined into scaffolds using the read mate pairs. To obtained distinct gene sequences, the scaffolds were clustered using TGI Clustering tools [24]. Distinct sequences were used for blast search and annotation against an NCBI nr database using an E-value cut-off of 10 -5 . Functional annotation by gene ontology terms (GO; http://www.geneontology.org) was analyzed by Blast2go software. The COG and KEGG pathways annotation was performed using Blastall software against Cluster of Orthologous Groups database and Kyoto Encyclopedia of Genes and Genomes database. The data sets are available at the NCBI Short Read Archive (SRA) with the accession number: SRX018661. The assembled sequences have been deposited in the NCBI's TSA database and can be searched using the Gene-ID listed in Additional file 2.

Digital gene expression library preparation and sequencing
Tag library preparation for the three Q biotype whitefly samples (egg & nymph, pupa, and adult) was performed in parallel using Illumina gene expression sample preparation kit. Briefly, total RNA from the three samples was used for mRNA capture with magnetic oligo(dT) beads. First and second strand cDNA were synthesized and bead-bound cDNA was subsequently digested with NlaIII. The cDNA fragments with 3' ends were then purified with magnetic beads and Illumina adapter 1 was added to their 5' ends. The junction of Illumina adapter 1 and CATG site is the recognition site of MmeI, which cuts 17 bp downstream of the CATG site, producing tags with adapter 1. After removing 3' fragments with magnetic beads precipitation, Illumina adapter 2 was introduced at 3' ends of tags, acquiring tags with different adapters at both ends to form a tag library. After 15 cycles of linear PCR amplification, 85 base strips were purified by PAGE gel electrophoresis. These strips were then digested, and the single-chain molecules were fixed onto the Illumina sequencing chip for sequencing. The data sets are available at the NCBI SRA with the accession number: SRX018662.

Analysis and mapping of DGE tags
Sequencing-received raw image data was transformed by base calling into sequence data. Prior to mapping reads to the reference database, we filtered all sequences to remove adaptor sequence, low quality sequences (tags with unknown sequences 'N'), empty tags (sequence with only adaptor sequences but no tags); low complexity, and tags with a copy number of 1 (probably sequencing error). A preprocessed database of all possible CATG+17 nucleotide tag sequences was created using our transcriptome reference database. For annotation, all tags were mapped to the reference sequences and only allowed no more than 1 nucleotide mismatch. All the tags mapped to reference sequences from multiple genes were filtered and the remaining tags were designed as unambiguous tags. For gene expression analysis, the number of expressed tags was calculated and then normalized to TPM (number of transcripts per million tags); and the differentially expressed tags were used for mapping and annotation. The complete lists of differentially expressed genes are shown in Additional file 7 and 8.

Evaluation of DGE libraries
A statistical analysis of the frequency of each tag in the different cDNA libraries was performed to compare gene-expression in different developmental stages. Statistical comparison was performed with a custom written scripts using the method described by Audic et al. [29]. FDR (false discovery rate) was used to determine the threshold of P value in multiple test and analysis. We used FDR < 0.001 as the threshold to judge the significance of gene expression difference. For pathway enrichment analysis, we mapped all differentially expressed genes to terms in KEGG database and looked for significantly enriched KEGG terms compared to the genome background.