Whitefly (Bemisia tabaci) genome project: analysis of sequenced clones from egg, instar, and adult (viruliferous and non-viruliferous) cDNA libraries

Background The past three decades have witnessed a dramatic increase in interest in the whitefly Bemisia tabaci, owing to its nature as a taxonomically cryptic species, the damage it causes to a large number of herbaceous plants because of its specialized feeding in the phloem, and to its ability to serve as a vector of plant viruses. Among the most important plant viruses to be transmitted by B. tabaci are those in the genus Begomovirus (family, Geminiviridae). Surprisingly, little is known about the genome of this whitefly. The haploid genome size for male B. tabaci has been estimated to be approximately one billion bp by flow cytometry analysis, about five times the size of the fruitfly Drosophila melanogaster. The genes involved in whitefly development, in host range plasticity, and in begomovirus vector specificity and competency, are unknown. Results To address this general shortage of genomic sequence information, we have constructed three cDNA libraries from non-viruliferous whiteflies (eggs, immature instars, and adults) and two from adult insects that fed on tomato plants infected by two geminiviruses: Tomato yellow leaf curl virus (TYLCV) and Tomato mottle virus (ToMoV). In total, the sequence of 18,976 clones was determined. After quality control, and removal of 5,542 clones of mitochondrial origin 9,110 sequences remained which included 3,843 singletons and 1,017 contigs. Comparisons with public databases indicated that the libraries contained genes involved in cellular and developmental processes. In addition, approximately 1,000 bases aligned with the genome of the B. tabaci endosymbiotic bacterium Candidatus Portiera aleyrodidarum, originating primarily from the egg and instar libraries. Apart from the mitochondrial sequences, the longest and most abundant sequence encodes vitellogenin, which originated from whitefly adult libraries, indicating that much of the gene expression in this insect is directed toward the production of eggs. Conclusion This is the first functional genomics project involving a hemipteran (Homopteran) insect from the subtropics/tropics. The B. tabaci sequence database now provides an important tool to initiate identification of whitefly genes involved in development, behaviour, and B. tabaci-mediated begomovirus transmission.


Background
The past three decades have witnessed a dramatic increase in the economic importance of the whitefly Bemisia tabaci (Genn.) (Aleyrodidae; Hemiptera) in subtropical and mild temperate agriculture systems, owing to the damage it causes to plants when it feeds in the phloem, and its ability to transmit plant viruses. B. tabaci occupies tropical and subtropical habitats, producing 11-15 generations per year [1,2]. The B. tabaci complex consists of diverse biological 'types' with distinct genetic polymorphisms [3,4], and differences in host range, fecundity, dispersal behaviours, prokaryotic endosymbiont composition, and competency with respect to begomovirus transmission, a group of small circular ssDNA plant viruses (genus Begomovirus, family Geminiviridae) [5,6]. The highly fecund Old World B biotype can produce ~ 300 eggs/female, colonizes over 500 host species, while the New World A type colonizes about 200 species and has a lower fecundity (1 00 eggs/female). In contrast, the Jatropha type colonizes only a few species within the genus Jatropha and exhibits low fecundity (~ 30-50 eggs/female) [3].
B. tabaci adults develop from eggs, after passing through four instars in approximately 2-3 wk and development is temperature dependent. Members of this complex are haplodiploid and thus unfertilized eggs give rise to haploid males; fertilized eggs develop into diploid females (arrhenotoky) [1,2].
The B type of B. tabaci transmits begomoviruses to a large number of crop, ornamental, and weed species [7]. Begomovirus have either one (monopartite) or two (bipartite) genomic components [8]. Those infecting tomato constitute a large group of begomoviruses. Among them the bipartite Tomato mottle virus (ToMoV) originated in the New World (Florida/Caribbean region), whereas, the monopartite Old World Tomato yellow leaf curl virus (TYLCV) is indigenous to the Old World (Middle East and Africa). TYLCV recently was introduced to the Caribbean Islands and has since spread into the South eastern states of the U.S.A. [9].
Begomoviruses are transmitted by B. tabaci in a circulative manner [10,11]. Virus particles ingested through the stylets enter the oesophagus and the filter chamber, are transported through the gut into the hemocoel, reach the salivary glands and are finally 'transmitted' during feeding, about 8-12 h after the beginning of an acquisition access period [10]. Velocity of translocation is reported to constitute an intrinsic property of the vector, not of the virus [12,13]. B. tabaci is able to transmit begomoviruses, and in particular TYLCV, for its lifetime, after the latent period has been achieved [14,15]. The ingestion of TYLCV by the whitefly vector is accompanied by a marked decrease in whitefly longevity and fertility [15]. In con-trast whiteflies that have ingested ToMoV displayed higher fecundity when reared on virus-free tomato than whiteflies not exposed to the virus [16]. TYLCV transcripts have been found in B. tabaci harbouring this virus, whereas viral transcripts are not detected in whiteflies that have ingested ToMoV [17], suggesting a fundamental difference in interactions between these two begomoviruses and their whitefly vector.
At least one whitefly species that colonizes some of the same hosts as B. tabaci (e.g. the greenhouse whitefly, Trialeurodes vaporariorum) is known to be capable of ingesting, but does not transmit begomoviruses [12], and at least one barrier to transmission has been shown to occur at the gut/hemocoel interface [12,18]. The receptors that are hypothesized to mediate begomovirus translocation into the salivary glands of B. tabaci, which is a requisite to transmission, and their genes, are presently unidentified.
Surprisingly very little is known about the genetic make up of this insect. The nuclear DNA content of B. tabaci male and female was estimated as 1.04 and 2.06 pg respectively, using flow cytometry, indicating that the haploid genome of B. tabaci contains about one billion bp, which is approximately five times the size of the genome of the fruitfly Drosophila melanogaster [19]. However, it is still not clear if this size estimate will prove to be accurate and so a long-term goal is to determine the complete genome of this whitefly. Ultimately it is of interest to isolate and identify the genes expressed during the life cycle of the whitefly B. tabaci and to understand the genetic makeup of this pest. Of particular interest is the identification of specific genes and their functions, which are expressed during the development of B. tabaci, as well as those involved in circulative virus transmission, the detoxification of insecticides, and the determination of polyphagy or monophagy in different B. tabaci biotypes. Consequently, the construction of cDNA libraries and the analyses of the sequences for the widespread 'B' biotype of B. tabaci constitute a first step in this endeavour.  From 18,976 sequencing attempts 9,110 sequences remained after quality, vector and adapter trimming, and removal of mitochondrial DNA sequences (see Materials and methods). The fraction of cleaned sequences from the total number of sequences from each library was between one half and one third ( Table 1). The number of sequences from the various libraries that were assembled into contigs and singletons was as follows: EGG: 201, INST: 1816, HBT: 2093, TYLCV: 2704, and TOMOV: 2296 (Table 1).

EST assembly into contigs
To identify clones belonging to the same gene, sequences were assembled into contigs [see Additional files 1 and 2] using the Staden Gap4 program [20]. The advantage of this program is that both the bases and their quality are used to assess overlap and contig consensus sequence. In the assembly process the genome of the whitefly primary endosymbiotic bacteria Candidatus Portiera aleyrodidarum [21] (AY268081.1) was included since a preliminary analysis revealed that some of the sequences were identical to bacterial DNA. In the assembly process 4,860 contigs and singletons were assembled from 9,109 sequences ( Table 2). The number of singletons was 3,843. The GC content and the average length were higher in the contigs than in the singletons. The contigs with more than a single sequence resulted in sequences 1.5 fold longer than the average sequence length. Figure 1 shows that the contigs with up to 25 sequences had a linear relationship between sequence number and contig length, i.e. the assembly process produces longer sequences than the singletons. Table 3 shows that the largest contig (with respect to sequence number) was assembled with sequences from the genome of Candidatus Portiera aleyrodidarum (AY268081.1), and was derived primarily from egg (EGG) and instar (INST) libraries; endosymbiotic sequences were rare in adult whitefly libraries ( Figure 2). Most clones that shared sequence homology with this bacterium aligned between nucleotide coordinates 23,000-24,000 of the partial sequence of the bacterial genome, which encodes a 16S ribosomal RNA gene. The region downstream was found to be rich in adenine, a feature which may have contributed to poly(T)-mediated capture intended to selectively bind poly(A)-containing eukaryotic mRNAs.
The second largest contig was composed of sequences homologous to the published B. tabaci mitochondrial genome (AY521259). The number of mitochondrial DNA clones was extremely high. We performed an initial screening for mitochondrial sequences by running RepeatMasker against the published mitochondrial genome (see Methods: analysis of library quality) using a threshold of up to 10% substitutions in matching DNA region or a Smith -Waterman score of at least 2500. For initial selection we preferred not to use stringent criteria in order not to leave aside nuclear genes and to allow assem-  This region contains the large subunit ribosomal RNA and three tRNA genes. It is possible that these sequences allowed RNA:tRNA dimer formation and initiation of cDNA synthesis as described in some retroviruses [22].
Number of sequences building a contig versus the contig length Figure 1 Number of sequences building a contig versus the contig length Scatter plot of contigs sequence number (x axis; number of sequences that make up a certain contig) versus the contig length (y axis; bases). The colour scale represents the amount of sequences; the size of the square represents the number of HBT sequences. The annotations for the ten contigs, having the highest number of sequences is shown. The third largest contig shared high homology to other insect vitellogenin genes; this contig is also the second longest (2,883 bp) and originated exclusively from the adult whitefly libraries ( Figure 2). Since the libraries were not normalized, a high level of redundancy allowed appraising the level of expression of certain genes at certain developmental stages, or as a consequence of virus ingestion and/or transmission. Figure 3 shows that in the thirteen longest contigs (each containing 28 or more sequences) the INST library seems to have a different set of expressed genes than the adult libraries. Contig Bt-ToMoV-023-1-C12-T3_C12 was composed of 28 sequences which shared no significant homology with any of the sequence databases that were searched. However a search of the Interpro database [23] has revealed that it contains a signal peptide and a trans-membrane domain.

Identification of whitefly contigs and singletons by BLAST analysis
To identify homologies and identities of the contigs and singletons to known proteins, genes and/or genomes, Homo + Mus + Rat other contigs and singletons were subjected to blastn and blastx searches against the following databases: non-redundant protein database (nr, NCBI), non-redundant nucleotide database (nt, NCBI), Swiss-Prot [24], Flybase protein database [25] and EST other (non-human and nonmouse, NCBI) [see Additional file 3] Managing and parsing the BLAST outputs were carried out using the Bio-CloneDB application [26]. About 45% of the contigs and singletons had a match with an E value of at least 1.0e-06 to one of the above databases (Table 4). 1,544 contigs and singletons had a homology to a protein in the nr database. The E-value distributions of the top hits in the nr database ( Figure 3a) showed that 43% of the homolog contigs and singletons ranged between 1.0e-20 to 1.0e-50, whereas 67% had a moderate to strong homology (smaller than 1.0e-20). The species distribution of the top hits ( Figure  3b) showed that 58% of contigs and singletons had sequence homology to genomes of insects completely or partially sequenced, and were approximately evenly distributed between the mosquito Anopheles gambiae, the honeybee Apis mellifera, the whitefly B. tabaci and the fruitfly D. melanogaster.

Sequences with no identifiable homology
No homologous sequences could be found for 2,649 (54.5%) of the contigs and singletons among the databases searched. The singletons showed a higher occurrence of lack of homology (58%). Because the library was poly(dT)-primed, some of these sequences may represent 3' untranslated regions (3' UTRs). It is also possible that the putative homologous regions are too short to produce a significant alignment.

Comparison to existing B. tabaci sequences in NCBI
There were 448 contigs and singletons presenting high similarity (E-value equal or smaller than 1.e-40) with B. tabaci DNA sequences. The majority of these hits (399) were to the whitefly mitochondria genome (AY521257.1), and represented mitochondrial sequences that were not removed in the preassembly process. The BLAST search against the EST database did not reveal any ESTs originating from B. tabaci (currently there are no B. tabaci ESTs in Genbank). Thus, homology searches indicated that the majority of the contigs and singletons described herein are novel B. tabaci genes, which are not known in the NCBI sequence databases.

Assignment of the whitefly contigs and singletons to common Gene Ontology terms
Based on homologies with the Swiss-Prot database, the contigs and singletons were assigned a biological process, molecular function and cellular component from the Gene Ontology (GO) terminology [27]. The GO was extracted electronically using the FatiGO tool [28]. The top hit of 1,224 contigs and singletons with an E-value equal or smaller than 1.0e-06 was to 922 different Swiss-Prot entries (Figure 4). According to the FatiGo analysis on this list of Swiss-Prot entries only 35 did not have a GO annotation. The most dominant Biological process GO annotation at level 2 was physiological process (97% of the proteins) and 95% were annotated as cellular process. The most dominant molecular function GO category was catalytic activity (51%) and binding (50%) in level 2. The most dominant cellular component was cell (97%) and the second largest was organelle (78%).

Comparing whitefly Gene Ontology to Drosophila
To evaluate how similar B. tabaci is to Drosophila, a blastx search was carried out against all Drosophila proteins.
Here, 1,053 contigs and singletons had a top hit with an E-value equal, or smaller, than 1.0e-06. This set of genes was used to compare B. tabaci to Drosophila in respect to their GO profile. The overrepresented terms were associated with ribosome and protein biosynthesis as well as mitochondria and generation of precursor metabolites and energy ( Table 5). The underrepresented terms were 'unknowns', or were identified as receptors, or as having a role in signal transduction (Table 6).

Mapping whitefly contigs and singletons to pathways
From the top nr homologies the additional information extracted are the KEGG EC numbers [29]. In total, out of the 1,544 nr hits, 48 had an EC number, 37 of which were Ontology using Swiss-Prot homologies   unique. The EC numbers were mapped to their respective pathway using the KEGG tools (gpath) ( Table 7).

Multiple alignments of vitellogenin-like contigs
Nine contigs shared the same vitellogenin homolog from the butterfly Athalia rosae; (BAA22791.1) as revealed in a blastx search against nr-database. Table 8 shows homologies to vitellogenin ranging from an E-value of 6.0e-13 to 2.0e-83; the location of homology was not the same for the various contigs. To evaluate whether the contigs encoded a protein family or the same protein, a multiple alignment of the contigs having homology to the carboxyl terminus of vitellogenin (amino acids 1371 to 1770, BAA22791.1) was performed ( Figure 5). The multiple alignments clearly demonstrated that at least two different types of proteins were represented (2, 6 and 8 versus 4 and 7). The contigs shared a region rich in serine codons flanked by an AAC repeat in the DNA sequence, two features not found in the C-terminal moiety of the A. rosae vitellogenin homolog ( Figure 6). In A. rosae vitellogenin, two serine repeats can be found in the N-terminal moiety of the protein, between amino acids 344 and 367 and between amino acids 402 and 434.

The whitefly contig and EST database
A relational database with a web-based front end (White-FlyDB) was created to store, navigate, annotate and retrieve sequence and contig information [30]. This database is based on the BioCloneDB application [26]. The database contains all the relevant contig information such as the names of the sequence that compose it, the top hits against the described databases and the information extracted for these top hits (GO, EC, cellular location) as well as information on the homology itself. The sequences in fasta format and the tab delimited BLAST reports can be easily extracted and imported to Excel files.

Discussion
The whitefly B. tabaci is a major pest to agricultural crops because it causes damage due to feeding and because it transmits many important viruses to plant species cultivated for food and fiber nearly worldwide. Previous to the present research and despite the importance of B. tabaci, the sequence of only a handful of mRNAs (mostly partial) encoding a handful of nuclear protein-coding genes has been published in Genbank. They include sequences encoding actins, a para-sodium channel, putative knottins, a NADP-dependent ketose reductase, two heat shock proteins, a nicotinic acetylcholine receptor alpha subunit, an acetyl cholinesterase-like protein, and a diffusible secreted glycoprotein. The results described in this communication represent the first attempt to develop a functional genomics program involving a homopteran species.
Since the amount of total RNA that could be extracted from eggs and instars was extremely low, we have not isolated polyA + -RNA, which has inevitably reduced the mRNA representation in the sample. Instead we have used total RNA as template for synthesizing cDNA. Libraries have been prepared from another insect pest, the brown citrus aphid Toxoptera citricida starting from RNA samples enriched in polyA + -RNA [31]. However, it has to be noted that an adult aphid weighs approximately 300 micrograms, while an adult whitefly weighs approximately 30 micrograms. Moreover, the weight of a whitefly egg is approximately 1/1000 that of an adult. We have not normalized the libraries, a fact that allowed us to roughly estimate and compare the levels of expression of major genes in the different libraries.
The fraction of the expressed whitefly genes present in our database can be roughly estimated. Although the genome size of B. tabaci was estimated to be approximately five times that of Drosophila [19], it is logical to speculate that the two insect species may have approximately the same number of protein-encoding nuclear genes. The whitefly database contains the sequences of 975 contigs and 3,322 singletons (non-mitochondrial and non-bacterial). If we take into account that each contig represents a transcript  of a single protein-coding nuclear gene our sequences represent 4,297 genes. The number of gene families (protein families) in Drosophila has been estimated as 674 and the number of genes not member of a gene family has been estimated as 10,786; altogether 11,460 protein-encoding genes [32]. Hence the B. tabaci database may represent approximately one third of the insect nuclear proteinencoding genes. Additional sequencing from the 3' end of the clones may provide a more accurate estimation.
Within this whitefly database approximately half of the sequences had a match with an E value of at least 1.0e-06 to one of the databases; 1,544 sequences had a homology to a protein in the nr database. Approximately 60% of the whitefly contigs presented homologies with sequenced genomes of other insect species. No homologous sequence could be found for 2,649 contigs and singletons (54.5%) with any of the databases searched.
It was notable that the most abundant contig was vitellogenin. This ancient protein is the major yolk protein of eggs, where it is used as a food source during embryogenesis [33]. There are three vitellogenin genes in Drosophila [25]. The whitefly vitellogenin sequences were found exclusively in libraries from adult whiteflies, indicating that a relatively large amount of resources transcriptional activity is mobilized towards the production of eggs.
The database developed in this study provides a large source of information for studies of whitefly development, circulative transmission of begomoviruses, and choice of host plant. Comparing the sequences present in the various libraries may provide preliminary information on genes expressed during acquisition and transmission of begomoviruses, and ultimately those involved in B. tabaci development.

Conclusion
The set of sequences developed in this study makes available the first DNA sequence database for an important hemipteran (homopteran) pest of agricultural crops for the scientific community. Its availability will allow the investigation of important questions regarding whitefly biology, development, gene expression, and comparative biology. It will also facilitate studies to elucidate the genetics underlying gene expression in pest-and non-pest biotypes, and the basis for virus-vector specificity, resistance to insecticides, and plant host preferences for this cryptic species. This sequence set has been arrayed in a microchip format and enables biologically-based questions to be addressed by examining gene functionalities and expression patterns of the whitefly genome.

Libraries from eggs and instars
Directional cDNA libraries were constructed using the Creator SMART cDNA Library Construction Kit (Clontech). Eggs and instars were collected from leaves of cotton plants caged with whiteflies (B. tabaci, B biotype) and held in insect-proof cages. RNA was isolated using TRI-ZOL and Phase Lock Gel-Heavy tubes. The first strand cDNA was synthesized using the total RNA and the CDS III/3' PCR Primer which contains a Sfi IB site. The cDNA was amplified by PCR: the first-strand cDNA was used together with the 5' PCR primer which contains a Sfi IA site and the CDS III/3' PCR primer. Following phenol treatment, the DNA was digested with SfiI and size-fractionated using CHROMA SPIN-400 Column. The high molecular weight cDNA fractions (with sticky ends) were pooled together and ligated with dephosphorylated pDNR-LIB vector treated Sfi IA and Sfi IB. The recombinant plasmids were electroporated into DH5α and 10G competent cells, and plated on LB agar plate containing chloramphenicol.

Sequencing
Plasmid clones were isolated from 1.7 ml overnight Luria-Bertani broth cultures using a Qiagen 9600 robot and