De novo characterization of a whitefly transcriptome and analysis of its gene expression during development
© Wang et al; licensee BioMed Central Ltd. 2010
Received: 10 December 2009
Accepted: 24 June 2010
Published: 24 June 2010
Whitefly (Bemisia tabaci) causes extensive crop damage throughout the world by feeding directly on plants and by vectoring hundreds of species of begomoviruses. Yet little is understood about its genes involved in development, insecticide resistance, host range plasticity and virus transmission.
To facilitate research on whitefly, we present a method for de novo assembly of whitefly transcriptome using short read sequencing technology (Illumina). In a single run, we produced more than 43 million sequencing reads. These reads were assembled into 168,900 unique sequences (mean size = 266 bp) which represent more than 10-fold of all the whitefly sequences deposited in the GenBank (as of March 2010). Based on similarity search with known proteins, these analyses identified 27,290 sequences with a cut-off E-value above 10-5. Assembled sequences were annotated with gene descriptions, gene ontology and clusters of orthologous group terms. In addition, we investigated the transcriptome changes during whitefly development using a tag-based digital gene expression (DGE) system. We obtained a sequencing depth of over 2.5 million tags per sample and identified a large number of genes associated with specific developmental stages and insecticide resistance.
Our data provides the most comprehensive sequence resource available for whitefly study and demonstrates that the Illumina sequencing allows de novo transcriptome assembly and gene expression analysis in a species lacking genome information. We anticipate that next generation sequencing technologies hold great potential for the study of the transcriptome in other non-model organisms.
The whitefly Bemisia tabaci (Gennadius) is a genetically diverse complex containing some of the most destructive invasive pests of many ornamental and glasshouse crops worldwide [1, 2]. The species complex colonizes more than 600 different species of plants, transmits many plant viruses, feeds on phloem sap, and promotes the growth of damaging fungi on honeydew excretions deposited on plants [3–6]. Recent phylogenetic analysis combined with a pattern of reproductive isolation among genetic groups within B. tabaci indicate that the complex contains at least 24 cryptic species, some of which have been referred to as "biotypes" in the last 20 years [7, 8]. As the separation at the species level within the B. tabaci complex is yet to be fully resolved, we have retained the commonly used term "biotype" to link this study with existing literature. The most predominant and damaging biotypes of B. tabaci are the B and Q biotypes [9, 10]. While the former is known for its high fitness parameters, the Q biotype whitefly has a unique ability to develop and maintain high levels of resistance to major classes of insecticides owing to biological and genetic factors [11, 12].
Despite its global importance, genomic sequence resources available for the whitefly are scarce, especially for the Q biotype. Currently (March 30th, 2010), there are about 9110 EST and 762 nucleotide sequences available on NCBI for the B biotype whitefly, and only 683 nucleotide sequences have been deposited for the Q biotype whitefly. The previous EST sequencing efforts for the B biotype whitefly have allowed the development of small-scale microarrays for gene expression analysis in the context of insecticide resistance and parasitoid-whitefly interactions [13–15]. While these studies have highlighted the utility of cDNA sequencing for candidate gene discovery in the absence of a genome sequence, a comprehensive description of the genes expressed in insecticide-resistant Q biotype whitefly remains unavailable.
Over the past several years, the next generation sequencing technology has emerged as a cutting edge approach for high-throughput sequence determination and this has dramatically improved the efficiency and speed of gene discovery [16, 17]. For example, the Illumina sequencing technology is able to generate over one billion bases of high-quality DNA sequence per run at less than 1% of the cost of capillary-based methods . Furthermore, this next generation sequencing has also significantly accelerated and improved the sensitivity of gene-expression profiling and, is expected to boost collaborative and comparative genomics studies [19, 20]. Previously, Illumina sequencing of transcriptomes for organisms with completed genomes confirmed that the relatively short reads produced can be effectively assembled and used for gene discovery and comparison of gene expression profiles [21, 22]. Despite its obvious potential, next generation sequencing methods have not yet been applied to whitefly research.
In this study, we generated over three billion bases of high-quality DNA sequence with Illumina technology and demonstrated the suitability of short-read sequencing for de novo assembly and annotation of genes expressed in a eukaryote without the prior genome information. In a single run, we identified 168,900 distinct sequences including hundreds of insecticide target and metabolism genes. Furthermore, we compared the gene expression profiles of whiteflies during different developmental stages using a digital gene expression system. The assembled, annotated transcriptome sequences and gene expression profiles provide an invaluable resource for the identification of whitefly genes involved in insecticide resistance, development and virus transmission.
Results and discussion
Illumina sequencing and reads assembly
Summary for the Q biotype whitefly transcriptome
Total number of reads
Total base pairs (bp)
Average read length
Total number of contigs
Mean length of contigs
Total number of scaffolds
Mean length of scaffolds
Total distinct sequences
Sequences with E-value < 10-5
Annotation of predicted proteins
Gene ontology (GO) and clusters of orthologous groups (COG) classification
Detection of sequences related to the insecticide targets and metabolism
Genes related to the insecticide targets and metabolism
Number of sequences had a hit with nr database1
Number of known sequences from NCBI nucleotide database2
Number of known sequences from B biotype whitefly EST project
Nicotinic acetylcholine receptor
Identified nicotinic acetylcholine receptor (nAChR) genes
alpha 1 subunit
alpha 2 subunit
alpha 3 subunit
alpha 4 subunit
alpha 5 subunit
alpha 7 subunit
alpha 10 subunit
beta 1 subunit
Digital gene expression (DGE) library sequencing
Statistics of DGE sequencing
egg & nymph
Tag Mapping to Gene
Tag Mapping to Gene
% of tag
Tag Mapping to Gene
Tag Mapping to Gene
% of tag
% of ref genes
Mapping sequences to the reference transcriptome database
Distribution of DGE tags on genes
DGE is a SAGE-based transcript profiling method. In theory, all the sequenced tags in the libraries should be mapped to the last NlaIII restriction site on the sense strand at the 3'-end of the transcript. However, like the results in DGE analysis of mouse and zebrafish transcriptomes [20, 22], we found that approximately 60% of the tags mapped to the classical site (data not shown). This is probably due to the incomplete NlaIII digestion during library preparation and the usage of alternative polyadenylation and/or splicing sites . Detection of multiple tags with high count numbers for a predicted transcript indicates the reliability of the transcript sequence . Furthermore, the information obtained from multiple tags per transcript is valuable for the verification of ab initio gene predictions.
Changes in gene expression profile among the different developmental stages
Functional annotation of differentially expressed genes
With this study, we present a rapid and cost-effective method for transcriptome and DGE analysis using Illumina sequencing technology. The single run produced more than 168,900 distinct sequences with 27,290 sequences having an above cut-off BLAST result. These findings provide a substantial contribution to existing sequence resources for the whitefly and will certainly accelerate insecticide resistance research in the Q biotype whitefly. To our knowledge, this is the first publication using Illumina sequencing technology for an organism without prior genome annotation. Additionally, we have demonstrated the feasibility of using Illumina sequencing based DGE system for gene expression profiling and have provided new leads for functional studies of genes involved in whitefly development.
Insect rearing and sample preparation
Stock cultures of Q biotype whiteflies (Bemisia tabaci) (cytochrome oxidase I sequence GenBank accession no: DQ473394) were maintained on cotton Gossypium hirsutum L (Malvaceae) cv. "Zhe-Mian 1793" in climate chambers at 27 ± 1°C, a photoperiod of 14 h light:10 h darkness and 70 ± 10% relative humidity. The purity of the cultures was monitored every 3-5 generations using the random amplified polymorphic DNA-polymerase chain reaction technique and the sequence of mitochondrial cytochrome oxidase I gene, which has been used widely to differentiate whitefly (B. tabaci) "biotypes" . Since the quantity of eggs is extremely low, a mixture of eggs and first to third nymphs were collected as one sample. The pupae were collected as another sample. For adults, individuals were collected from the culture using a glass tube (5 × 0.5 cm) and the sex was determined under a stereo microscope. Then adults of the same sex were pooled into a plastic tube using an aspirator. Finally, these samples were frozen at -80°C until use.
RNA isolation and library preparation for transcriptome analysis
Total RNA was isolated using SV total RNA isolation system (Promega) according to the manufacturer's protocol. RNA integrity was confirmed using the 2100 Bioanalyzer (Agilent Technologies) with a minimum RNA integrated number value of 8. The samples for transcriptome analysis were prepared using Illumina's kit following manufacturer's recommendations. Briefly, mRNA was purified from 6 μg of total RNA (a mixture of RNA from egg & nymph, pupa and adult at equal ratio) using oligo (dT) magnetic beads. Following purification, the mRNA is fragmented into small pieces using divalent cations under elevated temperature and the cleaved RNA fragments were used for first strand cDNA synthesis using reverse transcriptase and random primers. This was followed by second strand cDNA synthesis using DNA polymerase I and RNaseH. These cDNA fragments then went through an end repair process and ligation of adapters. These products were purified and enriched with PCR to create the final cDNA library.
Analysis of Illumina sequencing results
The cDNA library was sequenced on the Illumina sequencing platform (GAII). The size of the library is approximately 200 bp and both ends of the libraries are sequenced. Image deconvolution and quality value calculations were performed using the Illumina GA pipeline 1.3. The raw reads were cleaned by removing adaptor sequences, empty reads and low quality sequences (reads with unknown sequences 'N'). The reads obtained were randomly clipped into 21 bp K-mers for assembly using de Bruijn graph and SOAPdenovo software . After assessing different K-mer sizes, we found that the 21-mer provided the best result for transcriptome assembly. Small K-mers make the graph very complex; while large K-mers can have poor overlap in regions with low sequencing depth. After sequence assembly, the resultant contigs were joined into scaffolds using the read mate pairs. To obtained distinct gene sequences, the scaffolds were clustered using TGI Clustering tools . Distinct sequences were used for blast search and annotation against an NCBI nr database using an E-value cut-off of 10-5. Functional annotation by gene ontology terms (GO; http://www.geneontology.org) was analyzed by Blast2go software. The COG and KEGG pathways annotation was performed using Blastall software against Cluster of Orthologous Groups database and Kyoto Encyclopedia of Genes and Genomes database. The data sets are available at the NCBI Short Read Archive (SRA) with the accession number: SRX018661. The assembled sequences have been deposited in the NCBI's TSA database and can be searched using the Gene-ID listed in Additional file 2.
Digital gene expression library preparation and sequencing
Tag library preparation for the three Q biotype whitefly samples (egg & nymph, pupa, and adult) was performed in parallel using Illumina gene expression sample preparation kit. Briefly, total RNA from the three samples was used for mRNA capture with magnetic oligo(dT) beads. First and second strand cDNA were synthesized and bead-bound cDNA was subsequently digested with NlaIII. The cDNA fragments with 3' ends were then purified with magnetic beads and Illumina adapter 1 was added to their 5' ends. The junction of Illumina adapter 1 and CATG site is the recognition site of MmeI, which cuts 17 bp downstream of the CATG site, producing tags with adapter 1. After removing 3' fragments with magnetic beads precipitation, Illumina adapter 2 was introduced at 3' ends of tags, acquiring tags with different adapters at both ends to form a tag library. After 15 cycles of linear PCR amplification, 85 base strips were purified by PAGE gel electrophoresis. These strips were then digested, and the single-chain molecules were fixed onto the Illumina sequencing chip for sequencing. The data sets are available at the NCBI SRA with the accession number: SRX018662.
Analysis and mapping of DGE tags
Sequencing-received raw image data was transformed by base calling into sequence data. Prior to mapping reads to the reference database, we filtered all sequences to remove adaptor sequence, low quality sequences (tags with unknown sequences 'N'), empty tags (sequence with only adaptor sequences but no tags); low complexity, and tags with a copy number of 1 (probably sequencing error). A preprocessed database of all possible CATG+17 nucleotide tag sequences was created using our transcriptome reference database. For annotation, all tags were mapped to the reference sequences and only allowed no more than 1 nucleotide mismatch. All the tags mapped to reference sequences from multiple genes were filtered and the remaining tags were designed as unambiguous tags. For gene expression analysis, the number of expressed tags was calculated and then normalized to TPM (number of transcripts per million tags); and the differentially expressed tags were used for mapping and annotation. The complete lists of differentially expressed genes are shown in Additional file 7 and 8.
Evaluation of DGE libraries
A statistical analysis of the frequency of each tag in the different cDNA libraries was performed to compare gene-expression in different developmental stages. Statistical comparison was performed with a custom written scripts using the method described by Audic et al. . FDR (false discovery rate) was used to determine the threshold of P value in multiple test and analysis. We used FDR < 0.001 as the threshold to judge the significance of gene expression difference. For pathway enrichment analysis, we mapped all differentially expressed genes to terms in KEGG database and looked for significantly enriched KEGG terms compared to the genome background.
We thank the technical assistance of Dan-Mei Yao and Jin-Hua Liu. We are grateful to Ding Jeak Ling for comments on the manuscript. Financial support for this study was provided by the National Basic Research Programme of China (Project 2009CB119203), the National Natural Science Foundation of China (Project 30730061), the China National Science-Technology Support Programme (Project 2006BAD08A18) and the Fundamental Research Funds for the Central Universities (2009QNA6025).
- Perring TM, Cooper AD, Rodriguez RJ, Farrar CA, Bellows TS: Identification of a whitefly species by genomic and behavioral studies. Science. 1993, 259: 74-77. 10.1126/science.8418497.PubMedView ArticleGoogle Scholar
- Boykin LM, Shatters RG, Rosell RC, McKenzie CL, Bagnall RA, De Barro PJ, Frohlich DR: Global relationships of Bemisia tabaci (Hemiptera: Aleyrodidae) revealed using Bayesian analysis of mitochondrial COI DNA sequences. Mol Phylogenet Evol. 2007, 44: 1306-1319. 10.1016/j.ympev.2007.04.020.PubMedView ArticleGoogle Scholar
- Czosnek H, Ghanim M, Ghanim M: The circulative pathway of begomoviruses in the whitefly vector Bemisia tabaci; insights from studies with Tomato yellow leaf curl virus. Ann Appl Biol. 2002, 140: 215-231. 10.1111/j.1744-7348.2002.tb00175.x.View ArticleGoogle Scholar
- Seal S, vandenBosch F, Jeger M: Factors influencing begomovirus evolution and their increasing global significance: implications for sustainable control. Crit Rev Plant Sci. 2006, 25: 23-46. 10.1080/07352680500365257.View ArticleGoogle Scholar
- Jiu M, Zhou XP, Liu SS: Acquisition and transmission of two begomoviruses by the B and a non-B biotype of Bemisia tabaci from Zhejiang, China. Journal of Phytopathology. 2006, 154: 587-591. 10.1111/j.1439-0434.2006.01151.x.View ArticleGoogle Scholar
- Dalton R: Whitefly infestations: the Christmas Invasion. Nature. 2006, 443: 898-900. 10.1038/443898a.PubMedView ArticleGoogle Scholar
- Xu J, De Barro PJ, Liu SS: Reproductive incompatibility among genetic groups of Bemisia tabaci supports the proposition that the whitefly is a cryptic species complex. Bull Entomol Res. 2010, 100: 359-366. 10.1017/S0007485310000015.PubMedView ArticleGoogle Scholar
- Dinsdale A, Cook L, Riginos C, Buckley Y, De Barro PJ: Refined global analysis of Bemisia tabaci (Gennadius) (Hemiptera: Sternorrhyncha: Aleyrodoidea) mitochondrial CO1 to identify species level genetic boundaries. Ann Entomol Soc Am. 2010, 103: 196-208. 10.1603/AN09061.View ArticleGoogle Scholar
- Perring TM: The Bemisia tabaci species complex. Crop Protection. 2001, 20: 725-737. 10.1016/S0261-2194(01)00109-0.View ArticleGoogle Scholar
- Liu SS, De Barro PJ, Xu J, Luan JB, Zang LS, Ruan YM, Wan FH: Asymmetric mating interactions drive widespread invasion and displacement in a whitefly. Science. 2007, 318: 1769-1772. 10.1126/science.1149887.PubMedView ArticleGoogle Scholar
- Nauen R, Stumpf N, Elbert A: Toxicological and mechanistic studies on neonicotinoid cross resistance in Q-type Bemisia tabaci (Hemiptera: Aleyrodidae). Pest Manag Sci. 2002, 58: 868-875. 10.1002/ps.557.PubMedView ArticleGoogle Scholar
- Horowitz AR, Kontsedalov S, Khasdan V, Ishaaya I: Biotypes B and Q of Bemisia tabaci and their relevance to neonicotinoid and pyriproxyfen resistance. Arch Insect Biochem Physiol. 2005, 58: 216-225. 10.1002/arch.20044.PubMedView ArticleGoogle Scholar
- Leshkowitz D, Gazit S, Reuveni E, Ghanim M, Czosnek H, McKenzie C, Shatters RL, Brown JK: Whitefly (Bemisia tabaci) genome project: analysis of sequenced clones from egg, instar, and adult (viruliferous and non-viruliferous) cDNA libraries. BMC Genomics. 2006, 7: 79-10.1186/1471-2164-7-79.PubMed CentralPubMedView ArticleGoogle Scholar
- Ghanim M, Kontsedalov S: Gene expression in pyriproxyfen-resistant Bemisia tabaci Q biotype. Pest Manag Sci. 2007, 63: 776-783. 10.1002/ps.1410.PubMedView ArticleGoogle Scholar
- Mahadav A, Gerling D, Gottlieb Y, Czosnek H, Ghanim M: Parasitization by the wasp Eretmocerus mundus induces transcription of genes related to immune response and symbiotic bacteria proliferation in the whitefly Bemisia tabaci. BMC Genomics. 2008, 9: 342-10.1186/1471-2164-9-342.PubMed CentralPubMedView ArticleGoogle Scholar
- Schuster SC: Next-generation sequencing transforms today's biology. Nature Methods. 2008, 5: 16-18. 10.1038/nmeth1156.PubMedView ArticleGoogle Scholar
- Ansorge WJ: Next-generation DNA sequencing techniques. New Biotechnology. 2009, 25: 195-203. 10.1016/j.nbt.2008.12.009.PubMedView ArticleGoogle Scholar
- Huang W, Marth G: EagleView: a genome assembly viewer for next-generation sequencing technologies. Genome Res. 2008, 18: 1538-1543. 10.1101/gr.076067.108.PubMed CentralPubMedView ArticleGoogle Scholar
- Blow N: Transcriptomics: The digital generation. Nature. 2009, 458: 239-242. 10.1038/458239a.PubMedView ArticleGoogle Scholar
- t Hoen PA, Ariyurek Y, Thygesen HH, Vreugdenhil E, Vossen RH, de Menezes RX, Boer JM, van Ommen GJ, den Dunnen JT: Deep sequencing-based expression analysis shows major advances in robustness, resolution and inter-lab portability over five microarray platforms. Nucleic Acids Res. 2008, 36: e141-10.1093/nar/gkn705.View ArticleGoogle Scholar
- Rosenkranz R, Borodina T, Lehrach H, Himmelbauer H: Characterizing the mouse ES cell transcriptome with Illumina sequencing. Genomics. 2008, 92: 187-194. 10.1016/j.ygeno.2008.05.011.PubMedView ArticleGoogle Scholar
- Hegedus Z, Zakrzewska A, Agoston VC, Ordas A, Racz P, Mink M, Spaink HP, Meijer AH: Deep sequencing of the zebrafish transcriptome response to mycobacterium infection. Mol Immunol. 2009, 46: 2918-2930. 10.1016/j.molimm.2009.07.002.PubMedView ArticleGoogle Scholar
- Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 2010, 20: 265-272. 10.1101/gr.097261.109.PubMed CentralPubMedView ArticleGoogle Scholar
- Pertea G, Huang X, Liang F, Antonescu V, Sultana R, Karamycheva S, Lee Y, White J, Cheung F, Parvizi B: TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets. Bioinformatics. 2003, 19: 651-652. 10.1093/bioinformatics/btg034.PubMedView ArticleGoogle Scholar
- Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for deciphering the genome. Nucleic Acids Res. 2004, 32: D277-280. 10.1093/nar/gkh063.PubMed CentralPubMedView ArticleGoogle Scholar
- Shao YM, Dong K, Zhang CX: The nicotinic acetylcholine receptor gene family of the silkworm, Bombyx mori. BMC Genomics. 2007, 8: 324-10.1186/1471-2164-8-324.PubMed CentralPubMedView ArticleGoogle Scholar
- Alon M, Alon F, Nauen R, Morin S: Organophosphates' resistance in the B-biotype of Bemisia tabaci (Hemiptera: Aleyrodidae) is associated with a point mutation in an ace1-type acetylcholinesterase and overexpression of carboxylesterase. Insect Biochem Mol Biol. 2008, 38: 940-949. 10.1016/j.ibmb.2008.07.007.PubMedView ArticleGoogle Scholar
- Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ: Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet. 2008, 40: 1413-1415. 10.1038/ng.259.PubMedView ArticleGoogle Scholar
- Audic S, Claverie JM: The significance of digital gene expression profiles. Genome Res. 1997, 7: 986-995.PubMedGoogle Scholar
- Shechter A, Aflalo ED, Davis C, Sagi A: Expression of the reproductive female-specific vitellogenin gene in endocrinologically induced male and intersex Cherax quadricarinatus crayfish. Biol Reprod. 2005, 73: 72-79. 10.1095/biolreprod.104.038554.PubMedView ArticleGoogle Scholar
- Parthasarathy R, Palli SR: Stage- and cell-specific expression of ecdysone receptors and ecdysone-induced transcription factors during midgut remodeling in the yellow fever mosquito, Aedes aegypti. J Insect Physiol. 2007, 53: 216-229. 10.1016/j.jinsphys.2006.09.009.PubMedView ArticleGoogle Scholar
- Luo C, Yao Y, Wang R, Yan F, Hu D, Zhang Z: The use of mitochondrial cytochrome oxidase I (mt COI) gene sequences for the identification of biotypes of Bemisia tabaci (Gennadius) in China. Acta Entomologica Sinica. 2002, 45: 759-763.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.