Transcriptome analysis and comparison reveal divergence between two invasive whitefly cryptic species

Background Invasive species are valuable model systems for examining the evolutionary processes and molecular mechanisms associated with their specific characteristics by comparison with closely related species. Over the past 20 years, two species of the whitefly Bemisia tabaci species complex, Middle East-Asia Minor 1 (MEAM1) and Mediterranean (MED), have both spread from their origin Middle East/Mediterranean to many countries despite their apparent differences in many life history parameters. Previously, we have sequenced the transcriptome of MED. In this study, we sequenced the transcriptome of MEAM1 and took a comparative genomic approach to investigate the transcriptome evolution and the genetic factors underlying the differences between MEAM1 and MED. Results Using Illumina sequencing technology, we generated 17 million sequencing reads for MEAM1. These reads were assembled into 57,741 unique sequences and 15,922 sequences were annotated with an E-value above 10-5. Compared with the MED transcriptome, we identified 3,585 pairs of high quality orthologous genes and inferred their sequence divergences. The average differences in coding, 5' untranslated and 3' untranslated region were 0.83%, 1.66% and 1.43%, respectively. The level of sequence divergence provides additional support to the proposition that MEAM1 and MED are two species. Based on the ratio of nonsynonymous and synonymous substitutions, we identified 24 sequences that have evolved in response to positive selection. Many of those genes are predicted to be involved in metabolism and insecticide resistance which might contribute to the divergence of the two whitefly species. Conclusions Our data present a comprehensive sequence comparison between the two invasive whitefly species. This study will provide a road map for future investigations on the molecular mechanisms underlying their biological differences.


Background
Genomic resources and information about invasive species are valuable for evolutionary studies to determine how and why phenotypes specific to non-indigenous species have been formed [1,2]. Moreover, they will aid ongoing efforts to understand and control the ecological, genetic and economic impacts of the invasive species. However, genomic or expressed sequence tag (EST) resources required to identify candidate genes or genomic changes associated with invasiveness are not yet developed for most invasive species [3]. In fact, only few invasive species have had significant genomic resources developed for this purpose [4]. Over the past several years, the next generation sequencing technology has significantly accelerated the speed of gene discovery and, is expected to boost genomics studies [5][6][7]. Because this technology eliminates the need for cloning ESTs, which introduces bias, and has greatly increased the quantity of data that can be generated in a short time at a reduced cost compared with traditional Sanger sequencing of cDNA libraries [8]. This technology has been proved to be a valuable addition to evolutionary and ecological research for non-model organisms [9]. However, so far, few studies have explored the potential of using next generation sequencing to investigate the source of genetic variation underlying the evolution of an invasive species [3,10].
The whitefly Bemisia tabaci (Gennadius) (Hemiptera: Aleyrodidae) has a global distribution with substantial genetic diversity and has been recorded over 600 species of plants [11][12][13]. Recent phylogenetic analysis combined with a determination of a consistent pattern of reproductive isolation among many genetic groups within B. tabaci indicates that the whitefly is a complex containing at least 28 cryptic species (herein species) [12,[14][15][16][17]. Two species of the complex, Middle East-Asia Minor 1 (herein MEAM1) and Mediterranean (herein MED), as designated by Dinsdale et al. [15] and commonly referred to as the B and Q 'biotype' respectively in the past 20 years, have risen to international prominence since the 1980s due to their global invasion [18][19][20][21][22]. The invasive ability and damage potential of MEAM1 has earned it a place as one of the world's top 100 invasive species http://www.issg.org [23]. Some effort has been made to understand the multiple factors that contribute to the incursion of the two species into new regions and habitats. For example, asymmetric mating interactions between MEAM1 and its indigenous competitors have been shown to play a major role in the invasion of MEAM1 into China and Australia [19]. While both MEAM1 and MED are known for their invasiveness, their biological characteristics are rather different. For example, the invasion of MED seems more closely related to its strong resistance to major classes of insecticides [23][24][25][26][27][28]. Several studies have revealed that the greater abundance of MED relative to MEAM1 in Israel and southern Spain were associated with the higher levels of resistance to pyriproxyfen and neonicotinoids in MED [25,26]. Both species have been speculated to have a wide range of host plants, although up to date the knowledge of their actual host range is very limited partly due to the confusion of their species status in the past [12,29,30].
The experimental evidence available also shows clearly that the two species differ substantially in host range [31,32], interactions with begomoviruses [33,34] and mating behavior [35,36]. Because of those differences between MEAM1 and MED, competitive interactions between them where they co-occur are common and have significant impacts on their distribution. In Zhejiang Province of China, MEAM1 probably arrived in the late 1990s and has been rapidly displacing the indigenous species of the B. tabaci complex [19,37]. In 2005, MED appeared and gradually replaced MEAM1 and has become the only or predominant B. tabaci in some locations [37].
Natural selection under different ecological and agricultural environments is likely to have driven the evolution and divergence between MEAM1 and MED whiteflies and resulted in their reproductive isolation and biological differences. However, the molecular factors responsible for the differences between MEAM1 and MED species are almost unknown. Furthermore, we have no information about how natural selection may have affected the transcriptomes of these invasive whiteflies since their divergence from a common ancestor. So far, studies about sequences divergence in this whitefly species complex only focused on a few genes, such as cytochrome oxidase 1, nuclear ribosomal intergenic transcribed spacer 1, and 16S ribosomal DNA, which are important molecules to differentiate genetic groups of B. tabaci [12,15]. The sequence divergence of acetylcholinesterase among different whitefly populations of MEAM1 has also been documented because of its role in insecticide resistance [38]. However, the investigations of individual genes can not provide an accurate description of genome wide DNA sequence divergence. A more robust picture of genomics divergence between MEAM1 and MED may be attained by examining large numbers of genes that have been selected without prior interest in their biological functions or evolutionary histories [39]. The transcriptome represents a sample of the spatiotemporally expressed genome and can be used as an entry into the genome divergence study [40].
Previously, we have sequenced the transcriptome of MED using Illumina sequencing technology and demonstrated that this technology allows de novo transcriptome assembly in a species lacking genome information [41]. In this study, we generated over one billion bases of high-quality DNA sequence for MEAM1 with Illumina technology. Based on these DNA sequences, we identified 57,741 distinct sequences including hundreds of insecticide target and metabolism genes. To reveal the genetic differences between MEAM1 and MED, we compared the sequence variations between them and identified a number of orthologous genes that show signs of diversifying natural selection. The assembled, annotated transcriptome sequences of MEAM1 provide an invaluable resource for the identification of whitefly genes involved in biological invasion, insecticide resistance and host adaptation. The identification of divergent sequences between MED and MEAM1 opens the door for future investigation of the molecular mechanisms underlying the biological variations between them.

Illumina sequencing and reads assembly of MEAM1 transcriptome
To obtain an overview of the MEAM1 whitefly transcriptome, a cDNA sample was prepared from a mixture of RNA from egg & nymph, pupa, female adult and male adult at equal ratio, and sequenced using the Illumina sequencing platform. We obtained a total of 17 million of 75 bp reads, which have been deposited in the NCBI Short Read Archive (SRA) under the accession number: SRX022878. These raw reads were assembled using SOAPdenovo software and resulted in 123,055 contigs (Table 1) [42]. The contigs were assembled into 104,722 scaffolds using paired-end joining and gap-filling (mean size: 326 bp) ( Table 1). The 104,722 scaffolds were further clustered into 57,741 distinct sequences including 135 clusters and 57,606 singletons (Table 1). In this article, a cluster means a sequence composed of several scaffolds and the singleton means a scaffold that matches no other scaffolds. Next, we analyzed the length distribution of the 57,741 distinct sequences. As shown in Figure 1, although nearly 70% of the sequences (40,254) are between 100 to 500 bp, we identified 4,480 sequences longer than 1,000 bp.

Annotation of predicted proteins and Gene Ontology classification
For functional annotation, distinct gene sequences were searched using BLASTx against the non-redundant (nr) NCBI nucleotide database using a cut-off E-value of 1 × 10 -5 . A total of 15,922 genes returned an above cut-off BLAST result representing about 27.6% of all distinct sequences (See additional file 1). The E-value distribution of the top hits in the nr database showed that 34.2% of the mapped sequences have strong homology (smaller than 1.0E -40 ), whereas 65.8% of the homolog sequences ranged between 1.0E -5 to 1.0E -40 ( Figure 2A). Likewise, the similarity distribution showed that 39% of the sequences have a similarity higher than 60%, while 61% of the hits have a similarity ranging from 18% to 60% ( Figure 2B). Similar to the results of MED transcriptome [41], the highest percentage of MEAM1 sequences were matched to the pea aphid (Acyrthosiphon pisum) (17.7%), followed by the body louse (Pediculus humanus corporis) (14.3%), red flour beetle (Tribolium castaneum) (12.4%) and honey bee (Apis mellifera) (10.9%) ( Figure 2C). Gene Ontology (GO) assignments were used to classify the functions of the predicted MEAM1 whitefly genes. Based on sequence homology, 4,711 sequences can be categorized into 52 functional groups under three main divisions (See additional file 2, red bars). Next, we compared the GO of MEAM1 and MED whitefly transcriptomes [41] and found that the distributions of gene functions from these two species are extremely similar (See additional file 2). This expected result indicates that there is no bias in the construction of the libraries from the MEAM1 and MED whiteflies. Compared to the MED transcriptome which has 7,330 sequences with GO annotation [41], the number of sequences with GO annotation in MEAM1 (4,771) is lower. This is probably due to the differences in the amount of sequencing data generated from the two samples (MEAM1: 1G; MED: 3G). For both species, in the three main divisions (cellular component, molecular function and biological process) of the GO classification, 'Cell part', 'Binding' and 'Cellular process', terms are dominant respectively. For both species, we also noticed a high-percentage of genes from categories of 'Cell', 'Catalytic' and 'Metabolic process' (See additional file 2).

Identification and analysis of the orthologous genes between MED and MEAM1
To compare the sequence divergence of the two species, we analyzed the possible orthologous genes between their transcriptomes using bidirectional best hit which has been widely used to identify orthologous genes [9,40,43,44]. By this way, we identified 24,945 pairs of putative orthologs with an average length of 397 bp and 98.22% identity (range from 80.2% to 100%). To remove potential paralogs, these putative orthologous genes were further screened against the Swissprot database. Only pairs of sequences that mapped unambiguously to the same protein in Swissprot database with an E-value < 1 × 10 -5 were selected as orthologous genes. Among these sequence pairs, 3,997 pairs of sequences could be mapped unambiguously to the same protein, suggesting strongly that they are orthologous genes ( Figure 3). The    (Table 2), a value slightly higher than that of pea aphid genome (mean = 38.8) and Apis mellifera genome (mean = 38.6) [45,46]. However, the GC contents in the UTR regions of MEAM1 are 37.12% (5'UTR) and 35.19% (3'UTR) which is slightly lower than that of the pea aphid and honeybee genomes. Considering the large percentage of noncoding sequences in the whitefly genome [47], the overall GC content of the whitefly should be comparable to that of the pea aphid and honeybee. The 3,585 translated genes were annotated and classified using Kyoto Encyclopedia of Genes and Genomes (KEGG) database (See additional file 3). Given the differences between MEAM1 and MED, we propose that these pathways represent a transcriptome involved in core cellular and physiological functions common to the two whitefly species.

The sequence divergence between MEAM1 and MED
For the 5'UTR, the GC content is 37.12%, and 6.3% of the compared nucleotides occur in CpG contexts ( Table  2). Differences between 5'UTRs of MED and MEAM1 orthologous genes occur at 1.66% of the positions. Interestingly, CpG sites in the 5'UTR differ at 5.29% of positions, whereas non-CpG sites differ at 1.46%. Thus, within 5'UTRs, differences occur approximately 3.6 times more often at CpG sites than at non-CpG sites. For the 3'UTR, the GC content is 35.19% and 3.74% of the nucleotides are in a CpG context. The overall difference of 3'UTR between MED and MEAM1 is 1.43%. CpG and non-CpG sites differ at 7.32% and 1.1%, respectively. Hence, in the 3'UTR, CpG sites contain  6.65 times more differences than non-CpG sites. These results suggest that a substantial proportion of the DNA sequence divergence between the two species is caused by changes at CpG sites. To understand the mechanism of evolution, we compared the ratio of transition (ts) and transversion (tv) [48]. Overall, the transitional differences are 1.41 times more frequent than transversional differences in 5'UTRs (Table 2). Interestingly, the transition-transversion ratio is higher in the CpG positions (2.32) than the non-CpG positions (1.28). This is consistent with the suggestion that the predominant type of mutations in the CpG sites is cytosine deamination, which results in transitional differences [49]. As in 5'UTR, transitional substitutions at 3'UTR are more common than transversions (ts/tv = 1.56) and transitions are even more frequent than transversions at CpG sites (ts/tv = 3.18, Table 2) compared with non-CpG sites (1.36). When comparing divergences of 3'UTR and 5'UTR, the overall and non-CpG sites divergence of 3'UTR is slightly lower than that at 5'UTR. However, for the CpG sites, the divergence at 3'UTR is higher than that of 5'UTR (7.32% versus 5.29%). Therefore, the mutation rates are likely to differ for CpG sites in different regions of genes. It reflects differences in mutational pressures between these regions because of differential methylation rate, which is also consistent with the finding that the transition/transversion rate is higher in the 3'UTRs than in the 5'UTRs (Table 2). Among the 3,585 orthologous gene pairs, the overall divergence in coding regions is 0.83%. In non-CpG sites, the divergence is slightly lower (0.61%); whereas in the CpG sites, the divergence (3.99%) is 6.5 times as high as that of non-CpG sites. Apart from CpG context, the Filter of genes with unexpected stop codon Figure 3 Identification of the orthologous gene pairs between MEAM1 and MED. The bidirectional best hit method was used to identify genes that are putatively orthologs. Coding sequences (CDS) of the orthologous genes were determined by BLASTx against all known proteins in Swissprot database using a threshold of 1.0E-5. After removing the UTR regions, sequences shorter than 150 bp and with unexpected codons in the CDS region were further filtered.  Figure 4). Among the sequences with Ka/Ks values > 1, a couple of genes are involved in sugar and amino acids metabolic processes suggesting those processes are under strongly positive selection and are critical to the specialization of the whiteflies (Table 3). For example, the alpha-trehalose-phosphate synthase (TPS), which is a key enzyme for the synthesis of trehalose, has the highest Ka/Ks value (2.90). The amino acid sequences comparison of the TPS revealed that, in the MEMA1 species, a highly conserved glutamic acid residue in the catalytic domain is replaced by an alanine ( Figure 5A) [53]. Except the sequences with high Ka/Ks, many sequences had Ka/Ks values equal to or only slightly greater than zero, suggesting that these genes have evolved under high selective constraint (See additional file 4) [39].

Analysis of sequences with weak amino-acid similarity
To determine the transcriptome-wide levels of coding sequence divergence, we calculated the sequence homology of orthologous genes between MED and MEAM1. The 3,585 sequence pairs had a mean homology of 99.17% and ranged from 92% to 100% (See additional file 3). The large variation in sequence homology suggests that estimated values of species divergence based on few genes may produce misleading results. Among these sequence pairs, 604 show 100% homology, suggesting these genes are well conserved between MED and MEAM1 (See additional file 3).
To reveal the proteins responsible for the differences between the two species, we analyzed the sequence pairs with weak amino-acid similarity. Interestingly, many of the divergent genes are related to sugar, protein and amino acid metabolism, such as aminopeptidase (92%), oligopeptide transporter (93.6%), aspartate aminotransferase (93.8%) and galactose oxidase (95.8%) (See additional file 3). The sequence divergences of these genes may cause functional differences in corresponding enzymes and result in the biological variations between MED and MEAM1. Furthermore, we noticed that a couple of proteins involved in insecticide resistance are also highly divergent, such as cytochrome P450 4C1 (93.4%) and glutathione S-transferase (95.3%). It was previously observed that the resistance to pesticides appeared to be enhanced by cytochrome P450 [54]. Therefore, we compared the protein sequence of the MED cytochrome P450 4C1 with that of MEAM1 ( Figure  5B). Sequence alignment showed that the MED whitefly cytochrome P450 4C1 is clearly different with the MEAM1 homolog. The relevance of these mutations in pesticide resistance requires further investigation.

Discussion
For the B. tabaci complex, 28 cryptic species have been delineated based on a number of phylogenetic analysis and reproductive isolation studies; however, the evolutionary origin and basal taxa have not been ascertained. Furthermore, the genetic factors associated with the evolution of invasive whitefly species are almost unknown. This is partially due to the fact that few molecules are available for inferring the evolutionary history and genetic divergence of B. tabaci. So far, only the mitochondrial 16S DNA, cytochrome oxidase 1 and the nuclear ribosomal intergenic spacer have been explored [12,15]. In this study, we have identified nearly 3,585 ortholog gene pairs and determined the average sequence divergence between MEAM1 and MED to be 0.83% for the CDS. This is much higher than the reported mean 0.45% divergence at the coding region between human and  chimpanzee [55,56]. The gene divergence at the non-coding regions is even more obvious for 5'UTR (1.66%) and 3'UTR (1.43%) regions -nearly 1.5 times the divergence between the 5'UTR (1.12%) and 3'UTR (0.86%) regions of human and chimpanzee [56]. In addition, it is likely that our analyses might have underestimated the level of divergence between the two species. The orthologous genes were identified with very high stringency (reciprocal best match and the same unique Swissprot hit) which may have filtered out genes that have diversified in response to selection and no longer exhibit similarity significant enough to be identified with BLAST [39]. This strictness could have biased our data set toward more conserved sequences. Altogether, these results indicate that despite high-sequence identity in their CDS, the MEAM1 and MED species have diverged substantially between their transcriptomes. The level of sequence divergence provides additional support to the proposition that MEAM1 and MED are two species [15], in addition to the evidence of reproductive isolation [35,36]. Previous studies have shown that MEAM1 and MED differ in many life history parameters, such as mating behavior, insecticide resistance and host plant utilization [25,31,35,57]. Interestingly, we have identified a number of divergent sequences which might contribute to these biological differences between the two species. For example, MEAM1 and MED have different capacity to utilize various host plants and weeds [32,58]. In this study, we found that a number of genes involved in sugar and amino acid utilization are highly divergent, such as alpha-trehalose-phosphate synthase, oligopeptide transporter, aminopeptidase and galactose oxidase. Further functional characterization and comparison of these enzymes in MEAM1 and MED might reveal the molecular mechanisms underneath those differences. We also noticed that a couple of genes involved in insecticide resistance, such as cytochrome P450 4C1 and glutathione S-transferase, are clearly divergent between MEAM1 and MED. These differences might be responsible for the higher insecticide resistance of the MED species [25,54]. Nonetheless, the functions of those proteins in driving the evolution and divergence of the MEAM1 and MED species need to be further characterized.
The higher level of similarity observed in the CDS could be attributable to the presence of purifying natural selection on the CDS of the genes. To estimate the extent to which selection affects protein-coding DNA sequences, we calculated the number of nucleotide substitutions that change amino acids (Ka) and the number of substitutions that do not (Ks). Among the 3,585 pairs of transcript compared, the average Ka/Ks ratio is 0.225. Surprisingly, this ratio is remarkably similar to the average Ka/Ks ratio of rat and mouse coding region (0.19) and Ka/Ks ratio of human and chimpanzee (0.22) [56,59]. This is unexpected because the effective population size is much larger in whiteflies than in rodents and humans, which should have resulted in more effective selection against deleterious variants [60]. The ratio of Ka/Ks is a good indicator of selective pressure and has been used to identify proteincoding genes under positive and purifying selection [61].  Of the 1,161 orthologous pairs for which Ka and Ks could be calculated, 24 have Ka/Ks > 1, suggesting that those sequences might play important roles in the speciation and adaptive evolution of the whitefly. While the genes we have earmarked through this study are suitable subjects for future research, more rigorous statistical tests of positive selection, using multiple sequence samples, are required to confirm current results as well as to detect specific codons undergoing adaptive change.
Major advances in transcriptomics have become feasible in non-model organisms as a result of technology developments in next-generation sequencing. This study dramatically increased the number of genes from the MEAM1 whitefly (previously referred to as B biotype) [62]. Together with the MED whitefly transcriptome [41], we can provide an initial estimate of the number of transcribed genes in the whitefly. By reciprocal best match, we have identified 24,945 homologous sequences between MEAM1 and MED. Those 24,945 sequences were separately sequenced and assembled during the Illumina sequencing suggesting strongly that they are valid transcripts. As many distinct sequences represent nonoverlapping portions of the same transcript, this number is probably an overestimation of the actual genes and can be considered as an upper-bound of our library. For gene annotation, 12,931 sequences of the MEAM1 transcriptome have significant Swissprot hits with a minimum E-value of 1 × 10 -5 and 20,824 MED sequences have a significant hit [41]. As many of these Swissprot hits are likely to be duplicates (e.g., two sequences blasting to different parts of the same gene), we analyzed the unique gene names identified during these Blast searches. Among these Swissprot hits, 8,872 unique gene names were identified during the Blast search of the MEAM1 transcriptome and 11,219 unique gene names were identified for the MED transcriptome (data not shown). Considering both sets of results, the lower-bound of genes for the whitefly transcriptome should be more than 8,872. In fact, many assembled sequences lack matches to public database because they are either too short to permit appropriate alignments or represent highly divergent genes. For example, about 20% of the genes in the pea aphid (Acyrthosiphon pisum), the closest relative of B. tabaci with a sequenced genome, showed no homology to other metazoan genes [46]. Therefore, it is reasonable to postulate that the minimum number of genes in the whitefly will exceed 11,000. Although a precise estimation of transcriptome coverage is unachievable without the full genome information, our collection of unique transcripts represents a substantial percentage of the genes from the genome of B. tabaci.

Conclusions
In this study, we have demonstrated that it is feasible to use Illumina sequencing to rapidly characterize multiple transcriptomes and compare their differences in an invasive non-model species. The level of sequence divergence is consistent with the previous proposition that MEAM1 and MED whiteflies are two species. Furthermore, we have identified hundreds of sequences showing high sequence divergence and found 24 genes under strong positive selection. The divergent sequences identified in this study will be an invaluable resource for studies of whitefly speciation, invasion, insecticide resistance and host plant utilization. To our knowledge, this is the first attempt using Illumina sequencing to study the transcriptome divergence of invasive species. We anticipate that this methodology holds great potential for the identification of genetic variation underlying the evolution of other invasive species.

Insect rearing and sample preparation
Cotton (Gossypium hirsutum cv. Zhe-Mian 1793) was cultivated to the 7-8 true-leaf stage for experiments. A pair of virgin adults of MEAM1 (mitochondrial cytochrome oxidase 1 gene GenBank accession no: GQ332577) were released onto cotton plant to oviposit and develop for five generations in a climate chamber at 27 ± 1°C, a photoperiod of 14 h light:10 h darkness and 70 ± 10% relative humidity [63]. The purity of the culture was monitored using the random amplified polymorphic DNA-polymerase chain reaction technique and the sequence of mitochondrial cytochrome oxidase 1 gene, which has been used widely to differentiate B. tabaci genetic groups [28]. Since the quantity of eggs was extremely low, a mixture of eggs and first to third instar nymphs were collected as one sample. The pupae were collected as another sample. For adults, individuals were collected from the culture using a glass tube (5 × 0.5 cm) and the sex was determined under a stereo microscope. Then the male and female adults were pooled separately into plastic tubes using an aspirator. Finally, these samples were frozen at -80°C until use.

RNA isolation and library preparation for transcriptome analysis
Total RNA was isolated separately from the four samples (egg & nymph, pupa, female adult, and male adult) using SV total RNA isolation system (Promega) according to the manufacturer's protocol [64]. RNA integrity was confirmed using the 2100 Bioanalyzer (Agilent Technologies) with a minimum RNA integrated number value of 8. mRNA was purified from 8 μg of total RNA (a mixture of RNA from egg & nymph, pupa, female adult and male adult at equal ratio) using oligo (dT) magnetic beads. The cDNA library for transcriptome sequencing was prepared using Illumina's kit following manufacturer's recommendations as described before [41].

Analysis of Illumina sequencing results
The cDNA library was sequenced at The Beijing Genome Institute (Shenzhen, China) on the Illumina sequencing platform (GAII). The size of the library is approximately 200 bp and both ends of the library are sequenced with a length of 75 bp. Image deconvolution and quality value calculations were performed using the Illumina GA pipeline 1.5. The raw reads were cleaned by removing adaptor sequences, empty reads and low quality sequences (reads with unknown sequences 'N'). The reads obtained were randomly clipped into 21 bp K-mers for assembly using de Bruijn graph and SOAPdenovo software [42]. After sequence assembly, the resultant contigs were joined into scaffolds using the read mate pairs. To obtain distinct gene sequences, the scaffolds were clustered using TGI Clustering tools with the default parameters [65]. Distinct sequences were used for Blastx search and annotation against the NCBI nr protein database using an E-value cut-off of 10 -5 . Functional annotation of GO terms http://www.geneontology.org was analyzed by Blast2go software. The data sets of Illumina sequencing are available at the NCBI Short Read Archive (SRA) with the accession number: SRX022878. The assembled sequences have been deposited in the NCBI's Transcriptome Shotgun Assembly (TSA) database under the accession number of HP643344 to HP701084 and can be searched using the GeneID listed in Additional file 1.

Identification of orthologous genes and prediction of the coding and untranslated regions
We used the bidirectional best hit method in MegaBLAST to identify genes that are putatively orthologs between MEAM1 and MED. Pairs of sequences that were each other's best hit and longer than 200 bp were retained as putative orthologs. Bidirectional best hit has been widely used to identify the orthologous genes between closely related species [9,40,44] and this approach has been found to outperform more complex orthology identification algorithms [43]. As de novo transcriptome assemblies generally struggle to differentiate members of gene families, using bidirectional best hit to identify orthologs may not completely exclude the orthologous paralogs. Therefore, these putative orthologous genes were further screened against the Swissprot database to remove potential paralogs. If two sequences are orthologous paralogs, during the Blast search, they probably will hit to different genes in the Swissprot database. Only pairs of sequences that mapped unambiguously to the same protein in Swissprot database with an E-value < 1 × 10 -5 were selected as orthologous genes. Alignments containing any frameshifts and indels were filtered. CDS of the orthologous genes were determined by BLASTx against all known proteins in Swissprot database using a threshold of 1 × 10 -5 . CDS with unexpected stop codon in the Blast hit region and shorter than 150 bp were removed. Start codon positions were determined by examination of the in-frame ATG codon present 30 bp upstream or downstream of the beginning of the aligned reference protein. The 5'UTR of each pair of orthologs was identified based on the results of CDS and start codon prediction. The stop codon positions were determined by examination of in-frame TAA, TAG and TGA motifs present within 30 bp of the stop codon of the reference protein. Similarly, the 3'UTR of orthologs was defined based on the CDS and stop codon prediction. To prevent false positive results, only UTR pairs with an Evalue < 1 × 10 -30 were selected for further analyses.

Sequence divergence analyses and estimation of substitution rates
For each pair of orthologs, 5'UTR, coding and 3'UTR regions were extracted respectively. For the CDS region, pair-wise alignments were generated for all the orthologous gene pairs based on protein sequences and backtranslated to DNA sequences for subsequent analysis. The CDS and UTR regions were aligned separately to each other with a MegaBlast algorithm and checked manually for errors. Alignments are available upon request. Because MEAM1 and MED are closely related, multiple substitutions at the same site are highly unlikely. Therefore, the sequence divergence was calculated by dividing the number of substitutions by the number of base pairs compared. The divergence was determined for the contexts of nondegenerate (nd), fourfold degenerate (4d), CpG and non-CpG [56]. The ratio of transitions over transversions (ts/tv) was determined for the 5'UTR, CDS and 3'UTR as well. Substitution rates were estimated separately for synonymous (Ks) and non-synonymous sites (Ka) using an approximate method implemented in the KaKs Calculator Version 1.2 [66]. Pair-wise approximate analyses were performed using the YN method [61]. Because the sequencing errors are distributed among synonymous and nonsynonymous sites at equal frequencies, they are not expected to influence the results of analyses [39].

Additional material
Additional file 1: Top BLASTx hits from NCBI nr database. BLASTx results against the NCBI nr database for all the distinct sequences with a cut-off E value above 1.0E-5 are shown.
Additional file 2: Histogram presentation of Gene Ontology (GO) classification of genes from the MEAM1 and MED whiteflies. The results are summarized in three main categories: biological process, cellular component and molecular function. The right y-axis indicates the number of genes in a category. The left y-axis indicates the percentage of a specific category of genes in that main category. GO analysis showed that the distributions of gene functions for MEAM1 and MED whiteflies are similar.
Additional file 3: List of the orthologous gene pairs between MEAM1 and MED. The length of orthologous region, sequence homology and Swissprot, nr, KEGG annotations were shown as well.
Additional file 4: Summary of Ka and Ks values of each orthologous gene pairs. S-Substitutions: synonymous substitutions; N-Substitutions: nonsynonymous substitutions; Ka: nonsynonymous substitution rate; Ks: synonymous substitution rate.