For the B. tabaci complex, 28 cryptic species have been delineated based on a number of phylogenetic analysis and reproductive isolation studies; however, the evolutionary origin and basal taxa have not been ascertained. Furthermore, the genetic factors associated with the evolution of invasive whitefly species are almost unknown. This is partially due to the fact that few molecules are available for inferring the evolutionary history and genetic divergence of B. tabaci. So far, only the mitochondrial 16S DNA, cytochrome oxidase 1 and the nuclear ribosomal intergenic spacer have been explored [12, 15]. In this study, we have identified nearly 3,585 ortholog gene pairs and determined the average sequence divergence between MEAM1 and MED to be 0.83% for the CDS. This is much higher than the reported mean 0.45% divergence at the coding region between human and chimpanzee [55, 56]. The gene divergence at the non-coding regions is even more obvious for 5'UTR (1.66%) and 3'UTR (1.43%) regions - nearly 1.5 times the divergence between the 5'UTR (1.12%) and 3'UTR (0.86%) regions of human and chimpanzee . In addition, it is likely that our analyses might have underestimated the level of divergence between the two species. The orthologous genes were identified with very high stringency (reciprocal best match and the same unique Swissprot hit) which may have filtered out genes that have diversified in response to selection and no longer exhibit similarity significant enough to be identified with BLAST . This strictness could have biased our data set toward more conserved sequences. Altogether, these results indicate that despite high-sequence identity in their CDS, the MEAM1 and MED species have diverged substantially between their transcriptomes. The level of sequence divergence provides additional support to the proposition that MEAM1 and MED are two species , in addition to the evidence of reproductive isolation [35, 36].
Previous studies have shown that MEAM1 and MED differ in many life history parameters, such as mating behavior, insecticide resistance and host plant utilization [25, 31, 35, 57]. Interestingly, we have identified a number of divergent sequences which might contribute to these biological differences between the two species. For example, MEAM1 and MED have different capacity to utilize various host plants and weeds [32, 58]. In this study, we found that a number of genes involved in sugar and amino acid utilization are highly divergent, such as alpha-trehalose-phosphate synthase, oligopeptide transporter, aminopeptidase and galactose oxidase. Further functional characterization and comparison of these enzymes in MEAM1 and MED might reveal the molecular mechanisms underneath those differences. We also noticed that a couple of genes involved in insecticide resistance, such as cytochrome P450 4C1 and glutathione S-transferase, are clearly divergent between MEAM1 and MED. These differences might be responsible for the higher insecticide resistance of the MED species [25, 54]. Nonetheless, the functions of those proteins in driving the evolution and divergence of the MEAM1 and MED species need to be further characterized.
The higher level of similarity observed in the CDS could be attributable to the presence of purifying natural selection on the CDS of the genes. To estimate the extent to which selection affects protein-coding DNA sequences, we calculated the number of nucleotide substitutions that change amino acids (Ka) and the number of substitutions that do not (Ks). Among the 3,585 pairs of transcript compared, the average Ka/Ks ratio is 0.225. Surprisingly, this ratio is remarkably similar to the average Ka/Ks ratio of rat and mouse coding region (0.19) and Ka/Ks ratio of human and chimpanzee (0.22) [56, 59]. This is unexpected because the effective population size is much larger in whiteflies than in rodents and humans, which should have resulted in more effective selection against deleterious variants . The ratio of Ka/Ks is a good indicator of selective pressure and has been used to identify protein-coding genes under positive and purifying selection . Of the 1,161 orthologous pairs for which Ka and Ks could be calculated, 24 have Ka/Ks > 1, suggesting that those sequences might play important roles in the speciation and adaptive evolution of the whitefly. While the genes we have earmarked through this study are suitable subjects for future research, more rigorous statistical tests of positive selection, using multiple sequence samples, are required to confirm current results as well as to detect specific codons undergoing adaptive change.
Major advances in transcriptomics have become feasible in non-model organisms as a result of technology developments in next-generation sequencing. This study dramatically increased the number of genes from the MEAM1 whitefly (previously referred to as B biotype) . Together with the MED whitefly transcriptome , we can provide an initial estimate of the number of transcribed genes in the whitefly. By reciprocal best match, we have identified 24,945 homologous sequences between MEAM1 and MED. Those 24,945 sequences were separately sequenced and assembled during the Illumina sequencing suggesting strongly that they are valid transcripts. As many distinct sequences represent nonoverlapping portions of the same transcript, this number is probably an overestimation of the actual genes and can be considered as an upper-bound of our library. For gene annotation, 12,931 sequences of the MEAM1 transcriptome have significant Swissprot hits with a minimum E-value of 1 × 10-5 and 20,824 MED sequences have a significant hit . As many of these Swissprot hits are likely to be duplicates (e.g., two sequences blasting to different parts of the same gene), we analyzed the unique gene names identified during these Blast searches. Among these Swissprot hits, 8,872 unique gene names were identified during the Blast search of the MEAM1 transcriptome and 11,219 unique gene names were identified for the MED transcriptome (data not shown). Considering both sets of results, the lower-bound of genes for the whitefly transcriptome should be more than 8,872. In fact, many assembled sequences lack matches to public database because they are either too short to permit appropriate alignments or represent highly divergent genes. For example, about 20% of the genes in the pea aphid (Acyrthosiphon pisum), the closest relative of B. tabaci with a sequenced genome, showed no homology to other metazoan genes . Therefore, it is reasonable to postulate that the minimum number of genes in the whitefly will exceed 11,000. Although a precise estimation of transcriptome coverage is unachievable without the full genome information, our collection of unique transcripts represents a substantial percentage of the genes from the genome of B. tabaci.