Large-scale identification of odorant-binding proteins and chemosensory proteins from expressed sequence tags in insects

Background Insect odorant binding proteins (OBPs) and chemosensory proteins (CSPs) play an important role in chemical communication of insects. Gene discovery of these proteins is a time-consuming task. In recent years, expressed sequence tags (ESTs) of many insect species have accumulated, thus providing a useful resource for gene discovery. Results We have developed a computational pipeline to identify OBP and CSP genes from insect ESTs. In total, 752,841 insect ESTs were examined from 54 species covering eight Orders of Insecta. From these ESTs, 142 OBPs and 177 CSPs were identified, of which 117 OBPs and 129 CSPs are new. The complete open reading frames (ORFs) of 88 OBPs and 123 CSPs were obtained by electronic elongation. We randomly chose 26 OBPs from eight species of insects, and 21 CSPs from four species for RT-PCR validation. Twenty two OBPs and 16 CSPs were confirmed by RT-PCR, proving the efficiency and reliability of the algorithm. Together with all family members obtained from the NCBI (OBPs) or the UniProtKB (CSPs), 850 OBPs and 237 CSPs were analyzed for their structural characteristics and evolutionary relationship. Conclusions A large number of new OBPs and CSPs were found, providing the basis for deeper understanding of these proteins. In addition, the conserved motif and evolutionary analysis provide some new insights into the evolution of insect OBPs and CSPs. Motif pattern fine-tune the functions of OBPs and CSPs, leading to the minor difference in binding sex pheromone or plant volatiles in different insect Orders.


Background
Insects are highly successful terrestrial animals that have complicated communication systems. Insect odorant binding proteins (OBPs) play an important role in insect chemical communication. Until recently, it was believed that pheromones and other odors entering the aqueous lumen of chemosensilla, were transported by OBPs to transmembrane odorant receptors (ORs) [1,2] and finally degraded by odorant degradation enzymes (ODEs) [3][4][5][6][7]. Recently, however, an active role of OBPs has been reported, where a conformational change of the OBP triggered by the presence of the ligand in its binding pocket that activated the membrane-bound receptor [8]. Insect OBPs, particularly in Lepidoptera, can be classified into two subfamilies, pheromone-binding proteins (PBPs) and general odorant binding proteins (GOBPs) [9]. OBPs are small and water soluble proteins 120 to 150 amino acids long. A typical feature of OBPs is the presence of six positional conserved cysteines. These six cysteines form three disulfide bridges, which play important roles in maintaining the protein tertiary structure. Another essential criterion is an acceptable similarity in protein sequence (e-value of BLAST analysis) with other family members. Insect chemosensory proteins (CSPs) represent another gene family suggested to have similar properties in binding and transporting pheromones and other ligands. Insect CSPs are smaller than OBPs with about 100-120 amino acids, and bear no sequence similarity with OBPs. CSPs have only four conserved cysteines linked by disulfide bridges between neighboring residues [10] and are better conserved than OBPs across species [11].
Numerous efforts have been made to obtain the sequences of insect OBPs [9,[12][13][14][15][16][17][18][19][20] and CSPs [14,[21][22][23][24][25][26][27][28] by direct cloning, which normally involves designing degenerate primers based on conserved protein sequences, amplifying the fragment and obtaining the full length sequences by Rapid Amplification of cDNA Ends (RACE). Thanks to the accomplishments of genome sequencing projects of several insect species, large scale new gene discovery is possible using bioinformatics. By searching available genome sequences, Hekmat-Scafe et al. found 51 OBP genes in Drosophila melanogaster and a new subfamily of OBPs [29]; Maleszka et al. showed that Apis mellifera has only 21 OBP genes [14]; Zhou et al. identified 66 putative OBPs in Aedes aegypti and 11 additional sequences in Anopheles gambiae by developing a specific algorithm [30]. By comparative genomic analysis of the OBP families in 12 Drosophila genomes, Vieira et al. identified 595 OBP genes and found that purifying selection governs the evolution of the OBP family [31]. In 2006, Zhou et al. did a comprehensive searching for CSP genes from insect genomes and ESTs and identified 74 putative CSP genes from 22 insect species [32]. Gong et al. performed a genome-wide analysis based on the conserved cysteine residues and similarity to CSPs in other insects, finding 20 candidate CSPs in the silkworm [33]. However, genome searching for new genes is limited to a few insect species, as genome sequences are not available for most insects. Fortunately, an increasing number of insect expressed sequence tags (ESTs) are deposited in the dbEST database of the National Center for Biotechnology Information (NCBI). Insect ESTs are a valuable resource that has not been fully exploited for mining new OBP or CSP genes. Pugalenthi et al. developed a new algorithm using Regularized Least Squares Classifier (RLSC) to predict OBPs with a high accuracy of 97.7%. This approach could be used to identify novel OBPs that have low similarities with known ones [34]. Recently, Zhou et al. used Motif-Search algorithm to screen putative OBPs in the silkworm and found 13 OBP-like genes, which is much fewer than that in fruit flies and mosquitoes [35].
Here, we develop a computational pipeline to identify OBP and CSP genes from insect ESTs of 54 species across eight Orders including Blattaria, Coleoptera, Diptera, Hemiptera, Hymenoptera, Lepidoptera, Orthoptera and Phthiraptera. In total, 117 new OBPs and 129 new CSPs were found, of which 38 genes from eight species were experimentally validated by RT-PCR. In addition, the conserved cysteines patterns, motif patterns and phylogenetic relationship of known OBPs and CSPs were analyzed.

Identification of new OBPs and CSPs genes from insect ESTs
We collected 752,841 insect ESTs from the dbEST [36] and constructed a local database for further analysis. The ESTs are from 54 insect species that cover eight Orders of Insecta. We searched for OBPs and CSPs with a computational pipeline as detailed in Figure 1. In total, 2,380 ESTs were found to satisfy the strict criteria, and produce 142 OBPs from 38 species and 177 CSPs from 37 species. Of these genes, more than 80% OBPs (117) and 70% CSPs (129) have not been reported before ( Table 1, Additional The computational pipeline used to identify insect OBPs and CSPs from expressed sequence tags Figure 1 The computational pipeline used to identify insect OBPs and CSPs from expressed sequence tags. The accession numbers of OBP and CSPs used in this analysis were listed in Additional File 6.   In some insects, more than 10 OBPs or CSPs were identified. For example, 29 new OBPs were predicted in Diabrotica virgifera and 10 in Nasonia giraulti. Fifteen CSPs were predicted in Solenopsis invicta, of which 14 are not reported before, and 10 new CSPs genes were found in Gryllus bimaculatus. However, fewer than five OBPs or CSPs were identified in most species. We plotted the number of identified OBPs or CSPs against the total number of ESTs in each species and could not find any clear relationship (data not shown).

Conserved cysteines pattern
The presence of conserved cysteines is a typical feature of OBPs and CSPs. We therefore analyzed the cysteines patterns (C-patterns) of OBPs and CSPs in different Orders (Table 2). Generally, there were no major differences between different Orders, except for the presence of a subclass of OBPs, C-plus OBPs, in Diptera containing eight conserved cysteines. In the typical C-pattern, there were three amino acids between the second and third cysteines in all OBPs, while eight residues were present between the fifth and sixth cysteines in most insect OBPs. The numbers of amino acids between the other three neighboring cysteines were rather variable. In order to evaluate the variability in the distances between each pair of neighboring cysteines, we calculated the coefficients of variation (Table  3). In most insects, the distance between the fourth and the fifth cysteines was the most variable. However, in Hymenoptera, the distance between the first and the sec-ond cysteines was the most variable with a coefficient of variation of 11.66. The highest variations were found in the OBPs of Diptera. By contrast, C-patterns of CSPs were much more conserved.

Motif-pattern analysis
The conserved motifs are important elements of functional domains. We used the MEME server to discover conserved motifs in OBPs and CSPs [37]. The full-length sequences of OBP and CSPs either collected from the database or newly predicted in this work were used for motif analysis. Parameters used in this and all other motif predictions of this study were: minimum width = 6, maximum = 10, maximum number of motif to find = 8. As a result, eight motifs were found for both CSPs and OBPs.
Only five motifs were present in more than 50% of OBPs, while all eight motifs were present in more than 50% of CSPs ( Figure 2).
Since a high number of OBP genes have been reported in species of Lepidoptera, we carried out a motif-pattern analysis of GOBPs and PBPs to compare the differences between these two subfamilies. The GOBPs and PBPs were combined into one set of sequences and then submitted to MEME server. Although both GOBPs and PBPs have the same eight motifs, the motif-patterns were quite different ( Figure 3). The seventh motif was located at the C-terminus of all 41 tested PBPs, but appeared at the N-terminus of 12 out of 20 GOBPs. Only six GOBPs shared the same motif-pattern with PBPs. Interestingly, one GOBP lacked the fifth motif and one had two copies of the seventh motif.
When the GOBP sequences of both lepidopteran and dipteran were combined into a set of sequences for motif analysis, we also found that the motif patterns were differ- X: Any amino acid. The C-pattern in Diptera includes two types (typical 6-C and atypical 8-C patterns), which are listed separately. The accession numbers of OBPs and CSPs used in this analysis were listed in Additional File 6. ent between lepidopteran and dipteran GOBPs ( Figure 4). Of the eight motifs in the Lepidoptera, only the second and seventh were found in most dipteran GOBPs. The first, third and eighth motifs appeared in only one dipteran GOBP. Interestingly, the motif patterns of PBPs were also different between the Lepidoptera and Hymenoptera. Similarly, motif patterns of lepidopteran and dipteran PBPs were analyzed by combining the PBP sequences of both lepidopteran and dipteran into a set of sequences. The order of the eight motifs in the Lepidoptera was 7-3-2-4-5-8-6-1 whereas it was 3-7-4-2-5-1-8-6 in the Hymenoptera ( Figure 5). Furthermore, one PBP lacked one motif and two PBPs lacked five motifs in the Hymenoptera. These differences may imply functional differences of OBPs in different Orders. It should be noticed that the motifs found by MEME server are not comparable when different sets of sequences were used for analysis. Thus, it is not suitable to compare the motifs in different figures (figure 2, 3, 4, 5) since we used different input sequences.

Phylogenetic analysis of OBPs and CSPs
The neighbor-joining trees were inferred by MEGA4.0 using the p-distance amino acid model after 1000 bootstrap replicates [38]. In the evolutionary tree for GOBPs and PBPs, these two subfamilies were mainly clustered by Orders, indicating that most genes appeared after diversification of different Orders ( Figure 6). This is consistent with the existence of an Order-specific motif-pattern as described above, suggesting that most GOBP and PBP genes have evolved recently. However, the situation is different for CSPs. Although lepidopteran CSPs were mainly clustered as an independent group, some of their CSPs are in the same clade with other Orders, suggesting that some CSPs are ancient, whereas others appeared after the diversification of Orders (Figure 7).

Experimental validation of identified OBPs and CSPs
Most predicted OBPs or CSPs of full length were assembled from several ESTs. To validate the reliability of the computational pipeline, we randomly chose 26 OBPs from eight species and 22 CSPs from four species for RT-PCR validation. To cover a sequence that was as long as possible, the primers were designed at both ends of the transcripts assembled by the CAP3 software. As a result, 22 OBPs and 16 CSPs were successfully amplified by RT-PCR. The PCR results were confirmed by sequencing (Figure 8). Most validated OBPs and CSPs contains intact ORFs.

Discussion
With only a few insect genomes sequenced, expressed sequence tags (EST) are a good resource for new gene discovery and expression profile analysis. As the cost of sequencing rapidly deceases, an abundance of insect ESTs has become available particularly in recent years, providing an opportunity to discover new OBPs and related genes at large-scale level. In this study, more than 100 new OBPs and CSPs were found from insect ESTs, suggesting that this approach is effective.
Although more than 10 OBP or CSP genes were found in some insects, less than five OBPs were identified in most species. Generally, there is no correlation between the number of identified genes and that of ESTs. Though some OBPs and CSPs are ubiquitous or expressed in nonsensory organs, both these two classes of proteins are believed to be abundant in the antennae and other chem-  [35]. Generally, the motifs analyzed by Order including new OBP genes found in our work are similar with Zhou's report. This proves the C-patterns of both OBP and CSP genes are highly conserved.
As most known PBP and GOBP genes have been identified in the Lepidoptera, we conducted a MEME motif anal- ysis to compare the difference between these two subfamilies of OBP genes. Interestingly, all PBPs have an identical MEME motif pattern as 6-1-2-8-3-4-5-7, though they are more divergent than GOBPs at the proteinsequence level. GOBPs show four different motif patterns, with the most common one being 7-6-1-2-8-3-4-5. To the best of our knowledge, this is the first report of motif difference between GOBP and PBP subfamilies. This difference in the motif pattern might imply a functional difference between PBPs and GOBPs. Meanwhile, it also provides a hint that GOBP genes might have broad func-tions. Generally, PBPs bind and transport sex pheromones, while GOBPs are involved in sensing plant volatiles. Recent report by Zhou et al. proves that BmorGOBP2 in B. mori can also bind sex pheromone component (bombykol) [35]. Although sex pheromones in moths are species-specific, their chemical structures are similar, consisting of a hydrocarbon chain that contains an oxygenated functional group (ester, alcohol, aldehyde or epoxides) [39]. Thus, it is reasonable that PBPs from different insects have an identical motif. By contrast, GOBPs can bind both plant volatiles and sex pheromone, which display a broad diversity in chemical structures. We argued that this is the reason why GOBPs have divergent motif patterns.

Motif analysis of Lepidoptera PBPs and GOBPs
In addition, we found that the C-patterns are similar, whereas the motif patterns are different among diverse Orders. We reasoned that C-pattern is the key structure of OBPs and CSPs, which should be highly conserved. But motif pattern fine-tune the functions of OBPs and CSPs, leading to the minor difference in binding sex pheromone or plant volatiles in different insect Orders.

Conclusion
In conclusion, our results indicate that the computational pipeline we used in this study is efficient and reliable in identifying new OBP and CSP genes with insect EST resources. The large number of the newly found OBPs and CSPs in our study provides the basis for functional studies of these proteins. In addition, analysis of protein sequences showed that there is generally no major difference in C-patterns of OBPs or CSPs between different insect Orders, whereas conserved motif patterns are quite different between insect Orders and between the GOBPs and PBPs in Lepidoptera. Together with the evolutionary analysis, the results provide some new insights into the differentiation and evolution of insect OBPs and CSPs.

Insects
The cotton aphid (Aphis gossypii), peach aphid (Myzus persicae), brown plant hopper (Nilaparvata lugens) and pea aphid (Acyrthosiphon pisum) were collected from the campus of Nanjing Agricultural University. The Asiatic migratory locust (Locusta migratoria) was bought from an insect rearing factory in Shandong province, China. The Ameri-   redundant protein sequences (nr) were downloaded from the FTP server of NCBI. In total, 290 CSP sequences were retrieved from the UniProtKB [40,41].

Computational pipeline for gene discovery
The computational pipeline is shown in Figure 1. The sequences of known OBPs and CSPs were used to search a local database of insect ESTs using the program TBLASTN [42] (version 2.2.17) by an e-value of 10.0. To get more information, the Blast hits were used as the queries to search the local EST database using the BLASTN [42] program (e-value = 1.0e-20). The ESTs meeting the criteria were collected. After removal of the identical sequences by perl scripts, the remaining sequences were assembled with CAP3 software (version date: 08/29/02) [43]. Then, the assembled sequences were used as queries to search against non-redundant protein sequences (nr) with the BLASTX program (default parameter) [42]. We kept those sequences whose blast hits of BLASTX are PBP_GOBP [44] or OS-D [10,12,45] family as putative OBP or CSP genes.
The deduced protein sequences were further confirmed by searching the Pfam database with the default parameter (e-value = 1.0) [46].
Accession numbers of all OBP or CSP sequences used for C-Patten analysis, motif analysis and phylogenetic analysis are listed in the Additional File 6.

C-Patten analysis
The protein sequences of OBPs and CSPs were aligned using ClustalX [47] (version 1.83) with default gap-penalty parameters to locate six or four conserved cysteines, and only those sequences with six (for OBP) or four (for CSP) conserved cysteines were used for C-pattern analysis. The number of amino acids between cysteines was counted separately.

Motif analysis
According to the average length of known genes, the predicted ORFs with more than 120 amino acids (aa) for OBPs and 100 aa for CSPs were regarded as intact ORFs. All OBP and CSP sequences with intact ORF were used for The evolutionary tree of PBP and GOBP Figure 6 The evolutionary tree of PBP and GOBP. Orthoptera motif discovery and pattern analysis. Parameters used for motif discovery were: minimum width = 6, maximum = 10, maximum number of motif to find = 8. Motif analysis was conducted by using MEME [48] (version 3.5.7) online server http://meme.sdsc.edu. The motifs identified in more than half of the input sequences with a p-value < 0.0001 were counted and viewed by WebLogo [49].

Phylogenetic analysis
To improve the reliability, only those sequences covered the region of six cysteines (for OBP) or four cysteines (for CSP) were used in phylogenetic analysis. In total, 114 OBP and 224 CSP sequences were used. The protein sequences were aligned by ClustalX (version 1.83) with default gap-penalty parameters. The evolutionary trees were constructed based on consensus sequence by the MEGA4.0 [38] program with neighbor-joining [50] phylogeny using the p-distances model. An un-rooted tree was generated with 1000 bootstrap replications.

RNA extraction and cDNA synthesis
The whole bodies of cotton aphids, peach aphids and pea aphids were used for RNA extraction, whereas only the heads with antennae of brown plant hoppers and red fire ants were collected. For the American cockroach, Asiatic migratory locust and two-spotted cricket, the antennae were dissected and used for RNA extraction. The collected tissues were fast-frozen in liquid nitrogen and kept at -70°C until further use. Total RNA was extracted by homogenizing antennae or other tissues in Trizol™ reagent (Invitrogen, Carlsbad, CA, USA) or E.Z.N.A. ® Total RNA Kit II (Omega) following the manufacturer's instructions. The cDNA template was synthesized with Oligo(dT)18 primer as anchor primers, using M-MLV reverse transcriptase (Invitrogen, Carlsbad, CA, USA) at 37°C for 50 min. The reactions were stopped by heating at 70°C for 15 min. Alternatively, we used AMV reverse transcriptase (Takara) at 42°C for 60 min, and stopped the reactions by cooling on ice for 5 min.

RT-PCR
Gene specific primers across ORF of predicted OBP and CSP genes were designed using "Primer Premier 5.0" for RT-PCR validation. The sequences of these primers are listed in Additional File 7. PCR experiments were carried out in a PTC-200 (Bio-Rad, Waltham, MA, USA), and Touchdown PCR reactions were performed under the following conditions: 94°C for 3 min; 20 cycles at 94°C for 50 sec, 65°C for 1 min, and 72°C for 50 sec, with a decrease of the annealing temperature of 0.5°C per cycle. This was followed by 15 cycles at 94°C for 50 sec, 55°C for 1 min, and 72°C for 50 sec, and final incubation for 10 min at 72°C. The reactions were performed in 25 μl with 200-600 ng of single-stranded cDNA, 2.0 mM MgCl 2 , 0.5 mM dNTP, 0.4 μM for each primer and 1.25 U Taq polymerase or EX-Taq polymerase (Takara). PCR products were analyzed by electrophoresis on 1.5% w/v agarose gel in TAE buffer (40 mmol/L Tris-acetate, 2 mmol/L Na 2 EDTA·H 2 O) and the resulting bands were visualized with ethidium bromide. DNA purification was performed using the AxyPrep™ PCR Cleanup Kit (Axygen). Purified products were sub-cloned into a T/A plasmid using the pGEM-T easy vector system (Promega) following manufacturer's instructions. The plasmid DNA was used to transform into competent DH5a or Top10 cells. Positive clones were checked by restriction enzyme cleavage sites and PCR. Plasmid extraction was performed by E.Z.N.A.™ Plasmid Mini kit (Omega). The PCR products were sequenced by Bioasia (Shanghai, China).