7TMRmine: a Web server for hierarchical mining of 7TMR proteins
© Lu et al; licensee BioMed Central Ltd. 2009
Received: 08 January 2009
Accepted: 19 June 2009
Published: 19 June 2009
Seven-transmembrane region-containing receptors (7TMRs) play central roles in eukaryotic signal transduction. Due to their biomedical importance, thorough mining of 7TMRs from diverse genomes has been an active target of bioinformatics and pharmacogenomics research. The need for new and accurate 7TMR/GPCR prediction tools is paramount with the accelerated rate of acquisition of diverse sequence information. Currently available and often used protein classification methods (e.g., profile hidden Markov Models) are highly accurate for identifying their membership information among already known 7TMR subfamilies. However, these alignment-based methods are less effective for identifying remote similarities, e.g., identifying proteins from highly divergent or possibly new 7TMR families. In this regard, more sensitive (e.g., alignment-free) methods are needed to complement the existing protein classification methods. A better strategy would be to combine different classifiers, from more specific to more sensitive methods, to identify a broader spectrum of 7TMR protein candidates.
We developed a Web server, 7TMRmine, by integrating alignment-free and alignment-based classifiers specifically trained to identify candidate 7TMR proteins as well as transmembrane (TM) prediction methods. This new tool enables researchers to easily assess the distribution of GPCR functionality in diverse genomes or individual newly-discovered proteins. 7TMRmine is easily customized and facilitates exploratory analysis of diverse genomes. Users can integrate various alignment-based, alignment-free, and TM-prediction methods in any combination and in any hierarchical order. Sixteen classifiers (including two TM-prediction methods) are available on the 7TMRmine Web server. Not only can the 7TMRmine tool be used for 7TMR mining, but also for general TM-protein analysis. Users can submit protein sequences for analysis, or explore pre-analyzed results for multiple genomes. The server currently includes prediction results and the summary statistics for 68 genomes.
7TMRmine facilitates the discovery of 7TMR proteins. By combining prediction results from different classifiers in a multi-level filtering process, prioritized sets of 7TMR candidates can be obtained for further investigation. 7TMRmine can be also used as a general TM-protein classifier. Comparisons of TM and 7TMR protein distributions among 68 genomes revealed interesting differences in evolution of these protein families among major eukaryotic phyla.
Seven-transmembrane-region containing receptors (7TMRs), often referred to as G protein-coupled receptors (GPCRs), constitute the largest receptor superfamily in vertebrates and other metazoans [1–3]. GPCRs, activated by a diverse array of ligands, are the central players in eukaryotic signal transduction and are involved in a wide variety of physiological processes. Mutations in genes encoding GPCRs are associated with major diseases (e.g., hypertension, cardiac dysfunction, depression, pain). Due to their biomedical importance, thorough mining of 7TMRs from diverse genomes is an active endeavor of bioinformatics and pharmacogenomics research. However, efforts to identify all member proteins in this superfamily from diverse genomes are hindered by their extreme sequence divergence. In order to facilitate more sensitive and thorough mining, many computational methods, both alignment-based and alignment-free classification methods, were developed particularly for these proteins.
Protein classification methods
Computational methods of predicting protein functions rely on detecting similarities among proteins. The majority of protein classification methods rely on alignment to known protein sequences to identify the similarities and to build various forms of models (e.g., regular expression patterns , protein fingerprints , position-specific scoring matrices , and profile hidden Markov models ). However, generating reliable alignments of divergent candidate 7TMR sequences is practically not possible. Another disadvantage of alignment-based methods is that the resulting models are built only from known "positives" (protein sequences of interest) without incorporating information that discriminates positives from "negatives" (unrelated protein sequences). Consequently, these classifiers are affected by sampling bias, which is propagated and/or amplified during subsequent re-training. In contrast, alignment-free protein classification methods overcome these problems. Instead of alignments, various descriptors are extracted from each sequence (e.g., amino acid composition, dipeptide frequencies, and physico-chemical properties), and pattern recognition or multivariate statistical methods are trained to discriminate positive protein samples from negative samples.
Our recent comparative analyses showed that alignment-free classifiers are more sensitive to remote similarities than alignment-based profile hidden Markov model (profile HMM) methods [8–10]. They can also identify weak similarities from short subsequences. We observed also that these alignment-free classifiers are better than profile-HMM methods when a sufficiently large training set is unavailable . For example, one alignment-free method was successfully used to identify extremely divergent 7TMRs (odorant and gustatory receptors) for the first time from the Drosophila melanogaster genome [11–13]. One disadvantage of alignment-free classifiers is their relatively high false-positive rate. Profile-HMM classifiers, on the other hand, are accurate in identifying well-established protein family with few false positives. Combining both approaches hierarchically provides greater sensitivity with fewer false positives.
Hierarchical classification strategy
Our study for mining 7TMR protein candidates from the Arabidopsis thaliana genome showed the power of hierarchically combining multiple classifiers, including both traditional alignment-based and newer alignment-free methods . We identified 394 Arabidopsis thaliana proteins as 7TMR candidates and selected 54 proteins as those prioritized for further investigation. More recently, Gookin et al.  used a similar strategy by combining several methods hierarchically and identified a small number of GPCR candidates from three plant genomes including A. thaliana. They showed that a subset of the Arabidopsis proteins predicted to be GPCR candidates can interact with the Arabidopsis G-protein α subunit (AtGPA1) in a yeast complementation assay.
In order to facilitate hierarchical identification of 7TMR proteins, we developed the Web server, 7TMRmine. 7TMRmine permits users to customize the integration of both alignment-based and alignment-free classifiers in any combination and order. 7TMRmine is a Web-based mining system as well as a database for 7TMR candidates from a growing collection of diverse genomes. It allows researchers to generate and explore prioritized lists of 7TMR candidates. It also allows researchers to examine the performance of various methods. Furthermore, 7TMRmine can be used for other transmembrane protein identification.
While all known GPCR proteins have seven transmembrane (TM) regions, an increasing number of alternative 'G protein-independent' signaling mechanisms are associated with some 7TM protein groups. For example, plant-specific mildew resistance locus O (MLO) protein family is one of the most divergent 'GPCR' families [16, 17], and, not surprisingly, MLO's interaction with Gα has not been shown despite great effort (AM Jones and R Panstruga, unpublished data). Another problem is that none of the candidate plant GPCRs was shown to activate the Gα subunit; therefore they do not fulfill the most important criterion for GPCR classification. A third problem is represented by the odorant receptor (OR) family in insects, another extremely diverged group of 7TM proteins. These proteins act independently of known G-protein-coupled second messenger pathways [18, 19]. With these problems acknowledged, it is no longer appropriate to label the entire 7TM protein group as GPCRs because this group includes 'G protein-dependent', 'G protein-independent' signaling proteins, and putative scaffolds. Following the notation used in our previous study , we designate these proteins as candidate 7-transmembrane receptors (7TMRs), not GPCRs. Our goal here is to provide a tool capable of identifying the entire set of 7TMRs from diverse genomes. Having a comprehensive inventory of 7TMRs from diverse organisms will facilitate studies on the evolution of GPCRs and to address functionality of the large number of orphaned GPCRs, many critical to human health.
Construction and content
Overview of the 7TMRmine Web server
7TMRmine Web server includes protein classifiers and the database of the classification results. The Web interface is developed in HTML, PHP, and PERL. The database is managed in MySQL . The user interface is available through standard Web browsers (tested for Safari, Firefox, and Internet Explorer). The Web server and all classifier programs run on the Linux operating system with the Apache HTTP server (tested on Red Hat Linux 9 and CentOS 4.2/5.1).
Fourteen classifiers (four alignment-based and ten alignment-free) were trained to identify 7TMR candidates and are included in the current 7TMRmine (Figure 2A):
This is an alignment-based classifier, and provides full probabilistic representation of protein families [e.g., ]. The program package, Sequence Alignment and Modeling System (SAM, version 3.5) [24, 25] is used for implementing profile HMMs. The expect values (E-values) for SAM are calculated based on the constant sample size, 30,000, regardless of the genome size. Therefore, the E-values can be directly compared between different genomes. Strope and Moriyama  reported that when the E-value threshold of 0.05 was used, profile-HMM classifiers were highly accurate (nearly 100% accurate) for identifying proteins belonging to the same 7TMR classes (within-class prediction). However, at the same E-value threshold, these classifiers performed much poorly (70% or lower accuracy) in identifying distant 7TMRs (between-class prediction). Therefore, in 7TMRmine, we chose three E-value thresholds to provide different levels of identification stringency. They are listed as three different classifiers: SAM, SAM1, and SAM2. The SAM classifier uses the most stringent E-value threshold, E = 0.05. The SAM1 classifier uses E = 4.23 as the threshold, which is based on the highest E-value given to Arabidopsis MLOs (specifically, MLO3). The SAM2 classifier is the least stringent with the threshold E = 6.52, which is obtained at the minimum error point  based on the classification of the training set (total errors: 4 out of 2,030 training samples: no false positive and 4 false negatives).
This method was developed by Wistrand et al. . These authors constructed a compartmentalized HMM incorporating distinct loop length patterns and differences in amino acid composition between cytosolic loops, extracellular loops, and membrane regions based on a diverse set of GPCR sequences. Their training set included eleven of 13 PFAM GPCR protein families . They considered the remaining two divergent families: Drosophila odorant receptor family 7tm_6 (PF02949) and the plant family Mlo (PF03094) as the outliers and excluded from their training set. The sensitivity (against 1,706 positives obtained from GPCRDB [28, 29]) and false positive rates (against 1,071 negatives) of GPCRHMM are reported as 92.8% and 0–1.18%, respectively .
LDA, QDA, LOG, and KNN
These classifiers are parametric and non-parametric discrimination methods (linear, quadratic, and logistic discriminant analyses, as well as nonparametric K-nearest neighbor) described by Moriyama and Kim . These classifiers use amino acid composition and physico-chemical properties as sequence descriptors. For KNN classifiers, the number of neighbors, K, is chosen from 5, 10, 15, or 20 and the classifiers are designated KNN5, KNN10, KNN15, and KNN20, respectively. Based on the training set including 1,000 positives (obtained from GPCRDB) and 750 negatives, cross-validation tests showed that these methods have 97.7–98.7% and 2.9–3.6% of true and false positive rates, respectively . S-PLUS statistical package version 8.1.1 for Linux (TIBCO Software Inc., Palo Alto, CA, USA) is used for the classifier development and application.
SVM-AA and SVM-di
These are the classifiers based on support vector machines (SVMs), learning machines that make binary classifications based on a hyperplane separating a remapped instance space . Amino acid composition (SVM-AA) and dipeptide frequencies (SVM-di) are used as the sequence descriptors. Strope and Moriyama  reported that the true and false positive rates by SVM-AA are >96% and 4–6%, respectively. SVM-AA performed much better than profile-HMM classifier for identifying distant 7TMRs (~90% accuracy by SVM-AA, while lower than 80% by profile HMMs), and similar accuracies were observed with SVM-AA even for short sub-sequences. Bhasin and Raghava  used SVM-di for their GPCRpred classifier and showed that 99.5% accuracy from cross-validation tests based on the training set including the five major 7TMR classes. We use SVMlight version 6.01 developed by Joachims [32, 33] for the SVM implementation with the radial basis (rbf) kernel function. We performed the grid analysis with five-fold cross validation to obtain the optimal set of parameters (γ for the rbf kernel and the trade-off, C) for our training set. For SVM-AA and SVM-di, the values used were (γ, C) = (155, 0.5) and (417, 0.5291), respectively.
This classifier uses the partial least squares regression (PLS) with sequence descriptors based on the auto/cross-covariance transformation of amino acid properties . We use an R implementation [34, 35]: the PLS package (ver. 2.1-0) developed by Mevik and Wehrens [36, 37]. The classification was done using the threshold score, 0.4982, which was obtained at the minimum error point . PLS-ACC was found to perform better than profile-HMM classifiers and PSI-blast when training sets are small and also against short sub-sequences, constantly better than 90% accuracy whereas profile-HMM classifiers fluctuates as low as 80% accuracy .
All classifiers except for GPCRHMM were trained using the dataset including 1,015 each of positive (GPCR) and negative (non-GPCR) sequences (these sequences are available on the 7TMRmine website). GPCR sequences were randomly sampled from GPCRDB (June 2006 release) [28, 29]. Only non-GPCR "Class Z (Archaeal/bacterial/fungal opsins)" sequences were excluded from sampling. Non-GPCR sequences were randomly sampled from UniProtKB/SwissProt (manually curated part of UniProt) [38, 39]. We manually examined this random-negative set to ensure that no known GPCR sequences were included.
Classifier performance against known proteins
In order to understand how these classifiers perform for the actual 7TMR proteins, we tested them against the entire set of sequences obtained from GPCRDB [28, 29]. In Additional file 1, the percentage of positives identified by each classifier is summarized. GPCRDB includes one non-GPCR class, "Class Z: Archael/bacterial/fungal opsins", which includes bacteriorhodopsins, proteorhodopsins, and related fungal opsins. They are light-driven proton and chloride pumps. Although these proteins have 7TM regions, they are not GPCRs and not involved with signal transduction. Therefore, we consider these proteins as important negative test samples.
As shown in Additional file 1, the percentage of positives obtained by classifiers varies depending on the GPCR class. Only Class A (Rhodopsin-like), frizzled/smoothened, and vertebrate taste receptors (T2R) are consistently identified at higher than 96% by any classifier. GPCRHMM completely missed insect odorant receptors and plant MLOs. This is because GPCRHMM is not trained for these proteins as described earlier. Compared to alignment-based classifiers (SAM/SAM1/SAM2 and GPCRHMM), all alignment-free classifiers showed very high false positive rates (shown as % positives against Class Z). In order to reduce false positive rates, Moriyama et al.  took the intersection of six selected classifiers (SVM-AA, SVM-di, PLS-ACC, LDA, QDA, and KNN20). As shown in Additional file 1, this strategy (called "6 class") reduced the false positive rate to ~6% without affecting the true positive rates. By taking the union of "6 class" and GPCRHMM as well as SAM2, we achieved the highest coverage for all GPCR classes without increasing the false positive rate. Additional file 1 also shows the classifier performance against the GPCR datasets from two organisms (Homo sapiens and D. melanogaster). Using the combination classifier "6 class + GPCRHMM + SAM2", nearly 100% of all known 7TMRs were recovered from these two genomes.
Transmembrane prediction methods
HMMTOP2.1 [40–42] and TMHMM2.0  are both HMM-based TM-prediction methods. Both are considered to be the two best TM-prediction methods [e.g., [44, 45]]. Many secreted proteins contain short N-terminal signal peptides, which often have strongly hydrophobic segments; consequently many TM-prediction methods misidentify these signal peptides as TM regions. Phobius [46, 47] addressed this problem by combining a signal peptide model, SignalP-HMM , and TMHMM improving overall accuracy in detecting and differentiating proteins with signal peptides and proteins with TM segments.
We incorporated HMMTOP2.1 and Phobius in our classifier set. As shown in Figure 2A, users can set their own rules with the number of TM regions (from 0 to 15 or more) and the location of N-terminals (internal or external of cells). Proteins that satisfy these rules are identified as 'positives', and all others 'negatives'. These options give the users flexibility in mining transmembrane proteins. The topology of canonical GPCR proteins has seven TM-regions and the N-terminus located extracellularly. However, no single TM-prediction method predicts exactly seven TM-regions from all known 7TMRs. Among known GPCR sequences in the GPCRDB, less than 85% are predicted to have exactly seven TM-regions by either Phobius or HMMTOP2.1 (Additional file 2; also see ). Choosing the TM number ranging from five to nine, for example, covered 99% of the known GPCRs. In addition to the prediction accuracy problem, some divergent 7TMRs may have their N-termini located intracellularly (Additional file 2; also see [49, 50]). Furthermore, test sequences may include partial proteins. Therefore, users are advised to use a range in the number of predicted TM regions for identification purpose.
Genes encoding transmembrane proteins constitute 20–30% of both prokaryotic and eukaryotic genomes [51–54]. Therefore, TM-region prediction is in general one of the most important steps for analyzing proteins. Inclusion of TM-prediction options adds flexibility to explore beyond just 7TM proteins. For this purpose, the users may elect to use only TM-prediction options with any number of levels (Figure 2A). In this regard, 7TMRmine works as a flexible analysis tool for examining TM protein candidates from entire genomes.
User submitted sequences
For user-submitted protein sequences, all classifiers are run first and the identification results are displayed for users to review. If the user chooses to perform further hierarchical analysis, the option interface similar to Figure 2A is presented, allowing the user to build and perform their own hierarchical 7TMR mining for any sequences.
Utility and discussion
7TMR protein mining from the Arabidopsis thaliana genome
7TMR proteins form the largest receptor superfamily in vertebrates and other metazoans (e.g., ~800 in human, ~1,000 in Caenorhabditis elegans) . However, few 7TMR candidates are reported in plants and fungi. Only 22 candidate Arabidopsis 7TMRs were described to date  (more recent review is found in Moriyama and Opiyo, in press 65). We explored the possibility of finding more divergent groups of 7TMR candidates from the A. thaliana genome using both alignment-free and alignment-based methods . For the 7TMRmine server, we updated all classifiers using a larger training dataset, and added new classifiers (SAM1, SAM2, GPCRHMM, and Phobius). The server also includes a newer release of the A. thaliana genome (TAIR8; 32,690 proteins excluding those shorter than 35 amino acids; 27,066 proteins further excluding predicted alternative-splicing products).
Number of 7TMR candidates predicted from 27,066 A. thaliana proteins.a
Number of 7TMR candidates
39 (1)b 
SAM (E = 0.05)
10 (10) 
SAM1 (E = 4.23)
24 (16) 
SAM2 (E = 6.52)
28 (16) 
1,123 (22)* [1,393]
191 (20) 
1,207 (22)* [1,499]
197 (13) 
Phobius & HMMTOP: 5–10TM
969 (22)* [1,212]
Phobius & HMMTOP: 7TM
103 (11) 
By using either Phobius or HMMTOP, ~200 of 27,066 A. thaliana proteins (or ~250 of 32,690 including alternative-splicing products) were predicted to have exactly seven TM-regions. 103 proteins (134 including alternative-splicing products) were predicted to be 7-TM proteins by both methods. The 22 (or 27 including alternative-splicing products) known A. thaliana 7TMR proteins were predicted to have between six and eight and between seven and ten TM-regions by Phobius and HMMTOP, respectively. Only 11 of the 22 proteins (or 13 of 27 including alternative-splicing products) are predicted to have exactly seven TM-regions by the both methods. Note that GTG1 and GTG2 are predicted to have eight or nine TM-regions (one of the two GTG2 alternative-splicing products, AT4G27630.1, is predicted to have only five TM-regions by both methods). Of the 27,066 A. thaliana proteins, 969 proteins have between five and ten TM-regions by both methods. The range "5–10TMs" (by HMMTOP) was also used by Moriyama et al.  as the best coverage against the entire GPCR dataset for the hierarchical classification.
As shown in this example, users can choose classifiers in any combination in any number of levels (currently up to six) to create their own hierarchical filtering system. By using less strict methods at the earlier level and more strict methods at the later level, the 7TMRmine Web server facilitates the prioritization of the 7TMR protein candidate set and generation of a protein set in a manageable size for further investigation. The union and intersection of positive or negative sets can be easily obtained as shown in Figure 3C. Figure 3D shows an example of the list of all classifier prediction results. Protein sequences as well as the classification results can be downloaded from this page for further analysis. For example, protein sequences can be submitted to GPCR classification tools such as GPCRsIdentifier , GPCRsclass and GPCRpred [31, 61], and GPCRTree  for further family classification.
Distribution of transmembrane proteins among eukaryotic genomes
Distribution of 7TMR proteins among eukaryotic genomes
7TMR candidates in the A. thaliana, rice, and poplar genomes
As described earlier, from the A. thaliana genome, the 16 high-ranking proteins identified by Gookin et al.  as well as 15 of the 22 known 7TMRs are found in the 132 proteins (156 including predicted alternative-splice forms) obtained from the intersection of the "6 classifiers" AND "7–8 TM" predictions (see Venn diagrams for A. thaliana in Figure 7). All six MLOs of the remaining seven known 7TMRs are included in the 49 proteins (57 including predicted alternative-splice forms) obtained from the intersection between "5–10 TM" AND "SAM2+GPCRHMM" (Venn diagrams including "5–10 TM" are available on the website). The remaining HHP5 as well as GTG1 are predicted as positives by both "5–10 TM" and "6 classifiers" but neither by GPCRHMM nor SAM2. GTG2 is not predicted by "6 classifiers" because PLS-ACC does not identify it as positive. Based on these results, we consider the 162 proteins (excluding predicted alternative-splicing forms; obtained by combining 132 proteins identified by both of "6 classifiers" AND "7–8 TM" with 49 proteins identified by both of "SAM2+GPCRHMM" AND "5–10 TM") to be the most likely 7TMR candidates from the A. thaliana genome (see Additional file 3). Similar lists generated for Oryza sativa (rice) and Populus trichocarpa (California poplar) include 84 and 153 candidates, respectively (see Additional files 4 and 5). High-ranking protein sets identified by Gookin et al.  included 13 rice and 20 poplar proteins. Of their rice GPCR candidates, six proteins are included in our intersection set of "7–8 TM" AND "6 classifiers", and two proteins are included in the intersection set of "5–10 TM" AND "SAM2+GPCRHMM". Two of the remaining five proteins are included in the intersection set between "5–10 TM" AND "6 classifiers". Three are not identified by any of these criteria due to negative predictions by SVM-AA (for three proteins) and SVM-di (one protein). Among 20 poplar GPCR candidates claimed by Gookin et al. , 17 proteins are included in our intersection set of "7–8 TM" AND "6 classifiers". Among the three proteins not included in our list, two proteins are predicted to be negatives by SVM-AA.
7TMRmine facilitates the discovery of extremely divergent 7TMR proteins from diverse genomes. By combining prediction results from various classifiers including alignment-based and alignment-free classifiers as well as transmembrane prediction methods in a multi-level filtering process, prioritized sets of 7TMR candidates can be obtained for further investigation. Furthermore, 7TMRmine can be used as a general transmembrane-protein classifier. Statistics provided for pre-analyzed 68 genomes revealed interesting differences in evolution of these protein families among major eukaryotic phyla.
Availability and requirements
7TMRmine is freely available from http://bioinfolab.unl.edu/emlab/7tmr using any current Web browser.
The authors thank Qiaomei Zhong for developing the early prototype of the database and Web interface. We also thank Dr. Stephen O. Opiyo and Pooja K. Strope for training PLS, SAM, and SVM classifiers. This work was in part funded by Nebraska EPSCoR Women in Science, NSF EPSCoR Type II grant, and the grant number R01LM009219 from the National Library of Medicine to E.N.M., and the NIGMS (GM65989-01), the DOE (DE-FG02-05er15671), and the NSF (MCB-0209711, MCB-0723515) to A.M.J. The authors have no conflicts of interest that are directly relevant to the content of this article.
- Bjarnadóttir TK, Gloriam DE, Hellstrand SH, Kristiansson H, Fredriksson R, Schiöth HB: Comprehensive repertoire and phylogenetic analysis of the G protein-coupled receptors in human and mouse. Genomics. 2006, 88 (3): 263-273. 10.1016/j.ygeno.2006.04.001.View ArticlePubMedGoogle Scholar
- Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blocker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowski J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ: Initial sequencing and analysis of the human genome. Nature. 2001, 409 (6822): 860-921. 10.1038/35057062.View ArticlePubMedGoogle Scholar
- Thomas JH, Robertson HM: The Caenorhabditis chemoreceptor gene families. BMC Biol. 2008, 6: 42-10.1186/1741-7007-6-42.PubMed CentralView ArticlePubMedGoogle Scholar
- Hulo N, Bairoch A, Bulliard V, Cerutti L, De Castro E, Langendijk-Genevaux PS, Pagni M, Sigrist CJ: The PROSITE database. Nucleic Acids Res. 2006, D227-230. 10.1093/nar/gkj063. 34 Database
- Attwood TK, Bradley P, Flower DR, Gaulton A, Maudling N, Mitchell AL, Moulton G, Nordle A, Paine K, Taylor P, Uddin A, Zygouri C: PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res. 2003, 31 (1): 400-402. 10.1093/nar/gkg030.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, Bateman A: The Pfam protein families database. Nucleic Acids Res. 2008, D281-288. 36 Database
- Moriyama EN, Kim J: Protein family classification with discriminant function analysis. Genome Exploitation: Data Mining the Genome. Edited by: Gustafson JP, Shoemaker R, Snape JW. 2005, New York: Springer, 121-132.Google Scholar
- Opiyo SO, Moriyama EN: Protein family classification with partial least squares. J Proteome Research. 2007, 6 (2): 846-853. 10.1021/pr060534k.View ArticleGoogle Scholar
- Strope PK, Moriyama EN: Simple alignment-free methods for protein classification: a case study from G-protein-coupled receptors. Genomics. 2007, 89 (5): 602-612. 10.1016/j.ygeno.2007.01.008.View ArticlePubMedGoogle Scholar
- Clyne PJ, Warr CG, Carlson JR: Candidate Taste Receptors in Drosophila. Science. 2000, 287: 1830-1833. 10.1126/science.287.5459.1830.View ArticlePubMedGoogle Scholar
- Clyne PJ, Warr CG, Freeman MR, Lessing D, Kim JH, Carlson JR: A novel family of divergent seven-transmembrane proteins: Candidate odorant receptors in Drosophila. Neuron. 1999, 22 (2): 327-338. 10.1016/S0896-6273(00)81093-4.View ArticlePubMedGoogle Scholar
- Kim J, Moriyama EN, Warr CG, Clyne PJ, Carlson JR: Identification of novel multi-transmembrane proteins from genomic databases using quasi-periodic structural properties. Bioinformatics. 2000, 16 (9): 767-775. 10.1093/bioinformatics/16.9.767.View ArticlePubMedGoogle Scholar
- Moriyama EN, Strope PK, Opiyo SO, Chen Z, Jones AM: Mining the Arabidopsis thaliana genome for highly-divergent seven transmembrane receptors. Genome Biol. 2006, 7: R96-10.1186/gb-2006-7-10-r96.PubMed CentralView ArticlePubMedGoogle Scholar
- Gookin TE, Kim J, Assmann SM: Whole proteome identification of plant candidate G-protein coupled receptors in Arabidopsis, rice, and poplar: computational prediction and in-vivo protein coupling. Genome Biol. 2008, 9 (7): R120-10.1186/gb-2008-9-7-r120.PubMed CentralView ArticlePubMedGoogle Scholar
- Devoto A, Hartmann HA, Piffanelli P, Elliott C, Simmons C, Taramino G, Goh CS, Cohen FE, Emerson BC, Schulze-Lefert P, Panstruga R: Molecular phylogeny and evolution of the plant-specific seven-transmembrane MLO family. J Mol Evol. 2003, 56 (1): 77-88. 10.1007/s00239-002-2382-5.View ArticlePubMedGoogle Scholar
- Devoto A, Piffanelli P, Nilsson I, Wallin E, Panstruga R, von Heijne G, Schulze-Lefert P: Topology, subcellular localization, and sequence diversity of the Mlo family in plants. J Biol Chem. 1999, 274 (49): 34993-35004. 10.1074/jbc.274.49.34993.View ArticlePubMedGoogle Scholar
- Sato K, Pellegrino M, Nakagawa T, Vosshall LB, Touhara K: Insect olfactory receptors are heteromeric ligand-gated ion channels. Nature. 2008, 452 (7190): 1002-1006. 10.1038/nature06850.View ArticlePubMedGoogle Scholar
- Wicher D, Schafer R, Bauernfeind R, Stensmyr MC, Heller R, Heinemann SH, Hansson BS: Drosophila odorant receptors are both ligand-gated and cyclic-nucleotide-activated cation channels. Nature. 2008, 452 (7190): 1007-1011. 10.1038/nature06861.View ArticlePubMedGoogle Scholar
- MySQL. [http://www.mysql.com]
- Huala E, Dickerman AW, Garcia-Hernandez M, Weems D, Reiser L, LaFond F, Hanley D, Kiphart D, Zhuang M, Huang W, Mueller LA, Bhattacharyya D, Bhaya D, Sobral BW, Beavis W, Meinke DW, Town CD, Somerville C, Rhee SY: The Arabidopsis Information Resource (TAIR): a comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant. Nucleic Acids Res. 2001, 29 (1): 102-105. 10.1093/nar/29.1.102.PubMed CentralView ArticlePubMedGoogle Scholar
- The Arabidopsis Information Resource (TAIR). [http://www.arabidopsis.org]
- Eddy SR: Profile hidden Markov models. Bioinformatics. 1998, 14 (9): 755-763. 10.1093/bioinformatics/14.9.755.View ArticlePubMedGoogle Scholar
- Hughey R, Krogh A: Hidden Markov models for sequence analysis: Extension and analysis of the basic method. Comput Appl Biosci. 1996, 12 (2): 95-107.PubMedGoogle Scholar
- SAM: Sequence Alignment and Modeling System. [http://compbio.soe.ucsc.edu/sam.html]
- Karchin R, Karplus K, Haussler D: Classifying G-protein coupled receptors with support vector machines. Bioinformatics. 2002, 18 (1): 147-159. 10.1093/bioinformatics/18.1.147.View ArticlePubMedGoogle Scholar
- Wistrand M, Kall L, Sonnhammer EL: A general model of G protein-coupled receptor sequences and its application to detect remote homologs. Protein Sci. 2006, 15 (3): 509-521. 10.1110/ps.051745906.PubMed CentralView ArticlePubMedGoogle Scholar
- GPCRDB: Information system for G protein-coupled receptors (GPCRs). [http://www.gpcr.org/7tm_old/]
- Horn F, Bettler E, Oliveira L, Campagne F, Cohen FE, Vriend G: GPCRDB information system for G protein-coupled receptors. Nucleic Acids Res. 2003, 31 (1): 294-297. 10.1093/nar/gkg103.PubMed CentralView ArticlePubMedGoogle Scholar
- Vapnik VN: The Nature of Statistical Learning Theory. 1999, New York: Springer-Verlag, 2Google Scholar
- Bhasin M, Raghava GP: GPCRpred: an SVM-based method for prediction of families and subfamilies of G-protein coupled receptors. Nucleic Acids Res. 2004, W383-389. 10.1093/nar/gkh416. 32 Web Server
- Joachims T: Making large-Scale SVM Learning Practical. Advances in Kernel Methods – Support Vector Learning. Edited by: Schölkopf B, Burges C, Smola A. 1999, Cambridge: MIT Press, 169-184.Google Scholar
- SVMlight. [http://svmlight.joachims.org/]
- The R Project for Statistical Computing. [http://www.r-project.org/]
- R Development Core Team: R: A Language and Environment for Statistical Computing. 2008, Vienna, AustriaGoogle Scholar
- Mevik B-H, Wehrens R: The pls Package: Principal Component and Partial Least Squares Regression in R. Journal of Statistical Software. 2007, 18 (2): 1-24.View ArticleGoogle Scholar
- pls. [http://mevik.net/work/software/pls.html]
- UniProt. [http://www.uniprot.org/]
- The UniProt Consortium: The universal protein resource (UniProt). Nucleic Acids Res. 2008, D190-195. 36 Database
- Tusnády GE, Simon I: Principles governing amino acid composition of integral membrane proteins: application to topology prediction. J Mol Biol. 1998, 283 (2): 489-506. 10.1006/jmbi.1998.2107.View ArticlePubMedGoogle Scholar
- Tusnády GE, Simon I: The HMMTOP transmembrane topology prediction server. Bioinformatics. 2001, 17 (9): 849-850. 10.1093/bioinformatics/17.9.849.View ArticlePubMedGoogle Scholar
- HMMTOP: Prediction of transmembrane helices and topology of proteins. [http://www.enzim.hu/hmmtop]
- Krogh A, Larsson B, von Heijne G, Sonnhammer EL: Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001, 305 (3): 567-580. 10.1006/jmbi.2000.4315.View ArticlePubMedGoogle Scholar
- Chen CP, Kernytsky A, Rost B: Transmembrane helix predictions revisited. Protein Sci. 2002, 11 (12): 2774-2791. 10.1110/ps.0214502.PubMed CentralView ArticlePubMedGoogle Scholar
- Cuthbertson JM, Doyle DA, Sansom MS: Transmembrane helix prediction: a comparative evaluation and analysis. Protein Eng Des Sel. 2005, 18 (6): 295-308. 10.1093/protein/gzi032.View ArticlePubMedGoogle Scholar
- Phobius: A combined transmembrane topology and signal peptide predictor. [http://phobius.cbr.su.se/]
- Käll L, Krogh A, Sonnhammer EL: Advantages of combined transmembrane topology and signal peptide prediction – the Phobius web server. Nucleic Acids Res. 2007, W429-432. 10.1093/nar/gkm256. 35 Web Server
- Bendtsen JD, Nielsen H, von Heijne G, Brunak S: Improved prediction of signal peptides: SignalP 3.0. J Mol Biol. 2004, 340 (4): 783-795. 10.1016/j.jmb.2004.05.028.View ArticlePubMedGoogle Scholar
- Hsieh M-H, Goodman HM: A novel gene family in Arabidopsis encoding putative heptahelical transmembrane proteins homologous to human adiponectin receptors and progestin receptors. J Exp Bot. 2005, 56 (422): 3137-3147. 10.1093/jxb/eri311.View ArticlePubMedGoogle Scholar
- Benton R, Sachse S, Michnick SW, Vosshall LB: Atypical membrane topology and heteromeric function of Drosophila odorant receptors in vivo. PLoS Biol. 2006, 4 (2): e20-10.1371/journal.pbio.0040020.PubMed CentralView ArticlePubMedGoogle Scholar
- Wallin E, von Heijne G: Genome-wide analysis of integral membrane proteins from eubacterial, archaean, and eukaryotic organisms. Protein Sci. 1998, 7 (4): 1029-1038.PubMed CentralView ArticlePubMedGoogle Scholar
- Stevens TJ, Arkin IT: Do more complex organisms have a greater proportion of membrane proteins in their genomes?. Proteins. 2000, 39 (4): 417-420. 10.1002/(SICI)1097-0134(20000601)39:4<417::AID-PROT140>3.0.CO;2-Y.View ArticlePubMedGoogle Scholar
- Liu J, Rost B: Comparing function and structure between entire proteomes. Protein Sci. 2001, 10 (10): 1970-1979. 10.1110/ps.10101.PubMed CentralView ArticlePubMedGoogle Scholar
- Marsden RL, Lee D, Maibaum M, Yeats C, Orengo CA: Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space. Nucleic Acids Res. 2006, 34 (3): 1066-1080. 10.1093/nar/gkj494.PubMed CentralView ArticlePubMedGoogle Scholar
- Jones AM, Assmann SM: Plants: the latest model system for G-protein research. Embo Rep. 2004, 5 (6): 572-578. 10.1038/sj.embor.7400174.PubMed CentralView ArticlePubMedGoogle Scholar
- Pandey S, Assmann SM: The Arabidopsis putative G protein-coupled receptor GCR1 interacts with the G protein alpha subunit GPA1 and regulates abscisic acid signaling. Plant Cell. 2004, 16 (6): 1616-1632. 10.1105/tpc.020321.PubMed CentralView ArticlePubMedGoogle Scholar
- Grigston JC, Osuna D, Scheible WR, Liu C, Stitt M, Jones AM: d-Glucose sensing by a plasma membrane regulator of G signaling protein, AtRGS1. FEBS Lett. 2008, 582 (25–26): 3577-3584. 10.1016/j.febslet.2008.08.038.PubMed CentralView ArticlePubMedGoogle Scholar
- Pandey S, Nelson DC, Assmann SM: Two novel GPCR-type G proteins are abscisic acid receptors in Arabidopsis. Cell. 2009, 136 (1): 136-148. 10.1016/j.cell.2008.12.026.View ArticlePubMedGoogle Scholar
- Maeda Y, Ide T, Koike M, Uchiyama Y, Kinoshita T: GPHR is a novel anion channel critical for acidification and functions of the Golgi apparatus. Nat Cell Biol. 2008, 10 (10): 1135-1145. 10.1038/ncb1773.View ArticlePubMedGoogle Scholar
- Gao QB, Wang ZZ: Classification of G-protein coupled receptors at four levels. Protein Eng Des Sel. 2006, 19 (11): 511-516. 10.1093/protein/gzl038.View ArticlePubMedGoogle Scholar
- Bhasin M, Raghava GP: GPCRsclass: a web tool for the classification of amine type of G-protein-coupled receptors. Nucleic Acids Res. 2005, W143-147. 10.1093/nar/gki351. 33 Web Server
- Davies MN, Secker A, Halling-Brown M, Moss DS, Freitas AA, Timmis J, Clark E, Flower DR: GPCRTree: online hierarchical classification of GPCR function. BMC Res Notes. 2008, 1: 67-10.1186/1756-0500-1-67.PubMed CentralView ArticlePubMedGoogle Scholar
- Bargmann CI: Neurobiology of the Caenorhabditis elegans Genome. Science. 1998, 282 (5396): 2028-2033. 10.1126/science.282.5396.2028.View ArticlePubMedGoogle Scholar
- Katinka MD, Duprat S, Cornillot E, Metenier G, Thomarat F, Prensier G, Barbe V, Peyretaillade E, Brottier P, Wincker P, Delbac F, El Alaoui H, Peyret P, Saurin W, Gouy M, Weissenbach J, Vivares CP: Genome sequence and gene compaction of the eukaryote parasite Encephalitozoon cuniculi. Nature. 2001, 414 (6862): 450-453. 10.1038/35106579.View ArticlePubMedGoogle Scholar
- Moriyama EN, Opiyo SO: Bioinformatics of Seven Transmembrane Receptors in Plant Genomes in "Integrated G Proteins Signaling in Plants (eds., S. Yalovskly, F. Baluska, and A. Jones)". Springer-Verlag. in press
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.