SCMMTP: identifying and characterizing membrane transport proteins using propensity scores of dipeptides
© Liou et al. 2015
Published: 9 December 2015
Identifying putative membrane transport proteins (MTPs) and understanding the transport mechanisms involved remain important challenges for the advancement of structural and functional genomics. However, the transporter characters are mainly acquired from MTP crystal structures which are hard to crystalize. Therefore, it is desirable to develop bioinformatics tools for the effective large-scale analysis of available sequences to identify novel transporters and characterize such transporters.
This work proposes a novel method (SCMMTP) based on the scoring card method (SCM) using dipeptide composition to identify and characterize MTPs from an existing dataset containing 900 MTPs and 660 non-MTPs which are separated into a training dataset consisting 1,380 proteins and an independent dataset consisting 180 proteins. The SCMMTP produced estimating propensity scores for amino acids and dipeptides as MTPs. The SCMMTP training and test accuracy levels respectively reached 83.81% and 76.11%. The test accuracy of support vector machine (SVM) using a complicated classification method with a low possibility for biological interpretation and position-specific substitution matrix (PSSM) as a protein feature is 80.56%, thus SCMMTP is comparable to SVM-PSSM. To identify MTPs, SCMMTP is applied to three datasets including: 1) human transmembrane proteins, 2) a photosynthetic protein dataset, and 3) a human protein database. MTPs showing α-helix rich structure is agreed with previous studies. The MTPs used residues with low hydration energy. It is hypothesized that, after filtering substrates, the hydrated water molecules need to be released from the pore regions.
SCMMTP yields estimating propensity scores for amino acids and dipeptides as MTPs, which can be used to identify novel MTPs and characterize transport mechanisms for use in further experiments.
Membrane transport proteins (MTPs), or transporters, span lipid bilayers and form gates for hydrophilic solutes to cross hydrophobic membranes. Transporters are essential in many biological processes, such as nutrient uptake, metabolite secretion, ion homeostasis, signaling, energy transduction, immune system recognition processes, osmoregulation, and other physiological and developmental processes in the cell . Currently, several commercial drugs target ion channels or carrier proteins with results indicating that transporter proteins have tremendous therapeutic potential [3, 4].
MTPs are primarily involved in the transportation of amino acids, cations, anions, sugars, proteins, mRNAs, electrons, water, and hormones . According to the transporter nomenclature panel of the International Union of Biochemistry and Molecular Biology, MTPs can be classified into six groups based on their mode of transport, energy coupling mechanisms, molecular phylogeny, and substrate specificity . MTPs are thought to constitute 3-16% of the total number of open reading frames in prokaryotic genomes . Identifying putative MTPs and understanding their transport characters are important challenges in the advancement of structural and functional genomics. MTPs have been identified by proteomics strategies, such as absorbance spectroscopy, gel electrophoresis, metal-affinity columns and shift assay, chromatography, mass spectroscopy, and combined spectroscopic studies . But two main features of the MTP make them difficult to identify . First, transporters are usually minor components in cell membranes. The protein engineers often use E. coli or yeasts as the hosts to overexpress MTPs which seems to be toxic after overexpressing or expressed as unfolded inactive proteins . Second, most MTP contain a series of hydrophobic residues causing the under-represented in two-dimensional electrophoresis . Bioinformatics tools are needed for effective large-scale analysis of available sequences to identify novel transporters, direct further experiments and provide information about transport mechanisms.
Recently, several machine learning methods have been proposed for predicting membrane transporters from amino acid sequence information. Lin et al.  used a support vector machine (SVM) to predict transporter families from the transporter classification system. Gromiha et al.  analyzed the amino acid compositions in MTPs and used different classifiers implemented in the WEKA program to discriminate channel/pore proteins, electrochemical transporters, and active transporters. Li et al.  developed a general approach combining homology-based and machine learning methods, using transporter sequence features learned from well-curated proteomes as guides, to predict major transporter families/subfamilies defined in the transporter classification database. Ou et al. analyzed the amino acid composition of transporters and developed a radial basis network-based method for classifying these proteins into channel/pore proteins, electrochemical transporters, active transporters, and six transporter families with amino acid properties and position-specific substitution matrix (PSSM) profiles. Mishra et al.  contributed to the substrate specificity annotations of transporters by developing SVM models that discriminate between amino acid, anion, cation, electron, protein/mRNA, sugar, and other transporters. The transporter characters can be investigated based on the crystal structure of the transport proteins and their transport objects. Sauguest et al.  used pentameric ligand-gate ion channels to examine ion permeation. Hibbs and Gouauxidentified permeation and activation principles in an anion receptor. Zhou et al.  and Kopfer et al.  used potassium channels to determine the relationship between ion coordination and hydration, as well as the Coulomb knock-on mechanism. However, the goal of most studies has been to predict channel families; a few studies have constructed a general predictor to predict if the proteins are channel proteins. Although these predictors can provide a range of prediction accuracy levels, independent statistical work is needed to examine MTP characters, whose understanding is mostly based on crystal structures and also to some extent on sequences.
SCM can provide insight into protein function prediction based on interpretable propensity scores [15–18]. To create the SCMMTP, a previously published dataset  was used. The proposed method estimated the propensity scores of 400 individual dipeptides and used the difference between dipeptide compositions of positives and negatives to predict putative transporters. The method was further optimized using an Intelligent Genetic Algorithm (IGA) . The propensity scores of 20 natural amino acids were derived from the dipeptide scores and used to identify informative physicochemical properties (PCPs) of membrane transporters. The SCM method achieved a 10-fold cross validation accuracy (10-CV) of 81.12% and a test accuracy of 76.11%. Several PCPs from the AAindex database  or from some PCP studies have been useful for describing transporters. First, the "hydropathy index" scale (KYTJ820101) was found to precisely reflect the conformational characteristics of transporters. Second, MTPs are expected to have higher preferences for α-helices outside, rather than inside, of protein molecules (WERD780104). Finally, the channels are generally composed of residues with low hydration energy levels. This occurs because after hydrated solutes pass through the membrane via MTPs, the channel must release the water molecules so that additional substances can be transported.
Materials and methods
In this work, we propose a novel SCMMTP for the identification and characterization of MTPs based on the propensity scores of dipeptides and amino acids. The MTP characterization includes the analysis of protein PCPs, the visualization of the MTP propensity scores and a PCP mining method. This methodutilized the propensity scores of amino acids allowing for the analysis of the MTPs. Figure 1 presents a flowchart of the experimental design.
We established five datasets based on different sources of transporter proteins from various species: MTP-TRN1380, MTP-TST180, HTS380, HMTP494 and PSPGO649. MTP-TRN1380 and MTP-TST180 were respectively used as training data for the SCMMTP classifier and independent test. HTS380 and HMTP4942 both contained human transporter proteins. PSPGO was composed of the photosynthetic proteins. HTS380, HMPAS4942 and PSPGO were used for the identification of MTPs. The numbers of transporters and non-transporters in each dataset are summarized in Additional File 1: Table S1.
MTP-TRN1380 and MTP-TST180
Mishra et al.  provided a dataset which included 10,780 transporter, carrier, and channel proteins collected from the UniProt database. Mishra et al. removed fragmented sequences and sequences annotated with more than two substrates, those based solely on sequence similarity, and sequences which exhibited a similarity exceeding 70%. The primary dataset contained 900 transporters and 660 non-transporters that were randomly chosen from all proteins in UniProt, excluding 10,780 MTPs for the negative dataset. The 1,560 sequences in our dataset were divided into training and test datasets. The training dataset, named MTP-TRN1380, consisted of 780 transporters and 600 non-transporters, while the test dataset, named MTP-TST180, included 180 transporters and 60 non-transporters.
Huang et al.  gathered 5,176 human transporter proteins from SwissProt. Huang et al. divided the sequences into"confirmed", "potential" and "non-transporter" by manually checking four annotations in SwissProt, i.e., protein names, gene names, function and sequence similarities. After reducing the sequence identity with a threshold of 25%, HTS had 380 transporters, 144 potential transporters and 2,815 non-transporters. The 380 transporters, named HTS380, are used for the identification of MTPs.
Kim and Yi  built a human protein database containing36,585 proteins with 5386 transporters. This study considered the transporters members from this database, excluding the sequences having uncommon amino acids. Finally, 4,942 transporters were collected as HPMAS4942.
Most photosynthetic proteins are membrane-embedded and take part in electron transport reactions of photosynthesis. This process involves the transport of electrons, protons and other solutes via proton complex. This work adopted the PSPGO dataset from the previous study  to identify MTPs. The sequence identity was reduced to 25%. PSPGO contained 649 photosynthetic proteins as positive dataset and 649 randomly chosen sequences from non-photosynthetic proteins as the negative dataset. In this work, we used only positive part of PSPGO, called PSPGO649.
SCM-based MTP classifier (SCMMTP)
The Scoring Card Method (SCM) was already used to analyze various protein functions [15–18] from sequence information. In contrast to the SVM classifier, SCM demonstrates increased simplicity and interpretability by using the propensity scores of amino acids and dipeptides to identify and characterize protein function. Current work proposes the SCM-based method (SCMMTP) to predict MTPs. The SCMMTP implementation corresponds to the original SCM algorithm without any major adjustments, as follows:
Construct a training dataset, MTP-TRN1380, consisting of 780 MTPs and 600 non-MTPs.
where i indicates the residues.S' i , S i , Max i and Min i denote the scaled target dipeptide compositions, original target dipeptide compositions, maximum dipeptide compositions and minimum dipeptide compositions, respectively, of the corresponding residues.
The SCMMTP performance was compared toother classifiers with the features commonly used in protein function prediction. This work considered the SVM, J48, Bayes and k-Nearest Neighbor (KNN) in cooperation with theamino acid composition (AAC), the dipeptide composition (DPC), the normalized PSSM (PSSM400) , and the 531 PCPs fromthe AAindexdatabase as the features. SVM is widely applied for protein function prediction and is also implemented for MTP classification . We used LIBSVM  to create SVM classifiers with radial basis kernel. The optimal SVM parameters were chosen via a grid search according to the 10-fold cross-validation (10-CV) accuracy of MTP-TRN1380. Other classifiers are implemented using WEKA . The suitable K parameter of the KNN classifier was decided based on the best 10-CV evaluated from MTP-TRN1380. We tried 5 different K values for each KNN classifiers i.e. 3, 5, 7, 9 and 13. We used the default WEKA parameter settings when applying both the decision tree (J48) and the Naïve Bayes classifiers.
MTPswere analysed and characterized using the PCP mining method and the propensity score visualization method.
The PCP mining method, SCM-PCPs, was introduced to identify the important physicochemical properties (PCPs) of Heme-binding proteinsbased on the propensity scores of 20 amino acids . To find a set of possibly correlated PCPs with a considered protein function, we examined the 544 indices representing different PCPs from the AAindex database. After removingthe PCPs containing the value 'NA', 531 PCP indices were left and considered in this work. The implementation of SCM-PCPs to MTPs analysis included following steps: 1) Calculate the R values between the amino acid propensity scores of MTPs (generated by SCMMTP) for each of the531 PCPs. 2) Calculate the R values between the amino acid propensity scores and the informative PCPs collected based on the domain knowledge of MTPs. 3) If the R values of the PCP and amino acid propensity scores of MTP > 0.5, these PCPs are chosen as candidate PCPs for further analysis.
The visualizing method aimed to express the MTP propensity scores for proteins to determine their characteristics. The structure coordination files of the proteins werecolored according to the amino acid or dipeptide scores, and expressed using Pymol . The red and blue colorsrespectively represented high and low propensity score residues.
Results and discussion
Performance comparisons of different MTP predictors
Because of the variance in the MTP datasets, many predictors used different MTP datasets for creating their predicting models. For example, TPpred  used the mitochondrial proteins as the dataset while TransportTP used sequences from the TCDB database . We evaluated the performance of the SCMMTP method and other generic-MTP classifiers (Decision tree, J48; Naïve Bayes; K-nearest neighbors, KNN; Support vector machine, SVM) with four types of feature sets (amino acid composition, AAC; dipeptide composition, DPC; physicochemical property, PCP; Position-Specific Scoring Matrix, PSSM) to discriminate between MTPs and non-MTPs.
Performance comparison of MTP predictors on the independent test set
KNN-AAC(k = 7)
KNN-DPC (k = 5)
KNN-AAindex (k = 7)
KNN-PSSM (k = 13)
The SCMMTP yielded test accuracy, sensitivity, specificity and MCC results of 76.11%, 80.00%, 68.33% and 0.47, respectively. SCMMTP had better performance than other predictors excluding KNN-PSSM and SVM-PSSM which showed the accuracies of 76.67% and 80.56%, respectively. Even the predictor using PSSM as thefeatures also had a good performance to predict the MTPs in previous study , this feature cannot always provide satisfied performances. The predictors, Bayes-PSSM and J48-PSSM, had only the accuracies of 65.00% and 69.44%, respectively. Among the predictors using DPC as feature, SCMMTP had a better performance than Bayes-DPC, J48-DPC, KNN-DPC and SVM-DPC which have the accuracies of 44.44%, 59.44%, 69.44% and 70.56%, respectively. The classifiers using AAC as feature generally had low performances excluding the KNN-AAC which had the accuracy of 71.67%. TheBayes-AAC, J48-AAC and SVM-AACshowedthe averaged accuracyof 59.48% with the Bayes-AAC yielding the lowest performance among all the predictors. This suggests that AAC would not be a good feature to predict MTPs even using different machine learning methods.
The SVM-PSSM method outperformed other classifiers with a test accuracy of 80.56%. However, the SVM uses a complicated classification model with a low possibility for biological interpretation. Moreover, the time cost issue is also a problem, while generating a PSSM profile takes a long time. On the other hand, SCMMTP with a straightforward weighted-sum model and a dipeptide composition as a feature set provides propensity scores which are interpretable in biological analysis.
SCMMTP performance for identifying MTPs using existing datasets
Errors in this MTP identification may be due to the different methodsused to establish the various datasets. In HMPAS4942, Kim and Yi  collected transporter sequences including both experimentally verified and predicted MTPs. The ten-lowest scoring sequences in our identification work do not have experimental evidence to be MTPs. These sequences are all predictedas MTPs using other prediction tools. Eight are predicted as MTPs (i.e., Q86XP4, Q9H480, Q6LA62, B4YCR0, H9PSV8, Q8HNQ6, Q658P4 and D6RA35, with respective scores of 321.89, 354.36, 354.56, 373.43, 375.93, 377.04, 377.32, and 379.32). In addition, Q8IYB3 and Q15287 (serine/arginine repetitive matrix protein 1, 364.88 & RNA-binding protein with serine-rich domain 1, 383.91) are selected using the ortholog selection methods due to a lack of experimental evidence.
The PSPGO649 dataset contained photosynthetic proteins, which are very diverse in terms of structure and function, ranging from soluble to membrane-embedded. Furthermore, membrane photosynthetic proteins do not necessarily traverse the bilayer, as they often are subunits of bigger protein complexes. Thus, the PSPGO649 dataset is expected to contain many MTPs, but is not completely a membrane-transporter dataset. Those proteins classified by SCMMTP as negative may be fragments or subunits of larger photosynthetic proteins, auxiliary subunits, soluble proteins of the Calvin Cycle, components of light-harvesting complexes, or proteins that help to activate other photosynthetic proteins. In addition, a closer look at the bottom-10 sequences shows that most of these sequences have homology-based annotations, which are not experimentally verified. On the contrary, sequences that have annotations based on experimental evidence (i.e., Uniprot ID P84990, P73202, and P09927) function as light receptors which does not imply a transporter function (see Table S4) [30, 31].
The MTPs of HTS380 have characteristics similar to those in the MTP-TRN1380 and MTP-TST180 datasets. However, the sequences from HTS380 are only collected from humans, in whom some pore-forming proteins have functions which differ from those of the transporters. For example, Peroxisomal membrane protein 2 has score of 429.33 and is top-7 lowest scored sequences; it seems to be involved in pore-forming activity and may contribute to the unspecific permeability of the Peroxisomal membrane. This function is different from MTP-TRN1380 and MTP-TST180, in which the MTPs are permeable for specific subtracts.
Comparing the dipeptide compositions of the MTPs and non-MTPs
The SCMMTP dipeptide scores revealed that the top-5 dipeptides with the highest scoresare LF, FY, DL, VE, and QV scored 1000, 998, 995, 994, and 990, respectively. The five dipeptides with the lowest scores were QN, NE, NK, FL, and AVwiththe scores of 1, 5, 5, 12, and 13, respectively. The averaged dipeptide compositions of MTP and non-MTP were calculated for comparison. Mann-Whitney U-test which is a non-parameter statistic method was applied to evaluate the statistical significance of averaged dipeptide compositions between MTP and non-MTP. In the top-5 ranked dipeptides, LF, DL, VE, and QV showed the significant differences based on a p-value threshold of 0.05, and had p-values of 0.00, 0.00, 0.01 and 0.03, respectively. However, FY had a p-value of 0.25 which was not significantly different between MTPs and non-MTPs. Among the lowest-5 scored dipeptides, QN, NE, NK, FL, and AV had the p-values of 0.00, 0.00, 0.00, 0.01 and 0.00, respectively, indicating a significant different between MTPs and non-MTPs. These results suggest that although the dipeptides with the highest and the lowest scores separated the MTPs and non-MTPs, some dipeptides that can be used after tuning these scores.
MTPs characterization using the propensity visualizing method
The SCMMTP predictor operates by calculating dipeptide and amino acid propensity scores of MTPs and non-MTPs. Visualization techniques provide a way to represent these results and discover informative patterns within the structure of a given protein class. In this study, the protein structures were colored according to the SCMMTP-derived dipeptide (DP) and amino acid (AA) propensity scores.
The MTP propensity score and PCPs selected from AAindex database based on R.
MTP score (Rank)
These observations are consistent with the results from several previous studies. Amongst these, the analysis of AA propensities and physicochemical properties of photosynthetic proteins showed that the hydrophobic interactions are crucial for electron transport reactions . Site-directed mutations of Val102, Phe219, and Glu276 residues are shown to impair the transport function of SmbA protein, which mediates the transport of antimicrobial peptides .
The correlation between the propensity scores derived from SCMMTP and amino acid composition is evaluated with the Pearson correlation coefficient (R value). The high correlation coefficient (R = 0.95) between the propensity scores of amino acids and the composition difference between MTPs and non-MTPs indicates that SCMMTP-derived scores are effective in discriminating between positive and negative classes. The distributions of AA propensity scores on the surface of several highly-scored MTPs and non-MTPs have been visualized in Figure 3B. The red color represents the positions of highly-scored amino acids, whereas the low-scored AAs are colored in blue.
As shown in Figure 3, MTPs contain more regions colored in red than non-MTPs. Furthermore, high-scored regions in MTPs are mainly present in the transmembrane α-helices. Hence, the increased occurrence of hydrophobic residues in MTPs, evident from the AA propensity and composition analysis, is due to the presence of long stretches of these residues in the membrane spanning α-helices.
MTPs characterization using physicochemical properties
In this study, the SCM-PCPs method was used to identify the physicochemical properties (PCPs) of MTPs. The correlations between SCMMTP-derived AA scores and AAindex indices of the PCPs have been estimated and the top-ranked PCPs from AAindex database are presented in Table 3. The three selected PCPs with their corresponding R values are: OLSK800101 or "Average internal preferences" (R = 0.86); KYTJ820101 or "Hydropathy index" (R = 0.85); WERD780104 or "Free energy change of epsilon(i) to alpha(Rh)" (R = 0.74).
A. Hydropathiccharacteristics of MTPs
SCM indicates that the KYTJ820101 property, described as the "hydropathy index" , was found to have a high positive correlation (R = 0.854) with AA scores. KYTJ820101 represents a hydropathy scale in which each of the 20 amino acids is assigned a value reflecting its relative hydrophobicity and hydrophilicity based on experimental observations .
In contrast to soluble proteins, little is understood about the structure and folding of membrane-related proteins. To date, very few high-resolution three-dimensional structures have been solved for membrane proteins due to the need for sophisticated techniques for diffraction studies. Problems originate from the inherent insolubility of membrane proteins due to the presence of hydrophobic domains . In fact, as of mid-February 2012 only 320 unique membrane proteins had been deposited in the protein data bank, representing less than 1% of the total data . While structural determination has progressed in recent years, most membrane protein crystal structures solved are taken from bacteria because eukaryotic membrane proteins are more difficult to crystallize .
Given this lack of experimental structural data, three-dimensional structures can be inferred from amino acid sequences applying an appropriate hydrophobicity scale. Examination of the hydropathy of a given sequence can help crystallographers measure the distribution of hydrophobic and hydrophilic regions, predict whether or not a given peptide segment is sufficiently hydrophobic to interact with or reside within the interior of the membrane, define secondary structures, and study the relationships between buried/exposed behaviour of the residues and their nature [33, 37].
Figure 4A shows hydropathy plots ofthe MTPs, P0AGM7 and P11551. It clearly shows the physicochemical property of "surrounding hydrophobicity in α -helix", which has a high positive correlation (R = 0.836) with AA scores. In Figure 4B-1, the non-MTP Q8NGY6 showed the transmembrane regions but displayed few high-scoring amino acids in the α-helix. In Figure 4B-2, the non-MTP Q9H4M7does not show the transmembrane regions, thus Figure 4 indicates these two PCPs are important classification features.
Furthermore, the correlation results obtained between SCM scores and the proposed hydropathy values  reveal that Ile plays a significant role in the stability and functionality of MTPs. Ile, Phe, Gly, Ala and Val residues ranked top-5 in our SCM derived scale are hydrophobic and are top- or middle-ranked in the proposed hydropathy scale . Further investigation is needed to determine its role in the structure and stability of membrane transporter proteins.
Hydrophilic Arg, Gln, Asp, Lys and Glu residues are respectively ranked by Kate & Doolittle  as the bottom-20, -18, -15, -19 and -16, are also at the bottom-5 of SCM derived scores. This may imply reduce prevalence of hydrophilic residues in membrane transporter protein native structures.
The abundance of Ile and Phe residues in the membrane proteins agrees with previous findings, which mention their location in the acyl chain areas of membrane lipids . It should also be noted that steric effects may affect the folding of membrane transporter proteins independent of hydropathy .
B. Preferences of transporters to form outside α-helices
The WERD780104 property showed a high correlation (R = 0.7445) with SCM-derived scores and is described in the AAindex as "free energy change of epsilon(i) to alpha(Rh)" . The amino acid indices of WERD780104 reflect the effects of local solute-solvent interactions (i.e., interactions between a residue and water with no influence from the neighboring residues) on the conformational preferences of the 20 naturally occurring amino acids, summarized from protein X-ray data. In a given property , different residue conformations have been assigned to one of the three types of protein backbone structures: nonregular structure, helix, and extended structure. Nonregular structures include residues in an epsilon(i) conformation, which defines isolated extended residues or those which are a part of a run of two or three extended residues. On the other hand, the helix structure included residues in αRh conformation, defining those residues as a part of the right-handed α-helix. In WERD780104, the free energy change (Δ(ΔG°)) was used to express the preferences of each residue for a nonregular structure relative to their preferences for the α-helical structure in going from the inside to the outside of a protein molecule.
In general, the scale proposed by Wertz et al.  confirmed the general preference of polar groups to be on the outside of protein molecules, while the non-polar groups are on the inside. However, regarding the conformational pattern, important inferences can be drawn from the obtained positive correlation results between SCM scores and the WERD780104 scale of free energy changes as follows: 1) transporter proteins are more stable in α-helical structures than in non-regular or extended structures and 2) transporter proteins have higher preferences for α-helices outside, rather than inside.
Molecular transport proteins are regarded as 'outside' or surface polypeptide chains and face the cavity, pore or channel, in contrast to membrane-buried regions. As discussed by Wertz et al. , proteins are usually more stable if they have non-regular or helical structures on the surface, because of the greater increase in entropy in going from the inside (where the librational motions of all types of residues are highly restricted) to the outside of a protein (where the restrictions on the librational motions are less severe).
Characterization MTP using both the propensity visualization and physicochemical properties
The correlations between the propensity scores and PCPs including membrane single span helix, membrane multi-span helix and amino acid hydration energies.
AAC of single span segment 
The polar residues are thought to play an important role in ion selecting, depending on their hydration energy. Illergard et al.  indicated the polar residues in the core of membrane proteins are conserved and often interact with water. Thissuggests that the polar residues of transmembrane proteins usually work in aqueous environments and the hydration energy influences the channel selectivity. The hydration energy of each amino acid is also provided in Table 4. WOLR810101 is the "hydration potential" provided from the AAindex, and a newer hydration potentials are also provided from Konig et al. . The high correlations (0.79) between the propensity scores and the hydration energies indicate the transmembrane segments are prone to be composed of low hydration energy residues. The MTPs are folded in the membranes, which is an extremely hydrophobic environment, and are composed of the hydrophobic low hydration energy amino acids that could decrease the folding energy . Since polar residues are important to ion selection, the relationship of the polar residue propensity scores to the hydration energies of polar residues is also shown in Table 4.
The correlations of amino acid scores between WOLR810101 and hydration energy are 0.54 and 0.80, respectively, suggests that the correlation increases after updating the hydration energies. It also indicates that the high hydration energy residues have high propensity scores. This leads to the conclusion that the transmembrane regions are prone to being composed of high hydration energy amino acids, and the polar residues in the transmembrane regions are responsible for transporting hydrophilic molecules.
Despite the growing amount of sequences of MTPs in public databases, their three-dimensional structures are being resolved at far more slower rates. In this study, several machine learning methods have been applied to predict MTPs from sequences. Furthermore, a novel scoring card (SCM)-based SCMMTP method have been proposed for prediction and analysis of MTPs. SCMMTP method yielded a good prediction performance and utilized dipeptide and amino acid propensity scores of MTPs to analyze their structure and physicochemical properties.
Considering the importance of MTPs in numerous biological processes, understanding of their nature can help to facilitate important future applications, including drug design.
This work was also funded by Ministry of Science and Technology of Taiwan under the contract number MOST-104-2627-M-009-009-, MOST-104-2221-E-009-183-, and "Center for Bioinformatics Research of Aiming for the Top University Program" of the National Chiao Tung University and Ministry of Education, Taiwan, R.O.C. for the project 104W962. This work was also supported in part by the UST-UCSDInternational Center of Excellence in Advanced Bioengineering sponsored bythe Taiwan Ministry of Science and Technology with I-RiCE Program under Grant Number:MOST-103-2911-I-009-101-
This article has been published as part of BMC Genomics Volume 16 Supplement 12, 2015: Joint 26th Genome Informatics Workshop and 14th International Conference on Bioinformatics: Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/16/S12.
- Mishra NK, Chang J, Zhao PX: Prediction of Membrane Transport Proteins and Their Substrate Specificities Using Primary Sequence Information. PLoS One. 2014, 9 (6): e100278-PubMed CentralView ArticleGoogle Scholar
- Ravna AW, Sylte I: Homology modeling of transporter proteins (carriers and ion channels). Methods Mol Bio. 2012, 857: 281-299.View ArticleGoogle Scholar
- Saier MH, Tran CV, Barabote RD: TCDB: the Transporter Classification Database for membrane transport protein analyses and information. Nucleic Acids Res. 2006, 34 (Database issue): D181-D186.PubMed CentralView ArticleGoogle Scholar
- Lin H, Han L, Cai C, Ji Z, Chen Y: Prediction of transporter family from protein sequence by support vector machine approach. Proteins. 2006, 62 (1): 218-231.View ArticleGoogle Scholar
- Ren Q, Paulsen IT: Large-scale comparative genomic analyses of cytoplasmic membrane transport systems in prokaryotes. J Mol Microbiol Biotechnol. 2006, 12 (3-4): 165-179.View ArticleGoogle Scholar
- Jain S, Ranjan P, Sengupta D, Naik PK: TpPred: A Tool for Hierarchical Prediction of Transport Proteins Using Cluster of Neural Networks and Sequence Derived Features. International Journal for Computational Biology. 2014, 1 (1): 28-36.Google Scholar
- Barbier-Brygoo H, Gaymard F, Rolland N, Joyard J: Strategies to identify transport systems in plants. Trends Plant Sci. 2001, 6 (12): 577-585.View ArticleGoogle Scholar
- Gromiha MM, Yabuki Y: Functional discrimination of membrane proteins using machine learning techniques. BMC Bioinformatics. 2008, 9 (1): 135-PubMed CentralView ArticleGoogle Scholar
- Li H, Benedito VA, Udvardi MK, Zhao PX: TransportTP: a two-phase classification approach for membrane transporter prediction and characterization. BMC Bioinformatics. 2009, 10 (1): 418-PubMed CentralView ArticleGoogle Scholar
- Ou YY, Chen SA, Gromiha MM: Classification of transporters using efficient radial basis function networks with position-specific scoring matrices and biochemical properties. Proteins. 2010, 78 (7): 1789-1797.Google Scholar
- Sauguet L, Poitevin F, Murail S, Van Renterghem C, Moraga-Cid G, Malherbe L, et al: Structural basis for ion permeation mechanism in pentameric ligand-gated ion channels. EMBO J. 2013, 32 (5): 728-741.PubMed CentralView ArticleGoogle Scholar
- Hibbs RE, Gouaux E: Principles of activation and permeation in an anion-selective Cys-loop receptor. Nature. 2011, 474 (7349): 54-60.PubMed CentralView ArticleGoogle Scholar
- Zhou Y, Morais-Cabral JH, Kaufman A, MacKinnon R: Chemistry of ion coordination and hydration revealed by a K+ channel-Fab complex at 2.0 Å resolution. Nature. 2001, 414 (6859): 43-48.View ArticleGoogle Scholar
- Köpfer DA, Song C, Gruene T, Sheldrick GM, Zachariae U, de Groot BL: Ion permeation in K+ channels occurs by direct Coulomb knock-on. Science. 2014, 346 (6207): 352-355.View ArticleGoogle Scholar
- Charoenkwan P, Shoombuatong W, Lee HC, Chaijaruwanich J, Huang HL, Ho SY: SCMCRYS: Predicting Protein Crystallization Using an Ensemble Scoring Card Method with Estimating Propensity Scores of P-Collocated Amino Acid Pairs. PLoS One. 2013, 8 (9): e72368-PubMed CentralView ArticleGoogle Scholar
- Huang HL, Charoenkwan P, Kao TF, Lee HC, Chang FL, Huang WL, et al: Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition. BMC Bioinformatics. 2012, 13 Suppl 17: S3-Google Scholar
- Liou YF, Charoenkwan P, Srinivasulu YS, Vasylenko T, Lai SC, Lee HC, et al: SCMHBP: prediction and analysis of heme binding proteins using propensity scores of dipeptides. BMC Bioinformatics. 2014, 15 (Suppl 6): S4-PubMed CentralView ArticleGoogle Scholar
- Huang HL: Propensity Scores for Prediction and Characterization of Bioluminescent Proteins from Sequences. PLoS One. 2014, 9 (5): e97158-PubMed CentralView ArticleGoogle Scholar
- Ho SY, Shu LS, Chen JH: Intelligent evolutionary algorithms for large parameter optimization problems. IEEE Transactions on Evolutionary Computation. 2004, 8 (6): 522-541.View ArticleGoogle Scholar
- Kawashima S, Ogata H, Kanehisa M: AAindex: Amino acid index database. Nucleic Acids Res. 2000, 28 (1): 374-374.PubMed CentralView ArticleGoogle Scholar
- Huang H-L, Li M-C, Vasylenko T, Ho S-Y: Computational prediction and analysis of human transporters using physicochemical properties of amino acids. International Journal of Engineering and Technical Research. 2 (2): 180-187.Google Scholar
- Kim MS, Yi GS: HMPAS: Human Membrane Protein Analysis System. Proteome Sci. 2013, 11 (Suppl 1): S7-PubMed CentralView ArticleGoogle Scholar
- Vasylenko T, Liou YF, Chen HA, Charoenkwan P, Huang HL, Ho SY: SCMPSP: Prediction and characterization of photosynthetic proteins based on a scoring card method. BMC Bioinformatics. 2015, 16 (Suppl 1): S8-PubMed CentralView ArticleGoogle Scholar
- Bradley AP: The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recogn. 1997, 30 (7): 1145-1159.View ArticleGoogle Scholar
- Huang HL, Lin IC, Liou YF, Tsai CT, Hsu KT, Huang WL, et al: Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties. BMC Bioinformatics. 2011, 12 Suppl 1: S47-View ArticleGoogle Scholar
- Chang CC, Lin CJ: LIBSVM: A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology. 2011, 2 (3):Google Scholar
- Frank E, Hall M, Trigg L, Holmes G, Witten IH: Data mining in bioinformatics using Weka. Bioinformatics. 2004, 20 (15): 2479-2481.View ArticleGoogle Scholar
- DeLano WL, Lam JW: PyMOL: A communications tool for computational models. Abstr Pap Am Chem S. 2005, 230: U1371-U1372.Google Scholar
- Saier MH, Tran CV, Barabote RD: TCDB: the Transporter Classification Database for membrane transport protein analyses and information. Nucleic Acids Res. 2006, 34 (Database issue): D181-D186.PubMed CentralView ArticleGoogle Scholar
- Watanabe Y, Feick RG, Shiozawa JA: Cloning and Sequencing of the Genes Encoding the Light-Harvesting B806-866 Polypeptides and Initial Studies on the Transcriptional Organization of Puf2b, Puf2a and Puf2c in Chloroflexus-Aurantiacus. Arch Microbiol. 1995, 163 (2): 124-130.Google Scholar
- Kaneko T, Sato S, Kotani H, Tanaka A, Asamizu E, Nakamura Y, et al: Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions. DNA Res. 1996, 3 (3): 185-209.View ArticleGoogle Scholar
- Corbalan N, Runti G, Adler C, Covaceuszach S, Ford RC, Lamba D, et al: Functional and structural study of the dimeric inner membrane protein SbmA. J Bacteriol. 2013, 195 (23): 5352-5361.PubMed CentralView ArticleGoogle Scholar
- Kyte J, Doolittle RF: A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982, 157 (1): 105-132.View ArticleGoogle Scholar
- Santoni V, Molloy M, Rabilloud T: Membrane proteins and proteomics: un amour impossible?. Electrophoresis. 2000, 21 (6): 1054-1070.View ArticleGoogle Scholar
- Sciara G, Mancia F: Highlights from recently determined structures of membrane proteins: a focus on channels and transporters. Curr Opin Struct Biol. 2012, 22 (4): 476-481.PubMed CentralView ArticleGoogle Scholar
- Grisshammer RK, Buchanan SK: Structural biology of membrane proteins. Royal Society of Chemistry. 2006, 4:Google Scholar
- Eisenberg D: Three-dimensional structure of membrane and surface proteins. Annual Review of Biochemistry. 1984, 53 (1): 595-623.View ArticleGoogle Scholar
- Wertz DH, Scheraga HA: Influence of water on protein structure. An analysis of the preferences of amino acid residues for the inside or outside and for specific conformations in a protein molecule. Macromolecules. 1978, 11 (1): 9-15.View ArticleGoogle Scholar
- Engelman DM, Zaccai G: Bacteriorhodopsin is an inside-out protein. Proc Natl Acad Sci U S A. 1980, 77 (10): 5894-5898.PubMed CentralView ArticleGoogle Scholar
- Nakashima H, Nishikawa K: The amino acid composition is different between the cytoplasmic and extracellular sides in membrane proteins. FEBS Letters. 1992, 303 (2): 141-146.Google Scholar
- Landolt-Marticorena C, Williams KA, Deber CM, Reithmeier RA: Non-random distribution of amino acids in the transmembrane segments of human type I single span membrane proteins. J Mol Biol. 1993, 229 (3): 602-608.View ArticleGoogle Scholar
- Illergård K, Kauko A, Elofsson A: Why are polar residues within the membrane core evolutionary conserved?. Proteins: Structure, Function, and Bioinformatics. 2011, 79 (1): 79-91.View ArticleGoogle Scholar
- König G, Bruckner S, Boresch S: Absolute hydration free energies of blocked amino acids: implications for protein solvation and stability. Biophysical Journal. 2013, 104 (2): 453-462.PubMed CentralView ArticleGoogle Scholar
- Fu D, Libson A, Miercke LJ, Weitzman C, Nollert P, Krucinski J, Stroud RM: Structure of a glycerol-conducting channel and the basis for its selectivity. Science. 2000, 290 (5491): 481-486.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.