Prediction of essential proteins based on gene expression programming
© Zhong et al.; licensee BioMed Central Ltd. 2013
Published: 1 October 2013
Essential proteins are indispensable for cell survive. Identifying essential proteins is very important for improving our understanding the way of a cell working. There are various types of features related to the essentiality of proteins. Many methods have been proposed to combine some of them to predict essential proteins. However, it is still a big challenge for designing an effective method to predict them by integrating different features, and explaining how these selected features decide the essentiality of protein. Gene expression programming (GEP) is a learning algorithm and what it learns specifically is about relationships between variables in sets of data and then builds models to explain these relationships.
In this work, we propose a GEP-based method to predict essential protein by combing some biological features and topological features. We carry out experiments on S. cerevisiae data. The experimental results show that the our method achieves better prediction performance than those methods using individual features. Moreover, our method outperforms some machine learning methods and performs as well as a method which is obtained by combining the outputs of eight machine learning methods.
The accuracy of predicting essential proteins can been improved by using GEP method to combine some topological features and biological features.
Essential proteins are indispensable to support cellular life . Identifying essential proteins can help us understand the minimal requirements for cell survival, which plays a significant role in the emerging field of synthetic biology . Since the deleting, interrupting or blocking of essential proteins leads to the death of organisms, essential proteins can serve as candidates of drug-targets for developing novel therapies of diseases, such as cancers or infectious diseases . Moreover, some studies have pointed out that essential proteins have correlation with human disease genes . However it is expensive and time-consuming to experimentally identify essential proteins.
In recent years, many computational approaches have been presented to identify essential proteins according to their features. One group of researchers focus on detecting essential proteins based on their topological features in protein-protein interaction (PPI) networks, since previous studies have shown that the removal of those proteins with a larger number of neighbours in PPI networks is more likely to cause the organism to die . Therefore, many centrality methods have been come up with such as Degree Centrality (DC) , Betweenness Centrality (BC) , Closeness Centrality (CC) , Subgraph Centrality (SC) , Eigenvector Centrality (EC) , Information Centrality (IC) , Edge Clustering Coefficient Centrality (NC)  and so on. However, these centrality methods have their own limits. For example, they highly depend on the accuracy of PPI networks and ignore the useful biological features. Recently, new methods that combine their topological features with their biological ones have been developed. Hart and his fellows have pointed out that the essentiality is a special property of protein complex and essential proteins are often highly clustered in certain protein complexes . Based on this observation, Ren et al.  integrate PPI network topology and protein complexes information to predict essential proteins. According to the fact that proteins in protein complexes tend to be co-expressed, Li et al.  propose a new prediction method called PeC and Tang et al  propose another one, WDC, which integrates network topology with gene expression profiles. Considering the fact that essential proteins are more conserved than non-essential ones  and they frequently connects to each other , Peng et al.  have proposed an iterative method for the prediction of essential proteins based on the orthology and PPI networks. Their results show that the accuracy of predicting essential proteins can be improved by combing their biological features with their topological features. Although the methods mentioned above combine some features of essential proteins efficiently and explain how these features work together to decide the essentiality of proteins, better methods needs to be developed to integrate more appropriate features. Because there are different types of features that relates to protein essentiality, which suggests that multiple aspects of organisms contribute to determining the essentiality of proteins .
Another group of researchers use machine learning algorithms, such as support vector machine (SVM) , decision tree , Naive Bayes and so on, to detect essential proteins. These methods train a classifier according to the features of known essential proteins and non-essential ones. Then test the classifier in the same organism or the other organisms. For example, Gustafson et al.  select a lot of essentiality related features including both topological features, such as the degree centrality (DC) in PPI networks, and biological features, such as open reading frame (ORF) length, Phyletic retention (PHY), paralogs, codon adaptation index (CAI) and so on. And then they use a Naive Bayes classifier to make prediction. Hwang et al.  build SVM classifier which combines the biological features, such as ORF length, strand and PHY, and topological features in PPI network, such as DC, BC, CC and so on. Additionally, some researchers combine several classifiers to predict essential proteins. Acencio et al.  learn 12 different network topological features (DC, BC, CC and so on) in the integrated network and some biological features, such as cellular localization and biological processes information. Then several decision tree-based classifiers are trained and tested based on these features. A best essentiality classifier is obtained by combining the outputs of these diverse classifiers. Deng et al.  also train their classifier by combining the results of four separate classifiers (Naive Bayes classifier, logistical regression model, C4.5 decision tree and CN2 rule). In , more approaches of detecting essential protein are introduced and discussed.
Now that those machine learning methods are available as software packages , they can be easily adapted to predict essential proteins using input features. However, it is difficult for them to explain how these features are used for classification. Moreover, few methods are able to automatically select appropriate features. Researchers tend to analyse the relationship between features and the essentiality of proteins with statistical methods. And then they decide which features are selected to train classifiers [19, 22]. Gene expression programming (GEP) is a learning algorithm and what it learns specifically is about relationships between variables in sets of data, and it builds models to describe these relationships . The features that show weak positive correlation with essentiality of proteins will not be selected by the GEP classifier.
In this work, we propose a GEP-based method to predict essential protein by combining biological features, such as subcellular location, and topological features, such as DC, BC, CC, SC, EC, IC, NC, and other composed features computed by the methods PeC, WDC and ION using biological and topological features. We carry out experiments on S. cerevisiae (Baker's yeast) data. The experimental results show that our method outperforms others using one of features calculated by existing methods (DC, BC, CC, SC, EC, IC, NC, PeC, WDC and ION). Moreover, in terms of area under an ROC curve (AUC), our method achieves better results than other machine learning methods (SVM, SMO, NaiveBayes, Bayes Network, RBF Network, J48, Random Tree, Random Forest, NaiveBayes Tree), and it performs as well as a method that uses multiple machine learning methods.
Results and Discussion
In this section, we firstly analyze the results of 10-fold cross-validation. Then we compare the prediction of our method with other existing methods which calculate individual features. Moreover our methods are compared with other machine learning-based methods. Finally we show our best classifier and explain how to combine individual features to decide the essentiality of protein.
Comparison of 10-fold cross-validation results
Comparison with other existing methods
Comparison between GEP and the methods using individual feature
Comparison with other machine learning-based methods
Comparison of average AUC between GEP and other machine learning based methods.
Analysis of GEP classifier
As the expression shown, ION is a very predictive features. Because it relates to the evolutionary conservation of proteins, and essential proteins are often highly conserved across organisms . The proteins located in endoplasmic reticulum or nucleus tend to possess indispensable functions, which is consistent with the observation of Acencio et al. . However, those proteins located in cytoplasm or vacuole are less likely to be essential proteins. Note that some input features are not present in the expression. Our GEP classifier discards those features that are either less effective in predicting essentiality or replaced by other features. For example the classifier selects localized topological features such as WDC and SC instead of global ones such as BC, CC and IC. Because localized centrality measures can obtain better prediction performance than the global ones . Both WDC and PeC depend on the features of co-expression and co-clustering of essential proteins, the classifier uses one of them in terms of their capability of prediction.
As different types of features relates to the essentiality of proteins, it is still a big challenge for designing an effective method to predict them by integrating different features and explaining how these selected features decide the essentiality of protein. In this work, we propose a GEP-based prediction method which combines topological and biological features. Compared to other machine learning-based methods, it is able to select predictive features automatically and generates an expression that describes the relationships between them. We carry out experiments on S. cerevisiae (Baker's yeast) data.(1) Ten classifiers are obtained from 10-fold cross-validation based on all input features. The average AUC value is 0.7730. (2) In terms of average AUC values, our method outperforms a number of machine learning methods and has comparable performance to the method which combines the output results of eight decision trees. (3) We evaluate our classifier by testing it on all proteins in PPI network with all available learning features. The results indicate our classifier performs better than those that use individual features. Thus, our method can effectively combine a range of different features to predict essential proteins.
We implement experiments based on data of S. cerevisiae (Bakers' Yeast) because both its PPI and gene essentiality data are the most complete and reliable among various species. The PPI data of S. cerevisiae is downloaded from DIP database  using the version published on Oct.10, 2010, without self-interactions and repeated interactions. There are total of 5093 proteins and 24743 edges.
The list of essential proteins is integrated from the following databases: MIPS , SGD , DEG  and SGDP , which contains 1285 essential proteins. Among the 1285 essential protein, 1167 proteins present in PPI network. In our study, these 1167 proteins are regarded as essential proteins while other 3926(= 5093-1167) proteins are nonessential proteins.
The information of orthologous proteins used in method ION, is download from Version 7 of the In Paranoid database . The gene expression data of yeast is retrieved from Tu et al., 2005 , containing 6,777 gene products (proteins) and 36 samples in total. Among the 6777 proteins, 4858 proteins are involved in the yeast PPI network. The subcellular information is downloaded from eSLDB database , which categorizes the 5093 proteins in PPI network into 16 different subcellular localizations.
Our GEP classifier is constructed to predict essential proteins based on topological and biological features. The topological features include degree centrality, betweenness centrality, closeness centrality, subgraph centrality, eigenvector centrality, information centrality and edge clustering coefficient centrality in PPI network, which are calculated by the centrality methods DC, BC, CC, SC, EC, IC and NC, respectively. Some composited features calculated by methods (PeC, WDC, and ION) which integrate the topological features with biological features are also used in this work.
Additionally, some biological features such as subcellular localization are considered. Because subcellular location plays a crucial role in protein function and proteins perform their functions in certain subcellular compartments. Acencio et al.  find that proteins located in nuclear subcellular compartments tend to be essential, because most essential biological processes for cell viability take place in nuclear. In contrast, most membrane proteins with functions as transporters or participate in metabolism related processes are more likely to be nonessential. In this work, all proteins in PPI network are involved in 16 different localizations including Vacuole, Vesicles, Lysosome, Membrane, Mitochondrion, Peroxisome, Secretory pathway, Cell wall, Cytoskeleton, Endoplasmic reticulum, Golgi, Transmembrane, Cytoplasm, Nucleus and Endosome, Extracellular.
For each of feature has its own value ranges, we normalize all features by dividing them by corresponding maximum values, so that ranges of features value are -1 to 1. The coefficient (0.1, 0.2, 0.5, and 1.0) is added to adjust the contribution of each feature.
As one kind of the Evolutionary Algorithms (EA's), Gene expression programming (GEP) is a genotype/phenotype genetic algorithm that combines the merits of both genetic algorithms and genetic programming . Each chromosome in GEP is expressed using nonlinear entities that can be represented as a fixed-length linear encoded string, such as mathematical expressions, polynomial constructs, logical expression, and so forth. Chromosomes can be evolved and new ones are generated by some genetic operations of mutation, transposition and recombination guided by a fitness function. GEP is able to do global searches for classification and performs well, but it is seldom adopted to solve the classification problem of essential proteins.
In this work, GEP is developed to predict essential proteins. First of all, a set of functions and terminals is chosen to define chromosomes that will be expressed as nonlinear entities. The set of functions contains arithmetic operators and logic operators, such as add, subtract, multiply, divide, min, max, equal, sqrt, log, exp, abs, while the set of terminals contains variables representing features of proteins (for example, topological and subcellular localization properties) and relevant coefficients. Then we build the chromosomal structure. For each chromosome the length of its head and the number of genes will be given.
The second step is to randomly generate first population. The input parameter indicates the size of population. We varied the number of populations from 800 to 12000, and kept track of results each model produces. According to the results, the performance of models increases gradually as the number of populations rises. Thus, we chose the maximum, 12000, as the number of populations.
Where TPi, TNi, FPi, and FNi indicate, respectively, the number of true positives, true negatives, false positives, and false negatives. Given protein features, we use the fitness function to compute scores for all chromosomes in population.
The forth step is to select the top 30% of populations as eugenic ones. Then the fifth step is that performing a set of genetic operations (including mutation, transposition and crossover) on eugenic ones reproduces chromosomes of the next generation that has the same size as former one.
Parameters used in our GEP method
Description of parameter
Setting of parameter
Number of Population
Length of Gene
Length of Chromosome
Length of Head
Number of Generation
+,-,*, =,/, Sqrt, Log, Exp, Abs, Max, Min
Fitness Function Name
Training and testing set preparation
We build and evaluate our classifiers in terms of 10-fold cross-validation analysis, in which original data are divided into 10 equal datasets, and nine-folds are used to train the classifier and the remaining one fold is used for testing. Since the ratio of essential and non-essential proteins in original data is about 1:3.36 (essential proteins: non-essential proteins = 1167:3926), each fold data maintains the same ratio of essential proteins and nonessential proteins in original data. The cross-validation process is repeated ten times to generated ten classifiers, with each of the ten datasets used exactly once as testing data.
This work is supported in part by the National Natural Science Foundation of China under Grant No. 61232001, No.61128006 and No.61003124; the Program for New Century Excellent Talents in University (NCET-12-0547); the U.S. National Science Foundation under Grants CCF-0514750, CCF-0646102 and CNS-0831634; Scientific Research Fund of Hunan Provincial Education Department NO.12B080.
The publication costs for this article were funded by the corresponding author.
This article has been published as part of BMC Genomics Volume 14 Supplement S4, 2013: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S4.
- Zhang R, Lin Y: DEG 5.0, a database of essential genes in both prokaryotes and eukaryotes. Nucleic acids research. 2009, 37: D455-458. 10.1093/nar/gkn858.PubMed CentralView ArticlePubMedGoogle Scholar
- Glass JI, Hutchison CA, Smith HO, Venter JC: A systems biology tour de force for a near-minimal bacterium. Molecular systems biology. 2009, 5: 330-PubMed CentralView ArticlePubMedGoogle Scholar
- Clatworthy AE, Pierson E, Hung DT: Targeting virulence: a new paradigm for antimicrobial therapy. Nature chemical biology. 2007, 3: 541-548. 10.1038/nchembio.2007.24.View ArticlePubMedGoogle Scholar
- Furney S, Alba MM, Lopez-Bigas N: Differences in the evolutionary history of disease genes affected by dominant or recessive mutations. BMC Genomics. 2006, 7: 165-10.1186/1471-2164-7-165.PubMed CentralView ArticlePubMedGoogle Scholar
- Jeong H, Mason SP, Barabasi AL, Oltvai ZN: Lethality and centrality in protein networks. Nature. 2001, 411: 41-42. 10.1038/35075138.View ArticlePubMedGoogle Scholar
- Hahn MW, Kern AD: Comparative genomics of centrality and essentiality in three eukaryotic protein-interaction networks. Mol Biol Evol. 2005, 22: 803-806. 10.1093/molbev/msi072.View ArticlePubMedGoogle Scholar
- Joy MP, Brock A, Ingber DE, Huang S: High-betweenness proteins in the yeast protein interaction network. J Biomed Biotechnol. 2005, 2005: 96-103. 10.1155/JBB.2005.96.PubMed CentralView ArticlePubMedGoogle Scholar
- Wuchty S, Stadler PF: Centers of complex networks. J Theor Biol. 2003, 223: 45-53. 10.1016/S0022-5193(03)00071-7.View ArticlePubMedGoogle Scholar
- Estrada E, Rodríguez-Velázquez JA: Subgraph centrality in complex networks. Phys Rev E Stat Nonlin Soft Matter Phys. 2005, 71: 056103-View ArticlePubMedGoogle Scholar
- Bonacich P: Power and centrality: A family of measures. American journal of sociology. 1987, 92: 12-View ArticleGoogle Scholar
- Karen Stephenson, Zelen M: Rethinking centrality: Methods and examples. Social Networks. 2002, 11: 37-Google Scholar
- Wang J, Li M, Wang H, Pan Y: Identification of Essential Proteins Based on Edge Clustering Coefficient. IEEE/ACM transactions on computational biology and bioinformatics/IEEE, ACM. 2012, 9: 1070-1080.View ArticlePubMedGoogle Scholar
- Hart GT, Lee I, Marcotte E: A high-accuracy consensus map of yeast protein complexes reveals modular nature of gene essentiality. BMC Bioinformatics. 2007, 8: 236-10.1186/1471-2105-8-236.PubMed CentralView ArticlePubMedGoogle Scholar
- Ren J, Wang JX, Li M, Wang H, Liu BB: Prediction of Essential Proteins by Integration of PPI Network Topology and Protein Complexes Information. Lect N Bioinformat. 2011, 6674: 12-24.Google Scholar
- Peng W, Wang J, Wang W, Liu Q, Wu F-X, Pan Y: Iteration method for predicting essential proteins based on orthology and protein-protein interaction networks. BMC Systems Biology. 2012, 6: 87-10.1186/1752-0509-6-87.PubMed CentralView ArticlePubMedGoogle Scholar
- Tang X, Wang J, Pan Y: Identifying essential proteins via integration of protein interaction and gene expression data. Bioinformatics and Biomedicine (BIBM), 2012 IEEE International Conference on: 4-7 October 2012. 2012, 1-4. 10.1109/BIBM.2012.6392716.View ArticleGoogle Scholar
- Jordan IK, Rogozin IB, Wolf YI, Koonin EV: Essential genes are more evolutionarily conserved than are nonessential genes in bacteria. Genome research. 2002, 12: 962-968. 10.1101/gr.87702. Article published online before print in May 2002.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang J, Peng W, Wu FX: Computational approaches to predicting essential proteins: a survey. Proteomics Clin Appl. 2013, 7: 181-192. 10.1002/prca.201200068.View ArticlePubMedGoogle Scholar
- Hwang YC, Lin CC, Chang JY, Mori H, Juan HF, Huang HC: Predicting essential genes based on network and sequence analysis. Molecular bioSystems. 2009, 5: 1672-1678. 10.1039/b900611g.View ArticlePubMedGoogle Scholar
- Seringhaus M, Paccanaro A, Borneman A, Snyder M, Gerstein M: Predicting essential genes in fungal genomes. Genome research. 2006, 16: 1126-1135. 10.1101/gr.5144106.PubMed CentralView ArticlePubMedGoogle Scholar
- Gustafson AM, Snitkin ES, Parker SC, DeLisi C, Kasif S: Towards the identification of essential genes using targeted genome sequencing and comparative analysis. BMC Genomics. 2006, 7: 265-10.1186/1471-2164-7-265.PubMed CentralView ArticlePubMedGoogle Scholar
- Deng J, Deng L, Su S, Zhang M, Lin X, Wei L, Minai AA, Hassett DJ, Lu LJ: Investigating the predictability of essential genes across distantly related organisms using an integrative approach. Nucleic acids research. 2011, 39: 795-807. 10.1093/nar/gkq784.PubMed CentralView ArticlePubMedGoogle Scholar
- Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter. 2009, 11: 10-18. 10.1145/1656274.1656278.View ArticleGoogle Scholar
- Ferreira C: Gene expression programming: A new adaptive algorithm for solving problems. arXiv preprint cs/0102027. 2001Google Scholar
- Acencio ML, Lemke N: Towards the prediction of essential genes by integration of network topology, cellular localization and biological process information. BMC Bioinformatics. 2009, 10: 290-10.1186/1471-2105-10-290.PubMed CentralView ArticlePubMedGoogle Scholar
- Park K, Kim D: Localized network centrality and essentiality in the yeast-protein interaction network. Proteomics. 2009, 9: 5143-5154. 10.1002/pmic.200900357.View ArticlePubMedGoogle Scholar
- Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg D: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic acids research. 2002, 30: 303-305. 10.1093/nar/30.1.303.PubMed CentralView ArticlePubMedGoogle Scholar
- Mewes HW, Frishman D, Mayer KF, Munsterkotter M, Noubibou O, Pagel P, Rattei T, Oesterheld M, Ruepp A, Stumpflen V: MIPS: analysis and annotation of proteins from whole genomes in 2005. Nucleic acids research. 2006, 34: D169-172. 10.1093/nar/gkj148.PubMed CentralView ArticlePubMedGoogle Scholar
- Cherry JM, Adler C, Ball C, Chervitz SA, Dwight SS, Hester ET, Jia Y, Juvik G, Roe T, Schroeder M, et al: SGD: Saccharomyces Genome Database. Nucleic acids research. 1998, 26: 73-79. 10.1093/nar/26.1.73.PubMed CentralView ArticlePubMedGoogle Scholar
- Saccharomyces Genome Deletion Project. [http://www-sequence.stanford.edu/group/yeast_deletion_project/deletions3.html]
- Ostlund G, Schmitt T, Forslund K, Kostler T, Messina DN, Roopra S, Frings O, Sonnhammer EL: InParanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic acids research. 2010, 38: D196-203. 10.1093/nar/gkp931.PubMed CentralView ArticlePubMedGoogle Scholar
- Tu BP, Kudlicki A, Rowicka M, McKnight SL: Logic of the yeast metabolic cycle: temporal compartmentalization of cellular processes. Science. 2005, 310: 1152-1158. 10.1126/science.1120499.View ArticlePubMedGoogle Scholar
- Pierleoni A, Martelli PL, Fariselli P, Casadio R: eSLDB: eukaryotic subcellular localization database. Nucleic acids research. 2007, 35: D208-212. 10.1093/nar/gkl775.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.