Predicting metabolic pathway membership with deep neural networks by integrating sequential and ontology information

Background Inference of protein’s membership in metabolic pathways has become an important task in functional annotation of protein. The membership information can provide valuable context to the basic functional annotation and also aid reconstruction of incomplete pathways. Previous works have shown success of inference by using various similarity measures of gene ontology. Results In this work, we set out to explore integrating ontology and sequential information to further improve the accuracy. Specifically, we developed a neural network model with an architecture tailored to facilitate the integration of features from different sources. Furthermore, we built models that are able to perform predictions from pathway-centric or protein-centric perspectives. We tested the classifiers using 5-fold cross validation for all metabolic pathways reported in KEGG database. Conclusions The testing results demonstrate that by integrating ontology and sequential information with a tailored architecture our deep neural network method outperforms the existing methods significantly in the pathway-centric mode, and in the protein-centric mode, our method either outperforms or performs comparably with a suite of existing GO term based semantic similarity methods.

such as missing proteins or reactions in the network. One example of such prediction tasks is pathway membership inference, which is to determine whether a protein is a member in the enzyme list of a given pathway. This is an important annotation task that can not only provide context to the basic function annotation of proteins but also more importantly aid reconstruction of incomplete metabolic pathways, which can subsequently help better understand metabolism and physiology of cells and provide complementary perspective to study evolutionary [1].
However, traditional sequence similarity-based homology approaches to characterizing proteins for their enzymatic properties run into difficulties when sequence identity is lower than 60% [2]. Facing this challenge, various efforts have been made to go beyond individual proteins and their homologs to leverage the large amount of annotations for proteins in their functional context, such as from curated reference dataset or features extracted from proteins. The example of curated reference dataset is Gene Ontology (GO), which provide a hierarchy of controlled terms defining protein functions with varied levels of specificity for different cellular functions/processes [3,4]. The semantic similarity between two proteins can be used to replace the sequence-based similarity method.
Various similarity measures have been developed to quantify the semantic similarity of GO terms and applied it in quantitative comparison of functional similarity of gene products, although most of these methods are not developed for metabolic pathway membership inference [5][6][7][8][9][10]. Essentially, those measures mainly involve two steps of calculation : 1) calculation of GO term similarity, and 2) calculation of protein similarity, based on GO term similarity. In the first step, the semantic similarity between two GO terms is calculated to incorporate the GO hierarchy, via information contains in the GO tree such as node, edge or combination of the two. In the second step, protein similarities are aggregated from their terms' similarities. To infer the protein's membership in the pathway, the similarity between the proteins are then used [7,11]. More recently, in [5], a hybrid approach to take into account of both information content of individual GO terms and the whole GO hierarchy with a simple Cosine similarity is shown to be advantageous in both prediction accuracy and running time as compared with other semantic similarity-based methods.
In general, however, the prediction task of proteins' annotation, including the prediction of protein's metabolic pathway annotation, may come from two perspectives. One perspective is the pathway centric perspective and the other is protein centric perspective. In the pathway centric perspective, the relevant question is: given a pathway, predict the proteins participate in the pathway, thus this perspective leads to prediction problem of association of pathway and its enzymatic reaction. On the other hand, the protein centric problem asks a different question: given a protein and its annotation, predict enzymatic reaction that they catalyzed. This question can be translated into prediction of set of metabolic pathways of which a given protein is likely to be a member. While the protein centric perspective is more natural in protein annotation, it turns out more computationally challenging as it is multi-class classification problem, as compared to the binary classification problem for pathway centric membership prediction.
In this work, we set out to develop new computational approach based on neural networks for predicting pathway membership from both directions: the protein centric and pathway centric problems. In doing so, we also explore integrating both ontology and sequential information to further improve the accuracy. Specifically, we develop a neural network model with an architecture tailored to facilitate the integration of features from different sources. Table 1 shows the performance of our method for pathway membership prediction, in comparison to using a suite of different ontology-based gene similarity methods mentioned in the Methods. Because GO has three separate hierarchies: BP, CC, and MF, we thus evaluated the prediction performance for using each hierarchy. In addition, we also evaluated the performance of different featured used in this experiment separately.

Results and discussion
We developed a method to include the graph structure information of gene ontology and the information contain in ontology terms as feature representation of proteins. The inclusion of both graph structure and information content in our method can significantly improve performance of pathway prediction membership. When a simple When cosine method is used as a baseliner method, our method's performance is statistically significant higher (p <0.05), while other machine learning methods such as KNN and SVM are lower. However, it is interesting to note that the performance of methods that are designed specifically to use the ontology-based semantic similarity such as SimGIC, and Resnik, are mostly the worst performance in all ontologies, even below the baseline cosine method. The reason behind this may be explained by the fact that most of ontology-based semantic similarity methods are based on calculating the similarity distance between the proteins only, without the learning process such as SVM classifier.
The good performances of prediction methods when using GO terms ontology are expected since the GO terms are curated data. The BP terms are especially information rich of protein function dataset. Other ontology terms, i.e. MF and CC, are not as rich as the BP in terms of function information, thus the performance of methods in predicting protein membership of pathway when using these ontologies are below the BP ontology. This pattern is consistent with our intuition that metabolic pathways are better characterized as biological processes (BP). Realizing this, we tested the performances of neural network method and base classifier when using non function based curated data, such as k-mer which transform the sequence information into frequency of k-mer amino acids, as input features to the models. Compared to the performances when GO terms are used as features, the sequence-based features are less effective in pathway membership prediction task ( Table 2). The top model performance when using this feature are .786 for neural networks model.
We also tested the effect of multi modal features as input to our neural network model. We tested two different possibilities of combining the multi modal features in our NN model, by concatenating the features at early stage and at later stage. Addition of information to the method can improve the prediction performance of NN model (Table 3), although in other models it can lower the prediction performance. For example, compare to single modal of GO term in NN architecture, the use of multimodal data can increase the performance from .952 to When we considered the metabolic membership prediction task as a pathway centric problem, we needed to build many models, one for each pathway. Thus, for a given protein to be classified, we need to run it for every model and obtained the predicted output. The protein centric prediction task, on the other hand, will predict multiple classes at once thus can be built from one model. Table 4 shows the performance of neural network method in comparison to other methods by using either single modal or multi modal features.
Similar to pathway-centric prediction task, the performances of the protein-centric methods are best when BP ontology is used as feature. The F measure of NN for Table 3 The ROC score of methods for multi-modal data. NN is neural network model, NN 1/0 is neural network model that use binary representation of GO terms as features. (concat) is approach where GO terms and k-mer is concatenated as single vector to represent each protein, (multi-input) approach where GO terms and k-mer are used as two input to the model. The number of layers in neural network are three and the dimension of neurons in each layer are 128,64, and 1  It is worth noting that, the protein-centric membership prediction is a multi-class classification whereas the pathway-centric membership prediction is a binary classification, which means that the former one is much more challenging, as reflected in the prediction performance. Therefore, while performance for protein-centric membership prediction may seem low, it should be assessed in the context of multi-class (320 classes to be exact) classification with a 1/320 = 0.3% accuracy from a random classifier.

Conclusion
In this work, we developed a neural network-based method for pathway membership inference using both gene ontology (GO) similarity and sequential features between a query protein and proteins that are known to the members of a given pathway. By replacing binary vector of the GO term annotation for a gene with the information content of individual GO terms and incorporating GO hierarchy with ancestor nodes that are directly present in gene annotation, we can create information rich vector representation for a gene. We built multilayer forward feeding neural networks that are able to integrate the GO term features and sequential features. We demonstrated that our NN based method outperformed other classifiers including SVM and random forest and the methods that are specifically designed to use the GO term features alone. Moreover, the NN based method is also able to answer question from both the pathway centric and protein centric perspectives, which makes the method more versatile in scaled up application for protein annotation.

Dataset
We used the gene ontology and gene annotation from GeneOntology (GO, http://geneontology.org), version 2019-07-01. The GO's ontology consists of three ontologies, i.e. biological process (BP), cellular components (CC) and molecular functions (MF). This version of GO contains 31043 BP, 11973 MF, and 4397 CC terms. The annotation provides association between proteins and their corresponding GO terms either manually reviewed by curator or automatically generated by prediction tools. Out all of available evidence codes, only IEA (Inferred from Electronic Annotation) has not assigned manually by a curator. Therefore, it is necessary to exclude the IEA evidence code to prevent cyclic prediction: predict the protein annotations by using predicted data. In this experiment, we exclude annotations encoded by IEA. We downloaded human KEGG pathway data set from Kyoto Encyclopedia of Genes and Genomes database [12], http://rest.kegg.jp. The database consists of 320 human pathways. We excluded pathways that consists less than 10 proteins to ensure adequate training and testing in the cross-validation scheme and mapped the NCBI gene id to its corresponding Uniprot identifier. As a result, we obtained 308 pathways and the number of proteins in the pathways range from 10 to 521 proteins with most of the pathways having proteins less than 100 proteins (Fig. 1).

Data representation
We used multimodal data as input to our model, including the GO terms and k-mer information from protein sequences. While a simplistic approach to represent GO terms is a binary vector with 1 or 0 representing the presence or absence of GO terms in annotation of given gene, our method adopts a scheme from [5], which considers both of the structure of the GO graph and the information content of the GO terms in building the vector of the gene and their corresponding annotations (Fig. 2).
Specifically, before we build the gene vector, we first calculated the semantic value (SV) for each GO term in the annotation of a given protein. We used a normalized information content of term t i by dividing the information Fig. 1 Distribution of pathways and the number of their proteins used in this experiment Fig. 2 Generation of vector representation from GO dataset. In this example, the protein is annotated with t3 and t4. To generate protein's feature vector, the normalized IC of t3 and t4 is used in the first stage. On second stage, the semantic value (SV) of all term ancestors of t3 and t4 are calculated. Since t3 and t4 share common ancestor, t1 and t2, the semantic value for t1 and t2 are average semantic value (SV). See Material and Methods for detail description content of term t i with the maximum IC in whole set of GO terms T as follows: Then we expanded the annotation of a given protein by including all of the ancestor terms: for each annotation term t i in a given protein, we assigned the weighted semantic value for all ancestor terms of term t i , defined as follows: wherew is the weight, in this case we use a fix constant of 0.5, t pi is all ancestor terms of term t i and d pi is the path length of term t i to its ancestor t pi . The path length is defined as the difference of the maximum depth between the two terms in the GO tree. When there are multiple GO terms in the annotation of a given protein, it is possible that these GO terms may have ancestor terms in common. Therefore, during expansion of the annotation vector for a given protein, a common ancestor term will have multiple semantic values, each for annotation term in the original annotation, as the common ancestor term may receive a semantic value from all of its descendants. Hence, we calculated the average of these values (SV (t p )) as the new semantic values for a common ancestor term t p . Note that, in GO hierarchy, there are other relationships such as "NOT" and "contribute to", between two GO terms; in this study, however, we only include "is_a" relationship for calculating the semantic value, following the same practice as in other method such as [7], which we compare with.
After this procedure, a gene is represented as a vector of n-dimension, where n = |T|, each dimension corresponding to one GO term in the gene ontology hierarchy, with a semantic value being either a) the normalized information content if the GO term is present in the gene annotation, or b) a value assigned as above for a GO term whose descendant(s) is present, or c) a value of zero if a GO term is not of either of the two former cases.
In addition to gene annotation data as input to our model, we also used sequence-based features, such k-mer. The k-mer feature represents the sequence information as the frequency of k-mer, in this case we used k = 2.

Neural network architecture
Artificial neural network is inspired by biological process [13]. It consists of layers of neurons that are fully connected between layers, but no connection between neurons in the same layer. Each neuron performs linear transformation operation of weighted information summation coming from all neurons in previous layer adjusted by some biases followed by nonlinear activation function f, as define by following equation: While there are many activation options available for neural network. The two most used activation functions are ReLU and Sigmoid. The ReLu set the lower bound output of neuron to 0 the output of neuron to be minimum of 0, while sigmoid squashing the output of neuron and bounded to be between 0 and 1. In this experiment, we used the ReLU activation in the hidden layer, while Sigmoid is used in the output layer. The formal definition of ReLU ( 4 ) and Sigmoid (5) are: We implemented a multi-layer feed forward deep neural network in our model. We stacked three fully connected layers where the first layer is the input layer, the second layer is hidden layer, and the last layer is output layer. The input of the network is the n-dimensional vector of protein's features (Fig. 4). We used multi-modal features, i.e. GO terms and k-mer, and we either used a single modal or multi-modal features. For a single modal feature, we adopt architecture in Fig. 3. For multi-modal features, we combined the features' vectors at early stage or at later stage. At early stage, we concatenate multiple vectors into one vector as input to the model, thus the architecture similar to single input vector (Fig. 3). On the other hand, the concatenation at later stage happens inside the model where multi input model accept multiple input of vectors, then the model combine it in hidden layer while processing the inputs (NN multi input, Fig. 4). Note that convolution neural networks were attempted and did not get good performance, which we believe may be attributed to lack of convoluted patterns/features in protein sequences, unlike 2d images. Depending upon the classification task, the dimension of output layer is either 1 or n, where n is the number of classes to be predicted (n = 308). In binary classification, the dimension of output layer is 1, while in multi-label classification the dimension of output layer is n. For binary classification task, we built one model for each class, while for multi-label classification task, we built one model. We performed optimization by comparing different number of neurons in each layer (data not shown).
We implemented the Keras library to build our model. We chose to minimize the binary cross entropy function loss using the Adam optimizer with learning rate 0.001 for binary classification task. For multi label classification task, we chose to minimize the F1 function loss. To prevent overfitting in our model, we implemented the dropout (0.5) regularization. Note that unless explicitly mentioned otherwise, the default values of the hyperparameters are used in this study, and it is conceivable that better performance than reported in Tables 1, 2, and 3 can be achieved should these hyperparameters be optimally tuned.

Training
In our experiment, we trained individual model separately for each pathway in the binary classification task. We performed 5-fold cross validation for each pathway. For each pathway, positive dataset consists of proteins that belong to the pathway while negative dataset is generated by selecting equal number of random proteins that do not belong to the pathway or interacting with proteins in the pathway. We followed this procedure since proteins in the pathway tend to interact each other, and by using this approach we ensured that there are no proteins in the negative dataset that are interacting with proteins in the positive dataset. We used BioGrid dataset to determine the interacting protein. We also excluded proteins that have no GO terms information in the pathways.
For multi-label classification task, we followed different approach. Since in both multi class or multi label classification task a positive sample can be a negative sample for other classes, we did not generate negative dataset. We simply consider negative dataset of a given pathways are proteins in other pathways. We also did not perform 5-fold cross validation, instead we randomly held 5 proteins from each pathway as testing dataset and the rest as training dataset.

Baseline classifiers
We used several GO based semantic similarity measures and baseline classifier as comparison to our method. We used the most commonly used semantic similarity measures, Resnik [10] and simGIC [9]. These measures mainly use the information content (IC) of each node to quantify the GO terms in the GO graph. The IC is described as: where p(t) is term frequency of t in a given annotation corpus, such as Gene Ontology Annotation (GOA). These measures use same principal in calculating similarity between two proteins, which is based on the similarities of their corresponding terms. For protein similarities of Resnik's measures, we followed method from [14]. In addition to these methods, we also calculated the similarity of two proteins q and p based on their dot product where t is the term of GO terms T. To determine whether query protein q belong to the model, we used the average similarity score between the query protein and set of proteins P of incomplete pathways as S(q, P) = p∈P S(q, p)/|P| (8) where s(q, p) is the similarity score between query protein q and a member protein p as calculated by Eq. 7 and |P| is the number of known proteins for the incomplete pathway P.
In addition to GO based semantic similarity methods, we also use some of mostly used base classifiers in machine learning: SVM, RF, and KNN. We implemented the Scikit library of SVM, RF, and KNN by using all default parameters. We used parameters as follows: rbf kernel and C = 1e10 in SVM, number of forest is 100 in RF, and number of neighbor is 5 in KNN. We implemented Scikit SVM, RF, and KNN libraries.

Predictive performance evaluation
We adopted two different performance measures, each for pathway centric and protein centric prediction task respectively. For pathway centric task, we considered the task as binary classification problem and used receiving operating characteristic (ROC) curve analysis to evaluate the performance. The ROC curve of perfect classifier has the area under the ROC curve (AUC) of 1. The perfect curve rises steeply from bottom left to top left and move toward top right. We calculated ROC curve for each pathway and average across all pathways. ROC curve measures the performance of classifier at various threshold setting and represents the tradeoff between true positive rate (TPR) and false positive rate (FPR). The TPR and FPR for each pathway c are defined as: where FP c , TN c , TP c , and FN c are the number of false positive, true negative, true positive and false positive respectively in pathway c. We then calculated the AUC of ROC from the above FPR and TPR and average the ROC score over all pathways. For protein centric task, we considered the task as multilabel classification since one protein can have multiple label, and used the F1 score and Matthews Correlation Coefficient (MCC) to evaluate performance. The precision and recall are defined as p = TP TP + FP (11) r = TP TP + FN (12) where TP, FP, and FN are the number of true positive, false positive, and false negative respectively. The F measure is harmonic mean of precision and recall. The value range between 0 and 1. The perfect score of 1 means that both of the precision and recall reach their maximum score of 1. However, when the precision reach maximum, it increases the TN, thus reducing the recall. On the other hand, when the recall reaches maximum score, it increases the FP, thus reducing the precision. Thus, F measure hardly reach maximum score 1. The F1 measure is defined as while MCC is defined as: